A Comparative Analysis of Machine Learning Algorithms For Credit Risk Scoring Using Chi-Square Feature Selection

2023 International Conference on Converging Technology in Electrical and Information Engineering (ICCTEIE)
A Comparative Analysis of Machine Learning

Algorithms for Credit Risk Scoring using Chi-
2023 International Conference on Converging Technology in Electrical and Information Engineering (ICCTEIE) | 979-8-3503-7064-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCTEIE60099.2023.10366576
Square Feature Selection

Titin Yulianti
Hery Dian Septama Deny Budiyanto
Department of Electrical and
Department of Electrical and Department of Electrical and
Informatics Engineering
Informatics Engineering Informatics Engineering
University of Lampung
University of Lampung University of Lampung
Bandar Lampung, Indonesia
Bandar Lampung, Indonesia Bandar Lampung, Indonesia
titin.yulianti@eng.unila.ac.id
hery@eng.unila.ac.id deny.budiyanto@eng.unila.ac.id
Sherly Martina Mulyadi Amanda Hasna Cahyana

Department of Electrical and Informatics Engineering Department of Electrical and Informatics Engineering
University of Lampung University of Lampung
Bandar Lampung, Indonesia Bandar Lampung, Indonesia
sherly.martinamulyadi2021@students.unila.ac.id amanda.hasnacahyana2073@students.unila.ac.id
Abstract— Credit payment, which involves making simply providing their identification and the necessary
predetermined installment amounts starting from the receipt documents [1].
of the purchased product, has become increasingly popular.
Lending institutions often employ various risk assessment However, it is crucial to understand that credit
strategies, commonly known as credit risk scoring. Manual transactions come with inherent risks. The credit risk term
credit risk analysis often requires a longer processing time and refers to the potential downsides of providing financial
may result in a decrease in accuracy. By leveraging the assistance to individuals or businesses in the form of loans or
advancements in artificial intelligence, the current landscape lines of credit. This risk includes the likelihood that
offers promising opportunities for conducting credit risk borrowers may be unable to repay the loan by the due date or
analysis in a more efficient and accurate manner. The primary will default on the loan entirely. This has the potential to
objective of this paper is to utilize the Chi-Square feature negatively impact the financial stability of the lending
selection method in order to improve the speed and efficiency institution. Consequently, lending institutions often employ
of these algorithms. Four machine learning algorithms are various risk assessment strategies known as credit risk
used to assess the outcomes. The 22 features have been scoring [2]. These strategies involve the evaluation of the
carefully selected for further use and thoroughly evaluated. borrower's credit history, stability of income, and collateral.
The final results indicate that the Naïve Bayes Classifier The purpose of conducting these assessments is to reduce the
technique, with a score of only 0.65, is less accurate when potential negative outcomes of these risks and guarantee that
compared to other models. In the realm of predicting whether
responsible lending practices are adhered to.
a client will default or not, the Random Forest Classifier model
emerges as the most effective model. This model boasts the To mitigate credit-related challenges, credit providers
highest accuracy value while maintaining the evaluation value must employ stringent customer selection processes through
at 0.68, making it the top choice among all the models. The credit risk scoring. Manual credit risk analysis often requires
deployment of the model reveals that there has been a decrease a longer processing time and may suffer from reduced
of 10.22% in the percentage of LGD. accuracy. Leveraging the advancements in artificial
intelligence, the current landscape offers promising
Keywords—machine learning, credit scoring, default
opportunities for conducting credit risk analysis more
efficiently and with higher accuracy. The research makes use
I. INTRODUCTION of a credit dataset from Home Credit, which was obtained
Recently, credit payments have become increasingly from Kaggle [3]. The borrower's data features are utilized to
popular among the general public. Credit payment involves predict customers' repayment abilities by analyzing
making predetermined installment amounts starting from the customer-related data, such as telecommunications and
receipt of the purchased product. The resources that are transaction information.
provided can come in various forms, including financial One limitation is that when using a large amount of data,
assistance such as loans, as well as tangible goods and it takes a long time to process it. Therefore, a feature
services like consumer credit. Essentially, credit involves selection process is required. Feature selection techniques
facilitating delayed payment for these resources. Credit, demonstrate that having more information is not always
whether acquired through monetary means or deferred beneficial in the machine learning process. Feature selection
transactions, remains a vital tool for efficiently managing is an essential procedure that involves obtaining a subset
economic exchanges. Individuals and enterprises have the from an original set of features, determined by specific
option to purchase a wide range of products, whether through criteria. The subset has been specifically selected to include
offline or online channels, without the requirement of only the most relevant feature of the dataset. Therefore, the
making immediate cash payments. This can be done by feature selection process helps compress the scale of data
Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on August 10,2024 at 10:34:58 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-7064-5/23/$31.00 ©2023 IEEE 32
processing by removing redundant and irrelevant features complex network of interdependencies among risk factors
while maintaining its accuracy. requires a high level of computational sophistication that
manual calculations often struggle to deliver. This could
In this study, we aim to conduct a comparative analysis potentially compromise the accuracy of credit risk
of various machine learning algorithms for credit risk assessments. In today's ever-changing financial landscape, it
scoring. Our main focus will be to utilize the Chi-Square is crucial to consider automated solutions or advanced
feature selection method in order to improve the speed and algorithms to address these challenges. The goal is to
efficiency of these algorithms. The purpose of credit risk improve accuracy, efficiency, and adaptability in credit risk
scoring is to assess the likelihood of a borrower defaulting on assessment.
their loan payments. By accurately predicting credit risk,
financial institutions can make informed decisions regarding Previous research has shown that credit risk in predicting
lending and minimize potential losses. The ultimate goal is to credit default can be successfully implemented using
provide data for predicting Non-Performing Loans (NPL) machine learning methods. The work in [8] show the
and enabling the company to avoid financial losses while systhematic review in credit risk analysis using machine
effectively reallocating credit to borrowers. learning approach. The study of machine learning approach
step by step describe in [9] that show potential in accuracy.
II. RELATED WORKS The utilization of machine learning may also require
significant computational power when utilizing all features in
Banking and other credit distribution companies are types the dataset during the calculation process. The speed and
of companies that inherently carry a significant amount of efficiency of machine learning can be enhanced by utilizing
risk. The risks that are being referred to include credit risk, feature selection. Studies have shown that the use of feature
market risk, operational risk, liquidity risk, strategic risk, selection affects the speed and efficiency of machine
reputation risk, legal risk, and compliance risk [4]. In learning computations without compromising accuracy [10].
Indonesia, these risks are outlined in Bank Indonesia This paper will utilize the Chi-Square (χ²) technique, which
Regulation Number 11/25/PBI/2009 [5]. Credit is the is used in statistical analysis to test the independence of two
primary source of income and profit for banks and credit events [11]. It is used to evaluate the relationship between
distributors. However, it is also considered an investment categorical variables. This method assists in determining the
activity that can lead to credit problems for banks. Improper features that have the greatest influence on the target
management of credit can result in the accumulation of non- variable. It does so by comparing the observed frequency
performing loans. Credit risk refers to the inability of a distribution to the expected distribution, assuming that the
company, institution, agency, or individual to fulfill their variables are independent. In the context of feature selection,
financial obligations promptly, both at the time of maturity the chi-square test calculates a value that indicates the degree
and beyond, in compliance with relevant regulations and of association between each categorical feature and the target
agreements. variable. Higher chi-square values indicate a stronger
In order to calculate credit risk, one can utilize the Non- association, suggesting that the feature may contain valuable
Performing Loan (NPL) ratio. This ratio represents the information for predicting the target. The Chi-Square Test
proportion of total non-performing loans to the total loans was chosen as the method for feature selection in this study.
extended to debtors. The NPL ratio indicates that as the The Chi-Square Test is one of the feature selection methods
credit quality of a bank deteriorates, resulting in an increase that can be used in supervised machine learning.
in the number of non-performing loans, the likelihood of the The work in [12]–[15] all demonstrate that tree-based
bank encountering financial difficulties also rises. Other methods such as Random Forest and Gradient Boosted Trees
literature investigates into various attempts to enhance the (XGBoost, LightGBM) outperform Logistic Regression as a
accuracy of credit default prediction. It elucidates how this predictive tool for company bankruptcy. The work in [13],
predictive model can effectively contribute to the mitigation [14] experimented with methods such as Support Vector
of non-performing loans [6]. Considering that portfolios of Machines (SVM) and Artificial Neural Networks (ANN), but
banking and lending companies can reach billions of dollars, the performance benefits observed from these methods
even a minor improvement can be deemed significant. This compared to Logistic Regression are less convincing or
implies that it is worthwhile to make efforts to enhance the significantly worse. In study [16] utilized machine learning
model's performance in predicting credit default or credit methods to predict mortgage default and found that tree-
risk. based methods outperformed Logistic Regression. This
In the realm of financial institutions, the manual demonstrates that using less interpretable machine learning
calculation of credit risk has always posed intricate methods (particularly tree-based methods) can lead to
challenges. The process entails conducting a thorough valuable performance improvements. As stated in [12] tree-
evaluation of various factors, including the borrower's based methods are more capable of capturing non-linear
financial history, market trends, and macroeconomic relationships that are typically not achieved by Logistic
indicators. This evaluation is done to ascertain the Regression. In study [17] discovered that the Random Forest
probability of a loan or credit product defaulting [7]. model did not convincingly outperform Logistic Regression.
However, the manual method used for this calculation poses However, they found that other tree-based ensemble models,
several significant challenges. Firstly, the large amount and such as LightGBM, were able to do so. There are certain
intricate nature of data needed for precise risk assessment cases where the choice of model is not a factor at all in the
can be overwhelming for even the most skilled analysts, performance of the model, as seen in [16] where no model
resulting in errors and inconsistencies. Moreover, relying convincingly outperforms the others. It should be noted that
solely on static datasets can result in outdated risk profiles, as the work in [16] employed a rigorous evaluation procedure,
they may not reflect the current borrower dynamics or wherein out-of-sample testing was conducted, and
economic fluctuations in real-time. Furthermore, the performance was assessed based on varying risk preferences.
33
The previous research that has been conducted serves as IV. RESULT AND DISCUSSION
the foundation for the data analysis process and the search
for relevant features to analyze the data. With the ability to A. Bussiness Understanding
process feature selection, computations can be done faster Home Credit is a financial institution that offers
and more efficiently without sacrificing accuracy in machine financing in partner stores. Credit financing can be
learning. Based on this experience, it is anticipated that this immediately experienced when purchasing various
research will be able to establish an effective risk analysis household appliances, smartphones, and other electronic
model with a high level of accuracy. This will be achieved equipment. Home Credit strives to expand financial inclusion
by conducting a comparison of various machine learning for the population without bank accounts by providing a
algorithms. positive and secure credit experience. In order to ensure that
this underserved population has a positive loan experience,
III. RESEARCH METHODOLOGY Home Credit utilizes various alternative data - including
The Cross-Industry Standard Process for Data Mining telecommunications and transactional information - to
(CRISP-DM) is a significant framework in the field of data predict the repayment ability of their consumers.
analytics and knowledge discovery that used in this paper The large number of customers makes it impractical to
[18]. Fig. 1 depict structured methodology provides a manually conduct credit risk analysis. To achieve this, the
comprehensive and flexible approach to help organizations process of data mining will be utilized, along with
navigate the complexities of data mining projects. The conducting analysis using various algorithms, in order to
CRISP-DM methodology consists of six distinct phases, each create a credit risk scoring model. The model is created using
with its own specific objectives and activities. This approach historical data from Home Credit's loan application available
offers a systematic roadmap that guides a project from its on Kaggle, to predict whether customers will be able to repay
initial stages to its completion. These phases encompass the loan or not. By doing so, the factors that cause customers
business understanding, data comprehension, data to be unable to repay the loan within the specified time
preparation, modeling, evaluation, and deployment. By period can also be identified.
starting with a clear understanding of the business objectives
and requirements, CRISP-DM ensures that the following B. Data Undestanding
analytical steps are in line with the overall goals.
The dataset used in this paper consists of 3 datasets
depicted in Fig. 2, which were obtained from the Kaggle
Competition Home Credit Default Risk [3]. Firstly, the
dataset application_train is a dataset that contains personal
information of clients, consisting of 307,511 rows and 122
columns. Secondly, the bureau dataset is a dataset that
contains all previous credits owned by clients in financial
institutions other than Home Credit. It consists of 1,670,214
rows and 37 columns. Thirdly, the dataset
previous_application consists of 1,716,428 rows and 17
columns. It includes all the previous loan applications made
by clients at Home Credit. The large number of features in
the dataset is what needs to be selected for their relevance in
credit scoring calculations, in order to make the calculation
process faster and more efficient.
Fig. 1. The Cross-Industry Standard Process for Data Mining (CRISP-

DM)
The iterative nature of the framework allows for

flexibility in accommodating adjustments and insights that Fig. 2. The datasets and their relationships
arise during the process. This fosters a dynamic and
responsive approach to data mining. Furthermore, the The dataset has been joined by performing a left join
emphasis of CRISP-DM on collaboration and operation between the previous_application and credit
communication within cross-functional teams generates a bureau datasets with the application_train dataset. The
comprehensive viewpoint, which assists in identifying joining key used for this task was the SK_ID_CURR.
potential challenges and opportunities throughout the Afterward, we will check whether there are any missing
project's lifespan. As organizations struggle with the values in the dataset. The examination results have shown
increasing amount of data and strive to gain useful insights, that there are a total of 69 variables with missing values. Out
CRISP-DM continues to be a crucial tool for navigating the of these, 17 variables have a missing value percentage that
complexities of data mining projects. It allows for the exceeds 60%. The existence of duplicate data was also
efficient conversion of raw data into valuable knowledge. investigated, however, no instances of duplicate data were
34
discovered. The types of data and their corresponding Table I. The number of columns is adjusted according to the
quantities in the combined dataset are as follows: There are a threshold applied, which means only columns with Chi
total of 67 instances of the float64 data type, 41 instances of Values greater than 140 are selected.
the int64 data type, and 16 instances of the object data type.
TABLE I. THE FEATURES SELECTED BY CHI-SQUARE
C. Data Preparation
The main dataframe, referred to as df, contains a total of No Features Chi-square values
307,511 rows and 124 columns. During the data preparation 1 Higher education 745.108.117
stage, several steps are taken. One of the steps involves
transforming the column "DAYS_BIRTH" into "AGE" to 2 EXT_SOURCE_2 654.363.170
obtain the ages of the borrowers. In order to determine the 3 REG_CITY_NOT_WORK_CITY 615.377.434
borrower's length of employment in years, the column 4 CODE_GENDER 606.035.521
labeled DAYS_EMPLOYED will be modified to reflect
YEARS_EMPLOYED. To address anomalies in the variable 5 REG_CITY_NOT_LIVE_CITY 558.708.941
YEARS_EMPLOYED. In order to determine the number of 6 Pensioner 538.416.471
years since the borrower last updated their registration, it is
7 EXT_SOURCE_3 494.610.060
necessary to modify the column DAYS_REGISTRATION to
YEARS_REGISTRATION. 8 Working 491.439.930
Substitute null for the value of XNA that was detected in 9 LIVE_CITY_NOT_WORK_CITY 266.788.515
the column. Remove columns from the aggregate dataset 10 NAME_CONTRACT_TYPE 265.588.342
that have more than 60% null values to address the issue of 11 Drivers 265.304.825
missing values. It is necessary to remove 17 columns from
the table. Imputation is done for columns with no more than 12 With parents 262.804.675
60% of null values. This suggests that any null values for 13 AGE 244.861.751
columns with data types other than object are replaced with
14 Low-skill Laborers 232.500.451
the median value of that specific column. Columns with
object data types replace null values with the mode value for 15 Self-employed 228.484.051
that particular column. The column name 16 Secondary/secondary special 221.231.858
"DAYS_ID_PUBLISH" has been modified to
17 FLAG_WORK_PHONE 200.319.862
"YEARS_PUBLISHED".
18 Single/not married 184.289.183
The column DAYS_LAST_PHONE_CHANGE will be
changed to YEAR_LAST_PHONE_CHANGE in order to 19 State servant 157.124.301
determine how many years ago the customer changed their 20 YEAR_LAST_PHONE_CHANGE 149.606.292
phone. At this stage, the definition of new columns is also
21 YEARS_EMPLOYED 144.733.462
carried out. These columns are TOTAL_DOCUMENT,
which contains the value representing the total number of 22 Civil marriage 140.557.212
documents provided by the client, and doc_provided, which
contains the value 0 indicating that the client did not provide
any documents, and the value 1 indicating that the client D. Modelling
provided documents. The modeling process involves building a machine learning
Scaling on columns is also performed to ensure that data model to predict whether a client will be able to repay a loan
types other than "object" have the same range of values in on time, represented by the number 0 (non-default), or if the
each column. Scaling is performed using the client will struggle to make timely payments or have
MinMaxScaler() technique, which results in a range of difficulty repaying the loan, represented by the number 1
values for each column from 0 to 1. Afterwards, encode the (default). The modeling compares the performance of four
columns that contain data of object type. Columns that machine learning techniques: Logistic Regression, Decision
contain only two distinct values will be encoded using the Tree Classifier, Random Forest Classifier, and Naïve Bayes
LabelEncoder method. Columns that possess more than two Classifier. Before building the model, the data is divided into
distinct values will be encoded through the utilization of one train data to create the model and test data to evaluate the
hot encoding. accuracy of the model. The number of test data is 20% of the
total data, and the random state is set to 42. The data training
During the final stage, the process of feature selection process includes balancing the data to prevent oversampling,
takes place. This entails the careful selection of columns that as the number of data training with Target 0 is much greater
will serve as features in the machine learning model than that of Target 1. The number of data training with
currently being developed. The feature selection is Target 1 is adjusted to be 50% of the data training with
determined by evaluating the Chi-Square values of all Target 0.
columns in relation to the TARGET column. The chi-square
test calculates a value that indicates the degree of association E. Evaluation
between each categorical feature and the target variable.
Higher chi-square values indicate a stronger association, The main performance metric used is the accuracy
which suggests that the feature may contain valuable obtained from the test data. Additionally, the Precision,
information for predicting the target. The results show that Recall, F1-Score, and Area Under the ROC Curve (AUC) for
22 features have been selected for further use, as indicated in each model is calculated. The accuracy metric is determined
by dividing the total number of accurate predictions by the
35
overall number of forecasts made. Conversely, precision is a indicating that it possesses the greatest number of accurate
measure that calculates the proportion of correct predictions predictions or matches with the original values in
out of all the predictions that are truly accurate. In contrast, comparison to other models. This is the case because it has
the recall metric assesses the percentage of accurately the highest number of accurate matches with the original
predicted positive class values in the test dataset. ditionally, values. The non-default setting on the model is the reason
the F1 score serves as a metric that evaluates the equilibrium why the True Negative value is being interpreted as negative.
between accuracy and recall, providing a comprehensive
assessment of the model's performance. Furthermore, the
time required for constructing the model as well as the
duration needed for making predictions is also calculated.
This enables to offer a thorough comprehension of the
computational efficiency required for deploying the model.
Based on the findings presented in Table II, it is evident
that the model constructed using the Naïve Bayes Classifier
exhibits a lower capability in predicting the payment ability
of clients. This is indicated by its comparatively smaller
accuracy value when compared to the other models.
Meanwhile, the Random Forest Classifier model stands out
as the most effective model for predicting whether a client
will default or not, outperforming all other models. This is
because it exhibits the highest levels of accuracy, recall, and
F1 Score when compared to other models. Additionally, it
ranks second in terms of AUC and precision values.
TABLE II. THE EVALUATION SCORES FOR EACH ALGORITHM Fig. 3. Confussion Matrix (a. Logistic Regression, b. Decission Tree,
c. Random Forest, d. Naïve Bayes)
F1 Modeling Predict
Algorithm Accuracy AUC Precision Recall
Score Time (s) Time (s)
The assessment of the classification quality of each
Logistic 0.68 0.73 0.96 0.68 0.80 2.62 0.04
Regression
model is properly provided by the Area Under the ROC
Curve (AUC) depicted in Fig. 4. The AUC metric offers a
Decision Tree 0.82 0.53 0.93 0.88 0.90 7.09 0.04
Classifier
comprehensive evaluation of performance by considering all
potential classification thresholds. One possible
Random Forest 0.89 0.68 0.93 0.96 0.94 131.61 2.75 interpretation of AUC is as the probability that the model
Classifier
will rank a randomly selected positive example higher than a
Naïve Bayes 0.65 0.68 0.95 0.65 0.77 0.31 0.08 randomly selected negative example.
Classifier
Furthermore, it can be observed from the table above that

the shortest execution time required for modeling is found in
the Naïve Bayes Classifier model. Next, it is followed by the
Logistic Regression model and the Decision Tree Classifier.
Meanwhile, the model's execution time in making
predictions (predict time) is shortest in the Logistic
Regression and Decision Tree models. While the model built
with Random Forest has the longest execution time, both in
modeling and prediction. It demonstrates a superior level of
precision and reliability in managing intricate datasets, which
is a highly desirable characteristic for numerous real-world
applications.
However, it is important to acknowledge that this
superior performance comes with the drawback of increased
computational time for both the modeling and prediction
processes. The time-intensive nature of the random forest
model is a significant drawback, and it is essential to
carefully consider this trade-off when deciding to use it in Fig. 4. Area Under the ROC Curve (AUC) (a. Logistic Regression, b.
situations where computational efficiency is crucial. Decission Tree, c. Random Forest, d. Naïve Bayes)
By examining the four confusion matrix images The AUC metric offers a measure of a classifier's
displayed in Fig. 3, it becomes evident that the Random effectiveness for a specific task. The value of the AUC falls
Forest Classifier model offers the most accurate estimates within the interval of 0 to 1. An efficient classifier is
concerning the clients' ability to meet their financial characterized by an AUC value that is close to 1. It indicates
obligations. This fact is clearly evident due to the substantial the extent to which the model is capable of differentiating
values of both True False and True Negative. The Random between classes. The higher the AUC, the better the model is
Forest Classifier exhibits the highest True Negative value, at accurately predicting 0 classes as 0 and 1 classes as 1.
36
According to Fig. 4, the logistic regression has the highest importance. This is because the Random Forest Classifier
score of 0.73, while the decision tree has the lowest score. requires additional time for both building the model and
making predictions.
F. Deployments
The chosen model for deployment is the one that utilizes ACKNOWLEDGMENT
Random Forest with Chi-Square feature selection. The model We would like to express our sincere appreciation to
that was built has a higher accuracy when it comes to LPPM Unila for their invaluable funding through the Basic
predicting NPL. To determine which column has the most Research Scheme. This funding has greatly facilitated the
influence in the model created using Random Forest completion of our research and the subsequent publication of
Classifier, or in other words, the factors that have the greatest our findings in this paper.
impact on NPL, a plot of feature importance is generated.
The results of this plot are shown in Fig. 5.
REFERENCES
[1] A. O’Sullivan and S. M. Sheffrin, Economics: principles in action.
Princeton, N.J.: Prentice Hall, 2004.
[2] I. Genriha and I. Voronova, “Methods for Evaluating the
Creditworthiness of Borrowers,” Economics and Business, vol. 22,
pp. 42–49.
[3] Home Credit Group, “Home Credit Default Risk.”
https://www.kaggle.com/competitions/home-credit-default-risk/data
(accessed Jul. 12, 2023).
[4] T. Roncalli, Handbook of Financial Risk Management, 1st ed. Boca
Raton: Chapman and Hall/CRC, 2020. doi: 10.1201/9781315144597.
[5] “Bank Indonesia Regulation Number 11/25/PBI/2009 regarding
Amendments to Bank Indonesia Regulation Number 5/8/PBI/2003 on
the Implementation of Risk Management for Commercial Banks.”
[6] M. Naili and Y. Lahrichi, “Banks’ credit risk, systematic
determinants and specific factors: recent evidence from emerging
markets,” Heliyon, vol. 8, no. 2, p. e08960, Feb. 2022, doi:
10.1016/j.heliyon.2022.e08960.
Fig. 5. features importances ranking used by Random Forest Classifier [7] A. Ieda, K. Marumo, and T. Yoshiba, “A Simplified Method for
Calculating the Credit Risk of Lending Portfolios,” Monetary and
To observe the results of the prediction using the Economic Studies, vol. 18, no. 2, pp. 49–82, Dec. 2000.
[8] S. Shi, R. Tse, W. Luo, S. D’Addona, and G. Pau, “Machine
constructed model, a comparison is made between the learning-driven credit risk: a systemic review,” Neural Comput &
calculations of Loss Given Default (LGD) or the total credit Applic, vol. 34, no. 17, pp. 14327–14339, Sep. 2022, doi:
from clients who default. The LGD being calculated is the 10.1007/s00521-022-07472-2.
current LGD recorded in the dataset. Then, it is compared [9] T. Mokheleli and T. Museba, “Machine Learning Approach for
with the LGD using the constructed prediction model. The Credit Score Predictions,” J. Inf. Syst. Informatics, vol. 5, no. 2, pp.
497–517, May 2023, doi: 10.51519/journalisi.v5i2.487.
total LGD experienced by Home Credit before using the [10] J. Laborda and S. Ryoo, “Feature Selection in a Credit Scoring
prediction model was approximately 2,780,554,000. After Model,” Mathematics, vol. 9, no. 7, p. 746, Mar. 2021, doi:
using the prediction model, the LGD obtained was 10.3390/math9070746.
2,496,462,552. The difference between the LGD after using [11] N. P. S. Alisha Sikri Surjeet Dalal, “Chi-Square Method of Feature
the prediction model and before obtaining the results is Selection: Impact of Pre-Processing of Data,” Int J Intell Syst Appl
284,091,601.5. This means that after using the prediction Eng, vol. 11, no. 3s, pp. 241–248, Feb. 2023.
[12] M. Moscatelli, F. Parlapiano, S. Narizzano, and G. Viggiano,
model, there was a decrease in the LGD of clients who “Corporate default forecasting with machine learning,” Expert
defaulted. The percentage of the decrease is 10.22%. Systems with Applications, vol. 161, p. 113567, Dec. 2020, doi:
10.1016/j.eswa.2020.113567.
[13] F. Barboza, H. Kimura, and E. Altman, “Machine learning models
V. CONCLUSION and bankruptcy prediction,” Expert Systems with Applications, vol.
The selected approach for deploying model using the 83, pp. 405–417, Oct. 2017, doi: 10.1016/j.eswa.2017.04.006.
Random Forest algorithm, in combination with Chi-Square [14] D. Guégan and B. Hassani, “Regulatory learning: How to supervise
machine learning models? An application to credit scoring,” The
feature selection has significant potential for improving the Journal of Finance and Data Science, vol. 4, no. 3, pp. 157–171, Sep.
predictive capabilities. This process aids in the identification 2018, doi: 10.1016/j.jfds.2018.04.001.
of the most influential features by comparing the observed [15] Y. Wang, Y. Zhang, Y. Lu, and X. Yu, “A Comparative Assessment
and expected distributions. The Random Forest model, when of Credit Risk Model Based on Machine Learning ——a case study
combined with this enhanced feature set, demonstrates of bank loan data,” Procedia Computer Science, vol. 174, pp. 141–
149, 2020, doi: 10.1016/j.procs.2020.06.069.
remarkable accuracy. This is particularly evident in [16] Z. Qiu, Y. Li, P. Ni, and G. Li, “Credit Risk Scoring Analysis Based
predicting instances related to non-performing loans (NPL). on Machine Learning Models,” in 2019 6th International Conference
The meticulous selection of features not only improves the on Information Science and Control Engineering (ICISCE),
overall performance of the model but also adds a level of Shanghai, China: IEEE, Dec. 2019, pp. 220–224. doi:
interpretability to the predictions. This, in turn, provides 10.1109/ICISCE48695.2019.00052.
valuable insights into the factors that influence the outcomes. [17] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting
System,” in Proceedings of the 22nd ACM SIGKDD International
The deployment of the model reveals that there has been a Conference on Knowledge Discovery and Data Mining, San
decrease of 10.22% in the percentage of LGD. However, it is Francisco California USA: ACM, Aug. 2016, pp. 785–794. doi:
crucial to carefully consider the implementation of the 10.1145/2939672.2939785.
Random Forest Classifier when deciding to use it in [18] C. Shearer, “The CRISP-DM Model: The New Blueprint for Data
situations where computational efficiency is of utmost Mining,” Journal of Data Warehousing, vol. 5, no. 4, 2000.
37

A Comparative Analysis of Machine Learning Algorithms For Credit Risk Scoring Using Chi-Square Feature Selection

Uploaded by

A Comparative Analysis of Machine Learning Algorithms For Credit Risk Scoring Using Chi-Square Feature Selection

Uploaded by

2023 International Conference on Converging Technology in Electrical and Information Engineering (ICCTEIE)

A Comparative Analysis of Machine Learning

Square Feature Selection

Sherly Martina Mulyadi Amanda Hasna Cahyana

Fig. 1. The Cross-Industry Standard Process for Data Mining (CRISP-

The iterative nature of the framework allows for

Furthermore, it can be observed from the table above that

You might also like