Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval
Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval
Abstract -Decision taking is attained by probabilistic and Artificial Intelligence dates back to the time of advent
predictive approaches developed by various machine learning of computers and since then it has diversified into
algorithms. This paper discusses about logistic regression ad numerous field[6]. Over the years, technologists have
its mathematical representation. This paper adheres to
acquired a great understanding in this field which has
logistic regression as a machine learning tool in order to
lead to development of defined models and further
actualize the predictive and probabilistic approaches to a
given problem of loan approval prediction. Using logistic application of these models to real world problems.
regression as a tool , this paper specifically delineates about Various domains in Artificial Intelligence domain
whether or not loan for a set of records of an applicant will be include machine learning, neural networks, fuzzy logic ,
approved. Furthermore, it also discusses about other real- natural language processing , expert systems[6]. These
world applications of this machine learning mode. concepts are deployed according to the specificity of the
desired requirements.
Keywords -Artificial Intelligence, Machine Learning ,
Logistic Regression , Data Munging, Distributional In regards with this paper , one of the concepts in the
Analysis, Data Fitting domain of machine learning is exploited and also
applied to a real world application. Machine Learning
I. INTRODUCTION
is a tool which facilitates development of analytical
The technical world is advancing toward complete
models without explicit programming[4]. Various
automation. In order to attain automation various
machine learning algorithms are developed to tailor to
concepts are being developed and put to use , as can be
the problem requirements. All the leading edge
observed from the numerous developments being made
industries are now utilizing the capabilities of machine
and symposiums being held. One of the most striking
learning to gain higher sales growth and statistics have
features that excites scientists and technologists , in
shown that they are getting positive results. With
regards to the development of automation , is Artificial
institutions generating more and more data, exploitation
Intelligence.
of data manually becomes difficult , hence machine
Artificial Intelligence is the concept of simulate learning, having the capability of analytical modelling
human like intelligence in a computer[6]. To make a is sought to, as a solution.
machine think exactly like a human is intriguing to the
scientists and developers , and they strive to achieve this II. LOGISTIC REGRESSION
goal by putting Artificial Intelligence to use. The ideaIII.
is Logistic regression is also called logistic model or
not to overpower the human society but to work with logit regression. It takes in independent features and
the man so that the combined intelligence can lead to returns output as categorical output. The probability of
many more revelations in this technological era.
occurrence of an categorical output can also be found logit(y) = ln( Įߚ1 ݔ1 +ߚ2 ݔ2 +....ߚ݈ ݈ݔ (3)
1െ
by logistic regression model by fitting the features in
the logistic curve. Fig. 1 shows a general logit curve[1]. In (3) Ƚ , ߚ1 ... ߚ݈ are the parameters of logistic
regression and ݔ1 ... ݈ݔare the features used for fitting
into the model .Taking antilog on both sides of (3) we
get a similar but extended result as that of (2), which is
given by [2][5][7]:
1
P= (4)
1+݁ െ(ߙ +ߚ 1 ݔ1 +ڮ.+ߚ ݈ ) ݈ ݔ
LP001 P=
1
002 M No 0 Grad No
1+݁ െ(ߙ +ߚ ݁݀ ݁݀ ݊݅ݐܽܿݑ+ߚ ݄ܿ ܿ ݕݎݐݏ݅ ݄ ݐ݅݀݁ݎ+ߚ ݀݁ݕ݈݉݁ ݂݈݁ݏ ݁ݏ+ߚ ) ܽ݁ݎܽ ݕݐݎ݁ݎ ܽ
LP001 (5)
003 M Yes 1 Grad No
Equation (5) gives P( the probability of the loan
LP001
005 M Yes 0 Grad Yes JHWWLQJDSSURYHGIRUDSDUWLFXODUGDWDVHW,QĮߚ݁݀ ,
ߚ݄ܿ , ߚ ݁ݏ, ߚ ܽare the parameters of logistic regression
LP001 Not and 'education' , 'credit history' , 'self employed',
006 M Yes 0 Grad No 'property area' are the features used for building the
model. If the probability of loan getting approved is
TABLE II. Contents of Dataset ( the continuation of Table I)
more than 0.5 then , the predicted value for the set of
Applic Coappli Loan Loan_A Credi Prope attributes will be that the loan will get approved
antInc cantInc Amo mount_ t_Hist rty_A otherwise if the probability value is found to be less 0.5
ome ome unt Term ory rea then the result will be that the loan will not get
approved.
5849 0 360 1 U
C. Distributional and Categorical Variable Analysis
4583 1508 128 360 1 R After getting a basic idea of the data we are dealing
with , the distribution of the variables present in the
3000 0 66 360 1 U dataset are analysed so as to attain a better
understanding of the data and get an idea of the outliers.
2583 2358 120 360 1 U This analysis gives us an overall idea of the data to be
In Table I 'M' stands for male and 'F' ( not present in
dealt with . All the attributes to be fitted into the model
the displayed set of data sets) stands for female. In
are statistically analysed in order to find any existing
Table II 'U' stands for urban and 'R' stands for rural.
outliers or extreme values so that these outliers are dealt
Statistical functions like the mean , standard deviation,
with before fitting the data into the model. If these
minimum value , maximum value etc. can be found and
extreme vales are left as such then the accuracy of the
is tabulated in Table III.
model is compromised. Standardization of the attribute
TABLE III. Statistical Analysis of Dataset
containing outliers has to be done in order to generate a
more accurate model.
Applicant Income Credit History
Fig. 2 gives the distribution of applicant income .
Count 614.000000 564.000000
train_data['Gender'].fillna(max(train_data['Gender'].v
alue_counts()),inplace=True)
train_data['Married'].fillna(max(train_data['Married'].
value_counts()),inplace=True)
Fig. 4 Applicant Income grouped in terms of Gender
train_data['TotalIncome'] = y=train_data[outcome_var]
model = LogisticRegression()
train_data['ApplicantIncome'] + predictor_var =
train_data['CoapplicantIncome'] ['Credit_History','Education','Married','Self_Employ
ed','Property_Area']
train_data['TotalIncome_log'] = ##the gender is not taken as a possible term
np.log(train_data['TotalIncome']) affecting the prediction
Code 5 #Fit the model:
clf=model.fit(train_data[predictor_var],y.values.rave
l())
Code 6