Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval

IEEE - 40222
PREDICTIVE AND PROBABILISTIC

APPROACH USING LOGISTIC
REGRESSION :APPLICATION TO
PREDICTION OF LOAN APPROVAL
Ashlesha Vaidya
Computer Science Engineering
SRM University, Chennai
[email protected]
Abstract -Decision taking is attained by probabilistic and Artificial Intelligence dates back to the time of advent
predictive approaches developed by various machine learning of computers and since then it has diversified into
algorithms. This paper discusses about logistic regression ad numerous field[6]. Over the years, technologists have
its mathematical representation. This paper adheres to
acquired a great understanding in this field which has
logistic regression as a machine learning tool in order to
lead to development of defined models and further
actualize the predictive and probabilistic approaches to a
given problem of loan approval prediction. Using logistic application of these models to real world problems.
regression as a tool , this paper specifically delineates about Various domains in Artificial Intelligence domain
whether or not loan for a set of records of an applicant will be include machine learning, neural networks, fuzzy logic ,
approved. Furthermore, it also discusses about other real- natural language processing , expert systems[6]. These
world applications of this machine learning mode. concepts are deployed according to the specificity of the
desired requirements.
Keywords -Artificial Intelligence, Machine Learning ,
Logistic Regression , Data Munging, Distributional In regards with this paper , one of the concepts in the
Analysis, Data Fitting domain of machine learning is exploited and also
applied to a real world application. Machine Learning
I. INTRODUCTION
is a tool which facilitates development of analytical
The technical world is advancing toward complete
models without explicit programming[4]. Various
automation. In order to attain automation various
machine learning algorithms are developed to tailor to
concepts are being developed and put to use , as can be
the problem requirements. All the leading edge
observed from the numerous developments being made
industries are now utilizing the capabilities of machine
and symposiums being held. One of the most striking
learning to gain higher sales growth and statistics have
features that excites scientists and technologists , in
shown that they are getting positive results. With
regards to the development of automation , is Artificial
institutions generating more and more data, exploitation
Intelligence.
of data manually becomes difficult , hence machine
Artificial Intelligence is the concept of simulate learning, having the capability of analytical modelling
human like intelligence in a computer[6]. To make a is sought to, as a solution.
machine think exactly like a human is intriguing to the
scientists and developers , and they strive to achieve this II. LOGISTIC REGRESSION
goal by putting Artificial Intelligence to use. The ideaIII.
is Logistic regression is also called logistic model or
not to overpower the human society but to work with logit regression. It takes in independent features and
the man so that the combined intelligence can lead to returns output as categorical output. The probability of
many more revelations in this technological era.
8th ICCCNT 2017

July 3-5, 2017, IIT Delhi,
Delhi, India
IEEE - 40222
‫݌‬
occurrence of an categorical output can also be found logit(y) = ln( Įߚ1 ‫ݔ‬1 +ߚ2 ‫ݔ‬2 +....ߚ݈ ‫݈ݔ‬ (3)
1െ‫݌‬
by logistic regression model by fitting the features in
the logistic curve. Fig. 1 shows a general logit curve[1]. In (3) Ƚ , ߚ1 ... ߚ݈ are the parameters of logistic
regression and ‫ݔ‬1 ...‫ ݈ݔ‬are the features used for fitting
into the model .Taking antilog on both sides of (3) we
get a similar but extended result as that of (2), which is
given by [2][5][7]:
1
P= (4)
1+݁ െ(ߙ +ߚ 1 ‫ ݔ‬1 +‫ڮ‬.+ߚ ݈ ‫) ݈ ݔ‬
Logistic regression is now extensively used in fields

where a relationship has to be formed among the
features and a dichotomous output is to be obtained.
III. RELATED WORK

In all the money lending firms , a baseline criterion is
required in order to decide whether or not loan for a
particular borrower will be approved. The criteria for
decision making need not necessarily be a single
Fig. 1 General Logit Curve
attribute but can be multiple attributes which need to be
The Logistic Regression model can be replaced by the taken into consideration while taking the decision.
simpler Linear Regression model when the output Datasets containing the relevant information for their
variable is taken to be continuous. When the output clients can be acquired from the money lending firms.
variable is not continuous or is dichotomous another The attributes in this dataset are to incorporated into a
model has to be applied in order to take this difference model resulting in an outcome as to whether or not loan
into consideration. Many models were henceforth should be approved for the given borrower. The
developed to account for dichotomous behaviour of the outcome can be of either of the two forms : Accept or
outcome variable. Logistic Regression model was Reject.
chosen over the other models because of its
The developed model has to deduce in a lesser time
mathematical clarity and flexibility. This model can
than that required to manually take the decision. The
have single or multiple predictors .With this model we
model should be accurate enough to be accepted by the
can model the natural log odds of the desired variable in
standards of the money lending firms.
linear form of the feature variables taken as input.
Mathematically, this can be represented as : IV. BUILDING A MODEL
‫݌‬ According to the problem statement the outcome
logit(y) = ln( Į+ȕ[ (1)
1െ‫݌‬ variable is dichotomous. Fitting the data into logistic
regression model for multivariate independent entries
In (1) , ĮDQGȕDUHSDUDPHWHUVRIORJLVWLFUHJUHVVLRQ will give the required output.
model, p is the probability of the desired categorical
output and x is the feature used as the input. Taking A. Describing the data
antilog on both sides of (1) we can easily find the The data acquired from the money lending firms
probability of occurrence of the desired output. contains variable features according to their standards
Mathematical representation can be given as : of choosing which features to incorporate to fit into the
model for prediction of loan approval status. Generally,
1
P= (2) information of the borrowers like the gender, education,
1+݁ െ(ߙ +ߚ‫) ݔ‬
whether or not they are self-employed , their income ,
If multiple parameters are used as features and are whether or not they are financially dependent on
necessitated to be used for the prediction , then the someone and if yes what is their income, their property
mathematical representation of the natural log of odds area etc are considered for predicting the loan status.
for the desired variable can be given as : The Table I and Table II gives few of the contents of the
8th ICCCNT 2017

Delhi, India
IEEE - 40222
data set used for fitting values into logistic regression

model.
B. Mathematical representation of Logistic
TABLE I. Contents of data set (6 columns)
Regression
The attributes considered for building the model are :
Loan_ Gen Marr Depend Educa Self_Emp
ID der ied ents tion loyed education, credit history, self-employed, property area.
LP001 P=
1
002 M No 0 Grad No
1+݁ െ(ߙ +ߚ ݁݀ ݁݀‫ ݊݋݅ݐܽܿݑ‬+ߚ ݄ܿ ܿ‫ ݕݎ݋ݐݏ݅ ݄ ݐ݅݀݁ݎ‬+ߚ ‫ ݀݁ݕ݋݈݌݉݁ ݂݈݁ݏ ݁ݏ‬+ߚ ‫) ܽ݁ݎܽ ݕݐݎ݁݌݋ݎ݌ ܽ݌‬
LP001 (5)
003 M Yes 1 Grad No
Equation (5) gives P( the probability of the loan
LP001
005 M Yes 0 Grad Yes JHWWLQJDSSURYHGIRUDSDUWLFXODUGDWDVHW,QĮߚ݁݀ ,
ߚ݄ܿ , ߚ‫ ݁ݏ‬, ߚ‫ ܽ݌‬are the parameters of logistic regression
LP001 Not and 'education' , 'credit history' , 'self employed',
006 M Yes 0 Grad No 'property area' are the features used for building the
model. If the probability of loan getting approved is
TABLE II. Contents of Dataset ( the continuation of Table I)
more than 0.5 then , the predicted value for the set of
Applic Coappli Loan Loan_A Credi Prope attributes will be that the loan will get approved
antInc cantInc Amo mount_ t_Hist rty_A otherwise if the probability value is found to be less 0.5
ome ome unt Term ory rea then the result will be that the loan will not get
approved.
5849 0 360 1 U
C. Distributional and Categorical Variable Analysis
4583 1508 128 360 1 R After getting a basic idea of the data we are dealing
with , the distribution of the variables present in the
3000 0 66 360 1 U dataset are analysed so as to attain a better
understanding of the data and get an idea of the outliers.
2583 2358 120 360 1 U This analysis gives us an overall idea of the data to be
In Table I 'M' stands for male and 'F' ( not present in
dealt with . All the attributes to be fitted into the model
the displayed set of data sets) stands for female. In
are statistically analysed in order to find any existing
Table II 'U' stands for urban and 'R' stands for rural.
outliers or extreme values so that these outliers are dealt
Statistical functions like the mean , standard deviation,
with before fitting the data into the model. If these
minimum value , maximum value etc. can be found and
extreme vales are left as such then the accuracy of the
is tabulated in Table III.
model is compromised. Standardization of the attribute
TABLE III. Statistical Analysis of Dataset
containing outliers has to be done in order to generate a
more accurate model.
Applicant Income Credit History
Fig. 2 gives the distribution of applicant income .
Count 614.000000 564.000000
Mean 5403.459283 0.842199
Min 150.000000 0.000000
25% 2877.500000 1.000000
50% 3812.500000 1.000000
Max 81000.000000 1.000000
Fig. 2 Distribution Analysis of Applicant income
8th ICCCNT 2017

Delhi, India
IEEE - 40222
The distributional analysis of the applicant income

shows a few outliers in the data. Similar analysis is print("CHECKING MISSING VALUES IN
TRAIN DATASET ")
performed on the all the numerical features in order to
obtain a clear understanding of all the existing outliers print(train_data.apply(lambda x:
and their range. Fig. 3 and Fig. 4 give plots for the sum(x.isnull()),axis=0))
applicant income as grouped in terms of
education(whether they are graduate or not) and
Code 2
gender(male or female) respectively.
The distributional analysis for all the numerical
features yields the following consolidated results:
x There exist some extreme values or outliers
which need to be dealt with before fitting the
data into the model.
x There are some missing values in the data set
.These places have to be filled with some values
so that logistic regression can easily be applied
uniformly.
In order to overcome these inconsistencies data
munging has to be performed on the data.
D. Data Munging
The problems found during analysing the numerical
Fig. 3 Applicant income grouped in terms of education data are handled using a technique called data munging.
It is pre processing of the data to yield ideal data set for
modelling.
1) Handling the missing values: The missing values
in the non numerical attributes can be imputed using
the concept of statistical mode. Finding the mode for
each of the attributes containing some missing values
and filling the mode in place of the missing values
serves our purpose. Code 3 shows the python code
employed to fill the missing non numerical values.
train_data['Gender'].fillna(max(train_data['Gender'].v
alue_counts()),inplace=True)
train_data['Married'].fillna(max(train_data['Married'].
value_counts()),inplace=True)
Fig. 4 Applicant Income grouped in terms of Gender
Code 1 shows the python code that was employed to Code 3

carry out the distributional analysis for applicant
For numerical variables having some dependency on
income to get an idea of the outliers and Code 2 shows
other variables the missing values have to be imputed
the python code that was employed for estimation of the
taking into consideration values of the other variables
number of missing values.
on which it is dependent upon. The pivot table method
plot1=plt.hist(train_data['ApplicantIncome'],bins=5 is used to accurately fill these missing values. Code 4
0) employs the python code for filling the missing values
plt.xlabel('APPLICANT INCOME') in the Loan Amount variable taking into consideration
plt.title('DISTRIBUTION ANALYSIS of its dependency on the borrowers education and self
APPLICANT INCOME')
employment.
plt.show()
Code 1
8th ICCCNT 2017

Delhi, India
IEEE - 40222
E. Fitting data into model

table = train_data.pivot_table(values='LoanAmount',
Once the data is properly rectified and idealized , it is
index='Self_Employed' ,columns='Education',
aggfunc=np.median) used to fit into the model so as to train the model.
Training of the model can be accomplished in either a
# Define function to return value of this pivot_table supervised or unsupervised manner. While supervised
learning includes the desired category of the output ,
def fage(x): unsupervised does not. In this problem statement of
return table.loc[x['Self_Employed'],x['Education']] loan approval prediction the training data set consists of
the desired outcome for each of the borrowers hence it
# Replace missing values
is a supervised learning model. All the variables should
train_data['LoanAmount'].fillna(train_data[train_data not be selected for taking into consideration otherwise it
['LoanAmount'].isnull()].apply(fage, axis=1), might lead to the problem of over fitting. Taking more
inplace=True) number of attributes will result in the model learning
more specific to the dataset rather than generalizing and
Code 4 thus will fail to give accurate results for other general
cases. Few attributes which have a direct impact on
2) Handling the extreme values : The extreme values whether or not the loan for a given borrower should be
in the data can be handled by standardizing the data by approved or not are chosen and fitted into the model.
taking log transformation of variables which have While fitting another parameter of the desired output in
existing outliers. Code 5 shows the python fed to the statement performing the fitting, since it is
implementation of this method for the Applicant supervised learning. Code 6 shows the command in
Income variable. Other variables containing outliers are python to fit the data to logistic regression model.
handles in a similar fashion. After implementing Code 5
a graph showed in Fig. 5 is plotted and it is observed
that now the variable has no more outliers. outcome_var =['Loan_Status']
train_data['TotalIncome'] = y=train_data[outcome_var]
model = LogisticRegression()
train_data['ApplicantIncome'] + predictor_var =
train_data['CoapplicantIncome'] ['Credit_History','Education','Married','Self_Employ
ed','Property_Area']
train_data['TotalIncome_log'] = ##the gender is not taken as a possible term
np.log(train_data['TotalIncome']) affecting the prediction
Code 5 #Fit the model:
clf=model.fit(train_data[predictor_var],y.values.rave
l())
Code 6
F. Predicting and finding Probability of outcomes

Prediction whether or not loan will be approved for a
given borrower can be easily found from the model
developed and is illustrated in Code 7. Furthermore the
probability for the loan getting accepted can also be
found by the logistic regression model. This
probabilistic approach of the model is an advantage of
logistic regression over other models.
Fig. 5 Managing Extreme values in Applicant income
8th ICCCNT 2017

Delhi, India
IEEE - 40222
independent variable based on which prediction takes

predictions = place need not be normally distributed.
model.predict(test_data[predictor_var])
print(predictions) Limitations of the logistic regression model are :
x Logistic regression requires a large sample for
probabilities=model.predict_proba(test_data[predict
or_var]) parameter estimation
print(probabilities) x It requires independent variables for estimation.
If variables are not independent then the model
Code 7
tends to over weigh the importance of th
dependent variables.
V. APPLICATIONS OF LOGISTIC REGRESSION
Logistic regression is a generalized form of linear x Logistic model cannot provide continuous
regression and is used whenever the outcome of a outputs. It only provides discrete variables as
problem is dichotomous and are dependent on some their output i.e. either true or false. For example :
other variables. it cannot give how temperature will rise in a
given city based on previous data.
Some of the applications of logistic regression
include: REFERNCES
x In financial firms , as illustrated by the problem [1] Chao-Ying Joanne Peng, Kuk Lida Lee, Gary
of loan sanctioning. The manual work required M. Ingersoll, "An introduction to logistic
for processing of large number of petitions , Regression Analysis and Reporting", The
automation with a great accuracy has proved to journal of Educational Research Indiana
be a blessing for these firms. University-Bloomington, vol. 96, pp 1-13,
x Logistic regression is widely used in data September/October 2002
analytics where analysing of the pre existing [2] David H.Hosmer and Stanley Lemeshow,
data within all kinds of organisation is required. Applied Logistic Regression, 2nd ed.,
This helps in economical growth of the John Wiley & Sons 2004, pp. 47-135
organization as predictions can be made about [3] Hyeoun-Ae, "An Introduction to Logistic
policies to be made in the future based on pre Regression: From Basic Concepts to
existing policies. Interpretation with Particular Attention to
x From a governmental standpoint, this model can Nursing Domain Park", J Korean Acad Nurs.,
be applied to predict whether or not the policies vol. 43, pp. 154-164, April 2013.
they wish to introduce will yield desired result or [4] Kevin P. Murphy, Machine Learning A
not. Probabilistic Approach, Massachusetts
Institute of Technology 2012, pp. 1-21
The probabilistic approach which comes along with [5] Ömay ÇOKLUK, "Logistic Regression:
the predictive approach is very useful in any real world Concept and Application", Kuram ve
applications. The applications for the logistic regression Uygulamada Egitim Bilimleri, vol. 10, pp.
modelling is not restrictive to the applications stated 1397-1407 , Summer 2010
above but is advanced to many more fields. [6] Stuart Russell and Peter Norvig, Artificial
VI. CONCLUSION Intelligence -A mordern approach, 1rst ed.,
Prentice Hall Inc. 1995, pp. 4-8.
Various machine learning models exist for predictive [7] Vijayalakshmi Sampath, Andrew Flagel,
analysis like logistic regression, decision trees, Carolina Figueroa, "A Logistic Regression
Artificial neural networks(ANN) and Bayesian Model To Predict Freshmen Enrolments",
Networks. This paper with the logistic regression Northern Virginia Community College VA,
model. While logistic regression is a statistical model pp. 1-12, May 2016
the other three are graphical models. ANNs have a very [8] yhat- Machine Learning, Data Science and
complex structure with multiple layers of nodes. Engineering blog, "Logistic Regression in
Logistic regression and ANNs are most widely used Python", March 2013
because they are easy to develop and provide most
accurate predictive analysis. Logistic regression can
handle non linear effect and power terms. The
8th ICCCNT 2017

Delhi, India

Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval

Uploaded by

Predictive and Probabilistic Approach Using Logistic Regression:Application To Prediction of Loan Approval

Uploaded by

IEEE - 40222

PREDICTIVE AND PROBABILISTIC

8th ICCCNT 2017

Logistic regression is now extensively used in fields

III. RELATED WORK

8th ICCCNT 2017

data set used for fitting values into logistic regression

Mean 5403.459283 0.842199

Min 150.000000 0.000000

25% 2877.500000 1.000000

50% 3812.500000 1.000000

Max 81000.000000 1.000000

Fig. 2 Distribution Analysis of Applicant income

8th ICCCNT 2017

The distributional analysis of the applicant income

Code 1 shows the python code that was employed to Code 3

8th ICCCNT 2017

E. Fitting data into model

F. Predicting and finding Probability of outcomes

Fig. 5 Managing Extreme values in Applicant income

8th ICCCNT 2017

independent variable based on which prediction takes

8th ICCCNT 2017

You might also like