India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
Mini Project
1|Page
1. Project Objective
The objective of the report is to build India Credit Risk (default) model using logistic regression
framework based on the “raw-data.xlsx” in R.
Also, reflect upon the performance of the model based on the “validation_data.xlsx”.
2|Page
2. Assumptions
3|Page
3. Exploratory Data Analysis – Step by step approach
The various steps followed to analyze the case study is mentioned and explained below.
The lists of R packages used to analyze the data are listed below:
Setting up the working directory will help to maintain all the files related to the project at one place
in the system.
The given datasets are in “.xlsx format, so to import the data in R we use the “read_excel”
command.
4|Page
Please refer Appendix A for Source Code.
DIM
rawTrainData data frame : There are 3541 rows and 52 columns
valTestData data frame: There are 715 rows and 52 columns
STR
There are 52 variables in the raw an validation dataset.
5|Page
HEAD
rawTrainData data frame: Verifying head records
SUMMARY
rawTrainData and valTestData data frame: The variables, namely, WIP Turnover, Raw
material turnover shares outstanding etc should be numbers. So we will convert them to
numeric. The continuous variables have outliers and there are missing values as well
Please refer Appendix A for Source Code.
We use ‘is na’ function to check if there are any missing values. There are missing values. Hence
we plot to get the overall status
6|Page
rawTrainData:
ValTestData:
We replaced the missing values with the mean value for both the dataset
Please refer Appendix A for Source Code.
In Summary of data we have seen that few of the variables should be numeric. So we use ‘
as.numric ‘ to convert them.
7|Page
The following variables were created on train dataset:
Variable Name Type Formula
PAT2Sales Profitability Profit after tax/sales
PAT2Totalassets Profitability Profit after tax/total asset
PAT2Equity Profitability Profit after tax/total equity
Liquidity Liquidity Net working Capital/total asset
Leverage Leverage Total liabilities/total equity
Totalincome2Totalassets size Total income/Total assets
We are analyzing the all the 50 independent variable from data set ‘rawTrainData’. The ‘Networth
Next Year’ variable is the dependent variable. We have created a variable ‘default’ from ‘Networth
Next Year’.Refer to Variable Transformation section for the same. Then we perform Univariate
and Bivariate analysis.
• All the variables, except few like PE on BSE and Cumulative retained profit, are
concentrated to a particular range of values. They are either right on left skewed. Hence
there is difference in mean and median.
8|Page
• Change in stock, Long Term Liabilities/tangible net worth, net working capital , raw
material turnover and E on BSE has negative values.
• The summary shows there is an outlier in most of the continuous variables. But the
number is very less.
• The scatter plot shows that values are not widely spread and there are outliers in most of
the variables.
We will analyze default with the other variables from data set ‘rawTrainData’.
Most of the variables do not seem to have much effect whether company will default or not
Companies who have defaulted have low PAT as % on networth.
9|Page
Please refer Appendix A for Source Code
There are outliers in very most of the variables. It is evident from the box plot and summary as
well. For example, hence we have removed anything above 99% and anything below 1% in
rawTrainData.
10 | P a g e
3.8 Correlation/Multicollinearity
Based on the above plot we can many of the variables are highly co-related.
Hence, we will check the multicollinearity during model building and drop the variable is
required.
11 | P a g e
4. Logistic regression
Logistic regression is part of the supervised learning. Logistic regression is used to describe data
and to explain the relationship between one dependent binary variable and one or more nominal,
ordinal, interval or ratio-level independent variables.
The independent variable, default, is dichotomous in nature.
We can scale the data to reduce the impact of outliers. While model building, we have checked
with scaled data as well but there was no impact on the model due to scaling. Hence, we are not
scaling the data.
4.1.1 Model-1
The initial model we will build is with all the variables and we will check the multicollinearity
12 | P a g e
Since there are 4 coefficients not defined because of singularities we cannot calculate vif. Hence
we remove those and create the model to check for multicollinearity
Multicollinearity: VIF for all the variables are less. Hence there is no multicollinearity.
4.1.2 Model-2
We build the final model by removing the variables with highest p value till most of the variables
are significant and AIC is also low.
13 | P a g e
To predict the class, we find the threshold value, which comes to around 0.1, from the plot given below:
14 | P a g e
4.2 Performance Metrics
15 | P a g e
Metrics Value for Training Dataset
Accuracy 0.89
Sensitivity 0.78
Specificity 0.90
AUC 0.92
K-S 0.70
Gini 0.74
16 | P a g e
Metrics Value for testing Dataset
Accuracy 0.83
Sensitivity 0.79
Specificity 0.84
AUC 0.88
K-S 0.68
Gini 0.76
17 | P a g e
4.3 Rank Chart
4.4 Interpretation
Based on the performance metrics of the model on raw and validation data, we can say the model
is good and stable. As the values for AUC, Gini, K-S and lift are nearly comparable for raw and
validation data.
Based on the train metrics we can interpret that:
1. The model will catch 77% of the companies who will actually ‘Default’
2. The model will catch 90% of the companies who will actually ‘Not default’
3. Overall, accuracy is 79%
4. Out of the companies that the model predicted will ‘Default’, about 30% of them will actually
default
5. Out of the companies that the model predicted will ‘Not default’, about 95% of them will
actually Not default
6. AUC is about 80%, so it is a good classifier
7. K-S is 70% which is also good, hence the model has good power to separate between default
and not default
8. Gini is also above 70% so the model has good prediction power
9. Lift is also about 7 times
18 | P a g e