FRA Report
FRA Report
Sravanthi.M
1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Cleaning up data................................................................................................................3
3.1.3.Reading the Data and visualization....................................................................................3
3.2. Variable Identification................................................................................................................3
4. Conclusion.........................................................................................................................................3
5. Detailed Explanation of Findings…………………………………………………………………………………………………….4
1. EDA
Outlier Treatment
New Variables Creation (One ration for profitability, leverage, liquidity and company's size
each)
2. Modeling
Build Logistic Regression Model on most important variables
Sort the data in descending order based on probability of default and then divide into 10
dociles based on probability & check how well the model has performed
6. Source Code
1 Project Objective
The objective of the project is to create India credit risk(default) model using the given training
dataset
and validate it. Logistic regression framework is to be used to develop the credit default model.
2 Assumptions
The data provided in raw data comprises of financial data.
Major data points are variables.
4 Conclusion
Major data points or variables are Net worth next year, Total assets, Net worth, Total income, Total
expenses, Profit after tax, PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income,
PBT as % of total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total
capital, Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net fixed
assets, Investments, Net working capital, Debt to equity ratio (times), Cash to current liabilities
(times), Total liabilities.
In addition to the above variables there are other financial parameters which define the financial
strength of the organization taking the total tally of variables to 51.
5 Detailed Explanation of Findings
5.1 EDA
Ans: There are two datasets: Training and Testing dataset with similar variables. The dataset consists of
Organisation details such as Net worth next year, Total assets, Net worth, Total income, Total expenses,
Profit after tax, PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as
% of total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total capital,
Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net fixed assets,
Investments, Net working capital, Debt to equity ratio (times), Cash to current liabilities (times), Total
liabilities
Output:
Output:
From the above plot we can notice that the training dataset has 9.5% missing observations
The training dataset has missing observations. The missing observations are replaced with median of that
column and the missing columns are removed from the dataset. Onrunning the plot again, it shows that
the training dataset does not have any missing observations or columns.
Output:
Similarly, the testing dataset also has variables of the type character.
In the following code, the variables of type character are changed into the type numeric and then the
missing observations in each column is replaced with the median of that column.
Output:
From the above plot we can notice that the testing dataset has 9.4% missing observations and 1.9%
missing columns.
From the above code, the variables of type character are changed into the type numeric and then the
missing observations in each column is replaced with the median of that column. The missing columns
are removed from the dataset.
Output:
From the above plot shows that the dataset does not contain any missing observations or missing
columns.
Outlier Treatment:
The outliers in the dataset are treated by replacing the observations lesser than the 1st percentile with
value of the 1st percentile and the observations more than the 99th percentile with the value of the 99th
percentile. This outlier treatment is done for every column in the dataset.
The quantile function identifies the observations less than 1st percentile and more than the 99th
percentile. The squish function replaces the values of these identified outliers with the value of the 1st
percentile and the 99th percentile and Redundant variables are removed from the Training and Testing
dataset.
Output:
New variables are created as per the requirement. One ratio for Profitability, Liquidity and Leverage is
required as per the problem statement.
Other ratios are also created by dividing multiple variables by Total assets and the contribution of these
ratios towards the model can be found later.
2 Modelling: Logistic Regression
Ans: The Logistic regression model is used for this dataset. Initially, all the variables are used as the
predictors with the Default variable as the response variable.
g1m(formuJa = default- Profitability, family=binomial,data = train)
Among the most important variables, the variables with positive estimates are Total
Assets, Current ratio, Sales/Total Assets and the variables with negative estimates are
Cash Profit, PAT as % of net worth,Current Liabilities and Provisions, Capital
employed
Analysis:
The model has an AIC value of 875.4 and predicts the training and testing dataset with
almost 95% accuracy (seen later in the document).
Few of the most important variables are Total Assets, Cash Profit, PAT as % of net
worth, Reserves and Funds, Current Liabilities and Provisions, Capital employed, Net
Working Capital/Total Assets and Networth/Total Assets. These variables have very
less Pr(>|z|).
Among the most important variables, the variables with positive estimates are Total
Assets, Current ratio, Sales/Total Assets and the variables with negative estimates are
Cash Profit, PAT as % of net worth,Current Liabilities and Provisions, Capital
employed
21 | P a g
e
Model Performance and Measure:
The Logistic regression model that was created is used to predict the Training dataset.
obs
pred 0 1
0 3227 107
1 35 109
attr(,"class")
[1] "confusion.matrix"
The confusion matrix shows that there are 35 Type 1 error and 107 type 2 error.
[1] 0.9591719
The same logistic regression model is used to predict the testing dataset.
obs
pred 0 1
0 639 20
1 22 34
22 | P a g
e
attr(,"class
")
[1] "confusion.matrix"
The confusion matrix shows that there are 22 Type 1 error and 20 Type 2 error.
[1] 0.9412587
The accuracy of the model is 94.125%
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = test$`Default - 1`, predictor = PredLOGIT)
Data: PredLOGIT in 661 controls (test$`Default - 1` 0) < 54 cases (test$`Default - 1` 1).
Area under the curve: 0.941
Deciling:
The training dataset is then divided into 10 deciles based on the probability of default.
23 | P a g
e
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.
The testing dataset is then divided into 10 deciles based on the probability of default.
The ranks of the deciles are seen above. The deciles are sorted in the descending
order. The 10th decile has the maximum number of defaults in the form of
cnt_resp.
The mean is taken for both the Training and Testing dataset to differentiate the
predicted and observed values.
The plot shows that the model almost accurately predicted both the Training and
Testing dataset with an accuracy of almost 95%
24 | P a g
e
6.Source code
#Loading relevant libraries for current session
library(caTools)
install.packages("car")
library(car)
install.packages("lattice")
library(caret)
library(ROCR)
library(corrplot)
install.packages("ipred")
library(ipred)
library(ggplot2)
install.packages("dplyr")
library(dplyr)
library(StatMeasures)
install.packages("scales")
library(scales)
install.packages("DataExplorer")
library(DataExplorer)
##Companies with Total Assets less than 3 is removed from further analysis.
train <- train[!train$`Total assets` <= 3, ]
##Missing Values
sum(is.na(train))
for(i in 1:length(train)){
25 | P a g
e
print(paste(colnames(train[i]),class(train[,i])))}
plot_intro(train)
for(i in 1:ncol(train)){
train[,i] <- as.numeric(train[,i])
train[is.na(train[,i]), i] <- median(train[,i], na.rm = TRUE)
}
train <- train[,-22]
sum(is.na(train))
plot_intro(train)
##Missing values#
sum(is.na(test))
test<-as.data.frame(test)
for(i in 1:length(test)){
print(paste(colnames(test[i]),class(test[,i])))
}
plot_intro(test)
for(i in 1:ncol(test)){
test[,i] <- as.numeric(test[,i])
test[is.na(test[,i]), i] <- median(test[,i], na.rm = TRUE)
}
##Outlier Treatment
boxplot(rawdata)
for(i in 2:ncol(train)){
q <- quantile(train[,i], c(0.1, 0.99))
train[,i] <- squish(train[,i], q)
}
##Redundant variables are removed from the Training and Testing dataset
plot_str(train)
plot_intro(train)
plot_missing(train)
plot_histogram(train)
plot_qq(train)
plot_bar(train)
plot_correlation(train)
26 | P a g
e
##Variable Creation
##Logistic Regression
27 | P a g
e
trainLOGIT<- glm(Default~.,data = train, family=binomial)
summary(trainLOGIT)
+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+
Capitalemployed2Totalassets
+ Investments2Totalassets , data= train, family =
binomial)
summary(trainLOGIT)
tmp_DT = data.table(train)
tmp_DT = data.table(test)
View(rank)
30 | P a g
e