Business Report M2 PDF
Business Report M2 PDF
Business Report M2 PDF
Problem Statement
Businesses or companies can fall prey to default if they are not able to keep up their debt
obligations. Defaults will lead to a lower credit rating for the company which in turn reduces its
chances of getting credit in the future and may have to pay higher interests on existing debts as
well as any new obligations. From an investor's point of view, he would want to invest in a
company if it is capable of handling its financial obligations, can grow quickly, and is able to
manage the growth scale.
Data that is available includes information from the financial statement of the companies for the
previous year (2015). Also, information about the Networth of the company in the following year
(2016) is provided which can be used to drive the labeled field.
Explanation of data fields available in Data Dictionary, 'Credit Default Data Dictionary.xlsx'
Hints :
Dependent variable - We need to create a default variable which should take the value of 1 when
net worth next year is negative & 0 when net worth next year is positive.
Test Train Split - Split the data into Train and Test dataset in a ratio of 67:33 and use
random_state =42. Model Building is to be done on Train Dataset and Model Validation is to be
done on Test Dataset.
• Please avoid sharing code in the business report. There might be a deduction if codes
are shared in the report
• Please ensure all the graphs displayed in the report are clearly visible
• The proper interpretation should be provided wherever required
1.8 Build a Random Forest Model on Train Dataset. Also showcase your
model building approach
We performed the train – test split as mentioned in the question and performed the random
forest model on the train data.
We built the random forest model with estimators or number of trees as 100 and tried to fit in
the training data. We also set our maximum features in each tree to be considered is 6 and
this will improve our performance tuning.
After fitting the model, we got a higher out-of-bag score of 0.96, which means we have 96%
of predicting the correct number of rows out of the sample considered from the training data.
1.9 Validate the Random Forest Model on test Dataset and state the
performance matrices. Also state interpretation from the model
After fitting the model on train dataset, we predict the values while fitting the model on the
test dataset. We also created the confusion and classification report for both the training and
test dataset to measure its performance metrics.
As per the results, we see that we have over-fit our model as we have accuracy value to be
1 and also as per confusion matrix, it identifies all of the values correctly without any false
value. This means we need to pass different value parameters while creating the model to
eliminate over-fitting model.
We perform the grid search for finding out the hyper parameters and build the model once
again. After finding out the hyper parameters, we calculate the performance metrics once
again.
Confusion Matrix and Classification Report on the train data (After Hyper parameters):
Confusion Matrix and Classification Report on the test data (After Hyper parameters):
We could see high accuracy and high recall value in both the training and the testing data
and also our model performs well, eliminating the over-fitting issue which occurred earlier
before tuning the hyper parameters.
1.10 Build a LDA Model on Train Dataset. Also showcase your model
building approach
We tried building the Linear Discriminant Analysis model on the same train data as above.
We build the model without any parameters passed on the model, thereby the model will
consider all of its default parameters while fitting the model.
1.11 Validate the LDA Model on test Dataset and state the performance
matrices. Also state interpretation from the model
After fitting the model, we need to fit the model on the test data and predict the values on the
test data.
We could see the confusion matrix for both the sets – for training and testing data sets. We
could see that the true positive and true negatives on the both sets to be on the higher side
rather than false negatives and false positives.
We could see that the accuracy on the both training and testing sets seems to be on the
higher side, which is 94% and 93% respectively.
Also, precision is also on the higher side, but recall value is on the lesser side in both the
training and testing sets.
In this model, recall seems to be a problem but as per the problem, we need to improve our
recall value as we need to identify potential risk customers who will probably default in the
future.
1.12 Compare the performances of Logistic Regression, Random Forest
and LDA models (include ROC Curve)
We are considering the recall value along with the precision value for our performance
metrics. Let’s consider the performance metrics value on the test set for all our models.
We got less recall value in the earlier model, but after the SMOTE technique, our
performance improved, thus improving recall value.
In our random forest model, we were able to better overall performance metrics, including
precision and recall value on both our training and testing sets.
AUC score is higher for this model – 0.99 for training set and 0.98 for testing set.
Classification Report is mentioned above – Refer fig. above
Our LDA model shows performance metrics which is significantly lesser than other models –
Random Forest, Logistic Regression model as above.
• We are choosing the Random Forest Model as our optimum model as we have
higher performance metrics on all values – accuracy, precision and recall values for
both the training and testing sets.
• We have choose this model as our best , because logistic regression model gives
better results only when we use SMOTE technique for our unbalanced dataset issue.
Logistic Regression model gives good recall value when compared to all our models,
but our precision value takes a hit for our defaulters (people who are predicted to be
defaulters).
• We set out to identify the potential customers in the bank who are predicted to default
and our model should give us high recall value, but we also need to consider a model
which gives high precision value also.
• Random model also gives high AUC value of 98% in the testing set, which will give
better results in identifying our potential defaulters and it is important for us as a bank
to identify our defaulters.
We have imported the dataset and we could see the stock values of various companies
across the years 2014 to 2021.
The number of rows is 314 and we have 11 columns in this dataset.
Descriptive Statistics:
2.1 Draw Stock Price Graph (Stock Price vs. Time) for any 2 given stocks
with inference
Below is the Stock Price Graph for 2 major companies – Infosys, Idea Vodafone
Infosys:
We could see that the stock price seems to be on up rise from the years 2014 to 2021. We
could see a slight dip in stock prices in the year 2018 but the Company got hold of its
problems and from then on, there is steep increase in the stock price.
Idea-Vodafone:
In this plot, stock price is decreasing over the years and we could see that the prices are
inversely proportional to the year. We find the company has higher risk in terms of stock
market investments, as it could improve the share value even after merger acquisition.
We have plotted the graphs for other companies, below are the results
Companies with good stock prices over the year - Infosys, Indian Hotel, Shree Cement, Axis
bank
Companies with stock prices with variability across the year – SAIL, Jindal Steel, Jet
Airways, Mahindra & Mahindra
Companies with stock prices with descent over the year - Sun Pharma, Idea Vodafone.
We calculate the returns by taking logarithmic value difference by subtracting the stock value
from any day to its previous day value.
The first row will return a null value, as we don’t have any previous day stock price value.
Below are the stock mean and standard deviation stock market values across the rows.
We are calculating the mean value and standard deviation using the python functionality.
Lesser the standard deviation, lesser is the risk in investing in that particular company and
higher the mean value means that the stock price is on the higher side.
We created a data frame to show its mean value as average and the standard deviation
value as its volatility across all companies.
We also created the plot for these columns – Average stock price Vs Standard Deviation
(Volatility).
For X-axis , we need values with mean values higher than 0 , so we set the line at mean
value = 0 and we need to check values higher than that , so we know which companies have
higher stock mean value.
For Y-axis, we have set values with lower risk of 0.02 as the y-axis limit for finding out which
companies has risk value near to the limit value.
We could see that the companies – Infosys and Shree Cement has higher stock price and
less volatility and companies with stock prices value and lesser stock price than top 2
companies are Indian Hotel and Axis Bank.
Below are the conclusions and recommendations on the market risk analysis