Article Review 11 Eng
Article Review 11 Eng
Article Review 11 Eng
Scripting
End-to-End Solution
Daftar Isi
Learn to build an end to end data science project 3
Data Science Workflow 3
Business Understanding 4
Analytic Approach 4
Data Requirements 4
Data Collection 5
Data Understanding 5
Data Preparation 7
Exploratory Data Analysis 8
Model Building 9
Model Evaluation 14
Model Deployment 14
Use Case 15
References 18
Other Reading Sources: 18
2
Learn to build an end to end data science project
Appreciating the process you must work through for any Data Science project is
valuable before you land your first job in this field. With a well-honed strategy, such
as the one outlined in this example project, you will remain productive and
consistently deliver valuable machine learning models.
A Data Scientist is one who is the best programmer among all the statisticians and
the best statistician among all the programmers. Every Data Scientist needs an
efficient strategy to solve data science problems. Data Science positions are unique
across the country so we can try and predict the salary of data science positions
based on Job Title, Company, Geography, etc. Here I have built a project where any
user can plug in the information, and it splits up into a range of salaries, so if anyone
is trying to negotiate, then this is a pretty cool tool for them to use.
3
Business Understanding
This stage is significant because it helps clarify the customer’s target. The
success of any project depends on the quality of the questions asked. If you
understand the business requirement correctly, then it helps you collect the
right data. Asking the right questions will help you narrow down the data
acquisition part.
Analytic Approach
This is the stage where, once the business problem has been clearly stated,
the data scientist can define the analytic approach to solve the problem. This
step includes explaining the problem in the sense of statistical and
machine-learning techniques, and it is important as it helps to determine
what kind of trends are required to solve the issue in the most efficient way
possible. If the issue is to determine the probabilities of something, then a
predictive model might be used; if the question is to show relationships, a
descriptive approach may be required, and if our problem requires counts,
then statistical analysis is the best way to solve it. For each type of approach,
we can use different algorithms.
Data Requirements
We find out the necessary data content, formats, and sources for initial data
collection, and we use this data inside the algorithm of the approach we
chose. Data reveals impact, and with data, you can bring more science to your
decisions.
4
Data Collection
We identify the available data resources relevant to the problem domain. To
retrieve the data, we can apply web scraping on a related website, or we can
use a repository with premade datasets that are ready to use. If you want to
collect data from any website or repository, use the Pandas library, which is a
very useful tool to download, convert, and modify datasets.
So for this purpose, I have tweaked the web scraper to scrape 1000 job
postings from glassdoor.com. With each job, we get the following: Job title,
Salary Estimate, Job Description, Rating, Company, Location, Company
Headquarters, Company Size, Company Founded Date, Type of Ownership,
Industry, Sector, Revenue, Competitors. So these are the various attributes
for determining the salary of a person working in the Data Science field.
To check the Web Scraper Article, click here.
To check the Web Scraper Github code, click here.
You can have data without information, but you cannot have information
without data.
Data Understanding
Data scientists try to understand more about the data collected before. We
have to check the type of each data and have to learn more about the
attributes and their names.
5
Figure 2: Example of Dataframe
Few salaries contain -1, so those values are not of much importance to us so
let’s remove them. As the salary estimate column is in a string right now, we
need to give -1 in the string format.
6
So now we can see that the number of rows has come down to 742. We
observe that most of our variables are categorical and not numerical. This
dataset comprises 2 numerical and 12 categorical variables. But in reality, our
dependent variable, Salary Estimate, has to be numerical. So we need to
convert that into a numerical variable.
Data Preparation
Data can be in any format. To analyze it, you need to have data in a certain
format. Data scientists have to prepare data for modeling, which is one of the
most crucial steps because the model has to be clean and should not contain
any errors or null values.
In real-world scenarios, data scientists spend 80% of their time cleaning the
data and only spend 20% of their time giving insights and conclusions.
This is a pretty messy process, so that’s something you should be prepared
for.
After scraping the data, We needed to clean it up so that it was usable for our
model. Let’s make a few changes and create new variables.
7
When we split on the left parenthesis, what happens is, the left and right
sides of ‘(‘ of all the rows go into 2 different lists. That’s why we need to
include [0] to get the salaries. After obtaining the salaries, replace ‘K’,’$’ with
an empty string. In a few entries, the salary is given as ‘employer provided’
and ‘per hour’, so these are inconsistent and should be looked after.
8
Model Building
The data scientist has the chance to understand if his work is ready to go or if
it needs review. Modeling focuses on developing models that are either
descriptive or predictive. So here, we perform Predictive modeling, which is a
process that uses data mining and probability to forecast outcomes. For
predictive modeling, data scientists use a training set that is a set of historical
data in which the outcomes are already known. This step can be repeated
more times until the model understands the question and answer to it.
If we have categorical data, then we need to create dummy variables, so
that's why I transformed the categorical variables into dummy variables. I
also split the data into train and test sets with a test size of 20%. I tried three
different models and evaluated them using Mean Absolute Error. I chose MAE
because it is relatively easy to interpret, and outliers aren’t particularly bad
for this type of model.
After this conversion, the number of columns in our dataset has increased
from 14 to 178!!
9
I have implemented three different models:
● Multiple Linear Regression — Baseline for the model
● Lasso Regression — Because of the sparse data from the many
categorical variables, I thought a normalized regression like lasso
would be effective.
● Random Forest — Again, with the sparsity associated with the data, I
thought this would be a good fit.
10
Figure 8: Model Evaluation
Here I have chosen i/10 as well, but the error was still high, so that’s why I
have reduced the values of alpha.
After plotting the graph and checking the value of alpha, we see that an alpha
value of 0.13 gives the best error term. Now our error has reduced from 21.09
to 19.25 (which means 19.25K dollars). We can also improve the model tuning
the GridSearch.
11
GridSearch is the process of performing hyperparameter tuning in order to
determine the optimal values for a given model. GridSearchCV is basically like
you put in all the parameters which you want, and then it runs all the models
and splits the ones with the best results.
We can even use Support Vector Regression, XGBoost, or any other models.
Random Forest Regression is a tree-based decision process, and also, there
are many 0s, 1s in our dataset, so we expect it to be a better model. So that’s
why I have preferred Random Forest Regression here.
12
Figure 11: Test Best Model
So here we are getting a smaller value of error than the previous ones, so the
Random Forest model is better than the previous models. I have combined
the Random Forest model with the Linear Regression model to make a
prediction. So I have taken the average of both, which means that I have
given 50% weightage to each of the models.
Most of the time, it’s better to combine different models and then make
predictions because there are very good chances of increasing our accuracy.
These types of models are called ensemble models, and they are widely used.
The error may or may not increase because one model might be overtraining.
The tuned Random Forest model is the best here because it has the least
error when compared to Lasso and Linear regression. So instead of taking the
13
average of both, we can even merge 90% of the random forest model with
10% of any other models and test the accuracy/performance. Generally,
these types of ensemble models are better for classification problems.
The project should not be about trying all the models, but it should be to
choose the most effective models and should be able to tell a story as to why
we have chosen those specific ones. Usually, Lasso regression should have
more effect than linear regression as it has the normalization effect, and we
have a sparse matrix, but here the Lasso performed worse than the linear
regression. Hence it depends model to model, and we cannot generalize
anything.
Model Evaluation
Data scientists can evaluate the model in two ways: Hold-Out and
Cross-Validation. In the Hold-Out method, the dataset is divided into three
subsets: a training set, a validation set that is a subset that is used to assess
the performance of the model built in the training phase, and a test set is a
subset to test the likely future performance of a model. In most of the cases,
the training:validation:test set ratios will be 3:1:1, which means 60% of the
data to the training set, 20% of the data to the validation set, and 20% of the
data to the test set.
Model Deployment
So I have created a basic webpage so that it’s simple to understand. Given the
details of the employee and company, this model predicts the expected
salary for the employee.
I have deployed my Machine Learning model in Heroku using flask. I have
trained the model using Linear regression (because it’s easy to understand),
14
but you can always train your model using any other Machine Learning
model, or you can even use ensemble models as they provide good accuracy.
Use Case
Background and Problem Statement:
You are a data scientist at IDX Partners and currently helping the sales team
to increase sales using machine learning. The aim of this project is to create a
machine learning model that can help predict whether a customer has the
potential or not to be given a sales offer.
After studying the material, create simple code to build a machine learning
mode with this criteria!
- You can use dummy data
- You can use any machine learning model
- Metrics used is accuracy
Solution:
Step 1: Install scikit-learn
If you haven't installed scikit-learn yet, you can do so using:
15
pip install scikit-learn
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from sklearn.ensemble import
RandomForestClassifier
from sklearn.metrics import accuracy_score,
classification_report
from sklearn.datasets import load_iris
iris = load_iris()
X, y = dummy.data, dummy.target
Step 6: Build and Train a Model and evaluate using test dataset
16
Let's use a Random Forest Classifier as a simple model:
17
References
https://www.kdnuggets.com/2020/11/build-data-science-project.html
https://blog.devgenius.io/learn-to-build-an-end-to-end-data-science-proje
ct-c9f79692191
https://www.kaggle.com/kabure/predicting-credit-risk-model-pipeline/not
ebook
https://www.kaggle.com/raenish/home-credit-default-risk-r
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification
https://peps.python.org/pep-0008/
https://realpython.com/python-pep8/
http://adv-r.had.co.nz/Style.html
https://www.r-bloggers.com/2014/07/consistent-naming-conventions-in-r/
18