Data+Science+in+Python+-+Regression
Data+Science+in+Python+-+Regression
Regression
With Expert Python Instructor Chris Bruehl
This is Part 2 of a 5-Part series designed to take you through several applications of data science
using Python, including data prep & EDA, regression, classification, unsupervised learning & NLP
This is a project-based course for students looking for a practical, hands-on approach to
learning data science and applying regression models with Python
Quizzes & Assignments to test and reinforce key concepts, with step-by-step solutions
Interactive demos to keep you engaged and apply your skills throughout the course
Review the basics of regression, including key terms, the types and goals
2 Regression 101 of regression analysis, and the regression modeling workflow
Recap the data prep & EDA steps required to perform modeling, including
3 Pre-modeling Data Prep & EDA key techniques to explore the target, features, and their relationships
Build simple linear regression models in Python and learn about the
4 Simple Linear Regression metrics and statistical tests that help evaluate their quality and output
Build multiple linear regression models in Python and evaluate the model
5 Multiple Linear Regression fit, perform variable selection, and compare models using error metrics
Test model performance by splitting data, tuning the model with the train
7 Model Testing & Validation & validation data, selecting the best model, and scoring it on the test data
Learn methods for exploring time series data and how to perform time
10 Time Series Analysis series forecasting using linear regression and Prophet
THE You’ve just been hired as an Associate Data Scientist for Maven Consulting
SITUATION Group to work on a team that specializes in price research for various industries.
You’ll have access to data on the price data for several different industries,
THE including diamonds, computer prices, and apartment rent.
ASSIGNMENT Your task is to build regression models that can accurately predict the price of
goods, while giving their clients insights into the factors that impact pricing.
This course covers both the theory & application of linear regression models
• We’ll start with Ordinary Least Squares (OLS), including evaluation metrics, assumptions, and validation & testing
• We’ll then pivot to regularized regression models, which are extensions of linear regression with special properties
In this section we’ll install Anaconda and introduce Jupyter Notebook, a user-friendly
coding environment where we’ll be coding in Python
Launching
Jupyter
Google Colab
Launching
Jupyter
Google Colab
Installing
Anaconda
Launching
Jupyter
Google Colab
1) Once inside the Jupyter interface, create a folder to store your notebooks for the course
Installing
Anaconda
Launching
Jupyter
NOTE: You can rename your folder by clicking “Rename” in the top left corner
Google Colab
2) Open your new coursework folder and launch your first Jupyter notebook!
NOTE: You can rename your notebook by clicking on the title at the top of the screen
NOTE: When you launch a Jupyter notebook, a terminal window may pop up as
well; this is called a notebook server, and it powers the notebook interface
Installing
Anaconda
Launching
Jupyter
Google Colab
Google Colab
In this section we’ll introduce the field of data science, discuss how it compares to
other data fields, and walk through each phase of the data science workflow
Yes! The differences lie in the types of problems you solve, and tools and
Machine Learning
techniques you use to solve them:
Data Science
Workflow
What happened? What’s going to happen?
• Descriptive Analytics • Predictive Analytics
• Data Analysis • Data Mining
• Business Intelligence • Data Science
What is Data
Science?
Machine Learning
Data Science
Workflow What will house prices look like How can I segment my
for the next 12 months? customers?
These are some of the most common machine learning algorithms that data
scientists use in practice
What is Data
Science?
MACHINE LEARNING
Essential Skills
The data science workflow consists of scoping the project, gathering, cleaning
What is Data
and exploring the data, applying models, and sharing insights with end users
Science?
1 2 3 4 5 6
Essential Skills
Machine Learning
Data Science
Workflow
Scoping a Gathering Cleaning Exploring Modeling Sharing
Project Data Data Data Data Insights
This is not a linear process! You’ll likely go back to further gather, clean and explore your data
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
Projects don’t start with data, they start with a clearly defined scope:
Data Science
Workflow • Who are your end users or stakeholders?
• What business problems are you trying to help them solve?
• Is this a supervised or unsupervised learning problem? (do you even need data science?)
• What data do you need for your analysis?
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
A project is only as strong as the underlying data, so gathering the right data is
Data Science essential to set a proper foundation for your analysis
Workflow
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
A popular saying within data science is “garbage in, garbage out”, which means that
Data Science cleaning data properly is key to producing accurate and reliable results
Workflow
Data cleaning tasks may include: Building models
The flashy part of data science
• Resolving formatting issues
• Correcting data types Cleaning data
• Imputing missing data Less fun, but very important
(Data scientists estimate that around
• Restructuring the data 50-80% of their time is spent here!)
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
Exploratory data analysis (EDA) is all about exploring and understanding the
Data Science data you’re working with before applying models or algorithms
Workflow
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
Modeling data involves structuring and preparing data for specific modeling
Data Science techniques, and applying those models to make predictions or discover patterns
Workflow
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
The final step of the workflow involves summarizing your key findings and sharing
Data Science insights with end users or stakeholders:
Workflow
• Reiterate the problem Even with all the technical work
• Interpret the results of your analysis that’s been done, it’s important to
remember that the focus here is
• Share recommendations and next steps on non-technical solutions
NOTE: Another way to share results is to deploy your model, or put it into production
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
DATA PREP & EDA REGRESSION
Data Science
Workflow
Data scientists have both coding and math skills along with domain expertise
• In addition to technical expertise, soft skills like communication, problem-solving, curiosity, creativity, grit,
and Googling prowess round out a data scientist’s skillset
In this section we’ll cover the basics of regression, including key modeling terminology,
the types & goals of regression analysis, and the regression modeling workflow
Goals of
Regression 𝒚 Target
• This is the variable you’re trying to predict
Types of • The target is also known as “Y”, “model output”, “response”, or “dependent” variable
Regression
• Regression helps understand how the target variable is impacted by the features
The Modeling
Workflow
𝒙 Features
• These are the variables that help you predict the target variable
• Features are also known as “X”, “model inputs”, “predictors”, or “independent” variables
• Regression helps understand how the features impact, or predict, the target
Regression 101
The Modeling
Workflow
Carat, cut, color, clarity, and the rest of the columns are all features,
since they can help us explain, or predict, the price of diamonds
Regression 101
Goals of
Regression
The Modeling
Workflow
??? ...then apply that model to new, unobserved
values containing features but no target
This is what our model will predict!
Regression models are used for two primary goals: prediction and inference
• The goal shapes the modeling approach, including the regression algorithm used, the
Regression 101 complexity of the model, and more
Goals of
Regression
PREDICTION INFERENCE
Types of
Regression
• Used to predict the target as • Used to understand the relationships
The Modeling
accurately as possible between the features and target
Workflow
• “What is the predicted price of a • “How much do a diamond’s size and
diamond given its characteristics?” weight impact its price?”
You often need to strike a balance between these goals – a model that is very inaccurate won’t be too trustworthy
for inference, and understanding the impact that variables have on predictions can help make them more accurate
Regression 101
Linear Regularized Time-Series Tree-Based
Regression Regression Forecasting Regression
Goals of
Regression
Types of
Regression
The Modeling
Workflow
Models the relationship An extension of linear regression Predicts future data using Splits data by maximizing the
between the features & target that penalizes model complexity historical trends & seasonality difference between groups
using a linear equation
Even though logistic regression (which you may have heard of) has
“regression” in its name, it’s actually a classification modeling technique!
1 2 3 4 5 6
Regression 101 Scoping a Gathering Cleaning Exploring Modeling Sharing
project data data data data insights
Goals of
Regression
The Modeling Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Workflow input into an ML algorithm from training data training & validation data deploy and identify insights
• Single table, non-null • Linear regression • R-squared & MAE • Test performance
• Feature engineering • Regularized regression • Checking Assumptions • Interpretability
• Data splitting • Time series • Validation Performance
The target is the value we want to predict, and the features help us predict it
• The target is also known as “Y”, “model output”, “response”, or “dependent” variable
• Features are also known as “X”, “model inputs”, “predictors”, or “independent” variables
In this section we’ll review the data prep & EDA steps required before applying regression
algorithms, including key techniques to explore the target, features, and their relationships
Exploratory data analysis (EDA) is the process of exploring and visualizing data to
find useful patterns & insights that help inform the modeling process
EDA for
Regression • In regression, EDA lets you identify & understand the most promising features for your model
• It also helps uncover potential issues with the features or target that need to be addressed
Exploring the
Target
Preparing for Even though data cleaning comes before exploratory data analysis, it’s common
Modeling for EDA to reveal additional data cleaning steps needed before modeling
Exploring the target variable lets you understand where most values lie, how
spread out they are, and what shape, or distribution, they have
EDA for
Regression
• Charts like histograms and boxplots are great tools to explore regression targets
sns.despine() removes
Exploring the the top & right borders
Features
Data is right-skewed
Linear (not uncommon for prices but Lots of outliers
Relationships may want to transform later) past $12,000
Exploring the features helps you understand them and start to get a sense of the
transformations you may need to apply to each one
EDA for
Regression • Histograms and boxplots let you explore numeric features
• The .value_counts() method and bar charts are best for categorical features
Exploring the
Target
Exploring the
Features
Data is right-skewed Not many “Fair” diamonds,
which are the lowest quality
Linear (enough for modeling though)
Relationships Spikes at quarter and
half carat increments
Exploring
Few diamonds bigger
Relationships
than 2.5 carats
Preparing for
Modeling
Key Objectives
NEW MESSAGE
June 28, 2023
1. Read in the “Computers.csv” file
From: Cam Correlation (Sr. Data Scientist)
2. Explore the target variable: “price”
Subject: EDA
3. Explore the features: “speed” and “RAM”
Hi there, glad to have you on the team!
Can you explore the computer prices data set at a high level?
Thanks!
01_EDA_assignments.ipynb
It’s common for numeric variables to have linear relationships between them
• When one variable changes, so does the other
EDA for
Regression • This relationship is commonly visualized with a scatterplot
Exploring the
Target
EXAMPLE Relationship between digital advertising spend and site traffic
Exploring the
Features
($72,954; 7,592)
Linear
Relationships
Exploring
Relationships
As “digital_spend” moves up
or down, so do “site_traffic”
Preparing for (this is a positive relationship)
Modeling
Exploring the
Features
Linear
Relationships
Exploring
Relationships
As one changes, the other As one changes, the other changes No association can be found between the
changes in the same direction in the opposite direction changes in one variable and the other
Preparing for
Modeling
The correlation (r) measures the strength & direction of a linear relationship (-1 to 1)
• You can use the .corr() method to calculate correlations in Pandas – df[“col1”].corr(df[“col2”])
EDA for
Regression
Exploring the
Features
Linear
Relationships
Exploring
Relationships
The correlation (r) measures the strength & direction of a linear relationship (-1 to 1)
• You can use the .corr() method to calculate correlations in Pandas – df[“col1”].corr(df[“col2”])
EDA for
Regression
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Strong positive correlation Moderate negative correlation No correlation
Preparing for
Modeling PRO TIP: Highly correlated variables tend to be the best candidates for your “baseline” regression model,
but other variables can still be useful features after exploring non-linear relationships and fixing data issues
Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Preparing for
Modeling
Exploring the
Target
Linear
Relationships
Exploring
Relationships
Preparing for
Modeling
Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Average prices differ between cuts:
Preparing for • “Fair” cut diamonds (worst) have the 2nd highest average price
Modeling • “Ideal” cut diamonds (best) have the lowest average price
Strong positive relationship between carat & price
(potentially exponential and not linear)
Exploring the
Target
Linear
Relationships
Exploring
Relationships
Strong positive relationship between “Fair” cut diamonds have the largest average carat size, which
Preparing for diamond length (x) and carat explains why their average price is higher than “Ideal” cut
Modeling (we might not be able to include both diamonds, which have the smallest average carat size
features in our model – more later!)
Use sns.pairplot() to create a pairplot that shows all the scatterplots and
histograms that can be made using the numeric variables in a DataFrame
EDA for
Regression
Exploring the The row with your target can help explore
Target numeric feature-target relationships quickly!
Exploring the
Features
Preparing for
Modeling
Use sns.lmplot() to create a scatterplot with a fitted regression line (more soon!)
• This is commonly used to explore the impact of other variables on a linear relationship
EDA for
Regression • sns.lmplot(df, x=“feature”, y=“target”, hue=“categorical feature”)
Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Larger carat diamonds don’t
follow the relationship well Diamonds with “I1” clarity increase
Preparing for in price at a much slower rate
Modeling
Key Objectives
NEW MESSAGE
June 28, 2023
1. Build a correlation matrix heatmap
From: Cam Correlation (Sr. Data Scientist)
2. Build a pair plot of numeric features
Subject: Exploring Variable Relationships
3. Build an lmplot of “RAM” vs. “price”
Hi there!
Thanks!
01_EDA_assignments.ipynb
Preparing for modeling involves structuring the data as a valid input for a model:
• Stored in a single table (DataFrame)
EDA for
Regression • Aggregated to the right grain (1 row per target)
• Non-null (no missing values) and numeric
Exploring the
Target
Linear
0.90 Good D $3,423 0 0.90 1 1 0 3423
Relationships 0.31 Fair E $527 1 0.31 0 0 0 527
0.52 Very Good $1,250 2 0.52 2 0 1 1250
Exploring
Relationships 1.55 Ideal D $12,500 3 1.55 3 1 0 12500
Features Target
Preparing for
Modeling We’ll revisit this during the feature
engineering section of the course
In this section we’ll build our first simple linear regression models in Python and learn
about the metrics and statistical tests that help evaluate their quality and output
Simple linear regression models use a single feature to predict the target
• This is achieved by fitting a line through the data points in a scatterplot (like an lmplot!)
Linear Regression
Model
Least Squared EXAMPLE Simple linear regression model using carat and price
Error
Making
Predictions Target (y)
Evaluation
Metrics
Feature (x)
The linear regression model is an equation that best describes a linear relationship
Linear Regression
Model The value for
The predicted value (x, ŷ)
the feature ŷ
for the target
Least Squared
Error 𝑦ො = 𝛽0 + 𝛽1 𝑥 β1
β0
Regression in
The y-intercept The slope of the
Python x
relationship
Making
Predictions
The actual value
for the target ε
Evaluation
Metrics 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 y
The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra
Least Squared
Error
Why “squared” error?
Regression in
Python
• Squaring the residuals converts them into positive values, and prevents positive and negative
distances from cancelling each other out (this makes the algebra to solve the line much easier, too)
Making • One drawback of squared errors is that outliers can significantly impact the line (more later!)
Predictions
Evaluation
Metrics Ordinary Least Squares (OLS) is another term for traditional linear regression
There are other frameworks for linear regression that don’t use least squared
error, but they are rarely used outside of specialized domains
The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra
35 30 30 0 0
Making
Predictions y 30
40 40 30 -10 100
50 15 30 15 225
20
Evaluation 60 40 30 -10 100
Metrics 65 30 30 0 0
10
70 50 30 -20 400
80 40 30 -10 100
10 20 30 40 50 60 70 80
=
x SUM OF SQUARED ERROR: 1,450
*Copyright Maven Analytics, LLC
LEAST SQUARED ERROR
The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra
50 15 35 20 400
20
Evaluation 60 40 40 0 0
80 40 50 10 100
10 20 30 40 50 60 70 80
=
x SUM OF SQUARED ERROR: 1,450
862.5
*Copyright Maven Analytics, LLC
LEAST SQUARED ERROR
The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra
35 30 26 -4 16
Making
Predictions y 30
40 40 28 -12 144
50 15 32 17 289
20
Evaluation 60 40 36 -4 16
Metrics 65 30 38 8 64
10
70 50 40 -10 100
80 40 44 4 16
10 20 30 40 50 60 70 80
=
x SUM OF SQUARED ERROR: 862.5
722
*Copyright Maven Analytics, LLC
REGRESSION IN PYTHON
These Python libraries are used fit regression models: statsmodels & scikit-learn
Linear Regression
Model
Least Squared
Error
Evaluation
Metrics
Both libraries use the same math and return the same regression equation!
We will begin by focusing on statsmodels, but once we have the fundamentals
of regression down, we’ll introduce scikit-learn
You can fit a regression in statsmodels with just a few lines of code:
Linear Regression
Model
1) Import statsmodels.api (standard alias is sm)
4) Call sm.OLS(y, X) to set up the model, then use .fit() to build the model
Regression in
Python 5) Call .summary() on the model to review the model output
Making
Predictions
Why do we need to add a constant?
Evaluation • Statsmodels assumes you want to fit a model with a line that runs through the origin (0, 0)
Metrics
• sm.add_constant() lets statsmodels calculate a y-intercept other than 0 for the model
• Most regression software (like sklearn) takes care of this step behind the scenes
You can fit a regression in statsmodels with just a few lines of code:
Linear Regression
Model
Regression in
Python
Making
Predictions Variable summary
statistics
The model output can be intimidating the first
Evaluation time you see it, but we’ll cover the important
Metrics Residual (error)
pieces in the next few lessons and later sections!
statistics
To interpret the model, use the “coef” column in the variable summary statistics
Linear Regression
Model
Least Squared
Error
Making
Predictions
How do we interpret this?
Evaluation • An increase of 1 carat in a diamond is associated with a $7,756 dollar increase in its price
Metrics
• We cannot say 1 carat causes a $7,756 increase in price without a more rigorous experiment
• Technically, a 0-carat diamond is predicted to cost -$2,256 dollars
The .predict() method returns model predictions for single points or DataFrames
Linear Regression
Model
Least Squared
Error
Regression in
Python
The .predict() method returns model predictions for single points or DataFrames
Linear Regression
Model
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
Squaring the correlation between the feature The model explains 84.9% of the variation in
and target yields R2 in simple linear regression price not explained by the mean of price
(this doesn’t hold in multiple linear regression)
A “good” value for R2 is relative to your data – an R2 of .05 might be industry leading in sports analytics, but if
you were trying to prove a physics theory with an experiment, an R2 of .95 might be a disappointment!
Regression models include several hypothesis tests, including the F-test, that
indicates whether our model is significantly better at predicting our target than
Linear Regression using the mean of the target as the model
Model
• In other words, you’re trying to find significant evidence that your model isn’t useless
Least Squared
Error
Steps for the hypothesis test:
Regression in
Python
1) State the null and alternative hypotheses
2) Set a significance level (α)
Making
Predictions
3) Calculate the test statistic and p-value
Evaluation 4) Draw a conclusion from the test
Metrics
a) If p≤α, reject the null hypothesis (you’re confident the model isn’t useless)
b) If p>α, don’t reject it (the model is probably useless, and needs more training)
1) For F-Tests, the null & alternative hypotheses are always the same:
• Ho: F=0 – the model is NOT significantly better than the mean (naïve guess)
Linear Regression
Model • Ha: F≠0 – the model is significantly better than the mean (naïve guess)
Regression in
Python
2) The significance level is the threshold you set to determine when the evidence
Making
against your null hypothesis is considered “strong enough” to prove it wrong
Predictions
• This is set by alpha (α), which is the accepted probability of error
Evaluation
• The industry standard is α = .05 (this is what we’ll use in the course)
Metrics
Some teams and industries set a much higher bar, such as .01 or
even .001, making the null hypothesis less likely to be rejected
3) The F-statistic and associated p-value are part of the model summary and help
understand the predictive power of the regression model as a whole
Linear Regression
Model • The F-statistic is the ratio of variability the model explains vs the variability it doesn’t
• The p-value, or F-significance, is the probability that your model predicts poorly
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
4) Comparing the p-value and alpha lets us draw a conclusion from the test
• p≤α – reject the null hypothesis (you’re confident the model isn’t useless)
Linear Regression
Model • p>α – don’t reject it (the model is probably useless, and needs more training)
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
The T-statistics and associated p-values are part of the model summary and help
understand the predictive power of individual model coefficients
Linear Regression
Model • It’s essentially another hypothesis test designed to find which coefficients are useful
Least Squared
Error
Regression in
Python
Evaluation
Metrics
Residual plots show how well a model performs across the range of predictions
• Ideally, residual plots should be normally distributed around 0
Linear Regression
Model
80 40 50 -10 100
15 20 25 30 35 40 45 50
Prediction (ŷ)
Residual plots show how well a model performs across the range of predictions
• Ideally, residual plots should show errors to be normally distributed around 0
Linear Regression
Model • model.resid returns a series with the residuals (actual value - predicted value)
Least Squared
Error
Making
Predictions
THE You’ve just been hired as a Data Science Intern for Maven National Insurance, a
SITUATION large private health insurance provider in the United States
The company is looking at updating their insurance pricing model and want you to
THE start a new one from scratch using only a handful of variables
ASSIGNMENT If you’re successful, they can reduce the complexity of their model while maintaining its
accuracy (the data you’ve been provided has already been QA’d and cleaned)
Key Objectives
NEW MESSAGE
June 30, 2023
1. Build a simple linear regression model with
From: Cam Correlation (Sr. Data Scientist) “price” as the target and the most
Subject: Price predictions correlated variable as the feature
2. Interpret the model summary
Hi there!
Can you build a linear regression model with the target as 4. Predict new “price” values with the model
“price” and the most correlated variable as the feature?
I’d also like to see a few predictions for common values of the
feature that you selected.
Thanks!
02_simple_regression_assignments.ipynb
A simple linear regression model is the line that best fits a scatterplot
• The line can be described using an equation with a slope and y-intercept, plus an error term
• The least squared error method is used to find the line of best fit
Python uses the statsmodels & scikit-learn libraries to fit regression models
• Statsmodels is ideas if your goal is inference, while scikit-learn is optimal for prediction workflows
𝒙𝒌 In this section we’ll build multiple linear regression models using more than one feature,
evaluate the model fit, perform variable selection, and compare models using error metrics
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖
Mean Error
Metrics
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝜖
Instead of just one “x”, we have a whole set of features (and
associated coefficients) to help predict the target (y)
Variable Selection
Mean Error
Metrics
Multiple regression can scale well beyond 2 variables, but this is where visual
analysis breaks down – and one reason why we need machine learning!
Multiple
R2 increased
Regression
from 0.849
p < 0.05, so
Variable Selection
the model
isn’t useless
Mean Error
Metrics
p < 0.05, so
“carat” and “x” are the most correlated features with “price” all variables
are useful
Multiple
Regression
Variable Selection
Mean Error
Metrics ŷ = 1738 + 10130(carat) – 1027(x)
Mean Error
Metrics
R2 increased
but adjusted
R2 didn’t
In this case, the p-value could also tell us to remove this variable
(other times, a variable can be significant and lower adjusted R2)
Key Objectives
NEW MESSAGE
July 2, 2023
1. Build and interpret two multiple linear
From: Cam Correlation (Sr. Data Scientist) regression models
Subject: Multiple Regression Model
2. Evaluate model fit and coefficient values
for both models
Hi there!
Thanks!
03_multiple_regression_assignments.ipynb
Variable Selection
MAE MSE RMSE
Mean Error
Metrics Average of the absolute distance Average of the squared distance Square root of Mean Squared Error,
between actual & predicted values between actual & predicted values to return to the target’s units (like MAE)
Key Objectives
NEW MESSAGE
July 3, 2023
1. Calculate MAE and RMSE for fitted
From: Cam Correlation (Sr. Data Scientist) regression models
Subject: Error Metrics
Hi there!
Can you calculate MAE and RMSE for the model with all
variables and our simple regression model that just included
RAM?
Thanks!
03_multiple_regression_assignments.ipynb
Multiple linear regression models use multiple features to predict the target
• Each new feature comes with an associated coefficient that forms part of the regression equation
Variable selection methods help you identify valuable features for the model
• Coefficients with p-values greater than alpha (0.05) indicate that a coefficient isn’t significantly different than 0
• Contrary to R2, the adjusted R2 metric penalizes new variables added to a model
Mean error metrics let you compare predictive accuracy across models
• Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) quantify a model’s inaccuracy in the target’s units
• RMSE is more sensitive to large errors, so it is preferred in situations where large errors are undesirable
In this section we’ll cover the assumptions of linear regression models which should be
checked and met to ensure that the model’s predictions and interpretation are valid
There are a few key assumptions of linear regression models that can be violated,
leading to unreliable predictions and interpretations
Assumptions
Overview • If the goal is inference, these all need to be checked rigorously
• If the goal is prediction, some of these can be relaxed
Linearity
4• No perfect multicollinearity: the features aren’t perfectly correlated with each other
No Perfect
Multicollinearity
5• Equal variance of errors: the spread of residuals is consistent across predictions
Equal Variance
of Errors
You can use the L.I.N.N.E acronym (like linear regression) to remember them
Outliers & It’s worth noting that you might see resources saying there are anywhere from
Influence 3 to 6 assumptions, but these 5 are the ones to focus on
Linearity assumes there’s a linear relationship between the target and each feature
Assumptions • If this assumption is violated, it means the model isn’t capturing the underlying relationship
Overview
between the variables, which could lead to inaccurate predictions
Linearity
You can diagnose linearity by using scatterplots and residual plots:
Independence
of Errors
Ideal Scatterplot Ideal Residual Plot
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers &
Influence
You can fix linearity issues by transforming features with non-linear relationships
Assumptions • Common transformations include polynomial terms (x2, x3, etc.) and log transformations (log(x))
Overview
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers &
Influence The scatterplot has a U-like shape, which is similar to the The “carat_sq” has a much more linear relationship
y=x2 plot, so let’s try squaring the “carat” feature with the “price” target variable
You can fix linearity issues by transforming features with non-linear relationships
Assumptions • Common transformations include polynomial terms (x2, x3, etc.) and log transformations (log(x))
Overview
Linearity
R2 increased
from 0.859
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
When adding polynomial terms, you need to
include the lower order terms to the model,
Outliers & regardless of significance We can drop “y”
Influence (p > 0.05)
Independence of errors assumes that the residuals in your model have no patterns
Assumptions or relationships between them (they aren’t autocorrelated)
Overview
• In other words, it checks that you haven’t fit a linear model to time series data
Linearity
You can diagnose independence with the Durbin-Watson Test:
Independence
of Errors
• Ho: DW=2 – the errors are NOT autocorrelated
• Ha: DW≠2 – the errors are autocorrelated
Normality • As a rule of thumb, values between 1.5 and 2.5 are accepted
of Errors
Equal Variance
of Errors
Outliers &
Influence
You can fix independence issues by using a time series model (more later!)
Linearity
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers & Sorting the DataFrame by the target (price) leads to a Randomizing the order brings things back to normal
Influence Durbin-Watson statistic outside the desired range
Linearity
Ideal QQ Plot
The red line represents a normal distribution
Independence
of Errors
It’s normal for a few points to fall off the line,
particularly outside of 2 standard deviations
Normality
of Errors
No Perfect
The Jarque-Bera Test (JB) in the model
Multicollinearity
summary tests the normality of errors
However, this test has numerous issues
Equal Variance and is far too restrictive to use in practice
of Errors
You can typically fix normality issues by applying a log transform on the target
Assumptions • Other options are applying log transforms to features or simply leaving the data as is
Overview
Independence
of Errors
Normality
of Errors
There are still
No Perfect large residuals
Our line starts to curve in between left, but they
Multicollinearity
these thresholds, which indicates don’t violate the
This looks better!
moderate to severe non-normality normality
(between -3 and 3)
Equal Variance assumption
of Errors (check later)
Outliers &
Influence You generally want to see points fall along the
line in between -2 and +2 standard deviations
Independence
of Errors
Normality
of Errors
A 1-unit increase in carat is associated
with a 10.6X increase in price
No Perfect
Multicollinearity
( e2.3598 = 10.59)
Transformation Inverse
Equal Variance
y = np.sqrt(x) x = y**2
of Errors
y = np.log(x) x = np.exp(y)
Outliers & y = np.log10(x) x = 10 ** y
Influence A diamond with these features is
predicted to cost $9,458 y = 1/x x = 1/y
Independence
You can diagnose multicollinearity with the Variance Inflation Factor (VIF):
of Errors
• Each feature is treated as the target, and R2 measures how well the other features predict it
Normality • As a rule of thumb, a VIF > 5 indicates that a variable is causing multicollinearity problems
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors We can ignore the VIF for the intercept, but
most of our variables have a VIF > 5
Outliers &
Influence
Independence
of Errors Original Model Removing x
Equal Variance
of Errors
The VIF can also be ignored when using dummy variables (more later!)
Outliers &
Influence
Equal variance of errors assumes the residuals are consistent across predictions
Assumptions • In other words, average error should stay roughly the same across the range of the target
Overview
• Equal variance is known as homoskedasticity, and non-equal variance is heteroskedasticity
Linearity
You can diagnose heteroskedasticity with residual plots:
Independence
of Errors Heteroskedasticity Homoskedasticity
(original regression model) (after fixing violated assumptions)
Normality
of Errors
Equal Variance
of Errors
Outliers &
Influence
You can typically fix heteroskedasticity by applying a log transform on the target
Assumptions • In other words, the average error should stay roughly the same across the target variable
Overview
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers &
Influence
Errors have a cone shape along the x-axis Errors are spread evenly along the x-axis
Outliers are extreme data points that fall well outside the usual pattern of data
Assumptions
• Some outliers have dramatic influence on model fit, while others won’t
Overview • Outliers that impact a regression equation significantly are called influential points
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors This is an outlier in both profit and This is an outlier in terms of profit that
An outlier’s influence depends units sold, but it’s in line with the rest doesn’t follow the same pattern as the rest of
on the size of the dataset of the data, so it’s not influential the data, so it changes the regression line
Outliers &
Influence (large datasets are impacted less)
Cook’s Distance measures the influence a data point has on the regression line
Assumptions
• It works by fitting a regression without the point and calculating the impact
Overview • Cook’s D > 1 is considered a significant problem, while Cook’s D > 0.5 is worth investigating
• Use .get_influence().summary_frame() on a regression model to return influence statistics
Linearity
Independence
of Errors
No Perfect
Multicollinearity
Equal Variance
Not surprisingly, the most influential
of Errors
points were the largest diamonds!
Outliers &
Influence
Normality
of Errors
No Perfect
Multicollinearity
Suboptimal model Curved patterns in Slightly less The more curved the relationship,
Feature-target
Linearity accuracy, non-normal feature-target Add polynomial terms intuitive the worse it is. It’s usually worth
relationship is not linear
residuals scatterplots coefficients fixing as it can improve accuracy.
Cook’s D greater than Removing data If we expect the data to have similar
Lower accuracy & Remove data points or
Outliers & 1 (highly influential) points is not points, removing them won’t change
Influential data points possible violated engineer features to
Influence or 0.5 (potentially ideal if we need model performance (consider a
assumptions explain outliers
problematic) to predict them model with and without outliers)
Key Objectives
NEW MESSAGE
July 10, 2023
1. Check for violated assumptions on a fitted
From: Cam Correlation (Sr. Data Scientist) regression model
Subject: Model Assumptions
2. Implement fixes for assumptions
Hi there! 3. Decide whether the fixes are worth the
Can you check the model assumptions for our computer price
trade-offs and settle on a final model
regression model?
Thanks!
04_assumptions_assignments.ipynb
Diagnosing & fixing these assumptions can help improve model accuracy
• Use residual plots, QQ plots, the Durbin-Watson test, and the Variance Inflation Factor to diagnose
• Transforming the features and/or target (polynomial terms, log transforms, etc.) can typically help fix issues
In this section we’ll cover model testing & validation, which is a crucial step in the
modeling process designed to ensure that a model performs well on new, unseen data
1 2 3 4 5 6
Model Scoring Scoping a Gathering Cleaning Exploring Modeling Sharing
Steps
project data data data data insights
Data Splitting
Bias-Variance Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Tradeoff input into an ML algorithm from training data training & validation data deploy and identify insights
Cross Validation
This is what we’ve covered so far
1 2 3 4 5 6
Model Scoring Scoping a Gathering Cleaning Exploring Modeling Sharing
Steps
project data data data data insights
Data Splitting
Bias-Variance Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Tradeoff input into an ML algorithm from training data training & validation data deploy and identify insights
Validation • Single table, non-null • Linear regression • R-squared & MAE • Test performance
• Checking assumptions
• Data splitting • Validation performance
Cross Validation
This needs to be the first So we can do these properly
step in the process
These model scoring steps need to be considered before you start modeling:
Model Scoring
Steps 1 Split the data into a “training” and “test” set
• Around 80% of the data is used for training and 20% is reserved for testing
Data Splitting
Validation
4 Score the model using the test data
• Once you’ve settled on a model, combine the training & validation data sets and refit the
Cross Validation model, then score it on the test data to evaluate model performance
Data splitting involves separating a data set into “training” and “test” sets
• Training data, or in-sample data, is used to fit and tune a model
Model Scoring
Steps • Test data, or out-of-sample data, provides realistic estimates of accuracy on new data
Data Splitting
Overfitting
Bias-Variance
Tradeoff Training data
(80%)
Validation
Cross Validation
Test data
(20%)
Data splitting involves separating a data set into “training” and “test” sets
• Training data, or in-sample data, is used to fit and tune a model
Model Scoring
Steps • Test data, or out-of-sample data, provides realistic estimates of accuracy on new data
Data Splitting
Overfitting
In practice, the rows are sampled randomly
Bias-Variance
Tradeoff
80/20 is the most common ratio for
train/test data, but anywhere from
70-90% can be used for training
Validation
For smaller data sets, a higher ratio of
test data is needed to ensure a
representative sample
Cross Validation
Data Splitting
Overfitting
Bias-Variance
Tradeoff
Validation
4 outputs: Perfect 80/20 split! 4 inputs:
• Training set for features (X) • All feature columns
Cross Validation • Test set for features (X_test) • The target column
• Training set for target (y) • The percentage of data for the test set
• Test set for target (y_test) • A random state (or a different split is created each time)
Data splitting is primarily used to avoid overfitting, which is when a model predicts
known (Training) data very well but unknown (Test) data poorly
Model Scoring
Steps • Overfitting is like memorizing the answers to a test instead of learning the material; you’ll ace
the test, but lack the ability to generalize and apply your knowledge to unfamiliar questions
Data Splitting
Overfitting
Bias-Variance
Tradeoff
Validation
You can diagnose overfit & underfit models by comparing evaluation metrics like
R2, MAE, and RMSE between the train and test data sets
Model Scoring
Steps • Large gaps between train and test scores indicate that a model is overfitting the data
• Poor results across both train and test scores indicate that a model is underfitting the data
Data Splitting
You can fix overfit models by: You can fix underfit models by:
Overfitting
• Simplifying the model • Making the model more complex
Bias-Variance
• Removing features • Add new features
Tradeoff • Regularization (more later!) • Feature engineering (more later!)
Validation
Models will usually have lower performance on test data compared to the performance
on training data, so small gaps are expected – remember that no model is perfect!
Cross Validation
When splitting data for regression, there are two types of errors that can occur:
• Bias: How much the model fails to capture the relationships in the training data
Model Scoring
Steps • Variance: How much the model fails to generalize to the test data
Data Splitting
Training data
Overfitting
Bias-Variance
Tradeoff
Validation
OVERFIT model WELL-FIT model UNDERFIT model
• Low bias (no error) • Medium bias (some error) • High bias (lots of error)
Cross Validation
When splitting data for regression, there are two types of errors that can occur:
• Bias: How much the model fails to capture the relationships in the training data
Model Scoring
Steps • Variance: How much the model fails to generalize to the test data
Data Splitting
Test data
Overfitting
Bias-Variance
Tradeoff
Validation
OVERFIT model WELL-FIT model UNDERFIT model
• Low bias (no error) • Medium bias (some error) • High bias (lots of error)
Cross Validation • High variance (much more error) • Medium variance (a bit more error) • Low variance (same error)
The bias-variance tradeoff aims to find a balance between the two types of errors
Model Scoring • It’s rare for a model to have low bias and low variance, so finding a “sweet spot” is key
Steps • This is something that you can monitor during the model tuning process
Data Splitting
Prediction Error
• Fail to capture
Sweet
• Capture noise from the
trends in the data spot training data
Bias-Variance
Tradeoff • Are underfit • Are overfit
• Are too simple Variance • Are too complex
• Have a high error Bias • Have a large gap
Validation rate on both train between training & test
& test data Training error
Validation data is a subset of the training data set that is used to assess model fit
and provide feedback to the modeling process
Model Scoring
Steps
• This is an extension of a simple train-test split of the data
You can also use the train_test_split function to create the validation data set
Model Scoring
• First split off the test data, then separate the training and validation sets
Steps
Data Splitting
Validation
Cross Validation
Data Splitting
Both R2 values seem relatively low with Training R2 improved but the gap Training R2 dropped a bit but the gap
Validation a small gap between them between them grew considerably between them closed down
(underfit – high bias, low variance) (overfit – low bias, high variance) (best fit – balanced bias & variance)
Cross Validation Action: Add complexity Action: Simplify Action: Decide on model
Overfitting
Bias-Variance
Tradeoff
Validation
Cross Validation
Once you’ve tuned and selected the best model, you need to refit the model on both
the training and validation data, then score the model on the test data
Model Scoring
Steps • Combining the validation data back with the training data helps improve coefficient estimates
• The final “model score” that you’d share is the score from the test data
Data Splitting
Model Tuning
Overfitting
Training Validation Once you’ve finished tuning,
(60%) (20%) select the best model
Bias-Variance
Tradeoff R2 = 0.81 R2 = 0.79
Model Scoring
Validation
Refit the model on the training &
Training + Validation Test validation data and score on the test
Cross Validation
(60% + 20% = 80%) (20%) data
R2 = 0.83 R2 = 0.81
Overfitting
Bias-Variance
Tradeoff
Validation
Cross Validation
Cross validation is another validation method that splits, or “folds”, the training
data into “k” parts, and treats each part as the validation data across iterations
Model Scoring • You fit the model k times on the training folds, while validating on a different fold each time
Steps
Train-Test Split
Data Splitting
Training Test
Overfitting
5-Fold Cross Validation
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Validation Score
Bias-Variance
Tradeoff Training Training Training Training Validation 0.42
Data Splitting
Overfitting
Bias-Variance
This can be put into
Tradeoff
a function to reuse!
Validation
Cross Validation
You should choose one validation approach to use throughout your modeling
process, so it’s important to highlight the pros and cons of each:
Model Scoring
Steps
Training Faster, since we’re only training and Slower, since we train and score the
Cross Validation Time scoring the model once model once for each fold
Key Objectives
NEW MESSAGE
July 12, 2023
1. Split your data into training and test
From: Cam Correlation (Sr. Data Scientist)
2. Use cross validation to fit your model and
Subject: Data Splitting
report each validation fold score
Hi there! 3. Fit your model on all of your training data
The model seems to be shaping up nicely, but I forgot to
and score it on the test dataset
mention we need to split off a test dataset to assess
performance.
Can you split off a test set, and fit your model using cross-
validation?
Thanks!
05_data_splitting_assignments.ipynb
In this section we’ll cover feature engineering techniques for regression models, including
dummy variables, interaction terms, binning, and more
Category
Mappings
Feature
Engineering
Math
Calculations
Math Category DateTime Group
Calculations Mappings Calculations Calculations Scaling
Category
Mappings • Polynomial terms • Binary columns • Days from “today” • Aggregations • Standardization
• Combining features • Dummy variables • Time between dates • Ranks within groups • Normalization
• Interaction terms • Binning
Your own creativity, domain knowledge, and critical thinking will lead to feature
engineering ideas not covered in this course, but these are worth keeping in mind!
Adding polynomial terms (x2, x3, etc.) to regression models can improve fit when
spotting “curved” feature-target relationships during EDA
Feature • Generally, as the degree of your polynomial increases, so does the risk of overfitting
Engineering
• If your goal is prediction, let cross validation guide the complexity of your polynomial term
Math
Calculations
Model Structure Cross-Val R2
Category
Mappings Carat .847
Carat + Carat2 .929
Carat + Carat2 + Carat3 .936
Carat + … + Carat4 .936
Carat + … + Carat5 .936
Adding polynomial terms (x2, x3, etc.) to regression models can improve fit when
spotting “curved” feature-target relationships during EDA
Feature • Generally, as the degree of your polynomial increases, so does the risk of overfitting
Engineering
• If your goal is prediction, let cross validation guide the complexity of your polynomial term
Math
Calculations
Model Structure Cross-Val R2
Category
Mappings Carat .847
Carat + Carat2 .929
Carat + Carat2 + Carat3 .936
Carat + … + Carat4 .936
Carat + … + Carat5 .936
Feature correlations:
Carat (weight), x (width), y (length), and z (depth)
capture a diamond’s size, which is why they are
highly correlated with each other and will cause
multicollinearity issues in the model
Category
Mappings
Category
Mappings
A dummy variable is a field that only contains zeros and ones to represent the
presence (1) or absence (0) of a value, also known as one-hot encoding
Feature
• They are used to transform a categorical field into multiple numeric fields
Engineering • Use pd.get_dummies() to create dummy variables in Python
Math
Calculations
Category
Mappings
In linear regression models, you need to drop a dummy variable category using
the “drop_first=True” argument to avoid perfect multicollinearity
• The category that gets dropped is known as the “reference level”
Feature
Engineering
Math
Calculations
Category
Mappings
PRO TIP: Your model accuracy will be the same regardless of which dummy column
dropped, but some reference levels are more intuitive to interpret than others
If you want to choose your reference level, skip the drop_first argument and drop the
desired reference level manually
In linear regression models, you need to drop a dummy variable category using
the “drop_first=True” argument to avoid perfect multicollinearity
• The category that gets dropped is known as the “reference level”
Feature
Engineering
Lower Quality
Adding dummy variables for each categorical column in your data can lead to very
wide data sets, which tends to increase model variance
Grouping, or binning categorical data, solves this and can improve interpretability
Feature
Engineering • After binning, create dummy variables for the groups (which should be fewer than before)
Math
Calculations
Category If we had less data, the ”I1” category would be too rare to
Mappings produce reliable estimates, as random data splitting means we
might not see “I1” diamonds in our test or validation sets!
(categories with low counts are especially at risk of overfitting)
Adding dummy variables for each categorical column in your data can lead to
very wide data sets, which can increase model variance with small data sets
Grouping, or binning categorical data, solves this and improves interpretability
Feature
Engineering • After binning, create dummy variables for the groups (which should be fewer than before)
Math
Calculations
Dummy variables: Binning:
Category
Mappings
Binning numeric data lets you turn numeric features into categories
• Generally, this is less accurate than using raw values, but it is a highly interpretable way of
capturing non-linear trends and numeric fields with a high percentage of missing values
Feature
Engineering
Math
Calculations Carat is a continuous field, but
we’re binning it into values at
various intervals, making it a
Category
categorical field
Mappings
Binning numeric data lets you turn numeric features into categories
• Generally, this is less accurate than using raw values, but it is a highly interpretable way of
capturing non-linear trends and numeric fields with a high percentage of missing values
Feature
Engineering
Key Objectives
NEW MESSAGE
July 14, 2023
1. Perform feature engineering on numeric
From: Cam Correlation (Sr. Data Scientist) and categorical features
Subject: Feature Engineering
2. Evaluate model performance after
including the new features
Hello again!
I notice that your last model didn’t have any of the categorical
features in it, can you make sure to include them?
Thanks!
06_feature_engineering_assignment.ipynb
Feature engineering lets you turn raw data into useful model features
• This is critical to getting the best accuracy out of your data sets
Most ideas will come from domain expertise and critical thinking
• Thinking carefully about what might influence your target variable can lead to the creation of powerful features
Key Objectives
NEW MESSAGE
July 20, 2023 1. Perform EDA on a modelling dataset
From: Cathy Coefficient (Data Science Lead) 2. Split your data and choose a validation
Subject: Apartment Rental Prices framework
3. Fit and tune a linear regression model by
Hi there,
checking model assumptions and performing
Cam has told me some great things about your work. We have feature engineering
a new client that has a modelling project they need help with.
4. Interpret a linear regression model
The client works in the real estate industry in San Francisco
and wants to understand the key factors affecting rental
prices. More importantly, they hope be able to use your model
to predict an appropriate price range for the apartments they
build in the city.
07_regression_modelling_project.ipynb
𝛂 In this section we’ll cover regularized regression models, also known as “penalized”
regression models, which focus on reducing model variance to improve predictive accuracy
Ridge Regression
Linear Regression (OLS) Regularized Regression
Lasso Regression
Training The “slope”
coefficients tend to
Elastic Net
shrink during
Regression
regularization
Test
Low bias, high variance (overfit model) Balance of bias and variance
𝐽 = 𝑆𝑆𝐸
Lasso Regression
The cost you’re trying to minimize The sum of squared error
Elastic Net
Regression
In regularized regression, you add a regularization term controlled by alpha:
𝐽 = 𝑆𝑆𝐸 + 𝛼𝑅
The alpha term, which controls the The regularization term
impact of the regularization term
Regularization
Ridge Regression
𝐽 = 𝑆𝑆𝐸 + 𝛼𝑅
Lasso Regression
Ridge Regression Lasso Regression Elastic Net Regression
Elastic Net
Regression Sum of the ridge & lasso regularization terms,
Sum of the squared coefficient values Sum of the absolute coefficient values
weighted by lambda (λ)
𝑝 𝑝 𝑝 𝑝
𝛽𝑗2 𝛽𝑗 𝜆 𝛽𝑗2 + (1 − 𝜆) 𝛽𝑗
𝑗=1 𝑗=1 𝑗=1 𝑗=1
Lasso Regression 𝑝
When fitting a ridge regression, the model is
Elastic Net 𝐽 = 𝑆𝑆𝐸 + 𝛼 𝛽𝑗2 trying to minimize this cost function by both:
• Minimizing training error with SSE
Regression
• Keeping the coefficient values small
𝑗=1
The idea is to incorporate the bias-variance tradeoff directly into the algorithm!
We want to reduce the sum of squared error, but also constrain the magnitude of
the coefficients to prevent overfitting’
Because of this, ridge regression tends to produce much more accurate models
when working with multicollinear features.
Regularization
Linear regression:
Ridge Regression 1. Fit a linear regression model on the training data and score on the validation data
2. Tune the model by adding or removing inputs
3. Fit the model on the train & validation data and score on the test data
Lasso Regression
As a best practice, only retrieve the mean & standard deviation from the training data
Doing this before splitting the data would be collecting information from the test data
and using it for training the model, which will lead to inflated test performance
Lasso Regression
Elastic Net
Regression
Ridge Regression
Fit on the standardized features and alpha = 1
Ridge Regression
Elastic Net
Regression
Ridge Regression
Lasso Regression
Elastic Net
Regression
Coefficient estimates shrink
towards 0 as alpha increases
Ridge Regression
Cross validation with RidgeCV Validation loop
Lasso Regression
Elastic Net
Regression
Key Objectives
NEW MESSAGE
July 22, 2023
1. Fit a ridge regression model in Python
From: Cam Correlation (Sr. Data Scientist)
2. Tune alpha through cross validation
Subject: Ridge Regression
3. Compare model performance with
Hi there! traditional regression
I love the model you built for our computer price data set.
Thanks!
08_regularization_assignments.ipynb
Lasso Regression 𝑝
When fitting a lasso regression, the model is
trying to minimize this cost function by both:
Elastic Net 𝐽 = 𝑆𝑆𝐸 + 𝛼 𝛽𝑗 • Minimizing training error with SSE
Regression
• Keeping the coefficient values small
𝑗=1
PRO TIP: If you have a lots of features in your model, fitting a moderately penalized lasso
regression can help inform which features are strongest by dropping the rest to 0
Regularization
Ridge Regression
These coefficients all
dropped to 0!
You can tune alpha with
Lasso Regression the same validation loop!
Elastic Net
Regression
Ridge Regression
Lasso Regression
Elastic Net
Regression
Key Objectives
NEW MESSAGE
July 23, 2023
1. Fit a lasso regression model in Python
From: Cam Correlation (Sr. Data Scientist)
2. Tune alpha through cross validation
Subject: RE: Ridge Regression
3. Compare model performance with
Hi there! traditional regression
I was hoping ridge regression would increase performance
more than it did, but given that we didn’t have any highly
correlated features, it makes sense.
Thanks!
08_regularization_assignments.ipynb
Ridge Regression
Ridge penalty Lasso penalty
𝑝 2 𝑝
Lasso Regression
𝐽 = 𝑆𝑆𝐸 + 𝛼((1 − σ
𝜆) 𝑗=1 𝛽𝑗 + (𝜆) σ𝑗=1 𝛽𝑗 )
Elastic Net
Regression Total regularization strength Controls the balance between ridge & lasso
PRO TIP: Elastic net regression combines the effects of both lasso and
ridge regression and can be a lifesaver for challenging modeling problems
Ridge Regression
Lasso Regression
Elastic Net
Regression
Ridge Regression
Lasso Regression
Elastic Net
Regression
Tuning lambda while keeping alpha constant, or vice versa, misses out on many possible combinations of
these hyperparameters that might perform better – we need to try multiple combinations of both!
Lasso Regression
Elastic Net
Regression
Low regularization strength, and 90% This model slightly underperformed the
skewed towards a Lasso penalty Lasso model, but it is still quite good
Key Objectives
NEW MESSAGE
July 25, 2023
1. Fit an elastic net regression model in Python
From: Cam Correlation (Sr. Data Scientist)
2. Tune alpha and lambda through cross
Subject: Re: Re: Ridge Regression
validation
Hi there! 3. Compare model performance with other
Ok, last request for a bit, I promise!
linear regression models
Thanks!
08_regularization_assignments.ipynb
Linear Regression (OLS) 𝑆𝑆𝐸 None None Fits a line of best fit by minimizing SSE
𝑝
Helps with overfitting by dropping
L1
Lasso Regression 𝑆𝑆𝐸 + 𝛼 𝛽𝑗 𝛼 some coefficients to 0, making it a good
penalty
𝑗=1 variable selection technique
𝑝 𝑝
L2 and L1 Helps with overfitting by balancing
Elastic Net Regression 𝑆𝑆𝐸 + 𝛼((1 − 𝜆) 𝛽𝑗2 + 𝜆 𝛽𝑗 ) 𝛼, 𝜆
penalty ridge and lasso regression
𝑗=1 𝑗=1
All features need to be standardized before being input into these models
• This allows the resulting coefficients to be fairly compared with one another
• Use the training data set to calculate the mean and standard deviation values, then apply to the test data set
Key Objectives
NEW MESSAGE
July 27, 2023 1. Fit Lasso, Ridge, and Elastic Net Regression
From: Cathy Coefficient (Data Science Lead) Models
Subject: RE: Apartment Rental Prices 2. Tune the models to the optimal regularization
strength based on validation results
Hi again,
3. Select the model which has the highest accuracy
I just reviewed your modelling work and, overall, I’m quite on out of sample data
pleased with the results.
09_regularized_regression_project.ipynb
In this section we’ll cover time series analysis and forecasting, specialized techniques
applied to time series data to extract patterns & trends and predict future values
Time series data requires each row to represent a unique moment in time
• This can be in any unit of time (seconds, hours, days, months, years, etc.)
Time Series Data
Raw Data Time Series Data
Daily Monthly Yearly
Smoothing
Decomposition
Forecasting
Time series data requires each row to represent a unique moment in time
• This can be in any unit if time (seconds, hours, days, months, years, etc.)
Time Series Data
Smoothing
Decomposition
Forecasting
Time series data is often
visualized using a line chart
with time as the x-axis
Forecasting
Time series smoothing is the process of reducing volatility in time series data to
help identify trends and patterns that are otherwise challenging to see
Time Series Data The simplest way to smooth time series data is by calculating a moving average
Smoothing
Decomposition
PRO TIP: Align your windows with intuitive seasons: with monthly data,
look at quarters or years, with daily data look at weekly or monthly, etc.
The rolling() method lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data
Decomposition
The rolling() function lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data
Decomposition
The rolling() function lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data
Decomposition
Decomposition
Forecasting
No NaN values!
Key Objectives
NEW MESSAGE
August 1, 2023
1. Explore and manipulate time series data
From: Tammy Tiempo (Financial Analyst)
2. Modify smoothing parameters to reveal
Subject: Smoothing
different patterns
Hi there!
Thanks!
10_time_series_assignments.ipynb
Time series data can be decomposed into trend, seasonality, and random noise
• Trend: Are values trending up, down, or staying flat over time?
Time Series Data • Seasonality: Do values display a cyclical pattern? (like more customers buying on weekends)
• Random noise: What volatility exists outside the trend and seasonal patterns?
Smoothing
Decomposition
Forecasting
This data has a positive trend, unclear This data has a flat trend, clear hourly
seasonality, and lots of random noise seasonality, and relatively little noise
Smoothing
Decomposition
Forecasting
Growing
Constant amplitude
amplitude
𝑦𝑡 = 𝑇𝑡 + 𝑆𝑡 + 𝑅𝑡 𝑦𝑡 = 𝑇𝑡 × 𝑆𝑡 × 𝑅𝑡
Time + seasonality + random Time * seasonality * random
Smoothing
Decomposition
Forecasting
Multipliers
The residuals look good in the middle but get worse towards We can no longer see a pattern in the residuals
the end, which indicates we need a multiplicative model
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
Decomposition
This peak indicates a
seasonal pattern every
24 periods (hours)
Forecasting
Key Objectives
NEW MESSAGE
August 5, 2023
1. Use time series decomposition to
From: Tammy Tiempo (Financial Analyst) understand the trend, seasonality and
Subject: Decomposition noise of time series data
2. Use an ACF chart to estimate the seasonal
Hi there!
window for time series data
Thanks for your help with smoothing.
Thanks!
10_time_series_assignments.ipynb
Time series forecasting lets you predict future values using historical data
• Forecasting models use existing trends & seasonality to make accurate future predictions
Time Series Data
Smoothing
Common Forecasting Techniques:
• Linear Regression We’ll cover these in
• Facebook Prophet the course
Decomposition
• Holt-Winters
• ARIMA Modeling
Forecasting
• LSTM (deep learning approach)
Forecasts get less accurate the further out we are trying to predict, so think carefully
about how far in advance you really need to forecast in the context of the problem
Time series data splitting does not follow traditional train/test splits:
• Instead of random splits, we’ll need to split by points in time to mimic forecasting the future
Time Series Data • The training data set should be at least as long as your desired forecast window (test data)
• You may need to change the period to get enough holdout data
Smoothing
Decomposition
Train Test
Forecasting
Decomposition
Forecasting
…
You can fit the model on the training data like a regular linear regression model
• It violates the assumptions of linear regression, but it’s often very effective
Time Series Data
Smoothing
Decomposition
Forecasting
To score the model, you can plot the predictions against the actual values for the
test data and calculate error metrics like MAE and MAPE
Time Series Data • The Mean Absolute Percentage Error (MAPE) is is calculated by finding the average percent
error of all the data points (it’s essentially MAE converted to a percentage)
Smoothing
Decomposition
Forecasting
Instead of OLS, it uses an additive regression model with three main components:
Smoothing
1. Growth curve trend (created by automatically detecting changepoints in the data)
Decomposition 2. Yearly, weekly, and daily seasonal components
3. User-provided list of important holidays
Forecasting
Decomposition
Forecasting
Decomposition
Forecasting
Key Objectives
NEW MESSAGE
August 10, 2023
1. Split time series data into train and test
From: Tammy Tiempo (Financial Analyst) datasets
Subject: Forecasting
2. Fit time series models with linear
regression and Facebook Prophet
Hi there!
Thanks!
10_time_series_assignments.ipynb
Time series data requires each row to represent a unique moment in time
• It’s important to decide on the units, or row granularity, of your time series data (years, months, etc.)
Time series smoothing allows you to visualize trends in the time series data
• Common techniques include moving average and exponential smoothing
Decomposition breaks the data down into trend, season, and random noise
• Time series data can be decomposed in an additive or multiplicative fashion
Key Objectives
NEW MESSAGE
August 14, 2023 1. Explore and manipulate time series data
From: Tammy Tiempo (Financial Analyst) 2. Perform time series data splitting
Subject: Forecasting Electricity Consumption
3. Build and compare the predictive accuracy of forecasting
models
Hello,
-Tammy
11_time_series_project.ipynb