Data+Science+in+Python+-+Regression

DATA SCIENCE IN PYTHON
Regression
With Expert Python Instructor Chris Bruehl
*Copyright Maven Analytics, LLC

ABOUT THIS SERIES
This is Part 2 of a 5-Part series designed to take you through several applications of data science
using Python, including data prep & EDA, regression, classification, unsupervised learning & NLP
PART 1 PART 2 PART 3 PART 4 PART 5

Data Prep & EDA Regression Classification Unsupervised Natural Language
Learning Processing

COURSE STRUCTURE
This is a project-based course for students looking for a practical, hands-on approach to
learning data science and applying regression models with Python
Additional resources include:
Downloadable PDF to serve as a helpful reference when you’re offline or on the go
Quizzes & Assignments to test and reinforce key concepts, with step-by-step solutions
Interactive demos to keep you engaged and apply your skills throughout the course

COURSE OUTLINE
Introduce the fields of data science and machine learning, review

1 Intro to Data Science essential skills, and introduce each phase of the data science workflow
Review the basics of regression, including key terms, the types and goals
2 Regression 101 of regression analysis, and the regression modeling workflow
Recap the data prep & EDA steps required to perform modeling, including
3 Pre-modeling Data Prep & EDA key techniques to explore the target, features, and their relationships
Build simple linear regression models in Python and learn about the
4 Simple Linear Regression metrics and statistical tests that help evaluate their quality and output
Build multiple linear regression models in Python and evaluate the model
5 Multiple Linear Regression fit, perform variable selection, and compare models using error metrics

COURSE OUTLINE
Review the assumptions of linear regression models that need to be met

6 Model Assumptions to ensure that the model’s predictions and interpretation are valid
Test model performance by splitting data, tuning the model with the train
7 Model Testing & Validation & validation data, selecting the best model, and scoring it on the test data
Apply feature engineering techniques for regression models, including

8 Feature Engineering dummy variables, interaction terms, binning, and more
Introduce regularized regression techniques, which are alternatives to

9 Regularized Regression linear regression, including Ridge, Lasso, and Elastic Net regression
Learn methods for exploring time series data and how to perform time
10 Time Series Analysis series forecasting using linear regression and Prophet

WELCOME TO MAVEN CONSULTING GROUP
THE You’ve just been hired as an Associate Data Scientist for Maven Consulting
SITUATION Group to work on a team that specializes in price research for various industries.
You’ll have access to data on the price data for several different industries,
THE including diamonds, computer prices, and apartment rent.
ASSIGNMENT Your task is to build regression models that can accurately predict the price of
goods, while giving their clients insights into the factors that impact pricing.
1. Explore & visualize the data

THE 2. Prepare the data for modelling
OBJECTIVES 3. Apply regression algorithms to the data
4. Evaluate how well your models fit
5. Select the best model and interpret it

SETTING EXPECTATIONS
This course covers both the theory & application of linear regression models
• We’ll start with Ordinary Least Squares (OLS), including evaluation metrics, assumptions, and validation & testing
• We’ll then pivot to regularized regression models, which are extensions of linear regression with special properties
We’ll also do an overview of time series analysis and forecasting methods

• We’ll cover time series behavior, smoothing, decomposition, and basic forecasting models & theory, but won’t do
a deep dive into advanced time series modeling techniques
We’ll use Jupyter Notebooks as our primary coding environment

• Jupyter Notebooks are free to use, and the industry standard for conducting data analysis with Python
(we’ll introduce Google Colab as an alternative, cloud-based environment as well)
You do NOT need to be a Python expert to take this course

• It is strongly recommended that you complete the first course in this series, Data Prep & EDA, and have some basic
statistics knowledge, but we will teach the relevant Python code for building regression models from scratch

INSTALLATION & SETUP

INSTALLATION & SETUP
In this section we’ll install Anaconda and introduce Jupyter Notebook, a user-friendly
coding environment where we’ll be coding in Python
TOPICS WE’LL COVER: GOALS FOR THIS SECTION:
• Install Anaconda and launch Jupyter Notebook

Installing Anaconda Launching Jupyter
• Get comfortable with the Jupyter Notebook
environment and interface
Google Colab

INSTALL ANACONDA (MAC)
1) Go to anaconda.com/products/distribution and click
3) Follow the installation steps

(default settings are OK)
Installing
Anaconda
Launching
Jupyter
Google Colab
2) Launch the downloaded Anaconda pkg file

INSTALL ANACONDA (PC)
1) Go to anaconda.com/products/distribution and click
3) Follow the installation steps

(default settings are OK)
Installing
Anaconda
Launching
Jupyter
Google Colab
2) Launch the downloaded Anaconda exe file

LAUNCHING JUPYTER
1) Launch Anaconda Navigator 2) Find Jupyter Notebook and click
Installing
Anaconda
Launching
Jupyter
Google Colab

YOUR FIRST JUPYTER NOTEBOOK
1) Once inside the Jupyter interface, create a folder to store your notebooks for the course
Installing
Anaconda
Launching
Jupyter
NOTE: You can rename your folder by clicking “Rename” in the top left corner
Google Colab
2) Open your new coursework folder and launch your first Jupyter notebook!
NOTE: You can rename your notebook by clicking on the title at the top of the screen

THE NOTEBOOK SERVER
NOTE: When you launch a Jupyter notebook, a terminal window may pop up as
well; this is called a notebook server, and it powers the notebook interface
Installing
Anaconda
Launching
Jupyter
Google Colab
If you close the server window,

your notebooks will not run!
Depending on your OS, and method

of launching Jupyter, one may not
open – as long as you can run your
notebooks, don’t worry!

ALTERNATIVE: GOOGLE COLAB
Google Colab is Google’s cloud-based version of Jupyter Notebooks
To create a Colab notebook:

Installing
Anaconda 1. Log in to a Gmail account
2. Go to colab.research.google.com
Launching
Jupyter 3. Click “new notebook”
Google Colab
Colab is very similar to Jupyter Notebooks

(they even share the same file extension); the main
difference is that you are connecting to Google
Drive rather than your machine, so files will be
stored in Google’s cloud

INTRO TO DATA SCIENCE

INTRO TO DATA SCIENCE
In this section we’ll introduce the field of data science, discuss how it compares to
other data fields, and walk through each phase of the data science workflow
• Compare data science and machine learning with

What is Data Science? Essential Skills other common data analytics fields
• Introduce supervised and unsupervised learning,
Machine Learning Data Science Workflow and examples of each technique
• Review the machine learning landscape and
commonly used algorithms
• Discuss essential skills, and review each phase of
the data science workflow

WHAT IS DATA SCIENCE?
Data science is about using data to make smart decisions

What is Data
Science?
Wait, isn’t that business
predictive
data analysis
intelligence
analytics ?
Essential Skills
Yes! The differences lie in the types of problems you solve, and tools and
Machine Learning
techniques you use to solve them:
Data Science
Workflow
What happened? What’s going to happen?
• Descriptive Analytics • Predictive Analytics
• Data Analysis • Data Mining
• Business Intelligence • Data Science

DATA SCIENCE SKILL SET
Data science requires a blend of coding, math, and domain expertise
What is Data
Science?
The key is in applying these along

Essential Skills with soft skills like:
Machine
Coding Learning
Math • Communication
Machine Learning • Problem solving
Data • Curiosity & creativity
Science
Data Science • Grit
Workflow Danger Traditional
Zone! Research • Googling prowess
Data scientists & analysts approach problem

Domain solving in similar ways, but data scientists will
Expertise often work with larger, more complex data sets
and utilize advanced algorithms

WHAT IS MACHINE LEARNING?
Machine learning uses algorithms applied by data scientists to enable computers

to learn and make decisions from data
What is Data
Science? Machine learning algorithms fall into two broad categories:
Essential Skills Supervised Learning Unsupervised Learning

Using historical data to predict the future Finding patterns and relationships in data
Machine Learning
Data Science
Workflow What will house prices look like How can I segment my
for the next 12 months? customers?
How can I flag suspicious emails Which TV shows should I

as spam? recommend to each user?

COMMON ALGORITHMS
These are some of the most common machine learning algorithms that data
scientists use in practice
What is Data
Science?
MACHINE LEARNING
Essential Skills
Supervised Learning Unsupervised Learning Another category of machine

Machine Learning
learning algorithms is called
K-Means Clustering
reinforcement learning, which
Hierarchical Clustering is commonly used in robotics
Data Science Regression Classification and gaming
Workflow Anomaly Detection
Fields like deep learning and
Linear Regression KNN Matrix Factorization natural language processing
Regularized Regression Logistic Regression Principal Components Analysis
utilize both supervised and
unsupervised learning
Time Series Tree-Based Models Recommender Systems techniques
Naïve Bayes (NLP) Topic Modeling (NLP)
Support vector machines, neural networks, DBSCAN, T-SNE, factor analysis,

deep learning, etc. association rule mining, etc.

DATA SCIENCE WORKFLOW
The data science workflow consists of scoping the project, gathering, cleaning
What is Data
and exploring the data, applying models, and sharing insights with end users
Science?
1 2 3 4 5 6
Essential Skills
Machine Learning
Data Science
Workflow
Scoping a Gathering Cleaning Exploring Modeling Sharing
Project Data Data Data Data Insights
This is not a linear process! You’ll likely go back to further gather, clean and explore your data

STEP 1: SCOPING A PROJECT
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
Projects don’t start with data, they start with a clearly defined scope:
Data Science
Workflow • Who are your end users or stakeholders?
• What business problems are you trying to help them solve?
• Is this a supervised or unsupervised learning problem? (do you even need data science?)
• What data do you need for your analysis?

STEP 2: GATHERING DATA
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
A project is only as strong as the underlying data, so gathering the right data is
Data Science essential to set a proper foundation for your analysis
Workflow
Data can come from a variety of sources, including:

• Files (flat files, spreadsheets, etc.)
• Databases
• Websites
• APIs

STEP 3: CLEANING DATA
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
A popular saying within data science is “garbage in, garbage out”, which means that
Data Science cleaning data properly is key to producing accurate and reliable results
Workflow
Data cleaning tasks may include: Building models
The flashy part of data science
• Resolving formatting issues
• Correcting data types Cleaning data
• Imputing missing data Less fun, but very important
(Data scientists estimate that around
• Restructuring the data 50-80% of their time is spent here!)

STEP 4: EXPLORING DATA
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
Exploratory data analysis (EDA) is all about exploring and understanding the
Data Science data you’re working with before applying models or algorithms
Workflow
EDA tasks may include:

• Slicing & dicing the data A good number of the final insights that you share
will come from the exploratory analysis phase!
• Profiling the data
• Visualizing the data

STEP 5: MODELING DATA
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
Modeling data involves structuring and preparing data for specific modeling
Data Science techniques, and applying those models to make predictions or discover patterns
Workflow
Modeling tasks include:

With fancy new algorithms introduced every
year, you may feel the need to learn and apply
• Data splitting the latest and greatest techniques
• Feature selection & engineering In practice, simple is best; businesses &
leadership teams appreciate solutions that are
• Training & validating models easy to understand, interpret and implement

STEP 6: SHARING INSIGHTS
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
The final step of the workflow involves summarizing your key findings and sharing
Data Science insights with end users or stakeholders:
Workflow
• Reiterate the problem Even with all the technical work
• Interpret the results of your analysis that’s been done, it’s important to
remember that the focus here is
• Share recommendations and next steps on non-technical solutions
• Focus on potential impact, not technical details
NOTE: Another way to share results is to deploy your model, or put it into production

REGRESSION MODELING
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
Machine Learning
DATA PREP & EDA REGRESSION
Data Science
Workflow
Regression models are used to predict the value of numeric variables

Even though regression models fall into the final two steps of the workflow, data
prep & EDA should always come first to help you get the most out of your models

KEY TAKEAWAYS
Data science is about using data to make smart decisions

• Supervised learning techniques use historical data to predict the future, and unsupervised learning
techniques use algorithms to find patterns and relationships
Data scientists have both coding and math skills along with domain expertise
• In addition to technical expertise, soft skills like communication, problem-solving, curiosity, creativity, grit,
and Googling prowess round out a data scientist’s skillset
The data science workflow starts with defining a clear scope

• Once the project scope is defined, you can move on to gathering and cleaning data, performing exploratory data
analysis, preparing data for modeling, applying algorithms, and sharing insights with end users
Regression modeling is a key skill used for predicting real-world outcomes

• Data scientists are often asked to create regression models used to predict numeric values like revenue,
website traffic, order volumes, and much more

REGRESSION 101

REGRESSION 101
In this section we’ll cover the basics of regression, including key modeling terminology,
the types & goals of regression analysis, and the regression modeling workflow
• Introduce the basics of regression modeling

Regression 101 Goals of Regression
• Understand key modeling terminology
• Discuss the different goals of regression modeling
Types of Regression The Modeling Workflow
• Review the regression modeling workflow

REGRESSION 101
Regression analysis is supervised learning technique used to predict a numeric

variable (target) by modeling its relationship with a set of other variables (features)
Regression 101
EXAMPLE Predicting the price of diamonds based on carat weight

Goals of
Regression
In its simplest form, a regression model is a

Types of line that represents this relationship
Regression
The predicted price for
a 1.5 carat diamond is
The Modeling roughly $10,000
Workflow
“
As “carat” moves up or
down, so does “price”
(positive relationship)
“ All models are wrong,
but some are useful
George Box

REGRESSION 101
Regression analysis is supervised learning technique used to predict a numeric

variable (target) by modeling its relationship with a set of other variables (features)
Regression 101
Goals of
Regression 𝒚 Target
• This is the variable you’re trying to predict
Types of • The target is also known as “Y”, “model output”, “response”, or “dependent” variable
Regression
• Regression helps understand how the target variable is impacted by the features
The Modeling
Workflow
𝒙 Features
• These are the variables that help you predict the target variable
• Features are also known as “X”, “model inputs”, “predictors”, or “independent” variables
• Regression helps understand how the features impact, or predict, the target

REGRESSION 101
EXAMPLE Predicting the price of diamonds based on diamond characteristics
Regression 101
Price is our target, since it’s

Goals of
Regression
what we want to predict
Since price is numerical, we’ll
use regression to predict it
Types of
Regression
The Modeling
Workflow
Carat, cut, color, clarity, and the rest of the columns are all features,
since they can help us explain, or predict, the price of diamonds

REGRESSION 101
EXAMPLE Predicting the price of diamonds based on diamond characteristics
Regression 101
Goals of
Regression
We’ll use records with observed values for

Types of both the features and target to “train” our
Regression regression model...
The Modeling
Workflow
??? ...then apply that model to new, unobserved
values containing features but no target
This is what our model will predict!

GOALS OF REGRESSION
Regression models are used for two primary goals: prediction and inference
• The goal shapes the modeling approach, including the regression algorithm used, the
Regression 101 complexity of the model, and more
Goals of
Regression
PREDICTION INFERENCE
Types of
Regression
• Used to predict the target as • Used to understand the relationships
The Modeling
accurately as possible between the features and target
Workflow
• “What is the predicted price of a • “How much do a diamond’s size and
diamond given its characteristics?” weight impact its price?”
You often need to strike a balance between these goals – a model that is very inaccurate won’t be too trustworthy
for inference, and understanding the impact that variables have on predictions can help make them more accurate

TYPES OF REGRESSION
These are some of the major types of regression modeling techniques:
Regression 101
Linear Regularized Time-Series Tree-Based
Regression Regression Forecasting Regression
Goals of
Regression
Types of
Regression
The Modeling
Workflow
Models the relationship An extension of linear regression Predicts future data using Splits data by maximizing the
between the features & target that penalizes model complexity historical trends & seasonality difference between groups
using a linear equation
Even though logistic regression (which you may have heard of) has
“regression” in its name, it’s actually a classification modeling technique!

REGRESSION MODELING WORKFLOW
1 2 3 4 5 6
Regression 101 Scoping a Gathering Cleaning Exploring Modeling Sharing
project data data data data insights
Goals of
Regression
Types of Preparing for Applying Model Model

Regression Modeling Algorithms Evaluation Selection
The Modeling Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Workflow input into an ML algorithm from training data training & validation data deploy and identify insights
• Single table, non-null • Linear regression • R-squared & MAE • Test performance
• Feature engineering • Regularized regression • Checking Assumptions • Interpretability
• Data splitting • Time series • Validation Performance

KEY TAKEAWAYS
Regression modeling is used to predict numeric values

• There are several types of regression models, but we will mostly focus on linear regression in this course
The target is the value we want to predict, and the features help us predict it
• The target is also known as “Y”, “model output”, “response”, or “dependent” variable
• Features are also known as “X”, “model inputs”, “predictors”, or “independent” variables
Regression Modeling has two primary goals: prediction and inference

• Prediction focuses on predicting the target as accurately as possible
• Inference is used to understand the relationship between the features and target
The modeling workflow is designed to ensure strong performance

• Splitting data, feature engineering, and model validation all work to ensure your model is as accurate as possible

PRE-MODELING DATA PREP & EDA

PRE-MODELING DATA PREP & EDA
In this section we’ll review the data prep & EDA steps required before applying regression
algorithms, including key techniques to explore the target, features, and their relationships
• Visualize & explore the target and features

EDA for Regression Exploring the Target
• Quantify the strength of linear relationships
• Identify relationships between the target and
Exploring the Features Linear Relationships features, as well as between features themselves
• Review the data prep steps required for modeling
Exploring Relationships Preparing for Modeling

EDA FOR REGRESSION
Exploratory data analysis (EDA) is the process of exploring and visualizing data to
find useful patterns & insights that help inform the modeling process
EDA for
Regression • In regression, EDA lets you identify & understand the most promising features for your model
• It also helps uncover potential issues with the features or target that need to be addressed
Exploring the
Target
When performing EDA for regression, it’s key to explore:

Exploring the
Features • The target variable
• The features
Linear
Relationships • Feature-target relationships
• Feature-feature relationships
Exploring
Relationships
Preparing for Even though data cleaning comes before exploratory data analysis, it’s common
Modeling for EDA to reveal additional data cleaning steps needed before modeling

EXPLORING THE TARGET
Exploring the target variable lets you understand where most values lie, how
spread out they are, and what shape, or distribution, they have
EDA for
Regression
• Charts like histograms and boxplots are great tools to explore regression targets
Exploring the The seaborn library is ideal for charts

Target
sns.despine() removes
Exploring the the top & right borders
Features
Data is right-skewed
Linear (not uncommon for prices but Lots of outliers
Relationships may want to transform later) past $12,000
Exploring There is a secondary

Relationships “bump” around $4,000
(can this be explained
by any features?) Median price is
Preparing for around $2,500
Modeling

EXPLORING THE FEATURES
Exploring the features helps you understand them and start to get a sense of the
transformations you may need to apply to each one
EDA for
Regression • Histograms and boxplots let you explore numeric features
• The .value_counts() method and bar charts are best for categorical features
Exploring the
Target
Exploring the
Features
Data is right-skewed Not many “Fair” diamonds,
which are the lowest quality
Linear (enough for modeling though)
Relationships Spikes at quarter and
half carat increments
Exploring
Few diamonds bigger
Relationships
than 2.5 carats
Preparing for
Modeling

ASSIGNMENT: EXPLORING THE TARGET & FEATURES
Key Objectives
NEW MESSAGE
June 28, 2023
1. Read in the “Computers.csv” file
From: Cam Correlation (Sr. Data Scientist)
2. Explore the target variable: “price”
Subject: EDA
3. Explore the features: “speed” and “RAM”
Hi there, glad to have you on the team!
We’ve been getting a lot of partnership requests from various

business units, so I could really use your help.
Can you explore the computer prices data set at a high level?
I want to see a boxplot and histogram of the “price” variable

and histograms of the “speed”, and “ram” variables as well.
Thanks!
01_EDA_assignments.ipynb

LINEAR RELATIONSHIPS
It’s common for numeric variables to have linear relationships between them
• When one variable changes, so does the other
EDA for
Regression • This relationship is commonly visualized with a scatterplot
Exploring the
Target
EXAMPLE Relationship between digital advertising spend and site traffic
Exploring the
Features
($72,954; 7,592)
Linear
Relationships
Exploring
Relationships
As “digital_spend” moves up
or down, so do “site_traffic”
Preparing for (this is a positive relationship)
Modeling

LINEAR RELATIONSHIPS
There are two possible linear relationships: positive & negative

• Variables can also have no relationship
EDA for
Regression
Positive Relationship Negative Relationship No Relationship

Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
As one changes, the other As one changes, the other changes No association can be found between the
changes in the same direction in the opposite direction changes in one variable and the other
Preparing for
Modeling

CORRELATION
The correlation (r) measures the strength & direction of a linear relationship (-1 to 1)
• You can use the .corr() method to calculate correlations in Pandas – df[“col1”].corr(df[“col2”])
EDA for
Regression
r = 0.858 r = -0.499 r = 0.008

Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
-1 -0.9 -0.7 -0.4 -0.1 0.1 0.4 0.7 0.9 1

Preparing for Very Strong Strong Moderate Weak None Weak Moderate Strong Very Strong
Modeling Negative Positive

CORRELATION
The correlation (r) measures the strength & direction of a linear relationship (-1 to 1)
• You can use the .corr() method to calculate correlations in Pandas – df[“col1”].corr(df[“col2”])
EDA for
Regression
r = 0.858 r = -0.499 r = 0.008

Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Strong positive correlation Moderate negative correlation No correlation
Preparing for
Modeling PRO TIP: Highly correlated variables tend to be the best candidates for your “baseline” regression model,
but other variables can still be useful features after exploring non-linear relationships and fixing data issues

PRO TIP: CORRELATION MATRIX
A correlation matrix returns the correlation between each column in a DataFrame

• Use df.corr(numeric_only=True) to only consider numeric columns in the DataFrame
EDA for
Regression • Wrap you correlation matrix in sns.heatmap() to make it easier to interpret
Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Preparing for
Modeling

PRO TIP: CORRELATION MATRIX
A correlation matrix returns the correlation between each column in a DataFrame

• Use df.corr(numeric_only=True) to only consider numeric columns in the DataFrame
EDA for
Regression • Wrap you correlation matrix in sns.heatmap() to make it easier to interpret
Exploring the
Target
Setting the maximum and minimum values at -1

Exploring the and 1 respectively, and adding a divergent color
Features scale can help increase interpretability
Linear
Relationships
Exploring
Relationships
Preparing for
Modeling

FEATURE-TARGET RELATIONSHIPS
Exploring feature-target relationships helps identify potentially useful predictors

• Use scatterplots for numeric features and bar charts for categorical features
EDA for
Regression
Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Average prices differ between cuts:
Preparing for • “Fair” cut diamonds (worst) have the 2nd highest average price
Modeling • “Ideal” cut diamonds (best) have the lowest average price
Strong positive relationship between carat & price
(potentially exponential and not linear)

FEATURE-FEATURE RELATIONSHIPS
Exploring feature-feature relationships helps identify highly correlated features,

which can cause problems with regression models (more on this later!)
EDA for
Regression
Exploring the
Target
Exploring the Missing data

Features coded as 0?
Linear
Relationships
Exploring
Relationships
Strong positive relationship between “Fair” cut diamonds have the largest average carat size, which
Preparing for diamond length (x) and carat explains why their average price is higher than “Ideal” cut
Modeling (we might not be able to include both diamonds, which have the smallest average carat size
features in our model – more later!)

PRO TIP: PAIRPLOTS
Use sns.pairplot() to create a pairplot that shows all the scatterplots and
histograms that can be made using the numeric variables in a DataFrame
EDA for
Regression
Exploring the The row with your target can help explore
Target numeric feature-target relationships quickly!
Exploring the
Features
Linear PRO TIP: It can take a long time to

Relationships generate pairplots for large datasets,
so consider using df.sample() to
Exploring speed up the process significantly
Relationships without losing high-level insights!
Preparing for
Modeling

PRO TIP: LMPLOTS
Use sns.lmplot() to create a scatterplot with a fitted regression line (more soon!)
• This is commonly used to explore the impact of other variables on a linear relationship
EDA for
Regression • sns.lmplot(df, x=“feature”, y=“target”, hue=“categorical feature”)
Exploring the
Target
Exploring the
Features
Linear
Relationships
Exploring
Relationships
Larger carat diamonds don’t
follow the relationship well Diamonds with “I1” clarity increase
Preparing for in price at a much slower rate
Modeling

ASSIGNMENT: EXPLORING RELATIONSHIPS
Key Objectives
NEW MESSAGE
June 28, 2023
1. Build a correlation matrix heatmap
2. Build a pair plot of numeric features
Subject: Exploring Variable Relationships
3. Build an lmplot of “RAM” vs. “price”
Hi there!
Thanks for taking a quick look at those variables for me.
Now let’s take it one step further and explore variable

relationships in the data.
Thanks!
01_EDA_assignments.ipynb

PREPARING FOR MODELING
Preparing for modeling involves structuring the data as a valid input for a model:
• Stored in a single table (DataFrame)
EDA for
Regression • Aggregated to the right grain (1 row per target)
• Non-null (no missing values) and numeric
Exploring the
Target
Data ready for EDA Data ready for modeling

Exploring the
Features
Carat Cut Color Price Index Carat Cut D Missing Price ($)
Linear
0.90 Good D $3,423 0 0.90 1 1 0 3423
Relationships 0.31 Fair E $527 1 0.31 0 0 0 527
0.52 Very Good $1,250 2 0.52 2 0 1 1250
Exploring
Relationships 1.55 Ideal D $12,500 3 1.55 3 1 0 12500
Features Target
Preparing for
Modeling We’ll revisit this during the feature
engineering section of the course

KEY TAKEAWAYS
It’s critical to explore the features & target in your data

• Use histograms and boxplots to visualize and explore your target and any numeric features
• Use bar charts to visualize and explore categorical features
Use scatterplots & correlations to find linear relationships in your data

• Features that are highly correlated with your target are likely strong predictors of it
• Features that are highly correlated with each other can cause problems in a model
Remember to prepare you data for modeling after performing EDA

• The data must be stored in a single table, each column must be numeric, and missing values must be handled

SIMPLE LINEAR REGRESSION

In this section we’ll build our first simple linear regression models in Python and learn
about the metrics and statistical tests that help evaluate their quality and output
• Introduce the linear regression equation

Linear Regression Model Least Squared Error
• Visualize the the least squared error method for
finding the line of best fit
Regression in Python Making Predictions • Build linear regression models in Python and use
them to make predictions for the target
Evaluation Metrics • Interpret the model summary statistics and use

them to evaluate the model’s accuracy

Simple linear regression models use a single feature to predict the target
• This is achieved by fitting a line through the data points in a scatterplot (like an lmplot!)
Linear Regression
Model
Least Squared EXAMPLE Simple linear regression model using carat and price
Error
Regression in Simple linear regression model

Python
Making
Predictions Target (y)
Evaluation
Metrics
Feature (x)

LINEAR REGRESSION MODEL
The linear regression model is an equation that best describes a linear relationship
Linear Regression
Model The value for
The predicted value (x, ŷ)
the feature ŷ
for the target
Least Squared
Error 𝑦ො = 𝛽0 + 𝛽1 𝑥 β1
β0
Regression in
The y-intercept The slope of the
Python x
relationship
Making
Predictions
The actual value
for the target ε
Evaluation
Metrics 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 y
The error, or residual, caused by the difference

between the actual and predicted values x

LEAST SQUARED ERROR
The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra
Least Squared
Error
Why “squared” error?
Regression in
Python
• Squaring the residuals converts them into positive values, and prevents positive and negative
distances from cancelling each other out (this makes the algebra to solve the line much easier, too)
Making • One drawback of squared errors is that outliers can significantly impact the line (more later!)
Predictions
Evaluation
Metrics Ordinary Least Squares (OLS) is another term for traditional linear regression
There are other frameworks for linear regression that don’t use least squared
error, but they are rarely used outside of specialized domains

LEAST SQUARED ERROR
Linear Regression
Least Squared If we know nothing about x, the

Error mean of y is our best guess
ŷ = 30 x y ŷ ε ε2
50
10 10 30 20 400
Regression in
Python 20 25 30 5 25
40
30 20 30 10 100
35 30 30 0 0
Making
Predictions y 30
40 40 30 -10 100
50 15 30 15 225
20
Evaluation 60 40 30 -10 100
Metrics 65 30 30 0 0
10
70 50 30 -20 400
80 40 30 -10 100
10 20 30 40 50 60 70 80
=
x SUM OF SQUARED ERROR: 1,450
LEAST SQUARED ERROR
Linear Regression
Least Squared We do know x, so let’s use it

Error to improve our guess
ŷ = 10 + 0.5x x y ŷ ε ε2
50
10 10 15 5 25
Regression in
Python 20 25 20 -5 25
40
30 20 25 5 25
Making y 35 30 27.5 -2.5 6.25

30
Predictions 40 40 30 -10 100
50 15 35 20 400
20
Evaluation 60 40 40 0 0
Metrics 65 30 42.5 12.5 156.25

10
70 50 45 -5 25
80 40 50 10 100
10 20 30 40 50 60 70 80
=
x SUM OF SQUARED ERROR: 1,450
862.5
LEAST SQUARED ERROR
Linear Regression
Least Squared This is the line with the

Error least squared error!
ŷ = 12 + 0.4x x y ŷ ε ε2
50
10 10 16 6 36
Regression in
Python 20 25 20 -5 25
40
30 20 24 4 16
35 30 26 -4 16
Making
Predictions y 30
40 40 28 -12 144
50 15 32 17 289
20
Evaluation 60 40 36 -4 16
Metrics 65 30 38 8 64
10
70 50 40 -10 100
80 40 44 4 16
10 20 30 40 50 60 70 80
=
x SUM OF SQUARED ERROR: 862.5
722
REGRESSION IN PYTHON
These Python libraries are used fit regression models: statsmodels & scikit-learn
Linear Regression
Model
Least Squared
Error
Regression in • Ideal if your goal is inference • Ideal if your goal is prediction

Python
• Similar output to other tools (SAS, R, Excel) • Most popular ML library in Python
• Easy access to dozens of statistical tests • Has various models for easy comparison
Making
Predictions • Harder to leverage in production ML • Designed to be deployed to production
Evaluation
Metrics
Both libraries use the same math and return the same regression equation!
We will begin by focusing on statsmodels, but once we have the fundamentals
of regression down, we’ll introduce scikit-learn

REGRESSION IN STATSMODELS
You can fit a regression in statsmodels with just a few lines of code:
Linear Regression
Model
1) Import statsmodels.api (standard alias is sm)
2) Create an “X” DataFrame with your feature(s) and add a constant

Least Squared
Error 3) Create a “y” DataFrame with your target
4) Call sm.OLS(y, X) to set up the model, then use .fit() to build the model
Regression in
Python 5) Call .summary() on the model to review the model output
Making
Predictions
Why do we need to add a constant?
Evaluation • Statsmodels assumes you want to fit a model with a line that runs through the origin (0, 0)
Metrics
• sm.add_constant() lets statsmodels calculate a y-intercept other than 0 for the model
• Most regression software (like sklearn) takes care of this step behind the scenes

REGRESSION IN STATSMODELS
You can fit a regression in statsmodels with just a few lines of code:
Linear Regression
Model
Least Squared Model summary

Error statistics
Regression in
Python
Making
Predictions Variable summary
statistics
The model output can be intimidating the first
Evaluation time you see it, but we’ll cover the important
Metrics Residual (error)
pieces in the next few lessons and later sections!
statistics

INTERPRETING THE MODEL
To interpret the model, use the “coef” column in the variable summary statistics
Linear Regression
Model
Least Squared
Error
Regression in ŷ = -2256 + 7756x

Python
Making
Predictions
How do we interpret this?
Evaluation • An increase of 1 carat in a diamond is associated with a $7,756 dollar increase in its price
Metrics
• We cannot say 1 carat causes a $7,756 increase in price without a more rigorous experiment
• Technically, a 0-carat diamond is predicted to cost -$2,256 dollars

MAKING PREDICTIONS
The .predict() method returns model predictions for single points or DataFrames
Linear Regression
Model
Least Squared
Error
Regression in
Python
Making -2256 + 7756(1.5) = 9378

Predictions
The 1 adds a constant A 1.5 carat diamond has a predicted price of $9,378
Evaluation
Metrics

MAKING PREDICTIONS
The .predict() method returns model predictions for single points or DataFrames
Linear Regression
Model
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics

R-SQUARED
R-squared, or coefficient of determination, measures how much better the model is

at predicting the target than using its mean (our best guess without using features)
Linear Regression • R-squared values are bounded between 0 and 1 on training data
Model
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
Squaring the correlation between the feature The model explains 84.9% of the variation in
and target yields R2 in simple linear regression price not explained by the mean of price
(this doesn’t hold in multiple linear regression)

R-SQUARED
R-squared, or coefficient of determination, measures how much better the model is

at predicting the target than using its mean (our best guess without using features)
Linear Regression • R-squared values are bounded between 0 and 1 on training data
Model
Least Squared This is the variation of “y” not explained by “x”

Error This is the sum of squared error • Reducing error will increase R2
(distance between values & line) • A perfect model has an error of 0, or R2 of 1
Regression in
Python 𝑆𝑆𝐸
𝑅2 =1−
𝑆𝑆𝑇
Making
Predictions This is the variation of “y” around its mean
This is the sum of squared total
• If R2 = 0, the model is no better than the
(distance between values & mean)
mean of y
Evaluation
Metrics
A “good” value for R2 is relative to your data – an R2 of .05 might be industry leading in sports analytics, but if
you were trying to prove a physics theory with an experiment, an R2 of .95 might be a disappointment!

HYPOTHESIS TEST
Regression models include several hypothesis tests, including the F-test, that
indicates whether our model is significantly better at predicting our target than
Linear Regression using the mean of the target as the model
Model
• In other words, you’re trying to find significant evidence that your model isn’t useless
Least Squared
Error
Steps for the hypothesis test:
Regression in
Python
1) State the null and alternative hypotheses
2) Set a significance level (α)
Making
Predictions
3) Calculate the test statistic and p-value
Evaluation 4) Draw a conclusion from the test
Metrics
a) If p≤α, reject the null hypothesis (you’re confident the model isn’t useless)
b) If p>α, don’t reject it (the model is probably useless, and needs more training)

HYPOTHESES & SIGNIFICANCE LEVEL
1) For F-Tests, the null & alternative hypotheses are always the same:
• Ho: F=0 – the model is NOT significantly better than the mean (naïve guess)
Linear Regression
Model • Ha: F≠0 – the model is significantly better than the mean (naïve guess)
Least Squared The hope is to reject the null hypothesis

Error
(and therefore, accept the alternative)
Regression in
Python
2) The significance level is the threshold you set to determine when the evidence
Making
against your null hypothesis is considered “strong enough” to prove it wrong
Predictions
• This is set by alpha (α), which is the accepted probability of error
Evaluation
• The industry standard is α = .05 (this is what we’ll use in the course)
Metrics
Some teams and industries set a much higher bar, such as .01 or
even .001, making the null hypothesis less likely to be rejected

F-STATISTIC & P-VALUE
3) The F-statistic and associated p-value are part of the model summary and help
understand the predictive power of the regression model as a whole
Linear Regression
Model • The F-statistic is the ratio of variability the model explains vs the variability it doesn’t
• The p-value, or F-significance, is the probability that your model predicts poorly
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
The F-statistic is primarily used as a stepping-stone

to calculate the p-value, which is easier to interpret
and more commonly used in model diagnostics

HYPOTHESIS TEST CONCLUSION
4) Comparing the p-value and alpha lets us draw a conclusion from the test
• p≤α – reject the null hypothesis (you’re confident the model isn’t useless)
Linear Regression
Model • p>α – don’t reject it (the model is probably useless, and needs more training)
Least Squared
Error
Regression in
Python
Making
Predictions
Evaluation
Metrics
Our F-statistic is much greater than 0, and the p-value is

less than 0.05, so we can reject the null hypothesis
(carat is a good predictor of a diamond’s price!)

T-STATISTICS & P-VALUES
The T-statistics and associated p-values are part of the model summary and help
understand the predictive power of individual model coefficients
Linear Regression
Model • It’s essentially another hypothesis test designed to find which coefficients are useful
Least Squared
Error
Regression in
Python
Since the p-value is lower than our alpha of 0.05, we can

Making conclude that carat is a good predictor of price
Predictions (the constant also has a p-value lower than 0.05, but we can
generally ignore insignificant p-values for the intercept term)
Evaluation
Metrics
This will become more relevant when performing variable

selection in multiple linear regression models (up next!)

RESIDUAL PLOTS
Residual plots show how well a model performs across the range of predictions
• Ideally, residual plots should be normally distributed around 0
Linear Regression
Model
Observations on this line fall exactly on

Least Squared the regression model (0 error)
Error
x y ŷ ε ε2
20
10 10 15 -5 25
Regression in (30, 10)
20 25 20 5 25
Python 10
30 20 25 -5 25
Residual (ε)
35 30 27.5 2.5 6.25

Making 0
40 40 30 10 100
Predictions
50 15 35 -20 400
-10
60 40 40 0 0
Evaluation
65 30 42.5 -12.5 156.25
Metrics -20
70 50 45 5 25
80 40 50 -10 100
15 20 25 30 35 40 45 50
Prediction (ŷ)

RESIDUAL PLOTS
Residual plots show how well a model performs across the range of predictions
• Ideally, residual plots should show errors to be normally distributed around 0
Linear Regression
Model • model.resid returns a series with the residuals (actual value - predicted value)
Least Squared
Error
Regression in It looks like the model

Python underpredicts in this
price range
Making
Predictions
It looks like the

Evaluation model overpredicts
Metrics as price increases

CASE STUDY: HEALTH INSURANCE
THE You’ve just been hired as a Data Science Intern for Maven National Insurance, a
SITUATION large private health insurance provider in the United States
The company is looking at updating their insurance pricing model and want you to
THE start a new one from scratch using only a handful of variables
ASSIGNMENT If you’re successful, they can reduce the complexity of their model while maintaining its
accuracy (the data you’ve been provided has already been QA’d and cleaned)
1. Identify the strongest predictor of insurance prices using correlation

THE 2. Build a simple linear regression model using this feature
OBJECTIVES 3. Predict insurance prices for new customers using the model

ASSIGNMENT: SIMPLE LINEAR REGRESSION
Key Objectives
NEW MESSAGE
June 30, 2023
1. Build a simple linear regression model with
From: Cam Correlation (Sr. Data Scientist) “price” as the target and the most
Subject: Price predictions correlated variable as the feature
2. Interpret the model summary
Hi there!
Looks a few variables are promising!

3. Visualize the model residuals
Can you build a linear regression model with the target as 4. Predict new “price” values with the model
“price” and the most correlated variable as the feature?
I’d also like to see a few predictions for common values of the
feature that you selected.
Thanks!
02_simple_regression_assignments.ipynb

KEY TAKEAWAYS
A simple linear regression model is the line that best fits a scatterplot
• The line can be described using an equation with a slope and y-intercept, plus an error term
• The least squared error method is used to find the line of best fit
Python uses the statsmodels & scikit-learn libraries to fit regression models
• Statsmodels is ideas if your goal is inference, while scikit-learn is optimal for prediction workflows
Linear regression models can be used to predict new data

• These predictions can be used to assess if our model performance holds on new data and help make decisions
The model summary contains metrics used to evaluate the model

• The model’s R2 value and the F-test’s p-value help determine if the regression model is useful

MULTIPLE LINEAR REGRESSION

MULTIPLE LINEAR REGRESSION
𝒙𝒌 In this section we’ll build multiple linear regression models using more than one feature,
evaluate the model fit, perform variable selection, and compare models using error metrics
• Learn how to fit and interpret multiple linear

Multiple Regression Variable Selection regression models in Python
• Walk through methods of performing variable
Mean Error Metrics selection, ensuring model variables add value
• Compare the predictive accuracy across models
using mean error metrics like MAE and RMSE

MULTIPLE LINEAR REGRESSION MODEL
𝒙𝒌
Multiple linear regression models use multiple features to predict the target
• In other words, it’s the same linear regression model, but with additional “x” variables
Multiple
Regression
SIMPLE LINEAR REGRESSION MODEL:

Variable Selection
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖
Mean Error
Metrics
MULTIPLE LINEAR REGRESSION MODEL:
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝜖
Instead of just one “x”, we have a whole set of features (and
associated coefficients) to help predict the target (y)

MULTIPLE LINEAR REGRESSION MODEL
𝒙𝒌
To visualize how a multiple linear regression model works with 2 features,
imagine fitting a plane (instead of a line) through a 3D scatterplot:
Multiple
Regression
Variable Selection
Mean Error
Metrics
Multiple regression can scale well beyond 2 variables, but this is where visual
analysis breaks down – and one reason why we need machine learning!

FITTING A MULTIPLE REGRESSION
𝒙𝒌
To fit a multiple regression, include the desired features in the “X” DataFrame
Multiple
R2 increased
Regression
from 0.849
p < 0.05, so
Variable Selection
the model
isn’t useless
Mean Error
Metrics
p < 0.05, so
“carat” and “x” are the most correlated features with “price” all variables
are useful
“x” has a negative

coefficient but a
positive correlation
(this is a concern
we’ll cover shortly)

INTERPRETING MULTIPLE REGRESSION
𝒙𝒌
Interpreting multiple regression is like simple regression, with a slight twist
Multiple
Regression
Variable Selection
Mean Error
Metrics ŷ = 1738 + 10130(carat) – 1027(x)
How do we interpret this?

• Technically, a 0-carat diamond with 0 length (x) is predicted to cost $1,738 dollars
• An increase in carat by 1 corresponds to a $10,130 increase in price, holding x constant
• An increase in x by 1 corresponds to a $1,027 decrease in price, holding carat constant

VARIABLE SELECTION
𝒙𝒌
Variable selection is a critical part of the modeling process that helps reduce
complexity by only including features that help meet the model’s goal
• Do new variables improve model accuracy? (critical for prediction)
Multiple
Regression
• Is each variable statistically significant? (critical for inference)
• Do new variables make the model more challenging to interpret or explain to stakeholders?
Variable Selection
EXAMPLE Using all the possible features to predict diamond price

Mean Error
Metrics
R2 improved to .859 compared to the simple regression model (.849)
If the goal is inference, a single variable model may be a good choice!
“x” still has a negative coefficient

This helps R2 but makes the model hard to explain
“z” has a p-value greater than alpha (0.347 > 0.05)

We should drop this variable from the model

ADJUSTED R-SQUARED
𝒙𝒌
A criticism of r-squared is that it will never decrease as new variables are added
Adjusted r-squared corrects this by penalizing new variables added to a model
• This measure has no meaning whatsoever, but it helps as a variable selection tool
Multiple
Regression
Variable Selection EXAMPLE Adding a random column to a sample of 100 diamonds
Mean Error
Metrics
R2 increased
but adjusted
R2 didn’t
In this case, the p-value could also tell us to remove this variable
(other times, a variable can be significant and lower adjusted R2)

ASSIGNMENT: MULTIPLE LINEAR REGRESSION
Key Objectives
NEW MESSAGE
July 2, 2023
1. Build and interpret two multiple linear
From: Cam Correlation (Sr. Data Scientist) regression models
Subject: Multiple Regression Model
2. Evaluate model fit and coefficient values
for both models
Hi there!
Let’s include more features in the model.
Can you start by adding all of the PC component variables,

ram, screen, hd, and speed?
Then take a look at including trend, which accounts for cost

decreases over time, and ads, which measure marketing
spend on each model.
Thanks!
03_multiple_regression_assignments.ipynb

MEAN ERROR METRICS
𝒙𝒌
Mean error metrics measure how well your regression model fits in in the units of
our target, as opposed to how well it explains variance (like R-Squared)
• The most common are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)
Multiple
Regression
• They are used to compare model fit across models (the lower the better!)
Variable Selection
MAE MSE RMSE
Mean Error
Metrics Average of the absolute distance Average of the squared distance Square root of Mean Squared Error,
between actual & predicted values between actual & predicted values to return to the target’s units (like MAE)
σ𝑖 |𝑦𝑖 − 𝑦ො𝑖 | σ𝑖(𝑦𝑖 − 𝑦ො𝑖 )2 σ𝑖(𝑦𝑖 − 𝑦ො𝑖 )2

𝑛 𝑛 𝑛
RMSE is more sensitive to large outliers, so it is preferred over
MAE in situations where they are particularly undesirable

MEAN ERROR METRICS
𝒙𝒌
You can use sklearn.metrics to calculate MAE and MSE for your model
• sklearn.metrics.mean_absolute_error(y_actual, y_predicted)
• sklearn.metrics.mean_squared_error(y_actual, y_predicted)
Multiple
Regression
The simple linear regression model has an

Variable Selection average prediction error of ~$1,000
Mean Error The outliers in the dataset are making

Metrics RMSE around 50% larger than MAE
This returns RMSE

instead of MSE
The multiple linear regression model

performs better across both metrics
RMSE will always be bigger than MAE

ASSIGNMENT: MEAN ERROR METRICS
Key Objectives
NEW MESSAGE
July 3, 2023
1. Calculate MAE and RMSE for fitted
From: Cam Correlation (Sr. Data Scientist) regression models
Subject: Error Metrics
Hi there!
Thanks for building those models, they’re definitely showing

some potential!
Can you calculate MAE and RMSE for the model with all
variables and our simple regression model that just included
RAM?
This will help me communicate the improvement in fit better

to our stakeholders.
Thanks!
03_multiple_regression_assignments.ipynb

KEY TAKEAWAYS
Multiple linear regression models use multiple features to predict the target
• Each new feature comes with an associated coefficient that forms part of the regression equation
Variable selection methods help you identify valuable features for the model
• Coefficients with p-values greater than alpha (0.05) indicate that a coefficient isn’t significantly different than 0
• Contrary to R2, the adjusted R2 metric penalizes new variables added to a model
Mean error metrics let you compare predictive accuracy across models
• Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) quantify a model’s inaccuracy in the target’s units
• RMSE is more sensitive to large errors, so it is preferred in situations where large errors are undesirable

MODEL ASSUMPTIONS

MODEL ASSUMPTIONS
In this section we’ll cover the assumptions of linear regression models which should be
checked and met to ensure that the model’s predictions and interpretation are valid
• Review the assumptions of linear regression models

Assumptions Overview Linearity
• Learn to diagnose and fix violations to each
assumption using Python
Independence of Errors Normality of Errors
• Assess the influence of outliers on a regression
model, and learn methods for dealing with them
No Perfect Multicollinearity Equal Variance of Errors
Outliers & Influence

ASSUMPTIONS OF LINEAR REGRESSION
There are a few key assumptions of linear regression models that can be violated,
leading to unreliable predictions and interpretations
Assumptions
Overview • If the goal is inference, these all need to be checked rigorously
• If the goal is prediction, some of these can be relaxed
Linearity
1• Linearity: a linear relationship exists between the target and features

Independence
of Errors
2• Independence of errors: the residuals are not correlated
Normality
of Errors
3• Normality of errors: the residuals are approximately normally distributed
4• No perfect multicollinearity: the features aren’t perfectly correlated with each other
No Perfect
Multicollinearity
5• Equal variance of errors: the spread of residuals is consistent across predictions
Equal Variance
of Errors
You can use the L.I.N.N.E acronym (like linear regression) to remember them
Outliers & It’s worth noting that you might see resources saying there are anywhere from
Influence 3 to 6 assumptions, but these 5 are the ones to focus on

LINEARITY
Linearity assumes there’s a linear relationship between the target and each feature
Assumptions • If this assumption is violated, it means the model isn’t capturing the underlying relationship
Overview
between the variables, which could lead to inaccurate predictions
Linearity
You can diagnose linearity by using scatterplots and residual plots:
Independence
of Errors
Ideal Scatterplot Ideal Residual Plot
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers &
Influence

LINEARITY
You can fix linearity issues by transforming features with non-linear relationships
Assumptions • Common transformations include polynomial terms (x2, x3, etc.) and log transformations (log(x))
Overview
Linearity Model Sample (carat vs price) Model Sample (carat2 vs price)
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers &
Influence The scatterplot has a U-like shape, which is similar to the The “carat_sq” has a much more linear relationship
y=x2 plot, so let’s try squaring the “carat” feature with the “price” target variable

LINEARITY
You can fix linearity issues by transforming features with non-linear relationships
Assumptions • Common transformations include polynomial terms (x2, x3, etc.) and log transformations (log(x))
Overview
Linearity
R2 increased
from 0.859
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
When adding polynomial terms, you need to
include the lower order terms to the model,
Outliers & regardless of significance We can drop “y”
Influence (p > 0.05)

INDEPENDENCE OF ERRORS
Independence of errors assumes that the residuals in your model have no patterns
Assumptions or relationships between them (they aren’t autocorrelated)
Overview
• In other words, it checks that you haven’t fit a linear model to time series data
Linearity
You can diagnose independence with the Durbin-Watson Test:
Independence
of Errors
• Ho: DW=2 – the errors are NOT autocorrelated
• Ha: DW≠2 – the errors are autocorrelated
Normality • As a rule of thumb, values between 1.5 and 2.5 are accepted
of Errors
No Perfect All good!

Multicollinearity
Equal Variance
of Errors
Outliers &
Influence
You can fix independence issues by using a time series model (more later!)

INDEPENDENCE OF ERRORS
IMPORTANT: If your data is sorted by your target, or potentially an important

Assumptions predictor variable, it can cause this assumption to be violated
Overview
• Use df.sample(frac=1) to fix this by randomly shuffling your dataset rows
Linearity
Model (sorted by price) Model (randomized rows)

Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers & Sorting the DataFrame by the target (price) leads to a Randomizing the order brings things back to normal
Influence Durbin-Watson statistic outside the desired range

NORMALITY OF ERRORS
Normality of errors assumes the residuals are approximately normally distributed

Assumptions
Overview You can diagnose normality by using a QQ plot (quantile-quantile plot):
Linearity
Ideal QQ Plot
The red line represents a normal distribution
Independence
of Errors
It’s normal for a few points to fall off the line,
particularly outside of 2 standard deviations
Normality
of Errors
No Perfect
The Jarque-Bera Test (JB) in the model
Multicollinearity
summary tests the normality of errors
However, this test has numerous issues
Equal Variance and is far too restrictive to use in practice
of Errors
Outliers & The x-axis represents the number of standard

Influence deviations away from the mean residuals lie

NORMALITY OF ERRORS
You can typically fix normality issues by applying a log transform on the target
Assumptions • Other options are applying log transforms to features or simply leaving the data as is
Overview
Price Model Log of Price Model

Linearity
Independence
of Errors
Normality
of Errors
There are still
No Perfect large residuals
Our line starts to curve in between left, but they
Multicollinearity
these thresholds, which indicates don’t violate the
This looks better!
moderate to severe non-normality normality
(between -3 and 3)
Equal Variance assumption
of Errors (check later)
Outliers &
Influence You generally want to see points fall along the
line in between -2 and +2 standard deviations

PRO TIP: INTERPRETING TRANSFORMED TARGETS
Transforming your target fundamentally changes the interpretation of your model,

so you need to invert the transformation to understand coefficients & predictions
Assumptions
Overview • Inverting a feature coefficient returns the associated multiplicative change in the target’s value
• Inverting a prediction returns the target in its original units
Linearity
Independence
of Errors
Normality
of Errors
A 1-unit increase in carat is associated
with a 10.6X increase in price
No Perfect
Multicollinearity
( e2.3598 = 10.59)
Transformation Inverse
Equal Variance
y = np.sqrt(x) x = y**2
of Errors
y = np.log(x) x = np.exp(y)
Outliers & y = np.log10(x) x = 10 ** y
Influence A diamond with these features is
predicted to cost $9,458 y = 1/x x = 1/y

NO PERFECT MULTICOLLINEARITY
No perfect multicollinearity assumes that features aren’t perfectly correlated

Assumptions
with each other, as that would lead to unreliable and illogical model coefficients
Overview
• If two features have a correlation (r) of 1, there are infinite ways to minimize squared error
• Even if it’s not perfect, strong multicollinearity (r > 0.7) can still cause issues to a model
Linearity
Independence
You can diagnose multicollinearity with the Variance Inflation Factor (VIF):
of Errors
• Each feature is treated as the target, and R2 measures how well the other features predict it
Normality • As a rule of thumb, a VIF > 5 indicates that a variable is causing multicollinearity problems
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors We can ignore the VIF for the intercept, but
most of our variables have a VIF > 5
Outliers &
Influence

NO PERFECT MULTICOLLINEARITY
There are several ways to fix multicollinearity issues:

Assumptions
Overview • The most common is to drop features with a VIF > 5 (leave at least 1 to see the impact)
• Another is to engineer combined features (like “GDP” and “Population” to “GDP per capita”)
Linearity • You can also turn to regularized regression or tree-based models (more later!)
Independence
of Errors Original Model Removing x
Normality We still have VIF > 5 for our carat terms,

of Errors but we can generally ignore those since
they are polynomial terms
No Perfect
Multicollinearity
Equal Variance
of Errors
The VIF can also be ignored when using dummy variables (more later!)
Outliers &
Influence

EQUAL VARIANCE OF ERRORS
Equal variance of errors assumes the residuals are consistent across predictions
Assumptions • In other words, average error should stay roughly the same across the range of the target
Overview
• Equal variance is known as homoskedasticity, and non-equal variance is heteroskedasticity
Linearity
You can diagnose heteroskedasticity with residual plots:
Independence
of Errors Heteroskedasticity Homoskedasticity
(original regression model) (after fixing violated assumptions)
Normality
of Errors
No Perfect Our model predicts

Multicollinearity much better now!
Equal Variance
of Errors
Outliers &
Influence

EQUAL VARIANCE OF ERRORS
You can typically fix heteroskedasticity by applying a log transform on the target
Assumptions • In other words, the average error should stay roughly the same across the target variable
Overview
Price Model Log of Price Model

Linearity
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors
Outliers &
Influence
Errors have a cone shape along the x-axis Errors are spread evenly along the x-axis

OUTLIERS & INFLUENCE
Outliers are extreme data points that fall well outside the usual pattern of data
Assumptions
• Some outliers have dramatic influence on model fit, while others won’t
Overview • Outliers that impact a regression equation significantly are called influential points
Linearity No Outliers Non-influential Outlier Influential Outlier

slope = 0.34 slope = 0.34 slope = 0.09
Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance
of Errors This is an outlier in both profit and This is an outlier in terms of profit that
An outlier’s influence depends units sold, but it’s in line with the rest doesn’t follow the same pattern as the rest of
on the size of the dataset of the data, so it’s not influential the data, so it changes the regression line
Outliers &
Influence (large datasets are impacted less)

Cook’s Distance measures the influence a data point has on the regression line
Assumptions
• It works by fitting a regression without the point and calculating the impact
Overview • Cook’s D > 1 is considered a significant problem, while Cook’s D > 0.5 is worth investigating
• Use .get_influence().summary_frame() on a regression model to return influence statistics
Linearity
Independence
of Errors
Normality The other metrics

of Errors can be ignored
No Perfect
Multicollinearity
Equal Variance
Not surprisingly, the most influential
of Errors
points were the largest diamonds!
Outliers &
Influence

There are several ways to deal with influential points:

Assumptions
• You can remove them or leave them in
Overview • You can engineer features to help capture their variance
• You can use robust regression (outside the scope of this course)
Linearity
Model (with outliers) Model (without 2 outliers)

Independence
of Errors
Normality
of Errors
No Perfect
Multicollinearity
Equal Variance R2 improved slightly and

of Errors the coefficients changed a
bit, but not much changed
Outliers & (this is a large dataset)
Influence

RECAP: ASSUMPTIONS OF LINEAR REGRESSION
Drawbacks to
Assumption Problem to solve Effect on Model Diagnosis Fix Should I Fix it?
Fix
Suboptimal model Curved patterns in Slightly less The more curved the relationship,
Feature-target
Linearity accuracy, non-normal feature-target Add polynomial terms intuitive the worse it is. It’s usually worth
relationship is not linear
residuals scatterplots coefficients fixing as it can improve accuracy.
Independence of Residuals aren’t Suboptimal model Durbin-Watson stat Use time-series

None Yes
Errors independent of each other structure not between 1.5–2.5 modeling techniques
Data significantly Transform the target

If it’s due to a non-linear pattern or
Inaccurate predictions & deviates from the line with log or other
Normality of Residuals aren’t normally Less missing variable, or if the fix
less reliable coefficient of normality in QQ transforms. Add
Errors distributed around 0 interpretability significantly improves accuracy
estimates plot inside of +/- 2 features to model if
(unless you need a simple model)
standard deviations available
For perfect or near perfect

Features with perfect, or Features have a May lose some
No Perfect Unstable & illogical Drop features or use collinearity, yes. For features closer
near perfect, correlation Variance Inflation training
Multicollinearity model coefficients regularized regression to r= 0.7, gains seen in training
with each other Factor greater than 5 accuracy
usually offset accuracy.
Residuals are not Unreliable confidence

Equal Variance of Cone-shaped residual Less This is generally ok to leave as is if
consistent across the full intervals for coefficients Transform the target.
Errors plots interpretability your goal is prediction
range of predicted values & predictions
Cook’s D greater than Removing data If we expect the data to have similar
Lower accuracy & Remove data points or
Outliers & 1 (highly influential) points is not points, removing them won’t change
Influential data points possible violated engineer features to
Influence or 0.5 (potentially ideal if we need model performance (consider a
assumptions explain outliers
problematic) to predict them model with and without outliers)

ASSIGNMENT: MODEL ASSUMPTIONS
Key Objectives
NEW MESSAGE
July 10, 2023
1. Check for violated assumptions on a fitted
From: Cam Correlation (Sr. Data Scientist) regression model
Subject: Model Assumptions
2. Implement fixes for assumptions
Hi there! 3. Decide whether the fixes are worth the
Can you check the model assumptions for our computer price
trade-offs and settle on a final model
regression model?
Make any changes necessary to improve model fit!
Don’t force it though, use your best judgement and remember

that not all assumptions will be violated and there are some
trade-offs for fixes!
Thanks!
04_assumptions_assignments.ipynb

KEY TAKEAWAYS
Linear regression models have 5 key assumptions that must be checked

• Linearity, independence of errors, normality or errors, no perfect multicollinearity, and equal variance of errors
• If the goal is inference, they need to be checked rigorously, but for prediction some of them can be relaxed
Diagnosing & fixing these assumptions can help improve model accuracy
• Use residual plots, QQ plots, the Durbin-Watson test, and the Variance Inflation Factor to diagnose
• Transforming the features and/or target (polynomial terms, log transforms, etc.) can typically help fix issues
Outliers that significantly impact a model are known as influential points

• Use Cook’s distance to identify influential points (Cook’s D > 1)
• Before removing influential points, consider if you expect to encounter similar data points in new data

MODEL TESTING & VALIDATION

MODEL TESTING & VALIDATION
In this section we’ll cover model testing & validation, which is a crucial step in the
modeling process designed to ensure that a model performs well on new, unseen data
• Understand the concept of data splitting and its

Model Scoring Steps Data Splitting impact on model fit, bias, and variance
• Split data in Python into training and test data sets,
Overfitting Bias-Variance Tradeoff and then into training and validation data sets
• Learn the concept of cross validation, and its
advantages & disadvantages over simple validation
Validation Cross Validation
• Evaluate models by tuning on training & validation
data, refitting on both, then scoring on test data

RECAP: REGRESSION MODELING WORKFLOW
1 2 3 4 5 6
Model Scoring Scoping a Gathering Cleaning Exploring Modeling Sharing
Steps
Data Splitting
Preparing for Applying Model Model

Overfitting
Modeling Algorithms Evaluation Selection
Bias-Variance Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Tradeoff input into an ML algorithm from training data training & validation data deploy and identify insights
Validation • Single table, non-null • Linear regression • R-squared & MAE

• Checking assumptions
Cross Validation
This is what we’ve covered so far

RECAP: REGRESSION MODELING WORKFLOW
1 2 3 4 5 6
Model Scoring Scoping a Gathering Cleaning Exploring Modeling Sharing
Steps
Data Splitting
Preparing for Applying Model Model

Overfitting
Modeling Algorithms Evaluation Selection
Bias-Variance Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Tradeoff input into an ML algorithm from training data training & validation data deploy and identify insights
Validation • Single table, non-null • Linear regression • R-squared & MAE • Test performance
• Checking assumptions
• Data splitting • Validation performance
Cross Validation
This needs to be the first So we can do these properly
step in the process

MODEL SCORING STEPS
These model scoring steps need to be considered before you start modeling:
Model Scoring
Steps 1 Split the data into a “training” and “test” set
• Around 80% of the data is used for training and 20% is reserved for testing
Data Splitting
2 Select a validation method

Overfitting • Either split the training set into “training” and “validation” or use cross validation
Bias-Variance 3 Tune the model using the training data

Tradeoff
• Gradually improve the model by fitting on the training set and scoring on the validation set
Validation
4 Score the model using the test data
• Once you’ve settled on a model, combine the training & validation data sets and refit the
Cross Validation model, then score it on the test data to evaluate model performance

DATA SPLITTING
Data splitting involves separating a data set into “training” and “test” sets
• Training data, or in-sample data, is used to fit and tune a model
Model Scoring
Steps • Test data, or out-of-sample data, provides realistic estimates of accuracy on new data
Data Splitting
Overfitting
Bias-Variance
Tradeoff Training data
(80%)
Validation
Cross Validation
Test data
(20%)

DATA SPLITTING
Data splitting involves separating a data set into “training” and “test” sets
• Training data, or in-sample data, is used to fit and tune a model
Model Scoring
Steps • Test data, or out-of-sample data, provides realistic estimates of accuracy on new data
Data Splitting
Overfitting
In practice, the rows are sampled randomly
Bias-Variance
Tradeoff
80/20 is the most common ratio for
train/test data, but anywhere from
70-90% can be used for training
Validation
For smaller data sets, a higher ratio of
test data is needed to ensure a
representative sample
Cross Validation

DATA SPLITTING
You can split data in Python using scikit-learn’s train_test_split function

• train_test_split(feature_df, target_column, test_pct, random_state)
Model Scoring
Steps
Data Splitting
Overfitting
Bias-Variance
Tradeoff
Validation
4 outputs: Perfect 80/20 split! 4 inputs:
• Training set for features (X) • All feature columns
Cross Validation • Test set for features (X_test) • The target column
• Training set for target (y) • The percentage of data for the test set
• Test set for target (y_test) • A random state (or a different split is created each time)

OVERFITTING & UNDERFITTING
Data splitting is primarily used to avoid overfitting, which is when a model predicts
known (Training) data very well but unknown (Test) data poorly
Model Scoring
Steps • Overfitting is like memorizing the answers to a test instead of learning the material; you’ll ace
the test, but lack the ability to generalize and apply your knowledge to unfamiliar questions
Data Splitting
Overfitting
Bias-Variance
Tradeoff
Validation
OVERFIT model WELL-FIT model UNDERFIT model

Cross Validation • Models the training data too well • Models the training data just right • Doesn’t model training data well enough
• Doesn’t generalize well to test data • Generalizes well to test data • Doesn’t score well on test data either

OVERFITTING & UNDERFITTING
You can diagnose overfit & underfit models by comparing evaluation metrics like
R2, MAE, and RMSE between the train and test data sets
Model Scoring
Steps • Large gaps between train and test scores indicate that a model is overfitting the data
• Poor results across both train and test scores indicate that a model is underfitting the data
Data Splitting
You can fix overfit models by: You can fix underfit models by:
Overfitting
• Simplifying the model • Making the model more complex
Bias-Variance
• Removing features • Add new features
Tradeoff • Regularization (more later!) • Feature engineering (more later!)
Validation
Models will usually have lower performance on test data compared to the performance
on training data, so small gaps are expected – remember that no model is perfect!
Cross Validation

THE BIAS-VARIANCE TRADEOFF
When splitting data for regression, there are two types of errors that can occur:
• Bias: How much the model fails to capture the relationships in the training data
Model Scoring
Steps • Variance: How much the model fails to generalize to the test data
Data Splitting
Training data
Overfitting
Bias-Variance
Tradeoff
Validation
• Low bias (no error) • Medium bias (some error) • High bias (lots of error)
Cross Validation

THE BIAS-VARIANCE TRADEOFF
When splitting data for regression, there are two types of errors that can occur:
• Bias: How much the model fails to capture the relationships in the training data
Model Scoring
Steps • Variance: How much the model fails to generalize to the test data
Data Splitting
Test data
Overfitting
Bias-Variance
Tradeoff
Validation
• Low bias (no error) • Medium bias (some error) • High bias (lots of error)
Cross Validation • High variance (much more error) • Medium variance (a bit more error) • Low variance (same error)
This is the bias-variance tradeoff!

THE BIAS-VARIANCE TRADOFF
The bias-variance tradeoff aims to find a balance between the two types of errors
Model Scoring • It’s rare for a model to have low bias and low variance, so finding a “sweet spot” is key
Steps • This is something that you can monitor during the model tuning process
Data Splitting
High bias models High variance models

Overfitting Test
Prediction Error
• Fail to capture
Sweet
• Capture noise from the
trends in the data spot training data
Bias-Variance
Tradeoff • Are underfit • Are overfit
• Are too simple Variance • Are too complex
• Have a high error Bias • Have a large gap
Validation rate on both train between training & test
& test data Training error
Low Model Complexity High

Cross Validation

VALIDATION DATA
Validation data is a subset of the training data set that is used to assess model fit
and provide feedback to the modeling process
Model Scoring
Steps
• This is an extension of a simple train-test split of the data
Data Splitting Train-Test Split

With a simple train / test split, you
Training Test cannot use test data to optimize
Overfitting (80%) (20%) your models
Bias-Variance Train-Validation-Test Split

Tradeoff
With separate data sets for validation
Training Validation Test and testing, the validation data can
and should be used to:
Validation (60%) (20%) (20%)
• Check for overfitting
• Fine tune model parameters
• Verify the impact of specific
Cross Validation features on out-of-sample data
• Verify the impact of outliers on
out-of-sample data

VALIDATION DATA
You can also use the train_test_split function to create the validation data set
Model Scoring
• First split off the test data, then separate the training and validation sets
Steps
Data Splitting
First split off the test data

Overfitting
Then do the same thing to
split off the validation data
Bias-Variance
The test size is 25% of the
Tradeoff
training data (80%), which
is 20% of the full data set
Validation
Cross Validation
Perfect 60/20/20 split!

MODEL TUNING
Model tuning is the process of making gradual improvements to a model by fitting it

on the training data and scoring it on the validation data
Model Scoring
Steps • This lets you compare modifications to your models (like adding features) and avoid overfitting
Data Splitting
Model #1 Model #2 Model #3

Overfitting
3 variables 20 variables 3 variables + 1 squared term
Bias-Variance • Training R2 = 0.67 • Training R2 = 0.89 • Training R2 = 0.81

Tradeoff
• Validation R2 = 0.66 • Validation R2 = 0.71 • Validation R2 = 0.79
Both R2 values seem relatively low with Training R2 improved but the gap Training R2 dropped a bit but the gap
Validation a small gap between them between them grew considerably between them closed down
(underfit – high bias, low variance) (overfit – low bias, high variance) (best fit – balanced bias & variance)
Cross Validation Action: Add complexity Action: Simplify Action: Decide on model

MODEL TUNING
To tune your model in Python:

Model Scoring 1. Use sm.OLS(y_train, X_train).fit() to fit the model on the training data
Steps
2. Use r2(y_train, model.predict(X_train)) to return R2 for the training data (or use mae / mse)
3. Use r2(y_valid, model.predict(X_valid)) to return R2 for the validation data (or use mae / mse)
Data Splitting
Overfitting
Bias-Variance
Tradeoff
Validation
Cross Validation

MODEL SCORING
Once you’ve tuned and selected the best model, you need to refit the model on both
the training and validation data, then score the model on the test data
Model Scoring
Steps • Combining the validation data back with the training data helps improve coefficient estimates
• The final “model score” that you’d share is the score from the test data
Data Splitting
Model Tuning
Overfitting
Training Validation Once you’ve finished tuning,
(60%) (20%) select the best model
Bias-Variance
Tradeoff R2 = 0.81 R2 = 0.79
Model Scoring
Validation
Refit the model on the training &
Training + Validation Test validation data and score on the test
Cross Validation
(60% + 20% = 80%) (20%) data
R2 = 0.83 R2 = 0.81

MODEL SCORING
To score your model in Python:

Model Scoring 1. Use sm.OLS(y, X).fit() to fit the final model on the training and validation data
Steps
2. Use r2(y, model.predict(X)) to return R2 for the training and validation data (or use mae / mse)
3. Use r2(y_test, model.predict(X_test)) to return R2 for the test data (or use mae / mse)
Data Splitting
Overfitting
Bias-Variance
Tradeoff
Validation
Cross Validation

CROSS VALIDATION
Cross validation is another validation method that splits, or “folds”, the training
data into “k” parts, and treats each part as the validation data across iterations
Model Scoring • You fit the model k times on the training folds, while validating on a different fold each time
Steps
Train-Test Split
Data Splitting
Training Test
Overfitting
5-Fold Cross Validation
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Validation Score
Bias-Variance
Tradeoff Training Training Training Training Validation 0.42
Training Training Training Validation Training 0.40

Cross validation score:
Validation
Training Training Validation Training Training 0.45 0 .41
Cross Validation Training Validation Training Training Training 0.41
Validation Training Training Training Training 0.37

CROSS VALIDATION
You use cross validation in Python with scikit-learn’s Kfold function

• sklearn.model_selection.Kfold(n_splits, shuffle=True, random_state)
Model Scoring
Steps
Data Splitting
Overfitting
Bias-Variance
This can be put into
Tradeoff
a function to reuse!
Validation
Cross Validation
Average validation score, plus/minus the standard deviation

SIMPLE VS CROSS VALIDATION
You should choose one validation approach to use throughout your modeling
process, so it’s important to highlight the pros and cons of each:
Model Scoring
Steps
Factor Simple Validation Cross Validation

Data Splitting
Less reliable, particularly on smaller More reliable, as the validation

Reliability
data sets process looks at multiple holdout sets
Overfitting
More appropriate for small-medium

Bias-Variance More appropriate for large data sets,
sized datasets (<10,000 rows) or
Tradeoff Data Size as there is less risk for the holdout set
(<1M rows) if you have access to
to be biased
sizeable computing power
Validation
Training Faster, since we’re only training and Slower, since we train and score the
Cross Validation Time scoring the model once model once for each fold

ASSIGNMENT: MODEL TESTING & VALIDATION
Key Objectives
NEW MESSAGE
July 12, 2023
1. Split your data into training and test
2. Use cross validation to fit your model and
Subject: Data Splitting
report each validation fold score
Hi there! 3. Fit your model on all of your training data
The model seems to be shaping up nicely, but I forgot to
and score it on the test dataset
mention we need to split off a test dataset to assess
performance.
Can you split off a test set, and fit your model using cross-
validation?
Thanks!
05_data_splitting_assignments.ipynb

KEY TAKEAWAYS
Data Splitting is a key step in the modeling process

• Training, Validation, and Test datasets all have unique purposes
Test Data Sets give us an unbiased estimate of model accuracy

• Scoring on your test dataset should only be performed once you are ready to decide on a final model
Validation Data provides critical feedback into model tuning

• Simple validation is suitable for larger datasets, while cross validation tends to be more reliable on small-
medium sized data
• Fold your validation data back into training data to fit your final model

FEATURE ENGINEERING

FEATURE ENGINEERING
In this section we’ll cover feature engineering techniques for regression models, including
dummy variables, interaction terms, binning, and more
• Understand the goal and importance of feature

Feature Engineering Math Calculations engineering in modeling
• Learn common types of feature engineering and
Category Mappings how to implement them using Python
• Walk through examples of some more advanced
feature engineering options

FEATURE ENGINEERING
Feature engineering is the process of creating or modifying features that you

think will be helpful inputs for improving a model’s performance
Feature • This is where data scientists add the most value in the modeling process
Engineering
• Strong domain knowledge is required to get the most out of this step
Math
Calculations
Category
Mappings
Only “carat” can be used as a feature

here, since the rest or not numeric
Feature engineering is all about trial and error, and you can use cross
validation to test new ideas and determine if they improve the model

FEATURE ENGINEERING TECHNIQUES
These are some commonly used feature engineering techniques:
Feature
Engineering
Math
Calculations
Math Category DateTime Group
Calculations Mappings Calculations Calculations Scaling
Category
Mappings • Polynomial terms • Binary columns • Days from “today” • Aggregations • Standardization
• Combining features • Dummy variables • Time between dates • Ranks within groups • Normalization
• Interaction terms • Binning
We’ll briefly demo these techniques

We’ll do an in-depth review of these (they reference previously learned concepts)
Your own creativity, domain knowledge, and critical thinking will lead to feature
engineering ideas not covered in this course, but these are worth keeping in mind!

POLYNOMIAL TERMS
Adding polynomial terms (x2, x3, etc.) to regression models can improve fit when
spotting “curved” feature-target relationships during EDA
Feature • Generally, as the degree of your polynomial increases, so does the risk of overfitting
Engineering
• If your goal is prediction, let cross validation guide the complexity of your polynomial term
Math
Calculations
Model Structure Cross-Val R2
Category
Mappings Carat .847
Carat + Carat2 .929
Carat + Carat2 + Carat3 .936
Carat + … + Carat4 .936
Always keep the lower

order terms in the model

POLYNOMIAL TERMS
Adding polynomial terms (x2, x3, etc.) to regression models can improve fit when
spotting “curved” feature-target relationships during EDA
Feature • Generally, as the degree of your polynomial increases, so does the risk of overfitting
Engineering
• If your goal is prediction, let cross validation guide the complexity of your polynomial term
Math
Calculations
Category
Mappings Carat .847
Carat + Carat2 .929
Cross validation R2 stops increasing after adding the

cubic term, so we should stop there
(if you need to explain this model, you could justify
stopping at the squared term for interpretability)

COMBINING FEATURES
Combining features with arithmetic operators can help avoid multicollinearity

issues without throwing away information altogether
Feature • Sums (+), differences (-), products (*) and ratios (÷) of columns are all valid combinations
Engineering
Math Current best model:

Calculations
Model Structure Cross-Val R2 Carat weight is our strongest variable alone, but
Category can we engineer some stronger features?
Mappings
Feature correlations:
Carat (weight), x (width), y (length), and z (depth)
capture a diamond’s size, which is why they are
highly correlated with each other and will cause
multicollinearity issues in the model
While carat is the most important factor in price,

“deep” diamonds are less valuable than “wide”
diamonds, since depth can’t be seen in jewelry

COMBINING FEATURES
Combining features with arithmetic operators can help avoid multicollinearity

issues without throwing away information altogether
Feature • Sums (+), differences (-), products (*) and ratios (÷) of columns are all valid combinations
Engineering

Calculations
Model Structure Cross-Val R2 The carat model is still the best!
Category Carat + Carat2 + Carat3 .936
Mappings
Models with combined features:

Model Structure Cross-Val R2 PRO TIP: There is no guarantee
that even the most clever feature
Original columns: x, y, z .913
engineering will improve your model;
Sum: (x + y + z) .909 give it a try, but be prepared to move
on if you don’t see results!
Product: (x * y * z) .909
Area/depth ratio: (x + y) / z .877

INTERACTION TERMS
Interaction terms capture feature-target relationships that change based on

the value of another feature
Feature • They can be detected with careful EDA or brute force engineering
Engineering
• They exist for both categorical-numeric and numeric-numeric feature combinations
Math
Calculations
Adding an interaction term:

Category
Mappings Interaction term
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2
𝑦 = 𝛽0 + 𝛽1 (𝑐𝑎𝑟𝑎𝑡) + 𝛽2 (𝑐𝑙𝑎𝑟𝑖𝑡𝑦_𝐼1) + 𝛽3 (𝑐𝑎𝑟𝑎𝑡 ∗ 𝑐𝑙𝑎𝑟𝑖𝑡𝑦_𝐼1)
When adding interaction terms, always

include the original features in your model
“I1” diamonds have a much lower

slope coefficient than others

INTERACTION TERMS
Interaction terms capture feature-target relationships that change based on

the value of another feature
Feature • They can be detected with careful EDA or brute force engineering
Engineering
• They exist for both categorical-numeric and numeric-numeric feature combinations
Math
Calculations
Current best model:
Category
Mappings
The carat model is still the best!
Model with interaction term:

PRO TIP: Interaction terms are cool,
Model Structure Cross-Val R2 but often don’t add enough value to
Carat + I1 + Carat * I1 .870 justify the time it takes to find them or
the increased complexity!

CATEGORICAL FEATURES
Categorical features must be converted to numeric before modeling

• In their simplest form, they can be represented with binary columns
Feature
Engineering
This assigns a value of 1 to

Math
diamonds with “IF” clarity and a
Calculations
value of 0 to any others
Category
Mappings
This field that represents clarity is now

numeric and can be input into a model

CATEGORICAL FEATURES
Categorical features must be converted to numeric before modeling

• In their simplest form, they can be represented with binary columns
Feature
Engineering
Interpreting coefficients for binary columns: Effect on a model:
Math
Calculations
Category
Mappings
Diamonds with a clarity of “IF” have a price $1,278

dollars higher, on average, than non-IF diamonds
Binary columns have the effect of shifting the

intercept of the line without affecting its slope!

DUMMY VARIABLES
A dummy variable is a field that only contains zeros and ones to represent the
presence (1) or absence (0) of a value, also known as one-hot encoding
Feature
• They are used to transform a categorical field into multiple numeric fields
Engineering • Use pd.get_dummies() to create dummy variables in Python
Math
Calculations
Category
Mappings
These dummy variables are numeric

representations of the “clarity” field

DUMMY VARIABLES
In linear regression models, you need to drop a dummy variable category using
the “drop_first=True” argument to avoid perfect multicollinearity
• The category that gets dropped is known as the “reference level”
Feature
Engineering
Math
Calculations
Category
Mappings
PRO TIP: Your model accuracy will be the same regardless of which dummy column
dropped, but some reference levels are more intuitive to interpret than others
If you want to choose your reference level, skip the drop_first argument and drop the
desired reference level manually

DUMMY VARIABLES
In linear regression models, you need to drop a dummy variable category using
the “drop_first=True” argument to avoid perfect multicollinearity
• The category that gets dropped is known as the “reference level”
Feature
Engineering
Interpreting coefficients for dummy variables:

Math
Calculations Diamonds with a clarity of “IF” have a predicted
price of $5,571 higher than our reference level
Category (diamonds with a clarity of I1)
Mappings
The coefficients align with the diamond quality chart!
Lower Quality
The reference level is represented in the intercept term, and

the coefficients of other categories are compared to it

BINNING CATEGORICAL DATA
Adding dummy variables for each categorical column in your data can lead to very
wide data sets, which tends to increase model variance
Grouping, or binning categorical data, solves this and can improve interpretability
Feature
Engineering • After binning, create dummy variables for the groups (which should be fewer than before)
Math
Calculations
Category If we had less data, the ”I1” category would be too rare to
Mappings produce reliable estimates, as random data splitting means we
might not see “I1” diamonds in our test or validation sets!
(categories with low counts are especially at risk of overfitting)
We can map the 8 clarity

values into just 3 buckets

BINNING CATEGORICAL DATA
Adding dummy variables for each categorical column in your data can lead to
very wide data sets, which can increase model variance with small data sets
Grouping, or binning categorical data, solves this and improves interpretability
Feature
Engineering • After binning, create dummy variables for the groups (which should be fewer than before)
Math
Calculations
Dummy variables: Binning:
Category
Mappings

BINNING NUMERIC DATA
Binning numeric data lets you turn numeric features into categories
• Generally, this is less accurate than using raw values, but it is a highly interpretable way of
capturing non-linear trends and numeric fields with a high percentage of missing values
Feature
Engineering
Math
Calculations Carat is a continuous field, but
we’re binning it into values at
various intervals, making it a
Category
categorical field
Mappings
Now you can get dummy variables from this!

BINNING NUMERIC DATA
Binning numeric data lets you turn numeric features into categories
• Generally, this is less accurate than using raw values, but it is a highly interpretable way of
capturing non-linear trends and numeric fields with a high percentage of missing values
Feature
Engineering

Calculations
Category Carat + Carat2 + Carat3 .936

Mappings
Models with binned data:

Carat .847
Carat Bins .870
Note that the binned data has created steps in the

model versus the earlier smooth lines / curves

ASSIGNMENT: FEATURE ENGINEERING
Key Objectives
NEW MESSAGE
July 14, 2023
1. Perform feature engineering on numeric
From: Cam Correlation (Sr. Data Scientist) and categorical features
Subject: Feature Engineering
2. Evaluate model performance after
including the new features
Hello again!
We’re starting to get somewhere with this model, nice job!

3. Select only features that improve model fit
I notice that your last model didn’t have any of the categorical
features in it, can you make sure to include them?
I’ve also included some additional feature engineering ideas I

had in the notebook.
Thanks!
06_feature_engineering_assignment.ipynb

KEY TAKEAWAYS
Feature engineering lets you turn raw data into useful model features
• This is critical to getting the best accuracy out of your data sets
Several calculations can be applied to enhance numeric features

• Combining features, polynomial terms, interaction terms, and binning can be applied on numeric variables
It also allows you to prepare categorical features for modeling

• Techniques like binary columns, dummy variables, and binning let you turn categorical variables into numeric
Most ideas will come from domain expertise and critical thinking
• Thinking carefully about what might influence your target variable can lead to the creation of powerful features

PROJECT 1: LINEAR REGRESSION

PROJECT DATA: SAN FRANCISCO RENT PRICES

PROJECT BRIEF: LINEAR REGRESSION
Key Objectives
NEW MESSAGE
July 20, 2023 1. Perform EDA on a modelling dataset
From: Cathy Coefficient (Data Science Lead) 2. Split your data and choose a validation
Subject: Apartment Rental Prices framework
3. Fit and tune a linear regression model by
Hi there,
checking model assumptions and performing
Cam has told me some great things about your work. We have feature engineering
a new client that has a modelling project they need help with.
4. Interpret a linear regression model
The client works in the real estate industry in San Francisco
and wants to understand the key factors affecting rental
prices. More importantly, they hope be able to use your model
to predict an appropriate price range for the apartments they
build in the city.
You’ll find more info in the attached notebook, thanks!
07_regression_modelling_project.ipynb

REGULARIZED REGRESSION

𝛂 In this section we’ll cover regularized regression models, also known as “penalized”
regression models, which focus on reducing model variance to improve predictive accuracy
• Understand the difference between linear regression

Regularization Ridge Regression and regularized regression models
• Introduce the cost function and the impact of the
Lasso Regression Elastic Net Regression regularization term
• Review the steps for fitting and training regularized
regression models, including standardization and
hyperparameter tuning
• Discuss the similarities and differences between
Ridge Regression, Lasso Regression and Elastic Net

𝛂
Regularized regression models add bias by NOT choosing the line of best fit for
the training data with the purpose of reducing the variance in the test data
Regularization • This helps reduce overfitting and leads to better predictions (especially in smaller data sets)
Ridge Regression
Linear Regression (OLS) Regularized Regression
Lasso Regression
Training The “slope”
coefficients tend to
Elastic Net
shrink during
Regression
regularization
Test
Low bias, high variance (overfit model) Balance of bias and variance

THE COST FUNCTION
𝛂
Regularized regression models modify the “line of best fit” by adding additional
elements to the cost function used to find the optimal model
Regularization
In linear regression, the cost function is just the sum squared error (SSE):
Ridge Regression
𝐽 = 𝑆𝑆𝐸
Lasso Regression
The cost you’re trying to minimize The sum of squared error
Elastic Net
Regression
In regularized regression, you add a regularization term controlled by alpha:
𝐽 = 𝑆𝑆𝐸 + 𝛼𝑅
The alpha term, which controls the The regularization term
impact of the regularization term

TYPES OF REGULARIZED REGRESSION
𝛂
Different types of regularized regression have different regularization equations
Regularization
Ridge Regression
𝐽 = 𝑆𝑆𝐸 + 𝛼𝑅
Lasso Regression
Ridge Regression Lasso Regression Elastic Net Regression
Elastic Net
Regression Sum of the ridge & lasso regularization terms,
Sum of the squared coefficient values Sum of the absolute coefficient values
weighted by lambda (λ)
𝑝 𝑝 𝑝 𝑝
෍ 𝛽𝑗2 ෍ 𝛽𝑗 𝜆 ෍ 𝛽𝑗2 + (1 − 𝜆) ෍ 𝛽𝑗
𝑗=1 𝑗=1 𝑗=1 𝑗=1

RIDGE REGRESSION
𝛂
Ridge regression adds a regularization term, or complexity penalty, that’s equal to
the sum of the model’s coefficients (𝛽) squared, also known as an L2 penalty
Regularization
• The L2 penalty shrinks the coefficients towards 0 (but never to 0 exactly)
• The higher alpha, the more the coefficients get shrunk
Ridge Regression • The features that reduce SSE the most will shrink at a slower rate than less useful ones
Lasso Regression 𝑝
When fitting a ridge regression, the model is
Elastic Net 𝐽 = 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗2 trying to minimize this cost function by both:
• Minimizing training error with SSE
Regression
• Keeping the coefficient values small
𝑗=1
The idea is to incorporate the bias-variance tradeoff directly into the algorithm!
We want to reduce the sum of squared error, but also constrain the magnitude of
the coefficients to prevent overfitting’
Because of this, ridge regression tends to produce much more accurate models
when working with multicollinear features.

RIDGE REGRESSION WORKFLOW
𝛂
The ridge regression workflow has a few additional steps that are required:
Regularization
Linear regression:
Ridge Regression 1. Fit a linear regression model on the training data and score on the validation data
2. Tune the model by adding or removing inputs
3. Fit the model on the train & validation data and score on the test data
Lasso Regression
Elastic Net Ridge regression:

Regression
1. Standardize all inputs into the model
2. Fit a ridge regression model on the training data and score on the validation data
3. Tune the model by adding or removing inputs and modifying the ⍺ hyperparameter
4. Fit the model on the train & validation data and score on the test data

STANDARDIZATION
𝛂
Standardization sets all input features on the same scale by transforming all
values to have a mean of 0 and a standard deviation of 1
Regularization
• Ridge regression is fit using the size of the coefficients, so they need to have the same units
• The StandardScaler() function from scikit-learn is the preferred way to do this
Ridge Regression
Use .fit_transform on the training data to

Lasso Regression calculate the mean & standard deviation
and apply the standardization
Elastic Net Use .transform on the validation & test

Regression data to apply the same standardization
As a best practice, only retrieve the mean & standard deviation from the training data
Doing this before splitting the data would be collecting information from the test data
and using it for training the model, which will lead to inflated test performance

STANDARDIZATION
𝛂
Standardization sets all input features on the same scale by transforming all values
to have a mean of 0 and a standard deviation of 1
Regularization
• Ridge regression is fit using the size of the coefficients, so they need to have the same units
• The StandardScaler() function from scikit-learn is the preferred way to do this
Ridge Regression
Lasso Regression
Elastic Net
Regression
A 0.29-carat diamond is 1.08 standard deviations

(0.46) smaller than the mean carat weight (0.79)

FITTING A RIDGE REGRESSION
𝛂
You can fit a ridge regression with the Ridge() function in sklearn.linear_model
• Use the .coef_ attribute to return the model coefficients
Regularization
Ridge Regression
Fit on the standardized features and alpha = 1
Lasso Regression Calculate R2 for the training and validation sets
The large gap indicates the model is overfit,

Elastic Net so we should try different values for alpha
Regression
Note that the magnitudes differ

from previous estimates due to
standardization

TUNING ALPHA
𝛂
We need to test different values of ⍺ to see how it affects model performance
• In sklearn, the default value is 1, but this should be tuned
Regularization • The ideal value is different for every model and can be found via trial & error with validation
Ridge Regression
The tiny sacrifice in bias results in a

huge improvement in variance!
Lasso Regression
Elastic Net
Regression
The training score went down slightly from 0.869,

but validation improved significantly from 0.332
with an alpha of 71.49 (instead of 1)

TUNING ALPHA
𝛂
We need to test different values of ⍺ to see how it affects model performance
• In sklearn, the default value is 1, but this should be tuned
Regularization • The ideal value is different for every model and can be found via trial & error with validation
Ridge Regression
Lasso Regression
Elastic Net
Regression
Coefficient estimates shrink
towards 0 as alpha increases

PRO TIP: RIDGECV
𝛂 The RidgeCV() function performs a cross validation loop that returns the best-
performing value for alpha (instead of writing a loop from scratch)
Regularization
• RidgeCV(alphas=list_of_alphas, cv=5_or_10)
• Use .alpha_ to return the best-performing value for alpha
Ridge Regression
Cross validation with RidgeCV Validation loop
Lasso Regression
Elastic Net
Regression
RidgeCV() will usually yield a different optimal

regularization strength than simple validation,
and tends to yield better test performance as it
looks at multiple cross validation folds
RidgeCV() used an alpha of 41.02 and achieved a better test

score than the alpha of 71.49 from the normal validation loop

ASSIGNMENT: RIDGE REGRESSION
Key Objectives
NEW MESSAGE
July 22, 2023
1. Fit a ridge regression model in Python
2. Tune alpha through cross validation
Subject: Ridge Regression
3. Compare model performance with
Hi there! traditional regression
I love the model you built for our computer price data set.
Can you try fitting a ridge regression and compare the

model’s accuracy?
Thanks!
08_regularization_assignments.ipynb

LASSO REGRESSION
𝛂
Lasso regression adds a regularization term equal to the sum of the magnitude
(absolute value) of the model’s coefficients (𝛽), also known as an L1 penalty
Regularization
• The L1 penalty shrinks the coefficients towards 0 (they can drop to 0, unlike ridge regression)
• The higher alpha, the more the coefficients get shrunk
Ridge Regression • The features that reduce SSE the most will shrink at a slower rate than less useful ones
Lasso Regression 𝑝
When fitting a lasso regression, the model is
trying to minimize this cost function by both:
Elastic Net 𝐽 = 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗 • Minimizing training error with SSE
Regression
• Keeping the coefficient values small
𝑗=1
PRO TIP: If you have a lots of features in your model, fitting a moderately penalized lasso
regression can help inform which features are strongest by dropping the rest to 0

FITTING A LASSO REGRESSION
𝛂
You can fit a lasso regression with the Lasso() function in sklearn.linear_model
Regularization
Ridge Regression
These coefficients all
dropped to 0!
You can tune alpha with
Lasso Regression the same validation loop!
Elastic Net
Regression

PRO TIP: LASSOCV
𝛂 Like RidgeCV(), the LassoCV() function performs a cross validation loop that
returns the best-performing value for alpha
Regularization
• LassoCV(alphas=list_of_alphas, cv=5_or_10)
• Use .alpha_ to return the best-performing value for alpha
Ridge Regression
Lasso Regression
Elastic Net
Regression

ASSIGNMENT: LASSO REGRESSION
Key Objectives
NEW MESSAGE
July 23, 2023
1. Fit a lasso regression model in Python
2. Tune alpha through cross validation
Subject: RE: Ridge Regression
3. Compare model performance with
Hi there! traditional regression
I was hoping ridge regression would increase performance
more than it did, but given that we didn’t have any highly
correlated features, it makes sense.
I meant to ask you to check the performance of the lasso

model as well, can you try a lasso model and compare it with
the others?
Thanks!

ELASTIC NET REGRESSION
𝛂
Elastic net regression combines the ridge and lasso penalties into a single model
and introduces a new hyperparameter (λ) that controls the balance between them
Regularization
• When lambda is 1, the model is equivalent to lasso; when it is 0, it is equivalent to ridge
Ridge Regression
Ridge penalty Lasso penalty
𝑝 2 𝑝
Lasso Regression
𝐽 = 𝑆𝑆𝐸 + 𝛼((1 − σ
𝜆) 𝑗=1 𝛽𝑗 + (𝜆) σ𝑗=1 𝛽𝑗 )
Elastic Net
Regression Total regularization strength Controls the balance between ridge & lasso
PRO TIP: Elastic net regression combines the effects of both lasso and
ridge regression and can be a lifesaver for challenging modeling problems

FITTING AN ELASTIC NET REGRESSION
𝛂
You can fit an elastic net with the ElasticNet() function in sklearn.linear_model
• ElasticNet(alpha=1, l1_ratio=.5)
Regularization • The “l1_ratio” controls the balance of L1 to L2 penalty (the default is 0.5, or equal weight)
Ridge Regression
Lasso Regression
Elastic Net
Regression
This plot for alpha holds Coefficients drop

lambda constant at 0.5 much faster here

TUNING LAMBDA
𝛂
You need to tune lambda in order to find the best ridge / lasso balance
• To do so, you can try fixing alpha at 1 and evaluating performance across “l1_ratios”
Regularization
Ridge Regression
Lasso Regression
Elastic Net
Regression
Tuning lambda while keeping alpha constant, or vice versa, misses out on many possible combinations of
these hyperparameters that might perform better – we need to try multiple combinations of both!

PRO TIP: ELASTICNETCV
𝛂
The ElasticNetCV() function performs a cross validation loop that returns the
best-performing values for both alpha and lambda
Regularization • ENetCV(alphas=list_of_alphas, l1_ratio=list_of_lambdas, cv=5_or_10)
• Use .alpha_ and .l1_ratio_ to return the best-performing values for alpha and lambda
Ridge Regression
Lasso Regression
Elastic Net
Regression
Low regularization strength, and 90% This model slightly underperformed the
skewed towards a Lasso penalty Lasso model, but it is still quite good

ASSIGNMENT: ELASTIC NET REGRESSION
Key Objectives
NEW MESSAGE
July 25, 2023
1. Fit an elastic net regression model in Python
2. Tune alpha and lambda through cross
Subject: Re: Re: Ridge Regression
validation
Hi there! 3. Compare model performance with other
Ok, last request for a bit, I promise!
linear regression models
Can you try an elastic net as well?
Just want to make sure we’re checking all of our possible

options here.
Thanks!

RECAP: REGULARIZED REGRESSION MODELS
𝛂
Penalty Hyper-
Model Cost Function Details
Type parameters
Linear Regression (OLS) 𝑆𝑆𝐸 None None Fits a line of best fit by minimizing SSE
𝑝 Helps with overfitting by shrinking

L2 coefficients towards zero, but never
Ridge Regression 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗2 𝛼
penalty reaching it. Great for modelling highly
𝑗=1
correlated features
𝑝
Helps with overfitting by dropping
L1
Lasso Regression 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗 𝛼 some coefficients to 0, making it a good
penalty
𝑗=1 variable selection technique
𝑝 𝑝
L2 and L1 Helps with overfitting by balancing
Elastic Net Regression 𝑆𝑆𝐸 + 𝛼((1 − 𝜆) ෍ 𝛽𝑗2 + 𝜆 ෍ 𝛽𝑗 ) 𝛼, 𝜆
penalty ridge and lasso regression
𝑗=1 𝑗=1

KEY TAKEAWAYS
Regularized regression adds bias to a model to help reduce variance

• To achieve this, it strays away from the line of best fit by introducing a regularization term to the cost equation
• The goal is now to minimize the sum of squared error AND keep the coefficient estimates small
There are 3 types of regularized regression models you can try

• Any regularization technique combats overfitting, with Ridge regression reducing the coefficient values to be
closer to 0, Lasso regression dropping some coefficients down to 0, and Elastic Net balancing the two
All features need to be standardized before being input into these models
• This allows the resulting coefficients to be fairly compared with one another
• Use the training data set to calculate the mean and standard deviation values, then apply to the test data set
Tuning the hyperparameters is key in achieving the best performance

• The alpha and lambda hyperparameters modify the regularization penalty amount in the cost equation

PROJECT 2: REGULARIZATION

PROJECT DATA: SAN FRANCISCO RENT PRICES

PROJECT BRIEF: REGULARIZATION
Key Objectives
NEW MESSAGE
July 27, 2023 1. Fit Lasso, Ridge, and Elastic Net Regression
From: Cathy Coefficient (Data Science Lead) Models
Subject: RE: Apartment Rental Prices 2. Tune the models to the optimal regularization
strength based on validation results
Hi again,
3. Select the model which has the highest accuracy
I just reviewed your modelling work and, overall, I’m quite on out of sample data
pleased with the results.
In order to make sure we’re doing everything we can to

exceed our client expectations, I’d like to see if regularized
regression can improve the model.
Please build on your previous work to fit regularized

regression models and compare them with your original
model, then select the final model based on accuracy.
09_regularized_regression_project.ipynb

TIME SERIES ANALYSIS

TIME SERIES ANALYSIS
In this section we’ll cover time series analysis and forecasting, specialized techniques
applied to time series data to extract patterns & trends and predict future values
• Learn how to transform data into a format that is

Time Series Data Smoothing ready for time series analysis and forecasting
• Become familiar with the various ways to smooth
Decomposition Forecasting time series data to visualize patterns
• Understand how decomposition works and how it
can be useful for looking at trends and seasonality
• Apply time series forecasting techniques, including
OLS and Facebook Prophet

TIME SERIES DATA
Time series data requires each row to represent a unique moment in time
• This can be in any unit of time (seconds, hours, days, months, years, etc.)
Time Series Data
Raw Data Time Series Data
Daily Monthly Yearly
Smoothing
Decomposition
Forecasting
Aggregating the data by date converts it into time series data
Deciding which unit of time to analyze is an important

This is NOT time series data, as each row first step. Does your company need a daily forecast to
represents a transaction, not a point in time help plan staffing? Or is does it need a monthly forecast
to help finance make budgeting plans?

TIME SERIES DATA
• This can be in any unit if time (seconds, hours, days, months, years, etc.)
Time Series Data
Smoothing
Decomposition
Forecasting
Time series data is often
visualized using a line chart
with time as the x-axis

TYPES OF TIME SERIES ANALYSIS
There are many types of time series analysis, including:
Time Series Data
Smoothing Decomposition Forecasting

Smoothing
Reduces volatility to reveal Breaks down data into seasonality, Predicts future values in time using
underlying trends & patterns trend, and noise components historical time series data
Decomposition • Moving average • Additive model • Linear Regression

• Exponential smoothing • Multiplicative model • Facebook Prophet*
Forecasting

MOVING AVERAGE
Time series smoothing is the process of reducing volatility in time series data to
help identify trends and patterns that are otherwise challenging to see
Time Series Data The simplest way to smooth time series data is by calculating a moving average
Smoothing
Decomposition
The larger the window,

Forecasting the smoother the data
PRO TIP: Align your windows with intuitive seasons: with monthly data,
look at quarters or years, with daily data look at weekly or monthly, etc.

MOVING AVERAGE
The rolling() method lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data
This creates new columns with a 3-month, 6-

Smoothing month, and 12-month moving average
Decomposition
How does it work?

Forecasting
• Each “quarterly” value represents the average
sales from the current month and 2 previous

MOVING AVERAGE
The rolling() function lets you calculate moving averages for a specified window
Time Series Data
This creates new columns with a 3-month,

Smoothing 6-month, and 12-month moving average
Decomposition
How does it work?

Forecasting
• Each “half_year” value represents the average

MOVING AVERAGE
The rolling() function lets you calculate moving averages for a specified window
Time Series Data
This creates new columns with a 3-month,

Smoothing 6-month, and 12-month moving average
Decomposition
How does it work?

Forecasting
• Each “half_year” value represents the average
• Each “annual” value represents the average

EXPONENTIAL SMOOTHING
The exponential smoothing technique is similar to moving average, but it gives

more weight to recent data points within a window
Time Series Data • The weight is controlled by alpha, which is a value between 0 and 1
• The higher alpha, the more weight is given to recent points (which also increases volatility)
Smoothing • df[“col].ewm(alpha).mean()
Decomposition
Forecasting
No NaN values!

ASSIGNMENT: SMOOTHING
Key Objectives
NEW MESSAGE
August 1, 2023
1. Explore and manipulate time series data
From: Tammy Tiempo (Financial Analyst)
2. Modify smoothing parameters to reveal
Subject: Smoothing
different patterns
Hi there!
We’re working with an electric utility in Morocco on a

forecasting model for electric consumption. The data is quite
noisy due to lots of seasonality. Can you calculate a weekly
and monthly rolling average so we can see broader patterns?
Thanks!
10_time_series_assignments.ipynb

DECOMPOSITION
Time series data can be decomposed into trend, seasonality, and random noise
• Trend: Are values trending up, down, or staying flat over time?
Time Series Data • Seasonality: Do values display a cyclical pattern? (like more customers buying on weekends)
• Random noise: What volatility exists outside the trend and seasonal patterns?
Smoothing
Decomposition
Forecasting
This data has a positive trend, unclear This data has a flat trend, clear hourly
seasonality, and lots of random noise seasonality, and relatively little noise

DECOMPOSITION
You can use statsmodels’ seasonal_decompose() to decompose time series data
Time Series Data
Smoothing
Decomposition
The positive trend started around July 2016

Forecasting
There appears to be a yearly seasonal pattern,

but it doesn’t seem to match the data that well
After taking away the trend and seasonality, the

residuals look somewhat random over time

TYPES OF DECOMPOSITION
There are two types of decomposition: additive and multiplicative

• Additive decomposition assumes the trend and seasonality remain constant
Time Series Data • Multiplicative decomposition assumes the trend and seasonality increase over time
Smoothing Additive Multiplicative
Linear trend Growing trend

Decomposition
Forecasting
Growing
Constant amplitude
amplitude
𝑦𝑡 = 𝑇𝑡 + 𝑆𝑡 + 𝑅𝑡 𝑦𝑡 = 𝑇𝑡 × 𝑆𝑡 × 𝑅𝑡
Time + seasonality + random Time * seasonality * random

TYPES OF DECOMPOSITION
There are two types of decomposition: additive and multiplicative

• Additive decomposition assumes the trend and seasonality remain constant
Time Series Data • Multiplicative decomposition assumes the trend and seasonality increase over time
Smoothing
Decomposition
Forecasting
Multipliers
The residuals look good in the middle but get worse towards We can no longer see a pattern in the residuals
the end, which indicates we need a multiplicative model

PRO TIP: AUTOCORRELATION CHART
An autocorrelation chart calculates the correlation between time-series data and

lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Forecasting
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
Strong correlation every 7th lag,

Forecasting which indicates a weekly cycle
100%
80%
60%
CORRELATION
40%
20%
0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG


Time Series Data
Smoothing
Decomposition
This peak indicates a
seasonal pattern every
24 periods (hours)
Forecasting
If your season has trend, you must

difference your time series first, which
can be be done with the .diff() method in
Pandas
ASSIGNMENT: DECOMPOSITION
Key Objectives
NEW MESSAGE
August 5, 2023
1. Use time series decomposition to
From: Tammy Tiempo (Financial Analyst) understand the trend, seasonality and
Subject: Decomposition noise of time series data
2. Use an ACF chart to estimate the seasonal
Hi there!
window for time series data
Thanks for your help with smoothing.
In a related project, we’re looking at weather in neighboring

Spain and we want to confirm the seasonality of weather on
both a daily and annual basis.
Can you decompose hourly and monthly weather and plot

ACF charts for them?
Thanks!

FORECASTING
Time series forecasting lets you predict future values using historical data
• Forecasting models use existing trends & seasonality to make accurate future predictions
Time Series Data
Smoothing
Common Forecasting Techniques:
• Linear Regression We’ll cover these in
• Facebook Prophet the course
Decomposition
• Holt-Winters
• ARIMA Modeling
Forecasting
• LSTM (deep learning approach)
Forecasts get less accurate the further out we are trying to predict, so think carefully
about how far in advance you really need to forecast in the context of the problem

DATA SPLITTING
Time series data splitting does not follow traditional train/test splits:
• Instead of random splits, we’ll need to split by points in time to mimic forecasting the future
Time Series Data • The training data set should be at least as long as your desired forecast window (test data)
• You may need to change the period to get enough holdout data
Smoothing
EXAMPLE Forecasting the next 48 hours of weather
Decomposition
Train Test
Forecasting

LINEAR REGRESSION WITH TREND & SEASON
It common to build forecasting models using linear regression by creating

“trend” and “seasonal” dummy variables
Time Series Data
• The trend variable is a sequence of numbers that increments by 1 unit every period
• The seasonal variables are dummies for every period in the season
Smoothing
Decomposition
Forecasting
…
This creates a “trend” variable and 24

“seasonal” dummy variables
(1 per hour in a day)

FITTING THE MODEL
You can fit the model on the training data like a regular linear regression model
• It violates the assumptions of linear regression, but it’s often very effective
Time Series Data
Smoothing
Decomposition
Forecasting
The temperature increases by

.02 degrees every hour
The temperature is 2.47 degrees

higher than the reference level at
which is hour_0, or midnight
SCORING THE MODEL
To score the model, you can plot the predictions against the actual values for the
test data and calculate error metrics like MAE and MAPE
Time Series Data • The Mean Absolute Percentage Error (MAPE) is is calculated by finding the average percent
error of all the data points (it’s essentially MAE converted to a percentage)
Smoothing
Decomposition
Forecasting
The model is wrong by 2.96 degrees on average, or by 12.7%

FACEBOOK PROPHET
Facebook Prophet is a forecasting package available in Python and R that was

developed by Meta’s data science research team
Time Series Data
Instead of OLS, it uses an additive regression model with three main components:
Smoothing
1. Growth curve trend (created by automatically detecting changepoints in the data)
Decomposition 2. Yearly, weekly, and daily seasonal components
3. User-provided list of important holidays
Forecasting
PRO TIP: Prophet is a relatively sophisticated and easy to

use package, making it ideal for quick forecasting activities

FACEBOOK PROPHET
EXAMPLE Forecasting the next 48 hours of weather
Install Facebook Prophet

Time Series Data
Import the package
Split the data
Smoothing Rename the x and y variables to “ds” and “y”
Fit the model
Decomposition
Forecasting
Specify the number of periods and period

frequency to predict, and plot the predictions

LINEAR REGRESSION VS. PROPHET
Linear Regression Facebook Prophet

Time Series Data
MAE: 2.96 MAE: 2.09
MAPE: 0.127 MAPE: .091
Smoothing
Decomposition
Forecasting

ASSIGNMENT: FORECASTING
Key Objectives
NEW MESSAGE
August 10, 2023
1. Split time series data into train and test
From: Tammy Tiempo (Financial Analyst) datasets
Subject: Forecasting
2. Fit time series models with linear
regression and Facebook Prophet
Hi there!
We’re looking at evaluating two forecasting methods to

3. Compare the accuracy of the two
propose to an airline client. forecasting methods, calculating MAE and
MAPE
Can you take a look at historical airline passenger growth and
compare the accuracy of forecasts for linear regression and
Facebook Prophet?
I’ve left a few more details in the notebook.
Thanks!

KEY TAKEAWAYS
• It’s important to decide on the units, or row granularity, of your time series data (years, months, etc.)
Time series smoothing allows you to visualize trends in the time series data
• Common techniques include moving average and exponential smoothing
Decomposition breaks the data down into trend, season, and random noise
• Time series data can be decomposed in an additive or multiplicative fashion
Time series forecasting allows you to predict future values in time

• Common techniques include linear regression (even though the assumptions are violated) and Facebook Prophet

PROJECT 2: FORECASTING

PROJECT DATA: ELECTRICITY CONSUMPTION

ASSIGNMENT: FINAL PROJECT
Key Objectives
NEW MESSAGE
August 14, 2023 1. Explore and manipulate time series data
From: Tammy Tiempo (Financial Analyst) 2. Perform time series data splitting
Subject: Forecasting Electricity Consumption
3. Build and compare the predictive accuracy of forecasting
models
Hello,
Can you build a simple forecast for our Moroccan Electricity

consumption? I’d like to see a linear regression model
compared to Facebook Prophet, now that we know they’re
both reasonable forecasting methods.
I’m hoping we can get a forecast accurate enough that allows

us to properly estimate the demand for electricity, which will
help us properly price it during peak demand.
-Tammy
11_time_series_project.ipynb

Data+Science+in+Python+-+Regression

Uploaded by

Data+Science+in+Python+-+Regression

Uploaded by

DATA SCIENCE IN PYTHON

*Copyright Maven Analytics, LLC

PART 1 PART 2 PART 3 PART 4 PART 5

*Copyright Maven Analytics, LLC

Additional resources include:

Downloadable PDF to serve as a helpful reference when you’re offline or on the go

*Copyright Maven Analytics, LLC

Introduce the fields of data science and machine learning, review

*Copyright Maven Analytics, LLC

Review the assumptions of linear regression models that need to be met

Apply feature engineering techniques for regression models, including

Introduce regularized regression techniques, which are alternatives to

*Copyright Maven Analytics, LLC

1. Explore & visualize the data

*Copyright Maven Analytics, LLC

We’ll also do an overview of time series analysis and forecasting methods

We’ll use Jupyter Notebooks as our primary coding environment

You do NOT need to be a Python expert to take this course

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Install Anaconda and launch Jupyter Notebook

*Copyright Maven Analytics, LLC

1) Go to anaconda.com/products/distribution and click

3) Follow the installation steps

2) Launch the downloaded Anaconda pkg file

*Copyright Maven Analytics, LLC

1) Go to anaconda.com/products/distribution and click

3) Follow the installation steps

2) Launch the downloaded Anaconda exe file

*Copyright Maven Analytics, LLC

1) Launch Anaconda Navigator 2) Find Jupyter Notebook and click

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC

If you close the server window,

Depending on your OS, and method

*Copyright Maven Analytics, LLC

Google Colab is Google’s cloud-based version of Jupyter Notebooks

To create a Colab notebook:

Colab is very similar to Jupyter Notebooks

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Compare data science and machine learning with

*Copyright Maven Analytics, LLC

Data science is about using data to make smart decisions

*Copyright Maven Analytics, LLC

Data science requires a blend of coding, math, and domain expertise

The key is in applying these along

Data scientists & analysts approach problem

*Copyright Maven Analytics, LLC

Machine learning uses algorithms applied by data scientists to enable computers

Essential Skills Supervised Learning Unsupervised Learning

How can I flag suspicious emails Which TV shows should I

*Copyright Maven Analytics, LLC

Supervised Learning Unsupervised Learning Another category of machine

Support vector machines, neural networks, DBSCAN, T-SNE, factor analysis,

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC

Data can come from a variety of sources, including:

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC

EDA tasks may include:

*Copyright Maven Analytics, LLC

Modeling tasks include:

*Copyright Maven Analytics, LLC

• Focus on potential impact, not technical details

*Copyright Maven Analytics, LLC

Regression models are used to predict the value of numeric variables

*Copyright Maven Analytics, LLC

Data science is about using data to make smart decisions

The data science workflow starts with defining a clear scope

Regression modeling is a key skill used for predicting real-world outcomes

*Copyright Maven Analytics, LLC

*Copyright Maven Analytics, LLC