0% found this document useful (0 votes)
27 views234 pages

Data+Science+in+Python+-+Regression

Uploaded by

pibib26031
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
27 views234 pages

Data+Science+in+Python+-+Regression

Uploaded by

pibib26031
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 234

DATA SCIENCE IN PYTHON

Regression
With Expert Python Instructor Chris Bruehl

*Copyright Maven Analytics, LLC


ABOUT THIS SERIES

This is Part 2 of a 5-Part series designed to take you through several applications of data science
using Python, including data prep & EDA, regression, classification, unsupervised learning & NLP

PART 1 PART 2 PART 3 PART 4 PART 5


Data Prep & EDA Regression Classification Unsupervised Natural Language
Learning Processing

*Copyright Maven Analytics, LLC


COURSE STRUCTURE

This is a project-based course for students looking for a practical, hands-on approach to
learning data science and applying regression models with Python

Additional resources include:

Downloadable PDF to serve as a helpful reference when you’re offline or on the go

Quizzes & Assignments to test and reinforce key concepts, with step-by-step solutions

Interactive demos to keep you engaged and apply your skills throughout the course

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Introduce the fields of data science and machine learning, review


1 Intro to Data Science essential skills, and introduce each phase of the data science workflow

Review the basics of regression, including key terms, the types and goals
2 Regression 101 of regression analysis, and the regression modeling workflow

Recap the data prep & EDA steps required to perform modeling, including
3 Pre-modeling Data Prep & EDA key techniques to explore the target, features, and their relationships

Build simple linear regression models in Python and learn about the
4 Simple Linear Regression metrics and statistical tests that help evaluate their quality and output

Build multiple linear regression models in Python and evaluate the model
5 Multiple Linear Regression fit, perform variable selection, and compare models using error metrics

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Review the assumptions of linear regression models that need to be met


6 Model Assumptions to ensure that the model’s predictions and interpretation are valid

Test model performance by splitting data, tuning the model with the train
7 Model Testing & Validation & validation data, selecting the best model, and scoring it on the test data

Apply feature engineering techniques for regression models, including


8 Feature Engineering dummy variables, interaction terms, binning, and more

Introduce regularized regression techniques, which are alternatives to


9 Regularized Regression linear regression, including Ridge, Lasso, and Elastic Net regression

Learn methods for exploring time series data and how to perform time
10 Time Series Analysis series forecasting using linear regression and Prophet

*Copyright Maven Analytics, LLC


WELCOME TO MAVEN CONSULTING GROUP

THE You’ve just been hired as an Associate Data Scientist for Maven Consulting
SITUATION Group to work on a team that specializes in price research for various industries.

You’ll have access to data on the price data for several different industries,
THE including diamonds, computer prices, and apartment rent.
ASSIGNMENT Your task is to build regression models that can accurately predict the price of
goods, while giving their clients insights into the factors that impact pricing.

1. Explore & visualize the data


THE 2. Prepare the data for modelling
OBJECTIVES 3. Apply regression algorithms to the data
4. Evaluate how well your models fit
5. Select the best model and interpret it

*Copyright Maven Analytics, LLC


SETTING EXPECTATIONS

This course covers both the theory & application of linear regression models
• We’ll start with Ordinary Least Squares (OLS), including evaluation metrics, assumptions, and validation & testing
• We’ll then pivot to regularized regression models, which are extensions of linear regression with special properties

We’ll also do an overview of time series analysis and forecasting methods


• We’ll cover time series behavior, smoothing, decomposition, and basic forecasting models & theory, but won’t do
a deep dive into advanced time series modeling techniques

We’ll use Jupyter Notebooks as our primary coding environment


• Jupyter Notebooks are free to use, and the industry standard for conducting data analysis with Python
(we’ll introduce Google Colab as an alternative, cloud-based environment as well)

You do NOT need to be a Python expert to take this course


• It is strongly recommended that you complete the first course in this series, Data Prep & EDA, and have some basic
statistics knowledge, but we will teach the relevant Python code for building regression models from scratch

*Copyright Maven Analytics, LLC


INSTALLATION & SETUP

*Copyright Maven Analytics, LLC


INSTALLATION & SETUP

In this section we’ll install Anaconda and introduce Jupyter Notebook, a user-friendly
coding environment where we’ll be coding in Python

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Install Anaconda and launch Jupyter Notebook


Installing Anaconda Launching Jupyter
• Get comfortable with the Jupyter Notebook
environment and interface
Google Colab

*Copyright Maven Analytics, LLC


INSTALL ANACONDA (MAC)

1) Go to anaconda.com/products/distribution and click

3) Follow the installation steps


(default settings are OK)
Installing
Anaconda

Launching
Jupyter

Google Colab

2) Launch the downloaded Anaconda pkg file

*Copyright Maven Analytics, LLC


INSTALL ANACONDA (PC)

1) Go to anaconda.com/products/distribution and click

3) Follow the installation steps


(default settings are OK)
Installing
Anaconda

Launching
Jupyter

Google Colab

2) Launch the downloaded Anaconda exe file

*Copyright Maven Analytics, LLC


LAUNCHING JUPYTER

1) Launch Anaconda Navigator 2) Find Jupyter Notebook and click

Installing
Anaconda

Launching
Jupyter

Google Colab

*Copyright Maven Analytics, LLC


YOUR FIRST JUPYTER NOTEBOOK

1) Once inside the Jupyter interface, create a folder to store your notebooks for the course

Installing
Anaconda

Launching
Jupyter
NOTE: You can rename your folder by clicking “Rename” in the top left corner

Google Colab
2) Open your new coursework folder and launch your first Jupyter notebook!

NOTE: You can rename your notebook by clicking on the title at the top of the screen

*Copyright Maven Analytics, LLC


THE NOTEBOOK SERVER

NOTE: When you launch a Jupyter notebook, a terminal window may pop up as
well; this is called a notebook server, and it powers the notebook interface

Installing
Anaconda

Launching
Jupyter

Google Colab

If you close the server window,


your notebooks will not run!

Depending on your OS, and method


of launching Jupyter, one may not
open – as long as you can run your
notebooks, don’t worry!

*Copyright Maven Analytics, LLC


ALTERNATIVE: GOOGLE COLAB

Google Colab is Google’s cloud-based version of Jupyter Notebooks

To create a Colab notebook:


Installing
Anaconda 1. Log in to a Gmail account
2. Go to colab.research.google.com
Launching
Jupyter 3. Click “new notebook”

Google Colab

Colab is very similar to Jupyter Notebooks


(they even share the same file extension); the main
difference is that you are connecting to Google
Drive rather than your machine, so files will be
stored in Google’s cloud

*Copyright Maven Analytics, LLC


INTRO TO DATA SCIENCE

*Copyright Maven Analytics, LLC


INTRO TO DATA SCIENCE

In this section we’ll introduce the field of data science, discuss how it compares to
other data fields, and walk through each phase of the data science workflow

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Compare data science and machine learning with


What is Data Science? Essential Skills other common data analytics fields
• Introduce supervised and unsupervised learning,
Machine Learning Data Science Workflow and examples of each technique
• Review the machine learning landscape and
commonly used algorithms
• Discuss essential skills, and review each phase of
the data science workflow

*Copyright Maven Analytics, LLC


WHAT IS DATA SCIENCE?

Data science is about using data to make smart decisions


What is Data
Science?
Wait, isn’t that business
predictive
data analysis
intelligence
analytics ?
Essential Skills

Yes! The differences lie in the types of problems you solve, and tools and
Machine Learning
techniques you use to solve them:
Data Science
Workflow
What happened? What’s going to happen?
• Descriptive Analytics • Predictive Analytics
• Data Analysis • Data Mining
• Business Intelligence • Data Science

*Copyright Maven Analytics, LLC


DATA SCIENCE SKILL SET

Data science requires a blend of coding, math, and domain expertise

What is Data
Science?

The key is in applying these along


Essential Skills with soft skills like:
Machine
Coding Learning
Math • Communication
Machine Learning • Problem solving
Data • Curiosity & creativity
Science
Data Science • Grit
Workflow Danger Traditional
Zone! Research • Googling prowess

Data scientists & analysts approach problem


Domain solving in similar ways, but data scientists will
Expertise often work with larger, more complex data sets
and utilize advanced algorithms

*Copyright Maven Analytics, LLC


WHAT IS MACHINE LEARNING?

Machine learning uses algorithms applied by data scientists to enable computers


to learn and make decisions from data
What is Data
Science? Machine learning algorithms fall into two broad categories:

Essential Skills Supervised Learning Unsupervised Learning


Using historical data to predict the future Finding patterns and relationships in data

Machine Learning

Data Science
Workflow What will house prices look like How can I segment my
for the next 12 months? customers?

How can I flag suspicious emails Which TV shows should I


as spam? recommend to each user?

*Copyright Maven Analytics, LLC


COMMON ALGORITHMS

These are some of the most common machine learning algorithms that data
scientists use in practice
What is Data
Science?

MACHINE LEARNING
Essential Skills

Supervised Learning Unsupervised Learning Another category of machine


Machine Learning
learning algorithms is called
K-Means Clustering
reinforcement learning, which
Hierarchical Clustering is commonly used in robotics
Data Science Regression Classification and gaming
Workflow Anomaly Detection
Fields like deep learning and
Linear Regression KNN Matrix Factorization natural language processing
Regularized Regression Logistic Regression Principal Components Analysis
utilize both supervised and
unsupervised learning
Time Series Tree-Based Models Recommender Systems techniques
Naïve Bayes (NLP) Topic Modeling (NLP)

Support vector machines, neural networks, DBSCAN, T-SNE, factor analysis,


deep learning, etc. association rule mining, etc.

*Copyright Maven Analytics, LLC


DATA SCIENCE WORKFLOW

The data science workflow consists of scoping the project, gathering, cleaning
What is Data
and exploring the data, applying models, and sharing insights with end users
Science?

1 2 3 4 5 6
Essential Skills

Machine Learning

Data Science
Workflow
Scoping a Gathering Cleaning Exploring Modeling Sharing
Project Data Data Data Data Insights

This is not a linear process! You’ll likely go back to further gather, clean and explore your data

*Copyright Maven Analytics, LLC


STEP 1: SCOPING A PROJECT

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
Projects don’t start with data, they start with a clearly defined scope:
Data Science
Workflow • Who are your end users or stakeholders?
• What business problems are you trying to help them solve?
• Is this a supervised or unsupervised learning problem? (do you even need data science?)
• What data do you need for your analysis?

*Copyright Maven Analytics, LLC


STEP 2: GATHERING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
A project is only as strong as the underlying data, so gathering the right data is
Data Science essential to set a proper foundation for your analysis
Workflow

Data can come from a variety of sources, including:


• Files (flat files, spreadsheets, etc.)
• Databases
• Websites
• APIs

*Copyright Maven Analytics, LLC


STEP 3: CLEANING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
A popular saying within data science is “garbage in, garbage out”, which means that
Data Science cleaning data properly is key to producing accurate and reliable results
Workflow
Data cleaning tasks may include: Building models
The flashy part of data science
• Resolving formatting issues
• Correcting data types Cleaning data
• Imputing missing data Less fun, but very important
(Data scientists estimate that around
• Restructuring the data 50-80% of their time is spent here!)

*Copyright Maven Analytics, LLC


STEP 4: EXPLORING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
Exploratory data analysis (EDA) is all about exploring and understanding the
Data Science data you’re working with before applying models or algorithms
Workflow

EDA tasks may include:


• Slicing & dicing the data A good number of the final insights that you share
will come from the exploratory analysis phase!
• Profiling the data
• Visualizing the data

*Copyright Maven Analytics, LLC


STEP 5: MODELING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
Modeling data involves structuring and preparing data for specific modeling
Data Science techniques, and applying those models to make predictions or discover patterns
Workflow

Modeling tasks include:


With fancy new algorithms introduced every
year, you may feel the need to learn and apply
• Data splitting the latest and greatest techniques
• Feature selection & engineering In practice, simple is best; businesses &
leadership teams appreciate solutions that are
• Training & validating models easy to understand, interpret and implement

*Copyright Maven Analytics, LLC


STEP 6: SHARING INSIGHTS

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
The final step of the workflow involves summarizing your key findings and sharing
Data Science insights with end users or stakeholders:
Workflow
• Reiterate the problem Even with all the technical work
• Interpret the results of your analysis that’s been done, it’s important to
remember that the focus here is
• Share recommendations and next steps on non-technical solutions

• Focus on potential impact, not technical details

NOTE: Another way to share results is to deploy your model, or put it into production

*Copyright Maven Analytics, LLC


REGRESSION MODELING

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

Machine Learning
DATA PREP & EDA REGRESSION
Data Science
Workflow

Regression models are used to predict the value of numeric variables


Even though regression models fall into the final two steps of the workflow, data
prep & EDA should always come first to help you get the most out of your models

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Data science is about using data to make smart decisions


• Supervised learning techniques use historical data to predict the future, and unsupervised learning
techniques use algorithms to find patterns and relationships

Data scientists have both coding and math skills along with domain expertise
• In addition to technical expertise, soft skills like communication, problem-solving, curiosity, creativity, grit,
and Googling prowess round out a data scientist’s skillset

The data science workflow starts with defining a clear scope


• Once the project scope is defined, you can move on to gathering and cleaning data, performing exploratory data
analysis, preparing data for modeling, applying algorithms, and sharing insights with end users

Regression modeling is a key skill used for predicting real-world outcomes


• Data scientists are often asked to create regression models used to predict numeric values like revenue,
website traffic, order volumes, and much more

*Copyright Maven Analytics, LLC


REGRESSION 101

*Copyright Maven Analytics, LLC


REGRESSION 101

In this section we’ll cover the basics of regression, including key modeling terminology,
the types & goals of regression analysis, and the regression modeling workflow

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Introduce the basics of regression modeling


Regression 101 Goals of Regression
• Understand key modeling terminology
• Discuss the different goals of regression modeling
Types of Regression The Modeling Workflow
• Review the regression modeling workflow

*Copyright Maven Analytics, LLC


REGRESSION 101

Regression analysis is supervised learning technique used to predict a numeric


variable (target) by modeling its relationship with a set of other variables (features)
Regression 101

EXAMPLE Predicting the price of diamonds based on carat weight


Goals of
Regression

In its simplest form, a regression model is a


Types of line that represents this relationship
Regression
The predicted price for
a 1.5 carat diamond is
The Modeling roughly $10,000
Workflow

As “carat” moves up or
down, so does “price”
(positive relationship)
“ All models are wrong,
but some are useful
George Box

*Copyright Maven Analytics, LLC


REGRESSION 101

Regression analysis is supervised learning technique used to predict a numeric


variable (target) by modeling its relationship with a set of other variables (features)
Regression 101

Goals of
Regression 𝒚 Target
• This is the variable you’re trying to predict
Types of • The target is also known as “Y”, “model output”, “response”, or “dependent” variable
Regression
• Regression helps understand how the target variable is impacted by the features
The Modeling
Workflow
𝒙 Features
• These are the variables that help you predict the target variable
• Features are also known as “X”, “model inputs”, “predictors”, or “independent” variables
• Regression helps understand how the features impact, or predict, the target

*Copyright Maven Analytics, LLC


REGRESSION 101

EXAMPLE Predicting the price of diamonds based on diamond characteristics

Regression 101

Price is our target, since it’s


Goals of
Regression
what we want to predict
Since price is numerical, we’ll
use regression to predict it
Types of
Regression

The Modeling
Workflow

Carat, cut, color, clarity, and the rest of the columns are all features,
since they can help us explain, or predict, the price of diamonds

*Copyright Maven Analytics, LLC


REGRESSION 101

EXAMPLE Predicting the price of diamonds based on diamond characteristics

Regression 101

Goals of
Regression

We’ll use records with observed values for


Types of both the features and target to “train” our
Regression regression model...

The Modeling
Workflow
??? ...then apply that model to new, unobserved
values containing features but no target
This is what our model will predict!

*Copyright Maven Analytics, LLC


GOALS OF REGRESSION

Regression models are used for two primary goals: prediction and inference
• The goal shapes the modeling approach, including the regression algorithm used, the
Regression 101 complexity of the model, and more

Goals of
Regression

PREDICTION INFERENCE
Types of
Regression
• Used to predict the target as • Used to understand the relationships
The Modeling
accurately as possible between the features and target
Workflow
• “What is the predicted price of a • “How much do a diamond’s size and
diamond given its characteristics?” weight impact its price?”

You often need to strike a balance between these goals – a model that is very inaccurate won’t be too trustworthy
for inference, and understanding the impact that variables have on predictions can help make them more accurate

*Copyright Maven Analytics, LLC


TYPES OF REGRESSION

These are some of the major types of regression modeling techniques:

Regression 101
Linear Regularized Time-Series Tree-Based
Regression Regression Forecasting Regression
Goals of
Regression

Types of
Regression

The Modeling
Workflow
Models the relationship An extension of linear regression Predicts future data using Splits data by maximizing the
between the features & target that penalizes model complexity historical trends & seasonality difference between groups
using a linear equation

Even though logistic regression (which you may have heard of) has
“regression” in its name, it’s actually a classification modeling technique!

*Copyright Maven Analytics, LLC


REGRESSION MODELING WORKFLOW

1 2 3 4 5 6
Regression 101 Scoping a Gathering Cleaning Exploring Modeling Sharing
project data data data data insights

Goals of
Regression

Types of Preparing for Applying Model Model


Regression Modeling Algorithms Evaluation Selection

The Modeling Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Workflow input into an ML algorithm from training data training & validation data deploy and identify insights

• Single table, non-null • Linear regression • R-squared & MAE • Test performance
• Feature engineering • Regularized regression • Checking Assumptions • Interpretability
• Data splitting • Time series • Validation Performance

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Regression modeling is used to predict numeric values


• There are several types of regression models, but we will mostly focus on linear regression in this course

The target is the value we want to predict, and the features help us predict it
• The target is also known as “Y”, “model output”, “response”, or “dependent” variable
• Features are also known as “X”, “model inputs”, “predictors”, or “independent” variables

Regression Modeling has two primary goals: prediction and inference


• Prediction focuses on predicting the target as accurately as possible
• Inference is used to understand the relationship between the features and target

The modeling workflow is designed to ensure strong performance


• Splitting data, feature engineering, and model validation all work to ensure your model is as accurate as possible

*Copyright Maven Analytics, LLC


PRE-MODELING DATA PREP & EDA

*Copyright Maven Analytics, LLC


PRE-MODELING DATA PREP & EDA

In this section we’ll review the data prep & EDA steps required before applying regression
algorithms, including key techniques to explore the target, features, and their relationships

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Visualize & explore the target and features


EDA for Regression Exploring the Target
• Quantify the strength of linear relationships
• Identify relationships between the target and
Exploring the Features Linear Relationships features, as well as between features themselves
• Review the data prep steps required for modeling
Exploring Relationships Preparing for Modeling

*Copyright Maven Analytics, LLC


EDA FOR REGRESSION

Exploratory data analysis (EDA) is the process of exploring and visualizing data to
find useful patterns & insights that help inform the modeling process
EDA for
Regression • In regression, EDA lets you identify & understand the most promising features for your model
• It also helps uncover potential issues with the features or target that need to be addressed
Exploring the
Target

When performing EDA for regression, it’s key to explore:


Exploring the
Features • The target variable
• The features
Linear
Relationships • Feature-target relationships
• Feature-feature relationships
Exploring
Relationships

Preparing for Even though data cleaning comes before exploratory data analysis, it’s common
Modeling for EDA to reveal additional data cleaning steps needed before modeling

*Copyright Maven Analytics, LLC


EXPLORING THE TARGET

Exploring the target variable lets you understand where most values lie, how
spread out they are, and what shape, or distribution, they have
EDA for
Regression
• Charts like histograms and boxplots are great tools to explore regression targets

Exploring the The seaborn library is ideal for charts


Target

sns.despine() removes
Exploring the the top & right borders
Features

Data is right-skewed
Linear (not uncommon for prices but Lots of outliers
Relationships may want to transform later) past $12,000

Exploring There is a secondary


Relationships “bump” around $4,000
(can this be explained
by any features?) Median price is
Preparing for around $2,500
Modeling

*Copyright Maven Analytics, LLC


EXPLORING THE FEATURES

Exploring the features helps you understand them and start to get a sense of the
transformations you may need to apply to each one
EDA for
Regression • Histograms and boxplots let you explore numeric features
• The .value_counts() method and bar charts are best for categorical features
Exploring the
Target

Exploring the
Features
Data is right-skewed Not many “Fair” diamonds,
which are the lowest quality
Linear (enough for modeling though)
Relationships Spikes at quarter and
half carat increments

Exploring
Few diamonds bigger
Relationships
than 2.5 carats

Preparing for
Modeling

*Copyright Maven Analytics, LLC


ASSIGNMENT: EXPLORING THE TARGET & FEATURES

Key Objectives
NEW MESSAGE
June 28, 2023
1. Read in the “Computers.csv” file
From: Cam Correlation (Sr. Data Scientist)
2. Explore the target variable: “price”
Subject: EDA
3. Explore the features: “speed” and “RAM”
Hi there, glad to have you on the team!

We’ve been getting a lot of partnership requests from various


business units, so I could really use your help.

Can you explore the computer prices data set at a high level?

I want to see a boxplot and histogram of the “price” variable


and histograms of the “speed”, and “ram” variables as well.

Thanks!

01_EDA_assignments.ipynb

*Copyright Maven Analytics, LLC


LINEAR RELATIONSHIPS

It’s common for numeric variables to have linear relationships between them
• When one variable changes, so does the other
EDA for
Regression • This relationship is commonly visualized with a scatterplot

Exploring the
Target
EXAMPLE Relationship between digital advertising spend and site traffic

Exploring the
Features
($72,954; 7,592)
Linear
Relationships

Exploring
Relationships
As “digital_spend” moves up
or down, so do “site_traffic”
Preparing for (this is a positive relationship)
Modeling

*Copyright Maven Analytics, LLC


LINEAR RELATIONSHIPS

There are two possible linear relationships: positive & negative


• Variables can also have no relationship
EDA for
Regression

Positive Relationship Negative Relationship No Relationship


Exploring the
Target

Exploring the
Features

Linear
Relationships

Exploring
Relationships
As one changes, the other As one changes, the other changes No association can be found between the
changes in the same direction in the opposite direction changes in one variable and the other
Preparing for
Modeling

*Copyright Maven Analytics, LLC


CORRELATION

The correlation (r) measures the strength & direction of a linear relationship (-1 to 1)
• You can use the .corr() method to calculate correlations in Pandas – df[“col1”].corr(df[“col2”])
EDA for
Regression

r = 0.858 r = -0.499 r = 0.008


Exploring the
Target

Exploring the
Features

Linear
Relationships

Exploring
Relationships

-1 -0.9 -0.7 -0.4 -0.1 0.1 0.4 0.7 0.9 1


Preparing for Very Strong Strong Moderate Weak None Weak Moderate Strong Very Strong
Modeling Negative Positive

*Copyright Maven Analytics, LLC


CORRELATION

The correlation (r) measures the strength & direction of a linear relationship (-1 to 1)
• You can use the .corr() method to calculate correlations in Pandas – df[“col1”].corr(df[“col2”])
EDA for
Regression

r = 0.858 r = -0.499 r = 0.008


Exploring the
Target

Exploring the
Features

Linear
Relationships

Exploring
Relationships
Strong positive correlation Moderate negative correlation No correlation

Preparing for
Modeling PRO TIP: Highly correlated variables tend to be the best candidates for your “baseline” regression model,
but other variables can still be useful features after exploring non-linear relationships and fixing data issues

*Copyright Maven Analytics, LLC


PRO TIP: CORRELATION MATRIX

A correlation matrix returns the correlation between each column in a DataFrame


• Use df.corr(numeric_only=True) to only consider numeric columns in the DataFrame
EDA for
Regression • Wrap you correlation matrix in sns.heatmap() to make it easier to interpret

Exploring the
Target

Exploring the
Features

Linear
Relationships

Exploring
Relationships

Preparing for
Modeling

*Copyright Maven Analytics, LLC


PRO TIP: CORRELATION MATRIX

A correlation matrix returns the correlation between each column in a DataFrame


• Use df.corr(numeric_only=True) to only consider numeric columns in the DataFrame
EDA for
Regression • Wrap you correlation matrix in sns.heatmap() to make it easier to interpret

Exploring the
Target

Setting the maximum and minimum values at -1


Exploring the and 1 respectively, and adding a divergent color
Features scale can help increase interpretability

Linear
Relationships

Exploring
Relationships

Preparing for
Modeling

*Copyright Maven Analytics, LLC


FEATURE-TARGET RELATIONSHIPS

Exploring feature-target relationships helps identify potentially useful predictors


• Use scatterplots for numeric features and bar charts for categorical features
EDA for
Regression

Exploring the
Target

Exploring the
Features

Linear
Relationships

Exploring
Relationships
Average prices differ between cuts:
Preparing for • “Fair” cut diamonds (worst) have the 2nd highest average price
Modeling • “Ideal” cut diamonds (best) have the lowest average price
Strong positive relationship between carat & price
(potentially exponential and not linear)

*Copyright Maven Analytics, LLC


FEATURE-FEATURE RELATIONSHIPS

Exploring feature-feature relationships helps identify highly correlated features,


which can cause problems with regression models (more on this later!)
EDA for
Regression

Exploring the
Target

Exploring the Missing data


Features coded as 0?

Linear
Relationships

Exploring
Relationships

Strong positive relationship between “Fair” cut diamonds have the largest average carat size, which
Preparing for diamond length (x) and carat explains why their average price is higher than “Ideal” cut
Modeling (we might not be able to include both diamonds, which have the smallest average carat size
features in our model – more later!)

*Copyright Maven Analytics, LLC


PRO TIP: PAIRPLOTS

Use sns.pairplot() to create a pairplot that shows all the scatterplots and
histograms that can be made using the numeric variables in a DataFrame
EDA for
Regression

Exploring the The row with your target can help explore
Target numeric feature-target relationships quickly!

Exploring the
Features

Linear PRO TIP: It can take a long time to


Relationships generate pairplots for large datasets,
so consider using df.sample() to
Exploring speed up the process significantly
Relationships without losing high-level insights!

Preparing for
Modeling

*Copyright Maven Analytics, LLC


PRO TIP: LMPLOTS

Use sns.lmplot() to create a scatterplot with a fitted regression line (more soon!)
• This is commonly used to explore the impact of other variables on a linear relationship
EDA for
Regression • sns.lmplot(df, x=“feature”, y=“target”, hue=“categorical feature”)

Exploring the
Target

Exploring the
Features

Linear
Relationships

Exploring
Relationships
Larger carat diamonds don’t
follow the relationship well Diamonds with “I1” clarity increase
Preparing for in price at a much slower rate
Modeling

*Copyright Maven Analytics, LLC


ASSIGNMENT: EXPLORING RELATIONSHIPS

Key Objectives
NEW MESSAGE
June 28, 2023
1. Build a correlation matrix heatmap
From: Cam Correlation (Sr. Data Scientist)
2. Build a pair plot of numeric features
Subject: Exploring Variable Relationships
3. Build an lmplot of “RAM” vs. “price”
Hi there!

Thanks for taking a quick look at those variables for me.

Now let’s take it one step further and explore variable


relationships in the data.

Thanks!

01_EDA_assignments.ipynb

*Copyright Maven Analytics, LLC


PREPARING FOR MODELING

Preparing for modeling involves structuring the data as a valid input for a model:
• Stored in a single table (DataFrame)
EDA for
Regression • Aggregated to the right grain (1 row per target)
• Non-null (no missing values) and numeric
Exploring the
Target

Data ready for EDA Data ready for modeling


Exploring the
Features
Carat Cut Color Price Index Carat Cut D Missing Price ($)

Linear
0.90 Good D $3,423 0 0.90 1 1 0 3423
Relationships 0.31 Fair E $527 1 0.31 0 0 0 527
0.52 Very Good $1,250 2 0.52 2 0 1 1250
Exploring
Relationships 1.55 Ideal D $12,500 3 1.55 3 1 0 12500

Features Target
Preparing for
Modeling We’ll revisit this during the feature
engineering section of the course

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

It’s critical to explore the features & target in your data


• Use histograms and boxplots to visualize and explore your target and any numeric features
• Use bar charts to visualize and explore categorical features

Use scatterplots & correlations to find linear relationships in your data


• Features that are highly correlated with your target are likely strong predictors of it
• Features that are highly correlated with each other can cause problems in a model

Remember to prepare you data for modeling after performing EDA


• The data must be stored in a single table, each column must be numeric, and missing values must be handled

*Copyright Maven Analytics, LLC


SIMPLE LINEAR REGRESSION

*Copyright Maven Analytics, LLC


SIMPLE LINEAR REGRESSION

In this section we’ll build our first simple linear regression models in Python and learn
about the metrics and statistical tests that help evaluate their quality and output

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Introduce the linear regression equation


Linear Regression Model Least Squared Error
• Visualize the the least squared error method for
finding the line of best fit
Regression in Python Making Predictions • Build linear regression models in Python and use
them to make predictions for the target

Evaluation Metrics • Interpret the model summary statistics and use


them to evaluate the model’s accuracy

*Copyright Maven Analytics, LLC


SIMPLE LINEAR REGRESSION

Simple linear regression models use a single feature to predict the target
• This is achieved by fitting a line through the data points in a scatterplot (like an lmplot!)
Linear Regression
Model

Least Squared EXAMPLE Simple linear regression model using carat and price
Error

Regression in Simple linear regression model


Python

Making
Predictions Target (y)

Evaluation
Metrics

Feature (x)

*Copyright Maven Analytics, LLC


LINEAR REGRESSION MODEL

The linear regression model is an equation that best describes a linear relationship

Linear Regression
Model The value for
The predicted value (x, ŷ)
the feature ŷ
for the target
Least Squared
Error 𝑦ො = 𝛽0 + 𝛽1 𝑥 β1

β0
Regression in
The y-intercept The slope of the
Python x
relationship

Making
Predictions
The actual value
for the target ε
Evaluation
Metrics 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 y

The error, or residual, caused by the difference


between the actual and predicted values x

*Copyright Maven Analytics, LLC


LEAST SQUARED ERROR

The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra

Least Squared
Error
Why “squared” error?
Regression in
Python
• Squaring the residuals converts them into positive values, and prevents positive and negative
distances from cancelling each other out (this makes the algebra to solve the line much easier, too)
Making • One drawback of squared errors is that outliers can significantly impact the line (more later!)
Predictions

Evaluation
Metrics Ordinary Least Squares (OLS) is another term for traditional linear regression
There are other frameworks for linear regression that don’t use least squared
error, but they are rarely used outside of specialized domains

*Copyright Maven Analytics, LLC


LEAST SQUARED ERROR

The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra

Least Squared If we know nothing about x, the


Error mean of y is our best guess
ŷ = 30 x y ŷ ε ε2
50
10 10 30 20 400
Regression in
Python 20 25 30 5 25
40
30 20 30 10 100

35 30 30 0 0
Making
Predictions y 30
40 40 30 -10 100

50 15 30 15 225
20
Evaluation 60 40 30 -10 100

Metrics 65 30 30 0 0
10
70 50 30 -20 400

80 40 30 -10 100

10 20 30 40 50 60 70 80

=
x SUM OF SQUARED ERROR: 1,450
*Copyright Maven Analytics, LLC
LEAST SQUARED ERROR

The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra

Least Squared We do know x, so let’s use it


Error to improve our guess
ŷ = 10 + 0.5x x y ŷ ε ε2
50
10 10 15 5 25
Regression in
Python 20 25 20 -5 25
40
30 20 25 5 25

Making y 35 30 27.5 -2.5 6.25


30
Predictions 40 40 30 -10 100

50 15 35 20 400
20
Evaluation 60 40 40 0 0

Metrics 65 30 42.5 12.5 156.25


10
70 50 45 -5 25

80 40 50 10 100

10 20 30 40 50 60 70 80

=
x SUM OF SQUARED ERROR: 1,450
862.5
*Copyright Maven Analytics, LLC
LEAST SQUARED ERROR

The least squared error method finds the line that best fits through the data
• It works by solving for the line that minimizes the sum of squared error
Linear Regression
Model • The equation that minimizes error can be solved with linear algebra

Least Squared This is the line with the


Error least squared error!
ŷ = 12 + 0.4x x y ŷ ε ε2
50
10 10 16 6 36
Regression in
Python 20 25 20 -5 25
40
30 20 24 4 16

35 30 26 -4 16
Making
Predictions y 30
40 40 28 -12 144

50 15 32 17 289
20
Evaluation 60 40 36 -4 16

Metrics 65 30 38 8 64
10
70 50 40 -10 100

80 40 44 4 16

10 20 30 40 50 60 70 80

=
x SUM OF SQUARED ERROR: 862.5
722
*Copyright Maven Analytics, LLC
REGRESSION IN PYTHON

These Python libraries are used fit regression models: statsmodels & scikit-learn

Linear Regression
Model

Least Squared
Error

Regression in • Ideal if your goal is inference • Ideal if your goal is prediction


Python
• Similar output to other tools (SAS, R, Excel) • Most popular ML library in Python
• Easy access to dozens of statistical tests • Has various models for easy comparison
Making
Predictions • Harder to leverage in production ML • Designed to be deployed to production

Evaluation
Metrics
Both libraries use the same math and return the same regression equation!
We will begin by focusing on statsmodels, but once we have the fundamentals
of regression down, we’ll introduce scikit-learn

*Copyright Maven Analytics, LLC


REGRESSION IN STATSMODELS

You can fit a regression in statsmodels with just a few lines of code:

Linear Regression
Model
1) Import statsmodels.api (standard alias is sm)

2) Create an “X” DataFrame with your feature(s) and add a constant


Least Squared
Error 3) Create a “y” DataFrame with your target

4) Call sm.OLS(y, X) to set up the model, then use .fit() to build the model
Regression in
Python 5) Call .summary() on the model to review the model output

Making
Predictions
Why do we need to add a constant?
Evaluation • Statsmodels assumes you want to fit a model with a line that runs through the origin (0, 0)
Metrics
• sm.add_constant() lets statsmodels calculate a y-intercept other than 0 for the model
• Most regression software (like sklearn) takes care of this step behind the scenes

*Copyright Maven Analytics, LLC


REGRESSION IN STATSMODELS

You can fit a regression in statsmodels with just a few lines of code:

Linear Regression
Model

Least Squared Model summary


Error statistics

Regression in
Python

Making
Predictions Variable summary
statistics
The model output can be intimidating the first
Evaluation time you see it, but we’ll cover the important
Metrics Residual (error)
pieces in the next few lessons and later sections!
statistics

*Copyright Maven Analytics, LLC


INTERPRETING THE MODEL

To interpret the model, use the “coef” column in the variable summary statistics

Linear Regression
Model

Least Squared
Error

Regression in ŷ = -2256 + 7756x


Python

Making
Predictions
How do we interpret this?
Evaluation • An increase of 1 carat in a diamond is associated with a $7,756 dollar increase in its price
Metrics
• We cannot say 1 carat causes a $7,756 increase in price without a more rigorous experiment
• Technically, a 0-carat diamond is predicted to cost -$2,256 dollars

*Copyright Maven Analytics, LLC


MAKING PREDICTIONS

The .predict() method returns model predictions for single points or DataFrames

Linear Regression
Model

Least Squared
Error

Regression in
Python

Making -2256 + 7756(1.5) = 9378


Predictions
The 1 adds a constant A 1.5 carat diamond has a predicted price of $9,378
Evaluation
Metrics

*Copyright Maven Analytics, LLC


MAKING PREDICTIONS

The .predict() method returns model predictions for single points or DataFrames

Linear Regression
Model

Least Squared
Error

Regression in
Python

Making
Predictions

Evaluation
Metrics

*Copyright Maven Analytics, LLC


R-SQUARED

R-squared, or coefficient of determination, measures how much better the model is


at predicting the target than using its mean (our best guess without using features)
Linear Regression • R-squared values are bounded between 0 and 1 on training data
Model

Least Squared
Error

Regression in
Python

Making
Predictions

Evaluation
Metrics
Squaring the correlation between the feature The model explains 84.9% of the variation in
and target yields R2 in simple linear regression price not explained by the mean of price
(this doesn’t hold in multiple linear regression)

*Copyright Maven Analytics, LLC


R-SQUARED

R-squared, or coefficient of determination, measures how much better the model is


at predicting the target than using its mean (our best guess without using features)
Linear Regression • R-squared values are bounded between 0 and 1 on training data
Model

Least Squared This is the variation of “y” not explained by “x”


Error This is the sum of squared error • Reducing error will increase R2
(distance between values & line) • A perfect model has an error of 0, or R2 of 1
Regression in
Python 𝑆𝑆𝐸
𝑅2 =1−
𝑆𝑆𝑇
Making
Predictions This is the variation of “y” around its mean
This is the sum of squared total
• If R2 = 0, the model is no better than the
(distance between values & mean)
mean of y
Evaluation
Metrics

A “good” value for R2 is relative to your data – an R2 of .05 might be industry leading in sports analytics, but if
you were trying to prove a physics theory with an experiment, an R2 of .95 might be a disappointment!

*Copyright Maven Analytics, LLC


HYPOTHESIS TEST

Regression models include several hypothesis tests, including the F-test, that
indicates whether our model is significantly better at predicting our target than
Linear Regression using the mean of the target as the model
Model
• In other words, you’re trying to find significant evidence that your model isn’t useless
Least Squared
Error
Steps for the hypothesis test:
Regression in
Python
1) State the null and alternative hypotheses
2) Set a significance level (α)
Making
Predictions
3) Calculate the test statistic and p-value
Evaluation 4) Draw a conclusion from the test
Metrics
a) If p≤α, reject the null hypothesis (you’re confident the model isn’t useless)
b) If p>α, don’t reject it (the model is probably useless, and needs more training)

*Copyright Maven Analytics, LLC


HYPOTHESES & SIGNIFICANCE LEVEL

1) For F-Tests, the null & alternative hypotheses are always the same:
• Ho: F=0 – the model is NOT significantly better than the mean (naïve guess)
Linear Regression
Model • Ha: F≠0 – the model is significantly better than the mean (naïve guess)

Least Squared The hope is to reject the null hypothesis


Error
(and therefore, accept the alternative)

Regression in
Python
2) The significance level is the threshold you set to determine when the evidence
Making
against your null hypothesis is considered “strong enough” to prove it wrong
Predictions
• This is set by alpha (α), which is the accepted probability of error

Evaluation
• The industry standard is α = .05 (this is what we’ll use in the course)
Metrics

Some teams and industries set a much higher bar, such as .01 or
even .001, making the null hypothesis less likely to be rejected

*Copyright Maven Analytics, LLC


F-STATISTIC & P-VALUE

3) The F-statistic and associated p-value are part of the model summary and help
understand the predictive power of the regression model as a whole
Linear Regression
Model • The F-statistic is the ratio of variability the model explains vs the variability it doesn’t
• The p-value, or F-significance, is the probability that your model predicts poorly
Least Squared
Error

Regression in
Python

Making
Predictions

Evaluation
Metrics

The F-statistic is primarily used as a stepping-stone


to calculate the p-value, which is easier to interpret
and more commonly used in model diagnostics

*Copyright Maven Analytics, LLC


HYPOTHESIS TEST CONCLUSION

4) Comparing the p-value and alpha lets us draw a conclusion from the test
• p≤α – reject the null hypothesis (you’re confident the model isn’t useless)
Linear Regression
Model • p>α – don’t reject it (the model is probably useless, and needs more training)

Least Squared
Error

Regression in
Python

Making
Predictions

Evaluation
Metrics

Our F-statistic is much greater than 0, and the p-value is


less than 0.05, so we can reject the null hypothesis
(carat is a good predictor of a diamond’s price!)

*Copyright Maven Analytics, LLC


T-STATISTICS & P-VALUES

The T-statistics and associated p-values are part of the model summary and help
understand the predictive power of individual model coefficients
Linear Regression
Model • It’s essentially another hypothesis test designed to find which coefficients are useful

Least Squared
Error

Regression in
Python

Since the p-value is lower than our alpha of 0.05, we can


Making conclude that carat is a good predictor of price
Predictions (the constant also has a p-value lower than 0.05, but we can
generally ignore insignificant p-values for the intercept term)

Evaluation
Metrics

This will become more relevant when performing variable


selection in multiple linear regression models (up next!)

*Copyright Maven Analytics, LLC


RESIDUAL PLOTS

Residual plots show how well a model performs across the range of predictions
• Ideally, residual plots should be normally distributed around 0
Linear Regression
Model

Observations on this line fall exactly on


Least Squared the regression model (0 error)
Error
x y ŷ ε ε2
20
10 10 15 -5 25
Regression in (30, 10)
20 25 20 5 25
Python 10
30 20 25 -5 25
Residual (ε)

35 30 27.5 2.5 6.25


Making 0
40 40 30 10 100
Predictions
50 15 35 -20 400
-10
60 40 40 0 0
Evaluation
65 30 42.5 -12.5 156.25
Metrics -20
70 50 45 5 25

80 40 50 -10 100

15 20 25 30 35 40 45 50

Prediction (ŷ)

*Copyright Maven Analytics, LLC


RESIDUAL PLOTS

Residual plots show how well a model performs across the range of predictions
• Ideally, residual plots should show errors to be normally distributed around 0
Linear Regression
Model • model.resid returns a series with the residuals (actual value - predicted value)

Least Squared
Error

Regression in It looks like the model


Python underpredicts in this
price range

Making
Predictions

It looks like the


Evaluation model overpredicts
Metrics as price increases

*Copyright Maven Analytics, LLC


CASE STUDY: HEALTH INSURANCE

THE You’ve just been hired as a Data Science Intern for Maven National Insurance, a
SITUATION large private health insurance provider in the United States

The company is looking at updating their insurance pricing model and want you to
THE start a new one from scratch using only a handful of variables
ASSIGNMENT If you’re successful, they can reduce the complexity of their model while maintaining its
accuracy (the data you’ve been provided has already been QA’d and cleaned)

1. Identify the strongest predictor of insurance prices using correlation


THE 2. Build a simple linear regression model using this feature
OBJECTIVES 3. Predict insurance prices for new customers using the model

*Copyright Maven Analytics, LLC


ASSIGNMENT: SIMPLE LINEAR REGRESSION

Key Objectives
NEW MESSAGE
June 30, 2023
1. Build a simple linear regression model with
From: Cam Correlation (Sr. Data Scientist) “price” as the target and the most
Subject: Price predictions correlated variable as the feature
2. Interpret the model summary
Hi there!

Looks a few variables are promising!


3. Visualize the model residuals

Can you build a linear regression model with the target as 4. Predict new “price” values with the model
“price” and the most correlated variable as the feature?

I’d also like to see a few predictions for common values of the
feature that you selected.

Thanks!

02_simple_regression_assignments.ipynb

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

A simple linear regression model is the line that best fits a scatterplot
• The line can be described using an equation with a slope and y-intercept, plus an error term
• The least squared error method is used to find the line of best fit

Python uses the statsmodels & scikit-learn libraries to fit regression models
• Statsmodels is ideas if your goal is inference, while scikit-learn is optimal for prediction workflows

Linear regression models can be used to predict new data


• These predictions can be used to assess if our model performance holds on new data and help make decisions

The model summary contains metrics used to evaluate the model


• The model’s R2 value and the F-test’s p-value help determine if the regression model is useful

*Copyright Maven Analytics, LLC


MULTIPLE LINEAR REGRESSION

*Copyright Maven Analytics, LLC


MULTIPLE LINEAR REGRESSION

𝒙𝒌 In this section we’ll build multiple linear regression models using more than one feature,
evaluate the model fit, perform variable selection, and compare models using error metrics

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Learn how to fit and interpret multiple linear


Multiple Regression Variable Selection regression models in Python
• Walk through methods of performing variable
Mean Error Metrics selection, ensuring model variables add value
• Compare the predictive accuracy across models
using mean error metrics like MAE and RMSE

*Copyright Maven Analytics, LLC


MULTIPLE LINEAR REGRESSION MODEL
𝒙𝒌
Multiple linear regression models use multiple features to predict the target
• In other words, it’s the same linear regression model, but with additional “x” variables
Multiple
Regression

SIMPLE LINEAR REGRESSION MODEL:


Variable Selection

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖
Mean Error
Metrics

MULTIPLE LINEAR REGRESSION MODEL:

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝜖
Instead of just one “x”, we have a whole set of features (and
associated coefficients) to help predict the target (y)

*Copyright Maven Analytics, LLC


MULTIPLE LINEAR REGRESSION MODEL
𝒙𝒌
To visualize how a multiple linear regression model works with 2 features,
imagine fitting a plane (instead of a line) through a 3D scatterplot:
Multiple
Regression

Variable Selection

Mean Error
Metrics

Multiple regression can scale well beyond 2 variables, but this is where visual
analysis breaks down – and one reason why we need machine learning!

*Copyright Maven Analytics, LLC


FITTING A MULTIPLE REGRESSION
𝒙𝒌
To fit a multiple regression, include the desired features in the “X” DataFrame

Multiple
R2 increased
Regression
from 0.849

p < 0.05, so
Variable Selection
the model
isn’t useless
Mean Error
Metrics

p < 0.05, so
“carat” and “x” are the most correlated features with “price” all variables
are useful

“x” has a negative


coefficient but a
positive correlation
(this is a concern
we’ll cover shortly)

*Copyright Maven Analytics, LLC


INTERPRETING MULTIPLE REGRESSION
𝒙𝒌
Interpreting multiple regression is like simple regression, with a slight twist

Multiple
Regression

Variable Selection

Mean Error
Metrics ŷ = 1738 + 10130(carat) – 1027(x)

How do we interpret this?


• Technically, a 0-carat diamond with 0 length (x) is predicted to cost $1,738 dollars
• An increase in carat by 1 corresponds to a $10,130 increase in price, holding x constant
• An increase in x by 1 corresponds to a $1,027 decrease in price, holding carat constant

*Copyright Maven Analytics, LLC


VARIABLE SELECTION
𝒙𝒌
Variable selection is a critical part of the modeling process that helps reduce
complexity by only including features that help meet the model’s goal
• Do new variables improve model accuracy? (critical for prediction)
Multiple
Regression
• Is each variable statistically significant? (critical for inference)
• Do new variables make the model more challenging to interpret or explain to stakeholders?
Variable Selection

EXAMPLE Using all the possible features to predict diamond price


Mean Error
Metrics
R2 improved to .859 compared to the simple regression model (.849)
If the goal is inference, a single variable model may be a good choice!

“x” still has a negative coefficient


This helps R2 but makes the model hard to explain

“z” has a p-value greater than alpha (0.347 > 0.05)


We should drop this variable from the model

*Copyright Maven Analytics, LLC


ADJUSTED R-SQUARED
𝒙𝒌
A criticism of r-squared is that it will never decrease as new variables are added
Adjusted r-squared corrects this by penalizing new variables added to a model
• This measure has no meaning whatsoever, but it helps as a variable selection tool
Multiple
Regression

Variable Selection EXAMPLE Adding a random column to a sample of 100 diamonds

Mean Error
Metrics
R2 increased
but adjusted
R2 didn’t

In this case, the p-value could also tell us to remove this variable
(other times, a variable can be significant and lower adjusted R2)

*Copyright Maven Analytics, LLC


ASSIGNMENT: MULTIPLE LINEAR REGRESSION

Key Objectives
NEW MESSAGE
July 2, 2023
1. Build and interpret two multiple linear
From: Cam Correlation (Sr. Data Scientist) regression models
Subject: Multiple Regression Model
2. Evaluate model fit and coefficient values
for both models
Hi there!

Let’s include more features in the model.

Can you start by adding all of the PC component variables,


ram, screen, hd, and speed?

Then take a look at including trend, which accounts for cost


decreases over time, and ads, which measure marketing
spend on each model.

Thanks!

03_multiple_regression_assignments.ipynb

*Copyright Maven Analytics, LLC


MEAN ERROR METRICS
𝒙𝒌
Mean error metrics measure how well your regression model fits in in the units of
our target, as opposed to how well it explains variance (like R-Squared)
• The most common are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)
Multiple
Regression
• They are used to compare model fit across models (the lower the better!)

Variable Selection
MAE MSE RMSE
Mean Error
Metrics Average of the absolute distance Average of the squared distance Square root of Mean Squared Error,
between actual & predicted values between actual & predicted values to return to the target’s units (like MAE)

σ𝑖 |𝑦𝑖 − 𝑦ො𝑖 | σ𝑖(𝑦𝑖 − 𝑦ො𝑖 )2 σ𝑖(𝑦𝑖 − 𝑦ො𝑖 )2


𝑛 𝑛 𝑛
RMSE is more sensitive to large outliers, so it is preferred over
MAE in situations where they are particularly undesirable

*Copyright Maven Analytics, LLC


MEAN ERROR METRICS
𝒙𝒌
You can use sklearn.metrics to calculate MAE and MSE for your model
• sklearn.metrics.mean_absolute_error(y_actual, y_predicted)
• sklearn.metrics.mean_squared_error(y_actual, y_predicted)
Multiple
Regression

The simple linear regression model has an


Variable Selection average prediction error of ~$1,000

Mean Error The outliers in the dataset are making


Metrics RMSE around 50% larger than MAE

This returns RMSE


instead of MSE

The multiple linear regression model


performs better across both metrics

RMSE will always be bigger than MAE

*Copyright Maven Analytics, LLC


ASSIGNMENT: MEAN ERROR METRICS

Key Objectives
NEW MESSAGE
July 3, 2023
1. Calculate MAE and RMSE for fitted
From: Cam Correlation (Sr. Data Scientist) regression models
Subject: Error Metrics

Hi there!

Thanks for building those models, they’re definitely showing


some potential!

Can you calculate MAE and RMSE for the model with all
variables and our simple regression model that just included
RAM?

This will help me communicate the improvement in fit better


to our stakeholders.

Thanks!

03_multiple_regression_assignments.ipynb

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Multiple linear regression models use multiple features to predict the target
• Each new feature comes with an associated coefficient that forms part of the regression equation

Variable selection methods help you identify valuable features for the model
• Coefficients with p-values greater than alpha (0.05) indicate that a coefficient isn’t significantly different than 0
• Contrary to R2, the adjusted R2 metric penalizes new variables added to a model

Mean error metrics let you compare predictive accuracy across models
• Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) quantify a model’s inaccuracy in the target’s units
• RMSE is more sensitive to large errors, so it is preferred in situations where large errors are undesirable

*Copyright Maven Analytics, LLC


MODEL ASSUMPTIONS

*Copyright Maven Analytics, LLC


MODEL ASSUMPTIONS

In this section we’ll cover the assumptions of linear regression models which should be
checked and met to ensure that the model’s predictions and interpretation are valid

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Review the assumptions of linear regression models


Assumptions Overview Linearity
• Learn to diagnose and fix violations to each
assumption using Python
Independence of Errors Normality of Errors
• Assess the influence of outliers on a regression
model, and learn methods for dealing with them
No Perfect Multicollinearity Equal Variance of Errors

Outliers & Influence

*Copyright Maven Analytics, LLC


ASSUMPTIONS OF LINEAR REGRESSION

There are a few key assumptions of linear regression models that can be violated,
leading to unreliable predictions and interpretations
Assumptions
Overview • If the goal is inference, these all need to be checked rigorously
• If the goal is prediction, some of these can be relaxed
Linearity

1• Linearity: a linear relationship exists between the target and features


Independence
of Errors
2• Independence of errors: the residuals are not correlated
Normality
of Errors
3• Normality of errors: the residuals are approximately normally distributed

4• No perfect multicollinearity: the features aren’t perfectly correlated with each other
No Perfect
Multicollinearity
5• Equal variance of errors: the spread of residuals is consistent across predictions
Equal Variance
of Errors
You can use the L.I.N.N.E acronym (like linear regression) to remember them
Outliers & It’s worth noting that you might see resources saying there are anywhere from
Influence 3 to 6 assumptions, but these 5 are the ones to focus on

*Copyright Maven Analytics, LLC


LINEARITY

Linearity assumes there’s a linear relationship between the target and each feature
Assumptions • If this assumption is violated, it means the model isn’t capturing the underlying relationship
Overview
between the variables, which could lead to inaccurate predictions

Linearity
You can diagnose linearity by using scatterplots and residual plots:
Independence
of Errors
Ideal Scatterplot Ideal Residual Plot
Normality
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors

Outliers &
Influence

*Copyright Maven Analytics, LLC


LINEARITY

You can fix linearity issues by transforming features with non-linear relationships
Assumptions • Common transformations include polynomial terms (x2, x3, etc.) and log transformations (log(x))
Overview

Linearity Model Sample (carat vs price) Model Sample (carat2 vs price)

Independence
of Errors

Normality
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors

Outliers &
Influence The scatterplot has a U-like shape, which is similar to the The “carat_sq” has a much more linear relationship
y=x2 plot, so let’s try squaring the “carat” feature with the “price” target variable

*Copyright Maven Analytics, LLC


LINEARITY

You can fix linearity issues by transforming features with non-linear relationships
Assumptions • Common transformations include polynomial terms (x2, x3, etc.) and log transformations (log(x))
Overview

Linearity
R2 increased
from 0.859
Independence
of Errors

Normality
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors
When adding polynomial terms, you need to
include the lower order terms to the model,
Outliers & regardless of significance We can drop “y”
Influence (p > 0.05)

*Copyright Maven Analytics, LLC


INDEPENDENCE OF ERRORS

Independence of errors assumes that the residuals in your model have no patterns
Assumptions or relationships between them (they aren’t autocorrelated)
Overview
• In other words, it checks that you haven’t fit a linear model to time series data

Linearity
You can diagnose independence with the Durbin-Watson Test:
Independence
of Errors
• Ho: DW=2 – the errors are NOT autocorrelated
• Ha: DW≠2 – the errors are autocorrelated
Normality • As a rule of thumb, values between 1.5 and 2.5 are accepted
of Errors

No Perfect All good!


Multicollinearity

Equal Variance
of Errors

Outliers &
Influence
You can fix independence issues by using a time series model (more later!)

*Copyright Maven Analytics, LLC


INDEPENDENCE OF ERRORS

IMPORTANT: If your data is sorted by your target, or potentially an important


Assumptions predictor variable, it can cause this assumption to be violated
Overview
• Use df.sample(frac=1) to fix this by randomly shuffling your dataset rows

Linearity

Model (sorted by price) Model (randomized rows)


Independence
of Errors

Normality
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors

Outliers & Sorting the DataFrame by the target (price) leads to a Randomizing the order brings things back to normal
Influence Durbin-Watson statistic outside the desired range

*Copyright Maven Analytics, LLC


NORMALITY OF ERRORS

Normality of errors assumes the residuals are approximately normally distributed


Assumptions
Overview You can diagnose normality by using a QQ plot (quantile-quantile plot):

Linearity
Ideal QQ Plot
The red line represents a normal distribution
Independence
of Errors
It’s normal for a few points to fall off the line,
particularly outside of 2 standard deviations
Normality
of Errors

No Perfect
The Jarque-Bera Test (JB) in the model
Multicollinearity
summary tests the normality of errors
However, this test has numerous issues
Equal Variance and is far too restrictive to use in practice
of Errors

Outliers & The x-axis represents the number of standard


Influence deviations away from the mean residuals lie

*Copyright Maven Analytics, LLC


NORMALITY OF ERRORS

You can typically fix normality issues by applying a log transform on the target
Assumptions • Other options are applying log transforms to features or simply leaving the data as is
Overview

Price Model Log of Price Model


Linearity

Independence
of Errors

Normality
of Errors
There are still
No Perfect large residuals
Our line starts to curve in between left, but they
Multicollinearity
these thresholds, which indicates don’t violate the
This looks better!
moderate to severe non-normality normality
(between -3 and 3)
Equal Variance assumption
of Errors (check later)

Outliers &
Influence You generally want to see points fall along the
line in between -2 and +2 standard deviations

*Copyright Maven Analytics, LLC


PRO TIP: INTERPRETING TRANSFORMED TARGETS

Transforming your target fundamentally changes the interpretation of your model,


so you need to invert the transformation to understand coefficients & predictions
Assumptions
Overview • Inverting a feature coefficient returns the associated multiplicative change in the target’s value
• Inverting a prediction returns the target in its original units
Linearity

Independence
of Errors

Normality
of Errors
A 1-unit increase in carat is associated
with a 10.6X increase in price
No Perfect
Multicollinearity
( e2.3598 = 10.59)
Transformation Inverse
Equal Variance
y = np.sqrt(x) x = y**2
of Errors
y = np.log(x) x = np.exp(y)
Outliers & y = np.log10(x) x = 10 ** y
Influence A diamond with these features is
predicted to cost $9,458 y = 1/x x = 1/y

*Copyright Maven Analytics, LLC


NO PERFECT MULTICOLLINEARITY

No perfect multicollinearity assumes that features aren’t perfectly correlated


Assumptions
with each other, as that would lead to unreliable and illogical model coefficients
Overview
• If two features have a correlation (r) of 1, there are infinite ways to minimize squared error
• Even if it’s not perfect, strong multicollinearity (r > 0.7) can still cause issues to a model
Linearity

Independence
You can diagnose multicollinearity with the Variance Inflation Factor (VIF):
of Errors
• Each feature is treated as the target, and R2 measures how well the other features predict it
Normality • As a rule of thumb, a VIF > 5 indicates that a variable is causing multicollinearity problems
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors We can ignore the VIF for the intercept, but
most of our variables have a VIF > 5
Outliers &
Influence

*Copyright Maven Analytics, LLC


NO PERFECT MULTICOLLINEARITY

There are several ways to fix multicollinearity issues:


Assumptions
Overview • The most common is to drop features with a VIF > 5 (leave at least 1 to see the impact)
• Another is to engineer combined features (like “GDP” and “Population” to “GDP per capita”)
Linearity • You can also turn to regularized regression or tree-based models (more later!)

Independence
of Errors Original Model Removing x

Normality We still have VIF > 5 for our carat terms,


of Errors but we can generally ignore those since
they are polynomial terms
No Perfect
Multicollinearity

Equal Variance
of Errors
The VIF can also be ignored when using dummy variables (more later!)
Outliers &
Influence

*Copyright Maven Analytics, LLC


EQUAL VARIANCE OF ERRORS

Equal variance of errors assumes the residuals are consistent across predictions
Assumptions • In other words, average error should stay roughly the same across the range of the target
Overview
• Equal variance is known as homoskedasticity, and non-equal variance is heteroskedasticity

Linearity
You can diagnose heteroskedasticity with residual plots:
Independence
of Errors Heteroskedasticity Homoskedasticity
(original regression model) (after fixing violated assumptions)
Normality
of Errors

No Perfect Our model predicts


Multicollinearity much better now!

Equal Variance
of Errors

Outliers &
Influence

*Copyright Maven Analytics, LLC


EQUAL VARIANCE OF ERRORS

You can typically fix heteroskedasticity by applying a log transform on the target
Assumptions • In other words, the average error should stay roughly the same across the target variable
Overview

Price Model Log of Price Model


Linearity

Independence
of Errors

Normality
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors

Outliers &
Influence
Errors have a cone shape along the x-axis Errors are spread evenly along the x-axis

*Copyright Maven Analytics, LLC


OUTLIERS & INFLUENCE

Outliers are extreme data points that fall well outside the usual pattern of data
Assumptions
• Some outliers have dramatic influence on model fit, while others won’t
Overview • Outliers that impact a regression equation significantly are called influential points

Linearity No Outliers Non-influential Outlier Influential Outlier


slope = 0.34 slope = 0.34 slope = 0.09
Independence
of Errors

Normality
of Errors

No Perfect
Multicollinearity

Equal Variance
of Errors This is an outlier in both profit and This is an outlier in terms of profit that
An outlier’s influence depends units sold, but it’s in line with the rest doesn’t follow the same pattern as the rest of
on the size of the dataset of the data, so it’s not influential the data, so it changes the regression line
Outliers &
Influence (large datasets are impacted less)

*Copyright Maven Analytics, LLC


OUTLIERS & INFLUENCE

Cook’s Distance measures the influence a data point has on the regression line
Assumptions
• It works by fitting a regression without the point and calculating the impact
Overview • Cook’s D > 1 is considered a significant problem, while Cook’s D > 0.5 is worth investigating
• Use .get_influence().summary_frame() on a regression model to return influence statistics
Linearity

Independence
of Errors

Normality The other metrics


of Errors can be ignored

No Perfect
Multicollinearity

Equal Variance
Not surprisingly, the most influential
of Errors
points were the largest diamonds!

Outliers &
Influence

*Copyright Maven Analytics, LLC


OUTLIERS & INFLUENCE

There are several ways to deal with influential points:


Assumptions
• You can remove them or leave them in
Overview • You can engineer features to help capture their variance
• You can use robust regression (outside the scope of this course)
Linearity

Model (with outliers) Model (without 2 outliers)


Independence
of Errors

Normality
of Errors

No Perfect
Multicollinearity

Equal Variance R2 improved slightly and


of Errors the coefficients changed a
bit, but not much changed
Outliers & (this is a large dataset)
Influence

*Copyright Maven Analytics, LLC


RECAP: ASSUMPTIONS OF LINEAR REGRESSION
Drawbacks to
Assumption Problem to solve Effect on Model Diagnosis Fix Should I Fix it?
Fix

Suboptimal model Curved patterns in Slightly less The more curved the relationship,
Feature-target
Linearity accuracy, non-normal feature-target Add polynomial terms intuitive the worse it is. It’s usually worth
relationship is not linear
residuals scatterplots coefficients fixing as it can improve accuracy.

Independence of Residuals aren’t Suboptimal model Durbin-Watson stat Use time-series


None Yes
Errors independent of each other structure not between 1.5–2.5 modeling techniques

Data significantly Transform the target


If it’s due to a non-linear pattern or
Inaccurate predictions & deviates from the line with log or other
Normality of Residuals aren’t normally Less missing variable, or if the fix
less reliable coefficient of normality in QQ transforms. Add
Errors distributed around 0 interpretability significantly improves accuracy
estimates plot inside of +/- 2 features to model if
(unless you need a simple model)
standard deviations available

For perfect or near perfect


Features with perfect, or Features have a May lose some
No Perfect Unstable & illogical Drop features or use collinearity, yes. For features closer
near perfect, correlation Variance Inflation training
Multicollinearity model coefficients regularized regression to r= 0.7, gains seen in training
with each other Factor greater than 5 accuracy
usually offset accuracy.

Residuals are not Unreliable confidence


Equal Variance of Cone-shaped residual Less This is generally ok to leave as is if
consistent across the full intervals for coefficients Transform the target.
Errors plots interpretability your goal is prediction
range of predicted values & predictions

Cook’s D greater than Removing data If we expect the data to have similar
Lower accuracy & Remove data points or
Outliers & 1 (highly influential) points is not points, removing them won’t change
Influential data points possible violated engineer features to
Influence or 0.5 (potentially ideal if we need model performance (consider a
assumptions explain outliers
problematic) to predict them model with and without outliers)

*Copyright Maven Analytics, LLC


ASSIGNMENT: MODEL ASSUMPTIONS

Key Objectives
NEW MESSAGE
July 10, 2023
1. Check for violated assumptions on a fitted
From: Cam Correlation (Sr. Data Scientist) regression model
Subject: Model Assumptions
2. Implement fixes for assumptions
Hi there! 3. Decide whether the fixes are worth the
Can you check the model assumptions for our computer price
trade-offs and settle on a final model
regression model?

Make any changes necessary to improve model fit!

Don’t force it though, use your best judgement and remember


that not all assumptions will be violated and there are some
trade-offs for fixes!

Thanks!

04_assumptions_assignments.ipynb

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Linear regression models have 5 key assumptions that must be checked


• Linearity, independence of errors, normality or errors, no perfect multicollinearity, and equal variance of errors
• If the goal is inference, they need to be checked rigorously, but for prediction some of them can be relaxed

Diagnosing & fixing these assumptions can help improve model accuracy
• Use residual plots, QQ plots, the Durbin-Watson test, and the Variance Inflation Factor to diagnose
• Transforming the features and/or target (polynomial terms, log transforms, etc.) can typically help fix issues

Outliers that significantly impact a model are known as influential points


• Use Cook’s distance to identify influential points (Cook’s D > 1)
• Before removing influential points, consider if you expect to encounter similar data points in new data

*Copyright Maven Analytics, LLC


MODEL TESTING & VALIDATION

*Copyright Maven Analytics, LLC


MODEL TESTING & VALIDATION

In this section we’ll cover model testing & validation, which is a crucial step in the
modeling process designed to ensure that a model performs well on new, unseen data

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand the concept of data splitting and its


Model Scoring Steps Data Splitting impact on model fit, bias, and variance
• Split data in Python into training and test data sets,
Overfitting Bias-Variance Tradeoff and then into training and validation data sets
• Learn the concept of cross validation, and its
advantages & disadvantages over simple validation
Validation Cross Validation
• Evaluate models by tuning on training & validation
data, refitting on both, then scoring on test data

*Copyright Maven Analytics, LLC


RECAP: REGRESSION MODELING WORKFLOW

1 2 3 4 5 6
Model Scoring Scoping a Gathering Cleaning Exploring Modeling Sharing
Steps
project data data data data insights

Data Splitting

Preparing for Applying Model Model


Overfitting
Modeling Algorithms Evaluation Selection

Bias-Variance Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Tradeoff input into an ML algorithm from training data training & validation data deploy and identify insights

Validation • Single table, non-null • Linear regression • R-squared & MAE


• Checking assumptions

Cross Validation
This is what we’ve covered so far

*Copyright Maven Analytics, LLC


RECAP: REGRESSION MODELING WORKFLOW

1 2 3 4 5 6
Model Scoring Scoping a Gathering Cleaning Exploring Modeling Sharing
Steps
project data data data data insights

Data Splitting

Preparing for Applying Model Model


Overfitting
Modeling Algorithms Evaluation Selection

Bias-Variance Get your data ready to be Build regression models Evaluate model fit on Pick the best model to
Tradeoff input into an ML algorithm from training data training & validation data deploy and identify insights

Validation • Single table, non-null • Linear regression • R-squared & MAE • Test performance
• Checking assumptions
• Data splitting • Validation performance
Cross Validation
This needs to be the first So we can do these properly
step in the process

*Copyright Maven Analytics, LLC


MODEL SCORING STEPS

These model scoring steps need to be considered before you start modeling:
Model Scoring
Steps 1 Split the data into a “training” and “test” set
• Around 80% of the data is used for training and 20% is reserved for testing
Data Splitting

2 Select a validation method


Overfitting • Either split the training set into “training” and “validation” or use cross validation

Bias-Variance 3 Tune the model using the training data


Tradeoff
• Gradually improve the model by fitting on the training set and scoring on the validation set

Validation
4 Score the model using the test data
• Once you’ve settled on a model, combine the training & validation data sets and refit the
Cross Validation model, then score it on the test data to evaluate model performance

*Copyright Maven Analytics, LLC


DATA SPLITTING

Data splitting involves separating a data set into “training” and “test” sets
• Training data, or in-sample data, is used to fit and tune a model
Model Scoring
Steps • Test data, or out-of-sample data, provides realistic estimates of accuracy on new data

Data Splitting

Overfitting

Bias-Variance
Tradeoff Training data
(80%)

Validation

Cross Validation
Test data
(20%)

*Copyright Maven Analytics, LLC


DATA SPLITTING

Data splitting involves separating a data set into “training” and “test” sets
• Training data, or in-sample data, is used to fit and tune a model
Model Scoring
Steps • Test data, or out-of-sample data, provides realistic estimates of accuracy on new data

Data Splitting

Overfitting
In practice, the rows are sampled randomly

Bias-Variance
Tradeoff
80/20 is the most common ratio for
train/test data, but anywhere from
70-90% can be used for training
Validation
For smaller data sets, a higher ratio of
test data is needed to ensure a
representative sample
Cross Validation

*Copyright Maven Analytics, LLC


DATA SPLITTING

You can split data in Python using scikit-learn’s train_test_split function


• train_test_split(feature_df, target_column, test_pct, random_state)
Model Scoring
Steps

Data Splitting

Overfitting

Bias-Variance
Tradeoff

Validation
4 outputs: Perfect 80/20 split! 4 inputs:
• Training set for features (X) • All feature columns
Cross Validation • Test set for features (X_test) • The target column
• Training set for target (y) • The percentage of data for the test set
• Test set for target (y_test) • A random state (or a different split is created each time)

*Copyright Maven Analytics, LLC


OVERFITTING & UNDERFITTING

Data splitting is primarily used to avoid overfitting, which is when a model predicts
known (Training) data very well but unknown (Test) data poorly
Model Scoring
Steps • Overfitting is like memorizing the answers to a test instead of learning the material; you’ll ace
the test, but lack the ability to generalize and apply your knowledge to unfamiliar questions
Data Splitting

Overfitting

Bias-Variance
Tradeoff

Validation

OVERFIT model WELL-FIT model UNDERFIT model


Cross Validation • Models the training data too well • Models the training data just right • Doesn’t model training data well enough
• Doesn’t generalize well to test data • Generalizes well to test data • Doesn’t score well on test data either

*Copyright Maven Analytics, LLC


OVERFITTING & UNDERFITTING

You can diagnose overfit & underfit models by comparing evaluation metrics like
R2, MAE, and RMSE between the train and test data sets
Model Scoring
Steps • Large gaps between train and test scores indicate that a model is overfitting the data
• Poor results across both train and test scores indicate that a model is underfitting the data
Data Splitting

You can fix overfit models by: You can fix underfit models by:
Overfitting
• Simplifying the model • Making the model more complex
Bias-Variance
• Removing features • Add new features
Tradeoff • Regularization (more later!) • Feature engineering (more later!)

Validation

Models will usually have lower performance on test data compared to the performance
on training data, so small gaps are expected – remember that no model is perfect!
Cross Validation

*Copyright Maven Analytics, LLC


THE BIAS-VARIANCE TRADEOFF

When splitting data for regression, there are two types of errors that can occur:
• Bias: How much the model fails to capture the relationships in the training data
Model Scoring
Steps • Variance: How much the model fails to generalize to the test data

Data Splitting
Training data

Overfitting

Bias-Variance
Tradeoff

Validation
OVERFIT model WELL-FIT model UNDERFIT model
• Low bias (no error) • Medium bias (some error) • High bias (lots of error)
Cross Validation

*Copyright Maven Analytics, LLC


THE BIAS-VARIANCE TRADEOFF

When splitting data for regression, there are two types of errors that can occur:
• Bias: How much the model fails to capture the relationships in the training data
Model Scoring
Steps • Variance: How much the model fails to generalize to the test data

Data Splitting

Test data

Overfitting

Bias-Variance
Tradeoff

Validation
OVERFIT model WELL-FIT model UNDERFIT model
• Low bias (no error) • Medium bias (some error) • High bias (lots of error)
Cross Validation • High variance (much more error) • Medium variance (a bit more error) • Low variance (same error)

This is the bias-variance tradeoff!


*Copyright Maven Analytics, LLC
THE BIAS-VARIANCE TRADOFF

The bias-variance tradeoff aims to find a balance between the two types of errors
Model Scoring • It’s rare for a model to have low bias and low variance, so finding a “sweet spot” is key
Steps • This is something that you can monitor during the model tuning process

Data Splitting

High bias models High variance models


Overfitting Test

Prediction Error
• Fail to capture
Sweet
• Capture noise from the
trends in the data spot training data
Bias-Variance
Tradeoff • Are underfit • Are overfit
• Are too simple Variance • Are too complex
• Have a high error Bias • Have a large gap
Validation rate on both train between training & test
& test data Training error

Low Model Complexity High


Cross Validation

*Copyright Maven Analytics, LLC


VALIDATION DATA

Validation data is a subset of the training data set that is used to assess model fit
and provide feedback to the modeling process
Model Scoring
Steps
• This is an extension of a simple train-test split of the data

Data Splitting Train-Test Split


With a simple train / test split, you
Training Test cannot use test data to optimize
Overfitting (80%) (20%) your models

Bias-Variance Train-Validation-Test Split


Tradeoff
With separate data sets for validation
Training Validation Test and testing, the validation data can
and should be used to:
Validation (60%) (20%) (20%)
• Check for overfitting
• Fine tune model parameters
• Verify the impact of specific
Cross Validation features on out-of-sample data
• Verify the impact of outliers on
out-of-sample data

*Copyright Maven Analytics, LLC


VALIDATION DATA

You can also use the train_test_split function to create the validation data set
Model Scoring
• First split off the test data, then separate the training and validation sets
Steps

Data Splitting

First split off the test data


Overfitting
Then do the same thing to
split off the validation data
Bias-Variance
The test size is 25% of the
Tradeoff
training data (80%), which
is 20% of the full data set

Validation

Cross Validation

Perfect 60/20/20 split!

*Copyright Maven Analytics, LLC


MODEL TUNING

Model tuning is the process of making gradual improvements to a model by fitting it


on the training data and scoring it on the validation data
Model Scoring
Steps • This lets you compare modifications to your models (like adding features) and avoid overfitting

Data Splitting

Model #1 Model #2 Model #3


Overfitting
3 variables 20 variables 3 variables + 1 squared term

Bias-Variance • Training R2 = 0.67 • Training R2 = 0.89 • Training R2 = 0.81


Tradeoff
• Validation R2 = 0.66 • Validation R2 = 0.71 • Validation R2 = 0.79

Both R2 values seem relatively low with Training R2 improved but the gap Training R2 dropped a bit but the gap
Validation a small gap between them between them grew considerably between them closed down
(underfit – high bias, low variance) (overfit – low bias, high variance) (best fit – balanced bias & variance)

Cross Validation Action: Add complexity Action: Simplify Action: Decide on model

*Copyright Maven Analytics, LLC


MODEL TUNING

To tune your model in Python:


Model Scoring 1. Use sm.OLS(y_train, X_train).fit() to fit the model on the training data
Steps
2. Use r2(y_train, model.predict(X_train)) to return R2 for the training data (or use mae / mse)
3. Use r2(y_valid, model.predict(X_valid)) to return R2 for the validation data (or use mae / mse)
Data Splitting

Overfitting

Bias-Variance
Tradeoff

Validation

Cross Validation

*Copyright Maven Analytics, LLC


MODEL SCORING

Once you’ve tuned and selected the best model, you need to refit the model on both
the training and validation data, then score the model on the test data
Model Scoring
Steps • Combining the validation data back with the training data helps improve coefficient estimates
• The final “model score” that you’d share is the score from the test data
Data Splitting

Model Tuning
Overfitting
Training Validation Once you’ve finished tuning,
(60%) (20%) select the best model
Bias-Variance
Tradeoff R2 = 0.81 R2 = 0.79

Model Scoring
Validation
Refit the model on the training &
Training + Validation Test validation data and score on the test
Cross Validation
(60% + 20% = 80%) (20%) data

R2 = 0.83 R2 = 0.81

*Copyright Maven Analytics, LLC


MODEL SCORING

To score your model in Python:


Model Scoring 1. Use sm.OLS(y, X).fit() to fit the final model on the training and validation data
Steps
2. Use r2(y, model.predict(X)) to return R2 for the training and validation data (or use mae / mse)
3. Use r2(y_test, model.predict(X_test)) to return R2 for the test data (or use mae / mse)
Data Splitting

Overfitting

Bias-Variance
Tradeoff

Validation

Cross Validation

*Copyright Maven Analytics, LLC


CROSS VALIDATION

Cross validation is another validation method that splits, or “folds”, the training
data into “k” parts, and treats each part as the validation data across iterations
Model Scoring • You fit the model k times on the training folds, while validating on a different fold each time
Steps

Train-Test Split
Data Splitting
Training Test

Overfitting
5-Fold Cross Validation
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Validation Score
Bias-Variance
Tradeoff Training Training Training Training Validation 0.42

Training Training Training Validation Training 0.40


Cross validation score:
Validation
Training Training Validation Training Training 0.45 0 .41

Cross Validation Training Validation Training Training Training 0.41

Validation Training Training Training Training 0.37

*Copyright Maven Analytics, LLC


CROSS VALIDATION

You use cross validation in Python with scikit-learn’s Kfold function


• sklearn.model_selection.Kfold(n_splits, shuffle=True, random_state)
Model Scoring
Steps

Data Splitting

Overfitting

Bias-Variance
This can be put into
Tradeoff
a function to reuse!

Validation

Cross Validation

Average validation score, plus/minus the standard deviation

*Copyright Maven Analytics, LLC


SIMPLE VS CROSS VALIDATION

You should choose one validation approach to use throughout your modeling
process, so it’s important to highlight the pros and cons of each:
Model Scoring
Steps

Factor Simple Validation Cross Validation


Data Splitting

Less reliable, particularly on smaller More reliable, as the validation


Reliability
data sets process looks at multiple holdout sets
Overfitting

More appropriate for small-medium


Bias-Variance More appropriate for large data sets,
sized datasets (<10,000 rows) or
Tradeoff Data Size as there is less risk for the holdout set
(<1M rows) if you have access to
to be biased
sizeable computing power
Validation

Training Faster, since we’re only training and Slower, since we train and score the
Cross Validation Time scoring the model once model once for each fold

*Copyright Maven Analytics, LLC


ASSIGNMENT: MODEL TESTING & VALIDATION

Key Objectives
NEW MESSAGE
July 12, 2023
1. Split your data into training and test
From: Cam Correlation (Sr. Data Scientist)
2. Use cross validation to fit your model and
Subject: Data Splitting
report each validation fold score
Hi there! 3. Fit your model on all of your training data
The model seems to be shaping up nicely, but I forgot to
and score it on the test dataset
mention we need to split off a test dataset to assess
performance.

Can you split off a test set, and fit your model using cross-
validation?

Thanks!

05_data_splitting_assignments.ipynb

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Data Splitting is a key step in the modeling process


• Training, Validation, and Test datasets all have unique purposes

Test Data Sets give us an unbiased estimate of model accuracy


• Scoring on your test dataset should only be performed once you are ready to decide on a final model

Validation Data provides critical feedback into model tuning


• Simple validation is suitable for larger datasets, while cross validation tends to be more reliable on small-
medium sized data
• Fold your validation data back into training data to fit your final model

*Copyright Maven Analytics, LLC


FEATURE ENGINEERING

*Copyright Maven Analytics, LLC


FEATURE ENGINEERING

In this section we’ll cover feature engineering techniques for regression models, including
dummy variables, interaction terms, binning, and more

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand the goal and importance of feature


Feature Engineering Math Calculations engineering in modeling
• Learn common types of feature engineering and
Category Mappings how to implement them using Python
• Walk through examples of some more advanced
feature engineering options

*Copyright Maven Analytics, LLC


FEATURE ENGINEERING

Feature engineering is the process of creating or modifying features that you


think will be helpful inputs for improving a model’s performance
Feature • This is where data scientists add the most value in the modeling process
Engineering
• Strong domain knowledge is required to get the most out of this step
Math
Calculations

Category
Mappings

Only “carat” can be used as a feature


here, since the rest or not numeric
Feature engineering is all about trial and error, and you can use cross
validation to test new ideas and determine if they improve the model

*Copyright Maven Analytics, LLC


FEATURE ENGINEERING TECHNIQUES

These are some commonly used feature engineering techniques:

Feature
Engineering

Math
Calculations
Math Category DateTime Group
Calculations Mappings Calculations Calculations Scaling
Category
Mappings • Polynomial terms • Binary columns • Days from “today” • Aggregations • Standardization
• Combining features • Dummy variables • Time between dates • Ranks within groups • Normalization
• Interaction terms • Binning

We’ll briefly demo these techniques


We’ll do an in-depth review of these (they reference previously learned concepts)

Your own creativity, domain knowledge, and critical thinking will lead to feature
engineering ideas not covered in this course, but these are worth keeping in mind!

*Copyright Maven Analytics, LLC


POLYNOMIAL TERMS

Adding polynomial terms (x2, x3, etc.) to regression models can improve fit when
spotting “curved” feature-target relationships during EDA
Feature • Generally, as the degree of your polynomial increases, so does the risk of overfitting
Engineering
• If your goal is prediction, let cross validation guide the complexity of your polynomial term
Math
Calculations
Model Structure Cross-Val R2
Category
Mappings Carat .847
Carat + Carat2 .929
Carat + Carat2 + Carat3 .936
Carat + … + Carat4 .936
Carat + … + Carat5 .936

Always keep the lower


order terms in the model

*Copyright Maven Analytics, LLC


POLYNOMIAL TERMS

Adding polynomial terms (x2, x3, etc.) to regression models can improve fit when
spotting “curved” feature-target relationships during EDA
Feature • Generally, as the degree of your polynomial increases, so does the risk of overfitting
Engineering
• If your goal is prediction, let cross validation guide the complexity of your polynomial term
Math
Calculations
Model Structure Cross-Val R2
Category
Mappings Carat .847
Carat + Carat2 .929
Carat + Carat2 + Carat3 .936
Carat + … + Carat4 .936
Carat + … + Carat5 .936

Cross validation R2 stops increasing after adding the


cubic term, so we should stop there
(if you need to explain this model, you could justify
stopping at the squared term for interpretability)

*Copyright Maven Analytics, LLC


COMBINING FEATURES

Combining features with arithmetic operators can help avoid multicollinearity


issues without throwing away information altogether
Feature • Sums (+), differences (-), products (*) and ratios (÷) of columns are all valid combinations
Engineering

Math Current best model:


Calculations
Model Structure Cross-Val R2 Carat weight is our strongest variable alone, but
Category can we engineer some stronger features?
Carat + Carat2 + Carat3 .936
Mappings

Feature correlations:
Carat (weight), x (width), y (length), and z (depth)
capture a diamond’s size, which is why they are
highly correlated with each other and will cause
multicollinearity issues in the model

While carat is the most important factor in price,


“deep” diamonds are less valuable than “wide”
diamonds, since depth can’t be seen in jewelry

*Copyright Maven Analytics, LLC


COMBINING FEATURES

Combining features with arithmetic operators can help avoid multicollinearity


issues without throwing away information altogether
Feature • Sums (+), differences (-), products (*) and ratios (÷) of columns are all valid combinations
Engineering

Math Current best model:


Calculations
Model Structure Cross-Val R2 The carat model is still the best!
Category Carat + Carat2 + Carat3 .936
Mappings

Models with combined features:


Model Structure Cross-Val R2 PRO TIP: There is no guarantee
that even the most clever feature
Original columns: x, y, z .913
engineering will improve your model;
Sum: (x + y + z) .909 give it a try, but be prepared to move
on if you don’t see results!
Product: (x * y * z) .909
Area/depth ratio: (x + y) / z .877

*Copyright Maven Analytics, LLC


INTERACTION TERMS

Interaction terms capture feature-target relationships that change based on


the value of another feature
Feature • They can be detected with careful EDA or brute force engineering
Engineering
• They exist for both categorical-numeric and numeric-numeric feature combinations
Math
Calculations

Adding an interaction term:


Category
Mappings Interaction term
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2

𝑦 = 𝛽0 + 𝛽1 (𝑐𝑎𝑟𝑎𝑡) + 𝛽2 (𝑐𝑙𝑎𝑟𝑖𝑡𝑦_𝐼1) + 𝛽3 (𝑐𝑎𝑟𝑎𝑡 ∗ 𝑐𝑙𝑎𝑟𝑖𝑡𝑦_𝐼1)

When adding interaction terms, always


include the original features in your model

“I1” diamonds have a much lower


slope coefficient than others

*Copyright Maven Analytics, LLC


INTERACTION TERMS

Interaction terms capture feature-target relationships that change based on


the value of another feature
Feature • They can be detected with careful EDA or brute force engineering
Engineering
• They exist for both categorical-numeric and numeric-numeric feature combinations
Math
Calculations
Current best model:
Category
Mappings
Model Structure Cross-Val R2
The carat model is still the best!
Carat + Carat2 + Carat3 .936

Model with interaction term:


PRO TIP: Interaction terms are cool,
Model Structure Cross-Val R2 but often don’t add enough value to
Carat + I1 + Carat * I1 .870 justify the time it takes to find them or
the increased complexity!

*Copyright Maven Analytics, LLC


CATEGORICAL FEATURES

Categorical features must be converted to numeric before modeling


• In their simplest form, they can be represented with binary columns
Feature
Engineering

This assigns a value of 1 to


Math
diamonds with “IF” clarity and a
Calculations
value of 0 to any others

Category
Mappings

This field that represents clarity is now


numeric and can be input into a model

*Copyright Maven Analytics, LLC


CATEGORICAL FEATURES

Categorical features must be converted to numeric before modeling


• In their simplest form, they can be represented with binary columns
Feature
Engineering
Interpreting coefficients for binary columns: Effect on a model:
Math
Calculations

Category
Mappings

Diamonds with a clarity of “IF” have a price $1,278


dollars higher, on average, than non-IF diamonds

Binary columns have the effect of shifting the


intercept of the line without affecting its slope!

*Copyright Maven Analytics, LLC


DUMMY VARIABLES

A dummy variable is a field that only contains zeros and ones to represent the
presence (1) or absence (0) of a value, also known as one-hot encoding
Feature
• They are used to transform a categorical field into multiple numeric fields
Engineering • Use pd.get_dummies() to create dummy variables in Python

Math
Calculations

Category
Mappings

These dummy variables are numeric


representations of the “clarity” field

*Copyright Maven Analytics, LLC


DUMMY VARIABLES

In linear regression models, you need to drop a dummy variable category using
the “drop_first=True” argument to avoid perfect multicollinearity
• The category that gets dropped is known as the “reference level”
Feature
Engineering

Math
Calculations

Category
Mappings

PRO TIP: Your model accuracy will be the same regardless of which dummy column
dropped, but some reference levels are more intuitive to interpret than others
If you want to choose your reference level, skip the drop_first argument and drop the
desired reference level manually

*Copyright Maven Analytics, LLC


DUMMY VARIABLES

In linear regression models, you need to drop a dummy variable category using
the “drop_first=True” argument to avoid perfect multicollinearity
• The category that gets dropped is known as the “reference level”
Feature
Engineering

Interpreting coefficients for dummy variables:


Math
Calculations Diamonds with a clarity of “IF” have a predicted
price of $5,571 higher than our reference level
Category (diamonds with a clarity of I1)
Mappings
The coefficients align with the diamond quality chart!

Lower Quality

The reference level is represented in the intercept term, and


the coefficients of other categories are compared to it

*Copyright Maven Analytics, LLC


BINNING CATEGORICAL DATA

Adding dummy variables for each categorical column in your data can lead to very
wide data sets, which tends to increase model variance
Grouping, or binning categorical data, solves this and can improve interpretability
Feature
Engineering • After binning, create dummy variables for the groups (which should be fewer than before)

Math
Calculations

Category If we had less data, the ”I1” category would be too rare to
Mappings produce reliable estimates, as random data splitting means we
might not see “I1” diamonds in our test or validation sets!
(categories with low counts are especially at risk of overfitting)

We can map the 8 clarity


values into just 3 buckets

*Copyright Maven Analytics, LLC


BINNING CATEGORICAL DATA

Adding dummy variables for each categorical column in your data can lead to
very wide data sets, which can increase model variance with small data sets
Grouping, or binning categorical data, solves this and improves interpretability
Feature
Engineering • After binning, create dummy variables for the groups (which should be fewer than before)

Math
Calculations
Dummy variables: Binning:
Category
Mappings

*Copyright Maven Analytics, LLC


BINNING NUMERIC DATA

Binning numeric data lets you turn numeric features into categories
• Generally, this is less accurate than using raw values, but it is a highly interpretable way of
capturing non-linear trends and numeric fields with a high percentage of missing values
Feature
Engineering

Math
Calculations Carat is a continuous field, but
we’re binning it into values at
various intervals, making it a
Category
categorical field
Mappings

Now you can get dummy variables from this!

*Copyright Maven Analytics, LLC


BINNING NUMERIC DATA

Binning numeric data lets you turn numeric features into categories
• Generally, this is less accurate than using raw values, but it is a highly interpretable way of
capturing non-linear trends and numeric fields with a high percentage of missing values
Feature
Engineering

Math Current best model:


Calculations
Model Structure Cross-Val R2

Category Carat + Carat2 + Carat3 .936


Mappings

Models with binned data:


Model Structure Cross-Val R2
Carat .847
Carat Bins .870

Note that the binned data has created steps in the


model versus the earlier smooth lines / curves

*Copyright Maven Analytics, LLC


ASSIGNMENT: FEATURE ENGINEERING

Key Objectives
NEW MESSAGE
July 14, 2023
1. Perform feature engineering on numeric
From: Cam Correlation (Sr. Data Scientist) and categorical features
Subject: Feature Engineering
2. Evaluate model performance after
including the new features
Hello again!

We’re starting to get somewhere with this model, nice job!


3. Select only features that improve model fit

I notice that your last model didn’t have any of the categorical
features in it, can you make sure to include them?

I’ve also included some additional feature engineering ideas I


had in the notebook.

Thanks!

06_feature_engineering_assignment.ipynb

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Feature engineering lets you turn raw data into useful model features
• This is critical to getting the best accuracy out of your data sets

Several calculations can be applied to enhance numeric features


• Combining features, polynomial terms, interaction terms, and binning can be applied on numeric variables

It also allows you to prepare categorical features for modeling


• Techniques like binary columns, dummy variables, and binning let you turn categorical variables into numeric

Most ideas will come from domain expertise and critical thinking
• Thinking carefully about what might influence your target variable can lead to the creation of powerful features

*Copyright Maven Analytics, LLC


PROJECT 1: LINEAR REGRESSION

*Copyright Maven Analytics, LLC


PROJECT DATA: SAN FRANCISCO RENT PRICES

*Copyright Maven Analytics, LLC


PROJECT BRIEF: LINEAR REGRESSION

Key Objectives
NEW MESSAGE
July 20, 2023 1. Perform EDA on a modelling dataset
From: Cathy Coefficient (Data Science Lead) 2. Split your data and choose a validation
Subject: Apartment Rental Prices framework
3. Fit and tune a linear regression model by
Hi there,
checking model assumptions and performing
Cam has told me some great things about your work. We have feature engineering
a new client that has a modelling project they need help with.
4. Interpret a linear regression model
The client works in the real estate industry in San Francisco
and wants to understand the key factors affecting rental
prices. More importantly, they hope be able to use your model
to predict an appropriate price range for the apartments they
build in the city.

You’ll find more info in the attached notebook, thanks!

07_regression_modelling_project.ipynb

*Copyright Maven Analytics, LLC


REGULARIZED REGRESSION

*Copyright Maven Analytics, LLC


REGULARIZED REGRESSION

𝛂 In this section we’ll cover regularized regression models, also known as “penalized”
regression models, which focus on reducing model variance to improve predictive accuracy

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand the difference between linear regression


Regularization Ridge Regression and regularized regression models
• Introduce the cost function and the impact of the
Lasso Regression Elastic Net Regression regularization term
• Review the steps for fitting and training regularized
regression models, including standardization and
hyperparameter tuning
• Discuss the similarities and differences between
Ridge Regression, Lasso Regression and Elastic Net

*Copyright Maven Analytics, LLC


REGULARIZED REGRESSION
𝛂
Regularized regression models add bias by NOT choosing the line of best fit for
the training data with the purpose of reducing the variance in the test data
Regularization • This helps reduce overfitting and leads to better predictions (especially in smaller data sets)

Ridge Regression
Linear Regression (OLS) Regularized Regression

Lasso Regression
Training The “slope”
coefficients tend to
Elastic Net
shrink during
Regression
regularization

Test

Low bias, high variance (overfit model) Balance of bias and variance

*Copyright Maven Analytics, LLC


THE COST FUNCTION
𝛂
Regularized regression models modify the “line of best fit” by adding additional
elements to the cost function used to find the optimal model
Regularization
In linear regression, the cost function is just the sum squared error (SSE):
Ridge Regression

𝐽 = 𝑆𝑆𝐸
Lasso Regression
The cost you’re trying to minimize The sum of squared error

Elastic Net
Regression
In regularized regression, you add a regularization term controlled by alpha:

𝐽 = 𝑆𝑆𝐸 + 𝛼𝑅
The alpha term, which controls the The regularization term
impact of the regularization term

*Copyright Maven Analytics, LLC


TYPES OF REGULARIZED REGRESSION
𝛂
Different types of regularized regression have different regularization equations

Regularization

Ridge Regression
𝐽 = 𝑆𝑆𝐸 + 𝛼𝑅
Lasso Regression
Ridge Regression Lasso Regression Elastic Net Regression

Elastic Net
Regression Sum of the ridge & lasso regularization terms,
Sum of the squared coefficient values Sum of the absolute coefficient values
weighted by lambda (λ)

𝑝 𝑝 𝑝 𝑝

෍ 𝛽𝑗2 ෍ 𝛽𝑗 𝜆 ෍ 𝛽𝑗2 + (1 − 𝜆) ෍ 𝛽𝑗
𝑗=1 𝑗=1 𝑗=1 𝑗=1

*Copyright Maven Analytics, LLC


RIDGE REGRESSION
𝛂
Ridge regression adds a regularization term, or complexity penalty, that’s equal to
the sum of the model’s coefficients (𝛽) squared, also known as an L2 penalty
Regularization
• The L2 penalty shrinks the coefficients towards 0 (but never to 0 exactly)
• The higher alpha, the more the coefficients get shrunk
Ridge Regression • The features that reduce SSE the most will shrink at a slower rate than less useful ones

Lasso Regression 𝑝
When fitting a ridge regression, the model is

Elastic Net 𝐽 = 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗2 trying to minimize this cost function by both:
• Minimizing training error with SSE
Regression
• Keeping the coefficient values small
𝑗=1

The idea is to incorporate the bias-variance tradeoff directly into the algorithm!
We want to reduce the sum of squared error, but also constrain the magnitude of
the coefficients to prevent overfitting’

Because of this, ridge regression tends to produce much more accurate models
when working with multicollinear features.

*Copyright Maven Analytics, LLC


RIDGE REGRESSION WORKFLOW
𝛂
The ridge regression workflow has a few additional steps that are required:

Regularization
Linear regression:
Ridge Regression 1. Fit a linear regression model on the training data and score on the validation data
2. Tune the model by adding or removing inputs
3. Fit the model on the train & validation data and score on the test data
Lasso Regression

Elastic Net Ridge regression:


Regression
1. Standardize all inputs into the model
2. Fit a ridge regression model on the training data and score on the validation data
3. Tune the model by adding or removing inputs and modifying the ⍺ hyperparameter
4. Fit the model on the train & validation data and score on the test data

*Copyright Maven Analytics, LLC


STANDARDIZATION
𝛂
Standardization sets all input features on the same scale by transforming all
values to have a mean of 0 and a standard deviation of 1
Regularization
• Ridge regression is fit using the size of the coefficients, so they need to have the same units
• The StandardScaler() function from scikit-learn is the preferred way to do this
Ridge Regression

Use .fit_transform on the training data to


Lasso Regression calculate the mean & standard deviation
and apply the standardization

Elastic Net Use .transform on the validation & test


Regression data to apply the same standardization

As a best practice, only retrieve the mean & standard deviation from the training data
Doing this before splitting the data would be collecting information from the test data
and using it for training the model, which will lead to inflated test performance

*Copyright Maven Analytics, LLC


STANDARDIZATION
𝛂
Standardization sets all input features on the same scale by transforming all values
to have a mean of 0 and a standard deviation of 1
Regularization
• Ridge regression is fit using the size of the coefficients, so they need to have the same units
• The StandardScaler() function from scikit-learn is the preferred way to do this
Ridge Regression

Lasso Regression

Elastic Net
Regression

A 0.29-carat diamond is 1.08 standard deviations


(0.46) smaller than the mean carat weight (0.79)

*Copyright Maven Analytics, LLC


FITTING A RIDGE REGRESSION
𝛂
You can fit a ridge regression with the Ridge() function in sklearn.linear_model
• Use the .coef_ attribute to return the model coefficients
Regularization

Ridge Regression
Fit on the standardized features and alpha = 1

Lasso Regression Calculate R2 for the training and validation sets

The large gap indicates the model is overfit,


Elastic Net so we should try different values for alpha
Regression

Note that the magnitudes differ


from previous estimates due to
standardization

*Copyright Maven Analytics, LLC


TUNING ALPHA
𝛂
We need to test different values of ⍺ to see how it affects model performance
• In sklearn, the default value is 1, but this should be tuned
Regularization • The ideal value is different for every model and can be found via trial & error with validation

Ridge Regression

The tiny sacrifice in bias results in a


huge improvement in variance!
Lasso Regression

Elastic Net
Regression

The training score went down slightly from 0.869,


but validation improved significantly from 0.332
with an alpha of 71.49 (instead of 1)

*Copyright Maven Analytics, LLC


TUNING ALPHA
𝛂
We need to test different values of ⍺ to see how it affects model performance
• In sklearn, the default value is 1, but this should be tuned
Regularization • The ideal value is different for every model and can be found via trial & error with validation

Ridge Regression

Lasso Regression

Elastic Net
Regression
Coefficient estimates shrink
towards 0 as alpha increases

*Copyright Maven Analytics, LLC


PRO TIP: RIDGECV
𝛂 The RidgeCV() function performs a cross validation loop that returns the best-
performing value for alpha (instead of writing a loop from scratch)
Regularization
• RidgeCV(alphas=list_of_alphas, cv=5_or_10)
• Use .alpha_ to return the best-performing value for alpha

Ridge Regression
Cross validation with RidgeCV Validation loop

Lasso Regression

Elastic Net
Regression

RidgeCV() will usually yield a different optimal


regularization strength than simple validation,
and tends to yield better test performance as it
looks at multiple cross validation folds

RidgeCV() used an alpha of 41.02 and achieved a better test


score than the alpha of 71.49 from the normal validation loop

*Copyright Maven Analytics, LLC


ASSIGNMENT: RIDGE REGRESSION

Key Objectives
NEW MESSAGE
July 22, 2023
1. Fit a ridge regression model in Python
From: Cam Correlation (Sr. Data Scientist)
2. Tune alpha through cross validation
Subject: Ridge Regression
3. Compare model performance with
Hi there! traditional regression
I love the model you built for our computer price data set.

Can you try fitting a ridge regression and compare the


model’s accuracy?

Thanks!

08_regularization_assignments.ipynb

*Copyright Maven Analytics, LLC


LASSO REGRESSION
𝛂
Lasso regression adds a regularization term equal to the sum of the magnitude
(absolute value) of the model’s coefficients (𝛽), also known as an L1 penalty
Regularization
• The L1 penalty shrinks the coefficients towards 0 (they can drop to 0, unlike ridge regression)
• The higher alpha, the more the coefficients get shrunk
Ridge Regression • The features that reduce SSE the most will shrink at a slower rate than less useful ones

Lasso Regression 𝑝
When fitting a lasso regression, the model is
trying to minimize this cost function by both:
Elastic Net 𝐽 = 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗 • Minimizing training error with SSE
Regression
• Keeping the coefficient values small
𝑗=1

PRO TIP: If you have a lots of features in your model, fitting a moderately penalized lasso
regression can help inform which features are strongest by dropping the rest to 0

*Copyright Maven Analytics, LLC


FITTING A LASSO REGRESSION
𝛂
You can fit a lasso regression with the Lasso() function in sklearn.linear_model

Regularization

Ridge Regression
These coefficients all
dropped to 0!
You can tune alpha with
Lasso Regression the same validation loop!

Elastic Net
Regression

*Copyright Maven Analytics, LLC


PRO TIP: LASSOCV
𝛂 Like RidgeCV(), the LassoCV() function performs a cross validation loop that
returns the best-performing value for alpha
Regularization
• LassoCV(alphas=list_of_alphas, cv=5_or_10)
• Use .alpha_ to return the best-performing value for alpha

Ridge Regression

Lasso Regression

Elastic Net
Regression

*Copyright Maven Analytics, LLC


ASSIGNMENT: LASSO REGRESSION

Key Objectives
NEW MESSAGE
July 23, 2023
1. Fit a lasso regression model in Python
From: Cam Correlation (Sr. Data Scientist)
2. Tune alpha through cross validation
Subject: RE: Ridge Regression
3. Compare model performance with
Hi there! traditional regression
I was hoping ridge regression would increase performance
more than it did, but given that we didn’t have any highly
correlated features, it makes sense.

I meant to ask you to check the performance of the lasso


model as well, can you try a lasso model and compare it with
the others?

Thanks!

08_regularization_assignments.ipynb

*Copyright Maven Analytics, LLC


ELASTIC NET REGRESSION
𝛂
Elastic net regression combines the ridge and lasso penalties into a single model
and introduces a new hyperparameter (λ) that controls the balance between them
Regularization
• When lambda is 1, the model is equivalent to lasso; when it is 0, it is equivalent to ridge

Ridge Regression
Ridge penalty Lasso penalty

𝑝 2 𝑝
Lasso Regression
𝐽 = 𝑆𝑆𝐸 + 𝛼((1 − σ
𝜆) 𝑗=1 𝛽𝑗 + (𝜆) σ𝑗=1 𝛽𝑗 )
Elastic Net
Regression Total regularization strength Controls the balance between ridge & lasso

PRO TIP: Elastic net regression combines the effects of both lasso and
ridge regression and can be a lifesaver for challenging modeling problems

*Copyright Maven Analytics, LLC


FITTING AN ELASTIC NET REGRESSION
𝛂
You can fit an elastic net with the ElasticNet() function in sklearn.linear_model
• ElasticNet(alpha=1, l1_ratio=.5)
Regularization • The “l1_ratio” controls the balance of L1 to L2 penalty (the default is 0.5, or equal weight)

Ridge Regression

Lasso Regression

Elastic Net
Regression

This plot for alpha holds Coefficients drop


lambda constant at 0.5 much faster here

*Copyright Maven Analytics, LLC


TUNING LAMBDA
𝛂
You need to tune lambda in order to find the best ridge / lasso balance
• To do so, you can try fixing alpha at 1 and evaluating performance across “l1_ratios”
Regularization

Ridge Regression

Lasso Regression

Elastic Net
Regression

Tuning lambda while keeping alpha constant, or vice versa, misses out on many possible combinations of
these hyperparameters that might perform better – we need to try multiple combinations of both!

*Copyright Maven Analytics, LLC


PRO TIP: ELASTICNETCV
𝛂
The ElasticNetCV() function performs a cross validation loop that returns the
best-performing values for both alpha and lambda
Regularization • ENetCV(alphas=list_of_alphas, l1_ratio=list_of_lambdas, cv=5_or_10)
• Use .alpha_ and .l1_ratio_ to return the best-performing values for alpha and lambda
Ridge Regression

Lasso Regression

Elastic Net
Regression

Low regularization strength, and 90% This model slightly underperformed the
skewed towards a Lasso penalty Lasso model, but it is still quite good

*Copyright Maven Analytics, LLC


ASSIGNMENT: ELASTIC NET REGRESSION

Key Objectives
NEW MESSAGE
July 25, 2023
1. Fit an elastic net regression model in Python
From: Cam Correlation (Sr. Data Scientist)
2. Tune alpha and lambda through cross
Subject: Re: Re: Ridge Regression
validation
Hi there! 3. Compare model performance with other
Ok, last request for a bit, I promise!
linear regression models

Can you try an elastic net as well?

Just want to make sure we’re checking all of our possible


options here.

Thanks!

08_regularization_assignments.ipynb

*Copyright Maven Analytics, LLC


RECAP: REGULARIZED REGRESSION MODELS
𝛂
Penalty Hyper-
Model Cost Function Details
Type parameters

Linear Regression (OLS) 𝑆𝑆𝐸 None None Fits a line of best fit by minimizing SSE

𝑝 Helps with overfitting by shrinking


L2 coefficients towards zero, but never
Ridge Regression 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗2 𝛼
penalty reaching it. Great for modelling highly
𝑗=1
correlated features

𝑝
Helps with overfitting by dropping
L1
Lasso Regression 𝑆𝑆𝐸 + 𝛼 ෍ 𝛽𝑗 𝛼 some coefficients to 0, making it a good
penalty
𝑗=1 variable selection technique

𝑝 𝑝
L2 and L1 Helps with overfitting by balancing
Elastic Net Regression 𝑆𝑆𝐸 + 𝛼((1 − 𝜆) ෍ 𝛽𝑗2 + 𝜆 ෍ 𝛽𝑗 ) 𝛼, 𝜆
penalty ridge and lasso regression
𝑗=1 𝑗=1

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Regularized regression adds bias to a model to help reduce variance


• To achieve this, it strays away from the line of best fit by introducing a regularization term to the cost equation
• The goal is now to minimize the sum of squared error AND keep the coefficient estimates small

There are 3 types of regularized regression models you can try


• Any regularization technique combats overfitting, with Ridge regression reducing the coefficient values to be
closer to 0, Lasso regression dropping some coefficients down to 0, and Elastic Net balancing the two

All features need to be standardized before being input into these models
• This allows the resulting coefficients to be fairly compared with one another
• Use the training data set to calculate the mean and standard deviation values, then apply to the test data set

Tuning the hyperparameters is key in achieving the best performance


• The alpha and lambda hyperparameters modify the regularization penalty amount in the cost equation

*Copyright Maven Analytics, LLC


PROJECT 2: REGULARIZATION

*Copyright Maven Analytics, LLC


PROJECT DATA: SAN FRANCISCO RENT PRICES

*Copyright Maven Analytics, LLC


PROJECT BRIEF: REGULARIZATION

Key Objectives
NEW MESSAGE
July 27, 2023 1. Fit Lasso, Ridge, and Elastic Net Regression
From: Cathy Coefficient (Data Science Lead) Models
Subject: RE: Apartment Rental Prices 2. Tune the models to the optimal regularization
strength based on validation results
Hi again,
3. Select the model which has the highest accuracy
I just reviewed your modelling work and, overall, I’m quite on out of sample data
pleased with the results.

In order to make sure we’re doing everything we can to


exceed our client expectations, I’d like to see if regularized
regression can improve the model.

Please build on your previous work to fit regularized


regression models and compare them with your original
model, then select the final model based on accuracy.

09_regularized_regression_project.ipynb

*Copyright Maven Analytics, LLC


TIME SERIES ANALYSIS

*Copyright Maven Analytics, LLC


TIME SERIES ANALYSIS

In this section we’ll cover time series analysis and forecasting, specialized techniques
applied to time series data to extract patterns & trends and predict future values

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Learn how to transform data into a format that is


Time Series Data Smoothing ready for time series analysis and forecasting
• Become familiar with the various ways to smooth
Decomposition Forecasting time series data to visualize patterns
• Understand how decomposition works and how it
can be useful for looking at trends and seasonality
• Apply time series forecasting techniques, including
OLS and Facebook Prophet

*Copyright Maven Analytics, LLC


TIME SERIES DATA

Time series data requires each row to represent a unique moment in time
• This can be in any unit of time (seconds, hours, days, months, years, etc.)
Time Series Data
Raw Data Time Series Data
Daily Monthly Yearly
Smoothing

Decomposition

Forecasting

Aggregating the data by date converts it into time series data

Deciding which unit of time to analyze is an important


This is NOT time series data, as each row first step. Does your company need a daily forecast to
represents a transaction, not a point in time help plan staffing? Or is does it need a monthly forecast
to help finance make budgeting plans?

*Copyright Maven Analytics, LLC


TIME SERIES DATA

Time series data requires each row to represent a unique moment in time
• This can be in any unit if time (seconds, hours, days, months, years, etc.)
Time Series Data

Smoothing

Decomposition

Forecasting
Time series data is often
visualized using a line chart
with time as the x-axis

*Copyright Maven Analytics, LLC


TYPES OF TIME SERIES ANALYSIS

There are many types of time series analysis, including:

Time Series Data

Smoothing Decomposition Forecasting


Smoothing
Reduces volatility to reveal Breaks down data into seasonality, Predicts future values in time using
underlying trends & patterns trend, and noise components historical time series data

Decomposition • Moving average • Additive model • Linear Regression


• Exponential smoothing • Multiplicative model • Facebook Prophet*

Forecasting

*Copyright Maven Analytics, LLC


MOVING AVERAGE

Time series smoothing is the process of reducing volatility in time series data to
help identify trends and patterns that are otherwise challenging to see

Time Series Data The simplest way to smooth time series data is by calculating a moving average

Smoothing

Decomposition

The larger the window,


Forecasting the smoother the data

PRO TIP: Align your windows with intuitive seasons: with monthly data,
look at quarters or years, with daily data look at weekly or monthly, etc.

*Copyright Maven Analytics, LLC


MOVING AVERAGE

The rolling() method lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data

This creates new columns with a 3-month, 6-


Smoothing month, and 12-month moving average

Decomposition

How does it work?


Forecasting
• Each “quarterly” value represents the average
sales from the current month and 2 previous

*Copyright Maven Analytics, LLC


MOVING AVERAGE

The rolling() function lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data

This creates new columns with a 3-month,


Smoothing 6-month, and 12-month moving average

Decomposition

How does it work?


Forecasting
• Each “quarterly” value represents the average
sales from the current month and 2 previous
• Each “half_year” value represents the average
sales from the current month and 5 previous

*Copyright Maven Analytics, LLC


MOVING AVERAGE

The rolling() function lets you calculate moving averages for a specified window
• df[“col”].rolling(window).mean()
Time Series Data

This creates new columns with a 3-month,


Smoothing 6-month, and 12-month moving average

Decomposition

How does it work?


Forecasting
• Each “quarterly” value represents the average
sales from the current month and 2 previous
• Each “half_year” value represents the average
sales from the current month and 5 previous
• Each “annual” value represents the average
sales from the current month and 11 previous

*Copyright Maven Analytics, LLC


EXPONENTIAL SMOOTHING

The exponential smoothing technique is similar to moving average, but it gives


more weight to recent data points within a window
Time Series Data • The weight is controlled by alpha, which is a value between 0 and 1
• The higher alpha, the more weight is given to recent points (which also increases volatility)
Smoothing • df[“col].ewm(alpha).mean()

Decomposition

Forecasting

No NaN values!

*Copyright Maven Analytics, LLC


ASSIGNMENT: SMOOTHING

Key Objectives
NEW MESSAGE
August 1, 2023
1. Explore and manipulate time series data
From: Tammy Tiempo (Financial Analyst)
2. Modify smoothing parameters to reveal
Subject: Smoothing
different patterns
Hi there!

We’re working with an electric utility in Morocco on a


forecasting model for electric consumption. The data is quite
noisy due to lots of seasonality. Can you calculate a weekly
and monthly rolling average so we can see broader patterns?

Thanks!

10_time_series_assignments.ipynb

*Copyright Maven Analytics, LLC


DECOMPOSITION

Time series data can be decomposed into trend, seasonality, and random noise
• Trend: Are values trending up, down, or staying flat over time?
Time Series Data • Seasonality: Do values display a cyclical pattern? (like more customers buying on weekends)
• Random noise: What volatility exists outside the trend and seasonal patterns?
Smoothing

Decomposition

Forecasting

This data has a positive trend, unclear This data has a flat trend, clear hourly
seasonality, and lots of random noise seasonality, and relatively little noise

*Copyright Maven Analytics, LLC


DECOMPOSITION

You can use statsmodels’ seasonal_decompose() to decompose time series data

Time Series Data

Smoothing

Decomposition

The positive trend started around July 2016


Forecasting

There appears to be a yearly seasonal pattern,


but it doesn’t seem to match the data that well

After taking away the trend and seasonality, the


residuals look somewhat random over time

*Copyright Maven Analytics, LLC


TYPES OF DECOMPOSITION

There are two types of decomposition: additive and multiplicative


• Additive decomposition assumes the trend and seasonality remain constant
Time Series Data • Multiplicative decomposition assumes the trend and seasonality increase over time

Smoothing Additive Multiplicative

Linear trend Growing trend


Decomposition

Forecasting

Growing
Constant amplitude
amplitude

𝑦𝑡 = 𝑇𝑡 + 𝑆𝑡 + 𝑅𝑡 𝑦𝑡 = 𝑇𝑡 × 𝑆𝑡 × 𝑅𝑡
Time + seasonality + random Time * seasonality * random

*Copyright Maven Analytics, LLC


TYPES OF DECOMPOSITION

There are two types of decomposition: additive and multiplicative


• Additive decomposition assumes the trend and seasonality remain constant
Time Series Data • Multiplicative decomposition assumes the trend and seasonality increase over time

Smoothing

Decomposition

Forecasting
Multipliers

The residuals look good in the middle but get worse towards We can no longer see a pattern in the residuals
the end, which indicates we need a multiplicative model

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Forecasting

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition

Strong correlation every 7th lag,


Forecasting which indicates a weekly cycle

100%

80%

60%
CORRELATION

40%

20%

0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG

*Copyright Maven Analytics, LLC


PRO TIP: AUTOCORRELATION CHART

An autocorrelation chart calculates the correlation between time-series data and


lagged versions of itself, then plots those correlations
Time Series Data
• This allows you to visualize which lags are highly correlated with the original data, and reveals
the length (or period) of the seasonal trend
Smoothing

Decomposition
This peak indicates a
seasonal pattern every
24 periods (hours)
Forecasting

If your season has trend, you must


difference your time series first, which
can be be done with the .diff() method in
Pandas
*Copyright Maven Analytics, LLC
ASSIGNMENT: DECOMPOSITION

Key Objectives
NEW MESSAGE
August 5, 2023
1. Use time series decomposition to
From: Tammy Tiempo (Financial Analyst) understand the trend, seasonality and
Subject: Decomposition noise of time series data
2. Use an ACF chart to estimate the seasonal
Hi there!
window for time series data
Thanks for your help with smoothing.

In a related project, we’re looking at weather in neighboring


Spain and we want to confirm the seasonality of weather on
both a daily and annual basis.

Can you decompose hourly and monthly weather and plot


ACF charts for them?

Thanks!

10_time_series_assignments.ipynb

*Copyright Maven Analytics, LLC


FORECASTING

Time series forecasting lets you predict future values using historical data
• Forecasting models use existing trends & seasonality to make accurate future predictions
Time Series Data

Smoothing
Common Forecasting Techniques:
• Linear Regression We’ll cover these in
• Facebook Prophet the course
Decomposition
• Holt-Winters
• ARIMA Modeling
Forecasting
• LSTM (deep learning approach)

Forecasts get less accurate the further out we are trying to predict, so think carefully
about how far in advance you really need to forecast in the context of the problem

*Copyright Maven Analytics, LLC


DATA SPLITTING

Time series data splitting does not follow traditional train/test splits:
• Instead of random splits, we’ll need to split by points in time to mimic forecasting the future
Time Series Data • The training data set should be at least as long as your desired forecast window (test data)
• You may need to change the period to get enough holdout data
Smoothing

EXAMPLE Forecasting the next 48 hours of weather

Decomposition
Train Test

Forecasting

*Copyright Maven Analytics, LLC


LINEAR REGRESSION WITH TREND & SEASON

It common to build forecasting models using linear regression by creating


“trend” and “seasonal” dummy variables
Time Series Data
• The trend variable is a sequence of numbers that increments by 1 unit every period
• The seasonal variables are dummies for every period in the season
Smoothing

Decomposition

Forecasting

This creates a “trend” variable and 24


“seasonal” dummy variables
(1 per hour in a day)

*Copyright Maven Analytics, LLC


FITTING THE MODEL

You can fit the model on the training data like a regular linear regression model
• It violates the assumptions of linear regression, but it’s often very effective
Time Series Data

Smoothing

Decomposition

Forecasting

The temperature increases by


.02 degrees every hour

The temperature is 2.47 degrees


higher than the reference level at
which is hour_0, or midnight
*Copyright Maven Analytics, LLC
SCORING THE MODEL

To score the model, you can plot the predictions against the actual values for the
test data and calculate error metrics like MAE and MAPE
Time Series Data • The Mean Absolute Percentage Error (MAPE) is is calculated by finding the average percent
error of all the data points (it’s essentially MAE converted to a percentage)
Smoothing

Decomposition

Forecasting

The model is wrong by 2.96 degrees on average, or by 12.7%

*Copyright Maven Analytics, LLC


FACEBOOK PROPHET

Facebook Prophet is a forecasting package available in Python and R that was


developed by Meta’s data science research team
Time Series Data

Instead of OLS, it uses an additive regression model with three main components:
Smoothing
1. Growth curve trend (created by automatically detecting changepoints in the data)
Decomposition 2. Yearly, weekly, and daily seasonal components
3. User-provided list of important holidays
Forecasting

PRO TIP: Prophet is a relatively sophisticated and easy to


use package, making it ideal for quick forecasting activities

*Copyright Maven Analytics, LLC


FACEBOOK PROPHET

EXAMPLE Forecasting the next 48 hours of weather

Install Facebook Prophet


Time Series Data
Import the package
Split the data

Smoothing Rename the x and y variables to “ds” and “y”

Fit the model

Decomposition

Forecasting

Specify the number of periods and period


frequency to predict, and plot the predictions

*Copyright Maven Analytics, LLC


LINEAR REGRESSION VS. PROPHET

Linear Regression Facebook Prophet


Time Series Data
MAE: 2.96 MAE: 2.09
MAPE: 0.127 MAPE: .091
Smoothing

Decomposition

Forecasting

*Copyright Maven Analytics, LLC


ASSIGNMENT: FORECASTING

Key Objectives
NEW MESSAGE
August 10, 2023
1. Split time series data into train and test
From: Tammy Tiempo (Financial Analyst) datasets
Subject: Forecasting
2. Fit time series models with linear
regression and Facebook Prophet
Hi there!

We’re looking at evaluating two forecasting methods to


3. Compare the accuracy of the two
propose to an airline client. forecasting methods, calculating MAE and
MAPE
Can you take a look at historical airline passenger growth and
compare the accuracy of forecasts for linear regression and
Facebook Prophet?

I’ve left a few more details in the notebook.

Thanks!

10_time_series_assignments.ipynb

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Time series data requires each row to represent a unique moment in time
• It’s important to decide on the units, or row granularity, of your time series data (years, months, etc.)

Time series smoothing allows you to visualize trends in the time series data
• Common techniques include moving average and exponential smoothing

Decomposition breaks the data down into trend, season, and random noise
• Time series data can be decomposed in an additive or multiplicative fashion

Time series forecasting allows you to predict future values in time


• Common techniques include linear regression (even though the assumptions are violated) and Facebook Prophet

*Copyright Maven Analytics, LLC


PROJECT 2: FORECASTING

*Copyright Maven Analytics, LLC


PROJECT DATA: ELECTRICITY CONSUMPTION

*Copyright Maven Analytics, LLC


ASSIGNMENT: FINAL PROJECT

Key Objectives
NEW MESSAGE
August 14, 2023 1. Explore and manipulate time series data
From: Tammy Tiempo (Financial Analyst) 2. Perform time series data splitting
Subject: Forecasting Electricity Consumption
3. Build and compare the predictive accuracy of forecasting
models
Hello,

Can you build a simple forecast for our Moroccan Electricity


consumption? I’d like to see a linear regression model
compared to Facebook Prophet, now that we know they’re
both reasonable forecasting methods.

I’m hoping we can get a forecast accurate enough that allows


us to properly estimate the demand for electricity, which will
help us properly price it during peak demand.

-Tammy

11_time_series_project.ipynb

*Copyright Maven Analytics, LLC

You might also like