Machine Learning in Python - Course Notes
Machine Learning in Python - Course Notes
Ivan Manov
Machine Learning in
Python
Course Notes
365 DATA SCIENCE 2
Table of Contents
Abstract ........................................................................................................................ 4
1.10 F-test............................................................................................................... 16
Abstract
Regression analysis is one of the most widely used methods for predictions,
portion of the predictive modelling that occurs in practice is carried out through
on sample data.
Design a model
learning method and a starting point for the advanced analytical learning path of
Variables:
Dependent (predicted): Y
Y = 𝛽0 + 𝛽1 𝑥1 + 𝜀
Y – dependent variable
𝑥1 – independent variable
𝛽0– constant/intercept
𝜀 – error of estimation
365 DATA SCIENCE 6
𝑦̂ = 𝑏0 + 𝑏1 𝑥1
𝑦̂ – estimated/predicted value
𝑏0 – coefficient – estimate of 𝛽0
𝑏1 – coefficient – estimate of 𝛽1
𝜀 – error of estimation
Correlation Regression
The relationship between two variables How one variable affects another
Symmetrical One-way
Linear regression analysis is known for the best fitting line that goes
through the data points and minimizes the distance between them.
𝑦̂ = 𝑏0 + 𝑏1 𝑥1
𝑦̂𝑗
𝑥̂𝑗
𝑏1 - the slope of the regression line – shows how much y changes for each
unit change of x
𝑒̂𝑖 – estimator of the error - the distance between the observed values and
Regression line – the best fitting line through the data points
365 DATA SCIENCE 8
Coding steps:
regression equation
• Regression itself
• Adding a constant
• A model summary
Dep. Variable - the dependent variable, y; This is the variable we are trying
to predict
365 DATA SCIENCE 10
Range: [0;1]
predictor is significant, F-statistic is also significant). The lower the F-statistics, the
• A coefficients table
std err = standard error – shows the accuracy of the prediction for each
variable
value)
365 DATA SCIENCE 11
• Additional tests
assumption)
Decomposition of variability
squared deviations from the mean. It describes how far the observed values differ
The total variability of the data set is equal to the variability explained by the
• Sum of squares total (SST) = Total sum of squares (TSS) – measures the
SSE = ∑𝑛𝑖=1 𝑒𝑖 2
𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖
SSR SSE
365 DATA SCIENCE 13
1.6 OLS
OLS or the ordinary least squares is the most common method to do estimate
of the linear regression equation. “Least squares” stands for the minimum squares
error, or SSE. This method aims to find the line, which minimizes the sum of the
squared errors.
𝑛
𝑆(𝑏) = ∑ (𝑦 − 𝑥𝑏)2
𝑖=1
There are other methods for determining the regression line. They are
1.7 R-squared
0 1
squared of zero means your regression line explains none of the variability of the
data. An R squared of 1 would mean your model explains the entire variability of the
data. Unfortunately, regressions explaining the entire variability are rare. What you
Population model:
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + . . . + 𝛽𝑘 𝑥𝑘 + 𝜀
𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + . . . + + 𝑏𝑘 𝑥𝑘
𝑦̂ – inferred value
𝑏0 – intercept
𝑥1 . . . 𝑥𝑘 – independent variables
𝑏1 . . . 𝑏𝑘 – corresponding coefficients
• It is not about the best fitting line as there is no way to represent the
➔ min SSE
SSR SSE
Steps:
regression equation -
• Regression itself
• Adding a constant
1.10 F-test
The F-statistic is used for testing the overall significance of the model.
F-test:
𝐻0 : 𝛽1 = 𝛽2 =. . . = 𝛽𝑘 = 0
𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + . . . + 𝛽𝑘 𝑥𝑘 + 𝜀
The easiest way is to verify if the relationship between two variables is linear
a scatter plot. If the data points form a pattern that looks like a straight line, then a
If the relationship is non-linear, you should not use the data before
transforming it appropriately.
Fixes:
• exponential transformation
• log transformation
365 DATA SCIENCE 18
𝜎𝑋𝜀 = 0 ∶ ⩝ 𝑥, 𝜀
The error (the difference between the observed values and the predicted
omitted variable bias. Omitted variable bias is introduced to the model when you
are also correlated. Chances are that the omitted variable is also correlated with at
least one independent x. but it is not included it as a regressor. Everything that you
don’t explain with your model goes into the error. So, the error becomes correlated
𝜀~ 𝑁(0, 𝜎 2 )
• Zero mean - if the mean is not expected to be zero, then the line is not the
• Prevention:
𝜎𝜀𝑖 𝜀𝑗 = 0 ∶ ⩝ 𝑖 ≠ 𝑗
Prevention:
• Dublin-Watson test:
• 2 – no autocorrelation
model!
Alternatives:
• Autoregressive model
ρ𝑥𝑖 𝑥𝑗 ≈ 1 ∶ ⩝ 𝑖, 𝑗; 𝑖 ≠ 𝑗
creating the regression, find the correlation between each two pairs of independent
Fixes:
Example:
Attendance
Yes 1
365 DATA SCIENCE 21
No 0
Doesn’t capture the logic The training model has Captures the
model
We split the data into training and testing parts and we train the model on
the training dataset but test it on the testing dataset. The goal is to avoid the
scenario where the model learns to predict the training data very well but fails
A logistic regression implies that the possible outcomes are not numerical,
but rather – categorical. In the same way that we include categorical predictors into
a logistic regression.
Logistic regression
The main difference between logistic and linear regressions is the linearity
regression predicts the probability of an event occurring. Thus, we are asking the
question: given input data, what is the probability of a student being admitted?
Input Probability
0
365 DATA SCIENCE 24
Logistic function
𝑒 (𝛽0+𝛽1𝑥1+...+ 𝛽𝑘𝑥𝑘)
𝑝(𝑋) =
1 + 𝑒 (𝛽0+𝛽1𝑥1+...+ 𝛽𝑘𝑥𝑘)
Odds 𝑝(𝑋)
= 𝑒 (𝛽0+𝛽1𝑥1+...+ 𝛽𝑘𝑥𝑘)
1 − 𝑝(𝑋)
𝑙𝑜𝑔(𝑜𝑑𝑑𝑠) = 𝛽0 + 𝛽1 𝑥1 +. . . + 𝛽𝑘 𝑥𝑘
• Adding a constant
of the variables. The bigger the likelihood function, the higher the
independent variables
comparing variations of the same model. Different models will have completely
• AIC
• BIC
• McFadden’s R-squared
365 DATA SCIENCE 26
△ 𝑜𝑑𝑑𝑠 = 𝑒 𝑏𝑘
For a unit change in a variable, the change in the odds equals the exponential
of the coefficient. That exactly is what provides us with a way to interpret the
regression.
365 DATA SCIENCE 27
Confusion matrix
For 69 observations, the model predicted 0 when the true value was 0.
These cells indicate in how many cases the model did its job well.
The most important metric we can calculate from this matrix is the accuracy of
the model.
365 DATA SCIENCE 28
In 69 plus 90 of the cases, the model was correct. In 4 plus 5 of the cases, the
model was incorrect. Overall, the model made an accurate prediction in 159 out of
168 cases. That gives us a 159 divided by 168, which is 94.6% accuracy.
365 DATA SCIENCE 29
observations on the basis some of their features or variables that they are
Cluster Analysis
Identify patterns
Classification Clustering
Model (Inputs) > Outputs > Correct Model (Inputs) > Outputs > ?
values
Euclidean distance
N-dim space:
given data
Clusters
The Elbow method – a criterion for setting the proper number of clusters. It
is about making the WCSS as low as possible, while still having a small number of
Pros Cons
4.3 Standardization
increase that of lower ones. If we don’t standardize, the range of the values serves
as weights for each variable, and we are not taking advantage of the size data
clustering.
variables on equal footing, in some cases, we don’t need to do that. If we know that
365 DATA SCIENCE 33
shouldn’t be used.
Classification Clustering
It is used whenever we have input data It is used whenever we have input data
and the desired correct outcomes but have no clue what the correct
targets
get better outputs. In clustering, there is no feedback loop, therefore, the model
Flat Hierarchical
Divisive
Agglomerative
Types of clustering
Flat - with flat methods there is no hierarchy, but rather the number of
clusters are chosen prior to clustering. Flat methods have been developed
Nowadays, flat methods are preferred because of the volume of data we typically
try to cluster.
superior to flat clustering in the fact that it explores (contains) all solutions.
365 DATA SCIENCE 35
split this big cluster into 2 smaller ones. Next, we continue with 3,
find the best split, we must explore all possibilities at each step.
5.1 Dendrogram
Pros:
clusters
Cons:
slower it gets
365 DATA SCIENCE 36
Iliya Valchanov
IvanEmail:
Manov
Address:
Email: team@365datascience.com