Machine Learning

Machine Learning
Samatrix Consulting Pvt Ltd

What is Machine Learning?
Machine Learning
• To start the introduction to Machine Learning, let’s start with a simple
example.
• Suppose you have been assigned as Data Scientist to advice on how to
improve the sales of a particular product of a company.
• The company provided you with sales data from 200 different markets.
• The data also contains the advertising budgets for the product in each of
those markets for three different media: TV, radio, and newspaper.
• The client cannot directly increase the sales of the product.
• But they can adjust the advertisement budget for each of the three media.
Machine Learning
• As a data scientist, if you can establish the relationship between
advertisement expenditure and sales, you can provide your feedback
on how to adjust the budgets so that sales can increase.
• So, the objective is to develop a model that you can use to predict the
sales on the basis of the three media budgets.
Machine Learning
Fig 1: The plots display sales in thousands of units. Sales is a function of budget on
TV, Radio, and Newspaper in thousands of dollars across 200 markets
Input – Output Variables
• In this case, the three media budgets are input variables.
• We generally denote input variables by symbol 𝑋 and use subscripts
to distinguish among the input variables.
• Hence, we can denote TV budget by 𝑋1 , Radio budget by 𝑋2 , and
newspaper budget by 𝑋3 .
• In this case, the predicted sales is the response of the model.
• We call the response by output variable and denote the response by
symbol 𝑌.
Independent Vs Dependent Variables
• We can give the input variables different names such as independent
variables, predictors, features etc.
• The output variable or response is also known as the dependent variable.
• If we observe one quantitative output variable 𝑌 and 𝑝 different input
variables, 𝑋1 , 𝑋2 , … , 𝑋𝑝 .
• Our assumption is that there is a relationship between 𝑌 and 𝑋 =
(𝑋1 , 𝑋2 , … , 𝑋3 ). We can write the relationship as
𝑌 =𝑓 𝑋 +𝜖
• In this case 𝑓 is some fixed but unknown function of 𝑋1 , … , 𝑋𝑝 . 𝜖 is random
error that is independent of 𝑋. The mean value of 𝜖 is zero.
Function Approximation
• Let us consider another
dataset, income dataset.
• The left-hand panel shows
the plot of income versus
years of education for 30
individuals.
• Using the plot, you may be
able to predict the income
given the years of education.
• But the function that relates
the income to the years of
education is not known.
• In this situation, we should
estimate the function 𝑓 based on
observed data.
• Since the dataset is simulated, so
we know the function 𝑓.
• Based on the given function, we
have plotted the blue curve in the
right-hand side panel.
• You may notice the vertical lines
that represent the error terms 𝜖.
• Out of the 30 observations, some
observations lie above the blue
curve whereas some observations
lie below the blue curve.
• But the overall mean would be
zero.
• Generally, the function 𝑓
may involve more than one
input variable.
• In the figure, the plot of
income as a function of
seniority and years of
education is given.
• We need to estimate the
function 𝑓that is two-
dimensional surface.
• So, we can say that machine
learning refers to a set of
approaches for estimating
function 𝑓.
Why to estimate 𝑓?
Why Approximate f
• The purpose of function approximation is either prediction or
inference.
Prediction
• In several cases, a set of input variable 𝑋 are available but we cannot
easily obtain the response variable 𝑌. Since the average of error
terms is zero, we can predict 𝑌 using
መ
• 𝑌෠ = 𝑓(𝑋)
• In this case, 𝑓መ represents the estimated 𝑓 and 𝑌෠ represents the
predicted outcome of 𝑌.If the prediction is the priority, the 𝑓መ is often
treated as black box if it can provide accurate estimation of 𝑌.
Prediction
• Example: Suppose 𝑋1 , … , 𝑋𝑝 represent the characteristics of a
patient’s blood sample that has been measured in a lab.
• The output variable 𝑌 encodes the patient’s risk for a severe adverse
reaction to a particular drug.
• We can predict 𝑌 using 𝑋 so that we can avoid giving the drug to the
patient who are at high risk of an adverse reaction.
• That means we can predict the set of patients who are at high risk of
an adverse reaction.
Reducible - Irreducible Errors
• Two quantities, reducible error and irreducible error, define the accuracy of 𝑌෠ as a
prediction of 𝑌.
• The 𝑓መ cannot be the perfect estimator of 𝑓.
• Due to this some errors will be introduced.
• We can reduce such error by improving the accuracy of 𝑓መ by using the most appropriate
machine learning technique to estimate 𝑓.
• Even if, we could perfectly estimate 𝑓, our estimated response would be 𝑌෠ = 𝑓(𝑋).
• The prediction would still have some error because 𝑌 is also a function of 𝜖 and we
cannot predict 𝜖 using 𝑋.
• The variability associated with 𝜖 affects the accuracy of our predictions.
• This is known as the irreducible error. Irrespective of our accuracy in estimating 𝑓, some
errors that have been introduced by 𝜖, cannot be reduced.
• Why the irreducible errors are larger than zero?
• The quantity 𝜖 may contain some unmeasured variable that can be
used to predict 𝑌.
• Because we do not measure such variables, 𝑓 cannot used them for
prediction.
• Unmeasurable variations are also included in the quantity 𝜖.
• For example, due to manufacturing variation in the drug or the
general feeling of well-being of the patient on the given day.
• If the estimated function is 𝑓መ and set of predictors is 𝑋. We get the
መ
prediction 𝑌෠ = 𝑓(𝑋). We can show that
2 2
𝐸 𝑌 − 𝑌 = 𝐸 𝑓 𝑋 + 𝜖 − 𝑓መ 𝑋
෠
2
መ
= 𝐸 𝑓 𝑋 − 𝑓 𝑋 + 𝑉𝑎𝑟(𝜖)
= 𝑅𝑒𝑑𝑢𝑐𝑖𝑏𝑎𝑙𝑒 + 𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑖𝑏𝑙𝑒
2
෠
• Where 𝐸 𝑌 − 𝑌 represents the average, or expected value, of the
squared difference between the predicted and actual value of 𝑌 and
𝑉𝑎𝑟(𝜖) represents the variance associated with the error term 𝜖.
Inference
• On various occasions, we want to understand how 𝑌 changes as the
we change 𝑋1 , … , 𝑋𝑝 .
• We wish to estimate 𝑓 but making predictions for 𝑓 may not be our
goal.
• We may want to understand the relationship between 𝑋 and 𝑌.
መ
• In such situations, we need to know understand the exact form of 𝑓.
Hence, it cannot be treated as a black box.
• We would like to answer the following questions:
Inference
• Which input variables are associated with output variable 𝒀?
• Generally, a small fraction of the available input variables is
associated with response 𝑌.
• It is extremely useful to understand few important predictors out of a
big set of all the possible predictors.
Inference
• Understand the relationship between the response and each predictor?
• Different predictors may have different relationship with the response.
• Some predictors may have a positive relation with 𝑌 whereas some other
predictors may have a negative relationship with 𝑌.
• If the relationship is positive, the increase in the value of predictor results
in the increase in the value of 𝑌.
• Negative relationship has opposite relationship.
• Due to the complexity of the function 𝑓, the relationship between a
predictor and the response may also depend upon the values of the other
predictors.
Inference
• Whether the relationship between 𝒀 and each predictor is linear or
more complex?
• Generally, the methods for estimating 𝑓 are linear.
• However, in certain cases, the relationship is more complex that
cannot be accurately represented by a linear model.
How to estimate 𝑓?
How to estimate 𝑓
• Suppose we have a set of observed data that includes 𝑛 = 30 data
points.
• We name these observations, training data. The training data will be
used to train or teach our model to estimate 𝑓.
• Let 𝑥𝑖𝑗 represents the value of 𝑗𝑡ℎ predictor or input for observation 𝑖
where 𝑖 = 1, 2, … , 𝑛 and 𝑗 = 1 , 2, … , 𝑝.
• Correspondingly, the response variable for the 𝑖 𝑡ℎ observation is
represented by 𝑦𝑖 .
• Hence our training data consists of { 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 }
𝑇
where 𝑥𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑝 .
How to estimate 𝑓
• Our objective is to estimate the unknown function 𝑓.
• For this, we apply the machine learning model to the training data.
• We want to find the function 𝑓መ such that 𝑌 ≈ 𝑓(𝑋)
መ for any
observation (𝑋, 𝑌).
• We can use two approaches for the same: parametric and non-
parametric approach.
Parametric Approach
• Parametric approach is 2 step model-based approach
• First step is to make an assumption about the functional form or
shape of 𝑓. For example, we can make a simple assumption that 𝑓 is
linear in 𝑋:
𝑓 𝑋 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
• Once the assumption about a linear 𝑓 is made, the 𝑓 can be
estimated easily.
• Instead of estimating the complete p-functional function 𝑓(𝑋), we
have to estimate 𝑝 + 1 coefficients 𝛽0 , 𝛽1 , … , 𝛽𝑝
Parametric Approach
• After we select the model, we use the training data to fit or train the
model.
• From the example given in step 1, we need to estimate the
parameters 𝛽0 , 𝛽1 , … , 𝛽𝑝 .
• In other words, we need to find the values of the parameters such
that
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
Parametric Approach
• Several ways are available to fit the linear model. (Ordinary) least
square method is one of the possible ways to fit the linear model.
• Least square method is the most common approach.
• The model-based approach mentioned above is known as parametric
approach.
• This approach reduces the problem of estimating 𝑓 to estimate a set
of parameters.
• Using the linear model, on one hand it is very easy to estimate the
parameters, such as 𝛽0 , 𝛽1 , … , 𝛽𝑝 , on the other hand, the model that
we choose may not represent the true unknown form of 𝑓.
Parametric Approach
• If the estimated model is too far from the true 𝑓, our estimate will not
be good.
• To fix this problem, we can choose the flexible model but such models
require estimating a greater number of parameters.
• The more complex models may lead to the overfitting of the data
which means that the models follow the errors, or noise, too closely.
• We applied the parametric approach to the Income data as shown in
the figure below. In this case we fit the linear model to the form
𝑖𝑛𝑐𝑜𝑚𝑒 ≈ 𝛽0 + 𝛽1 × 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽2 × 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦

Parametric Approach
• In this case, we have assumed the linear
relationship between the response and
the two predictors.
• Hence, we have to estimate the values of
𝛽0 , 𝛽1 , 𝑎𝑛𝑑 𝛽2 that we can estimate
using the least squares linear regression.
• The following picture shows the plot of
income as a function of years of
education and seniority in the Income
data set.
• The blue surface represents the true
underlying relationship between income
and years of education and seniority
Parametric Approach
• The picture below a linear model fit by least
squares to the Income data from previous picture.
• The observations are shown in red, and the yellow
plane indicates the least squares fit to the data.
• When we compare both the figures, we can see
that the linear fit in the second figure is not
correct.
• The true 𝑓 has some curvature that the linear fit
does not capture.
• But the linear file has done a reasonable job of
capturing the positive relationship between years
of education and income, as well as the slightly
less positive relationship between seniority and
income.
Non - parametric Approach
• With non-parametric methods, we do not make explicit assumptions
about the functional form of 𝑓.
• These methods estimate 𝑓 that is as close to the data points as
possible without being too rough or wiggly.
• Non-parametric methods have an advantage over the parametric
methods.
• Non-parametric methods avoid the assumption of a particular
functional form for 𝑓.
• Hence, they can accurately fit a wider range of possible shapes for 𝑓.
• Some risks are involved with the parametric approach.
• The functional form that we used to estimate 𝑓 may be very different from
the true 𝑓.
• In this case, the resulting model will not fit the data well.
• On the other hand, non-parametric methods completely avoid this risk.
• Be we do not make any assumption about the form of 𝑓.
• However non-parametric methods have a disadvantage.
• These approaches do not reduce the number of parameters to estimate 𝑓,
we need a larger number of observations when compared to parametric
methods to estimate 𝑓.
• The figure below shows the results of a non-
parametric approach (thin-plate spine) to fit
the Income data.
• No pre-specified model was not imposed on
𝑓.
• It tries spline to produce an estimate for 𝑓.
• This is as close as possible to the observed
data. In the picture below, it is shown by
yellow surface.
• It is a smooth thin plate spline fit. In this case,
the non-parametric fit has resulted in a
reasonably accurate fit estimate of true 𝑓.
• For the thin-plate spline, the data analyst
should select a level of smoothness that
allows a rougher fit.
• With the increase in roughness, the resulting
estimate fits the observed data perfectly so
that the errors are zero.
• But the rough spline fit is far more variable
than the true function 𝑓.
• This is an example of overfitting the data.
• It is an undesirable situation because the fit
obtained will not yield accurate estimates of
the response on new observations that were
not part of the original training data set.
Prediction Accuracy vs Model
Interpretability
Trade Off
• Out of several machine learning
methods, some methods are less
flexible.
• That means that they are more
restrictive and they can produce just a
relatively small range of shapes to
estimate 𝑓.
• For example, linear regression is a
relatively inflexible approach, because it
can only generate linear functions.
• On the other hand, methods such as the
thin plate splines are considerably more
flexible because they can generate a
much wider range of possible shapes to
estimate 𝑓.
Trade Off
• We can ask the following question:
• why should we choose to use a more restrictive approach instead of a very
flexible method?
• Due to several reasons, we might prefer a more restrictive model. If our
goal is inference, then restrictive models are more interpretable.
• For example, if we are interested in inference, we may choose a linear
model, because the linear model helps us understand the relationship
between 𝑌 and 𝑋1 , 𝑋2 , … , 𝑋𝑝 easily.
• On the other hand, very flexible approaches, such as the splines and the
boosting methods can lead to such complicated estimates of 𝑓 that it is
difficult to understand how any individual predictor is associated with the
response.
Trade Off
• The figure above underlines the trade-off between flexibility and interpretability
for some of the machine learning methods.
• Least squares linear regression is relatively inflexible but is quite interpretable.
• We can take the case of the lasso method.
• This method relies upon the linear model but it uses an alternative fitting
procedure for estimating the coefficients 𝛽0 , 𝛽1 , … , 𝛽𝑝 .
• The lesso method is more restrictive in estimating the coefficients, and sets a
number of them to exactly zero.
• Hence, we can state that the lasso is a less flexible approach than linear
regression.
• It is also more interpretable than linear regression, because in the final model the
response variable will only be related to a small subset of the predictors—namely,
those with nonzero coefficient estimates.
Trade Off
• On the other hand, generalized additive models (GAMs) extend the
linear model to allow for certain non-linear relationships.
• Hence, GAMs are more flexible than linear regression.
• However, they are less interpretable than linear regression, because
the relationship between each predictor and the response is now
modeled using a curve.
• Finally, fully non-linear methods such as bagging, boosting, and
support vector machines with non-linear kernels are highly flexible
approaches that are harder to interpret.
Trade Off
• Hence, we can state that when inference is the goal, we should use
simple and relatively inflexible machine learning methods.
• However, there might be situations, when we are only interested in
prediction, and not in the interpretability of the predictive model.
• For instance, if we want to build a model to predict the price of a
stock, we would be interested in an algorithm that can predict
accurately whereas the interpretability is not a concern.
• In such cases, we should use the most flexible model available.
Assessing Model Accuracy
No Free Lunch Theorem
• During this course we would introduce a wide range of machine learning models.
• These models are more complex than the standard linear regression approach.
• The question is why do we need so many different machine learning approaches,
rather than having a best method?
• In statistics and machine learning, we follow no free lunch theorem.
• For a given data set, one specific approach may give us the best results but some
other scientific approach may give better results on a similar but different data
set.
• Hence, we need to explore and decide for each data set which approach provides
us the best results.
• The most challenging part of the machine learning is to select the approach that
can provide us the best results.
Measuring Quality of Fit
• For a given set, we need to evaluate the performance of the machine learning method
and measure how well its predictions actually match the observed data.
• We need to quantify the extent to which the predicted response value for a given
observation is close to the true response value for that observation.
• In the regression setting, the most commonly-used measure is the mean squared error
(MSE), given by 𝑛
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑓 𝑥𝑖 መ
𝑛
𝑖=1
መ 𝑖 ) is the prediction that 𝑓መ gives for the 𝑖th observation.
• 𝑓(𝑥
• The mean square error will be small if the predicted response and the true response are
very close.
• It will be large if the predicted response is substantially different from the true response.
Training MSE
• We compute the MSE using the training data that we used to fit the model.
• Hence, we call it training MSE. However, in practice, we are not bothered about the
performance of the model on the training data.
• Rather, we are interested in the accuracy of prediction that we get using the previously
unseen test data.
• The question arises, why we are interested in unseen test data not in training data?
• Suppose our goal is to develop a machine learning model to predict the stock price base
on historical stock returns.
• We can use the last 6 months stock return data to train our model.
• We would not be interested in how well the model is predicting the stock price for a past
date.
• Rather we would be interested in how well the model can predict the stock price the
next day or the next month.
Training MSE
• Similarly, if we have clinical data that includes weight, blood pressure,
height, age, and family history of disease for a number of patients.
• We also have information about whether each patient has diabetes.
• This data can be used to train a machine learning model to predict
the risk of diabetes based on clinical observations.
• In practice, we are interested accurately predicting diabetes risk for
future patients based on their clinical observations.
• We do not want to know how accurately the model predicts diabetes
risk for patients used to train the model.
• We already know which of those patients have diabetes.
Test MSE
• Mathematically, suppose we fit our machine learning model on the training
measurements 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 to estimate 𝑓. መ
• Using the machine learning model, we can compute 𝑓መ 𝑥1 ,
𝑓መ 𝑥2 , … , 𝑓መ 𝑥𝑛 .
• Suppose these outputs are approximately equal to 𝑦1 , 𝑦2 , … , 𝑦𝑛 , the
training MSE would be small.
• But we are not interested in whether 𝑓መ 𝑥𝑖 ≈ 𝑦𝑖 .
• We are interested to know whether 𝑓መ 𝑥0 ≈ 𝑦0 where (𝑥0 , 𝑦0 ) is a
previously unseen test measurement that was not used to train the
machine learning model.
• In this case, we are interested in choosing the model that gives the lowest
test MSE not the lowest training MSE.
Model Selection
• How do we select a model that results in the minimization of the
MSE?
• In certain situations, the test data set might be available.
• In other words, we have a set of observations that we did not use to
train the machine learning method.
• In this case, we can evaluate the test observations and select the
model with the smallest test MSE.
Model Selection
• On the other hand, in certain situations, the test observations are not
available.
• In such situations, we can select the model with the smallest training
MSE.
• Even though the training MSE and test MSE appear to be closely
related, there is no guarantee that the model with the lowest training
MSE will also have the lowest test MSE.
• For many machine learning methods, the training set MSE can be
quite small, but the test MSE is often much larger.
Model Selection
Left: Data simulated from 𝑓, shown in black. Three estimates of 𝑓 are shown: the
linear regression line (orange curve), and two smoothing spline fits (blue and green
curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum
possible test MSE overall methods (dashed line). Squares represent the training and
test MSEs for the three fits shown in the left-hand panel.
Model Selection
• The figure above shows the phenomenon using a simple example.
• The left-hand side panel shows the true 𝑓 by the black curve.
• The three possible estimates for 𝑓 that have been obtained using methods with
increasing levels of flexibility have been shown in orange, blue, and green curves.
• The orange line is the linear regression fit, which is relatively inflexible.
• The blue and green curves were produced using smoothing splines with different
levels of smoothness.
• We can see that as the level of flexibility increases, the curves fit the observed
data more closely.
• The green curve is the most flexible and matches the data very well but it fits the
true 𝑓 (shown in black) poorly because it is too wiggly.
• We can adjust the level of flexibility of the smoothing spline fit to produce
different fits to the data.
Model Selection
• Now we will analyze the right-hand panel.
• The grey curve shows the average training MSE as a function of flexibility. We can also
refer to the number of smoothing spines as the degrees of freedom.
• The degrees of freedom summarizes the flexibility of a curve.
• The orange, blue, and green squares represent the MSEs corresponding to the curves in
the left-hand panel.
• A more restricted and hence smoother curve such as linear regression has fewer degrees
of freedom than a wiggly curve.
• The linear regression is at the most restrictive end, with two degrees of freedom. The
training MSE declines monotonically as flexibility increases.
• In this example the true 𝑓 is non-linear, and so the orange linear fit is not flexible enough
to estimate 𝑓 well.
• The green curve has the lowest training MSE of all three methods since it corresponds to
the most flexible of the three curves fit in the left-hand panel.
Model Selection
• We have demonstrated the test MSE using the red curve in the right-
hand panel.
• The test MSE along with training MSE initially decline with the
increase in the level of flexibility.
• At a certain point the test MSE levels off and then it starts to increase
again.
Model Selection
• On the other hand, the orange and green curves both have high test
MSE.
• The blue curve minimizes the test MSE because the blue curve
appears to estimate 𝑓, the best among all the curves.
• The horizontal dashed line represents the irreducible error, 𝑉𝑎𝑟(𝜖)
that is the lowest achievable test MSE among all possible methods.
• Hence, the smoothing spline represented by the blue curve is close to
optimal.
Model Selection
• From the right-hand side panel, we can see that as the flexibility of the statistical
learning method increases, we observe a monotone decrease in the training MSE
and a U-shape in the test MSE.
• This is a fundamental property of machine learning that holds regardless of the
particular data set at hand and regardless of the statistical method being used.
• As model flexibility increases, training MSE will decrease, but the test MSE may
not.
• When a given method yields a small training MSE but a large test MSE, we are
said to be overfitting the data.
• This happens because our machine learning procedure is working too hard to find
patterns in the training data, and maybe picking up some patterns that are just
caused by random chance rather than by true properties of the unknown
function 𝑓.
Model Selection
• When we overfit the training data, the test MSE will be very large
because the supposed patterns that the method found in the training
data simply don’t exist in the test data.
• Note that regardless of whether or not overfitting has occurred, we
almost always expect the training MSE to be smaller than the test
MSE because most machine learning methods either directly or
indirectly seek to minimize the training MSE.
• Overfitting refers specifically to the case in which a less flexible model
would have yielded a smaller test MSE.
Model Selection
The figure 10 provides another example in which the true 𝑓 is approximately linear. We can
notice that the training MSE decreases monotonically as the model flexibility increases,
and that there is a U-shape in the test MSE. However, because the truth is close to linear,
the test MSE only decreases slightly before increasing again, so that the orange least
squares fit is substantially better than the highly flexible green curve.
Model Selection
The figure 11 displays an example in which 𝑓 is highly non-linear. The training

and test MSE curves still exhibit the same general patterns, but now there is
a rapid decrease in both curves before the test MSE starts to increase slowly.
Bias – Variance Trade Off
• The U-shape observed in the test MSE curves (the 3 pictures above)
turns out to be the result of two competing properties of machine
learning methods.
• We can show that the expected test MSE, for a given value 𝑥0 , can be
decomposed into the sum of three fundamental quantities: the
መ 0 ), the squared bias of 𝑓(𝑥
variance of 𝑓(𝑥 መ 0 ), and the variance of the
error terms 𝜖
2 2
𝐸 𝑦0 − 𝑓መ 𝑥0 = 𝑉𝑎𝑟 𝑓መ 𝑥0 + 𝐵𝑖𝑎𝑠 𝑓መ 𝑥0 + 𝑉𝑎𝑟(𝜖)
2
• The notation 𝐸 𝑦0 − 𝑓መ 𝑥0 signifies that expected test MSE.
• It refers to the average test MSE that we can obtain by repeatedly
estimating 𝑓 using a large number of training sets and by testing each
at 𝑥0 .
• We can compute the expected test MSE by averaging 𝐸 ቀ𝑦0 −
2
𝑓መ 𝑥0 ቁ over all the possible values of 𝑥0 that are part of the test set.
• From the equation above, we can infer that in order to minimize the
expected test error, we should select the method, that can achieve
the low variance and low bias simultaneously.
• We can see that the variance is a nonnegative quantity and squared
bias is also a nonnegative quantity.
• So, it is not possible to achieve the expected test MSE below 𝑉𝑎𝑟(𝜖),
the irreducible error.
Meaning of Bias – Variance Trade Off
• The amount by which 𝑓መ would change if we use a different training
data set to estimate it is variance.
• Because we use the training data to fit the machine learning models,
we would get different 𝑓መ by using different training data sets.
• However, ideally the 𝑓መ should not vary too much if we use different
training data sets.
• But, due to the high variance of the model, small changes in the
መ
training data set can lead to large changes in 𝑓.
• In general, we observe that more flexible machine learning models
have higher variance.
• Let’s consider the figure 10 again.
• The flexible green curve is very close to the observations.
• It has high variance.
• Because if we change any one of these data points, 𝑓መ may change
considerably.
• On the other hand, the orange least squares line is relatively inflexible
and has low variance.
• Because if we move any single observation, there will only be a small
shift in the position of the line.
• Bias refers to the error that is we get by approximating a real-life and
extremely complicated problem using a much simpler model.
• For example, the assumption in the case of linear regression is that there is
a linear relationship between 𝑌 and 𝑋1 , 𝑋2 , … , 𝑋𝑝 .
• It is unlikely for a real-life problem to have such a simple linear
relationship.
• Hence in this case, performing linear regression will certainly result in some
bias in the estimate of 𝑓.
• In Figure 11, we can see that the true 𝑓 is non-linear. Irrespective of
number of training observations that we give, we cannot produce an
accurate estimate using linear regression.
• Hence, in this example, linear regression results in high bias.
• However, in Figure 10, the true 𝑓 is very close to linear.
• So, if we provide enough data, linear regression should be able to
produce an accurate estimate.
• Generally, more flexible methods result in less bias.
• We can generalize the concept. As the model becomes more flexible, the
variance increases and the bias decreases.
• By analyzing the relative rate of change of these two quantities, we can
determine whether the test MSE will increase or decrease.
• As the flexibility of the model increases, the bias tends to initially decrease
faster than the variance increases.
• As a result, the expected test MSE decreases.
• After some point an increase in flexibility has little impact on the bias but it
starts to significantly increase the variance.
• Due to this, the test MSE increases.
• You can note this pattern of decreasing test MSE followed by increasing
test MSE in the right-hand panels of Figures 9–11.
The three plots in Figure 12 illustrate relationship between bias and

variance for the examples in Figures 9–11.
• In each case the blue solid curve represents the squared bias, for different
levels of flexibility, while the orange curve corresponds to the variance.
• The horizontal dashed line represents 𝑉𝑎𝑟(𝜖), the irreducible error. Finally,
the red curve that corresponds to the test set MSE, is the sum of these
three quantities.
• In all three cases, the variance increases and the bias decreases as the
method’s flexibility increases.
• However, the flexibility level corresponding to the optimal test MSE differs
considerably among the three data sets, because the squared bias and
variance change at different rates in each of the data sets.
• In the left-hand panel of Figure 12, the bias initially decreases rapidly,
resulting in an initial sharp decrease in the expected test MSE.
• On the other hand, in the center panel of Figure 12 the true 𝑓 is close
to linear, so there is only a small decrease in bias as flexibility
increases, and the test MSE only declines slightly before increasing
rapidly as the variance increases.
• Finally, in the right-hand panel of Figure 12, as flexibility increases,
there is a dramatic decline in bias because the true 𝑓 is very non-
linear.
• There is also very little increase in variance as flexibility increases.
• Consequently, the test MSE declines substantially before experiencing
a small increase as model flexibility increases.
• The relationship between bias, variance, and test set MSE in Figure 12
is referred to as the bias-variance trade-off.
• Good test set performance of a machine learning method requires
low variance as well as low squared bias.
• This is referred to as a trade-off because it is easy to obtain a method
with extremely low bias but high variance (for instance, by drawing a
curve that passes through every single training observation) or a
method with very low variance but high bias (by fitting a horizontal
line to the data).
• The challenge lies in finding a method for which both the variance
and the squared bias are low.
Regression vs Classification
• We can characterize the variables as either quantitative or qualitative (also
known as categorical).
• Quantitative variables take on numerical values.
• Examples of quantitative variables include a person’s age, height, or income, the
value of a house, and the price of a stock.
• The qualitative variables take on values in one of 𝐾 different classes, or
categories.
• Examples of qualitative variables include a person’s gender (male or female), the
brand of product purchased (brand A, B, or C), whether a person defaults on a
debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute
Lymphoblastic Leukemia, or No Leukemia).
• We tend to refer to problems with a quantitative response as regression
problems, while those involving a qualitative response are often referred to as
classification problems.
Approaches for Prediction
• We shall study two simple but powerful prediction methods: the
linear model fit by least squares and the k-nearest-neighbor
prediction rule.
• The linear model assumes about 𝑓 and yields stable but possibly
inaccurate predictions.
• Another method of k-nearest neighbors makes very mild assumptions
about 𝑓.
• Its predictions are often accurate but can be unstable
Linear Model
• The linear models have been the mainstay of statistics for past 30
years.
• Till date it has been one of the most important tools. Suppose we
have a vector of inputs 𝑋 𝑇 = (𝑋1 , 𝑋2 , … , 𝑋𝑝 ).
• We can predict the output 𝑌 using the model
𝑝
𝑌෠ = 𝛽መ0 + ෍ 𝑋𝑗 𝛽መ𝑗
𝑗=1
• The term 𝛽መ0 is the intercept. This is also known as bias in machine
learning.
Linear Model
• We can include the constant variable 1 in 𝑋 and include 𝛽መ0 in the
መ Hence, we can write the linear model as
vector of coefficients 𝛽.
𝑌෠ = 𝑋 𝑇 𝛽መ
• Where 𝑋 𝑇 denotes vector or matrix transpose. 𝑓 𝑋 = 𝑋 𝑇 𝛽 is linear.
• How can we fit the linear model to a set of training data?
• We can use many different methods. The most popular method is
least square.
Linear Model
• In this method, we pick the coefficient 𝛽 to minimize the residual sum of
squares
𝑁
𝑇 2
𝑅𝑆𝑆 𝛽 = ෍ 𝑦𝑖 − 𝑥𝑖 𝛽
𝑖=1
• 𝑅𝑆𝑆(𝛽) is a quadratic function of the parameters. So, its minimum always
exists that may not be unique. We can write
𝑅𝑆𝑆 𝛽 = y − X𝛽 𝑇 (y − X𝛽)
• Differentiating w.r.t. 𝛽 we get the normal equations
X𝑇 y − X𝛽 = 0
• Hence, we can get the unique solution by
𝛽መ = X𝑇 X −1 X𝑇 y
Linear Model - Classification
• Let’s consider an example of the linear model in a classification context.
• Figure 12 shows a scatterplot of training data on a pair of inputs 𝑋1 and 𝑋2 .
• The output class variable 𝐺 has the values BLUE or ORANGE.
• These values have been represented in the scatterplot. There are 100
points in each of the two classes.
• The linear regression model was used to fit these data.
• The response 𝑌 was coded as 0 for BLUE and 1 for ORANGE.
• The fitted valued 𝑌෠ are converted to a fitted class variable 𝐺෠ according to
the rule
𝑂𝑅𝐴𝑁𝐺𝐸 𝑖𝑓 ෠ > 0.5
𝑌
𝐺෠ = ൝
𝐵𝐿𝑈𝐸 𝑖𝑓 𝑌෠ ≤ 0.5
Linear Model - Classification
• A classification example in two dimensions.
The classes are coded as a binary variable
(BLUE = 0, ORANGE = 1), and then fit by linear
regression.
• The line is the decision boundary defined by
𝑋 𝑇 𝛽መ = 0.5.
• The decision boundary in this case is linear.
• The orange shaded region shows the part of
input space that is classified as ORANGE,
while the blue region is classified as BLUE.
• In this case, we can notice several
misclassifications on both sides of the
decision boundary.
• This may be because either our model is too
rigid or such errors are unavoidable.
Model Accuracy - Classification
• In the classification setting, 𝑦𝑖 is no longer numerical.
• If we want to estimate 𝑓 based on the training data
{ 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 } where 𝑦1 , … , 𝑦𝑛 are qualitative.
• We can use training error rate to quantify the accuracy of our
estimate 𝑓. መ
• The training error rate is the proportion of mistakes that we get if we
apply the 𝑓መ to the training observations
𝑛
1
෍ 𝑌(𝑦𝑖 ≠ 𝑦ො𝑖 )
𝑛
𝑖=1
Model Accuracy - Classification
መ
• In this case 𝑦ො𝑖 is the predicted class label for the 𝑖th observation using 𝑓.
• 𝐼(𝑦𝑖 ≠ 𝑦ො𝑖 ) is an indicator variable that equals 1 if 𝑦𝑖 ≠ 𝑦ො𝑖 and 0 if 𝑦𝑖 = 𝑦ො𝑖 .
• If 𝐼 𝑦𝑖 ≠ 𝑦ො𝑖 = 0, then we can say that the 𝑖th observation was correctly
classified otherwise it was misclassified.
• The above equation is referred to as the training error rate but we computed it
using the training data.
• However, we are interested in the error rates that are based on test data, which
was not used in the training.
• The test error that has been associated with the test data (𝑥0 , 𝑦0 ) is given by
𝐴𝑣𝑒(𝐼 𝑦0 ≠ 𝑦ො0 )
• 𝑦ො0 is the predicted class label that we get by applying the classifier to the test
data with predictor 𝑥0 .
Bayes Classifier
• The test error rate can be minimized by using a simple classifier that
assigns each observation to the most likely class, given its predictor values.
• Hence, we assign a test observation with predictor 𝑥0 to the class 𝑗 for
which
Pr 𝑌 = 𝑗 𝑋 = 𝑥0
• is the largest. The equation above is a conditional probability.
• The probability that 𝑌 = 𝑗, given the observed predictor vector 𝑥0 .
• This simple classifier is called the Bayes Classifier.
• If we have a two-class problem where only two response values, say class 1
or class 2, are possible, the Bayes classifier predicts class 1 if
Pr 𝑌 = 1 𝑋 = 𝑥0 > 0.5, and class 2 otherwise.
Bayes Classifier
• The figure corresponds to example where we used a simulated data set that is in two-
dimensional space and consists of predictors 𝑋1 and 𝑋2 .
• In the figure, the orange and blue circle represents the training data belonging to two
different classes.
• For each 𝑋1 and 𝑋2 , the probability of the response being orange or blue is different.
• Since the data is simulated and we know how the data set is generated, we can easily
calculate the conditional probabilities of each value of 𝑋1 and 𝑋2 .
• The orange shaded region represents the set of observations for which
Pr 𝑌 = 𝑜𝑟𝑎𝑛𝑔𝑒 𝑋 is greater than 50%.
• The blue shaded region represents the set of observations where the probability is less
then 50%.
• The purple dashed line represents the line where the probability is exactly 50%. This line
is known as Bayes decision boundary
Bayes Classifier
• We can determine the Bayes classifier prediction using the Bayes
decision boundary.
• An observation that falls on the orange side of the boundary will be
assigned to the orange class whereas the observation on the blue side
of the boundary will be assigned to the blue class.
• The Bayes classifier gives the lowest possible test error rate, which is
known as Bayes error rate.
• The Bayes error rate is analogous to the irreducible errors.
N Nearest Neighbour
• Theoretically, we prefer to predict the qualitative response using the Bayes
classifier.
• However, in the case of real data, we are not aware of the conditional
distribution of Y given X.
• Therefore, the computing of the Bayes classifier is impossible.
• Hence, the Bayes classifier becomes an unattainable gold standard against
which we can compare other methods.
• We can use several approaches to estimate the conditional distribution of
𝑌 given 𝑋.
• We can classify a given observation to the class that has highest estimated
probability.
N Nearest Neighbour
• K-nearest neighbors (KNN) classifier is one of the most popular classifiers
among them.
• For a positive integer 𝐾 and a test observation 𝑥0 , first of all, the KNN
classifier identifies the 𝐾 training observations that are nearest to 𝑥0 ,
represented by 𝑁0 .
• The second step is to estimate the conditional probability for class 𝑗 as the
fraction of points in 𝑁0 , whose response values equal 𝑗:
1
Pr 𝑌 = 𝑗 𝑋 = 𝑥0 = ෍ 𝐼(𝑦𝑖 = 𝑗)
𝐾
𝑖∈𝑁0
• In the last step, KNN applies Bayes rule and classifies the test observation
𝑥0 to the class that has the highest probability.
N Nearest Neighbour
• An example of the KNN approach is given
in figure 15.
• The left side panel demonstrates a small
training data set that consists of six blue
and six orange observations.
• The objective is to predict the label for the
point marked by the black cross. Let’s
choose 𝐾 = 3.
• The first step for KNN would be to identify
the three observations that are nearest to
the cross. We can show the neighborhood
as a circle.
• It consists of two blue points and one
orange point.
N Nearest Neighbour
• That results in estimated
probabilities of 2/3 for the blue
class and 1/3 for the orange class.
• Thus, KNN prediction would that the
black cross belongs to the blue class.
• In the right-hand panel of Figure 15,
KNN approach with 𝐾 = 3 has
been applied to all of the possible
values for 𝑋1 and 𝑋2 .
• Based on the classification approach
as given above, the KNN decision
boundary has been drawn.
N Nearest Neighbour
• Even though KNN is a very simple
approach, it often provides the results
that are close to the optimal Bayes
classifier.
• In Figure 16, we have displayed the KNN
decision boundary, using 𝐾 = 10, that
was applied to the larger simulated data
set.
• We can see that even though the KNN
classifies is not aware of the true
distribution, the KNN decision boundary
and Bayes classifier boundary is very
close.
• In this case, the test error rate using
KNN is 0.1363, where are the Bayes
error rate is 0.1304 that are very close.
N Nearest Neighbour
• The results from KNN classifier are drastically dependent
on the choice of 𝐾. Figure 17 shows two KNN fits using
𝐾 = 1 and 𝐾 = 100 on the data shown in figure 14.
• With 𝐾 = 1, the decision boundary in this case is very
flexibles and it finds patterns in the data that do not
represent the Bayes decision boundary.
• Hence, we get a classifier, that has low bias but very high
variance.
• As we increase the value of 𝐾, the method becomes less
flexible and produces a decision boundary that is close to
linear.
• Hence, we get a classifier, that has a low-variance but
high-bias.
• On this data set, we neither get good predictions from
𝐾 = 1 nor 𝐾 = 100 good predictions.
• The test error rates are 0.1695 and 0.1925 respectively.
N Nearest Neighbour
• Similar to the regression methods, we do not find a strong relationship between
the training error rate and the test error rate.
• When we choose 𝐾 = 1, we get KNN training error rate as 0, but we get high the
test error rate. In general, by using more flexible classification models, we may
achieve lower the training error rate but higher test error rate.
• Figure 18, shows the KNN test and training errors as a function of 1/𝐾. As we
increase 1/𝐾, the method becomes more flexible.
• Similar to the regression methods, the training error rate consistently declines as
the flexibility increases.
• But the test error demonstrates a characteristic U-shape.
• First it declines (with a minimum at approximately 𝐾 = 10) before it increases
again when the method becomes excessively flexible and overfits.
N Nearest Neighbour
• Hence for both the regression and classification models, the correct
flexibility level is critical to the success.
• The bias-variance tradeoff, and the resulting U-shape in the test error,
can make this a challenging task.
Thanks
Samatrix Consulting Pvt Ltd

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Machine Learning

Samatrix Consulting Pvt Ltd

𝑖𝑛𝑐𝑜𝑚𝑒 ≈ 𝛽0 + 𝛽1 × 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽2 × 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦

The figure 11 displays an example in which 𝑓 is highly non-linear. The training

The three plots in Figure 12 illustrate relationship between bias and

You might also like