Machine Learning
Machine Learning
Fig 1: The plots display sales in thousands of units. Sales is a function of budget on
TV, Radio, and Newspaper in thousands of dollars across 200 markets
Input – Output Variables
• In this case, the three media budgets are input variables.
• We generally denote input variables by symbol 𝑋 and use subscripts
to distinguish among the input variables.
• Hence, we can denote TV budget by 𝑋1 , Radio budget by 𝑋2 , and
newspaper budget by 𝑋3 .
• In this case, the predicted sales is the response of the model.
• We call the response by output variable and denote the response by
symbol 𝑌.
Independent Vs Dependent Variables
• We can give the input variables different names such as independent
variables, predictors, features etc.
• The output variable or response is also known as the dependent variable.
• If we observe one quantitative output variable 𝑌 and 𝑝 different input
variables, 𝑋1 , 𝑋2 , … , 𝑋𝑝 .
• Our assumption is that there is a relationship between 𝑌 and 𝑋 =
(𝑋1 , 𝑋2 , … , 𝑋3 ). We can write the relationship as
𝑌 =𝑓 𝑋 +𝜖
• In this case 𝑓 is some fixed but unknown function of 𝑋1 , … , 𝑋𝑝 . 𝜖 is random
error that is independent of 𝑋. The mean value of 𝜖 is zero.
Function Approximation
Function Approximation
• Let us consider another
dataset, income dataset.
• The left-hand panel shows
the plot of income versus
years of education for 30
individuals.
• Using the plot, you may be
able to predict the income
given the years of education.
• But the function that relates
the income to the years of
education is not known.
Function Approximation
• In this situation, we should
estimate the function 𝑓 based on
observed data.
• Since the dataset is simulated, so
we know the function 𝑓.
• Based on the given function, we
have plotted the blue curve in the
right-hand side panel.
• You may notice the vertical lines
that represent the error terms 𝜖.
• Out of the 30 observations, some
observations lie above the blue
curve whereas some observations
lie below the blue curve.
• But the overall mean would be
zero.
Function Approximation
• Generally, the function 𝑓
may involve more than one
input variable.
• In the figure, the plot of
income as a function of
seniority and years of
education is given.
• We need to estimate the
function 𝑓that is two-
dimensional surface.
• So, we can say that machine
learning refers to a set of
approaches for estimating
function 𝑓.
Why to estimate 𝑓?
Why Approximate f
• The purpose of function approximation is either prediction or
inference.
Prediction
• In several cases, a set of input variable 𝑋 are available but we cannot
easily obtain the response variable 𝑌. Since the average of error
terms is zero, we can predict 𝑌 using
መ
• 𝑌 = 𝑓(𝑋)
• In this case, 𝑓መ represents the estimated 𝑓 and 𝑌 represents the
predicted outcome of 𝑌.If the prediction is the priority, the 𝑓መ is often
treated as black box if it can provide accurate estimation of 𝑌.
Prediction
• Example: Suppose 𝑋1 , … , 𝑋𝑝 represent the characteristics of a
patient’s blood sample that has been measured in a lab.
• The output variable 𝑌 encodes the patient’s risk for a severe adverse
reaction to a particular drug.
• We can predict 𝑌 using 𝑋 so that we can avoid giving the drug to the
patient who are at high risk of an adverse reaction.
• That means we can predict the set of patients who are at high risk of
an adverse reaction.
Reducible - Irreducible Errors
• Two quantities, reducible error and irreducible error, define the accuracy of 𝑌 as a
prediction of 𝑌.
• The 𝑓መ cannot be the perfect estimator of 𝑓.
• Due to this some errors will be introduced.
• We can reduce such error by improving the accuracy of 𝑓መ by using the most appropriate
machine learning technique to estimate 𝑓.
• Even if, we could perfectly estimate 𝑓, our estimated response would be 𝑌 = 𝑓(𝑋).
• The prediction would still have some error because 𝑌 is also a function of 𝜖 and we
cannot predict 𝜖 using 𝑋.
• The variability associated with 𝜖 affects the accuracy of our predictions.
• This is known as the irreducible error. Irrespective of our accuracy in estimating 𝑓, some
errors that have been introduced by 𝜖, cannot be reduced.
Reducible - Irreducible Errors
• Why the irreducible errors are larger than zero?
• The quantity 𝜖 may contain some unmeasured variable that can be
used to predict 𝑌.
• Because we do not measure such variables, 𝑓 cannot used them for
prediction.
• Unmeasurable variations are also included in the quantity 𝜖.
• For example, due to manufacturing variation in the drug or the
general feeling of well-being of the patient on the given day.
Reducible - Irreducible Errors
• If the estimated function is 𝑓መ and set of predictors is 𝑋. We get the
መ
prediction 𝑌 = 𝑓(𝑋). We can show that
2 2
𝐸 𝑌 − 𝑌 = 𝐸 𝑓 𝑋 + 𝜖 − 𝑓መ 𝑋
2
መ
= 𝐸 𝑓 𝑋 − 𝑓 𝑋 + 𝑉𝑎𝑟(𝜖)
= 𝑅𝑒𝑑𝑢𝑐𝑖𝑏𝑎𝑙𝑒 + 𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑖𝑏𝑙𝑒
2
• Where 𝐸 𝑌 − 𝑌 represents the average, or expected value, of the
squared difference between the predicted and actual value of 𝑌 and
𝑉𝑎𝑟(𝜖) represents the variance associated with the error term 𝜖.
Inference
• On various occasions, we want to understand how 𝑌 changes as the
we change 𝑋1 , … , 𝑋𝑝 .
• We wish to estimate 𝑓 but making predictions for 𝑓 may not be our
goal.
• We may want to understand the relationship between 𝑋 and 𝑌.
መ
• In such situations, we need to know understand the exact form of 𝑓.
Hence, it cannot be treated as a black box.
• We would like to answer the following questions:
Inference
• Which input variables are associated with output variable 𝒀?
• Generally, a small fraction of the available input variables is
associated with response 𝑌.
• It is extremely useful to understand few important predictors out of a
big set of all the possible predictors.
Inference
• Understand the relationship between the response and each predictor?
• Different predictors may have different relationship with the response.
• Some predictors may have a positive relation with 𝑌 whereas some other
predictors may have a negative relationship with 𝑌.
• If the relationship is positive, the increase in the value of predictor results
in the increase in the value of 𝑌.
• Negative relationship has opposite relationship.
• Due to the complexity of the function 𝑓, the relationship between a
predictor and the response may also depend upon the values of the other
predictors.
Inference
• Whether the relationship between 𝒀 and each predictor is linear or
more complex?
• Generally, the methods for estimating 𝑓 are linear.
• However, in certain cases, the relationship is more complex that
cannot be accurately represented by a linear model.
How to estimate 𝑓?
How to estimate 𝑓
• Suppose we have a set of observed data that includes 𝑛 = 30 data
points.
• We name these observations, training data. The training data will be
used to train or teach our model to estimate 𝑓.
• Let 𝑥𝑖𝑗 represents the value of 𝑗𝑡ℎ predictor or input for observation 𝑖
where 𝑖 = 1, 2, … , 𝑛 and 𝑗 = 1 , 2, … , 𝑝.
• Correspondingly, the response variable for the 𝑖 𝑡ℎ observation is
represented by 𝑦𝑖 .
• Hence our training data consists of { 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 }
𝑇
where 𝑥𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑝 .
How to estimate 𝑓
• Our objective is to estimate the unknown function 𝑓.
• For this, we apply the machine learning model to the training data.
• We want to find the function 𝑓መ such that 𝑌 ≈ 𝑓(𝑋)
መ for any
observation (𝑋, 𝑌).
• We can use two approaches for the same: parametric and non-
parametric approach.
Parametric Approach
• Parametric approach is 2 step model-based approach
• First step is to make an assumption about the functional form or
shape of 𝑓. For example, we can make a simple assumption that 𝑓 is
linear in 𝑋:
𝑓 𝑋 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
• Once the assumption about a linear 𝑓 is made, the 𝑓 can be
estimated easily.
• Instead of estimating the complete p-functional function 𝑓(𝑋), we
have to estimate 𝑝 + 1 coefficients 𝛽0 , 𝛽1 , … , 𝛽𝑝
Parametric Approach
• After we select the model, we use the training data to fit or train the
model.
• From the example given in step 1, we need to estimate the
parameters 𝛽0 , 𝛽1 , … , 𝛽𝑝 .
• In other words, we need to find the values of the parameters such
that
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
Parametric Approach
• Several ways are available to fit the linear model. (Ordinary) least
square method is one of the possible ways to fit the linear model.
• Least square method is the most common approach.
• The model-based approach mentioned above is known as parametric
approach.
• This approach reduces the problem of estimating 𝑓 to estimate a set
of parameters.
• Using the linear model, on one hand it is very easy to estimate the
parameters, such as 𝛽0 , 𝛽1 , … , 𝛽𝑝 , on the other hand, the model that
we choose may not represent the true unknown form of 𝑓.
Parametric Approach
• If the estimated model is too far from the true 𝑓, our estimate will not
be good.
• To fix this problem, we can choose the flexible model but such models
require estimating a greater number of parameters.
• The more complex models may lead to the overfitting of the data
which means that the models follow the errors, or noise, too closely.
• We applied the parametric approach to the Income data as shown in
the figure below. In this case we fit the linear model to the form
Left: Data simulated from 𝑓, shown in black. Three estimates of 𝑓 are shown: the
linear regression line (orange curve), and two smoothing spline fits (blue and green
curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum
possible test MSE overall methods (dashed line). Squares represent the training and
test MSEs for the three fits shown in the left-hand panel.
Model Selection
• The figure above shows the phenomenon using a simple example.
• The left-hand side panel shows the true 𝑓 by the black curve.
• The three possible estimates for 𝑓 that have been obtained using methods with
increasing levels of flexibility have been shown in orange, blue, and green curves.
• The orange line is the linear regression fit, which is relatively inflexible.
• The blue and green curves were produced using smoothing splines with different
levels of smoothness.
• We can see that as the level of flexibility increases, the curves fit the observed
data more closely.
• The green curve is the most flexible and matches the data very well but it fits the
true 𝑓 (shown in black) poorly because it is too wiggly.
• We can adjust the level of flexibility of the smoothing spline fit to produce
different fits to the data.
Model Selection
• Now we will analyze the right-hand panel.
• The grey curve shows the average training MSE as a function of flexibility. We can also
refer to the number of smoothing spines as the degrees of freedom.
• The degrees of freedom summarizes the flexibility of a curve.
• The orange, blue, and green squares represent the MSEs corresponding to the curves in
the left-hand panel.
• A more restricted and hence smoother curve such as linear regression has fewer degrees
of freedom than a wiggly curve.
• The linear regression is at the most restrictive end, with two degrees of freedom. The
training MSE declines monotonically as flexibility increases.
• In this example the true 𝑓 is non-linear, and so the orange linear fit is not flexible enough
to estimate 𝑓 well.
• The green curve has the lowest training MSE of all three methods since it corresponds to
the most flexible of the three curves fit in the left-hand panel.
Model Selection
• We have demonstrated the test MSE using the red curve in the right-
hand panel.
• The test MSE along with training MSE initially decline with the
increase in the level of flexibility.
• At a certain point the test MSE levels off and then it starts to increase
again.
Model Selection
• On the other hand, the orange and green curves both have high test
MSE.
• The blue curve minimizes the test MSE because the blue curve
appears to estimate 𝑓, the best among all the curves.
• The horizontal dashed line represents the irreducible error, 𝑉𝑎𝑟(𝜖)
that is the lowest achievable test MSE among all possible methods.
• Hence, the smoothing spline represented by the blue curve is close to
optimal.
Model Selection
• From the right-hand side panel, we can see that as the flexibility of the statistical
learning method increases, we observe a monotone decrease in the training MSE
and a U-shape in the test MSE.
• This is a fundamental property of machine learning that holds regardless of the
particular data set at hand and regardless of the statistical method being used.
• As model flexibility increases, training MSE will decrease, but the test MSE may
not.
• When a given method yields a small training MSE but a large test MSE, we are
said to be overfitting the data.
• This happens because our machine learning procedure is working too hard to find
patterns in the training data, and maybe picking up some patterns that are just
caused by random chance rather than by true properties of the unknown
function 𝑓.
Model Selection
• When we overfit the training data, the test MSE will be very large
because the supposed patterns that the method found in the training
data simply don’t exist in the test data.
• Note that regardless of whether or not overfitting has occurred, we
almost always expect the training MSE to be smaller than the test
MSE because most machine learning methods either directly or
indirectly seek to minimize the training MSE.
• Overfitting refers specifically to the case in which a less flexible model
would have yielded a smaller test MSE.
Model Selection
The figure 10 provides another example in which the true 𝑓 is approximately linear. We can
notice that the training MSE decreases monotonically as the model flexibility increases,
and that there is a U-shape in the test MSE. However, because the truth is close to linear,
the test MSE only decreases slightly before increasing again, so that the orange least
squares fit is substantially better than the highly flexible green curve.
Model Selection
𝑌 = 𝛽መ0 + 𝑋𝑗 𝛽መ𝑗
𝑗=1
• The term 𝛽መ0 is the intercept. This is also known as bias in machine
learning.
Linear Model
• We can include the constant variable 1 in 𝑋 and include 𝛽መ0 in the
መ Hence, we can write the linear model as
vector of coefficients 𝛽.
𝑌 = 𝑋 𝑇 𝛽መ
• Where 𝑋 𝑇 denotes vector or matrix transpose. 𝑓 𝑋 = 𝑋 𝑇 𝛽 is linear.
• How can we fit the linear model to a set of training data?
• We can use many different methods. The most popular method is
least square.
Linear Model
• In this method, we pick the coefficient 𝛽 to minimize the residual sum of
squares
𝑁
𝑇 2
𝑅𝑆𝑆 𝛽 = 𝑦𝑖 − 𝑥𝑖 𝛽
𝑖=1
• 𝑅𝑆𝑆(𝛽) is a quadratic function of the parameters. So, its minimum always
exists that may not be unique. We can write
𝑅𝑆𝑆 𝛽 = y − X𝛽 𝑇 (y − X𝛽)
• Differentiating w.r.t. 𝛽 we get the normal equations
X𝑇 y − X𝛽 = 0
• Hence, we can get the unique solution by
𝛽መ = X𝑇 X −1 X𝑇 y
Linear Model - Classification
• Let’s consider an example of the linear model in a classification context.
• Figure 12 shows a scatterplot of training data on a pair of inputs 𝑋1 and 𝑋2 .
• The output class variable 𝐺 has the values BLUE or ORANGE.
• These values have been represented in the scatterplot. There are 100
points in each of the two classes.
• The linear regression model was used to fit these data.
• The response 𝑌 was coded as 0 for BLUE and 1 for ORANGE.
• The fitted valued 𝑌 are converted to a fitted class variable 𝐺 according to
the rule
𝑂𝑅𝐴𝑁𝐺𝐸 𝑖𝑓 > 0.5
𝑌
𝐺 = ൝
𝐵𝐿𝑈𝐸 𝑖𝑓 𝑌 ≤ 0.5
Linear Model - Classification
• A classification example in two dimensions.
The classes are coded as a binary variable
(BLUE = 0, ORANGE = 1), and then fit by linear
regression.
• The line is the decision boundary defined by
𝑋 𝑇 𝛽መ = 0.5.
• The decision boundary in this case is linear.
• The orange shaded region shows the part of
input space that is classified as ORANGE,
while the blue region is classified as BLUE.
• In this case, we can notice several
misclassifications on both sides of the
decision boundary.
• This may be because either our model is too
rigid or such errors are unavoidable.
Model Accuracy - Classification
• In the classification setting, 𝑦𝑖 is no longer numerical.
• If we want to estimate 𝑓 based on the training data
{ 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 } where 𝑦1 , … , 𝑦𝑛 are qualitative.
• We can use training error rate to quantify the accuracy of our
estimate 𝑓. መ
• The training error rate is the proportion of mistakes that we get if we
apply the 𝑓መ to the training observations
𝑛
1
𝑌(𝑦𝑖 ≠ 𝑦ො𝑖 )
𝑛
𝑖=1
Model Accuracy - Classification
መ
• In this case 𝑦ො𝑖 is the predicted class label for the 𝑖th observation using 𝑓.
• 𝐼(𝑦𝑖 ≠ 𝑦ො𝑖 ) is an indicator variable that equals 1 if 𝑦𝑖 ≠ 𝑦ො𝑖 and 0 if 𝑦𝑖 = 𝑦ො𝑖 .
• If 𝐼 𝑦𝑖 ≠ 𝑦ො𝑖 = 0, then we can say that the 𝑖th observation was correctly
classified otherwise it was misclassified.
• The above equation is referred to as the training error rate but we computed it
using the training data.
• However, we are interested in the error rates that are based on test data, which
was not used in the training.
• The test error that has been associated with the test data (𝑥0 , 𝑦0 ) is given by
𝐴𝑣𝑒(𝐼 𝑦0 ≠ 𝑦ො0 )
• 𝑦ො0 is the predicted class label that we get by applying the classifier to the test
data with predictor 𝑥0 .
Bayes Classifier
• The test error rate can be minimized by using a simple classifier that
assigns each observation to the most likely class, given its predictor values.
• Hence, we assign a test observation with predictor 𝑥0 to the class 𝑗 for
which
Pr 𝑌 = 𝑗 𝑋 = 𝑥0
• is the largest. The equation above is a conditional probability.
• The probability that 𝑌 = 𝑗, given the observed predictor vector 𝑥0 .
• This simple classifier is called the Bayes Classifier.
• If we have a two-class problem where only two response values, say class 1
or class 2, are possible, the Bayes classifier predicts class 1 if
Pr 𝑌 = 1 𝑋 = 𝑥0 > 0.5, and class 2 otherwise.
Bayes Classifier
• The figure corresponds to example where we used a simulated data set that is in two-
dimensional space and consists of predictors 𝑋1 and 𝑋2 .
• In the figure, the orange and blue circle represents the training data belonging to two
different classes.
• For each 𝑋1 and 𝑋2 , the probability of the response being orange or blue is different.
• Since the data is simulated and we know how the data set is generated, we can easily
calculate the conditional probabilities of each value of 𝑋1 and 𝑋2 .
• The orange shaded region represents the set of observations for which
Pr 𝑌 = 𝑜𝑟𝑎𝑛𝑔𝑒 𝑋 is greater than 50%.
• The blue shaded region represents the set of observations where the probability is less
then 50%.
• The purple dashed line represents the line where the probability is exactly 50%. This line
is known as Bayes decision boundary
Bayes Classifier
• We can determine the Bayes classifier prediction using the Bayes
decision boundary.
• An observation that falls on the orange side of the boundary will be
assigned to the orange class whereas the observation on the blue side
of the boundary will be assigned to the blue class.
• The Bayes classifier gives the lowest possible test error rate, which is
known as Bayes error rate.
• The Bayes error rate is analogous to the irreducible errors.
N Nearest Neighbour
• Theoretically, we prefer to predict the qualitative response using the Bayes
classifier.
• However, in the case of real data, we are not aware of the conditional
distribution of Y given X.
• Therefore, the computing of the Bayes classifier is impossible.
• Hence, the Bayes classifier becomes an unattainable gold standard against
which we can compare other methods.
• We can use several approaches to estimate the conditional distribution of
𝑌 given 𝑋.
• We can classify a given observation to the class that has highest estimated
probability.
N Nearest Neighbour
• K-nearest neighbors (KNN) classifier is one of the most popular classifiers
among them.
• For a positive integer 𝐾 and a test observation 𝑥0 , first of all, the KNN
classifier identifies the 𝐾 training observations that are nearest to 𝑥0 ,
represented by 𝑁0 .
• The second step is to estimate the conditional probability for class 𝑗 as the
fraction of points in 𝑁0 , whose response values equal 𝑗:
1
Pr 𝑌 = 𝑗 𝑋 = 𝑥0 = 𝐼(𝑦𝑖 = 𝑗)
𝐾
𝑖∈𝑁0
• In the last step, KNN applies Bayes rule and classifies the test observation
𝑥0 to the class that has the highest probability.
N Nearest Neighbour
• An example of the KNN approach is given
in figure 15.
• The left side panel demonstrates a small
training data set that consists of six blue
and six orange observations.
• The objective is to predict the label for the
point marked by the black cross. Let’s
choose 𝐾 = 3.
• The first step for KNN would be to identify
the three observations that are nearest to
the cross. We can show the neighborhood
as a circle.
• It consists of two blue points and one
orange point.
N Nearest Neighbour
• That results in estimated
probabilities of 2/3 for the blue
class and 1/3 for the orange class.
• Thus, KNN prediction would that the
black cross belongs to the blue class.
• In the right-hand panel of Figure 15,
KNN approach with 𝐾 = 3 has
been applied to all of the possible
values for 𝑋1 and 𝑋2 .
• Based on the classification approach
as given above, the KNN decision
boundary has been drawn.
N Nearest Neighbour
• Even though KNN is a very simple
approach, it often provides the results
that are close to the optimal Bayes
classifier.
• In Figure 16, we have displayed the KNN
decision boundary, using 𝐾 = 10, that
was applied to the larger simulated data
set.
• We can see that even though the KNN
classifies is not aware of the true
distribution, the KNN decision boundary
and Bayes classifier boundary is very
close.
• In this case, the test error rate using
KNN is 0.1363, where are the Bayes
error rate is 0.1304 that are very close.
N Nearest Neighbour
• The results from KNN classifier are drastically dependent
on the choice of 𝐾. Figure 17 shows two KNN fits using
𝐾 = 1 and 𝐾 = 100 on the data shown in figure 14.
• With 𝐾 = 1, the decision boundary in this case is very
flexibles and it finds patterns in the data that do not
represent the Bayes decision boundary.
• Hence, we get a classifier, that has low bias but very high
variance.
• As we increase the value of 𝐾, the method becomes less
flexible and produces a decision boundary that is close to
linear.
• Hence, we get a classifier, that has a low-variance but
high-bias.
• On this data set, we neither get good predictions from
𝐾 = 1 nor 𝐾 = 100 good predictions.
• The test error rates are 0.1695 and 0.1925 respectively.
N Nearest Neighbour
• Similar to the regression methods, we do not find a strong relationship between
the training error rate and the test error rate.
• When we choose 𝐾 = 1, we get KNN training error rate as 0, but we get high the
test error rate. In general, by using more flexible classification models, we may
achieve lower the training error rate but higher test error rate.
• Figure 18, shows the KNN test and training errors as a function of 1/𝐾. As we
increase 1/𝐾, the method becomes more flexible.
• Similar to the regression methods, the training error rate consistently declines as
the flexibility increases.
• But the test error demonstrates a characteristic U-shape.
• First it declines (with a minimum at approximately 𝐾 = 10) before it increases
again when the method becomes excessively flexible and overfits.
N Nearest Neighbour
• Hence for both the regression and classification models, the correct
flexibility level is critical to the success.
• The bias-variance tradeoff, and the resulting U-shape in the test error,
can make this a challenging task.
Thanks
Samatrix Consulting Pvt Ltd