Ds 5
Ds 5
Ds 5
Regression Problems
In binary classification, we have to predict one of two classes for each
data point i.e.
In regression, we have to predict a real value i.e.
Training data looks like where
Predicting price of a stock, predicting change in price of a stock,
predicting test scores of a student, etc can be solved using regression
Let us look at a few ways to solve regression problem as well as loss
functions for regression problems
Recall: logistic regression is not a way to perform regression, it is a
way to perform binary classification
Loss Functions for Regression Problems
Can use linear models to solve regression problems too i.e. learn a
and predict score for test data point as
Need loss functions that define what they think is bad behaviour
Absolute Loss:
Intuition: a model is doing badly if is either much larger
than or much smaller than
Squared Loss:
Intuition: a model is doing badly if is either much larger
than or much smaller than . Also I want the loss fn
to be differentiable so that I can take gradients etc
Actually, even these loss fns can be derived from
basic principles. We will see this soon.
LossLoss
Functions
functions have to befor
chosenRegression
according to needs of theProblems
problem and
experience. Eg. Huber loss popular if some data points are corrupted.
However, some loss functions are very popular in ML for various reasons
e.g. hinge/cross entropy for binary classification, squared for regression
Indeed, notice that the Huber loss function penalizes large deviations less
Vapnik’s
strictly-insensitive
that the squaredLoss:
loss function (the purple curve is much below the
Intuition:
dotted redifcurve).
a model is doing
Doing so mayslightly
be a goodbadly, don’t
idea in corrupted data settings
penalize
since itittells
at all, else penalize
the optimizer to notitworry
as squared
too muchloss. Ensure
about a differentiable
corrupted points fn
Huber Loss:
Intuition: if a model is doing slightly badly, penalize it as squared loss, if
doing very badly, penalize it as absolute loss. Also, please ensure a
differentiable fn
Finding good parameters
Highest education Gender True
Age
(HS, BT, MT, PhD) (M, F, O) Salary We suspect that the following law holds
45 0 0 1 0 1 0 0 64475
22 0 1 0 0 0 1 0 34179
Actually, just a linear model
28 1 0 0 0 1 0 0 34573
34 0 0 1 0 0 1 0 50882
49 0 0 0 1 0 1 0 79430 is the feature vector for the i-th person in our training set
27 0 1 0 0 0 0 1 34355
25 0 1 0 0 1 0 0 43837
Ridge Regression
Ridge regression uses least squares loss and regularization. However the term
“least squares regression” is often used to refer to ridge regression
Ignore the bias for sake of notational simplicity
Could have used too but it is customary in regression to write the
optimization problem a bit differently
Can rewrite as with
Can apply first order optimality to obtain a solution (convex objective)
Even here, we may use dual methods like SDCA (liblinear etc do indeed
Gradient: (recall:
use them) gradient
but yet mustthebedual
again, deriving ) is not as straightforward here
Gradient mustsince there are
vanish no constraintsso
at minimum in the
weoriginal
must problem
have
Takes Iftime to to
we want invert the matrix
use fancier – may
loss functions likeuse faster
Vapnik’s lossmethods (S)GD
function, then
cannot apply first order optimality, need to use SGD, SDCA etc methods
Much faster methods available e.g. conjugate gradient method
Behind the scenes: GD for Ridge Regression
(ignore bias for now) GD and other optimization techniques merely
try to obtain models that obey the rules of good
behaviour as encoded in the loss function
The constraint set (feasible set) is a convex set here (an ball)
Regularization by imposing non-convex constraints