Ds 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Regularization

Regression Problems
In binary classification, we have to predict one of two classes for each
data point i.e.
In regression, we have to predict a real value i.e.
Training data looks like where
Predicting price of a stock, predicting change in price of a stock,
predicting test scores of a student, etc can be solved using regression
Let us look at a few ways to solve regression problem as well as loss
functions for regression problems
Recall: logistic regression is not a way to perform regression, it is a
way to perform binary classification
Loss Functions for Regression Problems
Can use linear models to solve regression problems too i.e. learn a
and predict score for test data point as
Need loss functions that define what they think is bad behaviour
Absolute Loss:
Intuition: a model is doing badly if is either much larger
than or much smaller than
Squared Loss:
Intuition: a model is doing badly if is either much larger
than or much smaller than . Also I want the loss fn
to be differentiable so that I can take gradients etc
Actually, even these loss fns can be derived from
basic principles. We will see this soon.
LossLoss
Functions
functions have to befor
chosenRegression
according to needs of theProblems
problem and
experience. Eg. Huber loss popular if some data points are corrupted.
However, some loss functions are very popular in ML for various reasons
e.g. hinge/cross entropy for binary classification, squared for regression

Indeed, notice that the Huber loss function penalizes large deviations less
Vapnik’s
strictly-insensitive
that the squaredLoss:
loss function (the purple curve is much below the
Intuition:
dotted redifcurve).
a model is doing
Doing so mayslightly
be a goodbadly, don’t
idea in corrupted data settings
penalize
since itittells
at all, else penalize
the optimizer to notitworry
as squared
too muchloss. Ensure
about a differentiable
corrupted points fn

Huber Loss:
Intuition: if a model is doing slightly badly, penalize it as squared loss, if
doing very badly, penalize it as absolute loss. Also, please ensure a
differentiable fn
Finding good parameters
Highest education Gender True
Age
(HS, BT, MT, PhD) (M, F, O) Salary We suspect that the following law holds
45 0 0 1 0 1 0 0 64475

22 0 1 0 0 0 1 0 34179
Actually, just a linear model
28 1 0 0 0 1 0 0 34573

34 0 0 1 0 0 1 0 50882

47 1 0 0 0 0 1 0 38660 Use average loss to find good params


55 0 0 1 0 1 0 0 71487

49 0 0 0 1 0 1 0 79430 is the feature vector for the i-th person in our training set

27 0 1 0 0 0 0 1 34355

25 0 1 0 0 1 0 0 43837
Ridge Regression
Ridge regression uses least squares loss and regularization. However the term
“least squares regression” is often used to refer to ridge regression 
Ignore the bias for sake of notational simplicity
Could have used too but it is customary in regression to write the
optimization problem a bit differently
Can rewrite as with
Can apply first order optimality to obtain a solution (convex objective)
Even here, we may use dual methods like SDCA (liblinear etc do indeed
Gradient: (recall:
use them) gradient
but yet mustthebedual
again, deriving ) is not as straightforward here
Gradient mustsince there are
vanish no constraintsso
at minimum in the
weoriginal
must problem
have

Takes Iftime to to
we want invert the matrix
use fancier – may
loss functions likeuse faster
Vapnik’s lossmethods (S)GD
function, then
cannot apply first order optimality, need to use SGD, SDCA etc methods
Much faster methods available e.g. conjugate gradient method
Behind the scenes: GD for Ridge Regression
(ignore bias for now) GD and other optimization techniques merely
try to obtain models that obey the rules of good
behaviour as encoded in the loss function

Assume for a moment for sake of understanding


No change to due to the
data point
Small : is large do not change too much!
If does well on , say , then is smaller than i.e. may be
If does badly on , say , then closer to
where
If , GD will try to
increase the value
Regularization
An umbrellaMakes
termsense
used in regularization
since ML to describe is a whole family of steps
taken to prevent
supposedML algosus from
to protect suffering
from data issues from problems in data

These help ML algos offer stable behaviour even if data misbehaves


So regularization can be
In an ideal world where data is perfectly clean and
somewhat there
data is plenty of
dependent
data available, there would be no need for any regularization!
How to do regularization is often decided without looking at data
However, regularization usually involves its own hyperparameters that need
to be tuned using data itself (using validation techniques)
In general, regularization techniques prevent the model from just
blindly doing well on data (since data cannot be trusted)
Frequently, regularization can also make the optimization problem well-
posed and the solution unique – this helps optimizers (SGD etc) immensely
A regularizer essentially tells the optimizer to not just blindly return a model
Regularization by adding a regularizer
that does well on data (according to the loss function), but rather return a model
that does well and is simple. The L2 regularizer defines simplicity using the L2
norm (or Euclidean length). A model is simple if it has small L2 norm

In binary classification, simple models also had large margins. However, be


The careful
term not
above isregularize.
to over called the L2use
If you regularizer (or the
a very large value squared L2
of regularization
constant e.g.
regularizer if we (orwant
in SVM)to then you may
be very get a very useless model that does
specific)
not fit data at all i.e. does not care to do well on data at all!
In binary classification settings, we saw that this regularizer
encourages
The keyaislarge margin
moderation. Usually regularization constants are chosen using
validation. Other important considerations include: how noisy do we
In regression settings
expect data to be andithow
ensures uniqueness
much data do we have. of
Rulesolution
of thumb: as you
Recallhave
thatmore
the and
closed
moreform
data,solution is afford to regularize less and
you can safely
If is non-invertible then we have infinitely
less many solutions if
Having ensures unique solution no matter what the data
Other popular regularizers
If you pay close attention, then the dual of CSVM also has L1
regularization on . Since , . Note that the dual does indeed have sparse
solutions (i.e. not all ) which means not every vector becomes a support
The other most popular regularizer is
vector!
the L1 regularizer
LASSO:
L1-reg SVM:
The L1 regularizer prefers model vectors that have lots of coordinates
whose value is either 0 or close to 0 – called sparse vectors
Often, we make coordinates close to zero actually zero to save space
Sparse models are faster at test time, also consume less memory
Very popular in high dimensional problems e.g. million
Since L1 norm is non-differentiable, need to use subgradient methods
although proximal gradient descent does much better in general
Regularization by Early Stopping
Sometimes, ML practitioners stop an optimizer well before it has
solved the optimization problem fully – sometimes due to timeout
but sometimes deliberately
Note that this automatically prevents the model from fitting the data
too closely (this is good if the data was noisy or had outliers)
This often happens implicitly with complex optimization problems
e.g. training deep networks where the person training the network
gets tired and gives up training – gets some regularization for free 
Be careful not to misuse this – if you stop too early, you may just get
an overregularized model that does not fit data at all i.e. does not do
well on data at all!
Regularization by adding noise
A slightly counter-intuitive way of regularization (considering that
regularization is supposed to save us from noise in data)
Add controlled noise to data so that the model learns to perform well
despite noise – note that it does not fit the data exactly here either
Most well-known instance of this technique is the practice of dropout in
deep learning – randomly make features go missing
Related methods:
Learn from not entire data, but a subset of the data that seems clean
Use a corruption-aware loss function like the Huber loss in robust regression
Of the features present, choose only those that are informative, non-noisy
Called sparse recovery: e.g. LASSO does sparse recovery for least squares loss function
Other forms of regularization For a vector , the term refers to the number
of non-zeros in the vector. It is not actually
a norm although people frequently abuse
Regularization by imposing convex constraints
terminology to call it the “ norm”.

The constraint set (feasible set) is a convex set here (an ball)
Regularization by imposing non-convex constraints

Forcing us to choose only features to solve the problem


Called “sparse recovery” – related to compressed sensing
The feasible set is a non-convex set here

You might also like