Lec 05

Logistic Regression
Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
Administrative
• Please start HW 1 early!
• Questions are welcome!

Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes probability of observed data
• Maximum a posteriori estimation (MAP)

Choose that is most probable given prior probability and
data
Slide credit: Tom Mitchell

Naïve Bayes classifier
• Want to learn
• But require parameters...
• How about applying Bayes rule?
• : Need parameters
•: Need 1 parameter
• Apply conditional independence assumption

• : Need parameters
Naïve Bayes classifier
• Bayes rule:
• Assume conditional independence among ’s:
• Pick the most probable Y

Example
Bayes rule Conditional indep.

• Estimating parameters
0.6
• Test example:
•:
•:
Naïve Bayes algorithm – discrete
• For each value
Estimate
For each value of each attribute
Estimate
• Classify

Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month
𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S|F ) P ( D|F ) P(G∨F)

Naïve Bayes: Subtlety #1
• Often the are not really conditionally independent
• Naïve Bayes often works pretty well anyway

• Often the right classification, even when not the right probability
[Domingos & Pazzani, 1996])
• What is the effect on estimated ?

• What if we have two copies:

Naïve Bayes: Subtlety #2
MLE estimate for might be zero.
(for example, = birthdate. = Feb_4_1995)
• Why worry about just one parameter out of many?
• What can we do to address this?

• MAP estimates (adding “imaginary” examples)
Estimating parameters: discrete
• Maximum likelihood estimates (MLE)
• MAP estimates (Dirichlet priors):

What if we have continuous
• Gaussian Naïve Bayes (GNB): assume
• Additional assumption on :
• Is independent of ()
• Is independent of ()
• Is independent of and ()

Naïve Bayes algorithm – continuous
• For each value
Estimate
For each attribute estimate
Class conditional mean , variance
• Classify

Things to remember
• Probability basics
• Conditional probability, joint probability, Bayes rule
• Estimating parameters from data

• Maximum likelihood (ML) maximize
• Maximum a posteriori estimation (MAP) maximize
• Naive Bayes
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
Logistic Regression
• Cost function
• Regularization
1 (Yes)
Malignant?
0 (No)
Tumor Size
h 𝜃 ( 𝑥 ) =𝜃 ⊤ 𝑥
• Threshold classifier output at 0.5

• If predict “”
• If , predict “”
Slide credit: Andrew Ng

Classification: or
(from linear regression)

can be or
Logistic regression:
Logistic regression is actually for classification

Hypothesis representation
• Want 1
h𝜃 ( 𝑥) = −𝜃 𝑥
⊤
•
1+ 𝑒
where
𝑔(𝑧)
• Sigmoid function
• Logistic function
𝑧 Slide credit: Andrew Ng
Interpretation of hypothesis output
• estimated probability that on input
• Example: If
• 0.7
• Tell patient that 70% chance of tumor being malignant

Logistic regression
𝑔(𝑧)
⊤
𝑧 =𝜃 𝑥
Suppose predict “y = 1” if
predict “y = 0” if

Decision boundary
Age
E.g.,
Tumor Size
• Predict “” if

•
E.g.,
• Predict “” if

Where does the form come from?
• Logistic regression hypothesis representation
• Consider learning f: , where

• is a vector of real-valued features
• is Boolean
• Assume all are conditionally independent given
• Model as Gaussian
• Model as Bernoulli
What is ? Slide credit: Tom Mitchell

Applying Bayes rule
Divide by
Apply
Plug in
2 2
𝜇𝑖 0 − 𝜇𝑖1 𝜇 −𝜇
2
( 𝑥 −𝜇 𝑖𝑘 )
𝑃 ( 𝑥∨𝑦 𝑘) =
1
𝑒
−
2𝜎
2
𝑖
∑( 𝜎
2 ¿ 𝑋 𝑖+
𝑖1
2𝜎
2
𝑖0
)¿
√2 𝜋 𝜎 𝑖 𝑖 𝑖 𝑖

Logistic Regression
• Cost function
• Regularization
Training set with examples
How to choose parameters ?

Cost function for Linear Regression

Cost function for Logistic
Regression
Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=
{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0
if 𝑦=1 if 𝑦=0
0 h𝜃 ( 𝑥) 1 0 h𝜃 ( 𝑥) 1 Slide credit: Andrew Ng

Logistic regression cost function
•:
•:

Logistic regression
Learning: fit parameter Prediction: given new

Output

Where does the cost come from?
• Training set with examples
• Maximum likelihood estimate for parameter
• Maximum conditional likelihood estimate for parameter

• Goal: choose to maximize conditional likelihood of training data
• Training data
• Data likelihood
• Data conditional likelihood

Expressing conditional log-
likelihood
Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=
{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0
Logistic Regression
• Cost function
• Regularization
Gradient descent
Goal:
Good news: Convex function!
Repeat { Bad news: No analytical solution
(Simultaneously update all )

} 𝑚
𝜕 1
𝐽 ( 𝜃 ) = ∑ (h 𝜃 ( 𝑥 ) − 𝑦 )𝑥 𝑗
( 𝑖) (𝑖) (𝑖)
𝜕𝜃 𝑗 𝑚 𝑖=1
Gradient descent
Goal:
Repeat {
(Simultaneously update all )
}

Gradient descent for Linear Regression
Repeat {
⊤
}
h 𝜃 ( 𝑥 ) =𝜃 𝑥
Gradient descent for Logistic Regression

Repeat {
1
h𝜃 ( 𝑥) = −𝜃 𝑥
⊤
} 1+ 𝑒
Logistic Regression
• Cost function
• Regularization
How about MAP?
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)

Prior
• Common choice of :
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting

MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)

Logistic Regression
• Cost function
• Regularization
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby
• Medical diagrams: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow

Binary classification Multiclass classification
𝑥2 𝑥2
𝑥1 𝑥1
One-vs-all (one-vs-rest) 𝑥2
(1 )
h ( 𝑥)
𝜃
𝑥1
𝑥2
(2 ) 𝑥2
h (𝑥)𝜃
𝑥1 𝑥1
Class 1:
Class 2: (3 ) 𝑥2
Class 3: h (𝑥)
𝜃
h ( 𝑥 )=𝑃 ( 𝑦 =𝑖|𝑥 ; 𝜃 ) (𝑖=1 , 2 , 3)

(𝑖 )
𝜃 𝑥1
One-vs-all
• Train a logistic regression classifier for each class
to predict the probability that
• Given a new input , pick the class that maximizes

Generative Approach Discriminative Approach
Ex: Naïve Bayes Ex: Logistic regression
Estimate and Estimate directly

(Or a discriminant function: e.g., SVM)
Prediction Prediction
Further readings
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic
Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
• Andrew Ng, Michael Jordan

On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-cla
ssifiers-a-comparison-of-logistic-regression-and-naive-bayes.pd
f
Things to remember
1
h𝜃 ( 𝑥) =
⊤
−𝜃 𝑥
1+ 𝑒
• Cost function Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=

{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0

𝑚
1
𝜃 𝑗 ≔ 𝜃 𝑗 − 𝛼 ∑ ( h𝜃 ( 𝑥 )− 𝑦 ) 𝑥 𝑗
( 𝑖) (𝑖) (𝑖)
• Regularization 𝑚 𝑖=1 𝑚
1
𝜃 𝑗 ≔ 𝜃 𝑗 − 𝛼𝜆 𝜃 𝑗 − 𝛼 ∑ ( h 𝜃 𝑥 − 𝑦 ) 𝑥 𝑗
( )
( 𝑖) (𝑖) (𝑖)
𝑚 𝑖=1
Coming up…
• Regularization
• Support Vector Machine

Lec 05

Uploaded by

Copyright:

Available Formats

Lec 05

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 05

Uploaded by

Copyright:

Available Formats

Logistic Regression

• Questions are welcome!

• Maximum a posteriori estimation (MAP)

Slide credit: Tom Mitchell

• How about applying Bayes rule?

• Apply conditional independence assumption

• Assume conditional independence among ’s:

• Pick the most probable Y

Slide credit: Tom Mitchell

Bayes rule Conditional indep.

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S|F ) P ( D|F ) P(G∨F)

• Naïve Bayes often works pretty well anyway

• What is the effect on estimated ?

Slide credit: Tom Mitchell

• Why worry about just one parameter out of many?

• What can we do to address this?

• MAP estimates (Dirichlet priors):

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

• Estimating parameters from data

• Logistic regression with gradient descent

• Logistic regression with gradient descent

• Threshold classifier output at 0.5

Slide credit: Andrew Ng

(from linear regression)

Logistic regression is actually for classification

Slide credit: Andrew Ng

• Tell patient that 70% chance of tumor being malignant

Slide credit: Andrew Ng

Slide credit: Andrew Ng

Slide credit: Andrew Ng

Slide credit: Andrew Ng

• Consider learning f: , where

What is ? Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

• Logistic regression with gradient descent

How to choose parameters ?

Slide credit: Andrew Ng

Slide credit: Andrew Ng

0 h𝜃 ( 𝑥) 1 0 h𝜃 ( 𝑥) 1 Slide credit: Andrew Ng

Slide credit: Andrew Ng

Learning: fit parameter Prediction: given new

Slide credit: Andrew Ng

• Maximum likelihood estimate for parameter

• Maximum conditional likelihood estimate for parameter

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

• Logistic regression with gradient descent

(Simultaneously update all )

Slide credit: Andrew Ng

Gradient descent for Logistic Regression

• Logistic regression with gradient descent

• Maximum conditional a posterior estimate (MCAP)

Slide credit: Tom Mitchell

• Maximum conditional a posterior estimate (MCAP)