Lec 05

Logistic Regression

Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
• Please start HW 1 early!

• Questions are welcome!

Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes probability of observed data

• Maximum a posteriori estimation (MAP)

Choose that is most probable given prior probability and

Slide credit: Tom Mitchell

Naïve Bayes classifier
• Want to learn
• But require parameters...

• How about applying Bayes rule?

• : Need parameters
•: Need 1 parameter

• Apply conditional independence assumption

• : Need parameters
Naïve Bayes classifier
• Bayes rule:

• Assume conditional independence among ’s:

• Pick the most probable Y

Slide credit: Tom Mitchell


Bayes rule Conditional indep.

• Estimating parameters

• Test example:
Naïve Bayes algorithm – discrete
• For each value
For each value of each attribute

• Classify

Slide credit: Tom Mitchell

Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

Slide credit: Tom Mitchell

• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month

𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S|F ) P ( D|F ) P(G∨F)

Naïve Bayes: Subtlety #1
• Often the are not really conditionally independent

• Naïve Bayes often works pretty well anyway

• Often the right classification, even when not the right probability
[Domingos & Pazzani, 1996])

• What is the effect on estimated ?

• What if we have two copies:

Slide credit: Tom Mitchell

Naïve Bayes: Subtlety #2
MLE estimate for might be zero.
(for example, = birthdate. = Feb_4_1995)

• Why worry about just one parameter out of many?

• What can we do to address this?

• MAP estimates (adding “imaginary” examples)
Slide credit: Tom Mitchell
Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

• MAP estimates (Dirichlet priors):

Slide credit: Tom Mitchell

What if we have continuous
• Gaussian Naïve Bayes (GNB): assume

• Additional assumption on :
• Is independent of ()
• Is independent of ()
• Is independent of and ()

Slide credit: Tom Mitchell

Naïve Bayes algorithm – continuous
• For each value
For each attribute estimate
Class conditional mean , variance

• Classify

Slide credit: Tom Mitchell

Things to remember
• Probability basics
• Conditional probability, joint probability, Bayes rule

• Estimating parameters from data

• Maximum likelihood (ML) maximize
• Maximum a posteriori estimation (MAP) maximize

• Naive Bayes
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
1 (Yes)

0 (No)
Tumor Size
h 𝜃 ( 𝑥 ) =𝜃 ⊤ 𝑥

• Threshold classifier output at 0.5

• If predict “”
• If , predict “”

Slide credit: Andrew Ng

Classification: or

(from linear regression)

can be or

Logistic regression:

Logistic regression is actually for classification

Slide credit: Andrew Ng

Hypothesis representation
• Want 1
h𝜃 ( 𝑥) = −𝜃 𝑥

1+ 𝑒

• Sigmoid function
• Logistic function
𝑧 Slide credit: Andrew Ng
Interpretation of hypothesis output
• estimated probability that on input

• Example: If
• 0.7

• Tell patient that 70% chance of tumor being malignant

Slide credit: Andrew Ng

Logistic regression

𝑧 =𝜃 𝑥
Suppose predict “y = 1” if

predict “y = 0” if

Slide credit: Andrew Ng

Decision boundary


Tumor Size

• Predict “” if

Slide credit: Andrew Ng


• Predict “” if

Slide credit: Andrew Ng

Where does the form come from?
• Logistic regression hypothesis representation

• Consider learning f: , where

• is a vector of real-valued features
• is Boolean
• Assume all are conditionally independent given
• Model as Gaussian
• Model as Bernoulli

What is ? Slide credit: Tom Mitchell

Applying Bayes rule
Divide by


Plug in
2 2
𝜇𝑖 0 − 𝜇𝑖1 𝜇 −𝜇
( 𝑥 −𝜇 𝑖𝑘 )

𝑃 ( 𝑥∨𝑦 𝑘) =

∑( 𝜎
2 ¿ 𝑋 𝑖+

√2 𝜋 𝜎 𝑖 𝑖 𝑖 𝑖

Slide credit: Tom Mitchell

Training set with examples

How to choose parameters ?

Slide credit: Andrew Ng

Cost function for Linear Regression

Slide credit: Andrew Ng

Cost function for Logistic
Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=
{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0

if 𝑦=1 if 𝑦=0

0 h𝜃 ( 𝑥) 1 0 h𝜃 ( 𝑥) 1 Slide credit: Andrew Ng

Logistic regression cost function


Slide credit: Andrew Ng

Logistic regression

Learning: fit parameter Prediction: given new


Slide credit: Andrew Ng

Where does the cost come from?
• Training set with examples

• Maximum likelihood estimate for parameter

• Maximum conditional likelihood estimate for parameter

Slide credit: Tom Mitchell

• Goal: choose to maximize conditional likelihood of training data

• Training data
• Data likelihood
• Data conditional likelihood

Slide credit: Tom Mitchell

Expressing conditional log-

Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=
{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0
Gradient descent

Good news: Convex function!
Repeat { Bad news: No analytical solution

(Simultaneously update all )

} 𝑚
𝜕 1
𝐽 ( 𝜃 ) = ∑ (h 𝜃 ( 𝑥 ) − 𝑦 )𝑥 𝑗
( 𝑖) (𝑖) (𝑖)
𝜕𝜃 𝑗 𝑚 𝑖=1
Slide credit: Andrew Ng
Gradient descent


Repeat {
(Simultaneously update all )

Slide credit: Andrew Ng

Gradient descent for Linear Regression
Repeat {

h 𝜃 ( 𝑥 ) =𝜃 𝑥

Gradient descent for Logistic Regression

Repeat {
h𝜃 ( 𝑥) = −𝜃 𝑥

} 1+ 𝑒
Slide credit: Andrew Ng
How about MAP?
• Maximum conditional likelihood estimate (MCLE)

• Maximum conditional a posterior estimate (MCAP)

• Common choice of :
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

• Maximum conditional likelihood estimate (MCLE)

• Maximum conditional a posterior estimate (MCAP)

Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
One-vs-all (one-vs-rest) 𝑥2
(1 )
h ( 𝑥)
(2 ) 𝑥2
h (𝑥)𝜃

𝑥1 𝑥1
Class 1:
Class 2: (3 ) 𝑥2
Class 3: h (𝑥)

h ( 𝑥 )=𝑃 ( 𝑦 =𝑖|𝑥 ; 𝜃 ) (𝑖=1 , 2 , 3)

(𝑖 )
𝜃 𝑥1
Slide credit: Andrew Ng
• Train a logistic regression classifier for each class
to predict the probability that

• Given a new input , pick the class that maximizes

Slide credit: Andrew Ng

Generative Approach Discriminative Approach
Ex: Naïve Bayes Ex: Logistic regression

Estimate and Estimate directly

(Or a discriminant function: e.g., SVM)

Prediction Prediction
Further readings
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic

• Andrew Ng, Michael Jordan

On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes
Things to remember
h𝜃 ( 𝑥) =
• Hypothesis representation

−𝜃 𝑥
1+ 𝑒

• Cost function Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=

{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0

• Logistic regression with gradient descent

𝜃 𝑗 ≔ 𝜃 𝑗 − 𝛼 ∑ ( h𝜃 ( 𝑥 )− 𝑦 ) 𝑥 𝑗
( 𝑖) (𝑖) (𝑖)

• Regularization 𝑚 𝑖=1 𝑚
𝜃 𝑗 ≔ 𝜃 𝑗 − 𝛼𝜆 𝜃 𝑗 − 𝛼 ∑ ( h 𝜃 𝑥 − 𝑦 ) 𝑥 𝑗
( )
( 𝑖) (𝑖) (𝑖)
𝑚 𝑖=1
• Multi-class classification
Coming up…
• Regularization

• Support Vector Machine

