Lab 4 Training Neural Nets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

What next?

 Given an example (or group of examples), we know how to compute


the derivative for each weight.
 How exactly do we update the weights?
 How often? (after each training data point? after all the training
data points?)

2
What next?—Gradient Descent
 W_new = W_old - lr * derivative
 Classical approach—get derivative for entire data set, then take a step in
that direction
 Pros: Each step is informed by all the data
 Cons: Very slow, especially as data gets big

3
Another approach: Stochastic Gradient Descent
 Get derivative for just one point, and take a step in that direction
 Steps are “less informed” but you take more of them
 Should “balance out”
 Probably want a smaller step size
 Also helps “regularize”

4
Compromise approach: Mini-batch
 Get derivative for a ”small” set of points, then take a step in that direction
 Typical mini batch sizes are 16, 32
 Strikes a balance between two extremes

5
Comparison of Batching Approaches

6
Batching Terminology
Full-batch:
Use entire data set to compute gradient before updating

Mini-batch:
Use a smaller portion of data (but more than single example) to compute gradient before
updating

Stochastic Gradient Descent (SGD):


Use a single example to compute gradient before updating (though sometimes people
use SGD to refer to minibatch, also)

7
Batching Terminology
 An Epoch refers to a single pass through all of the training data.
 In full batch gradient descent, there would be one step taken per epoch.
 In SGD / Online learning, there would be n steps taken per epoch (n =
training set size)
 In Minibatch there would be (n/batch size) steps taken per epoch
 When training, it is common to refer to the number of epochs needed for
the model to be “trained”.

8
Note on Data Shuffling
 To avoid any cyclical movement and aid convergence, it is recommended
to shuffle the data after each epoch.
 This way, the data is not seen in the same order every time, and the
batches are not the exact same ones.

9
Feedforward Neural Network

Batch 1

Batch 2

Full Batch Batch 3

Batch 4

Batch 5

10
Training in Action

Step 1
Batch 1

Batch 2

Batch 3

Batch 4

Batch 5

11
Training in Action

Batch 1
Step 2
Batch 2

Batch 3

Batch 4

Batch 5

12
Training in Action

Batch 1

Batch 2
Step 3
Batch 3

Batch 4

Batch 5

13
Training in Action

Batch 1

Batch 2

Batch 3
Step 4
Batch 4

Batch 5

14
Training in Action

Batch 1

Batch 2

Batch 3

Batch 4
Step 5
Batch 5

15
Training in Action

Batch 1

Batch 2

Batch 3

Batch 4
First Epoch Complete!

Batch 5

16
Shuffle the Data!

Batch 1

Batch 2

Batch 3

Batch 4

Batch 5

17
Shuffle the Data!

Step 6

18
The Keras Package
 Keras allows easy construction, training, and execution of Deep Neural
Networks
 Written in Python, and allows users to configure complicated models
directly in Python
 Uses either Tensorflow or Theano “under the hood”
 Uses either CPU or GPU for computation
 Uses numpy data structures, and a similar command structure to scikit-
learn (model.fit , model.predict, etc.)

19
Typical Command Structure in Keras
 Build the structure of your network.
 Compile the model, specifying your loss function, metrics, and optimizer
(which includes the learning rate).
 Fit the model on your training data (specifying batch size, number of
epochs)
 Predict on new data
 Evaluate your results

20
Building the model
 Keras provides two approaches to building the structure of your model:
 Sequential Model: allows a linear stack of layers – simpler and more
convenient if model has this form
 Functional API: more detailed and complex, but allows more complicated
architectures
 We will focus on the Sequential Model.

21
Running Example, this time in Keras
Let’s build this Neural Network structure shown below in Keras:

𝜎 𝜎
𝑥1 𝑦1
𝜎 𝜎
𝑥2 𝑦2
𝜎 𝜎
𝑥3 𝑦3
𝜎 𝜎

22
Keras—Sequential Model
First, import the Sequential function and initialize your model object:

from keras.models import Sequential


model = Sequential()

23
Keras—Sequential Model
Then we add layers to the model one by one.
from keras.layers import Dense, Activation

# For the first layer, specify the input dimension


model.add(Dense(units=4, input_dim=3))

# Specify an activation function


model.add(Activation(sigmoid'))

# For subsequent layers, the input dimension is presumed from


# the previous layer
model.add(Dense(units=4))
model.add(Activation(sigmoid'))
model.add(Dense(units=3))
model.add(Activation('softmax'))

24
Multiclass Classification with Neural Networks
 For binary classification problems, we have a final layer with a single node
and a sigmoid activation.
 This has many desirable properties
 Gives an output strictly between 0 and 1
 Can be interpreted as a probability
 Derivative is “nice”
 Analogous to logistic regression

 Is there a natural extension of this to a multiclass setting?

25
Multiclass Classification
with Neural Networks
 Reminder: one hot encoding for
categories 1 0 0
 Take a vector with length equal to
the number of categories
0 1 0
 Represent each category with one
at a particular position (and zero
0 0 1
everywhere else) Cat Dog Toaster

26
Multiclass Classification with Neural Networks
 For multiclass classification problems, let the final layer be a vector with
length equal to the number of possible classes.
 Extension of sigmoid to multiclass is the softmax function.
𝑒 𝑧𝑖
 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) = 𝐾 𝑒 𝑧𝑘
𝑘=1

 Yields a vector with entries that are between 0 and 1, and sum to 1

27
Multiclass Classification with Neural Networks
 For loss function use “categorical cross entropy”
 This is just the log-loss function in disguise
𝑛

𝐶. 𝐸. = − 𝑦𝑖 log(𝑦𝑖 )
𝑖=1

 Derivative has a nice property when used with softmax


𝜕𝐶. 𝐸. 𝜕𝑠𝑜𝑓𝑡𝑚𝑎𝑥
⋅ = 𝑦𝑖 − 𝑦𝑖
𝜕𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝜕𝑧𝑖

28
Ways to scale inputs
 Linear scaling to the interval [0,1]

𝑥𝑖 − 𝑥𝑚𝑖𝑛
𝑥𝑖 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

 Linear scaling to the interval [-1,1]

𝑥𝑖 − 𝑥
𝑥𝑖 = 2 −1
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

29
Ways to scale inputs
 Standardization (making variable approx. std. normal)

𝑛
𝑥𝑖 − 𝑥 1 2
𝑥𝑖 = ; 𝜎= 𝑥𝑖 − 𝑥
𝜎 𝑛
𝑖=1

30

You might also like