Lab 4 Training Neural Nets

What next?
 Given an example (or group of examples), we know how to compute

the derivative for each weight.
 How exactly do we update the weights?
 How often? (after each training data point? after all the training
data points?)
2
What next?—Gradient Descent
 W_new = W_old - lr * derivative
 Classical approach—get derivative for entire data set, then take a step in
that direction
 Pros: Each step is informed by all the data
 Cons: Very slow, especially as data gets big
3
Another approach: Stochastic Gradient Descent
 Get derivative for just one point, and take a step in that direction
 Steps are “less informed” but you take more of them
 Should “balance out”
 Probably want a smaller step size
 Also helps “regularize”
4
Compromise approach: Mini-batch
 Get derivative for a ”small” set of points, then take a step in that direction
 Typical mini batch sizes are 16, 32
 Strikes a balance between two extremes
5
Comparison of Batching Approaches
6
Batching Terminology
Full-batch:
Use entire data set to compute gradient before updating
Mini-batch:
Use a smaller portion of data (but more than single example) to compute gradient before
updating
Stochastic Gradient Descent (SGD):

Use a single example to compute gradient before updating (though sometimes people
use SGD to refer to minibatch, also)
7
Batching Terminology
 An Epoch refers to a single pass through all of the training data.
 In full batch gradient descent, there would be one step taken per epoch.
 In SGD / Online learning, there would be n steps taken per epoch (n =
training set size)
 In Minibatch there would be (n/batch size) steps taken per epoch
 When training, it is common to refer to the number of epochs needed for
the model to be “trained”.
8
Note on Data Shuffling
 To avoid any cyclical movement and aid convergence, it is recommended
to shuffle the data after each epoch.
 This way, the data is not seen in the same order every time, and the
batches are not the exact same ones.
9
Feedforward Neural Network
Batch 1
Batch 2
Full Batch Batch 3
Batch 4
Batch 5
10
Training in Action
Step 1
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5
11
Training in Action
Batch 1
Step 2
Batch 2
Batch 3
Batch 4
Batch 5
12
Training in Action
Batch 1
Batch 2
Step 3
Batch 3
Batch 4
Batch 5
13
Training in Action
Batch 1
Batch 2
Batch 3
Step 4
Batch 4
Batch 5
14
Training in Action
Batch 1
Batch 2
Batch 3
Batch 4
Step 5
Batch 5
15
Training in Action
Batch 1
Batch 2
Batch 3
Batch 4
First Epoch Complete!
Batch 5
16
Shuffle the Data!
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5
17
Shuffle the Data!
Step 6
18
The Keras Package
 Keras allows easy construction, training, and execution of Deep Neural
Networks
 Written in Python, and allows users to configure complicated models
directly in Python
 Uses either Tensorflow or Theano “under the hood”
 Uses either CPU or GPU for computation
 Uses numpy data structures, and a similar command structure to scikit-
learn (model.fit , model.predict, etc.)
19
Typical Command Structure in Keras
 Build the structure of your network.
 Compile the model, specifying your loss function, metrics, and optimizer
(which includes the learning rate).
 Fit the model on your training data (specifying batch size, number of
epochs)
 Predict on new data
 Evaluate your results
20
Building the model
 Keras provides two approaches to building the structure of your model:
 Sequential Model: allows a linear stack of layers – simpler and more
convenient if model has this form
 Functional API: more detailed and complex, but allows more complicated
architectures
 We will focus on the Sequential Model.
21
Running Example, this time in Keras
Let’s build this Neural Network structure shown below in Keras:
𝜎 𝜎
𝑥1 𝑦1
𝜎 𝜎
𝑥2 𝑦2
𝜎 𝜎
𝑥3 𝑦3
𝜎 𝜎
22
Keras—Sequential Model
First, import the Sequential function and initialize your model object:
from keras.models import Sequential

model = Sequential()
23
Keras—Sequential Model
Then we add layers to the model one by one.
from keras.layers import Dense, Activation
# For the first layer, specify the input dimension

model.add(Dense(units=4, input_dim=3))
# Specify an activation function

model.add(Activation(sigmoid'))
# For subsequent layers, the input dimension is presumed from

# the previous layer
model.add(Dense(units=4))
model.add(Activation(sigmoid'))
model.add(Dense(units=3))
model.add(Activation('softmax'))
24
Multiclass Classification with Neural Networks
 For binary classification problems, we have a final layer with a single node
and a sigmoid activation.
 This has many desirable properties
 Gives an output strictly between 0 and 1
 Can be interpreted as a probability
 Derivative is “nice”
 Analogous to logistic regression
 Is there a natural extension of this to a multiclass setting?
25
Multiclass Classification
with Neural Networks
 Reminder: one hot encoding for
categories 1 0 0
 Take a vector with length equal to
the number of categories
0 1 0
 Represent each category with one
at a particular position (and zero
0 0 1
everywhere else) Cat Dog Toaster
26
 For multiclass classification problems, let the final layer be a vector with
length equal to the number of possible classes.
 Extension of sigmoid to multiclass is the softmax function.
𝑒 𝑧𝑖
 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) = 𝐾 𝑒 𝑧𝑘
𝑘=1
 Yields a vector with entries that are between 0 and 1, and sum to 1
27
 For loss function use “categorical cross entropy”
 This is just the log-loss function in disguise
𝑛
𝐶. 𝐸. = − 𝑦𝑖 log(𝑦𝑖 )
𝑖=1
 Derivative has a nice property when used with softmax

𝜕𝐶. 𝐸. 𝜕𝑠𝑜𝑓𝑡𝑚𝑎𝑥
⋅ = 𝑦𝑖 − 𝑦𝑖
𝜕𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝜕𝑧𝑖
28
Ways to scale inputs
 Linear scaling to the interval [0,1]
𝑥𝑖 − 𝑥𝑚𝑖𝑛
𝑥𝑖 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
 Linear scaling to the interval [-1,1]
𝑥𝑖 − 𝑥
𝑥𝑖 = 2 −1
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
29
Ways to scale inputs
 Standardization (making variable approx. std. normal)
𝑛
𝑥𝑖 − 𝑥 1 2
𝑥𝑖 = ; 𝜎= 𝑥𝑖 − 𝑥
𝜎 𝑛
𝑖=1
30

Lab 4 Training Neural Nets

Uploaded by

Copyright:

Available Formats

Lab 4 Training Neural Nets

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 4 Training Neural Nets

Uploaded by

Copyright:

Available Formats

What next?

 Given an example (or group of examples), we know how to compute

Stochastic Gradient Descent (SGD):

Full Batch Batch 3

from keras.models import Sequential

# For the first layer, specify the input dimension

# Specify an activation function

# For subsequent layers, the input dimension is presumed from

 Is there a natural extension of this to a multiclass setting?

 Derivative has a nice property when used with softmax

 Linear scaling to the interval [-1,1]

You might also like