Lab 4 Training Neural Nets
Lab 4 Training Neural Nets
Lab 4 Training Neural Nets
2
What next?—Gradient Descent
W_new = W_old - lr * derivative
Classical approach—get derivative for entire data set, then take a step in
that direction
Pros: Each step is informed by all the data
Cons: Very slow, especially as data gets big
3
Another approach: Stochastic Gradient Descent
Get derivative for just one point, and take a step in that direction
Steps are “less informed” but you take more of them
Should “balance out”
Probably want a smaller step size
Also helps “regularize”
4
Compromise approach: Mini-batch
Get derivative for a ”small” set of points, then take a step in that direction
Typical mini batch sizes are 16, 32
Strikes a balance between two extremes
5
Comparison of Batching Approaches
6
Batching Terminology
Full-batch:
Use entire data set to compute gradient before updating
Mini-batch:
Use a smaller portion of data (but more than single example) to compute gradient before
updating
7
Batching Terminology
An Epoch refers to a single pass through all of the training data.
In full batch gradient descent, there would be one step taken per epoch.
In SGD / Online learning, there would be n steps taken per epoch (n =
training set size)
In Minibatch there would be (n/batch size) steps taken per epoch
When training, it is common to refer to the number of epochs needed for
the model to be “trained”.
8
Note on Data Shuffling
To avoid any cyclical movement and aid convergence, it is recommended
to shuffle the data after each epoch.
This way, the data is not seen in the same order every time, and the
batches are not the exact same ones.
9
Feedforward Neural Network
Batch 1
Batch 2
Batch 4
Batch 5
10
Training in Action
Step 1
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5
11
Training in Action
Batch 1
Step 2
Batch 2
Batch 3
Batch 4
Batch 5
12
Training in Action
Batch 1
Batch 2
Step 3
Batch 3
Batch 4
Batch 5
13
Training in Action
Batch 1
Batch 2
Batch 3
Step 4
Batch 4
Batch 5
14
Training in Action
Batch 1
Batch 2
Batch 3
Batch 4
Step 5
Batch 5
15
Training in Action
Batch 1
Batch 2
Batch 3
Batch 4
First Epoch Complete!
Batch 5
16
Shuffle the Data!
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5
17
Shuffle the Data!
Step 6
18
The Keras Package
Keras allows easy construction, training, and execution of Deep Neural
Networks
Written in Python, and allows users to configure complicated models
directly in Python
Uses either Tensorflow or Theano “under the hood”
Uses either CPU or GPU for computation
Uses numpy data structures, and a similar command structure to scikit-
learn (model.fit , model.predict, etc.)
19
Typical Command Structure in Keras
Build the structure of your network.
Compile the model, specifying your loss function, metrics, and optimizer
(which includes the learning rate).
Fit the model on your training data (specifying batch size, number of
epochs)
Predict on new data
Evaluate your results
20
Building the model
Keras provides two approaches to building the structure of your model:
Sequential Model: allows a linear stack of layers – simpler and more
convenient if model has this form
Functional API: more detailed and complex, but allows more complicated
architectures
We will focus on the Sequential Model.
21
Running Example, this time in Keras
Let’s build this Neural Network structure shown below in Keras:
𝜎 𝜎
𝑥1 𝑦1
𝜎 𝜎
𝑥2 𝑦2
𝜎 𝜎
𝑥3 𝑦3
𝜎 𝜎
22
Keras—Sequential Model
First, import the Sequential function and initialize your model object:
23
Keras—Sequential Model
Then we add layers to the model one by one.
from keras.layers import Dense, Activation
24
Multiclass Classification with Neural Networks
For binary classification problems, we have a final layer with a single node
and a sigmoid activation.
This has many desirable properties
Gives an output strictly between 0 and 1
Can be interpreted as a probability
Derivative is “nice”
Analogous to logistic regression
25
Multiclass Classification
with Neural Networks
Reminder: one hot encoding for
categories 1 0 0
Take a vector with length equal to
the number of categories
0 1 0
Represent each category with one
at a particular position (and zero
0 0 1
everywhere else) Cat Dog Toaster
26
Multiclass Classification with Neural Networks
For multiclass classification problems, let the final layer be a vector with
length equal to the number of possible classes.
Extension of sigmoid to multiclass is the softmax function.
𝑒 𝑧𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) = 𝐾 𝑒 𝑧𝑘
𝑘=1
Yields a vector with entries that are between 0 and 1, and sum to 1
27
Multiclass Classification with Neural Networks
For loss function use “categorical cross entropy”
This is just the log-loss function in disguise
𝑛
𝐶. 𝐸. = − 𝑦𝑖 log(𝑦𝑖 )
𝑖=1
28
Ways to scale inputs
Linear scaling to the interval [0,1]
𝑥𝑖 − 𝑥𝑚𝑖𝑛
𝑥𝑖 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
𝑥𝑖 − 𝑥
𝑥𝑖 = 2 −1
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
29
Ways to scale inputs
Standardization (making variable approx. std. normal)
𝑛
𝑥𝑖 − 𝑥 1 2
𝑥𝑖 = ; 𝜎= 𝑥𝑖 − 𝑥
𝜎 𝑛
𝑖=1
30