L5 Neural Network
L5 Neural Network
L5 Neural Network
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2023
2
Contents
¡ Introduction to Machine Learning
¡ Supervised learning
¨ Artificial neural network
¡ Unsupervised learning
¡ Reinforcement learning
¡ Practical advice
3
Artificial neural network: introduction (1)
¡ Artificial neural network (ANN) (mạng nơron nhân tạo)
¡ Simulates the biological neural systems (human brain)
¡ ANN is a structure/network made of interconnection of artificial
neurons
¡ Neuron
¡ Has input/output
¡ Executes a local calculation (local function)
¡ Output of a neuron is charactorized by
¡ In/out characteristics
¡ Connections between it and other neurons
¡ (Possible) other inputs
4
Artificial neural network: introduction (2)
¡ ANN can be thought of as a highly decentralized and parallel
information processing structure
¡ ANN can learn, recall and generalize from the training data
¡ The ability of an ANN depends on
¡ Network architecture
¡ Input/output characteristics
¡ Learning algorithm
¡ Training data
5
Structure of a neuron
¡ Input signals of a neuron x0=1
{𝑥𝑖 , 𝑖 = 1 … 𝑚}
x1 w0
¡ Each input signal 𝑥𝑖 is w1
associated with x2 Output
w2
a weight 𝑤𝑖 … S (Out)
wm
¡ Bias 𝑤0 (with 𝑥0 = 1) xm
¡ Net input is a combination
of the input signals
𝑁𝑒𝑡(𝒘, 𝒙)
Input (x) Net Activation
¡ Activation/transfer function input function
𝑓 3 computes the output of
a neuron (Net) (f)
¡ Output
𝑂𝑢𝑡 = 𝑓 (𝑁𝑒𝑡 (𝒘, 𝒙))
6
Net Input
¡ Net input is usually calculated by a function of linear form
m m
Net = w0 + w1 x1 + w2 x2 + ... + wm xm = w0 .1 + å wi xi = å wi xi
i =1 i =0
¡ Role of bias:
¡ Net=w1x1 may not separate well the classes
¡ Net=w1x1+w0 is able to do better
Net Net
Net = w1x1
Net = w1x1 + w0
x1 x1
7
Activation function: hard-limited
"$ 1, if Net ≥ θ
¡ Also known as a threshold Out(Net) = HL(Net, θ ) = #
function $% 0, otherwise
Bipolar
Out
Binary
Out
hard-limiter
hard-limiter
1 1
q 0 Net
q 0 Net
-1
8
Activation function: threshold logic
ì
ï 0, if Net < -q
ïï 1
Out ( Net ) = tl ( Net, a , q ) = ía ( Net + q ), if - q £ Net £ - q
ï a
ï 1, if 1
Net > - q (α >0)
ïî a
= max(0, min(1, a ( Net + q )))
Out
¡ Also known as a saturating
linear function
¡ Combination of 2 activation 1
functions: linear and tight limits
¡ 𝛼 determines the slope of the linear -q 0 (1/α)-q Net
range
¡ Properties: continuous, 1/α
non-smoothed (liên tục, không trơn)
9
Activation function: Sigmoid
1
Out ( Net) = sf ( Net, a ,q ) =
1 + e -a ( Net +q )
Out
¡ Popular
¡ The parameter 𝛼 determines 1
the slope
0.5
¡ Output in the range of 0 and 1
¡ Advantages -q 0 Net
¡ Continuous, smoothed
¡ Gradient of a sigmoid function is represented by a function of itself
10
Activation function: Hyperbolic tangent
1 - e -a ( Net +q ) 2
Out ( Net) = tanh( Net, a ,q ) = -a ( Net +q )
= -a ( Net +q )
-1
1+ e 1+ e
Out
¡ Popular
1
¡ The parameter 𝛼 determines
the slope
-q 0 Net
¡ Output in the range of -1 and 1
-1
¡ Advantages
¡ Continuous, continuous derivative
¡ Gradient of a tanh function is represented by a function of itself
11
Act. function: Rectified linear unit (ReLU)
𝑂𝑢𝑡 𝑛𝑒𝑡 = max(0, 𝑛𝑒𝑡)
¡ Most popular
¡ Output is non-negative
¡ Advantages
¡ Continuous
¡ No derivative at point 0
¡ Easy to calculate
12
ANN: Architecture (1)
¡ ANN’s architecture is determined by bias
Recurrent network
with single layer
Feed-forward
network with
multiple layers
Recurrent network
with multiple layers
16
ANN: Training
¡ 2 types of learning in ANNs
¡ Parameter learning: The goal is to adapt the weights of the
connections in the ANN, given a fixed network structure
¡ Structure learning: The goal is to learn the network structure,
including the number of neurons and the types of connections
between them, and the weights
Or
§ Where out(x) is the output of the network, with the input x labeled
accordingly as 𝑑# ; loss is a function for measuring prediction error
æ m ö
Out = sign(Net( w, x) ) = signç å w j x j ÷÷
ç separation plane
è j =0 ø x1
w0+w1x1+w2x2=0
¡ -1 otherwise
Output = -1
x2
19
Perceptron: Algorithm
¡ Training data D = {(x, d)}
¡ x is input vector
¡ d is output (1 or -1)
¡ The goal of perceptron learning (training) process determines a weight
vector that allows the perceptron to produce the correct output value (-1
or 1) for each data point
¡ For data point x correctly classified by perceptron, the weight vector w
unchanged
¡ If d = 1 but the perceptron produces -1 (Out = -1), then w needs to be
changed so that the value of Net (w, x) increases
¡ If d = -1 but the perceptron produces 1 (Out = 1), then w needs to be
changed so that the value of Net (w, x) decreases
20
Perceptron: Batch training
Perceptron_batch(D, η)
Initialize w (wi ← an initial (small) random value)
do
∆w ← 0
for each instance (x,d) Î D
Compute the real output value Out
if (Out ¹ d)
∆w ← ∆w + η(d-Out)x
end for
w ← w + ∆w
until all the training instances in D are correctly classified
return w
21
Perceptron: Limitation
¡ The training algorithm for perceptron
A perceptron cannot
is proved to converge if: classify correctly for
¡ Data points are linearly separable this case!
1
ED ( w ) =
D
å E (w)
xÎD
x
23
Minimize errors with gradients
¡ Gradient of E (denoted by ∇E) is a vector
æ ¶E ¶E ¶E ö
ç
ÑE (w) = ç , ,..., ÷÷
è ¶w1 ¶w2 ¶wN ø
¡ where N is the total number of weights (connections) in the ANN
¡ The gradient ∇E determines the direction that causes the steepest
increase for the error value E
¡ Therefore, the direction that causes the steepest decrease is opposite
to the gradient of E
%&
D𝒘 = −h. Ñ𝐸 𝒘 ; D𝑤$ = −𝜂 %' for 𝑖 = 1 … 𝑁
!
2-dimensional space
One-dimensional space
E(w1,w2)
E(w)
25
Incremental training
Gradient_descent_incremental (D, η)
Initialize w (wi ← an initial (small) random value)
do
for each training instance (x,d)ÎD
Compute the network output
If we take a small subset
for each weight component wi (mini-batch) randomly
from D to update the
wi ← wi – η(∂Ex/∂wi) weights, we will have
end for mini-batch training.
end for
until (stopping criterion satisfied)
return w
Stopping criterion: epochs, threshold error, ...
26
Backpropagation algorithm
¡ A perceptron can only represent a linear function
¡ A multi-layer NN learned by the Backpropagation (BP) algorithm can
represent a highly non-linear function
¡ The BP algorithm is used to learn the weights of an ANN
¡ Fixed network structure (một cấu trúc mạng đã chọn trước)
¡ For each neuron, the activation function must be differentiable
¡ Outi is the output value of the network corresponding to the output neuron yi
30
BP algorithm: Forward (1)
¡ For each data point x
¡ Input vector x is forwarded from the input layer to the output layer
¡ The network will generate an actual output value Out (a vector with
value Outi, i = 1..n)
¡ For an input vector x, a neuron zq at the hidden layer receives the value
of net input: m
Netq = å wqj x j
j =1
then produces a (local) output value
æ m ö
Out q = f ( Netq ) = f ç å wqj x j ÷÷
ç
è j =1 ø
where f(.) is a activation function of neuron zq
31
BP algorithm: Forward (2)
¡ Net input value of the neuron yi at the output layer
l l æ m ö
Neti = å wiqOut q = å wiq f çç å wqj x j ÷÷
q =1 q =1 è j =1 ø
¡ Neuron yi produces output value (is an output value of network)
æ l ö æ l æ m öö
Outi = f ( Neti ) = f ç å wiqOutq ÷÷ =
ç f ç å wiq f çç å wqj x j ÷÷ ÷
ç q =1 ÷
è q =1 ø è è j =1 øø
¡ Vector of the output values Outi (i=1..n) is the actual output value of the
network, for the input vector x
32
BP algorithm: Backward (1)
¡ For each data point x
¡ Error signals due to the difference between the desired output
value d and the actual output value Out are calculated
¡ These error signals are back-propagated from the output layer to
the front layers, to update weights
¡ To consider the error signals and their back-propagated ones, an error
function needs to be defined
1 n 1 n
E (w) = å (d i - Out i ) = å [d i - f ( Net i )]
2 2
2 i =1 2 i =1
2
1 é n æl öù
= å êd i - f çç å wiqOut q ÷÷ú
2 i =1 êë è q =1 øúû
33
BP algorithm: Backward (2)
¡ According to the gradient descent method, the weights of the
connections from the hidden layer to the output layer are updated by
¶E
Dwiq = -h
¶wiq
¡ Using the derivative chain rule for ¶E/¶wiq, we have
é ¶E ù é ¶Outi ù é ¶Neti ù
Dwiq = -h ê úê úê [ ]
ú = h [di - Outi ][ f ' (Neti )] Outq = hd i Outq
ë ¶Outi û ë ¶Neti û êë ¶wiq úû
¡ di is error signals of neuron yi at output layer
¶E é ¶E ù é ¶Outi ù
di = - = -ê úê ú = [di - Outi ][ f ' (Neti )]
¶Neti ë ¶Outi û ë ¶Neti û
where Neti is the net input of the neuron yi at the output layer,
and f'(Neti)=¶f(Neti)/¶Neti
34
BP algorithm: Backward (3)
¡ To update the weights of the connections from the input layer to the
hidden layer, we also apply the gradient-descent method and the
derivative chain rule
¶E é ¶E ù é ¶Outq ù é ¶Netq ù
Dwqj = -h = -h ê úê úê ú
¶wqj êë ¶Outqú
û êë ¶Net q ú
û êë ¶wqj ú
û
¡ From the formula for calculating the error function E(w), we see that
each error component (di-yi) (i=1..n) is a function of Outq
2
1 é n æl öù
E (w) = å êdi - f çç å wiqOut q ÷÷ú
2 i =1 êë è q =1 øúû
35
BP algorithm: Backward (4)
¡ Apply the derivation chain rule, we have
[ ]
Dwqj = h å (d i - Out i ) f ' ( Net i ) wiq f ' (Net q ) x j
n
i =1
[ ]
= h å d i wiq f ' (Net q ) x j = hd q x j
n
i =1
¶E é ¶E ù é ¶Outq ù
ú = f ' (Netq )å d i wiq
n
dq = - = -ê úê
¶Netq êë ¶Outq úû êë ¶Netq úû i =1
where Netg is the net input of the neuron zq at the hidden layer,
and f'(Netq)=¶f(Netq)/¶Netq
36
BP algorithm: Backward (5)
¡ According to the formulas for calculating the error signals di and dq, the
error signal of a neuron in the hidden layer is different from the error
signal of a neuron in the output layer
¡ Because of this difference, the weight update procedure in BP
algorithm is also known as general delta learning rule
¡ Error signals dq of neuron zq at hidden layer determined by:
¡ Error signals di of neuron yi at output layer (to which neuron zq are
connected)
¡ The weights wiq
37
BP algorithm: Backward (6)
¡ The process of calculating the error signals as above can be extended
(generalized) easily for neural networks with more than 1 hidden layer
¡ The general form of the weighting update rule in BP algorithm
Dwab = hdaxb
¡ b and a are 2 indices corresponding to the two ends of the
connection (b → a) (from a neuron (or input signal) b to neuron a)
¡ xb is the output value of the neuron at the hidden layer (or input
signal) b
¡ da is error signal of neuron a
38
BP algorithm
Back_propagation_incremental(D, η)
Neural network consists of Q layer, q = 1,2,...,Q
qNet and qOuti are net input and output value of neuron i at the layer q
i
Step 0 (Initialization)
Select the error threshold Ethreshold (the error value is acceptable)
Initialize the initial value of the weights with random small values
Assign E=0
Step 1 (Start a training cycle)
Apply the input vector of the data point k to the input layer (q=1)
qOut
i = 1Outi = xi(k), "i
Step 2 (Forward)
Forward the input signals over the network, until the network output values (at the output
æ q q -1 ö
( Net ) = f çç å
layer) are received QOuti q
Out i = f
q
i wij Out j ÷÷
è j ø
39
BP algorithm
Step 3 (Calculate the output error)
Calculate network output error and error signal Qd of each neuron at output layer
i
n
1
E=E+ å i
2 i =1
( d -
(k ) Q
Out i ) 2
Q
δi = (d i(k) - QOut i )f '( QNet i )
Step 4 (Error backward)
Backpropagation the error to update the weights and calculate the error signals q-1d for the
i
front layers
Dqwij = h.(qdi).(q-1Outj); qw
ij = qwij + Dqwij
j
Step 5 (Check stopping criterion satisfied)
Check if the entire training data has been used yet
If the entire training data has used, go to Step 6, otherwise go to Step 1
Step 6 (Check net error)
If net error E is less than the acceptable threshold (<Ethreshold), then training is completed
and returns the learned weights;
otherwise, assign E=0, and start new training cycle (go back to Step 1)
40
BP algorithm: Forward (1)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
f(Net3)
41
BP algorithm: Forward (2)
f(Net1)
w1x1 x1
x1 w1x2 x2 f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
f(Net3)
Out1 = f ( w1x1 x1 + w1x2 x2 )
42
BP algorithm: Forward (3)
f(Net1)
x1 f(Net4)
w2 x1 x1
Out6
f(Net6)
f(Net2)
w2 x2 x2
x2 f(Net5)
f(Net3)
Out 2 = f ( w2 x1 x1 + w2 x2 x2 )
43
BP algorithm: Forward (4)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
x2 w3 x1 x1 f(Net5)
w3 x2 x2 f(Net3)
Out 3 = f ( w3 x1 x1 + w3 x2 x2 )
44
BP algorithm: Forward (5)
f(Net1)
w41Out1
x1 f(Net4)
w42Out2
Out6
f(Net2)
w43Out 3 f(Net6)
x2 f(Net5)
f(Net3)
Out 4 = f ( w41Out1 + w42Out 2 + w43Out3 )
45
BP algorithm: Forward (6)
f(Net1)
x1 w51Out1 f(Net4)
Out6
f(Net6)
f(Net2)
w52Out 2
x2 f(Net5)
w53Out 3
f(Net3)
Out5 = f (w51Out1 + w52Out 2 + w53Out3 )
46
BP algorithm: Forward (7)
f(Net1)
x1 f(Net4)
w 64Out 4
f(Net6)
f(Net2)
w65Out 5
x2 f(Net5)
f(Net3)
Out 6 = f (w64Out 4 + w65Out5 )
47
BP algorithm: Calculate error
f(Net1)
f(Net4)
d6
Out6
f(Net6)
f(Net2)
f(Net5)
d is the desired
output value
f(Net3) é ¶E ù é ¶Out 6 ù
¶E
d6 = - = -ê úê ú = [d - Out 6 ] [ f ' (Net6 )]
¶Net6 ë ¶Out 6 û ë ¶Net6 û
48
BP algorithm: Backward(1)
f(Net1)
d4
x1 f(Net4)
w64 d6
Out6
f(Net6)
f(Net2)
x2 f(Net5)
f(Net3)
δ4 = f ' (Net4 )(w64δ6 )
49
BP algorithm: Backward(2)
f(Net1)
x1 f(Net4)
d6
Out6
f(Net6)
f(Net2)
d5
w65
x2 f(Net5)
f(Net3)
δ5 = f '(Net5 )(w65δ6 )
50
BP algorithm: Backward(3)
d1
f(Net1)
w41 d4
x1 w51 f(Net4)
Out6
f(Net6)
f(Net2)
d5
x2 f(Net5)
f(Net1)
d4
x1 w42 f(Net4)
d2
Out6
f(Net6)
f(Net2) w52 d5
x2 f(Net5)
f(Net3)
δ2 = f '(Net2 )(w42δ4 + w52δ5 )
52
BP algorithm: Backward(5)
f(Net1)
d4
x1 f(Net4)
Out6
f(Net6)
f(Net2) w43
d5
x2 w53 f(Net5)
d3
f(Net3)
δ3 = f '(Net3 )(w43δ4 + w53δ5 )
53
BP algorithm: Update weight(1)
d1
w1x1 f(Net1)
x1 w1x2 f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
f(Net1)
x1 f(Net4)
w2 x1 d2 Out6
f(Net6)
f(Net2)
w2 x2
x2 f(Net5)
w2 x1 = w2 x1 + hd 2 x1
f(Net3)
w2 x2 = w2 x2 + hd 2 x2
55
BP algorithm: Update weight(3)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
x2 w3x1
d3
f(Net5)
w3 x2
f(Net3)
w3 x1 = w3 x1 + hd 3 x1
w3 x2 = w3 x2 + hd 3 x2
56
BP algorithm: Update weight(4)
f(Net1)
w41 d4
x1 w42 f(Net4)
Out6
f(Net2)
w43 f(Net6)
x2 f(Net5)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
w 51 d5
w52
x2 f(Net5)
f(Net1)
x1 f(Net4)
w64 d6
Out6
f(Net6)
f(Net2)
w65
x2 f(Net5)
f(Net3)
w64 = w64 + ηδ6Out 4
w65 = w65 + ηδ6Out5
59
BP algorithm: Initialize weights
¡ Normally, weights are initialized with random small values
¡ If the weights have large initial values
¡ Sigmoid functions will reach saturation soon
¡ The system will deadlock at a saddle / stationary points
60
BP algorithm: Learning rate
¡ Important effect on the efficiency and convergence of BP algorithm
¡ A large value of h can accelerate the convergence of the learning
process, but can cause the system to ignore the global optimal
point or focus on bad points (saddle points).
¡ A small h value can make the learning process take a long time
¡ Often select it empirically
¡ Good values of learning rate at the beginning (learning process) may
not be good at a later time
¡ Using an adaptive (dynamic) learning rate?
61
BP algorithm: Momentum
¡ The gradient descent method can aDw(t’)
-hÑE(t’+1)