Deep Learning Fundamentals Materials
Deep Learning Fundamentals Materials
Deep Learning Fundamentals Materials
1
1. What is Deep Learning?
2
What is Deep Learning?
Deep
● Deep Learning (DL): A type of ML. Learning
Data amount
4
Challenge in Handling High Dimensional Data
• 20-29
• Admin
• Male • 30-39 7: Excellent • Manager
• IT
• Female • 40-49 ‥‥ • Supervisor
• Operation
• Others • 50-59 1: Poor • Rank-and-filer
• Others
• 60-
3 types 12 types 60 types 420 types 1,260 types
3×4 12×5 60×7 420×3
1,260 types in 3,000 employees… Each type has only 2.5 employees.
5
Curse of Dimensionality
6
Advantage of Deep Learning
7
2. Artificial Neural
Network
8
Neurons
Dendrite:
Receive signals from other neurons
9
Neural Network
10
Modeling Neural Network
Input A
Input A
Input B
+ Input B
> Threshold
11
Example
𝑥𝑥1 𝑤𝑤1
12
3. Perceptron
13
What is Perceptron?
Bias
Input Weight
𝑥𝑥1 𝑤𝑤1 𝑏𝑏
Output
𝑦𝑦
0 or 1
𝑥𝑥2 𝑤𝑤2
Node
Edge
14
Output of a Perceptron
Bias
Input Weight
𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
𝑥𝑥1 𝑤𝑤1 𝑏𝑏
Output
𝑦𝑦
0 or 1
𝑥𝑥2 𝑤𝑤2
Bias
18
Logic Gate
1 or 0
・ AND
・ OR
・ NAND
19
AND Gate
0 0 0
(𝑤𝑤1 , 𝑤𝑤2 , 𝜃𝜃)
(0.6, 0.4, 0.8) 1 0 0
0 0 0
(𝑤𝑤1 , 𝑤𝑤2 , 𝜃𝜃)
(0.7, 0.7, 0.6) 1 0 1
0 0 1
(𝑤𝑤1 , 𝑤𝑤2 , 𝜃𝜃)
22
Perceptron and Logic Gate
23
5. Logic Gate with Python
24
Logic Gate
25
AND Gate
29
XOR Gate
Exclusive OR gate
Truth Table
A logic gate that outputs 1 only when
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
either one of x1 or x2 is 1.
Otherwise, it outputs 0. 0 0 0
1 0 1
0 1 1
1 1 0
30
Output of a Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 0
0 1 0
0 𝑥𝑥1 1 1 1
1
32
AND Gate and Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 0
0 1 0
0 𝑥𝑥1 1 1 1
1
33
OR Gate and Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 1
1
34
NAND Gate
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 1
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
35
XOR Gate and Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
36
XOR Gate and Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
37
XOR Gate and Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
38
XOR Gate and Perceptron
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
39
XOR Gate and Multiple Perceptrons
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
40
XOR Gate and Multiple Perceptrons
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
41
Truth Tables
42
Multilayer Perceptron
Input NAND
𝑥𝑥1 𝑠𝑠1 Output
𝑦𝑦
𝑥𝑥2 𝑠𝑠2
OR
43
7. Multilayer Perceptron
with Python
44
XOR Gate
# Import libraries
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="¥s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
46
Multilayer Perceptron for Regression (Continued)
# Create X and y
X = pd.DataFrame(data)
y = target
47
Multilayer Perceptron for Regression (Continued)
48
Multilayer Perceptron for Regression (Continued)
# Import libraries
from sklearn.neural_network import MLPClassifier
# Create X and y
X = pd.DataFrame(breast_cancer.data)
y = breast_cancer.target
51
Multilayer Perceptron for Classification (Continued)
52
8. Neural Network
53
Artificial Neural Network
54
Artificial Neural Network
𝑤𝑤1
𝑤𝑤5
𝑤𝑤2
𝑤𝑤3
𝑤𝑤4 𝑤𝑤6
55
Artificial Neural Network
0.8
10
0.6
1.0
0.4
1.2 0.2
20
56
Artificial Neural Network
10 × 0.8 + 20 × 0.4
= 16
10 × 0.8
10
16 × 0.6
10 × 1.0
20 × 0.4
16 × 0.6
20 × 1.2 34 × 0.2
+34 × 0.2
20 = 16.4
10 × 1.0 + 20 × 1.2
= 34
57
Forward Propagation
𝑥𝑥1 𝑤𝑤2
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3
𝑥𝑥2 𝑤𝑤3 +𝑤𝑤6 (𝑥𝑥1 𝑤𝑤2 + 𝑥𝑥2 𝑤𝑤4 )
58
Deep Neural Network
59
Difference between Multilayer Perceptron and Neural Network
1
Multilayer
Perceptron
0
Sigmoid function
1
Neural
Network
0
60
9. Activation Function
61
Recap: Forward Propagation
𝑥𝑥1 𝑤𝑤2
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3
𝑥𝑥2 𝑤𝑤3 +𝑤𝑤6 (𝑥𝑥1 𝑤𝑤2 + 𝑥𝑥2 𝑤𝑤4 )
62
Forward Propagation
⇔
𝑤𝑤1
𝑥𝑥2 = 𝑥𝑥1
𝑤𝑤3
𝑤𝑤2
𝑥𝑥2 = 𝑥𝑥1
𝑤𝑤4
63
Forward Propagation
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3 + 𝑤𝑤6 (𝑥𝑥1 𝑤𝑤2 + 𝑥𝑥2 𝑤𝑤4 )
⇔
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑤𝑤5 𝑥𝑥2 𝑤𝑤3 + 𝑤𝑤6 𝑥𝑥1 𝑤𝑤2 + 𝑤𝑤6 𝑥𝑥2 𝑤𝑤4
⇔
𝑤𝑤1 𝑤𝑤5 𝑥𝑥1 + 𝑤𝑤3 𝑤𝑤5 𝑥𝑥2 + 𝑤𝑤2 𝑤𝑤6 𝑥𝑥1 + 𝑤𝑤4 𝑤𝑤6 𝑥𝑥2
⇔
𝑤𝑤1 𝑤𝑤5 𝑥𝑥1 + 𝑤𝑤2 𝑤𝑤6 𝑥𝑥1 + 𝑤𝑤3 𝑤𝑤5 𝑥𝑥2 + 𝑤𝑤4 𝑤𝑤6 𝑥𝑥2
⇔
(𝑤𝑤1 𝑤𝑤5 + 𝑤𝑤2 𝑤𝑤6 )𝑥𝑥1 + (𝑤𝑤3 𝑤𝑤5 + 𝑤𝑤4 𝑤𝑤6 )𝑥𝑥2
⇔
(𝑤𝑤1 𝑤𝑤5 + 𝑤𝑤2 𝑤𝑤6 )
𝑥𝑥2 = 𝑥𝑥1
(𝑤𝑤3 𝑤𝑤5 + 𝑤𝑤4 𝑤𝑤6 )
64
Activation Function
Activation
Function
Activation
𝑥𝑥1 Function
Activation
Function
𝑥𝑥2
65
Step Function
0 𝑥𝑥 ≤ 0
𝑦𝑦 = �
1 𝑥𝑥 > 0
66
Activation Function
67
Sigmoid Function
1
𝑦𝑦 =
1 + exp(−𝑥𝑥)
68
Tahn Function
exp 𝑥𝑥 − exp(−𝑥𝑥)
𝑦𝑦 =
exp 𝑥𝑥 + exp(−𝑥𝑥)
69
ReLU Function
𝑦𝑦 = 𝑚𝑚𝑚𝑚𝑚𝑚(0, 𝑥𝑥)
70
Activation Function for Output Layer
71
Identity Function
𝑦𝑦 = 𝑥𝑥
72
Sigmoid Function
1
𝑦𝑦 =
1 + exp(−𝑥𝑥)
73
Softmax Function
exp(𝑎𝑎𝑘𝑘 )
𝑦𝑦𝑘𝑘 = 𝑛𝑛
� exp(𝑎𝑎𝑖𝑖 )
𝑖𝑖=1
74
10. Loss Function
75
Loss Function
It quantifies the degree to which the model’s predicted values are deviated from
the true values.
A neural network model are trained to find the best parameters that minimize
the returned value of the loss function.
76
Mean Squared Error (MSE)
𝑛𝑛
1
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑛𝑛
𝑖𝑖=1
77
Mean Absolute Error (MAE)
𝑛𝑛
1
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
𝑛𝑛
𝑖𝑖=1
78
Cross Entropy Loss
79
Example: Cross-Entropy Loss
The class with the highest probability will be the predicted class.
80
Example: Cross-Entropy Loss (Continued)
𝑝𝑝1 𝑦𝑦1
𝑝𝑝 = 𝑝𝑝2 𝑦𝑦 = 𝑦𝑦2
𝑝𝑝𝑛𝑛 𝑦𝑦𝑛𝑛
log 𝑦𝑦1
𝐸𝐸(𝑝𝑝, 𝑦𝑦) = − 𝑝𝑝1 𝑝𝑝2 𝑝𝑝3 log 𝑦𝑦2
log 𝑦𝑦𝑛𝑛
81
Example: Cross-Entropy Loss (Continued)
The logarithm base is = −(1 × log 0.6 + 0 × log 0.2 + 0 × log 0.2)
Napier’s e
= − log 0.6 = 0.22
82
Cross-Entropy Loss of Binary Classification
𝑛𝑛
1
LogLoss = − � 𝑦𝑦𝑖𝑖 log 𝑝𝑝𝑖𝑖 + (1 − 𝑦𝑦𝑖𝑖 ) log(1 − 𝑝𝑝𝑖𝑖 )
𝑛𝑛
𝑖𝑖=1
83
Cross-Entropy Loss of Binary Classification (Continued)
Since the logarithm base is Napier’s e, 𝑦𝑦𝑖𝑖 log 𝑝𝑝𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 log 1 − 𝑝𝑝𝑖𝑖 = ⋯
The advantage of using a loss function is that the differential of a loss function
will never be 0.
Thus, a neural network can continue to learn until it finds the best parameters.
85
11. Training Neural
Network
86
Data Splitting and Generalizability
Dataset
Training Prediction
87
Batch Learning and Mini Batch Learning
Batch Learning:
Use all cases in the training dataset for training all together.
Mini-batch learning:
Train the model using a randomly sampled subset of the training set.
88
ANN’s Learning Process
Bias
Inputs
Weights 𝑏𝑏
𝑥𝑥1 𝑤𝑤1 Output True value
∑ 𝑦𝑦� 𝑦𝑦
Error
𝑥𝑥2 𝑤𝑤2 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
ANN learns the weights and biases that minimizes the difference between
the outputs and the true values.
89
Backpropagation
STEP 2 Compute the error between the output and the true value.
90
Simplified Example: Backpropagation
𝑊𝑊𝑊𝑊 = 𝑦𝑦�
1 4 2 2 3 1 4 0 5 1
2 9 4 5 6 3 8 1 10 1
3 11 6 5 9 2 12 1 15 4
91
12. Gradient Descent
Method (1)
92
What is Gradient Descent?
X
An optimization algorithm that identifies a local minimum
of a differentiable function using gradients of a function.
93
What is Gradient Descent?
X
An optimization algorithm that identifies a local minimum
of a differentiable function using gradients of a function.
94
What is Gradient Descent?
X
An optimization algorithm that identifies a local minimum
of a differentiable function using gradients of a function.
95
What is Gradient Descent?
X
An optimization algorithm that identifies a local minimum
of a differentiable function using gradients of a function.
96
Differential
𝑓𝑓(𝑥𝑥 + ℎ)
𝑓𝑓(𝑥𝑥)
𝑥𝑥 𝑥𝑥 + ℎ X
97
Differential
𝑑𝑑𝑑𝑑(𝑥𝑥) 𝑓𝑓 𝑥𝑥 + ℎ − 𝑓𝑓(𝑥𝑥)
= lim
𝑑𝑑𝑑𝑑 ℎ→0 ℎ
𝑓𝑓(𝑥𝑥 + ℎ)
𝑓𝑓(𝑥𝑥)
𝑥𝑥
X
𝑥𝑥 + ℎ
98
Partial Differential
𝜕𝜕
𝑓𝑓(𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 )
𝜕𝜕𝑥𝑥𝑛𝑛
𝜕𝜕
(3𝑥𝑥12 + 4𝑥𝑥1 + 2𝑥𝑥22 + 5)
𝜕𝜕𝑥𝑥1
Constant
= 6𝑥𝑥1 + 4
99
Gradient
100
Parameter Optimization
Optimization problem:
Objective function
101
Parameter Optimization (Continued)
𝜕𝜕𝐿𝐿 L
𝜕𝜕𝑤𝑤
𝜕𝜕𝐿𝐿
𝑤𝑤 = 𝑤𝑤 − 𝜂𝜂
𝜕𝜕𝑤𝑤
102
13. Gradient Descent
Method (2)
103
Types of Gradient Descent Method
104
Batch Gradient Descent
• Strength
105
Stochastic Gradient Descent
• Strength
• Weakness:
106
Mini-batch Gradient Descent
• Splits the training data into smaller mini-batches, and updates parameters
for each mini-batch.
• Less computational cost and less memory usage than batch gradient
descent.
107
14. Chain Rule
108
Chain Rule
𝑓𝑓(𝑥𝑥) 𝑔𝑔(𝑥𝑥)
𝑔𝑔 𝑓𝑓
𝑥𝑥 𝑔𝑔(𝑥𝑥) 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
109
Chain Rule: Another Expression
𝐹𝐹 𝑥𝑥 = 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
𝑔𝑔 𝑓𝑓
𝑥𝑥 𝑔𝑔(𝑥𝑥) 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
𝑢𝑢 = 𝑔𝑔(𝑥𝑥) 𝑦𝑦 = 𝑓𝑓(𝑢𝑢)
𝑑𝑑𝑢𝑢 𝑑𝑑𝑦𝑦
𝑢𝑢𝑢 = 𝑦𝑦′ =
𝑑𝑑𝑑𝑑 𝑑𝑑𝑢𝑢
𝑢𝑢 = 𝑔𝑔(𝑥𝑥) 𝑦𝑦 = 𝑓𝑓(𝑢𝑢) ′
𝑑𝑑𝑑𝑑
𝐹𝐹 (𝑥𝑥) =
𝑑𝑑𝑑𝑑
𝑑𝑑𝑢𝑢 𝑑𝑑𝑦𝑦 𝑑𝑑𝑦𝑦 𝑑𝑑𝑢𝑢 1
𝑢𝑢𝑢 = 𝑦𝑦′ = = � 𝑑𝑑𝑑𝑑
𝑑𝑑𝑑𝑑 𝑑𝑑𝑢𝑢 𝑑𝑑𝑢𝑢 𝑑𝑑𝑑𝑑 𝑑𝑑𝑥𝑥
𝑑𝑑𝑦𝑦 𝑑𝑑𝑢𝑢
𝑑𝑑𝑢𝑢 𝑑𝑑𝑦𝑦 = �
𝑑𝑑𝑢𝑢 𝑑𝑑𝑑𝑑
𝑑𝑑𝑑𝑑 = 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 = 𝑑𝑑𝑑𝑑
𝑑𝑑𝑥𝑥 𝑑𝑑𝑢𝑢
𝑑𝑑𝑦𝑦 𝑑𝑑𝑢𝑢
𝑑𝑑𝑑𝑑 = � 𝑑𝑑𝑑𝑑
𝑑𝑑𝑢𝑢 𝑑𝑑𝑑𝑑
111
Example: Chain Rule
113
Recap: ANN’s Learning Process
∑ 𝑦𝑦� 𝑦𝑦
Error
𝑤𝑤2 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
𝑥𝑥2 1
𝑛𝑛
114
Minimize Loss
Set parameters
Loss function Optimization
randomly
𝑛𝑛
𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 1 𝜕𝜕𝐿𝐿
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑛𝑛
𝑖𝑖=1
𝜕𝜕𝑤𝑤
1
𝑛𝑛 𝜕𝜕𝐿𝐿
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 𝑤𝑤 = 𝑤𝑤 − 𝜂𝜂
𝑛𝑛 𝜕𝜕𝑤𝑤
𝑖𝑖=1
115
Gradient Descent and Backpropagation
w w w
𝜕𝜕𝐿𝐿
𝑤𝑤 = 𝑤𝑤 − 𝜂𝜂
𝜕𝜕𝑤𝑤
116
Example: Backpropagation
𝑧𝑧11
𝑤𝑤1
𝑥𝑥1
𝑤𝑤5
𝑤𝑤2
𝑤𝑤3
𝑧𝑧12 𝑦𝑦� = 𝑤𝑤5 𝑧𝑧11 + 𝑤𝑤6 𝑧𝑧12
𝑤𝑤6
𝑥𝑥2
𝑤𝑤4
117
Example: Backpropagation
𝑛𝑛
1 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑦𝑦�
𝐿𝐿 = 𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2 =
𝑛𝑛 𝜕𝜕𝑤𝑤5 𝜕𝜕𝑦𝑦� 𝜕𝜕𝑤𝑤5
𝑖𝑖=1
� 2
𝐿𝐿 = (𝑦𝑦 − 𝑦𝑦) 𝜕𝜕𝐿𝐿
= 𝑦𝑦 2 − 2𝑦𝑦𝑦𝑦� + 𝑦𝑦� 2 ′
𝜕𝜕𝑦𝑦�
= {𝑦𝑦 − (𝑤𝑤5 𝑧𝑧11 + 𝑤𝑤6 𝑧𝑧12 )}2 = −2 𝑦𝑦 − 𝑦𝑦� = −2(𝑦𝑦 − 𝑤𝑤5 𝑧𝑧11 − 𝑤𝑤6 𝑧𝑧12 )
𝜕𝜕𝐿𝐿
𝑤𝑤5 = 𝑤𝑤5 − 𝜂𝜂
𝜕𝜕𝑤𝑤5
119
16. Vanishing Gradient
Problem
120
Vanishing Gradient Problem
121
Why do Gradient Vanish?
Sigmoid 1 𝑒𝑒 −𝑥𝑥
𝑓𝑓 𝑥𝑥 = 𝑓𝑓𝑓 𝑥𝑥 =
function 1 + 𝑒𝑒 −𝑥𝑥 (1 + 𝑒𝑒 −𝑥𝑥 )2
122
Use ReLU Function
𝑥𝑥 𝑥𝑥 > 0
𝑓𝑓 𝑥𝑥 = �
0 𝑥𝑥 ≤ 0
1 𝑥𝑥 > 0
𝑓𝑓𝑓 𝑥𝑥 = �
0 𝑥𝑥 ≤ 0
123
17. Nonsaturating
Activation Functions
124
Dying ReLUs
𝑥𝑥 𝑥𝑥 > 0
𝜕𝜕𝐿𝐿 𝑓𝑓 𝑥𝑥 = �
𝑤𝑤𝑖𝑖 = 𝑤𝑤𝑖𝑖 − 𝜂𝜂 0 𝑥𝑥 ≤ 0
𝜕𝜕𝑤𝑤𝑖𝑖
125
Leaky ReLU
𝑥𝑥 𝑥𝑥 > 0
The output is αx when the input is 𝑓𝑓 𝑥𝑥 = �
negative. So, the output is not 𝛼𝛼𝑥𝑥 𝑥𝑥 ≤ 0 (𝛼𝛼 = 0.01)
exactly horizontal when x < 0
𝑥𝑥 𝑥𝑥 > 0
𝑓𝑓 𝑥𝑥 = � (𝛼𝛼 > 0)
𝛼𝛼(𝑒𝑒 𝑥𝑥 − 1) 𝑥𝑥 ≤ 0
𝑥𝑥 𝑥𝑥 > 0
𝑓𝑓 𝑥𝑥 = λ � (λ > 1, 𝛼𝛼 > 0)
𝛼𝛼(𝑒𝑒 𝑥𝑥 − 1) 𝑥𝑥 ≤ 0
127
18. Parameter Initialization
128
Recap: Vanishing Gradient Problem
129
Parameter Initialization
Initialize parameter values with a larger range when the model size is large, and
with a smaller range when the model size is small.
Popular methods:
・ Xavier initialization
・ He initialization
130
Xavier Initialization
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑥𝑥2 ℎ2
𝑦𝑦
・ ・
1 ・ ・
𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 ・ ・
𝑥𝑥𝑛𝑛 ℎ𝑚𝑚
𝑛𝑛: N of nodes
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0
1 1
𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 𝑚𝑚
131
He Initialization
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑥𝑥2 ℎ2
𝑦𝑦
・ ・
2 ・ ・
𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 ・ ・
𝑥𝑥𝑛𝑛 ℎ𝑚𝑚
𝑛𝑛: N of nodes
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0
2 2
𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 𝑚𝑚
132
19. ANN Regression
with Keras
133
Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
134
Load and Prepare the Dataset
# Load dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="¥s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
# Create X and y
X = pd.DataFrame(data)
y = target
135
Define Artificial Neural Network
# Define ANN
Leaky ReLU function
model = Sequential() from tensorflow.keras.layers import LeakyReLU
model.add(Dense(128, activation='relu’, leaky_relu = LeakyReLU(alpha=0.01)
input_shape=(13,))) Dense(64, activation=leaky_relu)
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu')) ELU function
activation =‘elu’
model.add(Dense(32, activation='relu'))
model.add(Dense(1)) SELU function
activation= ‘selu’
136
He Initialization
# Define ANN
model = Sequential()
model.add(Dense(128, activation='relu’, input_shape=(13,),
kernel_initializer="he_normal"))
model.add(Dense(64, activation='relu’,
kernel_initializer="he_normal"))
model.add(Dense(64, activation='relu’,
kernel_initializer="he_normal"))
model.add(Dense(32, activation='relu’,
kernel_initializer="he_normal"))
model.add(Dense(1))
137
Show the Model Summary
W0 × W1 + b1(bias)
128 × 64 + 64 = 8256
64 × 64 + 64 = 4160
64 × 32 + 32 = 2080
32 × 1 + 1 = 33
1792+8256+4160+208+33=16321
138
Compile the Model
139
Fit the Model
batch_size=200
140
Fit the Model
141
Learning History
143
Model Evaluation
# Model evaluation
train_loss, train_mae = model.evaluate(X_train, y_train)
test_loss, test_mae = model.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss, test_loss))
print('train mae:{:.3f}¥ntest mae: {:.3f}'.format(train_mae, test_mae))
----------------------------------------------------------------------------
train loss: 4.239
test loss: 10.772
train mae: 0.884
test mae: 2.293
144
Make Prediction
# Make prediction
y_pred = model.predict(X_test)
print(y_pred)
----------------------------------------------------------------------------
[[28.521143 ]
[19.221983 ]
[26.441172 ]
[19.683641 ]
[15.493921 ]
・・・・・
[17.98763 ]
[14.546152 ]]
145
20. ANN Classification
with Keras
146
Import Libraries
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
147
Load and Prepare the Dataset
# Load dataset
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
# Create X and y
X = pd.DataFrame(breast_cancer.data)
y = breast_cancer.target
148
Define Artificial Neural Network
# Define ANN
model = Sequential()
model.add(Dense(64, activation='relu’,
input_shape=(30,)))
model.add(Dense(32, activation='relu‘))
model.add(Dense(1, activation='sigmoid'))
149
Show the Model Summary
# Model summary
model.summary()
W0 × W1 + b1(bias)
30 × 64 + 64 = 1984
64 × 32 + 32 = 2080
32 × 1 + 1 = 33
150
Compile the Model
# Compile model
model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['acc'])
151
Fit the Model
152
Learning History
・・・・・・・・・・
Epoch 300/300
340/340 [==============================] - 0s 97us/sample - loss: 0.0442 - acc: 0.9882 -
val_loss: 0.0766 - val_acc: 0.9767
153
Visualize Learning History
154
Model Evaluation
# Model evaluation
train_loss, train_acc = model.evaluate(X_train, y_train)
test_loss, test_acc = model.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss, test_loss))
print('train acc:{:.3f}¥ntest acc: {:.3f}'.format(train_acc, test_acc))
----------------------------------------------------------------------------
426/426 [==============================] - 0s 30us/sample - loss: 0.0506 - acc:
0.9859
143/143 [==============================] - 0s 63us/sample - loss: 0.0649 - acc:
0.9790
train loss:0.051
test loss: 0.065
train acc:0.986
test acc: 0.979
155
Make Prediction
# Make prediction
y_pred = model.predict(X_test)
print(np.round(y_pred, 3))
--------------------------------------------------------------------------
[[0.993]
[1. ]
[0. ]
[0.998]
[1. ]
・・・・
[0.996]
[0.894]]
156
Multiclass Classification
# Load dataset
from sklearn.datasets import load_wine
data = load_wine()
class_0 30%
# Display target names
list(data.target_names)
--------------------------------------
class_1 50%
['class_0', 'class_1', 'class_2’]
157
Multiclass Classification
# Create X and y
X = pd.DataFrame(features)
y = target
# Compile model
model.compile(optimizer='adam‘,
loss='sparse_categorical_crossentropy‘,
metrics=['acc'])
159
Multiclass Classification
# Model evaluation
train_loss, train_acc = model.evaluate(X_train, y_train)
test_loss, test_acc = model.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss, test_loss))
print('train acc:{:.3f}¥ntest acc: {:.3f}'.format(train_acc, test_acc))
160
Multiclass Classification
# Make prediction
y_pred = model.predict(X_test)
print(np.round(y_pred, 3))
----------------------------------------
[[0. 1. 0. ]
[0. 1. 0. ]
[0. 1. 0. ]
[0. 1. 0. ]
[0. 1. 0. ]
[0.01 0.99 0. ]
[1. 0. 0. ]
[0. 1. 0. ]
[0.003 0.997 0. ]
[0. 1. 0. ]
・・・・・・・・・・・・・
[1. 0. 0. ]
[0. 0. 1. ]
[1. 0. 0. ]]
161
21. Overfitting
162
Model Generalizability
163
Model Generalizability
164
Model Generalizability
165
Overfitting
166
How Can We Prevent Overfitting?
・ Regularization
・ Early Stopping
167
22. L1 & L2 Regularization
168
Regularization
169
Q. Which is a case where “X and Y are not related”?
Y Y Y Y
X X X X
170
Statistical Hypothesis Testing in Regression Analysis
Y
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + ε
If 𝛽𝛽𝑖𝑖 ≠ 0
171
Model Complexity and Overfitting
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + ε
Y
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + 𝛽𝛽4 𝑥𝑥4 + ε
0 0 0
= 𝛽𝛽1 𝑥𝑥1 + ε
X
Y
Regularization forces the weights of uninformative
features to be zero or nearly zero.
174
Lasso Regression (L1 Regularization)
Ridge regression
𝑀𝑀
176
23. Dropout
177
Dropout
178
Dropout
179
Dropout
180
Dropout and Generalizability
181
24. Regularization with
Keras
182
Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
183
Load and Prepare the Dataset
# Load dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="¥s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
# Create X and y
X = pd.DataFrame(data)
y = target
184
L1 Regularization
185
Fit and Predict with L1 Regularization
# Model evaluation
train_loss_l1, train_mae_l1 = model_l1.evaluate(X_train, y_train)
test_loss_l1, test_mae_l1 = model_l1.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss_l1, test_loss_l1))
print('train mae:{:.3f}¥ntest mae: {:.3f}'.format(train_mae_l1, test_mae_l1))
--------------------------------------------------------------------------------------------------------
train loss:5.964
test loss: 9.608
train mae:1.132
test mae: 2.114
186
L2 Regularization
187
Dropout
189
Early Stopping with Keras
# Define ANN
model_e = Sequential()
model_e.add(Dense(128, activation='relu', input_shape=(13,)))
model_e.add(Dense(64, activation='relu‘))
model_e.add(Dense(64, activation='relu‘))
model_e.add(Dense(32, activation='relu‘))
model_e.add(Dense(1))
# Set EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=30)
190
Early Stopping with Keras
191
Model Performance
192
25. Optimizer
193
Optimizers
e.g.,
・SGD
・Momentum
・AdaGrad
・RMSProp
・Adam etc.
194
SGD (Stochastic Gradient Descent)
𝜕𝜕𝐿𝐿
𝑊𝑊 ← 𝑊𝑊 − 𝜂𝜂
𝜕𝜕𝑊𝑊
𝑊𝑊: Parameter
𝜕𝜕𝐿𝐿
: Gradient
𝜕𝜕𝑊𝑊
195
SGD (Stochastic Gradient Descent)
𝜕𝜕𝐿𝐿
𝑊𝑊 ← 𝑊𝑊 − 𝜂𝜂 Y
𝜕𝜕𝑊𝑊
𝑊𝑊: Parameter
𝜕𝜕𝐿𝐿
: Gradient X
𝜕𝜕𝑊𝑊
196
Momentum
𝜕𝜕𝐿𝐿
𝑣𝑣 ← α𝑣𝑣 − 𝜂𝜂
𝜕𝜕𝑊𝑊
𝑊𝑊 ← 𝑊𝑊 + 𝑣𝑣
𝑣𝑣: Velocity
197
Momentum
𝜕𝜕𝐿𝐿
𝑣𝑣 ← α𝑣𝑣 − 𝜂𝜂 Y
𝜕𝜕𝑊𝑊
𝑊𝑊 ← 𝑊𝑊 + 𝑣𝑣
𝑣𝑣: Velocity
X
198
Momentum
𝜕𝜕𝐿𝐿
𝑣𝑣 ← α𝑣𝑣 − 𝜂𝜂 Y
𝜕𝜕𝑊𝑊
𝑊𝑊 ← 𝑊𝑊 + 𝑣𝑣
𝑣𝑣: Velocity
X
199
AdaGrad
Learning rate decay: Set a high learning rate at the outset, and gradually lower it.
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
ℎ←ℎ+ ⊙
𝜕𝜕𝑊𝑊 𝜕𝜕𝑊𝑊
1
𝜕𝜕𝐿𝐿
𝑊𝑊 ← 𝑊𝑊 − 𝜂𝜂
ℎ + ε 𝜕𝜕𝑊𝑊
200
RMSProp
AdaGrad tends to lower the learning rate too fast, and thus, the learning rate
becomes 0 too early.
RMSPROP calculates the sum of squares of the gradients, but use only the
gradients from the recent iterations.
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
ℎ ← 𝛽𝛽ℎ + (1 − 𝛽𝛽) ⊙
𝜕𝜕𝑊𝑊 𝜕𝜕𝑊𝑊 𝛽𝛽: Decay rate
1 𝜕𝜕𝐿𝐿 = 0.9
𝑊𝑊 ← 𝑊𝑊 − 𝜂𝜂
ℎ + ε 𝜕𝜕𝑊𝑊
201
Adam
𝜕𝜕𝐿𝐿
𝜕𝜕𝐿𝐿 𝑚𝑚: Velocity, 𝐿𝐿: Loss, W: parameter, : gradient
𝜕𝜕𝑊𝑊
𝑚𝑚 ← 𝛽𝛽1 𝑚𝑚 + (1 − 𝛽𝛽1 )
𝜕𝜕𝑊𝑊 𝛽𝛽1 : Decay rate
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
𝑣𝑣 ← 𝛽𝛽2 𝑣𝑣 + (1 − 𝛽𝛽2 ) ⊙ 𝑣𝑣: Velocity, 𝛽𝛽2 : Decay rate
𝜕𝜕𝑊𝑊 𝜕𝜕𝑊𝑊
𝑚𝑚 𝑣𝑣
𝑚𝑚
�= 𝑡𝑡 , 𝑣𝑣
�=
1 − 𝛽𝛽1 1 − 𝛽𝛽2𝑡𝑡
𝑚𝑚
�
𝑊𝑊 ← −η
𝑣𝑣� + ε
202
26. Batch Normalization
203
Recap
・ Regularization
・ Optimizer
・ Parameter Initialization
𝑥𝑥𝑖𝑖 − 𝜇𝜇
𝑥𝑥𝑖𝑖𝑛𝑛𝑛𝑛𝑛𝑛 =
σ
0 ≤ 𝑥𝑥 ′ ≤ 1
205
Weakness of Input Data Normalization
206
Batch Normalization
Batch Batch
Normalization Normalization
207
Batch Normalization: Equations
𝑚𝑚
1
Mean: 𝜇𝜇𝑏𝑏 ← � 𝑥𝑥𝑖𝑖
𝑚𝑚
𝑖𝑖=1
𝑚𝑚
1
Variance: 𝜎𝜎𝑏𝑏2 ← �(𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑏𝑏 )2
𝑚𝑚
𝑖𝑖=1
𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑏𝑏
Standardization: 𝑥𝑥�𝑖𝑖 ←
σ2𝑏𝑏 + ε
208
27. Optimization & Batch
Normalization with Keras
209
Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
210
Load and Prepare Dataset
# Load dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="¥s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
# Create X and y
X = pd.DataFrame(data)
y = target
# Define ANN
def ann():
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(13,)))
model.add(BatchNormalization())
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(32, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1))
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
212
Set Optimizer
# Momentum
optimizer=SGD(lr=0.001, momentum=0.9)
# Adagrad
optimizer=Adagrad(lr=0.001, epsilon=10**-10)
# RMSprop
optimizer=RMSprop(lr=0.001, rho=0.9)
# Adam
optimizer=Adam(lr=0.001, beta_1=0.9, beta_2=0.9)
213
Fit and Evaluate the Model
# Model evaluation
train_loss, train_mae = model.evaluate(X_train, y_train)
test_loss, test_mae = model.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss, test_loss))
print('train mae:{:.3f}¥ntest mae: {:.3f}'.format(train_mae, test_mae))
214
Save and Load the Model
# Save model
model.save("my_ann_model.h5")
# Load model
model = keras.models.load_model("my_ann_model.h5")
215
Save the Best Model
# Import ModelCheckpoint
from keras.callbacks import ModelCheckpoint
216