Deep Learning Fundamentals Materials

Deep Learning
Artificial Neural Network

Takuma Kimura
1
1. What is Deep Learning?
2
What is Deep Learning?
● Artificial Intelligence (AI)
Imitate intelligent human behavior and cognitive

functions.
Artificial Intelligence
● Machine Learning (ML): A branch of AI.
Automatically learn from data and make prediction

Machine Learning
or judgement.
Deep
● Deep Learning (DL): A type of ML. Learning
Use artificial neural network to solve complex

problems.
3
Limitations of Traditional Machine Learning
● Cannot handle high-dimensional Performance

data
Deep Learning
● Need manual feature extraction
● Not appropriate for image and Machine Learning

movie data analysis.
Data amount
4
Challenge in Handling High Dimensional Data
e.g., Employee turnover prediction using data of 3,000 employees
Gender Job Age Performance Score Title
• 20-29
• Admin
• Male • 30-39 7: Excellent • Manager
• IT
• Female • 40-49 ‥‥ • Supervisor
• Operation
• Others • 50-59 1: Poor • Rank-and-filer
• Others
• 60-
3 types 12 types 60 types 420 types 1,260 types
3×4 12×5 60×7 420×3
1,260 types in 3,000 employees…  Each type has only 2.5 employees.
5
Curse of Dimensionality
● The increase in the dimension of data

results in the exponential increase in the
amount of training data required to
develop a generalizable machine learning
model.
● The increase in the dimension lowers the

similarity between samples in the training
dataset.
 The model will be likely to overfit.
6
Advantage of Deep Learning
● Deep learning enables us to extract importance

features automatically.
● Artificial neural network (ANN) is a better way to

extract low dimensional features from high-
dimensional data.
● The use of ANN enables us to alleviate the

problem of “curse of dimensionality.”
7
2. Artificial Neural
Network
8
Neurons
Dendrite:
Receive signals from other neurons
Cell body: Axon:

Aggregate the input Send the signal to other neurons
9
Neural Network
10
Modeling Neural Network
Input A
Input A
Input B
+ Input B
＞ Threshold
11
Example
𝑥𝑥1 𝑤𝑤1
𝑥𝑥2 𝑤𝑤2 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 > Θ
12
3. Perceptron
13
What is Perceptron?
Bias
Input Weight
𝑥𝑥1 𝑤𝑤1 𝑏𝑏
Output
𝑦𝑦
0 or 1
𝑥𝑥2 𝑤𝑤2
Node
Edge
14
Output of a Perceptron
Bias
Input Weight
𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
𝑥𝑥1 𝑤𝑤1 𝑏𝑏
Output
𝑦𝑦
0 or 1
𝑥𝑥2 𝑤𝑤2
0 if 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 ≤ θ

𝑦𝑦
1 if 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 > θ
15
Example
Bias
0.4 1.0 −0.8 𝜃𝜃 = 0.5

𝑦𝑦
0
0.5 0.4
0.4 × 1.0 + 0.5 × 0.4 − 0.8 = −0.2 < 𝜃𝜃

16
Another Example
We can change the
output of a perceptron
by just adjusting the
Bias weights and biases.
0.8 0.5 −0.2 𝜃𝜃 = 0.5

𝑦𝑦
1
0.5 1.2
0.8 × 0.5 + 0.5 × 1.2 − 0.2 = 0.8 > 𝜃𝜃

17
4. Logic Circuit
18
Logic Gate
1 or 0
・ AND
・ OR
・ NAND
19
AND Gate
A logic gate that outputs 1 only when

Truth Table
both inputs are 1.
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
Otherwise, it outputs 0.
0 0 0
(𝑤𝑤1 , 𝑤𝑤2 , 𝜃𝜃)
(0.6, 0.4, 0.8) 1 0 0
(0.2, 0.6, 0.6) 0 1 0
(1.2, 0.4, 1.5) 1 1 1

…
20
OR Gate
A logic gate that outputs 1 when at least

Truth Table
one of the inputs is 1.
0 0 0
(𝑤𝑤1 , 𝑤𝑤2 , 𝜃𝜃)
(0.7, 0.7, 0.6) 1 0 1
(0.4, 0.3, 0.2) 0 1 1
(0.8, 1.2, 0.6) 1 1 1

…
21
NAND Gate
A logic gate that outputs 1 when at least

Truth Table
one of the inputs is 1.
0 0 1
(𝑤𝑤1 , 𝑤𝑤2 , 𝜃𝜃)
(−0.6, −0.4, −0.8) 1 0 1
(−0.2, −0.6, −0.7) 0 1 1
(−1.2, −0.4, −1.5) 1 1 0

…
22
Perceptron and Logic Gate
Perceptron can express AND gate, OR ・ AND

gate, and NAND gate.
・ OR
・ NOT
A perceptron can be AND gate, OR gate,
and NAND gate depending on its weight
and threshold.
In machine learning, a machine learns

appropriate parameter values
automatically.
23
5. Logic Gate with Python
24
Logic Gate
AND Gate OR Gate NAND Gate
A logic gate that outputs 1 A logic gate that outputs 1

A logic gate that outputs 1
when at least one of the when at least one of the
only when both inputs are 1.
inputs is 1. Otherwise, it inputs is 1. Otherwise, it
outputs 0. outputs 0.
𝑥𝑥1 𝑥𝑥2 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 𝑦𝑦

0 0 0 0 0 0 0 0 1
1 0 0 1 0 1 1 0 1
0 1 0 0 1 1 0 1 1
1 1 1 1 1 1 1 1 0
25
AND Gate
# Define AND gate function

def and_gate(x1, x2): AND Gate
w1, w2, b, theta = 0.3, 0.3, 0.1, 0.5
a = x1*w1 + x2*w2 + b A logic gate that outputs 1
only when both inputs are 1.
if a <= theta: Otherwise, it outputs 0.
return 0
else:
return 1 𝑥𝑥1 𝑥𝑥2 𝑦𝑦
0 0 0
# AND gate output
print(and_gate(0, 0)) 0 1 0 0
print(and_gate(1, 0)) 0 0 1 0
print(and_gate(0, 1)) 0
1 1 1
print(and_gate(1, 1)) 1
26
OR Gate
# Define OR gate function

def or_gate(x1, x2): OR Gate
w1, w2, b, theta = 0.5, 0.5, 0.1, 0.5 A logic gate that outputs 1
a = x1*w1 + x2*w2 + b when at least one of the
if a <= theta: inputs is 1. Otherwise, it
return 0 outputs 0.
else:
0 0 0
# OR gate output
print(or_gate(0, 0)) 0 1 0 1
print(or_gate(1, 0)) 1 0 1 1
print(or_gate(0, 1)) 1
1 1 1
print(or_gate(1, 1)) 1
27
NAND Gate
# Define NAND gate function

def nand_gate(x1, x2): NAND Gate
w1, w2, b, theta = -0.2, -0.2, -0.2, -0.5 A logic gate that outputs 1
a = x1*w1 + x2*w2 + b when at least one of the
if a <= theta: inputs is 1. Otherwise, it
return 0 outputs 0.
else:
0 0 1
# NAND gate output
print(nand_gate(0, 0)) 1 1 0 1
print(nand_gate(1, 0)) 1 0 1 1
print(nand_gate(0, 1)) 1
1 1 0
print(nand_gate(1, 1)) 0
28
6. Multilayer Perceptron
29
XOR Gate
Exclusive OR gate
Truth Table
A logic gate that outputs 1 only when
either one of x1 or x2 is 1.
Otherwise, it outputs 0. 0 0 0
1 0 1
0 1 1
1 1 0
30
Output of a Perceptron
0 if 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 ≤ θ

𝑦𝑦
1 if 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 > θ
𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 = θ

⇔
𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 − θ = 0
⇔
𝑤𝑤2 𝑥𝑥2 = −𝑤𝑤1 𝑥𝑥1 − 𝑏𝑏 + θ
⇔
𝑥𝑥2 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑤𝑤1 −𝑏𝑏 + θ
𝑥𝑥2 = 𝑥𝑥1 +
𝑤𝑤2 𝑤𝑤2
31
AND Gate and Perceptron
𝑥𝑥2 Truth Table
1
0 0 0
1 0 0
0 1 0
0 𝑥𝑥1 1 1 1
1
32
AND Gate and Perceptron
1
0 0 0
1 0 0
0 1 0
0 𝑥𝑥1 1 1 1
1
33
OR Gate and Perceptron
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 1
1
34
NAND Gate
1
0 0 1
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
35
XOR Gate and Perceptron
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
36
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
37
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
38
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
39
XOR Gate and Multiple Perceptrons
1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
40
XOR Gate and Multiple Perceptrons

NAND
OR 1
0 0 0
1 0 1
0 1 1
0 𝑥𝑥1 1 1 0
1
41
Truth Tables
𝑥𝑥1 𝑥𝑥2 𝑠𝑠1

0 0 1
NAND 1 0 1 AND Gate
0 1 1 𝑠𝑠1 𝑠𝑠2 𝑦𝑦
1 1 0 1 0 0
1 1 1
𝑥𝑥1 𝑥𝑥2 𝑠𝑠2 𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1 1 1
0 0 0 0 1 0 0 0 0
OR 1 0 1 1 0 1
0 1 1 0 1 1
1 1 1 1 1 0
42
Multilayer Perceptron
Input NAND
𝑥𝑥1 𝑠𝑠1 Output
𝑦𝑦
𝑥𝑥2 𝑠𝑠2
OR
43
7. Multilayer Perceptron
with Python
44
XOR Gate
# Define XOR gate function # XOR gate output

def xor_gate(x1, x2): print(xor_gate(0, 0))
s1 = nand_gate(x1, x2) print(xor_gate(1, 0))
s2 = or_gate(x1, x2) print(xor_gate(0, 1))
y = and_gate(s1, s2) print(xor_gate(1, 1))
return y ------------------------ Truth Table
0 𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1
1 0 0 0
0 1 0 1
0 1 1
1 1 0
45
Multilayer Perceptron for Regression
# Import libraries
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="¥s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
46
Multilayer Perceptron for Regression (Continued)
# Create X and y
X = pd.DataFrame(data)
y = target
# Split data into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y)
# Standardize the features

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
47
# Initiate and fit the MLP regression model

mlp_rg = MLPRegressor()
mlp_rg.fit(X_train, y_train)
# Make Prediction by MLP

y_pred_train = mlp_rg.predict(X_train)
y_pred_test = mlp_rg.predict(X_test)
48
# Print MLP Performance

print("Multilayer Perceptron Performance")
print("------------------------------------")
print("Train R2 : ", f'{mlp_rg.score(X_train, y_train):.3f}‘)
print("Test R2 : ", f'{mlp_rg.score(X_test, y_test):.3f}‘)
print("------------------------------------")
print("Train MSE: ", f'{mean_squared_error(y_train, y_pred_train):.3f}‘)
print("Test MSE: ", f'{mean_squared_error(y_test, y_pred_test):.3f}’)
Multilayer Perceptron Performance

------------------------------------
Train R2 : 0.686
Test R2 : 0.710
------------------------------------
Train MSE: 26.002
Test MSE: 25.982
49
Multilayer Perceptron for Classification
# Import libraries
from sklearn.neural_network import MLPClassifier
# Load the dataset

from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
# Create X and y
X = pd.DataFrame(breast_cancer.data)
y = breast_cancer.target
# Split the data into training and test set

X_train, X_test, y_train, y_test = train_test_split(breast_cancer.data,
breast_cancer.target)
50
Multilayer Perceptron for Classification (Continued)

scaler.fit(X_train)
# Initiate and fit the model

mlp_clf = MLPClassifier()
mlp_clf.fit(X_train, y_train)
51
Multilayer Perceptron for Classification (Continued)
# Show accuracy score

print("Training Accuracy: ", mlp_clf.score(X_train, y_train))
print("Test Accuracy : ", mlp_clf.score(X_test, y_test))
---------------------------------------------------------------
Training Accuracy: 0.9929577464788732
Test Accuracy : 0.986013986013986
52
8. Neural Network
53
Input Layer Hidden Layer Output Layer
54
𝑤𝑤1
𝑤𝑤5
𝑤𝑤2
𝑤𝑤3
𝑤𝑤4 𝑤𝑤6
55
0.8
10
0.6
1.0
0.4
1.2 0.2
20
56
10 × 0.8 + 20 × 0.4
= 16
10 × 0.8
10
16 × 0.6
10 × 1.0
20 × 0.4
16 × 0.6
20 × 1.2 34 × 0.2
+34 × 0.2
20 = 16.4
10 × 1.0 + 20 × 1.2
= 34
57
Forward Propagation
𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3

𝑥𝑥1 𝑤𝑤1
𝑥𝑥1 𝑤𝑤5 (𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3 )
𝑥𝑥1 𝑤𝑤2
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3
𝑥𝑥2 𝑤𝑤3 +𝑤𝑤6 (𝑥𝑥1 𝑤𝑤2 + 𝑥𝑥2 𝑤𝑤4 )

𝑥𝑥2 𝑤𝑤4
58
Deep Neural Network
59
Difference between Multilayer Perceptron and Neural Network
Input Layer Hidden Layer Output Layer Step function
１
Multilayer
Perceptron
０
Sigmoid function
１
Neural
Network
０
60
9. Activation Function
61
Recap: Forward Propagation

𝑥𝑥1 𝑤𝑤1
𝑥𝑥1 𝑤𝑤2
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3
𝑥𝑥2 𝑤𝑤3 +𝑤𝑤6 (𝑥𝑥1 𝑤𝑤2 + 𝑥𝑥2 𝑤𝑤4 )

𝑥𝑥2 𝑤𝑤4
62
Forward Propagation
⇔
𝑤𝑤1
𝑥𝑥2 = 𝑥𝑥1
𝑤𝑤3
𝑤𝑤2
𝑤𝑤4
63
Forward Propagation
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤3 + 𝑤𝑤6 (𝑥𝑥1 𝑤𝑤2 + 𝑥𝑥2 𝑤𝑤4 )
⇔
𝑤𝑤5 𝑥𝑥1 𝑤𝑤1 + 𝑤𝑤5 𝑥𝑥2 𝑤𝑤3 + 𝑤𝑤6 𝑥𝑥1 𝑤𝑤2 + 𝑤𝑤6 𝑥𝑥2 𝑤𝑤4
⇔
𝑤𝑤1 𝑤𝑤5 𝑥𝑥1 + 𝑤𝑤3 𝑤𝑤5 𝑥𝑥2 + 𝑤𝑤2 𝑤𝑤6 𝑥𝑥1 + 𝑤𝑤4 𝑤𝑤6 𝑥𝑥2
⇔
𝑤𝑤1 𝑤𝑤5 𝑥𝑥1 + 𝑤𝑤2 𝑤𝑤6 𝑥𝑥1 + 𝑤𝑤3 𝑤𝑤5 𝑥𝑥2 + 𝑤𝑤4 𝑤𝑤6 𝑥𝑥2
⇔
(𝑤𝑤1 𝑤𝑤5 + 𝑤𝑤2 𝑤𝑤6 )𝑥𝑥1 + (𝑤𝑤3 𝑤𝑤5 + 𝑤𝑤4 𝑤𝑤6 )𝑥𝑥2
⇔
(𝑤𝑤1 𝑤𝑤5 + 𝑤𝑤2 𝑤𝑤6 )
(𝑤𝑤3 𝑤𝑤5 + 𝑤𝑤4 𝑤𝑤6 )
64
Activation Function
Activation
Function
Activation
𝑥𝑥1 Function
Activation
Function
𝑥𝑥2
65
Step Function
0 𝑥𝑥 ≤ 0
𝑦𝑦 = �
1 𝑥𝑥 > 0
66
Activation Function
Middle Layer Output Layer
• Sigmoid Function • Identity Function
• Tanh Function • Sigmoid Function
• ReLU Function • Softmax Function
67
Sigmoid Function
1
𝑦𝑦 =
1 + exp(−𝑥𝑥)
68
Tahn Function
exp 𝑥𝑥 − exp(−𝑥𝑥)
𝑦𝑦 =
exp 𝑥𝑥 + exp(−𝑥𝑥)
69
ReLU Function
𝑦𝑦 = 𝑚𝑚𝑚𝑚𝑚𝑚(0, 𝑥𝑥)
70
Activation Function for Output Layer
Middle Layer Output Layer
• Sigmoid Function • Identity Function
• Tanh Function • Sigmoid Function
• ReLU Function • Softmax Function
71
Identity Function
𝑦𝑦 = 𝑥𝑥
72
Sigmoid Function
1
𝑦𝑦 =
1 + exp(−𝑥𝑥)
73
Softmax Function
exp(𝑎𝑎𝑘𝑘 )
𝑦𝑦𝑘𝑘 = 𝑛𝑛
� exp(𝑎𝑎𝑖𝑖 )
𝑖𝑖=1
74
10. Loss Function
75
Loss Function
Loss function is used to evaluate the model performance.
It quantifies the degree to which the model’s predicted values are deviated from
the true values.
A neural network model are trained to find the best parameters that minimize
the returned value of the loss function.
76
Mean Squared Error (MSE)
𝑛𝑛
1
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑛𝑛
𝑖𝑖=1
𝑦𝑦𝑖𝑖 : True values
𝑦𝑦�𝑖𝑖 : Predicted values
77
Mean Absolute Error (MAE)
𝑛𝑛
1
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
𝑛𝑛
𝑖𝑖=1
𝑦𝑦𝑖𝑖 : True values
𝑦𝑦�𝑖𝑖 : Predicted values
78
Cross Entropy Loss
𝐸𝐸 𝑝𝑝, 𝑦𝑦 = − � 𝑝𝑝𝑖𝑖 log𝑦𝑦𝑖𝑖

𝑖𝑖
𝑝𝑝𝑖𝑖 : True probability distribution
𝑦𝑦𝑖𝑖 : Predicted probability distribution
79
Example: Cross-Entropy Loss
Multiclass classification: Apple, Orange, Pear
The model outputs the probability distribution of the classes.
The class with the highest probability will be the predicted class.
𝑃𝑃 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 , 𝑃𝑃 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 , 𝑃𝑃(𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃)
80
Example: Cross-Entropy Loss (Continued)
Compute the cross-entropy loss： 𝐸𝐸 𝑝𝑝, 𝑦𝑦 = − ∑𝑖𝑖 𝑝𝑝𝑖𝑖 log𝑦𝑦𝑖𝑖
𝑝𝑝1 𝑦𝑦1
𝑝𝑝 = 𝑝𝑝2 𝑦𝑦 = 𝑦𝑦2
𝑝𝑝𝑛𝑛 𝑦𝑦𝑛𝑛
log 𝑦𝑦1
𝐸𝐸(𝑝𝑝, 𝑦𝑦) = − 𝑝𝑝1 𝑝𝑝2 𝑝𝑝3 log 𝑦𝑦2
log 𝑦𝑦𝑛𝑛
= −(𝑝𝑝1 log 𝑦𝑦1 + 𝑝𝑝2 log 𝑦𝑦2 + ⋯ + 𝑝𝑝𝑛𝑛 log 𝑦𝑦𝑛𝑛 )
81
Example: Cross-Entropy Loss (Continued)
True Probability Distribution Predicted Probability Distribution

= 1, 0, 0 = 0.6, 0.2, 0.2
= 0, 1, 0 = 0.3, 0.4, 0.3
= 0, 0, 1 = 0.2, 0.3, 0.5

log 0.6
𝐸𝐸(𝑝𝑝𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) = − 1 0 0 log 0.2 𝐸𝐸 𝑝𝑝, 𝑦𝑦 = − � 𝑝𝑝𝑖𝑖 log𝑦𝑦𝑖𝑖
log 0.2 𝑖𝑖
The logarithm base is = −(1 × log 0.6 + 0 × log 0.2 + 0 × log 0.2)
Napier’s e
= − log 0.6 = 0.22
82
Cross-Entropy Loss of Binary Classification
𝑛𝑛
1
LogLoss = − � 𝑦𝑦𝑖𝑖 log 𝑝𝑝𝑖𝑖 + (1 − 𝑦𝑦𝑖𝑖 ) log(1 − 𝑝𝑝𝑖𝑖 )
𝑛𝑛
𝑖𝑖=1
𝑝𝑝𝑖𝑖 : True probability distribution
𝑦𝑦𝑖𝑖 : Predicted probability distribution
83
Cross-Entropy Loss of Binary Classification (Continued)
Since the logarithm base is Napier’s e, 𝑦𝑦𝑖𝑖 log 𝑝𝑝𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 log 1 − 𝑝𝑝𝑖𝑖 = ⋯
If 𝑝𝑝𝑖𝑖 = 1, 𝑦𝑦𝑖𝑖 = 1: 1 × log 1 + 1 − 1 log(1 − 1)

=1×0+0=0
If 𝑝𝑝𝑖𝑖 = 0, 𝑦𝑦𝑖𝑖 = 0: = 0 × log 0 + 1 − 0 log(1 − 0)
= 0 + log1 = 0
= 0 + log0 = ∞
= log 0 + log1 = ∞
84
Why Loss Function?
A neural network uses differential to learn the best parameters.
The advantage of using a loss function is that the differential of a loss function
will never be 0.
Thus, a neural network can continue to learn until it finds the best parameters.
85
11. Training Neural
Network
86
Data Splitting and Generalizability
Dataset
Training Data Test Data
Training Prediction
Develop a model that can predict unknown data.
87
Batch Learning and Mini Batch Learning
Batch Learning:
Use all cases in the training dataset for training all together.
Mini-batch learning:
Train the model using a randomly sampled subset of the training set.
Repeat the random sampling and training
88
ANN’s Learning Process
Bias
Inputs
Weights 𝑏𝑏
𝑥𝑥1 𝑤𝑤1 Output True value
∑ 𝑦𝑦� 𝑦𝑦
Error
𝑥𝑥2 𝑤𝑤2 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
ANN learns the weights and biases that minimizes the difference between
the outputs and the true values.
89
Backpropagation
Compute the output using inputs and randomly set

STEP 1
weight and biases.
STEP 2 Compute the error between the output and the true value.
STEP 3 Check if the error was minimized.
STEP 4 If not, update the weights and biases.
Compute the output using the updated weights and

STEP 5
biases.
90
Simplified Example: Backpropagation
𝑊𝑊𝑊𝑊 = 𝑦𝑦�
W=2 W=3 W=4 W=5

True
Input
Value
Output Error Output Error Output Error Output Error
1 4 2 2 3 1 4 0 5 1
2 9 4 5 6 3 8 1 10 1
3 11 6 5 9 2 12 1 15 4
Σ Error 12 Σ Error 6 Σ Error 2 Σ Error 6
91
12. Gradient Descent
Method (1)
92
What is Gradient Descent?
X
An optimization algorithm that identifies a local minimum
of a differentiable function using gradients of a function.
93
X
94
X
95
X
96
Differential
𝑓𝑓(𝑥𝑥 + ℎ)
𝑓𝑓(𝑥𝑥)
𝑥𝑥 𝑥𝑥 + ℎ X
97
Differential
𝑑𝑑𝑑𝑑(𝑥𝑥) 𝑓𝑓 𝑥𝑥 + ℎ − 𝑓𝑓(𝑥𝑥)
= lim
𝑑𝑑𝑑𝑑 ℎ→0 ℎ
𝑓𝑓(𝑥𝑥 + ℎ)
𝑓𝑓(𝑥𝑥)
𝑥𝑥
X
𝑥𝑥 + ℎ
98
Partial Differential
𝜕𝜕
𝑓𝑓(𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 )
𝜕𝜕𝑥𝑥𝑛𝑛
𝜕𝜕
(3𝑥𝑥12 + 4𝑥𝑥1 + 2𝑥𝑥22 + 5)
𝜕𝜕𝑥𝑥1
Constant
= 6𝑥𝑥1 + 4
99
Gradient
The slope of a tangent line at a Y

point in the function.
We can find the optimal

parameter value using gradient.
100
Parameter Optimization
Optimization problem:
A problem where we try to minimize or maximize the output of a function.
Objective function
A function that we try to minimize or maximize.
e.g., Loss function
101
Parameter Optimization (Continued)
𝜕𝜕𝐿𝐿 L
𝜕𝜕𝑤𝑤
𝜕𝜕𝐿𝐿
𝑤𝑤 = 𝑤𝑤 − 𝜂𝜂
𝜕𝜕𝑤𝑤
𝜂𝜂: Learning rate
102
13. Gradient Descent
Method (2)
103
Types of Gradient Descent Method
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent
104
Batch Gradient Descent
• Uses all the training dataset to explore the

minima. P
• Strength
Results are stable.

Local
minimum
• Weakness:
High computation cost and memory usage
Likely to fall into local minima
105
Stochastic Gradient Descent
• Uses randomly selected datapoint.
• Strength
Low computational cost and fast
Less likely to fall into local minima Local

minimum
• Weakness:
Unstable: Susceptible to outliers
106
Mini-batch Gradient Descent
• Hybrid of batch gradient descent and stochastic gradient descent.
• Splits the training data into smaller mini-batches, and updates parameters
for each mini-batch.
• Less computational cost and less memory usage than batch gradient
descent.
• More stable (less susceptible to outliers) than stochastic gradient descent.
107
14. Chain Rule
108
Chain Rule
𝑓𝑓(𝑥𝑥) 𝑔𝑔(𝑥𝑥)
Composite function: 𝐹𝐹 𝑥𝑥 = 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
𝑔𝑔 𝑓𝑓
𝑥𝑥 𝑔𝑔(𝑥𝑥) 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
Chain rule: 𝐹𝐹𝐹 𝑥𝑥 = 𝑓𝑓 ′ 𝑔𝑔 𝑥𝑥 𝑔𝑔𝑔(𝑥𝑥)
109
Chain Rule: Another Expression
𝐹𝐹 𝑥𝑥 = 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
𝑔𝑔 𝑓𝑓
𝑥𝑥 𝑔𝑔(𝑥𝑥) 𝑓𝑓(𝑔𝑔 𝑥𝑥 )
𝑢𝑢 = 𝑔𝑔(𝑥𝑥) 𝑦𝑦 = 𝑓𝑓(𝑢𝑢)
𝑑𝑑𝑢𝑢 𝑑𝑑𝑦𝑦
𝑢𝑢𝑢 = 𝑦𝑦′ =
𝑑𝑑𝑑𝑑 𝑑𝑑𝑢𝑢
𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑

Chain rule: 𝐹𝐹 ′ (𝑥𝑥) = = �
𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
110
Chain Rule: Mathematical Proof
𝑢𝑢 = 𝑔𝑔(𝑥𝑥) 𝑦𝑦 = 𝑓𝑓(𝑢𝑢) ′
𝑑𝑑𝑑𝑑
𝐹𝐹 (𝑥𝑥) =
𝑑𝑑𝑑𝑑
𝑑𝑑𝑢𝑢 𝑑𝑑𝑦𝑦 𝑑𝑑𝑦𝑦 𝑑𝑑𝑢𝑢 1
𝑢𝑢𝑢 = 𝑦𝑦′ = = � 𝑑𝑑𝑑𝑑
𝑑𝑑𝑑𝑑 𝑑𝑑𝑢𝑢 𝑑𝑑𝑢𝑢 𝑑𝑑𝑑𝑑 𝑑𝑑𝑥𝑥
𝑑𝑑𝑦𝑦 𝑑𝑑𝑢𝑢
𝑑𝑑𝑢𝑢 𝑑𝑑𝑦𝑦 = �
𝑑𝑑𝑢𝑢 𝑑𝑑𝑑𝑑
𝑑𝑑𝑑𝑑 = 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 = 𝑑𝑑𝑑𝑑
𝑑𝑑𝑥𝑥 𝑑𝑑𝑢𝑢
𝑑𝑑𝑦𝑦 𝑑𝑑𝑢𝑢
𝑑𝑑𝑑𝑑 = � 𝑑𝑑𝑑𝑑
𝑑𝑑𝑢𝑢 𝑑𝑑𝑑𝑑
111
Example: Chain Rule
𝑓𝑓(𝑥𝑥) = (2𝑥𝑥 + 1)3
𝑢𝑢 = 2𝑥𝑥 + 1 𝑓𝑓(𝑢𝑢) = 𝑢𝑢3
Chain Rule: 𝑓𝑓′(𝑥𝑥) = 𝑓𝑓 ′ 𝑢𝑢 � 𝑢𝑢′

= 3𝑢𝑢2 × 2
= 6𝑢𝑢2
= 6(2𝑥𝑥 + 1)2
= 24𝑥𝑥 2 + 24𝑥𝑥 + 6
112
15. Backpropagation
113
Recap: ANN’s Learning Process
Input training data Compute loss

Bias
Inputs 𝑏𝑏
Weights
𝑥𝑥1 𝑤𝑤1 Output True value
∑ 𝑦𝑦� 𝑦𝑦
Error
𝑤𝑤2 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
𝑥𝑥2 1
𝑛𝑛

𝑛𝑛
𝑖𝑖=1
Update the parameters using

backpropagation 𝐸𝐸 𝑝𝑝, 𝑦𝑦 = − � 𝑝𝑝𝑖𝑖 log𝑦𝑦𝑖𝑖
𝑖𝑖
114
Minimize Loss
Set parameters
Loss function Optimization
randomly
𝑛𝑛
𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 1 𝜕𝜕𝐿𝐿
𝑛𝑛
𝑖𝑖=1
𝜕𝜕𝑤𝑤
1
𝑛𝑛 𝜕𝜕𝐿𝐿
𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 𝑤𝑤 = 𝑤𝑤 − 𝜂𝜂
𝑛𝑛 𝜕𝜕𝑤𝑤
𝑖𝑖=1

𝐸𝐸 𝑝𝑝, 𝑦𝑦 = − � 𝑝𝑝𝑖𝑖 log𝑦𝑦𝑖𝑖
𝑖𝑖
115
Gradient Descent and Backpropagation
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿

<0 >0 =0
𝜕𝜕𝑤𝑤 𝜕𝜕𝑤𝑤 𝜕𝜕𝑤𝑤
L L L
w w w
𝜕𝜕𝐿𝐿
𝑤𝑤 = 𝑤𝑤 − 𝜂𝜂
𝜕𝜕𝑤𝑤
116
Example: Backpropagation
𝑧𝑧11
𝑤𝑤1
𝑥𝑥1
𝑤𝑤5
𝑤𝑤2
𝑤𝑤3
𝑧𝑧12 𝑦𝑦� = 𝑤𝑤5 𝑧𝑧11 + 𝑤𝑤6 𝑧𝑧12
𝑤𝑤6
𝑥𝑥2
𝑤𝑤4
117
𝑛𝑛
1 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑦𝑦�
𝐿𝐿 = 𝑀𝑀𝑀𝑀𝑀𝑀 𝑦𝑦𝑖𝑖 , 𝑦𝑦�𝑖𝑖 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2 =
𝑛𝑛 𝜕𝜕𝑤𝑤5 𝜕𝜕𝑦𝑦� 𝜕𝜕𝑤𝑤5
𝑖𝑖=1
� 2
𝐿𝐿 = (𝑦𝑦 − 𝑦𝑦) 𝜕𝜕𝐿𝐿
= 𝑦𝑦 2 − 2𝑦𝑦𝑦𝑦� + 𝑦𝑦� 2 ′
𝜕𝜕𝑦𝑦�
= {𝑦𝑦 − (𝑤𝑤5 𝑧𝑧11 + 𝑤𝑤6 𝑧𝑧12 )}2 = −2 𝑦𝑦 − 𝑦𝑦� = −2(𝑦𝑦 − 𝑤𝑤5 𝑧𝑧11 − 𝑤𝑤6 𝑧𝑧12 )
= −2(𝑦𝑦 − 𝑤𝑤5 𝑧𝑧11 − 𝑤𝑤6 𝑧𝑧12 )

𝜕𝜕𝑦𝑦�
= 𝑧𝑧11
𝜕𝜕𝑤𝑤5
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑦𝑦�

= = −2𝑧𝑧11 (𝑦𝑦 − 𝑤𝑤5 𝑧𝑧11 − 𝑤𝑤6 𝑧𝑧12 )
𝜕𝜕𝑤𝑤5 𝜕𝜕𝑦𝑦� 𝜕𝜕𝑤𝑤5
118
𝜕𝜕𝐿𝐿
𝑤𝑤5 = 𝑤𝑤5 − 𝜂𝜂
𝜕𝜕𝑤𝑤5
119
16. Vanishing Gradient
Problem
120
Vanishing Gradient Problem
In ANNs, gradients often get smaller

and smaller as the increase in the
number of hidden layers.
In DNNs with many hidden layers,

the gradients can become almost 0,
and thus, they can hardly learn from
the data.
121
Why do Gradient Vanish?
𝑓𝑓𝑓 𝑥𝑥 𝑓𝑓𝑓 𝑥𝑥 𝑓𝑓𝑓 𝑥𝑥 𝑓𝑓𝑓 𝑥𝑥 ･･･ 𝑓𝑓𝑓 𝑥𝑥 ≈ 0
Sigmoid 1 𝑒𝑒 −𝑥𝑥
𝑓𝑓 𝑥𝑥 = 𝑓𝑓𝑓 𝑥𝑥 =
function 1 + 𝑒𝑒 −𝑥𝑥 (1 + 𝑒𝑒 −𝑥𝑥 )2
122
Use ReLU Function
𝑥𝑥 𝑥𝑥 > 0
𝑓𝑓 𝑥𝑥 = �
0 𝑥𝑥 ≤ 0
1 𝑥𝑥 > 0
𝑓𝑓𝑓 𝑥𝑥 = �
0 𝑥𝑥 ≤ 0
123
17. Nonsaturating
Activation Functions
124
Dying ReLUs
𝜕𝜕𝐿𝐿 𝑓𝑓 𝑥𝑥 = �
𝑤𝑤𝑖𝑖 = 𝑤𝑤𝑖𝑖 − 𝜂𝜂 0 𝑥𝑥 ≤ 0
𝜕𝜕𝑤𝑤𝑖𝑖
The larger η is, the more likely

𝑤𝑤𝑖𝑖 is negative, and ReLU returns 0.
→ When we use high learning rate,

Dying ReLUs tends to occur.
125
Leaky ReLU
The output is αx when the input is 𝑓𝑓 𝑥𝑥 = �
negative. So, the output is not 𝛼𝛼𝑥𝑥 𝑥𝑥 ≤ 0 (𝛼𝛼 = 0.01)
exactly horizontal when x < 0
The output values are slightly larger

than 0 when the input value is
negative.
The Dying ReLU problem is less likely

to occur.
126
ELU and SELU Functions
ELU function (Exponential Linear Unit):
𝑓𝑓 𝑥𝑥 = � (𝛼𝛼 > 0)
𝛼𝛼(𝑒𝑒 𝑥𝑥 − 1) 𝑥𝑥 ≤ 0
SELU function (Scaled Exponential Linear Unit):
𝑓𝑓 𝑥𝑥 = λ � (λ > 1, 𝛼𝛼 > 0)
𝛼𝛼(𝑒𝑒 𝑥𝑥 − 1) 𝑥𝑥 ≤ 0
127
18. Parameter Initialization
128
Recap: Vanishing Gradient Problem
In ANNs, gradients often get smaller

and smaller as the increase in the
number of hidden layers.
In DNNs with many hidden layers,

the gradients can become almost 0,
and thus, they can hardly learn from
the data.
129
Parameter Initialization
Measure to prevent vanishing gradient problems.
Initialize parameter values with a larger range when the model size is large, and
with a smaller range when the model size is small.
Popular methods:
・ Xavier initialization
・ He initialization
130
Xavier Initialization
Initial parameter value:

𝑥𝑥1 ℎ1
Normal distribution
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑥𝑥2 ℎ2
𝑦𝑦
・・
1 ・・
𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 ・・
𝑥𝑥𝑛𝑛 ℎ𝑚𝑚
𝑛𝑛: N of nodes
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0
1 1
𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 𝑚𝑚
131
He Initialization
Initial parameter value:

𝑥𝑥1 ℎ1
Normal distribution
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑥𝑥2 ℎ2
𝑦𝑦
・・
2 ・・
𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 ・・
𝑥𝑥𝑛𝑛 ℎ𝑚𝑚
𝑛𝑛: N of nodes
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0
2 2
𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠𝑠𝑠 =
𝑛𝑛 𝑚𝑚
132
19. ANN Regression
with Keras
133
Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from tensorflow.keras.layers import Activation, Dense

from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import warnings
warnings.filterwarnings('ignore')
134
Load and Prepare the Dataset
# Load dataset
# Create X and y
y = target


scaler.fit(X_train)
135
Define Artificial Neural Network
# Define ANN
Leaky ReLU function
model = Sequential() from tensorflow.keras.layers import LeakyReLU
model.add(Dense(128, activation='relu’, leaky_relu = LeakyReLU(alpha=0.01)
input_shape=(13,))) Dense(64, activation=leaky_relu)
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu')) ELU function
activation =‘elu’
model.add(Dense(1)) SELU function
activation= ‘selu’
136
He Initialization
# Define ANN
model = Sequential()
model.add(Dense(128, activation='relu’, input_shape=(13,),
kernel_initializer="he_normal"))
model.add(Dense(64, activation='relu’,
model.add(Dense(1))
137
Show the Model Summary
# Show the model summary

model.summary()
W0 × W1 ＋ b1（bias）
13 × 128 + 128 = 1792
128 × 64 + 64 = 8256
64 × 64 + 64 = 4160
64 × 32 + 32 = 2080
32 × 1 + 1 = 33
1792+8256+4160+208+33=16321
138
Compile the Model
# Compile the model

model.compile(loss='mse’,
optimizer=Adam(lr=0.01),
metrics=['mae'])
139
Fit the Model
# Fit the model

history = model.fit(X_train, y_train, Mini-batch learning
batch_size = 64,
N=1,000
epochs=1000,
validation_split=0.2,
verbose=1)
200 200 200 200

200 200
batch_size=200
2n : 32, 64, 128, 256, 512, 1024, 2048
140
Fit the Model
# Fit the model

history = model.fit(X_train, y_train, Mini-batch learning
batch_size = 64,
epochs=1000, N=1,000
verbose=1)
Epoch 200 200 200 200

200 200
Epoch 200 200 200 200

200 200
Epoch 200 200 200 200

200 200
141
Learning History
Train on 303 samples, validate on 76 samples

Epoch 1/1000
303/303 [==============================] - 0s 179us/sample - loss: 0.1522 - mae: 0.2759 -
val_loss: 15.1493 - val_mae: 2.6273
Epoch 2/1000
val_loss: 15.3335 - val_mae: 2.6674
Epoch 3/1000
val_loss: 15.6130 - val_mae: 2.6099
Epoch 4/1000
val_loss: 15.6360 - val_mae: 2.6161
・
・
・
Epoch 1000/1000
val_loss: 14.8579 - val_mae: 2.6602
142
Visualize Learning History
# Plot the learning history

plt.plot(history.history['mae’],
label='train mae‘)
plt.plot(history.history['val_mae’],
label='val mae‘)
plt.xlabel('epoch‘)
plt.ylabel('mae‘)
plt.legend(loc='best‘)
plt.ylim([0,20])
plt.show()
143
Model Evaluation
# Model evaluation
train_loss, train_mae = model.evaluate(X_train, y_train)
test_loss, test_mae = model.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss, test_loss))
print('train mae:{:.3f}¥ntest mae: {:.3f}'.format(train_mae, test_mae))
----------------------------------------------------------------------------
train loss: 4.239
test loss: 10.772
train mae: 0.884
test mae: 2.293
144
Make Prediction
# Make prediction
y_pred = model.predict(X_test)
print(y_pred)
----------------------------------------------------------------------------
[[28.521143 ]
[19.221983 ]
[26.441172 ]
[19.683641 ]
[15.493921 ]
・・・・・
[17.98763 ]
[14.546152 ]]
145
20. ANN Classification
with Keras
146
Import Libraries
# Import libraries
import pandas as pd
import numpy as np
%matplotlib inline


import warnings
147
# Load dataset
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
# Create X and y
X = pd.DataFrame(breast_cancer.data)
y = breast_cancer.target

X_train, X_test, y_train, y_test = train_test_split(breast_cancer.data, breast_cancer.target)

scaler.fit(X_train)
148
# Define ANN
input_shape=(30,)))
model.add(Dense(32, activation='relu‘))
model.add(Dense(1, activation='sigmoid'))
149
Show the Model Summary
# Model summary
model.summary()
W0 × W1 ＋ b1（bias）
30 × 64 + 64 = 1984
64 × 32 + 32 = 2080
32 × 1 + 1 = 33
1984 + 2080 + 33 = 4097
150
Compile the Model
# Compile model
model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['acc'])
151
Fit the Model
# Fit the model

history = model.fit(X_train, y_train,
batch_size=64,
epochs=300,
validation_split=0.2)
152
Learning History
Train on 340 samples, validate on 86 samples

Epoch 1/300
340/340 [==============================] - 1s 2ms/sample - loss: 0.5206 - acc: 0.8441 -
val_loss: 0.5117 - val_acc: 0.8488
Epoch 2/300
340/340 [==============================] - 0s 58us/sample - loss: 0.4763 - acc: 0.8794 -
val_loss: 0.4714 - val_acc: 0.8605
Epoch 3/300
val_loss: 0.4377 - val_acc: 0.8605
Epoch 4/300
val_loss: 0.4103 - val_acc: 0.8721
・・・・・・・・・・
Epoch 300/300
val_loss: 0.0766 - val_acc: 0.9767
153
Visualize Learning History

plt.plot(history.history[‘loss‘],
label='train loss‘)
plt.plot(history.history['val_loss‘],
label='val loss‘)
plt.ylabel('loss‘)
plt.ylim([0,0.5])
plt.show()
154
Model Evaluation
# Model evaluation
train_loss, train_acc = model.evaluate(X_train, y_train)
test_loss, test_acc = model.evaluate(X_test, y_test)
print('train acc:{:.3f}¥ntest acc: {:.3f}'.format(train_acc, test_acc))
----------------------------------------------------------------------------
426/426 [==============================] - 0s 30us/sample - loss: 0.0506 - acc:
0.9859
143/143 [==============================] - 0s 63us/sample - loss: 0.0649 - acc:
0.9790
train loss:0.051
test loss: 0.065
train acc:0.986
test acc: 0.979
155
Make Prediction
# Make prediction
print(np.round(y_pred, 3))
--------------------------------------------------------------------------
[[0.993]
[1. ]
[0. ]
[0.998]
[1. ]
・・・・
[0.996]
[0.894]]
156
Multiclass Classification
# Load dataset
from sklearn.datasets import load_wine
data = load_wine()
class_0 30%
# Display target names
list(data.target_names)
--------------------------------------
class_1 50%
['class_0', 'class_1', 'class_2’]
# Store features and target class_2 20%

features = data.data
target = data.target 100%
157
# Create X and y
X = pd.DataFrame(features)
y = target


scaler.fit(X_train)
# Shape of the data

X_train.shape
----------------------------------------------------------------
(133, 13)
158
# Define ANN for multiclass classification

model.add(Dense(64, activation='relu', input_shape=(13,)))
model.add(Dense(32, activation='relu‘))
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(optimizer='adam‘,
loss='sparse_categorical_crossentropy‘,
metrics=['acc'])
159
# Fit the model

history = model.fit(X_train, y_train, batch_size=32, epochs=300, validation_split=0.2)

plt.plot(history.history['loss'], label='train loss‘)
plt.plot(history.history['val_loss'], label='val loss‘)
plt.ylabel('loss‘)
plt.ylim([0,0.5])
plt.show()
# Model evaluation
train_loss, train_acc = model.evaluate(X_train, y_train)
test_loss, test_acc = model.evaluate(X_test, y_test)
print('train acc:{:.3f}¥ntest acc: {:.3f}'.format(train_acc, test_acc))
160
# Make prediction
print(np.round(y_pred, 3))
----------------------------------------
[[0. 1. 0. ]
[0. 1. 0. ]
[0. 1. 0. ]
[0. 1. 0. ]
[0. 1. 0. ]
[0.01 0.99 0. ]
[1. 0. 0. ]
[0. 1. 0. ]
[0.003 0.997 0. ]
[0. 1. 0. ]
・・・・・・・・・・・・・
[1. 0. 0. ]
[0. 0. 1. ]
[1. 0. 0. ]]
161
21. Overfitting
162
Model Generalizability
163
164
165
Overfitting
A model is optimized too closely for

a particular dataset and fails to Loss
perform well for other datasets. ー train loss
ー val loss
Since we develop a deep learning

model to make predictions for unseen
data, an overfitted model is useless.
We need to take measures to prevent

overfitting.
Epoch
166
How Can We Prevent Overfitting?
・ Increase training data
・ Regularization
・ Early Stopping
167
22. L1 & L2 Regularization
168
Regularization
Too complex models are likely to be

overfitted to the training data.
Y
It lacks generalizability.
Regularization prevent overfitting by

penalizing the complexity of the
model.
169
Q. Which is a case where “X and Y are not related”?
Y Y Y Y
X X X X
𝑦𝑦 = 𝛽𝛽𝑥𝑥 If 𝛽𝛽 = 0, X is not related to Y.
170
Statistical Hypothesis Testing in Regression Analysis
Y
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + ε
Testing whether 𝛽𝛽𝑖𝑖 = 0 or not

X
Y
If 𝛽𝛽𝑖𝑖 ≠ 0
𝑥𝑥𝑖𝑖 is significantly related to 𝑦𝑦

X
171
Model Complexity and Overfitting
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + ε
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ε

X
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + ε

Y
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + 𝛽𝛽4 𝑥𝑥4 + ε
The more complex a model is, the more likely it falls

into overfitting. X
172
Regularization
Y
𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝛽𝛽3 𝑥𝑥3 + 𝛽𝛽4 𝑥𝑥4 + ε
0 0 0
= 𝛽𝛽1 𝑥𝑥1 + ε
X
Y
Regularization forces the weights of uninformative
features to be zero or nearly zero.
By doing so, regularization simplifies the model, and

thus, prevent overfitting.
X
173
L1 & L2 Regularization
・ L1 regularization: Lasso Regression
・ L2 regularization: Ridge Regression
The key difference is the regularization term.
Regularization term penalizes the model when some features’

weights get larger.
174
Lasso Regression (L1 Regularization)
Ordinary regression analysis

𝑁𝑁
2
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
𝑖𝑖=1
Lasso Regression (Least Absolute Shrinkage and Selection Operator)

𝑀𝑀
Lasso shrinks the less
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 𝑅𝑅𝑅𝑅𝑅𝑅 + λ � 𝑤𝑤𝑗𝑗
important features’
weights (coefficient) to 0.
𝑗𝑗=0
Regularization term
175
Ridge Regression (L2 Regularization)
Ridge regression
𝑀𝑀
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 𝑅𝑅𝑅𝑅𝑅𝑅 + λ � 𝑤𝑤𝑗𝑗2

𝑗𝑗=0
Ridge makes less important features’

weights smaller.
176
23. Dropout
177
Dropout
178
Dropout
179
Dropout
Dropout Rate = 0.5
180
Dropout and Generalizability
181
24. Regularization with
Keras
182
Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
%matplotlib inline


import warnings
183
# Load dataset
# Create X and y
y = target


scaler.fit(X_train)
184
L1 Regularization
# Import library for L1 regularization

from tensorflow import keras
# Define ANN with L1 regularization

model_l1 = Sequential()
model_l1.add(Dense(128, activation='relu', input_shape=(13,),
kernel_regularizer=keras.regularizers.l1(0.01)))
model_l1.add(Dense(64, activation='relu', kernel_regularizer=keras.regularizers.l1(0.01)))
model_l1.add(Dense(1))
# Compile the model

model_l1.compile(loss='mse‘,
metrics=['mae'])
185
Fit and Predict with L1 Regularization
# Fit the model

history_l1 = model_l1.fit(X_train, y_train,
batch_size=64,
epochs=1000,
verbose=0)
# Model evaluation
train_loss_l1, train_mae_l1 = model_l1.evaluate(X_train, y_train)
test_loss_l1, test_mae_l1 = model_l1.evaluate(X_test, y_test)
print('train loss:{:.3f}¥ntest loss: {:.3f}'.format(train_loss_l1, test_loss_l1))
print('train mae:{:.3f}¥ntest mae: {:.3f}'.format(train_mae_l1, test_mae_l1))
--------------------------------------------------------------------------------------------------------
train loss:5.964
test loss: 9.608
train mae:1.132
test mae: 2.114
186
L2 Regularization
# Define ANN with L2 regularization

model_l2 = Sequential()
model_l2.add(Dense(128, activation='relu', input_shape=(13,),
kernel_regularizer=keras.regularizers.l2(0.01)))
model_l2.add(Dense(1))
# Compile the model

model_l2.compile(loss='mse‘,
metrics=['mae'])
187
Dropout
# Import library for Dropout

from tensorflow.keras.layers import Dropout
# Define ANN with Dropout (Dropout rate = 0.5)

model_d = Sequential()
model_d.add(Dense(128, activation='relu', input_shape=(13,)))
model_d.add(Dropout(0.5))
model_d.add(Dense(64, activation='relu‘))
model_d.add(Dense(1))
# Compile the model

model_d.compile(loss='mse’, optimizer=Adam(lr=0.01), metrics=['mae'])
188
Early Stopping
A regularization method that stops

training when parameter updates
no longer improve the model
performance.
189
Early Stopping with Keras
# Import library for earlystopping

from tensorflow.keras.callbacks import EarlyStopping
# Define ANN
model_e = Sequential()
model_e.add(Dense(128, activation='relu', input_shape=(13,)))
model_e.add(Dense(64, activation='relu‘))
model_e.add(Dense(1))
# Compile the model

model_e.compile(loss=‘mse‘, optimizer=Adam(lr=0.01), metrics=['mae'])
# Set EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=30)
190
Early Stopping with Keras
# Fit the model

history_e = model_e.fit(X_train, y_train,
batch_size=64,
epochs=1000,
callbacks=[early_stop],
verbose=1)
191
Model Performance
Train MAE Test MAE
No Regularization 2.333 2.840
L1 Regularization 1.132 2.114
L2 Regularization 1.008 2.079
Dropout 2.912 2.802
Early Stopping 2.912 2.054
192
25. Optimizer
193
Optimizers
Algorithms used to find;
・The minimum value of the loss function. # Compile the model

model.compile(loss='mse',
・The parameter value that minimizes the optimizer=Adam(lr=0.01),
loss function. metrics=['mae'])
e.g.,
・SGD
・Momentum
・AdaGrad
・RMSProp
・Adam etc.
194
SGD (Stochastic Gradient Descent)
𝜕𝜕𝐿𝐿
𝑊𝑊 ← 𝑊𝑊 − 𝜂𝜂
𝜕𝜕𝑊𝑊
𝑊𝑊: Parameter
𝜕𝜕𝐿𝐿
: Gradient
𝜕𝜕𝑊𝑊
195
SGD (Stochastic Gradient Descent)
𝜕𝜕𝐿𝐿
𝑊𝑊 ← 𝑊𝑊 − 𝜂𝜂 Y
𝜕𝜕𝑊𝑊
𝑊𝑊: Parameter
𝜕𝜕𝐿𝐿
: Gradient X
𝜕𝜕𝑊𝑊
196
Momentum
𝜕𝜕𝐿𝐿
𝑣𝑣 ← α𝑣𝑣 − 𝜂𝜂
𝜕𝜕𝑊𝑊
𝑊𝑊 ← 𝑊𝑊 + 𝑣𝑣
𝑣𝑣: Velocity
197
Momentum
𝜕𝜕𝐿𝐿
𝑣𝑣 ← α𝑣𝑣 − 𝜂𝜂 Y
𝜕𝜕𝑊𝑊
𝑣𝑣: Velocity
X
198
Momentum
𝜕𝜕𝐿𝐿
𝑣𝑣 ← α𝑣𝑣 − 𝜂𝜂 Y
𝜕𝜕𝑊𝑊
𝑣𝑣: Velocity
X
199
AdaGrad
Learning rate decay: Set a high learning rate at the outset, and gradually lower it.
AdaGrad does it for each parameter adaptively.
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
ℎ←ℎ+ ⊙
𝜕𝜕𝑊𝑊 𝜕𝜕𝑊𝑊
1
𝜕𝜕𝐿𝐿
ℎ + ε 𝜕𝜕𝑊𝑊
200
RMSProp
AdaGrad tends to lower the learning rate too fast, and thus, the learning rate
becomes 0 too early.
RMSPROP calculates the sum of squares of the gradients, but use only the
gradients from the recent iterations.
ℎ ← 𝛽𝛽ℎ + (1 − 𝛽𝛽) ⊙
𝜕𝜕𝑊𝑊 𝜕𝜕𝑊𝑊 𝛽𝛽: Decay rate
1 𝜕𝜕𝐿𝐿 = 0.9
ℎ + ε 𝜕𝜕𝑊𝑊
201
Adam
𝜕𝜕𝐿𝐿
𝜕𝜕𝐿𝐿 𝑚𝑚: Velocity, 𝐿𝐿: Loss, W: parameter, : gradient
𝜕𝜕𝑊𝑊
𝑚𝑚 ← 𝛽𝛽1 𝑚𝑚 + (1 − 𝛽𝛽1 )
𝜕𝜕𝑊𝑊 𝛽𝛽1 : Decay rate
𝑣𝑣 ← 𝛽𝛽2 𝑣𝑣 + (1 − 𝛽𝛽2 ) ⊙ 𝑣𝑣: Velocity, 𝛽𝛽2 : Decay rate
𝜕𝜕𝑊𝑊 𝜕𝜕𝑊𝑊
𝑚𝑚 𝑣𝑣
𝑚𝑚
�= 𝑡𝑡 , 𝑣𝑣
�=
1 − 𝛽𝛽1 1 − 𝛽𝛽2𝑡𝑡
𝑚𝑚
�
𝑊𝑊 ← −η
𝑣𝑣� + ε
202
26. Batch Normalization
203
Recap
・ Regularization
e.g., L1 & L2 regularization, dropout
・ Optimizer
e.g., AdaGrad, RMSprop, Adam
・ Parameter Initialization
e.g., Xavier initialization, He initialization

204
Normalization
𝑥𝑥𝑖𝑖 − 𝜇𝜇
𝑥𝑥𝑖𝑖𝑛𝑛𝑛𝑛𝑛𝑛 =
σ
0 ≤ 𝑥𝑥 ′ ≤ 1
205
Weakness of Input Data Normalization
206
Batch Normalization
Batch Batch
Normalization Normalization
207
Batch Normalization: Equations
𝑚𝑚
1
Mean: 𝜇𝜇𝑏𝑏 ← � 𝑥𝑥𝑖𝑖
𝑚𝑚
𝑖𝑖=1
𝑚𝑚
1
Variance: 𝜎𝜎𝑏𝑏2 ← �(𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑏𝑏 )2
𝑚𝑚
𝑖𝑖=1
𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑏𝑏
Standardization: 𝑥𝑥�𝑖𝑖 ←
σ2𝑏𝑏 + ε
208
27. Optimization & Batch
Normalization with Keras
209
Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
%matplotlib inline


from tensorflow.keras.optimizers import SGD, Adagrad, RMSprop, Adam
from keras.layers.normalization import BatchNormalization
import warnings
210
Load and Prepare Dataset
# Load dataset
# Create X and y
y = target


scaler.fit(X_train)
211
# Define ANN
def ann():
model.add(Dense(128, activation='relu', input_shape=(13,)))
model.add(BatchNormalization())
model.add(Dense(1))
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
212
Set Optimizer
# Momentum
optimizer=SGD(lr=0.001, momentum=0.9)
# Adagrad
optimizer=Adagrad(lr=0.001, epsilon=10**-10)
# RMSprop
optimizer=RMSprop(lr=0.001, rho=0.9)
# Adam
optimizer=Adam(lr=0.001, beta_1=0.9, beta_2=0.9)
213
Fit and Evaluate the Model
# Fit the model

history = model.fit(X_train, y_train, batch_size=64, epochs=1000, validation_split=0.2)

plt.plot(history.history['mae'], label='train mae‘)
plt.plot(history.history['val_mae'], label='val mae‘)
plt.ylabel('mae‘)
plt.ylim([0,20])
plt.show()
# Model evaluation
train_loss, train_mae = model.evaluate(X_train, y_train)
test_loss, test_mae = model.evaluate(X_test, y_test)
print('train mae:{:.3f}¥ntest mae: {:.3f}'.format(train_mae, test_mae))
214
Save and Load the Model
# Save model
model.save("my_ann_model.h5")
# Load model
model = keras.models.load_model("my_ann_model.h5")
215
Save the Best Model
# Save the best model only
# Import ModelCheckpoint
from keras.callbacks import ModelCheckpoint
# Set model checkpoint

model_checkpoint =ModelCheckpoint("my_ann_model_2.h5",
save_best_only=True)
# Fit and save model

history = model.fit(X_train, y_train,
batch_size=64,
epochs=1000,
callbacks=[model_checkpoint])
216

Deep Learning Fundamentals Materials

Uploaded by

Copyright:

Available Formats

Deep Learning Fundamentals Materials

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Fundamentals Materials

Uploaded by

Copyright:

Available Formats

Deep Learning

Artificial Neural Network

● Artificial Intelligence (AI)

Imitate intelligent human behavior and cognitive

Automatically learn from data and make prediction

Use artificial neural network to solve complex

● Cannot handle high-dimensional Performance

● Need manual feature extraction

● Not appropriate for image and Machine Learning

e.g., Employee turnover prediction using data of 3,000 employees

Gender Job Age Performance Score Title

● The increase in the dimension of data

● The increase in the dimension lowers the

 The model will be likely to overfit.

● Deep learning enables us to extract importance

● Artificial neural network (ANN) is a better way to

● The use of ANN enables us to alleviate the

Cell body: Axon:

𝑥𝑥2 𝑤𝑤2 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 > Θ

0 if 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 ≤ θ

0.4 1.0 −0.8 𝜃𝜃 = 0.5

0.4 × 1.0 + 0.5 × 0.4 − 0.8 = −0.2 < 𝜃𝜃

0.8 0.5 −0.2 𝜃𝜃 = 0.5

0.8 × 0.5 + 0.5 × 1.2 − 0.2 = 0.8 > 𝜃𝜃

A logic gate that outputs 1 only when

(0.2, 0.6, 0.6) 0 1 0

(1.2, 0.4, 1.5) 1 1 1

A logic gate that outputs 1 when at least

(0.4, 0.3, 0.2) 0 1 1

(0.8, 1.2, 0.6) 1 1 1

A logic gate that outputs 1 when at least

(−0.6, −0.4, −0.8) 1 0 1

(−0.2, −0.6, −0.7) 0 1 1

(−1.2, −0.4, −1.5) 1 1 0

Perceptron can express AND gate, OR ・ AND

In machine learning, a machine learns

AND Gate OR Gate NAND Gate

A logic gate that outputs 1 A logic gate that outputs 1

𝑥𝑥1 𝑥𝑥2 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 𝑦𝑦

# Define AND gate function

# Define OR gate function

# Define NAND gate function

0 if 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 ≤ θ

𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏 = θ

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥2 Truth Table

𝑥𝑥1 𝑥𝑥2 𝑠𝑠1

# Define XOR gate function # XOR gate output

# Split data into training and test set

# Standardize the features