ML UNIT-4 Notes PDF
ML UNIT-4 Notes PDF
COLLEGE
(Approved by AICTE, Affiliated to JNTUK)(An ISO9001:2008 Certified
Institution)
Prepared by Dr.M.BHEEMALINGAIAH
Artificial Neural Networks and Support Vector Machine (SVM)
UNIT IV: Artificial Neural Networks: Neurons and biological motivation, Linear threshold
units. Perceptrons: representational limitation and gradient descent training, Multilayer networks
and backpropagation, Hidden layers and constructing intermediate, distributed representations.
Overfitting, learning network structure, recurrent networks.
Support Vector Machines: Maximum margin linear separators. Quadratic programming solution
to finding maximum margin separators. Kernels for learning non-linear functions.
Content
4.1 Introduction of Artificial neural network
4.2 When to consider Artificial Neural Networks?
4.3 Basic fundamental of Artificial Neural Networks
4.3.1 Neural Network Architecture Types
4.3.2 What are the Five Algorithms to Train a Neural Network?
4.3.3 What is the architecture of Artificial Neural Network?
4.3.4 What are the Learning Techniques in Neural Networks?
4.3.5 What are the applications of neural networks?
4.4 Activation Functions in Artificial Neural Network
4.5 PERCEPTRON (Single Neuron Network)
4.5.1 Gradient Descent and the Delta Rule
4.5.2 Derivation of the Gradient Descent Rule
4.5.3 Gradient Descent Algorithm
4.5.4 Stochastic Approximation to Gradient Descent
4.6 Multilayer Networks (Multi-Layer Perceptron Network)
4.6.1 Backpropagation Algorithm
4.6.2 Derivation of the Backpropagation Rule
4.6.3 Hidden Layer Representations
4.6.4 Generalization, Overfitting, and Stopping Criterion
4.7 Recurrent networks
4.8 Support Vector Machine
4.8.1 Linear Discriminant Functions for Binary Classification
4.8.2 Estimation of Margin
1
4.8.3 Quadratic programming solution to finding maximum margin
separators
4.8.4 Kernels for learning non-linear functions
The term "Artificial Neural Network" is derived from Biological neural networks that develop
the structure of a human brain. Similar to the human brain that has neurons interconnected to one
another, artificial neural networks also have neurons that are interconnected to one another in
various layers of the networks. These neurons are known as nodes.
Dendrites Inputs
Synapse Weights
Axon Output
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output
3
Artificial Neural Networks can be viewed as weighted directed graphs in which artificial
neurons are nodes, and directed edges with weights are connections between neuron outputs
and neuron inputs.
The Artificial Neural Network receives information from the external world in pattern and
image in vector form. These inputs are designated by the notation x(n) for n number of
inputs.
Each input is multiplied by its corresponding weights. Weights are the information used by
the neural network to solve a problem. Typically weight represents the strength of the
interconnection between neurons inside the Neural Network.
The weighted inputs are all summed up inside the computing unit (artificial neuron). In case
the weighted sum is zero, bias is added to make the output not- zero or to scale up the system
response. Bias has the weight and input always equal to ‘1'.
The sum corresponds to any numerical value ranging from 0 to infinity. To limit the response
to arrive at the desired value, the threshold value is set up. For this, the sum is forward
through an activation function.
The activation function is set to the transfer function to get the desired output. There are
linear as well as the nonlinear activation function
5
Perceptron Model in Neural Networks: Neural Network is having two input units and one
output unit with no hidden layers. These are also known as ‘single-layer perceptrons.'
Radial Basis Function Neural Network: These networks are similar to the feed-forward Neural
Network, except radial basis function is used as these neurons' activation function.
Multilayer Perceptron Neural Network: These networks use more than one hidden layer of
neurons, unlike single-layer perceptron. These are also known as Deep Feedforward Neural
Networks.
Recurrent Neural Network: Type of Neural Network in which hidden layer neurons have self-
connections. Recurrent Neural Networks possess memory. At any instance, the hidden layer
neuron receives activation from the lower layer and its previous activation value.
Long Short-Term Memory Neural Network (LSTM): The type of Neural Network in which
memory cell is incorporated into hidden layer neurons is called LSTM network.
Boltzmann Machine Neural Network: These networks are similar to the Hopfield network,
except some neurons are input, while others are hidden in nature. The weights are initialized
randomly and learn through the backpropagation algorithm.
6
Convolutional Neural Network: Get a complete overview of Convolutional Neural Networks
through our blog Log Analytics with Machine Learning and Deep Learning.
Modular Neural Network: It is the combined structure of different types of neural networks
like multilayer perceptron, Hopfield Network, Recurrent Neural Network, etc., which are
incorporated as a single module into the network to perform independent subtask of whole
complete Neural Networks.
Physical Neural Network: In this type of Artificial Neural Network, electrically adjustable
resistance material is used to emulate synapse instead of software simulations performed in the
neural network. Artificial Intelligence collects and analyze data using smart sensors or machine
learning algorithms and automatically route service requests to reduce the human workload.
A typical Neural Network contains a large number of artificial neurons called units arranged in a
series of layers. In typical Artificial Neural Network comprise different layers –
Input layer - It contains those units (Artificial Neurons) which receive input from the
outside world on which the network will learn, recognize about, or otherwise process.
7
Output layer - It contains units that respond to the information about how it learn any task.
Hidden layer - These units are in between input and output layers. The hidden layer's job is
to transform the input into something that the output unit can use somehow.
Connect Neural Networks, which means say each hidden neuron links completely to every
neuron in its previous layer (input) and the next layer (output) layer.
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Offline Learning
Online Learning
Supervised Learning: In this learning, the training data is input to the network, and the desired
output is known weights are adjusted until production yields desired value.
Unsupervised Learning: Use the input data to train the network whose output is known. The
network classifies the input data and adjusts the weight by feature extraction in input data.
Reinforcement Learning: Here, the output value is unknown, but the network provides
feedback on whether the output is right or wrong. It is Semi-Supervised Learning.
Offline Learning: The weight vector adjustment and threshold adjustment are made only after
the training set is shown to the network. It is also called Batch Learning.
Online Learning: The adjustment of the weight and threshold is made after presenting each
training sample to the network.
8
4.4 Activation Functions in Artificial Neural Network
Some of well-known activation functions are used in ANN. They are shown in following table
9
linear Machine Learning algorithm used for supervised learning for various binary
classifiers. This algorithm enables neurons to learn elements and processes them one by
one during preparation
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three
main components. These are as follows:
Input Nodes or Input Layer:This is the primary component of Perceptron which accepts the
initial data into the system for further processing. Each input node contains a real numerical
value.
Wight and Bias:Weight parameter represents the strength of the connection between units. This
is another most important parameter of Perceptron components. Weight is directly proportional
to the strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
Activation Function: These are the final and important components that help to determine
whether the neuron will fire or not. Activation Function can be considered primarily as a step
function.
10
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or has
vanishing or exploding gradients.
11
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift the
activation function curve up or down.
Figure: A perceptron
12
possible real-valued weight vectors
Representational Power of Perceptrons
The perceptron can be viewed as representing a hyperplane decision surface in
the n-dimensional space of instances (i.e., points)
The perceptron outputs a 1 for instances lying on one side of the hyperplane and
outputsa -1 for instances lying on the other side, as illustrated in below figure
Example: Representation of AND functions
The learning problem is to determine a weight vector that causes the perceptron to produce
thecorrect + 1 or - 1 output for each of the given training examples.
13
example.
This process is repeated, iterating through the training examples as many
times asneeded until the perceptron classifies all training examples correctly.
Weights are modified at each step according to the perceptron training rule,
whichrevises the weight wi associated with input xi according to the rule.
The role of the learning rate is to moderate the degree to which weights are
changed at each step. It is usually set to some small value (e.g., 0.1) and is
sometimes made to decayas the number of weight-tuning iterations increases
Drawback:
The perceptron rule finds a successful weight vector when the training examples are
linearlyseparable, it can fail to converge if the examples are not linearly separable
PERCEPTRON
Gradient Descent is an optimization algorithm used for minimizing the cost function in
various machine learning algorithms. It is basically used for updating the parameters of the
learning model.
Types of gradient Descent:
Batch Gradient Descent(Gradient Descent) : This is a type of gradient descent which
processes all the training examples for each iteration of gradient descent. But if the
number of training examples is large, then batch gradient descent is computationally
very expensive. Hence if the number of training examples is large, then batch gradient
descent is not preferred. Instead, we prefer to use stochastic gradient descent or mini-
batch gradient descent.
Stochastic Gradient Descent: This is a type of gradient descent which processes 1
training example per iteration. Hence, the parameters are being updated even after one
14
iteration in which only a single example has been processed. Hence this is quite faster
than batch gradient descent. But again, when the number of training examples is large,
even then it processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
Mini Batch gradient descent: This is a type of gradient descent which works faster than
both batch gradient descent and stochastic gradient descent. Here b examples
where b<m are processed per iteration. So even if the number of training examples is
large, it is processed in batches of b training examples in one go. Thus, it works for
larger training examples and that too with lesser number of iterations.
To understand the delta training rule, consider the task of training an unthresholded perceptron.
That is, a linear unit for which the output O is given by
To derive a weight learning rule for linear units, specify a measure for the training error
of ahypothesis (weight vector), relative to the training examples.
Where,
D is the set of training examples,
td is the target output for training example d,
od is the output of the linear unit for training example d
ur
E ( w ) simply half the squared difference between the target output td and the linear
unit output od, summed over all training examples.
15
Visualizing the Hypothesis Space
To understand the gradient descent algorithm, it is helpful to visualize the entire hypothesis
space of possible weight vectors and their associated E values as shown in below figure.
Here the axes w0 and wl represent possible values for the two weights of a simple linearunit.
The w0, wl plane therefore represents the entire hypothesis space.
The vertical axis indicates the error E relative to some fixed set of training examples.
The arrow shows the negated gradient at one particular point, indicating the direction in the
w0, wl plane producing steepest descent along the error surface.
The error surface shown in the figure thus summarizes the desirability of every weight vector in
the hypothesis space
Given the way in which we chose to define E, for linear units this error surface must
16
always be parabolic with a single global minimum.
Gradient descent search determines a weight vector that minimizes E by starting with an
arbitrary initial weight vector, then repeatedly modifying it in small steps.
At each step, the weight vector is altered in the direction that produces the steepest descent
along the error surface depicted in above figure. This process continues until the global
minimum error is reached.
How to calculate the direction of steepest descent along the error surface?
PERCEPTRON
The direction of steepest can be found by computing the derivative of E with respect to each
ur
component of the vector w . This vector derivative is called the gradient of E with respect to
ur
w , is written as
The gradient specifies the direction of steepest increase of E, the training rule for
gradient descent is
Here η is a positive constant called the learning rate, which determines the stepsize in the gradient
descent search.
The negative sign is present because we want to move the weight vector in the direction that decreases
E.
17
E
Calculate the gradient at each step. The vector of derivatives that form the gradient can be
wi
obtained by differentiating E from Equation (2), as
PERCEPTRON
18
Issues in Gradient Descent Algorithm
The gradient descent training rule presented in Equation (7) computes weight updates
after summing over all the training examples in D
The idea behind stochastic gradient descent is to approximate this gradient descent
PERCEPTRON
search by updating weights incrementally, following the calculation of the error for
each individual example
∆wi = η (t – o) xi
where t, o, and xi are the target value, unit output, and ith input for the training
19
examplein question
One way to view this stochastic gradient descent is to consider a distinct error function for
each individual training example d as follows
Where, t and o are the target value and the unit output value for training example d.
d d
Stochastic gradient descent iterates over the training examples d in D, at each iteration
altering the weights according to the gradient with respect to
The sequence of these weight updates, when iterated over all training examples, provides a
reasonable approximation to descending the gradient with respect to our original error
function
By making the value of η sufficiently small, stochastic gradient descent can be made to
approximate true gradient descent arbitrarily closely
The key differences between standard gradient descent and stochastic gradient descent are
In standard gradient descent, the error is summed over all examples before updating
weights, whereas in stochastic gradient descent weights are updated upon examining each
training example.
Summing over multiple examples in standard gradient descent requires more computation
per weight update step. On the other hand, because it uses the true gradient, standard
gradient descent is often used with a larger step size per weight update than stochastic
gradient descent.
In cases where there are multiple local minima with respect to stochastic gradient descent
uur
can sometimes avoid falling into these local minima because it uses the various Ed (w) rather
uur
than E ( w) to guide its search.
20
4.6 Multilayer Networks (Multi-Layer Perceptron Network)
The main shortcoming of the Feed Forward networks was its inability to learn with
backpropagation (Unsupervised Learning). Multi-layer Perceptrons are the neural networks
PERCEPTRON
which incorporate multiple hidden layers and activation functions. The learning takes place
in a Supervised manner where the weights are updated by the means of Gradient Descent.
Multi-layer Perceptron is bi-directional, i.e., Forward propagation of the inputs, and the
backward propagation of the weight updates. The activation functions can be changes with
respect to the type of target. Softmax is usually used for multi-class classification, Sigmoid
for binary classification and so on. These are also called dense networks because all the
neurons in a layer are connected to all the neurons in the next layer as shown in figure
21
A Differentiable Threshold Unit (Sigmoid unit)
Sigmoid unit-a unit very much like a perceptron, but based on a smoothed,
differentiablethreshold function.
The sigmoid unit first computes a linear combination of its inputs, then applies a
threshold to the result and the threshold output is a continuous function of its input.
More precisely, the sigmoid unit computes its output O as
4. 6 .1 Backpropagation Algorithm
Backpropagation is an algorithm that back propagates the errors from output nodes to the
input nodes. Therefore, it is simply referred to as backward propagation of errors. It uses
in the vast applications of neural networks like Character recognition, Signature
verification, etc.
Backpropagation is a widely used algorithm for training feedforward neural networks. It
computes the gradient of the loss function with respect to the network weights and is very
22
PERCEPTRON
efficient, rather than naively directly computing the gradient with respect to each individual
weight. This efficiency makes it possible to use gradient methods to train multi-layer networks
and update weights to minimize loss; variants such as gradient descent or stochastic gradient
descent are often used.
The backpropagation algorithm works by computing the gradient of the loss function with
respect to each weight via the chain rule, computing the gradient layer by layer, and iterating
backward from the last layer to avoid redundant computation of intermediate terms in the chain
rule
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input vectors that the
network operates on. It Compares generated output to the desired output and generates an error
report if the result does not match the generated output vector. Then it adjusts the weights
according to the bug report to get your desired output.
Backpropagation Algorithm:
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the
error.
Step 6: Repeat the process until the desired output is achieved.
Advantages:
It is simple, fast, and easy to program.
Only numbers of the input are tuned, not any other parameter.
It is Flexible and efficient.
No need for users to learn any special functions.
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch
The BACKPROPAGATION Algorithm learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections. It employs gradient descent to attempt to
minimize the squared error between the network output values and the target values for these
outputs.
In BACKPROPAGATION algorithm, we consider networks with multiple output unitsrather than
single units as before, so we redefine E to sum the errors over all of the network output units.
where,
outputs - is the set of output units in the network
24
tkd and Okd - the target and output values associated with the kth output unit
d - training example
BACKPROPAGATION Algorithm:
BACKPROPAGATION (training_example, ƞ, nin, nout, nhidden )
r r
Each training example is a pair of the form x, t where
r
x is the vector of network input values,
Create a feed-forward network with ni inputs, nhidden hidden units, and nout outputunits.
Initialize all network weights to small random numbers
Until the termination condition is met, Do
r r
For each x, t in training examples, Do
r
1. Input the instance x , to the network and compute the output Ou of every unit u in
the network.
Propagate the errors backward through the network:
25
Adding Momentum
Because BACKPROPAGATION is such a widely used algorithm, many variations have
beendeveloped. The most common is to alter the weight-update rule the equation below
by making the weight update on the nth iteration depend partially on the update that
occurredduring the (n - 1)th iteration, as follows:
Deriving the stochastic gradient descent rule: Stochastic gradient descent involves iterating
through the training examples one at a time, for each training example d descending the
gradient of the error Ed with respect to this single example
For each training example d every weight wji is updated by adding to it Δ wji
PERCEPTRON
26
Here outputs is the set of output units in the network, tk is the target value of unit k for
trainingexample d, and ok is the output of unit k given training example d.
The derivation of the stochastic gradient descent rule is conceptually straightforward, but
requires keeping track of a number of subscripts and variables
Consider two cases: The case where unit j is an output unit for the network, and the case
wherej is an internal unit (hidden unit).
27
wji can influence the rest of the network only through net j , netj can influence the network only
through oj. Therefore, we can invoke the chain rule again to write
28
net j can influence the network outputs only through the units in Downstream(j). Therefore, we
can write
29
4.6.3 Hidden Layer Representations
BACKPROPAGATION can define new hidden layer features that are not explicit in the input
representation, but which capture properties of the input instances that are most relevant to learning
the target function. Consider example, the network shown in below Figure
PERCEPTRON
Consider training the network shown in Figure to learn the simple target function f (x)
30
= x, where x is a vector containing seven 0's and a single 1.
The network must learn to reproduce the eight inputs at the corresponding eight output units.
Although this is a simple function, the network in this case is constrained to use only three hidden
units. Therefore, the essential information from all eight input units must be captured by the three
learned hidden units.
When BACKPROPAGATION applied to this task, using each of the eight possible vectors as
training examples, it successfully learns the target function. By examining the hidden unit values
generated by the learned network for each of the eight possible input vectors, it is easy to see that
the learned encoding is similar to the familiar standard binary encoding of eight values using three
bits (e.g., 000,001,010,. . . , 111). The exact values of the hidden units for one typical run of shown
in Figure.
What is an appropriate condition for terminating the weight update loop? One choice is
to continue training until the error E on the training examples falls below some
predetermined threshold. To see the dangers of minimizing the error over the training
data, consider how the error Evaries with the number of weight iterations
PERCEPTRON
31
Consider first the top plot in this figure. The lower of the two lines shows the monotonically
decreasing error E over the training set, as the number of gradient descent iterations grows. The
upper line shows the error E measured over a different validation set of examples, distinct from
the training examples. This line measures the generalization accuracy of the network-the
accuracy with which it fits examples beyondthe training data.
The generalization accuracy measured over the validation examples first decreases, then
increases, even as the error over the training examples continues to decrease. How can this
occur? This occurs because the weights are being tuned to fit idiosyncrasies of the training
examples that are not representative of the general distribution of examples. The large number of
weight parameters in ANNs provides many degrees of freedom forfitting such idiosyncrasies
Why does overfitting tend to occur during later iterations, but not during earlier iterations?
By giving enough weight-tuning iterations, BACKPROPAGATION will often be able to create
overly complex decision surfaces that fit noise in the training data or unrepresentative
characteristics of the particular training sample.
Architecture of a traditional RNN Recurrent neural networks, also known as RNNs, are a
class of neural networks that allow previous outputs to be used as inputs while having hidden
states. They are typically as follows:
32
:
Prediction problems.
Language Modelling and Generating Text.
Machine Translation.
Speech Recognition.
Generating Image Descriptions.
Video Tagging.
Text Summarization.
Call Center Analysis
35
ℋ1
𝐰
𝐱 = 𝐱𝑃 + 𝑟
‖𝐰‖
𝐰
where‖𝐰‖is the Euclidean norm of 𝐰 and ‖𝐰‖ is a unit vector.
> 0 if 𝐱 𝜖 ℋ +
𝑇
𝑔 (𝐱) = 𝐰 𝐱 + 𝑤0 { = 0 if 𝐱 𝜖 ℋ
< 0 if 𝐱 𝜖 ℋ −
Perpendicular distance d from coordinate origin to ℋ = 𝑤0 ⁄‖𝐰‖
Geometry for 3-dimensions (n=3)
Hyperplane ℋ separates the feature space into two half space ℋ + and ℋ −
• For linearly separable, many hyperplanes exist to perform. Separation.
36
• SVM framework tells which hyperplane is best.
• Hyperplane with the largest margin which minimizes training error.
• Select the decision boundary that is far away from both the classes. Large margin
separation is expected to yield good generalization.
Two parallel hyperplanes ℋ1 and ℋ2 that pass through 𝐱 (𝑖) and 𝐱 (𝑘) respectively
37
Linearly separable training examples,
𝒟 = {(𝐱 (1) , 𝑦 (1) ), (𝐱 (2) , 𝑦 (2) ), … , (𝐱 (𝑁) , 𝑦 (𝑁) )}
4.8.3 Quadratic programming solution to finding maximum margin separators
Problem: Solve the following constrained minimization problem:
1
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑓(𝐰) = 𝐰 𝑇 𝐰
2
subject to 𝑦 (𝑖) (𝐰 𝑇 𝐱 (𝑖) + 𝑤0 ) ≥ 1; 𝑖 = 1, … , 𝑁
This is the formulation of hard-margin SVM.
Dual formulation of constrained optimization problem:
Lagrangian is constructed:
𝑁
1
𝐿(𝐰, 𝑤𝟎 , 𝝀) = 𝐰 𝑇 𝐰 − ∑ 𝜆𝑖 [𝑦 (𝑖) (𝐰 𝑇 𝐱 (𝑖) + 𝑤0 ) − 1]
2
𝑖=1
𝜕𝐿
= 0 ⇒ ∑𝑁
𝑖=1 𝜆𝑖 𝑦
(𝑖)
=0
𝜕𝑤0
ii. 𝑦 (𝑖) (𝐰 𝑇 𝐱 (𝑖) + 𝑤0 ) − 1 ≥ 0; 𝑖 = 1, … , 𝑁
iii. 𝜆𝑖 ≥ 0; 𝑖 = 1, … , 𝑁
iv. 𝜆𝑖 (𝑦 (𝑖) (𝐰 𝑇 𝐱 (𝑖) + 𝑤0 ) − 1) = 0; 𝑖 = 1, … , 𝑁
After solving the dual problem numerically, the resulting optimum 𝜆𝑖 values are used to compute
𝐰 and 𝑤0 using the KKT conditions.
w is computed using condition (i) of KKT conditions
𝑁
𝐰 = ∑ 𝜆𝑖 𝑦 (𝑖) 𝐱 (𝑖)
𝑖=1
• Very small percentage have 𝜆𝑖 > 0.
𝐰= ∑ 𝜆𝑖 𝑦 (𝑖) 𝐱 (𝑖)
𝑖 𝜖 𝑠𝑣𝑖𝑛𝑑𝑒𝑥
4.8.4 . Kernels for learning non-linear functions : For training examples which cannot be
linearly separated. In the feature space, they can be separated linearly with some transformations
Let us see some common kernels used with SVMs and their uses:
Gaussian kernel:It is a general-purpose kernel; used when there is no prior knowledge about the
data. Equation is:
39
Gaussian radial basis function (RBF):It is a general-purpose kernel; used when there is no prior
knowledge about the data. Equation is:
, for
Laplace RBF kernel: It is general-purpose kernel; used when there is no prior knowledge about
the data. Equation is:
40