DSML
DSML
DSML
The sample mean is usually the best, unbiased estimate of the population mean.
However, the mean is influenced by extreme values (outliers) and may not be the
best measure of center with strongly skewed data. The following equations
compute the population mean and sample mean.
Median
• The median of a variable is the middle value of the data set when the
data are sorted in order from least to greatest. It splits the data into
two equal halves with 50% of the data below the median and 50%
above the median.
#1 : 23, 27, 29, 31, 35, 39, 40, 42, 44, 47, 51
#2 : 23, 27, 29, 31, 35, 39, 40, 42, 44, 47
Mode
• The mode is the most frequently occurring value and is commonly used
with qualitative data as the values are categorical. Categorical data cannot
be added, subtracted, multiplied or divided, so the mean and median
cannot be computed.
• Understanding the relationship between the mean and median is
important. It gives us insight into the distribution of the variable. For
example, if the distribution is skewed right (positively skewed), the mean
will increase to account for the few larger observations that pull the
distribution to the right.
Measures of Dispersion
• Measures of center look at the average or middle values of a data set.
Measures of dispersion look at the spread or variation of the
data. Variation refers to the amount that the values vary among
themselves. Values in a data set that are relatively close to each other
have lower measures of variation. Values that are spread farther
apart have higher measures of variation.
Examine the two histograms below. Both groups have the same mean weight, but the values of Group A are more
spread out compared to the values in Group B. Both groups have an average weight of 267 lb. but the weights of
Group A are more variable.
Range
• The range of a variable is the largest value minus the smallest value. It
is the simplest measure and uses only these two values in a
quantitative data set.
• Find the range for the given data set: 12, 29, 32, 34, 38, 49, 57
Range = 57 – 12 = 45
Variance
The variance uses the difference between each value and its arithmetic
mean. The differences are squared to deal with positive and negative
differences. The sample variance (s2) is an unbiased estimator of the
population variance (σ2), with n-1 degrees of freedom.
• Compute the variance of the sample data: 3, 5, 7.
Standard Deviation
The standard deviation is the square root of the variance (both
population and sample). While the sample variance is the positive,
unbiased estimator for the population variance, the units for the
variance are squared. The standard deviation is a common method for
numerically describing the distribution of a variable.
We want to use this sample mean to estimate the true but unknown population mean.
• Sample 1—we compute sample mean x̄
• Sample 2—we compute sample mean x̄
• Sample 3—we compute sample mean x̄
The sample mean (x̄) is a random variable with its own probability distribution called
the sampling distribution of the sample mean. The distribution of the sample mean will
have a mean equal to µ and a standard deviation equal to
The standard error is the standard deviation of all possible sample means
Example #1 :
5 Students in a college were selected at random and their agents were found to
be 18, 21, 19, 20 and 26.
a) calculate the standard deviation of the ages in the sample
b) calculate the standard error
#2
In a certain university, the mean age office student is 20.5 with a standard
deviation of 0.8.
a) calculate the standard error of the mean if a sample of 25 students were
selected
b) what would be the standard error of the mean be if a simple sample of 100
students were selected
Coefficient of Variation
The goal of hypothesis testing is to see if there is enough evidence against the null hypothesis. In other words, to
see if there is enough evidence to reject the null hypothesis. If there is not enough evidence, then we fail to reject
the null hypothesis.
• Ex: A man, Mr. XyZ, goes to trial and is tried for the murder of his ex-
wife. He is either guilty or innocent. Set up the null and alternative
hypotheses for this example.
The hypotheses being tested are:
1.The man is guilty
2.The man is innocent
Since the calculated chi-squared value is smaller, there is no significant difference. As a prerequisite for this test,
please note that all expected frequencies must be greater than 5.
• A school principal would like to know which days of the week student
are most likely to be absent . The principal expects that student will
absent equally during the five-day school week the principal select a
random sample of 100 teacher asking them which day of the week
they had the highest number of student absence. The observed and
expected result are shown in the table below based on this result to
the days for the highest number of absences occur with equal
frequency? (use a 5% significance level)
Mon Tue Wed Thu Fri
Observed Absences 23 16 14 19 28
Expected Absences 20 20 20 20 20
• In an antimalarial campaign in India, quinine was administrated to
500 person out of a total population of 2000. The number of fever
cases is shown below :
• The number of successes (tails) in an experiment of 100 trials of tossing a coin. Here the
sample space is {0, 1, 2, …100}
• The number of successes (four) in an experiment of 100 trials of rolling a dice. Here the
sample space is {0, 1, 2, …100}
Binomial Distribution
Binomial distribution is a discrete probability distribution that represents
the probabilities of binomial random variables. The binomial distribution
is a probability distribution associated with a binomial experiment in
which the binomial random variable specifies the number of successes
or failures that occurred within that sample space.
Ex: Let’s calculate the probability of getting exactly six heads when a
coin is tossed ten times.
P(x=6) = 10C6 * 0.56 * 0.54 = ?
Mean and Variance of Binomial Distribution: The mean and variance of
the binomial distribution are:
• Mean = np Properties of a binomial distribution are:
• Variance = npq where, 1.There are only two possible outcomes:
• p is the probability of success True or False, Yes or No.
2.There are N number of independent
• q is the probability of failure (1-p)
trials.
• n is the number of trials. 3.The probability of success and failure
varies in each trial.
4.Only the number of successes are taken
into account out of N independent trials.
Examples:
#1: 80% of people who purchase pet insurance are women. If 9
pet insurance owners are randomly selected, find the probability
that exactly 6 are women.
Example 1: In a cafe, the customer arrives at a mean rate of 2 per min. Find the probability of arrival of 5 customers
in 1 minute using the Poisson distribution formula.
Given: λ = 2, and x = 5.
Example 2: Find the mass probability of function at x = 6, if the value of the mean is 3.4.
In this case, the task T is to flag spam for new emails, the experience E
is the training data, and the performance measure P needs to be
defined;
Examples of Applications:
• Whether or not they can learn incrementally on the fly (online versus batch
learning)
• Whether they work by simply comparing new data points to known data
points, or instead by detecting patterns in the training data and building a
predictive model, much like scientists do (instance-based versus model-
based learning)
• These criteria are not exclusive; you can combine them in any way you like
Supervised learning
• In supervised learning, the training set you feed to the algorithm
includes the desired solutions, called labels
Another typical task is to predict a target numeric value, such as the price of
a car, given a set of features (mileage, age, brand, etc.). This sort of task is
called regression.
Unsupervised learning
• In unsupervised learning, as you might guess, the training data is
unlabeled. The system tries to learn without a teacher.
• For example, say you have a lot of data about your blog’s visitors. You may want
to run a clustering algorithm to try to detect groups of similar visitors.
• Another important unsupervised task is anomaly detection—for
example, detecting unusual credit card transactions to prevent fraud,
catching manufacturing defects, or automatically removing outliers
from a dataset before feeding it to another learning algorithm. The
system is shown mostly normal instances during training, so it learns
to recognize them; then, when it sees a new instance, it can tell
whether it looks like a normal one or whether it is likely an anomaly
Semi-supervised learning
• Since labeling data is usually time-consuming and costly, you will
often have plenty of unlabeled instances, and few labeled instances.
Some algorithms can deal with data that’s partially labeled. This is
called semi supervised learning
Some photo-hosting services, such as
Google Photos, are good examples of
this
Self-supervised learning
• Another approach to machine learning involves actually
generating a fully labeled dataset from a fully unlabeled one.
Again, once the whole dataset is labeled, any supervised
learning algorithm can be used. This approach is called self-
supervised learning.
For example, if you have a large dataset of
unlabeled images, you can randomly mask a
small part of each image and then train a model
to recover the original image. During training,
the masked images are used as the inputs to the
model, and the original images are used as the
labels
Reinforcement learning
• The learning system, called
an agent in this context, can
observe the environment,
select and perform actions,
and get rewards in return (or
penalties in the form of
negative rewards. It must
then learn by itself what is
the best strategy, called a
policy, to get the most
reward over time. A policy
defines what action the
agent should choose when it
is in a given situation.
Batch Versus Online Learning
• Batch learning - In batch learning, the system is incapable of learning
incrementally: it must be trained using all the available data. This will
generally take a lot of time and computing resources, so it is typically
done offline. First the system is trained, and then it is launched into
production and runs without learning anymore; it just applies what it
has learned. This is called offline learning.
• Even a model trained to classify pictures of cats and dogs may need to be
retrained regularly, not because cats and dogs will mutate overnight, but because
cameras keep changing, along with image formats, sharpness, brightness, and
size ratios. Moreover, people may love different breeds next year, or they may
decide to dress their pets with tiny hats—who knows?
Online learning
• In online learning, you train the
system incrementally by feeding it
data instances sequentially, either
individually or in small groups
called minibatches. Each learning
step is fast and cheap, so the
system can learn about new data
on the fly
Instance-Based Versus Model-Based Learning
• Instance-based learning: the system learns the examples by heart,
then generalizes to new cases by using a similarity measure to
compare them to the learned examples (or a subset of them)
A (very basic) similarity
measure between two emails
could be to count the number
of words they have in
common. The system would
flag an email as spam if it has
many words in common with a
known spam email
• Another way to generalize from a set of examples is to build a model
of these examples and then use that model to make predictions. This
is called model-based learning.
# Suppose you want to know if money makes people happy,
so you download the Better Life Index data from the OECD’s
website and World Bank stats about gross domestic product
(GDP) per capita. Then you join the tables and sort by GDP
per capita
life_satisfaction = θ0 + θ1 × GDP_per_capita
This model has two model parameters, θ and θ .
Before you can use your model, you need to define the
parameter values θ0 and θ1.
This model is just a linear function of the input feature GDP_per_capita. θ0 and θ1
are the model’s parameters.
• Variance This part is due to the model’s excessive sensitivity to small variations in
the training data.
• A model with many degrees of freedom (such as a high-degree polynomial model) is
likely to have high variance and thus overfit the training data.
• Irreducible error This part is due to the noisiness of the data itself. The only way
to reduce this part of the error is to clean up the data (e.g., fix the data sources,
such as broken sensors, or detect and remove outliers).
Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s
complexity increases its bias and reduces its variance
Regularized Linear Models
• A simple way to regularize a polynomial model is to reduce the number of
polynomial degrees. For a linear model, regularization is typically achieved by
constraining the weights of the model. We will now look at ridge regression, lasso
regression, and elastic net regression.
• Ridge Regression: This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible. Note that the regularization term should only be added to the cost
function during training.
• The hyperparameter α controls how much you want to regularize the model. If α = 0, then ridge
regression is just linear regression. If α is very large, then all weights end up very close to zero and
the result is a flat line going through the data’s mean.
If we define w as the vector of feature weights (θ1 to θn), then the regularization term is equal to α(∥ w ∥ 2 )2 / m,
where ∥ w ∥2 represents the ℓ2 norm of the weight vector.
• Lasso Regression: Least absolute shrinkage and selection operator
regression (usually simply called lasso regression) is another
regularized version of linear regression: just like ridge regression, it
adds a regularization term to the cost function, but it uses the ℓ1
norm of the weight vector.
Ridge is a good default, but if you suspect that only a few features are useful, you should prefer lasso or elastic
net because they tend to reduce the useless features’ weights down to zero, as discussed earlier . In general,
elastic net is preferred over lasso because lasso may behave erratically when the number of features is greater
than the number of training instances or when several features are strongly correlated.
Early Stopping
• A very different way to regularize
iterative learning algorithms such as
gradient descent is to stop training
as soon as the validation error
reaches a minimum. This is called
early stopping.
• With early stopping you just stop
training as soon as the validation
error reaches the minimum. It is
such a simple and efficient
regularization technique that
Geoffrey Hinton called it a “beautiful
free lunch”
With stochastic and mini-batch gradient descent, the curves are not so smooth, and it may be hard to know whether you
have reached the minimum or not. One solution is to stop only after the validation error has been above the minimum for
some time then roll back the model parameters to the point where the validation error was at a minimum.
Logistic regression is one of the most popular machine learning algorithms for binary
classification. This is because it is a simple algorithm that performs very well on a wide range
of problems.
Tutorial Dataset
In this tutorial we will use a contrived dataset.
This dataset has two input variables (X1 and X2) and one output variable (Y). In input
variables are real-valued random numbers drawn from a Gaussian distribution. The output
variable has two values, making the problem a binary classification problem.
X1 X2
2.7810836 2.55053700
1.465489372 2.36212507
3.396561688 4.40029352
1 X1 X2 Y
2 2.7810836 2.550537003 0
3 1.465489372 2.362125076 0
4 3.396561688 4.400293529 0
5 1.38807019 1.850220317 0
6 3.06407232 3.005305973 0
7 7.627531214 2.759262235 1
8 5.332441248 2.088626775 1
9 6.922596716 1.77106367 1
10 8.675418651 -0.2420686549 1
11 7.673756466 3.508563011 1
Below is a plot of the dataset. You can see that it is completely contrived and that we can
easily draw a line to separate the classes.
This is exactly what we are going to do with the logistic regression model.
Logistic Regression Tutorial Dataset
Logistic Function
Before we dive into logistic regression, let’s take a look at the logistic function, the heart of
the logistic regression technique.
transformed = 1 / (1 + e^-x)
Where e is the numerical constant Euler’s number and x is a input we plug into the function.
Let’s plug in a series of numbers from -5 to +5 and see how the logistic function transforms
them:
X Transformed
-5 0.006692850924
-4 0.01798620996
-3 0.04742587318
1 X Transformed
2 -5 0.006692850924
3 -4 0.01798620996
4 -3 0.04742587318
5 -2 0.119202922
6 -1 0.2689414214
7 0 0.5
8 1 0.7310585786
9 2 0.880797078
10 3 0.9525741268
11 4 0.98201379
12 5 0.9933071491
You can see that all of the inputs have been transformed into the range [0, 1] and that the
smallest negative numbers resulted in values close to zero and the larger positive numbers
resulted in values close to one. You can also see that 0 transformed to 0.5 or the midpoint of
the new range.
From this we can see that as long as our mean value is zero, we can plug in positive and
negative values into the function and always get out a consistent transform into the new
range.
Logistic Function
If the probability is > 0.5 we can take the output as a prediction for the default class (class 0),
otherwise the prediction is for the other class (class 1).
For this dataset, the logistic regression has three coefficients just like linear regression, for
example:
The job of the learning algorithm will be to discover the best values for the coefficients (b0,
b1 and b2) based on the training data.
Unlike linear regression, the output is transformed into a probability using the logistic
function:
p(class=0) = 1 / (1 + e^(-output))
p(class=0) = 1 / (1 + EXP(-output))
This is a simple procedure that can be used by many algorithms in machine learning. It works
by using the model to calculate a prediction for each instance in the training set and
calculating the error for each prediction.
We can apply stochastic gradient descent to the problem of finding the coefficients for the
logistic regression model as follows:
The process is repeated until the model is accurate enough (e.g. error drops to some desirable
level) or for a fixed number iterations. You continue to update the model for training
instances and correcting errors until the model is accurate enough orc cannot be made any
more accurate. It is often a good idea to randomize the order of the training instances shown
to the model to mix up the corrections made.
By updating the model for each training pattern we call this online learning. It is also possible
to collect up all of the changes to the model over all training instances and make one large
update. This variation is called batch learning and might make a nice extension to this tutorial
if you’re feeling adventurous.
Calculate Prediction
Let’s start off by assigning 0.0 to each coefficient and calculating the probability of the first
training instance that belongs to class 0.
B0 = 0.0
B1 = 0.0
B2 = 0.0
Using the above equation we can plug in all of these numbers and calculate a prediction:
prediction = 0.5
We can calculate the new coefficient values using a simple update equation.
Where b is the coefficient we are updating and prediction is the output of making a prediction
using the model.
Alpha is parameter that you must specify at the beginning of the training run. This is the
learning rate and controls how much the coefficients (and therefore the model) changes or
learns each time it is updated. Larger learning rates are used in online learning (when we
update the model for each training instance). Good values might be in the range 0.1 to 0.3.
Let’s use a value of 0.3.
You will notice that the last term in the equation is x, this is the input value for the
coefficient. You will notice that the B0 does not have an input. This coefficient is often called
the bias or the intercept and we can assume it always has an input value of 1.0. This
assumption can help when implementing the algorithm using vectors or arrays.
Let’s update the coefficients using the prediction (0.5) and coefficient values (0.0) from the
previous section.
or
b0 = -0.0375
b1 = -0.104290635
b2 = -0.09564513761
We can repeat this process and update the model for each training instance in the dataset.
A single iteration through the training dataset is called an epoch. It is common to repeat the
stochastic gradient descent procedure for a fixed number of epochs.
At the end of epoch you can calculate error values for the model. Because this is a
classification problem, it would be nice to get an idea of how accurate the model is at each
iteration.
The graph below show a plot of accuracy of the model over 10 epochs.
Logistic Regression with Gradient Descent Accuracy versus Iteration
You can see that the model very quickly achieves 100% accuracy on the training dataset.
b0 = -0.4066054641
b1 = 0.8525733164
b2 = -1.104746259
Make Predictions
Now that we have trained the model, we can use it to make predictions.
We can make predictions on the training dataset, but this could just as easily be new data.
Using the coefficients above learned after 10 epochs, we can calculate output values for each
training instance:
0.2987569857
0.145951056
0.08533326531
0.2197373144
1 0.2987569857
2 0.145951056
3 0.08533326531
4 0.2197373144
5 0.2470590002
6 0.9547021348
7 0.8620341908
8 0.9717729051
9 0.9992954521
10 0.905489323
These are the probabilities of each instance belonging to class=0. We can convert these into
crisp class values using:
With this simple procedure we can convert all of the outputs to class values:
0
0
0
0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Finally, we can calculate the accuracy for the model on the training dataset:
accuracy = 100%
Summary
In this post you discovered how you can implement logistic regression from scratch, step-by-
step. You learned:
1
Features: The attributes used to make the digit decision
Pixels: (6,8)=ON
Shape Patterns: NumComponents, AspectRatio, NumLoops
??
…
Other Classification Tasks
Classification: given inputs x, predict labels (classes) y
Examples:
Spam detection (input: document,
classes: spam / ham)
OCR (input: images, classes: characters)
Medical diagnosis (input: symptoms,
classes: diseases)
Automatic essay grading (input: document,
classes: grades)
Fraud detection (input: account activity,
classes: fraud / no fraud)
Customer service email routing
… many more
Challenges
What structure should the BN have?
How should we learn its parameters?
Naïve Bayes for Digits
Naïve Bayes: Assume all features are independent effects of the label
|Y| parameters
F1 F2 Fn
+
Step 2: sum to get probability of evidence
P(spam | w) = 98.9
Training and Testing
Important Concepts
Data: labeled instances, e.g. emails marked spam/ham
Training set
Held out set
Test set
Features: attribute‐value pairs which characterize each x Training
Data
Experimentation cycle
Learn parameters (e.g. model probabilities) on training set
(Tune hyperparameters on held‐out set)
Compute accuracy of test set
Very important: never “peek” at the test set!
Evaluation Held-Out
Accuracy: fraction of instances predicted correctly
Data
Overfitting and generalization
Want a classifier which does well on test data
Overfitting: fitting the training data very closely, but not Test
generalizing well Data
We’ll investigate overfitting and generalization formally in a few
lectures
Generalization and Overfitting
Overfitting
30
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Example: Overfitting
2 wins!!
Example: Overfitting
Posteriors determined by relative probabilities (odds ratios):
As an extreme case, imagine using the entire email as the only feature
Would get the training data perfect (if deterministic labeling)
Wouldn’t generalize at all
Just making the bag‐of‐words assumption gives us some generalization, but isn’t enough
Another option is to consider the most likely parameter value given the data
????
Unseen Events
Laplace Smoothing
Laplace’s estimate:
Pretend you saw every outcome r r b
once more than you actually did
+%($#&"#$%/#"6<=/0$#,5#$%&"#(.803/*#
B/C$#@3(""&D0(8,:#
• I&&.R2.2R$&5Pe#0+$0%+#R(6.#&E$+(J.0&E$(6$R#26#&$
• DJ%=$G#+#01(2$
• I5+"(6&".J$.G#21/0%1(2$
• IR#fR#2G#6$.G#21/0%1(2$
• g%2R5%R#$bG#21/0%1(2$
• D#21=#2+$%2%-7&.&$
• X$
3%2$456%)&'7$
B/C$#@3(""&D0(8,:E#2/D:&8,:#
• !"#$%<$
• $%$G(05=#2+$&'
• '%$/*#G$&#+$()$0-%&&#&$$('h'i)8E$)KEXE$)*j$
• +$%#$%<$%$J6#G.0+#G$0-%&&$)$∈$('
3%2$456%)&'7$
@3(""&D0(8,:#4/$%,2"E##
?(:2F0,2/2#.63/"#
• Z5-#&$P%&#G$(2$0(=P.2%1(2&$()$@(6G&$(6$(+"#6$)#%+56#&$
• $&J%=<$P-%0';-.&+;%GG6#&&$kZ$l\G(--%6&^$I?3\"%>#$P##2$&#-#0+#G^m$
• I0056%07$0%2$P#$".R"$
• b)$65-#&$0%6#)5--7$6#/2#G$P7$#*J#6+$
• S5+$P5.-G.2R$%2G$=%.2+%.2.2R$+"#&#$65-#&$.&$#*J#2&.>#$
3%2$456%)&'7$
@3(""&D0(8,:#4/$%,2"E#
>6'/.9&"/2#4(0%&:/#G/(.:&:;#
• !"#$%,''
• %$G(05=#2+$&'
• '%$/*#G$&#+$()$0-%&&#&$$('h'i)8E$)KEXE$)*j'
• I$+6%.2.2R$&#+$()$-'"%2G;-%P#-#G$G(05=#2+&$.&/0)/1022220.&-0)-1'
• +$%#$%,''
• %$-#%62#G$0-%&&./#6$3,&'!')'
8[$
3%2$456%)&'7$
@3(""&D0(8,:#4/$%,2"E#
>6'/.9&"/2#4(0%&:/#G/(.:&:;#
• I27$'.2G$()$0-%&&./#6$
• ?%n>#$S%7#&$
• g(R.&10$6#R6#&&.(2$
• D5JJ(6+;>#0+(6$=%0".2#&$
• ';?#%6#&+$?#.R"P(6&$
• X$
3%2$456%)&'7$
H(I9/#J(A/"#!:$6&8,:#
• D.=J-#$l\2%n>#^m$0-%&&./0%1(2$=#+"(G$P%&#G$(2$
S%7#&$65-#$
• Z#-.#&$(2$>#67$&.=J-#$6#J6#+%1(2$()$G(05=#2+$
• S%R$()$@(6G&$
3%2$456%)&'7$
B%/#<(;#,5#-,.2"#./'./"/:$(8,:#
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
B%/#<(;#,5#-,.2"#./'./"/:$(8,:#
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
B%/#<(;#,5#-,.2"#./'./"/:$(8,:#
great 2
love 2
γ( recommend
laugh
1
1
)=c
happy 1
... ...
3%2$456%)&'7$
J(;#,5#-,.2"#5,.#2,06)/:$#03(""&D0(8,:#
Test d$
document
Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
3%2$456%)&'7$
J(A/"K#L63/#M''3&/2#$,#N,06)/:$"#(:2#
@3(""/"#
o V(6$%$G(05=#2+$&$%2G$%$0-%&&$)'
P(d | c)P(c)
P(c | d) =
P(d)
3%2$456%)&'7$
H(I9/#J(A/"#@3(""&D/.#O!P#
MAP is “maximum a
cMAP = argmax P(c | d) posteriori” = most
c∈C likely class
P(d | c)P(c)
= argmax Bayes Rule
c∈C P(d)
= argmax P(d | c)P(c) Dropping the
denominator
c∈C
3%2$456%)&'7$
H(I9/#J(A/"#@3(""&D/.#O!!P#
H(I9/#J(A/"#@3(""&D/.#O!QP#
,(5-G$(2-7$P#$#&1=%+#G$.)$%$
We can just count the
>#67E$>#67$-%6R#$25=P#6$()$ relative frequencies in
+6%.2.2R$#*%=J-#&$@%&$ a corpus
%>%.-%P-#C$
3%2$456%)&'7$
4638:,)&(3#H(I9/#J(A/"#!:2/'/:2/:0/#
M""6)'8,:"#
P(x1, x2 ,…, xn | c)
• J(;#,5#+,.2"#(""6)'8,:<$I&&5=#$J(&.1(2$G(#&2q+$
=%L#6$
• @,:2&8,:(3#!:2/'/:2/:0/<$I&&5=#$+"#$)#%+56#$
J6(P%P.-.1#&$5l67p)8m$%6#$.2G#J#2G#2+$R.>#2$+"#$0-%&&$)2
P(x1,…, xn | c) = P(x1 | c)• P(x2 | c)• P(x3 | c)•...• P(xn | c)
3%2$456%)&'7$
4638:,)&(3#H(I9/#J(A/"#@3(""&D/.#
M''3A&:;#4638:,)&(3#H(&9/#J(A/"#
@3(""&D/."#$,#B/C$#@3(""&D0(8,:#
positions ←$%--$@(6G$J(&.1(2&$.2$+#&+$G(05=#2+$$$$$$
$ $ $
G/(.:&:;#$%/#4638:,)&(3#H(I9/#J(A/"#4,2/3#
• V.6&+$%L#=J+<$=%*.=5=$-.'#-."((G$#&1=%+#&$
• &.=J-7$5&#$+"#$)6#r5#20.#&$.2$+"#$G%+%$
doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
∑ count(w, c j )
w∈V
3%2$456%)&'7$
7(.()/$/.#/"8)(8,:#
count(wi , c j ) )6%01(2$()$1=#&$@(6G$97$%JJ#%6&$$
P̂(wi | c j ) =
∑ count(w, c j ) %=(2R$%--$@(6G&$.2$G(05=#2+&$()$+(J.0$)8'
w∈V
• ,6#%+#$=#R%;G(05=#2+$)(6$+(J.0$8$P7$0(20%+#2%12R$%--$G(0&$.2$
+".&$+(J.0$
• B&#$)6#r5#207$()$9$.2$=#R%;G(05=#2+$
3%2$456%)&'7$ Sec.13.3
7.,<3/)#-&$%#4(C&)6)#G&R/3&%,,2#
• Q"%+$.)$@#$"%>#$&##2$2($+6%.2.2R$G(05=#2+&$@.+"$+"#$@(6G$
!"#$"%&'#$%2G$0-%&&./#G$.2$+"#$+(J.0$',"&89/$l$()*+%,)-.d$
ˆ count("fantastic", positive)
P("fantastic" positive) = = 0
∑ count(w, positive)
$
$
w∈V
• s#6($J6(P%P.-.1#&$0%22(+$P#$0(2G.1(2#G$%@%7E$2($=%L#6$
+"#$(+"#6$#>.G#20#t$
G('3(0/#O(22FSP#"),,$%&:;#5,.#H(I9/#J(A/"#
ˆ count(wi , c) +1
P(wi | c) =
∑ (count(w, c))+1)
w∈V
count(wi , c) +1
=
∑ count(w, c) + V
w∈V
3%2$456%)&'7$
4638:,)&(3#H(I9/#J(A/"E#G/(.:&:;#
• V6(=$+6%.2.2R$0(6J5&E$#*+6%0+$Vocabulary$
• ,%-05-%+#$5l)8m'+#6=&$ • ,%-05-%+#$5l9<'p')8m'+#6=&$
• V(6$#%0"$)8'.2$($G($ • =>6%8'←$&.2R-#$G(0$0(2+%.2.2R$%--$&:);8'
'&:);8'←'%--$G(0&$@.+"$$0-%&&$h)8' • V(6'#%0"$@(6G$9<'.2$?:)@A$B@CD'
''''"<'←$u$()$(00566#20#&$()$9<'.2$=>6%8'
| docs j |
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
3%2$456%)&'7$
H(I9/#J(A/"#(:2#G(:;6(;/#4,2/3&:;#
• ?%n>#$P%7#&$0-%&&./#6&$0%2$5&#$%27$&(6+$()$)#%+56#$
• BZgE$#=%.-$%GG6#&&E$G.01(2%6.#&E$2#+@(6'$)#%+56#&$
• S5+$.)E$%&$.2$+"#$J6#>.(5&$&-.G#&$
• Q#$5&#$,:3A$@(6G$)#%+56#&$$
• @#$5&#$(33$()$+"#$@(6G&$.2$+"#$+#*+$l2(+$%$&5P&#+m$
• !"#2$$
• ?%n>#$P%7#&$"%&$%2$.=J(6+%2+$&.=.-%6.+7$+($-%2R5%R#$
OM$ =(G#-.2RC$
3%2$456%)&'7$ D#0C8OCKC8$
U(0%#03(""#V#(#6:&;.()#3(:;6(;/#),2/3#
• I&&.R2.2R$#%0"$@(6G<$vl@(6G$p$0m$
• I&&.R2.2R$#%0"$+#20#<$vl&p0mhΠ$vl@(6Gp0m$
,-%&&$#:;'
[C8 $b$
b$ -(>#$ +".&$ )52$ /-=$
[C8 $-(>#$
[C8$ [C8$ C[T$ [C[8$ [C8$
[C[8 $+".&$
[C[T $)52$
[C8 $/-=$ vl&$p$J(&m$h$[C[[[[[[T$$
3%2$456%)&'7$ Sec.13.2.1
H(I9/#J(A/"#("#(#G(:;6(;/#4,2/3#
• Q".0"$0-%&&$%&&.R2&$+"#$".R"#6$J6(P%P.-.+7$+($&d$
F(G#-$J(&$ F(G#-$2#R$
[C8 $b$ [CK $b$ b$ -(>#$ +".&$ )52$ /-=$
[C8 $-(>#$ [C[[8 $-(>#$
[C8$ [C8$ [C[8$ [C[T$ [C8$
[C[8 $+".&$ [C[8 $+".&$ [CK$ [C[[8$ [C[8$ [C[[T$ [C8$
vl0pGTm$$ ∝ $Of`$x$lOf9mO$x$8f8`$x$8f8`$$
$ $y$[C[[[O$
@,:2&8,:(3#7.,<(<&3&8/"E# $
vl,".2#&#p)m$h$ lTz8m$f$l:zNm$h$Nf8`$h$Of9$ $
vl!('7(p)m$$$$h$ l[z8m$f$l:zNm$h$8f8`$ vlepGTm$$ ∝ $8f`$x$lKfMmO$x$KfM$x$KfM$$$
vl4%J%2p)m$$$$$h$ l[z8m$f$l:zNm$h$8f8`$ $ $y$[C[[[8$
vl,".2#&#p8m$h$ l8z8m$f$lOzNm$h$KfM$$ $
vl!('7(p8m$$$$$h$ l8z8m$f$lOzNm$h$KfM$$ $
``$ vl4%J%2p8m$$$$$$h$$ l8z8m$f$lOzNm$h$KfM$$
Machine Learning
Classification Methods
Bayesian Classification, Nearest
Neighbor, Ensemble Methods
Bayesian Classification: Why?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
Choosing Hypotheses
⚫ Maximum Likelihood
hypothesis:
hML = arg max P(d | h)
hH
P ( A A A | C ) P (C )
P (C | A A A ) = 1 2 n
P( A A A )
1 2 n
1 2 n
cNaive Bayes = arg max P(c) P(x | c) = arg max P(c) P(ai | c)
c c i
How to Estimatel l
Probabilities
a a us
from Data?
e gor
ic
e gor
ic
tinuo
ss
ca
t
ca
t
co
n
cla ⚫ Class: P(C) = Nc/N
Tid Refund Marital Taxable ⚫ e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No ⚫ For discrete attributes:
4 Yes Married 120K No
5 No Divorced 95K Yes
P(Ai | Ck) = |Aik|/ Nc
k
6 No Married 60K No ⚫ where |Aik| is number of
7 Yes Divorced 220K No instances having attribute Ai and
8 No Single 85K Yes belongs to class Ck
9 No Married 75K No
⚫ Examples:
10 No Single 90K Yes
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities
from Data?
⚫ For continuous attributes:
⚫ Discretize the range into bins
⚫ one ordinal attribute per bin
⚫ violates independence assumption
⚫ Two-way split: (A < v) or (A > v)
⚫ choose only one of the two splits as new attribute
⚫ Probability density estimation:
⚫ Assume attribute follows a normal distribution
⚫ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
⚫ Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
How toricEstimate
al ic a l
ous Probabilities from
o or nu
Data? te g
te g
nti
la ss
ca ca co c
Tid Refund Marital
Status
Taxable
Income Evade
⚫ Normal distribution:
( Ai − ij ) 2
1 −
P( A | c ) =
2 ij2
1 Yes Single 125K No
e
2
i j 2
2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No
⚫ One for each (Ai,ci) pair
5 No Divorced 95K Yes
⚫ For (Income, Class=No):
6 No Married 60K No
7 Yes Divorced 220K No
⚫ If Class=No
8 No Single 85K Yes ⚫ sample mean = 110
1 −
( 120−110 ) 2
First step: Compute P(C) The prior probability of each class can be
computed based on the training tuples:
P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
Naïve Bayesian Classifier:
An Example
N ic
Original : P( Ai | C ) = c: number of classes
Nc
p: prior probability
N ic + 1
Laplace : P( Ai | C ) =
Nc + c m: parameter
N ic + mp
m - estimate : P( Ai | C ) =
Nc + m
Naïve Bayes (Summary)
⚫ Advantage
⚫ Robust to isolated noise points
⚫ Handle missing values by ignoring the instance during probability
estimate calculations
⚫ Robust to irrelevant attributes
⚫ Disadvantage
⚫ Assumption: class conditional independence, which may cause loss
of accuracy
⚫ Independence assumption may not hold for some attribute.
Practically, dependencies exist among variables
⚫ Use other techniques such as Bayesian Belief Networks (BBN)
Remember
•The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
•The conditional distributions for each node are given as conditional probabilities
table or CPT.
•Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
DECISION TREE
An instance is classified by starting at the root node of the tree, testing the attribute specified
by this node, then moving down the tree branch corresponding to the value of the attribute in
the given example. This process is then repeated for the subtree rooted at the new node.
Sandeep Chaurasia, SPSU
APPROPRIATE PROBLEMS FOR
DECISION TREE LEARNING
• Instances are represented by attribute-value
pairs.
• The target function has discrete output values.
• Disjunctive descriptions may be required.
• The training data may contain errors.
• The training data may contain missing
attribute values.
Values(A) is the set of all possible values for attribute A, and S, is the
subset of S for which attribute A has value
An algorithm can be transparent only if its decisions can be read and understood by people
clearly. Even though deep learning is superstar of machine learning nowadays, it is an opaque
algorithm and we do not know the reason of decision. Herein, Decision tree algorithms still
keep their popularity because they can produce transparent decisions. ID3 uses information
gain whereas C4.5 uses gain ratio for splitting. Here, CART is an alternative decision tree
building algorithm. It can handle both classification and regression tasks. This algorithm uses
a new metric named gini index to create decision points for classification tasks. We will
mention a step by step CART decision tree example by hand from scratch.
We will work on same dataset in ID3. There are 14 instances of golf playing decisions based
on outlook, temperature, humidity and wind factors.
Gini index
Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities
of each class. We can formulate it as illustrated below.
Outlook
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final
decisions for outlook feature.
Number of
Outlook Yes No
instances
Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Then, we will calculate weighted sum of gini indexes for outlook feature.
Temperature
Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot
and Mild. Let’s summarize decisions for temperature feature.
Number of
Temperature Yes No
instances
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
Humidity
Number of
Humidity Yes No
instances
High 3 4 7
Normal 6 1 7
Wind
Number of
Wind Yes No
instances
Weak 6 2 8
Strong 3 3 6
Time to decide
We’ve calculated gini index values for each feature. The winner will be outlook feature
because its cost is the lowest.
Gini
Feature
index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that
overcast leaf is over.
We will apply same principles to those sub datasets in the following steps.
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for
temperature, humidity and wind features respectively.
Number of
Temperature Yes No
instances
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
Number of
Humidity Yes No
instances
High 0 3 3
Normal 2 0 2
Number of
Wind Yes No
instances
Weak 1 2 3
Strong 1 1 2
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity
because it has the lowest value.
Gini
Feature
index
Temperature 0.2
Humidity 0
Wind 0.466
As seen, decision is always no for high humidity and sunny outlook. On the other hand,
decision will always be yes for normal humidity and sunny outlook. This branch is over.
Decisions for high and normal humidity
Rain outlook
We’ll calculate gini index scores for temperature, humidity and wind features when outlook
is rain.
Number of
Temperature Yes No
instances
Cool 1 1 2
Mild 2 1 3
Number of
Humidity Yes No
instances
High 1 1 2
Normal 2 1 3
Number of
Wind Yes No
instances
Weak 3 0 3
Strong 0 2 2
The winner is wind feature for rain outlook because it has the minimum gini index score in
features.
Gini
Feature
index
Temperature 0.466
Humidity 0.466
Wind 0
Put the wind feature for rain outlook branch and monitor the new sub data sets.
Sub data sets for weak and strong wind and rain outlook
As seen, decision is always yes when wind is weak. On the other hand, decision is always no
if wind is strong. This means that this branch is over.
Spring 2023
denotes +1
x f yest
denotes -1
Estimation:
f(w,b) = sign(w. x + b)
w: weight vector
Plane x: data vector
Separating different
classes How would you
classify this data?
3
a
LINEAR CLASSIFIERS
x f yest
f(w,b) = sign(w. x + b)
denotes +1
denotes -1
4
a
LINEAR CLASSIFIERS
x f yest
f(w,b) = sign(w. x + b)
denotes +1
denotes -1
5
a
LINEAR CLASSIFIERS
x f yest
f(w,b) = sign(w. x +b)
denotes +1
denotes -1
6
a
LINEAR CLASSIFIERS
x f yest
f(w,b) = sign(w. x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
7
a
CLASSIFIER MARGIN
x f yest
f(w,b) = sign(w. x +b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
8
a
MAXIMUM MARGIN
x f yest
f(w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM) 9
Linear SVM
a
MAXIMUM MARGIN
x f yest
f(w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those data
points that the maximum margin.
margin pushes This is the
up against
simplest kind of
SVM (Called an
LSVM) 10
Linear SVM
HYPERPLANE : NUMERICAL
11
HYPERPLANE : NUMERICAL-1
12
HYPERPLANE : NUMERICAL
We would like to discover a simple SVM that accurately
discriminates the two classes. Since the data is linearly separable,
we can use a linear SVM (that is, one whose mapping function Φ()
is the identity function). By inspection, it should be obvious that
there are three support vectors (see Figure 2):
13
HYPERPLANE : NUMERICAL-1
14
HYPERPLANE : NUMERICAL-1
15
HYPERPLANE : NUMERICAL-1
16
HYPERPLANE : NUMERICAL-1
17
HYPERPLANE : NUMERICAL-2
18
HYPERPLANE : NUMERICAL-2
Our goal, again, is to discover a separating hyperplane that
accurately discriminates the two classes. Of course, it is
obvious that no such hyperplane exists in the input space (that
is, in the space in which the original input data live). Therefore,
we must use a nonlinear SVM (that is, one whose mapping
function is a nonlinear mapping from input space into some
feature space).
Define
19
HYPERPLANE : NUMERICAL-2
20
HYPERPLANE : NUMERICAL-2
We again use vectors augmented with a 1 as a bias input and will
differentiate them as before. Now given the [augmented] support
vectors, we must again and values for the αi
21
HYPERPLANE : NUMERICAL-2
22
HYPERPLANE : NUMERICAL-2
23
WHY MAXIMUM MARGIN?
f(w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM) 24
How to calculate the distance from a point to a line?
denotes +1
denotes -1
x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value (bias)
25
ESTIMATE THE MARGIN
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
Class 2
Class 1
m
27
FINDING THE DECISION BOUNDARY
Let {x1, ..., xn} be our data set and let yi {1,-1} be the
class label of xi
The decision boundary should classify all points
correctly
To see this:
28
NEXT STEP… OPTIONAL
29
DERIVATION
30
DERIVATION
31
DERIVATION
32
DERIVATION
33
DERIVATION
34
DERIVATION
35
DERIVATION
36
DERIVATION
37
DERIVATION
38
DERIVATION
39
DERIVATION
40
DERIVATION
41
DERIVATION
42
SVM : EXAMPLE
43
SVM : EXAMPLE
44
SVM : EXAMPLE
45
SVM : EXAMPLE
46
THE DUAL PROBLEM
The new objective function is in terms of ai only
It is known as the dual problem: if we know w, we know
all ai; if we know all ai, we know w
The original problem is known as the primal problem
w can be recovered by
48
CHARACTERISTICS OF THE SOLUTION
Many of the ai are zero (see next page for example)
w is a linear combination of a small number of data points
This “sparse” representation can be viewed as data
compression as in the construction of KNN classifier
xi with non-zero ai are called support vectors (SV)
The decision boundary is determined only by the SV
Let tj (j=1, ..., s) be the indices of the s support vectors. We
can write
For testing with a new data z
Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
Note: w need not be formed explicitly 49
SUPPORT VECTOR: GEOMETRICAL INTERPRETATION
Class 2
a8=0.6 a10=0
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
50
LINEAR SVM: EXAMPLE
51
LINEAR SVM: EXAMPLE
We would like to discover a simple SVM that accurately
discriminates the two classes. Since the data is linearly
separable, we can use a linear SVM.
By inspection, it is obvious that there are three support
vectors.
52
LINEAR SVM: EXAMPLE
53
SUPPORT VECTOR ARCHITECTURE
54
EXAMPLE CONTINUES…
55
ALLOWING ERRORS IN OUR SOLUTIONS
We allow “error” xi in classification; it is based on
the output of the discriminant function wTx+b
xi approximates the number of misclassified
samples
Class 2
56
Class 1
SOFT MARGIN HYPERPLANE
If we minimize ixi, xi can be computed by
We want to minimize
C : tradeoff parameter between error and margin
The optimization problem becomes
57
EXTENSION TO NON-LINEAR DECISION BOUNDARY
definite
This also means that optimization problem can be
solved in polynomial time!
62
EXAMPLES OF KERNEL FUNCTIONS
Φ: x → φ(x)
64
EXAMPLE
Suppose we have 5 one-dimensional data points
x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
We use the polynomial kernel of degree 2
K(x,y) = (xy+1)2
C is set to 100
We first find ai (i=1, …, 5) by
65
EXAMPLE
By using a Quadratic (QP) solver, we get
a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
Note that the constraints are indeed satisfied
The support vectors are {x2=2, x4=5, x5=6}
The discriminant function is
1 2 4 5 6
67
DEGREE OF POLYNOMIAL FEATURES
68
X^4 X^5 X^6
CHOOSING THE KERNEL FUNCTION
Probably the most tricky part of using SVM.
69
SUMMARY: STEPS FOR CLASSIFICATION
Prepare the pattern matrix
Select the kernel function to use
P1=(x1,y1),P2=(x2,y2),P3=(x3,y3) P3
P1
71
DISTANCE AND MARGIN
x = x1 + u (x2 - x1)
y = y1 + u (y2 - y1)
72
ARTIFICIAL NEURAL
NETWORKS: AN
INTRODUCTION
Massive parallelism
Learning ability
Generalization ability
Adaptivity
Fault tolerance
W1
X1 Y
W2
X2
The figure shows a simple artificial neural net with two input neurons
(X1, X2) and one output neuron (Y). The inter connected weights are
given by W1 and W2.
1. A set of links, describing the neuron inputs, with weights W1, W2,
…, Wm.
x1-x2= 1
x1
HIDDEN LAYER: the number of nodes in the hidden layer and the
number of hidden layers depends on implementation.
- Bias
x0 w0j
x1 w1j
f
Output y
xn wnj
All change in weights (wij) in the previous epoch are below some
threshold, or
Activation Function
Activation functions:
(A) Identity
(F) Ramp
All but one neuron are excitatory (tend to increase voltage of other
cells)
Introduction I ---==
10 --1
;:;,\1 bjyO
)-(o,~ c.
ri ..~I'·,"
'~ .
J
I 2.1 Fundamental Concept
Neural networks are those information processing systems, which are constructed and implememed to model
the human brain. The main objective of the neural network research is to develop a computational device
for modeling the brain to perform various computational tasks at a faster rate .than the traditional systems .
.-..., Artificial neural ne.~qrks perfOFm various tasks such as parr~n·marchjng and~"dassificarion. oprimizauon
~on, approximatiOn, vector ·uamizatio d data..clus.te..di!fThese_r__'!_5~~~2'!2'..J~~for rraditiOiiif'
Computers, w ·,c are er 1 gomll~putational raskrlndrp;;ise !-rithmeric operatic~. Therefore,
for implementation of artificial n~~·speed digital corrlpurers are used, which makes the
simulation of neural processes feasible.
& already stated in Chapter 1, an artificial neural nerwork (ANN) is an efficient information processing
system which resembles in characteristics with a biological neural nerwork. ANNs possess large number of
highly interconnected processing elements called notUs or units or neurom, which usually operate in parallel
and are configured in regular architectures. Each neuron is connected wirh the oilier by a connection link. Each
connection link is associated with weights which contain info!£11ation about the_iapu.t signal. This information
is used by rhe neuron n;t to solve a .Particular pr.cl>lem. ANNs' collective behavior is characterized by their
ability to learn, recall and' generaUa uaining p®:erns or data similar to that of a human brain. They have the
ro
capability model networkS of ongma:l nellfOIIS as-found in the brain. Thus, rhe ANN processing elements
are called neurons or artificial neuro'f·\, , ·, l"-
( \- ,.'"
c' ,,· ,
'--· .r ,-
\ ' \
12 ~ Artificial Neural Network: An Introduction 2.1 Fundamental Concept 13
x, X,
...
~@-r Slope= m
t
y
Figure 2·1 Architecture of a simple anificial neuron net.
,
Input '(x) :- ~- i ' (v)l----~•·mx
~--.J
j
2. f Fundamental Concept 15
14 Artificial Neural Network: An Introduction
2. jJ'ocessing: Basically, the biological neuron can perform massive paralld operations simulraneously. The
Inputs artificial neuron can also perform several parallel operations simultaneouSlY, but, ih general, the artificial
~ Weights neuron ne[INork process is faster than that of the brain. .
x, ~ 3. Size and complexity: The total number of neUrons in the brain is about lOll and the total number of
interconnections is about 1015 • Hence, it can be rioted that the complexity of the brain is comparatively
";
higher, i.e. the computational work takes places not"Cmly in the brain cell body, but also in axon, synapse,
ere. On the other hand, the size and complOciry ofan ANN is based on the chosen application and
the ne[INork designer. The size and complexity of a biological neuron is more than iliac Of an arcificial
neurorr.-----
Processing
w, 4. Storage capacity (mnno,Y}: The biologica.l. neuron stores the information in its imerconnections or in
element
synapse strength but in an artificial neuron it is smred in its contiguous memory locations. In an artltlcial
~/ neuron, the continuous loading of new information may sometimes overload the memory locations. As a
X,
result, some of the addresses containing older memory locations may be destroyed. But in case of the brain,
Figure 2·5 Mathematical model of artificial neuron. new information can be added in the interconnections by adjusting the strength without descroying the
older infonnacRm. A disadvantage related to brain is that sometimes its memory niay fail to recollect the.
stored information whereas in an artificial neuron, once the information is stored in its me~ locations,
Table 2·1 Terminology relarioii:ShrpS b~tw-ee·n
biological and artificial neurons
Biological neuron
Cell
Anificial neuron
Neuron
-
it can be retrieved. Owing to these facts, rhe adaptability is more toward an artificial neuron.
5. Tokrance: The biola ical neuron assesses fault tolerant capability whereas the artificial neuron has no
fault tolerance. Th distributed natu of the biological neurons enables to store and retrieve information
even when the interconnections m em get disconnected. Thus biological neurons nc fault toleF.lm. But in
Dendrites Weights or inrerconnecrions case of artificial neurons, the mformauon gets corrupted if the network interconnections are disconnected.
Soma Nee inpur
Biological neurons can accept redundancies, which is not possible in artificial neurons. Even when some
Axon Outpm
ceHs die, the human nervous system appears to be performing with the same efficiency.
6. Control mechanism: In an artificial neuron modeled using a computer, there is a control unit present in
Figure 2~5 shows a mathematical represenracion of the above~discussed chemical processing raking place Central Processing Unit, which can transfe..! and control precise scalar values from unit to unit, bur there
in an artificial neuron. is no such control unit for monitoring in the brain. The srrengdl of a neuron in the brain depends on the
In chis model, the net input is elucidated as active chemicals present and whether neuron connections are strong or weak as a result ~mre layer
rather t~ synapses. However, rhe ANN possesses simpler interconnections and is freefrom
--
models, self-organizing systems, neuro-computing systems and neuro-morphic systems.
I
l
:w:
Year Newal Designer Description The neurons should be visualized for their arrangements in layers. An ANN consists of a set of highly inter-
necwork connected processi elements (neurons) such that each processing element output is found ro·be connected
1943 McCulloch md McCulloch and The arran gemem of neurons in this case is a combination of logic throughc.. e1g ts to the other processing elements or to itself, delay lead and lag-free_.conn'eccions are allowed.
Pitts neuron Pins functions. Unique feature of this neuron is the concept of Hence, the arrange!llents of these orocessing elements and-dl'e" g:ametFy o'f-tJiciC'interconnectipns are essential
threshold. for an ANN. The point where the connection ongmates and terminates should De noted, :ind the function
1949 Hebb network Hebb It is based upon the fact that if two neurons are found to be active o ea~ processing element in an ANN should be specifie4.
simulraneously then the strength of the connection bmveen them Bes1 es e pie neuron shown in Figure??, there exist several other cypes of neural network connections.
should be increased. /fie arrangement of neuron:2form layers and the connection panem formed wi~in and between layers is
1958, Percepuon F<>nk Here the weighrs on the connection path can be adjusted. ~led the network architecture. here exist five basic types of neuron connection architectUres. They are:
1959. Rosenblau,
1962, Block, Minsky 1. single-layer feed-forwar network;
1988 and Papert 2. multilayer feed-forward network;
1960 Adaline Widrow and Here the weights are adjusted ro reduce the difference between the
Hoff net input to the output unit and the desired output. The result 3. single node with itS own feedback;
here is very negligible. Mean squared error is obtained. 4. single-layer recurrent network;
1972 Kohonen Kohonen The concept behind this network is that the inputs are clustered 5. mulrilayer recurrent network.
self-organizing together to obtain a fired ourput neuron. The clustering is
feature map performed by winner-take all policy. Figures 2-6-2-10 depict the five types of neural network architectures. Basically, neural nets are classified
1982, Hopfield John Hopfidd This neural network is based on fixed weights. These nets can also into single-layer or multilayer neural ners. A layer is formed by taking a processing element and combining it
1984, network and Tank act as associative memory nets. wirh other processing elements. Practically, a layer implies a stage, going stage by stage, i.e., the input srageand
1985, the output stage are linked with each other. These linked interconnections lead to the formation of various
1986, netw-ork architecrures. When a layer of the processing nodes is formed, the inputs can be connected to these
1987
1986 Back- Rumelhart, This network is multi-layer wirh error being propagated backwards I
propagation Hinton and from the output unirs ro the hidden unirs. lnpul Output
network
1988 Counter-
propagation
WiUiams
Grossberg This network is similar ro rhe Kohonen network; here the learning
occurs for all units in a panicular layer, and there exists no
.lI layer layer
I
18/ Artificial Neural Network: An Introduction j 2.3 Basic Models of Artificial Neural Network 19
Input I
layer
®········ . ··~~! ...
Output
neurous
0>·· ~"··
Output
Input
Input layer
--------..
Y, )---'-----
~
-£ 0 "" Vn
.· .. <:.:..::,/.......
··,
X -+
Neural
network y
21
r
<
'
__o;,L
0
,,
1.•
;~'' ? rl
1
__ ~.\~;-.\
'h-'"'""==::':::C:=:=:'::
Flgure2-11~on.&~r~
c< \,. ,'
Error
(0-Y) <
Error
signal b
-~-' signals
o' ..;'s~?ucture, each processing neuron receives two differem classes of inputs- "excitatory" input &om nearby ~ generator (Desi ·ad output)
c~ ·\'
processing elements and "inhibitory" inputs from more disramly_lggted..pro@~ elements. This cype of
inter~ is shown in Figure"2:-1T:·--·--···------- ----~ Figure 2-12 Supervised learning.
In Figure 2-11, the connections with open circles are excitatory connections and the links with solid con-
nective circles are inhibitory connections. From Figure 2-10, it can be noted that a processing element output
can be directed back w the nodes in a preceding layer, forming a multilayer recunmt network. Nso, in these ence, the
networks, a processing dement output can be directed back to rhe processing element itself and to other pro-
cessing elemenrs in the same layer. Thus, the various network architecrures as discussed from Figures 2~6-2·11
can be suitably used for giving effective solution ro a problem by using ANN.
I 2.3,2 Learning
2.3,2,2 Unsupervised Learning
The learning here is performed without the help of a teacher. Consider the learning process of a tadpole, it
The main property of an ANN is its capability to learn. Learning or training is a process by means of which a learns by itself, that is, a child fish learns to swim by itself, it is not taught by its mother. Thus, its learning
neural network adapts itself to a stimulus by making$rop~~rer adjustm~ resulting in the production process is independent and is nor supervised by a teacher. In ANNs following unsupervised learning, the
of desired response. Broadly, there are nvo kinds o{b;ning in ANNs: \ input vectors of simil~pe are grouped without th use of training da.ta t specify ~ch
'~ group looks or to which group a number beloogf n e training process, efietwork receives rhe input
1. Parameter learning: h updates the connecting weights in a neural net.
~-·~paii:erns and organizes these patterns to form clusters. When a new input panern is applied, the neural
2. Strncttm learning: It focuses on the change in network structure (which includes the number of processing " ·· network gives an output response i dicar.ing..ili_~c which the input pattern belongs. If for an input,
elemems as well as rheir connection types). a pattern class cannot be found the a new class is generated The block 1agram of unsupervised learning is
The above two types oflearn.ing can be performed simultaneously or separately. Apart from these two categories shown in Figure 2~13.
of learning, the learning in an ANN can be generally classified imo three categories as: supervised learning; From Figure 2·13 it is clear that there is no feedback from the environment to inform what the outputs
unsupervised learning; reinforcement learning. Let us discuss rhese learning types in detail. should be or whether the outputs are correct. In this case, the network must itself discover patterns~~
lariries, features or categories from the input data and relations for the input data over (heOUtj:lut. While
2-_3,2, 1 Supervised Learning discovering all these features, the network undergoes change m Its parameters. I h1s process IS called self
The learning here is performed with the help of a teacher. Let us take the example of the learning process organizing in which exact clusters will be formed by discovering similarities and dissimilarities among the
of a small child. The child doesn't know how to readlwrite. He/she is being taught by the parenrs at home objects.
and by the reacher in school. The children are trained and molded to recognize rhe alphabets, numerals, etc.
Their each and every action is supervised by a teacher. Acrually, a child works on the basis of the output that 2.3.2.3 Reinforcement Learning
he/She has to produce. All these real-time events involve supervised learning methodology. Similarly, in ANNs This learning process is similar ro supervised learning. In the case of supervised learning, the correct rarget
following the supervised learning, each input vector re uires a cor din rar et vector, which represents output values are known for each input pattern. But, in some cases, less information might be available.
the desired output. The input vecror along with the target vector is called trainin
informed precisely about what should be emitted as output. The block 1a
~
working of a supervised learning network. X y
(lnpu al output)
During training. the input vector is presented to the network, which results in an output vecror. This
outpur vector is the actual output vecwr. Then the actual output vector is compared with the desired (target)
Figure 2-13 Unsupervised learning.
output ·vector. If there exists a difference berween the two output vectors then an error signal is generated by
2.3 Basic Models of Artificial Neural Network
23
22 Artificial Neural Network: An Introduction
The output here remains the same as input. The input layer uses the idemity activation function.
Neural 2. Binary step function: This function can be defined as
X network y
(lnpu t) w (Actual output)
f(x) = { 1 if x) e
0 1fx<e
where 8 represents the lhreshold value. This function is most widely used in single-layer nets to convert
the net input to an output that is a binary (1 or 0).
Error Error
signals signal A 3. Bipolar step fimction: This function can be defined as
generator (Relnlforcement
siignal) 'f(x)=\ .1 ifx)8
-1 tf x< (}
Figure 2~14 Reinforcement learning.
where 8 represents the dueshold value. This function is also used in single-layer nets to convert the nee
For example, the necwork might be told chat its actual output is only "50% correct" or so. Thus, here only input to an output that is bipolar(+ 1 or -1).
critic information is available, nor the exacr information. The learning based on this crjrjc jofnrmarion is 4. Sigmoidal fonctions-. The sigmoidal functions are widely used in back-propagation nets because of the
called reinforCfment kaming and the feedback sent is called reinforcement sb relationship between the value of the functions ar a point and the value of the derivative at that ~nt
The block diagram of reinforcement leammg IS shown in Figure 2-14. The reinforcement learning is a which reduces the computational blJ!den d~ng.
form of su ervis the necwork receives some feedback from its environment. However, the Sigm01dil funcnons are of two types: -
feedback obtained here is only evaluative and not mstrucr1ve. e extern rem orcemenr signals are processed
Binmy sigmoid fonction: It is also rermed as logistic sigmoid function or unipolar sigmoid function.
in the critic signal generator, andilie obtained ;rnc signals are sent to the ANN for adjustment of weights
It can be defined as
properly so as to get better critic feedback in furure. The reinforcement learning is also called learning with a
critic as opposed ro learning with a teacher, which indicates supervised learning. I
So, now you've a fair understanding of the three generalized learning rules used in the training process of f(x) = 1 + ,-'-'
ANNs.
where A is the steepness parameter. The derivative of rhis funcrion is
c---·---·--··... """\
I 2.3.3 Activation Functions / J'(x) =J.f(x)[l- f(x)] \
To better understand the role. of the activation function, let us assume a person is performing some work. Here the range of che sigmoid funct~~iS"fr~~ Qr~ 1~· -···-· - ___ ..
To make the work more efficient and to obrain exact output, some force or activation may be given. This
• Bipo!dr sigmoid fimction: This function is defined as
aaivation helps in achieving the exaa ourpur. In a similar \vay, the aaivation function is applied over the net
inpu~eulate.the output of an ANN. 2 1-e-Ax
The information processing of a processing element can be viewed as consisting of two major parts: input f ( x )1= ---1=--
+ e-Ax l + e-Ax
and output. An integration fun~tion (say[) is associated with the input of a processing element. This function
serves to combine activation, information or evidence from an external source or other processing elements where A is thesteef'n~~rand the sigmoid function range is between -1 and+ 1. The derivative
into a net mpm ro the processing element. I he nofllmear actlvatlon-fi:iiicfion IS usei:l to ensure that a neuron's ofthisiilliC:·~.:· I ..
response is ~nded - diat 1s, the acrual response of the neuron is conditioned or dampened as a reru.h-of A
large or small activating stimuli and is thus controllabl_s. J'(x) = [1 +f(x)][l - f(x)]
Certain nonlinear fllncnons are used to aCh.eve dle advantages of a multilayer network from a single-layer
2
nerwork. When a signal is fed thro~ a multilayer network with linear activation functions, che output The bipolar sigmoidal function is closely related ro hyperbolic rangenr &merion, which is written as
obtained remains same as that could be obtained using a single~layer network. Due to this reason, nohlinear
et-e-x 1-e-b:
functions are widely used in multilayef networks compared ro linear functions. h(x)=--=--
There are several activation functions. Let us discuss a few in chis section: rF\ r+e-x 1 +e-2x
1. Identity fimction: It is a linear function and can be defined as 'I. \Y ':I '
(.
The derivative of the hyperbolic tangent function is
~
If the nerwork uses a binary data, it is better to conven it to bipolar form and use ilie bipolar sigmoidal
1 ,l(x)
acnvauon funcnon or hyperbolic tangent function.
if X> 1 f(x)'
X
(A) (B)
The graphical representations of all the activation functions are Shown in Figure 2-I5(A)-(F).
I(!C)
I 2.4 Important Terminologies of ANNs
This section introduces you ro the various terminologies related with ANNs. +1f-----
I 2.4.1 Weights 0 X
\
In the architecrure ofan ANN, each neuron is connected ro other neurons by means ofdirected communication -1
links, and each communication link is associated with weights. The weighrs contain information about e
if'!pur ~nal. This information is used by the net ro solve a problem. The we1ghr can ented in
-rem1sOf matrix. T4e weight matrix can alSO bt c:rlled connectzon matrix. To form a mathematical notation, it (C) (D)
is assumed that there are "n" processingelemenrs in~ each processing element has exaaly "m"
adaptive weighr.s. Thus, rhe weight matrix W is defined by l(x),
W=
2
I=
""' IU)_m
+1
'·"'
w~j LWn] 7Vn2 1Unm
+1 X
(E) (F)
where w; = [wil, w;2 •... , w;m]T, i = 1,2, ... , n, is the weight vector of processing dement and Wij is the Figure 2-15 Depicrion of activation functions: (A) identity function; (B) binary step function; (C) bipolar step
weight from processing element":" (source node) to processing element "j' (destination node). function; (D) binary sigmoidal function; (E) bipolar sigmoidal function; (F) ramp function.
If the weight matrix W contains all the adaptive elements of an ANN, then the set of aH W matrices
will determine dte set of all possible information processing configurations for this ANN. The ANN can be
The bias is considered. like another weight, dtat is&£= b}
Consider a simple network shown in Figure 2-16
with bias. From Figure 2-16, the net input to dte ourput neuron Yj is calculated as
realized by finding an appropriate matrix W Hence, the weights encode long-term memory (LTM) and rhe
activation states of neurons encode short-term memory (STM) in a neural network. "
Jinj = Lx;Wij = XOWOj +X] W]j + XlWJ.j + ··· + X 11 Wnj
c(Bias)
I 2.4.6 Vigilance Parameter
(Weight) ~
Input J@ m ]; Y• )• y.=mx+c
The-notations mentioned in this section have been used in this textbook for explaining each network.
The activation function discussed in Section 2.3.3 is applied over chis nee input to calculate the ouqmt. The
bias can also be explain~d as follows: Consider an equation of straight line, x;: Activation of unit Xi, inp_uc signal.
y;: Activation of unit Yj, Jj = f(J;nj)
y= mx+c Wij: Weight on connection from unit X; ro unit Yj.
where xis the input, m is rhe weight, cis !he bias andy is rhe output. The equation of the suaight line can bj: Bias acting on unitj. Bias has a constant activation of 1.
also be represemed as a block diagram shown in Figure 2~17. Thus, b}as plays a major role in dererrnj_njng W: Weight matrix, W = {wij}
the ouq~ut of rhe nerwork. Yinj= Net input to unit Yj given by Yinj = bj + L;XiWij
The bias can be of two types: positive bias and negaiive bias. The positive bias helps in increasing ~et l!x\1: Norm of magnitude vector X.
input of the network and rhe negative bias helps in decreasing the n_~_r)!!.R-1.!-.~ o(Jli!!_p.et\licid{. I hus, as a result Bj: Threshold for activation of neuron Yj-
of the bias effect, the output of rhe network can be varied. ·--- S: Training input vector, S = (s 1 , ••• , s;, ... , s11)
I 2.4.3 Threshold
T:
X:
Training ourput vector, T = (tJ, ... , fj, •.. , t 71 )
Input vector, X= (XI> ••• , Xi> ••• , x11)
Thr~ldis a set yalue based upon which the final outp_~t-~f ~e network may be calculated. The threshold D..wij: Change in weights given by 8.wij = Wij(new) - Wij(old)
vafue is used in me activation function. X co.mparrso·n is made between the Cil:co.lared:·net>•input and the a: Learning rate; it controls the amount of weight adjustment at each step of training.
threshold to obtain the ne ork outpuc. For each and every apPlicauon;·mere1S'a-dlle5hoidlimit. Consider a
direct current DC) motor. If its maximum spee~then lhe threshold based on the speed is 1500
rpm. If lhe motor is run on a speed higher than its set threshold,-it-m~amage motor coils. Similarly, in neural I 2.5 McCulloch-Pitts Neuron
networks, based on the threshold value, the activation functions ar-;;-cres.iie(l"al:td the ourp_uc is calculated. The
activation function using lhreshold can be defined as ----- I 2.5.1 Theory
The McCulloch-Pitts neuron was the earliest neural network discovered in 1943. It is usually called as M-P
/(net)={_: if net "?-8
ifnet<8 neuron. The M-P neurons are connected by directed weighted paths. It should be noted that the activation of
aM-P neuron is binary, that is, at any time step the neuron maY fire or may por 6re The weights associated
where e ~ the fixed threshold value. wilh the communication links may be excitatocy (weight is positive) or inhibioocy (weight is negative). All ilie
.L
/
excitatory connected weights entering into a particular neuron will have same weights. The threshold plays
a major role in M-P neuron: There is a fiXed threshold for each neuron, and if ilie net input to the neuron
I 2.6 Linear Separability
is greater than the.threshold then ilie neuron fires. Also, it should be noted that any nonzero inhibitory ~ fu'l'N does not give an exact solution for a nonlinea;-. problem. However, it provides possible approximate
input would prevent the neuro,n from firing. The M-P neurons are most widely used in the case of logic solutions nonlinear problems. Linear separability, is _ifie ~ritept wherein the separatiOn of the input space
functiOn~.------------ into regions is ase on w e er e network respoilse isJositive or negative.
A decision line is drawn tO separate positive and negative responses. The decision line may also be called as
I 2.5.2 Architecture
the decision-making line or decision-support line or linear-separable line. The necessity of the linear separability
concept was felt to classify the patterns based upon their output responses. Generally the net input @cU'Iau:a-
to t1te output Unu IS given as
A simple M-P neuron is shown in Figure 2-18. As already discussed, the M-P neuron has both excitatory and
inhibitory connections. It is excitatory with weight (w > 0) or inhibitory with weight -p(p < 0). In Figure "
2-18, inpms &om Xi ro Xn possess excitatory weighted connections and inputs from Xn+ 1 m Xn+m possess Yin = b + z:x,w;
inhibitory weighted interconnections. Since the firing of ilie output neuron is based upon the threshold, the i=l
activation function here is defined as
For example, if 4hlpolar srep acnvanoijfunction is used over the calculated ner input (y;,) then the value of
the funct:ion fs" 1 for a positive net input and -1 for a negative net input. Also, it is clear that there exists a
f(y;,)=(l ify;,;?:-0 boundary between the regions where y;, > 0 andy;, < 0. This region may be called as decision boundary and
0 ify;n<8
can be determined by the relation
For inhibition to be absolute, the threshold with the activation function should satisfy the following condition:
"
b+ Lx;w;=O
() > nw- p l~l
The output wiH fire if it receives sa6·:~~citatory ·i·n~~~~ut no inhibitory inputs, where On the basis of the number of input units in the network, the above equation may represenr a line, a plane
kw:>:O>(k-l)w
---- or a hyperplane. The linear separability of the nerwork is based on the decision-boundary line. If there exist
weights (with bias) for which the training input vectors having positive (correct:) response,+ l,lie on one side
of the decision boundary and all the other vectors having negative (incorrect) response, -1, lie on rhe other
The M-P neuron has no particular training algorithm. An analysis has to be performed m determine the side of the decision boundary. then we can conclude the/PrObleffi.Js "linearly separable."
values of the weights and the ili,reshold. Here the weights of the neuron are set along with the threshold to Consider a single-layer network as shown in Figure 2-~ias irlduded. The net input for the ne[Work
shown in Figure 2-l9 is given as
make the neuron "perform a simple logic functiofk-Xhe-M J?. neurons are used as buildigs ~ocks on...which
we can model any funcrion or phenomenon, which can be represented as a logic furfction. y;,=h+xtwl +X21V2
The sepaming line for wh-ich the boundary lies between the values XJ and X'2· so that the net gives a positive
x, response on one side and negative response on other side, is given as
~
~
'J b+xtw1 +X2Ui2 = 0
~
X,
-·X,
-
b
~' 'y
xm,
-p:;?? x, X, w,
~ w,
Xm•
Figure 2·18 McCulloch- Pins neuron model. Figure 2·19 A single-layer neural net.
30 Artificial Neural Network: An Introduction 2.7 Hebb Network 31
If weight WJ. is not equal to 0 then we get However, the dara representation mode has to be decide_d - whether it would be in binary form or in
bipolar form. It may be noted that the bipolar reoresenta'tion is bener than the
WI b
= Using bipolar data
--
X2 --Xl--
w, w, ues are represeru;d can be represented by
Thus, the requirement for the'positive response of the net is vice-versa.
0t~l W\ + "2"'2 > '!) 1 2.7 H~bb Network (e-n (,j 19,., ":_ w1p--tl u,.,; t-)
During training process, lhe values of Wi> W2 and bare determined so that the net will produce a positive ~ <..J I
(correct) response for the training data. if on the other hand, threshold value is being used, then the condmon- I 2. 7.1 Theory •
for obtaining the positive response from ourpur unit is
Net input received> ()(threshOld) I For a neural net, the Hebb learning rule is a simple one. Let us understand it. Donald Hebb stated in 1949
that in the brain, the learning is performed by th c ange m e syna nc ebb explained it: "When an
Yir~-> 8 axon of cell A is near enough to excite cdl B, an y or permanently takes pia~ it, some
XtW\ + XZW2 > (} growth process or merahgljc cheag;e rakes place in one or both the cells such that Ns efficiency, as one of the
cellS hrmg B. is increased.,
The separating line equation will then be According to the Hebb rule, the weight vector is found to increase proportionately to the product of the
input and the learning signal. Here the learning signal is equal tO the neuron's output. In Hebb learning,
XtWJ +X2W2 =()
if two interconnected neurons are 'on' simu)taneously then the weights associated w1ih these neurons can
W\ 8 be increased by ilie modification made in their synapnc gap (strength). The weight update in Hebb rule is
"'=--XI+- (with w, 'f' 0)
w, w, given by
During training process, the values of WJ and W2 have to be determined, so that the net will have a correct w;(new) = w;(old) + x;y
response to the training data. For this correct response, the line passes close rhrough the origin. In certain
situations, even for correct response, the separating line does not pass through the origin. The Hebb rule is more suited for ~ data than binary data. If binary data is used, ilie above weight
Consider a network having positive response in the first quadram and negative response in all other updation formula cannot distinguish two conditions namely;
quadrants (AND function) with either binary or bipolar data, then the decision line is drawn separating the 1. A training pair in which an input unir is "on" and target value is "off."
positive response region from rhe negative response region. This is depicred in Figure 2-20.
2. A training pair in which both ilie input unit and the target value are "off."
Thus, based on the conditions discussed above, the equation of this decision line may be obtained.
Also, in all the networks rhat we would be discussing, the representation of data plays a major role. Thus, iliere are limitations in Hebb rule application over binary data. Hence, the represemation using bipolar
data is advanrageous.
X,
I 2. 7.2 Flowchart of Training Algorithm
+ The training algorithm is used for rhe calculation and -~diustmem of weights. The flowchart for the training
(Positive response region)
algorithm ofHebb ne[Work is given in Figure 2-21. The notations used in the flowchart have already been
discussed in Section 2.4.7.
(Negalive response region) In Figure 2-21, s: t refers to each rraining input and target output pair. Till iliere exists a pair of training
input and target output, the training process takes place; elSe, IE tS stopped.
-x, x,
Decision
I 2. 7.3 Training Algorithm
line The training algorithm ofHebb network is given below:
I Step 0: First initialize ilie weights. Basically in this network iliey may be se~ro zero, i.e., w; = 0 fori= 1 \
-x, to n where "n" may be the total number of input neurons. '
Figure 2·20 Decision boundary line. Step 1: Steps 2-4 have to b~ performed for each input training vector and mger output pair, s: r.
i
l
32 Artificial Neural Network: An Introduction
2.9 Solved Problems 33
The above five steps complete the algorithmic process. In S~ep 4, rhe weight updarion formula can also be
given in vector form as
D.w = xy
As a result,
For
No w(new) = w(old) + l>.w
each
s: t
The Hebb rule can be used for pattern association, pattern categorization, parcem classification and over a
range of other areas.
Yes
·~
tI " , , Figure 2~21 Flowchm ofHebb training algorithm.
0.5
@
0.1
y Yin =X] WJ + X'2WZ + X3W3
,. S~~ 2: Input units acrivations are ser. Generally, the activation function of input layer is idemiry funcr.ion: = 0.3 X 0.2+0.5 X 0.1 + 0.6 X (-0.3)
0- s; fori- tiiiJ
__/"
-0.3 = 0,06 + 0.05-0,18 = -O.D7
(n = 3, because only 0
1
y=f(y;,)= 0 ify;,< 1
1 ify;,::01
be written as 0
0
0
1
0
1
0
0
1
0
w11
WZI
e~ 1
== 1
= -1
for the zl neuron
f.{; r '',I SIV
---
8~1
Solution: The trmh table for XOR function is given wu = 1021 = 1 TableS I) 0 0 0 0 0
in Table 3. 0 1 1 0 1
Calculate the net inpms. For inputs, Xi "'- zz 1 0 1 1 0
Table3 0 0 0 I 1 0 0 0
X]
"'- y (0, 0), Zj,0 = 0 X 1+ 0 X I= 0 0 I 1
0 0 0 (Q, 1), ZJin = 0 X 1+ l X 1= l 0 0 Here the net input is calculated using
0 1 1 (1, 0), Z!i, = 1 X 1+ 0 1= 1
X 1 0
~]in :::
1
1
0
1
1
0
w12 = wn = 1
-· Z] V] + Z2VZ
Case 1: Assume both weights as excitatory, i.e.,
V] ::: VZ = 1
)
\Lf::~
WB=l; U/21=-l
(O,O),Z2in=Ox 1+0x 1=0 (O,O),y;,=Ox 1+0x 1=0
y=z, +za
,, 1
~·
,!
l
38 Artificial Neural Network: An lntn:iduction 2.9 Solved Problems 39
z, z,
, where the threshold is taken as "I" (e = 1) based final (new) weights obtained by presenting the
on the calculated net input. Hence, using the linear first input paaern, i.e.,
separability concept, the response is obtained fo.r
(-1,1) /7 (1,1) [wi w, b] = [1 l 1]
+ / + "OR" function.
y
-~x,, y,) 8. Design a Hebb net to implement logical AND The weight change here is
.(-1,0) function (use bipolar inputs and targets). ·
x, t:..w 1 =x1y= 1 X -1 = -1
Solution: The training data for the AND function is l>w, =xzy= -I X -I= I
given in Table 9.
(>,, y,)
l>b=y=-1
Figure 9 Nemal ner for Y(Z1 ORZ,). +
(0,-1)
(-1, -1) (1, -1) Table9
The new weights here are
e
Swing a threshold -of 2::. 1' Vj == 1'2 = I, which
Function decision
boundary
Inputs Target
implies that the net is recognized. Therefore, the Xi X2 b y w1(new) = w1(old) + 6.w1 =I -1 = 0
analysis is made for XOR function using M-P Figure 10 Graph for 'OR' function.
1 1 1 1 w, (new) = w,(old) + l>w, = 1 + 1 = 2
neurons. Thus for XOR function, the weights are
1 -1 1 -1 b(new) = b(old) + l>b = 1- 1 = 0
obtained as Using this value the equation for the line is given as
-I 1 1 -1
y = mx+c= (-1)x-l = -x-1 -1 -I 1 -1 Similarly, by presenting the third and fourth
wu = Zll22 = 1 (excitatory) input patterns, the new weights can be calculated.
WJ2 = W21 = -1 (inhibitory) Here the quadrants are nm x andy but XJ and xz, so The neMork is trained using the Hebb network train- Table 10 shows the values of weights for all inputs.
VJ = Vz = 1 (excirarory) the above equation becomes ing algorithm discussed in Section 2.7 .3.lnitially the .
weights and bias are set to zero, i.e., .- ~-- ~~J--- Table 10
7. Using the linear separability concept, obtain the \ xz =-xi -1 (2.1)
Inputs Weight changes Weights
response for OR function (rake bipolar inputs and
This can be wrinen as 0'2_~\ ·"'0 Xj X2, b y D.w, D.wz t:..b w1 wz b
bipolar targets). (0 0 0)
-WI b First input [xi xz b] = [1 1 1] and target = 1 I 1
Solution: Table 7 is the truth table for OR function xz= --XI-'-- (2.2)
[i.e., y = 1]: Setting the initial weights as old
I I 1 1 1 I I
with bipolar inputs and targets. wz wz I -1 I -I -I I -1 0 2 0
weights and applying the Hebb rule, we get -1 -1 I I -1
Comparing Eqs. (2.1) and (2.2), we get -1 I I -1 1
Table7 -1 -1 I -1 1 1 -1 2 2 -2
w;(new) = w;(old) + x;y
Xi X2 y Wi b
w2 =I; w 1(new) = w1 (old) +Xi]= 0 + I x l= 1 The sepaming line equation is given by
I 1
-I 1 '"' w,(new) = w,(oid) + xzy = 0 + I x I = f
-I I I Therefore, WJ = l, wz = 1 and b =
1. Calculating
b(new) = b(old) +y = 0 + 1 = I
-WJ
xz= - - x , - -
b
-I -I -I the net input and output of OR function on the basis ruz wz
of these weights and bias, we get emries in Table 8.
The weights calculated above arc the final weights
The uurh table inpurs and corresponding outputs that are obtained after presenting the first input. For all inputs, use the final weights obtained
TableS
.------;-:=::;:=~----:.cl
have been plotted in Figure 10. If output is 1, it is These weights are used as rhe initial weights when for each input to obtain the separating line.
~[Y•·=b+~}D
denoted as"+" else"-." Assuming rbe ~res ~ X2 the second input pattern is presented. The weight For the first input [1 I 1), the separating line is
as ( l, 0) 3.nd (0, ll; (x,, Yl) and (.xz,yz), the slope change here is t:..w; = x;y. Hence weight changes given by
1 1
"m" of the straight line can be obtained as I -1 1 1 1 relating to the first input are
-1 1
-1 I I 1 1 XZ = - X i - - ::::} XZ = -XJ - 1
)'2-yi -1-0 -1 -1 -1 -1 -1 t:..w1 = XJJ = l x 1 = I 1 1
m=--=--=-=-1
X2-X] 0+1 1 "'"" =w= 1 x 1 = 1
~ Similarly, for the second input [ 1 -1 1], the
Thus, the output of neuron Y can be written as y,·] l>b=y=l separating line is
We now calculate c:
y=f(;;;,) = 11OifJin<1
if y;,) I • Second input [x, X2, b] = [1 - 1 1] and
XZ =
-0
-x, --02 => xz = 0
'= Ji- '"-"i = 0- (~1)(-1) = -1 =
y -1: The initial or old weights here are the 2
I
_L_
T;:•·
X, Forthethirdinput[-lll],itis Solution: The training pair for the OR function is ilie output response is "-1" lies on ilie other side -of
(-1, 1)
given in Table 11.
the bOun J, ~ . '·-!. N;
I (1,1)
X2. =
-1
-x,
I
+-I ; ;:;} X2 = -x, + 1 Table 11 , I W] = 2; W, = 2; b = 2 ~}' v.J<" Q
' I
Inputs Target ~~ nerwork can be represented as shown in L ...
- x,
Finally, for the fourth input [ -1 - 1 1], the
separating line is X]
"' I
b
I
-
y
I
Figure 14.
X,.
(i
v' .y.:,J-"
·~.w
J,,.,;;:rl>
J '>
<ii ~',..
(1, -1) -I -1 I -I + +
r~
The graphs for each of these separating lines
obtained are shown in Figure 11. In this figure Initially the weights and bias are set to zero, i.e.,
(A) First Input "+" mark is used for output "1" and"-" mark
is used for output "-1." From Figure 11, it can Wj =w2=h=O x,
X, be noticed rhat. for the first input, the decision
boundary differentiates only the first and fourth The nerwork is trained and the final weights are out·
(-1,1)
I (1,1)
+
inputs, and nor all negative responses are separated
from positive responSes. When rhe second input
pattern is presented, the decision boundary sep·
lined using the Hebb training algorithm discussed
in Section 2.7.3. The weighrs are considered as final
weights if the boundary line obtained from these
(-1, -1) '
(1. -1)
ar.ues (1, I) from (I, -I) and (-I. -I) and nor weights separates the positive response region and .112"'-x, -1
~ negative response region.
(-1, I). But the boundary line is same for the both
- x,
third and fourth training pairs. And, the decision
boundary line obtained from these input training
pairs separates the positive response region from
By presenting all the input patterns, the weights
are calculated. Table 12 shows the weights calculated
for all the inputS.
Figure 13 Decision boundary for OR function.
WI =2; tuz=2;
--
b=-2 Xj
Inputs
Xz b J
Weight changes
l).wl t.,wz t..b WI
(0
Weights
W1.
0
b
0)
x1
x1 2
2
y
(-1.~1
Figure 12. -I -1 I 2 0 2
(1,1)
-I -1 I I I I 3 Figure 14 Hebb net for OR function.
' -1 -( -1 I I -1 2 2 2
10. Use the Hebb rule method to implement XOR
~·
-2 function {take bipolar inputs and targets).
Using the final weights, the boundary line equation
x1 x 2 _ can be obtained. The separating line equation is Solution: The training patterns for an XOR function
1 y are shown in Table 13.
-wl b -2 2
X,= --X] - - = - X I - - =-X\ - 1 Table 13
(-1. -1) (1, -1) 2 wz wz2 2
__Inpu~ Target
The decision region for this net is shown in Figure 13.
It is observed in Figure 13 that straight li~e X']. =
b y
"'I
X]
II I
42 Artificial Neural Network: An lnlraduc\ion 2.9 Solved Problems 43
'";'.·
_,__.
Here, a single-layer network with two input neurons, The XOR function can be made linearly separable by [J,'"2 = X2J = 1 X 1 = 1 w,(new) = w,(old) + x,y = 1 + 1 x -1 = 0
one bias and one output neuron is considered. In solving it in a manner as discussedjn Problem 6. This
method of solving will result in rwo decision bound- l:J.w3 = X3Y = 1 X 1= 1 w4(n.W) = w,(old) + X4J = -1 + 1 x -1 = -2
this case also, the initial weights are assumed to be
zero: ary lines for separating positive and negative regions
ofXOR function.
l:J.w4 =xv= -1 x 1 = -1 u:s(new) = ws(old) +xsy =I+ -1 x -1 = 2
§ill §ili
1 I -1 -1 -1 -1 -1 -1 -1 + +
w;(new) = wi(old) + l:J.wi The weights obtained are indicated in the Hebb net
1 -1 1 1 1 -1 1 0 -2 0 + + shown in Figure 17,
-1 1 1 1 -1 1 I -1 -1 1 Setting the old weights as the initial weights here,
we obrairt 12. Find the weights required to perform the follow-
-1-11-1 1 1 -1 0 0 0 + + +
ing classifications of given input patterns using
'I' ·o· the Hebb rule. The inpurs are "1" where''+"
WJ (new) = WJ (old) + l:J.w1 = 0 + 1 = 1
symbol is present and" -1 ''where"," is presem.
The final weights obtained after presenting aH the Figure 16 Data for input patterns. '"2(new) = '"2(old) + !J,'"2 = 0 + 1 = 1 "L" pattern belongs to the class (target value+ 1)
inpm pauerns do nm give correct output for all pat-
w,(new) = w3(o!d) + IJ,w3 = 0 + 1 = 1 and "U" pattern does not belong to the class
terns. Figure 15 shows that the input patterns are Solution: The training input patterns for the given (target value -1).
linearly non-separable. The graph shown in Figure 15 net (Figure 16) are indicated in Table 15.
indicates that the four input pairs that are present can- Similarly, calculating for other weights we get Solution: The training input patterns for Figure 18
not be divided by a single line m separate them into Table 15 are given in Table 16.
two regions. Thus XORfi.mcrion is a case of a panern Pattern Inputs Target W4(new) = -1, ws(new) = l, WG(new):::: -1,
classification problem, which is not linearly separable. WJ(new) = 1, wa(new) = 1, llJ9(new) = 1, Table 16
XI X2 X3 :Gj xs X6X7xaX9b y
1 1 -1 1 -1 1 1 I ·1 b(new) = 1 Pattern Inputs Target
X, 0 1 1 1 1 -1 1 1 I 1 1 -1 X3 X4 X) XG
X] X2. '-7XSX9 b J
')i!-/ The weights after presenting first input pattern are L 1-1-11-1-1
(-1, 1) (1,1)
\ ;,,_ Here a single-layer ne[Work with nine input netUons,
·P one bias and one output neuron is formed. Set rhe W(new) = [1 1 1 -1 1 -I 1 I 1 1] u -1 1 I -1 -1
+
' initial weights and bias to zero, i.e.,
IN::::\
x, booodmy u'") W] ::=W2=W3=W<i=Ws
Case 2: Now we present the second input pattern
(0). The initial weights used here are the final weights
obtained after presenting the fim input pa~ern. Here,
A single-layer ne[Work with nine input neurons, one
bias and one output neuron is formed. Set the initial
=wG =w-, =wa =llJ9 = b= 0 weights and bias tO zero, i.e.,
the weights are calculated as shown below (y = -1
+ wiclHheinitialweighrsbeing[1ll-11-ll1I1]). W]::=W2=W3=W<j:=W5
(-1, -1) (1, -1) Case 1: Presenting first input panern (I), we calculate
=wG=UJ?=wa=U19=b=O
change in weights: w;(new) = w;(old) + l:J.x; [l:J.w; = x;y] The weights are calculated using
X,
f:..w;=x,y, i= 1 to9 w,(new) = WJ(old) + XiJ =I+ 1 X -1 = 0
w;(new) = w;(old) + x;y
Figure 15 Decision boundary for XOR function.
f:..w 1 = XIJ = 1 X 1= } ""(new)= '"2(old) + x,y = 1 + 1 x -1 = 0
I
44 Artificial Neural Network: An Introduction
2.9 Solved Problems v
45
x, Table 17
Tatge' Weights
X, "' "'-
X3 X4 X5
Inpuu
X6X7XSX9b ----
j WJ
(0
Uf2 Ul3 W4 W5
0 0 0 0
W6 W'J
0 0
Wg
0
U19
0 0)
b
-1 -1 1 -1 -1
,, -1 -1 1 -1 -1 1 1 1 1, '
-I 0 0 -2 0 0 -2 0 0 0 0
-I I I -I
"'
,, x,
,, ,,
,,
x 9
1 (x9
Figure 17 Hebb ner for the data matrix shown in Figure 16. ....
y
,,
+ + + + "" x,
'
·e ·u·
Figure 18 Input clara for given parrerns. -- ,(X,
Figure 19 flebb ne< of Figure 18.
46 Artificial Neural Network An Introduction 2.12 Projects 47
I 2.10 Review Questions 2. Calculate the output of neuron Y for the net
shown in Figure 21. Use binary and bipolar
{b) Construct a recurrent network with four
input nodes, three hidden nodes and two output
l. Define an artificial neural network. 15. What is the necessity of activation function? sigmoidal activation functions. nodes that has feedback links from the hidden
layer to the input layer.
2. Srate ilie properties of the processing element of 16. List the commonly used accivation functions.
an artificial neural network. 6:. l)singlinear separability oo~cept, obtain the
17. What is me impact of weight in an anifidal .0.9
response for NAND funccion.
3. How many signals can be sent by a neuron at a neural network?
particular rime instant? 7. Design a Hebb net to implement logical AND
18. What is the mher name for weight? 0.7~ y
function with
4. Draw a simple artificial neuron and discuss dte 19. Define bias and threshold. (a) binary inputs and targets and
calculation of net input.
20. What is a learning rate parameter? (b) binary inputs and bipolar targets.
5. What is the influence of a linear equation over
21. How does a momentum factor make faster 8. Implement NOR function using Hebb net with
the net input calculation?
convergence of a network? Figure 21 Neural net. {a) bipolar inputs and targets and
6. List the main components of ilie biological (b) bipolar inputs and binary targets.
22. State the role of vigilance parameter iE:l ART 3. Design neural networks wiili only one M-P
neuron.
network. neuron that implements the three basic logic 9. Classify the input panerns shown in Figure 22
7. Compare and contrast biological neuron and using Hebb training algorithm.
artificial neuron. 23. Why is the McCu!loch-Pins neuron widely used operations:
27. How can the equation ofa straight line be formed j'(x) =AJ(x)[1 - [(x)j
11. Define net architecmre and give ilS classifica·
tlons. using linear separability? Figure 22 Inpur panern.
(b) Show that the derivative of bipolar sigmoidal
12. Define learning. 28. In what ways is bipolar representation better rhan ftmcrion is
binary representation? 10. Using Hebb rule, find dte weighLS required ro
13. Differentiate beP.veen supervised and unsuper- A perform following classifications. The vecrors
vised learning. 29. Stare the uaining algorithm used for the Hebb /' (x) = 2[1 +f(x)][1 - [(x)]
(1 -1 1 -1) and (111-1) belong to class (target
nerwork.
14. How is the critic information used in the learning 5. {a) Construct a feed-forward nerwork wirh five ,aJue+1);,eetors(-1-11l)and(11-1-l)
process? 30. Compare feed·fonvard and feedback network. input nodes, three hidden nodes and four output do nor belong to class (target value -1). Also
nodes that has lateral inhibition structure in the using each of training xvecmrs as input, test the
I 2.11 Exercise Problems output layer. response of net.
1. For the neP.vork shown in Figure 20, calculate the net input to rhe output neuron.
I 2.12 Projects
~
1. Write a program to classify ilie letters and numer- 2. Wtit$:.~~ira~ programs for implementing logic
y als using Hebb learning rule. Take a pair of letters functions usin~cCulloch-Pitts neuron.
- 6 or numerals of your own. Also, after training 3. Write a computer program to train a Madaline to
0.2 the fl.erwork, test the response of ilie net using perform AND function, using MRI algorithm.
suitable activation function. Perform the clas-
0.3 ~
4: Write a program for implementing BPN for
sification using bipolar data as well as binary
training a single·hidden·layer back-propagation
dara.
Figure 20 Neural net.
CHAPTER 3
SUPERVISED LEARNING
NETWORK
x1 w1
w0
w2
x2 n
o
.
. wn
wi xi
. i=0
n
xn 1 if wi xi >0
f(xi)= { i=0
-1 otherwise
wi = wi + wi
wi = (t - o) xi
where
t = c(x) is the target value,
o is the perceptron output,
Is a small constant (e.g., 0.1) called learning rate.
Error: The error value is the amount by which the value output by
the network differs from the target value. For example, if we
required the network to output 0 and it outputs 1, then Error = -1.
Use a set of sample patterns where the desired output (given the
inputs presented) is known.
Output Values
Output Layer
Adjustable
Weights
Input Layer
Input Signals (External Stimuli)
Adaline network uses Delta Learning Rule. This rule is also called as
Widrow Learning Rule or Least Mean Square Rule. The delta rule for
adjusting the weights is given as (i = 1 to n):
Training
• Feed-in known inputs in random sequence
• Simulate the network
Training • Compute error between the input and the
output (Error Function)
• Adjust weights (Learning Function)
• Repeat until total error < ε
Thinking Thinking
• Simulate the network
• Network will respond to any input
• Does not guarantee a correct solution even
for trained inputs
“Principles of Soft Computing, 2nd Edition”
by S.N. Sivanandam & SN Deepa
Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
MADALINE NETWORK
MADALINE is a Multilayer Adaptive Linear Element. MADALINE was the
first neural network to be applied to a real world problem. It is used in
several adaptive filtering process.
Inputs
Hiddens
I0
Outputs
h0
I1 o0
h1
I2 o1
h2 Outputs
I3 Hiddens
Inputs
“Principles of Soft Computing, 2nd Edition”
by S.N. Sivanandam & SN Deepa
Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.
MULTILAYER FEEDFORWARD NETWORK:
ACTIVATION AND TRAINING
For feed forward networks:
• A continuous function can be
• differentiated allowing
• gradient-descent.
• Back propagation is an example of a gradient-descent technique.
• Uses sigmoid (binary or bipolar) activation function.
Will find a local, not necessarily global error minimum -in practice
often works well (can be invoked multiple times with different initial
weights)
Image processing.
Signature verification.
Bioinformatics.
Perceptron,
Adaline,
Madaline,
Backpropagation Network,
Radial Basis Function Network.
Apart from these mentioned above, there are several other supervised
neural networks like tree neural networks, wavelet neural network,
functional link neural network and so on.
3
testing. The input-output data are obtained by
achieve the following [)YO-to-one mappings: varying inpuc variables (xt,Xz) within [-1,+1] ~
~
Learning Objectives -----'''-----------------,
The basic networks in supervised learning. Adaline, Madaline, back~propagarion and
How the perceptron learning rule is better radial basis funcrion network.
rhan the Hebb rule. The various learning facrors used in BPN.
Original percepuon layer description. • An overview of Ttme Delay, Function Link,
Delta rule with single output unit. Wavelet and Tree Neural Networks.
I 3.1 Introduction
The chapter covers major topics involving supervised learning networks and their associated single-layer
and multilayer feed-forward networks. The following topics have been discussed in derail- rh'e- perceptron
learning r'Ule for simple perceptrons, the delta rule (Widrow-Hoff rule) for Adaline and single-layer feed-
forward flC[\VOrks with continuous activation functions, and the back-propagation algorithm for multilayer
feed-forward necworks with cominuous activation functions. ln short, ali the feed-forward networks have
been explored.
I. The perceptron network consists of three units, namely, sensory unit (input unit), associator unit (hidden
unit), response unit (output unit).
~
50 SupeNised Learning Network 3.2 Perceptron Networks
51
2. The sensory units are connected to associamr units with fixed weights having values 1, 0 or -l, which are Output
assigned at random. · - o·or 1 Output Desired
II
3. The binary activation function is used in sensory unit and associator unit. Fixed _weight
t Oar 1 output
4. The response unit has an'activarion of l, 0 or -1. The binary step wiili fixed threshold 9 is used as
activation for associator. The output signals £hat are sem from the associator unit to the response unit are
valUe ciN., 0, -1
at randorr\ .
\ 0 G) y,
9~
only binary.
5. TiiCQUt'put of the percepuon network is given by
- ---
i . \.--- '
~¢~
' iX1
y = f(y,,)
X X X
\i
;x,
G) G)
..., \.~ X
:.I\
' < •
.,cl. 1 X I I \ Xn
where J(y;n) is activation function and is defmed as &.,
·'~
"
•<
· ~. tr Sensory unit 1
•
f(\- ,-. \~ ..
~
sensor grid " /
) ·~ if J;n> 9
)\t \ .. ._ representing any-·'
f(y;,) ={ if -9~y;11 56
'Z -1 if y;71 <-9
lJa~------ .
@ @ ry~
6. The perceptron learning rule is used in the weight updation between the associamr unit and the response e,
unit. For each training input, the net will calculate the response and it will Oetermine whelfier or not an Assoc1ator un~ . Response unit
error has occurred.
Figure 3·1 Ori~erceprron network.
w~t fL 9-·
7. The error calculation is based on the comparison of th~~~~~rgets with those of the ca1~t!!_~~ed
outputs. b'"~ r>-"-j ::.Kq>
(l.. u,•>J;>.-l? '<Y\
' ~ AJA I &J)
8. The weights on the connections from the units that send the nonzero signal will get adjusted suitably. I 3.2.2 Perceptron Learning Rule '
9. The weights will be adjusted on the basis of the learning_rykjf an error has occurred for a particular
training patre_!Jl.,..i.e..,- In case of the percepuon learrling rule, the learning signal is the difference between esir.ed...and.actuaL...- -·--,
~ponse of a neuron. The perceptron learning rule IS exp rune as o ows: j ~ f.:] (\ :._ PK- A-£. )
Wi{new) = Wj{old) + a tx1• Consider a finite "n" number of input training vectors, with their associated r;g~ ~ired) values x(n) {
and t{n), where "n" r~o N. The target is either+ 1 or -1. The ourput ''y" is obtained on the
b(new) = b(old) + at basis of the net input calculated and activation function being applied over the net input.
If no error occurs, there is no weight updarion and hence the training process may be stopped. In the above
equations, the target value "t" is+ I or-land a is the learningrate.ln general, these learning rules begin with
an initial guess at rhe weight values and then successive adjusunents are made on the basis of the evaluation
of an ob~~ve function. Evenrually, the lear!Jillg rules reac~.a near~optimal or optimal solution in a finite __
y = f(y,,) = l~
-1
if J1i1 > (}
if-{} 5Jirl 58
if Jin < -{}
\r~~ ~r
I
-~~
~
'
.,
number of steps. -------
APcrceprron nerwork with irs three units is shown in Figure 3~1. A£ shown in Figure 3~1. a sensory unir The weight updacion in case of perceprron learning is as shown. X~ -~~. ·.
can be a two-dimensional matrix of 400 photodetectors upon which a lighted picture with geometric black
and white pmern impinges. These detectors provide a bif!.~.{~) __:~r-~lgl.signal__if.f\1_~ i~.u.~und lfy ,P • then /I
co exceei~. certain value of threshold. Also, these detectors are conne ed randomly with the associator ullit. w{new) = w{old) + a tx {a - learning rate)
The associator unit is found to conSISt of a set ofsubcircuits called atrtre predicates. The feature predicates are else, we have
(
hard-wired to detect the specific fearure of a pattern and are e "valent to the feature detectors. For a particular
w(new) = w(old)
fearure, each predicate is examined with a few or all of the ponses of the sensory unit. It can be found that
the results from the predicate units are also binary (0 1). The last unit, i.e. response unit, contains the
pattern~recognizers or perceptrons. The weights pr tin the input layers are all fixed, while the weights on
the response unit are trainable.
~I
l
52 Supervised Learning Network 3.2 Perceptron Networks 53
For
each No
Figure 3-2 Single classification perceptron network. s:t
training patterns, and this learning takes place within a finite number of steps provided that the solution
exists."-
I 3.2.3 Architecture
In the original perceptron ne[Work, the output obtained from the associator unit is a binary vector, and hence
that output can be taken as input signal to the res onse unit and classificanon can be performed. Here only
the weights be[l.veen the associator unit and the output unit can be adjuste , an, t e we1ghrs between the
sensory _and associator units are faxed. As a result, the discussion of the network is limited. to a single portion. Apply activation, obtain
Thus, the associator urut behaves like the input unit. A simple perceptron network architecrure is shown in Y= f(y,)
Figure 3•2. --~·------
In Figure 3-2, there are n input neurons, 1 output neuron and a bias. The inpur-layer and output-
layer neurons are connected through a directed communication link, which is associated with weights. The
goal of the perceptron net is to classify theJ!w.w: pa~~tern as a member or not a member to a p~nicular
class. · -···-·.··-· ·-.. --- ......
y!l~~~
~1
ilie desired output, the weight updation process is carried out. The entire neMork is trained based on the Step 2: Perform Steps 3--5 for each bipolar or binary training vector pair s:t.
mentioned stopping criterion. The algorithm of a percepuon network is as follows: Step 3, Set activation (identity) of each input unit i = 1 ton:
I StepO: Initi-alize ili~weights a~d th~bia~for ~ ~culation they can b-e set to zero). Also initialize the / x;;= ~{
learning race a(O < a,;:= 1). For simplicity a is set to 1.
Step 1: Perform Steps 2-6 until the final stopping condition is false. Step 4, irst, the net input is calculated as i A
Step 2: Perform Steps 3-5 for each training pair indicated by s:t. I ,_..... It: -.-' --.~ ',_ -~,
---- :::::::::J~ "'( ;-;· J)'
Step 3: The input layer containing input units is applied with identity activation functions: ~----
b(new) = b(old) + Of
Step 6: Test for the stopping condition, i.e., if there is no change in weights then stop the training process,
else, we have
1 else stan again from Step 2. 1
I.'
7Vi(new) = WJ(old}
b(new) = b(old) It em be noticed that after training, the net classifies each of the training vectors. The above algorithm is
I
i
Step 6: Train the nerwork until diere is no weight change. This is the stopping condition for the network. suited for the architecture shown in Figure 3~4. ~
j
If this condition is not met, then start again from Step 2. i
I
The algorithm discussed above is not sensitive to the initial values of the weights or the value of the
3.2. 7 Percept ron Network Testing Algorithm
~
~
It is best to test the network performance once the training process is complete. For efficient performance
learning rare.
of the network, it should be trained with more data. The testing algorithm (application procedure) is as
I
~
follows: ~!I
3.2.6 Perceptron Training Algorithm for Multiple Output Classes il\!
For multiple output classes, the perceptron training algorithm is as follows:
\ Step 0:-- Initialize the weights, biases and learning rare suitably.
Step 1: Check for stopping c?ndirion; if it is false, perform Steps 2-6.
I
I Step 0: The initi~ weights to be used here are taken from the training algorithms (the final weights I
obtained.i:l.uring training).
Step 1: For each input vector X to be classified, perform Steps 2-3.
Step 2: Set activations of the input unit.
II
I:;i
I
.,,.
011
r
56 Supervised Learriing Network
~- 3.3 Adaptive Unear Neuron (Adaline) 57
~~
~,
~~3.3 Adaptive Linear Neuron (Adaline)
'
1
I 3.3.1 Theory ,
x, 'x,
The unirs with linear activation function are called li~ear.~ts. A network ~ith a single linear unit is called
an Adaline (adaptive linear neuron). That is, in an Adaline, the input-output relationship is linear. Adaline
./~~ \~\J
/w,, uses bipolar activation for its input signals and its target output. The weights be.cween the input and the
omput are adjustable. The bias in Adaline acts like an adjustable weighr, whose connection is from a unit
with activations being always 1. Ad.aline is a net which has only one output unit. The Adaline nerwork may
w,l be trained using delta rule. The delta rule may afso be called as least mean square (LMS) rule or Widrow~Hoff
Xi
(x;)~ "/ ~ y 1:
~(s)--
YJ
I • -----+- YJ
rule. This learning rule is found to minimize the mean~squared error between the activation and the target
value.
The Widrow-Hoff rule is very similar to percepuon learning rule. However, rheir origins are different. The
perceptron learning rule originates from the Hebbian assumption while the delta rule is derived from the
x, ( x,).£::::___ _ _~
w -
gradienc~descem method (it can be generalized to more than one layer). Also, the perceptron learning rule
stops after a finite number ofleaming steps, but the gradient~descent approach concinues forever, converging
Figure 3·4 Network archirecture for percepuon network for several output classes. only asymptotically to the solution. The delta rule updates the weights between the connections so as w
minimize the difference between the net input ro the output unit and the target value. The major aim is to
Step 3: Obrain the· response of output unit. minimize the error over all training parrerns. This is done by reducing the error for each pattern, one at a
rime.
The delta rule for adjusting rhe weight of ith pattern {i = 1 ro n) is
Yin = L" x;w; / ' ·
i=l
D.w; = a(t- y1,)x1
where D.w; is the weight change; a the learning rate; xthe vector of activation of input unit;y;, the net input
I if y;, > 8
to output unit, i.e., Y Li=l
= x;w;; t rhe target output. The deha rule in case of several output units for
Y = f(yhl) = { _o ~f ~e sy;, ~8 _,/'\ adjusting the weight from ith input unit to the jrh output unit (for each pattern) is
1 tfy111 <-8 IJ.wij = a(t;- y;,,j)x;
Thus, the testing algorithm resLS the performance of nerwork. I 3.3.3 Architeclure
As already stated, Adaline is a single~unir neuron, which receives input from several units and also from one
unit called bias. An Adaline inodel is shown in Figure 3~5. The basic Adaline model consists of trainable
weights. Inputs are either of the two values (+ 1 or -1) and the weights have signs (positive or negative).
The condition for separaring the response &om re~o is Initially, random weights are assigned. The net input calculated is applied to a quantizer transfer function
(possibly activation function) that restOres the output to +1 or -1. The Adaline model compares the actual
WJXJ + tiJ2X]. + b> (} output with the target output and on the basis of the training algorithm, the weights are adjusted.
_______
The condition for separating the resPonse from_...r~~o t~~ion of nega~ve
..
~--
is I 3.3.4 Flowchart lor Training Process
WI X} + 'WJ.X]_ + b < -(} The flowchan for the training process is shown in Figure 3~6. This gives a picrorial representation of the
network training. The conditions necessary for weight adjustments have co be checked carefully. The weights
The conditions- above are stated for a siilgie:f.i~p;;~~~ ~~~~;~k~ith rwo Input neurons and one output and other required parameters are initialized. Then the net input is calculated, output is obtained and compared
neuron and one bias. with the desired output for calculation of error. On the basis of the error Factor, weights are adjusted.
58 Supervised Learning Network 3.3 Adaptive Linear Neuron (Adaline) 59
Adaptive
algorithm I• e = t- Ym 1 Output error
generator +t For
No
each
.. ................................. Learning supervisor
~... s: t
Figure 3·5 Adaline model.
Yes
.Step 0: Weights and bias are set to some random values bur not zero. Set the learning rate parameter ct.
Step 1: Perform Steps 2-6 when stopping condition is false.
Step 2: Perform Steps 3~5 for each bipolar training pair s:t.
Step 3: Set activations for input units i = I to n.
Weight updation
x;=s; w;(new) = w1(old) + a(t- Y1n)Xi
b(new) = b(old) + a(r- Yinl
Seep 4: Calculate the net input to the output unit.
"
y;, = b+ Lx;w;
i=J
Step 6: If the highest weight change rhat occurred during training is smaller than a specified toler-
ance ilien stop ilie uaining process, else continue. This is the rest for stopping condition of a
network.
The range of learning rate Can be be[Ween 0.1 and 1.0. Figure 3·6 Flowcharr for Adaline training process.
I
I
I
1._
.~
Ic is essential to perform the resting of a network rhat has been trained. When training is completed, the
Adaline can be used ro classify input patterns. A step &merion is used to test the performance of the network.
The resting procedure for thC Adaline nerwc~k is as follows:
J Step 0: Initialize the weights. (The weights are obtained from ilie ttaining algorithm.) J
Step 1: Perform Steps 2-4 for each bipolar input vecror x.
Step 2: Set the activations of the input units to x.
Step 3: Calculate the net input to rhe output unit:
]in= b+ Lx;Wj
Step 4: Apply the activation funcrion over the net input calculated: Figure 3·7 Archireaure of Madaline layer.
1 ify,"~o
y= and the output layer are ftxed. The time raken for the training process in the Madaline network is very high
{ -1 ifJin<O
compared to that of the Adaline network.
{_
Adaline. lifx~O
f(x) = 1 if x < 0
I 3.4.2 Architectury>
A simple Madaline architecture is shown in Figure 3-7, which consists of"n" uniu of input layer, "m" units Step 0: Initialize the weighu. The weights entering the output unit are set as above. Set initial small
ofAdaline layer and "1" unit of rhe Madaline layer. Each neuron in theAdaline and Madaline layers has a bias random values for Adaline weights. Also set initial learning rate a.
of excitation 1. The Adaline layer is present between the input layer and the Madaline (output) layer; hence, Step 1: When stopping condition is false, perform Steps 2-3.
the Adaline layer can be considered a hidden layer. The use of the hidden layer gives the net computational
Step 2: For each bipolar training pair s:t, perform Steps 3-7.
capability which is nor found in single-layer nets, but chis complicates rhe training process to some extent.
The Adaline and Madaline models can be applied effectively in communication systems of adaptive Step 3: Activate input layer units. Fori;:::: 1 to n,
equalizers and adaptive noise cancellation and other cancellation circuits. x;:;: s;
I 3.4.3 Rowchart of Training Process Step 4: Calculate net input to each hidden Adaline unit:
The flowchart of the traini[lg process of the Madaline network is shown in Figure 3-8. In case of training, the "
Zinj:;:bj+ LxiWij, j:;: l tom
weighu between the input layer and the hidden layer are adjusted, and the weights between the hidden layer i=l
62 Supervised Learning Network 3.4 Multiple Adaptive Linear Neurons 63
(
p A
T
Set small random value
weights for adallne layer.
Initialize a
c}----~
t= 1" No
Yes
No
>--+---{8
Update weights on unit z1whose
net input is closest to zero.
b1(new) = b1(old) + a(1-z~)
w,(new) = wi(old) + a(1-zoy)X1
Activate input units
X10': s,, b1 ton
I
Calculate output
zJ= f(z.,)
I If no
Calculater net input to output unit No ( weight changes
c) (or) specilied
Y..,=b0 ·;i:zyJ
,., ' number of
epochs
T ' / (8
Calculate output Yes '
Y= l(y,)
Step 5: Calculate output of each hidden unit: The back-propagation algorithm is different from mher networks in respect to the process by whic
weights are calculated during the learning period of the ne[INork. The general difficulty with the multilayer
Zj = /(z;n) pe'rceprrons is calculating the weights of the hidden layers in an efficient way that would result in a very small
or zero output error. When the hidden layers are incteas'ed the network training becomes more complex. To
Step 6: Find the output of the net: update weights, the error must be calculated. The error, Which is the difference between the actual (calculated)
and the desired (target) output, is easily measured at the"Output layer. It should be noted that at the hidden
y;, = bo + Lqvj
"' layers, there is no direct information of the en'or. Therefore, other techniques should be used to calculate an
j=l error at the hidden layer, which will cause minimization of the output error, and this is the ultimate goal.
The training of the BPN is done in three stages - the feed-forward of rhe input training pattern, the
y =f(y;")
calculation and back-propagation of the error, and updation of weights. The tescin of the BPN involves the
Step 7: Calculate the error and update ilie weighcs. compuration of feed-forward phase onlx.,There can be more than one hi en ayer (more beneficial) bur one
hidden layer is sufhcienr. Even though the training is very slow, once the network is trained it can produce
1. If t = y, no weight updation is required. its outputs very rapidly.
2. If t f y and t = +1, update weights on Zj, where net input is closest to 0 (zero):
I 3.5.2 Architecture
bj(new) = bj(old) + a (1 - z;11j}
wij(new) = W;i(old) + a (1 - z;11j)x; A back-propagation neural network is a multilayer, feed~forv.rard neural network consisting of an input layer,
a hidden layer and an output layer. The neurons present in che hidden and output layers have biases, which
3. If t f y and t = -1, update weights on units Zk whose net input is positive: are rhe connections from the units whose activation is always 1. The bias terms also acts as weights. Figure 3-9
shows the architecture of a BPN, depicting only the direction of information Aow for the feed~forward phase.
w;k(new) = w;k(old) + a (-1 - z;, k) x;
1 During the b~R3=l)3tion phase of learnms., si nals are sent in the reverse direction
b,(new) = b,(old) +a (-1- z;,.,) The inputs sent to the BPN and the output obtained from the net could be e1ther binary (0, I) or
bipolar (-1, + 1). The activation function could be any function which increases monotonically and is also
Step 8: Test for the stopping condition. (If there is no weight change or weight reaches a satisFactory level, differentiable.
or if a specifted maximum number of iterations of weight updarion have been performed then
1 stop, or else continue). I
Madalines can be formed with the weights on the output unit set to perform some logic functions. If there
are only t\VO hidden units presenr, or if there are more than two hidden units, then rhe "majoriry vote rule"
function may be used. /
I'
1 3.5.1 Theory
The back~propagarion learning algorithm is one of the most important developments in neural net\vorks
(Bryson and Ho, 1969; Werbos, 1974; Lecun, 1985; Parker, 1985; Rumelhan, 1986). This network has re-
awakened the scientific and engineering community to the model in and rocessin of nu
phenomena usin ne networks. This learning algori m IS a lied !tilayer feed-forward ne_two_d~
con;rung o processing elemen~S with continuous renua e activation functions. e networks associated
with back-propagation learning algorithm are so e ac -propagation networ. (BPNs). For a given set
of training input-output pair, chis algorithm provides a procedure for changing the weights in a BPN to
classify the given input patterns correctly. The basic concept for this weight update algorithm is simply the
gradient-des em method as used in the case of sim le crce uon networks with differentiable units. This is a r(~.
method where the error is propagated ack to the hidden unit. he aim o t e neur networ IS w train the ''
net to achieve a balance between the net's ability to respond (memorization) and irs ability to give reason~e
I ~~ure3·9
l
Architecture of a back-propagation network.
responses to rhe inpm mar "simi,.,. bur not identi/to me one mar is used in ttaining (generalization).
66 Super.<ise_d Learni~g Network 3.5 Back·Propagalion Network 67
The flowchart for rhe training process using a BPN is shown in Figure 3-10. The terminologies used in the
flowchart and in the uaining algorithm are as follows:
x = input training vecro.r (XJ, ... , x;, ... , x11 )
t = target output vector (t), ... , t/r, ... , tm) -
a = learning rate parameter
x; :;::. input unit i. (Since rhe input layer uses identity activation function, the input and output signals © "
here are same.)
VOj = bias on jdi hidd~n unit
wok = bias on kch output unit FOr each No
~=hidden unirj. The net inpUt to Zj is training pair >-~----(B
x. t
"
Zinj = llOj +I: XjVij
i=l Yes
and rhe output is
Zj = f(zi"j) Receive Input signal x1 &
transmit to hidden unit
z;=f(Z;nj), ]=1top
and rhe output is i= 1\o n
y; = f(y,";)
Ok =. error correction weight adjusrmen~. for Wtk ~hat is due tO an error at output unit Yk• which is
back-propagared m the hidden uni[S thai feed into u~
Of = error correction weight adjustment for Vij that is due m the back-proEagation of error to the
hidden uni<zj- b>• '\f"-( L""'-'iJ ~-fe_,l.. ,,'-'.fJ Z-J' ...--
Also, ir should be noted that tOe commonly used acrivarion functions are l:imary sigmoidal and bipolar
sigmoidal activation functions (discussed in Section 2.3.3). These functions are used in the BPN because of Calculate output signal from
the following characteristics: (i) continui~; (ii) djffereorjahilit:ytlm) nQndeCreasing mon0£9.11Y· output layer,
p
The range of binary sigmoid is fio;Q to 1, and for bipolar sigmoid it is from -1 to+ 1. Yink =- Wok+ :E z,wik
"'
Yk = f(Yink), k =1 tom
The error back-propagation learning algorithm can be oudined in ilie following algorithm:
Figure 3·10
!Step 0: Initialize weights and learning rate (take some small random values).
Step 1: Perform Sreps 2-9 when stopping condition is false.
Step 2: Perform Steps 3-8 for~ traini~~r.
I
L
Supervised learning Network
3.5 Back·Propagation Network 69
68
_, - ------------._
lf:edjorward p~as' (Phas:fJ_I
A
Zfnf' =
-
v;j + LX
"
ill;;
I
-v
Y. '..,
= l to n}.
,I
'rJ
i=l
Calculate output of the hidden uilit by applying its activation functions over Zinj (binary or bipolar
Find weight & bias correction term
ll.Wjk. = aO,zj> l\W01c = ~J"II
Zj = /(z;,j)
and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unity,~o (k = I to m),_ca.lcuhue the net input: ,I
,\--. o\•\
(between hidden and input)
m
~nJ=f}kWjk ' I
p
~ = 0,,1f'(z1,p
Yink = Wok + L ZjWjk
j~l
I
Compute change in weights & bias based
on bj.l!.vii= aqx;. ll.v01 = aq
Yk = f(y;,,)
"'
8inj= z=okwpr
k=l
The term 8inj gets multiplied wirh ilie derivative of j(Zinj) to calculate the error tetm:
8j=8;11jj'(z;nj)
The derivative /'(z;71j) can be calculated as C!TS:cllssed in Section 2.3.3 depending on whether
binary or bipolar sigmoidal function is used. On the basis of the calculated 8j, update rhe change
in weights and bias:
I
-I
'
70 Supervised Learning Network 3.5 Back-Propagation Network 71 :IIf
. Wlighr and bias upddtion (PhaJ~ Ill): I from the beginning itself and the system may be smck at a local minima or at a very flat plateau at the starting
•
point itself. One method of choosing the weigh~ is choosing it in the range
Step 8: Each output unit (yk, k = 1 tom) updates the bias and weights:
I
Wjk(new) = Wjk(old)+6.wjk I -3' 3 J.
[ .fO;' _;a,'
= WQk(oJd)+L'.WQk '
WOk(new)
i
Each hidden unit (z;,j = 1 top) updates its bias and weights: I
Vij(new) = Vij(o!d)+6.vij
'<y(new) = VOj(old)+t.voj
Step 9: Check for the sropping condition. The stopping condition may be cenain number of epochs
1 reached or when ilie actual omput equals the t<Uget output. 1 V,j'(new) =y Vij(old)
llvj(old)ll
The above algorithm uses the incremental approach for updarion of weights, i.e., the weights are being
where Vj is the average weight calculated for all values of i, and the scale factory= 0.7(P) 11n ("n" is the
changed immediately after a training pattern is presented. There is another way of training called batch-mode
number of input neurons and "P" is the nwnber of hidden neurons).
training, where the weights are changed only after all the training patterns are presented. The effectiveness of
rwo approaches depends on the problem, but batch-mode training requires additional local storage for each
3.5.5.2 Learning Rate a
connection to maintain the immediate weight changes. When a BPN is used as a classifier, it is equivalent to
the optimal Bayesian discriminant function for asymptOtically large sets of statistically independent training The learning rate (a) affects the convergence of the BPN. A larger value of a may speed up the convergence
but might result in overshooting, while a smaller value of a has vice-versa effecr. The range of a from 10- 3
pauerns.
The problem in this case is whether the back-propagation learning algorithm can always converge and find to 10 has been used successfulfy for several back-propagation algorithmic experiments. Thus, a large learning I
proper weights for network even after enough learning. It will converge since it implements a gradient-descent rate leads to rapid learning bm there is oscillation of wei_g!lts, while the lower learning rare leads to slower
on the error surface in the weight space, and this will roll down the error surface to the nearest minimum error learning. -
and will stop. This becomes true only when the relation existing between rhe input and the output training
patterns is deterministic and rhe error surface is deterministic. This is nm the case in real world because the 3.5.5.3 Momentum Factor
produced square-error surfaces are always at random. This is the stochastic nature of the back-propagation The gradient descent is very slow if the learning rare a is small and oscillates widely if a is roo large. One
algorithm, which is purely based on the srochastic gradient-descent method. The BPN is a special case of very efficient and commonly used method that altows a larger learning rate without oscillations is by adding
stochastic approximation. a momentum factor ro rhc;_.!,LQ!DlaLgradient-descen_t __m~_r]l_Qq., _
If rhe BPN algorithm converges at all, then it may get smck with local minima and may be unable to The-iil"Omemum E'cror IS denoted by 1] E [0, i] and the value of 0.9 is often used for the momentum
find satisfactory solutions. The randomness of the algorithm helps it to get out of local minima. The error factor. Also, this approach is more useful when some training data are ve rem from the ma·oriry
functions may have large number of global minima because of permutations of weights that keep the network of clara. A momentum factor can be used with either p uern y pattern up atillg or batch-"iiii e up a -
input-output function unchanged. This"6.uses the error surfaces to have numerous troughs. ing.-I'iicase of batch mode, it has the effect of complete averagirig over rhe patterns. Even though the
averaging is only partial in the panern-by-pattern mode, it leaves some useful i-nformation for weight
updation.
3.5.5 Learning Factors _of Back-Propagation Network
The weight updation formulas used here are
The training of a BPN is based on the choice of various parameters. Also, the convergence of the BPN is
Wjk(t+ I)= Wji(t) + ao,Zj+ry [Wjk(t)- Wjk(t- I)]
based on some important learning factors such as rhe initial weights, the learning rare, the updation rule,
the size and nature of the training set, and the architecture (number of layers and number of neurons per ll.•uj~(r+ 1)
layer).
and
3.5.5.1 Initial Weights
Vij(t+ 1) = Vij(t) + a8jXi+1J{Vij(t)- Vij(t- l)]
The ultimate solution may be affected by the initial weights of a multilayer feed-forward nerwork. They are
ll.v;j(r+ l)
initialized at small random values. The choice of r wei t determines how fast the network converges. I
The initial weights cannm be very high because t q~g-~oidal acriva · ed here may get samrated I The momenlum factor also helps in fas"r convergence.
L
'.
72 Supervised Learning Network 3.6 Radiat Basis Function Network 73
3.5.5.4 Generalization Step 4: Now c?mpure the output of the output layer unit. Fork= I tom,
The best network for generalization is BPN. A network is said robe generalized when it sensibly imerpolates p
with input networks thai: are new to the nerwork. When there are many trainable parameters for the given link =:WOk + L ZjWjk
amount of training dam, the network learns well bm does not generalize well. This is usually called overfitting ·. ·j=l
or overtraining. One solurion to this problem is to moniror the error on the rest sec and terminate the training
when che error increases. With small number of trainable parameters, ~e network fails to learn the training Jk = f(yj,,)
_r!-'' ~.,_,r; data and performs very poorly. on the .test data. For improving rhe abi\icy of the network ro generalize from Use sigmoidal activation functions for calculating the output.
.-.!( ~o_ a training data set w a rest clara set, ir is desirable to make small changes in rhe iripur space of a panern,
}{i 1
.,'e,) without changing the output components. This is achieved by introducing variations in the in pur space of
-0
c..!( '!f.!' training panerns as pan of the training set. However, computationally, this method is very expensive. Also,
,-. ,:'\ j a net With large number of nodes is capable of membfizing the training set at the cost of generali:zation ...As a I 3.6 Radial Basis Function Network
?\ Ji result, smaller nets are preferred than larger ones.
r I 3.6.1 Theory
3.5.5.5 Number of Training Data
The radial basis function (RBF) is a classification and functional approximation neural network developed
The training clara should be sufficient and proper. There exisrs a rule of thumb, which states !!!:r rhe training
by M.J.D. Powell. The newark uses the most common nonlineariries such as sigmoidal and Gaussian kernel
dat:uhould cover the entire expected input space, and while training, training-vector pairs should be selected
functions. The Gaussian functions are also used in regularization networks. The response of such a function is
randomly from the set. Assume that theffiput space as being linearly separable into "L" disjoint regions
positive for all values ofy; rhe response decreases to 0 as lyl _. 0. The Gaussian function is generally defined as
with their boundaries being part of hyper planes. Let "T" be the lower bound on the ~umber~ of training
pens. Then, choosing T suE!!_ that TIL ») will allow the network w discriminate pauern classes using f(y) = ,-1
fine piecewise hyperplane parririomng. Also in some cases, scaling.ornot;!:flalization has to be done to help
learning. __ ,•' ··: }) \ .. The derivative of this function is given by
3.5.5.6 Number of Hidden Layer Nodes .•. A/77 _/ ['(yl = -zy,-r' = -2yf(yl
If there exists more than one hidden layer in a BPN, rhe~~ICufarions
performed for a single layer are The graphical represemarion of this Gaussian Function is shown in Figure 3-11 below.
repeated for all the layers and are summed up at rhe end. In case of"all mufnlayer feed-forward networks, When rhe Gaussian potemial functions are being used, each node is found to produce an idemical outpm
rhe size of a h1dden layer i'f"VeTy important. The number of hidden units required for an application needs for inputs existing wirhin the fixed radial disrance from rhe center of the kernel, they are found m be radically
to be determined separately. The size of a hidden lay~_:___is usually determi_~Q~~p_qim~~- For a network symmerric, and hence the name radial basis function network. The emire network forms a linear combination
of a reasonable size,~ SIZe of hidden nod -- araariVel}r~mall fraction of the inpllrl~For of the nonlinear basis function.
example, if the network does not converge to a solution, it may need mor hidduJ lmdes:-i3~and,
Step 0: Initialize the weights. The weights are taken from the training algorithm.
Step 1: Perform Steps 2-4 for each input vector.
Step 2: Set the activation of input unit for x; (i = I ro n).
Step 3: Calculate the net input to hidden unit x and irs output-. For j = 1 ro p,
"
Zinj = VOj + L XiVij ~----~~--r---L-~--~r-----~Y
i:=l -2 -1 0 2
x,
X,
For "'- No
each >--
x,
Input Hidden Output
layer layer (RBF) layer
The flowchart for rhe training process of the RBF is shown in Figure 3-13 below. In this case, the cemer of
the RBF functions has to be chosen and hence, based on all parameters, the output of network is calculated.
The training algorithm describes in derail ali rhe calculations involved in the training process depicted in rhe
flowchart. The training is starred in the hidden layer with an unsupervised learning algorithm. The training is
continued in the output layer with a supervised learning algorithm. Simultaneously, we can apply supervised
learning algorithm to ilie hidden and output layers for fme-runing of the network. The training algorithm is
If no
given as follows. 'epochs (or)
no
I Ste~ 0: Set the weights to small random values. No weight
hange
Step 1: Perform Steps 2-8 when the stopping condition is false.
Step 2: Perform Steps 3-7 for each input. Yes f+------------'
Step 3: Each input unir .(x; for all i ::= 1 ron) receives inpm signals and transmits to rhe next hidden layer
unit.
Figure 3-13 Flowchart for the training process ofRBF.
76 Supervised Learning Network
3.8 Functional Link Networks 77
·Step 4: Calculate the radial basis function.
Step 5: Select the cemers for che radial basis function. The cenrers are selected from rhe set of input
vea:ors. It should be ·noted that a sufficient number of centen; have m be selected to ensure
X( I)
Delay line
l
adequate sampli~g of the input vecmr space. X( I) !<(1-D X( I-n)
Step 6: Calculate the output from the hidden layer unit:
-
Multllayar perceptron
r
t,rxji- Xji)']
v;(x;) =
exp [-
J-
a2
T
0(1)
'
where Xj; is the center of the RBF unit for input variables; a; the width of ith RBF unit; xp rhe Figure 3·14 Time delay neural network (FIR fiher).
jth variable of input panern.
Step 7: Calculate the output of the neural network:
i=l
where k is the number of hidden layer nodes (RBF funcrion);y,m the output value of mrh node in Multilayer perceptron z-1
output layer for the nth incoming panern; Wim rhe weight between irh RBF unit and mrh ourpur
node; wo the biasing term at nrh output node.
Step 8: Calculate the error and test for the stopping condition. The stopping condition may be number 0(1)
of epochs or ro a certain ex:renr weight change.
Figure 3·15 TDNN wirh ompur feedback (IIR filter).
Yes No
obtained by a multilayer network at a panicular decision node is used in the following way:
Figure 3·16 Functional line nerwork model.
x directed to left child node tL, if y < 0
x directed to right child node tR, if y ::: 0
x, 'x, The algorithm for a TNN consists of two phases:
~
/
1. Tree growing phase: In this phase, a large rree is grown by recursively fmding the rules for splitting until
x, x, 0 all the terminal nodes have pure or nearly pure class membership, else it cannot split further.
y y
2. Tree pnming phase: Here a smaller tree is being selected from the pruned subtree to avoid the overfilling
1 of data.
The training ofTNN involves [\VO nested optimization problems. In the inner optimization problem, the
~G BPN algorithm can be used to train the network for a given pair of classes. On the other hand, in omer
Figure 3·17 The XOR problem.
optimization problem, a heuristic search method is used to find a good pair of classes. The TNN when rested
on a character recognition problem decreases the error rare and size of rhe uee relative to that of the smndard
classifiCation tree design methods. The TNN can be implemented for waveform recognition problem. It
Thus, ir can be easily seen rhar rhe functional link nerwork in Figure 3~ 17 is used for solving this problem. obtains comparable error rates and the training here is faster than the large BPN for the same application.
The li.Jncriona.llink network consists of only one layer, therefore, ir can be uained using delta learning rule Also, TNN provides a structured approach to neural network classifier design problems.
instead of rhe generalized delta learning rule used in BPN. As, a result, rhe learning speed of the fUnc6onal
link network is faster rhan that of the BPN.
I 3.10 Wavelet Neural Networks
I 3.9 Tree Neural Networks The wavelet neural network (WNN) is based on the wavelet transform theory. This nwvork helps in
approximating arbitrary nonlinear functions. The powerful tool for function approximation is wavelet
The uee neural networks (TNNs) are used for rhe pattern recognition problem. The main concept of this decomposition.
network is m use a small multilayer neural nerwork ar each decision-making node of a binary classification Letj(x) be a piecewise cominuous function. This function can be decomposed into a family of functions,
tree for extracting the non-linear features. TNNs compbely extract rhe power of tree classifiers for using which is obtained by dilating and translating a single wavelet function¢: !(' --')- R as
appropriate local fearures at the rlilterent levels and nodes of the tree. A binary classification tree is shown in
Figure 3-18.
The decision nodes are present as circular nodes and the terminal nodes are present as square nodes. The
j(x) = L' w;det [D) 12] ¢ [D;(x- 1;)]
i::d
terminal node has class label denoted 'by Cassociated with it. The rule base is formed in the decision node
(splitting rule in the form off(x) < 0 ). The rule determines whether the panern moves to the right or to the where D,. is the diag(d,·), d,. EJ?t
ate dilation vectors; Di and t; are the translational vectors; det [ ] is the
left. Here,f(x) indicates the associated feature ofparcern and"(}" is the threshold. The pattern will be given determinant operator. The w:..velet function¢ selecred should satisfy some properties. For selecting¢: If' --')o
the sJass label of the terminal node on which it has landed. The classification here is based on the fact iliat R, the condition may be
the appropriate features can be selected ar different nodes and levels in the tree. The output feature y = j(x)
,P(x) =¢1 (XJ) .. t/J1 (X 11 ) forx:::: (x, X?.· . . , X11 )
L_i.._
..~~'"·
80 Supervised Learning Network 3.12 Solved Problems 81
ro form a Madaline network. These networks are trained using delta learning rule. Back-propagation network
-r is the most commonly used network in the real time applications. The error is back-propagated here and is
fine runed for achieving better performance. The basic difference between the back-propagation network and
~)-~-{~Q-{~}-~~-~ 7 radial basis function network is the activation funct'ion. use;d. The radial basis function network mostly uses
Gaussian activation funcr.ion. Apart from these nerWor~; some special supervised learning networks such as
:_,, : \] : : ~ time delay neural ne[Wotks, functional link networks, tree neural networks and wavelet neural networks have
also been discussed.
0----{~J--[~]-----{~-BJ------0-r
:· I I
K
Input( X
3.12 Solved Problems
Output
I. I!Jlplement AND function using perceptron net- Calculate the net input
~ //works for bipol~nd targets.
&-c~J-{~~J-G-cd
y;, = b+xtWJ +X2W2
Solution: Table 1···shows the truth table for AND
function with bipolar inputs and targelS:
=O+Ix0+1x0=0
Figure 3·19 Wavelet neural network. The output y is computed by applying activations
Table 1
over the net input calculated:
I {
X]
where "'I I ify;,> 0 · - .
-I -I y = f(;y;,) = 0 if y;, = 0
¢, (x) = -xexp ( -~J -I I -I -1 ify;71 <0
-I -I -I . - ··-· . .- -- .--_-==-...
Here we have rake~-1) = O.)Hence, when,y;11 = 0,
is called scalar wavelet. The network structure can be formed based on rhe wavelet decomposirion as y= 0. ---···
The perceptron network, which uses perceptron
" learning rule, is used to train the AND function. Check whether t = y. Here, t = 1 andy = 0, so
y(x) = L w;¢ [D;(x- <;)] +y The network architecture is as shown in Figure l. t f::. y, hence weight updation takes place:
i=l
The input patterns are presemed to the network one
w;(new) = zv;(old) + ct.t:x;
where J helps to deal with nonzero mean functions on finite domains. For proper dilation, a rotation can be by one. When all the four input patterns are pre-
made for bener network operation: sented, then one epoch is said to be completed. The WJ(new) = WJ(oJd}+ CUXJ =0+] X I X l = 1
initial weights and threshold are set to zero, i.e., W2(ncw) = W2(old) + atx:z = 0 + 1 x l x 1 = I
WJ = WJ. = h = 0 and IJ = 0. The learning rate
y(x) = L" w;¢ [D;R;(x- <;)] + y a is set equal to 1.
b(ncw) = h(old) + at= 0 + 1 x I = l
i=l
Here, the change in weights are
x,~
where R; are the rotation marrices. The network which performs according to rhe above equation is called
Ll.w! = ~Yt:q;
wavelet neural network. This is a combination of translation, rotarian and dilation; and if a wavelet is lying on
the same line, then it is called wavekm in comparison to the neurons in neural networks. The wavelet neural Ll.W2 = atxz;
network is shown in Figure 3-19. b..b = at
X, ~ y y
Table2
Weights 0----z The final weights at the end of third epoch are
w, =2,W]_ = l,b= -1
Input
Target Net input
Calculated
output
Weight changes
W) W]. b x, X w,~y y
Fu-rther epochs have to be done for the convergence
~
X) X]. (t) (y,,) (y) ~WI f:j.W'l M (0 0 0) of'the network.
· 3. _Bnd-the weights using percepuon network for
EPOCH-I
I 0 0 I /AND NOT function when all the inpms are pre-
I
-I -1 -I -I 0 2 0 sented only one time. Use bipolar inputS and
Figure 3 Perceptron network for OR function.
-I -I 2 +I -1 -I I -I ' targets.
0 0 0 1 -1 -I The perceptron network, which uses perceptron Solution: The truth table for ANDNOT function is
-1 -1 -I -3 -I
learning rule, is used to train the OR function. shown in Table 5.
EPOCH-2
0 0 0 -I The network architecture is shown in Figure 3. TableS
I I
0 0 0 -I The initial values of the weights and bias are taken
I -1 -I -1 -I t
-1 as zero, i.e., Xj "'-
-I -I -I -I 0 0 0
-I
I I -I
-I -1 -3 -I 0 0 0
WJ=W]_:::::b:::::O 1 -I I
-I I -1
target and calculated ourput converge for all the ~ Also the learning rate is 1 and threshold is 0.2. So, -I -I -I
patterns. the aaivation function becomes
The final weights and bias after second epoch are The network architecture of AND NOT function is
/'~-..._.};.- . .~:- 1 if y;/1> 0.2 ~ shown as in Figure 4. Let the initial weights be zero
=l,W'l=l, b=-1 (-1, 1)
,_~-- ~Yin ~ 0.2
W[
0 . _,. \.. . .. , [(yin) ;::: { O if - 0.2 and ct = l,fJ = 0. For the first input sample, we
compme the net input as
Since the threshold for the problem is zero, the
equation of the separating line is '
-x,
l~
~
,x,. . J/
";?").. The network is trained as per the perceptron training "
·. ~i algorithm and the steps are as in problem 1 (given for Yin= b+ Lx;w; = h+x1w 1 +xzlil2
w, b
X2 = - - X i - -
:.. /1--'
first pattern}. Table 4 gives the network rraining for i=-1
(-1,-1) (1,-1) ~=-X,+1 '/
Here
'"' "" 3 epochs. =O+IxO+IxO=O
Table4
W[X! + lli2X2 + b > $ Weights
W]X] + UlzX2 + b> Q -X,
Input Calculated Weight changes
Figure 2 Decision boundary for AND function
Target Net input output w, W2 b
Thus, using the final weights we obtain Xi X2 (t) {y;,,) (y) ~W) ~., ~b (0 0 0)
in perceptron training{$= 0).
I (-1) EPOCH-I
X2 = -}x' - -~- ~'mplemenr OR function with binary inputs and I 0 0 I I I
0 2 0 0 0
lil~J
L _ -xt+l · bipolar targw using perceptron training algo-
rithm upto 3 epochs.
0 2 0 0 0 I I 0
h can be easily found that the above straight line 0 0 -I 0 0 -I I I 0
Solution: The uuth table for OR function with EPOCH-2
separates the positive response and negative response
binary inputs and bipolar targets is shown in Table 3. 2 0 0 0 I I 0
region, as shown in Figure 2.
I 0 0 0 0 I I 0
The same methodology can be applied for imple- Table 3 0 I I 0 0 0 I I 0
menting other logic functions such as OR, AND- t 0 0 -I 0 0 0 0 0 I I -I
NOT, NAND, etc. If there exists a threshold value
Xj
"'-
EPOCH-3
f) ::j:. 0, then two separating lines have to be obtained, I
I I I 0 0 0 I I -I
i.e., one to se-parate positive response from zero 0 I 0 0 0 I 0 I 2 I 0
and the other for separating zero from the negative 0 I 0 I I I I 0 0 0 2 I 0
0 0 -I 0 0 -I 0 0 0 0 -I 2 I -I
response.
"'
C:J
Supervised Learning Network
84 3.12 Solved Problema 85
For the third input sample, XI = -1, X2 = 1,
0----z t = -1, the net input is calculated as,
4. Pind the weights required to perform the follow- Table.7
w,~y
/ ing classification using percepuon network. The Input
(/ vectors (1,), 1, 1) and (-1, 1 -1, -1) are belong-
x, x, _..,;¥'
y '
]in= b+ Lx;w;= b+XJWJ +X2WJ. ing to the class (so have rarger value 1), vectors X2 b Targ.t (t)
w, i=l (1, 1, 1, -1) and (1, -1, -1, 1) are not belong-
'J
··) 1 "'1 "'
=0+-1 X O+ 1 X -2=0+0-2= -2 ing to the class (so have target value -1). Assume -1 1 -1 -1
X,
X, learning rate as 1 and initial weights as 0.
-1 1 -1
Figure 4 Network for AND NOT function. The output is oblained as y = fi.J;n) -1. Since = Solution: The truth table for lhe given vectors is given -1 -1 1 1 -1
t = y, no weight changes. Thus, even after presenting
in Table_?.·-· -·---.. ><
Applying the activation function over the net input, clJe third input sample, the weights are
Le~·Wt = ~~.l/l3. = W< "' b ,;;-p and the
we obtain lear7cng ratec; = 1. Since the thresWtl = 0.2, so Thus,ln the third epoch, all the calculated outputs
w=[O -2 0]
become equal to targets and the necwork has con-
=I ~
ify;,. > 0 the.' ctivation function is
y=f(y,,) if-O~y;11 ::S:0 For the fourth input sample, x1 = -1, X2 = -1, verged. The network convergence can also be checked
y., { ~
if ]in> 0.2
l-1 ify;,. < -0
t = -1, the net input is calculated as
'
if -0.2 :S Yin :S 0.1
by forming separating line equations for separating
positive response regions from zero and zero from
negative response region.
Hence, the output y = f (y;,.) = 0. Since t ::/= y, U.e -1 if Yin< -0.2
]in= b+ Lx;w; = b+x1w1 +X21112 The network architecture is shown in Figure 5.
new weights are computed as
i=l The net input is given by
WJ (new) = W] (o\d) + (UX] = 0 + 1 X -} X 1 = -} =0+-lxO+(-lx-2)
5. Classify the two-dimensiona1 input pattern shown
]in= b+x1w1 +xzWJ. +X3W3 _/ in Figure 6 using perceptron network. The sym~
U12(new) = W2.(old) + cttx2_ = 0 + 1 x -1 x l = -1 =0+0+2=2 bol "*" indicates the da[a representation to be +1
+x4w4
b(new) = b(old)+ at= 0 + 1 x -1 = -1 and "•" indicates data robe -1. The patterns are
The output is obtained as y = f (y;n) = 1. Since The training is performed and the weights are tabu- I-F. For panern I, the targer is+ 1, and for F, the
The weights after presenting the first sample are t f. y, the new weights on updating are given as lated in Table 8. target is -1.
w=[-1-1-1] WJ (new) = WJ (old)+ £UXj = 0+ l X -I X -I = 1
Tables
For the seconci inpur sample, we calculate the net IU2(new) = Ul!(old) + ct!X'z = -2 +I x -1 x -1 =-I
inpur as Weights
b(ncw) = b{old) +at= O+ 1 X -1 = -1 Inputs Target Net input Output Weight changes (w, w, w, w4 b)
' (x, X4 b) (t) (Y;,) (y) (.6.w1 /J.llJ2 .6.w3 IJ.w4 !:J.b) (0 0 0 0 0)
Yin= b + L:x;w; = b +x1w1 +X2W.Z X2
i:= I
The weights after presenting foun:h input sample are
w= [1 -1 -1]
EPOCH-! "'
=-l+lx-1+(-lx-1) ( 1 1 1 1 1) 1 0 0 1 1 1 l 1 1 1 1 1 1
(-1 1 -1 -1 1) 1 -1 -1 -1 1 -1 -1 1 0 2 0 0 2
One epoch of training for AND NOT function using
=-1-1+1=-1 ( 1 1 l -1 1) -1 4 I -1 -1 -I 1 -1 -1 1 -1 1
perceptron network is tabulated in Table 6.
( 1 -1 -1 1 1) -1 1 1 -1 1 1 -1 -1 -2 2 0 0 0
The output y = f(y;") is obtained by applying
Table& EPOCH-2
activation function, hence y = -1.
( 1 1 1 1 1) 1 0 0 1 1 1 1 1 -1 3 1
Since t i= y, the new weights are calculated as Weights
Calculated (-1 1 -1 -1 1) 1 3 1 0 0 0 0 0 -1 3 1
Input
Wj{new) = WJ(oJd) + CUXJ = -l + 1 X I X J = 0 _ _ _ Target Net input output WJ "'2 b ( 1 1 1 -1 1) -1 4 1 -1 -1 -1 1 -1 -2 2 0 2 0
(y) (0 0 0)
XI X:Z 1 (t) (y;,)
I
( 1 -1 -1 1 1) -1 -2 -1 0 0 0 0 0 -2 2 0 2 0
Ul2(new) = Ul2(old) + CtD:l = -1 + 1 x l x-I= -2
1 1 -1 0 0 -1 -1 -1 EPOCH-3
b(new) = b{old) +at= -1 + l xI =0
0 -2 0 ( 1 1 1 1 1) 1 2 1 0 0 0 0 0 -2 2 0 2 0
1 -1 1 1 -1 -1
The weights after presenting the second sample are -1 1 1 -1 -2 -1 0 -2 0 I (-1
( 1 1 1
1 -1 -1
-1
l)
1) -1
1 2
-2 -1
1 0
0
0
0
0
0
0
0
0
0
-2 2
-2 2
0
0
2 0
2 0
-1 -1 1 -1 2 1 1 -1 -l
l
w= [0 -2 0] !__1_ -1 -1 1 1) -1 -2 -1 0 0 0 0 0 -2 2 0 2 0
I
86
= b +x1w1 + XZW2 +X3w3 +X4W4 +xsws
+XGW6 + X7WJ + xawa + X9W9
Supervised learning Network
1
3.12 Solved Problems
w;(new) = w;(old)+ O:IXS = 1 + 1 x -1 x 1 = 0 lnitiaJly all the weights and links are assumed to be
W6(new) == WG(oJd) + 0:0:6 = -1 + 1 X -1 X 1 = -2 small raridom values, say 0.1, and the learning rare is
87
II
also set to 0.1. Also here the least mean square error
=0+1 x0+1 x0+1 x 0+(-1) xO W?{new) = W?(old) + atx'] =I+ 1 x -1 x 1 = 0
+1xO+~Dx0+1x0+1x0+1xO wg(new) = ws(old)+ o:txs = 1 + 1 x -1 x -1 = 2
· miy Qe set. The weights are calculated until the least
m~ square error is obtained.
I
Yin= 0 fU9(new) == fV9(old) + etfX9 = 1 + 1 x -1 x -1 "== 2 The initial weighlS are taken to be WJ = W2 =
b[new) = b(old) +or= I+ 1 x -1 = 0 b = 0.1 and rhe learning rate ct = 0.1. For the first
Therefore, by applying the activation function the
input sample, XJ = 1, X2 = 1, t = 1, we calculate the
output is given by y = ff.J;n) = 0. Now since t '# y, The weighlS afrer presenting rhe second input sam~ net input as
the new weights are computed as pie are ~
I~
+ X6W6 + X7lll] + XflWB +X<) IV<) Figure 7 Network architecture.
.6.wz = a(t- y;,)X2
The initial weights are all assumed to be zero, i.e., =1+ l X 1+ 1 X l+ l X 1+ 1 X -1 + 1 X 1 lmplemenr OR function with bipolar inputs and
e = 0 and a = 1. The activation function is given by
+1x-1+1x1+(-1)x 1+(-1)x1 targelS using Adaline network.
t.b = o(t- y;,)
. -1 ifyrn < -0 I Therefore the output is given by y = f (y;u) = l. E = (r- y;,) 2 = (0.7) 2 = 0.49
I Since t f= y, rhe new weights are Table 10
For the first input sample, Xj = [l 1 L ~ I--1 -1 1 1 t The final weights after presenting ftrsr inpur sam·
1 1], t = l, the net input is calculated as
w,(new) == + o:oq == l + 1 x -1
WJ(old) X\== 0 Xj X:z
- pie are
fV2(new) == fV2(old) + O:tx]. = 1 + 1 X -1 Xl=0 1
-1 w= [0.17 0.17 0.17]
w3(new) = w3(old)+ O:b:J =I+\ X -1 X1= 0
y;, = b + Lx;w; -1
i=l
w~(new) = wq(old) + CtP:4 =-I+ 1 x -1 x t = -2 -1 -1 -1 and errorE= 0.49.
11
II
88 Supervised learning Network 3.12 Solved Problems 89
These calculations are performed for all the input Table 12 7. UseAdaline nerwork to train AND NOT funaion w,(new) = w,(old) + a(t- y,,)x:z
'
samples and the error is caku1ared. One epoch is
completed when all the input patterns are presented.
Summing up all the errors obtained for each input
Epoch
Epoch I
Total mean square error
3.02
with bipolar inputs and targets. Perform 2 epochs
of training.
= 0.2+ 0.2 X (-1.6) X I= -0.12
b(new) = b(old) + a(t- y;,) !
Epoch 2 1.938 Solution: The truth table for ANDNOT function = 0.2+ 0.2 (-1.6) = -0.12
sample during one epoch will give the mtal mean X '!:
Epoch 3 1.5506 with bipolar inputs and targets is shown in Table 13.
square error of that epoch. The network training is
Epoch 4 1.417 Table 13 Now we compute the error,
continued until this error is minimized to a very small
Epoch 5 1.377
value.
Adopting the method above, the network training E= (t- y;,) 2 = (-1.6) 2 = 2.56
~-
is done for OR function using Adaline network and
is tabulated below in Table 11 for a = 0.1. The final weights after presenting first input sample
-".~
The total mean square error aft:er each epoch is a<e w = [-0.12- 0.12- 0.12] and errorE= 2.56.
The operational steps are carried for 2 epochs
given as in Table 12. ,1 @ w1 == 0.4893 f::\_ ~
of training and network performance is noted. It is
Thus from Table 12, it can be noticed that as
training goes on, the error value gets minimized.
~~1'~Y Initially the weights and bias have assumed a random
tabulated as shown in Table 14.
Hence, further training can be continued for fur~ - value say 0.2. The learning rate is also set m 0.2. The
weights are calculated until the least mean square error
The total mean square error at the end of two
epochs is summation of the errors of all input samples
t:her minimization of error. The network archirecrure ~ is obtained. The initial weights are WJ = W1. b = =
of Adaline network for OR function is shown in as shown in Table 15.
0.2, and a= 0.2. For the fim input samplex1 = 1,
Figure 8. Figure 8 Network architecture of Adaline.
.::q = l, & = -1, we calculate the net input as Table15
Yin= b + XtWJ + X2lli2
)
Table 11
= 0.2+ I X 0.2+ I X 0.2= 0.6
Epoch Total mean square error
ll
Weights
Epoch I 5.71 :~
Net Epoch 2 2.43 ·'
Inputs T: Weight changes Now compute (t- Yin} = (-1- 0.6) = -1.6.
- - a<get input Wt b Enor
X] x:z I t Yin (r- Y;,l) i>wt
"'"" i>b (0.1 ""
0.1 0.1) (t- Y;,? Updacing ilie weights we obtain
Hence from Table 15, it is clearly undersrood rhat the .,
EPOCH-I w,-(new) = w,-(old) + o:(t- y,n)x; mean square error decreases as training progresses.
I I I I 0.3 0.7 0,07 0,07 om 0.17 0.17 0.17 0.49
Also, it can be noted rhat at the end of the sixth
'
\;
I -1 I I 0.17 0.83 0.083 -0.083 0.083 0.253 0.087 0.253 0.69 The new weights are obtained as
-I I I I 0.087 0.913 -0.0913 0,0913 0,0913 0.1617 0.1783 0.3443 0.83 epoch, rhe error becomes approximately equal to l.
-1 -1 1 -I 0.0043 -1.0043 0.1004 0.1004 -0.1004 0.2621 0.2787 0.2439 1.01 WI (new) ::::: w, (old) + ct(t- Jj )x,
11 The network architecture for ANDNOT function
EPOCH.2 = 0.2 + 0.2 X (-1.6) X I= -0.12 using Adaline network is shown in Figure 9.
1 I 1 1 0.7847 0.2153 0.0215 0.0215 0.0215 0.2837 0.3003 0.2654 0.046
I -1 1 I 0.2488 0.7512 0.7512 -0.0751 0.0751 0.3588 0.2251 0.3405 0.564 Table 14
-I I 1 I 0.2069 0.7931 -0.7931 0.0793 0.0793 0.2795 0.3044 0.4198 0.629
-1 -1 I -I Weights
-0.1641 -0.8359 0.0836 0.0836 -0.0836 0.3631 0.388 0.336 0.699 Ne<
Inputs Weight changes
EPOCH-3 _ _ Target input w, b Error
I I I I 1.0873 -0.0873 -0.087 -0.087 -0.087 0.3543 0.3793 0.3275 0.0076
t>w, M (0.2 ""
0.2 0.2) (t- Y;n)2
-I
I -1 I
I I
-1 -1 1 -1
I
I
0.3025 +0.6975
0.2827
0.0697 -0.0697 0.0697 0.4241 0.3096 0.3973
0.7173 -0.0717 0,0717 0,0717 0.3523 0.3813 0.469
-0.2647 -0.7353 0.0735 0.0735 -0.0735 0.4259 0.4548 0.3954
0.487
0.515
0.541
X[ X:Z
EPOCH-I
I t Y;" (t-y;rl)
"'""
I -I 0.6 -1.6 -0.32 -0.32 -0.32 -0.12 -0.12 -0.12 2.56
EPOCH-4
I I I I 0,076 -I I I -0.12 1.12 0.22 -0.22 0.22 0.10 -0.34 0.10 1.25
1.2761 -0.2761 -0.0276 -0.0276 -0.0276 0.3983 0.4272 0.3678
I -1 I I 0.3389 0.6611 0.0661 -0.0661 0.0661 0.4644 0.3611 0.4339 0.437 -I I I -I -0.34 -0.66 0.13 -0.13 -0.13 0.24 -0.48 -0.03 0.43
-I I 1 I 0.3307 0.6693 -0.0669 0.0669 0.0699 0.3974 0.428 0.5009 0.448 -1 -1 I -I 0.21 -1.2 0.24 0.24 -0.24 0.48 -0.23 -0.27 1.47
-1 -1 I -I -0.3246 -0.6754 0.0675 0.0675 -0.0675 0.465 0.4956 0.4333 0.456 EPOCH-2
EPOCH-5 -I -0.02 -0.98 -0.195 -0.195 -0.195 0.28 -0.43 -0.46 0.95
I I I I 1.3939 -0.3939 -0.0394 -0.0394 -0.0394 0.4256 0.4562 0.393 0.155
I -1 I I 0.25 0.76 0.15 -0.15 0.15 0.43 -0.58 -0.31 0.57
I -1 I I 0.3634 0.6366 0.0637 -0.0637 0.0637 0.4893 0.3925 0.457 0.405
-I I I I 0.3609 0.6391 -0.0639 0.0639 0.0639 0.4253 0.4654 0.5215 0.408 -I I I -I -1.33 0.33 -0.065 0.065 0.065 0.37 -0.51 -0.25 0.106
-1 -1 I -I -0.3603 -0.6397 0.064 0.064 -0.064 0.4893 0.5204 0.4575 0.409 -1 -1 I -I -0.11 -0.90 0.18 0.18 -0.18 0.55 -0.38 0.43 0.8
I
-~
3.1 '2 Solved Problems 91
90 Supervised learning Network
...
b.,o learning rate a equal to 0.5: =0.2+0.5(-1-0.55) X 1 =-0.575
"'22 (new)= "'22 (old)+ a(t- z;" 2)"2
x, x ') w1=o.ss
Calculate net input to the hidden units:
1 y =0.2+0.5(-1-0.45)x 1=-0.525'
Zinl = + XJ WlJ + X2U/2J
b1
y
>Nz"'_o.~ = 0.3 + 1 X 0.05 + 1 X 0.2 = 0.55
b2 (new]= b2 (old)+ a(t- z;d
x, x, Zin2 = /n. +X} WJ2 + xiW22 = 0.15+0.5(-1-0.45)=-0.575
= 0.15 + 1 X 0.1 + 1 X 0.2 = 0.45 All the weights and bias between the input layer and
Figure 9 Network architecrure for ANDNOT hidden layer are adjusted. This completes the train-
function using Adaline nerwork.. Calculate the output z 1,Z2 by applying the activa- ~::-1.08
ing for the first epoch. The same process is repeated
tions over the net input computed. The activation until the weight converges. It is found that the weight
8 Using Madaline network, implement XOR func- function is given by Figure 11 Madaline network for XOR function
tion with bipolar inputs and targets. Assume the converges at the end of 3 epochs. Table 17 shows the
(final weights given).
required parameters for training of the network. I ifz;,<:O training performance of Madaline network for XOR y
! (Zir~) = ( -1 ifz;11 <0 function.
Solution: The uaining pattern for XOR function is The network architecture for Madaline network
given in Table 16. Hence, with final weights for XOR function is shown in
Table 16 z1 = j(z;,,) = /(0.55) = I Figure 11.
z, = /(z;,,) = /(0.45) = 1 9._}Jsing back-propagation_ network, find the new
• After computing the output of the hidden units, / weights ~or the ~et shown in Figure 12. It is pre- .0.5
, semed wuh the mput pattern [0, 1] and the target 0.3,
then find the net input entering into the output
output is 1. Use a learning rare a = 0.25 and
unit:
binary sigmoidal activation function.
Yin= b3 +zJVJ +z2112
Solution: The new weights are calculated based
The Madaline Rule I (MRI) algorithm in which the = 0.5 + 1 X 0.5 + I X 0.5 = 1.5 on the training -algorithm in Section 3.5.4. The
-oj
weights between the hidden layer and ourpur layer Figure 12 Ne[Work.
remain fixed is used for uaining the nerwork. Initializ- • Apply the activation function over the net input initial weights are [v11 v11 vod = [0.6 -0.1 0.3],
ing the weights to small random values, the net\York Yin to calculate the output y.
Table 17
architecture is as shown in Figure 10, widt initial
y = f(;y;,) = /(1.5) = 1 Inputs Target
weights. From Figure 10, rhe initial weights and bias b, b2
X~ (t} wn
are [wu "'21 bd = [0.05 0.2 0.3], [wn "'22 b,] = Since t f:. y, weight updation has to be performed. Zinl Zinl ZJ Zl Y;11 Y "'21 W12
'""
[0.1 0.2 0.15] and [v 1 v, b3] = [0.5 0.5 0.5]. For fim Also since t = -1, the weights are updated on z1
EPOCH-I
and Zl that have positive net input. Since here both
I I 1 -1 0.55 0.45 I 1 1.5 1-0.725 -0.58 -0.475-0.625 -0.525 -0.575
1lbj=0.3 net inputs Zinl and Zinl are positive, updating the 1-1 I I -0.625 -0.675 -1-1 -0.5 -1 0.0875-1.39 0.34 -0.625 -0.525 -0.575
weights and bias on both hidden units, we obtain -I 1 1 I -1.1375 -0.475 -I -1 -0.5 -I 0.0875 -1.39 0.34 -1.3625 0.2125 0.1625
Wij(new) = Wij(old) + a(t- Zin)x; -1-1 1 -1 1.6375 1.3125 1 1 1.5 1 1.4065 -0.069 -0.98 -0.207 1.369 -0.994
bj(new) = bj(old) + a(t- z;"j) EPOCH-2
1 I I -1 0.3565 0.168 1 I 1.5 I 0.7285 -0.75 -1.66 -0.791 -0.207 -1.58
y
This implies: 1-1 I 1 -0.1845-3.154 -1-1-0.5-1 1.3205-1.34 -1.068-0.791 0.785 -1.58
-1 1 I 1 -3.728 -0.002 -1-1-0.5-1 1.3205 -1.34 -1.068- 1.29 0.785 -1.08
WI! (new)= WI! (old)+ a(t- ZinJ)XJ
-1-1 I -1 -1.0495-1.071 -1-1-0.5-1 1.3205 -1.34 -1.068-1.29 1.29 -1.08
=0.05+0.5(-1-0.55) X 1 = -0.725
EPOCH-3
WJ2(new) = WJ2(old) + a(t- Zin2)Xl 1.32 -1.34 -1.07 - 1.29 1.29 -1.08
1 1 1 -1 -1.0865-1.083 -1-1-0.5-1
'bz =0.15 =0.!+0.5(-1-0.45) X I =-0.625 -1.34 -1.07 -1.29 1.29 -1.08
1-1 I I 1.5915-3.655 1-1 0.5 I 1.32
b1(new)= b1(old)+a(t-z;"Il -I 1 I I -3.728 1.501 -1 1 0.5 1 1.32 -1.34 -1.07 -1.29 1.29 -1.08
Figure 10 Nerwork archicecrure ofMadaline for 1.29
=0.3+0.5( -I- 0.55) = -0.475 1-1 1 -1 -1.0495-1.701 -1-1-0.5-1 1.32 -1.34 -1.07 -1.29 -1.08
XOR funcr.ions .(initial weights given).
92 SupeJVised Learning Network
-I 3.12 Solved Problems 93
I Compute rhe final weights of the network:
[v12 vn "02l = [-0.3 0.40.5] and [w, w, wo] = [0.4 This implies
0.1 -0.2], and the learning' rate is a = 0.25. Acti- v11(new) = VIt(old)+b.vJI = 0.6 + 0 = 0.6
!, = (I - 0.5227) (0.2495) = 0.1191
vation function used is binary sigmoidal activation vn(new) = vn(old)+t.v12 = -0.3 + 0 = -0.3 .
function and is given by Find the change5~Ulweights be~een hidden and "21 (new) = "21 (oldl+<'>"21
output layer:.
I = -0.1 + 0.00295 = -0.09705
f(x) = I+ ,-• <'>wi = a!1 ZI = 0.25 X 0.1191 X 0.5498 vu(new) = vu(old)+t>vu
,-- 0.0164 ::>
= 0.4 + 0.0006125 = 0.4006125
Given the output sample [x 1, X2] = [0, 1] and target
t= 1, t.w, = a!1 Z2 = 0.25 X 0.1191 X 0.7109 w,(new) = w1(old)+t.w, = 0.4 + 0.0164,
Calculate the net input: For zt layer ---=o:o2iT7 = 0.4164
Figure 13 Network.
<'>wo = a! 1 = 0.25 x 0.1191 = 0.02978 w2(now) = w,(old)+<'>W2 = 0.1 + 0.02!17
Zinl = !lQJ + XJ + X2V21
V11
Compute the error portion 8j between input and = 0.!2!17 For z2layer
= 0.3+0 X 0.6+ I X -0.1 = 0.2 hidden layer (j = 1 to 2): VOl (new) = VOl (old)+<'>•OI = 0.3 + 0.00295
For z2 layer ~f'( = 0.30295
z;,2 = V02 + XJVJ2 + X2V22
Dj= O;,j Zinj)
= 0.5 + (-1) X -0.3 +I X 0.4 = l.2
vo2(new) = 1102(old)+.6.vo2
Zjril = VQ2 + Xj V!2 + X2.V1.2 '
O;,j= I:okwjk = 0.5 + 0.0006125 = 0.5006!25 Applying activation to calculate the output, we
= 0.5 + 0 X -0.3 +I X 0.4 = 0.9 k=!/
.,.(new)= .,.(old)+8wo = -0.2 + 0.02976 obtain
8;nj = 81 Wj! I·.' only one output neuron]
Applying activation co calculate Ute output, we 1_ 1 _ t'0.4
obrain ------
=>!;,I= !1 wn = 0.1191.K0ft = 0.04764
-~
= -0.!7022
Thus, the final weights hav~ been computed for the
t-"inl
ZI =f(z; 1l = - - - = - - = -0.!974
n 1 + t'-z:;nl 1 + /1.4
I
ZI = f(z;,,) = - - - = - - - = 0.5498
1 + e-z.o.1 1 + t-0.2
I =>O;,z = Ot Wzl = 0.1191
_,- X 0.1 = 0.01191
_-:~
network shown in Figure 12.
zz =/(z;,2) = -
1- t'-Z:,;JL
- - = - -1- 2 = 0.537
l - t'-1.2
Error, 81 =O;,,f'(Zirll). 1+t-Zin2 1 +e-.
I 1 19. Find rhe new weights, using back-propagation
z2 = f(z· 2l = - - - = - - - = 0.7109 j'(z;,I) = f(z;,,) [1- f(z;,,)] network for the network shown in Figure 13.
m 1 + e-Zilll 1 + e-0.9 Calculate lhe net input entering the output layer.
= 0.5498[1- 0.5498] = 0.2475 The network is presented with the input pat- For y layer
Calculate the net input entering the output layer. 0 1 =8;,1/'(z;,J) tern l-1, 1] and the target output is +1. Use a
For y layer
= 0.04764 X 0.2475 = 0.0118
learning rate of a = 0.25 and bipolar sigmoidal Yin= WO + ZJWJ +zzWz
activation function. = -0.2 + (-0.1974) X 0.4 + 0.537 X 0.1
Ji11 = WO+ZJWJ +z2wz Error, Oz =0;,a/'(z;,2) Sn_ly.tion: The initial weights are [vii VZI vod = [0.6 = -0.22526
= -0.2 + 0.5498 X 0.4 + 0.7109 X 0.1
j'(z;,) = f(z;d [1 - f(z;,2)] ·0.1 0.3], [v12 "22 vo2l = [ -0.3 0.4 0.5] and [w,
= 0.09101 Wz wo] = [0.4 0.1 -0.2], and die learning rme is Applying activations to calculate the output, we
= 0.7109[1 - 0.7!09] = 0.2055
Applying activations to calculate the output, we
a= 0.25. obtain
Oz =8;,zf' (z;,2) Activation function used is binary sigmoidal 1 1 0.22526
obtain
= 0.01191 X 0.2055 = 0.00245 activacion function and is given by 1 - t'- '" _-_--",=< -0.1!22
1 1
y = f(y;,) = l + t'-y,.. = 1 + 11-22526
Y = f{y;n) = ~ = 1 + e-0.09101 = 0.5227 Now find rhe changes in weights between input 2 1 -e-x
and hidden layer: f (x )----1---
- 1 +e-x - 1 +e-x Compute the error portion 8k:
Compute the error portion 811.:
.6.v 11 =a0 1x1 =0.25 x0.0118 x0=0 Given the input sample [x1, X21 = [-1, l] and target !, = (t, - yllf' (y;,,)
!,= (t,- y,)f'(y,,.,) <'>"21 = a!pQ=0.25 X 0.0118 X I =0.00295 t= 1:
Now
'
----------------
I f'(J;.) = 0.5[1 + f(J;,)] [I- f(J;,)]
= 0.5[! - 0.1122][1 + 0.1122] = 0.4937 .
-- .
-~~
ll:"22 =a!2X'2 =0.25 X 0.00245 X I =0.0006125 I = Q.3 + (-1) X 0.6 +I X -0.1 = -0.4
!' (J;,) = 0.2495 <'>v02 =a!2=0.25 x 0.00245 =0.0006!25
I '-..
-·---
)
l _...-/
l
3.14 Exercise Prob!ems 95
94 Supervised learning Network
13. State the testing algorithm used in perceptron 34. What are the activations used in back-
This implies f>'OI =•01 = 0.25 X 0.1056';'0.0264 propagation network algorithm?
algorithm.
t,.,,=•o 2x, =0.25 x 0.0195 x -1 =-0.0049 35. What is meant by local minima and global
,, = (l + 0.1122) (0.4937) = 0.5491 14. How is _the linear separability concept imple-
[,."22 = cl02X, =0.25 X 0.0195 X 1 =0.0049 mented using perceprron network training? minima?
Find the changes in weights between hidden and l>'02 = •o2= 0.25 X 0.0195 =0.0049 3i5. · Derive the generalized delta learning rule.
15. Define perceprron learning rule.
output layer:
16. Define d_dta rule. 37. Derive the derivations of the binary and bipolar
Comp'Lite the final weights of the nerwork:
L\w1 = a81 ZJ = 0.25 X 0.5491 X -0.1974 1.1~ SGlte the error function for delta rule. sigmoidal activation function.
= -0.0271 18. What is the drawback of using optimization 38. What are the factors that improve the conver-
""(new) = "" (old)+t., 11 = 0.6- 0.0264
gence of learning in BPN network?
/).w, = •01 Z2 = 0.25 X 0.549! X 0.537 = 0.0737 = 0.5736 algorithm?
39. What is meant by incremenrallearning?
L\wo = a81 = 0.25 x 0.5491 = 0.1373 ,,(n<w) = ,,(old)+t.,, = -0.3-0.0049 19. What is Adaline?
40. Why is gradient descent method adopted to
20. Draw the model of an Adaline network.
Compute the error portion Bj beMeen input and = -0.3049 minimize error?
21. Explain the training algorithm used in Adaline
hidden layer (j = 1 to 2): "21 (new) = "21 (old)+t...., 1 = -0.1 + 0.0264 41. What are the methods of initialization of
network.
= -0.0736 weights?
81 = 8;/ljj' (z;nj) 22. How is a Madaline network fOrmed?
m ...,,(new) = "22(old)+t."22 = 0.4 + 0.0049 42. What is the necessity of momentum factor in
23. Is it true that Madaline network consists of many
8inj = L 8k Wjk = 0.4049 perceptrons?
weight updation process?
43. Define "over fitting" or "over training."
~I 24. Scare the characteristics of weighted interconnec-
WI (new) = WI (old)+t.w 1 = 0.4- 0.0271
._ 8inj = 81 WjJ [· •· only one output neuron] tions between Adaline and Madaline. 44. State the techniques for proper choice oflearning
= 0.3729 rate.
=>8in1 =81 WJJ = 0.5491 X 0.4 = 0.21964
w,(n<w) = w,(old)+t.w, = 0.1 + 0.0737
25. How is training adopted in Madaline network
using majority vme rule? 45. What are the limitations of using momentum
( =>o;., =o, ""' = o.549I x o.1 = o.05491 = 0.1737 factor?
Error, 81 =8;,J/'(z;nJ) = 0.21964 X 0.5 26. State few applications of Adaline and Madaline;
1 ''' (n<w) = "OI (old)+l>'OI = 0.3 + 0.0264 46. How many hidden layers can there be in a neural
27. What is meant by epoch in training process?
~
X (I +0.1974)(1- 0.1974) = 0.1056 network?
= 0.3264 28. Wha,r is meant by gradient descent meiliod?
Error, 82 =8;112/'(z;,2) = 0.05491 X 0.5 47. What is the activation function used in radial
"oz(n<w) = '02(old)+t..,, = 0.5 + 0.0049 29. State ilie importance of back-propagation
X (1- 0.537)(1 + 0.537) = 0.0195 basis function network?
= 0.5049 algorithm.
48. Explain the training algorithm of radial basis
Now find the changes in weights berw-een input wo(new) = wo(old)+t.wo = -0.2 + 0.1373 30. What is called as memorization and generaliza- function network.
and hidden layer: = -0.0627 tion? 49. By what means can an IIR and an FIR filter be
31. List the stages involved in training of back- formed in neural network?
f'l.V]J =Cl:'8]X1 =0.25 X 0.1056 X -1 = -0.0264
Thus, the final weight has been computed for the propagation network.
/).'21 =•OiX, =0.25 X 0.1056 X 1 =0.0264 50. What is the importance of functional link net-
network shown in Figure 13. 32. Draw the architecture of back-propagation algo· work?
I 3.13 Review Questions
rithm.
33. State the significance of error portions 8k and Oj
51. Write a short note on binary classification tree
neural network.
1. What is supervised learning and how is it differ- 7. Smte the activation function used in perceprron in BPN algorithm.
52. Explain in detail about wavelet neural network.
em from unsupervised learning? network.
2. How does learning take place in supervised 8. What is the imporrance of threshold in percep-
learning? tron network? I 3.14 Exercise Problems
3. From a mathematical point of view, what is the 9. Mention the applications of perceptron network.
1. Implement NOR function using perceptron are belonging to the class (so have targ.etvalue 1),
process of learning in supervised learning? 10. What are feature detectors?
network for bipolar inputs and targets. vector (-1, -1, -1, 1) and (-1, -1, 1 1) are
4. What is the building block of the perceprron? 11. With a neat flowchart, explain the training not belonging to the class_ (so have target· value
2. Find the weights required to perform the fol-
5. Does perceprron require supervised learning? If process of percepuon network. -1). Assume learning rate 1 and initial weighlS
lowing classifications using perceptron network
no, what does it require? 12. What is the significance of error signal in per- "0.
The vectors (1, 1, -1, -1) ,nd (!,-I. 1, -I)
6. List the limitations of perceptron. ceptron network?
,L
The neuron
I The sigmoid equation is what is typically used as a transfer
function between neurons. It is similar to the step function,
but is continuous and differentiable.
The neuron
I The sigmoid equation is what is typically used as a transfer
function between neurons. It is similar to the step function,
but is continuous and differentiable.
I
1
σ(x) = (1)
1 + e −x
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
d d 1
σ(x) =
dx dx 1 + e −x
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
1 + e −x − 1
=
(1 + e −x )2
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x 1
= −
(1 + e −x )2 (1 + e −x ) 2
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x
1 2
= −
(1 + e −x )2 1 + e −x
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x
1 2
= −
(1 + e −x )2 1 + e −x
= σ(x) − σ(x)2
The derivative of the sigmoid transfer function
d d 1
σ(x) =
dx dx 1 + e −x
e −x
=
(1 + e −x )2
(1 + e −x ) − 1
=
(1 + e −x )2
1 + e −x
1 2
= −
(1 + e −x )2 1 + e −x
= σ(x) − σ(x)2
σ 0 = σ(1 − σ)
Single input neuron
ω
ξ σ O
In the above figure (2) you can see a diagram representing a single
neuron with only a single input. The equation defining the figure is:
O = σ(ξω)
Single input neuron
ω
ξ σ O
In the above figure (2) you can see a diagram representing a single
neuron with only a single input. The equation defining the figure is:
O = σ(ξω + θ)
Multiple input neuron
θ
ω1
ξ1
ω2 P
ξ2 σ O
ω3
ξ3
O = σ(ω1 ξ1 + ω2 ξ2 + ω3 ξ3 + θ)
A neural network
Figure: A layer
A neural network
I J K
Notation
I xj` : Input to node j of layer `
The back propagation algorithm
Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
The back propagation algorithm
Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
The back propagation algorithm
Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
I θ` : Bias of node j of layer `
j
The back propagation algorithm
Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
I θ` : Bias of node j of layer `
j
I Oj` : Output of node j in layer `
The back propagation algorithm
Notation
I xj` : Input to node j of layer `
I Wij` : Weight from layer ` − 1 node i to layer ` node j
1
I σ(x) = 1+e −x
: Sigmoid Transfer Function
I θ` : Bias of node j of layer `
j
I Oj` : Output of node j in layer `
I tj : Target value of node j of the output layer
The error calculation
∂E
=
∂Wjk
Output layer node
∂E ∂ 1X
= (Ok − tk )2
∂Wjk ∂Wjk 2
k∈K
Output layer node
∂E ∂
= (Ok − tk ) Ok
∂Wjk ∂Wjk
Output layer node
∂E ∂
= (Ok − tk ) σ(xk )
∂Wjk ∂Wjk
Output layer node
∂E ∂
= (Ok − tk )σ(xk )(1 − σ(xk )) xk
∂Wjk ∂Wjk
Output layer node
∂E
= (Ok − tk )Ok (1 − Ok )Oj
∂Wjk
Output layer node
∂E
= (Ok − tk )Ok (1 − Ok )Oj
∂Wjk
For notation purposes I will define δk to be the expression
(Ok − tk )Ok (1 − Ok ), so we can rewrite the equation above as
∂E
= Oj δk
∂Wjk
where
δk = Ok (1 − Ok )(Ok − tk )
Hidden layer node
∂E
=
∂Wij
Hidden layer node
∂E ∂ 1X
= (Ok − tk )2
∂Wij ∂Wij 2
k∈K
Hidden layer node
∂E X ∂
= (Ok − tk ) Ok
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X ∂
= (Ok − tk ) σ(xk )
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X ∂xk
= (Ok − tk )σ(xk )(1 − σ(xk ))
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X ∂xk ∂Oj
= (Ok − tk )Ok (1 − Ok ) ·
∂Wij ∂Oj ∂Wij
k∈K
Hidden layer node
∂E X ∂Oj
= (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node
∂E ∂Oj X
= (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node
∂E ∂xj X
= Oj (1 − Oj ) (Ok − tk )Ok (1 − Ok )Wjk
∂Wij ∂Wij
k∈K
Hidden layer node
∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K
Hidden layer node
∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K
∂E X
= Oj (1 − Oj )Oi (Ok − tk )Ok (1 − Ok )Wjk
∂Wij
k∈K
∂E
= Oj δk
∂Wjk
where
δk = Ok (1 − Ok )(Ok − tk )
∂E
= Oi δj
∂Wij
where X
δj = Oj (1 − Oj ) δk Wjk
k∈K
What about the bias?
If we incorporate the bias term θ into the equation you will find
that
∂O ∂θ
= O(1 − O)
∂θ ∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from a
node which is always one.
What about the bias?
If we incorporate the bias term θ into the equation you will find
that
∂O ∂θ
= O(1 − O)
∂θ ∂θ
and because ∂θ/∂θ = 1 we view the bias term as output from a
node which is always one.
This holds for any layer ` we are concerned with, a substitution
into the previous equations gives us that
∂E
= δ`
∂θ
(because the O` is replacing the output from the “previous layer”)
The back propagation algorithm
1. Run the network forward with your input data to get the
network output
2. For each output node compute
δk = Ok (1 − Ok )(Ok − tk )
3. For each hidden node calulate
X
δj = Oj (1 − Oj ) δk Wjk
k∈K
Bag of Words
and
Term frequency inverse document frequency(Tf-IDF)
BOW (Count vectorizer)
Tfidf
Term Frequency
Length of every vector = size of vocabulary
Concept : Word present in all document is least relevant
• Till now tf-idf has been applied to one word only .
• Tf-idf vectorizer can also be applied to n gram(e.g. Word bi gram) in
which it will calculate relevant word bigram.
Example #2
• Example: If we are given 4 reviews for an Italian pasta dish.
• Review 1 : This pasta is very tasty and affordable.
• Review 2: This pasta is not tasty and is affordable.
• Review 3 : This pasta is delicious and cheap.
• Review 4: Pasta is tasty and pasta tastes good.
A high weight in tf–idf is reached by a high term frequency (in the given document) and a low
document frequency of the term in the whole collection of documents; the weights hence tend
to filter out common terms.
Since the ratio inside the IDF's log function is always greater than or equal to 1, the value of
IDF (and tf–idf) is greater than or equal to 0.
As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing
the IDF and tf–idf closer to 0.
TF-IDF gives larger values for less frequent words in the document corpus. TF-IDF value is
high when both IDF and TF values are high i.e the word is rare in the whole document but
frequent in a document.
TF-IDF also doesn’t take the semantic meaning of the words.
Let’s take an example to get a clearer understanding.
• Sentence 1: The car is driven on the road.
• Sentence 2: The truck is driven on the highway.
• In this example, each sentence is a separate document.
• We will now calculate the TF-IDF for the above two documents, which
represent our corpus.
From the mentioned table, we can see that the TF-IDF of common
words was zero, which shows they are not significant.
On the other hand, the TF-IDF of “car”, “truck”, “road”, and “highway”
are non-zero. These words have more significance.
Sandeep Chaurasia
Self Organizing Maps – Kohonen Maps
• Self Organizing Map (or Kohonen Map or SOM) is a type of
Artificial Neural Network which is also inspired by biological
models of neural systems from the 1970s.
• It follows an unsupervised learning approach and trained its
network through a competitive learning algorithm.
• SOM is used for clustering and mapping (or dimensionality
reduction) techniques to map multidimensional data onto lower-
dimensional which allows people to reduce complex problems for
easy interpretation.
• SOM has two layers, one is the Input layer and the other one is
the Output layer.
• The architecture of the Self Organizing Map with two clusters and
n input features of any sample is given below:
Self Organizing Maps – Kohonen Maps
• Let’s say an input data of size (m, n) where m is the number of training
examples and n is the number of features in each example. F
• irst, it initializes the weights of size (n, C) where C is the number of clusters.
• Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean
distance) from training example).
• Weight updation rule is given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes
the ith feature of training example and k denotes the kth training example from
the input data.
After training the SOM network, trained weights are used for clustering new
examples. A new example falls in the cluster of winning vectors.
Self Organizing Maps – Kohonen Maps
• Training:
• Step 1: Initialize the weights wij random value may be assumed. Initialize the
learning rate α.
• Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
• Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
• Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
• Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
• Step 6: Test the Stopping Condition.
SOM Example
KMean Clustering
• K-Means Clustering is an Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters.
• Unsupervised Machine Learning is the process of teaching a computer to
use unlabeled, unclassified data and enabling the algorithm to operate on
that data without supervision. Without any previous data training, the
machine’s job in this case is to organize unsorted data according to
parallels, patterns, and variations.
• The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are more comparable
to one another and different from the data points within the other groups. It is
essentially a grouping of things based on how similar and different they are to
one another.
• We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups.
To achieve this, we will use the K-means algorithm; an unsupervised
learning algorithm. ‘K’ in the name of the algorithm represents the number
of groups/clusters we want to classify our items into.
Single Linkage: It is the Shortest Distance between the closest points of the
clusters.
Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair
of datasets is added up and then divided by the total number of datasets to
calculate the average distance between two clusters. It is also one of the most
popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:
Single Linkage: It is the Shortest Distance between the closest
points of the clusters.
Complete Linkage: It is the farthest distance between the two
points of two different clusters. It is one of the popular linkage
methods as it forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by
the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage
methods.
Single Linkage: It is the Complete Linkage: It is the farthest
Shortest Distance between the distance between the two points of
closest points of the clusters two different clusters. It is one of the
popular linkage methods as it forms
tighter clusters than single-linkage.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:
Single Linkage Example