FDS Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

1.

Types of Machine Learning with examples

Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data. Data
is fed to these algorithms to train them, and on the basis of training, they build the
model & perform a specific task.
These ML algorithms help to solve different business problems like Regression,
Classification, Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:

1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Semi-Supervised Machine Learning

4. Reinforcement Learning

1. Supervised Machine Learning

As its name suggests, Supervised machine learning is based on supervision. It means in


the supervised learning technique, we train the machines using the "labelled" dataset,
and based on the training, the machine predicts the output. Here, the labelled data
specifies that some of the inputs are already mapped to the output. More preciously, we
can say; first, we train the machine with the input and corresponding output, and then
we ask the machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input


dataset of cats and dog images. So, first, we will provide the training to the machine to
understand the images, such as the shape & size of the tail of cat and dog, Shape of
eyes, colour, height (dogs are taller, cats are smaller), etc. After completion of training,
we input the picture of a cat and ask the machine to identify the object and predict the
output. Now, the machine is well trained, so it will check all the features of the object,
such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it
in the Cat category. This is the process of how the machine identifies the objects in
Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are
given below:

○ Classification

○ Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
The classification algorithms predict the categories present in the dataset. Some
real-world examples of classification algorithms are Spam Detection, Email filtering,
etc.

Some popular classification algorithms are given below:

○ Random Forest Algorithm

○ Decision Tree Algorithm

○ Logistic Regression Algorithm


○ Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

○ Simple Linear Regression Algorithm

○ Multivariate Regression Algorithm

○ Decision Tree Algorithm

○ Lasso Regression

2. Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning,
the machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.

In unsupervised learning, the models are trained with the data that is neither classified
nor labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of


fruit images, and we input it into the machine learning model. The images are totally
unknown to the model, and the task of the machine is to find the patterns and
categories of the objects.

So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.

Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:

○ Clustering

○ Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.

Some of the popular clustering algorithms are given below:

○ K-Means Clustering algorithm

○ Mean-shift algorithm

○ DBSCAN Algorithm

○ Principal Component Analysis

○ Independent Component Analysis


2) Association

Association rule learning is an unsupervised learning technique, which finds interesting


relations among variables within a large dataset. The main aim of this learning
algorithm is to find the dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum profit. This algorithm is
mainly applied in Market Basket analysis, Web usage mining, continuous production,
etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.

3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between


Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled
datasets during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may
have few labels. It is completely different from supervised and unsupervised learning as
they are based on the presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced. The main aim of
semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabeled data into
labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabeled data.

We can imagine these algorithms with an example. Supervised learning is where a


student is under the supervision of an instructor at home and college. Further, if that
student is self-analysing the same concept without any help from the instructor, it
comes under unsupervised learning. Under semi-supervised learning, the student has to
revise himself after analyzing the same concept under the guidance of an instructor at
college.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded
for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at
each step define states, and the goal of the agent is to get a high score. Agent receives
feedback in terms of punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such as
Game theory, Operation Research, Information theory, multi-agent systems.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:


○ Positive Reinforcement Learning: Positive reinforcement learning specifies
increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.

○ Negative Reinforcement Learning: Negative reinforcement learning works


exactly opposite to the positive RL. It increases the tendency that the specific
behaviour would occur again by avoiding the negative condition.
2. DIFFERENTIATE CLASSIFICATION AND REGRESSION

Classification and Regression algorithms are Supervised Learning algorithms. Both the
algorithms can be used for forecasting in Machine learning and operate with the labelled
datasets. But the distinction between classification vs regression is how they are used
on particular machine learning problems.

CLASSIFICATION ALGORITHM REGRESSION ALGORITHM

The mapping function is used for The mapping function is used for the
assigning values to predefined groups. assignment of values to continuous
output.

In Classification, the output element must In Regression, the output element must
be a discrete attribute. be of the constant type of real value.

The role of the classification algorithm is The role of the regression algorithm is to
to map the input value(x) with the discrete map the continuous output variable(y)
output variable(y). with the input value (x).

Classification Algorithms are used for Regression Algorithms are used for
discrete data. continuous data.

In Classification, we strive to locate the In Regression, we strive to find the best


judgment limit, which may split the match rows, which can forecast the
dataset into different classes. performance more accurately.

Classification Algorithms may be used to Regression algorithms may be used to


solve classification problems such as solve the regression problems such as
Voice Recognition, Identification of spam House price prediction, Weather
emails, Identification of cancer cells, etc. Prediction, etc.

The Classification algorithms can be The regression Algorithm can be further


classified into Multi-class Classifier and separated into Non-linear and Linear
Binary Classifier. Regression.
1.Explain the terms features, training set, target vector, test set, and curse of dimensionality in machine
learning.

Features in machine learning

In machine learning, features are individual independent variables that act like an input in a system.
Actually, while making the predictions, models use such features to make the predictions. And using the
feature engineering process, new features can also be obtained from old features in machine
learning.Features in machine learning is very important, being building a blocks of datasets, the quality
of the features in your dataset has major impact on the quality of the insights you will get while using
the dataset for machine learning.However, depending on the different business problems in different
industries it is not necessary the features should be same features, so here you need to strongly
understand the business goal of your data science project.

Training set in machine learning

The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, the training data is fed to the ML algorithms, which lets them learn how
to make predictions for the given task.

The training data varies depending on whether we are using Supervised Learning or Unsupervised
Learning Algorithms.

For Unsupervised learning, the training data contains unlabeled data points, i.e., inputs are not tagged
with the corresponding outputs. Models are required to find the patterns from the given training datasets
in order to make predictions.

On the other hand, for supervised learning, the training data contains labels in order to train the model
and make predictions.

The type of training data that we provide to the model is highly responsible for the model's accuracy and
prediction ability. It means that the better the quality of the training data, the better will be the
performance of the model. Training data is approximately more than or equal to 60% of the total data for
an ML project.

Target Vector in machine learning

The target vector is a central concept in supervised machine learning, representing the output variable
that a model aims to predict based on input features. In essence, it's the desired outcome or the variable
we seek to approximate accurately. The target vector guides the learning process, shaping the model's
understanding of the data. Its accuracy and relevance are vital for the model's ability to generalize to
new data, making it a crucial element in building effective machine learning models.

In supervised learning, we have a dataset containing both input features and corresponding target values.
The model uses this labeled data to learn patterns and relationships, enabling it to make predictions on
new, unseen data.

In regression, the target vector contains continuous numerical values. The model's goal is to predict a
continuous output, like estimating a house price based on features like size, location, etc.

In classification, the target vector comprises categorical labels representing different classes. The
model's task is to assign the correct class label to new data points, using patterns learned from the
training data.

Test set in machine learning

Once we train the model with the training dataset, it's time to test the model with the test dataset. This
dataset evaluates the performance of the model and ensures that the model can generalize well with the
new or unseen dataset. The test dataset is another subset of original data, which is independent of the
training dataset. However, it has some similar types of features and class probability distribution and
uses it as a benchmark for model evaluation once the model training is completed. Test data is a
well-organized dataset that contains data for each type of scenario for a given problem that the model
would be facing when used in the real world. Usually, the test dataset is approximately 20-25% of the
total original data for an ML project.

At this stage, we can also check and compare the testing accuracy with the training accuracy, which
means how accurate our model is with the test dataset against the training dataset. If the accuracy of the
model on training data is greater than that on testing data, then the model is said to have overfitting.

The testing data should:

○ Represent or part of the original dataset.

○ It should be large enough to give meaningful predictions.

Curse of dimensionality in machine learning

As the number of dimensions or features increases, the amount of data needed to generalize the machine
learning model accurately increases exponentially. The curse of dimensionality is a fundamental
challenge that arises when dealing with high-dimensional data in machine learning. With each added
dimension, the amount of data required to represent the feature space accurately grows exponentially.
This sparsity of data points can lead to issues like overfitting, where models capture noise rather than
meaningful patterns.

Moreover, the curse of dimensionality leads to increased computational demands. Algorithms that
perform well in lower dimensions might become impractical in high-dimensional spaces due to
escalating processing requirements. Identifying relevant features also becomes intricate, as irrelevant or
redundant dimensions introduce noise that can degrade model performance.

Different methods to solve the curse of dimensionality are given below:


1. Dimensionality Reduction is the data conversion from a high-dimensional into a
low-dimensional space. The idea behind this conversion is to let the low-dimensional
representation hold some significant properties of the data. These properties will be
almost identical to the data’s natural dimensions. Alternatively, it suggests decreasing the
dataset’s dimensions.
2. Deep Learning Technique implies that in the high-dimensional data, there exists a
fundamental pattern in lower-level dimensions that deep learning techniques can
effectively manipulate. Hence, for a high-dimensional matrix, neural networks can
efficiently find low-dimensional features that don’t exist in the high-dimensional space.
3. Cosine similarity can be used to substitute Euclidean distance. Cosine similarity presents
less impact on data in higher dimensional spaces. It assumes that the observations are
made by assuming that the points are spread randomly and uniformly.The effect of
dimensionality is high when the points are densely located, and dimensionality is high.
The effect of dimensionality is low when the points are sparsely located, and
dimensionality is high.
4. Principal Component Analysis (PCA): It is a popular linear dimensionality reduction
technique that projects the data onto a lower-dimensional subspace while preserving the
maximum variance. It helps in visualizing data, reducing noise, and speeding up model
training.
FDSASSI
GNMENT
Li
nearDi
scr
imi
nantFunct
ionsandDeci
sionSur
faces:
Adi
scri
minantf
unct
iont
hati
sal
i
nearcombi
nat
ionoft
hecomponent
sofxcanbe
wr
it
tenas

wherewi stheweightvectorandw0t hebiasort


hreshol
dweight
.Lineardiscri
minant
funct
ionsaregoingtobest udi
edf ort
hetwo-cat
egorycase,
multi
-cat
egorycase, and
generalcase(Fi
gure9.1)
.Forthegener al
casether
ewi l
lbecsuchdiscri
mi nant
funct
ions,oneforeachofccat egori
es.

Adi
scri
minantf
unct
iont
hati
sal
i
nearcombi
nat
ionoft
hecomponent
sofxcanbe
wr
it
tenas

wherewi stheweightvect
orandw0t hebi
asorthreshol
dweight.Li
neardiscri
minant
funct
ionsaregoingtobestudi
edforthetwo-cat
egorycase,mult
i-
categorycase,and
generalcaseForthegener
alcasetherewi
llbecsuchdiscri
minantfuncti
ons,onefor
eachofccat egor
ies.
Di
scr
imi
nantFunct
ionsar
e
 TheTwo-
Cat
egor
yCase
 TheMul
ti
cat
egor
yCase
 Gener
ali
zedLi
nearDi
scr
imi
nantFunct
ions

TheTwo-
Cat
egor
yCase
Foradiscri
mi nantfuncti
onoft heform ofeq. 9.
1,at wo-categor yclassifi
er
i
mplement sthef oll
owingdecisionrule:Decidew1i fg(x)>0andw2i fg(x)<0.
T
Thus,xisassignedt ow1iftheinnerpr oductw xexceedst het hreshold–w0and
tow2otherwise.Ifg(x)=0,xcanor di
nar i
l
ybeassi gnedt oeithercl ass,orcanbel eft
undef
ined.Theequat ong(
i x)=0def i
nest hedecisionsurfacet hatsepar atespoints
assi
gnedt ow1f rom point
sassi gnedtow2.Wheng( x)i
sl i
near, thi
sdeci sionsurf
ace
i
sahy perplane.Ifx1andx2arebot hont hedecisionsur f
ace, t
hen

or

andthisshowst hatwi snormaltoanyv ectorl


yi
ngint hehyperpl
ane.Ingener
al,t
he
hyper aneHdi
pl videsthefeatur
espacei ntotwohalf
-spaces:decisi
on
r onR1f
egi orw1andr egionR2forw2.Becauseg(x)>0ifxisinR1,itf
oll
owsthatthe
normalvectorwpoi ntsi oR1.I
nt tissometimessaidthatanyxi nR1ison
theposit
ivesideofH, andanyxi nR2isont henegati
veside(Figure9.
2).
Thedi
scr
imi
nantf
unct
iong(
x)gi
vesanal
gebr
aicmeasur
eoft
hedi
stance
f
rom xt
othehy
per
plane.Theeasi
estwayt
oseet
hisi
stoexpr
essxas

wherexpisthenormalproj
ect
ionofxontoH,andristhedesi
redalgebr
aicdi
stance
whichisposit
ivei
fxisontheposit
ivesi
deandnegati
veifxisonthenegati
veside.
Then,becauseg(xp)
=0,

Si
nce t
hen

or

Thel
i
neardeci
si yHsepar
onboundar atest
hef
eat
urespacei
ntot
wohal
f-
spaces.
I
npart
icular
,thedi
stancef
rom theori
gi oHi
nt sgiv
enby .I fw0>0,t
heori
ginison
t
hepositi
vesideofH,andifw0<0,i
tisont
henegat
ivesi fw0=0,
de.I heng(
t x)hasthe
homogeneousf orm ,
andt hehy perpl
anepassest hr
oughtheorigin(Fi
gure9.
2).
Ali
neardiscri
minantfunct i
ondi v
idest hefeatur
espacebyahy ¬perplanedeci
sion
sur
face.Theorientati
onoft hesur f
acei sdeterminedbythenormalvectorw,andthe
l
ocati
onoft hesurfaceisdet er
mi nedbyt hebiasw0.Thediscri
m¬inantfunctong(
i x)
i
sproporti
onaltothesi gneddistancef rom xtothehyperpl
ane,wi
thg( x)
>0whenxi s
onthepositi
veside,andg( x)<0whenxi sont henegati
veside.

TheMul
ti
cat
egor
yCase
Thereismor ethanonewayt odevisemul ti
categorycl
assi fi
ersempl oyingli
near
discri
mi nantfuncti
ons.Forexampl e, wemightreducet hepr oblem toct wo-cl
ass
t
h
problems, wher hei pr
et oblem issol vedbyal i
neardiscriminantfunct i
onthat
separatespoi ntsassignedt owi,f
rom t hosenotassignedt ow1.Amor eextravagant
approachwoul dbet ousec( c-1)
/2lineardiscri
minants, onef oreverypairofclasses.
Asi l
l
ust rat
edi nFi
gure9. 3,bothoft heseapproachescanl eadt oregionsinwhi chthe
cl
assi f
icati
oni sundefined.Weshal l avoi
dthisproblem bydef i
ningcl i
near
discri
mi nantfuncti
ons

andassigningxt owii
f forallj
¹i;i
ncaseofties,theclassi
fi
cati
onislef
t
undefi
ned.Ther esul
ti
ngclassi
f i
eriscaledal
l i
nearmachine.Al i
nearmachine
di
videsthef eaturespacei ocdeci
nt si
onregions(Fi
gure9.4),wihgj(
t x)bei
ngthe
l
argestdis¬criminantifxisi
nr egionRi.I
fRiandRjarecontiguous,theboundary
betweenthem i sapor ti
onofthehy per aneHijdef
pl i
nedby

or

I
tfol
lowsatoncet
hat i
snor
mal
toHijandt
hesi
gneddi
stancef
r oHiji
om xt s
gi
venby

Thus, wi
ththelinearmachi
neitisnottheweightvectorst
hemselvesbuttheir
di
fferencesthatareimport
ant.Whilet
herearec(c-1)/
2pairsofr
egions,
theyneed
notallbecontiguous,andt
het ot
alnumberofhy perpl
anesegmentsappearingint
he
decisi
onsurfacesi soft
enfewerthanc(c-
1)/
2,asshowni nFigur
e9.4.
Li
neardeci
sionboundar
iesf
oraf
our
-cl
asspr
obl
em wi
thundef
inedr
egi
ons.

Deci
si
onboundar
iesdef
inedbyal
i
nearmachi
ne.

Gener
ali
zedLi
nearDi
scr
imi
nantFunct
ions
Thel
i
neardi
scr
imi
nantf
unct
iong(
x)canbewr
it
tenas

wherethecoeff
ici
entswiar
ethecomponentsoft
heweightvect
orw.Byaddi
ng
addi
ti
onal t
ermsinvol
vi
ngtheproduct
sofpair
sofcomponentsofx,
weobtai
n
thequadrat
icdi
scri
minantf
uncti
on

Becausexixj=xjxiwecanassumethatwij=wjiwi
thnol
ossi
ngener
ali
ty.Theeq.
9.9
eval
uatesin2Df eat
urespaceas
whereeq. 9.
9and9. 10ar eequivalent.Thus, t
hequadrat i
cdi
scri
mi nantf uncti
onhas
anadditionald(
d+l)/2coeffici
entsati t
sdi sposalwit
hwhi chtoproducemor e
compl i
catedseparatingsurfaces.Thesepar at
ingsurfacedefnedbyg(
i x)=0isa
second-degreeorhy perquadri
csur face.Ifthesymmet r
icmatr
ixW=[ wij]
, wheret
he
el
ement swijarethewei ghtsofthel asttermi neq.9.
9,isnonsi
ngularthel i
neart
erms
ng(
i x)canbeel i
minat edbyt r
anslatingtheaxes.Thebasi ccharacteroft he
separati
ngsur f
acecanbedescr ibedi ntermsoft hescaledmatrix

wher
e wij]
and W=[ .
Ty
pesofQuadr
ati
cSur
faces:
Thetypesofquadrat
icsepar
ati
ngsur
facest
hatar
isei
nthegener
almul
ti
var
iat
e
Gaussiancasear
easf ol
lows.

1.I
f i
saposi
ti
vemul
ti
pleoft
hei
dent
it
ymat
ri
x,t
hesepar
ati
ngsur
facei
s
ahy
per
spher
esucht
hat
: , ek³0.Al
wher sonot
ethat
,hy
per
spher
eisdef
ined
as

2.I
f i
sposi
ti
vedef
ini
te,
thesepar
ati
ngsur
f sahy
acesi per
ell
i
psoi
dwhoseaxes
ar
eint
hedi
rect
ionsoft
heei
genv
ect
orsof .

3.I
fnoneoftheabovecaseshol
ds,t
hatis,
someoftheei
genval
uesof ar e
posi
ti
veandothersar
enegati
ve,
thesurf
aceisoneoft
hevari
eti
esoft
ypes
ofhyper
hyper
boloi
ds.
Byconti
nui
ngt oaddt mssuchaswijkxixjxk,
er wecanobt ai
ntheclass
ofpolyno¬mialdiscr
imi
nantfunct
ions.Thesecanbet houghtofastruncatedser
ies
expansionsofsomear bi
tr
aryg(x)
,andt hi
sint urnsuggestthegeneral
izedli
near
discri
minantfuncti
on

or

wher
eai snowa - di
mensional
wei ghtv
ect
orandwherethe f unctionsy(
ix)can
bear
bit
raryfunct
ionsofx.Suchfuncti
onsmightbecomputedbyaf eatur
edet ect
ing
subsy
stem.Byselecti
ngt
hesef
unct
ionsj
udi
¬ci
ousl
yandlet
ti
ng besuf f
ici
entl
y
l
arge,onecanapproxi
mat
eanydesi
reddi
scr
imi
¬nantf
uncti
onbysuchanexpansi
on.
Ther
esul
ti
ngdi
scr
imi
nantf
unct
ioni
snotl
i
neari
nx,
buti
tisl
i
neari
ny.The
f
unct
ionsy
i(
x)mer
elymappoi
nt nd-
si dimensi
onal
x-spacet
opoi
ntsi
n -
dimensionaly
-space.Thehomogeneousdiscri
mi nant sep¬arat
espoint
sinthi
s
transf
ormedspacebyahy perpl
anepassi
ngthroughtheor
igi
n.Thus,t
hemapping
from xtoyreducestheproblem t
ooneoffindi
ngahomogeneousl i
neardi
scri
minant
functi
on.
Someoftheadv
antagesanddisadvant
agesoft
hisappr
oachcanbecl
ari
fi
edby
consi
der
ingasi
mpleexample.Letthequadr
ati
cdi
scri
minantf
unct
ionbe
g(
x)=a1+a2x+a3x2
sot
hatt
het
hree-
dimensi
onal
vect
oryi
sgi
venby

Themappi
ngfrom xtoyi
si l
l
ust
ratedi
nFi
gur
e9.5.Thedat
aremai
ninher
entl
yone-
di
mensi
onal
,becausevar
yingxcausesyt
otr
aceoutacurvei
nthr
eedimensi
ons

1xx2)
T
Themappi
ngy
=( takesal
i
neandt
ransf
ormsi
ttoapar
abol
a.

Themappi
ngfrom xtoyi
si l
l
ust
ratedi
nFi
gur
e9.5.Thedat
aremai
ninher
entl
yone-
di
mensi
onal
,becausevar
yingxcausesyt
otr
aceoutacurvei
nthr
eedimensi
ons.
Thepl
ane def
inedby =0di
vi
dest
hey
-spacei
ntot
wodeci
sionr
egi
ons and
T
.Fi e5.
gur 6showst
hesepar
ati
ngpl
anecor
respondi
ngt
oa=(
-1,
1,2),t
hedeci
sion
regions and , andtheircorr
espondi
ngdecisionregions and i ntheor
igi
nal
x
-space.Thequadraticdi
scriminantf
unct
iong(x)=-1+x+2x2i
sposi
tiv
eifx<-1or
fx>0.
i 5.Ev
enwithr el
ati
velysimplef
unctonsy
i i(
x),decisi
onsur
facesinducedi
nanx-
spacecanbef ai
rl
ycompl ex.
Whi l
eitmaybehardtoreali
zethepotenti
albenefi
tsofageneral
izedl
inear
dis¬cri
minantf
unct
ion,wecanatleastexploi
ttheconveni
enceofbeingablet
owrite
g(x)inthehomogeneousform at
y.I
nt hepart
icul
arcaseoftheli
neardiscr
imi
nant
functi
onwehav e

ewesetx0=1.Thuswecanwr
wher it
e

andyissomet
i edanaugment
mescal
l edf
eat
urev
ect
or.Li
kewi anaugment
se, ed
weightv
ect
orcanbewri
tt
enas:

Thismappi ngf r
om d-dimensionalx-spaceto(d+1)
-dimensionaly
-spaceis
mat hematicall
ytri
vi
al butnonethel
essquiteconvenient
.Byusingthismappingwe
reducethepr oblem offindi
ngawei ghtvectorwandat hresholdweightw0tot
he
problem offindi
ngasi ngleweightvectora.
1xx2)
T
Themappi
ngy
=( takesal
i
neandt
ransf
ormsi
ttoapar
abol
a.

Bay
esi
anDeci
si
onTheor
y

Bay
esi
anDeci
si
onTheor
yisaf
undament
alst
ati
sti
cal
appr
oacht
othepr
obl
em of

pat
ter
ncl
assi
fi
cat
ion.I
tisconsi
der
edast
hei
deal
pat
ter
ncl
assi
fi
erandof
tenusedas

t
hebenchmar
kforot
heral
gor
it
hmsbecausei
tsdeci
si
onr
uleaut
omat
ical
l
y

mi
nimi
zesi
tsl
ossf
unct
ion.I
tmi
ghtnotmakemuchsenser
ightnow,
sohol
don,
we’
l
l

unr
avel
ital
l
.

I
tmakest
heassumpt
iont
hatt
hedeci
si
onpr
obl
em i
sposedi
npr
obabi
l
ist
ict
erms,

andt
hatal
lther
elev
antpr
obabi
l
ityv
aluesar
eknown.

Bay
es’
Theor
em

Deri
vati
onofBay es’Theorem:
Wek nowfrom thecondit
ionalprobabi
li
ty
:P(
A|B)=P(A,B)/P(
B)
=>P(A,B)=P( A|
B)*P(B)...(i
)Si
milar
ly
,
P(A,
B)=P( B|A)*P(A)..
.(i
i)Fr
om equati
on(i
)and(ii
):
P(A|
B)*P(B)=P(
B|A)*P(
A)
=>P(A|B)=[P(B|
A)*P(A)] /P(B)

Fort
hecaseofcl
assi
fi
cat
ion,
let
:
 A≡ (
stat
eoft
henat
ureort
hecl
assofanent
ry)

 B≡x(
inputf
eat
urev
ect
or)

Aftersubst ituti
ngweget :
P(ω| x)=[ P( x|
ω)*P( ω) ]/P( x)
whi chbec omes :P(ω|x )=[ p(x|ω)*P( ω) ]/p( x)
bec aus e,
*
P(ω| x)≡c all
edt hepos t
erior,i
tisthepr obabi l
i
tyoft hepr edictedc l
as stobeωf ora
givenent r
yoff eature( x).Analogoust oP( O|θ),becaus et hec las sist hedes ired
outc omet obepr edi ctedac cordingtot hedat adi st
ribution( model ).Capital'P'bec aus e
ωi sadi scret erandom v ariabl
e.*p( x|
ω)≡c lass -
condi t
ional probabi lit
ydens i
tyfunc ti
on
forthef eat ure.Wec alltl
i ikeli
hoodofωwi t
hr espec ttox , atermc hos ent oindic atethat,
othert hingsbei ngequal ,t
hec ategor y(orc l
ass )forwhi c hitisl argei smor e" l
ikely"tobe
thet ruec ategor y.Itisaf unc ti
onofpar amet erswi t
hint hepar amet er i
cs pacet hat
des cribest hepr obabi li
tyofobt ainingtheobs erveddat a( x )
.Smal l'P'bec ausexi sa
cont i
nousr andom v ar i
able. Weus uallyas sumei ttobef ol l
owi ngGaus si
anDi stributi
on.*
P(ω)≡apr i
or iprobabi li
ty(ors i
mpl yprior)ofc lassω. I
ti sus ual l
ypr e-determinedand
dependsont heex ter nalfac t
ors.Itmeanshowpr obablet heoc curenc eofc lassωoutof
allthec lass es.*p(x )≡c alledtheev i
denc e,i
tismer elyas cali
ngf actorthatguar antees
thatt hepos teri
orpr obabi l
iti
ess um toone. p(x)=s um( p( x |
ω)* P(ω) )ov erallthec l
asses.
Sof inallywegett hef ol
lowi ngequat iont of r
ameourdeci sionr ul
e:

Bay
es’
For
mul
aforCl
assi
fi
cat
ion

Deci
sionRul
e

Theabov
eequat
ioni
sthegov
erni
ngf
ormul
aforourdeci
si
ont
heor
y.Ther
ulei
sas

f
oll
ows:
Foreachsampl
einput
,i
tcal
cul
atesi
tspost
eri
orandassi
gni
ttot
hecl
ass

cor
respondi
ngt
othemax
imum v
alueoft
hepost
eri
orpr
obabi
l
ity
.Mat
hemat
ical
l
yit

canbewr
it
tenas:

Bay
es’
Deci
sionRul
e

Bay
esi
ancl
assi
fi
cat
ionf
ornor
mal
dist
ri
but
ions:
NaiveBayesclassi
fi
ersareacoll
ecti
onofcl assif
icat
ionalgori
thmsbasedonBay es’
Theorem.Iti
snotasi ngl
ealgori
thm butafami l
yofalgori
thmswher eal
lofthem
shar
eacommonpr i
ncipl
e,i
.e.ev
erypairoffeaturesbeingclassi
fi
edisi
ndependent
ofeachother
.
Gaussi
anNai
veBay
escl
assi
fi
er
InGaussi
anNaiveBayes,conti
nuousv al
uesassociat
edwi t
heachf eat
ureare
assumedt obedi
str
ibutedaccordingtoaGaussiandistr
ibution.AGaussian
dist
ri
buti
onisal
socalledNormal dist
ri
but
ion.Whenplotted,itgiv
esabellshaped
curvewhichi
ssymmet ri
caboutt hemeanoft hefeat
urev al
uesasshownbel ow:

Thelikel
ihoodofthef
eat
uresi
sassumedt
obeGaussi
an,
hence,
condi
ti
onal
probabil
i
tyisgiv
enby:

TheGaussi
an(
Nor
mal
)Densi
ty

Thedefi
nit
ionoftheexpect
edv
alueofascal
arf
unct
ionf
(x)def
inedf
orsome
densi
typ(
x)isgiv
enby

I
fthevaluesofthef
eat
urexar
erest
ri
ctedt
opoi
ntsi
nadi
scr
etesetDwemustsum
overal
lsamplesas
eP(
wher x)i
sthepr
obabi
l
itymass.
Thecont
inuousuni
var
iat
enor
mal
densi
tyi
sgi
venby

emeanm (
wher expect
edv
alue,
aver
age)i
sgi
venby

andt
hespeci
alexpect
ati
ont
hati
svar
iance(
squar
eddev
iat
ion)i
sgi
venby

Theunivari
atenormaldensi
tyisspeci
fi
edbyitstwopar
amet ers,i
tsmeanm, andthe
varances.Sampl
i esfrom normaldi
stri
buti
onstendt
oclusteraboutthemean,and
theextendtowhichtheyspreadoutdependsonthevar
iance(Figure4.
4).

GaussianDi
scr
imi
nantAnal
ysi
sini
tsgeneral
for
m assumesthatp(
x|t
)isdi
str
ibut
ed
accordi
ngtoamul
tiv
ari
atenor
mal(Gaussi
an)di
str
ibut
ion.Mul
ti
var
iat
eGaussian
dist
ri
buti
on:

wher
e|Σk|
denot
est
hedet
ermi
nantoft
hemat
ri
x,anddi
sdi
mensi
onofx.
Eachcl
asskhasassoci
atedmeanv
ect
orµkandcov
ari
ancemat
ri
xΣk.
Ty
pical
l
ythecl
assesshar
easi
ngl
ecov
ari
ancemat
ri
xΣ(
“shar
e”meanst
hatt
hey
hav
ethesamepar
amet
ers;
thecov
ari
ancemat
ri
xint
hiscase)
:Σ=Σ1=···=Σk.
Mul
ti
var
iat
edat
a:
 Mul
ti
plemeasur
ement
s(sensor
s)
 di
nput
s/f
eat
ures/
att
ri
but
es
 Ni
nst
ances/
obser
vat
ions/
exampl
es

Mul
ti
var
iat
epar
amet
ers:
 Mean:E[
x]=[
µ1,
···,
µd]T
 Cov
ari
ance:

 Cor
rel
ati
on=Cor
r(x)i
sthecov
ari
ancedi
vi
dedbyt
hepr
oductofst
andar
d
dev
iat
ion:

x∼N(
µ,Σ)
,aGaussi
an(
ornor
mal
)di
str
ibut
iondef
inedas:

Mahal
anobi
sdi
stance(
x−µk)TΣ−1(
x−µk)measur
est
hedi
stancef
rom xt
oµi
n
t
ermsofΣ.
Appl
i
cat
ionsofNai
veBay
esAl
gor
it
hm
Asyoumust ’
venot
iced,t
hisalgor
it
hm of
fersplent
yofadvantagest
oit
suser
s.
That
’swhyithasalotofappli
cati
onsi
nv ar
ioussector
stoo.Herear
esome
appl
icat
ionsofNai
veBayesalgori
thm:

 Ast
hisal
gor
it
hm i
sfastandef
fi
cient
,youcanusei
ttomaker
eal
-t
ime
pr
edi
cti
ons.
 Thi
sal
gor
it
hm i
spopul
arf
ormul
ti
-cl
asspr
edi
cti
ons.Youcanf
indt
he
pr
obabi
l
ityofmul
ti
plet
argetcl
asseseasi
l
ybyusi
ngt
hisal
gor
it
hm.
 Emai
lser
vices(
li
keGmai
l
)uset
hisal
gor
it
hm t
ofi
gur
eoutwhet
heranemai
li
s
aspam ornot
.Thi
sal
gor
it
hm i
sexcel
l
entf
orspam f
il
ter
ing.
 I
tsassumpt
ionoff
eat
urei
ndependence,
andi
tsef
fect
ivenessi
nsol
vi
ngmul
ti
-
cl
asspr
obl
ems,
makesi
tper
fectf
orper
for
mingSent
imentAnal
ysi
s.
Sent
imentAnal
ysi
sref
erst
othei
dent
if
icat
ionofposi
ti
veornegat
ive
sent
iment
sofat
argetgr
oup(
cust
omer
s,audi
ence,
etc.
)
 Col
l
abor
ati
veFi
l
ter
ingandt
heNai
veBay
esal
gor
it
hm wor
ktoget
hert
obui
l
d
r
ecommendat
ionsy
stems.Thesesy
stemsusedat
ami
ningandmachi
ne
l
ear
ningt
opr
edi
cti
ftheuserwoul
dli
keapar
ti
cul
arr
esour
ceornot
.
CONVOLUTIONAL NEURAL NETWORKS

Convolutional Neural Networks are a special type of feed-forward artificial neural network in
which the connectivity pattern between its neurons is inspired by the visual cortex. The visual
cortex encompasses a small region of cells that are region sensitive to visual fields. In case some
certain orientation edges are present then only some individual neuronal cells get fired inside the
brain such as some neurons responds as and when they get exposed to the vertical edges,
however some responds when they are shown to horizontal or diagonal edges, which is nothing
but the motivation behind Convolutional Neural Networks.
The Convolutional Neural Networks, which are also called covnets, are nothing but neural
networks, sharing their parameters. Suppose that there is an image, which is embodied as a
cuboid, such that it encompasses length, width, and height. Here the dimensions of the image are
represented by the Red, Green, and Blue channels, as shown in the image given below.

Now assume that we have taken a small patch of the same image, followed by running a small
neural network on it, having k number of outputs, which is represented in a vertical manner. Now
when we slide our small neural network all over the image, it will result in another image
constituting different width, height as well as depth. We will notice that rather than having R, G,
B channels, we have come across some more channels that, too, with less width and height,
which is actually the concept of Convolution. In case, if we accomplished in having similar patch
size as that of the image, then it would have been a regular neural network. We have some wights
due to this small patch
Working of CNN

Generally, a Convolutional Neural Network has three layers, which are as follows;

Convolution layer :
Convolution layer is the first layer to extract features from an input image. By learning image
features using a small square of input data, the convolutional layer preserves the relationship
between pixels. It is a mathematical operation which takes two inputs such as image matrix and a
kernel or filter.
o The dimension of the image matrix is h×w×d.
o The dimension of the filter is fh×fw×d.
o The dimension of the output is (h-fh+1)×(w-fw+1)×1.
Let's start with consideration a 5*5 image whose pixel values are 0, 1, and filter matrix
3*3 as:

The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is called "Features Map"
and show as an output.

Convolution of an image with different filters can perform an operation such as blur, sharpen,
and edge detection by applying filters.

Strides : Stride is the number of pixels which are shift over the input matrix. When the stride
is equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is equaled
to 2, then we move the filters to 2 pixels at a time. The following figure shows that the
convolution would work with a stride of 2.
Padding : Padding plays a crucial role in building the convolutional neural network. If the
image will get shrink and if we will take a neural network with 100's of layers on it, it will give
us a small image after filtered in the end. To overcome this, we have introduced padding to an
image. "Padding is an additional layer which can add to the border of an image."

Pooling Layer

Pooling layer plays an important role in pre-processing of an image. Pooling layer reduces the
number of parameters when the images are too large. Pooling is "downscaling" of the image
obtained from the previous layers. It can be compared to shrinking an image to reduce its pixel
density. Spatial pooling is also called down sampling or subsampling, which reduces the
dimensionality of each map but retains the important information. There are the following types
of spatial pooling:

Max Pooling : Max pooling is a sample-based discretization process. Its main objective is to
downscale an input representation, reducing its dimensionality and allowing for the assumption
to be made about features contained in the sub-region binned. Max pooling is done by applying a
max filter to non-overlapping sub-regions of the initial representation.
Average Pooling : Down-scaling will perform through average pooling by dividing the input
into rectangular pooling regions and computing the average values of each region.

Sum Pooling : The sub-region for sum pooling or mean pooling are set exactly the same as
for max-pooling but instead of using the max function we use sum or mean.

Fully Connected Layer

The fully connected layer is a layer in which the input from the other layers will be flattened into
a vector and sent. It will transform the output into the desired number of classes by the network.
In the above diagram, the feature map matrix will be converted into the vector such as x1, x2,
x3... xn with the help of fully connected layers. We will combine features to create a model and
apply the activation function such as softmax or sigmoid to classify the outputs as a car, dog,
truck, etc.

CNN HYPERPARAMETER TUNING


Hyperparameters of a neural network are variables that determine the network's architecture and behavior
during training. They include the number of layers, the number of nodes in each layer, the activation
functions, learning rate, batch size, regularization parameters, dropout rate, optimizer choice, and weight
initialization methods. Tuning these hyperparameters is crucial for optimizing the neural network's
performance.
Hyperparameter tuning in neural networks refers to the process of finding the optimal combination of
hyperparameters to maximize the performance and effectiveness of the network. It involves
systematically exploring different values or ranges of hyperparameters, training and evaluating the
network for each configuration, and selecting the set of hyperparameters that yield the best performance
on a validation set or through cross validation. Or A major challenge when working with DL algorithms is
setting and controlling hyperparameter values .This is technically known as Hyperparameter Tuning or
Hyperparameter Optimization.These are some examples of hyperparameters . The k in kNN or K-Nearest
Neighbour algorithm, Learning rate for training a neural network, Traintest split ratio, Batch Size,
Number of Epochs, Branches in Decision Tree, Number of clusters in Clustering Algorithm.
Broadly hyperparameters can be divided into two categories, which are given below:
1. Hyperparameter for Optimization
2. Hyperparameter for Specific Models
Some of the popular optimization parameters are given below:
● Learning Rate: The learning rate is the hyperparameter in optimization algorithms that controls
how much the model needs to change in response to the estimated error for each time when the
model's weights are updated. It is one of the crucial parameters while building a neural network,
and also it determines the frequency of cross-checking with model parameters. Selecting the
optimized learning rate is a challenging task because if the learning rate is very less, then it may
slow down the training process. On the other hand, if the learning rate is too large, then it may not
optimize the model properly
● Batch Size: To enhance the speed of the learning process, the training set is divided into different
subsets, which are known as a batch.
● Number of Epochs: An epoch can be defined as the complete cycle for training the machine
learning model. Epoch represents an iterative learning process. The number of epochs varies from
model to model, and various models are created with more than one epoch. To determine the right
number of epochs, a validation error is taken into account. The number of epochs is increased
until there is a reduction in a validation error. If there is no improvement in reduction error for the
consecutive epochs, then it indicates to stop increasing the number of epochs.
Hyperparameters that are involved in the structure of the model are known as hyperparameters for
specific models. These are given below:
● A number of Hidden Units: Hidden units are part of neural networks, which refer to the
components comprising the layers of processors between input and output units in a neural
network.
It is important to specify the number of hidden units hyperparameter for the neural network. It should be
between the size of the input layer and the size of the output layer. More specifically, the number of
hidden units should be 2/3 of the size of the input layer, plus the size of the output layer.

REGULARIZATION

Regularization refers to techniques that are used to calibrate machine learning models in order to
minimize the adjusted loss function and prevent overfitting or underfitting. Regularization is a technique
that helps prevent overfitting, which occurs when a neural network learns too much from the training data
and fails to generalize well to new data. Convolutional neural networks (CNNs) are a type of neural
network that are especially good at processing images, but they can also suffer from overfitting due to
their high complexity and large number of parameters.
How does regularization work in CNNs?
Regularization is a way of adding some constraints or penalties to the model, so that it does not overfit the
training data. There are different types of regularization methods, but they all aim to reduce the variance
of the model and increase its bias. Variance measures how sensitive the model is to small changes in the
data, while bias measures how far the model is from the true relationship. A good model should have low
variance and low bias, but there is usually a trade-off between them.
Regularization helps find a balance between them by shrinking or pruning the model parameters, adding
noise or dropout to the layers, or augmenting the data with transformations.
What are some common regularization methods for CNNs?
Regularization methods for CNNs are commonly used to reduce overfitting. L1 and L2 regularization,
also known as weight decay or ridge and lasso regularization, add a term to the loss function that
penalizes large weights in the model. L1 regularization tends to make some weights zero, while L2
regularization makes all weights smaller. Dropout is a technique that randomly drops out some units or
neurons in the hidden layers during training, preventing co-adaptation of features. Data augmentation
artificially increases the size and diversity of the training data by applying random transformations, such
as flipping, rotating, cropping, scaling, or adding noise to the images. This helps the model learn from
different perspectives and variations of the data.

INITIALIZATION

Initialization in the context of neural networks or CNN refers to the process of setting the initial values for
the parameters (weights and biases) of the network before training begins. Proper initialization is crucial
for achieving efficient and stable training, as it can help prevent issues like vanishing gradients, exploding
gradients, and slow convergence.

Common initialization techniques include:


1. Zero Initialization: Setting all weights and biases to zero. However, this can lead to issues
because all neurons will compute the same output during forward and backward propagation,
resulting in symmetric weights and no learning.
2. Random Initialization: Assigning random values to the weights and biases. This approach helps
break the symmetry and can enable learning. However, the magnitudes of the random values need
to be controlled to prevent gradient-related problems.
3. Xavier (Glorot) Initialization: This method calculates the scale of initialization based on the
number of input and output neurons for a given layer. It aims to keep the variance of activations
and gradients consistent across layers, reducing the risk of vanishing or exploding gradients.
4. He Initialization: Similar to Xavier initialization, but adapted for ReLU (Rectified Linear Unit)
activation functions, which are commonly used in many neural networks. It uses a slightly
different scaling factor to better suit the properties of ReLU.
5. LeCun Initialization: Specifically designed for networks with sigmoid activation functions. It uses
a scaling factor based on the number of input neurons to each neuron.
6. Orthogonal Initialization: In this technique, the weight matrices are initialized with orthogonal
matrices. This helps maintain the orthogonality of weights throughout training and can aid in
preserving gradient flow.
7. Normalized Initialization: This method normalizes the weights of each neuron based on their
fan-in (number of input neurons). This normalization can help prevent large activations that could
lead to gradient problems.
The choice of initialization method can greatly impact the training process and the final performance of a
neural network. Different activation functions, architectures, and problem domains might require specific
initialization strategies. Experimentation and hyperparameter tuning are often necessary to find the most
suitable initialization strategy for a given CNN. Improper initialization can lead to issues during training,
such as slow convergence, training instability, or poor generalization to new data.
CNN EXAMPLES

Various examples of Convolutional Neural Networks (CNNs) that have been influential in the field of
computer vision:
1. LeNet-5:
● LeNet-5 is one of the earliest CNN architectures developed by Yann LeCun for
handwritten digit recognition. It played a crucial role in demonstrating the effectiveness
of CNNs for image classification.
● It consists of two sets of convolutional and average pooling layers followed by fully
connected layers. It also introduced the concept of using non-linear activation functions
(sigmoid) in convolutional layers.

2. AlexNet:
● AlexNet, developed by Alex Krizhevsky, is a significant milestone in the resurgence of
neural networks. It won the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) in 2012.
● It features multiple convolutional and max-pooling layers, ReLU activation functions,
and dropout for regularization. It also popularized the use of GPUs for training deep
networks.
3. VGGNet:
● VGGNet, created by the Visual Geometry Group (VGG) at the University of Oxford, is
known for its simplicity and uniform architecture.
● It consists of multiple convolutional layers with small 3x3 filters, followed by
max-pooling layers. The repeated stacking of these layers results in a deep network.

4. GoogLeNet (Inception):
● The GoogLeNet architecture introduced the concept of "inception modules," which use
multiple filter sizes within the same layer to capture features at various scales.
● This design allows the network to efficiently extract both local and global features,
contributing to improved accuracy.
5. ResNet (Residual Network):
● ResNet is a breakthrough architecture that addresses the challenge of training very deep
networks by introducing residual connections.
● Residual connections allow the network to skip certain layers and retain information from
previous layers, combating the vanishing gradient problem.

6. DenseNet:
● DenseNet is designed to improve gradient flow and feature reuse by connecting each
layer to all subsequent layers in a feed-forward manner.
● DenseNet's densely connected blocks lead to more compact models and better feature
propagation.
7. MobileNet:
● MobileNet focuses on efficient model architectures for mobile and embedded devices. It
introduces depthwise separable convolutions to reduce the computational cost.
● Depthwise separable convolutions split the standard convolution into separate depthwise
and pointwise convolutions, significantly reducing the number of parameters.
8. YOLO (You Only Look Once):
● YOLO is an object detection architecture that performs real-time object detection in a
single pass. It divides the image into a grid and predicts bounding boxes and class
probabilities for each grid cell.
● YOLO's efficiency and speed make it suitable for real-time applications like video
analysis.
9. U-Net:
● U-Net is a CNN architecture designed for semantic segmentation tasks, such as
segmenting objects within an image.
● It has a U-shaped architecture with a contracting path (encoder) and an expanding path
(decoder), allowing it to capture both global and local context.
10. Transformer-based Vision Models (e.g., ViT, DeiT):
● Transformers, initially designed for natural language processing, have been adapted for
computer vision tasks with architectures like Vision Transformers (ViT) and
Data-efficient Image Transformer (DeiT).
● These models use attention mechanisms to capture relationships between different image
patches, eliminating the need for hand-designed convolutional architectures.
These are just a few examples of CNN architectures, each designed to tackle specific challenges in
computer vision. The field continues to evolve, and researchers are exploring novel architectures to
improve performance, efficiency, and versatility across various tasks.
Over-fitting
Overfitting is a problem in machine learning that occurs when a model learns the training data too
well and is unable to generalize to new data. This happens when the model is too complex and learns
the noise and patterns in the training data that are not relevant to the problem it is trying to solve.

A model that is overfit will perform very well on the training data, but it will not perform as well on
new data. This is because the model has learned the specific details of the training data, and it will
not be able to generalize to new data that has different features or patterns.

There are a few ways to prevent overfitting:

 Use a simpler model. A simpler model is less likely to learn the noise in the training data.
 Use regularization. Regularization is a technique that penalizes the model for complexity.
This can help to prevent the model from learning the noise in the training data.
 Use cross-validation. Cross-validation is a technique for evaluating a model on data that it
has not seen before. This can help to identify if the model is overfitting the training data.

If you are concerned that your model is overfitting, you can try using some of these techniques to
prevent it.

Here are some additional things to keep in mind about overfitting:

 Overfitting is more likely to occur when the training data is small.


 Overfitting is more likely to occur when the model is complex.
 Overfitting can be difficult to detect, especially if the training data is small.

If you are not sure if your model is overfitting, you can try using cross-validation to evaluate it. Cross-
validation will help you to identify if the model is performing well on the training data, but not on
new data. If this is the case, then your model is likely overfitting.

There are a number of techniques that can be used to prevent overfitting. Some of these techniques
include:

 Data augmentation: This involves artificially increasing the size of the training data by
creating new data points that are similar to the existing data points. This can help to prevent
the model from overfitting to the noise in the training data.
 Regularization: This involves adding a penalty to the model's loss function that discourages
the model from becoming too complex. This can help to prevent the model from overfitting
to the training data.
 Early stopping: This involves stopping the training process early, before the model has had a
chance to overfit the training data. This can be done by monitoring the model's performance
on a validation set.

If you are concerned that your model is overfitting, you can try using some of these techniques to
prevent it.
Curse of Dimensionality
The curse of dimensionality arises from the fact that the volume of a high-dimensional space
increases exponentially with the number of dimensions. This means that there will be fewer data
points in each dimension, which can make it difficult to find patterns in the data. Additionally, the
distance between two points in high-dimensional space can be misleading, as two points that are far
apart in terms of Euclidean distance may actually be very close together in terms of other measures
of similarity.

The curse of dimensionality can have a number of negative implications for data science. For
example, it can make it difficult to:

 Identify patterns in the data: As the number of dimensions increases, the data becomes
more sparse, which can make it difficult to find patterns in the data.
 Build accurate models: Machine learning models are often trained on labeled data, which
means that they are given the correct labels for the data points. However, in high-
dimensional space, it is more likely that the data points will be mislabeled, which can lead to
inaccurate models.
 Interpret the results of models: The results of machine learning models can be difficult to
interpret in high-dimensional space, as the models may be making decisions based on
features that are not easily understandable.

There are a number of techniques that can be used to mitigate the problems caused by the curse of
dimensionality. Some of these techniques include:

 Dimensionality reduction: This involves reducing the number of dimensions in the data. This
can be done by using techniques such as principal component analysis (PCA) or linear
discriminant analysis (LDA).
 Feature selection: This involves selecting a subset of the features in the data that are most
relevant to the problem being solved. This can help to reduce the dimensionality of the data
and improve the performance of machine learning models.
 Regularization: This involves adding a penalty to the model's loss function that discourages
the model from becoming too complex. This can help to prevent the model from overfitting
to the training data.
 Ensemble learning: This involves training multiple models on different subsets of the data
and then combining the predictions of the models. This can help to improve the robustness
of the model to the curse of dimensionality.

By using these techniques, it is possible to build machine learning models that can learn from high-
dimensional data and make accurate predictions.

The curse of dimensionality is an important concept to understand in data science, as it can have a
significant impact on the performance of machine learning models. By understanding the curse of
dimensionality and the techniques that can be used to mitigate its effects, data scientists can build
more accurate and reliable models
BIAS VARIANCE TRADE-OFF

What is Bias?
The bias is known as the difference between the prediction of the values by
the Machine Learning model and the correct value. Being high in biasing gives a
large error in training as well as testing data. It recommended that an algorithm
should always be low-biased to avoid the problem of under fitting. By high bias, the
data predicted is in a straight line format, thus not fitting accurately in the data in the
data set. Such fitting is known as the Underfitting of Data. This happens when
the hypothesis is too simple or linear in nature. Refer to the graph given below for
an example of such a situation.

In such a problem, a hypothesis looks like follows.

What is Variance?
The variability of model prediction for a given data point which tells us the spread
of our data is called the variance of the model. The model with high variance has a
very complex fit to the training data and thus is not able to fit accurately on the data
which it hasn’t seen before. As a result, such models perform very well on training
data but have high error rates on test data. When a model is high on variance, it is
then said to as Over-fitting of Data. Over-fitting is fitting the training set accurately
via complex curve and high order hypothesis but is not the solution as the error with
unseen data is high. While training a data model variance should be kept low. The
high variance data looks as follows.
In such a problem, a hypothesis looks like follows.

Bias Variance Trade-off :


If the algorithm is too simple (hypothesis with linear equation) then it may be on
high bias and low variance condition and thus is error-prone. If algorithms fit too
complex (hypothesis with high degree equation) then it may be on high variance and
low bias. In the latter condition, the new entries will not perform well. Well, there is
something between both of these conditions, known as a Trade-off or Bias Variance
Trade-off. This trade-off in complexity is why there is a trade-off between bias and
variance. An algorithm can’t be more complex and less complex at the same time.
For the graph, the perfect trade-off will be like this.
We try to optimize the value of the total error for the model by using the Bias-
Variance Trade-off.

The best fit will be given by the hypothesis on the trade -off point. The error to
complexity graph to show trade-off is given as –

This is referred to as the best point chosen for the training of the algorithm which
gives low error in training as well as testing data.
TRAINING SET

A training dateset is a collection of instances used in the learning process to fit the
parameters (e.g., weights) of a classifier
A supervised learning method for classification tasks examines the training dateset to
discover, or learn, the best combinations of variables that will produce a strong
predictive model.

The goal is to create a fitted model that does a good job of generalizing new, unknown
data. To estimate the model’s accuracy in categorizing fresh data, “new” instances
from the held-out datasets are used to evaluate the fitted model. The examples in
the validation and test datasets should not be utilized to train the model to minimize the
danger of over-fitting.

Most approaches to finding empirical links in training data tend to overfit the data,
which means they can find and exploit apparent links in the training data that don’t
hold in general.

VALIDATION SET
The model must be assessed regularly to be trained, which is exactly what the
validation set is for. We may determine how accurate a model is by computing the loss
it produces on the validation set at each given point. This is what training is all about.

What is a validation dateset? In simple terms:

 A validation dateset is a collection of instances used to fine-tune a classifier’s


hyper parameters
The number of hidden units in each layer is one good analogy of a hyper parameter for
machine learning neural networks. It should have the same probability distribution as
the training data set, as should the testing data set. When a classification variable must
be updated, a validation data set in machine learning, including the test and training
datasets, is required to avoid over-fitting.

If the most appropriate classifier for the problem is sought, the training data set is used
to train the various candidate classifiers, the data validation in machine learning is used
to compare their performances and choose which one to employ, and the test data set is
used to acquire performance characteristics like as F-measure, sensitivity, accuracy or
specificity.

The validation data set is a hybrid: it is training data that is used for testing, but it is not
included in either the low-level training or the final testing. Early stopping is a
technique in which the candidate models are iterations of the same network, and
training stops when the error on the validation set develops, choosing the previous
model – the one with the least error.. All based on our open-source core.
TEST SET
This refers to the model’s final evaluation when the training phase is done. This stage
is crucial for determining the model’s generalization. We can get the working accuracy
of our model by using this collection.

 Validation data vs test data = a sample of data is used to provide an unbiased


assessment of a model fit on the training data set vs The data collection used to
provide an impartial assessment of a final model’s fit to the training data set
It’s worth noting that we must be subjective — and truthful — by delaying exposing
the model to the test set until the training phase is complete. We can consider the final
accuracy measure to be dependable in this sense.

 Machine learning validation vs testing = Instructing the model to learn from its
errors vs concluding on the model’s performance
MULTIVARIATE REGRESSION

Multivariate is a controlled or supervised Machine Learning algorithm that analyses


multiple data variables. It is a continuation of multiple regression that involves one
dependent variable and many independent variables. The output is predicted based
on the number of independent variables.

Multivariate regression is a technique used to measure the degree to which the various
independent variable and various dependent variables are linearly related to each
other. The relation is said to be linear due to the correlation between the variables. Once
the multivariate regression is applied to the dataset, this method is then used to predict
the behaviour of the response variable based on its corresponding predictor variables.

A multivariate linear regression model would have the form.

where the relationships between multiple dependent variables (i.e., Ys)—measures of


multiple outcomes—and a single set of predictor variables (i.e., Xs) are assessed.

Example of multivariate regression

An agriculture expert decides to study the crops that were ruined in a certain region. He
collects the data about recent climatic changes, water supply, irrigation methods,
pesticide usage, etc. To understand why the crops are turning black, do not yield any
fruits and dry out soon.

In the above example, the expert decides to collect the mentioned data, which act as the
independent variables. These variables will affect the dependent variables which are
nothing but the conditions of the crops. In such a case, using single regression would be
a bad choice and multivariate regression might just do the trick.
Steps to achieve multivariate regression.
The processes involved in multivariate regression analysis include the selection of
features, engineering the features, feature normalization, selection loss functions,
hypothesis analysis, and creating a regression model.

1. Selection of features: It is the most important step in multivariate regression.


Also known as variable selection, this process involves selecting viable variables
to build efficient models.
2. Feature Normalizing: This involves feature scaling to maintain streamlined
distribution and data ratios. This helps in better data analysis. The value of all
the features can be changed according to the requirement.
3. Selecting Loss function and hypothesis: The loss function is used for
predicting errors. The loss function comes into play when the hypothesis
prediction changes from the actual figures. Here, the hypothesis represents the
value predicted from the feature or variable.
4. Fixing hypothesis parameter: The parameter of the hypothesis is fixed or set in
such a way that it minimizes the loss function and enhances better prediction.
5. Reducing the loss function: The loss function is minimized by generating an
algorithm specifically for loss minimization on the dataset which in turn
facilitates the alteration of hypothesis parameters. Gradient descent is the most
commonly used algorithm for loss minimization. The algorithm can also be used
for other actions once the loss minimization is complete.
6. Analysing the hypothesis function: The function of the hypothesis needs to be
analysed as it is crucial for predicting the values. After the function is analysed, it
is then tested on test data.

Assumptions in the Multivariate Regression Model

 The dependent and the independent variables have a linear relationship.


 The independent variables do not have a strong correlation among themselves.
 The observations of Yi are chosen randomly and individually from the
population.
Advantages of Multivariate Regression

1. Multivariate regression helps us to study the relationships among multiple


variables in the dataset.
2. The correlation between dependent and independent variables helps in
predicting the outcome.
3. It is one of the most convenient and popular algorithms used in machine
learning.

Disadvantages of Multivariate Regression

 The complexity of multivariate techniques requires complex mathematical


calculations.
 It is not easy to interpret the output of the multivariate regression model since
there are inconsistencies in the loss and error outputs.
 Multivariate regression models cannot be applied to smaller datasets; they are
designed for producing accurate outputs when it comes to larger datasets

APPLICATIONS OF REGRESSION ANALYSIS

Regression analysis can be used for a wide range of applications, including:

 Predictive modelling: Regression analysis can be used to predict future values of the
dependent variable based on changes in the independent variable.
 Causal analysis: Regression analysis can be used to determine whether changes in
the independent variable cause changes in the dependent variable.
 Forecasting: Regression analysis can be used to forecast trends and patterns in the
data, which can be useful for planning and decision-making.
 Control and optimization: Regression analysis can be used to optimize processes
and control systems by identifying the factors that have the greatest impact on the
outcome.
BIAS AND VARIANCE

Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value. These differences between actual or expected values and the predicted values
are known as error or bias error or error due to bias. Bias is a systematic error that
occurs due to wrong assumptions in the machine learning process.
Let be the true value of a parameter and let be an estimator of based on a sample of
data. Then, the bias of the estimator is given by:

Where is the expected value of the estimator It is the measurement of the


model that how well it fits the data.

 Low Bias: Low bias value means fewer assumptions are taken to build the
target function. In this case, the model will closely match the training
dataset.
 High Bias: High bias value means more assumptions are taken to build the
target function. In this case, the model will not match the training dataset
closely.
The high-bias model will not be able to capture the dataset trend. It is considered as
the underfitting model which has a high error rate. It is due to a very simplified
algorithm.
For example, a linear regression model may have a high bias if the data has a non-
linear relationship.

Variance

Variance is the measure of spread in data from its mean position. In machine learning
variance is the amount by which the performance of a predictive model changes when
it is trained on different subsets of the training data. More specifically, variance is the
variability of the model that how much it is sensitive to another subset of the training
dataset. i.e. how much it can adjust on the new subset of the training dataset.
Let Y be the actual values of the target variable, and be the predicted values of the
target variable. Then the variance of a model can be measured as the expected value of
the square of the difference between predicted values and the expected value of the
predicted values.

Where is the expected value of the predicted values. Here expected value is
averaged over all the training data.

Variance errors are either low or high-variance errors.

 Low variance: Low variance means that the model is less sensitive to
changes in the training data and can produce consistent estimates of the
target function with different subsets of data from the same distribution.
This is the case of underfitting when the model fails to generalize on both
training and test data.
 High variance: High variance means that the model is very sensitive to
changes in the training data and can result in significant changes in the
estimate of the target function when trained on different subsets of data
from the same distribution. This is the case of overfitting when the model
performs well on the training data but poorly on new, unseen test data. It
fits the training data too closely that it fails on the new training dataset.

Ways to Reduce the Variance in Machine Learning:

 Cross-validation: By splitting the data into training and testing sets


multiple times, cross-validation can help identify if a model is overfitting or
underfitting and can be used to tune hyperparameters to reduce variance.
 Feature selection: By choosing the only relevant feature will decrease the
model’s complexity. and it can reduce the variance error.
 Regularization: We can use L1 or L2 regularization to reduce variance in
machine learning models.
 Ensemble methods: It will combine multiple models to improve
generalization performance. Bagging, boosting, and stacking are common
ensemble methods that can help reduce variance and improve
generalization performance.
 Simplifying the model: Reducing the complexity of the model, such as
decreasing the number of parameters or layers in a neural network, can also
help reduce variance and improve generalization performance.
 Early stopping: Early stopping is a technique used to prevent overfitting by
stopping the training of the deep learning model when the performance on
the validation set stops improving.

Different Combinations of Bias-Variance

There can be four combinations between bias and variance.

 High Bias, Low Variance: A model with high bias and low variance is said
to be underfitting.
 High Variance, Low Bias: A model with high variance and low bias is said
to be overfitting.
 High-Bias, High-Variance: A model has both high bias and high variance,
which means that the model is not able to capture the underlying patterns
in the data (high bias) and is also too sensitive to change in the training data
(high variance). As a result, the model will produce inconsistent and
inaccurate predictions on average.
 Low Bias, Low Variance: A model that has low bias and low variance
means that the model is able to capture the underlying patterns in the data
(low bias) and is not too sensitive to change in the training data (low
variance). This is the ideal scenario for a machine learning model, as it is
able to generalize well to new, unseen data and produce consistent and
accurate predictions. But in practice, it’s not possible.
Bias Variance Trade-off

If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error prone. If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias.
In the latter condition, the new entries will not perform well. Well, there is something
between both conditions, known as a Trade-off or Bias Variance Trade-off. This trade-
off in complexity is why there is a trade-off between bias and variance. An algorithm
can’t be more complex and less complex at the same time. For the graph, the perfect
trade-off will be like this.
Introduction to Deep Learning
Deep learning is a branch of machine learning which is based on artificial neural networks. It
is capable of learning complex patterns and relationships within data. In deep learning, we
don’t need to explicitly program everything. It has become increasingly popular in recent
years due to the advances in processing power and the availability of large datasets. Because
it is based on artificial neural networks (ANNs) also known as deep neural networks (DNNs).
These neural networks are inspired by the structure and function of the human brain’s
biological neurons, and they are designed to learn from large amounts of data.

1. Deep Learning is a subfield of Machine Learning that involves the use of neural
networks to model and solve complex problems. Neural networks are modeled after
the structure and function of the human brain and consist of layers of interconnected
nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks, which
have multiple layers of interconnected nodes. These networks can learn complex
representations of data by discovering hierarchical patterns and features in the data.
Deep Learning algorithms can automatically learn and improve from data without the
need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including image
recognition, natural language processing, speech recognition, and recommendation
systems. Some of the popular Deep Learning architectures include Convolutional
Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Deep Belief
Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and the
development of specialized hardware, such as Graphics Processing Units (GPUs), has
made it easier to train deep neural networks.

In summary, Deep Learning is a subfield of Machine Learning that involves the use of deep
neural networks to model and solve complex problems. Deep Learning has achieved
significant success in various fields, and its use is expected to continue to grow as more data
becomes available, and more powerful computing resources become available.

What is Deep Learning?


Deep learning is the branch of machine learning which is based on artificial neural network
architecture. An artificial neural network or ANN uses layers of interconnected nodes called
neurons that work together to process and learn from the input data.

In a fully connected Deep neural network, there is an input layer and one or more hidden
layers connected one after the other. Each neuron receives input from the previous layer
neurons or the input layer. The output of one neuron becomes the input to other neurons in
the next layer of the network, and this process continues until the final layer produces the
output of the network. The layers of the neural network transform the input data through a
series of nonlinear transformations, allowing the network to learn complex representations of
the input data.
Today Deep learning has become one of the most popular and visible areas of machine
learning, due to its success in a variety of applications, such as computer vision, natural
language processing, and Reinforcement learning.

Deep learning can be used for supervised, unsupervised as well as reinforcement machine
learning. it uses a variety of ways to process these.

• Supervised Machine Learning: Supervised machine learning is the machine


learning technique in which the neural network learns to make predictions or classify
data based on the labeled datasets. Here we input both input features along with the
target variables. the neural network learns to make predictions based on the cost or
error that comes from the difference between the predicted and the actual target, this
process is known as backpropagation. Deep learning algorithms like Convolutional
neural networks, Recurrent neural networks are used for many supervised tasks like
image classifications and recognization, sentiment analysis, language translations, etc.
• Unsupervised Machine Learning: Unsupervised machine learning is the machine
learning technique in which the neural network learns to discover the patterns or to
cluster the dataset based on unlabeled datasets. Here there are no target variables.
while the machine has to self-determined the hidden patterns or relationships within
the datasets. Deep learning algorithms like autoencoders and generative models are
used for unsupervised tasks like clustering, dimensionality reduction, and anomaly
detection.
• Reinforcement Machine Learning: Reinforcement Machine Learning is the
machine learning technique in which an agent learns to make decisions in an
environment to maximize a reward signal. The agent interacts with the environment
by taking action and observing the resulting rewards. Deep learning can be used to
learn policies, or a set of actions, that maximizes the cumulative reward over time.
Deep reinforcement learning algorithms like Deep Q networks and Deep
Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and
game playing etc.
Artificial neural networks
Artificial neural networks are built on the principles of the structure and operation of
human neurons. It is also known as neural networks or neural nets. An artificial neural
network’s input layer, which is the first layer, receives input from external sources
and passes it on to the hidden layer, which is the second layer. Each neuron in the
hidden layer gets information from the neurons in the previous layer, computes the
weighted total, and then transfers it to the neurons in the next layer. These
connections are weighted, which means that the impacts of the inputs from the
preceding layer are more or less optimized by giving each input a distinct weight.
These weights are then adjusted during the training process to enhance the
performance of the model.

Artificial neurons, also known as units, are found in artificial neural networks. The whole
Artificial Neural Network is composed of these artificial neurons, which are arranged in a
series of layers. The complexities of neural networks will depend on the complexities of the
underlying patterns in the dataset whether a layer has a dozen units or millions of
units. Commonly, Artificial Neural Network has an input layer, an output layer as well as
hidden layers. The input layer receives data from the outside world which the neural network
needs to analyze or learn about.

In a fully connected artificial neural network, there is an input layer and one or more hidden
layers connected one after the other. Each neuron receives input from the previous layer
neurons or the input layer. The output of one neuron becomes the input to other neurons in
the next layer of the network, and this process continues until the final layer produces the
output of the network. Then, after passing through one or more hidden layers, this data is
transformed into valuable data for the output layer. Finally, the output layer provides an
output in the form of an artificial neural network’s response to the data that comes in.

Units are linked to one another from one layer to another in the bulk of neural networks. Each
of these links has weights that control how much one unit influences another. The neural
network learns more and more about the data as it moves from one unit to another, ultimately
producing an output from the output layer.

Difference between Machine Learning and Deep Learning:

machine learning and deep learning both are subsets of artificial intelligence but there are
many similarities and differences between them.

Machine Learning Deep Learning


Apply statistical algorithms to learn the Uses artificial neural network architecture
hidden patterns and relationships in the to learn the hidden patterns and
dataset. relationships in the dataset.
Can work on the smaller amount of dataset Requires the larger volume of dataset
compared to machine learning
Better for the low-label task. Better for complex task like image
processing, natural language processing,
etc.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features which Relevant features are automatically
are manually extracted from images to detect extracted from images. It is an end-to-end
an object in the image. learning process.
Less complex and easy to interpret the result. More complex, it works like the black box
interpretations of the result are not easy.
It can work on the CPU or requires less It requires a high-performance computer
computing power as compared to deep with GPU.
learning.

Types of neural networks

Deep Learning models are able to automatically learn features from the data, which makes
them well-suited for tasks such as image recognition, speech recognition, and natural
language processing. The most widely used architectures in deep learning are feedforward
neural networks, convolutional neural networks (CNNs), and recurrent neural networks
(RNNs).

Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image
classification, speech recognition, and natural language processing.

Convolutional Neural Networks (CNNs) are specifically for image and video recognition
tasks. CNNs are able to automatically learn features from the images, which makes them
well-suited for tasks such as image classification, object detection, and image segmentation.
Recurrent Neural Networks (RNNs) are a type of neural network that is able to process
sequential data, such as time series and natural language. RNNs are able to maintain an
internal state that captures information about the previous inputs, which makes them well-
suited for tasks such as speech recognition, natural language processing, and language
translation.

Applications of Deep Learning:


The main applications of deep learning can be divided into computer vision, natural language
processing (NLP), and reinforcement learning.

Computer vision

In computer vision, Deep learning models can enable machines to identify and understand
visual data. Some of the main applications of deep learning in computer vision include:

• Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such as
medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.

Natural language processing (NLP):

In NLP, the Deep learning model can enable machines to understand and generate human
language. Some of the main applications of deep learning in NLP include:

• Automatic Text Generation – Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these trained
models.
• Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or neutral.
This is used in applications such as customer service, social media monitoring, and
political analysis.
• Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion, voice
search, and voice-controlled devices.

Reinforcement learning:

In reinforcement learning, deep learning works as training agents to take action in an


environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:
• Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots to perform
complex tasks such as grasping objects, navigation, and manipulation.
• Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.

Challenges in Deep Learning


Deep learning has made significant advancements in various fields, but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep learning:

1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
2. Computational Resources: For training the deep learning model, it is computationally
expensive because it requires specialized hardware like GPUs and TPUs.
3. Time-consuming: While working on sequential data depending on the computational
resource it can take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it is
very difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized for
the training data, leading to overfitting and poor performance on new data.

Advantages of Deep Learning:

1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance in


various tasks, such as image recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically discover
and learn relevant features from data without the need for manual feature engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets,
and can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.

Disadvantages of Deep Learning:

1. High computational requirements: Deep Learning models require large amounts of


data and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a large
amount of labeled data for training, which can be expensive and time- consuming to
acquire.
3. Interpretability: Deep Learning models can be challenging to interpret, making it
difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data,
resulting in poor performance on new and unseen data.
4. Black-box nature: Deep Learning models are often treated as black boxes, making it
difficult to understand how they work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages, including high accuracy
and scalability, it also has some disadvantages, such as high computational
requirements, the need for large amounts of labeled data, and interpretability
challenges. These limitations need to be carefully considered when deciding whether
to use Deep Learning for a specific task.
DEEP FEEDFORWARD NETWORKS

A deep feedforward network, also known as a feedforward neural network or a


multilayer perceptron (MLP), is a fundamental type of artificial neural network
used in machine learning and deep learning. These models are called feedforward
because information flows through the function being evaluated from x, through
the intermediate computations used to define f, and finally to the output y. There
are no feedback connections in which outputs of the model are fed back into itself.
When feedforward neural networks are extended to include feedback connections,
they are called recurrent neural networks.
Feedforward Architecture: Information flows in one direction, from the input layer
through one or more hidden layers to the output layer, without any cycles or
feedback loops. Each layer in the network is fully connected to the next.
Architecture

The architecture of a deep feedforward network, also known as a multilayer


perceptron (MLP), consists of multiple layers, including the input layer, one or
more hidden layers, and the output layer. Each layer is composed of a certain
number of neurons (also referred to as units or nodes) that perform computations
on the input data.
Input Layer
This layer receives the raw input data, which could be a feature vector
representing the input features for the task at hand. The number of neurons in the
input layer is determined by the dimensionality of the input data.
Hidden Layers
These layers are sandwiched between the input and output layers and perform
intermediate computations. Deep networks have multiple hidden layers, which
allows them to learn hierarchical representations and capture complex features in
the data. Each hidden layer typically uses an activation function to introduce
nonlinearity into the model. The number of hidden layers and the number of
neurons in each hidden layer are design choices that depend on the complexity of
the problem and the amount of data available. These layers use activation
functions, such as ReLU or sigmoid, to introduce non-linearity into the network,
allowing it to learn and model more complex relationships between the inputs and
outputs.
Output Layer
This layer produces the final output or prediction based on the information
processed through the hidden layers. The number of neurons in the output layer
depends on the task being performed.
For example:
• In binary classification, the output layer may have a single neuron using a
sigmoid activation function to produce a probability score.
• In multi-class classification, the output layer may have multiple neurons
(equal to the number of classes) with a softmax activation function to
produce class probabilities.
• In regression, the output layer may have a single neuron without any
activation function (linear activation) for predicting continuous values.
Weights and Biases
Each connection between neurons in adjacent layers has an associated weight,
and each neuron typically has a bias term. These weights and biases are learned
during training using optimization algorithms such as gradient descent and
backpropagation.
Activation Functions
Activation functions are applied to the output of each neuron in the hidden layers.
They introduce nonlinearity, allowing the network to learn complex relationships
in the data.
1. Sigmoid:

The sigmoid activation function maps any input value to a value


between 0 and 1, which is useful for binary classification problems.

2. The rectified linear unit (ReLU):

It is a popular choice in neural networks. It is defined as f(x)=max(0,x),


where x is the input to the function. For any input x, the output of the ReLU
function is x if x is positive and 0 if x is negative. This activation function
is simple to implement computationally and faster than other non-linear
activation function like tanh or sigmoid.
3. Softmax:
The softmax activation function maps the input value to a probability
distribution, which is useful for multi-class classification problems.

Example: Learning of XOR

Truth table of XOR

You might also like