Deep Learning Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 110
At a glance
Powered by AI
The key takeaways are that loss functions need to be differentiable and computable from mini-batches for backpropagation during training neural networks. Common loss functions include cross-entropy for classification and mean squared error for regression.

Feature maps refer to the outputs of convolutional layers, where each channel corresponds to detecting a specific feature, and the spatial map shows where that feature is detected in the input.

Common activation functions are sigmoid, used for binary classification to output probabilities, and softmax, used for multiclass classification by normalizing sigmoid outputs. ReLU avoids vanishing gradients but can cause dead neurons.

DEEP LEARNING

Notes
loss functions, after all, need to be com-

tdssFunctio÷
putable given only a mini-batch of data (ideally, a loss function should be computable

for as little as a single data point) and must be differentiable (otherwise, you can’t use

backpropagation to train your network). For instance, the widely used classification

metric ROC AUC can’t be directly optimized. Hence, in classification tasks, it’s com-

mon to optimize for a proxy metric of ROC AUC, such as crossentropy. In general, you

can hope that the lower the crossentropy gets, the higher the ROC AUC will be.

÷
*
-

\ translation invariant
kspgtffalawis
Future layers → Learn global patterns
Learn local

patterns
Cnn →

=
→ FILTERS

filters encode
specific aspects

of input data ; ex single fitter


: a

could encode f
The
presence
a
face
freeform

.

That is what the term feature - map


gutweap
means: every dimension in the depth axis is a feature (or

÷
filter), and the 2D tensor

- map of the response of this


output[:, :, n] is the 2D spatial
filter over the input •

== -
AZM OF MACHINE LEARNING

÷=
COST FUNCTION
MINIMIZE

FMTODELPARAMETERF

|=mhoPI=t2n7on
DIFFERENTIATION

t
RATE OF CHANGE OF QUANTITY

WITH RESPECT TO ANOTER QUANTITY


PARTIAL DERIVATIVE

f DERIVATIVE OF

MULTIPLE VARIABLES

/ KEEPING OTHERS FIXED

fcxn )
fPARTtnLDERInrEVELTOROFPARTZALDLR2-rATTvEISGRADIENTOFFUNcTIONFfexint@odYc_nghjmofGeth.D

hCgd@7ygIyIaeparticlduirativefYgc_gdG_JEfRADIEHToFFuNctIoN.I


Maxin
AT
gradient
Ma →

of the
function bed should .

DIFFERENCE BETWEEN ACTUAL AMP PREDICTED

a
COST FUNCTION

MODEL PARAMETERS
M Which
is used

foapaedichon

OPTIMIZING
MAXIMIZING
COST FUNCTION OR
OF
LIKEEYIHOOD
TRAINING DATA
WHY NON -

LINEARITY ?

There is no way we
M
can
separate 2

classes .


REQUIRING 2 HYPERPLANES To SEPARATE 2 CLASSES

Is EQUIVALENT To CTAVINH NON -


LINEAR CLASSIFIER
-

Multilayer perception CMLD can provide


non-linear 't
separation single perception
,
can .
ACIVATIOM FUNCTIONS

" sumo
'→⇒y=
[email protected])
-
,
10
.fi t 70

→ It has capability to Provide in the


output
Ot
change → hence is used in
probability .

→ Used
for binary classification .

2 .
SOFTMAX ⇒
→ hmaalization of sigmoid function
→ Multiclass classification

fix
?xt6

-
he

@
pcyey D= ,
+6
[
→ In sottmax we set
different
classes
weights, for different
.

Loss
function for Softmax :

)
g

→ R Etu
=
¥2
yr
.

Mapco ,w' set 6)


q non
.

zero
As the smooth
gradient is
for

even
n

large positive numbers C unlike sigmoid ) → it

T.to#.y!ospCyiiYx
never stops learning hence avoids vanishing

gradient .

But if value is non .

positive then

Output k both time


gradient are
zuos it
-

stops Denning .
. In this cause me PRLEU

Bmin ( 0,2 )
Yr 0,2 )
. Max +
Tt
d- 3 y

+#Ig
e-
y=
as
With tank Vanishing

guadiwt

can be avoided
problems abound
#

. .

Loss lost
Log function
Ce
& glogq -
a -y)wgz
MOTIVATION BEYOND BAEKPRODAGATIOH

Weights hierarchical
follow

Structure in neural networks Cynlike

logistic 1 linear structure hence use

based rule to
backpuopogation drain

update weights :

@ CCWED
Watt hefty
' '
'
.

Twk
For )
CCO
minima
, gradient of cost function
with respect to 0 should be
zero

NFPCCO
,
.

Update
owllfol gradient descent :

a
#'
=
at -
'D
BACK PROPAGATION

Is a method to propagate
.

wood at the output layer


backward so that
gradients
at
previous laws can easily
be computed using chain

rule derivatives
q
.

a
The weight would be

impacted by the errors at all three output units. Basically, the

÷
error at the output of the jth unit in the hidden

layer would have an error contribution from all output units,


scaled by the weights connecting the output

layers to the jth hidden unit.

Each iteration is composed of a forward pass and a


backward pass, or backpropagation. In the forward

pass, the net input and output at each neuron unit in each
layer are computed. Based on the predicted

output and the actual target values, the error is computed


in the output layers. The error is backpropagated

by combining it with the neuron outputs computed in the


forward pass and with existing weights. Through

-
backpropagation the gradients get computed iteratively.
Once the gradients are computed, the weights are

updated by gradient-descent methods.

\ =) ERROR .to#fYFhFtstWEr+aa ,
PACKPROP
pofppoffkpaass

⇒ GRADIENTS compute > WEIGHS


UPDATED
Each iteration is composed of a forward pass and a
backward pass, or backpropagation. In the forward

pass, the net input and output at each neuron unit in each
layer are computed. Based on the predicted

output and the actual target values, the error is computed


in the output layers. The error is backpropagated

by combining it with the neuron outputs computed in the


forward pass and with existing weights. Through

backpropagation the gradients get computed iteratively.


Once the gradients are computed, the weights are

updated by gradient-descent methods.

BACKPROP STEPS

1. ERROR IS BACKPROPGATED

WITH EXISTING WEIGHTS AND

NEURON OUTPUTS Cfuom forward


pass )
2 .
GRADFENS CALCULATED
ITERATIVELY

WEIGHTS
3. ARE UPDATED USING
GRADIENT DESCENT
has automatic
Deep learning feature
.

learning capability ,
which reduces
feature
time
engineering .

Deep leaning

performance
Machine Leaning
p
I
# I
1


MB GB PB CDATA)
TEHSORFLOW

I . Interactive Session ( ) : TO nun tensoeflow in

interactive mode .

2 .
Evalcli TO
run inside an interactive Session .

initialized
'
3 .

Tf global
.
- variables .
.
HOG
.
Feature is
feature descriptor
Imae is divided
into portions .
.

Gnadfent

for each
image is calculated

.

Da →
T#t→u .it?#s*e GRADIENTS

Nw IN
,Ns ,
NE , Nsw
Etc . . . .
.


this
Histogram

ty##¥taFE¥ .
-
:
÷⇒
; #
sum :

BOW

.
final Steps :

fans
Cost
function MLNN is smooth
of .

# Problem
7 with
cost
quadratic
mm
functions

- .
UNDERSTANDING ( EARNING RATE

with NH They mostly


Dthupdoblem are are

non convex leading to stuck @ local minima


getting
. .
onion
Momentum generally keeps track of the previous
gradients through the velocity component. So,

if the gradients are steadily pointing toward a good local


minimum that has a large basin of attraction, the

velocity component would be high in the direction of the


good local minimum. If the new gradient is noisy

and points toward a bad local minimum, the velocity


component would provide momentum to continue in

the same direction and not get influenced by the new


gradient too much.


Documents with high sides
affine are

considered similar .

Number distinct words two sentences


of

in

Would .

costlier ) ÷ CUT uz ) 11411 I Lznosm

beieeetoaebacedmn
-
,
or
,H|µd All )
magnitude
A sine distance 2C 1- oso )

TF -
IDF
.

In ffidf number in
of words

the document is not considered

instead how frequently the word is

Used is taken .
frequently the )
'

word Cac
' ' '

. :
occurring
an
,

should contribute less hence count

words Is
of such
penalized by
called inverse document
a
factor
frequency

=
Word Wee

.
Is better to measure similarity
instead not encoding
of
one . .

Word2Vec is an intelligent way of expressing a word as


a vector by training the word against words in its

÷
neighborhood. Words that are contextually like the given
word would produce high cosine similarity or dot
product when their Word2Vec representations are
considered.

Generally, the words in the corpus are trained with


respect to the words in their neighborhood to
derive the set of the Word2Vec representations. The two
most popular methods of extracting Word2Vec
representations are the CBOW (Continuous Bag of
Words) method and Skip-Gram method.

The CBOW method tries to predict the center word from -

the context of the neighboring words in a specific

window length.
CBOW

2
ft
he 1
ft
E Dxk
@
z -2

x #
hidden layer

)
b ? window size h :

° 1 '
Embedding
Cwhich is average )
CBOW

To make this more intuitive, let’s say our target variable


is cat. If the hidden-layer vector h gives the

maximum dot product with the outer matrix word-


embeddings vector for cat while the dot product with
the

other outer word embedding is low, then the embedding


vectors are more or less correct, and very little error

or log loss will be backpropagated to correct the


embedding matrices. However, let’s say the dot product
of h

with cat is less and that of the other outer embedding


vectors is more; the loss of the SoftMax is going to be

significantly high, and thus more errors/log loss are


going to be backpropagated to reduce the error.

CBOW
-
¥⇐
a '

The dot product of the hidden-layer embedding h is


computed with each of the v
matrix WO by h. The dot product, as we know, would give
a similarity measure for each of the output word
embedding and the hidden-layer computed embedding h.
The dot products are
normalized to probability through a SoftMax and, based
on the target word w, the categorical cross-entropy
loss is computed and backpropagated through gradient
descent to update the matrices’ weights for both the
input and output embedding matrices.
SKIP GRAM

In skip context words


gram ; are
predicted
based on current word .

For Skip-gram models, the window size is not generally fixed. Given a maximum

=
window size, the window size at each current word is randomly chosen so that
smaller windows are chosen more frequently than larger ones. With Skip-gram, one
can generate a lot of training samples from a limited amount of text, and infrequent
words and phrases are also very well represented.
• CBOW is much faster to train than Skip-gram and has slightly better accuracy for
frequent words. -
• Both Skip-gram and CBOW look at local windows for word co-occurrences and
then
try to predict either the context words from the center word (as with Skip-gram) or
the center word from the context words (as with CBOW). So, basically, if we observe
in Skip-gram that locally within each window the probability of the co-occurrence
of the context word wc and the current word wt given by P(w) is assumed to be
proportional to the exponential of the dot product of their word-embedding vectors.
For example:
Since the co-occurrence is measured locally,

¥
these models miss utilizing the global co-occurrence statistics for word pairs
within certain window lengths. Next, we are going to explore a basic method
to look at the global co-occurrence statistics over a corpus and then use SVD
(singular value decomposition) to generate word vectors.
*":a*fYgy§r;agmg;gie
Before we move on to recurrent neural networks, one thing I want to
mention is the importance of word
embeddings for recurrent neural networks in the context of natural language
processing. A recurrent neural
network doesn’t understand text, and hence each word in the text needs to
have some form of number
representation. Word-embeddings vectors are a great choice since words
can be represented by multiple
concepts given by the components of the word-embeddings vector.
Recurrent neural networks can be made
to work both ways, either by providing the word-embeddings vectors as
input or by letting the network learn
those embeddings vectors by itself. In the latter case, the word-embeddings
vectors would be aligned more
toward the ultimate problem’s being solved through the recurrent neural
network. However, at times the
recurrent neural network might have a lot of other parameters to learn, or
the network might have very little
data to train on. In such cases, having to learn the word-embeddings
vectors as parameters might lead to
overfitting or sup-optimal results. Using the pre-trained word-vector
embeddings might be a wiser option in
such scenarios.
ZNM

NLP because
RNNS are
great for

of there Sequential dependency of


.

word
-
.

.
Prior Memory k Current Input .

Ottz

weights
Otri 0
O
pho Who
Who phh who

2
h
↳ man
u
.
an

shared
X ×
xp .
,
'
Cttz
The
Memory ht
f( Wxhttthhnht )
i
.
,

a y

\
input previous hidden

non . linear

foftmaechnihztbo)
zyiwgq activation function .

I
Output it
v
tfthanijutior
tossmeata .
-

ju
Markoff
a
Probability

of dquencef
words
"

PID
,HTzPC°4 Dan ...

they
d. window length
; an

RNM -
CKPROPAGATZON IN TZMF

Gradient g ghadient
is sum wait

loss at each time step


boy
.

misting
÷ se Exploding Gradient .
REI
weight
update '⇒ ()nes
Y @?D

jdhq.in#l(
Z@@÷
-

,
;

meagre
If due to Hann)
bvvugeneeisbbwithme
Sigmoid ,
new

weight updation in slow hence


,

me Ran →
fcz )= Max Co
, a)

k¥1 the
→ →

ffo¥I÷ehoi3⇒
Note : derivative at
...
0 is not vanishing
doesn't
gradient occur
there is no 0.210.3 → its ooel ←
Training See
7 Convergence is
faster
Mirant
End
YREW In RELU .

a.

5
'
ital
Epochs

{ }
' '

But because there is 0 in 0,1 .


there is

dead
activations
a
problem f .


SOFTPLUS

Smooth
:

approximation of Rew
#
fcxtlogcltexpx) Ykexpex )
'

f. the
Noisy 12dm is used in Restricted
Boltzan Machines
°

Leaky Rehl

f,
It
In Mln if &L_

zooadding
whole
,
@ °
becomes
also C because charm
ruled therefore
mange →
[email protected]
@ -

NTT
\1 to

wijnrhiio
. Note :
Dead Rew usually happens when initialized
nights are
large negative
.
WEIGHT INITIALIZATION :

either normal

Initialize weights or
gnassian

wi;IX
If
www.#tg
Ee
÷¥
-

→ at .

dsame → same gradient updates .

Out weights should be .

2 .

Wijeratna
then
if
ddffeuntwwy

#
-

ban
different things
3.
if page

#
is

Mihama
*
as
a Wis
# az qif ve
-

- mounded
c)
then become
wild
Zlarge - ue

↳ fcaino
I SOLUTIONS Ape ;
:

knee
smell

auzuf
Weights should be
Anyang

dduti.mn
↳ Not

↳ good his
#
1

Gauss
qpfnr
a
.

'taH/M°RMdi¥
#
IDEA # 2

faninr.ie#=fanout=2
to

SZhM07# win;cnunit[g÷ning÷ann]

a
th
-
for
o
to

WORKS WELL
IDEA #3 →

he
Ncaa;÷yoµ
rig
n who

XAVZETYGLOPOTOY
*
miinnFf÷a
'F¥# I .
IDEA # HE IHZZTIZATIOXE

,§¥EF
4 : :

normal
al
own Nco

ftp.fnstf.FDWORKSWEUFRRE#_
>
o unworn term :
BATCH NORMALIZATION :

F. µ Data
normalization of input
]
=
xi -

=
VANISHING GRADZEHT PROBLEMS In RNNS
CNN

.
CNN eeploits local connectivity constraints µ

exploits local spatial no .


relation .

The impulse response of a system can either be known or be determined from the system by noting
down its response to an impulse function.
Representing an
imago as
digital signal .

A video is a sequence of images with a temporal dimension. A black and white video can be expressed
as a signal of its spatial and temporal coordinates (x, y, t).

So, a grayscale n x m ́ image can be expressed as function I(x, y), where I denotes the intensity of the
pixel at the x, y coordinate. For a digital image, the x,=y are sampled coordinates and take discrete values.
Similarly, the pixel intensity is quantized between 0 and 255.

MXH

=
image
2D Convolution of an Image to Different LSI System Responses
Any image can be convolved with an LSI system’s unit step response. Those LSI system unit step
responses
are called filters or kernels. For example, when we try to take an image through a camera and the image
gets
blurred because of shaking of hands, the blur introduced can be treated as an LSI system with a specific
unit step response. This unit step response convolves the actual image and produces the blurred image
as output. Any image that we take through the camera gets convolved with the unit step response of the
camera. So, the camera can be treated as an LSI system with a specific unit step response.
Xi
Based on the choice of image-processing filter, the nature of the output images
will vary. For example, a
Gaussian filter would create an output image that would be a blurred version of
the input image, whereas a
Sobel filter would detect the edges in an image and produce an output image
that contains the edges of the
input image.
https://www.quora.com/Why-do-we-need-to-flip-the-impulse-
response-in-convolution
fly ]
meantime
"

My
"hq"
A 2D Median filter replaces each pixel in a neighborhood with the median pixel intensity in that
neighborhood based on the filter size. The Median filter is good for removing salt and pepper noise. This
type of noise presents itself in the images in the form of black and white pixels and is generally caused by
sudden disturbances while capturing the images.

The Gaussian filter is a modified version of the Mean filter where the weights of the impulse function are
distributed normally around the origin. Weight is highest at the center of the filter and falls normally away
from the center

Gaussian filters are used


to reduce noise by suppressing the high-frequency components. However, in its pursuit of suppressing
the high-frequency components it ends up producing a blurred image, called Gaussian blur.
the original image is convolved with the Gaussian filter to produce an image that
-

has Gaussian blur. We then subtract the blurred image from the original image to
-

get the high-frequency


component of the image. A small portion of the high-frequency image is added to
the original image to
improve the sharpness of the image.
EDGE DETECTORS .

Vatican
. footed
Ff
figoo Horizontal =

as
:
#
f .EE#
Arizona
.
's
;
. .
film
film
.

The impulse response of a Sobel Edge Detector along the horizontal and vertical axes can be
expressed by the
following Hx n Hy
matrices respectively. The Sobel Detectors are extensions of the Horizontal and Vertical
Gradient filters just illustrated. Instead of only taking the gradient at the point, it also takes the sum
of the
gradients at the points on either side of it. Also, it gives double weight to the point of interest.
Convolution neural networks (CNNs) are based on the convolution of images and detect features based
on filters that are learned by the CNN through training. For example, we don’t apply any known filter, such
as the ones for the detection of edges or for removing the Gaussian noise, but through the training of the
convolutional neural network the algorithm learns image-processing filters on its own that might be very
different from normal image-processing filters. For supervised training, the filters are learned in such a way
that the overall cost function is reduced as much as possible. Generally, the first convolution layer learns to
detect edges, while the second may learn to detect more complex shapes that can be formed by combining
different edges, such as circles and rectangles, and so on. The third layer and beyond learn much more
complicated features based on the features generated in the previous layer.
The good thing about convolutional neural networks is the sparse connectivity that results from weight

=)
sharing, which greatly reduces the number of parameters to learn. The same filter can learn to detect the
same edge in any given portion of the image through its equivariance property, which is a great property of
convolution useful for feature detection.

sparse wrmetivity
.

weight sharing
.
Eqvinaiame property
Filter size – Filter size defines the height and width of the filter kernel. A filter

÷
-

kernel of size 3 3 ́ would have nine weights. Generally, these filters are
initialized and slid over the input image for convolution without flipping these
filters. Technically, when convolution is performed without flipping the filter
-- not convolution. However, it doesn’t
kernel it’s called cross-correlation and
www..IN
matter, as we can consider the filters learned as a flipped version of image- now
processing filters

If a stride of 2 is chosen along both the


height and the width of the image, then after convolving the output image would th
of the input image size.

a
keep output image same an input ,
a

pad length f
should
kzt be used .
Values at
output feature
:
map
ffeatwxinfwuy
Number tormentors =xhn2
97 on as , , connotations =
#
This property of convolution is called translational equivariance. In fact, if the digit is represented by a set of
pixel intensities x, and f is the translation operation on x, while g is the convolution operation with a filter
kernel, then the following holds true for convolution:
GCFCXD = FGCXD
max pooling provides some translational
.
invariance to feature detection if the translation distance is not very high with respect to the size of the
receptor field or kernel for max pooling.
RESNET IDENTZTY
pgkmnl a

-7
#

Reyes
no i
-

⇒ RELU → x -→coNv→ FEW → corn → RELU -


" " '

;
'
x se x

"Ex
G )

#E
"

→ REW
ifjxtye
→ Cx ) =x
; Remcrdubikx

9 RELUGEKO ; Rehecrelucx )=o


Note ! : Removes all useless weights .

Dropout )
-

C< Liz
during baekpnopogation
,

C because f optimization )
RELU ;
-

I .

Adding additional ) new


byus mm not

Mutt
piyomancee as
regularization mm slab

them

ifmentlayvs
our
They useful
are .

then
2 .

Tf
are
useful , weights
wont he
non-zero
.
IMAGE SZMZLARZTY

Tonie
Take Positive
1. as
input ?
, Negative on
Query image

÷
2 .
One
way f building image similarity is

Gabor SIFT a JWG ; M


, Deep Ranking

MELI
Euclidean
: Li C Manhattan ) LZC ) M

Euclidean
,

Squared .

4. Loss Function :

Fire Networks
untruth
are

approximates
=
-

Degrate
be
can
thought
.

asfunnjannnngnehhogganed
myrmidon
in

-
& 's
Embedding function
M
Embedding function :

D
Cfcpi ) ,f CM ) s DGCPD , fcp ;D

g.
- e-

ftp.bit.fi

reside
sink that
;)

similarihgdistamep
ore .

.pt pr negative image


:
:
guy image fugitive image ,
:

1 : Euclidian distance : fmalhr the distance better


- . - - . ...

tie
⇒ lcp.pt.pt =max{0 ,
+
#Is
Dcfun;)
DYCPD ,tCtitD}
-

:
g gap parameters that rgulanlzsthe distance .
flue

aphid
are the mode in
,
you
suha that
you are the distance
way
between #
posing
se is

not lesser then the distance between


only
query image se negative image ,
hit in Amr
'

by
an amount
fig
The most crucial component is to learn an image
_
embedding function ‘f’. Traditional methods
typically employ hand-crafted visual features, and
learn linear or nonlinear transformations to obtain
the image embedding function. Here, a deep
learning technique is employed to learn image


similarity models directly from images. The Deep
Ranking network looks like this:
This network takes image triplets as input. One image triplet
contains a query image ‘pi’ , a positive image ‘pi+’ and a
negative image ‘pi-’ , which are fed independently into three
identical deep neural networks ‘f(.)’ with shared architecture
and parameters. A triplet characterizes the relative similarity
relationship for the three images. The deep neural network


‘f(.)’ computes the embedding of an image ‘pi’ : f(pi) ∈ Rd ,
where ‘d’ is the dimension of the feature embedding, and ‘R’
represents the Real number space.
The ranking layer on the top evaluates the hinge loss of a
triplet. During learning, it evaluates the model’s violation of the
ranking order, and back-propagates the gradients to the lower
layers so that the lower layers can adjust their parameters to
minimize the ranking loss.
@
invariance
caption

* hires visual appearance


BATCH

NIAH2.AT#=.
When training a neural network through stochastic gradient descent, the distribution of the
inputs to each layer changes due to the update of weights on the preceding layers. This slows
down the training process and makes it difficult to train very deep neural networks. The
O


training process for neural networks is complicated by the fact that the input to any layer is
dependent on the parameters for all preceding layers,
and thus even small parameter changes can have an amplified effect as the network grows.
This leads to input-distribution changes in a layer.
Now, let’s try to understand what might go wrong when the input distribution to the activation
functions in a layer change because of weight changes in the preceding layer.
A sigmoid or tanh activation function has good linear gradients only within a specified range
of its
input, and the gradient drops to zero once the inputs grow large.

so
&
The parameter change in the preceding layers might change the input probability distribution to
a sigmoid units layer in such a way that most of the inputs to the sigmoids belong to the
saturation zone and hence produce near-zero gradients, as shown in Figure 3-29.
Because of these zero or near-zero gradients, the learning becomes terribly slow or stops
entirely.
One way to avoid this problem is to have rectified linear units (ReLUs).
The other way to avoid this problem is to keep the distribution of inputs to the sigmoid units
stable within the unsaturated zone so that stochastic gradient descent doesn’t get stuck in a
saturated zone.
This phenomenon of change in the distribution of the input to the internal network units has
been referred to by the inventors of the batch normalization process as internal covariate shift.

Batch normalization reduces the internal covariate shift by normalizing the inputs to a layer to
have a zero mean and unit standard deviation.
.
'

wore It

it
!Y¥÷f;¥
site
"

.net#Efkf:EI?g:Ii*t*t*.:.:.:.:
go.y.iq#toyKuHtitdi
FF;odo#
6×6 .

- 4×4
Womanize

Eat
← #
May ← utara 2552*45
ihngl is 'T

4×7
CONV : ELEMENT WISE MULTIPZCATIOH Gthen ADDITEOH
COMPARISON OF OBJECT DETECTION MODELS

SSD PETETIOH GENERATOR

=
SSD →
YOLO
three
:nµD÷t¥e÷I:II
FEATURE EXTRACTOR

ENCAPSULATES MULTZBOX
, ,
4020112
"0R\y
FASTER RCMH

.si?Hin?inan
Eisman
GENERATOR

*
"

/
FEATURE Ex TRAC

:L
.

MYcastiynf

:
.

crop
:
2
Bot i

BOY CLASSIFIER Refinement


MASK RENN

:
2
*

SD
,

• FEATORF Extractors → INCEPTION / VGG /


MODZLENET /

RFSIXEH
INCEPTION Rebuff

# PROPOSALS
} for fpgq per , ,→ WECHAHE NUMBER OF

a ZMAHE SZZE PROPSALS 20 NEXT STAGE OF THE

}
[email protected]

*F2xEDPA=mEtEY
MODEL

OUTPUT STRIDF LOU RES ( Fast


e

Bun , not
CSAEME ,
PADDZHN )
AZUHRES ACCURATE )

.
MATCHERS .
LOCATION LOSS FUNCTION

BOX EHCOD # a

.
17091 PROCEHZHG PARAMETERS
?

Multiwwp rod
-
Hi -
notes
#
.

Times
horizontal FKD
.

Ensemblig
.

Box
voting
e
.

* We lone To Compare SPEED § Accuracy

* EYPERZMENNAL SETUP

It READ ?
2016 -
SPEED / Accuracy TRADEOFF FOR

-
D0RHwNVOLU1?o=BJECTDE7ECTOtS
at Ntv : 1611 . 10012W 1 -
30 NO ✓ 2016
SPEFDFASTER
MAI -

? ?
R

HAD
.
CNN > R -
FKN 7 SSD

.is#Fm3eEe/m5V
Eanes RES )
NHESHET
'
¢wgpEof)F Fer

'mm
300
30.4
-

,
PROPOSALS
.

tianya.ee#foIYsoopropaao)55a
Cfoese⇒
-

ISD DOES New

ATLBRGEO_BJECI.FR#HPOEss WELL FOR SMALL Too .


RFEH improve with better feature

.
-
Cex

fasteners
extractor :
inception

fostered
demand
feathered

SSPS
fufomave less
dependent
-
on

Reducing A- Proposals is a
great way
to without
speed up

significantly .
READ

INCEPTION RESNET VZ

INCEPTION
-

VGG

RESHET

MOBZLETIET
'
[email protected]
=3 →
4×4

Note introduce

2µF
: can an

astrology instead

Iamenlalne
use

=
padding
.
@

nxnpgkTgp@t2p-ktDXCnt2b-ktD-a0ntnIIEECts.KDxCntFnD-dx6pFIsCogEtDxEt2y.y

X 2
-
49
CNN
BLOCKHAYER
a

mKdneb_
cord

.EE?*ie.giHnEtEEEEYFEIn
image FIG
kxkxc
"

n
"

In
!
i.

makes
Madi
convmt location invariant

.

Flattens .
OPTIM 29 AT Tom

x°→^y Lcyiy )
)
> → BACK PROPAGAT go , ,

( egg µ
@ ui


Com v 9 Element wise multiplication t addition : This is differentiable


RELU → Is
differentiable

# a mat value derivative is 1

Max pooling


.

|€af@gDq]for non Max value derivative in 0 .

https://world4jason.gitbooks.io/research-log/content/deepLearning/CNN/
Model%20&%20ImgNet/lenet.html
an :#
at

Coordinates

@
;
namby
Comnut I .

output
numb
@ (
SGD

SGD
noisy
: are

uiiotydh
,

way
=

on
mine
GD : Using all -
points
.

Pnivative is
SHD ! at
zero
.
Using one
point M9nd°M
@ maxima
-

Mini batch : Random Subset

¥01 k @FDa
minima
k
soaking
;
,

Simple ShDs/ GDS
got Stuck in saddle
points
'
Saddle point

tone mon.ae

k¥¥ek%t€⇒#eE¥¥k
Ex Lin By SVM Ax : leaning
'

.
Log Reg
. .
Deep .

, ,

with
Based minivans → we could and up

anemia

.ge#s*i :*
SGD with momentum z
SGD updates al
noisy
SYM
.

iai
eigntaarwp
{ WNOTM
www.iffh
.
So lets use weighted average :
.

Via @t=i
-

)
✓ V.
tag
@ t 2
)
'

vi. age REDLSGD without


momentum )

n.name#ta:imd
=
✓ a. taz
-

V
}=VVzta }

Denoisedcblue
=p ( vaitautaz :
'
-

www..iq#w-Vlf.D@
,
4
iairana
w
Vvz
/

vii.
momentum
.

wed
.

gradient
tdhqad
In SGD ) Sh D momentum :
learning rate is constant
.
-

n'
'
.
2h Adagead
-
Is
different foe each sought

parameter

iPaq
mementos
[email protected]
for different weights
@ different times .


> at
Ptt .
, -2
: Btl deduces as time
.

Leaning adaptively
increases

+0 No need to
manually time

@ Talas care
q Sparse E dense features

@ AS t imueasy a increases → which


-
-

Causes
N to become snail → which
very
leads show IFconvergent

at # →TI¥T***#aan⇒of→aF¥#
ADA -

DELTA

agent
.im#oo.. ,

ti€÷naan=reaa±,er¥Ii€)
Exponential weighted averages f gradient
signers imdein simple Ignores .C aim
g home
) nd
to values
ada glad f
- avoid smeu →

inning convergence .

IMAGE → CONVTMA -11700k
6×256 )
@ g FC¥
)
COMV 4 MAT POOL

g×8XhD6NvtmA×P°0t~M
(

@8×r
)

'D
/
fuuycauv
www.r# .
Upsan
'DY
upsanble
256×256

↳ HI → UPSAMPLE → OHPVT

(8×8) (256×256)
↳ sparse output
Faster R-CNN: Down the rabbit hole of modern
object detection
Thu, Jan 18, 2018

Read time: 21 minutes

Previously, we talked about object detection, what it is and how it has been recently tackled using deep
learning. If you haven’t read our previous blog post, we suggest you take a look at it before continuing.

Last year, we decided to get into Faster R-CNN, reading the original paper, and all the referenced papers
(and so on and on) until we got a clear understanding of how it works and how to implement it.

We ended up implementing Faster R-CNN in Luminoth, a computer vision toolkit based on TensorFlow
which makes it easy to train, monitor and use these types of models. So far, Luminoth has raised an
incredible amount of interest and we even talked about it at both ODSC Europe and ODSC West.

Based on all the work developing Luminoth and based on the presentations we did, we thought it would be
a good idea to have a blog post with all the details and links we gathered in our research as a future
reference for anyone is interested in the topic.

Background
Faster R-CNN was originally published in NIPS 2015. After publication, it went through a couple of
revisions which we’ll later discuss. As we mentioned in our previous blog post, Faster R-CNN is the third
iteration of the R-CNN papers — which had Ross Girshick as author & co-author.

Everything started with “Rich feature hierarchies for accurate object detection and semantic segmentation”
(R-CNN) in 2014, which used an algorithm called Selective Search to propose possible regions of interest
and a standard Convolutional Neural Network (CNN) to classify and adjust them. It quickly evolved into
Fast R-CNN, published in early 2015, where a technique called Region of Interest Pooling allowed for
-

sharing expensive computations and made the model much faster. Finally came Faster R-CNN, where the

:
first fully differentiable model was proposed.

Architecture
The architecture of Faster R-CNN is complex because it has several moving parts. We’ll start with a high
level overview, and then go over the details for each of the components.

It all starts with an image, from which we want to obtain:

a list of bounding boxes.


a label assigned to each bounding box.
a probability for each label and bounding box.

Complete Faster R-CNN architecture

The input images are represented as Height×Width×Depth tensors (multidimensional arrays), which are
passed through a pre-trained CNN up until an intermediate layer, ending up with a convolutional feature
map. We use this as a feature extractor for the next part.

This technique is very commonly used in the context of Transfer Learning, especially for training a
classifier on a small dataset using the weights of a network trained on a bigger dataset. We’ll take a deeper
look at this in the following sections.

Next, we have what is called a Region Proposal Network (RPN, for short). Using the features that the CNN
computed, it is used to find up to a predefined number of regions (bounding boxes), which may contain
objects.

Probably the hardest issue with using Deep Learning (DL) for object detection is generating a variable-
length list of bounding boxes. When modeling deep neural networks, the last block is usually a fixed sized
tensor output (except when using Recurrent Neural Networks, but that is for another post). For example, in
image classification, the output is a (N,) shaped tensor, with N being the number of classes, where each
scalar in location i contains the probability of that image being label​i​.

The variable-length problem is solved in the RPN by using anchors: fixed sized reference bounding boxes
which are placed uniformly throughout the original image. Instead of having to detect where objects are,
we model the problem into two parts. For every anchor, we ask:

Does this anchor contain a relevant object? 3


How would we adjust this anchor to better fit the relevant object? 4
This is probably getting confusing, but fear not, we’ll dive into this below.

After having a list of possible relevant objects and their locations in the original image, it becomes a more
straightforward problem to solve. Using the features extracted by the CNN and the bounding boxes with

56
relevant objects, we apply Region of Interest (RoI) Pooling and extract those features which would
correspond to the relevant objects into a new tensor.

Finally, comes the R-CNN module, which uses that information to:

Classify the content in the bounding box (or discard it, using “background” as a label).
Adjust the bounding box coordinates (so it better fits the object).

Obviously, some major bits of information are missing, but that’s basically the general idea of how Faster
R-CNN works. Next, we’ll go over the details on both the architecture and loss/training for each of the
components.

Base network
As we mentioned earlier, the first step is using a CNN pretrained for the task of classification (e.g. using
ImageNet) and using the output of an intermediate layer. This may sound really simple for people with a
deep learning background, but it’s important to understand how and why it works, as well as visualize what
the intermediate layer output looks like.

There is no real consensus on which network architecture is best. The original Faster R-CNN used ZF and
VGG pretrained on ImageNet but since then there have been lots of different networks with a varying
number of weights. For example, MobileNet, a smaller and efficient network architecture optimized for
speed, has approximately 3.3M parameters, while ResNet-152 (yes, 152 layers), once the state of the art in
the ImageNet classification competition, has around 60M. Most recently, new architectures like DenseNet
are both improving results while lowering the number of parameters.

VGG
Before we talk about which is better or worse, let’s try to understand how it all works using the standard
VGG-16 as an example.

VGG architecture

VGG, whose name comes from the team which used it in the ImageNet ILSVRC 2014 competition, was
published in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition” by Karen
Simonyan and Andrew Zisserman. By today’s standards it would not be considered very deep, but at the
time it more than doubled the number of layers commonly used and kickstarted the “deeper → more
capacity → better” wave (when training is possible).

When using VGG for classification, the input is a 224×224×3 tensor (that means a 224x224 pixel RGB
image). This has to remain fixed for classification because the final block of the network uses fully-
connected (FC) layers (instead of convolutional), which require a fixed length input. This is usually done
by flattening the output of the last convolutional layer, getting a rank 1 tensor, before using the FC layers.

Since we are going to use the output of an intermediate convolutional layer, the size of the input is not our
problem. At least, it is not the problem of this module since only convolutional layers are used. Let’s get a
bit more into low-level details and define which convolutional layer we are going to use. The paper does
not specify which layer to use; but in the official implementation you can see they use the output of
conv5/conv5_1 layer.

Each convolutional layer creates abstractions based on the previous information. The first layers usually
learn edges, the second finds patterns in edges in order to activate for more complex shapes and so forth.
Eventually we end up with a convolutional feature map which has spatial dimensions much smaller than
the original image, but greater depth. The width and height of the feature map decrease because of the
pooling applied between convolutional layers and the depth increases based on the number of filters the
convolutional layer learns.

Image to convolutional feature map

In its depth, the convolutional feature map has encoded all the information for the image while maintaining
the location of the “things” it has encoded relative to the original image. For example, if there was a red
square on the top left of the image and the convolutional layers activate for it, then the information for that
red square would still be on the top left of the convolutional feature map.

VGG vs ResNet
Nowadays, ResNet architectures have mostly replaced VGG as a base network for extracting features.
Three of the co-authors of Faster R-CNN (Kaiming He, Shaoqing Ren and Jian Sun) were also co-authors
of “Deep Residual Learning for Image Recognition”, the original paper describing ResNets.

The obvious advantage of ResNet over VGG is that it is bigger, hence it has more capacity to actually learn
what is needed. This is true for the classification task and should be equally true in the case of object
detection.

Also, ResNet makes it easy to train deep models with the use of residual connections and batch
.

normalization, which was not invented when VGG was first released.

Anchors
Now that we are working with a processed image, we need to find proposals, ie. regions of interest for
classification. We previously mentioned that anchors are a way to solve the variable length problem, but we
skipped most of the explanation.

Our objective is to find bounding boxes in the image. These have rectangular shape and can come in
different sizes and aspect ratios. Imagine we were trying to solve the problem knowing beforehand that
there are two objects on the image. The first idea that comes to mind is to train a network that returns 8
values: two x​min​​,y​min​​,x​max​​,y​max​​ tuples defining a bounding box for each object. This approach has some
fundamental problems. For example, images may have different sizes and aspect ratios, having a good
model trained to predict raw coordinates can turn out to be very complicated (if not impossible). Another
problem is invalid predictions: when predicting x​min​​ and x​max​​ we have to somehow enforce that
x​min​​<x​max​​.

It turns out that there is a simpler approach to predicting bounding boxes by learning to predict offsets from
reference boxes. We take a reference box x​center​​,y​center​​,width,height and learn to predict
Δ​x​center​​,Δ​y​center​​,Δ​width​​,Δ​height​​, which are usually small values that tweak the reference box to better fit
what we want.

Anchors are fixed bounding boxes that are placed throughout the image with different sizes and ratios that
are going to be used for reference when first predicting object locations.

Since we are working with a convolutional feature map of size conv​width​​×conv​height​​×conv​depth​​, we


create a set of anchors for each of the points in conv​width​​×conv​height​​. It’s important to understand that
even though anchors are defined based on the convolutional feature map, the final anchors reference the

Yn×w÷
original image.
Ibsanblimguxh
-9
Since we only have convolutional and pooling layers, the dimensions of the feature map will be
proportional to those of the original image. Mathematically, if the image was w×h, the feature map will end
up w/r×h/r where r is called subsampling ratio. If we define one anchor per spatial position of the feature
map, the final image will end up with a bunch of anchors separated by r pixels. In the case of VGG, r=16.

Anchor centers throught the original image

In order to choose the set of anchors we usually define a set of sizes (e.g. 64px, 128px, 256px) and a set of
ratios between width and height of boxes (e.g. 0.5, 1, 1.5) and use all the possible combinations of sizes and
ratios.

Left: Anchors, Center: Anchor for a single point, Right: All anchors

Region Proposal Network

The RPN takes the convolutional feature map and generates proposals over the image

As we mentioned before, the RPN takes all the reference boxes (anchors) and outputs a set of good
proposals for objects. It does this by having two different outputs for each of the anchors.

The first one is the probability that an anchor is an object. An “objectness score”, if you will. Note that the
RPN doesn’t care what class of object it is, only that it does in fact look like an object (and not
background). We are going to use this objectness score to filter out the bad predictions for the second stage.
The second output is the bounding box regression for adjusting the anchors to better fit the object it’s
predicting.

The RPN is implemented efficiently in a fully convolutional way, using the convolutional feature map
returned by the base network as an input. First, we use a convolutional layer with 512 channels and 3x3
kernel size and then we have two parallel convolutional layers using a 1x1 kernel, whose number of
channels depends on the number of anchors per point.

Convolutional implementation of an RPN architecture, where k is the number of anchors.

For the classification layer, we output two predictions per anchor: the score of it being background (not an
object) and the score of it being foreground (an actual object).

For the regression, or bounding box adjustment layer, we output 4 predictions: the deltas
Δ​x​center​​,Δ​y​center​​,Δ​width​​,Δ​height​​ which we will apply to the anchors to get the final proposals.

Using the final proposal coordinates and their “objectness” score we then have a good set of proposals for
objects.

Training, target and loss functions


The RPN does two different type of predictions: the binary classification and the bounding box regression
adjustment.

For training, we take all the anchors and put them into two different categories. Those that overlap a
ground-truth object with an Intersection over Union (IoU) bigger than 0.5 are considered “foreground” and
those that don’t overlap any ground truth object or have less than 0.1 IoU with ground-truth objects are
considered “background”.

Then, we randomly sample those anchors to form a mini batch of size 256 — trying to maintain a balanced
ratio between foreground and background anchors.

The RPN uses all the anchors selected for the mini batch to calculate the classification loss using binary
cross entropy. Then, it uses only those minibatch anchors marked as foreground to calculate the regression
loss. For calculating the targets for the regression, we use the foreground anchor and the closest ground
truth object and calculate the correct Δ needed to transform the anchor into the object.

Instead of using a simple L1 or L2 loss for the regression error, the paper suggests using Smooth L1 loss.
Smooth L1 is basically L1, but when the L1 error is small enough, defined by a certain σ, the error is
considered almost correct and the loss diminishes at a faster rate.

Using dynamic batches can be challenging for a number of reasons. Even though we try to maintain a
balanced ratio between anchors that are considered background and those that are considered foreground,
that is not always possible. Depending on the ground truth objects in the image and the size and ratios of
the anchors, it is possible to end up with zero foreground anchors. In those cases, we turn to using the
anchors with the biggest IoU to the ground truth boxes. This is far from ideal, but practical in the sense that
we always have foreground samples and targets to learn from.

Post processing
Non-maximum suppression Since anchors usually overlap, proposals end up also overlapping over the
same object. To solve the issue of duplicate proposals we use a simple algorithmic approach called Non-
Maximum Suppression (NMS). NMS takes the list of proposals sorted by score and iterateqs over the
sorted list, discarding those proposals that have an IoU larger than some predefined threshold with a
proposal that has a higher score.

While this looks simple, it is very important to be cautious with the IoU threshold. Too low and you may
end up missing proposals for objects; too high and you could end up with too many proposals for the same
object. A value commonly used is 0.6.

Proposal selection After applying NMS, we keep the top N proposals sorted by score. In the paper N=2000
is used, but it is possible to lower that number to as little as 50 and still get quite good results.

Standalone application
The RPN can be used by itself without needing the second stage model. In problems where there is only a
single class of objects, the objectness probability can be used as the final class probability. This is because
for this case, “foreground” = “single class” and “background” = “not single class”.

Some examples of machine learning problems that can benefit from a standalone usage of the RPN are the
popular (but still challenging) face detection and text detection.

One of the advantages of using only the RPN is the gain in speed both in training and prediction. Since the
RPN is a very simple network which only uses convolutional layers, the prediction time can be faster than
using the classification base network.

Region of Interest Pooling


After the RPN step, we have a bunch of object proposals with no class assigned to them. Our next problem

:
to solve is how to take these bounding boxes and classify them into our desired categories.

The simplest approach would be to take each proposal, crop it, and then pass it through the pre-trained base
network. Then, we can use the extracted features as input for a vanilla image classifier. The main problem
is that running the computations for all the 2000 proposals is really inefficient and slow.

Faster R-CNN tries to solve, or at least mitigate, this problem by reusing the existing convolutional feature
map. This is done by extracting fixed-sized feature maps for each proposal using region of interest pooling.
Fixed size feature maps are needed for the R-CNN in order to classify them into a fixed number of classes.

Region of Interest Pooling

A simpler method, which is widely used by object detection implementations, including Luminoth’s Faster
R-CNN, is to crop the convolutional feature map using each proposal and then resize each crop to a fixed
sized 14×14×convdepth using interpolation (usually bilinear). After cropping, max pooling with a 2x2
kernel is used to get a final 7×7×convdepth feature map for each proposal.

The reason for choosing those exact shapes is related to how it is used next by the next block (R-CNN). It
is important to understand that those are customizable depending on the second stage use.

Region-based Convolutional Neural Network


Region-based convolutional neural network (R-CNN) is the final step in Faster R-CNN’s pipeline. After
getting a convolutional feature map from the image, using it to get object proposals with the RPN and
finally extracting features for each of those proposals (via RoI Pooling), we finally need to use these
features for classification. R-CNN tries to mimic the final stages of classification CNNs where a fully-
connected layer is used to output a score for each possible object class.

R-CNN has two different goals:

1. Classify proposals into one of the classes, plus a background class (for removing bad proposals).

inners
2. Better adjust the bounding box for the proposal according to the predicted class.

In the original Faster R-CNN paper, the R-CNN takes the feature map for each proposal, flattens it and uses
two fully-connected layers of size 4096 with ReLU activation.

Then, it uses two different fully-connected layers for each of the different objects:

A fully-connected layer with N+1 units where N is the total number of classes and that extra one is
for the background class.
A fully-connected layer with 4N units. We want to have a regression prediction, thus we need
Δ​center​x​,Δ​center​y​,Δ​width​​,Δ​height​​ for each of the N possible classes.

R-CNN architecture

Training and targets


Targets for R-CNN are calculated in almost the same way as the RPN targets, but taking into account the
different possible classes. We take the proposals and the ground-truth boxes, and calculate the IoU between
them.

Those proposals that have a IoU greater than 0.5 with any ground truth box get assigned to that ground
truth. Those that have between 0.1 and 0.5 get labeled as background. Contrary to what we did while
assembling targets for the RPN, we ignore proposals without any intersection. This is because at this stage
we are assuming that we have good proposals and we are more interested in solving the harder cases. Of
course, all these values are hyperparameters that can be tuned to better fit the type of objects that you are
trying to find.

The targets for the bounding box regression are calculated as the offset between the proposal and its
corresponding ground-truth box, only for those proposals that have been assigned a class based on the IoU
threshold.

We randomly sample a balanced mini batch of size 64 in which we have up to 25% foreground proposals
(with class) and 75% background.

Following the same path as we did for the RPNs losses, the classification loss is now a multiclass cross
entropy loss, using all the selected proposals and the Smooth L1 loss for the 25% proposals that are
matched to a ground truth box. We have to be careful when getting that loss since the output of the R-CNN
fully connected network for bounding box regressions has one prediction for each of the classes. When
calculating the loss, we only have to take into account the one for the correct class.

Post processing
Similar to the RPN, we end up with a bunch of objects with classes assigned which need further processing
before returning them.

In order to apply the bounding box adjustments we have to take into account which is the class with the
highest probability for that proposal. We also have to ignore those proposals that have the background class
as the one with the highest probability.

After getting the final objects and ignoring those predicted as background, we apply class-based NMS. This
is done by grouping the objects by class, sorting them by probability and then applying NMS to each
independent group before joining them again.

For our final list of objects, we also can set a probability threshold and a limit on the number of objects for
each class.

Training
In the original paper, Faster R-CNN was trained using a multi-step approach, training parts independently
and merging the trained weights before a final full training approach. Since then, it has been found that
doing end-to-end, joint training leads to better results.

After putting the complete model together we end up with 4 different losses, two for the RPN and two for
R-CNN. We have the trainable layers in RPN and R-CNN, and we also have the base network which we
can train (fine-tune) or not.
Ltossesfum
The decision to train the base network depends on the nature of the objects we want to learn and the
.

computing power available. If we want to detect objects that are similar to those that were on the original

RPHE2fnompwp=
dataset on which the base network was trained on, then there is no real need except for trying to squeeze all
the possible performance we can get. On the other hand, training the base network can be expensive both in
time and on the necessary hardware, to be able to fit the complete gradients.

The four different losses are combined using a weighted sum. This is because we may want to give

=
classification losses more weight relative to regression ones, or maybe give R-CNN losses more power
=
over the RPNs’.
more weight classification
,{pcN#
Apart from the regular losses, we also have the regularization losses which we skipped for the sake of

÷
brevity but can be defined both in RPN and in R-CNN. We use L2 regularization for some of the layers and
depending on which base network being used and if it’s trained, it may also have regularization.
-

We train using Stochastic Gradient Descent with momentum, setting the momentum value to 0.9. You can
easily train Faster R-CNN with any other optimizer without bumping into any big problem.

The learning rate starts at 0.001 and then decreases to 0.0001 after 50K steps. This is one of the
hyperparameters that usually matters the most. When training with Luminoth, we usually start with the
defaults and tune it from then on.

Evaluation
The evaluation is done using the standard Mean Average Precision (mAP) at some specific IoU threshold
(e.g. [email protected]). mAP is a metric that comes from information retrieval, and is commonly used for
calculating the error in ranking problems and for evaluating object detection problems.

We won’t go into details since these type of metrics deserve a blogpost of their own, but the important
takeway is that mAP penalizes you when you miss a box that you should have detected, as well as when
you detect something that does not exist or detect the same thing multiple times.

Conclusion
By now, you should have a clear idea of how Faster R-CNN works, why some decisions have been made
and some idea on how to be able to tweak it for your specific case. If you want to get a deeper
understanding on how it works you should check Luminoth’s implementation.

Faster R-CNN is one of the models that proved that it is possible to solve complex computer vision
problems with the same principles that showed such amazing results at the start of this new deep learning
revolution.

New models are currently being built, not only for object detection, but for semantic segmentation, 3D-
object detection, and more, that are based on this original model. Some borrow the RPN, some borrow the
R-CNN, others just build on top of both. This is why it is important to fully understand what is under the
hood so we are better prepared to tackle future problems.
Faster R-CNN: Down the rabbit hole of modern
object detection
Thu, Jan 18, 2018

Read time: 21 minutes

Previously, we talked about object detection, what it is and how it has been recently tackled using deep
learning. If you haven’t read our previous blog post, we suggest you take a look at it before continuing.

Last year, we decided to get into Faster R-CNN, reading the original paper, and all the referenced papers
(and so on and on) until we got a clear understanding of how it works and how to implement it.

We ended up implementing Faster R-CNN in Luminoth, a computer vision toolkit based on TensorFlow
which makes it easy to train, monitor and use these types of models. So far, Luminoth has raised an
incredible amount of interest and we even talked about it at both ODSC Europe and ODSC West.

Based on all the work developing Luminoth and based on the presentations we did, we thought it would be
a good idea to have a blog post with all the details and links we gathered in our research as a future
reference for anyone is interested in the topic.

Background
Faster R-CNN was originally published in NIPS 2015. After publication, it went through a couple of
revisions which we’ll later discuss. As we mentioned in our previous blog post, Faster R-CNN is the third
iteration of the R-CNN papers — which had Ross Girshick as author & co-author.

Everything started with “Rich feature hierarchies for accurate object detection and semantic segmentation”
(R-CNN) in 2014, which used an algorithm called Selective Search to propose possible regions of interest
and a standard Convolutional Neural Network (CNN) to classify and adjust them. It quickly evolved into
Fast R-CNN, published in early 2015, where a technique called Region of Interest Pooling allowed for
sharing expensive computations and made the model much faster. Finally came Faster R-CNN, where the
first fully differentiable model was proposed.

Architecture
The architecture of Faster R-CNN is complex because it has several moving parts. We’ll start with a high
level overview, and then go over the details for each of the components.

It all starts with an image, from which we want to obtain:

a list of bounding boxes.


a label assigned to each bounding box.
a probability for each label and bounding box.

Complete Faster R-CNN architecture

The input images are represented as Height×Width×Depth tensors (multidimensional arrays), which are
passed through a pre-trained CNN up until an intermediate layer, ending up with a convolutional feature
map. We use this as a feature extractor for the next part.

This technique is very commonly used in the context of Transfer Learning, especially for training a
classifier on a small dataset using the weights of a network trained on a bigger dataset. We’ll take a deeper
look at this in the following sections.

Next, we have what is called a Region Proposal Network (RPN, for short). Using the features that the CNN
computed, it is used to find up to a predefined number of regions (bounding boxes), which may contain
objects.

Probably the hardest issue with using Deep Learning (DL) for object detection is generating a variable-
length list of bounding boxes. When modeling deep neural networks, the last block is usually a fixed sized
tensor output (except when using Recurrent Neural Networks, but that is for another post). For example, in
image classification, the output is a (N,) shaped tensor, with N being the number of classes, where each
scalar in location i contains the probability of that image being label​i​.

The variable-length problem is solved in the RPN by using anchors: fixed sized reference bounding boxes
which are placed uniformly throughout the original image. Instead of having to detect where objects are,
we model the problem into two parts. For every anchor, we ask:

Does this anchor contain a relevant object?


How would we adjust this anchor to better fit the relevant object?

This is probably getting confusing, but fear not, we’ll dive into this below.

After having a list of possible relevant objects and their locations in the original image, it becomes a more
straightforward problem to solve. Using the features extracted by the CNN and the bounding boxes with
relevant objects, we apply Region of Interest (RoI) Pooling and extract those features which would
correspond to the relevant objects into a new tensor.

Finally, comes the R-CNN module, which uses that information to:

Classify the content in the bounding box (or discard it, using “background” as a label).
Adjust the bounding box coordinates (so it better fits the object).

Obviously, some major bits of information are missing, but that’s basically the general idea of how Faster
R-CNN works. Next, we’ll go over the details on both the architecture and loss/training for each of the
components.

Base network
As we mentioned earlier, the first step is using a CNN pretrained for the task of classification (e.g. using
ImageNet) and using the output of an intermediate layer. This may sound really simple for people with a
deep learning background, but it’s important to understand how and why it works, as well as visualize what
the intermediate layer output looks like.

There is no real consensus on which network architecture is best. The original Faster R-CNN used ZF and
VGG pretrained on ImageNet but since then there have been lots of different networks with a varying
number of weights. For example, MobileNet, a smaller and efficient network architecture optimized for
speed, has approximately 3.3M parameters, while ResNet-152 (yes, 152 layers), once the state of the art in
the ImageNet classification competition, has around 60M. Most recently, new architectures like DenseNet
are both improving results while lowering the number of parameters.

VGG
Before we talk about which is better or worse, let’s try to understand how it all works using the standard
VGG-16 as an example.

VGG architecture

VGG, whose name comes from the team which used it in the ImageNet ILSVRC 2014 competition, was
published in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition” by Karen
Simonyan and Andrew Zisserman. By today’s standards it would not be considered very deep, but at the
time it more than doubled the number of layers commonly used and kickstarted the “deeper → more
capacity → better” wave (when training is possible).

When using VGG for classification, the input is a 224×224×3 tensor (that means a 224x224 pixel RGB
image). This has to remain fixed for classification because the final block of the network uses fully-
connected (FC) layers (instead of convolutional), which require a fixed length input. This is usually done
by flattening the output of the last convolutional layer, getting a rank 1 tensor, before using the FC layers.

Since we are going to use the output of an intermediate convolutional layer, the size of the input is not our
problem. At least, it is not the problem of this module since only convolutional layers are used. Let’s get a
bit more into low-level details and define which convolutional layer we are going to use. The paper does
not specify which layer to use; but in the official implementation you can see they use the output of
conv5/conv5_1 layer.

Each convolutional layer creates abstractions based on the previous information. The first layers usually
learn edges, the second finds patterns in edges in order to activate for more complex shapes and so forth.
Eventually we end up with a convolutional feature map which has spatial dimensions much smaller than
the original image, but greater depth. The width and height of the feature map decrease because of the
pooling applied between convolutional layers and the depth increases based on the number of filters the
convolutional layer learns.

Image to convolutional feature map

In its depth, the convolutional feature map has encoded all the information for the image while maintaining
the location of the “things” it has encoded relative to the original image. For example, if there was a red
square on the top left of the image and the convolutional layers activate for it, then the information for that
red square would still be on the top left of the convolutional feature map.

VGG vs ResNet
Nowadays, ResNet architectures have mostly replaced VGG as a base network for extracting features.
Three of the co-authors of Faster R-CNN (Kaiming He, Shaoqing Ren and Jian Sun) were also co-authors
of “Deep Residual Learning for Image Recognition”, the original paper describing ResNets.

The obvious advantage of ResNet over VGG is that it is bigger, hence it has more capacity to actually learn
what is needed. This is true for the classification task and should be equally true in the case of object
detection.

Also, ResNet makes it easy to train deep models with the use of residual connections and batch
normalization, which was not invented when VGG was first released.

Anchors
Now that we are working with a processed image, we need to find proposals, ie. regions of interest for
classification. We previously mentioned that anchors are a way to solve the variable length problem, but we
skipped most of the explanation.

Our objective is to find bounding boxes in the image. These have rectangular shape and can come in
different sizes and aspect ratios. Imagine we were trying to solve the problem knowing beforehand that
there are two objects on the image. The first idea that comes to mind is to train a network that returns 8
values: two x​min​​,y​min​​,x​max​​,y​max​​ tuples defining a bounding box for each object. This approach has some
fundamental problems. For example, images may have different sizes and aspect ratios, having a good
model trained to predict raw coordinates can turn out to be very complicated (if not impossible). Another
problem is invalid predictions: when predicting x​min​​ and x​max​​ we have to somehow enforce that
x​min​​<x​max​​.

It turns out that there is a simpler approach to predicting bounding boxes by learning to predict offsets from
reference boxes. We take a reference box x​center​​,y​center​​,width,height and learn to predict
Δ​x​center​​,Δ​y​center​​,Δ​width​​,Δ​height​​, which are usually small values that tweak the reference box to better fit
what we want.

Anchors are fixed bounding boxes that are placed throughout the image with different sizes and ratios that
are going to be used for reference when first predicting object locations.

Since we are working with a convolutional feature map of size conv​width​​×conv​height​​×conv​depth​​, we


create a set of anchors for each of the points in conv​width​​×conv​height​​. It’s important to understand that
even though anchors are defined based on the convolutional feature map, the final anchors reference the
original image.

Since we only have convolutional and pooling layers, the dimensions of the feature map will be
proportional to those of the original image. Mathematically, if the image was w×h, the feature map will end
up w/r×h/r where r is called subsampling ratio. If we define one anchor per spatial position of the feature
map, the final image will end up with a bunch of anchors separated by r pixels. In the case of VGG, r=16.

Anchor centers throught the original image

In order to choose the set of anchors we usually define a set of sizes (e.g. 64px, 128px, 256px) and a set of
ratios between width and height of boxes (e.g. 0.5, 1, 1.5) and use all the possible combinations of sizes and
ratios.

Left: Anchors, Center: Anchor for a single point, Right: All anchors

Region Proposal Network

The RPN takes the convolutional feature map and generates proposals over the image

As we mentioned before, the RPN takes all the reference boxes (anchors) and outputs a set of good
proposals for objects. It does this by having two different outputs for each of the anchors.

The first one is the probability that an anchor is an object. An “objectness score”, if you will. Note that the
RPN doesn’t care what class of object it is, only that it does in fact look like an object (and not
background). We are going to use this objectness score to filter out the bad predictions for the second stage.
The second output is the bounding box regression for adjusting the anchors to better fit the object it’s
predicting.

The RPN is implemented efficiently in a fully convolutional way, using the convolutional feature map
returned by the base network as an input. First, we use a convolutional layer with 512 channels and 3x3
kernel size and then we have two parallel convolutional layers using a 1x1 kernel, whose number of
channels depends on the number of anchors per point.

Convolutional implementation of an RPN architecture, where k is the number of anchors.

For the classification layer, we output two predictions per anchor: the score of it being background (not an
object) and the score of it being foreground (an actual object).

For the regression, or bounding box adjustment layer, we output 4 predictions: the deltas
Δ​x​center​​,Δ​y​center​​,Δ​width​​,Δ​height​​ which we will apply to the anchors to get the final proposals.

Using the final proposal coordinates and their “objectness” score we then have a good set of proposals for
objects.

Training, target and loss functions


The RPN does two different type of predictions: the binary classification and the bounding box regression
adjustment.

For training, we take all the anchors and put them into two different categories. Those that overlap a
ground-truth object with an Intersection over Union (IoU) bigger than 0.5 are considered “foreground” and
those that don’t overlap any ground truth object or have less than 0.1 IoU with ground-truth objects are
considered “background”.

Then, we randomly sample those anchors to form a mini batch of size 256 — trying to maintain a balanced
ratio between foreground and background anchors.

The RPN uses all the anchors selected for the mini batch to calculate the classification loss using binary
cross entropy. Then, it uses only those minibatch anchors marked as foreground to calculate the regression
loss. For calculating the targets for the regression, we use the foreground anchor and the closest ground
truth object and calculate the correct Δ needed to transform the anchor into the object.

Instead of using a simple L1 or L2 loss for the regression error, the paper suggests using Smooth L1 loss.
Smooth L1 is basically L1, but when the L1 error is small enough, defined by a certain σ, the error is
considered almost correct and the loss diminishes at a faster rate.

Using dynamic batches can be challenging for a number of reasons. Even though we try to maintain a
balanced ratio between anchors that are considered background and those that are considered foreground,
that is not always possible. Depending on the ground truth objects in the image and the size and ratios of
the anchors, it is possible to end up with zero foreground anchors. In those cases, we turn to using the
anchors with the biggest IoU to the ground truth boxes. This is far from ideal, but practical in the sense that
we always have foreground samples and targets to learn from.

Post processing
Non-maximum suppression Since anchors usually overlap, proposals end up also overlapping over the
same object. To solve the issue of duplicate proposals we use a simple algorithmic approach called Non-
Maximum Suppression (NMS). NMS takes the list of proposals sorted by score and iterateqs over the
sorted list, discarding those proposals that have an IoU larger than some predefined threshold with a
proposal that has a higher score.

While this looks simple, it is very important to be cautious with the IoU threshold. Too low and you may
end up missing proposals for objects; too high and you could end up with too many proposals for the same
object. A value commonly used is 0.6.

Proposal selection After applying NMS, we keep the top N proposals sorted by score. In the paper N=2000
is used, but it is possible to lower that number to as little as 50 and still get quite good results.

Standalone application
The RPN can be used by itself without needing the second stage model. In problems where there is only a
single class of objects, the objectness probability can be used as the final class probability. This is because
for this case, “foreground” = “single class” and “background” = “not single class”.

Some examples of machine learning problems that can benefit from a standalone usage of the RPN are the
popular (but still challenging) face detection and text detection.

One of the advantages of using only the RPN is the gain in speed both in training and prediction. Since the
RPN is a very simple network which only uses convolutional layers, the prediction time can be faster than
using the classification base network.

Region of Interest Pooling


After the RPN step, we have a bunch of object proposals with no class assigned to them. Our next problem
to solve is how to take these bounding boxes and classify them into our desired categories.

The simplest approach would be to take each proposal, crop it, and then pass it through the pre-trained base
network. Then, we can use the extracted features as input for a vanilla image classifier. The main problem
is that running the computations for all the 2000 proposals is really inefficient and slow.

Faster R-CNN tries to solve, or at least mitigate, this problem by reusing the existing convolutional feature
map. This is done by extracting fixed-sized feature maps for each proposal using region of interest pooling.
Fixed size feature maps are needed for the R-CNN in order to classify them into a fixed number of classes.

Region of Interest Pooling

A simpler method, which is widely used by object detection implementations, including Luminoth’s Faster
R-CNN, is to crop the convolutional feature map using each proposal and then resize each crop to a fixed
sized 14×14×convdepth using interpolation (usually bilinear). After cropping, max pooling with a 2x2
kernel is used to get a final 7×7×convdepth feature map for each proposal.

The reason for choosing those exact shapes is related to how it is used next by the next block (R-CNN). It
is important to understand that those are customizable depending on the second stage use.

Region-based Convolutional Neural Network


Region-based convolutional neural network (R-CNN) is the final step in Faster R-CNN’s pipeline. After
getting a convolutional feature map from the image, using it to get object proposals with the RPN and
finally extracting features for each of those proposals (via RoI Pooling), we finally need to use these
features for classification. R-CNN tries to mimic the final stages of classification CNNs where a fully-
connected layer is used to output a score for each possible object class.

R-CNN has two different goals:

1. Classify proposals into one of the classes, plus a background class (for removing bad proposals).
2. Better adjust the bounding box for the proposal according to the predicted class.

In the original Faster R-CNN paper, the R-CNN takes the feature map for each proposal, flattens it and uses
two fully-connected layers of size 4096 with ReLU activation.

Then, it uses two different fully-connected layers for each of the different objects:

A fully-connected layer with N+1 units where N is the total number of classes and that extra one is
for the background class.
A fully-connected layer with 4N units. We want to have a regression prediction, thus we need
Δ​center​x​,Δ​center​y​,Δ​width​​,Δ​height​​ for each of the N possible classes.

R-CNN architecture

Training and targets


Targets for R-CNN are calculated in almost the same way as the RPN targets, but taking into account the
different possible classes. We take the proposals and the ground-truth boxes, and calculate the IoU between
them.

Those proposals that have a IoU greater than 0.5 with any ground truth box get assigned to that ground
truth. Those that have between 0.1 and 0.5 get labeled as background. Contrary to what we did while
assembling targets for the RPN, we ignore proposals without any intersection. This is because at this stage
we are assuming that we have good proposals and we are more interested in solving the harder cases. Of
course, all these values are hyperparameters that can be tuned to better fit the type of objects that you are
trying to find.

The targets for the bounding box regression are calculated as the offset between the proposal and its
corresponding ground-truth box, only for those proposals that have been assigned a class based on the IoU
threshold.

We randomly sample a balanced mini batch of size 64 in which we have up to 25% foreground proposals
(with class) and 75% background.

Following the same path as we did for the RPNs losses, the classification loss is now a multiclass cross
entropy loss, using all the selected proposals and the Smooth L1 loss for the 25% proposals that are
matched to a ground truth box. We have to be careful when getting that loss since the output of the R-CNN
fully connected network for bounding box regressions has one prediction for each of the classes. When
calculating the loss, we only have to take into account the one for the correct class.

Post processing
Similar to the RPN, we end up with a bunch of objects with classes assigned which need further processing
before returning them.

In order to apply the bounding box adjustments we have to take into account which is the class with the
highest probability for that proposal. We also have to ignore those proposals that have the background class
as the one with the highest probability.

After getting the final objects and ignoring those predicted as background, we apply class-based NMS. This
is done by grouping the objects by class, sorting them by probability and then applying NMS to each
independent group before joining them again.

For our final list of objects, we also can set a probability threshold and a limit on the number of objects for
each class.

Training
In the original paper, Faster R-CNN was trained using a multi-step approach, training parts independently
and merging the trained weights before a final full training approach. Since then, it has been found that
doing end-to-end, joint training leads to better results.

After putting the complete model together we end up with 4 different losses, two for the RPN and two for
R-CNN. We have the trainable layers in RPN and R-CNN, and we also have the base network which we
can train (fine-tune) or not.

The decision to train the base network depends on the nature of the objects we want to learn and the
computing power available. If we want to detect objects that are similar to those that were on the original
dataset on which the base network was trained on, then there is no real need except for trying to squeeze all
the possible performance we can get. On the other hand, training the base network can be expensive both in
time and on the necessary hardware, to be able to fit the complete gradients.

The four different losses are combined using a weighted sum. This is because we may want to give
classification losses more weight relative to regression ones, or maybe give R-CNN losses more power
over the RPNs’.

Apart from the regular losses, we also have the regularization losses which we skipped for the sake of
brevity but can be defined both in RPN and in R-CNN. We use L2 regularization for some of the layers and
depending on which base network being used and if it’s trained, it may also have regularization.

We train using Stochastic Gradient Descent with momentum, setting the momentum value to 0.9. You can
easily train Faster R-CNN with any other optimizer without bumping into any big problem.

The learning rate starts at 0.001 and then decreases to 0.0001 after 50K steps. This is one of the
hyperparameters that usually matters the most. When training with Luminoth, we usually start with the
defaults and tune it from then on.

Evaluation
The evaluation is done using the standard Mean Average Precision (mAP) at some specific IoU threshold
(e.g. [email protected]). mAP is a metric that comes from information retrieval, and is commonly used for
calculating the error in ranking problems and for evaluating object detection problems.

We won’t go into details since these type of metrics deserve a blogpost of their own, but the important
takeway is that mAP penalizes you when you miss a box that you should have detected, as well as when
you detect something that does not exist or detect the same thing multiple times.

Conclusion
By now, you should have a clear idea of how Faster R-CNN works, why some decisions have been made
and some idea on how to be able to tweak it for your specific case. If you want to get a deeper
understanding on how it works you should check Luminoth’s implementation.

Faster R-CNN is one of the models that proved that it is possible to solve complex computer vision
problems with the same principles that showed such amazing results at the start of this new deep learning
revolution.

New models are currently being built, not only for object detection, but for semantic segmentation, 3D-
object detection, and more, that are based on this original model. Some borrow the RPN, some borrow the
R-CNN, others just build on top of both. This is why it is important to fully understand what is under the
hood so we are better prepared to tackle future problems.
Why do we use convolutions for images rather than just FC layers?
Firstly, convolutions preserve, encode,
and actually use the spatial information from the image. If we used only FC
layers we would have no relative spatial information. Secondly,
Convolutional Neural Networks (CNNs) have a partially built-in translation
in-variance, since each convolution kernel acts as it’s own filter/feature
detector.

What makes CNNs translation invariant?

As explained above, each convolution kernel acts as it’s own filter/feature detector. So let’s say you’re
doing object detection, it doesn’t matter where in the image the object is since
we’re going to apply the convolution in a sliding window fashion across the
entire image anyways.

Why do we have max-pooling in classification CNNs?


Again as you would expect this is for a role in Computer Vision. Max-pooling in a CNN allows
you to reduce computation since your feature maps are smaller after the
pooling. You don’t lose too much semantic information since you’re taking
the maximum activation. There’s also a theory that max-pooling contributes a
bit to giving CNNs more translation in-variance. Check out this great video
from Andrew Ng on the benefits of max-pooling.

Why do segmentation CNNs typically have an encoder-decoder style /


structure?
The encoder CNN can basically be thought of as a feature
extraction network, while the decoder uses that information to predict the
image segments by “decoding” the features and upscaling to the original
image size.
What is the significance of Residual Networks?

The main thing that


residual connections did was allow for direct feature access from previous
layers. This makes information propagation throughout the network much
easier. One very interesting paper about this shows how using local skip
connections gives the network a type of ensemble multi-path structure, giving
features multiple paths to propagate throughout the network.

Ee ate gradient
)
[email protected]
.tn#prtEnnhtPmJtYGn=taFfo.pannalwaIItggn ,
Logistic tkgsemion

f¥¥0¥→9
To
^

feed
:
a.

yi=µ÷iE÷I
where
sigmoid ,
2= xitu Afgmoid
y ,
.=
pcyr.la ) .

n = -
CZD

PD
¥ II. ilugptu
@
h Leg Loss → -
.
Diga -

-
Logistic Regression

ii%¥iEE÷T
www.E#Ea;wsP(
,
Binomial , bdf ,
linen alfebh .

$Vm kmn Lorie hi .

r , ,
,

e Papers
-
Penn
:

methinks
.lk#Ksim=-_
I
| PROFILE
whit 9ns to me ?
no op Knut Moh

÷
1
-

grephdef

Remove training only files →


useful foe
- back prop q gradient .

4 8 Bit ← floating -

9
Checkpoint Saving .

9 frameworks not available foe mobile .

01¥
boolean
y Datatypes → variants

Vmrimrheots constancy
To
Multiclass
Binary Regime
RMSF
cakgoeial dinayuossentuepy
Erossentroby

RM )nne§ Rmsfirg Amboy


yoptmze .

EE
'

mod
Squire )
Seem
:

model
-
.
D
model .
add @ me
-

[email protected]
6232 lb ) . .
,
,

MCIDOM
Learning Rate

You might also like