Study and Implementation of Object Detection and Visual Tracking

Department of
Electronics and Telecommunication Engineering

National Institute of Technology, Raipur
Report of Research Internship under Dr. Rama Krishna Sai Gorthi

On
“STUDY AND IMPLEMENTATION OF
OBJECT DETECTION AND VISUAL TRACKING”
Undertaken at
INDIAN INSTITUTE OF TECHNOLOGY, TIRUPATI

Submitted By:
Bharat Giddwani
Roll Number: 16116901
Will be Submitted to:

Assistant Professor Head of Department:
Dr. T.Meenpal, Dr. Ajay Singh Raghuvanshi,
Dept. of ETC, Dept. of ETC,
NIT Raipur. NIT Raipur.
1
ACKNOWLEDGEMENT
The period of Internship at Indian Institute of Technology Tirupati was a valuable one as it had many
experiences and there was opportunity to learn a lot of things. There were so many new experiences
that can influence my professional as well as personal life.
I have put a lot of effort in the project. However, it would not be able for me to complete it with ease
without the help of people at IIT Tirupati. I sincerely thank my research guide Dr. Rama Krishna Sai
Gorthi , Associate Professor, Electrical Engineering Department, IIT Tirupati for giving me a golden
opportunity to intern under him. He has helped me a lot by suggestions, comments and appreciation. I
express my deep gratitude to Mr. Mohana Murali, PhD scholar of IIT Tirupati, who has given
complete assistance during the internship. I express my thanks to Mr. Naveen, MS scholar of IIT
Tirupati, for his help and tips regarding the project.
I express my special gratitude to my co-intern, Mr. Dheeraj Varma for his help and coordination
during the stay in Tirupati. I express special thanks to all the PhD and MS scholars of IIT Tirupati who
helped me during the internship. I thank all the staff of IIT Tirupati for their help.
2
Study and Implementation of Object Detectors and Visual Trackers:
Bharat Giddwani , Mohan Murali, Naveen Palaru , Dr. Gorthi R. K. Sai Subrahmanyam,
Department of Electrical Engineering, Indian Institute of Technology, Tirupati, A.P.-517506, India
Abstract
Efficient and accurate object detection has been an important topic in the advancement of computer
vision systems. With the advent of deep learning techniques, the accuracy for object detection has
increased drastically. The project aims to incorporate state-of-the-art technique for object detection with
the goal of achieving high accuracy with a real-time performance. A major challenge in many of the
object detection systems is the dependency on other computer vision techniques for helping the deep
learning based approach, which leads to slow and non-optimal performance. In this project, we use a
completely deep learning based approach to solve the problem of object detection and visual tracking in
an end-to-end fashion. We used two different pre-trained networks on the most challenging publicly
available datasets (PASCAL VOC and MS - COCO), on which an object detection challenge is
conducted annually our main focus will be on YOLO (a unified state-of-art object detector).
In this report, we also examined the problem of tracking objects in video streams by using Deep
Learning. We use spatially supervised recurrent convolutional neural networks for visual object tracking.
In this method, the recurrent convolutional network uses both the history of locations and the visual
features from the deep neural networks. This method is used for tracking, based on the detection results.
We concatenate the location of detected bounding boxes with high-level visual features produced by
convolutional networks and then predict the tracking bounding box for next frames. Because a video
contain continuous frames, we decide to have a method which uses the information from history of
frames to have a robust tracking in different visually challenging cases such as occlusion, motion blur,
fast movement, etc. Long Short-Term Memory (LSTM) is a kind of recurrent convolutional neural
network and useful for our purpose. We used OTB100 dataset to train our tracking network. Instead of
using binary classification which is commonly used in deep learning based tracking methods, we use a
regression for direct prediction of the tracking locations. The resulting system is fast and accurate, thus
aiding those applications which require object detection and visual tracking.
Key words: Convolutional Neural Networks, Recurrent Neural Networks, YOLO Object Detector,
Visual Tracking.
3
CONTENTS
S.no Name of Topic Page No.

1. Introduction
1.1.Background 5
1.2.Problem Statement 5
1.3.Why Deep Learning? 6
2. Theory and Background
2.1. Convolutional Neural Networks 6
2.2. CNN Architectures 7
3. Object Detection in Real Images
3.1. Challenges 8
3.2. Related Work 8
4. Approach for Object Detection
4.1. YOLO – You Look Only Once 10
4.2. SSD – Single Shot Multi box Detector 16
4.3. YOLOv3: An Incremental Improvement 19
4.4. Conclusion 23
5. Approach for Visual Tracking
5.1. Recurrent Neural Network. 24
5.2. Long Short-Term Memory (LSTM) Network 24
5.3. Visual Tracking 27
➢ Types of Trackers 27
➢ CNN based Tracker – MDNet 27
➢ CNN-RNN based Tracker (ROLO) 28
6. 6.1. Applications 30
6.2. Conclusion 31
6.3. Future Work 31
4
Chapter 1
Introduction
1.1 Background
In the past two decades, the problem of object detection, localization and tracking received significant
attention in different research areas. This coincides with the rising demand for information about objects
location and identity, which stems from applications in various fields, such as manufacturing, military,
business management, surveillance and security, transport and logistics, medical care, traffic
management, childcare, performance analysis in sports and sports medicine. Also human detection and
tracking can be widely used in many applications, including people counting and security surveillance
in public scenes.
Different methods have been used for this purpose. In some research such as, combination of Kalman
filter prediction and mean shift tracking is used. Tree-structured probabilistic model for human tracking
is used in. Recently, Neural Networks such as radial basis function (RBF) neural network and CNN,
have become more popular for image processing purposes. CNNs have recently been applied to various
computer vision tasks such as image classification, semantic segmentation, object detection, and many
others. Such great success of CNNs lead it to be used mostly with distinguished performance in visual
applications. Using CNN for tracking has a limitation which is related to training data. Tracking requires
more data in order to set a sufficient variety and it is difficult to collect a large enough amount of training
data for video processing applications and training algorithms. Several recent tracking algorithms have
mentioned the data leakage issue by transferring pre-tained CNNs on a large-scale dataset such as
ImageNet .
Online training videos can be used to train entirely from scratch online during test time and teach the
tracker to handle complex challenges in the condition such as rotations, changes in viewpoint, lighting
changes with no offline training being performed, but these tracking methods have a problem of too slow
speed. Also, such trackers have lower performance compared with offline training methods because they
cannot take advantage of the large number of videos for improving their performance.
1.2 Problem Statement

In this research project report, Deep Learning for tracking human in a video stream will be implemented.
The question that will be examined is how to adapt a Deep Learning based method for object detection
to the application of object/human tracking and improve it in terms of tracking accuracy and
computational cost and speed of the tracking. The assumption is that objects should be possible to be re-
tracked, if tracking is temporarily lost due to the object leaving the field of view. We implement and test
different structures and architectures for improving overall performance of the results. The system
should be able to track people at several distances and poses. Tracking algorithms suffer from sequence-
specific challenges including occlusion, deformation, lighting condition change, motion blur, etc.
So finding the best algorithm for object detection (CNN) and visual tracker (CNN-RNN) that could be
invariant to these conditions is our goal. The method that we propose is trying to do a robust tracking.
5
1.3 Why Deep Learning
Many problems in computer vision were saturating on their accuracy before a decade. However, with
the rise of deep learning techniques, the accuracy of these problems drastically improved. One of the
major problem was that of image classification, which is defined as predicting the class of the image. A
slightly complicated problem is that of image localization, where the image contains a single object and
the system should predict the class of the location of the object in the image (a bounding box around the
object). The more complicated problem (this project), of object detection involves both classification
and localization. In this case, the input to the system will be an image, and the output will be a bounding
box corresponding to all the objects in the image, along with the class of object in each box. An overview
of all these problems is depicted in Fig. 1.
Chapter 2
Theory and Background
2.1 Convolutional Neural Networks - CNN
Convolutional Neural Networks Convolutional Neural Networks (ConvNets or CNNs) are a category
of Neural Networks that recently have a great application in visual analysis and machine learning.
ConvNets have been successful in classification, segmentation, detection and tracking problems.
Figure-2 Typical CNN Architecture
CNN has four main steps: convolution, activation, subsampling and fully connectedness.
First step in CNN is convolution. The main idea of using convolution in first layers is extracting features
from the input image. There are some filters that act as feature detectors from the original input image.
In other words, convolution is a process where input signal is labelled by the network based on what it
has learned in the past. If the network decided that the input signal looks like previous cat images that it
6
has learned previously, the “cat” reference signal will be convolved with the input signal. The resulting
output signal is then passed on to the next layer.
In the Second step, activation. The activation layer controls how the signal flows from one layer to the
next layer. In different structures of CNNs, a wide variety of complex activation functions could be
chosen to model signal propagation. One of the most famous function is (ReLU), which is known for its
faster training speed. The ReLu has the mathematical form of:
f(x) = max(0, x)
Third step, subsampling, for reducing the sensitivity of the filters to noise and variations, the inputs from
the convolution layer are smoothed. Subsampling also reduces the dimensionality of each feature map
but save the most important information. This smoothing process is named subsampling or
downsampling or pooling. Subsampling can be achieved by different methods of max, average, sum etc.
The forth step is fully connected. The last layers in the most Convolutional network are fully connected.
It means that neurons of previous layers are connected to every neuron in next layers. The output from
the convolutional and pooling layers contain high-level features of the input image. Features of fully
connected layers are used by softmax layer and the input image is classified into different classes based
on the training data. Also fully-connected layers help to learn non-linear combinations of mentioned
features. Combinations of those features might be better for classification or other application of CNN.
• Formulas for Height , Width and Depth:
After Convolutional Layer: After Max/Average Pooling Layer
2.2 CNN Architectures:

Types of CNN Architectures:
A.) LeNet
B.) AlexNet
C.) VGG 16 and VGG 19
D.) GoogleNet/ Inception
E.) ResNet etc.
A) LeNet (1998) :
The most popular and first implementation of the CNN is the LeNet, which was introduced by Yann
LeCun in 1998 . Figure 2.1 illustrates the LeNet structure.
7
B.) VGG-16 (2014) :
Figure 2.2 –VGG16 neural Network Structure

The runner-up at the ILSVRC 2014 competition and was developed by Simonyan and Zisserman .
E.) ResNet-50 (2015) :
Figure 2.3 ResNet-50 Neural Network Structure

At the ILSVRC 2015, the so-called Residual Neural Network (ResNet) by Kaiming He et al introduced
a novel architecture with “skip connections” and features heavy batch normalization.
Chapter 3
Object Detection in Real Images
3.1 Challenges
The major challenge in this problem is that of the variable dimension of the output which is caused due
to the variable number of objects that can be present in any given input image. Any general machine
learning task requires a fixed dimension of input and output for the model to be trained. Another
important obstacle for widespread adoption of object detection systems is the requirement of real-time
(¿30fps) while being accurate in detection. The more complex the model is, the more time it requires for
inference; and the less complex the model is, the less is the accuracy. This trade-off between accuracy
and performance needs to be chosen as per the application. The problem involves classification as well
as regression, leading the model to be learnt simultaneously. This adds to the complexity of the problem.
3.2 Related Work

There has been a lot of work in object detection using traditional computer vision techniques (sliding
windows, deformable part models). However, they lack the accuracy of deep learning based techniques.
Among the deep learning based techniques, two broad class of methods are prevalent: two stage detection
(RCNN, Fast RCNN, and Faster RCNN) and unified detection (YOLO, SSD, DSSD, YOLOv2,
YOLOv3, ESSD). The major concepts involved in these techniques have been explained below.
8
3.2.1) Bounding Box
The bounding box is a rectangle drawn on the image which tightly fits the object in the image. A
bounding box exists for every instance of every object in the image. For the box, 4 numbers (center x,
center y, width, height) are predicted. This can be trained using a distance measure between predicted
and ground truth bounding box. The distance measure is a jaccard distance which computes intersection
over union between the predicted and ground truth boxes as shown in Fig. 3.
Figure 3.1: Intersection over Union calculation using Jaccard Method (From Andrew Ng)
3.2.2) Classification + Regression:

The bounding box is predicted using regression and the class within the bounding box is predicted using
classification.
3.2.3 Two-stage Method:

In this case, the proposals are extracted using some other computer vision technique and then resized to
fixed input for the classification network, which acts as a feature extractor. Then an SVM is trained to
classify between object and background (one SVM for each class). Also a bounding box regressor is
trained that outputs some correction (offsets) for proposal boxes. The overall idea is shown in Fig. 3.2
These methods are very accurate but are computationally intensive (low fps).
9
Types of Detectors:
1. R-CN
2. Fast R-CNN
3. R-FCN
4. Faster R-
CNN
Figure 3.2 RCNN Description
3.2.4 Unified Method

The difference here is that instead of producing proposals, pre-define a set of boxes to look for objects.
Using convolutional feature maps from later layers of the network, run another network over these
feature maps to predict class scores and bounding box offsets. The broad idea is depicted in Fig. 6.
The steps are mentioned below:
1. Train a CNN with regression and classification objective.
2. Gather activation from later layers to infer classification and location with a fully connected or
convolutional layers.
3. During training, use jaccard distance to relate predictions with the ground truth.
4. During inference, use non-maxima suppression to filter multiple boxes around the same object.
The major techniques that follow this strategy are:
SSD (, DSSD and ESSD, YOLOv3) (uses different activation maps (multiple-scales) for prediction of
classes and bounding boxes) and YOLO (, YOLOv2) (uses a single activation map for prediction of
classes and bounding boxes). Using multiple scales helps to achieve a higher mAP (mean average
precision) by being able to detect objects with different sizes on the image better.
Thus the technique used in this project is YOLO (single activation map) and YOLOv3 (multiple
activation maps)
Chapter 4
Approach for object detection
The network used in this project is based on YOLO [] and YOLOv3 []
4.1) YOLO - You Look Only Once

4.1.1) Grid cell
YOLO divides the input image into an S×S grid. Each grid cell predicts only one object. For example,
the red grid cell below tries to predict the “dog” object whose center falls inside the grid cell.
10
Figure 4.1 – Visualizing the YOLO Method
To evaluate PASCAL VOC, YOLO uses 7×7 grids (S×S), 2 boundary boxes (B) and 20 classes (C).
4.1.2) How it works:

• Uses features from the entire image to predict each bounding box.
• Predicts all bounding boxes across all classes for an image simultaneously.
• Divides the input image into an s*s grid. If the center of an object falls into a grid cell, the cell is
responsible for detecting that object.
• Each grid cell predicts B bounding boxes and confidence scores for those boxes.
• Each grid also predicts C conditional (conditioned on the grid cell containing an object) class
probabilities.
Figure 4.2 – The model. Models detection as a regression problem. It divides the image into an S × S grid and
for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These
predictions are encoded as an S × S × (B ∗ 5 + C) tensor. (From YOLO Research Paper)
4.1.3) Characteristics:
• Tool?-- A single neural network, unified architecture(24 Convolutional Layers, 4 Max Pooling
and 2 Fully connected layers )
• Framework? – Darknet – Original implementation is in C and CUDA
• Technology background?--related methods are slow, not real-time and devoid of generalization
ability.
11
4.1.4) Network Architecture:
Figure 4.3: The Architecture – Detection Network has 24 conv layers and 2 fully connected layers which
is converted into feature map of S * S*(B*5 +C).
4.1.5) Terms and Formulas:

1.0) Confidence scores:
Reflects how confident is that the box contains an object + how accurate the box is.
2.0) Conditional class probabilities:

Conditioned on the grid cell containing an object.
At test time we multiply the conditional class probabilities and the individual box confidence predictions.
3.0) Total Loss Function: Localization Loss + Confidence Loss + Classification Loss
Where denotes if object appears in cell i and denotes that the jth bounding box predictor
in cell i is “responsible” for that prediction.
12
4.0) Non-Max Suppression:
During prediction, non-maxima suppression is used to filter multiple boxes per object that may
be matched as shown in Fig. 3.6.
Figure 4.4- Non Max Suppression

• Discard all boxes with pc <0.6
While there are any remaining boxes?
 Pick the box with largest pc. Output that as prediction.
 Discard any remaining box with IOU >= 0.5 with the box output in previous step.
Advantages:
• Simpler structure of network.
• Much more faster, even with real-time property: 45fps with YOLO and 150fps with Fast
version: able to process streaming video in real-time with less than 25 milliseconds of latency
• Maintaining a proper accuracy range.
4.1.6) Experimental Results:

1.0) Dataset :
It includes 10K images from the training set of COCO. We provide a 9K/1K (train/val) split to make
results comparable. The dataset includes 80 thing classes, 91 stuff classes and 1 class 'unlabeled'.
Now the object detection challage runs on 80 thing classes with 1.5 million object instances. These
images are downloaded from flickr. This dataset is used in the MS-COCO Challenge.
COCO 2017 Object Detection Task
2.0) Implementation Details:

The project is implemented in python 3. Used the pre-trained model of YOLO on COCO dataset. Keras
(90% ) with Tensorflow (10%) backend library was used for fine- tuning the deep network and OpenCV3
was used for image pre-processing. The system specifications on which the model is trained and
evaluated are mentioned as follows: CPU - Intel Core i5- hp-2.5GHz, RAM -4 Gb , AMD -2GB
Graphics, and even worked on GPU - Nvidia GeForce GTX 1080 Titan Xp RAM-12GB , Driver Version
– 396 with cuda 9.0 to train and test the model . Libraries used are shown below.
13
3.0) Pre-processing
Reading of “yolo.cfg” and “yolo.weights” files (from https://pjreddie.com/) to convert them into keras
readable format. OpenCv3 is used for resizing, scaling and data augmentation tasks.
4.0) Network :
The entire network architecture is shown in Fig. 4.6.The model consists of the base network
derived from Darknet-19 and then the modified convolutional layer from last fully connected for
fine-tuning and then the classifier and localizer networks.
Figure 4.6- YOLO network
5.0) Qualitative Analysis

The results from the MS- COCO dataset are shown in Table 1
INPUT PREDICTION INPUT PREDICTION
14
False Detections:
False or Variated Detection Images : Table 2
The results on custom dataset are shown in Table 3.

PREDICTIONS:
a) Fails to predict small objects b) Occulsion c) Low Resolution
Problems Observed:
• Larger object dominated when present along with small objects as found in Fig. a
• Occlusion creates a problem for detection. As shown in Fig. b, the occluded birds are not detected
correctly.
• Resolution must be high for an image otherwise bounding box may displace from its position.
As shown in Fig. c.
In order to solve above problems their comes many unified detection methods after YOLO:
1.) YOLO9000 : Better, Faster, Stronger
2.) SSD – Single Shot Multibox Detector.
3.) DSSD- Deconvolutional Single Shot Multibox Detector.
4.) YOLOv3: An Incremental Improvement.
15
Improvement:
In this section I will be describing some major improvements I object detection methods and finally
adopting the YOLO v3 as base for my project report.
4.2) SSD: Single Shot Multibox Detector:

• Tool? -- A single deep neural network having VGGnet16 Architecture (Modern Version-
ResNet50 Architecture).
• Framework?-- Caffe
• Technology background?--related methods are structure-complicated and hard to bring high
speed and good accuracy( Fastest method till date)
4.2.2) Advantages:
• Providing a unified framework
• Much faster(59 FPS with VOC2007 test set)
• Better accuracy, even with a smaller input image size (74% to 77%)
4.2.3) Network Architecture:
Figure 4.7: SSD Architecture

The SSD normally starts with a VGG model, which is converted to a fully convolutional network. Then
we attach some extra convolutional layers that help to handle bigger objects. The output at the VGG
network is a 38x38 feature map (conv4 3). The added layers produce 19x19, 10x10, 5x5, 3x3, 1x1 feature
maps. All these feature maps are used for predicting bounding boxes at various scales.
Anchors/ Default Boxes (collection of boxes overlaid on image at different spatial locations, scales and
aspect ratios) act as reference points on ground truth images. A model is trained to make two predictions
for each anchor/default box:
➢ A discrete class.
➢ A continuous offset by which the anchor needs to be shifted to fit the ground-truth bounding box
4.2.4) Working:
Consider the case as shown in Fig. 10, where the cat has two anchors matched and the dog has one
anchor matched. Note that both have been matched on different feature maps.
16
Figure 4.8 – Working overview
During training SSD:-
➢ Needs an input image and ground truth boxes for each object during training.
➢ Evaluate a small set (e.g.4) of default boxes of different aspect ratios at each locations in several
feature maps with different scales.
➢ Filters (uses IoU Method) match these default boxes to the ground truth boxes (say for IoU>0.5)
and predict both the shape offsets and confidence for all object categories for each default box.
➢ The feed-forward convolutional network produces a fixed-size collection of bounding boxes and
scores for the presence of object class in those boxes.
➢ During prediction, non-maxima suppression is used to filter multiple boxes per object that may
be matched.
4.2.5) Quantity Choice:

• Imposing different aspects ratios for the default boxes, and denote them as :
 1 1
ar = 1, 2, 3, , 
 2 3
• Instead of using all the negative examples, SSD sorts them using the highest confidence for each
default box and pick the top ones so that the ratio between the negatives and positives is at most
3:1—leading to faster optimization and more stable training.
• Total Loss Function: The loss function used is the multi-box classification and regression loss.
The classification loss used is the softmax cross entropy and, for regression the smooth L1 loss is used.
1
L ( x, c, l , g ) = ( Lconf ( x, c) +  Lloc ( x, l , g ))
N
After the success of SSD many people move towards this architecture to and try to use and improve the
architecture, as SSD is unable to detect small objects and occlusion in an image accurately.
17
Some modified SSD based object detectors with their Architectures are:
A.) DSSD: Deconvolutional Single Shot Detector.
SSD Network
Figure 4.9: DSSD Architecture
B. ) ESSD: Extend the shallow part of Single Shot MultiBox Detector via CNN.
Fig. ESSD Framework
Fig. Extension Module

Figure 4.10: ESSD Architecture
18
4.3) YOLOv3: An Incremental Improvement:
You only look once, or YOLO, is one of the faster object detection algorithms out there. Though it is no
longer the most accurate object detection algorithm, it is a very good choice when you need real-time
detection, without loss of too much accuracy.
After some months the second version came into market with the official title as “YOLO9000: Better,
Faster, Stronger”. For its time YOLO 9000 was the fastest, and also one of the most accurate algorithm.
However, a couple of years down the line and it’s no longer the most accurate with algorithms like
RetinaNet, and SSD outperforming it in terms of accuracy. It still, however, was one of the fastest.
But that speed has been traded off for boosts in accuracy in YOLO v3. While the earlier variant ran on 45
FPS on a Titan X, the current version clocks about 30 FPS. This has to do with the increase in complexity
of underlying architecture called Darknet-53.
So what’s New with YOLOv3?

4.3.1) Darknet-53
First, YOLO v3 uses a variant of Darknet, which originally has 53 layer network trained on Imagenet.
For the task of detection, 53 more layers are stacked onto it, giving us a 106 layer fully convolutional
underlying architecture for YOLO v3. This is the reason behind the slowness of YOLO v3 compared to
YOLO v2.
Figure: 4.11- Darknet53 (From “What’s new with YOLOv3”in towardsdatascience.com by Ayoosh Kathuria.)
• A (end- to -end) Fully Convolutional Neural Network (FCN).
• YOLO makes use of only convolutional layers, making it a fully convolutional network (FCN).
It has 75 convolutional layers, with skip connections and upsampling layers. No form of
pooling is used, and a convolutional layer with stride 2 is used to downsample the feature
maps. This helps in preventing loss of low-level features often attributed to pooling.
• In YOLO, the prediction is done by using a convolutional layer (Duh…it’s a fully
convolutional network, remember?) with a kernel size of [ 1 x 1 x (B x (5 + C)) ].
19
4.3.4) Working :
Let us consider an example below, where the input image is 416 x 416, and stride is 32. As pointed
earlier, the dimensions of the feature map will be 13 x 13. Divide the input image into 13 x 13 cells.
Bounding Boxes and Each BB predicts - (5+C)

attributes.
C = No. of classes in the dataset (eg =80 in
COCO- p1, p2, p3…..p80
5 = center coordinates, the dimensions, the
objectness score.
In YOLO v3 trained on COCO, B = 3 and C = 80,
so the kernel size is 1 x 1 x 255. The feature map
produced by this kernel has identical height and
width of the previous feature map, and has
detection attributes along the depth as described
above.
+++
Figure 4.12: Working Overview
Note: Then, the cell (on the input image) containing the center of the ground truth box of an object is
chosen to be the one responsible for predicting the object (Here, the red cell is the 7th cell in the 7th
row on the grid is responsible for detecting the dog).
Cell is predicting 3 bounding boxes, so which on to assigned to the object’s (dog) ground truth label.
For this we have taken look on Anchor boxes concept.
1.) Anchor Boxes:
Most of the modern object detectors predict log-space transforms (instead of height and width of
predicted bounding box), or simply offsets to pre-defined default bounding boxes called anchors.
YOLOv3 has 3 anchor boxes which results in prediction of 3 bounding boxes. The bounding box
responsible for detecting the object (dog) will be the one whose anchor has the highest IoU with the
ground truth box.
2.) Dimensions of the Bounding Box
Figure 4.13: Dimensions

bx, by, bw, bh are the x,y center co-ordinates, width and height of our prediction. tx, ty, tw, th is what the
network outputs. cx and cy are the top-left co-ordinates of the grid. pw and ph are anchors dimensions
for the box.
3.) Objectness Score ( pc) : Object score represents the probability that an object is contained
inside a bounding box. It should be nearly 1 for the red and the neighbouring grids, whereas almost 0
20
for, say, the grid at the corners in Figure 4.12. The objectness score is also passed through a sigmoid,
as it is to be interpreted as a probability.
4.) Center Coordinates (bx , by):
Notice we are running our center coordinates prediction through a sigmoid function. This forces the
value of the output to be between 0 and 1.
5.) Dimensions of bounding box (bw , bh):
The dimensions of the bounding box are predicted by applying a log-space transform to the output and
then multiplying with an anchor.
6.) Class Predictions (c1, c2…..):
Each box predicts the classes the bounding box may contain using multilabel classification (sigmoid and
binary loss function is used).
7.) Predictions across different scales:
YOLO v3 makes prediction across 3 different scales. The detection layer is used make detection at
feature maps of three different sizes, having strides 32, 16, 8 respectively. This means, with an input of
416 x 416, we make detections on scales 13 x 13, 26 x 26 and 52 x 52 (as seen in Figure :)
And at each scale, each cell predicts 3 bounding boxes using 3 anchors making the total number of
anchors used is 9 (The anchors are different for different scales.)
Output Processing:
For an image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding
boxes. However, in case of our image, there’s only one object, a dog.
How do we reduce the detections from 10647 to 1?
1.) Thresholding by object confidence score, and 2.) Non Max Suppression.
4.3.5) Experimental Results:
1.0) Dataset : COCO 2017 Object Detection Dataset having 80 classes
2.0) Implementation Details:

The project is implemented in python 3. Used the pre-trained model of YOLOv3 on COCO dataset.
Using Pytorch library was used for fine-tuning the deep network and OpenCV3 was used for image pre-
processing. Libraries used are shown below.
The system specifications on which

the model is trained and evaluated are
mentioned as follows: CPU - Intel
Core i5- hp-2.5GHz, RAM -4 Gb ,
AMD -2GB Graphics, and even
worked on GPU - Nvidia GeForce
GTX 1080 Titan Xp RAM-12GB ,
Driver Version – 396 with cuda 9.0 to
train and test the model .
21
3.0) Pre-processing:
Reading of “yolov3.cfg” and “yolov3.weights” files (from https://pjreddie.com/) to convert them into
python (Pytorch) readable format. OpenCv3 is used for resizing, scaling and data augmentation tasks.
4.0) Network :
The entire network architecture is shown in above Fig. 4.11. The model consists of the base network
derived from Darknet-53 and then the modified convolutional layer from last fully connected
convolutional (1*1*nc) layer for fine-tuning and then the classifier and localizer networks.
5.0) Qualitative Analysis

Images that are not able to be detected by YOLO correctly:
YOLO –Prediction YOLOv3- Prediction YOLO –Prediction YOLOv3- Prediction
Images with some errors such as occlusion and small/far object detection.
22
6.0) Quantitative Analysis
The evaluation metric used is mean average precision (mAP). For a given class, precision recall curve
is computed. Recall is defined as the proportion of all positive examples ranked above a given rank.
Precision is the proportion of all examples above that rank which are from the positive class. The AP
summarizes the shape of the precision-recall curve, and is defined as the mean precision at a set of eleven
equally spaced recall levels [0, 0.1, ... 1]. Thus to obtain a high score, high precision is desired at all
levels of recall. This measure is better than area under curve (AUC) because it gives importance to the
sensitivity. The detections were assigned to ground truth objects and judged to be true/false positives by
measuring bounding box overlap. To be considered a correct detection, the area of overlap between the
predicted bounding box and ground truth bounding box must exceed a threshold. The output of the
detections assigned to ground truth objects satisfying the overlap criterion were ranked in order of
(decreasing) confidence output. Multiple detections of the same object in an image were considered false
detections, i.e. 5 detections of a single object counted as 1 true positive and 4 false positives. If no
prediction is made for an image then it is considered a false negative.
Figure 4.14: mAP vs inference time curve

(From official YOLOv3 paper)
4.4) Conclusion
An accurate and efficient object detection system has been studied and developed which achieves
comparable metrics with the existing state-of-the-art system. This project uses recent techniques in the
field of computer vision and deep learning. Custom dataset was created using labelling and the evaluation
was consistent. This can be used in real-time applications which require object detection for pre-
processing in their pipeline. An important scope would be to train the system on a video sequence for
usage in tracking applications. Addition of a temporally consistent network would enable smooth
detection and more optimal than per-frame detection. We will also see some visual trackers in next
section of this report.
23
Chapter 5
Approach for Visual Tracking
5.1) Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many works
which need the information of history such as language processing or video processing. The main idea
behind RNNs is to use sequential information. In other neural networks, all inputs and outputs are
independent of each other. But for many tasks it does not work well. For example, if you want to predict
the next word in a sentence, it is better to know which words came before it. RNNs are called recurrent
because they perform the same task for every element of a sequence, and the output depends on the
previous computations. Also we can say that RNNs have a “memory” which captures information about
what has been calculated so far. In theory RNNs can make use of information in arbitrarily long
sequences, but in practice they are limited to looking back only a few steps. Here is what a typical RNN
looks like:
Figure 5.1: recurrent neural network and the unfolding in time of the computation involved in its
forward computation
With unrolling we simply mean that we write out the network for the full sequence. For example, if we
want to use 5 sequences of a video, the network would be unrolled into a 5-layer neural network, one
layer for each sequence. More details about the parameters in figure and the formulas of RNN are as
follows: xt is the input at time step t. For example, x1 could be a vector corresponding to the second
sequence of a video. st is the hidden state at time step t. It’s the “memory” of the network. s t is
calculated based on the previous hidden state and the input at the current step:
st = f(Uxt + W st−1)
The function f usually is a nonlinearity such as tanh or ReLU. s−1, which is required to calculate the first
hidden state, is usually initialized to zero. ot is the output at step t. For example, if we wanted to predict
the the position of human in the next time stamp in a video it would be a vector of probabilities.
ot = softmax(V st)
But the most commonly used type of RNNs are LSTMs, which are much better at capturing long-term
dependencies than vanilla RNNs are. LSTM network will be explained in the next section.
5.2) Long Short-Term Memory (LSTM) Network:

LSTM is a special kind of recurrent neural network which works for many tasks, much better than the
traditional version [34]. One of the appeals of RNNs is the idea that they might be able to connect
previous information to the present task, such as using previous video frames might inform the
understanding of the present frame or predict future frame. But RNN does not have long memory for
24
doing a perfect prediction. Sometimes, we only need to look at recent information to perform the present
task, but sometimes we need to have long term memory for our prediction. Long Short Term Memory
networks (LSTMs) are a special kind of RNN, capable of learning long-term dependencies. They were
introduced by Hochreiter & Schmidhuber (1997) .In a standard recurrent neural network, in the gradient
back-propagation phase, the gradient signal is multiplied by the weight matrix of recurrent hidden layer,
a large number of times (related to the number of timesteps). This is the reason of importance of
magnitude of weights in the transition matrix. If the weights in this matrix are small (smaller than 1.0),
it causes vanishing gradients because the gradients become so small and learning becomes very slow or
stops working. So the task of learning long-term dependencies in the data would be impossible. The
vanishing gradient problem is illustrated in Figure 5.2. On the other hand, if the weights in this matrix
are large (larger than 1.0), it could cause exploding gradients, it means that the gradients become so large
that it can cause learning to diverge.
Figure 5.2 : The vanishing gradient problem for

RNN. The shading of the nodes in the unfolded
network shows their sensitivity to the inputs at
time one (the darker the shade, the greater the
sensitivity). The sensitivity decrease over time as
new inputs overwrite the activations of the hidden
layer, and the network ‘forgets’ the first inputs.
(From Alex Graves, 2012)
These problems of RNN are the main motivation for designing the LSTM model which have a memory
cell which is shown in Figure 2.4. A memory cell has four main elements: an input gate, a neuron with
a self-recurrent connection (a connection to itself), a forget gate and an output gate. The weight of the
self-recurrent connection is 1.0 and ensures that the state of a memory cell can remain without change
in different timestep. The input gate can let incoming signal change the state of the memory cell or block
it. Also, the output gate can let the state of the memory cell change other neurons or prevent it. The forget
gate can let the cell to remember or forget its previous state, as needed
Figure 5.3: LSTM memory cell,
(From Kyunghyun Cho Pierre Luc Carrier, 2017)
Gradient information is preserved by LSTM. Figure 5.3 illustrates preservation of gradient information
by LSTM. As it was showed in Figure 2.3 the shading of the nodes shows their sensitivity to the inputs
at time one; in the LSTM, the black nodes are maximally sensitive and the white nodes are completely
insensitive. The input, forget, and output gates are illustrated below, to the left and above the hidden
layer respectively. All gates are either entirely open (‘O’) or closed (‘—’). The memory cell ‘remembers’
the first input until the forget gate is open and the input gate is closed. The sensitivity of the output layer
can be switched on and off by the output gate without affecting the cell.
25
Figure 5.4: Preservation of gradient information by LSTM. (From Alex Graves, 2012)
All recurrent neural networks have the form of a chain of repeating modules of neural network. In
traditional RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
Figure 5.5: The repeating module in a standard RNN contains a single layer (From Cristopher Olah)
LSTMs also have similar chain structure, but the repeating module is a bit different. Instead of having a
single neural network layer, there are four, interacting in a very special way.
Figure 5.6: The repeating module in an LSTM contains four interacting layers (From Cristopher Olah)
The formula of input gate it, forget gate ft, output gate ot and final state ht is define as following:
The main difference of LSTM with classical RNNs is the use of the these gating functions it, ft , ot ,
which explained previously, and indicate the input, forget, and output gate at time t respectively. Weight
parameters Wxi ,Whi , Whf ,Who ,Wxf , Whc , Wxo and Wxc , connect the different inputs and gates with the
memory cells and outputs and biases bi , bf , bc and bo. The cell state ct is updated with a fraction of the
previous cell state ct−1 that is controlled by ft.
26
5.3) Visual Tracking:
Visual Object Tracking is the process of localizing a single target in a video or sequential images, given
the target position in the first frame. Visual tracking is a challenging task in computer vision due to target
deformations, illumination variations, scale changes, fast and abrupt motion, partial occlusions, motion
blur, object deformation, and background clutters. The methods of visual tracking are divided into 3
categories: 1.) Fast Tracking, 2.) Robust Tracking, 3) Fast & Robust tracking.
5.3.1) Types of Trackers

1.0) Fast Trackers:
• “Visual tracking with Simple convolutional neural network”
• “Learning to Track at 100 FPS with Deep Regression Networks”
2.0) Robust Tracker:
• “Learning Multi-Domain Convolutional Neural Networks for Visual Tracking-MDNet”
• “STCT: Sequentially Training Convolutional Networks for Visual Tracking”
3.0) Robust and Fast Trackers:
• “Using Recurrent Convolutional Neural Networks for Visual Object Tracking-ROLO”
5.3.2) CNN Based Tracker – MDNet:

Multi-Domain Convolutional Neural Networks (MDNet) as a tracking algorithm based on the
discriminatively trained Convolutional Neural Network (CNN) is used. A large set of videos with
tracking ground-truths is used for pretraining of the CNN to obtain a generic target representation. The
network has shared layers and multiple branches of domain-specific layers. Domains are related to
individual training sequences and each branch is used for binary classification to find target in each
domain. Each domain in the network is trained iteratively for generic target representations in the shared
layers. During the tracking of a target in a new sequence, a new network would be built by combining
the shared layers in the pretrained CNN with a new binary classification layer, which is updated online.
Online tracking works by evaluating the candidate windows randomly sampled around the previous
target state. This method produces very robust tracking and was the winner of The VOT2015 Challenge.
The structure of this method is shown in Fig 5.7.
Figure 5.7: Architecture of MDNet. (From Hyeonseob Nam and Bohyung Han, 2016)
The original implementation of MDNet is made in MATLAB -2014 (with MatconvNet library) which a
GPU support. So in order to check the robustness of the tracker I have made small changes in the code
with the help of PhD Scholar Mr. Mohan Murali for the research of old- MS Student Miss Pallavi
27
Venagopal of Prof. R.K Gorthi and tested video samples of VOT 2017, OTB100, and ALOV300+++
datasets and visualize the errors and drawbacks of tracker.
Simply CNN based trackers are fast enough but not robust enough in different challenging environment
such as motion blur. In order make a beautiful combination of Robust & Fast Tracker model we used an
approach combining the features of CNN with RNN to track a video sequence.
And we call it as recurrent- convolutional neural networks based object tracker.
Detail of our approach is shown below (taken from https://arxiv.org/pdf/1607.05781.pdf)
5.3.3) General overview of proposed method: (CNN – RNN based Visual trackers)
YOLO + LSTM = ROLO - Recurrent YOLO:-
(https://arxiv.org/pdf/1607.05781.pdf)
YOLO Detector (or any CNN)
Figure 5.8: Structure of YOLO Detection + LSTM
The proposed model contain a deep neural network for which the input is raw video frames and it returns
the coordinates of a bounding box of an object which was tracked in each frame.
Tracking probability is calculated as following:
Bt and Xt are the location of an human and an input frame, respectively, at time t. X<t is a history of
input frames and B<t is a history of previous locations of human before time t.
1.0) Why YOLO?
• YOLO to collect rich and robust visual features, as well as preliminary location inferences;
and
• LSTM in the next stage as it is spatially deep and appropriate for sequence processing.
Training of ROLO Model.
2.0) Problem with Object detection:

Many state-of -art object detectors such as YOLO detects multiple objects in each frame. As we do single
object tracking, we need to have one bounding box in each frame. So we used two methods for getting
one bounding box in each frame.
• In the first method we just select the bounding box with largest confidence level.
• In the second method, we define a cost matrix that is computed as the intersection-over-union
(IOU) distance between the current detection and the previous tracking result.
28
For detection of the first frame, we make a decision based on the IOU distance between the detection
boxes and the ground truth. Also a minimum IOU is defined to reject the detection which their IOU are
less than IOU min.
3.0) Training of ROLO Model:

There are three phases for the end-to-end training of the ROLO model:
1. The pre-training phase of convolutional layers for feature learning,
2. The traditional YOLO training phase for object proposal,
3. The LSTM (tracking Module) training phase for object tracking.
4.0) Network Training of the Tracking Module:

They use the LSTM RNNs for the training of the tracking module. In this module, LSTM has two streams
of data, which are the feature vector from the convolutional layers (Darknet) and the detection
information (YOLO detections) Bt,i from the detection block. At each time-step t, a feature vector Xt is
extracted. Xt and Bt,i and also output of states from the last time-step St−1 are inputs of LSTM network.
Mean Squared Error (MSE) is used for training:
n indicates the number of training samples in a batch, Bprediction is the tracking prediction and Btarget is the
target ground truth value. They use the Adam method for stochastic optimization.
5.0) Dataset: OTB30 OTB is one of the most commonly used

datasets. Each video is annotated with
one or more attributes:
• IV: Illumination Variation
• SV: Scale Variation
• OCC: Occlusion
• DEF: Deformation
• MB: Motion Blur
• FM: Fast Motion
• IPR: In-plane Rotation
• OPR: Out-of-Plane Rotation
• OV: Out-of-View
29
• BC: Background Clutters
• LR: Low Resolution

6.0) ROLO Results and Comparison with YOLO:
Blue box – YOLO , Yellow –ROLO , Red – GROUND TRUTH
Chapter 6
6.1) Applications
A well-known application of object detection is face detection that is used in almost all the mobile
cameras. A more generalized (multi-class) application can be used in autonomous driving where a
variety of objects need to be detected. Also it has a important role to play in surveillance systems. These
systems can be integrated with other tasks such as pose estimation where the first stage in the pipeline
is to detect the object, and then the second stage will be to estimate pose in the detected region. It can be
used for tracking objects and thus can be used in robotics and medical applications. Thus this problem
serves a multitude of applications.
(a) Surveillance (b) Autonomous vehicles
30
6.2) Conclusion:
During the complete internship, I have learnt a lot new things by doing intensive research on Deep
Learning mainly in the field of object detection and tracking and find and implement effective, accurate
and efficient state-of-art CNN and RNN based object trackers. This internship taught me how to proceed
with current research problems and contribute something new towards the community of Computer
Vision and Deep Learning.
I have faced many challenges while implementing the state-of-art visual tracker such as learning new
libraries for efficient and fast object tracking. Different approaches have been tried for different tasks
during the development of the proposed algorithms implementation which was discussed in the previous
chapters.
On the completion of the internship I understood that the community of Computer Vision and Deep
learning is very vast and open source it is getting contributed day- to- day from many parts of the world.
So I can say that I have satisfactorily completed the internship at IIT Tirupati. But there is to work on in
this field so I want be a part of this large developing field.
6.3) Future Work:

Experiments on Various detectors and trackers. Try to improve their performance by embedding some
new method like Rotation Invariance and Scale invariance on different datasets as present detectors are
not able to perform well on rotated images. So I will be working on this with my co- intern Mr. Dheeraj
Varma , under the guidance of Dr. Rama Krishna Sai Gorthi .
31
CERTIFICATE
This is to certify that this internship report entitled “Study and Implementation of
Object Detection and Visual Tracking” submitted to Indian Institute of Technology,
Tirupati, is a record of work done by “Bharat Giddwani” under my supervision from
___ May 2018 to __ June 2018.
Dr. Rama Krishna Sai Gorthi,

Associate Professor,
IIT Tirupati.
Place:
Date:
32

Study and Implementation of Object Detection and Visual Tracking

Uploaded by

Copyright:

Available Formats

Study and Implementation of Object Detection and Visual Tracking

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Study and Implementation of Object Detection and Visual Tracking

Uploaded by

Copyright:

Available Formats

Department of

Electronics and Telecommunication Engineering

Report of Research Internship under Dr. Rama Krishna Sai Gorthi

INDIAN INSTITUTE OF TECHNOLOGY, TIRUPATI

Will be Submitted to:

S.no Name of Topic Page No.

1.2 Problem Statement

Figure-2 Typical CNN Architecture

2.2 CNN Architectures:

Figure 2.2 –VGG16 neural Network Structure

E.) ResNet-50 (2015) :

Figure 2.3 ResNet-50 Neural Network Structure

3.2 Related Work

3.2.2) Classification + Regression:

3.2.3 Two-stage Method:

Figure 3.2 RCNN Description

3.2.4 Unified Method

4.1) YOLO - You Look Only Once

4.1.2) How it works:

4.1.5) Terms and Formulas:

2.0) Conditional class probabilities:

Figure 4.4- Non Max Suppression

4.1.6) Experimental Results:

COCO 2017 Object Detection Task

2.0) Implementation Details:

Figure 4.6- YOLO network

5.0) Qualitative Analysis

INPUT PREDICTION INPUT PREDICTION

The results on custom dataset are shown in Table 3.

a) Fails to predict small objects b) Occulsion c) Low Resolution

4.2) SSD: Single Shot Multibox Detector:

4.2.3) Network Architecture:

Figure 4.7: SSD Architecture

4.2.5) Quantity Choice:

Figure 4.9: DSSD Architecture

Fig. ESSD Framework

Fig. Extension Module

So what’s New with YOLOv3?

Bounding Boxes and Each BB predicts - (5+C)

2.) Dimensions of the Bounding Box

Figure 4.13: Dimensions

4.3.5) Experimental Results:

1.0) Dataset : COCO 2017 Object Detection Dataset having 80 classes

2.0) Implementation Details:

The system specifications on which

5.0) Qualitative Analysis

Figure 4.14: mAP vs inference time curve

5.2) Long Short-Term Memory (LSTM) Network:

Figure 5.2 : The vanishing gradient problem for

Figure 5.3: LSTM memory cell,

(From Kyunghyun Cho Pierre Luc Carrier, 2017)

5.3.1) Types of Trackers

5.3.2) CNN Based Tracker – MDNet:

YOLO Detector (or any CNN)

Figure 5.8: Structure of YOLO Detection + LSTM

1.0) Why YOLO?

2.0) Problem with Object detection:

3.0) Training of ROLO Model: