Master Thesis

Unsupervised Learning of Depth and Ego-
Motion from Video
By
Shah Zeb
CUI/FA19-RCE-008/ISB
MS Thesis
In
Computer Engineering
COMSATS University Islamabad
Islamabad Campus – Pakistan

Fall 2022
Unsupervised Learning of Depth and Ego-Motion

from Video
A Thesis Presented to
COMSATS University Islamabad, Islamabad
In partial fulfillment
Of the requirement for the degree of
MS in Computer Engineering
By
Shah Zeb
FA19-RCE-008
7th Semester, Fall 2022
ii
from Video
A Post Graduate Thesis submitted to the Department of Electrical and
Computer Engineering as partial fulfillment for the award of a Degree M.S. in
Computer Engineering.
Name Registration Number

Shah Zeb FA19-RCE-008
Supervisor
Dr. Omar Ahmad
Assistant Professor
Department of Electrical and Computer Engineering
Islamabad Campus
January 2023
iii
Final Approval
This thesis titled

from Video
By
Shah Zeb
FA19-RCE-008
Has been approved
For the COMSATS University Islamabad, Islamabad Campus
External Examiner: _________________________________________
Dr. Shahzad Saleem
Assistant Professor, Electrical Engineering Department, FAST National

University
Supervisor: ________________________________________________
Dr. Omar Ahmad
Assistant Professor, Department of Electrical and Computer Engineering
HOD: ____________________________________________________
Dr. Shurjeel Wyne
Professor, Department of Electrical and Computer Engineering
iv
Declaration
I Shah Zeb, Registration# FA19-RCE-008, hereby declare that I have produced the
work presented in this thesis during the scheduled study period. I also declare that I
have not taken any material from any source except referred to wherever due that
amount of plagiarism is within an acceptable range. If a violation of HEC rules on
research has occurred in this thesis, I shall be liable to punishable action under the
plagiarism rules of the HEC.
Signature of Student
Date: 6th February 2023 ____________________________

Shah Zeb
FA19-RCE-008
v
Certificate
It is certified that Shah Zeb, Registration# FA19-RCE-008, has carried out all the
work related to this thesis under my supervision at the Department of Electrical and
Computer Engineering, COMSATS University Islamabad, Islamabad Campus, and
the work fulfills the requirement for the award of MS degree.
Date: 6th February, 2023

Supervisor:
__________________________________
Dr. Omar Ahmad
Assistant Professor
Head of Department
_________________________________
Dr. Shurjeel Wyne
Department of Electrical and Computer
Engineering
vi
DEDICATION
Dedicated to my Family, Teachers and Friends
vii
ACKNOWLEDGEMENTS
First, I would like to thank Allah SWT for the countless blessings he has showered me
throughout my life. He has always given me the best opportunities regardless of my
weakness. I pray to Allah SWT to allow me to be His humble servant and bless my
family and me with steadfastness in His religion. Then, I express my heartiest
gratitude towards my supervisor Dr. Omar Ahmad. This work would not have been
possible without his guidance and support. His strong command of my area of
research and extraordinary problem-solving skills are the key factors in completing
this thesis. I will never forget his kind behavior while conveying technical arguments
about the research topic. Indeed it was an honor to work with such a friendly,
thorough, ad dedicated professional. I am grateful to Dr. Usman Qayyum from the
National Electronics Complex (NECOP), who introduced me to my area of research,
supported me, and provided me with the facilities to carry out my research work. This
work was only possible with him. I am thankful to all my course instructors at
COMSATS University Islamabad for developing my knowledge base in Computer
Engineering during the course work that helped me choose the area of research for my
Master's. I am also thankful to Dr. Haroon and Dr. Sufwan for their valuable
suggestions. I would also like to thank my family, including my father, mother,
brother, and sister, for their unwavering support in continuing my studies. They were
always there to help me with all their abilities. Special thanks to my father for his
encouragement and moral support. He pushed me beyond my limits from the start of
this thesis till this point.
viii
ABSTRACT
Unsupervised Learning of Depth and Ego-Motion from Video

Depth and Ego-Motion estimation are vital tasks in self-driving cars, robot navigation,
robot-human interaction, and augmented reality. Stereo cameras have shown
relatively good performances in estimating depth and Ego-Motion. The downside is
that it is complex and expensive to set up. We make use of monocular cameras due to
their simplicity and low cost. Most of the traditional techniques used were based on
the stereo. Researchers have used deep convolutional neural networks (DCNNs) for
monocular depth and Ego-Motion estimation in the last decade using supervised and
unsupervised learning. Supervised learning methods have shown excellent results but
require ground truth data which requires extensive labor, making it expensive.
Unsupervised learning requires no ground truth data, so a new view is synthesized
from the estimated depth and the camera’s pose. The view synthesis error carries out
learning. It makes it cheap to implement as it requires no collection, processing of
data, or extra setup. Though it does not have good results compared to supervised
learning, researchers have put their interest in this method. In this work, we aim to
examine the effects of the depth network on the pose network by using convolution
neural networks in an unsupervised way. We use the KITTI dataset to train and
evaluate the network and compare the results with the previous research.
ix
TABLE OF CONTENTS
1. Introduction ............................................................................................ 1
1.1 Limitations ............................................................................................. 5
1.2 Motivation .............................................................................................. 5
1.3 Problem Statement ................................................................................. 5
1.4 Thesis Outline ........................................................................................ 6
2. Background ............................................................................................. 7
2.1 Digital Video .......................................................................................... 8
2.1.1 Digital Image ................................................................................... 8
2.1.2 Video using Images ....................................................................... 10
2.2 Depth Estimation from a Video ........................................................... 10
2.2.1 Need for Depth Estimation ........................................................... 11
2.2.2 Importance of Depth Estimation ................................................... 11
2.2.3 Depth Estimation Using Geometric-Based Method ..................... 12
2.2.4 Convolution Neural Networks ...................................................... 16
2.2.5 Depth Estimation Using Learning-Based Method ........................ 21
2.3 Ego-Motion Estimation in a Video ...................................................... 24
2.3.1 Need of Ego-Motion ..................................................................... 25
2.3.2 Importance of Ego-Motion............................................................ 25
2.3.3 Traditional Ego-Motion Estimation Methods ............................... 26
2.3.4 Learning-Based Ego-Motion Estimation Methods ....................... 34
3. Literature Review ................................................................................. 36
3.1 Supervised Learning Methods ............................................................. 37
3.1.1 Multi-scale Depth Prediction, 2014 .............................................. 37
x
3.1.2 Fully Convolutional Residual Networks, 2016 ............................. 38
3.1.3 Deep3D, 2016 ............................................................................... 40
3.1.4 PoseNet ......................................................................................... 41
3.2 Unsupervised Learning Methods ......................................................... 42
3.2.1 DeepStereo, 2015 .......................................................................... 42
3.2.2 Depth Estimation with Left-Right Consistency, 2017 .................. 43
3.2.3 SfM-Net, 2017............................................................................... 44
3.2.4 SfM-Learner, 2017 ........................................................................ 46
4. Methodology.......................................................................................... 47
4.1 Network Architecture for Depth Estimation ........................................ 48
4.1.1 Subnetwork1 - Disparity Network ................................................ 49
4.1.2 Subnetwork2 - AdaptiveBin Network........................................... 50
4.1.3 Subnetwork3 - Residual Network ................................................. 51
4.2 Weighted Average Depth..................................................................... 52
4.2.1 Initializing Weights Based on Accuracy ....................................... 52
4.2.2 Adjustment of Weights Based on Performance During Training . 53
4.2.3 Updating Weights of Subnetworks in Real-time .......................... 53
4.3 Network Architecture for Pose Estimation .......................................... 54
4.4 New View Synthesis Block ................................................................. 55
4.5 Loss Functions for Training ................................................................. 57
4.5.1 Smooth Loss .................................................................................. 58
4.5.2 View Synthesis Loss ..................................................................... 58
4.6 KITTI Dataset used for Training the Networks ................................... 58
4.6.1 Splitting KITTI Dataset for Training, Validation & Testing ........ 59
4.6.2 Applying Data Augmentation on KITTI Dataset.......................... 59
xi
4.7 Error Metrics for Quantitative Analysis .............................................. 60
4.7.1 Depth Error Measurements ........................................................... 60
4.7.2 Pose Error Measurements ............................................................. 60
4.8 Training Settings .................................................................................. 61
5. Results.................................................................................................... 62
5.1 Qualitative Analysis ............................................................................. 63
5.2 Quantitative Analysis ........................................................................... 65
6. Conclusion and Future Work.............................................................. 67
6.1 Estimating Depth in Real-Time ........................................................... 68
6.2 Impact of Depth Estimation on Pose Estimation ................................. 68
6.3 Reducing Depth Estimation and Pose Estimation ............................... 68
7. Reference ............................................................................................... 70
xii
LIST OF FIGURES
Figure 1.1: Elements and their pixel values of a Digital Image ................................................. 2
Figure 1.2: Depth Perceived by Humans ................................................................................... 2
Figure 1.3: Stereo Vision System .............................................................................................. 3
Figure 1.4: Example of Ego-Motion .......................................................................................... 4
Figure 2.1: A Digital Image ....................................................................................................... 8
Figure 2.2:Image obtained on Cathode Ray Tube (CRT) .......................................................... 9
Figure 2.3: Digital Image Acquisition Process .......................................................................... 9
Figure 2.4: Sequence of frames from time t-1 to t+2 ............................................................... 10
Figure 2.5: Applications of VO ............................................................................................... 11
Figure 2.6: Occluded Points ..................................................................................................... 12
Figure 2.7: Stereo Camera Model ............................................................................................ 12
Figure 2.8: Stereo Camera Model in Bird’s Eye View ............................................................ 13
Figure 2.9: Brute force solution by searching the whole image for each pixel........................ 15
Figure 2.10: Movement of 3D point along the epipolar line.................................................... 15
Figure 2.11: Movement of 3D point along the epipolar line.................................................... 16
Figure 2.12: Properties of Input Image .................................................................................... 17
Figure 2.13: Each channel applied to the respective channel of an RGB input volume .......... 17
Figure 2.14: Process of cross-correlation................................................................................. 18
Figure 2.15: Applying two filters on an RGB input volume ................................................... 19
Figure 2.16: Applying max pool function on a 4x4 matrix...................................................... 20
Figure 2.17: Deconvolving a 3 x 3 feature map ....................................................................... 21
Figure 2.18: General Encoder-Decoder for depth estimation .................................................. 22
Figure 2.19: General supervised framework for depth estimation ........................................... 23
Figure 2.20: General unsupervised framework for depth estimation using stereo image pairs23
Figure 2.21: General unsupervised framework for depth and pose estimation ........................ 24
Figure 2.22: Types of Ego-Motion .......................................................................................... 25
Figure 2.23 Pose of an Car ....................................................................................................... 26
Figure 2.24: General process of Visual Odometry .................................................................. 26
Figure 2.25: General process of Visual Odometry .................................................................. 27
Figure 2.26: Example of image stitching technique ................................................................ 27
Figure 2.27: Interesting features .............................................................................................. 28
Figure 2.28: Illustration of feature description ........................................................................ 29
Figure 2.29: Computing the descriptor .................................................................................... 30
Figure 2.30: Matching feature from image 1 to image 2 ......................................................... 31
Figure 2.31: Sequences of frames concatenated with each other............................................. 32
xiii
Figure 2.32: Projection of 3D to 2D using camera motion ...................................................... 33
Figure 2.33: General Supervised Learning Pose Estimation Architecture .............................. 34
Figure 3.1: Network architecture of coarse and fine network .................................................. 37
Figure 3.2: Up-Projection Block .............................................................................................. 38
Figure 3.3: ResNet Network architecture with up sampling block .......................................... 39
Figure 3.4: Convolution Block In Residual Network .............................................................. 39
Figure 3.5: Fast Up-projection Block ...................................................................................... 40
Figure 3.6: ResNet Network archiotecture with upsampling block ......................................... 41
Figure 3.7: PoseNet Achitecture .............................................................................................. 41
Figure 3.8: Basic architecture of DeepStereo .......................................................................... 42
Figure 3.9: Base architecture of depth estimation without LR ................................................ 43
Figure 3.10: Main architecture for depth estimation for LR .................................................... 43
Figure 3.11: SfM-Net architecture ........................................................................................... 44
Figure 3.12: Over view of supervision pipeline ....................................................................... 46
Figure 4.1: Our modified network ........................................................................................... 48
Figure 4.2: Disparity Network architecture ............................................................................. 49
Figure 4.3: The AdaBins Architecture ..................................................................................... 50
Figure 4.4: Residual Block ...................................................................................................... 51
Figure 4.5: Pose Network ........................................................................................................ 55
Figure 4.6: View Synthesis Block ........................................................................................... 55
Figure 4.7: A stereo rig and a Velodyne laser scanner mounted on a Volkswagen ................. 59
Figure 4.8: Example of an input image from KITTI dataset.................................................... 59
Figure 5.1: Depth Prediction on The Validation Set ................................................................ 63
Figure 5.2: Difference between the target image and synthesized image ................................ 63
Figure 5.3: Results on test set .................................................................................................. 64
Figure 5.4: Comparison of our methods with the previous methods ....................................... 65
Figure 5.5: Trajectory Plot of Sequence 9 and 10.................................................................... 65
xiv
LIST OF TABLES
Table 4.1: Images in each set ................................................................................................... 59

Table 4.2: Hyperparameter Settings......................................................................................... 61
Table 5.1: Depth evaluation results on KITTI dataset ............................................................. 66
Table 5.2: Absolute Trajectory Error (ATE) on KITTI test split ............................................. 66
xv
Chapter 1
Introduction
1
A digital image is a two-dimensional function f(x, y) having a finite set of elements,
where (x, y) are the coordinates that give a particular position to an element in an
image. The elements in a digital image are referred to as pixels. Each pixel has an
amplitude called intensity, and the intensity of a pixel is a scalar value. In the case of
a grayscale image, as shown in figure 1.1, the intensity of a pixel can be between 0
and 255 [1].
Figure 1.1: Elements and their pixel values of a Digital Image
A sequence of still images is used to create a video by displaying the images in rapid
succession such that it gives the perception of motion. The frequency at which the
images are displayed is called the frame rate. Different frame rates impact how we
perceive motion and the speed of motion which appears on a screen [2].
Humans can estimate the distance of the object they can see and figure out the 3D
structure of the objects in a scene by combining the information obtained using their
eyes very efficiently. On a computing platform, the task of estimating the distance
from the object is known as depth estimation, i.e., how far or near objects are from the
observer. Depth can be estimated from multiple views or a single view.
Figure 1.2: Depth Perceived by Humans
2
Our eyes can capture a point from two views, i.e., from the left eye and the right eye,
which deduces valuable information from the scene quickly and efficiently, allowing
us to perform feats such as avoiding a collision or catching a ball. This approach of
estimating the depth using multiple views is known as stereopsis which is solved
using stereo vision cameras (multiple view cameras) [3].
Figure 1.3: Stereo Vision System
Such simple tasks that humans efficiently perform are difficult for a computer to
perform using a video. These abilities that look simple at first glance need to be
designed with great diligence for the computer to achieve. Many researchers have
admitted their proposals to address this problem. They have shown remarkable results
though they require extensive computational resources, and even delays can be noted
in the output. Hence monocular vision cameras (single-view cameras) have then opted
for depth estimation.
The monocular depth estimation task was challenging to achieve using the same
method as stereo. To solve this, a learning-based method was adopted. Further, there
are three types of machine learning methods; the supervised learning method, which
requires ground truth data for the model to learn from, the and unsupervised learning
method, in which the model is its teacher [4].
Monocular depth estimation using supervised learning was first solved by Saxena et
al. [5], using Markov Random Fields (MRFs) by training the model on Make3D data.
The problem with this was that very little data was used. Till 2014, no great
importance was given to this until Eigen et al. [6] used Convolutional Neural
Networks to estimate the depth, which will be explained in detail in chapter 3. The
3
network was trained on the KITTI dataset [7], and since then, deep convolutional
neural networks have gathered great attention.
This type of learning method required ground truth data which needed expensive
labor and equipment and would cost much time. To train a network without using
ground truth, Zhou et al. [8] estimated the depth from a single image using an
unsupervised method which requires no ground truth at all. For learning to be carried
out, a new view was synthesized using image-wrapping techniques.
We can use depth information to navigate through the environment to avoid collision
and know our and others’ position in a scene using a process called Ego-Motion. It is
being aware of where you are and what everyone around you is doing.
Figure 1.4: Example of Ego-Motion
Navigation has been possible using sensors such as the Global Positioning Systems
(GPS) from which the Ego-Motion of an object can easily be estimated without any
known data [9]. Sometimes the GPS might fail to work when the signals are jammed
or an object enters an environment such as underwater or space where the GPS signal
starts to attenuate. At such times, a vision system can be used in places where GPS
signals do not work [10].
A vision-based navigation system and visual odometry (VO), which is a type of Ego-
Motion, are some alternatives to this. VO uses a camera sensor to estimate the Ego-
Motion of an object from one frame to another frame of a video sequence. Traditional
indirect methods extracted features and matched them between image frames. Then
image geometry and perspective changes were used to estimate the motion. Direct VO
4
methods would use photometric consistency error in the pixel intensity values to
determine the camera motion [10].
Recently Convolutional Neural Networks, just as for depth estimation, have been used
to estimate the Ego-Motion of objects and have shown relatively good results.
1.1 Limitations
While deep learning techniques to estimate depth and Ego-Motion from a video have
shown relatively good results, some things could still be improved. To perform well,
the environment should have sufficient light [10]. Weather changes can also affect
estimating both depth and Ego-Motion. If an aerial vehicle moves over an ocean, it
might be challenging to estimate Ego-Motion as there will not be enough features for
calculations [11].
Stereo vision systems have successfully injected scale information into Ego-Motion
because of the baseline distance between the cameras. Monocular vision systems can
estimate 3-dimension rotation and translation from 2D images. However, it needs to
be more accurate in determining the scale of the translation, and some external sense
of scale needs to be given [10].
1.2 Motivation
The motivation of this thesis is to implement monocular depth & Ego-Motion using
an unsupervised technique and examine the effects of depth network on the Ego-
Motion/pose network. We use a similar method as Zhou et al. [8], using multiple
depth networks and taking the weighted average of all the outputs obtained from the
three depth networks. Since Ego-Motion or pose is assisted by depth network,
improvements in depth should also show improvement in the camera's pose. We train
the network using the KITTI dataset and then evaluate it by comparing the results
with previous methods.
1.3 Problem Statement

Estimating the depth and pose of the camera using supervised learning is highly
accurate but requires a large amount of ground truth data to be effective, which can be
time-consuming and expensive to collect. Estimating the depth and pose of the
camera in an unsupervised way takes less time in deployment since it does not require
large volume of ground truth data. The reliability of unsupervised learning is not as
5
supervised learning, so to improve the performance of depth we use weighted average
which uses three unsupervised networks in an ensemble to achieve a more reliable
unsupervised paradigm.
1.4 Thesis Outline

This thesis document is divided into five chapters. In chapter 1, we have provided an
overview of our topic and the aim of our work.
Chapter 2 deals with the theoretical concepts related to depth and Ego-Motion
estimation from a video. We show in detail how a video works and how we can
estimate depth and Ego-Motion from it. We also show the advantages and
disadvantages of the approaches used in estimating depth and Ego-Motion.
Chapter 3 is the literature review, where we will look at architectures of the depth and
Ego-Motion used by researchers.
In Chapter 4, the implementation followed is discussed in great detail. We show

where our work is inspired and what changes we have made to the method. We shed
light on the method of the new synthesis in detail. We discuss the KITTI dataset we
use to train the networks, how we split the data for training, evaluation & testing, and
the preprocessing technique we use to get the best parameters from the network.
Chapter 5 benchmarks the qualitative and quantitative results with previous research.
6
Chapter 2
Background
7
The purpose of this chapter is to introduce the basic concepts used throughout this
thesis. We will look at an image and how we make a video from images. Then we will
look at how we estimate depth using traditional techniques. In section 2.2.5, we will
look at how convolutional neural networks work. Then we discuss the different
learning methods for estimating depth. Then we look at what Ego-Motion is and how
it was traditionally estimated. Finally, we will discuss the learning-based methods
used to estimate Ego-Motion.
2.1 Digital Video

2.1.1 Digital Image
The human eye is designed exceptionally due to its capability to infer the scene in a
very efficient manner. It can infer the 3D structure of the scene and tell how far or
near an object is in a scene. The computer, however, does not see the way we see a
scene. The computer sees an image with a 2-dimensional function f(x, y) where (x, y)
is the element's coordinates in an image that gives a particular position to that element
in an image. Each element in an image is referred to as a pixel which tells about the
intensity of an element. The intensity of an element is a scalar value [1]. Figure 2.1
shows an example of an image as seen by a computer.
Figure 2.1: A Digital Image
An image is obtained from both analog and digital signals. Back in the 1940s, a
cathode tube was used in a camera to take pictures, as shown in figure 2.2. A lens
focused the scene on a plate in front of the tube. The cathode tube would throw a
stream of electrons on the picture back and forth, turning the scene's brightness and
darkness into voltage. The high voltage amplitude had high brightness, and the low
voltage amplitude showed low brightness. These signals were then sent to a monitor,
8
i.e., the cathode ray tube (CRT), which pushed the electrons to the monitor's screen.
The CRT moved the electron beam from left to right to create the original scene.
Figure 2.2:Image obtained on Cathode Ray Tube (CRT)
These days devices are used to convert scenes to digital data. However, an analog
signal with a different form of energy would still be required to hit the sensor. The
imaging sensor is used to transform this signal into a digital image. The particular
sensor we use is called an array sensor which is most widely used in cameras. The
sensor digitizes the analog signal, and each sensor element responds to the analog
signal. The output response is a digital image, as shown in figure 2.3.
Figure 2.3: Digital Image Acquisition Process
9
2.1.2 Video using Images
In the real world, everything is in motion. For example, the wind blows the trees, and
cars move on the roads, tennis players swing their rackets, etc. Things are moving,
and our perception system handles them very well. We also want to deal with motion
in computer vision, but nothing is moving in a single digital image. When sequences
of images are combined and are not separated by time, we get the perception of
motion in a single image [12]. When dealing with computers and machines, we want
them to have the same motion perception as humans.
When talking about motion, we talk about video. In the past, when someone would
talk about a video, people would refer to it as a sequence of images, but these days
people refer to it as a sequence of frames where instead of images, we use the term
frames. A video is a sequence of frames captured over time relatively quickly. When a
camera records a video, it means that the camera captures images at regular intervals
with minor changes in a scene.
Now our images are not just a function of space (x, y) but space (x, y) and time t. It
can also be written as f(x, y, t). Remember that when dealing with this 3-dimensional
function, we do not mean that time and space are the same. Figure 2.4 shows an
example sequence of frames where each frame is captured at a different interval.
Figure 2.4: Sequence of frames from time t-1 to t+2
The rate at which frames appear is measured by frame rate, sometimes called frames
per second. Different frame rates have a different impact on how motion is perceived.
The smaller the frame rate, the motion in a video will be jittery, and the greater the
frame rate, the smoother the video. The most common frame rate for video is 30 fps
or 60 fps.
2.2 Depth Estimation from a Video

Depth estimation is a common problem In the field of computer vision. Depth of the
object refers to obtaining the spatial structure of a scene, or we can define it as the
10
measure of obtaining the distance of an object from the camera. This information can
then create a 3D representation of the scene.
2.2.1 Need for Depth Estimation

Consider yourself driving a car. While driving, another car comes in front, and it is in
our best interest to avoid collision with that car [13]. For that to be possible, we will
need to know how far the other car is from us. In autonomous driving and other
computer vision applications, to do that, we need to know about the depth of the
environment. The depth tells us how far the pixel of an image is from the camera.
(a) Mars Exploration Rover (b) Google’s Self-Driving Car (c) Starbug X sub-marine
Figure 2.5: Applications of VO
LIDAR and other radar sensors have been used to capture the depth of the
environment, but in computer vision, we use cameras to get the depth information.
This is useful in vehicles like self-driving cars and can also be used in robot-human
interaction, augmented reality, and virtual reality.
2.2.2 Importance of Depth Estimation

The goal of depth estimation is to recover the 3D structure of a scene. It is a
fundamental problem in computer vision because the 3D structure of a scene is
necessary for tasks such as object recognition and scene understanding.
When observing a point in a 3D space, we require depth information. We know that

an image is a 2D matrix, as shown in figure 2.1. When a point from the real-world
coordinate system is projected to a 2D plane using a camera, it automatically loses its
depth information.
Most methods for depth estimation are based on finding correspondences between
points in different images and then using triangulation to reconstruct the 3D positions
of those points. However, this approach could be more reliable when occlusions exist,
as figure 2.6.
11
Figure 2.6: Occluded Points
2.2.3 Depth Estimation Using Geometric-Based Method

We, humans, perceive the depth of the environment by a process called stereopsis
[12], in which the objects projected to the eye differ in the horizontal positions, thus
giving the depth cues of horizontal disparity, which is also binocular disparity.
2.2.3.1 Stereopsis
For stereopsis, we use stereo vision cameras, which have nowadays been used in self-
driving cars, mobile cameras, etc. A stereo sensor [12] is created by placing two
cameras in parallel whose optical axes align. Given the rotation R and translation t,
point O is projecting to the camera's image plane at OL and OR [14].
Figure 2.7: Stereo Camera Model

To make our computations easy, we will make the following assumptions.
1. The two cameras used to construct the stereo camera are identical.
12
2. While manufacturing the stereo setup, we make as much as possible to keep the
two cameras’ optical axes aligned.
Let us list some critical parameters. The distance between the camera center and
image planes is called the focal length f. The baseline distance b is the distance
between the centers of the two cameras along the x-axis. We assume that the rotation
matrix R is identity and the x-component in the translation vector t is zero.
Let’s see this structure in a bird’s eye view, as shown in figure 2.8.
Figure 2.8: Stereo Camera Model in Bird’s Eye View
We first want to compute point O's z and x coordinates to the left camera frame. We
can see two similar triangles formed by the left camera, △CLZLO, and △CLXLO. The
equation formed is given as
𝑍 𝑋
= (2.1)
𝑓 𝑋𝐿
Similarly, we can say about the right camera, which forms △CRZRO and △CRXRO.
𝑍 𝑋−𝑏
= (2.2)
𝑓 𝑋𝑅
We define the disparity d as the difference between the image coordinates of the same
pixel in left and right images.
𝑑 = 𝑋𝐿 − 𝑋𝑅 (2.3)
Where:
𝑋𝐿 = 𝑈𝐿 − 𝑈𝑜
𝑋𝑅 = 𝑈𝑅 − 𝑈𝑜
𝑌𝑅 = 𝑉𝑅 − 𝑉𝑜
13
We now arrive at computing the 3D coordinate following
From Eq 2.1
𝑍 𝑋
= → 𝑍𝑋𝐿 = 𝑓𝑋 (2.4)
𝑓 𝑋𝐿
From Eq 2.2
𝑍 𝑋−𝑏
= → 𝑍𝑋𝑅 = 𝑓𝑋 − 𝑏 (2.5)
𝑓 𝑋𝑅
From Eq 2.4 and 2.5, we have
𝑍𝑋𝑅 = 𝑍𝑋𝐿 − 𝑓𝑏 (2.6)

𝑓𝑏
𝑍= (2.7)
𝑑
From Eq 2.4 and Eq 2.5, we also have
𝑍𝑋𝐿
𝑋= (2.8)
𝑓
𝑍𝑋𝑅
𝑌= (2.9)
𝑓
Now that the position of the points in the 3D coordinate system is available, we will
need to compute the focal length f, baseline b, x & y, and offset uo and vo. These are
what we need to calibrate the stereo camera. Another thing that we need to do is
compute the disparity. We need a specialized algorithm to efficiently perform the
matching and compute the disparity to solve these problems.
This setup has some limitations, like as the point moves further away, the depth
suffers a lot. However, depth is still quite valuable for many computer vision
applications.
2.2.3.2 Computing the Disparity
We found the position of a point in 3D space, but there are still parameters that we
need to find, such as baseline b, disparity d, focal length f, and camera pixel centers
uo, vo. These days there is much software available to obtain these parameters.
However, we will see how disparity is computed using the stereo setup. The disparity
is “the difference in the image location of the same 3D point as observed by two
different cameras”. In this problem, we need to find the same point in both the left and
right cameras. The problem is known as the correspondence problem. The most naïve
14
technique is the exhaustive search method, where you search for the whole image
using the sliding window in the left image and match it with the right image.
Figure 2.9: Brute force solution by searching the whole image for each pixel
But this method is inefficient and might not be good to run in real-time, especially in
self-driving cars.
Figure 2.10: Movement of 3D point along the epipolar line
To solve this problem, we can use stereo geometry to constrain the problem from 2D
over the entire image to a 1D line. We already know about the stereo setup, shown in
figure 2.7, where we observed a single point in both the left and right cameras. Now
let us move the 3D point along the line connecting it with the left camera centers.
As we move the point, we will see in the left image that the projection does not
change. However, if we notice in the right camera, the projection of the point changes
along the horizontal line. This horizontal line is called the epipolar line [14], which
follows directly from the fixed lateral offset and image plane alignment of two
cameras in a stereo pair. The epipolar line is a straight horizontal line only when the
two cameras are parallel; otherwise, the line becomes skewed. It is known as multi-
view geometry.
15
Figure 2.11: Movement of 3D point along the epipolar line
The skewed epipolar line is not a big problem that is resolved using stereo
rectification [12]. Going into the mathematical model of this is beyond the scope of
this work. So, we head back to the image in figure 2.10 and follow the following
steps.
1. We put a horizontal line at the center of both images.
2. Compare the pixels from the left image to the pixels of the right image along the
epipolar line.
3. Pick the pixel that has the minimum cost. The cost can be the squared difference
between the pixel intensities.
4. Finally, we compute the disparity d by subtracting the right image location from
the left one.
2.2.4 Convolution Neural Networks
In the last decade, convolutional neural networks have been mainly used for
perception tasks such as object classification, object recognition, image segmentation,
and depth estimation. Convolutional neural networks mainly comprise two layers:
1. Convolution Layers
2. Pooling Layers
2.2.4.1 Convolution Layers

Let us see how convolution layers and pooling layers work. Then we will put these
two layers together, and using these two layers, we can make an extensive network.
Consider a 3 × 3 × 3 (w × h × c) image, as shown in figure 2.12, where the width of
the image is 3, the height of the image is 3, and the number of channels is 3. The
16
number of channels tells us that the image is an RGB image. The gray area in the
image is added by a process called padding. The number of pixels on the side is called
pixel size. Here the pixel size is 1.
The advantage of padding is that when we perform the convolution operation on the
image using a filter, the width and height of the image is retained.
Figure 2.12: Properties of Input Image
We perform a convolution with a set of filters or kernels, as shown in figure 2.13.

Each filter comprises a set of weights and a single bias. Each channel of the filter
should correspond to each channel of the image. Usually, in convolutional neural
networks, we will not have only one filter but multiple filters.
Figure 2.13: Each channel applied to the respective channel of an RGB input volume
17
First, let us better understand this by applying a single filter, as shown in figure 2.14.
We will take each channel of the filter, apply it to the corresponding channels of the
RGB image, and perform the convolution operation. Then we add the corresponding
values adding the bias too. We will get the first output pixel. Then we slide the filter
to the right and again perform cross-correlation. Then add the values where we will
get another output pixel. Similarly, we will keep doing this until we obtain all the
output pixels.
(a) (b)
(c) (d)
Figure 2.14: Process of cross-correlation of each channel of filters with respective channels of
the RGB input volume
The size of the image has been reduced. It is because we moved the image by two
steps. If we want to confirm our output mathematically, we can do it as
18
𝑊𝑖𝑛 − 𝑚 + 2 + 𝑃
𝑊𝑜𝑢𝑡 = +1 (2.10)
𝑆
𝐻𝑖𝑛 − 𝑚 + 2 + 𝑃
𝐻𝑜𝑢𝑡 = +1 (2.11)
𝑆
𝐷𝑜𝑢𝑡 = 𝐾
Where:
m × m = filter size
K = number of filters
S=strides, P=Padding Size
Now we know how to apply the convolution layer, but where will the neural network
learn the parameters? The filters we apply to the input image will have values acting
as weights for the neural network. These are the weights in the filters that the
convolution neural network will learn, as shown in figure 2.13.
We know that the dimension of the RGB input is Win × Hin × 3, and when we apply a
single filter on the image, the dimension becomes Wout × Hout × 1. However, we will
not be applying a single filter but multiple filters on a single RGB image. For multiple
filters, we represent the dimension as Wout × Hout × K. Figure 2.15 shows that we first
apply filter 1 and then, in the same way, we apply filter 2. We can notice that we have
two output channels when we apply two filters. These channels are stacked over each
other to form output volume.
Figure 2.15: Applying two filters on an RGB input volume
19
2.2.4.2 Pooling Layers
Pooling layers [15] help the representation become invariant to small translations in
the input. This layer uses pooling functions that replace the pixel value from the
output of the previous layers. There are many pooling layers, but we will use Max
Pooling to understand it better.
Consider a 4 × 4 input matrix, and we apply a 4 × 4 pooling window on the input.

Like the convolution layer, the pooling function is applied to the input from the top
corner. What max pooling does is picks the maximum value in the pooling window
and outputs that pixel value. Then we slide it over the input by a stride of 2 and again
pick the maximum value and output. Similarly, we do this on the rest of the input.
Mathematically, this is defined as
𝑊𝑖𝑛 − 𝑛
𝑊𝑜𝑢𝑡 = +1 (2.12)
𝑆
𝐻𝑖𝑛 − 𝑛
𝐻𝑜𝑢𝑡 = +1 (2.13)
𝑆
𝐷𝑜𝑢𝑡 = 𝐷𝑖𝑛
Unlike the convolution layers, the channels in the output will be the same as the input
channels. Additionally, there are parameters to learn from the pooling layers. Only the
parameters from the convolution layers will be learned.
(a) (b)
(c) (d)
Figure 2.16: Applying max pool function on a 4x4 matrix
20
2.2.4.3 Deconvolution
We are familiar with regular convolution, but there is another type of convolution
algorithm called transpose convolution [16], also sometimes called the deconvolution
layer, which is mainly used for segmentation and depth perception tasks.
Consider a 2 × 2 matrix, and we want a 4 × 4 matrix. For this, we use a 3 × 3 filter, as

shown in figure 2.16. First, we take the first pixel of the input and multiply it with
every pixel of the filter. Remember that the padded area in the output, shown in the
red region, will not contain any value; only the 2×2 area will be output.
Then we multiply the second input pixel with all the filter values. In the output, we
slide the window by a stride of 2. The area where the pixel values overlapped are
added. Similarly, using the third input pixel, we move the slide by a stride of 2
downwards. We multiply that pixel with all the values of the filter, and in the
overlapped region, the pixels are added, and we do a similar process using the fourth
input pixel.
(a) (b)
(c) (d)
(e)
Figure 2.17: Deconvolving a 3 x 3 feature map
2.2.5 Depth Estimation Using Learning-Based Method

Convolutional Neural Networks (CNN) play an important role in monocular depth
estimation as they extract the feature from a single image in a direct way. Before
21
diving into the learning methods of estimating depth, we must first understand what
CNNs do.
2.2.5.1 General CNN Architecture for Depth Estimation

Deep learning is a powerful tool for estimating monocular depth because of its feature
learning capabilities and has shown robust results. A general framework of monocular
depth estimation is a simple encoder-decoder, as shown in figure 2.18.
Figure 2.18: General Encoder-Decoder for depth estimation
The encoder consists of a series of convolution & pooling layers to extract the depth
features. The decoder consists of a series of deconvolution that regresses the pixel-
level depth map. The output dimension, i.e., h × w of the depth map, should be the
same as the input. To preserve the depth features, outputs from each 28 layer of the
encoder is concatenated with outputs of each decoder layer. The depth loss function is
used to train the network till we obtain the desired depth.
2.2.5.2 Supervised Learning

In a supervised monocular depth estimation, we train the depth network using ground
truth depth maps obtained from LIDAR sensors. The loss is calculated by penalizing
the errors between predicted depth & ground truth. Figure 2.19 shows a general
diagram of monocular depth estimation using supervised learning.
22
Figure 2.19: General supervised framework for depth estimation
Advantages
The accuracy rate is very high when the scale of the estimated depth is close to the
ground truth, and we can further generate an accurate 3D map.
Disadvantages
The cost of obtaining the ground truth is very high because it requires extensive
equipment and time.
2.2.5.3 Unsupervised Learning

We don’t use any ground truth labels in unsupervised monocular depth estimation.
The depth network is trained using the stereo image pairs. First, the depth network
predicts the disparity. Then, a left image is synthesized using the predicted disparity
and the right image. This technique is called wrapping. Figure 2.20 shows the general
diagram of unsupervised monocular depth estimation with stereo-image pairs.
Figure 2.20: General unsupervised framework for depth estimation using stereo image pairs
23
Similarly, we can not only estimate depth but also pose simultaneously. The depth
and pose network predictions are used to synthesize a new view, and that new view is
compared with the original target image. The only difference from the former
unsupervised method is that the wrapping is based on adjacent frames. Figure 2.21
shows the general diagram of estimating both depth and pose.
Figure 2.21: General unsupervised framework for depth and pose estimation
Advantages
The unsupervised learning method of estimating depth does not require any ground
truth, which helps in reducing the cost of building depth labels.
Disadvantages
Unlike the supervised method, unsupervised depth and pose estimation suffer because
of low accuracy.
2.3 Ego-Motion Estimation in a Video

Ego-Motion or Pose Estimation is becoming self-aware of our motion or movement
and others' motion or movement in an environment.
There are three types of Ego-Motion:
1. Visual Odometry
2. Localization
3. SLAM (Simultaneous Localization And Mapping)
24
(a) Visual Odometry (b) Localization (c) SLAM
Figure 2.22: Types of Ego-Motion
For this work, we will focus on visual odometry. Visual odometry refers to estimating
the relative Ego-Motion from images.
2.3.1 Need of Ego-Motion

Ego-Motion is an essential aspect of human cognition. It refers to our ability to be
aware of our location and movement and the movement of others around us. It allows
us to navigate our environment effectively and avoid obstacles or collisions. Ego-
Motion is thus like having a map of the space around us in our heads and knowing
where everyone else position is to us. This ability is often taken for granted, but it is
essential for everyday life.
It can be used in space or underwater exploration where sensors such as the GPS do
not work and see what is in the environment. In places like these, it can also be used
to estimate the camera's pose and move in the environment.
2.3.2 Importance of Ego-Motion

Robots and self-driving cars will operate in dynamic environments and must be aware
of their environment and themselves.
Visual odometry aims to estimate the camera's pose by examining the changes that the
motion induces in a camera.
Figure 2.23 shows a car at time t and t + 1. We can see that the car has gone through a
rigid body transformation of rotation R and translation t, which is a 6-DoT, from time
t to t + 1. So we want to estimate the current pose, i.e., at time t + 1, with respect to
the previous pose, i.e., time t. This is why it is also sometimes referred to as relative
Ego-Motion because we are not localizing the vehicle with respect to global reference
such as map but previous time instance.
25
Figure 2.23 Pose of an Car
Even though we talked about mapping, it is often built as a by-product, which means
that Visual Odometry’s focus is not mapping but estimating pose.
2.3.3 Traditional Ego-Motion Estimation Methods

As mentioned before, our work focuses on visual odometry, a type of Ego-Motion.
We will go through the general process of visual odometry, which is also shown in
figure 2.24, where we are given two consecutive image frames I(k−1) and Ik. Then we
perform the feature detection and feature descriptor, which are the feature extraction
process where we end up with features f(k−1) and fk between two images, I(k−1) and Ik.
Then we perform feature matching between the two images. These matched features
are then used to estimate the motion between the two images, which represented by
the transformation matrix Tk between the two images.
Figure 2.24: General process of Visual Odometry
26
2.3.3.1 Visual Features – Detection, Description, and Matching
Image features are vital information that describes the contents of the image. In
computer vision, it has a variety of applications like face detection or object detection.
All of these tasks have a general framework, as shown in figure 2.25.
Figure 2.25: General process of Visual Odometry
These features have specific structures like points, edges, and corners. Let us take an
example of image stitching or panorama, which is widely used on our mobile devices.
Let us consider two images, and we would like to stitch them together to form a
panorama. Remember that a panorama can be formed used not only two images but
also more than two images. For the sake of understanding, we are using two images.
(a) (b)
(c) (d)
(e)
Figure 2.26: Example of image stitching technique
First, we would like to find distinct points in one image. These points will be called
features. Then we associate a descriptor for each feature from its neighborhood. Then
we finally match the features [18] across both images. Then these matched features
can be used to stitch both images creating a panorama. However, we can see some
missing artifacts in the extreme bottom right corner of the image, as shown in figure
2.26(e).
27
Features are points of interest in an image. A point of interest in an image should have
some characteristics for it to be a good feature point. These are the following.
The feature points should be salient, i.e., distinct, identifiable, and different from its
immediate neighborhood.
1. In order to extract the same features from each image, the feature points should be
repeatable.
2. Feature points should be local. It means they should stay the same if an image
region far away from the immediate neighborhood changes.
3. Many applications like camera calibration and localization require a minimum
number of distinct feature points, so they should be abundant.
4. Generating features should not require much computation.
(a) (b) (c)

Figure 2.27: Interesting features
To understand further how these features should look, consider figure 2.27. In (a),
repetitive textureless patterns are tough to recognize, so we cannot consider these as
features. Similarly, the two rectangles on the white line, as shown in (b), will also be
hard to recognize. The concept of corners [12] for image features is essential. These
corners occur when the gradients in at least two significant directions are
considerable. An example is shown in rectangles in (c).
In order to find these features, many algorithms were proposed by researchers. In

1988, Harris corner detectors [19] were proposed, which used image gradients to
identify pixels with significant intensity changes in both x and y directions. These
were very easy to compute, but the drawback to this algorithm was that it needed to
be scale invariant. The corners could look different because of the camera's distance
from the object generating the corner.
To solve this problem, the Harris-Laplace corner detector was proposed, which is
used to detect corners at different scales and choose the best corner based on the
Laplacian of the image. Similarly, many machine learning corner detectors were later
28
proposed. One prominent of these algorithms is FAST corner detector [20], which is
computationally efficient and performs well. Other scale-invariant feature detectors
are based on the concept of blobs, such as Laplacian of Gaussian. We will not discuss
these feature detectors in great detail as they represent a complex area of research.
Thanks to open-source implementations of these algorithms, we can see them in
action using OpenCV [19].
2.3.3.1.1 Feature Descriptors

We aim to match the features between two images for our depth perception task. For
that, we need to find and describe the feature such that the feature from one image
matches the same feature in the other image in the best way possible. Before that, we
need to see what makes a feature suitable. Feature in an image is a point of interest
that is identified by its image pixel coordinates [u, v], and we also define the
descriptor as the function f as an n-dimensional vector associated with each feature.
Figure 2.28: Illustration of feature description
A feature descriptor contains information about a particular feature point and should
have specific characteristics that should be repeatable. Regardless of position, scale,
and illumination, the point of interest in both images should have approximately the
same feature descriptor. Other characteristics for the point of interest were already
described, in section 2.2, in order for the feature descriptors to be a good match.
In order to have a good sense of finding a feature point, let us look at a particular
study called Scale Invariant Feature Transform (SIFT) descriptor [21].
1. Given a feature in an image, the SIFT descriptor takes a 16 by 16 window around
the feature.
2. It divides the window into 4 cells, each of which comprises 4 by 4 patch of pixels.
3. Edges are then computed using the gradients. For stability, we suppress the weak
edges by defining a threshold as they vary with orientation with small amounts of
noise between images.
29
4. We construct a 32-dimensional histogram of orientation for each cell and
concatenate them to get a 128-dimensional descriptor.
SIFT descriptor is a very well-engineered feature descriptor. It is usually computed
over multiple scales and orientations. It has also been combined with scale invariant
feature detection such as DoG, which results in a highly robust feature detector and
descriptor pair.
(a) A patch in an (b) Division of a (c) Computing

(d) Histogram of Gradient
image patch gradients
Figure 2.29: Computing the descriptor
Many feature descriptors have been well-engineered, and it is worth mentioning.

• Speeded-Up Robust Features (SURF)
• Gradient Location-Orientation Histogram (GLOH)
• Binary Robust Independent Elementary Features (BRIEF)
• Oriented Fast and Rotated Brief (ORB)
There are many more feature descriptors, and many open-source implementations are
available in OpenCV. Unfortunately, SIFT and SURF are not open sources because
they are patented and should not be used commercially without the authors' approval.
However, some excellent algorithms like ORB are available in an open-source library,
OpenCV, and can be used commercially.
2.3.3.1.2 Feature Matching
Now that we know how to find distinct feature descriptors, it is time to match those
features [18]. We will match the features using the distance function, then look at the
brute force method and then see a little faster algorithm which is used alternative to
brute force as the features increase.
In figure 2.29, we have described the general process of feature matching. If incorrect
matches are provided, a particular computer vision task will have catastrophic
consequences. Figure 2.30 shows an example of a feature-matching problem where a
feature and its descriptor in one image match two features and descriptors in the other
image.
30
Now we want to find the best match. The simplest method to solve this problem is
called brute force feature matching. It is defined as the distance function d(fi, fj) that
compares feature fi in image 1 and feature fj in image 2. The smaller the distance
between two features, the more similar will be the feature points. For every feature fi
in image 1, we compute the distance d(fi, fj) with every feature fj in image 2 and find
the closest match fc that has the minimum distance.
Figure 2.30: Matching feature from image 1 to image 2
Now, what distance function shall be used for feature matching? We can use the Sum
of Squared Differences (SSD) which is defined as
𝐷
2
𝑑(𝑓𝑖 , 𝑓𝑗 ) = ∑(𝑓𝑖,𝑘 − 𝑓𝑗,𝑘 ) (2.14)
𝑘=1
We can even use other distance functions instead of SSD which are the following.
• Sum of Absolute Differences (SAD)
𝑑(𝑓𝑖 , 𝑓𝑗 ) = ∑|𝑓𝑖,𝑘 − 𝑓𝑗,𝑘 | (2.15)

𝑘=1
• Hamming Distance
𝑑(𝑓𝑖 , 𝑓𝑗 ) = ∑ 𝑋𝑂𝑅(𝑓𝑖,𝑘 , 𝑓𝑗,𝑘 ) (2.16)

𝑘=1
This is a primary method of matching features but there are also other methods, like
the brute-force method, to make a robust feature matching.
2.3.3.1.3 Motion Estimation

We have explored how we can use features to localize an object in an image which
can be used to estimate the object's depth from image pairs. Now let us see how we
can perform visual Odometry, which is a fundamental task in computer vision to
estimate the state of the camera.
31
Visual Odometry [10], or VO for short, is the process of estimating the pose [7] of the
camera by examining the changes that the motion induces in images.
Visual Odometry is similar to Wheel Odometry but uses a camera instead of an

encoder. Furthermore, VO does not suffer from wheel slip, unlike wheel odometry,
and gives more accurate trajectory estimates. However, it is impossible to determine
an object's scale using a single camera. The estimated motion of the object, as
determined by one camera, can be distorted in the 3D world by adjusting the relative
motion between two images without affecting the pixel locations of the features in the
image. For this, we require an additional sensor like the IMU or GPS to estimate the
scaled trajectories accurately.
Furthermore, cameras are passive sensors and may not be robust to whether changes
and illumination changes like car headlights and street lights. Similarly, at night it can
be challenging to perform VO.
Like other odometry estimation techniques, VO will drift over time as the error
estimation accumulates.
To define the problem mathematically, we are given two consecutive
images Ik and I(k−1). Our goal is to estimate the transformation matrix Tk between these
two frames. The transformation matrix s defined by the translation matrix t and
rotation matrix R. k is the time step.
𝑅 𝑡𝑘,𝑘−1
𝑇𝑘 = [ 𝑘,𝑘−1 ] (2.17)
0 1
The camera will be attached to a rigid body, and as it is in motion, it will capture
every frame. We will call these frames Cm, where m is the frame number. When we
concatenate these frames, we can estimate the camera's trajectory.
Figure 2.31: Sequences of frames concatenated with each other
In VO, motion estimation is an important step. How we perform this step depends on
the type of feature correspondences we use. Three types of feature correspondences
can be used in motion estimation.
32
• 2D-2D: Feature matches in frames fk−1 and fk are defined purely in image
coordinates. It is instrumental in tracking objects and image stabilization in
videography.
• 3D-3D: Feature matches in frames fk−1 and fk are defined purely in 3D world
coordinates. This approach helps locate new features in 3D space and estimate
depth.
• 3D-2D: Features in frame fk−1 are defined in 3D space, and features in frame fk are
defined in image coordinates.
Let us see how 3D-2D projection [14] works. We are given a set of features in
frame Ck−1, and their 3D world coordinates estimates. Furthermore, through feature
matching, we have a set of features in frame Ck and their 2D image coordinates. Since
we cannot recover the scale for monocular visual odometry directly so we include a
scalar parameter s when forming a homogenous feature vector from the image
coordinates. With this information, we estimate the rotation matrix R and translation
vector t between the frame Ck−1 and Ck.
Figure 2.32: Projection of 3D to 2D using camera motion
This process is similar to camera calibration, with the exception that the intrinsic
parameters of the camera, represented by matrix K, are already known.
So now, our problem is reduced to finding the transformation matrix [R|t] from the
equations constructed using all of our matched features. It can be solved using the
Perspective-n-Point algorithm. Given feature locations in 3D, the feature
correspondence in 2D and intrinsic camera parameter K PnP solves for the external
33
parameters R and t. The equations here are non-linear so we will refine the Direct
Linear Transformation solution with an iterative non-linear optimization technique
such as the Levenberg-Marquardt algorithm.
The PnP algorithm requires at least 3 feature points to solve for R and t, and if we
want a good solution, then we use 4 feature points.
To improve the accuracy of the PnP algorithm, we incorporate RANSAC (Random
Sample Consensus) by assuming that PnP generates the transformation on a set of
four points in our model. We evaluate this model and calculate the percentage of
inliers (points that fit the model) to confirm the validity of the point match selected.
2.3.4 Learning-Based Ego-Motion Estimation Methods
There are two ways to estimate the pose of the camera:
1. Supervised way
2. Unsupervised way
2.3.4.1 Supervised Learning

A general architecture for estimating pose in a supervised way is shown in figure
2.33. In this framework, the camera’s pose is estimated using the ground-truth data
which is obtained using GPS and IMU sensor. Sequence of frames is givern to a
CNN. The CNN will estimate camera’s relative pose. Then a loss is calculated
between the estimated camera’s pose and ground-truth data which is then
backpropagated to the CNN which updates the weights.
Figure 2.33: General Supervised Learning Pose Estimation Architecture
The pose estimated in supervised way gives accurate estimations which make them
robust to motion changes and illumination but with time error starts to accumulate
giving rise to inevitable drift.
34
2.3.4.2 Unsupervised Learning
Figure 2.21 shows the general architecture of pose estimation. In unsupervised way,
the predicted pose is assisted by depth estimated from a single image. This method is
robust to some extend but not to dynamic changes in the environment and the learning
takes a lot of time.
35
Chapter 3
Literature Review
36
In this chapter we will see what researchers have been doing in order to improve the
depth and pose obtained from an image.
3.1 Supervised Learning Methods

3.1.1 Multi-scale Depth Prediction, 2014
Eigen et al. [6] estimated the depth where the ground truth was known. There were
two networks used in his method; coarse network and fine network.
The coarse network is used to get a global sense of what is in the image. It is passed
through a bunch of convolution layers, and at the end, there are two fully connected
layers. These fully connected layers are used to get a coarse depth estimation by
looking at every image pixel. In the end, the image is then resized back to the original
size of the input image.
Figure 3.1: Network architecture of coarse and fine network
The fine network is used for looking at fine details in the image. The image is also
passed through a bunch of convolution layers. The difference of these convolution
layers from the layers in coarse network is that the resolution of the output from each
layer is the same as the input image. The outcome from the coarse network is then
concatenated with output features from the first layer and then passed through two
convolution layers to get a refined depth prediction.
The convolution layers of the coarse network are pre-trained on ImageNet, and the
fully connected layers are then fine-tuned. The job of this network was to understand
37
the global scene, and the fine network was trained after the coarse network was
trained.
When writing a loss function for a training network, there is a catch, and it is that the
global scale of the scene is unknown, which is an ambiguity for depth prediction.
What this means is that we know the relative sizes of the objects in the scene but not
the exact sizes. Eigen et al. used a loss function independent of scale to solve this
issue.
Consider y as the predicted depth, y* to be the ground-truth depth, and n as the
number of pixels indexed with i. The depth loss function that can be used is given as
𝑛
1
𝑑𝑖 = ∑(log 𝑦𝑖 − log 𝑦𝑖∗ )2 (3.1)
𝑛
𝑖=1
This is called the scale-invariant mean squared error (in log space). However, this
function is heavily dependent on scale. To correct that, the loss function was written
as
𝑛
1
𝑑𝑖 = ∑(log 𝑦𝑖 − log 𝑦𝑖∗ + 𝛼(𝑦, 𝑦 ∗ ))2 (3.2)
𝑛
𝑖=1
Where:
𝑛
1
𝛼(𝑦, 𝑦 ∗ ) = ∑(log 𝑦𝑖∗ − log 𝑦𝑖 ) (3.3)
𝑛
𝑖=1
The value of α also helped in minimizing the error. It proved that the MSE was scale-
invariant. It can be re-written as
2
1 𝜆
𝐿(𝑦, 𝑦 ∗)
= ∑ 𝑑𝑖2 − 2 (∑ 𝑑𝑖 ) (3.4)
𝑛 𝑛
𝑖 𝑖
3.1.2 Fully Convolutional Residual Networks, 2016

Laina et al. [24] introduced a network architecture which is built upon ResNet-50.
The changes they made to architecture was that the fully-connected layers were
replaced with up-sampling blocks, which give an output half the size of the input.
Figure 3.2: Up-Projection Block
38
Another contribution to this work is the design of up-convolution blocks. In the up
convolution, the feature map is given to the 2 × 2 unpooling layer, which performs the
inverse of the pooling layer. It is then followed by two 5 × 5 convolution layers which
are followed by a ReLU activation function.
One 5x5 convolution layer is followed by a 3 × 3 convolution layer. The output from
this layer and the other convolution layer is then added to each other, followed by a
ReLU activation function.
Figure 3.3: ResNet Network architecture with up sampling block
Another block they introduced was the fast-up convolution block which is much more
efficient and faster. On the top, they observed that after un-pooling the feature map
and applying a 5x5 convolution filter, only certain parts are multiplied with non-zero
values. It inspired them to convolve the feature map with four different convolution
layers of sizes of 3×3, 2×3, 3×2 & 2×2 and then interleave. It marks the location of
the pixels.
(a) Fast Up-Convolution Block (b) Faster Up-Convolution Block

Figure 3.4: Convolution Block In Residual Network
39
This block is then used just the way the up-projection block is made. This block is
called a fast up-projection block.
Figure 3.5: Fast Up-projection Block
For learning, they found that reverse Huber loss yields a better error.
|𝑥| |𝑥| ≤ 𝑐
𝐵(𝑥) = {𝑥 + 𝑐 2
2
(3.5)
|𝑥| > 𝑐
2𝑐
1
Where 𝑐 = 5 𝑚𝑎𝑥𝑖 (|𝑦̃𝑖 − 𝑦𝑖 |)
3.1.3 Deep3D, 2016

Deep3D’s [25] goal was to estimate depth using a 2D image and produce 3D stereo
images. The training is done on stereo image pairs to minimize the reconstruction
error of the right view as determined from the left view. The right view is produced
using a differentiable depth image-based rendering technique.
An input image is fed to a series of convolution layers to estimate the depth at each
scale. The depth, each at different scales, of every convolution layer is upsampled and
added to each other. It is sent to the input of the selection layer to produce a right-
view image.
The selection layer models the DIBR [26] step using the traditional 2D-3D conversion
method. Given the left-view image I and predicted depth Z, the disparity D is
computed as
𝐵(𝑍 − 𝑓)
𝐷= (3.6)
𝑍
40
B is the baseline. Then the network predicts the probability distribution across
𝑑
possible disparity values d at each pixel location where ∑𝑑 𝐷𝑖,𝑗 = 1. A stack of shifted
left image is then stacked. The probability disparities and the shifted left images are
multiplied by others and summed to give a right-view image.
𝑑 𝑑
𝑂𝑖,𝑗 = ∑ 𝐼𝑖,𝑗 𝐷𝑖,𝑗 (3.7)
𝑑
Figure 3.6: ResNet Network archiotecture with upsampling block
3.1.4 PoseNet
The PoseNet architecture [27] used GoogleNet’s inception modules.
Figure 3.7: PoseNet Achitecture
We have a series of convolution layers in the pose network. This block is called the
encoder. The encoder output a vector v. Each feature represents an encoding of the
visual features. The vector v is fed to the localizer which is a FC layer that outputs
41
local features u. Note that the output of these is the rigi-body transformation T = [R|t]
where R is rotation and t is the translation.
3.2 Unsupervised Learning Methods

3.2.1 DeepStereo, 2015
DeepStereo’s [28] goal was to synthesize a new view given a set or sequence of
images. It requires estimating correspondences, wrapping, and blending pixels from
multiple images. Many researchers have used depth networks to synthesize a new
view, but instead, DeepStereo produces a new synthesized view directly from pixels
and not depth. This did not require any ground truth depth maps.
Another key idea was jointly predicting probability over depths for each pixel and a
set of corresponding hypothesis colors. The final color calculated for each pixel is a
probability-weighted sum of colors. This is optimized with an end-to-end within a
deep learning framework.
Figure 3.8: Basic architecture of DeepStereo
The input to the network is the plain sweep volume reprojected to the target camera at
various depths. Here the network learns the similarity function from lots of data. The
network learns the depth distribution and color hypotheses for each pixel. It consists
of two towers; the selection tower and the color tower.
The selection tower learns to produce a selection probability for each pixel in each
depth plane while the color tower combines and warp pixels and colors across the
input images. The outputs of both towers are multiplied and summed to produce a
final image.
42
3.2.2 Depth Estimation with Left-Right Consistency, 2017
Godard et al. [29] aimed at inferring depth using a single monocular image at test
time. At training time, two images. Il and Ir are used, which are obtained from a pair
of stereo cameras at the same moment. The target is one of these two images meaning
that the model needs to output one of the same images. E.g., let us say the target is the
left image, and the output should look similar to the left image. The reconstruction
loss is used to compare the output image with the target image for learning. A
convolution neural network is used to estimate the depth. Using the estimated depth
and the right input image, a sampler from the spatial transformer network (STN) [30]
outputs a reconstructed left image. The STN uses a bilinear sampler where the output
pixel is the weighted sum of four input pixels.
Figure 3.9: Base architecture of depth estimation without LR
This gave quite good results, but some artifacts could still be observed. A similar
thing is done to improve this by using the right image to estimate depth. This enforces
consistency between both depths, which leads to more accurate depth estimates.
Another sampler is used to produce a right image using the left image. The
reconstructed right image is then compared with the original right image.
For learning, the loss has used a combination of three main terms; Appearance match
loss, disparity smoothness loss, and left-right disparity consistency loss.
𝑙
The appearance match loss 𝐶𝑎𝑝 is a combination of L1 and single scale SSIM term,
which compares the input image 𝐼𝑖𝑗𝑙 and the reconstructed image 𝐼̃𝑖𝑗𝑙 .
Figure 3.10: Main architecture for depth estimation for LR
43
𝑎𝑝 1 1 − 𝑆𝑆𝐼𝑀(𝐼𝑖𝑗𝑙 , 𝐼̃𝑖𝑗𝑙 )
𝐶𝑙 = ∑𝛼 + (1 − 𝛼)‖𝐼𝑖𝑗𝑙 − 𝐼𝑖𝑗𝑙 ‖ (3.8)
𝑁 2
𝑖,𝑗
The smooth disparity loss is represented as
1 𝑙 𝑟
𝐶𝑙𝑑𝑠 = 𝑙
∑|𝜕𝑥 𝑑𝑖𝑗 |𝑒 −‖𝜕𝑥 𝐼𝑖𝑗‖ + |𝜕𝑦 𝑑𝑖𝑗
𝑟
|𝑒 −‖𝜕𝑦 𝐼𝑖𝑗‖ (3.9)
𝑁
𝑖,𝑗
To ensure consistency between the left and right disparities, the consistency loss is
defined as
1 𝑙
𝐶𝑙𝑙𝑟 = ∑ |𝑑𝑖𝑗 𝑟
− 𝑑𝑖𝑗+𝑑 | (3.10)
𝑁 𝑖𝑗
𝑖,𝑗
𝑟 𝑟 𝑟
Similarly, we will have the right variants 𝐶𝑎𝑝 , 𝐶𝑑𝑠 and 𝐶𝑙𝑟 . Using these terms, the total
loss is written as
𝑙 𝑟 𝑙 𝑟 𝑙 𝑟
𝐶𝑠 = 𝛼𝑎𝑝 (𝐶𝑎𝑝 + 𝐶𝑎𝑝 ) + 𝛼𝑑𝑠 (𝐶𝑑𝑠 + 𝐶𝑑𝑠 ) + 𝛼𝑙𝑟 (𝐶𝑙𝑟 + 𝐶𝑙𝑟 ) (3.11)
3.2.3 SfM-Net, 2017

SfM-Net [31] estimates the depth, segmentation, and Ego-Motion of the camera and
3D objects rotation & translations.
Figure 3.11: SfM-Net architecture
The structured network uses conv and deconv layers to estimate the depth from a
single image. Then using the camera intrinsic (cx, cy), focal length f and depth pixels
𝑥𝑡𝑖 , 𝑦𝑡𝑖 were converted to point clouds using Eq 3.12.
44
𝑥𝑡𝑖
− 𝑐𝑥
𝑋𝑖𝑡 𝑖 𝑤𝑖
𝑑 𝑡
𝑋𝑖𝑡 = [𝑌𝑖𝑡 ] = = 𝑦𝑡𝑖 (3.12)
𝑓 − 𝑐𝑦
𝑍𝑖𝑡 𝑤𝑖
[ 𝑓 ]
A similar architecture is used to estimate the camera motion and object motion.
However, two fully connected layers are used after convolving a pair of images. One
to estimate the pose/motion of the SSS camera and the other the pose/motion of K-
object segments. These poses/motions are the transformation matrix that tells the
point rotation and translation. The camera rotation 𝑅𝑐𝑡 can be represented using the
Euler angle representation as
cos 𝛼 − sin 𝛼 0 cos 𝛽 0 sin 𝛽

𝑅𝑡𝑐𝑥 (𝛼) = [ sin 𝛼 cos 𝛼 𝑐𝑥 (𝛽)
0] , 𝑅𝑡 =[ 0 1 0 ],
0 0 1 − sin 𝛽 0 cos 𝛽
1 0 0
𝑅𝑡𝑐𝑥 (𝛾) = [0 cos 𝛾 − sin 𝛾 ]
0 sin 𝛾 cos 𝛾
Similarly, we will have the translations.

The deconvolution layers in the motion network segment K-objects to obtain the
motion composition. The optical flow is obtained by transforming the point cloud, as
in Eq. 3.13, using camera and object motions. Using an intrinsic camera, the 3D point
is projected to the image plane. First, object transformation is applied.
𝑋𝑡′ = 𝑋𝑡 + ∑ 𝑚𝑡𝑘 (𝑖) (𝑅𝑡𝑘 (𝑋𝑡 − 𝑝𝑘 ) + 𝑡𝑡𝑘 − 𝑋𝑡 ) (3.13)

𝑘=1
The camera transformation is applied as

𝑋𝑡" = 𝑅𝑡𝑐 (𝑋𝑡′ − 𝑝𝑡𝑐 ) + 𝑡𝑡𝑐 (3.14)
Then the 3D point is projected to the 2D image plane as follows

𝑖
𝑥𝑡+1
𝑋𝑡"
𝑤 = [ " ] + [𝑐𝑥 ]
𝑓
(3.15)
𝑌 𝑐𝑦
𝑖
𝑦𝑡+1 𝑋𝑡" 𝑡
𝑓
[ 𝑤 ]
45
3.2.4 SfM-Learner, 2017
Zhou et al. [8] jointly predicted depth from a single image and posed from a sequence
of images using unlabeled training data. They also said that both networks could also
be trained independently. In order to prepare the network, the learning is carried out
from the task of novel view synthesis, where a new target image is reconstructed from
the nearby views and compared with the original target image.
Figure 3.12: Over view of supervision pipeline
̂ 𝑡 . The pose
The depth network takes in the target image It and outputs the depth 𝐷
network takes in the target image It and the nearby view/source images It−1, It+1 and
outputs the relative camera poses, which is the transformation matrix. The source
images are then inverse-wrapped using the relative poses from the pose network and
pixel-depth from the depth network to reconstruct the target view.
Figure 3.12 (b) shows how a target view It is reconstructed by sampling pixels from a
̂ 𝑡 , and the camera poses 𝑇𝑡→𝑠 . The
source view Is using the depth prediction 𝐷
projected coordinates of pt onto the source view ps can be obtained as
𝑝𝑠 ~𝐾𝑇̂𝑡→𝑠 𝐷
̂𝑡 (𝑝𝑡 )𝐾 −1 𝑝𝑡 (3.16)
To populate the values of Is(pt), the differential bilinear sampler, DBIR, as used in
[26], is used to interpolate the 4-pixel neighbors (top-left, top-right, bottom-left, right-
bottom) to approximate Is(ps).
𝑖𝑗
𝐼̂𝑠 (𝑝𝑡 ) = 𝐼𝑠 (𝑝𝑠 ) = ∑ 𝑤 𝑖𝑗 𝐼𝑠 (𝑝𝑠 ) (3.17)
𝑖∈{𝑡,𝑏},𝑗∈{𝑙,𝑟}
Then the view synthesis loss is formulated as
𝐿𝑣𝑠 = ∑ ∑|𝐼𝑡 (𝑝𝑡 ) − 𝐼𝑠 (𝑝𝑡 )| (3.18)

𝑠 𝑝
The learning is carried out by a final loss as
𝐿𝑓𝑖𝑛𝑎𝑙 = ∑ 𝐿𝑙𝑣𝑠 + 𝜆𝑠 𝐿𝑙𝑠𝑚𝑜𝑜𝑡ℎ + 𝜆𝑒 ∑ 𝐿𝑟𝑒𝑔 (𝐸̂𝑠𝑙 ) (3.19)

𝑙 𝑠
46
Chapter 4
Methodology
47
This chapter deals with implementation methodology. We start by explaining network
architectures used to estimate depth and Ego-Motion. Then we look at the view
synthesis block in detail and utilize this view for the network to learn the parameters.
Then we discuss the dataset for training, evaluation, and testing and the
hyperparameters used to train our networks.
4.1 Network Architecture for Depth Estimation

As seen in chapter 3, many depth networks are available, and they have been very
successful. To examine the effects of depth on the pose network, we use three
networks to predict the depth and then take the weighted average of the outputs of the
three networks and keep the pose network the same as [8]. We call these networks
subnetworks.
The following depth networks are used and we’ll discuss each of them.
1. DispNet
2. AdaBins
3. ResNet18
Figure 4.1: Our modified network
We could have used more than three networks, but as we increase the networks for
predicting depth, the parameters will increase, computation will increase and the
training will take longer. This is why we limit the network to three depth networks.
48
4.1.1 Subnetwork1 - Disparity Network
Figure 4.2 shows the architecture of DispNet [34], which is similar to the U-Net [35].
The network is made of 6 down-sampling blocks which consist of 2 convolution
layers. The 1st block consists of convolution layers with a kernel size of 7, and the
2nd block consists of convolution layers with a kernel size of 4. The remaining blocks
from 3 to 7 consist of convolution layers with a kernel size of 3. The output channels
from each respective block are 32, 64, 128, 256, 512, 512.
After the 7th block, the up-sampling process begins. Similar to convolution blocks,
we have up-sampling blocks where each block has two deconvolution layers. Each
layer in deconvolution blocks 1-5 has a kernel size of 3, and the layers in block 6 have
a kernel size of 7. The output channels from each respective up-sampling block
reverse from the convolution blocks, i.e., 512, 512, 256, 128, 64, 32, 16.
The outputs from the second convolution layer of each down-sampling block are
concatenated with the output of the first deconvolution layer of the up-sampling
blocks. The predictions from up-sampling blocks 3, 4 & 5 are again up-sampled and
then concatenated with the output first deconvolution layer of blocks 4, 5 & 6.
Figure 4.2: Disparity Network architecture
The final depth prediction is obtained as
1
𝐷𝑒𝑝𝑡ℎ = (4.1)
𝛼 + 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) + 𝛽
49
4.1.2 Subnetwork2 - AdaptiveBin Network
Distributions of depth vary a lot and increase the complexity of estimating depth. To
solve this, Shariq et al. [36] proposed dividing the depth ranges into bins so that the
bins can adapt to changes in a scene.
Figure 4.3 shows the architecture with two blocks; the standard encoder-decoder and
the AdaBins module.
The input image is H ×W ×3 dimension image. The encoder uses the EfficientB0 [37]
network to extract the features from the image and then is up-sampled in the similar
fashion as the DispNet. The output is then H × W × Cd.
The output from the encoder-decoder block is then divided into patches and fed to a
patch-based transformer [38] called mViT to preserve the global information. It
produces two outputs; Range Attention Maps R and the probability distribution over
the bins. Eq 4.2 is used to obtain the bin centers.
Figure 4.3: The AdaBins Architecture
50
𝑖−1
𝑏𝑖
𝑐(𝑏𝑖 ) = 𝑑𝑚𝑖𝑛 + (𝑑𝑚𝑎𝑥 − 𝑑𝑚𝑖𝑛 ) ( + ∑ 𝑏𝑗 ) (4.2)
2
𝑗=1
Using Eq 4.3. we obtain the final depth.
𝑑̃ = ∑ 𝑐(𝑏𝑘 )𝑝𝑘 (4.3)

𝑘=1
Another reason we chose this model was that this model was trained in a supervised
way, but we trained this in an unsupervised way.
4.1.3 Subnetwork3 - Residual Network

The Residual Network (ResNet) [24] consists of a residual block, as shown in figure
4.4. A single block takes in the input image. The convolution layer (weight layer) is
applied to the input, and another convolution layer is applied to the output of the first
convolution layer. Both of the convolution layers have a kernel size of 3 × 3. Another
thing introduced to CNN is skip connections. The input is added to the output of the
second convolution layer. The reason for this was that if the network weights obtain
an identity function, it should not hurt the network's output.
Figure 4.4: Residual Block
We can apply this block as many times as we can. For this work, we apply 8 blocks
connected in series. Before and after these blocks, we apply a single convolution
layer. As every block consists of two convolution layers, adding the convolution layer
before and after the blocks, we have 18 convolution layers. Hence the name becomes
ResNet18.
51
4.2 Weighted Average Depth
Every subnetwork in the architecture we are using to estimate depth has qualities as
described in sections 4.1.1 – 4.1.3. Sometimes, one subnetwork might outperform the
other in certain situations. Instead of using simple averages, we use a weighted
average. If we use plain average, the performance of a subnetwork that is performing
well might be degraded by a poorly performing subnetwork. Hence, we use weighted
averages in which different weights are given to the subnetworks. Each weight
determines the relative importance of the network. The weighted average for
estimating depth is determined as
∑𝑛𝑖=1 𝑤𝑖 𝐷𝑖
𝐷𝑎𝑣𝑔 = (4.4)
∑𝑛𝑖=1 𝑤𝑖
Where:
wi is the weight of ith subnetwork
Di is the ith subnetwork
Davg is the weighted average depth
Since we are using three subnetworks, the weighted average depth Davg is can be
written as
𝑤1 𝐷1 + 𝑤2 𝐷2 + 𝑤3 𝐷3
𝐷𝑎𝑣𝑔 = (4.5)
𝑤1 + 𝑤2 + 𝑤3
The contribution of the subnetworks will be determined by the weights that need to be
selected. One way to select the values of weights is to assign a higher weight value to
a subnetwork that will yield lower errors so that it can contribute more to the depth
estimation process. Assign a lower weight value to the subnetwork that will yield
higher errors so that it can contribute less to the depth estimation process.
4.2.1 Initializing Weights Based on Accuracy
Since we have used the subnetworks that researchers have already designed, we can
select the weights based on their accuracies. We test the networks individually and get
the accuracies of the networks. Then we normalize the accuracies in the 0-1 range.
These values can then be used as weights of the network. For example, subnetwork-1
has 69% accuracy, subnetwork-2 has 84% accuracy, and subnetwork-3 has 78%.
When we normalize these accuracies, we get 0.516, 0.628, and 0.583, which will be
52
the weights w1, w2, and w3 for the subnetwork-1, subnetwork-2, and subnetwork-3.
We can use these weights as the initial weights for the subnetworks.
4.2.2 Adjustment of Weights Based on Performance During Training
There will be situations where one subnetwork might not perform well, and the other
will perform well during training. It is in our best interest to update the weight for the
network that performs well so that it can contribute more to estimating the depth.
When each subnetwork estimates the depth, we synthesize a new view from the
respective depths. Then using the respective synthesized image, we calculate the
errors and normalize each error in the range of 0-1. These will be the weights of the
subnetworks, and the weights will be updated at every step. It should be kept in mind
that the network does not learn these weights. We can call these weights
hyperparameters. The error between the input and the image synthesized using the
average weighted depth will be used for backpropagation.
4.2.3 Updating Weights of Subnetworks in Real-time
In real-time, we like to update the weights after every N samples/sec so that if one
subnetwork does not perform well, the other subnetworks which perform well should
contribute more to estimate the depth. To automatically update the weights, we use
the following steps.
1. Initialize the weights w1, w2 and w3.
2. Set the frame rate and start the timer to compute the number of frames. For
example, if we have frame rate of 30 fps and the length of time is 0.1 sec then we
have 3 frames.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑟𝑎𝑚𝑒𝑠 (𝑁) = 𝑓𝑟𝑎𝑚𝑒 𝑟𝑎𝑡𝑒 × 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑡𝑖𝑚𝑒 (4.6)

3. Start a loop over N frames.
4. For each n frame, estimate the depth. We have the following depth outputs.
𝐷𝐴𝑛 , 𝐷𝐵𝑛 , 𝐷𝐶𝑛
5. Using DAn, DBn, DCn, synthesize a new view. We get
𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐴 , 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐵 , 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐶
6. Compute the errors between n input target frame It and n synthesized image.
𝐸𝑟𝑟𝑜𝑟1𝑛 = |𝐼𝑡𝑛 − 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐴 | (4.7)
53
𝐸𝑟𝑟𝑜𝑟2𝑛 = |𝐼𝑡𝑛 − 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐵 | (4.8)
𝐸𝑟𝑟𝑜𝑟3𝑛 = |𝐼𝑡𝑛 − 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐶 | (4.9)
7. Take the mean of each error over N samples.
𝐸1 = 𝑚𝑒𝑎𝑛(𝐸𝑟𝑟𝑜𝑟11 + ⋯ + 𝐸𝑟𝑟𝑜𝑟1𝑛 )/𝑁 (4.10)

𝐸2 = 𝑚𝑒𝑎𝑛(𝐸𝑟𝑟𝑜𝑟21 + ⋯ + 𝐸𝑟𝑟𝑜𝑟2𝑛 )/𝑁 (4.11)
𝐸3 = 𝑚𝑒𝑎𝑛(𝐸𝑟𝑟𝑜𝑟31 + ⋯ + 𝐸𝑟𝑟𝑜𝑟3𝑛 /𝑁 (4.12)
8. Calculate the mean of the errors
𝐸1 + 𝐸2 + 𝐸3
𝑚𝑒𝑎𝑛 = (4.13)
3
9. Subtract mean from each error to center around 0.
𝐸1_𝑛𝑜𝑟𝑚 = 𝐸1 − 𝑚𝑒𝑎𝑛 (4.14)

𝐸2_𝑛𝑜𝑟𝑚 = 𝐸2 − 𝑚𝑒𝑎𝑛 (4.15)
𝐸3_𝑛𝑜𝑟𝑚 = 𝐸3 − 𝑚𝑒𝑎𝑛 (4.16)
10. Calculate the standard deviation of the normalized data.
2 2 2
𝐸1_𝑛𝑜𝑟𝑚 + 𝐸2_𝑛𝑜𝑟𝑚 + 𝐸3_𝑛𝑜𝑟𝑚
𝑠𝑡𝑑𝑑𝑒𝑣 = √ (4.17)
3
11. Divide each normalized error by standard deviation to scale the error to have a
standard deviation of 1. These will be the weights w1, w2 and w3.
𝐸1_𝑛𝑜𝑟𝑚
𝑤1 = (4.18)
𝑠𝑡𝑑𝑑𝑒𝑣
𝑤2 = (4.19)
𝑤3 = (4.20)
Finally, we apply these weights in Eq 4.5. After this the timer will reset, at certain
length of time, we will again calculate the weights.
4.3 Network Architecture for Pose Estimation

The problem type of Ego-Motion we are dealing with is Visual Odometry (VO) which
is the estimation of the relative pose of the camera. From this point on, we will use the
term 'pose' to refer to Ego-Motion. The pose network receives a sequence of frames as
input and produces output in the form of the relative poses between the target view
and source views.. In this network, we have seven convolution layers in series. The
54
first two convolution layers have a kernel size of 7 and 5, and the rest have a kernel
size of 3.
Figure 4.5: Pose Network
Further, we have the sixth and seventh convolution layers. At the seventh convolution
layer, we obtain the rotation and translation of the camera. The ReLU activation
function, except the last layer, also follows all the convolution layers.
4.4 New View Synthesis Block

We are estimating the depth and camera's pose using unsupervised learning. If we do
not have any ground truth to learn from, how will our network be able to learn? This
is where new view synthesis comes into play. Figure. 4.6 describes the summary of
the view synthesis block.
Figure 4.6: View Synthesis Block
55
We generate a target image It_pred using the source images, predicted depth, and
predicted pose and compare the generated target image It_pred with the original target
image It using a pixel-wise loss function. If the pixel-wise loss is minimum, then it
means that the network is successfully predicting depth and pose.
The first step is to back-project the point in the 2D image plane to a 3D coordinate. In
chapter 2, we saw how a 3D world coordinate system point is projected on the 2D
image plane in section 2.2.3.1.
𝑂𝑖𝑚𝑎𝑔𝑒 = 𝐾[𝑅|𝑡]𝑂𝑤𝑜𝑟𝑙𝑑
𝑢 𝑓𝑥 0 𝑢𝑜 𝑟11 𝑟12 𝑟13 𝑡1 𝑋

[𝑣 ] = [ 0 𝑓𝑦 𝑣𝑜 ] [𝑟21 𝑟22 𝑟23 𝑡2 ] [𝑌 ]
1 0 0 1 𝑟31 𝑟32 𝑟33 𝑡3 𝑍
1
The depth Z here is known to us, and without considering the pose, we can expand the
equation as
𝑍(𝑢 − 𝑢𝑜 )
𝑋= (4.21)
𝑠𝑓𝑥
𝑍(𝑣 − 𝑣𝑜 )
𝑋= (4.22)
𝑠𝑓𝑦
Where:
(u, v) is a point in 2D image plane
s is the scale
uo, vo are the optical centers of the camera
fx, fy are the focal lengths
Now that we have back-projected the point from the 2D target frame to the 3D
coordinate frame, we apply a linear transformation on that point using the predicted
camera pose to obtain the target point in It−1 and It+1. Then the 3D points are projected
to the image plane.
The key component of the learning framework is the differential depth image-based
renderer [30], which produces the new target view by sampling the source view using
the predicted depth Dˆ and predicted pose K = [R|t].
𝑝𝑠 ~𝑇̂𝑡→𝑠 𝐷
̂𝑡 𝑝𝑡 𝐾 −1 𝑝𝑡 (4.23)
Where:
56
ps is a point in source frame
pt is a point in target frame
K are the intrinsic parameters of the camera
𝑇̂𝑡→𝑠 is the predicted pose
̂𝑡 is the predicted depth

𝐷
To populate the pixels, the STN’s [30] bilinear sampling method, as shown in Eq 4.9,
is used to take the pixels from the source images It−1, It+1 and match them to the
corresponding pixel coordinate in the target image. By doing this, we obtain the
predicted target image It_pred.
𝑖𝑗
𝐼̂𝑠 (𝑝𝑡 ) = 𝐼𝑠 (𝑝𝑠 ) = ∑ 𝑤 𝑖𝑗 𝐼𝑠 (𝑝𝑠 ) (4.24)
𝑖∈{𝑡,𝑏}𝐽∈{𝑙,𝑟}
Where:
Is is the source frame
𝑖𝑗
𝑝𝑠 point in source frame
𝐼̂𝑠 (𝑝𝑡 ), 𝐼𝑠 (𝑝𝑠 ) are the points of target frame projected to the source frame
coordinates
4.5 Loss Functions for Training

In order for the model to learn, a loss function is required. In chapter 3, we saw that a
suitable loss function also plays a role in an algorithm to find the suitable parameters
and better capture the appearance and structure of this scene. We use the sum of the
smooth loss and the photometric loss for our work.
𝐿𝑡𝑜𝑡𝑎𝑙 = 𝛼𝐿𝑠𝑙 + 𝐿𝑣𝑠 (4.25)

Where:
α is the smooth weight
𝐿𝑠𝑙 is the smooth loss
𝐿𝑣𝑠 is the view synthesis loss
57
4.5.1 Smooth Loss
The depths need to be smooth with an L1 penalty. The gradient of the depth on the x-
axis w.r.t both x and y-axis is taken; similarly, the gradient of the depth on the y-axis
w.r.t both x and y-axis is taken. The smooth loss is given as
1
𝐿𝑠𝑙 = ∑|𝜕 2 𝑑𝑖,𝑗 | + |𝜕𝑥 𝜕𝑦 𝑑𝑖,𝑗 | + |𝜕𝑥 𝜕𝑦 𝑑𝑖,𝑗 | + |𝜕𝑦2 𝑑𝑖,𝑗 | (4.26)
𝑁
𝑖,𝑗
4.5.2 View Synthesis Loss

The view synthesis loss is an absolute difference between the original and synthesized
target images. It is formulated as
𝐿𝑣𝑠 = ∑ ∑|𝐼𝑡 (𝑝) − 𝐼̂𝑠 (𝑝)| (4.27)

𝑠 𝑝
Where:
𝐼𝑡 (𝑝) is the input target image
𝐼̂𝑠 (𝑝) is the synthesized target image
The synthesized target image is generated in the view synthesis block, which has been
discussed in section 4.1.3.
4.6 KITTI Dataset used for Training the Networks

For training, evaluation, and testing, we use the KITTI dataset used by many
researchers for depth and Ego-Motion estimation.
The KITTI dataset consists of 96 thousand images from outdoor scenes. These scenes
are divided into five categories; city, residential, roads, campus, and pedestrians. The
data was recorded using a high-resolution stereo camera rig and a Velodyne laser
scanner (LIDAR) to capture the environment's depth. All this was mounted on a
Volkswagen, as shown in Figure 4.7.
There are 151 sequences in total, and the left and right images have been provided for
each frame. Raw RGB images with the raw LIDAR scans are provided that can be
downloaded from CVlabs [7].
58
Figure 4.7: A stereo rig and a Velodyne laser scanner mounted on a Volkswagen
The image's resolution depends on the calibration parameters, but the approximate
resolution is 1242 × 375. Figure 4.8 shows an example of an input image.
Figure 4.8: Example of an input image from KITTI dataset
4.6.1 Splitting KITTI Dataset for Training, Validation & Testing

The KITTI dataset used to train the depth and pose network is split into training,
evaluation, and testing sets. The first split was introduced by Eigen et al. [6] and later
made public by Godard et al. at GitHub [39]. For the training set, we use 46700
frames. For the evaluation set, we use 2000 frames. For the testing set, we use 697
frames. The testing set is different from the training set and should be kept different to
see the network's performance for the network to generalize.
Set Number of Images

Train 46700
Validation 2000
Test 697
Table 4.1: Images in each set
4.6.2 Applying Data Augmentation on KITTI Dataset

To boost the performance of the networks, we apply data augmentation [40] to the
dataset. It is because the depth and pose networks will be fed real-world images or
other images from another dataset during testing. For this task, some valuable and
59
common augmentations are; small random rotations, random horizontal flips, color
augmentations, and translation.
4.7 Error Metrics for Quantitative Analysis

4.7.1 Depth Error Measurements
To quantitatively evaluate the depth predictions, we use similar error metrics used by
Zhou et al. [8]. Below are the error metrics that we use.
1 2
𝑅𝑀𝑆𝐸 = √ ∑‖𝑑𝑖 − 𝑑𝑖∗ ‖ (4.28)
𝑇
𝑖
1 2
𝑅𝑀𝑆𝐸(𝑙𝑜𝑔) = √ ∑‖log(𝑑𝑖 ) − log(𝑑𝑖∗ )‖ (4.29)
𝑇
𝑖
1 |𝑑𝑖 − 𝑑𝑖∗ |
𝐴𝑅𝐷 = √ ∑ (4.30)
𝑇 𝑑𝑖∗
𝑖
2
1 ‖𝑑𝑖 − 𝑑𝑖∗ ‖
𝐴𝑅𝐷 = √ ∑ (4.31)
𝑇 𝑑𝑖∗
𝑖
We also use the accuracy metrics with threshold. The accuracy is calculated by
dividing the number of pixels with an error ratio below the threshold by the total
number of pixels in the image
1 𝑦𝑖 𝑦𝑖∗
∑ (𝑚𝑎𝑥 ( ∗ , ) = 𝜕 < 𝑡ℎ𝑟) , 𝑡ℎ𝑟 = [𝜆, 𝜆2 , 𝜆3 ] (4.32)
𝑇 𝑦𝑖 𝑦𝑖
𝑖
The typical value of λ is 1.25 which corresponds to a prediction accuracy of ±25%,

±56.25%, ±95.31% compared to the ground truth value.
4.7.2 Pose Error Measurements

We compare the absolute distances between the estimated and ground truth poses, we
use the absolute trajectory error (ATE) [41]. This is defined as
𝑛
1
𝐴𝑇𝐸𝑟𝑚𝑠𝑒 = √ ∑‖𝑡𝑟𝑎𝑛𝑠(𝐸𝑖 )‖2 (4.33)
𝑛
𝑖=1
60
Where Ei is trajectory error which is defined as
𝐸𝑖 = 𝑄𝑖−1 𝑆𝑃𝑖 (4.34)

Where:
S is the rigid body transformation
Pi is a point at time i
4.8 Training Settings

Table 4.1 shows that there are many training data, and it will take a long time to load
the data. To quickly load the data, we first format the data—the original size of the
images 1224 × 370. So, we resize them to 416 × 128. Then the images are
concatenated we each other. For example, first the 1st, 2nd, 3rd images are
concatenated, then 2nd, 3rd, 4th images are concatenated, then 3rd, 4th, 5th images
are concatenated and so on. This is done in order to reduce the loading time.
We kept the batch size to 2 because of limited resources. We use an 11GB RTX 2080
graphic card to train our network. We train the network first for 200 epochs using the
Adam optimizer [42]. After that, we changed the optimizer from Adam to SGD [43]
and trained the network for 3 more epochs to analyze whether changing brings any
change to the predictions. The other hyperparameters are shown in table 4.2
Learning Rate [44] 2e-4

Momentum [45] 0.9
Beta 0.999
Weight Decay 0
Smooth Loss Weight 0.1
Photo Loss Weight 1
Mask Loss Weight 1
Table 4.2: Hyperparameter Settings
61
Chapter 5
Results
62
The networks used in this research were trained, evaluated, and tested on the KITTI
dataset using RTX2080(11 GB) GPU. The networks took 2.5 months to train due to
heavy computation. In this chapter, we compare the results of both the depth and pose
networks qualitatively and quantitatively with the previous research.
5.1 Qualitative Analysis

The network is trained on the KITTI dataset according to table 4.1 using the Eigen
split. Figure 5.1 shows the depth prediction on the validation set. First, the network is
trained for 200 epochs using the Adam optimizer, and then we use SGD for the
successive three epochs. The red pixels show large depth values, i.e., near scenes to
the camera have large values, and the blue pixels show small depth values, i.e., far
scenes to the camera have small values.
Figure 5.1: Depth Prediction on The Validation Set
The loss function is an essential part of a learning algorithm to find the parameters.
Our method is unsupervised, so we adopted a similar new view synthesis generation
method. When a new synthesized image is generated, the difference between the
target image and the synthesized view is taken. Figure 5.2 shows the synthesized
image generated and the difference between the target and synthesized images.
Figure 5.2: Difference between the target image and synthesized image
We test the model on the test images. The test images are from different scenes and
are used for visual inspection, as each image highlights the model's strengths and
weaknesses. First, we train the network with the Adam optimizer and then train it for
63
three more epochs using the SGD optimizer. Both networks are then tested on the test
images, as shown in figure 5.3. The bright scenes show large pixel values, and the
faded scenes show small pixel values. The large pixel values mean that the scene is
near the camera and small pixel values mean that the scene is far away from the
camera. Changing the optimizer to SGD, which is trained for three more epochs, did
not affect the test images much except for the fourth image. The traffic pole shows
some improvement as it gives more structure to it.
Figure 5.3: Results on test set
Then we compared the results with previous research, as shown in figure 5.4. The 2nd
and 3rd column show methods that predicted the depth image in a supervised fashion,
while the 4th column used a method to predict the depth of an image in an
unsupervised fashion. We can see that the depth predictions improved using weighted
average depth. The previous methods show a blurry image prediction. [16] gave good
depth predictions but still were blurry. Improving the depth network did give good
predictions but still consisted of some artifacts.
64
Figure 5.4: Comparison of our methods with the previous methods
We visualized the trajectories of sequence 9 and sequence 10 of the odometry data

using the camera's predicted pose, as shown in figure 5.5. The trajectories are almost
similar to the previous method [8]. This deviation of trajectory from the actual relative
pose is called visual drift.
Figure 5.5: Trajectory Plot of Sequence 9 and 10
5.2 Quantitative Analysis

Here the error metrics for depth and pose, as stated in section 4.3, are presented. Due
to time and resource limitations, we trained the network two times; once for 200
epochs using Adam optimizer, then we changed the optimizer to SGD, continued
from the last 200 epoch, and trained it for three more epochs.
Table 5.1 shows the error metrics calculated over the entire test dataset. The original
paper did not use data augmentation and is available in [70]. So, by changing the
depth networks and taking the weighted average, the results have improved. We limit
the network three because of limited resources.
65
Table 5.1: Depth evaluation results on KITTI dataset
The aim is to see if improving the depth also improves the pose. Table 5.2 shows the
Absolute trajectory error and standard deviation calculated using the pose network.
The error metrics are calculated on sequences 9 and 10 of odometry data [46]. It
shows that improving the depth also improved the pose, but figure 5.5 showed that the
trajectories were too far from the actual value. It means that the algorithm is suffering
from overfitting.
Seq. 09 Seq. 10
Method
ATE Std. ATE Std.
ORB-SLAM(Full) 0.014 0.008 0.012 0.011
ORB-SLAM(Short) 0.064 0.141 0.064 0.130
Mean Odometry 0.032 0.026 0.028 0.023
Zhou et al. (Original) 0.021 0.017 0.020 0.015
Zhou et al. (Data Aug.) 0.0179 0.011 0.0141 0.0115
Our (Adam Opt.) 0.0098 0.0062 0.0083 0.0067
Our (SGD Opt.) 0.0092 0.0060 0.0081 0.0066
Table 5.2: Absolute Trajectory Error (ATE) on KITTI test split
66
Chapter 6
Conclusion and Future Work
67
The depth and Ego-Motion estimation are demanding tasks in computer vision.
Currently, most of the focus has shifted towards monocular vision from stereo vision
as it is an inherently ill-posed problem. To address this problem, machine learning
and deep learning techniques are used. Supervised learning methods still have shown
superior results compared to unsupervised learning methods. However, the cost of
obtaining the ground truth is high and requires extensive labor work.
6.1 Estimating Depth in Real-Time

When testing the weighted average depth network, it took time to infer the sequence
of frames. It is because subnetworks 2 and 3 are very large, taking time and memory
to infer depth. We tried to infer depth using 2GB, 6GB, and even 8GB graphic cards,
but the memory would get exhausted. Finally, we could infer the depth using an 11GB
graphic card.
6.2 Impact of Depth Estimation on Pose Estimation

Estimating depth using average weighted depth did improve the depth, but there was
no effect on the pose predictions, just as we saw in section 5.1. Figure 5.5 shows that
the predicted pose of the camera was too far from the actual values of the pose. It is
called visual drift. It is due to several factors, such as sensor noise, calibration, camera
noise, or feature extraction algorithm errors. Similarly, the ATEs showed
improvement in the previous section, but as we visualized the trajectory, there was a
significant drift. It means that the pose network was overfitting the data and needs to
be addressed.
6.3 Reducing Depth Estimation and Pose Estimation

In real-time, we can estimate the depth of a scene and the camera's pose using the
11GB graphic card, but the cost will be high. To reduce the cost, we can use small
depth networks that take less time and accurately estimate the depth of a scene.
We saw that by using weighted average depth, there was no effect of depth estimation
on the pose of the camera, and in the previous section, we also discussed the drift that
had occurred and its causes. Since we are using the KITTI dataset with high-quality
images captured using a very well-calibrated camera, this problem has nothing to do
with the drift we obtained. The leading cause of this can be the pose network, and to
68
solve this, we need another pose network that can accurately estimate the camera's
pose and avoid overfitting the data.
69
Reference
70
[1] R. C. Gonzalez, Digital image processing. Pearson education India, 2009.
[2] T. C. School, “What is a video?.” https://www.youtube.com/watch?v=
9CSjUl-xKSU&ab_channel=TheCutSchool.
[3] C.. Tutorial, “Learning-based depth estimation from stereo and monocular
images: successes, limitations and future challenges,” 2017.
[4] Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth
estimation: A review,” Neurocomputing, vol. 438, pp. 14–33, 2021.
[5] Saxena, S. Chung, and A. Ng, “Learning depth from single monocular
images,” Advances in neural information processing systems, vol. 18, 2005.
[6] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single
image using a multi-scale deep network,” Advances in neural information
processing systems, vol. 27, 2014.
[7] P. L. C. S. Geiger, Andreas and R. Urtasun, “Vision meets robotics: The kitti
dataset,” The International Journal of Robotics Research, 2013.
[8] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of
depth and Ego-Motion from video,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1851–1858, 2017.
[9] G. Grisetti, R. K¨ummerle, C. Stachniss, and W. Burgard, “A tutorial on
graph-based slam,” IEEE Intelligent Transportation Systems Magazine, vol. 2,
no. 4, pp. 31–43, 2010.
[10] D. Scaramuzza and F. Fraundorfer, “Tutorial: visual odometry,” IEEE Robot.
Autom. Mag, vol. 18, no. 4, pp. 80–92, 2011.
[11] M. S. N. G. Hanna, Vehicle Distance Detection Using Monocular Vision and
Machine Learning. PhD thesis, University of Windsor (Canada), 2019. 90
[12] D. A. Forsyth and J. Ponce, Computer vision: a modern approach. prentice
hall professional technical reference, 2002.
[13] X. Yang, H. Luo, Y. Wu, Y. Gao, C. Liao, and K.-T. Cheng, “Reactive
obstacle avoidance of monocular quadrotors with online adapted depth
prediction network,” Neurocomputing, vol. 325, pp. 142–158, 2019.
[14] R. Szeliski, Szeliski, R. (2022). Computer vision: algorithms and applications.
Springer Nature, 2010.
[15] J. Heaton, Ian goodfellow, yoshua bengio, and aaron courville: Deep learning.
Springer, 2018.
71
[16] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
learning,” arXiv preprint arXiv:1603.07285, 2016.
[17] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural
network for joint depth and surface normal estimation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 283–291,
2018.
[18] “Feature matching (opencv).” https://docs.opencv.org/4.0.0/dc/dc3/
tutorial_py_matcher.html.
[19] “Haris corner detection (opencv).” https://docs.opencv.org/4.0.0/dc/
d0d/tutorial_py_features_harris.html.
[20] M. Trajkovi´c and M. Hedley, “Fast corner detection,” Image and vision
computing, vol. 16, no. 2, pp. 75–87, 1998.
[21] “Introduction to sift (scale-invariant feature transform).” https://docs.
opencv.org/4.x/da/df5/tutorial_py_sift_intro.html.
[22] R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and
overview,” arXiv preprint arXiv:1912.05911, 2019.
[23] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth
estimation from a single image,” in Proceedings of the IEEE conference on
[24] A. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper
depth prediction with fully convolutional residual networks,” in 2016 Fourth
international conference on 3D vision (3DV), pp. 239–248, IEEE, 2016.
[25] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video
conversion with deep convolutional neural networks,” in European conference
on computer vision, pp. 842–857, Springer, 2016.
[26] C. Fehn, “Depth-image-based rendering (dibr), compression, and transmission
for a new approach on 3d-tv,” in Stereoscopic displays and virtual reality
systems XI, vol. 5291, pp. 93–104, SPIE, 2004.
[27] Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for
real-time 6-dof camera relocalization,” 2015.
[28] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to
predict new views from the world’s imagery,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 5515–5524, 2016.
72
[29] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular
depth estimation with left-right consistency,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 270–279, 2017.
[30] K. S. Jaderberg, Max and A. Zisserman, “Spatial transformer networks,”
Advances in neural information processing systems, 2015.
[31] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K.
Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv
preprint arXiv:1704.07804, 2017. 92
[32] D. Tan, “Sfm self supervised depth estimation: Breaking down the ideas,”
2020.
[33] Q. Sun, Y. Tang, C. Zhang, C. Zhao, F. Qian, and J. Kurths, “Unsupervised
estimation of monocular depth and vo in dynamic environments via hybrid
masks,” IEEE Transactions on Neural Networks and Learning Systems, vol.
33, no. 5, pp. 2023–2033, 2021.
[34] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T.
Brox, “A large dataset to train convolutional networks for disparity, optical
flow, and scene flow estimation,” in Proceedings of the IEEE conference on
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical
image computing and computer-assisted intervention, pp. 234–241, Springer,
2015.
[36] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using
adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4009–4018, 2021.
[37] G. Marques, D. Agarwal, and I. de la Torre D´ıez, “Automated medical
diagnosis of covid-19 through efficientnet convolutional neural network,”
Applied soft computing, vol. 96, p. 106691, 2020.
[38] Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T.
Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An
image is worth 16x16 words: Transformers for image recognition at scale,”
arXiv preprint arXiv:2010.11929, 2020.
[39] “Preparing dataset.” https://github.com/ClementPinard/ SfmLearner-
73
Pytorch/tree/master/data. 93
[40] L. Perez and J. Wang, “The effectiveness of data augmentation in image
classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017.
[41] D. Prokhorov, D. Zhukov, O. Barinova, K. Anton, and A. Vorontsova,
“Measuring robustness of visual slam,” in 2019 16th International Conference
on Machine Vision Applications (MVA), pp. 1–6, IEEE, 2019.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[43] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv
preprint arXiv:1609.04747, 2016.
[44] Y. Wu, L. Liu, J. Bae, K.-H. Chow, A. Iyengar, C. Pu, W. Wei, L. Yu, and Q.
Zhang, “Demystifying learning rate policies for high accuracy training of deep
neural networks,” in 2019 IEEE International conference on big data (Big
Data), pp. 1971–1980, IEEE, 2019.
[45] G. G, “Why momentum really works,” Distill, 2017.
[46] C. S. R. U. A. Geiger, P. Lenz, “Visual odometry data,” 2012.
74

Master Thesis

Uploaded by

Copyright:

Available Formats

Master Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Master Thesis

Uploaded by

Copyright:

Available Formats

Unsupervised Learning of Depth and Ego-

Motion from Video

COMSATS University Islamabad

Islamabad Campus – Pakistan

Unsupervised Learning of Depth and Ego-Motion

COMSATS University Islamabad, Islamabad

Of the requirement for the degree of

7th Semester, Fall 2022

Name Registration Number

Dr. Omar Ahmad

Department of Electrical and Computer Engineering

COMSATS University Islamabad

This thesis titled

Unsupervised Learning of Depth and Ego-Motion

Has been approved

For the COMSATS University Islamabad, Islamabad Campus

External Examiner: _________________________________________

Dr. Shahzad Saleem

Assistant Professor, Electrical Engineering Department, FAST National

Dr. Omar Ahmad

Assistant Professor, Department of Electrical and Computer Engineering

Dr. Shurjeel Wyne

Professor, Department of Electrical and Computer Engineering

Date: 6th February 2023 ____________________________

Date: 6th February, 2023

Dedicated to my Family, Teachers and Friends

Unsupervised Learning of Depth and Ego-Motion from Video

1.1 Limitations ............................................................................................. 5

1.2 Motivation .............................................................................................. 5

1.3 Problem Statement ................................................................................. 5

1.4 Thesis Outline ........................................................................................ 6

2.1 Digital Video .......................................................................................... 8

2.1.1 Digital Image ................................................................................... 8

2.1.2 Video using Images ....................................................................... 10

2.2 Depth Estimation from a Video ........................................................... 10

2.2.1 Need for Depth Estimation ........................................................... 11

2.2.2 Importance of Depth Estimation ................................................... 11

2.2.3 Depth Estimation Using Geometric-Based Method ..................... 12

2.2.4 Convolution Neural Networks ...................................................... 16

2.2.5 Depth Estimation Using Learning-Based Method ........................ 21

2.3 Ego-Motion Estimation in a Video ...................................................... 24

2.3.1 Need of Ego-Motion ..................................................................... 25

2.3.2 Importance of Ego-Motion............................................................ 25

2.3.3 Traditional Ego-Motion Estimation Methods ............................... 26

2.3.4 Learning-Based Ego-Motion Estimation Methods ....................... 34

3. Literature Review ................................................................................. 36

3.1 Supervised Learning Methods ............................................................. 37

3.1.1 Multi-scale Depth Prediction, 2014 .............................................. 37

3.1.3 Deep3D, 2016 ............................................................................... 40

3.1.4 PoseNet ......................................................................................... 41

3.2 Unsupervised Learning Methods ......................................................... 42

3.2.1 DeepStereo, 2015 .......................................................................... 42

3.2.2 Depth Estimation with Left-Right Consistency, 2017 .................. 43

3.2.3 SfM-Net, 2017............................................................................... 44

3.2.4 SfM-Learner, 2017 ........................................................................ 46

4.1 Network Architecture for Depth Estimation ........................................ 48

4.1.1 Subnetwork1 - Disparity Network ................................................ 49

4.1.2 Subnetwork2 - AdaptiveBin Network........................................... 50

4.1.3 Subnetwork3 - Residual Network ................................................. 51