Master Thesis
Master Thesis
Master Thesis
By
Shah Zeb
CUI/FA19-RCE-008/ISB
MS Thesis
In
Computer Engineering
A Thesis Presented to
In partial fulfillment
MS in Computer Engineering
By
Shah Zeb
FA19-RCE-008
ii
Unsupervised Learning of Depth and Ego-Motion
from Video
A Post Graduate Thesis submitted to the Department of Electrical and
Computer Engineering as partial fulfillment for the award of a Degree M.S. in
Computer Engineering.
Supervisor
Assistant Professor
Islamabad Campus
January 2023
iii
Final Approval
Shah Zeb
FA19-RCE-008
Supervisor: ________________________________________________
HOD: ____________________________________________________
iv
Declaration
I Shah Zeb, Registration# FA19-RCE-008, hereby declare that I have produced the
work presented in this thesis during the scheduled study period. I also declare that I
have not taken any material from any source except referred to wherever due that
amount of plagiarism is within an acceptable range. If a violation of HEC rules on
research has occurred in this thesis, I shall be liable to punishable action under the
plagiarism rules of the HEC.
Signature of Student
v
Certificate
It is certified that Shah Zeb, Registration# FA19-RCE-008, has carried out all the
work related to this thesis under my supervision at the Department of Electrical and
Computer Engineering, COMSATS University Islamabad, Islamabad Campus, and
the work fulfills the requirement for the award of MS degree.
__________________________________
Dr. Omar Ahmad
Assistant Professor
Head of Department
_________________________________
Dr. Shurjeel Wyne
Department of Electrical and Computer
Engineering
vi
DEDICATION
vii
ACKNOWLEDGEMENTS
First, I would like to thank Allah SWT for the countless blessings he has showered me
throughout my life. He has always given me the best opportunities regardless of my
weakness. I pray to Allah SWT to allow me to be His humble servant and bless my
family and me with steadfastness in His religion. Then, I express my heartiest
gratitude towards my supervisor Dr. Omar Ahmad. This work would not have been
possible without his guidance and support. His strong command of my area of
research and extraordinary problem-solving skills are the key factors in completing
this thesis. I will never forget his kind behavior while conveying technical arguments
about the research topic. Indeed it was an honor to work with such a friendly,
thorough, ad dedicated professional. I am grateful to Dr. Usman Qayyum from the
National Electronics Complex (NECOP), who introduced me to my area of research,
supported me, and provided me with the facilities to carry out my research work. This
work was only possible with him. I am thankful to all my course instructors at
COMSATS University Islamabad for developing my knowledge base in Computer
Engineering during the course work that helped me choose the area of research for my
Master's. I am also thankful to Dr. Haroon and Dr. Sufwan for their valuable
suggestions. I would also like to thank my family, including my father, mother,
brother, and sister, for their unwavering support in continuing my studies. They were
always there to help me with all their abilities. Special thanks to my father for his
encouragement and moral support. He pushed me beyond my limits from the start of
this thesis till this point.
viii
ABSTRACT
ix
TABLE OF CONTENTS
1. Introduction ............................................................................................ 1
2. Background ............................................................................................. 7
x
3.1.2 Fully Convolutional Residual Networks, 2016 ............................. 38
4. Methodology.......................................................................................... 47
4.6.1 Splitting KITTI Dataset for Training, Validation & Testing ........ 59
xi
4.7 Error Metrics for Quantitative Analysis .............................................. 60
5. Results.................................................................................................... 62
7. Reference ............................................................................................... 70
xii
LIST OF FIGURES
Figure 1.1: Elements and their pixel values of a Digital Image ................................................. 2
Figure 1.2: Depth Perceived by Humans ................................................................................... 2
Figure 1.3: Stereo Vision System .............................................................................................. 3
Figure 1.4: Example of Ego-Motion .......................................................................................... 4
Figure 2.1: A Digital Image ....................................................................................................... 8
Figure 2.2:Image obtained on Cathode Ray Tube (CRT) .......................................................... 9
Figure 2.3: Digital Image Acquisition Process .......................................................................... 9
Figure 2.4: Sequence of frames from time t-1 to t+2 ............................................................... 10
Figure 2.5: Applications of VO ............................................................................................... 11
Figure 2.6: Occluded Points ..................................................................................................... 12
Figure 2.7: Stereo Camera Model ............................................................................................ 12
Figure 2.8: Stereo Camera Model in Bird’s Eye View ............................................................ 13
Figure 2.9: Brute force solution by searching the whole image for each pixel........................ 15
Figure 2.10: Movement of 3D point along the epipolar line.................................................... 15
Figure 2.11: Movement of 3D point along the epipolar line.................................................... 16
Figure 2.12: Properties of Input Image .................................................................................... 17
Figure 2.13: Each channel applied to the respective channel of an RGB input volume .......... 17
Figure 2.14: Process of cross-correlation................................................................................. 18
Figure 2.15: Applying two filters on an RGB input volume ................................................... 19
Figure 2.16: Applying max pool function on a 4x4 matrix...................................................... 20
Figure 2.17: Deconvolving a 3 x 3 feature map ....................................................................... 21
Figure 2.18: General Encoder-Decoder for depth estimation .................................................. 22
Figure 2.19: General supervised framework for depth estimation ........................................... 23
Figure 2.20: General unsupervised framework for depth estimation using stereo image pairs23
Figure 2.21: General unsupervised framework for depth and pose estimation ........................ 24
Figure 2.22: Types of Ego-Motion .......................................................................................... 25
Figure 2.23 Pose of an Car ....................................................................................................... 26
Figure 2.24: General process of Visual Odometry .................................................................. 26
Figure 2.25: General process of Visual Odometry .................................................................. 27
Figure 2.26: Example of image stitching technique ................................................................ 27
Figure 2.27: Interesting features .............................................................................................. 28
Figure 2.28: Illustration of feature description ........................................................................ 29
Figure 2.29: Computing the descriptor .................................................................................... 30
Figure 2.30: Matching feature from image 1 to image 2 ......................................................... 31
Figure 2.31: Sequences of frames concatenated with each other............................................. 32
xiii
Figure 2.32: Projection of 3D to 2D using camera motion ...................................................... 33
Figure 2.33: General Supervised Learning Pose Estimation Architecture .............................. 34
Figure 3.1: Network architecture of coarse and fine network .................................................. 37
Figure 3.2: Up-Projection Block .............................................................................................. 38
Figure 3.3: ResNet Network architecture with up sampling block .......................................... 39
Figure 3.4: Convolution Block In Residual Network .............................................................. 39
Figure 3.5: Fast Up-projection Block ...................................................................................... 40
Figure 3.6: ResNet Network archiotecture with upsampling block ......................................... 41
Figure 3.7: PoseNet Achitecture .............................................................................................. 41
Figure 3.8: Basic architecture of DeepStereo .......................................................................... 42
Figure 3.9: Base architecture of depth estimation without LR ................................................ 43
Figure 3.10: Main architecture for depth estimation for LR .................................................... 43
Figure 3.11: SfM-Net architecture ........................................................................................... 44
Figure 3.12: Over view of supervision pipeline ....................................................................... 46
Figure 4.1: Our modified network ........................................................................................... 48
Figure 4.2: Disparity Network architecture ............................................................................. 49
Figure 4.3: The AdaBins Architecture ..................................................................................... 50
Figure 4.4: Residual Block ...................................................................................................... 51
Figure 4.5: Pose Network ........................................................................................................ 55
Figure 4.6: View Synthesis Block ........................................................................................... 55
Figure 4.7: A stereo rig and a Velodyne laser scanner mounted on a Volkswagen ................. 59
Figure 4.8: Example of an input image from KITTI dataset.................................................... 59
Figure 5.1: Depth Prediction on The Validation Set ................................................................ 63
Figure 5.2: Difference between the target image and synthesized image ................................ 63
Figure 5.3: Results on test set .................................................................................................. 64
Figure 5.4: Comparison of our methods with the previous methods ....................................... 65
Figure 5.5: Trajectory Plot of Sequence 9 and 10.................................................................... 65
xiv
LIST OF TABLES
xv
Chapter 1
Introduction
1
A digital image is a two-dimensional function f(x, y) having a finite set of elements,
where (x, y) are the coordinates that give a particular position to an element in an
image. The elements in a digital image are referred to as pixels. Each pixel has an
amplitude called intensity, and the intensity of a pixel is a scalar value. In the case of
a grayscale image, as shown in figure 1.1, the intensity of a pixel can be between 0
and 255 [1].
A sequence of still images is used to create a video by displaying the images in rapid
succession such that it gives the perception of motion. The frequency at which the
images are displayed is called the frame rate. Different frame rates impact how we
perceive motion and the speed of motion which appears on a screen [2].
Humans can estimate the distance of the object they can see and figure out the 3D
structure of the objects in a scene by combining the information obtained using their
eyes very efficiently. On a computing platform, the task of estimating the distance
from the object is known as depth estimation, i.e., how far or near objects are from the
observer. Depth can be estimated from multiple views or a single view.
2
Our eyes can capture a point from two views, i.e., from the left eye and the right eye,
which deduces valuable information from the scene quickly and efficiently, allowing
us to perform feats such as avoiding a collision or catching a ball. This approach of
estimating the depth using multiple views is known as stereopsis which is solved
using stereo vision cameras (multiple view cameras) [3].
Such simple tasks that humans efficiently perform are difficult for a computer to
perform using a video. These abilities that look simple at first glance need to be
designed with great diligence for the computer to achieve. Many researchers have
admitted their proposals to address this problem. They have shown remarkable results
though they require extensive computational resources, and even delays can be noted
in the output. Hence monocular vision cameras (single-view cameras) have then opted
for depth estimation.
The monocular depth estimation task was challenging to achieve using the same
method as stereo. To solve this, a learning-based method was adopted. Further, there
are three types of machine learning methods; the supervised learning method, which
requires ground truth data for the model to learn from, the and unsupervised learning
method, in which the model is its teacher [4].
Monocular depth estimation using supervised learning was first solved by Saxena et
al. [5], using Markov Random Fields (MRFs) by training the model on Make3D data.
The problem with this was that very little data was used. Till 2014, no great
importance was given to this until Eigen et al. [6] used Convolutional Neural
Networks to estimate the depth, which will be explained in detail in chapter 3. The
3
network was trained on the KITTI dataset [7], and since then, deep convolutional
neural networks have gathered great attention.
This type of learning method required ground truth data which needed expensive
labor and equipment and would cost much time. To train a network without using
ground truth, Zhou et al. [8] estimated the depth from a single image using an
unsupervised method which requires no ground truth at all. For learning to be carried
out, a new view was synthesized using image-wrapping techniques.
We can use depth information to navigate through the environment to avoid collision
and know our and others’ position in a scene using a process called Ego-Motion. It is
being aware of where you are and what everyone around you is doing.
Navigation has been possible using sensors such as the Global Positioning Systems
(GPS) from which the Ego-Motion of an object can easily be estimated without any
known data [9]. Sometimes the GPS might fail to work when the signals are jammed
or an object enters an environment such as underwater or space where the GPS signal
starts to attenuate. At such times, a vision system can be used in places where GPS
signals do not work [10].
A vision-based navigation system and visual odometry (VO), which is a type of Ego-
Motion, are some alternatives to this. VO uses a camera sensor to estimate the Ego-
Motion of an object from one frame to another frame of a video sequence. Traditional
indirect methods extracted features and matched them between image frames. Then
image geometry and perspective changes were used to estimate the motion. Direct VO
4
methods would use photometric consistency error in the pixel intensity values to
determine the camera motion [10].
Recently Convolutional Neural Networks, just as for depth estimation, have been used
to estimate the Ego-Motion of objects and have shown relatively good results.
1.1 Limitations
While deep learning techniques to estimate depth and Ego-Motion from a video have
shown relatively good results, some things could still be improved. To perform well,
the environment should have sufficient light [10]. Weather changes can also affect
estimating both depth and Ego-Motion. If an aerial vehicle moves over an ocean, it
might be challenging to estimate Ego-Motion as there will not be enough features for
calculations [11].
Stereo vision systems have successfully injected scale information into Ego-Motion
because of the baseline distance between the cameras. Monocular vision systems can
estimate 3-dimension rotation and translation from 2D images. However, it needs to
be more accurate in determining the scale of the translation, and some external sense
of scale needs to be given [10].
1.2 Motivation
The motivation of this thesis is to implement monocular depth & Ego-Motion using
an unsupervised technique and examine the effects of depth network on the Ego-
Motion/pose network. We use a similar method as Zhou et al. [8], using multiple
depth networks and taking the weighted average of all the outputs obtained from the
three depth networks. Since Ego-Motion or pose is assisted by depth network,
improvements in depth should also show improvement in the camera's pose. We train
the network using the KITTI dataset and then evaluate it by comparing the results
with previous methods.
5
supervised learning, so to improve the performance of depth we use weighted average
which uses three unsupervised networks in an ensemble to achieve a more reliable
unsupervised paradigm.
Chapter 2 deals with the theoretical concepts related to depth and Ego-Motion
estimation from a video. We show in detail how a video works and how we can
estimate depth and Ego-Motion from it. We also show the advantages and
disadvantages of the approaches used in estimating depth and Ego-Motion.
Chapter 3 is the literature review, where we will look at architectures of the depth and
Ego-Motion used by researchers.
Chapter 5 benchmarks the qualitative and quantitative results with previous research.
6
Chapter 2
Background
7
The purpose of this chapter is to introduce the basic concepts used throughout this
thesis. We will look at an image and how we make a video from images. Then we will
look at how we estimate depth using traditional techniques. In section 2.2.5, we will
look at how convolutional neural networks work. Then we discuss the different
learning methods for estimating depth. Then we look at what Ego-Motion is and how
it was traditionally estimated. Finally, we will discuss the learning-based methods
used to estimate Ego-Motion.
An image is obtained from both analog and digital signals. Back in the 1940s, a
cathode tube was used in a camera to take pictures, as shown in figure 2.2. A lens
focused the scene on a plate in front of the tube. The cathode tube would throw a
stream of electrons on the picture back and forth, turning the scene's brightness and
darkness into voltage. The high voltage amplitude had high brightness, and the low
voltage amplitude showed low brightness. These signals were then sent to a monitor,
8
i.e., the cathode ray tube (CRT), which pushed the electrons to the monitor's screen.
The CRT moved the electron beam from left to right to create the original scene.
These days devices are used to convert scenes to digital data. However, an analog
signal with a different form of energy would still be required to hit the sensor. The
imaging sensor is used to transform this signal into a digital image. The particular
sensor we use is called an array sensor which is most widely used in cameras. The
sensor digitizes the analog signal, and each sensor element responds to the analog
signal. The output response is a digital image, as shown in figure 2.3.
9
2.1.2 Video using Images
In the real world, everything is in motion. For example, the wind blows the trees, and
cars move on the roads, tennis players swing their rackets, etc. Things are moving,
and our perception system handles them very well. We also want to deal with motion
in computer vision, but nothing is moving in a single digital image. When sequences
of images are combined and are not separated by time, we get the perception of
motion in a single image [12]. When dealing with computers and machines, we want
them to have the same motion perception as humans.
When talking about motion, we talk about video. In the past, when someone would
talk about a video, people would refer to it as a sequence of images, but these days
people refer to it as a sequence of frames where instead of images, we use the term
frames. A video is a sequence of frames captured over time relatively quickly. When a
camera records a video, it means that the camera captures images at regular intervals
with minor changes in a scene.
Now our images are not just a function of space (x, y) but space (x, y) and time t. It
can also be written as f(x, y, t). Remember that when dealing with this 3-dimensional
function, we do not mean that time and space are the same. Figure 2.4 shows an
example sequence of frames where each frame is captured at a different interval.
The rate at which frames appear is measured by frame rate, sometimes called frames
per second. Different frame rates have a different impact on how motion is perceived.
The smaller the frame rate, the motion in a video will be jittery, and the greater the
frame rate, the smoother the video. The most common frame rate for video is 30 fps
or 60 fps.
10
measure of obtaining the distance of an object from the camera. This information can
then create a 3D representation of the scene.
(a) Mars Exploration Rover (b) Google’s Self-Driving Car (c) Starbug X sub-marine
Figure 2.5: Applications of VO
LIDAR and other radar sensors have been used to capture the depth of the
environment, but in computer vision, we use cameras to get the depth information.
This is useful in vehicles like self-driving cars and can also be used in robot-human
interaction, augmented reality, and virtual reality.
Most methods for depth estimation are based on finding correspondences between
points in different images and then using triangulation to reconstruct the 3D positions
of those points. However, this approach could be more reliable when occlusions exist,
as figure 2.6.
11
Figure 2.6: Occluded Points
2.2.3.1 Stereopsis
For stereopsis, we use stereo vision cameras, which have nowadays been used in self-
driving cars, mobile cameras, etc. A stereo sensor [12] is created by placing two
cameras in parallel whose optical axes align. Given the rotation R and translation t,
point O is projecting to the camera's image plane at OL and OR [14].
12
2. While manufacturing the stereo setup, we make as much as possible to keep the
two cameras’ optical axes aligned.
Let us list some critical parameters. The distance between the camera center and
image planes is called the focal length f. The baseline distance b is the distance
between the centers of the two cameras along the x-axis. We assume that the rotation
matrix R is identity and the x-component in the translation vector t is zero.
Let’s see this structure in a bird’s eye view, as shown in figure 2.8.
We first want to compute point O's z and x coordinates to the left camera frame. We
can see two similar triangles formed by the left camera, △CLZLO, and △CLXLO. The
equation formed is given as
𝑍 𝑋
= (2.1)
𝑓 𝑋𝐿
Similarly, we can say about the right camera, which forms △CRZRO and △CRXRO.
𝑍 𝑋−𝑏
= (2.2)
𝑓 𝑋𝑅
We define the disparity d as the difference between the image coordinates of the same
pixel in left and right images.
𝑑 = 𝑋𝐿 − 𝑋𝑅 (2.3)
Where:
𝑋𝐿 = 𝑈𝐿 − 𝑈𝑜
𝑋𝑅 = 𝑈𝑅 − 𝑈𝑜
𝑌𝑅 = 𝑉𝑅 − 𝑉𝑜
13
We now arrive at computing the 3D coordinate following
From Eq 2.1
𝑍 𝑋
= → 𝑍𝑋𝐿 = 𝑓𝑋 (2.4)
𝑓 𝑋𝐿
From Eq 2.2
𝑍 𝑋−𝑏
= → 𝑍𝑋𝑅 = 𝑓𝑋 − 𝑏 (2.5)
𝑓 𝑋𝑅
From Eq 2.4 and 2.5, we have
𝑍𝑋𝐿
𝑋= (2.8)
𝑓
𝑍𝑋𝑅
𝑌= (2.9)
𝑓
Now that the position of the points in the 3D coordinate system is available, we will
need to compute the focal length f, baseline b, x & y, and offset uo and vo. These are
what we need to calibrate the stereo camera. Another thing that we need to do is
compute the disparity. We need a specialized algorithm to efficiently perform the
matching and compute the disparity to solve these problems.
This setup has some limitations, like as the point moves further away, the depth
suffers a lot. However, depth is still quite valuable for many computer vision
applications.
2.2.3.2 Computing the Disparity
We found the position of a point in 3D space, but there are still parameters that we
need to find, such as baseline b, disparity d, focal length f, and camera pixel centers
uo, vo. These days there is much software available to obtain these parameters.
However, we will see how disparity is computed using the stereo setup. The disparity
is “the difference in the image location of the same 3D point as observed by two
different cameras”. In this problem, we need to find the same point in both the left and
right cameras. The problem is known as the correspondence problem. The most naïve
14
technique is the exhaustive search method, where you search for the whole image
using the sliding window in the left image and match it with the right image.
Figure 2.9: Brute force solution by searching the whole image for each pixel
But this method is inefficient and might not be good to run in real-time, especially in
self-driving cars.
To solve this problem, we can use stereo geometry to constrain the problem from 2D
over the entire image to a 1D line. We already know about the stereo setup, shown in
figure 2.7, where we observed a single point in both the left and right cameras. Now
let us move the 3D point along the line connecting it with the left camera centers.
As we move the point, we will see in the left image that the projection does not
change. However, if we notice in the right camera, the projection of the point changes
along the horizontal line. This horizontal line is called the epipolar line [14], which
follows directly from the fixed lateral offset and image plane alignment of two
cameras in a stereo pair. The epipolar line is a straight horizontal line only when the
two cameras are parallel; otherwise, the line becomes skewed. It is known as multi-
view geometry.
15
Figure 2.11: Movement of 3D point along the epipolar line
The skewed epipolar line is not a big problem that is resolved using stereo
rectification [12]. Going into the mathematical model of this is beyond the scope of
this work. So, we head back to the image in figure 2.10 and follow the following
steps.
1. We put a horizontal line at the center of both images.
2. Compare the pixels from the left image to the pixels of the right image along the
epipolar line.
3. Pick the pixel that has the minimum cost. The cost can be the squared difference
between the pixel intensities.
4. Finally, we compute the disparity d by subtracting the right image location from
the left one.
2.2.4 Convolution Neural Networks
In the last decade, convolutional neural networks have been mainly used for
perception tasks such as object classification, object recognition, image segmentation,
and depth estimation. Convolutional neural networks mainly comprise two layers:
1. Convolution Layers
2. Pooling Layers
16
number of channels tells us that the image is an RGB image. The gray area in the
image is added by a process called padding. The number of pixels on the side is called
pixel size. Here the pixel size is 1.
The advantage of padding is that when we perform the convolution operation on the
image using a filter, the width and height of the image is retained.
Figure 2.13: Each channel applied to the respective channel of an RGB input volume
17
First, let us better understand this by applying a single filter, as shown in figure 2.14.
We will take each channel of the filter, apply it to the corresponding channels of the
RGB image, and perform the convolution operation. Then we add the corresponding
values adding the bias too. We will get the first output pixel. Then we slide the filter
to the right and again perform cross-correlation. Then add the values where we will
get another output pixel. Similarly, we will keep doing this until we obtain all the
output pixels.
(a) (b)
(c) (d)
Figure 2.14: Process of cross-correlation of each channel of filters with respective channels of
the RGB input volume
The size of the image has been reduced. It is because we moved the image by two
steps. If we want to confirm our output mathematically, we can do it as
18
𝑊𝑖𝑛 − 𝑚 + 2 + 𝑃
𝑊𝑜𝑢𝑡 = +1 (2.10)
𝑆
𝐻𝑖𝑛 − 𝑚 + 2 + 𝑃
𝐻𝑜𝑢𝑡 = +1 (2.11)
𝑆
𝐷𝑜𝑢𝑡 = 𝐾
Where:
m × m = filter size
K = number of filters
Now we know how to apply the convolution layer, but where will the neural network
learn the parameters? The filters we apply to the input image will have values acting
as weights for the neural network. These are the weights in the filters that the
convolution neural network will learn, as shown in figure 2.13.
We know that the dimension of the RGB input is Win × Hin × 3, and when we apply a
single filter on the image, the dimension becomes Wout × Hout × 1. However, we will
not be applying a single filter but multiple filters on a single RGB image. For multiple
filters, we represent the dimension as Wout × Hout × K. Figure 2.15 shows that we first
apply filter 1 and then, in the same way, we apply filter 2. We can notice that we have
two output channels when we apply two filters. These channels are stacked over each
other to form output volume.
19
2.2.4.2 Pooling Layers
Pooling layers [15] help the representation become invariant to small translations in
the input. This layer uses pooling functions that replace the pixel value from the
output of the previous layers. There are many pooling layers, but we will use Max
Pooling to understand it better.
𝑊𝑖𝑛 − 𝑛
𝑊𝑜𝑢𝑡 = +1 (2.12)
𝑆
𝐻𝑖𝑛 − 𝑛
𝐻𝑜𝑢𝑡 = +1 (2.13)
𝑆
𝐷𝑜𝑢𝑡 = 𝐷𝑖𝑛
Unlike the convolution layers, the channels in the output will be the same as the input
channels. Additionally, there are parameters to learn from the pooling layers. Only the
parameters from the convolution layers will be learned.
(a) (b)
(c) (d)
Figure 2.16: Applying max pool function on a 4x4 matrix
20
2.2.4.3 Deconvolution
We are familiar with regular convolution, but there is another type of convolution
algorithm called transpose convolution [16], also sometimes called the deconvolution
layer, which is mainly used for segmentation and depth perception tasks.
Then we multiply the second input pixel with all the filter values. In the output, we
slide the window by a stride of 2. The area where the pixel values overlapped are
added. Similarly, using the third input pixel, we move the slide by a stride of 2
downwards. We multiply that pixel with all the values of the filter, and in the
overlapped region, the pixels are added, and we do a similar process using the fourth
input pixel.
(a) (b)
(c) (d)
(e)
Figure 2.17: Deconvolving a 3 x 3 feature map
21
diving into the learning methods of estimating depth, we must first understand what
CNNs do.
The encoder consists of a series of convolution & pooling layers to extract the depth
features. The decoder consists of a series of deconvolution that regresses the pixel-
level depth map. The output dimension, i.e., h × w of the depth map, should be the
same as the input. To preserve the depth features, outputs from each 28 layer of the
encoder is concatenated with outputs of each decoder layer. The depth loss function is
used to train the network till we obtain the desired depth.
22
Figure 2.19: General supervised framework for depth estimation
Advantages
The accuracy rate is very high when the scale of the estimated depth is close to the
ground truth, and we can further generate an accurate 3D map.
Disadvantages
The cost of obtaining the ground truth is very high because it requires extensive
equipment and time.
Figure 2.20: General unsupervised framework for depth estimation using stereo image pairs
23
Similarly, we can not only estimate depth but also pose simultaneously. The depth
and pose network predictions are used to synthesize a new view, and that new view is
compared with the original target image. The only difference from the former
unsupervised method is that the wrapping is based on adjacent frames. Figure 2.21
shows the general diagram of estimating both depth and pose.
Figure 2.21: General unsupervised framework for depth and pose estimation
Advantages
The unsupervised learning method of estimating depth does not require any ground
truth, which helps in reducing the cost of building depth labels.
Disadvantages
Unlike the supervised method, unsupervised depth and pose estimation suffer because
of low accuracy.
24
(a) Visual Odometry (b) Localization (c) SLAM
Figure 2.22: Types of Ego-Motion
For this work, we will focus on visual odometry. Visual odometry refers to estimating
the relative Ego-Motion from images.
It can be used in space or underwater exploration where sensors such as the GPS do
not work and see what is in the environment. In places like these, it can also be used
to estimate the camera's pose and move in the environment.
Visual odometry aims to estimate the camera's pose by examining the changes that the
motion induces in a camera.
Figure 2.23 shows a car at time t and t + 1. We can see that the car has gone through a
rigid body transformation of rotation R and translation t, which is a 6-DoT, from time
t to t + 1. So we want to estimate the current pose, i.e., at time t + 1, with respect to
the previous pose, i.e., time t. This is why it is also sometimes referred to as relative
Ego-Motion because we are not localizing the vehicle with respect to global reference
such as map but previous time instance.
25
Figure 2.23 Pose of an Car
Even though we talked about mapping, it is often built as a by-product, which means
that Visual Odometry’s focus is not mapping but estimating pose.
26
2.3.3.1 Visual Features – Detection, Description, and Matching
Image features are vital information that describes the contents of the image. In
computer vision, it has a variety of applications like face detection or object detection.
All of these tasks have a general framework, as shown in figure 2.25.
These features have specific structures like points, edges, and corners. Let us take an
example of image stitching or panorama, which is widely used on our mobile devices.
Let us consider two images, and we would like to stitch them together to form a
panorama. Remember that a panorama can be formed used not only two images but
also more than two images. For the sake of understanding, we are using two images.
(a) (b)
(c) (d)
(e)
Figure 2.26: Example of image stitching technique
First, we would like to find distinct points in one image. These points will be called
features. Then we associate a descriptor for each feature from its neighborhood. Then
we finally match the features [18] across both images. Then these matched features
can be used to stitch both images creating a panorama. However, we can see some
missing artifacts in the extreme bottom right corner of the image, as shown in figure
2.26(e).
27
Features are points of interest in an image. A point of interest in an image should have
some characteristics for it to be a good feature point. These are the following.
The feature points should be salient, i.e., distinct, identifiable, and different from its
immediate neighborhood.
1. In order to extract the same features from each image, the feature points should be
repeatable.
2. Feature points should be local. It means they should stay the same if an image
region far away from the immediate neighborhood changes.
3. Many applications like camera calibration and localization require a minimum
number of distinct feature points, so they should be abundant.
4. Generating features should not require much computation.
To understand further how these features should look, consider figure 2.27. In (a),
repetitive textureless patterns are tough to recognize, so we cannot consider these as
features. Similarly, the two rectangles on the white line, as shown in (b), will also be
hard to recognize. The concept of corners [12] for image features is essential. These
corners occur when the gradients in at least two significant directions are
considerable. An example is shown in rectangles in (c).
To solve this problem, the Harris-Laplace corner detector was proposed, which is
used to detect corners at different scales and choose the best corner based on the
Laplacian of the image. Similarly, many machine learning corner detectors were later
28
proposed. One prominent of these algorithms is FAST corner detector [20], which is
computationally efficient and performs well. Other scale-invariant feature detectors
are based on the concept of blobs, such as Laplacian of Gaussian. We will not discuss
these feature detectors in great detail as they represent a complex area of research.
Thanks to open-source implementations of these algorithms, we can see them in
action using OpenCV [19].
A feature descriptor contains information about a particular feature point and should
have specific characteristics that should be repeatable. Regardless of position, scale,
and illumination, the point of interest in both images should have approximately the
same feature descriptor. Other characteristics for the point of interest were already
described, in section 2.2, in order for the feature descriptors to be a good match.
In order to have a good sense of finding a feature point, let us look at a particular
study called Scale Invariant Feature Transform (SIFT) descriptor [21].
1. Given a feature in an image, the SIFT descriptor takes a 16 by 16 window around
the feature.
2. It divides the window into 4 cells, each of which comprises 4 by 4 patch of pixels.
3. Edges are then computed using the gradients. For stability, we suppress the weak
edges by defining a threshold as they vary with orientation with small amounts of
noise between images.
29
4. We construct a 32-dimensional histogram of orientation for each cell and
concatenate them to get a 128-dimensional descriptor.
SIFT descriptor is a very well-engineered feature descriptor. It is usually computed
over multiple scales and orientations. It has also been combined with scale invariant
feature detection such as DoG, which results in a highly robust feature detector and
descriptor pair.
30
Now we want to find the best match. The simplest method to solve this problem is
called brute force feature matching. It is defined as the distance function d(fi, fj) that
compares feature fi in image 1 and feature fj in image 2. The smaller the distance
between two features, the more similar will be the feature points. For every feature fi
in image 1, we compute the distance d(fi, fj) with every feature fj in image 2 and find
the closest match fc that has the minimum distance.
Now, what distance function shall be used for feature matching? We can use the Sum
of Squared Differences (SSD) which is defined as
𝐷
2
𝑑(𝑓𝑖 , 𝑓𝑗 ) = ∑(𝑓𝑖,𝑘 − 𝑓𝑗,𝑘 ) (2.14)
𝑘=1
We can even use other distance functions instead of SSD which are the following.
• Hamming Distance
This is a primary method of matching features but there are also other methods, like
the brute-force method, to make a robust feature matching.
31
Visual Odometry [10], or VO for short, is the process of estimating the pose [7] of the
camera by examining the changes that the motion induces in images.
Furthermore, cameras are passive sensors and may not be robust to whether changes
and illumination changes like car headlights and street lights. Similarly, at night it can
be challenging to perform VO.
Like other odometry estimation techniques, VO will drift over time as the error
estimation accumulates.
To define the problem mathematically, we are given two consecutive
images Ik and I(k−1). Our goal is to estimate the transformation matrix Tk between these
two frames. The transformation matrix s defined by the translation matrix t and
rotation matrix R. k is the time step.
𝑅 𝑡𝑘,𝑘−1
𝑇𝑘 = [ 𝑘,𝑘−1 ] (2.17)
0 1
The camera will be attached to a rigid body, and as it is in motion, it will capture
every frame. We will call these frames Cm, where m is the frame number. When we
concatenate these frames, we can estimate the camera's trajectory.
In VO, motion estimation is an important step. How we perform this step depends on
the type of feature correspondences we use. Three types of feature correspondences
can be used in motion estimation.
32
• 2D-2D: Feature matches in frames fk−1 and fk are defined purely in image
coordinates. It is instrumental in tracking objects and image stabilization in
videography.
• 3D-3D: Feature matches in frames fk−1 and fk are defined purely in 3D world
coordinates. This approach helps locate new features in 3D space and estimate
depth.
• 3D-2D: Features in frame fk−1 are defined in 3D space, and features in frame fk are
defined in image coordinates.
Let us see how 3D-2D projection [14] works. We are given a set of features in
frame Ck−1, and their 3D world coordinates estimates. Furthermore, through feature
matching, we have a set of features in frame Ck and their 2D image coordinates. Since
we cannot recover the scale for monocular visual odometry directly so we include a
scalar parameter s when forming a homogenous feature vector from the image
coordinates. With this information, we estimate the rotation matrix R and translation
vector t between the frame Ck−1 and Ck.
This process is similar to camera calibration, with the exception that the intrinsic
parameters of the camera, represented by matrix K, are already known.
So now, our problem is reduced to finding the transformation matrix [R|t] from the
equations constructed using all of our matched features. It can be solved using the
Perspective-n-Point algorithm. Given feature locations in 3D, the feature
correspondence in 2D and intrinsic camera parameter K PnP solves for the external
33
parameters R and t. The equations here are non-linear so we will refine the Direct
Linear Transformation solution with an iterative non-linear optimization technique
such as the Levenberg-Marquardt algorithm.
The PnP algorithm requires at least 3 feature points to solve for R and t, and if we
want a good solution, then we use 4 feature points.
To improve the accuracy of the PnP algorithm, we incorporate RANSAC (Random
Sample Consensus) by assuming that PnP generates the transformation on a set of
four points in our model. We evaluate this model and calculate the percentage of
inliers (points that fit the model) to confirm the validity of the point match selected.
2.3.4 Learning-Based Ego-Motion Estimation Methods
There are two ways to estimate the pose of the camera:
1. Supervised way
2. Unsupervised way
The pose estimated in supervised way gives accurate estimations which make them
robust to motion changes and illumination but with time error starts to accumulate
giving rise to inevitable drift.
34
2.3.4.2 Unsupervised Learning
Figure 2.21 shows the general architecture of pose estimation. In unsupervised way,
the predicted pose is assisted by depth estimated from a single image. This method is
robust to some extend but not to dynamic changes in the environment and the learning
takes a lot of time.
35
Chapter 3
Literature Review
36
In this chapter we will see what researchers have been doing in order to improve the
depth and pose obtained from an image.
The coarse network is used to get a global sense of what is in the image. It is passed
through a bunch of convolution layers, and at the end, there are two fully connected
layers. These fully connected layers are used to get a coarse depth estimation by
looking at every image pixel. In the end, the image is then resized back to the original
size of the input image.
The fine network is used for looking at fine details in the image. The image is also
passed through a bunch of convolution layers. The difference of these convolution
layers from the layers in coarse network is that the resolution of the output from each
layer is the same as the input image. The outcome from the coarse network is then
concatenated with output features from the first layer and then passed through two
convolution layers to get a refined depth prediction.
The convolution layers of the coarse network are pre-trained on ImageNet, and the
fully connected layers are then fine-tuned. The job of this network was to understand
37
the global scene, and the fine network was trained after the coarse network was
trained.
When writing a loss function for a training network, there is a catch, and it is that the
global scale of the scene is unknown, which is an ambiguity for depth prediction.
What this means is that we know the relative sizes of the objects in the scene but not
the exact sizes. Eigen et al. used a loss function independent of scale to solve this
issue.
Consider y as the predicted depth, y* to be the ground-truth depth, and n as the
number of pixels indexed with i. The depth loss function that can be used is given as
𝑛
1
𝑑𝑖 = ∑(log 𝑦𝑖 − log 𝑦𝑖∗ )2 (3.1)
𝑛
𝑖=1
This is called the scale-invariant mean squared error (in log space). However, this
function is heavily dependent on scale. To correct that, the loss function was written
as
𝑛
1
𝑑𝑖 = ∑(log 𝑦𝑖 − log 𝑦𝑖∗ + 𝛼(𝑦, 𝑦 ∗ ))2 (3.2)
𝑛
𝑖=1
Where:
𝑛
1
𝛼(𝑦, 𝑦 ∗ ) = ∑(log 𝑦𝑖∗ − log 𝑦𝑖 ) (3.3)
𝑛
𝑖=1
The value of α also helped in minimizing the error. It proved that the MSE was scale-
invariant. It can be re-written as
2
1 𝜆
𝐿(𝑦, 𝑦 ∗)
= ∑ 𝑑𝑖2 − 2 (∑ 𝑑𝑖 ) (3.4)
𝑛 𝑛
𝑖 𝑖
38
Another contribution to this work is the design of up-convolution blocks. In the up
convolution, the feature map is given to the 2 × 2 unpooling layer, which performs the
inverse of the pooling layer. It is then followed by two 5 × 5 convolution layers which
are followed by a ReLU activation function.
One 5x5 convolution layer is followed by a 3 × 3 convolution layer. The output from
this layer and the other convolution layer is then added to each other, followed by a
ReLU activation function.
Another block they introduced was the fast-up convolution block which is much more
efficient and faster. On the top, they observed that after un-pooling the feature map
and applying a 5x5 convolution filter, only certain parts are multiplied with non-zero
values. It inspired them to convolve the feature map with four different convolution
layers of sizes of 3×3, 2×3, 3×2 & 2×2 and then interleave. It marks the location of
the pixels.
39
This block is then used just the way the up-projection block is made. This block is
called a fast up-projection block.
For learning, they found that reverse Huber loss yields a better error.
|𝑥| |𝑥| ≤ 𝑐
𝐵(𝑥) = {𝑥 + 𝑐 2
2
(3.5)
|𝑥| > 𝑐
2𝑐
1
Where 𝑐 = 5 𝑚𝑎𝑥𝑖 (|𝑦̃𝑖 − 𝑦𝑖 |)
An input image is fed to a series of convolution layers to estimate the depth at each
scale. The depth, each at different scales, of every convolution layer is upsampled and
added to each other. It is sent to the input of the selection layer to produce a right-
view image.
The selection layer models the DIBR [26] step using the traditional 2D-3D conversion
method. Given the left-view image I and predicted depth Z, the disparity D is
computed as
𝐵(𝑍 − 𝑓)
𝐷= (3.6)
𝑍
40
B is the baseline. Then the network predicts the probability distribution across
𝑑
possible disparity values d at each pixel location where ∑𝑑 𝐷𝑖,𝑗 = 1. A stack of shifted
left image is then stacked. The probability disparities and the shifted left images are
multiplied by others and summed to give a right-view image.
𝑑 𝑑
𝑂𝑖,𝑗 = ∑ 𝐼𝑖,𝑗 𝐷𝑖,𝑗 (3.7)
𝑑
3.1.4 PoseNet
The PoseNet architecture [27] used GoogleNet’s inception modules.
We have a series of convolution layers in the pose network. This block is called the
encoder. The encoder output a vector v. Each feature represents an encoding of the
visual features. The vector v is fed to the localizer which is a FC layer that outputs
41
local features u. Note that the output of these is the rigi-body transformation T = [R|t]
where R is rotation and t is the translation.
Another key idea was jointly predicting probability over depths for each pixel and a
set of corresponding hypothesis colors. The final color calculated for each pixel is a
probability-weighted sum of colors. This is optimized with an end-to-end within a
deep learning framework.
The input to the network is the plain sweep volume reprojected to the target camera at
various depths. Here the network learns the similarity function from lots of data. The
network learns the depth distribution and color hypotheses for each pixel. It consists
of two towers; the selection tower and the color tower.
The selection tower learns to produce a selection probability for each pixel in each
depth plane while the color tower combines and warp pixels and colors across the
input images. The outputs of both towers are multiplied and summed to produce a
final image.
42
3.2.2 Depth Estimation with Left-Right Consistency, 2017
Godard et al. [29] aimed at inferring depth using a single monocular image at test
time. At training time, two images. Il and Ir are used, which are obtained from a pair
of stereo cameras at the same moment. The target is one of these two images meaning
that the model needs to output one of the same images. E.g., let us say the target is the
left image, and the output should look similar to the left image. The reconstruction
loss is used to compare the output image with the target image for learning. A
convolution neural network is used to estimate the depth. Using the estimated depth
and the right input image, a sampler from the spatial transformer network (STN) [30]
outputs a reconstructed left image. The STN uses a bilinear sampler where the output
pixel is the weighted sum of four input pixels.
This gave quite good results, but some artifacts could still be observed. A similar
thing is done to improve this by using the right image to estimate depth. This enforces
consistency between both depths, which leads to more accurate depth estimates.
Another sampler is used to produce a right image using the left image. The
reconstructed right image is then compared with the original right image.
For learning, the loss has used a combination of three main terms; Appearance match
loss, disparity smoothness loss, and left-right disparity consistency loss.
𝑙
The appearance match loss 𝐶𝑎𝑝 is a combination of L1 and single scale SSIM term,
which compares the input image 𝐼𝑖𝑗𝑙 and the reconstructed image 𝐼̃𝑖𝑗𝑙 .
43
𝑎𝑝 1 1 − 𝑆𝑆𝐼𝑀(𝐼𝑖𝑗𝑙 , 𝐼̃𝑖𝑗𝑙 )
𝐶𝑙 = ∑𝛼 + (1 − 𝛼)‖𝐼𝑖𝑗𝑙 − 𝐼𝑖𝑗𝑙 ‖ (3.8)
𝑁 2
𝑖,𝑗
1 𝑙 𝑟
𝐶𝑙𝑑𝑠 = 𝑙
∑|𝜕𝑥 𝑑𝑖𝑗 |𝑒 −‖𝜕𝑥 𝐼𝑖𝑗‖ + |𝜕𝑦 𝑑𝑖𝑗
𝑟
|𝑒 −‖𝜕𝑦 𝐼𝑖𝑗‖ (3.9)
𝑁
𝑖,𝑗
To ensure consistency between the left and right disparities, the consistency loss is
defined as
1 𝑙
𝐶𝑙𝑙𝑟 = ∑ |𝑑𝑖𝑗 𝑟
− 𝑑𝑖𝑗+𝑑 | (3.10)
𝑁 𝑖𝑗
𝑖,𝑗
𝑟 𝑟 𝑟
Similarly, we will have the right variants 𝐶𝑎𝑝 , 𝐶𝑑𝑠 and 𝐶𝑙𝑟 . Using these terms, the total
loss is written as
𝑙 𝑟 𝑙 𝑟 𝑙 𝑟
𝐶𝑠 = 𝛼𝑎𝑝 (𝐶𝑎𝑝 + 𝐶𝑎𝑝 ) + 𝛼𝑑𝑠 (𝐶𝑑𝑠 + 𝐶𝑑𝑠 ) + 𝛼𝑙𝑟 (𝐶𝑙𝑟 + 𝐶𝑙𝑟 ) (3.11)
The structured network uses conv and deconv layers to estimate the depth from a
single image. Then using the camera intrinsic (cx, cy), focal length f and depth pixels
𝑥𝑡𝑖 , 𝑦𝑡𝑖 were converted to point clouds using Eq 3.12.
44
𝑥𝑡𝑖
− 𝑐𝑥
𝑋𝑖𝑡 𝑖 𝑤𝑖
𝑑 𝑡
𝑋𝑖𝑡 = [𝑌𝑖𝑡 ] = = 𝑦𝑡𝑖 (3.12)
𝑓 − 𝑐𝑦
𝑍𝑖𝑡 𝑤𝑖
[ 𝑓 ]
A similar architecture is used to estimate the camera motion and object motion.
However, two fully connected layers are used after convolving a pair of images. One
to estimate the pose/motion of the SSS camera and the other the pose/motion of K-
object segments. These poses/motions are the transformation matrix that tells the
point rotation and translation. The camera rotation 𝑅𝑐𝑡 can be represented using the
Euler angle representation as
1 0 0
𝑅𝑡𝑐𝑥 (𝛾) = [0 cos 𝛾 − sin 𝛾 ]
0 sin 𝛾 cos 𝛾
45
3.2.4 SfM-Learner, 2017
Zhou et al. [8] jointly predicted depth from a single image and posed from a sequence
of images using unlabeled training data. They also said that both networks could also
be trained independently. In order to prepare the network, the learning is carried out
from the task of novel view synthesis, where a new target image is reconstructed from
the nearby views and compared with the original target image.
̂ 𝑡 . The pose
The depth network takes in the target image It and outputs the depth 𝐷
network takes in the target image It and the nearby view/source images It−1, It+1 and
outputs the relative camera poses, which is the transformation matrix. The source
images are then inverse-wrapped using the relative poses from the pose network and
pixel-depth from the depth network to reconstruct the target view.
Figure 3.12 (b) shows how a target view It is reconstructed by sampling pixels from a
̂ 𝑡 , and the camera poses 𝑇𝑡→𝑠 . The
source view Is using the depth prediction 𝐷
projected coordinates of pt onto the source view ps can be obtained as
𝑝𝑠 ~𝐾𝑇̂𝑡→𝑠 𝐷
̂𝑡 (𝑝𝑡 )𝐾 −1 𝑝𝑡 (3.16)
To populate the values of Is(pt), the differential bilinear sampler, DBIR, as used in
[26], is used to interpolate the 4-pixel neighbors (top-left, top-right, bottom-left, right-
bottom) to approximate Is(ps).
𝑖𝑗
𝐼̂𝑠 (𝑝𝑡 ) = 𝐼𝑠 (𝑝𝑠 ) = ∑ 𝑤 𝑖𝑗 𝐼𝑠 (𝑝𝑠 ) (3.17)
𝑖∈{𝑡,𝑏},𝑗∈{𝑙,𝑟}
46
Chapter 4
Methodology
47
This chapter deals with implementation methodology. We start by explaining network
architectures used to estimate depth and Ego-Motion. Then we look at the view
synthesis block in detail and utilize this view for the network to learn the parameters.
Then we discuss the dataset for training, evaluation, and testing and the
hyperparameters used to train our networks.
The following depth networks are used and we’ll discuss each of them.
1. DispNet
2. AdaBins
3. ResNet18
We could have used more than three networks, but as we increase the networks for
predicting depth, the parameters will increase, computation will increase and the
training will take longer. This is why we limit the network to three depth networks.
48
4.1.1 Subnetwork1 - Disparity Network
Figure 4.2 shows the architecture of DispNet [34], which is similar to the U-Net [35].
The network is made of 6 down-sampling blocks which consist of 2 convolution
layers. The 1st block consists of convolution layers with a kernel size of 7, and the
2nd block consists of convolution layers with a kernel size of 4. The remaining blocks
from 3 to 7 consist of convolution layers with a kernel size of 3. The output channels
from each respective block are 32, 64, 128, 256, 512, 512.
After the 7th block, the up-sampling process begins. Similar to convolution blocks,
we have up-sampling blocks where each block has two deconvolution layers. Each
layer in deconvolution blocks 1-5 has a kernel size of 3, and the layers in block 6 have
a kernel size of 7. The output channels from each respective up-sampling block
reverse from the convolution blocks, i.e., 512, 512, 256, 128, 64, 32, 16.
The outputs from the second convolution layer of each down-sampling block are
concatenated with the output of the first deconvolution layer of the up-sampling
blocks. The predictions from up-sampling blocks 3, 4 & 5 are again up-sampled and
then concatenated with the output first deconvolution layer of blocks 4, 5 & 6.
1
𝐷𝑒𝑝𝑡ℎ = (4.1)
𝛼 + 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) + 𝛽
49
4.1.2 Subnetwork2 - AdaptiveBin Network
Distributions of depth vary a lot and increase the complexity of estimating depth. To
solve this, Shariq et al. [36] proposed dividing the depth ranges into bins so that the
bins can adapt to changes in a scene.
Figure 4.3 shows the architecture with two blocks; the standard encoder-decoder and
the AdaBins module.
The input image is H ×W ×3 dimension image. The encoder uses the EfficientB0 [37]
network to extract the features from the image and then is up-sampled in the similar
fashion as the DispNet. The output is then H × W × Cd.
The output from the encoder-decoder block is then divided into patches and fed to a
patch-based transformer [38] called mViT to preserve the global information. It
produces two outputs; Range Attention Maps R and the probability distribution over
the bins. Eq 4.2 is used to obtain the bin centers.
50
𝑖−1
𝑏𝑖
𝑐(𝑏𝑖 ) = 𝑑𝑚𝑖𝑛 + (𝑑𝑚𝑎𝑥 − 𝑑𝑚𝑖𝑛 ) ( + ∑ 𝑏𝑗 ) (4.2)
2
𝑗=1
Another reason we chose this model was that this model was trained in a supervised
way, but we trained this in an unsupervised way.
We can apply this block as many times as we can. For this work, we apply 8 blocks
connected in series. Before and after these blocks, we apply a single convolution
layer. As every block consists of two convolution layers, adding the convolution layer
before and after the blocks, we have 18 convolution layers. Hence the name becomes
ResNet18.
51
4.2 Weighted Average Depth
Every subnetwork in the architecture we are using to estimate depth has qualities as
described in sections 4.1.1 – 4.1.3. Sometimes, one subnetwork might outperform the
other in certain situations. Instead of using simple averages, we use a weighted
average. If we use plain average, the performance of a subnetwork that is performing
well might be degraded by a poorly performing subnetwork. Hence, we use weighted
averages in which different weights are given to the subnetworks. Each weight
determines the relative importance of the network. The weighted average for
estimating depth is determined as
∑𝑛𝑖=1 𝑤𝑖 𝐷𝑖
𝐷𝑎𝑣𝑔 = (4.4)
∑𝑛𝑖=1 𝑤𝑖
Where:
Since we are using three subnetworks, the weighted average depth Davg is can be
written as
𝑤1 𝐷1 + 𝑤2 𝐷2 + 𝑤3 𝐷3
𝐷𝑎𝑣𝑔 = (4.5)
𝑤1 + 𝑤2 + 𝑤3
The contribution of the subnetworks will be determined by the weights that need to be
selected. One way to select the values of weights is to assign a higher weight value to
a subnetwork that will yield lower errors so that it can contribute more to the depth
estimation process. Assign a lower weight value to the subnetwork that will yield
higher errors so that it can contribute less to the depth estimation process.
4.2.1 Initializing Weights Based on Accuracy
Since we have used the subnetworks that researchers have already designed, we can
select the weights based on their accuracies. We test the networks individually and get
the accuracies of the networks. Then we normalize the accuracies in the 0-1 range.
These values can then be used as weights of the network. For example, subnetwork-1
has 69% accuracy, subnetwork-2 has 84% accuracy, and subnetwork-3 has 78%.
When we normalize these accuracies, we get 0.516, 0.628, and 0.583, which will be
52
the weights w1, w2, and w3 for the subnetwork-1, subnetwork-2, and subnetwork-3.
We can use these weights as the initial weights for the subnetworks.
4.2.2 Adjustment of Weights Based on Performance During Training
There will be situations where one subnetwork might not perform well, and the other
will perform well during training. It is in our best interest to update the weight for the
network that performs well so that it can contribute more to estimating the depth.
When each subnetwork estimates the depth, we synthesize a new view from the
respective depths. Then using the respective synthesized image, we calculate the
errors and normalize each error in the range of 0-1. These will be the weights of the
subnetworks, and the weights will be updated at every step. It should be kept in mind
that the network does not learn these weights. We can call these weights
hyperparameters. The error between the input and the image synthesized using the
average weighted depth will be used for backpropagation.
4.2.3 Updating Weights of Subnetworks in Real-time
In real-time, we like to update the weights after every N samples/sec so that if one
subnetwork does not perform well, the other subnetworks which perform well should
contribute more to estimate the depth. To automatically update the weights, we use
the following steps.
1. Initialize the weights w1, w2 and w3.
2. Set the frame rate and start the timer to compute the number of frames. For
example, if we have frame rate of 30 fps and the length of time is 0.1 sec then we
have 3 frames.
6. Compute the errors between n input target frame It and n synthesized image.
53
𝐸𝑟𝑟𝑜𝑟2𝑛 = |𝐼𝑡𝑛 − 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐵 | (4.8)
𝐸𝑟𝑟𝑜𝑟3𝑛 = |𝐼𝑡𝑛 − 𝐼𝑡𝑛_𝑝𝑟𝑒𝑑𝐶 | (4.9)
7. Take the mean of each error over N samples.
𝐸1 + 𝐸2 + 𝐸3
𝑚𝑒𝑎𝑛 = (4.13)
3
9. Subtract mean from each error to center around 0.
2 2 2
𝐸1_𝑛𝑜𝑟𝑚 + 𝐸2_𝑛𝑜𝑟𝑚 + 𝐸3_𝑛𝑜𝑟𝑚
𝑠𝑡𝑑𝑑𝑒𝑣 = √ (4.17)
3
11. Divide each normalized error by standard deviation to scale the error to have a
standard deviation of 1. These will be the weights w1, w2 and w3.
𝐸1_𝑛𝑜𝑟𝑚
𝑤1 = (4.18)
𝑠𝑡𝑑𝑑𝑒𝑣
𝐸2_𝑛𝑜𝑟𝑚
𝑤2 = (4.19)
𝑠𝑡𝑑𝑑𝑒𝑣
𝐸3_𝑛𝑜𝑟𝑚
𝑤3 = (4.20)
𝑠𝑡𝑑𝑑𝑒𝑣
Finally, we apply these weights in Eq 4.5. After this the timer will reset, at certain
length of time, we will again calculate the weights.
54
first two convolution layers have a kernel size of 7 and 5, and the rest have a kernel
size of 3.
Further, we have the sixth and seventh convolution layers. At the seventh convolution
layer, we obtain the rotation and translation of the camera. The ReLU activation
function, except the last layer, also follows all the convolution layers.
55
We generate a target image It_pred using the source images, predicted depth, and
predicted pose and compare the generated target image It_pred with the original target
image It using a pixel-wise loss function. If the pixel-wise loss is minimum, then it
means that the network is successfully predicting depth and pose.
The first step is to back-project the point in the 2D image plane to a 3D coordinate. In
chapter 2, we saw how a 3D world coordinate system point is projected on the 2D
image plane in section 2.2.3.1.
𝑂𝑖𝑚𝑎𝑔𝑒 = 𝐾[𝑅|𝑡]𝑂𝑤𝑜𝑟𝑙𝑑
𝑍(𝑢 − 𝑢𝑜 )
𝑋= (4.21)
𝑠𝑓𝑥
𝑍(𝑣 − 𝑣𝑜 )
𝑋= (4.22)
𝑠𝑓𝑦
Where:
s is the scale
Now that we have back-projected the point from the 2D target frame to the 3D
coordinate frame, we apply a linear transformation on that point using the predicted
camera pose to obtain the target point in It−1 and It+1. Then the 3D points are projected
to the image plane.
The key component of the learning framework is the differential depth image-based
renderer [30], which produces the new target view by sampling the source view using
the predicted depth Dˆ and predicted pose K = [R|t].
𝑝𝑠 ~𝑇̂𝑡→𝑠 𝐷
̂𝑡 𝑝𝑡 𝐾 −1 𝑝𝑡 (4.23)
Where:
56
ps is a point in source frame
To populate the pixels, the STN’s [30] bilinear sampling method, as shown in Eq 4.9,
is used to take the pixels from the source images It−1, It+1 and match them to the
corresponding pixel coordinate in the target image. By doing this, we obtain the
predicted target image It_pred.
𝑖𝑗
𝐼̂𝑠 (𝑝𝑡 ) = 𝐼𝑠 (𝑝𝑠 ) = ∑ 𝑤 𝑖𝑗 𝐼𝑠 (𝑝𝑠 ) (4.24)
𝑖∈{𝑡,𝑏}𝐽∈{𝑙,𝑟}
Where:
𝑖𝑗
𝑝𝑠 point in source frame
𝐼̂𝑠 (𝑝𝑡 ), 𝐼𝑠 (𝑝𝑠 ) are the points of target frame projected to the source frame
coordinates
57
4.5.1 Smooth Loss
The depths need to be smooth with an L1 penalty. The gradient of the depth on the x-
axis w.r.t both x and y-axis is taken; similarly, the gradient of the depth on the y-axis
w.r.t both x and y-axis is taken. The smooth loss is given as
1
𝐿𝑠𝑙 = ∑|𝜕 2 𝑑𝑖,𝑗 | + |𝜕𝑥 𝜕𝑦 𝑑𝑖,𝑗 | + |𝜕𝑥 𝜕𝑦 𝑑𝑖,𝑗 | + |𝜕𝑦2 𝑑𝑖,𝑗 | (4.26)
𝑁
𝑖,𝑗
Where:
The synthesized target image is generated in the view synthesis block, which has been
discussed in section 4.1.3.
The KITTI dataset consists of 96 thousand images from outdoor scenes. These scenes
are divided into five categories; city, residential, roads, campus, and pedestrians. The
data was recorded using a high-resolution stereo camera rig and a Velodyne laser
scanner (LIDAR) to capture the environment's depth. All this was mounted on a
Volkswagen, as shown in Figure 4.7.
There are 151 sequences in total, and the left and right images have been provided for
each frame. Raw RGB images with the raw LIDAR scans are provided that can be
downloaded from CVlabs [7].
58
Figure 4.7: A stereo rig and a Velodyne laser scanner mounted on a Volkswagen
The image's resolution depends on the calibration parameters, but the approximate
resolution is 1242 × 375. Figure 4.8 shows an example of an input image.
59
common augmentations are; small random rotations, random horizontal flips, color
augmentations, and translation.
1 2
𝑅𝑀𝑆𝐸 = √ ∑‖𝑑𝑖 − 𝑑𝑖∗ ‖ (4.28)
𝑇
𝑖
1 2
𝑅𝑀𝑆𝐸(𝑙𝑜𝑔) = √ ∑‖log(𝑑𝑖 ) − log(𝑑𝑖∗ )‖ (4.29)
𝑇
𝑖
1 |𝑑𝑖 − 𝑑𝑖∗ |
𝐴𝑅𝐷 = √ ∑ (4.30)
𝑇 𝑑𝑖∗
𝑖
2
1 ‖𝑑𝑖 − 𝑑𝑖∗ ‖
𝐴𝑅𝐷 = √ ∑ (4.31)
𝑇 𝑑𝑖∗
𝑖
We also use the accuracy metrics with threshold. The accuracy is calculated by
dividing the number of pixels with an error ratio below the threshold by the total
number of pixels in the image
1 𝑦𝑖 𝑦𝑖∗
∑ (𝑚𝑎𝑥 ( ∗ , ) = 𝜕 < 𝑡ℎ𝑟) , 𝑡ℎ𝑟 = [𝜆, 𝜆2 , 𝜆3 ] (4.32)
𝑇 𝑦𝑖 𝑦𝑖
𝑖
𝑛
1
𝐴𝑇𝐸𝑟𝑚𝑠𝑒 = √ ∑‖𝑡𝑟𝑎𝑛𝑠(𝐸𝑖 )‖2 (4.33)
𝑛
𝑖=1
60
Where Ei is trajectory error which is defined as
Pi is a point at time i
We kept the batch size to 2 because of limited resources. We use an 11GB RTX 2080
graphic card to train our network. We train the network first for 200 epochs using the
Adam optimizer [42]. After that, we changed the optimizer from Adam to SGD [43]
and trained the network for 3 more epochs to analyze whether changing brings any
change to the predictions. The other hyperparameters are shown in table 4.2
61
Chapter 5
Results
62
The networks used in this research were trained, evaluated, and tested on the KITTI
dataset using RTX2080(11 GB) GPU. The networks took 2.5 months to train due to
heavy computation. In this chapter, we compare the results of both the depth and pose
networks qualitatively and quantitatively with the previous research.
The loss function is an essential part of a learning algorithm to find the parameters.
Our method is unsupervised, so we adopted a similar new view synthesis generation
method. When a new synthesized image is generated, the difference between the
target image and the synthesized view is taken. Figure 5.2 shows the synthesized
image generated and the difference between the target and synthesized images.
Figure 5.2: Difference between the target image and synthesized image
We test the model on the test images. The test images are from different scenes and
are used for visual inspection, as each image highlights the model's strengths and
weaknesses. First, we train the network with the Adam optimizer and then train it for
63
three more epochs using the SGD optimizer. Both networks are then tested on the test
images, as shown in figure 5.3. The bright scenes show large pixel values, and the
faded scenes show small pixel values. The large pixel values mean that the scene is
near the camera and small pixel values mean that the scene is far away from the
camera. Changing the optimizer to SGD, which is trained for three more epochs, did
not affect the test images much except for the fourth image. The traffic pole shows
some improvement as it gives more structure to it.
Then we compared the results with previous research, as shown in figure 5.4. The 2nd
and 3rd column show methods that predicted the depth image in a supervised fashion,
while the 4th column used a method to predict the depth of an image in an
unsupervised fashion. We can see that the depth predictions improved using weighted
average depth. The previous methods show a blurry image prediction. [16] gave good
depth predictions but still were blurry. Improving the depth network did give good
predictions but still consisted of some artifacts.
64
Figure 5.4: Comparison of our methods with the previous methods
Table 5.1 shows the error metrics calculated over the entire test dataset. The original
paper did not use data augmentation and is available in [70]. So, by changing the
depth networks and taking the weighted average, the results have improved. We limit
the network three because of limited resources.
65
Table 5.1: Depth evaluation results on KITTI dataset
The aim is to see if improving the depth also improves the pose. Table 5.2 shows the
Absolute trajectory error and standard deviation calculated using the pose network.
The error metrics are calculated on sequences 9 and 10 of odometry data [46]. It
shows that improving the depth also improved the pose, but figure 5.5 showed that the
trajectories were too far from the actual value. It means that the algorithm is suffering
from overfitting.
Seq. 09 Seq. 10
Method
ATE Std. ATE Std.
ORB-SLAM(Full) 0.014 0.008 0.012 0.011
ORB-SLAM(Short) 0.064 0.141 0.064 0.130
Mean Odometry 0.032 0.026 0.028 0.023
Zhou et al. (Original) 0.021 0.017 0.020 0.015
Zhou et al. (Data Aug.) 0.0179 0.011 0.0141 0.0115
Our (Adam Opt.) 0.0098 0.0062 0.0083 0.0067
Our (SGD Opt.) 0.0092 0.0060 0.0081 0.0066
Table 5.2: Absolute Trajectory Error (ATE) on KITTI test split
66
Chapter 6
Conclusion and Future Work
67
The depth and Ego-Motion estimation are demanding tasks in computer vision.
Currently, most of the focus has shifted towards monocular vision from stereo vision
as it is an inherently ill-posed problem. To address this problem, machine learning
and deep learning techniques are used. Supervised learning methods still have shown
superior results compared to unsupervised learning methods. However, the cost of
obtaining the ground truth is high and requires extensive labor work.
We saw that by using weighted average depth, there was no effect of depth estimation
on the pose of the camera, and in the previous section, we also discussed the drift that
had occurred and its causes. Since we are using the KITTI dataset with high-quality
images captured using a very well-calibrated camera, this problem has nothing to do
with the drift we obtained. The leading cause of this can be the pose network, and to
68
solve this, we need another pose network that can accurately estimate the camera's
pose and avoid overfitting the data.
69
Reference
70
[1] R. C. Gonzalez, Digital image processing. Pearson education India, 2009.
[2] T. C. School, “What is a video?.” https://www.youtube.com/watch?v=
9CSjUl-xKSU&ab_channel=TheCutSchool.
[3] C.. Tutorial, “Learning-based depth estimation from stereo and monocular
images: successes, limitations and future challenges,” 2017.
[4] Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth
estimation: A review,” Neurocomputing, vol. 438, pp. 14–33, 2021.
[5] Saxena, S. Chung, and A. Ng, “Learning depth from single monocular
images,” Advances in neural information processing systems, vol. 18, 2005.
[6] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single
image using a multi-scale deep network,” Advances in neural information
processing systems, vol. 27, 2014.
[7] P. L. C. S. Geiger, Andreas and R. Urtasun, “Vision meets robotics: The kitti
dataset,” The International Journal of Robotics Research, 2013.
[8] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of
depth and Ego-Motion from video,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1851–1858, 2017.
[9] G. Grisetti, R. K¨ummerle, C. Stachniss, and W. Burgard, “A tutorial on
graph-based slam,” IEEE Intelligent Transportation Systems Magazine, vol. 2,
no. 4, pp. 31–43, 2010.
[10] D. Scaramuzza and F. Fraundorfer, “Tutorial: visual odometry,” IEEE Robot.
Autom. Mag, vol. 18, no. 4, pp. 80–92, 2011.
[11] M. S. N. G. Hanna, Vehicle Distance Detection Using Monocular Vision and
Machine Learning. PhD thesis, University of Windsor (Canada), 2019. 90
[12] D. A. Forsyth and J. Ponce, Computer vision: a modern approach. prentice
hall professional technical reference, 2002.
[13] X. Yang, H. Luo, Y. Wu, Y. Gao, C. Liao, and K.-T. Cheng, “Reactive
obstacle avoidance of monocular quadrotors with online adapted depth
prediction network,” Neurocomputing, vol. 325, pp. 142–158, 2019.
[14] R. Szeliski, Szeliski, R. (2022). Computer vision: algorithms and applications.
Springer Nature, 2010.
[15] J. Heaton, Ian goodfellow, yoshua bengio, and aaron courville: Deep learning.
Springer, 2018.
71
[16] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
learning,” arXiv preprint arXiv:1603.07285, 2016.
[17] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural
network for joint depth and surface normal estimation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 283–291,
2018.
[18] “Feature matching (opencv).” https://docs.opencv.org/4.0.0/dc/dc3/
tutorial_py_matcher.html.
[19] “Haris corner detection (opencv).” https://docs.opencv.org/4.0.0/dc/
d0d/tutorial_py_features_harris.html.
[20] M. Trajkovi´c and M. Hedley, “Fast corner detection,” Image and vision
computing, vol. 16, no. 2, pp. 75–87, 1998.
[21] “Introduction to sift (scale-invariant feature transform).” https://docs.
opencv.org/4.x/da/df5/tutorial_py_sift_intro.html.
[22] R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and
overview,” arXiv preprint arXiv:1912.05911, 2019.
[23] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth
estimation from a single image,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 5162–5170, 2015.
[24] A. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper
depth prediction with fully convolutional residual networks,” in 2016 Fourth
international conference on 3D vision (3DV), pp. 239–248, IEEE, 2016.
[25] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video
conversion with deep convolutional neural networks,” in European conference
on computer vision, pp. 842–857, Springer, 2016.
[26] C. Fehn, “Depth-image-based rendering (dibr), compression, and transmission
for a new approach on 3d-tv,” in Stereoscopic displays and virtual reality
systems XI, vol. 5291, pp. 93–104, SPIE, 2004.
[27] Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for
real-time 6-dof camera relocalization,” 2015.
[28] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to
predict new views from the world’s imagery,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 5515–5524, 2016.
72
[29] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular
depth estimation with left-right consistency,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 270–279, 2017.
[30] K. S. Jaderberg, Max and A. Zisserman, “Spatial transformer networks,”
Advances in neural information processing systems, 2015.
[31] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K.
Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv
preprint arXiv:1704.07804, 2017. 92
[32] D. Tan, “Sfm self supervised depth estimation: Breaking down the ideas,”
2020.
[33] Q. Sun, Y. Tang, C. Zhang, C. Zhao, F. Qian, and J. Kurths, “Unsupervised
estimation of monocular depth and vo in dynamic environments via hybrid
masks,” IEEE Transactions on Neural Networks and Learning Systems, vol.
33, no. 5, pp. 2023–2033, 2021.
[34] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T.
Brox, “A large dataset to train convolutional networks for disparity, optical
flow, and scene flow estimation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 4040–4048, 2016.
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical
image computing and computer-assisted intervention, pp. 234–241, Springer,
2015.
[36] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using
adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4009–4018, 2021.
[37] G. Marques, D. Agarwal, and I. de la Torre D´ıez, “Automated medical
diagnosis of covid-19 through efficientnet convolutional neural network,”
Applied soft computing, vol. 96, p. 106691, 2020.
[38] Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T.
Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An
image is worth 16x16 words: Transformers for image recognition at scale,”
arXiv preprint arXiv:2010.11929, 2020.
[39] “Preparing dataset.” https://github.com/ClementPinard/ SfmLearner-
73
Pytorch/tree/master/data. 93
[40] L. Perez and J. Wang, “The effectiveness of data augmentation in image
classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017.
[41] D. Prokhorov, D. Zhukov, O. Barinova, K. Anton, and A. Vorontsova,
“Measuring robustness of visual slam,” in 2019 16th International Conference
on Machine Vision Applications (MVA), pp. 1–6, IEEE, 2019.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[43] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv
preprint arXiv:1609.04747, 2016.
[44] Y. Wu, L. Liu, J. Bae, K.-H. Chow, A. Iyengar, C. Pu, W. Wei, L. Yu, and Q.
Zhang, “Demystifying learning rate policies for high accuracy training of deep
neural networks,” in 2019 IEEE International conference on big data (Big
Data), pp. 1971–1980, IEEE, 2019.
[45] G. G, “Why momentum really works,” Distill, 2017.
[46] C. S. R. U. A. Geiger, P. Lenz, “Visual odometry data,” 2012.
74