Overlapnet: Loop Closing For Lidar-Based Slam

The document describes OverlapNet, a deep neural network for loop closing in LiDAR-based SLAM. OverlapNet predicts the overlap between pairs of 3D LiDAR scans, defined as the percentage of points in one scan that can be projected into the other without occlusion. It also predicts the relative yaw angle between scans. OverlapNet is trained on KITTI data and evaluated on KITTI and Ford campus datasets, showing it can effectively detect loop closures and generalize to new environments. It outperforms other state-of-the-art methods and improves SLAM mapping when integrated into an existing SLAM system.

Uploaded by

Ignacio Vizzo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views

Overlapnet: Loop Closing For Lidar-Based Slam

Uploaded by

Ignacio Vizzo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

OverlapNet: Loop Closing for LiDAR-based SLAM

Xieyuanli Chen∗ Thomas Läbe∗ Andres Milioto∗ Timo Röhling∗,‡

Olga Vysotska†,∗ Alexandre Haag† Jens Behley∗ Cyrill Stachniss∗
∗ Photogrammetry & Robotics Lab, University of Bonn, Germany
‡ Fraunhofer
FKIE, Wachtberg, Germany
† Autonomous Intelligent Driving GmbH, Munich, Germany

Abstract—Simultaneous localization and mapping (SLAM) is

a fundamental capability required by most autonomous systems.
In this paper, we address the problem of loop closing for SLAM
based on 3D laser scans recorded by autonomous cars. Our
approach utilizes a deep neural network exploiting different cues
generated from LiDAR data for finding loop closures. It estimates
an image overlap generalized to range images and provides a
relative yaw angle estimate between pairs of scans. Based on such
predictions, we tackle loop closure detection and integrate our
approach into an existing SLAM system to improve its mapping
results. We evaluate our approach on sequences of the KITTI
odometry benchmark and the Ford campus dataset. We show that
our method can effectively detect loop closures surpassing the
detection performance of state-of-the-art methods. To highlight
the generalization capabilities of our approach, we evaluate our
model on the Ford campus dataset while using only KITTI for Fig. 1: Overlap of two scans (blue and orange points) at a loop
training. The experiments show that the learned representation closure location but computed with different relative transformations.
is able to provide reliable loop closure candidates, also in unseen The overlap depends on the relative transformation and larger overlap
environments. values often correspond to better alignment between the point clouds.
Our approach can predict the overlap without knowing the relative
I. I NTRODUCTION transformation between the scans.
Simultaneous localization and mapping or SLAM [1, 29] is
The main contribution of this paper is a deep neural network
an integral part of most robots and autonomous cars. Graph-
that exploits different types of information generated from Li-
based SLAM often relies on (i) pose estimation relative to a
DAR scans to provide overlap and relative yaw angle estimates
recent history, which is called odometry or incremental scan
between pairs of 3D scans. This information includes depth,
matching, and (ii) loop closure detection, which is needed for
normals, and intensity or remission values. We additionally
data association on a global scale. Loop closures enable SLAM
exploit a probability distribution over semantic classes that can
approaches to correct accumulated drift resulting in a globally
be computed for each laser beam. Our approach relies on a
consistent map.
spherical projection of LiDAR scans, rather than the raw point
In this paper, we propose a new method to loop closing for
clouds, which makes the proposed OverlapNet comparably
laser range scans produced by a rotating 3D LiDAR sensor
lightweight. We furthermore integrate it into a state-of-the-
installed on a wheeled robot or similar vehicle. Instead of
art SLAM system [3] for loop closure detection and evaluate
using handcrafted features [15, 31], we propose a deep neural
its performance also with respect to generalization to different
network designed to find loop closure candidates. Our network
environments.
predicts both, a so-called overlap defined on range images and
a relative yaw angle between two 3D LiDAR scans recorded We train the proposed OverlapNet on parts of the KITTI
with a typical sensor setup often used on automated cars. odometry dataset and evaluate it on unseen data.We thoroughly
The concept of overlap has been used in photogrammetry to evaluate our approach, provide ablation studies using different
estimate image overlaps, see also Sec. III-A and III-B, and modalities, and test the integrated SLAM system in an online
we use it on LiDAR range images. It is a useful tool for loop manner. Furthermore, we provide results for the Ford campus
closure detection as illustrated in Fig. 1 and can quantify the dataset, which was recorded using a different sensor setup in
quality of matches. The yaw estimate serves as an initial guess a different country and a differently structured environment.
for a subsequent application of iterative closest point (ICP) [4] The experimental results suggest that our method outperforms
to determine the relative pose between scans to derive loop other state-of-the-art baseline methods and is also able to
closures constraints for the pose graph optimization. Instead generalize well to unseen environments.
of ICP, one could also use global scan matching [7, 37, 34] In sum, our approach is able to (i) predict the overlap
to estimate the relative pose between scans. and relative yaw angle between pairs of LiDAR scans by
exploiting multiple cues without using relative poses, (ii) com- by a semantic segmentation system [23].
bine odometry information with overlap predictions to detect Similar to LocNet [35], we also use a siamese network,
correct loop closure candidates, (iii) improve the overall pose but we learn features and yield predictions end-to-end. Our
estimation results in a state-of-the-art SLAM system yielding network can directly provide estimates for overlap and the
more globally consistent maps, (iv) solve loop closure detec- relative yaw angle between pairs of LiDAR scans. Different
tion without prior pose information, (v) initialize ICP using from OREOS [27], our method not only provides loop closures
the OverlapNet predictions yielding correct scan matching candidates but also an estimate of the matching quality in
results. The implementation of our approach is available at: terms of the overlap.
https://github.com/PRBonn/OverlapNet Recently, Zaganidis et al. [36] proposed a Normal Distribu-
tions Transform (NDT) histogram-based loop closure detection
II. R ELATED W ORK
method, which is also assisted by semantic information. In
Loop closure detection using various sensor modalities [16, contrast to ours, their method needs a dense global map and
28] is a classical topic in robot mapping. We refer to the cannot estimate the relative yaw angle.
article by Lowry et al. [21] for an overview of approaches
using cameras. Here, we mainly concentrate on related work III. O UR A PPROACH
addressing 3D LiDAR-based approaches.
A. The Concept of Overlap
Steder et al. [31] propose a place recognition system operat-
ing on range images generated from 3D LiDAR data that uses The idea of overlap that we are using here has its origin
a combination of bag-of-words and a NARF-feature-based [30] in the photogrammetry and computer vision community [18].
relative poses estimation exploiting ideas of FABMAP [11]. To successfully match two images and calculate their relative
Röhling et al. [26] present an efficient method for detecting pose, the images must overlap. This can be quantified by
loop closures through the use of similarity measures on defining the overlap percentage as the percentage of pixels in
histograms extracted from 3D LiDAR scans. The work by the first image, which can successfully be projected back into
He et al. [15] presents M2DP, which projects a LiDAR scan the second image without occlusion. Note that this measure is
into multiple reference planes to generate a descriptor using a not symmetric: If there is a large scale difference of the image
density signature of points in each plane. Besides using pure pair, e.g., one image shows a wall and the other shows many
geometric information, there is also work [9, 14] exploiting buildings around that wall, the overlap percentage for the first
the remission information, i.e., how well LiDAR beams are to the second image can be large and from the second to the
reflected by a surface, to create descriptors for localization first image low. In this paper, we use the idea of overlap for
and loop closure detection with 3D LiDAR data. range images, exploiting the range information explicitly.
Motivated by the success of deep learning in computer For loop closing, a threshold on the overlap percentage can
vision [20], deep learning-based methods have been proposed be used to decide whether two LiDAR scans are at the same
recently. Barsan et al. [2] propose a deep network-based lo- place and/or a loop closing can be done. For loop closing, this
calization method, which embeds LiDAR sweeps and intensity measure maybe even better than the commonly used distance
maps into a joint embedding space and achieves localization between the recorded positions of a pair of scans, since the
by matching between these embeddings. Dubé et al. [12] positions might be affected by drift and therefore unreliable.
advocate the usage of segments for loop closure detection. The overlap predictions are independent of the relative poses
Cramariuc et al. [10] train a CNN to extract descriptors and can be therefore used to find loop closures without
from segments and use it to retrieve near-by place candidates. knowing the correct relative pose between scans. Fig. 1 shows
Schaupp et al. [27] propose a system called OREOS for place the overlap of two scans as an example.
recognition, that also estimates the yaw discrepancy between
scans. Furthermore, Yin et al. [35] develop LocNet, which uses B. Definition of the Overlap between Pairs of LiDAR Scans
semi-handcrafted feature learning based on a siamese network We use spherical projections of LiDAR scans as input data,
to solve place recognition. Lu et al. [22] proposed L3 -net, which is often used to speed up computations [3, 5, 8]. We
which uses 3D convolutions and a recurrent neural network to project the point cloud P to a so-called vertex map V : R2 7→
learn local descriptors for global localization. Uy and Lee [33] R3 , where each pixel is mapped to the nearest 3D point. Each
proposed PointNetVLAD to generate a global descriptor for point pi = (x, y, z) is converted via the function Π : R3 7→ R2
3D point clouds. Kim et al. [19] proposed a learning-based to spherical coordinates and finally to image coordinates (u, v)
descriptor called SCI to solve long-term global localization. by
Most recently, Sun et al. [32] also proposed a learning-based !
1 −1

2 1 − arctan(y, x)π w

method combined with Monte Carlo localization to achieve a u
fast global localization. = , (1)
v 1 − arcsin(zr−1 ) + fup f −1 h

Contrary to the above-mentioned methods, our method
exploits multiple types of information extracted from 3D where r = ||p||2 is the range, f = fup + fdown is the vertical
LiDAR scans, including depth, normal information, inten- field-of-view of the sensor, and w, h are the width and height
sity/remission and probabilities of semantic classes generated of the resulting vertex map V.
For a pair of LiDAR scans P1 and P2 , we generate the
corresponding vertex maps V1 , V2 . We denote the sensor-
centered coordinate frame at time step t as Ct . Each pixel in
coordinate frame Ct is associated with the world frame W by
a pose TW Ct ∈ R4×4 . Given the poses TW C1 and TW C2 , we
can reproject scan P1 into the coordinate frame of the other’s
vertex map V2 and generate a reprojected vertex map V10 :
(a) Exhaustive (b) OverlapNet (c) Ground truth
V10 = Π T−1

W C1 TW C2 P1 . (2) evaluation of Eq. (3) estimates
We then calculate the absolute difference of all correspond- Fig. 2: Overlap estimations of one frame to all others. The red arrow
ing pixels in V10 and V2 , considering only those pixels that points out the position of the query scan. If we directly use Eq. (3)
correspond to valid range readings in both range images. The to estimate the overlap between two LiDAR scans without knowing
overlap is then calculated as the percentage of all differences the accurate relative poses, it is hard to decide which pairs of scans
in a certain distance relative to all valid entries, i.e., the are true loop closures, since most evaluations of Eq. (3) show high
values. In contrast, our OverlapNet can predict the overlaps between
overlap of two LiDAR scans OC1 C2 is defined as follows: two LiDAR scans well.
n o
0
P
(u,v) I ||V1 (u, v) − V2 (u, v)|| ≤
OC1 C2 = , (3) intensity information as a one-channel intensity map I. The
min (valid(V10 ), valid(V2 )) point-wise semantic class probabilities are computed using
where I{a} = 1 if a is true and 0 otherwise. valid(V) is the RangeNet++ [23] and we represent them as a semantic map S.
the number of valid pixels in V, since not all pixel might have RangeNet++ delivers probabilities for 20 different classes. For
a valid LiDAR measurement associated after the projection. efficiency sake, we reduce the 20-dimensional RangeNet++
We use Eq. (3) only for creating training data, i.e., only output to a compressed 3-dimensional vector using principal
positive examples of correct loop closures get a non-zero over- component analysis. The information is combined into an input
lap assigned using the relative poses between scans, as shown tensor of size 64 × 900 × D, where 64, 900 are the height and
in Fig. 2(c). However, when performing loop closure detection width of the inputs, and D depends on the types of data used.
for online SLAM, the approximate relative poses from SLAM Our proposed OverlapNet is a siamese network architec-
before loop closure are not accurate enough to calculate usable ture [6], which consists of two legs sharing weights and two
overlaps by using Eq. (3) because of accumulated drift. We heads that use the same pair of feature volumes generated by
tried directly estimating overlaps using Eq. (3) assuming the the two legs. The trainable layers are listed in Tab. I.
relative pose as identity and applying different orientations, 1) Legs: The proposed OverlapNet has two legs, which
e.g., every 30 degrees rotation around the vertical axis, and have the same architecture and share the same weights. Each
using the maximum over all these overlaps as an estimate. leg is a fully convolutional network (FCN) consisting of 11
Fig. 2 shows the estimated overlaps for all scans using a convolutional layers. This architecture is quite lightweight and
query scan produced by this method and the result of the generates feature volumes of size 1 × 360 × 128. Note that our
estimated overlap for all scans using OverlapNet. We leave range images are cyclic projections and that a change in the
out the 100 most recent scans because they will not be loop yaw angle of the vehicle results in a cyclic column shift of the
closure candidates. In the case of the exhaustive approach, range image. Thus, the single row in the feature volume can
many wrong loop closure candidates get high overlap values, represent a relative yaw angle estimate (because a yaw angle
while our approach performs better since it produces a highly rotation results in a pure horizontal shift of the input maps).
distinctive peak around the correct location. Furthermore, it As the FCN is translation-equivariant, the feature volume will
takes on average 1.2 s to calculate the overlap for one pair of be shifted horizontally. The number of columns of the feature
scans using the exhaustive approach, which makes it unusable volume defines the resolution of the yaw estimation, which
in real-world scenarios. In contrast, the complete OverlapNet is 1 degree in the case of our leg architecture.
needs on average 17 ms for one pair overlap estimation when 2) Delta Head: The delta head is designed to estimate the
using depth and normal information only. overlap between two scans. It consists of a delta layer, three
convolutional layers, and one fully connected layer.
C. Overlap Network Architecture
The delta layer, shown in Fig. 4, computes all possible
The overview of the proposed OverlapNet is depicted in absolute differences of all pixels. It takes the output feature
Fig. 3. We exploit multiple cues, which can be generated from volumes Ll ∈ RH×W ×C from the two legs l as input. These
a single LiDAR scan, including depth, normal, intensity, and are stacked in a tiled tensor Tl ∈ RHW ×HW ×C as follows:
semantic class probability information. The depth information
is stored in the range map R, which consists of one channel. T0 (iW + j, k, c) = L0 (i, j, c) (4)
We use neighborhood information of the vertex map to gener- T1 (k, iW + j, c) = L1 (i, j, c), (5)
ate a normal map N , which has three channels encoding the
normal coordinates. We directly obtain the intensity informa- with k = {0, . . . , HW − 1}, i = {0, . . . , H − 1} and
tion, also called remission, from the sensor and represent the j = {0, . . . , W − 1}.
Overlap

Leg 1 Delta Head

Shared Weights

Yaw

Leg 2 Correlation Head

Layers : CONV + ReLU Dense Delta Correlation

Multiple Cues

Fig. 3: Pipeline overview of our proposed approach. The left-hand side shows the preprocessing of the input data which exploits multiple
cues generated from a single LiDAR scan, including range R, normal N , intensity I, and semantic class probability S information. The
right-hand side shows the proposed OverlapNet which consists of two legs sharing weights and the two heads use the same pair of feature
volumes generated by the two legs. The outputs are the overlap and relative yaw angle between two LiDAR scans.

TABLE I: Layers of our network architecture C

Operator Stride Filters Size Output Shape C

H C
Conv2D (2, 2) 16 (5, 15) 30 × 443 × 16 W

HxW
Conv2D (2, 1) 32 (3, 15) 14 × 429 × 32
Conv2D (2, 1) 64 (3, 15) 6 × 415 × 64 HxW _

Conv2D (2, 1) 64 (3, 12) 2 × 404 × 64

HxW
Conv2D (2, 1) 128 (2, 9) 1 × 396 × 128
Legs

Conv2D (1, 1) 128 (1, 9) 1 × 388 × 128

HxW
Conv2D (1, 1) 128 (1, 9) 1 × 380 × 128
Conv2D (1, 1) 128 (1, 9) 1 × 372 × 128
Conv2D (1, 1) 128 (1, 7) 1 × 366 × 128 Leg Output Delta Layer Operator

Conv2D (1, 1) 128 (1, 5) 1 × 362 × 128

Conv2D (1, 1) 128 (1, 3) 1 × 360 × 128 Fig. 4: Delta layer. Computation of pairwise differences is efficiently
performed by concatenating the feature volumes and transposition of
Delta Head

Conv2D (1, 15) 64 (1, 15) 360 × 24 × 64

Conv2D (15, 1) 128 (15, 1) 24 × 24 × 128 one concatenated feature volume.
Conv2D (1, 1) 256 (3, 3) 22 × 22 × 256
Dense - - - 1
same time. Typically, to train a neural network one needs a
large amount of manually labeled ground truth data. In our
Note that T1 is transposed in respect to T0 , as depicted in case, this is (I1 , I2 , YO , YY ), where I1 , I2 are two inputs
the middle of Fig. 4. After that, all differences are calculated and YO , YY are the ground truth overlaps and the ground truth
by element-wise absolute differences between T0 and T1 . yaw angles respectively. We are however able to generate the
By using the delta layer, we can obtain a representation of input and the ground truth without any manual effort in a fully
the latent difference information, which can be later exploited automated fashion given a dataset with pose information. From
by the convolutional and fully-connected layers to estimate given poses, we can calculate the ground truth overlap and
the overlap. Different overlaps induce different patterns in the relative yaw angles directly. We denote the legs part network
output of the delta layer. with trainable weights as fL (·), the delta head as fD (·) and
3) Correlation Head: The correlation head [24] is designed the correlation head as fC (·).
to estimate the yaw angle between two scans using the feature For training, we combine the loss LO (·) for the overlap and
volumes of the two legs. To perform the cross-correlation, the loss LY (·) for the yaw angle using a weight α:
we first pad horizontally one feature volume by copying the L (I1 , I2 , YO , YY ) = LO (I1 , I2 , YO )+αLY (I1 , I2 , YY ) . (6)
same values (as the range images are cyclic projections around
the yaw angle). This doubles the size of the feature volume. We treat the overlap estimation as a regression problem
We then use the other feature volume as a kernel that is and use a weighted absolute difference of ground truth YO
shifted over the first feature volume generating a 1D output of and network output ŶO = fD (fL (I1 ) , fL (I2 )) as the loss
size 360. The argmax of this feature serves as the estimate of function. For weighting, we use a scaled sigmoid function:
the relative yaw angle of the two input scans with a 1 degree

LO (I1 , I2 , YO ) = sigmoid s ŶO − YO + a − b , (7)

resolution.
with sigmoid(v) = (1 + exp(−v))−1 , a, b are offsets and s
D. Loss Functions being a scaling factor.
We train our OverlapNet end-to-end to estimate the overlap For the yaw angle estimation, we use a lightweight rep-
and the relative yaw angle between two LiDAR scans at the resentation of the correlation head output, which leads to a
one-dimensional vector of size 360. We take the index of the the odometry estimate can lead to large displacements, where
maximum, the argmax, as the estimate of the relative angle in the heuristic of just taking the nearest frame in the already
degrees. As the argmax is not differentiable, we cannot treat mapped areas does not yield correct candidates, which will be
this as a simple regression problem. The yaw angle estimation, also shown in our experiments.
however, can be regarded as a binary classification problem
F. Covariance Propagation for Geometric Verification
that decides for every entry of the head output whether it is
the correct angle or not. Therefore, we use the binary cross- SuMa’s loop closure detection uses a fixed search radius. In
entropy loss given by contrast, we use the covariance of the pose estimate and error
X propagation to automatically adjust the search radius.
LY (I1 , I2 , YY ) = H YYi , ŶYi , (8) We assume a noisy pose TCt−1 Ct = {T̄Ct−1 Ct , ΣCt−1 Ct }
i={1,...,N }
with mean T̄Ct−1 Ct and covariance ΣCt−1 Ct . We can estimate
where H(p, q) = p log(q) − (1 − p) log(1 − q) is the binary the covariance matrix by
cross entropy and N is the size of the output 1D vector.
1 E −1
ŶY = fC (fL (I1 ) , fL (I2 )) is the relative yaw angle estimate. ΣCt−1 Ct = J>
δ WJδ , (13)
Note that we only train the network to estimate the relative K N −M
yaw angle of a pair of scans with overlap larger than 30%, where K is the correction factors of the Huber robustized
since this minimum overlap was needed to result in correct covariance estimation [17], E is the sum of the squared
pose estimates of the ICP as explained in Sec. IV-A, but also point-to-plane errors (sum of squared residuals) given the
experimentally validated in Sec. IV-F. pose TCt−1 Ct , see Eq. (9), N is the number of correspon-
dences, M = 6 is the dimension of the transformation between
E. SLAM Pipeline two 3D poses.
We use the surfel-based mapping system called SuMa [3] To estimate the propagated uncertainty during the incremen-
as our SLAM pipeline and integrate OverlapNet in SuMa tally pose estimation, we can update the mean and covariance
replacing its original heuristic loop closure detection method. as follows:
We only summarize here the steps of SuMa relevant to our
approach and refer for more details to the original paper [3]. T̄Ct−1 Ct+1 = T̄Ct−1 Ct T̄Ct Ct+1 (14)
SuMa uses the same vertex map VD and normal map ND ΣCt−1 Ct+1 ≈ ΣCt−1 Ct + J>
Ct Ct+1 ΣCt Ct+1 JCt Ct+1 , (15)
as discussed in Sec. III-B. Furthermore, SuMa uses projective
where JCt Ct+1 is the Jacobian of TCt Ct+1 .
ICP with respect to a rendered map view VM and NM
Since we need the Mahalanobis distance DM as a prob-
at timestep t − 1, the pose update TCt−1 Ct and conse-
abilistic distance measure between two poses, we make use
quently TW Ct by chaining all pose increments. Therefore,
of Lie algebra to express T as a 6D vector ξ ∈ se(3) using
each vertex u ∈ VD is projectively associated to a reference
ξ = log T, yielding
vertex vu ∈ VM . Given this association information, SuMa q
estimates the transformation between scans by incrementally DM (TC1 , TC2 ) = ∆ξC1C2 > Σ−1
C1C2 ∆ξC1C2 . (16)
minimizing the point-to-plane error given by
2 Using the scaled distance, we can now restrict the search
X (k)
E(VD , VM , NM ) = >
nu TCt−1 Ct u − vu . (9) space depending on the pose uncertainty to save computation
u∈VD time. However, we can use our framework also without any
prior information, i.e., perform place recognition.
Each vertex u ∈ VD is projectively associated to a reference
vertex vu ∈ VM and its normal nu ∈ NM via IV. E XPERIMENTAL E VALUATION

(k)
vu = VM Π TCt−1 Ct u (10) The experimental evaluation is designed to support the key
claims that our approach is able to: (i) predict the overlap and
(k)
nu = NM Π TCt−1 Ct u . (11) relative yaw angle between pairs of LiDAR scans by exploiting
multiple cues without given poses, (ii) combine odometry
SuMa then minimizes the objective of Eq. (9) using Gauss- information with overlap predictions to detect correct loop
Newton and determines increments δ by iteratively solving closure candidates, (iii) improve the overall pose estimation
−1 > results in graph-based SLAM yielding more globally consis-
δ = J> δ WJδ Jδ Wr, (12)
tent maps, (iv) solve loop closure detection without prior pose
where W ∈ Rn×n is a diagonal matrix containing information, (v) initialize ICP using OverlapNet predictions
weights wu , r ∈ Rn is the stacked residual vector, and Jδ ∈ yielding correct scan matching results.
Rn×6 the Jacobian of r with respect to the increment δ. We train and evaluate our approach on the KITTI odometry
SuMa employs a loop closure detection module, which benchmark [13], which provides LiDAR scans recorded with
considers the nearest frame in the built map as the candidate a Velodyne HDL-64E of urban areas around Karlsruhe in
for loop closure given the current pose estimate. Loop closure Germany. We follow the experimental setup of Schaupp et
detection works well for small loops, but the heuristic fails al. [27] and use sequence 00 for evaluation. Sequences 03−10
in areas with only a few large loops. Furthermore, drift in are used for training and sequence 02 is used for validation.
M2DP SuMa Histogram Ours_CovNearestOfTop10 Ours We evaluate OverlapNet on both KITTI sequence 00 and
Ford campus sequence 00 using the precision-recall curves
100 100 shown in Fig. 5. We compare our method, trained with two
90 90 heads and all cues (labeled as Ours (AllChannel, TwoHeads))
with three state-of-the-art approaches, M2DP [15], His-
Precision [%]

80 80

70 70
togram [26], and the original SuMa [3]. Since SuMa always
uses the nearest frame as the candidate for loop closure detec-
60 60
tion, we can only get one pair of precision and recall value re-
50 50 sulting in a single point. We also show the result of our method
40 40 using prior information, named Ours CovNearestOfTop10,
0 25 50 75 100 0 25 50 75 100
which uses covariance propagation (Sec. III-F) to define the
Recall [%] Recall [%]
search space with the Mahalanobis distance and use the
(a) KITTI sequence 00 (b) Ford campus sequence 00 nearest in Mahalanobis distance of the top 10 predictions of
Fig. 5: Precision-Recall curves of different approaches.
OverlapNet as the loop closure candidates.
Tab. II shows the comparison between our approach and
To evaluate the generalization ability of our method, we the state of the art using the F1 score and the area under
also test it on the Ford campus dataset [25], which is recorded the curve (AUC) on both KITTI and Ford campus dataset.
on the Ford research campus and downtown Dearborn in For the KITTI dataset, our approach uses the model trained
Michigan using a different version of the Velodyne HDL-64E. with all cues, including depth, normals, intensity, and a
In the case of the Ford campus dataset, we test our method probability distribution over semantic classes. For the Ford
on sequence 00 which has several large loops. Note that we campus dataset, our approach uses the model trained with
never trained our approach on the Ford campus dataset. geometric information only, namely Ours (GeoOnly), since
For generating overlap ground truth, we only use points other cues are not available in this dataset. We can see that our
within a distance of 75 m to the sensor. For overlap compu- method outperforms the other methods on the KITTI dataset
tation, see Eq. (3), we use = 1 m. We use a learning rate and attains a similar performance on the Ford campus dataset.
of 10−3 with a decay of 0.99 every epoch and train at most 100 There are two reasons to explain the worse performance on
epochs. For the combined loss, Eq. (6), we set α = 5. For the the Ford campus dataset. First, we never trained our network
overlap loss, Eq. (7), we use a = 0.25, b = 12, and scale on the Ford campus dataset or even US roads, and secondly,
factor s = 24. there is only geometric information available on the Ford
campus dataset. However, our method outperforms all baseline
A. Loop Closure Detection methods in both, KITTI and Ford campus dataset, if we
In our first experiments, we investigate the loop closure integrate prior information.
performance of our approach and compare it to existing We also show the performance in comparison to variants of
methods. Loop closure detection typically assumes that robots our method in Tab. III. We compare our best model AllChannel
revisit places during the mapping while moving with uncertain using two heads and all available cues to a variant which
odometry. Therefore, the prior information about robot poses only uses a basic multilayer perceptron as the head named
extracted from the pose graph is available for the loop closure MLPOnly which consists of two hidden fully connected layers
detection. The following criteria are used in these experiments: and a final fully connected layer with two neurons (one for
• To avoid detecting a loop closure in the most recent scans, overlap, one for yaw angle). The substantial difference of the
we do not search candidates in the latest 100 scans. AUC and F1 scores shows that such a simple network structure
• For each query scan, only the best candidate is considered is not sufficient to get a good result. Training the network with
throughout this evaluation. only one head (only the delta head for overlap estimation,
• Most SLAM systems search for potential closures only named DeltaOnly), has not a significant influence on the
within the 3σ area around the current pose estimate. We performance. A huge gain can be observed when regarding the
do the same, either using the Euclidean or the Mahalanobis nearest frame in Mahalanobis distance of the top 10 candidates
distance, depending on the approach. in overlap percentage (CovNearestOfTop10).
• We use a relatively low threshold of 30 % for the overlap
B. Qualitative Results
to decide if a candidate is a true positive. We aim to find
more loops even in some challenging situations with low The second experiment is designed to support the claim
overlaps, e.g., when the car drives back to an intersection that our method is able to improve the overall mapping result.
from the opposite direction (as highlighted in the supple- Fig. 6 shows the odometry results on KITTI sequence 02.
mentary video1 ). Furthermore, ICP can find correct poses The color in Fig. 6 shows the 3D translation error (includ-
if the overlap between pairs of scans is around 30 %, as ing height). The left figure shows the SuMa and the right
illustrated in the experimental evaluation. figure shows Ours CovNearestOfTop10 using the proposed
OverlapNet to detect loop closures. We can see that after
1 https://youtu.be/YTfliBco6aw integrating our method, the overall odometry is much more
TABLE II: Comparison with state of the art. 30

F1 800 800
Dataset Approach AUC
score

Translation Error [m]

Histogram [26] 0.83 0.83 600 600
M2DP [15] 0.83 0.87

y [m]
KITTI
SuMa [3] - 0.85
400 400
Ours (AllChannel, TwoHeads) 0.87 0.88
Histogram [26] 0.84 0.83 200 200
M2DP [15] 0.84 0.85
Ford Campus
SuMa [3] - 0.33
Ours (GeoOnly) 0.85 0.84 0 ground truth 0 ground truth
0
0 200 400 600 0 200 400 600
TABLE III: Comparison with our variants. x [m] x [m]

(a) SuMa (b) Ours CovNearestOfTop10

F1
Dataset Variant AUC
score
Fig. 6: Qualitative result on KITTI sequence 02.
MLPOnly 0.58 0.65
DeltaOnly 0.85 0.88
KITTI CovNearestOfTop10 0.96 0.96
Ours (AllChannel, TwoHeads) 0.87 0.88
Ours (GeoOnly) 0.85 0.84
Ford Campus

within
GeoCovNearestOfTop10 0.85 0.88

accurate since we can provide more loop closure candidates

with higher accuracy in terms of overlap. The colors represent
the translation error of the estimated poses with respect to
the ground truth. Furthermore, after integrating the proposed Fig. 7: Loop closure detection performance on KITTI sequence 00.
OverlapNet, the SLAM system can find more loops even
in some challenging situations, e.g., when the car drives
relative yaw angle for any pairs of scans in contrast to the
back to an intersection from the opposite direction, which is
RANSAC-based method that sometimes fails.
highlighted in the supplementary video1 .
The superior performance can be mainly attributed to the
C. Loop Closure Detection without Odometry Information correlation head exploiting the fact that the orientation in
LiDAR scans can be well represented by the shift in the range
The third experiment is designed to support the claim that projection. Therefore, it is easier to train the correlation head
our approach is well-suited for solving the more general loop to accurately predict the relative yaw angles rather than a
closure detection task without using odometry information. multilayer perceptron used in OREOS [27]. Furthermore, there
In this case, we assume that we have no prior information is also a strong relationship between overlap and yaw angle,
about the robot pose. To compare with the state-of-the-art which also improves the results when trained together.
method OREOS [27], we follow their experimental setup and Fig. 8 shows the relationship between real overlap and yaw
refer to the original paper for more details. The OREOS results angle estimation error. As expected, the yaw angle estimate
are those produced by the authors of OREOS. gets better with increasing overlap. Based on these plots, our
The respective loop closure candidates recall results are method not only finds candidates but also measures the quality,
shown in Fig. 7. Our method outperforms all the baseline i.e., when the overlap of two scans is larger than 90%, our
methods with a small number of candidates and attains similar method can accurately estimate the relative yaw angle with an
performance as baseline methods for higher values of num- average error of about only 1 degree.
bers of candidates. However, OREOS and LocNet++ attain a
slightly higher recall if more candidates are considered. E. Ablation Study on Input Modalities
An ablation study on the usage of different inputs is shown
D. Yaw Estimation
in Tab. V. As can be seen, when employing more input
We aim at supporting our claim that our network provides modalities, the proposed method is more robust. We notice
good relative yaw angle estimates. We use the same experi- that exploiting only depth information with OverlapNet can
mental setup as described in Sec. IV-C. Tab. IV summarizes already perform reasonable in terms of overlap prediction,
the yaw angle errors on KITTI sequence 00. while it does not perform well in yaw angle estimation.
We can see that our method outperforms the other methods When combining with normal information, the OverlapNet
in terms of mean error and standard deviations. In terms of can perform well in both tasks. Another interesting finding
recall, OverlapNet and OREOS always provide a yaw angle is the drastic reduction of yaw angle mean error and standard
estimate, since both approaches are designed to estimate the deviation when using semantic information. One reason could
TABLE IV: Yaw estimation errors without ICP TABLE V: Ablation study on usage of input modalities.

Approach Mean[deg] std[deg] Recall[%] overlap yaw angle[deg]

Depth Normals Intensity Semantics
AUC F1 Mean Std
FPFH+RANSAC* 13.28 32.19 97
OREOS* 12.67 15.23 100 3 0.86 0.87 11.67 25.32
Ours (AllChannel, TwoHeads) 1.13 3.34 100 3 3 0.86 0.85 2.97 14.28
3 3 3 0.87 0.87 2.53 14.56
*: The results are those produced by the authors of OREOS [27]. 3 3 3 3 0.87 0.88 1.13 3.34

60 30 30

Error of ICP registration [m]

mean error mean error
mean error mean error
standard deviation 15 standard deviation 25 25
standard deviation standard deviation
Error [deg]

40 Error [deg] 20 20
10
15 15
20
5 10 10

5 5
0 0
0 0
40 60 80 100 40 60 80 100
10 30 50 70 90 10 30 50 70 90
Overlap Threshold [%] Overlap Threshold [%]
Overlap percentage [%] Overlap percentage [%]
(a) KITTI sequence 00 (b) Ford campus sequence 00
(a) With identity as initial guess (b) With yaw angle as initial guess
Fig. 8: Overlap and yaw estimation relationship. Fig. 9: ICP using OverlapNet predictions as initial guess. The error of
ICP registration here is the Euclidean distance between the estimated
be that adding semantic information will make the input translation and the ground-truth translation.
data more distinguishable when the car drives in symmetrical
environments. We also notice that semantic information will For the Ford campus dataset, we used only geometric
increase the computation time, see Sec. IV-G. However, from information, which could be generated in 10 ms on average per
the ablation study, one could also notice that the proposed frame, 2 ms for feature extraction and 24 ms for matching with
method can also achieve good performance by only employing the worst case of 550 ms. In real SLAM operation, we only
geometric information (depth and normals). search loop closure candidates inside a certain search space
given by pose uncertainty using the Mahalanobis distance
F. Using OverlapNet Predictions as Initial Guesses for ICP (see Sec. III-F). Therefore, our method can achieve online
We aim at supporting the claim that our network provides operation in long-term tasks, since we usually only have to
good initializations for ICP with 3D laser scans collected on evaluate a small number of candidate poses.
autonomous cars. Fig. 9 shows the relation between the overlap
and ICP registration error with and without using OverlapNet V. C ONCLUSION
predictions as initial guesses. The error of ICP registration is In this paper, we presented a novel approach for LiDAR-
here depicted by the Euclidean distance between the estimated based loop closure detection. It is based on the overlap
relative translation and the ground-truth translation. As can be between LiDAR scan range images and provides a measure for
seen, the yaw angle prediction of the OverlapNet increases the the quality of the loop closure. Our approach utilizes a siamese
chance to get a good result from the ICP even if two frames network structure to leverage multiple cues and allows us to
are relatively far away from each other (with low overlap). estimate the overlap and relative yaw angle between scans.
Therefore in some challenging cases, e.g. the car drives back The experiments on two different datasets suggest that when
into an intersection from a different street, our approach can combined with odometry information our method outperforms
still find loop closures (see in the supplementary video1 ). The other state-of-the-art methods and that it generalizes well to
results also show that the overlap estimates measure the quality different environments never seen during training.
of the found loop closure: larger overlap values result in better Despite these encouraging results, there are several avenues
registration results of the involved ICP. for future research. First, we want to investigate the integration
G. Runtime of other input modalities, such as vision and radar information.
We furthermore plan to test our approach with other datasets
We tested our method on a system equipped with an Intel
collected in different seasons.
i7-8700 with 3.2 GHz and an Nvidia GeForce GTX1080 Ti
with 11 GB memory.
ACKNOWLEDGMENTS
For the KITTI sequence 00, we could exploit all input cues
including the semantic classes provided by RangeNet++ [23]. This work has been supported in part by the German
We need on average 75 ms per frame for the input data prepro- Research Foundation (DFG) under Germany’s Excellence
cessing, 6 ms per frame for the legs feature extraction, 27 ms Strategy, EXC-2070 - 390732324 (PhenoRob) and under grant
per frame for the head matching. The worst case for the head number BE 5996/1-1 as well as by the Chinese Scholarship
matching takes 630 ms for all candidates in the search space. Committee.
R EFERENCES Computer Vision and Pattern Recognition (CVPR), pages
3354–3361, 2012.
[1] Tim Bailey and Hugh Durrant-Whyte. Simultaneous [14] Jiadong Guo, Paulo V.K. Borges, Chanoh Park, and Abel
localisation and mapping (SLAM): Part I. IEEE Robotics Gawel. Local descriptor for robust place recognition
and Automation Magazine (RAM), 13(2):99–110, 2006. using LiDAR intensity. IEEE Robotics and Automation
ISSN 1070-9932. doi: 10.1109/MRA.2006.1638022. Letters (RA-L), 4(2):1470–1477, 2019.
[2] Ioan Andrei Barsan, Shenlong Wang, Andrei Pokrovsky, [15] Li He, Xiaolong Wang, and Hong Zhang. M2DP: A
and Raquel Urtasun. Learning to Localize Using a Novel 3D Point Cloud Descriptor and Its Application
LiDAR Intensity Map. In Proc. of the Second Conference in Loop Closure Detection. In Proc. of the IEEE/RSJ
on Robot Learning (CoRL), pages 605–616, 2018. Intl. Conf. on Intelligent Robots and Systems (IROS),
[3] Jens Behley and Cyrill Stachniss. Efficient Surfel- 2016.
Based SLAM using 3D Laser Range Data in Urban [16] Wolfgang Hess, Damon Kohler, Holger Rapp, and Daniel
Environments. In Proc. of Robotics: Science and Systems Andor. Real-Time Loop Closure in 2D LIDAR SLAM. In
(RSS), 2018. Proc. of the IEEE Intl. Conf. on Robotics & Automation
[4] Paul J. Besl and Neil D. McKay. A Method for (ICRA), 2016.
Registration of 3D Shapes. IEEE Trans. on Pattern [17] Peter J. Huber. Robust Statistics. Wiley, 1981.
Analalysis and Machine Intelligence (TPAMI), 14(2): [18] Mushtaq Hussain and James Bethel. Project and mission
239–256, 1992. planing. In Chris McGlone, Edward Mikhail, James
[5] Igor Bogoslavskyi and Cyrill Stachniss. Fast range Bethel, and Roy Mullen, editors, Manual of Photogram-
image-based segmentation of sparse 3d laser scans for metry, chapter 15.1.2.6, pages 1109–1111. American
online operation. In Proc. of the IEEE/RSJ Intl. Conf. on Society for Photogrammetry and Remote Sensing, 2004.
Intelligent Robots and Systems (IROS), 2016. [19] Giseop Kim, Byungjae Park, and Ayoung Kim. 1-day
[6] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard learning, 1-year localization: Long-term LiDAR local-
Säckinger, and Roopak Shah. Signature Verifica- ization using scan context image. IEEE Robotics and
tion using a “Siamese” Time Delayed Neural Net- Automation Letters (RA-L), 4(2):1948–1955, 2019.
work. Intl. Journal of Pattern Recognition and Artifi- [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
cial Intelligence, 07(04):669–688, 1993. doi: 10.1142/ Imagenet classification with deep convolutional neural
S0218001493000339. networks. Communications of the ACM, 60(6):8490, May
[7] Andrea Censi and Stefano Carpin. HSM3D: feature- 2017. ISSN 0001-0782. doi: 10.1145/3065386.
less global 6DOF scan-matching in the Hough/Radon [21] Stephanie Lowry, Niko Sünderhauf, Paul Newman,
domain. In Proc. of the IEEE Intl. Conf. on Robotics John J Leonard, David Cox, Peter Corke, and Michael J
& Automation (ICRA), pages 3899–3906. IEEE, 2009. Milford. Visual place recognition: A survey. IEEE
[8] Xieyuanli Chen, Andres Milioto, Emanuele Palazzolo, Trans. on Robotics (TRO), 32(1):1–19, 2016. ISSN 1552-
Philippe Giguère, Jens Behley, and Cyrill Stachniss. 3098. doi: 10.1109/TRO.2015.2496823.
SuMa++: Efficient LiDAR-based Semantic SLAM. In [22] Weixin Lu, Yao Zhou, Guowei Wan, Shenhua Hou, and
Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots Shiyu Song. L3-Net: Towards Learning Based LiDAR
and Systems (IROS), 2019. Localization for Autonomous Driving. In Proc. of the
[9] Konrad P. Cop, Paulo V.K. Borges, and Renaud Dubé. IEEE Conf. on Computer Vision and Pattern Recognition
Delight: An efficient descriptor for global localisation (CVPR), June 2019.
using lidar intensities. In Proc. of the IEEE Intl. Conf. on [23] Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill
Robotics & Automation (ICRA), 2018. Stachniss. RangeNet++: Fast and Accurate LiDAR
[10] Andrei Cramariuc, Renaud Dubé, Hannes Sommer, Semantic Segmentation. In Proc. of the IEEE/RSJ
Roland Siegwart, and Igor Gilitschenski. Learning Intl. Conf. on Intelligent Robots and Systems (IROS),
3D Segment Descriptors for Place Recognition. arXiv 2019.
preprint, 2018. [24] Sei Nagashima, Koichi Ito, Takafumi Aoki, Hideaki Ishii,
[11] Mark Cummins and Paul Newman. Highly scalable and Koji Kobayashi. A high-accuracy rotation estimation
appearance-only SLAM - FAB-MAP 2.0. In Proc. of algorithm based on 1D phase-only correlation. In Proc. of
Robotics: Science and Systems (RSS), 2009. the Intl. Conf. on Image Analysis and Recognition, pages
[12] Renaud Dubé, Daniel Dugas, Elena Stumm, Juan Nieto, 210–221, 2007.
Roland Siegwart, and Cesar Cadena. SegMatch: Segment [25] Gaurav Pandey, James R. McBride, and Ryan M. Eustice.
Based Place Recognition in 3D Point Clouds. In Proc. of Ford campus vision and lidar data set. Intl. Journal of
the IEEE Intl. Conf. on Robotics & Automation (ICRA), Robotics Research (IJRR), 30(13):1543–1552, 2011.
2017. [26] Timo Röhling, Jennifer Mack, and Dirk Schulz. A Fast
[13] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are Histogram-Based Similarity Measure for Detecting Loop
we ready for Autonomous Driving? The KITTI Vision Closures in 3-D LIDAR Data. In Proc. of the IEEE/RSJ
Benchmark Suite. In Proc. of the IEEE Conf. on Intl. Conf. on Intelligent Robots and Systems (IROS),
pages 736–741, 2015.
[27] Lukas Schaupp, Mathias Bürki, Renaud Dubé, Roland
Siegwart, and Cesar Cadena. OREOS: Oriented Recog-
nition of 3D Point Clouds in Outdoor Scenarios. Proc. of
the IEEE/RSJ Intl. Conf. on Intelligent Robots and Sys-
tems (IROS), 2019.
[28] Cyrill Stachniss, Dirk Hähnel, Wolfram Burgard, and
Giorgio Grisetti. On Actively Closing Loops in Grid-
based FastSLAM. Advanced Robotics, 19(10):1059–
1080, 2005.
[29] Cyrill Stachniss, John J. Leonard, and Sebastian Thrun.
Springer Handbook of Robotics, 2nd edition, chapter
Chapt. 46: Simultaneous Localization and Mapping.
Springer Verlag, 2016.
[30] Basitan Steder, Radu B. Rusu, Kurt Konolige, and Wol-
fram Burgard. NARF: 3D range image features for
object recognition. In Workshop on Defining and Solving
Realistic Perception Problems in Personal Robotics at the
IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(IROS), 2010.
[31] Bastian Steder, Michael Ruhnke, Slawomir Grzonka, and
Wolfram Burgard. Place Recognition in 3D Scans Using
a Combination of Bag of Words and Point Feature Based
Relative Pose Estimation. In Proc. of the IEEE/RSJ
Intl. Conf. on Intelligent Robots and Systems (IROS),
2011.
[32] Li Sun, Daniel Adolfsson, Martin Magnusson, Henrik
Andreasson, Ingmar Posner, and Tom Duckett. Lo-
calising Faster: Efficient and precise lidar-based robot
localisation in large-scale environments. In Proc. of
the IEEE Intl. Conf. on Robotics & Automation (ICRA),
2020.
[33] Mikaela A. Uy and Gimm H. Lee. PointNetVLAD:
Deep point cloud based retrieval for large-scale place
recognition. In Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), pages 4470–
4479, 2018.
[34] Heng Yang, Jingnan Shi, and Luca Carlone. TEASER:
Fast and Certifiable Point Cloud Registration. arXiv
preprint, 2020.
[35] Huan Yin, Yue Wang, Xiaqing Ding, Li Tang, Shoudong
Huang, and Rong Xiong. 3D LiDAR-Based Global
Localization Using Siamese Neural Network. IEEE
Trans. on Intelligent Transportation Systems (TITS),
2019.
[36] Anestis Zaganidis, Alexandros Zerntev, Tom Duckett,
and Grzegorz Cielniak. Semantically Assisted Loop Clo-
sure in SLAM Using NDT Histograms. In Proc. of the
IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems
(IROS), 2019.
[37] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast
global registration. In Proc. of the Europ. Conf. on
Computer Vision (ECCV), pages 766–782, 2016.