0% found this document useful (0 votes)
14 views26 pages

Manuscript v2

Uploaded by

will andrew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
14 views26 pages

Manuscript v2

Uploaded by

will andrew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 26

LiDAR Loop Closure Detection using Semantic

Graphs with Graph Attention Networks


Liudi Yang1*, Ruben Mascaro1 , Ignacio Alzugaray3 , Sai Manoj
Prakhya2 , Marco Karrer1 , Ziyuan Liu2 , Margarita Chli1
1 Visionfor Robotics Lab, ETH Zurich and University of Cyprus,
Switzerland and Cyprus.
2 Huawei Munich Research Center, Germany.
3 Department of Computing, Imperial College London, UK.

*Corresponding author(s). E-mail(s): [email protected];

Abstract
In this paper, we propose a novel loop closure detection algorithm that uses
graph attention neural networks to encode semantic graphs to perform place
recognition and then use semantic registration to estimate the 6 DoF rela-
tive pose constraint. Our place recognition algorithm has two key modules,
namely, a semantic graph encoder module and a graph comparison module. The
semantic graph encoder employs graph attention networks to efficiently encode
spatial, semantic and geometric information from the semantic graph of the input
point cloud. We then use self-attention mechanism in both node-embedding and
graph-embedding steps to create distinctive graph vectors. The graph vectors
of the current scan and a keyframe scan are then compared in the graph com-
parison module to identify a possible loop closure. Specifically, employing the
difference of the two graph vectors showed a significant improvement in the
performance, as shown in ablation studies. Lastly, we implemented a semantic
registration algorithm that takes in loop closure candidate scans and estimates
the relative 6 DoF pose constraint for LiDAR SLAM system. Extensive eval-
uation on mainstream datasets shows that our model is more accurate and
robust, achieving 13% improvement in maximum F1 score on the SemanticKITTI
dataset, when compared to the baseline semantic graph algorithm. For the
benefit of the community, we open-source the complete implementation of our
proposed algorithm and custom implementation of semantic registration at
https://github.com/crepuscularlight/SemanticLoopClosure.

Keywords: Semantic LiDAR Loop Closure, Graph Attention Network, Semantic


Registration

1
1 Introduction
Simultaneous Localization and Mapping (SLAM) plays a crucial role in enabling
autonomous mobile robots to explore and navigate unknown environments. One of its
fundamental challenges is the accumulation of drift caused by state estimation errors
in the front-end odometry, which can lead to globally inconsistent maps. To address
this issue, loop closure detection algorithms are developed that identify revisited places
and help in reducing the accumulated drift by adding a 6 DoF pose constraint in
pose graph or non-linear factor-graph-based LiDAR SLAM systems. There has been
continual innovation in loop closure detection algorithms by employing the latest AI
advancements Uy and Lee (2018); Cattaneo et al (2022); Kong et al (2020) to improve
their accuracy and real-time performance.
Classical (learning-free) LiDAR-based place recognition algorithms use heuristic
and handcrafted methods to reduce a large raw point cloud to a distinctive and descrip-
tive multi-dimensional vector. Moreover, most of these handcrafted methods also have
a carefully designed metric specific to the descriptor to compare their similarity. Many
such handcrafted global feature descriptors such as M2DP He et al (2016), ScanCon-
text Kim and Kim (2018) and its variants Wang et al (2020); Kim et al (2021) are quite
sensitive to design parameters, type of the LiDAR used, i.e., 16, 32, 64 or 128 chan-
nel, the pose of the LiDAR (horizontal or inclined) and lastly perceptual disturbances
such as occlusion and rotation.
In recent years, deep-learning-based loop closure detection algorithms have gained
a lot of traction as they can be custom trained on the target environment where the
robot needs to be deployed and considering the actual installed pose and the type of
LiDAR used. Early representative methods such as PointNetVlad Uy and Lee (2018)
and LCDNet Cattaneo et al (2022) already exhibited promising accuracy, but they
process the raw point clouds as a whole and, when it comes to actual deployment,
they need a lot of computing power to run in real-time.
A more recent work, SGPR Kong et al (2020) proposed a less computationally
intensive approach that explicitly incorporates a semantic graph as the underlying
representation in an attempt to better mimic how the real world scene actually looks
semantically. To this end, SGPR takes the instance segmentation result of the point
cloud as the input, creates semantic graphs and encodes spatial and semantic infor-
mation into lightweight graph embeddings. These graph embeddings are matched in a
graph-graph interaction module, which is a graph-matching neural network that treats
loop detection as a graph comparison problem. SGPR Kong et al (2020), being one of
the first graph-based approaches for place recognition, did not encode comprehensive
information from the semantic graph (missed out the geometric information), used
much simpler EdgeConv module Kong et al (2020) to create node embeddings and
there is still some scope to improve their graph-graph interaction module as proposed
in our paper.
Recently graph attention networks (GAT) Veličković et al (2018) and graph sim-
ilarity computation methods Bai et al (2019) have shown significant improvements
on how graphs can be encoded, compared and can be made learnable in an end-to-
end fashion. Specifically, graph attention networks use learnable linear transformation
instead of simple scalar values to aggregate neighboring features, offering better graph

2
encoding than EdgeConv, as used in SGPR Kong et al (2020). This essentially allows
the multi-head attention in GATs to learn from multiple subspaces, thus encoding com-
plex relationships between graph nodes, offering a significant boost in performance.
Next, the seminal paper on Transformers Vaswani et al (2017) introduced self-attention
mechanism Vaswani et al (2017), where the relationship between different elements of
the input sequence can be learned effectively to reason about the underlying complex
relations in training data.
Inspired by SGPR Kong et al (2020), GAT Veličković et al (2018), self-attention
Vaswani et al (2017) and SimGNN Bai et al (2019), we have developed an enhanced
graph-based loop closure detection that overcomes many drawbacks and uses the latest
techniques to effectively encode a semantic graph, resulting in a significant boost in
the performance. We propose a two-stage approach consisting of a semantic graph
encoder and a graph comparison module.
• As our first contribution, we enhance SGPR by designing a semantic graph encoder
that uses graph attention networks to encode spatial, semantic and geometric infor-
mation of semantic graph as opposed to SGPR’s limited information and simpler
encoding.
• Our second contribution is to use self-attention mechanism in node embedding
and graph embedding steps to encode complex underlying relationships, essentially
creating more distinctive graph vectors.
• As our third contribution, we show that employing the difference of the input graph
vectors in the graph comparison module to perform classification offers a signifi-
cant boost in the performance, as opposed to direct usage of graph vectors, as in
SGPRKong et al (2020).
• Our final contribution is to open source our work at
https://github.com/crepuscularlight/SemanticLoopClosure that consists of
semantic graph encoder module, graph comparison module, custom implementation
of semantic registration for 6 DoF pose estimation to foster further research in this
direction.
Exhaustive experiments and ablation studies on public datasets prove the increased
accuracy and robustness of both our semantic place recognition network and seman-
tic registration algorithm compared to other methods from the state of the art. In
addition, we demonstrate that both modules can run in real time with minimal mem-
ory and compute requirements, making them an ideal choice to integrate into existing
SLAM frameworks.

2 Related Work
We review previous works on traditional and learning-based 3D place recognition
algorithms, and related graph neural networks that can function as backbones to
extract representative features from graphs.

3
2.1 3D Place Recognition
Traditional methods reduce a raw point cloud with millions of points into a multi-
dimensional vector using meticulously designed methods. Mostly, these extracted
descriptors can be compared using Euclidean distance or specific handcrafted metrics
to find a close match, essentially representing a place match/revisit. Magnusson et.
al. Magnusson et al (2009) has developed NDT, a histogram-based feature descriptor,
exploiting normal distribution representation to describe 3D surfaces and an evalua-
tion metric for scene matching. In 2013, Bosse et al. Bosse and Zlot (2013) presented
a keypoint voting mechanism to achieve fast matching between the current scan and
database scans while estimating the matching thresholds/hyper-parameters by fitting
a parametric model to the underlying distribution. M2DP He et al (2016) first projects
3D point clouds into a 2D plane to generate density signatures and uses corresponding
concatenated singular vectors as descriptors.
SegMatch Dubé et al (2017) is one of the first approaches that extracted descriptors
from the clustered segments of the raw point clouds and used a geometric verification
step to find a correct match. The approaches, where high-level geometric clustering
and semantic/feature description are used for matching, generally achieve high accu-
racy and are more robust in loop closure detection. Scan Context Kim and Kim (2018)
initiated the trend to directly use point clouds without calculating histograms to create
a global descriptor. It uses an encoding function that stores condensed information in
spatial bins along radial and azimuthal directions to generate more distinctive global
descriptors. While vanilla Scan Context Kim and Kim (2018) literally only stores
the maximum height in each bin, its variants encode more effective and representa-
tive information in the bins including detected intensity Wang et al (2020), semantic
labels of point clouds Li et al (2021a) and subcontexts Kim et al (2021) boosting the
performance of the descriptor.
Learning-based methods essentially have the advantage of being able to cus-
tom train them for the target environment with specific robot/sensor setup, thus
enabling them to offer better performance in particularly complex environments.
PointNetVLAD Uy and Lee (2018) leveraged deep neural networks to retrieve large-
scale scenes by using PointNet Qi et al (2017) as the backbone and NetVLAD
Arandjelovic et al (2016) to aggregate learned local features, while outputting global
descriptors for matching. SegMap Dubé et al (2018) has proposed to learn data-driven
leveraging on 3D point cloud variance of each cluster/segment, but the innate 3D CNN
architecture comes with considerable computational burden.
To enrich local geometric details, LPD-Net Liu et al (2019) resorts to an adaptive
backbone to aggregate local information into the global descriptors. The core of the
local information extraction module is to fuse the nearest neighbors’ information from
feature space and Cartesian space. MinkLoc3D Komorowski (2021) proposed a simple
neural network to process sparse voxelized point clouds based on sparse 3D CNN. By
quantizing the raw point clouds and employing sparse convolution, it achieves a simi-
lar inference speed to other multilayer-perceptron-based algorithms while maintaining
high precision. LCDNet Cattaneo et al (2022) adopted end-to-end architecture simul-
taneously accomplishing place recognition and 6-DOF pose estimation. The shared

4
3D voxel CNN is used to extract features for the two-head output of place recogni-
tion and pose estimation. SGPR Kong et al (2020) on the other hand, converted the
place recognition into a graph-matching problem by deeming every instance as a node
and designed an efficient graph neural network to infer the similarity while exhibiting
excellent robustness on mainstream datasets.

2.2 Graph Neural Networks for Feature Extraction


Here, we review a few related works from graph neural networks that encode graphs
into a distinctive feature vector, as it is the core idea of SGPR and our proposed
algorithm. Inspired by the tremendous success of CNN in the computer vision field,
plenty of methods transplant convolution into graph structures. ConvGNN Bruna et al
(2014) developed a graph convolution based on the spectral graph theory. GCN Kipf
and Welling (2017) proposed a semi-supervised way to implement an efficient variation
of graph convolution that utilizes first-order approximation in the spectral field. In
order to boost the flexibility of graph neural networks, DGCNN Wang et al (2019)
adopted EdgeConv to dynamically aggregate features from nearest neighbors which are
more suitable for high-level features in the graph. The appearance of ResGCN Li et al
(2023) showed the superior advantage by adding residual networks and introducing
large-scale architectures.
Another branch of GNN originates from the self-attention mechanism Vaswani et al
(2017). The graph attention networks Veličković et al (2018) (GAT) proposed a novel
idea of leveraging self-attention layers to calculate learnable weights from neighbors
to encode complex relations with more parameters from multiple subspaces of graph
nodes. This work opened up a completely new direction of working with graph data and
has been applied to various fields. Brody et al. Brody et al (2022) explored the innate
mechanism of GAT and modified the order of operations to overcome the original
limitation of static attention.
In this paper, we take SGPR as a baseline and further enhance its architecture
with graph attention networks, extract additional geometric information of instances,
use self-attention mechanism to encode complex relations and develop a more discrim-
inative graph comparison module that offers a significant performance boost with a
lower model size and faster inference.

3 Method
3.1 System Overview
Our proposed system’s pipeline is shown in Fig. 1. F-LOAM Wang et al (2021) is used
as the front-end LiDAR odometry to provide the pose estimation for every incoming
LiDAR scan. This pose is regarded as a node in the pose graph for nonlinear optimiza-
tion of the whole trajectory. The relative pose according to odometry is added as a
factor between the current node and its previous node. Specifically, when the seman-
tic graph-based place recognition module finds potential loop candidates successfully,
another constraint calculated by semantic registration is inserted into the pose graph
between the corresponding nodes. This pose graph with both odometry constraints

5
semantic semantic
place recognition relocalization
F-LOAM

Semantic

Encoder
Graph
x4 x3
graph vector odom(x3,x4)

Comparison
current scan semantic graph Yes

Graph
odom(x2,x3)
Semantic
Registration

Semantic
(x1,x4)

Encoder
Graph
x1 x2
odom(x1,x2)

candidate scan

Fig. 1: The high-level workflow of the proposed semantic graph based loop closure
system integrated into a SLAM framework. The proposed loop closure algorithm takes
two semantically segmented point clouds as input, which are converted to semantic
graphs. After that, semantic graph encoders are deployed to compress them into graph
vectors. Finally, the graph comparison module predicts the similarity of the two loop
candidates. When the similarity exceeds a specific threshold, a pose constraint is esti-
mated using semantic registration, which is added to the pose graph for trajectory
optimization.

and loop closure constraints is optimized to get the final consistent and accurate 3D
map of the environment. There are two components in the proposed loop closure back-
end, first, a semantic place recognition module that generates loop closure candidates
and second, a semantic relocalization to calculate the relative 6 DoF pose.

3.2 Semantic Place Recognition


The semantic place recognition module identifies if the robot has come back to a
previously traversed location, and this in quantitative terms, provides a set of possible
loop closure candidates. In order to judge whether two LiDAR scans are collected from
the same place, we design this module with three parts:
1. Construction of semantic graphs from input point clouds
2. Semantic graph encoder module to extract distinctive graph vectors from semantic
graphs
3. Graph comparison module to estimate the similarity of two graph vectors
If the predicted similarity is higher than a certain threshold, then the scan pair is
regarded as a potential loop closure candidate for semantic registration and for further
pose graph optimization.

3.2.1 Semantic Graph


We adopt a similar strategy to SGPR Kong et al (2020) to construct semantic graphs
from raw semantic point clouds, however, we add additional geometric information
by encoding detected bounding boxes to enhance the performance. The nodes of the
constructed semantic graph encode semantic labels of detected objects, their local

6
centroid coordinates and bounding boxes. One-hot encoding is used as the embedding
function for semantic information, which eliminates the ordinal influence of semantic
labels. The encoded semantic labels and centroids (xk , yk , zk ) of instances represent
the spatial distribution of semantic objects and the topological relationship of those
instances in the current scene. By adding the proposed bounding boxes, we enhance
the semantic graph with additional geometric information representing the object’s
size and boundaries. We have evaluated the following three possible ways of encoding
the geometric information about the instances:
• FPFH Rusu et al (2009) - Classical 3D feature descriptor
• PointNet Qi et al (2017) - Deep learning based 3D feature extractor
• Bounding box (top left, bottom right) points of the instance
Intuitively, FPFH and PointNet feature descriptors encode more insightful infor-
mation than bounding boxes. However, in practice, quantitative evaluation on datasets
has shown that bounding boxes offer better performance that other two options as
discussed further in Section. 4.2 in Table 4. The possible reason for this can be that
adding the multidimensional vector as graph nodes do not carry semantic, graphical
or topological information that a graph comparison module can use to differentiate
between two semantic graphs. Instead, bounding boxes directly encode the relative
size and boundaries of encoded semantic instances, making it easy for the graph com-
parison module. Also these traditional and deep learning feature extractors come with
additional computational costs while bounding boxes were readily available through
instance segmentation. Hence, we finalized on adding bounding boxes as an additional
source of geometric information to the constructed semantic graph nodes, boosting
the performance, with no computational overhead.

3.2.2 Semantic Graph Encoder Module


The semantic graph encoder is designed to convert the semantic graphs into represen-
tative graph vectors. In this process, these graph vectors become more distinguishable
than the input semantic graphs, while representing them in less number of dimensions.
The semantic graph encoder has two steps: extracting feature matrices from graphs
(node embedding) and compressing feature matrices into vectors (graph embedding).
Node Embedding: Given a graph with multiple nodes containing unique
information, we encode the information of considered node and also the contextual
information from its neighbouring nodes. In SGPR model, this is realized by the
EdgeConv from DGCNN Wang et al (2019), which skips the time-consuming step of
building adjacent matrices for every input graph and instead searches for k nearest
neighbor (kNN) nodes in the convolution operation. This made DGCNN more flexible
and robust while reducing the computational overhead. Instead of using EdgeConv as
in SGPR to encode neighbourhood node information, we observe that graph attention
networks (GATs) can extract more comprehensive features that encode complex under-
lying relationships between neighbourhood nodes by learning from multiple subspaces
through multi-head attention. Hence, we propose to replace the EdgeConv backbone
with GATs and subsequently modify the next steps in the whole pipeline.

7
semantic
one-hot
encoding
NxL
GAT

NxF'
node
embedding

centroid self - pool


GAT
Nx3 attention graph vector
1xF'
NxF'
NxF'

transpose

self -
attention
bounding box
GAT
Nx6
pool weight vector
transpose
Nx1
NxF' Nx3F'
concatenate global context
vector
Node Embedding 1xF' Graph Embedding
Fig. 2: The architecture of the proposed semantic graph encoder. The semantic graphs
created from input point clouds are passed through three GATs to extract contextual
spatial, semantic and geometric features. These features are then concatenated and
passed through a self-attention module to produce a node embedding f . Another self-
attention module operated on the node embedding f to learn a global context vector
c. We finally project the node embedding f into the global context vector c to obtain
corresponding node weights and use them to calculate the final graph vector e.

The detailed illustration of semantic graph encoder is shown in Fig. 2. The semantic
graph encoder has three branches emanating from the input semantic graph, that
correspond to semantic labels, centroids and bounding n boxes. For o a considered branch,
the input semantic graph can be denoted as h = ⃗h1 , ⃗h2 , . . . , ⃗hN , ⃗hi ∈ RF where N
represents the number of nodes and F is the feature dimension, for example, F = 3
for centroid and F = 6 for bounding box as shown in Fig. 2.
To aggregate neighbourhood node information, we first find the different between
the current node and neighbouring node, ⃗hi − ⃗hj and then concatenate this difference
with ⃗hi , resulting in ⃗hi ∥(⃗hi − ⃗hj ), essentially doubling its dimensionality. We perform
this concatenation of h⃗i with h⃗i − h⃗j to combine contextual information between h⃗i

and h⃗j . We then use a learnable matrix W ∈ RF ×2F to transform ⃗hi ∥(⃗hi − ⃗hj ) and
estimate the attention-based weights αij as shown below

exp LeakyReLU → −
  h i
a T W ⃗hi ∥(⃗hi − ⃗hj )
αij = P (1)


  h i
exp LeakyReLU a TW ⃗ h ∥(⃗hi − ⃗hk )
k∈Ni i

where →
−a is the learnable attention vector to reduce dimensions, ∥ denotes concate-
nation and Ni is the global id of the k nearest neighbor nodes to node i. We use 10

8
nearest neighbours all throughout our experiments, i.e., k = 10. We experimented with
multiple k values and found that it does not have noticeable affect on the performance.
As compared to vanilla/classical GAT Veličković et al (2018) that uses all of the
nodes, our proposed k -NN search reduces the computational costs in terms of both
training and inference drastically. We used the LeakyReLU activation function with
slope 0.2 to enhance the learning accuracy of neural network and alleviate the dead
neuron issues during training. The i-th row of features extracted using GAT can be
represented as  
Z X h i
⃗h′ = ∥ σ  z
αij Wz ⃗hi ∥(⃗hi − ⃗hj )  (2)
i
z=1 j∈Ni

where σ is a nonlinear function, ∥ represents concatenation, αijz


are normalized atten-

− F′
tion coefficients computed by the z-th attention vector ( a ), and W ∈ R Z ×2F is the
z
z
z-th learnable linear transformation. In this formula, we apply the multi-head atten-
tion with head number Z. In this way, different heads can concentrate on different
subspaces of features ⃗hi ∥(⃗hi − ⃗hj ), boosting the expressiveness of the enhanced GATs.
After extracting features individually from the semantic, spatial and geometric
branches, we employed a self-attention module to fuse them together (Fig. 2). Our
intention is to not only interact with neighboring nodes in one branch but also
aggregate information from different branches thereby yielding a more comprehen-
sive node-embedding representation. Self-attention mechanism is innately suitable
for determining different weights in a sequence. The corresponding output of node

embedding f ∈ RN ×F is calculated as

Q(x)K T (x)
f = pooling(sof tmax( √ V (x))) (3)
dk

where Q, K, V are respectively the query, key and value mapping functions, x ∈

RN ×3F is the concatenated feature from three branches and dk is the dimension
of keys. Going through a self-attention module, the original graph containing three
separate branches of information from different nodes gets converted to a single node
embedding matrix (f ). The next step is to compress this node embedding (f ) into a
graph vector (e).
Graph Embedding: Compressing graphs into fixed length vectors that encode
the information from all nodes, is essential to compare two graphs efficiently and
enable many downstream applications. We propose to use a self-attention module to

learn a global context vector c ∈ RF from node embedding f , instead of a much
simpler approach in SimGNN, to efficiently capture useful information from f . The
node embedding matrix f is passed through the self-attention module producing a
stack of auxiliary vectors, which can be represented as

attention(f ) = (u1 , . . . , uN )T ∈ RN ×F (4)

where ui ∈ RF represents the auxiliary vector for i-th row in the attention(f ). The
learnable global context vector c is estimated by pooling the auxiliary vectors, similar

9
difference vector

W1 1
1xF'
graph vector 1 sigmoid
ReLU ...... + W2 + W3 + b = FC [0,1]
1xF'
graph vector 2 Sx1
SxF' Sx2F' Sx1
WS1 similarity vector

SxF'xF'

- = 1xF'
our proposed
graph vector 1 graph vector 2
difference vector

Fig. 3: Overview of the graph comparison module. We propose a relative difference


vector (shown in orange color) as the absolute value of the difference between two graph
vectors. The similarity vector is leart based from first-order and second-order difference
vectors, and concatenated graph vectors. This similarity vector is then passed through
fully connected layers to predict the similarity value between two input graph vectors.

to SimGNN Bai et al (2019) as shown below

N
1 X
c = tanh( ui ) (5)
N i=1

We then obtain the weight vector as shown in Fig.2, representing the similarity between
the node embedding f and global context vector c by calculating their inner product.
After converting the weight vector into [0, 1] via a sigmoid function, we finally estimate
the graph vector e as the inner product of node embedding f and weight vector as
shown below,
N
X
e= sigmoid(fiT c)fi (6)
i=1
where fi represents the i-th row of node embedding f . This graph vector e essentially
compresses and represents the input semantic graph by encoding all the nodes and
their spatial, semantic and geometric information in to a F ′ -dimensional vector. In our
experiments, we create a 32-dimensional graph vector, and we found that changing the
graph vector’s dimension from 16-64 dimensions did not bring any noticeable changes
to the performance.

3.2.3 Graph Comparison Module


The similarity between two graph vectors e1 , e2 can be predicted by another neural net-
work that can comprehend the resemblance and amplify the difference. We specifically
enhance the performance of this module by adding an additional relative difference
term, defined as d = |e1 − e2 | to the proposed neural network in SimGNN. As shown
in Fig. 3, this module yields the similarity vector to measure the similarity between

10
two graph vectors using a function defined as
 
[1:S] e
f (e1 , e2 ) = ReLU (d T
W1 d + W2 d + W3 1 + b) (7)
e2

[1:S] ′ ′
In this equation, W1 ∈ RS×F ×F is the weight tensor to capture the second-

order difference term, W2 ∈ RS×F is the weight tensor for first-order difference term,
S×2F ′
W3 ∈ R is the weight matrix for the concatenated vector e1 ∥e2 , b ∈ RS is the
learnable bias term and S is the dimension of the similarity vector.
The similarity vector goes through fully connected layers and a sigmoid function
to guarantee that the probability lies in the range of [0, 1]. By feeding the graph
vectors into the graph comparison module, we convert the place recognition to a binary
classifier problem and hence we employ binary cross entropy function as the loss for
training.
NX
batch
1
loss = yi log(ŷi ) + (1 − yi )log(1 − ŷi ) (8)
Nbatch i=1
yi ∈ {0, 1} is the ground truth value, ŷi ∈ [0, 1] is the prediction value, Nbatch is the
batch size. Lastly, only when the resultant similarity between two graph vectors from
the graph comparison module is higher than a threshold, the input scans are passed
on to semantic registration module to estimate the 6DoF pose.

3.3 Semantic Registration


In LiDAR SLAM systems, to reduce the accumulated drift in the front end lidar
odometry pipeline, relative pose constraints are added between nodes representing
revisted places, and then the whole pose graph with both odometry and loop closure
constraints is optimized. Following this, after obtaining potential loop candidates from
semantic place recognition module, we estimate the relative pose constraints between
potential loop candidates with an enhanced semantic registration algorithm.
Essentially, to perform robust registration in complex environments, we enhance
the front-end pose estimation algorithm in F-LOAM Wang et al (2021) by combin-
ing semantic labels, gaining inspiration from SA-LOAM Li et al (2021b). The key
enhancements in our semantic registration pipeline are:
• Remove outliers and moving/dynamic objects based on the available semantic labels
of points.
• During data association, associate target points with the same semantic label to
construct point to line and point to plane error metrics.
• Assign weights based on semantic labels during minimization of the cost function.
First, based on the semantic labels from SemanticKITTI dataset, we remove dynamic
objects from the input point cloud such as cars, persons, trucks, bus, other vehicles,
outliers and so on. This aids in extracting more robust keypoints, which can alleviate
false matching during registration. As in F-LOAM, we then compute the curvature/s-
moothness and extract a set of edge keypoints (Pe ) and surface keypoints (Ps ). Second,

11
the relative pose between two scans is estimated by aligning edge keypoints and sur-
face keypoints using point to line and point to plane error metrics. While calculating
the target line and planes corresponding the source edge and planar points, we use
nearest 5 points in target scan with the same semantic label. Compared to original
F-LOAM, using target points with same semantic labels aids in establishing accurate
correspondences. Drawing inspiration from SA-LOAM that different classes influence
differently during semantic registration, we assign larger weights to clearly distinguish-
able classes, such as traffic signs, poles and buildings. For more details, one can look
at our implementation in our source code. The cost function is defined as below,

L
!
X X X
r= (wl dlei ) + (wl dlsj ) (9)
l i j

where dlei is the distance from the i-th edge keypoint to the corresponding edge, dlsj
is the distance from the j-th surface keypoint to the corresponding surface, l ∈ L is
the semantic label of the considered keypoint, wl is the semantic-related weights. We
set the weights for traffic signs, poles and buildings label as 1.2 and the rest are set to
0.8. Once the relative pose constraint is calculated, we perform geometric verification
based on fitness score, based on which the constraint is added to the pose graph for
optimization.

4 Experimental Evaluation
We design experiments to evaluate the performance of the proposed pipeline and
compare with related state-of-the art methods, on open-source datasets. We test its
robustness by randomly rotating and occluding input scans while highlighting its
low memory footprint and compute requirements. Extensive ablation studies were
performed to evaluate multiple ways of encoding geometric information, varying the
number of nodes in semantic graph and highlighting the performance improvements
contributed by each proposed enhancements. Lastly we evaluate the semantic registra-
tion module as a standalone lidar odometry pipeline and later integrate the proposed
loop closure module into an open-source SLAM algorithm and present its performance.

4.1 Semantic Place Recognition


We evaluate the performance and robustness of our place recognition module as a
classifier, compare it with relevant state of the art methods, and present its memory
and compute requirements.

4.1.1 Dataset and Implementation Details


We select three mainstream semantic LiDAR scan datasets, to evaluate our seman-
tic place recognition model’s performance. They are SemanticKITTI Behley et al
(2019), KITTI-360 Liao et al (2021)and KITTI Geiger et al (2012) preprocessed by
RangeNet++Chen et al (2019). SemanticKITTI is the annotated version of KITTI
dataset containing 11 publicly open sequences with semantic labels. KITTI-360 is a

12
novel semantic LiDAR dataset which extends SemanticKITTI to much larger areas.
Classical KITTI dataset with semantic labels, as produced by RangeNet++ can test
our model’s robustness when the system emulating real scenarios where the deep learn-
ing based segmentation models produce wrong labels. The raw point clouds, available
per-point semantic labels and ground-truth poses are used to build datasets to evalu-
ate place recognition. Similar to SGPR, we generate a large set of pairs randomly and
if the distance between them is less than 3m, then they are deemed as a true positive
pair. If the distance is larger than 20m, they are regarded as a true negative pair.
Our proposed model is developed using PyTorch using AdamW Loshchilov and
Hutter (2019) optimizer with learning rate 0.0001. Our model is trained with batch
size of 128 for 50 epochs on one Nvidia Tesla T4 (16 GB). The number of nodes in the
semantic graph created from an input scan is by default, set to 50. If the number of
segmented instances in a scan are less than 50, then pseudo nodes are added with zero
node information. If the semantic instances are more than 50, we randomly sample 50
of them to build the semantic graph to have a consistent batch size for training. In
these datasets, most scans contain 30-40 semantic instances and only a few of them
go as high as 60-70 nodes per scene graph. Setting the k-nearest neighbor parameter
k in our GAT to 10 drastically improves the training and inference speed as compared
to classical GAT, where all the neighbourhood nodes are used.
To make the SemanticKITTI dataset fit in our model, we remap the original 28
classes into 12 appropriate classes (car, other vehicles, other ground, fence, trunk,
pole, truck, sidewalk, building, vegetation, terrain and traffic sign), which is the same
as SGPR. We select 5 sequences (01, 03, 04, 09, 10) for training and 6 sequences
(00, 02, 05, 06, 07, 08) for test. For KITTI-360 dataset, we remap 19 classes into 13
classes(car, static object, ground, parking, rail track, building, wall, fence, guard rail,
bridge, pole, vegetation and traffic sign). We train on sequences (00, 02, 03, 04) and
test on sequences (05, 06, 07, 09, 10). For KITTI dataset, labels are inferred from the
pretrained model of RangeNet++, whose 19 output classes are mapped to 12 classes,
same as SemanticKITTI. The train and test sequences are identical to SemanticKITTI.

4.1.2 Analysis
We use the maximum value of F1 score as the evaluation metric for our model. It is
defined as
P ×R
F1 = 2 ×
P +R
where P represents precision and R represents recall. As the place recognition datasets
are unbalanced, i.e., the proportion of negative pairs is much larger than positive
pairs, the max F1 score is a more comprehensive metric than accuracy(success rate)
and average precision. We compare our method with following open-source algorithms
SGPR Kong et al (2020), Scan Context (SC) Kim and Kim (2018) and Intensity
Scan Context (ISC) Wang et al (2020). While there are other deep-learning-based
loop closure algorithms to compare against, most of them need a large amount of
work to pre-process the data to get results that can be compared in a fair manner.
Hence, we specifically focused on ones with high relevance, leverage semantic graph
for place recognition Kong et al (2020) or ones that are widely used such as Scan

13
SemanticKITTI KITTI-360
Method
00 02 05 06 07 08 Mean 05 06 07 09 10 Mean
SGPR 0.846 0.78 0.724 0.901 0.902 0.731 0.814 0.703 0.707 0.745 0.673 0.697 0.705
SGPR-RN 0.771 0.758 0.767 0.857 0.813 0.635 0.767 - - - - - -
SC 0.579 0.535 0.577 0.729 0.684 0.171 0.546 0.550 0.412 0.554 0.455 0.672 0.529
ISC 0.860 0.808 0.840 0.901 0.634 0.626 0.778 0.756 0.692 0.811 0.712 0.867 0.768
Ours-RN 0.923 0.839 0.873 0.947 0.913 0.730 0.871 - - - - - -
Ours 0.935 0.902 0.858 0.979 0.926 0.923 0.921 0.816 0.824 0.890 0.762 0.916 0.842

Table 1: Evaluation of various methods on SemanticKITTI and KITTI-360 dataset


using max F1 score metric. Ours-RN and SGPR-RN represent the performance of our
proposed system and SGPR with semantic labels inferred from RangeNet++ on the
KITTI dataset.

Context Kim and Kim (2018). Please note that in all the experiments, “Ours" and
“SGPR" represent our proposed method and SGPR algorithm respectively and the
semantic labels are directly taken from the dataset’s ground-truth annotations. And
when we refer to “Ours-RN" or “SGPR-RN", we mimic the real world scenario, where
the semantic labels are inferred from a pretrained network, RangeNet++ Chen et al
(2019), instead of using ground-truth annotations.

4.1.3 Evaluation using Max F1 Score


In Table. 1, we compare the max F1 score of our proposed method (Ours) and Ours-
RN (semantic labels inferred from RangeNet++) with SC Kim and Kim (2018), ISC
Wang et al (2020), SGPR Kong et al (2020) and SGPR-RN (semantic labels inferred
from RangeNet++). Table 1 shows that our proposed method (Ours) achieves the
highest performance of 0.921 and 0.842, when compared to other approaches on all
sequences on both datasets. This translates to a 13% and 19% improvement over
baseline SGPR algorithm on SemanticKITTI and KITTI-360 datasets respectively.
We can also notice that the max F1 score in general is lesser on KITTI-360 dataset in
comparison to SemanticKITTI dataset. The reason for this is that the annotation of
KITTI-360 is provided in submaps of combined scans and we had to use a clustering
method to recover labels for each scan causing some annotation errors/noise. The
performance of Ours-RN and SGPR-RN are missing on KITTI-360 sequences as they
contain about 25000 frames per sequence, which requires weeks of pre-processing to get
RangeNet++ and our evaluation results. Hence, we used the available ground-truth
annotations and evaluated them directly on KITTI-360 dataset.

4.1.4 Evaluation with real-world semantic label inference


The results of Ours-RN and SGPR-RN in Table. 1 refer to the performance on
KITTI dataset with semantic labels inferred from pretrained RangeNet++ model. It
can be clearly seen that inference with pretrained model (Ours-RN and SGPR-RN)
offers slightly lower performance than their counterparts with ground-truth semantic
annotations (Ours and SGPR). This is expected behaviour as inference with a pre-
trained deep learning model, in this case, RangeNet++ can never reach the accuracy
of ground-truth annotations. However, the drop in loop closure performance is not
significant (about 5% max F1 score), when compared to the drop in RangeNet++

14
Rotation Occlusion
Method
00 02 05 06 07 08 Mean CMP 00 02 05 06 07 08 Mean CMP
SGPR 0.741 0.701 0.708 0.905 0.734 0.675 0.744 -0.070 0.815 0.672 0.721 0.927 0.894 0.695 0.787 -0.027
SC 0.269 0.112 0.323 0.668 0.386 0.192 0.325 -0.221 0.524 0.447 0.530 0.654 0.158 0.386 0.450 -0.096
ISC 0.857 0.804 0.836 0.899 0.626 0.627 0.775 -0.003 0.833 0.793 0.810 0.878 0.578 0.600 0.749 -0.029
Ours 0.927 0.898 0.877 0.982 0.910 0.918 0.919 -0.002 0.923 0.831 0.836 0.972 0.846 0.802 0.868 -0.052

Table 2: Robustness evaluation using random rotation and occlusion on


SemanticKITTI. The “CMP” column shows the difference between the mean values of
corresponding algorithms from this Table. 2 and Table. 1. Our method shows excellent
robustness and has the lowest drop in performance when compared to the performance
on original SemanticKITTI dataset.

performance to groundtruth annotations (about 48% mean IoU score). Ours-RN still
maintains high performance even with pre-trained model to infer semantic labels,
mainly because of its architectural advantage of using a semantic graph representation,
which is tolerant to noisy labels in few graph nodes, while maintaining a distinctive
high-level semantic graph representation of every scene.

4.1.5 Evaluation using Precision-Recall curves


In Fig. 4, we show the precision recall curves of max F1 score on various sequences of
SemanticKITTI dataset. It can be seen that on most sequences, the area under curve
(AUC) or average precision of our model (Ours and Ours-RN) is much larger than the
baseline SGPR or widely used ScanContext. Especially in sequence 08, there are some
reverse loop closures, wherein the robot visits the same place but in reverse direction.
The reverse loop closure detection is a challenging problem for most existing loop
closure detection solutions and most traditional methods fail to recognize such loops.
However, our model combining contextual information via GAT can discern the place
by spatial topology and semantic understanding. Hence, it can be seen that in Fig.4
(f), which corresponds to sequence 08, Ours and Ours-RN offer significantly better
performance than other methods.

4.1.6 Robustness to random rotation and occlusion


In order to test robustness in real-world scenarios, we randomly rotate and occlude
the the scans in SemanticKITTI dataset. Specifically, we apply a random yaw angle
between −30◦ to 30◦ and mask some points in a randomly generated horizontal field
of view of 30◦ . We perform the same experiments as in Section. 4.1.3 to estimate max
F1 score after this random rotation and occlusion, and the resulting performance is
shown in Table. 2. Compared to other methods, Table. 2 shows that our proposed
algorithm offers the highest max F1 score on all sequences.
In Table. 2, the “CMP" column shows the difference between the mean values of
corresponding algorithms from Table. 2. and Table. 1. Basically, it shows that among
all methods, our proposed algorithm has the lowest drop in mean performance when
compared to its original performance in Table. 1. The reason being, that semantic
graph representation essentially encodes the semantic structure of the scene in a sparse
manner, making it robust to occlusion of objects and small variations arising from

15
(a) Sequence-00 (b) Sequence-02 (c) Sequence-05

(d) Sequence-06 (e) Sequence-07 (f) Sequence-08


Fig. 4: Precision-Recall curves of max F1 score on SemanticKITTI dataset. In the
legend, AUC denotes the area under curve. Here we compare, Ours and Ours-RN with
SGPR Kong et al (2020), SGPR-RN, ScanContext (SC) Kim and Kim (2018) and ISC
Wang et al (2020). It can be seen that Ours and Ours-RN outperform other methods
on all sequences, and especially on Sequence 08, where there are many reverse loop
closures.

wrong semantic label inference. This particularly makes our proposal an ideal choice
for loop closure detection in real-world scenarios.

4.1.7 Model size & Computational time for inference

Method Model Size


MinkLoc Komorowski (2021) 4MB
LCD-Net Cattaneo et al (2022) 142MB
PointNetVlad Uy and Lee (2018) 237MB
LPD-Net Liu et al (2019) 197MB
Ours 426KB

Table 3: Model size comparison.

One of the impressive features of our proposed model is that its extremely
lightweight in terms of memory, which inturn makes it extremely fast. As shown in
Table. 3 our model requires only 426 KB, which is extremely less as compared to
other learning-based methods. This minimal model size makes our model trainable
and deployable even on a consumer laptop. Owing to its small model size, our pro-
posed model runs extremely fast, at about 73 Hz, on an Nvidia Tesla T4 (15 GB)
GPU for inference with pre-processed semantic graphs as input. Our proposed model
runs in sequential mode to instance segmentation algorithm, and most state-of-the-art

16
Fig. 5: Performance change with varying number of nodes in semantic graph.

instance segmentation algorithms run at around 1-5 Hz depending on the hardware


and their level of optimization. This shows that our model has minimal computational
requirements and does not slow down the system by becoming a bottleneck.

4.2 Ablation Study


We perform three ablation studies, first, evaluate multiple ways of encoding geomet-
ric information, second, varying the number of nodes in semantic graph and lastly,
highlighting the performance improvements that each proposed enhancements bring
to the whole pipeline.

4.2.1 Evaluation with multiple geometric features


There are multiple ways of encoding geometric information in the semantic graph,
and we tested three options to find out the best performing one with our pipline.
First, encoding the 3D neighbourhood using one of best classical 3D feature descrip-
tor, FPFH Rusu et al (2009). Second, we integrated the features extracted using a
pretrained PointNet Qi et al (2017) model and lastly we used the bounding box infor-
mation that’s readily available from instance segmentation. While there are other 3D
feature descriptors such as SHOT Salti et al (2014) and 3DHoPD Prakhya et al (2017),
we chose FPFH owing to its low dimensionality and ease of use. For example, SHOT
is a 352 dimensional descriptor while FPFH is just 33 dimensional and 3DHoPD’s
design requires a two-stage feature extraction and matching, making FPFH an easier
choice to integrate.
To extract FPFH features, we operate on each instance point cloud, by considering
it’s centroid as the keypoint, feature radius as the maximum distance from centroid to
the farthest point, while extracing normals from 30 nearest neighbours. For PointNet

17
Geometry Feature Dimension Mean
Bounding box 6 0.921
FPFH 33 0.876
PointNet 1024 0.883
Table 4: Maximum F1 score comparison of different geometric features on
SemanticKITTI dataset.

features, we used the pretrained model for classification task on ModelNet40 Wu et al


(2015) and the dimension of output global feature is 1024. Lastly for bounding box, we
just use the top left and bottom right 3D points forming a six dimensional vector. The
resulting max F1 score on SemanticKITTI dataset, with these three different geometric
features is shown in Table 4. It clearly highlights that, though intuitively, classical
and deep learning descriptors encode more information about neighbourhood variance
and point statistics, it does not essentially improve the whole system performance,
when added to the semantic graph along with spatial and semantic labels. Instead, the
lightweight bounding boxes that essentially just capture the boundaries of the objects
and more importantly, their scale information representing how big the objects are, is
able to aid loop closure detection in a much better way, offering higher performance.

4.2.2 Impact of number of nodes in semantic graph


In all our experiments, we set the maximum number of nodes in the semantic graph to
50. If there are less nodes, we set them to zero and if there are more nodes, we randomly
sample 50 from them. In most cases, on existing datasets, the number of nodes were
less than 50, with a few samples ranging around 60 or more. In this experiment, we
test the influence of number of graph nodes on the model’s performance. The Fig. 5
compares the performance of SGPR with our model, as the graph node number varies
from 10 to 90. It can be seen that our model always performs better than SGPR
with same number of graph nodes. When there are less number of nodes (10-30), the
semantic graph discards most of the semantic information and encodes a partial set
resulting in lower performance. It can be seen that choosing 50 or more number of
graph nodes offers almost similar performance, with negligible change in performance.

4.2.3 Performance improvement from proposed enhancements


Here, we show how each of our proposed enhancements bring considerable improve-
ment to the baseline SGPR algorithm using the SemanticKITTI dataset. We dissect
our complete system, starting from SGPR as baseline and add one module after
another, and show the resulting improvement in max F1 score in Table 5. The baseline
SGPR is first enhanced with DIFF module, which stands for the relative difference
term which we added in the graph comparison module, resulting in a jump from
0.814 to 0.876 ( 8%). We then replace EdgeConv in SGPR with graph attention
networks (GAT) and self-attention that immediately follows it in Fig. 2, to encode
complex graph relations into feature vectors, demonstrating further improvement in
the performance. We then add geometric features (GEO), i.e., bounding box informa-
tion which supplements the graph with scale information and boundaries of detected

18
SGPR DIFF GAT GEO ATT Mean
✓ 0.814
✓ ✓ 0.876
✓ ✓ ✓ 0.882
✓ ✓ ✓ ✓ 0.908
✓ ✓ ✓ ✓ ✓ 0.921
Table 5: Ablation study on SemanticKITTI dataset showing how each proposed
enhancement to baseline SGPR contributes to overall performance improvement of the
proposed place recognition module. SGPR denotes the baseline model, DIFF denotes
the relative difference term added in the graph comparison module of Fig. 3, GAT
stands for the graph attention networks added in semantic graph encoder along with
the immediate self-attention module to generate node embedding f (refer Fig. 2), GEO
represents the geometric information branch that encodes the bounding box informa-
tion, and ATT is the self-attention layer from the graph embedding part that creates
the global context vector c (refer Fig. 2).

objects, resulting in further performance enhancement. Lastly, we show that using


self-attention (ATT) in graph embedding to create a representative global context vec-
tor, as discussed before in graph embedding part, again improves the performance by
another 2%, as shown in Table 5. The intuitive reasons for proposing these enhance-
ments were discussed in Section. 3 in great detail, and here we quantitatively show
their corresponding performance improvements.

4.3 Semantic Registration


4.3.1 LiDAR odometry estimation with Semantic registration
Our semantic registration is an enhanced front-end pose estimation algorithm from
F-LOAM Wang et al (2021), with semantic label assisted dynamic point removal, cor-
respondence association and weighting. As our semantic registration algorithm can
also function as a front-end LiDAR odometry algorithm, we evaluate its pose esti-
mation accuracy on SemanticKITTI dataset. We compare vanilla F-LOAM and our
enhanced version with semantic labels, based on their odometry accuracy on multiple
sequences, as shown in Table. 6. The code is implemented using ROS Quigley et al
(2009) in C++ with an Intel Core i7-7700HQ CPU (2.80GHz × 8). We employ the
absolute trajectory error (ATE) Prokhorov et al (2019) to evaluate the accuracy of
estimated trajectory. For this, the error matrix E at time i is defined as as

Ei = Q−1
i SPi

where Qi ∈ SE (3) is the ground-truth pose, Pi ∈ SE (3) is the estimated pose, S is the
rigid body transformation matrix. Then ATE is calculated as the root mean square

19
Method 00 02 05 08 09
F-LOAM 4.998 8.388 2.854 4.112 1.806
Ours-odometry 4.590 8.378 3.229 3.734 1.175
Table 6: Comparison between F-LOAM and semantic registration based odometry
using ATE metric on SemantiKITTI dataset. This shows that semantic registration is
accurate and can work on long sequences.

Method 00 02 05 08
ISC-LOAM 1.304 39.114 0.730 3.860
Ours 1.252 3.213 0.689 3.699
Table 7: Comparison of our proposed loop closure algorithm and intensity scan con-
text based ISC-LOAM with same front-end algorithm, F-LOAM, on SemanticKITTI
dataset.

error from error matrices.


v !
u
u 1X n
ATE = t ∥E T ∥2
n i=1 i

ATE is reflective of the average deviation between the estimated pose and ground-truth
pose per frame.
We select 5 sequences (00, 02, 05, 08, 09) to compare these two methods. From
Table 6, we can see that our semantic-assisted registration generates more accurate
trajectories than the original F-LOAM on four sequences except on sequence 05. The
possible reason for this can be attributed to few instances where the pose estimation
was erroneous, which then got propagated to later frames, as its an open-ended odom-
etry system with no loop closure. We want to highlight that the proposed semantic
registration is often more accurate (our first requirement) and robust enough to work
on complete sequences with no loop closure. Please do not mistake this for complete
system accuracy with loop closure, which is covered in the next experiment. Even
after integrating semantic information, the semantic registration algorithm runs at 9.3
Hz, while vanilla F-LOAM runs at 10.2 Hz on above system configuration, showing
negligible difference.

4.4 Semantic Loop Closure


We finally compare our whole semantic loop closure system consisting of the proposed
semantic place recognition module and semantic registration to estimate the 6DoF
pose, by plugging it into ISC-LOAM’s framework. We compare our proposed back-end
module with ISC-LOAM’s back-end that uses intensity scan context, while keeping
the front-end odometry algorithms unchanged.
We conducted the experiments on SemanticKITTI’s 4 sequences (00, 02, 05, 08) as
they contain enough loops, following the settings described in Section. 4.3 with ATE

20
Fig. 6: Trajectories of complete SLAM system based on our loop closure module,
ISC-LOAM and the ground truth of sequence 02 from SemanticKITTI.

as the evaluation metric. The results as shown in Table. 7 conclude that our semantic
loop closure algorithm finds enough loops and semantic registration estimates accu-
rate relative pose constraints, resulting in superior performance over the traditional
LiDAR SLAM, ISC-LOAM. Especially on sequence 02 (Fig. 6), ISC-LOAM fails to
close the loop, may be because of its wrong loop closure detection or inaccurate rela-
tive constraints, while our proposed method runs successfully. Thus our proposed loop
closure detection algorithm based on semantic graph encoder and graph comparison
module is a robust and reliable alternative with minimal memory and computational
requirements.

21
5 Conclusion & Future Work
We introduced a LiDAR-based loop closure detection algorithm that uses semantic
graphs with graph attention networks to identify possible revisited places. Our algo-
rithm had two modules, namely the semantic graph encoder module and the graph
comparison module. The semantic graph encoder encodes the spatial, semantic and
geometric information from the scene graph of point clouds using graph attention neu-
ral networks and self-attention mechanism to create distinctive graph vectors. These
graph vectors of candidate loop closure scans are then classified as a successful match
or not by the graph comparison module, mainly leveraging on the difference of these
graph vectors in an end-to-end trainable network. Lastly, we implemented a seman-
tic registration algorithm to estimate the 6 DoF pose and integrated it into existing
LiDAR SLAM algorithm. Our experiments show that the proposed approach offers a
significant boost in performance and opens up a direction of employing graph attention
networks for improved accuracy in place recognition. Lastly, to foster further research
in this direction, we open-source our complete algorithm.
In the future, we plan to employ RGB information, leverage classical (learning-
free) loop closure detection algorithms and develop an even more efficient algorithm
to improve both the accuracy and run time of the learning-based place recognition
module. We are also exploring the direction of leveraging foundational models such as
SAM Kirillov et al (2023), CLIP Radford et al (2021) and LLMs Hong et al (2023)
to automatically segment and reason about the RGB and point cloud information to
perform accurate place recognition in a human-like reasoning manner.

Declarations
5.1 Acknowledgements
Not Applicable

5.2 Funding
Not Applicable

5.3 Conflict of interest/Competing interests


The authors declare that they have no conflicts of interest or competing interests.

5.4 Ethics approval


Not applicable

5.5 Consent to participate


Informed consent was obtained from all individual participants included in the study.

22
5.6 Consent for publication
Consent for publication was obtained from all participants whose data is included in
this manuscript.

5.7 Data and code availability


The raw data that support the findings of this study are openly available at website of
SemanticKITTI and KITTI-360. The custom code used for data analysis is available
on GitHub at https://github.com/crepuscularlight/SemanticLoopClosure.

5.8 Author contribution


Liudi Yang: Methodology, Data Curation, Experiment and Writing Original Draft
Ruben Mascaro: Methodology, Supervision
Ignacio Alzugaray: Methodology, Supervision
Sai Manoj Prakhya: Formal Analysis, Investigation, Writing Review and Editing
Marco Karrer: Methodology, Supervision
Ziyuan Liu: Supervision
Margarita Chli: Supervision

References
Arandjelovic R, Gronat P, Torii A, et al (2016) NetVLAD: CNN architecture for
weakly supervised place recognition. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 5297–5307

Bai Y, Ding H, Bian S, et al (2019) SimGNN: A Neural Network Approach to Fast


Graph Similarity Computation. In: Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining. Association for Computing Machin-
ery, New York, NY, USA, WSDM ’19, p 384–392, https://doi.org/10.1145/3289600.
3290967, URL https://doi.org/10.1145/3289600.3290967

Behley J, Garbade M, Milioto A, et al (2019) SemanticKITTI: A Dataset for Semantic


Scene Understanding of LiDAR Sequences. In: Proc. of the IEEE/CVF International
Conf. on Computer Vision (ICCV)

Bosse M, Zlot R (2013) Place recognition using keypoint voting in large 3d lidar
datasets. In: 2013 IEEE International Conference on Robotics and Automation, pp
2677–2684, https://doi.org/10.1109/ICRA.2013.6630945

Brody S, Alon U, Yahav E (2022) How attentive are graph attention networks? In:
International Conference on Learning Representations, URL https://openreview.
net/forum?id=F72ximsx7C1

Bruna J, Zaremba W, Szlam A, et al (2014) Spectral networks and locally connected


networks on graphs. In: International Conference on Learning Representations
(ICLR2014), CBLS, April 2014

23
Cattaneo D, Vaghi M, Valada A (2022) LCDNet: Deep Loop Closure Detection
and Point Cloud Registration for LiDAR SLAM. IEEE Transactions on Robotics
38(4):2074–2093

Chen X, Milioto A, Palazzolo E, et al (2019) SuMa++: Efficient LiDAR-based Seman-


tic SLAM. In: Proceedings of the IEEE/RSJ Int. Conf. on Intelligent Robots and
Systems (IROS)

Dubé R, Dugas D, Stumm E, et al (2017) SegMatch: Segment Based Place Recognition


in 3D Point Clouds. In: IEEE International Conference on Robotics and Automation
(ICRA)

Dubé R, Cramariuc A, Dugas D, et al (2018) SegMap: 3D Segment Mapping using


Data-Driven Descriptors. In: Robotics: Science and Systems XIV. Robotics: Science
and Systems Foundation, https://doi.org/10.15607/rss.2018.xiv.003

Geiger A, Lenz P, Urtasun R (2012) Are we ready for Autonomous Driving? The
KITTI Vision Benchmark Suite. In: Conference on Computer Vision and Pattern
Recognition (CVPR)

He L, Wang X, Zhang H (2016) M2DP: A Novel 3D Point Cloud Descriptor and Its
Application in Loop Closure Detection. In: 2016 IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems (IROS), pp 231–237, https://doi.org/10.
1109/IROS.2016.7759060

Hong Y, Zhen H, Chen P, et al (2023) 3D-LLM: Injecting the 3D World into Large
Language Models. arXiv

Kim G, Kim A (2018) Scan Context: Egocentric Spatial Descriptor for Place Recog-
nition Within 3D Point Cloud Map. In: 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp 4802–4809, https://doi.org/10.1109/
IROS.2018.8593953

Kim G, Choi S, Kim A (2021) Scan Context++: Structural Place Recognition Robust
to Rotation and Lateral Variations in Urban Environments. IEEE Transactions on
Robotics 38(3):1856–1874

Kipf TN, Welling M (2017) Semi-Supervised Classification with Graph Convolutional


Networks. In: International Conference on Learning Representations, URL https:
//openreview.net/forum?id=SJU4ayYgl

Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint


arXiv:230402643

Komorowski J (2021) MinkLoc3D: Point Cloud Based Large-Scale Place Recognition.


In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV),
pp 1789–1798, https://doi.org/10.1109/WACV48630.2021.00183

24
Kong X, Yang X, Zhai G, et al (2020) Semantic Graph Based Place Recognition
for 3D Point Clouds. In: 2020 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp 8216–8223, https://doi.org/10.1109/IROS45743.
2020.9341060

Li G, Müller M, Qian G, et al (2023) DeepGCNs: Making GCNs Go as Deep as


CNNs 45(6):6923–6939. https://doi.org/10.1109/TPAMI.2021.3074057, URL https:
//doi.org/10.1109/TPAMI.2021.3074057

Li L, Kong X, Zhao X, et al (2021a) SSC: Semantic Scan Context for Large-Scale Place
Recognition. In: 2021 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE Press, p 2092–2099, URL https://doi.org/10.1109/
IROS51168.2021.9635904

Li L, Kong X, Zhao X, et al (2021b) SA-LOAM: Semantic-aided LiDAR SLAM with


Loop Closure. In: 2021 IEEE International Conference on Robotics and Automation
(ICRA), IEEE, pp 7627–7634

Liao Y, Xie J, Geiger A (2021) Kitti-360: A novel dataset and benchmarks for urban
scene understanding in 2d and 3d. https://doi.org/10.48550/ARXIV.2109.13410,
URL https://arxiv.org/abs/2109.13410

Liu Z, Zhou S, Suo C, et al (2019) LPD-Net: 3D Point Cloud Learning for Large-Scale
Place Recognition and Environment Analysis. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp 2831–2840

Loshchilov I, Hutter F (2019) Decoupled Weight Decay Regularization. In: Inter-


national Conference on Learning Representations, URL https://openreview.net/
forum?id=Bkg6RiCqY7

Magnusson M, Andreasson H, Nuchter A, et al (2009) Appearance-based Loop Detec-


tion from 3D Laser data Using the Normal Distributions Transform. In: 2009
IEEE International Conference on Robotics and Automation, pp 23–28, https:
//doi.org/10.1109/ROBOT.2009.5152712

Prakhya SM, Lin J, Chandrasekhar V, et al (2017) 3DHoPD: A Fast Low-Dimensional


3-D Descriptor. IEEE Robotics and Automation Letters 2(3):1472–1479. https://
doi.org/10.1109/LRA.2017.2667721

Prokhorov D, Zhukov D, Barinova O, et al (2019) Measuring Robustness of Visual


SLAM. In: 2019 16th International Conference on Machine Vision Applications
(MVA), pp 1–6, https://doi.org/10.23919/MVA.2019.8758020

Qi CR, Su H, Mo K, et al (2017) Pointnet: Deep Learning on Point Sets for 3d Clas-


sification and Segmentation. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 652–660

25
Quigley M, Conley K, Gerkey B, et al (2009) ROS: An Open-Source Robot Operating
System. In: ICRA Workshop on Open Source Software

Radford A, Kim JW, Hallacy C, et al (2021) Learning Transferable Visual Models from
Natural Language Supervision. In: International conference on machine learning,
PMLR, pp 8748–8763

Rusu RB, Blodow N, Beetz M (2009) Fast Point Feature Histograms (FPFH) for 3D
Registration. In: 2009 IEEE International Conference on Robotics and Automation,
pp 3212–3217, https://doi.org/10.1109/ROBOT.2009.5152473

Salti S, Tombari F, Di Stefano L (2014) SHOT: Unique Signatures of Histograms


for Surface and Texture Description. Computer Vision and Image Understand-
ing 125:251–264. https://doi.org/https://doi.org/10.1016/j.cviu.2014.04.011, URL
https://www.sciencedirect.com/science/article/pii/S1077314214000988

Uy MA, Lee GH (2018) PointNetVLAD: Deep Point Cloud Based Retrieval for Large-
Scale Place Recognition. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR)

Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is All you Need. In: Guyon I,
Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Sys-
tems, vol 30. Curran Associates, Inc., URL https://proceedings.neurips.cc/paper_
files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Veličković P, Cucurull G, Casanova A, et al (2018) Graph attention networks. In:


International Conference on Learning Representations, URL https://openreview.
net/forum?id=rJXMpikCZ

Wang H, Wang C, Xie L (2020) Intensity Scan Context: Coding Intensity and Geom-
etry Relations for Loop Closure Detection. In: 2020 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, https://doi.org/10.1109/icra40945.
2020.9196764

Wang H, Wang C, Chen CL, et al (2021) F-LOAM : Fast LiDAR odometry and
mapping. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). IEEE, https://doi.org/10.1109/iros51168.2021.9636655

Wang Y, Sun Y, Liu Z, et al (2019) Dynamic Graph CNN for Learning on Point
Clouds. ACM Transactions on Graphics (tog) 38(5):1–12

Wu Z, Song S, Khosla A, et al (2015) 3D ShapeNets: A Deep Representation for


Volumetric Shapes. In: 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp 1912–1920, https://doi.org/10.1109/CVPR.2015.7298801

26

You might also like