Manuscript v2
Manuscript v2
Abstract
In this paper, we propose a novel loop closure detection algorithm that uses
graph attention neural networks to encode semantic graphs to perform place
recognition and then use semantic registration to estimate the 6 DoF rela-
tive pose constraint. Our place recognition algorithm has two key modules,
namely, a semantic graph encoder module and a graph comparison module. The
semantic graph encoder employs graph attention networks to efficiently encode
spatial, semantic and geometric information from the semantic graph of the input
point cloud. We then use self-attention mechanism in both node-embedding and
graph-embedding steps to create distinctive graph vectors. The graph vectors
of the current scan and a keyframe scan are then compared in the graph com-
parison module to identify a possible loop closure. Specifically, employing the
difference of the two graph vectors showed a significant improvement in the
performance, as shown in ablation studies. Lastly, we implemented a semantic
registration algorithm that takes in loop closure candidate scans and estimates
the relative 6 DoF pose constraint for LiDAR SLAM system. Extensive eval-
uation on mainstream datasets shows that our model is more accurate and
robust, achieving 13% improvement in maximum F1 score on the SemanticKITTI
dataset, when compared to the baseline semantic graph algorithm. For the
benefit of the community, we open-source the complete implementation of our
proposed algorithm and custom implementation of semantic registration at
https://github.com/crepuscularlight/SemanticLoopClosure.
1
1 Introduction
Simultaneous Localization and Mapping (SLAM) plays a crucial role in enabling
autonomous mobile robots to explore and navigate unknown environments. One of its
fundamental challenges is the accumulation of drift caused by state estimation errors
in the front-end odometry, which can lead to globally inconsistent maps. To address
this issue, loop closure detection algorithms are developed that identify revisited places
and help in reducing the accumulated drift by adding a 6 DoF pose constraint in
pose graph or non-linear factor-graph-based LiDAR SLAM systems. There has been
continual innovation in loop closure detection algorithms by employing the latest AI
advancements Uy and Lee (2018); Cattaneo et al (2022); Kong et al (2020) to improve
their accuracy and real-time performance.
Classical (learning-free) LiDAR-based place recognition algorithms use heuristic
and handcrafted methods to reduce a large raw point cloud to a distinctive and descrip-
tive multi-dimensional vector. Moreover, most of these handcrafted methods also have
a carefully designed metric specific to the descriptor to compare their similarity. Many
such handcrafted global feature descriptors such as M2DP He et al (2016), ScanCon-
text Kim and Kim (2018) and its variants Wang et al (2020); Kim et al (2021) are quite
sensitive to design parameters, type of the LiDAR used, i.e., 16, 32, 64 or 128 chan-
nel, the pose of the LiDAR (horizontal or inclined) and lastly perceptual disturbances
such as occlusion and rotation.
In recent years, deep-learning-based loop closure detection algorithms have gained
a lot of traction as they can be custom trained on the target environment where the
robot needs to be deployed and considering the actual installed pose and the type of
LiDAR used. Early representative methods such as PointNetVlad Uy and Lee (2018)
and LCDNet Cattaneo et al (2022) already exhibited promising accuracy, but they
process the raw point clouds as a whole and, when it comes to actual deployment,
they need a lot of computing power to run in real-time.
A more recent work, SGPR Kong et al (2020) proposed a less computationally
intensive approach that explicitly incorporates a semantic graph as the underlying
representation in an attempt to better mimic how the real world scene actually looks
semantically. To this end, SGPR takes the instance segmentation result of the point
cloud as the input, creates semantic graphs and encodes spatial and semantic infor-
mation into lightweight graph embeddings. These graph embeddings are matched in a
graph-graph interaction module, which is a graph-matching neural network that treats
loop detection as a graph comparison problem. SGPR Kong et al (2020), being one of
the first graph-based approaches for place recognition, did not encode comprehensive
information from the semantic graph (missed out the geometric information), used
much simpler EdgeConv module Kong et al (2020) to create node embeddings and
there is still some scope to improve their graph-graph interaction module as proposed
in our paper.
Recently graph attention networks (GAT) Veličković et al (2018) and graph sim-
ilarity computation methods Bai et al (2019) have shown significant improvements
on how graphs can be encoded, compared and can be made learnable in an end-to-
end fashion. Specifically, graph attention networks use learnable linear transformation
instead of simple scalar values to aggregate neighboring features, offering better graph
2
encoding than EdgeConv, as used in SGPR Kong et al (2020). This essentially allows
the multi-head attention in GATs to learn from multiple subspaces, thus encoding com-
plex relationships between graph nodes, offering a significant boost in performance.
Next, the seminal paper on Transformers Vaswani et al (2017) introduced self-attention
mechanism Vaswani et al (2017), where the relationship between different elements of
the input sequence can be learned effectively to reason about the underlying complex
relations in training data.
Inspired by SGPR Kong et al (2020), GAT Veličković et al (2018), self-attention
Vaswani et al (2017) and SimGNN Bai et al (2019), we have developed an enhanced
graph-based loop closure detection that overcomes many drawbacks and uses the latest
techniques to effectively encode a semantic graph, resulting in a significant boost in
the performance. We propose a two-stage approach consisting of a semantic graph
encoder and a graph comparison module.
• As our first contribution, we enhance SGPR by designing a semantic graph encoder
that uses graph attention networks to encode spatial, semantic and geometric infor-
mation of semantic graph as opposed to SGPR’s limited information and simpler
encoding.
• Our second contribution is to use self-attention mechanism in node embedding
and graph embedding steps to encode complex underlying relationships, essentially
creating more distinctive graph vectors.
• As our third contribution, we show that employing the difference of the input graph
vectors in the graph comparison module to perform classification offers a signifi-
cant boost in the performance, as opposed to direct usage of graph vectors, as in
SGPRKong et al (2020).
• Our final contribution is to open source our work at
https://github.com/crepuscularlight/SemanticLoopClosure that consists of
semantic graph encoder module, graph comparison module, custom implementation
of semantic registration for 6 DoF pose estimation to foster further research in this
direction.
Exhaustive experiments and ablation studies on public datasets prove the increased
accuracy and robustness of both our semantic place recognition network and seman-
tic registration algorithm compared to other methods from the state of the art. In
addition, we demonstrate that both modules can run in real time with minimal mem-
ory and compute requirements, making them an ideal choice to integrate into existing
SLAM frameworks.
2 Related Work
We review previous works on traditional and learning-based 3D place recognition
algorithms, and related graph neural networks that can function as backbones to
extract representative features from graphs.
3
2.1 3D Place Recognition
Traditional methods reduce a raw point cloud with millions of points into a multi-
dimensional vector using meticulously designed methods. Mostly, these extracted
descriptors can be compared using Euclidean distance or specific handcrafted metrics
to find a close match, essentially representing a place match/revisit. Magnusson et.
al. Magnusson et al (2009) has developed NDT, a histogram-based feature descriptor,
exploiting normal distribution representation to describe 3D surfaces and an evalua-
tion metric for scene matching. In 2013, Bosse et al. Bosse and Zlot (2013) presented
a keypoint voting mechanism to achieve fast matching between the current scan and
database scans while estimating the matching thresholds/hyper-parameters by fitting
a parametric model to the underlying distribution. M2DP He et al (2016) first projects
3D point clouds into a 2D plane to generate density signatures and uses corresponding
concatenated singular vectors as descriptors.
SegMatch Dubé et al (2017) is one of the first approaches that extracted descriptors
from the clustered segments of the raw point clouds and used a geometric verification
step to find a correct match. The approaches, where high-level geometric clustering
and semantic/feature description are used for matching, generally achieve high accu-
racy and are more robust in loop closure detection. Scan Context Kim and Kim (2018)
initiated the trend to directly use point clouds without calculating histograms to create
a global descriptor. It uses an encoding function that stores condensed information in
spatial bins along radial and azimuthal directions to generate more distinctive global
descriptors. While vanilla Scan Context Kim and Kim (2018) literally only stores
the maximum height in each bin, its variants encode more effective and representa-
tive information in the bins including detected intensity Wang et al (2020), semantic
labels of point clouds Li et al (2021a) and subcontexts Kim et al (2021) boosting the
performance of the descriptor.
Learning-based methods essentially have the advantage of being able to cus-
tom train them for the target environment with specific robot/sensor setup, thus
enabling them to offer better performance in particularly complex environments.
PointNetVLAD Uy and Lee (2018) leveraged deep neural networks to retrieve large-
scale scenes by using PointNet Qi et al (2017) as the backbone and NetVLAD
Arandjelovic et al (2016) to aggregate learned local features, while outputting global
descriptors for matching. SegMap Dubé et al (2018) has proposed to learn data-driven
leveraging on 3D point cloud variance of each cluster/segment, but the innate 3D CNN
architecture comes with considerable computational burden.
To enrich local geometric details, LPD-Net Liu et al (2019) resorts to an adaptive
backbone to aggregate local information into the global descriptors. The core of the
local information extraction module is to fuse the nearest neighbors’ information from
feature space and Cartesian space. MinkLoc3D Komorowski (2021) proposed a simple
neural network to process sparse voxelized point clouds based on sparse 3D CNN. By
quantizing the raw point clouds and employing sparse convolution, it achieves a simi-
lar inference speed to other multilayer-perceptron-based algorithms while maintaining
high precision. LCDNet Cattaneo et al (2022) adopted end-to-end architecture simul-
taneously accomplishing place recognition and 6-DOF pose estimation. The shared
4
3D voxel CNN is used to extract features for the two-head output of place recogni-
tion and pose estimation. SGPR Kong et al (2020) on the other hand, converted the
place recognition into a graph-matching problem by deeming every instance as a node
and designed an efficient graph neural network to infer the similarity while exhibiting
excellent robustness on mainstream datasets.
3 Method
3.1 System Overview
Our proposed system’s pipeline is shown in Fig. 1. F-LOAM Wang et al (2021) is used
as the front-end LiDAR odometry to provide the pose estimation for every incoming
LiDAR scan. This pose is regarded as a node in the pose graph for nonlinear optimiza-
tion of the whole trajectory. The relative pose according to odometry is added as a
factor between the current node and its previous node. Specifically, when the seman-
tic graph-based place recognition module finds potential loop candidates successfully,
another constraint calculated by semantic registration is inserted into the pose graph
between the corresponding nodes. This pose graph with both odometry constraints
5
semantic semantic
place recognition relocalization
F-LOAM
Semantic
Encoder
Graph
x4 x3
graph vector odom(x3,x4)
Comparison
current scan semantic graph Yes
Graph
odom(x2,x3)
Semantic
Registration
Semantic
(x1,x4)
Encoder
Graph
x1 x2
odom(x1,x2)
candidate scan
Fig. 1: The high-level workflow of the proposed semantic graph based loop closure
system integrated into a SLAM framework. The proposed loop closure algorithm takes
two semantically segmented point clouds as input, which are converted to semantic
graphs. After that, semantic graph encoders are deployed to compress them into graph
vectors. Finally, the graph comparison module predicts the similarity of the two loop
candidates. When the similarity exceeds a specific threshold, a pose constraint is esti-
mated using semantic registration, which is added to the pose graph for trajectory
optimization.
and loop closure constraints is optimized to get the final consistent and accurate 3D
map of the environment. There are two components in the proposed loop closure back-
end, first, a semantic place recognition module that generates loop closure candidates
and second, a semantic relocalization to calculate the relative 6 DoF pose.
6
centroid coordinates and bounding boxes. One-hot encoding is used as the embedding
function for semantic information, which eliminates the ordinal influence of semantic
labels. The encoded semantic labels and centroids (xk , yk , zk ) of instances represent
the spatial distribution of semantic objects and the topological relationship of those
instances in the current scene. By adding the proposed bounding boxes, we enhance
the semantic graph with additional geometric information representing the object’s
size and boundaries. We have evaluated the following three possible ways of encoding
the geometric information about the instances:
• FPFH Rusu et al (2009) - Classical 3D feature descriptor
• PointNet Qi et al (2017) - Deep learning based 3D feature extractor
• Bounding box (top left, bottom right) points of the instance
Intuitively, FPFH and PointNet feature descriptors encode more insightful infor-
mation than bounding boxes. However, in practice, quantitative evaluation on datasets
has shown that bounding boxes offer better performance that other two options as
discussed further in Section. 4.2 in Table 4. The possible reason for this can be that
adding the multidimensional vector as graph nodes do not carry semantic, graphical
or topological information that a graph comparison module can use to differentiate
between two semantic graphs. Instead, bounding boxes directly encode the relative
size and boundaries of encoded semantic instances, making it easy for the graph com-
parison module. Also these traditional and deep learning feature extractors come with
additional computational costs while bounding boxes were readily available through
instance segmentation. Hence, we finalized on adding bounding boxes as an additional
source of geometric information to the constructed semantic graph nodes, boosting
the performance, with no computational overhead.
7
semantic
one-hot
encoding
NxL
GAT
NxF'
node
embedding
transpose
self -
attention
bounding box
GAT
Nx6
pool weight vector
transpose
Nx1
NxF' Nx3F'
concatenate global context
vector
Node Embedding 1xF' Graph Embedding
Fig. 2: The architecture of the proposed semantic graph encoder. The semantic graphs
created from input point clouds are passed through three GATs to extract contextual
spatial, semantic and geometric features. These features are then concatenated and
passed through a self-attention module to produce a node embedding f . Another self-
attention module operated on the node embedding f to learn a global context vector
c. We finally project the node embedding f into the global context vector c to obtain
corresponding node weights and use them to calculate the final graph vector e.
The detailed illustration of semantic graph encoder is shown in Fig. 2. The semantic
graph encoder has three branches emanating from the input semantic graph, that
correspond to semantic labels, centroids and bounding n boxes. For o a considered branch,
the input semantic graph can be denoted as h = ⃗h1 , ⃗h2 , . . . , ⃗hN , ⃗hi ∈ RF where N
represents the number of nodes and F is the feature dimension, for example, F = 3
for centroid and F = 6 for bounding box as shown in Fig. 2.
To aggregate neighbourhood node information, we first find the different between
the current node and neighbouring node, ⃗hi − ⃗hj and then concatenate this difference
with ⃗hi , resulting in ⃗hi ∥(⃗hi − ⃗hj ), essentially doubling its dimensionality. We perform
this concatenation of h⃗i with h⃗i − h⃗j to combine contextual information between h⃗i
′
and h⃗j . We then use a learnable matrix W ∈ RF ×2F to transform ⃗hi ∥(⃗hi − ⃗hj ) and
estimate the attention-based weights αij as shown below
exp LeakyReLU → −
h i
a T W ⃗hi ∥(⃗hi − ⃗hj )
αij = P (1)
→
−
h i
exp LeakyReLU a TW ⃗ h ∥(⃗hi − ⃗hk )
k∈Ni i
where →
−a is the learnable attention vector to reduce dimensions, ∥ denotes concate-
nation and Ni is the global id of the k nearest neighbor nodes to node i. We use 10
8
nearest neighbours all throughout our experiments, i.e., k = 10. We experimented with
multiple k values and found that it does not have noticeable affect on the performance.
As compared to vanilla/classical GAT Veličković et al (2018) that uses all of the
nodes, our proposed k -NN search reduces the computational costs in terms of both
training and inference drastically. We used the LeakyReLU activation function with
slope 0.2 to enhance the learning accuracy of neural network and alleviate the dead
neuron issues during training. The i-th row of features extracted using GAT can be
represented as
Z X h i
⃗h′ = ∥ σ z
αij Wz ⃗hi ∥(⃗hi − ⃗hj ) (2)
i
z=1 j∈Ni
Q(x)K T (x)
f = pooling(sof tmax( √ V (x))) (3)
dk
where Q, K, V are respectively the query, key and value mapping functions, x ∈
′
RN ×3F is the concatenated feature from three branches and dk is the dimension
of keys. Going through a self-attention module, the original graph containing three
separate branches of information from different nodes gets converted to a single node
embedding matrix (f ). The next step is to compress this node embedding (f ) into a
graph vector (e).
Graph Embedding: Compressing graphs into fixed length vectors that encode
the information from all nodes, is essential to compare two graphs efficiently and
enable many downstream applications. We propose to use a self-attention module to
′
learn a global context vector c ∈ RF from node embedding f , instead of a much
simpler approach in SimGNN, to efficiently capture useful information from f . The
node embedding matrix f is passed through the self-attention module producing a
stack of auxiliary vectors, which can be represented as
′
attention(f ) = (u1 , . . . , uN )T ∈ RN ×F (4)
′
where ui ∈ RF represents the auxiliary vector for i-th row in the attention(f ). The
learnable global context vector c is estimated by pooling the auxiliary vectors, similar
9
difference vector
W1 1
1xF'
graph vector 1 sigmoid
ReLU ...... + W2 + W3 + b = FC [0,1]
1xF'
graph vector 2 Sx1
SxF' Sx2F' Sx1
WS1 similarity vector
SxF'xF'
- = 1xF'
our proposed
graph vector 1 graph vector 2
difference vector
N
1 X
c = tanh( ui ) (5)
N i=1
We then obtain the weight vector as shown in Fig.2, representing the similarity between
the node embedding f and global context vector c by calculating their inner product.
After converting the weight vector into [0, 1] via a sigmoid function, we finally estimate
the graph vector e as the inner product of node embedding f and weight vector as
shown below,
N
X
e= sigmoid(fiT c)fi (6)
i=1
where fi represents the i-th row of node embedding f . This graph vector e essentially
compresses and represents the input semantic graph by encoding all the nodes and
their spatial, semantic and geometric information in to a F ′ -dimensional vector. In our
experiments, we create a 32-dimensional graph vector, and we found that changing the
graph vector’s dimension from 16-64 dimensions did not bring any noticeable changes
to the performance.
10
two graph vectors using a function defined as
[1:S] e
f (e1 , e2 ) = ReLU (d T
W1 d + W2 d + W3 1 + b) (7)
e2
[1:S] ′ ′
In this equation, W1 ∈ RS×F ×F is the weight tensor to capture the second-
′
order difference term, W2 ∈ RS×F is the weight tensor for first-order difference term,
S×2F ′
W3 ∈ R is the weight matrix for the concatenated vector e1 ∥e2 , b ∈ RS is the
learnable bias term and S is the dimension of the similarity vector.
The similarity vector goes through fully connected layers and a sigmoid function
to guarantee that the probability lies in the range of [0, 1]. By feeding the graph
vectors into the graph comparison module, we convert the place recognition to a binary
classifier problem and hence we employ binary cross entropy function as the loss for
training.
NX
batch
1
loss = yi log(ŷi ) + (1 − yi )log(1 − ŷi ) (8)
Nbatch i=1
yi ∈ {0, 1} is the ground truth value, ŷi ∈ [0, 1] is the prediction value, Nbatch is the
batch size. Lastly, only when the resultant similarity between two graph vectors from
the graph comparison module is higher than a threshold, the input scans are passed
on to semantic registration module to estimate the 6DoF pose.
11
the relative pose between two scans is estimated by aligning edge keypoints and sur-
face keypoints using point to line and point to plane error metrics. While calculating
the target line and planes corresponding the source edge and planar points, we use
nearest 5 points in target scan with the same semantic label. Compared to original
F-LOAM, using target points with same semantic labels aids in establishing accurate
correspondences. Drawing inspiration from SA-LOAM that different classes influence
differently during semantic registration, we assign larger weights to clearly distinguish-
able classes, such as traffic signs, poles and buildings. For more details, one can look
at our implementation in our source code. The cost function is defined as below,
L
!
X X X
r= (wl dlei ) + (wl dlsj ) (9)
l i j
where dlei is the distance from the i-th edge keypoint to the corresponding edge, dlsj
is the distance from the j-th surface keypoint to the corresponding surface, l ∈ L is
the semantic label of the considered keypoint, wl is the semantic-related weights. We
set the weights for traffic signs, poles and buildings label as 1.2 and the rest are set to
0.8. Once the relative pose constraint is calculated, we perform geometric verification
based on fitness score, based on which the constraint is added to the pose graph for
optimization.
4 Experimental Evaluation
We design experiments to evaluate the performance of the proposed pipeline and
compare with related state-of-the art methods, on open-source datasets. We test its
robustness by randomly rotating and occluding input scans while highlighting its
low memory footprint and compute requirements. Extensive ablation studies were
performed to evaluate multiple ways of encoding geometric information, varying the
number of nodes in semantic graph and highlighting the performance improvements
contributed by each proposed enhancements. Lastly we evaluate the semantic registra-
tion module as a standalone lidar odometry pipeline and later integrate the proposed
loop closure module into an open-source SLAM algorithm and present its performance.
12
novel semantic LiDAR dataset which extends SemanticKITTI to much larger areas.
Classical KITTI dataset with semantic labels, as produced by RangeNet++ can test
our model’s robustness when the system emulating real scenarios where the deep learn-
ing based segmentation models produce wrong labels. The raw point clouds, available
per-point semantic labels and ground-truth poses are used to build datasets to evalu-
ate place recognition. Similar to SGPR, we generate a large set of pairs randomly and
if the distance between them is less than 3m, then they are deemed as a true positive
pair. If the distance is larger than 20m, they are regarded as a true negative pair.
Our proposed model is developed using PyTorch using AdamW Loshchilov and
Hutter (2019) optimizer with learning rate 0.0001. Our model is trained with batch
size of 128 for 50 epochs on one Nvidia Tesla T4 (16 GB). The number of nodes in the
semantic graph created from an input scan is by default, set to 50. If the number of
segmented instances in a scan are less than 50, then pseudo nodes are added with zero
node information. If the semantic instances are more than 50, we randomly sample 50
of them to build the semantic graph to have a consistent batch size for training. In
these datasets, most scans contain 30-40 semantic instances and only a few of them
go as high as 60-70 nodes per scene graph. Setting the k-nearest neighbor parameter
k in our GAT to 10 drastically improves the training and inference speed as compared
to classical GAT, where all the neighbourhood nodes are used.
To make the SemanticKITTI dataset fit in our model, we remap the original 28
classes into 12 appropriate classes (car, other vehicles, other ground, fence, trunk,
pole, truck, sidewalk, building, vegetation, terrain and traffic sign), which is the same
as SGPR. We select 5 sequences (01, 03, 04, 09, 10) for training and 6 sequences
(00, 02, 05, 06, 07, 08) for test. For KITTI-360 dataset, we remap 19 classes into 13
classes(car, static object, ground, parking, rail track, building, wall, fence, guard rail,
bridge, pole, vegetation and traffic sign). We train on sequences (00, 02, 03, 04) and
test on sequences (05, 06, 07, 09, 10). For KITTI dataset, labels are inferred from the
pretrained model of RangeNet++, whose 19 output classes are mapped to 12 classes,
same as SemanticKITTI. The train and test sequences are identical to SemanticKITTI.
4.1.2 Analysis
We use the maximum value of F1 score as the evaluation metric for our model. It is
defined as
P ×R
F1 = 2 ×
P +R
where P represents precision and R represents recall. As the place recognition datasets
are unbalanced, i.e., the proportion of negative pairs is much larger than positive
pairs, the max F1 score is a more comprehensive metric than accuracy(success rate)
and average precision. We compare our method with following open-source algorithms
SGPR Kong et al (2020), Scan Context (SC) Kim and Kim (2018) and Intensity
Scan Context (ISC) Wang et al (2020). While there are other deep-learning-based
loop closure algorithms to compare against, most of them need a large amount of
work to pre-process the data to get results that can be compared in a fair manner.
Hence, we specifically focused on ones with high relevance, leverage semantic graph
for place recognition Kong et al (2020) or ones that are widely used such as Scan
13
SemanticKITTI KITTI-360
Method
00 02 05 06 07 08 Mean 05 06 07 09 10 Mean
SGPR 0.846 0.78 0.724 0.901 0.902 0.731 0.814 0.703 0.707 0.745 0.673 0.697 0.705
SGPR-RN 0.771 0.758 0.767 0.857 0.813 0.635 0.767 - - - - - -
SC 0.579 0.535 0.577 0.729 0.684 0.171 0.546 0.550 0.412 0.554 0.455 0.672 0.529
ISC 0.860 0.808 0.840 0.901 0.634 0.626 0.778 0.756 0.692 0.811 0.712 0.867 0.768
Ours-RN 0.923 0.839 0.873 0.947 0.913 0.730 0.871 - - - - - -
Ours 0.935 0.902 0.858 0.979 0.926 0.923 0.921 0.816 0.824 0.890 0.762 0.916 0.842
Context Kim and Kim (2018). Please note that in all the experiments, “Ours" and
“SGPR" represent our proposed method and SGPR algorithm respectively and the
semantic labels are directly taken from the dataset’s ground-truth annotations. And
when we refer to “Ours-RN" or “SGPR-RN", we mimic the real world scenario, where
the semantic labels are inferred from a pretrained network, RangeNet++ Chen et al
(2019), instead of using ground-truth annotations.
14
Rotation Occlusion
Method
00 02 05 06 07 08 Mean CMP 00 02 05 06 07 08 Mean CMP
SGPR 0.741 0.701 0.708 0.905 0.734 0.675 0.744 -0.070 0.815 0.672 0.721 0.927 0.894 0.695 0.787 -0.027
SC 0.269 0.112 0.323 0.668 0.386 0.192 0.325 -0.221 0.524 0.447 0.530 0.654 0.158 0.386 0.450 -0.096
ISC 0.857 0.804 0.836 0.899 0.626 0.627 0.775 -0.003 0.833 0.793 0.810 0.878 0.578 0.600 0.749 -0.029
Ours 0.927 0.898 0.877 0.982 0.910 0.918 0.919 -0.002 0.923 0.831 0.836 0.972 0.846 0.802 0.868 -0.052
performance to groundtruth annotations (about 48% mean IoU score). Ours-RN still
maintains high performance even with pre-trained model to infer semantic labels,
mainly because of its architectural advantage of using a semantic graph representation,
which is tolerant to noisy labels in few graph nodes, while maintaining a distinctive
high-level semantic graph representation of every scene.
15
(a) Sequence-00 (b) Sequence-02 (c) Sequence-05
wrong semantic label inference. This particularly makes our proposal an ideal choice
for loop closure detection in real-world scenarios.
One of the impressive features of our proposed model is that its extremely
lightweight in terms of memory, which inturn makes it extremely fast. As shown in
Table. 3 our model requires only 426 KB, which is extremely less as compared to
other learning-based methods. This minimal model size makes our model trainable
and deployable even on a consumer laptop. Owing to its small model size, our pro-
posed model runs extremely fast, at about 73 Hz, on an Nvidia Tesla T4 (15 GB)
GPU for inference with pre-processed semantic graphs as input. Our proposed model
runs in sequential mode to instance segmentation algorithm, and most state-of-the-art
16
Fig. 5: Performance change with varying number of nodes in semantic graph.
17
Geometry Feature Dimension Mean
Bounding box 6 0.921
FPFH 33 0.876
PointNet 1024 0.883
Table 4: Maximum F1 score comparison of different geometric features on
SemanticKITTI dataset.
18
SGPR DIFF GAT GEO ATT Mean
✓ 0.814
✓ ✓ 0.876
✓ ✓ ✓ 0.882
✓ ✓ ✓ ✓ 0.908
✓ ✓ ✓ ✓ ✓ 0.921
Table 5: Ablation study on SemanticKITTI dataset showing how each proposed
enhancement to baseline SGPR contributes to overall performance improvement of the
proposed place recognition module. SGPR denotes the baseline model, DIFF denotes
the relative difference term added in the graph comparison module of Fig. 3, GAT
stands for the graph attention networks added in semantic graph encoder along with
the immediate self-attention module to generate node embedding f (refer Fig. 2), GEO
represents the geometric information branch that encodes the bounding box informa-
tion, and ATT is the self-attention layer from the graph embedding part that creates
the global context vector c (refer Fig. 2).
Ei = Q−1
i SPi
where Qi ∈ SE (3) is the ground-truth pose, Pi ∈ SE (3) is the estimated pose, S is the
rigid body transformation matrix. Then ATE is calculated as the root mean square
19
Method 00 02 05 08 09
F-LOAM 4.998 8.388 2.854 4.112 1.806
Ours-odometry 4.590 8.378 3.229 3.734 1.175
Table 6: Comparison between F-LOAM and semantic registration based odometry
using ATE metric on SemantiKITTI dataset. This shows that semantic registration is
accurate and can work on long sequences.
Method 00 02 05 08
ISC-LOAM 1.304 39.114 0.730 3.860
Ours 1.252 3.213 0.689 3.699
Table 7: Comparison of our proposed loop closure algorithm and intensity scan con-
text based ISC-LOAM with same front-end algorithm, F-LOAM, on SemanticKITTI
dataset.
ATE is reflective of the average deviation between the estimated pose and ground-truth
pose per frame.
We select 5 sequences (00, 02, 05, 08, 09) to compare these two methods. From
Table 6, we can see that our semantic-assisted registration generates more accurate
trajectories than the original F-LOAM on four sequences except on sequence 05. The
possible reason for this can be attributed to few instances where the pose estimation
was erroneous, which then got propagated to later frames, as its an open-ended odom-
etry system with no loop closure. We want to highlight that the proposed semantic
registration is often more accurate (our first requirement) and robust enough to work
on complete sequences with no loop closure. Please do not mistake this for complete
system accuracy with loop closure, which is covered in the next experiment. Even
after integrating semantic information, the semantic registration algorithm runs at 9.3
Hz, while vanilla F-LOAM runs at 10.2 Hz on above system configuration, showing
negligible difference.
20
Fig. 6: Trajectories of complete SLAM system based on our loop closure module,
ISC-LOAM and the ground truth of sequence 02 from SemanticKITTI.
as the evaluation metric. The results as shown in Table. 7 conclude that our semantic
loop closure algorithm finds enough loops and semantic registration estimates accu-
rate relative pose constraints, resulting in superior performance over the traditional
LiDAR SLAM, ISC-LOAM. Especially on sequence 02 (Fig. 6), ISC-LOAM fails to
close the loop, may be because of its wrong loop closure detection or inaccurate rela-
tive constraints, while our proposed method runs successfully. Thus our proposed loop
closure detection algorithm based on semantic graph encoder and graph comparison
module is a robust and reliable alternative with minimal memory and computational
requirements.
21
5 Conclusion & Future Work
We introduced a LiDAR-based loop closure detection algorithm that uses semantic
graphs with graph attention networks to identify possible revisited places. Our algo-
rithm had two modules, namely the semantic graph encoder module and the graph
comparison module. The semantic graph encoder encodes the spatial, semantic and
geometric information from the scene graph of point clouds using graph attention neu-
ral networks and self-attention mechanism to create distinctive graph vectors. These
graph vectors of candidate loop closure scans are then classified as a successful match
or not by the graph comparison module, mainly leveraging on the difference of these
graph vectors in an end-to-end trainable network. Lastly, we implemented a seman-
tic registration algorithm to estimate the 6 DoF pose and integrated it into existing
LiDAR SLAM algorithm. Our experiments show that the proposed approach offers a
significant boost in performance and opens up a direction of employing graph attention
networks for improved accuracy in place recognition. Lastly, to foster further research
in this direction, we open-source our complete algorithm.
In the future, we plan to employ RGB information, leverage classical (learning-
free) loop closure detection algorithms and develop an even more efficient algorithm
to improve both the accuracy and run time of the learning-based place recognition
module. We are also exploring the direction of leveraging foundational models such as
SAM Kirillov et al (2023), CLIP Radford et al (2021) and LLMs Hong et al (2023)
to automatically segment and reason about the RGB and point cloud information to
perform accurate place recognition in a human-like reasoning manner.
Declarations
5.1 Acknowledgements
Not Applicable
5.2 Funding
Not Applicable
22
5.6 Consent for publication
Consent for publication was obtained from all participants whose data is included in
this manuscript.
References
Arandjelovic R, Gronat P, Torii A, et al (2016) NetVLAD: CNN architecture for
weakly supervised place recognition. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 5297–5307
Bosse M, Zlot R (2013) Place recognition using keypoint voting in large 3d lidar
datasets. In: 2013 IEEE International Conference on Robotics and Automation, pp
2677–2684, https://doi.org/10.1109/ICRA.2013.6630945
Brody S, Alon U, Yahav E (2022) How attentive are graph attention networks? In:
International Conference on Learning Representations, URL https://openreview.
net/forum?id=F72ximsx7C1
23
Cattaneo D, Vaghi M, Valada A (2022) LCDNet: Deep Loop Closure Detection
and Point Cloud Registration for LiDAR SLAM. IEEE Transactions on Robotics
38(4):2074–2093
Geiger A, Lenz P, Urtasun R (2012) Are we ready for Autonomous Driving? The
KITTI Vision Benchmark Suite. In: Conference on Computer Vision and Pattern
Recognition (CVPR)
He L, Wang X, Zhang H (2016) M2DP: A Novel 3D Point Cloud Descriptor and Its
Application in Loop Closure Detection. In: 2016 IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems (IROS), pp 231–237, https://doi.org/10.
1109/IROS.2016.7759060
Hong Y, Zhen H, Chen P, et al (2023) 3D-LLM: Injecting the 3D World into Large
Language Models. arXiv
Kim G, Kim A (2018) Scan Context: Egocentric Spatial Descriptor for Place Recog-
nition Within 3D Point Cloud Map. In: 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp 4802–4809, https://doi.org/10.1109/
IROS.2018.8593953
Kim G, Choi S, Kim A (2021) Scan Context++: Structural Place Recognition Robust
to Rotation and Lateral Variations in Urban Environments. IEEE Transactions on
Robotics 38(3):1856–1874
24
Kong X, Yang X, Zhai G, et al (2020) Semantic Graph Based Place Recognition
for 3D Point Clouds. In: 2020 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp 8216–8223, https://doi.org/10.1109/IROS45743.
2020.9341060
Li L, Kong X, Zhao X, et al (2021a) SSC: Semantic Scan Context for Large-Scale Place
Recognition. In: 2021 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE Press, p 2092–2099, URL https://doi.org/10.1109/
IROS51168.2021.9635904
Liao Y, Xie J, Geiger A (2021) Kitti-360: A novel dataset and benchmarks for urban
scene understanding in 2d and 3d. https://doi.org/10.48550/ARXIV.2109.13410,
URL https://arxiv.org/abs/2109.13410
Liu Z, Zhou S, Suo C, et al (2019) LPD-Net: 3D Point Cloud Learning for Large-Scale
Place Recognition and Environment Analysis. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp 2831–2840
25
Quigley M, Conley K, Gerkey B, et al (2009) ROS: An Open-Source Robot Operating
System. In: ICRA Workshop on Open Source Software
Radford A, Kim JW, Hallacy C, et al (2021) Learning Transferable Visual Models from
Natural Language Supervision. In: International conference on machine learning,
PMLR, pp 8748–8763
Rusu RB, Blodow N, Beetz M (2009) Fast Point Feature Histograms (FPFH) for 3D
Registration. In: 2009 IEEE International Conference on Robotics and Automation,
pp 3212–3217, https://doi.org/10.1109/ROBOT.2009.5152473
Uy MA, Lee GH (2018) PointNetVLAD: Deep Point Cloud Based Retrieval for Large-
Scale Place Recognition. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR)
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is All you Need. In: Guyon I,
Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Sys-
tems, vol 30. Curran Associates, Inc., URL https://proceedings.neurips.cc/paper_
files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang H, Wang C, Xie L (2020) Intensity Scan Context: Coding Intensity and Geom-
etry Relations for Loop Closure Detection. In: 2020 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, https://doi.org/10.1109/icra40945.
2020.9196764
Wang H, Wang C, Chen CL, et al (2021) F-LOAM : Fast LiDAR odometry and
mapping. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). IEEE, https://doi.org/10.1109/iros51168.2021.9636655
Wang Y, Sun Y, Liu Z, et al (2019) Dynamic Graph CNN for Learning on Point
Clouds. ACM Transactions on Graphics (tog) 38(5):1–12
26