Next Article in Journal
A Survey of Computationally Efficient Graph Neural Networks for Reconfigurable Systems
Previous Article in Journal
Fault Line Selection Method for Power Distribution Network Based on Graph Transformation and ResNet50 Model
Previous Article in Special Issue
Architectural Framework to Enhance Image-Based Vehicle Positioning for Advanced Functionalities
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles

by
Husnain Mushtaq
1,
Xiaoheng Deng
1,*,
Irshad Ullah
1,
Mubashir Ali
2 and
Babur Hayat Malik
3
1
School of Computer Science and Engineering, Central South University, Changsha 410083, China
2
School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
3
Department of Computer Science, University of Chenab, Gujrat 50700, Pakistan
*
Author to whom correspondence should be addressed.
Information 2024, 15(7), 376; https://doi.org/10.3390/info15070376
Submission received: 20 April 2024 / Revised: 19 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024

Abstract

:
Autonomous vehicles (AVs) strive to adapt to the specific characteristics of sustainable urban environments. Accurate 3D object detection with LiDAR is paramount for autonomous driving. However, existing research predominantly relies on the 3D object-based assumption, which overlooks the complexity of real-world road environments. Consequently, current methods experience performance degradation when targeting only local features and overlooking the intersection of objects and road features, especially in uneven road conditions. This study proposes a 3D Object-Oriented-Segmentation Spatial-Attention ( O 2 SAT) approach to distinguish object points from road points and enhance the keypoint feature learning by a channel-wise spatial attention mechanism. O 2 SAT consists of three modules: Object-Oriented Segmentation (OOS), Spatial-Attention Feature Reweighting (SFR), and Road-Aware 3D Detection Head (R3D). OOS distinguishes object and road points and performs object-aware downsampling to augment data by learning to identify the hidden connection between landscape and object; SFR performs weight augmentation to learn crucial neighboring relationships and dynamically adjust feature weights through spatial attention mechanisms, which enhances the long-range interactions and contextual feature discrimination for noise suppression, improving overall detection performance; and R3D utilizes refined object segmentation and optimized feature representations. Our system forecasts prediction confidence into existing point-backbones. Our method’s effectiveness and robustness across diverse datasets (KITTI) has been demonstrated through vast experiments. The proposed modules seamlessly integrate into existing point-based frameworks, following a plug-and-play approach.

1. Introduction

The swift progress in the communication, sensor, and computer industries has geared up autonomous driving technology and is credited to advancements [1,2,3,4]. The environmental perception system is a crucial phase of autonomous driving, significantly influencing subsequent motion planning and decision-making processes. A 3D LiDAR-based system involves object point cloud clustering, road point segmentation, and bounding box fitting tasks through environmental perception via a target object point cloud [5,6,7,8]. This study strives to improve 3D object detection performance in autonomous vehicles (AVs) by learning to discriminate between objects (cars, pedestrians, and cyclists) and road points, which play a crucial role in object localization, bounding box fitting, and computational optimization in the 3D point cloud environmental perception system [9,10]. The precision of object detection holds paramount importance for the ultimate accuracy of AVs’ technology and environmental sensing systems [11].
One-third or more of the collected points are identified as road points due to the installation and operating features of LiDAR, and these points are often eroded during the downsampling process in 3D backbones [12]. However, these road points impose an unnecessary computational burden on object identification and tracking [13,14,15]. Most of the point-based downsampling approaches, like feature-based farthest point sampling (Feat-FPS) [16] and distance-based farthest point sampling (D-FPS) [17], result in a shortage of object points (especially for pedestrians and poles), inversely impacting the assessment by vehicles in the diverse urban driving environment. Existing 3D object detection pipelines for AVs often encounter false positives or false negatives when the object (car, pedestrian, and cyclist) points are mistakenly categorized as road points and vice versa. Consequently, eliminating ground points has become a common practice to optimize the efficiency of object identification. However, it still negatively affects the full advantage of geometric features to generate high-quality proposals due to a stronger correlation between object points and road points [18,19].
The road point segmentation approaches for LiDAR point cloud data have been targeted at achieving good accuracy in 3D object detection pipelines for AVs [1,2,3]. The enhancement of 3D object detection approaches in the context of 3D object segmentation of road points aims to reduce these two kinds of inaccurate scenarios while detecting 3D objects [7]. Moreover, the efficient handling of substantial real-time data poses notable challenges for both computational performance and algorithm efficiency [17,20,21,22].
Two primary challenges need addressing. Firstly, the issue of generating full-space 3D proposals is crucial, involving object detection at any spatial position to refrain from overlooking 3D objects’ intersections with road points [23,24,25,26]. The direct incorporation of point sampling strategies often fails to preserve enough foreground points at the last encoding layer. As such, it remains challenging to detect objects of interest precisely, especially small objects such as pedestrians and cyclists, where only extremely limited foreground points are left [27,28,29]. Secondly, there is a lack of discrimination between the object class features (truth boxes) and road class features when point features are passed through feature training pipelines [30,31,32], as highlighted in Figure 1c.
We propose the 3D Object-Oriented-Segmentation-Guided Spatial Attention ( O 2 SAT) approach to enhance 3D object detection for autonomous vehicles (AVs). This approach distinguishes object points from road points and improves keypoint feature learning using a channel-wise spatial attention mechanism. Initially, a point-based framework coupled with an Object-Oriented Segmentation (OOS) module with deep Hough transform was used to segment the road surface from 3D point cloud data, eliminating spatial limitations. A road-aware orientation branch leverages the results to a 3D detection head and an object-aware sampling module to detect objects without spatial limitations for anchor-free center point generation, as shown in Figure 2. This object-aware sampling approach produces the references for each center-point feature by incorporating the foreground semantic priors into the network training pipelines. The object coordinate system undergoes segmentation, where some segments are mapped to locations in the Cartesian system. Real object (i.e., car) points are found in each segment using the deep Hough transform. As depicted in Figure 1a, ground truth boxes located on the figure’s right side are more distant (deeper) from the observer, presenting a greater challenge for localization.
The Spatial Attention-Based Feature Reweighting (SFR) module dynamically adjusts the importance of different channels in the feature representation of the segmented object regions. It leverages attention mechanisms and focuses on relevant features while suppressing noise and irrelevant information. This adaptive reweighting scheme enhances the discriminative power of the feature representation, improving the overall performance of the detection system. We exploit local and global channel-wise attributes of the encoded points by employing a channel-wise reweighting approach, improving the augmentative approach of the standard transformer. We incorporate a scaling mechanism across the decoding feature space to enhance the expressiveness of query–key interactions and compute channel-wise attention distribution in each key embedding. This approach improves the expressive capacity of the model and the depth of query–key interactions, ultimately enhancing the standard attention mechanism, as depicted in Figure 1b; 3D boxes are better observed compared to Figure 1a. The Road-Aware 3D Detection Head (R3D) module utilizes the refined object segmentation results from OOS and optimized feature representations from SFR to detect and classify objects, such as vehicles, pedestrians, and obstacles in the AV’s vicinity. It also provides crucial information about the objects’ positions, dimensions, and semantic labels, enabling the AV to make informed decisions for safe navigation. To summarize, the principal findings of this study can be delineated as follows:
  • We propose O 2 SAT to augment object points from the road surface. This enables an attention-based, channel-wise feature reweighting module to learn more discriminating features, enhancing the overall detection performance.
  • We designed the OOS module to segment the road surface, with its road-aware orientation branch feeding into the object-aware sampling module and the 3D detection head. This enables the detection of objects without spatial limitations for anchor-free center point generation.
  • The SFR module leverages self-attention-based spatial encoding to embed contextual information into each point feature, enhancing feature representation and suppressing noise. This adaptive reweighting enhances feature discrimination, thereby improving the detection system performance.
  • Our method’s effectiveness and robustness across diverse datasets (KITTI and SlopedKITTI) have been demonstrated through experiments conducted in various urban traffic environments.

2. Related Work

2.1. Point Cloud Representations for 3D Object Detection

LiDAR plays a crucial role in autonomous vehicles (AVs) using 3D-generated uneven, unordered, irregular and unstructured point clouds. Handling 3D raw points using conventional data processing techniques is a challenging task. Several 3D object detection methods have surfaced in recent years [14,17,21,26,28,33,34,35,36,37,38,39,40,41,42], and these approaches are classified based on their 3D point cloud processing techniques.

2.1.1. Voxel-Based Methods

Many studies have been reported that create regular voxel grids from uneven and unordered 3D point clouds, employing convolutional neural networks (CNNs) to learn geometric patterns [26,36,37,43]. Initial research incorporated highly dense CNNs and voxelization for analyzing voxel-based point cloud data [40,41,42].
The SECOND architecture introduced by Yan et al. reduced the memory and quantization cost and enhanced the efficiency using a sparse convolution 3D sub-manifold [36]. On the other hand, PointPillars created pillars from voxels to make the simplified voxels of a 3D point cloud [37]. However, existing single-stage and two-stage voxel-based 3D object detectors often yield different accuracy for different objects concerning size and distance, lacking generalization [21,44]. ImVoxelNet by Danila et al. applied the image-to-voxel approach, which greatly increases quantization and memory usage due to the additional projection of images into voxels [43]. Zhou et al. incorporated 3D CNN to create regular 3D voxels by transforming LiDAR point clouds for object detection [26]. Noh et al. incorporated single-stage simple yet efficient point-based and voxel-based features integration for 3D object detection [41]. To achieve dense feature representation with less processing cost, Shi et al. proposed a roadside LiDAR voxel-based features learning network that projects raw point clouds and voxelizes them into a bird’s-eye view (BEV) [28]. While voxel-based methods are reported to provide efficient and appropriate 3D object identification performance, they may suffer from quantization loss and structural complexity, making it difficult to determine the ideal point cloud resolution for local geometry and associated circumstances.

2.1.2. Point-Based Methods

Point-based approaches produce 3D models by learning direct raw point clouds with unstructured geometry, in opposition to voxel-based approaches [34,38]. PointNet [17] and its variations [21,45] are used in point-based approaches to aggregate point-wise characteristics using symmetric functions, which helps to manage the uneven and unordered LiDAR 3D point clouds. Shi et al. [21] introduced Point-RCCN, a two-stage regional proposal 3D object detection framework. It regresses high-quality 3D bounding boxes by generating object suggestions from foreground point segments and using local spatial and semantic information.
Qi et al. [20] proposed a deep Hough VoteNet, a one-stage point voting-based 3D object detector to predict the center point of an object. Yang et al. [13] introduced a single-stage 3DSSD, 3D object detection framework that employs Euclidean space and farthest point sampling (FPS) as a combined downsampling strategy. However, PointGNN [46] presented a generalized GNN-based 3D object detection. Here, it is revealed that compared to voxel-based methods, point-based methods are less resource-intensive, straightforward, and intuitive, and the processing of a raw point cloud requires less pre-processing cost. However, the point-based method still contains an enormous gap in learning ability and efficiency in 3D object detection applications.

2.1.3. Ground Segmentation from 3D Object Detection

The provided work encompasses diverse methodologies designed to precisely determine the orientation and position of objects, primarily leveraging information regarding the depth of objects acquired from RGB-D devices. With the advantage of familiar CAD models, the need to predict object dimensions is eliminated [47,48,49,50,51,52,53]. Among the methodologies, Guo et al. introduced Point Pair Features (PPFs), a point-based technique to estimate the object poses from the point of object depth map [47]. Gao et al. adopted a dual-branch network to estimate the orientation and position. He improved the accuracy of rotation regression using a novel geodesic loss function [50]. Keypoint-based and voting-based methods utilize CNN-based and similar to PointNet-based networks for keypoint prediction from point clouds, recovering 6D poses through the estimation of transformations between predicted and predefined keypoint sets, often incorporating an iterated closest point (ICP) approach for further refinement based on known CAD models [51].
The work also delves into ground segmentation algorithms, categorizing them into line-based and surface-based methods. The latter involves modeling the ground shape as either a plane or a sloped ground, with RANSAC (random sampling) and its variations being commonly employed for plane fitting [10,30,54,55]. Alternative approaches involve optimizing sampling point selection using the curved surface fitting approach, normal distribution transformation (NDT) [10], and the Gaussian Process Incremental Sample Consensus (GP-INSAC) algorithm [10,54]. The paper highlights the challenges surface-based methods face, such as reduced accuracy in dynamic ground scenarios, leading to the proposal of line-based methods as an alternative approach.
Furthermore, three primary techniques are reported for ground segmentation in point cloud object detection in AVs: The first method depends on projecting the horizontal plane of 3D point cloud space into grids and spatial characteristics. It accomplishes ground segmentation through high and low terrain point elevation and their difference measures [30]. The second method considers the terrain as a plane surface, utilizing algorithms like RANSAC for iterative determination of unknown plane parameters [56]. The third method involves projecting the LiDAR point cloud into polar coordinates, segmenting the ground based on feature analysis within polar sectors [55]. The paper emphasizes these approaches’ efficiency and high precision in terrain feature recognition.

2.1.4. Attention-Based Methods

The Point Cloud Transformer applies a neighbor embedding technique to improve local feature representation and incorporates offset attention to extract global features [57]. Concurrently, the Point Transformer showcases an altered Transformer design that blends local characteristics with vector focus and relative location encoding [58]. Using a stratified sampling technique for keys from Swin Transformer [59], Stratified Transformer [60] creates non-overlapping cubic windows in 3D space [59]. DFA-SAT utilizes self-attention to learn semantic features with contextual information by incorporating neighboring data and focusing on vital geometric details [11]. To enhance long-dependency correlation learning, PointASNL [61] uses an adaptive sample module in conjunction with a non-local network. Specifically, PointNeXt uses the PointNet++ backbone as a foundation while investigating sophisticated training and data augmentation methods to improve precision and performance. The PCT methods [57] design replaces the PointNet module of Point++ with neighbor embedding patches. However, both approaches depend upon local and neighboring features so far. Therefore, their performances are certainly affected due to missing points and point variations.
This work distinguishes itself from existing methods in 3D object detection for AVs by augmenting object data in point clouds through road point segmentation and object-based downsampling. It enhances the capability of attention-based feature learning approaches to improve spatial feature learning, addressing challenges in detecting objects in urban traffic environments. The proposed approach aims to enhance performance metrics and provides a comprehensive overview of object detection and ground segmentation techniques.

3. Methodology

3.1. Overview

An O 2 SAT is presented, leveraging an OOS approach to augment object points from the road surface. This ultimately enables the SFR module to learn more discriminating features, improving overall detection performance [13,62]. OOS distinguishes between object and road points, performing object-aware downsampling to augment data by learning to identify the hidden connection between the landscape and objects. SFR performs weight augmentation to learn crucial neighboring relationships and dynamically adjusts feature weights through spatial attention mechanisms, enhancing long-range interactions and discriminating contextual features for noise suppression, thereby improving overall detection performance. Finally, R3D utilizes refined object segmentation and optimized feature representations, forecasting prediction confidence into existing point-backbones, as shown in Figure 2.

3.2. Point-Based Backbone

The backbone exclusively integrates the encoder component of the U-shaped network [63] to ensure sustainable computing. As input X is progressively encoded into a compact set of point features X r = { x i } i = 1 n 1 , each is associated with a semantic point feature P r R c 1 . This aggregation is achieved through multiple object-aware downsampling layers and set abstraction layers, with detailed explanations in the subsequent section. The sampled points are grouped using set abstraction layers, after which the point-based backbone [64] extracts features for each center point.
P = γ max i = 1 , , k { h ( f i ) }
Here, P features are extracted from the point groups { f i } i = 1 k , where γ ( · ) and h ( · ) denote multilayer perceptrons (MLPs). To enhance recall during downsampling and preserve more foreground points, the feature distance measurement approach [12] weights local point density and foreground score.

3.3. Object-Oriented Segmentation

The OOS module employs deep Hough transform to segment the road surface from 3D point cloud data collected by aerial or drone-based sensors, minimizing spatial limitations for object detection. This module provides a foundational understanding of the environment for subsequent object detection tasks by accurately and precisely identifying and delineating road surfaces. The object coordinate system undergoes segmentation, where segments are mapped to locations in the Cartesian system. The deep Hough transform identifies the real object (car) points in each segment. Notably, as depicted in Figure 1a, 3D boxes are farther from the observer, presenting a greater challenge for localization. The subsequent object-aware sampling process generates references for each center-point feature corresponding to each object, completing the discrimination between road points and object points.

3.3.1. Road-Aware Orientation

In contrast to other methods, we present a road-aware orientation branch which predicts not only horizontal ( θ x ) but additional overlooked positions, named vertical ( θ y ) and depth ( θ z ). This extension is straightforward and obvious due to the difference in data distribution. We apply this approach to discriminate the intersection area of an object and road. The object distribution in KITTI is illustrated using Figure 1, which includes 3D annotations and pose information. The dimensions ( w , l , h ) and the horizontal angle ( θ x ) roughly follow a mixed points distribution, facilitating regression. However, even with non-flat roads up to 10–15% in this dataset, the number of points with zero θ y and θ z os three to four orders of magnitude greater. Direct prediction of θ y and θ z , like θ x , is affected by imbalanced distribution and leads to mostly zero results because it treats non-zero points as noise. Additionally, the shape of points varies noticeably with θ z , but less so with θ y and θ z , making it challenging to acquire an explicit relationship between these two objects and road shape.
A distinct perspective of this approach is that objects in AV scenes are affected by road, constraining the vertical and depth ( θ y and θ z ). Thus, incorporating road-aware orientation can aid in the θ y and θ z regression. To achieve this, we incorporate lightweight and several fully connected layers for object segmentation from road points, as shown in Figure 3a. It classifies each terrain point into object and road point. We categorically input each coarse center’s feature P c c in X c c . It calculates the probability of the prediction as p g of ground slope via p g = Sigm ( MLPs ( P c c ) ) using a segmentation approach.

3.3.2. Object-Aware Sampling Approach

The instance recall rate exhibited a noticeable decline after multiple random downsampling procedures, indicating a substantial reduction in foreground points. Furthermore, D-FPS and Feat-FPS demonstrated relatively higher instance recall rates in the initial stages. However, they also struggled to retain adequate foreground points in the final encoding layer. Consequently, the accurate detection of objects of interest, particularly small entities like pedestrians and cyclists, poses a significant challenge, given the scarcity of remaining foreground points. To address this issue, we seek to capitalize on the latent semantics of individual points, as the learned point features may encapsulate more comprehensive semantic insights with the progression of hierarchical aggregation across layers. Building upon this concept, we introduce a task-specific object-aware sampling approach, integrating foreground semantic labels into the network training frameworks, as shown in Algorithm 1.
Algorithm 1 Object-oriented segmentation based on road-aware orientation
  •   Input: LiDAR point cloud P = { p 1 , p 2 , , p n }
  •   Output: Identified road points
  •   Step 1: Collect one frame of LiDAR points P = { p 1 , p 2 , , p n }
  •   Step 2: Determine hyperparameters δ , σ , l in the sparse kernel function from prior data
  •   Step 3: Set other parameters: k train , H train , T min , T max
  •   Step 4: Project points to polar coordinates
  •   Step 5: Identify the lowest points in each bin and initialize seed with them
  •   Step 6: Find true road points within seed using height threshold and radius
  •   Return: Road points for further processing
The derived coarse centers F c c , corresponding to every center point in P c c , are predicted offset (depicted as brown dots in Figure 3a) to refine at maximum. We also incorporate a set abstraction layer for feature f c c R c 2 extraction for each coarse center X c c .
c g = γ ( θ g x t θ x or θ g y t θ x )
where c g represents road segmentation features and θ t y and θ t z denote object regression targets while each object center point is given by θ t x . Specifically, two MLPs were added after the encoding stages to predict the semantic classes of individual points. The point-wise one-hot-encoded semantic labels derived from the initial bounding box annotations are employed for guidance and supervision during this process.
θ t x = θ g x t θ x π / 2 , θ t y = θ g y t θ y π / 2
where t θ x and t θ y denote thresholds, and γ denotes the indicator function to ascertain the road and object points classification. Whereas the orientation branch assists the network in leveraging the segmentation outcomes and recognizing the object and road relationship, as stated in Figure 1b. The ultimate predictions θ y and θ z are obtained through p g association:
θ a = θ x a , if p g > 0.5 0 , otherwise , for a { x , y }
where the prediction branch output θ a decodes θ x a . During the inference phase, the points with the highest k foreground scores are selected and treated as the representative points that are then fed into the subsequent spatial-attention-based encoding layers. As depicted in Figure 3a, this approach conserves more foreground points, resulting in a heightened instance recall ratio.

3.4. Spatial Attention

3.4.1. Spatial Feature Encoding

The embedded point characteristics are introduced into the multi-head self-attention layer to capture the intricate contextual connections. This is achieved by a residual architecture of a feed-forward network (FFN) to learn the inter-point dependencies within proposals to refine the point attributes. Illustrated in Figure 3b, except for the position embedding, this internal self-attention encoding mechanism closely resembles the original natural language processing transformer-encoding architecture, which is already inherent in the point characteristics.
The multi-head self-attention layer computes multiple query, key, and value matrices using different learned linear projections [11]. Representing by X = [ p 1 T , . . . , p N T ] T R N × D the embedded point features with dimensionality D, we define Q = O q P ; K = O k P ; V = O v P , where O q , O k , O v R N × N represent linear projections. Let W Q i , W K i , W V i denote the weight matrices for query, key, and value projections for the i-th attention head, respectively.
For the i-th attention head, the query matrix Q i , key matrix K i , and value matrix V i are computed as follows:
Q i = X · W Q i ,
K i = X · W K i ,
V i = X · W V i .
Then, the scaled dot-product attention mechanism is applied independently for each attention head:
S E ( Q i , K i , V i ) = σ Q m K m T D · V m , m = 1 , , M
where σ ( · ) represents the softmax function. Subsequently, through a straightforward FFN and residual mechanism, the outcome is as follows:
S E ( e n c ) ( X ) = N ( F ( N ( SE ( Q 1 , K 1 , V 1 ) , , SE ( Q M , K M , V M ) ) ) )
where N ( · ) signifies the addition and normalization function, F ( · ) denotes double linear layers and single ReLU activation ( · ) in FFN. It is observed that a there is a series of triad attention-based encoders which are aligned optimally within O 2 SAT.

3.4.2. Spatial Feature Reweighting

In an alternate feature reweighting approach, we propose leveraging attention mechanisms to dynamically adjust the importance of different channels of the key embeddings K ^ m based on their spatial relationships. This approach aims to enhance the discriminative power of the decoding weights while preserving global information aggregation. Firstly, we introduce spatial-aware attention weights that consider the spatial context of each channel within the key embeddings. This is achieved by computing attention weights for each channel individually using the query embedding q ^ m and the corresponding channel of the key embeddings, as illustrated in Figure 4a.
α i = σ q ^ m · k ^ m i T D
where k ^ m i represents the i-th channel of the key embeddings K ^ m via the encoder output projection. Each element of the vector q ^ m · k ^ m i T performs the aggregation of global points, and the subsequent softmax function assigns the decoding value for each point based on the probability in the normalized vector. Consequently, the decoding weight vector values are derived from straightforward global aggregation, overlooking local channel-wise modeling critical for learning 3D surface structures of point clouds due to the strong geometric relationships exhibited by different channels in point clouds.
Next, we use these spatial-aware attention weights to reweight the channel-wise contributions to the decoding weights. To accentuate spatial information for key embeddings K ^ m involves computing the decoding weight vector for points relevant to all channels of K ^ m . Essentially, this entails generating distinct decoding weight D vectors for each channel to obtain decoding values D. Then, these D decoding values are merged via a linear projection into a unified decoding vector in the channel-wise mechanism. As depicted in Figure 4b, our spatial reweighting approach for the channel-wise reweighting vector is provided as
R m = i = 1 D α i · K ^ m i T , m = 1 , , M
where the weight assigned to each channel of the key embeddings corresponds to its spatial significance. The aggregate of these weighted channels forms the final decoding weight vector per attention head. Symbol α i denotes a linear transformation compressing D decoding values into a scalar for reweighting, while σ ^ ( ) calculates softmax across the N dimensions. Nevertheless, the decoding weights derived from σ ^ ( ) are channel-specific, overlooking the holistic point aggregation, as depicted in Figure 4b.
We initially disseminate spatial information into each channel by repeating the matrix product of query embedding and key embeddings, followed by element-wise multiplication to preserve channel distinctions.
R h = s · σ ^ ρ ( q ^ m K ^ m T ) K ^ m T D , m = 1 , , M
where ρ ( ) represents a repeat operator that transforms R 1 × N into R D × N . This approach ensures the preservation of global information relative to the channel-wise reweighting scheme while enriching local and intricate channel-wise decoding compared to Equation (8). Consequently, the final decoding of vector representation is provided as follows:
y = [ R 1 · V ^ 1 , , R M · V ^ M ]
where the linear projections obtained from X ^ are presented as value embeddings V ^ .

3.5. 3D Detection Head

In the preceding steps, the input point features undergo aggregation to yield a D agg -dimensional vector, denoted as v , which is subsequently processed by two FFNs for confidence prediction ( c pred ) and the 3D bounding boxes ( r box ) corresponding to the input 3D proposal, respectively. The training targets are established to determine the confidence prediction as the 3D intersection over union (IoU) between the 3D proposals and their corresponding ground truth boxes. The confidence prediction target ( c pred t ) is computed using the following equation:
c pred t = min 1 , max 0 , IoU θ B θ F θ B
Here, θ B and θ F represent the IoU thresholds for background and foreground, respectively.
Additionally, the regression targets (superscript t) are encoded based on the proposals and their corresponding ground truth (superscript g) boxes, as expressed below:
r box t = v g v c d , h t = log h g h c , w t = log w g w c ,
l t = log l g l c , y t = y g y c d , x t = x g x c d , θ t = θ g θ c ,
where d = ( l c ) 2 + ( w c ) 2 represents the proposal box diagonals.

4. Loss

Several branch loss factors make up the total training loss of bounding boxes { b 3 D } :
L θ x , y = 1 N p 1 N p ( c > 0 ) L segment ( c g , p g ) + 1 N s b { x , y } 1 c g > 0 L regress ( θ t a , θ ^ a ) L θ z = 1 N p ( c > 0 ) L cls ( c θ z , c ^ θ z ) + L regress ( Δ θ t z , Δ θ ^ z ) L box = L dim + L cls + L θ x , y + L pos + L θ z
Here, regression L reg employs the smooth-L1. For road point segmentation, L seg represents focal loss, while L cls represents cross-entropy loss.

5. Experiment and Results

5.1. Datsets

KITTI is the primary benchmark for assessing the efficacy of O 2 SAT in the context of 3D object detection within flat scenes [65]. There are 7481 training and 7518 test samples in the KITTI dataset. We further partition the training dataset into 3712 training and 3769 valid split samples. We adopt the strategy proposed in [46] to create the train–val split for KITTI test server submissions. To address the limitations of existing datasets, which lack non-flat scenes, we utilized SlopedKITTI, which is constructed based on KITTI’s val split and incorporates synthesized pseudo-slope information. SlopedKITTI includes annotations for slope and full pose, enabling the evaluation of our O 2 SAT performance in non-flat environments. Even with the addition of non-flat urban landscapes, SlopedKITTI retains the original KITTI’s appearance when O 2 SAT is used. We assess its performance in two additional settings to ensure a reasonable analogy and showcase O 2 SAT’s generalization capabilities.

5.2. Metrics

Like the official KITTI evaluation protocol, all results undergo assessment using average precision (AP), computed at 11 recall positions for the validation split set and 40 recall positions for the test split set. A true positive is determined using an IoU threshold of 0.7 in vehicle detection. For 3D and bird’s-eye view (BEV) detection, rotated BEV 3D IoU and rotated BEV IoU are utilized.
However, because the IoU calculations are based on BEV, accurately describing the relationship between two full-pose bounding boxes in 3D space becomes challenging. In alignment with the approach in NDS [66], we introduce an additional metric known as a rotated 3D metric. This metric uses a 1.0 m distance threshold from the center instead of the IoU threshold 0.7 and is given as APcd.
For true positives, we extend the evaluation by computing additional metrics such as average scale score (ASS), average translation score (ATS), and average full orientation score (AOS) based on their respective error items. The composite rotation object detection score (RODS) is then calculated as 3APcd + ATS + ASS + AOS, providing a comprehensive measure of performance in rotated object detection.

5.3. Implementation Details

Our O 2 SAT implementation is built upon the OpenPCDet [12] toolbox and executed on a single GTX2080Ti GPU.

5.3.1. Network Architecture

We initially randomly shuffle the sample 16,384 points for each input raw point cloud. In the backbone, we progressively downsample the points using three layers, resulting in 4096, 1024, and 512 sampled points. The output feature channels of each set abstraction layer are set to 64, 128, and 256, respectively. Next, we choose n2 = 256 representative points as the starting center points from the N1 = 512 representative points acquired by the previous downsampling layer. These points’ features are fed into (256, 128, 3) MLPs for offset predictions. While the detection head comprises 512 and 256 shared MLPs. The parameter N θ z is set to 12, and the threshold values t θ x and t θ y are both 10 degrees. During the testing phase, we applied non-maximum suppression.

5.3.2. Training

The proposed O 2 SAT model is trained for 8 h with a batch size of 4 on the KITTI train split, adopting a one-cycle learning rate policy with an initial rate of 0.1. Object-oriented segmentation-based data augmentation is employed to enhance the diversity of the training data, introducing foreground objects randomly sampled from other scenes. The proposed Object-Oriented Segmentation module is integrated into the training process with a probability of ps = 0.1, contributing to the augmentation strategies. These strategies include global flipping, scaling, and rotation, which are applied to augment the dataset. It is important to note that, except for the test set, all models undergo training on the train split and are subsequently validated on the val split.

5.4. Main Results

We conducted a comprehensive quantitative comparison with state-of-the-art techniques to assess the effectiveness of our approach in terrains with varying road conditions. The models were uniformly trained on the KITTI train split and evaluated on the KITTI val split. Visualizations of selected results are depicted in Figure 5. As outlined in our summarized findings in Table 1, Table 2, Table 3 and Table 4, our proposed method showcases significant performance improvements across all metrics in normal and uneven road scenes, surpassing previous methods by a notable margin. Specifically, in the domain of 3D object detection, in Table 2, O 2 SAT exhibits noteworthy enhancements in mAP for easy, moderate, and hard average precisions (APs) for cars of 81.25%, 88.72%, 80.02%, and 74.11%, respectively, with the PointRCNN backbone. O 2 SAT also demonstrates increases of 7.14%, 6.14%, 5.71%, and 5.12% in mAP for easy, moderate, and hard modes with the PointPillars backbone. Moreover, our method also demonstrates increases of 4.84%, 4.35%, 5.20%, and 4.08% in AP for cars using the PV-RCNN backbone.
Specifically, in the domain of BEV detection, in Table 1, O 2 SAT exhibits noteworthy enhancements in mAP for easy, moderate, and hard average precisions for cars of 89.94%, 93.86%, 89.25%, and 84.53% with the PointRCNN Backbone. O 2 SAT also demonstrates increases of 5.18%, 5.51%, 3.15%, and 4.70% in mAP for easy, moderate, and hard modes with the PointPillars backbone. Moreover, our method also demonstrates increases of 2.53%, 1.73%, 1.86%, and 1.81% in AP for cars using the PV-RCNN backbone.
Table 1, Table 2, Table 3 and Table 4 further highlight the superiority of point-based backbones over voxel-based methods in both 3D object detection and BEV detection, owing to their ability to detect objects wherever points exist without a detection range limitation. Table 3 demonstrates the road points score of voxel-based and point-based methods and empirically shows that point-based methods perform better in true positive detection due to better road point understanding than voxel-grids. Furthermore, the sustainable computing ability is evident as it takes less computational resources compared to the other methods, as shown in Table 5. Despite this, all methods demonstrate comparatively lower performance in 3D AP, attributed to the neglected road point intersection prediction, resulting in low mean error.
It is noteworthy in Figure 5 that our point-based method O 2 SAT yielded numerous false positives on slopes. Our method also precisely predicts the objects’, such as ground and vehicle points, union and intersection contrasts, slopes, and some unlabeled vehicles. Table 1, Table 2, Table 3 and Table 4 and Figure 1 and Figure 5 additionally investigate the impact of road point segregation from object points on detector performance and report results on KITTI and SlopedKITTI. Accurately detecting objects in non-flat scenes and with full poses contributes to improved and robust 3D object detection in autonomous driving.

5.5. Effectiveness

We assess and verify the potency of the proposed study and its qualitative performance on using the KITTI benchmark. We trained all models on the KITTI dataset, and the selected outcomes are illustrated in Figure 1 and Figure 5. In alignment with previous findings, voxel-based methods encounter challenges in detecting objects concerning road-wise intersections. In contrast, prior point-based methods struggle to achieve accurate object detection in such a scenario. Conversely, our proposed method excels in accurately detecting objects and predicting their full poses, showcasing its ability to handle real slopes through training with object-oriented segmentation effectively.
To gain insights into the knowledge acquired by the network, we conducted an extended experiment in SlopedKITTI. We explore whether the road orientation influences the predictions of the horizontal and vertical, whereas height is directly affected by the object itself. When the vehicle is at a distance and difficult to distinguish from road points in LiDAR input, the anticipated posture reliably corresponds with the local road points normal, irrespective of the actual poses. In contrast, the predicted boxes vary by the local road points for flat road points. These experimental findings confirm that our proposed O 2 SAT approach accurately predicts horizontal and vertical point intersection between an object and road by considering the ground, highlighting a fundamental distinction from height prediction.

5.6. Ablation Studies

The outcomes presented in Table 1, Table 2, Table 3 and Table 4 show the detail of the impact of overall performance. The empirical results illustrate that object-oriented segmentation significantly enhances detection performance with a self-attention-based channel-wise reweighting approach. The subsequent tables delve into the influence of OOS and SFR on predicting horizontal and vertical point orientation in R3D, performing dynamic adjustment of object feature weights through attention mechanisms that enhance the long-range interactions and contextual feature discrimination for noise suppression, reducing horizontal and vertical orientation errors. Since vehicles typically exhibit merging patterns with road points due to horizontal and vertical orientation, they wield a more substantial impact on IoU. Consequently, there is a 7.14% and 4.84% increase in 3D average precision (AP) compared to Pointpillars and PV-RCNN, respectively.
In Table 6, Table 7 and Table 8, we offer further experimental data that are consistent with the experiments. We carry out identical tests on two other backbones, PointPillars and PV-RCNN, and we find comparable patterns of gains resulting from learning with segmentation loss, constructing the OOS to correct the bias in road segmentation estimation using SFR and R3D. Whereas the understanding of abbreviations is provided in Table 9. We note that in Table 6 results, both competitive backbones are shown with easy, moderate, and hard settings. Another observation differs from that in Table 7, where PointPillars and PV-RCNN are evaluated with each O 2 SAT module. We further hypothesize various numbers of points, as Pointpillars and PV-RCNN take sparse inputs to dense inputs, as shown in Table 8.

6. Conclusions

This research focuses on enhancing LiDAR-based 3D object detection for AVs in complex urban traffic environments; specifically, emphasizing object-oriented segmentation aided by spatial attention mechanisms. The proposed O 2 SAT introduces efficient modules to augment the accuracy and efficiency of 3D object detection networks, thereby bolstering the robustness of autonomous vehicle navigation. Highlighting the significance of object-oriented segmentation and spatial attention within 3D object detection pipelines, especially in challenging scenarios like sloped roads and intersections, the O 2 SAT approach integrates three key modules. Firstly, it employs advanced techniques to accurately segment road surfaces from 3D object data, effectively removing irrelevant points and outliers. This is further complemented by the object-aware sampling, resulting in substantial improvements in model performance. Additionally, the SFR module leverages self-attention mechanisms to embed contextual information into each point feature, enhancing feature representation and noise suppression. Coupled with the Road-Aware 3D Detection Head, this facilitates precise orientation prediction and comprehensive 3D proposal generation, thereby enhancing overall object detection performance. By addressing the limitations associated with non-flat roads in existing methods, the O 2 SAT approach ensures accurate road segmentation while facilitating long-range object feature dependencies while minimizing computational complexity. Experimental validation across diverse datasets, including KITTI and SlopedKITTI, showcases the method’s robustness and superior performance across various terrains and urban scenarios. Furthermore, the study contributes to advancing environmental perception systems in autonomous vehicles, focusing on optimizing 3D object detection. The proposed approach seamlessly integrates into existing point-based frameworks, offering a plug-and-play solution adaptable for practical implementations. In the future, one promising direction is integrating multi-modal data fusion, incorporating additional sensors such as cameras and radar. This would enhance the robustness and accuracy of object detection by leveraging complementary data sources, particularly in scenarios where LiDAR data alone may be insufficient, such as in adverse weather conditions or with reflective surfaces.

Author Contributions

Conceptualization H.M., X.D. and M.A.; methodology, H.M.; software, H.M. and B.H.M.; validation, H.M., X.D., M.A., and B.H.M.; formal analysis, H.M. and I.U.; investigation, H.M. and I.U.; resources, H.M. and X.D.; data curation, H.M. and M.A.; writing—original draft preparation, H.M.; writing—review and editing, H.M. and X.D.; visualization, H.M., B.H.M. and I.U.; supervision, X.D.; project administration, H.M. and X.D.; funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China Project (62172441, 62172449); the Local Science and Technology Developing Foundation Guided by the Central Government of China (Free Exploration project 2021Szvup166); the Opening Project of State Key Laboratory of Nickel and Cobalt Resources Comprehensive Utilization (GZSYS-KY-2022-018, GZSYS-KY-2022-024); Key Project of Shenzhen City Special Fund for Fundamental Research (202208183000751); and the National Natural Science Foundation of Hunan Province (2023JJ30696).

Data Availability Statement

The dataset created and examined in the present study can be accessed from the KITTI 3D object detection repository (https://www.cvlibs.net/datasets/kitti/evalobject.php?objbenchmark=3d, accessed on 28 October 2023.)

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep Learning for LiDAR Point Clouds in Autonomous Driving: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3412–3432. [Google Scholar] [CrossRef]
  2. Mukhtar, A.; Xia, L.; Tang, T.B. Vehicle Detection Techniques for Collision Avoidance Systems: A Review. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2318–2338. [Google Scholar] [CrossRef]
  3. Ye, Y.; Fu, L.; Li, B. Object detection and tracking using multi-layer laser for autonomous urban driving. In Proceedings of the 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro, Brazil, 1–4 November 2016; pp. 259–264. [Google Scholar] [CrossRef]
  4. Guojun, W.; Wu, J.; He, R.; Yang, S. A Point Cloud Based Robust Road Curb Detection and Tracking Method. IEEE Access 2019, 7, 24611–24625. [Google Scholar] [CrossRef]
  5. Dieterle, T.; Particke, F.; Patino-Studencki, L.; Thielecke, J. Sensor data fusion of LIDAR with stereo RGB-D camera for object tracking. In Proceedings of the 2017 IEEE SENSORS, Glasgow, UK, 29 October–1 November 2017; pp. 1–3. [Google Scholar]
  6. Zhao, C.; Fu, C.; Dolan, J.M.; Wang, J. L-Shape Fitting-Based Vehicle Pose Estimation and Tracking Using 3D-LiDAR. IEEE Trans. Intell. Veh. 2021, 6, 787–798. [Google Scholar] [CrossRef]
  7. Li, Y.; Ibanez-Guzman, J. Lidar for Autonomous Driving: The Principles, Challenges, and Trends for Automotive Lidar and Perception Systems. IEEE Signal Process. Mag. 2020, 37, 50–61. [Google Scholar] [CrossRef]
  8. Sualeh, M.; Kim, G.W. Dynamic Multi-LiDAR Based Multiple Object Detection and Tracking. Sensors 2019, 19, 1474. [Google Scholar] [CrossRef]
  9. Kim, D.; Jo, K.; Lee, M.; Sunwoo, M. L-Shape Model Switching-Based Precise Motion Tracking of Moving Vehicles Using Laser Scanners. IEEE Trans. Intell. Transp. Syst. 2018, 19, 598–612. [Google Scholar] [CrossRef]
  10. Jin, X.; Yang, H.; Li, Z. Vehicle Detection Framework Based on LiDAR for Autonoumous Driving. In Proceedings of the 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, 29–31 October 2021; pp. 1–5. [Google Scholar] [CrossRef]
  11. Mushtaq, H.; Deng, X.; Ali, M.; Hayat, B.; Raza Sherazi, H.H. DFA-SAT: Dynamic Feature Abstraction with Self-Attention-Based 3D Object Detection for Autonomous Driving. Sustainability 2023, 15, 13667. [Google Scholar] [CrossRef]
  12. Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
  13. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
  14. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
  15. Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, Virtual, 16–18 November 2020; pp. 923–932. [Google Scholar]
  16. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-Net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
  17. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
  18. Huang, W.; Liang, H.; Lin, L.; Wang, Z.; Wang, S.; Yu, B.; Niu, R. A Fast Point Cloud Ground Segmentation Approach Based on Coarse-To-Fine Markov Random Field. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7841–7854. [Google Scholar] [CrossRef]
  19. Chu, P.M.; Cho, S.; Park, J.; Fong, S.; Cho, K. Enhanced Ground Segmentation Method for Lidar Point Clouds in Human-Centric Autonomous Robot Systems. Hum.-Centric Comput. Inf. Sci. 2019, 9, 17. [Google Scholar] [CrossRef]
  20. Qi, C.R.; Litany, O.; He, K.; Guibas, L. Deep hough voting for 3D object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
  21. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
  22. Yang, Z.; Wang, L. Learning relationships for multi-view 3D object recognition. In In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
  23. Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar]
  24. Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3d deep learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  25. Jiang, T.; Song, N.; Liu, H.; Yin, R.; Gong, Y.; Yao, J. Vic-net: Voxelization information compensation network for point cloud 3d object detection. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13408–13414. [Google Scholar]
  26. Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
  27. Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; Yang, R. IoU Loss for 2D/3D Object Detection. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019. [Google Scholar] [CrossRef]
  28. Shi, H.; Hou, D.; Li, X. Center-Aware 3D Object Detection with Attention Mechanism Based on Roadside LiDAR. Sustainability 2023, 15, 2628. [Google Scholar] [CrossRef]
  29. Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
  30. Li, L.; Yang, F.; Zhu, H.; Li, D.; Li, Y.; Tang, L. An Improved RANSAC for 3D Point Cloud Plane Segmentation Based on Normal Distribution Transformation Cells. Remote Sens. 2017, 9, 433. [Google Scholar] [CrossRef]
  31. Miądlicki, K.; Pajor, M.; Saków, M. Ground plane estimation from sparse LIDAR data for loader crane sensor fusion system. In Proceedings of the 2017 22nd International Conference on Methods and Models in Automation and Robotics (MMAR), Międzyzdroje, Poland, 28–31 August 2017; pp. 717–722. [Google Scholar] [CrossRef]
  32. Narksri, P.; Takeuchi, E.; Ninomiya, Y.; Morales, Y.; Akai, N.; Kawaguchi, N. A Slope-robust Cascaded Ground Segmentation in 3D Point Cloud for Autonomous Vehicles. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 497–504. [Google Scholar] [CrossRef]
  33. Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-sensor 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11220 LNCS. [Google Scholar] [CrossRef]
  34. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar] [CrossRef]
  35. Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
  36. Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  37. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
  38. Zhao, X.; Liu, Z.; Hu, R.; Huang, K. 3D object detection using scale invariant and feature reweighting networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9267–9274. [Google Scholar]
  39. Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar]
  40. Li, S.; Geng, K.; Yin, G.; Wang, Z.; Qian, M. MVMM: Multi-View Multi-Modal 3D Object Detection for Autonomous Driving. IEEE Trans. Ind. Inform. 2023, 20, 845–853. [Google Scholar] [CrossRef]
  41. Noh, J.; Lee, S.; Ham, B. HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
  42. Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. TANet: Robust 3D object detection from point clouds with triple attention. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11677–11684. [Google Scholar] [CrossRef]
  43. Rukhovich, D.; Vorontsova, A.; Konushin, A. ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
  44. Xu, W.; Hu, J.; Chen, R.; An, Y.; Xiong, Z.; Liu, H. Keypoint-Aware Single-Stage 3D Object Detector for Autonomous Driving. Sensors 2022, 22, 1451. [Google Scholar] [CrossRef] [PubMed]
  45. Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph Cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
  46. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
  47. Guo, J.; Xing, X.; Quan, W.; Yan, D.M.; Gu, Q.; Liu, Y.; Zhang, X. Efficient Center Voting for Object Detection and 6D Pose Estimation in 3D Point Cloud. IEEE Trans. Image Process. 2021, 30, 5072–5084. [Google Scholar] [CrossRef]
  48. Chen, W.; Duan, J.; Basevi, H.; Chang, H.J.; Leonardis, A. PointPoseNet: Point Pose Network for Robust 6D Object Pose Estimation. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 2813–2822. [Google Scholar]
  49. Gao, G.; Lauri, M.; Wang, Y.; Hu, X.; Zhang, J.; Frintrop, S. 6D Object Pose Regression via Supervised Learning on Point Clouds. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3643–3649. [Google Scholar] [CrossRef]
  50. He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; Sun, J. PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11629–11638. [Google Scholar] [CrossRef]
  51. Hagelskjær, F.; Buch, A.G. Pointvotenet: Accurate Object Detection And 6 DOF Pose Estimation In Point Clouds. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 2641–2645. [Google Scholar] [CrossRef]
  52. Gao, G.; Lauri, M.; Hu, X.; Zhang, J.; Frintrop, S. CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11081–11087. [Google Scholar] [CrossRef]
  53. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
  54. Douillard, B.; Underwood, J.; Kuntz, N.; Vlaskine, V.; Quadros, A.; Morton, P.; Frenkel, A. On the segmentation of 3D LIDAR point clouds. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 2798–2805. [Google Scholar] [CrossRef]
  55. Rummelhard, L.; Paigwar, A.; Nègre, A.; Laugier, C. Ground estimation and point cloud segmentation using SpatioTemporal Conditional Random Field. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 1105–1110. [Google Scholar] [CrossRef]
  56. Xu, X.; Dong, S.; Xu, T.; Ding, L.; Wang, J.; Jiang, P.; Song, L.; Li, J. FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection. Remote Sens. 2023, 15, 1839. [Google Scholar] [CrossRef]
  57. Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
  58. Engel, N.; Belagiannis, V.; Dietmayer, K. Point transformer. IEEE Access 2021, 9, 134826–134840. [Google Scholar] [CrossRef]
  59. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  60. Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8500–8509. [Google Scholar]
  61. Yan, X.; Zheng, C.; Li, Z.; Wang, S.; Cui, S. PointasNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
  62. Chen, C.; Chen, Z.; Zhang, J.; Tao, D. Sasa: Semantics-augmented set abstraction for point-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 221–229. [Google Scholar]
  63. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  64. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
  65. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
  66. Shuang, F.; Huang, H.; Li, Y.; Qu, R.; Li, P. AFE-RCNN: Adaptive Feature Enhancement RCNN for 3D Object Detection. Remote Sens. 2022, 14, 1176. [Google Scholar] [CrossRef]
  67. Nabhani, A.; Sjølie, H.K. TreeSim: An object-oriented individual tree simulator and 3D visualization tool in Python. SoftwareX 2022, 20, 101221. [Google Scholar] [CrossRef]
  68. Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12372 LNCS. [Google Scholar] [CrossRef]
Figure 1. Comparison of quality. The detection outcomes in a KITTI validation scene are presented using various input point clouds through PV-RCNN and O 2 SAT in parts (a) and (b), respectively. Visualization is provided through a 3D point cloud. Ground truth boxes are highlighted in green, while red shows the bounding boxes’ predictions. Consequently, the ground truth boxes (encompassing object class features) on the right side in part (c) are deeper (i.e., more distant) from the LiDAR, posing a greater challenge for localization.
Figure 1. Comparison of quality. The detection outcomes in a KITTI validation scene are presented using various input point clouds through PV-RCNN and O 2 SAT in parts (a) and (b), respectively. Visualization is provided through a 3D point cloud. Ground truth boxes are highlighted in green, while red shows the bounding boxes’ predictions. Consequently, the ground truth boxes (encompassing object class features) on the right side in part (c) are deeper (i.e., more distant) from the LiDAR, posing a greater challenge for localization.
Information 15 00376 g001
Figure 2. O 2 SAT framework overview. Road-aware orientation branch performs object-aware downsampling, and coarse center point segments are utilized for object representation. Then, a spatial-attention-based encoder–decoder module exploits local and global channel-wise attributes of the encoded points by employing a channel-wise reweighting approach. Finally, it has a 3D detection backbone/head that utilizes center-point features and optimized feature representation bonding-box prediction with complete poses. For optimal comprehension, it is recommended to view it in color.
Figure 2. O 2 SAT framework overview. Road-aware orientation branch performs object-aware downsampling, and coarse center point segments are utilized for object representation. Then, a spatial-attention-based encoder–decoder module exploits local and global channel-wise attributes of the encoded points by employing a channel-wise reweighting approach. Finally, it has a 3D detection backbone/head that utilizes center-point features and optimized feature representation bonding-box prediction with complete poses. For optimal comprehension, it is recommended to view it in color.
Information 15 00376 g002
Figure 3. (a) The ground-aware orientation branch and ground candidate point selection modules. The ground-aware orientation module forecasts the terrain classes where coarse centers supervise the ground truth boxes. The coarse center-point features undergo treatment via shared MLPs fed to the orientation head. (b) The multi-head self-attention layer captures intricate contextual connections, followed by a residual architecture of an FFN to learn inter-point dependencies within proposals and refine point attributes.
Figure 3. (a) The ground-aware orientation branch and ground candidate point selection modules. The ground-aware orientation module forecasts the terrain classes where coarse centers supervise the ground truth boxes. The coarse center-point features undergo treatment via shared MLPs fed to the orientation head. (b) The multi-head self-attention layer captures intricate contextual connections, followed by a residual architecture of an FFN to learn inter-point dependencies within proposals and refine point attributes.
Information 15 00376 g003
Figure 4. Illustration of decoding schemes. (a) Standard decoding approach; (b) spatial-attention reweighting approach.
Figure 4. Illustration of decoding schemes. (a) Standard decoding approach; (b) spatial-attention reweighting approach.
Information 15 00376 g004
Figure 5. Visualization of object detection results in multiple views using the KITTI validation dataset. Green boxes represent predicted classifications, while red boxes indicate ground truth annotations.
Figure 5. Visualization of object detection results in multiple views using the KITTI validation dataset. Green boxes represent predicted classifications, while red boxes indicate ground truth annotations.
Information 15 00376 g005
Table 1. Quantitative assessment of object detection in bird’s-eye view through average precision (AP %) on the KITTI test set and comparative analysis with various methodologies.
Table 1. Quantitative assessment of object detection in bird’s-eye view through average precision (AP %) on the KITTI test set and comparative analysis with various methodologies.
MethodCategoryModmAPCarPedestrianCyclist
EasyMod.HardEasyMod.HardEasyMod.Hard
VoxelSECOND [36]L81.8088.0779.3777.9555.1046.2744.7673.6756.0448.78
PointPillars [37]L84.7688.3586.1079.8358.6650.2347.1979.1962.2556.00
PV-RCNN [21]L87.4192.1387.3982.7254.7746.1342.8482.5667.2460.28
VoxelNet [26]L82.089.3579.2677.3946.1340.7438.1166.7054.7650.55
ParallelMV3D [14]C + L78.4586.6278.9369.80------
Contfuse [33]C + L85.1094.0785.3575.88------
PIXOR++ [35]L83.6889.3883.7077.97------
MVMM [40]C + L88.7892.1788.7085.4753.7546.8444.8781.8470.1763.84
PointF-PointNet [17]C + L83.5491.1784.6774.7757.1349.5745.4877.2661.3753.78
Avod [34]C + L85.1490.9984.8279.62 -
PointRCNN [21]L87.4192.1387.3982.7254.7746.1342.8482.5667.2460.28
PSIFT + SENet [38]C + L82.9988.8083.9676.21 -
O 2 SAT (Our)L89.9493.8689.2584.5354.9447.5045.7884.5972.8365.13
Table 2. Evaluation of quantitative performance in 3D object detection through average precision (AP %) on the KITTI test set and comparison with various methodologies.
Table 2. Evaluation of quantitative performance in 3D object detection through average precision (AP %) on the KITTI test set and comparison with various methodologies.
MethodCategoryModmAPCarPedestrianCyclist
EasyMod.HardEasyMod.HardEasyMod.Hard
VoxelSECOND [36]L74.3383.3472.5565.8248.9638.7834.9171.3352.0845.83
PointPillars [37]L74.1182.5874.3168.9951.4541.9238.8977.1058.6551.92
PV-RCNN [39]C + L76.4184.3774.8270.03------
ParallelVoxelNet [26]L66.7777.4765.1157.7339.4833.6931.561.2248.3644.37
MV3D [14]C + L64.2074.9763.6354.0------
Contfuse [33]C + L71.3883.6868.7861.67------
HVPR [41]L79.1186.3877.9273.0453.4743.9640.64 -
MVMM [40]C + L80.0887.5978.8773.7847.5440.4938.3677.8264.8158.79
PointPSIFT + SENet [38]C + L77.1485.9972.7272.72------
PointRCNN [21]C + L77.7786.9675.6470.7047.9839.3736.0174.9658.8252.53
TANET [42]L76.3884.3975.9468.8253.7244.3440.4975.7059.4452.53
F-PointNet [17]C + L70.8682.1969.7960.5950.5342.1538.0872.2756.1749.01
O 2 SAT (Our)L81.2588.7280.0274.1148.9341.2539.1079.1065.8360.18
Table 3. Comparative assessment on the SlopedKITTI with the latest methodologies for car detection validation set. Average precision (AP) is computed with an IoU threshold of 0.7 for 3D and bird’s-eye view detection. For rotated full 3D detection.
Table 3. Comparative assessment on the SlopedKITTI with the latest methodologies for car detection validation set. Average precision (AP) is computed with an IoU threshold of 0.7 for 3D and bird’s-eye view detection. For rotated full 3D detection.
MethodCategoryModATS ↑AP ↑AOS ↑ASS ↑RODS ↑
VoxelSECOND [36]L49.5077.2286.4976.3364.76
PointPillars [37]L47.3776.9586.2277.9463.87
PV-RCNN [39]C + L46.9479.8086.8183.0065.07
VoxelNet [26]L50.9978.5986.8578.6066.17
ParallelMV3D [14]C + L50.1577.3786.4680.3465.77
Contfuse [33]C + L51.0678.1786.6477.7365.95
HVPR [41]L50.9978.5986.8578.6066.17
MVMM [40]C + L46.9479.8086.8183.0065.07
PointPSIFT + SENet [38]C + L74.1268.4783.9964.3872.20
PointRCNN [21]C + L67.8370.2283.8763.1970.13
TANET [42]L72.0169.2383.3369.1272.94
F-PointNet [17]L74.1268.4783.9964.3872.20
O 2 SAT (Our)C69.3669.0482.4370.9871.75
Table 4. Comparative evaluation of car detection for state-of-the-art methods on the KITTI test set and validation split.
Table 4. Comparative evaluation of car detection for state-of-the-art methods on the KITTI test set and validation split.
MethodCategoryCar 3D—TestCar 3D—Validation
EasyMod.HardEasyMod.Hard
VoxelSECOND [36]83.1373.6666.2087.4376.4869.10
PointPillars [37]82.5874.3168.99-77.98-
PV-RCNN [39]87.8178.4973.5189.4779.4778.54
VoxelNet [26]90.2581.4376.8289.3583.6978.70
ParallelMV3D [14]---87.7279.4877.17
Contfuse [33]90.9081.6277.0689.4184.5278.93
HVPR [41]86.9675.6470.7088.8878.6377.38
MVMM [40]88.3679.5774.5589.7179.4578.67
PointPSIFT + SENet [38]88.7682.1677.1689.3884.8079.01
PointRCNN [21]88.8780.3275.10-79.57-
TANET [42]86.5278.9573.1288.3883.1477.48
F-PointNet [17]87.3479.2673.8588.1782.4178.74
O 2 SAT (Our)88.8180.6275.5589.2484.5778.45
Table 5. Ablation studies on various modules of ( O 2 SAT) on SlopedKITTI under validation split.
Table 5. Ablation studies on various modules of ( O 2 SAT) on SlopedKITTI under validation split.
OOSSFRR3D3D APBEV APMean Error
Moderate ASS OSSD Yaw Pich and Roll
---37.4570.1273.780.160.44
--71.7266.6782.980.160.45
-60.7573.1082.150.210.16
-72.4884.2786.570.130.06
Table 6. Ablation studies on two different 3D backbones using AP (%) quantitative performance evaluation under various difficulty modes.
Table 6. Ablation studies on two different 3D backbones using AP (%) quantitative performance evaluation under various difficulty modes.
PointRCNNPV-RCNN2SRPar (M)Moderate APCarPedestrianCyclist
EasyMod.HardEasyMod.HardEasyMod.Hard
2076.5786.2178.6576.1562.0440.4138.1576.5963.3757.47
2080.6886.9579.6574.3964.3042.5239.4078.8165.4858.48
2879.1288.5478.6074.4263.9041.6939.3578.9765.4359.28
3282.3689.8381.1375.2349.8642.3639.5280.0466.9461.29
Table 7. Conducting ablation studies on various O 2 SAT modules with two different region proposal networks (RPNs) to demonstrate the generalization capability of the proposed approach.
Table 7. Conducting ablation studies on various O 2 SAT modules with two different region proposal networks (RPNs) to demonstrate the generalization capability of the proposed approach.
Method  OOS  SFR  R3D  ms/imageFPS    Param/MB
PointRCNN 186572.8
245173.8
206072.9
215176.2
PV-RCNN 402964.7
582165.9
532265.6
622165.8
Table 8. Ablation study on ( O 2 SAT) with different selections of nearest neighbors.
Table 8. Ablation study on ( O 2 SAT) with different selections of nearest neighbors.
Methodk-PointsAP3D (%)APBEV (%)
EasyMod.HardEasyMod.Hard
PointRCNN988.4780.3574.8994.6690.4284.89
1688.3279.9474.2693.7889.4684.13
3289.4780.5374.8594.6690.1584.29
PV-RCNN987.8579.4673.9793.5789.6283.97
1687.3478.9373.9592.8988.5783.62
3289.5680.1675.2793.8789.5484.29
Table 9. List of abbreviations.
Table 9. List of abbreviations.
AbbreviationDefinitionAbbreviationDefinition
AVs [36]Autonomous vehiclesFPS [17]Feature point sampling
LiDAR [37]Light detection and rangingD-FPS [21]Deterministic feature point sampling
3D [14]Three-dimensionalFeat-FPS [21]Feature-based feature point sampling
O 2 SATObject-Oriented-Segmentation-Guided Spatial-AttentionAP [21]Average precision
OOS [67]Object-Oriented SegmentationFFN [57]Feed-forward network
SFR [68]Spatial-Attention-Based Feature ReweightingR3D [4]Road-Aware 3D Detection Head
IoU [26]Intersection over unionBEV [40]Bird’s-eye view
MLPs [37]Multilayer perceptronsmAP [39]Mean average precision
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mushtaq, H.; Deng, X.; Ullah, I.; Ali, M.; Malik, B.H. O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles. Information 2024, 15, 376. https://doi.org/10.3390/info15070376

AMA Style

Mushtaq H, Deng X, Ullah I, Ali M, Malik BH. O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles. Information. 2024; 15(7):376. https://doi.org/10.3390/info15070376

Chicago/Turabian Style

Mushtaq, Husnain, Xiaoheng Deng, Irshad Ullah, Mubashir Ali, and Babur Hayat Malik. 2024. "O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles" Information 15, no. 7: 376. https://doi.org/10.3390/info15070376

APA Style

Mushtaq, H., Deng, X., Ullah, I., Ali, M., & Malik, B. H. (2024). O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles. Information, 15(7), 376. https://doi.org/10.3390/info15070376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop