1. Introduction
Smart sustainable cities use ICT for efficient operations, information sharing, better government services, and citizen well-being, prioritizing technological efficiency over availability for improved urban life [
1,
2,
3,
4]. Autonomous vehicles offer immersive user experiences, shaping future human–machine interactions in smart cities [
5,
6]. Mobility as a service is set to transform urban mobility in terms of sustainability [
7]. Cities seek smart mobility solutions to address transport issues [
8]. AVs’ benefits drive their adoption, mitigating safety concerns. AVs promise traffic improvements, enhanced public transport, safer streets, and better quality of life in eco-conscious digital cities [
9].
At the core of AV technology lies 3D object detection, a fundamental capability enabling AVs to perceive their surroundings in three dimensions. This 3D object detection is vital for safe autonomous vehicle navigation in smart cities [
10,
11]. It identifies and comprehends surrounding objects in 3D, enabling obstacle avoidance, path planning, and collision prevention [
12]. Advancements in this technology enhance urban life through improved autonomous vehicle perception [
13,
14]. Autonomous vehicles are equipped with various sensors, including cameras, LiDAR (light detection and ranging), radar, and sometimes ultrasonic sensors. These sensors capture data about the surrounding environment [
15].
Recent advancements in autonomous driving technology have significantly propelled the development of sustainable smart cities [
16,
17,
18]. Notably, 3D object detection has emerged as a pivotal element within autonomous vehicles, forming the basis for efficient planning and control processes in alignment with smart city principles of optimization and enhancing citizens’ quality of life, particularly in ensuring the safe navigation of autonomous vehicles (AVs) [
19,
20,
21]. LiDAR, an active sensor utilizing laser beams to scan the environment, is extensively integrated into AVs to provide 3D perception in urban environments. Various autonomous driving datasets, such as KITTI, have been developed to enable mass mobility in smart cities [
22,
23]. Although 3D LiDAR point cloud data are rich in depth and spatial information and less susceptible to lighting variations, it possesses irregularities and sparseness, particularly at longer distances, which can jeopardize the safety of pedestrians and cyclists. Traditional methods for learning point cloud features struggle to comprehend the geometrical characteristics of smaller and distant objects in AVs [
24,
25].
To overcome geometric challenges and facilitate the use of deep neural networks (DNNs) for processing 3D smart city datasets to ensure safe autonomous vehicle (AV) navigation, custom discretization or voxelization techniques are employed [
26,
27,
28,
29,
30,
31,
32,
33,
34]. These methods convert 3D point clouds into voxel representations, enabling the application of 2D or 3D convolutions. However, they may compromise geometric data and suffer from quantization loss and computational bottlenecks, posing sustainability challenges for AVs in smart cities. Region proposal network (RPN) backbones exhibit high accuracy and recall but struggle with average precision (AP), particularly for distant or smaller objects. The poor AP score hinders AV integration in sustainable smart cities due to its direct impact on object detection at varying distances [
35,
36].
Most RPN backbones, including region proposal networks, rely on convolutional neural networks (CNNs) for Euclidean data feature extraction [
34,
37]. However, CNNs are ill-suited for handling unstructured point clouds [
38]. To address this, self-attention mechanisms from transformers are introduced to capture long-range dependencies and interactions, enhancing distant object representation and reducing false negatives [
2,
39,
40]. By combining self-attention with CNNs, the performance of 3D object detection in AVs can be enhanced, even with limited point cloud data [
2,
41,
42]. The proposed DFA-SAT approach shows promising results, addressing smart city challenges such as urban space management, pedestrian and cyclist safety, and overall quality of life improvement, aligning with eco-conscious city development goals.
Figure 1 illustrates DFA-SAT’s performance with a reduced number of point features.
This study aims to enhance 3D object detection in autonomous vehicles (AVs) to address the challenges posed by smart cities, including pedestrian and cyclist safety and reducing vehicle collisions [
6,
8,
18]. It emphasizes the importance of foreground global points for learning better semantic and contextual information among points, a crucial aspect of 3D object detection. The study aims to overcome the limitations caused by insufficient semantic information in point clouds, improving AVs’ 3D object detection capabilities, which is essential for their adoption in smart cities [
9,
11]. To achieve this, two key observations are made. First, a unified module can be developed to address weak semantic information by leveraging both voxel-based and point-based methods. Second, enhancing interactions between global and local object features can promote better feature association. The proposed solution, called dynamic feature abstraction with self-attention (DFA-SAT), combines CNNs and self-attention mechanisms to augment semantic information in both voxel-based and point-based methods. The proposed approach aims to improve the effectiveness of 3D object detection by addressing the issue of insufficient semantic information.
DFA-SAT is composed of four primary components: object-based down-sampling (OBDS), semantic and contextual features extraction (SCFE), multi-level feature re-weighting (MLFR), and local and global features aggregation (LGFA). The OBDS module preserves more semantic foreground points based on the basis of spatial information as shown in
Figure 2. SEFE learns rich semantic and contextual information with respect to spatial dependencies to refine the local point features information. MLFR decodes all the point features using the channel-wise multi-layered transformer approach to enhance the relationship among local features. It adjusts the weights of these relationships, emphasizing the most significant connections. In scenarios with sparse point clouds, distant points tend to be far apart from their neighbors, potentially hindering detection accuracy. LGFA combines local features with decoding weights for global features using matrix product key and query embedding to learn the spatial information across each channel.
Figure 3 illustrates DFA-SAT, and
Figure 4 demonstrates how it re-weights local and global encoded features.
To validate the effectiveness of the proposed DFA-SAT module, it was integrated into popular baseline algorithms such as SECOND [
34] and PointPillars [
37] which provide a base into which to incorporate 3D object detection in AVs to achieve the perceived benefits of smart mobility [
2]. Through comprehensive experiments conducted on the widely recognized dataset KITTI [
43], the results substantiate the benefits of DFA-SAT. KITTI and similar datasets play a significant role in the development of autonomous vehicles, which are integral to the advancement of smart cities’ transportation infrastructure and sustainability goals [
44]. Our module enhances the extraction of semantic information and significantly improves detection accuracy in AVs, especially for objects located at medium and long distances, to increase the safety of cyclists and pedestrians in sustainable smart cities. Importantly, the incorporation of the DFA-SAT module has a minimal impact on both the number of model parameters required and the run-time performance. In summary, the key contributions of this research can be outlined as follows:
We propose DFA-SAT, a versatile module that improves the detection of 3D objects by preserving maximum foreground features and enhancing weak semantic information of objects around AVs.
DFA-SAT performs semantic and contextual feature extraction and decodes these features to refine these relationships by assigning weights to meaningful connections, thus reinforcing their importance.
This module can be seamlessly integrated into both voxel-based and point-based methods.
Empirical evaluations on the benchmark dataset KITTI. We validate its efficacy in improving detection accuracy, especially for distant and sparse objects, to contribute to sustainability in urban environments.
5. Discussion
The incorporation of autonomous vehicles (AVs) into urban settings marks a pivotal development in the evolution of smart cities. At the core of AV technology lies 3D object detection, a fundamental capability enabling AVs to perceive their surroundings in three dimensions. The paper’s primary contributions include the proposal of DFA-SAT, a versatile module for 3D object detection in AVs, and its integration into established frameworks like PointPillars and SECOND. DFA-SAT addresses the challenges of weak semantic information in point clouds, particularly for distant and sparse objects. It achieves this by preserving foreground features, refining semantic and contextual information, and enhancing feature associations.
The significance of DFA-SAT lies in its potential to improve the safety of AVs in smart cities by better detecting pedestrians, cyclists, and other objects by 8.03% and 6.86% using the SECOND RPN for BEV and 3D detection, respectively. The module’s minimal impact on the model’s parameter (75.1 param/MB) and run-time performance (49 FPS) is crucial for practical applications. The experimental setup is comprehensive, and the authors provide detailed information about the dataset, implementation details, and network configuration. They use the KITTI dataset, a well-established benchmark for 3D object detection, and conduct evaluations on multiple difficulty levels (easy, mod., and hard) and use AP% and mAP% as evaluation metrics. The custom down-sampling approach, encoder–decoder architecture, and transformer-based decoding weight calculations distinguish DFA-SAT. The authors also emphasize the efficiency of their approach, demonstrating its suitability for real-world applications.
The paper presents extensive results comparing DFA-SAT with existing methods. It achieves competitive performance in 3D object detection, particularly for detecting smaller and distant objects like cyclists and pedestrians. The improvement in mean average precision (AP) for both the PointPillars and SECOND frameworks demonstrates the effectiveness of DFA-SAT. It also exhibits efficient performance, achieving high AP while running at 32 FPS. The paper’s qualitative results showcase accurate 3D bounding box predictions and refined object detection, emphasizing the importance of semantic and contextual information with meticulous deliberation and strategic implementation. Three-dimensional object detection holds the potential to reshape the functioning of cities, rendering them more habitable, eco-conscious, and responsive to the needs of their inhabitants.
6. Conclusions
The proposed DFA-SAT dynamic feature abstraction with self-attention architecture for 3D object detection in autonomous vehicles has significant implications for smart city applications. By improving the detection performance of LiDAR 3D point-cloud-based object detectors, this research contributes to the advancement of autonomous driving technology, which is a vital component of smart cities. This study presents a dynamic feature abstraction with self-attention (DFA-SAT), encoding decoding architecture for 3D object detection in autonomous vehicles to assist autonomous driving using LiDAR 3D point clouds. It thoroughly examines existing issues with 3D object detectors and proposes a novel methodology called DFA-SAT to extract detailed geometric information among the local semantic features and applies a features re-weighing mechanism. DFA-SAT sets itself apart from existing methods by utilizing a convolutional neural network (CNN) and a self-attention mechanism to learn high-dimensional local features and combine them with low-dimensional global features, leading to significant improvements in detection performance. Experimental evaluations conducted on the KITTI 3D object detection dataset demonstrate the advantages of DFA-SAT, as it achieves noticeable performance enhancements. The research outcomes of DFA-SAT, evaluated on the KITTI 3D object detection dataset, highlight its potential in enhancing autonomous driving and smart city development. Improvements in 3D object detection methods are essential for safer, more efficient, and sustainable urban environments as autonomous vehicles become integrated into smart city infrastructures. The study’s insights pave the way for future developments in object detection techniques, driving the progress of autonomous vehicles in urban planning and smart cities. Combining technological advancements with supportive policies and responsible adoption will lead to a more sustainable and environmentally friendly transportation future.
Limitations
The DFA-SAT model demonstrates impressive efficiency in detecting objects within extensive LiDAR point clouds. However, it is not without its limitations. Notably, the semantic prediction of individual points can be problematic when dealing with imbalanced class distributions. Its accuracy may be hampered in the case of uneven distribution of points for a given semantic context. To address this challenge, in future research, we intend to explore and implement advanced techniques aimed at mitigating the effects of class imbalances. With this, we aim to enhance the model’s overall performance and robustness in complex real-world scenarios to provide a more comprehensive understanding of the method’s implications for smart city development.