SMIFormer: Learning Spatial Feature Representation for 3D Object Detection from 4D Imaging Radar via Multi-View Interactive Transformers
Abstract
:1. Introduction
- We addressed the issue of insufficient representation of extremely sparse point clouds in a single view by decoupling 3D voxel features into separate front view (FV), side view (SV), and bird’s-eye view (BEV) planes.
- We propose sparse-dimensional compression as an alternative to dense-dimensional compression. By individually placing voxels on each view plane and aggregating features at corresponding positions, we construct a two-dimensional sparse feature matrix. This approach allows us to achieve precise predictions while minimizing memory and computational demands.
- We suggest multi-view feature interaction (MVI) as a method to enhance spatial perception. MVI divides the full-size feature map into non-overlapping windows, enabling more effective interaction between view-inside and view-outside features through self-attention and cross-attention. This approach enhances spatial perception at each interaction level with only a small increase in computation.
2. Related Work
2.1. 3D Object Detection with LiDAR Point Cloud
2.2. 3D Object Detection with 4D Imaging Radar Point Cloud
2.3. Point Cloud Perception for Indoor Scenes
3. Method
3.1. Framework Overview
3.2. Decoupled Per-View Feature Encoding
3.3. Voxel Feature Query Module
3.4. Splitting for Multi-View Feature Interaction
3.5. Feature Self-Attention inside the View
3.6. Feature Cross-Attention outside the View
4. Experiments
4.1. Dataset
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Results and Analysis
4.5. Ablation Study
4.5.1. Effects of Proposed Components
4.5.2. Effects of Splitting Window Size
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sun, S.; Petropulu, A.P.; Poor, H.V. MIMO radar for advanced driver-assistance systems and autonomous driving: Advantages and challenges. IEEE Signal Process. Mag. 2020, 37, 98–117. [Google Scholar] [CrossRef]
- Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
- Liu, J.; Bai, L.; Xia, Y.; Huang, T.; Zhu, B.; Han, Q.L. GNN-PMB: A simple but effective online 3D multi-object tracker without bells and whistles. IEEE Trans. Intell. Veh. 2022, 8, 1176–1189. [Google Scholar] [CrossRef]
- Dreher, M.; Erçelik, E.; Bänziger, T.; Knoll, A. Radar-based 2D car detection using deep neural networks. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
- Svenningsson, P.; Fioranelli, F.; Yarovoy, A. Radar-pointgnn: Graph based object recognition for unstructured radar point-cloud data. In Proceedings of the 2021 IEEE Radar Conference (RadarConf21), Atlanta, GA, USA, 8–14 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
- Bai, J.; Li, S.; Huang, L.; Chen, H. Robust detection and tracking method for moving object based on radar and camera data fusion. IEEE Sens. J. 2021, 21, 10761–10774. [Google Scholar] [CrossRef]
- Han, Z.; Wang, J.; Xu, Z.; Yang, S.; He, L.; Xu, S.; Wang, J. 4D Millimeter-Wave Radar in Autonomous Driving: A Survey. arXiv 2023, arXiv:2306.04242. [Google Scholar]
- Brisken, S.; Ruf, F.; Höhne, F. Recent evolution of automotive imaging radar and its information content. IET Radar Sonar Navig. 2018, 12, 1078–1081. [Google Scholar] [CrossRef]
- Li, G.; Sit, Y.L.; Manchala, S.; Kettner, T.; Ossowska, A.; Krupinski, K.; Sturm, C.; Lubbert, U. Novel 4D 79 GHz radar concept for object detection and active safety applications. In Proceedings of the 2019 12th German Microwave Conference (GeMiC), Stuttgart, Germany, 25–27 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 87–90. [Google Scholar]
- Li, G.; Sit, Y.L.; Manchala, S.; Kettner, T.; Ossowska, A.; Krupinski, K.; Sturm, C.; Goerner, S.; Lübbert, U. Pioneer study on near-range sensing with 4D MIMO-FMCW automotive radars. In Proceedings of the 2019 20th International Radar Symposium (IRS), Ulm, Germany, 26–28 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–10. [Google Scholar]
- Bansal, K.; Rungta, K.; Zhu, S.; Bharadia, D. Pointillism: Accurate 3d bounding box estimation with multi-radars. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems, Virtual, 16–19 November 2020; pp. 340–353. [Google Scholar]
- Bagloee, S.A.; Tavana, M.; Asadi, M.; Oliver, T. Autonomous vehicles: Challenges, opportunities, and future implications for transportation policies. J. Mod. Transp. 2016, 24, 284–303. [Google Scholar] [CrossRef]
- Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1477–1485. [Google Scholar]
- Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. Bevfusion: A simple and robust lidar-camera fusion framework. Adv. Neural Inf. Process. Syst. 2022, 35, 10421–10434. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2774–2781. [Google Scholar]
- Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 194–210. [Google Scholar]
- Zhang, Y.; Zheng, W.; Zhu, Z.; Huang, G.; Lu, J.; Zhou, J. A simple baseline for multi-camera 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3507–3515. [Google Scholar]
- Zhang, Y.; Zhu, Z.; Zheng, W.; Huang, J.; Huang, G.; Zhou, J.; Lu, J. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv 2022, arXiv:2205.09743. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Deng, S.; Liang, Z.; Sun, L.; Jia, K. Vista: Boosting 3d object detection via dual cross-view spatial attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8448–8457. [Google Scholar]
- Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhao, X.; Wang, Y.; Wang, P.; Foroosh, H. Centerformer: Center-based transformer for 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 496–513. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
- Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Wang, J.; Lan, S.; Gao, M.; Davis, L.S. Infofocus: 3d object detection for autonomous driving with dynamic information modeling. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 405–420. [Google Scholar]
- Shi, G.; Li, R.; Ma, C. Pillarnet: Real-time and high-performance pillar-based 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–52. [Google Scholar]
- Xu, B.; Zhang, X.; Wang, L.; Hu, X.; Li, Z.; Pan, S.; Li, J.; Deng, Y. RPFA-Net: A 4D radar pillar feature attention network for 3D object detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3061–3066. [Google Scholar]
- Tan, B.; Ma, Z.; Zhu, X.; Li, S.; Zheng, L.; Chen, S.; Huang, L.; Bai, J. 3d object detection for multi-frame 4d automotive millimeter-wave radar point cloud. IEEE Sens. J. 2022, 23, 11125–11138. [Google Scholar] [CrossRef]
- Liu, J.; Zhao, Q.; Xiong, W.; Huang, T.; Han, Q.L.; Zhu, B. SMURF: Spatial Multi-Representation Fusion for 3D Object Detection with 4D Imaging Radar. arXiv 2023, arXiv:2307.10784. [Google Scholar] [CrossRef]
- Zhou, T.; Chen, J.; Shi, Y.; Jiang, K.; Yang, M.; Yang, D. Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection. IEEE Trans. Intell. Veh. 2023, 8, 1523–1535. [Google Scholar] [CrossRef]
- Kim, Y.; Kim, S.; Shin, J.; Choi, J.W.; Kum, D. Crn: Camera radar net for accurate, robust, efficient 3d perception. arXiv 2023, arXiv:2304.00670. [Google Scholar]
- Zheng, L.; Li, S.; Tan, B.; Yang, L.; Chen, S.; Huang, L.; Bai, J.; Zhu, X.; Ma, Z. RCFusion: Fusing 4D Radar and Camera with Bird’s-Eye View Features for 3D Object Detection. IEEE Trans. Instrum. Meas. 2023, 72, 8503814. [Google Scholar] [CrossRef]
- Xiong, W.; Liu, J.; Huang, T.; Han, Q.L.; Xia, Y.; Zhu, B. Lxl: Lidar exclusive lean 3d object detection with 4d imaging radar and camera fusion. arXiv 2023, arXiv:2307.00724. [Google Scholar] [CrossRef]
- Xie, T.; Wang, S.; Wang, K.; Yang, L.; Jiang, Z.; Zhang, X.; Dai, K.; Li, R.; Cheng, J. Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1233–1243. [Google Scholar]
- Xie, T.; Wang, K.; Lu, S.; Zhang, Y.; Dai, K.; Li, X.; Xu, J.; Wang, L.; Zhao, L.; Zhang, X.; et al. CO-Net: Learning Multiple Point Cloud Tasks at Once with a Cohesive Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3523–3533. [Google Scholar]
- Xie, T.; Wang, L.; Wang, K.; Li, R.; Zhang, X.; Zhang, H.; Yang, L.; Liu, H.; Li, J. FARP-Net: Local-Global Feature Aggregation and Relation-Aware Proposals for 3D Object Detection. IEEE Trans. Multimed. 2023, 1–15. [Google Scholar] [CrossRef]
- Wang, L.; Xie, T.; Zhang, X.; Jiang, Z.; Yang, L.; Zhang, H.; Li, X.; Ren, Y.; Yu, H.; Li, J.; et al. Auto-Points: Automatic Learning for Point Cloud Analysis with Neural Architecture Search. IEEE Trans. Multimed. 2023, 1–16. [Google Scholar] [CrossRef]
- Palffy, A.; Pool, E.; Baratam, S.; Kooij, J.F.; Gavrila, D.M. Multi-class road user detection with 3+ 1D radar in the View-of-Delft dataset. IEEE Robot. Autom. Lett. 2022, 7, 4961–4968. [Google Scholar] [CrossRef]
- Team, O.; et al. Openpcdet: An open-source toolbox for 3d object detection from point clouds. OD Team 2020. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 22 October 2023).
Method | Modality | Entire Annotation Area | Driving Corridor Area | ||||||
---|---|---|---|---|---|---|---|---|---|
Car | Ped | Cyc | mAP | Car | Ped | Cyc | mAP | ||
PointPillars [29] | R | 37.24 | 32.19 | 66.80 | 45.41 | 70.55 | 43.28 | 88.13 | 67.32 |
SECOND [22] | R | 40.40 | 30.64 | 62.51 | 44.52 | 72.25 | 41.19 | 83.39 | 65.61 |
CenterPoint [2] | R | 32.74 | 38.00 | 65.51 | 45.42 | 62.01 | 48.18 | 84.98 | 65.06 |
LXL-R [39] | R | 32.75 | 39.65 | 68.13 | 46.84 | 70.26 | 47.34 | 87.93 | 68.51 |
RCFusion [38] | R + C | 41.70 | 38.95 | 68.31 | 49.65 | 71.87 | 47.50 | 88.33 | 69.23 |
Ours | R | 39.53 | 41.88 | 64.91 | 48.77 | 77.04 | 53.40 | 82.95 | 71.13 |
# | MVD | SAI | CAO | Entire Annotation Area | Driving Corridor Area | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Car | Ped | Cyc | mAP | Car | Ped | Cyc | mAP | ||||
1 | 37.82 | 39.44 | 62.18 | 46.48 | 72.08 | 49.29 | 83.71 | 68.36 | |||
2 | ✓ | 41.16 | 40.07 | 61.06 | 47.43 | 78.73 | 52.43 | 74.80 | 68.65 | ||
3 | ✓ | ✓ | 39.32 | 39.92 | 63.53 | 47.59 | 70.64 | 53.79 | 85.57 | 70.00 | |
4 | ✓ | ✓ | ✓ | 39.53 | 41.88 | 64.91 | 48.77 | 77.04 | 53.40 | 82.95 | 71.13 |
Window Size | Entire Annotation Area | Driving Corridor Area | ||||||
---|---|---|---|---|---|---|---|---|
Car | Ped | Cyc | mAP | Car | Ped | Cyc | mAP | |
2 | 43.48 | 38.59 | 63.18 | 48.41 | 79.59 | 51.10 | 79.95 | 70.21 |
4 | 39.53 | 41.88 | 64.91 | 48.77 | 77.04 | 53.40 | 82.95 | 71.13 |
8 | 39.35 | 35.69 | 64.49 | 46.51 | 77.58 | 47.98 | 86.37 | 70.64 |
Window Size | Latency |
---|---|
2 | 68 ms |
4 | 61 ms |
8 | 65 ms |
32 | 69 ms |
128 (w/o splitting) | 90 ms |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shi, W.; Zhu, Z.; Zhang, K.; Chen, H.; Yu, Z.; Zhu, Y. SMIFormer: Learning Spatial Feature Representation for 3D Object Detection from 4D Imaging Radar via Multi-View Interactive Transformers. Sensors 2023, 23, 9429. https://doi.org/10.3390/s23239429
Shi W, Zhu Z, Zhang K, Chen H, Yu Z, Zhu Y. SMIFormer: Learning Spatial Feature Representation for 3D Object Detection from 4D Imaging Radar via Multi-View Interactive Transformers. Sensors. 2023; 23(23):9429. https://doi.org/10.3390/s23239429
Chicago/Turabian StyleShi, Weigang, Ziming Zhu, Kezhi Zhang, Huanlei Chen, Zhuoping Yu, and Yu Zhu. 2023. "SMIFormer: Learning Spatial Feature Representation for 3D Object Detection from 4D Imaging Radar via Multi-View Interactive Transformers" Sensors 23, no. 23: 9429. https://doi.org/10.3390/s23239429
APA StyleShi, W., Zhu, Z., Zhang, K., Chen, H., Yu, Z., & Zhu, Y. (2023). SMIFormer: Learning Spatial Feature Representation for 3D Object Detection from 4D Imaging Radar via Multi-View Interactive Transformers. Sensors, 23(23), 9429. https://doi.org/10.3390/s23239429