Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving
Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving
Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving
8, AUGUST 2021 1
{jianzou,tyhuang}@stu.hit.edu.cn,{yangguanglei,wmzuo}@hit.edu.cn,[email protected]
arXiv:2308.10421v2 [cs.CV] 30 Aug 2023
a method known for its efficacy in handling LiDAR data and position of objects, which makes it unsuitable for MAE. In
as mentioned in [8]. In parallel, images undergo a division this work, we introduce a unified representation with height
process where they’re segmented into patches. These patches dimension in 3D volume space, which captures the detailed
are then embedded using positional embeddings, ensuring height and position of objects.
spatial relationships are preserved. Upon these initial process-
ing steps, the multi-view images and voxelized LiDAR point B. Masked Autoencoders
clouds are then interpreted and encoded by their designated Masked Autoencoders (MAE) [6] are a self-supervised pre-
encoders. The output from both encoders is then integrated training method, with a pre-text task of predicting masked
into a unified 3D volume space. Not only does it ensure the pixels. With its success, a series of 3D representation learn-
maintenance of both geometric and semantic nuances, but ing methods apply masked modeling to 3D data. Some
it also accommodates an additional height dimension. This works [2], [16], [17] reconstruct masked points of indoor
height element plays a pivotal role in the process. It grants the objects. Some works [3], [18]–[21] predict the masked voxels
capability to reverse project the features back to their original in outdoor scenes. Recent methods propose multi-modal MAE
modalities, enabling the reconstruction of the initially masked pre-training: [7] exploits 2D pre-trained knowledge to the 3D
inputs from the various data branches. point prediction but fails to exploit the full potential of LiDAR
Compared with indoor scenes, scenarios in autonomous point cloud and image datasets. [1] attempts to tackle it in
driving are generally expansive, encompassing a greater num- the indoor scene and conducts a 2D-3D joint prediction by
ber of objects and displaying intricate inter-instance relation- the projection of points but ignore the characteristic of point
ships. The Multi-modal 3D Interaction Module (MMIM) is clouds and images. To address these problems, we propose
employed to amalgamate features from the dual branches to predict both masked 2D pixels and masked 3D voxels in
to facilitate efficient interaction. This module is built upon a unified representation, focusing on the autonomous driving
stacked 3D deformable self-attention blocks, enabling the scenario.
modeling of global context at elevated resolutions.
Comprehensive experiments on nuScenes [9] demonstrate C. Multi-Modal Fusion in 3D Perception
that our pre-training method significantly enhances the model’s
Recently, multi-modal fusion has been well-studied in 3D
performance and convergence speed in downstream tasks.
perception. Proposal-level fusion methods adopt proposals in
Specifically, our UniM2 AE improves detection performance
3D and project the proposals to images to extract RoI fea-
by 1.2/1.5 NDS/mAP even with larger voxel size and promotes
ture [22]–[24]. Point-level fusion methods usually paint image
BEV map segmentation performance by 6.5 mAP. According
semantic features onto foreground LiDAR points, which can be
to our ablation study, it also reduces training time almost by
classified into input-level decoration [25]–[27], and feature-
half when utilizing the entire dataset.
level decoration [28], [29]. However, the camera-to-LiDAR
To sum up, our contributions can be presented as follows:
2
projection is semantically lossy due to the different densities of
• We propose UniM AE, a multi-modal self-supervised
both modalities. Some BEV-based approaches aim to mitigate
pre-training framework with unified representation in a this problem, but their simple fusion modules fall short in
cohesive 3D volume space. This representation advan- modeling the relationships between objects. Accordingly, we
tageously allows for feature transformation back to the design the Multi-modal 3D Interaction Module to effectively
original modality, facilitating the reconstruction of multi- fuse the projected 3D volume features.
modal masked inputs.
• To better interact semantics and geometries retained in III. M ETHOD
unified 3D volume space, we introduce a Multi-modal 3D
In this section, we initially present an overview of our
Interaction Module (MMIM) to obtain more informative
UniM2 AE, specifically focusing on the pre-training phase. Se-
and powerful features.
quentially, we then detail unified representation in 3D volume
• We conduct extensive experiments on various 3D down-
space and the operations of our integral sub-modules, which
stream tasks, where UniM2 AE notably promotes diverse
include representation transformation, multi-modal interaction,
detectors and shows competitive performance.
and reconstruction.
Input Masked Token Production 3D Volume Projection Interaction Feature Back-Projection Reconstruction
LIDAR
LIDAR Decoder
Encoder
Multi-modal
3D
Interaction
Module
Camera
Encoder
Camera
Decoder
Visible camera voxel feature Visible LIDAR voxel feature Masked voxel feature
Fig. 2. Pre-training overview of UniM2 AE. The LiDAR branch voxelize the point cloud, while the camera branch divides multiple images into patches,
both subsequently randomly masking their inputs. The tokens from the two branches are individually embedded and then passed through the Token-Volume
projection, Multi-modal 3D Interaction Module, Volume-Token projection, and eventually the modality-specific decoder. Ultimately, we reconstruct the original
inputs using the fused features.
To align features from various modalities with the preser- original modalities, cementing its position as an optimal latent
vation of semantics and geometrics, (FI , FV ) are separately space for integrating features. Due to the intrinsic alignment
projected into the unified 3D volume space, which is extended of images and point clouds within the 3D volume space,
BEV along the height dimension. Specifically, we build a map- the Multi-modal 3D Interaction Module can bolster represen-
ping of each voxel to 3D volume space based on its position tations across streams, sidestepping the need for additional
in the ego-vehicle coordinates, while for the image tokens, the alignment mechanisms. Such alignment streamlines the tran-
spatial cross-attention is employed for 2D to 3D conversion. sition between pre-training and fine-tuning stages, producing
The projected feature (FVvol , FIvol ) are subsequently passed favorable outcomes for subsequent tasks. Additionally, the
into the Multi-modal 3D Interaction Module (MMIM), aiming adaptability of the 3D volume space leaves the door open for
at promoting powerful feature fusion. its extension to encompass three or even more modalities.
Following the cross-modal interaction, we project the fused
feature Fc′ back to the modality-specific token, denoted FVsp
for LiDAR and FIproj ∈ (C, H, W ) (which is then reshaped to C. Multi-modal Interaction
FIsp ∈ (HW, C)) for camera. The camera decoder and LiDAR
1) Projection to 3D volume space: In the projection of
decoder are finally used to reconstruct the original inputs.
LiDAR to the 3D volume, the voxel embedding is directed to
a predefined 3D volume using positions from the ego car coor-
B. Unified Representation in 3D volume space dinate system. This method ensures no geometric distortion is
introduced. The resulting feature from this process is denoted
Different sensors capture data that, while representing the
as FVvol . For the image to 3D volume projection, the 2D-3D
same scene, often provide distinct descriptions. For instance,
Spatial Cross-Attention method is employed. Following prior
camera-derived images emphasize the color palette of the
works [30], [31], 3D volume queries for each image is defined
environment within their field of view, whereas point clouds
as Qvol ∈ RC×H×W ×Z . The corresponding 3D points are
primarily capture object locations. Given these variations,
then projected to 2D views using the camera’s intrinsic and
selecting an appropriate representation for fusing features
extrinsic parameters. During this projection, the 3D points
from disparate modalities becomes paramount. Such a rep-
associate only with specific views, termed as Vhit . The 2D
resentation must preserve the unique attributes of multi-modal
feature is then sampled from the view Vhit at the locations of
information sourced from various sensors.
those projected 3D reference points. The process unfolds as:
In pursuit of capturing the full spectrum of object po-
sitioning and appearance, the voxel feature in 3D volume
1 X NX ref
space is adopted as the unified representation, depicted in FIvol DeformAttn Qvol , P(p, i, j), FIi
=
Figure 1 (b). The 3D volume space uniquely accommodates |Vhit |
i∈Vhit j=1
the height dimension, enabling it to harbor more expansive (1)
geometric data and achieve exacting precision in depicting where i indexes the camera view, j indexes the 3D reference
object locations, exemplified by features like elevated traffic points, and Nref is the total number of points for each 3D
signs. This enriched representation naturally amplifies the volume query. FIi is the features of the i-th camera view.
accuracy of interactions between objects. A salient benefit of P(p, i, j) is the project function that gets the j-th reference
the 3D volume space is its capacity for direct remapping to point on the i-th view image.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
3D Deformable Self-attention
(HWZ, C) garding the camera branch, the corresponding 2D coordinate
(u, v) can be determined using the projection function T .
(HWZ, 2C)
(HWZ, 2C)
Feed Forward
The 2D-plane feature FIproj is then obtained by mapping the
C S 3D volume feature in (x, y, z) to the position (u, v). The
projection function Tproj is defined as :
x
(HWZ, C) u y
z v = Tproj (P ) = K · Rt ·
z
(3)
×L 1
1
Multi-modal 3D Interaction Module
C Channel Concatenate where P ∈ R3 is the position in 3D volume, K ∈ 3 × 4,
S Channel Split Rt ∈ 4 × 4 are the camera intrinsic and extrinsic matrices.
D. Prediction Target
Fig. 3. Illustration of our Multi-modal 3D Interaction Module. We Three distinct reconstruction tasks supervise each modal
first concatenate the inputs FVvol , FIvol and reshape it for the subsequent
stacking 3D deformable self-attention blocks. After interaction, we split the decoder. A single linear layer is applied to the output of each
output volume feature and project them back to feature token. This contributes decoder for each task. The dual-modal reconstruction tasks and
more generalized and effective feature learning. their respective loss functions are detailed below. In alignment
with Voxel-MAE [34], the prediction focuses on the number of
2) Multi-modal 3D Interaction Module: To fuse the pro-
points within each voxel. Supervision for this reconstruction
jected 3D volume features from the camera (FIvol ∈
uses the Chamfer distance, which gauges the disparity between
RC×H×W ×Z ) and the LiDAR (FVvol ∈ RC×H×W ×Z )
two point sets of different scales. Let Gn denote the masked
branches effectively, the Multi-modal 3D Interaction Module
LiDAR point cloud partitioned into voxels, and Pn symbolize
(MMIM) is introduced. As depicted in Figure 3, MMIM
the predicted voxels. The Chamfer loss Lc can be presented
comprises L attention blocks, with L = 3 being the default
as:
setting.
Given the emphasis on high performance at high resolutions Lc = CD (DecV (FVsp ) , Gn ) (4)
in downstream tasks and the limited scale of token sequences where CD (·) stands for Chamfer distance function [35],
in standard self-attention, deformable self-attention is selected DecV (·) denote voxel decoder and FVsp represents projected
to alleviate computational demands. Each block is composed voxel features.
of deformable self-attention, a feed-forward network, and
In addition to the aforementioned reconstruction task, there
normalization. Initially, the concatenation of FVvol and FIvol is
is a prediction to ascertain if a voxel is empty. Supervision for
performed along the channel dimension, reshaping the result
this aspect employs the binary cross entropy loss, denoted as
to form the query token Fcvol ∈ RHW Z×2C . This token is
Locc . The cumulative voxel reconstruction loss is thus given
then inputted into the Multi-modal 3D Interaction Module, an
as:
extension of [32], to promote effective modal interaction. The
interactive process can be described as follows: Lvoxel = Lc + Locc (5)
M
X K
X For the camera branch, the pixel reconstruction is supervised
Fc′ = ′
Fcvol pvol + ∆pvol
Wm Wmk · Wm k (2) using the Mean Squared Error (MSE) loss, represented as :
m=1 k=1
Limg = LM SE (DecI (FIsp ) , GI ) (6)
where m indexes the attention head, k indexes the sampled
keys, and K is the total sampled key number. ∆pvol
k and Wmk where GI is the original images in pixel space, DecI (·)
denote the sampling offset and attention weight of the k th denotes the image decoder and FIsp represents projected image
sampling point in the mth attention head, respectively. The features.
attention weight Wmk lies in the range [0, 1], normalized by
P K ′
k=1 Wmk = 1. At the end, Fc is split along the channel IV. E XPERIMENTS
dimension to obtain the modality-specific 3D volume features
(FV′ , FI′ ). In this section, we conduct extensive experiments to evaluate
3) Projection to Modality-specific Token: By tapping into our proposed UniM2 AE: 1) We compare UniM2 AE with
the advantages of the 3D volume representation, the fused different MAE methods using various amount of annotated
feature can be conveniently projected onto the 2D image data. 2) We evaluate UniM2 AE on different downstream tasks,
plane and 3D voxel token. For the LiDAR branch, the process including 3D object detection and BEV map segmentation. 3)
merely involves sampling the features located at the position We conduct diverse ablation studies to evaluate the effective-
of the masked voxel token within the ego-vehicle coordinates. ness of our self-supervised method.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
TABLE I
DATA - EFFICIENT 3D OBJECT DETECTION RESULTS OF ON THE NU S CENES VALIDATION SET. BACKBONES IN SINGLE - MODALITY AND MULTI - MODALITY
ARE PRE - TRAINED USING VARIOUS MAE METHODS . T HE MODEL PERFORMANCES ARE REPORTED USING DIFFERENT AMOUNTS OF FINE - TUNING DATA .
R ANDOM DENOTES TRAINING FROM SCRATCH . MIM+VOXEL -MAE DENOTES THE COMBINATION OF THE WEIGHTS PRE - TRAINED USING
G REEN MIM [33] AND VOXEL -MAE [34]. L AND C REPRESENT L I DAR AND C AMERA , RESPECTIVELY. *: OUR RE - IMPLEMENTATION .
Data amount Modality Initialization mAP NDS Car Truck C.V. Bus Trailer Barrier Motor Bike Ped. T.C.
Random 44.3 56.3 78.8 41.1 13.1 50.7 18.8 52.8 46.1 17.3 75.7 49.1
L Voxel-MAE* 48.9 59.8 80.9 47.0 12.8 59.0 23.6 61.9 47.8 23.5 79.9 52.4
UniM2 AE 50.0 60.0 81.0 47.8 12.0 57.3 24.0 62.3 51.2 26.8 81.4 55.8
20%
Random 51.5 50.9 84.1 47.6 13.3 49.8 27.9 65.0 53.9 26.8 78.8 67.8
C+L MIM+Voxel-MAE 54.3 51.2 84.3 50.8 18.9 52.3 28.9 68.4 57.4 32.1 80.3 69.2
UniM2 AE 55.9 52.8 85.8 51.1 19.3 54.2 30.6 69.0 61.1 34.3 83.0 70.8
Random 50.9 60.5 81.4 47.6 13.7 58.0 24.5 61.3 57.7 30.1 80.4 54.4
L Voxel-MAE* 52.6 62.2 82.4 49.1 15.4 62.2 25.8 64.5 56.8 30.4 82.3 57.5
UniM2 AE 52.9 62.6 82.7 49.2 15.8 60.1 23.7 65.5 58.4 31.2 83.8 58.9
40%
Random 58.6 61.9 86.2 54.7 21.7 60.0 33.0 70.7 64.2 38.6 83.0 74.3
C+L MIM+Voxel-MAE 60.2 63.5 86.6 56.5 22.5 64.1 33.5 72.2 66.1 41.9 83.8 74.8
UniM2 AE 62.0 64.5 87.0 57.8 22.8 62.7 38.7 69.7 66.8 50.5 86.0 77.9
Random 51.9 61.7 82.2 49.0 15.6 61.2 24.9 62.9 56.3 32.1 81.6 53.1
L Voxel-MAE* 54.2 63.5 83.0 51.1 16.3 62.0 27.5 64.9 61.2 34.7 82.9 58.0
UniM2 AE 54.7 63.8 83.1 51.0 17.3 62.5 26.9 65.1 62.2 35.7 83.4 59.9
60%
Random 61.6 65.2 87.3 58.5 23.9 65.2 35.8 71.9 67.8 46.7 85.7 77.0
C+L MIM+Voxel-MAE 62.1 65.7 87.2 56.7 23.0 65.4 37.0 71.7 70.6 47.3 85.6 76.7
UniM2 AE 62.4 66.1 87.7 59.7 23.8 67.6 37.0 70.5 68.4 48.9 86.6 77.8
Random 52.7 62.5 82.3 49.6 16.0 63.3 25.8 60.7 58.6 31.6 82.0 56.7
L Voxel-MAE* 55.1 64.2 83.4 51.7 18.8 64.0 28.7 63.8 62.2 35.1 84.3 58.7
UniM2 AE 55.6 64.6 83.4 52.9 18.2 64.2 29.4 64.7 63.1 36.1 84.5 58.8
80%
Random 62.5 66.1 87.1 57.6 24.0 66.4 38.1 71.1 68.4 48.8 86.2 77.6
C+L MIM+Voxel-MAE 63.0 66.4 87.6 59.6 24.1 66.1 38.0 71.3 70.2 48.8 86.5 78.1
UniM2 AE 63.9 67.1 87.7 59.6 24.9 69.2 39.8 71.3 71.0 50.2 86.8 78.7
Random 53.6 63.0 82.3 49.8 16.7 64.0 26.2 60.9 61.7 33.0 82.2 58.9
L Voxel-MAE* 55.3 64.1 83.2 51.2 16.8 64.6 28.3 65.0 61.8 39.6 83.6 58.9
UniM2 AE 55.8 64.6 83.3 51.3 17.6 63.7 28.6 65.4 62.8 40.8 83.9 60.3
100%
Random 63.6 67.4 87.7 58.0 26.6 67.8 38.4 72.6 71.8 47.6 87.0 78.9
C+L MIM+Voxel-MAE 63.7 67.7 87.6 58.3 25.3 67.1 38.8 70.8 71.7 51.5 86.7 78.9
UniM2 AE 64.3 68.1 87.9 57.8 24.3 68.6 42.2 71.6 72.5 51.0 87.2 79.5
A. Implementation Details In terms of input data masking during the training phase,
1) Dataset and Metrics: The nuScenes Dataset [9], a com- our experiments have determined that a masking ratio of
prehensive autonomous driving dataset, serves as the primary 70% for the LiDAR branch and 75% for the camera branch
dataset for both pre-training our model and evaluating its per- yields optimal results. By default, all the MAE methods are
formance on multiple downstream tasks. This dataset encom- trained with a total of 200 epochs on 8 GPUs, a base learning
passes 1,000 sequences gathered from Boston and Singapore, rate of 2.5e-5. Detailed configurations are reported in the
with 700 designated for training and 300 split evenly for supplemental material.
validation and testing. Each sequence, recorded at 10Hz, spans 4) Fine-tuning: Utilizing the encoders from UniM2 AE for
20 seconds and is annotated at a frequency of 2Hz. In terms both camera and LiDAR, the process then involves fine-tuning
of 3D detection, the principal evaluation metrics employed are and assessing the capabilities of the features learned on tasks
mean Average Precision (mAP) and the nuScenes detection that are both single-modal and multi-modal in nature. For tasks
score (NDS). For the task of BEV map segmentation, the that are solely single-modal, one of the pre-trained encoder
methodology aligns with the dataset’s map expansion pack, serves as the feature extraction mechanism. When it comes
using Intersection over Union (IoU) as the assessment metric. to multi-modal tasks, which include 3D object detection and
2) Network Architectures: The UniM2 AE utilizes SST [43] BEV map segmentation, both the LiDAR encoder and the
and Swin-T [44] as the backbones for the LiDAR Encoder camera branch’s encoder are capitalized upon as the feature
and Camera Encoder, respectively. In the Multi-modal 3D extractors pertinent to these downstream tasks. It’s pivotal
Interaction Module, 3 deformable self-attention blocks are to note that while the decoder plays a role during the pre-
stacked, with each attention module comprising 192 input training phase, it’s omitted during the fine-tuning process. A
channels and 768 hidden channels. To facilitate the transfer comparison of the pre-trained feature extractors with a variety
of pre-trained weights for downstream tasks, BEVFusion- of baselines across different tasks was conducted, ensuring
SST and TransFusion-L-SST are introduced, with the LiDAR the experimental setup remained consistent. Notably, when
backbone in these architectures being replaced by SST. integrated into fusion-based methodologies, the pre-trained
3) Pre-training: During this stage, the perception ranges Multi-modal 3D Interaction Module showcases competitive
are restricted to [−50m, 50m] for X- and Y-axes, [−5m, 3m] performance results. Detailed configurations are in the sup-
for Z-axes. Each voxel has dimensions of (0.5m, 0.5m, 4m). plemental material.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
TABLE II
P ERFORMANCES OF 3D OBJECT DETECTION ON THE NU S CENES VALIDATION SPLIT. W / MMIM MEANS APPLYING THE PRE - TRAINED MMIM TO THE
DOWNSTREAM TASK .
Method Modality Voxel Size(m) NDS↑ mAP↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓
BEVDet [36] C - 47.2 39.3 60.8 25.9 36.6 82.2 19.1
BEVFormer [31] C - 51.7 41.6 67.3 27.4 37.2 39.4 19.8
PETRv2 [37] C - 52.4 42.1 68.1 26.7 35.7 37.7 18.6
CenterPoint [38] L [0.075, 0.075, 0.2] 66.8 59.6 29.2 25.5 30.2 25.9 19.4
LargeKernel3D [39] L [0.075, 0.075, 0.2] 69.1 63.9 28.6 25.0 35.1 21.1 18.7
TransFusion-L [40] L [0.075, 0.075, 0.2] 70.1 65.4 - - - - -
TransFusion-L-SST L [0.15, 0.15, 8] 69.9 65.0 28.0 25.3 30.1 24.1 19.0
UniM2 AE-L L [0.15, 0.15, 8] 70.4 65.7 28.0 25.2 29.5 23.5 18.6
FUTR3D [24] C+L [0.075, 0.075, 0.2] 68.3 64.5 - - - - -
Focals Conv [41] C+L [0.075, 0.075, 0.2] 69.2 64.0 33.2 25.4 27.8 26.8 19.3
MVP [27] C+L [0.075, 0.075, 0.2] 70.7 67.0 28.9 25.1 28.1 27.0 18.9
TransFusion [40] C+L [0.075, 0.075, 0.2] 71.2 67.3 27.2 25.2 27.4 25.4 19.0
MSMDFusion [42] C+L [0.075, 0.075, 0.2] 72.1 69.3 - - - - -
BEVFusion [14] C+L [0.075, 0.075, 0.2] 71.4 68.5 28.7 25.4 30.4 25.6 18.7
BEVFusion-SST C+L [0.15, 0.15, 8] 71.5 68.2 27.8 25.3 30.2 23.6 18.9
UniM2 AE C+L [0.15, 0.15, 8] 71.9 68.4 27.2 25.2 28.8 23.2 18.7
UniM2 AE w/ MMIM C+L [0.15, 0.15, 4] 72.7 69.7 26.9 25.2 27.3 23.2 18.9
TABLE III
P ERFORMANCES OF THE BEV MAP SEGMENTATION ON THE NU S CENES VALIDATION SPLIT. W / MMIM MEANS APPLYING THE PRE - TRAINED MMIM TO
THE DOWNSTREAM TASK . *: RE - IMPLEMENTED BY TRAINING FROM SCRATCH .
Method Modality Drivable Ped. Cross. Walkway Stop Line Carpark Divider mIoU
CVT [45] C 74.3 36.8 39.9 25.8 35.0 29.4 40.2
OFT [46] C 74.0 35.3 45.9 27.5 35.9 33.9 42.1
LSS [47] C 75.4 38.8 46.3 30.3 39.1 36.5 44.4
M2 BEV [48] C 77.2 - - - - 40.5 -
BEVFusion* [14] C 78.2 48.0 53.5 40.4 45.3 41.7 51.2
UniM2 AE C 79.5 50.5 54.9 42.4 47.3 42.9 52.9
PointPillars [49] L 72.0 43.4 53.1 29.7 27.7 37.5 43.8
CenterPoint [38] L 75.6 48.4 57.5 36.5 31.7 41.9 48.6
MVP [27] C+L 76.1 48.7 57.0 36.9 33.0 42.2 49.0
PointPainting [25] C+L 75.9 48.5 57.7 36.9 34.5 41.9 49.1
BEVFusion [14] C+L 85.5 60.5 67.6 52.0 57.0 53.7 62.7
X-Align [50] C+L 86.8 65.2 70.0 58.3 57.1 58.2 65.7
BEVFusion-SST C+L 84.9 59.2 66.3 48.7 56.0 52.7 61.3
UniM2 AE C+L 85.1 59.7 66.6 48.7 56.0 52.6 61.4
UniM2 AE w/ MMIM C+L 88.7 67.4 72.9 59.0 59.0 59.7 67.8
[8] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, [28] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d for multi-sensor 3d object detection,” in Proceedings of the European
object detection in lidar point clouds,” in Conference on Robot Learning. conference on computer vision (ECCV), 2018, pp. 641–656.
PMLR, 2020, pp. 923–932. [29] Y. Li, A. W. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen,
[9] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, Y. Lu, D. Zhou, Q. V. Le, et al., “Deepfusion: Lidar-camera deep fusion
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF
multimodal dataset for autonomous driving,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2022, pp.
IEEE/CVF conference on computer vision and pattern recognition, 2020, 17 182–17 191.
pp. 11 621–11 631. [30] Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc:
[10] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, Multi-camera 3d occupancy prediction for autonomous driving,” arXiv
G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable preprint arXiv:2303.09551, 2023.
visual models from natural language supervision,” in International [31] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai,
conference on machine learning. PMLR, 2021, pp. 8748–8763. “Bevformer: Learning bird’s-eye-view representation from multi-camera
[11] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, images via spatiotemporal transformers,” in European conference on
X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image computer vision. Springer, 2022, pp. 1–18.
pre-training,” arXiv preprint arXiv:2111.07783, 2021. [32] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr:
[12] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, Deformable transformers for end-to-end object detection,” arXiv preprint
P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in arXiv:2010.04159, 2020.
Proceedings of the IEEE/CVF Conference on Computer Vision and [33] L. Huang, S. You, M. Zheng, F. Wang, C. Qian, and T. Yamasaki, “Green
Pattern Recognition, 2022, pp. 8552–8562. hierarchical vision transformer for masked image modeling,” Advances
in Neural Information Processing Systems, vol. 35, pp. 19 997–20 010,
[13] T. Huang, B. Dong, Y. Yang, X. Huang, R. W. Lau, W. Ouyang, and
2022.
W. Zuo, “Clip2point: Transfer clip to point cloud classification with
[34] G. Hess, J. Jaxing, E. Svensson, D. Hagerman, C. Petersson, and
image-depth pre-training,” arXiv preprint arXiv:2210.01055, 2022.
L. Svensson, “Masked autoencoder for self-supervised pre-training on
[14] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, lidar point clouds,” in Proceedings of the IEEE/CVF Winter Conference
“Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view on Applications of Computer Vision, 2023, pp. 350–359.
representation,” in 2023 IEEE International Conference on Robotics and [35] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for
Automation (ICRA). IEEE, 2023, pp. 2774–2781. 3d object reconstruction from a single image,” in Proceedings of the
[15] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and IEEE conference on computer vision and pattern recognition, 2017, pp.
Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion frame- 605–613.
work,” Advances in Neural Information Processing Systems, vol. 35, pp. [36] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “Bevdet: High-
10 421–10 434, 2022. performance multi-camera 3d object detection in bird-eye-view,” arXiv
[16] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: preprint arXiv:2112.11790, 2021.
Pre-training 3d point cloud transformers with masked point modeling,” [37] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, X. Zhang, and J. Sun,
in Proceedings of the IEEE/CVF Conference on Computer Vision and “Petrv2: A unified framework for 3d perception from multi-camera
Pattern Recognition, 2022, pp. 19 313–19 322. images,” arXiv preprint arXiv:2206.01256, 2022.
[17] H. Liu, M. Cai, and Y. J. Lee, “Masked discrimination for self-supervised [38] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection
learning on point clouds,” in European Conference on Computer Vision. and tracking,” in Proceedings of the IEEE/CVF conference on computer
Springer, 2022, pp. 657–675. vision and pattern recognition, 2021, pp. 11 784–11 793.
[18] C. Min, D. Zhao, L. Xiao, Y. Nie, and B. Dai, “Voxel-mae: Masked [39] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Largekernel3d: Scaling up
autoencoders for pre-training large-scale point clouds,” arXiv preprint kernels in 3d sparse cnns,” in Proceedings of the IEEE/CVF Conference
arXiv:2206.09900, 2022. on Computer Vision and Pattern Recognition, 2023, pp. 13 488–13 498.
[19] R. Xu, T. Wang, W. Zhang, R. Chen, J. Cao, J. Pang, and D. Lin, [40] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai,
“Mv-jar: Masked voxel jigsaw and reconstruction for lidar-based self- “Transfusion: Robust lidar-camera fusion for 3d object detection with
supervised pre-training,” in Proceedings of the IEEE/CVF Conference transformers,” in Proceedings of the IEEE/CVF conference on computer
on Computer Vision and Pattern Recognition, 2023, pp. 13 445–13 454. vision and pattern recognition, 2022, pp. 1090–1099.
[20] H. Yang, T. He, J. Liu, H. Chen, B. Wu, B. Lin, X. He, and W. Ouyang, [41] Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional
“Gd-mae: generative decoder for mae pre-training on lidar point clouds,” networks for 3d object detection,” in Proceedings of the IEEE/CVF
in Proceedings of the IEEE/CVF Conference on Computer Vision and Conference on Computer Vision and Pattern Recognition, 2022, pp.
Pattern Recognition, 2023, pp. 9403–9414. 5428–5437.
[21] A. Boulch, C. Sautier, B. Michele, G. Puy, and R. Marlet, “Also: [42] Y. Jiao, Z. Jie, S. Chen, J. Chen, L. Ma, and Y.-G. Jiang, “Msmdfusion:
Automotive lidar self-supervision by occupancy estimation,” in Pro- Fusing lidar and camera at multiple scales with multi-depth seeds for
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern 3d object detection,” in Proceedings of the IEEE/CVF Conference on
Recognition, 2023, pp. 13 455–13 465. Computer Vision and Pattern Recognition, 2023, pp. 21 643–21 652.
[43] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang,
[22] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
and Z. Zhang, “Embracing single stride 3d object detector with sparse
detection network for autonomous driving,” in Proceedings of the IEEE
transformer,” in Proceedings of the IEEE/CVF conference on computer
conference on Computer Vision and Pattern Recognition, 2017, pp.
vision and pattern recognition, 2022, pp. 8458–8468.
1907–1915.
[44] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
[23] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter windows,” in Proceedings of the IEEE/CVF international conference on
Conference on Applications of Computer Vision, 2021, pp. 1527–1536. computer vision, 2021, pp. 10 012–10 022.
[24] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A [45] B. Zhou and P. Krähenbühl, “Cross-view transformers for real-time
unified sensor fusion framework for 3d detection,” in Proceedings of map-view semantic segmentation,” in Proceedings of the IEEE/CVF
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, conference on computer vision and pattern recognition, 2022, pp.
2023, pp. 172–181. 13 760–13 769.
[25] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Se- [46] T. Roddick, A. Kendall, and R. Cipolla, “Orthographic feature transform
quential fusion for 3d object detection,” in Proceedings of the IEEE/CVF for monocular 3d object detection,” arXiv preprint arXiv:1811.08188,
conference on computer vision and pattern recognition, 2020, pp. 4604– 2018.
4612. [47] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from
[26] C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-modal arbitrary camera rigs by implicitly unprojecting to 3d,” in Computer
augmentation for 3d object detection,” in Proceedings of the IEEE/CVF Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
Conference on Computer Vision and Pattern Recognition, 2021, pp. 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 194–210.
11 794–11 803. [48] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo,
[27] T. Yin, X. Zhou, and P. Krähenbühl, “Multimodal virtual point 3d and J. M. Alvarez, “M2 bev: Multi-camera joint 3d detection and
detection,” Advances in Neural Information Processing Systems, vol. 34, segmentation with unified birds-eye view representation,” arXiv preprint
pp. 16 494–16 507, 2021. arXiv:2204.05088, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
TABLE VIII dataset SUN RGB-D [56], PiMAE [1] first uses FPS to filter
F INE - TUNING C ONFIGURATION . D ET DENOTES 3 D OBJECT DETECTION 2048 points from about 20,000 points and then sends them
CONFIGURATION . S EG DENOTES BEV MAP SEGMENTATION TASK
CONFIGURATION into FPS and KNN to form 128 groups in order to save the
GPU memory. However, in the outdoor dataset, the number of
Config
Value points in one scene often reach almost 200,000, and if the FPS
Det Seg is still employed to filter out only 2048 points, it might result in
point cloud range -x [-54.0m, 54.0m] [-51.2m, 51.2m] undersampling critical areas since FPS focuses on spreading
point cloud range -y [-54.0m, 54.0m] [-51.2m, 51.2m] out the sampled points. This can lead to missing important
point cloud range -x [-3.0m, 5.0m] [-3.0.m, 5.0m]
optimizer AdamW AdamW local structures, such as pedestrians, small obstacles or even
base lr 1e-4 1e-4 the cars far away from the ego vehicle, which are essential
weight decay 0.01 0.01 for MAE methods to model the masked point clouds. Another
batch size 4 4
option is to increase the number of output groups of FPS on
the outdoor dataset. For example, we filter out 20,000 points
from 200,000 points, send them into joint FPS+KNN operation
pre-training we first transfer the weights of UniM2 AE LiDAR and finally process the 1,000 output groups with the following
encoder to TransFusion-L-SST and fintune it. We follow the point cloud encoder. Nevertheless since PiMAE [1] utilize
TransFusion [40] training schedule and the results obtained multiple standard ViTs [57], whose computational complexity
by fine-tuning is denoted as UniM2 AE-L. As for the multi- is O(n2 ), in both encoder and decoder, the computational
modal strategies, the weights of LiDAR encoder in UniM2 AE- burden is still unacceptable even after downsampling 200
L and camera encoder pre-trained by UniM2 AE are loaded times.
to finetune the BEVFusion-SST following the BEVFusion
[14] training schedule. Furthermore, we replace the fusion
module in BEVFusion-SST by the pre-trained MMIM and get
UniM2 AE w/MMIM.
In the BEV map segmentation task, unlike the previ-
ous two-stage training schedule in the 3D object detection,
we directly fine-tune the multi-modal BEVFusion-SST with
[0.2m, 0.2m, 4m] voxel size for 24 epochs and the camera-
only BEVFusion [14] for 20 epochs, The changes regarding
the backbone and fusion modules are the same as for the 3D
detection task. For the camera-only detectors, all configuration
is aligned with camera-only BEVFusion [14].
E. Visualization
In Figure 5, we provide examples of reconstruction visu-
alizations. Our UniM2 AE is able to reconstruct the masked
LiDAR point clouds and multi-view images, accurately re-
flecting semantic and geometric understanding.
Image Masked Image Image Reconstruction Point Cloud Masked Point Cloud Point Cloud Reconstruction
Fig. 5. Visualization of reconstruction results. The reconstruction for two different scenes is presented, including 6 images and a point cloud. For ease of
observation, we zoom in the point cloud at [0m, 15m] for X axes and [−7.5m, 7.5m] for Y axes.