Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.
8, AUGUST 2021 1
UniM2AE: Multi-modal Masked Autoencoders with

Unified 3D Representation for 3D Perception in
Autonomous Driving
Jian Zou1 Tianyu Huang1 Guanglei Yang1 Zhenhua Guo2 Wangmeng Zuo1
1 Harbin Institute of Technology, China 2 Tianyi Traffic Technology, China
{jianzou,tyhuang}@stu.hit.edu.cn,{yangguanglei,wmzuo}@hit.edu.cn,[email protected]
arXiv:2308.10421v2 [cs.CV] 30 Aug 2023
Abstract—Masked Autoencoders (MAE) play a pivotal role in

learning potent representations, delivering outstanding results LiDAR
Encoder
across various 3D perception tasks essential for autonomous
driving. In real-world driving scenarios, it’s commonplace to de-
ploy multiple sensors for comprehensive environment perception. Extra Alignment Fuse
(a)
While integrating multi-modal features from these sensors can
produce rich and powerful features, there is a noticeable gap in Camera
Encoder
MAE methods addressing this integration. This research delves
into multi-modal Masked Autoencoders tailored for a unified
representation space in autonomous driving, aiming to pioneer a
more efficient fusion of two distinct modalities. To intricately LiDAR
Encoder
marry the semantics inherent in images with the geometric
intricacies of LiDAR point clouds, the UniM2 AE is proposed. (b) Fuse
This model stands as a potent yet straightforward, multi-modal
self-supervised pre-training framework, mainly consisting of two Camera
designs. First, it projects the features from both modalities Encoder
into a cohesive 3D volume space, ingeniously expanded from
the bird’s eye view (BEV) to include the height dimension.
The extension makes it possible to back-project the informative
Fig. 1. (a) Multi-modal frameworks [1] that align masked input before feature
features, obtained by fusing features from both modalities, into extraction but ignore feature characteristics from two branch. (b) UniM2 AE
their native modalities to reconstruct the multiple masked inputs. that interacts multi-modal features with unified representation.
Second, the Multi-modal 3D Interactive Module (MMIM) is
invoked to facilitate the efficient inter-modal interaction during
the interaction process. Extensive experiments conducted on the MAE has shown notable success in 2D vision [4]–[6]. Some
nuScenes Dataset attest to the efficacy of UniM2 AE, indicating self-supervised frameworks, such as [7], attempt to use 2D pre-
enhancements in 3D object detection and BEV map segmentation trained knowledge to inform 3D MAE pre-training. However,
by 1.2%(NDS) and 6.5% (mIoU), respectively. Code is available these efforts yield only minor performance enhancements,
at https://github.com/hollow-503/UniM2AE. primarily due to challenges in bridging 2D and 3D data spaces.
Index Terms—Masked Autoencoders, Multi-modal, Au- On the other hand, as shown in Figure 1 (a), methods like
tonomous driving. PiMAE [1] focus on directly fusing 2D and 3D data. However,
these approaches often neglect the nuances between indoor and
I. I NTRODUCTION outdoor scenes. Such oversight complicates their application
Autonomous driving marks a transformative leap in trans- in autonomous driving contexts, which are characterized by
portation, heralding potential enhancements in safety, effi- extensive areas and intricate occlusion relationships.
ciency, and accessibility. Fundamental to this progression is To address the identified challenges, UniM2 AE is intro-
the vehicle’s capability to decode its surroundings. To tackle duced as a self-supervised pre-training framework optimized
the intricacies of real-world contexts, integration of various for integrating two distinct modalities: images and LiDAR
sensors is imperative: cameras yield detailed visual insights, data. The strategy aims to establish a unified representation
LiDAR grants exact geometric data, etc.Through this multi- space, enhancing the fusion of these modalities as shown
sensor fusion, a comprehensive grasp of the environment is in Figure 1 (b). In this framework, the semantic richness
achieved. While current Masked Autoencoders (MAE) excel of images is seamlessly merged with the geometric details
at learning robust representations for a single modality [2], captured by LiDAR to produce a robust and informative
[3], they struggle to effectively combine the semantic depth feature.
of images with the geometric nuances of LiDAR point clouds, The transformation to a 3D volume space is a critical step in
especially in the realm of autonomous driving. This limitation ensuring comprehensive interaction between the multi-modal
presents an obstacle to attaining the nuanced understanding data sources. Starting with the LiDAR data, it’s processed into
necessary for crafting robust and potent representations. voxels and given a specific embedding through DynamicVFE,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
a method known for its efficacy in handling LiDAR data and position of objects, which makes it unsuitable for MAE. In
as mentioned in [8]. In parallel, images undergo a division this work, we introduce a unified representation with height
process where they’re segmented into patches. These patches dimension in 3D volume space, which captures the detailed
are then embedded using positional embeddings, ensuring height and position of objects.
spatial relationships are preserved. Upon these initial process-
ing steps, the multi-view images and voxelized LiDAR point B. Masked Autoencoders
clouds are then interpreted and encoded by their designated Masked Autoencoders (MAE) [6] are a self-supervised pre-
encoders. The output from both encoders is then integrated training method, with a pre-text task of predicting masked
into a unified 3D volume space. Not only does it ensure the pixels. With its success, a series of 3D representation learn-
maintenance of both geometric and semantic nuances, but ing methods apply masked modeling to 3D data. Some
it also accommodates an additional height dimension. This works [2], [16], [17] reconstruct masked points of indoor
height element plays a pivotal role in the process. It grants the objects. Some works [3], [18]–[21] predict the masked voxels
capability to reverse project the features back to their original in outdoor scenes. Recent methods propose multi-modal MAE
modalities, enabling the reconstruction of the initially masked pre-training: [7] exploits 2D pre-trained knowledge to the 3D
inputs from the various data branches. point prediction but fails to exploit the full potential of LiDAR
Compared with indoor scenes, scenarios in autonomous point cloud and image datasets. [1] attempts to tackle it in
driving are generally expansive, encompassing a greater num- the indoor scene and conducts a 2D-3D joint prediction by
ber of objects and displaying intricate inter-instance relation- the projection of points but ignore the characteristic of point
ships. The Multi-modal 3D Interaction Module (MMIM) is clouds and images. To address these problems, we propose
employed to amalgamate features from the dual branches to predict both masked 2D pixels and masked 3D voxels in
to facilitate efficient interaction. This module is built upon a unified representation, focusing on the autonomous driving
stacked 3D deformable self-attention blocks, enabling the scenario.
modeling of global context at elevated resolutions.
Comprehensive experiments on nuScenes [9] demonstrate C. Multi-Modal Fusion in 3D Perception
that our pre-training method significantly enhances the model’s
Recently, multi-modal fusion has been well-studied in 3D
performance and convergence speed in downstream tasks.
perception. Proposal-level fusion methods adopt proposals in
Specifically, our UniM2 AE improves detection performance
3D and project the proposals to images to extract RoI fea-
by 1.2/1.5 NDS/mAP even with larger voxel size and promotes
ture [22]–[24]. Point-level fusion methods usually paint image
BEV map segmentation performance by 6.5 mAP. According
semantic features onto foreground LiDAR points, which can be
to our ablation study, it also reduces training time almost by
classified into input-level decoration [25]–[27], and feature-
half when utilizing the entire dataset.
level decoration [28], [29]. However, the camera-to-LiDAR
To sum up, our contributions can be presented as follows:
2
projection is semantically lossy due to the different densities of
• We propose UniM AE, a multi-modal self-supervised
both modalities. Some BEV-based approaches aim to mitigate
pre-training framework with unified representation in a this problem, but their simple fusion modules fall short in
cohesive 3D volume space. This representation advan- modeling the relationships between objects. Accordingly, we
tageously allows for feature transformation back to the design the Multi-modal 3D Interaction Module to effectively
original modality, facilitating the reconstruction of multi- fuse the projected 3D volume features.
modal masked inputs.
• To better interact semantics and geometries retained in III. M ETHOD
unified 3D volume space, we introduce a Multi-modal 3D
In this section, we initially present an overview of our
Interaction Module (MMIM) to obtain more informative
UniM2 AE, specifically focusing on the pre-training phase. Se-
and powerful features.
quentially, we then detail unified representation in 3D volume
• We conduct extensive experiments on various 3D down-
space and the operations of our integral sub-modules, which
stream tasks, where UniM2 AE notably promotes diverse
include representation transformation, multi-modal interaction,
detectors and shows competitive performance.
and reconstruction.
II. R ELATED W ORK A. Overview Architecture

A. Multi-Modal Representation As shown in Figure 2, UniM2 AE learns multi-modal rep-
Multi-modal representation has raised significant interest resentation by masking inputs (I, V ) and jointly combine
recently, especially in vision-language pre-training [10], [11]. features projected to the 3D volume space (FVvol , FIvol ) to
Some works [12], [13] align point cloud data to 2D vision by accomplish the reconstruction. In our proposed pipeline, the
depth projection. As for unifying 3D with other modalities, point cloud is first embeded into tokens after voxelization and
the bird’s-eye view (BEV) is a widely-used representation similarly we embed the images with position encoding after
since the transformation to BEV space retains both geometric dividing the them into non-overlapping patches. Following
structure and semantic density. Although many SOTA meth- this, tokens from both modalities are randomly masked, pro-
ods [14], [15] adopt BEV as the unified representation, the lack ducing (MI , MV ). Separate transformer-based encoders are
of height information leads to a poor description of the shape then utilized to extract features (FI , FV ).
Input Masked Token Production 3D Volume Projection Interaction Feature Back-Projection Reconstruction
LIDAR
LIDAR Decoder
Encoder
Multi-modal
3D
Interaction
Module
Camera
Encoder
Camera
Decoder
Visible camera voxel feature Visible LIDAR voxel feature Masked voxel feature
Visible image token Visible LIDAR token Masked token
Fig. 2. Pre-training overview of UniM2 AE. The LiDAR branch voxelize the point cloud, while the camera branch divides multiple images into patches,
both subsequently randomly masking their inputs. The tokens from the two branches are individually embedded and then passed through the Token-Volume
projection, Multi-modal 3D Interaction Module, Volume-Token projection, and eventually the modality-specific decoder. Ultimately, we reconstruct the original
inputs using the fused features.
To align features from various modalities with the preser- original modalities, cementing its position as an optimal latent
vation of semantics and geometrics, (FI , FV ) are separately space for integrating features. Due to the intrinsic alignment
projected into the unified 3D volume space, which is extended of images and point clouds within the 3D volume space,
BEV along the height dimension. Specifically, we build a map- the Multi-modal 3D Interaction Module can bolster represen-
ping of each voxel to 3D volume space based on its position tations across streams, sidestepping the need for additional
in the ego-vehicle coordinates, while for the image tokens, the alignment mechanisms. Such alignment streamlines the tran-
spatial cross-attention is employed for 2D to 3D conversion. sition between pre-training and fine-tuning stages, producing
The projected feature (FVvol , FIvol ) are subsequently passed favorable outcomes for subsequent tasks. Additionally, the
into the Multi-modal 3D Interaction Module (MMIM), aiming adaptability of the 3D volume space leaves the door open for
at promoting powerful feature fusion. its extension to encompass three or even more modalities.
Following the cross-modal interaction, we project the fused
feature Fc′ back to the modality-specific token, denoted FVsp
for LiDAR and FIproj ∈ (C, H, W ) (which is then reshaped to C. Multi-modal Interaction
FIsp ∈ (HW, C)) for camera. The camera decoder and LiDAR
1) Projection to 3D volume space: In the projection of
decoder are finally used to reconstruct the original inputs.
LiDAR to the 3D volume, the voxel embedding is directed to
a predefined 3D volume using positions from the ego car coor-
B. Unified Representation in 3D volume space dinate system. This method ensures no geometric distortion is
introduced. The resulting feature from this process is denoted
Different sensors capture data that, while representing the
as FVvol . For the image to 3D volume projection, the 2D-3D
same scene, often provide distinct descriptions. For instance,
Spatial Cross-Attention method is employed. Following prior
camera-derived images emphasize the color palette of the
works [30], [31], 3D volume queries for each image is defined
environment within their field of view, whereas point clouds
as Qvol ∈ RC×H×W ×Z . The corresponding 3D points are
primarily capture object locations. Given these variations,
then projected to 2D views using the camera’s intrinsic and
selecting an appropriate representation for fusing features
extrinsic parameters. During this projection, the 3D points
from disparate modalities becomes paramount. Such a rep-
associate only with specific views, termed as Vhit . The 2D
resentation must preserve the unique attributes of multi-modal
feature is then sampled from the view Vhit at the locations of
information sourced from various sensors.
those projected 3D reference points. The process unfolds as:
In pursuit of capturing the full spectrum of object po-
sitioning and appearance, the voxel feature in 3D volume
1 X NX ref
space is adopted as the unified representation, depicted in FIvol DeformAttn Qvol , P(p, i, j), FIi

=
Figure 1 (b). The 3D volume space uniquely accommodates |Vhit |
i∈Vhit j=1
the height dimension, enabling it to harbor more expansive (1)
geometric data and achieve exacting precision in depicting where i indexes the camera view, j indexes the 3D reference
object locations, exemplified by features like elevated traffic points, and Nref is the total number of points for each 3D
signs. This enriched representation naturally amplifies the volume query. FIi is the features of the i-th camera view.
accuracy of interactions between objects. A salient benefit of P(p, i, j) is the project function that gets the j-th reference
the 3D volume space is its capacity for direct remapping to point on the i-th view image.
Notably, these features have already been enriched by the

fusion module with semantics from the camera branch. Re-
3D Deformable Self-attention
(HWZ, C) garding the camera branch, the corresponding 2D coordinate
(u, v) can be determined using the projection function T .
(HWZ, 2C)
(HWZ, 2C)
Feed Forward
The 2D-plane feature FIproj is then obtained by mapping the
C S 3D volume feature in (x, y, z) to the position (u, v). The
projection function Tproj is defined as :
 
  x
(HWZ, C) u  y 
z  v  = Tproj (P ) = K · Rt · 
 z 
 (3)
×L 1
1
Multi-modal 3D Interaction Module
C Channel Concatenate where P ∈ R3 is the position in 3D volume, K ∈ 3 × 4,
S Channel Split Rt ∈ 4 × 4 are the camera intrinsic and extrinsic matrices.
D. Prediction Target
Fig. 3. Illustration of our Multi-modal 3D Interaction Module. We Three distinct reconstruction tasks supervise each modal
first concatenate the inputs FVvol , FIvol and reshape it for the subsequent

stacking 3D deformable self-attention blocks. After interaction, we split the decoder. A single linear layer is applied to the output of each
output volume feature and project them back to feature token. This contributes decoder for each task. The dual-modal reconstruction tasks and
more generalized and effective feature learning. their respective loss functions are detailed below. In alignment
with Voxel-MAE [34], the prediction focuses on the number of
2) Multi-modal 3D Interaction Module: To fuse the pro-
points within each voxel. Supervision for this reconstruction
jected 3D volume features from the camera (FIvol ∈
uses the Chamfer distance, which gauges the disparity between
RC×H×W ×Z ) and the LiDAR (FVvol ∈ RC×H×W ×Z )
two point sets of different scales. Let Gn denote the masked
branches effectively, the Multi-modal 3D Interaction Module
LiDAR point cloud partitioned into voxels, and Pn symbolize
(MMIM) is introduced. As depicted in Figure 3, MMIM
the predicted voxels. The Chamfer loss Lc can be presented
comprises L attention blocks, with L = 3 being the default
as:
setting.
Given the emphasis on high performance at high resolutions Lc = CD (DecV (FVsp ) , Gn ) (4)
in downstream tasks and the limited scale of token sequences where CD (·) stands for Chamfer distance function [35],
in standard self-attention, deformable self-attention is selected DecV (·) denote voxel decoder and FVsp represents projected
to alleviate computational demands. Each block is composed voxel features.
of deformable self-attention, a feed-forward network, and
In addition to the aforementioned reconstruction task, there
normalization. Initially, the concatenation of FVvol and FIvol is
is a prediction to ascertain if a voxel is empty. Supervision for
performed along the channel dimension, reshaping the result
this aspect employs the binary cross entropy loss, denoted as
to form the query token Fcvol ∈ RHW Z×2C . This token is
Locc . The cumulative voxel reconstruction loss is thus given
then inputted into the Multi-modal 3D Interaction Module, an
as:
extension of [32], to promote effective modal interaction. The
interactive process can be described as follows: Lvoxel = Lc + Locc (5)
M
X K
X For the camera branch, the pixel reconstruction is supervised
Fc′ = ′
Fcvol pvol + ∆pvol

Wm Wmk · Wm k (2) using the Mean Squared Error (MSE) loss, represented as :
m=1 k=1
Limg = LM SE (DecI (FIsp ) , GI ) (6)
where m indexes the attention head, k indexes the sampled
keys, and K is the total sampled key number. ∆pvol
k and Wmk where GI is the original images in pixel space, DecI (·)
denote the sampling offset and attention weight of the k th denotes the image decoder and FIsp represents projected image
sampling point in the mth attention head, respectively. The features.
attention weight Wmk lies in the range [0, 1], normalized by
P K ′
k=1 Wmk = 1. At the end, Fc is split along the channel IV. E XPERIMENTS
dimension to obtain the modality-specific 3D volume features
(FV′ , FI′ ). In this section, we conduct extensive experiments to evaluate
3) Projection to Modality-specific Token: By tapping into our proposed UniM2 AE: 1) We compare UniM2 AE with
the advantages of the 3D volume representation, the fused different MAE methods using various amount of annotated
feature can be conveniently projected onto the 2D image data. 2) We evaluate UniM2 AE on different downstream tasks,
plane and 3D voxel token. For the LiDAR branch, the process including 3D object detection and BEV map segmentation. 3)
merely involves sampling the features located at the position We conduct diverse ablation studies to evaluate the effective-
of the masked voxel token within the ego-vehicle coordinates. ness of our self-supervised method.
TABLE I
DATA - EFFICIENT 3D OBJECT DETECTION RESULTS OF ON THE NU S CENES VALIDATION SET. BACKBONES IN SINGLE - MODALITY AND MULTI - MODALITY
ARE PRE - TRAINED USING VARIOUS MAE METHODS . T HE MODEL PERFORMANCES ARE REPORTED USING DIFFERENT AMOUNTS OF FINE - TUNING DATA .
R ANDOM DENOTES TRAINING FROM SCRATCH . MIM+VOXEL -MAE DENOTES THE COMBINATION OF THE WEIGHTS PRE - TRAINED USING
G REEN MIM [33] AND VOXEL -MAE [34]. L AND C REPRESENT L I DAR AND C AMERA , RESPECTIVELY. *: OUR RE - IMPLEMENTATION .
Data amount Modality Initialization mAP NDS Car Truck C.V. Bus Trailer Barrier Motor Bike Ped. T.C.
Random 44.3 56.3 78.8 41.1 13.1 50.7 18.8 52.8 46.1 17.3 75.7 49.1
L Voxel-MAE* 48.9 59.8 80.9 47.0 12.8 59.0 23.6 61.9 47.8 23.5 79.9 52.4
UniM2 AE 50.0 60.0 81.0 47.8 12.0 57.3 24.0 62.3 51.2 26.8 81.4 55.8
20%
Random 51.5 50.9 84.1 47.6 13.3 49.8 27.9 65.0 53.9 26.8 78.8 67.8
C+L MIM+Voxel-MAE 54.3 51.2 84.3 50.8 18.9 52.3 28.9 68.4 57.4 32.1 80.3 69.2
UniM2 AE 55.9 52.8 85.8 51.1 19.3 54.2 30.6 69.0 61.1 34.3 83.0 70.8
Random 50.9 60.5 81.4 47.6 13.7 58.0 24.5 61.3 57.7 30.1 80.4 54.4
L Voxel-MAE* 52.6 62.2 82.4 49.1 15.4 62.2 25.8 64.5 56.8 30.4 82.3 57.5
UniM2 AE 52.9 62.6 82.7 49.2 15.8 60.1 23.7 65.5 58.4 31.2 83.8 58.9
40%
Random 58.6 61.9 86.2 54.7 21.7 60.0 33.0 70.7 64.2 38.6 83.0 74.3
C+L MIM+Voxel-MAE 60.2 63.5 86.6 56.5 22.5 64.1 33.5 72.2 66.1 41.9 83.8 74.8
UniM2 AE 62.0 64.5 87.0 57.8 22.8 62.7 38.7 69.7 66.8 50.5 86.0 77.9
Random 51.9 61.7 82.2 49.0 15.6 61.2 24.9 62.9 56.3 32.1 81.6 53.1
L Voxel-MAE* 54.2 63.5 83.0 51.1 16.3 62.0 27.5 64.9 61.2 34.7 82.9 58.0
UniM2 AE 54.7 63.8 83.1 51.0 17.3 62.5 26.9 65.1 62.2 35.7 83.4 59.9
60%
Random 61.6 65.2 87.3 58.5 23.9 65.2 35.8 71.9 67.8 46.7 85.7 77.0
C+L MIM+Voxel-MAE 62.1 65.7 87.2 56.7 23.0 65.4 37.0 71.7 70.6 47.3 85.6 76.7
UniM2 AE 62.4 66.1 87.7 59.7 23.8 67.6 37.0 70.5 68.4 48.9 86.6 77.8
Random 52.7 62.5 82.3 49.6 16.0 63.3 25.8 60.7 58.6 31.6 82.0 56.7
L Voxel-MAE* 55.1 64.2 83.4 51.7 18.8 64.0 28.7 63.8 62.2 35.1 84.3 58.7
UniM2 AE 55.6 64.6 83.4 52.9 18.2 64.2 29.4 64.7 63.1 36.1 84.5 58.8
80%
Random 62.5 66.1 87.1 57.6 24.0 66.4 38.1 71.1 68.4 48.8 86.2 77.6
C+L MIM+Voxel-MAE 63.0 66.4 87.6 59.6 24.1 66.1 38.0 71.3 70.2 48.8 86.5 78.1
UniM2 AE 63.9 67.1 87.7 59.6 24.9 69.2 39.8 71.3 71.0 50.2 86.8 78.7
Random 53.6 63.0 82.3 49.8 16.7 64.0 26.2 60.9 61.7 33.0 82.2 58.9
L Voxel-MAE* 55.3 64.1 83.2 51.2 16.8 64.6 28.3 65.0 61.8 39.6 83.6 58.9
UniM2 AE 55.8 64.6 83.3 51.3 17.6 63.7 28.6 65.4 62.8 40.8 83.9 60.3
100%
Random 63.6 67.4 87.7 58.0 26.6 67.8 38.4 72.6 71.8 47.6 87.0 78.9
C+L MIM+Voxel-MAE 63.7 67.7 87.6 58.3 25.3 67.1 38.8 70.8 71.7 51.5 86.7 78.9
UniM2 AE 64.3 68.1 87.9 57.8 24.3 68.6 42.2 71.6 72.5 51.0 87.2 79.5
A. Implementation Details In terms of input data masking during the training phase,
1) Dataset and Metrics: The nuScenes Dataset [9], a com- our experiments have determined that a masking ratio of
prehensive autonomous driving dataset, serves as the primary 70% for the LiDAR branch and 75% for the camera branch
dataset for both pre-training our model and evaluating its per- yields optimal results. By default, all the MAE methods are
formance on multiple downstream tasks. This dataset encom- trained with a total of 200 epochs on 8 GPUs, a base learning
passes 1,000 sequences gathered from Boston and Singapore, rate of 2.5e-5. Detailed configurations are reported in the
with 700 designated for training and 300 split evenly for supplemental material.
validation and testing. Each sequence, recorded at 10Hz, spans 4) Fine-tuning: Utilizing the encoders from UniM2 AE for
20 seconds and is annotated at a frequency of 2Hz. In terms both camera and LiDAR, the process then involves fine-tuning
of 3D detection, the principal evaluation metrics employed are and assessing the capabilities of the features learned on tasks
mean Average Precision (mAP) and the nuScenes detection that are both single-modal and multi-modal in nature. For tasks
score (NDS). For the task of BEV map segmentation, the that are solely single-modal, one of the pre-trained encoder
methodology aligns with the dataset’s map expansion pack, serves as the feature extraction mechanism. When it comes
using Intersection over Union (IoU) as the assessment metric. to multi-modal tasks, which include 3D object detection and
2) Network Architectures: The UniM2 AE utilizes SST [43] BEV map segmentation, both the LiDAR encoder and the
and Swin-T [44] as the backbones for the LiDAR Encoder camera branch’s encoder are capitalized upon as the feature
and Camera Encoder, respectively. In the Multi-modal 3D extractors pertinent to these downstream tasks. It’s pivotal
Interaction Module, 3 deformable self-attention blocks are to note that while the decoder plays a role during the pre-
stacked, with each attention module comprising 192 input training phase, it’s omitted during the fine-tuning process. A
channels and 768 hidden channels. To facilitate the transfer comparison of the pre-trained feature extractors with a variety
of pre-trained weights for downstream tasks, BEVFusion- of baselines across different tasks was conducted, ensuring
SST and TransFusion-L-SST are introduced, with the LiDAR the experimental setup remained consistent. Notably, when
backbone in these architectures being replaced by SST. integrated into fusion-based methodologies, the pre-trained
3) Pre-training: During this stage, the perception ranges Multi-modal 3D Interaction Module showcases competitive
are restricted to [−50m, 50m] for X- and Y-axes, [−5m, 3m] performance results. Detailed configurations are in the sup-
for Z-axes. Each voxel has dimensions of (0.5m, 0.5m, 4m). plemental material.
TABLE II
P ERFORMANCES OF 3D OBJECT DETECTION ON THE NU S CENES VALIDATION SPLIT. W / MMIM MEANS APPLYING THE PRE - TRAINED MMIM TO THE
DOWNSTREAM TASK .
Method Modality Voxel Size(m) NDS↑ mAP↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓
BEVDet [36] C - 47.2 39.3 60.8 25.9 36.6 82.2 19.1
BEVFormer [31] C - 51.7 41.6 67.3 27.4 37.2 39.4 19.8
PETRv2 [37] C - 52.4 42.1 68.1 26.7 35.7 37.7 18.6
CenterPoint [38] L [0.075, 0.075, 0.2] 66.8 59.6 29.2 25.5 30.2 25.9 19.4
LargeKernel3D [39] L [0.075, 0.075, 0.2] 69.1 63.9 28.6 25.0 35.1 21.1 18.7
TransFusion-L [40] L [0.075, 0.075, 0.2] 70.1 65.4 - - - - -
TransFusion-L-SST L [0.15, 0.15, 8] 69.9 65.0 28.0 25.3 30.1 24.1 19.0
UniM2 AE-L L [0.15, 0.15, 8] 70.4 65.7 28.0 25.2 29.5 23.5 18.6
FUTR3D [24] C+L [0.075, 0.075, 0.2] 68.3 64.5 - - - - -
Focals Conv [41] C+L [0.075, 0.075, 0.2] 69.2 64.0 33.2 25.4 27.8 26.8 19.3
MVP [27] C+L [0.075, 0.075, 0.2] 70.7 67.0 28.9 25.1 28.1 27.0 18.9
TransFusion [40] C+L [0.075, 0.075, 0.2] 71.2 67.3 27.2 25.2 27.4 25.4 19.0
MSMDFusion [42] C+L [0.075, 0.075, 0.2] 72.1 69.3 - - - - -
BEVFusion [14] C+L [0.075, 0.075, 0.2] 71.4 68.5 28.7 25.4 30.4 25.6 18.7
BEVFusion-SST C+L [0.15, 0.15, 8] 71.5 68.2 27.8 25.3 30.2 23.6 18.9
UniM2 AE C+L [0.15, 0.15, 8] 71.9 68.4 27.2 25.2 28.8 23.2 18.7
UniM2 AE w/ MMIM C+L [0.15, 0.15, 4] 72.7 69.7 26.9 25.2 27.3 23.2 18.9
B. Data Efficiency C. Comparison on Downstream Tasks

1) 3D Object Detection: To demonstrate the capability of
The primary motivation behind employing MAE is to mini- the learned representation, we fine-tune various pre-trained
mize the dependency on annotated data without compromising detectors on the nuScenes dataset. As shown in Table II,
the efficiency and performance of the model. In assessing the our UniM2 AE substranitally improves both LiDAR-only and
representation derived from the pre-training with UniM2 AE, fusion-based detection models. Compared to TransFusion-L-
experiments were conducted on datasets of varying sizes, SST, the UniM2 AE-L registers a 0.5/0.7 NDS/mAP enhance-
utilizing different proportions of the labeled data. Notably, ment on the nuScenes validation subset. In the multi-modality
for training both the single-modal and multi-modal 3D Object realm, the UniM2 AE elevates the outcomes of BEVFusion-
Detection models, fractions of the annotated dataset, namely SST by 1.2/1.5 NDS/mAP when MMIM is pre-trained. Of note
{20%, 40%, 60%, 80%, 100%}, are used. is that superior results are attained even when a larger voxel
size is employed. This is particularly significant given that
In the realm of single-modal 3D self-supervised techniques,
Transformer-centric strategies (e.g.SST [43]) generally trail
UniM2 AE is juxtaposed against Voxel-MAE [34] using an
CNN-centric methodologies (e.g.VoxelNet [51]).
anchor-based detector. Following the parameters set by Voxel-
2) BEV Map Segmentation: Table III presents our BEV
MAE, the detector undergoes training for 288 epochs with
map segmentation results on the nuScenes dataset based on
a batch size of 4 and an initial learning rate pegged at 1e-
BEVFusion [14]. For the camera modality, we outperform the
5. On the other hand, for multi-modal strategies, evaluations
results of training from scratch by 1.7 mAP. In the multi-
are conducted on a fusion-based detector equipped with a
modality setting, the UniM2 AE boosts the BEVFusion-SST by
TransFusion head [40]. As per current understanding, this
6.5 mAP with pre-trained MMIM and achieve 2.1 improve-
is a pioneering attempt at implementing multi-modal MAE
ment over state-of-the-art methods X-Align [50], indicating
in the domain of autonomous driving. For the sake of a
the effectiveness and strong generalization of our UniM2 AE.
comparative analysis, a combination of the pre-trained Swin-T
from GreenMIM [33] and SST from Voxel-MAE is utilized.
According to the result in Table I, the proposed UniM2 AE D. Ablation study
presents a substantial enhancement to the detector, exhibiting 1) Multi-modal pre-training: To underscore the importance
an improvement of 4.4/1.9 mAP/NDS over random initializa- of the Multi-modal 3D Interaction Module (MMIM) in the
tion and 1.6/1.6 mAP/NDS compared to the basic amalga- 3D volume space for dual modalities, ablation studies were
mation of GreenMIM and Voxel-MAE when trained on just performed with single modal input and were compared with
20% of the labeled data. Impressively, even when utilizing other interaction techniques. Results, as presented in Table IV,
the entirety of the labeled dataset, UniM2 AE continues to reveal that by utilizing MMIM to integrate features from two
outperform, highlighting its superior ability to integrate multi- branches within a unified 3D volume space, the UniM2 AE
modal features in the unified 3D volume space during the model achieves a remarkable enhancement in performance.
pre-training phase. Moreover, it’s noteworthy to mention that Specifically, there’s a 3.4 NDS improvement for camera-only
while UniM2 AE isn’t specifically tailored for a LiDAR-only pre-training, 2.6 NDS for LiDAR-only pre-training, and 2.1
detector, it still yields competitive outcomes across varying NDS when simply merging the two during the initialization
proportions of labeled data. This underscores the capability of of downstream detectors. Additionally, a noticeable decline
UniM2 AE to derive more insightful representations. in performance becomes evident when substituting the 3D
TABLE III
P ERFORMANCES OF THE BEV MAP SEGMENTATION ON THE NU S CENES VALIDATION SPLIT. W / MMIM MEANS APPLYING THE PRE - TRAINED MMIM TO
THE DOWNSTREAM TASK . *: RE - IMPLEMENTED BY TRAINING FROM SCRATCH .
Method Modality Drivable Ped. Cross. Walkway Stop Line Carpark Divider mIoU
CVT [45] C 74.3 36.8 39.9 25.8 35.0 29.4 40.2
OFT [46] C 74.0 35.3 45.9 27.5 35.9 33.9 42.1
LSS [47] C 75.4 38.8 46.3 30.3 39.1 36.5 44.4
M2 BEV [48] C 77.2 - - - - 40.5 -
BEVFusion* [14] C 78.2 48.0 53.5 40.4 45.3 41.7 51.2
UniM2 AE C 79.5 50.5 54.9 42.4 47.3 42.9 52.9
PointPillars [49] L 72.0 43.4 53.1 29.7 27.7 37.5 43.8
CenterPoint [38] L 75.6 48.4 57.5 36.5 31.7 41.9 48.6
MVP [27] C+L 76.1 48.7 57.0 36.9 33.0 42.2 49.0
PointPainting [25] C+L 75.9 48.5 57.7 36.9 34.5 41.9 49.1
BEVFusion [14] C+L 85.5 60.5 67.6 52.0 57.0 53.7 62.7
X-Align [50] C+L 86.8 65.2 70.0 58.3 57.1 58.2 65.7
BEVFusion-SST C+L 84.9 59.2 66.3 48.7 56.0 52.7 61.3
UniM2 AE C+L 85.1 59.7 66.6 48.7 56.0 52.6 61.4
UniM2 AE w/ MMIM C+L 88.7 67.4 72.9 59.0 59.0 59.7 67.8
TABLE IV with the single-modal MAE.

A BLATION STUDY ON THE INPUT MODALITY AND THE INTERACTION
SPACE DURING THE PRE - TRAINING . D ETECTION RESULTS ON THE V. C ONCLUSIONS
NU S CENES VALIDATION SPLIT.
The disparity in the multi-modal integration of MAE
Modality Interaction Space
mAP NDS
methods for practical driving sensors was identified. With
Camera LiDAR BEV 3D volume the introduction of UniM2 AE, a multi-modal self-supervised
59.0 61.8 model is brought forward that adeptly marries image semantics
✓ 59.7 62.6
✓ 60.1 62.6 to LiDAR geometries. Two principal innovations define this
✓ ✓ 60.7 63.1
approach: firstly, the fusion of dual-modal attributes into an
✓ ✓ ✓ 62.0 64.3 augmented 3D volume, which incorporates the height dimen-
✓ ✓ ✓ 62.8 65.2 sion absent in BEV; and secondly, the deployment of the
TABLE V Multi-modal 3D Interaction Module that guarantees proficient
A BLATION STUDY ON MASKING RATIO . D ETECTION RESULTS ON THE cross-modal communications. Benchmarks conducted on the
NU S CENES VALIDATION SPLIT. nuScenes Dataset reveal substantial enhancements in 3D object
detection by 1.2/1.5 NDS/mAP and in BEV map segmenta-
Masking Ratio tion by 6.5 mAP, reinforcing the potential of UniM2 AE in
mAP NDS
Camera LiDAR
advancing autonomous driving perception.
60% 60% 63.6 66.5
70% 70% 64.2 67.3 R EFERENCES
75% 70% 64.5 67.3 [1] A. Chen, K. Zhang, R. Zhang, Z. Wang, Y. Lu, Y. Guo, and S. Zhang,
80% 80% 63.3 66.4 “Pimae: Point cloud and image interactive masked autoencoders for
3d object detection,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2023, pp. 5291–5301.
volume space with BEV, as showcased in the concluding [2] Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked
rows of Table IV. This drop is likely attributed to features autoencoders for point cloud self-supervised learning,” in European
conference on computer vision. Springer, 2022, pp. 604–621.
mapped onto BEV losing essential geometric and semantic [3] X. Tian, H. Ran, Y. Wang, and H. Zhao, “Geomae: Masked geometric
details, especially along the height axis, resulting in a less target prediction for self-supervised point cloud pre-training,” in Pro-
accurate representation of an object’s true height and spatial ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2023, pp. 13 570–13 580.
positioning. These findings conclusively highlight the crucial [4] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image
role of concurrently integrating camera and LiDAR features transformers,” arXiv preprint arXiv:2106.08254, 2021.
within the unified 3D volume space and further validate the [5] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and
H. Hu, “Simmim: A simple framework for masked image modeling,”
effectiveness of the MMIM. in Proceedings of the IEEE/CVF Conference on Computer Vision and
2) Masking Ratio: In Table V, an examination of the Pattern Recognition, 2022, pp. 9653–9663.
effects of the masking ratio reveals that optimal performance [6] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked au-
toencoders are scalable vision learners,” in Proceedings of the IEEE/CVF
is achieved with a high masking ratio (70% and 75%). This conference on computer vision and pattern recognition, 2022, pp.
not only offers advantages in GPU memory savings but also 16 000–16 009.
ensures commendable performance. On the other hand, if the [7] R. Zhang, L. Wang, Y. Qiao, P. Gao, and H. Li, “Learning 3d repre-
sentations from 2d pre-trained models via image-to-point masked au-
masking ratio is set too low or excessively high, there is a toencoders,” in Proceedings of the IEEE/CVF Conference on Computer
notable decline in performance, akin to the results observed Vision and Pattern Recognition, 2023, pp. 21 769–21 780.
[8] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, [28] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d for multi-sensor 3d object detection,” in Proceedings of the European
object detection in lidar point clouds,” in Conference on Robot Learning. conference on computer vision (ECCV), 2018, pp. 641–656.
PMLR, 2020, pp. 923–932. [29] Y. Li, A. W. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen,
[9] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, Y. Lu, D. Zhou, Q. V. Le, et al., “Deepfusion: Lidar-camera deep fusion
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF
multimodal dataset for autonomous driving,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2022, pp.
IEEE/CVF conference on computer vision and pattern recognition, 2020, 17 182–17 191.
pp. 11 621–11 631. [30] Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc:
[10] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, Multi-camera 3d occupancy prediction for autonomous driving,” arXiv
G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable preprint arXiv:2303.09551, 2023.
visual models from natural language supervision,” in International [31] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai,
conference on machine learning. PMLR, 2021, pp. 8748–8763. “Bevformer: Learning bird’s-eye-view representation from multi-camera
[11] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, images via spatiotemporal transformers,” in European conference on
X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image computer vision. Springer, 2022, pp. 1–18.
pre-training,” arXiv preprint arXiv:2111.07783, 2021. [32] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr:
[12] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, Deformable transformers for end-to-end object detection,” arXiv preprint
P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in arXiv:2010.04159, 2020.
Proceedings of the IEEE/CVF Conference on Computer Vision and [33] L. Huang, S. You, M. Zheng, F. Wang, C. Qian, and T. Yamasaki, “Green
Pattern Recognition, 2022, pp. 8552–8562. hierarchical vision transformer for masked image modeling,” Advances
in Neural Information Processing Systems, vol. 35, pp. 19 997–20 010,
[13] T. Huang, B. Dong, Y. Yang, X. Huang, R. W. Lau, W. Ouyang, and
2022.
W. Zuo, “Clip2point: Transfer clip to point cloud classification with
[34] G. Hess, J. Jaxing, E. Svensson, D. Hagerman, C. Petersson, and
image-depth pre-training,” arXiv preprint arXiv:2210.01055, 2022.
L. Svensson, “Masked autoencoder for self-supervised pre-training on
[14] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, lidar point clouds,” in Proceedings of the IEEE/CVF Winter Conference
“Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view on Applications of Computer Vision, 2023, pp. 350–359.
representation,” in 2023 IEEE International Conference on Robotics and [35] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for
Automation (ICRA). IEEE, 2023, pp. 2774–2781. 3d object reconstruction from a single image,” in Proceedings of the
[15] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and IEEE conference on computer vision and pattern recognition, 2017, pp.
Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion frame- 605–613.
work,” Advances in Neural Information Processing Systems, vol. 35, pp. [36] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “Bevdet: High-
10 421–10 434, 2022. performance multi-camera 3d object detection in bird-eye-view,” arXiv
[16] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: preprint arXiv:2112.11790, 2021.
Pre-training 3d point cloud transformers with masked point modeling,” [37] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, X. Zhang, and J. Sun,
in Proceedings of the IEEE/CVF Conference on Computer Vision and “Petrv2: A unified framework for 3d perception from multi-camera
Pattern Recognition, 2022, pp. 19 313–19 322. images,” arXiv preprint arXiv:2206.01256, 2022.
[17] H. Liu, M. Cai, and Y. J. Lee, “Masked discrimination for self-supervised [38] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection
learning on point clouds,” in European Conference on Computer Vision. and tracking,” in Proceedings of the IEEE/CVF conference on computer
Springer, 2022, pp. 657–675. vision and pattern recognition, 2021, pp. 11 784–11 793.
[18] C. Min, D. Zhao, L. Xiao, Y. Nie, and B. Dai, “Voxel-mae: Masked [39] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Largekernel3d: Scaling up
autoencoders for pre-training large-scale point clouds,” arXiv preprint kernels in 3d sparse cnns,” in Proceedings of the IEEE/CVF Conference
arXiv:2206.09900, 2022. on Computer Vision and Pattern Recognition, 2023, pp. 13 488–13 498.
[19] R. Xu, T. Wang, W. Zhang, R. Chen, J. Cao, J. Pang, and D. Lin, [40] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai,
“Mv-jar: Masked voxel jigsaw and reconstruction for lidar-based self- “Transfusion: Robust lidar-camera fusion for 3d object detection with
supervised pre-training,” in Proceedings of the IEEE/CVF Conference transformers,” in Proceedings of the IEEE/CVF conference on computer
on Computer Vision and Pattern Recognition, 2023, pp. 13 445–13 454. vision and pattern recognition, 2022, pp. 1090–1099.
[20] H. Yang, T. He, J. Liu, H. Chen, B. Wu, B. Lin, X. He, and W. Ouyang, [41] Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional
“Gd-mae: generative decoder for mae pre-training on lidar point clouds,” networks for 3d object detection,” in Proceedings of the IEEE/CVF
in Proceedings of the IEEE/CVF Conference on Computer Vision and Conference on Computer Vision and Pattern Recognition, 2022, pp.
Pattern Recognition, 2023, pp. 9403–9414. 5428–5437.
[21] A. Boulch, C. Sautier, B. Michele, G. Puy, and R. Marlet, “Also: [42] Y. Jiao, Z. Jie, S. Chen, J. Chen, L. Ma, and Y.-G. Jiang, “Msmdfusion:
Automotive lidar self-supervision by occupancy estimation,” in Pro- Fusing lidar and camera at multiple scales with multi-depth seeds for
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern 3d object detection,” in Proceedings of the IEEE/CVF Conference on
Recognition, 2023, pp. 13 455–13 465. Computer Vision and Pattern Recognition, 2023, pp. 21 643–21 652.
[43] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang,
[22] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
and Z. Zhang, “Embracing single stride 3d object detector with sparse
detection network for autonomous driving,” in Proceedings of the IEEE
transformer,” in Proceedings of the IEEE/CVF conference on computer
conference on Computer Vision and Pattern Recognition, 2017, pp.
vision and pattern recognition, 2022, pp. 8458–8468.
1907–1915.
[44] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
[23] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter windows,” in Proceedings of the IEEE/CVF international conference on
Conference on Applications of Computer Vision, 2021, pp. 1527–1536. computer vision, 2021, pp. 10 012–10 022.
[24] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A [45] B. Zhou and P. Krähenbühl, “Cross-view transformers for real-time
unified sensor fusion framework for 3d detection,” in Proceedings of map-view semantic segmentation,” in Proceedings of the IEEE/CVF
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, conference on computer vision and pattern recognition, 2022, pp.
2023, pp. 172–181. 13 760–13 769.
[25] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Se- [46] T. Roddick, A. Kendall, and R. Cipolla, “Orthographic feature transform
quential fusion for 3d object detection,” in Proceedings of the IEEE/CVF for monocular 3d object detection,” arXiv preprint arXiv:1811.08188,
conference on computer vision and pattern recognition, 2020, pp. 4604– 2018.
4612. [47] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from
[26] C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-modal arbitrary camera rigs by implicitly unprojecting to 3d,” in Computer
augmentation for 3d object detection,” in Proceedings of the IEEE/CVF Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
Conference on Computer Vision and Pattern Recognition, 2021, pp. 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 194–210.
11 794–11 803. [48] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo,
[27] T. Yin, X. Zhou, and P. Krähenbühl, “Multimodal virtual point 3d and J. M. Alvarez, “M2 bev: Multi-camera joint 3d detection and
detection,” Advances in Neural Information Processing Systems, vol. 34, segmentation with unified birds-eye view representation,” arXiv preprint
pp. 16 494–16 507, 2021. arXiv:2204.05088, 2022.
[49] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,

“Pointpillars: Fast encoders for object detection from point clouds,”
in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2019, pp. 12 697–12 705.
[50] S. Borse, M. Klingner, V. R. Kumar, H. Cai, A. Almuzairee, S. Yoga-
mani, and F. Porikli, “X-align: Cross-modal cross-view alignment for
bird’s-eye-view segmentation,” in Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, 2023, pp. 3287–3297.
[51] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
[52] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse
voxelnet for 3d object detection and tracking,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2023, pp. 21 674–21 683.
[53] Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-
based representation with transformer for 3d object detection,” Advances
in Neural Information Processing Systems, vol. 35, pp. 18 442–18 455,
2022.
[54] Y. Li, X. Qi, Y. Chen, L. Wang, Z. Li, J. Sun, and J. Jia, “Voxel
field fusion for 3d object detection,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp.
1120–1129.
[55] H. Wang, C. Shi, S. Shi, M. Lei, S. Wang, D. He, B. Schiele, and
L. Wang, “Dsvt: Dynamic sparse voxel transformer with rotated sets,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 13 520–13 529.
[56] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene
understanding benchmark suite,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2015, pp. 567–576.
[57] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
VI. S UPPLEMENTARY M ATERIAL C. Pre-training Details

A. Additional Results To fairly compare UniM2 AE with the single-modal MAE
The 3D object detection results on test set of the nuScenes methods (i.e. GreenMIM [33] and Voxel-MAE [34]), a consis-
are reported in Table VI. In the multi-modal setting, our tent pre-training configuration shown in Table VII is adapted
UniM2 AE boosts the BEVFusion [14] by 0.4 NDS and achieve during the pre-training process. The detailed hyperparameters
competitive results compared with the SOTA multi-modal of MAE methods we used in this work are as follows.
detectors. For the detectors that are solely single-modal, our
TABLE VII
LiDAR-only UniM2 AE-L outperforms the SST-baseline [40] P RE - TRAINING C ONFIGURATION .
by 2.4/2.0 mAP/NDS improvement, indicating the general-
ization of our self-supervised methods. Concerning that our Config Value
MAE framework isn’t specifically designed for the LiDAR-
optimizer AdamW
only detector, the UniM2 AE lags slightly behind GeoMAE [3], base lr 5e-4
which introduces extra loss functions for the characteristics of weight decay 0.001
the point cloud. batch size 32
lr schedule cosine annealing
warmup iterations 1000
TABLE VI
point cloud augmentation random flip, resize
P ERFORMANCES OF THE 3D OBJECT DETECTION ON THE NU S CENES TEST
SPLIT. W / MMIM MEANS APPLYING THE PRE - TRAINED MMIM TO THE
image augmentation crop, resize, random flip
DOWNSTREAM TASK . total epochs 200
Method Modality mAP NDS

1) UniM2 AE hyperparameters: Generally, we employ the
PointPillar [49] L 40.1 55.0 configuration of the encoder and decoder presented in Green-
CenterPoint [38] L 60.3 67.3
VoxelNeXt [52] L 64.5 70.0 MIM [33] and Voxel-MAE [34] with adaptive modification to
LargeKernel3D [39] L 65.4 70.6 better suit multi-modal self-supervised pre-training. The image
GeoMAE [3] L 67.8 72.5
TransFusion-L [40] L 65.5 70.2 size is set to [256, 704] and the point cloud range is restrict
UniM2 AE-L L 67.9 72.2 in [−50m, 50m] for X-, Y- axes, [−3m, 5m] for Z-axes. At
UVTR-M [53] C+L 67.1 71.1 the same time, the volume grid shape is set to [200, 200, 2].
TransFusion [40] C+L 68.9 71.7 Specifically, to align multi-view images and LiDAR point
VFF [54] C+L 68.4 72.4
DSVT [55] C+L 68.4 72.7 cloud, only random flipping, resizing and cropping are used in
BEVFusion [14] C+L 70.2 72.9 the image augmentation, discarding other data augmentation
UniM2 AE w/ MMIM C+L 70.3 73.3
methods originally applied to Masked Image Modeling.
For the Spatial Cross-Attention during the image to 3D
volume projection, the number of deformable attention blocks
Accelerating Convergence Accelerating Convergence
70 65 is set to 6 with 256 hidden channels and Nref is set to 4.
In the Multi-modal Interaction Module (MMIM), we stack 3
65
60
deformable self-attention blocks comprising 8 heads in each
block and the number of reference point is 4.
60 2) Baseline hyperparameters: In the experiments on data
55 efficiency, we compare our UniM2 AE with single-modal MAE
mAP
NDS
55 methods, whose implementation follow their publicly released

50
codes with minimal changes. Since GreenMIM [33] doesn’t
50 employ Swin-T as their backbone while pre-training, we
replaced their original Swin-B with Swin-T, and rigorously
45
45 follow the other settings for pre-training. Additionally, for fair
Ours Ours
comparison, we use the same data augmentation in the camera
Random Random
40 40 branch of UniM2 AE. For the Voxel-MAE [34], pre-training are
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Epoch Epoch done with intensity information in data efficiency experiment.
The rest of the setup is the same as the Voxel-MAE.
Fig. 4. 3D object detection results on the nuScenes validation split. Our
UniM2 AE accerlerates the model convergence and ultimately improve the D. Fine-tuning Details
performance.
We evaluate our multi-modal self-supervised pre-training
framework by fine-tuning two state-of-the-art detectors [14],
B. Additional Abalation Study [40], denoted as TransFusion-L-SST and BEVFusion-SST,
As shown in Figure 4, we compare the performance of de- whose LiDAR backbone is replaced by SST [43]. Detail
tectors trained from scratch and pre-trained with our UniM2 AE configuration is presented in Table VIII.
for 10 epochs. Our pre-training method significantly acceler- In the 3D object detection task, we separately set the
ates model convergence and finally stabilises it at a higher voxel size to [0.5m, 0.5m, 8m] in the LiDAR-only method
score when utilizing the entire dataset. and [0.5m, 0.5m, 4m] in the multi-modal method. During the
TABLE VIII dataset SUN RGB-D [56], PiMAE [1] first uses FPS to filter
F INE - TUNING C ONFIGURATION . D ET DENOTES 3 D OBJECT DETECTION 2048 points from about 20,000 points and then sends them
CONFIGURATION . S EG DENOTES BEV MAP SEGMENTATION TASK
CONFIGURATION into FPS and KNN to form 128 groups in order to save the
GPU memory. However, in the outdoor dataset, the number of
Config
Value points in one scene often reach almost 200,000, and if the FPS
Det Seg is still employed to filter out only 2048 points, it might result in
point cloud range -x [-54.0m, 54.0m] [-51.2m, 51.2m] undersampling critical areas since FPS focuses on spreading
point cloud range -y [-54.0m, 54.0m] [-51.2m, 51.2m] out the sampled points. This can lead to missing important
point cloud range -x [-3.0m, 5.0m] [-3.0.m, 5.0m]
optimizer AdamW AdamW local structures, such as pedestrians, small obstacles or even
base lr 1e-4 1e-4 the cars far away from the ego vehicle, which are essential
weight decay 0.01 0.01 for MAE methods to model the masked point clouds. Another
batch size 4 4
option is to increase the number of output groups of FPS on
the outdoor dataset. For example, we filter out 20,000 points
from 200,000 points, send them into joint FPS+KNN operation
pre-training we first transfer the weights of UniM2 AE LiDAR and finally process the 1,000 output groups with the following
encoder to TransFusion-L-SST and fintune it. We follow the point cloud encoder. Nevertheless since PiMAE [1] utilize
TransFusion [40] training schedule and the results obtained multiple standard ViTs [57], whose computational complexity
by fine-tuning is denoted as UniM2 AE-L. As for the multi- is O(n2 ), in both encoder and decoder, the computational
modal strategies, the weights of LiDAR encoder in UniM2 AE- burden is still unacceptable even after downsampling 200
L and camera encoder pre-trained by UniM2 AE are loaded times.
to finetune the BEVFusion-SST following the BEVFusion
[14] training schedule. Furthermore, we replace the fusion
module in BEVFusion-SST by the pre-trained MMIM and get
UniM2 AE w/MMIM.
In the BEV map segmentation task, unlike the previ-
ous two-stage training schedule in the 3D object detection,
we directly fine-tune the multi-modal BEVFusion-SST with
[0.2m, 0.2m, 4m] voxel size for 24 epochs and the camera-
only BEVFusion [14] for 20 epochs, The changes regarding
the backbone and fusion modules are the same as for the 3D
detection task. For the camera-only detectors, all configuration
is aligned with camera-only BEVFusion [14].
E. Visualization
In Figure 5, we provide examples of reconstruction visu-
alizations. Our UniM2 AE is able to reconstruct the masked
LiDAR point clouds and multi-view images, accurately re-
flecting semantic and geometric understanding.
F. Discussion about PiMAE

As mentioned in the main paper, [1] pre-trained PiMAE
predominantly using indoor datasets like SUN RGB-D [56]
and, in tandem, utilized Farthest Point Sampling (FPS) and
K-Nearest Neighbors (KNN) in the point cloud branch. FPS
is a sampling technique that select points far apart as possible
from each other to provide a sparse and representative subset
of the original set. Particularly, it first selects a point randomly,
and then continuously chooses subsequent point that is the
farthest from the already-selected points. KNN is a method
used for classification and regression, but in the context of
point clouds, it is often used to find the closest points to a
given point. Given a point sampled by FPS, the KNN search
would identify the k closest points to that point.
However the combination of FPS and KNN in PiMAE is not
inherently suitable for outdoor datasets like nuScenes [9] in
autonomous driving. The challenges arise due to the vast scale
and spatial diversity of outdoor environments. On the indoor
Image Masked Image Image Reconstruction Point Cloud Masked Point Cloud Point Cloud Reconstruction
Fig. 5. Visualization of reconstruction results. The reconstruction for two different scenes is presented, including 6 images and a point cloud. For ease of
observation, we zoom in the point cloud at [0m, 15m] for X axes and [−7.5m, 7.5m] for Y axes.

Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving

Uploaded by

Copyright:

Available Formats

Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unim Ae: Multi-Modal Masked Autoencoders With Unified 3D Representation For 3D Perception in Autonomous Driving

Uploaded by

Copyright:

Available Formats

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

UniM2AE: Multi-modal Masked Autoencoders with

Abstract—Masked Autoencoders (MAE) play a pivotal role in

II. R ELATED W ORK A. Overview Architecture

Visible image token Visible LIDAR token Masked token

Notably, these features have already been enriched by the

B. Data Efficiency C. Comparison on Downstream Tasks

TABLE IV with the single-modal MAE.

[49] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,

VI. S UPPLEMENTARY M ATERIAL C. Pre-training Details

Method Modality mAP NDS

55 methods, whose implementation follow their publicly released

F. Discussion about PiMAE

You might also like