Scene Reconstruction From 4D Radar Data With GAN and Diffusion
Scene Reconstruction From 4D Radar Data With GAN and Diffusion
ALEXANDR DJADKIN
ALEXANDR DJADKIN
Abstract
4D Imaging Radar is increasingly becoming a critical component in various
industries due to beamforming technology and hardware advancements.
However, it does not replace visual data in the form of 2D images captured
by an RGB camera. Instead, 4D radar point clouds are a complementary data
source that captures spatial information and velocity in a Doppler dimension
that cannot be easily captured by a camera’s view alone. Some discriminative
features of the scene captured by the two sensors are hypothesized to have
a shared representation. Therefore, a more interpretable visualization of
the radar output can be obtained by learning a mapping from the empirical
distribution of the radar to the distribution of images captured by the camera.
To this end, the application of deep generative models to generate images
conditioned on 4D radar data is explored. Two approaches that have become
state-of-the-art in recent years are tested, generative adversarial networks and
diffusion models. They are compared qualitatively through visual inspection
and by two quantitative metrics: mean squared error and object detection
count. It is found that it is easier to control the generative adversarial
network’s generative process through conditioning than in a diffusion process.
In contrast, the diffusion model produces samples of higher quality and is
more stable to train. Furthermore, their combination results in a hybrid
sampling method, achieving the best results while simultaneously speeding
up the diffusion process.
Keywords
Deep generative models, Generative adversarial networks, Diffusion models,
GAN, DGM, 4D imaging radar
ii | Sammanfattning
Sammanfattning
4D bildradar får en alltmer betydande roll i olika industrier tack vare
utveckling inom strålformningsteknik och hårdvara. Det ersätter dock inte
visuell data i form av 2D-bilder som fångats av en RGB-kamera. Istället
utgör 4D radar-punktmoln en kompletterande datakälla som representerar
spatial information och hastighet i form av en Doppler-dimension. Det
antas att vissa beskrivande egenskaper i den observerade miljön har en
abstrakt representation som de två sensorerna delar. Därmed kan radar-datan
visualiseras mer intuitivt genom att lära en transformation från fördelningen
över radar-datan till fördelningen över bilderna. I detta syfte utforskas
tillämpningen av djupa generativa modeller för bilder som är betingade
av 4D radar-data. Två metoder som har blivit state-of-the-art de senaste
åren testas: generativa antagonistiska nätverk och diffusionsmodeller. De
jämförs kvalitativt genom visuell inspektion och med kvantitativa metriker:
medelkvadratfelet och antalet korrekt detekterade objekt i den genererade
bilden. Det konstateras att det är lättare att styra den generativa processen i
generativa antagonistiska nätverk genom betingning än i en diffusionsprocess.
Å andra sidan är diffusionsmodellen stabil att träna och producerar generellt
bilder av högre kvalité. De bästa resultaten erhålls genom en hybrid: båda
metoderna kombineras för att dra nytta av deras respektive styrkor.
Nyckelord
Djupa generativa modeller, Generativa antagonistiska nätverk, Diffusionsmo-
deller, GAN, DGM, 4D-bildradar
Acknowledgments | iii
Acknowledgments
I am grateful to Qamcom and Sensrad for providing me with the opportunity
to work on this project. I would also like to express my thanks to the engineers
at both companies for their work in developing the software and hardware that
formed the basis of this thesis.
I could not have undertaken this journey without the continuous, enlightening
feedback of my supervisors Vangjush Komini, Debaditya Roy, and Dr. Ole
Martin Christensen.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Ethical Approach . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . 6
2.1.2 Deep Architectures . . . . . . . . . . . . . . . . . . . 7
2.1.3 Convolutional Neural Networks . . . . . . . . . . . . 8
2.2 Deep Generative Models . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Generative Adversarial Networks . . . . . . . . . . . 10
2.2.2 Diffusion Models . . . . . . . . . . . . . . . . . . . . 10
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Conditional GANs . . . . . . . . . . . . . . . . . . . 12
2.3.1.1 Pix2Pix . . . . . . . . . . . . . . . . . . . 13
2.3.1.2 Points2Pix . . . . . . . . . . . . . . . . . . 13
2.3.2 Conditional Diffusion Models . . . . . . . . . . . . . 14
3 Methods 15
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Data Collection and Selection . . . . . . . . . . . . . 16
3.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2.1 Spatial Dimensions . . . . . . . . . . . . . 17
3.1.2.2 Additional Dimensions . . . . . . . . . . . 18
Contents | v
References 44
A Supporting materials 51
A.1 Code and Demos . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Additional Background . . . . . . . . . . . . . . . . . . . . . 51
A.2.1 Residual Networks . . . . . . . . . . . . . . . . . . . 51
A.2.2 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . 52
A.3 Additional Methods . . . . . . . . . . . . . . . . . . . . . . . 53
vi | Contents
A.3.1 Postprocessing . . . . . . . . . . . . . . . . . . . . . 53
A.3.2 Upscaling . . . . . . . . . . . . . . . . . . . . . . . . 53
A.4 Additional Examples of Generated Images . . . . . . . . . . . 53
List of Figures | vii
List of Figures
4.1 Two cases used for evaluation by object detection: the full
image and a cropped region of interest, which excluded parked
vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Examples of generated images. . . . . . . . . . . . . . . . . . 34
viii | List of Figures
List of Tables
4.1 MSE results for the different methods used to generate images
conditioned on point cloud data. . . . . . . . . . . . . . . . . 36
4.2 The number of detected objects of each class in the full image
including parked vehicles (refer to fig. 4.1). Diffusion was
closest to ground truth. . . . . . . . . . . . . . . . . . . . . . 37
4.3 Number of objects detected by each method in the ROI where
parked vehicles were excluded. . . . . . . . . . . . . . . . . . 37
4.4 Absolute difference between true and generated images in the
number of detected objects of the three classes: car, truck, and
bus (percentage of total object counts). The GAN-conditioned
diffusion model scored best in the ROI where parked vehicles
were excluded. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 GAN quantitative metrics at various checkpoints. The metrics
show no improvement beyond four to five epochs, despite the
continued decrease in training loss shown in fig. 4.5a. The
percentage in the parentheses shows the relative error to the
total number of objects in the ground truth images. The region
of interest (ROI) is the region of the image which excludes
parked vehicles in the background. . . . . . . . . . . . . . . . 38
x | List of acronyms and abbreviations
NF Normalizing Flow
Chapter 1
Introduction
1.1 Motivation
Radar 4D imaging is increasingly becoming a critical component in the
automotive industry. This progress is mainly attributed to the advancements
in beamforming technology and hardware capabilities [1]. Although the
contextual information reconstructed from the 4D radar is limited and does
not replace the cameras’ (classical) visual information, it can be particularly
useful in certain situations where camera information fails, e.g., when driving
on a very foggy road or using a radar sensor on a drone in a forest fire to
see through the smoke. In those cases, directly visualizing the 4D radar
information could mitigate the shortcomings. The recent development of deep
(conditional) generative models in AI has shown great potential for generating
realistic images. Namely, it is possible to mimic the discriminative features
of the training data to generate new realistic images from 4D radar output.
Hence, the above methodology could be directly integrated with the scenarios
mentioned above to reap substantial benefits.
1.2 Problem
The main problem of this project is to train a generative model p(x|c) that
can produce high-quality video frames given radar recordings c ∈ C of a
road. The data consist of video frames x ∈ X , temporally and spatially
synchronized with the radar recordings, resulting in the set of ordered pairs
(x, c) ∈ X × C. The radar sensor and camera are mounted such that
the background is stationary, and the variability in the data comes from the
movement on the road and changing environmental conditions. By leveraging
deep generative models, the aim is to enhance the visual representation of 4D
radar data and improve its interpretability. To this end, the following research
questions are posed:
1. How can deep generative models be used to enhance the visual
representation of 4D radar data and improve its interpretability?
Introduction | 3
1.3 Purpose
From the perspective of Qamcom, the project aims to demonstrate the quality
of radar output by generating high-quality images and videos from 4D radar
point cloud data. Such visualizations can show that the radar is capturing
meaningful discriminative features of the environment captured by the radar,
facilitating a more straightforward interpretation of the scene.
From a scientific standpoint, the project investigates the application of
deep generative models to 4D radar data, evaluating two state-of-the-art
techniques: Generative Adversarial Networks (GANs) [3] and Denoising
Diffusion Probabilistic Models (DDPMs) [4]. By achieving these goals, the
project seeks to advance the understanding and capabilities of generative
models in the domain of 4D radar data.
1.4 Goals
The primary objective of this project revolves around the visualization of 4D
radar data using deep generative models. To accomplish this overarching goal,
the project has been further delineated into two sub-goals, each serving a
specific purpose:
In summary, while the first sub-goal aligns with the specific interests of
Sensrad, the second sub-goal seeks to contribute to the scientific body of
knowledge.
1.6 Delimitations
This project is limited to the investigation of applying GANs and diffusion
models to images conditioned on 3D point clouds calculated from 4D radar
data collected in the automotive setting. While many other architectures exist,
and the models could be applied to other data, the study is about only a subset
of the field of deep generative models and a very specific data source.
Chapter 2
Background
This chapter introduces DGMs and explores the relevant research conducted
on conditional DGMs specifically developed for images and point clouds.
The core architectures used in these models are deep Convolutional Neural
Networks (CNNs), which are introduced in this section. It’s worth noting that
CNNs are a subset of deep learning, which, in turn, falls under the umbrella
of artificial neural networks. Hence, this chapter also provides a background
on these fundamental building blocks to establish a foundation of knowledge
that underpins the subsequent content of the report.
as a combination of input weights and an activation function [7]. Just like the
neurons in our brains, these computational units activate when the weighted
sum of inputs exceeds a certain threshold determined by the adjusted weights.
The objective of training neural networks is to find optimal sets of weights that
enable these neurons to produce high activation signals when presented with
test data that contains discriminative features learned from the training data.
(∑
Nin
)
x(out)
j =σ γji ∗ x(in)
i , (2.2)
i=1
∑
a ∑
b
γ ∗ x(h, w) = γ(h′ , w′ )x(h − h′ , w − w′ ) (2.3)
h′ =−a w′ =−b
denotes the discrete 2D convolution with a filter of size (2a, 2b). The filters
learn to express local spatial connectivity patterns across the input channels,
enabling the model to identify, detect, classify, and segment objects within the
image.
Background | 9
LGAN = Ey∼pdata (x) [log D(y)] + Ez∼pz (z) [log(1 − D(G(z)))] (2.4)
∏
T
q(x1:T | x0 ) = q(xt | xt−1 ) (2.5)
t=1
√
q(xt | xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (2.6)
∏
T
pθ (x0:T ) = p(xT ) pθ (xt−1 | xt ) (2.7)
t=1
pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (2.8)
p(xT ) = N (xT ; 0, I). (2.9)
The goal is to tune θ such that we can start with Gaussian noise xT in
Equation 2.9 and through the reverse process gradually transform it to x0 such
that x0 ∼ q(x0 ). This training is performed by optimizing the variational
bound on negative log-likelihood:
[ ]
pθ (x0:T )
E[− log pθ (x0 )] ≤ Eq − log (2.10)
q(x1:T | x0 )
[ ∑ ]
pθ (xt−1 | xt )
= Eq − log pθ (xT ) − log =: L (2.11)
t≥1
q(xt | xt−1 )
= LT + LT −1 + · · · + L0 (2.12)
( )
where LT = DKL q(xT |x0 ) ∥ pθ (xT ) (2.13)
( )
Lt−1 = DKL q(xt−1 |xt , x0 ) ∥ pθ (xt−1 |xt ) for 2 ≤ t ≤ T (2.14)
L0 = − log pθ (x0 |x1 ), (2.15)
√ √
ᾱt−1 βt αt (1 − ᾱt−1 )
µ̃t (xt , x0 ) = x0 + xt (2.17)
1 − ᾱt 1 − ᾱt
1 − ᾱt−1
β̃t = βt . (2.18)
1 − ᾱt
The forward process variances can be learned or held constant as
hyperparameters. Ho, et al. set Σθ (xt , t) = σt2 I to untrained time dependent
constants [17]. Nichol and Dhariwal found that learning the variance gave
better log-likelihood but had no effect on sample quality [19]. Without
learning the variance, we can set pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), σt2 I),
and obtain the following form for Lt−1 :
[ ]
1
Lt−1 = Eq ||µ̃ (xt , x0 ) − µθ (xt , t)|| + C,
2
(2.19)
2σt2 t
where C is a constant that does not depend on θ. Hence, we want to
parameterize µθ as a model that predicts µ̃t , the posterior mean of the forward
process. xt and µθ can be further reparameterized to formulate a simplified
objective in which we try to predict the noise at every time step:
[ ]
βt2 √ √
Ex0 ,ϵ ||ϵ − ϵθ ( ᾱt x0 + 1 − ᾱt ϵ, t)|| .
2
(2.20)
2σt2 αt (1 − ᾱt )
2.3.1.1 Pix2Pix
Isola et al. explored GANs for image generation in a conditional setting [21].
Their approach, named Pix2Pix, was tested on various tasks and datasets, such
as map to aerial photos and day-to-night translation. Pix2Pix’s ability to learn
the loss function through adversarial training makes it a versatile and powerful
tool for a variety of image-to-image translation tasks, allowing the network
to adapt to different problem domains without the need for task-specific loss
formulations. The model learns a mapping from an observed sample x and
random noise z to y, G : {x, z} → y. The objective function consists of a
GAN loss and an L1 term. The GAN discriminator learns the high-frequency
content, while the L1 loss helps to learn the low-frequency trends, resulting in
the following objectives:
LcGAN (G, D) = Ex,y [log D(x, y)] + Ex,z [log(1 − D(x, G(x, z)))] (2.21)
LL1 = λE[||y − G(x, z)||1 ] (2.22)
Ltotal = LcGAN + LL1 . (2.23)
Pix2Pix used U-Net [22] as the generator and PatchGAN [23] as the
discriminator. PatchGAN models the image as a Markov random field by
focusing on the structure in local image patches. I.e., the model assumes that
given its neighbors, a pixel is independent of pixels that are more than N steps
away. The discriminator runs convolutionally across the image, averaging all
responses to provide the final output of D. GAN-based techniques have also
been applied to unpaired image translation [24].
2.3.1.2 Points2Pix
In 2019, Milz, Simon, Fischer et al. proposed an approach for 3D point cloud
to image translation applied to Lidar [25]. The mapping from point clouds
to images was learned using a conditional GAN with three distinct conditions
c1 , c2 , c3 . Firstly, c1 was obtained by processing the raw point cloud using
14 | Background
The authors of this work conducted experiments on KITTI [27] for outdoor
and SunRGBD [28] for indoor scenarios. For validation, they measured the
number of correct detected classes with the aid of the 2D object detector
YOLOv3 [29], which could be achieved due to object-centered image patches
used in their experiments. The classification score was then given by the
detection ratio of fake images to ground truth. Additionally, the intersection
over union (IoU) for the bounding boxes of predicted objects was measured.
Chapter 3
Methods
This chapter discusses the core methods associated with this project. Firstly,
the methods employed for selecting the training and testing datasets and the
preprocessing steps involved in transforming the data into a format suitable as
input to the models are presented. Secondly, the choice of models and their
implementation details are discussed in detail. Finally, the evaluation metrics
employed for comparing the performance of these models are discussed.
3.1 Data
The Hugin 4D Imaging Radar (fig. 3.1) utilizes Arbe’s chipset and boasts
advanced perception capabilities and a wide field of view [36]. It uses a 48
× 48 MIMO antenna configuration, operating in the 76 to 81 GHz frequency
range with a maximum bandwidth of 1540 GHz. It has a sensitivity range of
0.5 to 300+ m, a range resolution of 0.1 to 0.75 m, and a Doppler resolution of
16 | Methods
0.1 ms−1 . The radar data used in this project were recorded with the A3 radar
variant at a framerate of 15 FPS and synchronized with the RGB camera, which
operated at 25 FPS. Additional technical details are available in table 3.1.
• Similar weather and lighting conditions: As the radar does not capture
weather or lighting information, recordings with similar backgrounds
were selected. This approach ensured a smoother generation of videos
by biasing the background towards a specific setting.
Table 3.2: Radar recording statistics used as heuristics for dataset selection.
3.1.2 Preprocessing
The raw 4D radar point cloud data consists of a set of N points {xi }N i=1 in
3D-space with spatial coordinates x, y, z along with additional dimensions
xdoppler , xrange , xpower . The preprocessing steps involved the separate processing
of the spatial dimensions and the additional information.
to align the point cloud with the camera’s coordinate system. Additionally, the
aligned points were projected onto the 2D plane corresponding to the camera
view (cf. eq. (3.1)):
where ci are constants such that i ∈ [0, 255], i.e., the RGB values were
in the valid range for pixels. Since objects moving both towards and away
from the detector were observed, xdoppler could be either positive or negative
with an observed maximum value |xmax doppler |. cR = 8 ensured that R ∈ [0, 255].
1
Similarly, by setting cG = 1.42 and cB = 8, the proper scaling was also ensured
for the green and blue channels.
The image was then cropped to a region of interest and downsampled to
128 × 128, or 256 × 256 pixels. The 128 × 128 resolution was used in the
comparative analysis of the two generative methods. The 256 × 256 resolution
was used to demonstrate the radars ability to capture discriminative features.
Methods | 19
Finally, the pixels were scaled to the range [−1, 1], a common practice for
improving the stability and performance of neural networks. Four examples
of input-output pairs are shown in fig. 3.2.
Figure 3.2: Four examples of input-output pairs from the test dataset.
• Density: Can the model evaluate the probability density function p(x)?
• Latents: Does the model use a latent vector z, and what is its
dimensionality?
VAEs learn a lower bound on the density through MLE, support fast
sampling, use latent representations with lower dimensionality than the data,
and are implemented using an encoder-decoder architecture, utilizing the
reparameterization trick [14]. NFs learn an exact density through MLE, are
slow to sample from and use latents with the same dimensionality as the
data. The architecture choice that restricts them is that we must use invertible
neural networks where each layer has a tractable Jacobian. GANs allow for
fast sampling and use small latents but do not support density evaluation due
to the min-max training objective. In addition, the generator-discriminator
architecture can lead to unstable training. Diffusion models learn a lower
bound on the density through MLE, are slow to sample from, use latents with
the same dimensionality as the data, and use an encoder-decoder architecture.
The primary concern in this practical case is the quality of the generated
samples while being able to evaluate an exact density is less critical. GANs
have been extensively refined and explored in the literature and have produced
state-of-the-art results in recent years. On the other hand, diffusion models
have recently been used to outperform GANs on image synthesis tasks [4].
Based on these considerations and, given the promising results of both GANs
and diffusion models, they were selected as starting points for this project.
Methods | 21
Dataset
Gaussian Projected
noise point cloud Real image
Discriminator
loss
Discriminator
Generator
loss
Generator
Generated
image
Figure 3.3: A block diagram of the GAN training scheme. Both the generator
and discriminator are CNNs. The generator utilizes the discriminator’s
classification output through backpropagation to adjust its weight values. An
L1 term is also used to enforce low-level correctness (see algorithm 1).
The training scheme for the GAN is visualized in Figure 3.3. The GAN
consists of two main components: the generator (G) and the discriminator (D).
The output of D, which represents D’s confidence in the authenticity of the
input samples, is used to guide the optimization of both G and D. Specifically,
D’s output is used to formulate the loss function for updating the parameters
of both the G and D [3]. This adversarial training process aims to improve
G’s ability to produce synthetic samples that are indistinguishable from real
data, while simultaneously enhancing D’s ability to discriminate between real
and fake samples accurately. Additionally, an L1 term enforces low-level
correctness as in (refer to eq. (3.5)). By iteratively updating the parameters of
the two neural nets based on their respective loss functions, the GAN training
scheme facilitates the learning of a generator that can effectively generate
realistic data samples resembling the training data distribution.
As in [21], setting λL1 = 100 in equation eq. (2.21) resulted in the
objective:
22 | Methods
3.2.1.1 Implementation
Based on the ideas of Pix2Pix [21] and Points2Pix [25], we used U-Net as
the generator [22]. We did not use the raw point cloud as in Points2Pix,
since we found early on in the development process that it did not boost the
performance of our model. The architecture differed from Pix2Pix mainly in
two aspects: we concatenated Gaussian noise with the projected point cloud
Methods | 23
A
64 x 128 x 128
64 x 128 x 128
6 x 128 x 128
3 x 128 x 128
Output
Input
A
128 x 64 x 64
128 x 64 x 64
128 x 64 x 64
128 x 64 x 64
A
256 x 32 x 32
128 x 32 x 32
256 x 32 x 32
256 x 32 x 32
A
512 x 16 x 16
512 x 16 x 16
512 x 16 x 16
512 x 16 x 16
1024 x 8 x 8
1024 x 8 x 8
Figure 3.4: A block diagram of the Attention U-Net generator. The input
image is progressively filtered and downsampled with stride 2. The input has
6 input channels: 3 for noise and 3 for the condition, while the output has 3 for
red, green, and blue. The features propagated through the skip connections are
filtered using attention gates. This figure was based on Attention U-Net [37].
24 | Methods
1 ( βt )
µθ (xt , t) = √ xt − √ ϵθ (xt , t) , (3.12)
αt 1 − ᾱt
where the noise ϵθ (xt , t) is predicted with the use of a neural network
Gθ (xt , t). Given a noisy image, this allows us to sample a less noisy image
xt−1 ∼ pθ (xt−1 |xt ) (eq. (3.9)) by inferring the noise and computing
1 ( βt ) √
xt−1 =√ xt − √ ϵθ (xt , t) + βt z where z ∼ N (0, I). (3.13)
αt 1 − ᾱt
Methods | 25
f (t) ( t/T + s π )2
ᾱt = , f (t) = cos · , (3.15)
f (0) 1+s 2
26 | Methods
where s = 0.08 and T = 1000. [19] found it to give better results than a
linear schedule. It is visualized in fig. 3.5.
Figure 3.5: Diffusion forward process using the cosine noise schedule. The
leftmost image is the sample at t = 0, and the rightmost is pure noise at t = T .
Every image except at t = 0 is a 100 steps more noisy version of its left
neighbor.
3.2.2.1 Implementation
In DDPM, Ho et al. [17] introduced the U-Net architecture for diffusion
models. The model employs a series of residual layers and downsampling
convolutions, followed by a set of residual layers with upsampling convolu-
tions. These layers are interconnected with skip connections, linking layers of
the same spatial size. The implementation of DDPM was a modified version
of the Attention U-Net (which we used as the generator in GAN, shown in
fig. 3.4). Rather than applying attention at every resolution, the authors opted
to utilize a single-head global attention layer, specifically at the 16 × 16
resolution. Furthermore, each residual block incorporated a projection of the
timestep embedding. Song et al. [39] discovered that implementing additional
modifications to the U-Net architecture resulted in improved performance on
the CIFAR-10 [40] and CelebA-64 [41] datasets. Dhariwal and Nichol [4]
found that architecture can substantially boost sample quality by showing
the same results on ImageNet 128 × 128. The authors explored increasing
the depth and the number of attention heads, using attention at multiple
resolutions rather than only at 16 × 16, using the BigGAN residual block [42]
for upsampling and downsampling, and rescaling residual connections with
√1 .
2
To reach our objectives, we used the U-Net architecture implemented
in [4], which used residual blocks with two convolutional layers, group
normalization, and the Sigmoid Linear Unit activation function. We used two
residual blocks per resolution with 128, 256, 512 and 1024 filter channels at
128 × 128, 64 × 64, 32 × 32, and 16 × 16 resolutions, respectively. Attention
was employed at 64 × 64, 32 × 32, and 16 × 16 resolutions. The model takes
6 input channels, 3 for the condition, and 3 for the noisy image at the previous
Methods | 27
√ √
xt (xGAN , ϵ) = ᾱt xGAN + 1 − ᾱt ϵ for ϵ ∼ N (0, I), (3.16)
GAN Diffusion
Figure 3.6: The hybrid method used to generate an image using a diffusion
model trained in 256 × 256 px conditioned on images generated by the GAN
in 128 × 128 px.
Chapter 4
(a) Full region used for (b) Object detection on the (c) Object detection on the
object detection. real image. generated image.
(d) Cropped ROI: no (e) Object detection on the (f) Detection on the
parked vehicles. cropped real image. cropped generated image.
Figure 4.1: Two cases used for evaluation by object detection: the full image
and a cropped region of interest, which excluded parked vehicles.
objects present in the real images could be assessed by calculating the absolute
difference between the detected objects in generated and real images. We
used YOLOv5 [47] pre-trained on the Microsoft COCO dataset [48] for
object detection. We disregarded certain outputs from the object detection
as irrelevant, attributing them to a lack of fine-tuning. Specifically, classes
such as ”train,” ”cow,” and ”toilet” were identified as such and subsequently
discarded. Three objects were used for calculating the similarity metric: cars,
trucks, and buses. This metric was computed for two cases: the full image,
including parked vehicles shown in fig. 4.1a, and a cropped region of interest
excluding the parking zone shown in fig. 4.1d. Using a cropped region of
interest, more attention is given to the model’s ability to generate nearby
vehicles in the conditioning point cloud. The model’s ability to generate a
high-quality background affects the result in the case when parked vehicles
are included.
methods.
Overall, the diffusion model demonstrated a higher level of visual fidelity
and realism, particularly in background elements, while the GAN offered
easier control of the generative process by conditioning on the input point
cloud. The better looking images were achieved by combining the ease of
conditioning offered by the GAN with the quality of the output from the
diffusion. This combination was achieved by biasing the diffusion process
using the GAN output as described in section 2.3.2, and the results are shown
in fig. 4.2e.
34 | Results and Analysis
Figure 4.3: Consecutive video frames generated using GAN and diffusion.
The GAN model generated the car in the correct position in every frame, while
the diffusion process missed some frames, especially for images where the car
was in the lower-left corner of the frame. The GAN-conditioned diffusion
samples (fig. 4.3d) look more realistic than the GAN baseline (fig. 4.3b).
Table 4.1: MSE results for the different methods used to generate images
conditioned on point cloud data.
4.4 Analysis
This section aims to provide an evaluation of the performance and limitations
of the diffusion model and the GAN in generating realistic images based on
4D radar point cloud data and delve into the specific aspects of their training
processes.
Results and Analysis | 37
Table 4.2: The number of detected objects of each class in the full image
including parked vehicles (refer to fig. 4.1). Diffusion was closest to ground
truth.
True GAN Diffusion Hybrid
car 1759 835 619 1031
truck 2099 772 1049 942
bus 76 18 2 13
Total 3934 1625 1670 1986
Table 4.3: Number of objects detected by each method in the ROI where
parked vehicles were excluded.
Table 4.4: Absolute difference between true and generated images in the
number of detected objects of the three classes: car, truck, and bus (percentage
of total object counts). The GAN-conditioned diffusion model scored best in
the ROI where parked vehicles were excluded.
38 | Results and Analysis
Epoch 3 4 5 6
MSE 0.1415 0.1353 0.1366 0.1394
∆ Objects 21346 (59%) 17789 (50%) 20551 (57%) 19393 (54%)
∆ ROI 2619 (67%) 2309 (59%) 2113 (54%) 2228 (57%)
Table 4.5: GAN quantitative metrics at various checkpoints. The metrics show
no improvement beyond four to five epochs, despite the continued decrease
in training loss shown in fig. 4.5a. The percentage in the parentheses shows
the relative error to the total number of objects in the ground truth images.
The region of interest (ROI) is the region of the image which excludes parked
vehicles in the background.
4.4.1.2 Diffusion
Due to the time-consuming nature of generating images using the diffusion
process, the evaluation of the model at different checkpoints was not conducted
as extensively as with the GAN. As a result, it is possible that the diffusion
model may be slightly overfitted, and there is potential for achieving better
quantitative scores through further experimentation and evaluation. The
limited evaluation of diffusion checkpoints highlights the practical challenges
and trade-offs involved in assessing and fine-tuning generative models with
computationally intensive processes.
4.4.2 Performance
The GAN demonstrated better control and conditioning capabilities but
sacrificed some image quality. It accurately represented objects within
the specified region of interest, excluding background elements like parked
vehicles. This indicates the effectiveness of the GAN in leveraging the
conditioning point cloud data to generate recognizable objects. Generating
a single image utilizing the GAN required a time duration of 52 milliseconds.
The diffusion model performed better than the GAN in generating higher-
quality background elements, such as accurately representing parked vehicles.
However, it was limited in generating complete and accurate representations
of objects in certain cases, as evidenced by instances where cars were missing
or incomplete in the generated images (see fig. 4.3). The generation of a single
image utilizing the diffusion model required a time duration of 32 seconds.
The hybrid approach, which combined the diffusion model with GAN
conditioning, outperformed the individual models in terms of both quantitative
metrics and visual appeal. By integrating the strengths of both models, the
hybrid approach achieved improved image quality and object recognition. For
this method, we experimentally adopted a starting timestep of T = 250,
considering the trade-offs between sample quality, computational complexity,
and correspondence to the ground truth. We found that a larger starting
timestep resulted in higher sample quality, meaning the generated images were
visually more realistic, excluding missing or incomplete objects. However,
this came at the cost of increased computational complexity and a more
significant deviation from the ground truth. On the other hand, a lower
starting timestep provided better correspondence to the ground truth and lower
computational complexity, but at the expense of image quality.
40 | Results and Analysis
4.5 Discussion
The hybrid approach proved to be the most effective in achieving the
best performance, integrating the strengths of both models and mitigating
weaknesses. The combination produced images that surpassed the individual
models’ capabilities by leveraging the diffusion model’s high-quality image
generation and the GAN’s accurate object representation and adherence to
input point clouds.
While the diffusion model produced higher-quality images and superior
object detection scores in the entire image, the GAN outperformed it in
terms of MSE and object detection in the region of interest, excluding
background elements such as parked vehicles. This seemingly contradictory
result suggests that the GAN achieved better pixel-level similarity to the
ground truth images and excelled in generating accurate and recognizable
vehicles captured by the radar. However, the hybrid model outperformed both
individual approaches, aligning with the initial qualitative interpretation that
the diffusion model’s inferior MSE (table 4.1) and ROI object detection score
(table 4.4) were due to its occasional neglect of the conditioning point cloud,
while still producing high-quality images otherwise. The observed difference
in object detection results between the entire image and the region of interest
further confirms the qualitative observation that the diffusion model generated
higher-quality images but faced challenges in conditioning and accurately
representing particular objects in specific instances. This underscores the
complementary nature of the hybrid model, which combines the best features
of both approaches to deliver improved performance. Combining the
diffusion model’s background generation capabilities and the GAN’s object
representation and conditioning abilities resulted in a more effective and
comprehensive generation of realistic images from 4D radar data.
It is important to note that the performances of the diffusion model
and GAN used in this study represent specific instances of these models.
It is possible that other variations or architectures of GANs or diffusion
models could outperform the ones used here. Different GAN architectures,
such as conditional GAN with improved conditioning mechanisms, may
exhibit improved performance in generating realistic images from 4D radar
data. Similarly, alternative diffusion models with different priors or training
strategies could yield better results. Further exploration and experimentation
with different model variations are encouraged to identify potential models
that could outperform the ones employed in this study.
Choi et al. found that samples started to deviate noticeably from the
Results and Analysis | 41
4.6 Limitations
The main limitation of the approach lies in the diversity of generated objects,
particularly trucks, which are not generated as effectively as passenger
vehicles. This limitation can be attributed to the characteristics of the dataset
used for training, which may need more representation and diversity of these
specific object classes. Another factor that may contribute to the limitation
in generalizability is the alignment between the point cloud and the camera
image. It has been observed that the alignment is more accurate at the
center of the image compared to the lower left corner. In cases where the
alignment could be more precise, the diffusion model tends to disregard
or inadequately utilize the input information, resulting in an incomplete or
inaccurate generation of objects. This behavior was unexpected and requires
further investigation to gain a better understanding of the underlying dynamics
and explainability of the diffusion process. Addressing this issue could
improve the model’s performance and enhance its ability to generate more
accurate and consistent results across different image regions.
42 | Conclusions and Future work
Chapter 5
5.1 Conclusions
This study explored the generation of realistic images from 4D radar point
cloud data using a diffusion model and a GAN. Through comprehensive
analysis and evaluation of the performance and limitations of these models,
valuable insights were gained. The results indicate that it is possible to learn
a transformation from the 4D radar empirical distribution to the RGB values
distribution by combining the GAN with the diffusion model.
The diffusion model demonstrated strengths in generating higher-quality
images. On the other hand, the GAN showcased better control and
conditioning capabilities, allowing for an accurate representation of objects
that adheres to the conditioning point cloud. This result highlighted the
effectiveness of leveraging conditioning point cloud data for generating
recognizable objects.
A hybrid approach was proposed combining the GAN and the diffusion
model, outperforming both models. The essential advantage of this hybrid
method lies in leveraging the iterative nature of the diffusion process. By
introducing an image generated by the GAN at a predetermined diffusion
timestep, the diffusion process’s limitations and challenges were effectively
mitigated. This hybrid approach improved the ability of the generated
samples to adhere to the input condition and accelerated the sampling process
compared to diffusion alone. From a GAN perspective, the hybrid method is
a way to generate higher-quality samples at the cost of slower sampling.
The dissimilarity between the generated and ground truth images was
measured quantitatively by calculating the mean squared error and object
detection scores. This assessment complemented the qualitative evaluation
Conclusions and Future work | 43
References
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, May 2015. doi: 10.1038/nature14539. [Online].
Available: https://doi.org/10.1038/nature14539 [Pages 7 and 8.]
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Recognition,” CoRR, vol. abs/1512.03385, 2015, arXiv: 1512.03385.
[Online]. Available: http://arxiv.org/abs/1512.03385 [Pages 11, 23,
and 51.]
[41] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep Learning Face Attributes
in the Wild,” in Proceedings of International Conference on Computer
Vision (ICCV), Dec. 2015. [Page 26.]
Appendix A
Supporting materials
A.2.2 U-Net
U-Net is a convolutional neural network that was first introduced in 2015 for
use in biomedical image segmentation [22]. The name ”U-Net” comes from its
characteristic U-shaped architecture, created by a sequence of downsampling
52 | Appendix A: Supporting materials
A.2.3 Attention
Attention is a mechanism that allows models to focus on specific parts of the
input while processing it [43]. It has become a crucial component of many
state-of-the-art natural language processing and computer vision models,
allowing them to perform better on a wide range of tasks. It is a function
that maps a query and a set of key-value pairs to a weighted sum of the
values, expressing the compatibility of the query with the corresponding key
as the weight for each value. The input consists of dk -dimensional keys k
and queries q, and dv -dimensional values V . The weights of the values are
obtained by applying√a softmax function to the dot products of the query with
all keys divided by dk . In practice, these computations are performed as
matrix multiplications, operating simultaneously on a collection of queries Q,
resulting in the following equation:
QK T
Attention(Q, K, V ) = sof tmax( √ )V (A.1)
dk
This is known as multiplicative attention. In the context of vision, the
queries, keys, and values are three sets of feature maps with the same spatial
dimensions, obtained through three separate linear transformations. The
feature maps are then reshaped into vectors q, k, and v and used in the attention
calculation. Various versions of attention exist. An improvement to U-Net
known as Attention U-Net uses additive attention n [37]:
l
qatt = ψ ⊤ (σ1 (Wx⊤ xi + Wg⊤ gi + bg )) + bψ (A.2)
αil = σ2 (qatt (xli , gi ; Θatt )), (A.3)
A.3 Additional Methods
A.3.1 Postprocessing
The radar system is effective at capturing data across various weather
conditions, as it is not influenced by weather or lighting information. However,
this lack of weather information poses a challenge when generating video
sequences, as certain scene attributes, such as wet or dry roads, get randomly
generated. This randomness introduces inconsistency when playing back the
generated frames at a standard frame rate of 25 FPS, potentially impacting the
viewing experience. To address this issue, we employed CycleGAN [24] as
a postprocessing technique to transfer the generated images into a consistent
style. This style transfer process helped alleviate the inherent randomness in
scene attributes, allowing for a more consistent and enjoyable playback of the
generated video sequences.
A.3.2 Upscaling
In order to create a visually more appealing demo, an additional step was
taken to upscale the generated images. This involved training a diffusion
model specifically in a higher resolution of 256 × 256 pixels. The output
from the GAN, which was initially generated in a resolution of 128 × 128
pixels, was then postprocessed using CycleGAN (refer to appendix A.3.1) and
further upscaled to match the 256 × 256 pixel resolution. The frames for the
demo were sampled using the modified diffusion sampling algorithm outlined
in algorithm 4.
www.kth.se