0% found this document useful (0 votes)
3 views69 pages

Scene Reconstruction From 4D Radar Data With GAN and Diffusion

A Hybrid Method Combining GAN and Diffusion for Generating Video Frames from 4D Radar Data

Uploaded by

cocbottest01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
3 views69 pages

Scene Reconstruction From 4D Radar Data With GAN and Diffusion

A Hybrid Method Combining GAN and Diffusion for Generating Video Frames from 4D Radar Data

Uploaded by

cocbottest01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 69

Degree Project in Computer Science

Second cycle, 30 credits

Scene Reconstruction From 4D


Radar Data with GAN and
Diffusion
A Hybrid Method Combining GAN and Diffusion for
Generating Video Frames from 4D Radar Data

ALEXANDR DJADKIN

Stockholm, Sweden, 2023


Scene Reconstruction From 4D
Radar Data with GAN and Diffusion

A Hybrid Method Combining GAN and Diffusion for


Generating Video Frames from 4D Radar Data

ALEXANDR DJADKIN

Master’s Programme, Computer Science, 120 credits


Date: August 22, 2023

Supervisors: Vangjush Komini, Ole Martin Christensen, Debaditya Roy


Examiner: Danica Kragic Jensfelt
School of Electrical Engineering and Computer Science
Host company: Qamcom Research and Technology AB
Swedish title: Scenrekonstruktion från 4D-radardata med GAN och Diffusion
Swedish subtitle: En Hybridmetod för Generation av Bilder och Video från
4D-radardata med GAN och Diffusionsmodeller
© 2023 Alexandr Djadkin
Abstract | i

Abstract
4D Imaging Radar is increasingly becoming a critical component in various
industries due to beamforming technology and hardware advancements.
However, it does not replace visual data in the form of 2D images captured
by an RGB camera. Instead, 4D radar point clouds are a complementary data
source that captures spatial information and velocity in a Doppler dimension
that cannot be easily captured by a camera’s view alone. Some discriminative
features of the scene captured by the two sensors are hypothesized to have
a shared representation. Therefore, a more interpretable visualization of
the radar output can be obtained by learning a mapping from the empirical
distribution of the radar to the distribution of images captured by the camera.
To this end, the application of deep generative models to generate images
conditioned on 4D radar data is explored. Two approaches that have become
state-of-the-art in recent years are tested, generative adversarial networks and
diffusion models. They are compared qualitatively through visual inspection
and by two quantitative metrics: mean squared error and object detection
count. It is found that it is easier to control the generative adversarial
network’s generative process through conditioning than in a diffusion process.
In contrast, the diffusion model produces samples of higher quality and is
more stable to train. Furthermore, their combination results in a hybrid
sampling method, achieving the best results while simultaneously speeding
up the diffusion process.

Keywords
Deep generative models, Generative adversarial networks, Diffusion models,
GAN, DGM, 4D imaging radar
ii | Sammanfattning

Sammanfattning
4D bildradar får en alltmer betydande roll i olika industrier tack vare
utveckling inom strålformningsteknik och hårdvara. Det ersätter dock inte
visuell data i form av 2D-bilder som fångats av en RGB-kamera. Istället
utgör 4D radar-punktmoln en kompletterande datakälla som representerar
spatial information och hastighet i form av en Doppler-dimension. Det
antas att vissa beskrivande egenskaper i den observerade miljön har en
abstrakt representation som de två sensorerna delar. Därmed kan radar-datan
visualiseras mer intuitivt genom att lära en transformation från fördelningen
över radar-datan till fördelningen över bilderna. I detta syfte utforskas
tillämpningen av djupa generativa modeller för bilder som är betingade
av 4D radar-data. Två metoder som har blivit state-of-the-art de senaste
åren testas: generativa antagonistiska nätverk och diffusionsmodeller. De
jämförs kvalitativt genom visuell inspektion och med kvantitativa metriker:
medelkvadratfelet och antalet korrekt detekterade objekt i den genererade
bilden. Det konstateras att det är lättare att styra den generativa processen i
generativa antagonistiska nätverk genom betingning än i en diffusionsprocess.
Å andra sidan är diffusionsmodellen stabil att träna och producerar generellt
bilder av högre kvalité. De bästa resultaten erhålls genom en hybrid: båda
metoderna kombineras för att dra nytta av deras respektive styrkor.

Nyckelord
Djupa generativa modeller, Generativa antagonistiska nätverk, Diffusionsmo-
deller, GAN, DGM, 4D-bildradar
Acknowledgments | iii

Acknowledgments
I am grateful to Qamcom and Sensrad for providing me with the opportunity
to work on this project. I would also like to express my thanks to the engineers
at both companies for their work in developing the software and hardware that
formed the basis of this thesis.

I could not have undertaken this journey without the continuous, enlightening
feedback of my supervisors Vangjush Komini, Debaditya Roy, and Dr. Ole
Martin Christensen.

Stockholm, August 2023


Alexandr Djadkin
iv | Contents

Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Ethical Approach . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . 5

2 Background 6
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . 6
2.1.2 Deep Architectures . . . . . . . . . . . . . . . . . . . 7
2.1.3 Convolutional Neural Networks . . . . . . . . . . . . 8
2.2 Deep Generative Models . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Generative Adversarial Networks . . . . . . . . . . . 10
2.2.2 Diffusion Models . . . . . . . . . . . . . . . . . . . . 10
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Conditional GANs . . . . . . . . . . . . . . . . . . . 12
2.3.1.1 Pix2Pix . . . . . . . . . . . . . . . . . . . 13
2.3.1.2 Points2Pix . . . . . . . . . . . . . . . . . . 13
2.3.2 Conditional Diffusion Models . . . . . . . . . . . . . 14

3 Methods 15
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Data Collection and Selection . . . . . . . . . . . . . 16
3.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2.1 Spatial Dimensions . . . . . . . . . . . . . 17
3.1.2.2 Additional Dimensions . . . . . . . . . . . 18
Contents | v

3.2 Deep Generative Models . . . . . . . . . . . . . . . . . . . . 19


3.2.1 Conditional Generative Adversarial Network . . . . . 21
3.2.1.1 Implementation . . . . . . . . . . . . . . . 22
3.2.2 Conditional Diffusion Model . . . . . . . . . . . . . . 24
3.2.2.1 Implementation . . . . . . . . . . . . . . . 26
3.2.3 Hybrid method: GAN-conditioned Diffusion . . . . . 27
3.3 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Hardware/Software Used . . . . . . . . . . . . . . . . . . . . 28

4 Results and Analysis 30


4.1 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . 30
4.1.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . 30
4.1.2.1 Mean Squared Error . . . . . . . . . . . . . 31
4.1.2.2 Object Detection . . . . . . . . . . . . . . . 31
4.2 Qualitative Assessment . . . . . . . . . . . . . . . . . . . . . 32
4.3 Quantitative Assessment . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Mean Squared Error . . . . . . . . . . . . . . . . . . 36
4.3.2 Object Detection . . . . . . . . . . . . . . . . . . . . 36
4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Training Process . . . . . . . . . . . . . . . . . . . . 38
4.4.1.1 GAN . . . . . . . . . . . . . . . . . . . . . 38
4.4.1.2 Diffusion . . . . . . . . . . . . . . . . . . . 39
4.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusions and Future work 42


5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

References 44

A Supporting materials 51
A.1 Code and Demos . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Additional Background . . . . . . . . . . . . . . . . . . . . . 51
A.2.1 Residual Networks . . . . . . . . . . . . . . . . . . . 51
A.2.2 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . 52
A.3 Additional Methods . . . . . . . . . . . . . . . . . . . . . . . 53
vi | Contents

A.3.1 Postprocessing . . . . . . . . . . . . . . . . . . . . . 53
A.3.2 Upscaling . . . . . . . . . . . . . . . . . . . . . . . . 53
A.4 Additional Examples of Generated Images . . . . . . . . . . . 53
List of Figures | vii

List of Figures

3.1 The Hugin 4D Radar. . . . . . . . . . . . . . . . . . . . . . . 15


3.2 Four examples of input-output pairs from the test dataset. . . . 19
3.3 A block diagram of the Generative Adversarial Network
(GAN) training scheme. Both the generator and discriminator
are Convolutional Neural Networks (CNNs). The generator
utilizes the discriminator’s classification output through
backpropagation to adjust its weight values. An L1 term is
also used to enforce low-level correctness (see algorithm 1). . 21
3.4 A block diagram of the Attention U-Net generator. The input
image is progressively filtered and downsampled with stride
2. The input has 6 input channels: 3 for noise and 3 for the
condition, while the output has 3 for red, green, and blue. The
features propagated through the skip connections are filtered
using attention gates. This figure was based on Attention U-
Net [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Diffusion forward process using the cosine noise schedule.
The leftmost image is the sample at t = 0, and the rightmost
is pure noise at t = T . Every image except at t = 0 is a 100
steps more noisy version of its left neighbor. . . . . . . . . . . 26
3.6 The hybrid method used to generate an image using a diffusion
model trained in 256 × 256 px conditioned on images
generated by the GAN in 128 × 128 px. . . . . . . . . . . . . 28

4.1 Two cases used for evaluation by object detection: the full
image and a cropped region of interest, which excluded parked
vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Examples of generated images. . . . . . . . . . . . . . . . . . 34
viii | List of Figures

4.3 Consecutive video frames generated using GAN and diffu-


sion. The GAN model generated the car in the correct position
in every frame, while the diffusion process missed some
frames, especially for images where the car was in the lower-
left corner of the frame. The GAN-conditioned diffusion
samples (fig. 4.3d) look more realistic than the GAN baseline
(fig. 4.3b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Stochasticity of the reverse diffusion process. Two separate
attempts at generating the vehicle in the lower left corner of
the image. The same input point cloud resulted in two different
outcomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.1 Generated samples in 256 × 256 px. . . . . . . . . . . . . . . 54


List of Tables | ix

List of Tables

3.1 HUGIN S1 4D imaging radar features. . . . . . . . . . . . . . 16


3.2 Radar recording statistics used as heuristics for dataset selection. 17

4.1 MSE results for the different methods used to generate images
conditioned on point cloud data. . . . . . . . . . . . . . . . . 36
4.2 The number of detected objects of each class in the full image
including parked vehicles (refer to fig. 4.1). Diffusion was
closest to ground truth. . . . . . . . . . . . . . . . . . . . . . 37
4.3 Number of objects detected by each method in the ROI where
parked vehicles were excluded. . . . . . . . . . . . . . . . . . 37
4.4 Absolute difference between true and generated images in the
number of detected objects of the three classes: car, truck, and
bus (percentage of total object counts). The GAN-conditioned
diffusion model scored best in the ROI where parked vehicles
were excluded. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 GAN quantitative metrics at various checkpoints. The metrics
show no improvement beyond four to five epochs, despite the
continued decrease in training loss shown in fig. 4.5a. The
percentage in the parentheses shows the relative error to the
total number of objects in the ground truth images. The region
of interest (ROI) is the region of the image which excludes
parked vehicles in the background. . . . . . . . . . . . . . . . 38
x | List of acronyms and abbreviations

List of acronyms and abbreviations

cGAN Conditional Generative Adversarial Network


CNN Convolutional Neural Network

DDPM Denoising Diffusion Probabilistic Model


DGM Deep Generative Model

GAN Generative Adversarial Network

MLE Maximum Likelihood Estimation


MSE Mean Squared Error

NF Normalizing Flow

ReLU Rectified Linear Unit

SDG Sustainable Development Goal


SGD Stochastic Gradient Descent

VAE Variational Autoencoder


Introduction | 1

Chapter 1

Introduction

At present, the field of Deep Generative Models (DGMs) is experiencing an


unprecedented era of progress. The objective of this master’s thesis was to
explore the application of DGMs to the domain of images conditioned on 4D
radar data. By doing so, the aim was to provide an alternative visualization
of the radar data and enhance its interpretability, showcasing its ability to
capture the important features of the observed scene. Such visualizations are
not only of interest to radar professionals seeking enhanced interpretations
of radar data but also to the broader deep learning community. The ability
of DGMs to generate images conditioned on 4D radar data opens up new
possibilities for understanding and analyzing complex scenes. Additionally,
the application of deep generative models to radar data expands the scope of
deep learning research and encourages collaboration between radar experts
and deep learning practitioners, fostering interdisciplinary advancements in
both fields.
This project was carried out at Qamcom Research and Technology AB
during the spring of 2023. Qamcom is a leading research and technology
company based in Sweden that offers expertise in hardware, firmware,
software, and system development. Through a collaboration with Arbe, the
world’s most advanced radar chip supplier, the enterprise has realized the
Hugin 4D imaging radar. In addition to the spatial dimensions in 3D, the
radar detects the velocities of moving objects by utilizing the Doppler effect.
The solution works in all environmental conditions and finds use cases in
various applications, such as aerial vehicles, heavy machinery, and traffic
management. Qamcom Radar is now known as Sensrad, a spin-out venture
set to continue developing and commercializing the Hugin 4D imaging radar.
In our experiment, the radar system was installed in a stationary
2 | Introduction

configuration, positioned to overlook a road. The radar recordings were


spatially and temporally synchronized with video captured by an RGB camera.
These paired data served as the training data for neural networks. The
objective was to enable the translation of the radar recordings to RGB video by
leveraging deep generative models. By doing so, the radar’s ability to capture
high-quality discriminative features of the scene was demonstrated.

1.1 Motivation
Radar 4D imaging is increasingly becoming a critical component in the
automotive industry. This progress is mainly attributed to the advancements
in beamforming technology and hardware capabilities [1]. Although the
contextual information reconstructed from the 4D radar is limited and does
not replace the cameras’ (classical) visual information, it can be particularly
useful in certain situations where camera information fails, e.g., when driving
on a very foggy road or using a radar sensor on a drone in a forest fire to
see through the smoke. In those cases, directly visualizing the 4D radar
information could mitigate the shortcomings. The recent development of deep
(conditional) generative models in AI has shown great potential for generating
realistic images. Namely, it is possible to mimic the discriminative features
of the training data to generate new realistic images from 4D radar output.
Hence, the above methodology could be directly integrated with the scenarios
mentioned above to reap substantial benefits.

1.2 Problem
The main problem of this project is to train a generative model p(x|c) that
can produce high-quality video frames given radar recordings c ∈ C of a
road. The data consist of video frames x ∈ X , temporally and spatially
synchronized with the radar recordings, resulting in the set of ordered pairs
(x, c) ∈ X × C. The radar sensor and camera are mounted such that
the background is stationary, and the variability in the data comes from the
movement on the road and changing environmental conditions. By leveraging
deep generative models, the aim is to enhance the visual representation of 4D
radar data and improve its interpretability. To this end, the following research
questions are posed:
1. How can deep generative models be used to enhance the visual
representation of 4D radar data and improve its interpretability?
Introduction | 3

2. What are the trade-offs between generative adversarial networks and a


diffusion approach for visualizing 4D radar data?

Generative models are, in essence, modeling the transformation of a priory


simple (i.e., easy to sample from) parametric distribution (i.e., Gaussian
distribution) into the empirical distribution of the training data [2]. Similarly,
in this practical example, we hypothesize that learning to transform the 4D
radar empirical distribution into the RGB values distribution is possible. Using
deep neural networks, we can parameterize this transformation by leveraging
their ability to discover intricate structures shared between 4D radar and
camera images at high levels of abstraction.

1.3 Purpose
From the perspective of Qamcom, the project aims to demonstrate the quality
of radar output by generating high-quality images and videos from 4D radar
point cloud data. Such visualizations can show that the radar is capturing
meaningful discriminative features of the environment captured by the radar,
facilitating a more straightforward interpretation of the scene.
From a scientific standpoint, the project investigates the application of
deep generative models to 4D radar data, evaluating two state-of-the-art
techniques: Generative Adversarial Networks (GANs) [3] and Denoising
Diffusion Probabilistic Models (DDPMs) [4]. By achieving these goals, the
project seeks to advance the understanding and capabilities of generative
models in the domain of 4D radar data.

1.4 Goals
The primary objective of this project revolves around the visualization of 4D
radar data using deep generative models. To accomplish this overarching goal,
the project has been further delineated into two sub-goals, each serving a
specific purpose:

1. Showcasing the quality of radar data output by leveraging it as a


conditioning variable for video frame generation. By employing this
approach, the project aims to demonstrate the effectiveness of using
radar data as a meaningful input in the generation process.
4 | Introduction

2. Conducting a comparative analysis of the advantages and drawbacks of


GANs and diffusion models within the context of 4D radar data. This
investigation aims to contribute insights to the scientific community
regarding the suitability and performance of these two generative
modeling techniques for 4D radar data visualization.

In summary, while the first sub-goal aligns with the specific interests of
Sensrad, the second sub-goal seeks to contribute to the scientific body of
knowledge.

1.5 Ethical Approach


In our work with data collected from a public space, we recognize the
importance of protecting individuals’ privacy. To this end, we have employed
a downsampling process to ensure that identifiable information, such as license
plates and facial features, is not present in our neural network input. Such
information, along with timestamps and location data, could potentially lead
to the identification of individuals or their travel patterns. In such cases,
there may be privacy concerns, and appropriate measures should be taken to
protect the data and ensure individuals’ privacy. This aligns with Sustainable
Development Goal (SDG) 16, which focuses on promoting peace, justice,
and strong institutions by ensuring privacy and data security in public spaces
[5]. Our approach solves two problems in one go, as downsampling not
only protects privacy but also reduces the computational resources required
for training while preserving enough detail to demonstrate the discriminative
features of the radar.
Furthermore, we have prioritized sustainability in the design and
development of the models. Training deep learning models can be
computationally demanding and have a significant environmental impact. To
address this concern, we have incorporated sustainability measures in our
research, such as training the models with lower precision and smaller scale
during the development stage, thus contributing to SDG 13, which focuses on
climate action. Training models with lower precision means using reduced
numerical precision for computations, which can significantly reduce the
energy consumption and carbon footprint associated with model training.
Additionally, training models at a smaller scale involves using a subset of data
or a reduced complexity in the model architecture, resulting in lower resource
requirements.
Introduction | 5

1.6 Delimitations
This project is limited to the investigation of applying GANs and diffusion
models to images conditioned on 3D point clouds calculated from 4D radar
data collected in the automotive setting. While many other architectures exist,
and the models could be applied to other data, the study is about only a subset
of the field of deep generative models and a very specific data source.

1.7 Structure of the Thesis


The relevant background about 4D imaging radar, deep learning, convolu-
tional neural networks, and deep generative models is presented in chapter 2.
The methods used to preprocess the data and the neural network architectures
used for training are presented in chapter 3. Results are presented, analyzed,
and discussed in chapter 4. Conclusions and directions for future work are
presented in chapter 5.
6 | Background

Chapter 2

Background

This chapter introduces DGMs and explores the relevant research conducted
on conditional DGMs specifically developed for images and point clouds.
The core architectures used in these models are deep Convolutional Neural
Networks (CNNs), which are introduced in this section. It’s worth noting that
CNNs are a subset of deep learning, which, in turn, falls under the umbrella
of artificial neural networks. Hence, this chapter also provides a background
on these fundamental building blocks to establish a foundation of knowledge
that underpins the subsequent content of the report.

2.1 Deep Learning


Deep learning has revolutionized the field of machine learning in recent years,
fueled by the availability of vast amounts of data and significant advancements
in computing power. This section explores the fundamental concepts and
architectures of deep learning, focusing on artificial neural networks and their
powerful variant, CNNs. Finally, we will explore deep generative models,
which focus on the generation of novel data that captures the essential features
present in the training data. These models have proven instrumental in various
domains, including image synthesis, natural language processing, and music
composition [2].

2.1.1 Artificial Neural Networks


Artificial neural networks are machine learning algorithms that draw
inspiration from the structure and functioning of the brain [6]. They consist
of computational units, often referred to as neurons, which can be as simple
Background | 7

as a combination of input weights and an activation function [7]. Just like the
neurons in our brains, these computational units activate when the weighted
sum of inputs exceeds a certain threshold determined by the adjusted weights.
The objective of training neural networks is to find optimal sets of weights that
enable these neurons to produce high activation signals when presented with
test data that contains discriminative features learned from the training data.

2.1.2 Deep Architectures


Deep learning has become a powerful tool in recent years, with algorithmic
development driven by increased data availability and computing power.
Before the emergence of deep learning, detecting patterns in data required
extensive domain expertise and thorough feature engineering to extract
representations suitable for learning [7]. While hand-crafted features can
be effective for certain tasks, they have limitations in their generalization
capabilities. Deep neural networks, also known as multi-layer neural
networks, aim to generalize across the discriminative features present in the
training data by leveraging multiple layers of computation. Each layer extracts
increasingly abstract representations of the raw input data (such as the pixel
values of an image) at multiple levels of abstraction while irrelevant variations
are suppressed. This hierarchical structure enables deep neural networks to
learn and model intricate relationships between input features without needing
hand-crafted features, facilitating better generalization and performance on
various tasks.
Deep learning models are fine-tuned by leveraging the backpropagation
algorithm. Initially, the algorithm computes the gradient at the output layer
based on the error associated with the given task. This gradient is then
propagated backward, using the chain rule of differentiation to compute the
gradients at every layer. Subsequently, the deep architecture’s parameters are
adjusted to minimize the error. Stochastic Gradient Descent (SGD) is often
used as a scalable method for updating the weights [8]. Adam is a popular SGD
choice due to better optimization performance owing to adaptive learning rate
and momentum [9]. Such a model and optimization process can be thought of
as analogous to a box with many knobs, with the goal of tuning them optimally
to achieve a specific objective.
8 | Background

2.1.3 Convolutional Neural Networks


CNNs have emerged as a powerful neural network architecture for tasks
involving image and video recognition [7]. These networks utilize
convolutional layers to extract meaningful features from input images,
enabling various downstream applications, such as generating new images
resembling those in the provided dataset. The foundational concept of CNNs
was initially introduced in the 1980s with Fukushima’s Neocognitron model
[10]. However, they gained immense popularity only in the past decade,
thanks to the availability of extensive labeled image datasets and the utilization
of high-performance GPUs, which facilitated the construction of large-scale
networks [11].
A generic CNN architecture follows a hierarchical structure comprising
multiple convolutional (C) layers and optionally pooling (P ) layers:

UΘ (x) = (CθK · · · P · · · Cθ2 · · · Cθ1 )(x), (2.1)


where Θ = (θ1 , . . . , θK ) is the set of the network parameters [12]. A
convolutional layer Cθk applies a bank of filters θk = γji to the input x
followed by a non-linearity σ, such as a Rectified Linear Unit (ReLU), which
is simply the half-wave rectifier f (x) = max(x, 0) [10] [13]. For an input
x(in) ∈ RNin ×H×W (Nin = 3 at the first layer for an RGB image), the output
x(out) ∈ RNout ×Hout ×Wout is obtained by

(∑
Nin
)
x(out)
j =σ γji ∗ x(in)
i , (2.2)
i=1

often referred to as the feature maps. Here,


a ∑
b
γ ∗ x(h, w) = γ(h′ , w′ )x(h − h′ , w − w′ ) (2.3)
h′ =−a w′ =−b

denotes the discrete 2D convolution with a filter of size (2a, 2b). The filters
learn to express local spatial connectivity patterns across the input channels,
enabling the model to identify, detect, classify, and segment objects within the
image.
Background | 9

2.2 Deep Generative Models


The main task in DGMs is generating novel data that captures the essential
features present in the training data. Sampling is at the heart of this process to
achieve the goal of creating new data points. Consider an image, denoted as x,
drawn from a dataset X , and let us view it as a sample from the joint probability
distribution over the data’s dimensions, denoted as p(x1 , . . . , xN ) = p(x). For
RGB images with a resolution of 128 × 128, we have N = 49152 dimensions.
The high dimensionality renders p(x) exceedingly complex and impractical to
parameterize directly. Hence, How can we generate new samples from p(x)?
A naive Monte Carlo approach might involve randomly drawing every pixel
from U{0, 255} and hoping to obtain an image resembling those present in our
dataset eventually. However, due to the curse of dimensionality, the dataset
X only occupies a small subset of the high-dimensional space containing all
possible images. Consequently, most images sampled through this method
will consist primarily of noise, making it an inefficient approach.
To overcome this challenge, an alternative strategy involves sampling z
from a simple parametric distribution (e.g., Gaussian: z ∼ N (µ, σ 2 )) and
learning a transformation x′ = f (z) such that x′ ∼ p(x). If we learn such
an f , we can map high-likelihood samples from the known distribution to
high-likelihood samples from p(x). DGMs use the data and a deep neural
network architecture to estimate p(x) by learning such a high-dimensional
mapping from z (sampled from a known distribution) to the empirical data
x. This approach allows for efficient sampling while producing meaningful
outputs. Multiple DGM architectures have shown success in generating
synthetic images of high quality in recent years: Variational Autoencoders
(VAEs) [14], GANs [3], Normalizing Flows (NFs) [15], and diffusion models
[16] [17]. Diffusion models have overtaken GANs and claimed the status of
state of the art.
Sometimes we want our model to generate novel data adhering to specific
conditions c. When we want to model a dependence on such covariates c ∈ C,
we obtain a conditional generative model of the form p(x|c). A more familiar
and intuitive form in the conditional case is the model p(y|x), where we denote
the inputs x and the outputs y. This discriminative model predicts y, which
can be a class label with one correct output. Generative modeling differs from
this framework mainly because we assume many correct outputs may exist.
Rather than maximizing the number of correct outputs, it aims to match the
output distribution to the target distribution. Therefore, they are harder to
evaluate [2].
10 | Background

2.2.1 Generative Adversarial Networks


In 2014, Ian J. Goodfellow et al. proposed GANs, a new generative framework
in which the generator is pitted against an adversary [3]. The model consists
of two submodels, a generator G and a discriminator D. Through D, the
architecture sidesteps the difficulty of approximating intractable probabilistic
computations that arise in Maximum Likelihood Estimation (MLE). The goal
of G is to generate samples indistinguishable from the real dataset, while the
goal of D is to distinguish between real and fake ones. This setting can be
considered analogous to the police (D) trying to detect fake currency produced
by a counterfeiter (G). The generator implicitly defines a distribution pG over
d-dimensional data x which is learned as a mapping G(z, θG ) from a prior
distribution pz (z) to data space. The mapping is represented by a neural
network architecture, parameterized by θG . The discriminator D(x, θD ) maps
d-dimensional samples to a single scalar, the probability that the sample x
came for the true data distribution. D is parameterized by another set of learned
parameters θD .
G and D are set to compete against each other during the training phase.
One training step consists of an update to the parameters of D, followed by an
update to the parameters of G. The result is a mini-max game in which G tries
to maximize the probability of D making a mistake while D tries to minimize
it. The game is defined by the objective formulated in eq. (2.4), which D and
G attempt to maximize and minimize, respectively.

LGAN = Ey∼pdata (x) [log D(y)] + Ez∼pz (z) [log(1 − D(G(z)))] (2.4)

2.2.2 Diffusion Models


Diffusion models sample from a distribution by reversing a gradual noising
process. Sampling starts with noise xT and produces gradually less noisy
samples xT −1 , xT −2 , . . . until the final sample x0 is reached.
Suppose x0 is a sample from our dataset and let x0:T be the sequence
obtained by gradually adding Gaussian noise to it, resulting in the Markov
chain:
Background | 11


T
q(x1:T | x0 ) = q(xt | xt−1 ) (2.5)
t=1

q(xt | xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (2.6)

where β1 , ..., βT are given according to a variance schedule. Equation


2.5 is known as the forward process or diffusion process. The xt obtained
by sampling conditioned on the previous noising step xt−1 in eq. (2.6) are
a set of latent variables x1 , ..., xT with the same dimensionality as the data
x0 ∼ q(x0 ). These
∫ provide a starting point for latent variable models of the
form pθ (x0 ) = pθ (x0:T )dx1:T called diffusion models, introduced by Sohl-
Dickenstein et al. [18]. The joint distribution pθ (x0:T ), known as the reverse
process, is defined as a Markov chain with Gaussian transitions parameterized
by θ:


T
pθ (x0:T ) = p(xT ) pθ (xt−1 | xt ) (2.7)
t=1
pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (2.8)
p(xT ) = N (xT ; 0, I). (2.9)

The goal is to tune θ such that we can start with Gaussian noise xT in
Equation 2.9 and through the reverse process gradually transform it to x0 such
that x0 ∼ q(x0 ). This training is performed by optimizing the variational
bound on negative log-likelihood:
[ ]
pθ (x0:T )
E[− log pθ (x0 )] ≤ Eq − log (2.10)
q(x1:T | x0 )
[ ∑ ]
pθ (xt−1 | xt )
= Eq − log pθ (xT ) − log =: L (2.11)
t≥1
q(xt | xt−1 )
= LT + LT −1 + · · · + L0 (2.12)
( )
where LT = DKL q(xT |x0 ) ∥ pθ (xT ) (2.13)
( )
Lt−1 = DKL q(xt−1 |xt , x0 ) ∥ pθ (xt−1 |xt ) for 2 ≤ t ≤ T (2.14)
L0 = − log pθ (x0 |x1 ), (2.15)

where DKL (q|p) denotes the Kullback-Leibler divergence of q from p and is



a measure of how different q is from p. Now let αt = 1−βt and ᾱt = ti=1 αi .
12 | Background

Then the forward process admits sampling xt conditioned on x0 at an arbitrary


timestep t in closed form:

q(xt | x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I). (2.16)
Therefore, evaluating the full loss L at every update of θ is unnecessary.
Instead, random terms can be optimized with stochastic gradient descent by
sampling a random timestep t at every update iteration. Another consequence
of this reparameterization is that the forward process posteriors (mean µ̃t and
variance β̃t ) are tractable when conditioned on x0 :

√ √
ᾱt−1 βt αt (1 − ᾱt−1 )
µ̃t (xt , x0 ) = x0 + xt (2.17)
1 − ᾱt 1 − ᾱt
1 − ᾱt−1
β̃t = βt . (2.18)
1 − ᾱt
The forward process variances can be learned or held constant as
hyperparameters. Ho, et al. set Σθ (xt , t) = σt2 I to untrained time dependent
constants [17]. Nichol and Dhariwal found that learning the variance gave
better log-likelihood but had no effect on sample quality [19]. Without
learning the variance, we can set pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), σt2 I),
and obtain the following form for Lt−1 :
[ ]
1
Lt−1 = Eq ||µ̃ (xt , x0 ) − µθ (xt , t)|| + C,
2
(2.19)
2σt2 t
where C is a constant that does not depend on θ. Hence, we want to
parameterize µθ as a model that predicts µ̃t , the posterior mean of the forward
process. xt and µθ can be further reparameterized to formulate a simplified
objective in which we try to predict the noise at every time step:

[ ]
βt2 √ √
Ex0 ,ϵ ||ϵ − ϵθ ( ᾱt x0 + 1 − ᾱt ϵ, t)|| .
2
(2.20)
2σt2 αt (1 − ᾱt )

2.3 Related Work


2.3.1 Conditional GANs
A Conditional Generative Adversarial Network (cGAN) is an extension of
the traditional GAN that incorporates additional conditional information into
Background | 13

the training process. In a cGAN, both the generator and discriminator


(alternatively only the generator) are conditioned on some form of an auxiliary
input, such as class labels, textual descriptions, or image conditioning. By
conditioning the GAN on additional information, the generator learns to
generate samples that not only resemble the real data but also adhere to the
specified conditions [20].

2.3.1.1 Pix2Pix
Isola et al. explored GANs for image generation in a conditional setting [21].
Their approach, named Pix2Pix, was tested on various tasks and datasets, such
as map to aerial photos and day-to-night translation. Pix2Pix’s ability to learn
the loss function through adversarial training makes it a versatile and powerful
tool for a variety of image-to-image translation tasks, allowing the network
to adapt to different problem domains without the need for task-specific loss
formulations. The model learns a mapping from an observed sample x and
random noise z to y, G : {x, z} → y. The objective function consists of a
GAN loss and an L1 term. The GAN discriminator learns the high-frequency
content, while the L1 loss helps to learn the low-frequency trends, resulting in
the following objectives:

LcGAN (G, D) = Ex,y [log D(x, y)] + Ex,z [log(1 − D(x, G(x, z)))] (2.21)
LL1 = λE[||y − G(x, z)||1 ] (2.22)
Ltotal = LcGAN + LL1 . (2.23)

Pix2Pix used U-Net [22] as the generator and PatchGAN [23] as the
discriminator. PatchGAN models the image as a Markov random field by
focusing on the structure in local image patches. I.e., the model assumes that
given its neighbors, a pixel is independent of pixels that are more than N steps
away. The discriminator runs convolutionally across the image, averaging all
responses to provide the final output of D. GAN-based techniques have also
been applied to unpaired image translation [24].

2.3.1.2 Points2Pix
In 2019, Milz, Simon, Fischer et al. proposed an approach for 3D point cloud
to image translation applied to Lidar [25]. The mapping from point clouds
to images was learned using a conditional GAN with three distinct conditions
c1 , c2 , c3 . Firstly, c1 was obtained by processing the raw point cloud using
14 | Background

PointNet [26]. Secondly, c2 was a projection of the point cloud onto a 2D


image, with the distance encoded in the green channel and the reflectance
intensities in the blue channel. Finally, c3 consisted of an image background
patch surrounding the object to be generated from the point cloud. The first
and second conditions were concatenated at the bottleneck of the generator
architecture, which followed a UNet structure [22]. The discriminator was
implemented following the PatchGAN approach proposed in [21].
As in eq. (2.23), the loss function used was a combination of the
conditional GAN loss and an L1 term that enforced correctness in low-
frequency content:

LP oints2P ix = LcGAN (G, D) + λL1 E[||y − G(c1 , c2 , c3 )||1 ]. (2.24)

The authors of this work conducted experiments on KITTI [27] for outdoor
and SunRGBD [28] for indoor scenarios. For validation, they measured the
number of correct detected classes with the aid of the 2D object detector
YOLOv3 [29], which could be achieved due to object-centered image patches
used in their experiments. The classification score was then given by the
detection ratio of fake images to ground truth. Additionally, the intersection
over union (IoU) for the bounding boxes of predicted objects was measured.

2.3.2 Conditional Diffusion Models


Inspired by Pix2Pix [25], Palette evaluated conditional diffusion models
which outperformed GAN baselines on four image-to-image translation tasks:
colorization, inpainting, uncropping, and JPEG restoration [30]. Their
neural network implementation was based on the original DDPM [17]
with several modifications which further improved sample quality [19] [4].
DDPM follows the backbone of PixelCNN++ [31], which is a U-Net [22]
based on a wide ResNet [32]. Instead of class-conditioning, they added
conditioning of the source image via concatenation, following [33]. Inspired
by CycleGAN, diffusion has also been applied to unpaired image translation
[34]. Furthermore, Choi et al. proposed Iterative Latent Variable Refinement
(ILVR) [35], a conditioning method for denoising diffusion probabilistic
models. ILVR guided the generative process of a diffusion model based on
a reference image, facilitating the refinement of latent variables throughout
the diffusion process.
Methods | 15

Chapter 3

Methods

This chapter discusses the core methods associated with this project. Firstly,
the methods employed for selecting the training and testing datasets and the
preprocessing steps involved in transforming the data into a format suitable as
input to the models are presented. Secondly, the choice of models and their
implementation details are discussed in detail. Finally, the evaluation metrics
employed for comparing the performance of these models are discussed.

3.1 Data

Figure 3.1: The Hugin 4D Radar.

The Hugin 4D Imaging Radar (fig. 3.1) utilizes Arbe’s chipset and boasts
advanced perception capabilities and a wide field of view [36]. It uses a 48
× 48 MIMO antenna configuration, operating in the 76 to 81 GHz frequency
range with a maximum bandwidth of 1540 GHz. It has a sensitivity range of
0.5 to 300+ m, a range resolution of 0.1 to 0.75 m, and a Doppler resolution of
16 | Methods

0.1 ms−1 . The radar data used in this project were recorded with the A3 radar
variant at a framerate of 15 FPS and synchronized with the RGB camera, which
operated at 25 FPS. Additional technical details are available in table 3.1.

Chipset Arbe Phoenix


Antennas 48 × 48 MIMO Configuration
Frequency 76 − 81 GHz
Max Bandwidth 1540 GHz
Field of View Azimuth 50 − 100◦
Field of View Elevation 30◦
Frame Rate > 15 FPS
Sensitivity Range 0.5 − 300 + m
Sensitivity Range Resolution 0.1 − 0.75 m
Doppler Resolution 0.1 ms−1
Virtual Channels > 2300 (2k ultra-high resolution)
Target Radial Velocity −35 to +70 ms−1
Azimuth Resolution (at beam 3dB width point) 1.25◦
Elevation Resolution (at beam 3dB width point) 1.7◦
Number of Simultaneously Tracked Objects > 500

Table 3.1: HUGIN S1 4D imaging radar features.

3.1.1 Data Collection and Selection


The data for this study were collected in February of 2023, encompassing
various environmental and traffic scenarios, including day and night
conditions, snowy and rainy days, sunny days without snow, and rush hours
with varying traffic intensity. Specifically, four 15-minute recordings during
rush hour traffic were selected for training purposes, taken at 8 AM and
4:30 PM. These recordings were chosen based on specific parameters to
ensure the maximum possible traffic intensity and point cloud representation.
Consequently, a total of 89762 training samples were obtained from these
selected recordings. 5000 samples were used for testing. It is important to note
that additional training data were available; however, these four recordings
were selected to meet the following criteria:

• The static parameter: Recordings with varying static thresholding of


the signal strength (power) values were obtained, specifically 5, 10, and
15. A lower static value resulted in richer point clouds, increasing
the information density in the input data.
Methods | 17

• The mode parameter: Recordings with mode set to Medium, Long,


and Ultra-Long were obtained. A longer mode corresponds to
greater detection capabilities at long distances, while a shorter mode
provides better range resolution.

• Similar weather and lighting conditions: As the radar does not capture
weather or lighting information, recordings with similar backgrounds
were selected. This approach ensured a smoother generation of videos
by biasing the background towards a specific setting.

• Traffic intensity: To construct a dataset with maximal traffic intensity,


additional statistical heuristics were employed as selection criteria. Two
metrics, the mean point count µpc (representing the average number of
µpc
points in a given set of point clouds) and the Doppler ratio µdpc (the ratio
of the average number of points with a Doppler value greater than 1 to
the mean point count). These values for the chosen dataset are shown
in table 3.2.

recording mean point count static doppler ratio mode


train1 817.9 5 0.1452 Ultra-Long
train2 804.9 5 0.1352 Ultra-Long
train3 825.3 5 0.1303 Ultra-Long
train4 755.3 5 0.1099 Ultra-Long
test 775.0 5 0.0942 Ultra-Long

Table 3.2: Radar recording statistics used as heuristics for dataset selection.

3.1.2 Preprocessing
The raw 4D radar point cloud data consists of a set of N points {xi }N i=1 in
3D-space with spatial coordinates x, y, z along with additional dimensions
xdoppler , xrange , xpower . The preprocessing steps involved the separate processing
of the spatial dimensions and the additional information.

3.1.2.1 Spatial Dimensions


The spatial dimensions of the point cloud data were treated first. The initial
steps involved transforming the 3D spatial dimensions into a suitable format
for input to the models. The transformation included translation, rotation, and
projection operations. The translation and rotation operations were performed
18 | Methods

to align the point cloud with the camera’s coordinate system. Additionally, the
aligned points were projected onto the 2D plane corresponding to the camera
view (cf. eq. (3.1)):

x2D = PRTx, (3.1)


where T is the translation vector, R is the rotation vector, and P is the
projection matrix. These were computed beforehand by calibration using the
known relative position of the radar to the camera.

3.1.2.2 Additional Dimensions


Information such as the distance to an object was lost in the processing
of the spatial dimensions. To address this, the additional dimensions were
incorporated into the 2D projections by encoding them as RGB values in the
resulting 2D plane. The encoding scheme assigned the Doppler, range, and
power values to the red, green, and blue channels, respectively. Consequently,
the input image had dimensions 3×H ×W , where H and W denote the height
and width of the image, respectively.
To ensure that the model effectively captured variations in the Doppler,
range, and power information, the extreme values of these features were
calculated. These extreme values were used to scale the RGB channels (refer
to eq. (3.4)), maximizing the contribution of these features to the model’s
outputs:

R = 127.5 + cR xdoppler (3.2)


G = cG xrange (3.3)
B = cB (xpower − xmin
power ), (3.4)

where ci are constants such that i ∈ [0, 255], i.e., the RGB values were
in the valid range for pixels. Since objects moving both towards and away
from the detector were observed, xdoppler could be either positive or negative
with an observed maximum value |xmax doppler |. cR = 8 ensured that R ∈ [0, 255].
1
Similarly, by setting cG = 1.42 and cB = 8, the proper scaling was also ensured
for the green and blue channels.
The image was then cropped to a region of interest and downsampled to
128 × 128, or 256 × 256 pixels. The 128 × 128 resolution was used in the
comparative analysis of the two generative methods. The 256 × 256 resolution
was used to demonstrate the radars ability to capture discriminative features.
Methods | 19

Finally, the pixels were scaled to the range [−1, 1], a common practice for
improving the stability and performance of neural networks. Four examples
of input-output pairs are shown in fig. 3.2.

(a) No close objects. (b) One car.

(c) Two pickup trucks. (d) A bus on the incoming lane.

Figure 3.2: Four examples of input-output pairs from the test dataset.

3.2 Deep Generative Models


There are many different kinds of DGMs, such as VAEs, NFs, diffusion
models, and GANs. Depending on the specific task at hand, several criteria
need to be considered when choosing from the different kinds of generative
models [2]:

• Density: Can the model evaluate the probability density function p(x)?

• Sampling: Can the model generate new samples x ∼ p(x)? If so, is it


fast or slow?

• Training: How are the parameters learned? Is it stable?

• Latents: Does the model use a latent vector z, and what is its
dimensionality?

• Architecture: What kind of neural network can be used?


20 | Methods

VAEs learn a lower bound on the density through MLE, support fast
sampling, use latent representations with lower dimensionality than the data,
and are implemented using an encoder-decoder architecture, utilizing the
reparameterization trick [14]. NFs learn an exact density through MLE, are
slow to sample from and use latents with the same dimensionality as the
data. The architecture choice that restricts them is that we must use invertible
neural networks where each layer has a tractable Jacobian. GANs allow for
fast sampling and use small latents but do not support density evaluation due
to the min-max training objective. In addition, the generator-discriminator
architecture can lead to unstable training. Diffusion models learn a lower
bound on the density through MLE, are slow to sample from, use latents with
the same dimensionality as the data, and use an encoder-decoder architecture.
The primary concern in this practical case is the quality of the generated
samples while being able to evaluate an exact density is less critical. GANs
have been extensively refined and explored in the literature and have produced
state-of-the-art results in recent years. On the other hand, diffusion models
have recently been used to outperform GANs on image synthesis tasks [4].
Based on these considerations and, given the promising results of both GANs
and diffusion models, they were selected as starting points for this project.
Methods | 21

3.2.1 Conditional Generative Adversarial Network

Dataset

Gaussian Projected
noise point cloud Real image

Discriminator
loss

Discriminator

Generator
loss
Generator

Generated
image

Figure 3.3: A block diagram of the GAN training scheme. Both the generator
and discriminator are CNNs. The generator utilizes the discriminator’s
classification output through backpropagation to adjust its weight values. An
L1 term is also used to enforce low-level correctness (see algorithm 1).

The training scheme for the GAN is visualized in Figure 3.3. The GAN
consists of two main components: the generator (G) and the discriminator (D).
The output of D, which represents D’s confidence in the authenticity of the
input samples, is used to guide the optimization of both G and D. Specifically,
D’s output is used to formulate the loss function for updating the parameters
of both the G and D [3]. This adversarial training process aims to improve
G’s ability to produce synthetic samples that are indistinguishable from real
data, while simultaneously enhancing D’s ability to discriminate between real
and fake samples accurately. Additionally, an L1 term enforces low-level
correctness as in (refer to eq. (3.5)). By iteratively updating the parameters of
the two neural nets based on their respective loss functions, the GAN training
scheme facilitates the learning of a generator that can effectively generate
realistic data samples resembling the training data distribution.
As in [21], setting λL1 = 100 in equation eq. (2.21) resulted in the
objective:
22 | Methods

L = LcGAN (G, D) + 100 · E[||x − G(z, c)||1 ], (3.5)


where z denotes Gaussian noise and c is the conditioning radar point cloud
corresponding to the target image x. In practice, the discriminator took a batch
of images as input and outputted a single scalar for every image, predicting
whether the input is real (1) or fake (0). Binary Cross-Entropy (BCE), as the
discriminator loss:

BCE(ŷn , yn ) = yn · log(ŷn ) + (1 − yn ) · log(1 − ŷn ), (3.6)


where yn is the target and ŷn is the prediction for image n. D was
optimizing the matching of yn and ŷn , while G was optimized such that D
would output ŷn = 1 when yn = 0 and ŷn = 0 when yn = 1. The detailed
training procedure is presented in algorithm 1, in which a parameter update
step refers to minibatch SGD using the Adam algorithm [9] with a learning
rate of 0.0002 and momentum parameters β1 = 0.5, β2 = 0.999.

Algorithm 1 Conditional GAN training


Input Set of image-radar pairs X × C, models G and D.

1: for each batch of image-radar pairs (x, c) ∈ X × C do


2: z ∼ N (0, I) ▷ sample noise
3: y ← G(c, z) ▷ generator forward pass
4: dreal ← D(x) ▷ discriminator forward pass
5: dfake ← D(y)
6: LD = BCE(dreal , 1) + BCE(dfake , 0) ▷ discriminator loss
7: θD = update(θD , ∇θD LD ) ▷ backward pass, update
8: y ← G(c, z) ▷ repeat to update generator parameters
9: dfake ← D(y)
10: LG = BCE(dfake , 1) + λL1 L1(y, x) ▷ generator loss
11: θG = update(θG , ∇θG LG )

3.2.1.1 Implementation
Based on the ideas of Pix2Pix [21] and Points2Pix [25], we used U-Net as
the generator [22]. We did not use the raw point cloud as in Points2Pix,
since we found early on in the development process that it did not boost the
performance of our model. The architecture differed from Pix2Pix mainly in
two aspects: we concatenated Gaussian noise with the projected point cloud
Methods | 23

at the input instead of using dropout, and we used attention as in Attention U-


Net [37], which is an extension that allows the network to focus on important
regions of the image and ignore irrelevant regions. The architecture is shown
in fig. 3.4. U-Net is a CNN that consists of an encoder and a decoder with skip
connections based on ResNet [18]. We used an encoder that consisted of a
series of convolutional blocks with increasing channels. Each block used two
convolutions followed by batch normalization [38] and ReLU activation. A
Maxpooling operation with kernel size 2 and stride 2 after each block achieved
downsampling. We applied five blocks with 64, 128, 256, 512, and 1024
output channels. The decoding path used the attention mechanism, using
the residual feature map as the input signal (key) and the upscaled feature
map coming from the bottleneck as the gating signal (query). We used the
PatchGAN discriminator [23] as the adversarial network, which treats the
image as a set of independent patches which it classifies as real or fake. The
GAN was trained with a batch size of 64.

A
64 x 128 x 128

64 x 128 x 128
6 x 128 x 128

3 x 128 x 128
Output
Input

A
128 x 64 x 64

128 x 64 x 64

128 x 64 x 64

128 x 64 x 64
A
256 x 32 x 32

128 x 32 x 32

256 x 32 x 32

256 x 32 x 32

A
512 x 16 x 16

512 x 16 x 16

512 x 16 x 16

512 x 16 x 16
1024 x 8 x 8

1024 x 8 x 8

2*(Conv + BN + ReLU) Upsampling (x2)

Maxpooling (stride 2) Gating Signal (query)

Skip Connection A Attention Gate

Figure 3.4: A block diagram of the Attention U-Net generator. The input
image is progressively filtered and downsampled with stride 2. The input has
6 input channels: 3 for noise and 3 for the condition, while the output has 3 for
red, green, and blue. The features propagated through the skip connections are
filtered using attention gates. This figure was based on Attention U-Net [37].
24 | Methods

3.2.2 Conditional Diffusion Model


A diffusion model samples from a distribution by starting with noise xT and
gradually transforming it into less noisy samples xT −1 , xT −2 , . . . until the final
sample x0 is obtained. The diffusion process (refer to fig. 3.5) consists of small
increments of Gaussian noise:

q(xt | xt−1 ) = N (xt ; 1 − βt xt−1 , βt I). (3.7)
Therefore, the transitions from xt to xt−1 in the reverse process can also
be represented as conditional Gaussians:

pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)). (3.8)


Equation 3.8 enables a straightforward parameterization of the reverse
process using a neural network. It has been shown that learning the variance
does not significantly impact sample quality [19]. Therefore, we held the
variance fixed at βt and learned to predict the mean µθ (xt , t):

pθ (xt−1 | xt ) = N (xt−1 ; µθ (xt , t), βt I). (3.9)


As mentioned in section 2.2.2, we can sample xt at an arbitrary timestep

using the parameterization αt = 1 − βt and ᾱt = ti=1 αi :

q(xt | x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I), (3.10)
which can be further reparameterized as
√ √
xt (x0 , ϵ) = 1 − ᾱt ϵ for ϵ ∼ N (0, I).
ᾱt x0 + (3.11)
( )
This means that µθ (xt , t) must predict √1αt xt − √1−βt
ᾱt
ϵ given xt . We
may instead predict ϵ by parameterizing the mean as

1 ( βt )
µθ (xt , t) = √ xt − √ ϵθ (xt , t) , (3.12)
αt 1 − ᾱt
where the noise ϵθ (xt , t) is predicted with the use of a neural network
Gθ (xt , t). Given a noisy image, this allows us to sample a less noisy image
xt−1 ∼ pθ (xt−1 |xt ) (eq. (3.9)) by inferring the noise and computing

1 ( βt ) √
xt−1 =√ xt − √ ϵθ (xt , t) + βt z where z ∼ N (0, I). (3.13)
αt 1 − ᾱt
Methods | 25

In the practical case of visualizing 4D radar output, we additionally


conditioned the predicted noise ϵθ (c, xt , t) = Gθ (c, xt , t) on the projected
point cloud c by concatenation as in [30]. Finally, the resulting sampling
process is shown in algorithm 2.

The model is trained by optimizing random terms of L = Tt=0 Lt with
minibatch SGD, as shown in algorithm 3. We used the Adam algorithm [9]
with a learning rate of 0.0001 and momentum parameters β1 = 0.9, β2 =
0.999. By sampling a random timestep t for every image in the batch, we
computed an approximation to the full loss L at every update iteration. As [17]
found it to be beneficial to sample quality, we opted to optimize a simplified
version of the noise-predicting objective:
[ ]
√ √
Lt−1 = Ex0 ,ϵ ||ϵ − ϵθ (c, ᾱt x0 + 1 − ᾱt ϵ, t)|| .
2
(3.14)

Algorithm 2 Sampling by reversing the diffusion process


Input A 2D point cloud projection c, model G.
xT ∼ N (0, I) ▷ sample Gaussian noise
for t = T, . . . , 1 do
z ∼ N (0, I) if t > 1, else z = 0
ϵθ (c, xt , t) =
( Gθ (c,1−α
xt , t) ) √ ▷ forward pass
xt−1 = √αt xt − √1−ᾱt ϵθ (c, xt , t) + βt z
1 t
▷ eq. (3.13)
return x0

Algorithm 3 Training a diffusion model


Input Set of image-radar pairs X × C, model G.
1: repeat
2: for each batch of image-radar pairs (x0 , c) ∈ X × C do
3: t ∼ Uniform({1, . . . , T }) ▷ sample random timesteps
4: ϵ ∼ N (0, I)
√ √ ▷ sample Gaussian noise
5: ϵθ = G(c, ᾱt x0 + 1 − ᾱt ϵ, t) ▷ predict noise at times t
6: Lt = ||ϵ − ϵθ ||2 ▷ calculate loss
7: θ = update(θ, ∇θ Lt ) ▷ backward pass, update
8: until convergence

We used the cosine variance schedule for ᾱt :

f (t) ( t/T + s π )2
ᾱt = , f (t) = cos · , (3.15)
f (0) 1+s 2
26 | Methods

where s = 0.08 and T = 1000. [19] found it to give better results than a
linear schedule. It is visualized in fig. 3.5.

Figure 3.5: Diffusion forward process using the cosine noise schedule. The
leftmost image is the sample at t = 0, and the rightmost is pure noise at t = T .
Every image except at t = 0 is a 100 steps more noisy version of its left
neighbor.

3.2.2.1 Implementation
In DDPM, Ho et al. [17] introduced the U-Net architecture for diffusion
models. The model employs a series of residual layers and downsampling
convolutions, followed by a set of residual layers with upsampling convolu-
tions. These layers are interconnected with skip connections, linking layers of
the same spatial size. The implementation of DDPM was a modified version
of the Attention U-Net (which we used as the generator in GAN, shown in
fig. 3.4). Rather than applying attention at every resolution, the authors opted
to utilize a single-head global attention layer, specifically at the 16 × 16
resolution. Furthermore, each residual block incorporated a projection of the
timestep embedding. Song et al. [39] discovered that implementing additional
modifications to the U-Net architecture resulted in improved performance on
the CIFAR-10 [40] and CelebA-64 [41] datasets. Dhariwal and Nichol [4]
found that architecture can substantially boost sample quality by showing
the same results on ImageNet 128 × 128. The authors explored increasing
the depth and the number of attention heads, using attention at multiple
resolutions rather than only at 16 × 16, using the BigGAN residual block [42]
for upsampling and downsampling, and rescaling residual connections with
√1 .
2
To reach our objectives, we used the U-Net architecture implemented
in [4], which used residual blocks with two convolutional layers, group
normalization, and the Sigmoid Linear Unit activation function. We used two
residual blocks per resolution with 128, 256, 512 and 1024 filter channels at
128 × 128, 64 × 64, 32 × 32, and 16 × 16 resolutions, respectively. Attention
was employed at 64 × 64, 32 × 32, and 16 × 16 resolutions. The model takes
6 input channels, 3 for the condition, and 3 for the noisy image at the previous
Methods | 27

timestep and outputs an RGB image with 3 channels. In this architecture,


the weights of the network were shared across the timestep t using sinusoidal
position embeddings as in [43]. The diffusion model was trained with a batch
size of 16.

3.2.3 Hybrid method: GAN-conditioned Diffusion


During the qualitative assessment, it was observed that the conditional
sampling using diffusion faced some challenges (refer to chapter 4). To
circumvent this, we introduced an alternative sampling process, which we
called the hybrid approach. This was achieved by combining the trained GAN
and diffusion models by using a conditioning method for diffusion similar to
iterative latent variable refinement [35]. Instead of starting from Gaussian
noise at t = 1000, the reverse diffusion process was started at t = 250 from
a noisy image generated by the GAN. Given a GAN-generated image xGAN ,
a noisy version was calculated using eq. (3.11), resulting in the noisy GAN-
generated image:

√ √
xt (xGAN , ϵ) = ᾱt xGAN + 1 − ᾱt ϵ for ϵ ∼ N (0, I), (3.16)

which was used in a modified version of algorithm 2 (refer to algorithm 4),


starting at T = 250. The process is visualized in fig. 3.6. The starting timestep
in the hybrid approach was determined based on prior experiments conducted
by Choi et al. (2021), in which the authors generated samples with varying
starting timesteps and observed the deviation from reference images.

Algorithm 4 Sampling by reversing the diffusion process


Input Conditions c, xGAN , model G.
ϵ ∼ N (0, I) √ √
xT (xGAN , ϵ) = ᾱt xGAN + 1 − ᾱt ϵ
for t = T, . . . , 1 do
z ∼ N (0, I) if t > 1, else z = 0
ϵθ (c, xt , t) =
( Gθ (c,1−α
xt , t) ) √ ▷ forward pass
xt−1 = √αt xt − 1−ᾱt t ϵθ (c, xt , t) + βt z
1 √ ▷ eq. (3.13)
return x0
28 | Methods

GAN Diffusion

Figure 3.6: The hybrid method used to generate an image using a diffusion
model trained in 256 × 256 px conditioned on images generated by the GAN
in 128 × 128 px.

3.3 Training Process


The DGMs were trained on the training dataset until convergence and
evaluated on the testing dataset. Determining the optimal stopping point for
training a generative model is challenging and requires careful evaluation. In
addition to monitoring the training loss, we incorporated subjective evaluation
methods to assess the model’s performance. By regularly inspecting
the generated output and considering specific factors, such as the correct
representation of car color, we aimed to identify signs of overfitting. Since
radar data does not provide explicit information about car color, the model’s
ability to generate accurate color information and specific details could
indicate overfitting. Based on these observations, we made informed decisions
to stop the training process at an appropriate time to prevent overfitting and
ensure the model’s generalization capability. As mentioned in their respective
implementations, the GAN was trained with a batch size of 64, and the
diffusion model was trained with a batch size of 16. The batch sizes were
chosen based on utilizing GPU memory maximally.

3.4 Hardware/Software Used


In our pursuit, we have employed a selection of tools and technologies, namely
the high-performance graphics processing unit RTX 3090, the versatile
programming language Python, the powerful deep learning framework
PyTorch, the comprehensive computer vision library OpenCV, and the
essential numerical computing package NumPy. These resources have served
as the foundation of our work, enabling us to tackle various computational
challenges, analyze data, implement algorithms, and explore the realm of
Methods | 29

artificial intelligence. Our implementations were built based on the backbones


of existing open-source implementations [4] [44] [45] [46]. In our writing
process, we used Grammarly and ChatGPT to improve sentence structure,
grammar, and phrasing.
30 | Results and Analysis

Chapter 4

Results and Analysis

4.1 Evaluation Framework


4.1.1 Qualitative Evaluation
To assess the realism of the generated images, we employed a qualitative
evaluation approach. This involved rendering videos using the models and
analyzing the resulting frames. The evaluation was conducted through visual
inspection, where the generated frames were compared subjectively to real
video frames. Qualitative evaluation was chosen as the preferred method
due to its close alignment with the ultimate objective of a generative model,
which is to produce visually appealing and realistic video frames. Due to
the subjective nature of qualitative evaluation, quantitative evaluation methods
were also employed to provide a more objective assessment of the generated
images and a comparison of the two methods.

4.1.2 Quantitative Evaluation


In addition to visual inspection, we performed a quantitative assessment to
evaluate the accuracy of the generated images. This assessment involved
calculating the Mean Squared Errors (MSEs) between the generated frames
and the ground truth frames. Additionally, object detection in generated and
ground truth frames was used as a measure of similarity. The MSE was used
to indicate an average pixel-level similarity to the ground truth. The object
detection metric was used to indicate the models’ ability to generate realistic
and recognizable objects.
Results and Analysis | 31

4.1.2.1 Mean Squared Error


The MSE was calculated between the greyscaled generated frames and the
greyscaled ground truth frames. The decision to greyscale the images was
based on the fact that the radar data does not convey color information.
By calculating the MSE, a numerical measure of dissimilarity between the
generated and ground truth frames was obtained, providing a quantitative
indication of the accuracy of the generated images.

4.1.2.2 Object Detection

(a) Full region used for (b) Object detection on the (c) Object detection on the
object detection. real image. generated image.

(d) Cropped ROI: no (e) Object detection on the (f) Detection on the
parked vehicles. cropped real image. cropped generated image.

Figure 4.1: Two cases used for evaluation by object detection: the full image
and a cropped region of interest, which excluded parked vehicles.

We employed an object detection analysis to further enhance the


quantitative evaluation of the generated images. This analysis involved
detecting objects in both the real and generated images, followed by comparing
the similarity between the occurrences of relevant classes of the detected
objects. The extent to which the generated images accurately captured the
32 | Results and Analysis

objects present in the real images could be assessed by calculating the absolute
difference between the detected objects in generated and real images. We
used YOLOv5 [47] pre-trained on the Microsoft COCO dataset [48] for
object detection. We disregarded certain outputs from the object detection
as irrelevant, attributing them to a lack of fine-tuning. Specifically, classes
such as ”train,” ”cow,” and ”toilet” were identified as such and subsequently
discarded. Three objects were used for calculating the similarity metric: cars,
trucks, and buses. This metric was computed for two cases: the full image,
including parked vehicles shown in fig. 4.1a, and a cropped region of interest
excluding the parking zone shown in fig. 4.1d. Using a cropped region of
interest, more attention is given to the model’s ability to generate nearby
vehicles in the conditioning point cloud. The model’s ability to generate a
high-quality background affects the result in the case when parked vehicles
are included.

4.2 Qualitative Assessment


The qualitative assessment of the generated images revealed distinct
characteristics of the diffusion model and the GAN. As shown in fig. 4.2, the
diffusion model produced images that were visually more realistic compared
to the GAN. The diffusion model (fig. 4.2d) performed better in generating
higher-quality background elements, such as the parked vehicles at the top of
the image. In contrast, the GAN (fig. 4.2c) tended to exhibit more averaging.
An example of this is the bus in the leftmost image, which is less blurred in
the diffusion output. However, the diffusion model occasionally generated
incomplete objects. For instance, in the second image from the right in
fig. 4.2d, the diffusion model generated a red car where the front half appears
to fade, leaving only the trunk of the vehicle visible. Furthermore, there
were instances where the diffusion model failed to generate objects that were
clearly present in the point cloud and successfully generated by the GAN. This
discrepancy is particularly noticeable in the lower left corner of the images,
as depicted in fig. 4.3. This stochasticity of the diffusion is also displayed in
fig. 4.4, where the reverse process is successful in fig. 4.4b but fails in a second
attempt for the same input shown in fig. 4.4c.
A closer inspection of the center image in fig. 4.2a reveals that both the
diffusion model and the GAN exhibit a bias towards generating passenger
vehicles, which are the most frequent class of vehicles in the training dataset.
This bias is reflected in the generated output, where what should resemble
a pickup truck was instead generated as a larger passenger vehicle by both
Results and Analysis | 33

methods.
Overall, the diffusion model demonstrated a higher level of visual fidelity
and realism, particularly in background elements, while the GAN offered
easier control of the generative process by conditioning on the input point
cloud. The better looking images were achieved by combining the ease of
conditioning offered by the GAN with the quality of the output from the
diffusion. This combination was achieved by biasing the diffusion process
using the GAN output as described in section 2.3.2, and the results are shown
in fig. 4.2e.
34 | Results and Analysis

(a) Input: projected point cloud.

(b) Output: True.

(c) Output: GAN.

(d) Output: diffusion.

(e) Output: Gan-conditioned diffusion.

Figure 4.2: Examples of generated images.


Results and Analysis | 35

(a) True (target output).

(b) Output: GAN.

(c) Output: diffusion.

(d) Output: Gan-conditioned diffusion.

Figure 4.3: Consecutive video frames generated using GAN and diffusion.
The GAN model generated the car in the correct position in every frame, while
the diffusion process missed some frames, especially for images where the car
was in the lower-left corner of the frame. The GAN-conditioned diffusion
samples (fig. 4.3d) look more realistic than the GAN baseline (fig. 4.3b).

(a) Forward diffusion process.

(b) Successful reverse process.

(c) Unsuccessful reverse process.

Figure 4.4: Stochasticity of the reverse diffusion process. Two separate


attempts at generating the vehicle in the lower left corner of the image. The
same input point cloud resulted in two different outcomes.
36 | Results and Analysis

4.3 Quantitative Assessment


4.3.1 Mean Squared Error
As shown in table 4.1, the images generated by the GAN (fig. 4.2c) had a lower
MSE to ground truth than the diffusion model (DIF). This indicates overall
better pixel-level similarity to the ground truth images (fig. 4.2b). However,
the combination by conditioning the diffusion process on the GAN output (the
hybrid model) scored a lower MSE than both GAN and DIF.

Model Number of Parameters MSE


GAN 35M (Generator) + 28M (Discriminator) 0.1353
Diffusion 233M 0.2079
Hybrid 35M (GAN) + 233M (Diffusion) 0.1274

Table 4.1: MSE results for the different methods used to generate images
conditioned on point cloud data.

4.3.2 Object Detection


The numbers of objects detected by YOLOv5 for each method are shown in
table 4.2 for the full image, including the parked vehicles, and in table 4.3 for
the region of interest including only cars on the road. The images generated by
the diffusion model exhibited the lowest absolute difference in the number of
detected objects in the full image (see table 4.4). On the other hand, the GAN
demonstrated better performance than DIF in the region of interest (ROI) that
excluded parked vehicles (refer to Figure 4.1). The hybrid model outperformed
both the GAN and the diffusion model in the ROI.

4.4 Analysis
This section aims to provide an evaluation of the performance and limitations
of the diffusion model and the GAN in generating realistic images based on
4D radar point cloud data and delve into the specific aspects of their training
processes.
Results and Analysis | 37

True GAN Diffusion Hybrid


car 24668 14764 22794 13074
truck 10794 2966 2785 3214
bus 146 89 64 58
Total 35608 17794 25643 16346

Table 4.2: The number of detected objects of each class in the full image
including parked vehicles (refer to fig. 4.1). Diffusion was closest to ground
truth.
True GAN Diffusion Hybrid
car 1759 835 619 1031
truck 2099 772 1049 942
bus 76 18 2 13
Total 3934 1625 1670 1986

Table 4.3: Number of objects detected by each method in the ROI where
parked vehicles were excluded.

Model Full Image Cropped ROI


GAN 17789 (50%) 2309 (59%)
Diffusion 9965 (28%) 2264 (58%)
Hybrid 19262 (54%) 1948 (50%)

Table 4.4: Absolute difference between true and generated images in the
number of detected objects of the three classes: car, truck, and bus (percentage
of total object counts). The GAN-conditioned diffusion model scored best in
the ROI where parked vehicles were excluded.
38 | Results and Analysis

4.4.1 Training Process


4.4.1.1 GAN
Table 4.5 shows that beyond four to five epochs, training lead to diminishing
returns. The MSE and absolute difference in the number of detected objects in
the full image did not improve after the fourth epoch. The absolute difference
in the ROI did not improve after the fifth epoch.

Epoch 3 4 5 6
MSE 0.1415 0.1353 0.1366 0.1394
∆ Objects 21346 (59%) 17789 (50%) 20551 (57%) 19393 (54%)
∆ ROI 2619 (67%) 2309 (59%) 2113 (54%) 2228 (57%)

Table 4.5: GAN quantitative metrics at various checkpoints. The metrics show
no improvement beyond four to five epochs, despite the continued decrease
in training loss shown in fig. 4.5a. The percentage in the parentheses shows
the relative error to the total number of objects in the ground truth images.
The region of interest (ROI) is the region of the image which excludes parked
vehicles in the background.

Furthermore, it is worth noting that the GAN training process exhibited


some instability. As shown in fig. 4.5a, the discriminator’s loss showed
irregular spikes at unpredictable intervals, indicating fluctuations in its ability
to distinguish between real and generated images. Additionally, a sudden
increase in the generator’s loss was observed during the second epoch,
suggesting a temporary deterioration in its ability to produce realistic images.
These fluctuations and instabilities in the GAN training process highlight the
challenges associated with training GANs and the need for careful monitoring
and adjustment to ensure optimal performance.

(a) GAN learning curves. (b) Diffusion learning curve.


Results and Analysis | 39

4.4.1.2 Diffusion
Due to the time-consuming nature of generating images using the diffusion
process, the evaluation of the model at different checkpoints was not conducted
as extensively as with the GAN. As a result, it is possible that the diffusion
model may be slightly overfitted, and there is potential for achieving better
quantitative scores through further experimentation and evaluation. The
limited evaluation of diffusion checkpoints highlights the practical challenges
and trade-offs involved in assessing and fine-tuning generative models with
computationally intensive processes.

4.4.2 Performance
The GAN demonstrated better control and conditioning capabilities but
sacrificed some image quality. It accurately represented objects within
the specified region of interest, excluding background elements like parked
vehicles. This indicates the effectiveness of the GAN in leveraging the
conditioning point cloud data to generate recognizable objects. Generating
a single image utilizing the GAN required a time duration of 52 milliseconds.
The diffusion model performed better than the GAN in generating higher-
quality background elements, such as accurately representing parked vehicles.
However, it was limited in generating complete and accurate representations
of objects in certain cases, as evidenced by instances where cars were missing
or incomplete in the generated images (see fig. 4.3). The generation of a single
image utilizing the diffusion model required a time duration of 32 seconds.
The hybrid approach, which combined the diffusion model with GAN
conditioning, outperformed the individual models in terms of both quantitative
metrics and visual appeal. By integrating the strengths of both models, the
hybrid approach achieved improved image quality and object recognition. For
this method, we experimentally adopted a starting timestep of T = 250,
considering the trade-offs between sample quality, computational complexity,
and correspondence to the ground truth. We found that a larger starting
timestep resulted in higher sample quality, meaning the generated images were
visually more realistic, excluding missing or incomplete objects. However,
this came at the cost of increased computational complexity and a more
significant deviation from the ground truth. On the other hand, a lower
starting timestep provided better correspondence to the ground truth and lower
computational complexity, but at the expense of image quality.
40 | Results and Analysis

4.5 Discussion
The hybrid approach proved to be the most effective in achieving the
best performance, integrating the strengths of both models and mitigating
weaknesses. The combination produced images that surpassed the individual
models’ capabilities by leveraging the diffusion model’s high-quality image
generation and the GAN’s accurate object representation and adherence to
input point clouds.
While the diffusion model produced higher-quality images and superior
object detection scores in the entire image, the GAN outperformed it in
terms of MSE and object detection in the region of interest, excluding
background elements such as parked vehicles. This seemingly contradictory
result suggests that the GAN achieved better pixel-level similarity to the
ground truth images and excelled in generating accurate and recognizable
vehicles captured by the radar. However, the hybrid model outperformed both
individual approaches, aligning with the initial qualitative interpretation that
the diffusion model’s inferior MSE (table 4.1) and ROI object detection score
(table 4.4) were due to its occasional neglect of the conditioning point cloud,
while still producing high-quality images otherwise. The observed difference
in object detection results between the entire image and the region of interest
further confirms the qualitative observation that the diffusion model generated
higher-quality images but faced challenges in conditioning and accurately
representing particular objects in specific instances. This underscores the
complementary nature of the hybrid model, which combines the best features
of both approaches to deliver improved performance. Combining the
diffusion model’s background generation capabilities and the GAN’s object
representation and conditioning abilities resulted in a more effective and
comprehensive generation of realistic images from 4D radar data.
It is important to note that the performances of the diffusion model
and GAN used in this study represent specific instances of these models.
It is possible that other variations or architectures of GANs or diffusion
models could outperform the ones used here. Different GAN architectures,
such as conditional GAN with improved conditioning mechanisms, may
exhibit improved performance in generating realistic images from 4D radar
data. Similarly, alternative diffusion models with different priors or training
strategies could yield better results. Further exploration and experimentation
with different model variations are encouraged to identify potential models
that could outperform the ones employed in this study.
Choi et al. found that samples started to deviate noticeably from the
Results and Analysis | 41

reference images at a starting timestep of T = 500 when employing their


conditioning method (ILVR) [35]. In the current case, by starting the diffusion
process at T = 250 instead of the usual T = 1000, a significant reduction in
the time required to generate a sample was achieved, approximately four times
faster. This adjustment in the starting timestep allowed for a more efficient
generation process, enabling quicker iterations and exploration of different
samples. From the perspective of the GAN, the hybrid process represents a
refinement of the generated samples, incorporating the benefits of the diffusion
model while sacrificing some speed in the generative process. This trade-off
between speed and refinement is an important consideration when deciding on
the optimal starting timestep for the diffusion model.
Considering the training time, the hybrid approach presents an interesting
possibility. A trained unconditional diffusion model can be a solid foundation
if the generated distribution is diverse enough. Subsequently, only the GAN
needs to be trained on new data, using it to guide the diffusion process and
fine-tune the generated samples. This approach reduces the overall training
time and computational complexity while achieving satisfactory results.
Overall, the results and analysis provide valuable insights into the
performance, limitations, and potential for improvement of the diffusion
model, the GAN, and their combination in generating realistic images from
4D radar point cloud data.

4.6 Limitations
The main limitation of the approach lies in the diversity of generated objects,
particularly trucks, which are not generated as effectively as passenger
vehicles. This limitation can be attributed to the characteristics of the dataset
used for training, which may need more representation and diversity of these
specific object classes. Another factor that may contribute to the limitation
in generalizability is the alignment between the point cloud and the camera
image. It has been observed that the alignment is more accurate at the
center of the image compared to the lower left corner. In cases where the
alignment could be more precise, the diffusion model tends to disregard
or inadequately utilize the input information, resulting in an incomplete or
inaccurate generation of objects. This behavior was unexpected and requires
further investigation to gain a better understanding of the underlying dynamics
and explainability of the diffusion process. Addressing this issue could
improve the model’s performance and enhance its ability to generate more
accurate and consistent results across different image regions.
42 | Conclusions and Future work

Chapter 5

Conclusions and Future work

5.1 Conclusions
This study explored the generation of realistic images from 4D radar point
cloud data using a diffusion model and a GAN. Through comprehensive
analysis and evaluation of the performance and limitations of these models,
valuable insights were gained. The results indicate that it is possible to learn
a transformation from the 4D radar empirical distribution to the RGB values
distribution by combining the GAN with the diffusion model.
The diffusion model demonstrated strengths in generating higher-quality
images. On the other hand, the GAN showcased better control and
conditioning capabilities, allowing for an accurate representation of objects
that adheres to the conditioning point cloud. This result highlighted the
effectiveness of leveraging conditioning point cloud data for generating
recognizable objects.
A hybrid approach was proposed combining the GAN and the diffusion
model, outperforming both models. The essential advantage of this hybrid
method lies in leveraging the iterative nature of the diffusion process. By
introducing an image generated by the GAN at a predetermined diffusion
timestep, the diffusion process’s limitations and challenges were effectively
mitigated. This hybrid approach improved the ability of the generated
samples to adhere to the input condition and accelerated the sampling process
compared to diffusion alone. From a GAN perspective, the hybrid method is
a way to generate higher-quality samples at the cost of slower sampling.
The dissimilarity between the generated and ground truth images was
measured quantitatively by calculating the mean squared error and object
detection scores. This assessment complemented the qualitative evaluation
Conclusions and Future work | 43

and added an objective perspective to the accuracy of the generated images.


The quantitative metrics reinforced that the hybrid approach produced realistic
and high-quality images better than the diffusion model or GAN alone. This
convergence of results from both qualitative and quantitative evaluations
strengthens the findings’ validity and reliability, highlighting the hybrid
approach’s effectiveness in enhancing the generation of images based on 4D
radar point cloud data.
In conclusion, the findings and discussions presented in this study
contribute to advancing deep generative models for 4D radar point cloud data.
The hybrid approach combines state-of-the-art methods and provides a novel,
effective solution for generating artificial videos based on radar data. This
knowledge will be helpful as 4D radar data becomes more widespread and
opens up new possibilities for applying artificial video generation in radar
technology.

5.2 Future Work


Future research could explore further improvements to the hybrid approach
and address the limitations identified in the individual models. This work
could involve refining the conditioning mechanism of the diffusion model
to enhance its object representation capabilities. The question of why
the diffusion process is more challenging to condition could be further
investigated, e.g., by analyzing the magnitudes of weights operating on the
input point cloud and the noisy image at different timesteps. Additionally,
exploring novel architectures and training strategies for the GAN could help
mitigate the averaging effects and improve fine-grained details in the generated
images. The scalability of the models should also be investigated by training
on a more extensive and diverse dataset.
44 | References

References

[1] F. Engels, P. Heidenreich, A. Zoubir, F. Jondral, and M. Win-


termantel, “Advances in Automotive Radar: A framework on
computationally efficient high-resolution frequency estimation,” IEEE
Signal Processing Magazine, vol. 34, pp. 36–46, Mar. 2017. doi:
10.1109/MSP.2016.2637700 [Page 2.]

[2] K. P. Murphy, Probabilistic Machine Learning: Advanced Topics.


MIT Press, 2023. [Online]. Available: http://probml.github.io/book2
[Pages 3, 6, 9, and 19.]

[3] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-


Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial
Networks,” 2014. [Online]. Available: https://arxiv.org/abs/1406.2661
[Pages 3, 9, 10, and 21.]

[4] P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image


Synthesis,” in Advances in Neural Information Processing Systems,
M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. W.
Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 8780–8794.
[Online]. Available: https://proceedings.neurips.cc/paper/2021/file/4
9ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf [Pages 3, 14, 20, 26,
and 29.]

[5] United Nations, “Sustainable Development Goals,” 2015. [Online].


Available: https://sdgs.un.org/goals [Page 4.]

[6] S. Marsland, Machine Learning: An Algorithmic Perspective, Second


Edition. CRC Press, 2015. ISBN 978-1-4987-5978-6. [Online].
Available: https://books.google.se/books?id=y_oYCwAAQBAJ
[Page 6.]
References | 45

[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, May 2015. doi: 10.1038/nature14539. [Online].
Available: https://doi.org/10.1038/nature14539 [Pages 7 and 8.]

[8] H. Robbins and S. Monro, “A Stochastic Approximation Method,” The


Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400 – 407, 1951.
doi: 10.1214/aoms/1177729586 Publisher: Institute of Mathematical
Statistics. [Online]. Available: https://doi.org/10.1214/aoms/117772958
6 [Page 7.]

[9] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”


2017, _eprint: 1412.6980. [Pages 7, 22, and 25.]

[10] K. Fukushima, “Neocognitron: A self-organizing neural network model


for a mechanism of pattern recognition unaffected by shift in position,”
Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980, publisher:
Springer. [Page 8.]

[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification


with Deep Convolutional Neural Networks,” Commun. ACM, vol. 60,
no. 6, pp. 84–90, May 2017. doi: 10.1145/3065386 Place: New York,
NY, USA Publisher: Association for Computing Machinery. [Online].
Available: https://doi.org/10.1145/3065386 [Page 8.]

[12] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst,


“Geometric Deep Learning: Going beyond Euclidean data,” IEEE
Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017. doi:
10.1109/MSP.2017.2693418 [Page 8.]

[13] V. Nair and G. E. Hinton, “Rectified linear units improve restricted


boltzmann machines,” in Proceedings of the 27th international
conference on machine learning (ICML-10), 2010, pp. 807–814.
[Page 8.]

[14] C. Doersch, “Tutorial on Variational Autoencoders,” 2016. [Online].


Available: https://arxiv.org/abs/1606.05908 [Pages 9 and 20.]

[15] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and


B. Lakshminarayanan, “Normalizing Flows for Probabilistic Modeling
and Inference,” 2019. doi: 10.48550/ARXIV.1912.02762 Publisher:
arXiv. [Online]. Available: https://arxiv.org/abs/1912.02762 [Page 9.]
46 | References

[16] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and


S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium
Thermodynamics,” CoRR, vol. abs/1503.03585, 2015, arXiv:
1503.03585. [Online]. Available: https://proceedings.mlr.press/v3
7/sohl-dickstein15.html [Page 9.]

[17] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic


Models,” CoRR, vol. abs/2006.11239, 2020, arXiv: 2006.11239.
[Online]. Available: https://arxiv.org/abs/2006.11239 [Pages 9, 12, 14,
25, and 26.]

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Recognition,” CoRR, vol. abs/1512.03385, 2015, arXiv: 1512.03385.
[Online]. Available: http://arxiv.org/abs/1512.03385 [Pages 11, 23,
and 51.]

[19] A. Nichol and P. Dhariwal, “Improved Denoising Diffusion Probabilistic


Models,” 2021. [Online]. Available: https://arxiv.org/abs/2102.09672
[Pages 12, 14, 24, and 26.]

[20] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”


arXiv preprint arXiv:1411.1784, 2014. [Page 13.]

[21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image


Translation with Conditional Adversarial Networks,” CoRR, vol.
abs/1611.07004, 2016, arXiv: 1611.07004. [Online]. Available:
http://arxiv.org/abs/1611.07004 [Pages 13, 14, 21, and 22.]

[22] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional


Networks for Biomedical Image Segmentation,” CoRR, vol.
abs/1505.04597, 2015, arXiv: 1505.04597. [Online]. Available:
http://arxiv.org/abs/1505.04597 [Pages 13, 14, 22, and 51.]

[23] C. Li and M. Wand, “Precomputed Real-Time Texture Synthesis


with Markovian Generative Adversarial Networks,” CoRR, vol.
abs/1604.04382, 2016, arXiv: 1604.04382. [Online]. Available:
http://arxiv.org/abs/1604.04382 [Pages 13 and 23.]

[24] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image


Translation using Cycle-Consistent Adversarial Networks,” in Computer
Vision (ICCV), 2017 IEEE International Conference on, 2017. [Pages 13
and 53.]
References | 47

[25] S. Milz, M. Simon, K. Fischer, and M. Pöpperl, “Points2Pix: 3D Point-


Cloud to Image Translation using conditional Generative Adversarial
Networks,” CoRR, vol. abs/1901.09280, 2019, arXiv: 1901.09280.
[Online]. Available: http://arxiv.org/abs/1901.09280 [Pages 13, 14,
and 22.]

[26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep


Learning on Point Sets for 3D Classification and Segmentation,” CoRR,
vol. abs/1612.00593, 2016, arXiv: 1612.00593. [Online]. Available:
http://arxiv.org/abs/1612.00593 [Page 14.]

[27] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:


The kitti dataset,” The International Journal of Robotics Research,
vol. 32, no. 11, pp. 1231–1237, 2013, publisher: Sage Publications Sage
UK: London, England. [Page 14.]

[28] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D


Scene Understanding Benchmark Suite,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Jun.
2015. [Page 14.]

[29] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”


CoRR, vol. abs/1804.02767, 2018, arXiv: 1804.02767. [Online].
Available: http://arxiv.org/abs/1804.02767 [Page 14.]

[30] C. Saharia, W. Chan, H. Chang, C. A. Lee, J. Ho, T. Salimans, D. J.


Fleet, and M. Norouzi, “Palette: Image-to-Image Diffusion Models,”
2022, _eprint: 2111.05826. [Pages 14 and 25.]

[31] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN++:


Improving the PixelCNN with Discretized Logistic Mixture Likelihood
and Other Modifications,” CoRR, vol. abs/1701.05517, 2017, arXiv:
1701.05517. [Online]. Available: http://arxiv.org/abs/1701.05517
[Page 14.]

[32] S. Zagoruyko and N. Komodakis, “Wide Residual Networks,” CoRR,


vol. abs/1605.07146, 2016, arXiv: 1605.07146. [Online]. Available:
http://arxiv.org/abs/1605.07146 [Page 14.]

[33] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi,


“Image super-resolution via iterative refinement,” arXiv preprint
arXiv:2104.07636, 2021. [Page 14.]
48 | References

[34] H. Sasaki, C. G. Willcocks, and T. P. Breckon, “UNIT-DDPM:


UNpaired Image Translation with Denoising Diffusion Probabilistic
Models,” CoRR, vol. abs/2104.05358, 2021, arXiv: 2104.05358.
[Online]. Available: https://arxiv.org/abs/2104.05358 [Page 14.]

[35] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “ILVR: Conditioning


Method for Denoising Diffusion Probabilistic Models,” 2021, _eprint:
2108.02938. [Pages 14, 27, and 41.]

[36] Q. Research and T. AB, “SensRad,” Jan. 2023. [Online]. Available:


https://www.qamcom.com/sensrad/ [Page 15.]

[37] O. Oktay, J. Schlemper, L. L. Folgoc, M. C. H. Lee, M. P. Heinrich,


K. Misawa, K. Mori, S. G. McDonagh, N. Y. Hammerla, B. Kainz,
B. Glocker, and D. Rueckert, “Attention U-Net: Learning Where to
Look for the Pancreas,” CoRR, vol. abs/1804.03999, 2018, arXiv:
1804.03999. [Online]. Available: http://arxiv.org/abs/1804.03999
[Pages vii, 23, and 52.]

[38] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep


Network Training by Reducing Internal Covariate Shift,” CoRR,
vol. abs/1502.03167, 2015, arXiv: 1502.03167. [Online]. Available:
http://arxiv.org/abs/1502.03167 [Page 23.]

[39] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon,


and B. Poole, “Score-Based Generative Modeling through Stochastic
Differential Equations,” CoRR, vol. abs/2011.13456, 2020, arXiv:
2011.13456. [Online]. Available: https://arxiv.org/abs/2011.13456
[Page 26.]

[40] “The CIFAR-10 dataset.” [Online]. Available: https://www.cs.toronto.e


du/~kriz/cifar.html [Page 26.]

[41] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep Learning Face Attributes
in the Wild,” in Proceedings of International Conference on Computer
Vision (ICCV), Dec. 2015. [Page 26.]

[42] A. Brock, J. Donahue, and K. Simonyan, “Large Scale GAN


Training for High Fidelity Natural Image Synthesis,” CoRR, vol.
abs/1809.11096, 2018, arXiv: 1809.11096. [Online]. Available:
http://arxiv.org/abs/1809.11096 [Page 26.]
References | 49

[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” CoRR,
vol. abs/1706.03762, 2017, arXiv: 1706.03762. [Online]. Available:
http://arxiv.org/abs/1706.03762 [Pages 27 and 52.]

[44] R. E. Yılmaz, “Attention U-Net,” 2019. [Online]. Available: https:


//www.kaggle.com/code/truthisneverlinear/attention-u-net-pytorch#Att
ention-U-Net [Page 29.]

[45] N. Rogge and K. Rasul, “Diffusion - Huggingface,” 2022. [Online].


Available: https://huggingface.co/blog/annotated-diffusion [Page 29.]

[46] D. , “Diffusion - Colab,” 2022. [Online]. Available: https://colab.resear


ch.google.com/drive/1sjy9odlSSy0RBVgMTgP7s99NXsqglsUL?usp=
sharing#scrollTo=i7AZkYjKgQTm [Page 29.]

[47] G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, NanoCode012,


Y. Kwon, K. Michael, TaoXie, J. Fang, imyhxy, Lorna, �. Yifu),
C. Wong, A. V, D. Montes, Z. Wang, C. Fati, J. Nadar, Laughing,
UnglvKitDe, V. Sonck, tkianai, yxNONG, P. Skalski, A. Hogan, D. Nair,
M. Strobel, and M. Jain, “ultralytics/yolov5: v7.0 - YOLOv5 SOTA
Realtime Instance Segmentation,” Nov. 2022. [Online]. Available:
https://doi.org/10.5281/zenodo.7347926 [Page 32.]

[48] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,


P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO:
Common Objects in Context,” 2015, _eprint: 1405.0312. [Page 32.]

[49] K. Simonyan and A. Zisserman, “Very deep convolutional networks for


large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[Page 51.]
50 | References
Appendix A: Supporting materials | 51

Appendix A

Supporting materials

A.1 Code and Demos


For generated video demos see here. Code for GAN applied to the Kitti dataset
can be found here.

A.2 Additional Background


A.2.1 Residual Networks
In 2014, the Visual Geometry Group at the University of Oxford achieved
state-of-the-art performance on the ImageNet dataset using a conventional
convolutional neural network with increased depth [49]. They demonstrated
the significant impact of depth on image recognition, but it also resulted in
more complex networks that are harder to train. To address this issue, Kaiman
He et al. introduced the concept of skip connections in 2015, which allowed
for the construction of deeper and less complex networks called ResNets [18].
The architecture is built using residual blocks, with each block producing an
output that is the sum of the previous block’s output and a learned residual.
He et al. achieved state-of-the-art accuracies with ResNets eight times deeper
than VGG networks.

A.2.2 U-Net
U-Net is a convolutional neural network that was first introduced in 2015 for
use in biomedical image segmentation [22]. The name ”U-Net” comes from its
characteristic U-shaped architecture, created by a sequence of downsampling
52 | Appendix A: Supporting materials

operations followed by upsampling operations. Downsampling is performed


by either max-pooling or convolving with a step size (stride) greater than
one. Upsampling is achieved with a convolution of fractional step size or
deconvolution. This design creates a bottleneck in the network that allows
it to capture the input image’s local and global features efficiently.

A.2.3 Attention
Attention is a mechanism that allows models to focus on specific parts of the
input while processing it [43]. It has become a crucial component of many
state-of-the-art natural language processing and computer vision models,
allowing them to perform better on a wide range of tasks. It is a function
that maps a query and a set of key-value pairs to a weighted sum of the
values, expressing the compatibility of the query with the corresponding key
as the weight for each value. The input consists of dk -dimensional keys k
and queries q, and dv -dimensional values V . The weights of the values are
obtained by applying√a softmax function to the dot products of the query with
all keys divided by dk . In practice, these computations are performed as
matrix multiplications, operating simultaneously on a collection of queries Q,
resulting in the following equation:

QK T
Attention(Q, K, V ) = sof tmax( √ )V (A.1)
dk
This is known as multiplicative attention. In the context of vision, the
queries, keys, and values are three sets of feature maps with the same spatial
dimensions, obtained through three separate linear transformations. The
feature maps are then reshaped into vectors q, k, and v and used in the attention
calculation. Various versions of attention exist. An improvement to U-Net
known as Attention U-Net uses additive attention n [37]:

l
qatt = ψ ⊤ (σ1 (Wx⊤ xi + Wg⊤ gi + bg )) + bψ (A.2)
αil = σ2 (qatt (xli , gi ; Θatt )), (A.3)
A.3 Additional Methods
A.3.1 Postprocessing
The radar system is effective at capturing data across various weather
conditions, as it is not influenced by weather or lighting information. However,
this lack of weather information poses a challenge when generating video
sequences, as certain scene attributes, such as wet or dry roads, get randomly
generated. This randomness introduces inconsistency when playing back the
generated frames at a standard frame rate of 25 FPS, potentially impacting the
viewing experience. To address this issue, we employed CycleGAN [24] as
a postprocessing technique to transfer the generated images into a consistent
style. This style transfer process helped alleviate the inherent randomness in
scene attributes, allowing for a more consistent and enjoyable playback of the
generated video sequences.

A.3.2 Upscaling
In order to create a visually more appealing demo, an additional step was
taken to upscale the generated images. This involved training a diffusion
model specifically in a higher resolution of 256 × 256 pixels. The output
from the GAN, which was initially generated in a resolution of 128 × 128
pixels, was then postprocessed using CycleGAN (refer to appendix A.3.1) and
further upscaled to match the 256 × 256 pixel resolution. The frames for the
demo were sampled using the modified diffusion sampling algorithm outlined
in algorithm 4.

A.4 Additional Examples of Generated Im-


ages
To demonstrate the radar’s ability to detect various objects, we also trained
on a larger dataset. The dataset consisted of all recordings with mode =
medium and static = 10, including additional recordings taken in March
and April. To address the class imbalance, we employed upsampling of frames
containing trucks, vans, or busses and downsampling frames which only
contained passenger vehicles. A selection of generated frames in 256×256 px
is shown in fig. A.1.
54 | Appendix A: Supporting materials

(a) A truck. (b) A van and three cars.

Figure A.1: Generated samples in 256 × 256 px.


TRITA-EECS-EX-2023:635

www.kth.se

You might also like