Scalable Diffusion Models With Transformers

Scalable Diffusion Models with Transformers
William Peebles* Saining Xie

UC Berkeley New York University
arXiv:2212.09748v2 [cs.CV] 2 Mar 2023
Figure 1. Diffusion models with transformer backbones achieve state-of-the-art image quality. We show selected samples from two
of our class-conditional DiT-XL/2 models trained on ImageNet at 512×512 and 256×256 resolution, respectively.
Abstract 1. Introduction
We explore a new class of diffusion models based on the Machine learning is experiencing a renaissance powered
transformer architecture. We train latent diffusion models by transformers. Over the past five years, neural architec-
of images, replacing the commonly-used U-Net backbone tures for natural language processing [8, 42], vision [10]
with a transformer that operates on latent patches. We an- and several other domains have largely been subsumed by
alyze the scalability of our Diffusion Transformers (DiTs) transformers [60]. Many classes of image-level genera-
through the lens of forward pass complexity as measured by tive models remain holdouts to the trend, though—while
Gflops. We find that DiTs with higher Gflops—through in- transformers see widespread use in autoregressive mod-
creased transformer depth/width or increased number of in- els [3,6,43,47], they have seen less adoption in other gener-
put tokens—consistently have lower FID. In addition to pos- ative modeling frameworks. For example, diffusion models
sessing good scalability properties, our largest DiT-XL/2 have been at the forefront of recent advances in image-level
models outperform all prior diffusion models on the class- generative models [9,46]; yet, they all adopt a convolutional
conditional ImageNet 512×512 and 256×256 benchmarks, U-Net architecture as the de-facto choice of backbone.
achieving a state-of-the-art FID of 2.27 on the latter.
* Work done during an internship at Meta AI, FAIR Team.
Code and project page available here.
1
Diameter
5 20 80 320
Gflops
Figure 2. ImageNet generation with Diffusion Transformers (DiTs). Bubble area indicates the flops of the diffusion model. Left:
FID-50K (lower is better) of our DiT models at 400K training iterations. Performance steadily improves in FID as model flops increase.
Right: Our best model, DiT-XL/2, is compute-efficient and outperforms all prior U-Net-based diffusion models, like ADM and LDM.
The seminal work of Ho et al. [19] first introduced the More specifically, we study the scaling behavior of trans-
U-Net backbone for diffusion models. Having initially seen formers with respect to network complexity vs. sample
success within pixel-level autoregressive models and con- quality. We show that by constructing and benchmark-
ditional GANs [23], the U-Net was inherited from Pixel- ing the DiT design space under the Latent Diffusion Mod-
CNN++ [52, 58] with a few changes. The model is con- els (LDMs) [48] framework, where diffusion models are
volutional, comprised primarily of ResNet [15] blocks. In trained within a VAE’s latent space, we can successfully
contrast to the standard U-Net [49], additional spatial self- replace the U-Net backbone with a transformer. We further
attention blocks, which are essential components in trans- show that DiTs are scalable architectures for diffusion mod-
formers, are interspersed at lower resolutions. Dhariwal and els: there is a strong correlation between the network com-
Nichol [9] ablated several architecture choices for the U- plexity (measured by Gflops) vs. sample quality (measured
Net, such as the use of adaptive normalization layers [40] to by FID). By simply scaling-up DiT and training an LDM
inject conditional information and channel counts for con- with a high-capacity backbone (118.6 Gflops), we are able
volutional layers. However, the high-level design of the U- to achieve a state-of-the-art result of 2.27 FID on the class-
Net from Ho et al. has largely remained intact. conditional 256 × 256 ImageNet generation benchmark.
With this work, we aim to demystify the significance of
2. Related Work
architectural choices in diffusion models and offer empiri-
cal baselines for future generative modeling research. We Transformers. Transformers [60] have replaced domain-
show that the U-Net inductive bias is not crucial to the per- specific architectures across language, vision [10], rein-
formance of diffusion models, and they can be readily re- forcement learning [5, 25] and meta-learning [39]. They
placed with standard designs such as transformers. As a have shown remarkable scaling properties under increas-
result, diffusion models are well-poised to benefit from the ing model size, training compute and data in the language
recent trend of architecture unification—e.g., by inheriting domain [26], as generic autoregressive models [17] and
best practices and training recipes from other domains, as as ViTs [63]. Beyond language, transformers have been
well as retaining favorable properties like scalability, ro- trained to autoregressively predict pixels [6, 7, 38]. They
bustness and efficiency. A standardized architecture would have also been trained on discrete codebooks [59] as both
also open up new possibilities for cross-domain research. autoregressive models [11,47] and masked generative mod-
In this paper, we focus on a new class of diffusion models els [4, 14]; the former has shown excellent scaling behavior
based on transformers. We call them Diffusion Transform- up to 20B parameters [62]. Finally, transformers have been
ers, or DiTs for short. DiTs adhere to the best practices of explored in DDPMs to synthesize non-spatial data; e.g., to
Vision Transformers (ViTs) [10], which have been shown to generate CLIP image embeddings in DALL·E 2 [41, 46]. In
scale more effectively for visual recognition than traditional this paper, we study the scaling properties of transformers
convolutional networks (e.g., ResNet [15]). when used as the backbone of diffusion models of images.
2
+ + +
𝛼"
Scale
Noise Σ Pointwise Pointwise
Pointwise Feedforward Feedforward
32 x 32 x 4 32 x 32 x 4
Feedforward
Layer Norm Layer Norm
𝛾",𝛽"
Linear and Reshape Scale, Shift
Layer Norm + +
Layer Norm
Multi-Head
+ Cross-Attention
Multi-Head
Nx DiT Block Scale
𝛼! Layer Norm Self-Attention
Layer Norm
Multi-Head +
Self-Attention
Patchify Embed
𝛾!,𝛽! Multi-Head
Scale, Shift Self-Attention Concatenate
on Sequence
Noised Layer Norm MLP Layer Norm Dimension
Timestep 𝑡
Latent
32 x 32 x 4 Label 𝑦 Input Tokens Conditioning Input Tokens Conditioning Input Tokens Conditioning
Latent Diffusion Transformer DiT Block with adaLN-Zero DiT Block with Cross-Attention DiT Block with In-Context Conditioning
Figure 3. The Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed
into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer
blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.
Denoising diffusion probabilistic models (DDPMs). 3. Diffusion Transformers

Diffusion [19, 54] and score-based generative models [22,
56] have been particularly successful as generative models 3.1. Preliminaries
of images [35,46,48,50], in many cases outperforming gen-
erative adversarial networks (GANs) [12] which had previ- Diffusion formulation. Before introducing our architec-
ously been state-of-the-art. Improvements in DDPMs over ture, we briefly review some basic concepts needed to
the past two years have largely been driven by improved understand diffusion models (DDPMs) [19, 54]. Gaus-
sampling techniques [19, 27, 55], most notably classifier- sian diffusion models assume a forward noising process
free guidance [21], reformulating diffusion models to pre- which √gradually applies noise to real data x0 : q(xt |x0 ) =
dict noise instead of pixels [19] and using cascaded DDPM N (xt ; ᾱt x0 , (1 − ᾱt )I), where constants ᾱt are hyperpa-
pipelines where low-resolution base diffusion models are rameters. By √applying the √ reparameterization trick, we can
trained in parallel with upsamplers [9, 20]. For all the dif- sample xt = ᾱt x0 + 1 − ᾱt t , where t ∼ N (0, I).
fusion models listed above, convolutional U-Nets [49] are Diffusion models are trained to learn the reverse process
the de-facto choice of backbone architecture. Concurrent that inverts forward process corruptions: pθ (xt−1 |xt ) =
work [24] introduced a novel, efficient architecture based N (µθ (xt ), Σθ (xt )), where neural networks are used to pre-
on attention for DDPMs; we explore pure transformers. dict the statistics of pθ . The reverse process model is
Architecture complexity. When evaluating architecture trained with the variational lower bound [30] of the log-
complexity in the image generation literature, it is fairly likelihood of x0 , which reduces to L(θ) = −p(x0 |x1 ) +
∗
P
common practice to use parameter counts. In general, pa- D
t KL (q (xt−1 |xt , x0 )||pθ (xt−1 |xt )), excluding an ad-
rameter counts can be poor proxies for the complexity of ditional term irrelevant for training. Since both q ∗ and pθ
image models since they do not account for, e.g., image res- are Gaussian, DKL can be evaluated with the mean and co-
olution which significantly impacts performance [44, 45]. variance of the two distributions. By reparameterizing µθ as
Instead, much of the model complexity analysis in this pa- a noise prediction network θ , the model can be trained us-
per is through the lens of theoretical Gflops. This brings us ing simple mean-squared error between the predicted noise
in-line with the architecture design literature where Gflops θ (xt ) and the ground truth sampled Gaussian noise t :
are widely-used to gauge complexity. In practice, the Lsimple (θ) = ||θ (xt ) − t ||22 . But, in order to train diffu-
golden complexity metric is still up for debate as it fre- sion models with a learned reverse process covariance Σθ ,
quently depends on particular application scenarios. Nichol the full DKL term needs to be optimized. We follow Nichol
and Dhariwal’s seminal work improving diffusion mod- and Dhariwal’s approach [36]: train θ with Lsimple , and
els [9, 36] is most related to us—there, they analyzed the train Σθ with the full L. Once pθ is trained, new images can
scalability and Gflop properties of the U-Net architecture be sampled by initializing xtmax ∼ N (0, I) and sampling
class. In this paper, we focus on the transformer class. xt−1 ∼ pθ (xt−1 |xt ) via the reparameterization trick.
3
Classifier-free guidance. Conditional diffusion models
take extra information as input, such as a class label c. DiT Block
In this case, the reverse process becomes pθ (xt−1 |xt , c),
where θ and Σθ are conditioned on c. In this setting, Input Tokens T × d
classifier-free guidance can be used to encourage the sam-

pling procedure to find x such that log p(c|x) is high [21]. !
𝑇 = 𝐼/𝑝
By Bayes Rule, log p(c|x) ∝ log p(x|c) − log p(x), and
hence ∇x log p(c|x) ∝ ∇x log p(x|c)−∇x log p(x). By in- Noised Latent
I×I×C 𝑝
terpreting the output of diffusion models as the score func-
𝑝
tion, the DDPM sampling procedure can be guided to sam-
ple x with high p(x|c) by: ˆθ (xt , c) = θ (xt , ∅) + s ·
𝐼
∇x log p(x|c) ∝ θ (xt , ∅)+s·(θ (xt , c)−θ (xt , ∅)), where
s > 1 indicates the scale of the guidance (note that s = 1 re-
covers standard sampling). Evaluating the diffusion model 𝐼
with c = ∅ is done by randomly dropping out c during
training and replacing it with a learned “null” embedding
Figure 4. Input specifications for DiT. Given patch size p × p,
∅. Classifier-free guidance is widely-known to yield sig- a spatial representation (the noised latent from the VAE) of shape
nificantly improved samples over generic sampling tech- I × I × C is “patchified” into a sequence of length T = (I/p)2
niques [21, 35, 46], and the trend holds for our DiT models. with hidden dimension d. A smaller patch size p results in a longer
sequence length and thus more Gflops.
Latent diffusion models. Training diffusion models di-
rectly in high-resolution pixel space can be computationally Patchify. The input to DiT is a spatial representation z
prohibitive. Latent diffusion models (LDMs) [48] tackle this (for 256 × 256 × 3 images, z has shape 32 × 32 × 4). The
issue with a two-stage approach: (1) learn an autoencoder first layer of DiT is “patchify,” which converts the spatial
that compresses images into smaller spatial representations input into a sequence of T tokens, each of dimension d,
with a learned encoder E; (2) train a diffusion model of by linearly embedding each patch in the input. Following
representations z = E(x) instead of a diffusion model of patchify, we apply standard ViT frequency-based positional
images x (E is frozen). New images can then be generated embeddings (the sine-cosine version) to all input tokens.
by sampling a representation z from the diffusion model The number of tokens T created by patchify is determined
and subsequently decoding it to an image with the learned by the patch size hyperparameter p. As shown in Figure 4,
decoder x = D(z). halving p will quadruple T , and thus at least quadruple total
As shown in Figure 2, LDMs achieve good performance transformer Gflops. Although it has a significant impact on
while using a fraction of the Gflops of pixel space diffusion Gflops, note that changing p has no meaningful impact on
models like ADM. Since we are concerned with compute downstream parameter counts.
efficiency, this makes them an appealing starting point for We add p = 2, 4, 8 to the DiT design space.
architecture exploration. In this paper, we apply DiTs to
DiT block design. Following patchify, the input tokens
latent space, although they could be applied to pixel space
are processed by a sequence of transformer blocks. In ad-
without modification as well. This makes our image genera-
dition to noised image inputs, diffusion models sometimes
tion pipeline a hybrid-based approach; we use off-the-shelf
process additional conditional information such as noise
convolutional VAEs and transformer-based DDPMs.
timesteps t, class labels c, natural language, etc. We explore
3.2. Diffusion Transformer Design Space four variants of transformer blocks that process conditional
inputs differently. The designs introduce small, but impor-
We introduce Diffusion Transformers (DiTs), a new ar- tant, modifications to the standard ViT block design. The
chitecture for diffusion models. We aim to be as faithful to designs of all blocks are shown in Figure 3.
the standard transformer architecture as possible to retain
its scaling properties. Since our focus is training DDPMs of – In-context conditioning. We simply append the vec-
images (specifically, spatial representations of images), DiT tor embeddings of t and c as two additional tokens in
is based on the Vision Transformer (ViT) architecture which the input sequence, treating them no differently from
operates on sequences of patches [10]. DiT retains many of the image tokens. This is similar to cls tokens in
the best practices of ViTs. Figure 3 shows an overview of ViTs, and it allows us to use standard ViT blocks with-
the complete DiT architecture. In this section, we describe out modification. After the final block, we remove the
the forward pass of DiT, as well as the components of the conditioning tokens from the sequence. This approach
design space of the DiT class. introduces negligible new Gflops to the model.
4
100 XL/2 In-Context
Model Layers N Hidden size d Heads Gflops (I=32, p=4)
XL/2 Cross-Attention DiT-S 12 384 6 1.4
80 XL/2 adaLN DiT-B 12 768 12 5.6

XL/2 adaLN-Zero DiT-L
DiT-XL
24
28
1024
1152
16
16
19.7
29.1
FID-50K
60 Table 1. Details of DiT models. We follow ViT [10] model con-

figurations for the Small (S), Base (B) and Large (L) variants; we
40 also introduce an XLarge (XL) config as our largest model.
20 We initialize the MLP to output the zero-vector for all

100K 200K 300K 400K α; this initializes the full DiT block as the identity
Training Steps function. As with the vanilla adaLN block, adaLN-
Zero adds negligible Gflops to the model.
Figure 5. Comparing different conditioning strategies. adaLN-
Zero outperforms cross-attention and in-context conditioning at all We include the in-context, cross-attention, adaptive layer
stages of training. norm and adaLN-Zero blocks in the DiT design space.
– Cross-attention block. We concatenate the embeddings Model size. We apply a sequence of N DiT blocks, each
of t and c into a length-two sequence, separate from operating at the hidden dimension size d. Following ViT,
the image token sequence. The transformer block is we use standard transformer configs that jointly scale N ,
modified to include an additional multi-head cross- d and attention heads [10, 63]. Specifically, we use four
attention layer following the multi-head self-attention configs: DiT-S, DiT-B, DiT-L and DiT-XL. They cover a
block, similar to the original design from Vaswani et wide range of model sizes and flop allocations, from 0.3
al. [60], and also similar to the one used by LDM for to 118.6 Gflops, allowing us to gauge scaling performance.
conditioning on class labels. Cross-attention adds the Table 1 gives details of the configs.
most Gflops to the model, roughly a 15% overhead. We add B, S, L and XL configs to the DiT design space.
– Adaptive layer norm (adaLN) block. Following Transformer decoder. After the final DiT block, we need
the widespread usage of adaptive normalization lay- to decode our sequence of image tokens into an output noise
ers [40] in GANs [2, 28] and diffusion models with U- prediction and an output diagonal covariance prediction.
Net backbones [9], we explore replacing standard layer Both of these outputs have shape equal to the original spa-
norm layers in transformer blocks with adaptive layer tial input. We use a standard linear decoder to do this; we
norm (adaLN). Rather than directly learn dimension- apply the final layer norm (adaptive if using adaLN) and lin-
wise scale and shift parameters γ and β, we regress early decode each token into a p×p×2C tensor, where C is
them from the sum of the embedding vectors of t and the number of channels in the spatial input to DiT. Finally,
c. Of the three block designs we explore, adaLN adds we rearrange the decoded tokens into their original spatial
the least Gflops and is thus the most compute-efficient. layout to get the predicted noise and covariance.
It is also the only conditioning mechanism that is re- The complete DiT design space we explore is patch size,
stricted to apply the same function to all tokens. transformer block architecture and model size.
– adaLN-Zero block. Prior work on ResNets has found 4. Experimental Setup

that initializing each residual block as the identity
function is beneficial. For example, Goyal et al. found We explore the DiT design space and study the scaling
that zero-initializing the final batch norm scale factor γ properties of our model class. Our models are named ac-
in each block accelerates large-scale training in the su- cording to their configs and latent patch sizes p; for exam-
pervised learning setting [13]. Diffusion U-Net mod- ple, DiT-XL/2 refers to the XLarge config and p = 2.
els use a similar initialization strategy, zero-initializing
the final convolutional layer in each block prior to any Training. We train class-conditional latent DiT models at
residual connections. We explore a modification of 256 × 256 and 512 × 512 image resolution on the Ima-
the adaLN DiT block which does the same. In addi- geNet dataset [31], a highly-competitive generative mod-
tion to regressing γ and β, we also regress dimension- eling benchmark. We initialize the final linear layer with
wise scaling parameters α that are applied immediately zeros and otherwise use standard weight initialization tech-
prior to any residual connections within the DiT block. niques from ViT. We train all models with AdamW [29,33].
5
Figure 6. Scaling the DiT model improves FID at all stages of training. We show FID-50K over training iterations for 12 of our DiT
models. Top row: We compare FID holding patch size constant. Bottom row: We compare FID holding model size constant. Scaling the
transformer backbone yields better generative models across all model sizes and patch sizes.
We use a constant learning rate of 1 × 10−4 , no weight de- We follow convention when comparing against prior works
cay and a batch size of 256. The only data augmentation and report FID-50K using 250 DDPM sampling steps.
we use is horizontal flips. Unlike much prior work with FID is known to be sensitive to small implementation de-
ViTs [57, 61], we did not find learning rate warmup nor tails [37]; to ensure accurate comparisons, all values re-
regularization necessary to train DiTs to high performance. ported in this paper are obtained by exporting samples and
Even without these techniques, training was highly stable using ADM’s TensorFlow evaluation suite [9]. FID num-
across all model configs and we did not observe any loss bers reported in this section do not use classifier-free guid-
spikes commonly seen when training transformers. Follow- ance except where otherwise stated. We additionally report
ing common practice in the generative modeling literature, Inception Score [51], sFID [34] and Precision/Recall [32]
we maintain an exponential moving average (EMA) of DiT as secondary metrics.
weights over training with a decay of 0.9999. All results
reported use the EMA model. We use identical training hy-
perparameters across all DiT model sizes and patch sizes. Compute. We implement all models in JAX [1] and train
Our training hyperparameters are almost entirely retained them using TPU-v3 pods. DiT-XL/2, our most compute-
from ADM. We did not tune learning rates, decay/warm-up intensive model, trains at roughly 5.7 iterations/second on a
schedules, Adam β1 /β2 or weight decays. TPU v3-256 pod with a global batch size of 256.
Diffusion. We use an off-the-shelf pre-trained variational 5. Experiments

autoencoder (VAE) model [30] from Stable Diffusion [48].
DiT block design. We train four of our highest Gflop
The VAE encoder has a downsample factor of 8—given an
DiT-XL/2 models, each using a different block design—
RGB image x with shape 256 × 256 × 3, z = E(x) has
in-context (119.4 Gflops), cross-attention (137.6 Gflops),
shape 32 × 32 × 4. Across all experiments in this section,
adaptive layer norm (adaLN, 118.6 Gflops) or adaLN-zero
our diffusion models operate in this Z-space. After sam-
(118.6 Gflops). We measure FID over the course of training.
pling a new latent from our diffusion model, we decode it
Figure 5 shows the results. The adaLN-Zero block yields
to pixels using the VAE decoder x = D(z). We retain diffu-
lower FID than both cross-attention and in-context condi-
sion hyperparameters from ADM [9]; specifically, we use a
tioning while being the most compute-efficient. At 400K
tmax = 1000 linear variance schedule ranging from 1×10−4
training iterations, the FID achieved with the adaLN-Zero
to 2 × 10−2 , ADM’s parameterization of the covariance Σθ
model is nearly half that of the in-context model, demon-
and their method for embedding input timesteps and labels.
strating that the conditioning mechanism critically affects
model quality. Initialization is also important—adaLN-
Evaluation metrics. We measure scaling performance Zero, which initializes each DiT block as the identity func-
with Fréchet Inception Distance (FID) [18], the standard tion, significantly outperforms vanilla adaLN. For the rest
metric for evaluating generative models of images. of the paper, all models will use adaLN-Zero DiT blocks.
6
Increasing transformer size
Decreasing patch size
Figure 7. Increasing transformer forward pass Gflops increases sample quality. Best viewed zoomed-in. We sample from all 12 of
our DiT models after 400K training steps using the same input latent noise and class label. Increasing the Gflops in the model—either by
increasing transformer depth/width or increasing the number of input tokens—yields significant improvements in visual fidelity.
7
160 S/8
S/4
B/8
B/4
L/8
L/4
XL/8
XL/4
200 S/8
S/4
B/8
B/4
L/8
L/4
XL/8
XL/4
140 S/2 B/2 L/2 XL/2
175 S/2 B/2 L/2 XL/2
120 150 30
25
100 125
FID-50K
FID-50K
20
80 100 15
60 75 10
40 50
Correlation: -0.93 25
20
100 101 102 0 7
10 108 109 1010 1011 1012
Transformer Gflops Training Compute (Gflops)
Figure 8. Transformer Gflops are strongly correlated with FID. Figure 9. Larger DiT models use large compute more effi-
We plot the Gflops of each of our DiT models and each model’s ciently. We plot FID as a function of total training compute.
FID-50K after 400K training steps.
Scaling model size and patch size. We train 12 DiT mod- Larger DiT models are more compute-efficient. In
els, sweeping over model configs (S, B, L, XL) and patch Figure 9, we plot FID as a function of total training compute
sizes (8, 4, 2). Note that DiT-L and DiT-XL are significantly for all DiT models. We estimate training compute as model
closer to each other in terms of relative Gflops than other Gflops · batch size · training steps · 3, where the factor of
configs. Figure 2 (left) gives an overview of the Gflops of 3 roughly approximates the backwards pass as being twice
each model and their FID at 400K training iterations. In as compute-heavy as the forward pass. We find that small
all cases, we find that increasing model size and decreasing DiT models, even when trained longer, eventually become
patch size yields considerably improved diffusion models. compute-inefficient relative to larger DiT models trained for
Figure 6 (top) demonstrates how FID changes as model fewer steps. Similarly, we find that models that are identi-
size is increased and patch size is held constant. Across all cal except for patch size have different performance profiles
four configs, significant improvements in FID are obtained even when controlling for training Gflops. For example,
over all stages of training by making the transformer deeper XL/4 is outperformed by XL/2 after roughly 1010 Gflops.
and wider. Similarly, Figure 6 (bottom) shows FID as patch
size is decreased and model size is held constant. We again Visualizing scaling. We visualize the effect of scaling on
observe considerable FID improvements throughout train- sample quality in Figure 7. At 400K training steps, we sam-
ing by simply scaling the number of tokens processed by ple an image from each of our 12 DiT models using iden-
DiT, holding parameters approximately fixed. tical starting noise xtmax , sampling noise and class labels.
This lets us visually interpret how scaling affects DiT sam-
DiT Gflops are critical to improving performance. The ple quality. Indeed, scaling both model size and the number
results of Figure 6 suggest that parameter counts do not of tokens yields notable improvements in visual quality.
uniquely determine the quality of a DiT model. As model
size is held constant and patch size is decreased, the trans-
5.1. State-of-the-Art Diffusion Models
former’s total parameters are effectively unchanged (actu-
ally, total parameters slightly decrease), and only Gflops are 256×256 ImageNet. Following our scaling analysis, we
increased. These results indicate that scaling model Gflops continue training our highest Gflop model, DiT-XL/2, for
is actually the key to improved performance. To investi- 7M steps. We show samples from the model in Figures 1,
gate this further, we plot the FID-50K at 400K training steps and we compare against state-of-the-art class-conditional
against model Gflops in Figure 8. The results demonstrate generative models. We report results in Table 2. When us-
that different DiT configs obtain similar FID values when ing classifier-free guidance, DiT-XL/2 outperforms all prior
their total Gflops are similar (e.g., DiT-S/2 and DiT-B/4). diffusion models, decreasing the previous best FID-50K of
We find a strong negative correlation between model Gflops 3.60 achieved by LDM to 2.27. Figure 2 (right) shows that
and FID-50K, suggesting that additional model compute is DiT-XL/2 (118.6 Gflops) is compute-efficient relative to la-
the critical ingredient for improved DiT models. In Fig- tent space U-Net models like LDM-4 (103.6 Gflops) and
ure 12 (appendix), we find that this trend holds for other substantially more efficient than pixel space U-Net mod-
metrics such as Inception Score. els such as ADM (1120 Gflops) or ADM-U (742 Gflops).
8
Class-Conditional ImageNet 256×256 180 S/8 B/8 L/8 XL/8
S/4 B/4 L/4 XL/4
Model FID↓ sFID↓ IS↑ Precision↑ Recall↑ 160 S/2 B/2 L/2 XL/2
BigGAN-deep [2] 6.95 7.36 171.4 0.87 0.28
StyleGAN-XL [53] 2.30 4.02 265.12 0.78 0.53 140
ADM [9] 10.94 6.02 100.98 0.69 0.63 120
FID-10K
ADM-U 7.49 5.13 127.49 0.72 0.63
ADM-G 4.59 5.25 186.70 0.82 0.52 100
ADM-G, ADM-U 3.94 6.14 215.84 0.83 0.53
80
CDM [20] 4.88 - 158.71 - -
LDM-8 [48] 15.51 - 79.03 0.65 0.63
60
LDM-8-G 7.76 - 209.52 0.84 0.35 40
LDM-4 10.56 - 103.49 0.71 0.62
LDM-4-G (cfg=1.25) 3.95 - 178.22 0.81 0.55 20
LDM-4-G (cfg=1.50) 3.60 - 247.67 0.87 0.48 101 102 103 104 105
DiT-XL/2 9.62 6.85 121.50 0.67 0.67 Sampling Compute (Gflops)
DiT-XL/2-G (cfg=1.25) 3.22 5.28 201.77 0.76 0.62
DiT-XL/2-G (cfg=1.50) 2.27 4.60 278.24 0.83 0.57
Figure 10. Scaling-up sampling compute does not compensate
for a lack of model compute. For each of our DiT models trained
Table 2. Benchmarking class-conditional image generation on
for 400K iterations, we compute FID-10K using [16, 32, 64, 128,
ImageNet 256×256. DiT-XL/2 achieves state-of-the-art FID.
256, 1000] sampling steps. For each number of steps, we plot the
FID as well as the Gflops used to sample each image. Small mod-
Class-Conditional ImageNet 512×512
els cannot close the performance gap with our large models, even
Model FID↓ sFID↓ IS↑ Precision↑ Recall↑ if they sample with more test-time Gflops than the large models.
BigGAN-deep [2] 8.43 8.13 177.90 0.88 0.29
StyleGAN-XL [53] 2.41 4.06 267.75 0.77 0.52
ADM [9] 23.24 10.19 58.06 0.73 0.60 5.2. Scaling Model vs. Sampling Compute
ADM-U 9.96 5.62 121.78 0.75 0.64
ADM-G 7.72 6.57 172.71 0.87 0.42 Diffusion models are unique in that they can use addi-
ADM-G, ADM-U 3.85 5.86 221.72 0.84 0.53 tional compute after training by increasing the number of
DiT-XL/2 12.03 7.12 105.25 0.75 0.64 sampling steps when generating an image. Given the im-
DiT-XL/2-G (cfg=1.25) 4.64 5.77 174.77 0.81 0.57 pact of model Gflops on sample quality, in this section we
DiT-XL/2-G (cfg=1.50) 3.04 5.02 240.82 0.84 0.54
study if smaller-model compute DiTs can outperform larger
Table 3. Benchmarking class-conditional image generation on ones by using more sampling compute. We compute FID
ImageNet 512×512. Note that prior work [9] measures Precision for all 12 of our DiT models after 400K training steps, us-
and Recall using 1000 real samples for 512 × 512 resolution; for ing [16, 32, 64, 128, 256, 1000] sampling steps per-image.
consistency, we do the same. The main results are in Figure 10. Consider DiT-L/2 us-
ing 1000 sampling steps versus DiT-XL/2 using 128 steps.
Our method achieves the lowest FID of all prior generative In this case, L/2 uses 80.7 Tflops to sample each image;
models, including the previous state-of-the-art StyleGAN- XL/2 uses 5× less compute—15.2 Tflops—to sample each
XL [53]. Finally, we also observe that DiT-XL/2 achieves image. Nonetheless, XL/2 has the better FID-10K (23.7
higher recall values at all tested classifier-free guidance vs 25.9). In general, scaling-up sampling compute cannot
scales compared to LDM-4 and LDM-8. When trained for compensate for a lack of model compute.
only 2.35M steps (similar to ADM), XL/2 still outperforms
all prior diffusion models with an FID of 2.55. 6. Conclusion
512×512 ImageNet. We train a new DiT-XL/2 model on We introduce Diffusion Transformers (DiTs), a simple
ImageNet at 512 × 512 resolution for 3M iterations with transformer-based backbone for diffusion models that out-
identical hyperparameters as the 256 × 256 model. With a performs prior U-Net models and inherits the excellent scal-
patch size of 2, this XL/2 model processes a total of 1024 ing properties of the transformer model class. Given the
tokens after patchifying the 64 × 64 × 4 input latent (524.6 promising scaling results in this paper, future work should
Gflops). Table 3 shows comparisons against state-of-the-art continue to scale DiTs to larger models and token counts.
methods. XL/2 again outperforms all prior diffusion models DiT could also be explored as a drop-in backbone for text-
at this resolution, improving the previous best FID of 3.85 to-image models like DALL·E 2 and Stable Diffusion.
achieved by ADM to 3.04. Even with the increased num-
ber of tokens, XL/2 remains compute-efficient. For exam- Acknowledgements. We thank Kaiming He, Ronghang
ple, ADM uses 1983 Gflops and ADM-U uses 2813 Gflops; Hu, Alexander Berg, Shoubhik Debnath, Tim Brooks, Ilija
XL/2 uses 524.6 Gflops. We show samples from the high- Radosavovic and Tete Xiao for helpful discussions. William
resolution XL/2 model in Figure 1 and the appendix. Peebles is supported by the NSF GRFP.
9
References [16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
units (gelus). arXiv preprint arXiv:1606.08415, 2016. 12
[1] James Bradbury, Roy Frostig, Peter Hawkins,
[17] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen,
Matthew James Johnson, Chris Leary, Dougal Maclau-
Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B
rin, George Necula, Adam Paszke, Jake VanderPlas, Skye
Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws
Wanderman-Milne, and Qiao Zhang. JAX: composable
for autoregressive generative modeling. arXiv preprint
transformations of Python+NumPy programs, 2018. 6
arXiv:2010.14701, 2020. 2
[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
scale GAN training for high fidelity natural image synthesis. [18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
In ICLR, 2019. 5, 9 Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilib-
[3] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
rium. 2017. 6
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
guage models are few-shot learners. In NeurIPS, 2020. 1 sion probabilistic models. In NeurIPS, 2020. 2, 3
[4] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T [20] Jonathan Ho, Chitwan Saharia, William Chan, David J
Freeman. Maskgit: Masked generative image transformer. In Fleet, Mohammad Norouzi, and Tim Salimans. Cas-
CVPR, pages 11315–11325, 2022. 2 caded diffusion models for high fidelity image generation.
[5] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, arXiv:2106.15282, 2021. 3, 9
Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- [21] Jonathan Ho and Tim Salimans. Classifier-free diffusion
vas, and Igor Mordatch. Decision transformer: Reinforce- guidance. In NeurIPS 2021 Workshop on Deep Generative
ment learning via sequence modeling. In NeurIPS, 2021. 2 Models and Downstream Applications, 2021. 3, 4
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- [22] Aapo Hyvärinen and Peter Dayan. Estimation of non-
woo Jun, David Luan, and Ilya Sutskever. Generative pre- normalized statistical models by score matching. Journal
training from pixels. In ICML, 2020. 1, 2 of Machine Learning Research, 6(4), 2005. 3
[7] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. [23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Generating long sequences with sparse transformers. arXiv Efros. Image-to-image translation with conditional adver-
preprint arXiv:1904.10509, 2019. 2 sarial networks. In Proceedings of the IEEE conference on
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina computer vision and pattern recognition, pages 1125–1134,
Toutanova. Bert: Pre-training of deep bidirectional trans- 2017. 2
formers for language understanding. In NAACL-HCT, 2019. [24] Allan Jabri, David Fleet, and Ting Chen. Scalable adap-
1 tive computation for iterative generation. arXiv preprint
[9] Prafulla Dhariwal and Alexander Nichol. Diffusion models arXiv:2212.11972, 2022. 3
beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 5, [25] Michael Janner, Qiyang Li, and Sergey Levine. Offline rein-
6, 9, 12 forcement learning as one big sequence modeling problem.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, In NeurIPS, 2021. 2
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [26] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
vain Gelly, et al. An image is worth 16x16 words: Trans- Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for
formers for image recognition at scale. In ICLR, 2020. 1, 2, neural language models. arXiv:2001.08361, 2020. 2, 13
4, 5
[27] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
[11] Patrick Esser, Robin Rombach, and Björn Ommer. Taming Elucidating the design space of diffusion-based generative
transformers for high-resolution image synthesis, 2020. 2 models. In Proc. NeurIPS, 2022. 3
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
generator architecture for generative adversarial networks. In
Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
CVPR, 2019. 5
3
[29] Diederik Kingma and Jimmy Ba. Adam: A method for
[13] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
stochastic optimization. In ICLR, 2015. 5
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Yangqing Jia, and Kaiming He. Accurate, large minibatch [30] Diederik P Kingma and Max Welling. Auto-encoding varia-
sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017. tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 6
5 [31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
[14] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Imagenet classification with deep convolutional neural net-
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- works. In NeurIPS, 2012. 5
tor quantized diffusion model for text-to-image synthesis. In [32] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko
CVPR, pages 10696–10706, 2022. 2 Lehtinen, and Timo Aila. Improved precision and recall met-
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ric for assessing generative models. In NeurIPS, 2019. 6
Deep residual learning for image recognition. In CVPR, [33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
2016. 2 regularization. arXiv:1711.05101, 2017. 5
10
[34] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W [50] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Battaglia. Generating images with sparse representations. Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
arXiv preprint arXiv:2103.03841, 2021. 6 Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
[35] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Fleet, and Mohammad Norouzi. Photorealistic text-to-
and Mark Chen. Glide: Towards photorealistic image image diffusion models with deep language understanding.
generation and editing with text-guided diffusion models. arXiv:2205.11487, 2022. 3
arXiv:2112.10741, 2021. 3, 4 [51] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
[36] Alexander Quinn Nichol and Prafulla Dhariwal. Improved Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved
denoising diffusion probabilistic models. In ICML, 2021. 3 techniques for training GANs. In NeurIPS, 2016. 6
[37] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On [52] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P
aliased resizing and surprising subtleties in gan evaluation. Kingma. PixelCNN++: Improving the pixelcnn with dis-
In CVPR, 2022. 6 cretized logistic mixture likelihood and other modifications.
arXiv preprint arXiv:1701.05517, 2017. 2
[38] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- [53] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-
age transformer. In International conference on machine xl: Scaling stylegan to large diverse datasets. In SIGGRAPH,
learning, pages 4055–4064. PMLR, 2018. 2 2022. 9
[54] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[39] William Peebles, Ilija Radosavovic, Tim Brooks, Alexei
and Surya Ganguli. Deep unsupervised learning using
Efros, and Jitendra Malik. Learning to learn with genera-
nonequilibrium thermodynamics. In ICML, 2015. 3
tive models of neural network checkpoints. arXiv preprint
arXiv:2209.12892, 2022. 2 [55] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. arXiv:2010.02502, 2020. 3
[40] Ethan Perez, Florian Strub, Harm De Vries, Vincent Du-
[56] Yang Song and Stefano Ermon. Generative modeling by es-
moulin, and Aaron Courville. Film: Visual reasoning with a
timating gradients of the data distribution. In NeurIPS, 2019.
general conditioning layer. In AAAI, 2018. 2, 5
3
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[57] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
your ViT? data, augmentation, and regularization in vision
ing transferable visual models from natural language super-
transformers. TMLR, 2022. 6
vision. In ICML, 2021. 2
[58] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt,
[42] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Oriol Vinyals, Alex Graves, et al. Conditional image genera-
Sutskever. Improving language understanding by generative
tion with pixelcnn decoders. Advances in neural information
pre-training. 2018. 1
processing systems, 29, 2016. 2
[43] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario [59] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
Amodei, Ilya Sutskever, et al. Language models are unsu- representation learning. Advances in neural information pro-
pervised multitask learners. 2019. 1 cessing systems, 30, 2017. 2
[44] Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, [60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
and Piotr Dollár. On network design spaces for visual recog- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
nition. In ICCV, 2019. 3 Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
[45] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, 2, 5
Kaiming He, and Piotr Dollár. Designing network design [61] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor
spaces. In CVPR, 2020. 3 Darrell, and Ross Girshick. Early convolutions help trans-
[46] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, formers see better. In NeurIPS, 2021. 6
and Mark Chen. Hierarchical text-conditional image gener- [62] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
ation with clip latents. arXiv:2204.06125, 2022. 1, 2, 3, 4 Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku,
[47] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore-
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. gressive models for content-rich text-to-image generation.
Zero-shot text-to-image generation. In ICML, 2021. 1, 2 arXiv:2206.10789, 2022. 2
[48] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [63] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
Patrick Esser, and Björn Ommer. High-resolution image syn- cas Beyer. Scaling vision transformers. In CVPR, 2022. 2,
thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4, 5
6, 9
[49] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmen-
tation. In International Conference on Medical image com-
puting and computer-assisted intervention, pages 234–241.
Springer, 2015. 2, 3
11
Figure 11. Additional selected samples from our 512×512 and 256×256 resolution DiT-XL/2 models. We use a classifier-free guidance
scale of 6.0 for the 512 × 512 model and 4.0 for the 256 × 256 model. Both models use the ft-EMA VAE decoder.
A. Additional Implementation Details results (in terms of FID) when simply adjusting the scale
factor. Specifically, three-channel guidance with a scale
We include detailed information about all of our DiT of (1 + x) appears reasonably well-approximated by four-
models in Table 4, including both 256 × 256 and 512 × 512 channel guidance with a scale of (1 + 34 x) (e.g., three-
models. In Figure 13, we report DiT training loss curves. channel guidance with a scale of 1.5 gives an FID-50K of
Finally, we also include Gflop counts for DDPM U-Net 2.27, and four-channel guidance with a scale of 1.375 gives
models from ADM and LDM in Table 6. an FID-50K of 2.20). It is somewhat interesting that ap-
DiT model details. To embed input timesteps, we use plying guidance to a subset of elements can still yield good
a 256-dimensional frequency embedding [9] followed by performance, and we leave it to future work to explore this
a two-layer MLP with dimensionality equal to the trans- phenomenon further.
former’s hidden size and SiLU activations. Each adaLN
layer feeds the sum of the timestep and class embeddings B. Model Samples
into a SiLU nonlinearity and a linear layer with output neu-
We show samples from our two DiT-XL/2 models at
rons equal to either 4× (adaLN) or 6× (adaLN-Zero) the
512 × 512 and 256 × 256 resolution trained for 3M and 7M
transformer’s hidden size. We use GELU nonlinearities (ap-
steps, respectively. Figures 1 and 11 show selected samples
proximated with tanh) in the core transformer [16].
from both models. Figures 14 through 33 show uncurated
Classifier-free guidance on a subset of channels. In our samples from the two models across a range of classifier-
experiments using classifier-free guidance, we applied guid- free guidance scales and input class labels (generated with
ance only to the first three channels of the latents instead of 250 DDPM sampling steps and the ft-EMA VAE decoder).
all four channels. Upon investigating, we found that three- As with prior work using guidance, we observe that larger
channel guidance and four-channel guidance give similar scales increase visual fidelity and decrease sample diversity.
12
Model Image Resolution Flops (G) Params (M) Training Steps (K) Batch Size Learning Rate DiT Block FID-50K (no guidance)
DiT-S/8 256 × 256 0.36 33 400 256 1 × 10−4 adaLN-Zero 153.60
DiT-S/4 256 × 256 1.41 33 400 256 1 × 10−4 adaLN-Zero 100.41
DiT-S/2 256 × 256 6.06 33 400 256 1 × 10−4 adaLN-Zero 68.40
DiT-B/8 256 × 256 1.42 131 400 256 1 × 10−4 adaLN-Zero 122.74
DiT-B/4 256 × 256 5.56 130 400 256 1 × 10−4 adaLN-Zero 68.38
DiT-B/2 256 × 256 23.01 130 400 256 1 × 10−4 adaLN-Zero 43.47
DiT-L/8 256 × 256 5.01 459 400 256 1 × 10−4 adaLN-Zero 118.87
DiT-L/4 256 × 256 19.70 458 400 256 1 × 10−4 adaLN-Zero 45.64
DiT-L/2 256 × 256 80.71 458 400 256 1 × 10−4 adaLN-Zero 23.33
DiT-XL/8 256 × 256 7.39 676 400 256 1 × 10−4 adaLN-Zero 106.41
DiT-XL/4 256 × 256 29.05 675 400 256 1 × 10−4 adaLN-Zero 43.01
DiT-XL/2 256 × 256 118.64 675 400 256 1 × 10−4 adaLN-Zero 19.47
DiT-XL/2 256 × 256 119.37 449 400 256 1 × 10−4 in-context 35.24
DiT-XL/2 256 × 256 137.62 598 400 256 1 × 10−4 cross-attention 26.14
DiT-XL/2 256 × 256 118.56 600 400 256 1 × 10−4 adaLN 25.21
DiT-XL/2 256 × 256 118.64 675 2352 256 1 × 10−4 adaLN-Zero 10.67
DiT-XL/2 256 × 256 118.64 675 7000 256 1 × 10−4 adaLN-Zero 9.62
DiT-XL/2 512 × 512 524.60 675 1301 256 1 × 10−4 adaLN-Zero 13.78
DiT-XL/2 512 × 512 524.60 675 3000 256 1 × 10−4 adaLN-Zero 11.93
Table 4. Details of all DiT models. We report detailed information about every DiT model in our paper. Note that FID-50K here is
computed without classifier-free guidance. Parameter and flop counts exclude the VAE model which contains 84M parameters across the
encoder and decoder. For both the 256 × 256 and 512 × 512 DiT-XL/2 models, we never observed FID saturate and continued training
them as long as possible. Numbers reported in this table use the ft-MSE VAE decoder.
C. Additional Scaling Results Class-Conditional ImageNet 256×256, DiT-XL/2-G (cfg=1.5)

Decoder FID↓ sFID↓ IS↑ Precision↑ Recall↑
Impact of scaling on metrics beyond FID. In Figure 12,
original 2.46 5.18 271.56 0.82 0.57
we show the effects of DiT scale on a suite of evaluation ft-MSE 2.30 4.73 276.09 0.83 0.57
metrics—FID, sFID, Inception Score, Precision and Recall. ft-EMA 2.27 4.60 278.24 0.83 0.57
We find that our FID-driven analysis in the main paper gen-
eralizes to the other metrics—across every metric, scaled-up Table 5. Decoder ablation. We tested different pre-trained VAE
DiT models are more compute-efficient and model Gflops decoder weights available at https://huggingface.co/
are highly-correlated with performance. In particular, In- stabilityai/sd-vae-ft-mse. Different pre-trained de-
ception Score and Precision benefit heavily from increased coder weights yield comparable results on ImageNet 256 × 256.
model scale.
Diffusion U-Net Model Complexities
Model Image Resolution Base Flops (G) Upsampler Flops (G) Total Flops (G)
Impact of scaling on training loss. We also examine the ADM 128 × 128 307 - 307
impact of scale on training loss in Figure 13. Increasing ADM 256 × 256 1120 - 1120
ADM 512 × 512 1983 - 1983
DiT model Gflops (via transformer size or number of input ADM-U 256 × 256 110 632 742
ADM-U 512 × 512 307 2506 2813
tokens) causes the training loss to decrease more rapidly and
LDM-4 256 × 256 104 - 104
saturate at a lower value. This phenomenon is consistent LDM-8 256 × 256 57 - 57
with trends observed with language models, where scaled-
up transformers demonstrate both improved loss curves as Table 6. Gflop counts for baseline diffusion models that use U-
well as improved performance on downstream evaluation Net backbones. Note that we only count Flops for DDPM com-
suites [26]. ponents.
D. VAE Decoder Ablations different choices of the VAE decoder; the original one used
We used off-the-shelf, pre-trained VAEs across our ex- by LDM and the two fine-tuned decoders used by Stable
periments. The VAE models (ft-MSE and ft-EMA) are fine- Diffusion. Because the encoders are identical across mod-
tuned versions of the original LDM “f8” model (only the els, the decoders can be swapped-in without retraining the
decoder weights are fine-tuned). We monitored metrics for diffusion model. Table 5 shows results; XL/2 continues to
our scaling analysis in Section 5 using the ft-MSE decoder, outperform all prior diffusion models when using the LDM
and we used the ft-EMA decoder for our final metrics re- decoder.
ported in Tables 2 and 3. In this section, we ablate three
13
Figure 12. DiT scaling behavior on several generative modeling metrics. Left: We plot model performance as a function of total training
compute for FID, sFID, Inception Score, Precision and Recall. Right: We plot model performance at 400K training steps for all 12 DiT
variants against transformer Gflops, finding strong correlations across metrics. All values were computed using the ft-MSE VAE decoder.
14
0.21
S/8 S/4 S/2
0.20
0.20
0.19
0.19 0.18
0.17
0.16
Training Loss
0.18
0.15
0.14
0.17 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
0.16
0.15
0.14
0.13
0 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Training Iterations
0.21
B/8 B/4 B/2
0.20
0.20
0.19
0.19 0.18
0.17
0.16
Training Loss
0.18
0.15
0.14
0.17 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
0.16
0.15
0.14
0.13
0 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Training Iterations
0.21
L/8 L/4 L/2
0.20
0.20
0.19
0.19 0.18
0.17
0.16
Training Loss
0.18
0.15
0.14
0.17 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
0.16
0.15
0.14
0.13
0 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Training Iterations
0.21
XL/8 XL/4 XL/2
0.20
0.20
0.19
0.19 0.18
0.17
0.16
Training Loss
0.18
0.15
0.14
0.17 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
0.16
0.15
0.14
0.13
0 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Training Iterations
0.21
XL/2 (256x256) XL/2 (512x512)
0.20
0.20
0.19
0.19 0.18
0.17
0.16
Training Loss
0.18
0.15
0.14
0.17 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
0.16
0.15
0.14
0.13
0 0.25M 0.50M 0.75M 1.00M 1.25M 1.50M 1.75M 2.00M 2.25M 2.50M 2.75M 3.00M 3.25M 3.50M 3.75M 4.00M 4.25M 4.50M 4.75M 5.00M 5.25M 5.50M 5.75M 6.00M 6.25M 6.50M 6.75M 7.00M
Training Iterations
Figure 13. Training loss curves for all DiT models. We plot the loss over training for all DiT models (the sum of the noise prediction
mean-squared error and DKL ). We also highlight early training behavior. Note that scaled-up DiT models exhibit lower training losses.
15
DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 4.0
Figure 14. Uncurated 512 × 512 DiT-XL/2 samples. Figure 15. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0 Classifier-free guidance scale = 4.0
Class label = “arctic wolf” (270) Class label = “volcano” (980)
16
Class label = “husky” (250) Class label = “sulphur-crested cockatoo” (89)
17
Class label = “cliff drop-off” (972) Class label = “balloon” (417)
18
Class label = “lion” (291) Class label = “otter” (360)
19
Class label = “red panda” (387) Class label = “panda” (388)
20
Class label = “coral reef” (973) Class label = “macaw” (88)
21
Class label = “macaw” (88) Class label = “dog sled” (537)
22
Class label = “arctic fox” (279) Class label = “loggerhead sea turtle” (33)
23
Class label = “golden retriever” (207) Class label = “lake shore” (975)
24
Class label = “space shuttle” (812) Class label = “ice cream” (928)
25

Scalable Diffusion Models With Transformers

Uploaded by

Copyright:

Available Formats

Scalable Diffusion Models With Transformers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scalable Diffusion Models With Transformers

Uploaded by

Copyright:

Available Formats

Scalable Diffusion Models with Transformers

William Peebles* Saining Xie

Denoising diffusion probabilistic models (DDPMs). 3. Diffusion Transformers

classifier-free guidance can be used to encourage the sam-

XL/2 Cross-Attention DiT-S 12 384 6 1.4

80 XL/2 adaLN DiT-B 12 768 12 5.6

60 Table 1. Details of DiT models. We follow ViT [10] model con-

20 We initialize the MLP to output the zero-vector for all

– adaLN-Zero block. Prior work on ResNets has found 4. Experimental Setup

Diffusion. We use an off-the-shelf pre-trained variational 5. Experiments

C. Additional Scaling Results Class-Conditional ImageNet 256×256, DiT-XL/2-G (cfg=1.5)

You might also like