StarGAN v2 - Diverse Image Synthesis For Multiple Domains

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

StarGAN v2: Diverse Image Synthesis for Multiple Domains

Yunjey Choi1∗ Youngjung Uh1∗ Jaejun Yoo2∗ Jung-Woo Ha1


1 2
Clova AI Research, NAVER Corp. EPFL

Input Generated outputs (Female) Generated outputs (Male)


arXiv:1912.01865v2 [cs.CV] 26 Apr 2020
CelebA-HQ
AFHQ

Input Generated outputs (Cat) Generated outputs (Dog) Generated outputs (Wildlife)
Figure 1. Diverse image synthesis results on the CelebA-HQ dataset and the newly collected animal faces (AFHQ) dataset. The first column
shows input images while the remaining columns are images synthesized by StarGAN v2.

Abstract 1. Introduction
A good image-to-image translation model should learn a Image-to-image translation aims to learn a mapping be-
mapping between different visual domains while satisfying tween different visual domains [20]. Here, domain implies
the following properties: 1) diversity of generated images a set of images that can be grouped as a visually distinctive
and 2) scalability over multiple domains. Existing methods category, and each image has a unique appearance, which
address either of the issues, having limited diversity or mul- we call style. For example, we can set image domains
tiple models for all domains. We propose StarGAN v2, a sin- based on the gender of a person, in which case the style in-
gle framework that tackles both and shows significantly im- cludes makeup, beard, and hairstyle (top half of Figure 1).
proved results over the baselines. Experiments on CelebA- An ideal image-to-image translation method should be able
HQ and a new animal faces dataset (AFHQ) validate our to synthesize images considering the diverse styles in each
superiority in terms of visual quality, diversity, and scalabil- domain. However, designing and learning such models be-
ity. To better assess image-to-image translation models, we come complicated as there can be arbitrarily large number
release AFHQ, high-quality animal faces with large inter- of styles and domains in the dataset.
and intra-domain differences. The code, pretrained models,
and dataset are available at clovaai/stargan-v2. * indicates equal contribution

1
To address the style diversity, much work on image-to- 2. StarGAN v2
image translation has been developed [1, 16, 34, 28, 38, 54].
These methods inject a low-dimensional latent code to the In this section, we describe our proposed framework and
generator, which can be randomly sampled from the stan- its training objective functions.
dard Gaussian distribution. Their domain-specific decoders 2.1. Proposed framework
interpret the latent codes as recipes for various styles when
generating images. However, because these methods have Let X and Y be the sets of images and possible domains,
only considered a mapping between two domains, they are respectively. Given an image x ∈ X and an arbitrary do-
not scalable to the increasing number of domains. For ex- main y ∈ Y, our goal is to train a single generator G that can
ample, having K domains, these methods require to train generate diverse images of each domain y that corresponds
K(K-1) generators to handle translations between each to the image x. We generate domain-specific style vectors
and every domain, limiting their practical usage. in the learned style space of each domain and train G to
reflect the style vectors. Figure 2 illustrates an overview of
To address the scalability, several studies have proposed our framework, which consists of four modules described
a unified framework [2, 7, 17, 30]. StarGAN [7] is one of below.
the earliest models, which learns the mappings between all
available domains using a single generator. The generator Generator (Figure 2a). Our generator G translates an input
takes a domain label as an additional input, and learns to image x into an output image G(x, s) reflecting a domain-
transform an image into the corresponding domain. How- specific style code s, which is provided either by the map-
ever, StarGAN still learns a deterministic mapping per each ping network F or by the style encoder E. We use adaptive
domain, which does not capture the multi-modal nature of instance normalization (AdaIN) [15, 22] to inject s into G.
the data distribution. This limitation comes from the fact We observe that s is designed to represent a style of a spe-
that each domain is indicated by a predetermined label. cific domain y, which removes the necessity of providing y
Note that the generator receives a fixed label (e.g. one-hot to G and allows G to synthesize images of all domains.
vector) as input, and thus it inevitably produces the same
Mapping network (Figure 2b). Given a latent code z and
output per each domain, given a source image.
a domain y, our mapping network F generates a style code
To get the best of both worlds, we propose StarGAN v2, s = Fy (z), where Fy (·) denotes an output of F correspond-
a scalable approach that can generate diverse images across ing to the domain y. F consists of an MLP with multiple
multiple domains. In particular, we start from StarGAN output branches to provide style codes for all available do-
and replace its domain label with our proposed domain- mains. F can produce diverse style codes by sampling the
specific style code that can represent diverse styles of latent vector z ∈ Z and the domain y ∈ Y randomly.
a specific domain. To this end, we introduce two mod- Our multi-task architecture allows F to efficiently and ef-
ules, a mapping network and a style encoder. The mapping fectively learn style representations of all domains.
network learns to transform random Gaussian noise into
Style encoder (Figure 2c). Given an image x and its corre-
a style code, while the encoder learns to extract the style
sponding domain y, our encoder E extracts the style code
code from a given reference image. Considering multiple
s = Ey (x) of x. Here, Ey (·) denotes the output of E corre-
domains, both modules have multiple output branches, each
sponding to the domain y. Similar to F , our style encoder E
of which provides style codes for a specific domain. Finally,
benefits from the multi-task learning setup. E can produce
utilizing these style codes, our generator learns to success-
diverse style codes using different reference images. This
fully synthesize diverse images over multiple domains (Fig-
allows G to synthesize an output image reflecting the style
ure 1).
s of a reference image x.
We first investigate the effect of individual components
Discriminator (Figure 2d). Our discriminator D is a multi-
of StarGAN v2 and show that our model indeed benefits
task discriminator [30, 35], which consists of multiple out-
from using the style code (Section 3.1). We empirically
put branches. Each branch Dy learns a binary classification
demonstrate that our proposed method is scalable to multi-
determining whether an image x is a real image of its do-
ple domains and gives significantly better results in terms of
main y or a fake image G(x, s) produced by G.
visual quality and diversity compared to the leading meth-
ods (Section 3.2). Last but not least, we present a new 2.2. Training objectives
dataset of animal faces (AFHQ) with high quality and wide
variations (Appendix A) to better evaluate the performance Given an image x ∈ X and its original domain y ∈ Y,
of image-to-image translation models on large inter- and we train our framework using the following objectives.
intra-domain differences. We release this dataset publicly Adversarial objective. During training, we sample a latent
available for research community. code z ∈ Z and a target domain ye ∈ Y randomly, and
Input image Latent code Input image Input image

Shared layers
Shared layers

Shared layers
F E D

Domain specific

Domain specific
Domain specific
Output image Style Style Style Style Style Style R/F R/F R/F
Domain 1 Domain 2 Domain 3 Domain 1 Domain 2 Domain 3 Domain 1 Domain 2 Domain 3

(a) Generator (b) Mapping network (c) Style encoder (d) Discriminator

Figure 2. Overview of StarGAN v2, consisting of four modules. (a) The generator translates an input image into an output image reflecting
the domain-specific style code. (b) The mapping network transforms a latent code into style codes for multiple domains, one of which is
randomly selected during training. (c) The style encoder extracts the style code of an image, allowing the generator to perform reference-
guided image synthesis. (d) The discriminator distinguishes between real and fake images from multiple domains. Note that all modules
except the generator contain multiple output branches, one of which is selected when training the corresponding domain.

generate a target style code e s = Fye(z). The generator G si = Fye(zi ) for i ∈ {1, 2}). Maximizing the regulariza-
e
takes an image x and e s as inputs and learns to generate an tion term forces G to explore the image space and discover
output image G(x, e s) via an adversarial loss meaningful style features to generate diverse images. Note
that in the original form, the small difference of kz1 − z2 k1
Ladv = Ex,y [log Dy (x)] + in the denominator increases the loss significantly, which
(1)
Ex,ey,z [log (1 − Dye(G(x, e
s)))], makes the training unstable due to large gradients. Thus,
we remove the denominator part and devise a new equation
where Dy (·) denotes the output of D corresponding to the
for stable training but with the same intuition.
domain y. The mapping network F learns to provide the
style code e s that is likely in the target domain ye, and G Preserving source characteristics. To guarantee that the
s and generate an image G(x, e
learns to utilize e s) that is in- generated image G(x, e s) properly preserves the domain-
distinguishable from real images of the domain ye. invariant characteristics (e.g. pose) of its input image x, we
employ the cycle consistency loss [7, 24, 53]
Style reconstruction. In order to enforce the generator G to
utilize the style code es when generating the image G(x, e s),
Lcyc = Ex,y,ey,z [||x − G(G(x, e
s), ŝ)||1 ] , (4)
we employ a style reconstruction loss
  where ŝ = Ey (x) is the estimated style code of the input
Lsty = Ex,ey,z ||e
s − Eye(G(x, e
s))||1 . (2) image x, and y is the original domain of x. By encourag-
This objective is similar to the previous approaches [16, 54], ing the generator G to reconstruct the input image x with
which employ multiple encoders to learn a mapping from the estimated style code ŝ, G learns to preserve the original
an image to its latent code. The notable difference is that characteristics of x while changing its style faithfully.
we train a single encoder E to encourage diverse outputs Full objective. Our full objective functions can be summa-
for multiple domains. At test time, our learned encoder E rized as follows:
allows G to transform an input image, reflecting the style of
a reference image. min max Ladv + λsty Lsty
G,F,E D
(5)
Style diversification. To further enable the generator G to
− λds Lds + λcyc Lcyc ,
produce diverse images, we explicitly regularize G with the
diversity sensitive loss [34, 48] where λsty , λds , and λcyc are hyperparameters for each
Lds = Ex,ey,z1 ,z2 [kG(x, e
s1 ) − G(x, e
s2 )k1 ] , (3) term. We also further train our model in the same manner
as the above objective, using reference images instead of
s1 and e
where the target style codes e s2 are produced by F latent vectors when generating style codes. We provide the
conditioned on two random latent codes z1 and z2 (i.e. training details in Appendix B.
Method FID LPIPS
A Baseline StarGAN [7] 98.4 -
B + Multi-task discriminator 91.4 -
C + Tuning (e.g. R1 regularization) 80.5 -
D + Latent code injection 32.3 0.312
E + Replace (D) with style code 17.1 0.405 Source (A)
F + Diversity regularization 13.7 0.452
Table 1. Performance of various configurations on CelebA-HQ.
Frechet inception distance (FID) indicates the distance between
two distributions of real and generated images (lower is better),
while learned perceptual image patch similarity (LPIPS) measures (B) (C) (D)
the diversity of generated images (higher is better).

3. Experiments
In this section, we describe evaluation setups and con-
duct a set of experiments. We analyze the individual compo-
nents of StarGAN v2 (Section 3.1) and compare our model
with three leading baselines on diverse image synthesis
(Section 3.2). All experiments are conducted using unseen
images during the training phase. (E) (F)
Baselines. We use MUNIT [16], DRIT [28], and MSGAN Figure 3. Visual comparison of generated images using each con-
[34] as our baselines, all of which learn multi-modal map- figuration in Table 1. Note that given a source image, the config-
pings between two domains. For multi-domain compar- urations (A) - (C) provide a single output, while (D) - (F) generate
isons, we train these models multiple times for every pair multiple output images.
of image domains. We also compare our method with Star-
tion (A) corresponds to the basic setup of StarGAN, which
GAN [7], which learns mappings among multiple domains
employs WGAN-GP [11], ACGAN discriminator [39], and
using a single generator. All the baselines are trained using
depth-wise concatenation [36] for providing the target do-
the implementations provided by the authors.
main information to the generator. As shown in Figure 3a,
Datasets. We evaluate StarGAN v2 on CelebA-HQ [21] and the original StarGAN produces only a local change by ap-
our new AFHQ dataset (Appendix A). We separate CelebA- plying makeup on the input image.
HQ into two domains of male and female, and AFHQ into We first improve our baseline by replacing the ACGAN
three domains of cat, dog, and wildlife. Other than the do- discriminator with a multi-task discriminator [35, 30], al-
main labels, we do not use any additional information (e.g. lowing the generator to transform the global structure of an
facial attributes of CelebA-HQ or breeds of AFHQ) and input image as shown in configuration (B). Exploiting the
let the models learn such information as styles without su- recent advances in GANs, we further enhance the training
pervision. For a fair comparison, all images are resized to stability and construct a new baseline (C) by applying R1
256 × 256 resolution for training, which is the highest res- regularization [35] and switching the depth-wise concate-
olution used in the baselines. nation to adaptive instance normalization (AdaIN) [9, 15].
Evaluation metrics. We evaluate both the visual quality Note that we do not report LPIPS of these variations in Ta-
and the diversity of generated images using Frechét incep- ble 1, since they are yet to be designed to produce multiple
tion distance (FID) [14] and learned perceptual image patch outputs for a given input image and a target domain.
similarity (LPIPS) [52]. We compute FID and LPIPS for ev- To induce diversity, one can think of directly giving a
ery pair of image domains within a dataset and report their latent code z into the generator G and impose the latent re-
average values. The details on evaluation metrics and pro- construction loss ||z − E(G(x, z, y))||1 [16, 54]. However,
tocols are further described in Appendix C. in a multi-domain scenario, we observe that this baseline (D)
does not encourage the network to learn meaningful styles
3.1. Analysis of individual components
and fails to provide as much diversity as we expect. We con-
We evaluate individual components that are added to our jecture that this is because latent codes have no capability
baseline StarGAN using CelebA-HQ. Table 1 gives FID and in separating domains, and thus the latent reconstruction
LPIPS for several configurations, where each component loss models domain-shared styles (e.g. color) rather than
is cumulatively added on top of StarGAN. An input im- domain-specific ones (e.g. hairstyle). Note that the FID gap
age and the corresponding generated images of each con- between baseline (C) and (D) is simply due to the difference
figuration are shown in Figure 3. The baseline configura- in the number of output samples.
Source
Reference
Female
Male

Figure 4. Reference-guided image synthesis results on CelebA-HQ. The source and reference images in the first row and the first column
are real images, while the rest are images generated by our proposed model, StarGAN v2. Our model learns to transform a source image
reflecting the style of a given reference image. High-level semantics such as hairstyle, makeup, beard and age are followed from the
reference images, while the pose and identity of the source images are preserved. Note that the images in each column share a single
identity with different styles, and those in each row share a style with different identities.
Source MUNIT [16] DRIT [28] MSGAN [34] Ours Source MUNIT [16] DRIT [28] MSGAN [34] Ours

(a) Latent-guided synthesis on CelebA-HQ (b) Latent-guided synthesis on AFHQ


Figure 5. Qualitative comparison of latent-guided image synthesis results on the CelebA-HQ and AFHQ datasets. Each method translates
the source images (left-most column) to target domains using randomly sampled latent codes. (a) The top three rows correspond to the
results of converting male to female and vice versa in the bottom three rows. (b) Every two rows from the top show the synthesized images
in the following order: cat-to-dog, dog-to-wildlife, and wildlife-to-cat.

Instead of giving a latent code into G directly, to learn CelebA-HQ AFHQ


Method FID LPIPS FID LPIPS
meaningful styles, we transform a latent code z into a
MUNIT [16] 31.4 0.363 41.5 0.511
domain-specific style code s through our proposed mapping DRIT [28] 52.1 0.178 95.6 0.326
network (Figure 2b) and inject the style code into the gen- MSGAN [34] 33.1 0.389 61.4 0.517
erator (E). Here, we also introduce the style reconstruction StarGAN v2 13.7 0.452 16.2 0.450
loss (Eq. (2)). Note that each output branch of our map- Real images 14.8 - 12.9 -
ping network is responsible to a particular domain, thus Table 2. Quantitative comparison on latent-guided synthesis. The
style codes have no ambiguity in separating domains. Un- FIDs of real images are computed between the training and test
like the latent reconstruction loss, the style reconstruction sets. Note that they may not be optimal values since the number of
loss allows the generator to produce diverse images reflect- test images is insufficient, but we report them for reference.
ing domain-specific styles. Finally, we further improve the
network to produce diverse outputs by adopting the diver- duces multiple outputs using random noise. For CelebA-
sity regularization (Eq. (3)), and this configuration (F) cor- HQ, we observe that our method synthesizes images with
responds to our proposed method, StarGAN v2. Figure 4 a higher visual quality compared to the baseline models. In
shows that StarGAN v2 can synthesize images that reflect addition, our method is the only model that can successfully
diverse styles of references including hairstyle, makeup, change the entire hair styles of the source images, which re-
and beard, without hurting the source characteristics. quires non-trivial effort (e.g. generating ears). For AFHQ,
which has relatively large variations, the performance of the
3.2. Comparison on diverse image synthesis baselines is considerably degraded, while our method still
In this section, we evaluate StarGAN v2 on diverse im- produces images with high quality and diverse styles.
age synthesis from two perspectives: latent-guided synthe- As shown in Table 2, our method outperforms all the
sis and reference-guided synthesis. baselines by a large margin in terms of visual quality. For
Latent-guided synthesis. Figure 5 provides a qualitative both CelebA-HQ and AFHQ, our method achieves FIDs of
comparison of the competing methods. Each method pro- 13.7 and 16.2, respectively, which are more than two times
Source Reference DRIT [28] MSGAN [34] Ours Source Reference DRIT [28] MSGAN [34] Ours

(a) Reference-guided synthesis on CelebA-HQ (b) Reference-guided synthesis on AFHQ


Figure 6. Qualitative comparison of reference-guided image synthesis results on the CelebA-HQ and AFHQ datasets. Each method trans-
lates the source images into target domains, reflecting the styles of the reference images.

improvement over the previous leading method. Our LPIPS CelebA-HQ AFHQ
Method FID LPIPS FID LPIPS
is also the highest in CelebA-HQ, which implies our model
MUNIT [16] 107.1 0.176 223.9 0.199
produces the most diverse results given a single input. We DRIT [28] 53.3 0.311 114.8 0.156
conjecture that the high LPIPS values of the baseline mod- MSGAN [34] 39.6 0.312 69.8 0.375
els in AFHQ are due to their spurious artifacts. StarGAN v2 23.8 0.388 19.8 0.432
Real images 14.8 - 12.9 -
Reference-guided synthesis. To obtain the style code from
a reference image, we sample test images from a target do- Table 3. Quantitative comparison on reference-guided synthesis.
main and feed them to the encoder network of each method. We sample ten reference images to synthesize diverse images.
For CelebA-HQ (Figure 6a), our method successfully ren-
ders distinctive styles (e.g. bangs, beard, makeup, and hair- The LPIPS of StarGAN v2 is also the highest among the
style), while the others mostly match the color distribution competitors, which implies that our model produces the
of reference images. For the more challenging AFHQ (Fig- most diverse results considering the styles of reference im-
ure 6b), the baseline models suffer from a large domain ages. Here, MUNIT and DRIT suffer from mode-collapse in
shift. They hardly reflect the style of each reference im- AFHQ, which results in lower LPIPS and higher FID than
age and only match the domain. In contrast, our model ren- other methods.
ders distinctive styles (e.g. breeds) of each reference im- Human evaluation. We use the Amazon Mechanical Turk
age as well as its fur pattern and eye color. Note that Star- (AMT) to compare the user preferences of our method
GAN v2 produces high quality images across all domains with baseline approaches. Given a pair of source and ref-
and these results are from a single generator. Since the other erence images, the AMT workers are instructed to select
baselines are trained individually for each pair of domains, one among four image candidates from the methods, whose
the output quality fluctuates across domains. For example, order is randomly shuffled. We ask separately which model
in AFHQ (Figure 6b), the baseline models work reasonably offers the best image quality and which model best styl-
well in dog-to-wildlife (2nd row) while they fail in cat-to- izes the input image considering the reference image. For
dog (1st row). each comparison, we randomly generate 100 questions, and
Table 3 shows FID and LPIPS of each method for ref- each question is answered by 10 workers. We also ask each
erence guided synthesis. For both datasets, our method worker a few simple questions to detect unworthy work-
achieves FID of 23.8, and 19.8, which are about 1.5× and ers. The number of total valid workers is 76. As shown in
3.5× better than the previous leading method, respectively. Table 4, our method obtains the majority of votes in all in-
Source Reference Output
CelebA-HQ AFHQ
Method Quality Style Quality Style
MUNIT [16] 6.2 7.4 1.6 0.2
DRIT [28] 11.4 7.6 4.1 2.8
MSGAN [34] 13.5 10.1 6.2 4.9
StarGAN v2 68.9 74.8 88.1 92.1
Table 4. Votes from AMT workers for the most preferred method
regarding visual quality and style reflection (%). StarGAN v2 out-
performs the baselines with remarkable margins in all aspects.
stances, especially in the challenging AFHQ dataset and the
question about style reflection. These results show that Star-
GAN v2 better extracts and renders the styles onto the input
image than the other baselines.

4. Discussion
We discuss several reasons why StarGAN v2 can suc-
cessfully synthesize images of diverse styles over multiple
domains. First, our style code is separately generated per
domain by the multi-head mapping network and style en-
coder. By doing so, our generator can only focus on us- Figure 7. Reference-guided synthesis results on FFHQ with the
ing the style code, whose domain-specific information is al- model trained on CelebA-HQ. Despite the distribution gap be-
tween the two datasets, StarGAN v2 successfully extracts the style
ready taken care of by the mapping network (Section 3.1).
codes of the references and synthesizes faithful images.
Second, following the insight of StyleGAN [22], our style
space is produced by learned transformations. This provides
nection between stochastic noise and the generated im-
more flexibility to our model than the baselines [16, 28, 34],
age for diversity, by marginal matching [1], latent regres-
which assume that the style space is a fixed Gaussian dis-
sion [54, 16], and diversity regularization [48, 34]. Other
tribution (Section 3.2). Last but not least, our modules ben-
approaches produce various outputs with the guidance of
efit from fully exploiting training data from multiple do-
reference images [5, 6, 32, 40]. However, all theses meth-
mains. By design, the shared part of each module should
ods consider only two domains, and their extension to mul-
learn domain-invariant features which induces the regular-
tiple domains is non-trivial. Recently, FUNIT [30] tackles
ization effect, encouraging better generalization to unseen
multi-domain image translation using a few reference im-
samples. To show that our model generalizes over the un-
ages from a target domain, but it requires fine-grained class
seen images, we test a few samples from FFHQ [22] with
labels and can not generate images with random noise. Our
our model trained on CelebA-HQ (Figure 7). Here, Star-
method provides both latent-guided and reference-guided
GAN v2 successfully captures styles of references and ren-
synthesis and can be trained with coarsely labeled dataset.
ders these styles correctly to the source images.
In parallel work, Yu et al. [51] tackle the same issue but they
define the style as domain-shared characteristics rather than
5. Related work domain-specific ones, which limits the output diversity.
Generative adversarial networks (GANs) [10] have
shown impressive results in many computer vision tasks 6. Conclusion
such as image synthesis [4, 31, 8], colorization [18, 50] and
We proposed StarGAN v2, which addresses two major
super-resolution [27, 47]. Along with improving the visual
challenges in image-to-image translation; translating an im-
quality of generated images, their diversity also has been
age of one domain to diverse images of a target domain, and
considered as an important objective which has been tack-
supporting multiple target domains. The experimental re-
led by either devoted loss functions [34, 35] or architec-
sults showed that our model can generate images with rich
tural design [4, 22]. StyleGAN [22] introduces a non-linear
styles across multiple domains, remarkably outperforming
mapping function that embeds an input latent code into an
the previous leading methods [16, 28, 34]. We also released
intermediate style space to better represent the factors of
a new dataset of animal faces (AFHQ) for evaluating meth-
variation. However, this method requires non-trivial effort
ods in a large inter- and intra domain variation setting.
when transforming a real image, since its generator is not
Acknowledgements. We thank the full-time and visiting
designed to take an image as input.
Clova AI members for an early review: Seongjoon Oh,
Early image-to-image translation methods [20, 53, 29] Junsuk Choe, Muhammad Ferjad Naeem, and Kyungjune
are well known to learn a deterministic mapping even with Baek. All experiments were conducted based on NAVER
stochastic noise inputs. Several methods reinforce the con- Smart Machine Learning (NSML) [23, 43].
Figure 8. Examples from our newly collected AFHQ dataset.

A. The AFHQ dataset C. Evaluation protocol


We release a new dataset of animal faces, Animal Faces- This section provides details for the evaluation metrics
HQ (AFHQ), consisting of 15,000 high-quality images at and evaluation protocols used in all experiments.
512 × 512 resolution. Figure 8 shows example images of Frechét inception distance (FID) [14] measures the dis-
the AFHQ dataset. The dataset includes three domains of crepancy between two sets of images. We use the feature
cat, dog, and wildlife, each providing 5000 images. By hav- vectors from the last average pooling layer of the ImageNet-
ing multiple (three) domains and diverse images of various pretrained Inception-V3 [44]. For each test image from a
breeds (≥ eight) per each domain, AFHQ sets a more chal- source domain, we translate it into a target domain using 10
lenging image-to-image translation problem. For each do- latent vectors, which are randomly sampled from the stan-
main, we select 500 images as a test set and provide all re- dard Gaussian distribution. We then calculate FID between
maining images as a training set. We collected images with the translated images and training images in the target do-
permissive licenses from the Flickr1 and Pixabay2 websites. main. We calculate the FID values for every pair of image
All images are vertically and horizontally aligned to have domains (e.g. female  male for CelebA-HQ) and report
the eyes at the center. The low-quality images were dis- the average value. Note that, for reference-guided synthesis,
carded by human effort. We have made dataset available each source image is transformed using 10 reference images
at https://github.com/clovaai/stargan-v2. randomly sampled from the test set of a target domain.
Learned perceptual image patch similarity (LPIPS) [52]
B. Training details measures the diversity of generated images using the L1
For fast training, the batch size is set to eight and the distance between features extracted from the ImageNet-
model is trained for 100K iterations. The training time is pretrained AlexNet [26]. For each test image from a source
about three days on a single Tesla V100 GPU with our im- domain, we generate 10 outputs of a target domain using
plementation in PyTorch [41]. We set λsty = 1, λds = 1, 10 randomly sampled latent vectors. Then, we compute the
and λcyc = 1 for CelebA-HQ and λsty = 1, λds = 2, and average of the pairwise distances among all outputs gen-
λcyc = 1 for AFHQ. To stabilize the training, the weight erated from the same input (i.e. 45 pairs). Finally, we re-
λds is linearly decayed to zero over the 100K iterations. port the average of the LPIPS values over all test images.
We adopt the non-saturating adversarial loss [10] with R1 For reference-guided synthesis, each source image is trans-
regularization [35] using γ = 1. We use the Adam [25] op- formed using 10 reference images to produce 10 outputs.
timizer with β1 = 0 and β2 = 0.99. The learning rates
for G, D, and E are set to 10−4 , while that of F is set to D. Additional results
10−6 . For evaluation, we employ exponential moving av-
We provide additional reference-guided image synthesis
erages over parameters [21, 49] of all modules except D.
results on both CelebA-HQ and AFHQ (Figure 9 and 10).
We initialize the weights of all modules using He initial-
In CelebA-HQ, StarGAN v2 synthesizes the source identity
ization [12] and set all biases to zero, except for the biases
in diverse appearances reflecting the reference styles such
associated with the scaling vectors of AdaIN that are set to
as hairstyle, and makeup. In AFHQ, the results follow the
one.
breed and hair of the reference images preserving the pose
1 https://www.flickr.com of the source images. Interpolation results between styles
2 https://www.pixabay.com can be found at https://youtu.be/0EVh5Ki4dIY.
Source
Reference
Female
Male

Figure 9. Reference-guided image synthesis results on CelebA-HQ. The source and reference images in the first row and the first column
are real images, while the rest are images generated by our proposed model, StarGAN v2. Our model learns to transform a source image
reflecting the style of a given reference image. High-level semantics such as hairstyle, makeup, beard and age are followed from the
reference images, while the pose and identity of the source images are preserved. Note that the images in each column share a single
identity with different styles, and those in each row share a style with different identities.
Source
Reference
Cat
Dog
Wildlife

Figure 10. Reference-guided image synthesis results on AFHQ. All images except the sources and references are generated by our proposed
model, StarGAN v2. High-level semantics such as hair are followed from the references, while the pose of the sources are preserved.
L AYER R ESAMPLE N ORM O UTPUT S HAPE
E. Network architecture
Image x - - 256 × 256 × 3
In this section, we provide architectural details of Star-
Conv1×1 - - 256 × 256 × 64
GAN v2, which consists of four modules described below.
ResBlk AvgPool IN 128 × 128 × 128
Generator (Table 5). For AFHQ, our generator consists ResBlk AvgPool IN 64 × 64 × 256
of four downsampling blocks, four intermediate blocks, ResBlk AvgPool IN 32 × 32 × 512
and four upsampling blocks, all of which inherit pre- ResBlk AvgPool IN 16 × 16 × 512
activation residual units [13]. We use the instance nor-
ResBlk - IN 16 × 16 × 512
malization (IN) [45] and the adaptive instance normaliza-
ResBlk - IN 16 × 16 × 512
tion (AdaIN) [15, 22] for down-sampling and up-sampling
ResBlk - AdaIN 16 × 16 × 512
blocks, respectively. A style code is injected into all AdaIN
ResBlk - AdaIN 16 × 16 × 512
layers, providing scaling and shifting vectors through
learned affine transformations. For CelebA-HQ, we in- ResBlk Upsample AdaIN 32 × 32 × 512
crease the number of downsampling and upsampling lay- ResBlk Upsample AdaIN 64 × 64 × 256
ers by one. We also remove all shortcuts in the upsampling ResBlk Upsample AdaIN 128 × 128 × 128
residual blocks and add skip connections with the adaptive ResBlk Upsample AdaIN 256 × 256 × 64
wing based heatmap [46]. Conv1×1 - - 256 × 256 × 3
Mapping network (Table 6). Our mapping network con- Table 5. Generator architecture.
sists of an MLP with K output branches, where K indicates
the number of domains. Four fully connected layers are T YPE L AYER ACTVATION O UTPUT SHAPE
shared among all domains, followed by four specific fully Shared Latent z - 16
connected layers for each domain. We set the dimensions
of the latent code, the hidden layer, and the style code to Shared Linear ReLU 512
16, 512, and 64, respectively. We sample the latent code Shared Linear ReLU 512
from the standard Gaussian distribution. We do not apply Shared Linear ReLU 512
the pixel normalization [22] to the latent code, which has Shared Linear ReLU 512
been observed not to increase model performance in our Unshared Linear ReLU 512
tasks. We also tried feature normalizations [3, 19], but this Unshared Linear ReLU 512
degraded performance. Unshared Linear ReLU 512
Style encoder (Table 7). Our style encoder consists of a Unshared Linear - 64
CNN with K output branches, where K is the number of do- Table 6. Mapping network architecture.
mains. Six pre-activation residual blocks are shared among
all domains, followed by one specific fully connected layer L AYER R ESAMPLE N ORM O UTPUT SHAPE
for each domain. We do not use the global average pool-
ing [16] to extract fine style features of a given reference Image x - - 256 × 256 × 3
image. The output dimension “D” in Table 7 is set to 64, Conv1×1 - - 256 × 256 × 64
which indicates the dimension of the style code. ResBlk AvgPool - 128 × 128 × 128
Discriminator (Table 7). Our discriminator is a multi-task ResBlk AvgPool - 64 × 64 × 256
discriminator [35], which contains multiple linear output ResBlk AvgPool - 32 × 32 × 512
branches 3 . The discriminator contains six pre-activation ResBlk AvgPool - 16 × 16 × 512
residual blocks with leaky ReLU [33]. We use K fully- ResBlk AvgPool - 8 × 8 × 512
connected layers for real/fake classification of each domain, ResBlk AvgPool - 4 × 4 × 512
where K indicates the number of domains. The output di- LReLU - - 4 × 4 × 512
mension “D” is set to 1 for real/fake classification. We do Conv4×4 - - 1 × 1 × 512
not use any feature normalization techniques [19, 45] nor LReLU - - 1 × 1 × 512
PatchGAN [20] as they have been observed not to improve
Reshape - - 512
output quality. We have observed that in our settings, the
Linear * K - - D*K
multi-task discriminator provides better results than other
types of conditional discriminators [36, 37, 39, 42]. Table 7. Style encoder and discriminator architectures. D and K
represent the output dimension and number of domains, respec-
3 The original implementation of the multi-task discriminator can be tively.
found at https://github.com/LMescheder/GAN_stability.
References [20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
translation with conditional adversarial nets. In CVPR, 2017.
[1] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and 1, 8, 12
A. Courville. Augmented cyclegan: Learning many-to-many
[21] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive
mappings from unpaired data. In ICML, 2018. 2, 8
growing of GANs for improved quality, stability, and varia-
[2] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. tion. In ICLR, 2018. 4, 9
Combogan: Unrestrained scalability for image domain trans- [22] T. Karras, S. Laine, and T. Aila. A style-based generator
lation. In CVPRW, 2018. 2 architecture for generative adversarial networks. In CVPR,
[3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. 2019. 2, 8, 12
In arXiv preprint, 2016. 12 [23] H. Kim, M. Kim, D. Seo, J. Kim, H. Park, S. Park, H. Jo,
[4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan K. Kim, Y. Yang, Y. Kim, et al. Nsml: Meet the mlaas
training for high fidelity natural image synthesis. In ICLR, platform with a real-world case study. arXiv preprint
2019. 8 arXiv:1810.09957, 2018. 8
[5] H. Chang, J. Lu, F. Yu, and A. Finkelstein. Pairedcycle- [24] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to
gan: Asymmetric style transfer for applying and removing discover cross-domain relations with generative adversarial
makeup. In CVPR, 2018. 8 networks. In ICML, 2017. 3
[6] W. Cho, S. Choi, D. K. Park, I. Shin, and J. Choo. Image- [25] D. P. Kingma and J. Ba. Adam: A method for stochastic
to-image translation via group-wise deep whitening-and- optimization. In ICLR, 2015. 9
coloring transformation. In CVPR, 2019. 8 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[7] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. classification with deep convolutional neural networks. In
Stargan: Unified generative adversarial networks for multi- NeurIPS, 2012. 9
domain image-to-image translation. In CVPR, 2018. 2, 3, [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
4 A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.
[8] J. Donahue and K. Simonyan. Large scale adversarial repre- Photo-realistic single image super-resolution using a genera-
sentation learning. In NeurIPS, 2019. 8 tive adversarial network. In CVPR, 2017. 8
[9] V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. d. Vries, [28] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H.
A. Courville, and Y. Bengio. Feature-wise transformations. Yang. Diverse image-to-image translation via disentangled
In Distill, 2018. 4 representations. In ECCV, 2018. 2, 4, 6, 7, 8
[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [29] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- image translation networks. In NeurIPS, 2017. 8
erative adversarial networks. In NeurIPS, 2014. 8, 9 [30] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehti-
[11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and nen, and J. Kautz. Few-shot unsupervised image-to-image
A. C. Courville. Improved training of wasserstein gans. In translation. In ICCV, 2019. 2, 4, 8
NeurIPS, 2017. 4 [31] M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and
[12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into S. Gelly. High-fidelity image generation with fewer labels.
rectifiers: Surpassing human-level performance on imagenet In ICML, 2019. 8
classification. In ICCV, 2015. 9 [32] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in Exemplar guided unsupervised image-to-image translation
deep residual networks. In ECCV, 2016. 12 with semantic consistency. In ICLR, 2019. 8
[14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and [33] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
S. Hochreiter. Gans trained by a two time-scale update rule earities improve neural network acoustic models. In ICML,
converge to a local nash equilibrium. In NeurIPS, 2017. 4, 9 2013. 12
[15] X. Huang and S. Belongie. Arbitrary style transfer in real- [34] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang.
time with adaptive instance normalization. In ICCV, 2017. Mode seeking generative adversarial networks for diverse
2, 4, 12 image synthesis. In CVPR, 2019. 2, 3, 4, 6, 7, 8
[16] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal [35] L. Mescheder, S. Nowozin, and A. Geiger. Which training
unsupervised image-to-image translation. In ECCV, 2018. 2, methods for gans do actually converge? In ICML, 2018. 2,
3, 4, 6, 7, 8, 12 4, 8, 9, 12
[17] L. Hui, X. Li, J. Chen, H. He, and J. Yang. Unsuper- [36] M. Mirza and S. Osindero. Conditional generative adversar-
vised multi-domain image translation with domain-specific ial nets. In arXiv preprint, 2014. 4, 12
encoders/decoders. In ICPR, 2018. 2 [37] T. Miyato and M. Koyama. cGANs with projection discrim-
[18] K. Hyunsu, J. Ho Young, P. Eunhyeok, and Y. Sungjoo. inator. In ICLR, 2018. 12
Tag2pix: Line art colorization using text tag with secat and [38] S. Na, S. Yoo, and J. Choo. Miso: Mutual information loss
changing loss. In ICCV, 2019. 8 with stochastic style representations for multimodal image-
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating to-image translation. In arXiv preprint, 2019. 2
deep network training by reducing internal covariate shift. In [39] A. Odena, C. Olah, and J. Shlens. Conditional image synthe-
ICML, 2015. 12 sis with auxiliary classifier gans. In ICML, 2017. 4, 12
[40] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic
image synthesis with spatially-adaptive normalization. In
CVPR, 2019. 8
[41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-
matic differentiation in pytorch. In NeurIPSW, 2017. 9
[42] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
H. Lee. Generative adversarial text to image synthesis. In
ICML, 2016. 12
[43] N. Sung, M. Kim, H. Jo, Y. Yang, J. Kim, L. Lausen, Y. Kim,
G. Lee, D. Kwak, J.-W. Ha, et al. Nsml: A machine learning
platform that enables you to focus on your models. arXiv
preprint arXiv:1712.05902, 2017. 8
[44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision. In
CVPR, 2016. 9
[45] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal-
ization: The missing ingredient for fast stylization. In arXiv
preprint, 2016. 12
[46] X. Wang, L. Bo, and L. Fuxin. Adaptive wing loss for robust
face alignment via heatmap regression. In ICCV, 2019. 12
[47] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and
C. Change Loy. Esrgan: Enhanced super-resolution genera-
tive adversarial networks. In ECCV, 2018. 8
[48] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee. Diversity-
sensitive conditional generative adversarial networks. In
ICLR, 2019. 3, 8
[49] Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and
V. Chandrasekhar. The unusual effectiveness of averaging in
gan training. In ICLR, 2019. 9
[50] S. Yoo, H. Bahng, S. Chung, J. Lee, J. Chang, and J. Choo.
Coloring with limited data: Few-shot colorization via mem-
ory augmented networks. In CVPR, 2019. 8
[51] X. Yu, Y. Chen, T. Li, S. Liu, and G. Li. Multi-mapping
image-to-image translation via learning disentanglement. In
NeurIPS, 2019. 8
[52] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.
The unreasonable effectiveness of deep features as a percep-
tual metric. In CVPR, 2018. 4, 9
[53] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
workss. In ICCV, 2017. 3, 8
[54] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,
O. Wang, and E. Shechtman. Toward multimodal image-to-
image translation. In NeurIPS, 2017. 2, 3, 4, 8

You might also like