Perceptual Adversarial Networks For Image-to-Image Transformation
Perceptual Adversarial Networks For Image-to-Image Transformation
Perceptual Adversarial Networks For Image-to-Image Transformation
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 1
Abstract—In this paper, we propose Perceptual Adversarial image. By penalizing the discrepancy between the output
Networks (PAN) for image-to-image transformations. Different image and ground-truth image, optimal CNNs can be trained to
from existing application driven algorithms, PAN provides a discover the mapping from the input image to the transformed
generic framework of learning to map from input images to
desired images (Fig. 1), such as a rainy image to its de-rained image of interest. These CNNs are developed with distinct
counterpart, object edges to photos, semantic labels to a scenes motivations and differ in the loss function design.
image, etc. The proposed PAN consists of two feed-forward convo- One of the most straightforward approaches is to pixel-
lutional neural networks (CNNs): the image transformation net- wisely evaluate output images [8], [10], [11], e.g., least squares
work T and the discriminative network D. Besides the generative or least absolute losses to calculate the distance between the
adversarial loss widely used in GANs, we propose the perceptual
adversarial loss, which undergoes an adversarial training process output and ground-truth images in the pixel space. Though
between the image transformation network T and the hidden pixel-wise evaluation can generate reasonable images, there
layers of the discriminative network D. The hidden layers and are some unignorable defects associated with the outputs, such
the output of the discriminative network D are upgraded to con- as image blur and image artifacts.
stantly and automatically discover the discrepancy between the Besides pixel-wise losses, the generative adversarial losses
transformed image and the corresponding ground-truth, while
the image transformation network T is trained to minimize the were largely utilized in training image-to-image transformation
discrepancy explored by the discriminative network D. Through models. GANs (and cGANs) [12], [13] perform an adversarial
integrating the generative adversarial loss and the perceptual training process alternating between identifying and faking,
adversarial loss, D and T can be trained alternately to solve and generative adversarial losses are formulated to evaluate the
image-to-image transformation tasks. Experiments evaluated on discrepancy between the generated distribution and the real-
several image-to-image transformation tasks (e.g., image de-
raining, image inpainting, etc) demonstrate the effectiveness of world distribution. Experimental results show that generative
the proposed PAN and its advantages over many existing works. adversarial losses are beneficial for generating more realistic
images. Therefore, there are many GANs (or cGANs) based
Index Terms—Generative adversarial networks, image de- works to solve image-to-image transformation tasks, resulting
raining, image inpainting, image-to-image transformation in sharper and more realistic transformed images [7], [14].
Meanwhile, some GANs variants [15]–[17] investigated cross-
domain image translations and performed image translations
I. I NTRODUCTION in absence of paired examples. Although these unpaired
works achieved reasonable results in some image-to-image
I MAGE-TO-IMAGE transformations aim to transform an
input image into the desired output image, and they exist
in a number of applications about image processing, computer
translation tasks, they are inappropriate for some image-to-
image problems. For example, in image in-painting tasks, it
graphics, and computer vision. For example, generating high- is difficult to define the domain and formulate the distribution
quality images from corresponding degraded (e.g. simpli- of corrupted images. In addition, paired information within
fied, corrupted or low-resolution) images, and transforming training data are beneficial for learning image transformations,
a color input image into its semantic or geometric represen- but they cannot be utilized by unpaired translation methods.
tations. More examples include, but not limited to, image de- Moreover, perceptual losses emerged as a novel measure-
noising [1], image in-painting [2], image super-resolution [3], ment for evaluating the discrepancy between high-level per-
image colorization [4], image segmentation [5], etc. ceptual features of the output and ground-truth images [18]–
In recent years, convolutional neural networks (CNNs) are [20]. Hidden layers of a well-trained image classification
trained in a supervised manner for various image-to-image network (e.g., VGG-16 [21]) are usually employed to ex-
transformation tasks [6]–[9]. They encode input image into tract high-level features (e.g., content or texture) of both
hidden representation, which is then decoded to the output output images and ground-truth images. It is then expected
to encourage the output image to have the similar high-level
Chaoyue Wang is with the Centre for Artificial Intelligence, School feature with that of the ground-truth image. Recently, percep-
of Software, University of Technology Sydney, Australia, e-mail: tual losses were introduced in aforementioned GANs-based
[email protected]
Chang Xu and Dacheng Tao are with UBTech Sydney AI Institute, image-to-image transformation frameworks for suppressing
School of IT, FEIT, The University of Sydney, Australia, e-mail: {c.xu, artifacts [22] and improving perceptual quality [23], [24] of
dacheng.tao}@sydney.edu.au the output images. Though integrating perceptual losses into
Chaohui Wang is with the Université Paris-Est, LIGM (UMR 8049),
CNRS, ENPC, ESIEE Paris, UPEM, Marne-la-Vallée, France, e-mail: GANs has produced impressive image-to-image transforma-
[email protected] tion results, existing works are used to depend on external
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 2
Fig. 1. Image-to-image transformation tasks. Many tasks in image processing, computer graphics, and computer vision can be regarded as image-to-image
transformation tasks, where a model is designed to transform an input image into the required output image. We proposed Perceptual Adversarial Networks
(PAN) to solve the image-to-image transformation between paired images. For each pair of the images we demonstrated, the left one is the input image, and
the right one is the transformed result of the proposed PAN.
well-trained image classification network (e.g.VGG-Net) out • We proposed the perceptual adversarial loss, which uti-
of GANs, but ignored the fact that GANs, especially the lizes the hidden layers of the discriminative network to
discriminative network, also has the capability and demand of evaluate the discrepancy between the output and ground-
perceiving the content of images and the difference between truth images through an adversarial training process.
images. Moreover, since these external networks are trained on • Through combining the perceptual adversarial loss and
specific classification datasets (e.g., ImageNet), they mainly the generative adversarial loss, we presented the PAN for
focus on features that contribute to the classification and may solving image-to-image transformation tasks.
perform inferior in some image transformation tasks (e.g., • We evaluated the performance of the PAN on several
transfer aerial images to maps). Meanwhile, since specific image-to-image transformation tasks (Fig. 1). Experimen-
hidden layers of pre-trained networks are employed, it is tal results show that the proposed PAN has a great capa-
difficult to explore the difference between generated images bility of accomplishing image-to-image transformations.
and ground-truth images from more points of view. The rest of the paper is organized as follows: after a
In this paper, we proposed the perceptual adversarial net- brief summary of previous related works in section II, we
works (PAN) for image-to-image transformation tasks. In- illustrate the proposed PAN together with its training losses in
spired by GANs, PAN is composed of an image transformation section III. Then we exhibit the experimental validation of the
network T and a discriminative network D. Both generative whole method in section IV. Finally, we conclude this paper
adversarial loss and perceptual adversarial loss are employed. with some future directions in section V.
Firstly, similar with GANs, the generative adversarial loss is
utilized to measure the distribution of generated images, i.e., II. BACKGROUND
penalizing generated images to lie in the desired target domain, In this section, we first introduce some representative image-
which usually contributes to producing more visually realistic to-image transformation methods based on feed-forward
images. Meanwhile, to comprehensively evaluate transformed CNNs, and then summarize related works on GANs and
images, we devised the perceptual adversarial loss to form perceptual losses.
dynamic measurements based on the hidden layers of the
discriminative network D. Specifically, given hidden layers A. Image-to-image transformation with feed-forward CNNs
of the network D, the network T is trained to generate the
Recent years have witnessed a variety of feed-forward
output image that has the same high-level features with that
CNNs developed for image-to-image transformation tasks.
of the corresponding ground-truth. If the difference between
These feed-forward CNNs can be easily trained using the
images measured on existing hidden layers of the discriminator
back-propagation algorithm [25], and the transformed images
is smaller, these hidden layers will be updated to discover
are generated by forwardly passing the input image through
the discrepancy between images from a new point of view.
the well-trained CNNs in the test stage.
Different from the pixel-wise loss and conventional perceptual
Individual pixel-wise loss or pixel-wise loss accompanied
loss, our perceptual adversarial loss undergoes an adversarial
with other losses are employed in a number of image-to-image
training process, and aims to discover and decrease the dis-
transformations. Image super-resolution tasks estimate a high-
crepancy under constantly explored dynamic measurements.
resolution image from its low-resolution counterpart [8], [18],
In summary, our paper makes the following contributions: [23]. Image de-raining (or de-snowing) methods attempt to
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 3
remove the rain (or snow) strikes in the pictures brought by if paired training data are available in some applications,
the uncontrollable weather conditions [6], [22], [26]. Given [15]–[17] neglect paired information between data often have
a damaged image, image inpainting aims to recover the inferior performance to that of paired methods [14]. Thus,
missing part of the input image [7], [27], [28]. Image semantic at this stage, it is still important to study paired training,
segmentation methods produce dense scene labels based on especially for performance-driven situations and applications,
a single input image [29]–[31]. Given an input object image, such as high-resolution image synthesis [46], photo-realistic
some feed-forward CNNs were trained to synthesize the image image synthesis [23], real-world image inpainting [47], etc.
of the same object from a different viewpoint [32]–[34]. More
image-to-image transformation tasks based on feed-forward C. Perceptual loss
CNNs, include, but not limited to, image colorization [10],
depth estimations [31], [35], etc. Recently, some theoretical analysis and experimental results
suggested that the high-level features extracted from a well-
trained image classification network have the capability to cap-
B. GANs-based works ture the perceptual information from real-world images [18],
Generative adversarial networks (GANs) [12] provide an [48]. Specifically, representations extracted from hidden layers
important approach for learning a generative model which of well-trained image classification network are beneficial
generates samples from the real-world data distribution. GANs to interpret the semantics of input images, and image style
consist of a generative network and a discriminative network. distribution can be captured by the Gram matrix of hidden rep-
Through playing a minimax game between these two net- resentations. Hence, high-level features extracted from hidden
works, GANs are trained to generate more and more realistic layers of a well-trained classifier are often introduced in image
samples. Since the great performance on learning real-world generation models. Dosovitskiy and Brox [19] took Euclidean
distributions, there have emerged a large number of GANs- distances between high-level features of images as the deep
based works. Some of these GANs-based works are committed perceptual similarity metrics to improve the performance of
to training a better generative model, such as InfoGAN [36], image generation. Johnson et al. [18], Bruna et al. [20] and
Energy-based GAN [37], WGAN(-GP) [38], [39], Progressive Ledig et al. [23] used features extracted from a well-trained
GAN [40] and SN-GAN [41]. There are also some works VGG network to improve the performance of single image
integrating the GANs into their models to improve the per- super-resolution task. In addition, there are works applying
formance of classical tasks. For example, the PGAN [42] is high-level features in image style-transfer [18], [48], image
proposed for small object detection. Specifically, [42] devised de-raining [22] and image view synthesis [49] tasks.
a novel perceptual discriminator network, which contains an
adversarial branch and a perception branch. The adversarial III. M ETHODS
branch utilizes the adversarial loss to distinguish representa- In this section, we introduce the proposed Perceptual Ad-
tions of real and synthesized objectives; the perception branch versarial Networks (PAN) for image-to-image transformation
(or loss) employs a classification loss Lcls and a bounding- tasks. Firstly, we explain the generative and perceptual adver-
box regression loss Lloc to encourage the synthesized ‘super- sarial losses, respectively. Then, we give the whole framework
resolved‘ objectives representation to retain the same percep- of the proposed PAN. Finally, we illustrate the details of the
tion information as the input small objectives representation. training procedure and network architectures.
In addition, these kind of works include, but not limited
to, the PGN [43] for video prediction, the SRGAN [23] for
A. Generative adversarial loss
super-resolution, the ID-CGAN for image de-raining [22],
the iGAN [44] for interactive application, the IAN [45] for We begin with the generative adversarial loss in vanilla
photo modification, and the Context-Encoder for image in- GANs. A generative network G is trained to map samples
painting [7]. Most recently, Isola et al. [14] proposed the from noise distribution pz to real-world data distribution pdata
pix2pix-cGANs to perform several image-to-image transfor- through playing a minimax game with a discriminative net-
mation tasks (also known as image-to-image translations in work D. In the training procedure, the discriminative network
their work), such as translating semantic labels into the street D aims to distinguish the real samples y ∼ pdata from the
scene, object edges into pictures, aerial photos into maps, etc. generated samples G(z). In contrary, the generative network
Moreover, some GANs variants [15]–[17] investigated G tries to confuse the discriminative network D by generating
cross-domain image translations through exploring the cyclic increasingly realistic samples. This minimax game can be
mapping (or primal-dual) relation between different image formulated as:
domains. Specifically, a primal GAN aims to explore the min max Ey∼pdata [log D(y)] + Ez∼pz [log(1 − D(G(z)))] (1)
mapping relations from source images to target images, while G D
a dual (or inverse) GAN performs the invert task. These two Nowadays, GANs-based models have shown the strong
GANs form a closed loop and allow images from either capability of learning generative models, especially for image
domain to be translated and then reconstructed. Through generation [36], [38], [43]. We, therefore, adopt the GANs
combining the GAN loss and cycle consistency loss (or learning strategy to solve image-to-image transformation tasks
recovery loss), these works can be used for performing image as well. As shown in Fig. 2, the image transformation network
translation tasks in absence of paired examples. However, T is used to generate transformed image T (x) given the
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 4
Fig. 2. PAN framework. PAN consists of an image transformation network T and a discriminative network D. The image transformation network T is
trained to synthesize the transformed images given the input images. It is composed of a stack of Convolution-BatchNorm-LeakyReLU encoding layers and
Deconvolution-BatchNorm-ReLU decoding layers, and the skip-connections are used between mirrored layers. The discriminative network D is also a CNN
that consists of Convolution-BatchNorm-LeakyReLU layers. Hidden layers of the network D are utilized to evaluate the perceptual adversarial loss, and the
output of the network D is used to distinguish transformed images from real-world images.
input image x ∈ X . Meanwhile, each input image x has a Here, we employ hidden layers of the discriminative net-
corresponding ground-truth image y. We suppose that all target work D to evaluate the perceptual adversarial loss between
image y ∈ Y obey the distribution preal , and the transformed transformed images and ground-truth images. In our experi-
image T (x) is encouraged to have the same distribution with ments, given the training sample {(xi , yi ) ∈ (X × Y)}Ni=1 , the
that of targets image y, i.e., T (x) ∼ preal . To achieve the gen- least absolute loss is employed to calculate the discrepancy of
erative adversarial learning strategy, a discriminative network the high-dimensional representations on the hidden layers of
D is additionally introduced, and the generative adversarial the network D, e.g.,
loss can be written as: N
1 X
min max VD,T = Ey∈Y [log D(y)] + Ex∈X [log(1 − D(T (x)))] `D,j
percep = ||dj (yi ) − dj (T (xi ))|| (3)
T D N i=1
(2)
The generative adversarial loss acts as a statistical measure- where dj (·) is the image representation on the j th hidden layer
ment to penalize the discrepancy between the distributions of of the discriminative network D, and `D,j percep calculates the
transformed images and the ground-truth images. discrepancy measured by the j th hidden layer of D.
Similar to what has been done with the Energy-Based
B. Perceptual adversarial loss GAN [37], we use two different losses, one (LT ) to train
Different from vanilla GANs that randomly generate sam- the image transformation network T , and the other (LD ) to
ples from the data distribution pdata , our goal is to infer the train hidden layers of the discriminative network D. Therefore,
transformed image according to the input images. Therefore, the image transformation network T and hidden layers of
it is a further step of GANs to explore the mapping from the the discriminative network D play a non-zero-sum game and
input image to its ground truth. form the perceptual adversarial loss. Formally, the perceptual
As mentioned in Sections I and II, pixel-wise losses and adversarial loss LT for the image transformation network T
perceptual losses are widely used in existing works for gener- can be written as:
ating images towards the ground truth. The pixel-wise losses XF
penalize the discrepancy occurred in the pixel space, but often LT = λj `D,j
percep (4)
produce blurry results [7], [9]. The perceptual losses explore j=1
the discrepancy between high-dimensional representations of
and, given a positive margin m, the loss LD for hidden layers
images extracted from a well-trained classifier, e.g., the VGG
of the discriminative network D is defined as:
net trained on the ImageNet dataset [21]. Although hidden
F i+
layers of well-trained classifier have been experimentally h X
LD = [m − LT ]+ = m − λj `D,j
percep (5)
validated to map the image from pixel space to high-level
j=1
feature spaces, how to extract the effective features for image-
to-image transformation tasks from hidden layers has not been where [·]+ = max(0, ·), {λj }F j=1 are hyper-parameters bal-
thoroughly discussed. ancing the influence of F different hidden layers.
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 5
By minimizing the perceptual adversarial loss function LT transformation network JT and discriminative network JD
with respect to parameters of T , we encourage the network are formally defined as:
T to generate image T (x) that has similar high-level features
with its ground-truth y on the hidden layers. If the weighted JT = θVD,T + LT
sum of discrepancy between transformed images and ground- JD = −θVD,T + LD (6)
truth images on different hidden layers is less than the positive = −θVD,T + [m − LT ] +
margin m, the loss function LD will upgrade the discrimina-
tive network D for some new latent feature spaces, which where θ is the hyper-parameter balance the influence of gen-
preserve the discrepancy between the transformed images erative adversarial loss and perceptual adversarial loss. When
and their ground-truth. Therefore, based on the perceptual LT < m, minimizing JD with respect to the parameters of D
adversarial loss, the discrepancy between the transformed and is consistent with maximizing JT . Otherwise, when LT ≥ m,
ground-truth images can be constantly explored and exploited. the second term of JD will have zero gradients, because of
Compared to our perceptual adversarial loss which measures the positive margin m. In general, the discriminative network
the difference between the transformed image and ground- D aims to distinguish transformed image T (x) from ground-
truth image in hidden layers of the discriminator, the con- truth image y from both the statical (the first term of JD )
ditional GAN loss indicates whether the transformed image and dynamic perceptual (the second term of JD ) aspects.
forms the appropriate image pair with the input image, and On the other hand, the image transformation network T is
can also explore supervised information of paired images trained to generate increasingly better images by reducing the
during the training process. However, they utilize different discrepancy between the output and ground-truth images.
methods to minimize high-level feature differences explored
by the discriminator. The perceptual adversarial loss directly
penalizes the high-level representations of transformed images D. Network architectures
and ground-truth images to be as same as possible. In contrast, Fig. 2 illustrates the framework of the proposed PAN, which
the conditional GAN loss aims to model the mapping relation is composed of two CNNs, i.e., the image transformation
from the input x to its output yreal and encourages the network T and the discriminative network D.
generated image pairs (x, T (x)) obeying the same conditional
1) Image transformation network T : The image transfor-
distribution Preal (y|x). Compared to conditional GAN loss
mation network T is designed to generate the transformed
that indirectly guides the generated images T (x) sharing
image given the input image. Following the network archi-
the same features with corresponding ground-truth yreal , our
tectures in [14], [50], the network T firstly encodes the
perceptual adversarial loss directly measures and minimizes
input image into high-dimensional representation using a stack
differences between generated images and ground-truth images
of Convolution-BatchNorm-LeakyReLU layers, and then, the
from different perspectives.
output image can be decoded by the following Deconvolution-
BatchNorm-ReLU layers1 . Note that the output layer of the
C. The perceptual adversarial networks network T does not use batchnorm and replaces the ReLU
with Tanh activation. Moreover, the skip-connections are used
Based on the aforementioned generative adversarial loss to connect mirrored layers in the encoder and decoder stacks.
(Eq. 2) and perceptual adversarial loss (Eq. 4 and Eq. 5), More details of the transformation network T are listed in
we develop the PAN framework, which consists of an im- Table II. The same architecture of the network T is used for
age transformation network T and a discriminative network
D. These two networks are trained alternately to perform 1 The deconvolution layer utilized in our framework is the transposed
an adversarial learning process, the loss functions of image convolution layer used in [51], [52].
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 6
Input `D,1
percep `D,2
percep `D,3
percep `D,4
percep Ground-truth
Fig. 4. Transforming the semantic labels to cityscapes images use the perceptual adversarial loss. Within the perceptual adversarial loss, a different hidden
layer is utilized for each experiment. For better visual comparison, zoomed versions of the specific regions-of-interest are demonstrated below the test images.
For higher layers, the transformed images look sharper, but less color information is preserved.
TABLE I TABLE II
T HE ARCHITECTURE OF THE DISCRIMINATIVE NETWORK . T HE ARCHITECTURE OF THE IMAGE TRANSFORMATION NETWORK .
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 7
raining), computer vision (e.g., semantic segmentation) and C. Analysis of the loss functions
computer graphics (e.g., image generation).
As discussed in Sections I and II, the design of loss function
A. Experimental setting up will largely influence the performance of image-to-image
For fair comparisons, we adopted the same settings with transformation. Firstly, the pixel-wise loss (using least squares
existing works, and reported experimental results using several loss) is widely used in various image-to-image transforma-
evaluation metrics. These tasks and data settings include: tion works [33], [61], [62]. Then, the joint loss integrating
pixel-wise loss and conditional generative adversarial loss is
• Single image de-raining, on the dataset provided by ID-
proposed to synthesize more realistic transformed images [7],
CGAN [22].
[14]. Most recently, through introducing the perceptual loss,
• Image Inpainting, on a subset of ILSVRC’12 (same as
i.e., penalizing the discrepancy between high-level features
context-encoder [7]).
that extracted by a well-trained classifier, the performance
• Semantic labels↔images, on the Cityscapes dataset [53]
of some image-to-image transformation tasks are further en-
(same as pix2pix [14]).
hanced [18], [22], [23]. Different from these existing methods,
• Edges→images, on the dataset created by pix2pix [14].
the proposed PAN loss integrates the generative adversarial
The original data is from [44] and [54], and the HED
loss and the perceptual adversarial loss to train image-to-
edge detector [55] was used to extract edges.
image transformation networks. Here, we compare the per-
• Aerial→map, on the dataset from pix2pix [14].
formance of the proposed perceptual adversarial loss with
Furthermore, all experiments were trained on Nvidia Titan- those of existing losses. For a fair comparison, we adopted
X GPUs using Theano [56]. Given the generative and per- the same image transformation network and data settings
ceptual adversarial losses, we alternately updated the image from ID-CGAN [22], and used the combination of different
transformation network T and the discriminative network losses to perform the image de-raining (de-snowing) task. The
D. Specifically, Adam solver [57] with a learning rate of quantitative results over the synthetic test set were shown in
0.0002 and a first momentum of 0.5 was used in network Table III, while the qualitative results on the real-world images
training. After one update of the discriminative network D, the were shown in Fig. 3. From both quantitative and qualitative
image transformation T will be updated three times. Hyper- comparisons, we find that only using the pixel-wise loss (least
parameters θ = 1, λ1 = 5, λ2 = 1.5, λ3 = 1.5, λ4 = 1, and squares loss) achieved the worst result, and there are many
batch size of 4 were used for all tasks. Since the dataset sizes snow-streaks in the transformed images (Fig. 3). Through
for different tasks are changed largely, the training epochs of introducing the cGANs loss, the de-snowing performance was
different tasks were set accordingly. Overall, the number of indeed improved, but artifacts can be observed (Fig. 3) and
training iterations was around 100k. the PSNR performance dropped (Table III). Combining the
pixel-wise, cGAN and perceptual loss (VGG-16 [21]) together,
B. Evaluation metrics i.e., using the loss function of ID-CGAN [22], the quality
To illustrate the performance of image-to-image transfor- of transformed images has been further improved on both
mation tasks, we conducted qualitative and quantitative exper- observations and quantitative measurements. However, from
iments to evaluate the performance of the transformed images. Fig. 3, we observe that the transformed images have some
For the qualitative experiments, we directly presented the input color distortion compared to the input images. The proposed
and transformed images. Meanwhile, we used quantitative PAN loss (i.e., combining the perceptual adversarial loss and
measures to evaluate the performance over the test sets, such original GAN loss) not only removed most streaks without
as Peak Signal to Noise Ratio (PSNR), Structural Similarity color distortion, but also achieved much better performance
Index (SSIM) [58], Universal Quality Index (UQI) [59] and on quantitative measurements. Moreover, we evaluated the
Visual Information Fidelity (VIF) [60]. f training iterations performance of combining conditional GAN loss and the
were around 100k. perceptual adversarial loss. Comparing with using the cGAN
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 8
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 9
TABLE III derstanding the surroundings and estimating the missing part
D E - RAINING with semantic contents. However, the context-encoder tended
PSNR(dB) SSIM UQI VIF
to use the nearest region (usually the background) to inpaint
the missing part. PAN can synthesize more details in the
L2 22.77 0.7959 0.6261 0.3570
missing parts. Last but not the least, in Table IV, we reported
cGAN 21.87 0.7306 0.5810 0.3173
the quantitative results calculated over all 50k test images,
L2+cGAN 22.19 0.8083 0.6278 0.3640
which also demonstrated that the proposed PAN achieves
ID-CGAN 22.91 0.8198 0.6473 0.3885
better performance.
PAN 23.35 0.8303 0.6644 0.4050
2) ID-CGAN: Image de-raining task aims to remove rain
PA Loss+cGAN 23.22 0.8078 0.6375 0.3904
streaks in a given rainy image. Considering the unpredictable
weather conditions, the single image de-raining (de-snowing)
TABLE IV is a challenge image-to-image transformation task. Most re-
I N - PAINTING cently, the Image De-raining Conditional Generative Adver-
sarial Networks (ID-CGAN) was proposed to tackle the image
PSNR(dB) SSIM UQI VIF
de-raining problem. Through combining the pixel-wise (least
Context-Encoder 21.74 0.8242 0.7828 0.5818
squares loss), conditional generative adversarial, and percep-
PAN 21.85 0.8307 0.7956 0.6104 tual losses (VGG-16), ID-CGAN achieved the state-of-the-art
performance on single image de-raining.
We attempted to solve image de-raining by the proposed
To compare with the Context-Encoder, we applied PAN to PAN using the same setting with that of ID-CGAN. Since there
inpaint images whose central regions were missed. As illus- is a lack of large-scale datasets consisting of paired rainy and
trated in Section IV-A, 100k images were randomly selected de-rained images, we resort to synthesize the training set [22]
from the ILSVRC’12 dataset to train both Context-Encoder of 700 images. Zhang et al. [22] provided 100 synthetic images
and PAN, and 50k images from the ILSVRC’12 validation and 50 real-world rainy images for the test. Since the ground-
set were used for test purpose. Moreover, since the image truth is available for synthetic test images, we calculated and
inpainting models are asked to generate the missing region of reported the quantitative results in Table III. Moreover, we test
the input image instead of the whole image, we employ the both ID-CGAN and PAN on real-world rainy images, and the
image transformation network architecture from [7]. results were shown in Fig. 6. For better visual comparison,
In Fig. 7, we reported some example results in the test we zoomed up the specific regions-of-interest below the test
set. For each input image, the missing part is mixed by images.
the foreground objects and backgrounds. From the inpainted From Fig. 6, we found both ID-CGAN and PAN achieved
results, we find the proposed PAN performed better on un- great performance on single image de-raining. However, by
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 10
observing the zoomed region, the PAN removed more rain- gray images to color images, etc. The proposed PAN can
strikes with less color distortion. Additionally, as shown in also solve the image-to-image transformation tasks performed
Table III, for synthetic test images, the de-rained results of by pix2pix-cGAN. Here, we implemented some of them and
PAN are much more similar with the corresponding ground- compared with pix2pix-cGAN.
truth than that of ID-CGAN. Dealing with the uncontrollable Firstly, we attempted to translate the semantic labels to
weather condition, why the proposed PAN can achieve better cityscapes images. Unlike the image segmentation problems,
results? One possible reason is that ID-CGAN utilized the this inverse translation is an ill-posed problem and image
well-trained classifier to extract the high-level features of the transformation network has to learn prior knowledge from the
output and ground-truth images, and penalize the discrepancy training data. As shown in Fig. 8, given semantic labels as
between them (i.e., the perceptual loss). The high-level fea- input images, we listed the transformed cityscapes images of
tures extracted by the well-trained classifier usually focus on pix2pix-cGAN, PAN and the corresponding ground-truth on
the content information, and may hard to capture other image the rightside. From the comparison, we found the proposed
information, such as color information. Yet, the proposed PAN PAN captured more details with less deformation, which led
used the perceptual adversarial loss, which aims to continually the synthetic images are looked more realistic. Moreover, the
and automatically measure the discrepancy between the output quantitative comparison in Table V also indicated that the PAN
and ground-truth images. The different training strategy of can achieve much better performance.
PAN may help the model to learn a better mapping from the Generating real-world objects from corresponding edges is
input to output images, and resulting in better performance. also one kind of image-to-image transformation task. Based on
3) Pix2pix-cGAN: Isola et al. [14] utilized cGANs as a the dataset provided by [14], we trained the PAN to translate
general-purpose solution to image-to-image translation (trans- edges to object photos, and compared its performance with
formation) tasks. In their work, the pixel-wise loss (least abso- that of pix2pix-cGAN. Given edges as input, Fig. 9 presented
lute loss) and Patch-cGANs loss are employed to solve a serial shoes and handbags synthesized by pix2pix-cGAN and PAN.
of image-to-image transformation tasks, such as translating the At the same time, the quantitative results over the test set
object edges to its photos, semantic labels to scene images, were shown in the Table V. Observing the generated object
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 11
Fig. 11. Introducing the proposed perceptual adversarial loss to CycleGAN framework and attempting to perform unpaired image translation (horses↔zebras).
Given horse images as input, trained models (both CycleGAN and ‘CycleGAN+Perceptual adversarial loss’) aim to generate corresponding zebras images.
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 12
In this section, we perform the unpaired image translation [11] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
task, horse↔zebra, and qualitatively report some generated for semantic segmentation,” IEEE transactions on pattern analysis and
machine intelligence, 2016.
results. For a fair comparison, we adopted default settings [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
(and codes) from CycleGAN, and utilized the 3rd and 4th S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
hidden layers of the discriminator to measure the perceptual Advances in neural information processing systems (NIPS), 2014, pp.
2672–2680.
adversarial loss. In addition, hyper-parameters λ3 = λ4 = 0.5, [13] T. Miyato and M. Koyama, “cgans with projection discriminator,” in
m = 0.1, and batch size of 4 were used. As shown in Fig. Proceedings of the International Conference on Learning Representa-
x, through considering the perceptual similarity in unpaired tions (ICLR), 2018.
[14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
image translations, the model performance was more or less with conditional adversarial networks,” 2017.
improved. Although this work mainly focuses on exploring [15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
the perceptual features between paired images (the generated translation using cycle-consistent adversarial networks,” in The IEEE
International Conference on Computer Vision (ICCV), 2017.
image and its ground-truth), we demonstrate the possibility [16] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual
of improving the performance of unpaired image translations learning for image-to-image translation,” in The IEEE International
through measuring the perceptual similarity between different Conference on Computer Vision (ICCV), Oct 2017.
[17] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to
image domains. discover cross-domain relations with generative adversarial networks,” in
Proceedings of the 34th International Conference on Machine Learning
V. C ONCLUSION (ICML), vol. 70, 06–11 Aug 2017, pp. 1857–1865.
[18] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-
In this paper, we proposed the perceptual adversarial net- time style transfer and super-resolution,” in European Conference on
works (PAN) for image-to-image transformation tasks. As a Computer Vision (ECCV). Springer, 2016, pp. 694–711.
[19] A. Dosovitskiy and T. Brox, “Generating images with perceptual similar-
generic framework of learning mapping relationship between ity metrics based on deep networks,” in Advances in Neural Information
paired images, the PAN combines the generative adversarial Processing Systems (NIPS), 2016, pp. 658–666.
loss and the proposed perceptual adversarial loss as a novel [20] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep
convolutional sufficient statistics,” in Proceedings of the International
training loss function. According to this loss function, a dis- Conference on Learning Representations (ICLR), 2016.
criminative network D is trained to continually and automati- [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
cally explore the discrepancy between the transformed images large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[22] H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a condi-
and the corresponding ground-truth images. Simultaneously, tional generative adversarial network,” arXiv preprint arXiv:1701.05957,
an image transformation network T is trained to narrow 2017.
the discrepancy explored by the discriminative network D. [23] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single
Through the adversarial training process, these two networks image super-resolution using a generative adversarial network,” in The
are updated alternately. Finally, experimental results on sev- IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
eral image-to-image transformation tasks demonstrated that 2017.
[24] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
the proposed PAN framework is effective and promising for unreasonable effectiveness of deep features as a perceptual metric,”
practical image-to-image transformation applications. in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2018.
R EFERENCES [25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-
tations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3,
[1] M. Elad and M. Aharon, “Image denoising via sparse and redundant p. 1, 1988.
representations over learned dictionaries,” IEEE Transactions on Image [26] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken
processing, vol. 15, no. 12, pp. 3736–3745, 2006. through a window covered with dirt or rain,” in The IEEE International
[2] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image in- Conference on Computer Vision (ICCV), 2013.
painting,” in Proceedings of the 27th annual conference on Computer [27] T. Ruzic and A. Pizurica, “Context-aware patch-based image inpainting
graphics and interactive techniques. ACM Press/Addison-Wesley using markov random field modeling,” IEEE Transactions on Image
Publishing Co., 2000, pp. 417–424. Processing, vol. 24, no. 1, pp. 444–456, 2015.
[3] K. Nasrollahi and T. B. Moeslund, “Super-resolution: a comprehensive [28] C. Qin, C.-C. Chang, and Y.-P. Chiu, “A novel joint data-hiding and
survey,” Machine vision and applications, vol. 25, no. 6, pp. 1423–1468, compression scheme based on smvq and image inpainting,” IEEE
2014. transactions on image processing, vol. 23, no. 3, pp. 969–978, 2014.
[4] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, and H.-Y. Shum, [29] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
“Natural image colorization,” in Proceedings of the 18th Eurographics features for scene labeling,” IEEE transactions on pattern analysis and
conference on Rendering Techniques. Eurographics Association, 2007, machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
pp. 309–320. [30] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
[5] M. W. Khan, “A survey: Image segmentation techniques,” International semantic segmentation,” in The IEEE International Conference on
Journal of Future Computer and Communication, vol. 3, no. 2, p. 89, Computer Vision (ICCV), 2015.
2014. [31] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
[6] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley, “Clearing the labels with a common multi-scale convolutional architecture,” in The
skies: A deep network architecture for single-image rain removal,” IEEE IEEE International Conference on Computer Vision (ICCV), 2015.
Transactions on Image Processing, vol. 26, no. 6, pp. 2944–2956, 2017. [32] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee, “Weakly-supervised
[7] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Con- disentangling with recurrent transformations for 3d view synthesis,” in
text encoders: Feature learning by inpainting,” in The IEEE Conference Advances in Neural Information Processing Systems (NIPS), 2015, pp.
on Computer Vision and Pattern Recognition (CVPR), 2016. 1099–1107.
[8] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using [33] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Multi-view 3d models
deep convolutional networks,” IEEE transactions on pattern analysis from single images with a convolutional network,” in European Confer-
and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016. ence on Computer Vision (ECCV), 2016, pp. 322–337.
[9] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in [34] C. Wang, C. Wang, C. Xu, and D. Tao, “Tag disentangled generative
European Conference on Computer Vision (ECCV), 2016. adversarial network for object image re-rendering,” in Proceedings of
[10] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in The IEEE the Twenty-Sixth International Joint Conference on Artificial Intelligence
International Conference on Computer Vision (ICCV), 2015. (IJCAI), 2017, pp. 2901–2907.
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2836316, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 13
[35] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a [58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
single image using a multi-scale deep network,” in Advances in neural quality assessment: from error visibility to structural similarity,” IEEE
information processing systems (NIPS), 2014, pp. 2366–2374. transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[36] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and [59] Z. Wang and A. C. Bovik, “A universal image quality index,” IEEE
P. Abbeel, “Infogan: Interpretable representation learning by informa- Signal Processing Letters, vol. 9, no. 3, pp. 81–84, March 2002.
tion maximizing generative adversarial nets,” in Advances in Neural [60] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”
Information Processing Systems (NIPS), 2016, pp. 2172–2180. IEEE Transactions on image processing, vol. 15, no. 2, pp. 430–444,
[37] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative ad- 2006.
versarial network,” in Proceedings of the International Conference on [61] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu-
Learning Representations (ICLR), 2017. tional network for image super-resolution,” in European Conference on
[38] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver- Computer Vision (ECCV), 2014.
sarial networks,” in Proceedings of the 34th International Conference [62] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
on Machine Learning (ICML), 2017. Proceedings of the International Conference on Learning Representa-
[39] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, tions (ICLR), 2014.
“Improved training of wasserstein gans,” in Advances in Neural Infor-
mation Processing Systems (NIPS), 2017. Chaoyue Wang reveived his bachelor degree from
[40] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of Tianjin University (TJU), Tianjin, China, in 2014.
gans for improved quality, stability, and variation,” in Proceedings of the Currently, he is working toward the Ph.D. degree
International Conference on Learning Representations (ICLR), 2018. with Center of Artificial Intelligence, Faculty of
[41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral nor- Engineering and Information Technology, University
malization for generative adversarial networks,” in Proceedings of the of Technology Sydney (UTS). His research interests
International Conference on Learning Representations (ICLR), 2018. mainly include machine learning and computer vi-
[42] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual sion.
generative adversarial networks for small object detection,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), July
2017.
[43] W. Lotter, G. Kreiman, and D. Cox, “Unsupervised learning of visual Chang Xu is Lecturer in Machine Learning and
structure using predictive generative networks,” in Proceedings of the In- Computer Vision at the School of Information Tech-
ternational Conference on Learning Representations (ICLR) Workshop, nologies, The University of Sydney. He obtained
2016. a Bachelor of Engineering from Tianjin University,
[44] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative China, and a Ph.D. degree from Peking University,
visual manipulation on the natural image manifold,” in European Con- China. While pursing his PhD degree, Chang re-
ference on Computer Vision (ECCV). Springer, 2016, pp. 597–613. ceived fellowships from IBM and Baidu. His re-
[45] A. Brock, T. Lim, J. Ritchie, and N. Weston, “Neural photo editing with search interests lie in machine learning, data mining
introspective adversarial networks,” in Proceedings of the International algorithms and related applications in artificial in-
Conference on Learning Representations (ICLR), 2017. telligence and computer vision, including multi-view
[46] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, learning, multi-label learning, visual search and face
“High-resolution image synthesis and semantic manipulation with con- recognition. His research outcomes have been widely published in prestigious
ditional gans,” in The IEEE Conference on Computer Vision and Pattern journals and top-tier conferences.
Recognition (CVPR), 2018.
[47] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded Chaohui Wang Chaohui Wang received his Ph.D.
refinement networks,” in The IEEE International Conference on Com- in applied mathematics and computer vision from
puter Vision (ICCV), 2017. Ecole Centrale Paris, Châtenay-Malabry, France, in
[48] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of 2011. After that, he was postdoctoral researcher at
artistic style,” in The IEEE Conference on Computer Vision and Pattern the Vision Lab of University of California, Los An-
Recognition (CVPR), 2016. geles, CA, USA (from January 2012 to March 2013),
and at the Perceiving Systems Department, Max
[49] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg, “Transformation-
Planck Institute for Intelligent Systems, Tübingen,
grounded image generation network for novel 3d view synthesis,” in The
Germany (from March 2013 to August 2014). Since
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
September 2014, he is Maı̂tre de Conférences at
2017.
Université Paris-Est, Marne-la-Vallêe, France. His
[50] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation research interests include computer vision, machine learning, image process-
learning with deep convolutional generative adversarial networks,” in ing, and medical image analysis.
Proceedings of the International Conference on Learning Representa-
tions (ICLR), 2015.
Dacheng Tao (F’15) is Professor of Computer
[51] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional
Science and ARC Future Fellow in the School of
networks for mid and high level feature learning,” in The IEEE Inter-
Information Technologies and the Faculty of En-
national Conference on Computer Vision (ICCV), 2011.
gineering and Information Technologies, and the
[52] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
Inaugural Director of the UBTech Sydney Artificial
learning,” arXiv preprint arXiv:1603.07285, 2016.
Intelligence Institute, at The University of Sydney.
[53] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- He mainly applies statistics and mathematics to
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset Artificial Intelligence and Data Science. His research
for semantic urban scene understanding,” in The IEEE Conference on interests spread across computer vision, data sci-
Computer Vision and Pattern Recognition (CVPR), 2016. ence, image processing, machine learning, and video
[54] A. Yu and K. Grauman, “Fine-grained visual comparisons with local surveillance. His research results have expounded in
learning,” in The IEEE Conference on Computer Vision and Pattern one monograph and 500+ publications at prestigious journals and prominent
Recognition (CVPR), 2014. conferences, such as IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS,
[55] S. Xie and Z. Tu, “Holistically-nested edge detection,” in The IEEE CIKM, ICML, CVPR, ICCV, ECCV, AISTATS, ICDM; and ACM SIGKDD,
International Conference on Computer Vision (ICCV), 2015. with several best paper awards, such as the best theory/algorithm paper runner
[56] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des- up award in IEEE ICDM07, the best student paper award in IEEE ICDM13,
jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A cpu and and the 2014 ICDM 10-year highest-impact paper award. He received the
gpu math compiler in python,” in Proc. 9th Python in Science Conf, 2015 Australian Scopus-Eureka Prize, the 2015 ACS Gold Disruptor Award
2010, pp. 1–7. and the 2015 UTS Vice-Chancellors Medal for Exceptional Research. He is
[57] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” a Fellow of the IEEE, OSA, IAPR and SPIE.
in Proceedings of the International Conference on Learning Represen-
tations (ICLR), 2014.
1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.