Kolkin Style Transfer by Relaxed Optimal Transport and Self-Similarity CVPR 2019 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Style Transfer by Relaxed Optimal Transport and Self-Similarity

Nicholas Kolkin1 Jason Salavon2 Gregory Shakhnarovich1


1 2
Toyota Technological Institute at Chicago University of Chicago
[email protected], [email protected], [email protected]

Abstract style transfer, but have a long history of successful applica-


tion in computer vision more broadly. We hope that related
The goal of style transfer algorithms is to render the efforts to refine definitions of both style and content will
content of one image using the style of another. We pro- eventually lead to more robust recognition systems, but in
pose Style Transfer by Relaxed Optimal Transport and this work we solely focus on their utility for style transfer.
Self-Similarity (STROTSS), a new optimization-based style We define style as a distribution over features extracted
transfer algorithm. We extend our method to allow user- by a deep neural network, and measure the distance be-
specified point-to-point or region-to-region control over vi- tween these distributions using an efficient approximation
sual similarity between the style image and the output. Such of the Earth Movers Distance initially proposed in the Nat-
guidance can be used to either achieve a particular visual ural Language Processing community [14]. This definition
effect or correct errors made by unconstrained style trans- of style similarity is not only well motivated statistically,
fer. In order to quantitatively compare our method to prior but also intuitive. The goal of style transfer is to deploy the
work, we conduct a large-scale user study designed to as- visual attributes of the style image onto the content image
sess the style-content tradeoff across settings in style trans- with minimum distortion to the content’s underlying layout
fer algorithms. Our results indicate that for any desired and semantics; in essence to ’optimally transport’ these vi-
level of content preservation, our method provides higher sual attributes.
quality stylization than prior work. Our definition of content is inspired by the concept of
self-similarity, and the notion that human perceptual sys-
tem is robust because it identifies objects based on their ap-
1 Introduction pearance relative to their surroundings, rather than absolute
appearance. Defining content similarity in this way discon-
One of the main challenges of style transfer is formal- nects the term somewhat from pixels precise values making
izing ’content’ and ’style’, terms which evoke strong intu- it easier to satisfy than the definitions used in prior work.
itions but are hard to even define semantically. We propose This allows the output of our algorithm to maintain the per-
formulations of each term which are novel in the domain of ceived semantics and spatial layout of the content image,

Figure 1: Examples of our output for unconstrained (left) and guided (right) style transfer.Images are arranged in order of
content, output, style. Below the content and style image on the right we visualize the user-defined region-to-region guidance
used to generate the output in the middle.

110051
Picasso Dürer Matisse Kandinsky Klimt

Figure 2: Examples of the effect of different content images on the same style, and vice-versa

while drastically differing in pixel space. 2 Methods


To increase utility of style transfer as an artistic tool, Like the original Neural Style Transfer algorithm pro-
it is important that users can easily and intuitively control posed by Gatys et al. [4] our method takes two inputs, a
the algorithm’s output. We extend our formulation to allow style image IS and a content image IC , and uses the gradi-
region-to-region constraints on style transfer (e.g., ensuring ent descent variant RMSprop [11] to minimize our proposed
that hair in the content image is stylized using clouds in the objective function (equation 1) with respect to the output
style image) and point-to-point constraints (e.g., ensuring image X.
that the eye in the content image is stylized in the same way
as the eye in a painting). αℓC + ℓm + ℓr + α1 ℓp
L(X, IC , IS ) = (1)
We quantitatively compare our method to prior work 2 + α + α1
using human evaluations gathered from 662 workers on
Amazon Mechanical Turk (AMT). Workers evaluated con- We describe the content term of our loss αℓC in Sec-
tent preservation and stylization quality separately. Work- tion 2.2, and the style term ℓm + ℓr + α1 ℓp in Section 2.3.
ers were shown two algorithms’ output for the same in- The hyper-parameter α represents the relative importance
puts in addition to either the content or style input, then of content preservation to stylization. Our method is itera-
asked which has more similar content or style respectively tive; let X (t) refer to the stylized output image at timestep
to the displayed input. In this way are able to quantify t. We describe our initialization X (0) in Section 2.5.
the performance of each algorithm along both axes. By
2.1 Feature Extraction
evaluate our method and prior work for multiple hyper-
parameter settings, we also measure the trade-off within Both our style and content loss terms rely upon extract-
each method between stylization and content preservation ing a rich feature representation from an arbitrary spatial lo-
as hyper-parameters change. Our results indicate that for cation. In this work we use hypercolumns [21, 8] extracted
any desired level of content preservation, our method pro- from a subset of layers of VGG16 trained on ImageNet [26].
vides higher quality stylization than prior work. Let Φ(X)i be the tensor of feature activations extracted

10052
from input image X by layer i of network Φ. Given the Computing this is dominated by computing the cost matrix
set of layer indices l1 , .., lL we use bilinear upsampling to C. We compute the cost of transport (ground metric) from
match the spatial dimensions of Φ(X)l1 ...Φ(X)lL to those Ai to Bj as the cosince distance between the two feature
of the original image (X), then concatenate all such ten- vectors,
sors along the feature dimension. This yields a hypercol- Ai · Bj
umn at each pixel, that includes features which capture low- Cij = Dcos (Ai , Bj ) = 1 − (9)
kAi kkBj k
level edge and color features, mid-level texture features,
and high-level semantic features [27]. For all experiments We experimented with using the Euclidean distance be-
we use all convolutional layers of VGG16 except layers tween vectors instead, but the results were significantly
9,10,12, and 13, which we exclude because of memory con- worse, see the supplement for examples.
straints. While ℓr does a good job of transferring the structural
forms of the source image to the target, the cosine distance
2.2 Style Loss ignores the magnitude of the feature vectors. In practice this
Let A = {A1 , . . . , An } be a set of n feature vectors ex- leads to visual artifacts in the output, most notably over-
tracted from X (t) , and B = {B1 , . . . , Bm } be a set of m /under-saturation. To combat this we add a moment match-
features extracted from style image IS . The style loss is ing loss:
derived from the Earth Movers Distance (EMD)1 : 1 1
ℓm = kµA − µB k1 + 2 kΣA − ΣB k1 (10)
X d d
EMD(A, B) = min Tij Cij (2)
T ≥0 where µA , ΣA are the mean and covariance of the feature
ij
X vectors in set A, and µB and ΣB are defined in the same
s.t. Tij = 1/m (3) way.
j We also add a color matching loss, ℓp to encourage our
output and the style image to have a similar palette. ℓp is de-
X
Tij = 1/n (4)
i fined using the Relaxed EMD between pixel colors in X (t)
and IS , this time and using Euclidean distance as a ground
where T is the ’transport matrix’, which defines partial pair- metric. We find it beneficial to convert the colors from RGB
wise assignments, and C is the ’cost matrix’ which de- into a decorrelated colorspace with mean color as one chan-
fines how far an element in A is from an element in B. nel when computing this term. Because palette shifting is at
EMD(A, B) captures the distance between sets A and B, odds with content preservation, we weight this term by α1 .
but finding the optimal T costs O(max(m, n)3 ), and is
therefore untenable for gradient descent based style transfer 2.3 Content Loss
(where it would need to be computed at each update step). Our content loss is motivated by the observation that
Instead we will use the Relaxed EMD [14]. To define this robust pattern recognition can be built using local self-
we will use two auxiliary distances, essentially each is the similarity descriptors [25]. An every day example of this is
EMD with only one of the constraints (3) or (4): the phenomenon called pareidolia, where the self-similarity
X X patterns of inanimate objects are perceived as faces because
RA (A, B) = min Tij Cij s.t. Tij = 1/m (5) they match a loose template. Formally, let DX be the pair-
T ≥0
ij j wise cosine distance matrix of all (hypercolumn) feature
vectors extracted from X (t) , and let DIC be defined analo-
X X
RB (A, B) = min Tij Cij s.t. Tij = 1/n (6)
T ≥0
ij i gously for the content image. We visualize several potential
rows of DX in Figure 3. We define our content loss as:
we can then define the relaxed earth movers distance as:
X IC
1 X Dij Dij
ℓr = REM D(A, B) = max(RA (A, B), RB (A, B)) (7) Lcontent (X, C) = X
− IC
(11)
n2 i,j
P
i Dij
P
i Dij

This is equivalent to: In other words the normalized cosine distance between fea-
  ture vectors extracted from any pair of coordinates should
1 X 1 X remain constant between the content image and the output
ℓr = max  min Cij , min Cij  (8) image. This constrains the structure of the output, without
n i j m j i
enforcing any loss directly connected to pixels of the con-
1 Since we consider all features to have equal mass, this is a simplified tent image. This causes the semantics and spatial layout to
version of the more general EMD [23], which allows for transport between be broadly preserved, while allowing pixel values in X (t)
general, non-uniform mass distributions. to drastically differ from those in IC .

10053
2.4 User Control objective function. We find that optimizing the Laplacian
pyramid, rather than pixels directly, dramatically speeds up
We incorporate user control as constraints on the style of
convergence. At each scale we make 200 updates using
the output. Namely the user defines paired sets of spatial
RMSprop, and use a learning rate of 0.002 for all scales
locations (regions) in X (t) and IS that must have low style
except the last, where we reduce it to 0.001.
loss. In the case of point-to-point user guidance each set
The pairwise distance computation required to calculate
contains only a single spatial location (defined by a click).
the style and content loss precludes extracting features from
Let us denote paired sets of spatial locations in the output
all coordinates of the input images, instead we sample 1024
and style image as (Xt1 , Ss1 )...(XtK , SsK ). We redefine
coordinates randomly from the style image, and 1024 co-
the ground metric of the Relaxed EMD as follows:
ordinates in a uniform grid with a random x,y offset from
the content image. We only differentiate the loss w.r.t the

β ∗ Dcos (Ai , Bj ), if i ∈ Xtk , j ∈ Ssk

features extracted from these locations, and resample these
Cij = ∞, if ∃k s.t. i ∈ Xtk , j 6∈ Ssk (12)
 locations after each step of RMSprop.
Dcos (Ai , Bj ) otherwise,

where β controls the weight of user-specified constraints


3 Related Work
relative to the unconstrained portion of the style loss, we Style transfer algorithms have existed for decades, and
use β = 5 in all experiments. In the case of point-to-point traditionally relied on hand-crafted algorithms to render an
constraints we find it useful to augment the constraints spec- image in fixed style [7, 9], or hand-crafting features to be
ified by the user with 8 additional point-to-point constraints, matched between an arbitrary style to the content image
these are automatically generated and centered around the [10, 3]. The state-of-the-art was dramatically altered in
original to form a uniform 9x9 grid. The horizontal and 2016 when Gatys et al. [4] introduced Neural Style Trans-
vertical distance between each point in the grid is set to be fer. This method uses features extracted from a neural net-
20 pixels for 512x512 outputs, but this is is a tunable pa- work pre-trained for image classification. It defines style in
rameter that could be incorporated into a user interface. terms of the Gram matrix of features extracted from multi-
ple layers, and content as the feature tensors extracted from
2.5 Implementation Details
another set of layers. The style loss is defined as the Frobe-
We apply our method iteratively at increasing resolu- nius norm of the difference in Gram feature matrices be-
tions, halving α each time. We begin with the content and tween the output image and style image. The content loss
style image scaled to have a long side of 64 pixels. The is defined as the Frobenius norm of the difference between
output at each scale is bilinearly upsampled to twice the feature tensors from the output image and the style image.
resolution and used as initialization for the next scale. By Distinct from the framework of Neural Style Transfer, there
default we stylize at four resolutions, and because we halve are several recent methods [17, 1] that use similarities be-
α at each resolution our default α = 16.0 is set such that tween deep neural features to build a correspondence map
α = 1.0 at the final resolution. between the content image and style image, and warp the
At the lowest resolution we initialize using the bottom style image onto the content image. These methods are ex-
level of a Laplacian pyramid constructed from the content tremely successful in paired settings, when the contents of
image (high frequency gradients) added to the mean color the style image and content image are similar, but are not
of the style image. We then decompose the initialized out- designed for style transfer between arbitrary images (un-
put image into a five level Laplacian pyramid, and use RM- paired or texture transfer).
Sprop [11] to update entries in the pyramid to minimize our Subsequent work building upon [4] has explored im-
provements and modifications along many axes. Per-
haps the most common form of innovation is in propos-
als for quantifying the ’stylistic similarity’ between two im-
ages [15, 2, 22, 20]. For example in order to capture long-
range spatial dependencies Berger et al. [2] propose com-
puting multiple Gram matrices using translated feature ten-
sors (so that the outer product is taken between feature vec-
tors at fixed spatial offsets). Both [4] and [2] discard valu-
Figure 3: The blue, red, and green heatmaps visualize the able information about the complete distribution of style
cosine similarity in feature space relative to the correspond- features that isn’t captured by Gram matrices.
ing points marked in the photograph. Our content loss at- In [15] Li et al. formulate the style loss as minimiz-
tempts to maintain the relative pairwise similarities between ing the energy function of a Markov Random Field over
1024 randomly chosen locations in the content image the features extracted from one of the latter layers of a pre-

10054
Content Style Ours Reshuffle [6] Gatys [4] CNNMRF [15] Contextual [20]

Figure 4: Qualitative comparison between our method and prior work. Default hyper-parameters used for all methods

trained CNN, encouraging patches (which yielded the deep optimization-based algorithm of [4] with a neural network
features) in the target image to match their nearest neigh- trained to perform style transfer, enabling real-time infer-
bor from style image in feature space. Other functionally ence. Initial efforts in this area were constrained to a lim-
similar losses appear in [22], which treats style transfer ited set of pre-selected styles [13], but subsequent work
as matching two histograms of features, and [20], which relaxed this constraint and allowed arbitrary styles at test
matches features between the style and target which are time [12]. Relative to slower optimization-based methods
significantly closer than any other pairing. In all of these these works made some sacrifices in the quality of the out-
methods, broadly speaking, features of the output are en- put for speed. However, Sanakoyeu et al. [24] introduce a
couraged to lie on the support of the distribution of features method for incorporating style images from the same artist
extracted from the style image, but need not cover it. These into the real-time framework which produces high quality
losses are all similar to one component of the Relaxed EMD outputs in real-time, but in contrast to our work relies on
(RA ). However, our method differs from these approaches having access to multiple images with the same style and
because our style term also encourages covering the entire requires training the style transfer mechanism separately for
distribution of features in the style image (RB ). Our style each new style.
loss is most similar in spirit to that proposed by Gu et al [6], Various methods have been proposed for controlling the
which also includes terms that encourage fidelity and diver- output of style transfer. In [5] Gatys et al. propose two
sity. Their loss minimizes the distance between explicitly ’global’ control methods, that affect the entire output rather
paired individual patches, whereas ours minimizes the dis- than a particular spatial region. One method is decompos-
tance between distributions of features. ing the image into hue, saturation, and luminance, and only
Another major category of innovation is replacing the stylizes the luminance in order to preserve the color palette

10055
Figure 5: Examples of using guidance for aesthetic effect (left, point-to-point)) and error correction (right, region-to-region).
In the top row the images are arranged in order of content, output, style. Below each content and style image we show the
guidance mask, and between them the guided output.

Content α = 32.0 α = 16.0 α = 8.0 α = 4.0 Style


Figure 6: Effect of varying α, the content loss weight, on our unconstrained style transfer output, because we stylize at four
resolutions, and halve α each time, our default α = 16.0 is set such that α = 1.0 at the final resolution.

of the content image. A second method from [5] is to gen- ance provided.
erate an auxiliary style image either to preserve color, or
Evaluating and comparing style transfer algorithms is
to transfer style from only a particular scale (for example
a challenging task because, in contrast to object recogni-
the transferring only the brush-strokes, rather than the larger
tion or segmentation, there is no established “ground truth”
and more structurally complex elements of the style). These
for the output. The most common method is a qualitative,
types of user control are orthogonal to our method, and can
purely subjective comparison between the output of differ-
be incorporated into it.
ent algorithms. Some methods also provide more refined
Another type of control is spatial, allowing users to en- qualitative comparisons such as texture synthesis [22, 6]
sure that certain regions of the output should be stylized and inpainting [2]. While these comparisons provide in-
using only features from a manually selected region of the sight into the behavior of each algorithm, without quanti-
style image (or that different regions of the output im- tative comparisons it is difficult to draw conclusions about
age should be stylized based on different style images). the algorithm’s performance on average. The most common
In [5, 18] the authors propose forms of spatial control based quantitative evaluation is asking users to rank the output of
on the user defining matched regions of the image by cre- each algorithm according to aesthetic appeal [6, 16, 19].
ating a dense mask for both the style and content image. Recently Sanakoyeu et al. [24] propose two new forms of
We demonstrate that it is straightforward to incorporate this quantitative evaluation. The first is testing if an neural net-
type of user-control into our formulation of style transfer. In work pretrained for artist classification on real paintings can
the supplement we show an example comparing the spatial correctly classify the artist of the style image based on an al-
control of our method and [5], and demonstrate that both gorithm’s output. The second is asking experts in art history
yield visually pleasing results that match the spatial guid- which algorithm’s output most closely matches the style im-

10056
1
age. We designed our human evaluation study, described [4] with 100 and 100× the default content weight. We also
in section 4.1, to give a more complete sense of the trade- test our method with 4× the content weight. We only were
off each algorithm makes between content and style as its able to test the default hyper-parameters for [6] because
hyper-parameters vary. To the best of our knowledge it is the code provided by the authors does not expose content
the first such effort. weight as a parameter to users. We test all possible pairings
of A and B between different algorithms and their hyper-
parameters (i.e. we do not compare an algorithm against
itself with different hyperparameters, but do compare it to
all hyperparameter settings of other algorithms). In each
presentation, the order of output (assignment of methods
to A or B in the interface) was randomized. Each pairing
was voted on by an average of 4.98 different workers (mini-
mum 4, maximum 5), 662 workers in total. On average, 3.7
workers agreed with the majority vote for each pairing. All
Figure 7: Human evaluation interface of the images used in this evaluation will be made available
to enable further benchmarking.
4 Experiments For an algorithm/hyper-parameter combination we de-
fine its content score to be the number of times it was se-
We include representative qualitative results in Fig-
lected by workers as having closer or equal content to IC
ures 2, 4, and an illustration of the effect of the content
relative to the other output it was shown with, divided by the
weight α in Figure 6. Figure 5 demonstrates uses of user
total number of experiments it appeared in. This is always
guidance with our method.
a fraction between 0 and 1. The style score is defined anal-
4.1 Large-Scale Human Evaluation ogously. We present these results in Figure 8, separated by
regime. The score of each point is computed over 1580 pair-
Because style transfer between arbitrary content and ings on average (including the same pairings being shown to
style pairs is such a broad task, we propose three regimes distinct workers, minimum 1410, maximum 1890). Overall
that we believe cover the major use cases of style transfer. for a given level of content score, our method provides a
’Paired’ refers to when the content image and style image higher style score than prior work.
are both representations of the same things, this is mostly
images of the same category (e.g. both images of dogs), 4.2 Ablation Study
but also includes images of the same entity (e.g. both im-
ages of the London skyline). ’Unpaired’ refers to when the In Figure 9 we explore the effect of different terms of
content and style image are not representations of the same our style loss, which is composed of a moment-matching
thing (e.g. a photograph of a Central American temple, and loss ℓm , the Relaxed Earth Movers Distance ℓr , and a color
a painting of a circus). ’Texture’ refers to when the content palette matching loss ℓp . As seen in Figure 9, ℓm alone
is a photograph of a face, and the style is a homogeneous does a decent job of transferring style, but fails to capture
texture (e.g. a brick wall, flames). For each regime we con- the larger structures of the style image. ℓRA alone does not
sider 30 style/content pairings (total of 90). make use of the entire distribution of style features, and
In order to quantitatively compare our method to prior reconstructs content more poorly than ℓr . ℓRB alone en-
work we performed several studies using AMT. An exam- courages every style feature to have a nearby output fea-
ple of the workers’ interface is shown in Figure 7. Images ture, which is too easy to satisfy. Combining ℓRA and ℓRB
A and B were the result of the same inputs passed into ei- in the relaxed earth movers distance ℓr results in a higher
ther the algorithms proposed in [4],[6], [15], [20], or our quality output than either term alone, however because the
method. In Figure 7 image C is the corresponding style im- ground metric used is the cosine distance the magnitude of
age, and workers were asked to choose whether the style the features is not constrained, resulting in saturation issues.
of image is best matched by: ’A’, ’B’, ’Both Equally’, or Combining ℓr with ℓm alleviates this, but some issues with
’Neither’. If image C is a content image, workers are posed the output’s palette remain, which are fixed by adding ℓp .
the same question with respect to content match, instead of
4.3 Relaxed EMD Approximation Quality
style. For each competing algorithm except [6] we test three
sets of hyper-parameters, the defaults recommended by the To measure how well the Relaxed EMD approximates
authors, the same with 14 of the content weight (high styl- the exact Earth Movers Distance we take each of the 900
ization), and the same with double the content weight (low possible content/style pairings formed by the 30 content and
stylization). Because these modifications to content weight style images used in our AMT experiments for the unpaired
did not alter the behavior of [4] significantly we also tested regime. For each pairing we compute the REMD between

10057
Figure 8: Quantitative evaluation of our method and prior work, we estimate the Pareto frontier of the methods evaluated by
linearly interpolation (dashed line)

ℓp
ℓm ℓ RA ℓ RB ℓr ℓr + ℓm ℓr + ℓm + α Style

Figure 9: Ablation study of effects of our proposed style terms with low content loss (α = 4.0). See text for analysis of each
terms’ effect. Best viewed zoomed-in on screen.

1024 features extracted from random coordinates, and the Image size 64 128 256 512 1024
exact EMD based on the same set of features. We then ana- Ours 20 38 60 95 154
lyze the distribution of REM D(A,B)
EM D(A,B Because the REMD is Gatys 8 10 14 33 116
a lower bound, this quantity is always ≤1. Over the 900 im- CNNMRF 3 8 27 117 X
age pairs, its mean was 0.60, with standard deviation 0.04. Contextual 13 40 189 277 X
A better EMD approximation, or one that is an upper bound Reshuffle - - - 69* -
rather than a lower bound, may yield better style transfer
results. On the other hand the REMD is simple to compute, Table 1: Timing comparison (in seconds) between our
empirically easy to optimize, and yields good results. methods and others. The style and content images had the
same dimensions and were square. *: a projected result, see
text for details. -: we were not able to project these results.
4.4 Timing Results X: the method ran out of memory.

We compute our timing results using a Intel i5-7600


CPU @ 3.50GHz CPU, and a NVIDIA GTX 1080 GPU. 5 Conclusion and Future Work
We use square style and content images scaled to have the
We propose novel formalizations of style and content for
edge length indicated in the top row of Table 1. For inputs
style transfer and show that the resulting algorithm com-
of size 1024x1024 the methods from [15] and [20] ran out
pares favorably to prior work, both in terms of stylization
of memory (’X’ in the table). Because the code provided
quality and content preservation. Via our ablation study
by the authors [6] only runs on Windows, we had to run it
we show that style-similarity losses which more accurately
on a different computer. To approximate the speed of their
measure the distance between distributions of features leads
method on our hardware we project the timing result for
to better style transfer. The approximation of the earth
512x512 images reported in their paper based on the relative
movers distance that we use is simple, but effective, and
speedup for [15] between their hardware and ours. For low
we leave it to future work to explore more accurate approx-
resolution outputs our method is relatively slow, however it
imations. Another direction for future work is improving
scales better for outputs with resolution 512 and above rel-
our method’s speed by training feed-forward style transfer
ative to [15] and [20], but remains slower than [4] and our
methods using our proposed objective function.
projected results for [6].

10058
References [13] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
for real-time style transfer and super-resolution. In Eu-
[1] K. Aberman, J. Liao, M. Shi, D. Lischinski, B. Chen, ropean Conference on Computer Vision, pages 694–
and D. Cohen-Or. Neural best-buddies. ACM Trans- 711. Springer, 2016. 5
actions on Graphics, 37(4):114, Jul 2018. 4
[14] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger.
[2] G. Berger and R. Memisevic. Incorporating long- From word embeddings to document distances. In In-
range consistency in cnn-based texture generation. ternational Conference on Machine Learning, pages
arXiv preprint arXiv:1606.01286, 2016. 4, 6 957–966, 2015. 1, 3
[3] A. A. Efros and W. T. Freeman. Image quilting for [15] C. Li and M. Wand. Combining markov random fields
texture synthesis and transfer. In Proceedings of the and convolutional neural networks for image synthe-
28th annual conference on Computer graphics and in- sis. In Proceedings of the IEEE Conference on Com-
teractive techniques, pages 341–346. ACM, 2001. 4 puter Vision and Pattern Recognition, pages 2479–
2486, 2016. 4, 5, 7, 8
[4] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
transfer using convolutional neural networks. In Pro- [16] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A
ceedings of the IEEE Conference on Computer Vision closed-form solution to photorealistic image styliza-
and Pattern Recognition, pages 2414–2423, 2016. 2, tion. arXiv preprint arXiv:1802.06474, 2018. 6
4, 5, 7, 8 [17] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang.
Visual attribute transfer through deep image analogy.
[5] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann,
SIGGRAPH, 2017. 4
and E. Shechtman. Controlling perceptual factors in
neural style transfer. In IEEE Conference on Com- [18] M. Lu, H. Zhao, A. Yao, F. Xu, Y. Chen, and L. Zhang.
puter Vision and Pattern Recognition (CVPR), 2017. Decoder network over lightweight reconstructed fea-
5, 6 ture for fast semantic style transfer. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
[6] S. Gu, C. Chen, J. Liao, and L. Yuan. Arbitrary style Recognition, pages 2469–2477, 2017. 6
transfer with deep feature reshuffle. 5, 6, 7, 8
[19] R. Mechrez, E. Shechtman, and L. Zelnik-Manor.
[7] P. Haeberli. Paint by numbers: Abstract image repre- Photorealistic style transfer with screened poisson
sentations. In ACM SIGGRAPH computer graphics, equation. arXiv preprint arXiv:1709.09828, 2017. 6
volume 24, pages 207–214. ACM, 1990. 4
[20] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The con-
[8] B. Hariharan, P. Arbeláez, R. Girshick, and J. Ma- textual loss for image transformation with non-aligned
lik. Hypercolumns for object segmentation and fine- data. arXiv preprint arXiv:1803.02077, 2018. 4, 5, 7,
grained localization. In Proceedings of the IEEE con- 8
ference on computer vision and pattern recognition,
[21] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich.
pages 447–456, 2015. 2
Feedforward semantic segmentation with zoom-out
[9] A. Hertzmann. Painterly rendering with curved brush features. In Proceedings of the IEEE conference on
strokes of multiple sizes. In Proceedings of the 25th computer vision and pattern recognition, pages 3376–
annual conference on Computer graphics and interac- 3385, 2015. 2
tive techniques, pages 453–460. ACM, 1998. 4 [22] E. Risser, P. Wilmot, and C. Barnes. Stable
and controllable neural texture synthesis and style
[10] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
transfer using histogram losses. arXiv preprint
D. H. Salesin. Image analogies. In Proceedings of
arXiv:1701.08893, 2017. 4, 5, 6
the 28th annual conference on Computer graphics and
interactive techniques, pages 327–340. ACM, 2001. 4 [23] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric
for distributions with applications to image databases.
[11] G. Hinton, N. Srivastava, and K. Swersky. Neural In Computer Vision, 1998. Sixth International Confer-
networks for machine learning lecture 6a overview of ence on, pages 59–66. IEEE, 1998. 3
mini-batch gradient descent. 2, 4
[24] A. Sanakoyeu, D. Kotovenko, S. Lang, and B. Om-
[12] X. Huang and S. J. Belongie. Arbitrary style transfer mer. A style-aware content loss for real-time hd style
in real-time with adaptive instance normalization. 5 transfer. 2018. 5, 6

10059
[25] E. Shechtman and M. Irani. Matching local self-
similarities across images and videos. In Computer Vi-
sion and Pattern Recognition, 2007. CVPR’07. IEEE
Conference on, pages 1–8. IEEE, 2007. 3
[26] K. Simonyan and A. Zisserman. Very deep convo-
lutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014. 2
[27] M. D. Zeiler and R. Fergus. Visualizing and under-
standing convolutional networks. In European con-
ference on computer vision, pages 818–833. Springer,
2014. 3

10060

You might also like