Kolkin Style Transfer by Relaxed Optimal Transport and Self-Similarity CVPR 2019 Paper
Kolkin Style Transfer by Relaxed Optimal Transport and Self-Similarity CVPR 2019 Paper
Kolkin Style Transfer by Relaxed Optimal Transport and Self-Similarity CVPR 2019 Paper
Figure 1: Examples of our output for unconstrained (left) and guided (right) style transfer.Images are arranged in order of
content, output, style. Below the content and style image on the right we visualize the user-defined region-to-region guidance
used to generate the output in the middle.
110051
Picasso Dürer Matisse Kandinsky Klimt
Figure 2: Examples of the effect of different content images on the same style, and vice-versa
10052
from input image X by layer i of network Φ. Given the Computing this is dominated by computing the cost matrix
set of layer indices l1 , .., lL we use bilinear upsampling to C. We compute the cost of transport (ground metric) from
match the spatial dimensions of Φ(X)l1 ...Φ(X)lL to those Ai to Bj as the cosince distance between the two feature
of the original image (X), then concatenate all such ten- vectors,
sors along the feature dimension. This yields a hypercol- Ai · Bj
umn at each pixel, that includes features which capture low- Cij = Dcos (Ai , Bj ) = 1 − (9)
kAi kkBj k
level edge and color features, mid-level texture features,
and high-level semantic features [27]. For all experiments We experimented with using the Euclidean distance be-
we use all convolutional layers of VGG16 except layers tween vectors instead, but the results were significantly
9,10,12, and 13, which we exclude because of memory con- worse, see the supplement for examples.
straints. While ℓr does a good job of transferring the structural
forms of the source image to the target, the cosine distance
2.2 Style Loss ignores the magnitude of the feature vectors. In practice this
Let A = {A1 , . . . , An } be a set of n feature vectors ex- leads to visual artifacts in the output, most notably over-
tracted from X (t) , and B = {B1 , . . . , Bm } be a set of m /under-saturation. To combat this we add a moment match-
features extracted from style image IS . The style loss is ing loss:
derived from the Earth Movers Distance (EMD)1 : 1 1
ℓm = kµA − µB k1 + 2 kΣA − ΣB k1 (10)
X d d
EMD(A, B) = min Tij Cij (2)
T ≥0 where µA , ΣA are the mean and covariance of the feature
ij
X vectors in set A, and µB and ΣB are defined in the same
s.t. Tij = 1/m (3) way.
j We also add a color matching loss, ℓp to encourage our
output and the style image to have a similar palette. ℓp is de-
X
Tij = 1/n (4)
i fined using the Relaxed EMD between pixel colors in X (t)
and IS , this time and using Euclidean distance as a ground
where T is the ’transport matrix’, which defines partial pair- metric. We find it beneficial to convert the colors from RGB
wise assignments, and C is the ’cost matrix’ which de- into a decorrelated colorspace with mean color as one chan-
fines how far an element in A is from an element in B. nel when computing this term. Because palette shifting is at
EMD(A, B) captures the distance between sets A and B, odds with content preservation, we weight this term by α1 .
but finding the optimal T costs O(max(m, n)3 ), and is
therefore untenable for gradient descent based style transfer 2.3 Content Loss
(where it would need to be computed at each update step). Our content loss is motivated by the observation that
Instead we will use the Relaxed EMD [14]. To define this robust pattern recognition can be built using local self-
we will use two auxiliary distances, essentially each is the similarity descriptors [25]. An every day example of this is
EMD with only one of the constraints (3) or (4): the phenomenon called pareidolia, where the self-similarity
X X patterns of inanimate objects are perceived as faces because
RA (A, B) = min Tij Cij s.t. Tij = 1/m (5) they match a loose template. Formally, let DX be the pair-
T ≥0
ij j wise cosine distance matrix of all (hypercolumn) feature
vectors extracted from X (t) , and let DIC be defined analo-
X X
RB (A, B) = min Tij Cij s.t. Tij = 1/n (6)
T ≥0
ij i gously for the content image. We visualize several potential
rows of DX in Figure 3. We define our content loss as:
we can then define the relaxed earth movers distance as:
X IC
1 X Dij Dij
ℓr = REM D(A, B) = max(RA (A, B), RB (A, B)) (7) Lcontent (X, C) = X
− IC
(11)
n2 i,j
P
i Dij
P
i Dij
This is equivalent to: In other words the normalized cosine distance between fea-
ture vectors extracted from any pair of coordinates should
1 X 1 X remain constant between the content image and the output
ℓr = max min Cij , min Cij (8) image. This constrains the structure of the output, without
n i j m j i
enforcing any loss directly connected to pixels of the con-
1 Since we consider all features to have equal mass, this is a simplified tent image. This causes the semantics and spatial layout to
version of the more general EMD [23], which allows for transport between be broadly preserved, while allowing pixel values in X (t)
general, non-uniform mass distributions. to drastically differ from those in IC .
10053
2.4 User Control objective function. We find that optimizing the Laplacian
pyramid, rather than pixels directly, dramatically speeds up
We incorporate user control as constraints on the style of
convergence. At each scale we make 200 updates using
the output. Namely the user defines paired sets of spatial
RMSprop, and use a learning rate of 0.002 for all scales
locations (regions) in X (t) and IS that must have low style
except the last, where we reduce it to 0.001.
loss. In the case of point-to-point user guidance each set
The pairwise distance computation required to calculate
contains only a single spatial location (defined by a click).
the style and content loss precludes extracting features from
Let us denote paired sets of spatial locations in the output
all coordinates of the input images, instead we sample 1024
and style image as (Xt1 , Ss1 )...(XtK , SsK ). We redefine
coordinates randomly from the style image, and 1024 co-
the ground metric of the Relaxed EMD as follows:
ordinates in a uniform grid with a random x,y offset from
the content image. We only differentiate the loss w.r.t the
β ∗ Dcos (Ai , Bj ), if i ∈ Xtk , j ∈ Ssk
features extracted from these locations, and resample these
Cij = ∞, if ∃k s.t. i ∈ Xtk , j 6∈ Ssk (12)
locations after each step of RMSprop.
Dcos (Ai , Bj ) otherwise,
10054
Content Style Ours Reshuffle [6] Gatys [4] CNNMRF [15] Contextual [20]
Figure 4: Qualitative comparison between our method and prior work. Default hyper-parameters used for all methods
trained CNN, encouraging patches (which yielded the deep optimization-based algorithm of [4] with a neural network
features) in the target image to match their nearest neigh- trained to perform style transfer, enabling real-time infer-
bor from style image in feature space. Other functionally ence. Initial efforts in this area were constrained to a lim-
similar losses appear in [22], which treats style transfer ited set of pre-selected styles [13], but subsequent work
as matching two histograms of features, and [20], which relaxed this constraint and allowed arbitrary styles at test
matches features between the style and target which are time [12]. Relative to slower optimization-based methods
significantly closer than any other pairing. In all of these these works made some sacrifices in the quality of the out-
methods, broadly speaking, features of the output are en- put for speed. However, Sanakoyeu et al. [24] introduce a
couraged to lie on the support of the distribution of features method for incorporating style images from the same artist
extracted from the style image, but need not cover it. These into the real-time framework which produces high quality
losses are all similar to one component of the Relaxed EMD outputs in real-time, but in contrast to our work relies on
(RA ). However, our method differs from these approaches having access to multiple images with the same style and
because our style term also encourages covering the entire requires training the style transfer mechanism separately for
distribution of features in the style image (RB ). Our style each new style.
loss is most similar in spirit to that proposed by Gu et al [6], Various methods have been proposed for controlling the
which also includes terms that encourage fidelity and diver- output of style transfer. In [5] Gatys et al. propose two
sity. Their loss minimizes the distance between explicitly ’global’ control methods, that affect the entire output rather
paired individual patches, whereas ours minimizes the dis- than a particular spatial region. One method is decompos-
tance between distributions of features. ing the image into hue, saturation, and luminance, and only
Another major category of innovation is replacing the stylizes the luminance in order to preserve the color palette
10055
Figure 5: Examples of using guidance for aesthetic effect (left, point-to-point)) and error correction (right, region-to-region).
In the top row the images are arranged in order of content, output, style. Below each content and style image we show the
guidance mask, and between them the guided output.
of the content image. A second method from [5] is to gen- ance provided.
erate an auxiliary style image either to preserve color, or
Evaluating and comparing style transfer algorithms is
to transfer style from only a particular scale (for example
a challenging task because, in contrast to object recogni-
the transferring only the brush-strokes, rather than the larger
tion or segmentation, there is no established “ground truth”
and more structurally complex elements of the style). These
for the output. The most common method is a qualitative,
types of user control are orthogonal to our method, and can
purely subjective comparison between the output of differ-
be incorporated into it.
ent algorithms. Some methods also provide more refined
Another type of control is spatial, allowing users to en- qualitative comparisons such as texture synthesis [22, 6]
sure that certain regions of the output should be stylized and inpainting [2]. While these comparisons provide in-
using only features from a manually selected region of the sight into the behavior of each algorithm, without quanti-
style image (or that different regions of the output im- tative comparisons it is difficult to draw conclusions about
age should be stylized based on different style images). the algorithm’s performance on average. The most common
In [5, 18] the authors propose forms of spatial control based quantitative evaluation is asking users to rank the output of
on the user defining matched regions of the image by cre- each algorithm according to aesthetic appeal [6, 16, 19].
ating a dense mask for both the style and content image. Recently Sanakoyeu et al. [24] propose two new forms of
We demonstrate that it is straightforward to incorporate this quantitative evaluation. The first is testing if an neural net-
type of user-control into our formulation of style transfer. In work pretrained for artist classification on real paintings can
the supplement we show an example comparing the spatial correctly classify the artist of the style image based on an al-
control of our method and [5], and demonstrate that both gorithm’s output. The second is asking experts in art history
yield visually pleasing results that match the spatial guid- which algorithm’s output most closely matches the style im-
10056
1
age. We designed our human evaluation study, described [4] with 100 and 100× the default content weight. We also
in section 4.1, to give a more complete sense of the trade- test our method with 4× the content weight. We only were
off each algorithm makes between content and style as its able to test the default hyper-parameters for [6] because
hyper-parameters vary. To the best of our knowledge it is the code provided by the authors does not expose content
the first such effort. weight as a parameter to users. We test all possible pairings
of A and B between different algorithms and their hyper-
parameters (i.e. we do not compare an algorithm against
itself with different hyperparameters, but do compare it to
all hyperparameter settings of other algorithms). In each
presentation, the order of output (assignment of methods
to A or B in the interface) was randomized. Each pairing
was voted on by an average of 4.98 different workers (mini-
mum 4, maximum 5), 662 workers in total. On average, 3.7
workers agreed with the majority vote for each pairing. All
Figure 7: Human evaluation interface of the images used in this evaluation will be made available
to enable further benchmarking.
4 Experiments For an algorithm/hyper-parameter combination we de-
fine its content score to be the number of times it was se-
We include representative qualitative results in Fig-
lected by workers as having closer or equal content to IC
ures 2, 4, and an illustration of the effect of the content
relative to the other output it was shown with, divided by the
weight α in Figure 6. Figure 5 demonstrates uses of user
total number of experiments it appeared in. This is always
guidance with our method.
a fraction between 0 and 1. The style score is defined anal-
4.1 Large-Scale Human Evaluation ogously. We present these results in Figure 8, separated by
regime. The score of each point is computed over 1580 pair-
Because style transfer between arbitrary content and ings on average (including the same pairings being shown to
style pairs is such a broad task, we propose three regimes distinct workers, minimum 1410, maximum 1890). Overall
that we believe cover the major use cases of style transfer. for a given level of content score, our method provides a
’Paired’ refers to when the content image and style image higher style score than prior work.
are both representations of the same things, this is mostly
images of the same category (e.g. both images of dogs), 4.2 Ablation Study
but also includes images of the same entity (e.g. both im-
ages of the London skyline). ’Unpaired’ refers to when the In Figure 9 we explore the effect of different terms of
content and style image are not representations of the same our style loss, which is composed of a moment-matching
thing (e.g. a photograph of a Central American temple, and loss ℓm , the Relaxed Earth Movers Distance ℓr , and a color
a painting of a circus). ’Texture’ refers to when the content palette matching loss ℓp . As seen in Figure 9, ℓm alone
is a photograph of a face, and the style is a homogeneous does a decent job of transferring style, but fails to capture
texture (e.g. a brick wall, flames). For each regime we con- the larger structures of the style image. ℓRA alone does not
sider 30 style/content pairings (total of 90). make use of the entire distribution of style features, and
In order to quantitatively compare our method to prior reconstructs content more poorly than ℓr . ℓRB alone en-
work we performed several studies using AMT. An exam- courages every style feature to have a nearby output fea-
ple of the workers’ interface is shown in Figure 7. Images ture, which is too easy to satisfy. Combining ℓRA and ℓRB
A and B were the result of the same inputs passed into ei- in the relaxed earth movers distance ℓr results in a higher
ther the algorithms proposed in [4],[6], [15], [20], or our quality output than either term alone, however because the
method. In Figure 7 image C is the corresponding style im- ground metric used is the cosine distance the magnitude of
age, and workers were asked to choose whether the style the features is not constrained, resulting in saturation issues.
of image is best matched by: ’A’, ’B’, ’Both Equally’, or Combining ℓr with ℓm alleviates this, but some issues with
’Neither’. If image C is a content image, workers are posed the output’s palette remain, which are fixed by adding ℓp .
the same question with respect to content match, instead of
4.3 Relaxed EMD Approximation Quality
style. For each competing algorithm except [6] we test three
sets of hyper-parameters, the defaults recommended by the To measure how well the Relaxed EMD approximates
authors, the same with 14 of the content weight (high styl- the exact Earth Movers Distance we take each of the 900
ization), and the same with double the content weight (low possible content/style pairings formed by the 30 content and
stylization). Because these modifications to content weight style images used in our AMT experiments for the unpaired
did not alter the behavior of [4] significantly we also tested regime. For each pairing we compute the REMD between
10057
Figure 8: Quantitative evaluation of our method and prior work, we estimate the Pareto frontier of the methods evaluated by
linearly interpolation (dashed line)
ℓp
ℓm ℓ RA ℓ RB ℓr ℓr + ℓm ℓr + ℓm + α Style
Figure 9: Ablation study of effects of our proposed style terms with low content loss (α = 4.0). See text for analysis of each
terms’ effect. Best viewed zoomed-in on screen.
1024 features extracted from random coordinates, and the Image size 64 128 256 512 1024
exact EMD based on the same set of features. We then ana- Ours 20 38 60 95 154
lyze the distribution of REM D(A,B)
EM D(A,B Because the REMD is Gatys 8 10 14 33 116
a lower bound, this quantity is always ≤1. Over the 900 im- CNNMRF 3 8 27 117 X
age pairs, its mean was 0.60, with standard deviation 0.04. Contextual 13 40 189 277 X
A better EMD approximation, or one that is an upper bound Reshuffle - - - 69* -
rather than a lower bound, may yield better style transfer
results. On the other hand the REMD is simple to compute, Table 1: Timing comparison (in seconds) between our
empirically easy to optimize, and yields good results. methods and others. The style and content images had the
same dimensions and were square. *: a projected result, see
text for details. -: we were not able to project these results.
4.4 Timing Results X: the method ran out of memory.
10058
References [13] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
for real-time style transfer and super-resolution. In Eu-
[1] K. Aberman, J. Liao, M. Shi, D. Lischinski, B. Chen, ropean Conference on Computer Vision, pages 694–
and D. Cohen-Or. Neural best-buddies. ACM Trans- 711. Springer, 2016. 5
actions on Graphics, 37(4):114, Jul 2018. 4
[14] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger.
[2] G. Berger and R. Memisevic. Incorporating long- From word embeddings to document distances. In In-
range consistency in cnn-based texture generation. ternational Conference on Machine Learning, pages
arXiv preprint arXiv:1606.01286, 2016. 4, 6 957–966, 2015. 1, 3
[3] A. A. Efros and W. T. Freeman. Image quilting for [15] C. Li and M. Wand. Combining markov random fields
texture synthesis and transfer. In Proceedings of the and convolutional neural networks for image synthe-
28th annual conference on Computer graphics and in- sis. In Proceedings of the IEEE Conference on Com-
teractive techniques, pages 341–346. ACM, 2001. 4 puter Vision and Pattern Recognition, pages 2479–
2486, 2016. 4, 5, 7, 8
[4] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
transfer using convolutional neural networks. In Pro- [16] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A
ceedings of the IEEE Conference on Computer Vision closed-form solution to photorealistic image styliza-
and Pattern Recognition, pages 2414–2423, 2016. 2, tion. arXiv preprint arXiv:1802.06474, 2018. 6
4, 5, 7, 8 [17] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang.
Visual attribute transfer through deep image analogy.
[5] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann,
SIGGRAPH, 2017. 4
and E. Shechtman. Controlling perceptual factors in
neural style transfer. In IEEE Conference on Com- [18] M. Lu, H. Zhao, A. Yao, F. Xu, Y. Chen, and L. Zhang.
puter Vision and Pattern Recognition (CVPR), 2017. Decoder network over lightweight reconstructed fea-
5, 6 ture for fast semantic style transfer. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
[6] S. Gu, C. Chen, J. Liao, and L. Yuan. Arbitrary style Recognition, pages 2469–2477, 2017. 6
transfer with deep feature reshuffle. 5, 6, 7, 8
[19] R. Mechrez, E. Shechtman, and L. Zelnik-Manor.
[7] P. Haeberli. Paint by numbers: Abstract image repre- Photorealistic style transfer with screened poisson
sentations. In ACM SIGGRAPH computer graphics, equation. arXiv preprint arXiv:1709.09828, 2017. 6
volume 24, pages 207–214. ACM, 1990. 4
[20] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The con-
[8] B. Hariharan, P. Arbeláez, R. Girshick, and J. Ma- textual loss for image transformation with non-aligned
lik. Hypercolumns for object segmentation and fine- data. arXiv preprint arXiv:1803.02077, 2018. 4, 5, 7,
grained localization. In Proceedings of the IEEE con- 8
ference on computer vision and pattern recognition,
[21] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich.
pages 447–456, 2015. 2
Feedforward semantic segmentation with zoom-out
[9] A. Hertzmann. Painterly rendering with curved brush features. In Proceedings of the IEEE conference on
strokes of multiple sizes. In Proceedings of the 25th computer vision and pattern recognition, pages 3376–
annual conference on Computer graphics and interac- 3385, 2015. 2
tive techniques, pages 453–460. ACM, 1998. 4 [22] E. Risser, P. Wilmot, and C. Barnes. Stable
and controllable neural texture synthesis and style
[10] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
transfer using histogram losses. arXiv preprint
D. H. Salesin. Image analogies. In Proceedings of
arXiv:1701.08893, 2017. 4, 5, 6
the 28th annual conference on Computer graphics and
interactive techniques, pages 327–340. ACM, 2001. 4 [23] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric
for distributions with applications to image databases.
[11] G. Hinton, N. Srivastava, and K. Swersky. Neural In Computer Vision, 1998. Sixth International Confer-
networks for machine learning lecture 6a overview of ence on, pages 59–66. IEEE, 1998. 3
mini-batch gradient descent. 2, 4
[24] A. Sanakoyeu, D. Kotovenko, S. Lang, and B. Om-
[12] X. Huang and S. J. Belongie. Arbitrary style transfer mer. A style-aware content loss for real-time hd style
in real-time with adaptive instance normalization. 5 transfer. 2018. 5, 6
10059
[25] E. Shechtman and M. Irani. Matching local self-
similarities across images and videos. In Computer Vi-
sion and Pattern Recognition, 2007. CVPR’07. IEEE
Conference on, pages 1–8. IEEE, 2007. 3
[26] K. Simonyan and A. Zisserman. Very deep convo-
lutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014. 2
[27] M. D. Zeiler and R. Fergus. Visualizing and under-
standing convolutional networks. In European con-
ference on computer vision, pages 818–833. Springer,
2014. 3
10060