0% found this document useful (0 votes)
6 views12 pages

chen20e

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
6 views12 pages

chen20e

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Graph Optimal Transport for Cross-Domain Alignment

Liqun Chen 1 Zhe Gan 2 Yu Cheng 2 Linjie Li 2 Lawrence Carin 1 Jingjing Liu 2

Abstract tol et al., 2015), and machine translation (Bahdanau et al.,


Cross-domain alignment between two sets of en- 2015; Vaswani et al., 2017). Considering VQA as an ex-
tities (e.g., objects in an image, words in a sen- ample, in order to understand the contexts in the image and
tence) is fundamental to both computer vision the question, a model needs to interpret the latent align-
and natural language processing. Existing meth- ment between regions in the input image and words in the
ods mainly focus on designing advanced attention question. Specifically, a good model should: (i) identify en-
mechanisms to simulate soft alignment, with no tities of interest in both the image (e.g., objects/regions) and
training signals to explicitly encourage alignment. the question (e.g., words/phrases), (ii) quantify both intra-
The learned attention matrices are also dense and domain (within the image or sentence) and cross-domain
lacks interpretability. We propose Graph Optimal relations between these entities, and then (iii) design good
Transport (GOT), a principled framework that ger- metrics for measuring the quality of cross-domain alignment
minates from recent advances in Optimal Trans- drawn from these relations, in order to optimize towards
port (OT). In GOT, cross-domain alignment is better results.
formulated as a graph matching problem, by rep- CDA is particularly challenging as it constitutes a weakly
resenting entities into a dynamically-constructed supervised learning task. That is, only paired spaces of
graph. Two types of OT distances are considered: entity are given (e.g., an image paired with a question),
(i) Wasserstein distance (WD) for node (entity) while the ground-truth relations between these entities are
matching; and (ii) Gromov-Wasserstein distance not provided (e.g., no supervision signal for a “dog” re-
(GWD) for edge (structure) matching. Both WD gion in an image aligning with the word “dog” in the ques-
and GWD can be incorporated into existing neu- tion). State-of-the-art methods principally focus on design-
ral network models, effectively acting as a drop- ing advanced attention mechanisms to simulate soft align-
in regularizer. The inferred transport plan also ment (Bahdanau et al., 2015; Xu et al., 2015; Yang et al.,
yields sparse and self-normalized alignment, en- 2016b;a; Vaswani et al., 2017). For example, Lee et al.
hancing the interpretability of the learned model. (2018); Kim et al. (2018); Yu et al. (2019) have shown that
Experiments show consistent outperformance of learned co-attention can model dense interactions between
GOT over baselines across a wide range of tasks, entities and infer cross-domain latent alignments for vision-
including image-text retrieval, visual question an- and-language tasks. Graph attention has also been applied to
swering, image captioning, machine translation, relational reasoning for image captioning (Yao et al., 2018)
and text summarization. and VQA (Li et al., 2019a), such as graph attention network
(GAT) (Veličković et al., 2018) for capturing relations be-
tween entities in a graph via masked attention, and graph
1. Introduction matching network (GMN) (Li et al., 2019b) for graph align-
ment via cross-graph soft attention. However, conventional
Cross-domain Alignment (CDA), which aims to associate
attention mechanisms are guided by task-specific losses,
related entities across different domains, plays a central role
with no training signal to explicitly encourage alignment.
in a wide range of deep learning tasks, such as image-text
And the learned attention matrices are often dense and unin-
retrieval (Karpathy & Fei-Fei, 2015; Lee et al., 2018), visual
terpretable, thus inducing less effective relational inference.
question answering (VQA) (Malinowski & Fritz, 2014; An-
We address whether there is a more principled approach to
Most of this work was done when the first author was an intern
at Microsoft. 1 Duke University 2 Microsoft Dynamics 365 AI Re- scalable discovery of cross-domain relations. To explore
search. Correspondence to: Liqun Chen <[email protected]>, this, we present Graph Optimal Transport (GOT),1 a new
Zhe Gan <[email protected]>. 1
Another GOT framework was proposed in Maretic et al. (2019)
th
Proceedings of the 37 International Conference on Machine for graph comparison. We use the same acronym for the proposed
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- algorithm; however, our method is very different from theirs.
thor(s).
Graph Optimal Transport for Cross-Domain Alignment

framework for cross-domain alignment that leverages recent is represented by a feature vector, i.e., X̃ = {x̃i }ni=1 and
advances in Optimal Transport (OT). OT-based learning Ỹ = {ỹj }m
j=1 , where n and m are the number of entities in
aims to optimize for distribution matching via minimizing each domain, respectively. The scope of this paper mainly
the cost of transporting one distribution to another. We ex- focuses on tasks involving images and text, thus entities
tend this to CDA (here a domain can be language, images, here correspond to objects in an image or words in a sen-
videos, etc.). The transport plan is thus redefined as trans- tence. An image can be represented as a set of detected
porting the distribution of embeddings from one domain objects, each associated with a feature vector (e.g., from a
(e.g., language) to another (e.g., images). By minimizing pre-trained Faster RCNN (Anderson et al., 2018)). With a
the cost of the learned transport plan, we explicitly min- word embedding layer, a sentence can be represented as a
imize the embedding distance between the domains, i.e., sequence of word feature vectors.
optimizing towards better cross-domain alignment.
A deep neural network fθ (·) can be designed to take both
Specifically, we convert entities (e.g., objects, words) in X̃ and Ỹ as initial inputs, and generate contextualized rep-
each domain (e.g., image, sentence) into a graph, where resentations:
each entity is represented by a feature vector, and the graph
representations are recurrently updated via graph propaga- X, Y = fθ (X̃, Ỹ) , (1)
tion. Cross-domain alignment can then be formulated into
where X = {xi }ni=1 , Y = {yj }m j=1 , and advanced atten-
a graph matching problem, and be addressed by calculat-
tion mechanisms (Bahdanau et al., 2015; Vaswani et al.,
ing matching scores based on graph distance. In our GOT
2017) can be applied to fθ (·) to simulate soft alignment.
framework, we utilize two types of OT distance: (i) Wasser-
The final supervision signal l is then used to learn θ, i.e.,
stein distance (WD) (Peyré et al., 2019) is applied to node
the training objective is defined as:
(entity) matching, and (ii) Gromov-Wasserstein distance
(GWD) (Peyré et al., 2016) is adopted for edge (structure) L(θ) = Lsup (X, Y, l) . (2)
matching. WD only measures the distance between node
embeddings across domains, without considering topologi- Several instantiations for different tasks are summarized as
cal information encoded in the graphs. GWD, on the other follows: (i) Image-text Retrieval. X̃ and Ỹ are image and
hand, compares graph structures by measuring the distance text features, respectively. l is the binary label, indicating
between a pair of nodes within each graph. When fused whether the input image and sentence are paired or not. Here
together, the two distances allow the proposed GOT frame- fθ (·) can be the SCAN model (Lee et al., 2018), and Lsup (·)
work to effectively take into account both node and edge corresponds to ranking loss (Faghri et al., 2018; Chechik
information for better graph matching. et al., 2010). (ii) VQA. Here l denotes the ground-truth
answer, fθ (·) can be BUTD or BAN model (Anderson et al.,
The main contributions of this work are summarized as 2018; Kim et al., 2018), Lsup (·) is cross-entropy loss. (iii)
follows. (i) We propose Graph Optimal Transport (GOT),
Machine Translation. X̃ and Ỹ are textual features from
a new framework that tackles cross-domain alignment by
the source and target sentences, respectively. Here fθ (·) can
adopting Optimal Transport for graph matching. (ii) GOT is
be an encoder-decoder Transformer model (Vaswani et al.,
compatible with existing neural network models, acting as
2017), and Lsup (·) corresponds to cross-entropy loss that
an effective drop-in regularizer to the original objective. (iii)
models the conditional distribution of p(Y|X), and here l
To demonstrate the versatile generalization ability of the
is not needed. To simplify subsequent discussions, all the
proposed approach, we conduct experiments on five diverse
tasks are abstracted into fθ (·) and Lsup (·).
tasks: image-text retrieval, visual question answering, image
captioning, machine translation, and text summarization. In most previous work, the learned attention can be inter-
Results show that GOT provides consistent performance preted as a soft alignment between X̃ and Ỹ. However,
enhancement over strong baselines across all the tasks. only the final supervision signal Lsup (·) is used for model
training, thus lacking an objective explicitly encouraging
cross-domain alignment. To enforce alignment and cast a
2. Graph Optimal Transport Framework
regularizing effect on model training, we propose a new
We first introduce the problem formulation of Cross-domain objective for Cross-domain Alignment:
Alignment in Sec. 2.1, then present the proposed Graph
Optimal Transport (GOT) framework in Secs. 2.2- 2.4. L(θ) = Lsup (X, Y, l) + α · LCDA (X, Y) , (3)

where LCDA (·) is a regularization term that encourages align-


2.1. Problem Formulation ments explicitly, and α is a hyper-parameter that balances
Assume we have two sets of entities from two different do- the two terms. Through gradient back-propagation, the
mains (denoted as Dx and Dy ). For each set, every entity learned θ supports more effective relational inference. In
Section 2.4 we describe LCDA (·) in detail.
Graph Optimal Transport for Cross-Domain Alignment

Algorithm 1 Computing Wasserstein Distance.


1: Input: {xi }n n
i=1 ,{yj }j=1 , β
2: σ = 1
1 ,
n n
T(1) = 11>
Cij

3: Cij = c(xi , yj ), Aij = e β
4: for t = 1, 2, 3 . . . do
5: Q = A T(t) // is Hadamard product
6: for k = 1, 2, 3, . . . K do
1
7: δ = nQσ , σ = nQ1> δ
8: end for
9: T(t+1) = diag(δ)Qdiag(σ)
Figure 1. Illustration of the Wasserstein Distance (WD) and the 10: end for
Gromov-Wasserstein Distance (GWD) used for node and structure 11: Dwd = hC> , Ti
matching, respectively. WD: c(a, b) is calculated between node 12: Return T, Dw // h·, ·i is the Frobenius dot-product
a and b across two domains; GWD: L(x, y, x0 , y 0 ) is calculated
between edge c1 (x, x0 ) and c2 (y, y 0 ). See Sec. 2.3 for details.
Algorithm 2 Computing Gromov-Wasserstein Distance.
1: Input: {xi }n n
i=1 ,{yj }j=1 , probability vectors p, q
2: Compute intra-domain similarities:
3: [Cx ]ij = cos(xi , xj ), [Cy ]ij = cos(yi , yj ),
2.2. Dynamic Graph Construction 4: Compute cross-domain similarities:
Image and text data inherently contain rich sequential/spatial 5: Cxy = C2x p1m > + Cy q(C2y )>
6: for t = 1, 2, 3 . . . do
structures. By representing them as graphs and performing 7: // Compute the pseudo-cost matrix
graph alignment, not only cross-domain relations can be 8: L = Cxy − 2Cx TC> y
modeled, but also intra-domain relations are exploited (e.g., 9: Apply Algorithm 1 to solve transport plan T
semantic/spatial relations among detected objects in an im- 10: end for
age (Li et al., 2019a)). 11: Dgw = hL> , Ti
12: Return T, Dgw
Given X, we aim to construct a graph Gx (Vx , Ex ), where
each node i ∈ Vx is represented by a feature vector xi . To
add edges Ex , we first calculate the similarity between a pair 2.3. Optimal Transport Distances
of entities inside a graph: Cx = {cos(xi , xj )}i,j ∈ Rn×n .
Further, we define Cx = max(Cx − τ, 0), where τ is a As illustrated in Figure 1, two types of OT distance are
threshold hyper-parameter for the graph cost matrix. Em- adopted for our graph matching: Wasserstein distance for
pirically, τ is set to 0.1. If [Cx ]ij > 0, an edge is added node matching, and Gromov-Wasserstein distance for edge
between node i and j. Given Y, another graph Gy (Vy , Ey ) matching.
can be similarly constructed. Since both X and Y are
evolving through the update of parameters θ during training, Wasserstein Distance Wasserstein distance (WD) is com-
this graph construction process is considered “dynamic”. monly used for matching two distributions (e.g., two sets of
By representing the entities in both domains as graphs, node embeddings). In our setting, discrete WD can be used
cross-domain alignment is naturally formulated into a graph as a solver for network flow and bipartite matching (Luise
matching problem. et al., 2018). The definition of WD is described as follows.
Definition 2.1. Let µ ∈ P(X), ν ∈ P(Y)P denote two dis-
In our proposed framework, we use Optimal Transport (OT) n
crete distributions, formulated as µ = i=1 ui δxi and
for graph matching, where a transport plan T ∈ Rn×m is Pm
ν = j=1 v j δ y , with δ x as the Dirac function cen-
learned to optimize the alignment between X and Y. OT j
tered on x. Π(µ, ν) denotes all the joint distributions
possesses several idiosyncratic characteristics that make
γ(x, y), with marginals µ(x) and ν(y). The weight vec-
it a good choice for solving CDA problem. (i) Self-
tors u = {ui }ni=1 ∈ ∆n and v = {vi }m i=1 ∈ ∆m belong
normalization: all the elements of T∗ sum to 1 (Peyré et al.,
to the n- and m-dimensional simplex, respectively (i.e.,
2019). (ii) Sparsity: when solved exactly, OT yields a Pn Pm
u i = v j = 1), where both µ and ν are proba-
sparse solution T∗ containing (2r − 1) non-zero elements i=1 j=1
bility distributions. The Wasserstein distance between the
at most, where r = max(n, m), leading to a more inter-
two discrete distributions µ, ν is defined as:
pretable and robust alignment (De Goes et al., 2011). (iii)
Efficiency: compared with conventional linear programming Dw (µ, ν) = inf E(x,y)∼γ [c(x, y)]
solvers, our solution can be readily obtained using iterative γ∈Π(µ,ν)
procedures that only require matrix-vector products (Xie n X
X m

et al., 2018), hence readily applicable to large deep neural = min Tij · c(xi , yj ) , (4)
T∈Π(u,v)
networks. i=1 j=1
Graph Optimal Transport for Cross-Domain Alignment

Input Data Features Intra-graph


C
cost matrix x
X̃ X GWD

GW
Algorithm
Neural GOT
Network Distance
Intra-graph Cross-graph Transport
f✓ (·) C C
cost matrix xy plan T
cost matrix y

WD

Ỹ Y

Figure 2. Schematic computation graph of the Graph Optimal Transport (GOT) distance used for cross-domain alignment. WD is short
for Wasserstein Distance, and GWD is short for Gromov-Wasserstein Distance. See Sec. 2.1 and 2.4 for details.

where Π(u, v) = {T ∈ Rn×m + |T1m = u, T> 1n = v}, Similar to WD, in the GWD setting, c1 (xi , x0i ) and
1n denotes an n-dimensional all-one vector, and c(xi , yj ) c2 (yi , yi0 ) (corresponding to the edges) can be viewed as
is the cost function evaluating the distance between xi and two nodes in the dual graphs (Van Lint et al., 2001), where
yj . For example, the cosine distance c(xi , yj ) = 1 − edges are projected into nodes. The learned matrix T̂ now
x> i yj becomes a transport plan that helps aligning the edges in
is a popular choice. The matrix T is denoted
||xi ||2 ||yj ||2
as the transport plan, where Tij represents the amount of different graphs. Note that, the same c1 and c2 are also used
mass shifted from ui to vj . for graph construction in Sec. 2.2.

Dw (µ, ν) defines an optimal transport distance that mea- 2.4. Graph Matching via OT Distances
sures the discrepancy between each pair of samples across
Though GWD is capable of capturing edge similarity be-
the two domains. In our graph matching, this is a natural
tween graphs, it cannot be directly applied to graph align-
choice for node (entity) matching.
ment, since only the similarity between c1 (xi , x0i ) and
c2 (yi , yi0 ) is considered, without taking into account node
Gromov-Wasserstein Distance Instead of directly cal-
representations. For example, the word pair (“boy”, “girl”)
culating distances between two sets of nodes as in WD,
has similar cosine similarity as the pair (“football”, “bas-
Gromov-Wasserstein distance (GWD) (Peyré et al., 2016;
ketball”), but the semantic meanings of the two pairs are
Chowdhury & Mémoli, 2019) can be used to calculate dis-
completely different, and should not be matched.
tances between pairs of nodes within each domain, as well
as measuring how these distances compare to those in the On the other hand, WD can match nodes in different graphs,
counterpart domain. GWD in the discrete matching setting but fails to capture the similarity between edges. If there
can be formulated as follows. are duplicated entities represented by different nodes in the
Definition 2.2. Following the same notation as in Definition same graph, WD will treat them as identical and ignore
2.1, Gromov-Wasserstein distance between µ, ν is defined their neighboring relations. For example, given a sentence
as: “there is a red book on the blue desk” paired with an image
containing several desks and books in different colors, it is
Dgw (µ, ν) = inf E(x,y)∼γ,(x0 ,y0 )∼γ [L(x, y, x0 , y 0 )] difficult to correctly identity which book in the image the
γ∈Π(µ,ν)
X sentence is referring to, without understanding the relations
= min T̂ij T̂i0 j 0 L(xi , yj , x0i , yj0 ) , (5) among the objects in the image.
T̂∈Π(u,v)
i,i0 ,j,j 0
To best couple WD and GWD and unify these two distances
where L(·) is the cost function evaluating the intra-graph in a mutually-beneficial way, we propose a transport plan
structural similarity between two pairs of nodes (xi , x0i ) and T shared by both WD and GWD. Compared with naively
(yj , yj0 ), i.e., L(xi , yj , x0i , yj0 ) = kc1 (xi , x0i )−c2 (yi , yi0 )k, employing two different transport plans, we observe that
where ci , i ∈ [1, 2] are functions that evaluate node similar- this joint plan works better (see Table 8), and faster, since
ity within the same graph (e.g., the cosine similarity). we only need to solve T once (instead of twice). Intuitively,
Graph Optimal Transport for Cross-Domain Alignment

with a shared transport plan, WD and GWD can enhance Algorithm 3 Computing GOT Distance.
each other effectively, as T utilizes both node and edge 1: Input: {xi }n m
i=1 ,{yj }j=1 , hyper-parameter λ
information simultaneously. Formally, the proposed GOT 2: Compute intra-domain similarities:
distance is defined as: 3: [Cx ]ij = cos(xi , xj ), [Cy ]ij = cos(yi , yj ),
 4: x0i = g1 (xi ), yj0 = g2 (yj ) // g1 , g2 denote two MLPs
5: Compute cross-domain similarities:
X
Dgot (µ, ν) = min Tij λc(xi , yj )
T∈Π(u,v) 6: Cij = cos(x0i , yj0 )
i,i0 ,j,j 0
 7: if T is shared: then
+ (1 − λ)Ti0 j 0 L(xi , yj , x0i , yj0 ) . (6) 8: Update L in Algorithm 2 (Line 8) with:
9: Lunified = λC + (1 − λ)L
10: Plug in Lunified back to Algorithm 2 and solve new T
We apply the Sinkhorn algorithm (Cuturi, 2013; Cuturi & 11: Compute Dgot
Peyré, 2017) to solve WD (4) with an entropic regular- 12: else
izer (Benamou et al., 2015): 13: Apply Algorithm 1 to obtain Dw
14: Apply Algorithm 2 to obtain Dgw
X m
n X 15: Dgot = λDw + (1 − λ)Dgw
min Tij c(xi , yj ) + βH(T) , (7) 16: end if
T∈Π(u,v) 17: Return Dgot
i=1 j=1
P
where H(T) = i,j Tij log Tij , and β is the hyper-
parameter controlling the importance of the entropy term. Chen et al., 2018; Mroueh et al., 2018; Zhang et al., 2020)
Details are provided in Algorithm 1. The solver for GWD to alleviate the mode-collapse issue. Recently, it has also
can be readily developed based on Algorithm 1, where p, q been used for vision-and-language pre-training to encour-
are defined as uniform distributions (as shown in Algorithm age word-region alignment (Chen et al., 2019b). Besides
2), following Alvarez-Melis & Jaakkola (2018). With the WD, Gromov-Wassersten distance (Peyré et al., 2016) has
help of the Sinkhorn algorithm, GOT can be efficiently been proposed for distributional metric matching and ap-
implemented in popular deep learning libraries, such as plied to unsupervised machine translation (Alvarez-Melis &
PyTorch and TensorFlow. Jaakkola, 2018).
To obtain a unified solver for the GOT distance, we define There are different ways to solve the OT distance, such as
the unified cost function as: linear programming. However, this solver is not differen-
tiable, thus it cannot be applied in deep learning frameworks.
Lunified = λc(x, y) + (1 − λ)L(x, y, x0 , y 0 ) , (8)
Recently, WGAN (Arjovsky et al., 2017) proposes to ap-
where λ is the hyper-parameter for controlling the impor- proximate the dual form of WD by imposing a 1-Lipschitz
tance of different cost functions. Instead of using projected constraint on the discriminator. Note that the duality used
gradient descent or conjugated gradient descent as in Xu for WGAN is restricted to the W-1 distance, i.e., k · k. The
et al. (2019b;a); Vayer et al. (2018), we can approximate Sinkhorn algorithm was first proposed in Cuturi (2013) as a
the transport plan T by adding back Lunified in Algorithm solver for calculating an entropic regularized OT distance.
2, so that Line 9 in Algorithm 2 helps solve T for WD and Thanks to the Envelop Theorem (Cuturi & Peyré, 2017),
GWD at the same time, effectively matching both nodes and the Sinkhorn algorithm can be efficiently calculated and
edges simultaneously. The solver for calculating the GOT readily applied to neural networks. More recently, Vayer
distance is illustrated in Figure 2, and the detailed algorithm et al. (2018) proposed the fused GWD for graph matching.
is summarized in Algorithm 3. The calculated GOT distance Our proposed GOT framework enjoys the benefits of both
is used as the cross-domain alignment loss LCDA (X, Y) in Sinkhorn algorithm and fused GWD: it is (i) capable of cap-
(3), as a regularizer to update parameters θ. turing more structured information via marrying both WD
and GWD; and (ii) scalable to large datasets and trainable
with deep neural networks.
3. Related Work
Optimal Transport Wasserstein distance (WD), a.k.a. Graph Neural Network Neural networks operating on
Earth Mover’s distance, has been widely applied to machine graph data was first introduced in Gori et al. (2005) using
learning tasks. In computer vision, Rubner et al. (1998) recurrent neural networks. Later, Duvenaud et al. (2015)
uses WD to discover the structure of color distribution for proposed a convolutional neural network over graphs for
image search. In natural language processing, WD has classification tasks. However, these methods suffer from
been applied to document retrieval (Kusner et al., 2015) and scalability issues, because they need to learn node-degree-
sequence-to-sequence learning (Chen et al., 2019a). There specific weight matrices for large graphs. To alleviate this is-
are also studies adopting WD in Generative Adversarial Net- sue, Kipf & Welling (2016) proposed to use a single weight
work (GAN) (Goodfellow et al., 2014; Salimans et al., 2018; matrix per layer in the neural network, which is capable
Graph Optimal Transport for Cross-Domain Alignment

Sentence Retrieval Image Retrieval


Method R@1 R@5 R@10 R@1 R@5 R@10 Rsum
VSE++ (ResNet) (Faghri et al., 2018) 52.9 – 87.2 39.6 – 79.5 –
DPC (ResNet) (Zheng et al., 2020) 55.6 81.9 89.5 39.1 69.2 80.9 416.2
DAN (ResNet) (Nam et al., 2017) 55.0 81.8 89.0 39.4 69.2 79.1 413.5
SCO (ResNet) (Huang et al., 2018) 55.5 82.0 89.3 41.1 70.5 80.1 418.5
SCAN (Faster R-CNN, ResNet) (Lee et al., 2018) 67.7 88.9 94.0 44.0 74.2 82.6 452.2
Ours (Faster R-CNN, ResNet):
SCAN + WD 70.9 92.3 95.2 49.7 78.2 86.0 472.3
SCAN + GWD 69.5 91.2 95.2 48.8 78.1 85.8 468.6
SCAN + GOT 70.9 92.8 95.5 50.7 78.7 86.2 474.8
VSE++ (ResNet) (Faghri et al., 2018) 41.3 – 81.2 30.3 – 72.4 –
DPC (ResNet) (Zheng et al., 2020) 41.2 70.5 81.1 25.3 53.4 66.4 337.9
GXN (ResNet) (Gu et al., 2018) 42.0 – 84.7 31.7 – 74.6 –
SCO (ResNet) (Huang et al., 2018) 42.8 72.3 83.0 33.1 62.9 75.5 369.6
SCAN (Faster R-CNN, ResNet)(Lee et al., 2018) 46.4 77.4 87.2 34.4 63.7 75.7 384.8
Ours (Faster R-CNN, ResNet):
SCAN + WD 50.2 80.1 89.5 37.9 66.8 78.1 402.6
SCAN + GWD 47.2 78.3 87.5 34.9 64.4 76.3 388.6
SCAN + GOT 50.5 80.2 89.8 38.1 66.8 78.5 403.9

Table 1. Results on image-text retrieval evaluated on Recall@K (R@K). Upper panel: Flickr30K; lower panel: COCO.

of handling varying node degrees through an appropriate 4.1. Vision-and-Language Tasks


normalization of the adjacency matrix of the data. To fur-
Image-Text Retrieval For image-text retrieval task, we
ther improve the classification accuracy, the graph attention
use pre-trained Faster R-CNN (Ren et al., 2015) to extract
network (GAT) (Veličković et al., 2018) was proposed by
bottom-up-attention features (Anderson et al., 2018) as the
using a learned weight matrix instead of the adjacency ma-
image representation. A set of 36 features is created for
trix, with masked attention to aggregate node neighborhood
each image, each feature represented by a 2048-dimensional
information.
vector. For captions, a bi-directional GRU (Schuster &
Recently, the graph neural network has been extended to Paliwal, 1997; Bahdanau et al., 2015) is used to obtain
other tasks beyond classification. Li et al. (2019b) proposed textual features.
graph matching network (GMN) for learning similarities
We evaluate our model on the Flickr30K (Plummer et al.,
between graphs. Similar to GAT, masked attention is applied
2015) and COCO (Lin et al., 2014) datasets. Flickr30K
to aggregate information from each node within a graph,
contains 31,000 images, with five human-annotated captions
and cross-graph information is further exploited via soft
per image. We follow previous work (Karpathy & Fei-
attention. Task-specific losses are then used to guide model
Fei, 2015; Faghri et al., 2018) for the data split: 29,000,
training. In this setting, an adjacency matrix can be directly
1,000 and 1,000 images are used for training, validation and
obtained from the data and soft attention is used to induce
test, respectively. COCO contains 123,287 images, each
alignment. In contrast, our GOT framework does not rely on
image also accompanied with five captions. We follow
explicit graph structures in the data, and uses OT for graph
the data split in Faghri et al. (2018), where 113,287, 5,000
alignment.
and 5,000 images are used for training, validation and test,
respectively.
4. Experiments
We measure the performance of image retrieval and sen-
To validate the effectiveness of the proposed GOT frame- tence retrieval on Recall at K (R@K) (Karpathy & Fei-Fei,
work, we evaluate performance on a selection of diverse 2015), defined as the percentage of queries retrieving the
tasks. We first consider vision-and-language understand- correct images/sentences within the top K highest-ranked re-
ing, including: (i) image-text retrieval, and (ii) visual sults. In our experiment, K = {1, 5, 10}, and Rsum (Huang
question answering. We further consider text genera- et al., 2017) (summation over all R@K) is used to eval-
tion tasks, including: (iii) image captioning, (iv) ma- uate the overall performance. Results are summarized in
chine translation, and (v) abstractive text summariza- Table 1. Both WD and GWD can boost the performance of
tion. Code is available at https://github.com/ the SCAN model, while WD achieves a larger margin than
LiqunChen0606/Graph-Optimal-Transport. GWD. This indicates that when used alone, GWD may not
be a good metric for graph alignment. When combining the
Graph Optimal Transport for Cross-Domain Alignment

(a) (b)

Figure 3. (a) A comparison of the inferred transport plan from GOT (top chart) and the learned attention matrix from SCAN (bottom
chart). Both serve as a lens to visualize cross-domain alignment. The horizontal axis represents image regions, and the vertical axis
represents word tokens. (b) The original image.

Model BAN BAN+GWD BAN+WD BAN+GOT annotated QA pairs on COCO images (Lin et al., 2014).
Score 66.00 66.21 66.26 66.44 For each image, an average of 3 questions are collected,
Table 2. Results (accuracy) on VQA 2.0 validation set, using with 10 candidate answers per question. The most frequent
BAN (Kim et al., 2018) as baseline. answer from the annotators is selected as the correct answer.
Following previous work (Kim et al., 2018), we take the
Model BUTD BAN-1 BAN-2 BAN-4 BAN-8 answers that appear more than 9 times in the training set as
w/o GOT 63.37 65.37 65.61 65.81 66.00 candidate answers, which results in 3129 candidates. Clas-
w/ GOT 65.01 65.68 65.88 66.10 66.44
sification accuracy is used as the evaluation metric, defined
Table 3. Results (accuracy) of applying GOT to BUTD (Anderson as min(1, # humans provided
3
ans.
).
et al., 2018) and BAN-m (Kim et al., 2018) on VQA 2.0. m
denotes the number of glimpses. The BAN model (Kim et al., 2018) is used as baseline, with
the original codebase used for fair comparison. Results are
summarized in Table 2. Both WD and GWD improve the
two distances together, GOT achieves the best performance. BAN model on the validation set, and GOT achieves further
performance lift.
Figure 3 provides visualization on the learned transport plan
in GOT and the learned attention matrix in SCAN. Both We also investigate whether different architecture designs
serve as a proxy to lend insights into the learned alignment. affect the performance gain. We consider BUTD (Anderson
As shown, the attention matrix from SCAN is much denser et al., 2018) as an additional baseline, and apply different
and noisier than the transport plan inferred by GOT. This number of glimpses m to the BAN model, denoted as BAN-
shows our model can better discover cross-domain relations m. Results are summarized in Table 3, with the following
between image-text pairs, since the inferred transport plan observations: (i) When the number of parameters in the
is more interpretable and has less ambiguity. For exam- tested model is small, such as BUTD, the improvement
ple, both the words “sidewalk” and “skateboard” match the brought by GOT is more significant. (ii) BAN-4, a sim-
corresponding image regions very well. pler model than BAN-8, when combined with GOT, can
outperform BAN-8 without using GOT (66.10 v.s. 66.00).
Because of the Envelope Theorem (Cuturi & Peyré, 2017), (iii) For complex models such as BAN-8 that might have
GOT needs to be calculated only during the forward phase of limited space for improvement, GOT is still able to achieve
model training. Therefore, it does not introduce much extra performance gain.
computation time. For example, when using the same ma-
chine for image-text retrieval experiments, SCAN required 4.2. Text Generation Tasks
6hr 34min for training and SCAN+GOT 6hr 57min.
Image Captioning We conduct experiments on image
Visual Question Answering We also consider the VQA captioning using the same COCO dataset. The same bottom-
2.0 dataset (Goyal et al., 2017), which contains human- up-attention features (Anderson et al., 2018) used in image-
Graph Optimal Transport for Cross-Domain Alignment

Method CIDEr BLEU-4 BLUE-3 BLEU-2 BLEU-1 ROUGE METEOR


Soft Attention (Xu et al., 2015) - 24.3 34.4 49.2 70.7 - 23.9
Hard Attention (Xu et al., 2015) - 25.0 35.7 50.4 71.8 - 23.0
Show & Tell (Vinyals et al., 2015) 85.5 27.7 - - - - 23.7
ATT-FCN (You et al., 2016) - 30.4 40.2 53.7 70.9 - 24.3
SCN-LSTM (Gan et al., 2017) 101.2 33.0 43.3 56.6 72.8 - 25.7
Adaptive Attention (Lu et al., 2017) 108.5 33.2 43.9 58.0 74.2 - 26.6
MLE 106.3 34.3 45.3 59.3 75.6 55.2 26.2
MLE + WD 107.9 34.8 46.1 60.1 76.2 55.6 26.5
MLE + GWD 106.6 33.3 45.2 59.1 75.7 55.0 25.9
MLE + GOT 109.2 35.1 46.5 60.3 77.0 56.2 26.7

Table 4. Results of image captioning on the COCO dataset.

Model EN-VI uncased EN-VI cased EN-DE uncased EN-DE cased


Transformer (Vaswani et al., 2017) 29.25 ± 0.18 28.46 ± 0.17 25.60 ± 0.07 25.12 ± 0.12
Transformer + WD 29.49 ± 0.10 28.68 ± 0.14 25.83 ± 0.12 25.30 ± 0.11
Transformer + GWD 28.65 ± 0.14 28.34 ± 0.16 25.42 ± 0.17 24.82 ± 0.15
Transformer + GOT 29.92 ± 0.11 29.09 ± 0.18 26.05 ± 0.17 25.54 ± 0.15
Table 5. Results of neural machine translation on EN-DE and EN-VI.

Method ROUGE-1 ROUGE-2 ROUGE-L


ABS+ (Rush et al., 2015) 31.00 12.65 28.34
LSTM (Hu et al., 2018) 36.11 16.39 32.32
LSTM + GWD 36.31 17.32 33.15
LSTM + WD 36.81 17.34 33.34
LSTM + GOT 37.10 17.61 33.70

Table 6. Results of abstractive text summarization on the English


Gigawords dataset.

Figure 4. Inferred transport plan for aligning source and output


text retrieval are adopted here. The text decoder is one-
sentences in abstractive summarization.
layer LSTM with 256 hidden units. The word embedding
dimension is set to 256. Results are summarized in Table
uation metric. Results are summarized in Table 5. As also
4. A similar performance gain is introduced by GOT. The
observed in Chen et al. (2019a), using WD can improve the
relative performance boost from WD to GOT over CIDEr
performance of the Transformer for sequence-to-sequence
GOT−WD
score is: WD−MLE = 109.2−107.9
107.9−106.3 = 81.25%. This attributes learning. However, if only GWD is used, the test BLEU
to the additional GWD introduced in GOT that can help
score drops. Since GWD can only match the edges, it ig-
model implicit intra-domain relationships in images and
nores supervision signals from node representations. This
captions, leading to more accurate caption generation.
serves as empirical evidence to support our hypothesis that
using GWD alone may not be enough to improve perfor-
Machine Translation In machine translation (and ab- mance. However, GWD serves as a complementary method
stractive summarization), the word embedding spaces of for capturing graph information that might be missed by
the source and target sentences are different, which can be WD. Therefore, when combining the two together, GOT
considered as different domains. Therefore, GOT can be achieves the best performance. Example translations are
used to align those words with similar semantic meanings provided in Table 7.
between the source and target sentences for better transla-
tion/summarization. We choose two machine translation
Abstractive Summarization We evaluate abstractive
benchmarks for experiments: (i) English-Vietnamese TED-
summarization on the English Gigawords benchmark (Graff
talks corpus, which contains 133K pairs of sentences from
et al., 2003). A basic LSTM model as implemented in
the IWSLT Evaluation Campaign (Cettolo et al., 2015);
Texar (Hu et al., 2018) is used in our experiments. ROUGE-
and (ii) a large-scale English-German parallel corpus with
1, -2 and -L scores (Lin, 2004) are reported. Table 6 shows
4.5M pairs of sentences, from the WMT Evaluation Cam-
that both GWD and WD can improve the performance of the
paign (Vaswani et al., 2017). The Texar codebase (Hu et al.,
LSTM. The transport plan for source and output sentences
2018) is used in our experiments.
alignment is illustrated in Figure 4. The learned alignment
We apply GOT to the Transformer model (Vaswani et al., is sparse and interpretable. For instance, the words “largest”
2017) and use BLEU score (Papineni et al., 2002) as the eval- and “projects” in the source sentence matches the words
Graph Optimal Transport for Cross-Domain Alignment

Reference: India’s new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss
economic and security ties, on his first major foreign visit since winning May’s election.
MLE: India ‘ s new prime minister , Narendra Modi , meets his Japanese counterpart , Shinzo Abe , in Tokyo , during his
first major foreign visit in May to discuss economic and security relations .
GOT: India ’ s new prime minister , Narendra Modi , is meeting his Japanese counterpart Shinzo Abe in Tokyo in his first
major foreign visit since his election victory in May to discuss economic and security relations.

Reference: Chinese leaders presented the Sunday ruling as a democratic breakthrough because it gives Hong Kongers a direct
vote, but the decision also makes clear that Chinese leaders would retain a firm hold on the process through a
nominating committee tightly controlled by Beijing.
MLE: The Chinese leadership presented the decision of Sunday as a democratic breakthrough , because it gives Hong
Kong citizens a direct right to vote , but the decision also makes it clear that the Chinese leadership maintains the
expiration of a nomination committee closely controlled by Beijing .
GOT: The Chinese leadership presented the decision on Sunday as a democratic breakthrough , because Hong Kong
citizens have a direct electoral right , but the decision also makes it clear that the Chinese leadership remains
firmly in hand with a nominating committee controlled by Beijing.

Table 7. Comparison of German-to-English translation examples. For each example, we show the human translation (reference) and the
translation from MLE and GOT. We highlight the key-phrase differences between reference and translation outputs in blue and red, and
denote the error in translation in bold. In the first example, GOT correctly maintains all the information in “since winning May’s election”
by translating to “since his election victory in May”, whereas MLE only generate “in May”. In the second example, GOT successfully
keeps the information “Beijing”, whereas MLE generates wrong words “expiration of”.

Model EN-VI uncased EN-DE uncased to be larger than the weight on GWD, since intuitively node
GOT (shared) 29.92 ± 0.11 26.05 ± 0.18 matching is more important than edge matching for ma-
GOT (unshared) 29.77 ± 0.12 25.89 ± 0.17
chine translation. However, both WD and GWD contribute
Table 8. Ablation study on transport plan in machine translation. to GOT achieving the best performance.
Both models were run 5 times with the same hyper-parameter
setting. 5. Conclusions
λ 0 0.1 0.3 0.5 0.8 1.0 We propose Graph Optimal Transport, a principled frame-
BLEU 28.65 29.31 29.52 29.65 29.92 29.49 work for cross-domain alignment. With the Wasserstein
Table 9. Ablation study of the hyper-parameter λ on the EN-VI and Gromov-Wasserstein distances, both intra-domain and
machine translation dataset. cross-domain relations are captured for better alignment.
Empirically, we observe that enforcing alignment can serve
“more” and “investment” in the output summary very well. as an effective regularizer for model training. Extensive ex-
periments show that the proposed method is a generic frame-
4.3. Ablation study work that can be applied to a wide range of cross-domain
tasks. For future work, we plan to apply the proposed frame-
We conduct additional ablation study on the EN-VI and
work to self-supervised representation learning.
EN-DE datasets for machine translation.

Shared Transport Plan T As discussed in Sec. 2.4, we Acknowledgements


use a shared transport plan T to solve the GOT distance. An The authors would like to thank the anonymous reviewers
alternative is not to share this T matrix. The comparison for their insightful comments. The research at Duke Univer-
results are provided in Table 8. GOT with a shared transport sity was supported in part by DARPA, DOE, NIH, NSF and
plan achieves better performance than the alternative. Since ONR.
we only need to run the iterative Sinkhorn algorithm once,
it also saves training time than the unshared case.
References
Hyper-parameter λ We perform ablation study on the Alvarez-Melis, D. and Jaakkola, T. S. Gromov-wasserstein
hyper-parameter λ in (6). We select λ from [0, 1] and report alignment of word embedding spaces. arXiv:1809.00013,
results in Table 9. When λ = 0.8, EN-VI translation per- 2018.
forms the best, which indicates that the weight on WD needs
Graph Optimal Transport for Cross-Domain Alignment

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bom-
Gould, S., and Zhang, L. Bottom-up and top-down atten- barell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P.
tion for image captioning and visual question answering. Convolutional networks on graphs for learning molecular
In CVPR, 2018. fingerprints. In NeurIPS, 2015.

Antol, S. et al. Vqa: Visual question answering. In ICCV, Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. Vse++:
2015. Improved visual-semantic embeddings. In BMVC, 2018.

Arjovsky, M. et al. Wasserstein generative adversarial net- Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin,
works. In ICML, 2017. L., and Deng, L. Semantic compositional networks for
visual captioning. In CVPR, 2017.
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine
translation by jointly learning to align and translate. In Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
ICLR, 2015. Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.
Generative adversarial nets. In NeurIPS, 2014.
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and
Peyré, G. Iterative bregman projections for regularized Gori, M., Monfardini, G., and Scarselli, F. A new model for
transportation problems. SIAM Journal on Scientific Com- learning in graph domains. In IEEE International Joint
puting, 2015. Conference on Neural Networks, 2005.

Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Cattoni, Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and
R., and Federico, M. The IWSLT 2015 evaluation cam- Parikh, D. Making the v in vqa matter: Elevating the
paign. In International Workshop on Spoken Language role of image understanding in visual question answering.
Translation, 2015. In CVPR, 2017.

Chechik, G., Sharma, V., Shalit, U., and Bengio, S. Large Graff, D., Kong, J., Chen, K., and Maeda, K. English
scale online learning of image similarity through ranking. gigaword. Linguistic Data Consortium, Philadelphia,
Journal of Machine Learning Research, 2010. 2003.

Chen, L., Dai, S., Tao, C., Zhang, H., Gan, Z., Shen, D., Gu, J., Cai, J., Joty, S. R., Niu, L., and Wang, G. Look, imag-
Zhang, Y., Wang, G., Zhang, R., and Carin, L. Adver- ine and match: Improving textual-visual cross-modal re-
sarial text generation via feature-mover’s distance. In trieval with generative models. In CVPR, 2018.
NeurIPS, 2018. Hu, Z., Shi, H., Yang, Z., Tan, B., Zhao, T., He, J., Wang, W.,
Yu, X., Qin, L., Wang, D., et al. Texar: A modularized,
Chen, L., Zhang, Y., Zhang, R., Tao, C., Gan, Z., Zhang,
versatile, and extensible toolkit for text generation. arXiv
H., Li, B., Shen, D., Chen, C., and Carin, L. Improv-
preprint arXiv:1809.00794, 2018.
ing sequence-to-sequence learning via optimal transport.
arXiv preprint arXiv:1901.06283, 2019a. Huang, Y., Wang, W., and Wang, L. Instance-aware image
and sentence matching with selective multimodal lstm.
Chen, Y.-C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan, Z.,
In CVPR, 2017.
Cheng, Y., and Liu, J. Uniter: Learning universal image-
text representations. arXiv preprint arXiv:1909.11740, Huang, Y., Wu, Q., Song, C., and Wang, L. Learning seman-
2019b. tic concepts and order for image and sentence matching.
In CVPR, 2018.
Chowdhury, S. and Mémoli, F. The gromov–wasserstein
distance between networks and stable network invariants. Karpathy, A. and Fei-Fei, L. Deep visual-semantic align-
Information and Inference: A Journal of the IMA, 2019. ments for generating image descriptions. In CVPR, 2015.
Cuturi, M. Sinkhorn distances: Lightspeed computation of Kim, J.-H., Jun, J., and Zhang, B.-T. Bilinear attention
optimal transport. In NeurIPS, 2013. networks. In NeurIPS, 2018.

Cuturi, M. and Peyré, G. Computational optimal transport. Kipf, T. N. and Welling, M. Semi-supervised classification
2017. with graph convolutional networks. arXiv:1609.02907,
2016.
De Goes, F. et al. An optimal transport approach to ro-
bust reconstruction and simplification of 2d shapes. In Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. From
Computer Graphics Forum, 2011. word embeddings to document distances. In ICML, 2015.
Graph Optimal Transport for Cross-Domain Alignment

Lee, K.-H. et al. Stacked cross attention for image-text Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for
matching. In ECCV, 2018. distributions with applications to image databases. In
ICCV, 1998.
Li, L., Gan, Z., Cheng, Y., and Liu, J. Relation-aware graph
attention network for visual question answering. In ICCV, Rush, A. M., Chopra, S., and Weston, J. A neural atten-
2019a. tion model for abstractive sentence summarization. In
EMNLP, 2015.
Li, Y., Gu, C., Dullien, T., Vinyals, O., and Kohli, P. Graph
matching networks for learning the similarity of graph Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Im-
structured objects. In ICML, 2019b. proving GANs using optimal transport. In ICLR, 2018.
Schuster, M. and Paliwal, K. K. Bidirectional recurrent neu-
Lin, C.-Y. Rouge: A package for automatic evaluation of ral networks. Transactions on Signal Processing, 1997.
summaries. Text Summarization Branches Out, 2004.
Van Lint, J. H., Wilson, R. M., and Wilson, R. M. A course
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., in combinatorics. Cambridge university press, 2001.
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft
COCO: Common objects in context. In ECCV, 2014. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention
Lu, J., Xiong, C., Parikh, D., and Socher, R. Knowing when is all you need. In NeurIPS, 2017.
to look: Adaptive attention via a visual sentinel for image
captioning. In CVPR, 2017. Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty,
N. Optimal transport for structured data with application
Luise, G., Rudi, A., Pontil, M., and Ciliberto, C. Differential on graphs. arXiv:1805.09114, 2018.
properties of sinkhorn approximation for learning with
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio,
wasserstein distance. arXiv:1805.11897, 2018.
P., and Bengio, Y. Graph attention networks. In ICLR,
Malinowski, M. and Fritz, M. A multi-world approach 2018.
to question answering about real-world scenes based on Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show
uncertain input. In NeurIPS, 2014. and tell: A neural image caption generator. In CVPR,
2015.
Maretic, H. P., El Gheche, M., Chierchia, G., and Frossard,
P. Got: an optimal transport framework for graph com- Xie, Y., Wang, X., Wang, R., and Zha, H. A fast
parison. In NeurIPS, 2019. proximal point method for Wasserstein distance. In
arXiv:1802.04307, 2018.
Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y.
Sobolev GAN. In ICLR, 2018. Xu, H., Luo, D., and Carin, L. Scalable gromov-wasserstein
learning for graph partitioning and matching. In NeurIPS,
Nam, H., Ha, J.-W., and Kim, J. Dual attention networks 2019a.
for multimodal reasoning and matching. In CVPR, 2017.
Xu, H., Luo, D., Zha, H., and Carin, L. Gromov-wasserstein
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU: learning for graph matching and node embedding. In
a method for automatic evaluation of machine translation. ICML, 2019b.
In ACL, 2002.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhut-
Peyré, G., Cuturi, M., and Solomon, J. Gromov-wasserstein dinov, R., Zemel, R. S., and Bengio, Y. Show, attend and
averaging of kernel and distance matrices. In ICML, 2016. tell: Neural image caption generation with visual atten-
tion. In ICML, 2015.
Peyré, G., Cuturi, M., et al. Computational optimal transport. Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked
Foundations and Trends R in Machine Learning, 2019. attention networks for image question answering. In
CVPR, 2016a.
Plummer, B. A. et al. Flickr30k entities: Collecting region-
to-phrase correspondences for richer image-to-sentence Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy,
models. In ICCV, 2015. E. Hierarchical attention networks for document classifi-
cation. In NAACL, 2016b.
Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn:
Towards real-time object detection with region proposal Yao, T., Pan, Y., Li, Y., and Mei, T. Exploring visual rela-
networks. In NeurIPS, 2015. tionship for image captioning. In ECCV, 2018.
Graph Optimal Transport for Cross-Domain Alignment

You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. Image
captioning with semantic attention. In CVPR, 2016.
Yu, Z., Yu, J., Cui, Y., Tao, D., and Tian, Q. Deep modular
co-attention networks for visual question answering. In
CVPR, 2019.
Zhang, R., Chen, C., Gan, Z., Wen, Z., Wang, W., and
Carin, L. Nested-wasserstein self-imitation learning for
sequence generation. arXiv:2001.06944, 2020.
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., and
Shen, Y.-D. Dual-path convolutional image-text embed-
dings with instance loss. TOMM, 2020.

You might also like