Transform and Tell: Entity-Aware News Image Captioning

Tran, Alasdair; Mathews, Alexander; Xie, Lexing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2004.08070 (cs)

[Submitted on 17 Apr 2020 (v1), last revised 13 Jun 2020 (this version, v2)]

Title:Transform and Tell: Entity-Aware News Image Captioning

Authors:Alasdair Tran, Alexander Mathews, Lexing Xie

View PDF

Abstract:We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.

Comments:	Published in CVPR 2020. Code is available at this https URL and demo is available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
ACM classes:	I.4.0; I.2.7
Cite as:	arXiv:2004.08070 [cs.CV]
	(or arXiv:2004.08070v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2004.08070
Journal reference:	The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13035-13045

Submission history

From: Alasdair Tran [view email]
[v1] Fri, 17 Apr 2020 05:44:37 UTC (4,659 KB)
[v2] Sat, 13 Jun 2020 01:21:14 UTC (4,659 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Transform and Tell: Entity-Aware News Image Captioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Transform and Tell: Entity-Aware News Image Captioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators