Scaling Up Vision-Language Pre-training for Image Captioning

Hu, Xiaowei; Gan, Zhe; Wang, Jianfeng; Yang, Zhengyuan; Liu, Zicheng; Lu, Yumao; Wang, Lijuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.12233 (cs)

[Submitted on 24 Nov 2021 (v1), last revised 26 Mar 2022 (this version, v2)]

Title:Scaling Up Vision-Language Pre-training for Image Captioning

Authors:Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

View PDF

Abstract:In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2111.12233 [cs.CV]
	(or arXiv:2111.12233v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.12233

Submission history

From: Xiaowei Hu [view email]
[v1] Wed, 24 Nov 2021 02:30:22 UTC (1,795 KB)
[v2] Sat, 26 Mar 2022 02:02:42 UTC (1,798 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Up Vision-Language Pre-training for Image Captioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Up Vision-Language Pre-training for Image Captioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators