Autoregressive Pre-Training on Pixels and Texts

Chai, Yekun; Liu, Qingyi; Xiao, Jingwu; Wang, Shuohuan; Sun, Yu; Wu, Hua

Computer Science > Computation and Language

arXiv:2404.10710 (cs)

[Submitted on 16 Apr 2024 (v1), last revised 3 Oct 2024 (this version, v3)]

Title:Autoregressive Pre-Training on Pixels and Texts

Authors:Yekun Chai, Qingyi Liu, Jingwu Xiao, Shuohuan Wang, Yu Sun, Hua Wu

View PDF HTML (experimental)

Abstract:The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at \url{this https URL}.

Comments:	EMNLP 2024
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.10710 [cs.CL]
	(or arXiv:2404.10710v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.10710

Submission history

From: Yekun Chai [view email]
[v1] Tue, 16 Apr 2024 16:36:50 UTC (8,854 KB)
[v2] Wed, 17 Apr 2024 08:44:30 UTC (8,854 KB)
[v3] Thu, 3 Oct 2024 17:46:40 UTC (8,820 KB)

Computer Science > Computation and Language

Title:Autoregressive Pre-Training on Pixels and Texts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Autoregressive Pre-Training on Pixels and Texts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators