Measuring Progress in Fine-grained Vision-and-Language Understanding

Bugliarello, Emanuele; Sartran, Laurent; Agrawal, Aishwarya; Hendricks, Lisa Anne; Nematzadeh, Aida

Computer Science > Computation and Language

arXiv:2305.07558 (cs)

[Submitted on 12 May 2023]

Title:Measuring Progress in Fine-grained Vision-and-Language Understanding

Authors:Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh

View PDF

Abstract:While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

Comments:	ACL 2023
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.07558 [cs.CL]
	(or arXiv:2305.07558v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.07558

Submission history

From: Emanuele Bugliarello [view email]
[v1] Fri, 12 May 2023 15:34:20 UTC (1,606 KB)

Computer Science > Computation and Language

Title:Measuring Progress in Fine-grained Vision-and-Language Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Measuring Progress in Fine-grained Vision-and-Language Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators