Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Wang, Fei; Ding, Liang; Rao, Jun; Liu, Ye; Shen, Li; Ding, Changxing

Computer Science > Multimedia

arXiv:2308.12898 (cs)

[Submitted on 24 Aug 2023 (v1), last revised 25 Aug 2023 (this version, v2)]

Title:Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Authors:Fei Wang, Liang Ding, Jun Rao, Ye Liu, Li Shen, Changxing Ding

View PDF

Abstract:The multimedia community has shown a significant interest in perceiving and representing the physical world with multimodal pretrained neural network models, and among them, the visual-language pertaining (VLP) is, currently, the most captivating topic. However, there have been few endeavors dedicated to the exploration of 1) whether essential linguistic knowledge (e.g., semantics and syntax) can be extracted during VLP, and 2) how such linguistic knowledge impact or enhance the multimodal alignment. In response, here we aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark, to detect the vital linguistic components, e.g., lexical, semantic, and syntax knowledge, containing four tasks: Semantic structure, Negation logic, Attribute ownership, and Relationship composition. Based on our proposed probing benchmarks, our holistic analyses of five advanced VLP models illustrate that the VLP model: i) shows insensitivity towards complex syntax structures and relies on content words for sentence comprehension; ii) demonstrates limited comprehension of combinations between sentences and negations; iii) faces challenges in determining the presence of actions or spatial relationships within visual information and struggles with verifying the correctness of triple combinations. We make our benchmark and code available at \url{this https URL}.

Comments:	[TL;DR] we design and release the SNARE, the first large-scale multimodal alignment probing benchmark for current vision-language pretrained models
Subjects:	Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.12898 [cs.MM]
	(or arXiv:2308.12898v2 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2308.12898

Submission history

From: Liang Ding [view email]
[v1] Thu, 24 Aug 2023 16:17:40 UTC (5,857 KB)
[v2] Fri, 25 Aug 2023 12:22:53 UTC (5,860 KB)

Computer Science > Multimedia

Title:Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators