Democratizing Fine-grained Visual Recognition with Large Language Models

Liu, Mingxuan; Roy, Subhankar; Li, Wenjing; Zhong, Zhun; Sebe, Nicu; Ricci, Elisa

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.13837 (cs)

[Submitted on 24 Jan 2024 (v1), last revised 10 Mar 2024 (this version, v2)]

Title:Democratizing Fine-grained Visual Recognition with Large Language Models

Authors:Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa Ricci

View PDF HTML (experimental)

Abstract:Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

Comments:	Accepted as a conference paper at ICLR 2024; Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.13837 [cs.CV]
	(or arXiv:2401.13837v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.13837

Submission history

From: Mingxuan Liu [view email]
[v1] Wed, 24 Jan 2024 22:28:26 UTC (33,957 KB)
[v2] Sun, 10 Mar 2024 16:01:25 UTC (33,957 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Democratizing Fine-grained Visual Recognition with Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Democratizing Fine-grained Visual Recognition with Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators