Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Wang, Wenxuan; Yue, Tongtian; Zhang, Yisi; Guo, Longteng; He, Xingjian; Wang, Xinlong; Liu, Jing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.08007 (cs)

[Submitted on 13 Dec 2023 (v1), last revised 21 Mar 2024 (this version, v2)]

Title:Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Authors:Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

View PDF HTML (experimental)

Abstract:Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one expression must refer to object-level targets. In this paper, we take a step further to finer-grained part-level RES task. To promote the object-level RES task towards finer-grained vision-language understanding, we put forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations. By employing our automatic model-assisted data engine, we build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions on the provided 1M images. Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task. Extensive experiments on our RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task demonstrate the superiority of our method over previous state-of-the-art methods. To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at this https URL

Comments:	This work is accepted by CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.08007 [cs.CV]
	(or arXiv:2312.08007v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.08007

Submission history

From: Wenxuan Wang [view email]
[v1] Wed, 13 Dec 2023 09:29:45 UTC (15,477 KB)
[v2] Thu, 21 Mar 2024 09:09:52 UTC (15,478 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators