SeqTR: A Simple yet Universal Network for Visual Grounding

Zhu, Chaoyang; Zhou, Yiyi; Shen, Yunhang; Luo, Gen; Pan, Xingjia; Lin, Mingbao; Chen, Chao; Cao, Liujuan; Sun, Xiaoshuai; Ji, Rongrong

doi:10.1007/978-3-031-19833-5_35

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.16265 (cs)

[Submitted on 30 Mar 2022 (v1), last revised 24 Jul 2022 (this version, v2)]

Title:SeqTR: A Simple yet Universal Network for Visual Grounding

Authors:Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, Rongrong Ji

View PDF

Abstract:In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible. Source code is available at this https URL.

Comments:	21 pages, 8 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2203.16265 [cs.CV]
	(or arXiv:2203.16265v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.16265
Journal reference:	European Conference on Computer Vision, 2022
Related DOI:	https://doi.org/10.1007/978-3-031-19833-5_35

Submission history

From: Chaoyang Zhu [view email]
[v1] Wed, 30 Mar 2022 12:52:46 UTC (13,989 KB)
[v2] Sun, 24 Jul 2022 02:13:37 UTC (9,698 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SeqTR: A Simple yet Universal Network for Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SeqTR: A Simple yet Universal Network for Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators