Task-oriented Sequential Grounding in 3D Scenes

Zhang, Zhuofan; Zhu, Ziyu; Li, Pengxiang; Liu, Tengyu; Ma, Xiaojian; Chen, Yixin; Jia, Baoxiong; Huang, Siyuan; Li, Qing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.04034 (cs)

[Submitted on 7 Aug 2024]

Title:Task-oriented Sequential Grounding in 3D Scenes

Authors:Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li

View PDF HTML (experimental)

Abstract:Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented grounding necessary for practical applications. In this work, we propose a new task: Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow detailed step-by-step instructions to complete daily activities by locating a sequence of target objects in indoor scenes. To facilitate this task, we introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed using a combination of RGB-D scans from various 3D scene datasets and an automated task generation pipeline, followed by human verification for quality assurance. We adapted three state-of-the-art 3D visual grounding models to the sequential grounding task and evaluated their performance on SG3D. Our results reveal that while these models perform well on traditional benchmarks, they face significant challenges with task-oriented sequential grounding, underscoring the need for further research in this area.

Comments:	website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.04034 [cs.CV]
	(or arXiv:2408.04034v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.04034

Submission history

From: Zhuofan Zhang [view email]
[v1] Wed, 7 Aug 2024 18:30:18 UTC (8,636 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Task-oriented Sequential Grounding in 3D Scenes

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Task-oriented Sequential Grounding in 3D Scenes

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators