VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Zheng, Kaizhi; Chen, Xiaotong; Jenkins, Odest Chadwicke; Wang, Xin Eric

Computer Science > Robotics

arXiv:2206.08522 (cs)

[Submitted on 17 Jun 2022 (v1), last revised 17 Aug 2022 (this version, v2)]

Title:VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Authors:Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, Xin Eric Wang

View PDF

Abstract:Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) system and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.

Subjects:	Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.08522 [cs.RO]
	(or arXiv:2206.08522v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2206.08522

Submission history

From: Kaizhi Zheng [view email]
[v1] Fri, 17 Jun 2022 03:07:18 UTC (6,869 KB)
[v2] Wed, 17 Aug 2022 17:18:43 UTC (7,065 KB)

Computer Science > Robotics

Title:VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators