ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Su, Ying; Ling, Zhan; Shi, Haochen; Cheng, Jiayang; Yim, Yauwai; Song, Yangqiu

Computer Science > Computation and Language

arXiv:2410.03907 (cs)

[Submitted on 4 Oct 2024]

Title:ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Authors:Ying Su, Zhan Ling, Haochen Shi, Jiayang Cheng, Yauwai Yim, Yangqiu Song

View PDF HTML (experimental)

Abstract:Large language models~(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models~(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model's reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.

Comments:	13 pages, 9 figures, 8 tables, accepted to EMNLP 2024 main conference
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2410.03907 [cs.CL]
	(or arXiv:2410.03907v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.03907

Submission history

From: Ying Su [view email]
[v1] Fri, 4 Oct 2024 20:21:40 UTC (13,648 KB)

Computer Science > Computation and Language

Title:ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators