Egocentric Vision Language Planning

Fang, Zhirui; Yang, Ming; Zeng, Weishuai; Li, Boyu; Yue, Junpeng; Ding, Ziluo; Li, Xiu; Lu, Zongqing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.05802 (cs)

[Submitted on 11 Aug 2024]

Title:Egocentric Vision Language Planning

Authors:Zhirui Fang, Ming Yang, Weishuai Zeng, Boyu Li, Junpeng Yue, Ziluo Ding, Xiu Li, Zongqing Lu

View PDF HTML (experimental)

Abstract:We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.05802 [cs.CV]
	(or arXiv:2408.05802v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.05802

Submission history

From: Zongqing Lu [view email]
[v1] Sun, 11 Aug 2024 15:37:29 UTC (45,910 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Egocentric Vision Language Planning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Egocentric Vision Language Planning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators