Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Goko, Miyu; Kambara, Motonari; Saito, Daichi; Otsuki, Seitaro; Sugiura, Komei

Computer Science > Robotics

arXiv:2410.00436 (cs)

[Submitted on 1 Oct 2024]

Title:Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Authors:Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei Sugiura

View PDF HTML (experimental)

Abstract:In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

Comments:	Accepted for presentation at CoRL2024
Subjects:	Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.00436 [cs.RO]
	(or arXiv:2410.00436v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2410.00436

Submission history

From: Miyu Goko [view email]
[v1] Tue, 1 Oct 2024 06:35:34 UTC (23,390 KB)

Computer Science > Robotics

Title:Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators