Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Chen, William; Mees, Oier; Kumar, Aviral; Levine, Sergey

Computer Science > Machine Learning

arXiv:2402.02651v1 (cs)

[Submitted on 5 Feb 2024 (this version), latest version 23 May 2024 (v3)]

Title:Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Authors:William Chen, Oier Mees, Aviral Kumar, Sergey Levine

View PDF

Abstract:Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings extracted from general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2402.02651 [cs.LG]
	(or arXiv:2402.02651v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.02651

Submission history

From: William Chen [view email]
[v1] Mon, 5 Feb 2024 00:48:56 UTC (7,149 KB)
[v2] Tue, 13 Feb 2024 17:51:03 UTC (9,470 KB)
[v3] Thu, 23 May 2024 01:04:11 UTC (14,256 KB)

Computer Science > Machine Learning

Title:Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators