RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Liu, Jiaming; Liu, Mengzhen; Wang, Zhenyu; An, Pengju; Li, Xiaoqi; Zhou, Kaichen; Yang, Senqiao; Zhang, Renrui; Guo, Yandong; Zhang, Shanghang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.04339 (cs)

[Submitted on 6 Jun 2024 (v1), last revised 14 Dec 2024 (this version, v2)]

Title:RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Authors:Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang

View PDF HTML (experimental)

Abstract:A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: this https URL

Comments:	Accepted by Neurips 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.04339 [cs.CV]
	(or arXiv:2406.04339v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.04339

Submission history

From: Jiaming Liu [view email]
[v1] Thu, 6 Jun 2024 17:59:47 UTC (4,198 KB)
[v2] Sat, 14 Dec 2024 18:41:03 UTC (4,388 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators