Papers
arxiv:2605.11832

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Published on May 12
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Multi-view diffusion model and geometry-guided gated transformer address depth ambiguity in vision-language-action models, while action manifold learning improves action prediction efficiency.

AI-generated summary

This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.11832
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.11832 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.