FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

Xiaoxu Xu1,2, Hao Li2, Jinhui Ye2, Yilun Chen2, Jia Zeng2, Xinyi Chen2, Linning Xu3, Dahua Lin3, Weixin Li1, Jiangmiao Pang2
1Beihang University, 2Shanghai Artificial Intelligence Laboratory,
3The Chinese University of Hong Kong

Introduction

Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling.

Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling.

To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves upon baseline VLA frameworks across simulated and real-robot settings, validating the effectiveness of our method.

Introduction of FutureVLA

Method

(a) Joint Visuomotor Pretraining: Continuous video clips are processed by a frozen 3D-VAE into temporal tokens and structurally decoupled into two streams. Visual tokens reconstruct the initial frame, while motor tokens, supervised by action chunks, utilize the Joint Visuomotor Gating module (c) based on gated cross-attention, where the motor stream iteratively queries spatial affordances from the visual tokens, yielding physically grounded joint visuomotor embeddings.

(b) Joint Visuomotor Embedding Guided VLA Post-training: The frozen model provides joint visuomotor embeddings as future-aware temporal priors. Through latent embedding alignment, the downstream VLA's intermediate representations are forced to internalize these dynamics.

Real World

Real World

We evaluate our method on several real-world tasks with Franka Research 3 Robot, we utilize a third-person camera for visual input. These tasks cover a broad spectrum of manipulation primitives, including grasping, tool use, insertion, and contact-rich control.

Real World

Simulation

The simulation experiments include: (1) Performance comparisons on the Google Robot and WidowX Robot in SimplerEnv. The experiments are conducted across 12 tasks, including both visual matching (VM) and visual aggregation (VA) settings. (2) Main results in LIBERO, reporting the average success rate across 3 seeds over 500 trials per task. (3) Extended experiments on LIBERO-Plus to further evaluate the generalization and robustness of our method.

Simulation_table
Simulation_table
Simulation_table
Simulation_table

Video Presentation

Below are some visualizations demonstrating our method in both simulation and the real world.

Turn on the stove and put the moka pot on it.

Open the top drawer and put the bowl inside.

Put the bowl on the plate.

Pick up the bbq sauce and place it in the basket.

Open bottom drawer.

Move near.

Put the carrot on the plate.

Put the eggplant in yellow basket.

Make a burger.

Insert the rose.

Scoop the beans.

Erase the whiteboard.

BibTeX


        @misc{xu2026futurevlajointvisuomotorprediction,
          title={FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model}, 
          author={Xiaoxu Xu and Hao Li and Jinhui Ye and Yilun Chen and Jia Zeng and Xinyi Chen and Linning Xu and Dahua Lin and Weixin Li and Jiangmiao Pang},
          year={2026},
          eprint={2603.10712},
          archivePrefix={arXiv},
          primaryClass={cs.RO},
          url={https://arxiv.org/abs/2603.10712}, 
    }