WorldVLN

Baining Zhao^1*, Jiacheng Xu^2*, Weicheng Feng^2*, Xin Zhang^3*, Zhaolu Wang^4* Haoyang Wang¹, Shilong Ji¹, Ziyou Wang⁵, Jianjie Fang¹, Zhiheng Zheng¹ Weichen Zhang¹, Yu Shang¹, Wei Wu³, Chen Gao^1†, Xinlei Chen^1†, Yong Li¹

¹Tsinghua University ²Shandong University ³Manifold AI ⁴Beijing Institute of Technology ⁵Northeastern University

zbn22@mails.tsinghua.edu.cn chgao96@gmail.com chen.xinlei@sz.tsinghua.edu.cn liyong07@tsinghua.edu.cn

^*All authors contributed equally to this research. ^†Corresponding authors.

Highlights

01

Autoregressive WAM for Aerial VLN

We propose the first autoregressive world action model for aerial VLN, grounded in a closed-loop observe-act-update process.

02

Action-aware GRPO

We introduce the first action-aware GRPO to boost the action ability of autoregressive WAMs.

03

SOTA and Real-World Deployment

WorldVLN achieves strong performance on indoor and outdoor benchmarks and transfers to real-world drone deployment.

Abstract

Aerial vision-language navigation requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks.

Videos

Demo

Outdoor Real-World Deployment

Indoor Real-World Deployment

Outdoor Simulation (UAV-Flow)

Indoor Simulation (IndoorUAV)

Autoregressive World-Action

WorldVLN navigates by implicitly predicting what will happen next, acting on it, and continuously correcting itself with real visual feedback.

Two-Stage Training Framework

We train the autoregressive WAM in two stages. Stage 1 supervises the latent autoregressive backbone with instruction-video pairs and the action decoder with video-trajectory pairs. Stage 2 samples multiple rollouts, assigns segment-level rewards from trajectory accuracy, task progress, and reference-policy regularization with temporal decay weighting, and updates WorldVLN through Action-aware GRPO.

Quantitative Results

The quantitative results demonstrate the strong performance of WorldVLN across both outdoor and indoor UAV benchmarks.

Training Analysis

We plot the training curves of the proposed model:

Compared with VLA baselines, our method achieves faster performance improvements under the same number of training steps, demonstrating the stronger learning efficiency of WAM.
Action-aware GRPO further improves performance beyond SFT, and qualitative examples indicate that it more directly enhances action execution.

Citation

@misc{zhao2026worldvln,
    title={WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation},
    author={Baining Zhao and Jiacheng Xu and Weicheng Feng and Xin Zhang and Zhaolu Wang and Haoyang Wang and Shilong Ji and Ziyou Wang and Jianjie Fang and Zhiheng Zheng and Weichen Zhang and Yu Shang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},
    year={2026},
    eprint={2605.15964},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2605.15964}
}