AirScape

An Aerial Generative World Model with Motion Controllability

First Aerial World Model: Pioneering 6DoF world model for aerial agents with motion controllability

11k Video-Intention Dataset: First-person drone videos with annotated motion intentions across diverse scenarios

Two-Phase Training: Self-play with VLM for learning spatio-temporal constraints

Superior Performance: 165.64% improvement in intention alignment over baselines

arXiv PDF Code(release soon)

Baining Zhao*, Rongze Tang*, Jiamingyuan*, Ziyou Wang*, Fanghang Man

Xin Zhang, Yu Shang, Weichen Zhang, Chen Gao, Wei Wu, Xinlei Chen, Yong Li,

Tsinghua University

Challenges

intro

The paper addresses the fundamental problem of enabling robots to predict outcomes of their motion intentions in 3D space, specifically for six-degree-of-freedom (6DoF) aerial agents like drones. The main challenges include:

1. Lack of aerial datasets: Training world models requires firstperson perspective videos and corresponding textual prompts about aerial agents' actions or tasks. Existing datasets are either third-person views or ground-based perspectives from robots or vehicles.

2. Distribution gap: In terms of text input, existing open-source foundation models focus on generating videos from detailed textual descriptions, whereas world models rely on concise instructions or action intents. In terms of video, training data for open-source foundation models mostly consists of third-person videos with limited visual changes, while embodied first-person perspectives typically have narrower fields of view and larger visual changes, increasing training difficulty.

3. Generation diversity: Drones operate in 6DoF with high flexibility. Compared to ground vehicles, generated scenes include lateral translation, in-place rotation, camera gimbal adjustments, and combinations of multiple actions, making generation more challenging. The aerial spatial world model is required to simulate more complex changes in relative position, perspective variation, and parallax effects.

Dataset

intro

We present an 11k embodied aerial agent video dataset along with corresponding annotations of motion intention, aligning the inputs and outputs of the aerial world model. The proposed dataset exhibits diversity across multiple dimensions, including actions, areas, scenes, and tasks, as illustrated by the examples below.

Action

Translation
The drone moves rightward while capturing a video of cars on the bridge, keeping it centered without gimbal adjustments.
Rotation
The drone rotates to the right and maintains a steady altitude and camera angle.
Compound Movement
The drone flies forward, adjusts gimbal downward for top-down view, stabilizing above the parking lot.

Area

Roadside
The drone maintains altitude, tilts gimbal down 45°, captures a descending view of a busy road and ends above it.
Tourist Attraction
The drone follows a red heart-shaped balloon while rotating left, without gimbal adjustment.
Seaside
The drone flies forward keeping four buildings in view, no significant altitude or gimbal changes.

Scene

Night
The drone flies forward focusing on the bridge and cars, keeping gimbal stable and centered.
Daytime
The drone tracks a Maersk ship and a tugboat with slight gimbal adjustments to keep both in view.
Snowy
The drone flies upward and backward while rotating left, tracking a snowmobile near a forest.

Task

Navigation
The drone moves forward, turns right slightly, capturing commercial street view with buildings and pedestrians.
Tracking
The drone follows a boat on a river, adjusting position to keep the boat centered in view.
Detection
The drone flies forward while detecting and observing traffic flow on the road.

Method

We propose a two-phase training framework to build an aerial world model capable of predicting future observations based on current egocentric views and natural language motion intentions. The model is trained to generate physically plausible future video sequences in 6DoF space, reflecting both semantic intention and spatio-temporal consistency.

Method Overview

Phase 1: Learning Intention Controllability

In the first phase, we perform supervised fine-tuning on a pre-trained video generation model using a dataset of paired aerial videos and their corresponding high-level textual motion intentions. This phase enables the model to understand and respond to concise, goal-oriented commands such as "The drone flies toward the blue building" as opposed to passive visual descriptions. By aligning the model's generative process with embodied control semantics, the foundation model begins to function as a predictive world model.

Since aerial agents operate in 6DoF space involving translation, rotation, and gimbal changes, our dataset includes diverse scenarios to encourage robust intention understanding. The model learns to generate temporally coherent and spatially aligned video predictions based on the current visual input and a single-sentence motion command.

Phase 2: Learning Spatio-Temporal Constraints

Despite initial fine-tuning, model outputs may still contain unrealistic or unstable predictions. To address this, we introduce a self-play strategy combined with large multimodal models (LMMs) to enforce physical consistency. This phase enhances the world model’s ability to generate high-fidelity video sequences by filtering synthetic generations through a rejection sampling mechanism.

The process includes (a) sampling a current frame and generating a motion intention using LMMs, (b) producing candidate video predictions under that intention, and (c) evaluating them based on four criteria: intention alignment, spatial consistency, temporal continuity, and adherence to projective geometry. LMMs act as critics to select the most realistic prediction.

Accepted video-intention pairs are collected into a synthetic dataset and used to further fine-tune the world model. This iterative loop improves the model’s capability to generate videos that not only match the intention but also obey real-world physical constraints, even in complex aerial scenes.

Experiment

Experimental Setup

We split the proposed dataset into training and testing sets in a 9:1 ratio. In Phase 1, we fine-tune the video generation model CogVideoX-i2v-5B with key hyperparameters: resolution of 49×480×720 (frames×height×width), batch size of 2, gradient accumulation steps of 8, and total 10 training epochs on 8 NVIDIA A800-SXM4-40GB GPUs. For Phase 2, we use Gemini-2.0-Flash as the vision-language model (VLM) to support the self-play process, due to its strong video understanding capability and fast inference.

Evaluation metrics focus on both visual quality (FID, FVD) and semantic consistency between generated future observations and motion intentions. We report performance under three 6DoF motion types: Translation, Rotation, and Compound, along with the averaged results across all tasks.

RQ1: How does AirScape perform compared to prior methods?

Quantitative results in the table below show that AirScape outperforms all baseline methods across key metrics. Compared with state-of-the-art models like Cosmos-Predict1-7B and HunyuanVideo-I2V, AirScape achieves improvements of +14.91% in FID, +14.63% in FVD, and +165.64% in IAR. Especially under the Rotation setting, which emphasizes physical consistency, AirScape significantly surpasses other approaches.

Comparison Results

RQ2: Has AirScape Learned Spatio-Temporal Constraints?

As shown in the case analysis below, AirScape produces physically plausible and intention-aligned video predictions, while baseline models often violate physical laws. For example, in cases of 6DoF motion, HunyuanVideo-I2V and Cosmos may generate distorted objects, unrealistic camera shifts, or discontinuous scenes. AirScape maintains stable depth, object continuity, and smooth transitions, indicating a robust understanding of spatio-temporal constraints learned via self-play.

Case Analysis

RQ3: What are the effects of the self-play mechanism?

The self-play pipeline is key to injecting physical constraints into the model. For each sampled input, multiple video candidates are generated via prompt variation and random seeding. The VLM (e.g., Gemini-2.0-Flash) evaluates and selects the candidate that best aligns with the intended motion while satisfying spatio-temporal coherence. This mechanism filters low-quality generations and improves learning stability during fine-tuning.

Self-Play Mechanism

Citation

@misc{zhao2025airscapeaerialgenerativeworld,
      title={AirScape: An Aerial Generative World Model with Motion Controllability}, 
      author={Baining Zhao and Rongze Tang and Mingyuan Jia and Ziyou Wang and Fanghang Man and Xin Zhang and Yu Shang and Weichen Zhang and Chen Gao and Wei Wu and Xin Wang and Xinlei Chen and Yong Li},
      year={2025},
      eprint={2507.08885},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.08885}, 
}