Worldscape-MoE

A Unified Mixture-of-Experts Architecture for Scalable Multi-Control Video Generation World Modeling

Worldscape Team

Abstract

With the rapid progress of interactive video generation, video generation world models have gradually emerged as one of the mainstream paradigms in world model research and are increasingly regarded as a promising path toward efficient intelligent agents. However, existing video generation world models are typically developed under different forms of control supervision and mainly focus either on interactive world modeling or embodied world modeling, leaving the compatibility of heterogeneous control signals largely unexplored. In this work, we introduce Worldscape-MoE, the first training framework that enables unified learning of world models under heterogeneous supervisory controls by incorporating a Mixture-of-Experts (MoE) design into Diffusion Transformers (DiT). We further propose Worldscape-MoE Tuning, a continually extensible heterogeneous training strategy for world models, which supports diverse control signals, including robotic arms, hand joints, and camera poses, within a single world model and allows the model to be progressively expanded to new control settings. By enabling joint learning from more diverse data sources, this training strategy alleviates the scaling bottleneck of current world models. Experiments show that MoE-based heterogeneous supervision brings consistent mutual gains across control types, achieving state-of-the-art performance on WorldArena, strong locomotion and hand-motion capabilities, robust out-of-distribution generalization, and loco-manipulation generation.

Loading HD video...

Contributions

01

We propose Worldscape-MoE, the first world model framework that Worldscape-MoE, within a unified architecture, enabling training on all available ego-centric world-modeling data and alleviating scaling bottlenecks caused by single-modality supervision.

02

We present Worldscape-MoE Tuning, a scalable training strategy that Worldsdape-MoE Tuning of additional control modalities while allowing the shared expert to continually absorb world knowledge across all controls.

03

Extensive experiments show clear positive transfer across modalities in Worldscape-MoE, with each single-modality inference setting outperforming existing baselines. In particular, Worldscape-MoE surpasses Ctrl-World by 2.86% on WorldArena and further achieves stronger physical consistency, higher visual quality, robust OOD generalization, and loco-manipulation generation.

Overview

Worldscape-MoE Overview Figure

Figure 1: Worldscape-MoE Overview. Worldscape-MoE supports three mainstream control modalities: Locomotion for trajectory-conditioned world navigation, Manipulation for robot-action-conditioned embodied tasks, and Action Map for hand-joint-conditioned egocentric interaction generation. The framework can also be extended to additional control injection settings.

Method

Worldscape-MoE unifies heterogeneous control signals in one diffusion-transformer world model by using a control-aware Mixture-of-Experts design. During training, each sample is routed through a shared expert plus the corresponding modality expert, enabling cross-modality world knowledge sharing and control-specific specialization at the same time.

Worldscape-MoE Architecture Figure

Figure 2: Worldscape-MoE Architecture. Given the current world observation and different forms of supervisory control, our framework generates world dynamics under heterogeneous control signals. It supports both egocentric world exploration and embodied task execution.

Video Showcase

Out of Distribution

W/O MoE Comparison

Locomotion Comparison

Hand Motion Comparison

Physics Consistency

Loco-Hand Motion/Manipulation

Quantitative Results

Locomotion Experiments

MethodAvgBrightnessColor TempSharpnessMotionSmoothnessTrajectory Accuracy
Worldscape-MoE0.75560.69550.77580.76390.67450.99410.6300
w/o MoE0.68690.67100.69930.66130.48650.99300.6100
Matrix-game 3.00.62320.56330.61800.63530.38520.96600.5714
HY-World 1.50.73220.71280.70270.74770.55450.99080.6844
CameraCtrl0.55210.46020.48120.30760.48330.98320.5970
MotionCtrl0.55620.45830.52960.24210.51820.97760.6115
CamI2V0.61370.51500.59040.45130.52550.98860.6115
RealCam-I2V0.70630.65300.57120.61970.69870.99010.7050
VideoX-Fun-Wan0.74430.66840.68560.66400.69340.98990.7645
AC3D0.72620.48840.77640.70500.72130.99340.6729
ASTRA0.60720.56000.59160.50880.56250.98260.4379

Manipulation Experiments

ModelEWM Score
Worldscape-MoE62.84
w/o MoE61.88
CtrlWorld59.98
Wan 2.659.80
CogvideoX58.79
Veo 3.157.77
IRASim56.14
TesserAct54.62
Cosmos-Predict 2.5 (action)54.29
Cosmos-Predict 2.5 (text)53.06
Vidar51.92
Wan 2.251.71
GigaWorld-050.96
RoboMaster50.35

Hand Motion Experiments

ModelFID-VIDFVDFIDImage Quality
Worldscape-MoE3.80110.945.780.7325
w/o MoE5.39128.8715.340.7250
HunyuanVideo-1.523.18517.4256.310.6419
Cosmos-Predict 2.515.02628.9651.360.6158
MimicMotion26.74589.4748.920.5324
MagicDance65.931498.6591.780.5739
LOME144.581794.8467.820.5281

Visual Motion and Consistency Metrics

ModelImageAestheticJEPADynamicFlowSmoothnessSubjectBackgroundPhotometric
Worldscape-MoE0.45660.37950.89200.43730.26320.77170.83330.90430.1439
w/o MoE0.52200.40530.87790.44320.24570.77760.82820.89900.1126
GigaWorld-00.50410.39910.44130.67090.31180.78110.73030.85630.1756
TesserAct0.33220.45900.45790.51500.24470.75790.82500.92380.2491
RoboMaster0.34870.38420.29660.61240.14840.69400.82950.91230.3356
Vidar0.41450.40680.56080.27670.14260.79730.76290.83000.2350
Cosmos-Predict 2.5 (text)0.66680.45010.31260.59110.43020.78820.74880.85110.1383
Cosmos-Predict 2.5 (action)0.44890.35760.92960.39940.05730.71000.81970.88940.3528
CtrlWorld0.35220.38930.91850.42570.34490.73770.84110.90570.1729
Wan 2.20.38840.39630.75750.43490.12690.70190.83880.90420.4776
CogvideoX0.35820.37770.93840.31660.21890.73910.80830.87730.3580
IRASim0.34890.36230.93300.41390.20830.70520.83120.90680.3522
Veo 3.10.66050.46320.56940.54500.13960.69890.78780.87100.3247
Wan 2.60.68240.44330.72290.74210.45320.85390.75170.86870.1904

Physics and 3D and Controllability Metrics

ModelInteractionTrajectoryDepthPerspectivityInstructionSemanticAction
Worldscape-MoE0.80080.46100.90300.96860.93480.90390.0955
w/o MoE0.76220.35400.90380.97440.87030.89140.0324
GigaWorld-00.53680.15520.63160.75960.61560.85910.1134
TesserAct0.58000.13960.71590.79200.61520.87830.0311
RoboMaster0.53640.11580.83350.75880.57720.87610.0352
Vidar0.53480.19280.78720.75920.59120.88260.0819
Cosmos-Predict 2.5 (text)0.38720.08160.70510.79640.26640.77330.1418
Cosmos-Predict 2.5 (action)0.55000.29450.88620.76440.58400.88790.0133
CtrlWorld0.62120.47660.93000.79600.72720.89120.0210
Wan 2.20.51840.16270.77680.76600.53760.88770.0512
CogvideoX0.59400.35260.90970.78280.72680.89770.0076
IRASim0.56560.36390.93120.77880.66040.88490.0526
Veo 3.10.78720.12310.74210.82760.93280.86070.0852
Wan 2.60.72800.11820.71440.80320.85360.87280.0992

Citation

@article{worldscape_moe_2026,
  title   = {Worldscape-MoE: A Unified Mixture-of-Experts Architecture for Scalable Multi-Control Video Generation World Modeling},
  author  = {Worldscape Team},
  journal = {Under Review},
  year    = {2026}
}