Worldscape-MoE

A Unified Mixture-of-Experts Architecture for Scalable Multi-Control Video Generation World Modeling

Worldscape Team

Paper (Comming Soon) arXiv (Comming Soon) GitHub Model

Abstract

With the rapid progress of interactive video generation, video generation world models have gradually emerged as one of the mainstream paradigms in world model research and are increasingly regarded as a promising path toward efficient intelligent agents. However, existing video generation world models are typically developed under different forms of control supervision and mainly focus either on interactive world modeling or embodied world modeling, leaving the compatibility of heterogeneous control signals largely unexplored. In this work, we introduce Worldscape-MoE, the first training framework that enables unified learning of world models under heterogeneous supervisory controls by incorporating a Mixture-of-Experts (MoE) design into Diffusion Transformers (DiT). We further propose Worldscape-MoE Tuning, a continually extensible heterogeneous training strategy for world models, which supports diverse control signals, including robotic arms, hand joints, and camera poses, within a single world model and allows the model to be progressively expanded to new control settings. By enabling joint learning from more diverse data sources, this training strategy alleviates the scaling bottleneck of current world models. Experiments show that MoE-based heterogeneous supervision brings consistent mutual gains across control types, achieving state-of-the-art performance on WorldArena, strong locomotion and hand-motion capabilities, robust out-of-distribution generalization, and loco-manipulation generation.

Loading HD video...

Contributions

We propose Worldscape-MoE, the first world model framework that Worldscape-MoE, within a unified architecture, enabling training on all available ego-centric world-modeling data and alleviating scaling bottlenecks caused by single-modality supervision.

We present Worldscape-MoE Tuning, a scalable training strategy that Worldsdape-MoE Tuning of additional control modalities while allowing the shared expert to continually absorb world knowledge across all controls.

Extensive experiments show clear positive transfer across modalities in Worldscape-MoE, with each single-modality inference setting outperforming existing baselines. In particular, Worldscape-MoE surpasses Ctrl-World by 2.86% on WorldArena and further achieves stronger physical consistency, higher visual quality, robust OOD generalization, and loco-manipulation generation.

Overview

Figure 1: Worldscape-MoE Overview. Worldscape-MoE supports three mainstream control modalities: Locomotion for trajectory-conditioned world navigation, Manipulation for robot-action-conditioned embodied tasks, and Action Map for hand-joint-conditioned egocentric interaction generation. The framework can also be extended to additional control injection settings.

Method

Worldscape-MoE unifies heterogeneous control signals in one diffusion-transformer world model by using a control-aware Mixture-of-Experts design. During training, each sample is routed through a shared expert plus the corresponding modality expert, enabling cross-modality world knowledge sharing and control-specific specialization at the same time.

Figure 2: Worldscape-MoE Architecture. Given the current world observation and different forms of supervisory control, our framework generates world dynamics under heterogeneous control signals. It supports both egocentric world exploration and embodied task execution.

Video Showcase

Out of Distribution

W/O MoE Comparison

Locomotion Comparison

Hand Motion Comparison

Physics Consistency

Loco-Hand Motion/Manipulation

Quantitative Results

Locomotion Experiments

Method	Avg	Brightness	Color Temp	Sharpness	Motion	Smoothness	Trajectory Accuracy
Worldscape-MoE	0.7556	0.6955	0.7758	0.7639	0.6745	0.9941	0.6300
w/o MoE	0.6869	0.6710	0.6993	0.6613	0.4865	0.9930	0.6100
Matrix-game 3.0	0.6232	0.5633	0.6180	0.6353	0.3852	0.9660	0.5714
HY-World 1.5	0.7322	0.7128	0.7027	0.7477	0.5545	0.9908	0.6844
CameraCtrl	0.5521	0.4602	0.4812	0.3076	0.4833	0.9832	0.5970
MotionCtrl	0.5562	0.4583	0.5296	0.2421	0.5182	0.9776	0.6115
CamI2V	0.6137	0.5150	0.5904	0.4513	0.5255	0.9886	0.6115
RealCam-I2V	0.7063	0.6530	0.5712	0.6197	0.6987	0.9901	0.7050
VideoX-Fun-Wan	0.7443	0.6684	0.6856	0.6640	0.6934	0.9899	0.7645
AC3D	0.7262	0.4884	0.7764	0.7050	0.7213	0.9934	0.6729
ASTRA	0.6072	0.5600	0.5916	0.5088	0.5625	0.9826	0.4379

Manipulation Experiments

Model	EWM Score
Worldscape-MoE	62.84
w/o MoE	61.88
CtrlWorld	59.98
Wan 2.6	59.80
CogvideoX	58.79
Veo 3.1	57.77
IRASim	56.14
TesserAct	54.62
Cosmos-Predict 2.5 (action)	54.29
Cosmos-Predict 2.5 (text)	53.06
Vidar	51.92
Wan 2.2	51.71
GigaWorld-0	50.96
RoboMaster	50.35

Hand Motion Experiments

Model	FID-VID	FVD	FID	Image Quality
Worldscape-MoE	3.80	110.94	5.78	0.7325
w/o MoE	5.39	128.87	15.34	0.7250
HunyuanVideo-1.5	23.18	517.42	56.31	0.6419
Cosmos-Predict 2.5	15.02	628.96	51.36	0.6158
MimicMotion	26.74	589.47	48.92	0.5324
MagicDance	65.93	1498.65	91.78	0.5739
LOME	144.58	1794.84	67.82	0.5281

Visual Motion and Consistency Metrics

Model	Image	Aesthetic	JEPA	Dynamic	Flow	Smoothness	Subject	Background	Photometric
Worldscape-MoE	0.4566	0.3795	0.8920	0.4373	0.2632	0.7717	0.8333	0.9043	0.1439
w/o MoE	0.5220	0.4053	0.8779	0.4432	0.2457	0.7776	0.8282	0.8990	0.1126
GigaWorld-0	0.5041	0.3991	0.4413	0.6709	0.3118	0.7811	0.7303	0.8563	0.1756
TesserAct	0.3322	0.4590	0.4579	0.5150	0.2447	0.7579	0.8250	0.9238	0.2491
RoboMaster	0.3487	0.3842	0.2966	0.6124	0.1484	0.6940	0.8295	0.9123	0.3356
Vidar	0.4145	0.4068	0.5608	0.2767	0.1426	0.7973	0.7629	0.8300	0.2350
Cosmos-Predict 2.5 (text)	0.6668	0.4501	0.3126	0.5911	0.4302	0.7882	0.7488	0.8511	0.1383
Cosmos-Predict 2.5 (action)	0.4489	0.3576	0.9296	0.3994	0.0573	0.7100	0.8197	0.8894	0.3528
CtrlWorld	0.3522	0.3893	0.9185	0.4257	0.3449	0.7377	0.8411	0.9057	0.1729
Wan 2.2	0.3884	0.3963	0.7575	0.4349	0.1269	0.7019	0.8388	0.9042	0.4776
CogvideoX	0.3582	0.3777	0.9384	0.3166	0.2189	0.7391	0.8083	0.8773	0.3580
IRASim	0.3489	0.3623	0.9330	0.4139	0.2083	0.7052	0.8312	0.9068	0.3522
Veo 3.1	0.6605	0.4632	0.5694	0.5450	0.1396	0.6989	0.7878	0.8710	0.3247
Wan 2.6	0.6824	0.4433	0.7229	0.7421	0.4532	0.8539	0.7517	0.8687	0.1904

Physics and 3D and Controllability Metrics

Model	Interaction	Trajectory	Depth	Perspectivity	Instruction	Semantic	Action
Worldscape-MoE	0.8008	0.4610	0.9030	0.9686	0.9348	0.9039	0.0955
w/o MoE	0.7622	0.3540	0.9038	0.9744	0.8703	0.8914	0.0324
GigaWorld-0	0.5368	0.1552	0.6316	0.7596	0.6156	0.8591	0.1134
TesserAct	0.5800	0.1396	0.7159	0.7920	0.6152	0.8783	0.0311
RoboMaster	0.5364	0.1158	0.8335	0.7588	0.5772	0.8761	0.0352
Vidar	0.5348	0.1928	0.7872	0.7592	0.5912	0.8826	0.0819
Cosmos-Predict 2.5 (text)	0.3872	0.0816	0.7051	0.7964	0.2664	0.7733	0.1418
Cosmos-Predict 2.5 (action)	0.5500	0.2945	0.8862	0.7644	0.5840	0.8879	0.0133
CtrlWorld	0.6212	0.4766	0.9300	0.7960	0.7272	0.8912	0.0210
Wan 2.2	0.5184	0.1627	0.7768	0.7660	0.5376	0.8877	0.0512
CogvideoX	0.5940	0.3526	0.9097	0.7828	0.7268	0.8977	0.0076
IRASim	0.5656	0.3639	0.9312	0.7788	0.6604	0.8849	0.0526
Veo 3.1	0.7872	0.1231	0.7421	0.8276	0.9328	0.8607	0.0852
Wan 2.6	0.7280	0.1182	0.7144	0.8032	0.8536	0.8728	0.0992

Citation

@article{worldscape_moe_2026,
  title   = {Worldscape-MoE: A Unified Mixture-of-Experts Architecture for Scalable Multi-Control Video Generation World Modeling},
  author  = {Worldscape Team},
  journal = {Under Review},
  year    = {2026}
}