Worldscape-MoE: A Unified Mixture-of-Experts World Model for Scalable Heterogeneous Action Control

Abstract

World models are rapidly becoming a core infrastructure for embodied intelligence and interactive agents: they provide controllable simulators in which agents can perceive, act, forecast, and acquire scalable experience. Yet current video generation world models are still organized around isolated control interfaces, such as camera trajectories, robot actions, or hand-joint signals. This fragmentation is increasingly a scaling bottleneck. The central challenge is not the absence of controllable generators, but the lack of a unified and extensible learning framework that can absorb heterogeneous action supervision while preserving a shared model of world dynamics. In this work, we introduce Worldscape-MoE, a Mixture-of-Experts world model built on Diffusion Transformers for scalable heterogeneous action control. Our key observation is that different controls specify different interfaces to the same underlying world: although their representations differ, they constrain shared physical regularities, scene dynamics, and interaction semantics. Worldscape-MoE operationalizes this observation through modality-aware control injection, shared and control-specific experts, and a progressive MoE tuning strategy that supports continual extension to new action modalities. Experiments across locomotion, robotic manipulation, and egocentric hand control show that heterogeneous supervision improves rather than interferes with individual control capabilities. Worldscape-MoE achieves strong results on WorldArena, improves locomotion and hand-control metrics, exhibits robust out-of-distribution generalization, and demonstrates scaling behavior as additional control data and experts are integrated.

Loading HD video...

Contributions

01

We formulate heterogeneous action-control world modeling as a scaling problem and identify the key obstacle as the lack of a unified learning framework rather than the lack of individual control models.

02

We propose Worldscape-MoE, a DiT-based world model that combines modality-aware control injection, shared experts, and control-specific experts to learn from locomotion, robotic manipulation, and egocentric hand-control data in one architecture.

03

We present Worldscape-MoE Tuning, a progressive and extensible training procedure that allows the shared expert to absorb cross-control world knowledge while new experts specialize to newly introduced control modalities.

04

We conduct extensive experiments across heterogeneous control performance, expert routing, MoE effectiveness, scalability, out-of-distribution generalization, and coupled loco-manipulation. The results show consistent gains over dense mixed training and strong performance on manipulation, locomotion, and hand-motion evaluations.

Overview

Figure 1: Worldscape-MoE Overview. Worldscape-MoE supports three mainstream control modalities: Locomotion for trajectory-conditioned world navigation, Manipulation for robot-action-conditioned embodied tasks, and Action Map for hand-joint-conditioned egocentric interaction generation. The framework can also be extended to additional control injection settings.

Method

Worldscape-MoE unifies heterogeneous control signals in one diffusion-transformer world model by using a control-aware Mixture-of-Experts design. During training, each sample is routed through a shared expert plus the corresponding modality expert, enabling cross-modality world knowledge sharing and control-specific specialization at the same time.

Figure 2: Worldscape-MoE Architecture. Given the current world observation and different forms of supervisory control, our framework generates world dynamics under heterogeneous control signals. It supports both egocentric world exploration and embodied task execution.

Video Showcase

Out of Distribution

W/O MoE Comparison

Locomotion Comparison

Hand Motion Comparison

Physics Consistency

Loco-Hand Motion/Manipulation

Quantitative Results

Locomotion Experiments

Method	Avg	Brightness	Color Temp	Sharpness	Motion	Smoothness	Trajectory Accuracy
Worldscape-MoE	0.7556	0.6955	0.7758	0.7639	0.6745	0.9941	0.6300
w/o MoE	0.6869	0.6710	0.6993	0.6613	0.4865	0.9930	0.6100
Matrix-game 3.0	0.6232	0.5633	0.6180	0.6353	0.3852	0.9660	0.5714
HY-World 1.5	0.7322	0.7128	0.7027	0.7477	0.5545	0.9908	0.6844
CameraCtrl	0.5521	0.4602	0.4812	0.3076	0.4833	0.9832	0.5970
MotionCtrl	0.5562	0.4583	0.5296	0.2421	0.5182	0.9776	0.6115
CamI2V	0.6137	0.5150	0.5904	0.4513	0.5255	0.9886	0.6115
RealCam-I2V	0.7063	0.6530	0.5712	0.6197	0.6987	0.9901	0.7050
VideoX-Fun-Wan	0.7443	0.6684	0.6856	0.6640	0.6934	0.9899	0.7645
AC3D	0.7262	0.4884	0.7764	0.7050	0.7213	0.9934	0.6729
ASTRA	0.6072	0.5600	0.5916	0.5088	0.5625	0.9826	0.4379

Manipulation Experiments

Model	EWM Score
Worldscape-MoE	62.84
w/o MoE	61.88
CtrlWorld	59.98
Wan 2.6	59.80
CogvideoX	58.79
Veo 3.1	57.77
IRASim	56.14
TesserAct	54.62
Cosmos-Predict 2.5 (action)	54.29
Cosmos-Predict 2.5 (text)	53.06
Vidar	51.92
Wan 2.2	51.71
GigaWorld-0	50.96
RoboMaster	50.35

Hand Motion Experiments

Model	FID-VID	FVD	FID	Image Quality
Worldscape-MoE	3.80	110.94	5.78	0.7325
w/o MoE	5.39	128.87	15.34	0.7250
HunyuanVideo-1.5	23.18	517.42	56.31	0.6419
Cosmos-Predict 2.5	15.02	628.96	51.36	0.6158
MimicMotion	26.74	589.47	48.92	0.5324
MagicDance	65.93	1498.65	91.78	0.5739
LOME	144.58	1794.84	67.82	0.5281

Visual Motion and Consistency Metrics

Model	Image	Aesthetic	JEPA	Dynamic	Flow	Smoothness	Subject	Background	Photometric
Worldscape-MoE	0.4566	0.3795	0.8920	0.4373	0.2632	0.7717	0.8333	0.9043	0.1439
w/o MoE	0.5220	0.4053	0.8779	0.4432	0.2457	0.7776	0.8282	0.8990	0.1126
GigaWorld-0	0.5041	0.3991	0.4413	0.6709	0.3118	0.7811	0.7303	0.8563	0.1756
TesserAct	0.3322	0.4590	0.4579	0.5150	0.2447	0.7579	0.8250	0.9238	0.2491
RoboMaster	0.3487	0.3842	0.2966	0.6124	0.1484	0.6940	0.8295	0.9123	0.3356
Vidar	0.4145	0.4068	0.5608	0.2767	0.1426	0.7973	0.7629	0.8300	0.2350
Cosmos-Predict 2.5 (text)	0.6668	0.4501	0.3126	0.5911	0.4302	0.7882	0.7488	0.8511	0.1383
Cosmos-Predict 2.5 (action)	0.4489	0.3576	0.9296	0.3994	0.0573	0.7100	0.8197	0.8894	0.3528
CtrlWorld	0.3522	0.3893	0.9185	0.4257	0.3449	0.7377	0.8411	0.9057	0.1729
Wan 2.2	0.3884	0.3963	0.7575	0.4349	0.1269	0.7019	0.8388	0.9042	0.4776
CogvideoX	0.3582	0.3777	0.9384	0.3166	0.2189	0.7391	0.8083	0.8773	0.3580
IRASim	0.3489	0.3623	0.9330	0.4139	0.2083	0.7052	0.8312	0.9068	0.3522
Veo 3.1	0.6605	0.4632	0.5694	0.5450	0.1396	0.6989	0.7878	0.8710	0.3247
Wan 2.6	0.6824	0.4433	0.7229	0.7421	0.4532	0.8539	0.7517	0.8687	0.1904

Physics and 3D and Controllability Metrics

Model	Interaction	Trajectory	Depth	Perspectivity	Instruction	Semantic	Action
Worldscape-MoE	0.8008	0.4610	0.9030	0.9686	0.9348	0.9039	0.0955
w/o MoE	0.7622	0.3540	0.9038	0.9744	0.8703	0.8914	0.0324
GigaWorld-0	0.5368	0.1552	0.6316	0.7596	0.6156	0.8591	0.1134
TesserAct	0.5800	0.1396	0.7159	0.7920	0.6152	0.8783	0.0311
RoboMaster	0.5364	0.1158	0.8335	0.7588	0.5772	0.8761	0.0352
Vidar	0.5348	0.1928	0.7872	0.7592	0.5912	0.8826	0.0819
Cosmos-Predict 2.5 (text)	0.3872	0.0816	0.7051	0.7964	0.2664	0.7733	0.1418
Cosmos-Predict 2.5 (action)	0.5500	0.2945	0.8862	0.7644	0.5840	0.8879	0.0133
CtrlWorld	0.6212	0.4766	0.9300	0.7960	0.7272	0.8912	0.0210
Wan 2.2	0.5184	0.1627	0.7768	0.7660	0.5376	0.8877	0.0512
CogvideoX	0.5940	0.3526	0.9097	0.7828	0.7268	0.8977	0.0076
IRASim	0.5656	0.3639	0.9312	0.7788	0.6604	0.8849	0.0526
Veo 3.1	0.7872	0.1231	0.7421	0.8276	0.9328	0.8607	0.0852
Wan 2.6	0.7280	0.1182	0.7144	0.8032	0.8536	0.8728	0.0992

Worldscape-MoE