Worldscape-MoE

A Unified Mixture-of-Experts World Model for Scalable Heterogeneous Action Control

Worldscape Team

Abstract

World models are rapidly becoming a core infrastructure for embodied intelligence and interactive agents: they provide controllable simulators in which agents can perceive, act, forecast, and acquire scalable experience. Yet current video generation world models are still organized around isolated control interfaces, such as camera trajectories, robot actions, or hand-joint signals. This fragmentation is increasingly a scaling bottleneck. The central challenge is not the absence of controllable generators, but the lack of a unified and extensible learning framework that can absorb heterogeneous action supervision while preserving a shared model of world dynamics. In this work, we introduce Worldscape-MoE, a Mixture-of-Experts world model built on Diffusion Transformers for scalable heterogeneous action control. Our key observation is that different controls specify different interfaces to the same underlying world: although their representations differ, they constrain shared physical regularities, scene dynamics, and interaction semantics. Worldscape-MoE operationalizes this observation through modality-aware control injection, shared and control-specific experts, and a progressive MoE tuning strategy that supports continual extension to new action modalities. Experiments across locomotion, robotic manipulation, and egocentric hand control show that heterogeneous supervision improves rather than interferes with individual control capabilities. Worldscape-MoE achieves strong results on WorldArena, improves locomotion and hand-control metrics, exhibits robust out-of-distribution generalization, and demonstrates scaling behavior as additional control data and experts are integrated.

Loading HD video...

Contributions

01

We formulate heterogeneous action-control world modeling as a scaling problem and identify the key obstacle as the lack of a unified learning framework rather than the lack of individual control models.

02

We propose Worldscape-MoE, a DiT-based world model that combines modality-aware control injection, shared experts, and control-specific experts to learn from locomotion, robotic manipulation, and egocentric hand-control data in one architecture.

03

We present Worldscape-MoE Tuning, a progressive and extensible training procedure that allows the shared expert to absorb cross-control world knowledge while new experts specialize to newly introduced control modalities.

04

We conduct extensive experiments across heterogeneous control performance, expert routing, MoE effectiveness, scalability, out-of-distribution generalization, and coupled loco-manipulation. The results show consistent gains over dense mixed training and strong performance on manipulation, locomotion, and hand-motion evaluations.

Overview

Worldscape-MoE Overview Figure

Figure 1: Worldscape-MoE Overview. Worldscape-MoE supports three mainstream control modalities: Locomotion for trajectory-conditioned world navigation, Manipulation for robot-action-conditioned embodied tasks, and Action Map for hand-joint-conditioned egocentric interaction generation. The framework can also be extended to additional control injection settings.

Method

Worldscape-MoE unifies heterogeneous control signals in one diffusion-transformer world model by using a control-aware Mixture-of-Experts design. During training, each sample is routed through a shared expert plus the corresponding modality expert, enabling cross-modality world knowledge sharing and control-specific specialization at the same time.

Worldscape-MoE Architecture Figure

Figure 2: Worldscape-MoE Architecture. Given the current world observation and different forms of supervisory control, our framework generates world dynamics under heterogeneous control signals. It supports both egocentric world exploration and embodied task execution.

Video Showcase

Out of Distribution

W/O MoE Comparison

Locomotion Comparison

Hand Motion Comparison

Physics Consistency

Loco-Hand Motion/Manipulation

Quantitative Results

Locomotion Experiments

MethodAvgBrightnessColor TempSharpnessMotionSmoothnessTrajectory Accuracy
Worldscape-MoE0.75560.69550.77580.76390.67450.99410.6300
w/o MoE0.68690.67100.69930.66130.48650.99300.6100
Matrix-game 3.00.62320.56330.61800.63530.38520.96600.5714
HY-World 1.50.73220.71280.70270.74770.55450.99080.6844
CameraCtrl0.55210.46020.48120.30760.48330.98320.5970
MotionCtrl0.55620.45830.52960.24210.51820.97760.6115
CamI2V0.61370.51500.59040.45130.52550.98860.6115
RealCam-I2V0.70630.65300.57120.61970.69870.99010.7050
VideoX-Fun-Wan0.74430.66840.68560.66400.69340.98990.7645
AC3D0.72620.48840.77640.70500.72130.99340.6729
ASTRA0.60720.56000.59160.50880.56250.98260.4379

Manipulation Experiments

ModelEWM Score
Worldscape-MoE62.84
w/o MoE61.88
CtrlWorld59.98
Wan 2.659.80
CogvideoX58.79
Veo 3.157.77
IRASim56.14
TesserAct54.62
Cosmos-Predict 2.5 (action)54.29
Cosmos-Predict 2.5 (text)53.06
Vidar51.92
Wan 2.251.71
GigaWorld-050.96
RoboMaster50.35

Hand Motion Experiments

ModelFID-VIDFVDFIDImage Quality
Worldscape-MoE3.80110.945.780.7325
w/o MoE5.39128.8715.340.7250
HunyuanVideo-1.523.18517.4256.310.6419
Cosmos-Predict 2.515.02628.9651.360.6158
MimicMotion26.74589.4748.920.5324
MagicDance65.931498.6591.780.5739
LOME144.581794.8467.820.5281

Visual Motion and Consistency Metrics

ModelImageAestheticJEPADynamicFlowSmoothnessSubjectBackgroundPhotometric
Worldscape-MoE0.45660.37950.89200.43730.26320.77170.83330.90430.1439
w/o MoE0.52200.40530.87790.44320.24570.77760.82820.89900.1126
GigaWorld-00.50410.39910.44130.67090.31180.78110.73030.85630.1756
TesserAct0.33220.45900.45790.51500.24470.75790.82500.92380.2491
RoboMaster0.34870.38420.29660.61240.14840.69400.82950.91230.3356
Vidar0.41450.40680.56080.27670.14260.79730.76290.83000.2350
Cosmos-Predict 2.5 (text)0.66680.45010.31260.59110.43020.78820.74880.85110.1383
Cosmos-Predict 2.5 (action)0.44890.35760.92960.39940.05730.71000.81970.88940.3528
CtrlWorld0.35220.38930.91850.42570.34490.73770.84110.90570.1729
Wan 2.20.38840.39630.75750.43490.12690.70190.83880.90420.4776
CogvideoX0.35820.37770.93840.31660.21890.73910.80830.87730.3580
IRASim0.34890.36230.93300.41390.20830.70520.83120.90680.3522
Veo 3.10.66050.46320.56940.54500.13960.69890.78780.87100.3247
Wan 2.60.68240.44330.72290.74210.45320.85390.75170.86870.1904

Physics and 3D and Controllability Metrics

ModelInteractionTrajectoryDepthPerspectivityInstructionSemanticAction
Worldscape-MoE0.80080.46100.90300.96860.93480.90390.0955
w/o MoE0.76220.35400.90380.97440.87030.89140.0324
GigaWorld-00.53680.15520.63160.75960.61560.85910.1134
TesserAct0.58000.13960.71590.79200.61520.87830.0311
RoboMaster0.53640.11580.83350.75880.57720.87610.0352
Vidar0.53480.19280.78720.75920.59120.88260.0819
Cosmos-Predict 2.5 (text)0.38720.08160.70510.79640.26640.77330.1418
Cosmos-Predict 2.5 (action)0.55000.29450.88620.76440.58400.88790.0133
CtrlWorld0.62120.47660.93000.79600.72720.89120.0210
Wan 2.20.51840.16270.77680.76600.53760.88770.0512
CogvideoX0.59400.35260.90970.78280.72680.89770.0076
IRASim0.56560.36390.93120.77880.66040.88490.0526
Veo 3.10.78720.12310.74210.82760.93280.86070.0852
Wan 2.60.72800.11820.71440.80320.85360.87280.0992

Citation

@article{worldscape_moe_2026,
  title   = {Worldscape-MoE: A Unified Mixture-of-Experts World Model for Scalable Heterogeneous Action Control},
  author  = {Worldscape Team},
  journal = {Under Review},
  year    = {2026}
}