CausalPhysics: Unifying Semantic Reasoning, Physical Dynamics, and Counterfactual Simulation in World Models
Mysore supreeth ⋅ Manish Mehta
Abstract
Current world models fragment physical intelligence into separate pipelines. Vision language models (VLMs) excel at semantic tasks but struggle with causal physical reasoning: on our CAUSALPHYSICS-BENCH evaluation, GPT-4V answers only 21.9% of counterfactual physics queries correctly. Video generators produce realistic frames but understand little physics: Sora attains 24.1%, Runway Gen-3 23.2%, and VideoPoet 21.4% on Physics-IQ (Motamed et al., 2025). Model-based reinforcement learning (MBRL) systems operate in narrow domains and lack semantic grounding. We present CAUSALPHYSICS, a single architecture that bridges these gaps with three tightly coupled modules: (1) a Semantic-Physical Encoder (SPE) that fuses DINOv2 vision tokens with frozen LLaMA-2 language representations through cross-attention; (2) a Causal Graph Induction Module (CGIM) that discovers a differentiable structural causal model from video, supporting Pearl’s do-operator and counterfactual queries; (3) a Physics-Constrained Dynamics Network (PCDN) that propagates states through the learned causal graph while enforcing differentiable conservation-law constraints. On the official Physics-IQ v1.0 toolkit, CAUSALPHYSICS scores 46.8 ± 0.9—a 47% relative gain over V-JEPA 2 (31.8 ± 1.4) and roughly double Sora (24.1). Causal consistency reaches 71.3 ± 1.2% on CAUSALPHYSICS-BENCH versus 21.9 ± 0.8% for GPT-4V ($p<0.001$, paired t-test, 3 seeds). Out-of-distribution (OOD) generalization improves by 20.2 percentage points over the strongest baseline
Chat is not available.
Successful Page Load