RigidBench: Evaluating Rigid-Body Physics in Video Generation Models
Abstract
Video generation models are increasingly deployed as world model backbones for physical AI, yet their ability to predict rigid-body dynamics remains unreliable. Existing benchmarks either lack precise ground-truth annotations (relying on VLM judgment) or render synthetic primitives against plain backgrounds, introducing a visual domain gap from natural video. We introduce RigidBench, a benchmark combining Blender physics simulation with photorealistic interior scenes to provide exact 3D trajectories, segmentation masks, and depth maps across ten rigid-body physics tasks. Our evaluation protocol spans object localization, trajectory tracking, depth consistency, and perceptual quality, enabling controlled comparison across models. Evaluating seven models spanning open-source diffusion transformers and closed-source commercial systems, we find that trajectory accuracy and perceptual quality are essentially uncorrelated (r=0.002): models that best predict object motion often score worst on perceptual metrics. This demonstrates that standard video quality metrics cannot assess physical understanding, motivating evaluation with precise physics annotations. We further show that fine-tuning on RigidBench data improves physics prediction on held-out tasks, suggesting a path toward more physically grounded video generation.