Scaling Large Vision-Language Model RL Training via Efficient Load Balancing
Zerui Wang ⋅ Qinghao Hu ⋅ Chang Chen ⋅ Jiecheng Zhou ⋅ Haojie Duanmu ⋅ Xingcheng Zhang ⋅ Peng Sun ⋅ Dahua Lin
Abstract
Reinforcement learning (RL) is increasingly used to align vision--language models (VLMs), yet scaling RL for VLMs is bottlenecked by multimodal data handling and extreme workload skew. In typical RL pipelines, visual data loading and preprocessing are centralized, creating severe I/O and CPU/memory stragglers, while batches that mix short image-text prompts with long video contexts lead to large cross-GPU imbalance during rollouts, inference, and training. We present FlexRL, an end-to-end system that removes these bottlenecks. FlexRL introduces: (1) ShadowLoader, a distributed, metadata-driven pipeline that keeps only lightweight visual metadata on the controller, pushes decoding and preprocessing to worker-side preprocessors, and asynchronously materializes tensors to overlap I/O with GPU computation; (2) FlexUlysses, a cost-aware sub-sequence sharding and execution engine that adaptively splits sequences to balance compute and memory. Our evaluation shows that across multiple VLM scales and multimodal datasets on 128-GPU clusters, FlexRL improves end-to-end throughput by up to 8.47$\times$ over state-of-the-art RL systems.
Successful Page Load