UNIVERSAL AND EFFICIENT LOADING BALANCING FOR RL TRAINING OF LARGE MULTIMODAL MODELS
Abstract
Reinforcement learning (RL) is crucial for aligning Vision-Language Models (VLMs), but its practical application is hampered by significant system-level bottlenecks. The typical RL pipeline, encompassing data loading, inference-based rollouts, and model updates, suffers from severe inefficiencies when applied to VLMs due to the extreme heterogeneity of multimodal data. Centralized data loading creates I/O bottlenecks with large media files, while variations in sequence length across text, image, and video inputs lead to critical load imbalance during computation, leaving expensive GPU resources underutilized. Existing systems either focus on text-only RL or employ general load-balancing techniques that are incompatible with the small-batch, iterative nature of RL training. To address these challenges, we present FlexRL, a holistic system designed to optimize the end-to-end VLM RL pipeline. FlexRL introduces two core contributions: (1) a \textbf{Decentralized Data Pipeline} that parallelizes data fetching and preprocessing across worker nodes, facilitates metadata-only scheduling on the single controller, eliminating the central bottleneck and accelerating data-intensive stages; and (2) a novel \textbf{Hybrid Sequence Sharding} mechanism that partitions sequences into fine-grained chunks. This enables sub-sequence level load balancing for both inference and training, effectively mitigating workload skew. Our evaluation on a 128-GPU cluster shows that FlexRL significantly improves training efficiency by 4.2~7.7x end-to-end speedups compared to production baselines, enabling more efficient and scalable RL for large multimodal models.