Sparse Attention for Efficient LLM Reinforcement Learning
Abstract
Reinforcement learning (RL) is a key driver of recent progress in large language model reasoning, but its scalability is increasingly limited by the cost of online rollouts, especially for long chain-of-thought generation and large-batch sampling. Sparse attention is a promising way to reduce per-token attention cost and improve rollout throughput, yet we find that practical sparse rollouts often destabilize training: approximation errors bias likelihood estimates, causing large actor–policy distribution mismatch that compounds over long trajectories and can collapse training. We propose DISTILLSPARSE, a robust sparse-rollout framework that restores distribution alignment while preserving speed. DISTILLSPARSE co-trains a sparse rollout policy via lightweight, LoRA-based on-policy distillation from the dense policy to prevent mismatch from accumulating across RL iterations. For long generations and high sparsity, DISTILLSPARSE further oversamples rollout candidates and applies reward-aware filtering to focus updates on trajectories that are both high-quality and closer to the dense distribution. We evaluate on POLARIS across 4B–8B models and mathematical reasoning benchmarks including AIME24/25, AMC23, and Math500. Across settings where training-free sparse rollouts degrade or collapse, DISTILLSPARSE matches dense-rollout training performance while providing substantial practical acceleration, achieving a 1.72× rollout speedup on NVIDIA H200 at 16K generation length with minimal overhead.