Skip to yearly menu bar Skip to main content


Poster Sat, Apr 25, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 3 P3-#1416

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi ⋅ Xin Guo ⋅ Yang Nan ⋅ Enyu Zhou ⋅ Junrui Shen ⋅ Wenxiang Chen ⋅ Jiaqi Liu ⋅ Jixuan Huang ⋅ Xun Deng ⋅ Zhihao Zhang ⋅ Honglin Guo ⋅ Zhikai Lei ⋅ Miao Zheng ⋅ Guoteng Wang ⋅ Peng Sun ⋅ Rui Zheng ⋅ Hang Yan ⋅ Tao Gui ⋅ Qi Zhang ⋅ Xuanjing Huang

Abstract

Log in and register to view live content