ICLR Poster Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Poster

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Michael Noukhovitch · Shengyi Huang · Louis-Pascal Xhonneux · Arian Hosseini · Rishabh Agarwal · Aaron Courville

Hall 3 + Hall 2B #582

[ Abstract ]

Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, it is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF on instruction following tasks 40\% faster than a synchronous run while matching final performance measured with GPT-4o.

Live content is unavailable. Log in and register to view live content