Poster

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa ⋅ Robert Dadashi ⋅ Léonard Hussenot-Desenonges ⋅ Johan Ferret ⋅ Nino Vieillard ⋅ Alexandre Rame ⋅ Bobak Shahriari ⋅ Sarah Perrin ⋅ Abram Friesen ⋅ Geoffrey Cideron ⋅ Sertan Girgin ⋅ Piotr Stanczyk ⋅ Andrea Michi ⋅ Danila Sinopalnikov ⋅ Sabela Ramos Garea ⋅ Amélie Héliou ⋅ Aliaksei Severyn ⋅ Matthew Hoffman ⋅ Nikola Momchev ⋅ Olivier Bachem

2025 Poster

[ OpenReview]

Abstract

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models.Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates.In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.

Video

Chat is not available.