Poster
in
Workshop: Workshop on Reasoning and Planning for Large Language Models

Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations

Xinnan Zhang ⋅ Chenliang Li ⋅ Siliang Zeng ⋅ Jiaxiang Li ⋅ Zhongruo Wang ⋅ Songtao Lu ⋅ Alfredo Garcia ⋅ Mingyi Hong

Project Page [ OpenReview]

Abstract

Aligning Large Language Models (LLMs) to human preferences is essential for their effective deployment in real-world applications. Traditional post-training methods, such as Reinforcement Learning with Human Feedback (RLHF), are resource-intensive and time-consuming, especially as model sizes continue to grow. Recently, inference-time alignment methods have gained significant attention, as they can steer the LLM output without direct fine-tuning, and can be integrated with post-training techniques to further enhance performance. Additionally, these methods enable personalization, allowing models to adapt dynamically to user preferences and specific task requirements. However, these approaches operate in a one-shot manner, limiting policy improvement to a single round. To address this limitation, we introduce inference-time Successive Policy Iterations (SPI), a novel algorithm that enables successive policy improvement at inference time. Specifically, inference-time SPI iteratively learns value functions and leverages them to guide the LLM through a search-based optimization process. Theoretically, our algorithm is equivalent to performing multi-iteration policy optimization on the base model, effectively improving its behavior without direct fine-tuning. Experimental results demonstrate that inference-time SPI significantly improves length-control win rates on challenging instruction-following benchmarks, such as AlpacaEval 2.0, achieving a substantial performance boost (e.g., $30.71\% \to 43.80\%$ for \texttt{Llama-3-8B-Instruct} compare against GPT-4 responses). Furthermore, inference-time SPI consistently outperforms existing test-time alignment baselines such as Best-of-N (BoN), weak to strong search, which is effective for inference time scaling on different tasks.

Chat is not available.