ICLR Poster Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning

Poster

Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning

Charlie Snell · Jaehoon Lee · Kelvin Xu · Aviral Kumar

Hall 3 + Hall 2B #212

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Oral presentation: Oral Session 1A
Wed 23 Apr 7:30 p.m. PDT — 9 p.m. PDT

Abstract:

Enabling LLMs to improve their outputs by using more test-time compute is a critical step towards building self-improving agents that can operate on open-ended natural language. In this paper, we scale up inference-time computation in LLMs, with a focus on answering: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on performance, but also on the future of LLM pretraining and how to tradeoff inference-time and pre-training compute. Little research has attempted to understand the scaling behaviors of test-time inference methods, with current work largely providing negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models (PRMs); and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to, as effectively as possible, allocate test-time compute per prompt in an adaptive manner. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling for math reasoning problems by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

Live content is unavailable. Log in and register to view live content