Poster
in
Workshop: Agentic AI in the Wild: From Hallucinations to Reliable Autonomy

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

James Xu Zhao ⋅ Bryan Hooi ⋅ See-Kiong Ng

Project Page [ OpenReview]

Abstract

Test-time scaling increases inference-time computation by enabling longer reasoning chains and has shown strong performance gains across many domains. However, frontier models still suffer from hallucinations on knowledge-intensive tasks, raising the question of whether increasing test-time computation is effective in this setting. In this work, we evaluate 14 reasoning models under different test-time scaling strategies on parametric knowledge benchmarks. Our results challenge its effectiveness: increasing test-time computation does not consistently improve accuracy and often leads to more hallucinations. We find that changes in hallucination rates are largely driven by the model's willingness to answer, as longer reasoning encourages more attempts, many of which are incorrect. Extended reasoning can also induce confirmation bias, where models reinforce early incorrect beliefs with fabricated details, resulting in overconfident hallucinations. Finally, we provide an information-theoretic perspective showing that compute-only test-time scaling, as a post-processing step on a fixed model, cannot increase information about the ground-truth answer. Overall, our findings highlight fundamental limitations of current test-time scaling methods for knowledge-intensive tasks.

Chat is not available.