Forecasting Research Success Through Learned Comparison of Scientific Ideas
Abstract
As potential AGI-level systems begin to reshape science, hypothesis generation represents the most immediate shift, with language models (LMs) generating research ideas at a scale far exceeding the capacity to validate them through experimentation. This creates the risk of wastefully allocating resources to ideas that fail to translate into real-world gains. As a step towards filtering the most promising ideas, we study comparative empirical forecasting: given a research goal and two candidate ideas, predict which one will achieve better empirical performance \emph{before} any experiments are run. We construct a dataset of 11,488 idea pairs grounded in objective benchmark outcomes from PapersWithCode, and find that 8B-parameter models fine-tuned on this data achieve 77.1\% accuracy, outperforming frontier models like GPT-5 (61.1\%). Further, we attempt to establish interpretable grounds for \emph{trust} by training these models to articulate their reasoning via Reinforcement Learning with Verifiable Rewards, achieving 71.35\% accuracy. Such models can serve as transparent filters enabling oversight as the scientific process continues to evolve rapidly.