Do Language Models Provide Useful Priors for Autonomous Scientific Search? A Calibration Study
Abstract
As large language models (LLMs) are increasingly positioned as potential autonomous scientific agents, a central open question is whether they can reliably generate hypotheses that drive iterative discovery. We study this question in a minimal autonomous search loop where a model proposes candidate solutions, receives scalar objective feedback, and iteratively refines proposals. Using continuous black-box optimization as a controlled proxy for scientific search, we compare random search, Tree-structured Parzen Estimator (TPE), LLM-driven proposal generation, and a hybrid TPE+LLM scheme under equal evaluation budgets. Across five independent seeds on a shifted ellipsoid benchmark, we find that LLM-only search performs better than random sampling but substantially worse than TPE, while hybridization achieves the best mean final performance. However, both LLM and hybrid methods exhibit high variance across seeds, indicating limited reliability. These results suggest that current LLMs do not encode sufficiently strong inductive biases for autonomous discovery and must be coupled with explicit optimization machinery in post-AGI scientific systems.