TimeSeek: Temporal Reliability of Agentic Forecasters
Om Shastri ⋅ Hamza Mostafa ⋅ Dennis Lee
Abstract
We present TimeSeek, a benchmark for evaluating when agentic LLM forecasters—autonomous systems that search the web and reason over evidence—are reliable enough to trust over prediction markets. Across 10 frontier models and 150 CFTC-regulated binary markets, we generate 15,000 probabilistic forecasts at five temporal checkpoints under two conditions (with and without web search). Key findings: (1) Models beat markets early (Claude $+0.167$ BSS at Open+1) but fail late (all BSS $< -0.7$ at Close-1), explained by an information-sparse $\to$ dense regime transition. (2) Web search improves pooled BSS by 0.14–0.59 points across all models, but degrades performance in 12% of model-checkpoint conditions—search requires tool-access policies. (3) Seven of ten models add genuine independent signal on high-uncertainty markets (positive BSS on toss-ups), but all hurt on easy markets—a monotonic difficulty gradient yielding deployable rules. (4) Two-model ensembles reduce error, with the best pair cutting pooled BSS loss by 40%. Unlike static benchmarks or drift-focused evaluations, we provide the first controlled study of when web search helps vs. hurts across market lifecycle stages. TimeSeek quantifies agentic forecaster/trading bot effectiveness over the market lifecycle and enables post-trainable improvements in tool use and calibration.
Chat is not available.
Successful Page Load