Time-Series as Feedback: Evaluating Adaptive Reasoning in LLM Agents
Abstract
Time-series interpretation and reasoning are essential for inferring the state of physical systems and remain a key challenge for autonomous scientific discovery. We introduce a benchmark to evaluate whether large language model (LLM) agents can perform such reasoning in adaptive experiment planning settings where time-series observations serve as feedback and experimental conditions constitute agent actions that generate new trajectories. Using kinetic mechanism identification as a motivating testbed, we construct an agent–environment loop in which an agent iteratively proposes experiments, receives time-series data, and refines hypotheses over competing mechanisms while selecting new experimental conditions that best discriminate among them. We show that agents with likelihood-based (NLL) feedback consistently outperform a non-adaptive baseline, demonstrating effective hypothesis-aware adaptive experimental design. Agents operating directly on raw time-series feedback also outperform the same baseline, indicating non-trivial capability for extracting task-relevant information from noisy trajectories without hand-engineered analysis tools. However, raw-feedback performance remains below NLL-feedback performance, highlighting current limitations in direct time-series interpretation by LLM agents without structured signals. Overall, this work contributes both (i) a benchmark for interactive time-series reasoning in adaptive experimental settings, and (ii) an empirical study of LLM agents’ strengths and limitations in hypothesis-driven scientific experimentation.