VariantBench: Benchmarking Language Models on Scientific Reasoning Across the Pharmacogenomic Evidence Pipeline
Abstract
Large language models increasingly serve as reasoning engines over scientific literature, yet it remains unclear whether they can sustain logical consistency across the multi-stage workflows required for real-world literature analysis. We introduce \textsc{VariantBench}, a benchmark that mirrors the full pharmacogenomic evidence curation pipeline grounded in expert-curated annotations from the ClinPGx research team. The benchmark comprises 79,592 structured single-paper questions and 394 agentic cross-document and clinical reasoning tasks spanning three tiers of complexity: factual extraction, dependent multi-turn reasoning, and CPIC guideline recreation under zero-context and evidence-provided settings. Evaluating frontier tool-use agents with the Harbor framework reveals substantial brittleness in multi-step reasoning. While per-step accuracy on chained tasks exceeds 60%, requiring all steps in a chain to be correct reduces success to 13.6%. Cross-document synthesis further degrades performance relative to single-paper comprehension. For clinical guideline recreation, providing the referenced literature improves mean reward by 20 points, indicating that models benefit substantially from explicit evidence access but remain unreliable when relying solely on parametric recall. VariantBench provides deterministic verifiers, reproducible agent infrastructure, and a large-scale expert-grounded evaluation suite for measuring progress toward robust scientific reasoning.