Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents
Abstract
Financial institutions deploying LLM-based agents face a fundamental tension: regulators demand audit trails showing why a decision was made, yet LLMs exhibit non-deterministic behavior that undermines reproducibility. We introduce the Determinism-Faithfulness Assurance Harness, a framework that (1) measures trajectory determinism and evidence-conditioned faithfulness for tool-using agents, (2) quantifies the trade-off frontier between these properties, and (3) provides model-selection and architecture guidance for compliance-critical deployments. Across 74 model-task configurations at temperature zero spanning 12 models from 4 providers (with 8–24 runs per configuration), we find that smaller models (7–20B parameters) achieve 100% determinism while maintaining task performance sufficient for triage workflows, whereas the largest models (Tier 3, 120B+) require 3.7× larger validation samples to achieve equivalent statistical reliability. Critically, we find a positive correlation (r = 0.45) between determinism and faithfulness, directly challenging the assumed trade-off between reliability and capability—models that produce consistent outputs also tend to be more evidence-grounded. We introduce three benchmark tasks for compliance triage, portfolio constraint checking, and data-ops exception handling (50 test cases each), along with an open-source stress-test harness. We provide actionable deployment guidance mapping model tiers to regulatory use cases, enabling institutions to select architectures appropriate for audit-critical versus advisory workflows.