Evaluating Frontier Agents on End-to-End Investment Banking Workflows
Abstract
AI agents are expected to revolutionize professional work, but a basic question remains open: How well can today’s frontier models complete end-to-end analytical workflows in economically high-value settings? We examine this question through the lens of investment banking, evaluating the performance of AI agents on tasks routinely performed by junior bankers. To ensure ecological validity, we collaborated with 175 investment bankers to develop an evaluation suite that replicates core features of their professional environment. Agents are assigned VP (Vice President) and MD (Managing Director)-level requests; granted access to realistic data rooms and industry-standard tools (e.g., FactSet and SEC EDGAR); and required to produce multi-file deliverables, including financial models, slide decks, reports, and email summaries. Completing individual tasks required as much as 8 hours of banker time, highlighting the nontrivial labor investment and economic stakes for agents seeking to perform them autonomously. Benchmarking eight frontier models, we find that current AI systems struggle to reliably complete these workflows: even the best-performing model in our study (Claude Opus 4.5) achieves only 33.8% success. Our error analysis identifies key obstacles and routes to economic value when deploying agentic AI in high-stakes professional domains (such as internal consistency across deliverables and their client readiness).