Evaluating Frontier Agents on End-to-End Investment Banking Workflows
Elaine Lau ⋅ Rosemary Wei ⋅ Guram Gogia ⋅ Ronak Chaudhary ⋅ Yi Liu ⋅ Samuel Danquah ⋅ Punit Arani ⋅ Ray Epps ⋅ Markus Dücker ⋅ Abdullah Arif ⋅ Asrith Devalaraju ⋅ Varsha Sandadi ⋅ Scott Millslagle ⋅ Haemi Nam ⋅ Skyler Wang ⋅ Sahil Bhaiwala ⋅ Anish Athalye ⋅ Jonas Mueller ⋅ Francisco Guzmán
Abstract
AI agents are expected automate professional work, yet a key question arises: how well do today's frontier models actually handle the $\textit{end-to-end analytical workflows}$ in economically high-value settings? We examine this question through the lens of investment banking by evaluating the performance of AI agents on tasks routinely performed by junior bankers. To ensure ecological validity, we collaborated with 175 investment bankers to develop an evaluation suite that replicates core features of their professional environment. Agents are assigned VP (Vice President) and MD (Managing Director)-level requests; granted access to realistic \emph{data rooms} and industry-standard tools (e.g., FactSet and SEC EDGAR); and required to produce multi-file deliverables, including financial models, slide decks, reports, and email summaries. Completing individual tasks required as much as 8 hours of banker time, highlighting the nontrivial labor investment and economic stakes for agents seeking to perform them autonomously. Across eight frontier models, we find that current AI systems struggle to reliably complete these workflows: even the best-performing model (Claude Opus 4.5) achieves only 33.8\% success. Our error analysis identifies key obstacles and routes to economic value when deploying agentic AI in high-stakes professional domains (such as internal consistency across deliverables and their client readiness).
Chat is not available.
Successful Page Load