MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Zijian Wu ⋅ Xiangyan Liu ⋅ xinyuan zhang ⋅ Lingjun Chen ⋅ Fanqing Meng ⋅ Lingxiao Du ⋅ Yiran Zhao ⋅ Fanshi Zhang ⋅ Yaoqi Ye ⋅ Jiawei Wang ⋅ Zirui Wang ⋅ Jinjie Ni ⋅ Yufan Yang ⋅ Arvin Xu ⋅ Michael Qizhe Shieh
Abstract
The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose \texttt{MCPMark}, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents, each with a curated initial state and programmatic verification script. These tasks demand diverse CRUD operations and richer environmental interactions. We evaluate cutting-edge LLMs using a minimal agent framework. The best-performing model, \texttt{gpt-5-medium}, reaches only $52.56$\% pass@1 and $33.86$\% pass\textasciicircum{}4, while other strong models including \texttt{claude-sonnet-4} and \texttt{o3} fall below $30$\% pass@1 and $15$\% pass\textasciicircum{}4. On average, LLMs require $16.2$ turns and $17.4$ tool calls per task, highlighting the stress-testing nature of \texttt{MCPMark}.
Successful Page Load