Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks

Michael Wornow · Vaishnav Garodia · Vasilis Vassalos · Utkarsh Contractor

Project Page [ OpenReview]

Abstract

Enterprises are increasingly adopting AI agents based on large language models (LLMs) for mission-critical workflows. However, most existing benchmarks use synthetic or consumer-oriented data, and do not holistically evaluate agents on operational concerns beyond accuracy (e.g. cost, security, etc.). To address these gaps we propose CLASSIC, a novel benchmark containing 2,133 real-world user-chatbot conversations and 423 workflows across 7 enterprise domains including IT, HR, banking, and healthcare. We evaluate LLMs across five key metrics -- Cost, Latency, Accuracy, Stability, and Security -- on a multiclass classification task that requires the model to select the proper workflow to trigger in response to a user message. Our dataset of real-world conversations is challenging, with the best LLM achieving an overall accuracy of only 76.1%. Across all five metrics, we find significant variation in performance -- for example, Gemini 1.5 Pro only refuses 78.5% of our jailbreak prompts compared to Claude 3.5 Sonnet's 99.8%, while GPT-4o costs 5.4x more than the most affordable model we evaluate. We hope that our benchmark helps to increase trust in LLM applications by better grounding evaluations in real-world enterprise data. We open source our code and data, and welcome contributions from the community.

Chat is not available.