SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks
Abstract
Large language models (LLMs) are increasingly capable of producing and manipulating structured artifacts; this has unlocked applications in complex, open-ended tasks in the finance domain. We consider the task of end-to-end spreadsheet generation, wherein LLMs are prompted to produce spreadsheet artifacts satisfying users' explicit and implicit natural language constraints. We introduce SpreadsheetArena, a platform for evaluating LLMs' performance on the task via blind pairwise preference votes of LLM-generated spreadsheet workbooks. Compared to general dialogue tasks, the evaluation of spreadsheet generation presents unique challenges: 1) spreadsheet workbooks are inherently multi-dimensional with dense, graph-structured dependencies across cells and formulas, and 2) complex considerations around interactivity can be difficult to formalize in terms of programmatically verifiable surface features. Arena prompts span diverse use cases, with professional and corporate finance domains in particular heavily represented. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our live arena is hosted at https://spreadsheetarena.ai.