CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
Haining Pan ⋅ James Roggeveen ⋅ Erez Berg ⋅ Juan Alvarez ⋅ Debanjan Chowdhury ⋅ Surya Ganguli ⋅ Federico Ghimenti ⋅ Juraj Hasik ⋅ Henry Hunt ⋅ Hong-Chen Jiang ⋅ Mason Kamb ⋅ Ying-Jer Kao ⋅ Ehsan Khatami ⋅ Michael Lawler ⋅ Di Luo ⋅ Titus Neupert ⋅ Xiaoliang Qi ⋅ Michael Brenner ⋅ Eun-Ah Kim
Abstract
Large language models (LLMs) have demonstrated remarkable progress in coding and mathematical problem-solving; however, evaluation on advanced research-level problems in the hard sciences remains scarce. To fill this gap, we present \cmt, a dataset of 50 original problems covering condensed matter theory (CMT) at the level of an expert researcher. The solution for these problems involve analytical and computational approaches commonly used in quantum many-body physics and classical statistical mechanics. The dataset has been designed and verified by a worldwide panel of expert researchers through a collaborative environment. Topics in the dataset include Hartree-Fock mean-field theory, exact diagonalization methods, quantum Monte Carlo sampling, density matrix renormalization group, quantum statistical mechanics, classical statistical mechanics, and model building. We evaluate different LLMs by programmatically checking LLM-generated solutions against expert-supplied ground truth. To verify LLMs performance at scale, we developed an automated machine-grading pipeline suitable for advanced physics research problems. For example, we handle non-commuting operators that are essential for quantum many-body problems by symbolic manipulation and normal ordering. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. While the highest-performing model, GPT5, correctly solves 30\% of the problems, average performance across 17 models (GPT, Gemini, Claude, DeepSeek, and Llama classes) is only 11.4$\pm$2.1\%. Moreover, our benchmark contains 18 problems that not a single one of the 17 models considered here can correctly solve, and 26 problems that are solved by at most one model. These currently unsolvable problems span the fields of Quantum Monte Carlo, Variational Monte Carlo, and Density Matrix Renormalization Group. Furthermore, we illustrate how incorrect answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe that this benchmark set provides valuable guidance for the future development of language models, aiming to achieve the goal of AI research assistants and tutors.
Successful Page Load