Track: Oral Session 4E Datasets and benchmarks

Fri 24 April 11:15 - 11:25 PDT

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha ⋅ Ryan Marten ⋅ Sedrick Keh ⋅ Negin Raoof ⋅ Georgios Smyrnis ⋅ Hritik Bansal ⋅ Marianna Nezhurina ⋅ Jean Mercat ⋅ Trung Vu ⋅ Zayne Sprague ⋅ Ashima Suvarna ⋅ Benjamin Feuer ⋅ Leon Liangyu Chen ⋅ Zaid Khan ⋅ Eric Frankel ⋅ Sachin Grover ⋅ Caroline Choi ⋅ Niklas Muennighoff ⋅ Shiye Su ⋅ Wanjia Zhao ⋅ John Yang ⋅ Shreyas Pimpalgaonkar ⋅ Kartik sharma ⋅ Charlie Ji ⋅ Yichuan Deng ⋅ Sarah Pratt ⋅ Vivek Ramanujan ⋅ Jon Saad-Falcon ⋅ Stutee Acharya ⋅ Jeffrey Li ⋅ Achal Dave ⋅ Alon Albalak ⋅ Kushal Arora ⋅ Blake Wulfe ⋅ Chinmay Hegde ⋅ Greg Durrett ⋅ Sewoong Oh ⋅ Mohit Bansal ⋅ Saadia Gabriel ⋅ Aditya Grover ⋅ Kai-Wei Chang ⋅ Vaishaal Shankar ⋅ Aaron Gokaslan ⋅ Mike Merrill ⋅ Tatsunori Hashimoto ⋅ Yejin Choi ⋅ Jenia Jitsev ⋅ Reinhard Heckel ⋅ Maheswaran Sathiamoorthy ⋅ Alex Dimakis ⋅ Ludwig Schmidt

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.

Fri 24 April 11:27 - 11:37 PDT

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong ⋅ jiahao ying ⋅ Haiyuan Liang ⋅ Mengdi Zhang ⋅ Jun Kuang ⋅ Jiazheng Zhang ⋅ Yixin Cao

Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects (i.e., specific evaluation criteria such as fluency for text and image quality for images). In this paper, we argue that, on one hand, based on the interconnected nature of criteria, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual criteria and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks --- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.

Fri 24 April 11:39 - 11:49 PDT

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Gyuhyeon Seo ⋅ Jungwoo Yang ⋅ Junseong Pyo ⋅ Nalim Kim ⋅ Jonggeun Lee ⋅ Yohan Jo

We introduce $\textbf{SimuHome}$, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents. Existing smart home benchmarks treat the home as a static system, neither simulating how device operations affect environmental variables over time nor supporting workflow scheduling of device commands. SimuHome is grounded in the Matter protocol, the industry standard that defines how real smart home devices communicate and operate. Agents interact with devices through SimuHome's APIs and observe how their actions continuously affect environmental variables such as temperature and humidity. Our benchmark covers state inquiry, implicit user intent inference, explicit device control, and workflow scheduling, each with both feasible and infeasible requests. For workflow scheduling, the simulator accelerates time so that scheduled workflows can be evaluated immediately. An evaluation of 18 agents reveals that workflow scheduling is the hardest category, with failures persisting across alternative agent frameworks and fine-tuning. These findings suggest that SimuHome's time-accelerated simulation could serve as an environment for agents to pre-validate their actions before committing them to the real world.

Fri 24 April 11:51 - 12:01 PDT

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais ⋅ Pavel Chizhov ⋅ Catherine Arnett ⋅ Carlos Hinostroza ⋅ Mattia Nee ⋅ Eliot Jones ⋅ Irène Girard ⋅ David Mach ⋅ Anastasia Stasenko ⋅ Ivan Yamshchikov

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissive licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.