Open LLM Projects Should Allocate More Compute for Data Than Training
Maximilian Idahl
Abstract
Open LLM projects aim to build the best possible open language models under constrained compute budgets. Currently, most allocate the vast majority of their GPU compute to training runs rather than better data. This position paper argues that these efforts should invest the majority of their compute in data, not training. Reported efficiency gains of 6-9x from data curation, filtering, and synthetic generation justify allocating 80% or more of development compute to data work. Beyond producing better models, data investments compound across model generations while individual models are often superseded within months. We discuss allocation strategies and call for open LLM projects to adopt explicitly data-centric compute accounting.
Chat is not available.
Successful Page Load