Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models

QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig · Aatmik Gupta · Saumya Malik · Danqi Chen


Abstract:

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitivelyperceive. In this paper, we employ LLMs to discern these qualities, and enhance their reliability by eliciting pairwise comparisons of texts. We investigate four qualities—writing style, required expertise, facts & trivia, and educational value. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with fine-grained quality ratings. In our experiments, we sample 30B tokens according to different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity when selecting data. With appropriate sampling, our models achieve lower perplexity and stronger in-context learning performance than baselines. We release our models and annotated data.

Chat is not available.