Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
QuRating: Selecting High-Quality Data for Training Lanugage Models
Alexander Wettig · Aatmik Gupta · Saumya Malik · Danqi Chen
Keywords: [ data selection ] [ language models ]
Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities—writing style, required expertise, facts & trivia, and educational value. We employ LLMs to discern these qualities and obtain more reliable judgments by prompting for pairwise comparisons between texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with fine-grained quality ratings. In our experiments, we sample 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity when selecting data, and with appropriate sampling, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use quality ratings to construct curricula which improve performance without changing the training dataset. We feature extensive analysis of the characteristics and biases of the quality ratings. To encourage further research, we release our prompts, models, and annotated data (QuRatedPajama).