Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Perplexed by Perplexity: Perplexity-Based Pruning with Small Reference Models
Zachary Ankner · Cody Blakeney · Kartik Sreenivasan · Max M Marion · Matthew Leavitt · Mansheej Paul
Keywords: [ LLM ] [ Perplexity ] [ Data pruning ] [ transformer ] [ efficiency ] [ pretraining ]
In this work, we consider whether pretraining on a pruned high-quality subset of a large-scale text dataset can improve LLM performance. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average accuracy of downstream tasks of a 3 billion parameter model by up to 1.35\% and achieves up to a 1.36x reduction in pretraining steps to reach commensurate baseline performance.