Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)

Perplexed by Perplexity: Perplexity-Based Pruning with Small Reference Models

Zachary Ankner ⋅ Cody Blakeney ⋅ Kartik Sreenivasan ⋅ Max M Marion ⋅ Matthew Leavitt ⋅ Mansheej Paul

Keywords: LLM Perplexity Data pruning transformer efficiency pretraining

Project Page [ Poster] [ OpenReview]

Abstract

In this work, we consider whether pretraining on a pruned high-quality subset of a large-scale text dataset can improve LLM performance. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average accuracy of downstream tasks of a 3 billion parameter model by up to 1.35\% and achieves up to a 1.36x reduction in pretraining steps to reach commensurate baseline performance.

Chat is not available.