Poster
Scaling Long Context Training Data by Long-Distance Referrals
Yonghao Zhuang · Lanxiang Hu · Longfei Yun · Souvik Kundu · Zhengzhong Liu · Eric P Xing · Hao Zhang
Hall 3 + Hall 2B #276
Training large language models for long context understanding faces the challenge of data shortage.Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality.In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, LongPack, to scalesuch data.We found that long distance referrals, which occur in natural long documents, are crucial for long-context training.However, simply concatenating short documents does not reliably generate these relations.We further show that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal.To enrich long documents, we propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships.Specifically, for web pages, which are the primary source for language model training, we found hyper-link a native signal for such a relation.By packing web pages through their hyper-link connection, we can create longer, high-quality documents.Our experiments demonstrate that LongPackis highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents.Furthermore, the constructed documents have a ‘near-natural’ quality as innate long documents for long context training, reaching a 32.7% higher score than previous state-of-the-art methods.
Live content is unavailable. Log in and register to view live content