Poster
in
Workshop: 3rd Workshop on Navigating and Addressing Data Problems For Foundation Models (DATA-FM)

[Short] Beyond Data Size: Exploring the Impact of Dataset Diversity and Density in Self-Distillation Learning

Alvard Barseghyan ⋅ Ani Vanyan ⋅ Hakob Tamazyan ⋅ Hrant Khachatrian

Project Page [ OpenReview]

Abstract

Current scaling laws suggest that maximizing unique data is key to superior pre-training. For self-distillation models like iBOT, we show that data density (repetition) and data diversity (as measured by Vendi score) can be as critical as data size (the total number of unique samples). Wide range of experiments on a large remote sensing dataset demonstrate that seeing a smaller, high-quality subset multiple times outperforms a single pass over a massive stream of unique samples under equivalent compute. Based on these results, we propose a predictive scaling law that models downstream performance as a joint function of unique data size, data density and data diversity. We demonstrate the extrapolation power of the proposed formula.

Chat is not available.