Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
Addressing Sample Inefficiency in Multi-View Representation Learning
Arna Ghosh · Kumar Agrawal · Adam Oberman · Blake A Richards
Abstract:
Non-contrastive self-supervised learning (NC-SSL) methods like BarlowTwins and VICReg have shown great promise for label-free representation learning in computer vision. However, researchers rely on several heuristics to achieve competitive performance, such as using high-dimensional projector heads and two augmentations of the same image, and it is unknown why these heuristics are necessary.In this work, we provide theoretical insights into the implicit biases of the BarlowTwins and VICReg loss and learning dynamics that can explain these heuristics and guide the development of more principled recommendations. First, our analysis reveals that the orthogonality of the learned features is more important than projector dimensionality for learning good representations. Based on this, we empirically demonstrate that low-dimensional projector heads are sufficient with appropriate regularization, contrary to the existing heuristic. Second, we show that using multiple data augmentations better aligns with the desiderata of the NC-SSL objectives. Based on this, we demonstrate that leveraging more augmentations per sample improves representation quality and optimization convergence, leading to better features emerging earlier in training. Remarkably, we demonstrate that we can ${\bf \text{\bf reduce the pretraining dataset size by up to 4x}}$ while maintaining downstream accuracy and improving convergence simply by using more data augmentations. Combining these insights, we present practical pretraining recommendations that $\textbf{improve wall-clock time by 2x}$ and improve performance on CIFAR-10/STL-10 datasets using a ResNet-50 backbone.
Chat is not available.