Optimization, Not Architecture, Governs Vision Transformer Generalization in Small-Data Regimes
Abstract
Vision Transformers (ViTs) perform competitively on large-scale vision benchmarks but consistently underperform convolutional models when trained from scratch on small datasets. We present a controlled empirical study of ViTs trained from scratch on CIFAR-10, systematically isolating the effects of data diversity, model capacity, regularization, and optimization. Across four progressively refined ViT variants, we find that architectural scaling and data augmentation yield limited gains, whereas optimization strategies—specifically learning rate warmup and cosine decay combined with stronger regularization—produce substantial improvements in generalization. Our results indicate that ViT failure in small-data regimes is governed primarily by optimization dynamics rather than architectural limitations.