Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
Abstract
In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits. Under a single-step gradient descent assumption, subliminal learning theory says that this is mediated by the alignment between trait and distillation gradients. The theory does not guarantee that this alignment which is necessary for trait acquisition will persist in a multi-step setting. We empirically show that this alignment persists under the multi-step setting and whether it contributes to trait acquisition. We measure the gradient alignment during the training and find a persistent weak positive alignment. We then perform an ablation experiment by removing the trait-aligned component by projecting the distillation gradient to the normal plane of the trait gradient. We find that this suppresses trait transfer without affecting the distillation progress. Additionally, we observe a period of fast trait acquisition in the first epoch that is similar to the description of critical period of subliminal learning observed in the previous study. We then further show that the current known mitigation method motivated by the critical period observation, liminal training, that works by apply KL divergence to minimize the deviation between the base model and the model that is being distilled during the critical period does not effectively suppress trait acquisition in our setting.