ICLR Poster Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Poster

Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Yong Guo · Shulian Zhang · Haolin Pan · Jing Liu · Yulun Zhang · Jian Chen

Hall 3 + Hall 2B #101

[ Abstract ] [ Project Page ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. Interestingly, we find that a too-large performance gap can hamper the training process.To alleviate this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with the student to maintain a reasonable performance gap. To further strengthen distillation, we develop a hard strategy by enforcing both models to share parameters. Besides, we also build the soft bidirectional mappings between them through Inverse Reparameterization (IR) and Channel-Branch Reparameterization (CBR).IR initializes a larger dynamic teacher with approximately the same accuracy as the student to avoid a too large gap in early stage of training. CBR enables direct extraction of an effective student model from the dynamic teacher without post-training. In experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers, achieving up to 1.58\% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80\% and 0.89\% on ResNet18, respectively.

Live content is unavailable. Log in and register to view live content