Skip to yearly menu bar Skip to main content


Poster

Open-Vocabulary Customization from CLIP via Data-Free Knowledge Distillation

Yongxian Wei · Zixuan Hu · Li Shen · Zhenyi Wang · Chun Yuan · Dacheng Tao

Hall 3 + Hall 2B #112
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT
 
Oral presentation: Oral Session 4D
Fri 25 Apr 12:30 a.m. PDT — 2 a.m. PDT

Abstract:

Vision-language models such as CLIP have demonstrated strong zero-shot performance, but their considerable size and inefficient inference limit customizable deployment for users. While knowledge distillation is a solution, it still requires the original data, which is not always available due to copyrights and privacy concerns. For many users seeking open-vocabulary customization, Data-Free Knowledge Distillation (DFKD) emerges as a promising direction. Upon rethinking DFKD, we find that existing methods fail on CLIP due to their heavy reliance on BatchNorm layers, which are unexpectedly unusable in CLIP. Based on our findings, we adopt image-text matching to achieve DFKD for CLIP, enabling customization based on arbitrary class texts. This involves (i) inversing a surrogate dataset from CLIP based on text prompts; and (ii) distilling a student model from CLIP using the surrogate dataset. Specifically, we introduce style dictionary diversification to enhance the diversity of synthetic images. To prevent uncontrollable semantics introduced by diversification, we propose a class consistency maintaining strategy to ensure the consistency of synthetic images. Based on synthetic images with various styles, we further propose meta knowledge distillation to train the student model with good generalization ability. Moreover, we introduce a simple yet effective method to enable customization based on few example images. Comprehensive experiments showcase the superiority of our approach across twelve customized tasks, achieving a 9.33\% improvement compared to existing DFKD methods.

Live content is unavailable. Log in and register to view live content