Poster
in
Workshop: Deep Generative Model in Machine Learning: Theory, Principle and Efficacy
TASKD-LLM: Task-Aware Selective Knowledge Distillation for LLMs
Khouloud Saadi · Di Wang
Keywords: [ Gradient Attribution ] [ Selective Distillation ] [ Task-Based Knowledge Distillation ] [ Knowledge Localization ] [ LLMs ]
Large language models achieved state-of-the-art performance in generative tasks but are computationally expensive, making them impractical for deployment in resource-constrained environments. Knowledge distillation (KD) is a promising technique for compressing LLMs by transferring knowledge from a large teacher to a more efficient student model. However, existing task-based KD methods distill all teacher model components indiscriminately. Since teacher models are typically pre-trained for versatility across a broad range of tasks, this approach can introduce unnecessary complexity when distilling for a specific downstream task, potentially limiting the student's ability to specialize. Furthermore, previous work showed that only a subset of the LLM components significantly contribute to a given task, making indiscriminate distillation inefficient. Motivated by these insights, we propose task-aware selective KD (TASKD-LLM), a novel approach that transfers only task-relevant knowledge from the teacher to the student, simplifying the distillation process and maintaining the student's focus. Our method is flexible and can be combined with other distillation techniques in a plug-and-play manner. Empirical results demonstrate that TASKD-LLM outperforms existing methods, achieving higher performance on several benchmark datasets.