ICLR Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Poster
in
Workshop: Secure and Trustworthy Large Language Models

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain · Robert Kirk · Ekdeep Singh Lubana · Robert Dick · Hidenori Tanaka · Edward Grefenstette · Tim Rocktaeschel · David Krueger

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Fine-tuning large pre-trained models has become the de facto strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a `wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient ``revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.

Chat is not available.

Poster in Workshop: Secure and Trustworthy Large Language Models

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain · Robert Kirk · Ekdeep Singh Lubana · Robert Dick · Hidenori Tanaka · Edward Grefenstette · Tim Rocktaeschel · David Krueger

Poster
in
Workshop: Secure and Trustworthy Large Language Models