Enhancing Trustworthiness of Fine-Tuned LLMs via Regularized Subset Selection
Kumar Shubham · Nishant Sharma · Karn Tiwari · Prathosh AP
Abstract
Supervised fine-tuning (SFT) improves large language model (LLM) perplexity but can also degrade trustworthiness—leading to the generation of untruthful, biased, or unsafe content during user interactions. These issues are often traced back to specific phrases or patterns in the training data. However, correcting them usually requires expensive retraining or new data collection. In this work, we propose a two-stage, compute-efficient repair of the post-SFT models that enhances trustworthiness while preserving the downstream performance. In the first stage, we identify the training samples responsible for failures on trustworthiness metrics like truthfulness, stereotypical bias, and machine ethics—and select a small, diverse subset of these examples using a determinantal point process (DPP)-based regularization. In the second stage, we repair the model under the framework of proximal Bregman response function (PBRF) using a gradient ascent update, which enhances trustworthiness while preserving downstream task performance (perplexity). We evaluate our method on multiple LLMs of varying sizes and demonstrate up to 21\% improvement in trustworthiness metrics with minimal impact ($\leq1$ %) on perplexity. Our method provides a computationally efficient approach to enhance post-SFT models and offers a practical alternative to hours of retraining required for model repair
Successful Page Load