Poster
in
Workshop: Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities

Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models

Huzaifa Arif ⋅ Pin-Yu Chen ⋅ Keerthiram Murugesan ⋅ Alex Gittens ⋅ Payel Das ⋅ Ching-Yun Ko

Project Page [ OpenReview]

Abstract

We propose \textit{safety policy patching}, a lightweight and modular approach for addressing safety vulnerabilities in large language models (LLMs) between major releases. Major version updates are costly, infrequent, and difficult to tailor to customer needs, leaving deployed models with known safety gaps. Our method enables rapid remediation by prepending a compact, learnable prefix to an existing model's inputs.This patch introduces very few additional parameters---e.g., $0.003\%$ for LLaMA-2---yet reliably steers model behavior toward that of a safer reference model. Across three critical domains---toxicity mitigation, bias reduction, and harmfulness refusal---policy patches achieve safety improvements comparable to stronger safety-aligned models (e.g., future major releases) while preserving fluency. Overall, we show that LLMs can be ``patched'' much like software, providing vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.

Chat is not available.