Forgetting-MarI: LLM Unlearning via Marginal Information Regularization
Abstract
As Large Language Models (LLMs) face increasing regulatory scrutiny, the ability to surgically remove the influence of specific data without full retraining is critical, especially for deployed agentic systems that continuously accumulate user interactions, tool-use traces, and long-horizon trajectories. However, current LLM unlearning techniques are largely heuristic, lacking formal guarantees and often degrading model utility by removing information shared between the unlearn and retain sets. We bridge the gap between rigorous unlearning theory and LLM practice by introducing Forgetting-MarI. This framework provably isolates and removes only the marginal information, the unique effect contributed by the unlearn set, while preserving information supported by the retain set. By penalizing marginal information, we derive a tractable upper bound on the unlearn set’s residual influence in the unlearned models, yielding a verifiable notion of undetectability. Extensive experiments on Llama and GPT models (up to 8B parameters) confirm that Forgetting-MarI achieves superior trade-offs between unlearning efficacy and utility preservation compared to state-of-the-art baselines. These results position marginal-information regularization as a principled and practical primitive for more controllable, auditable, and safe unlearning in real-world LLM deployments.