AEGIS: Adversarial Target–Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models
Abstract
Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness-retention trade-off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model’s overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other—e.g., mapping a single erased prompt to a fixed safe target leaves class-level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces AEGIS (Adversarial Erasure with Gradient-Informed Synergy), a retention-data-free framework that advances both robustness and retention. First, AEGIS replaces handpicked targets with an Adversarial Erasure Target (AET) optimized to approximate the semantic center of the erased concept class. By aligning the model’s prediction on the erased prompt to an AET-derived target in the shared text–image space, AEGIS increases predicted-noise distances not just for the instance but for semantically related variants, substantially hardening the DMs against state-of-the-art adversarial prompt attacks. Second, AEGIS preserves utility without auxiliary data via Gradient Regularization Projection (GRP), a conflict-aware gradient rectification that selectively projects away the destructive component of the retention update only when it opposes the erasure direction. This directional, data-free projection mitigates interference between erasure and retention, avoiding dataset bias and accidental relearning. Extensive experiments show that AEGIS markedly reduces attack success rates across various concepts while maintaining or improving FID/CLIP versus advanced baselines, effectively pushing beyond the prevailing robustness–retention trade-off. The source code is in the supplementary.