Learning for Highly Faithful Explainability
Abstract
Learning to Explain is a forward-looking paradigm recently proposed in the field of explainable AI, which envisions training explainers capable of producing high-quality explanations for target models efficiently. Although existing studies have made attempts through self-supervised optimization or learning from prior explanation methods, the Learning to Explain paradigm still faces three critical challenges: 1) self-supervised objectives often rely on assumptions about the target model or task, restricting their generalizability; 2) methods driven by prior explanations struggle to guarantee the quality of the supervisory signals; and 3) depending exclusively on either approach leads to poor convergence or limited explanation quality. To address these challenges, we propose a faithfulness-guided amortized explainer that 1) theoretically derives a self-supervised objective free from assumptions about the target model or task, 2) practically generates high-quality supervisory signals by deduplicating and filtering prior explanations, and 3) jointly optimizes both objectives via a dynamic weighting strategy, enabling the amortized explainer to produce more faithful explanations for complex, high-dimensional models. We re-formalize multiple well-validated faithfulness evaluation metrics within a unified notation system and theoretically prove that an explanation mapping can simultaneously achieve optimality across all these metrics. We aggregate prior explanation methods to generate high-quality supervised signals through deduplicating and faithfulness-based filtering. Our amortized explainer leverages dynamic weighting to guide optimization, initially emphasizing pattern consistency with the supervised signals for rapid convergence, and subsequently refining explanation quality by approximating the most faithful explanation mapping. Extensive experiments across various target models and image, text, and tabular tasks demonstrate that the proposed explainer consistently outperforms all prior explanation methods across all faithfulness metrics, highlighting its effectiveness and its potential to offer a systematic solution to the fundamental challenges of the Learning to Explain paradigm.