Oral
in
Workshop: Secure and Trustworthy Large Language Models
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Katie Matton · Robert Ness · Emre Kiciman
Large language models (LLMs) are capable of producing plausible explanations of how they arrived at an answer to a question. However, these explanations can be unfaithful to the model's true underlying behavior, potentially leading to over-trust and misuse. We introduce a new approach for measuring the faithfulness of explanations provided by LLMs. Our first contribution is to translate an intuitive understanding of what it means for an LLM explanation to be faithful into a formal definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that are influential in decision-making. We formalize faithfulness in terms of the difference between the set of concepts that the LLM says are influential and the set that truly are. We then present a novel method for quantifying faithfulness that is based on: (1) using an auxiliary LLM to edit, or perturb, the values of concepts within model inputs, and (2) using a hierarchical Bayesian model to quantify how changes to concepts affect model answers at both the example- and dataset-level. Through preliminary experiments on a question-answering dataset, we show that our method can be used to quantify and discover interpretable patterns of unfaithfulness, including cases where LLMs fail to admit their use of social biases.