âOn Measuring Faithfulness or Self-Consistency of Natural Language Explanationsâ, 2023-11-13 (; similar)â :
Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations.
In this work, we argue that these faithfulness tests do not measure faithfulness to the modelsâ inner workingsâbut rather their self-consistency at the output level. Our contributions are three-fold: (1) We clarify the status of faithfulness tests in view of model explainability, characterizing them as self-consistency tests instead. This assessment we underline by (2) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasksâincluding (3) our new self-consistency measure CC-SHAP.
CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a modelâs input contributes to the predicted answer and to generating the explanation.
Our fine-grained CC-SHAP metric allows us (3) to compare LLM behavior when making predictions and to analyze the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests.
Our code is available at GitHub.