State of XAI - Brainstorm

In a few months, I have been tasked with giving a lecture at an advanced summer course on explainable AI (XAI). Join me as I discover what I think I know about it, and what I don’t know yet that I should probably know. These will be unfiltered thoughts, and hopefully will get clearer as I go along.

Explainability/interpretability comes in different flavors

Inherently interpretable models: Here we make assumptions about what it means for something to be interpretable. This prescribes decision-making processes, and builds training methods around them. Examples are prototype-based interpretability, decision trees, physics-inspired modeling, etc.
Attribution methods: the way that I see this, these methods are more focused on understanding what in the data causes what in the output. Attribution methods come in many flavors, but they all have in common that their outputs is focused on the data, not on the model. I would even argue that counterfactual examples fall under the subset of attribution method. We reason about counterfactuals in the input space.
Interpretable representations: This consists of digging into the internal activations, or latent space, in the network and finding features there that correlate with something that we understand. This is what we’re looking for when we do self-supervised learning, e.g. contrastive learning, or even when we just apply principal components to our data. We’re looking for some representation of the data that we can attach to concepts that make sense to us.
“Mechanistic” interpretability: This means trying to deconstruct the network itself. Now, instead of assigning meaning to activations, we’re trying to assign meaning to nodes, modules, or layers in the network. In essence what we’re try to do with this is to reverse-engineer the computation.

Each of these flavors comes with its own set of goals and assumptions.

Open questions that come to mind:

Do I think there’s a difference between interpretability and explainability? If yes, how would I define it?
What is the point of explainability in the first place? When would you want it? When would you not want it?
Is there a glossary of equivalent terms?