2 Interpretability

This chapter introduces the concepts of interpretability. While it’s difficult to define interpretability mathematically, I like the definition by Biran and Cotton (2017), which was also used by Miller (2019):

“Interpretability is the degree to which a human can understand the cause of a decision.”

Another good one is by Kim, Khanna, and Koyejo (2016):

“a method is interpretable if a user can correctly and efficiently predict the method’s results”

The more interpretable a machine learning model, the easier it is for someone to understand why certain decisions or predictions were made. A model is more interpretable than another model if its decisions are easier for a human to understand than the other model’s decisions.

Sometimes you will also see the term “explainable” in this context, as in “explainable AI”. What’s the difference between explainable and interpretable? In general, researchers in the field can’t seem to agree on a definition for either term (Flora et al. 2022). I agree with the definitions from Roscher et al. (2020):

Interpretability is about mapping an abstract concept from the models into an understandable form.
Explainability is a stronger term requiring interpretability and additional context.

Additionally, the term explanation is typically used for local methods, which are about “explaining” a prediction. Anyway, these terms are so fuzzy that the pragmatic approach is to see them as an umbrella term that captures the “extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model” (Murdoch et al. 2019).

Importance of interpretability

If a machine learning model performs well, why do we not just trust the model and ignore why it made a certain decision? “The problem is that a single metric, such as classification accuracy, is an incomplete description of most real-world tasks.” (Doshi-Velez and Kim 2017)

Let’s dive deeper into the reasons why interpretability is so important. When it comes to predictive modeling, you have to make a trade-off: Do you just want to know what is predicted? For example, the probability that a customer will churn or how effective some drug will be for a patient. Or do you want to know why the prediction was made and possibly pay for the interpretability with a drop in predictive performance? In some cases, you don’t care why a decision was made, it is enough to know that the predictive performance on a test dataset was good. But in other cases, knowing the ‘why’ can help you learn more about the problem, the data, and the reason why a model might fail. Some models may not require explanations because they are used in a low-risk environment, meaning a mistake will not have serious consequences (e.g., a movie recommender system). It could also be that the method has already been extensively studied and evaluated (e.g., optical character recognition). The need for interpretability arises from an incompleteness in problem formalization (Doshi-Velez and Kim 2017), which means that for certain problems or tasks it is not enough to get the prediction (the what). The model must also explain how it arrived at the prediction (the why), because a correct prediction only partially solves your original problem. The following reasons drive the demand for interpretability and explanations (Doshi-Velez and Kim 2017 and Miller 2017).

Human curiosity and learning: Humans have a mental model of their environment that is updated when something unexpected happens. This update is performed by finding an explanation for the unexpected event. For example, a human feels unexpectedly sick and asks, “Why do I feel so sick?”. He learns that he gets sick every time he eats those red berries. He updates his mental model and decides that the berries caused the sickness and should therefore be avoided. When opaque machine learning models are used in research, scientific findings remain completely hidden if the model only gives predictions without explanations. To facilitate learning and satisfy curiosity as to why certain predictions or behaviors are created by machines, interpretability and explanations are crucial. Of course, humans don’t need explanations for everything that happens. For example, most people don’t need to understand how a computer works. However, unexpected events makes us curious. For example: Why is my computer shutting down unexpectedly?

Closely related to learning is the human desire to find meaning in the world. We want to harmonize contradictions or inconsistencies between elements of our knowledge structures. “Why did my dog bite me even though it has never done so before?” a human might ask. There is a contradiction between the knowledge of the dog’s past behavior and the newly made, unpleasant experience of the bite. The vet’s explanation reconciles the dog owner’s contradiction: “The dog was under stress and that’s why it bit you.” The more a machine’s decision affects a person’s life, the more important it is for the machine to explain its behavior. If a machine learning model rejects a loan application, this may be completely unexpected for the applicants. They can only reconcile this inconsistency between expectation and reality with some kind of explanation. The explanations don’t actually have to fully explain the situation but should address a main cause. Another example is algorithmic product recommendation. Personally, I always think about why certain products or movies have been algorithmically recommended to me. Often it’s clear: Advertising follows me on the Internet because I recently bought a washing machine, and I know that for the next days I’ll be followed by advertisements for washing machines. Yes, it makes sense to suggest gloves if I already have a winter hat in my shopping cart. The algorithm recommends this particular movie because users who liked other movies I liked also enjoyed the recommended movie. Increasingly, Internet companies are adding explanations to their recommendations. A good example is product recommendations, which are based on frequently purchased product combinations, as illustrated in Figure 2.1.

Figure 2.1: Illustration of recommended products that are frequently bought together.

In many scientific disciplines, there is a change from qualitative to quantitative methods (e.g., sociology, psychology), and also towards machine learning (biology, genomics). The goal of science is to gain knowledge, but many problems are solved with big datasets and black box machine learning models. The model itself becomes the source of knowledge instead of the data. Interpretability makes it possible to extract this additional knowledge captured by the model.

Machine learning models take on real-world tasks that require safety measures and testing. Imagine a self-driving car automatically detects cyclists based on a deep learning system. You want to be 100% sure that the abstraction the system has learned is error-free because running over cyclists is very bad. An explanation might reveal that the most important learned feature is to recognize the two wheels of a bike, and this explanation helps you think about edge cases like bikes with side bags that partially cover the wheels.

By default, machine learning models pick up biases from the training data. This could make your machine learning models racist and discriminate against underrepresented groups. Interpretability is a useful debugging tool for detecting bias in machine learning models. It might happen that the machine learning model you have trained for automatic approval or rejection of credit applications discriminates against a minority that has been historically disenfranchised. Your main goal is to grant loans only to people who will eventually repay them. The incompleteness of the problem formulation in this case lies in the fact that you not only want to minimize loan defaults but are also obliged not to discriminate on the basis of certain demographics. This is an additional constraint that is part of your problem formulation (granting loans in a low-risk and compliant way) that is not covered by the loss function the machine learning model was optimized for.

The process of integrating machines and algorithms into our daily lives requires interpretability to increase social acceptance. People attribute beliefs, desires, intentions, and so on to objects. In a famous experiment, Heider and Simmel (1944) showed participants videos of shapes in which a circle opened a “door” to enter a “room” (which was simply a rectangle). The participants described the actions of the shapes as they would describe the actions of a human agent, assigning intentions and even emotions and personality traits to the shapes. Robots are a good example, like our vacuum cleaner, which we named “Doge”. If Doge gets stuck, I think: “Doge wants to keep cleaning, but asks me for help because it got stuck.” Later, when Doge finishes cleaning and searches the home base to recharge, I think: “Doge has a desire to recharge and intends to find the home base.” I also attribute personality traits: “Doge is a bit dumb, but in a cute way.” These are my thoughts, especially when I find out that Doge has knocked over a plant while dutifully vacuuming the house. A machine or algorithm that explains its predictions will find more acceptance.

Explanations are used to manage social interactions. By creating a shared meaning of something, the explainer influences the actions, emotions, and beliefs of the recipient of the explanation. For a machine to interact with us, it may need to shape our emotions and beliefs. Machines have to “persuade” us so that they can achieve their intended goal. I would not fully accept my robot vacuum cleaner if it didn’t explain its behavior to some degree. The vacuum cleaner creates a shared meaning of, for example, an “accident”, such as getting stuck on the bathroom carpet … again, by explaining that it got stuck instead of simply stopping to work without comment. Interestingly, there may be a misalignment between the goal of the explaining machine (create trust) and the goal of the recipient (understand the prediction or behavior). Perhaps the full explanation for why Doge got stuck could be that the battery was very low, that one of the wheels is not working properly, and that there is a bug that makes the robot go to the same spot over and over again even though there was an obstacle. These reasons (and a few more) caused the robot to get stuck, but it only explained that something was in the way, and that was enough for me to trust its behavior and get a shared meaning of that accident. By the way, Doge got stuck in the bathroom again. Proof in Figure 2.2. We have to remove the carpets every time before we let Doge vacuum.

Figure 2.2: Doge, our vacuum cleaner, got stuck. As an explanation for the accident, Doge told us that it needs to be on an even surface.

Machine learning models can only be debugged and audited when they can be interpreted. Even in low-risk environments, such as movie recommendations, the ability to interpret is valuable in the research and development phase as well as after deployment. Later, when a model is used in a product, things can go wrong. An interpretation for an erroneous prediction helps to understand the cause of the error. It delivers a direction for how to fix the system. Consider an example of a husky versus wolf classifier that misclassifies some huskies as wolves. Using interpretable machine learning methods, you would find that the misclassification was due to the snow on the image. The classifier learned to use snow as a feature for classifying images as “wolf,” which might make sense in terms of separating wolves from huskies in the training dataset but not in real-world use.

If you can ensure that the machine learning model can explain decisions, you can also check the following traits more easily (Doshi-Velez and Kim 2017):

Fairness: Ensuring that predictions are unbiased and don’t implicitly or explicitly discriminate against underrepresented groups.
Privacy: Ensuring that sensitive information in the data is protected.
Reliability or Robustness: Ensuring that small changes in the input don’t lead to large changes in the prediction.
Causality: Ensure that only causal relationships are picked up.
Trust: It’s easier for humans to trust a system that explains its decisions compared to a black box.

Sometimes we don’t need interpretability.

The following scenarios illustrate when we do not need or even do not want interpretability of machine learning models.

Interpretability is not required if the model has no significant impact. Imagine someone named Mike working on a machine learning side project to predict where his friends will go for their next holidays based on Facebook data. Mike just likes to surprise his friends with educated guesses about where they will be going on holidays. There’s no real problem if the model is wrong (at worst just a little embarrassment for Mike), nor is there a problem if Mike cannot explain the output of his model. It’s perfectly fine not to have interpretability in this case. The situation would change if Mike started building a business around these holiday destination predictions. If the model is wrong, the business could lose money, or the model may work worse for some people because of learned racial bias. As soon as the model has a significant impact, be it financial or social, interpretability becomes relevant.

Interpretability is not required when the problem is well studied. Some applications have been sufficiently well studied so that there is enough practical experience with the model, and problems with the model have been solved over time. A good example is a machine learning model for optical character recognition that processes images of envelopes and extracts addresses. These systems have been in use for many years, and it’s clear that they work. In addition, we might not be interested in gaining additional insights about the task at hand.

Interpretability might enable people or programs to manipulate the system. Problems with users who deceive a system result from a mismatch between the goals of the creator and the user of a model. Credit scoring is such a system because banks want to ensure that loans are only given to applicants who are likely to return them, and applicants aim to get the loan even if the bank does not want to give them one. This mismatch between the goals introduces incentives for applicants to game the system to increase their chances of getting a loan. If an applicant knows that having more than two credit cards negatively affects his score, he simply returns his third credit card to improve his score and organizes a new card after the loan has been approved. While his score improved, the actual probability of repaying the loan remained unchanged. The system can only be gamed if the inputs are proxies for a causal feature but do not actually cause the outcome. Whenever possible, proxy features should be avoided as they make models gameable. For example, Google developed a system called Google Flu Trends to predict flu outbreaks. The system correlated Google searches with flu outbreaks – and it has performed poorly. The distribution of search queries changed, and Google Flu Trends missed many flu outbreaks. Google searches do not cause the flu. When people search for symptoms like “fever,” it’s merely a correlation with actual flu outbreaks. Ideally, models would only use causal features that are not gameable.

Human-friendly explanations

Let’s dig deeper and discover what we humans see as “good” explanations and what the implications are for interpretable machine learning. Humanities research can help us find out. Miller (2017) has conducted a huge survey of publications on explanations, and this chapter builds on his summary.

In this chapter, I want to convince you of the following: As an explanation for an event, humans prefer short explanations (only 1 or 2 causes) that contrast the current situation with a situation in which the event would not have occurred. Especially abnormal causes provide good explanations. Explanations are social interactions between the explainer and the explainee (recipient of the explanation), and therefore the social context has a great influence on the actual content of the explanation.

What’s a good explanation?

An explanation is the answer to a why-question (Miller 2017).

Why did the treatment not work on the patient?
Why was my loan rejected?
Why have we not been contacted by alien life yet?

The first two questions can be answered with an “everyday” explanation, while the third one comes from the category “More general scientific phenomena and philosophical questions.” We focus on the “everyday”-type explanations because those are relevant to interpretable machine learning. In the following, the term “explanation” refers to the social and cognitive process of explaining, but also to the product of these processes. The explainer can be a human being or a machine. This section further condenses Miller’s summary on “good” explanations and adds concrete implications for interpretable machine learning.

Explanations are contrastive (Lipton 1990). Humans usually don’t ask why a certain prediction was made, but why this prediction was made instead of another prediction. We tend to think in counterfactual cases, i.e., “How would the prediction have been if input \(\mathbf{x}\) had been different?”. For a house price prediction, the house owner might be interested in why the predicted price was high compared to the lower price they had expected. If my loan application is rejected, I do not care to hear all the factors that generally speak for or against a rejection. I’m interested in the factors in my application that would need to change to get the loan. I want to know the contrast between my application and the would-be-accepted version of my application. The recognition that contrasting explanations matter is an important finding for explainable machine learning. You can extract an explanation from most interpretable models that implicitly contrasts a prediction of an instance with the prediction of an artificial data instance or an average of instances. Physicians might ask: “Why did the drug not work for my patient?”. And they might want an explanation that contrasts their patient with a patient for whom the drug worked and who is similar to the non-responding patient. Contrastive explanations are easier to understand than complete explanations. A complete explanation of the physician’s question why the drug does not work might include: The patient has had the disease for 10 years, 11 genes are over-expressed, the patient’s body is very quick in breaking the drug down into ineffective chemicals, … A contrastive explanation might be much simpler: In contrast to the responding patient, the non-responding patient has a certain combination of genes that make the drug less effective. The best explanation is the one that highlights the greatest difference between the object of interest and the reference object. What it means for interpretable machine learning: Humans don’t want a complete explanation for a prediction, but want to compare what the differences were to another instance’s prediction (can be an artificial one). Creating contrastive explanations is application-dependent because it requires a point of reference for comparison. And this may depend on the data point to be explained, but also on the user receiving the explanation. A user of a house price prediction website might want to have an explanation of a house price prediction contrastive to their own house or maybe to another house on the website or maybe to an average house in the neighborhood. The solution for the automated creation of contrastive explanations might also involve finding prototypes or archetypes in the data.

Explanations are selected. People don’t expect explanations that cover the actual and complete list of causes of an event. We are used to selecting one or two causes from a variety of possible causes as THE explanation. As proof, turn on the TV news: “The decline in stock prices is blamed on a growing backlash against the company’s product due to problems with the latest software update.” “Tsubasa and his team lost the match because of a weak defense: they gave their opponents too much room to play out their strategy.” “The increasing distrust of established institutions and our government are the main factors that have reduced voter turnout.” For machine learning models, it’s advantageous if a good prediction can be made from different features. Ensemble methods that combine multiple models with different features (different explanations) usually perform well because averaging over those “stories” makes the predictions more robust and accurate. But it also means that there is more than one selective explanation why a certain prediction was made. What it means for interpretable machine learning: Make the explanation short, give only 1 to 3 reasons, even if the world is more complex.

Explanations are social. They are part of a conversation or interaction between the explainer and the receiver of the explanation. The social context determines the content and nature of the explanations. If I wanted to explain to a technical person why digital cryptocurrencies are worth so much, I would say things like: “The decentralized, distributed, blockchain-based ledger, which cannot be controlled by a central entity, resonates with people who want to secure their wealth, which explains the high demand and price.” But to my grandmother I would say: “Look, Grandma: Cryptocurrencies are a bit like computer gold. People like and pay a lot for gold, and young people like and pay a lot for computer gold.” What it means for interpretable machine learning: Pay attention to the social environment of your machine learning application and the target audience. Getting the social part of the machine learning model right depends entirely on your specific application. Find experts from the humanities (e.g., psychologists and sociologists) to help you.

Explanations focus on the abnormal. People focus more on abnormal causes to explain events (Kahneman and Tversky 1982). These are causes that had a small probability but nevertheless happened. The elimination of these abnormal causes would have greatly changed the outcome (counterfactual explanation). Humans consider these kinds of “abnormal” causes as good explanations. An example from Štrumbelj and Kononenko (2011) is: Assume we have a dataset of test situations between teachers and students. Students attend a course and pass the course directly after successfully giving a presentation. The teacher has the option to additionally ask the student questions to test their knowledge. Students who cannot answer these questions will fail the course. Students can have different levels of preparation, which translates into different probabilities for correctly answering the teacher’s questions (if they decide to test the student). We want to predict whether a student will pass the course and explain our prediction. The chance of passing is 100% if the teacher does not ask any additional questions; otherwise, the probability of passing depends on the student’s level of preparation and the resulting probability of answering the questions correctly. Scenario 1: The teacher usually asks the students additional questions (e.g., 95 out of 100 times). A student who did not study (10% chance to pass the question part) was not one of the lucky ones and gets additional questions that he fails to answer correctly. Why did the student fail the course? I would say that it was the student’s fault to not study. Scenario 2: The teacher rarely asks additional questions (e.g. 2 out of 100 times). For a student who has not studied for the questions, we would predict a high probability of passing the course because questions are unlikely. Of course, one of the students did not prepare for the questions, which gives him a 10% chance of passing the questions. He is unlucky and the teacher asks additional questions that the student cannot answer, and he fails the course. What’s the reason for the failure? I would argue that now, the better explanation is “because the teacher tested the student.” It was unlikely that the teacher would test, so the teacher behaved abnormally. What it means for interpretable machine learning: If one of the input features for a prediction was abnormal in any sense (like a rare category of a categorical feature) and the feature influenced the prediction, it should be included in an explanation, even if other ‘normal’ features have the same influence on the prediction as the abnormal one. An abnormal feature in our house price prediction example might be that a rather expensive house has two balconies. Even if some attribution method finds that the two balconies contribute as much to the price difference as the above-average house size, the good neighborhood, or the recent renovation, the abnormal feature “two balconies” might be the best explanation for why the house is so expensive.

Explanations are truthful. Good explanations prove to be true in reality (i.e., in other situations). But disturbingly, this is not the most important factor for a “good” explanation. For example, selectiveness seems to be more important than truthfulness. An explanation that selects only one or two possible causes rarely covers the entire list of relevant causes. Selectivity omits part of the truth. It’s not true that only one or two factors, for example, have caused a stock market crash, but the truth is that there are millions of causes that influence millions of people to act in such a way that in the end a crash was caused. What it means for interpretable machine learning: The explanation should predict the event as truthfully as possible, which in machine learning is sometimes called fidelity. So if we say that a second balcony increases the price of a house, then that also should apply to other houses (or at least to similar houses). For humans, fidelity of an explanation is not as important as its selectivity, its contrast, and its social aspect.

Good explanations are consistent with prior beliefs of the explainee. Humans tend to ignore information that is inconsistent with their prior beliefs. This effect is called confirmation bias (Nickerson 1998). Explanations are not spared by this bias. People will tend to devalue or ignore explanations that do not agree with their beliefs. The set of beliefs varies from person to person, but there are also group-based prior beliefs such as political worldviews. What it means for interpretable machine learning: Good explanations are consistent with prior beliefs. This is difficult to integrate into machine learning and would probably drastically compromise predictive performance. Our prior belief for the effect of house size on predicted price is that the larger the house, the higher the price. Let’s assume that a model also shows a negative effect of house size on the predicted price for a few houses. The model has learned this because it improves predictive performance (due to some complex interactions), but this behavior strongly contradicts our prior beliefs. You can enforce monotonicity constraints (a feature can only affect the prediction in one direction) or use something like a linear model that has this property.

Good explanations are general and probable. A cause that can explain many events is called “general” and could be considered a good explanation. Note that this contradicts the claim that abnormal causes make good explanations. As I see it, abnormal causes beat general causes. Abnormal causes are by definition rare in the given scenario. In the absence of an abnormal event, a general explanation is considered a good explanation. Also remember that people tend to misjudge probabilities of joint events. (Joe is a librarian. Is he more likely to be a shy person or to be a shy person who likes to read books?) A good example is “The house is expensive because it is big,” which is a very general, good explanation of why houses are expensive or cheap. What it means for interpretable machine learning: Generality can easily be measured by the feature’s support, which is the number of instances to which the explanation applies divided by the total number of instances.

Biran, Or, and Courtenay V. Cotton. 2017. “Explanation and Justification in Machine Learning: A Survey.” In Proceedings of the IJCAI-17 Workshop on Explainable Artificial Intelligence (XAI). https://www.cs.columbia.edu/~orb/papers/xai_survey_paper_2017.pdf.

Doshi-Velez, Finale, and Been Kim. 2017. “Towards a Rigorous Science of Interpretable Machine Learning.” arXiv Preprint arXiv:1702.08608.

Flora, Montgomery, Corey Potvin, Amy McGovern, and Shawn Handler. 2022. “Comparing Explanation Methods for Traditional Machine Learning Models Part 1: An Overview of Current Methods and Quantifying Their Disagreement.” arXiv. http://arxiv.org/abs/2211.08943.

Heider, Fritz, and Marianne Simmel. 1944. “An Experimental Study of Apparent Behavior.” The American Journal of Psychology 57 (2): 243–59. https://doi.org/10.2307/1416950.

Kahneman, Daniel, and Amos Tversky. 1982. “The Simulation Heuristic.” In Judgment Under Uncertainty: Heuristics and Biases, edited by Amos Tversky, Daniel Kahneman, and Paul Slovic, 201–8. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511809477.015.

Kim, Been, Rajiv Khanna, and Oluwasanmi Koyejo. 2016. “Examples Are Not Enough, Learn to Criticize! Criticism for Interpretability.” In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2288–96. NIPS’16. Red Hook, NY, USA: Curran Associates Inc.

Lipton, Peter. 1990. “Contrastive Explanation.” Royal Institute of Philosophy Supplements 27 (March): 247–66. https://doi.org/10.1017/S1358246100005130.

Miller, Tim. 2019. “Explanation in Artificial Intelligence: Insights from the Social Sciences.” Artificial Intelligence 267 (February): 1–38. https://doi.org/10.1016/j.artint.2018.07.007.

Murdoch, W. James, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. 2019. “Definitions, Methods, and Applications in Interpretable Machine Learning.” Proceedings of the National Academy of Sciences 116 (44): 22071–80. https://doi.org/10.1073/pnas.1900654116.

Nickerson, Raymond S. 1998. “Confirmation Bias: A Ubiquitous Phenomenon in Many Guises.” https://doi.org/https://journals.sagepub.com/doi/10.1037/1089-2680.2.2.175.

Roscher, Ribana, Bastian Bohn, Marco F. Duarte, and Jochen Garcke. 2020. “Explainable Machine Learning for Scientific Insights and Discoveries.” IEEE Access 8: 42200–42216. https://doi.org/10.1109/ACCESS.2020.2976199.

Štrumbelj, Erik, and Igor Kononenko. 2011. “A General Method for Visualizing and Explaining Black-Box Regression Models.” In Adaptive and Natural Computing Algorithms, edited by Andrej Dobnikar, Uroš Lotrič, and Branko Šter, 21–30. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-20267-4_3.