2.1 The Importance of Interpretability

If a machine learning model performs well, why not just trust the model and ignore why it made a certain decision? “The problem is that a single metric, such as classification accuracy, is an incomplete description of most real-world tasks.” (Doshi-Velez and Kim 2017 4)

Let’s dive deeper into the reasons why interpretability is so important. In predictive modelling, you have to make a trade-off: Do you simply want to know what is predicted? For example the probability that a client will churn or how effective some medication will be for a patient. Or do you want to know why the prediction was made and possibly paying for the interpretability with a drop in accuracy? In some cases you don’t care why a decision was made, only the assurance that the predictive performance was good on a test dataset is enough. But in other cases, knowing the ‘why’ can help you understand more about the problem, the data and why a model might fail. Some models might not need explanations, because they are used in a low risk environment, meaning a mistake has no severe consequences, (e.g. a movie recommender system) or the method has already been extensively studied and evaluated (e.g. optical character recognition). The necessity for interpretability comes from an incompleteness in the problem formalisation (Doshi-Velez and Kim 2017), meaning that for certain problems or tasks it is not enough to get the answer (the what). The model also has to give an explanation how it came to the answer (the why), because a correct prediction only partially solves your original problem. The following reasons drive the demand for interpretability and explanations (Doshi-Velez and Kim 2017 and Miller 2017) Human curiosity and learning: Humans have a mental model of their environment, which gets updated when something unexpected happens. This update is done by finding an explanation for the unexpected event. For example, a human feels unexpectedly sick and asks himself: “Why do I feel so sick?”. He learns that he becomes sick every time he eats those red berries. He updates his mental model and decides that the berries caused the sickness and therefore should be avoided. Curiosity and learning is important for any machine learning model used in the research context, where scientific findings stay completely hidden, when the machine learning model only gives predictions without explanations. To facilitate learning and satisfy curiosity about why certain predictions or behaviours are created by machines, interpretability and explanations are crucial. Of course, humans don’t need an explanation for everything that happens. Most people are okay with not understanding how a computer works. The emphasis of this point is more on unexpected events, that makes us curious. Like: Why is my computer shutting down unexpectedly?

Closely related to learning is the human desire to find meaning in the world. We want to reconcile contradictions or inconsistencies between elements of our knowledge structures. “Why did my dog bite me, even though it has never done so before?” a human might ask himself. There is a contraction between the knowledge about the dog’s past behaviour and the newly made, unpleasant experience of the bite. The explanation of the vet reconciles the dog holders contradiction: “The dog was under stress and did bite, dogs are animals and this can happen.” The more a machine’s decision affects a human’s life, the more important it will be for the machine to explain its behaviour. When a machine learning model rejects a loan application, this could be quite unexpected for the applicant. He can only reconcile this inconsistency between expectation and reality by having some form of explanation. The explanations don’t actually have to fully explain the situation, but should address a main cause. Another example is algorithmic product recommendation. Personally, I always reflect on why certain products or movies have been recommended to me algorithmically. Often it is quite clear: The advertising is following me on the Internet because I have bought a washing machine recently, and I know that I will be followed by washing machine advertisements the next days. Yes, it makes sense to suggest gloves, when I already have a winter hat in my shopping basket. The algorithm recommended this movie, because users that liked other movies that I also liked, enjoyed the recommended movie. Increasingly, Internet companies are adding explanations to their recommendations. A good example is the Amazon product recommendation based on frequently bought product combinations:
Recommended products when buying some paint from [Amazon](https://www.amazon.com/Colore-Acrylic-Paint-Set-12/dp/B014UMGA5W/). Visited on December 5th 2012.

FIGURE 2.1: Recommended products when buying some paint from Amazon. Visited on December 5th 2012.

There is a shift in many scientific disciplines from qualitative to quantitative methods (e.g. sociology, psychology), and also towards machine learning (biology, genomics). The goal of science is to gain knowledge, but many problems can only be solved with big datasets and black box machine learning models. The model itself becomes a source of knowledge, instead of the data. Interpretability allows to tap into this additional knowledge captured by the model.

Machine learning models are taking over real world tasks, that demand safety measurements and testing. Imagine a self-driving car automatically detects cyclists, which is as desired. You want to be 100% sure that the abstraction the system learned will be fail-safe, because running over cyclists is quite bad. An explanation might reveal that the most important feature learned is to recognise the two wheels of a bike and this explanation helps you to think about edge cases like bikes with side bags, that partially cover the wheels.

By default most machine learning models pick up biases from the training data. This can turn your machine learning models into racists which discriminate against protected groups. Interpretability is a useful debugging tool to detect bias in machine learning models. It might happen that the machine learning model you trained for automatically approving or rejecting loan applications discriminates against some minority. Your main goal is to give out loans to people that will pay them back eventually. In this case, the incompleteness in the problem formulation lies in the fact that you not only want to minimise loan defaults, but you are also required to not discriminate based on certain demographics. This is an additional constraint, which is part of your problem formulation (Handing out loans in a low-risk and compliant way), which is not captured by the loss function, which the machine learning model optimises.

The process of integrating machines and algorithms into our daily lives demands interpretability to increase social acceptance. People attribute beliefs, desires, intentions and so on to objects. In a famous experiment, Heider and Simmel (1944) 5 showed the participants videos of shapes, where a circle opened a door to enter a “room” (which was simply a rectangle). The participants described the actions of the shapes as they would describe the actions of a human agent, attributing intentions and even emotions and personality traits to the shapes. Robots are a good example, like my vacuum cleaner, which I named ‘Doge’. When Doge gets stuck, I think: “Doge wants to continue cleaning, but asks me for help because it got stuck.” Later, when Doge finished cleaning and searches the home base to recharge I think: “Doge has the desire to recharge and intents to find the home base”. Also I attribute personality traits: “Doge is a bit dumb, but in a cute way”. These are my thoughts, especially when I find out that Doge threw over some plant while cleaning the house dutifully. A machine or algorithm explaining its prediction will receive more acceptance. See also the chapter about explanations, which argues that explanations are a social process.

Explanations are used to manage social interactions. Through the creation of a shared meaning of something, the explainer influences the actions, emotions and beliefs of the receiver of the explanation. In order to allow a machine to interact with us, it might need to shape our emotions and beliefs. Machines have to “persuade” us, so that we believe that they can achieve their intended goal. I would not completely accept my robot vacuum cleaner if it would not explain its behaviour to some degree. The vacuum cleaner creates a shared meaning of, for example, an “accident” (like getting stuck on the bathroom carpet … again) by explaining that it got stuck, instead of simply stopping to work without comment. Interestingly, there can be a misalignment between the goal of the explaining machine, which is generating trust, and the goal of the recipient, which is to understand the prediction or behaviour. Maybe the correct explanation why Doge got stuck could be that the battery was very low, additionally one of the wheels is not working properly and on top of that there is a bug that causes the robot to re-try to go to the same spot over and over again, even though there was some obstacle in the way. These reasons (and some more) caused the robot to get stuck, but it only explained that there was something in the way, and this was enough for me to trust its behaviour, and to get a shared meaning of that accident, which I can share with my girlfriend. (“By the way, Doge got stuck again in the bathroom, we have to remove the carpets before we let it clean”). The example of the robot getting stuck on the carpet might not even require an explanation, because I can explain it to myself by observing that Doge can’t move on this carpet mess. But there are other situations, which are less obvious, like a full dirt bag.

Doge, my vacuum cleaner got stuck. As an explanation for the accident, Doge told me that it needs to be on a flat surface.

FIGURE 2.2: Doge, my vacuum cleaner got stuck. As an explanation for the accident, Doge told me that it needs to be on a flat surface.

Only with interpretability can machine learning algorithms be debugged and audited. So even in low risk environments, like movie recommendation, interpretability in the research and development stage as well as after deployment is valuable. Because later, when some model is used in a product, things can go wrong. Having an interpretation for a faulty prediction helps to understand the cause of the fault. It delivers a direction for how to fix the system. Consider an example of a husky versus wolf classifier, that misclassifies some huskies as wolfs. Using interpretable machine learning methods, you would find out that the misclassification happened due to the snow on the image. The classifier learned to use snow as a feature for classifying images as wolfs, which might make sense in terms of separating features in the training dataset, but not in the real world use.

If you can ensure that the machine learning model can explain decisions, the following traits can also be checked more easily (Doshi-Velez and Kim 2017):

  • Fairness: Making sure the predictions are unbiased and not discriminating against protected groups (implicit or explicit). An interpretable model can tell you why it decided that a certain person is not worthy of a credit and for a human it becomes easier to judge if the decision was based on a learned demographic (e.g. racial) bias.
  • Privacy: Ensuring that sensitive information in the data is protected.
  • Reliability or Robustness: Test that small changes in the input don’t lead to big changes in the prediction.
  • Causality: Check if only causal relationships are picked up. Meaning a predicted change in a decision due to arbitrary changes in the input values are also happening in reality.
  • Trust: It is easier for humans to trust a system that explains its decisions compared to a black box.

When we don’t need interpretability.

The following scenarios illustrate when we don’t need or even don’t want interpretability for machine learning models.

Interpretability is not required if the model has no significant impact. Imagine someone named Mike working on a machine learning side project to predict where his friends will go to for their next vacation based on Facebook data. Mike just likes it to surprise his friends with educated guesses where they’re going on vacation. There is no real problem if the model is wrong (just a little embarrassment for Mike), it is also not problematic if Mike can’t explain the output of his model. It’s perfectly fine not to have any interpretability. The situation would change when Mike starts building a company around these vacation destination predictions. If the model is wrong, the company will lose money, or the model could refuse services to people based on some learned racial bias. As soon as the model has a significant impact, either financially or socially, the interpretability becomes relevant.

Interpretability is not required when the problem is well-studied. Some applications have been sufficiently well-studied so that there is enough practical experience with the model and problems with the model have been solved over time. A good example is a machine learning model for optical character recognition that processes images of envelopes and extracts the addresses. There are years of experience in using these systems and it is clear that they work. Also we are not really interested in gaining additional insights about the task at hand.

Interpretability might enable gaming the system. Problems with users fooling a system result from a mismatch in the objectives of the creator and the user of a model. Credit scoring is one such system because the banks want to ensure to give loans to applicants who are likely to return it and the applicants have the goal to get the loan even if the bank has a different opinion. This mismatch between objectives introduces incentives for the applicants to game the system to increase their chances of getting a loan. If an applicant knows that having more than two credit cards affects the score negatively, he simply returns his third credit card to improve his score, and simply gets a new card after the loan has been approved. While his score improved, the true probability of repaying the loan remained the same. The system can only be gamed if the inputs are proxies for another feature, but are not the cause of the outcome. Whenever possible, proxy features should be avoided, as they are often the reason for defective models. For example, Google developed a system called Google Flu Trends for predicting flu outbreaks that correlates Google searches with flu outbreaks - and it performed rather poorly so far. The distribution of searches changed and Google Flu Trends missed many flu outbreaks. Google searches are not known to cause the flu and people searching for symptoms like “fever” merely imply a correlation. Ideally, models would only use causal features, because then they are not gameable and a lot of issues with biases would not occur.


  1. Doshi-Velez, Finale, and Been Kim. 2017. “Towards A Rigorous Science of Interpretable Machine Learning,” no. Ml: 1–13. http://arxiv.org/abs/1702.08608.

  2. Heider, Fritz, and Marianne Simmel. 1944. “An Experimental Study of Apparent Behavior.” The American Journal of Psychology 57 (2). JSTOR: 243–59.