Chapter 2 Interpretability

There is no mathematical definition of interpretability. A (non-mathematical) definition I like by Miller (2017)3 is: Interpretability is the degree to which a human can understand the cause of a decision. Another one is: Interpretability is the degree to which a human can consistently predict the model’s result 4. The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model. I will use both the terms interpretable and explainable interchangeably. Like Miller (2017), I think it makes sense to distinguish between the terms interpretability/explainability and explanation. I will use “explanation” for explanations of individual predictions. See the section about explanations to learn what we humans see as a good explanation.


  1. Miller, Tim. 2017. “Explanation in Artificial Intelligence: Insights from the Social Sciences.” arXiv Preprint arXiv:1706.07269.

  2. Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. “Examples are not enough, Learn to Criticize! Criticism for Interpretability.” Advances in Neural Information Processing Systems. 2016.