3.4 Evaluating Interpretability

There is no real consensus on what interpretability in machine learning is. Also it is not clear how to measure it. But there is some first research on it and the attempt to formulate some approaches for the evaluation, as described in the following section.

3.4.1 Approaches for Evaluating the Interpretability Quality

Doshi-Velez and Kim (2017) propose three major levels when evaluating interpretability:

  • Application level evaluation (real task): Put the explanation into the product and let the end user test it. For example, on an application level, radiologists would test fracture detection software (which includes a machine learning component to suggest where fractures might be in an x-ray image) directly in order to evaluate the model. This requires a good experimental setup and an idea of how to assess the quality. A good baseline for this is always how good a human would be at explaining the same decision.
  • Human level evaluation (simple task) is a simplified application level evaluation. The difference is that these experiments are not conducted with the domain experts, but with lay humans. This makes experiments less expensive (especially when the domain experts are radiologists) and it is easier to find more humans. An example would be to show a user different explanations and the human would choose the best.
  • Function level evaluation (proxy task) does not require any humans. This works best when the class of models used was already evaluated by someone else in a human level evaluation. For example it might be known that the end users understand decision trees. In this case, a proxy for explanation quality might be the depth of the tree. Shorter trees would get a better explainability rating. It would make sense to add the constraint that the predictive performance of the tree remains good and does not drop too much compared to a larger tree. More on Function Level Evaluation

Model size is an easy way to measure explanation quality, but it is too simplistic. For example, a sparse model with features that are themselves not interpretable is still not a good explanation.

There are more dimensions to interpretability:

  • Model sparsity: How many features are being used by the explanation?
  • Monotonicity: Is there a monotonicity constraint? Monotonicity means that a feature has a monotonic relationship with the target. If the feature increases, the target either always increases or always decreases, but never switches between increasing and decreasing.
  • Uncertainty: Is a measurement of uncertainty part of the explanation?
  • Interactions: Is the explanation able to include interactions of features?
  • Cognitive processing time: How long does it take to understand the explanation?
  • Feature complexity: What features were used for the explanation? PCA components are harder to understand than word occurrences, for example.
  • Description length of explanation.