2.2 Criteria for Interpretability Methods

Methods for machine learning interpretability can be classified according to different criteria:

  • Intrinsic or post hoc? Intrinsic interpretability means selecting and training a machine learning model that is considered to be intrinsically interpretable (for example short decision trees). Post hoc interpretability means selecting and training a black box model (for example a neural network) and applying interpretability methods after the training (for example measuring the feature importance). The “intrinsic or post hoc”-criterion determined the layout of the chapters in the book: The two main chapters are the intrinsically interpretable models chapter and the post hoc (and model-agnostic) interpretability methods chapter.
  • Outcome of the interpretability method: The different interpretability methods can be roughly differentiated according to their outcomes:
    • Feature summary statistic: Many interpretability methods provide a kind of summary statistic of how each feature affects the model predictions. These can be feature importance measures or statistics about the interaction strength between features.
    • Feature summary visualization: Most feature summary statistics can be visualized. However, some feature summaries can only be visualized and not meaningfully be placed in a table. The partial dependence of a feature is such a case. For non-linear relationships, partial dependence plots are arbitrary curves showing a feature and the average predicted outcome.
    • Model internals (e.g. learned weights): The interpretation of intrinsically interpretable models falls under this category. Examples are the weights in linear models or the learned tree structure (which features and feature values are used for the splits?) of decision trees. The lines are blurred between model internals and feature summary statistic in, for example, linear models, because the weights are both model internals and at the same time summary statistics for the features. Another method that outputs model internals is the visualization of feature detectors that are learned in convolutional neural networks. Interpretability methods that output model internals are model-specific by definition (see next point).
    • Data point: This category includes all methods that return data points (can be existing or newly created) to make a model interpretable. Counterfactuals, for example: To explain the prediction of a data point, find a similar data point by changing some of the features for which the predicted outcome changes in a relevant way (like a flip in the predicted class). Another example is the identification of prototypes of predicted classes. Interpretability methods that output new data points require that the data points themselves can be interpreted. This works well for images and text, but is less useful for tabular data with hundreds of features.
    • Intrinsically interpretable model: This is a little circular, but one solution to interpreting black box models is to approximate them (either globally or locally) with an interpretable model. The interpretable model themselves are interpreted by internal model parameter or feature summary statistics.
  • Model-specific or model-agnostic?: Model-specific interpretation tools are limited to specific model classes. The interpretation of regression weights in a linear model is a model-specific interpretation, since - by definition - the interpretation of intrinsically interpretable models is always model-specific. Any tool that only works for e.g. interpreting neural networks is model-specific. Model-agnostic tools can be used on any machine learning model and are usually post hoc. These agnostic methods usually operate by analysing feature input and output pairs. By definition, these methods can’t have access to any model internals like weights or structural information.
  • Local or global?: Does the interpretation method explain a single prediction or the entire model behavior? Or is the scope somewhere in between? Read more about the scope criterion in the next section.