# Chapter 10 Neural Network Interpretation

*This chapter is currently only available in this web version. ebook and print will follow.*

The following chapters focus on interpretation methods for neural networks. The methods visualize features and concepts learned by a neural network, explain individual predictions and simplify neural networks.

Deep learning has been very successful, especially in tasks that involve images and texts such as image classification and language translation.
The success story of deep neural networks began in 2012, when the ImageNet image classification challenge ^{74} was won by a deep learning approach.
Since then, we have witnessed a Cambrian explosion of deep neural network architectures, with a trend towards deeper networks with more and more weight parameters.

To make predictions with a neural network, the data input is passed through many layers of multiplication with the learned weights and through non-linear transformations. A single prediction can involve millions of mathematical operations depending on the architecture of the neural network. There is no chance that we humans can follow the exact mapping from data input to prediction. We would have to consider millions of weights that interact in a complex way to understand a prediction by a neural network. To interpret the behavior and predictions of neural networks, we need specific interpretation methods. The chapters assume that you are familiar with deep learning, including convolutional neural networks.

We can certainly use model-agnostic methods, such as local models or partial dependence plots, but there are two reasons why it makes sense to consider interpretation methods developed specifically for neural networks: First, neural networks learn features and concepts in their hidden layers and we need special tools to uncover them. Second, the gradient can be utilized to implement interpretation methods that are more computationally efficient than model-agnostic methods that look at the model “from the outside”. Also most other methods in this book are intended for the interpretation of models for tabular data. Image and text data require different methods.

The next chapters cover the following techniques that answer different questions:

- Learned Features: What features has the neural network learned?
- Pixel Attribution (Saliency Maps): How did each pixel contribute to a particular prediction?
- Concepts: Which more abstract concepts has the neural network learned?
- Adversarial Examples are closely related to counterfactual explanations: How can we trick the neural network?
- Influential Instances is a more general approach with a fast implementation for gradient-based methods such as neural networks: How influential was a training data point for a certain prediction?

Olga Russakovsky and Jia Deng (equal contribution), Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. “ImageNet large scale visual recognition challenge”. IJCV (2015).↩︎