## 6.5 Feature Importance

A feature’s importance is the increase in the model’s prediction error after we permuted the feature’s values (breaks the relationship between the feature and the outcome).

### 6.5.1 The Theory

The concept is really straightforward: We measure a feature’s importance by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. The permutation feature importance measurement was introduced for Random Forests by Breiman (2001)^{29}. Based on this idea, Fisher, Rudin, and Dominici (2018)^{30} proposed a model-agnostic version of the feature importance - they called it model reliance. They also introduce more advanced ideas about feature importance, for example a (model-specific) version that accounts for the fact that many prediction models may fit the data well. Their paper is worth a read.

**The permutation feature importance algorithm based on Breiman (2001) and Fisher, Rudin, and Dominici (2018):**

Input: Trained model \(\hat{f}\), feature matrix \(X\), target vector \(Y\), error measure \(L(Y,\hat{Y})\)

- Estimate the original model error \(e_{orig}(\hat{f})=L(Y,\hat{f}(X))\) (e.g. mean squared error)
- For each feature \(j\in1,\ldots,p\) do
- Generate feature matrix \(X_{perm_{j}}\) by permuting feature \(X_j\) in \(X\). This breaks the association between \(X_j\) and \(Y\).
- Estimate error \(e_{perm}=L(Y,\hat{f}(X_{perm_j}))\) based on the predictions of the permuted data.
- Calculate permutation feature importance \(FI_j=e_{perm}(\hat{f})/e_{orig}(\hat{f})\). Alternatively, the difference can be used: \(FI_j=e_{perm}(\hat{f})-e_{orig}(\hat{f})\)

- Sort variables by descending \(FI\).

In their paper, Fisher, Rudin, and Dominici (2018) propose to split the dataset in half and exchange the \(X_j\) values of the two halves instead of permuting \(X_j\). This is exactly the same as permuting the feature \(X_j\) if you think about it. If you want to have a more accurate estimate, you can estimate the error of permuting \(X_j\) by pairing each instance with the \(X_j\) value of each other instance (except with itself). This gives you a dataset of size \(n(n-1)\) to estimate the permutation error and it takes a big amount of computation time. I can only recommend using the \(n(n-1)\) - method when you are serious about getting extremely accurate estimates.

### 6.5.2 Example and Interpretation

We show examples for classification and regression.

**Cervical cancer (Classification)**

We fit a random forest model to predict cervical cancer. We measure the error increase by: \(1-AUC\) (one minus the area under the ROC curve). Features that are associated model error increase by a factor of 1 (= no change) were not important for predicting cervical cancer.

The feature with the highest importance was associated with an error increase of 7.8 after permutation.

**Bike sharing (Regression)**

We fit a support vector machine model to predict the number of rented bikes, given weather conditions and calendric information. As error measurement we use the mean absolute error.

### 6.5.3 Advantages

- Nice interpretation: Feature importance is the increase of model error when the feature’s information is destroyed.
- Feature importance provides a highly compressed, global insight into the model’s behavior.
- A positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are comparable across different problems.

### 6.5.4 Disadvantages

- The feature importance measure is tied to the error of the model. This is not inherently bad, but in some cases not what you need. In some cases you would prefer to know how much the model’s output varies for one feature, ignoring what it means for the performance. For example: You want to find out how robust your model’s output is, given someone manipulates the features. In this case, you wouldn’t be interested in how much the model performance drops given the permutation of a feature, but rather how much of the model’s output variance is explained by each feature. Model variance (explained by the features) and feature importance correlate strongly when the model generalizes well (i.e. it doesn’t overfit).
- You need access to the actual outcome target. If someone only gives you the model and unlabeled data - but not the actual target - you can’t compute the permutation feature importance.
- When features are correlated, the permutation feature importance measure can be biased by unrealistic data instances. The problem is the same as for partial dependence plots: The permutation of features generates unlikely data instances when two features are correlated. When they are positively correlated (like height and weight of a person) and I shuffle one of the features, then I create new instances that are unlikely or even physically impossible (2m person weighting 30kg for example), yet I use those new instances to measure the importance. In other words, for the permutation feature importance of a correlated feature we consider how much the model performance drops when we exchange the feature with values that we would never observe in reality. Check if the features are strongly correlated and be careful with the interpretation of the feature importance when they are.
- Another tricky thing: Adding a correlated feature can decrease the importance of the associated feature, by splitting up the importance on both features. Let me show you with an example what I mean by “splitting up” feature importance: We want to predict the probability of rain and use the temperature at 8:00 AM of the day before as a feature together with other uncorrelated features. I fit a random forest and it turns out the temperature is the most important feature and all is good and I sleep well the next night. Now imagine another scenario in which I include the temperature at 9:00 AM as a feature, which is of course highly correlated with the temperature at 8:00 AM. The temperature at 9:00 AM doesn’t give me additional information, when I already know the temperature at 8:00 AM. But having more features is always good, right? I fit a random forest with the two temperature features and the uncorrelated features. Some of the trees in the random forest pick up the 8:00 AM temperature, some the 9:00 AM temperature, some both and some none. The two temperature features together have a bit more importance than the single temperature feature before, but instead of being on the top of the list of the important features, each temperature is now somewhere in the middle. By introducing a correlated feature, I kicked the most important feature from the top of the importance ladder to mediocrity. On one hand, that’s okay, because it simply reflects the behaviour of the underlying machine learning model, here the random forest. The 8:00 AM temperature simply has become less important, because the model can now rely on the 9:00 AM measure as well. On the other hand, it makes the interpretation of the feature importances way more difficult. Imagine that measuring the features is expensive and you decide to only include the top 3 most important features in your model. In the first case you would include the temperature, in the second case you would not include any temperature feature, simply because they now share the importance. Even though the importance measure might make sense on the model behaviour level, it’s damn confusing if you have correlated features.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32.↩

Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “Model Class Reliance: Variable Importance Measures for any Machine Learning Model Class, from the ‘Rashomon’ Perspective.” http://arxiv.org/abs/1801.01489.↩