## 5.5 Feature Importance

A feature’s importance is the increase in the model’s prediction error after we permuted the feature’s values (breaks the relationship between the feature and the outcome).

### 5.5.1 The Theory

The concept is really straightforward: We measure a feature’s importance by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. The permutation feature importance measurement was introduced for Random Forests by Breiman (2001)31. Based on this idea, Fisher, Rudin, and Dominici (2018)32 proposed a model-agnostic version of the feature importance - they called it model reliance. They also introduce more advanced ideas about feature importance, for example a (model-specific) version that accounts for the fact that many prediction models may fit the data well. Their paper is worth a read.

The permutation feature importance algorithm based on Breiman (2001) and Fisher, Rudin, and Dominici (2018):

Input: Trained model $$\hat{f}$$, feature matrix $$X$$, target vector $$Y$$, error measure $$L(Y,\hat{Y})$$

1. Estimate the original model error $$e_{orig}(\hat{f})=L(Y,\hat{f}(X))$$ (e.g. mean squared error)
2. For each feature $$j\in1,\ldots,p$$ do
• Generate feature matrix $$X_{perm_{j}}$$ by permuting feature $$X_j$$ in $$X$$. This breaks the association between $$X_j$$ and $$Y$$.
• Estimate error $$e_{perm}=L(Y,\hat{f}(X_{perm_j}))$$ based on the predictions of the permuted data.
• Calculate permutation feature importance $$FI_j=e_{perm}(\hat{f})/e_{orig}(\hat{f})$$. Alternatively, the difference can be used: $$FI_j=e_{perm}(\hat{f})-e_{orig}(\hat{f})$$
3. Sort variables by descending $$FI$$.

In their paper, Fisher, Rudin, and Dominici (2018) propose to split the dataset in half and exchange the $$X_j$$ values of the two halves instead of permuting $$X_j$$. This is exactly the same as permuting the feature $$X_j$$ if you think about it. If you want to have a more accurate estimate, you can estimate the error of permuting $$X_j$$ by pairing each instance with the $$X_j$$ value of each other instance (except with itself). This gives you a dataset of size $$n(n-1)$$ to estimate the permutation error and it takes a big amount of computation time. I can only recommend using the $$n(n-1)$$ - method when you are serious about getting extremely accurate estimates.

### 5.5.2 Should I Compute Importance on Training or Test Data?

tl;dr: I don’t have a definite answer.

Answering the question about training or test data touches the fundamental question of what feature importance means. The best way to understand the difference between feature importance based on training vs. based on test data is an “extreme” example. I trained a support vector machine to predict a continuous, random target outcome given 50 random features (200 instances). By “random” I mean that the target outcome is independent of the 50 features. This is like predicting tomorrow’s temperature given the latest lottery numbers. If the model “learns” any relationships, then it overfits. And in fact, the SVM did overfit on the training data. The mean absolute error for the training data is 0.29 and for the test data 0.82, which is also the error of the best possible model that always predicts the mean outcome of 0 (mae of 0.78). In other words, the SVM model is garbage. What values for the feature importance would you expect for the 50 features of this overfitted SVM? Zero because none of the features contribute to improved performance on unseen test data? Or should the importances reflect how much the model depends on each of the features, regardless whether the learned relationships generalize to unseen data? Let’s take a look at how the distributions of feature importances for training and test data differ.

It’s unclear to me which of the two results are more desirable. So I will try to make a case for both versions and let you decide for yourself.

The Case for Test Data

This is a simple case: Model error estimates based on training data are garbage -> feature importance relies on model error estimates -> feature importance based on training data is garbage.
Really, it’s one of the first things you learn in machine learning: If you measure the model error (or performance) on the same data on which the model was trained, the measurement is usually too optimistic, which means that the model seems to work much better than it does in reality. And since the permutation feature importance relies on measurements of the model error, we should use unseen test data. The feature importance based on training data makes us mistakenly believe that feature are important for the predictions, when in reality the model was just overfitting and the features weren’t important at all.

The Case for Training Data

The arguments for using training data are somewhat more difficult to formulate, but are IMHO just as compelling as the arguments for using test data. We take another look at our garbage SVM. Based on the training data, the most important feature was X13. Let’s look at a partial dependence plot of feature X13. The partial dependence plot shows how the model output changes based on changes of the feature inputs and doesn’t rely on the generalization error. It doesn’t matter whether the PDP is computed with training or test data.

The plot clearly shows that the SVM has learned to rely on feature X13 for its predictions, but according to the feature importance based on the test data (0.97), it’s not important. Based on the training data, the importance is 1.21, reflecting that the model has learned to use this feature. Feature importance based on the training data tells us which features are important for the model in the sense that it depends on them for making predictions.

As part of the case for using training data, I would like to introduce an argument against test data. In practice, you want to use all your data to train your model to get the best possible model in the end. This means no unused test data is left to compute the feature importance. You have the same problem when you want to estimate the generalization error of your model. One of the solutions is a (nested) cross-validation scheme. If you would use (nested) cross-validation for the feature importance estimation, you would have the problem that the feature importance is not calculated on the final model with all the data, but on models with subsets of the data that might behave differently.

In the end, you need to decide whether you want to know how much the model relies on each feature for making predictions (-> training data) or how much the feature contributes to the performance of the model on unseen data (-> test data). To the best of my knowledge, there is no research addressing the question of training vs. test data. It will require more thorough examination than my “garbage-SVM” example. We need more research and more experience with these tools to gain a better understanding.

Next, we will look at a some examples. I based the importance computation on the training data, because I had to choose one and using the training data needed a few lines less of code.

### 5.5.3 Example and Interpretation

We show examples for classification and regression.

Cervical cancer (Classification)

We fit a random forest model to predict cervical cancer. We measure the error increase by: $$1-AUC$$ (one minus the area under the ROC curve). Features that are associated model error increase by a factor of 1 (= no change) were not important for predicting cervical cancer.

The feature with the highest importance was associated with an error increase of 6.28 after permutation.

Bike sharing (Regression)

We fit a support vector machine model to predict the number of rented bikes, given weather conditions and calendric information. As error measurement we use the mean absolute error.

• Nice interpretation: Feature importance is the increase of model error when the feature’s information is destroyed.
• Feature importance provides a highly compressed, global insight into the model’s behavior.
• A positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are comparable across different problems.
• The importance measure automatically takes into account all interactions with other features. By permuting the feature you also destroy the interaction effects with other features. This means that the permutation feature importance measure regards both the feature main effect and the interaction effects on the model performance. This is also a disadvantage because the importance of the interaction between two features will be included in the importance measures of both features. This means that the feature importances don’t add up to the total drop in performance, would we shuffle all the features, but they are greater than that. Only when there is no interaction between the features, like in a linear model, the importances would roughly add up.
• Permutation feature importance doesn’t require retraining the model like. Some other methods suggest to delete a feature, retrain the model and then compare the model error. Since retraining a machine learning model can take a long time, ‘only’ permuting a feature can safe lots of time.
• Importance methods that retrain the model with a subset of the features seem intuitive at first glance, but the model with the reduced data is meaningless for the feature importance. We are interested in the feature importance of a fixed model, and when I say fixed, I also mean that the features have to be used. Retraining with a reduced dataset creates a different model from the one we are interested in. Let’s say you train a sparse linear model (with LASSO) with a fixed amount of features with a non-zero weight. The dataset has 100 features, you set the number of non-zero weights to 5. You analyze the importance of one of the features that got a non-zero weight. You remove the feature and retrain the model. The model performance stays the same, because now another equally good feature gets a non-zero weight and your conclusion would be that the feature was not important. Another example: the model is a decision tree and we analyze the importance of the feature that was chose as the first split. We remove the feature and retrain the model. Since another feature will be chosen as the first split, the whole tree can be very different, meaning that we compare the error rates of (potentially) completely different trees to decide how important that feature is.