5.5 Feature Importance

A feature’s importance is the increase in the model’s prediction error after we permuted the feature’s values (breaks the relationship between the feature and the outcome).

5.5.1 The Theory

The concept is really straightforward: We measure a feature’s importance by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. The permutation feature importance measurement was introduced for Random Forests by Breiman (2001)31. Based on this idea, Fisher, Rudin, and Dominici (2018)32 proposed a model-agnostic version of the feature importance - they called it model reliance. They also introduce more advanced ideas about feature importance, for example a (model-specific) version that accounts for the fact that many prediction models may fit the data well. Their paper is worth a read.

The permutation feature importance algorithm based on Breiman (2001) and Fisher, Rudin, and Dominici (2018):

Input: Trained model \(\hat{f}\), feature matrix \(X\), target vector \(Y\), error measure \(L(Y,\hat{Y})\)

  1. Estimate the original model error \(e_{orig}(\hat{f})=L(Y,\hat{f}(X))\) (e.g. mean squared error)
  2. For each feature \(j\in1,\ldots,p\) do
    • Generate feature matrix \(X_{perm_{j}}\) by permuting feature \(X_j\) in \(X\). This breaks the association between \(X_j\) and \(Y\).
    • Estimate error \(e_{perm}=L(Y,\hat{f}(X_{perm_j}))\) based on the predictions of the permuted data.
    • Calculate permutation feature importance \(FI_j=e_{perm}(\hat{f})/e_{orig}(\hat{f})\). Alternatively, the difference can be used: \(FI_j=e_{perm}(\hat{f})-e_{orig}(\hat{f})\)
  3. Sort variables by descending \(FI\).

In their paper, Fisher, Rudin, and Dominici (2018) propose to split the dataset in half and exchange the \(X_j\) values of the two halves instead of permuting \(X_j\). This is exactly the same as permuting the feature \(X_j\) if you think about it. If you want to have a more accurate estimate, you can estimate the error of permuting \(X_j\) by pairing each instance with the \(X_j\) value of each other instance (except with itself). This gives you a dataset of size \(n(n-1)\) to estimate the permutation error and it takes a big amount of computation time. I can only recommend using the \(n(n-1)\) - method when you are serious about getting extremely accurate estimates.

5.5.2 Should I Compute Importance on Training or Test Data?

tl;dr: I don’t have a definite answer.

Answering the question about training or test data touches the fundamental question of what feature importance means. The best way to understand the difference between feature importance based on training vs. based on test data is an “extreme” example. I trained a support vector machine to predict a continuous, random target outcome given 50 random features (200 instances). By “random” I mean that the target outcome is independent of the 50 features. This is like predicting tomorrow’s temperature given the latest lottery numbers. If the model “learns” any relationships, then it overfits. And in fact, the SVM did overfit on the training data. The mean absolute error for the training data is 0.29 and for the test data 0.82, which is also the error of the best possible model that always predicts the mean outcome of 0 (mae of 0.78). In other words, the SVM model is garbage. What values for the feature importance would you expect for the 50 features of this overfitted SVM? Zero because none of the features contribute to improved performance on unseen test data? Or should the importances reflect how much the model depends on each of the features, regardless whether the learned relationships generalize to unseen data? Let’s take a look at how the distributions of feature importances for training and test data differ.

Distributions of feature importance values by data type. A support vector machine was trained on a regression dataset with 50 features and 200 instances. The features don't contain any information about the true target. The SVM overfits the data, so the feature importance based on the training data shows many important features. Computed on unseen test data, the feature importances are close to zero.

FIGURE 5.27: Distributions of feature importance values by data type. A support vector machine was trained on a regression dataset with 50 features and 200 instances. The features don’t contain any information about the true target. The SVM overfits the data, so the feature importance based on the training data shows many important features. Computed on unseen test data, the feature importances are close to zero.

It’s unclear to me which of the two results are more desirable. So I will try to make a case for both versions and let you decide for yourself.

The Case for Test Data

This is a simple case: Model error estimates based on training data are garbage -> feature importance relies on model error estimates -> feature importance based on training data is garbage.
Really, it’s one of the first things you learn in machine learning: If you measure the model error (or performance) on the same data on which the model was trained, the measurement is usually too optimistic, which means that the model seems to work much better than it does in reality. And since the permutation feature importance relies on measurements of the model error, we should use unseen test data. The feature importance based on training data makes us mistakenly believe that feature are important for the predictions, when in reality the model was just overfitting and the features weren’t important at all.

The Case for Training Data

The arguments for using training data are somewhat more difficult to formulate, but are IMHO just as compelling as the arguments for using test data. We take another look at our garbage SVM. Based on the training data, the most important feature was X13. Let’s look at a partial dependence plot of feature X13. The partial dependence plot shows how the model output changes based on changes of the feature inputs and doesn’t rely on the generalization error. It doesn’t matter whether the PDP is computed with training or test data.

The partial dependence of feature X13, which is the most important feature according to the feature importance based on the training data. The plot shows that the SVM depends on this feature to make predictions

FIGURE 5.28: The partial dependence of feature X13, which is the most important feature according to the feature importance based on the training data. The plot shows that the SVM depends on this feature to make predictions

The plot clearly shows that the SVM has learned to rely on feature X13 for its predictions, but according to the feature importance based on the test data (0.97), it’s not important. Based on the training data, the importance is 1.21, reflecting that the model has learned to use this feature. Feature importance based on the training data tells us which features are important for the model in the sense that it depends on them for making predictions.

As part of the case for using training data, I would like to introduce an argument against test data. In practice, you want to use all your data to train your model to get the best possible model in the end. This means no unused test data is left to compute the feature importance. You have the same problem when you want to estimate the generalization error of your model. One of the solutions is a (nested) cross-validation scheme. If you would use (nested) cross-validation for the feature importance estimation, you would have the problem that the feature importance is not calculated on the final model with all the data, but on models with subsets of the data that might behave differently.

In the end, you need to decide whether you want to know how much the model relies on each feature for making predictions (-> training data) or how much the feature contributes to the performance of the model on unseen data (-> test data). To the best of my knowledge, there is no research addressing the question of training vs. test data. It will require more thorough examination than my “garbage-SVM” example. We need more research and more experience with these tools to gain a better understanding.

Next, we will look at a some examples. I based the importance computation on the training data, because I had to choose one and using the training data needed a few lines less of code.

5.5.3 Example and Interpretation

We show examples for classification and regression.

Cervical cancer (Classification)

We fit a random forest model to predict cervical cancer. We measure the error increase by: \(1-AUC\) (one minus the area under the ROC curve). Features that are associated model error increase by a factor of 1 (= no change) were not important for predicting cervical cancer.

The importance for each of the features in predicting cervical cancer with a random forest. The importance is the factor by which the error is increased compared to the original model error.

FIGURE 5.29: The importance for each of the features in predicting cervical cancer with a random forest. The importance is the factor by which the error is increased compared to the original model error.

The feature with the highest importance was associated with an error increase of 6.28 after permutation.

Bike sharing (Regression)

We fit a support vector machine model to predict the number of rented bikes, given weather conditions and calendric information. As error measurement we use the mean absolute error.

The importance for each of the features in predicting bike counts with a support vector machine.

FIGURE 5.30: The importance for each of the features in predicting bike counts with a support vector machine.

5.5.4 Advantages

  • Nice interpretation: Feature importance is the increase of model error when the feature’s information is destroyed.
  • Feature importance provides a highly compressed, global insight into the model’s behavior.
  • A positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are comparable across different problems.
  • The importance measure automatically takes into account all interactions with other features. By permuting the feature you also destroy the interaction effects with other features. This means that the permutation feature importance measure regards both the feature main effect and the interaction effects on the model performance. This is also a disadvantage because the importance of the interaction between two features will be included in the importance measures of both features. This means that the feature importances don’t add up to the total drop in performance, would we shuffle all the features, but they are greater than that. Only when there is no interaction between the features, like in a linear model, the importances would roughly add up.
  • Permutation feature importance doesn’t require retraining the model like. Some other methods suggest to delete a feature, retrain the model and then compare the model error. Since retraining a machine learning model can take a long time, ‘only’ permuting a feature can safe lots of time.
  • Importance methods that retrain the model with a subset of the features seem intuitive at first glance, but the model with the reduced data is meaningless for the feature importance. We are interested in the feature importance of a fixed model, and when I say fixed, I also mean that the features have to be used. Retraining with a reduced dataset creates a different model from the one we are interested in. Let’s say you train a sparse linear model (with LASSO) with a fixed amount of features with a non-zero weight. The dataset has 100 features, you set the number of non-zero weights to 5. You analyze the importance of one of the features that got a non-zero weight. You remove the feature and retrain the model. The model performance stays the same, because now another equally good feature gets a non-zero weight and your conclusion would be that the feature was not important. Another example: the model is a decision tree and we analyze the importance of the feature that was chose as the first split. We remove the feature and retrain the model. Since another feature will be chosen as the first split, the whole tree can be very different, meaning that we compare the error rates of (potentially) completely different trees to decide how important that feature is.

5.5.5 Disadvantages

  • It’s very unclear whether you should use training or test data for computing the feature importance.
  • The feature importance measure is tied to the error of the model. This is not inherently bad, but in some cases not what you need. In some cases you would prefer to know how much the model’s output varies for one feature, ignoring what it means for the performance. For example: You want to find out how robust your model’s output is, given someone manipulates the features. In this case, you wouldn’t be interested in how much the model performance drops given the permutation of a feature, but rather how much of the model’s output variance is explained by each feature. Model variance (explained by the features) and feature importance correlate strongly when the model generalizes well (i.e. it doesn’t overfit).
  • You need access to the actual outcome target. If someone only gives you the model and unlabeled data - but not the actual target - you can’t compute the permutation feature importance.
  • The permutation feature importance measure depends on shuffling the feature, which adds randomness to the importance measure. When the permutation is repeated, the results might differ. Repeating the permutation and averaging the importance measures over repetitions stabilizes the measure, but increases the time of computation.
  • When features are correlated, the permutation feature importance measure can be biased by unrealistic data instances. The problem is the same as for partial dependence plots: The permutation of features generates unlikely data instances when two features are correlated. When they are positively correlated (like height and weight of a person) and I shuffle one of the features, then I create new instances that are unlikely or even physically impossible (2m person weighting 30kg for example), yet I use those new instances to measure the importance. In other words, for the permutation feature importance of a correlated feature we consider how much the model performance drops when we exchange the feature with values that we would never observe in reality. Check if the features are strongly correlated and be careful with the interpretation of the feature importance when they are.
  • Another tricky thing: Adding a correlated feature can decrease the importance of the associated feature, by splitting up the importance of both features. Let me explain with an example what I mean by “splitting up” the feature importance: We want to predict the probability of rain and use the temperature at 8:00 AM of the day before as a feature together with other uncorrelated features. I fit a random forest and it turns out the temperature is the most important feature and all is good and I sleep well the next night. Now imagine another scenario in which I additionally include the temperature at 9:00 AM as a feature, which is highly correlated with the temperature at 8:00 AM. The temperature at 9:00 AM doesn’t give me much additional information, when I already know the temperature at 8:00 AM. But having more features is always good, right? I fit a random forest with the two temperature features and the uncorrelated features. Some of the trees in the random forest pick up the 8:00 AM temperature, some the 9:00 AM temperature, some both and some none. The two temperature features together have a bit more importance than the single temperature feature before, but instead of being on the top of the list of the important features, each temperature is now somewhere in the middle. By introducing a correlated feature, I kicked the most important feature from the top of the importance ladder to mediocrity. On one hand, that’s okay, because it simply reflects the behaviour of the underlying machine learning model, here the random forest. The 8:00 AM temperature simply has become less important, because the model can now rely on the 9:00 AM measure as well. On the other hand, it makes the interpretation of the feature importances way more difficult. Imagine that you want to check the features for measurement errors. The check is expensive and you decide to only check the top 3 most important features. In the first case you would check the temperature, in the second case you would not include any temperature feature, simply because they now share the importance. Even though the importance measure might make sense on the model behaviour level, it’s damn confusing if you have correlated features.

  1. Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32.

  2. Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “Model Class Reliance: Variable Importance Measures for any Machine Learning Model Class, from the ‘Rashomon’ Perspective.” http://arxiv.org/abs/1801.01489.