## 5.4 Feature Importance

A feature’s importance is the increase in the model’s prediction error after we permuted the feature’s values (breaks the relationship between the feature and the outcome).

### 5.4.1 The Theory

The concept is really straightforward: We measure a feature’s importance by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. The permutation feature importance measurement was introduced for Random Forests by Breiman (2001)29. Based on this idea, Fisher, Rudin, and Dominici (2018)30 proposed a model-agnostic version of the feature importance - they called it model reliance. They also introduce more advanced ideas about feature importance, for example a (model-specific) version that accounts for the fact that many prediction models may fit the data well. Their paper is worth a read.

The algorithm:

Input: Trained model $$\hat{f}$$, feature matrix $$X$$, target vector $$Y$$, error measure $$L(Y,\hat{Y})$$

1. Estimate the original model error $$e_{orig}(\hat{f})=L(Y,\hat{f}(X))$$ (e.g. mean squared error)
2. For each feature $$j\in1,\ldots,p$$ do
• Generate feature matrix $$X_{perm_{j}}$$ by permuting feature $$X_j$$ in $$X$$. This breaks the association between $$X_j$$ and $$Y$$.
• Estimate error $$e_{perm}=L(Y,\hat{f}(X_{perm_j}))$$ based on the predictions of the permuted data.
• Calculate permutation feature importance $$FI_j=e_{perm}(\hat{f})/e_{orig}(\hat{f})$$. Alternatively, the difference can be used: $$FI_j=e_{perm}(\hat{f})-e_{orig}(\hat{f})$$
3. Sort variables by descending $$FI$$.

In their paper, Fisher, Rudin, and Dominici (2018) propose to split the dataset in half and exchange the $$X_j$$ values of the two halves instead of permuting $$X_j$$. This is exactly the same as permuting the feature $$X_j$$ if you think about it. If you want to have a more accurate estimate, you can estimate the error of permuting $$X_j$$ by pairing each instance with the $$X_j$$ value of each other instance (except with itself). This gives you a dataset of size $$n(n-1)$$ to estimate the permutation error and it takes a big amount of computation time. I can only recommend using the $$n(n-1)$$ - method when you are serious about getting extremely accurate estimates.

### 5.4.2 Example and Interpretation

We show examples for classification and regression.

Cervical cancer (Classification)

We fit a random forest model to predict cervical cancer. We measure the error increase by: $$1-AUC$$ (one minus the area under the ROC curve). Features that are associated model error increase by a factor of 1 (= no change) were not important for predicting cervical cancer.

The feature with the highest importance was associated with an error increase of 7.8 after permutation.

Bike rentals (Regression)

We fit a support vector machine model to predict bike rentals, given weather conditions and calendric information. As error measurement we use the mean absolute error.