## 4.4 Decision Tree

Linear regression models and logistic regression fail in situations where the relationship between features and outcome is non-linear or where the features are interacting with each other. Time to shine for the decision trees! Tree-based models split the data according to certain cutoff values in the features multiple times. Splitting means that different subsets of the dataset are created, where each instance belongs to one subset. The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes. For predicting the outcome in each leaf node, a simple model is fitted with the instances in this subset (for example the subsets average target outcome). Trees can be used for classification and regression.

There are a lot of tree algorithms with different approaches for how to grow a tree. They differ in the possible structure of the tree (e.g. number of splits per node), criteria for how to find the splits, when to stop splitting and how to estimate the simple models within the leaf nodes. Classification and regression trees (CART) is one of the more popular algorithms for tree induction. We will focus on CART, but the interpretation is similar for most of the tree types. I recommend the book ‘The elements of statistical learning’ (Hastie, Tibshirani, and Friedman 2009)^{16}for a more detailed introduction.

The following formula describes the relationship between outcome \(y\) and the features \(x\).

\[\hat{y}_i=\hat{f}(x_i)=\sum_{m=1}^Mc_m{}I\{x_i\in{}R_m\}\]

Each instance \(x_i\) falls into exactly one leaf node (=subset \(R_m\)). \(I_{\{x_i\in{}R_m\}}\) is the identity function which returns 1 if \(x_i\) is in the subset \(R_m\) and else 0. If \(x_i\) falls into a leaf node \(R_l\), the predicted outcome is \(\hat{y}=c_l\), where \(c_l\) is the mean of all the training instances in leaf node \(R_l\).

But where do the subsets come from? This is quite simple: The algorithm takes a feature and tries which cut-off point minimises the sum of squares of \(y\) for a regression tasks or the Gini index of the class distribution of \(y\) for classification tasks. The best cut-off point makes the two resulting subsets as different as possible in terms of the target outcome. For categorical features the algorithm tries to build subsets by trying different groupings of categories. After this was done for each feature, the algorithm looks for the feature with the best cut-off and chooses it to split the node into two new nodes. The algorithm continues doing this recursively in both of the new nodes until a stopping criterium is reached. Possible criteria are: A minimum number of instances that have to be in a node before the split or the minimum number of instances that have to be in a terminal node.

### 4.4.1 Interpretation

The interpretation is simple: Starting from the root node you go to the next nodes and the edges tell you which subsets you are looking at. Once you reach the leaf node, the node tells you the predicted outcome. All the edges are connected by ‘AND’.

Template: If feature x is [smaller/bigger] than threshold c AND …, then the predicted outcome is \(\hat{y}_{\text{leafnode}}\).

**Feature Importance**

The overall importance of a feature in a decision tree can be computed by going through all the splits for which the feature was used and adding up how much it has improved the predictions in the child nodes compared to the parent node (e.g. measured as decrease of Gini index). The sum of all importances is scaled to 100, so that the interpretation for each feature importance is percentage of the overall importance.

**Tree Decomposition**

Individual predictions of a decision tree can be explained by decomposing the decision path and breaking it down per feature. Similar to linear regression models, we can look at the contribution of each feature value to the prediction of an individual data instance, starting from the average prediction.

The root node in a decision tree is our starting point. If we were to use the root node to make predictions, it would predict the mean of the outcome of the training data. With the next split, we either subtract or add some term to this average, depening on which child node we go to. We can track a decision through the tree and explain a prediction by the contributions added at each decision node.

\[\hat{f}(x^{(i)})=\bar{y}+\sum_{d=1}^D\text{split contrib(d,x)}=\bar{y}+\sum_{j=1}^p\text{feat contrib(j,x)}\] The prediction of an individual instance is the mean of the target outcome plus the sum of all contributions of the splits happening between the root node and the terminal node the instance ends up in. But we are not interested in the split contributions, but the feature contributions, and a feature might be used for more than one split or for none at all. We can add up the contributions for each of the p features and get an interpretation of how much each feature contributed to a prediction.

### 4.4.2 Example

Let’s have a look again at the bike rental data. We want to predict the number of rented bikes on a given day. The learned tree visualized:

The first split and one of the second splits was done in the trend feature, which tells how many days passed since beginning of the data collection and covers the trend that the bike rental service became more popular over time. For days that came before the 105th day the predicted number of bikes is ca. 1800, between the 106th and 430th day it is around 3900. For days after the 430th day, depending on the temperature, the prediction is either 4600 (if below 12 degrees) or 6600 (if above 12 degrees).

The importance of the features is computed as percentages of purity improvement.

While the visualized tree shows that both temperature and time trend are used for splits, the feature importance shows that the time trend is far more important than temperature.

### 4.4.3 Advantages

The tree structure is perfectly suited to **cover interactions** between features in the data. The data also ends up in **distinct groups**, which are often easier to grasp than points on a hyperplane like in linear regression. The interpretation is arguably pretty straightforward. The tree structure also has a **natural visualization**, with its nodes and edges. Trees **create good explanations** as defined here. The tree structure automatically invites to think about predicted values for single instances in a counterfactual way: “If feature \(x_j\) would have been bigger / smaller than the split point, the prediction would have been \(\hat{y}_{1}\) instead of \(\hat{y}_2\)” The created explanations are contrastive, because you can always compare the prediction of an instance with relevant (as defined by the tree) “what-if”-scenarios, which are simply the other leaf nodes of the tree. If the tree is short, like one to three splits deep, the resulting explanations are selective. A tree with a depth of three needs a maximum of three features and split points to create the explanation for the prediction of an instance. The truthfulness of the prediction depends on the predictive performance of the tree. The explanations for short trees are very simple and general, because for each split, the instance either falls into one or the other leave., and binary decisions are easy to understand. There is no need to transform features. In linear models it is sometimes necessary to take the logarithm of a feature. A decision tree can handle a feature regardless of monotonic transformations.

### 4.4.4 Disadvantages

**Handling of linear relationships**, that’s what trees suck at. Any linear relationship between an input feature and the outcome has to be approximated by hard splits, which produces a step function. This is not efficient. This goes hand in hand with **lack of smoothness**. Slight changes in the input feature can have a big impact on the predicted outcome, which might not be desirable. Imagine a tree that predicts the value of a house and the tree splits in the square meters multiple times. One of the splits is at 100.5 square meters. Imagine a user of a house price estimator, that uses your decision tree model: She measures her house, concludes that the house has 99 square meters, types it into some nice web interface and get’s a prediction of 200 000 Euro. The user notices that she forgot to measure a small storeroom with 2 square meters. The storeroom has a skewed wall, so she is not sure if she can count it fully towards the whole house area or only half of the space. So she decides to try both 100.0 and 101.0 square meters. The results: 200 000 Euro and 205 000 Euro, which is quite unintuitive, because there was no change from 99 square meters to 100, but from 100 to 101.

Trees are also quite **unstable**, so a few changes in the training dataset might create a completely different tree. That’s because each split depends on the parent split. And if a different feature gets selected as the first split feature, the whole tree structure will change. It does not generate confidence in the model if the structure flips so easily.

Decision trees are very interpretable - as long as they are short. **The number of terminal nodes increases quickly with increasing depth.** The more terminal nodes, the more difficult it becomes to understand the decision rules of a tree. A depth of 1 means 2 terminal nodes. Depth of 2 means max. 4 nodes. Depth of 3 means max. 8 nodes. The maximum number of terminal nodes in a tree is 2 to the power of its depth.

Hastie, T, R Tibshirani, and J Friedman. 2009. The elements of statistical learning. http://link.springer.com/content/pdf/10.1007/978-0-387-84858-7.pdf.↩