## 5.3 Decision Tree

Linear regression models and logistic regression fail in situations where the relationship between features and outcome is non-linear or where the features are interacting with each other. Time to shine for the decision trees! Tree-based models split the data according to certain cutoff values in the features multiple times. Splitting means that different subsets of the dataset are created, where each instance belongs to one subset. The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes. For predicting the outcome in each leaf node, a simple model is fitted with the instances in this subset (for example the subsets average target outcome). Trees can be used for classification and regression.

There are a lot of tree algorithms with different approaches for how to grow a tree. They differ in the possible structure of the tree (e.g. number of splits per node), criteria for how to find the splits, when to stop splitting and how to estimate the simple models within the leaf nodes. Classification and regression trees (CART) is one of the more popular algorithms for tree induction. We will focus on CART, but the interpretation is similar for most of the tree types. I recommend the book ‘The elements of statistical learning’ (Hastie, Tibshirani, and Friedman 2009)^{14}for a more detailed introduction.

The following formula describes the relationship between outcome \(y\) and the features \(x\).

\[\hat{y}_i=\hat{f}(x_i)=\sum_{m=1}^Mc_m{}I\{x_i\in{}R_m\}\]

Each instance \(x_i\) falls into exactly one leaf node (=subset \(R_m\)). \(I_{\{x_i\in{}R_m\}}\) is the identity function which returns 1 if \(x_i\) is in the subset \(R_m\) and else 0. If \(x_i\) falls into a leaf node \(R_l\), the predicted outcome is \(\hat{y}=c_l\), where \(c_l\) is the mean of all the training instances in leaf node \(R_l\).

But where do the subsets come from? This is quite simple: The algorithm takes a feature and tries which cut-off point minimises the sum of squares of \(y\) for a regression tasks or the Gini index of the class distribution of \(y\) for classification tasks. The best cut-off point makes the two resulting subsets as different as possible in terms of the target outcome. For categorical features the algorithm tries to build subsets by trying different groupings of categories. After this was done for each feature, the algorithm looks for the feature with the best cut-off and chooses it to split the node into two new nodes. The algorithm continues doing this recursively in both of the new nodes until a stopping criterium is reached. Possible criteria are: A minimum number of instances that have to be in a node before the split or the minimum number of instances that have to be in a terminal node.

### 5.3.1 Interpretation

The interpretation is simple: Starting from the root node you go to the next nodes and the edges tell you which subsets you are looking at. Once you reach the leaf node, the node tells you the predicted outcome. All the edges are connected by ‘AND’.

Template: If feature x is [smaller/bigger] than threshold c AND …, then the predicted outcome is \(\hat{y}_{\text{leafnode}}\).

### 5.3.2 Interpretation Example

Let’s have a look again at the bike rental data. We want to predict the number of rented bikes on a given day. The learned tree visualized:

The first split and one of the second splits was done in the trend feature, which tells how many days passed since beginning of the data collection and covers the trend that the bike rental service became more popular over time. For days that came before the 105th day the predicted number of bikes is ca. 1800, between the 106th and 430th day it is around 3900. For days after the 430th day, depending on the temperature, the prediction is either 4600 (if below 12 degrees) or 6600 (if above 12 degrees).

### 5.3.3 Advantages

The tree structure is perfectly suited to **cover interactions** between features in the data. The data also ends up in **distinct groups**, which are often easier to grasp than points on a hyperplane like in linear regression. The interpretation is arguably pretty straightforward. The tree structure also has a **natural visualization**, with its nodes and edges. Trees **create good explanations** as defined here. The tree structure automatically invites to think about predicted values for single instances in a counterfactual way: “If feature \(x_j\) would have been bigger / smaller than the split point, the prediction would have been \(\hat{y}_{1}\) instead of \(\hat{y}_2\)” The created explanations are contrastive, because you can always compare the prediction of an instance with relevant (as defined by the tree) “what-if”-scenarios, which are simply the other leaf nodes of the tree. If the tree is short, like one to three splits deep, the resulting explanations are selective. A tree with a depth of three needs a maximum of three features and split points to create the explanation for the prediction of an instance. The truthfulness of the prediction depends on the predictive performance of the tree. The explanations for short trees are very simple and general, because for each split, the instance either falls into one or the other leave., and binary decisions are easy to understand. There is no need to transform features. In linear models it is sometimes necessary to take the logarithm of a feature. A decision tree can handle a feature regardless of monotonic transformations.

### 5.3.4 Disadvantages

**Handling of linear relationships**, that’s what trees suck at. Any linear relationship between an input feature and the outcome has to be approximated by hard splits, which produces a step function. This is not efficient. This goes hand in hand with **lack of smoothness**. Slight changes in the input feature can have a big impact on the predicted outcome, which might not be desirable. Imagine a tree that predicts the value of a house and the tree splits in the square meters multiple times. One of the splits is at 100.5 square meters. Imagine a user of a house price estimator, that uses your decision tree model: She measures her house, concludes that the house has 99 square meters, types it into some nice web interface and get’s a prediction of 200 000 Euro. The user notices that she forgot to measure a small storeroom with 2 square meters. The storeroom has a skewed wall, so she is not sure if she can count it fully towards the whole house area or only half of the space. So she decides to try both 100.0 and 101.0 square meters. The results: 200 000 Euro and 205 000 Euro, which is quite unintuitive, because there was no change from 99 square meters to 100, but from 100 to 101.

Trees are also quite **unstable**, so a few changes in the training dataset might create a completely different tree. That’s because each split depends on the parent split. And if a different feature gets selected as the first split feature, the whole tree structure will change. It does not generate confidence in the model if the structure flips so easily.

Hastie, T, R Tibshirani, and J Friedman. 2009. The elements of statistical learning. http://link.springer.com/content/pdf/10.1007/978-0-387-84858-7.pdf.↩