## 4.3 Decision Tree

Linear regression models and logistic regression fail in situations where the relationship between features and outcome is non-linear or where the features are interacting with each other. Time to shine for the decision trees! Tree-based models split the data according to certain cutoff values in the features multiple times. Splitting means that different subsets of the dataset are created, where each instance belongs to one subset. The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes. For predicting the outcome in each leaf node, a simple model is fitted with the instances in this subset (for example the subsets average target outcome). Trees can be used for classification and regression.

There are a lot of tree algorithms with different approaches for how to grow a tree. They differ in the possible structure of the tree (e.g. number of splits per node), criteria for how to find the splits, when to stop splitting and how to estimate the simple models within the leaf nodes. Classification and regression trees (CART) is one of the more popular algorithms for tree induction. We will focus on CART, but the interpretation is similar for most of the tree types. I recommend the book ‘The elements of statistical learning’ (Hastie, Tibshirani, and Friedman 2009)14 for a more detailed introduction.

The following formula describes the relationship between outcome $$y$$ and the features $$x$$.

$\hat{y}_i=\hat{f}(x_i)=\sum_{m=1}^Mc_m{}I\{x_i\in{}R_m\}$

Each instance $$x_i$$ falls into exactly one leaf node (=subset $$R_m$$). $$I_{\{x_i\in{}R_m\}}$$ is the identity function which returns 1 if $$x_i$$ is in the subset $$R_m$$ and else 0. If $$x_i$$ falls into a leaf node $$R_l$$, the predicted outcome is $$\hat{y}=c_l$$, where $$c_l$$ is the mean of all the training instances in leaf node $$R_l$$.

But where do the subsets come from? This is quite simple: The algorithm takes a feature and tries which cut-off point minimises the sum of squares of $$y$$ for a regression tasks or the Gini index of the class distribution of $$y$$ for classification tasks. The best cut-off point makes the two resulting subsets as different as possible in terms of the target outcome. For categorical features the algorithm tries to build subsets by trying different groupings of categories. After this was done for each feature, the algorithm looks for the feature with the best cut-off and chooses it to split the node into two new nodes. The algorithm continues doing this recursively in both of the new nodes until a stopping criterium is reached. Possible criteria are: A minimum number of instances that have to be in a node before the split or the minimum number of instances that have to be in a terminal node.

### 4.3.1 Interpretation

The interpretation is simple: Starting from the root node you go to the next nodes and the edges tell you which subsets you are looking at. Once you reach the leaf node, the node tells you the predicted outcome. All the edges are connected by ‘AND’.

Template: If feature x is [smaller/bigger] than threshold c AND …, then the predicted outcome is $$\hat{y}_{\text{leafnode}}$$.

### 4.3.2 Interpretation Example

Let’s have a look again at the bike rental data. We want to predict the number of bike rentals on a given day. The learned tree visualized:

The first split and one of the second splits was done in the trend feature, which tells how many days passed since beginning of the data collection and covers the trend that the bike rental service became more popular over time. For days that came before the 105th day the predicted number of bike rentals is ca. 1800, between the 106th and 430th day it is around 3900. For days after the 430th day, depending on the temperature, the prediction is either 4600 (if below 12 degrees) or 6600 (if above 12 degrees).

The tree structure is perfectly suited to cover interactions between features in the data. The data also ends up in distinct groups, which are often easier to grasp than points on a hyperplane like in linear regression. The interpretation is arguably pretty straightforward. The tree structure also has a natural visualization, with its nodes and edges. Trees create good explanations as defined here. The tree structure automatically invites to think about predicted values for single instances in a counterfactual way: “If feature $$x_j$$ would have been bigger / smaller than the split point, the prediction would have been $$\hat{y}_{1}$$ instead of $$\hat{y}_2$$” The created explanations are contrastive, because you can always compare the prediction of an instance with relevant (as defined by the tree) “what-if”-scenarios, which are simply the other leaf nodes of the tree. If the tree is short, like one to three splits deep, the resulting explanations are selective. A tree with a depth of three needs a maximum of three features and split points to create the explanation for the prediction of an instance. The truthfulness of the prediction depends on the predictive performance of the tree. The explanations for short trees are very simple and general, because for each split, the instance either falls into one or the other leave., and binary decisions are easy to understand. There is no need to transform features. In linear models it is sometimes necessary to take the logarithm of a feature. A decision tree can handle a feature regardless of monotonic transformations.