## 4.1 Linear Regression Model

A linear regression model predicts the target as a weighted sum of the feature inputs. The linearity of the learned relationship makes the interpretation easy. Linear regression models have been used for a long time by statisticians, computer scientists, and other people tackling quantitative problems.

Linear models can be used to model the dependency of a regression target \(y\) on \(p\) features \(x\). The learned relationships are linear and, for a singular instance \(i\), can be written as:

\[y_{i}=\beta_{0}+\beta_{1}x_{i1}+\ldots+\beta_{p}x_{ip}+\epsilon_{i}\]

The i-th instance’s outcome is a weighted sum of its \(p\) features. The \(\beta_{j}\) represent the learned feature weights or coefficients. The \(\epsilon_{i}\) is the error we are still making, i.e. the difference between the predicted outcome and the actual outcome. These errors are assumed to follow a Gaussian distribution, which means we make errors in both negative and positive directions and make many small errors and few large errors.

Different methods can be used to estimate the optimal weight vector \(\hat{\boldsymbol{\beta}}\). The ordinary least squares method is commonly used to find the weights that minimise the squared difference between the actual and the estimated outcome:

\[\hat{\boldsymbol{\beta}}=\arg\!\min_{\beta_0,\ldots,\beta_p}\sum_{i=1}^n\left(y_i-\left(\beta_0+\sum_{j=1}^p\beta_jx_{ij}\right)\right)^{2}\]

We won’t go into detail about how the optimal weights can be found, but if you are interested you can read Chapter 3.2 of the book “Elements of Statistical Learning” (Hastie, Tibshirani, and Friedman 2009)^{15} or one of the other zillions of sources about linear regression models.

The biggest advantage of linear regression models is their linearity: It makes the estimation procedure straightforward and, most importantly, these linear equations have an easy to understand interpretation on a modular level (i.e. the weights). That is one of the main reasons why the linear model and all similar models are so widespread in academic fields like medicine, sociology, psychology, and many more quantitative research fields. In these areas it is important to not only predict, e.g., the clinical outcome of a patient, but also to quantify the influence of the medication while at the same time accounting for things like gender, age, and other features in an interpretable manner. Estimates of coefficients or weights also come with confidence intervals. A confidence interval is a range for the weight estimate that covers the ‘true’ parameter with a certain confidence. For example, a 95% confidence interval for a weight of 2 could range from 1 to 3 and its interpretation would be: If we repeated the estimation 100 times with newly sampled data, the confidence interval would cover the true parameter in 95 out of 100 cases.

Linear regression models also come with some assumptions that make them easy to use and interpret but which are often not satisfied in reality. The assumptions are: Linearity, normality, homoscedasticity, independence, fixed features, and absence of multicollinearity.

**Linearity**: Linear regression models force the estimated response to be a linear combination of the features, which is both their greatest strength and biggest limitation. Linearity leads to interpretable models: linear effects are simple to quantify and describe (see also next chapter) and are additive, so it is easy to separate the effects. If you suspect interactions of features or a non-linear association of a feature with the target value, then you can add interaction terms and use techniques like regression splines to estimate non-linear effects.**Normality**: The target outcome given the features are assumed to follow a normal distribution. If this assumption is violated, then the estimated confidence intervals of the feature weights are not valid. Consequently, any interpretation of the features p-values is not valid.**Homoscedasticity**(constant variance): The variance of the error terms \(\epsilon_{i}\) is assumed to be constant over the whole feature space. Let’s say you want to predict the value of a house given the living area in square meters. You estimate a linear model, which assumes that no matter how big the house, the error terms around the predicted response have the same variance. This assumption is often violated in reality. In the house example it is plausible that the variance of error terms around the predicted price is higher for bigger houses, since also the prices are higher and there is more room for prices to vary.**Independence**: Each instance is assumed to be independent from the next one. If you have repeated measurements, like multiple records per patient, the data points are not independent from each other and there are special linear model classes to deal with these cases, like mixed effect models or GEEs.**Fixed features**: The input features are seen as ‘fixed’, carrying no errors or variation, which, of course, is very unrealistic and only makes sense in controlled experimental settings. But not assuming fixed features would mean that you have to fit very complex measurement error models that account for the measurement errors of your input features. And usually you don’t want to do that.**Absence of multicollinearity**: Basically you don’t want features to be highly correlated, because this messes up the estimation of the weights. In a situation where two features are highly correlated (something like correlation > 0.9) it will become problematic to estimate the weights, since the feature effects are additive and it becomes indeterminable to which of the correlated features to attribute the effects.

### 4.1.1 Interpretation

The interpretation of a weight in the linear model depends on the type of the corresponding feature:

- Numerical feature: For an increase of the numerical feature \(x_{j}\) by one unit, the estimated outcome changes by \(\beta_{j}\). An example of a numerical feature is the size of a house.
- Binary feature: A feature, that for each instance takes on one of two possible values. An example is the feature “House comes with a garden”. One of the values counts as the reference level (in some programming languages coded with 0), like “No garden”. A change of the feature \(x_{j}\) from the reference level to the other level changes the estimated outcome by \(\beta_{j}\).
- Categorical feature with multiple levels: A feature with a fixed amount of possible values. An example is the feature “Flooring type”, with possible levels “carpet”, “laminate” and “parquet”. One solution to deal with many levels is to one-hot-encode them, meaning each level gets its own binary column. From a categorical feature with \(l\) levels, you only need \(l-1\) columns, otherwise the coding is overparameterised. The interpretation for each level is then according to the binary features. Some languages, like R, allow you to code categorical features in different ways, described here.
- Intercept \(\beta_{0}\): The intercept is the feature weight for the constant feature, which is always 1 for all instances. Most software packages automatically add this feature for estimating the intercept. The interpretation is: Given all numerical features are zero and the categorical features are at the reference level, the estimated outcome of \(y_{i}\) is \(\beta_{0}\). The interpretation of \(\beta_{0}\) is usually not relevant, because instances with all features at zero often don’t make any sense, unless the features were standardised (mean of zero, standard deviation of one), where the intercept \(\beta_0\) reflects the predicted outcome of an instance where all features are at their mean.

The interpretation of the features in the linear model can be automated by using following text templates.

**Interpretation of a Numerical Feature**

An increase of \(x_{k}\) by one unit increases the expectation for \(y\) by \(\beta_k\) units, given all other features stay the same.

**Interpretation of a Categorical Feature**

A change from \(x_{k}\)’s reference level to the other category increases the expectation for \(y\) by \(\beta_{k}\), given all other features stay the same.

Another important measurement for interpreting linear models is the \(R^2\) measurement. \(R^2\) tells you how much of the total variance of your target outcome is explained by the model. The higher \(R^2\) the better your model explains the data. The formula to calculate \(R^2\) is: \(R^2=1-SSE/SST\), where SSE is the squared sum of the error terms:

\[SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2\]

and SST is the squared sum of the data variance:

\[SST=\sum_{i=1}^n(y_i-\bar{y})^2\]

The SSE tells you how much variance remains after fitting the linear model, which is measured by looking at the squared differences between the predicted and actual target values. SST is the total variance of the target around the mean. So \(R^2\) tells you how much of your variance can be explained by the linear model. \(R^2\) ranges between 0 for models that explain nothing and 1 for models that explain all of the variance in your data.

There is a catch, because \(R^2\) increases with the number of features in the model, even if they carry no information about the target value at all. So it is better to use the adjusted R-squared (\(\bar{R}^2\)), which accounts for the number of features used in the model. Its calculation is

\[\bar{R}^2=R^2-(1-R^2)\frac{p}{n-p-1}\]

where \(p\) is the number of features and \(n\) the number of instances.

It isn’t helpful to do interpretation on a model with very low \(R^2\) or \(\bar{R}^2\), because basically the model is not explaining much of the variance, so any interpretation of the weights are not meaningful.

**Feature Importance**

The importance of a feature in a linear regression model can be measured by the absolute value of its t-statistic. The t-statistic is the estimated weight scaled with it’s standard error.

\[t_{\hat{\beta}}=\frac{\hat{\beta}}{SE(\hat{\beta})}\]

The importance of a feature increases as its weight increases. This makes sense. The more variance the estimated weight has (= the less certain we are about the correct value), the less important the feature is. This also makes sense.

### 4.1.2 Example

In this example we use the linear model to predict the number of rented bikes on a day, given weather and calendrical information. For the interpretation we examine the estimated regression weights. The features are a mix of numerical and categorical features. The table shows for each feature the estimated weight, the standard error of the estimation and the absolute value of the t-statistic (feature importance).

Weight estimate | Std. Error | abs(t) | |
---|---|---|---|

(Intercept) | 2399.4 | 238.3 | 10.1 |

seasonSUMMER | 899.3 | 122.3 | 7.4 |

seasonFALL | 138.2 | 161.7 | 0.9 |

seasonWINTER | 425.6 | 110.8 | 3.8 |

holidayHOLIDAY | -686.1 | 203.3 | 3.4 |

workingdayWORKING DAY | 124.9 | 73.3 | 1.7 |

weathersitMISTY | -379.4 | 87.6 | 4.3 |

weathersitRAIN/SNOW/STORM | -1901.5 | 223.6 | 8.5 |

temp | 110.7 | 7.0 | 15.7 |

hum | -17.4 | 3.2 | 5.5 |

windspeed | -42.5 | 6.9 | 6.2 |

days_since_2011 | 4.9 | 0.2 | 28.5 |

Interpretation of a numerical feature (‘Temperature’): An increase of the temperature by 1 degree Celsius increases the expected number of bikes by 110.7, given all other features stay the same.

Interpretation of a categorical feature (‘weathersituation’)): The estimated number of bikes is -1901.5 lower when it is rainy, snowing or stormy, compared to good weather, given that all other features stay the same. Also if the weather was misty, the expected number of bikes was -379.4 lower, compared to good weather, given all other features stay the same.

As you can see in the interpretation examples, the interpretations always come with the footnote that ‘all other features stay the same’. That’s because of the nature of linear models: The target is a linear combination of the weighted features. The estimated linear equation spans a hyperplane in the feature/target space (a simple line in the case of a single feature). The \(\beta\) (weight) values specify the slope (gradient) of the hyperplane in each direction. The good side is that it isolates the interpretation. If you think of the features as knobs that you can turn up or down, it is nice to see what happens when you would just turn the knob for one feature. On the bad side of things, the interpretation ignores the joint distribution with other features. Increasing one feature, but not changing others, might create unrealistic, or at least unlikely, data points.

### 4.1.3 Visual Interpretation

Different visualisations make the linear model outcomes easy and quick to grasp for humans.

#### 4.1.3.1 Weight Plot

The information of the weights table (weight estimates and variance) can be visualised in a weight plot (showing the results from the linear model fitted before):

The weight plot makes clear that rainy/snowy/stormy weather has a strong negative effect on the expected number of bikes. The working day feature’s weight is close to zero and the zero is included in the 95% interval, meaning it is not influencing the prediction significantly. Some confidence intervals are very short and the estimates are close to zero, yet the features were important. Temperature is such a candidate. The problem about the weight plot is that the features are measured on different scales. While for weather situation feature the estimated \(\beta\) signifies the difference between good and rainy/storm/snowy weather, for temperature it signifies only an increase of 1 degree Celsius. You can improve the comparison by scaling the features to zero mean and unit standard deviation before fitting the linear model, to make the estimated weights comparable.

#### 4.1.3.2 Effect Plot

The weights of the linear model can be analysed more meaningfully when multiplied by the actual feature values. The weights depend on the scale of the features and will be different if you have a feature measuring some height and you switch from meters to centimetres. The weight will change, but the actual relationships in your data will not. It is also important to know the distribution of your feature in the data, because if you have a very low variance, it means that almost all instances will get a similar contribution from this feature. The effect plot can help to understand how much the combination of a weight and a feature contributes to the predictions in your data. Start with the computation of the effects, which is the weight per feature times the feature of an instance: \(\text{effect}_{i,j}=w_{j}x_{i,j}\). The resulting effects are visualised with boxplots: A box in a boxplot contains the effect range for half of your data (25% to 75% effect quantiles). The vertical line in the box is the median effect, i.e. 50% of the instances have a lower and the other half a higher effect on the prediction than the median value. The horizontal lines extend to\(\pm1.58\text{IQR}/\sqrt{n}\), with IQR being the inter quartile range (\(q_{0.75}-q_{0.25}\)). The points are outliers. The categorical feature effects can be aggregated into one boxplot, compared to the weight plot, where each weight gets a row.

The largest contributions to the expected number of rented bikes comes from temperature and from the days feature, which captures the trend that the bike rental became more popular over time. The temperature has a broad contribution distribution. The day trend feature goes from zero to large positive contribution, because the first day in the dataset (01.01.2011) gets a very low day effect, and the estimated weight with this feature is positive (4.93), so the effect increases with every day and is highest for the latest day in the dataset (31.12.2012). Note that for effects from a feature with a negative weight, the instances with a positive effect are the ones that have a negative feature value, so days with a high negative effect of windspeed on the bike count have the highest windspeeds.

### 4.1.4 Explaining Individual Predictions

How much did each feature of an instance contribute towards the prediction? This can, again, be answered by bringing together the weights and feature values of this instance and computing the effects. An interpretation of instance specific effects is only meaningful in comparison with the distribution of each feature’s effects. We want to explain the prediction of the linear model for the 6-th instance from the bicycle dataset. The instance has the following feature values.

Feature | Value |
---|---|

season | SPRING |

yr | 2011 |

mnth | JAN |

holiday | NO HOLIDAY |

weekday | THU |

workingday | WORKING DAY |

weathersit | GOOD |

temp | 1.604356 |

hum | 51.8261 |

windspeed | 6.000868 |

cnt | 1606 |

days_since_2011 | 5 |

To obtain the feature effects of this instance, we have to multiply its features values by the corresponding weights from the linear regression model. For example, for the value WORKING DAY of feature workingday, the effect is 124.9209381. For the value 1.604356 of feature temp, the effect is 177.6175815. We insert these individual effects as points in the effect plot, which shows us the distribution of effects in the data. This allows us to compare the individual effects with the distribution of effects in the data.

When we average the predictions for the training data instances, we get a mean value of 4504 (). In comparison, the prediction of the 6-th instance is small with only 1571 predicted bicycles. The effect plot reveals the reason. The boxplots show the distributions of the effects for all instances of the dataset, the red markers show the effects for the 6-th instance. The 6-th instance has a low temperature effect because on this day the temperature was 2 degrees, which is low compared to most other days (and remember that the weight of the temperature feature is positive). Also, the effect of the trend feature “days_since_2011” is small compared to the other data instances because this instance is from early 2011 (5 days) and the trend feature also has a positive weight.

### 4.1.5 Coding Categorical Features

There are several ways to encode a categorical feature and the choice influences the interpretation of the \(\beta\)-weights.

The standard in linear regression models is the treatment coding, which is sufficient in most cases. Using different codings boils down to creating different matrices (=design matrix) from your one column with the categorical feature. This section presents three different codings, but there are many more. The example used has six instances and one categorical feature with 3 levels. For the first two instances, the feature takes on category A, for instances three and four category B and for the last two instances category C.

**Treatment coding**:

The \(\beta\) per level is the estimated difference in \(y\) compared to the reference level. The intercept of the linear model is the mean of the reference group (given all other features stay the same). The first column of the design matrix is the intercept, which is always 1. Column two is an indicator whether instance \(i\) is in category B, column three is an indicator for category C. There is no need for a column for category A, because then the linear equation would be overspecified and no unique solution (= unique \(\beta\)’s) can be found. Knowing that an instance is neither in category B or C is enough.

Feature matrix: \[\begin{pmatrix}1&0&0\\1&0&0\\1&1&0\\1&1&0\\1&0&1\\1&0&1\\\end{pmatrix}\]

**Effect coding**:

The \(\beta\) per level is the estimated \(y\)-difference from the level to the overall mean (again, given all other features are zero or the reference level). The first column is again used to estimate the intercept. The weight \(\beta_{0}\) which is associated with the intercept represents the overall mean and \(\beta_{1}\), the weight for column two is the difference between the overall mean and category B. The overall effect of category B is \(\beta_{0}+\beta_{1}\). The interpretation for category C is equivalent. For the reference category A, \(-(\beta_{1}+\beta_{2})\) is the difference to the overall mean and \(\beta_{0}-(\beta_{1}+\beta_{2})\) the overall effect.

Feature matrix: \[\begin{pmatrix}1&-1&-1\\1&-1&-1\\1&1&0\\1&1&0\\1&0&1\\1&0&1\\\end{pmatrix}\]

**Dummy coding**:

The \(\beta\) per level is the estimated mean of \(y\) for each level (given all feature are at value zero or reference level). Note that the intercept was dropped here, so that a unique solution for the linear model weights can be found.

Feature matrix: \[\begin{pmatrix}1&0&0\\1&0&0\\0&1&0\\0&1&0\\0&0&1\\0&0&1\\\end{pmatrix}\]

If you want to dive a bit deeper into different encodings of categorical features, checkout this webpage and this blog post.

### 4.1.6 Do Linear Models Create Good Explanations?

Judging by the attributes that constitute a good explanation as presented in the Human-friendly Explanations chapter, linear models don’t create the best explanations. They are contrastive, but the reference instance is a data point for which all continuous features are zero and the categorical features at their reference levels. This is usually an artificial, meaningless instance, which is unlikely to occur in your dataset. There is an exception: When all continuous features are mean centered (feature minus mean of feature) and all categorical features are effect coded, the reference instance is the data point for which each feature is at its mean. This might also be a non-existent data point, but it might be at least more likely or meaningful. In this case, the \(\beta\)-values times the feature values (feature effects) explain the contribution to the predicted outcome contrastive to the reference mean-instance. Another aspect of a good explanation is selectivity, which can be achieved in linear models by using less features or by fitting sparse linear models. But by default, linear models don’t create selective explanations. Linear models create truthful explanations, as long as the linear equation can model the relationship between features and outcome. The more non-linearities and interactions exist, the less accurate the linear model becomes and the less truthful explanations it will produce. The linearity makes the explanations more general and simple. The linear nature of the model, I believe, is the main factor why people like linear models for explaining relationships.

### 4.1.7 Sparse Linear Models

The examples for the linear models that I chose look all nice and tidy, right? But in reality you might not have just a handful of features, but hundreds or thousands. And your normal linear models? Interpretability goes downriver. You might even get into a situation with more features than instances and you can’t fit a standard linear model at all. The good news is that there are ways to introduce sparsity (= only keeping a few features) into linear models.

#### 4.1.7.1 Lasso

The most automatic and convenient way to introduce sparsity is to use the Lasso method. Lasso stands for “least absolute shrinkage and selection operator” and when added to a linear model, it performs feature selection and regularisation of the selected feature weights. Let’s review the minimization problem, the \(\beta\)s optimise:

\[min_{\boldsymbol{\beta}}\left(\frac{1}{n}\sum_{i=1}^n(y_i-x_i^T\boldsymbol{\beta})^2\right)\]

Lasso adds a term to this optimisation problem:

\[min_{\boldsymbol{\beta}}\left(\frac{1}{n}\sum_{i=1}^n(y_i-x_i^T\boldsymbol{\beta})^2+\lambda||\boldsymbol{\beta}||_1\right)\]

The term \(||\boldsymbol{\beta}||_1\), the L1-norm of the feature vector, leads to a penalisation of large \(\boldsymbol{\beta}\)-values. Since the L1-norm is used, many of the weights for the features will get an estimate of 0 and the others are shrunk. The parameter \(\lambda\) controls the strength of the regularising effect and is usually tuned by doing cross-validation. Especially when \(\lambda\) is large, many weights become 0.

The feature weights as a function of the penalty term lambda can be visualized as a curve per feature weight as shown in the following figure.

What value should we choose for \(\lambda\)? If you see the regularization parameter as a tuning parameter, then you can use cross-validation to find the optimal \(\lambda\) that minimizes the model error. You can also consider \(\lambda\) as a parameter to adjust model interpretability. The higher the parameter, the fewer features are in the model (because their weight is zero) and the better the model can be interpreted.

**Example with LASSO**

We will predict bike rentals with LASSO. We determine the number of features we want to have in the L1 regularized model.

Let’s first set the number to 2 features:

Weight | |
---|---|

seasonSPRING | 0.00 |

seasonSUMMER | 0.00 |

seasonFALL | 0.00 |

seasonWINTER | 0.00 |

holidayHOLIDAY | 0.00 |

workingdayWORKING DAY | 0.00 |

weathersitMISTY | 0.00 |

weathersitRAIN/SNOW/STORM | 0.00 |

temp | 52.33 |

hum | 0.00 |

windspeed | 0.00 |

days_since_2011 | 2.15 |

The first two features with non-zero weights in the lasso path are temperature (“temp”) and the time trend (“days_since_2011”).

Now, let’s select 5 features:

Weight | |
---|---|

seasonSPRING | -389.99 |

seasonSUMMER | 0.00 |

seasonFALL | 0.00 |

seasonWINTER | 0.00 |

holidayHOLIDAY | 0.00 |

workingdayWORKING DAY | 0.00 |

weathersitMISTY | 0.00 |

weathersitRAIN/SNOW/STORM | -862.27 |

temp | 85.58 |

hum | -3.04 |

windspeed | 0.00 |

days_since_2011 | 3.82 |

Note that the weights for “temp” and “days_since_2011” differ from the model with two features. The reason for this is that by decreasing \(\lambda\) even features that are already “in” the model are less penalized and possibly get a larger absolute weight. The interpretation of the LASSO weights corresponds to the interpretation of weights in the linear regression model. You only have to be careful whether the features are standardized or not, because this affects the weights. In this example, the features were standardized by the software, but the weights were automatically re-transformed to match the original feature scales.

**Other Methods for Sparsity in Linear Models**

A big spectrum of methods can be used to reduce the number of features in a linear model.

Methods that include a pre-processing step:

- Hand selected features: You can always use expert knowledge to choose and discard some features. The big drawback is, that it can’t be automated and you might not be an expert.
- Use some measures to pre-select features: An example is the correlation coefficient. You only take features into account that exceed some chosen threshold of correlation between the feature and the target. Disadvantage is that it only looks at the features one at a time. Some features might only show correlation after the linear model has accounted for some other features. Those you will miss with this approach.

Step-wise procedures:

- Forward selection: Fit the linear model with one feature. Do that with each feature. Choose the model that works best (for example decided by the highest R squared). Now again, for the remaining features, fit different versions of your model by adding each feature to your chosen model. Pick the one that performs best. Continue until some criterion is reached, like the maximum number of features in the model.
- Backward selection: Same as forward selection, but instead of adding features, start with the model that includes all features and try out which feature you have to remove to get the highest performance increase. Repeat until some stopping criterion is reached.

I recommend using Lasso, because it can be automated, looks at all features at the same time and can be controlled via \(\lambda\). It also works for the logistic regression model for classification.

### 4.1.8 Advantages

The modeling of the predictions as a **weighted sum** makes it transparent how predictions are produced. And with LASSO we can ensure that the number of summands remains small.

Many people use linear regression models. This means that in many places it’s **accepted** for predictive modeling and doing inference. There is a **high level of collective experience and expertise**, including teaching materials on linear regression models and software implementations. Linear regression can be found in R, Python, Java, Julia, Scala, Javascript, …

Mathematically, it’s straightforward to estimate the weights and you have a **guarantee to find optimal weights** (given all assumptions of the linear regression model are met by the data).

Together with the weights you get confidence intervals, tests, a very solid statistical theory. There are also many extensions of the linear regression model (see chapter Linear Model 2.0 - GLMs, GAMs and more).

### 4.1.9 Disadvantages

Linear models can only represent linear relationships, i.e. a weighted sum of the input features. Each **non-linearity or interaction has to be hand-crafted** and explicitly given to the model as an input feature.

Linear models are also often **not that good regarding predictive performance**, because the relationships that can be learned are so restricted and usually oversimplifies how complex reality is.

The interpretation of a weight **can be unintuitive** because it depends on all other features. A feature with high positive correlation with the outcome \(y\) and another feature might get a negative weight in the linear model, because, given the other correlated feature, it is negatively correlated with \(y\) in the high-dimensional space. Completely correlated features make it even impossible to find a unique solution for the linear equation. An example: You have a model to predict the value of a house and have features like number of rooms and area of the house. House area and number of rooms are highly correlated: the bigger a house is, the more rooms it has. If you now take both features into a linear model, it might happen, that the area of the house is the better predictor and get’s a large positive weight. The number of rooms might end up getting a negative weight, because either, given that a house has the same size, increasing the number of rooms could make it less valuable or the linear equation becomes less stable, when the correlation is too strong.

Hastie, T, R Tibshirani, and J Friedman. 2009. The elements of statistical learning. http://link.springer.com/content/pdf/10.1007/978-0-387-84858-7.pdf.↩