In chapter 3 Model seclection we discussed the problem of overfitting the training data by using very complex models that can adapt to all variations in the training data even those coming simply from the noise and randomness in the sampling procedure. At the same time, too simple models may not be flexible enough and may underfit. Finding a balance between these two is a non-trivial task which needs to be based on solid methods for model evaluation and model selection.
Overfitting is particularly prevalent when the training sample size is small relative to the model complexity. There are various rather sophisticated methods to measure the model complexity (such as the Vapnik-Chervonenkis dimension) but for simplicity you can think of it as of the number of model free parameters (number of elements in the \(\mathbf{w}\) vector).
An indication of a model class (the hypotheses space \(\mathcal{H}\)) being too flexible is the large variability of the models \(\hat{h}_i\) trained over different random samples \(S_i\). From the decomposition of the generalization MSE into the bais and variance terms (see 3.6.3 Bias-variance trade-off) we see that there are two ways to decrease the expected MSE: by decreasing the bias or by decreasing the variance. Complex flexible models typically can achieve smaller bias but at the cost of higher variance across different training samples. Very simple models may be similar across different training samples (have low variance) but may not be able to fit the data too well (have large bias). In result both too complex and too simple models have large generalization error.
Regularization methods use the bias-variance trade-off to reduce the expected MSE. They typically willingly sacrifice some of the bias (making it somewhat bigger) in order to reduce the variability and in result the total MSE.
We have seen in section 3.3 Expected loss that ideally we would like to find our model \(h \in \mathcal{H}\) by minimizing the expected loss \(\mathbb{L}(h)\) over any (including future) data point \((\mathbf{x}, y)\) coming from the generative distribution \(P_{X,Y}\) \[\text{arg min}_{h \in \mathcal{H}} \ \mathbb{L}(h) \enspace ,\] where \(\mathbb{L}(h) = \mathbb{E} \, c(h(\mathbf{x}), y)\) is the expectation of some cost function \(c\) for the prediction \(h(\mathbf{x})\) with the true output \(y\), such as the squared error \((h(\mathbf{x}) - y)^2\).
However, in section 3.4 Empirical loss we clarified that in practice this is impossible because we do not know the generative distribution \(P_{X,Y}\) and therefore cannot calculate the expectation. We therefore resort to using the emprirical loss calculated over the training samples as an estimate of the expected loss \[\text{arg min}_{h \in \mathcal{H}} \ L(h) \enspace ,\] where \(L(h) = \frac{1}{n} \sum_i^n c(h(\mathbf{x}_i), y_i)\) is the sample average of the costs calculated over the training samples \((\mathbf{x}_i, y_i) \in S\). We also mentioned that as much as this is a reasonable approach, the sample average may not be a very reliable estimate of the population expectation especially if the size of the sample is not very big. The sample average is an unbiased estimator of the population expectation, however, its variance across data samples can be rather big (depending on the size of the sample and the specific data generating distribution).
In regularized learning instead of minimizing directly the empirical loss we minimize the so-called regularized loss (we will indicate it by \(J\)): \[\text{arg min}_{h \in \mathcal{H}} \ J(h) \enspace ,\] where \[J(h) = L(h) + \lambda \, R(h), \quad \lambda > 0 \enspace .\] In the above, \(L\) is the usual empirical loss and \(R\) is the regularization function or simply regularizer. The role of the regularizer \(R\) is to penalize the hypotheses \(h\) for complexity: the bigger the complexity of \(h\), the higher \(R(h)\) shall be (we will see some examples soon).
In front of the regularizer \(R\), there is the regularization parameter \(\lambda\) also called the hyperparameter. It is a simple scalar number bigger than zero and you should see it as a weight of the regularization term as compared to the empirical loss. When minimizing \(J\), we minimize \(L\) and \(R\) together: on one hand we minimize the empirical loss over the training data and on the other, we minimize the complexity of the model. These are two competing objectives! Remember from your experiments with polynomials that more complex models \(h\) (e.g. higher order polynomials) can achieve low training loss \(L(h)\). But they will have high regularizer \(R(h)\). Very simple models with low \(R(h)\) are likely to underfit and have high \(L(h)\). The hyperparameter \(\lambda\) decides how much importance each of the terms shall have when contributing to the overall minimization \(J\). Note that when \(\lambda = 0\) we fall back onto the standard empirical loss minimization without any regularization.
Another way to look at the regularized learning is through the bias-variance trade-off lens. The empirical loss \(L\) is an unbiased estimate of the expected loss \(\mathbb{L}\) and therefore so are the estimated parameters \(\widehat{w}\) unbiased estimators of the Bayes hypotheses. When we add the regularizer to the objective function \(J(h)\), we push the solution away from this unbiased estimator - we introduce a bias. However, the regularizer should reduce the variability of the models by pushing it to simpler models and therefore lead to lower generalization error.
So far we have been controlling the model complexity by fixing the model class \(\mathcal{H}\). For example, as discussed in section 3.1 Nonlinear feature transformation, instead of searching for the best hypotheses (the model) within the class of simple linear functions \(h \in \mathcal{H}_{linear}\) we moved into more complex classes, such as \(h \in \mathcal{H}_{quadratic}\) or \(h \in \mathcal{H}_{12degPoly}\), by applying suitable feature transformations. The only way to reduce the complexity was to change the function class (e.g. to a lower degree polynomial).
Regularization gives us an additional handle to controlling the hypotheses (model) complexity. We can now afford to search in a complex model class \(h \in \mathcal{H}\), e.g. \(\mathcal{H}_{12degPoly}\), and force the model to be simpler (reduce the complexity) by applying the regularization \(R\) with sufficient strength \(\lambda\).
It is important to understand that regularization is a general concept applicable to almost any machine learning method. We will discuss it here in more detail in the context of regression. But it is equally applicable to classification (logistic regression, support vector machines, deep learning, etc.). What changes for these different problems is the form of the loss function (e.g. squared error for regression, cross-entropy for classification). But the main idea and principles remain the same.
There are many standard regularization functions that you may want to consider. We will discuss later in this chapter the most classical ones (\(\ell_2\) and \(\ell_1\)). However, formulating regularization functions to bias the model in the direction that may bring most benefits to the learning is an active area of ML research. New regularization functions appear in research papers on regular basis. They are often motivated by the specific domain from which the data originate (e.g. regularization to promote invariance to translations and rotations in images) or by secondary objectives such as model interpretability on top of the usual goal of predictive performance.
One of the difficulties of regularized learning (other than choosing the regularizer \(R\)) is selecting the right value of the hyperparameter \(\lambda\). Once again we need to recall the main goal of supervised learning: predict on new data with the lowest possible error ~ achieve low generalization error. Unfortunately, there is no way of knowing the best value of \(\lambda\) without trying. Similarly as in choosing the best hypotheses space (e.g. from the spaces of polynomials of different degrees) we need to resort to procedures of model evaluation and selection such as the cross-validation trying different values of the hyperparameter \(\lambda\), each essentially representing a different model class.
In a typical regularization learning procedure we allow the hypotheses space \(h \in \mathcal{H}\) to be rather broad and flexible class of functions. We then establish a search grid for the hyperparameter \(\lambda\), e.g. a list of values such as \((10, 1, 0.1, 0.01, 0.001, ... )\). We select the best \(\lambda\) from the grid via some model evaluation and selection procedure (e.g. cross-validation). Once the \(\lambda\) is selected, we can use it to learn over the complete data set the final model to be put into production thereafter.
Fixing a reasonably dense search grid for the hyperparameter is somewhat of an art. A fairly common approach is to start from a basic grid such as the above and fine-tune it around the best \(\lambda\) from this first rough search. E.g. if in the 1st round using the grid above we find the best hyperparameter to be \(\hat{\lambda} = 0.01\), we can try with a more dense grid around e.g. \((0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.008, 0.006, 0.004 0.002, 0.001)\). Or something similar.
Probably the most common regularization function \(R\) for parametric models \(h(\mathbf{w}, \mathbf{x})\) with parameters \(\mathbf{w}\) is the \(\ell_2\) norm \[R(\mathbf{w}) = ||\mathbf{w}||_2^2 = \mathbf{w}^T \mathbf{w} = \sum_i^d w_i^2\].
It is used so commonly that it has almost become the default in most ML classification and regression methods. People often just mention it very briefly saying something like “we use this-and-this method with \(\ell_2\) norm”. When referring to neural networks, it is often referred to as the weight decay.
The \(\ell_2\) regularization has been first proposed in the context of linear regression under the name of ridge regression (Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12.1 (1970): 55-67.)
In ridge regression, same as in standard linear regression, we learn the parameters parameters \(\mathbf{w}\) of the hypotheses \(h(\mathbf{w}) = \mathbf{x}^T \mathbf{w}\) (the inputs \(\mathbf{x}\) can be easily replaced by some features obtained by the input transformations \(\phi(\mathbf{x})\)). However, unlike standard linear regression, the objective function of ridge regression is the minimization of the regularized loss functional \[J(\mathbf{w}) = L(\mathbf{w}) + \lambda \, R(\mathbf{w}) \enspace,\] where \(L(\mathbf{w})\) is the usual squared error loss \[L(\mathbf{w}) = \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 = \frac{1}{n} \sum_i^n (\mathbf{x}_i^T \mathbf{w} - y_i)^2 = \frac{1}{n} \sum_i^n (\sum_j^d x_{ij} w_j - y_i)^2\] and \(R(\mathbf)\) is the \(\ell_2\) norm regularizer above. Putting these two together we can write the ridge regression minimization in the matrix form as \[\widehat{\mathbf{w}}_{ridge} = \text{arg}\min_w \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 + \lambda \, ||\mathbf{w}||_2^2 \enspace .\]
The \(\ell_2\) regularization puts a constraint on the size of the \(\mathbf{w}\) parameters in terms of their \(\ell_2\) norm. There is an alternative equivalent way to write the regularized learning problem (from optimization theory, specifically use of Lagrange multipliers) \[\widehat{\mathbf{w}}_{ridge} = \text{arg}\min_w \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 \\ \text{subject to } ||\mathbf{w}||_2^2 \leq c \enspace .\]
In words, this says: find the parameters \(\mathbf{w}\) by minimizing the empirical loss but keep the \(\mathbf{w}\) small so that their \(\ell_2\) norm is smaller than \(c\).
There is a relationship between \(c\) and \(\lambda\) in the two different formulations: the bigger the \(\lambda\) (more weight on the regularizer), the smaller the \(c\) (stronger constraint, \(\mathbf{w}\) constrained to smaller size). Getting the exact corresponding values of \(c\) and \(\lambda\) is rather complicated and depends on the specific data but it is not really needed. You typically decide to work with one or the other and this relationship should be just somewhere in the back of your mind to help you understand the effects.
In the pictures below, you can see the effect of imposing the \(\ell_2\) constraint on \(\mathbf{w}\) graphically for a very simple example of a \(2\) dimensional problem (vectors \(\mathbf{x}\) and \(\mathbf{w}\) are both \(2\)-dimensional).
In the left, there is a graph of the empirical loss function \(L(\mathbf{w})\). The squared error loss is a quadratic with a parabolic shape opening up and a unique minimum at the bottom. The black rings are the level curves (contours). These are the combinations of \((w_1, w_2)\) for which the loss is constant \(L(\mathbf{w}) = k\). The highest ring has higher loss than all the other displayed rings. The rings displayed in the image are the level curves for randomly chosen values of the loss. But it should be clear that there exists such a ring for any value \(L(\mathbf{w}) = k\) and they all together create the surface of the blue 3d paraboloid.
The right plot is the projection of the loss on the \((w_1, w_2)\) plane. This is like looking at the left plot from the top. The level curves are displayed as the blue ellipses, the contour lines, with the minimizing solution for the ordinary least squares problem \(\widehat{\mathbf{w}}_{ols}\) in the middle. The further away from the center the contours are, the higher is the loss. Again, only a few contour lines are plotted but there exists such an ellipse for any value of the loss \(L(\mathbf{w}) = k\). Note also how the loss changes unevenly when \(w_1\) and \(w_2\) change.
There are two sets of ellipses in the plot. These correspond to two different losses based on two different random samples. Remember that the model you learn always depends on the particular training data. If you start from a different training set, the regression solution is different. This is because after plugging in for \(\mathbf{X}\) and \(\mathbf{y}\) the real numbers, the losses are different as are the light blue and dark blue contour lines in the plot.
The points in the center of the ellipses are the two different solutions of the empirical loss minimization. You can see these are rather far apart meaning the models are rather different and will produce quite different predictions over new data - there is a high variability in the models.
The red circle around zero is the area where the \(\ell_2\) norm of the parameters \(\mathbf{w}\) is smaller (or equal) than some constant \(c\). The red points \(\widehat{\mathbf{w}}_{r}\) are the ridge regression solutions for the two training samples. As you can see, these are compromises: they try to make the squared error loss as small as possible but at the same time they cannot be further away from the \((0,0)\) center than \(c\). The two solutions are closer together meaning that the ridge models are more similar, producing more similar predictions. Ridge regression reduces the variability in the models. This reduction comes at a cost in bias but the hope is that this compromise will lead into lower generalization MSE.
Finding the best value of \(c\) is equivalent to finding the best value of \(\lambda\) and is typically performed by some procedure for model evaluation and selection. Note also in the image that if \(c\) is big enough (\(\lambda\) is small) the red circle will cover the empirical loss solutions. In that case the regularization is ineffective and the ridge regression and standard regression solutions are equal.
Recall the discussion in section 3.2 Data normalization about the effect of varible scaling on the size of the parameters \(\mathbf{w}\) (slope of the regression line). In the ridge regression we treat all the parameters \(w_i\) evenly, forcing them all together into a small sphere. If due to the uneven variable scaling some of the parameters would tend to be naturally much bigger than others, our constraining effect would be very uneven. Therefore it is customary to learn the ridge regression model over normalized variables.
Caution: Remember that if you apply any transformation to the training data, you need to apply exactly the same transformation to the test data. For normalization this means that for the test data you need to use the same mean and standard deviation values as you use for the training data. That is test normalization should not be based on test means and standard deviations but on the corresponding training statistics.
To solve the ridge regression problem we follow the standard principles of function minimization: taking derivatives and using the optimality condition of derivatives equal to zero.
For the loss function \[L(\mathbf{w}) = \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 =\frac{1}{n} (\mathbf{X} \mathbf{w} - \mathbf{y})^T (\mathbf{X} \mathbf{w} - \mathbf{y}) = \frac{1}{n} \big( \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} - 2 \mathbf{y}^T \mathbf{X} \mathbf{w} + \mathbf{y}^T \mathbf{y} \big)\] the gradient (with respect to \(\mathbf{w}\)) is \[\nabla L(\mathbf{w}) = \frac{1}{n} \big( 2 \mathbf{X}^T \mathbf{X} \mathbf{w} - 2 \mathbf{X}^T \mathbf{y} \big) \enspace .\] You could derive this manually starting from partial derivatives \(\partial L(\mathbf{w})/\partial w_i\). But usually we can use rules for vector differentiation.
In the above we use two basic rules: \[f(\mathbf{w}) = \mathbf{A} \mathbf{w} \qquad \nabla f(\mathbf{w}) = \mathbf{A}^T\\ f(\mathbf{w}) = \mathbf{w}^T \mathbf{A} \qquad \nabla f(\mathbf{w}) = \mathbf{A}\] and combine them together with a product rule to get: \[f(\mathbf{w}) = \mathbf{w}^T \mathbf{A} \mathbf{w} \qquad \nabla f(\mathbf{w}) = (\mathbf{A}^T + \mathbf{A}) \mathbf{w} \enspace.\] Using the above rules, the gradient for the \(\ell_2\) regularizer is \[R(\mathbf{w}) = ||\mathbf{w}||_2^2 = \mathbf{w}^T \mathbf{w} \qquad \nabla R(\mathbf{w}) = 2\mathbf{w} \enspace .\] Putting all these together we have the gradient of the regularized loss \(J(\mathbf{w})\) as \[\nabla J(\mathbf{w}) = \frac{1}{n} \big( 2 \mathbf{X}^T \mathbf{X} \mathbf{w} - 2 \mathbf{X}^T \mathbf{y} \big) + \lambda 2\mathbf{w} \enspace .\] You should get the habbit of differentiating vectors by using similar rules. It is much easier than going through the rigorous way of partial derivatives. A list of useful reules is for example here: Petersen, Kaare Brandt, and Michael Syskind Pedersen. “The matrix cookbook.” Technical University of Denmark 7.15 (2008): 510.
We use the optimality condition for the minimum of a convex function \(J(\mathbf{w}) = 0\) \[\begin{eqnarray} \frac{1}{n} \big( 2 \mathbf{X}^T \mathbf{X} \mathbf{w} - 2 \mathbf{X}^T \mathbf{y} \big) + \lambda 2\mathbf{w} & = & 0\\ \mathbf{X}^T \mathbf{X} \mathbf{w} - \mathbf{X}^T \mathbf{y} + \lambda n \mathbf{w} & = & 0 \\ (\mathbf{X}^T \mathbf{X} + \lambda n \mathbf{I}_d) \mathbf{w} = \mathbf{X}^T \mathbf{y} \end{eqnarray}\] The solution to the ridge regression problem is the \(\mathbf{w}\) solving the above linear system and can be represented in a closed form \[\widehat{\mathbf{w}}_{ridge} = (\mathbf{X}^T \mathbf{X} + \lambda n \mathbf{I}_d)^{-1} \mathbf{X}^T \mathbf{y} \enspace .\]
Another very common form of the regularizer \(R\) is the \(\ell_1\) norm of the parameters \[R(\mathbf{w}) = ||\mathbf{w}||_1 = \sum_i^d | w_i | \enspace .\]
In the context of linear regression it is known as lasso (least absolute shrinkage and selection operator) (Tibshirani, Robert. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.)
In the lasso regression we learn the parameters \(\mathbf{w}\) of the linear model \(h(\mathbf{x}, \mathbf{w}) = \mathbf{x}^T \mathbf{w}\) by solving the regularized optimization problem \[\widehat{\mathbf{w}}_{lasso} = \text{arg}\min_w \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 + \lambda \, ||\mathbf{w}||_1 \enspace .\] This is equivalent to the minimization of the constrained problem \[\widehat{\mathbf{w}}_{lasso} = \text{arg}\min_w \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 \\ \text{subject to } ||\mathbf{w}||_1 \leq c \enspace .\]
Both these formulations are very similar to the ridge regression formulations, the only difference being the norm they use in the regularizor (or the constraint).
However, the effects of the \(\ell_1\) regularization are somewhat different than in the \(\ell_2\) case. The \(\ell_1\) norm constraint on the parameters \(\mathbf{w}\) is not a circle or a ball-shaped object but a square (or cube-shaped object) with corners on the \(w\) axis. As in ridge, the solutions of lasso regression are compromises between minimizing the loss and staying within the constraints. But due to the shape of the constraint, it is now much more likely that the solution will fall onto one of the corners of the cube-shaped boundary. And because the corners are no the \(w\) axis, it is more likely than ridge to find solutions with several of the parameters equal to zero \(w_i = 0\).
A zero parameter \(w_i = 0\) means that the output \(y\) does not change at all with changes in the input variable \(X_i\). The variable has no impact on the outputs and therefore can be left out of the model. This variable selection effect is THE crucial property of the lasso regularizer, the one why people decide to use \(\ell_1\) over \(\ell_2\) regularization.
There are several reasons one may want to perform variable selection and learn sparse models (models with many parameters equal to zero). The first (relevant for ridge regression as well) is the reduction of overfitting by working with a simpler class of models. Sparse models are also usually easier to interpret as we see immediately which variables are not important for the prediction. With some caution this can help the understanding of the underlying problem, e.g. in a medical application you may collect multiple samples and measurements from the patients (blood contents, bone matter, etc.) in order to predict a disease. If some of these do not help prediction, they may not be relevant for the disease. Also, we may save money and other resources (the emotional stress of the patient) by not collecting these irrelevant variables. The reduced set of variables is also easier and cheaper to store and process.
As for ridge regression, it is customary to normalize the data before learning the lasso regression model to cater for the possibly uneven scalings in the original input variables.
While the lasso optimization problem seems very similar to the ridge problem, we cannot solve by the standard approach in function minimization using derivatives. This is because the regularization function in this case is non-differentiable. Multiple alternatives have been probposed to be able to solve the problem anyway. We will discusse one of the most succesful proximal gradient descent in chapter 5. Descent methods.
Lasso regression is not the historically first method trying to adress the variable selection problem. Somewhat similarly to feature transformations, variables can be selected manually by simply excluding or including them into the model. Obviously, the number of possible subsets of variables grows very rapidly with the dimensioanlity of the input space, often so much that the subsets cannot be all exhaustively explored.
In result, multiple heuristics for searching within the subsets have been proposed. The prime examples are the stepwise backward and forward selection procedures. In the backward selection we start from the full model \(M_0\) and then search amongst the variables the one that can be dropped without hurting the performance over the training data too much (essentially by trying to drop each varaible at a time and checking the change in the train MSE). Once we find it, we drop it from the set of inputs, train the smaller model \(M_1\) and repeat the procedure to find \(M_2\), etc. Finally, we choose the best model out of the \(M_0\), \(M_1\), etc. by their test MSE performance using for example cross-validation. In the forward selection we start from an empty model \(M_0\) predicting simply the mean of the outputs and we at each step we search for the variable that will improve the trian MSE the most.