We now get back to our original example of study hour and grades. If you inspect the population plot in section 2.2 our assumption of linear relationship between \(x\) and \(y\) does not seem quite right. As \(x\) increases further, the \(y\) eventually saturates and does not increase any longer. Is there some other function than linear that would perhaps capture this better? Recalling our discussion of different function shapes, you may remember that the logarithm grows very fast in the beginning but finishes growing rather slow. How can we use this? We will resort to feature transformation.
Feature transformation is an approach in which we manually transform the features (the inputs) before using them in the standard linear regression. In our case, we will try the log transformation by creating a new set of inputs as \(\tilde{x_i} = \log(x_i), \ i=1, \ldots, n\).
We continue working with the simple linear regression model from section 2.3 but instead of the original feature \(x\) we now use the transformed \(\tilde{x} = \log(x)\) \[y \approx w_0 + w_1 \tilde{x}, \qquad \tilde{x} = \log(x) \enspace .\] We will next follow the matrix approach introduced in section 2.3.3 simply replacing everywhere the original \(x\) with the transformed \(\tilde{x}\). We will first augment the data by the column of ones so that \(\mathbf{\tilde{x}}_i = (1, \tilde{x})_i\) is a \(2\)-dimensional vector for every \(i = 1, \ldots, n\). We then work the \(n \times 2\) matrix \(\mathbf{\tilde{X}}\) and we minimize the loss function \[L(\mathbf{w}) = \frac{1}{n} || \, \mathbf{y} - \mathbf{\tilde{X}} \mathbf{w} \, ||_2^2\] In analogy with the result in section 2.3.3. the minimizing solution is \[ \widehat{\mathbf{w}} = \big(\mathbf{\tilde{X}}^T\, \mathbf{\tilde{X}} \big)^{-1} \mathbf{\tilde{X}}^T \, \mathbf{y} \enspace . \]
Let’s explore one more, perhaps more obvious example. Here clearly fitting a line to the original data does not do a great job. However, applying first the feature transformation \(\tilde{x} = x^2\) and then performing the linear regression \[y \approx w_0 + w_1 \tilde{x} = w_0 + w_1 x^2\] seems to work rather well!
In the examples above we followed the same strategy. Inspecting the data, we suspected that the hypotheses of linear relationship between the inputs \(x \in \mathcal{X}\) and \(y \in \mathcal{Y}\) is not quite satisfactory. Instead of working with the original inputs \(x \in \mathcal{X}\) and trying to figure out how to learn a complex nonlinear model, we first applied a suitable nolinear feature transformation to the data using a feature map \(\phi: \mathcal{X} \to \tilde{\mathcal{X}}, \ \tilde{x} = \phi(x)\) and performed the linear regression in this tranformed feature space \(\tilde{\mathcal{X}}\).
\[y \approx w_0 + w_1 \phi(x)\]
This neat trick allowed us to get from the very restrictive world of linear relationships to highly nonlinear ones while keeping the math and machinery simple!
In the examples above we used a single nonlinear transformation to a single input dimension. It is, however, possible to apply multiple transformations to the the same variable creating a vector of transformed inputs \(\tilde{\mathbf{x}} = (\phi_1(x), \phi_2(x), \ldots, \phi_2(x))^T\). Instead of simple linear regression we then need to work with multivariate regression with the parameters vector \(\mathbf{w}\) of length \(d+1\).
The most common transformations are powers \((x^2, x^3, \ldots)\). This is sometimes referred to as polynomial regression because the hypotheses is a polynomial in the original feature space \(\mathcal{X}\) (while still a simple linear regression in the transformed feature space \(\tilde{\mathcal{X}}\)).
When the original feature space \(\mathcal{X} \subseteq \mathbb{R}^d\) is multi-dimensional, we can apply various non-linear transformations (such as powers) to each dimension \(x_i\). These may or may not be the same for each dimension. Furthermore, we can account for interactions between dimensions by working with products such as \(x_i x_j\). The regression model is then \[y \approx \phi(\mathbf{x})^T \mathbf{w} \enspace ,\] where \(\phi: \mathcal{X} \to \tilde{\mathcal{X}}\) is the vector nonlinear transformation function.
In fact, the feature transformation is such a common approach that you can often see simply \[y \approx \mathbf{\phi}^T \mathbf{w} \enspace ,\] where everybody understands that this is a linear regression problem after first applying some non-linear transformation to the original data. Everything we discussed for the linear regression over the original input \(\mathbf{x}\) vectors and \(\mathbf{X}\) matrix can be rewritten using the transformed feature vectors \(\phi\) and matrix \(\Phi\).
There are two major take-home messages from this:
There is one question we haven’t yet answered. How do you know which feature transformation you shall apply? This indeed is a very critical question and the answer is anything but obvious. The process of finding suitable feature transformations is called feature ingeneering. It is usually very much based on domain knowledge and experience and if done right can bring rather impressive improvements to the basic ML models. Two classical examples come from computer vision and natural language processing.
In computer vision you often work with images. These can be initially represented either as \(2\)-d matrices \(x \in \mathcal{X} \in [0,1]^2\) where each element represents the gray-scale intensity of the pixel in the image, or \(3\)-d cubes \(x \in \mathcal{X} \in [0,1]^3\) with intensities of the Red, Green and Blue channels. Computer vision experts have discovered quite long ago that what typically matters in images (for tasks such as classification) are the important parts of the images such as edges, corners, blobs, etc. They have therefore developed a multitude of feature detectors of which probably the most successful one is the SIFT (Scale-Invariant Feature Transform). It performs a series of transformations to the the original pixel values of the images to find the important and part of the image and encodes them into an transformed feature space \(\tilde{x} \in \tilde{\mathcal{X}}\).
In natural language processing the original input data \(x \in \mathcal{X}\) are often simply pieces of text (one for each instance). In the bag of words (BOW) approach we treat the text as a set of words disregarding their order and grammar. We can then create features by counting the number of occurrences of all words in each of the text.
One of the most common feature transformations which is usually a good idea to apply to any numerical inputs is the so-called normalization. Some people call it standardization and there is a little bit of confusion about the correct terminology. It is thus a good idea to always make clear what exact transformations you use rather than relying simply on the keywords.
The normalization is in fact a result of a composition of two transformations: centering and rescaling.
The goal of centering is to move the average of all the data to zero. One of the major reasons for this is to get rid of the intercept term \(w_0\) in the linear regression and the need to expand the inputs \(\mathbf{X}\) by the 1st column of ones.
The centering transformation removes the mean from the data and is defined as follows (\(C\) stands for centered) \[x^{(C)} = \phi_{C}(x) = x - \bar{x} = x - \frac{1}{n} \sum_i^n x_i\] It is rather easy to show that the centered variable has zero mean (see related homework).
Revisiting the minimizing solution for the intercept \(w_0\) in simple linear regression (section 2.3.2) \[\widehat{w}_0 = \bar{y} - \widehat{w}_1 \bar{x} \enspace ,\] we see that when working with the centered data we get always \[\widehat{w}_0 = \bar{y}^{(C)} - \widehat{w}_1 \bar{x}^{(C)} = 0 - \widehat{w}_1 0 = 0 \enspace .\] We can thus drop the parameter \(w_0\) from the regression altogether (it is always zero) and there is not need to augment our data matrix by the column of ones.
The goal of rescaling is to remove from the data the effect of using different measurement scales which make comparisons of variability difficult. We re-scale the variables by their standard deviation making the variance (and standard deviation) of the re-scaled variable equal to one.
Normalization is the composition of centering and re-scaling and is define as follows (\(N\) stands for normalized) \[x^{(N)} = \phi_{N}(x) = \frac{x - \bar{x}}{\sigma_x} = \frac{x^{(C)}}{\sigma_x} \enspace ,\] where \(\bar{x}\) and \(\sigma_x\) are the empirical mean and standard deviation of \(x\).
It is rather easy to show that normalized variable has unit (~ equal to one) variance (see related homework).
In chapter 2 Linear regression we motivated the use of the mean squared error loss (MSE) for finding the best parameters \(\widehat{\mathbf{w}}\) of the linear hypotheses \(h(\mathbf{x}) = \mathbf{x}^T \mathbf{w}\). \[ L_{MSE} (h) = \frac{1}{n} \sum_{i=1}^n \big( h(\mathbf{x_i}) - y_i \big)^2 \]
However, recalling chapter 1 The learning problem, there seems to be something fishy in using the MSE over the training data for picking the best hypotheses \(h \in \mathcal{H}\). Indeed, in chapter 1 The learning problem we stressed out multiple times that what we care the most about is not the prediction over the training data (we already know the output values for the training set) but the ability to generalize and predict over new data.
Remember that one of the critical assumptions in the supervised learning is that of an IID sample from some data generating probability distribution \(P_{X,Y}\), that is every data point \((\mathbf{x}_i, y_i) \in S_n\) in the training data is a sample from the probability distribution \((\mathbf{x}_i, y_i) \sim P_{X,Y}\). Similarly, any new data point shall be coming from the same distribution \((\mathbf{x}, y) \sim P_{X,Y}\).
Following the arguments in chapter 2 Linear regression it makes sense to evaluate the quality of the predictions produced by the regression function \(h(\mathbf{x})\) in terms of the squared error cost \(c = (h(x) - y)^2\). But since our main interest is predictions outside the training set, we should look at the average cost of predictions over any random data point \((\mathbf{x}, y)\) from the distribution \(P_{X,Y}\).
Calling upon your knowledge of basic probability (or the notes in Probability review) you know that the average value for a random variable \(X\) is called its expectation and it is defined as \[\mathbb{E} [X] = \begin{cases} \sum_x \, x \, p(x) = \sum_x x \, P(X = x) & \text{ for discrete } X \\ \int_x \, x \, p(x) \, dx & \text{ for continuous } X \\ \end{cases} \enspace ,\] where \(p(x)\) is the probability mass or probability density function respectively (you can see these as weights when calculating the weighted average of the values \(x\)).
An average of the cost \(c\) of any random data point is therefore the expected loss \[\mathbb{L}_{MSE} (h) = \mathbb{E} \, \big( h(\mathbf{x}) - y \big)^2 = \int_x \int_y \big( h(\mathbf{x}) - y \big)^2 p(\mathbf{x},y) \, d\mathbf{x} dy \enspace ,\] where \(p(\mathbf{x},y)\) is the joint probability density function.
In the above, we developed arguments for using the expected loss as the correct measure of the quality of a hypotheses \(h(x)\). It is therefore natural to use this measure in the search for the best regression function and define the best regression function \(h^*(x)\) as the one minimizing the expected loss \[h^* = \text{arg min}_h \, \mathbb{L}_{MSE} (h) = \text{arg min}_h \, \mathbb{E} \, \big( h(\mathbf{x}) - y \big)^2 \enspace .\]
Without anyhow limiting the hypotheses space \(h \in \mathcal{H}\) (i.e. not limiting \(\mathcal{H}\) to the space of e.g. linear functions) the best such hypotheses \(h^*(x)\) is called the Bayes predictive function.
However, even though the Bayes predictive function \(h^*(x)\) achieves the smallest expected error among all possible prediction function, its error in the stochastic setting is not zero \[\min_h \mathbb{L}_{MSE} (h) = \mathbb{L}_{MSE} (h^*) \neq 0\]
By the stochastic setting we mean that in reality the outputs \(y\) are not produced by a deterministic function \(f\) so that once \(x\) is known, there is no doubt about the value \(y = f(x)\). Instead, even for a fixed \(x\), the outputs \(y\) can still take different values according to some conditional probability distribution \(P_{Y|X}\).
This is often represented by an additive model \[y = f(x) + \epsilon \enspace ,\] where \(\epsilon\) is a random variable independent from the random variable \(X\) (often assumed to come from a normal distribution) usually called simply the noise.
The expected loss of a hypotheses \(h(x)\) is now \[\begin{eqnarray} \mathbb{E} \, \big( h(\mathbf{x}) - y \big)^2 & = & \mathbb{E} \, \big( h(\mathbf{x}) - f(x) - \epsilon \big)^2 \\ & = & \underbrace{\mathbb{E} \, \big( h(\mathbf{x}) - f(x)\big)^2}_{reducible} + \underbrace{\mathbb{E} \, (\epsilon)^2}_{irreducible = Var(\epsilon)} - 2 \underbrace{\mathbb{E} \, \big[ (h(\mathbf{x}) - f(x))(\epsilon) \big]}_{Cov(x,\epsilon) = 0} \end{eqnarray}\]
Hence, even if our ML method was able to correctly recover the generating function \(\widehat{h}(x) = f(x)\) and therefore bring the reducible error to zero, there is still the irreducible error equal to the variance of the noise \(Var(\epsilon)\).
Skip this section if not clear to you, it is not critical for an introductory course to ML
For the case of mean squared error it is not difficult to show that the Bayes predictive function is the conditional expectation \(h^*(x) = \mathbb{E} \, [y| \mathbf{x}]\).
Proof: \[\begin{align*} \mathbb{E} \left( h(\mathbf{x}) - y \right)^2 & = \mathbb{E} \left( h(\mathbf{x}) - \mathbb{E}[y| \mathbf{x}] + \mathbb{E}[y| \mathbf{x}] - y \right)^2 \\ & = \underbrace{\mathbb{E} \big( h(\mathbf{x}) - \mathbb{E}[y| \mathbf{x}] \big)^2}_A + \underbrace{\mathbb{E} \big(\mathbb{E}[y| \mathbf{x}] - y \big)^2}_B - 2 \underbrace{\mathbb{E} \big[ \left( h(\mathbf{x}) - \mathbb{E}[y| \mathbf{x}] \right) \left(\mathbb{E}[y| \mathbf{x}] - y \right) \big]}_C \end{align*}\]
In here the cross term \(C\) disappears (is always equal to zero) and therefore we are left only with the \(A\) and \(B\) terms (see eg. p. 33 in GREGOROVA, Magda. Sparse learning for variable selection with structures and nonlinearities. Thèse de doctorat : Univ. Genève, 2018, no. Sc. 5318)
Whichever \(h\) we choose, the term \(B\) will not change. In fact, this is the irreducible error discussed above, the variation of the outputs around the conditional average \(Var(y|x)\). Therefore to minimize the expected loss \(\mathbb{E} \left( h(\mathbf{x}) - y \right)^2\) we need to minimize the term \(A\), the reducible error. Clearly, the minimum value is \(A = 0\) when \(h(x) = \mathbb{E}[y| \mathbf{x}]\).
The theory developed in the section on 3.3 Expected loss is interesting and certainly useful from the theoretical point of view, but in practice, there is a major drawback. There is actually no way we could minimize the expected loss! (And the problem is not that we are not very good at integrating functions ;-))
The whole motivation of supervised learning is that we actually do not know the data generating distribution \(P_{X,Y}\) and the density \(p(x,y)\) needed to be able to evaluate the expected loss. Though we can show that theoretically the best regression predictor with squared error loss is the conditional expectation \(\mathbb{E}[y| \mathbf{x}]\), it is of no real use to us in practice because we have no clue what this expectation is for any given \(x\). We know nothing about the true conditional distribution \(P_{X|Y}\).
So what can we do? Well, what we are left with is basic statistics. A standard way to estimate the population mean, that is the expectation \(\mathbb{E}[x]\), is to use the empirical mean \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\) calculated over the sample data \[\widehat{\mathbb{E}}[x] = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \enspace .\]
In our case, to estimate the expected loss, we will use the empirical loss calculated over the training data \[\widehat{\mathbb{L}}_{MSE} (h) = \widehat{\mathbb{E}} \, \big( h(\mathbf{x}) - y \big)^2 = L_{MSE} (h) = \frac{1}{n} \sum_{i=1}^n \big( h(\mathbf{x_i}) - y_i \big)^2 \enspace .\]
It may seem somewhat disappointing that after all the talking in section 3.3 Expected loss we finally arrived at our starting point, the empirical loss calculated over the training data. But it is not at all disappointing! Rather the opposite. We now know that minimising the empirical loss is a reasonable thing to do in our search for the best hypotheses \(h \in \mathcal{H}\).
We can also use all the statistical theory about the properties of the sample mean (such as the law of large numbers, the central limit theorem, and many more) to understand the properties of the empirical loss \(L_{MSE} (h)\) as an estimator of the expected loss \(\mathbb{L}_{MSE} (h)\). Most importantly, it provides us with a handle to analyse the properties of the hypotheses learned by minimizing the empirical loss \[ \widehat{h_S} = \text{arg min}_h \, L_{MSE} (h)\] as estimators of the theoretically correct approach of minimizing the expected loss \[\widehat{h_E} = \text{arg min}_h \, \mathbb{L}_{MSE} (h) \enspace .\] Such an analysis (called the statistical learning theory) goes, however, far beyond an introductory course on ML. For those interested, a good starting point is for example the book Shalev-Shwartz, Ben-David: Undertanding machine learning: from theory to practice, Cambridge Uiversity Press 2014.
We argued in the previous section for the second time already for the use of the empirical loss function \(L_{MSE}(h)\) as an objective function to be minimized by the learning algorithm \[\widehat{h} = \text{arg min}_h \, L_{MSE} (h) \enspace .\]
But how good is this hypotheses \(\widehat{h}\) minimizing the empirical loss (risk) in what really interests us: the predictions over new data? How good is its generalization accuracy (error)?
One possible answer is to use the same arguments as before: since we do not know the the true generative distribution we cannot evaluate the MSE over the true data and we can only use the MSE calculated over the training data \(L_{MSE}(h)\). You can certainly see that this is not a very good estimate of the true MSE over new data, especially if the training sample is not too large. In the same way that the average age calculated over 10 randomly selected students may not be a good estimate of the average age of all HEG students. Sometimes this 10-sample average may be let’s say 20, another time 25, while the true average may be 22.
You may remember from your statistical courses that by the law of large numbers as \(n \to \infty\), the sample average converges to the expectation \(\bar{x}_n \to \mathbb{E} [x]\). However, the rate at which these two converge may not be very fast and we may need a lot of observations so that these two become close. A major consequence of this slow rate of convergence is a phenomenon critical for most ML algorithms: the problem of overfitting.
In the homework Ex4 part2: feature transformation we used polynomial feature transformations to extend our hypotheses space \(\mathcal{H}\) from the simple linear functions to more complex polynomial functions (quadratic and higher degree polynomials).
We see in the plot above that the 12-degree polynomial has much more flexibility to capture all the little variances in the training data (the blue points) and therefore achieves lower mse over the training set than the 3-degree polynomial. However, checking the plot, the 12-degree polynomial regression function seems somehow unsatisfactory. Though we cannot be sure (because the only data we have to do the analysis are the blue points in the plot) it somehow seems unlikely that for a point \(x = -1.6\) the output value would really be around \(\hat{y} = 2.8\), and for points around \(x = -2.5\) so low that it cannot even fit the plot.
We next generate 100 new test data points and calculate the MSE of the two learned models over these new test data. What we observe is that the more flexible 12-degree polynomial actually performs very badly on these, having much higher mse over the test data. The problem we see in the plot is related to an important problem of many machine learning models: overfitting the training data sample.
Overfitting is closely related to the complexity of the hypotheses. Complex models often have enough capacity to capture all the variations in the training data, even those coming from the noise \(\epsilon\) instead of focusing on the important variations explaining the outputs, they overfit. Too simple models, on the other hand, cannot capture the variability in the data sufficiently and do not perform well either, they underfit. For example, a simple linear model would underfit the data in the example above.
In the image below we plot the MSE calculated over the train and test data for polynomials of degree 1 (linear) to 8. The patterns is fairly typical in any supervised learning. The training error monotonically decreases as the complexity (and therefore flexibility) of the model increases. However, the test error is high in the beginning when the model is not flexible enough and underfits, then decreases for a while and then starts increasing again as the hypothesis becomes too flexible and overfits.
Important take-away message from the previous section is to never use the MSE over the training as an indicator of the accuracy of the model for predictions on new data.
In some cases, data are plentiful and there may be actually an independent test set available for evaluating the prediction accuracy. This is often the case in ML competitions where the goal is to compare competing approaches in exactly the same manner. In such a case we can simply calculate the MSE over the test set. A word of caution: for this to be an honest evaluation of prediction accuracy over new data we must never use the test data for the training. We must come up with our models as if the test data hadn’t existed. Only after the model is completely fixed and finalized, we can use the test data to get the test MSE.
If an independent test set is not available but the data available to us are still plentiful, we can create an artificial test set by splitting it out of the data and holding it out separately from the training set (common splits are for example 80%/20% or 70%/30%). The same principal as above still holds: we should never use the test set during the training of the model when we want to estimate its prediction accuracy using the test data.
There are two potential (somewhat related) problems with this approach:
A rather popular scheme that tries to combat the problems of the hold-out approach is the cross-validation (cv). In the k-folds cross-validation we split the data available to us randomly into \(k\) parts of equal size. We then repeat the same procedure \(k\) times:
For each repetition we choose a different fold as the test set and we use the remaining \(k-1\) folds for the training. The final error estimation (or equivalently the estimation of the accuracy = 1 - error) is then calculated as the average of the error across the \(k\) test folds.
Note an interesting property of the cv approach: each sample (instance) in our data is used within the test set exactly once. Assuming our data are generated IID from some distribution \(P_{XY}\), our most complete information about the distribution is the complete data set. By the cv approach we eventually use the whole data set as a fake test set hoping it represents the unknown distribution well enough.
When we estimate the error by the k-folds cv, we also avoid many of the problems of the hold-out approach:
An extreme version of cross-validation is the leave-one-out procedure. In this each fold is composed of a single observations. For a data set of size n we therefore repeat the train-test procedure n times, each time using a single data point as the test set and the remaining \(n-1\) data points as training set.
Obviously, this procedure is prohibitively expensive for samples with large \(n\). However, it may be interesting for small data sets where we cannot afford to loose multiple data points on the test fold not to reduce the size training folds too much.
The leave-one-out procedure is also the one with the best theoretical properties though discussion of these goes beyond the scope of this course.
The process of model selection may seem at first somewhat confusing. What is important to understand is that when talking about model selection we do not really select a model, but rather a model class.
For example, when we decide to work with the class of linear regression models \(h(\mathbf{x}) \in \mathcal{H}_{linear}\), so that \(h(\mathbf{x}) = \mathbf{x}^T \mathbf{w}\), the model class is fixed a priori (from the beginning). Therefore there is nothing to select from. We can only train (learn) the model within the given class by minimizing the problem-specific objective over the training data \(S_n\) (e.g. the training MSE in the regression case). The optimal model \(\hat{h}(\mathbf{x}) = \mathbf{x}^T \widehat{\mathbf{w}}\) is the minimizer of the training loss found by using the selected algorithm. There is just one such optimal model and therefore there is nothing to select from!
In contrast, we may decide to explore multiple classes of models. We have seen an example of classes of polynomials of increasing degree, \(h^{(1)}(\mathbf{x}) \in \mathcal{H}_{linear}\), \(h^{(2)}(\mathbf{x}) \in \mathcal{H}_{quadratic}\), \(h^{(3)}(\mathbf{x}) \in \mathcal{H}_{cubic}\), etc., so that \(h^{(i)}(\mathbf{x}) = \phi^{(i)}(\mathbf{x})^T \mathbf{w}^{(i)}\), where \(\phi^{(i)}(\mathbf{x})\) is a suitable feature transformation function. But the classes may be less related to one another. For example, for a classification problem the classes my be \(h^{(1)}(\mathbf{x}) \in \mathcal{H}_{perceptron}\), \(h^{(2)}(\mathbf{x}) \in \mathcal{H}_{logistic}\), \(h^{(3)}(\mathbf{x}) \in \mathcal{H}_{neuralnet}\), etc. Another example of different model classes are neural nets with different architectures. Yet another is standard models with different regularization functions - these we will see soon in our course.
Once we decided which classes to explore, we use the training data \(S_n\) to learn the model of each class, that is find for each class the optimal model \(\widehat{h^{(i)}}(\mathbf{x}) =\phi^{(i)}(\mathbf{x})^T \widehat{\mathbf{w}^{(i)}}\) by minimizing the problem-specific objective. Out of these multiple optimal models (one per each class) we then typically want to select only a single model to keep for future work, for predictions on new data.
Clearly, the model we want to keep is the one that is likely to predict the best (with the lowest errors) on new data. This is where model selection and model evaluation meet. Any of the procedures described in the section 3.5 Model evaluation can be used to evaluate the models with the caveats and pros and cons explained there. However, for a full fairness of the comparisons, the evaluation procedure should be exactly the same for each of the competing models. For example, if we use random splits in a cross-validation procedure, the splits should be exactly the same when evaluating all the models. Moreover, all the settings of your evaluation procedure shall be well explained and documented to make your work transparent, trustful and reproducible!
After the mention of the cross-validation procedure it should be more clear why model selection is rather about selecting the model calss and not about selecting any specific one model. Recall that in a \(k\)-folds cross validation we actually train and evaluate \(k\) different models. Each of these is learned over a different set of \(k-1\) folds as a training set and evaluated over the remaining fold as a test set. The final estimate of the model accuracy (the model error = 1 - accuracy), is the average over these \(k\) different models. It is thus not the accuracy of any single model in our whole cv procedure.
Similarly, the final optimal hypothesis is not any of the hypothesis of the \(k\) folds. However, averaging the hypotheses (functions) does not really make much sense. Instead we can obtain the final optimal hypothesis by training over the complete data set (without any fake splits). We cannot be absolutely sure what the accuracy of this final optimal hypothesis will be when predicting on new data (we will only know once we get the new data). But the whole point of the model evaluation procedure is to get an estimate of this and therefore we hope and believe the accuracy will be close to what we obtained from our model evaluation procedure.
There is a procedure which may seem tempting to try but should be absolutely avoided: you shall never try to tweak the parameters of your optimal model trained by a solid procedure to fit better the test data. That is, once you have an optimal model \(\hat{h}\) coming from minimizing some objective function over the training data, you can only use the test data to evaluate the model accuracy (error). You can never ever use the test data to improve your model (tweak the parameters). If you do, you may seemingly improve the model accuracy but this improvement is just imaginary, not real! By adapting your model to the test data, you treat the test data similarly as the training data and therefore the error you calculate is conceptually rather similar to the training error. It is no longer a good indicator of performance on new data that have never been seen during the model learning.
What we noted already in the homework Ex4 part2: feature transformation is that when we use different training data \(S_i, \, i=1, \ldots, k\) to learn the models \(\hat{h}_i\), these models can be rather different. In result, there is quite a lot of variability in their predictions \(\hat{h}_i(x), \, i=1, \ldots, k\) for any single data point \(x\). This variability in the models and their predictions based on their underlying training sets has an important consequence for the error analysis.
Using an independent test set, we could calculate the test set error for each of the samples as \(E_i = \frac{1}{n}\sum_j^n (\hat{h}_i(x_j) - y_j)^2\) and then estimate the generalization error, similarly as in the cross-validation procedure, as the average of these errors.
\[\hat{E} = \frac{1}{k}\sum_i^k E_i = \frac{1}{k}\sum_i^k \frac{1}{n}\sum_j^n \big( \hat{h}_i(x_j) - y_j \big)^2\].
For simplicity, imagine our test set is very tiny. In fact, it contains only single data pair \((x,y)\). The average test error is then simply \[\hat{E} = \frac{1}{k}\sum_i^k E_i = \frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - y \big)^2\].
We will now use this simplified error equation to derive an important result \[\begin{eqnarray} \hat{E} & = & \frac{1}{k}\sum_i^k E_i = \frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - y \big)^2 \\ & = & \frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) + \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big)^2 \\ & = & \underbrace{\frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) \big)^2}_A + \underbrace{\frac{1}{k}\sum_i^k \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big)^2}_B + 2 \underbrace{\frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) \big) \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big)}_C \end{eqnarray}\]
For the \(C\) term we have \[\begin{eqnarray} \frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) \big) \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big) & = & \big( \frac{1}{k} \sum_i^k \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) \big) \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big) \\ & = & \big( 0 \big) \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big) = 0 \end{eqnarray}\]
For the \(B\) term we have \[\begin{eqnarray} \frac{1}{k}\sum_i^k \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big)^2 & = & \big( \frac{1}{k}\sum_i^k \hat{h}_i(x) - y \big)^2 = \big( bias \big)^2 \end{eqnarray}\]
For the \(A\) term we have \[\begin{eqnarray} \frac{1}{k}\sum_i^k \big( \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) \big)^2 & = & \frac{1}{k} \sum_i^k \big( \hat{h}_i(x) - \frac{1}{k}\sum_i^k \hat{h}_i(x) \big)^2 & = & variance \end{eqnarray}\]
Therefore we get this important result for the generalization error estimate \(\hat{E}\) calculated by averaging the errors of hypotheses \(\hat{h}_i\) learned over different training samples \(S_i\) \[\hat{E} = \frac{1}{k}\sum_i^k E_i = \big( bias \big)^2 + variance \enspace ,\]
where the \(bias\) term represents the extent to which the average of the predictions over all the models \(\frac{1}{k}\sum_i^k \hat{h}_i(x)\) differs from the desired true output value \(y\), and the \(variance\) term describes how much the predictions of the individual models \(\hat{h}_i(x)\) vary around their average \(\frac{1}{k}\sum_i^k \hat{h}_i(x)\).
In the above we showed the relation between the generalization error estimated over a single-sample test set. But this can be easily extended to larger samples (by simply calculating an extra average over all the points in the test set) and in fact holds even when not working with test sample averages but with the population averages in terms of expectations.
An important consequence of the above result is that there are in principal two ways to minimise the generalization error. One is to reduce the bias of our hypotheses, that is whatever training sample we use, on average the hypotheses shall be as close as possible to the true values. The other is to reduce the variance in the hypotheses, that is the hypotheses shall not vary too much when we use different training sample.
We have actually seen this informally in the homework Ex4 part2: feature transformation already. The 12-degree polynomial models estimated from different training samples varied a lot and indeed reached a very large test error. The 3rd-degree polynomial models seemed more similar to each other, they varied a lot less, and in consequence reached a lot lower generalization error.