Supervised learning problems can be characterized by the nature of the output space \(\mathcal{Y}\) (what is it we predict):
Regression: The prediction target \(y\) is a real value, \(\mathcal{Y} \subseteq \mathbb{R}\), such as predicting a person’s weight from her height, or predicting a person’s salary based on years of experience.
Binary classification: The prediction target \(y\) is binary (e.g. yes/no, 0/1, true/false), \(\mathcal{Y} = \{0, 1\}\), such as predicting if an email is a spam or not based on its contents, or predicting if you pass an exam or not based on number of hours you study.
Multiclass classification: The prediction target \(y\) is a set of classes (e.g. high/medium/low, cat/dog/mouse/elephant), \(\mathcal{Y} = \{a_i\}, i=1, \ldots, k\), such as predicting (classifying) an image into the correct category of animals, or classifying an post on twitter into interest categories (sport, politics, entertainment, etc.).
Structured prediction: The prediction target \(y\) is an object with some complex structure (e.g. graph, molecule), such as predicting new molecules based on chemical properties, or predicting social structures based on peoples interests and interactions.
Ranking: The prediction target \(y\) is an ordering, such as predicting the best order for Google suggestions based on a query, or predicting an order of movies to propose to you based on your previous evaluations.
In this lecture, we will talk about the problem of linear regression.
We will begin from a simple example.
Imagine you would like to understand how the home study time HEG students dedicate for preparation for the courses influences their final grade in the courses. The input variable \(X \in \mathcal{X} \subset \mathbb{R}\) is a random variable giving the number of hours the student study. The output variable \(Y \in \mathcal{Y} \subset \mathbb{R}\) is the final grade (which we will treat as a real number).
We assume that there is some relationship between the two, some unknown function \(f: \mathcal{X} \to \mathcal{Y}\) and our goal is to formulate a hypotheses \(h\) as an estimate of the function.
There are two reasons to want to know the function:
In our example all the HEG students (past, current, and future) is the population we are interested in. Obviously, we cannot ask everybody. Instead, we resort to sampling. We pick \(n\) students and ask them about their study time \(x\) and grades \(y\) and create the sample of input-output pairs \(S_n = \big( (x_1, y_1),(x_2, y_2), \ldots, (x_n, y_n)\big)\)
Do you think your group is a good sample for our task? What makes a sample a good or useful sample?
size: Intuitively, we feel that if our sample is too small, we will not be able to use it to induce something about the whole population. From a statistical point of view, the larger the better. We won’t go into the details, but there are various statistical proofs about how quickly the estimate from the sample approaches the true population values depending on the sample size \(n\). From a technical point of view, large sample size is costly: data collection, data storage, data processing. Therefore real life is always a compromise.
identical distribution: The sample should have similar characteristics as the population. If our sample were not representative of the population, it would be difficult (essentially impossible) to generalize from it to the whole population. Formally, we assume there is some data generating probability distribution \(P\) for the population \((X,Y) \sim P\). For example, each person decides how many hours she will study. Though these may be very much rational decisions, given that we do not understand the rationale, we treat this variable as a random with the unknown probability distribution \(P_X\). We want our sample instances to come from the same probability distribution \((x_i, y_i) \sim P\) for all \(i=1, \ldots n\). This property of coming from the same distribution is referred to as the identical distribution property of the sample. The best way to obtain a sample generated by the same distribution as the whole population is simply by random sampling from the population (formally the best is simple random sampling with replacement).
independent sampling: Somewhat more technical requirement is that the sample instances should be independent from one another. Simply speaking, you should really select the instances completely randomly and avoid selecting samples just because it is somehow convenient. For example, once you select a person you should not include into the sample her friend just because it was easy to ask since you met them both together. This would very likely skew your sample. Perhaps the two friends study together and help each other to achieve better grades). This may seem obvious but is often not that easy to do in practice.
These requirements on the sample instances being independent and identically distributed is one of the major concepts of statistical modelling. It is abbreviated as i.i.d. or IID and you will often see people talking simply about IID sample. Now you know what they mean.
Simple linear regression tries to relate a single input variable \(X \in \mathbb{R}\) to a single output variable \(Y \in \mathbb{R}\) by assuming a simple linear relationship: \[y \approx w_0 + w_1 x\] You may read the “\(\approx\)” symbol as approximately, that is we model \(y\) to be approximately linearly dependent on the \(x\). Looking at the data plots we indeed cannot assume anything more than approximately as the data clearly do not sit on a straight line.
This by no means shall mean that the true target function \(f: \mathcal{X} \to \mathcal{Y}\) which generated the data is linear. This is a crucial point of machine learning. The true target function \(f: \mathcal{X} \to \mathcal{Y}\) is unknown. But since we do not know which function it really was, our linear assumption is as good as any.
You know that \(w_0\) and \(w_1\) are the intercept and slope of the line but we will refer to them simply as parameters of the line (parameters defining the linear model) and indicate them as the vector \(\mathbf{w} = (w_0, w_1)^T\).
Using the formal ML terminology, we introduced a very strong inductive bias by assuming a rather limited hypothesis space \(\mathcal{H}\) of linear functions. Each of the hypotheses \(h: \mathcal{X} \to \mathcal{Y}, \ x \to h(x)\) in the space \(h \in \mathcal{H}\) is associated with a specific set of of parameters \(\mathbf{w}\) which exactly define the linear function. In this case thus the problem of learning the hypotheses \(h\) reduces to finding the parameters \(\mathbf{w}\).
There are many linear functions \(h\) in our hypotheses space that seem as good models for our sample dataset. How do we know which one to pick? Naturally, we want to pick the one that does the best job in taking the input \(x\) to the predicted output \(h(x)\). But to do this, we need to be able to measure how good the specific hypotheses \(h\) is so that we can compare it to all the others. This is where the cost function comes into play.
The cost function \(c : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}\) compares the predictions produced by a hypotheses \(\widehat{y} = h(x)\) with the true values \(y\). A suitable cost function in the regression setting is the squared error \[c(y, \widehat{y}) = (y - \widehat{y})^2 = (y - w_0 - w_1 x)^2 \enspace .\] We square the errors simply to make the positive and negative differences equally important (there is clearly no reason why positive or negative errors shall be more important than others).
There is one rather common choice for a regression cost function, the absolute error \[c(y, \widehat{y}) = |y - \widehat{y}| \enspace .\] Though this may seem a rather natural choice at first, it is far less convenient as will become clear shortly because of its non-differntiability. We will therefore stick with the squared error cost.
Obviously it is not enough if our hypotheses \(h\) is good at one point. It should be good at all our points. It should do well on average across all the data. Hence, that’s what we do, we simply measure the error at all the points \(x_i\) in our dataset and get the average. (This is the other reasons why we want to get rid off the negative values. They would cancel out with some of the positive errors.) The squared error loss function is therefore the sum of the squared errors over the data in the sample \[L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2 = \frac{1}{n} \sum_{i=1}^n (y_i - w_0 - w_1 x_i)^2 \enspace .\]
We make specific in the loss \(L(\mathbf{w})\) that this is a function of the unknown parameters \(\mathbf{w}\). These are the parameters we are searching for because the data, the \((x_i, y_i)\) pairs, are known, they are in our dataset!
Let’s summarise. We have fixed a hypotheses space \(\mathcal{H}\) of linear functions \(\widehat{y} = h(x) = w_0 + w_1 x\). We want to find the hypotheses \(h\) (the parameters \(\mathbf{w}\)) that predicts the best, that achieves the lowest squared error loss \(L(\mathbf{w})\). Now you just need to recall your calculus. How de we find the minimum (\(\min\)) and the minimizer (argmin) of a function? By setting it’s derivative equal to zero and solving for \(\mathbf{w}\)! \[\frac{\partial L(\mathbf{w})}{\partial w_0} = 0\\ \frac{\partial L(\mathbf{w})}{\partial w_1} = 0\] If you do this correctly and after a bit of fiddling you will arrive to the result you have probably seen and discussed for quite a while in your statistics courses. \[\widehat{w}_1 = \frac{\sum_i^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_i^n(x_i - \bar{x})^2}\\ \widehat{w}_0 = \bar{y} - \widehat{w}_1 \bar{x} \enspace ,\] where \(\bar{x} = \frac{1}{n} \sum_i^n x_i\) indicates the average.
In ML we usually use a neat trick to get a more computer-friendly solution. We ill call upon your linear algebra now. Observer that the linear function we work with can be rewritten as \[y \approx w_0 1 + w_1 x = \mathbf{x}^T \mathbf{w} \enspace,\] where \(\mathbf{w} = (w_0, w_1)^T\) is the 2-dimensional vector of unknown parameters of the linear funciton and \(\mathbf{x} = (1, x_1)^T\) is the 2-dimensional vector of observations where we created a dummy 1st dimension containing ones “\(1\)” for all the data points (instances).
The square error loss function can then be written as \[c(y, \widehat{y}) = (y - \widehat{y})^2 = (y - \mathbf{x}^T \mathbf{w})^2 \enspace ,\] and the loss function taking he average over the whole sample as \[L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{x_i}^T \mathbf{w})^2 \enspace .\]
Next, instead of thinking about each intance \((x, y)\) separately, think about the whole dataset. What you have is actually a \(n \times 2\) matrix of the input data \(\mathbf{X}\) and an n-long vector of responses (outputs) \(\mathbf{y}\).
The data now looks like this
\(\mathbf{x_0}\) | \(\mathbf{x_1}\) | \(\mathbf{y}\) |
---|---|---|
\(x_{10}\) | \(x_{11}\) | \(y_1\) |
\(x_{20}\) | \(x_{21}\) | \(y_2\) |
\(x_{30}\) | \(x_{31}\) | \(y_3\) |
… | … | … |
\(x_{n0}\) | \(x_{n1}\) | \(y_n\) |
From your linear algebra lectures you should know the \(\ell_2\) norm of a vector \(|| \, \mathbf{x} \,||_2 = \sqrt{\mathbf{x}^T \mathbf{x}} = \sqrt{\sum_i^n x_i^2}\).
Let’s indicate by \(e_i = y_i - \widehat{y}_i = y_i - \mathbf{x_i}^T \mathbf{w}\) the error of the prediction for a single observations. Using this in the total loss and observing the definition of the norm of a vector we get \[L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (e_i)^2 = \frac{1}{n} || \, \mathbf{e} \, ||_2^2 \enspace .\] From your linear algebra lectures you also know that \[\mathbf{X} \mathbf{w} = \begin{bmatrix} \mathbf{x_1}^T \mathbf{w} \\ \mathbf{x_2}^T \mathbf{w} \\ ... \\ \mathbf{x_n}^T \mathbf{w} \end{bmatrix} \enspace ,\] where each \(\mathbf{x_i}\) is a row of the matrix, an instance in our case.
Using the matrix multiplication we can write the loss function as \[L(\mathbf{w}) = \frac{1}{n} || \, \mathbf{e} \, ||_2^2 = \frac{1}{n} || \, \mathbf{y} - \mathbf{X} \mathbf{w} \, ||_2^2 = \frac{1}{n} (\mathbf{y} - \mathbf{X} \mathbf{w})^T \, (\mathbf{y} - \mathbf{X} \mathbf{w}) \enspace .\]
As before, to find the minimum (\(\min\)) and the minimizer (argmin) of the loss function we take the derivative with respect to the vector \(\mathbf{w}\) (more precisely the vector of partial derivatives, the gradient), set it equal to zeros and solve for \(\mathbf{w}\). \[\frac{d\, L(\mathbf{w})}{d \, \mathbf{w}} = 0 \enspace .\] Either you can do this from scratch starting from the partial derivatives using the scalar differentiation rules, or you can cheat a bit and follow the vector and matrix differentiation rules. They are rather similar to the scalar case and are summarised for example in The Matrix Cookbook of Kaare Brandt Petersen and Michael Syskind Pedersen.
We get the following \[ -\frac{2}{n} \mathbf{X}^T \, (\mathbf{y} - \mathbf{X} \mathbf{w}) = 0\\ \widehat{\mathbf{w}} = \big(\mathbf{X}^T\, \mathbf{X} \big)^{-1} \mathbf{X}^T \, \mathbf{y} \]
The term \(\big(\mathbf{X}^T\, \mathbf{X} \big)^{-1} \mathbf{X}^T\) is called a pseudo-inverse of \(\mathbf{X}\) and you may have seen this in your linear algebra lectures.
While this seems rather different from our original results for \(\widehat{w}_0, \, \widehat{w}_1\), it actually will give the same solutions when using the 2-dimensional matrix with ones in the 1st column. (Homework to show this?)
It also is rather simpler to implement, essentially 1 line of code instead of 1st calculating the averages and than using these in the two equations for \(\widehat{w}_0, \, \widehat{w}_1\).
What we get for free using the matrix approach explained in the previous section is the ability to solve problems with multidimensional inputs \(\mathcal{X} \subseteq \mathbf{R}^d\), where \(d > 1\).
For 2-dimensional input space \(\mathcal{X} \in \mathbb{R}^2\) the linear model we consider is \[\widehat{y} = h(x) = w_0 + w_1 \, x_1 + w_2 x_2 \enspace .\] Graphically, it can now be represented as a plane in the 3-dimensional \(\mathcal{X} \times \mathcal{Y}\) space.
You can imagine that we can continue like this for any dimension of the inputs so that the linear model has generally the form \[\widehat{y} = h(x) = w_0 + w_1 \, x_1 + \cdots + w_d \, x_d \enspace .\] We cannot really represent these linear objects graphically (dimensins are too big for our imagination) but in analogy to the plane in the 3-d \(\mathcal{X} \times \mathcal{Y}\) space we speak about hyperplanes in the higher dimensions.
As in the simple linear regression case the squared error loss that we want to minimise with respect to \(\mathbf{w}\) is \[L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2 = \frac{1}{n} \sum_{i=1}^n (y_i - w_0 - w_1 x_{i1} - \cdots - w_d x_{id})^2 \enspace .\] In principle, you could start off by finding the partial derivatives with respect to each dimension of \(\mathbf{w}\) setting them in turn equal to zero and solving the system of equaitons \(\partial L(\mathbf{w})/ \partial w_i = 0, \ i=0, \ldots, d\).
However, it is much more convenient to use the matrix approach introduced in the previous section.
First we prepend the input vectors by ones 1
in the first dimension so that each input instance is now \(\mathbf{x} = (1, x_1, \dots, x_d)^T \in \mathbb{R}^{d+1}\). Observe that that the dimensionality of the parameter vectors \(\mathbf{w}\) and the input vectors \(\mathbf{x}\) matches and is \(d+1\) for both.
Next, similarly as before, we will organize our data into the \(n \times (d+1)\) input matrix \(\mathbf{X}\) and the n-long output vector \(y\), write the loss funciton in the vector form \[L(\mathbf{w}) = \frac{1}{n} || \, \mathbf{y} - \mathbf{X} \mathbf{w} \, ||_2^2 \enspace ,\] and following the same principal of function minimisation we obtain the same minimising solution \[ \widehat{\mathbf{w}} = \big(\mathbf{X}^T\, \mathbf{X} \big)^{-1} \mathbf{X}^T \, \mathbf{y} \enspace . \] Easy ;)
It is a good habit to think about the dimensionality of your objects. As you can see, the minimising solution for \(\widehat{\mathbf{w}}\) is the same in case of simple linear regression as well as multivariate regression. But what are the dimensions of hte individual objects in the equation?
Simple linear regression | Multivariate linear regression | |
---|---|---|
original inputs \(x\) | \(1\) | \(d\) |
outpus \(y\) | \(1\) | \(1\) |
parameters \(\mathbf{w}\) | \(2\) | \(d+1\) |
augemented inputs \((1,x)\) | \(2\) | \(d+1\) |
\(\mathbf{X}^T\, \mathbf{X}\) | \(2 \times 2\) | \((d+1) \times (d+1)\) |
\(\mathbf{X}^T\, \mathbf{y}\) | \(2\) | \((d+1)\) |
The other thing you should consider as a computer scientist is the complexity of your computations. How many operations are need for the matrix multiplication \(\mathbf{X}^T\, \mathbf{X}\). For each element \((\mathbf{X}^T\, \mathbf{X})_{ij} = \sum_k^n \mathbf{X}^T_{ik} \, \mathbf{X}_{kj}\) in the resulting matrix you need to multiply \(n\) scalars (and sum them thereafter). You have got \((d+1)^2\) elements in the matrix so overall you need \(\mathcal{O}(nd^2)\) of operations. We read this as the order of \(nd^2\) and we are using the big-O notation for complexity.
Calculating an inverse of a matrix is roughly cubic so the inversion \(\big(\mathbf{X}^T\, \mathbf{X} \big)^{-1}\) is \(\mathcal{O}(d^3)\).
The last two operations are relatively cheap: \(\mathbf{X}^T \, \mathbf{y}\) is \(\mathcal{O}(nd)\) (no squares or cubes here!), and the final multiplication \(\big(\mathbf{X}^T\, \mathbf{X} \big)^{-1} \mathbf{X}^T \, \mathbf{y}\) after having the intermediate result is a matter of multiplying \((d+1) \times (d+1)\) matrix with \((d+1)\) vector so \(\mathcal{O}(2)\).
Depending on the size of your matrices, the most costly operations can be either the creation of \(\mathbf{X}^T\, \mathbf{X}\) of taking the inverse. In any case, this is a very naive approach for solving the linear system \(\mathbf{y} = \mathbf{X} \mathbf{w}\) and in most tools cou can usually find a specific function that has been optimised to find the \(\widehat{\mathbf{w}}\) more efficiently (from a computational perspective).