What does it mean to learn?
People are very good at learning. For example you, you learned how to differentiate a function in a matter of a few hours (perhaps not quite, but that’s to be seen.)
To learn to differentiate I told you there is a procedure which gets us from a function \(f(x)\) to its derivative \(f'(x)\) and I showed you a few examples of pairs of \(f(x)\) and \(f'(x)\). You watched me but at that point, you could not really find a derivative of a function yourself. (Perhaps you could, but for the sake of the discussion, let’s imagine you could not.)
I then let you digest the examples and asked you to do the HW exercises. You went home, you thought about the examples and you tried to come up with (infer) your own procedure on how to get from \(f(x)\) to \(f'(x)\). You tried to induce from the examples you had seen in the class a procedure that would work on the HW exercise that you have never seen before.
I repeat, the procedure had to be such that it could solve not only the cases I have given you as examples, but also other functions I asked you to do. So it was not enough to memorize the examples or look them up in your notes. You had to generalize from the examples to a procedure that could work for other functions as well.
The ability to generalize is perhaps the most important component of learning and distinguishes it from simple memorization.
It actually hadn’t happened exactly as I say above. I didn’t just give you a set of examples of functions and their derivatives and let you out to figure how it works. I told you there is in fact a set of simple rules you can follow when differentiating the function and I gave you the list. This means that I provided you with some prior knowledge that helped you when creating your hypotheses about a workable differentiation procedure. I limited your hypotheses space so that your search for a workable procedure was much easier then if just looking at the examples. This reduction the hypotheses space is often referred to as an inductive bias. By guiding you, I tried to prevent that you induce from the examples something which may be completely useless.
How did we verify that you really learned how to differentiate? We checked together your responses to the HW exercises. We compared your answers with the true answers. If these matched, this suggested that your learning was successful. We didn’t care about the examples I gave you in class. Solving those was relatively easy. The challenge was to solve the new exercises! That is what matters in learning. Apply the newly learned expertise to new examples never seen before!
We didn’t finish with the learning there. Based on your responses being correct or not, you had an opportunity to revise your differntiation procedure. Perhaps you were thinking about two possible approaches and one of them seemed better than the other. Using new examples to validate your hypothesis thus helped you to understand better the quality of your different options and pick the most sucessfull one. Based on your experience you pin pointed the best procedure that you will from now on try to apply to all future functions \(f(x)\) that you will need to differentiate.
The final control of the results of your learning will be the math test. There you will apply the procedure you learned to a completely new set of test examples of functions \(f(x)\). I will then compare your answers for \(\widehat{f'}(x)\) with the true values on \(f'(x)\). I will also have to fix some evaluation criteria. For example, I could use a very simple zero/one rule counting the number of errors you made. Each test example in which you make an error will cost you a point \[\text{cost}\big( f(x)\big) = \begin{cases} 1 & \text{ if } \ \widehat{f'}(x) \neq f'(x) \\ 0 & \text{ if } \ \widehat{f'}(x) = f'(x) \\ \end{cases} \] This indeed is a rather tough criteria. I could be nicer and perhaps tolerate some small erros in the answers. \[\text{cost}\big( f(x)\big) = \begin{cases} 1 & \text{ if } \ \widehat{f'}(x) \neq f'(x) \\ 0 & \text{ if } \ \widehat{f'}(x) \approx f'(x) \\ \end{cases} \] Or instead of using just \(0, 1\) numbers I could say that for a test exercises you can get anything in the interval \([0,1]\) depending on how close/far your derivative is from the true one. \[\text{cost}\big( f(x)\big) = \text{distance}\big(\widehat{f'}(x), f'(x)\big) \]
These criteria might make you more or less happy in terms of getting somewhat better of worse grade from the math test. But we should both remember that the final objective of you learning how to differentiate fucntions is not to pass the math test. It is to learn a new expertise that you will be able to apply to new functions whenever from now on you may need it. Therefore in principal, the costs in the test should be adapted so that we are sure we achieved this main goal. Fixing a good cost measure (cost function) is therefore a critical step of the learning procedure and often requires specific domain konwladge to do it right (e.g. the derivative does not have to have exactly the same form in the order of mathematical operations but has to be mathematically equivalent).
There is one more point worth mentioning. If you remember, in that class I didn’t ask you to work out the homework examples on the white board and explain your procedure. We trusted together that if your answers are correct, your differentiation procedure is good. This approach is very common in machine learning and sometimes is perceeved as one of the major disadvantages of ML. The procedure the machine learnes can essentially be a black box that we don’t really understand and we don’t know how it works. But as long as it seems to work well over new examples, we are usually happy with it. Some people are not so much and this triggered a whole interpretable ML movement in the research community and has become a focus of many research proejcts.
When we want a machine to perform some task, the classical approach is to prgramme the machine to do the specific task. For example, we could come up with a simple algorithm to return the maximum value of a set of integers. As you can imagine, writing an algorithm to differentiate a function is much more challenging even if you yourself know how to do it.
Machine learning comes to play for two principal reasons:
You shouldn’t imagine some dark magic when we speak about machine lerning. So far machines cannot learn on their own, they still need our help. They still need us to tell them how to do the learning, we still need to write the programme for them, this time the learning algorithm. The machine then will use this learning algorithm and the training data we give it to learn a new programme to perform the task we want.
There are different tasks we may want the machine to learn to perform. The basic learning tasks that current machine learning tries to address are usually categorised into:
In this course, we will primarily focus on the supervised learning problems.
In the supervised learning setting the learning task consists of the following objects:
the input space \(\mathcal{X}\): this is the space of all the possible input data \(\mathbf{x}\), we indicate this as \(\mathbf{x} \in \mathcal{X}\). Often times the input space is a \(d\)-dimensinoal vector of real numbers \(\mathcal{X} \subseteq \mathbb{R}^d\). But not always. For example, in our differntiation example the input set \(\mathcal{X}\) was the set of differentiable functions \(f(x)\).
the output space \(\mathcal{Y}\): this is the space of the outputs \(y \in \mathcal{Y}\). It may be real numbers, labels such as \(\{0, 1\}\) or categories such as \(\{cats, dogs, elephants, birds\}\) and many more. In our differentiation example, the output space \(\mathcal{Y}\) was the set of all derivative functions \(f'(x)\).
target function \(f\): we assume that there is some function \(f: \mathcal{X} \to \mathcal{Y}\), some rule that maps each \(\mathbf{x_i} \in \mathcal{X}\) to \(y_i = f(x_i)\). The problem is that this function is unknown. In fact this is what the learning algorithm shall try to learn. In our running example, these are all the differentiation rules combined in the right order.
training data \(S_n = \big( (\mathbf{x}_1, y_1), (\mathbf{x}_n, y_n), \ldots, (\mathbf{x}_n, y_n) \big)\): is a finige set (with size \(n\)) of input-output pairs. These are the data the algorithm shall use while learning. Sometimes you may see these to be called training examples or training set. For us, these were the examples of functions and derivatives I gave you in the class.
prediction rule \(h\): this is the outcome of the learner, it is the function \(h : \mathcal{X} \to \mathcal{Y}\) that the learner discovers to approximate the unknown target function \(f\). It is also called the prediction funciton, the hypothesis, the classifier.
measure of error \(l\): also called the cost function (and sometimes idicated as \(c\)) is the function \(l : (\mathcal{Y} \times \mathcal{Y}) \to \mathbb{R}\) that measures how far are the outputs of the learned prediction rule \(h(\mathbf{x})\) from the true outputs produced by the unknown target funcion \(f(\mathbf{x})\). The choice of suitable cost function \(l\big( h(\mathbf{x}), f(\mathbf{x})\big) = l\big( \widehat{y}, y \big)\) has an important impact on our learning procedure and has to be appropriate for the problem at hand.