1 The learning problem

What does it mean to learn?

1.1 Informal introduction by example

People are very good at learning. For example you, you learned how to differentiate a function in a matter of a few hours (perhaps not quite, but that’s to be seen.)

To learn to differentiate I told you there is a procedure which gets us from a function \(f(x)\) to its derivative \(f'(x)\) and I showed you a few examples of pairs of \(f(x)\) and \(f'(x)\). You watched me but at that point, you could not really find a derivative of a function yourself. (Perhaps you could, but for the sake of the discussion, let’s imagine you could not.)

I then let you digest the examples and asked you to do the HW exercises. You went home, you thought about the examples and you tried to come up with (infer) your own procedure on how to get from \(f(x)\) to \(f'(x)\). You tried to induce from the examples you had seen in the class a procedure that would work on the HW exercise that you have never seen before.

I repeat, the procedure had to be such that it could solve not only the cases I have given you as examples, but also other functions I asked you to do. So it was not enough to memorize the examples or look them up in your notes. You had to generalize from the examples to a procedure that could work for other functions as well.

The ability to generalize is perhaps the most important component of learning and distinguishes it from simple memorization.


It actually hadn’t happened exactly as I say above. I didn’t just give you a set of examples of functions and their derivatives and let you out to figure how it works. I told you there is in fact a set of simple rules you can follow when differentiating the function and I gave you the list. This means that I provided you with some prior knowledge that helped you when creating your hypotheses about a workable differentiation procedure. I limited your hypotheses space so that your search for a workable procedure was much easier then if just looking at the examples. This reduction the hypotheses space is often referred to as an inductive bias. By guiding you, I tried to prevent that you induce from the examples something which may be completely useless.


How did we verify that you really learned how to differentiate? We checked together your responses to the HW exercises. We compared your answers with the true answers. If these matched, this suggested that your learning was successful. We didn’t care about the examples I gave you in class. Solving those was relatively easy. The challenge was to solve the new exercises! That is what matters in learning. Apply the newly learned expertise to new examples never seen before!


We didn’t finish with the learning there. Based on your responses being correct or not, you had an opportunity to revise your differntiation procedure. Perhaps you were thinking about two possible approaches and one of them seemed better than the other. Using new examples to validate your hypothesis thus helped you to understand better the quality of your different options and pick the most sucessfull one. Based on your experience you pin pointed the best procedure that you will from now on try to apply to all future functions \(f(x)\) that you will need to differentiate.


The final control of the results of your learning will be the math test. There you will apply the procedure you learned to a completely new set of test examples of functions \(f(x)\). I will then compare your answers for \(\widehat{f'}(x)\) with the true values on \(f'(x)\). I will also have to fix some evaluation criteria. For example, I could use a very simple zero/one rule counting the number of errors you made. Each test example in which you make an error will cost you a point \[\text{cost}\big( f(x)\big) = \begin{cases} 1 & \text{ if } \ \widehat{f'}(x) \neq f'(x) \\ 0 & \text{ if } \ \widehat{f'}(x) = f'(x) \\ \end{cases} \] This indeed is a rather tough criteria. I could be nicer and perhaps tolerate some small erros in the answers. \[\text{cost}\big( f(x)\big) = \begin{cases} 1 & \text{ if } \ \widehat{f'}(x) \neq f'(x) \\ 0 & \text{ if } \ \widehat{f'}(x) \approx f'(x) \\ \end{cases} \] Or instead of using just \(0, 1\) numbers I could say that for a test exercises you can get anything in the interval \([0,1]\) depending on how close/far your derivative is from the true one. \[\text{cost}\big( f(x)\big) = \text{distance}\big(\widehat{f'}(x), f'(x)\big) \]

These criteria might make you more or less happy in terms of getting somewhat better of worse grade from the math test. But we should both remember that the final objective of you learning how to differentiate fucntions is not to pass the math test. It is to learn a new expertise that you will be able to apply to new functions whenever from now on you may need it. Therefore in principal, the costs in the test should be adapted so that we are sure we achieved this main goal. Fixing a good cost measure (cost function) is therefore a critical step of the learning procedure and often requires specific domain konwladge to do it right (e.g. the derivative does not have to have exactly the same form in the order of mathematical operations but has to be mathematically equivalent).


There is one more point worth mentioning. If you remember, in that class I didn’t ask you to work out the homework examples on the white board and explain your procedure. We trusted together that if your answers are correct, your differentiation procedure is good. This approach is very common in machine learning and sometimes is perceeved as one of the major disadvantages of ML. The procedure the machine learnes can essentially be a black box that we don’t really understand and we don’t know how it works. But as long as it seems to work well over new examples, we are usually happy with it. Some people are not so much and this triggered a whole interpretable ML movement in the research community and has become a focus of many research proejcts.

1.2 Machine learning

When we want a machine to perform some task, the classical approach is to prgramme the machine to do the specific task. For example, we could come up with a simple algorithm to return the maximum value of a set of integers. As you can imagine, writing an algorithm to differentiate a function is much more challenging even if you yourself know how to do it.

Machine learning comes to play for two principal reasons:

  1. for tasks that are too complex to programme: either we humans can achieve them but we are not able to decompose our actions into a programme (e.g. driving a car, speach recoginition, image understanidng), or tasks so complex that we are not even sure how to achieve them (e.g. turning medical imaging into diagnosis, weather prediction, analysis of genomic data)
  2. ability to adapt to variations as opposed to rigid programmed algorithms (e.g. handwritten text recognition, speack recognition, spam detection)

You shouldn’t imagine some dark magic when we speak about machine lerning. So far machines cannot learn on their own, they still need our help. They still need us to tell them how to do the learning, we still need to write the programme for them, this time the learning algorithm. The machine then will use this learning algorithm and the training data we give it to learn a new programme to perform the task we want.


There are different tasks we may want the machine to learn to perform. The basic learning tasks that current machine learning tries to address are usually categorised into:

  • supervised learning: The task is to use some input data to predict some output data, some outcomes, or targets. During the training, the machine sess examples of inputs and the corresponding outputs. At test time, it only recieves the inputs and is asked to predict the ouptuts. In our example, the inputs are the different functions \(f(x)\) and the outputs their derivatives \(f'(x)\).
  • unsupervised learning: The task is to discover and discribe some structure in the data or to come up with some higher level or shorter description of the data. There is no notion of inputs and outputs and nothing to predict. An example would be to discover clusters of similar data or finding the most important axes of variation in the data (principal components analysis).
  • reinforcement learning: This is now a very poplular area of ML research. Imagine a baby learning to walk. She wants to walk because she wants to discover the world. She cannot learn by following some rules, she has to try. But her trials lead to less or more success. Often when she tries, she falls down and it hearts. It she does not try, she is unhappy as well, because she cannot discover the toys at the other end of the room. So she keeps trying and little by little she discovers how to walk. A bit more formally for our ML case. During training the machine observes different inputs and outputs (actions). But these may not be the best possible outputs (actions). The machine only recieves some evaluation of how good the outputs were (either immediately or at a later point). It then has to come up with its own rule to produce outputs (actions) given the inputs to do the best possible.

In this course, we will primarily focus on the supervised learning problems.

1.3 Formalizing supervised learning

In the supervised learning setting the learning task consists of the following objects:

  • the input space \(\mathcal{X}\): this is the space of all the possible input data \(\mathbf{x}\), we indicate this as \(\mathbf{x} \in \mathcal{X}\). Often times the input space is a \(d\)-dimensinoal vector of real numbers \(\mathcal{X} \subseteq \mathbb{R}^d\). But not always. For example, in our differntiation example the input set \(\mathcal{X}\) was the set of differentiable functions \(f(x)\).

  • the output space \(\mathcal{Y}\): this is the space of the outputs \(y \in \mathcal{Y}\). It may be real numbers, labels such as \(\{0, 1\}\) or categories such as \(\{cats, dogs, elephants, birds\}\) and many more. In our differentiation example, the output space \(\mathcal{Y}\) was the set of all derivative functions \(f'(x)\).

  • target function \(f\): we assume that there is some function \(f: \mathcal{X} \to \mathcal{Y}\), some rule that maps each \(\mathbf{x_i} \in \mathcal{X}\) to \(y_i = f(x_i)\). The problem is that this function is unknown. In fact this is what the learning algorithm shall try to learn. In our running example, these are all the differentiation rules combined in the right order.

  • training data \(S_n = \big( (\mathbf{x}_1, y_1), (\mathbf{x}_n, y_n), \ldots, (\mathbf{x}_n, y_n) \big)\): is a finige set (with size \(n\)) of input-output pairs. These are the data the algorithm shall use while learning. Sometimes you may see these to be called training examples or training set. For us, these were the examples of functions and derivatives I gave you in the class.

  • prediction rule \(h\): this is the outcome of the learner, it is the function \(h : \mathcal{X} \to \mathcal{Y}\) that the learner discovers to approximate the unknown target function \(f\). It is also called the prediction funciton, the hypothesis, the classifier.

  • measure of error \(l\): also called the cost function (and sometimes idicated as \(c\)) is the function \(l : (\mathcal{Y} \times \mathcal{Y}) \to \mathbb{R}\) that measures how far are the outputs of the learned prediction rule \(h(\mathbf{x})\) from the true outputs produced by the unknown target funcion \(f(\mathbf{x})\). The choice of suitable cost function \(l\big( h(\mathbf{x}), f(\mathbf{x})\big) = l\big( \widehat{y}, y \big)\) has an important impact on our learning procedure and has to be appropriate for the problem at hand.