5 Descent methods

5.1 Convex minimization

We focus on the problem of convex minimization \[\min_w f(\mathbf{w}) \enspace ,\] where \(f : \mathbb{R}^d \to \mathbb{R}\) is a convex differentiable function. For example, \(f(\mathbf{w})\) can be the regularized loss \(L(\mathbf{w}) + \lambda R(\mathbf{w})\) discussed in chapter 4 Regularization. We also assume that the problem is actually solvable and we denote the optimal point by \(\mathbf{w}^*\). Since this is a differnetiable convex optimization problem, the neccessary and sufficient condition for a point to be optimal is \[\nabla f(\mathbf{w}) = 0 \enspace .\] Finding the solution to the minimization problem is thus the same as finding the solution to the above equation.

For example, in the linear regression problem where \[f(\mathbf{w}) = L_{MSE}(\mathbf{w}) = \frac{1}{n}||\mathbf{Xw} - \mathbf{y}||_2^2\] we have the gradient \[\nabla f(\mathbf{w}) = \frac{1}{n}(2\mathbf{X}^T\mathbf{Xw} - 2 \mathbf{X}^T\mathbf{y})\] and the solution is found from \(\nabla f(\mathbf{w}) = 0\) \[\frac{1}{n}(2\mathbf{X}^T\mathbf{Xw}^* - 2 \mathbf{X}^T\mathbf{y}) = 0\\ \mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y} \enspace .\]

However, it may not always be possible to find the analytical solution to the problem \(\nabla f(\mathbf{w}) = 0\). For example in the logistic regression, we minimize the cross-entropy loss \[f(\mathbf{w}) = L_{lr}(\mathbf{w}) = -\sum_i^n y_i \log \sigma(\mathbf{x_i}^T \mathbf{w}) - \sum_i^n (1-y_i) \log (1 - \sigma(\mathbf{x_i}^T \mathbf{w})) \enspace .\] The corresponding gradient is \[\nabla f(\mathbf{w}) = \sum_i^n \big(\sigma(\mathbf{x_i}^T \mathbf{w}) - y_i \big) \mathbf{x}_i\]

and there is no simple analytic solution \(\mathbf{w}^*\) for the optimality condition \(\nabla f(\mathbf{w}) = 0\) \[\sum_i^n \big(\sigma(\mathbf{x_i}^T \mathbf{w}) - y_i \big) \mathbf{x}_i = 0 \enspace .\]

5.1.1 Descent algorithms

When the minimizer \(\mathbf{w}^*\) of the convex differentiable problem \[\min_w f(\mathbf{w})\]

cannot be found by solving the optimality condition \(\nabla f(\mathbf{w}) = 0\), an alternative is to use an iterative algorithm which will produce the minimizing sequence \(\mathbf{w}^{(0)}, \mathbf{w}^{(1)}, \mathbf{w}^{(2)}, \ldots\) such that \(\mathbf{w}^{(k)} \to \mathbf{w}^{*}\) as \(k \to \infty\).

The iterative algorithm will update the value of \(\mathbf{w}\) at each step \(k=1,2,\ldots\) as \[\mathbf{w}^{(k+1)} = \mathbf{w}^{(k)} + t^{(k)} \Delta \mathbf{w}^{(k)} \enspace .\] Here \(k=1,2,\ldots\) is the iteration number, \(\Delta \mathbf{w}^{(k)}\) is the step or search direction at iteration \(k\), and \(t^{(k)}\) is the step size or the learning rate at iteration k.

For a single iteration you can also write the above in a simplified form as \[\mathbf{w}^{+} = \mathbf{w} + t \, \Delta \mathbf{w} \quad \text{or} \quad \mathbf{w} := \mathbf{w} + t \, \Delta \mathbf{w} \enspace .\]

In descent algorithms, in all iterations \(k\) except when \(\mathbf{w}^{(k)} = \mathbf{w}^{*}\) it holds \[f(\mathbf{w}^{(k+1)}) < f(\mathbf{w}^{(k)}) \enspace .\]

5.2 Gradient descent

Source of image in the right: ML cheatsheet

Gradient descent is an iterative descent algorithm in which the search direction \(\Delta \mathbf{w}\) is the negative gradient \(\Delta \mathbf{w} = - \nabla f(\mathbf{w})\). Recall from calculus that the gradient is a vector in the space of \(\mathbf{w}\) which points in the direction of the steepest increase in the function \(f\). It tells you which direction you shall change the \(\mathbf{w}\) to increase the \(f\) the most rapidly. In gradient descent we want to decrease the function so we change the \(\mathbf{w}\) in the direction of the negative gradient.

The algorithm can be summarized as follows:

Gradient descent method

given a starting point \(\mathbf{w}\) repeat

  1. Line search: choose a step size \(t\)
  2. Update: \(\mathbf{w} := \mathbf{w} - t \, \nabla f(\mathbf{w})\)

until stopping criteria

The stopping critera usually checks whether the gradients become almost zero (near minimum of \(f\) tangent plane is nearly flat, parallel with the \(\mathbf{w}\) plane) \(||\nabla f(\mathbf{w})||_2 \leq \eta\) where \(\eta > 0\) is some very small number. Another possible criteria checks whether the decrease in the function values slows down and nearly stops \(f(\mathbf{w}^{(k)}) - f(\mathbf{w}^{(k+1)}) \leq \eta\). In some cases, the algorithm is simply stopped after some fixed number of iterations.

5.2.2 Practical considerations

Gradient descent is particularly sensitive to feature scaling. If the loss function sensitivity is very different with respect to changes in some parameters rather than others (which typically originates from uneven scaling in the input variables), gradient descent will have difficulties to converge. While in some directions the step size should be rather large to speed up the convergence, it has to be small not to diverge in the other directions. Feature normalization helps to reshape the loss contours closer to spherical (rather than elliptical) which allows for much longer steps and typically reduces the number of iterations needed for convergence.

Gradient descent convergence for skewed function (in the left) may be difficult. The step size needs to be small not to diverge in some dimensions. For spherical functions (in the right) the steps may be much longer and the convergence is faster.

5.3 Gradient descent in ML

Gradient descent is the workhorse of machine learning. It is the default approach for convex optimization when the optimality condition \(\nabla L(\mathbf{w}) = 0\) cannot be solved or is too expensive and impractical to be solved directly.

In ML the function \(f\) we minimize with respect to the parameters \(\mathbf{w}\) is the loss function \(L(\mathbf{w})\) (or some extension of it such as the regularized loss function \(J(\mathbf{w})\)).

5.3.1 Batch gradient descent

In batch gradient descent the loss we minimize is the average (or the sum) of the problem-specific cost functions (e.g. squared errors for regression) across all the data in the training sample \(S_n = \{ (\mathbf{x}_i, y_i), i=1, \ldots, n \}\) \[L(\mathbf{w}) = \frac{1}{n} \sum_i^n c(\mathbf{w}, \mathbf{x}_i, y_i) \enspace .\] At each iteration of the batch gradient descent the updates are calculated as \[\mathbf{w} := \mathbf{w} - t \frac{1}{n} \sum_i^n \nabla_w \, c(\mathbf{w}, \mathbf{x}_i, y_i) \enspace .\] If the costs are convex (and differentiable), the algorithm (with some form of line search that ensures that \(L(\mathbf{w}^{(k+1)}) < L(\mathbf{w}^{(k)})\)) is guaranteed to converge to the optimal parameters, \(\mathbf{w} \to \mathbf{w}^*\) as \(k \to \infty\). Note that this guarantee is in infinity. As discussed in the previous section, the speed of the convergence depends on the shape of the cost function (on the data) and in practice on the choice of the learning rate \(t\).

The problem of the batch gradient descent is its computational cost which is at every iteration \(\mathcal{O}(n)\). Suppose you increase the size of your training set \(S_n\) from \(n\) to twice as many instances \(S_{2n}\). Each iteration in the gradient descent will require twice as many computations. For very large data sets these updates may take prohibitively too long. In some extreme cases, the whole data set may not even fit the memory to be able to obtain the gradients.

5.3.2 Stochastic gradient descent

In stochastic gradient descent (SGD) instead of working over the whole training sample, we use at every iteration only a single data point (randomly chosen without replacement). The update at each iteration is therefore \[\mathbf{w} := \mathbf{w} - t \, \nabla_w \, c(\mathbf{w}, \mathbf{x}_i, y_i) \qquad \text{for } i=1, \ldots, n \enspace .\] In each of these updates, the gradient (and therefore the update direction) is not in the direction of the fastest decrease of the overall average loss \(L\). In fact, we can no longer speak of a descent algorithm because the single update may even occasionally increase the overall loss function.

Each sweep through all the data instances \(i=1, \ldots, n\) is usually called an epoch. In SGD we repeat the updates for multiple epochs until convergence. Before starting each epoch we shuffle the training examples so that each time there is a different order in the updates.

While at first sight the stochastic and batch gradient descent seem very similar, they are not. From calculus we know that the gradient of a sum is equal to the sum of gradients. Therefore performing an update using the gradient of the sum loss is the same as using the sum of gradients. However, the important difference is in the point \(\mathbf{w}\) at which the gradient is calculated. In the batch update, the gradient is calculated across all instances at a single starting point \(\mathbf{w}\). In the stochastic version, the \(\mathbf{w}\) at which the gradient is calculated is different for each data instance \((\mathbf{x}_i, y_i)\). \[\mathbf{w}_{batch} := \mathbf{w} - t \frac{1}{n} \sum_i^n \nabla_w \, c(\mathbf{w}, \mathbf{x}_i, y_i) = \mathbf{w} - \frac{t}{n} \nabla_w \, c(\mathbf{w}, \mathbf{x}_1, y_1) - \frac{t}{n} \nabla_w \, c(\mathbf{w}, \mathbf{x}_2, y_2) - \frac{t}{n} \nabla_w \, c(\mathbf{w}, \mathbf{x}_3, y_3) - \cdots\] \[\mathbf{w}_{stoch} := \mathbf{w}^{(0)} - \frac{t}{n} \nabla_w \, c(\mathbf{w}^{(0)}, \mathbf{x}_1, y_1) - \frac{t}{n} \nabla_w \, c(\mathbf{w}^{(1)}, \mathbf{x}_2, y_2) - \frac{t}{n} \nabla_w \, c(\mathbf{w}^{(2)}, \mathbf{x}_3, y_3) - \cdots\]

You can see from the image above, that the path of the stochastic gradient descent is much less direct than of the batch gradient descent. Is there still any guarantee of convergence similar to the batch gradient descent? It is a bit more tricky because moving in the descending direction of the single-example cost \(\nabla_w \, c(\mathbf{w}, \mathbf{x}_i, y_i)\) does not guarantee that \(L(\mathbf{w}^{(k+1)}) < L(\mathbf{w}^{(k)})\) at each step. Therefore searching for the right learning rate \(t\) for each single-example step \(\mathbf{w} := \mathbf{w} - t \, \nabla_w \, c(\mathbf{w}, \mathbf{x}_i, y_i)\) is pointless (and trying to find the right rate to descend the total loss would mean calculating the computationally expensive sum \(L(\mathbf{w}) = \frac{1}{n} \sum_i^n c(\mathbf{w}, \mathbf{x}_i, y_i)\) so beat the purpose of the stochastic gradient descent).

Instead of trying to search for the learning rate at every step, in SGD we typically apply some form of learning rate scheduling. The most common and simple one is the learning rate decay which decreases the learning rate gradually as we progress to later epochs in the algorithm (e.g. \(t := 0.9t\) after each epoch). The speed of convergence depends on the initial value and the rate of decay. There are no universal best values for these and they usually have to be tunned manually by trying and checking the loss evolution.

Learning rate scheduling is an area of active ML research and many more sophisticated approaches than the simple rate decay have been proposed. You will discuss and use some of them in your neural network courses where these are the most often used.

5.3.3 Mini-batch gradient descent

Mini-batch gradient descent is midway between batch gradient descent and stochastic gradient descent: for each update it uses a batch (a subset) of randomly selected (without replacement) training data \[\mathbf{w} := \mathbf{w} - t \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_w \, c(\mathbf{w}, \mathbf{x}_i, y_i) \enspace .\] The batch size \(|\mathcal{B}|\) is bigger than \(1\) but smaller than \(n\), \(1 < |\mathcal{B}| < n\), usual sizes are 64, 128 or 256. The training size \(n\) does not have to be exactly divisible by the batch size, the last batch of each epoch can be smaller and simply contain whatever is left in the training set.

Most of what we explained for SGD holds for mini-batch gradient descent as well since the SGD can be seen as the extreme versions of the mini-batch gradient descent with batch sizes \(|\mathcal{B}| = 1\). In fact, when people talk about using SGD in practice, they usually have in mind the mini-batch version of the gradient descent and not the extreme \(|\mathcal{B}| = 1\) version.

The advantage of mini-batch is that there is smaller variance in the gradients at each step as compared to the single-example gradients in the SGD. The path is therefore usually more smooth and may be easier to use (find suitable learning rate and its decay).

From the above discussion it seems, that other than in cases of extremely large datasets that do not fit the memory, there is not much reason to prefer the mini-batch gradient descent over batch gradient descent. Its path is usually longer and it is more difficult to find the right sizes for the learning rate and its decay. This is indeed true for the rather nice convex optimization problems such as linear or logistic regression. However, many of the modern machine learning problems are non-convex and it is there, where the true strengths of the SGD (in practice mini-batch gradient descent) really kick in.

5.4 Non-convex minimization

The previous sections focused on the problem of convex optimization, that is optimization of functions that are particularly nice and easy to optimize because we know that they only have a single minimum which is the global minimum. Many of the traditional ML methods (linear regression, logistic regression, support vector machines, etc.) were formulated as convex optimization problems to benefit from the nice properties and the machinery of convex optimization.

Recently, more often than not the modern optimization problems are non-convex. For example, all of the standard neural networks use loss functions which are non-convex with respect to the network parameters \(\mathbf{w}\). The surface of a non-convex function is much more complex. There may be multiple local minima and saddle points which pose a particular challenge to methods based on gradients because in both these special points the gradient is equal to zero and therefore the standard batch descent will stop there.

Source: Hello Paperspace: Intro to optimization in deep learning: Gradient Descent

In fact, this is the main reason why SGD (mini-batch gradient descent) has become probably the most important training algorithm in the current ML.

One of the major problems of standard batch gradient descent is that it is very sensitive to the starting point of descent and that it is not able to recover from descending to local minima. For example, if the batch gradient descent starts at point A in the image, it will follow the descent direction at each step and inevitably end in the local minimum. If it starts on the little hill near the saddle point, it will descent to the saddle point and then get stuck because the gradient at the saddle point is zero. There is no way for the classical batch gradient descent to avoid these points and getting stuck there if the algorithm starts at an unfortunate starting point.

In contrast, the SGD (and mini-batch) do have a chance to avoid these due to a certain level of randomness in their path. As explained, these algorithms do not necessarily descent at each step so sometimes may randomly escape the local minima. They are also less likely to get stuck at the saddle point because while the gradient of the total loss \(L(\mathbf{w}, \mathbf{x, y})\) over the whole training set is equal to zero there, the gradient of the cost with respect to a single data point \(c(\mathbf{w}, \mathbf{x}_i, y_i)\) is not and therefore the algorithm will continue.

The learning rate schedule is extremely important in these cases and many methods based on momentum have been proposed to keep the algorithm going even past the local minima or saddle points. These are beyond the scope of this discussion and will be addressed in more detailed in the neural networks part of the course.

5.5 Proximal gradient descent

We have seen in the course one class of problems that cannot be addressed by any of the optimization approaches explained so far. This is the class of non-differentiable problems. More specifically, we will focus here on problems in the form \[\min_w L(\mathbf{w}) + \lambda R(\mathbf{w}) \enspace ,\] where \(L\) is some convex differentiable function and \(R\) is a convex but non-differentiable function of \(\mathbf{w}\). Clearly, we cannot apply any of the gradient descent methods here because we cannot even obtain the gradient of \(R\).

Typical representatives of this class of problems are the sparse regularized learning methods such as the lasso that we have seen in chapter 4 Regularization that use non-differentiable regularizors.

In lasso, the regularizor is the sum of absolute values across the dimensions \(||\mathbf{w}||_1 = \sum_i^d |w_i|\), which as you know from calculus is non-differentiable \[\widehat{\mathbf{w}}_{lasso} = \text{arg}\min_w \frac{1}{n}||\mathbf{X} \mathbf{w} - \mathbf{y}||_2^2 + \lambda \, ||\mathbf{w}||_1 \enspace . \] You can find various approaches to solving these problems in the vast literature (e.g. Bach, Francis. “Optimization with Sparsity-Inducing Penalties.” Foundations and Trends® in Machine Learning 4, no. 1 (2011): 1–106.).

We will focus here at a very popular and efficient approach called proximal gradient descent

Proximal gradient descent method

given a starting point \(\mathbf{w}\)
repeat

  1. Line search: choose a step size \(t\)
  2. Gradient step: \(\mathbf{\tilde{w}} := \mathbf{w}^{(k)} - t \, \nabla L(\mathbf{w}^{(k)})\)
  3. Proximal step: \(\mathbf{w}^{(k+1)} := \textbf{prox}_{t \lambda R}(\mathbf{\tilde{w}})\)

until stopping criteria

As you can see, the algorithm is very similar to the standard batch gradient descent. The second gradient step is exactly the same as in the standard gradient descent using the convex differentiable part of the objective function \(L(\mathbf{w})\). However, there is a new proximal step which depends on the non-differentiable function \(R\). The line search also uses a somewhat different stopping condition.

5.5.1 Proximal operator

In the proximal step we find the updated value \(\mathbf{w}^{(k+1)}\) by applying a proximal operator (think of an operator as a fancy name for a function) \(\textbf{prox} : \mathbb{R}^d \to \mathbb{R}^d\), which maps from the space of the parameters to itself and which operates on the temporary value \(\mathbf{\tilde{w}}\) (the result of the gradient step).

The operator is defined as follows: \[\textbf{prox}_{t \lambda R}(\mathbf{\tilde{w}}) = \text{arg}\min_w \Big( \lambda R(\mathbf{w}) + \frac{1}{2t} ||\mathbf{\tilde{w}} - \mathbf{w}||_2^2 \Big) \enspace .\] An intuitive way to think about the proximal operator is that it starts at the point \(\mathbf{\tilde{w}}\) and tries to find a nearby point \(\mathbf{w}\) (a point in the proximity, a proximal point) that would minimize \(R(\mathbf{w})\). By being near we mean being within the ball around \(\mathbf{\tilde{w}}\) given by the norm \(||\mathbf{\tilde{w}} - \mathbf{w}||_2^2\). The solution will be a compromise between the distance from the starting point \(\mathbf{\tilde{w}}\) and the size of \(R(\mathbf{w})\) with the parameter \(t\) as the relative weight for the trade off. Note that \(t\) is the same value as used in the gradient step for the step size (learning rate) which itself is found by the line search in the first step of the algorithm.

In the above proximal definition, \(\lambda\) is the hyperparameter of the original problem and we show it in the proximal only for clarity in our specific lasso example. Usually you will find the proximal operator definition without \(\lambda\) in front of \(R\) as this can be easily absorbed within by defining \(R(\mathbf{w}) := \lambda R(\mathbf{w})\).

To summarise the steps so far:

  • in the gradient step we move \(\mathbf{w}\) in the direction of the fastest descent of the convex differentiable loss function \(L\)
  • in the proximal step we shift from this temporary point \(\mathbf{\tilde{w}}\) to a nearby point which also makes small the non-differntiable regularizer \(R(\mathbf{w})\)

At first sight decomposing the algorithm into this two steps seems rather pointless. The proximal problem still has the non-differentiable part \(R(\mathbf{w})\) and in fact seems rather similar to the original problem. However, it is much simpler than the original minimization problem as it does not depend on the loss function \(L\) and the specific training data. Most importantly, for many common regularizors there exists a closed form solution of the proximal problem which is easy to evaluate.

In the specific case of the \(\ell_1\) norm regularizor \(R(\mathbf{w}) = ||\mathbf{w}||_1\), the proximal operator can be evaluated element-wise as follows \[\textbf{prox}_{t \lambda ||.||_1}(\mathbf{\tilde{w}})_j = sign(\tilde{w}_j) (|\tilde{w}_j|_{temp} - t\lambda)_+ = \begin{cases} \tilde{w}_j - t \lambda & \text{ if } \tilde{w}_j \geq t \lambda \\ 0 & \text{ if } |\tilde{w}_j| \leq t \lambda \\ \tilde{w}_j + t \lambda & \text{ if } \tilde{w}_j \leq - t \lambda \\ \end{cases} \enspace ,\] where \((x)_+ := \max(0,x)\). This is also often called the soft thresholding operator due to its effect on the starting value \(\mathbf{\tilde{w}}\). It shifts the value closer to origin (makes is smaller if positive or bigger if negative) and cuts it of to be zero if between \(\pm t\lambda\). It is this cutting-off which will make the final solution \(\mathbf{w}_{lasso}\) sparse with zeros at some dimensions indicating that not all imput variables are important.

5.5.2 Line search

The line search in the proximal gradient desent has the usual purpose of ensuring that the composed objective \(J(\mathbf{w}) = L(\mathbf{w}) + \lambda R(\mathbf{w})\) decreases at every step so that \(J(\mathbf{w})^{(k+1)} < J(\mathbf{w})^{(k)}\) except at the optimal \(\mathbf{w}^{(k)} = \mathbf{w}^{*}\).

Backtraking line search for iterative soft-thresholding algorithm (ISTA)

initiate \(t > 0, \alpha \in (0,1)\), (e.g. \(t=1, \, \alpha = 0.5\) but depends on the specific data and problem)
repeat

  1. \(t := \alpha \, t\)
  2. \(\mathbf{\tilde{w}} := \mathbf{w} - t \, \nabla L(\mathbf{w})\)
  3. \(\mathbf{\tilde{\tilde{w}}} := \textbf{prox}_{t \lambda R}(\mathbf{\tilde{w}})\)

until \(J(\mathbf{\tilde{\tilde{w}}}) < L(\mathbf{w}) + (\mathbf{\tilde{\tilde{w}}}-\mathbf{w})^T \nabla L(\mathbf{w})+ \frac{2}{t} ||\mathbf{\tilde{\tilde{w}}}-\mathbf{w}||_2^2 + \lambda R(\mathbf{\tilde{\tilde{w}}})\)

The originial paper for the above algorithm is: Beck, Amir, and Marc Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences 2, no. 1 (January 2009): 183–202.