7 Multivariate functions

These notes are reusing text and imagery from: Calculus Volume 1 from OpenStax, Print ISBN 193816802X, Digital ISBN 1947172131, https://www.openstax.org/details/calculus-volume-1

7.1 Functions of two and more variables

Definition: A function of two variables \(z = f(x, y)\) maps each ordered pair \((x,y)\) in a subset \(D \in \mathbb{R}^2\) to a unique real number \(z\). \(D\) is the domain of the function and the set of \(z\)’s is the range.

To graph a function of two variables consists of ordered triples \((x, y, z)\) and is called a surface (3D-plot).

3D garphing in Geogebra

Another option is to use level curves. Level curve of a function of two variables \(f(x, y)\) for a value \(c\) is the set of point satisfying \(f(x, y) = c\).

Level curves are equivalent to the contour lines representing elevation in topographical maps. Here \(x\) and \(y\) represent the longitude and latitude and \(z\) represents the elevation.

Example: Contour map for the function \(f(x,y) = \sqrt{8 + 8x - 4y - 4x^2 - y^2}\) at values \(c = 0, 1, 2, 3, 4\)

The above concept can be easily extended to function of more variables.

For example, \(f(x ,y , z) = x^2 - 2 x y - 3x + z^2\) is a function of 3 variables with a domain \(D \in \mathbb{R^3}\). Generally, a function of d variables will have a domain \(D \in \mathbb{R^d}\) and we often write it as \(f(x_1, x_2, \dots, x_d) = f(\mathbf{x})\).

Notation: We indicate by bold \(\mathbf{x}\) the vector \(\mathbf{x} = (x_1, x_2, \ldots, x_d)\).

Clearly, plotting surfaces of functions of more than 2 variables is impossible and therefore we only can use our imagination when moving beyond 3-dimensional spaces (e.g. imagining the 4th dimension as time).

7.2 Partial derivatives and gradient

7.2.1 Partial derivatives

For a function of single variable \(y = f(x)\) the derivative \(f'(x) = \frac{dy}{dx} = \frac{df}{dx}\) is the instantaneous rate of change of the function with respect to \(x\).

For a function of two or more variables \(f(x_1, \ldots x_d) = f(\mathbf{x})\) we can investigate the instantaneous rate of change of the function with respect to each of the individual variables \(x_i\) while the other variables \(x_j, j \neq i\) are held fixed, the partial derivative.

Definition: Let \(f(\mathbf{x}) = f(x_1, \ldots, x_d)\) be a function of \(d\) variables. The partial derivative of \(f\) with respect to a variable \(x_i\) is \[\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \ldots, x_d) - f(x_1, \dots, x_i, \ldots, x_d) }{h} \enspace .\] Notation: We use the rounded “d” symbol \(\partial f / \partial x\) to distinguish the partial derivative from the single varialbe derivative \(df / dx\).

Similarly as in the univariate case, the partial derivative is equal to the slope of the tangent line at the point \(P = \big(\mathbf{x}, f(\mathbf{x}) \big)\) to a curve parallel to the \(x_i\) axis and passing through the point \(P\).

For a differentiable function \(f\) of \(d\) variables there are alltogether \(d\) partial derivatives \(\partial f / \partial x_i, i=1, \ldots, d\). Each of the partial derivatives is equivalent to a slope of a tangent line passing through the point \(\big(\mathbf{x}, f(\mathbf{x}) \big)\) and parallel to \(x_i\). Together these tangent lines define a tangent plane to \(f\) at the point \(\big(\mathbf{x}, f(\mathbf{x}) \big)\).

The differentiability of a multivariate function \(f\) is linked to the smoothness of the function. For a function to be differentiable at a point \(P\) it has to be continuous at the point.

7.2.2 Gradient

The gradient of a function \(f\) (we use the symbol \(\nabla\) called nabla) is a vector constructed from partial derivatives of the function \[\nabla f = \big( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_d} \big)^T \enspace .\] For a function \(f\) of \(d\) variables the gradient is a \(d\)-dimensional vector \(\nabla f \in \mathbb{R}^d\).

The gradient vector \(\nabla f(\mathbf{x})\) points in the direction of the most rapid increase of the function \(f\) from point \(\mathbf{x}\).

The gradient vector \(\nabla f(\mathbf{x})\) is orthogonal (normal) to the level curve at point \(f(\mathbf{x})\).

7.2.3 Differentiation with respect to vector

In machine learning, in place of discussing the partial derivatives with respect to each \(x_i\) element of the input vector \(\mathbf{x} \in \mathbb{R}^d\), we often treat the whole vector \(\mathbf{x}\) as a single object. In result, we sometimes refer to the gradient \(\nabla f(\mathbf{x})\) as to the derivative of \(f\) with respect to \(\mathbf{x}\) (the whole vector).

This can be further extended to functions of multiple vectors. For example \(f(\mathbf{x}, \mathbf{w})\) indicates a function of two vectors \(\mathbf{x} \in \mathbb{R}^{d}, \mathbf{w} \in \mathbb{R}^{c}\).

In the course, we may sometimes use the term (partial) derivative of \(f\) with respect to \(\mathbf{x}\) and indicate it as \(\partial f / \partial \mathbf{x} = \nabla_x f\) to refer to the gradient of the function when treating \(\mathbf{x}\) as the input variable of the function, and \(\mathbf{w}\) as a fixed constant.

One way to find the gradient of a function is to rely on the the basic differentiation rules for a single variable and crate the gradient as the vector of partial derivatives.

There are, however, also rules for differentiating directly a function with respect to a vector. A practical overview of these can be found for example in a popular book by Kaare Brandt Petersen and Michael Syskind Pedersen: The Matrix Cookbook.

7.2.4 Higher order derivatives

The partial derivative \(\partial f / \partial x_1\) of a function \(f\) is itself a function (of \(x_i\) and possibly other input variables of \(f\)). We can thus calculate its partial derivatives to create higher-order partial derivatives.

For example, compare these four second-order partial derivatives \[ \frac{\partial^2 f}{\partial x^2} = \frac{\partial}{\partial x} \big[ \frac{\partial f}{\partial x} \big], \qquad \frac{\partial^2 f}{\partial x \partial y} = \frac{\partial}{\partial x} \big[ \frac{\partial f}{\partial y} \big], \qquad \frac{\partial^2 f}{\partial y \partial x} = \frac{\partial}{\partial y} \big[ \frac{\partial f}{\partial x} \big], \qquad \frac{\partial^2 f}{\partial y^2} = \frac{\partial}{\partial y} \big[ \frac{\partial f}{\partial y} \big] \] The first is the second-order partial derivative of \(f\) with respect to \(x\), the second and third are the mixed second-order partial derivatives of \(f\) with respect to \(x\) and \(y\) (and in fact thery are always equal because the order of differentiation does not matter), and the last is the second-order partial derivative of \(f\) with respect to \(y\).

We can continue similarly with the differntiation for 3rd-, 4th-, etc. partial derivatives.

7.3 Application

Similarly as in the single-variable case, we can use the partial derivatives for understanding the shape of the function.

Definition: A critical point of a multivariate function \(f(\mathbf{x}), \mathbf{x} in \mathbb{R}^d\) is the point \(\mathbf{c} \in \mathbb{R}^d\) for which \[\frac{\partial f}{\partial x_i} \Big|_{\mathbf{x = c}} = 0, \text{ for all } i \in 1, \ldots, d \enspace .\]

The critical point \(\mathbf{c}\) is a local maximum if \(f(\mathbf{x}) \geq f(\mathbf{c})\) for all \(\mathbf{x}\) in a neighbourhood around \(\mathbf{x}\).
It is a local minimum if \(f(\mathbf{x}) \leq f(\mathbf{c})\) for all \(\mathbf{x}\) in a neighbourhood around \(\mathbf{x}\).

If \(f\) has a local extremum at \(\mathbf{c}\), then \(\mathbf{c}\) is a critical point.

There may be critical points (\(\nabla f(\mathbf{c}) = \mathbf{0}\)) which are not local extrema. We call these the saddle points.

For the saddle point, the function values around the point are higher along one axis and lower along the other axis.

To determine if the critical point is a local minimum or maximum we can use the second-derivative test.

In the case of multivariate function, there are multiple second-order partial derivatives (see above the list for the case of 2 variables). We organise them into a Hessian matrix and then check the eigenvalues of the matrix

if the Hessian is positive definite (all eigenvalues are positive), the critical point is a local minimum
if the Hessian is negative definite (all eigenvalues are negative), the critical point is a local maximum
if the Hessian has both positive and negative eigenvalues, the critical point is a saddle point

M6C Data Science I - math review

Magda Gregorova

18/3/2019