We will use optimization to motivate the development of multivariate calculus.
For a sufficiently smooth scalar function f:R↦R, we have the second order Taylor approximation:
f(x+Δx)≈f(x)+Δxddxf(x)+12(Δx)2d2dx2f(x).
TODO: graph
To generalize to a multivariate function f:Rn↦R, we need notations:
Example: TODO.
For a sufficiently smooth multivariate function f:Rn↦R, we have Taylor approximation:
f(x+Δx)≈f(x)+∇f(x)′Δx+12Δx′[∇2f(x)]Δx.
For a vector function f:Rn↦Rm f(x)=(f1(x)⋮fm(x)), the m×n Jacobian matrix is J=(∇f′1⋮∇f′m)=(∂f1∂x1⋯∂f1∂xn⋮⋱⋮∂fm∂x1⋯∂fm∂xn). The Jacobian is the determant of J and appears in the multi-dimensional integrals.
Later we will develop chain rule for the vector and matrix functions.
Optimization aims to minimize a multivariate function f:Rn↦R possibly subject to certain constraints. Possible constrains include
Statisticians often talk about maximization, because of the maximum likelihood estimation (MLE). Note maximizing f is same as minimizing −f.
For unconstrained optimization of a scalar function f, we know
Similarly for unconstrained optimization of a multivariate function f:Rn↦R,
A body of work in optimization is to generalize these optimality conditions to the constrained case.
Convexity plays a key role in optimization.
TODO: A convex function sits between its tangent lines and its chords.
TODO: graph of convex and nonconvex sets/functions.
Examples: f1(x)=ax+b, f2(x)=x2, and f3(x)=max.
Intersection of convex sets is convex.
The maximum of two or more convex functions is always convex.
Proof: Let f(\mathbf{x}) = \sup_{i \in \mathcal{I}} f_i (\mathbf{x}). Then f_i(\alpha \mathbf{x} + (1 - \alpha) \mathbf{y}) \le \alpha f_i(\mathbf{x}) + (1 - \alpha) f_i(\mathbf{y}) \le \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y}) for all i \in \mathcal{I}. Taking supremum over i on the left hand side yields f(\alpha \mathbf{x} + (1 - \alpha) \mathbf{y}) \le \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y}).
Note the minimum of convex functions is usually not convex.
The set of positive (semi)definite matrices is convex.
A twice differentiable function is convex if \nabla^2 f(\mathbf{x}) \succeq \mathbf{O}_{n \times n} at all \mathbf{x}. It is strictly convex \nabla^2 f(\mathbf{x}) \succ \mathbf{O}_{n \times n} at all \mathbf{x}.
Convexity prevents two local minima. For a convex function, any stationary point with \nabla f(\mathbf{x}) = \mathbf{0} is a global minimum.
Example: \begin{eqnarray*} &\text{minimize}& f(x_1,x_2) = x_1^2 + x_2^2 \\ &\text{subject to}& a_1 x_1 + a_2 x_2 = b. \end{eqnarray*} Define the Lagrangian L(x_1, x_2, \lambda) = f(x_1, x_2) + \lambda (a_1 x_1 + a_2 x_2 - b) = x_1^2 + x_2^2 + \lambda (a_1 x_1 + a_2 x_2 - b), where \lambda is the Lagrange multiplier. Solve the equations \begin{eqnarray*} \frac{\partial}{\partial x_1} L &=& 2 x_1 + \lambda a_1 \\ \frac{\partial}{\partial x_2} L &=& 2 x_2 + \lambda a_2 \\ \frac{\partial}{\partial \lambda} L &=& a_1 x_1 + a_2 x_2 - b \end{eqnarray*} to get optimal x_1, x_2, and \lambda: \begin{eqnarray*} x_1^\star &=& \frac{a_1 b}{a_1^2 + a_2^2} \\ x_2^\star &=& \frac{a_2 b}{a_1^2 + a_2^2} \\ \lambda^\star &=& - \frac{2b}{a_1^2 + a_2^2} \end{eqnarray*} with optimal objective value (x_1^\star)^2 + (x_2^\star)^2 = \frac{b^2}{a_1^2 + a_2^2}. We found -\lambda^\star is simply the derivative of the minimum cost with respect to the constraint level b \frac{d}{d b} \left( \frac{b^2}{a_1^2 + a_2^2} \right) = \frac{2b}{a_1^2 + a_2^2} = - \lambda.
We generalize above example to the general problem of minimizing a convex quadratic function with linear constraints. For \mathbf{S} \succ \mathbf{0} and \mathbf{A} \in \mathbb{R}^{n \times m}, \begin{eqnarray*} &\text{minimize}& \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x} \\ &\text{subject to}& \mathbf{A}' \mathbf{x} = \mathbf{b}. \end{eqnarray*} The Lagrangian function is L(\mathbf{x}, \boldsymbol{\lambda}) = \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x} + \boldsymbol{\lambda}' (\mathbf{A}' \mathbf{x} - \mathbf{b}), where \boldsymbol{\lambda} \in \mathbb{R}^m is the Lagrange multipliers. Solving equations \begin{eqnarray*} \nabla_\mathbf{x} L(\mathbf{x}, \boldsymbol{\lambda}) &=& \mathbf{S} \mathbf{x} + \mathbf{A} \boldsymbol{\lambda} = \mathbf{0}_n \\ \nabla_\boldsymbol{\lambda} L(\mathbf{x}, \boldsymbol{\lambda}) &=& \mathbf{A}' \mathbf{x} - \mathbf{b} = \mathbf{0}_m, \end{eqnarray*} or equivalently \begin{pmatrix} \mathbf{S} & \mathbf{A} \\ \mathbf{A}' & \mathbf{O} \end{pmatrix} \begin{pmatrix} \mathbf{x} \\ \boldsymbol{\lambda} \end{pmatrix} = \begin{pmatrix} \mathbf{0}_n \\ \mathbf{b} \end{pmatrix} \quad \quad (\text{saddle point matrix or KKT matrix}) yields the solution \begin{eqnarray*} \mathbf{x}^\star &=& \mathbf{S}^{-1} \mathbf{A} (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b} \\ \boldsymbol{\lambda}^\star &=& - (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b}. \end{eqnarray*} Further calculation shows the minimum cost f^\star = \frac 12 \mathbf{b}' (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b} and the gradient of cost \nabla_\mathbf{b} f^\star = (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b} = - \boldsymbol{\lambda}^\star.
The saddle point matrix (or KKT matrix) has n positive eigenvalues and m negative eigenvalue. The Lagrangian function is convex in \mathbf{x} and concave in \boldsymbol{\lambda}.
Let \mathbf{y}_1, \ldots, \mathbf{y}_n be iid samples from a multivariate normal distribution N(\boldsymbol{\mu}, \boldsymbol{\Omega}). The log-likelihood is L(\boldsymbol{\mu}, \boldsymbol{\Omega}) = - \frac n2 \log \det \boldsymbol{\Omega} - \frac 12 \sum_{i=1}^n (\mathbf{y}_i - \boldsymbol{\mu})' \boldsymbol{\Omega}^{-1} (\mathbf{y}_i - \boldsymbol{\mu}). We want to maximize L(\boldsymbol{\mu}, \boldsymbol{\Omega}) or minimize f(\boldsymbol{\mu}, \boldsymbol{\Omega}) = - L(\boldsymbol{\mu}, \boldsymbol{\Omega}) = \frac n2 \log \det \boldsymbol{\Omega} + \frac 12 \sum_{i=1}^n (\mathbf{y}_i - \boldsymbol{\mu})' \boldsymbol{\Omega}^{-1} (\mathbf{y}_i - \boldsymbol{\mu}). The MLE is achieved by \begin{eqnarray*} \widehat{\boldsymbol{\mu}} &=& \frac{\sum_{i=1}^n \mathbf{y}_i}{n} \\ \widehat{\boldsymbol{\Omega}} &=& \frac{\sum_{i=1}^n (\mathbf{y}_i - \hat{\boldsymbol{\mu}})(\mathbf{y}_i - \hat{\boldsymbol{\mu}})'}{n}. \end{eqnarray*}
TODO: derivation.
Consider a multinomial experiment with n trials and observed outcomes n_1, \ldots, n_m over m categories. The maximum likelihood estimate (MLE) seeks maximizer of likelihood
L(p_1,\ldots,p_m) = \binom{n}{n_1 \cdots n_m} \prod_{i=1}^m p_i^{n_i}.
To minimize the negative log-likelihood
\log L(p_1,\ldots,p_m) = \log \binom{n}{n_1 \cdots n_m} + \sum_{i=1}^m n_i \log p_i
subject to the constraint p_1 + \cdots + p_m = 1, we find the statinary point of Lagrangian
L(p_1,\ldots,p_m, \lambda) = - \log \binom{n}{n_1 \cdots n_m} - \sum_{i=1}^m n_i \log p_i + \lambda \left( \sum_{i=1}^m p_i - 1 \right).
Solving
\frac{\partial}{\partial p_i} L = - \frac{n_i}{p_i} + \lambda = 0
gives
\frac{n_i}{p_i} = \lambda.
Combining with the constraint \sum_i p_i=1 yields
\begin{eqnarray*}
p_i^\star &=& \frac{n_i}{n}, \quad i=1,\ldots,m, \\
\lambda^\star &=& n.
\end{eqnarray*}
Note the nonnegativity constraint is automatically satisfied.
We are looking for a point \mathbf{x}^\star such that \nabla f(\mathbf{x}^\star) = \mathbf{0}. Multivariate calculus gives us a principled way to move towards such an \mathbf{x}^\star.
Second-order Taylor approximation to a function f at current iterate \mathbf{x}^{(t)} says f(\mathbf{x}^{(t)} + \Delta \mathbf{x}) \approx f(\mathbf{x}^{(t)}) + \nabla f(\mathbf{x}^{(t)})' \Delta \mathbf{x} + \frac 12 \Delta \mathbf{x}' [\nabla^2 f(\mathbf{x}^{(t)})] \Delta \mathbf{x}. Which direction \Delta \mathbf{x} shall we move from \mathbf{x}^{(t)}?
Minimizing the quadratic approximation gives the Newton direction \Delta \mathbf{x}_{\text{newton}} = - [\nabla^2 f(\mathbf{x}^{(t)})]^{-1} \nabla f(\mathbf{x}^{(t)}).
So the Newton method iterates according to \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \Delta \mathbf{x}_{\text{newton}} = \mathbf{x}^{(t)} - [\nabla^2 f(\mathbf{x}^{(t)})]^{-1} \nabla f(\mathbf{x}^{(t)}).
Quadratic convergence of Newton's method: \|\mathbf{x}^{(t+1)} - \mathbf{x}^\star\| \le C \|\mathbf{x}^{(t)} - \mathbf{x}^\star\|^2.
Example: f(x) = \frac 13 x^3 - 4x with \nabla f(x) = x^2 - 4 and \nabla^2 f(x) = 2x. Newton's iterates are x^{(t+1)} = x^{(t)} - \frac{x^{(t)2}-4}{2x^{(t)}} = \frac{1}{2} \left( x^{(t)} + \frac{4}{x^{(t)}} \right). Let's start from x^{(0)}=2.5.
x = 2.5
for iter in 1:5
x = 0.5 * (x + 4 / x)
@show x, x^3 / 3 - 4x
end
If we use the identity matrix as a very crude approximation of the Hessian matrix, we obtain the classical gradient descent, or steepest descent, algorithm: \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - s^{(t)} \nabla f(\mathbf{x}^{(t)}).
Gradient descent algorithm is gaining tremendous of interest in modern, large scale optimization problems, because calculation of Hessian is usually quite expensive.
Example: minimize a bivariate quadratic function: f(x_1, x_2) = \frac 12 (x_1^2 + bx_2^2). Gradient: \nabla f(x_1, x_2) = \begin{pmatrix} x_1 \\ bx_2 \end{pmatrix}. Gradient descent algorithm: \begin{eqnarray*} \begin{pmatrix} x_1^{(t+1)} \\ x_2^{(t+1)} \end{pmatrix} &=& \begin{pmatrix} x_1^{(t)} \\ x_2^{(t)} \end{pmatrix} - s^{(t)} \begin{pmatrix} x_1^{(t)} \\ bx_2^{(t)} \end{pmatrix} \\ &=& \begin{pmatrix} (1 - s^{(t)}) x_1^{(t)} \\ (1 - bs^{(t)})x_2^{(t)} \end{pmatrix}. \end{eqnarray*} What's the optimal step length s (exact line search)? The objective function f(s) = \frac 12 \left[(1-s)^2 x_1^{(t)2} + b(1-bs)^2x_2^{(t)2} \right] is minimized at s^{(t)} = \frac{x_1^{(t)2} + b^2 x_2^{(t)2}}{x_1^{(t)2} + b^3 x_2^{(t)2}}. Let's assume starting point (x_1^{(0)}, x_2^{(0)}) = (b, 1) to simplify analysis. Then s^{(0)} = \frac{b^2 + b^2}{b^2 + b^3} = \frac{2}{b + 1} and \begin{pmatrix} x_1^{(1)} \\ x_2^{(1)} \end{pmatrix} = \begin{pmatrix} \left( \frac{b-1}{b+1} \right) x_1^{(1)} \\ \left( \frac{1-b}{b+1} \right) x_2^{(1)} \end{pmatrix} = \begin{pmatrix} b \left( \frac{b-1}{b+1} \right) \\ \left( \frac{1-b}{b+1} \right) \end{pmatrix}. Continuing this way, we found \begin{pmatrix} x_1^{(t)} \\ x_2^{(t)} \end{pmatrix} = \begin{pmatrix} b \left( \frac{b-1}{b+1} \right)^t \\ \left( \frac{1-b}{b+1} \right)^t \end{pmatrix} and f(x_1^{(t)}, x_2^{(t)}) = \frac 12 (1 + b^2) \left( \frac{b-1}{b+1} \right)^{2t} = \left( \frac{b-1}{b+1} \right)^{2t} f(x_1^{(0)}, x_2^{(0)}).
Following graph shows the zig-zagging pattern of the gradient descent iterates:
using Plots; gr()
using LaTeXStrings
b = 0.1
f(x1, x2) = 0.5 * (x1^2 + b * x2^2)
s = 2 / (1 + b) # optimal step length
x1, x2 = b, 1.0 # start point
x1iter, x2iter, fiter = [x1], [x2], [f(x1, x2)] # store function values
for iter in 1:15
x1 *= (1 - s)
x2 *= (1 - s * b)
push!(x1iter, x1)
push!(x2iter, x2)
push!(fiter, f(x1, x2))
end
x1 = -1:0.01:1
x2 = -1:0.01:1
X1 = repeat(reshape(x1, 1, :), length(x2), 1)
X2 = repeat(x2, 1, length(x1))
Z = map(f, X1, X2)
p = contour(x1, x2, Z, levels=fiter)
plot(p, xlabel=L"x_1", ylabel=L"x_2", title="GD");
anim = @animate for iter in 1:length(x1iter)-1
Plots.plot!(p, x1iter[iter:iter+1], x2iter[iter:iter+1], shape=:circle)
Plots.annotate!(p, x1iter[iter], x2iter[iter], text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./gd.gif", fps = 1);
Above analysis can be generalized to a general strongly convex function that satisfies m \mathbf{I}_n \preceq \nabla^2 f(\mathbf{x}) \preceq M \mathbf{I}_n \text{ for all } \mathbf{x} for some constants M > m > 0. Or equivalently all eigenvalues of \nabla^2 f(\mathbf{x}) are in [m, M].
Theorem: for any strongly convex function f, the gradient descent with exact line search satisfies f(\mathbf{x}^{(t+1)}) - f(\mathbf{x}^\star) \le \left( 1 - \frac mM \right) [f(\mathbf{x}^{(t)}) - f(\mathbf{x}^\star)], where \mathbf{x}^\star is the optimum.
In practice, we often perform inexact line search such as backtracking, because exact line search is expensive.
To mitigate the frustrating zig-zagging of gradient descent, Polyak proposed to add a momentum with coefficient \beta to the gradient. The idea is that a heavy ball rolling downhill has less zig-zagging.
Gradient descent with momentum:
\begin{eqnarray*}
\mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - s \mathbf{z}^{(t)} \text{ with } \\
\mathbf{z}^{(t)} &=& \nabla f(\mathbf{x}^{(t)}) + \beta \mathbf{z}^{(t-1)}.
\end{eqnarray*}
For a quadratic objective function f(\mathbf{x}) = \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x}. The optimal step length s and momentum coefficient \beta are given by (derivation omitted) \begin{eqnarray*} s = \left( \frac{2}{\sqrt{\lambda_{\text{max}}} + \sqrt{\lambda_{\text{min}}}} \right)^2 \\ \beta = \left( \frac{\sqrt{\lambda_{\text{max}}} - \sqrt{\lambda_{\text{min}}}}{\sqrt{\lambda_{\text{max}}} + \sqrt{\lambda_{\text{min}}}} \right)^2, \end{eqnarray*} where \lambda_{\text{min}} and \lambda_{\text{max}} are the minimum and maximum eigenvalues of \mathbf{S}.
Back to our 2-dimensional example f(x_1, x_2) = \frac 12 (x_1^2 + b x_2^2). We have s = \left( \frac{2}{1+\sqrt b} \right)^2, \quad \beta = \left( \frac{1 - \sqrt b}{1 + \sqrt b} \right)^2, and f(x_1^{(t)}, x_2^{(t)}) = \left( \frac{1 - \sqrt b}{1 + \sqrt b} \right)^{2t} f(x_1^{(0)}, x_2^{(0)}). The convergence rate improves from \left( \frac{1 - b}{1 + b} \right)^{2} to \left( \frac{1 - \sqrt b}{1 + \sqrt b} \right)^{2}.
Following graph shows the iterates of gradient descent with momentum:
b = 0.1
f(x1, x2) = 0.5 * (x1^2 + b * x2^2)
s = (2 / (1 + sqrt(b)))^2 # optimal step length
β = ((1 - sqrt(b)) / (1 + sqrt(b)))^2 # optimal momentum coefficient
x1, x2 = b, 1.0 # start point
z1, z2 = [0.0, 0.0]
x1iter, x2iter, fiter = [x1], [x2], [f(x1, x2)] # store function values
for iter in 1:15
z1 = x1 + β * z1
z2 = b * x2 + β * z2
x1 -= s * z1
x2 -= s * z2
push!(x1iter, x1)
push!(x2iter, x2)
push!(fiter, f(x1, x2))
end
x1 = -1:0.01:1
x2 = -1:0.01:1
X1 = repeat(reshape(x1, 1, :), length(x2), 1)
X2 = repeat(x2, 1, length(x1))
Z = map(f, X1, X2)
p = contour(x1, x2, Z, levels=fiter)
plot(p, xlabel=L"x_1", ylabel=L"x_2", title="GD with Momentum");
anim = @animate for iter in 1:length(x1iter)-1
Plots.plot!(p, x1iter[iter:iter+1], x2iter[iter:iter+1], shape=:circle)
Plots.annotate!(p, x1iter[iter], x2iter[iter], text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./gd_momentum.gif", fps = 1);
Yuri Nesterov proposed that, instead of evaluating the gradient \nabla f at current iterate \mathbf{x}^{(t)}, we shift the evaluation point to \mathbf{x}^{(t)} + \gamma (\mathbf{x}^{(t)} - \mathbf{x}^{(t-1)}).
Nesetrov accelerated gradient descent \begin{eqnarray*} \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - s \nabla f(\mathbf{y}^{(t)}) \\ \mathbf{y}^{(t+1)} &=& \mathbf{x}^{(t+1)} + \gamma (\mathbf{x}^{(t+1)} - \mathbf{x}^{(t)}). \end{eqnarray*}
For a quadratic objective function f(\mathbf{x}) = \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x}. The optimal step length s and \gamma are given by (derivation omitted) \begin{eqnarray*} s &=& \frac{1}{\lambda_{\text{max}}} \\ \gamma &=& \frac{\sqrt{\lambda_{\text{max}}} - \sqrt{\lambda_{\text{min}}}}{\sqrt{\lambda_{\text{max}}} + \sqrt{\lambda_{\text{min}}}}, \end{eqnarray*} where \lambda_{\text{min}} and \lambda_{\text{max}} are the minimum and maximum eigenvalues of \mathbf{S}.
The convergence rate improves from \left( \frac{1 - b}{1 + b} \right)^{2} to 1 - \sqrt b.
b = 0.1
f(x1, x2) = 0.5 * (x1^2 + b * x2^2)
s = 1.0 # optimal step length
γ = (1 - sqrt(b)) / (1 + sqrt(b)) # optimal Nesterov constant
x1, x2 = b, 1.0 # start point
y1, y2 = x1, x2
x1iter, x2iter, fiter = [x1], [x2], [f(x1, x2)] # store iterates
for iter in 1:15
x1prev, x2prev = x1, x2
x1 = y1 - s * y1
x2 = y2 - s * b * y2
y1 = x1 + γ * (x1 - x1prev)
y2 = x2 + γ * (x2 - x2prev)
push!(x1iter, x1)
push!(x2iter, x2)
push!(fiter, f(x1, x2))
end
x1 = -1:0.01:1
x2 = -1:0.01:1
X1 = repeat(reshape(x1, 1, :), length(x2), 1)
X2 = repeat(x2, 1, length(x1))
Z = map(f, X1, X2)
p = contour(x1, x2, Z, levels=fiter)
plot(p, xlabel=L"x_1", ylabel=L"x_2", title="GD with Nesterov");
anim = @animate for iter in 1:length(x1iter)-1
Plots.plot!(p, x1iter[iter:iter+1], x2iter[iter:iter+1], shape=:circle)
Plots.annotate!(p, x1iter[iter], x2iter[iter], text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./gd_nesterov.gif", fps = 1);
In machine learning and statistics, we need to minimize the loss function of training data L(\mathbf{x}) = \frac 1N \sum_{i=1}^N \ell(\mathbf{x}, \mathbf{v}_i), where \mathbf{x} is the parameter to be fitted, N is the training sample size, and \mathbf{v}_i is the i-th sample.
For example, in leasts squares problem, we minimize L(\mathbf{x}) = \frac 1N \|\mathbf{b} - \mathbf{A} \mathbf{x}\|^2 = \frac 1N \sum_{i=1}^N (b_i - \mathbf{a}_i' x)^2.
Issues of the (classical) gradient descent scheme \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - s^{(t)} \nabla L(\mathbf{x}). include:
The stochastic gradient descent uses only a mini-batch (of size B) of the training data at each step:
\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - s^{(t)} \frac{1}{B} \sum_{i \in \text{mini-batch}} \nabla \ell(\mathbf{x}, \mathbf{v}_i).
Stochastic gradient descent solves the two issues of gradient descent. It is the workhorse in deep learning.