Processing math: 27%

1  Optimization and Multivariate Calculus

We will use optimization to motivate the development of multivariate calculus.

1.1  Multivariate calculus

  • For a sufficiently smooth scalar function f:RR, we have the second order Taylor approximation:
    f(x+Δx)f(x)+Δxddxf(x)+12(Δx)2d2dx2f(x).

  • TODO: graph

  • To generalize to a multivariate function f:RnR, we need notations:

    • Gradient: f(x)=(x1f(x)xnf(x)).
    • Hessian: H(x)=2f(x)=(2x21f(x)2x1xnf(x)2xnx1f(x)2x2nf(x)).
  • Example: TODO.

  • For a sufficiently smooth multivariate function f:RnR, we have Taylor approximation:
    f(x+Δx)f(x)+f(x)Δx+12Δx[2f(x)]Δx.

  • For a vector function f:RnRm f(x)=(f1(x)fm(x)), the m×n Jacobian matrix is J=(f1fm)=(f1x1f1xnfmx1fmxn). The Jacobian is the determant of J and appears in the multi-dimensional integrals.

Later we will develop chain rule for the vector and matrix functions.

1.2  Optimization and optimality conditions

  • Optimization aims to minimize a multivariate function f:RnR possibly subject to certain constraints. Possible constrains include

    • Linear constraints: Ax=b.
    • Inequality constraints: x0.
    • Integer constraints: Each xi is either 0 or 1.
  • Statisticians often talk about maximization, because of the maximum likelihood estimation (MLE). Note maximizing f is same as minimizing f.

  • For unconstrained optimization of a scalar function f, we know

    • the necessary condition for a point x to be a local minimum is ddxf(x)=0.
    • a sufficient condition for a strict local minimum is (1) ddxf(x)=0 and (2) d2dx2f(x)>0.
      These are called the optimality conditions.
  • Similarly for unconstrained optimization of a multivariate function f:RnR,

    • the necessary condition for a point x to be a local minimum is f(x)=0n.
    • a sufficient condition for a strict local minimum is f(x)=0n and 2f(x)On×n.
  • A body of work in optimization is to generalize these optimality conditions to the constrained case.

1.3  Convexity

  • Convexity plays a key role in optimization.

    • A convex set K: If x,yK, then the line from x to y {αx+(1α)y:α[0,1]}K.
    • A convex function f: f(αx+(1α)y)αf(x)+(1α)f(y) for all x,y and α(0,1).
    • A strictly convex function satisifies above definition but replacing by <.
    • A convex function f (alternative definition): The set of points on and above the graph of f, {(x,y):f(x)y}, is convex.
    • A smooth and convex f: f(x)f(y)+f(y)(xy) for all x,y. The function f sits above its tangent lines.
  • TODO: A convex function sits between its tangent lines and its chords.

  • TODO: graph of convex and nonconvex sets/functions.

  • Examples: f1(x)=ax+b, f2(x)=x2, and f3(x)=max.

  • Intersection of convex sets is convex.

  • The maximum of two or more convex functions is always convex.

    Proof: Let f(\mathbf{x}) = \sup_{i \in \mathcal{I}} f_i (\mathbf{x}). Then f_i(\alpha \mathbf{x} + (1 - \alpha) \mathbf{y}) \le \alpha f_i(\mathbf{x}) + (1 - \alpha) f_i(\mathbf{y}) \le \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y}) for all i \in \mathcal{I}. Taking supremum over i on the left hand side yields f(\alpha \mathbf{x} + (1 - \alpha) \mathbf{y}) \le \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y}).

    Note the minimum of convex functions is usually not convex.

  • The set of positive (semi)definite matrices is convex.

  • A twice differentiable function is convex if \nabla^2 f(\mathbf{x}) \succeq \mathbf{O}_{n \times n} at all \mathbf{x}. It is strictly convex \nabla^2 f(\mathbf{x}) \succ \mathbf{O}_{n \times n} at all \mathbf{x}.

  • Convexity prevents two local minima. For a convex function, any stationary point with \nabla f(\mathbf{x}) = \mathbf{0} is a global minimum.

1.4  Optimality conditions for constrained optimization

  • Example: \begin{eqnarray*} &\text{minimize}& f(x_1,x_2) = x_1^2 + x_2^2 \\ &\text{subject to}& a_1 x_1 + a_2 x_2 = b. \end{eqnarray*} Define the Lagrangian L(x_1, x_2, \lambda) = f(x_1, x_2) + \lambda (a_1 x_1 + a_2 x_2 - b) = x_1^2 + x_2^2 + \lambda (a_1 x_1 + a_2 x_2 - b), where \lambda is the Lagrange multiplier. Solve the equations \begin{eqnarray*} \frac{\partial}{\partial x_1} L &=& 2 x_1 + \lambda a_1 \\ \frac{\partial}{\partial x_2} L &=& 2 x_2 + \lambda a_2 \\ \frac{\partial}{\partial \lambda} L &=& a_1 x_1 + a_2 x_2 - b \end{eqnarray*} to get optimal x_1, x_2, and \lambda: \begin{eqnarray*} x_1^\star &=& \frac{a_1 b}{a_1^2 + a_2^2} \\ x_2^\star &=& \frac{a_2 b}{a_1^2 + a_2^2} \\ \lambda^\star &=& - \frac{2b}{a_1^2 + a_2^2} \end{eqnarray*} with optimal objective value (x_1^\star)^2 + (x_2^\star)^2 = \frac{b^2}{a_1^2 + a_2^2}. We found -\lambda^\star is simply the derivative of the minimum cost with respect to the constraint level b \frac{d}{d b} \left( \frac{b^2}{a_1^2 + a_2^2} \right) = \frac{2b}{a_1^2 + a_2^2} = - \lambda.

  • We generalize above example to the general problem of minimizing a convex quadratic function with linear constraints. For \mathbf{S} \succ \mathbf{0} and \mathbf{A} \in \mathbb{R}^{n \times m}, \begin{eqnarray*} &\text{minimize}& \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x} \\ &\text{subject to}& \mathbf{A}' \mathbf{x} = \mathbf{b}. \end{eqnarray*} The Lagrangian function is L(\mathbf{x}, \boldsymbol{\lambda}) = \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x} + \boldsymbol{\lambda}' (\mathbf{A}' \mathbf{x} - \mathbf{b}), where \boldsymbol{\lambda} \in \mathbb{R}^m is the Lagrange multipliers. Solving equations \begin{eqnarray*} \nabla_\mathbf{x} L(\mathbf{x}, \boldsymbol{\lambda}) &=& \mathbf{S} \mathbf{x} + \mathbf{A} \boldsymbol{\lambda} = \mathbf{0}_n \\ \nabla_\boldsymbol{\lambda} L(\mathbf{x}, \boldsymbol{\lambda}) &=& \mathbf{A}' \mathbf{x} - \mathbf{b} = \mathbf{0}_m, \end{eqnarray*} or equivalently \begin{pmatrix} \mathbf{S} & \mathbf{A} \\ \mathbf{A}' & \mathbf{O} \end{pmatrix} \begin{pmatrix} \mathbf{x} \\ \boldsymbol{\lambda} \end{pmatrix} = \begin{pmatrix} \mathbf{0}_n \\ \mathbf{b} \end{pmatrix} \quad \quad (\text{saddle point matrix or KKT matrix}) yields the solution \begin{eqnarray*} \mathbf{x}^\star &=& \mathbf{S}^{-1} \mathbf{A} (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b} \\ \boldsymbol{\lambda}^\star &=& - (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b}. \end{eqnarray*} Further calculation shows the minimum cost f^\star = \frac 12 \mathbf{b}' (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b} and the gradient of cost \nabla_\mathbf{b} f^\star = (\mathbf{A}' \mathbf{S}^{-1} \mathbf{A})^{-1} \mathbf{b} = - \boldsymbol{\lambda}^\star.

  • The saddle point matrix (or KKT matrix) has n positive eigenvalues and m negative eigenvalue. The Lagrangian function is convex in \mathbf{x} and concave in \boldsymbol{\lambda}.

1.5  Example: MLE of multivariate normal model

Let \mathbf{y}_1, \ldots, \mathbf{y}_n be iid samples from a multivariate normal distribution N(\boldsymbol{\mu}, \boldsymbol{\Omega}). The log-likelihood is L(\boldsymbol{\mu}, \boldsymbol{\Omega}) = - \frac n2 \log \det \boldsymbol{\Omega} - \frac 12 \sum_{i=1}^n (\mathbf{y}_i - \boldsymbol{\mu})' \boldsymbol{\Omega}^{-1} (\mathbf{y}_i - \boldsymbol{\mu}). We want to maximize L(\boldsymbol{\mu}, \boldsymbol{\Omega}) or minimize f(\boldsymbol{\mu}, \boldsymbol{\Omega}) = - L(\boldsymbol{\mu}, \boldsymbol{\Omega}) = \frac n2 \log \det \boldsymbol{\Omega} + \frac 12 \sum_{i=1}^n (\mathbf{y}_i - \boldsymbol{\mu})' \boldsymbol{\Omega}^{-1} (\mathbf{y}_i - \boldsymbol{\mu}). The MLE is achieved by \begin{eqnarray*} \widehat{\boldsymbol{\mu}} &=& \frac{\sum_{i=1}^n \mathbf{y}_i}{n} \\ \widehat{\boldsymbol{\Omega}} &=& \frac{\sum_{i=1}^n (\mathbf{y}_i - \hat{\boldsymbol{\mu}})(\mathbf{y}_i - \hat{\boldsymbol{\mu}})'}{n}. \end{eqnarray*}

TODO: derivation.

1.6  Example: MLE of multinomial model

Consider a multinomial experiment with n trials and observed outcomes n_1, \ldots, n_m over m categories. The maximum likelihood estimate (MLE) seeks maximizer of likelihood
L(p_1,\ldots,p_m) = \binom{n}{n_1 \cdots n_m} \prod_{i=1}^m p_i^{n_i}. To minimize the negative log-likelihood
\log L(p_1,\ldots,p_m) = \log \binom{n}{n_1 \cdots n_m} + \sum_{i=1}^m n_i \log p_i subject to the constraint p_1 + \cdots + p_m = 1, we find the statinary point of Lagrangian
L(p_1,\ldots,p_m, \lambda) = - \log \binom{n}{n_1 \cdots n_m} - \sum_{i=1}^m n_i \log p_i + \lambda \left( \sum_{i=1}^m p_i - 1 \right).
Solving
\frac{\partial}{\partial p_i} L = - \frac{n_i}{p_i} + \lambda = 0
gives
\frac{n_i}{p_i} = \lambda.
Combining with the constraint \sum_i p_i=1 yields
\begin{eqnarray*} p_i^\star &=& \frac{n_i}{n}, \quad i=1,\ldots,m, \\ \lambda^\star &=& n. \end{eqnarray*} Note the nonnegativity constraint is automatically satisfied.

1.7  Newton's method

  • We are looking for a point \mathbf{x}^\star such that \nabla f(\mathbf{x}^\star) = \mathbf{0}. Multivariate calculus gives us a principled way to move towards such an \mathbf{x}^\star.

  • Second-order Taylor approximation to a function f at current iterate \mathbf{x}^{(t)} says f(\mathbf{x}^{(t)} + \Delta \mathbf{x}) \approx f(\mathbf{x}^{(t)}) + \nabla f(\mathbf{x}^{(t)})' \Delta \mathbf{x} + \frac 12 \Delta \mathbf{x}' [\nabla^2 f(\mathbf{x}^{(t)})] \Delta \mathbf{x}. Which direction \Delta \mathbf{x} shall we move from \mathbf{x}^{(t)}?

    Minimizing the quadratic approximation gives the Newton direction \Delta \mathbf{x}_{\text{newton}} = - [\nabla^2 f(\mathbf{x}^{(t)})]^{-1} \nabla f(\mathbf{x}^{(t)}).

  • So the Newton method iterates according to \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \Delta \mathbf{x}_{\text{newton}} = \mathbf{x}^{(t)} - [\nabla^2 f(\mathbf{x}^{(t)})]^{-1} \nabla f(\mathbf{x}^{(t)}).

  • Quadratic convergence of Newton's method: \|\mathbf{x}^{(t+1)} - \mathbf{x}^\star\| \le C \|\mathbf{x}^{(t)} - \mathbf{x}^\star\|^2.

  • Example: f(x) = \frac 13 x^3 - 4x with \nabla f(x) = x^2 - 4 and \nabla^2 f(x) = 2x. Newton's iterates are x^{(t+1)} = x^{(t)} - \frac{x^{(t)2}-4}{2x^{(t)}} = \frac{1}{2} \left( x^{(t)} + \frac{4}{x^{(t)}} \right). Let's start from x^{(0)}=2.5.

In [1]:
x = 2.5
for iter in 1:5
    x = 0.5 * (x + 4 / x)
    @show x, x^3 / 3 - 4x
end
(x, x ^ 3 / 3 - 4x) = (2.05, -5.328291666666667)
(x, x ^ 3 / 3 - 4x) = (2.000609756097561, -5.333332589652766)
(x, x ^ 3 / 3 - 4x) = (2.0000000929222947, -5.333333333333316)
(x, x ^ 3 / 3 - 4x) = (2.000000000000002, -5.333333333333334)
(x, x ^ 3 / 3 - 4x) = (2.0, -5.333333333333334)
x^{(t+1)} - 2 = \frac 12 \left( x^{(t)} + \frac{4}{x^{(t)}} \right) - 2 = \frac{1}{2x^{(t)}} \left( x^{(t)} - 2 \right)^2
  • In practice, the Newton's method may suffer from instability; the iterates may escape into infinities or a local maximum. Two remedies are needed:
    1. Use a positive definite matrix in the quadratic approximation (automatically satisfied by the Hessian of a convex function).
    2. Line search (backtracking) to guarantee sufficient drop in the objective function f(\mathbf{x}^{(t)} + s \Delta \mathbf{x}) \le f(\mathbf{x}^{(t)}) - \alpha s \Delta \mathbf{x}' [\nabla^2 f(\mathbf{x}^{(t)})] \Delta \mathbf{x}.

1.8  Gradient descent

  • If we use the identity matrix as a very crude approximation of the Hessian matrix, we obtain the classical gradient descent, or steepest descent, algorithm: \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - s^{(t)} \nabla f(\mathbf{x}^{(t)}).

  • Gradient descent algorithm is gaining tremendous of interest in modern, large scale optimization problems, because calculation of Hessian is usually quite expensive.

  • Example: minimize a bivariate quadratic function: f(x_1, x_2) = \frac 12 (x_1^2 + bx_2^2). Gradient: \nabla f(x_1, x_2) = \begin{pmatrix} x_1 \\ bx_2 \end{pmatrix}. Gradient descent algorithm: \begin{eqnarray*} \begin{pmatrix} x_1^{(t+1)} \\ x_2^{(t+1)} \end{pmatrix} &=& \begin{pmatrix} x_1^{(t)} \\ x_2^{(t)} \end{pmatrix} - s^{(t)} \begin{pmatrix} x_1^{(t)} \\ bx_2^{(t)} \end{pmatrix} \\ &=& \begin{pmatrix} (1 - s^{(t)}) x_1^{(t)} \\ (1 - bs^{(t)})x_2^{(t)} \end{pmatrix}. \end{eqnarray*} What's the optimal step length s (exact line search)? The objective function f(s) = \frac 12 \left[(1-s)^2 x_1^{(t)2} + b(1-bs)^2x_2^{(t)2} \right] is minimized at s^{(t)} = \frac{x_1^{(t)2} + b^2 x_2^{(t)2}}{x_1^{(t)2} + b^3 x_2^{(t)2}}. Let's assume starting point (x_1^{(0)}, x_2^{(0)}) = (b, 1) to simplify analysis. Then s^{(0)} = \frac{b^2 + b^2}{b^2 + b^3} = \frac{2}{b + 1} and \begin{pmatrix} x_1^{(1)} \\ x_2^{(1)} \end{pmatrix} = \begin{pmatrix} \left( \frac{b-1}{b+1} \right) x_1^{(1)} \\ \left( \frac{1-b}{b+1} \right) x_2^{(1)} \end{pmatrix} = \begin{pmatrix} b \left( \frac{b-1}{b+1} \right) \\ \left( \frac{1-b}{b+1} \right) \end{pmatrix}. Continuing this way, we found \begin{pmatrix} x_1^{(t)} \\ x_2^{(t)} \end{pmatrix} = \begin{pmatrix} b \left( \frac{b-1}{b+1} \right)^t \\ \left( \frac{1-b}{b+1} \right)^t \end{pmatrix} and f(x_1^{(t)}, x_2^{(t)}) = \frac 12 (1 + b^2) \left( \frac{b-1}{b+1} \right)^{2t} = \left( \frac{b-1}{b+1} \right)^{2t} f(x_1^{(0)}, x_2^{(0)}).

    Following graph shows the zig-zagging pattern of the gradient descent iterates:

In [2]:
using Plots; gr()
using LaTeXStrings

b = 0.1
f(x1, x2) = 0.5 * (x1^2 + b * x2^2)
s = 2 / (1 + b) # optimal step length
x1, x2 = b, 1.0   # start point
x1iter, x2iter, fiter = [x1], [x2], [f(x1, x2)]  # store function values
for iter in 1:15
    x1 *= (1 - s)
    x2 *= (1 - s * b)
    push!(x1iter, x1)
    push!(x2iter, x2)
    push!(fiter, f(x1, x2))
end

x1 = -1:0.01:1
x2 = -1:0.01:1
X1 = repeat(reshape(x1, 1, :), length(x2), 1)
X2 = repeat(x2, 1, length(x1))
Z = map(f, X1, X2)
p = contour(x1, x2, Z, levels=fiter)
plot(p, xlabel=L"x_1", ylabel=L"x_2", title="GD");
┌ Info: Recompiling stale cache file /Users/huazhou/.julia/compiled/v1.2/Plots/ld3vC.ji for Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1240
In [3]:
anim = @animate for iter in 1:length(x1iter)-1
    Plots.plot!(p, x1iter[iter:iter+1], x2iter[iter:iter+1], shape=:circle)
    Plots.annotate!(p, x1iter[iter], x2iter[iter], text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./gd.gif", fps = 1);
┌ Info: Saved animation to 
│   fn = /Users/huazhou/Documents/github.com/ucla-biostat216-2019fall.github.io/slides/14-optim/gd.gif
└ @ Plots /Users/huazhou/.julia/packages/Plots/AXUqs/src/animation.jl:98
  • Above analysis can be generalized to a general strongly convex function that satisfies m \mathbf{I}_n \preceq \nabla^2 f(\mathbf{x}) \preceq M \mathbf{I}_n \text{ for all } \mathbf{x} for some constants M > m > 0. Or equivalently all eigenvalues of \nabla^2 f(\mathbf{x}) are in [m, M].

    Theorem: for any strongly convex function f, the gradient descent with exact line search satisfies f(\mathbf{x}^{(t+1)}) - f(\mathbf{x}^\star) \le \left( 1 - \frac mM \right) [f(\mathbf{x}^{(t)}) - f(\mathbf{x}^\star)], where \mathbf{x}^\star is the optimum.

  • In practice, we often perform inexact line search such as backtracking, because exact line search is expensive.

1.9  Gradient descent with momentum

  • To mitigate the frustrating zig-zagging of gradient descent, Polyak proposed to add a momentum with coefficient \beta to the gradient. The idea is that a heavy ball rolling downhill has less zig-zagging.

  • Gradient descent with momentum:
    \begin{eqnarray*} \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - s \mathbf{z}^{(t)} \text{ with } \\ \mathbf{z}^{(t)} &=& \nabla f(\mathbf{x}^{(t)}) + \beta \mathbf{z}^{(t-1)}. \end{eqnarray*}

  • For a quadratic objective function f(\mathbf{x}) = \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x}. The optimal step length s and momentum coefficient \beta are given by (derivation omitted) \begin{eqnarray*} s = \left( \frac{2}{\sqrt{\lambda_{\text{max}}} + \sqrt{\lambda_{\text{min}}}} \right)^2 \\ \beta = \left( \frac{\sqrt{\lambda_{\text{max}}} - \sqrt{\lambda_{\text{min}}}}{\sqrt{\lambda_{\text{max}}} + \sqrt{\lambda_{\text{min}}}} \right)^2, \end{eqnarray*} where \lambda_{\text{min}} and \lambda_{\text{max}} are the minimum and maximum eigenvalues of \mathbf{S}.

  • Back to our 2-dimensional example f(x_1, x_2) = \frac 12 (x_1^2 + b x_2^2). We have s = \left( \frac{2}{1+\sqrt b} \right)^2, \quad \beta = \left( \frac{1 - \sqrt b}{1 + \sqrt b} \right)^2, and f(x_1^{(t)}, x_2^{(t)}) = \left( \frac{1 - \sqrt b}{1 + \sqrt b} \right)^{2t} f(x_1^{(0)}, x_2^{(0)}). The convergence rate improves from \left( \frac{1 - b}{1 + b} \right)^{2} to \left( \frac{1 - \sqrt b}{1 + \sqrt b} \right)^{2}.

    Following graph shows the iterates of gradient descent with momentum:

In [4]:
b = 0.1
f(x1, x2) = 0.5 * (x1^2 + b * x2^2)
s = (2 / (1 + sqrt(b)))^2             # optimal step length
β = ((1 - sqrt(b)) / (1 + sqrt(b)))^2 # optimal momentum coefficient
x1, x2 = b, 1.0   # start point
z1, z2 = [0.0, 0.0]
x1iter, x2iter, fiter = [x1], [x2], [f(x1, x2)]  # store function values
for iter in 1:15
    z1 = x1 + β * z1
    z2 = b * x2 + β * z2
    x1 -= s * z1
    x2 -= s * z2
    push!(x1iter, x1)
    push!(x2iter, x2)
    push!(fiter, f(x1, x2))
end

x1 = -1:0.01:1
x2 = -1:0.01:1
X1 = repeat(reshape(x1, 1, :), length(x2), 1)
X2 = repeat(x2, 1, length(x1))
Z = map(f, X1, X2)
p = contour(x1, x2, Z, levels=fiter)
plot(p, xlabel=L"x_1", ylabel=L"x_2", title="GD with Momentum");
In [5]:
anim = @animate for iter in 1:length(x1iter)-1
    Plots.plot!(p, x1iter[iter:iter+1], x2iter[iter:iter+1], shape=:circle)
    Plots.annotate!(p, x1iter[iter], x2iter[iter], text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./gd_momentum.gif", fps = 1);
┌ Info: Saved animation to 
│   fn = /Users/huazhou/Documents/github.com/ucla-biostat216-2019fall.github.io/slides/14-optim/gd_momentum.gif
└ @ Plots /Users/huazhou/.julia/packages/Plots/AXUqs/src/animation.jl:98

1.10  Nesterov acceleration of gradient descent

  • Yuri Nesterov proposed that, instead of evaluating the gradient \nabla f at current iterate \mathbf{x}^{(t)}, we shift the evaluation point to \mathbf{x}^{(t)} + \gamma (\mathbf{x}^{(t)} - \mathbf{x}^{(t-1)}).

  • Nesetrov accelerated gradient descent \begin{eqnarray*} \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - s \nabla f(\mathbf{y}^{(t)}) \\ \mathbf{y}^{(t+1)} &=& \mathbf{x}^{(t+1)} + \gamma (\mathbf{x}^{(t+1)} - \mathbf{x}^{(t)}). \end{eqnarray*}

  • For a quadratic objective function f(\mathbf{x}) = \frac 12 \mathbf{x}' \mathbf{S} \mathbf{x}. The optimal step length s and \gamma are given by (derivation omitted) \begin{eqnarray*} s &=& \frac{1}{\lambda_{\text{max}}} \\ \gamma &=& \frac{\sqrt{\lambda_{\text{max}}} - \sqrt{\lambda_{\text{min}}}}{\sqrt{\lambda_{\text{max}}} + \sqrt{\lambda_{\text{min}}}}, \end{eqnarray*} where \lambda_{\text{min}} and \lambda_{\text{max}} are the minimum and maximum eigenvalues of \mathbf{S}.

  • The convergence rate improves from \left( \frac{1 - b}{1 + b} \right)^{2} to 1 - \sqrt b.

In [6]:
b = 0.1
f(x1, x2) = 0.5 * (x1^2 + b * x2^2)
s = 1.0                           # optimal step length
γ = (1 - sqrt(b)) / (1 + sqrt(b)) # optimal Nesterov constant
x1, x2 = b, 1.0   # start point
y1, y2 = x1, x2
x1iter, x2iter, fiter = [x1], [x2], [f(x1, x2)]  # store iterates
for iter in 1:15
    x1prev, x2prev = x1, x2
    x1 = y1 - s * y1
    x2 = y2 - s * b * y2
    y1 = x1 + γ * (x1 - x1prev)
    y2 = x2 + γ * (x2 - x2prev)
    push!(x1iter, x1)
    push!(x2iter, x2)
    push!(fiter, f(x1, x2))
end

x1 = -1:0.01:1
x2 = -1:0.01:1
X1 = repeat(reshape(x1, 1, :), length(x2), 1)
X2 = repeat(x2, 1, length(x1))
Z = map(f, X1, X2)
p = contour(x1, x2, Z, levels=fiter)
plot(p, xlabel=L"x_1", ylabel=L"x_2", title="GD with Nesterov");
In [7]:
anim = @animate for iter in 1:length(x1iter)-1
    Plots.plot!(p, x1iter[iter:iter+1], x2iter[iter:iter+1], shape=:circle)
    Plots.annotate!(p, x1iter[iter], x2iter[iter], text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./gd_nesterov.gif", fps = 1);
┌ Info: Saved animation to 
│   fn = /Users/huazhou/Documents/github.com/ucla-biostat216-2019fall.github.io/slides/14-optim/gd_nesterov.gif
└ @ Plots /Users/huazhou/.julia/packages/Plots/AXUqs/src/animation.jl:98

1.11  Stochastic gradient descent

  • In machine learning and statistics, we need to minimize the loss function of training data L(\mathbf{x}) = \frac 1N \sum_{i=1}^N \ell(\mathbf{x}, \mathbf{v}_i), where \mathbf{x} is the parameter to be fitted, N is the training sample size, and \mathbf{v}_i is the i-th sample.

    For example, in leasts squares problem, we minimize L(\mathbf{x}) = \frac 1N \|\mathbf{b} - \mathbf{A} \mathbf{x}\|^2 = \frac 1N \sum_{i=1}^N (b_i - \mathbf{a}_i' x)^2.

  • Issues of the (classical) gradient descent scheme \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - s^{(t)} \nabla L(\mathbf{x}). include:

    1. Evaluating \nabla L(\mathbf{x}) = \frac 1N \sum_{i=1}^N \nabla \ell(\mathbf{x}, \mathbf{v}_i) can be expensive when N is large.
    2. The solution found by gradient descent does not generalize well on the test data.
  • The stochastic gradient descent uses only a mini-batch (of size B) of the training data at each step:
    \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - s^{(t)} \frac{1}{B} \sum_{i \in \text{mini-batch}} \nabla \ell(\mathbf{x}, \mathbf{v}_i).

  • Stochastic gradient descent solves the two issues of gradient descent. It is the workhorse in deep learning.