Lecture 14 · Newton Method

References

Lecture Reference: https://www.stat.cmu.edu/~ryantibs/convexopt-F18/

1. Newton’s Method Interpretation

1.1 Introduction

在一阶的 GD 方法中, 其核心思想是在当前点 $x$ 处, 考虑附近 $x^{+}$ 处的一阶泰勒展开:

f (x^{+}) \approx f (x) + \nabla f (x)^{⊤} (x^{+} - x) + \frac{1}{2 t} ∥ x^{+} - x ∥_{2}^{2}

该近似成立可以由凸性+Lipschitz 连续性得到:
- 由凸性可知, 对于任意 $x, y \in dom (f)$ , 有: $f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x)$
- 由 Lipschitz 连续性可知, 对于任意 $x, y \in dom (f)$ , 有: $f (y) \leq f (x) + \nabla f (x)^{⊤} (y - x) + \frac{L}{2} ∥ y - x ∥_{2}^{2}$
若尝试通过寻找 $x^{+}$ 使得 RHS 最小 $min_{x^{+}} f (x) + \nabla f (x)^{⊤} (x^{+} - x) + \frac{1}{2 t} ∥ x^{+} - x ∥_{2}^{2}$ , 则有:
$x^{+} = x - \frac{1}{t} \nabla f (x)$

因此, 进一步我们可以通过二阶泰勒展开来得到更精确的近似:

f (x^{+}) \approx f (x) + \nabla f (x)^{⊤} (x^{+} - x) + \frac{1}{2} (x^{+} - x)^{⊤} \nabla^{2} f (x) (x^{+} - x)

用同样的方法对 RHS 进行最小化, 则有: $x^{+} = x - \nabla^{2} f (x)^{- 1} \nabla f (x)$

因此, 我们可以得到 Newton 方法的更新规则:

x_{t + 1} = x_{t} - \nabla^{2} f (x_{t})^{- 1} \nabla f (x_{t})

1.2 Affine Invariance of Newton’s Method

Newton’s Method 具有 Affine Invariance 性质.

Proof of affine invariance

对于目标函数 $f$ , 以及可逆矩阵 $A \in R^{n \times n}$ , 考虑迭代 $x_{t + 1} = x_{t} - \nabla^{2} f (x_{t})^{- 1} \nabla f (x_{t})$ .

若对其进行 Affine Transformation $y := A x$ , 则 Newton’s Method 更新规则为:
$y_{t + 1} = y_{t} - [\nabla^{2} f (y_{t})]^{- 1} \nabla f (y_{t}) = A x_{t} - [\nabla^{2} f (A x_{t})]^{- 1} \nabla f (A x_{t})$

则有:
$\tilde{x}_{t + 1} := A^{- 1} y_{t + 1} = x_{t} - A^{- 1} [\nabla^{2} f (A x_{t})]^{- 1} \nabla f (A x_{t})$

另一方面, 考虑 $ϕ (x) := f (A x)$ , 则有:
$x_{t + 1} = x_{t} - [\nabla^{2} ϕ (x_{t})]^{- 1} \nabla ϕ (x_{t}) = x_{t} - [A^{⊤} \nabla^{2} f (A x_{t}) A]^{- 1} A^{⊤} \nabla f (A x_{t}) = x_{t} - A^{- 1} [\nabla^{2} f (A x_{t})]^{- 1} \nabla f (A x_{t}) = \tilde{x}_{t + 1}$

$□$

1.3 Newton Decrement

对于目标函数 $f$ 及当前点 $x$ , Newton Decrement 定义为:

λ (x) = [\nabla f (x)^{⊤} (\nabla^{2} f (x))^{- 1} \nabla f (x)]^{1/2}

这个量可以有一下几个理解的角度:

首先, 其刻画了更新量经过 Hessian 矩阵校正后的长度. 若记 $Δ x = x_{t + 1} - x_{t} = - [\nabla^{2} f (x)]^{- 1} \nabla f (x)$ , 则有:
$λ (x) = [Δ x^{⊤} (\nabla^{2} f (x)) Δ x]^{1/2} = ∥Δ x ∥_{\nabla^{2} f (x)}$
- 其中 $∥ \cdot ∥_{P}$ 表示在 $P$ -范数下的长度, 即 $∥ v ∥_{P} = (v^{⊤} P v)^{1/2}$ .
其次, 其衡量了在当前点处距离二阶泰勒展开的最优解之间的距离. 即:
$\frac{1}{2} λ (x)^{2} = f (x) - y min [f (x) + \nabla f (x)^{⊤} (y - x) + \frac{1}{2} (y - x)^{⊤} \nabla^{2} f (x) (y - x)] = f (x) - [f (x) - \frac{1}{2} \nabla f (x)^{⊤} (\nabla^{2} f (x))^{- 1} \nabla f (x)]$
- 对于牛顿法, 其在最优点附近处, 可以近似代替作为当前点距离最优解的距离.
此外指出, Newton Decrement 也是 Affine Invariance 的.

1.4 Damped Newton’s Method (Backtracking Line Search)

在 pure Newton’s Method 中，Newton 方向为 $v_{t} = - \nabla^{2} f (x_{t})^{- 1} \nabla f (x_{t})$ . Damped Newton’s Method 用 Backtracking Line Search 选择步长 $s$ .

Armijo Condition.

对给定 $s > 0$ ，接受该步长当且仅当

f (x_{t} + s v_{t}) \leq f (x_{t}) + α s \nabla f (x_{t})^{⊤} v_{t},

其中 $α \in (0, 1/2]$ ， $β \in (0, 1)$ .

Note: Armijo descent condition

若 $\nabla^{2} f (x_{t})$ 正定，则 $\nabla f (x_{t})^{⊤} v_{t} < 0$ ，因此右侧是对 $f (x_{t})$ 的“下降”要求.

Algorithm.

选初始步长 $s_{0} > 0$ 以及超参数 $α \in (0, 1/2]$ 、 $β \in (0, 1)$ .
在第 $t$ 次外层迭代：
- 计算 Newton 方向： $v_{t} = - \nabla^{2} f (x_{t})^{- 1} \nabla f (x_{t})$ .
- 令 $s \leftarrow s_{0}$ .
- 重复（线搜索回溯）直到 Armijo 条件成立：
  - 若 $f (x_{t} + s v_{t}) > f (x_{t}) + α s \nabla f (x_{t})^{⊤} v_{t}$ ，则令 $s \leftarrow β s$ ；
  - 否则接受该步长并停止回溯。
- 更新： $x_{t + 1} = x_{t} + s v_{t}$ .

2. Convergence Analysis

2.1 Pure Newton’s Method

假设 $f$ 是二阶连续可微的函数 ( $f \in C^{2} (R^{n})$ ), 且假设其 Hessian 矩阵在最优解 $x^{⋆}$ 的一个 $δ$ -邻域内是 Lipschitz 连续的, 即存在常数 $L > 0$ 使得对于任意 $x, y \in N_{δ} (x^{⋆})$ 都有:

∥ \nabla^{2} f (x) - \nabla^{2} f (y) ∥_{2} \leq L ∥ x - y ∥_{2}

如果函数 $f (x)$ 在 $x^{⋆}$ 处满足 $\nabla f (x^{⋆}) = 0$ 且 $\nabla^{2} f (x^{⋆}) ≻ 0$ , 则对于上述 pure Newton’s Method 有如下系列结论:

如果初始点距离 $x^{⋆}$ 的足够近, 则牛顿法产生的迭代点列会收敛到 $x^{⋆}$ .
${x_{k}}$ 的收敛速度为 Q-quadratic 的.
- $∥ x_{k + 1} - x^{⋆} ∥_{2} \leq L ∥ \nabla^{2} f (x_{k})^{- 1} ∥_{2} ∥ x_{k} - x^{⋆} ∥_{2} := C_{1} ∥ x_{k} - x^{⋆} ∥_{2}^{2}$ .
- 换言之, 若初始点 $x_{0}$ 满足 $∥ x_{0} - x^{⋆} ∥_{2} \leq min {δ, r, 1/2 L ∥ \nabla^{2} f (x_{0})^{- 1} ∥_{2}} := \hat{δ}$ , 则可保证点列一直处于 $N_{\hat{δ}} (x^{⋆})$ 内, 从而保证点列收敛到 $x^{⋆}$ . 其中 $r$ 是一个局部邻域半径保证在 $x^{⋆}$ 附近其 Hessian 具有连续, 非退化等性质.
${∥\nabla f (x_{k}) ∥_{2}}$ 以 Q-quadratic 的速率收敛到零. 具体地:
$∥\nabla f (x_{k + 1}) ∥_{2} \leq 2 L ∥ \nabla^{2} f (x^{⋆})^{- 1} ∥_{2}^{2} \cdot ∥\nabla f (x_{k}) ∥_{2}^{2} := C_{2} ∥\nabla f (x_{k}) ∥_{2}^{2}$

由此可见, Newton’s Method 的收敛速度非常快, 其收敛速度为 Q-quadratic 的. 但其同时也有代价:

初始点必须足够接近最优解, 牛顿法只具有局部收敛性.
Hessian $\nabla^{2} f (x^{⋆})$ 需要为正定矩阵. 若是其是奇异的非正定, 则收敛速度可能只有 Q-linear 的.
尽管条件数不会直接影响收敛速度, 但对于病态问题, 牛顿法的收敛域可能会变小, 故对初值的选取有了更大的要求.

2.2 Damped Newton’s Method with Strong Convexity

假设 $f$ 是 $m$ -强凸函数, $\nabla f$ 是 $L$ -Lipschitz 连续的, $\nabla^{2} f$ 是 $M$ -Lipschitz 连续的, 则对于上述 damped Newton’s Method 会有如下 2-stage 的收敛结构:

第一阶段 (Damped Phase): 当 $∥\nabla f (x_{k}) ∥ \geq η$ 远离最优解时, 为线性收敛速率:
$f (x_{k + 1}) - f (x_{k}) \leq - γ$
第二阶段 (Pure Newton Phase): 当 $∥\nabla f (x_{k}) ∥ \leq η$ 接近最优解时, 有二次收敛速率:
$∥\nabla f (x_{k + 1}) ∥_{2} \leq C ∥\nabla f (x_{k}) ∥_{2}^{2}$

正是由于前面的强凸性, 全局 Lipschitz 连续以及 Armijo Condition 的共同作用, 使得在远离最优解时保证算法依然不会发散, 函数值保证下降, 梯度范数保证下降.

Note: 回顾 $m$ -强凸性

Definition (Strong Convexity). 称 $f$ 是 $m$ -强凸函数, 若存在常数 $m > 0$ 使得对于任意 $x, y \in R^{n}$ 都有:
$f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x) + \frac{m}{2} ∥ y - x ∥_{2}^{2}$
其中 $m$ 称为强凸常数.

Definition (Lipschitz Continuity). 称 $f$ 是 $L$ -Lipschitz 连续的, 若存在常数 $L > 0$ 使得对于任意 $x, y \in R^{n}$ 都有:

其具有如下性质:

若 $f \in C^{2}$ , 则 $\nabla^{2} f (x) ⪰ m I$ . 这说明曲率的下界被 $m$ 所控制, 不会退化, 所有的特征值均大于 $m$ .

函数值的变化率被 $m$ 控制: $f (x) - f (x^{⋆}) \geq \frac{m}{2} ∥ x - x^{⋆} ∥_{2}^{2}$ .

梯度变化被 $m$ 控制: $∥\nabla f (x) ∥ \geq m ∥ x - x^{⋆} ∥$ .

梯度变化是强单调的: $(\nabla f (x) - \nabla f (x^{⋆}))^{⊤} (x - x^{⋆}) \geq m ∥ x - x^{⋆} ∥_{2}^{2}$ .

在强凸+Lipschitz 连续的假设下, 如下三者在常数意义下等价: $∥ x - x^{⋆} ∥^{2} \sim f (x) - f (x^{⋆}) \sim ∥\nabla f (x) ∥_{2}^{2}$ .

2.3 Convergence under Self-concordance

Definition (Self-concordance). 以一元函数为例. 称 $f$ 是 $κ$ -self-concordant 的, 若存在常数 $κ > 0$ 使得对于任意 $x \in R$ 都有:

\frac{d ^{3}}{d x ^{3}} f (x) \leq κ \frac{d ^{2}}{d x ^{2}} f (x)^{3/2}

默认 $κ = 2$ . 对于其他的取值, 事实上也可以通过缩放 $f (x) = κ^{2} g (x) /4$ 来得到.

若目标函数 $f$ 是 self-concordant 的, 则不需要上述强凸+光滑的假设, 亦能保证上述的线性+二次的收敛结构.

3. Discussion

3.1 Comparison with First-order Methods

Method	Gradient Descent	Newton’s Method
内存复杂度	$O (n)$	$O (n^{2})$
计算复杂度	$O (n)$	$O (n^{3})$ (对于 Dense 的 Hessian 矩阵)
Backtracking 成本	$O (n)$	$O (n)$
条件数影响	敏感	局部不敏感
稳健性	强	弱, 受到数值稳定性, 奇异性等问题的影响

3.2 Sparse, Structured Problems

Hessian 的求解是 Newton’s Method 的瓶颈. 对于一些结构化问题, 例如: sparse, banned (只有主对角线附近有非零元素), 块对角, Toeplitz / Kronecker 结构, Low-rank 等, 可以利用其结构特性来加速 Hessian 的求解.

3.3 Equality Constrained Newton’s Method

对于等式约束优化问题:

x \in R^{n} min f (x) s.t. A x = b

一个比较直观的思路是在 $x$ 的切线空间中进行优化. 记优化的更新方向为 $v$ , 即 $x^{+} = x + v$ . 则我们只需保证 $A v = 0$ 即可保证 $A x^{+} = A (x + v) = A x + A v = b$ .

还原到 Newton 法最开始推导的二阶展开表达式, 我们的最小化任务为:

A v = 0 min [\nabla f (x)^{⊤} v + \frac{1}{2} v^{⊤} \nabla^{2} f (x) v]

对应 Lagrangian 形式为:

L (v, λ) = \nabla f (x)^{⊤} v + \frac{1}{2} v^{⊤} \nabla^{2} f (x) v + λ^{⊤} A v

Stationarity Condition 为:

{\nabla f (x) + \nabla^{2} f (x) v + A^{⊤} λ = 0 A v = 0 ⟺ [\nabla^{2} f (x) A A^{⊤} 0] [v λ] = [- \nabla f (x) 0]

由此, 在计算出 $v$ 后, 更新:

x_{t + 1} = x_{t} + v

OptOpt

Explorer

Lecture 14 · Newton Method

1. Newton’s Method Interpretation

1.1 Introduction

1.2 Affine Invariance of Newton’s Method

1.3 Newton Decrement

1.4 Damped Newton’s Method (Backtracking Line Search)

2. Convergence Analysis

2.1 Pure Newton’s Method

2.2 Damped Newton’s Method with Strong Convexity

2.3 Convergence under Self-concordance

3. Discussion

3.1 Comparison with First-order Methods

3.2 Sparse, Structured Problems

3.3 Equality Constrained Newton’s Method

Graph View

Table of Contents

OptOpt

Explorer

Lecture 14 · Newton Method

1. Newton’s Method Interpretation

1.1 Introduction

1.2 Affine Invariance of Newton’s Method

1.3 Newton Decrement

1.4 Damped Newton’s Method (Backtracking Line Search)

2. Convergence Analysis

2.1 Pure Newton’s Method

2.2 Damped Newton’s Method with Strong Convexity

2.3 Convergence under Self-concordance

3. Discussion

3.1 Comparison with First-order Methods

3.2 Sparse, Structured Problems

3.3 Equality Constrained Newton’s Method

Related Notes

Graph View

Table of Contents