论文标题
梯度下降和功率方法:利用他们的连接以找到最左侧的特征和逃生鞍点
Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points
论文作者
论文摘要
这项工作表明,应用具有固定步骤大小的梯度下降(GD)以最小化(可能是非凸)二次函数等同于在梯度上运行功率法(PM)。因此,建立了具有和没有固定动量的PM之间的GD之间的GD之间的连接。因此,可通过GD获得有价值的欧吉人信息。 最近的示例表明,具有固定步长的GD应用于本地二次的非凸功能,可能需要指数时间来逃脱鞍点(Simon S. Du,Chi Jin,Jason D. Lee,Michael I. Jordan,Aarti Singh和Aarti Singh和Barnabas Poczos和Barnabas Poczos: Ribeiro:“一种基于牛顿的非凸优化方法,并快速逃避马鞍点”)。在这里,对这些示例进行了重新审视,并表明缺少特征值信息,因此示例可能无法完整地描述GD的潜在实际行为。因此,有必要对\ emph {自适应}或\ emph {variable}步骤大小进行持续研究GD在非convex函数上的行为。 结果表明,在$ r^2 $中的二次二次情况下,如果已知特征值,则具有固定步骤尺寸的GD将在两个迭代中收敛,并且可以使用完整的特征分解。 通过考虑梯度和迭代的动力,提出了新的步长策略来提高GD的实际性能。提出了几个数值示例,这些示例证明了利用GD-PM连接的优势。
This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly nonconvex) quadratic function is equivalent to running the Power Method (PM) on the gradients. The connection between GD with a fixed step size and the PM, both with and without fixed momentum, is thus established. Consequently, valuable eigen-information is available via GD. Recent examples show that GD with a fixed step size, applied to locally quadratic nonconvex functions, can take exponential time to escape saddle points (Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabas Poczos: "Gradient descent can take exponential time to escape saddle points"; S. Paternain, A. Mokhtari, and A. Ribeiro: "A newton-based method for nonconvex optimization with fast evasion of saddle points"). Here, those examples are revisited and it is shown that eigenvalue information was missing, so that the examples may not provide a complete picture of the potential practical behaviour of GD. Thus, ongoing investigation of the behaviour of GD on nonconvex functions, possibly with an \emph{adaptive} or \emph{variable} step size, is warranted. It is shown that, in the special case of a quadratic in $R^2$, if an eigenvalue is known, then GD with a fixed step size will converge in two iterations, and a complete eigen-decomposition is available. By considering the dynamics of the gradients and iterates, new step size strategies are proposed to improve the practical performance of GD. Several numerical examples are presented, which demonstrate the advantages of exploiting the GD--PM connection.