确定性梯度下降的随机性：多尺度目标功能的较大学习率

论文标题

确定性梯度下降的随机性：多尺度目标功能的较大学习率

Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

论文作者

Kong, Lingkai, Tao, Molei

论文摘要

本文表明，不使用任何随机梯度近似的确定性梯度下降仍然可以表现出随机行为。特别是，它表明，如果目标函数表现出多尺度行为，那么在较大的学习率方向上，只能解决目标的微观细节，而不能解决目标的微观细节，则确定性的GD动力学会变得混乱，而不是与局部最小化器，而是统计分布。还建立了足够的条件，可以通过重新固定的吉布斯分布来近似这种长期统计限制。提供了理论和数值演示，理论部分依赖于使用有界噪声的随机图的构建（与离散的扩散相反）。

This article suggests that deterministic Gradient Descent, which does not use any stochastic gradient approximation, can still exhibit stochastic behaviors. In particular, it shows that if the objective function exhibit multiscale behaviors, then in a large learning rate regime which only resolves the macroscopic but not the microscopic details of the objective, the deterministic GD dynamics can become chaotic and convergent not to a local minimizer but to a statistical distribution. A sufficient condition is also established for approximating this long-time statistical limit by a rescaled Gibbs distribution. Both theoretical and numerical demonstrations are provided, and the theoretical part relies on the construction of a stochastic map that uses bounded noise (as opposed to discretized diffusions).

下载PDF全文

下载文献需遵守相关版权规定

论文标题