Nesterov在非平滑优化的个体收敛中推断出外推的强度

论文标题

Nesterov在非平滑优化的个体收敛中推断出外推的强度

The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth Optimization

论文作者

Tao, W., Pan, Z., Wu, G., Tao, Q.

论文摘要

Nesterov提出的外推策略可以通过处理平滑的凸观目标在数量级加速梯度下降方法的收敛速率，这在训练机器学习任务方面取得了巨大成功。在本文中，基于Nesterov的外推研究，在理论上研究了投影亚级别（PSG）方法的单个迭代（PSG）方法的收敛性，我们将其称为单个收敛性。我们证明，Nesterov的外推具有使PSG对非平滑问题的最佳融合的力量。鉴于这种考虑，对亚级别评估的直接修改足以实现强烈凸出问题的最佳个人收敛，这可以认为是朝着Shamir提出的有关随机梯度下降（SGD）的开放问题迈出的有趣一步。此外，我们给出了派生算法的扩展，以在随机设置中以非平滑损失的方式求解正规的学习任务。与其他最先进的非平滑方法相比，派生的算法可以作为基本SGD的替代方案，尤其是在应对机器学习问题时，需要单个输出来保证正则化结构，同时保持最佳的收敛速度。通常，我们的方法适用于解决大规模$ L $ 1登记的铰链学习问题的有效工具。几个比较实验表明，我们的个人产出不仅达到了最佳收敛率，而且还可以保证比平均解决方案更好。

The extrapolation strategy raised by Nesterov, which can accelerate the convergence rate of gradient descent methods by orders of magnitude when dealing with smooth convex objective, has led to tremendous success in training machine learning tasks. In this article, the convergence of individual iterates of projected subgradient (PSG) methods for nonsmooth convex optimization problems is theoretically studied based on Nesterov's extrapolation, which we name individual convergence. We prove that Nesterov's extrapolation has the strength to make the individual convergence of PSG optimal for nonsmooth problems. In light of this consideration, a direct modification of the subgradient evaluation suffices to achieve optimal individual convergence for strongly convex problems, which can be regarded as making an interesting step toward the open question about stochastic gradient descent (SGD) posed by Shamir. Furthermore, we give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in stochastic settings. Compared with other state-of-the-art nonsmooth methods, the derived algorithms can serve as an alternative to the basic SGD especially in coping with machine learning problems, where an individual output is needed to guarantee the regularization structure while keeping an optimal rate of convergence. Typically, our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems. Several comparison experiments demonstrate that our individual output not only achieves an optimal convergence rate but also guarantees better sparsity than the averaged solution.

下载PDF全文

下载文献需遵守相关版权规定

论文标题