镜像下降最大化广义边缘，可以有效地实现

论文标题

镜像下降最大化广义边缘，可以有效地实现

Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently

论文作者

Sun, Haoyuan, Ahn, Kwangjun, Thrampoulidis, Christos, Azizan, Navid

论文摘要

在经验成功和深层神经网络的广泛使用的推动下，了解过度参数化模型的概括性能已成为一个越来越流行的问题。为此，已经进行了巨大的努力来表征所使用的优化算法的隐式偏差，例如梯度下降（GD）以及其首选解决方案的结构特性。本文回答了本文献中的一个空旷的问题：对于分类设置，镜像下降（MD）融合到哪种解决方案？具体而言，由于其有效的实施，我们考虑了镜像下降算法的家族，其潜在功能被选为$ \ ell_p $ -norm的$ p $ - th幂，这是GD的重要概括。我们称此算法$ p $ - $ \ textsf {gd} $。对于这个家族，我们表征了它获得的解决方案，并表明它会沿方向转化为相对于可分离的$ \ ell_p $ -norm的广义最大修订解决方案。虽然MD更新规则通常是昂贵的，并且可能不适合深度学习，但$ p $ - $ \ textsf {gd} $与SGD完全平行，并且可用于训练具有几乎没有其他计算机架空的深神经网络。使用线性和深神经网络模型的综合实验，我们证明了$ p $ - $ \ textsf {gd} $可以明显影响学习模型的结构和通用性能。

Driven by the empirical success and wide use of deep neural networks, understanding the generalization performance of overparameterized models has become an increasingly popular question. To this end, there has been substantial effort to characterize the implicit bias of the optimization algorithms used, such as gradient descent (GD), and the structural properties of their preferred solutions. This paper answers an open question in this literature: For the classification setting, what solution does mirror descent (MD) converge to? Specifically, motivated by its efficient implementation, we consider the family of mirror descent algorithms with potential function chosen as the $p$-th power of the $\ell_p$-norm, which is an important generalization of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we characterize the solutions it obtains and show that it converges in direction to a generalized maximum-margin solution with respect to the $\ell_p$-norm for linearly separable classification. While the MD update rule is in general expensive to compute and perhaps not suitable for deep learning, $p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be used to train deep neural networks with virtually no additional computational overhead. Using comprehensive experiments with both linear and deep neural network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the structure and the generalization performance of the learned models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题