使用局部多项式回归的基于梯度的经验风险最小化

论文标题

使用局部多项式回归的基于梯度的经验风险最小化

Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

论文作者

Jadbabaie, Ali, Makur, Anuran, Shah, Devavrat

论文摘要

在本文中，我们考虑了使用基于迭代梯度的方法的平滑，强烈凸出损失函数的经验风险最小化（ERM）的问题。该文献的一个主要目的是通过分析其收敛速率到$ε$ - approximate解决方案，从而比较不同的算法，例如梯度下降（GD）或随机梯度下降（SGD）。例如，gd的甲骨复杂性为$ O（n \ log（ε^{ - 1}））$，其中$ n $是训练样本的数量。当$ n $很大时，这在实践中可能很昂贵，并且由于其甲骨文的复杂性为$ O（ε^{ - 1}）$，因此首选SGD。这种标准分析仅利用要优化的参数中损失函数的平滑度。相比之下，我们证明，当数据中的损耗函数平滑时，我们可以在每次迭代中学习甲骨文，并在重要制度中击败GD和SGD的甲骨文复杂性。具体而言，在每次迭代中，我们提出的算法都执行局部多项式回归以了解损耗函数的梯度，然后估算ERM目标函数的真正梯度。我们确定算法的oracle复杂性像$ \ tilde {o}（（（（（pε^{ - 1}））^{d/（2η）}）$（忽略了$ d $和$ p $，其中$ d $和$ p $是数据和参数空间尺寸，分别与损失的数据属于损失的数据$ - 我们的证明扩展了对非参数统计中局部多项式回归的分析，以提供多元设置中的插值保证，并从不精确的GD文献中利用工具。与GD和SGD不同，我们方法的复杂性取决于$ d $和$ p $。但是，当$ d $很小并且损失功能在数据中表现出适中的平滑度时，我们的算法在Oracle复杂性中击败了GD和SGD，范围非常广泛，$ p $和$ε$。

In this paper, we consider the problem of empirical risk minimization (ERM) of smooth, strongly convex loss functions using iterative gradient-based methods. A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to $ε$-approximate solutions. For example, the oracle complexity of GD is $O(n\log(ε^{-1}))$, where $n$ is the number of training samples. When $n$ is large, this can be expensive in practice, and SGD is preferred due to its oracle complexity of $O(ε^{-1})$. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD in important regimes. Specifically, at every iteration, our proposed algorithm performs local polynomial regression to learn the gradient of the loss function, and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of our algorithm scales like $\tilde{O}((p ε^{-1})^{d/(2η)})$ (neglecting sub-dominant factors), where $d$ and $p$ are the data and parameter space dimensions, respectively, and the gradient of the loss function belongs to a $η$-Hölder class with respect to the data. Our proof extends the analysis of local polynomial regression in non-parametric statistics to provide interpolation guarantees in multivariate settings, and also exploits tools from the inexact GD literature. Unlike GD and SGD, the complexity of our method depends on $d$ and $p$. However, when $d$ is small and the loss function exhibits modest smoothness in the data, our algorithm beats GD and SGD in oracle complexity for a very broad range of $p$ and $ε$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题