适应性黑森近似随机梯度MCMC方法

论文标题

适应性黑森近似随机梯度MCMC方法

An adaptive Hessian approximated stochastic gradient MCMC method

论文作者

Wang, Yating, Deng, Wei, Lin, Guang

论文摘要

贝叶斯方法已成功地集成到培训深神网络中。一个受欢迎的家族是随机梯度马尔可夫链蒙特卡洛方法（SG-MCMC），由于其处理大型数据集的可伸缩性以及避免过度拟合的能力，它们的兴趣增加了。尽管标准的SG-MCMC方法在各种问题中表现出良好的性能，但是当目标后密度的随机变量具有尺度差异或高度相关时，它们可能会效率低下。在这项工作中，我们提出了一种自适应的Hessian近似随机梯度MCMC方法，以合并局部几何信息，同时从后部采样。这个想法是应用随机近似以在每次迭代时顺序更新预处理矩阵。预处理具有二阶信息，可以有效地指导采样器的随机步行。我们使用样品及其随机梯度的有限记忆来近似更新公式中的逆黑板 - 矢量乘法，而不是计算和保存对数的全部Hessian。此外，通过平稳优化预处理矩阵，我们提出的算法可以在轻度条件下以可控的偏置渐近地收敛到目标分布。为了减少培训和测试计算负担，我们采用了一种基于幅度的重量修剪方法来实现网络的稀疏性。我们的方法是用户友好的，可以通过实现附加的预处理来扩展到标准的SG-MCMC更新规则。逆黑板的稀疏近似减轻了大维模型的存储和计算复杂性。由随机近似引入的偏差是可控的，可以在理论上进行分析。在几个问题上进行数值实验。

Bayesian approaches have been successfully integrated into training deep neural networks. One popular family is stochastic gradient Markov chain Monte Carlo methods (SG-MCMC), which have gained increasing interest due to their scalability to handle large datasets and the ability to avoid overfitting. Although standard SG-MCMC methods have shown great performance in a variety of problems, they may be inefficient when the random variables in the target posterior densities have scale differences or are highly correlated. In this work, we present an adaptive Hessian approximated stochastic gradient MCMC method to incorporate local geometric information while sampling from the posterior. The idea is to apply stochastic approximation to sequentially update a preconditioning matrix at each iteration. The preconditioner possesses second-order information and can guide the random walk of a sampler efficiently. Instead of computing and saving the full Hessian of the log posterior, we use limited memory of the sample and their stochastic gradients to approximate the inverse Hessian-vector multiplication in the updating formula. Moreover, by smoothly optimizing the preconditioning matrix, our proposed algorithm can asymptotically converge to the target distribution with a controllable bias under mild conditions. To reduce the training and testing computational burden, we adopt a magnitude-based weight pruning method to enforce the sparsity of the network. Our method is user-friendly and is scalable to standard SG-MCMC updating rules by implementing an additional preconditioner. The sparse approximation of inverse Hessian alleviates storage and computational complexities for large dimensional models. The bias introduced by stochastic approximation is controllable and can be analyzed theoretically. Numerical experiments are performed on several problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题