论文标题
势头势端优化器,适用于正交的关注和最佳运输
Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport
论文作者
论文摘要
在Stiefel歧管上的优化问题,即,已广泛研究了满足正交性约束的(不一定是平方)矩阵的功能。然而,提出了一种基于经过深思熟虑的连续和离散动态之间的相互作用,提出了一种新方法。它导致具有内在添加动量的基于梯度的优化器。此方法完全保留了流形结构,但不需要额外的操作即可保持在变化(CO)切线空间中的动力,因此计算成本较低和精确度。还展示了其对自适应学习率的概括。在实际任务中观察到了显着的表演。例如,我们发现将正交限制放在训练有素的从训练的视觉变压器的注意力头上[Dosovitskiy等。 2022]当使用优化器时,可以显着提高其性能,并且更好的是,每个头部都会在本身内进行正交,但不一定是其他头部。该优化器还使投影强大的瓦斯汀距离的有用概念[Paty&Cuturi 2019; Lin等。 2020]用于高清。最佳运输更加有效。
The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require additional operation to keep momentum in the changing (co)tangent space, and thus has low computational cost and pleasant accuracy. Its generalization to adaptive learning rates is also demonstrated. Notable performances are observed in practical tasks. For instance, we found that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer [Dosovitskiy et al. 2022] could markedly improve its performance, when our optimizer is used, and it is better that each head is made orthogonal within itself but not necessarily to other heads. This optimizer also makes the useful notion of Projection Robust Wasserstein Distance [Paty & Cuturi 2019; Lin et al. 2020] for high-dim. optimal transport even more effective.