论文标题
培训神经网络的数据效率增强
Data-Efficient Augmentation for Training Neural Networks
论文作者
论文摘要
数据增强对于在许多深度学习应用程序中实现最新性能至关重要。但是,最有效的增强技术对于甚至中型数据集而变得过于估算。为了解决这个问题,我们提出了一种严格的技术来选择数据点的子集,这些数据点在增强时,密切捕获了完整数据增强的训练动力。我们首先表明,通过相对扩大和扰动网络Jacobian的较小的奇异值,同时保留其突出的方向,可以通过相对扩大和扰动较小的奇异值来改善学习和概括,从而改善学习和概括。这样可以防止过度拟合并增强学习信息更难学习。然后,我们提出一个框架,以迭代地提取较小的训练数据子集,这些数据在增强时,密切捕获了完全增强的雅各比式与标签/残差的对齐。我们证明,应用于我们方法发现的增强子集的随机梯度下降具有与完全增强数据相似的训练动力学。我们的实验表明,我们的方法在SVHN上的CIFAR10和2.2倍速度上实现了6.3倍的速度,并且在各种子集尺寸上,我们的方法的表现高达10%。同样,在Tinyimagenet和Imagenet上,我们的方法将基准击败高达8%,同时在各种子集尺寸上达到高达3.3倍的速度。最后,使用我们的方法在CIFAR11上使用标签噪声损坏的版本上培训和增强50%的子集,甚至使用完整数据集都优于表现。我们的代码可在以下网址找到:https://github.com/tianyu139/data-felficity-augmentation
Data augmentation is essential to achieve state-of-the-art performance in many deep learning applications. However, the most effective augmentation techniques become computationally prohibitive for even medium-sized datasets. To address this, we propose a rigorous technique to select subsets of data points that when augmented, closely capture the training dynamics of full data augmentation. We first show that data augmentation, modeled as additive perturbations, improves learning and generalization by relatively enlarging and perturbing the smaller singular values of the network Jacobian, while preserving its prominent directions. This prevents overfitting and enhances learning the harder to learn information. Then, we propose a framework to iteratively extract small subsets of training data that when augmented, closely capture the alignment of the fully augmented Jacobian with labels/residuals. We prove that stochastic gradient descent applied to the augmented subsets found by our approach has similar training dynamics to that of fully augmented data. Our experiments demonstrate that our method achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across various subset sizes. Similarly, on TinyImageNet and ImageNet, our method beats the baselines by up to 8%, while achieving up to 3.3x speedup across various subset sizes. Finally, training on and augmenting 50% subsets using our method on a version of CIFAR10 corrupted with label noise even outperforms using the full dataset. Our code is available at: https://github.com/tianyu139/data-efficient-augmentation