对抗性的学历：沿对抗方向降低信心

论文标题

对抗性的学历：沿对抗方向降低信心

Adversarial Unlearning: Reducing Confidence Along Adversarial Directions

论文作者

Setlur, Amrith, Eysenbach, Benjamin, Smith, Virginia, Levine, Sergey

论文摘要

受到最大似然目标培训的监督学习方法通常过于培训数据。大多数防止过度拟合的正规化器都希望增加对其他示例（例如，数据增强，对抗训练）或减少培训数据（例如标签平滑）的信心。在这项工作中，我们提出了一种互补的正则化策略，以减少对自我生成的例子的信心。我们称之为RCAD的方法（降低了沿对抗方向的信心），旨在减少对分布外示例的信心，沿着对手选择以增加训练损失。与对抗性训练相反，RCAD并没有试图鲁棒化模型以输出原始标签，而是将其正常于使用比传统的对抗训练更大的扰动产生的点降低了对点的信心。 RCAD可以轻松地通过几行代码集成到培训管道中。尽管它很简单，但我们在许多分类基准上发现，RCAD可以添加到现有技术（例如标签平滑，混合训练）中，以使测试准确性的绝对值提高1-3％，并且在低数据状态下增长了更大的增长。我们还提供了理论分析，有助于在简化的设置中解释这些好处，这表明RCAD可以证明可以帮助模型在培训数据中删除虚假特征。

Supervised learning methods trained with maximum likelihood objectives often overfit on training data. Most regularizers that prevent overfitting look to increase confidence on additional examples (e.g., data augmentation, adversarial training), or reduce it on training data (e.g., label smoothing). In this work we propose a complementary regularization strategy that reduces confidence on self-generated examples. The method, which we call RCAD (Reducing Confidence along Adversarial Directions), aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss. In contrast to adversarial training, RCAD does not try to robustify the model to output the original label, but rather regularizes it to have reduced confidence on points generated using much larger perturbations than in conventional adversarial training. RCAD can be easily integrated into training pipelines with a few lines of code. Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques (e.g., label smoothing, MixUp training) to increase test accuracy by 1-3% in absolute value, with more significant gains in the low data regime. We also provide a theoretical analysis that helps to explain these benefits in simplified settings, showing that RCAD can provably help the model unlearn spurious features in the training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题