合理的反事实：使用现实的对抗性示例审核深度学习分类器

论文标题

合理的反事实：使用现实的对抗性示例审核深度学习分类器

Plausible Counterfactuals: Auditing Deep Learning Classifiers with Realistic Adversarial Examples

论文作者

Barredo-Arrieta, Alejandro, Del Ser, Javier

论文摘要

在过去的十年中，深度学习模型在许多应用中的扩散，实现了无与伦比的预测性能水平。不幸的是，深度学习模型的黑框性质提出了有关他们从数据中学到的知识的未解决问题。某些应用程序方案强调了评估深度学习模型运行的界限的重要性，这是通过使用针对来自不同领域观众的各种方法来解决的问题。但是，由于该应用程序的重点更多地放在非专家用户上，因此必须为他/她提供信任模型的手段，就像人类熟悉系统或过程一样：通过了解其失败的假设情况。这确实是这项研究工作的角石：对深度学习模型进行对抗分析。提出的框架通过确保其合理性（例如人类可以在不诉诸于计算机程序的情况下生成它们的可能性合理。因此，这项工作必须被视为有价值的审计练习，其中一定模型受到限制，从而使对真实应用中使用的模型的功能和陷阱有了更大的了解。为此，使用生成的对抗网络（GAN）和多目标启发式方法来提供对审计模型的合理攻击，在该模型的混乱，生成的反事实的强度和合理性之间有效地进行交易。它的实用性在人类面部分类任务中展示，揭示了拟议框架的巨大潜力。

The last decade has witnessed the proliferation of Deep Learning models in many applications, achieving unrivaled levels of predictive performance. Unfortunately, the black-box nature of Deep Learning models has posed unanswered questions about what they learn from data. Certain application scenarios have highlighted the importance of assessing the bounds under which Deep Learning models operate, a problem addressed by using assorted approaches aimed at audiences from different domains. However, as the focus of the application is placed more on non-expert users, it results mandatory to provide the means for him/her to trust the model, just like a human gets familiar with a system or process: by understanding the hypothetical circumstances under which it fails. This is indeed the angular stone for this research work: to undertake an adversarial analysis of a Deep Learning model. The proposed framework constructs counterfactual examples by ensuring their plausibility, e.g. there is a reasonable probability that a human could generate them without resorting to a computer program. Therefore, this work must be regarded as valuable auditing exercise of the usable bounds a certain model is constrained within, thereby allowing for a much greater understanding of the capabilities and pitfalls of a model used in a real application. To this end, a Generative Adversarial Network (GAN) and multi-objective heuristics are used to furnish a plausible attack to the audited model, efficiently trading between the confusion of this model, the intensity and plausibility of the generated counterfactual. Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题