预先训练的代码模型的自然攻击

论文标题

预先训练的代码模型的自然攻击

Natural Attack for Pre-trained Models of Code

论文作者

Yang, Zhou, Shi, Jieke, He, Junda, Lo, David

论文摘要

预先培训的代码模型已在许多重要的软件工程任务中取得了成功。但是，这些强大的模型容易受到对抗性攻击的影响，这些攻击略微扰动模型输入以使受害者模型产生错误的输出。当前的作品主要是攻击代码模型的示例，这些示例保留了操作计划的语义，但忽略了对对抗性的示例生成的基本要求：对人类法官来说，扰动应该是自然的，我们称这是自然性的要求。在本文中，我们提出了警报（自然意识攻击），这是一种黑盒攻击，对手会转换输入以使受害者模型产生错误的输出。与先前的作品不同，本文考虑了生成的示例的自然语义，同时保留了原始输入的操作语义。我们的用户研究表明，人类开发人员始终认为，警报产生的对抗示例比Zhang等人最先进的作品产生的敌人更自然。这忽略了自然的要求。在攻击Codebert时，我们的方法可以在三个下游任务中实现53.62％，27.79％和35.78％的攻击成功率：漏洞预测，克隆检测和代码撰稿人身份归因。在GraphCodebert上，我们的方法可以达到三个任务的平均成功率为76.95％，7.96％和61.47％。上面的两个预训练模型的表现要优于基线14.07％和18.56％。最后，我们通过对抗性微调程序研究了生成的对抗性示例对硬化受害者模型的价值，并证明了Codebert和GraphCodebert对警报生成的对抗性示例的准确性分别增加了87.59％和92.32％。

Pre-trained models of code have achieved success in many important software engineering tasks. However, these powerful models are vulnerable to adversarial attacks that slightly perturb model inputs to make a victim model produce wrong outputs. Current works mainly attack models of code with examples that preserve operational program semantics but ignore a fundamental requirement for adversarial example generation: perturbations should be natural to human judges, which we refer to as naturalness requirement. In this paper, we propose ALERT (nAturaLnEss AwaRe ATtack), a black-box attack that adversarially transforms inputs to make victim models produce wrong outputs. Different from prior works, this paper considers the natural semantic of generated examples at the same time as preserving the operational semantic of original inputs. Our user study demonstrates that human developers consistently consider that adversarial examples generated by ALERT are more natural than those generated by the state-of-the-art work by Zhang et al. that ignores the naturalness requirement. On attacking CodeBERT, our approach can achieve attack success rates of 53.62%, 27.79%, and 35.78% across three downstream tasks: vulnerability prediction, clone detection and code authorship attribution. On GraphCodeBERT, our approach can achieve average success rates of 76.95%, 7.96% and 61.47% on the three tasks. The above outperforms the baseline by 14.07% and 18.56% on the two pre-trained models on average. Finally, we investigated the value of the generated adversarial examples to harden victim models through an adversarial fine-tuning procedure and demonstrated the accuracy of CodeBERT and GraphCodeBERT against ALERT-generated adversarial examples increased by 87.59% and 92.32%, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题