论文标题

Dexbert:Android字节码的有效,任务不可能的和细粒度的表示

DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode

论文作者

Sun, Tiezhu, Allix, Kevin, Kim, Kisub, Zhou, Xin, Kim, Dongsun, Lo, David, Bissyandé, Tegawendé F., Klein, Jacques

论文摘要

由于机器学习(ML),大量软件工程任务的自动化变得可能成为可能。将ML应用于软件工件(例如源或可执行代码)的核心是将其转换为适合学习的表格。传统上,研究人员基于专业知识有时不精确且通常不完整的专业知识依靠手动选择的功能。表示学习使ML能够自动选择合适的表示形式和相关功能。但是,对于与Android相关的任务,诸如APK2VEC之类的现有模型专注于整个应用程序级别,或者针对特定任务(例如Smali2Vec),该任务限制了其适用性。我们的工作是一项新的研究系列的一部分,该研究研究有效,任务不合时宜且细粒度的通用表示,以减轻这两个限制中的这两个局限性。此类表示旨在捕获与各种低级下游任务相关的信息(例如,在班级级别)。我们受到自然语言处理领域的启发,在这种情况下,通过建立通用语言模型(例如伯特)来解决通用表示的问题,伯特的目标是以各种任务可重复使用的方式捕获有关句子的抽象语义信息。我们提出了Dexbert,这是一种类似于BERT的语言模型,该模型致力于表示Dex字节码的块,Dex字节码是Android应用中使用的主要二进制格式。我们从经验上评估Dexbert是否能够对DEX语言进行建模并评估模型在三个不同的类级软件工程任务中的适用性:恶意代码本地化,缺陷预测和组件类型分类。我们还试验策略,以解决迎合大小截然不同的应用程序的问题,并演示了使用我们的技术研究哪些信息与给定任务相关的一个例子。

The automation of a large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). Central to applying ML to software artifacts (like source or executable code) is converting them into forms suitable for learning. Traditionally, researchers have relied on manually selected features, based on expert knowledge which is sometimes imprecise and generally incomplete. Representation learning has allowed ML to automatically choose suitable representations and relevant features. Yet, for Android-related tasks, existing models like apk2vec focus on whole-app levels, or target specific tasks like smali2vec, which limits their applicability. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源