融合功能工程和深度学习：恶意软件分类的案例研究

论文标题

融合功能工程和深度学习：恶意软件分类的案例研究

Fusing Feature Engineering and Deep Learning: A Case Study for Malware Classification

论文作者

Gibert, Daniel, Mateu, Carles, Planes, Jordi, Le, Quan

论文摘要

机器学习已成为一种吸引人的无签名方法，可以检测和分类恶意软件，因为它可以推广到从未见过的样本并处理大量数据的能力。基于传统功能的方法依赖于基于专家了解域知识的手工制作功能的手动设计，但深度学习方法将手动功能工程过程替换为基础系统，通常由具有多层的神经网络组成，这些神经网络完全执行特征学习和分类。但是，两种方法的组合都可以大大增强检测系统。在本文中，我们提出了一种混合方法，通过融合专家定义的多种功能和通过从原始数据深入学习的功能来解决恶意软件分类的任务。特别是，我们的方法依赖于深度学习将n-gram从汇编语言说明和恶意软件的字节以及基于恶意软件灰度图像表示和结构熵的基于纹理模式和基于塑形的特征中提取出n-gram。后来，这些深度功能将传递给梯度增强模型，该模型结合了深层特征和使用早期融合机制的手工制作的功能。我们对Microsoft恶意软件分类挑战基准进行了评估，结果表明，所提出的解决方案可以实现最先进的性能，并且在文献中的梯度提升和深度学习方法都优于梯度。

Machine learning has become an appealing signature-less approach to detect and classify malware because of its ability to generalize to never-before-seen samples and to handle large volumes of data. While traditional feature-based approaches rely on the manual design of hand-crafted features based on experts knowledge of the domain, deep learning approaches replace the manual feature engineering process by an underlying system, typically consisting of a neural network with multiple layers, that perform both feature learning and classification altogether. However, the combination of both approaches could substantially enhance detection systems. In this paper we present an hybrid approach to address the task of malware classification by fusing multiple types of features defined by experts and features learned through deep learning from raw data. In particular, our approach relies on deep learning to extract N-gram like features from the assembly language instructions and the bytes of malware, and texture patterns and shapelet-based features from malwareś grayscale image representation and structural entropy, respectively. These deep features are later passed as input to a gradient boosting model that combines the deep features and the hand-crafted features using an early-fusion mechanism. The suitability of our approach has been evaluated on the Microsoft Malware Classification Challenge benchmark and results show that the proposed solution achieves state-of-the-art performance and outperforms gradient boosting and deep learning methods in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题