论文标题

根据API呼叫分析,应用可解释的机器学习在检测和分类的勒索软件家族中

Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis

论文作者

Mowri, Rawshan Ara, Siddula, Madhuri, Roy, Kaushik

论文摘要

最近几天,勒索软件似乎是全球主要威胁之一。勒索软件攻击和新的勒索软件变体的令人震惊的提高速度使研究人员不断研究勒索软件的区别特征并完善其检测策略。应用程序编程接口(API)是一个程序与另一个程序进行协作的一种方式; API呼叫是他们交流的媒介。勒索软件使用此策略与OS进行交互,并以不同的序列进行更高数量的调用以要求采取行动。这项研究工作利用不同API调用的频率来检测和分类勒索软件系列。首先,开发了一个Web-Crawler来自动化15个不同勒索软件系列的Windows Portable可执行文件(PE)文件。通过提取68个API调用的不同频率,我们在两相功能工程过程的第一阶段开发数据集。在特征工程过程的第二阶段中选择了最重要的功能之后,我们部署了六个监督的机器学习模型:“ na”“ ive贝叶斯,逻辑回归,随机森林,随机梯度下降,最邻居邻居和支持向量机器。然后,所有分类器的性能。我们不依赖机器学习模型的“黑匣子”特征,而是使用“ Shapley添加说明”或Shap值对我们表现最好的模型进行事后分析,以确定模型预测的透明度和可信度。

Ransomware has appeared as one of the major global threats in recent days. The alarming increasing rate of ransomware attacks and new ransomware variants intrigue the researchers to constantly examine the distinguishing traits of ransomware and refine their detection strategies. Application Programming Interface (API) is a way for one program to collaborate with another; API calls are the medium by which they communicate. Ransomware uses this strategy to interact with the OS and makes a significantly higher number of calls in different sequences to ask for taking action. This research work utilizes the frequencies of different API calls to detect and classify ransomware families. First, a Web-Crawler is developed to automate collecting the Windows Portable Executable (PE) files of 15 different ransomware families. By extracting different frequencies of 68 API calls, we develop our dataset in the first phase of the two-phase feature engineering process. After selecting the most significant features in the second phase of the feature engineering process, we deploy six Supervised Machine Learning models: Na"ive Bayes, Logistic Regression, Random Forest, Stochastic Gradient Descent, K-Nearest Neighbor, and Support Vector Machine. Then, the performances of all the classifiers are compared to select the best model. The results reveal that Logistic Regression can efficiently classify ransomware into their corresponding families securing 99.15% overall accuracy. Finally, instead of relying on the 'Black box' characteristic of the Machine Learning models, we present the post-hoc analysis of our best-performing model using 'SHapley Additive exPlanations' or SHAP values to ascertain the transparency and trustworthiness of the model's prediction.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源