论文标题

Napierone:GovDocs1的现代混合文件数据集替代

NapierOne: A modern mixed file data set alternative to Govdocs1

论文作者

Davies, Simon R, Macfarlane, Richard, Buchanan, William J

论文摘要

在审查勒索软件检测研究文献时发现,几乎没有提案提供有关如何创建测试数据集的足够详细信息,或者对其实际内容的充分描述,以允许其他有兴趣重建环境并验证研究结果的研究人员重新创建它。提出了一个名为Napierone的现代网络安全混合文件数据集,主要针对但不限于勒索软件检测和法医分析研究。 Napierone旨在解决可重复性的这种缺陷,并通过促进研究复制和可重复性来提高一致性。还详细描述了用于创建此数据集的方法。数据集的灵感来自GOVDOCS1数据集,目的是将Napierone用作此原始数据集的补充。 进行了调查,目的是确定当前正在使用的通用文件类型。没有发现明确提供此信息的具体研究,因此采用了另一种共识方法。这涉及将来自多个文件类型使用源的发现结合到整体排名列表中。之后,收集了5000个现实世界的示例文件,并为确定的每种常见文件类型创建了一个特定的数据子集。在某些情况下,为特定文件类型创建了多个数据子集,每个子​​集代表该文件类型的特定特征。例如,有多个用于ZIP文件类型的数据子集,每个子​​集包含特定压缩方法的示例。勒索软件执行倾向于产生具有较高熵的文件,因此也存在自然具有此属性的文件类型的示例。

It was found when reviewing the ransomware detection research literature that almost no proposal provided enough detail on how the test data set was created, or sufficient description of its actual content, to allow it to be recreated by other researchers interested in reconstructing their environment and validating the research results. A modern cybersecurity mixed file data set called NapierOne is presented, primarily aimed at, but not limited to, ransomware detection and forensic analysis research. NapierOne was designed to address this deficiency in reproducibility and improve consistency by facilitating research replication and repeatability. The methodology used in the creation of this data set is also described in detail. The data set was inspired by the Govdocs1 data set and it is intended that NapierOne be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5000 real-world example files were gathered, and a specific data subset created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源