论文标题
用生成的对抗性学习来抵消基于黑暗文本的验证
Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence
论文作者
论文摘要
大规模对Dark Web(DW)平台的自动监视是开发主动网络威胁智能(CTI)的第一步。尽管有有效的方法可以从表面网络中收集数据,但大规模的暗网数据收集通常受到反爬行措施的阻碍。特别是,基于文本的验证验证是在黑暗网络中最普遍,最禁止的这些措施的类型。基于文本的验证码通过强迫用户输入难以认识的字母数字字符的组合来识别并阻止自动爬网。在黑暗的网络中,验证码图像经过精心设计,具有附加的背景噪声和可变的字符长度,以防止自动化验证码断裂。现有的自动验证验证码破坏方法在克服这些黑暗网络挑战方面存在困难。因此,解决基于黑暗的文本的验证码一直在很大程度上依赖人类参与,这是劳动密集型且耗时的。在这项研究中,我们提出了一个新颖的框架,用于自动破坏黑网验码,以促进黑暗网络数据收集。该框架涵盖了一种新颖的生成方法,可以识别具有嘈杂背景和可变字符长度的基于暗文本的验证验。为了消除对人参与的需求,拟议的框架利用生成的对抗网络(GAN)来抵消黑网背景噪声,并利用增强的角色分割算法来处理具有可变字符长度的验证码。我们提出的框架DW-GAN在多个暗网验码台上进行了系统的评估。 DW-GAN在所有数据集上的最先进基准方法明显优于最先进的基准方法,在经过精心收集的现实世界中的Dark Web数据集上达到了超过94.4%的成功率...
Automated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise and leverages an enhanced character segmentation algorithm to handle CAPTCHA images with variable character length. Our proposed framework, DW-GAN, was systematically evaluated on multiple dark web CAPTCHA testbeds. DW-GAN significantly outperformed the state-of-the-art benchmark methods on all datasets, achieving over 94.4% success rate on a carefully collected real-world dark web dataset...