论文标题

用于检测混合垃圾邮件电子邮件的晚期多模式融合模型

A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-mail

论文作者

Zhang, Zhibo, Damiani, Ernesto, Hamadi, Hussam Al, Yeun, Chan Yeob, Taher, Fatma

论文摘要

近年来,垃圾邮件发送者现在正试图通过引入混合垃圾邮件将图像和文本部分的混合垃圾邮件发送来使其意图相混淆,与仅包含文本或图像的电子邮件相比,检测到的图像和文本部分更具挑战性。这项研究的动机是设计一种有效的方法来滤除混合垃圾邮件电子邮件,以避免在传统基于文本的或图像baesd的情况下,只有过滤器无法检测到混合垃圾邮件电子邮件。据我们所知,已经进行了一些研究,目的是检测混合垃圾邮件电子邮件。通常,光学特征识别(OCR)技术用于通过将图像转换为文本来消除垃圾邮件的图像部分。但是,研究问题是,尽管OCR扫描是处理文本和图像混合动力垃圾邮件的非常成功的技术,但由于所需的CPU功率以及扫描电子邮件文件所需的执行时间,它并不是处理大量数量的有效解决方案。在转换过程中,OCR技术并不总是可靠的。为了解决此类问题,我们建议与基于OCR方法的经典早期融合检测框架相比,为文本和图像混合垃圾邮件滤波器过滤系统提供新的晚期多模式融合训练框架。实施了卷积神经网络(CNN)和连续的单词袋,以分别从混合垃圾邮件的图像和文本部分中提取特征,而产生的特征被馈送到sigmoid层和基于机器学习的分类器中,包括随机森林(RF),决策树(DT),天真湾(NB),NAIVE BAYES(NB)和支持载体机器(SVM)或SPAND MACHET MACHIC(SVM)或SPAM或SPAM。

In recent years, spammers are now trying to obfuscate their intents by introducing hybrid spam e-mail combining both image and text parts, which is more challenging to detect in comparison to e-mails containing text or image only. The motivation behind this research is to design an effective approach filtering out hybrid spam e-mails to avoid situations where traditional text-based or image-baesd only filters fail to detect hybrid spam e-mails. To the best of our knowledge, a few studies have been conducted with the goal of detecting hybrid spam e-mails. Ordinarily, Optical Character Recognition (OCR) technology is used to eliminate the image parts of spam by transforming images into text. However, the research questions are that although OCR scanning is a very successful technique in processing text-and-image hybrid spam, it is not an effective solution for dealing with huge quantities due to the CPU power required and the execution time it takes to scan e-mail files. And the OCR techniques are not always reliable in the transformation processes. To address such problems, we propose new late multi-modal fusion training frameworks for a text-and-image hybrid spam e-mail filtering system compared to the classical early fusion detection frameworks based on the OCR method. Convolutional Neural Network (CNN) and Continuous Bag of Words were implemented to extract features from image and text parts of hybrid spam respectively, whereas generated features were fed to sigmoid layer and Machine Learning based classifiers including Random Forest (RF), Decision Tree (DT), Naive Bayes (NB) and Support Vector Machine (SVM) to determine the e-mail ham or spam.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源