论文标题
一个基于视听注意力的多模式网络,用于假话面部视频检测
An Audio-Visual Attention Based Multimodal Network for Fake Talking Face Videos Detection
论文作者
论文摘要
基于DeepFake的数字面部伪造正在威胁公共媒体安全,尤其是在说话时使用唇部操纵时,伪造视频检测的难度得到了进一步改善。通过仅更改唇形以匹配给定的语音,很难在这种假话的面部视频中歧视身份的面部特征。由于先验知识缺乏对音频流的关注,假谈话面部产生的检测失败也变得不可避免。受人类多感知系统的决策机制的启发,该系统使听觉信息能够增强后感官的视觉证据,以获得知情决策输出,在这项研究中,通过合并音频和视觉表达来实现更准确的伪造说话的脸部检测,提出了一个虚假的会说话的面部检测框架ftfdnet。此外,提出了一种视听注意机制(AVAM),以发现更有信息的特征,可以通过模块化将其无缝集成到任何视听CNN体系结构中。借助其他AVAM,提出的FTFDNET能够在已建立的数据集(FTFDD)上实现更好的检测性能。对拟议作品的评估表明,在检测假话的面部视频方面表现出色,该视频的检测率超过97%。
DeepFake based digital facial forgery is threatening the public media security, especially when lip manipulation has been used in talking face generation, the difficulty of fake video detection is further improved. By only changing lip shape to match the given speech, the facial features of identity is hard to be discriminated in such fake talking face videos. Together with the lack of attention on audio stream as the prior knowledge, the detection failure of fake talking face generation also becomes inevitable. Inspired by the decision-making mechanism of human multisensory perception system, which enables the auditory information to enhance post-sensory visual evidence for informed decisions output, in this study, a fake talking face detection framework FTFDNet is proposed by incorporating audio and visual representation to achieve more accurate fake talking face videos detection. Furthermore, an audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architectures by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance on the established dataset (FTFDD). The evaluation of the proposed work has shown an excellent performance on the detection of fake talking face videos, which is able to arrive at a detection rate above 97%.