跨媒体键形预测：一个具有多模式多头注意力和图像词的统一框架

论文标题

跨媒体键形预测：一个具有多模式多头注意力和图像词的统一框架

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

论文作者

Wang, Yue, Li, Jing, Lyu, Michael R., King, Irwin

论文摘要

社交媒体每天生产大量内容。为了帮助用户快速捕获所需的东西，钥匙拼预测正在受到越来越多的关注。然而，大多数先前的努力都集中在文本建模上，在很大程度上忽略了匹配图像中嵌入的丰富功能。在这项工作中，我们探讨了文本和图像在预测多媒体帖子的钥匙拼时的联合影响。为了更好地调整社交媒体风格的文本和图像，我们提出：（1）一种新颖的多模式多模式的注意力（M3H-ATT），以捕获复杂的跨媒体互动；（2）以光学字符和图像属性形式的图像措辞桥接这两种方式。此外，我们设计了一个统一的框架，以利用键形分类和生成的输出，并将其优势融入。从Twitter新收集的大规模数据集上进行的广泛实验表明，我们的模型大大优于基于传统注意力网络的先前最新技术。进一步的分析表明，我们的多头关注能够从各个方面获得信息，并在各种情况下增加分类或产生。

Social media produces large amounts of contents every day. To help users quickly capture what they need, keyphrase prediction is receiving a growing attention. Nevertheless, most prior efforts focus on text modeling, largely ignoring the rich features embedded in the matching images. In this work, we explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. To better align social media style texts and images, we propose: (1) a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions; (2) image wordings, in forms of optical characters and image attributes, to bridge the two modalities. Moreover, we design a unified framework to leverage the outputs of keyphrase classification and generation and couple their advantages. Extensive experiments on a large-scale dataset newly collected from Twitter show that our model significantly outperforms the previous state of the art based on traditional attention networks. Further analyses show that our multi-head attention is able to attend information from various aspects and boost classification or generation in diverse scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题