多模式对话AI：数据集和方法的调查

论文标题

多模式对话AI：数据集和方法的调查

Multimodal Conversational AI: A Survey of Datasets and Approaches

论文作者

Sundar, Anirudh, Heck, Larry

论文摘要

作为人类，我们以我们的所有感官或方式（声音，视觉，触摸，气味和味觉）体验世界。我们使用这些方式，尤其是视觉和触摸来传达和解释特定的含义。多模式表达是对话的核心；一组丰富的模式会放大并经常相互补偿。多模式的对话AI系统回答问题，完成任务，并通过多种方式理解和表达自己来效仿人类对话。本文激励，定义和数学旨在提出多模式对话研究目标。我们提供了解决目标所需的研究分类学：多模式表示，融合，对准，翻译和共同学习。我们调查了每个研究领域的最新数据集和方法，并突出了它们的限制假设。最后，我们将多模式共学习视为多模式对话AI研究的有希望的方向。

As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective. We provide a taxonomy of research required to solve the objective: multimodal representation, fusion, alignment, translation, and co-learning. We survey state-of-the-art datasets and approaches for each research area and highlight their limiting assumptions. Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题