多模式对话状态跟踪

论文标题

多模式对话状态跟踪

Multimodal Dialogue State Tracking

论文作者

Le, Hung, Chen, Nancy F., Hoi, Steven C. H.

论文摘要

对话状态跟踪器是为了跟踪对话中用户目标的设计，是对话系统中必不可少的组成部分。但是，对话状态跟踪的研究在很大程度上仅限于单形式，其中插槽和老虎机值受知识领域（例如，餐厅名称和价格范围的插槽）的限制，并且由特定的数据库架构定义。在本文中，我们建议将对话状态跟踪的定义扩展到多模式。具体来说，我们介绍了一项新颖的对话状态跟踪任务，以跟踪视频接地对话中提到的视觉对象的信息。每个新的对话说法都可能引入一个新的视频段，新的视觉对象或新对象属性，并且需要一个状态跟踪器来相应地更新这些信息插槽。我们创建了一个新的合成基准测试，并为此任务设计了一种新颖的基线Video-Dialogue Transformer网络（VDTN）。 VDTN结合了对象级功能和段级功能，并在视频和对话之间学习上下文依赖性，以生成多模式对话状态。我们为国家生成任务以及一个自我监督的视频理解任务优化了VDTN，该任务恢复了视频段或对象表示。最后，我们培训了VDTN在响应预测任务中使用解码状态。加上全面的消融和定性分析，我们发现了一些有趣的见解，以建立更有能力的多模式对话系统。

Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes, and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题