对话状态跟踪中的多转话对话与其评估指标之间的不匹配

论文标题

对话状态跟踪中的多转话对话与其评估指标之间的不匹配

Mismatch between Multi-turn Dialogue and its Evaluation Metric in Dialogue State Tracking

论文作者

Kim, Takyoung, Yoon, Hoonsang, Lee, Yukyung, Kang, Pilsung, Kim, Misuk

论文摘要

对话状态跟踪（DST）旨在从多转对话情况下提取基本信息并采取适当的行动。一种信念状态，即核心信息之一，是指该主题及其特定内容，并以域名值的形式出现。受过训练的模型在各个方面都预测了“积累”的信念状态，并且主要用于评估预测的联合目标准确性和插槽精度。但是，我们指定当前的评估指标在评估随着对话的进行积累时，尤其是在最常用的多沃兹数据集中，在评估信念状态时具有关键的限制。此外，我们提出了相对的插槽精度来补充现有指标。相对插槽精度不取决于预定义的插槽的数量，并且可以根据每个对话的转弯来分配相对得分来进行直观评估。这项研究还鼓励不仅报告关节目标准确性的报告，而是为了进行现实的评估，在DST任务中的各种互补指标。

Dialogue state tracking (DST) aims to extract essential information from multi-turn dialogue situations and take appropriate actions. A belief state, one of the core pieces of information, refers to the subject and its specific content, and appears in the form of domain-slot-value. The trained model predicts "accumulated" belief states in every turn, and joint goal accuracy and slot accuracy are mainly used to evaluate the prediction; however, we specify that the current evaluation metrics have a critical limitation when evaluating belief states accumulated as the dialogue proceeds, especially in the most used MultiWOZ dataset. Additionally, we propose relative slot accuracy to complement existing metrics. Relative slot accuracy does not depend on the number of predefined slots, and allows intuitive evaluation by assigning relative scores according to the turn of each dialogue. This study also encourages not solely the reporting of joint goal accuracy, but also various complementary metrics in DST tasks for the sake of a realistic evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题