DEAM：使用基于AMR的语义操纵的对话连贯性评估

论文标题

DEAM：使用基于AMR的语义操纵的对话连贯性评估

DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations

论文作者

Ghazarian, Sarik, Wen, Nuan, Galstyan, Aram, Peng, Nanyun

论文摘要

自动评估指标对于开放域对话系统的快速开发至关重要，因为它们促进了模型之间的超参数调整和比较。尽管最近提出的可训练对话级指标表现出令人鼓舞的结果，但指标的质量在很大程度上取决于培训数据的质量。先前的作品主要是诉诸启发式文本级操纵（例如，调整）从连贯的对话（积极的示例）中引导不连贯的对话（负面示例）。这种方法不足以适当反映在高级对话模型与人类之间的相互作用中发生的不一致。为了解决这个问题，我们提出了DEAM，这是一个依赖于抽象含义表示（AMR）的对话连贯评估指标，以将语义级别的操作应用于不连贯的（负）数据生成。 AMR自然会促进各种类型的不一致来源的注射，例如核心不一致，无关，矛盾和降低参与度，从而在语义水平上导致更自然的样本。我们的实验表明，与几个对话框数据集中的基线方法相比，DEAM与人类判断的相关性更高。我们还表明，DEAM可以区分基线操作产生的相干和不相互的对话，而这些基线模型无法检测到DEAM产生的不一致的示例。我们的结果证明了基于AMR的语义操纵对天然负面示例产生的潜力。

Automatic evaluation metrics are essential for the rapid development of open-domain dialogue systems as they facilitate hyper-parameter tuning and comparison between models. Although recently proposed trainable conversation-level metrics have shown encouraging results, the quality of the metrics is strongly dependent on the quality of training data. Prior works mainly resort to heuristic text-level manipulations (e.g. utterances shuffling) to bootstrap incoherent conversations (negative examples) from coherent dialogues (positive examples). Such approaches are insufficient to appropriately reflect the incoherence that occurs in interactions between advanced dialogue models and humans. To tackle this problem, we propose DEAM, a Dialogue coherence Evaluation metric that relies on Abstract Meaning Representation (AMR) to apply semantic-level Manipulations for incoherent (negative) data generation. AMRs naturally facilitate the injection of various types of incoherence sources, such as coreference inconsistency, irrelevancy, contradictions, and decrease engagement, at the semantic level, thus resulting in more natural incoherent samples. Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods on several dialog datasets by significant margins. We also show that DEAM can distinguish between coherent and incoherent dialogues generated by baseline manipulations, whereas those baseline models cannot detect incoherent examples generated by DEAM. Our results demonstrate the potential of AMR-based semantic manipulations for natural negative example generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题