近乎阴性的区别：为人类评估数据集提供第二寿命

论文标题

近乎阴性的区别：为人类评估数据集提供第二寿命

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

论文作者

Laban, Philippe, Wu, Chien-Sheng, Liu, Wenhao, Xiong, Caiming

论文摘要

精确评估自然语言产生（NLG）任务的进度是具有挑战性的，并且通常需要评估模型的产出而不是另一个模型的偏好。但是，人类评估通常是昂贵的，难以再现的，并且不可解决。在本文中，我们提出了一种新的简单自动评估方法，称为NLG，称为“近阴性区别”（NND），将先前的人类注释重新用于NND测试。在NND测试中，NLG模型必须在高质量的输出候选者上放置更高的可能性，而不是在具有已知误差的近阴性候选者上。模型性能是通过模型传递的NND测试数量以及模型未能打开的任务误差的分布来确定的。通过实验三个NLG任务（问题产生，问题答案和摘要），我们表明，与标准NLG评估指标相比，NND与人类判断的相关性更高。然后，我们在四个实际情况下说明了评估，例如执行细粒模型分析或研究模型训练动态。我们的发现表明，NND可以为人类注释提供第二寿命，并提供低成本的NLG评估。

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题