论文标题

近乎阴性的区别:为人类评估数据集提供第二寿命

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

论文作者

Laban, Philippe, Wu, Chien-Sheng, Liu, Wenhao, Xiong, Caiming

论文摘要

精确评估自然语言产生(NLG)任务的进度是具有挑战性的,并且通常需要评估模型的产出而不是另一个模型的偏好。但是,人类评估通常是昂贵的,难以再现的,并且不可解决。在本文中,我们提出了一种新的简单自动评估方法,称为NLG,称为“近阴性区别”(NND),将先前的人类注释重新用于NND测试。在NND测试中,NLG模型必须在高质量的输出候选者上放置更高的可能性,而不是在具有已知误差的近阴性候选者上。模型性能是通过模型传递的NND测试数量以及模型未能打开的任务误差的分布来确定的。通过实验三个NLG任务(问题产生,问题答案和摘要),我们表明,与标准NLG评估指标相比,NND与人类判断的相关性更高。然后,我们在四个实际情况下说明了评估,例如执行细粒模型分析或研究模型训练动态。我们的发现表明,NND可以为人类注释提供第二寿命,并提供低成本的NLG评估。

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源