论文标题
Menli:来自自然语言推断的强大评估指标
MENLI: Robust Evaluation Metrics from Natural Language Inference
论文作者
论文摘要
最近提出的基于BERT的文本生成评估指标在标准基准上表现良好,但容易受到对抗性攻击的影响,例如与信息正确性有关。我们认为这(部分原因)是因为它们是语义相似性的模型。相反,我们根据自然语言推断(NLI)制定评估指标,我们认为这是更合适的建模。我们设计了一个基于偏好的对抗攻击框架,并表明我们的基于NLI的指标比最近基于BERT的指标更强大。在标准基准上,我们的基于NLI的指标的表现优于现有的摘要指标,但在SOTA MT下进行的指标。但是,当将现有指标与我们的NLI指标相结合时,我们可以获得更高的对抗性鲁棒性(15%-30%)和较高的质量指标,如标准基准测量(+5%至30%)。
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).