论文标题
用于测量自然语言处理性能的指标的全球分析
A global analysis of metrics used for measuring performance in natural language processing
论文作者
论文摘要
衡量自然语言处理模型的性能是具有挑战性的。传统上使用的指标,例如BLEU和Rouge,最初是为机器翻译和摘要而设计的,已被证明与人类判断力的相关性低,并且缺乏对其他任务和语言的可转让性。在过去的15年中,已经提出了广泛的替代指标。但是,目前尚不清楚这在多大程度上对NLP的基准测试工作产生了影响。在这里,我们提供了用于测量自然语言处理性能的指标的第一个大规模横截面分析。我们从开放的存储库“带代码”策划,映射和系统化了3500多个机器学习模型性能结果,以实现全球和全面的分析。我们的结果表明,当前使用的绝大多数自然语言处理指标具有可能导致模型性能反射不足的属性。此外,我们发现指标报告中的歧义和不一致可能会导致难以解释和比较模型性能,损害NLP研究中的透明度和可重复性。
Measuring the performance of natural language processing models is challenging. Traditionally used metrics, such as BLEU and ROUGE, originally devised for machine translation and summarization, have been shown to suffer from low correlation with human judgment and a lack of transferability to other tasks and languages. In the past 15 years, a wide range of alternative metrics have been proposed. However, it is unclear to what extent this has had an impact on NLP benchmarking efforts. Here we provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing. We curated, mapped and systematized more than 3500 machine learning model performance results from the open repository 'Papers with Code' to enable a global and comprehensive analysis. Our results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance. Furthermore, we found that ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.