论文标题

更好的smatch =更好的解析器? AMR评估不再那么简单

Better Smatch = Better Parser? AMR evaluation is not so simple anymore

论文作者

Opitz, Juri, Frank, Anette

论文摘要

最近,通过结构性Smatch度量测量,在AMR解析中观察到了惊人的进步。实际上,当今的系统达到的绩效水平似乎超过了人类间注释者一致性(IAA)的估计。因此,目前尚不清楚Smatch(仍然)与人类对解析质量的估计有何关系,因为在这种情况下,重量相似的细粒度误差可能会影响AMR的含义在不同程度上。 我们对两种流行和强大的AMR解析器进行了分析,这些解析器(根据Smatch)与人类IAA的质量水平达到质量水平,并评估人类质量评级与Smatch和其他AMR指标的关系。我们的主要发现是:i)虽然较高的Smatch分数表明否则,我们发现AMR解析远非解决:我们经常发现在结构上很小但在语义上不可接受的错误实质上扭曲了句子的含义。 ii)考虑到高性能解析器,更好的SMATTH分数可能不一定表明始终如一地解析质量。为了获得对解析(R)s质量差异的有意义而全面的评估,我们建议通过宏统计数据,使用其他指标和更多人类分析来增强评估。

Recently, astonishing advances have been observed in AMR parsing, as measured by the structural Smatch metric. In fact, today's systems achieve performance levels that seem to surpass estimates of human inter annotator agreement (IAA). Therefore, it is unclear how well Smatch (still) relates to human estimates of parse quality, as in this situation potentially fine-grained errors of similar weight may impact the AMR's meaning to different degrees. We conduct an analysis of two popular and strong AMR parsers that -- according to Smatch -- reach quality levels on par with human IAA, and assess how human quality ratings relate to Smatch and other AMR metrics. Our main findings are: i) While high Smatch scores indicate otherwise, we find that AMR parsing is far from being solved: we frequently find structurally small, but semantically unacceptable errors that substantially distort sentence meaning. ii) Considering high-performance parsers, better Smatch scores may not necessarily indicate consistently better parsing quality. To obtain a meaningful and comprehensive assessment of quality differences of parse(r)s, we recommend augmenting evaluations with macro statistics, use of additional metrics, and more human analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源