论文标题

实用程序在用户眼中:对NLP排行榜的批评

Utility is in the Eye of the User: A Critique of NLP Leaderboards

论文作者

Ethayarajh, Kawin, Jurafsky, Dan

论文摘要

诸如胶水之类的基准通过激励创建更准确的模型来帮助推动NLP的进步。尽管该排行榜范式非常成功,但历史上关注基于绩效的评估的历史旨在取决于NLP社区价值观在模型中的其他品质,例如紧凑,公平和能源效率。在本意见论文中,我们研究了排行榜激励的内容与通过微观经济理论的镜头在实践中有用的差异。我们将排行榜和NLP从业人员视为消费者,以及他们从模型作为其实用性所获得的好处。有了这一框架,我们正式将排行榜以目前的形式形式化,对于NLP社区而言可能是糟糕的代理。例如,高效效率低下的模型将为从业者提供较少的效用,但不能为排行榜提供效用,因为这是只有前者必须承担的成本。为了使从业人员更好地估计模型的效用,我们主张对排行榜的透明度,例如报告实际关注的统计数据(例如,模型大小,能源效率和推理潜伏期)。

Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards -- in their current form -- can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model's utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源