论文标题

迈向自动化文档修订:语法误差校正,流利的编辑及其他

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond

论文作者

Mita, Masato, Sakaguchi, Keisuke, Hagiwara, Masato, Mizumoto, Tomoya, Suzuki, Jun, Inui, Kentaro

论文摘要

自然语言处理技术已迅速改善了自动化的语法错误校正任务,社区开始探索文档级修订,这是下一个挑战之一。要超越句子级自动化的语法错误校正到基于NLP的文档级修订助理,有两个主要障碍:(1)公共语料库中没有专业编辑的文档级修订注释,并且(2)引起所有可能的参考和评估此类参考的质量是无限的。本文应对这些挑战。首先,我们介绍了一个新的文档革命语料库TETRA,专业编辑在其中修订了从ACL选集中采样的学术论文,其中包含一些琐碎的语法错误,使我们能够更多地专注于文档和段落级的编辑,例如相干性和一致性。其次,我们探索无参考和可解释的元评估方法,可以通过文档修订来检测质量改进。我们表明,与现有文档修订公司相比,TETRA的独特性,并证明,即使差异微妙,经过微调的预训练的语言模型也可以区分修订后文档的质量。这一有希望的结果将鼓励社区将来进一步探索自动化的文档修订模型和指标。

Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revisions being annotated by professional editors, and (2) it is not feasible to elicit all possible references and evaluate the quality of revision with such references because there are infinite possibilities of revision. This paper tackles these challenges. First, we introduce a new document-revision corpus, TETRA, where professional editors revised academic papers sampled from the ACL anthology which contain few trivial grammatical errors that enable us to focus more on document- and paragraph-level edits such as coherence and consistency. Second, we explore reference-less and interpretable methods for meta-evaluation that can detect quality improvements by document revision. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle. This promising result will encourage the community to further explore automated document revision models and metrics in future.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源