根据其内容来判断句子以产生语法错误

论文标题

根据其内容来判断句子以产生语法错误

Judge a Sentence by Its Content to Generate Grammatical Errors

论文作者

Rahman, Chowdhury Rafeed

论文摘要

数据稀疏性是语法误差校正（GEC）的众所周知的问题。生成合成训练数据是针对此问题的一种广泛提出的解决方案，并且近年来允许模型实现最新的（SOTA）性能。但是，这些方法通常会产生不切实际的错误，或者旨在只有一个错误生成句子。我们提出了一种基于学习的两个阶段方法，用于GEC的合成数据生成，从而放大了仅包含一个错误的句子的约束。错误是根据句子优点产生的。我们表明，经过合成生成的语料库训练的GEC模型优于先前工作的合成数据的模型。

Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in recent years. However, these methods often generate unrealistic errors, or aim to generate sentences with only one error. We propose a learning based two stage method for synthetic data generation for GEC that relaxes this constraint on sentences containing only one error. Errors are generated in accordance with sentence merit. We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题