在学术出版物中检测自动生成的文本的基准语料库

论文标题

在学术出版物中检测自动生成的文本的基准语料库

A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications

论文作者

Liyanage, Vijini, Buscaldi, Davide, Nazarenko, Adeline

论文摘要

基于神经语言模型的自动文本生成已经达到了性能水平，使生成的文本几乎与人类所写的文本没有区别。尽管文本生成在各种应用中都有价值，但它也可以用于恶意任务。这种实践的扩散代表了对学术出版质量的威胁。为了解决这些问题，我们在本文中提出了两个由人为生成的研究内容组成的数据集：完全合成数据集和部分文本替代数据集。在第一种情况下，在从原始论文中提取的简短提示后，GPT-2模型完全生成了内容。通过用Arxiv-NLP模型生成的句子替换几个摘要句子来创建部分或混合数据集。我们评估了使用流利度指标（例如Bleu and Rouge）将生成的文本与对齐原始文本进行比较的数据集质量。人造文本看起来越自然，检测到越困难，基准越好。我们还通过使用最新的分类模型来评估将原始文本与生成文本区分开的困难。

Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题