论文标题
Arxivedits:了解科学写作的人类修订过程
arXivEdits: Understanding the Human Revision Process in Scientific Writing
论文作者
论文摘要
科学出版物是传达研究发现的主要手段,其中写作质量至关重要。但是,在此领域中研究人类编辑过程的先前工作主要集中在摘要或简介部分上,从而导致不完整的情况。在这项工作中,我们提供了一个完整的计算框架,用于研究科学写作中的文本修订。我们首先介绍了Arxivedits,这是一个新的注释语料库,其中包括Arxiv的751篇完整论文,并在其多个版本的修订版中,以及精细的跨度跨度编辑及其基本意图,以1,000句对。它支持我们的数据驱动分析,以揭示研究人员修改论文所实行的共同策略。为了扩展分析,我们还开发了在文档,句子和单词级别提取修订的自动方法。在我们的语料库中训练的神经CRF句子对准模型达到了93.8 F1,从而实现了不同版本之间句子的可靠匹配。我们将编辑提取任务作为跨度对齐问题制定,与常用的差异算法相比,我们提出的方法更细粒度和可解释的编辑。在我们的数据集中培训的意图分类器在细粒度的意图分类任务上实现了78.9 F1。我们的数据和系统以tiny.one/arxivedits发布。
Scientific publications are the primary means to communicate research discoveries, where the writing quality is of crucial importance. However, prior work studying the human editing process in this domain mainly focused on the abstract or introduction sections, resulting in an incomplete picture. In this work, we provide a complete computational framework for studying text revision in scientific writing. We first introduce arXivEdits, a new annotated corpus of 751 full papers from arXiv with gold sentence alignment across their multiple versions of revision, as well as fine-grained span-level edits and their underlying intentions for 1,000 sentence pairs. It supports our data-driven analysis to unveil the common strategies practiced by researchers for revising their papers. To scale up the analysis, we also develop automatic methods to extract revision at document-, sentence-, and word-levels. A neural CRF sentence alignment model trained on our corpus achieves 93.8 F1, enabling the reliable matching of sentences between different versions. We formulate the edit extraction task as a span alignment problem, and our proposed method extracts more fine-grained and explainable edits, compared to the commonly used diff algorithm. An intention classifier trained on our dataset achieves 78.9 F1 on the fine-grained intent classification task. Our data and system are released at tiny.one/arxivedits.