论文标题
腐败并不是全部不好:通过腐败将话语结构纳入预训练中
Corruption Is Not All Bad: Incorporating Discourse Structure into Pre-training via Corruption for Essay Scoring
论文作者
论文摘要
自动论文评分和文档表示学习的现有方法通常依赖于话语解析器将话语结构纳入文本表示。但是,解析器的性能并不总是足够的,尤其是当它们用于嘈杂的文本(例如学生论文)时。在本文中,我们提出了一种无监督的预训练方法,以捕获论文的话语结构,从连贯性和凝聚力来看,这不需要任何话语解析器或注释。我们为我们提出的预训练方法和增强掩盖语言建模的培训预训练的方法介绍了几种类型的令牌,句子和段落级损坏技术,并使用我们的预训练方法进行预训练,以利用上下文化和话语信息。我们提出的无监督方法实现了论文组织评分任务的新最新结果。
Existing approaches for automated essay scoring and document representation learning typically rely on discourse parsers to incorporate discourse structure into text representation. However, the performance of parsers is not always adequate, especially when they are used on noisy texts, such as student essays. In this paper, we propose an unsupervised pre-training approach to capture discourse structure of essays in terms of coherence and cohesion that does not require any discourse parser or annotation. We introduce several types of token, sentence and paragraph-level corruption techniques for our proposed pre-training approach and augment masked language modeling pre-training with our pre-training method to leverage both contextualized and discourse information. Our proposed unsupervised approach achieves new state-of-the-art result on essay Organization scoring task.