论文标题

两级变压器和辅助连贯建模,用于改进文本细分

Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation

论文作者

Glavaš, Goran, Somasundaran, Swapna

论文摘要

将长文本的结构分解为语义连贯的片段,使文本更具可读性,并支持诸如摘要和检索之类的下游应用程序。从文本连贯性和分割之间的明显联系开始,我们介绍了一种新颖的监督模型,以简单但明确的连贯建模。我们的模型 - 由两个层次连接的变压器网络组成的神经体系结构 - 是一个多任务学习模型,该模型将句子级分割目标与连贯的目标相结合,将正确的句子与句子与损坏的句子区分开来。所提出的模型,称为相干感知的文本细分(CAT),在基准数据集集合中产生最新的分割性能。此外,通过将猫与跨语性单词嵌入结合起来,我们在零声语言转移中演示了其有效性:它可以成功地以培训中看不见的语言细分文本。

Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval. Starting from an apparent link between text coherence and segmentation, we introduce a novel supervised model for text segmentation with simple but explicit coherence modeling. Our model -- a neural architecture consisting of two hierarchically connected Transformer networks -- is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones. The proposed model, dubbed Coherence-Aware Text Segmentation (CATS), yields state-of-the-art segmentation performance on a collection of benchmark datasets. Furthermore, by coupling CATS with cross-lingual word embeddings, we demonstrate its effectiveness in zero-shot language transfer: it can successfully segment texts in languages unseen in training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源