论文标题
章节破解:远程语言模型的挑战数据集
ChapterBreak: A Challenge Dataset for Long-Range Language Models
论文作者
论文摘要
尽管最近提出了许多长期语言模型(LRLM)的体系结构,但对其话语级别的语言理解能力的有意义的评估尚未遵循。为此,我们介绍了Chapter Break,这是一个挑战数据集,它为LRLM提供了一个较长部分的叙述,该叙述以章节边界结束,并要求它区分下一章的开始与同一叙述的一组负面段。细粒度的人类注释表明,我们的数据集包含许多复杂类型的章节过渡(例如,平行叙述,悬崖末结尾),需要处理全局上下文才能理解。 Chapter Break上的实验表明,现有的LRLM无法有效利用远距离上下文,从而表现出了直接为此任务训练的细分级模型。我们公开发布我们的分章数据集,以促进对LRLM的更多原则性研究。
While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on ChapterBreak show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We publicly release our ChapterBreak dataset to spur more principled future research into LRLMs.