Newts：以新闻主题为中心的摘要的语料库

论文标题

Newts：以新闻主题为中心的摘要的语料库

NEWTS: A Corpus for News Topic-Focused Summarization

论文作者

Bahrainian, Seyed Ali, Feucht, Sheridan, Eickhoff, Carsten

论文摘要

文本摘要模型正在接近人类的忠诚度。现有的基准测试公司提供了一对完整和删节的网络，新闻或专业内容的一致对。迄今为止，所有汇总数据集均在一个型适当的范式下运行，这可能无法反映出各种有机摘要需求的范围。最近提出的几个模型（例如，插件语言模型）具有在所需的主题范围内调节生成的摘要的能力。由于没有专用数据集可以支持以主题为中心的摘要任务，因此这些能力在很大程度上仍未使用和未化为。本文介绍了基于著名的CNN/dailymail数据集的第一个主题摘要语料库Newts，并通过在线人群来源进行注释。每个源文章都与两个参考摘要配对，每个摘要都集中在源文档的不同主题上。我们评估了一系列现有技术的代表性范围，并分析了不同提示方法的有效性。

Text summarization models are approaching human levels of fidelity. Existing benchmarking corpora provide concordant pairs of full and abridged versions of Web, news or, professional content. To date, all summarization datasets operate under a one-size-fits-all paradigm that may not reflect the full range of organic summarization needs. Several recently proposed models (e.g., plug and play language models) have the capacity to condition the generated summaries on a desired range of themes. These capacities remain largely unused and unevaluated as there is no dedicated dataset that would support the task of topic-focused summarization. This paper introduces the first topical summarization corpus NEWTS, based on the well-known CNN/Dailymail dataset, and annotated via online crowd-sourcing. Each source article is paired with two reference summaries, each focusing on a different theme of the source document. We evaluate a representative range of existing techniques and analyze the effectiveness of different prompting methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题