论文标题
在Twitter上的文本摘要中的方言多样性
Dialect Diversity in Text Summarization on Twitter
论文作者
论文摘要
在Twitter上的讨论涉及不同社区的参与,具有不同的方言,通常有必要将大量帖子汇总到代表性的样本中以提供摘要。但是,任何此类代表性的样本都应充分描绘基本的方言多样性,以呈现代表方言的不同参与社区的声音。提取性摘要算法执行构造子集的任务,该子集简洁地捕获了任何给定的帖子的主题。但是,我们观察到,通过常见摘要方法产生的摘要存在方言偏差,即,他们经常返回以代表某些方言不足的摘要。 绝大多数现有的“公平”摘要方法都需要社会上显着的属性标签(在这种情况下为方言),以确保生成的摘要相对于社会显着属性是公平的。然而,在许多应用中,这些标签不存在。此外,由于社交媒体中方言的不断发展的性质,可以准确地标记或准确推断每个社交媒体帖子的方言是不合理的。为了纠正方言偏差,我们采用了一个将现有文本摘要算法作为黑框的框架,并使用一小部分的方言多样性句子返回了一个相对较大的方言信息。至关重要的是,这种方法不需要汇总的帖子以具有方言标签,从而确保多元化过程独立于方言分类/识别模型。我们在Twitter数据集上显示了我们的方法的功效,这些数据集包含在种族或性别定义的不同社会群体使用的方言中写的帖子;在所有情况下,与标准文本摘要方法相比,我们的方法都会提高方言多样性。
Discussions on Twitter involve participation from different communities with different dialects and it is often necessary to summarize a large number of posts into a representative sample to provide a synopsis. Yet, any such representative sample should sufficiently portray the underlying dialect diversity to present the voices of different participating communities representing the dialects. Extractive summarization algorithms perform the task of constructing subsets that succinctly capture the topic of any given set of posts. However, we observe that there is dialect bias in the summaries generated by common summarization approaches, i.e., they often return summaries that under-represent certain dialects. The vast majority of existing "fair" summarization approaches require socially salient attribute labels (in this case, dialect) to ensure that the generated summary is fair with respect to the socially salient attribute. Nevertheless, in many applications, these labels do not exist. Furthermore, due to the ever-evolving nature of dialects in social media, it is unreasonable to label or accurately infer the dialect of every social media post. To correct for the dialect bias, we employ a framework that takes an existing text summarization algorithm as a blackbox and, using a small set of dialect-diverse sentences, returns a summary that is relatively more dialect-diverse. Crucially, this approach does not need the posts being summarized to have dialect labels, ensuring that the diversification process is independent of dialect classification/identification models. We show the efficacy of our approach on Twitter datasets containing posts written in dialects used by different social groups defined by race or gender; in all cases, our approach leads to improved dialect diversity compared to standard text summarization approaches.