通过差异嵌入学习新闻语言的语言变化

论文标题

通过差异嵌入学习新闻语言的语言变化

Learning language variations in news corpora through differential embeddings

论文作者

Selmo, Carlos, Martinez, Julian F., Beiró, Mariano G., Alvarez-Hamelin, J. Ignacio

论文摘要

NLP社区越来越感兴趣地捕捉语言使用的变化，无论是通过时间（即语义漂移），跨越地区（作为方言或变体）或不同的社会环境（即专业或媒体技术）。已经提出了几种成功的动力嵌入，可以随着时间的流逝跟踪语义变化。在这里，我们表明，具有中心单词表示和slice依赖性贡献的模型可以同时从不同语料库中学习单词嵌入。该模型基于切片的类似星形表示。我们将其应用于《纽约时报》和《卫报》报纸，我们表明它可以在每个语料库的年度片段中捕获时间动态，以及在精选的多源语料库中，我们和英语之间的语言变化。我们对这种方法提供了广泛的评估。

There is an increasing interest in the NLP community in capturing variations in the usage of language, either through time (i.e., semantic drift), across regions (as dialects or variants) or in different social contexts (i.e., professional or media technolects). Several successful dynamical embeddings have been proposed that can track semantic change through time. Here we show that a model with a central word representation and a slice-dependent contribution can learn word embeddings from different corpora simultaneously. This model is based on a star-like representation of the slices. We apply it to The New York Times and The Guardian newspapers, and we show that it can capture both temporal dynamics in the yearly slices of each corpus, and language variations between US and UK English in a curated multi-source corpus. We provide an extensive evaluation of this methodology.

下载PDF全文

下载文献需遵守相关版权规定

论文标题