构图人口统计词嵌入

论文标题

构图人口统计词嵌入

Compositional Demographic Word Embeddings

论文作者

Welch, Charles, Kummerfeld, Jonathan K., Pérez-Rosas, Verónica, Mihalcea, Rada

论文摘要

单词嵌入通常来自包含许多人的文本的语料库，因此导致通用表示形式，而不是单独个性化的表示形式。虽然个性化的嵌入对于改善语言模型性能和其他语言处理任务可能是有用的，但只能针对拥有大量纵向数据的人进行计算，而对于新用户来说并非如此。我们提出了一种新形式的个性化单词嵌入形式，该单词嵌入方式使用特定于人口统计学的单词表示形式，这些单词表示用户（即性别，年龄，位置，宗教）从完整或部分人口统计学信息中得出。我们表明，由此产生的人口统计学的单词表示形式在英语的两个任务上的表现要优于通用单词表示：语言建模和单词关联。我们进一步探讨了可用属性数量与其相对有效性之间的权衡，并讨论了使用它们的道德含义。

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题