重尾表示，文本极性分类和数据增强

论文标题

重尾表示，文本极性分类和数据增强

Heavy-tailed Representations, Text Polarity Classification & Data Augmentation

论文作者

Jalalzai, Hamid, Colombo, Pierre, Clavel, Chloé, Gaussier, Eric, Varni, Giovanna, Vignon, Emmanuel, Sabourin, Anne

论文摘要

自然语言中文本表示的主要方法取决于在具有方便属性（例如构图和距离保存）的大规模语料库中学习嵌入。在本文中，我们开发了一种新颖的方法来学习具有有关分布尾巴的理想规律性属性的重尾嵌入，该方法可以使用多变量极端价值理论的框架来分析远离分布总体的点。特别是，获得了专用于提议嵌入的尾部的分类器，其性能优于基线。该分类器具有规模不变属性，我们通过引入新颖的文本生成方法来保存数据集扩展。关于合成和真实文本数据的数值实验证明了所提出的框架的相关性，并确认该方法具有具有可控属性的有意义的句子，例如积极或负面情绪。

The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题