论文标题
重尾表示,文本极性分类和数据增强
Heavy-tailed Representations, Text Polarity Classification & Data Augmentation
论文作者
论文摘要
自然语言中文本表示的主要方法取决于在具有方便属性(例如构图和距离保存)的大规模语料库中学习嵌入。在本文中,我们开发了一种新颖的方法来学习具有有关分布尾巴的理想规律性属性的重尾嵌入,该方法可以使用多变量极端价值理论的框架来分析远离分布总体的点。特别是,获得了专用于提议嵌入的尾部的分类器,其性能优于基线。该分类器具有规模不变属性,我们通过引入新颖的文本生成方法来保存数据集扩展。关于合成和真实文本数据的数值实验证明了所提出的框架的相关性,并确认该方法具有具有可控属性的有意义的句子,例如积极或负面情绪。
The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment.