论文标题
分析紧密注释预算下的文本表示:测量结构对齐
Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment
论文作者
论文摘要
注释大量文本数据可能会很耗时且昂贵。这就是为什么培训有限注释预算模型的能力非常重要的原因。在这种情况下,已经表明,在紧密的注释预算下,数据表示的选择是关键。本文的目的是更好地了解为什么这样做。考虑到这个目标,我们提出了一个指标,该指标可以衡量给定表示与任务结构一致的程度。我们对几个文本分类数据集进行了实验,以测试各种模型和表示。使用我们提出的指标,我们表明,任务的有效表示(即可以从几个样本中学习的一个)是一种表示潜在输入结构和类结构之间良好对齐的表示。
Annotating large collections of textual data can be time consuming and expensive. That is why the ability to train models with limited annotation budgets is of great importance. In this context, it has been shown that under tight annotation budgets the choice of data representation is key. The goal of this paper is to better understand why this is so. With this goal in mind, we propose a metric that measures the extent to which a given representation is structurally aligned with a task. We conduct experiments on several text classification datasets testing a variety of models and representations. Using our proposed metric we show that an efficient representation for a task (i.e. one that enables learning from few samples) is a representation that induces a good alignment between latent input structure and class structure.