几何事物：在决策边界探索语言示例

论文标题

几何事物：在决策边界探索语言示例

Geometry matters: Exploring language examples at the decision boundary

论文作者

Datta, Debajyoti, Kumar, Shashwat, Barnes, Laura, Fletcher, Tom

论文摘要

越来越多的近期证据强调了自然语言处理（NLP）数据集和分类器的局限性。其中包括在数据集中存在注释伪像，分类器依赖于单个单词（例如，电影评论具有“浪漫”一词，评论往往是正面的）或不必要的单词（例如，学习一个适当的名词以将电影归类为正面或负面）。此类工件的存在随后导致了挑战性数据集的开发，以迫使该模型更好地概括。尽管已经提出了各种启发式策略，例如反事实示例和对比度，但通常缺乏或不清楚这些示例的理论理由。在本文中，使用来自信息几何形状的工具，我们提出了一种理论方法来量化NLP中示例的难度。使用我们的方法，我们探讨了几个深度学习架构的困难示例。我们发现，BERT，CNN和FASTTEXT都在很难示例中易受单词替换的影响。这些分类器在FIM测试集上的性能往往很差。（通过采样和扰动难以示例产生，精度下降到50％以下）。我们将实验复制到5个NLP数据集（yelpreviewpaltility，agnews，sogounews，yelpreviewfull和yahoo答案）。在Yelpreviewpallity上，我们观察到相关系数为-0.4在弹药与扰动和难度评分之间。同样，我们观察到难度评分与随机取代的经验成功概率之间的相关性。我们的方法是简单的，建筑不可知的，可用于研究文本分类模型的脆弱性。所有使用的代码将公开可用，包括一个工具来探索其他数据集的困难示例。

A growing body of recent evidence has highlighted the limitations of natural language processing (NLP) datasets and classifiers. These include the presence of annotation artifacts in datasets, classifiers relying on shallow features like a single word (e.g., if a movie review has the word "romantic", the review tends to be positive), or unnecessary words (e.g., learning a proper noun to classify a movie as positive or negative). The presence of such artifacts has subsequently led to the development of challenging datasets to force the model to generalize better. While a variety of heuristic strategies, such as counterfactual examples and contrast sets, have been proposed, the theoretical justification about what makes these examples difficult for the classifier is often lacking or unclear. In this paper, using tools from information geometry, we propose a theoretical way to quantify the difficulty of an example in NLP. Using our approach, we explore difficult examples for several deep learning architectures. We discover that both BERT, CNN and fasttext are susceptible to word substitutions in high difficulty examples. These classifiers tend to perform poorly on the FIM test set. (generated by sampling and perturbing difficult examples, with accuracy dropping below 50%). We replicate our experiments on 5 NLP datasets (YelpReviewPolarity, AGNEWS, SogouNews, YelpReviewFull and Yahoo Answers). On YelpReviewPolarity we observe a correlation coefficient of -0.4 between resilience to perturbations and the difficulty score. Similarly we observe a correlation of 0.35 between the difficulty score and the empirical success probability of random substitutions. Our approach is simple, architecture agnostic and can be used to study the fragilities of text classification models. All the code used will be made publicly available, including a tool to explore the difficult examples for other datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题