停止设置对停止主动学习进行文本分类的影响

论文标题

停止设置对停止主动学习进行文本分类的影响

Impact of Stop Sets on Stopping Active Learning for Text Classification

论文作者

Kurlandski, Luke, Bloodgood, Michael

论文摘要

积极学习是机器学习的越来越重要的分支，也是自然语言处理的强大技术。主动学习的主要优点是它可能减少学习高性能模型所需的标记数据量。有效的主动学习算法的重要方面是确定何时停止获得其他标记数据。几种领先的最新停止方法使用停止设置来帮助做出这一决定。但是，对停止设置的选择的关注相对较少，而不是对在停止集上应用的停止算法的关注。停止集的不同选择会导致停止方法性能的显着差异。我们研究了不同停止选择对不同停止方法的影响。本文显示，停止集的选择可以对停止方法的性能产生重大影响，而基于稳定的方法对基于置信的方法的影响也有所不同。此外，原始作者的公正代表停止集比最近发表的工作中使用的系统偏见的停止集提出的更好的工作能力更好，并且基于稳定预测的基于稳定预测的停止方法比使用无偏置的代表性停止集时具有更强的基于置信度的停止方法。我们为迄今为止停止集的影响提供了最大数量的实验结果。这些发现对于帮助阐明停止在最近发表的工作中所考虑的这一重要方面的影响很重要，并且可以对停止方法的性能产生很大的实际影响，以促进重要的语义计算应用程序，例如技术辅助审查和文本分类。

Active learning is an increasingly important branch of machine learning and a powerful technique for natural language processing. The main advantage of active learning is its potential to reduce the amount of labeled data needed to learn high-performing models. A vital aspect of an effective active learning algorithm is the determination of when to stop obtaining additional labeled data. Several leading state-of-the-art stopping methods use a stop set to help make this decision. However, there has been relatively less attention given to the choice of stop set than to the stopping algorithms that are applied on the stop set. Different choices of stop sets can lead to significant differences in stopping method performance. We investigate the impact of different stop set choices on different stopping methods. This paper shows the choice of the stop set can have a significant impact on the performance of stopping methods and the impact is different for stability-based methods from that on confidence-based methods. Furthermore, the unbiased representative stop sets suggested by original authors of methods work better than the systematically biased stop sets used in recently published work, and stopping methods based on stabilizing predictions have stronger performance than confidence-based stopping methods when unbiased representative stop sets are used. We provide the largest quantity of experimental results on the impact of stop sets to date. The findings are important for helping to illuminate the impact of this important aspect of stopping methods that has been under-considered in recently published work and that can have a large practical impact on the performance of stopping methods for important semantic computing applications such as technology assisted review and text classification more broadly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题