通过线性模型基于流的主动学习

论文标题

通过线性模型基于流的主动学习

Stream-based active learning with linear models

论文作者

Cacciarelli, Davide, Kulahci, Murat, Tyssedal, John Sølve

论文摘要

自动数据收集方案的扩散和传感器的进步正在增加我们能够实时监控的数据量。但是，鉴于高注册成本和质量检查所需的时间，数据通常以未标记的形式获得。这正在促进使用主动学习来开发软传感器和预测模型。在生产中，通过评估未标记数据的信息内容来收集标签，而不是进行随机检查以获取产品信息。文献中已经提出了一些有关回归的查询策略框架，但大多数重点都专门用于基于静态池的场景。在这项工作中，我们为基于流的方案提出了一种新的策略，在该方案中，将实例顺序提供给学习者，该实例必须立即决定是否执行质量检查以获取标签或丢弃实例。该方法受到最佳实验设计理论的启发，决策过程的迭代方面通过对未标记数据点的信息设定阈值来解决。使用数值模拟和田纳西州伊士曼过程模拟器评估所提出的方法。结果证实，选择提出的算法建议的示例可以更快地减少预测误差。

The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题