论文标题

通过同等强度K-均值空间分配的概念漂移检测

Concept Drift Detection via Equal Intensity k-means Space Partitioning

论文作者

Liu, Anjin, Lu, Jie, Zhang, Guangquan

论文摘要

数据流对统计分类任务构成了其他挑战,因为随着时间的流逝,培训和目标样本的分布可能会有所不同。流数据中的这种分布变化称为概念漂移。已经提出了许多基于直方图的分布变化检测方法来检测漂移。大多数直方图都是在基于网格或基于树的空间分配算法上开发的,该算法使空间分区任意,无法解释,并可能导致漂移盲点。需要使用无监督的设置来提高基于直方图方法的漂移检测准确性。为了解决这个问题,我们提出了一个基于群集的直方图,称为同等强度K-均值空间分配(EI-KMEANS)。另外,引入了提高漂移检测灵敏度的启发式方法。提高灵敏度的基本思想是最大程度地减少在分配抵消区域中创建分区的风险。 Pearson的卡方检验用作统计假设检验,因此测试统计数据仍然独立于样本分布。在卡方检验中,基于渐近约束,从样品中动态确定垃圾箱及其形状的数量及其形状,强烈影响检测漂移的能力。因此,开发了三种算法来实现概念漂移检测,包括贪婪的质心初始化算法,群集放大螺旋算法和漂移检测算法。对于漂移适应,我们建议如果检测到漂移,请重新训练学习者。关于合成和现实世界数据集实验的结果证明了EI-KMeans的优势,并显示了其在检测概念漂移方面的功效。

Data stream poses additional challenges to statistical classification tasks because distributions of the training and target samples may differ as time passes. Such distribution change in streaming data is called concept drift. Numerous histogram-based distribution change detection methods have been proposed to detect drift. Most histograms are developed on grid-based or tree-based space partitioning algorithms which makes the space partitions arbitrary, unexplainable, and may cause drift blind-spots. There is a need to improve the drift detection accuracy for histogram-based methods with the unsupervised setting. To address this problem, we propose a cluster-based histogram, called equal intensity k-means space partitioning (EI-kMeans). In addition, a heuristic method to improve the sensitivity of drift detection is introduced. The fundamental idea of improving the sensitivity is to minimize the risk of creating partitions in distribution offset regions. Pearson's chi-square test is used as the statistical hypothesis test so that the test statistics remain independent of the sample distribution. The number of bins and their shapes, which strongly influence the ability to detect drift, are determined dynamically from the sample based on an asymptotic constraint in the chi-square test. Accordingly, three algorithms are developed to implement concept drift detection, including a greedy centroids initialization algorithm, a cluster amplify-shrink algorithm, and a drift detection algorithm. For drift adaptation, we recommend retraining the learner if a drift is detected. The results of experiments on synthetic and real-world datasets demonstrate the advantages of EI-kMeans and show its efficacy in detecting concept drift.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源