论文标题

SPARX:大规模分布式离群值检测

Sparx: Distributed Outlier Detection at Scale

论文作者

Zhang, Sean, Ursekar, Varun, Akoglu, Leman

论文摘要

文献中不乏异常检测(OD)算法,但其中大量的主体是为单个机器设计的。随着现已云居民数据集的现实越来越多,需要分布式OD技术。但是,该领域不仅研究了,而且还没有用于实际使用的公共域实施。本文旨在填补这一空白:我们设计了一种适用于共享的基础架构的数据并行OD算法,我们在Apache Spark中专门实现。通过对三个现实世界数据集进行的广泛实验,并具有数十亿分和数百万个功能,我们表明现有的开源解决方案无法扩展;通过大量点或高维度,而SPARX可以产生可扩展性和有效性能。为了促进在现代规模的数据集中实际使用OD,我们在https://tinyurl.com/sparx2​​022下在Apache许可下开放源Sparx。

There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源