论文标题
在分层统计模型中对复合假设进行异常搜索
Anomaly Search over Composite Hypotheses in Hierarchical Statistical Models
论文作者
论文摘要
在大量过程中发现异常的是一项基本任务,在多个研究领域已经研究了,其应用程序从频谱访问网络安全范围跨越。异常事件的特征是数据分布中的偏差,因此可以从基于统计方法的嘈杂观察结果中推断出来。在某些情况下,通常可以从所选过程子集汇总的嘈杂观察结果。这种层次搜索可以进一步最小化样本复杂性,同时保持准确性。因此,应根据多个要求设计异常搜索策略,例如最大化检测准确性;效率,在样本复杂性方面有效;并能够应对仅知道一些缺失参数(即复合假设)的统计模型。在本文中,我们考虑了异常检测,从所选的过程子集中进行观察,该过程符合具有部分已知的统计模型的预定树结构。我们提出了分层动态搜索(HDS),这是一种顺序搜索策略,该策略使用了广义对数似然比(GLLR)统计量的两种变体,可用于检测多个异常。就搜索空间的大小而言,HDS被证明是最佳的秩序,并且在检测准确性方面渐近最佳。针对有限样品制度建立了误差概率上的明确上限。除了对合成数据集进行广泛的实验外,还对DARPA入侵检测数据集进行了实验,表明HDS优于现有方法。
Detection of anomalies among a large number of processes is a fundamental task that has been studied in multiple research areas, with diverse applications spanning from spectrum access to cyber-security. Anomalous events are characterized by deviations in data distributions, and thus can be inferred from noisy observations based on statistical methods. In some scenarios, one can often obtain noisy observations aggregated from a chosen subset of processes. Such hierarchical search can further minimize the sample complexity while retaining accuracy. An anomaly search strategy should thus be designed based on multiple requirements, such as maximizing the detection accuracy; efficiency, be efficient in terms of sample complexity; and be able to cope with statistical models that are known only up to some missing parameters (i.e., composite hypotheses). In this paper, we consider anomaly detection with observations taken from a chosen subset of processes that conforms to a predetermined tree structure with partially known statistical model. We propose Hierarchical Dynamic Search (HDS), a sequential search strategy that uses two variations of the Generalized Log Likelihood Ratio (GLLR) statistic, and can be used for detection of multiple anomalies. HDS is shown to be order-optimal in terms of the size of the search space, and asymptotically optimal in terms of detection accuracy. An explicit upper bound on the error probability is established for the finite sample regime. In addition to extensive experiments on synthetic datasets, experiments have been conducted on the DARPA intrusion detection dataset, showing that HDS is superior to existing methods.