论文标题
对低间接费用时间序列的评估,用于下游机器学习的预处理技术
An Evaluation of Low Overhead Time Series Preprocessing Techniques for Downstream Machine Learning
论文作者
论文摘要
在本文中,我们将预处理技术应用于具有不同长度的多渠道时间序列数据,我们将其称为对齐问题,用于下游机器学习。多种渠道时间序列数据的未对准可能出于多种原因,例如丢失的数据,变化的采样率或收集时间不一致。我们考虑从MIT SuperCloud高性能计算(HPC)中心收集的多渠道时间序列数据,其中不同的工作启动时间和HPC作业的运行时间不同,导致数据不对。这种未对准使得诸如计算工作负载分类之类的任务的AI/ML方法具有挑战性。在以前的监督分类工作的基础上,我们通过MIT SuperCloud数据集进行了研究,我们通过三种宽阔的低顶间接方法解决了对齐问题:从全职系列中抽样固定子集,对全职序列进行摘要统计信息,并从映射到频域的时间序列中对系数的子集进行采样。我们最佳性能模型达到的分类精度大于95%,比以前的MIT SuperCloud数据集的多通道时间序列分类的方法优于5%。这些结果表明,我们低开销方法与标准机器学习技术结合使用,能够达到高水平的分类准确性,并作为解决对齐问题(例如内核方法)的未来方法的基准。
In this paper we address the application of pre-processing techniques to multi-channel time series data with varying lengths, which we refer to as the alignment problem, for downstream machine learning. The misalignment of multi-channel time series data may occur for a variety of reasons, such as missing data, varying sampling rates, or inconsistent collection times. We consider multi-channel time series data collected from the MIT SuperCloud High Performance Computing (HPC) center, where different job start times and varying run times of HPC jobs result in misaligned data. This misalignment makes it challenging to build AI/ML approaches for tasks such as compute workload classification. Building on previous supervised classification work with the MIT SuperCloud Dataset, we address the alignment problem via three broad, low overhead approaches: sampling a fixed subset from a full time series, performing summary statistics on a full time series, and sampling a subset of coefficients from time series mapped to the frequency domain. Our best performing models achieve a classification accuracy greater than 95%, outperforming previous approaches to multi-channel time series classification with the MIT SuperCloud Dataset by 5%. These results indicate our low overhead approaches to solving the alignment problem, in conjunction with standard machine learning techniques, are able to achieve high levels of classification accuracy, and serve as a baseline for future approaches to addressing the alignment problem, such as kernel methods.