论文标题

纵向系统监视数据的分析以进行性能预测

Analytics of Longitudinal System Monitoring Data for Performance Prediction

论文作者

Costello, Ian J., Bhatele, Abhinav

论文摘要

近年来,几个HPC设施已经开始不断监视其系统和工作,以收集与性能相关的数据,以了解性能和运营效率。这样的数据可用于通过创建数据驱动的模型来预测调度程序队列等待的作业的性能,以优化单个作业和整体系统的性能。在本文中,我们使用纵向系统范围的监视数据和机器学习来对代表性控制作业的性能进行建模,以探索性能变异性的原因。我们详细分析了这些预测模型,以确定绩效的主要预测指标。我们证明了此类模型可以是应用程序不可静止的,并且可用于预测未包括在培训中的应用程序的性能。

In recent years, several HPC facilities have started continuous monitoring of their systems and jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of jobs waiting in the scheduler queue. In this paper, we model the performance of representative control jobs using longitudinal system-wide monitoring data and machine learning to explore the causes of performance variability. We analyze these prediction models in great detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源