论文标题
Deeptriage:为云服务事件的自动转移援助
DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services
论文作者
论文摘要
随着云服务的增长并产生高收入,这些服务中停机的成本变得越来越昂贵。为了减少损失和服务停机时间,关键的主要步骤是执行事件分类,这是将服务事件分配给正确负责任团队的过程。不正确的任务可能会有其他事件重新构造,并增加了减轻10倍的时间。但是,大型云服务中的自动事件分类面临许多挑战:(1)来自大量团队的高度不平衡的事件分布,(2)输入数据或数据源形式的种类繁多,(3)缩放以满足生产级别的需求,(4)(4)在使用机器学习建议中获得工程师的信任。为了应对这些挑战,我们引入了DeepTriage,这是一项智能事件转移服务,结合了多个机器学习技术 - 梯度增强的分类器,聚类方法和深神经网络 - 在一个合奏中,建议负责任的团队分类事件。 Microsoft Azure中实际事件的实验结果表明,我们的服务获得了82.9%的F1分数。对于高度影响的事件,DeepTriage的F1得分从76.3%-91.3%。我们已经应用了最佳实践和最先进的框架来扩展深层式,以处理所有云服务的事件路由。自2017年10月以来,Deeptriage已被部署到Azure,每天都有数千个团队使用。
As cloud services are growing and generating high revenues, the cost of downtime in these services is becoming significantly expensive. To reduce loss and service downtime, a critical primary step is to execute incident triage, the process of assigning a service incident to the correct responsible team, in a timely manner. An incorrect assignment risks additional incident reroutings and increases its time to mitigate by 10x. However, automated incident triage in large cloud services faces many challenges: (1) a highly imbalanced incident distribution from a large number of teams, (2) wide variety in formats of input data or data sources, (3) scaling to meet production-grade requirements, and (4) gaining engineers' trust in using machine learning recommendations. To address these challenges, we introduce DeepTriage, an intelligent incident transfer service combining multiple machine learning techniques - gradient boosted classifiers, clustering methods, and deep neural networks - in an ensemble to recommend the responsible team to triage an incident. Experimental results on real incidents in Microsoft Azure show that our service achieves 82.9% F1 score. For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%. We have applied best practices and state-of-the-art frameworks to scale DeepTriage to handle incident routing for all cloud services. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.