论文标题
使用深度学习从不稳定的日志数据中识别失败
Failure Identification from Unstable Log Data using Deep Learning
论文作者
论文摘要
云平台的可靠性具有重要意义,因为社会越来越依赖于在云上运行的复杂软件系统。为了改善它,云提供商正在自动执行各种维护任务,并且经常考虑故障识别。自动化的前提是可观察性工具的可用性,通常使用系统日志。本文的重点是基于日志的故障识别。由于日志数据的不稳定性以及代码内显式记录失败覆盖范围的不完整性,因此此问题具有挑战性。为了应对这两个挑战,我们将堵塞作为故障识别的方法。本文介绍的关键思想是基于我们的观察结果,即通过将日志数据表示为子过程的序列而不是日志事件的序列,则减少了不稳定的日志数据的效果。 Clog引入了一种新型的子过程提取方法,该方法使用上下文感知的神经网络和聚类方法来提取有意义的子过程。日志事件上下文的直接建模允许识别有关突然上下文更改的故障,从而解决了日志记录失败范围不足的挑战。我们的实验结果表明,学到的子过程表示会减少输入中的不稳定性,从而使Clog在F1分数上的失败识别子问题上的基准胜过9-24%的失败检测到9-24%,而2)失败类型的识别在宏观平均F1得分上的失败类型识别率为7%。进一步的分析表明,输入事件序列中不稳定性与以模型不稳定方式的检测性能之间存在的负相关。
The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure identification frequently being considered. The precondition for automation is the availability of observability tools, with system logs commonly being used. The focus of this paper is log-based failure identification. This problem is challenging because of the instability of the log data and the incompleteness of the explicit logging failure coverage within the code. To address the two challenges, we present CLog as a method for failure identification. The key idea presented herein based is on our observation that by representing the log data as sequences of subprocesses instead of sequences of log events, the effect of the unstable log data is reduced. CLog introduces a novel subprocess extraction method that uses context-aware neural network and clustering methods to extract meaningful subprocesses. The direct modeling of log event contexts allows the identification of failures with respect to the abrupt context changes, addressing the challenge of insufficient logging failure coverage. Our experimental results demonstrate that the learned subprocesses representations reduce the instability in the input, allowing CLog to outperform the baselines on the failure identification subproblems - 1) failure detection by 9-24% on F1 score and 2) failure type identification by 7% on the macro averaged F1 score. Further analysis shows the existent negative correlation between the instability in the input event sequences and the detection performance in a model-agnostic manner.