我们准备好进行学习的基数估算了吗？

论文标题

我们准备好进行学习的基数估算了吗？

Are We Ready For Learned Cardinality Estimation?

论文作者

Wang, Xiaoying, Qu, Changbo, Wu, Weiyuan, Wang, Jiannan, Zhou, Qingqing

论文摘要

在查询优化中，基数估计是一个基本但未解决的问题。最近，来自不同研究小组的多篇论文始终报告说，学到的模型有可能替代现有的基数估计器。在本文中，我们提出了一个具有前瞻性的问题：我们准备在生产中部署这些学到的基数模型了吗？我们的研究包括三个主要部分。首先，我们专注于静态环境（即没有数据更新），并在统一的工作负载设置下将五种新的学习方法与八种现实世界数据集上的八种传统方法进行比较。结果表明，学到的模型确实比传统方法更准确，但是它们经常遭受高训练和推理成本的困扰。其次，我们探讨了这些学到的模型是否已准备好用于动态环境（即频繁的数据更新）。我们发现，由于不同的原因，他们无法赶上快速数据的快速数据并返回大错误。对于不太频繁的更新，它们的性能可以表现更好，但是彼此之间没有明显的赢家。第三，我们更深入地研究了学习的模型，并探索何时出错。我们的结果表明，学习方法的性能可能会受到相关，偏度或域大小的变化的极大影响。更重要的是，他们的行为很难解释，而且通常是不可预测的。基于这些发现，我们确定了两个有前途的研究方向（控制学习模型的成本，并使学识渊博的模型值得信赖），并提出了许多研究机会。我们希望我们的研究能够指导研究人员和从业人员共同努力，最终将学习的基数估计器推向真实的数据库系统。

Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this paper, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they cannot catch up with fast data up-dates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题