学会成为统计学家：学到的估计值

论文标题

学会成为统计学家：学到的估计值

Learning to be a Statistician: Learned Estimator for Number of Distinct Values

论文作者

Wu, Renzhi, Ding, Bolin, Chu, Xu, Wei, Zhewei, Dai, Xiening, Guan, Tao, Zhou, Jingren

论文摘要

在列中估计不同值（NDV）的数量对于数据库系统中的许多任务（例如柱状store压缩和数据分析）很有用。在这项工作中，我们专注于如何从随机（在线/离线）样本中得出准确的NDV估计。这种有效的估计对于甚至一次扫描数据都过时的任务至关重要。现有的基于样本的估计器通常依赖于启发式方法或假设，并且在不同数据集上没有稳健的性能，因为数据的假设很容易中断。另一方面，由于配方的复杂结构，从最大似然估计（最大似然估计）中得出估计量非常具有挑战性。我们建议在监督学习框架中制定NDV估计任务，并旨在将模型作为估计器学习。为此，我们需要回答几个问题：i）如何使学习的模型工作负载不可知论； ii）如何获得培训数据； iii）如何执行模型培训。我们得出了学习框架的条件，在这些模型中是工作负载不可知的，从某种意义上说，模型/估计器可以通过合成生成的培训数据进行培训，然后将其部署到任何数据仓库中，例如，例如用户定义的功能（UDFS），可在CPU上提供有效的NDV估计值，以提供有效的ndv估计。我们将学习的估计器与九个现实世界数据集上的最新样本估计器进行比较，以证明其出色的估计精度。我们发布了用于培训数据生成，模型培训和在线学习估算器的代码，以供可重复使用。

Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility.

下载PDF全文

下载文献需遵守相关版权规定

论文标题