论文标题
基于查询复杂性的原始数据的最佳处理
Query Complexity Based Optimal Processing of Raw Data
论文作者
论文摘要
本文旨在找到一种有效的方法来处理具有不同复制的不同类型的工作负载查询的大型数据集。这项工作首先确定了最适合给定数据处理工具的查询的复杂性。本文提出了使用轻量级查询识别和分区算法的查询复杂性意识分配技术QCA。已经研究了不同的复制方法,以涵盖更多用于不同应用程序工作负载的用例。使用称为斯隆数字天空调查SDSS的科学数据集证明了该技术。结果表明,与原始数据集相比,使用加载格式的6.7%的数据集的工作负载执行时间湿度减少了94.6%。与最先进的工作负载WA技术相比,QCA技术还将多节点复制减少了5.8倍。与WA相比,使用QCA提出的分区使用QCA的多核执行工作负载降低了42.66%和25.46%。
The paper aims to find an efficient way for processing large datasets having different types of workload queries with minimal replication. The work first identifies the complexity of queries best suited for the given data processing tool . The paper proposes Query Complexity Aware partitioning technique QCA with a lightweight query identification and partitioning algorithm. Different replication approaches have been studied to cover more use-cases for different application workloads. The technique is demonstrated using a scientific dataset known as Sloan Digital Sky Survey SDSS. The results show workload execution time WET reduced by 94.6% using only 6.7% of the dataset in loaded format compared to the original dataset. The QCA technique also reduced multi-node replication by 5.8x times compared to state-of-the-art workload aware WA techniques. The multi-node and multi-core execution of workload using QCA proposed partitions reduced WET by 42.66% and 25.46% compared to WA.