论文标题

基于距离的数据清洁:调查(技术报告)

Distance-based Data Cleaning: A Survey (Technical Report)

论文作者

Sun, Yu, Zhang, Jian

论文摘要

随着互联网技术的快速发展,通常在各种实际情况下观察到肮脏的数据,例如,由于传感器的读数不可靠,来自异质来源的不可靠。为了应对其对下游应用程序的负面影响,数据清洁方法旨在在进行应用程序之前进行肮脏的数据。大多数数据清洁方法的想法是识别或纠正肮脏的数据,指的是共享相同信息的邻居的值。不幸的是,由于数据的稀疏性和异质性,基于平等关系的邻居数量是相当有限的,尤其是在存在差异的数据值的情况下。为了解决这个问题,基于距离的数据清洁方法建议根据价值距离考虑相似性邻居。通过对小型变体的耐受性,可以识别出丰富的相似性邻居并用于数据清洁任务。同时,元组之间的距离关系也有助于指导数据清洁,其中包含更多信息并包括平等关系。因此,基于距离的技术在数据清洁领域起着重要作用,我们也有理由相信,基于距离的数据清洁技术将在未来的数据预处理研究中吸引更多的关注。因此,该调查提供了四个主要数据清洁任务的分类,即规则分析,错误检测,数据修复和数据插补,并全面审查每个类别的最新技术状态。

With the rapid development of the internet technology, dirty data are commonly observed in various real scenarios, e.g., owing to unreliable sensor reading, transmission and collection from heterogeneous sources. To deal with their negative effects on downstream applications, data cleaning approaches are designed to preprocess the dirty data before conducting applications. The idea of most data cleaning methods is to identify or correct dirty data, referring to the values of their neighbors which share the same information. Unfortunately, owing to data sparsity and heterogeneity, the number of neighbors based on equality relationship is rather limited, especially in the presence of data values with variances. To tackle this problem, distance-based data cleaning approaches propose to consider similarity neighbors based on value distance. By tolerance of small variants, the enriched similarity neighbors can be identified and used for data cleaning tasks. At the same time, distance relationship between tuples is also helpful to guide the data cleaning, which contains more information and includes the equality relationship. Therefore, distance-based technology plays an important role in the data cleaning area, and we also have reason to believe that distance-based data cleaning technology will attract more attention in data preprocessing research in the future. Hence this survey provides a classification of four main data cleaning tasks, i.e., rule profiling, error detection, data repair and data imputation, and comprehensively reviews the state of the art for each class.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源