论文标题
元数据考古学:利用培训动态来发掘数据子集
Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics
论文作者
论文摘要
现代机器学习研究依赖于相对较少的精心策划的数据集。即使在这些数据集中,通常在“不整合”或原始数据中,从业人员也面临着重要的数据质量和多样性问题,这些问题可能会过时地解决问题。应对这些挑战的现有方法往往会对特定问题做出强烈的假设,并且通常需要先验知识或元数据,例如域标签。我们的工作与这些方法是正交的:相反,我们专注于为元数据考古学提供一个统一,有效的框架 - 在数据集中发现并推断示例的元数据。我们使用简单的转换来策划可能存在的数据集(例如,标记,非典型或分布示例的错误,非典型或分布示例),并利用这些探针套件之间的学习动力学差异来推断感兴趣的元数据。我们的方法与跨不同任务的更为复杂的缓解方法相提并论:识别和纠正标签错误的示例,对少数群体样本进行分类,优先考虑与培训相关的点并启用相关示例的可扩展人类审计。
Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.