论文标题
HARDVIS:视觉分析,以使用底漆和过采样技术来处理实例硬度
HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques
论文作者
论文摘要
尽管机器学习取得了巨大进步(ML),但数据不平衡的培训仍然在许多现实世界应用中带来挑战。在解决此问题的一系列不同技术中,采样算法被视为有效的解决方案。但是,这个问题更为根本,许多作品强调了实例硬度的重要性。这个问题是指管理不安全或可能嘈杂的实例的重要性,这些实例更有可能被错误分类并作为分类绩效差的根本原因。本文介绍了Hardvis,这是一种视觉分析系统,旨在处理实例硬度,主要是在分类场景中进行不平衡的。我们提出的系统协助用户在视觉上比较数据类型的不同分布,根据局部特征选择实例类型,这些实例后来将受主动采样方法影响,并验证从底面采样或过采样技术中提出的建议对ML模型有益。此外,我们允许用户找到和采样容易且难以对所有类别的培训实例进行分类,而不是统一地采样/过采样。用户可以从不同的角度探索数据子集以决定所有这些参数,而Hardvis则跟踪其步骤,并在测试集中分别评估模型的预测性能。最终结果是一个均衡的数据集,可提高ML模型的预测能力。使用假设的使用情况和用例证明了Hardvis的功效和有效性。最后,我们还研究了系统的有用,基于我们从ML专家那里收到的反馈。
Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.