论文标题
组合数据集以增加样品数量并改善模型拟合
Combining datasets to increase the number of samples and improve model fitting
论文作者
论文摘要
对于许多用例,将来自不同数据集的信息组合起来可以提高机器学习模型的性能,尤其是当来自至少一个数据集中的样本数量很少时。但是,在这种情况下,潜在的挑战是,即使数据集之间存在一些通常共享的功能,这些数据集中的功能并不相同。为了应对这一挑战,我们提出了一个基于插补(COMIMP)的称为联合数据集的新颖框架。此外,我们提出了一种使用原理组件分析(PCA),PCA-COMIMP的变体,以便在组合数据集之前降低尺寸。当数据集具有大量它们之间没有共享的功能时,这很有用。此外,我们的框架也可以通过推出丢失的数据(即在组合不同数据集的同时填写丢失的条目来进行数据预处理。为了说明所提出的方法的功能及其潜在用法,我们对各种任务进行实验:回归,分类和不同的数据类型:表格数据,时间序列数据,当要合并的数据集丢失数据时。我们还研究了如何将设计的方法用于转移学习,以提供进一步的模型培训改进。我们的结果表明,所提出的方法与转移学习有些相似,因为合并可以显着提高较小数据集上预测模型的准确性。此外,将小型数据集组合在一起时,这些方法可以提高性能,并在与转移学习一起使用时可以提供额外的改进。
For many use cases, combining information from different datasets can be of interest to improve a machine learning model's performance, especially when the number of samples from at least one of the datasets is small. However, a potential challenge in such cases is that the features from these datasets are not identical, even though there are some commonly shared features among the datasets. To tackle this challenge, we propose a novel framework called Combine datasets based on Imputation (ComImp). In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. This is useful when the datasets have a large number of features that are not shared between them. Furthermore, our framework can also be utilized for data preprocessing by imputing missing data, i.e., filling in the missing entries while combining different datasets. To illustrate the power of the proposed methods and their potential usages, we conduct experiments for various tasks: regression, classification, and for different data types: tabular data, time series data, when the datasets to be combined have missing data. We also investigate how the devised methods can be used with transfer learning to provide even further model training improvement. Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets. In addition, the methods can boost performance by a significant margin when combining small datasets together and can provide extra improvement when being used with transfer learning.