论文标题
SOS:基于得分的表格数据
SOS: Score-based Oversampling for Tabular Data
论文作者
论文摘要
基于分数的生成模型(SGM)是生成假图像的最新突破。已知SGM可以超越其他生成模型,例如生成对抗网络(GAN)和变异自动编码器(VAE)。在这项工作中,我们受到他们的巨大成功的启发,我们将它们完全自定义以生成伪造的表格数据。特别是,我们有兴趣过度采样小课,因为不平衡的课程经常导致次优训练成果。据我们所知,我们是第一个提出基于得分的表格数据超采样方法的人。首先,我们必须重新设计自己的分数网络,因为我们必须处理表格数据。其次,我们为我们的生成方法提出了两个选项:前者等同于表格数据的样式传输,后者使用SGMS的标准生成策略。最后,我们定义了一种微调方法,该方法进一步提高了过度采样质量。在我们使用6个数据集和10个基准的实验中,我们的方法在所有情况下都均优于其他过度采样方法。
Score-based generative models (SGMs) are a recent breakthrough in generating fake images. SGMs are known to surpass other generative models, e.g., generative adversarial networks (GANs) and variational autoencoders (VAEs). Being inspired by their big success, in this work, we fully customize them for generating fake tabular data. In particular, we are interested in oversampling minor classes since imbalanced classes frequently lead to sub-optimal training outcomes. To our knowledge, we are the first presenting a score-based tabular data oversampling method. Firstly, we re-design our own score network since we have to process tabular data. Secondly, we propose two options for our generation method: the former is equivalent to a style transfer for tabular data and the latter uses the standard generative policy of SGMs. Lastly, we define a fine-tuning method, which further enhances the oversampling quality. In our experiments with 6 datasets and 10 baselines, our method outperforms other oversampling methods in all cases.