论文标题

transtab:学习跨桌子的可转移表面变压器

TransTab: Learning Transferable Tabular Transformers Across Tables

论文作者

Wang, Zifeng, Sun, Jimeng

论文摘要

表格数据(或表格)是机器学习(ML)中最广泛使用的数据格式。但是,ML模型通常假定表结构在训练和测试中保持固定。在ML建模之前,需要大量数据清洁才能将不同的表与不同的列合并。这种预处理通常会造成大量的数据浪费(例如,删除无与伦比的列和样本)。如何从具有部分重叠列的多个表中学习ML模型?随着越来越多的列可用,如何逐步更新ML模型?我们可以利用在多个不同表上预处理的模型吗?如何训练可以在看不见的桌子上预测的ML模型? 为了回答所有这些问题,我们建议通过为表引入可转移的表变压器(Transtab)来放松固定桌结构。 transtab的目的是将每个样本(表中的一行)转换为可推广的嵌入向量,然后应用堆叠的变压器进行特征编码。一种方法论的洞察力是将列描述和表单元组合为门控变压器模型的原始输入。另一个见解是引入受监督和自我监督的预处理以提高模型性能。我们将transtab与多种基线方法和五个肿瘤学临床试验数据集进行了比较。总体而言,transtab分别排名1.00、1.00、1.78,分别是有监督学习,具有增量学习和转移学习方案的12种方法;拟议的预处理导致在监督学习中平均达到2.3%的AUC提升。

Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e.g., removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00, 1.78 out of 12 methods in supervised learning, feature incremental learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2.3% AUC lift on average over the supervised learning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源