子集成：自我监督的代表性学习多摩学数据的癌症类型分类

论文标题

子集成：自我监督的代表性学习多摩学数据的癌症类型分类

SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification

论文作者

Hashim, Sayed, Ali, Muhammad, Nandakumar, Karthik, Yaqub, Mohammad

论文摘要

对于个性化药物，高维度数据中存在非常关键的内在信息，由于大量分子特征和少量可用样品，很难捕获。不同类型的OMIC数据显示了样本的各个方面。多摩学数据的整合和分析使我们对肿瘤有广泛的看法，这可以改善临床决策。 OMICS数据，主要是DNA甲基化和基因表达谱通常是具有许多分子特征的高维数据。近年来，变异自动编码器（VAE）已被广泛用于将图像和文本数据嵌入较低维的潜在空间中。在我们的项目中，我们扩展了使用VAE模型使用特征子集的自我监督学习技术进行低维潜在空间提取的想法。使用VAE，关键思想是使模型从不同类型的OMIC数据中学习有意义的表示，然后可以将其用于下游任务，例如癌症类型分类。主要目标是克服维数的诅咒，并整合甲基化和表达数据，以结合有关同一组织样品不同方面的信息，并希望提取与生物学相关的特征。我们的扩展程序涉及培训编码器和解码器，以重建数据的一个子集。通过这样做，我们强迫模型在潜在表示中编码最重要的信息。我们还向子集添加了身份，以便模型知道在训练和测试期间将哪个子集馈入其中的子集。我们尝试了方法，发现子集成与基线的效果相当，而基线的效果则仅使用一个小得多的网络，并且仅使用了一部分数据。这项工作也可以改进以整合基于突变的基因组数据。

For personalized medicines, very crucial intrinsic information is present in high dimensional omics data which is difficult to capture due to the large number of molecular features and small number of available samples. Different types of omics data show various aspects of samples. Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision making. Omics data, mainly DNA methylation and gene expression profiles are usually high dimensional data with a lot of molecular features. In recent years, variational autoencoders (VAE) have been extensively used in embedding image and text data into lower dimensional latent spaces. In our project, we extend the idea of using a VAE model for low dimensional latent space extraction with the self-supervised learning technique of feature subsetting. With VAEs, the key idea is to make the model learn meaningful representations from different types of omics data, which could then be used for downstream tasks such as cancer type classification. The main goals are to overcome the curse of dimensionality and integrate methylation and expression data to combine information about different aspects of same tissue samples, and hopefully extract biologically relevant features. Our extension involves training encoder and decoder to reconstruct the data from just a subset of it. By doing this, we force the model to encode most important information in the latent representation. We also added an identity to the subsets so that the model knows which subset is being fed into it during training and testing. We experimented with our approach and found that SubOmiEmbed produces comparable results to the baseline OmiEmbed with a much smaller network and by using just a subset of the data. This work can be improved to integrate mutation-based genomic data as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题