论文标题
SimpleChrome:组合效应以预测基因表达的编码
SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene Expression
论文作者
论文摘要
由于最新的DNA测序技术的突破,基因组学数据集已无处不在。大规模数据集的出现为更好地理解基因组学,尤其是基因调节提供了绝佳的机会。尽管人体中的每个细胞都包含相同的DNA信息,但基因表达通过打开或关闭基因(称为基因表达水平)来控制这些细胞的功能。控制每个基因的表达水平有两个重要因素:(1)基因调节(例如组蛋白修饰)可以直接调节基因表达。 (2)在功能上相关或相互作用的相邻基因也可能影响基因表达水平。以前的努力试图使用基于注意的模型来解决前者。但是,解决第二个问题需要将所有潜在相关的基因信息纳入模型。尽管现代机器学习和深度学习模型在应用于适度大小的数据时能够捕获基因表达信号,但由于数据的性质,他们一直在努力恢复数据的基本信号。为了解决这个问题,我们提出了SimpleChrome,这是一个深入学习模型,可以学习基因的潜在组蛋白修饰表示。从模型中学到的特征使我们能够更好地理解跨基因相互作用和直接基因调节对靶基因表达的组合效应。本文的结果显示了下游模型的预测能力的出色改进,并极大地放松了对学习强大的广义神经网络的大型数据集的需求。这些结果在表观基因组学研究和药物开发中立即产生下游影响。
Due to recent breakthroughs in state-of-the-art DNA sequencing technology, genomics data sets have become ubiquitous. The emergence of large-scale data sets provides great opportunities for better understanding of genomics, especially gene regulation. Although each cell in the human body contains the same set of DNA information, gene expression controls the functions of these cells by either turning genes on or off, known as gene expression levels. There are two important factors that control the expression level of each gene: (1) Gene regulation such as histone modifications can directly regulate gene expression. (2) Neighboring genes that are functionally related to or interact with each other that can also affect gene expression level. Previous efforts have tried to address the former using Attention-based model. However, addressing the second problem requires the incorporation of all potentially related gene information into the model. Though modern machine learning and deep learning models have been able to capture gene expression signals when applied to moderately sized data, they have struggled to recover the underlying signals of the data due to the nature of the data's higher dimensionality. To remedy this issue, we present SimpleChrome, a deep learning model that learns the latent histone modification representations of genes. The features learned from the model allow us to better understand the combinatorial effects of cross-gene interactions and direct gene regulation on the target gene expression. The results of this paper show outstanding improvements on the predictive capabilities of downstream models and greatly relaxes the need for a large data set to learn a robust, generalized neural network. These results have immediate downstream effects in epigenomics research and drug development.