迈向基因组数据的强大的开箱即用神经网络模型

论文标题

迈向基因组数据的强大的开箱即用神经网络模型

Towards a robust out-of-the-box neural network model for genomic data

论文作者

Zhang, Zhaoyi, Cheng, Songyang, Solis-Lemus, Claudia

论文摘要

从基因组数据中对生物学特征的准确预测对于精确医学和可持续农业至关重要。几十年来，鉴于在大数据设置下，神经网络模型在计算机视觉，天体物理学和有针对性的营销等领域都广受欢迎。然而，由于生物学数据的普遍特征，例如适度的样本量，稀疏性和极端异质性，神经网络模型并未成功地过渡到医学和生物学世界。在这里，我们研究了具有各种异质基因组数据集的广泛使用的卷积神经网络和自然语言处理模型的鲁棒性，概括潜力和预测准确性。主要的是，在研究中的数据集中，复发性神经网络模型在预测准确性，过度拟合和可传递性方面优于卷积神经网络模型。尽管稳健的开箱即用神经网络模型的观点无法实现，但我们确定了某些模型特征，这些模型特征可以很好地翻译整个数据集，并且可以作为转化研究人员的基线模型。

The accurate prediction of biological features from genomic data is paramount for precision medicine and sustainable agriculture. For decades, neural network models have been widely popular in fields like computer vision, astrophysics and targeted marketing given their prediction accuracy and their robust performance under big data settings. Yet neural network models have not made a successful transition into the medical and biological world due to the ubiquitous characteristics of biological data such as modest sample sizes, sparsity, and extreme heterogeneity. Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. Mainly, recurrent neural network models outperform convolutional neural network models in terms of prediction accuracy, overfitting and transferability across the datasets under study. While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题