论文标题

在真菌生物合成基因群集发现中支持有监督的学习:新的基准数据集

Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets

论文作者

Almeida, Hayda, Tsang, Adrian, Diallo, Abdoulaye Baniré

论文摘要

次生代谢产物的真菌生物合成基因簇(BGC)是能够生产天然产物的基因簇,在包括抗生素和药物在内的各种生物活性化合物的生产中起着重要作用,在生产各种生物活性化合物中起着重要作用。识别BGC可以导致发现新型天然产品以使人类健康受益。先前的工作一直集中在开发自动工具上,以支持植物,真菌和细菌中发现BGC的发现。数据驱动的方法以及概率和监督学习方法已在识别BGC中探索。用于识别真菌BGC的大多数方法都是数据驱动的,并具有有限的范围。监督的学习方法已被证明在识别细菌中的BGC方面表现良好,并且很可能适合在真菌中执行相同的任务。但是需要标记的数据实例来执行监督学习。公开访问的BGC数据库仅包含一小部分先前策划的真菌BGC。提供新的真菌BGC数据集可以激发真菌BGC的监督学习方法的开发,并与数据驱动的方法相比有可能提高预测性能。在这项工作中,我们提出了新的公开真菌BGC数据集,以使用监督学习来支持BGC发现任务。这些数据集准备进行二进制分类并预测真菌基因组中的候选BGC区域。此外,我们分析了为预测BGC而开发的良好支持的监督学习工具的性能。

Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on developing automatic tools to support BGC discovery in plants, fungi, and bacteria. Data-driven methods, as well as probabilistic and supervised learning methods have been explored in identifying BGCs. Most methods applied to identify fungal BGCs were data-driven and presented limited scope. Supervised learning methods have been shown to perform well at identifying BGCs in bacteria, and could be well suited to perform the same task in fungi. But labeled data instances are needed to perform supervised learning. Openly accessible BGC databases contain only a very small portion of previously curated fungal BGCs. Making new fungal BGC datasets available could motivate the development of supervised learning methods for fungal BGCs and potentially improve prediction performance compared to data-driven methods. In this work we propose new publicly available fungal BGC datasets to support the BGC discovery task using supervised learning. These datasets are prepared to perform binary classification and predict candidate BGC regions in fungal genomes. In addition we analyse the performance of a well supported supervised learning tool developed to predict BGCs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源