论文标题
对宏基因组学的shot弹枪测序的概率分析
A probabilistic analysis of shotgun sequencing for metagenomics
论文作者
论文摘要
基因组测序是许多现代生物学和药物研究的基础。随着技术的最新进展,宏基因组学已成为一个兴趣的问题。这个问题需要对来自不同来源的多个DNA序列进行分析和重建。 shot弹枪基因组测序通过将长的DNA序列分解为较短的片段,称为读数。鉴于此读取集合,人们希望重建DNA序列的原始集合。对于宏基因组学中的实验设计,重要的是要了解可靠重建所需的最小读取长度如何取决于所涉及的基因组的数量和特征。 Utilizing simple probabilistic models for each DNA sequence, we analyze the identifiability of collections of M genomes of length N in an asymptotic regime in which N tends to infinity and M may grow with N. Our first main result provides a threshold in terms of M and N so that if the read length exceeds the threshold, then a simple greedy algorithm successfully reconstructs the full collection of genomes with probability tending to one.我们的第二个主要结果在M和N方面建立了较低的阈值,因此,如果读取长度短于阈值,那么基因组的完整集合是不可能的,而概率倾向于一个。
Genome sequencing is the basis for many modern biological and medicinal studies. With recent technological advances, metagenomics has become a problem of interest. This problem entails the analysis and reconstruction of multiple DNA sequences from different sources. Shotgun genome sequencing works by breaking up long DNA sequences into shorter segments called reads. Given this collection of reads, one would like to reconstruct the original collection of DNA sequences. For experimental design in metagenomics, it is important to understand how the minimal read length necessary for reliable reconstruction depends on the number and characteristics of the genomes involved. Utilizing simple probabilistic models for each DNA sequence, we analyze the identifiability of collections of M genomes of length N in an asymptotic regime in which N tends to infinity and M may grow with N. Our first main result provides a threshold in terms of M and N so that if the read length exceeds the threshold, then a simple greedy algorithm successfully reconstructs the full collection of genomes with probability tending to one. Our second main result establishes a lower threshold in terms of M and N such that if the read length is shorter than the threshold, then reconstruction of the full collection of genomes is impossible with probability tending to one.