论文标题

Dirichlet-Tree多项式混合物用于聚类微生物组组成

Dirichlet-tree multinomial mixtures for clustering microbiome compositions

论文作者

Mao, Jialiang, Ma, Li

论文摘要

近年来,研究人类微生物组已经引起了很大的兴趣,分析这些数据的一项常见任务是将微生物组组成聚集到亚型中。将样品分为亚组的细分是实现个性化诊断和治疗的中介步骤。在包括美国肠道项目(AGP)数据在内的现代微生物组研究中,将现有的聚类方法应用于现代微生物组研究时,我们发现,由于此类数据的几个关键特征,在微生物组组成的环境中,这一看似标准的任务非常具有挑战性。基于标准的距离聚类算法通常不会产生可靠的结果,因为它们没有考虑到细菌分类单元之间的跨样本变异性的异质性,而现有的基于模型的方法不允许足够的灵活性来识别从交叉群集变异中复杂的集群内变异。这种方法的直接应用通常导致AGP数据中过度分散的簇,而这种现象对于其他微生物组数据很常见。为了克服这些挑战,我们引入了Dirichlet-Tree多项式混合物(DTMM)作为在微生物组研究中聚类扩增子测序数据的贝叶斯生成模型。 DTMM用Dirichlet-Tree内核的混合物对微生物组的种群进行建模,这些核心利用系统发育树在表征集群内变化方面提供了更灵活的协方差结构,并提供了一种识别签名分类的手段,以区分簇。我们进行了广泛的仿真研究,以评估DTMM的性能并将其与微生物组环境中基于模型和距离的基于模型的聚类方法进行比较。最后,我们报告了一项关于来自AGP的粪便数据的案例研究,以识别患有炎症性肠病和糖尿病患者之间的组成簇。

Studying the human microbiome has gained substantial interest in recent years, and a common task in the analysis of these data is to cluster microbiome compositions into subtypes. This subdivision of samples into subgroups serves as an intermediary step in achieving personalized diagnosis and treatment. In applying existing clustering methods to modern microbiome studies including the American Gut Project (AGP) data, we found that this seemingly standard task, however, is very challenging in the microbiome composition context due to several key features of such data. Standard distance-based clustering algorithms generally do not produce reliable results as they do not take into account the heterogeneity of the cross-sample variability among the bacterial taxa, while existing model-based approaches do not allow sufficient flexibility for the identification of complex within-cluster variation from cross-cluster variation. Direct applications of such methods generally lead to overly dispersed clusters in the AGP data and such phenomenon is common for other microbiome data. To overcome these challenges, we introduce Dirichlet-tree multinomial mixtures (DTMM) as a Bayesian generative model for clustering amplicon sequencing data in microbiome studies. DTMM models the microbiome population with a mixture of Dirichlet-tree kernels that utilizes the phylogenetic tree to offer a more flexible covariance structure in characterizing within-cluster variation, and it provides a means for identifying a subset of signature taxa that distinguish the clusters. We perform extensive simulation studies to evaluate the performance of DTMM and compare it to state-of-the-art model-based and distance-based clustering methods in the microbiome context. Finally, we report a case study on the fecal data from the AGP to identify compositional clusters among individuals with inflammatory bowel disease and diabetes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源