论文标题
从差异丰度到MTGWA:代谢组学数据的准确,可扩展的方法,具有不可忽视的观察值和潜在因素
From differential abundance to mtGWAS: accurate and scalable methodology for metabolomics data with non-ignorable missing observations and latent factors
论文作者
论文摘要
代谢组学是小分子代谢产物的高通量研究。除了提供新颖的生物学见解外,这些数据还包含独特的统计挑战,其中最明显的是许多不可忽视的缺失代谢物观察结果。为了解决这个问题,几乎所有分析管道首先都估算缺失的观察结果,然后使用设计用于完整数据的方法进行分析。尽管显然是错误的,但这些管道提供了现有统计严格的方法中不存在的关键实际优势,包括使用观察到的数据和缺失数据来增加功率,快速计算以支持现象 - 全基因组和全基因组分析,以及对因子模型的简化估计。为了弥合统计保真度和实用性之间的这一差距,我们开发了MS-Nimble,这是一套统计严格且强大的方法,可提供插入管道的所有实际好处,以执行全现象的差异丰度分析,代谢物基因组基因组基因组全基因组关联研究(MTGWAS)(MTGWAS),以及因不可符合性的缺失数据而进行的因素分析。至关重要的是,我们在存在潜在因素的情况下定制MS-Nimble,以执行差异丰度和MTGWA,从而减少偏见并改善功率。除了证明其统计和计算效率外,我们还使用三个实际代谢组数据集证明了其出色的性能。
Metabolomics is the high-throughput study of small molecule metabolites. Besides offering novel biological insights, these data contain unique statistical challenges, the most glaring of which is the many non-ignorable missing metabolite observations. To address this issue, nearly all analysis pipelines first impute missing observations, and subsequently perform analyses with methods designed for complete data. While clearly erroneous, these pipelines provide key practical advantages not present in existing statistically rigorous methods, including using both observed and missing data to increase power, fast computation to support phenome- and genome-wide analyses, and streamlined estimates for factor models. To bridge this gap between statistical fidelity and practical utility, we developed MS-NIMBLE, a statistically rigorous and powerful suite of methods that offers all the practical benefits of imputation pipelines to perform phenome-wide differential abundance analyses, metabolite genome-wide association studies (mtGWAS), and factor analysis with non-ignorable missing data. Critically, we tailor MS-NIMBLE to perform differential abundance and mtGWAS in the presence of latent factors, which reduces biases and improves power. In addition to proving its statistical and computational efficiency, we demonstrate its superior performance using three real metabolomic datasets.