具有已知和未知混杂因素的基因表达数据中潜在方差成分的限制最大样本方法

论文标题

具有已知和未知混杂因素的基因表达数据中潜在方差成分的限制最大样本方法

Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders

论文作者

Malik, Muhammad Ammar, Michoel, Tom

论文摘要

随机效应模型是用于检测和纠正由于全基因组基因表达数据中隐藏的混杂因子而引起的伪造样品相关性的流行统计模型。在知道一些混杂因素的应用中，同时估算了随机效应模型中已知和潜在方差成分的贡献，这是一个挑战，迄今依赖于基于数值的优化者，以最大程度地提高了可能性函数。这是不令人满意的，因为所得的解决方案的特征很差，该方法的效率可能是次优的。在这里，我们在分析上证明，最大可能的潜在变量始终可以与已知的混杂因素正交，换句话说，最大的可能性潜在变量解释了尚未由已知因素解释的样本协方差。基于此结果，我们提出了一种受限制的最大可能性方法，该方法通过最大程度地提高了与已知混杂因素的限制子空间正交的可能性来估算潜在变量，并证明这将减少到该子空间上的概率PCA。然后，该方法通过使用新得出的分析解决方案解决此问题来估算差异函数中的剩余项，从而估计了方差 - 可增值参数。与基于梯度的优化器相比，我们的方法可以使用标准矩阵操作来计算更大或相等的似然值，从而导致潜在因素与任何已知因素重叠，并且运行时降低了多个数量级。因此，受限制的最大似然方法有助于使用当前方法的随机效应建模策略在学习潜在方差成分到更大的基因表达数据集中的应用。

Random effect models are popular statistical models for detecting and correcting spurious sample correlations due to hidden confounders in genome-wide gene expression data. In applications where some confounding factors are known, estimating simultaneously the contribution of known and latent variance components in random effect models is a challenge that has so far relied on numerical gradient-based optimizers to maximize the likelihood function. This is unsatisfactory because the resulting solution is poorly characterized and the efficiency of the method may be suboptimal. Here we prove analytically that maximum-likelihood latent variables can always be chosen orthogonal to the known confounding factors, in other words, that maximum-likelihood latent variables explain sample covariances not already explained by known factors. Based on this result we propose a restricted maximum-likelihood method which estimates the latent variables by maximizing the likelihood on the restricted subspace orthogonal to the known confounding factors, and show that this reduces to probabilistic PCA on that subspace. The method then estimates the variance-covariance parameters by maximizing the remaining terms in the likelihood function given the latent variables, using a newly derived analytic solution for this problem. Compared to gradient-based optimizers, our method attains greater or equal likelihood values, can be computed using standard matrix operations, results in latent factors that don't overlap with any known factors, and has a runtime reduced by several orders of magnitude. Hence the restricted maximum-likelihood method facilitates the application of random effect modelling strategies for learning latent variance components to much larger gene expression datasets than possible with current methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题