论文标题

广义矩阵分解:将通用线性潜在变量模型拟合到大数据阵列的有效算法

Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays

论文作者

Kidziński, Łukasz, Hui, Francis K. C., Warton, David I., Hastie, Trevor

论文摘要

未衡量或潜在的变量通常是多元测量之间相关性的原因,多变量测量值是在心理学,生态和医学等各种领域进行了研究的。对于高斯测量值,有一些经典的工具,例如因子分析或主要成分分析,具有良好的理论和快速算法。广义线性潜在变量模型(GLLVM)将这种因子模型推广到非高斯响应。但是,用于估计GLLVM中模型参数的当前算法需要进行密集的计算,并且不扩展到具有数千个观察单元或响应的大型数据集。 在本文中,我们提出了一种将GLLVM拟合到高维数据集的新方法,基于使用惩罚的准类似物近似模型,然后使用Newton方法和Fisher评分来学习模型参数。在计算上,我们的方法明显更快,更稳定,使GLLVM拟合到比以前更大的矩阵。我们将我们的方法应用于48,000个观察单元的数据集,每个单元中有超过2,000种观察到的物种,发现大多数可变性可以用少数因素来解释。我们发布了易于使用的拟议拟合算法的实现。

Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源