大型，非线性和加性潜在变量模型的统计分位数学习

论文标题

大型，非线性和加性潜在变量模型的统计分位数学习

Statistical Quantile Learning for Large, Nonlinear, and Additive Latent Variable Models

论文作者

Bodelet, Julien, Blanc, Guillaume, Shan, Jiajun, Terrera, Graciela Muniz, Chen, Oliver Y.

论文摘要

在基因组学和神经科学等领域的大规模，高维数据的研究向科学注入了新的见解。然而，尽管有进步，但他们仍在面对几个挑战，通常是同时面临的：与高特征维度相比，缺乏可解释性，非线性，缓慢的计算，不一致和不确定的收敛性以及小样本量。在这里，我们提出了一种相对简单，可扩展且一致的非线性降低方法，可以在无监督的设置中潜在地解决这些问题。我们称此方法统计分位数学习（SQL），因为从方法上讲，它利用了潜在变量的分位数以及标准的非参数技术（筛子或阴茎方法）。我们表明，估计模型将简化为凸构匹配问题；我们得出其渐近特性；我们表明该模型在少数情况下是可识别的。与其线性竞争者相比，SQL解释了更多的差异，产生了更好的分离和解释，并提供了更准确的结果预测。与其非线性竞争者相比，SQL在大维设置中的可解释性，易用性和计算方面显示出很大的优势。最后，我们将SQL应用于高维基因表达数据（由来自801名受试者的20,263个基因组成），其中提出的方法确定了对五种癌症类型的潜在因素。 SQL软件包可在https://github.com/jbodelet/sql上找到。

The studies of large-scale, high-dimensional data in fields such as genomics and neuroscience have injected new insights into science. Yet, despite advances, they are confronting several challenges, often simultaneously: lack of interpretability, nonlinearity, slow computation, inconsistency and uncertain convergence, and small sample sizes compared to high feature dimensions. Here, we propose a relatively simple, scalable, and consistent nonlinear dimension reduction method that can potentially address these issues in unsupervised settings. We call this method Statistical Quantile Learning (SQL) because, methodologically, it leverages on a quantile approximation of the latent variables together with standard nonparametric techniques (sieve or penalyzed methods). We show that estimating the model simplifies into a convex assignment matching problem; we derive its asymptotic properties; we show that the model is identifiable under few conditions. Compared to its linear competitors, SQL explains more variance, yields better separation and explanation, and delivers more accurate outcome prediction. Compared to its nonlinear competitors, SQL shows considerable advantage in interpretability, ease of use and computations in large-dimensional settings. Finally, we apply SQL to high-dimensional gene expression data (consisting of 20,263 genes from 801 subjects), where the proposed method identified latent factors predictive of five cancer types. The SQL package is available at https://github.com/jbodelet/SQL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题