论文标题
歧管假设的统计探索
Statistical exploration of the Manifold Hypothesis
论文作者
论文摘要
歧管假设是机器学习的一个广泛接受的宗旨,它断言名义上高维数据实际上集中在嵌入高维空间中的低维歧管附近。在许多现实世界中,在经验上观察到了这种现象,在过去的几十年中导致了广泛的统计方法的发展,并被认为是现代AI技术成功的关键因素。我们表明,数据中有时甚至有时复杂的流形结构可以通过基本概念(例如潜在变量,相关性和平稳性)从通用且非常简单的统计模型(潜在度量模型)出现。这为为什么在如此多的情况下似乎存在多种假设而建立了一个一般的统计解释。通过潜在度量模型,我们得出了发现和解释高维数据几何形状的程序,并探讨了有关数据生成机制的假设。这些程序以最小的假设运行,并利用众所周知的图分析算法。
The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known graph-analytic algorithms.