论文标题
检测具有潜在结构的高维随机矢量的近似重复组件
Detecting approximate replicate components of a high-dimensional random vector with latent structure
论文作者
论文摘要
高维特征向量可能包含彼此近似重复的一组测量值。在复杂的应用程序或自动数据收集中,这些功能集尚不清楚,因此需要确定。这项工作提出了在观察到的高维随机矢量$ x \ in \ mathbb {r}^p $上的一类潜在因子模型,用于定义,识别和估计其大致复制组件的索引集。模型类由$ p \ times k $加载矩阵$ a $参数化,其中包含一个隐藏的子矩阵,其行可以分为并行向量的组。在此模型类中,$ x $的一组近似重复组件对应于$ a $的一组平行行:这些$ x $的条目是缩放和添加性错误,相同的线性组合$ k $ litex contrent因子; $ k $的值本身是未知的。在$ x $中查找近似重复的问题减少了识别和估算$ a $之内隐藏子矩阵的位置以及其行索引设置$ h $的分区。 $ h $及其部分都可以根据$ x $的相关矩阵及其可识别性以及未知的潜在尺寸$ k $的相关性矩阵的新标准来充分表征。可识别性参数的建设性性质可实现计算有效的程序,并具有一致性保证。当$ a $具有可变性的参数化时,问题的难度就会增加。任务成为将与规范基础向量与其他密集的平行行成正比的平行行分开的任务。这是在规模假设下通过选择目标行指数的原则性方法来实现的,并由Schur补充适当协方差矩阵的成功最大化。
High-dimensional feature vectors are likely to contain sets of measurements that are approximate replicates of one another. In complex applications, or automated data collection, these feature sets are not known a priori, and need to be determined. This work proposes a class of latent factor models on the observed high-dimensional random vector $X \in \mathbb{R}^p$, for defining, identifying and estimating the index set of its approximately replicate components. The model class is parametrized by a $p \times K$ loading matrix $A$ that contains a hidden sub-matrix whose rows can be partitioned into groups of parallel vectors. Under this model class, a set of approximate replicate components of $X$ corresponds to a set of parallel rows in $A$: these entries of $X$ are, up to scale and additive error, the same linear combination of the $K$ latent factors; the value of $K$ is itself unknown. The problem of finding approximate replicates in $X$ reduces to identifying, and estimating, the location of the hidden sub-matrix within $A$, and of the partition of its row index set $H$. Both $H$ and its partiton can be fully characterized in terms of a new family of criteria based on the correlation matrix of $X$, and their identifiability, as well as that of the unknown latent dimension $K$, are obtained as consequences. The constructive nature of the identifiability arguments enables computationally efficient procedures, with consistency guarantees. When $A$ has the errors-in-variable parametrization, the difficulty of the problem is elevated. The task becomes that of separating out groups of parallel rows that are proportional to canonical basis vectors from other dense parallel rows in $A$. This is met under a scale assumption, via a principled way of selecting the target row indices, guided by the succesive maximization of Schur complements of appropriate covariance matrices.