论文标题
高维PCA的重新采样灵敏度
Resampling Sensitivity of High-Dimensional PCA
论文作者
论文摘要
统计方法或算法对其数据的稳定性和敏感性的研究是机器学习和统计数据中的一个重要问题。在数据重新采样下,算法的性能是衡量其稳定性的一种基本方法,并且与算法的概括或隐私密切相关。在本文中,我们研究了主要成分分析(PCA)的重采样敏感性。给定一个$ n \ times p $随机矩阵$ \ mathbf {x} $,令$ \ mathbf {x}^{[k]} $是从$ \ mathbf {x} $中获得的矩阵,通过重新采样$ k $ k $随机选择的条目,该条目的随机选择的条目$ \ \ mathbf {x} $ {x} $ {x} $。令$ \ mathbf {v} $和$ \ mathbf {v}^{[k]} $表示$ \ mathbf {x} $和$ \ mathbf {x}^{x}^{[k]} $的主要组件。在(0,1] $中,我们建立了PCA的灵敏度/稳定性过渡的急剧阈值。 the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.
The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $ n \times p $ random matrix $ \mathbf{X} $, let $ \mathbf{X}^{[k]} $ be the matrix obtained from $ \mathbf{X} $ by resampling $ k $ randomly chosen entries of $ \mathbf{X} $. Let $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ denote the principal components of $ \mathbf{X} $ and $ \mathbf{X}^{[k]} $. In the proportional growth regime $ p/n \to ξ\in (0,1] $, we establish the sharp threshold for the sensitivity/stability transition of PCA. When $ k \gg n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically orthogonal. On the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.