论文标题
稀疏PCA通过$ l_ {2,p} $ - 无监督功能选择的规范正规化
Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature Selection
论文作者
论文摘要
在数据挖掘领域,如何处理高维数据是不可避免的问题。无监督的功能选择吸引了越来越多的关注,因为它不依赖标签。基于光谱的无监督方法的性能取决于构造相似性矩阵的质量,该矩阵用于描述数据的内在结构。但是,实际数据包含大量的噪声样本和功能,使原始数据构建的相似性矩阵不能完全可靠。更糟糕的是,随着样品数量的增加,相似性矩阵的大小迅速扩大,从而使计算成本大大增加。受主成分分析的启发,我们通过将重建误差与$ L_ {2,P} $ - 规范正则化,提出了一种简单有效的无监督特征选择方法。用于特征选择的投影矩阵可以通过最小化稀疏约束下的重建误差来学习。然后,我们提出了一种有效的优化算法来解决提出的无监督模型,并从理论上分析算法的收敛性和计算复杂性。最后,对现实世界数据集的广泛实验证明了我们提出的方法的有效性。
In the field of data mining, how to deal with high-dimensional data is an inevitable problem. Unsupervised feature selection has attracted more and more attention because it does not rely on labels. The performance of spectral-based unsupervised methods depends on the quality of constructed similarity matrix, which is used to depict the intrinsic structure of data. However, real-world data contain a large number of noise samples and features, making the similarity matrix constructed by original data cannot be completely reliable. Worse still, the size of similarity matrix expands rapidly as the number of samples increases, making the computational cost increase significantly. Inspired by principal component analysis, we propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_{2,p}$-norm regularization. The projection matrix, which is used for feature selection, is learned by minimizing the reconstruction error under the sparse constraint. Then, we present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically. Finally, extensive experiments on real-world data sets demonstrate the effectiveness of our proposed method.