通过主协变量回归改善样品和特征选择

论文标题

通过主协变量回归改善样品和特征选择

Improving Sample and Feature Selection with Principal Covariates Regression

论文作者

Cersonsky, Rose K., Helfrecht, Benjamin A., Engel, Edgar A., Ceriotti, Michele

论文摘要

从一系列候选人中选择最相关的功能和样本是在自动数据分析的背景下经常发生的任务，可以使用它来改善模型的计算性能以及通常的传递性。在这里，我们着重于已应用于此目的的两个流行的子选择方案：CUR分解，这是基于特征矩阵的低级别近似值和最远的点采样，这取决于对最多样化的样品的迭代识别和最多的特征。我们修改了这些无监督的方法，并遵循与主要协变量回归（PCOVR）方法相同的监督组成部分。我们表明，合并目标信息提供了在监督任务中执行得更好的选择，我们可以通过岭回归，内核岭回归和稀疏内核回归进行演示。我们还表明，合并简单监督学习模型的各个方面可以提高更复杂的模型的准确性，例如前馈神经网络。我们提出调整，以最大程度地减少任何子选择执行无监督任务时可能产生的影响。我们证明了与PCOV-CUR和PCOV-FPS在化学和材料科学应用中的使用相关的显着改进，通常减少了两倍的倍数，即达到给定的回归准确性水平所需的特征和样品数量。

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it can be used to improve the computational performance, and also often the transferability, of a model. Here we focus on two popular sub-selection schemes which have been applied to this end: CUR decomposition, that is based on a low-rank approximation of the feature matrix and Farthest Point Sampling, that relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the Principal Covariates Regression (PCovR) method. We show that incorporating target information provides selections that perform better in supervised tasks, which we demonstrate with ridge regression, kernel ridge regression, and sparse kernel regression. We also show that incorporating aspects of simple supervised learning models can improve the accuracy of more complex models, such as feed-forward neural networks. We present adjustments to minimize the impact that any subselection may incur when performing unsupervised tasks. We demonstrate the significant improvements associated with the use of PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples which are required to achieve a given level of regression accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题