稀疏数据集的色彩学习

论文标题

稀疏数据集的色彩学习

Chromatic Learning for Sparse Datasets

论文作者

Feinberg, Vladimir, Bailis, Peter

论文摘要

对稀疏，高维数据的学习通常需要使用专门的方法，例如哈希技巧。在这项工作中，我们设计了一种高度可扩展的替代方法，该方法利用了许多实用环境中存在的特征性相关程度低。我们称之为色彩学习（CL）的方法，通过在特征的同时表现图上执行图形着色，从而获得了低维密度的特征表示形式---以前用作GBDT训练的运行时性能优化的方法。这种基于颜色的密度表示可以与其他密集的分类编码方法（例如，下型特征压缩）结合使用，以进一步降低维度。 CL表现出线性并行性，并在同时出现图的大小中消耗线性线性。通过利用共发生图的结构特性，CL可以压缩稀疏的数据集，例如KDD杯2012，其包含超过50m的功能低至1024，使用基于频率的截断的级数少于频率的截断和哈希式技巧，同时维持相同的线性模型测试误差。这种压缩进一步使在这个宽阔，稀疏的设置中使用深网，与现有的预算输入维度相比，CL类似地具有有利的性能。

Learning over sparse, high-dimensional data frequently necessitates the use of specialized methods such as the hashing trick. In this work, we design a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings. This approach, which we call Chromatic Learning (CL), obtains a low-dimensional dense feature representation by performing graph coloring over the co-occurrence graph of features---an approach previously used as a runtime performance optimization for GBDT training. This color-based dense representation can be combined with additional dense categorical encoding approaches, e.g., submodular feature compression, to further reduce dimensionality. CL exhibits linear parallelizability and consumes memory linear in the size of the co-occurrence graph. By leveraging the structural properties of the co-occurrence graph, CL can compress sparse datasets, such as KDD Cup 2012, that contain over 50M features down to 1024, using an order of magnitude fewer features than frequency-based truncation and the hashing trick while maintaining the same test error for linear models. This compression further enables the use of deep networks in this wide, sparse setting, where CL similarly has favorable performance compared to existing baselines for budgeted input dimension.

下载PDF全文

下载文献需遵守相关版权规定

论文标题