论文标题
使用分类拆分与决策森林建模文本
Modeling Text with Decision Forests using Categorical-Set Splits
论文作者
论文摘要
决策森林算法通常通过递归学习二进制树结构来对数据进行建模,其中每个节点将特征空间拆分为两个子区域,从而将示例发送到左或右分支。在与轴对准决策林中,路由输入示例的“决定”是对特征空间中单个维度评估条件的结果。使用有效的,通常是贪婪的算法来优化局部损失函数的算法。例如,节点的条件可能是应用于数值特征的阈值函数,并且可以通过扫描该节点上可用的一组值并选择一个最大化某些纯度度量的阈值来学习其参数。至关重要的是,是否存在算法来学习和评估特征类型的条件,决定了决策林算法是否可以对特征类型进行建模。例如,当今决策森林无法直接消耗文本功能 - 必须将这些功能转换为摘要统计信息。在这项工作中,我们着手弥合这一差距。我们定义了特定于分类特征的条件 - 定义为一组无序的分类变量,并提出了一种学习算法以学习它,从而使决策森林具有直接建模文本的能力,尽管没有保留顺序顺序。我们的算法在训练过程中是有效的,并且随着我们的QuickScorer推理算法的扩展,最终的条件很快就可以评估。基准文本分类数据集的实验证明了我们的提案的实用性和有效性。
Decision forest algorithms typically model data by learning a binary tree structure recursively where every node splits the feature space into two sub-regions, sending examples into the left or right branch as a result. In axis-aligned decision forests, the "decision" to route an input example is the result of the evaluation of a condition on a single dimension in the feature space. Such conditions are learned using efficient, often greedy algorithms that optimize a local loss function. For example, a node's condition may be a threshold function applied to a numerical feature, and its parameter may be learned by sweeping over the set of values available at that node and choosing a threshold that maximizes some measure of purity. Crucially, whether an algorithm exists to learn and evaluate conditions for a feature type determines whether a decision forest algorithm can model that feature type at all. For example, decision forests today cannot consume textual features directly -- such features must be transformed to summary statistics instead. In this work, we set out to bridge that gap. We define a condition that is specific to categorical-set features -- defined as an unordered set of categorical variables -- and present an algorithm to learn it, thereby equipping decision forests with the ability to directly model text, albeit without preserving sequential order. Our algorithm is efficient during training and the resulting conditions are fast to evaluate with our extension of the QuickScorer inference algorithm. Experiments on benchmark text classification datasets demonstrate the utility and effectiveness of our proposal.