论文标题
命题和嵌入:同一枚硬币的两个侧面
Propositionalization and Embeddings: Two Sides of the Same Coin
论文作者
论文摘要
数据预处理是机器学习管道的重要组成部分,它需要大量的时间和资源。预处理的一个组成部分是数据转换为给定学习算法所需的格式。本文概述了关系学习中使用的一些现代数据处理技术,这些技术将数据融合从不同的输入数据类型和格式融合到单个表数据表示中,重点是命题和嵌入数据转换方法。虽然两种方法旨在将数据转换为表格数据格式,但它们使用不同的术语和任务定义,可以感知来解决不同的目标,并在不同的上下文中使用。本文贡献了一个统一的框架,通过介绍其统一的定义,并解释两种方法之间的相似性和差异作为统一复杂数据转换任务的变体,从而可以改善对这两种数据转换技术的理解。除了统一框架外,本文的新颖性是将命题和嵌入的统一方法组合在一起,这从两者的优势中受益于解决复杂的数据转换和学习任务。我们介绍了统一方法的两个有效实现:一种基于实例的PropDRM方法,以及一种基于功能的Propart方法来进行数据转换和学习,以及他们对几个关系问题的经验评估。结果表明,新算法可以胜过现有的关系学习者,并且可以解决更大的问题。
Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a single table data representation, focusing on the propositionalization and embedding data transformation approaches. While both approaches aim at transforming data into tabular data format, they use different terminology and task definitions, are perceived to address different goals, and are used in different contexts. This paper contributes a unifying framework that allows for improved understanding of these two data transformation techniques by presenting their unified definitions, and by explaining the similarities and differences between the two approaches as variants of a unified complex data transformation task. In addition to the unifying framework, the novelty of this paper is a unifying methodology combining propositionalization and embeddings, which benefits from the advantages of both in solving complex data transformation and learning tasks. We present two efficient implementations of the unifying methodology: an instance-based PropDRM approach, and a feature-based PropStar approach to data transformation and learning, together with their empirical evaluation on several relational problems. The results show that the new algorithms can outperform existing relational learners and can solve much larger problems.