论文标题

Machop:端到端广义实体匹配框架

Machop: an End-to-End Generalized Entity Matching Framework

论文作者

Wang, Jin, Li, Yuliang, Hirota, Wataru, Kandogan, Eser

论文摘要

现实世界中的应用程序经常寻求解决实体匹配(EM)问题的一般形式以找到相关实体。这种情况包括将职位与候选人的职位定位相匹配,与在线教育中的课程相匹配,将产品与电子商务网站上的用户评论相匹配。这些任务强加了新的要求,例如将数据条目匹配,具有不同的格式或具有灵活和语义富含匹配的定义,这些定义超出了当前的EM任务公式或方法。在本文中,我们介绍了满足这些实际要求的广义实体匹配(GEM)的问题,并提出了端到端管道手机作为解决方案。 MACHOP允许最终用户从划痕定义新的匹配任务,并以分步的方式将其应用于新域。 Machop将GEM问题作为序列对分类施放,以利用基于变形金刚的语言模型(LMS)(例如Bert)的语言理解能力。此外,它采用了一种新型的外部知识注入方法,采用结构感知的合并方法,使领域专家可以指导LM专注于关键匹配信息,从而进一步促进整体性能。我们对来自流行的招聘平台现实世界数据集的实验和案例研究表明,F1分数对最新方法的增长率显着17.1%,以及有意义的匹配结果,这些结果是可以理解的。

Real-world applications frequently seek to solve a general form of the Entity Matching (EM) problem to find associated entities. Such scenarios include matching jobs to candidates in job targeting, matching students with courses in online education, matching products with user reviews on e-commercial websites, and beyond. These tasks impose new requirements such as matching data entries with diverse formats or having a flexible and semantics-rich matching definition, which are beyond the current EM task formulation or approaches. In this paper, we introduce the problem of Generalized Entity Matching (GEM) that satisfies these practical requirements and presents an end-to-end pipeline Machop as the solution. Machop allows end-users to define new matching tasks from scratch and apply them to new domains in a step-by-step manner. Machop casts the GEM problem as sequence pair classification so as to utilize the language understanding capability of Transformers-based language models (LMs) such as BERT. Moreover, it features a novel external knowledge injection approach with structure-aware pooling methods that allow domain experts to guide the LM to focus on the key matching information thus further contributing to the overall performance. Our experiments and case studies on real-world datasets from a popular recruiting platform show a significant 17.1% gain in F1 score against state-of-the-art methods along with meaningful matching results that are human-understandable.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源