论文标题

自由:可转移的神经体系结构,用于在网络文档上提取结构化信息

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

论文作者

Lin, Bill Yuchen, Sheng, Ying, Vo, Nguyen, Tata, Sandeep

论文摘要

从HTML文档中提取结构化数据是一个长期研究的问题,其中包括增加知识库,支持刻面搜索并为诸如购物和电影(例如购物和电影)提供特定领域的体验,例如增强知识库。以前的方法要么需要为每个目标站点提供少量示例,要么依靠在网站的视觉效果上构建的精心手工启发式方法。在本文中,我们提出了一种新颖的两阶段神经方法,名为Freedom,它克服了这两个局限性。第一阶段通过组合文本和标记信息来了解页面中每个DOM节点的表示形式。第二阶段使用关系神经网络捕获更长的范围距离和语义相关性。通过将这些阶段结合起来,自由能够概括从该垂直行业的少量种子站点进行训练后,而无需在页面的视觉效果图上使用昂贵的手工制作的功能。通过具有8个不同垂直行业的公共数据集的实验,我们表明自由平均比以前的最新水平击败了近3.7 f1点,而无需超过渲染页面或昂贵的手工制作的功能。

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源