Xylayoutlm：朝向布局感知的多模式网络，以了解视觉范围丰富的文档理解

论文标题

Xylayoutlm：朝向布局感知的多模式网络，以了解视觉范围丰富的文档理解

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

论文作者

Gu, Zhangxuan, Meng, Changhua, Wang, Ke, Lan, Jun, Wang, Weiqiang, Gu, Ming, Zhang, Liqing

论文摘要

最近，已经提出了各种用于视觉上富裕文档理解（VRDU）的多模式网络，通过将视觉和布局信息与文本嵌入进行集成，以促进变形金刚的促进。但是，大多数现有方法都利用位置嵌入来包含序列信息，从而忽略了OCR工具获得的嘈杂的阅读顺序。在本文中，我们提出了一个名为XylayOutlm的强大布局多模式网络，以从我们的增强XY剪切产生的适当阅读订单中捕获和利用丰富的布局信息。此外，提出了扩张的条件位置编码模块来处理可变长度的输入序列，并在生成位置嵌入的同时，还从文本和视觉方式中提取本地布局信息。实验结果表明，我们的Xylayoutlm在文档理解任务上取得了竞争成果。

Recently, various multimodal networks for Visually-Rich Document Understanding(VRDU) have been proposed, showing the promotion of transformers by integrating visual and layout information with the text embeddings. However, most existing approaches utilize the position embeddings to incorporate the sequence information, neglecting the noisy improper reading order obtained by OCR tools. In this paper, we propose a robust layout-aware multimodal network named XYLayoutLM to capture and leverage rich layout information from proper reading orders produced by our Augmented XY Cut. Moreover, a Dilated Conditional Position Encoding module is proposed to deal with the input sequence of variable lengths, and it additionally extracts local layout information from both textual and visual modalities while generating position embeddings. Experiment results show that our XYLayoutLM achieves competitive results on document understanding tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题