论文标题
pix2 struct:屏幕截图解析作为视觉语言理解的预处理
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
论文作者
论文摘要
视觉上的语言无处不在 - 来源从图表的教科书到带有图像和表格的网页,再到带有按钮和表单的移动应用程序。也许由于这种多样性,以前的工作通常依赖于特定领域的食谱,其基础数据,模型架构和目标的共享有限。我们提出了Pix2-struct,这是一种纯粹的图像到文本模型,用于纯粹的视觉语言理解,可以在包含视觉上的语言的任务上进行填充。通过学习将网页的屏幕截图分析到简化的HTML中,可以预测PIX2 -struct。网络以其丰富的视觉元素在HTML结构中清晰地反映出来,提供了很大的训练处入式预处理数据来源,非常适合下游任务的多样性。直觉上,这个目标涵盖了常见的预读信号,例如OCR,语言建模,图像字幕。除了新颖的训练策略外,我们还引入了可变分辨率的输入表示形式,并更加灵活地集成了语言和视觉输入,其中语言提示(例如问题)直接在输入图像的顶部呈现。我们首次表明,单个审慎的模型可以实现最新的跨四个领域的任务中的六个:文档,插图,用户界面和自然图像。
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.