文本到SQL的结构训练预处理

论文标题

文本到SQL的结构训练预处理

Structure-Grounded Pretraining for Text-to-SQL

论文作者

Deng, Xiang, Awadallah, Ahmed Hassan, Meek, Christopher, Polozov, Oleksandr, Sun, Huan, Richardson, Matthew

论文摘要

学习捕获文本桌对齐对于文本到SQL等任务至关重要。一个模型需要正确识别对列和值的自然语言引用，并在给定数据库架构中将其扎根。在本文中，我们为文本到SQL提出了一个新颖的弱监督结构接地术后框架（strug），可以有效地学习基于平行的文本表语料库来捕获文本表对齐。我们确定了一组新的预测任务：列接地，值接地和列尺寸映射，并利用它们为文本表编码预处理。此外，为了在更现实的文本表对齐设置下评估不同的方法，我们根据蜘蛛开发设置创建了一个新的评估集蜘蛛现实化，并删除了列名，并采用了八个现有的文本到SQL数据集，以进行交叉数据库评估。在所有设置中，Strug对Bert-large都有显着改善。与现有的预训练方法（例如Grappa）相比，Strug在蜘蛛上的性能相似，并且在更现实的集合上都优于所有基准。蜘蛛现实的数据集可从https://doi.org/10.5281/zenodo.5205322获得。

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.

下载PDF全文

下载文献需遵守相关版权规定

论文标题