论文标题

合成高质量数据以进行文本到SQL解析的重要性

Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

论文作者

Zhao, Yiyun, Jiang, Jiarong, Hu, Yiqun, Lan, Wuwei, Zhu, Henry, Chauhan, Anuj, Li, Alexander, Pan, Lin, Wang, Jun, Hang, Chung-Wei, Zhang, Sheng, Dong, Marvin, Lilien, Joe, Ng, Patrick, Wang, Zhiguo, Castelli, Vittorio, Xiang, Bing

论文摘要

最近,人们对合成数据的兴趣越来越大,以改善下游文本到SQL任务。在本文中,我们首先检查了现有的合成数据集,并发现在接受增强合成数据培训时,最先进的文本到SQL算法并没有进一步改善流行的基准测试。我们观察到了两个缺点:独立柱采样和任意表连接的不合逻辑的合成SQL查询。为了解决这些问题,我们提出了一个新颖的合成框架,该框架结合了模式的关键关系,强大的键入并进行了模式 - 距离加权列采样。我们还为SQL到文本任务采用了中间表示(IR),以进一步提高生成的自然语言问题的质量。当现有强大的语义解析器对我们的高质量合成数据进行预先调查时,我们的实验表明,这些模型在流行的基准测试中具有显着的准确性提高,包括蜘蛛的新最先进的性能。

Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源