论文标题
合成高质量数据以进行文本到SQL解析的重要性
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
论文作者
论文摘要
最近,人们对合成数据的兴趣越来越大,以改善下游文本到SQL任务。在本文中,我们首先检查了现有的合成数据集,并发现在接受增强合成数据培训时,最先进的文本到SQL算法并没有进一步改善流行的基准测试。我们观察到了两个缺点:独立柱采样和任意表连接的不合逻辑的合成SQL查询。为了解决这些问题,我们提出了一个新颖的合成框架,该框架结合了模式的关键关系,强大的键入并进行了模式 - 距离加权列采样。我们还为SQL到文本任务采用了中间表示(IR),以进一步提高生成的自然语言问题的质量。当现有强大的语义解析器对我们的高质量合成数据进行预先调查时,我们的实验表明,这些模型在流行的基准测试中具有显着的准确性提高,包括蜘蛛的新最先进的性能。
Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.