MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

12/27/2022
by   Longxu Dou, et al.
0

Text-to-SQL semantic parsing is an important NLP task, which greatly facilitates the interaction between users and the database and becomes the key component in many human-computer interaction systems. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider, we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1 accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. Besides the dataset, we also propose a simple schema augmentation framework SAVe (Schema-Augmentation-with-Verification), which significantly boosts the overall performance by about 1.8 languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 6

page 7

research
10/05/2020

A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

Semantic parsing is an important NLP task. However, Vietnamese is a low-...
research
05/31/2023

Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding

In addressing the task of converting natural language to SQL queries, th...
research
05/25/2023

CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

The cross-domain text-to-SQL task aims to build a system that can parse ...
research
12/09/2020

Tracking Interaction States for Multi-Turn Text-to-SQL Semantic Parsing

The task of multi-turn text-to-SQL semantic parsing aims to translate na...
research
08/26/2022

SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

As the first session-level Chinese dataset, CHASE contains two separate ...
research
06/25/2023

A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention

Long sequences of text are challenging in the context of transformers, d...
research
10/30/2022

Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities

Contextualizing language technologies beyond a single language kindled e...

Please sign up or login with your details

Forgot password? Click here to reset