Log In Sign Up

Leveraging Table Content for Zero-shot Text-to-SQL with Meta-Learning

by   Yongrui Chen, et al.
NetEase, Inc

Single-table text-to-SQL aims to transform a natural language question into a SQL query according to one single table. Recent work has made promising progress on this task by pre-trained language models and a multi-submodule framework. However, zero-shot table, that is, the invisible table in the training set, is currently the most critical bottleneck restricting the application of existing approaches to real-world scenarios. Although some work has utilized auxiliary tasks to help handle zero-shot tables, expensive extra manual annotation limits their practicality. In this paper, we propose a new approach for the zero-shot text-to-SQL task which does not rely on any additional manual annotations. Our approach consists of two parts. First, we propose a new model that leverages the abundant information of table content to help establish the mapping between questions and zero-shot tables. Further, we propose a simple but efficient meta-learning strategy to train our model. The strategy utilizes the two-step gradient update to force the model to learn a generalization ability towards zero-shot tables. We conduct extensive experiments on a public open-domain text-to-SQL dataset WikiSQL and a domain-specific dataset ESQL. Compared to existing approaches using the same pre-trained model, our approach achieves significant improvements on both datasets. Compared to the larger pre-trained model and the tabular-specific pre-trained model, our approach is still competitive. More importantly, on the zero-shot subsets of both the datasets, our approach further increases the improvements.


page 1

page 2

page 3

page 4


Towards Zero-Shot and Few-Shot Table Question Answering using GPT-3

We present very early results on using GPT-3 to perform question answeri...

Zero-shot Text-to-SQL Learning with Auxiliary Task

Recent years have seen great success in the use of neural seq2seq models...

PART: Pre-trained Authorship Representation Transformer

Authors writing documents imprint identifying information within their t...

Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Recent work has demonstrated that pre-trained language models (PLMs) are...

Zero-Shot AutoML with Pretrained Models

Given a new dataset D and a low compute budget, how should we choose a p...

ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation

Real-world object sampling produces long-tailed distributions requiring ...

Cross-Domain Deep Code Search with Few-Shot Meta Learning

Recently, pre-trained programming language models such as CodeBERT have ...


Since the release of WikiSQL Zhong et al. (2017), a large-scale text-to-SQL benchmark, single-table text-to-SQL task has become an active research area in recent years. The goal of the task is to transform natural language questions into Structured Query Language (SQL) to query a single table. Although the search space is limited to one table, the task still has a considerable number of application scenarios (e.g., query regional electricity prices or flight schedules). More importantly, it is the basis for more complex text-to-SQL tasks on multi-tables Yu et al. (2018c). Therefore, the research on this area is of great significance.

Relying on large-scale pre-trained language models Devlin et al. (2019) and a multi-submodule framework, existing approaches He et al. (2019); Hwang et al. (2019); Lyu et al. (2020) have made considerable progress on the single-table text-to-SQL task. However, few of them pay attention to the challenge of zero-shot tables whose schema are not visible in the training set. Typically, in comparison with the visible tables, zero-shot tables are more challenging because they are not directly involved in training. Their schema cannot be perceived by the model, so that they may be noisy in the test. In fact, with the rapid expansion of business, zero-shot tables are becoming more and more common in realistic scenarios. Therefore, in order to make text-to-SQL land from laboratory to application, it is necessary to make the model learn to handle zero-shot tables.

Chang et al. (2020) explicitly deals with zero-shot tables for the first time. The core idea of their approach is to design an auxiliary task to model the mapping from the question to the headers (similar to entity linking). However, this approach requires training data to provide the gold mappings that are annotated manually. Undoubtedly, it is a strong limitation in realistic scenarios.

Figure 1: An example of table content to help predict headers. Red indicates the matching.

In this paper, we propose a new approach called Meta-Content text-to-SQL (MC-SQL) to handle zero-shot tables. The motivation comes from the following two intuitions: 1) The first one is that table content can provide abundant information for predicting headers. Figure 1 shows an example. The cell son in the table is relevant to the question word “son”, thus reveals the potential header Relationship to Monarch. 2) The second one is that meta-learning can help the model learn the generalization ability between different tables from the training data. It is because meta-learning has the capability that only needs a few gradient steps to quickly adapt to new tasks. Specifically, our approach consists of two parts. On the one hand, a table content-enhanced model is employed to encode questions, headers, and table cells at the same time, in order to combine their semantic relevance for the prediction on zero-shot tables. On the other hand, a zero-shot meta-learning algorithm is utilized to train our content-enhanced model instead of the traditional mini-batch strategy. In each training step, the algorithm generalizes the model by two sets of samples that rely on two disjoint table sets, respectively. Finally, to comprehensively evaluate our approach, we conduct experiments on public open-domain benchmark WikiSQL and domain-specific benchmark ESQL. Our approach achieves a significant improvement over the baselines that utilizes the same pre-trained model as ours, and also achieves competitive results over the baselines that utilize the larger or tabular-specific pre-trained model.


The single-table text-to-SQL task can be formally defined as


where denotes a natural language question and denotes the corresponding SQL query. denotes the table which relies on, where denotes the -th header in . The goal of the task is to learn a mapping from questions to SQL queries. In addition, this task supposes that no complex SQL syntax (e.g., GROUP BY and nested query) exists and there is only one column in the SELECT clause. Specifically, each follows a unified skeleton, which is shown in Figure 2. The tokens prefixed with “$” indicate the slots to be filled and “*” indicates zero or more AND clauses. According to the skeleton, existing approaches He et al. (2019); Hwang et al. (2019); Lyu et al. (2020) break the total task into the following six subtasks:

Figure 2: Skeleton of SQL in single-table text-to-SQL.
  • Select-Column(SC) finds the column $SEL in the SELECT clause from .

  • Select-Aggregation(SA) finds the aggregation function $AGG ( {NONE, MAX, MIN, COUNT, SUM, AVG}) of the column in the SELECT clause.

  • Where-Number(WN) finds the number of where conditions, denoted by .

  • Where-Column(WC) finds the column (header) $COL of each WHERE condition from .

  • Where-Operator(WO) finds the operator $OP () of each $COL in the WHERE clause.

  • Where-Value(WV) finds the value $VAL for each condition from the question, specifically, locating the starting position of the value in .

There are dependencies between some tasks. For example, the prediction of $OP requires $COL, and the prediction of $VAL requires both $COL and $OP.

Figure 3: Overall framework of our approach.
Figure 4: Architecture of the table content-enhanced model. WN, WC, and WV are detailed in the orange, purple, and green dotted box, respectively. Blue indicates the processes for table content and gray indicates the processes for headers.


The framework of our approach is shown in Figure 3, which consists of two parts. First, the table content enhanced model (left) captures the semantic relevance of questions with headers and cells at the same time, and predict subtasks comprehensively. Further, zero-shot meta-learning (right) is leveraged to train the table content enhanced model. In each training batch, the model parameters are updated in two stages to force the model to learn the generalization ability.

Table Content Enhanced Model

Figure 1 demonstrates that the core of leveraging table content is to find the table cells mentioned in the question. However, the total number of the cells can be very large, far exceeding that of the headers. Consequently, it is impractical to directly embed all of them.

To overcome this challenge, we adopt coarse-grained filtering before embedding. Specifically, for each header , only the cell with the highest literal similarity to question will be retained. The literal similarity is computed by


where denotes the -gram of , denotes the length of string , and denotes the length of the Longest Consecutive Common Subsequence between the string and . The intuitive meaning of is that the larger proportion of the overlap in the two strings, the higher the similarity of them. In addition, if the retained cell whose score is smaller than the threshold , it will be replaced with a special token #None#, in order to avoid noise. After filtering, each header has a corresponding cell (or #None#).

The overall architecture of our table content-enhanced model is shown in Figure 4. It consists of an encoding module and six sub-modules corresponding to six sub-tasks. Intuitively, table content is mainly helpful to the sub-tasks for the WHERE clause, especially WN, WC, and WV. Therefore, we will detail these three sub-modules that utilize table content.

Encoding Module

The encoding module consists of BERT Devlin et al. (2019) and an embedding layer. BERT is employed to encode the question and the headers. Following the format of BERT, the input is a sequence of tokens, which starts with a special token [CLS] followed by the word tokens of question and all headers . A special token [SEP] is utilized to separate the and each

. For each input token, BERT outputs a hidden vector that holds its context information. In addition to BERT, the embedding layer is utilized to embed the cells. Differing from BERT, the embedding is for characters rather than words. It is because cells are typically some entity names and numerical values, etc. Char-embedding can reduce the number of Out-Of-Vocabulary (OOV) tokens. Specifically, For each cell

, its character embedding is denoted by , where is the character number of and is the dimension of each embedding. Since and should be embedded in the same vector space for calculating their semantic relevance, we also embed instead of directly using the BERT output of . The char-embedding of is denoted by .

Where-Number Sub-Module

This sub-module contains two similar processes, which calculate the header-aware and content-aware context vector of the question, respectively. Here detail the process of the latter. First, for each , its content vector is obtained by a BiLSTM on

followed by a max-pooling operation (

). Then, the hidden vectors of is obtained by the other BiLSTM, denoted by . In order to make this BiLSTM aware of when encoding , its initial states is obtained by performing a self-attention on all .


Here, is the attention weight of each header . and are the trainable parameter matrices. Thereafter, the content-aware question context is calculated by a self-attention on , which is similar to . As described at the beginning, the header-aware question context is calculated by the same procedures above on the output of BERT111In order to distinguish easily, all the vectors obtained from the output of BERT are marked with a hat, such as .. Finally, the result is predicted by combining the header- and content-aware context vectors.


where and are the trainable parameter matrices.

Where-Column Sub-Module

In this sub-module, the processes of calculating question hidden vectors and content vectors are similar to those in WN. The only difference is the initial states of all the BiLSTMs are all random. Thereafter, in order to make the model focus on the parts of that are relevant to , an attention mechanism is performed on to calculate content-aware context vector .


where is the -th hidden vector in and is its attention weight. is the parameter matrix. Finally, the result $COL is predicted by


where and are the header vector and header-aware context vector, respectively. They are obtained by decoding the output of BERT with the same steps towards and .

Where-Value Sub-Module

The architecture of this sub-module is almost consistent with that of WC. The difference is that prediction of $VAL also requires the results of WC and WV, namely $COL and $OP. Let denote $COL and denote $OP, then the starting position and ending position of $VAL can be calculated by


Where is the -th word token of . , , , and are calculated by the same procedures used in WC. is the one-hot vector of $OP and is the semantic vector of . Here, in order to leverage the content more directly, we propose a Value Linking(VL) strategy for calculating .


where is the hidden vector of . It is calculated by a BiLSTM encoding the output of BERT. denotes the type embedding of . There are only two types, denoted by Match and NotMatch, that indicate whether matches some cells , respectively. Initially, the type of each is labeled as NotMatch. When calculating the literal similarity by (2), if a cell

is select, all the words of the corresponding n-gram

will be labeled as Match.

The architectures of the remaining three modules SC, SA, and WO are almost consistent with WC, except that they do not need the process for table content (i.e., removing the blue process in Figure 4). All their results are predicted by the classification that depends on the combined context .

Zero-Shot Meta Learning Framework

Meta-learning is typically leveraged to deal with classification problems and has the ability to adapt quickly between different categories. In our proposed framework, the table that each sample (question) relies on is regarded as an abstract category, in order to create the conditions for applying meta-learning. Furthermore, the traditional meta-learning framework consists of two stages of meta-training and meta-test. However, it is already demonstrated in Vinyals et al. (2016); Snell et al. (2017) where without fine-tuning on the meta-test, the meta-learning model shows similar even better performance. Motivated by this, our meta-learning algorithm only retains the meta-training stage. The entire process is formally described in Algorithm 1.

0:  A set of training samples , where is the -th input question, is the table which relies on, and is the gold SQL query of . A model , where

is its parameters. Hyperparameters

, and
1:  while not done do
2:     for all task do
3:        Sample a support set
4:        Evaluate
5:        Update parameters with gradient descent:
6:        Sample a query set , where
7:        Evaluate
8:        Update to minimum using Adam optimizer with learning rate , where
9:     end for
10:  end while
Algorithm 1 Zero-Shot Meta-Learning Framework

A task consists of several training samples. It is the basic training unit of our framework and split into a support set and a query set. Here, to simulate the scenario of zero-shot tables, the table set of the support set is disjoint with that of the query set. According to the split, the model experienced a two-stage gradient update during the training of each task. In the first stage, temporary parameters are obtained by calculating the loss of the support set and perform the gradient updating on original parameters . In the second stage, the loss of the query set is first calculated with . Then, the losses of the support set and query set are jointed to calculate the gradient. Finally, original parameters are updated by the gradient. In addition, for sampling and , we follow the -way -shot setting, i.e, each set covers tables and there are samples for each table.

Although meta-learning has also been utilized in Huang et al. (2018) on text-to-SQL, there are two key differences between our proposed approach and their method: First, Huang et al. (2018) focuses on sampling support sets according to types of the questions (e.g., COUNT, MIN), but we sample according to different tables, so as to capture the potential relationship between questions and tables. Second, we ensure that the tables in the support set do not intersect with those in the query set to simulate a zero-shot environment. Following this setting, the model needs to learn the generic knowledge between two different sets of tables and perform a joint optimization, thus it can be forced to learn the generalization ability.


Experimental Setup

Our models are trained and evaluated over the following two text-to-SQL benchmarks:

WikiSQL Zhong et al. (2017) is an English open-domain text-to-SQL benchmark, containing more than 20K tables. Each question corresponds to a table, which is extracted from the Wikipedia page. The data set is divided into 56,355 training questions, 8,421 development questions, and 15,878 test questions222 In order to focus on evaluating the performance on zero-shot tables, we conduct experiments on the remaining 30% of tables (zero-shot subset) released by Chang et al. (2020)333

ESQL is a Chinese domain-specific text-to-SQL dataset built by ourself. Its format imitates WikiSQL, containing 17 tables. These tables are related to the field of electric energy, including information such as electricity sales and prices444Due to commercial secrets, we first desensitize the original dataset and then release it and all the codes of MC-SQL on, and all the results in this paper are obtained from the desensitized version., etc. Although the number of tables in ESQL is small, the number of headers in each table is several times that in a WikiSQL table, thus still covers a wealth of information. The dataset is divided into 10,000 training questions, 1,000 development questions, and 2,000 test questions. In order to simulate the challenge of zero-shot tables, the training set contains only 10 tables of all, while the development set and the test set contain all the tables. We respectively extract the questions from the development and test set that rely on the remaining 7 tables as the zero-shot subsets.

Following previous approaches Zhong et al. (2017); Hwang et al. (2019)

, we adopt logical form (LF) accuracy and execution (EX) accuracy as the evaluation metrics. Here, LF evaluates the literal accuracy of the total SQL query and its clauses, and EX evaluates the accuracy of the results by executing the SQL query.

Implementation Details

We perform all the experiments on NVIDIA Tesla V100 GPU. In the experiments, all the BERT models are of base version. The following hyperparameters are tuned on development sets: (1) Fitlering threshold is set to 0.9 for both datasets. (2) The layer number of all BiLSTMs is set to 2. (3) The hidden state size is set to 100. (4) The character embedding size is set to 128. (5) The type embedding size is set to 32. (6) The number of sampling tasks is set to 10,000 for WikiSQL, 2,500 for ESQL. (7) For WikiSQL, both and in the -way -shot setting are set to 4. For ESQL, they are set to 1 and 4, respectively. (8) in Algorithm 1 is set to 0.3 for WikiSQL, 0.5 for ESQL. (9) For in Algorithm 1, BERT and sub-modules are trained with two kinds respectively. Specifically, is set to and is set to . Similarly, is set to and is set to .

Overall Results on WikiSQL

Approach Dev LF Dev EX Test LF Test EX
Seq2SQL 49.5 60.8 48.3 59.4
Coarse2Fine 72.5 79.0 71.7 78.5
Auxiliary Mapping 76.0 82.3 75.0 81.7
SQLova (-) 80.3 85.8 79.4 85.2
SQLova (*) 81.6 87.2 80.7 86.2
X-SQL (*) 83.8 89.5 83.3 88.7
HydratNet (*) 83.6 89.1 83.8 89.2
TaBERT-k1 (-) 83.1 88.9 83.1 88.4
TaBERT-k3 (-) 84.0 89.6 83.7 89.1
MC-SQL (-) 84.1 89.7 83.7 89.4
Table 1: Overall results on WikiSQL. “x(-)” denotes the model x with BERT-base. “x(*)” denotes the model x with BERT-large or larger pre-trained model, such as MT-DNN Liu et al. (2019) in X-SQL. k1 and k3 indicate that the model considers 1 and 3 rows of related content for one question, respectively.
Dataset Model SC SA WN WC WO WV LF
WikiSQL SQLova 96.7 / 96.3 90.1 / 90.3 98.4 / 98.2 94.1 / 93.6 97.1 / 96.8 94.8 / 94.3 80.2 / 79.7
TaBERT-k1 97.2 / 97.1 90.5 / 90.6 98.9 / 98.8 96.1 / 96.1 97.9 / 97.8 96.7 / 96.6 83.1 / 83.1
TaBERT-k3 97.3 / 97.1 91.1 / 91.2 98.8 / 98.7 96.6 / 96.4 97.5 / 97.5 96.6 / 96.2 83.9 / 83.7
MC-SQL 96.9 / 96.4 90.5 / 90.6 99.1 / 98.8 97.9 / 97.8 97.5 / 97.8 96.7 / 96.9 84.1 / 83.7
 w/o TC 97.0 / 96.5 89.8 / 90.0 98.6 / 98.3 94.5 / 93.7 97.2 / 97.0 94.7 / 94.7 79.9 / 79.2
 w/o VL 97.0 / 96.7 90.4 / 90.8 99.0 / 98.7 98.0 / 97.6 97.5 / 97.2 95.6 / 95.5 82.9 / 83.0
 w/o ML 96.5 / 96.2 90.4 / 90.4 98.9 / 98.7 97.8 / 97.4 97.5 / 97.4 96.5 / 96.1 83.2 / 82.9
ESQL SQLova 96.2 / 95.9 98.9 / 99.0 98.5 / 98.4 84.6 / 84.1 96.5 / 95.8 89.9 / 89.6 72.0 / 71.5
MC-SQL 97.2 / 97.3 99.1 / 99.2 98.9 / 98.9 93.6 / 93.3 97.5 / 96.8 92.9 / 92.6 82.8 / 82.7
 w/o TC 95.9 / 96.1 99.2 / 99.1 98.8 / 98.3 84.5 / 84.4 96.7 / 96.2 90.5 / 90.3 72.9 / 72.1
 w/o VL 96.5 / 96.7 99.3 / 98.9 98.9 / 98.8 93.5 / 93.5 97.4 / 96.9 92.0 / 91.8 82.1 / 81.9
 w/o ML 96.2 / 96.0 98.8 / 98.9 98.9 / 98.8 92.4 / 92.7 97.5 / 96.7 92.7 / 92.3 82.3 / 81.9
SQLova 95.8 / 95.2 89.7 / 89.3 97.6 / 97.4 91.1 / 90.4 95.9 / 95.7 90.1 / 90.5 74.7 / 72.8
TaBERT-k1 96.6 / 96.4 91.0 / 91.0 98.6 / 98.4 94.8 / 94.6 97.7 / 97.5 95.3 / 94.6 81.3 / 80.5
TaBERT-k3 96.7 / 96.4 91.6 / 91.5 98.2 / 98.2 95.1 / 95.0 96.8 / 97.0 94.9 / 94.2 82.0 / 81.2
MC-SQL 96.4 / 95.5 91.1 / 91.0 98.7 / 98.1 96.6 / 96.3 97.1 / 96.7 94.8 / 94.2 82.4 / 80.5
 w/o TC 96.2 / 95.7 91.0 / 90.5 97.6 / 97.7 91.5 / 90.7 96.2 / 96.1 90.5 / 90.8 75.8 / 73.6
 w/o VL 96.2 / 95.8 90.6 / 90.9 98.7 / 98.0 97.1 / 96.3 97.1 / 96.3 91.7 / 92.1 79.0 / 79.1
 w/o ML 95.7 / 95.0 90.4 / 90.2 98.5 / 98.2 96.0 / 95.8 96.8 / 96.7 94.0 / 93.5 81.2 / 79.4
SQLova 94.3 / 94.0 97.8 / 97.9 97.3 / 97.0 80.5 / 80.7 95.9 / 94.6 87.8 / 86.7 62.9 / 61.2
MC-SQL 94.6 / 94.2 98.0 / 98.0 97.5 / 97.3 93.7 / 92.0 96.2 / 94.8 91.9 / 90.5 76.7 / 74.8
 w/o TC 94.4 / 94.1 98.1 / 98.2 97.1 / 97.2 80.7 / 80.6 95.5 / 94.2 88.4 / 87.6 64.7 / 63.3
 w/o VL 93.8 / 94.0 98.0 / 98.1 97.4 / 97.2 92.6 / 91.1 95.1 / 94.8 90.9 / 90.1 75.7 / 73.7
 w/o ML 93.5 / 93.0 97.7 / 97.9 97.4 / 96.9 93.2 / 91.8 96.0 / 94.3 91.2 / 90.2 75.2 / 72.9
Table 2: Results of sub-tasks on WikiSQL and ESQL. / denotes the results of the dev/test sets.

We first compared our approach with several existing text-to-SQL approaches on public benchmark WikiSQL. Seq2SQL Zhong et al. (2017), Coarse2Fine Dong and Lapata (2018), and Auxiliary Mapping Chang et al. (2020) are all sequence-to-sequence (Seq2Seq) based models. SQLova Hwang et al. (2019) replaces the Seq2Seq framework with the multi-submodule framework. X-SQL He et al. (2019) and HydratNet Lyu et al. (2020) improve this framework by MT-DNN Liu et al. (2019) and a pair-wise ranking mechanism respectively, thus achieve better results. TaBERT Yin et al. (2020) is a state-of-the-art language model used to encode tabular data. It is pre-trained on a massive number of question-tabular corpus and also uses content information. In this paper, we ignore all the results with execution guiding (EG) trick Wang et al. (2018). It is because EG works on the premise that the generated SQL query must not be empty, which is unreasonable in realistic scenarios.

The overall experimental results on WikiSQL are reported in Table 1. Except for TaBERT, where we use official API, all the other comparison results are directly taken from the original paper. On LF accuracy, our approach achieves state-of-the-art results on the development set and ranks second only to HydratNet (-0.1%) on the test set. On EX accuracy, our approach achieves state-of-the-art results on both the sets. Notably, our results are achieved by only utilizing the base version of BERT. After ignoring the baselines that use larger pre-trained models (“(*)” in Table 1), our approach achieves significant improvements on both LF (4.3%) and EX (4.2%) accuracy when testing. In addition, compared with the table-specific pre-trained model, our model still has advantages without pre-training on table corpus. The performance of Seq2SQL, Coarse2Fine, and Auxiliary Mapping is limited by the decoding without SQL syntax constraints. SQLova, X-SQL, and HydratNet ignore the abundant information from table content, thus their performance is also limited. TaBERT makes use of content information and performs tabular-specific pre-training, thus achieving better results. There are two possible reasons why our approach outperforms TaBERT. On the one hand, TaBERT uses table information coarsely (SELECT-clause actually does not require content information), while we provide a more fine-grained usage (only WHERE-clause). On the other hand, the mandatory meta-learning process gives the model a stronger generalization ability.

Detailed Analysis

Ablation Test

To explore the contributions of various components of our MC-SQL model, we compared the following settings on both the datasets.

  • w/o table content(TC) We removed all the processes in WN, WC, and WV that related to table content. For example, (5) is converted to .

  • w/o value linking(VL) We retained the processes related to TC but removed the value linking in WV, i.e., (11) is converted to after removing.

  • w/o meta-learning(ML) We replaced the meta-learning strategy with the traditional mini-batch strategy.

The detailed results on the full sets of WikiSQL and ESQL555The SQL query in ESQL also includes other keywords such as ORDER BY. We design modules similar to SELECT and WHERE above to solve them. The detailed results of these sub-tasks are available on the data set homepage. are shown in the upper two blocks of Table 2

, respectively. MC-SQL equipped with all the components achieves the optimal results on LF accuracy and most sub-tasks, which significantly improves the baseline SQLova on both WikiSQL (4.0%) and ESQL (10.2%). Compared with TaBERT, our approach also has advantages on the overall performance on WikiSQL (0.2% on Dev.) by the improvement on WC. This is probably due to the fine-grained use of content information for specific modules. By removing TC, the overall performance (LF) declined approximately 3.5% and 10.6% on both the datasets. It demonstrates the significance of content. Here, the performance drop by removing VL also proves the contribution of value linking. Removing ML brings a certain drop on both the datasets, however, the drop on ESQL (-1.8%) is sharper than that on WikiSQL (-0.8%). The reason can be that WikiSQL is an open-domain dataset, thus it is more difficult for generalization capability than domain-specific ESQL. By further observation, it can be found that the contribution of TC is mainly reflected in the four sub-tasks of WN, WC, and WV. The improvement on WO is mainly attributed to the improvement on WC, because the former depends on the result of the latter. In addition, meta-learning is helpful for all sub-tasks and has the most significant improvements on the sub-tasks that are not enhanced by table content, such as SC and SA. Interestingly, the performance sometimes becomes better on WC and WO after removing VL, which reveals that VL can be noisy for predicting $COL. However, due to its significant improvement on WV, VL is still helpful for the overall performance.

Zero-shot Test

We evaluated our model on the zero-shot subsets of both the datasets. The results are shown in the bottom two blocks of Table 2. In terms of overall results, MC-SQL achieves greater improvements over SQLova on the zero-shot subsets of both WikiSQL (7.7% vs 4.0%) and ESQL (13.6% vs 10.2%). It proves that our approach is promising for enhance the model to handle zero-shot tables. Furthermore, the improvement on each subtask is also increased, especially WC and WV. The contribution of table content is greater on zero-shot tables, which is consistent with our intuition. The relatively more drastic performance drop caused by removing ML also proves that our meta-learning strategy is suitable for dealing with zero-shot tables. Notably, in addition to SC and SA, meta-learning is also contributing to the WHERE clause when handling zero-shot tables. It is interesting that the performance on ESQL is generally lower than that on WikiSQL, whereas the improvement brought by meta-learning on ESQL is greater than that of WikiSQL. We speculate that it is because fewer training tables result in poor performance, and meta-learning, which has the characteristic of suitable for a few samples, achieves greater improvements. Compared to TaBERT, our approach leads on the zero-shot development set (0.4%) but lags behind on the zero-shot test set (-0.7%). Specifically, TaBERT works better on the two sub-tasks of SA and WO. It benefits from its joint pre-training with tabular data and problems, thereby learning a stronger mapping capability between aggregation operators and questions.

Varied Sizes of Training Data

Figure 5: LF on WikiSQL with proportions of training data.
Figure 6: LF on ESQL with proportions of training data.

To simulate the scenario of zero-shot tables from another aspect, we tested the performance of the model using different proportions of training data. The results of WikiSQL and ESQL are shown in Figure 5 and Figure 6, respectively. The MC-SQL equipped with all components always maintains optimal performance with different sizes of training data. When the training data is small, the improvement achieved by MC-SQL over SQLova is more significant, especially on WikiSQL. In addition, the results on both datasets demonstrate that the less training data, the more significant the improvement brought by meta-learning. Note that changes in training data have less impact on ESQL than that on WikiSQL. It is probably because of the few tables and the specific domain of ESQL.

Related Work

In recent research, mainstream text-to-SQL approaches mainly include two directions.

One direction is represented by Spider Yu et al. (2018c), which is a benchmark to deal with the multi-table text-to-SQL task Yu et al. (2018a); Xu et al. (2017); Yu et al. (2018b); Guo et al. (2019); Wang et al. (2020). However, even if the evaluation does not require to recognize the value in WHERE clause, the state-of-the-art performance (65.6% achieved by Wang et al. (2020)) on this task is still far from realistic applications.

The other direction is represented by WikiSQL Zhong et al. (2017), which is a benchmark to deals with the single-table text-to-SQL task. This paper focuses on this task. Previous single-table text-to-SQL approaches Zhong et al. (2017); Xu et al. (2017); Yu et al. (2018a); Dong and Lapata (2018); Chang et al. (2020) are mainly based on the Seq2Seq framework but ignore the characteristics of the SQL skeletons, thus their performance is limited. Hwang et al. (2019) breaks the total task into several subtasks for the first time. They propose enables each sub-module to focus on the corresponding subtasks, thereby overcoming the bottleneck caused by a single model. In addition, the large-scale pre-trained language model Devlin et al. (2019) also greatly improved model performance. Thereafter, almost all work on WikiSQL follows this framework of pre-trained models with multi-submodules. He et al. (2019) leverages the type information of table headers and replaces BERT with MT-DNN, which is a stronger pre-trained model trained from multi-task learning. Lyu et al. (2020) proposes a pair-wise ranking mechanism for each question-header pair, and achieve better results on WikiSQL. The significant difference between our work and these approaches is that we take advantage of the content information. The closest work to ours is TABERT, which also encodes table content and utilizes a large number of question-table pairs for pre-training. However, their use of content lacks specificity, i.e., content-encoding is used for all subtasks. intuitively, the content information is helpful only for WHERE-clause predictions. Based on this intuition, our approach only uses content information on specific subtasks, thus it is more accurate. In addition, the use of meta-learning further promotes our model to obtain stronger generalization capabilities. More importantly, our approach can still play an important role in scenarios where lack of pre-trained models from massive tabular data (such as Chinese tables).


In this paper, we propose a new single-table text-to-SQL approach MC-SQL, which focuses on handling zero-shot tables. On the one hand, our approach takes advantage of table content to enhance the model. The potential header can be inferred by the semantic relevance of questions and content. On the other hand, our approach learns the generalization capability from different tables by meta-learning. It utilizes a two-stage gradient update to force the model to learn generic knowledge. The experimental results show that our approach improves the baselines on multiple benchmarks. More importantly, the improvements further increase on zero-shot tables. In future work, we will try to classify different tables and combine meta-learning and reinforcement learning to further explore the generalization capabilities.


Research in this paper was partially supported by the National Key Research and Development Program of China under grants (2018YFC0830200, 2017YFB1002801), the Natural Science Foundation of China grants (U1736204), the Judicial Big Data Research Centre, School of Law at Southeast University.


  • S. Chang, P. Liu, Y. Tang, J. Huang, X. He, and B. Zhou (2020) Zero-shot text-to-sql learning with auxiliary task. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020

    pp. 7488–7495. External Links: Link Cited by: Introduction, Experimental Setup, Overall Results on WikiSQL, Related Work.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: Introduction, Encoding Module, Related Work.
  • L. Dong and M. Lapata (2018) Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 731–742. External Links: Link, Document Cited by: Overall Results on WikiSQL, Related Work.
  • J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J. Lou, T. Liu, and D. Zhang (2019) Towards complex text-to-sql in cross-domain database with intermediate representation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 4524–4535. External Links: Link, Document Cited by: Related Work.
  • P. He, Y. Mao, K. Chakrabarti, and W. Chen (2019) X-SQL: reinforce schema representation with context. CoRR abs/1908.08113. External Links: Link, 1908.08113 Cited by: Introduction, Preliminaries, Overall Results on WikiSQL, Related Work.
  • P. Huang, C. Wang, R. Singh, W. Yih, and X. He (2018) Natural language to structured query generation via meta-learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 732–738. External Links: Link, Document Cited by: Zero-Shot Meta Learning Framework.
  • W. Hwang, J. Yim, S. Park, and M. Seo (2019) A comprehensive exploration on wikisql with table-aware word contextualization. CoRR abs/1902.01069. External Links: Link, 1902.01069 Cited by: Introduction, Preliminaries, Experimental Setup, Overall Results on WikiSQL, Related Work.
  • X. Liu, P. He, W. Chen, and J. Gao (2019)

    Multi-task deep neural networks for natural language understanding

    In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 4487–4496. External Links: Link, Document Cited by: Overall Results on WikiSQL, Table 1.
  • Q. Lyu, K. Chakrabarti, S. Hathi, S. Kundu, J. Zhang, and Z. Chen (2020) Hybrid ranking network for text-to-sql. CoRR abs/2008.04759. External Links: Link, 2008.04759 Cited by: Introduction, Preliminaries, Overall Results on WikiSQL, Related Work.
  • J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 4077–4087. External Links: Link Cited by: Zero-Shot Meta Learning Framework.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3630–3638. External Links: Link Cited by: Zero-Shot Meta Learning Framework.
  • B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson (2020) RAT-SQL: relation-aware schema encoding and linking for text-to-sql parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 7567–7578. External Links: Link Cited by: Related Work.
  • C. Wang, P. Huang, A. Polozov, M. Brockschmidt, and R. Singh (2018) Execution-guided neural program decoding. CoRR abs/1807.03100. External Links: Link, 1807.03100 Cited by: Overall Results on WikiSQL.
  • X. Xu, C. Liu, and D. Song (2017) SQLNet: generating structured queries from natural language without reinforcement learning. CoRR abs/1711.04436. External Links: Link, 1711.04436 Cited by: Related Work, Related Work.
  • P. Yin, G. Neubig, W. Yih, and S. Riedel (2020) TaBERT: pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 8413–8426. External Links: Link Cited by: Overall Results on WikiSQL.
  • T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. R. Radev (2018a) TypeSQL: knowledge-based type-aware neural text-to-sql generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 588–594. External Links: Link, Document Cited by: Related Work, Related Work.
  • T. Yu, M. Yasunaga, K. Yang, R. Zhang, D. Wang, Z. Li, and D. R. Radev (2018b) SyntaxSQLNet: syntax tree networks for complex and cross-domaintext-to-sql task. CoRR abs/1810.05237. External Links: Link, 1810.05237 Cited by: Related Work.
  • T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. R. Radev (2018c) Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018

    , E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.),
    pp. 3911–3921. External Links: Link, Document Cited by: Introduction, Related Work.
  • V. Zhong, C. Xiong, and R. Socher (2017) Seq2SQL: generating structured queries from natural language using reinforcement learning. CoRR abs/1709.00103. External Links: Link, 1709.00103 Cited by: Introduction, Experimental Setup, Experimental Setup, Overall Results on WikiSQL, Related Work.