Zero-shot Text-to-SQL Learning with Auxiliary Task

Recent years have seen great success in the use of neural seq2seq models on the text-to-SQL task. However, little work has paid attention to how these models generalize to realistic unseen data, which naturally raises a question: does this impressive performance signify a perfect generalization model, or are there still some limitations? In this paper, we first diagnose the bottleneck of text-to-SQL task by providing a new testbed, in which we observe that existing models present poor generalization ability on rarely-seen data. The above analysis encourages us to design a simple but effective auxiliary task, which serves as a supportive model as well as a regularization term to the generation task to increase the models generalization. Experimentally, We evaluate our models on a large text-to-SQL dataset WikiSQL. Compared to a strong baseline coarse-to-fine model, our models improve over the baseline by more than 3 accuracy on the whole dataset. More interestingly, on a zero-shot subset test of WikiSQL, our models achieve 5 clearly demonstrating its superior generalizability.


page 1

page 2

page 3

page 4


Leveraging Table Content for Zero-shot Text-to-SQL with Meta-Learning

Single-table text-to-SQL aims to transform a natural language question i...

Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment

In text-to-SQL tasks – as in much of NLP – compositional generalization ...

KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers

The goal of database question answering is to enable natural language qu...

Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization

Recently, there has been significant progress in studying neural network...

End-to-End Cross-Domain Text-to-SQL Semantic Parsing with Auxiliary Task

In this work, we focus on two crucial components in the cross-domain tex...

One-Shot Learning for Text-to-SQL Generation

Most deep learning approaches for text-to-SQL generation are limited to ...

Evaluating the Text-to-SQL Capabilities of Large Language Models

We perform an empirical evaluation of Text-to-SQL capabilities of the Co...

1 Introduction

Text-to-SQL has recently attracted much attention as a sequence-to-sequence learning problem due to its practical usage for search and question answering dong2016acl; zhong2017seq2sql; xu2017sqlnet; cai2018ijcai; yu2018naacl; dong2018acl; finegan-dollak2018acl; Yu2018emnlp; wang2018EG; Shi2018IncSQL. The performance on some text-to-SQL tasks has been improved progressively dong2018acl; wang2018EG; Shi2018IncSQL in recent years. As pointed out in finegan-dollak2018acl, when evaluating models on text-to-SQL tasks, we need to measure how well the models generalize to realistic unseen data, which is very common in the real applications.

Most of the previous text-to-SQL tasks assumed that all questions came from a fixed database and hence share one global table schema. This assumption is useful for some specific applications such as booking flights dahl1994atis and searching GEO zelle1996geo, but not applicable to many real scenarios when different questions involve querying on different tables. zhong2017seq2sql addressed this problem and generated a dataset called WikiSQL, which is by far the largest text-to-SQL benchmark dataset.

In WikiSQL many tables have different table schemas and each table has its own limited labeled data. One common approach is to encode the table column names in the input to the training of an encoder-decoder model yu2018naacl; dong2018acl. yu2018naacl

proposed to utilize high-level type information to better understand rare entities and numbers in the natural language questions and encode these information from the input. These type information come from either external knowledge graph, a column or a number. This approach of TypeSQL

yu2018naacl was proven to be effective on WikiSQL when it is required for the model to generalize to new tables.

We observe that a text-to-SQL encoder-decoder model implicitly learn a mapping between entities in natural language questions to column names in tables. The model is likely to fail on mapping to new table column names that it never sees before. Hence if we learn a better mapping from question words to table column names, then the text-to-SQL generation model would be better generalized to new tables. With this in mind, we introduce an auxiliary model to enhance the existing generation model.

Specifically, we propose a novel auxiliary mapping task besides traditional text-to-SQL generation task. Here we explicitly model the mapping from natural language entities to table column names. The mapping model serves as an supportive model to the specific text-to-SQL task as well as regularization to the generation model to increase its generalization. These two tasks are trained together with a multi-task learning loss. In practice, we adopt the coarse-to-fine decoder as the prototype of our generation model due to their superior performance in text-to-SQL tasks. And the generation model is further improved by introducing bi-attention layer (question-to-table attention and table-to-question attention) Seo2017iclr and attentive pooling layer bowen2016arxv.

We test our models on WikiSQL, with emphasis on a zero-shot subset, where the table schemas of the test data never occur in the training data. Compared to the coarse-to-fine model, our models improve over the baselines by absolute in accuracy, achieve execution accuracy of . In particular, on the zero-shot test part of WikiSQL, our models achieve even more gain, with improvement in accuracy over the baseline model. 333Our code will be released after paper is reviewed.

In summary our contributions in this paper are three-fold:

1) We find the existing testbed covers up the true generalization behavior of neural text-to-SQL models, and propose a new zero-shot test setting.

2) We improve the generalization ability of existing models by introducing an auxiliary task, which can explicitly learn the mapping between entities in the question and column names in the table.

3) The zero-shot evaluation not only shows the superior performance of our proposed method compared with the strong baseline but makes it possible to explain where the major gain comes from.

2 Background

2.1 Text-to-SQL Task


task can be formulated as a conditional text generation problem, in which a question

and a table are given, the goal is to generate a SQL language .

Figure 1 illustrates WikiSQL output format which consists of three components: AGG, SEL, and WHERE. Particularly, WHERE clause contains multiple conditions where each condition is a triplet with the form of (condition_column, condition_operation, condition_value).

Encoding Layer

The question and corresponding table schema

are first translated into the hidden representation by a BiLSTM sentence encoder:

where is embedding of question word and is the representation of a column name which consists of words . The first and last hidden state of a BiLSTM over is concatenated as .

Decoding Layer

Different from traditional text generation tasks, which share a decoder cross time-steps, in Text-to-SQL task, different decoders are designed in terms of different operations. Generally, these decoders can be classified two types:

cls for classifier, and pt for pointer.

cls is used for the operations, such as AGG and COND_OP:


where is one decoder hidden representation.

pt can be used to choose a proper column or word from a set of column or words. Formally, We refer to

as a pointer-query vector and

as a set of pointer-key vectors, and predict the probability of choosing each key:


can be obtained as:



Figure 1: SQL Sketch. The tokens starting with “$” are slots to fill. “*” indicates zero or more AND clauses.
Figure 2: Break down accuracy of a strong baseline model dong2018acl. X-axis represents different subsets of WikiSQL test set, split by how many times a table occurs in training data. Splitting details are in Table 2.

2.2 Diagnosing the Bottleneck of Text-to-SQL

The existing testbed covers up the true generalization behavior of existing models. To address this problem, we provide a new testbed by breaking down the testing samples. Specifically, we analyze the generalization problem on table aware Text-to-SQL tasks, by testing previous state-of-the-art model dong2018acl on different tables which occur different times in training set. We observe the following problems based on Figure 2:

  • WHERE clause performance is more sensitive to how many times the table has been seen in the training data;

  • The performance of WHERE would get a big drop once the table in test set is not present in the training data, i.e. zero-shot testing case.

Despite of the importance of the generalization problem of unseen tables, few work explored it due to the lack of appropriate datasets. The WikiSQL dataset was originally constructed to ensure that the training and test set have disjoint set of tables, which can provide a test bed for generalization test of new tables. However, we find that the current version of WikiSQL test cannot guarantee this because different tables extracted from different wiki pages may share the same table schema (i.e. table column names), even though their table content may not be the same.

The above problems motivate us to explicitly model the mapping between words in question and table column names, and test the model generalization to new tables on the true zero-shot sub testset of WikiSQL.

3 Model

Our model consists of a seq2seq model for the SQL generation task (largely following the baseline coarse-to-fine model), and a mapping model as a auxiliary task to explicitly map question words to table schema (column names).

3.1 Main Generation Model


we follow section 2.1 to obtain question and schema hidden representation and . To enhance the interaction between question words and column name , a bi-attention is used to generate final question and table schema representation:

Considering the nature of structured SQL, we follow previous works to use different sub-decoders for AGG, SEL and WHERE clause. Especially, our WHERE decoder is adapted from the baseline model dong2018acl.

Agg and Sel Decoder

Each SQL only contains one AGG and SEL, so we generate AGG and SEL based on entire question representation. Since different words or phrases in question do not equally contribute to the decisions of AGG and SEL, we employ an attentive pooling layer over to generate final hidden representation for AGG and SEL.

We feed into cls layer generate the aggregation operation AGG and meassure the similarity score between and each column name to predict SEL by Pt layer in (2):

Figure 3: Illustration of our model. The upper figure is the text-to-SQL generation model which consists of three parts: encoder (lower left), AGG/SEL decoder (upper left) and where decoder (upper right). Lower right is WHERE decoder cell. The bottom figure is our auxiliary mapping model with the ground-truth label of an example. Question word is mapped to a column only when it is tagged as part of a condition value ( or ).

WHERE Decoder

We took the WHERE decoder from the-state-of-the-art model dong2018acl, which first generates a slot sketch of WHERE clause and transform the SQL generation into a slot filling problem. There are 35-category WHERE clauses in WikiSQL and each one is subsequence of WHERE clause which skip the COND_COL and COND_VAL. For example, ”WHERE AND ” is a sketch of WHERE clause which has 2 conditions. We first predict the sketch based on :

where .

Once is predicted, we obtain the COND_OP sequence it represents. We embed each operation in COND_OP sequence and feed them into WHERE-decoder cell. As Figure 3 shows, the WHERE-decoder cell takes one COND_OP as input and output COND_COL and COND_VAL for each decoder time step, while each decoder time step has 3 LSTM time steps. For th condition, ,, are and and and output ,

are probability distribution of the index of

and the span of . We do not have output for each because the input of next time step is given by pre-predicted . The lstm-cell is updated 3 times for each decoder time step:

The output layers for COND_COL and COND_VAL are both pointer layer which are pointed to column names and question words to predict COND_COL index and the left and right end , of the span of COND_VAL in question:

3.2 Auxiliary Mapping Model

For a SQL query, each condition consists of three parts, COND_COL, COND_OP and COND_VAL. Our mapping model aims to discover the mapping between condition column and condition value. The mapping prediction is based on question and table schema representation and , which are shared with generation model. Optimization based on mapping task loss can improve the question and table schema representation. An intuitive way to achieve mapping is to directly learn a mapping function from each word in question to column names. However, since not all words in a question are condition values, it’s challenging to take all words into consideration. To address this problem, we propose a two-step mapping model, in which we first learn a detector to screen out condition values, and then we learn a mapping function from condition values to column names.

Condition Value Detection

Because the condition value sometimes covers multiple words, we label the span for condition values in questions with typical nadeau2007survey tags. We notice sometimes condition column names appear exactly in question, so the span of column name in question is also labeled with tags , during training when a column name appear in question. Altogether we have five tags , , , , , which represent the first word of condition column, subsequent word of condition column, the first word of condition value, subsequent word of condition value and outside, respectively. Figure 3 illustrates our mapping model by giving the ground-truth label for an example.

The mapping model takes encoding vector of question words and column names as input. Mapping model first predict gate :

where and are tagging parameters.

Value-column Mapping

We only learn the mapping for question words which are tagged as , :

3.3 Loss Function

We refer to the following as generation task loss and as mapping task loss.

where represents different operations during decoder phase. and denote the probability distribution of real label and predicted probability distribution. represents how many times words in question have been predicted as condition values.

Finally, the overall loss can be written as:

where is the number of training samples and is hyper-parameter.

4 Experimental Setup

4.1 Dataset

WikiSQL has over 20K tables and 80K questions corresponding to these tables. This dataset was designed for translating natural language questions to SQL queries using the corresponding table columns without access to the table content. This dataset is further split into training and testing sets that are separately obtained from different Wiki pages, assuming there is no overlap of tables between training and testing sets. However, we find in this split, question-table pairs in test set have the same table schema as those in the training set. This is because even train and test tables were obtained from different Wiki pages, these tables could still have the same table schema. For example, different football teams have their own Wiki page but each one have a table with the same schema recording match information.

We split the test set based on the number of shots (the number of a table occurrences in training data). We report experiments on the original full WikiSQL test set as well as different subset based on the number of shots, especially on the zero-shot testing case. Statistics of new test sets are in table 2.

SEQ2SQL zhong2017seq2sql - 59.4% 90.1% 88.9% 60.2%
SQLNET xu2017sqlnet 61.3% 78.0% 90.3% 90.9% 71.9%
TypeSQL yu2018naacl 66.7% 73.5% 90.5% 92.2% 77.8%
COARSE2FINE dong2018acl 71.7% 78.5% 90.4% 92.4% 84.2%
Gen-model w/o AP 72.8% 79.4% 90.2% 93.0% 84.7%
Gen-model 73.5% 80.1% 90.3% 94.2% 84.8%
Full-model 75.0% 81.7% 90.5% 94.5% 86.7%
Table 1: Overall and break down results on full WikiSQL dataset. , are accuracy numbers of query match (ignore the order of conditions) and execution result, and , , are the accuracy of AGG, SEL, WHERE clauses. The upper part are baseline models, and the lower part are our generation model Gen-model and the whole model Full-model which is the Gen-model with the auxiliary mapping model. Gen-model w/o AP is the generation model without attentive pooling.
dataset number of shots #questions
W-full [0,2045] 15878
W-0 0 5201
W-1 [1,5] 1700
W-2 [6,15] 1842
W-3 [16,40] 1971
W-4 [41,100] 1654
W-5 [101,500] 1887
W-6 [501,2045] 1623
Table 2: Statisitics of WikiSQL test set. W-full is original WikiSQL test set and W-0, W-1,, W-6 are subsets split by the number of shots (number of a table occurrences in the training data).

4.2 Evaluation

We follow the evaluation metrics in

xu2017sqlnet to measure the query synthesis accuracy: query-match accuracy () which measures the decoded query match the ground truth query without considering the order of conditions and execution accuracy () which measures the results from executing predicted queries. The accuracies are further break down into three categories: AGG, SEL and WHERE, as in xu2017sqlnet.

4.3 Model Configuration

We use 300-dim Glove word embedding as our pre-trained embedding. Hidden size for all LSTM is 250 and hidden size in attention function is set to 64. The loss weight is set to 0.5. A 0.5-rate dropout layer is used before each output layer. Each concatenation is followed by one full-connected layer to reduce the dimension to the original hidden or attention size. Test model is selected by the best performing model on validation set.

5 Results and Analysis

Table 1 shows the overall and breakdown results on full WikiSQL dataset. We compare our models with strong baseline models on the original WikiSQL test data. All these models have no access to table content following zhong2017seq2sql.

First our Gen-model with enhanced encoder/decoder improves over the baseline coarse-to-fine model by 1.6% in accuracy of both and . Our Gen-model mainly improves on compared to baseline models. Ablation test shows the improvement is attributed to the attentive pooling in SEL decoding.

Second our Full-model outperforms our single generation model by 1.5% and 1.6% in query-match accuracy and execution accuracy, achieving a very competitive new execution accuracy of 81.7%. Break down results show Full-model mainly improves the accuracy over Gen-model on the WHERE clause, with 1.9% accuracy gain.

(a) Where Decoder
(b) AGG/SEL Decoder
Figure 4: Accuracy of Full-model and Gen-model in different test subsets. W-0 represents zero-shot setting. The frequency of the table has been seen in the training data decrease from W-6 to W-0.
(a) Results on Unseen tables (W-0).
(b) Results on seen/unseen columns.
Figure 5: C2F, Gen-M and Full-M represent the baseline C2F model, and our proposed Gen-Model and Full-model respectively.

5.1 Training data amount

Figure 4(a) illustrates Gen-model and Full-model accuracy of WHERE clause prediction on different test subsets from Table 2. Full-model is consistently better than single Gen-model in WHERE clause prediction. The biggest gap between Full-model and Gen-model in WHERE clause accuracy is on test subset W-0. This shows that Full-model generalizes better than Gen-model for the unseen test tables. We also found that Full-model accuracy on W-4 is slightly lower than that on W-3. We believe this is due to the fact that table itself is the other fact affecting models’ performance, in addition to the amount of training tables.

Figure 4(b) again illustrates Gen-model outperforms Gen-model without attentive pooling on different amount of training data.

5.2 Zero-shot Test

Figure 5(a) illustrates the results on zero-shot test case (i.e. W-0). Our Full-model outperforms baseline coarse-to-fine model by 4.9% and 4.4% in and . The accuracy improvement over the baseline coarse-to-fine model lie on the SEL and WHERE clause, with 3.6% accuracy gain on WHERE clause over the baseline.

Figure 4(a) shows WHERE clause accuracy has a big drop on zero-shot setting (W-0) compared to few-shot setting (W-1). We further analyze the reason of this degradation by looking into how the performance is affected by whether a column name is present in the training data. On unseen test table schema, 28% column names never appear in training set, which makes question related to these columns harder. We further divide conditions in WHERE clauses into two classes, one class with condition column appearing in training, the other with condition column not appearing in training. We measure the accuracy of each class on the WHERE clause. The result is reported in Figure 5(b). Full-model outperforms single generation model by 4.2% on unseen column names and 2.1% on seen column names. This further shows the generalization ability of the Full-model.

5.3 Case Study on Zero-shot Setting

We also analyze the Full-model behavior on zero-shot test compared to the Gen-model alone. We first randomly sample 100 examples of which Full-model predicts correct on WHERE clause (Case-Correct in Table 3), while Gen-model fails. We label the failure reasons of Gen-model into four categories (one example can belong to more than one categories): (a) wrong COND_COL prediction, (b) wrong COND_VAL prediction, (c) predicting extra conditions or missing conditions and (d) others. Table 3 shows the majority of WHERE clause errors are in (a): wrong COND column name errors. We then randomly sample another set of 100 examples (Case-Wrong in Table 3): Gen-model predicts WHERE clause correctly on these examples but Full-model fails. Table 3 indicates Full-model corrects Gen-model mainly on wrong COND_COL prediction, which shows our mapping task improves column name predictionin the generation task.

Examples (a) (b) (c) others
Case-Wrong 63 22 18 4
Case-Correct 71 19 10 3
Table 3: Number of samples in each error categories.

6 Related Work

Recently neural network based approaches, especially sequence-to-sequence models have been applied to text-to-SQL successfully with progressively improving results

wang2017synthesizing; neelakantan2017iclr; iyer2017acl; yin2017acl; huang2018naacl; zhong2017seq2sql; xu2017sqlnet; cai2018ijcai; yu2018naacl; dong2018acl; finegan-dollak2018acl.

Sketch-based approach is very effective, especially on WikiSQL task zhong2017seq2sql; xu2017sqlnet; yu2018naacl. In zhong2017seq2sql SEQ2SQL model used a coarse-grained sketch: aggregation, SELECT column and WHERE clause; xu2017sqlnet used a finer sketch to align to the syntactical structure of a SQL query with three specific slot-filling models: model_COL, model_AGG, and model_OPval. In TypeSQL yu2018naacl it also adopted this sketch-based model structures. However, in dong2018acl sketch was referred to as abstractions for meaning representation, leaving out low-level details. This meaning sketch was used as an input to the final decoding.

One of the challenge for using neural seq2seq models is the need of large annotated question-query pairs. zhong2017seq2sql; cai2018ijcai have automatically generated large datasets using templates and had humans paraphrased the questions into natural language questions. WikiSQL is by far the largest text-to-SQL dataset. WikiSQL was designed for testing model’s generalization by splitting the tables in a way that there is no overlap of tables in training and testing.

Execution guided (EG) decoding was recently proposed in wang2018EG that detects and excludes faulty outputs during the decoding by conditioning on the execution of partially generated output. Adding execution guided decoding to the coarse-to-fine model improved accuracy by on the wikiSQL dataset; and adding on top of the most recent IncSQL model Shi2018IncSQL improved accuracy by . It is proven that the EG module is very effective with any generative model.

Zero-shot semantic parsing has not obtained enough attention. herzig2018decoupling applied a pipeline framework, including four independent models to achieve generalization, while our work is end-to-end trained and focusing on improving model’s generalization with an auxiliary mapping task. Zero-shot slot filling bapna2017towards also leverages the text of schema to connect language question words to column names (slots), but their model needs to predict the probability of each possible column indepentently while our model can select the column by processing the question and schema one time.

7 Conclusions and Future Work

In this paper, we propose a novel auxiliary mapping task for zero-shot text-to-SQL learning. Traditional seq2seq generation model is augmented with an explicit mapping model from question words to table schema. The generation model is first improved by an attentive pooling inside the question, and bi-directional attention flow to improve the interaction between the question and table schema. The mapping model serves as an enhancement model to text-to-SQL task as well as regularization to the generation model to increase its generalization.

We compare our model with the a strong baseline coarse-to-fine model on the original WikiSQL testset as well as on the totally unseen test tables (a subset of zero-shot testing). Experimental results show that our model outperforms baseline models on both setting. Even though the generation model is already augmented with bi-directional attention to enhance the interaction between question and table, our results and analysis demonstrate that the explicitly mapping task can further increase the capability of generalization to unseen tables.

Spider Yu2018emnlp was recently proposed as another large cross-domain text-to-SQL dataset besides WikiSQL. It has more complex SQL templates including joint tables, which brings other interesting problems except for generalization. We plan to expand our models on this new dataset in the future.


Appendix A Appendices

Error case (a): wrong Cond_col prediction

  • Table: 2-11568882-2, Header: [year , winners , score , runners up , venue , 3rd place]

  • Question: what ’s in third place that ’s going 1-0 ?

  • Ground Truth: SELECT 3rd place FROM 2-11568882-2 WHERE score = 1-0

  • Full-model Prediction: SELECT 3rd place FROM 2-11568882-2 WHERE score = 1-0

  • Gen-model Prediction: SELECT 3rd place FROM 2-11568882-2 WHERE 3rd = 1-0

Error case (b): wrong Cond_val prediction

  • Table: 1-1081235-1, Header: [name of lava dome , country , volcanic area , composition , last eruption or growth episode]

  • Question: what countries have had eruptions of growth episodes in 1986 ?

  • Ground Truth: SELECT country FROM 1-1081235-1 WHERE last eruption or growth episode = 1986

  • Full-model Prediction: SELECT country FROM 1-1081235-1 WHERE last eruption or growth episode = 1986

  • Gen-model Prediction: SELECT country FROM 1-1081235-1 WHERE last eruption or growth episode = growth episodes in 1986

Error case (c): predicting extra conditions or missing conditions

  • Table: 2-11480171-1, Header: [year , title , genre , role , director]

  • Question: what drama role does she play in 1973 ?

  • Ground Truth: SELECT role FROM 2-11480171-1 WHERE genre = drama AND year = 1973

  • Full-model Prediction: SELECT role FROM 2-11480171-1 WHERE genre = drama AND year = 1973

  • Gen-model Prediction: SELECT role FROM 2-11480171-1 WHERE year = 1973