Text-to-SQL has recently attracted much attention as a sequence-to-sequence learning problem due to its practical usage for search and question answering dong2016acl; zhong2017seq2sql; xu2017sqlnet; cai2018ijcai; yu2018naacl; dong2018acl; finegan-dollak2018acl; Yu2018emnlp; wang2018EG; Shi2018IncSQL. The performance on some text-to-SQL tasks has been improved progressively dong2018acl; wang2018EG; Shi2018IncSQL in recent years. As pointed out in finegan-dollak2018acl, when evaluating models on text-to-SQL tasks, we need to measure how well the models generalize to realistic unseen data, which is very common in the real applications.
Most of the previous text-to-SQL tasks assumed that all questions came from a fixed database and hence share one global table schema. This assumption is useful for some specific applications such as booking flights dahl1994atis and searching GEO zelle1996geo, but not applicable to many real scenarios when different questions involve querying on different tables. zhong2017seq2sql addressed this problem and generated a dataset called WikiSQL, which is by far the largest text-to-SQL benchmark dataset.
In WikiSQL many tables have different table schemas and each table has its own limited labeled data. One common approach is to encode the table column names in the input to the training of an encoder-decoder model yu2018naacl; dong2018acl. yu2018naacl
proposed to utilize high-level type information to better understand rare entities and numbers in the natural language questions and encode these information from the input. These type information come from either external knowledge graph, a column or a number. This approach of TypeSQLyu2018naacl was proven to be effective on WikiSQL when it is required for the model to generalize to new tables.
We observe that a text-to-SQL encoder-decoder model implicitly learn a mapping between entities in natural language questions to column names in tables. The model is likely to fail on mapping to new table column names that it never sees before. Hence if we learn a better mapping from question words to table column names, then the text-to-SQL generation model would be better generalized to new tables. With this in mind, we introduce an auxiliary model to enhance the existing generation model.
Specifically, we propose a novel auxiliary mapping task besides traditional text-to-SQL generation task. Here we explicitly model the mapping from natural language entities to table column names. The mapping model serves as an supportive model to the specific text-to-SQL task as well as regularization to the generation model to increase its generalization. These two tasks are trained together with a multi-task learning loss. In practice, we adopt the coarse-to-fine decoder as the prototype of our generation model due to their superior performance in text-to-SQL tasks. And the generation model is further improved by introducing bi-attention layer (question-to-table attention and table-to-question attention) Seo2017iclr and attentive pooling layer bowen2016arxv.
We test our models on WikiSQL, with emphasis on a zero-shot subset, where the table schemas of the test data never occur in the training data. Compared to the coarse-to-fine model, our models improve over the baselines by absolute in accuracy, achieve execution accuracy of . In particular, on the zero-shot test part of WikiSQL, our models achieve even more gain, with improvement in accuracy over the baseline model. 333Our code will be released after paper is reviewed.
In summary our contributions in this paper are three-fold:
1) We find the existing testbed covers up the true generalization behavior of neural text-to-SQL models, and propose a new zero-shot test setting.
2) We improve the generalization ability of existing models by introducing an auxiliary task, which can explicitly learn the mapping between entities in the question and column names in the table.
3) The zero-shot evaluation not only shows the superior performance of our proposed method compared with the strong baseline but makes it possible to explain where the major gain comes from.
2.1 Text-to-SQL Task
task can be formulated as a conditional text generation problem, in which a questionand a table are given, the goal is to generate a SQL language .
Figure 1 illustrates WikiSQL output format which consists of three components: AGG, SEL, and WHERE. Particularly, WHERE clause contains multiple conditions where each condition is a triplet with the form of (condition_column, condition_operation, condition_value).
The question and corresponding table schema
are first translated into the hidden representation by a BiLSTM sentence encoder:
where is embedding of question word and is the representation of a column name which consists of words . The first and last hidden state of a BiLSTM over is concatenated as .
Different from traditional text generation tasks, which share a decoder cross time-steps, in Text-to-SQL task, different decoders are designed in terms of different operations. Generally, these decoders can be classified two types:cls for classifier, and pt for pointer.
cls is used for the operations, such as AGG and COND_OP:
where is one decoder hidden representation.
2.2 Diagnosing the Bottleneck of Text-to-SQL
The existing testbed covers up the true generalization behavior of existing models. To address this problem, we provide a new testbed by breaking down the testing samples. Specifically, we analyze the generalization problem on table aware Text-to-SQL tasks, by testing previous state-of-the-art model dong2018acl on different tables which occur different times in training set. We observe the following problems based on Figure 2:
WHERE clause performance is more sensitive to how many times the table has been seen in the training data;
The performance of WHERE would get a big drop once the table in test set is not present in the training data, i.e. zero-shot testing case.
Despite of the importance of the generalization problem of unseen tables, few work explored it due to the lack of appropriate datasets. The WikiSQL dataset was originally constructed to ensure that the training and test set have disjoint set of tables, which can provide a test bed for generalization test of new tables. However, we find that the current version of WikiSQL test cannot guarantee this because different tables extracted from different wiki pages may share the same table schema (i.e. table column names), even though their table content may not be the same.
The above problems motivate us to explicitly model the mapping between words in question and table column names, and test the model generalization to new tables on the true zero-shot sub testset of WikiSQL.
Our model consists of a seq2seq model for the SQL generation task (largely following the baseline coarse-to-fine model), and a mapping model as a auxiliary task to explicitly map question words to table schema (column names).
3.1 Main Generation Model
we follow section 2.1 to obtain question and schema hidden representation and . To enhance the interaction between question words and column name , a bi-attention is used to generate final question and table schema representation:
Considering the nature of structured SQL, we follow previous works to use different sub-decoders for AGG, SEL and WHERE clause. Especially, our WHERE decoder is adapted from the baseline model dong2018acl.
Agg and Sel Decoder
Each SQL only contains one AGG and SEL, so we generate AGG and SEL based on entire question representation. Since different words or phrases in question do not equally contribute to the decisions of AGG and SEL, we employ an attentive pooling layer over to generate final hidden representation for AGG and SEL.
We feed into cls layer generate the aggregation operation AGG and meassure the similarity score between and each column name to predict SEL by Pt layer in (2):
We took the WHERE decoder from the-state-of-the-art model dong2018acl, which first generates a slot sketch of WHERE clause and transform the SQL generation into a slot filling problem. There are 35-category WHERE clauses in WikiSQL and each one is subsequence of WHERE clause which skip the COND_COL and COND_VAL. For example, ”WHERE AND ” is a sketch of WHERE clause which has 2 conditions. We first predict the sketch based on :
Once is predicted, we obtain the COND_OP sequence it represents. We embed each operation in COND_OP sequence and feed them into WHERE-decoder cell. As Figure 3 shows, the WHERE-decoder cell takes one COND_OP as input and output COND_COL and COND_VAL for each decoder time step, while each decoder time step has 3 LSTM time steps. For th condition, ,, are and and and output ,
are probability distribution of the index ofand the span of . We do not have output for each because the input of next time step is given by pre-predicted . The lstm-cell is updated 3 times for each decoder time step:
The output layers for COND_COL and COND_VAL are both pointer layer which are pointed to column names and question words to predict COND_COL index and the left and right end , of the span of COND_VAL in question:
3.2 Auxiliary Mapping Model
For a SQL query, each condition consists of three parts, COND_COL, COND_OP and COND_VAL. Our mapping model aims to discover the mapping between condition column and condition value. The mapping prediction is based on question and table schema representation and , which are shared with generation model. Optimization based on mapping task loss can improve the question and table schema representation. An intuitive way to achieve mapping is to directly learn a mapping function from each word in question to column names. However, since not all words in a question are condition values, it’s challenging to take all words into consideration. To address this problem, we propose a two-step mapping model, in which we first learn a detector to screen out condition values, and then we learn a mapping function from condition values to column names.
Condition Value Detection
Because the condition value sometimes covers multiple words, we label the span for condition values in questions with typical nadeau2007survey tags. We notice sometimes condition column names appear exactly in question, so the span of column name in question is also labeled with tags , during training when a column name appear in question. Altogether we have five tags , , , , , which represent the first word of condition column, subsequent word of condition column, the first word of condition value, subsequent word of condition value and outside, respectively. Figure 3 illustrates our mapping model by giving the ground-truth label for an example.
The mapping model takes encoding vector of question words and column names as input. Mapping model first predict gate :
where and are tagging parameters.
We only learn the mapping for question words which are tagged as , :
3.3 Loss Function
We refer to the following as generation task loss and as mapping task loss.
where represents different operations during decoder phase. and denote the probability distribution of real label and predicted probability distribution. represents how many times words in question have been predicted as condition values.
Finally, the overall loss can be written as:
where is the number of training samples and is hyper-parameter.
4 Experimental Setup
WikiSQL has over 20K tables and 80K questions corresponding to these tables. This dataset was designed for translating natural language questions to SQL queries using the corresponding table columns without access to the table content. This dataset is further split into training and testing sets that are separately obtained from different Wiki pages, assuming there is no overlap of tables between training and testing sets. However, we find in this split, question-table pairs in test set have the same table schema as those in the training set. This is because even train and test tables were obtained from different Wiki pages, these tables could still have the same table schema. For example, different football teams have their own Wiki page but each one have a table with the same schema recording match information.
We split the test set based on the number of shots (the number of a table occurrences in training data). We report experiments on the original full WikiSQL test set as well as different subset based on the number of shots, especially on the zero-shot testing case. Statistics of new test sets are in table 2.
|Gen-model w/o AP||72.8%||79.4%||90.2%||93.0%||84.7%|
|dataset||number of shots||#questions|
We follow the evaluation metrics inxu2017sqlnet to measure the query synthesis accuracy: query-match accuracy () which measures the decoded query match the ground truth query without considering the order of conditions and execution accuracy () which measures the results from executing predicted queries. The accuracies are further break down into three categories: AGG, SEL and WHERE, as in xu2017sqlnet.
4.3 Model Configuration
We use 300-dim Glove word embedding as our pre-trained embedding. Hidden size for all LSTM is 250 and hidden size in attention function is set to 64. The loss weight is set to 0.5. A 0.5-rate dropout layer is used before each output layer. Each concatenation is followed by one full-connected layer to reduce the dimension to the original hidden or attention size. Test model is selected by the best performing model on validation set.
5 Results and Analysis
Table 1 shows the overall and breakdown results on full WikiSQL dataset. We compare our models with strong baseline models on the original WikiSQL test data. All these models have no access to table content following zhong2017seq2sql.
First our Gen-model with enhanced encoder/decoder improves over the baseline coarse-to-fine model by 1.6% in accuracy of both and . Our Gen-model mainly improves on compared to baseline models. Ablation test shows the improvement is attributed to the attentive pooling in SEL decoding.
Second our Full-model outperforms our single generation model by 1.5% and 1.6% in query-match accuracy and execution accuracy, achieving a very competitive new execution accuracy of 81.7%. Break down results show Full-model mainly improves the accuracy over Gen-model on the WHERE clause, with 1.9% accuracy gain.
5.1 Training data amount
Figure 4(a) illustrates Gen-model and Full-model accuracy of WHERE clause prediction on different test subsets from Table 2. Full-model is consistently better than single Gen-model in WHERE clause prediction. The biggest gap between Full-model and Gen-model in WHERE clause accuracy is on test subset W-0. This shows that Full-model generalizes better than Gen-model for the unseen test tables. We also found that Full-model accuracy on W-4 is slightly lower than that on W-3. We believe this is due to the fact that table itself is the other fact affecting models’ performance, in addition to the amount of training tables.
Figure 4(b) again illustrates Gen-model outperforms Gen-model without attentive pooling on different amount of training data.
5.2 Zero-shot Test
Figure 5(a) illustrates the results on zero-shot test case (i.e. W-0). Our Full-model outperforms baseline coarse-to-fine model by 4.9% and 4.4% in and . The accuracy improvement over the baseline coarse-to-fine model lie on the SEL and WHERE clause, with 3.6% accuracy gain on WHERE clause over the baseline.
Figure 4(a) shows WHERE clause accuracy has a big drop on zero-shot setting (W-0) compared to few-shot setting (W-1). We further analyze the reason of this degradation by looking into how the performance is affected by whether a column name is present in the training data. On unseen test table schema, 28% column names never appear in training set, which makes question related to these columns harder. We further divide conditions in WHERE clauses into two classes, one class with condition column appearing in training, the other with condition column not appearing in training. We measure the accuracy of each class on the WHERE clause. The result is reported in Figure 5(b). Full-model outperforms single generation model by 4.2% on unseen column names and 2.1% on seen column names. This further shows the generalization ability of the Full-model.
5.3 Case Study on Zero-shot Setting
We also analyze the Full-model behavior on zero-shot test compared to the Gen-model alone. We first randomly sample 100 examples of which Full-model predicts correct on WHERE clause (Case-Correct in Table 3), while Gen-model fails. We label the failure reasons of Gen-model into four categories (one example can belong to more than one categories): (a) wrong COND_COL prediction, (b) wrong COND_VAL prediction, (c) predicting extra conditions or missing conditions and (d) others. Table 3 shows the majority of WHERE clause errors are in (a): wrong COND column name errors. We then randomly sample another set of 100 examples (Case-Wrong in Table 3): Gen-model predicts WHERE clause correctly on these examples but Full-model fails. Table 3 indicates Full-model corrects Gen-model mainly on wrong COND_COL prediction, which shows our mapping task improves column name predictionin the generation task.
6 Related Work
Recently neural network based approaches, especially sequence-to-sequence models have been applied to text-to-SQL successfully with progressively improving resultswang2017synthesizing; neelakantan2017iclr; iyer2017acl; yin2017acl; huang2018naacl; zhong2017seq2sql; xu2017sqlnet; cai2018ijcai; yu2018naacl; dong2018acl; finegan-dollak2018acl.
Sketch-based approach is very effective, especially on WikiSQL task zhong2017seq2sql; xu2017sqlnet; yu2018naacl. In zhong2017seq2sql SEQ2SQL model used a coarse-grained sketch: aggregation, SELECT column and WHERE clause; xu2017sqlnet used a finer sketch to align to the syntactical structure of a SQL query with three specific slot-filling models: model_COL, model_AGG, and model_OPval. In TypeSQL yu2018naacl it also adopted this sketch-based model structures. However, in dong2018acl sketch was referred to as abstractions for meaning representation, leaving out low-level details. This meaning sketch was used as an input to the final decoding.
One of the challenge for using neural seq2seq models is the need of large annotated question-query pairs. zhong2017seq2sql; cai2018ijcai have automatically generated large datasets using templates and had humans paraphrased the questions into natural language questions. WikiSQL is by far the largest text-to-SQL dataset. WikiSQL was designed for testing model’s generalization by splitting the tables in a way that there is no overlap of tables in training and testing.
Execution guided (EG) decoding was recently proposed in wang2018EG that detects and excludes faulty outputs during the decoding by conditioning on the execution of partially generated output. Adding execution guided decoding to the coarse-to-fine model improved accuracy by on the wikiSQL dataset; and adding on top of the most recent IncSQL model Shi2018IncSQL improved accuracy by . It is proven that the EG module is very effective with any generative model.
Zero-shot semantic parsing has not obtained enough attention. herzig2018decoupling applied a pipeline framework, including four independent models to achieve generalization, while our work is end-to-end trained and focusing on improving model’s generalization with an auxiliary mapping task. Zero-shot slot filling bapna2017towards also leverages the text of schema to connect language question words to column names (slots), but their model needs to predict the probability of each possible column indepentently while our model can select the column by processing the question and schema one time.
7 Conclusions and Future Work
In this paper, we propose a novel auxiliary mapping task for zero-shot text-to-SQL learning. Traditional seq2seq generation model is augmented with an explicit mapping model from question words to table schema. The generation model is first improved by an attentive pooling inside the question, and bi-directional attention flow to improve the interaction between the question and table schema. The mapping model serves as an enhancement model to text-to-SQL task as well as regularization to the generation model to increase its generalization.
We compare our model with the a strong baseline coarse-to-fine model on the original WikiSQL testset as well as on the totally unseen test tables (a subset of zero-shot testing). Experimental results show that our model outperforms baseline models on both setting. Even though the generation model is already augmented with bi-directional attention to enhance the interaction between question and table, our results and analysis demonstrate that the explicitly mapping task can further increase the capability of generalization to unseen tables.
Spider Yu2018emnlp was recently proposed as another large cross-domain text-to-SQL dataset besides WikiSQL. It has more complex SQL templates including joint tables, which brings other interesting problems except for generalization. We plan to expand our models on this new dataset in the future.
Appendix A Appendices
Error case (a): wrong Cond_col prediction
Table: 2-11568882-2, Header: [year , winners , score , runners up , venue , 3rd place]
Question: what ’s in third place that ’s going 1-0 ?
Ground Truth: SELECT 3rd place FROM 2-11568882-2 WHERE score = 1-0
Full-model Prediction: SELECT 3rd place FROM 2-11568882-2 WHERE score = 1-0
Gen-model Prediction: SELECT 3rd place FROM 2-11568882-2 WHERE 3rd = 1-0
Error case (b): wrong Cond_val prediction
Table: 1-1081235-1, Header: [name of lava dome , country , volcanic area , composition , last eruption or growth episode]
Question: what countries have had eruptions of growth episodes in 1986 ?
Ground Truth: SELECT country FROM 1-1081235-1 WHERE last eruption or growth episode = 1986
Full-model Prediction: SELECT country FROM 1-1081235-1 WHERE last eruption or growth episode = 1986
Gen-model Prediction: SELECT country FROM 1-1081235-1 WHERE last eruption or growth episode = growth episodes in 1986
Error case (c): predicting extra conditions or missing conditions
Table: 2-11480171-1, Header: [year , title , genre , role , director]
Question: what drama role does she play in 1973 ?
Ground Truth: SELECT role FROM 2-11480171-1 WHERE genre = drama AND year = 1973
Full-model Prediction: SELECT role FROM 2-11480171-1 WHERE genre = drama AND year = 1973
Gen-model Prediction: SELECT role FROM 2-11480171-1 WHERE year = 1973