Great efforts have been invested in deep-learning-based semantic parsing to convert natural language (NL) texts to structured representation or logical forms[Wang2015BuildingAS, PasupatL15, Jia2016DataRF]. In particular, a special case of semantic parsing - natural language interface (NLI) to structured query like SQL [androu1995natural, popescu2003towards, li2005nalix, li2014nalir] - has incited significant interests; the motivation is two-fold: 1) majority data in the world is stored in relational tables (databases), and a NLI to database engine has the potential to support a great many dialogue-based applications; 2) it is extremely difficult for machine to understand the meanings of arbitrary NL texts, especially involving multiple different domains, but the complexity is more likely to be reduced by converting texts to formal languages.
Some previous works have been using seq2seq [sutskever2014sequence] models to generate structured queries for a certain database, given NL queries [Dong2016, Jia2016DataRF]. Provided with abundant data and through an end-to-end training process, seq2seq model achieves decent performance on a single relational table; however, it is not straightforward to apply a trained model to a new table. For example, suppose we have two queries and against a geography table and an employee table, respectively:
a model trained for the geography table is able to parse , but when if comes to the employee table, the model would fail to directly parse .
This is because seq2seq model with end-to-end training mixes up three types of knowledge: (a) the ontology of NL (grammar), (b) domain-specific language usage, and (c) the schema information of the relational table (column and value). Back to the example, the end-to-end model has only trained on “age” but not “size”, “john smith” but not “south america”, so even though they are both schema information, it is difficult to use a model trained for one table (source domain) to answer queries against another (target domain).
Therefore, for a reliable domain-adaptation solution, two important approaches should be considered: (1) improve the learning of the NL ontology knowledge on source domain, which could then extend to target domain; and (2) manage to augment more data on target domain, so general NL knowledge and domain-specific usage are better learned. In this paper, we address the two problems accordingly:
We design a Structured-Query Inference Network (SQIN) for better cross-domain semantic-parsing, by separating schema-related information from NL query and decoding SQL in a more structural-aware way;
We design a generative adversarial network AugmentGAN to augment the limited number of training data on target domain.
To make domain adaptation more effective and with less required resources, we are going to: (1) explicitly separate the relational-table-related information in the NL query and structurally generate SQL, and (2) given limited number of target domain data, try to effectively augment to a larger size. We will first introduce the scope of our method and explain its validity.
1. Single relational table (self-join operations supported as subquery): This assumption implies that the SQL we support is a subset of standard SQL. A recent detailed analysis [JohnsonNS18] reveals that of 8.1 million industrial SQL queries are against single relational tables with self-join; in real life people are more likely to inquire simple and structured data such as weather or stock prices, so it is fair to say the percentage of assumed types of queries is even higher in most practical NL-based applications. Our method is also capable of extending to more complex cases.
2. Column names (and corresponding types) of a table are provided only: In most circumstances for privacy concerns, values stored in the table should not be accessible by NLI providers, unlike a new work STAMP [Sun2018SemanticPW], where value/cell could be accessible. During domain transfer, in addition to the schema, a limited number of (NL, SQL) pairs on target domain are given for training.
3. Column/value information is explicitly mentioned: This assumption ensures that we could identify and match the columns and values against an NL query. We do not require the columns to match exactly their appearances in the table; different forms (like plurals or past tense), synonyms, and common usages are allowed. For examples,
are also dealt with by our method. In this paper, NL queries which are too implicit will not be our focus.
Domain-adaptive Semantic Parsing
We present a Structured-Query Inference Network (SQIN), by dividing the semantic parsing task into two stages:
(1) Tag the column names and value information against the NL input. Some existing works proposed to detect schema information and copy them directly to the output using an attention-copying mechanism [Jia2016DataRF, vinyals2015pointer]; however, intensive learning is still needed when moving to another domain. Here, we use a convolutional tagging network (CTN) to determine, for each token in the NL query, whether it is a column, a value of column, or nan. For example, suppose we have schema [‘country’,‘size’,‘population’] for the geography table and [‘name’,‘salary’,‘age’] for the employee table, then and will be tagged in the following forms:
(2) Convert the tagged NL query to SQL query, e.g. and will be converted to SQL formats:
where a FROM statement is omitted. In the end, the column tags are substituted by the column names in the schema, and the value tags substituted by the corresponding substrings from the input.
For more complex SQLs, to decode in a more structural-aware manner, both the (a) hierarchical and (b) compositional properties of SQL queries should be addressed accordingly. SQLNet [xu2017sqlnet] uses a Seq2Set model with a sketch to deal with the compositional nature of SQL, but it only works for simple types of SQL and requires a significant amount of human efforts to define and retrain a new sketch, making it hard to adapt to different types of SQL. Seq2Tree [Dong2016] tackles the hierarchical structure of SQL by using a hierarchical tree decoder, but it still requires the model to memorize different possible compositions of keywords. ASN [rabinovich2017abstract] incorporated both tree-like structure and recursive decoding, but the multi-module design could be significantly simplified. Therefore, we use a simple structured sequence-based parts-of-SQL (seq2PSQL) generation to capture both natures of SQL with the help of the tagged information from the NL query.
More details of the model design will be introduced in the later sections.
Augment Data on Target Domain
Jia and Liang (2016) [Jia2016DataRF] previously developed a recombination method for data augmentation; for example, their AbsEntity method replaces a value in the query with different values for the same column; their AbsWholePhrases method replaces a value with its column under certain conditions. However, it would appear that the model might be heavily biased by the small set of seed queries used to generate the query variations, which may cause the augmented set of NL queries simpler than the full scope of NL expression.
We propose an augmenting algorithm that goes beyond. Given two different NL queries, it is very difficult to hybridize parts of them and generate a new fluent NL text; however, it is simple to recombine two SQL queries, as SQL is based on strict grammar rules. Therefore, we propose to train another sequence-based model to generate a NL query given certain SQL, and the augmentation process is to generate the corresponding NL queries for recombined SQL queries. We adopt a generative adversarial network (GAN), by using a discriminator to classify whether the generated NL query resembles the human usage, and the result is used as reward to the generator.
More details of the model design will be introduced in the later sections.
Seq2seq-based [sutskever2014sequence] models enable semantic parser training in an end-to-end manner without manual feature engineering. Besides common seq2seq framework [Jia2016DataRF, xiao2016sequence, zhong2017seq2sql], there are other sequence-based models with structural-aware decoders like Seq2Tree [Dong2016], SQLNet [xu2017sqlnet], EG [wang2018robust] and Abstract Syntax Networks [rabinovich2017abstract]. Due to the black-box nature of seq2seq, both Cheng et al. [cheng2017learning] and Coarse2fine [Dong2018coarse] proposed two-stage semantic parsers with the 1st stage mapping utterances to intermediate states, and 2nd stage converting intermediate states to logical forms. STAMP [Sun2018SemanticPW] realizes the importance of “linking” between the question and the table columns, and adopts a a switching gate in decoder and include value/cell information in SQL generation. A most recent work MQAN [mccann2018natural] designs a multi-pointer-generator decoder for the generation. As another line of works in deep-learning-based semantic parsing for relational tables, Neural Enquirer [YinLLK15] proposes a fully distributed end-to-end model where all components (query, table, answer) are stored and differentiable, and Neural Programmer [Neelakantan2016] defines a set of symbolic operators; these approaches lack of explicit interpretation and adaptability to different tables, and the input will be executed to generate an answer, instead of a structured query. Other progresses like Neural Symbolic Machine [liang2017neural] adopts memory for seq2seq model, but this will not be our focus in this work. Two cross-domain seq2seq approaches [su2017cross, herzig2017neural] are relevant, but both require a large amount of target-domain data to achieve a good domain adaptation. One recent work [xiong2018transfer] has made a good attempt to separate the schema information from the natural language query through annotation. As a future direction, DialSQL [Gur2018DialSQL] incorporates user feedbacks to enhance generation.
The idea of GANs [RadfordMC15, chen2016infogan, salimans2016improved] has recently enjoyed success in NLP fields [lamb2016professor, yu2017seqgan]. For example, a success application of GAN is used in Neural Dialogue Generation [li2017adversarial], where the generator is a RL-based seq2seq model, and the outputs from the discriminator are then used as rewards for the generator, pushing the system to generate dialogues that mostly resemble human usage.
Structural Query Inference Network (SQIN)
In this section, we tackle the problem of domain-adaptive semantic parsing. Given the NL query , and the schema of a relational table the query is against, our goal is to convert to corresponding SQL .
Convolutional Tagging Network (CTN)
As discussed in General Approaches, first we want to identify and tag the columns and values information against the NL input with a sequence of tags , i.e. for each token , we predict it with a tag denoting it as a column cj, value vk, or nan:
where and both .
One challenge for the tagging is that a column name or a value could possibly consist of multiple tokens, so the model should capture the feature of neighboring tokens as well; therefore, we use a convolutional model. Another challenge is to choose suitable embeddings to represent the tokens: to capture a token with both semantic and character-level accuracies, we use both GloVe embedding [pennington2014glove] and char-n-gram embedding [kim2016character]
, and regard them as two separate ‘channels’ of this token. For the embedding of a column name (multiple tokens considered), we take a bi-directional GRU to encode the two-channel word vector of each token[zhong2017seq2sql].
We use the multi-layer convolutional operations to process the NL query and assign tag for each token. For each conv layer, the input is the concatenation of consecutive -dimensional embeddings with channels, and the output is a -dimensional embeddings with channels, followed by a function; the convolution filter is with size of .
For the last layer, each output is multiplied with the embeddings of the schema (plus nan) through a bilinear matrix with size ; the result goes through a softmax function to return a probability vector; the index with the highest probability is related to one of the columns or nan. We call this model convolutional tagging network (CTN). Practically, we first use one CTN to tag column name against the NL input, and then add extra layers for values tagging.
Sequence-based Parts of SQL (seq2PSQL) generation
To encode the tagged NL query in the previous section, we use a sequence of embeddings , where as a concatenation of original token ’s GloVe vector and its tag ’s embeddings. The tag embedding itself is concatenated by three parts: (1) the embedding of tag type (column or value), (2) the embedding of index , which indicates the tag is either the th column, or a value of the th column, (3) the embedding of the value type (integer, string, etc). Tags that share similar attributes (like same tag type or same id) also share part of embeddings.
is taken as the input to a bidirectional multi-layer GRU encoder and encoded to a hidden representation. Since the encoder and decoder share the same vocabulary for the tags, we use the same embedding for tags on both sides, and synchronize the updates of this embedding during back-propagation.
The decoder adopts a uni-directional multi-layer GRU and generates SQL queries in a top-down manner:
(1) to address the compositional nature of SQL, we use different starting tokens (like <select>, <where>…) to generate different clauses of SQL; for each clause at each step, the output could be a column tag (c1), a value tag (v2), or a SQL functional word (like logical or aggregation operators); generation terminates with an ending token <eos>; the decoders for different clauses share the same set of parameters, and by doing so all possible SQL clauses are adapted in one universal setting.
(2) to address the hierarchical nature, we define a nonterminal <sub> token which indicates the onset of a subquery. If <sub> is predicted, a new set of clauses start to decode by conditioning on the nonterminal’s hidden vector. This process terminates when no more nonterminals are emitted.
In Fig. 1, we use an example from Overnight [Wang2015BuildingAS] to demonstrate the adaptive semantic parsing: a NL input is converted to a self-join SQL (supported as subquery).
Data Augmentation based on GAN
To augment the seed data, it is much easier to recombine SQLs and generate NL queries accordingly. The problem can be framed as follows: given a SQL query , the model needs to generate a NL query . We view the query generation as a sequence of actions taken according to a policy defined by a seq2seq model.
In this section, we describe the proposed AugmentGAN model in detail.
The adversarial paradigm is composed of a generator and a discriminator . The key idea is to encourage to generate NL query that are indistinguishable from human, using to provide reward for generation at each step. In detail, takes an attention-based seq2seq model [bahdanau2014neural] that generates a NL query step by step given SQL , and at each step, a partial query is generated and evaluated by ;
is updated through reinforcement learning.is a binary classifier that takes a pair of SQL and NL queries (, ) as input, and encodes into vector representations with size using two bi-directional GRU encoders, respectively; then the two hidden vectors are combined through a bilinear matrix with size to give a vector, which is fed to a 2-class feed-forward network, returning the score of being machine-generated or human-generated.
To calculate the score for the partial query at each step, we propose to use Monte Carlo Search [li2017adversarial, yu2017seqgan]: the model keeps sampling tokens from the distribution until the decoding finishes, and repeats (set to ) times; we use the mean score of times sampling being human-generated () as the reward to update the policy of for the next step (Fig. 2). The training objective is to maximize the expected reward of generated sequences based on policy gradient method [williams1992simple]:
where is the policy of , and
is the baseline function used to reduce the variance.
During the training, we also feed the human-generated query to the generator with a positive reward for model updates, which serves as a teacher intervene to give the generator more direct access to the gold-standard targets [lamb2016professor, li2017adversarial].
Experiments & Analyses
We present experiment results on both in-domain and domain-transfer tasks, and also analyze our models and compare with previous works.
Datasets and Implementation
We train and evaluate our models on GeoQuery [zettlemoyer2005learning], WikiSQL [zhong2017seq2sql], and Overnight (sub-domain Blocks excluded for not being a relational table) [Wang2015BuildingAS]. For data in GeoQuery and Overnight, we manually convert each original logical form to SQL query.
For GeoQuery and Overnight, we use the standard train-test splits as released, and randomly divide the train sets to splits for model cross validation ( for each train-valid cycle); the accuracies are calculated as the percentage of correct SQLs. For WikiSQL, we use the standard train-dev-test splits, and the accuracies are the percentage of correct logical forms (SQL queries).
We implement SQIN and AugmentGAN using Tensorflow, and train models using NVIDIA GPU card GTX-1080-Ti. For training, each iteration takes data with a batch size of 128, and the evaluation on development set happens for every 50 iterations. ForOvernight dataset, it usually requires iterations for the models to achieve the performance shown in this work.
In-domain Semantic Parsing
For both CTN and seq2PSQL, we use pre-trained GloVe vector [pennington2014glove] with dimension
; for out-of-vocabulary (OOV) tokens not covered by GloVe, we randomly generate a vector using Gaussian distribution (with inferred element-wise mean and variance). The char-n-gram embeddings we use in CTN are pre-generated with[kim2016character]. The tag embedding is concatenated by three parts as discussed in previous section: (a) tag type, (b) id, and (c) value type, which are with dimension , respectively, and randomly initialized using uniform scaling initializer .
We first train a two-layer CTN for column tagging, and then value tagging is based on the pretrained two-layer column CTN with one extra layer on top; by doing so the value alignment during value tagging is improved, given that the pretrained two layers provide important information related to columns. For both encoder and decoder in seq2PSQL, we use 2-hidden-layer GRU cells [chung2014empirical] and hidden states with size ; dropout for both encoder and decoder is applied during training with keep-rate for input and for output. The decoder is based on beam-search with a beam size of .
We conduct ablation analysis (CTN and seq2PSQL) to demonstrate the performance of SQIN, and compare with previous works on in-domain tasks. From Table 1, our model exhibits a better performance on all three datasets: seq2PSQL alone without a CTN demonstrates a better structural-aware decoder; to demonstrate the performance of CTN, we evaluate both SQIN and a combined model CTN+seq2seq, which feeds the tagged input into a seq2seq model [Jia2016DataRF]; CTN significantly enhances the performance of seq2seq by separating the schema-related information from the NL inputs, and with a better structural-aware decoder (seq2PSQL), SQIN shows a state-of-the-art performance for in-domain semantic parsing.
Data Augmentation and Evaluation
The generator is first pre-trained by predicting the NL queries given the SQLs based on maximum-likelihood-estimation (MLE) loss. The discriminator is also pre-trained: half of the negative examples we use are partial NL queries with incomplete information of corresponding SQLs; a quarter of the negative examples are complete NL queries with sequence being randomly permutated; the other quarter is generated from sampling.
In Table. 2 we show several generated examples in Overnight by AugmentGAN and recombination method [Jia2016DataRF], with comparison to the ground truths, which are originally composed through crowdsourcing [Wang2015BuildingAS]. From Table. 2, the examples generated by recombination have more strict rules, whereas examples by AugmentGAN are more flexible in both sentence structure and words selection, thereby more resemble to human usage.
In order to qualitatively evaluate how good the NL queries are generated from AugmentGAN, we employ crowd-sourced judges to evaluate 100 randomly sampling pairs of human and GAN-generated queries. For each pair of queries, we ask 3 judges to decide which one is better, with tie allowed. A small set of pairs is used to validate the quality of the crowd-sourced annotations, where only one annotator passes the validation set with a correctness could his annotations be accepted.
In Table. 3 we show the crowd-sourced evaluation of the GAN-generated NL queries versus the ground truth in two Overnight subdomains; around of the generations resemble (or better than) human usage. For industrial purpose, even if the AugmentGAN cannot generate human-like queries, an extra selection step could be added; for most circumstances, the action of selecting is significantly less time-consuming than human directly composing or paraphrasing a NL query. Therefore, AugmentGAN exhibits both academic and practical impacts.
Domain Adaptation and Evaluation
We evaluate how well our models (SQIN, seq2PSQL without tagging, and CTN + seq2seq) could leverage on the learning from source-domain data to generate SQL queries against target domain in dataset Overnight, compared with vanilla seq2seq (Table 4).
(1) in-domain setting refers to the model both trained and tested on the target domain; (2) plain transfer setting directly applies trained model on target-domain, where the source tables are all the Overnight subdomains except the target domain; (3) massive target-domain data uses sufficient amount of target-domain data to fine tune the model; (4) limited target-domain data uses randomly select of the target-domain data to fine tune the model; (5) limited target-domain data + GAN refers to transfer with limited target-domain data with augmentation by GAN; (6) limited target-domain data + ReComb. refers to transfer with limited target-domain data with augmentation by recombination [Jia2016DataRF]. For approaches with data augmentation, the size of augmented data is a factor of of seed data.
From Table 4, massive performs better than in-domain, which illustrate the out-of-domain information could enhance learning [su2017cross]. Both models with schema tagging (SQIN and CTN + seq2seq) have better results than their non-tagging counterparts, as well as demonstrate a better plain transfer performance, showing that the separation of schema information selectively enhances learning of general NL knowledge and provides a better domain adaptability.
For approaches using augmentation, AugmentGAN is more effective than recombination [Jia2016DataRF], by showing a better performance for all models. Using AugmentGAN, the accuracies for all models are higher than those of in-domain setting, and even close to massive
setting, showing that even with limited target-domain data, good domain adaptation is still achieved. One interesting thing to note: for the cases using AugmentGAN, even though there are two thirds less human-like queries in the training data, with the help of schema tagging, the models are still able to generate high-accuracy SQLs, which implies that a satisfactory transfer learning doesn’t require completely human-resemble augmentation.
In the end, we evaluate how many source domains are needed for the model to generate correct SQL on target domain: we use different number of Overnight subdomains as the source tables, and subdomain Publication as the target table, and calculate the SQL generation accuracies for both plain transfer and limited target domain data + GAN approaches.
From Fig. 3, there are two observations: (1) if the source tables do not fully cover all possible queries types on target table, fine-tuning from target domain data is necessary to achieve a better saturation performance: i.e. self-join type of queries are included in subdomain Publication but not in other subdomains; (2) for a model previously trained on a sufficient number of source tables ( in this case), it is enough to feed a small amount of target domain data (+GAN) to achieve a good domain adaptation, a promising technique that could save resources and man-power when adapting to a new table.
Conclusion & Perspective
As one of our main insights, we developed SQIN to separate schema-related information from the NL inputs, which enhances the learning of sequence-based models on general NL knowledge from source domains, thereby improving the in-domain performance and cross-domain adaptability. Based on recombining a formal language (SQL) and correspondingly generating NL texts, we develop an effective GAN-based data augmentation algorithm, which could significantly reduce the human effort for composing data. Our extensive experiments demonstrate the advantage of our approaches. Our extensive experimental analyses demonstrate the effectiveness of our approach on standard datasets. Future work could be to extend to other types of structured data by combing syntax-directed generation [dai2018syntaxdir].