Meta-Learning for Domain Generalization in Semantic Parsing

10/22/2020 ∙ by Bailin Wang, et al. ∙ 2

The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to studying learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relying on standard supervised learning. In this work, we use a meta-learning framework which targets specifically zero-shot domain generalization for semantic parsing. We apply a model-agnostic training algorithm that simulates zero-shot parsing by constructing virtual train and test sets from disjoint domains. The learning objective capitalizes on the intuition that gradient steps that improve source-domain performance should also improve target-domain performance, thus encouraging a parser to generalize well to unseen target domains. Experimental results on the (English) Spider and Chinese Spider datasets show that the meta-learning objective significantly boosts the performance of a baseline parser.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Zero-shot semantic parsing: at train time, a parser observes training instances for the database concert singer. At test time, it needs to generate SQL for questions pertaining to the unseen database farm.

Semantic parsing is the task of mapping natural language (NL) utterances to meaning representations. While there has been much progress in this area, earlier work has primarily focused on evaluating parsers in-domain (e.g., tables or databases) and often with the same programs as those provided in training Finegan-Dollak et al. (2018). A much more challenging goal is achieving domain generalization, i.e., building parsers which can be successfully applied to new domains and are able to produce complex unseen programs. Achieving this generalization goal would, in principle, let users query arbitrary (semi-)structured data on the Web and reduce the annotation effort required to build multi-domain NL interfaces (e.g., Apple Siri or Amazon Alexa). Current parsers struggle in this setting; for example, we show in Section 5 that a modern parser trained on the challenging Spider dataset Yu et al. (2018b) has a gap of more than 25% in accuracy between in-domain and out-of-domain performance. While the importance of domain generalization has been previously acknowledged Cai and Yates (2013); Chang et al. (2019), and datasets targetting zero-shot (or out-of-domain) performance are becoming increasingly available Pasupat and Liang (2015); Wang et al. (2015); Zhong et al. (2017); Yu et al. (2018b), little or no attention has been devoted to studying learning algorithms or objectives which promote domain generalization.

Conventional supervised learning simply assumes that source-domain and target-domain data originate from the same distribution, and as a result struggles to capture this notion of domain generalization for zero-shot semantic parsing. Previous approaches Guo et al. (2019b); Wang et al. (2019); Herzig and Berant (2018) facilitate domain generalization by incorporating inductive biases in the model, e.g., designing linking features or functions which should be invariant under domain shifts. In this work, we take a different direction and improve domain generalization of a semantic parser by modifying the learning algorithm and the objective. We draw inspiration from meta-learning Finn et al. (2017); Li et al. (2018a) and use an objective that optimizes for domain generalization. That is, we consider a set of tasks, where each task is a zero-shot semantic parsing task with its own source and target domains. By optimizing towards better target-domain performance on each task, we encourage a parser to extrapolate from source-domain data and achieve better domain generalization.

Specifically, we focus on text-to-SQL parsing where we aim at translating NL questions to SQL queries and conduct evaluations on unseen databases. Consider the example in Figure 1, a parser needs to process questions to a new database at test time. To simulate this scenario during training, we synthesize a set of virtual zero-shot parsing tasks by sampling disjoint source and target domains111We use the terms domain and database interchangeably. for each task from the training domains. The objective we require is that gradient steps computed towards better source-domain performance would also be beneficial to target-domain performance. One can think of the objective as consisting of both the loss on the source domain (as in standard supervised learning) and a regularizer, equal to the dot product between gradients computed on source- and target-domain data. Maximizing this regularizer favours finding model parameters that work not only on the source domain but also generalize to target-domain data. The objective is borrowed from Li et al. (2018a) who adapt a Model-Agnostic Meta-Learning (MAML; Finn et al. 2017

) technique for domain generalization in computer vision. In this work, we study the effectiveness of this objective in the context of semantic parsing. This objective is model-agnostic, simple to incorporate and not requiring any changes in the parsing model itself. Moreover, it does not introduce new parameters for meta-learning.

Our contributions can be summarized as follows.

  • To handle zero-shot semantic parsing, we apply a meta-learning objective that directly optimizes for domain generalization.

  • We propose an approximation of the meta-learning objective that is more efficient and allows more scalable training.

  • We perform experiments on two text-to-SQL benchmarks: Spider and Chinese Spider. Our new training objectives obtain significant improvements in accuracy over a baseline parser trained with conventional supervised learning.

  • We show that even when parsers are augmented with pre-trained models, e.g., BERT, our method can still effectively improve domain generalization in terms of accuracy.

Our code will be available at https://github.com/berlino/dgmaml-semparse.

2 Related Work

Zero-Shot Semantic Parsing

Developing a parser that can generalize to unseen domains has been drawing increasing attention in recent years. Previous work has mainly focused on the sub-task of schema linking as means of promoting domain generalization. In schema linking, we need to recognize which columns or tables are mentioned in a question. For example, a parser would decide to select the column Status because of the word statuses in Figure 1

. However, in the setting of zero-shot parsing, columns or tables might be mentioned in a question without being observed during training. One line of work tries to incorporate inductive biases, e.g., domain-invariant n-gram matching features

Guo et al. (2019b); Wang et al. (2019), cross-domain alignment functions Herzig and Berant (2018), or auxiliary linking tasks Chang et al. (2019) to improve schema linking. However, in the cross-lingual setting of Chinese Spider Min et al. (2019), where questions and schemas are not in the same language, it is not obvious how to design such inductive biases like n-gram matching features for the cross-lingual setting. Another line of work relies on large-scale unsupervised pre-training on massive tables Herzig et al. (2020); Yin et al. (2020) to obtain better representations for both questions and database schemas. Our work is orthogonal to these approaches and can be easily coupled with them. As an example, we show in Section 5 that our training procedure can improve the performance of a parser already enhanced with n-gram matching features Guo et al. (2019b); Wang et al. (2019).

Our work is similar in spirit to Givoli and Reichart (2019), who also attempts to simulate source and target domains during learning. However, their optimization updates on virtual source and target domains are loosely connected by a two-step training procedure where a parser is first pre-trained on virtual source domains and then fine-tuned on virtual target domains. As we will show in Section 3, our training procedure does not fine-tune on virtual target domains but rather, for every batch, uses them to evaluate a gradient step made on source domains. This is better aligned with the test time: there will be no fine-tuning on the real target domains so there should be no fine-tuning on the simulated ones as well. Moreover, they treat the construction of virtual train and test domains as a hyper-parameter, which would only be possible when there are a small number of domains, making it not applicable to text-to-SQL parsing which typically has hundreds of domains.

Meta-Learning for NLP

Meta-learning has been receiving soaring interest in the machine learning community. Unlike conventional supervised learning, meta-learning operates on tasks, instead of data points. Most previous work

Vinyals et al. (2016); Ravi and Larochelle (2016); Finn et al. (2017) has focused on few-shot learning where meta-learning helps address the problem of learning to learn fast for adaptation to a new task or domains. The concept of fast adaptation has been reflected in many low-resource NLP tasks, e.g., low-resource machine translation Gu et al. (2018) and relation classification with limited supervision Obamuyide and Vlachos (2019). The basic motivation is that meta-learning, specifically MAML Finn et al. (2017) can learn a good initialization of parameters such that it could be easily tuned for a new task where only limited training data is available.

Very recently, there have been some adaptations of MAML to semantic parsing tasks Huang et al. (2018); Guo et al. (2019a); Sun et al. (2019). These approaches simulate few-shot learning scenarios by constructing a pseudo-task for each example, where pseudo training examples are those that are relevant to the example and are retrieved from original training examples. The notion of relevance is then encoded by MAML, and can be intuitively understood as: train a parser such that it can be easily fine-tuned for an example on its relevant examples at test time. Lee et al. (2019) use matching networks Vinyals et al. (2016) to enable one-shot text-to-SQL parsing where tasks for meta-learning are defined by SQL templates, i.e., a parser is expected to generalize to a new SQL template with one example. In contrast, the tasks we construct for meta-learning aim to encourage generalization across domains, instead of adaptation to a new (pseudo-)task with one or few examples. A clear difference lies in how meta-train and meta-test sets are constructed. In previous work (e.g., Huang et al. 2018), these come from the same domain whereas we simulate domain shift and sample different sets of domains for meta-train and meta-test.

Domain Generalization

Although the notion of domain generalization has been less explored in semantic parsing, it has been extensively studied in other areas like computer vision Ghifary et al. (2015); Zaheer et al. (2017); Li et al. (2018b). Recent work Li et al. (2018a); Balaji et al. (2018) employed optimization-based meta-learning to handle domain shift issues in domain generalization. We employ the meta-learning objective originally proposed in Li et al. (2018a), where they adapt MAML to encourage generalization in unseen domains (of images). Based on this objective, we propose a cheap alternative that only requires first-order gradients, thus alleviating the overhead of computing second-order derivatives required by MAML.

3 Meta-Learning for Domain Generalization

We first formally define the problem of domain generalization in the context of zero-shot text-to-SQL parsing. Then, we introduce DG-MAML, a training algorithm that helps a parser achieve better domain generalization. Finally, we propose a computationally cheap approximation of DG-MAML.

3.1 Problem Definition

Domain Generalization

Given a natural language question in the context of a relational database , we aim at generating the corresponding SQL . In the setting of zero-shot parsing, we have a set of source domains where labeled question-SQL pairs are available. We aim at developing a parser that can perform well on a set of unseen target domains . We refer to this problem as domain generalization.

Parsing Model

We assume a parameterized parsing model that specifies a predictive distribution over all possible SQLs. For domain generalization, a parsing model needs to properly condition on its input of questions and databases such that it can generalize well to unseen domains.

Conventional Supervised Learning

Assuming that question-SQL pairs from source domains and target domains are sampled i.i.d from the same distribution, the typical training objective of supervised learning is to minimize the loss function of the negative log-likelihood of gold SQL query:

(1)

where is the size of mini-batch . Since a mini-batch is randomly sampled from all training source domains , it usually contains question-SQL pairs from a mixture of different domains.

Distribution of Tasks

Instead of treating semantic parsing as a conventional supervised learning problem, we take an alternative view based on the meta-learning perspective. Basically, we are interested in a learning algorithm that can benefit from a distribution of choices of source and target domains, denoted by , where refers to an instance of a zero-shot semantic parsing task that has its own source and target domains.

In practice, we usually have a fixed set of training source domains . We construct a set of virtual tasks by randomly sampling disjoint source and target domains from the training domains. Intuitively, we assume that divergences between test and train domain during the learning phase are representative of differences between training domains and actual test domains. This is still an assumption, but considerably weaker compared to the i.i.d. assumption used in conventional supervised learning. Next, we introduce our training algorithm called DG-MAML based on this assumption.

3.2 Learning to Generalize with DG-MAML

Having simulated source and target domains for each virtual task, we now need a training algorithm that encourages generalization to unseen target domains in each task. For this, we turn to optimization-based meta-learning algorithms Finn et al. (2017); Nichol et al. (2018); Li et al. (2018a) and apply DG-MAML (Domain Generalization with Model-Agnostic Meta-Learning), a variant of MAML Finn et al. (2017) for such purpose. Intuitively, DG-MAML encourages the optimization in the source domain to have a positive effect on the target domain as well.

During each learning episode of DG-MAML, we randomly sample a task which has its own source domain and target domain . For the sake of efficiency, we randomly sample mini-batch question-SQL pairs and from and respectively for learning in each task.

DG-MAML conducts optimization in two steps, namely meta-train and meta-test. We explain both steps as follows.

Meta-Train

DG-MAML first optimize parameters towards better performance in the virtual source domain

by taking one step of stochastic gradient descent (SGD) from the loss under

.

(2)

where is a scalar denoting the learning rate of meta-train. This step resembles conventional supervised learning where we use stochastic gradient descent to optimize the parameters.

Meta-Test

We then evaluate the resulting parameter in the virtual target domain by computing the loss under , which is denoted as .

Our final objective for a task is to minimize the joint loss on and :

(3)

where we optimize towards the better source and target domain performance simultaneously. Intuitively, the objective requires that the gradient step conducted in the source domains in Equation (2) should be beneficial to the performance of the target domain as well. In comparison, conventional supervised learning, whose objective would be equivalent to , does not pose any constraint on the gradient updates. As we will elaborate shortly, DG-MAML can be viewed as a regularization of gradient updates in addition to the objective of conventional supervised learning.

We summarize our DG-MAML training process in Algorithm 1. Basically, it requires two steps of gradient update (Step 5 and Step 7). Note that is a function of after the meta-train update. Hence, optimizing with respect to involves optimizing the gradient update in Equation (2) as well. That is, when we update the parameters in the final update of Step 7, the gradients need to back-propagate though the meta-train updates in Step 5. We will elaborate on this shortly.

The update function in Step 7 could be based on any gradient descent algorithm. In this work, we use the update rule of Adam Kingma and Ba (2014). In principle, gradient updates during meta-train in Step 5 could also be replaced by other gradient descent algorithms. However, we leave this to future work.

0:  Training databases
0:  Learning rate
1:  for step to  do
2:     Sample a task of ( from
3:     Sample mini-batch from
4:     Sample mini-batch from
5:     Meta-train update:  
6:     Compute meta-test objective:  
7:     Final Update:  
8:  end for
Algorithm 1 DG-MAML Training Algorithm

3.3 Analysis of DG-MAML

To give an intuition of the objective in Equation (3), we follow Nichol et al. (2018); Li et al. (2018a) and use the first-order Taylor series expansion to approximate it:

(4)

where in the last step we expand the function at . The approximated objective sheds light on what DG-MAML optimizes. In addition to minimizing the losses from both source and target domains, which are , DG-MAML further tries to maximize , the dot product between the gradients of source and target domain. That is, it encourages gradients to generalize between source and target domain within each task .

3.4 First-Order Approximation

The final update in Step 7 of Algorithm 1 requires second-order derivatives, which may be problematic, inefficient or non-stable with certain classes of models Mensch and Blondel (2018). Hence, we propose an approximation that only requires computing first-order derivatives.

First, the gradient of the objective in Equation (3) can be computed as:

(5)

where

is an identity matrix and

is the Hessian of at . We consider the alternative of ignoring this second-order term and simply assume that . In this variant, we simply combine gradients from source and target domains. We show in the Appendix that this objective can still be viewed as maximizing the dot product of gradients from source and target domain.

The resulting first-order training objective, which we refer to as DG-FMAML, is inspired by Reptile, a first-order meta-learning algorithm Nichol et al. (2018) for few-shot learning. A two-step Reptile would compute SGD on the same batch twice while DG-FMAML computes SGD on two different batches, and , once. To put it differently, DG-FMAML tries to encourage cross-domain generalization while Reptile encourages in-domain generalization.

4 Semantic Parser

In general, DG-MAML is model-agnostic and can be coupled with any semantic parser to improve its domain generalization. In this work, we use a base parser that is based on RAT-SQL Wang et al. (2019), which currently achieves state-of-the-art performance on Spider.

Formally, RAT-SQL takes as input question  and schema  of its corresponding database. Then it produces a program which is represented as an abstract syntax tree in the context-free grammar of SQL Yin and Neubig (2018). RAT-SQL adopts the encoder-decoder framework for text-to-SQL parsing. It basically has three components: an initial encoder, a transformer-based encoder and an LSTM-based decoder. The initial encoder provides initial representations, denoted as and for the question and the schema, respectively. A relation-aware transformer (RAT) module then takes the initial representations and further computes contextual-aware representations and for the question and the schema respectively. Finally, a decoder generates a sequence of production rules that constitute the abstract syntax tree based on and . To obtain and , the initial encoder could either be 1) LSTMs Hochreiter and Schmidhuber (1997) on top of pre-trained word embeddings, like GloVe Pennington et al. (2014), or 2) pre-trained contextual embeddings like BERT Devlin et al. (2018). In our work, we will test the effectiveness of our method for both variants.

As shown in Wang et al. (2019), the final encodings and

, which are the output of the RAT module, heavily rely on schema-linking features. These features are extracted from a heuristic function that links question words to columns and tables based on n-gram matching, and they are readily available in the conventional mono-lingual setting of the Spider dataset. However, we hypothesize that the parser’s over-reliance on these features is specific to the Spider dataset, where annotators were shown the database schema and asked to formulate queries. As a result, they were prone to re-using terms from the schema verbatim in their questions. This would not be the case in a real-world application where users are unfamiliar with the structure of the underlying database and free to use arbitrary terms which would not necessarily match column or table names 

Suhr et al. (2020). Hence, we will also evaluate our parser in the cross-lingual setting where and are not in the same language, and such features would not be available.

5 Experiments

To show the effectiveness of DG-MAML, we integrate it with a base parser and test it on zero-shot text-to-SQL tasks. Then we present further analysis of DG-MAML to show how it affects the domain generalization of the parser. By designing an in-domain benchmark, we also show that the out-of-domain improvement does not come at the cost of in-domain performance.

5.1 Datasets and Metrics

We evaluate DG-MAML on two zero-shot text-to-SQL benchmarks, namely, (English) Spider Yu et al. (2018b) and Chines Spider Min et al. (2019). Spider consists of 10,181 examples (questions and SQL pairs) from 206 databases, including 1,659 examples taken from the Restaurants (Popescu et al., 2003; Tang and Mooney, 2000), GeoQuery (Zelle and Mooney, 1996), Scholar (Iyer et al., 2017), Academic (Li and Jagadish, 2014)

, Yelp and IMDB

(Yaghmazadeh et al., 2017) datasets. We follow their split and use 8,659 examples (from 146 databases) for training, and 1,034 examples (from 20 databases) as our development set. The remaining 2,147 examples from 40 test databases are held out and kept by the authors for evaluation.

Chinese Spider is a Chinese version of Spider that translates all NL questions from English to Chinese and keeps the original English database. It simulates the real-life scenario where schemas for most relational databases in industry are written in English while NL questions from users could be in any other language. It poses to a parser an additional challenge of encoding cross-lingual correspondences between Chinese and English. Following Min et al. (2019), we use the same split of train/development/test as the Spider dataset.

In both datasets, we report results using the metric of exact set match accuracy, following Yu et al. (2018b). As the test set for both datasets are not publicly available, we will mostly use the development sets for further analyses.

5.2 Implementation and Hyperparameters

Our base parser is based on RAT-SQL Wang et al. (2019)

, which is implemented in PyTorch 

Paszke et al. (2019). During preprocessing, input questions, column names and table names in schemas are tokenized and lemmatized by Stanza Qi et al. (2020) which can handle both English and Chinese. For English questions and schemas, we use GloVe Pennington et al. (2014) and BERT-large Devlin et al. (2018) as the pre-trained embeddings for encoding. For Chinese questions, we use Tencent embeddings Song et al. (2018) and Multilingual-BERT Devlin et al. (2018).

In all experiments, we use a batch size of

and train for up to 20,000 steps. See Appendix for details and configurations of other hyperparameters.

5.3 Main Results

We present our main results in Table 2. On Spider, DG-MAML boosts the performance of the non-BERT base parser by 2.1%, showing its effectiveness in promoting domain generalization. Moreover, the improvement is not cancelled out when the base parsers are augmented with BERT representations. On Chinese Spider, DG-MAML helps the non-BERT base parser achieve a substantial improvement (+4.5%). For parsers augmented with multilingual BERT, DG-MAML is also beneficial. Overall, DG-MAML consistently helps the base parser achieve better accuracy, and it is empirically complementary to using pre-trained representations (e.g., BERT).

Compared with the mono-lingual setting of Spider, the performance margin by DG-MAML is more significant in the cross-lingual setting of Chinese Spider. This is presumably due to that heuristic schema-linking features, which help promote domain generalization for Spider, are not feasible in Chinese Spider. We will elaborate on this in Section 5.4.

max width=center Model Set Match SyntaxSQLNet (Yu et al., 2018a) 18.9 Global-GNN (Bogin et al., 2019) 52.7 IRNet (Guo et al., 2019b) 55.4 RAT-SQL (Wang et al., 2019) 62.7 Our Models  Base Parser 56.4  Base Parser + DG-MAML 58.5 With Pratrained Representations: RAT-SQL + BERT-large (Wang et al., 2019) 69.7 RYANSQL + BERT-large (Choi et al., 2020) 70.6 IRNet + BERT-base (Guo et al., 2019b) 61.9 Our Models  Base Parser + BERT-base 66.8  Base Parser + BERT-base + DG-MAML 68.9

Table 1: Set match accuracy (%) on the development of Spider.

In-Domain Setting

To confirm that the base parser struggles when applied out-of-domain, we construct an in-domain setting and measure the gap in performance. This setting also helps us address a natural question: does using DG-MAML hurt in-domain performance? This would not have been surprising as the parser is explicitly optimized towards better performance on unseen target domains. To answer these questions, we create a new split of Spider. Specifically, for each database from the training and development set of Spider, we include 80% of its question-SQL pairs in the new training set and assign the remaining 20% to the new test set. As a result, the new split consists of 7702 training examples and 1991 test examples. When using this split, the parser is tested on databases that are all have been seen during training. We evaluate the non-BERT parsers with the same metric of set match for evaluation.

As in-domain and out-of-domain setting have different splits, and thus do not use the same test set, the direct comparison between them only serves as a proxy to illustrate the effect of domain shift. We show that, despite the original split of out-of-domain setting containing a larger number of training examples (8659 vs 7702), the base parser tested in-domain achieves a much better performance (78.2%) than its counterpart tested out-of-domain (56.4%). This suggests that the domain shift genuinely hurts the base parser.

We further study DG-MAML in this in-domain setting to see if it causes a drop in in-domain performance. Somewhat surprisingly, we instead observe a modest improvement (+1.1%) over the base parser trained with conventional supervised learning. This suggests that DG-MAML, despite optimizing the model towards domain generalization, captures, to a certain degree, a more general notion of generalization or robustness, which appears beneficial even in the in-domain setting.

max width=center Model Set Match SyntaxSQLNet (Yu et al., 2018a) 16.4 Our Models  Base Parser 31.0  Base Parser + DG-MAML 35.5 With Pratrained Representations: RYANSQL + M-BERT (Choi et al., 2020) 41.3 Our Models  Base Parser + M-BERT 47.0  Base Parser + M-BERT + DG-MAML 50.1

Table 2: Set match accuracy (%) on the development Chinese Spider. M-BERT: Multiligual-BERT.

5.4 Additional Experiments and Analysis

We first discuss additional experiments on linking features and DG-FMAML, and then present further analysis probing how DG-MAML works.

Linking Features

As mentioned in Section 2, previous work addressed domain generalization by focusing on the sub-task of schema linking. For Spider, where questions and schemas are both in English, Wang et al. (2019) leverage n-gram matching features which improve schema linking and significantly boost parsing performance. However, in Chinese Spider, it is not easy and obvious how to design such linking heuristics. Moreover, as pointed out by Suhr et al. (2020), the assumption that columns/tables are explicitly mentioned is not general enough, implying that exploiting matching features would not be a good general solution to domain generalization. Hence, we would like to see whether DG-MAML can be beneficial when those features are not present.

Specifically, we consider the variant of the base parser that does not use this feature, and train it with conventional supervised learning and with DG-MAML for Spider. As shown222Some results in Table 3 differ from Table 2. The former reports dev set performance over three runs, while the latter shows the best model, selected based on dev set performance. in Table 3, we confirm that those features have a big impact on the base parser. More importantly, in the absence of those features, DG-MAML boosts the performance of the base parser by a larger margin. This is consistent with the observation that DG-MAML is more beneficial for Chinese Spider than Spider, in the sense that the parser would need to rely more on DG-MAML when these heuristics are not integrated or not available for domain generalization.

max width=0.8center Model Dev (%) Spider Base Parser 55.6 0.5  + DG-FMAML 56.8 1.2  + DG-MAML 58.0 0.8 Base Parser without Features 38.2 1.0  + DG-FMAML 41.8 1.5  + DG-MAML 43.5 0.9 Chinese Spider Base Parser 29.7 1.1  + DG-FMAML 32.5 1.3  + DG-MAML 34.3 0.9

Table 3: Accuracy (and confidence interval) on the development sets of Spider and Chinese Spider.

Effect of DG-FMAML

We investigate the effect of the first-order approximation in DG-FMAML to see if it would provide a reasonable performance compared with DG-MAML. We evaluate it on the development sets of the two datasets, see Table 3. DG-FMAML consistently boosts the performance of the base parser, although it lags behind DG-MAML. For a fair comparison, we use the same batch size for DG-MAML and DG-FMAML. However, considering the fact that DG-FMAML uses less memory, it could potentially benefit from a larger batch size. In practice, DG-FMAML is twice faster to train than DG-MAML, see Appendix for details.

Probing Domain Generalization

Schema linking has been the focus of previous work on zero-shot semantic parsing. We take the opposite direction and use this task to probe the parser to see if it, at least to a certain degree, achieves domain generalization due to improving schema linking. Our hypothesis is that improving linking is the mechanism which prevents the parser from being trapped in overfitting the source domains.

We propose to use ‘relevant column recognition’ as a probing task. Specifically, relevant columns refer to the columns that are mentioned in SQL queries. For example, the SQL query “Select Status, avg(Population) From City Groupby Status” in Figure 1

contains two relevant columns: ‘Status’ and ‘Population’. We formalize this task as a binary classification problem: given a NL question and a column from the corresponding database schema, a binary classifier should predict whether the column is mentioned in the gold SQL query. We hypothesize that representations from the DG-MAML parser will be more predictive of the relevance than those of the baseline parser, and the probing classifier will be able to detect this difference in the quality of the representations.

We first obtain the representations for NL questions and schemas from the parsers and keep them fixed. The binary classifier is then trained based only on these representations. For classifier training we use the same split as the Spider dataset, i.e. the classifier is evaluated on unseen databases. Details of the classifier are provided in the Appendix. The results are shown in Table 4. The classifier trained relying on the parser with DG-MAML achieves better performance. This confirms our hypothesis that using DG-MAML makes the parser have better encodings of NL questions and database schemas and that this is one of the mechanisms the parsing model uses to ensure generalization.

Model Precision Recall F1
Spider
Base Parser 70.0 70.4 70.2
Base Parser + DG-MAML 73.8 70.6 72.1
Chinese Spider
Base Parser 61.5 60.4 61.0
Base Parser + DG-MAML 66.8 61.2 63.9
Table 4: Performance (%) of column prediction on the development sets of Spider and Chinese Spider.

6 Conclusions

The task of zero-shot semantic parsing has been gaining momentum in recent years. However, previous work has not proposed algorithms or objectives that explicitly promote domain generalization. We rely on the meta-learning framework to encourage domain generalization. Instead of learning from individual data points, DG-MAML learns from a set of virtual zero-shot parsing tasks. By optimizing towards better target-domain performance in each simulated task, DG-MAML encourages the parser to generalize better to unseen domains.

We conduct experiments on two zero-shot text-to-SQL parsing datasets. In both cases, using DG-MAML leads to a substantial boost in performance. Furthermore, we show that the faster first-order approximation DG-FMAML can also help a parser achieve better domain generalization.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments. We gratefully acknowledge the support of the European Research Council (Titov: ERC StG BroadSem 678254; Lapata: ERC CoG TransModal 681760) and the Dutch National Science Foundation (NWO VIDI 639.022.518).

References