Systematic compositionality—expressing novel complex concepts as a systematic composition of expressions for simpler concepts—is the property underlying human languages’ immense expressive power and productivity Lake et al. (2017); Fodor and Pylyshyn (1988). However, NLP models, in spite of large-scale pretraining, have been shown to struggle to generalize compositionally to novel composite expressions in the context of semantic parsing Lake and Baroni (2018); Loula et al. (2018); Kim and Linzen (2020).
Data augmentation has been extensively explored for compositional generalization Akyürek et al. (2021); Guo et al. (2021); Wang et al. (2021); Guo et al. (2020); Qiu et al. (2021). However, training instances that overlap in their structure—such as abstract syntax trees (ASTs) of target programs—are sub-optimal and any augmentation mechanism that ignore this will require unnecessarily large training data to achieve compositional generalization. One way to improve sample efficiency is to train on a diverse training set—one that contains a variety of structures. The prior work exploring the efficiency of diverse training sets has been very limited in scope. Oren et al. (2021) demonstrate that template diversity improved sample efficiency in subsamples from a pool of instances as compared to random subsamples. However, they only experiment with a single dataset and given that templates themselves overlap in structure, diversifying over them can itself be inefficient. Bogin et al. (2022) demonstrate benefits of diversifying over bigrams—adjacent nodes in the AST of a logical form—which are too fine-grained. It is reasonable to expect that the optimal granularity of substructures over which to diversify would be somewhere in the middle of the extremes of templates and bigrams. Moreover, while they show benefits from diversity in compositional splits, there is limited evidence for whether they also help in the traditional IID setting.
In this study, we take a broader look at the efficacy of diversity in datasets. We propose a general, substructure-agnostic recipe for structurally diverse subsampling algorithms and use that to formulate a novel algorithm that diversifies over program subtrees. Evaluating on template and IID splits of 5 different semantic parsing datasets of varying complexities, we find that structurally diverse train sets perform best in 9 out of 10 splits with over 10 points improvements in exact match accuracy in 5. In particular, our proposed Subtree diversity algorithm is the most consistent, performing better than Template and Bigram diversity in 6 out of 10 splits. Interestingly, we find that the original versions of both bigram and template diversity algorithms often perform worse than random subsampling in IID splits but improve with some simple modifications.
We further show that diverse subsampling can also be used to sample test sets that more comprehensively cover the space of instance structures and hence are far more challenging than IID splits for models trained on random train sets than for those trained on diverse sets. Finally, we use information theory to explore the improved generalization from diverse train sets and find that the weakening of spurious correlations between substructures might be one way diversity helps.
While the idea of diverse subsampling is applicable to any task with structured inputs or outputs, the sampling algorithms we use in this work operate on trees, and we focus in particular on abstract syntax trees from output programs found in semantic parsing tasks. In semantic parsing, an utterance has to be parsed into an executable program or logical form . Further, following Oren et al. (2021) and Bogin et al. (2022), we assume access to a pool of instances where , from which we want to subsample instances. We view each instance as a bag of substructures: , where is the mapping from instances to their substructures and is the set of all substructures. The substructures may be shared between instances.
The goal of a structurally-diverse subsampling algorithm is to select instances that contain among them as many substructures in as many different contexts as possible. When used for training, the hypothesis is that such a subsample will give the model more information about the system
underlying the instances than a random subsample and hence improve generalization and sample efficiency. Intuitively, randomly sampling instances from a grammar, or from human annotators, is likely to produce skewed distributions with spurious correlations between substructures; a diverse sampling algorithm minimizes these correlations. Similarly, when subsampling to create test sets, we argue that a diverse subsample also makes for more comprehensive test sets by better covering the space of possible structures. This is analogous to why one would choose to evaluate on a test set with balanced classes even when the training data might have considerable imbalance.
3 Structurally Diverse Datasets
Oren et al. (2021) and Bogin et al. (2022) used program templates and bigrams extracted from program ASTs. While we follow them in extracting substructures from programs alone, we choose to use different substructures as we believe templates and bigrams are not the right granularity to diversify over. Templates are too coarse; they share a lot of structure and may be large in number themselves, making it inefficient to even cover all of them. Bigrams on the other hand are too fine, and may be unable to capture many salient structural patterns. We thus use subtrees of the program AST up to a certain size . The ASTs are constructed as in Bogin et al. (2022): tokens in a program are categorized as either functions, values, or structural tokens (such as parentheses or commas) that define the hierarchical structure of the program. Figure 1 shows an example program along with the different substructures extracted from it that we consider in this work: templates, bigrams and subtrees.
3.2 Structurally Diverse Subsampling
As described in §2, the goal of structurally diverse subsampling is to sample instances that contain many different substructures in many different contexts to maximize the pressure on the model to learn the underlying compositional system. One way to approach this would be to set it up as an optimization problem that directly selects an optimally diverse set of instances. This, however, has the challenge of having to define a measure of diversity that is also tractable to optimize given pools that may be very large, making it harder to experiment with different measures of diversity.
We thus choose to take the route of an iterative algorithm (pseudo-code in Algorithm 1) that picks instances from the pool, one-by-one till the requisite number of instances have been sampled. At a high level, in each iteration, it first selects a substructure based on some substructure-weighting scheme, , and then selects an instance with that substructure using another instance-weighting scheme, . It also keeps track of substructures that have been sampled ( and ) and resets when all substructures have been sampled allowing it to cycle over them repeatedly. The weighting schemes can be used to encode our interpretation of diversity. Additionally, being iterative has the usual benefit of being able to stop at any time and return the currently sampled instances.
In the current state, Algorithm 1 is still just a recipe and instantiating it into a subsampling algorithm requires substructure definition and substructure and instance-weighting schemes to be specified. In the following sections we’ll describe our proposed subsampling algorithm that improves subtree diversity in the framework of this recipe and show that it also subsumes both template diversity and bigram diversity algorithms from Oren et al. (2021) and Bogin et al. (2022).
3.3 Subtree Diversity
Our subsampling algorithm diversifies over subtrees as defined in § 3.1, i.e. is the set of subtrees of size in the AST of . In this work we use . Further, it uses the following substructure and instance weighting schemes:
Substructure Selection: We use the following substructure weighting function that prioritizes selecting more frequent unsampled substructures in the pool.: if and 0 otherwise. Here denotes set cardinality.
Instance Selection: We experiment with the following instance weighting schemes:
RandEx samples an instance uniformly at random: is a constant function.
RandNewT samples an instance with unseen template: where is the indicator function.
FreqNewT samples instance with the most frequent unsampled template: if and 0 otherwise. This and RandNewT try to improve coverage over templates.
3.4 Template and Bigram Diversity
is a singleton set containing the template for .
i.e. a uniformly random unsampled template.
is a constant function (RandEx).
is the set of bigrams in ’s AST as defined by Bogin et al. (2022).
Bogin et al. (2022)’s bigram diversity algorithm randomly samples from unsampled bigrams until there is still an unsampled bigram and then any random bigram. This can be formulated as: where is the set of all sampled bigrams. Note that is different from ; the latter gets emptied to enable repeatedly cycling through substructures.
is a constant function (RandEx).
Additionally we also experiment with using template and bigram diversity algorithms along with the substructure weighting scheme described in §3.3 that prioritizes more frequent substructures. These will be referred to as Template[Freq] and Bigram[Freq] respectively.
Given a dataset of utterance-program pairs, we first create three different types of splits, that we describe below, with each split consisting of a training pool and test set . We then compare the various subsampling algorithms described in §3 by using them to sample training sets of varying budget size from and evaluating on .
IID split For this we randomly sample some instances from to use as keeping the rest for .
Template split This is a type of compositional split proposed by Finegan-Dollak et al. (2018). Here instances are grouped based on their program template which is obtained by applying an anonymization function to the programs that replaces certain program tokens such as strings and numbers with their abstract type. The split is then created by randomly splitting the set of templates into a train set and a test set and using examples for train or test templates as or respectively. We follow the procedure of Bogin et al. (2022) to obtain solvable template splits i.e. there is no token in the test set but not in the training set.
Subtree split As discussed in §2, diversely subsampled sets of instances may also make for more comprehensive test sets. To test this, we also evaluate on a third type of split: we use the Subtree[FreqNewT] diverse subsampling algorithm to sample test sets from , using the rest of the instances as .
|Dataset||Input Utterance||Target Program|
|(synthetic)||[l]What is the number of black dog that is|
|chasing mouse ?||[l]count ( with_relation ( filter ( black , find ( dog ) ) ,|
|chasing , find ( mouse ) ) )|
|natural)||[l]person whose height is 180 cm and|
|whose birthdate is 2004|
|what person born in 2004 is 180 cm tall||[l](listValue (filter (filter (getProperty (singleton en.person)|
|(string !type)) (string height) (string =) (number 180 en.cm))|
|(string birthdate) (string =) (date 2004 -1 -1)))|
|(synthetic)||[l]what people are named aideliz li||[l]( Person ) filter id = "aideliz li"|
|(natural)||[l]a flight on continental airlines leaving boston|
|and going to denver||[l]( lambda $0 e ( and ( flight $0 ) ( airline $0 co : al )|
|( from $0 boston : ci ) ( to $0 denver : ci ) ) )|
|(natural)||[l]When is my next staff meeting scheduled for?||[l](Yield (Event.start (FindNumNextEvent (Event.subject?|
|(? = "staff meeting")) 1L)))|
We use five different semantic parsing datasets for our analysis from a variety of domains with varied number of program templates and both synthetic as well as natural language input utterances. Tables 1 and 2 show a few examples and statistics regarding number of instances and different types of compounds respectively.
COVR: A synthetic dataset that uses a variable-free functional query langauge and is generated using a synchronous context-free grammar (SCFG) adapted from the VQA dataset of Bogin et al. (2021a). We use the SCFG to generate 100K examples for our experiments.
Overnight Wang et al. (2015): A dataset containing both synthetic and natural language utterances from 11 domains (e.g. socialnetwork, restaurants, etc.) paired with Lambda-DCS logical forms.
SM-CalFlow Andreas et al. (2020): Consists of dialogs paired with LISP programs. Each instance is a single dialogue turn form one of two domains pertaining to creating calendar events or querying an org chart.
With the exception of SM-CalFlow which we took from Andreas et al. (2020), we used the pre-processed versions of the above datasets provided by Bogin et al. (2022) and used their code111https://github.com/benbogin/unobserved-local-structures to anonymize programs and produce ASTs. For SM-CalFlow, we anonymized strings and numbers such as “staff meeting” and “1L” in Table 1 and used nltk222https://www.nltk.org/ to produce ASTs.
4.3 Model and Training
We use the pretrained BART-base model Lewis et al. (2020) for our experiments, fine-tuning it on subsamples for each dataset, split and subsampling algorithm. For each dataset we use 5 different random seeds to create 5 splits of each type. Then for each split and training budget, we use 3 different seeds to subsample 3 training sets for each subsampling algorithm. See App. A for more details.
5.1 Structural Diversity in Train Sets
Diverse subsampling is more sample-efficient than random subsampling. Diverse subsampling outperforms random subsampling in both IID and template splits, being more efficient in 9 out of 10 dataset-split type combinations with over 10 point improvements in template splits and over 5 point improvement in IID splits of all datasets except SM-CalFlow. The only splits where random subsampling performs better are IID splits of SM-CalFlow. We believe this to be due to the fact that target programs in SM-CalFlow contain free-form strings and which inflate the number of substructures (bigrams or subtrees) as shown in Table 2.
Subtree diversity is most consistent. Among the diverse subsampling algorithms, Subtree diversity is the most consistent. It is more efficient than Bigram diversity in all datasets and splits and outperforms Template diversity everywhere except in the template splits of Atis, Schema2QA and Overnight which all have very few templates (as shown in Table 2) and hence little to gain from diversifying over smaller substructures. These results suggest that for most tasks except for those with little structural diversity in the first place, diversification over more granular substructures would be more sample efficient.
Prioritizing more frequent compounds improves efficiency. The original Bigram and Template diversity algorithms (Bigram and Template), while good in template splits, are worse than even random subsampling in IID splits. However, replacing their compound selection scheme with one that prioritizes frequent compounds (Bigram[Freq] and Template[Freq]) improves both their efficiencies in IID splits with minor degradation in template splits. The only exception is the template diversity in COVR. This is expected given that it was generated from a synchronous grammar with production rules sampled uniformly at random and hence is dominated by instances with shorter templates. Thus, Template[Freq] will only pick these.
Improving coverage over templates makes subtree diversity more consistent. Comparing Subtree[RandEx] and Subtree[RandNewT] we see that prioritizing examples with unseen templates improves efficiency in template splits with relatively minor reductions in IID splits.
5.2 Structural Diversity in Test Sets
Figure 4 compares the performance of random and diverse (Subtree[RandEx]) subsamples in IID and Subtree splits. It is clear that Subtree splits are harder than IID splits. More importantly, while random training sets perform very poorly on them, diverse subsamples quickly climb close to their performance on IID splits in all datasets except Atis. Note that this is not merely because of distributional change: these splits were created using a different variant of Subtree diverse subsampling (Subtree[FreqNewT]) and bigram and template diversity algorithms also regain much of their IID performance (see App. B). We thus believe that this demonstrates comprehensiveness of diverse samples as test sets.
6 Information-theoretic Analysis
Having seen that structurally diverse datasets indeed improve generalization, we now attempt to explain why they do so. Our hypothesis is that they improve generalization by weakening the spurious correlations between substructures that would exist in any dataset of bounded size. To test this, we define a measure for correlations and compare its value for random and diverse subsamples.
Measure for Correlations We measure correlations between substructures in a set of instances as the sum of pairwise mutual information (MI) between them, normalized to take into account the larger number of distinct substructures in diverse subsamples. Formally, as earlier, we view each instance as a bag of substructures, specifically subtrees of size with
as the set of all subtrees. We define an indicator random variablefor each subtree that is 1 if is present in a random instance, 0 otherwise. MI between two subtrees, and is then defined as,
is the empirical probability ofoccurring in an instance and is the empirical probability that and co-occur in an instance, i.e.
Other probabilities are defined analogously. Finally, our measure of correlations is the average mutual information (AMI) across all subtree pairs.
Results For each dataset, we took 3 samples using random and subtree subsampling algorithms, treating the entire dataset as pool and computed their NCMI. Figure 5 shows that diverse subsamples have lower NCMI than random subsamples across different training budgets and the different split types, confirming our hypothesis that diverse datasets have lower correlation. Moreover, examining Spearman correlations of NCMI and accuracy (Table 3) for train sets subsampled using random, Subtree[RandEx] and Subtree[RandNewT] subsampling algorithms from the experiments in §4, we find that they are significantly negatively correlated, giving further credence to our measure. These results substantiate our hypothesis that reduction of spurious correlations is indeed one way in which diverse subsamples improve generalization.
7 Related Work
Benchmarks Multiple benchmarks for compositional generalization have been created. These include those created manually specifically for the purpose of measuring generalization Lake and Baroni (2018); Bastings et al. (2018); Bahdanau et al. (2019); Kim and Linzen (2020); Ruis et al. (2020); Bogin et al. (2021a). Compositional splits have also been created automatically from existing semantic parsing datasets by splitting by output length Lake and Baroni (2018), holding out program templates Finegan-Dollak et al. (2018), and by maximizing compound divergence between the training and test sets Keysers et al. (2020); Shaw et al. (2021).
Improving Generalization Many approaches to examining and improving compositional generalization have been proposed, including specialized architectures with inductive bias for compositional generalization Herzig and Berant (2021); Bogin et al. (2021b); Chen et al. (2020); Gordon et al. (2020); Yin et al. (2021), data augmentation Andreas (2020); Akyürek et al. (2021); Guo et al. (2021), modifications to training methodology Oren et al. (2020); Csordás et al. (2021) and meta learning Conklin et al. (2021); Lake (2019). Data-based approaches have the advantage of being model-agnostic and hence can be used in conjunction with pretrained models. However without a more informed choice of what instances are added, given the infinite productivity of language, they would require large amounts of data to induce compositional generalization. Selecting structurally diverse instances is one such choice that leads to improved generalization and sample efficiency.
Data Selection Our work is most closely related to that of Oren et al. (2021) and Bogin et al. (2022), both of which take into account the structure of output programs in semantic parsing and select selecting structurally diverse instances for training to improve compositional generalization and sample efficiency. Our work, especially that on sampling diverse test sets, is also related to work on creating compositional splits from existing datasets Keysers et al. (2020); Shaw et al. (2021); Bogin et al. (2022) and reducing biases in datasets via adversarial filtering or other means Bras et al. (2020); Sakaguchi et al. (2021); Gardner et al. (2021) and representation debiasing Li and Vasconcelos (2019).
In this work we broadly explored the efficacy of structural diversity for the task of semantic parsing. We proposed a novel subtree diversity algorithm that diversifies over subtrees in program abstract syntax trees. Evaluating on multiple datasets with varying complexities and logical formalisms and on both IID and compositional splits, we demonstrated that diversity almost always, and often greatly, improves generalization and sample efficiency with our proposed algorithm. We further demonstrated that structurally diverse sets of instances also make for more comprehensive test sets and showed that reduction in spurious correlations might be one reason why diverse train sets are more sample efficient.
We hope that our experiments demonstrating benefits of diversity in a wide variety of datasets would encourage future research in finding other potentially better substructures in other structure prediction tasks and sampling algorithms with better sample efficiency. In particular, the current algorithms assume the availability of a pool of labeled instances. In many semantic parsing scenarios it is possible to sample a large pool of possible outputs from which we could select diverse programs to be annotated with corresponding utterances, but this is one of the only scenarios where our current algorithms could be applied. Future research could thus look at the setup of applying diversity when there is no pool available, e.g., when generating a synthetic dataset from a synchronous grammar or by prefering structural diversity in the input space instead of the output space (if a large collection of unlabeled inputs can be obtained or generated). Another important direction of future research is to apply principles of diversity to natural language utterances for which a hierarchical representation may not be as readily available and which will also have to deal with language variability. Additionally, future work may also look at better measures of spurious correlations that we believe may lead to more efficient diverse sampling algorithms.
Finally, and quite importantly, in the context of compositional generalization, our results indicate that the apparent lack of it in NLP models may be due in part to the lack of diversity in datasets, suggesting the need to create more diverse compositional generalization benchmarks.
- Akyürek et al. (2021) Ekin Akyürek, Afra Feyza Akyurek, and Jacob Andreas. 2021. Learning to recombine and resample data for compositional generalization. ArXiv, abs/2010.03706.
- Andreas (2020) Jacob Andreas. 2020. Good-enough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7556–7566, Online. Association for Computational Linguistics.
- Andreas et al. (2020) Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Percy Liang, Christopher H. Lin, Ilya Lintsbakh, Andy McGovern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth, Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov. 2020. Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics, 8:556–571.
- Bahdanau et al. (2019) Dzmitry Bahdanau, Harm de Vries, Timothy J. O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron C. Courville. 2019. CLOSURE: assessing systematic generalization of CLEVR models. CoRR, abs/1912.05783.
Bastings et al. (2018)
Jasmijn Bastings, Marco Baroni, Jason Weston, Kyunghyun Cho, and Douwe Kiela.
Jump to better
conclusions: SCAN both left and right.
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 47–55, Brussels, Belgium. Association for Computational Linguistics.
- Bogin et al. (2022) Ben Bogin, Shivanshu Gupta, and Jonathan Berant. 2022. Unobserved local structures make compositional generalization hard. CoRR, abs/2201.05899.
Bogin et al. (2021a)
Ben Bogin, Shivanshu Gupta, Matt Gardner, and Jonathan Berant.
test-bed for visually grounded compositional generalization with real
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9824–9846, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Bogin et al. (2021b) Ben Bogin, Sanjay Subramanian, Matt Gardner, and Jonathan Berant. 2021b. Latent compositional representations improve systematic generalization in grounded question answering. Transactions of the Association for Computational Linguistics, 9:195–210.
- Bras et al. (2020) Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E. Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial filters of dataset biases. CoRR, abs/2002.04108.
- Campagna et al. (2019) Giovanni Campagna, Silei Xu, Mehrad Moradshahi, Richard Socher, and Monica S. Lam. 2019. Genie: A generator of natural language semantic parsers for virtual assistant commands. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 394–410, New York, NY, USA. Association for Computing Machinery.
- Chen et al. (2020) Xinyun Chen, Chen Liang, Adams Wei Yu, Dawn Song, and Denny Zhou. 2020. Compositional generalization via neural-symbolic stack machines. In Advances in Neural Information Processing Systems, volume 33, pages 1690–1701. Curran Associates, Inc.
- Conklin et al. (2021) Henry Conklin, Bailin Wang, Kenny Smith, and Ivan Titov. 2021. Meta-learning to compositionally generalize. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3322–3335, Online. Association for Computational Linguistics.
- Csordás et al. (2021) Róbert Csordás, Kazuki Irie, and Juergen Schmidhuber. 2021. The devil is in the detail: Simple tricks improve systematic generalization of transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 619–634, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Dahl et al. (1994) Deborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriberg. 1994. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
- Finegan-Dollak et al. (2018) Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving text-to-SQL evaluation methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 351–360, Melbourne, Australia. Association for Computational Linguistics.
- Fodor and Pylyshyn (1988) Jerry A Fodor and Zenon W Pylyshyn. 1988. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71.
- Gardner et al. (2021) Matt Gardner, William Merrill, Jesse Dodge, Matthew Peters, Alexis Ross, Sameer Singh, and Noah A. Smith. 2021. Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1801–1813, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Gordon et al. (2020) Jonathan Gordon, David Lopez-Paz, Marco Baroni, and Diane Bouchacourt. 2020. Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations.
- Guo et al. (2020) Demi Guo, Yoon Kim, and Alexander Rush. 2020. Sequence-level mixed sample data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5547–5552, Online. Association for Computational Linguistics.
Guo et al. (2021)
Yinuo Guo, Hualei Zhu, Zeqi Lin, Bei Chen, Jian-Guang Lou, and Dongmei Zhang.
Revisiting iterative back-translation from the perspective of compositional
Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 7601–7609. AAAI Press.
- Hemphill et al. (1990) Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The atis spoken language systems pilot corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’90, page 96–101, USA. Association for Computational Linguistics.
- Herzig and Berant (2021) Jonathan Herzig and Jonathan Berant. 2021. Span-based semantic parsing for compositional generalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 908–921, Online. Association for Computational Linguistics.
- Keysers et al. (2020) Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations.
- Kim and Linzen (2020) Najoung Kim and Tal Linzen. 2020. COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Online. Association for Computational Linguistics.
Lake and Baroni (2018)
Brenden Lake and Marco Baroni. 2018.
Generalization without systematicity: On the compositional skills of
sequence-to-sequence recurrent networks.
Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2873–2882. PMLR.
- Lake (2019) Brenden M. Lake. 2019. Compositional generalization through meta sequence-to-sequence learning. CoRR, abs/1906.05381.
- Lake et al. (2017) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and brain sciences, 40.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Li and Vasconcelos (2019) Yi Li and Nuno Vasconcelos. 2019. REPAIR: removing representation bias by dataset resampling. CoRR, abs/1904.07911.
- Loula et al. (2018) João Loula, Marco Baroni, and Brenden Lake. 2018. Rearranging the familiar: Testing compositional generalization in recurrent networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 108–114, Brussels, Belgium. Association for Computational Linguistics.
- Oren et al. (2021) Inbar Oren, Jonathan Herzig, and Jonathan Berant. 2021. Finding needles in a haystack: Sampling structurally-diverse training sets from synthetic data for compositional generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10793–10809, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Oren et al. (2020) Inbar Oren, Jonathan Herzig, Nitish Gupta, Matt Gardner, and Jonathan Berant. 2020. Improving compositional generalization in semantic parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2482–2495, Online. Association for Computational Linguistics.
- Qiu et al. (2021) Linlu Qiu, Peter Shaw, Panupong Pasupat, Pawel Krzysztof Nowak, Tal Linzen, Fei Sha, and Kristina Toutanova. 2021. Improving compositional generalization with latent structure and data augmentation. CoRR, abs/2112.07610.
- Ruis et al. (2020) Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M Lake. 2020. A benchmark for systematic generalization in grounded language understanding. In Advances in Neural Information Processing Systems, volume 33, pages 19861–19872. Curran Associates, Inc.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
- Shaw et al. (2021) Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Kristina Toutanova. 2021. Compositional generalization and natural language variation: Can a semantic parsing approach handle both? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 922–938, Online. Association for Computational Linguistics.
- Wang et al. (2021) Bailin Wang, Wenpeng Yin, Xi Victoria Lin, and Caiming Xiong. 2021. Learning to synthesize data for semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2760–2766, Online. Association for Computational Linguistics.
- Wang et al. (2015) Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1332–1342, Beijing, China. Association for Computational Linguistics.
- Xu et al. (2020) Silei Xu, Giovanni Campagna, Jian Li, and Monica S. Lam. 2020. Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web, page 1685–1694. Association for Computing Machinery, New York, NY, USA.
- Yin et al. (2021) Pengcheng Yin, Hao Fang, Graham Neubig, Adam Pauls, Emmanouil Antonios Platanios, Yu Su, Sam Thomson, and Jacob Andreas. 2021. Compositional generalization for neural semantic parsing via span-level supervised attention. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2810–2823, Online. Association for Computational Linguistics.
Appendix A Training
Models for COVR, ATIS, Schema2QA and Overnight datasets were trained for different number of epochs depending on train set size as shown in Table4. Models for Schema2QA were all trained for 240 epochs. Training was run with batch sizes ranging from 8 to 20, depending on the maximum number of example tokens in each dataset and a learning rate of with polynomial decay. Each experiment was run with a Nvidia Titan RTX GPU. Following Bogin et al. (2022)
, we do early stopping using the test set. As our goal is to estimate the train set quality and not the model, we argue this is an acceptable choice in our setting.
Appendix B Full Results