Compositional generalization in semantic parsing with pretrained transformers

by   A. Emin Orhan, et al.

Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.



There are no comments yet.


page 1

page 2

page 3

page 4


NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Pretrained language models have become the standard approach for many NL...

Self-Supervised Pretraining Improves Self-Supervised Pretraining

While self-supervised pretraining has proven beneficial for many compute...

Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining

Most of the 3D networks are trained from scratch owning to the lack of l...

Does Pretraining for Summarization Require Knowledge Transfer?

Pretraining techniques leveraging enormous datasets have driven recent a...

Improving Large-scale Language Models and Resources for Filipino

In this paper, we improve on existing language resources for the low-res...

An Explanation of In-context Learning as Implicit Bayesian Inference

Large pretrained language models such as GPT-3 have the surprising abili...

Code Repositories


Compositional generalization in semantic parsing with pretrained transformers

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

here is the root of the root and the bud of the bud and the sky of the sky of a tree called life

— e e cummings

Large, pretrained language and vision models are widely used in a diverse range of downstream NLP and computer vision tasks. These models encode a large amount of transferable knowledge (of varying specificity) in their parameters and display remarkable (sometimes even surprising) “emergent” generalization behaviors as a result. To give a few examples: (i) the highly influential GPT-3 model displays hallmarks of few-shot learning/generalization ability from a small number of examples given in its prompt

(Brown et al., 2020); (ii) DALL-E, or other similar text-to-image models trained on large datasets (such as CLIP+VQGAN111The original CLIP+VQGAN notebook made by Katherine Crowson can be found at:, display qualitative evidence of compositional generalization abilities (Ramesh et al., 2021); (iii) pretraining with large, diverse image or text datasets improves the out-of-distribution generalization performance of image recognition (Orhan, 2019; Xie et al., 2020) and NLP models (Hendrycks et al., 2020), respectively.

What exactly are the limits to the generalization benefits of large-scale pretraining? More concretely, how do pretraining benefits scale with factors such as the diversity and the scale of the pretraining data, the model size, or the similarity between the pretraining data and the downstream task? Here, we report the results of some simple experiments aimed at addressing the latter two factors, namely the model size and the similarity between the pretraining data and the downstream task, in the context of two previously introduced semantic parsing tasks, SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020).

A few other recent works have also attempted to address some of these questions. Furrer et al. (2020) and Tay et al. (2021) show that large-scale pretraining —in the form of language modeling on the web-scale C4 corpus (Raffel et al., 2020) consisting mainly of English language texts— can improve compositional generalization in SCAN and COGS benchmarks, respectively. With respect to the specificity of pretraining benefits, Lu et al. (2021) argue that pretraining with language modeling confers broad benefits in downstream tasks, including, somewhat surprisingly, a variety of non-language tasks such as image recognition or protein sequence modeling. Papadimitriou and Jurafsky (2020) similarly report downstream natural language modeling tasks can benefit from pretraining in seemingly unrelated domains such Java code or MIDI music scores. They argue that abstract syntactic similarity between the pretraining and downstream domains is key for the transfer success of pretraining. Consistent with this idea, Chiang and Lee (2020) also show that pretraining with a simple artificial language generated by a stack-based hierarchical grammar can improve the accuracy on a diverse set of downstream natural language tasks, namely the tasks comprising the GLUE benchmark (Wang et al., 2018). More recently, Krishna et al. (2021)

show that text summarization also does not seem to require pretraining with natural language texts: they demonstrate that surprisingly even pretraining with texts consisting entirely of randomly and independently sampled nonsense words achieves similar results in downstream text summarization tasks. In the visual domain,

Baradad et al. (2021) show that even very simple random noise processes can be used as effective pretraining data for downstream natural image recognition tasks, as long as these noise processes satisfy certain basic structural properties of natural images. In a similar vein, Sinha et al. (2021) recently showed that scrambling the (within-sentence) word order in natural language texts has surprisingly little effect on their effectiveness as pretraining data, as long as some higher-order co-occurrence statistics are preserved.

The results we report here contribute valuable observations to this prior literature. Our main results can be summarized as follows:

  • Large-scale pretraining with natural language based text-to-text tasks improve compositional generalization in both SCAN and COGS benchmarks. This result is essentially a replication of earlier reports to the same effect: e.g. Furrer et al. (2020) for SCAN and Tay et al. (2021) for COGS.

  • Surprisingly, however, even models pretrained exclusively with non-English languages significantly improve performance on these benchmarks compared to models trained from scratch, even though both benchmarks are English-based.

  • Even more surprisingly, a language model pretrained predominantly on programming languages (Wang et al., 2021) also provides large generalization benefits on both SCAN and COGS, roughly equivalent in size to the generalization benefits of large-scale pretraining with natural languages.

  • However, the same is not true for pretraining with a large-scale protein sequence prediction task (Elnaggar et al., 2020). Pretraining with this task in fact mostly deteriorates the performance on the downstream semantic parsing tasks, suggesting that pretrained representations do not transfer universally and that there are likely constraints on the similarity between the pretraining and downstream tasks for successful transfer.

  • Bigger models are harder to train from scratch (and their generalization accuracy lower when trained to convergence) on the relatively small-scale SCAN and COGS benchmarks, but they benefit more from large-scale pretraining.

2 Tasks

We evaluated the potential benefits of large-scale pretraining for (out-of-distribution) compositional generalization in two semantic parsing tasks, SCAN and COGS. We now briefly describe these benchmarks. We refer the reader to the original papers for further details.

SCAN (Lake and Baroni, 2018): In SCAN, the goal is to map a procedurally generated instruction in simplified natural language (English) to a more formal semantic representation in a domain-specific language: e.g. turn left twice LTURN LTURN. The benchmark consists of several different train/test splits evaluating different aspects of compositional generalization. In this work, we considered all SCAN splits evaluated in Furrer et al. (2020) (seven different splits in total; see Table 1).

COGS (Kim and Linzen, 2020): COGS is similar to SCAN at a high level: like SCAN, it involves mapping procedurally generated English sentences to an abstract logical form. Unlike SCAN, however, it is geared more toward investigating compositional phenomena in natural language, hence the surface form English sentences in COGS are more complex and realistic and the targeted compositional phenomena are more linguistic in nature. COGS has a single training set and an out-of-distribution generalization set containing several different types of examples for evaluating different compositional phenomena in natural language (in English, more specifically).

We have found that for both SCAN and COGS, training the models up to convergence on the training set always improved the average out-of-distribution generalization accuracy (over all splits/conditions), as has been observed recently by Csordás et al. (2021)

for models trained from scratch (in SCAN, some splits benefited from early stopping, but the average accuracy over all splits always favored training the model up to convergence). We therefore followed this strategy for all experiments reported below. This means that no early stopping, no cross-validation over a validation set, and no hyperparameter searches were performed in any of the experiments. All experiments used the example

translation code provided in the Huggingface Examples repository222 with minimal modifications. All code for reproducing the experiments reported here, as well as the corresponding output files can be found at:

3 Results

model aj atl jar ar or right length ave.
mar. marian_defr_scr 2.9 0.5 95.2 1.0 100.0 0.0 33.1 0.5 89.8 2.6 82.1 4.1 12.8 0.2 59.4
marian_defr 62.7 2.7 83.5 2.9 96.3 0.6 45.0 1.9 99.1 0.1 92.4 0.8 15.0 0.1 70.6
t5-base t5_base_scr 0.0 0.0 67.6 11.0 100.0 0.0 18.1 3.4 51.1 7.3 58.7 5.7 7.5 1.3 43.3
t5_base 93.6 0.6 55.2 2.3 93.6 0.0 38.4 0.9 96.8 0.3 65.3 3.0 13.5 0.1 65.2
mt5_base 9.7 4.4 96.4 1.4 99.8 0.1 16.7 5.2 85.9 6.7 73.3 5.5 10.7 0.6 56.1
ct5_base 5.2 0.4 98.2 1.6 100.0 0.0 62.1 12.6 94.2 4.7 100.0 0.0 5.0 0.5 66.4
pt5_base 0.0 0.0 79.2 4.2 95.7 2.0 2.4 0.9 23.4 8.5 15.1 3.4 11.6 0.4 32.5
t5-3b t5_3b_scr 0.0 0.0 8.4 0.4 11.7 0.4 0.0 0.0 0.5 0.4 0.2 0.1 0.6 0.1 3.3
t5_3b 96.1 1.4 94.5 4.4 100.0 0.0 37.0 1.0 99.5 0.1 98.6 0.4 8.3 0.7 76.3
mt5_xl 40.0 8.1 90.0 4.6 99.9 0.0 90.0 3.3 98.5 1.1 99.4 0.3 5.1 0.2 74.7
pt5_xl 0.1 0.1 50.9 5.2 84.0 13.1 4.7 3.0 42.3 10.2 9.0 2.8 11.3 1.8 28.9
Table 1: Exact match accuracies in different SCAN splits. Models with a scr attached to their name (highlighted in gray) are models trained from scratch (no pretraining), the other models are all pretrained with different datasets/tasks. Abbreviations of SCAN splits: aj: add jump, atl: add turn left, jar: jump around right, ar: around right, or

: opposite right. The top two rows show the results for the Marian neural machine translation models (German-to-French); these models have

74M parameters. The middle four rows show the results for the t5_base sized models; these models have 220M parameters. The bottom four rows show the results for the t5_3b (or xl) sized models; these models have 3B parameters. Note that the multilingual mt5 models use different tokenizers from the corresponding t5

models, hence their parameter counts are slightly different. Numbers in small font indicate the standard errors over at least three independent runs for each condition. The last column shows the average of averages over all splits.

model average
mar. marian_defr_scr 62.7 0.5
marian_defr 83.4 0.1
t5-base t5_base_scr 32.3 2.2
t5_base 83.3 0.1
mt5_base 83.4 0.1
ct5_base 82.6 0.1
pt5_base 16.1 2.3
t5-3b t5_3b_scr 15.5 0.6
t5_3b 84.1 0.2
mt5_xl 84.6 0.1
pt5_xl 0.0 0.0
Table 2: Exact match accuracies on the COGS generalization set. Numbers in small font are standard errors over at least three independent runs for each model. Breakdown of accuracy into different conditions can be found in the output files provided on the accompanying github repository.

1. Pretraining improves compositional generalization: Pretraining in natural language based tasks consistently improved the out-of-distribution generalization accuracy in both SCAN and COGS (Tables 1-2; breakdown of accuracy into different conditions in COGS can be found in the output files provided on the accompanying github repository). For example, the pretrained T5-base (t5_base) model (Raffel et al., 2020) outperformed the same model architecture trained from scratch (t5_base_scr) by 22% on SCAN (averaged over all splits) and by 51% on COGS in exact match accuracy (in absolute terms). These results reproduce earlier observations of similar pretraining benefits for (out-of-distribution) compositional generalization in SCAN and COGS (Furrer et al., 2020; Tay et al., 2021).

2. Pretraining benefits are not language specific: Furrer et al. (2020) suggest that the main benefit of pretraining might be to improve the model’s ability to substitute similar words or phrases with each other. This hypothesis is circumstantially supported by the observation that in SCAN, for instance, pretraining benefits are largest in splits requiring a single-word substitution (e.g. add jump) and smaller in splits requiring the substitution of longer phrases. However, this hypothesis assumes that pretraining with English language texts should be crucial for transfer success in SCAN and COGS. The pretrained T5 models from Raffel et al. (2020) were indeed trained primarily on the C4 corpus, which consists mainly of English language texts. To see if pretraining benefits would be sustained with models trained primarily on languages other than English, we conducted the same experiments with the multilingual T5 (mT5) models (Xue et al., 2021), which are language models trained on 101 different languages. Importantly, Xue et al. (2021) report that less than 5% of the training data for the mT5 models were English sentences. Despite this, we find that pretrained mT5 models (mt5_base and mt5_xl) also significantly improve generalization accuracy in both SCAN and COGS (Tables 1-2). In SCAN, the mT5-base model improves the generalization accuracy over the same model architecture trained from scratch by 13% and the larger mT5-xl model improves it by over 71%. Similarly large improvements are observed in the COGS benchmark as well (Table 2).

These results suggest that pretraining data do not have to come primarily from English to enable successful transfer to SCAN and COGS. However, it is still possible that the relatively small amount of English examples in mT5’s training corpus might be responsible for the successful transfer to SCAN and COGS. To rule out this possibility, we also considered a model pretrained exclusively on non-English data, namely a German-to-French machine translation model (Marian NMT) trained on the OPUS-MT corpus (Tiedemann and Thottingal, 2020). We observed substantial pretraining benefits even for this model (Tables 1-2): 11% absolute improvement in generalization accuracy in SCAN over a baseline model with the same architecture trained from scratch and 21% absolute improvement in generalization accuracy in COGS. This surprising result suggests that language overlap between the pretraining data and the downstream domain is not necessary for successful transfer and casts doubt on the hypothesis of Furrer et al. (2020) that pretraining benefits primarily originate from an improved ability of the model to substitute similar words or phrases with each other.

To probe the limits of successful transfer further, we next asked whether pretraining with programming languages, as opposed to natural languages, would also improve generalization in downstream semantic parsing tasks. To test this, we took a recent language model called CodeT5 (denoted by ct5_base in Tables 1-2), which was pretrained predominantly on several different programming languages (Wang et al., 2021). We note that the pretraining data for this model involved some amount of natural language as well, however, so the model was not pretrained exclusively with programming languages (for more details on the pretraining data and the pretraining tasks for this model, please see Wang et al. (2021)). Remarkably, CodeT5 also substantially improved generalization in both SCAN and COGS (Tables 1-2). The overall improvements were roughly equivalent in size to the improvements afforded by large-scale pretraining with natural languages. This result is consistent with an earlier observation made by Papadimitriou and Jurafsky (2020) that pretraining with Java code improves downstream language modeling tasks with natural languages and it once again demonstrates the surprisingly broad transferability of the linguistic representations and knowledge learned through large-scale pretraining.

3. Pretraining in a different domain (protein sequence modeling) mostly hurts compositional generalization: To push the envelope even further, we next considered whether pretraining with language corpora (natural or programming languages) was necessary for successful transfer. To test this, we used T5-base and T5-3b sized models pretrained with a protein sequence modeling task (Elnaggar et al., 2020). These models are denoted by pt5_base and pt5_xl in Tables 1-2. To accommodate the tokenization difference between the protein modeling and natural language modeling domains, we replaced the original protein tokenizer in pt5_base and pt5_xl with the corresponding T5 English tokenizer from Raffel et al. (2020) and learned the token embeddings from scratch when finetuning the models on SCAN and COGS. Pretraining with the protein modeling task mostly hurt the generalization accuracy in SCAN and COGS (Tables 1-2). For example, in SCAN, pt5_base had 11% lower absolute generalization accuracy than the corresponding T5 model trained from scratch (t5_base_scr). Similarly, in COGS, pt5_base had 16% lower generalization accuracy than t5_base_scr. These results suggest that the generalization benefits of pretraining, although broad, are not universal and that there has to be a certain kind and degree of similarity between the pretraining domain and the downstream task for successful transfer to occur.

4. Bigger models are harder to train from scratch, but pretraining benefits are clearer for bigger models: In all our experiments, we consistently observed that larger models were harder to train from scratch and their generalization accuracy was lower when trained to convergence, but they also benefited more from large-scale pretraining. For example, comparing the T5-base and T5-3b models trained from scratch (t5_base_scr and t5_3b_scr), they achieve an average generalization accuracy of 43% and 3%, respectively, in SCAN; whereas large-scale pretraining on C4 increases these numbers to 65% and 76%, an absolute improvement of 23% and 73%, respectively (Table 1). A similar pattern is observed in COGS as well (Table 2). These results are not entirely unexpected: bigger models are likely more prone to overfitting on relatively small-scale datasets such as SCAN and COGS, but they are much more effective at encoding the large amount of knowledge inherent in very large-scale datasets such as C4.

4 Discussion

Our results suggest that there is considerable generality in the transferability of pretrained representations and knowledge. We have found that even models pretrained exclusively with non-English languages or models pretrained predominantly with programming languages show substantial out-of-distribution generalization benefits in downstream semantic parsing tasks couched entirely in English (SCAN and COGS). These results add to an emerging literature documenting surprisingly broad transferability between seemingly distant pretraining and downstream tasks in both NLP and computer vision (Papadimitriou and Jurafsky, 2020; Chiang and Lee, 2020; Lu et al., 2021; Krishna et al., 2021; Sinha et al., 2021; Baradad et al., 2021; Pondenkandath et al., 2018; Maennel et al., 2020).

However, successful transfer does not appear to be automatic or universal, potentially contrary to some recent claims (Lu et al., 2021). Rather, it is likely that there are constraints on the similarity between the pretraining domain and the downstream task for successful transfer. We hope that future work will better delineate these conditions for successful transfer. An important question to consider for future work is to what extent these conditions are content-specific. Recent work by Csordás et al. (2021) suggests that a few simple content-independent modifications to transformer models, such as adjusting the scale of the embedding weights and using embeddings that are more sensitive to the relative position of tokens, can dramatically improve compositional generalization in these models when they are trained from scratch. It remains to be seen whether, or to what extent, the generalization benefits of large-scale pretraining in different tasks can potentially be explained through such relatively content-independent effects.

It seems plausible that in order to bring out the real value of pretrained knowledge and pretrained representations (as opposed to some relatively generic, content-nonspecific effects), downstream transfer tasks should be significantly more challenging than those commonly used today. In this regard, zero-shot or few-shot evaluations are probably much better than benchmarks with relatively large training sets

(Brown et al., 2020; Ramesh et al., 2021).


  • M. Baradad, J. Wulff, T. Wang, P. Isola, and A. Torralba (2021) Learning to see by looking at noise. arXiv preprint arXiv:2106.05963. Cited by: §1, §4.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §4.
  • C. Chiang and H. Lee (2020) Pre-training a language model without human language. arXiv preprint arXiv:2012.11995. Cited by: §1, §4.
  • R. Csordás, K. Irie, and J. Schmidhuber (2021) The devil is in the detail: simple tricks improve systematic generalization of transformers. arXiv preprint arXiv:2108.12284. Cited by: §2, §4.
  • A. Elnaggar, M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, et al. (2020)

    ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing

    arXiv preprint arXiv:2007.06225. Cited by: 4th item, §3.
  • D. Furrer, M. van Zee, N. Scales, and N. Schärli (2020) Compositional generalization in semantic parsing: pre-training vs. specialized architectures. arXiv preprint arXiv:2007.08970. Cited by: 1st item, §1, §2, §3, §3, §3.
  • D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song (2020) Pretrained transformers improve out-of-distribution robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2744–2751. Cited by: §1.
  • N. Kim and T. Linzen (2020) COGS: A compositional generalization challenge based on semantic interpretation. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 9087–9105. Cited by: §1, §2.
  • K. Krishna, J. Bigham, and Z. C. Lipton (2021) Does pretraining for summarization require knowledge transfer?. arXiv preprint arXiv:2109.04953. Cited by: §1, §4.
  • B. Lake and M. Baroni (2018) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. In

    International Conference on Machine Learning

    pp. 2873–2882. Cited by: §1, §2.
  • K. Lu, A. Grover, P. Abbeel, and I. Mordatch (2021) Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247. Cited by: §1, §4, §4.
  • H. Maennel, I. Alabdulmohsin, I. Tolstikhin, R. J. Baldock, O. Bousquet, S. Gelly, and D. Keysers (2020) What do neural networks learn when trained with random labels?. arXiv preprint arXiv:2006.10455. Cited by: §4.
  • A. E. Orhan (2019) Robustness properties of Facebook’s ResNeXt WSL models. arXiv preprint arXiv:1907.07640. Cited by: §1.
  • I. Papadimitriou and D. Jurafsky (2020) Learning music helps you read: using transfer to study linguistic structure in language models. arXiv preprint arXiv:2004.14601. Cited by: §1, §3, §4.
  • V. Pondenkandath, M. Alberti, S. Puran, R. Ingold, and M. Liwicki (2018) Leveraging random label memorization for unsupervised pre-training. arXiv preprint arXiv:1811.01640. Cited by: §4.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Journal of Machine Learning Research 21, pp. 1–67. Cited by: §1, §3, §3, §3.
  • A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092. Cited by: §1, §4.
  • K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, and D. Kiela (2021) Masked language modeling and the distributional hypothesis: order word matters pre-training for little. arXiv preprint arXiv:2104.06644. Cited by: §1, §4.
  • Y. Tay, M. Dehghani, J. Gupta, D. Bahri, V. Aribandi, Z. Qin, and D. Metzler (2021) Are pre-trained convolutions better than pre-trained transformers?. arXiv preprint arXiv:2105.03322. Cited by: 1st item, §1, §3.
  • J. Tiedemann and S. Thottingal (2020) OPUS-MT – Building open translation services for the World. In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), Lisbon, Portugal. Cited by: §3.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, Cited by: §1.
  • Y. Wang, W. Wang, S. Joty, and S. C. Hoi (2021) CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859. Cited by: 3rd item, §3.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves ImageNet classification


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 10687–10698. Cited by: §1.
  • L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel (2021) mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies, pp. 483–498. Cited by: §3.