Visually Grounded Continual Learning of Compositional Semantics

by   Xisen Jin, et al.
University of Southern California

Children's language acquisition from the visual world is a real-world example of continual learning from dynamic and evolving environments; yet we lack a realistic setup to study neural networks' capability in human-like language acquisition. In this paper, we propose a realistic setup by simulating children's language acquisition process. We formulate language acquisition as a masked language modeling task where the model visits a stream of data with continuously shifting distribution. Our training and evaluation encode two important challenges in human's language learning, namely the continual learning and the compositionality. We show the performance of existing continual learning algorithms is far from satisfactory. We also study the interactions between memory based continual learning algorithms and compositional generalization and conclude that overcoming overfitting and compositional overfitting may be crucial for a good performance in our problem setup. Our code and data can be found at



There are no comments yet.


page 2


Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries

Learning under a continuously changing data distribution with incorrect ...

ADAM: A Sandbox for Implementing Language Learning

We present ADAM, a software system for designing and running child langu...

Evaluating Continual Learning Algorithms by Generating 3D Virtual Environments

Continual learning refers to the ability of humans and animals to increm...

Lifelong Learning of Compositional Structures

A hallmark of human intelligence is the ability to construct self-contai...

Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks

Continual learning is essential for all real-world applications, as froz...

Design of Explainability Module with Experts in the Loop for Visualization and Dynamic Adjustment of Continual Learning

Continual learning can enable neural networks to evolve by learning new ...

Mind Your Manners! A Dataset and A Continual Learning Approach for Assessing Social Appropriateness of Robot Actions

To date, endowing robots with an ability to assess social appropriatenes...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Children’s language acquisition process from the visual world is a real-world example of learning complicated natural language processing tasks. Simulating children’s language learning process with neural networks helps researchers to understand the capability and the limit of neural networks in modeling complicated tasks 

Surís et al. (2019), and inspire researchers to push the limit by addressing the found issues Lu et al. (2018); Lake and Baroni (2017).

However, no prior work encodes an important challenge for simulating language acquisition from the visual world: the ability to learn in an evolving environment, also known as continual learning. While continual learning itself is a popular topic since decades ago, these algorithms are usually studied in the context of simple image classification tasks. These setups are far from real environments where the end tasks can be much more complex, such as language acquisition. In this paper, we propose to incorporate continual learning into the framework of simulating children’s language acquisition process.

Figure 1: Visually grounded Continual Compositional Language Learning (VisCOLL). We highlight the noun phrase to be masked in each caption. Given image and masked caption, the model is required to predict the masked noun phrase.
Figure 2: Training and testing examples in our problem formulation. At training, the model visits a stream of image-caption pairs. We highlight the words that are masked for prediction. The distribution of the training data stream, identified by tasks labels, changes continuously over time. See Figure 3 for an illustration of the continuous shift. At testing, the model is asked to predict either a seen composition of words or novel composition of seen words.

Another challenge that we consider is compositionality in language. Compositionality allows atomic words to be combined with certain rules to represent complicated semantics Zadrozny (1992). For humans, compositionality is a demonstration of productivity which emerges as early as 3 years old Pinker et al. (1987): the toddlers may learn a nonsense stem, e.g., wug, to refer to an object; then, if there are two of them, they can report that by saying there are two wugs Berko (1958). To test models’ ability to learn compositionality, we formulate the language acquisition problem as a visually grounded masked language modeling task, which requires the model to predict multiple masked words; specifically, we expect the models to compose atomic words to generate novel composition of words. Technically, it also introduces an exponentially large output space, which brings extra challenges for most of the continual learning algorithms. For example, memory based continual learning algorithms, which identify and store important examples for each class in a fixed-sized memory for future replay, should never expect to store an example for each word combination visited. It implies learning the compositionally generalize, i.e., learning to identify atomic concepts in an example and combine them Keysers et al. (2020) is crucial for performance. For example, after storing examples for “red cars”, ideally the models do not further need to store examples for “red apples” to alleviate the forgetting on predicting “red” in “red apples”; in contrast, we do not hope the models overfit the stored examples for “red cars’ by predicting all cars as red or only “red” for apples. However, no prior works study such interaction between memory based continual learning and compositional generalization.

In this paper, we propose the Visually grounded Continual cOmpositional Language Learning (VisCOLL) task, aiming at simulating children’s language learning process. We create two datasets, namely COCO-shift and Flickr-shift, to encode challenges of compositionality and continual learning for VisCOLL. We conduct systematic evaluations over the VisCOLL datasets to study the difficulties and the characteristics of the task.

2 Related Works

In this section, we introduce related works on continual learning as well as compositional language learning.

Continual learning aims to alleviate catastrophic forgetting Robins (1995), i.e., significant performance degrade on early data when the models are trained on a non-stationary data stream. Existing continual learning algorithms can be summarized into memory-based approaches Lopez-Paz and Ranzato (2017); Aljundi et al. (2019b), pseudo-replay based approaches Shin et al. (2017), regularization based approaches Kirkpatrick et al. (2017); Zenke et al. (2017); Nguyen et al. (2018) and architecture based approaches. The benchmarks for evaluation are usually manually constructed from classification datasets, by “splitting” the training examples into several disjoint subsets by labels, or applying a fixed transformation for each subset of training examples, and let the model visit these subsets one by one Lopez-Paz and Ranzato (2017). The most commonly used datasets are Split MNIST, Permuted MNIST Kirkpatrick et al. (2017), and Split CIFAR Rebuffi et al. (2017) datasets. However, the training and testing environments in these benchmark datasets are far from the complicated real environment, where the end task is much more complex, and the data stream is less structured (e.g., having no strict task boundaries).

On the other hand, recent works in language learning try to understand and make explicit modeling of compositional semantics, i.e., the ability to composing the meaning of atomic words for higher-level meaning in neural networks, but without the context of the continual learning. Lake and Baroni (2017) study compositional generalization in language generation with synthetic instruction following tasks. Yuan et al. (2019) studies compositional language acquisition with text-based games. Some works further incorporate visual inputs in studying compositional language understanding and generation, by taking visual navigation Anderson et al. (2018), visual question answering Bahdanau et al. (2019), visually grounded masked word prediction Surís et al. (2019) as end tasks.

Few works have tried to apply compositional language learning as an end task for studying continual learning. Li et al. (2020) is a closely related works that studies challenges in continual learning of sequential prediction tasks while focusing on synthetic instruction following tasks. However, the analysis and the techniques of separating semantics and syntax is restricted to the cases where both inputs and outputs are text, and does not apply to visual inputs. Nguyen et al. (2019)

study continual learning of image captioning, but they do not analyze challenges of sequential predictions, and still make strong assumptions about the structure of the data stream.

3 Task Setup

In this section, we introduce our problem formulation for Visually grounded Continual cOmpositional Language Learning (VisCOLL). Our formulation encodes two main challenges, namely compositionality and continual learning. We choose visually grounded masked language modeling as a proxy for evaluating models’ capabilities in learning compositional semantics: it requires model to describe complicated and unseen visual scenes by composing atomic words. Then, we construct a training environment where the training data comes in a non-stationary data stream without clear “task” boundaries to simulate the realistic environment. Figure 2 illustrates the training and testing examples in our formulation. In the rest of the section, we introduce details of our task setup.

Task Definition. We employ masked language modeling with visual inputs as an end task: the training and testing examples consist of image-caption pairs and , where a text span in is masked with MASK tokens and needs to be predicted by the model. The masked text span always include a noun and optionally include verbs or adjectives. To study whether the model learns compositionality in language, we define each noun, verb, and adjective as an atom, and study whether the model can predict both seen and novel compositions of nouns and verbs/adjectives. For example, we may test whether the model successfully predicts “red apples” (a combination of an adjective and a noun) when the model has seen examples that involve “red” and “apples” separately.

Continuously Shifting Data Distribution. Unlike traditional offline training setups where the model is allowed to visit the training examples repeatedly for multiple passes, we study an online continual learning setup, where the training examples come as a non-stationary stream and are only visited for a single pass. Importantly, for a realistic simulation of the real-world scenarios where a child may see and learn, we assume the data distribution changes gradually: for example, the model may see more “apples” in the beginning, and see less of them later. Unlike most of the prior continual learning benchmarks, we do not assume strict task boundaries, where the models may never see any apples when they have passed. Formally, at each time step , the model receives a small mini-batch of examples . The distribution is non-stationary, i.e., changes over time. Note that our formulation rules out continual learning algorithms that make use of information about task boundaries. In the following Section 4, we introduce how we construct data streams that encode our challenges.

Figure 3: Probabablity of first 50 tasks in different time steps in the constructed stream on the Flickr-shift. Each curve corresponds to a task. -axis shows the time step, and

-axis shows the probability of the task.

4 Dataset Construction for VisCOLL

In this section, we introduce how we construct non-stationary data streams from MS COCO Lin et al. (2014) dataset and Flickr30k Plummer et al. (2015) dataset for our VisCOLL setup. We name our datasets COCO-shift and Flickr-shift respectively.

Both COCO and Flickr datasets provide images associated with several captions. We use the part-of-speech (POS) tagger in the stanfordnlp222 package to perform POS tagging. Each training instance is an image-caption pair with a text span masked. In Flickr dataset, we mask the noun phrase in each caption, which is included information in the dataset. In COCO dataset, we identify text spans with a regular expression chunker, which always includes a noun, and optionally includes an adjective before it or a verb after it.

To construct a non-stationary data stream, we define a “task” as the lemmatized noun in the masked text span in Flickr dataset. On COCO dataset, we map the lemmatized nouns to the provided 80 object categories via a synonym table provided in Lu et al. (2018). Note that the “task” is only used as an identifier of data distribution for constructing the dataset; the task identities are not revealed to models and we construct the data streams so that there are no clear task boundaries in the data streams. Specifically, we construct data streams so that the task shifts happen gradually. Figure 3 illustrate the task distribution in our constructed data streams. Table 1 shows statistics about the dataset.

Dataset COCO-shift Flickr-shift
Training # 639,592 456,299
Test # 28,743 15,286
Task # 80 1,000
Table 1: Statistics on the constructed data streams.

5 Evaluation on VisCOLL Datasets

Dataset COCO-shift Flickr-shift
Method/Metrics PPL Noun acc. Verb acc. Adj. acc. PPL Noun acc. Adj. acc.
Vanilla Online 6.055 0.51 20.93 1.58 5.965 1.72 6.96
Single-pass Offline 1.923 51.55 47.00 25.11 2.978 26.44 14.70
Experience Reply (ER)
  =1,000 4.475 14.59 33.81 7.76 5.485 3.79 7.41
  =10,000 3.193 33.23 42.69 18.81 4.303 15.69 13.41
  =100,000 2.119 45.60 49.19 25.63 3.005 26.54 18.21
Maximally Interfering Retrieval (MIR)
  =10,000 3.186 33.60 41.79 17.49 3.688 16.67 11.57
Table 2: Overall performance of methods in MS COCO dataset and Flickr30k dataset.

In this section, we introduce models, continual learning algorithm baselines and metrics for VisCOLL. We also propose metrics to address the following research questions: (1) whether existing continual learning algorithms effectively alleviate forgetting in our problem setup and (2) how memory based continual learning algorithms may influence compositional generalization.

5.1 Base Model for VisCOLL

We modify VLBERT Su et al. (2020); Surís et al. (2019) as our base model. We first encode the image with a ResNet-34 He et al. (2015) to get an image embedding. Then, we feed the image embedding as well as the word embeddings of the masked captions into the 4-layer Transformer with a hidden size of 384. The output of the transformer at the masked positions are fed into a linear layer to output the word predictions. We use cross-entropy loss and use Adam Kingma and Ba (2014) optimizer with a learning rate of 0.0002 throughout the experiments.

5.2 Continual Learning Algorithms

We focus on memory based continual learning algorithms, as most of them are scalable and naturally applicable to the scenarios where no task identifiers or task boundaries are available. We use Experience Replay (ER) Robins (1995); Rolnick et al. (2019) algorithm with reservoir sampling as a strong baseline. The algorithms randomly store visited examples in a fix-sized memory. We use a memory size of 1,000, 10,000, and 100,000, which corresponds to roughly , and of data for two datasets. Besides, we also experiment with recently proposed Experience Replay with Maximally Interfering Retrieval (MIR) Aljundi et al. (2019a) algorithm with a memory size of 10,000. We also compare the performances with the scenario where no continual learning algorithms are applied (noted as Vanilla Online) as well as where the underlying data stream is shuffled and visited for a single pass (noted as single-pass Offline).

5.3 Evaluation Metrics for VisCOLL

To address the first research question that whether existing continual algorithms are effective in our setting, we employ perplexity (PPL) as the major metrics to measure the general performance of training methods. Throughout the paper, we report the perplexity in the log scale. We also evaluate accuracies of noun, verb, adjective predictions separately. On Flickr-shift dataset, we only include the accuracy of nouns and adjectives as the phrases in the Flickr datasets are noun phrases.

To address the second research question that how replay memories influence compositional generalization, we start by proposing a measure for compositional overfitting. Given a reference set of compositions , the compositional overfitting of an atomic word to the set is measured as the average perplexity difference when appears in a composition in the test set that also exists in , and when appear in a composition in that does not exist in . Formally, the compositional overfitting is defined as,


We are able to compute compositional overfitting of a word regarding the replay memory , note as . A large implies the perplexity of is much larger when appears in compositions that do not exist in the replay memory. We also compute the compositional overfitting of regarding the training set, noted as . We then compare to a to evaluate whether the model inclines to overfit combinations stored in the memory more compared to random examples in the training set. We note the difference between and as .


6 Analysis Results and Discussion

Figure 10: Perpleity curves of nouns(N), verbs(V), adjectives(J) in seen or novel compositions to the training set. We consider noun-verb and noun-adjective compositions for COCO-shift and consider only noun-adjective compositions for Flickr-shift.

In this section, we first show the overall performance of continual learning algorithms in our VisCOLL task setup. We then measure the compositional generalization achieved by algorithms and analyze how memory based continual learning algorithms may affect compositional generalization.

6.1 Overall Performance

Table 2 show the overall performance achieved by vanilla online training, single-pass offline training, ER, and MIR. We see a clear performance gap from the comparison between vanilla online training and the offline methods. We see the largest gap in the prediction accuracy of nouns, which are bound with task identities according to our stream construction. We also see ER could alleviate forgetting, but the performance is close to offline training only when the replay buffer is very large (, about 20% of the training examples). It contradicts the performance of ER in popular benchmark datasets, where storing only a few examples is believed to be sufficient to achieve a good performance Chaudhry et al. (2019). We see MIR, which is a state-of-the-art continual learning algorithm, could improve perplexity and prediction accuracy on nouns on two datasets at the same memory cost compared to ER. However, there is still a huge space where the performance can be improved.

6.2 Measuring Compositional Generalization

We measure compositional generalization, i.e.,, how well the models predict a word when it appears in a novel combination to the training set. We measure them with the compositional overfitting to the training set introduced in Section 5.3. We consider noun-verb combinations and noun-adjective combinations separately. We average of nouns, verbs and adjectives.

Figure 10 show the perplexity plots of nouns, verbs and adjectives in seen and novel contexts. We see there are clear gaps between the perplexity of words in seen contexts and novel contexts for almost all methods, which implies the models suffer from compositional overfitting.

6.3 Compositional Overfitting on Memory

Memory size 1k 10k 100k
N-V composition
   Nouns -0.635 -0.137 0.143
   Verbs -0.744 0.462 0.345
N-J composition
   Nouns 0.311 0.171 -0.155
   Adjectives -0.523 0.016 -0.184
Table 3: Differences between the compositional overfitting to the memory and the training set measured on the COCO-shift. A positive difference implies the model’s tendency to overfit the compositions stored in the memory than the training set.

In addition to the compositional overfitting to the training set, we also measure the compositional overfitting to the examples stored in the memory. We measure the difference between two overfitting statics as introduced in section 5.3. A positive indicates the model predicts a word poorly when the composition is not stored in the memory, which implies the model may overfit the compositions stored in the memory and that the replay memory may potentially hurt compositional generalization. A negative indicates the model predicts a word poorly when the composition is stored in the memory, which is neither a good sign, as it implies the model overfits the specific instances of the composition stored in the memory. We report the results in Table 3. The results show both compositional overfit and normal overfit happen in the models. When the replay memory is small (), we see a clear overfit to the memory for noun, verb prediction in noun-verb compositions and adjective prediction in noun-adjective compositions. The overfit is reasonable, because the model may have visited the examples stored in the memory nearly hundreds of times more than other examples. When the size of memory increases to 1,000 and 10,000, we see a compositional overfit to the memory for verb prediction in noun-verb compositions, but other statistics become closer to zero.

Overall, the results indicate that both normal overfitting and compositional overfitting exist in memory based continual learning algorithms, and it is not certain which one may dominate. The results motivate researchers to study deeper into overfitting and compositional overfitting in memory based continual learning algorithms, and develop algorithms that can mitigate both.

7 Conclusion

In this paper, we propose a problem setup VisCOLL for simulating children’s language acquisition process from the visual world. We construct two datasets, namely COCO-shift and Flickr-shift, and propose evaluation to encode the challenges of continual learning and compositionality. Our analysis show there is a huge space where the performance of continual learning algorithms can be improved. Our analysis further shows that address overfitting and compositional overfitting issues can be crucial for better performance in our problem setup.


  • R. Aljundi, L. Caccia, E. Belilovsky, M. Caccia, M. Lin, L. Charlin, and T. Tuytelaars (2019a) Online continual learning with maximally interfered retrieval. In NeurIPS, Cited by: §5.2.
  • R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019b) Gradient based sample selection for online continual learning. In NeurIPS, Cited by: §2.
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3674–3683. Cited by: §2.
  • D. Bahdanau, H. de Vries, T. J. O’Donnell, S. Murty, P. Beaudoin, Y. Bengio, and A. Courville (2019) CLOSURE: assessing systematic generalization of clevr models. arXiv preprint arXiv:1912.05783. Cited by: §2.
  • J. Berko (1958) The child’s learning of english morphology. Word 14 (2-3), pp. 150–177. Cited by: §1.
  • A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. S. Torr, and M. Ranzato (2019) On tiny episodic memories in continual learning.. Cited by: §6.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §5.1.
  • D. Keysers, N. Schärli, N. Scales, H. Buisman, D. Furrer, S. Kashubin, N. Momchev, D. Sinopalnikov, L. Stafiniak, T. Tihon, D. Tsarkov, X. Wang, M. van Zee, and O. Bousquet (2020) Measuring compositional generalization: a comprehensive method on realistic data. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.
  • B. M. Lake and M. Baroni (2017) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: §1, §2.
  • Y. Li, L. Zhao, K. Church, and M. Elhoseiny (2020) Compositional language continual learning. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In NIPS, Cited by: §2.
  • J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In CVPR, Cited by: §1, §4.
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational continual learning. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • G. Nguyen, T. J. Jun, T. Tran, and D. Kim (2019) ContCap: a comprehensive framework for continual image captioning. arXiv preprint arXiv:1909.08745. Cited by: §2.
  • S. Pinker, D. S. Lebeaux, and L. A. Frost (1987) Productivity and constraints in the acquisition of the passive. Cognition 26 (3), pp. 195–267. Cited by: §1.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: §4.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017)

    Icarl: incremental classifier and representation learning

    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.
  • A. V. Robins (1995) Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci. 7, pp. 123–146. Cited by: §2, §5.2.
  • D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne (2019) Experience replay for continual learning. In Advances in Neural Information Processing Systems, pp. 348–358. Cited by: §5.2.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §2.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-bert: pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, External Links: Link Cited by: §5.1.
  • D. Surís, D. Epstein, H. Ji, S. Chang, and C. Vondrick (2019) Learning to learn words from visual scenes. arXiv preprint arXiv:1911.11237. Cited by: §1, §2, §5.1.
  • X. Yuan, M. Côté, J. Fu, Z. Lin, C. Pal, Y. Bengio, and A. Trischler (2019) Interactive language learning by question answering. arXiv preprint arXiv:1908.10909. Cited by: §2.
  • W. Zadrozny (1992) On compositional semantics. In Proceedings of the 14th conference on Computational linguistics-Volume 1, pp. 260–266. Cited by: §1.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 3987–3995. Cited by: §2.