Coming with the success of deep learning over the past decade, image captioning is rising as one of the most attractive domains because of its mouth-watering applications. Dealing with image captioning is to generate the most suitable caption describing the content of an image as much exact as possible. From the introduction of Neural Image Caption Generator (NIC) , various advanced techniques have been proposed to ameliorate image captioning . Experiments are frequently on three benchmark datasets: Flickr8 , Flickr30 , and MS-COCO .
However, since image captioning models are built on the concept of Gradient-Based Neural Networks, they inevitably suffer from catastrophic forgetting. A trained model can work well in a specific task thanks to the distribution it learned before. Nonetheless, it struggles to make use of this model on a new task with remaining its performance on the old task. More precisely, the model could reach a high performance on the new task, but its performance will be degraded disastrously on the old task. Though novel models  greatly improve the performance of image captioning, they do not spend much concern on solving catastrophic forgetting.
Continual learning (CL) enables deep learning models to continuously learn from a series of dependent or even independent tasks with an acceptable degree of forgetting. Besides, fine-tuning is a popular way used in deep learning, where the model can utilize what it has learned before to start solving a new problem. When applying fine-tuning in image captioning, from Fig. 1, we can observe catastrophic forgetting by looking at the content of two generated sentences.
In Fig. 1, we first train a model on a set of 19 classes (class 2: bicycle to class 21: cow
) - Task A. Notice that class 12 is missing in MS-COCO, so from class 2-21, we only have 19 classes. The model is an instance of convolutional neural network (CNN) + long-short term memory (LSTM) architecture, then after training, we obtain the model. After that, we train on class 1: person - Task B, fine-tuning from model to get the model . Now to observe catastrophic forgetting, we pick an image from class 18: dog in test set of Task A and perform inference with both and . With , we call original model, the description is: a black dog laying on a grass covered field. However, after fine-tuning to have , the caption is shifted to: a little boy is standing in the grass with a frisbee. When learning the new task B, the model seriously forgets what it has learned before and seems to overemphasize describing person.
In this paper, we propose ContCap that is an extensive framework to combine encoder-decoder image captioning architecture and continual learning. To overcome catastrophic forgetting, we introduce a pseudo-label approach which is an extended version from LwF  as our method works on the multi-modal architecture, while LwF only works on CNN for image classification. The idea of LwF is to record the output of the previous model on the new data to train the new task. Two strategies of freezing (encoder or decoder) are also integrated to transfer the knowledge among tasks for preserving information of old tasks, while adapting to work well on the new ones. Furthermore, we employ knowledge distillation  on the output layer to guide the new model to produce the similar outcome to the old model.
Our paper addresses the most challenging scenario in continual learning while new training samples of different, previously unseen classes become available in subsequent tasks . This scenario can be referred to as class-incremental learning . In particular, no access to data from old tasks is permitted, ensuring privacy concerns or memory obstacles. To perform experiments on class-incremental setting, we create a new dataset from MS-COCO 2014 named Split MS-COCO in which each task coming with a new class. For example, with an old task A of classes cat, dog, table, the new tasks B will contain samples of class person only. When training task B, information from task A will be used to help the current model generalize well on all 4 classes cat, dog, table, person. We propose a terminology “clear image” meaning an image containing only one class in its annotation to manipulate incremental steps correctly. From over 82k images of training set and 40k of validation set from MS-COCO 2014, Split MS-COCO consists of over 51k and 23k respectively. Fine-tuning is considered as the baseline for comparison and evaluating effectiveness of strategies.
The experiments show catastrophic forgetting in image captioning and improvements of using proposed techniques. The previously learned tasks are well captioned, whilst new information is also well absorbed. Traditional metrics in image captioning BLEU , METEOR , ROUGE-L , and CIDEr  are calculated to quantitatively assessed each model.
As far as we know, this is the first work conducting image captioning in continual learning setup. Furthermore, we provide future directions and discussion over experiments to elaborate the results qualitatively and quantitatively.
Contributions: Our main contributions are:
Our work is the first attempt considering addressing catastrophic forgetting in image captioning without the need for accessing data from existing tasks.
We present ContCap - a comprehensive framework that reconciles image captioning and continual learning in the scenario of class-incremental. Our framework could be easily adapted to further research in continual image captioning or video captioning, and easily adapting with existing captioning models.
Our experiments prove the manifestation of catastrophic forgetting in image captioning and our proposed methods help mitigate catastrophic forgetting in image captioning with no need of data from previous tasks. The experiments reveal the impact of frozen techniques, pseudo-label, and distillation in dealing with catastrophic forgetting in the context of image captioning.
We propose a new dataset named Split MS-COCO from the standard MS-COCO dataset. This can be a reference for other new benchmark datasets for continual learning. The details of creating the dataset lie in the section 4.
2 Related work
The most simple and prevalent architecture of an image captioning system is based on encoder-decoder architecture. Basically, the features of an image are extracted as a result of a CNN suppressed its output layer. The intermediate features are then fed into an embedding block to synchronize with the input of a Recurrent Neural Networks (RNNs) in terms of data shape. The RNNs, often powered by an LSTM, is in charge of generating a sentence step by step conditioned on the current time step and the previous hidden state. Word generation is performed until we get the end token of the sentence. NIC31] propose a novel method imitating attention mechanism of human. The method allows us to focus on a certain region with “high resolution” while perceiving the surrounding in “low resolution”, and then adjust the focal point or do the inference accordingly. The idea of using attention mechanism is also applied in Neural Baby Talk  in a combination with template-based image captioning approach. The idea is to use an object detector to detect the visual words in an image, and then decide whether a textual word (word in vocabulary) or a visual word (detected word) should be filled at each time step. To simplify, we use a primitive architecture (CNN-LSTM) of image captioning to keep track of the forgetting problem over scenarios.
Continual learning requires the ability to learn over time without witnessing catastrophic forgetting or known as semantic drift, and allowing neural networks to incrementally solve new tasks. To achieve these goals of continual learning, studies mainly focus on addressing catastrophic forgetting problem. While the community has not agreed yet on a shared categorization for continual learning strategies, Maltoni et al.  proposes a three-way fuzzy categorization of the most common CL strategies: Architecture, regularization, and rehearsal. Rehearsal approaches  demand data from old tasks, leading to privacy or limited storage budget issues, making them non-scalable in the case of the number of classes is large. Architectural techniques 
, at the same time, alter the original network architecture for adapting to a new task besides preserving old memory. When a new task comes, layers or neurons are typically added to the previous model resulting in a new one, thus the capacity of the network are increased accordingly. This leads to a storage problem when the number of new classes is numerous because the final model could be gigantic, alleviating its portability. The concept of regularization approaches is to keep the model close to its previous state through adding a penalty in the objective function. Neither accessing old samples nor expanding the old network is required in regularization methods. Ideas from continual learning are widely adopted for vision tasks like object detection and image classification. Very recently, Michieli et al.  introduce a novel framework to perform continual learning for semantic segmentation. Our work is mainly based on the idea of regularization but slight expansions in the decoder are needed to secure the preciseness of prediction.
Continual Learning in Vision Task
So far, continual learning has been experimented on several vision benchmarks like Permuted MNIST , MNIST Split, and CIFAR10/100 Split  for image classification. Shmelkov et al.  use PASCAL VOC2007  and COCO datasets  for object detection; and PASCAL VOC2012  is selected in  for semantic segmentation.
Given a task followed by a series of tasks, the ultimate target is to maximize the performance of captioning on the new task in addition to old images. The ContCap framework is presented in Fig. 2.
At any time point, the framework has performed a sequence of learning tasks, and at time t, samples from dataset of task are previously unseen with the other tasks. In learning task , continual learning techniques are applied with information from Knowledge Base (KB) with an expectation to describe well on both the task and the previous ones. KB contains the acquired knowledge and the knowledge accumulated from learning the old tasks.
The proposed approaches could be applied on any encoder-decoder architecture-based image captioning models, but to evaluate, we pick the most simple architecture consisting of CNN + LSTM. The encoder for feature extraction is the pre-trained ResNet-152. For decoder part, we build a single layer LSTM network for managing the process of description generation. A task is denoted by its classes , data , obtained model and vocabulary . Initially, we train the network in a set of classes with the corresponding training data to get the model . Next, with incremental steps , in task , we load the model and update the current model employing the distribution of like in .
Although tasks coming later are independent from previous tasks, a vocabulary should be accumulated and transferred progressively. For example, in an early task of 20 classes, we have a vocab of entries. While a task of person comes, we simply perform tokenization by NLTK  on the annotations of samples, then pick the most significant words into the vocabulary. The accumulated vocabulary so far is , and the number of entries in the vocabulary is .
We run the fine-tuning experiment at the first step to get the baseline for comparison. With new classes, we initialize model parameters by the converged state of an old task without adding any techniques. Predictably, the new model will try to generalize on the newly added classes and give a poor performance on the old classes.
A heuristic solution to hinder catastrophic forgetting is to freeze a part of model. As our architecture is divided into encoder and decoder, keeping one component intact could affect directly to the performance of the previous tasks as well as the new task.
On the one hand, encoder is frozen and decoder is updated exploiting the new dataset. The feature extraction stays unchanged compared to the previous model which plays a role of fine-tuned model (Task A), and this model is absolutely frozen. The feature extractor is constrained, leaving dealing with new task for the decoder , we refer as (see Fig. 3). Different from  while new nodes are added to the decoder for segmentation task based on the number of new classes present, the decoder here are modified according to the vocabulary size. The vocabulary is expanded over tasks to ensure the naturalness of the prediction.
On the other hand, we freeze decoder while encoder is trainable. Because neurons are added to decoder as by virtue of expanding vocabulary, the newly attached neurons are the only trainable part of decoder. By setting this, the convolutional network can better generalize features, thus feeding more fine-grained input for the decoder. In the case of both and are frozen, the new model is the same as the previous model, while fine-tuning is the case that both and are trainable.
We refer this method as pseudo-labeling since when training, pseudo-labels are generated to guide the current model to mimic the behavior of the previously trained model. The procedure of this approach is described as Fig. 4.
Pseudo labels are acquired by inferring the caption of all input images in the dataset of the new task using the previous model. Differ from 
, the convolutional network architecture stays unchanged during training although new classes appear since the expected output is textual caption, not the probabilities for each class as in object detection or semantic segmentation. The model is then trained in a supervised way to minimize the loss computed based on the ground truth, the predicted caption , and the pseudo labels. The number of captions per an image - is 5 following the standard MS-COCO dataset, and
is the cross-entropy loss function. The loss component from pseudo-labelingis explicitly written as:
where plays a role as a regulator to accentuate the old or new task.
3.3 Knowledge Distillation
Knowledge distillation works on a teacher-student strategy when we want to transfer the knowledge from a teacher model to a student model. Training the student model tries to minimize the difference between the predictions of the teacher and the student. By applying distillation, teacher is the model obtained from the old task, when student is the new model. One image is passed through both teacher and student, then the mean squared error between outputs ( and ) is added to the loss function of the student to penalize it. The distillation term is computed as:
where could be increased to encourage student more intensively to imitate the behaviors of teacher.
The objective function can be rewritten by , and in which is the cross-entropy loss over annotations and the prediction, is the additional loss from pseudo labeling, and is the penalty from teacher on student model.
To control the impact of pseudo labels or distillation in training, or is adjustable, and in case of 0, meaning we are training the new model on a fine-tuning scheme. In our experiments, and are mostly set to 1 and 10 respectively. Increasing or will favor the old task over the new one.
3.5 Implementation Details
We use pre-trained model of ResNet-152 , removing the final fully connected layer to obtain the immediate features. Behind the ResNet is an embedding layer which has an embedding size of 256. This embedding layer is responsible for bridging the output of the ResNet and the input of the RNN, followed by a batch normalization layer with .
The entrance of the language model is an embedding module to embed the features and labels to feed the LSTM. The RNN contains a single layer LSTM with the size of hidden state is 512 and the embedding size is 256. A fully connected layer is placed at the end of the language model to generate the score for each entry in the vocabulary. Note that when the vocabulary is increasingly accumulated, the embedding layer and the output layer of the decoder are also enlarged accordingly.
To help the model work with images of different sizes, we always resize input to square images (
). We shuffle images at each epoch, and the batch size is 128. Early stopping is used to help models reach their optimal state within 50 epochs of training. The value- the number of epochs to wait before early stop if no loss improvement on the validation set is empirically set to 5. For updating both encoder and decoder, we choose Adam optimizer  with a shared learning rate of . Exceptionally, in knowledge distillation, we apply another Adam optimizer on encoder with a learning rate of . The parameters of encoder could change dramatically during training time, so we want to slow it down.
All methods but distillation are initialized by the final state of the old task. While joint training is incorporated only in pseudo-labeling, normal training procedure is coupled to each remaining technique. Joint training is when we first freeze the entire model except for the newly added neurons, then train to converge. Subsequently, we unfreeze the model and perform training again until convergence (joint-optimization).
4 Experimental Results
Our experiment results have been recorded on Split MS-COCO, a variant of the MS-COCO benchmark dataset . The original MS COCO dataset contains 82,783 images in the training split and 40,504 images in the validation split. Each image of MS COCO dataset belongs to at least one of 80 classes. To make Split MS-COCO, we perform processing on MS-COCO as the following steps:
Split all images into distinguished classes. There are 80 classes in total.
Resize images to a fixed size of () in order to deal with different sizes of input.
Discard images including more than one classes in its annotations to obtain “clear” images only.
Since MS-COCO test set is not available, we divide the validation set of each class into two equal parts: one is for validation and the other for testing. If the number of images of a class is odd, the number of images in the validation set will be one image larger than that of the test set. Finally, we get 47,547 images in the training set, 11,722 images in the validation set, and 11,687 images in the test set.
We assess the learning capacity of the framework by commonly used metrics such as BLEU, ROUGE, CIDEr, and METEOR. , , , , and are fine-tuning, freezing encoder, freezing decoder, pseudo-labeling, and knowledge distillation respectively. Fine-tuning is incorporated in almost the remaining techniques by default. These metrics generally measure the word overlap between the predicted and ground truth captions. Captions are inferred on every settings to qualitatively evaluate the framework (See Fig. 5).
4.1 Addition of One Class
In the first setting, we examine catastrophic forgetting by an addition of one class. We train the model with the training set of 19 classes (2 to 21) - , and then evaluate on the corresponding test set of these 19 classes. Next, we analyze the learning capacity of the model by adding another class and see how confident the model can operate on the old and new classes. The results are shown in Table 1.
From the table, we can argue that fine-tuning seriously demolishes the accuracy on the old task although we just add only one class. Massive decreases can be observed in all the metrics with a remarkable one observed in CIDEr, a drop of 0.778 to 0.287. On the old task, pseudo-labeling outperforms the remaining ones, reaching 0.576 CIDEr, which doubles as fine-tuning’s CIDer score. Freezing encoder and decoder achieve 0.409 and 0.461 CIDEr respectively. We conclude that freezing does not help much because when a part is frozen, the other will be strongly driven to the new domain, thus misdescribing images of the old domain. Pseudo-labeling improves retaining the old knowledge as the pseudo-labels impede the drift of model weights by the regularization. Knowledge distillation performs worst because we rebuild the model, not any information is preserved from beginning, resulting in only 0.255 CIDEr.
When testing on the new domain, fine-tuning reaches a relatively high performance (0.686 and 0.713 for BLEU1 and CIDEr respectively). Simply freezing encoder, surprisingly, almost equalizes the performance of fine-tuning on the old task, reaching 0.685 BLEU1 and 0.702 CIDEr, while freezing decoder’s CIDEr is merly 0.467. Pseudo-labeling yields 0.657 BLEU1 and 0.523 CIDEr, and distillation claims 0.706 BLEU1 and 0.663 CIDEr.
Despite of an average performance on the new task of pseudo-labeling, this technique is a reasonable option when adding a new class as it balances the performance on the new task and old tasks the most. Freezing encoder is not balanced over all tasks, but it keeps the original performance on the new one while still diminished catastrophic forgetting.
4.2 Addition of Multiple Classes at Once
We assume the dataset from 5 classes , , , , and is . We choose those 5 classes because they are the biggest classes that are not existing in , and they are from different super-classes , ensuring totally new instances are introduced. In this setting, starting with with , we add at once to have . After training to get the model , we test the model on and .
We first test over . As the vocabulary size is broadened significantly due to new words coming from the annotations of the whole 5 new classes, the generated sentences take a chance to be more natural and diverse. As a result, all approaches in this experiment performs better than the ones when only one class is added. On , the trend is similar to the first experiment when we add a new class . We suppose that since the 5 new classes belong to different and non-existing super classes, the features from each of them are unique. We call this type of adding as ”unknown-domain addition”. Adding multiple classes simultaneously is similar to adding a totally new class but generates more fine-grained descriptions thanks to a bigger vocabulary. While pseudo-labeling still has the performance balance (0.612 on and 0.572 CIDEr on ), distillation seems to perform the best in this scenario, at 0.436 and 0.785 CIDEr.
4.3 Addition of Multiple Classes Sequentially
Again, we begin with and . After that, we add 5 classes , , , , and one by one. At each stage, the previous model is used to initialize the current model. The class of is trained at the end of the process. Finally, we have to manipulate evaluation on and .
We realize that this setting causes extremely poor performance on
regardless of the presence of the techniques. BLEU1 score which is computed in 1-gram manner is about 0.5, but other metrics collapse because they count matching n-grams or even more complex patterns. Sequentially adding new classes drifts the original model to various directions after each step, thus changing it drastically (stage catastrophic forgetting). CIDEr changes from 0.778 to only about 0.05 with distillation. Regarding to, the performance is much lower than the figure for the setting of adding 5 classes once. Simultaneously adding helps model better generalize the knowledge on new classes, while sequentially adding suffers from both stage catastrophic forgetting and poor generalization. Freezing seems to govern the sentence generation process the best in this scenario.
In this paper, we introduce ContCap, a comprehensive framework fusing continual learning and image captioning, working on a new dataset Split MS-COCO created from MS-COCO but working in continual learning settings. We firstly perform image captioning in incremental schemes and then add techniques from continual learning to weaken catastrophic forgetting. The experiments on three settings show the sign of catastrophic forgetting and the effectiveness when integrating freezing, pseudo-labeling, and distillation compared to fine-tuning. Applying further advanced techniques in continual learning to enhance the performance especially in the sequential addition setting is left for future research. Since image captioning task is multi-modal, further approaches should be adopted to reach a high performance on the new task like single-modal tasks (object segmentation or detection). As we refer “unknown-domain addition”, experiments in “known-domain addition” should be performed in an expectation of witnessing less forgetting compared to “unknown” scenario. Stage catastrophic forgetting should be more deeply investigated to make continual learning possible when we have a stream of new tasks.
-  (2018) Convolutional image captioning. In , pp. 5561–5570. Cited by: §1.
-  (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §1.
-  (2018) End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–248. Cited by: §2.
-  (2019) Show, control and tell: a framework for generating controllable and grounded captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8307–8316. Cited by: §1.
-  The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html Cited by: §2.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §2.
-  (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.5, §3.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
Framing image description as a ranking task: data, models and evaluation metrics.
Journal of Artificial Intelligence Research47, pp. 853–899. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.
-  (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1, §3.2.
-  (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2, §4.2, §4.
-  (2019) Fine-grained continual learning. arXiv preprint arXiv:1907.03799. Cited by: §1.
-  (2002) NLTK: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §3.
-  (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §2.
-  (2018) Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7219–7228. Cited by: §1, §1, §2.
-  (2019) Continuous learning in single-incremental-task scenarios. Neural Networks 116, pp. 56–73. Cited by: §2.
-  (2019) Incremental learning techniques for semantic segmentation. arXiv preprint arXiv:1907.13372. Cited by: §1, §2, §2, §3.1, §3.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §1.
-  (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §1.
Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.
-  (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §2.
-  (2017) Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3400–3409. Cited by: §2, §2.
-  (2018) Incremental learning for semantic segmentation of large-scale remote sensing data. arXiv preprint arXiv:1810.12448. Cited by: §2.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §1.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1, §2.
-  (2018) Incremental classifier learning with generative adversarial networks. arXiv preprint arXiv:1802.00853. Cited by: §2.
Show, attend and tell: neural image caption generation with visual attention.
International conference on machine learning, pp. 2048–2057. Cited by: §1, §2.
-  (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §2.