Bringing back simplicity and lightliness into neural image captioning

by   Jean-Benoit Delbrouck, et al.

Neural Image Captioning (NIC) or neural caption generation has attracted a lot of attention over the last few years. Describing an image with a natural language has been an emerging challenge in both fields of computer vision and language processing. Therefore a lot of research has focused on driving this task forward with new creative ideas. So far, the goal has been to maximize scores on automated metric and to do so, one has to come up with a plurality of new modules and techniques. Once these add up, the models become complex and resource-hungry. In this paper, we take a small step backwards in order to study an architecture with interesting trade-off between performance and computational complexity. To do so, we tackle every component of a neural captioning model and propose one or more solution that lightens the model overall. Our ideas are inspired by two related tasks: Multimodal and Monomodal Neural Machine Translation.



There are no comments yet.


page 1

page 2

page 3

page 4


Image Captioning as Neural Machine Translation Task in SOCKEYE

Image captioning is an interdisciplinary research problem that stands be...

Image Captioning using Deep Neural Architectures

Automatically creating the description of an image using any natural lan...

Neural Image Captioning

In recent years, the biggest advances in major Computer Vision tasks, su...

Image Captioning with Semantic Attention

Automatically generating a natural language description of an image has ...

Unpaired Image Captioning by Language Pivoting

Image captioning is a multimodal task involving computer vision and natu...

Can Active Memory Replace Attention?

Several mechanisms to focus attention of a neural network on selected pa...

On Vision Features in Multimodal Machine Translation

Previous work on multimodal machine translation (MMT) has focused on the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problems combining vision and natural language processing such as image captioning

[DBLP:journals/corr/ChenFLVGDZ15] is viewed as an extremely challenging task. It requires to grasp and express low to high-level aspects of local and global areas in an image as well as their relationships. Over the years, it continues to inspire considerable research. Visual attention-based neural decoder models [pmlr-v37-xuc15, conf/cvpr/KarpathyL15] have shown gigantic success and are now widely adopted for the NIC task. These recent advances are inspired from the neural encoder-decoder framework [SutskeverVL14, bahdanau+al-2014-nmt]

—or sequence to sequence model (seq2seq)— used for Neural Machine Translation (NMT). In that approach, Recurrent Neural Networks (RNN, conf/interspeech/MikolovKBCK10 conf/interspeech/MikolovKBCK10) map a source sequence of words (encoder) to a target sequence (decoder). An attention mechanism is learned to focus on different parts of the source sentence while decoding. The same mechanism applies for a visual input; the attention module learns to attend the salient parts of an image while decoding the caption.

These two fields, NIC and NMT, led to a Multimodal Neural Machine Translation (MNMT, specia-EtAl:2016:WMT specia-EtAl:2016:WMT) task where the sentence to be translated is supported by the information from an image. Interestingly, NIC and MNMT share a very similar decoder: they are both required to generate a meaningful natural language description or translation with the help of a visual input. However, both tasks differ in the amount of annotated data. MNMT has 19 times less unique training examples, reducing the amount of learnable parameters and potential complexity of a model. Yet, over the years, the challenge has brought up very clever and elegant ideas that could be transfered to the NIC task. The aim of this paper is to propose such an architecture for NIC in a straightforward manner. Indeed, our proposed models work with less data, less parameters and require less computation time. More precisely, this paper intents to:

Figure 1: This image depicts a decoder timestep of the MNMT architecture. At time , the decoder attend both a visual and textual representations. In NIC, the decoder only attends an image. This shows how both tasks are related.
  • Work only with in-domain data. No additional data besides proposed captioning datasets are involved in the learning process;

  • Lighten as much as possible the training data used, i.e. the visual and linguistic inputs of the model;

  • Propose a subjectively light and straightforward yet efficient NIC architecture with high training speed.

2 Captioning Model

As quickly mentionned in section 1, a neural captioning model is a RNN-decoder [bahdanau+al-2014-nmt] that uses an attention mechanism over an image to generate a word of the caption at each time-step . The following equations depict what a baseline time-step looks like [pmlr-v37-xuc15]:


where equation 1 maps the previous embedded word generated to the RNN hidden state size with matrice , equation 2 is the attention module over the image , equation 3 is the RNN cell computation and equation 4

is the probability distribution

over the vocabulary (matrix is also called the projection matrix).

If we denote as the model parameters, then is learned by maximizing the likelihood of the observed sequence or in other words by minimizing the cross entropy loss. The objective function is given by:


The paper is structured so that each section tackles an equation (i.e. a main component of the captioning model) in the following manner: section 2.1 for equation 1 (embeddings), section 2.2 for equation 3 (), section 2.3 for equation 2 (), section 2.4 for equation 4 (projection) and section 2.5 for equation 5 (objective function).

2.1 Embeddings

The total size of the embeddings matrix depends on the vocabulary size and the embedding dimension such that . The mapping matrix also depends on the embedding dimension because .

Many previous researches [conf/cvpr/KarpathyL15, You2016ImageCW, yao2017boosting, Anderson2017up-down]

uses pretrained embeddings such as Glove and word2vec or one-hot-vectors. Both word2vec and glove provide distributed representation of words. These models are pre-trained on 30 and 42 billions words respectively

[NIPS2013_5021, pennington2014glove], weights several gigabyes and work with .

For our experiments, each word is a column index in an embedding matrix learned along with the model and initialized using some random distribution. Whilst the usual allocation is from 512 to 1024 dimensions per embedding [pmlr-v37-xuc15, Lu2017Adaptive, mun2017textguided, Rennie2017SelfCriticalST] we show that a small embedding size of is sufficient to learn a strong vocabulary representation. The solution of an jointly-learned embedding matrix also tackles the high-dimensionality and sparsity problem of one-hot vectors. For example, [Anderson2017up-down] works with a vocabulary of 10,010 and a hidden size of 1000. As a result, the mapping matrix of equation 1 has 10 millions parameters.

Working with a small vocabulary, besides reducing the size of embedding matrix , presents two major advantages: it lightens the projection module (as explained further in section 2.4

) and reduces the action space in a Reinforcement Learning setup (detailed in section

2.5). To marginally reduce our vocabulary size (of 50 %), we use the byte pair encoding (BPE) algorithm on the train set to convert space-separated tokens into subwords [P16-1162]. Applied originally for NMT, BPE is based on the intuition that various word classes are made of smaller units than words such as compounds and loanwords. In addition of making the vocabulary smaller and the sentences length shorter, the subword model is able to productively generate new words that were not seen at training time.

2.2 Conditional GRU

Most previous researches in captioning [conf/cvpr/KarpathyL15, You2016ImageCW, Rennie2017SelfCriticalST, mun2017textguided, Lu2017Adaptive, yao2017boosting] used an LSTM [Hochreiter:1997:LSM:1246443.1246450] for their

function. Our recurrent model is a pair of two Gated Recurrent Units (GRU

[cho-al-emnlp14]), called conditional GRU (cGRU), as previously investigated in NMT111 A GRU is a lighter gating mechanism than LSTM since it doesn’t use a forget gate and lead to similar results in our experiments.

The cGRU also addresses the encoding problem of the mechanism. As shown in equation 2 the context vector takes the previous hidden state as input which is outside information of the current time-step. This could be tackled by using the current hidden , but then context vector is not an input of anymore. A conditional GRU is an efficient way to both build and encode the result of the module.

Mathematically, a first independent GRU encodes an intermediate hidden state proposal based on the previous hidden state and input at each time-step :


Then, the attention mechanism computes over the source sentence using the image and the intermediate hidden state proposal similar to 3:

Finally, a second independent GRU computes the hidden state of the cGRU by looking at the intermediate representation and context vector :


We see that both problem are addressed: context vector is computed according to the intermediate representation and the final hidden state is computed according to the context vector . Again, the size of the hidden state in the literature varies between 512 and 1024, we pick = 256.

The most similar approach to ours is the Top-Down Attention of [Anderson2017up-down] that encodes the context vector the same way but with LSTM and a different hidden state layout.

2.3 Attention model

Since the image is the only input to a captioning model, the attention module is crucial but also very diverse amongst different researches. For example, [You2016ImageCW] use a semantic attention where, in addition of image features, they run a set of attribute detectors to get a list of visual attributes or concepts that are most likely to appear in the image. [Anderson2017up-down]

uses the Visual Genome dataset to pre-train his bottom-up attention model. This dataset contains 47,000 out-of-domain images of the capioning dataset densely annotated with scene graphs containing objects, attribute and relationships.

[YangReview] proposes a review network which is an extension to the decoder. The review network performs a given number of review steps on the hidden states and outputs a compact vector representation available for the attention mechanism.

Yet everyone seems to agree on using a Convolutional Neural Network (CNN) to extract features of the image

. The trend is to select features matrices, at the convolutional layers, of size (Resnet, He2016DeepRL He2016DeepRL, res4f layer) or (VGGNet Simonyan2014VeryDC Simonyan2014VeryDC, conv5 layer). Other attributes can be extracted in the last fully connected layer of a CNN and has shown to bring useful information [YangReview, yao2017boosting, You2016ImageCW] Some models also finetune the CNN during training [YangReview, mun2017textguided, Lu2017Adaptive] stacking even more trainable parameters.

Our attention model is guided by a unique vector with global 2048-dimensional visual representation of image extracted at the pool5 layers of a ResNet-50. Our attention vector is computed so:


Recall that following the cGRU presented in section 2.2, we work with and not . Even though pooled features have less information than convolutional features ( 50 to 100 times less), pooled features have shown great success in combination with cGRU in MNMT [W17-4746]. Hence, our attention model is only the single matrice

2.4 Projection

The projection also accounts for a lot of trainable parameters in the captioning model, especially if the vocabulary is large. Indeed, in equation 4 the projection matrix is . To reduces the number of parameters, we use a bottleneck function:


where so that . Interestingly enough, if (embedding size), then . We can share the weights between the two matrices (i.e. ) to marginally reduce the number of learned parameters. Moreover, doing so doesn’t negatively impact the captioning results.

We push our projection further and use a deep-GRU, used originally in MNMT [delbrouck2018umons], so that our bottleneck function is now a third GRU as described by equations 2.2:


Because we work with small dimension, adding a new GRU block on top barely increases the model size.

2.5 Objective function


To directly optimize a automated metric, we can see the captioning generator as a Reinforcement Learning (RL) problem. The introduced function is viewed as an agent that interact with an environment composed of words and image features. The agent interacts with the environment by taking actions that are the prediction of the next word of the caption. An action is the result of the policy where are the parameters of the network. Whilst very effective to boost the automatic metric scores, porting the captioning problem into a RL setup significantly reduce the training speed.

DBLP:journals/corr/RanzatoCAZ15 DBLP:journals/corr/RanzatoCAZ15 proposed a method (MIXER), based on the REINFORCE method, combined with a baseline reward estimator. However, they implicitly assume each intermediate action (word) in a partial sequence has the same reward as the sequence-level reward, which is not true in general. To compensate for this, they introduce a form of training that mixes together the MLE objective and the REINFORCE objective. Liu2017ImprovedIC Liu2017ImprovedIC also addresses the delayed reward problem by estimating at each time-step the future rewards based on Monte Carlo rollouts. Rennie2017SelfCriticalST Rennie2017SelfCriticalST utilizes the output of its own test-time inference model to normalize the rewards it experiences. Only samples from the model that outperform the current test-time system are given positive weight.

To keep it simple, and because our reduced vocabulary allows us to do so, we follow the work of DBLP:journals/corr/RanzatoCAZ15 DBLP:journals/corr/RanzatoCAZ15 and use the naive variant of the policy gradient with REINFORCE. The loss function in equation

5 is now given by:


where is the reward (here the score given by an automatic metric scorer) of the outputted caption .

We use the REINFORCE algorithm based on the observation that the expected gradient of a non-differentiable reward function is computed as follows:


The expected gradient can be approximated using Monte-Carlo sample for each training example in the batch:


In practice, we can approximate with one sample:


The policy gradient can be generalized to compute the reward associated with an action value relative to a baseline . This baseline either encourages a word choice if or discourages it . If the baseline is an arbitrary function that does not depend on the actions

then baseline does not change the expected gradient, and importantly, reduces the variance of the gradient estimate. The final expression is given by:


3 Settings

Our decoder is a cGRU where each GRU is of size = 256. Word embedding matrix allocates features per word To create the image annotations used by our decoder, we used a ResNet-50 and extracted the features of size 1024 at the pool-5 layer. As regularization method, we apply dropout with a probability of 0.5 on bottleneck

and we early stop the training if the validation set CIDER metric does not improve for 10 epochs. All variants of our models are trained with ADAM optimizer

[kingma2014adam] with a learning rate of and mini-batch size of 256. We decode with a beam-search of size 3. In th RL setting, the baseline is a linear projection of .

We evaluate our models on MSCOCO [mscoco], the most popular benchmark for image captioning which contains 82,783 training images and 40,504 validation images. There are 5 human-annotated descriptions per image. As the annotations of the official testing set are not publicly available, we follow the settings in prior work (or ”Kaparthy splits” 222 that takes 82,783 images for training, 5,000 for validation and 5,000 for testing. On the training captions, we use the byte pair encoding algorithm on the train set to convert space-separated tokens into subwords (P16-1162 P16-1162, 5000 symbols), reducing our vocabulary size to 5066 english tokens. For the online-evaluation, all images are used for training except for the validation set.

4 Results

Our models performance are evaluated according to the following automated metrics: BLEU-4 [Papineni:2002:BMA:1073083.1073135], METEOR [Vedantam_2015_CVPR] and CIDER-D [Lavie:2007:MAM:1626355.1626389]. Results shown in table 1 are using cross-entropy (XE) loss (cfr. equation 5). Reinforced learning optimization results are compared in table 3.

4.1 XE scores

B4 M C Wt. (in M) Att. feat. (in K) O.O.D. (in M) epoch
This work
cGRU 0.302 0.258 1.018 2.46 2 - 9
Comparable work
Adaptive[Lu2017Adaptive] 0.332 0.266 1.085 17.4 100 - 42
Top-down [Anderson2017up-down] 0.334 0.261 1.054 25 204
Boosting [yao2017boosting] 0.325 0.251 0.986 28.4 2 - 123
Review [YangReview] 0.290 0.237 0.886 12.3 101 - 100
SAT [pmlr-v37-xuc15] 0.250 0.230 - 18 100 - -
O.O.D work
Top-down [Anderson2017up-down] 0.362 0.27 1.135 25 920 920
T-G att (Mun et al. mun2017textguided) 0.326 0.257 1.024 12.8 200 14 -
Semantic [You2016ImageCW] 0.304 0.243 - 5.5 2 3 -
NT[conf/cvpr/KarpathyL15] 0.230 0.195 0.660 - - 3 -
Table 1: Table sorted per CIDER-D score of models being optimized with cross-entropy loss only (cfr. equation 5).
pool features, conv features, FC features, means glove or word2vec embeddings, CNN finetuning in-domain, using in-domain CNN, CNN finetuning OOD

We sort the different works in table 1 by CIDER score. For every of them, we detail the trainable weights involved in the learning process (Wt.), the number of visual features used for the attention module (Att. Feat), the amount of out-of-domain data (O.O.D) and the convergence speed (epoch).

As we see, our model has the third best METEOR and CIDER scores across the board. Yet our BLEU metric is quiet low, we postulate two potential causes. Either our model has not enough parameters to learn the correct precision for a set of n-grams as the metric would require or it is a direct drawback from using subwords. Nevertheless, the CIDER and METEOR metric show that the main concepts are presents our captions. Our models are also the lightest in regards to trainable parameters and attention features number. As far as convergence in epochs were reported in previous works, our cGRU model is by far the fastest to train.

The following table 2 concerns the online evaluation of the official MSCOCO test-set 2014 split. Scores of our model are an ensemble of 5 runs with different initialization.

B4(c5) M (c5) C(c5)
cGru 0.326 0.253 0.973
Comparable work
[Lu2017Adaptive] 0.336 0.264 1.042
[yao2017boosting] 0.330 0.256 0.984
[YangReview] 0.313 0.256 0.965
[You2016ImageCW] 0.316 0.250 0.943
[Wu_2016_CVPR] 0.306 0.246 0.911
[pmlr-v37-xuc15] 0.277 0.251 0.865
Table 2: Published Ranking image captioning results on the online MSCOCO test server

We see that our model suffers a minor setback on this test-set, especially in term of CIDER score whilst the adaptive [Lu2017Adaptive] and boosting method [yao2017boosting] yields to stable results for both test-sets.

4.2 RL scores

The table 3 depicts the different papers using direct metric optimization. Rennie2017SelfCriticalST Rennie2017SelfCriticalST used the SCST method, the most effective one according to the metrics boost (+23, +3 an +123 points respectively) but also the most sophisticated. Liu2017ImprovedIC Liu2017ImprovedIC used a similar approach than ours (MIXER) but with Monte-Carlo roll-outs (i.e. sampling until the end at every time-step ). Without using this technique, two of our metrics improvement (METEOR and CIDER) surpasses the MC roll-outs variant (+0 against -2 and +63 against +48 respectively).

B4 M C
Renn. et al. Rennie2017SelfCriticalST
XE 0.296 0.252 0.940
RL-SCST 0.319 23 0.255 3 1.063 123
Liu2017ImprovedIC Liu2017ImprovedIC
XE 0.294 0.251 0.947
RL-PG 0.333 39 0.249 2 0.995 48
XE 0.302 0.258 1.018
RL-PG 0.315 13 0.258 1.071 63
Table 3: All optimization are on the CIDER metric.

4.3 Scalability

An interesting investigation would be to leverage the architecture with more parameters to see how it scales. We showed our model performs well with few parameters, but we would like to show that it could be used as a base for more complex posterior researches.

We propose two variants to effectively do so :

  • cGRUx2  The first intuition is to double the width of the model, i.e. the embedding size and the hidden state size . Unfortunately, this setup is not ideal with a deep-GRU because the recurrent matrices of equations 2.2 and 2.2 for the bottleneck GRU gets large. We can still use the classic bottleneck function (equation 9).

  • MHA  We trade our attention model described in section 2.3 for a standard multi-head attention (MHA) to see how convolutional features could improve our CIDER-D metrics. Multi-head attention [NIPS2017_7181] computes a weighted sum of some values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. This process is repeated multiple times. The compatibility function is given by:

    where the query is , the keys and values are a set of 196 vectors of dimension 1024 (from the layer res4f_relu of ResNet-50 CNN). Authors found it beneficial to linearly project the queries, keys and values times with different learned linear projections to dimension . The output of the multi-head attention is the concatenation of the number of values lineary projected to again. We pick = and . The multi-head attention model adds up 0.92M parameters if and 2.63M parameters if (in the case of cGRUx2).

Figure 2: The figure shows a new set of results on the online MSCOCO test server and tries to put those in perspective

We have hence proposed an image captioning architecture that, compared to previous work, cover a different area in the performance-complexity trade-off plane. We hope these will be of interest and will fuel more research in this direction.

5 Related work

As mentioned in the introduction, our model is largely inspired by the work carried out in NMT and MNMT. Component such as attention models like multi-head, encoder-based and pooled attention (W17-4746,DBLP:journals/corr/abs-1712-03449,NIPS2017_7181); reinforcement learning in NMT and MNMT (P16-1159,emnlpjb); embeddings (P16-1162,Delbrouck2017visually) are well investigated.

In captioning, Anderson2017up-down used a very similar approach where two LSTMs build and encode the visual features. Two other works, yao2017boosting and You2016ImageCW used pooled features as described in this paper. However, they both used an additional vector taken from the fully connected layer of a CNN.

6 Conclusion

We presented a novel and light architecture composed of a cGRU that showed interesting performance. The model builds and encodes the context vector from pooled features in an efficient manner. The attention model presented in section 2.3 is really straightforward and seems to bring the necessary visual information in order to output complete captions. Also, we empirically showed that the model can easily scale with more sought-after modules or simple with more parameters. In the future, it would be interesting to use different attention features, like VGG or GoogleNet (that have only 1024 dimensions) or different attention models to see how far this architecture can get.

7 Acknowledgements

This work was partly supported by the Chist-Era project IGLU with contribution from the Belgian Fonds de la Recherche Scientique (FNRS), contract no. R.50.11.15.F, and by the FSO project VCYCLE with contribution from the Belgian Waloon Region, contract no. 1510501.

We also thank the authors of nmtpytorch333 [nmtpy2017] that we used as framework for our experiments. Our code is made available for posterior research 444