1 Introduction
Problems combining vision and natural language processing such as image captioning
[DBLP:journals/corr/ChenFLVGDZ15] is viewed as an extremely challenging task. It requires to grasp and express low to highlevel aspects of local and global areas in an image as well as their relationships. Over the years, it continues to inspire considerable research. Visual attentionbased neural decoder models [pmlrv37xuc15, conf/cvpr/KarpathyL15] have shown gigantic success and are now widely adopted for the NIC task. These recent advances are inspired from the neural encoderdecoder framework [SutskeverVL14, bahdanau+al2014nmt]—or sequence to sequence model (seq2seq)— used for Neural Machine Translation (NMT). In that approach, Recurrent Neural Networks (RNN, conf/interspeech/MikolovKBCK10 conf/interspeech/MikolovKBCK10) map a source sequence of words (encoder) to a target sequence (decoder). An attention mechanism is learned to focus on different parts of the source sentence while decoding. The same mechanism applies for a visual input; the attention module learns to attend the salient parts of an image while decoding the caption.
These two fields, NIC and NMT, led to a Multimodal Neural Machine Translation (MNMT, speciaEtAl:2016:WMT speciaEtAl:2016:WMT) task where the sentence to be translated is supported by the information from an image. Interestingly, NIC and MNMT share a very similar decoder: they are both required to generate a meaningful natural language description or translation with the help of a visual input. However, both tasks differ in the amount of annotated data. MNMT has 19 times less unique training examples, reducing the amount of learnable parameters and potential complexity of a model. Yet, over the years, the challenge has brought up very clever and elegant ideas that could be transfered to the NIC task. The aim of this paper is to propose such an architecture for NIC in a straightforward manner. Indeed, our proposed models work with less data, less parameters and require less computation time. More precisely, this paper intents to:

Work only with indomain data. No additional data besides proposed captioning datasets are involved in the learning process;

Lighten as much as possible the training data used, i.e. the visual and linguistic inputs of the model;

Propose a subjectively light and straightforward yet efficient NIC architecture with high training speed.
2 Captioning Model
As quickly mentionned in section 1, a neural captioning model is a RNNdecoder [bahdanau+al2014nmt] that uses an attention mechanism over an image to generate a word of the caption at each timestep . The following equations depict what a baseline timestep looks like [pmlrv37xuc15]:
(1)  
(2)  
(3)  
(4) 
where equation 1 maps the previous embedded word generated to the RNN hidden state size with matrice , equation 2 is the attention module over the image , equation 3 is the RNN cell computation and equation 4
is the probability distribution
over the vocabulary (matrix is also called the projection matrix).If we denote as the model parameters, then is learned by maximizing the likelihood of the observed sequence or in other words by minimizing the cross entropy loss. The objective function is given by:
(5) 
The paper is structured so that each section tackles an equation (i.e. a main component of the captioning model) in the following manner: section 2.1 for equation 1 (embeddings), section 2.2 for equation 3 (), section 2.3 for equation 2 (), section 2.4 for equation 4 (projection) and section 2.5 for equation 5 (objective function).
2.1 Embeddings
The total size of the embeddings matrix depends on the vocabulary size and the embedding dimension such that . The mapping matrix also depends on the embedding dimension because .
Many previous researches [conf/cvpr/KarpathyL15, You2016ImageCW, yao2017boosting, Anderson2017updown]
uses pretrained embeddings such as Glove and word2vec or onehotvectors. Both word2vec and glove provide distributed representation of words. These models are pretrained on 30 and 42 billions words respectively
[NIPS2013_5021, pennington2014glove], weights several gigabyes and work with .For our experiments, each word is a column index in an embedding matrix learned along with the model and initialized using some random distribution. Whilst the usual allocation is from 512 to 1024 dimensions per embedding [pmlrv37xuc15, Lu2017Adaptive, mun2017textguided, Rennie2017SelfCriticalST] we show that a small embedding size of is sufficient to learn a strong vocabulary representation. The solution of an jointlylearned embedding matrix also tackles the highdimensionality and sparsity problem of onehot vectors. For example, [Anderson2017updown] works with a vocabulary of 10,010 and a hidden size of 1000. As a result, the mapping matrix of equation 1 has 10 millions parameters.
Working with a small vocabulary, besides reducing the size of embedding matrix , presents two major advantages: it lightens the projection module (as explained further in section 2.4
) and reduces the action space in a Reinforcement Learning setup (detailed in section
2.5). To marginally reduce our vocabulary size (of 50 %), we use the byte pair encoding (BPE) algorithm on the train set to convert spaceseparated tokens into subwords [P161162]. Applied originally for NMT, BPE is based on the intuition that various word classes are made of smaller units than words such as compounds and loanwords. In addition of making the vocabulary smaller and the sentences length shorter, the subword model is able to productively generate new words that were not seen at training time.2.2 Conditional GRU
Most previous researches in captioning [conf/cvpr/KarpathyL15, You2016ImageCW, Rennie2017SelfCriticalST, mun2017textguided, Lu2017Adaptive, yao2017boosting] used an LSTM [Hochreiter:1997:LSM:1246443.1246450] for their
function. Our recurrent model is a pair of two Gated Recurrent Units (GRU
[choalemnlp14]), called conditional GRU (cGRU), as previously investigated in NMT^{1}^{1}1https://github.com/nyudl/dl4mttutorial/blob/master/docs/cgru.pdf. A GRU is a lighter gating mechanism than LSTM since it doesn’t use a forget gate and lead to similar results in our experiments.The cGRU also addresses the encoding problem of the mechanism. As shown in equation 2 the context vector takes the previous hidden state as input which is outside information of the current timestep. This could be tackled by using the current hidden , but then context vector is not an input of anymore. A conditional GRU is an efficient way to both build and encode the result of the module.
Mathematically, a first independent GRU encodes an intermediate hidden state proposal based on the previous hidden state and input at each timestep :
(6) 
Then, the attention mechanism computes over the source sentence using the image and the intermediate hidden state proposal similar to 3:
Finally, a second independent GRU computes the hidden state of the cGRU by looking at the intermediate representation and context vector :
(7) 
We see that both problem are addressed: context vector is computed according to the intermediate representation and the final hidden state is computed according to the context vector . Again, the size of the hidden state in the literature varies between 512 and 1024, we pick = 256.
The most similar approach to ours is the TopDown Attention of [Anderson2017updown] that encodes the context vector the same way but with LSTM and a different hidden state layout.
2.3 Attention model
Since the image is the only input to a captioning model, the attention module is crucial but also very diverse amongst different researches. For example, [You2016ImageCW] use a semantic attention where, in addition of image features, they run a set of attribute detectors to get a list of visual attributes or concepts that are most likely to appear in the image. [Anderson2017updown]
uses the Visual Genome dataset to pretrain his bottomup attention model. This dataset contains 47,000 outofdomain images of the capioning dataset densely annotated with scene graphs containing objects, attribute and relationships.
[YangReview] proposes a review network which is an extension to the decoder. The review network performs a given number of review steps on the hidden states and outputs a compact vector representation available for the attention mechanism.Yet everyone seems to agree on using a Convolutional Neural Network (CNN) to extract features of the image
. The trend is to select features matrices, at the convolutional layers, of size (Resnet, He2016DeepRL He2016DeepRL, res4f layer) or (VGGNet Simonyan2014VeryDC Simonyan2014VeryDC, conv5 layer). Other attributes can be extracted in the last fully connected layer of a CNN and has shown to bring useful information [YangReview, yao2017boosting, You2016ImageCW] Some models also finetune the CNN during training [YangReview, mun2017textguided, Lu2017Adaptive] stacking even more trainable parameters.Our attention model is guided by a unique vector with global 2048dimensional visual representation of image extracted at the pool5 layers of a ResNet50. Our attention vector is computed so:
(8) 
Recall that following the cGRU presented in section 2.2, we work with and not . Even though pooled features have less information than convolutional features ( 50 to 100 times less), pooled features have shown great success in combination with cGRU in MNMT [W174746]. Hence, our attention model is only the single matrice
2.4 Projection
The projection also accounts for a lot of trainable parameters in the captioning model, especially if the vocabulary is large. Indeed, in equation 4 the projection matrix is . To reduces the number of parameters, we use a bottleneck function:
(9)  
(10) 
where so that . Interestingly enough, if (embedding size), then . We can share the weights between the two matrices (i.e. ) to marginally reduce the number of learned parameters. Moreover, doing so doesn’t negatively impact the captioning results.
We push our projection further and use a deepGRU, used originally in MNMT [delbrouck2018umons], so that our bottleneck function is now a third GRU as described by equations 2.2:
(11) 
Because we work with small dimension, adding a new GRU block on top barely increases the model size.
2.5 Objective function
.
To directly optimize a automated metric, we can see the captioning generator as a Reinforcement Learning (RL) problem. The introduced function is viewed as an agent that interact with an environment composed of words and image features. The agent interacts with the environment by taking actions that are the prediction of the next word of the caption. An action is the result of the policy where are the parameters of the network. Whilst very effective to boost the automatic metric scores, porting the captioning problem into a RL setup significantly reduce the training speed.
DBLP:journals/corr/RanzatoCAZ15 DBLP:journals/corr/RanzatoCAZ15 proposed a method (MIXER), based on the REINFORCE method, combined with a baseline reward estimator. However, they implicitly assume each intermediate action (word) in a partial sequence has the same reward as the sequencelevel reward, which is not true in general. To compensate for this, they introduce a form of training that mixes together the MLE objective and the REINFORCE objective. Liu2017ImprovedIC Liu2017ImprovedIC also addresses the delayed reward problem by estimating at each timestep the future rewards based on Monte Carlo rollouts. Rennie2017SelfCriticalST Rennie2017SelfCriticalST utilizes the output of its own testtime inference model to normalize the rewards it experiences. Only samples from the model that outperform the current testtime system are given positive weight.
To keep it simple, and because our reduced vocabulary allows us to do so, we follow the work of DBLP:journals/corr/RanzatoCAZ15 DBLP:journals/corr/RanzatoCAZ15 and use the naive variant of the policy gradient with REINFORCE. The loss function in equation
5 is now given by:(12) 
where is the reward (here the score given by an automatic metric scorer) of the outputted caption .
We use the REINFORCE algorithm based on the observation that the expected gradient of a nondifferentiable reward function is computed as follows:
(13) 
The expected gradient can be approximated using MonteCarlo sample for each training example in the batch:
(14) 
In practice, we can approximate with one sample:
(15) 
The policy gradient can be generalized to compute the reward associated with an action value relative to a baseline . This baseline either encourages a word choice if or discourages it . If the baseline is an arbitrary function that does not depend on the actions
then baseline does not change the expected gradient, and importantly, reduces the variance of the gradient estimate. The final expression is given by:
(16) 
3 Settings
Our decoder is a cGRU where each GRU is of size = 256. Word embedding matrix allocates features per word To create the image annotations used by our decoder, we used a ResNet50 and extracted the features of size 1024 at the pool5 layer. As regularization method, we apply dropout with a probability of 0.5 on bottleneck
and we early stop the training if the validation set CIDER metric does not improve for 10 epochs. All variants of our models are trained with ADAM optimizer
[kingma2014adam] with a learning rate of and minibatch size of 256. We decode with a beamsearch of size 3. In th RL setting, the baseline is a linear projection of .We evaluate our models on MSCOCO [mscoco], the most popular benchmark for image captioning which contains 82,783 training images and 40,504 validation images. There are 5 humanannotated descriptions per image. As the annotations of the official testing set are not publicly available, we follow the settings in prior work (or ”Kaparthy splits” ^{2}^{2}2https://github.com/karpathy/neuraltalk2/tree/master/coco) that takes 82,783 images for training, 5,000 for validation and 5,000 for testing. On the training captions, we use the byte pair encoding algorithm on the train set to convert spaceseparated tokens into subwords (P161162 P161162, 5000 symbols), reducing our vocabulary size to 5066 english tokens. For the onlineevaluation, all images are used for training except for the validation set.
4 Results
Our models performance are evaluated according to the following automated metrics: BLEU4 [Papineni:2002:BMA:1073083.1073135], METEOR [Vedantam_2015_CVPR] and CIDERD [Lavie:2007:MAM:1626355.1626389].
Results shown in table 1 are using crossentropy (XE) loss (cfr. equation 5). Reinforced learning optimization results are compared in table 3.
4.1 XE scores
B4  M  C  Wt. (in M)  Att. feat. (in K)  O.O.D. (in M)  epoch  
This work  
cGRU  0.302  0.258  1.018  2.46  2    9 
Comparable work  
Adaptive[Lu2017Adaptive]  0.332  0.266  1.085  17.4  100    42 
Topdown [Anderson2017updown]  0.334  0.261  1.054  25  204  
Boosting [yao2017boosting]  0.325  0.251  0.986  28.4  2    123 
Review [YangReview]  0.290  0.237  0.886  12.3  101    100 
SAT [pmlrv37xuc15]  0.250  0.230    18  100     
O.O.D work  
Topdown [Anderson2017updown]  0.362  0.27  1.135  25  920  920  
TG att (Mun et al. mun2017textguided)  0.326  0.257  1.024  12.8  200  14   
Semantic [You2016ImageCW]  0.304  0.243    5.5  2  3   
NT[conf/cvpr/KarpathyL15]  0.230  0.195  0.660      3   
pool features, conv features, FC features, means glove or word2vec embeddings, CNN finetuning indomain, using indomain CNN, CNN finetuning OOD
We sort the different works in table 1 by CIDER score. For every of them, we detail the trainable weights involved in the learning process (Wt.), the number of visual features used for the attention module (Att. Feat), the amount of outofdomain data (O.O.D) and the convergence speed (epoch).
As we see, our model has the third best METEOR and CIDER scores across the board. Yet our BLEU metric is quiet low, we postulate two potential causes. Either our model has not enough parameters to learn the correct precision for a set of ngrams as the metric would require or it is a direct drawback from using subwords. Nevertheless, the CIDER and METEOR metric show that the main concepts are presents our captions. Our models are also the lightest in regards to trainable parameters and attention features number. As far as convergence in epochs were reported in previous works, our cGRU model is by far the fastest to train.
The following table 2 concerns the online evaluation of the official MSCOCO testset 2014 split. Scores of our model are an ensemble of 5 runs with different initialization.
B4(c5)  M (c5)  C(c5)  

cGru  0.326  0.253  0.973 
Comparable work  
[Lu2017Adaptive]  0.336  0.264  1.042 
[yao2017boosting]  0.330  0.256  0.984 
[YangReview]  0.313  0.256  0.965 
[You2016ImageCW]  0.316  0.250  0.943 
[Wu_2016_CVPR]  0.306  0.246  0.911 
[pmlrv37xuc15]  0.277  0.251  0.865 
We see that our model suffers a minor setback on this testset, especially in term of CIDER score whilst the adaptive [Lu2017Adaptive] and boosting method [yao2017boosting] yields to stable results for both testsets.
4.2 RL scores
The table 3 depicts the different papers using direct metric optimization. Rennie2017SelfCriticalST Rennie2017SelfCriticalST used the SCST method, the most effective one according to the metrics boost (+23, +3 an +123 points respectively) but also the most sophisticated. Liu2017ImprovedIC Liu2017ImprovedIC used a similar approach than ours (MIXER) but with MonteCarlo rollouts (i.e. sampling until the end at every timestep ). Without using this technique, two of our metrics improvement (METEOR and CIDER) surpasses the MC rollouts variant (+0 against 2 and +63 against +48 respectively).
B4  M  C  

Renn. et al. Rennie2017SelfCriticalST  
XE  0.296  0.252  0.940  
RLSCST  0.319  23  0.255  3  1.063  123 
Liu2017ImprovedIC Liu2017ImprovedIC  
XE  0.294  0.251  0.947  
RLPG  0.333  39  0.249  2  0.995  48 
Ours  
XE  0.302  0.258  1.018  
RLPG  0.315  13  0.258  1.071  63 
4.3 Scalability
An interesting investigation would be to leverage the architecture with more parameters to see how it scales. We showed our model performs well with few parameters, but we would like to show that it could be used as a base for more complex posterior researches.
We propose two variants to effectively do so :

cGRUx2 The first intuition is to double the width of the model, i.e. the embedding size and the hidden state size . Unfortunately, this setup is not ideal with a deepGRU because the recurrent matrices of equations 2.2 and 2.2 for the bottleneck GRU gets large. We can still use the classic bottleneck function (equation 9).

MHA We trade our attention model described in section 2.3 for a standard multihead attention (MHA) to see how convolutional features could improve our CIDERD metrics. Multihead attention [NIPS2017_7181] computes a weighted sum of some values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. This process is repeated multiple times. The compatibility function is given by:
where the query is , the keys and values are a set of 196 vectors of dimension 1024 (from the layer res4f_relu of ResNet50 CNN). Authors found it beneficial to linearly project the queries, keys and values times with different learned linear projections to dimension . The output of the multihead attention is the concatenation of the number of values lineary projected to again. We pick = and . The multihead attention model adds up 0.92M parameters if and 2.63M parameters if (in the case of cGRUx2).
We have hence proposed an image captioning architecture that, compared to previous work, cover a different area in the performancecomplexity tradeoff plane. We hope these will be of interest and will fuel more research in this direction.
5 Related work
As mentioned in the introduction, our model is largely inspired by the work carried out in NMT and MNMT. Component such as attention models like multihead, encoderbased and pooled attention (W174746,DBLP:journals/corr/abs171203449,NIPS2017_7181); reinforcement learning in NMT and MNMT (P161159,emnlpjb); embeddings (P161162,Delbrouck2017visually) are well investigated.
In captioning, Anderson2017updown used a very similar approach where two LSTMs build and encode the visual features. Two other works, yao2017boosting and You2016ImageCW used pooled features as described in this paper. However, they both used an additional vector taken from the fully connected layer of a CNN.
6 Conclusion
We presented a novel and light architecture composed of a cGRU that showed interesting performance. The model builds and encodes the context vector from pooled features in an efficient manner. The attention model presented in section 2.3 is really straightforward and seems to bring the necessary visual information in order to output complete captions. Also, we empirically showed that the model can easily scale with more soughtafter modules or simple with more parameters. In the future, it would be interesting to use different attention features, like VGG or GoogleNet (that have only 1024 dimensions) or different attention models to see how far this architecture can get.
7 Acknowledgements
This work was partly supported by the ChistEra project IGLU with contribution from the Belgian Fonds de la Recherche Scientique (FNRS), contract no. R.50.11.15.F, and by the FSO project VCYCLE with contribution from the Belgian Waloon Region, contract no. 1510501.
We also thank the authors of nmtpytorch^{3}^{3}3https://github.com/liumlst/nmtpytorch [nmtpy2017] that we used as framework for our experiments. Our code is made available for posterior research ^{4}^{4}4https://github.com/jbdel/light_captioning.
Comments
There are no comments yet.