Image captioning is a popular research area as it combines the domains of Computer Vision and Natural Language Processing. Generating captions automatically is a difficult problem since it not only involves detecting the objects present in the image, but also involves expressing the semantic relationship between the corresponding objects in a natural language.
As the deep learning community matures, there have been several approaches for image captioning which have produced state-of-the-art models that produce sentences closer to natural languageXu et al. (2015)Vinyals et al. (2015)
. The different approaches usually follow the general encoder-decoder structure. The role of an encoder is to extract the semantic information from the image and is typically represented by Convolutional Neural Networks (CNN). The role of a decoder is to translate the encoded image features into natural language and this is usually represented by Recurrent Neural Network (RNN)Xu et al. (2015)
. In particular, a special type of RNN is popular and is referred to as Long Short Term Memory (LSTM) which has the ability to handle long-term temporal dependencies better through the use of a memory cell. However, RNN is known to be sequential in the sense that it generates one word at a time and as such, the training time increases with increased sequence length of the caption. As such, the transformer architecture has been used in recent times in place of an RNNZhu et al. (2018) for image captioning to avoid this sequential training problem.
Our key contribution in this paper is that we perform sensitivity analysis of several hyperparameters for both LSTM-based decoder and Transformer-based decoders by using Flickr8k dataset.
1.1 Related Work
After achieving a lot of success in Neural Machine Translation tasks, the encoder-decoder framework has been used several times for image captioningXu et al. (2015)Zhu et al. (2018)Vinyals et al. (2015). Vinyals et al. (2015) used CNN as the encoder for the image and then used Long Short Term Memory (LSTM) RNN as the decoder. In particular, they only used the encoded image representation for the first timestep for the LSTM. Xu et al. (2015) extended this approach by proposing a visual attention mechanism to attend to different parts of the images for every timestep during the caption generation using LSTM. Beyond the recurrence-based decoders, Vaswani et al. (2017) established a new architecture for machine translation called Transformer which is purely based on attention mechanism. They also showed that a Transformer is superior is quality while taking significantly less time to train as it is parallelizable. Using Transformer for image captioning has achieved state-of-the-art results as shown by Zhu et al. (2018).
In this section, we discuss the methodology used for our experiments. In particular, we used an encoder-decoder architecture for image captioning where the encoder consisted of a CNN and two different decoder networks were examined: LSTM and Transformer.
2.1 Flickr8k Dataset
The dataset used for experimentation is Flickr8K Hossain et al. (2018) which has 8,000 images in total. In particular, it is divided in to 6,000 training images, 1,000 validation images, and 1,000 test images. Furthermore, each of the images is associated with five reference captions annotated by humans. As such, our training set consists of 30,000 samples where each sample corresponds to one image and one caption.
The encoder model we used was the ResNet CNN model. The CNN extracts the features of the image which are referred to as annotation vectors. These vectors form the hidden states of the encoder on which the attention mechanism is performed. We experimented with 3 different types of ResNet models, the ResNet18, ResNet50 and the ResNet101. The number following the model name indicates the number of layers. We removed the final pooling and softmax layer and extracted the features from the final convolutional layer. We obtain an output of size N x 14 x 14, where the value of N depends on the type of encoder used. This is then flattened to give us a 196-dimensional vector on which we perform attention.
2.3 LSTM Decoder
The LSTM network produces a caption by generating a word at every time-step. The output at a given time-step is conditioned on the current hidden state, a context vector which is obtained from the attention mechanism and all the previous hidden states. The initial hidden state and cell state of the LSTM is obtained by taking the average of the annotation vectors and passing it through different MLPs. At a given time-step, we perform attention to obtain a context vector which is then appended to the input word embedding. In particular, we follow the same soft attention training process that is described by Xu et al. (2015) for our experiments.
2.3.1 Transformer Decoder
The transformer model introduced by Vaswani et al. (2017) was a way forward for language modeling that did away with the recurrent nature of language modeling. It relied purely on self-attention. The transformer model is shown in Figure 1.
We use the encoder as the CNN and adopt just the decoder of the transformer architecture. The transformer network relies on a series of computations known as scaled dot-product attention. The attention function is basically a mapping of queries(Q) and key-value pairs(K-V) to an output. The output is a weighted sum of the values, where the weights assigned to each value is based on a similarity function between the query and key. This is shown in Figure1(b). It is given by the following formula:
Here is a scaling factor that prevents the absolute value of the dot product from blowing up.
The transformer employs another strategy known as Multi-Head self-attention. This ensures that the model learns a multi-modal representation of the input sentence. The model learns to attend to different representations of the same input. The idea is to project the input vectors to different sub-spaces followed by the self attention function in each subspace. The output of each subspace is concatenated and a linear layer projects the data back down to the original subspace. This is shown in Figure 1(a). The formula is given by:
The decoder block can be broken into three main sub-blocks. The first is a masked multi-head self attention layer. This is a layer of self attention where the only difference is that each vector only attends to words that come before it. This is so that the model only uses past information to make judgements about the present and does not get a peek into the future. The second sub-layer is a layer of multi-head attention on the encoder hidden states. This is where the representation of the image is fed into the decoder layer. Finally, there is a feed forward layer which introduces some non-linearity to the model.
In between each sub-layer there is a residual connection which speeds up convergence and prevents the vanishing gradient problem. There are also dropout layers after each sub-layer to prevent over-fitting. Multiple such decoder blocks are stacked one on top of the other until the outputs of the last layer are passed through a softmax layer to obtain the output probabilities. Another technique for regularization we used was label-smoothing with. It was noticed that the model had higher loss but the BLEU scores improved.
3 Experimental Analysis
We performed several experiments to analyze the difference in using different encoder-decoder architectures.
3.1 ResNet + LSTM
Our baseline model was ResNet18 combined with an LSTM using a hidden vector size of 512. We do not finetune the encoder in the baseline. The word embedding size used in all LSTM experiments are of size 512. We conducted three different experiments: varying the encoder, fine-tuning the encoder, and varying the number of LSTM hidden units. All experiments were trained using the Adam optimizer with a learning rate of 0.0001 on Nvidia GeForce GTX 1080 GPU. In additon, the termination of training was determined by early stopping to obtain the best possible BLEU-4 scores.
For the first experiment, we used ResNet18, ResNet50, and ResNet101 models in order to examine the effect of improved image quality. It was hypothesized that using a larger CNN model would result in better caption generation. The summary of the experimental results is listed in Table 1. Moreover, Figure 3 shows that ResNet50 and ResNet101 perform much better than ResNet18 and hence, validates our original hypothesis. One key observation is that performance difference between RestNet50 and ResNet101 is minimal and ResNet101 actually gave a lower CIDEr score. As such, there is an indication that increasing image quality after some point may lead to saturation in terms of caption quality.
Next, we decided to fine-tune all three encoders discussed above and see it’s effect on caption quality. It was hypothesized that fine-tuning would increase captioning quality as ResNet is trained on ImageNet while we are using Flickr8k. For each of the encoders, we compared the score of fine-tuned model with the corresponding model scores without fine-tuning. The summary of the results is listed in Table2. Moreover, Figure 4 shows that fine-tuning is beneficial for all encoders as it outperforms the base models. One important observation is that the deeper models benefit more from fine-tuning as it can be clearly seen from the positive trend for CIDEr and METEOR scores. Additionally, we were able to outperform the metrics reported by Xu et al. (2015) through these experiments.
Lastly, we looked at the effect of varying number of LSTM hidden units by experimenting with 256, 512, and 1024 units. The experiment was carried out by keeping the encoder the same and varying the LSTM units. This experiment was repeated for all encoders: ResNet18, ResNet50, and ResNet101. It was hypothesized that increasing the hidden units would increase captioning quality as the capacity to store information increases and hence, it would be able to model long term dependencies better. The results of this experiment is summarized in Table 3. From Figure 5, we can see that the performance improves by a negligible amount as the number of units increase. We suspect this is due to over-fitting as we are using Flickr8k which is a small dataset compared to MSCOCO and Flickr30k.
3.2 ResNet + Transformer
Our baseline model for the ResNet-Transformer architecture was a ResNet18 model with 3 transformer layers. Each layer uses just a single head in the baseline. We conducted three experiments: varying the type of encoder model, varying the number of decoder layers, and varying the number of heads. Additionally we analyzed the effect of finetuning the encoder and seeing the performance of the model after finetuning. All experiments were trained using the Adam optimizer with a fixed learning rate of 0.00004 on a Geforece GTX 1080 Ti GPU. Termination of training was determined by early stopping to obtain the best possible BLEU-4 scores. The results are compiled in Tables 4-7.
For analyzing the different encoder models, we used ResNet18, ResNet50 and ResNet101. It was hypothesized that using a larger encoder model should improve the caption generation as the image representation would be better. Surprisngly, from Figure 6, we see that ResNet50 model performs best whereas the ResNet101 model gives the worst performance. This could be because the dataset size is very small. ResNet18 could be underfitting, whereas ResNet101 could be overfitting. The results are summarized in Table 4.
For analyzing the effect of number of decoder layers and heads, we kept the type of encoder fixed as ResNet18. It was hypothesized that increasing the number of heads should improve the quality as the transformer gets to attend to information from different representation subspaces. The results are compiled in Table 5. From Figure 7, we see that increasing the number of heads had either no effect or was even reducing the performance of the model. It is possible that our learning rate is not tuned perfectly, but we suspect that increasing the number of heads causes the model to overfit as our dataset size is very small compared to bigger datasets like MSCOCO or Flickr30k. Increasing dropout or other regularization techniques might help curb this effect but we have not experimented with this. Similar to changing the number of heads, it was hypothesized that increasing the number of layers should improve performance of caption generation as the transformer learns a better representation of the word as it gets deeper. From Figure 8, we see that changing the number of layers also showed no significant or noticeable changes. At times the model performed better and at times worse. These results are compiled in Table 6.
Finally, the most notable and obvious difference is when fine-tuning. It was hypothesized that fine-tuning the encoder would improve the quality of the caption. The summary of the results is listed in Table 7, and from Figure 9, it is clear that the fine-tuned encoder models consistently perform better than the encoder models which are not. This makes sense considering that the encoder models are trained on ImageNet whereas our dataset is Flickr8k.
We have presented and compared two different architectures for image caption generation. A ResNet model is used as an image feature extractor. For decoding we experimented with LSTMs and Transformers. We performed a sensitivity analysis of various hyperparameters. Experiments show that fine-tuning the encoder model almost always improves the outcome of the decoder model. The LSTM model using ResNet50 and ResNet101 surpass our reference baseline, the Soft-Attention Model even without finetuning. This could be due to the higher representative capability of the ResNet model as compared to the VGG model used in the Soft-attention model. All the other models surpass the baseline upon finetuning, with the only exception being using Resnet50 and a Transformer decoder. We have shown the effect of tuning other hyperparameters such as the number of hidden units for LSTMs, the number of decoder layers and number of heads used in multi-head attention for Transformers. It was noticed that increasing the number of heads, layers or hidden vector size does not always improve results and may even result in reducing the output quality. We believe that this is due to model overfitting as our dataset size is very small. A natural continuation of this work would be to experiment with larger datasets such as Flickr30k or MSCOCO. Apart from that, it would also be interesting to experiment with changing the word embedding size used in the LSTM model, or even finding out the effect of using pre-trained word embeddings such as GloVe vectors. Finally, we could experiment with more complicated Transformer architectures such as those byZhu et al. (2018).
The authors would like to thank Jimmy Ba for his thoughtful instruction throughout the Neural Networks course as well as providing helpful insights during our project consultation meetings.
- Banerjee and Lavie  Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W05-0909.
- Hossain et al.  Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. CoRR, abs/1810.04020, 2018. URL http://arxiv.org/abs/1810.04020.
- Lin  Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013.
- Papineni et al.  Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://www.aclweb.org/anthology/P02-1040.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
- Vedantam et al.  Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. CoRR, abs/1411.5726, 2014. URL http://arxiv.org/abs/1411.5726.
- Vinyals et al.  Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164. IEEE Computer Society, 2015. ISBN 978-1-4673-6964-0. URL http://dblp.uni-trier.de/db/conf/cvpr/cvpr2015.html#VinyalsTBE15.
Xu et al. 
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard S. Zemel, and Yoshua Bengio.
Show, attend and tell: Neural image caption generation with visual
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 2048–2057. JMLR.org, 2015.
- Zhu et al.  Xinxin Zhu, Jing Liu, Peng Haipeng, and Xinxin Niu. Captioning transformer with stacked attention modules. Applied Sciences, 8:739, 05 2018. doi: 10.3390/app8050739.