A single image can contain large amount of information in it. Humans have ability to parse this large amount of information by single glance of it. Humans normally communicate though written or spoken language. They can use languages for describing any image. Every individual will generate different caption for same image. If we can achieve same task with machine it will be greatly helpful for variety of tasks. However, generating captions for an image is very challenging task for machine. In order to perform caption generation task by machine, it requires brief understanding of natural language processing and ability to identify and relate objects in an image. Some of the early approaches that tried to solve these challenge are often based on hard-coded features and well defined syntax. This limits the type of sentence that can be generated by any given model. In order to overcome this limitation the main challenge is to make model free of any hard-coded feature or sentence templates. Rule for forming models should be learned from the training data.
Another challenge is that there are large number of images available with their associated text available in the ever expanding internet. However, most of them are noisy hence it can not be directly used in image captioning model. Training an image captioning model requires huge dataset with properly available annotated image by multiple persons.
In this paper, we have studied collections of different existing natural image captioning models and how they compose new caption for unseen images. We have also presented results of our implementation of these model and compared them.
Section 2 of this paper describes Related Work in detail. Show & Tell model in detailed is described in Section 3. Section 4 contains details about implementation environment and dataset. Results and Discussion is provided in detail in Section 5. At the end we provided our concluding remarks in section 6.
2 Related work
Creating captioning system that accurately generate captions like human depends on the connection between importance of object in image and how they will be related to other objects in image. Image can be described using more than one sentence but to efficiently train the image captioning model we requires only single sentence that can be provided as a caption. This leads to problem of text summarization in natural language processing.
There are mainly two different way to perform the task of image captioning. These two types are basically retrieval based method and generative method. From that most of work is done based on retrieval based method. One of the best model of retrieval based method is Im2Txt model . It was proposed by Vicente Ordonez, Girish Kulkarni and Tamara L Berg. Their system is divided into mainly two part 1) Image matching and 2) Caption generation. First we will provide our input image to model. Matching image will be retrieved from database containing images and its appropriate caption. Once we find matching images we will compare extracted high level objects from original image and matching images. Images will then reranked based on the content matched. Once it is reranked caption of top-n ranked images will be returned. The main limitation of these retrieval based method is that it can only produce captions which are already present in database. It can not generate novel captions.
This limitation of retrieval based method is solved in generative models. Using generative models we can create novel sentences. Generative models can be of two types either pipeline based model or end to end model. Pipeline type models uses two separate learning process, one for language modeling and and one for image recognition. They first identify objects in image and provides the result of it to language modeling task. While in end-to-end models we combine both language modeling and image recognition models in single end to end model 
. Both part of model learn at the same time in end-to-end system. They are typically created by combination of convolutional and recurrent neural networks.
Show & Tell model proposed by Vinyals et al. is of generative type end-to-end model. Show & Tell model uses recent advancement in image recognition and neural machine translation for image captioning task. It uses combination of Inception-v3 model and LSTM cells. Here Inception-v3 model will provides object recognition capability while LSTM cell provides it language modeling capability .
3 Show & Tell Model
Recurrent neural networks generally used in neural machine translation 
. They encodes the variable length inputs into a fixed dimensional vectors. Then it uses these vector representation to decode to the desired output sequence. Instead of using text as input to encoder Show & Tell model uses image as input. This image is then converted to word vector and then this word vector is translated to caption using Recurrent neural networks as decoder.
To achieve this goal, Show & Tell model is created by hybridizing two different models. It takes input as the image and provides it to Inception-v3 model. At the end of Inception-v3 model single fully connected layer is added. This layer will transform output of Inception-v3 model into word embedding vector. We input this word embedding vector into series of LSTM cell. LSTM cell provides ability to store and retrieve sequential information through time. This helps to generate the sentences with keeping previous words in context.
Training of Show & Tell model can be divided into two part. First part is of training process where model learns its parameters. While second part is of testing process. In testing process we infer the captions and we compare and evaluate these machine generated caption with human generated captions.
During training phase we provides pair of input image and its appropriate caption to Show & Tell model. Inception-v3 part of model is trained to identify all possible objects in an image. While LSTM part of model is trained to predict every word in the sentence after it has seen image as well as all previous words. For any given caption we add two additional symbols as start word and stop word. Whenever stop word is encountered it stop generating sentence and it marks end of string. Loss function for model is calculated as
where represent input image and represent generated caption. is length of generated sentence. and
represent probability and predicted word at the timerespectively. During the process of training we will try to minimize this loss function.
From various approaches to generate caption a sentence from given image Show & Tell model uses Beam Search to find suitable words to generate caption. If we keep beam size as K, it recursively consider K best word at each output of the word. At each step it will calculate joint probability of word with all previously generated word in sequence. It will keep producing the output until end of sentence marker is predicted. It will select sentence with best probability and outputs it as caption.
For evaluation of image captioning model we have implemented Show & Tell model. Details about dataset,implementation tool and implementation environment is given as follows:
For task of image captioning there are several annotated images dataset are available. Most common of them are Pascal VOC dataset and MSCOCO Dataset. In this work MSCOCO image captioning dataset is used. MSCOCO is a dataset developed by Microsoft with the goal of achieving the state-of-the-art in object recognition and captioning task. This dataset contains collection of day-to-day activity with theri related captions. First each object in image is labeled and after that description is added based on objects in an image. MSCOCO dataset contains image of around 91 objects types that can be easily recognizable by even a 4 year old kid. It contains around 2.5 million objects in 328K images. Dataset is created by using crowdsourcing by thousonds of humans .
4.2 Implementation tool and environment
For the implementation of this experiment we have used machine with Intel Xeon E3 processor with 12 cores and 32GB RAM running CentOS 7. Tensorflow liberary is used for creating and training deep neural networks. Tensorflow is a deep learning library developed by Google. It provides heterogeneous platform for execution of algorithms i.e. it can be run on low power devices like mobile as well as large scale distributed system containing thousands of GPUs. To define structure of our network tensorflow uses graph definition. Once graph is defined it can be executed on any supported devices.
5 Results and Discussion
By the implementation of the Show & Tell model we can able to generate moderately comparable captions with compared to human generated captions. First of all it model will identify all possible objects in image.
As shown in Fig. 2 Inception-v3 model will assign probability of all possible object in image and convert image into word vector. This word vector is provided as input to LSTM cells which will then form sentence from this word vector as shown in Fig. 3 using beam search as described in previous section.
5.2 Evaluation Matrices
To evaluating of any model that generate natural language sentence BLEU (Bilingual Evaluation Understudy) Score is used. It describes how natural sentence is compared to human generated sentence 
. It is widely used to evaluate performance of Machine translation. Sentences are compared based on modified n-gram precision method for generating BLEU score. Where precision is calculated using following equation:
To evaluate our model we have used image from validation dataset of MSCOCO Dataset. Some of captions generated by Show & Tell model is shown as follows:
As you can see in Fig. 4, generated sentence is “a woman sitting at a table with a plate of food.”, while actual human generated sentence are “The young woman is seated at the table for lunch, holding a hotdog.”, “a woman is eatting a hotdog at a wooden table.”, “there is a woman holding food at a table.”, “a young woman holding a sandwich at a table.” and “a woman that is sitting down holding a hotdog.”. This result in BLEU score of 63 for this image.
Similarly in Fig. 5, generated sentence is “a woman holding a cell phone in her hand.” while actual human generated sentence are “a woman holding a Hello Kitty phone on her hands”, “a woman holds up her phone in front of her face”, “a woman in white shirt holding up a cellphone”, “a woman checking her cell phone with a hello kitty case” and “the asian girl is holding her miss kitty phone”. This result in BLEU score of 77 for this image.
While calculating BLEU score of all image in validation dataset we get average score of 65.5. Which shows that our generated sentence are very similar compared to human generated sentence.
We can conclude from our findings that we can combine recent advancement in Image Labeling and Automatic Machine Translation into an end-to-end hybrid neural network system. This system is capable to autonomously view an image and generate a reasonable description in natural language with better accuracy and naturalness.
-  V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural Information Processing Systems, pp. 1143–1151, 2011.
-  A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in , pp. 3128–3137, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015. 28.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” arXiv preprint arXiv:1512.00567, 2015. 37.
-  D. Britz, “Introduction to rnns.” WILDML, http://www.wildml.com/, 2016. [Accessed 4-September-2016].
-  Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, pp. 740–755, Springer, 2014.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318, Association for Computational Linguistics, 2002.
-  D. Jurafsky, Speech & language processing. Pearson Education India, 2000.