Being able to automatically generate a description from an image is a fundamental problem in artificial intelligent, connecting computer vision and natural language processing. The problem is particularly challenging because it requires to correctly recognize different objects in images and also how they interact.
Convolutional Neural Networks (CNN) have achieved state of the art results in different computer vision tasks in the last few years. More recently, different authors proposed automatic image sentence description approaches based on deep neural networks. All the solutions use the representation of images generated by CNN that was previously trained for object recognition tasks as start point.
Vinyals et al. (2014) consider the problem in a similar way as a machine translation problem. The authors propose a encoder/decoder (CNN/LSTM networks) system that is trained to maximize the likelihood of the target description sentence given a training image. Kiros et al. (2014) also consider a encoder/decoder pipeline, but uses a combination of CNN and LSTM networks for encoding and a language model for decoding. Karpathy & Fei-Fei (2014)
propose an approach that is a combination of CNN, bidirectional recurrent neural networks over sentences and a structured objective responsible for a multimodal embedding. They propose a second recurrent neural network architecture to generate new sentences. Similar to the previous works,Mao et al. (2014) and Donahue et al. (2014) propose a system that uses a CNN to extract image features and a deep recurrent neural network for sentences. The two networks interact with each other in a multimodal common layer.
Fang et al. (2014) propose a different approach to the problem that does not rely on recurrent neural networks. Their solution can be divided into three steps: (i) visual detector for words that commonly occur are trained using multiple instance learning, (ii) a set of sentences are generated using a Maximum-Entropy language-model and (iii) the sentences are re-ranked using sentence-level features and a proposed deep multimodal similarity model.
This paper proposes a different approach to the problem. We propose a system that at the same time: (i) automatically generates a sentence describing a given scene and (ii) is relatively simpler than the recently proposed approaches. Our model shares some similarities with previously proposed deep approaches. For instance, we also use a pre-trained CNN to extract image features and we also consider a multimodal embedding. However, thanks to the phrase-based approach, we do not use any complex recurrent network for sentence generation.
We represent the ground-truth sentences as a collection of noun, verb and prepositional phrases. Each phrase is represented by the mean of the vector representation of the words that compose it. We then train a simplelinear embedding model that transform an image representation into a multimodal space that is common to the image and the phrases that are used to describe them. To automatically generate sentences in inference time, we (i) infer the phrases that correspond to the sample image and (ii) use a simple language model based on the statistics of the ground-truth sentences present in the corpus.
2 Phrase-based model for image descriptions
2.1 Understanding structures of image descriptions
The art of writing sentences can vary a lot according to the domain it is being applied. When reporting news or reviewing an item, not only the choice of the words might vary, but also the general structure of the sentence. Sentence structures used for describing images can therefore be identified.
They possess a very distinct structure, usually describing the different objects present on the scene and how they interact between each other. This interaction among objects is described as actions or relative position between different objects. The sentence can be short or long, but it generally respects this process. This statement is illustrated with the ground-truth sentence descriptions of the image in Figure 1.
All the key elements in a given image are usually described with a noun phrase (NP). Interactions between these elements can then be explained using prepositional phrases (PP) or verb phrases (VP). Describing an image is therefore just a matter of identifying these constituents to describe images. We propose to train a model which can predict the phrases which are likely to be in a given image.
Noun phrases or verb phrases are often a combination of several words. Good word vector representations can be obtained very quickly with many different recent approaches (Mikolov et al., 2013b; Mnih & Kavukcuoglu, 2013; Pennington et al., 2014; Lebret & Collobert, 2014). Mikolov et al. (2013a) also showed that simple vector addition can often produce meaningful results, such as king - man + woman queen. By leveraging the ability of these word vector representations to compose, representations for phrases are easily computed with an element-wise addition.
From phrases to sentence
After identifying the most likely constituents of the image, we propose to use a statistical language model to combine them and generate a proper description. A general framework is defined to reduce the total number of combination and thus speed up the process for generating sentences. The constrained language model used is illustrated in Figure 2. In general, a noun phrase is always followed by a verb phrase or a prepositional phrase, and both are then followed by another noun phrase. This process is repeated
times until reaching the end of a sentence (characterized by a period). This heuristic is based on the analysis of syntax if the sentences (see Section3.1).
2.2 A multimodal representation
For the representation of images, we choose to use a Convolutional Neural Network. CNNs have been widely used for many different vision domains and are currently the state-of-the-art in many object recognition tasks. We consider a CNN that has been pre-trained for the task of object classification. We use a CNN solely to the purpose of feature extraction, that is, no learning is done in the CNN layers.
Learning of a common space for image and phrase representations
Let be the set of training images, the set of all sentence descriptions for , the set of all phrases occuring in , and the trainable parameters of the model. is the set of sentences describing a given image , and is the set of phrases which compose a sentence description . The training objective is to find the phrases that describe the images
by maximizing the log probability:
Each image is represented by a vector thanks to a pre-trained CNN. Each phrase is composed of words which are represented by a vector thanks to another pre-trained model for word representations. A vector representation for a phrase is then calculated by averaging its word vector representations:
Vector representations for all phrases can thus be obtained to build a matrix . In general, . An encoding function is therefore defined to map image representations in the same vector space than phrase representations :
where is initialized randomly and trained to encode images in the same vectorial space than the phrases used for their descriptions. Because representations of images and phrases are in a common vector space, similarities between a given image and all phrases can be calculated:
where is fine-tuned to incorporate other features coming from the images. By denoting the score for the phrase, this score can be interpreted as the conditional probability by applying a softmax operation over all the phrases:
In practice, this formulation is often impractical due to the large set of possible phrases .
Training with negative sampling
and a negative sampling approach, we instead minimize the following logistic loss function with respect to:
Thus the task is to distinguish the target phrase from draws from the noise distribution, where there are
negative samples for each data sample. The model is trained using stochastic gradient descent.
3.1 Experimental Setup
We validate our model on the recently proposed COCO dataset (Lin et al., 2014), which contains complex images with multiple objects. The dataset contains a total of 123,000 images, each of them with 5 human annotated sentences. The testing images has not yet been released. We thus use two sets of 5,000 images from the validation images for validation and test, as in Karpathy & Fei-Fei (2014)111Available at http://cs.stanford.edu/people/karpathy/deepimagesent/. We measure the quality of of the generated sentences using the popular, yet controversial, BLEU score (Papineni et al., 2002).
Following Karpathy & Fei-Fei (2014), the image features are extracted using VGG CNN (Chatfield et al., 2014). This model generates image representations of dimension 4096 form RGB input images. For sentence features, we extract phrases from the 576,737 training sentences with the SENNA software222Available at http://ml.nec-labs.com/senna/. Statistics reported in Figure 3 confirm the hypothesis that image descriptions have a simple syntactic structure. A large majority of sentences contain from two to four noun phrases. Two noun phrases then interact using a verb or prepositional phrase. Only phrases occuring at least ten times in the training set are considered. This results in 11,688 noun phrases, 3,969 verb phrases333Pre-verbal and post-verbal adverb phrases are merged with verb phrases. and 219 prepositional phrases. Phrase representations are then computed by averaging vector representations of their words. We obtained word vector representations from the Hellinger PCA of a word co-occurence matrix, following the method described in Lebret & Collobert (2014). The word co-occurence matrix is built over the entire English Wikipedia444Available at http://download.wikimedia.org. We took the January 2014 version., with a symmetric context window of ten words coming from the 10,000 most frequent words. Words, and therefore also phrases, are represented in 400-dimensional vectors.
Learning multimodal representation
The parameters are and . The latter is initialized with the phrase representations. They are trained with negative samples and a learning rate set to 0.00025.
Generating sentences from the predicted phrases
According to the statistics of ground-truth sentence structures, we set . As nodes, we consider only the top twenty predicted noun phrases, the top ten predicted verb phrases and the top five predicted prepositional phrases. A trigram language model is used for the transition probabilities between two nodes. The probability of each lexical phrase is calculated using the previous phrases, , and the constraint described in Figure 2. In order to reduce the number of sentences generated, we just consider the transitions which are likely to happen (we discard any sentence which would have a trigram transition probability inferior to 0.01). This thresholding also helps to discard sentences that are semantically incorrect.
Ranking generated sentences
Our final step consists on ranking the sentences generated and choosing the one with the highest score as the final output. For each test image , we generate a set of sentence candidates using the proposed language model. For each sentence (), we compute its vector representation by averaging the representation of the phrases that make the sentence. The final score for each sentence is computed by doing a dot product between the sentence vector representation and the encoded representation of the sample image :
The output of the system is the sentence which has the highest score. This ranking helps the system to chose the sentence which is closer to the sample image.
3.2 Experimental Results
Table 1 show our sentence generation results on the COCO dataset. BLEU scores are reported up to 4-grams. Human agreement scores are computed by comparing one of the ground-truth description against the others. For comparison, we include results from recently proposed models. Although we use the same test set as in Karpathy & Fei-Fei (2014)
, there are slight variations between the test sets chosen in other papers. Our model gives competitive results at all N-gram levels. It is interesting to note that our results are very close to the human agreement scores. Examples of full automatic generated sentences can be found in Figure4.
|Karpathy & Fei-Fei (2014)||-|
|Vinyals et al. (2014)||-||-|
|Donahue et al. (2014)|
|Fang et al. (2014)||-||-||-|
4 Conclusion and future works
In this paper, we propose a simple model that is able to automatically generate sentences from an image sample. Our model is considerably simpler than the current state of the art, which uses complex recurrent neural networks. We predict phrase components that are likely to describe a given image and use a simple statistical language model to generate sentences. Our model achieves promising first results. Future works include apply the model to different datasets (Flickr8k, Flickr30k and final COCO version for benchmarking), do image-sentence ranking experiments and improve the language model used.
This work was supported by the HASLER foundation through the grant “Information and Communication Technology for a Better World 2020” (SmartWorld).
- Chatfield et al. (2014) Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the Devil in the Details: Delving Deep into Convolutional Nets. In British Machine Vision Conference, 2014.
- Donahue et al. (2014) Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. CoRR, abs/1411.4389, 2014.
- Fang et al. (2014) Fang, Hao, Gupta, Saurabh, Iandola, Forrest N., Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John C., Zitnick, C. Lawrence, and Zweig, Geoffrey. From captions to visual concepts and back. CoRR, abs/1411.4952, 2014.
- Karpathy & Fei-Fei (2014) Karpathy, Andrej and Fei-Fei, Li. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.
- Kiros et al. (2014) Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. volume abs/1411.2539, 2014.
- Lebret & Collobert (2014) Lebret, Remi and Collobert, Ronan. Rehabilitation of Count-based Models for Word Vector Representations. CoRR, abs/1412.4930, 2014.
- Lin et al. (2014) Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick, C. Lawrence. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- Mao et al. (2014) Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan L. Explain images with multimodal recurrent neural networks. CoRR, abs/1410.1090, 2014.
Mikolov et al. (2013a)
Mikolov, T., Chen, K., Corrado, G., and Dean, Jeff.
Efficient Estimation of Word Representations in Vector Space.ICLR Workshp, 2013a.
- Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 2013b.
Mnih & Kavukcuoglu (2013)
Mnih, A. and Kavukcuoglu, Koray.
Learning word embeddings efficiently with noise-contrastive estimation.In NIPS. 2013.
- Papineni et al. (2002) Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei-Jing. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002.
- Pennington et al. (2014) Pennington, J., Socher, Richard, and Manning, C. D. GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP, 2014.
- Vinyals et al. (2014) Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.