Image Captioning with Deep Bidirectional LSTMs
This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models "translate" image to sentence. Our proposed models are evaluated on caption generation and image-sentence retrieval tasks with three benchmark datasets: Flickr8K, Flickr30K and MSCOCO datasets. We demonstrate that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art results on caption generation even without integrating additional mechanism (e.g. object detection, attention model etc.) and significantly outperform recent methods on retrieval task.READ FULL TEXT VIEW PDF
Image captioning often requires a large set of training image-sentence p...
In this work we focus on the problem of image caption generation. We pro...
Image captioning is a research hotspot where encoder-decoder models comb...
Image captioning is a challenging task that combines the field of comput...
A Semantic Compositional Network (SCN) is developed for image captioning...
In practice, it is common to find oneself with far too little text data ...
Automatic generation of caption to describe the content of an image has ...
Image Captioning with Deep Bidirectional LSTMs
Image Captioning with Deep Bidirectional LSTMs
Automatically describe an image using sentence-level captions has been receiving much attention recent years [11, 10, 13, 17, 16, 23, 34, 39]. It is a challenging task integrating visual and language understanding. It requires not only the recognition of visual objects in an image and the semantic interactions between objects, but the ability to capture visual-language interactions and learn how to “translate” the visual understanding to sensible sentence descriptions. The most important part of this visual-language modeling is to capture the semantic correlations across image and sentence by learning a multimodal joint model. While some previous models [20, 15, 26, 17, 16] have been proposed to address the problem of image captioning, they rely on either use sentence templates, or treat it as retrieval task through ranking the best matching sentence in database as caption. Those approaches usually suffer difficulty in generating variable-length and novel sentences. Recent work [11, 10, 13, 23, 34, 39]
has indicated that embedding visual and language to common semantic space with relatively shallow recurrent neural network (RNN) can yield promising results.
In this work, we propose novel architectures to the problem of image captioning. Different to previous models, we learn a visual-language space where sentence embeddings are encoded using bidirectional Long-Short Term Memory (Bi-LSTM) and visual embeddings are encoded with CNN. Bi-LSTM is able to summarize long range visual-language interactions from forward and backward directions. Inspired by the architectural depth of human brain, we also explore the deep bidirectional LSTM architectures to learn higher level visual-language embeddings. All proposed models can be trained in end-to-end by optimizing a joint loss.
Why bidirectional LSTMs? In unidirectional sentence generation, one general way of predicting next word with visual context and history textual context is maximize . While unidirectional model includes past context, it is still limited to retain future context that can be used for reasoning previous word by maximizing . Bidirectional model tries to overcome the shortcomings that each unidirectional (forward and backward direction) model suffers on its own and exploits the past and future dependence to give a prediction. As shown in Figure 1, two example images with bidirectionally generated sentences intuitively support our assumption that bidirectional captions are complementary, combining them can generate more sensible captions.
Why deeper LSTMs? The recent success of deep CNN in image classification and object detection [14, 33] demonstrates that deep, hierarchical models can be more efficient at learning representation than shallower ones. This motivated our work to explore deeper LSTM architectures in the context of learning bidirectional visual-language embeddings. As claimed in , if we consider LSTM as a composition of multiple hidden layers that unfolded in time, LSTM is already deep network. But this is the way of increasing “horizontal depth” in which network weights are reused at each time step and limited to learn more representative features as increasing the “vertical depth” of network. To design deep LSTM, one straightforward way is to stack multiple LSTM layers as hidden to hidden transition. Alternatively, instead of stacking multiple LSTM layers, we propose to add multilayer perception (MLP) as intermediate transition between LSTM layers. This can not only increase LSTM network depth, but can also prevent the parameter size from growing dramatically.
). The final caption is the sentence with higher probabilities (histogram under sentence). In both examples, backward caption is selected as final caption for corresponding image.
The core contributions of this work are threefold:
We visualize the evolution of hidden states of bidirectional LSTM units to qualitatively analyze and understand how to generate sentence that conditioned by visual context information over time (see Sec.4.4).
We demonstrate the effectiveness of proposed models on three benchmark datasets: Flickr8K, Flickr30K and MSCOCO. Our experimental results show that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art on caption generation (see Sec.4.5) and perform significantly better than recent methods on retrieval task (see Sec.4.6).
Multimodal representation learning [27, 35] has significant value in multimedia understanding and retrieval. The shared concept across modalities plays an important role in bridging the “semantic gap” of multimodal data. Image captioning falls into this general category of learning multimodal representations.
Recently, several approaches have been proposed for image captioning. We can roughly classify those methods into three categories. The first category is template based approaches that generate caption templates based on detecting objects and discovering attributes within image. For example, the work was proposed to parse a whole sentence into several phrases, and learn the relationships between phrases and objects within an image. In , conditional random field (CRF) was used to correspond objects, attributes and prepositions of image content and predict the best label. Other similar methods were presented in [26, 17, 16]. These methods are typically hard-designed and rely on fixed template, which mostly lead to poor performance in generating variable-length sentences. The second category is retrieval based approach, this sort of methods treat image captioning as retrieval task. By leveraging distance metric to retrieve similar captioned images, then modify and combine retrieved captions to generate caption . But these approaches generally need additional procedures such as modification and generalization process to fit image query.
Inspired by the success use of CNN [14, 45] and Recurrent Neural Network [24, 25, 1]. The third category is emerged as neural network based methods [39, 42, 13, 10, 11]. Our work also belongs to this category. Among those work, Kiro et al. can been as pioneer work to use neural network for image captioning with multimodal neural language model. In their follow up work , Kiro et al. introduced an encoder-decoder pipeline where sentence was encoded by LSTM and decoded with structure-content neural language model (SC-NLM). Socher et al.
presented a DT-RNN (Dependency Tree-Recursive Neural Network) to embed sentence into a vector space in order to retrieve images. Later on, Maoet al. proposed m-RNN which replaces feed-forward neural language model in . Similar architectures were introduced in NIC  and LRCN , both approaches use LSTM to learn text context. But NIC only feed visual information at first time step while Mao et al. and LRCN ’s model consider image context at each time step. Another group of neural network based approaches has been introduced in [10, 11] where image captions generated by integrating object detection with R-CNN (region-CNN) and inferring the alignment between image regions and descriptions.
Most recently, Fang et al. used multi-instance learning and traditional maximum-entropy language model for description generation. Chen et al. proposed to learn visual representation with RNN for generating image caption. In , Xu et al. introduced attention mechanism of human visual system into encoder-decoder framework. It is shown that attention model can visualize what the model “see” and yields significant improvements on image caption generation. Unlike those models, our deep LSTM model directly assumes the mapping relationship between visual-language is antisymmetric and dynamically learns long term bidirectional and hierarchical visual-language interactions. This is proved to be very effective in generation and retrieval tasks as we demonstrate in Sec.4.5 and Sec.4.6.
In this section, we describe our multimodal bidirectional LSTM model (Bi-LSTM for short) and explore its deeper variants. We first briefly introduce LSTM which is at the center of model. The LSTM we used is described in .
Our model builds on LSTM cell, which is a particular form of traditional recurrent neural network (RNN). It has been successfully applied to machine translation , speech recognition  and sequence learning . As shown in Figure 2, the reading and writing memory cell is controlled by a group of sigmoid gates. At given time step , LSTM receives inputs from different sources: current input , the previous hidden state of all LSTM units as well as previous memory cell state . The updating of those gates at time step for given inputs , and as follows.
where are the weight matrices learned from the network and
are bias vectors.
is the sigmoid activation functionand presents hyperbolic tangent . denotes the products with a gate value. The LSTM hidden output =, will be used to predict the next word by Softmax function with parameters and :
is the probability distribution for predicted word.
Our key motivation of chosen LSTM is that it can learn long-term temporal activities and avoid quick exploding and vanishing problems that traditional RNN suffers from during back propagation optimization.
In order to make use of both the past and future context information of a sentence in predicting word, we propose a bidirectional model by feeding sentence to LSTM from forward and backward order. Figure 3 presents the overview of our model, it is comprised of three modules: a CNN for encoding image inputs, a Text-LSTM (T-LSTM) for encoding sentence inputs, a Multimodal LSTM (M-LSTM) for embedding visual and textual vectors to a common semantic space and decoding to sentence. The bidirectional LSTM is implemented with two separate LSTM layers for computing forward hidden sequences and backward hidden sequences . The forward LSTM starts at time and the backward LSTM starts at time . Formally, our model works as follows, for raw image input , forward order sentence and backward order sentence , the encoding performs as
where , represent CNN, T-LSTM respectively and , are their corresponding weights. and are bidirectional embedding matrices learned from network. Encoded visual and textual representations are then embedded to multimodal LSTM by:
where presents M-LSTM and its weight . aims to capture the correlation of visual context and words at different time steps. We feed visual vector to model at each time step for capturing strong visual-word correlation. On the top of M-LSTM are Softmax layers which compute the probability distribution of next predicted word by
where and is the vocabulary size.
To design deeper LSTM architectures, in addition to directly stack multiple LSTMs on each other that we named as Bi-S-LSTM (Figure 4(c)), we propose to use a fully connected layer as intermediate transition layer. Our motivation comes from the finding of , in which DT(S)-RNN (deep transition RNN with shortcut) is designed by adding hidden to hidden multilayer perception (MLP) transition. It is arguably easier to train such network. Inspired by this, we extend Bi-LSTM (Figure 4(b)) with a fully connected layer that we called Bi-F-LSTM (Figure 4(d)), shortcut connection between input and hidden states is introduced to make it easier to train model. The aim of extension models is to learn an extra hidden transition function . In Bi-S-LSTM
where presents the hidden states of -th layer at time , and are matrices connect to transition layer (also see Figure 5(L)). For readability, we consider one direction training and suppress bias terms. Similarly, in Bi-F-LSTM, to learn a hidden transition function by
where is the operator that concatenates and its abstractions to a long hidden states (also see Figure 5(R))..
One of the most challenging aspects of training deep bidirectional LSTM models is preventing overfitting. Since our largest dataset has only 80K images  which might cause overfitting easily, we adopted several techniques such as fine-tuning on pre-trained visual model, weight decay, dropout and early stopping that commonly used in the literature. Additionally, it has been proved that data augmentation such as randomly cropping and horizontal mirror [32, 22], adding noise, blur and rotation  can effectively alleviate over-fitting and other. Inspired by this, we designed new data augmentation techniques to increase the number of image-sentence pairs. Our implementation performs on visual model, as follows:
Multi-Corp: Instead of randomly cropping on input image, we crop at the four corners and center region. Because we found random cropping is more tend to select center region and cause overfitting easily. By cropping four corners and center, the variations of network input can be increased to alleviate overfitting.
Multi-Scale: To further increase the number of image-sentence pairs, we rescale input image to multiple scales. For each input image with size , it is resized to 256 256, then we randomly select a region with size of , where is scale ratio. means we do not multi-scale operation on given image. Finally we resize it to AlexNet input size 227 227 or VggNet input size 224 224.
Vertical Mirror: Motivated by the effectiveness of widely used horizontal mirror, it is natural to also consider the vertical mirror of image for same purpose.
Those augmentation techniques are implemented in real-time fashion. Each input image is randomly transformed using one of augmentations to network input for training. In principle, our data augmentation can increase image-sentence training pairs by roughly 40 times (542).
The trained model is used to predict a word with given image context and previous word context by in forward order, or by in backward order. We set == at start point respectively for forward and backward directions. Ultimately, with generated sentences from two directions, we decide the final sentence for given image according to the summation of word probability within sentence
In this section, we design several groups of experiments to accomplish following objectives:
Qualitatively analyze and understand how bidirectional multimodal LSTM learns to generate sentence conditioned by visual context information over time.
Measure the benefits and performance of proposed bidirectional model and its deeper variant models that we increase their nonlinearity depth from different ways.
Compare our approach with state-of-the-art methods in terms of sentence generation and image-sentence retrieval tasks on popular benchmark datasets.
Flickr8K. It consists of 8,000 images and each of them has 5 sentence-level captions. We follow the standard dataset divisions provided by authors, 6,000/1,000/1,000 images for training/validation/testing respectively.
Flickr30K. An extension version of Flickr8K. It has 31,783 images and each of them has 5 captions. We follow the public accessible111http://cs.stanford.edu/people/karpathy/deepimagesent/ dataset divisions by Karpathy et al. . In this dataset splits, 29,000/1,000/1,000 images are used for training/validation/testing respectively.
MSCOCO. This is a recent released dataset that covers 82,783 images for training and 40,504 images for validation. Each of images has 5 sentence annotations. Since there is lack of standard splits, we also follow the splits provided by Karpathy et al. . Namely, 80,000 training images and 5,000 images for both validation and testing.
. We use two visual models for encoding image: Caffe reference model which is pre-trained with AlexNet  and 16-layer VggNet model . We extract features from last fully connected layer and feed to train visual-language model with LSTM. Previous work [39, 23] have demonstrated that with more powerful image models such as GoogleNet  and VggNet  can achieve promising improvements. To make a fair comparison with recent works, we select the widely used two models for experiments.
Textual feature. We first represent each word within sentence as one-hot vector, , is vocabulary size built on training sentences and different for different datasets. By performing basic tokenization and removing the words that occurs less than 5 times in the training set, we have 2028, 7400 and 8801 words for Flickr8K, Flickr30K and MSCOCO dataset vocabularies respectively.
(g) Generated words and corresponding word index in vocabulary
A woman in a tennis court holding a tennis racket.
A woman getting ready to hit a tennis ball.
A living room with a couch and a table.
Two chairs and a table in a living room.
A giraffe standing in a zoo enclosure with a baby in the background.
A couple of giraffes are standing at a zoo.
A train is pulling into a train station.
A train on the tracks at a train station.
Our work uses the LSTM implementation of  on Caffe framework. All of our experiments were conducted on Ubuntu 14.04, 16G RAM and single Titan X GPU with 12G memory. Our LSTMs use 1000 hidden units and weights initialized uniformly from [-0.08, 0.08]. The batch sizes are 150, 100, 100, 32 for Bi-LSTM, Bi-S-LSTM, Bi-F-LSTM and Bi-LSTM (VGG) models respectively. Models were trained with learning rate (except for Bi-LSTM (VGG)), weight decay is 0.0005 and we used momentum 0.9. Each model is trained for 18
35 epochs with early stopping. The code for this work can be found athttps://github.com/deepsemantic/image_captioning.
We evaluate our models on two tasks: caption generation and image-sentence retrieval. In caption generation, we follow previous work to use BLEU-N (N=1,2,3,4) scores :
where , are the length of reference sentence and generated sentence, is the modified -gram precisions. We also report METETOR  and CIDEr  scores for further comparison. In image-sentence retrieval (image query sentence and vice versa), we adopt R@K (K=1,5,10) and Med
as evaluation metrics. R@K is the recall rate R at top K candidates and Medis the median rank of the first retrieved ground-truth image and sentence. All mentioned metric scores are computed by MSCOCO caption evaluation server222https://github.com/tylin/coco-caption, which is commonly used for image captioning challenge333http://mscoco.org/home/.
The aim of this set experiment is to visualize the properties of proposed bidirectional LSTM model and explain how it works in generating sentence word by word over time.
First, we examine the temporal evolution of internal gate states and understand how bidirectional LSTM units retain valuable context information and attenuate unimportant information. Figure 6 shows input and output data, the pattern of three sigmoid gates (input, forget and output) as well as cell states. We can clearly see that dynamic states are periodically distilled to units from time step to . At , the input data are sigmoid modulated to input gate where values lie within in [0,1]. At this step, the values of forget gates of different LSTM units are zeros. Along with the increasing of time step, forget gate starts to decide which unimportant information should be forgotten, meanwhile, to retain those useful information. Then the memory cell states and output gate gradually absorb the valuable context information over time and make a rich representation of the output data.
Next, we examine how visual and textual features are embedded to common semantic space and used to predict word over time. Figure 7 shows the evolution of hidden units at different layers. For T-LSTM layer, units are conditioned by textual context from the past and future. It performs as the encoder of forward and backward sentences. At M-LSTM layer, LSTM units are conditioned by both visual and textual context. It learns the correlations between input word sequence and visual information that encoded by CNN. At given time step, by removing unimportant information that make less contribution to correlate input word and visual context, the units tend to appear sparsity pattern and learn more discriminative representations from inputs. At higher layer, embedded multimodal representations are used to compute the probability distribution of next predict word with Softmax. It should be noted, for given image, the number of words in generated sentence from forward and backward direction can be different.
Figure 8 presents some example images with generated captions. From it we found some interesting patterns of bidirectional captions: (1) Cover different semantics, for example, in (b) forward sentence captures “couch” and “table” while backward one describes “chairs” and “table”. (2) Describe static scenario and infer dynamics, in (a) and (d), one caption describes the static scene, and the other one presents the potential action or motion that possibly happen in the next time step. (3) Generate novel sentences, from generated captions, we found that a significant proportion (88% by randomly select 1000 images on MSCOCO validation set) of generated sentences are novel (not appear in training set). But generated sentences are highly similar to ground-truth captions, for example in (d), forward caption is similar to one of ground-truth captions (“A passenger train that is pulling into a station”) and backward caption is similar to ground-truth caption (“a train is in a tunnel by a station”). It illustrates that our model has strong capability in learning visual-language correlation and generating novel sentences.
Now, we compare with state-of-the-art methods. Table 1 presents the comparison results in terms of BLEU-N. Our approach achieves very competitive performance on evaluated datasets although with less powerful AlexNet visual model. We can see that increase the depth of LSTM is beneficial on generation task. Deeper variant models mostly obtain better performance compare to Bi-LSTM, but they are inferior to latter one in B-3 and B-4 on Flickr8K. We conjecture it should be the reason that Flick8K is a relatively small dataset which suffers difficulty in training deep models with limited data. One of interesting facts we found is that by stacking multiple LSTM layers is generally superior to LSTM with fully connected transition layer although Bi-S-LSTM needs more training time. By replacing AlexNet with VggNet brings significant improvements on all BLEU evaluation metrics. We should be aware of that a recent interesting work  achieves the best results by integrating attention mechanism [19, 42] on this task. Although we believe incorporating such powerful mechanism into our framework can make further improvements, note that our current model Bi-LSTM achieves the best or second best results on most of metrics while the small gap in performance between our model and Hard-Attention  is existed.
The further comparison on METEOR and CIDEr scores is plotted in Figure 9. Without integrating object detection and more powerful vision model, our model (Bi-LSTM) outperforms DeepVS in a certain margin. It achieves 19.4/49.6 on Flickr 8K (compare to 16.7/31.8 of DeepVS) and 16.2/28.2 on Flickr30K (15.3/24.7 of DeepVS). On MSCOCO, our Bi-S-LSTM obtains 20.8/66.6 for METEOR/CIDEr, which exceeds 19.5/66.0 in DeepVS.
|Image to Sentence||Sentence to Image|
|Datasets||Methods||R@1||R@5||R@10||Med r||R@1||R@5||R@10||Med r|
|Kiros et al. ||13.5||36.2||45.7||13||10.4||31.0||43.7||14|
|Kiros et al. ||18||40.9||55||8||12.5||37||51.5||10|
|Kiros et al. ||14.8||39.2||50.9||10||11.8||34.0||46.3||13|
|Kiros et al. ||23.0||50.7||62.9||5||16.8||42.0||56.5||8|
For retrieval evaluation, we focus on image to sentence retrieval and vice versa. This is an instance of cross-modal retrieval [6, 30, 41] which has been a hot research subject in multimedia field. Table 2 illustrates our results on different datasets. The performance of our models exceeds those compared methods on most of metrics or matching existing results. In a few metrics, our model didn’t show better result than Mind’s Eye  which combined image and text features in ranking (it makes this task more like multimodal retrieval) and NIC  which employed more powerful vision model, large beam size and model ensemble. While adopting more powerful visual model VggNet results in significant improvements across all metrics, with less powerful AlexNet model, our results are still competitive on some metrics, e.g. R@1, R@5 on Flickr8K and Flickr30K. We also note that on relatively small dataset Filckr8K, shallow model performs slightly better than deeper ones on retrieval task, which in contrast with the results on the other two datasets. As we explained before, we think deeper LSTM architectures are better suited for ranking task on large datasets which provides enough training data for more complicate model training, otherwise, overfitting occurs. By increasing data variations with our implemented data augmentation techniques can alleviate it in a certain degree. But we foresee further significant improvement gains as training example grows, by reducing reliance on augmentation with fresh data. Figure 10 presents some examples of retrieval experiments. For each caption (image) query, sensible images and descriptive captions are retrieved. It shows our models captured the visual-textual correlation for image and sentence ranking.
Examples of image retrieval (top) and caption retrieval (bottom) with Bi-S-LSTM on Flickr30K validation set. Queries are marked with red color and top-4 retrieved results are marked with green color.
Efficiency. In addition to showing superior performance, our models also possess high computational efficiency. Table 3 presents the computational costs of proposed models. We randomly select 10 images from Flickr8K validation set, and perform caption generation and image to sentence retrieval test for 5 times respectively. The table shows the averaged time costs across 5 test results. The time cost of network initialization is excluded. The costs of caption generation includes: computing image feature, sampling bidirectional captions, computing the final caption. The time costs for retrieval considers: computing image-sentence pair scores (totally 10 50 pairs), ranking sentences for each image query. As can be seen from Table 1, 2 and 3, deep models have only slightly higher time consumption but yield significant improvements and our proposed Bi-F-LSTM can strike the balance between performance and efficiency.
Challenges in exact comparison. It is challenging to make a direct, extract comparison with related methods due to the differences in dataset division on MSCOCO. In principle, testing on smaller validation set can lead to better results, particularly in retrieval task. Since we strictly follow dataset splits as in , we compare to it in most cases. Another challenge is the visual model that utilized for encoding image inputs. Different models are employed in different works, to make a fair and comprehensive comparison, we select commonly used AlexNet and VggNet in our work.
We proposed a bidirectional LSTM model that generates descriptive sentence for image by taking both history and future context into account. We further designed deep bidirectional LSTM architectures to embed image and sentence at high semantic space for learning visual-language models. We also qualitatively visualized internal states of proposed model to understand how multimodal LSTM generates word at consecutive time steps. The effectiveness, generality and robustness of proposed models were evaluated on numerous datasets. Our models achieve highly completive or state-of-the-art results on both generation and retrieval tasks. Our future work will focus on exploring more sophisticated language representation (e.g. word2vec) and incorporating multitask learning and attention mechanism into our model. We also plan to apply our model to other sequence learning tasks such as text recognition and video captioning.
Cross-modal retrieval with correspondence autoencoder.In ACMMM, pages 7–16. ACM, 2014.
Composing simple image descriptions using web-scale n-grams.In CoNLL, pages 220–228. ACL, 2011.
Multimodal learning with deep boltzmann machines.In NIPS, pages 2222–2230, 2012.