Code for paper "Image Caption Generation with Text-Conditional Semantic Attention"
Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.READ FULL TEXT VIEW PDF
Code for paper "Image Caption Generation with Text-Conditional Semantic Attention"
[20, 25, 36, 33]. Basically, it requires machines to automatically describe the content of an image using an English sentence. While this task seems obvious for human-beings, it is complicated for machines since it requires the language model to capture various semantic information within an image, such as objects’ motions and actions. Another challenge for image captioning, especially for generative models, is that the generated output should be human-like natural sentences.
Recent literature in image captioning is dominated by neural network-based methods . The idea originates from the encoder-decoder architecture in Neural Machine Translation
Recent literature in image captioning is dominated by neural network-based methods[6, 31, 36, 15]
. The idea originates from the encoder-decoder architecture in Neural Machine Translation) decodes the feature vector into a sequence of words . Most recent work in image captioning relies on this structure, and leverages image guidance , attributes  or region attention  as the extra input to LSTM decoder for better performance. The intuition comes from visual attention, which has been known in Psychology and Neuroscience for a long time . For image captioning, this means the image guidance to the language model should change over time according to the context.
However, these methods using attention lack consideration from the following two aspects. First, attending to the image is only half of the story; watching what you just said comprises the other half. In other words, visual evidence can be inferred and interpreted by textual context, especially when the visual evidence is ambiguous. For example, in the sentence ‘‘After dinner, John is comfortably lying on the sofa and watching TV’’, the objects ‘‘sofa’’ and ‘‘TV’’ are naturally inferred even with weak visual evidences (see Figure 1, image credits: http://www.mirror.co.uk/ and http://newyork.cbslocal.com/). Despite its importance, textual context was not a topic of focus in attention models. Existing attention based methods such as [36, 33, 34] have used implicit text-guiding from an LSTM hidden layer to determine which of the image regions or attributes to attend on. However, as we mentioned in the previous example, the object for attention might be only partially observable, so the attention input could be misleading. This is not the case for our attention model since the textual features are tightly coupled with the image features to compensate for one another. While Jia et al.  use joint embedding of the text and image as the guidance for the LSTM, their approach has pre-specified guidance that is fixed over time and has a linear form. In contrast, our method systematically incorporates the time-dependent text-conditional attention, from 1-gram to n-gram and even to the sentence level.
use joint embedding of the text and image as the guidance for the LSTM, their approach has pre-specified guidance that is fixed over time and has a linear form. In contrast, our method systematically incorporates the time-dependent text-conditional attention, from 1-gram to n-gram and even to the sentence level.
Second, existing attention based methods separate CNN feature learning (trained for a different task, i.e. image classification) from the LSTM text generation. This leads to a representational disconnect between features learned and text generated. For instance, the attention model proposed by You et al.  uses weighted-sum visual attributes to guide the image captioning, while the attributes proposed by the specific predictor are separated from the language model. This makes the attributes guidance lack the ability to adapt to the textual context, which ultimately compromises the end-to-end learning ability the paper claimed.
To overcome the above limitations, we propose a new text-conditional attention model based on the time-dependent gLSTM. Our model has the ability to interpret image features based on textual context and it is end-to-end trainable. The model learns a text-conditional embedding matrix between CNN image features and previously generated text. Given a target image, the proposed model generates LSTM guidance by directly conditioning the image features on the current textual context. The model hence learns how to interpret image features given the textual content it has recently generated. If it conditions the image features on one previous word, it is a 1-gram word-conditional model. If it is on previous two words, we get a 2-gram word-conditional model. Similarly we can construct an n-gram word-conditional model. The extreme version of our text-conditional model is the sentence-conditional model, which takes advantage of all the previously generated words.
We implement our model 111https://github.com/LuoweiZhou/e2e-gLSTM-sc based on NeuralTalk2, an open-source implementation of Google NIC . We compare our methods with state-of-the-art methods on the commonly used MS-COCO dataset  with publicly available splits  of training, validation and testing sets. We evaluate methods on standard metrics as well as human evaluation. Our proposed methods outperform the state-of-the-art approaches across different evaluation metrics and yield reasonable attention outputs.
of training, validation and testing sets. We evaluate methods on standard metrics as well as human evaluation. Our proposed methods outperform the state-of-the-art approaches across different evaluation metrics and yield reasonable attention outputs.
The main contributions of our paper are as follows. First, we propose text-conditional attention which allows the language model to learn text-specified semantic guidance automatically. The proposed attention model learns how to focus on parts of the image feature given the textual content it has generated. Second, the proposed method demonstrates a less complicated way to achieve end-to-end training of attention-based captioning model, whereas state-of-the-art methods [15, 34, 36] involve LSTM hidden states or image attributes for attention, which compromises the possibility of end-to-end optimization.
Recent successes of deep neural networks in machine translation [29, 3] catalyze the adoption of neural networks in solving image captioning problems. Early works of neural network-based image captioning include the multimodal RNN  and LSTM . In these methods, neural networks are used for both image-text embedding and sentence generating. Various methods have shown to improve performance with region-level information [8, 16], external knowledge , and even from question-answering . Our method differs from them by considering attention from textual context in caption generating.
Attention mechanism has recently attracted considerable interest in LSTM-based image captioning [34, 15, 36, 33]. Xu et al.  propose a model that integrates visual attention through the hidden state of LSTM model. You et al.  and Wu et al.  tackle the semantic attention problem by fusing visual attributes extracted from images with the input or the output of LSTM. Even though these approaches achieve state-of-the-art performance, the performances rely heavily upon the quality of the pre-specified visual attributes, i.e., better attributes usually lead to better results. Our method also uses attention mechanism, but we consider the explicit time-dependent text attention and is comprised a clean architecture for the ease of end-to-end learning.
Early works in image captioning focus on either template-based methods or transfer-based methods. Template-based methods [20, 22, 35, 26, 7, 12] specify templates and fill them with detected visual evidences from target images. In Kulkarni et al. , visual detections are first put into a graphical model with higher order potentials from text corpora to reduce noise, then converted to language descriptions based on pre-specified templates. In Yang et al. , a quadruplet consisting of noun, verb, scene and
preposition is used to describe an image. The drawback of these methods is that the descriptions are not vivid and human-crafted templates do not work for all images. Transfer-based methods [9, 21, 4] rely on image retrieval to assign the target image with descriptions of similar images in the training set. A common issue is that they are less robustness to unseen images.
rely on image retrieval to assign the target image with descriptions of similar images in the training set. A common issue is that they are less robustness to unseen images.
The generated sentences by the LSTM model may lose track of the original image content since it only accesses the image content once at the beginning of the learning process, and forgets the image after even a short period of time. Therefore, Jia et al.  propose an extension of the LSTM model, named the guiding LSTM (gLSTM), which extracts semantic information from the target image and feeds it into the LSTM model every time step as extra information. The basic gLSTM unit is shown in Fig. 2. Its memory cell and gates are defined as follows:
where s denote weights, represents element-wise multiplication, is the sigmoid function,
is the sigmoid function,is the hyperbolic tangent function, stands for input, for the input gate, for the forget gate, for the output gate, for state of the memory cell, for the hidden state (also output for one-layer LSTM), and represents guidance information, which is time-invariant. The subscripts denote time: is the current time step and is the previous time step.
Our text-conditional attention model is based on a time-dependent gLSTM (td-gLSTM). We first describe the td-gLSTM in Sec. 3.1 and show how to obtain semantic guidance through this structure. Then, we introduce our text-conditional attention model and its variants, e.g. -gram word- and sentence-conditional models, in Sec. 3.2.
The gLSTM described in Sec. 2.1 has a time-invariant guidance. In Jia et al. , they show three ways of using such guidance, including an embedding of the joint image-text feature by linear CCA. However, the textual context in a sentence is constantly changing while the caption generator is generating the sentence. Obviously, we need the guidance to evolve over time, and hence we propose td-gLSTM. Notice that, despite its simple change in structure, the td-gLSTM is much more flexible in the way it incorporates guidance, e.g. a time-series dynamic guidance such as tracking and actions in a video. Also, notice that the gLSTM is a sepcial case of the td-gLSTM, when the guidance is set as .
Our proposed td-gLSTM consists of three parts: 1) image embedding; 2) text embedding; and 3) LSTM language model. Figure 3 shows an overview for using td-gLSTM for captioning. First, image feature vector is extracted using CNN and each word in the caption is represented by a one-hot vector , where indicates the index of the word in the sentence. We use the text embedding matrix to embed text feature into a latent space, which is the input of the LSTM language model. The text embedding matrix is initialized from a zero-mean Gaussian distribution with standard deviation 0.01. On the other hand, the text feature is jointly embedded with the image feature, denoted as
of the LSTM language model. The text embedding matrix is initialized from a zero-mean Gaussian distribution with standard deviation 0.01. On the other hand, the text feature is jointly embedded with the image feature, denoted as, where is the time-dependent guidance. Here, we do not specify the particular form of to make the framework general, and its choices are discussed in Sec. 3.2.
Both the guidance and embedded text features are used as the inputs to td-gLSTM, which are shown in Fig. 2 (including red) and formulated as follows:
We back-propagate error through guidance for fine-tuning the CNN. One significant benefit of this is that the model allows the guidance information to be more similar to its corresponding text description. Note that the text-conditional guidance keeps changing in each time step, which is a time-dependent variable. The outputs of the language model are the log likelihood of each word from the target sentence, followed by a Softmax function for normalization. We use the regularized cross-entropy loss function:
keeps changing in each time step, which is a time-dependent variable. The outputs of the language model are the log likelihood of each word from the target sentence, followed by a Softmax function for normalization. We use the regularized cross-entropy loss function:
where represents the image, represents the sentence, denotes the word in the sentence, is the stop sign, denotes all the weights in the convolutional net and controls the importance of the regularization term. Finally, we back-propagate the loss to LSTM language model, the text embedding matrix and the image embedding CNN. The training detail is described in Sec. 4.1.
Recently, You et al.  use visual attributes as the semantic attention to guide the image captioning. Their semantic guidance consists of top visual attributes of the input image, and the weight of each attribute is determined by the current word, which is the previous output of RNN. However, the attribute predictor adopted in their model has no learning ability and is separated from the encoder-decoder language model. In contrast, following the td-gLSTM model (see Sec. 3.1), we condition the guidance information on the current word (the one-hot vector representation), and use the text-conditional image feature as the semantic guidance. The benefits are twofold: first, the model can learn which part of the semantic image feature should be focused on when seeing a specific word; second, this structure is end-to-end tunable such that the CNNs weights are tuned for captioning rather than for image classification . For instance, when the caption generator generated a sequence as ‘‘a woman is washing’’, its attention on the image feature should be automatically switched to objects that can be washed, such as clothes and dishes.
We first consider modeling the text-conditional guidance feature as the weighted-sum of the outer product of image feature and text feature , therefore each entry in is represented as:
where denotes the entry of the image feature, denotes the entry of the text feature, and is the entry of the text-conditional guidance feature. For each , the corresponding weights is a 2-D tensor, hence, the total weights
is a 2-D tensor, hence, the total weightsfor is a 3-D tensor. In this model, image feature is fully coupled with text feature though the 3-D tensor.
Despite Eq. 4 fully couples the two types of features, it results in a huge amount of parameters, which prohibits its use in practice. To overcome it, we introduce an embedding matrix , which contains various text-to-image masks. Furthermore, in practice, adding one non-linear transfer function layer after the image-text feature embedding boosts the performance. Therefore, we model the text-conditional feature as a text-based mask on image feature followed by a non-linear function:
where is the text-conditional embedding matrix and is a non-linear transfer function. When is a all-one matrix, the conditioned feature is identical to . We transfer the pre-trained model from gLSTM to initialize the CNN, language model and word embedding of our attention model. For text-conditional matrix, we initialize it with all ones. We show the sensitivity of our model to various transfer functions in Sec. 4.2.
The above model is the 1-gram word-conditional semantic attention owing to the guidance feature is merely conditioned on the previous word. Similarly, we develop the 2-gram word-conditional model, which utilizes previous two words, or even n-gram word-conditional model. The extreme version of the text-conditional model is the sentence-condition model, which takes advantage of all the previously generated words:
One benefit of the text-conditional model is that it allows the language model to learn semantic attention automatically though the back-propagation of the training loss while attribute-based method, such as , represents semantic guidance by some major components of an image, but other semantic information, such as objects’ motions and locations, are discarded.
We use the MS-COCO dataset  with the commonly adopted splits as described in : 113,287 images for training, 5,000 images for validation and 5,000 images for testing. Three standard evaluation metrics, e.g. BLEU, METEOR and CIDER, are used in addition to human evaluation. We implement our model based on the NeuralTalk2 , which is an open source implementation of . We use three different CNNs in our experiments, e.g. 34-layer and 200-layer ResNets  and 16-layer VGGNet . For a fair comparison, we use 34-layer ResNet when analyzing the variants of our models in Table 1 and 2, 16-layer VGGNet when comparing to state-of-the-art methods in Table 5 and 6, and 200-layer ResNet for leadboard competition in Table 7. The variation of performance regarding different CNNs are also evaluated in Table 3.
We train our model in three steps: 1) train time-invariant gLSTM (ti-gLSTM) without CNN fine-tuning for 100,000 iterations; 2) train ti-gLSTM with CNN fine-tuning for 150,000 iterations; and 3) train td-gLSTM with initializd text-conditional matrix but without CNN fine-tuning for 150,000 iterations. The reason for this multiple-step training is described in Vinyals et al. : jointly training the system at the initial time causes noise in the initial gradients coming from LSTM that corrupts the CNN unrecoverably. For the hyper-parameters, we set the CNN weight decay rate ( in Eq. 3) to to avoid overfitting. The learning rate for CNN fine-tuning is set to and the learing rate for language model is set to . We use Adam optimizer  for updating weights with and . We adopt and for beam sizes during inference, as recommended by recent studies [32, 6]. The whole training process takes about one day on a single NVIDIA TITAN X GPU.
N-gram v.s. Sentence. Table 1 shows results with n-gram word- and sentence-conditional models. For conciseness, we only use BLEU@4, METEOR and CIDEr as the evaluation metrics, since they are more correlated with human judgments compared with low-level BLEU scores . It turns out generally, word-conditional models with higher grams yield better results, especially for METEOR. Notice that the -gram models achieve considerablely better results than 1-gram model, which is reasonable as the 1-gram has the least context that limits the attention performance. Furthermore, the sentence-conditional model outperforms all word-conditional models in all metrics, which shows the importance of long-term word dependency in attention modeling.
Transfer Function. We use a non-linear transfer function in our attention model (see Eq. 5) and we test four different functions: Softmax, ReLU, Tanh and Sigmoid. In all cases, we initialize the text-conditional embedding matrix with noises from one-mean Gaussian distribution with standard deviation . We base our experiments on the sentence-conditional model and conclude that the model achieves best performance when is a Tanh or a ReLU function (see Table 2). Notice that it is possible that other transfer functions different than the four we tested may lead to better results.
Image Encoding. We study the impact of image encoding CNNs on captioning performance, as shown in Table 3. In general, the more sophisticated image encoding architecture the higher performance of the captioning.
|dog||bear three woman cat girl person|
|banana||it carrots fruits six onto includes|
|red||UNK blue three several man yellow|
|sitting||standing next are sits dog woman|
|man||woman person his three are dog|
It is essential to verify whether our learned text-conditional attention is semantically meaningful. Each column in the text-conditional matrix is an attention mask for image features, and it corresponds to a word in our dictionary. It is expected that similar words should have similar masks (with some variations). To verify, we calculate the similarities among masks using Euclidean distance. We show five randomly sampled words w.r.t. different parts of speech (noun, verb and adjective). Table 4 shows their top few nearest words. Most of the neighbors are related to the original word, and some of them are strongly related, such as ‘‘cat’’ for ‘‘dog’’, ‘‘blue’’ for ‘‘red’’, ‘‘sits’’ for ‘‘sitting’’, and ‘‘woman’’ for ‘‘man’’. This shows strong evidence that our model is learning meaningful text-conditional attention.
We use LSTM with time-invariant image guidance (img-gLSTM)  and NeuralTalk2 , an implementation of , as baselines. We also compare to a state-of-the-art non-attention-based model---LSTM with semantic embedding guidance (emb-gLSTM) . Furthermore, we compare our method to a set of state-of-the-art attention-based methods including visual attention with soft- and hard-attention , and semantic attention with visual attributes (ATT-FCN) . For fair comparison among different attention models, we report our results with 16-layer VGGNet  since it is similar to the image encodings used in other methods.
Table 5 shows the comparison results. Our methods, both 1-gram word-conditional and sentence-conditional, outperform our two baselines in all metrics by a large margin, ranging from 1% to 5%. The results are strong evidence that 1) our td-gLSTM is better suited for captioning comparing to time-invariant gLSTM; and 2) modeling textual context is essential for image captioning. Also, our methods yield much higher evaluation scores than emb-gLSTM  showing the effectiveness of using textual content in our model.
We further compare our text-conditional methods with state-of-the-art attention-based methods. For 1-gram word-conditional method, the attention on the image feature guidance is merely determined by the previously generated word. Apparently, this results in semantic information loss. Even though, its performances are still on par with or better than state-of-the-art attention-based methods, such as Hard-Attention and ATT-FCN. We then upgrade the word-conditional model to the sentence-conditional model, which leads to improved performance in all metrics, and it outperforms state-of-the-art methods in most metrics. It worth noting that BLEU@1 score is related to single word accuracy, and highly affected by word vocabularies. This might result in our relatively low BLEU@1 score compared with hard-attention .
We choose three methods for human evaluation, NeuralTalk2, img-gLSTM and our sentence-conditional attention model. A cohort of five well-trained human annotators is performed the experiments. Each of the annotators were shown 500 pairs of randomly selected images and three corresponding generated captions. The annotators rate the three captions from 0 to 3 regarding the content quality and grammar (the higher the better). For content quality, a score of 3 is given if the caption describes all the important content, e.g. objects and actions, in the image; a score of 0 is given if the caption is totally wrong or irrelevant. For grammar, a score of 3 denotes human-level natural expression and a score of 0 means the caption is unreadable. The results are shown in Table 6. Our proposed sentence-conditional model lead the baseline img-gLSTM by a large margin of 28.2% in the caption content quality, and 3.1% compared to the baseline Neuraltalk2, showing the effectiveness of our attention mechanism in captioning. As for grammar, all the methods create human-like sentences with a few grammar mistakes, and adding sentence-conditional attention to LSTM yields a slightly higher grammar score, due to the explicitly textual information contained in the LSTM guidance input.
Figure 4 shows qualitative captioning results. The fix images in the first three rows are positive examples and the last two are failed cases. Our proposed model can better capture details in the target image, such as ‘‘yellow fire hydrant’’ in the second image, and ‘‘soccer’’ in the fifth image. Also, the text-conditional attention discovers rich context information in the image, such as the ‘‘preparing food’’ followed by ‘‘kitchen’’ in the first image, and the ‘in their hand‘’’ followed by ‘‘holding’’ in the sixth image. However, we also show the failed cases, where the objects are mistakenly inferred from the previous words. For the first image, when we feed in the word sequence ‘‘a man (is) sitting’’, our text-conditional attention is triggered by things can be sat by a man; a sofa is a reasonable candidate according to the training data. Similarly, for the second image, the model is trained on some images with stuffed animal held by a person, which in some sense biases the semantic attention model.
We test our model on the MS-COCO leaderboard competition and summarize the results in Table 7. Our method outperforms the baseline (NeuralTalk2 ) across all the metrics and is on par with state-of-the-art methods. It worth noting that our baseline is an open source implementation of , shown as OriolVinyals in Tab. 7, but the latter performs much better due to better CNNs, inference methods, and more careful engineering. Also, several methods unreasonably outperform human-annotated captions, which reveals the drawback of the existing evaluation metrics.
In this paper, we propose a semantic attention mechanism for image caption generation, called text-conditional semantic attention, which provides explicitly text-conditioned image features for attention. We also improve the existing gLSTM framework by introducing time-dependent guidance, opening up a new way for further boosting image captioning performance. We show in our experiments that the proposed methods significantly improve the baseline method and outperform state-of-the-art methods, which supports our argument of explicit consideration of using text-conditional attention modeling.
Future Work. There are several ways in which we can further improve our method. First, combining text-conditional attention with region-based or attribute-based attention, so that the model can learn to attend on regions in feature maps or attributes extracted from the image. Second, one common issue with supervised training is overfitting. As Vinyals et al.  pointed out, we cannot access enough training samples, even for the relatively huge dataset such as MS-COCO. One possible solution is to combine weakly annotated images with current dataset, such as . We keep those for our future work.
IEEE Conference on Computer Vision and Pattern Recognition, 2016.
Empirical Methods in Natural Language Processing, 2014.
Supervised sequence labelling with recurrent neural networks.2012.
Densecap: Fully convolutional localization networks for dense captioning.In IEEE Conference on Computer Vision and Pattern Recognition, 2016.