Naturalistic description of an image is one of the primary goals of computer vision, which has recently received much attention in the field of artificial intelligence recently. It is a high-level task and much more complicated than some fundamental recognition tasks, e.g., image classification    5]  , object detection and recognition    . This requires the system to comprehensively understand the content of an image and bridge the gap between the image and the natural language. Automatically generating image descriptions is useful in multimedia retrieval, and image understanding.
Some pioneering research has been carried out in generating image descriptions  . However, as pointed out in , most of these models often rely on hard-coded visual concepts and sentence templates, which limits their generalization capability. Recently, with the rapid development of deep learning in image recognition and natural language processing, the current trend of image captioning approaches 
is to follow the encoder-decoder framework, which shares the similarity with that in neural machine translation
. Most of these approaches represented the image as a single feature vector from the top layer of a pre-trained convolutional neural network (CNN) and cascaded recurrent neural network (RNN) to generate languages.
In fact, the tasks like image captioning and machine translation can be considered as a structured output problem where the task is to map the input to an output that possesses its own structure, as stated in . An inherent challenge in these tasks is the structure of the output is closely related to the structure of the input. Hence, a key problem in these tasks is alignment . Take neural machine translation for example,  trained a neural model to softly align the output to the input for machine translation. Subsequent research  applied the visual attention model to address this problem in image captioning, with much improvement. The visual attention mechanism is to dynamically select the relevant receptive fields in the CNN features to facilitate the image description generation, which, in other words, is to align the output words to spatial regions of the source image. In this paper, we also employ the visual attention mechanism for image captioning.
Nevertheless, natural language often consists of very meticulous descriptions, which correspond to the fine-grained objects of an image. As pointed out by , there are certain limitations of the most existing neural model-based schemes due to the mere use of the global feature representation in the image level. Some of the fine-grained objects might not to be recognized by only relying on the global image features. In this paper, we propose to use a pre-trained image detection model, i.e., Faster RCNN , to retrieve the fine-grained image features from the top detected objects. These fine-grained object features, are able to provide complementary information for the global image representation, which will be proved in the experiments. In terms of the model structure, the object features are also processed by a visual attention mechanism, and are added to the original model to form a hierarchical feature representation and hence it is able to generate more meticulous descriptions.
In addition to the improvement of the image feature representation, we also consider to improve the current language model, which is widely used in neural machine translation and image captioning. An issue with most of the previous language model is the training framework, namely, the RNN using Maximum Likelihood Estimation (MLE) to generate image descriptions. As pointed out in, the MLE approaches suffer from the so-called exposure bias in the inference stage: the model generates a sequence iteratively and predicts the next token based on the previously predicted ones that may never be observed in the training data. In image description generation, the MLE also suffers from a problem that the generated languages do not correlate well with a human assessment of quality .
Instead of only relying on the MLE, an alternative scheme is the generative adversarial network (GAN) 
. GAN was first proposed to generate realistic images. The GAN learns generative models without explicitly defining a loss function from the target distribution. Instead, GAN introduces a discriminator network which tries to differentiate real samples from generated samples. The whole network is trained using an adversarial training strategy. One can subsequently build a discriminator to judge how realistic are the samples generated by the description generator. The role of the caption generator, in this model, is similar with that of the the generator in the conditional GAN, which is conditioned on the image features.
However, language generation is a discrete process. Directly providing the discrete samples as inputs to the discriminator does not allow the gradients to be back propagated through them. The reinforcement learning (RL)  framework provides a solution to estimate the gradients of the discontinuous units. The RL framework, when dealing with sequence generation, has the problem of lacking the intermediate reward, as discussed in . The reward value can only be obtained when the whole sequence is generated. This is not suitable since what we want is the long-term reward of each intermediately generated token, so the whole sequence better optimized.
In the proposed scheme, the discriminator takes into account not only the differences between the generated captions and the reference captions but also the consistencies between captions and image features. Through the evaluation of the discriminator, the networks can better compensate for some unrealistic captions which might be generated under the MLE training. However, to deal with the discreteness of language, we treat the image captioning generator as an agent of RL. The feedbacks from the discriminator are considered as the rewards for the generator. To update the parameters of the image description generator in this framework, we consider the generator as a stochastic parameterized policy. We train the policy network using Policy Gradient , which naturally solve the differential difficulties in conventional GAN. Also, to solve the problem of lacking intermediate rewards, we borrow the idea from the famous “AlphaGo” program  in which a Monte Carlo roll-out strategy is applied to sample the expected long-term reward for an intermediate move. If we consider the sequence token generation as the the action to be taken in RL, we can apply a similar Monte Carlo roll-out strategy to obtain the intermediate rewards.  has successfully applied the Monte Carlo roll-out in sequence generation. In this paper, we use a similar sampling method to deal with intermediate rewards during the process of caption generation.
To summarize, our contribution in this paper is threefold:
We propose a hierarchical attention mechanism to reason on the global features and the local object features for image captioning.
The policy gradient algorithm combined with the GAN is proposed for the training and optimization of the language model, with improvements over MLE training scheme.
Through comprehensive experiments, we validate the proposed algorithm and comparable results with current state-of-the-art methods are achieved on the MSCOCO dataset.
Ii Related Work
Ii-a Deep Model-based Image Captioning
Promoted by the recent success of deep learning network in image recognition tasks and machine translation, the research on generating image description or image captioning has made remarkable progress      . As mentioned above, most of the previously proposed approaches consider the image description generation as a translation process, mainly by borrowing the idea of the encoder-decoder framework  from neural machine translation 
. Generally, this paradigm considers a deep CNN model as the image encoder, which maps the image into a static feature representation, and a RNN as a decoder to decode this static representations to an image description. The whole framework is trained using supervised learning under MLE. The generated description should be grammatically correct and match the content of the image.
Specifically, Karpathy et al.  proposed an alignment model through a multi-modal embedding layer. This model is able to align parts of a description with the corresponding regions of the image, which attracts significant attention. Jia et al.  proposed a variation of LSTM, called gLSTM, for the image captioning task to mainly tackle the problem of losing track of the image content. This model includes the semantic information along with the whole image as inputs to generate captions. Donahue et al.  applied both of the convolutional layers and recurrent layers to form a Long-term Recurrent Convolutional Network (LRCN) for visual recognition and description.
Bahdanau et al.  pointed out that a potential problem in this approach is that the model should compress all the necessary information of a source sentence into a fixed-length representation. This may make it difficult for the neural network to cope with long sentences. The static feature representation in the encoder-decoder framework, for both of machine translation and image captioning, cannot automatically retrieve relevant information from the source and thus at last influence the final performance. In neural machine translation, Bahdanau et al.  proposed a kind of soft attention mechanism for machine translation, which enables the decoder to automatically focus on the relevant parts of the source sentence. In computer vision, the attention mechanism has long been the focus of much research    since human perception does not tend to process a whole scene in its entirety at once but applies some mechanisms to selectively focus on the information needed. A comprehensive study for hard attention bound with reinforcement learning and soft attention for the task of image captioning was published by Xu et al. .
Yao et al.  tackled the video captioning task through capturing global temporal structures among video frames with a temporal attention mechanism, which makes the model dynamically focus on the key frames that are more relevant with the predicted word. Attention Models (ATT) developed by You et al.  first extracted semantic concept proposals and fused them with RNNs into hidden states and outputs. This method used K-NN, multi-label ranking to extract semantic concepts or attributes and fused these concepts into one vector using an attention mechanism. Similarly, Yao et al.  embedded attributes with image features into a RNN with various methods to boost the image captioning performance. Recently, Chen et al.  proposed to combine the spatial attention and the channel-wise attention mechanism for image captioning, with improved results. Alternatively, Li et al. 
proposed a global-local attention mechanism to include local features extracted from the top detected objects from a pre-trained object detector. Inspired by, we also include the local features from top detected objects. However, we build a hierarchical model whilst they treated local and global features equivalently.
Ii-B Policy Gradient Optimization for Image Captioning
Another approach to boost the performance of language tasks is to compensate the so-called exposure bias problem in RNN-based MLE learning. As pointed out in , RNNs are trained by MLE, which essentially minimized the KL-divergence between the distribution of target sequences and the distribution defined by the model. This KL-divergence objective tends to favour a model that overestimates its smoothness, which can lead to unrealistic samples .
In order to tackle the problems and generate more realistic image descriptions, some researches directly use evaluation metrics such as BLEU, METEOR  and ROUGE  as the reward signal and build the model under the RL framework. For instance, Ranzato et al.  is the first research using the policy gradient algorithm in a RNN-based sequence model, in which a REINFORCE-based approach was used to calculate the sentence-level reward and a Monte-Carlo technique was employed for training. Liu et al.  studied several linear combinations of the evaluation metrics and proposed to use a linear combination of SPICE  and CIDEr  as the reward signal and apply a policy gradient algorithm to optimize the model, with improved results. This research used a Monte-Carlo roll-out strategy to obtain the intermediate reward during the process of description generation. More recently, Bahdanau et al. , instead of sentence-level reward in the training, applied the token-level reward in temporal difference training for sequence generation.
As discussed previously, the GAN 
estimates a difference measure using a binary classifier, called a discriminator, to discriminate between the target samples and generated samples. GANs rely on back-propagating these difference estimates through the generated samples to train the generator to minimize these differences. Hence, the whole network in GAN is trained in an adversarial way. The GAN was originally proposed to generate naturalist images   . Directly applying a GAN for the language problem is impossible since sequences are composed of discrete elements in many application areas such as machine translation and image captioning.
A possible solution to tackle the discreteness problem of language is to use the Gumbel-Softmax approximation  . For instance, Shetty et al.  use a GAN to generate more realistic and accurate image descriptions with the aid of Gumbel-Softmax to deal with the discontinuousness issue in language processing. Another more general solution is to borrow an idea from the RL framework, in which the feedback from the discriminator is considered as the reward for the language generator. Dai et al.  built a model based on conditional GAN to generate diverse and naturalistic image descriptions and paragraphs, which utilizes a policy gradient for optimization. Yu et al.  proposed a model called SeqGAN, which unified the GAN framework and RL learning problem, this has recently received much attention  . They propose a three steps training strategy, which includes the pre-training the generator, pre-training the discriminator and the final adversarial training. In this paper, inspired by the SeqGAN, we propose to use a discriminator to judge the fitness of the generated image descriptions with reference to the image content and apply the policy gradient optimization technique  to train the model. Unlike the original SeqGAN, our discriminator not only cares about the differences between the target language and model-generated language but also considers the coherence of the language with the image content.
In this section, we describe the proposed method based on two parts: the hierarchical attention mechanism and the policy gradient optimization algorithm.
Iii-a Hierarchical Attention Mechanism
The hierarchical attention mechanism consists of two parts: a spatial attention mechanism which corresponds to global CNN features and a local attention mechanism which corresponds to object features.
The spatial attention mechanism is based on the model in 
. Specifically, the model comprises of an encoder and a decoder. We use a convolutional neural network pre-trained on the ImageNet dataset in order to extract a set of convolutional features. These features, denoted as , correspond to certain portions of the 2-D image. We extract convolutional features instead of fully connected ones in order to build a spatial attention mechanism since convolutional features have a spatial layout.
The Long-short Term Memory (LSTM) network, originally proposed by Hochreiter and Schmidhuber in , is applied as the language decoder because of its superior performance in natural language processing.
In Equation 1, , , , and are the input gate, forget gate, output gate, cell memory and hidden state of a LSTM network, respectively. and are the input and the output of the LSTM model. is the context vector, which can be processed by the soft attention mechanism and is able to capture visual information associated with a certain input location. The soft attention mechanism has to automatically allocate adaptive weights for the image locations to facilitate the task at hand.
where . Equation 2 actually maps the image features from each location, along with information from the hidden state, into an adaptive weight, which indicates the importance of each image location for the recognition.
Then, Equation 3
normalizes the adaptive weights into a probability value in the range of 0 and 1 using the Softmax function. Once these weights (summed to 1) are computed, we element-wisely multiply the weights vectorwith image feature vector and sum them to the context vector , which can be expressed as in Equation 4. This can be seen as the expectation of weighted features maps.
Then the context vector is forwarded to the LSTM network to generate captions, as described in Equation 1. This soft attention mechanism is able to adaptively select the relevant visual parts of the given image features and thus facilitate the recognition.
The local attention mechanism is formulated using object features and another LSTM model. We use a pre-trained object detector to retrieve the top detected object features, which are denoted as . We then use another LSTM model with soft attention to allocate adaptive weights to each of these features.
where indicates the hidden state of the LSTM model for the local attention mechanism.
Similarly, Equation 6 normalizes the adaptive weights for local features to a probability value with the Softmax function.
Equation 7 demonstrates that the context vector for local attention model catching information from both the local features and the global attention mechanism, where indicates the concatenation operation of the features. This context vector is then forwarded to a second LSTM model as described by Equation 8.
The two LSTM models, denoted as for the global features and for the local features are jointly trained to map the hierarchical feature representation with language. is at a higher level, which can be used to decode the hidden states for the final outputs. However, the gradient vanishing problem cannot be avoided if we only use the hidden states from to decode information. Inspired by  in which a shortcut in network connections is applied to solve the gradient vanishing problem, we concatenate the hidden states from and to decode and map the hidden states to language vectors, which can be seen in Equation 9.
In MLE training, if the length of a sentence is , the loss function can be formulated as in Equation 10, which is the sum of the log likelihood of each word.
Iii-B Policy Gradient Optimization
In addition to only using the MLE to train the image caption generator, to alleviate the previously discussed exposure bias problem in RNN-based MLE training as discussed previously, we also apply a policy gradient optimization algorithm in the RL framework to increase the quality of the generated descriptions.
We feed both of the generated descriptions and the reference descriptions to the discriminator. The level of coherence of the descriptions and image content is calculated by the dot product, which is forwarded to the discriminator, as described in Fig. 3. This operation is to consider the coherence between certain captions (sequences) and corresponding image features, which is able to make the generated captions more realistic and naturalistic. The reference sequences are labeled as true whilst the generated sequences are labeled as false during the training of the discriminator. The model is also a LSTM network with Softmax Cross Entropy loss. Hence, the discriminator outputs the probabilities of a sample being true. These probabilities, are then considered as the reward signal in the RL framework, to be utilized in the Policy Gradient algorithm for updating the parameters of the image caption generator.
Following , the objective of the policy network (the image caption generator), is to generate a sequence from the start state to maximize its expected long-term reward as described by Equation 11:
where is the reward for a complete sequence. is the action-value function of a language sequence, which is defined as the expected accumulative reward starting from state , taking a certain action, and then following policy .
The action-value function is estimated using the REINFORCE algorithm  and considers the probability of being real generated by the discriminator as a reward, which can be defined as in Equation 12.
As can be seen in Equation 12, the discriminator only provides a reward for a complete sequence. We should not only care about the reward for a complete tokens but also the long-term reward for the future time-steps since the long-term reward is what we actually want. Similar to the game of Go  in which the agent sometimes give up an immediate interest but cares about the final victory, we apply a similar Monte Carlo roll-out strategy for an intermediate state, i.e., an unfinished sequence. We represent an N-time Monte Carlo search as in Equation 13.
where is the generated sequence tokens and is the Monte Carlo sampled based on a roll-out policy, which, in our case, is set as the same as the image caption generator for convenience. In reality, we can use any policy to perform the roll-out operation. is the output of the LSTM decoder. MC is defined as a sampling procedure from a Multinomial distribution.
If there is no intermediate reward, the Monte Carlo roll-out strategy can sample the future possible tokens times and average these rewards to achieve the goal of reward estimation, which is described in Equation 14.
The Monte Carlo roll-out strategy can be better visualized in Fig. 4.
Once the reward value from the discriminator is obtained, it is ready to update the generator. The goal is to maximize the average reward starting from the initial state as defined in Equation 15.
Since the expectation can be approximated by sampling, we can now update the parameters of the image caption generator using Equation 17.
In practice, we can use advanced gradient algorithms such as RMSprop and Adam  in training the caption generator.
The image caption generator and discriminator are adversarially trained in the framework of GAN . In GAN , the discriminator can pass the gradient directly to the generator. Due to the discreteness of the sequence generation, we apply RL to estimate the gradient of the generator in our model.
Specifically, the training strategy is described in Algorithm 1. We initially pre-train the image caption generator using MLE. In practice, this is equivalent to the Cross Entropy loss . Hence, we can set the pre-training step the same as in . The trained model is used to generate some captions which are set as fake samples, which, along with the reference captions, are fed into the discriminator for training. Similarly, the discriminator is also pre-trained for certain steps. The next steps are the adversarial training steps, in which the image caption generator and discriminator are trained alternatively until convergence of the networks.
In addition to the sentence comparison scheme introduced previously, and shown in Fig. 2, we also employ a scheme to evaluate the coherence between the generated captions and the image content. Specifically, both of the global features and local object features are processed by average pooling in order to obtain fixed-size feature representation, denoted as . The captions, similar to the sentence comparison scheme, are also encoded into a fixed-size vector, using a LSTM model, denoted as . The two vectors and are then dot producted and forwarded to logistic function to obtain the reward for RL training, which can be seen in Fig. 3.
Iv Experimental Validation
Iv-a Dataset Introduction
We conduct our experiments using the MSCOCO dataset . To be consistent with the previous researches, we use the MSCOCO 2014 released version, which includes 123,000 images. The dataset contains 82,783 images in the training set, 40,504 images in the validation set and 40,775 images in the test set. As the ground-truth for the MSCOCO test set is not available, the validation set is further splited into a validation subset for model selection and a test subset for local experiments. This is the “Karpathy” split . It utilizes the whole 82,783 training set images for training, and selects 5,000 images for validation and 5,000 images for testing from the official validation set. The standard evaluation protocol contains BLEU , METEOR , CIDEr  and ROUGE-L .
BLEU is the most popular metric for the performance evaluation in machine translation. The metric is only based on the n-gram statistics. The BLEU-1, BLEU-2, BLEU-3 and BLEU-4 measure the performance of the 1, 2, 3, 4-gram, respectively. METEOR is based on the harmonic mean of unigram precision and recall, and seeks correlation at the corpus level. CIDEr can be used to evaluate the generated sentences with human consensus. ROUGE-L measures the common maximum-length subsequence for the target sentence and the generated sentence.
Iv-B Implementation Details
For all the images in the COCO dataset, we obtain global convolutional features (from the layer “res5c”) using a pre-trained Residual-152 network 
on the platform of Caffe, with a dimensionality of . We also retrieve local object features using a Faster RCNN  object detection network pre-trained on the MSCOCO dataset. Specifically, we obtain the top detected object features from the layer of “FC6” layer of the VGG16 model  used in Faster RCNN, with dimensionality of
. We build the hierarchical attention mechanism and policy gradient optimization on the TensorFlow platform.
Iv-B1 Training the Faster RCNN on the MSCOCO dataset
In order to obtain better local object features, we train the Faster RCNN model on MSCOCO object detection dataset. The model is first pre-trained on the ImageNet object detection dataset 
. The MSCOCO object detection dataset shares the same images with the image caption task. Consequently, we keep the same splits with the image caption dataset for training. The training process on the MSCOCO dataset is almost the same with the pre-training on ImageNet. The initial learning rate is set to 0.001. The momentum of the stochastic gradient descent is set to 0.9 and the weight decay is set to 0.0005.
Iv-B2 Language Pre-processing
To pre-process the language, the special symbols such as ‘.’, ‘,’, ‘(’, ‘)’ and ‘-’ are replaced with blank spaces whilst ‘&’ is replaced with ‘and’. Since we set the maximum length of the descriptions as 20 words, we delete the caption references from the original dataset which are longer than 20. For the vocabulary establishment, following the open-source code of , we include words that occurs more than 5 times in the vocabulary. We map the symbol ‘NULL’ to 0, ‘START’ to 1 and ‘END’ to 2.
Iv-B3 Training Details of the Model
The network was first pre-trained using MLE for 10 epochs. During training, the size of the hidden states of the two LSTM models is set as 512. We choose the same size of hidden states of  as they achieved satisfactory performance with this size of the hidden states. We set the batch size as 32 and the learning rate as 0.001 and we use the Adam algorithm  to train the network. Subsequently, we train the discriminator for 2500 steps, following by an adversarial training scheme, in which the caption generator and discriminator are trained alternatively until convergence. During the pre-training steps of the discriminator and the policy gradient-based adversarial training as described previously, the Adam algorithm is also applied. The learning rate for these steps are set as 0.0001. Following the open-source code of , at training time, we set the maximum length of the input sequence to 20 words. During the testing time, alternatively, we set maximum length of a generated symbols as 30 words. During the training of the proposed model, we add a trainable word embedding layer from Google’s TensorFlow platform . All the experiments are conducted on a server embedded with NVIDIA TITAN X GPU and installed with the Ubuntu 14.04 operating system.
Iv-C1 Quantitative Evaluation
In this section, a comprehensive quantitative evaluation is conducted using different experimental settings on the MSCOCO dataset.
IV-C1a Comparison between the global attention, the local attention and the hierarchical attention model
We first obtain the results using only the global attention model, which is similar to the soft attention model in . Since we use advanced CNN features from the Residual-152 model, the results of BLEU, METEOR, CIDEr and ROUGE-L are all satisfactory, and are listed in Table I. Then only the local attention model using the detected object features from a Faster RCNN detector is tested, with results which are much lower than those for the global attention model as listed in Table I. One of the possible reasons is that the Faster RCNN only uses the VGG16 model, which is not as powerful as the Residual-152 network. Another reason is that the local object features, despite the capability to provide complementary information to the global attention model, can sometimes miss many important features. Finally, we test our proposed hierarchical attention model under MLE training, which utilizes both of the global and local attention for image captioning. The results improve the baseline significantly, which can be seen in Table I. Specifically, all of the seven evaluation metrics are improved using our hierarchical attention model.
IV-C1b The determination of the number of top detected objects
To determine the best number for the top detected objects in the local attention model, we perform an ablation study. We extract the 10, 20 and 30 top detected object features and test them using the hierarchical attention model. The results can be seen in Table II. With the increase of the number from 10 to 30, the performance increases accordingly. Although the maximum length of our generated sentences is set as 30, not every word represents an object. Also, intuitively, there are a maximum 30 objects within an image. Hence, in the following experiments, we use the 30 top detected object features for the local attention model.
|Soft Attention ||70.7||49.2||34.4||24.3||23.90||-||-|
|Hierarchical Attention with 10 Objects for Local Attention||70.601||50.423||36.643||25.389||24.633||87.316||55.241|
|Hierarchical Attention with 20 Objects for Local Attention||72.159||52.498||37.552||26.918||24.725||88.639||55.825|
|Hierarchical Attention with 30 Objects for Local Attention||72.611||52.769||37.802||27.243||24.731||88.140||56.048|
|MLE training only||72.611||52.769||37.802||27.243||24.731||88.140||56.048|
|PG with 2500 steps for pre-training D followed by 1 D and 1 G step||72.450||52.845||38.141||27.551||24.543||87.416||55.876|
|PG with 2500 steps for pre-training D followed by 5 D and 1 G step||72.104||52.739||38.122||27.602||24.928||89.072||56.063|
IV-C1c The performance of Policy Gradient with reward only from language comparison
Next we start the reinforcement learning steps. We first train the discriminator which only compares the similarity between the reference sentence and the generated sentence. Specifically, we follow the model defined in Fig. 2. The discriminator is first trained in 2500 steps, which we find sufficient for the discriminator to converge. The loss curve of the image caption generator is shown in Fig. 8. After 2500 steps pre-training the discriminator, the loss of the image caption generator starts to decline, which validates that the policy gradient starts to work. Then we further train the generator and discriminator adversarially for another 1 epoch, and report the results in Table III. We also experimented with two different settings in the adversarial training steps. The first setting is to train 1 step for the discriminator, followed by another step for the generator. Another setting is to train the discriminator for 5 steps, followed by 1 step training for the generator. We find the final results of the two setting are similar, which all slightly improve the MLE training baseline. The reason for the improvement is because the reinforcement learning solves the exposure bias problem during MLE training. However, this scheme lacks the measurement of the similarity between the generated descriptions and the image contents, which prevents the image caption generator from generating more naturalistic and diverse descriptions.
|MLE training only||72.611||52.769||37.802||27.243||24.731||88.140||56.048|
|PG with similarity of global features (1 D and 1 G step)||72.250||52.290||37.099||26.331||23.815||84.516||55.238|
|PG with similarity of global features (5 D and 1 G step)||72.234||52.120||36.887||26.065||23.957||84.224||55.244|
|PG with similarity of global-local features (1 D and 1 G step)||73.036||53.688||39.069||28.551||25.324||92.449||56.539|
IV-C1d The performance of Policy Gradient with reward from the measurement of coherence between language and image content
To train the image caption generator to generate more naturalistic and diverse descriptions, we further test the model defined in Fig. 2. First we only extract the global features and perform average pooling, resulting with a feature dimension of 2048. We then use the dot product to measure these image features and language embedding features by a discriminator, which can be considered as the reward within the reinforcement learning framework. The experimental results from this model can be seen in Table IV.
However, the results from all of the seven metrics are even lower than the MLE training baseline. One possible reason, is the measurement of discriminator which only uses the global features, which is not consistent with the hierarchical attention model in the generator side. As can be seen from the Table IV, the results from this model are similar to that of global attention model, since the reward signal from the discriminator tends to force the generator to produce sentences that only matches the global features.
We further build a model exactly like in the one defined in Fig. 3. This model includes both of the global image features and the local object features, and thus guarantees that the discriminator and the generator are utilizing the same information source. The final results can be seen in Table IV, which outperform all of other experimental settings.
|Google NIC ||66.6||46.1||32.9||24.6||-||-||-|
|Spatial Attention ||71.8||50.4||35.7||25.0||23.0||-||-|
|Semantic Attention ||70.9||53.7||40.2||30.4||24.3||-||-|
|RL with G-GAN ||-||-||30.5||29.7||22.4||79.5||47.5|
|RL with Embedding Reward ||71.3||53.9||40.3||30.4||25.1||93.7||52.5|
To prove the effectiveness of the proposed method, we compare our final results on the “Karpathy” test split with previously published results, which is shown in Table V. We list most of the published results on the “Karpathy” split, which are grouped into three categories. The first category corresponds to various methods without external information and reinforcement learning. The best of them (SCA-CNN-ResNet) is the spatial and channel-wise attention model  in which both the spatial and channel-wise attention mechanisms are utilized for image captioning. The methods in the second group use extra information during the training of the model. For instance, Semantic Attention  utilizes rich extra data from social media to train the visual attribute predictor. Deep Compositional Captioning (DCC)  generates extra data to prove its unique transfer capability. The third group corresponds to the reinforcement learning technique. RL with G-GAN  applies conditional GAN and policy gradient to generate image descriptions. Although their results on the evaluation metrics are not improved, they prove that the generated captions are more diverse and naturalistic. Embedding Reward  applies a policy network to generate captions and a value network to evaluate the reward. Additionally, they also apply advanced inference method called lookahead inference and beam search during testing. They achieve the current state-of-the-art results on the “Karpathy” split. Although we do not use any external knowledge and any advanced inference technique (including beam search, we use greedy search in all of our experiments), we achieve similar results to the current state-of-the-art methods (Embedding Reward  and SCA-CNN-ResNet ), with state-of-the-art results on three important metrics: BLEU-1, METEOR and ROUGE-L and lead other methods significantly.
Iv-C2 Qualitative Evaluation
In addition to the quantitative evaluation using the standard metrics, we qualitatively evaluate the proposed model by visualization. Firstly, we plot some global attention maps corresponding to each generated words as shown in Fig. 5. It is obvious in the figure that the attentive regions normally correspond with the semantic meaning of the generated word in each time step. Then we choose some examples to visualize the local attention weights on the detected objects, which are shown in Fig. 6. We only retrieve the top 10 detected objects and corresponding attentive weights obtained from the local attention mechanism because of limited space in the figure. The detector can detect some fine-grained objects, which provide complementary information for the global attention mechanism. At last, we show some of the generated sentences using different methods. Specifically, we show the ground-truth sentences, descriptions generated by the MLE training-based model and by the proposed model as shown in Fig. 7. The text in red are the sentences generated by the proposed model, which are more accurate and naturalist than the MLE-based model, which are shown in blue. Specially, the proposed model show superior performance in finding the fine-grained properties of the image since the RL model automatically measure the coherence of the sentences and the image content. For instance, in Fig. 7 (c), the proposed model successfully determines the gender of the person in the image whilst the MLE training-based model gets it wrong.
This paper targets the image captioning task, which is a fundamental problem in artificial intelligence. Based on the recent successes of deep learning, especially the CNN feature representation and the LSTM with attention model, the paper proposes the use of a hierarchical attention mechanism, considering not only the global image features but also detected object features, with improved results. A significant improvement over the current RNN-based MLE training has also been demonstrated. Specifically, a GAN framework with RL optimization for the image captioning task is proposed to generate more accurate and high-quality captions. The discriminator is to evaluate the coherence and consistency between the generated sentences and image content, thus providing the rewards for optimization. The whole model follows a three-step training strategy. Experiments analysis confirms the merits of the framework and key contributors the improved performance. Comparable results with current state-of-the-art methods are achieved using only greedy inference, which proves the effectiveness of the training procedure.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  S. Tang, Y. T. Zheng, Y. Wang, and T. S. Chua, “Sparse ensemble learning for concept detection,” IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 43–54, Feb 2012.
-  C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, March 2015.
S. Bu, Z. Liu, J. Han, J. Wu, and R. Ji, “Learning high-level feature by deep belief networks for 3-d model retrieval and recognition,”IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2154–2167, Dec 2014.
-  P. Liu, J. M. Guo, C. Y. Wu, and D. Cai, “Fusion of deep learning and compressed domain features for content-based image retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5706–5717, Dec 2017.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  R. Girshick, “Fast r-cnn,” in Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp. 1440–1448.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  S. Tang, Y. Li, L. Deng, and Y. Zhang, “Object localization based on proposal fusion,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2105–2116, 2017.
-  G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk: Understanding and generating simple image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013.
-  H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From captions to visual concepts and back,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1473–1482.
-  A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
-  K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111.
-  K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content using attention-based encoder-decoder networks,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886, 2015.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations (ICLR), 2015.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio, “Show, attend and tell: Neural image caption generation with
visual attention,” in
International Conference on Machine Learning, 2015, pp. 2048–2057.
-  L. Li, S. Tang, Y. Zhang, L. Deng, and Q. Tian, “Gla: Global-local attention for image description,” IEEE Transactions on Multimedia, vol. PP, no. 99, pp. 1–1, 2017.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.
-  B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and natural image descriptions via a conditional gan,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2970–2979.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
-  L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial nets with policy gradient.” in AAAI, 2017, pp. 2852–2858.
-  R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
-  J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv preprint arXiv:1412.6632, 2014.
-  X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding long-short term memory for image caption generation,” arXiv preprint arXiv:1509.04942, 2015.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
-  J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
-  L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507–4515.
-  Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4651–4659.
-  T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 4904–4912.
-  L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5659–5667.
-  A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial learning,” arXiv preprint arXiv:1711.04755, 2017.
-  I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002, pp. 311–318.
-  A. Lavie and A. Agarwal, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the EMNLP 2011 Workshop on Statistical Machine Translation, 2005, pp. 65–72.
-  C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries using n-gram co-occurrence statistics,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003, pp. 71–78.
-  M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in International Conference on Learning Representations (ICLR), 2016.
-  S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 873–881.
-  P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision. Springer, 2016, pp. 382–398.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
-  D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio, “An actor-critic algorithm for sequence prediction,” in International Conference on Learning Representations (ICLR), 2017.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
-  C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” arXiv preprint arXiv:1611.00712, 2016.
-  R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele, “Speaking the same language: Matching machine to human captions by adversarial training,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  M. J. Kusner and J. M. Hernández-Lobato, “Gans for sequences of discrete elements with the gumbel-softmax distribution,” arXiv preprint arXiv:1611.04051, 2016.
-  L. Wu, Y. Xia, L. Zhao, F. Tian, T. Qin, J. Lai, and T.-Y. Liu, “Adversarial neural machine translation,” arXiv preprint arXiv:1704.06933, 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
-  P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of operations research, vol. 134, no. 1, pp. 19–67, 2005.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  X. Chen and C. L. Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 2422–2431.
-  L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, “Deep compositional captioning: Describing novel object categories without paired training data,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 1–10.
-  Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 290–298.