Joint Image Captioning and Question Answering

05/22/2018 ∙ by Jialin Wu, et al. ∙ The University of Texas at Austin 0

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8 validation set and 68.4 models results in 69.7



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, visual question answering (VQA) Antol et al. (2015) and image captioning task Chen et al. (2015)

have been widely and separately studied in both computer vision and NLP communities. Most of the recent works

Anderson et al. (2017); Singh et al. (2018); Lu et al. (2017); Rennie et al. (2017); Pedersoli et al. (2017) concentrate on designing attention modules to better gather image features for both tasks. Those attention modules help the systems learn to focus on potential relevant semantic parts of the images and improve performance to some extent.

However, the qualities of the attention are not guaranteed since there is no direct supervision on them. The boundaries of the attended regions are often vague in the top-down attention modules which fail to filter out noisy irrelevant parts of images. Even though the bottom-up attention mechanism Anderson et al. (2017) ensures clear object boundaries from object detection, it is still questionable that whether we can attend to the accord semantic parts from multiple detected regions given the insufficient amount of detected object categories and the lack of supervision on the semantic connections among those objects. Fully understanding the images need acquire daily common knowledge and model the semantic connection among different parts in images, which goes beyond what the attention modules can learn from images only. Therefore additional image descriptions can be a helpful common knowledge supplement to the attention modules. In fact, we find that captions are very useful to the VQA tasks. In VQA v2 validation split, we answer questions using the ground truth captions without images and achieve 59.6% accuracy, which has already outperformed a large number of VQA systems using image features only.

Figure 1: Overall structure of our joint image captioning and VQA system. Our system takes questions, images, and captions as inputs and uses questions and images’ joint representation to generate question related captions. Then we use the joint representation of the three inputs to predict answers. The numbers around the rectangle indicate dimensions, denotes element-wise multiplication and denotes element-wise addition. Blue arrows denote with learnable parameters and yellow arrows denote attention embedding.

On the other hand, image captioning systems with beam search strategy tend to generate short and general captions based on the most significant objects or scenes. Therefore the captions are usually less informative in that they fail to diversely describe the images and build complex relationships among different parts of images. To obtain more specific and diverse captions, more heuristic could be helpful if provided. Since VQA process require to be aware of various aspects in the images, visual questions, as the additional heuristics, can potentially help the image captioner explore richer image content. To quantitatively measure informativeness, besides the automated evaluation metrics, we propose to use the relative improvements on the VQA accuracies when systems take the generated captions into consideration.

In this work, we propose to jointly generate question-related captions and answer visual questions. We demonstrate that these two tasks can be a good complement to each other. On the one hand VQA task provides the image captioners more heuristic during captioning. On the other hand, the captioning task feeds more common knowledge to the VQA system. Specifically, we utilize the joint representation of questions and images to generate question-related captions which serve as additional inputs to the VQA systems as hints. Furthermore, creating these captions reduces the risk of learning from questions bias Li et al. (2018a) and ignoring the image content when high accuracy sometimes can be already achieved from the questions solely. Meanwhile, questions, as novel heuristics, inspire the captioning systems to generate captions that are question-related and helpful to the VQA process. To automatically choose annotated question-related captions in training phase as supervision, we propose an online algorithm which selects the captions that maximize the inner-products between gradients from the captions and answers to the images and questions’ joint representation.

For the evaluation of the joint system, we first evaluate the benefits from additional caption inputs in the VQA process. Empirically, we observe a huge improvements on the answers accuracy over the BUTDAnderson et al. (2017) baseline model in the VQA v2 validation splits Antol et al. (2015), from 63.2% to 69.1% with annotated captions from COCO dataset Chen et al. (2015) and 65.8% with generated captions. And our single model achieves 68.4% in the test-standard split. Furthermore, an ensemble of 10 models results in 69.7% in the test-standard split. For the image captioning task, we show that our generated captions can not only achieve promising results under the standard caption evaluation metrics but also provide more informative descriptions in terms of relative improvement on the VQA task.

2 Approach

We first describe the overall structure of our joint system in Sec. 2.1 and explain the feature representation as the foundations in Sec.2.2. Then, the VQA sub-system is introduced concretely in Sec. 2.3, which takes advantage of image captions to better understand the images and thus achieve better results in VQA task. In sec. 2.4, we explain the image captioning sub-system which takes the question features as additional heuristics and generates more informative captions. Finally, the training details are provided in Sec. 2.5.

2.1 Overview

We introduce the overall structure of our joint system. As illustrated as Fig. 1, the system firstly produces images features using bottom-up attention strategies, question features q with standard GRU, caption features c with our caption embedding module detailed in Sec. 2.2. After that, features q and c are used to generate the visual attention , to weight the images’ feature set V, producing question-caption-attended image features . We then sum up the with caption features c together and further perform element-wise multiplication with question features q. Meanwhile we use as input features to the image captioning sub system to generate question-related captions.

2.2 Feature representation

We concretely introduce the representation of the images, questions and captions which serves as the foundation of the joint system. And we adopt to denote a layers with input features and ignore the notation of weights and biases within the layers for simplicity, and those layers do not share weights. The

denotes Leaky ReLU

He et al. (2015).

Image and Question Embedding
We adopt bottom-up attention Anderson et al. (2017) from object detection task to provide salient regions with clear boundaries in the images. In particular, we use Faster R-CNN head Girshick (2015) in conjunction with ResNet-101 base network He et al. (2016). To generate an output set of image features V, we take the final outputs of the model and perform non-maximum suppression (NMS) for each object category using an IoU threshold . And a fixed number of detected objects are extracted as the image features as suggested in Teney et al. (2017).

For the question embedding, we use a standard GRU with hidden units inside and extract the output of hidden unit at final time step as the question features q. And following Anderson et al. (2017), the question features q and image feature set V are further embedded together to produce question-attended image feature set via the question visual attention as illustrated in Fig. 1.

Caption embedding
We present our novel caption embedding module in this section, which takes the question-attended image feature set , question features q and captions , where denotes the length of the captions and is the caption index, as inputs and generate the caption features c.

Figure 2: Overview of the caption embedding module. The Word GRU is used to generate attention to identify the important and related words in each caption and Caption GRU outputs the caption embeddings. And we use question-attended image features to model the attention. Blue arrows denote with learnable parameters and yellow arrows denote attention embedding.

The goals of the caption module are to (1) provide additional descriptions to help the VQA system better understand the images by serving as a complement to the insufficiency of detected objects and the lack of common knowledge;(2) provide additional clues to models the semantic connection of the different parts in images from the bottom-up attention.

To achieve that, as illustrated in Fig. 2, a Word GRU are first adopted to sequentially encode the words embedding in caption at each time step as . Then, we design a caption attention module which utilize the question-attended image feature set and to generate new attentions on the current word to identify relevancy in an online fashion. Specifically, for each input word , we use a standard GRU as the Word GRU to encode the words embedding in Eq. 2, and feed the outputs and to the attention module as shown in Eq. 4.



denotes sigmoid function,

is the number of the bottom up attention, is the word embedding matrix and is the one-hot embedding for

After that, the attended words in captions are used to produce the final caption representation in Eq. 5

via the Caption GRU. Since the goal is to gather more information, we perform element-wise max pooling within all input caption representation

in Eq. 7.

c (7)

where denotes the element-wise max pooling over all captions c of the images.

2.3 VQA sub-system

We elaborate our VQA sub-system in details in this section. For the purpose of better mining semantic connection of different parts in images, our VQA system takes additional caption embeddings c as inputs to generate caption-attend image attention in Eq.9 and produce question-caption-attended image feature in Eq. 10


To better incorporate knowledge from captions in the VQA’s reasoning process, we also sums up the caption features c with the joint attended image features and then element-wisely multiplies with the question features as shown in Eq. 11

h (11)

We frame the answer prediction task as a multi-label classification problem. In particular, we adopt the soft scores, which are in line with the evaluation metric, as labels to supervise the sigmoid-normalized predictions as shown in Eq. 13. In case of multiple feasible answers, the soft scores can capture the occasional uncertainty in ground truth annotations. As suggested in Teney et al. (2017), we collect the candidate answers that appears more than times in the training splits which results in answer candidates.


where the indices , run respectively over the training questions and candidate answers and the are the aforementioned soft answer scores.

2.4 Image Captioning sub-system

We adopt the same image captioning module from Anderson et al. (2017) to jointly take advantage of the bottom-up and top-down attention. For the more detail module structure, refer to Anderson et al. (2017). The key difference between our scheme and theirs lies on the input features and the caption supervision. Specifically, we feed the question-attended images features as inputs for the caption generation and only use the question-related annotated captions as supervision. During the training process, we compute the following cross entropy loss for each caption indexed by of the images in Eq. 14 and back propagate the gradients only from the most related caption detailed in next sub section.


Selecting relevant captions for training
Because our goal is to generate question-related captions, we need provide the image captioning system relevant captions as supervision in training phase. To achieve that, some word similarities based offline methods Li et al. (2018b) have been proposed, however, those offline methods are not capable of being aware of the semantic connection among images parts and thus take more risks of incorrectly understanding the images. To address this issue, we propose an online relevant captions selection scheme which guarantees our system to update with a shared descent direction Wu et al. (2018) in both the VQA parts and the image captioning parts, ensuring the consistency of image captioning module and the VQA module in the optimization process.

Specifically, we frame the caption selection problem as following, where the is the selected caption index. We require the inner product of the current gradients from the VQA and captioning loss to be greater than a constant and select a caption which maximizes that inner products.


Therefore, given the solution of the problem LABEL:pro:caption_selection , the final loss of the multi-task learning is the sum of the VQA loss and the captioning loss of the selected captions as shown in Eq. 16. If the problem LABEL:pro:caption_selection has no feasible solution, we will ignore the caption loss.


Caption Evaluation on the informativeness
Most of automated caption evaluation metrics are based on the statistical analysis on the word level language models ( BLEU, METEOR, ) from machine translation task. However, different with MT, image captioning task aim at generating informative descriptions. Therefore, in our case, we propose a new metric that measure the informativeness according to the additional relative performance improvements on the VQA task, where much common knowledge is required.

Formally, we define the metric as


where denotes the score of the VQA sub-system with input caption features c manually set to zeros and denotes the VQA score with captions as additional features.

2.5 Training

We train our joint system using AdaMax optimizer with batch size as suggested in Teney et al. (2017)

. And we split out the official validation set of the VQA v2 dataset for monitoring to tune the initial learning rate and figure out which number of epochs yielding the highest overall VQA scores. We find that training the joint model 20 epochs will be sufficient and more training epoch may lead to overfitting, resulting in sightly drop of the performance.

To simplify our training process, we firstly extract the bottom up detection attention as image features V. Unlike Anderson et al. (2017), we don’t require any other training data from other dataset. We initialize the training process with annotated captions from COCO dataset and pre-train our system for 20 epochs with the final loss in Eq. 16. After that, we generate the captions using our current system for all question image pairs in the COCO’s train, validation and test sets. Finally, we finetune our system using the generated captions with 0.25 learning rate for 10 epoch.

Yes/No Num Other All
Prior Goyal et al. (2017) 61.20 0.36 1.17 25.98
Language-only Goyal et al. (2017) 67.01 31.55 27.37 44.26
MCB Fukui et al. (2016) 78.82 38.28 53.36 62.27
BUTD Anderson et al. (2017) 82.20 43.90 56.26 65.32
VQA-E Anderson et al. (2017) 83.22 43.58 56.79 66.31
Beyond Bilinear Yu et al. (2018) 84.50 45.39 59.01 68.09
ours-single 84.69 46.75 59.30 68.37
ours-ensemble-10 86.15 47.41 60.41 69.66
Table 1: Comparison of our results in VQA task with the state-of-the-art VQA methods on validation set and test-standard set. Accuracies in percentage (%) are reported.

3 Experiments

We perform extensive experiments to evaluate our joint system in both VQA task and image captioning task.

3.1 Datasets

Image Captioning dataset
We use the MSCOCO 2014 dataset for the image caption sub-system. In training stage, we use the dataset’s official configuration but blind the Karpathy test split.

Following Anderson et al. (2017), we perform only minimal text pre-processing, converting all sentences to lower case, tokenizing on white space, and filtering words that do not occur at least five times. For evaluation, we first use the standard automatic evaluation metrics, namely SPICEAnderson et al. (2016), CIDEr Vedantam et al. (2015), METEOR Banerjee and Lavie (2005), ROUGE-LLin (2004), and BLEUPapineni et al. (2002). Further, we propose to measure the informativeness via the relative improvements on VQA task as shown in Eq. 17 as it requires much additional information.

VQA dataset
We use the VQA v2.0 dataset Antol et al. (2015) for the evaluation of our proposed joint system, where the answers are balanced in order to minimize the effectiveness of learning dataset priors. This dataset are used in VQA 2018 challenge and contains over 1.1M question from the images of MSCOCO 2015 dataset Chen et al. (2015).

Similar to the captions’ data pre-processing, we also perform standard text preprocessing and tokenization. Following Anderson et al. (2017), questions are trimmed to a maximum of words. To evaluate answer qualities, we report accuracies using the official VQA metric, considering the occasional disagreement between annotators for the ground truth answers.

3.2 Results on VQA task

We report the experimental results of the VQA task and compare our results with the state-of-the-art methods. As demonstrated in Table. 1, our system outperforms the state-of-the-art systems which indicates the effectiveness of including caption features as additional inputs. In particular, we observe that our single model outperform other methods especially in ’Num’ and ’Other’ categories. The reasons are that the additional captions are capable of providing more numerical clues for infering ’Num’ type questions and more common knowledge for answering the ’Other’ type questions. Furthermore, an ensemble of 10 single models results in 69.7% in the test-standard split.

Comparison between using the generated and annotated captions
We then analyze the difference between generated and annotated captions. As demonstrated in Table. 2, our system gains about 6% improvement from annotated captions and 2.5% improvement from generated captions in the validation split. These evidence indicates the insufficiency of directly answering question from a limited number of detection and the necessity of incorporating additional knowledge from the images. And the generated captions are still not informative enough compared to the human annotated ones.

BUTD Anderson et al. (2017) 63.2
ours with BUTD captions 64.6
ours with our generated captions 65.8
ours with annotated captions 69.1
Table 2: Comparison of the performance of using generated and annotated captions. Both of them provide large improvements to the baseline model. However, there are still a huge gap between generated captions and annotated captions.
Figure 3: Example of our joint system. The answers’ scores in the questions are that full score 1, video games score 0.3 and tv score 0. Columns from left to right are BUTD, using generated captions, with annotated captions.

Ablation study on the semantic connection modeling
In this section, we quantitatively analyze the effectiveness of incorporating captions to model the relationship besides the advantage of providing additional knowledge. The without relationship modeling methods is that we only use the caption features as inputs but don’t involve them in the visual attention parts ( we don’t compute ). As demonstrated in Table. 3, we observe above 0.5% improvements are gained from adopting caption features to model the attention of images feature V in both validation and test-standard splits. We use to indicate with semantic connection modeling modeling and to indicate without.

All Yes/No Num Other
BUTD 65.3 82.2 43.9 56.3
ours () 67.4 84.0 44.5 57.9
ours () 68.4 84.7 46.8 59.3
Table 3: Evaluation of the effectiveness of semantic connection modeling in test-standard split. Accuracies in percentage (%) are reported.
All Yes/No Num Other
BUTD 63.2 80.3 42.8 55.8
ours () 65.2 82.1 43.6 55.8
ours () 65.8 82.6 43.9 56.4
Table 4: Evaluation of the effectiveness of semantic connection modeling in validation split. Accuracies in percentage (%) are reported.

Qualitative Analysis
We qualitatively analyze the attention inside the caption embedding module and the joint attention on the image features. As illustrated in Fig. 3, the attended image regions and attended caption words are visualized to confirm the correctness of the attended areas. Specifically, we observe that our system is capable of concentrating on the related objects ( the human hands and the TV monitor) and meaningful words in captions ( in annotated captions and in generated captions) where BUTD baseline model Anderson et al. (2017) failed. And with semantic connection modeling, our system gains more confidence about its reasoning. As a results, the with ground truth captions, our system get full score by focusing on the from the captions, and even with generated ones (with and without ), the system can get partial credits thanks to the in the captions. The reason is that captions provides additional common knowledge which is hard for the VQA system to directly learn from the images and those knowledge not only direct provide clues for the question answering process but also help the VQA system better understand the semantic connection among different images parts.

cross entropy loss
BUTP 77.2 36.2 27.0 56.4 113.5 20.3 2.22%
ours 77.4 36.7 25.8 56.4 110.2 20.8 4.11%
Table 5: Comparison of the image captioning task on the MSCOCO Karpathy test split. Our captioner obtains competitive results in automated evaluation metric and more informative compations in terms of relative improvements on VQA task.

3.3 Results on Image captioning task

In this section, we evaluate our joint system on image captioning task. In particular, we not only compare our system with the state-of-the-art systems via standard automated metrics, but also demonstrate that our generated question-related captions are more informative as they provide more benefits for the VQA task, using the measurements mentioned in Eq. 17.

In Table. 5, we firstly report the standard automated caption metric score and compare them with the BUTD Anderson et al. (2017) model. When determine the captions for the images, we picked the most likely question-related captions based on the problem LABEL:pro:caption_selection. Our system, though only select a fraction of annotated caption as supervision, is capable of achieving similar standard scores. However, in term of the informativeness, we observe a large improvement on relative VQA accuracies from to . The improvements indicate that importance and necessity of adopting question features as inputs to the image captioning sub-system and more question-specific caption as supervision.

4 Related Work

Visual question answering

Recently, a large mount of attention-based deep learning methods have been proposed for VQA task including top-down attention methods

Gao et al. (2015); Malinowski et al. (2015); Ren et al. (2015) and bottom up methods Anderson et al. (2017); Li et al. (2018b) to incorporate different semantic parts in images. Specifically, the image features from pre-trained CNN, the question features from RNNs are combined to produce image attentions. And both of question and attended image features are used to predict the answers.

However, answering visual question requires not only the visual content information but also common knowledge about them, which can be too hard to directly learn from a limit number of images with only QA supervision. And comparatively few previous research worked on enriching the knowledge base when performing the VQA task. We are aware of three related papers. Li et al. (2018b) firstly use annotated captions to build explanation dataset in an offline fashion then adopt a multi-task learning strategy which simultaneously learns to predict answers and generate explanation. Different with them, we use the captions as input which can provide richer feature during predicting. Li et al. (2018a) firstly generates captions and attributes with a fixed annotator and then use them to predict answers. Therefore, the captions they generated are not necessary related to the question and they also drop out the image features in the answer prediction process. Rajani and Mooney (2018) stacks auxiliary features to robustly predict answers. However, all the above do not utilize the complementarity between image captioning and the VQA task. Therefore, we propose to use captions to provide addition knowledge as well we use question to provide heuristics to the image captioner.

Image Captioning
Most of modern image captioning systems are attention based deep learning system Donahue et al. (2015); Karpathy and Fei-Fei (2015); Vinyals et al. (2015). With the help of large image description datasets Chen et al. (2015), those image captioning systems have shown remarkable results under automatic evaluation metrics. Most of them takes image embeddings from CNNs as inputs and build an attentional RNN ( GRU Cho et al. (2014), LSTM Hochreiter and Schmidhuber (1997)) as language models to generate image captions without making prefixed decisions ( object categories).

However, deep network systems still tend to generate similar captions with beam search strategy failing to diversely describe the images Vijayakumar et al. (2016). In this work, rather than directly asking the system to mining diverse descriptions from images, we propose to providing more heuristics when generating the captions.

5 Conclusion

In this work, we explore the complementarity of the image captioning task and the VQA task. In particular, we present the joint system which generates question-related captions with question features as heuristics and uses the generated captions to provides additional common knowledge to help the VQA system. We produce more informative captions while outperform the current stat-of-the-art in terms of the VQA accuracy. Furthermore, we demonstrate the importance and necessity of including additional common knowledge besides the images in the VQA tasks and more heuristics in image captioning tasks.