QACE: Asking Questions to Evaluate an Image Caption

by   Hwanhee Lee, et al.
Seoul National University

In this paper, we propose QACE, a new metric based on Question Answering for Caption Evaluation. QACE generates questions on the evaluated caption and checks its content by asking the questions on either the reference caption or the source image. We first develop QACE-Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE-Img, which asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE-Img. Unfortunately, the standard VQA models are framed as a classification among only a few thousand categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE-Img is multi-modal, reference-less, and explainable. Our experiments show that QACE-Img compares favorably w.r.t. other reference-less metrics. We will release the pre-trained models to compute QACE.



There are no comments yet.


page 2

page 7


An Analysis of Visual Question Answering Algorithms

In visual question answering (VQA), an algorithm must answer text-based ...

Data-QuestEval: A Referenceless Metric for Data to Text Semantic Evaluation

In this paper, we explore how QuestEval, which is a Text-vs-Text metric,...

iVQA: Inverse Visual Question Answering

In recent years, visual question answering (VQA) has become topical as a...

Robustness Analysis of Visual QA Models by Basic Questions

Visual Question Answering (VQA) models should have both high robustness ...

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

While models for Visual Question Answering (VQA) have steadily improved ...


Infographics are documents designed to effectively communicate informati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image captioning is a task that aims to generate a description containing the main content of a given image. The field of caption generation is prolific vinyals2015show; anderson2018bottom

, and it is, therefore, important to provide reliable evaluation metrics to compare the systems. Most of the prior works still report n-gram similarity metrics such as BLEU 

papineni2002bleu or CIDEr vedantam2015cider. However, these n-gram similarity metrics often fail to capture the semantic errors in the generated captions novikova-etal-2017-need.
To overcome this limitation, we propose QACE, a radically different evaluation framework from n-gram metrics. QACE first generates questions about the candidate caption, and then checks if the answers are consistent w.r.t. either the reference or the source image. We depict QACE in Figure 1.
Specifically, we propose two variants of QACE, depending on what content the evaluated caption is compared to: QACERef when it is compared to the reference, and QACEImg when it is compared to the source image. QACEImg has the desired feature to be reference-less, i.e., the score can be computed without requiring a gold reference.
In this reference-less setup, a Visual Question Answering (VQA) system is required to answer those questions. However, in the VQA literature antol2015vqa, the task is usually seen as a classification task on 3k pre-defined answer choices (e.g., blue, sea, or banana). Therefore, these VQA models are not general QA systems; their usage off-the-shelf for QACEImg would limit the comparison to these very few pre-defined categories, which is not satisfying. To solve this issue, we also propose an abstractive VQA system Visual-T5 as a new module for QACEImg that can generate free-form abstractive answers given a textual question and an image. We conduct a human evaluation of Visual-T5 and show that it is capable of generating accurate abstractive answers. Using Visual-T5, we are now able to compare the answers of the candidate caption directly with the answers of the corresponding image.
Experimental results show that our proposed QACERef and QACEImg show promising results compared to other reference and reference-less metrics on three benchmark datasets: Pascal50s vedantam2015cider, Composite aditya2015images and Flickr8k hodosh2013framing. Also, as shown in Figure 1, QACE has a natural form of interpretability through the visualization of the questions and the answers.

Figure 1: The overall flow of QACE. QACE extracts possible answer spans and generates answer-aware questions for a given candidate caption . The VQA and TQA answer these questions given the image and reference captions, respectively. The correctness of the candidate caption is evaluated by comparing the answers.

2 Related Work

Image Captioning Metrics

Similar to other text generation tasks such as machine translation and summarization, n-gram similarity metrics such as BLEU, METEOR 

banerjee2005meteor and ROUGE lin2004rouge are arguably the standard in automatic evaluation. Among them, the most widely used metric is CIDEr vedantam2015cider which uses TF-IDF based weighted n-gram similarity. SPICE anderson2016spice metric is based on scene graph, while more recently, BERTScore zhang2019bertscore compute the similarity of the contextualized embeddings. Different from prior works, we are the first to use Question Generation (QG) and Question Answering (QA) to evaluate the image captions.

Question and Answering for Evaluation

fisch2020capwap proposes a new method to generate informal captions that can answer the visual questions. In our work, we focus on caption evaluation using the QA systems, not on generating the captions. Several QA-based evaluation metrics scialom2019answers; wang2020asking are recently proposed to evaluate abstractive summarization. However, all those prior works are limited to text-to-text evaluation, while our work develops a multi-modal metric.

3 Qace

We propose QACE, which is a QG- and QA-based framework for evaluating an image caption. As shown in Figure 1, QACE first extracts answer candidates (i.e., 1) wave, 2) top, 3) surfboard) from a candidate caption and generates corresponding questions. With these questions, visual-QA (VQA) and textual-QA (TQA) models answers given their context (i.e., image and reference ). By comparing the answers from each source, we can directly judge the correctness of the candidate caption.

3.1 Question Generation

The goal of this component is to generate questions that ask the primary information of the candidate caption. Our QG model is a text-to-text generation model (i.e., T5 raffel2020exploring), fine-tuned on SQuAD v2 rajpurkar2018know to generate answer-aware questions. Given a caption, we extract possible answer span; in particular, we focus on extracting noun phrases since they mostly contain salient information and can be easily foiled shekhar2017foil. We argue that questions generated on this salient information should be answered similarly from the image or the captions if they share the same information.

3.2 Question Answering

For QACERef, we use a TQA model. We train T5 to answer the generated questions (see 3.1) with the reference captions as context. Conversely, QACEImg requires a VQA model. We propose a new architecture, Visual-T5, that can generate abstractive answers given an image and a question, as opposed to the standard multiple-choice VQA.

Figure 2: The overview of Visual-T5, an abstractive VQA model. We embed questions with additional special separation token and concatenate the visual embeddings to make inputs for T5.

3.3 Abstractive Visual Question Answering

When no reference captions are available, one of the most important parts of QACE is the VQA model that can produce correct answers. To move beyond VQA as a classification task, we are the first, to the best of our knowledge, to develop an abstractive VQA model that can generate free-form answers. Specifically, we enable multimodal encoding for T5, inspired by the previous works on adapting pre-trained language models for multimodal tasks scialom2020bert. We illustrate our proposed Visual-T5 in Figure 2. Based on default T5 architecture, Visual-T5 has an additional visual embedding layer that encodes regional features of the image from Faster RCNN ren2015faster. This linear layer maps detection features to 768 dimensions, same as the dimension of textual embedding. This 768d features are therefore considered as a standard token in Visual-T5, which can encode an image and a question together. We provide more details in Appendix.

3.4 QACE Metric

For a given candidate caption , We use to generate questions  = (, …, ) for all of noun phrases of . Then, we compare the answers for each question in on with the answers on the reference source. We introduce two QACE variants, QACERef for which the reference caption is compared, and QACEImg for which the source image is compared. Using and , we compute QACERef and QACEImg as follows:


where corresponds to the image for QACEImg and the gold reference for QACERef, is the function that measures the similarity between two answers and . The standard metric in QA is the F1, as introduced by rajpurkar2016squad. However, two abstractive answers can be similar but written in two different ways, limiting the effectiveness of a naive F1. Hence, in addition to the F1, we propose to use the BERTScore. Finally, we also complete the similarity metrics using the answerability of the questions for function , in order to measure whether the question is answerable. The answerability corresponds to , where

is the probability attributed by the model to the token

unanswerable.222SQuAD v2 contains unanswerable questions, for which we associate the token unanswerable as the correct answer during training. Therefore, our QA model associates this token with the probability that the question is not answerable. To consider all the different aspects, we use the average of three values computed using each function as the default value of QACE.

4 Synthetic Data Generation for VQA

As discussed in 3.3, relying on a VQA dataset such as VQA v2 goyal2017making

limits possible answers to a small size of pre-defined categories. To train a general and abstractive VQA model, we create synthetic abstractive VQA datasets. We generate Questions/Answers pairs using the captions in the training set of MS-COCO 

lin2014microsoft. Specifically, we extract noun phrases from a reference caption and generate an answer-aware question using our QG model. To increase the validity of these synthetic questions, we apply the round trip consistency alberti2019synthetic, filtering out the questions for which the QA model predicts a different answer than the extracted noun phrase. We convert these synthetic QA dataset to create {question, answer, image} triples by concatenating the corresponding images to these captions.
In addition, we randomly add 20% of unanswerable questions333We consider an image and a question that are not paired to be unanswerable, and do negative sampling. to the synthetic training set, so that the model learns to judge the answerability of a given question. Through this, if a candidate caption contains any hallucinating content that is not included in the image, questions about it can be marked as unanswerable by our VQA model, as shown in the second example of Figure 3. This synthetic dataset enables the training of the abstractive VQA model. We report the performance of the model through a human evaluation in Section 5.2.

5 Experiments

5.1 Benchmark Dataset

We evaluate our proposed metric on three benchmark datasets (i.e. human annotations), PASCAL-50S, Composite and Flickr8k.

PASCAL-50S provides 4k caption triplet A, B, C, where ”A" is composed of 50 reference captions(A) and two candidate captions(B, C) for the given image. There are human judgments as to which “B" or “C" is more appropriate caption for a given image compared to “A".

Composite is composed of 11,985 human judgments scores range from 1 to 5 depending on the relevance between each candidate caption-image pair with 5 reference captions.

Flickr8k provides three human-expert judgments for 5,822 candidate caption-image pairs. The scores are from 1 to 4, depending on the relevance of each caption-image pair.

5.2 Results and Discussions

Ref? Pascal50s Composite Flickr8k
BLEU-4 65.2 45.7 28.6
ROUGE-L 67.7 47.7 30.0
METEOR 80.5 46.6 40.3
CIDEr 77.8 47.4 41.9
SPICE 76.1 48.6 45.7
BERTScore 72.0 45.6 30.5
QACE-Ref (ours) 75.1 49.3 40.5
    F1 57.5 55.1 9.2
    BERTScore 76.4 46.0 30.9
    Answerability 71.6 47.3 39.0
-Perplexity 46.8 1.7* 10.1
VIFIDEL 69.0 13.1 33.6
QACE-Img (ours) 70.0 19.1 29.1
    F1 62.0 12.5 27.3
    BERTScore 65.9 12.8 27.1
    Answerability 74.5 15.7 27.8
Table 1: First column represents the accuracy of matches between human judgments in PASCAL50s. Columns 2 to 3 show the Kendall Correlation between human judgments and various metrics. All p-values in the results are 0.05 except for *.

Computation Details

For all of the results on reference based metrics we reported in the paper, we compute the average of each metric score with each reference for all of the references on each dataset.

QACE Performance

We compare our proposed method with the following widely used metrics: BLEU-4, ROUGE-L , METEOR, CIDEr, SPICE, and the BERTScore. We present the experimental results for all three datasets in Table 1. For the reference-aware metrics, QACERef shows best results on Composite and comparable to the best metrics for Pascal50s and Flickr8k, indicating the relevance of a QA based metric to evaluate image captioning.
For the reference-less metrics, all the correlations are lower this time, showing the difficulty of evaluating the captions without reference. Nonetheless, among these metrics, QACEImg shows the best results for Pascal50s and Composite and comparable results in Flickr8k. For Flickr8k, we found that more than half of the human judgments of the candidate captions are less than 0.2 as 0 to 1 scale. In other words, most of the captions in this dataset are totally not related to the image. For this reason, most of the generated questions are unanswerable for an image and we explain that this leads to relatively lower performance of QACEImg in Flickr8k compared to other metrics.
Furthermore, We investigate the independent contribution of each answer similarity function, , in computing QACE and present the results in Table 1 (note that default QACE-Img uses the mean of F1, BERTScore and answerability). The table reveals that each similarity function has a different aspect, and averaging three results suggests the best performance for two of three datasets.

VQA Model Performance

Visual-T5 is one of the main components of QACEImg. Since it can generate free-form answers, its automatic evaluation is challenging. We therefore conduct a human evaluation on 200 examples randomly sampled from the test set. We hire three annotators to judge whether the generated answer is correct or not given the image. On the majority vote from three annotators, VQA model correctly answers for the average 69% of the examples. Among these 69% correct answers, half of them were written differently from the original answer, showing that our model can generate abstractive answers.

Case study

Figure 3: Case study on QACE metric. Human judgments are normalized to between 0 and 1.

Different from the previous metrics, QACE can be easily interpreted through the visualization of the generated questions and the following answers as shown in Figure 3. In the first example, we observe that the second question is answered differently by the VQA model (sand VS beach). Despite, the answer itself being correct - it is true that the man is standing on the sand - it results in a lower score for QACEImg compared to QACERef. This emphasizes the importance to use other similarity metrics than the F1 when comparing two answers (see Section 3.4). For instance, BERTScore should be able to consider closer sand and beach than sand and a random word.
The second example is very illustrative: for the first question, both TQA and VQA answer dog, hence detecting an error in the candidate caption that talks about a cow. The second question refers to the cow, which makes it ambiguous. The VQA model considers it as unanswerable, while the TQA model correctly answers grass. Following this study, we expect that QACEImg can be improved through a finer answer comparison method in future work.

6 Conclusion

In this paper, we propose QACE, a captioning metric that directly compares each content in the candidate caption with either the source image or a gold reference caption by asking questions. To enable asking questions directly on the source image, we introduce Visual T5, an abstractive VQA model to generate free-form visual answers, for which we report strong results based on a human evaluation. Our proposed metric can be applied in both reference and reference-less settings. It holds high explainability and compares favorably to the state-of-the-art in terms of correlations with human judgments.


We thank anonymous reviewers for their constructive and meaningful comments. K. Jung is with ASRI, Seoul National University, Korea. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. 2021R1A2C2008855). This work was partially funded by gifts from Adobe Research.

Ethical Considerations

We compensate the workers with competitive pay, which is above hourly USD $10 for the human evaluation of VQA model. Furthermore, we used public datasets to train the models.


Appendix A Experimental Details

a.1 Reproducibility Checklist

Source Code

We attach the source for computing QACE and training Visual-T5. In the question generation components, we use the Noun Chunks extractor from spaCy.444

Computing Infrastructure

We use AMD Ryzen Threadripper 2950X (3.50 GHz) with GeForce RTX 2080 Ti for the experiments. The software environments are Python 3.6.6 and PyTorch 1.3.1.

Average runtime for each approach

It takes average one second to generate all questions for a given candidate caption using a pre-trained question generation model. And it takes average about 0.1 seconds to compute visual and textual answers, and comparing the answers. For training VQA model, Visual-T5, each epoch takes 40 minutes using a single RTX 2080 Ti GPU.


We use the pre-trained t5-base for question generation and TQA model in the model repository555 of huggingface wolf2020transformers. We use t5-small to fine-tune our VQA model. Based on t5-small, we added single linear layer to encode visual features and then train the model for 5 epochs with batch size of 128. The number of the synthetic training set is 1 million and we split the dataset into 9:1 proportion for training and validation. For the max sequence length, we set 64 to the input sequence including the visual tokens, and set 32 to output sequence.

Number of Model Parameters

The number of parameters for QG is 222.9M, TQA is 222.9M and VQA is 61.6M.

a.2 Significance Test

We conduct a standard way to test the significance of the correlation coefficient for all of the reported correlation coefficients in the paper. We use a t-test that uses a null hypothesis, which is an absence of association, and report the p-value for each coefficient.

Appendix B Abstractive Visual Question Answering

We provide the training details including the additional output examples of our proposed abstractive VQA model, Visual-T5 in this section.

b.1 Visual Embedding

We extract the regional features for each object using Faster RCNN ren2015faster. We fixed the number of boxes to 36 and each regional feature consists of dimension 2048 and 6 additional dimensions consists of the location and the size of each box. We concatenate this additional dimensions to make dimension of 2054 for each regional feature. And single linear layer maps these 2054d features to 768d to be considered as a token in T5.

b.2 Answer Examples

Figure 4: Various output examples on the evaluation set of abstractive VQA model, Visual-T5.

We provide more examples of our abstractive VQA models in Figure 4. We observe that many predicted answers are correct, but expressed in a different form as in the first and the second example. Also, model outputs unanswerable to the questions that are unanswerable for a given image like the third example.

b.3 Answerability

We make unanswerable visual questions by randomly sampling the questions from the different images to the given image. We mixed 20% of these unanswerable questions similar to the third example in Figure 4 to train VQA model.

b.4 Human Evaluation

Figure 5: Full instructions and interface to workers for evaluating the answers of VQA model.

We hire the workers whose locations in one of the US, UK, CA, NZ, AU to guarantee the fluency in English. We restrict the workers whose HIT approval rates are higher than 95%, and minimum hits are over 500. We pay workers more than USD $10 in an hour through several preliminary experiments on the compensation. We provide the full instructions and the interface in Figure 5. We compute the annotator agreement using Krippendorff’s  krippendorff1970estimating. We observe that Krippendorff’s is 0.56 that indicates a “moderate“ agreement according to one of the referenced guidelines landis1977measurement for kappa-like measures.