Log In Sign Up

Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?

by   Jiawen Zhang, et al.
Apple Inc.

The task of Outside Knowledge Visual Question Answering (OKVQA) requires an automatic system to answer natural language questions about pictures and images using external knowledge. We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions and can be answered by existing text-based question answering systems. This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering. In this work, we propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions. Given an image and a question pertaining to that image (a visual question), we first extract the entities present in the image using pre-trained object and scene classifiers. Using these detected entities, the visual questions can be rewritten so as to be answerable by open domain QA systems. We explore two rewriting strategies: (1) an unsupervised method using BERT for masking and rewriting, and (2) a weakly supervised approach that combines adaptive rewriting and reinforcement learning techniques to use the implicit feedback from the QA system. We test our strategies on the publicly available OKVQA dataset and obtain a competitive performance with state-of-the-art models while using only 10


page 1

page 7


Geographic Question Answering: Challenges, Uniqueness, Classification, and Future Directions

As an important part of Artificial Intelligence (AI), Question Answering...

The meaning of "most" for visual question answering models

The correct interpretation of quantifier statements in the context of a ...

WebQA: Multihop and Multimodal QA

Web search is fundamentally multimodal and multihop. Often, even before ...

External Knowledge enabled Text Visual Question Answering

The open-ended question answering task of Text-VQA requires reading and ...

Leveraging Term Banks for Answering Complex Questions: A Case for Sparse Vectors

While open-domain question answering (QA) systems have proven effective ...

Stacked Attention Networks for Image Question Answering

This paper presents stacked attention networks (SANs) that learn to answ...

Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text

Due to its potential for a universal interface over both data and text, ...

1 Introduction

Within the realm of question answering systems, Visual Question Answering (VQA) seeks to answer questions pertaining to given pictures or images. Broadly, VQA can be categorized into three types — (a) direct (e.g., “How many slices of pizza are there?”), (b) visual common sense reasoning (e.g., “Why is the person running?”), (c) outside knowledge

(OK) (e.g., “Where does this food originate from?”). In this work, we focus on studying “outside knowledge” queries, which require external knowledge from either free text documents (e.g., Wikipedia), knowledge graphs (e.g., DBpedia), or large language models (e.g., BERT) to answer questions about contents of the image.

Existing approaches on OKVQA tackle this task by combining text and image representations for a joint QA system (Su et al., 2020; Lu et al., 2019). However, they requires large amounts of high-quality multi-modal training data containing images, a question to each image that draws upon external knowledge, and some human-annotated answers for each question. Such high-quality data is often harder to collect Marino et al. (2019)

than individual datasets used for well-known problems in computer vision and NLP (e.g., entity extraction from images, text-based QA). Furthermore, according to

Jain et al. (2021), a surprisingly large fraction of the queries in the available OKVQA dataset (Marino et al., 2019) does not need external knowledge to answer. Instead, some are independent of the image, some depend on speculation (e.g., “Can you guess what it is?”), some require object recognition (e.g. “What type of plant is it?”) or are otherwise answerable from the image alone. To simplify human annotation requirements, the OKVQA dataset treats the task as a classification task, selecting from among the most frequent answers in the training set rather than obtaining the answers from an external source (such as, a knowledge graph or spans from a document corpus like Wikipedia).

Question How tall is this animal on average?
(1) Objects giraffe, stone, tree, park
(1) Rewrite How tall is giraffe on average?
(2) TextQA 15 feet
Figure 1: Example OKVQA image and question describing steps needed to solve using a text-based model

Since several pre-trained models are already available for image analysis He et al. (2016); Iandola et al. (2016); Wu et al. (2019) and for text-based question answering Roberts et al. (2020); Lewis et al. (2020); Chen et al. (2017), a potential solution to mitigate the data scarcity problem is to leverage such pre-trained models. We propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions.

Given an image and a question pertaining to that image (a visual question), we first extract the entities present in the image using pre-trained object and scene classifiers. Using these detected entities, the visual questions can be rewritten so as to be answerable by open domain QA systems. We explore two rewriting strategies using unsupervised and weakly supervised techniques that do not require any additional data for training a rewriter.

In the unsupervised approach, given a question and a list of entities as input, we find all possible replacements of an entity with a span and leverage existing language models like BERT Devlin et al. (2018) and GPT-3 Brown et al. (2020) to select the semantically correct entity. While this method solely depends on the pre-trained language models’ syntactic ability to choose the best rewrite, the reformulated questions may not necessarily follow the same vocabulary and syntactic structure that is seen by the text-based QA model during training. Thus, for the second rewriting strategy, our goal is to adapt to the input style of the text QA model. In this, we train a sequence-to-sequence-based rewriter such as BART Lewis et al. (2019) implicitly, without the need of a labeled dataset of original and rewritten questions, with the help of reinforcement learning (RL). We leverage feedback from the QA system and correctness of the answer candidates to reward the RL.

We test the above two strategies on a portion of the publicly available OKVQA dataset Marino et al. (2019) that contains only knowledge-oriented visual questions. We reuse T5-based closed-book question answering system Roberts et al. (2020) as our text-based model and BART as our rewriter model for implementing the two strategies mentioned above. We choose the T5 model purely because it is easy to deploy, does not depend on external resources or models (such as retrievers in other pipelines), and delivers state-of-the-art performance on knowledge-oriented question answering datasets. Furthermore, experimental results reveal that our proposed rewrite-based pipelines have an exact match and semantic-similarity-based scores competitive to the other end-to-end VQA models.

The primary contributions of this work are the following:

  • We explore two rewriting strategies, a QA model agnostic (unsupervised) and a QA model-driven (weakly supervised), to transform deictic knowledge-oriented visual questions into direct questions better understood by text-based QA models.

  • Unlike end-to-end outside knowledge VQA models that try to capture images, questions, and external knowledge through a single model with a limited number of parameters, our approach is modular. As a result, it can reuse richer state-of-the-art individual models for image label extraction, rewriting, and question answering.

  • Our QA model-based strategy uses an architecture of adaptive rewriting with reinforcement learning techniques using fractionally lesser data (1/10) compared to state-of-the-art systems.

2 Related Work

In this section, we discuss related work to our research concerning: (1) Text-based QA Models, (2) Visual Question Answering, (3) Adaptive rewriting techniques, and (4) Reinforcement learning for rewriting.

2.1 Text-based QA Models

Open-domain question answering has been predominantly based on the retrieve-then-read based paradigm Zhu et al. (2021)

. Excerpts from large-scale web documents (e.g., Wikipedia) or knowledge representations from massive KGs are first retrieved and then comprehended by machine-learned models to yield potential answers from the extracted passages. Knowledge-oriented question answering, a type of open-domain question answering, often draws on two types of information

Fu et al. (2020). The first relies on retrieving many documents from a structured or unstructured knowledge source, such as Wikipedia. The response is then extracted by conditioning on the query and the retrieved passages. REALM (Guu et al., 2020) and ORQA (Lee et al., 2019), for example, have demonstrated promising results when combining masked language models with a differential retriever. The second type relies on implicit knowledge stored in model parameters. For example, T5 (Roberts et al., 2020) distributes knowledge in its parameters in a possibly inexplicable way using a large language model pre-trained on unstructured text, which delivers well on knowledge-oriented questions without having to mine knowledge from external sources. There have also been hybrid models that combine parametric and non-parametric (i.e., retrieval-based) memories, allowing information to be assessed, changed, and expanded. For example, Retrieval-Augmented-Generation (RAG) (Lewis et al., 2020), enables it to update internal knowledge based on newly obtained data.

Figure 2: End-to-end flow with entity extraction and input adaptation

2.2 Visual Question Answering

A growing interest in multi-modal research and the advent of high-capacity deep neural learning models has made visual question answering a trending research topic. Wu et al. (2017)

provide a summary of VQA models and datasets, where they focus on deep learning models that perform reasonably well on public datasets such as VQA

Antol et al. (2015) and OKVQA Marino et al. (2019). Early approaches to VQA combined recurrent networks with CNNs to integrate textual and visual data, transformer-based models to highlight the most relevant image region (Ben-younes et al., 2017; Wang et al., 2020), and zero-shot learning to counter the lack of labeled examples Demirel et al. (2019). None of these approaches, however, are built to use external knowledge, thus they can’t manage situations where the image doesn’t reflect all of the information needed to answer the query. Some research studies (Cao et al., 2021; Ziaeefard and Lécué, 2020) have tackled knowledge-based visual queries, but they exclusively deal with structured knowledge sources represented as subject-relation-object or visual concept-relation-attribute triplets. Therefore, the capacity to answer open-ended and complex questions is limited.

Most relevant to us, Jain et al. (2021) have proposed to treat OKVQA as a task of fusing structured data from the image with the unstructured text rather than a visual recognition problem. The idea is to transform the multi-modal input (image + text) to a text-only input so that the text-based QA model can directly interpret and answer (Figure 1 shows a sample). The text-only version of the original question is constructed by: (a) Selecting a referential expression (e.g., “this animal”), (b) Substituting the expression with the most appropriate entity from the image, and (c) Searching and augmenting additional knowledge with the help of a search engine. For selection and substitution, Jain et al. (2021)

use supervised learning approaches for predicting referential spans and selecting appropriate entities for substitution. This approach, however, would be harder to scale OKVQA systems to unseen domains and question types. Moreover, the output of such a rewriter may not always align with the input vocabulary of the text-based QA models used for answering. In our work, we emphasize using unsupervised and weakly supervised methods that extract grammatical and contextual cues from the visual questions to reformulate them.

2.3 Adaptive Rewriting Techniques

Since fine-tuning the large pre-trained model could be expensive if there are many downstream tasks, Houlsby et al. (2019) propose a new transfer mechanism: adding only a few trainable parameters with adapter module per task so that new tasks can be added without revisiting previous ones. Adapters serve the same purpose as fine-tuning but do it by stitching in layers to the main pre-trained model, and updating the weights of these new layers while freezing the weights of the pre-trained model. Seq2Seq based adapters (rewriters) are not new and have been applied to tasks like accent adaptation (Khandelwal et al., 2020) and response generation (Madotto et al., 2020), but not for adapting to knowledge-based visual QA. Based on the idea of adapter, Pfeiffer et al. (2020) introduces Adapterhub111, a new framework built on top of the HuggingFace library. These adapter architectures adapt well to tasks based on top of generic encoders like BERT, but they can not easily adapt to different input styles of QA models.

2.4 Reinforcement Learning for rewriting

Reinforcement learning has been successfully applied to various NLP tasks, including summarization (Paulus et al., 2018), question answering (Hua et al., 2020), and many more. Only recently, Reinforcement Learning (RL) has been applied for query rewriting. Buck et al. (2017) propose an approach to rewrite a jeopardy style query (declarative) to a question answering system that is often trained on questions (interrogative). The reformulation system is trained end-to-end to maximize answer quality using policy gradient. The policy gradient approach has also been used for tasks like machine translation (Wu et al., 2018), which has shown decent performance improvement even when less labeled data is available.

Figure 3: QA Model Agnostic Rewriting with Mask-Replace-Rank based approach using BERT

3 Method

Inspired by the idea of adapter (Houlsby et al., 2019), our solution is to adaptively rewrite the query and produce a ”non-grounded” question, which can be answered by the text-based QA system. As shown in Figure 2, given an input question (e.g., How tall is this animal on average?) and an image of two giraffes, we first extract the top entities from the image using off-the-shelf entity extractors such as SqueezeNet (Iandola et al., 2016) for object entities and Scene Classifier (Zhou et al., 2017) for scene labels. Those entities are then provided as input to a rewriter along with the original question with the objective of rewriting the question into an independent form. We take a simple approach of concatenating the entities with the question (e.g. how tall is this animal on average . giraffe . stone . tree . park), and allowing the model to learn the rewrite transformations. Our adaptive rewriting model is a generative model that learns to rewrite this input into an “independent” question (e.g. how tall is giraffe on average?).

The rewritten question is then passed on to the text-based QA model, to obtain an answer. Adaptive rewriting offers the following advantages. (1) since we directly use an off-the-shelf QA system, the core QA model does not have to be retrained on this kind of input and is allowed to remain agnostic of the changes in the input structure. (2) Adapter rewriting module can be added/removed from the execution pipelines on-demand. This allows for a modular system that is easy to debug and analyze. (3) the predicted answer is more interpretable, since we can see the rewritten that resulted in the final answer.

As mentioned in Section 1, we explore two approaches for adaptive rewriting: QA Model Agnostic Rewriting and QA Model Aware Rewriting. We reiterate that these approaches do not require additional labeled training data of original and rewritten questions.

3.1 QA Model Agnostic Adaptive Rewriting

For model agnostic rewriting, we implement a fairly straightforward Mask-Replace-Rank approach Havens and Stal (2019). As shown in Figure 3, given a question “how tall is this animal on average?

”, we extract all N-grams (

) from the question, and generate the masked candidates by replacing the N-grams with the tokens. For each entity extracted from the image, a pre-trained BERT model (BERT-large Devlin et al. (2018)

) is used to compute the probability of replacing the mask with the entity (

(entity MASKed N-grams, context)). Entities are ranked according to the language model probability of the token sequence and the top scoring token sequence is selected as the rewrite. In our example, (giraffe this animal, context) returns the highest score since language model prefers the sequence “how tall is giraffe on average?” to other candidate sequences. This approach leverages the fill-in-the-blanks nature of the BERT model without requiring additional training data.

3.2 QA Model Aware Adaptive Rewriting

While the prior approach relies primarily on identifying the more probable token sequences for that language, we expect that including some signal from the question answering task would help the model sort out the more ambiguous cases. In this approach, the model-aware method uses the gold-standard answers and the QA model to learn to rewrite. We use a generative sequence-to-sequence rewriter that takes inputs over vocabulary and produces an output sequence over vocabulary , which in practice is the same as the vocabulary of the black-box QA model. To begin with, this rewriter can be any generic, pre-trained sequence to sequence model (such as BART), and is fine-tuned with noisy inputs Morris et al. (2020) that simulate our use-case scenarios.

The difference between a traditional sequence-to-sequence model and the rewriter is that the rewriter is not retrained with direct supervision of gold-standard sentences provided as output labels. It is rather fine-tuned based on the learning signals (gradients) from the black-box text-based QA model. This way of training would not only help train the rewriter train with less data, but it will also enable it to adapt to different QA systems during training. Since most of the submodules in the architecture are frozen, the rewriter should learn faster and mitigate the problem of catastrophic forgetting during adaptation.

Figure 4: End-to-end flow with Policy Gradient Optimization

We use BART Lewis et al. (2019) for our rewriter, keeping its input and output vocabulary shared and same as the text-based question answering model (T5). The rewriter is initialized with a pre-trained version of BART trained on a large corpus with a denoising auto-encoding objective. The model then undergoes a reinforcement learning process, comprised of three steps: (a) Exploration (b) Reward Computation (c) Optimization (Figure 4).


In this step we allow the rewriter to produce a set of candidates. In reinforcement learning settings for sequence-to-sequence networks, beam-search decoding is often employed to generate candidates. Our BART decoder generates top candidates () through beam search with a beam width of . Initially, the pre-trained version of BART (which is trained with an auto-encoding objective) may not offer a lot of variations with beam search and may end up copying the input. So, to allow some syntactic variations, we consider the following strategy:

  1. We prepare several temporary inputs following a similar method as Section 3.1). We first extract all N-grams (e.g., ) from the question and replace them with entities from the image. For each entity, we filter the top candidates (in our work, ) using a GPT-2 language model Radford et al. (2019), using the likelihood (log probability) of the rewrites.

  2. We then pass these temporary rewrites through the BART encoder-decoder, which generates candidates for each temporary rewrite. For one entity the total number of candidates is .

  3. For each entity, we compute the average reward of the generated candidates following the steps in Section 3.2. or optimization, we only use the candidate lot that has the maximum average reward.

Reward Computation:

Each candidate output is given to the QA model which predicts the candidate answer. For candidates we compute the average rewards () as follows:


where, represents the candidates, and are the gold standard and predicted answers respectively, is the embedding representation of the answers, and

is the cosine similarity between two embeddings. As the answers may have multiple words, for embedding extraction, we consider a sentence embedder such as BERT.

The rationale behind choosing this reward function is two-fold: (a) it is continuous and offers smooth policy gradients (a) it indicates the degree with which the predicted answers are semantically similar to the gold-standard answers. With this design, the rewriter output “how stone this animal on average” should receive low rewards, as it does not have adequate information to obtain a correct answer.


In the optimization step, the rewriter model learns to exploit the candidates that maximize the average reward (). In a traditional setting for BART, for training instances of inputs (concatenated question and entities) and rewrites, , the learning objective (L) is to maximize the likelihood of a target given , which is similar to minimizing the cross entropy between the target distribution and the predicted output distribution. For training the network, the gradient of the negative cross-entropy loss is considered to update the model parameters. The gradient is given by:


where is the BART model with parameters . In our reinforcement learning setting, for optimizing the reward that comes from an external source, we use the policy gradient mechanism Williams (1992). The generator of BART, operating with a policy of , producing an output with an expected reward () computed using Equation 1, will thus have the following gradient:


where is the modified learning objective which has to be maximized. This way of optimizing ensures that at the end of the training, the BART model learns to translate the concatenated input to a rewritten by dropping entities and performing necessary re-ordering. For implementation, we use cross-entropy loss for the policy in Equation 3, by computing the cross entropy between the candidates and the input, using the input as reference. This helps in penalizing too many edits (e.g., adding or dropping too many terms in the rewrite).

4 Experimental Setup

In this section, we describe the setup for our experiments, including the dataset, evaluation metrics, baselines and training details that we will use to assess the effectiveness our approach.

4.1 Dataset

We test our method on the OKVQA (Jain et al., 2021) dataset, which annotates four types of questions in the OKVQA: (1) question which require detecting objects and subsequent reasoning over an external knowledge source to arrive at the answer, (2) question which require reading text from the image (OCR) (and no other information) to answer, (3) question which are based on personal opinion or speculation, (4) Other. Since we focus on rewriting knowledge-based queries, we take only the questions annotated as the first type, which have 1,643 questions in the training dataset and 999 questions in the test dataset. We further deleted the questions that require spatial reasoning (e.g., the person on the left). After the data processing work, we eventually have 1,010 training examples and 363 test examples (Table 1).

Dataset #Train #Test
OKVQA 9,009 5,046
OKVQA 1,643 999
Our Final Dataset 1,010 363
Table 1: Statistics of the Knowledge VQA Datasets

4.2 Systems for Comparision

We compare our approach with five existing VQA methods222We obtained predictions for OKVQA Marino et al. (2019) from the author along with two additional baselines: one uses a concatenated version (without rewrite) as input to the T5 model and the other fine-tunes T5 with those concatenated inputs. The VQA baselines are:

  • MUTAN (Ben-Younes et al., 2017)

    : A tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations.

  • MUTAN+AN (Marino et al., 2019): An attention-based version of MUTAN that fuses MUTAN and ArticleNet (MUTAN + AN) as a knowledge-base baseline.

  • BAN (Kim et al., 2018): Bilinear Attention Networks that uses a co-attention mechanism between the question and image features.

  • BAN+AN (Marino et al., 2019): An attention-version of BAN that concatenates the output of the memory network with the last BAN hidden state.

  • QOnly (Marino et al., 2019)

    : An neural network with 3 hidden layers that only takes question features.

4.3 Evaluation Metrics

To evaluate the QA system, we employ three widely used metrics for QA systems (a) Exact match (EM), (b) Bert Similarity Score (BS), (c) Human evaluation (HE)

  • EM: We calculate the exact match score between the golden answers and the predicted answers, and take the average across all answers.

  • BS: We use SentenceBERT to calculate the similarity between the golden answers and the predicted answers and take the average score across all answers.

  • HE: We manually evaluate the quality of the answers and based on the question and the image. The result should be either correct or incorrect. We asked four annotators to help us grade the 363 test data for all baselines. They are provided with a UI and guidelines (Section 4.4).

4.4 Grading Guidelines for Human Evaluation

There are 363 examples to be graded. For each system output, the human annotator marks whether the answer and rewritten output given by the system are correct or not (binary). Each example contains an image, a question about the image, a list of gold standard answers from human annotators, and answers from eight Question Answering systems. While evaluating the answers, note that for certain questions multiple answers are possible. Moreover, some of the answers may not appear in the gold standard answer list. So, in some cases they mark correct vs. incorrect according to their best judgment. If required, they are allowed to look up the question on a Web search engine. Certain questions require the answers to be “numbers” (such as calorie content in food or year of invention, etc.). The predicted answer is considered to be correct if it is within a small window of the correct answer (e.g., 19th century for 1890).

4.5 Model and Training Details

For BERT-based scoring and ranking, we use FitBERT333 and GPT based scoring is done through LMScorer444 The model optimization was carried out with a batch size of 16 for 9,600 steps. We allocated 1 GPU and 32 CPUs for all of our experiments, and on average for each experiment, we used 8 percent of the allocated GPU, 60 percent of the allocated CPUs, and 35 percent of the GPU memory.

5 Results and Analysis

Models Train Data EM BS HE
VQA Models
MUTAN+AN 9,009 0.28 0.70 0.45
BAN 9,009 0.29 0.70 0.47
BAN+AN 9,009 0.29 0.70 0.43
MUTAN 9,009 0.30 0.69 0.43
QOnly 9,009 0.16 0.62 0.24
Concatenated Input - 0.32 0.71 0.54
Fine-Tuned T5 9,009 0.30 0.71 0.48
Our Methods
Model Agnostic - 0.31 0.70 0.67
Model Aware 1,010 0.29 0.69 0.67
Table 2: Evaluation of answers and rewrites using three different metrics: EM: Exact Match, BS: Bert Similarity, HE: Human Evaluation
Image Question Extracted Entities Model Rewritten Outputs (Inputs to T5) Answers
what do these animals eat? zebras, tree, park Gold what do zebras eat? grass, plants, leaves
Concat what do these animals eat . zebras . tree . park wild plants
Fine-Tuned T5 - hay
Model Agnostic what do zebras eat? leaves
Model Aware what do zebras eat? wild plants
MUTAN - plant
MUTAN+AN - grass
BAN - grass
BAN+AN - grass
QOnly - meat
what famous founding father was known for his association with this object? 
kite, sky, beach Gold what famous founding father was known for his association with kite? Benjamin Franklin
Concat what famous founding father was known for his association with this object . kite . sky . beach Benjamin Franklin
Fine-Tuned T5 - Benjamin Franklin
Model Agnostic what famous founding father was known for his association with kite? Benjamin Franklin
Model Aware what famous founding father was known for his association with kite? Benjamin Franklin
MUTAN - kite
MUTAN+AN - shawn white
BAN - beach boy
BAN+AN - Benjamin Franklin
QOnly - wright
what conditions are necessary for this sport? 
skiing, dogsled, ski slope 
Gold what conditions are necessary for skiing? snow, snowfall
Concat what conditions are necessary for this sport . skiing . dogsled . ski slope snow
Fine-Tuned T5 - ice
Model Agnostic what conditions are necessary for dogsled? (comment: bad rewrite but generates good answer) snow
Model Aware what conditions are necessary for ski slope? (comment: bad rewrite but generates good answer) snowfall
MUTAN - snow
MUTAN+AN - snow
BAN - snow
BAN+AN - snow
QOnly - ski
Table 3: Examples outputs from different methods.

Out of 363 test data, Table 2 shows the results for different comparison systems using Exact match (EM), BERT similarity scores (BS), and human evaluation (HE). We also present some anecdotal examples in Table 3. Except for the QOnly model, which has much lower accuracy, our methods achieve competitive results on exact match and BERT similarity scores in comparison with other VQA models. Note that those VQA models are trained on significantly more human labeled data. It is also worth noting that the exact match results contain many false negatives, because the metric assumes that the model generates the same answers as the gold standard answers. Nevertheless, we can get other good answers such as the synonyms of the golden answers. Similarly, BERT Similarity Score results can have false positives, because the answers that are of the same type as the golden answers will be encouraged but they might not be correct. We thus report answer correctness assessment by humans as a reliable metric. Based on human evaluation results, our model agnostic method and model aware methods seem to do better than the existing systems. We have several interesting findings:

  • The difference between the exact match accuracy and human evaluation is larger for our methods than the comparison system because our methods are more likely to generate synonyms of the golden answers, which can be falsely ignored by the exact match metric. T5 has a larger vocabulary size because it is trained on at least 409K questions across three datasets: Natural Questions (Kwiatkowski et al., 2019), WebQuestions (Berant et al., 2013) and TriviaQA (Joshi et al., 2017). By contrast, those systems are only trained on 9,009 data from a single dataset OKVQA labeled by five MTurk workers.

  • The fine-tuned T5 baseline is lower than the concatenated input across all three metrics, which shows that fine-tuning does not help on this task.

  • Good rewrites usually generate good answers, but bad/grammatically incorrect rewrites (e.g., last row in Table 3) may not lead to good answers.

6 Conclusion

In this paper, we explore question rewrite strategies for knowledge-oriented VQA. We contend that for certain types of VQA question rewriting, a text-based question answering system can lower the needs for training data. Unsupervised and weakly-supervised reinforcement learning-based question rewriting strategies are explored in this work. We demonstrate effective VQA using an existing pre-trained text-based QA model. Experiments on these types of questions on a public dataset illustrate that this technique is competitive with other end-to-end VQA approaches.

For future work, we will investigate other types of text-based QA systems within this framework. Since entity extraction and disambiguation are key steps, we plan to train an entity extractor with a larger label ontology for visual objects. Additionally, we would like to investigate the impact of visual features, such as image position, bounding box information, relative positions of objects, etc. on the quality of the entity extractor.


  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.2.
  • H. Ben-younes, R. Cadene, M. Cord, and N. Thome (2017) MUTAN: multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
  • H. Ben-Younes, R. Cadene, M. Cord, and N. Thome (2017) Mutan: multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2612–2620. Cited by: 1st item.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In

    Proceedings of the 2013 conference on empirical methods in natural language processing

    pp. 1533–1544. Cited by: 1st item.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. Cited by: §1.
  • C. Buck, J. Bulian, M. Ciaramita, W. Gajewski, A. Gesmundo, N. Houlsby, and W. Wang (2017) Ask the right questions: active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830. Cited by: §2.4.
  • Q. Cao, B. Li, X. Liang, K. Wang, and L. Lin (2021) Knowledge-routed visual question reasoning: challenges for deep representation embedding. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.2.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §1.
  • B. Demirel, R. G. Cinbis, and N. Ikizler-Cinbis (2019) Image captioning with unseen objects. arXiv preprint arXiv:1908.00047. Cited by: §2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Can Open Domain QA Models Answer Visual Knowledge Questions?, §1, §3.1.
  • B. Fu, Y. Qiu, C. Tang, Y. Li, H. Yu, and J. Sun (2020) A survey on complex question answering over knowledge base: recent advances and challenges. arXiv preprint arXiv:2007.13069. Cited by: §2.1.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §2.1.
  • S. Havens and A. Stal (2019) Use bert to fill in the blanks. External Links: Link Cited by: §3.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019)

    Parameter-efficient transfer learning for nlp

    In International Conference on Machine Learning, pp. 2790–2799. Cited by: §2.3, §3.
  • Y. Hua, Y. Li, G. Haffari, G. Qi, and T. Wu (2020) Few-shot complex knowledge base question answering via meta reinforcement learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 5827–5837. External Links: Link, Document Cited by: §2.4.
  • F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1, §3.
  • A. Jain, M. Kothyari, V. Kumar, P. Jyothi, G. Ramakrishnan, and S. Chakrabarti (2021) Select, substitute, search: a new benchmark for knowledge-augmented visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, New York, NY, USA, pp. 2491–2498. External Links: ISBN 9781450380379, Link, Document Cited by: §1, §2.2, §4.1.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Cited by: 1st item.
  • K. Khandelwal, P. Jyothi, A. Awasthi, and S. Sarawagi (2020) Black-box adaptation of asr for accented speech. In INTERSPEECH, Cited by: §2.3.
  • J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. arXiv preprint arXiv:1805.07932. Cited by: 3rd item.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 453–466. Cited by: 1st item.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300. Cited by: §2.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §1, §3.2.
  • P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et al. (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401. Cited by: §1, §2.1.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
  • A. Madotto, Z. Lin, Y. Bang, and P. Fung (2020) The adapter-bot: all-in-one controllable conversational model. arXiv preprint arXiv:2008.12579. Cited by: §2.3.
  • K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi (2019) OK-vqa: a visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Can Open Domain QA Models Answer Visual Knowledge Questions?, §1, §1, §2.2, 2nd item, 4th item, 5th item, footnote 2.
  • J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi (2020) TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 119–126. External Links: Link, Document Cited by: §3.2.
  • R. Paulus, C. Xiong, and R. Socher (2018) A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.4.
  • J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vulić, S. Ruder, K. Cho, and I. Gurevych (2020) Adapterhub: a framework for adapting transformers. arXiv preprint arXiv:2007.07779. Cited by: §2.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: item 1.
  • A. Roberts, C. Raffel, and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §1, §2.1.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1.
  • Y. Wang, S. Joty, M. R. Lyu, I. King, C. Xiong, and S. C. Hoi (2020) Vd-bert: a unified vision and dialog transformer with bert. arXiv preprint arXiv:2004.13278. Cited by: §2.2.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp. 229–256. Cited by: §3.2.
  • L. Wu, F. Tian, T. Qin, J. Lai, and T. Liu (2018)

    A study of reinforcement learning for neural machine translation

    arXiv preprint arXiv:1808.08866. Cited by: §2.4.
  • Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel (2017) Visual question answering: a survey of methods and datasets. Computer Vision and Image Understanding 163, pp. 21–40. Cited by: §2.2.
  • Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §1.
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.
  • F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, and T. Chua (2021) Retrieving and reading: a comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774. Cited by: §2.1.
  • M. Ziaeefard and F. Lécué (2020) Towards knowledge-augmented visual question answering. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 1863–1873. Cited by: §2.2.