FVQA: Fact-based Visual Question Answering

06/17/2016 ∙ by Peng Wang, et al. ∙ The University of Adelaide 0

Visual Question Answering (VQA) has attracted a lot of attention in both Computer Vision and Natural Language Processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions which require common sense, or basic factual knowledge to answer, for example. Here we introduce FVQA, a VQA dataset which requires, and supports, much deeper reasoning. FVQA only contains questions which require external information to answer. We thus extend a conventional visual question answering dataset, which contains image-question-answerg triplets, through additional image-question-answer-supporting fact tuples. The supporting fact is represented as a structural triplet, such as <Cat,CapableOf,ClimbingTrees>. We evaluate several baseline models on the FVQA dataset, and describe a novel model which is capable of reasoning about an image on the basis of supporting facts.



There are no comments yet.


page 1

page 5

page 11

page 12

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual Question Answering (VQA) can be seen as a proxy task for evaluating a vision system’s capacity for deeper image understanding. It requires elements of image analysis, natural language processing, and a means by which to relate images and text. Distinct from many perceptual visual tasks such as image classification, object detection and recognition [1, 2, 3, 4], however, VQA requires that a method be prepared to answer a question that has never seen before. In object detection the set of objects of interest are specified at training time, for example, whereas in VQA the set of questions which may be asked inevitably extend beyond those in the training set.

The set of questions that a VQA method is able to answer are one of its key features, and limitations. Asking a method a question that is outside its scope will lead to a failure to answer, or worse, to a random answer. Much of the existing VQA effort has been focused on questions which can be answered by the direct analysis of the question and image, on the basis of a large training set [5, 6, 7, 8, 9, 10]. This is a restricted set of questions, which require only relatively shallow image understanding to answer. It is possible, for example, to answer ‘How many giraffes are in the image?’ without understanding anything about giraffes.

The number of VQA datasets available has grown as the field progresses [5, 6, 7, 8, 9, 10]

. They have contributed valuable large-scale data for training neural-network based VQA models and introduced various question types, and tasks, from global association between QA pairs and images 

[5, 6, 9] to grounded QA in image regions [10]; from free-from answer generation [5, 7, 9, 10] to multiple-choice picking [5, 6] and blank filling [8]. For example, The questions defined in DAQUAR [6] are almost exclusively “Visual” questions, referring to “color”, “number” and “physical location of the object”. In the Toronto-QA dataset [9], questions are generated automatically from image captions which describe the major visible content of the image.

The VQA dataset in [5], for example, has been very well studied, yet only 5.5% of questions require adult-level (18+) knowledge (28.4% and 11.2% questions require older child (9-12) and teenager (13-17) knowledge). This limitation means that this is not a truly “AI-complete” problem, because this is not a realistic test for human beings. Humans inevitably use their knowledge to answer questions, even visual ones. For example, to answer the question given in Fig. 1, one not only needs to visually recognize the ‘red object’ as a ‘fire hydrant’, but also to know that ‘a fire hydrant can be used for fighting fires’.

Fig. 1: An example visual-based question from our FVQA dataset that requires both visual and common-sense knowledge to answer. The answer and mined knowledge are generated by our proposed method.

Developing methods that are capable of deeper image understanding demands a more challenging set of questions. We consider here the set of questions which may be answered on the basis of an external source of information, such as Wikipedia. This reflects our belief that reference to an external source of knowledge is essential to general VQA. This belief is based on the observation that the number of image-question-answer training examples that would be required to provide the background information necessary to answer general questions about images would be completely prohibitive. The number of concepts that would need to be illustrated is too high, and scales combinatorially.

In contrast to previous VQA datasets which only contain question-answer pairs for an image, we additionally provide a supporting-fact for each question-answer pair. The supporting-fact is a structural representation of information that is necessary for answering the given question. For example, given an image with a cat and a dog, and the question ‘Which animal in the image is able to climb trees?’, the answer is ‘cat’. The required supporting fact for answering this question is <Cat,CapableOf,ClimbingTrees>, which is extracted from an existing knowledge base. By providing supporting facts, the dataset supports answering complex questions, even if all of the information required to answer the question is not depicted in the image. Moreover, it supports explicit reasoning in visual question answering, i.e., it gives an indication as to how a method might derive an answer. This information can be used in answer inference, to search for other appropriate facts, or to evaluate answers which include an inference chain.

In demonstrating the value of the dataset in driving deeper levels of image understanding in VQA, we examine the performance of the state-of-the-art LSTM (Long-Short Term Memory) models  

[5, 6, 9] on our FVQA dataset. We find that there are a number of limitations with this approach. The first is that there is no explicit reasoning process in these methods. This means that it is impossible to tell whether the method is answering the question based on image information or merely the prevalence of a particular answer in the training set. The second problem is that, because the model is trained on individual question-answer pairs, the range of questions that can be accurately answered is limited. It can only answer questions about concepts that have been observed in the training set, and there are millions of possible concepts and hundreds of millions relationships between them. Capturing this amount of information would require an implausibly large LSTM, and a completely impractical amount of training data.

Our main contributions are as follows. A new VQA dataset (FVQA) with additional supporting facts is introduced in Sec. III, which requires and supports deeper reasoning. In response to this observed limitation of the current LSTM-based approach, we propose a method which is based on explicit reasoning about the visual concepts detected from images in Sec. IV

. The proposed method first detects relevant content in the image, and relates it to information available in a pre-constructed knowledge base (we combine several publicly available large-scale knowledge bases). A natural language question is then automatically classified and mapped to a query which runs over the combined image and knowledge base information. The response of the query leads to the supporting fact, which is then processed so as to form the final answer to the question. We achieve the state-of-the-art performance with 56.91% in Top-1 accuracy (see Sec.


Ii Related Work

Ii-a Visual Question Answering Datasets

Several datasets designed for Visual Question Answering have been proposed. The DAQUAR [6] dataset is the first small benchmark dataset built upon indoor scene RGB-D images, which is mostly composed of questions requiring only visual knowledge. Most of the other datasets [5, 7, 8, 9, 10] represent question-answer pairs for Microsoft COCO images [2], either generated automatically by NLP tools [9] or written by human workers [5, 7]. The Visual Genome dataset [11] contains 1.7 million questions, which are asked by human workers based on region descriptions. The MadLibs dataset [8] provides a large number of template based text descriptions of images, which are used to answer multiple choice questions about the images. Visual 7W [10] established a semantic link-between textual descriptions and image regions by object-level grounding and the questions are asked based on groundings.

Ii-B Visual Question Answering Methods

Malinowski et al. [12] were the first to study the VQA problem. They proposed a method that combines image segmentation and semantic parsing with a Bayesian approach to sampling from nearest neighbors in the training set. This approach requires human defined predicates, which are inevitably dataset-specific. Tu et al[13] built a query answering system based on a joint parse graph from text and videos. Geman et al[14] proposed an automatic ‘query generator’ that is trained on annotated images and produces a sequence of binary questions from any given test image.

The current dominant trend within VQA is to combine Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to learn the mapping from input images and questions, to answers. Both Gao

et al[7] and Malinowski et al. [15] used RNNs to encode the question and renerate the answer. Whereas Gao et al[7] used two networks, a separate encoder and decoder, Malinowski et al. [15] used a single network for both encoding and decoding. Ren et al. [9] focused on questions with a single-word answer and formulated the task as a classification problem using an LSTM. Inspired by Xu et al. [16] who encoded visual attention in the Image Captioning, authors of [10, 17, 18, 19, 20] proposed to use the spatial attention to help answer visual questions. [20, 21] formulated the VQA as a classification problem and restrict the answer only can be drawn from a fixed answer space. In other words, they can not generate open-ended answers. Zhu et al[22]

investigated the video question answering problem using ‘fill-in-the-blank’ questions. However, either an LSTM or a GRU (Gated Recurrent Unit, similar to an LSTM) is still applied in these methods to model the questions. Irrespective of the finer details, we label this the LSTM approach.

Ii-C Knowledge-bases and VQA

Answering general questions posed by humans about images inevitably requires reference to information not contained in the image itself. To an extent this information may be provided by an existing training set such as ImageNet 

[3], or MS COCO [2] as class labels or image captions. There are a number of forms of such auxilliary information, including, for instance, question/answer pairs which refer to objects that are not depicted (e.g., which reference people waiting for a train, when the train is not visible inn the image) and provide external knowledge that cannot be derived directly from the image (e.g., the person depicted is Mona Lisa).

Large-scale structured Knowledge Bases (KBs) [23, 24, 25, 26, 27, 28, 29] in contrast, offer an explicit, and typically larger-scale, representation of such external information. In structured KBs, knowledge is typically represented by a large number of triples of the form (arg1,rel,arg2), where arg1 and arg2 denote two concepts in the KB and rel is a predicate representing the relationship between them. A collection of such triples form a large interlinked graph. Such triples are often described according to a Resource Description Framework [30] (RDF) specification, and housed in a relational database management system (RDBMS), or triple-store, which allows queries over the data. The information in KBs can be accessed efficiently using a query language. In this work we use SPARQL Protocol [31] to query the OpenLink Virtuoso [32] RDBMS.

Large-scale structured KBs are constructed either by manual annotation (e.g., DBpedia [23], Freebase [25] and Wikidata [29]), or by automatic extraction from unstructured/semi-structured data (e.g., YAGO [33, 28], OpenIE [24, 34, 35], NELL [26], NEIL [27, 36], WebChild [37, 38], ConceptNet [39]). The KB that we use here is the combination of DBpedia, WebChild and ConceptNet, which contains structured information extracted from Wikipedia and unstructured online articles.

In the NLP and AI communities, there is an increasing interest in the problem of natural language question answering using structured KBs (referred to as KB-QA) [40, 41, 42, 43, 44, 45, 46, 47, 48]. However, VQA systems exploiting KBs are still relatively rare. Zhu et al[49] used a KB and RDBMS to answer image-based queries. However, in contrast to our approach, they build a KB for the purpose, using an MRF model, with image features and scene/attribute/affordance labels as nodes. The links between nodes represent mutual compatibility relationships. The KB thus relates specific images to specified image-based quantities, which are all that exists in the database schema. This prohibits question answering that relies on general knowledge about the world. Most recently, Wu et al. [50]

encoded text mined from DBpedia to a vector with the Word2Vec model which they combined with visual features to generate answers using an LSTM model. However, their proposed method only extracts discrete pieces of text from the knowledge base, thus ignoring the power of its structural representation. Neither

[49] nor [50] are capable of explicit reasoning, in contrast to the method we propose here.

The approach closest to that we propose here is that of Wang et al[51], as it is capable of reasoning about an image based on information extracted from a knowledge base. However, their method largely relies on the pre-defined template, which only accepts questions in a pre-defined format. Our method does not suffer this constraint. Moreover, their proposed model used only a single manually annotated knowledge source whereas the method we propose uses this plus two additional two automatically-learned knowledge bases. This is critical because manually constructing such KBs does not scale well, and using automatically generated KBs thus enables the proposed method to answer more general questions.

Iii Creating the FVQA Dataset

Different from previous VQA datasets [5, 7, 8, 9, 10] that only ask annotators to provide question-answer pairs without any restrictions, we want the questions in our dataset only can be asked and answered after the annotator knowing some commonsense knowledge. This means that we can not simply distribute only images to questioners like others [5, 10]. We need to provide a large number of supporting facts (commonsense knowledge) which are related to the visual concepts in the image. We build our own on-line question collection system and allow users to choose images, visual concepts and candidate supporting facts freely. Then the user can ask questions based on his/her previous choices (all choices will be recorded). We give each annotator a tutorial and restrict them to ask questions that only to be answered with both visual concept in the image and the provided external commonsense knowledge. Following sections provide more details about images, visual concepts, knowledge bases and our question collection system and procedures. We also compare with other VQA datasets with some data statistics.

Iii-a Images and Visual Concepts

We sample images from the MS COCO [2] validation set and ImageNet [3] test set for collecting questions. Images from MS COCO can provide more context because they have more complicated scenes. Scenes of ImageNet images are much simpler but there are more object categories (200 in ImageNet vs. 80 in MS COCO).

Three types of visual concept extractors are applied to each image:

Object Detector: Two Fast-RCNN [52] models are trained by the authors on MS COCO -object (train split) and ImageNet -object datasets (train+val split) respectively. After combination, there are in total classes of objects which can be detected in each image (see Appendix).

Scene Classifier: The scene classifier trained on MIT Places [53] dataset is adopted, which assigns each image with scene labels from classes.

Attribute Classifier: The image attributes for training are obtained from the ground truth captions of MS COCO images, which are made up of actions, objects (without bounding box information) and scenes (see Appendix). A deep model is trained by Wu et al[54] on these training data and incorporated in this work. These object and scene classes are different from the concepts extracted using the above object detectors and scene classifiers, and they are combined together.

In summary, there are in total object, scene and action classes to be extracted. These visual concepts are further linked to a variety of external knowledge, as shown in the next section.

Iii-B Knowledge Bases

KB Predicate Facts Examples
DBpedia Category (Wii,Category,VideoGameConsole)
ConceptNet RelatedTo (Horse,RelatedTo,Zebra),(Wine,RelatedTo,Goblet)
AtLocation (Bikini,AtLocation,Beach),(Tap,AtLocation,Bathroom)
IsA (Broccoli,IsA,GreenVegetable)
CapableOf (Monitor,CapableOf,DisplayImages)
UsedFor (Lighthouse,UsedFor,SignalingDanger)
Desires (Dog,Desires,PlayFrisbee),(Bee,Desires,Flower)
HasProperty (Wedding,HasProperty,Romantic)
HasA (Giraffe,HasA,LongTongue),(Cat,HasA,Claw)
PartOf (RAM,PartOf,Computer),(Tail,PartOf,Zebra)
ReceivesAction (Books,ReceivesAction,bought at a bookshop)
CreatedBy (Bread,CreatedBy,Flour),(Cheese,CreatedBy,Milk)
WebChild Smaller, Better, (Motorcycle,Smaller,Car),(Apple,Better,VitaminPill),
Slower, Bigger, (Train,Slower,Plane),(Watermelon,Bigger,Orange),
Taller, (Giraffe,Taller,Rhino)
TABLE I: The predicates in different knowledge bases used for generating questions. The ‘#Facts’ column shows the number of facts which are related to the visual concepts described in Section III-A. The ‘Examples’ column gives some examples of extracted facts, in which the visual concept is underlined.

The knowledge about each visual concept is extracted from a range of existing structured knowledge bases, including DBpedia [23], ConceptNet [39] and WebChild [37, 38].

DBpedia: The structured information stored in DBpedia is extracted from Wikipedia by crowd-sourcing. In this KB, concepts are linked to their categories and super-categories based on the SKOS Vocabulary111http://www.w3.org/2004/02/skos/. In this work, the categories and super-categories of all aforementioned visual concepts are extracted transitively.

ConceptNet: This KB is made up of several commonsense relations, such as UsedFor, CreatedBy and IsA. Much of the knowledge is automatically generated from the sentences of the Open Mind Common Sense (OMCS) project222http://web.media.mit.edu/~push/Kurzweil.html. We adopt 11 common relations (predicates) in ConceptNet to generate questions and answers.

WebChild: The work in [37] considered a form of commonsense knowledge being overlooked by most of existing KBs, which involves comparative relations such as Faster, Bigger and Heavier. In [37], this form of information is extracted automatically from the Web.

The predicates (relations) which we extract from each KB and the corresponding number of facts can be found in Table I. All the aforementioned structured information are stored in the form of RDF triples and can be accessed using Sparql queries.

Iii-C Question Collection

In this work, we focus on collecting visual questions which need to be answered with the help of supporting-facts. To this end, we designed a specialized system, in which the procedure of asking questions is conducted in the following steps:

  1. [label=0)]

  2. Selecting Concept: Annotators are given an image and a number of visual concepts (object, scene and action). They need to choose one of the visual concepts which is related to this image.

  3. Selecting Fact: Once a visual concept is selected, the associated facts are demonstrated in the form of sentences with the two entities underlined. For example, the fact (Train,Slower,Plane) is expressed as ‘Train is slower than plane’. Annotators should select a correct and relevant fact by themselves.

  4. Asking Question and Giving Answer: The Annotators are required to ask a question, answering which needs the information from both of the image and the selected fact. The answer is limited to the two concepts in the supporting-fact. In other words, the source of the answer can be either the visual concept in the image (underlined in Table I) or the concept in the KB.

Iii-D Data Statistics

Dataset size and other statistics

In total, questions (corresponding to unique facts) are collected collaboratively by individuals. In order to report significant statistics, we create 5 random splits of the dataset. In each split, we have training images and test images. Each split provides roughly and questions333As each image contains a different number of questions, each split may contain different number of questions for training and test. Here we only report the average numbers. The error bars in the Table II show the differences. for training and test respectively. These questions can be categorized according to the type of visual concept being asked (Object, Scene or Action), the source of the answer (Image or KB) and the knowledge base of the supporting-fact (DBpedia, ConceptNet or Webchild).

Table II shows the number of training/test questions falling into each of the above categories. We can see that most of the questions are related to the objects in the images and most of the answers are visual concepts (‘Answer-source’ is ‘Image’). As for knowledge bases, of the collected questions rely on the supporting-facts from ConceptNet. Answering and questions depends on the knowledge from DBpedia and Webchild respectively.

Table III shows summary statistics of the dataset, such as the number of question categories, average question/answer length etc. We have totally 32 question types (see Section IV-A for more details). Compared to VQA-real [5] and Visual Genome [11], our FVQA dataset provides longer questions, with average length words.

Criterion Categories Train Test Total
Visual Concept Object
Answer-Source Image
KB-Source DBpedia
TABLE II: The classification of questions according to ‘the questioned visual concept’, ‘where the answer is from’ and ‘the KB-source of the supporting-fact’. The number of training/test questions in each category is also demonstrated. The error bars are produced by 5 different splits.
Number of Number of Num. question Average quest. Average ans. Knowledge Supporting
Dataset images questions categories length length Bases Facts
DAQUAR [12] 1,449 12,468 4 11.5 1.2 - -
COCO-QA [9] 117,684 117,684 4 8.6 1.0 - -
VQA-real [5] 204,721 614,163 20+ 6.2 1.1 - -
Visual Genome [11] 108,000 1,445,322 7 5.7 1.8 - -
Visual7W [10] 47,300 327,939 7 6.9 1.1 - -
Visual Madlibs [8] 10,738 360,001 12 6.9 2.0 - -
VQA-abstract [5] 50,000 150,000 20+ 6.2 1.1 - -
VQA-balanced [55] 15,623 33,379 1 6.2 1.0 - -
KB-VQA [51] 700 2,402 23 6.8 2.0 1 -
Ours (FVQA) 2,190 5,826 32 9.5 1.2 3
TABLE III: Major datasets for VQA and their main characteristics.
Fig. 2: The distributions of the collected questions and the corresponding facts over different predicates. The top five predicates are UsedFor, Category, IsA, RelatedTo and CapableOf. There are fewer supporting facts than questions because one ‘fact’ can correspond to multiple ‘questions’.

Predicates distribution

The distributions of collected questions and facts over different types of predicates are shown in Figure 2. The comparative predicates in WebChild are considered as one type and there are in total types of predicates. We can see that the questions and facts are evenly distributed over the predicates of Category, UsedFor, IsA, RelatedTo, CapableOf, AtLocation, HasProperty and HasA, although these predicates differ significantly in the total numbers of extracted facts (see Table I).

Human study of common-sense knowledge

In order to verify whether our collected questions require common-sense knowledge and whether the supporting facts are helpful for answering the knowledge required questions, we conducted two human studies by asking subjects -

  1. Whether or not the given question requires external common-sense knowledge to answer, and If ‘yes’

  2. Whether or not the given supporting fact provides the common-sense knowledge to answer the question.

The above study is repeated by 3 human subjects independently. We found that 2 or more in 3 subjects voted ‘yes’ to ‘require common-sense’ for 97.6% of questions. In the ‘supporting facts’ study, 99.8% of the supporting facts are considered valuable to answer the above knowledge-required questions. Figure 3 shows the distribution.

Fig. 3: Human study of how many percentage of the collected questions require common-sense knowledge and whether the collected supporting facts are critical for the reasoning.

Iii-E Comparison

The most significant difference between the proposed dataset and existing VQA datasets is on the provision of supporting-facts. A large portion of visual questions require not only the information from the image itself, but also the often overlooked but critical commonsense knowledge external to the image. It is shown in [5] that 3 or more subjects agreed that questions in the VQA dataset require commonsense reasoning to answer (: 6 or more subjects). However, such external knowledge is not provided in all the existing VQA datasets. To the best knowledge of the authors, this is the first VQA dataset providing supporting-facts.

In this dataset, the supporting-facts which are necessary for answering the corresponding visual questions are obtained from several large-scale structured knowledge bases. This dataset enables the development of approaches which utilize the information from both the image and the external knowledge bases. Different from [51] that only applied a single manually annotated knowledge source, we use two additional self-learned knowledge bases, which enable us to answer more general questions.

In a similar manner as ours, the Facebook bAbI [56] dataset also provides supporting-facts for pure textual questions. But the problem posed in this work is more complex than that in Facebook bAbI, as the information need to be extracted from both image and external commonsense knowledge bases.

Another feature of the proposed dataset is that the answers are restricted to the concepts from image and knowledge bases, so ‘Yes’/‘No’ questions are excluded. In the VQA dataset [5], questions can be answered using ‘Yes’ or ‘No’. It is somewhat difficult to measure the reasoning abilities of LSTM-based approaches via these ‘Yes’/‘No’ questions, because it is not clear how LSTM arrive at the answer.

Iv Approach

As shown in Section III, all the information extracted from images and KBs are stored as a graph of interlinked RDF triples. State-of-the-art LSTM approaches [7, 15, 10, 17, 18, 19, 20] directly learn the mapping between questions and answers, which, however, do not scale well to the diversity of answers and cannot provide the key information that the reasoning is based on. In contrast, we propose to learn the mapping between questions and a set of KB-queries, such that there is no limitation to the vocabulary size of the answers (i.e., the answer to a test question does not have to be observed ahead in the training set) and the supporting facts used for reasoning can be provided.

Fig. 4: An example of the reasoning process of the proposed VQA approach. The visual concepts (objects, scene, attributes) of the input image are extracted using trained models, which are further linked to the corresponding semantic entities in the knowledge base. The input question is firstly mapped to one of the query types using the LSTM model shown in Section IV-A. The types of predicate, visual concept and answer source can be determined accordingly. A specific query (see Section IV-B) is then performed to find all facts meeting the search conditions in KB. These facts are further matched to the keywords extracted from the question sentence. The fact with the highest matching score is selected and the answer is also obtained accordingly.

Iv-a Question-Query Mapping

In our approach, the KB-query is performed according to three properties of questions, i.e., visual concepts (refer to as VC), predicates (refer to as REL) and answer sources (refer to as AS). As shown in Section III-D, there are , and types of visual concepts, predicates and the answer sources respectively. In the training data, these properties of a question can be obtained through the annotated supporting-fact and the given answer, and there are in total different combinations of the three properties in the proposed dataset (see Appendix). Since both question and query are sequences, the question-query mapping problem can be treated as a sequence-2-sequence problem [57], which can be solved by Recurrent Neural Network (RNN) [58]. In this work, we consider each distinct combination as a query type and learn a -class classifier using LSTM models[59], in order to identify the above three properties of an input question and perform a specific query.

The LSTM is a memory cell encoding knowledge at every time step for what inputs have been observed up to this step. We follow the model used in [54]. Letting

be the sigmoid nonlinearity, the LSTM updates for time step

given inputs , , are:


Here, are the input, forget, memory, output state of the LSTM. The various matrices are trained parameters and represents the product with a gate value. is the hidden state at time step

and is fed to a Softmax, which will produce a probability distribution

over all candidate labels.

The LSTM model for the question to query type mapping is trained in an unrolled form. More formally, the LSTM takes the sequence of words in the given question , where is a special start word. Each word has been represented as a one-hot vector . At time step , we set and , where is the learnable word embedding weights. From to , we set and the input hidden state is given by the previous step. The cost function is


where is the number of training examples. is the ground truth query types of the -th training question. is the log-probability distribution over all candidate query types that is computed by the last LSTM cell, given the previous hidden state and the last word of question. represents model parameters, is a regularization term.

During the testing, the testing question words sequence will be feed-forward to the trained LSTM to produce the probability distribution over all query types, via equations (1) to (7).

In Figure 4, the query type of the input question ‘Which animal in this image is able to climb trees?’ is classified by the LSTM classifier as REL=‘CapableOf’,VC=‘Object’,AS=‘Image’.

Iv-B Answering by Querying KB

Query Construction Given that an input question’s types of predicate (REL) and visual concept (VC) are obtained using the aforementioned LSTM-based query-type classifier, a KB-query is accordingly constructed as shown in the following:

(ImgID,Contain,?X) and

where ?X denotes the visual concept of type VC in image ImgID, and ?Y stands for the concept in KB which is linked to ?X via predicate REL. All pairs of ?X,?Y, which satisfy the search conditions shown in the query, will be extracted from KBs. Note that the query is performed over all the facts related to visual concepts in the KB (see Table I), not only on the facts in the proposed dataset.

As shown in the example of Figure 4, the query Query(‘Img1’,‘CapableOf’,‘Image’) returns all objects in Image ‘Img1’ which are linked to some KB concepts via predicate ‘CapableOf’.

Answering The answer source (refer to as AS) of a given question can be again obtained via the LSTM classifier in Section IV-A, and different answering strategies will be used for each particular source (i.e., AS ‘Image’ or ‘KB’).

For answers from images (i.e., one of the visual concepts ?X), the KB concepts ?Y will be matched to the keywords which are extracted from the question sentence by removing high frequency words (such as ‘what’, ‘which’, ‘a’, ‘the’). The matching score is computed as the Jaccard similarity between the word sets of ?Y (refer to as ) and question keywords (refer to as ): . The visual concept ?X corresponding to the highest-scored ?Y will be considered as the answer. In the example of Figure 4, all ?Ys are matched to the keywords ‘climb trees’ and the fact (‘Cat’,‘CapableOf’,‘Climbing Trees’) has achieved the highest score, so the answer is ‘Cat’.

For answers from KBs (i.e., one of the KB concepts ?Y), we need to find out which visual concept (?X) is related to the input question. For questions asking about scene or action (i.e., VC ‘Scene’ or ‘Action’ ), the scene/action concept with the highest probability (obtained from the scene/attribute classifier shown in Section III-A) will be selected and the corresponding KB concept ?Y will be considered as the answer. For questions asking about objects (i.e., VC ‘Object’), the visual concept ?X is selected based on the location (such as ‘top’, ‘bottom’, ‘left’, ‘right’ or ‘center’) or size (such as ‘small’ and ‘large’) keywords in the question. Note that a single visual concept ?X may correspond to multiple KB entities (?Y), i.e., multiple answers. These answers are ordered according to their frequency in the answers of the training data.

V Experiment

In this section, we first evaluate our question to KB query mapping performance. As a key component in our model, its performance impacts the final visual question answering (VQA) accuracy. We then report the performance of several baseline models, comparing with our own proposed method. Different from all the baseline models, our method is able to do the explicit reasoning for the VQA, i.e. we can select the supporting fact from the knowledge base that leads to the answer. We also report the supporting facts selection accuracy.

V-a Question-Query Mapping Experiment

Table IV reports the accuracy of our proposed question-Query mapping (QQmaping) model in Sec IV-A

. The model is trained on the FVQA training splits and tested on the testing splits. To train the model, we use the Stochastic Gradient Descent (SGD) with mini-batches of

question-KB query type pairs. Both the word embedding size and the LSTM memory cell size are . The learning rate is set to and clip gradient is . The dropout rate is set to . It converged after epochs of training. We also provide the results on different KB sources. Questions asked based on the facts from the WebChild knowledge base are much easier to be mapped than questions based on other two KBs. This is mainly because much of the facts in WebChild are related to the ‘comparative’ relationship, such as ‘car is faster than bike’, which further lead to user-generated questions are more repeated in the format, for example, many questions are formulated as ‘Which object in the image is more a comparative adj ?’. Our Top-3 overall accuracy achieves .

Knowledge Q-Q Mapping Acc. (%)
Base Top-1 Top-3
TABLE IV: Question-Query mapping (QQMapping) accuracy on different Knowledge Base on the FVQA testing splits. Top-1 and Top-3 returned mappings are evaluated.

V-B FVQA Experiments

Our FVQA tasks are formulated as the open-ended answer generation, which means the model is required to predict open-ended text outputs. To measure the accuracy, we simply calculate the proportion of correctly answered test questions. And the predicted answer is determined as correct if and only if it matches with the ground-truth answer (all the answers have been pre-processed by the python INFLECT package to eliminate the singular-plurals differences etc.). We also report the accuracy when the top-3 and top-10 answers are provided by the given methods.

Additionally, we also report the Wu-Palmer similarity (WUPS) [60]

. The WUPS calculates the similarity between two words based on the similarity between their common subsequence in the taxonomy tree. If the similarity between two words is greater than a threshold then the candidate answer is considered to be right. We report on thresholds 0.9 and 0.0. All the reported results are averaged on the 5 test splits (standard deviation is also provided).

Method Overall Acc. Std (%)
Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
Human - -
TABLE V: Overall accuracy on our FVQA testing splits for different methods.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.
Method WUPS@0.9. Std (%)
Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
Human - -
TABLE VI: WUPS@0.9 on our FVQA testing splits for different methods.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.
Method WUPS@0.0. Std (%)
Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
Human - -
TABLE VII: WUPS@0.0 on our FVQA testing splits for different methods.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.

We evaluate baseline models on the FVQA tasks in three sets of experiments: without images (Question), without questions (Image) and with both images and questions (Question+Image). Same as [10], in the experiments without images (questions), we zero out the image (questions) features. We briefly describe the three models we used in the experiments:


A Support Vector Machine (SVM) model that predicts the answer from a concatenation of image feature and question feature. For the image features, we use the fc7 features (4096-d) from the VggNet-16

[4]. The questions are represented by 300-d averaged word embeddings from a pre-trained word2vec model [61]. We take the top-500 most frequent answers ( of the training set answers) as the class labels. At test time, we select the top-1, top-3 and top-10 scoring answer candidates. We use the LibSVM [62] and parameter C is set to 1.

LSTM We compare our system with an approach [9]

(which we label LSTM) that treats the question answering as a classification problem. The LSTM outputs are fed into a softmax layer at the last timestep to predict answers over a fixed answers space (top-500 most frequent answers). This is also very similar to the ‘LSTM+MLP’ method proposed in

[5]. Specifically, we use the fc7 layer (4096-d) of the pre-trained VggNet-16 model as the image features, and the LSTM is trained on our training split. The LSTM layer contains memory cells in each unit. The learning rate is set to and clip gradient is . The dropout rate is set to . Same as SVM models, we select the top-1, top-3 and top-10 scoring answer candidates at test time.

Human We also report the human performance. Testing splits are given to 5 human subjects and they are allowed to use any media (such as books, Wikipedias, Google etc.) to gather the information or knowledge to answer the question. Human subjects are only allowed to provide one answer to one question, so there are no Top-3 and Top-10 evaluations for the human performace. And please note that these 5 subjects are never involved in the previous questions collection procedure.

Ours Our KB-query based model is introduced in Section IV. To verify the effectiveness of our method, we implement three variants models. gt-QQmaping uses the ground truth question-query mapping, while top-1-QQmaping and top-3-QQmaping use the top-1 and top-3 predicted question-query mapping (see Section IV-A), respectively.

Method KB-Source
DBpedia ConceptNet WebChild
Top-1 Top-3 Top-10 Top-1 Top-3 Top-10 Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
Human - - - - - -
TABLE VIII: Accuracies on the questions that asked based on different Knowledge Base sources.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.
Method Visual Concept
Object Scene Action
Top-1 Top-3 Top-10 Top-1 Top-3 Top-10 Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
Human - - - - - -
TABLE IX: Accuracies on questions that focus on three different visual concepts.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.
Method Answer-Source
Image KB
Top-1 Top-3 Top-10 Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
Human - - - -
TABLE X: Accuracies for different methods according to different answer sources.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.
Which furniture in this image What animal in this image Which animal in this image Which transportation way in this
     can I lie on? are pulling carriage? has stripes? image is cheaper than taxi?
Mined Facts: a sofa is usually horses sometimes pull carriages zebras have stripes bus are cheaper than taxi
to sit or lie on
Predicted Answer: sofa horse zebras bus
Ground Truth: sofa horse zebras bus
   Which object in this image What thing in this image is Which food in this image can What animal can be found in
   can I ride? helpful for a romantic dinner? be seen on a birthday party? this place?
Mined Facts: motorcycle is wine is good for a cake is related to You are likely to find a cow
used for riding romantic dinner birthday party in a pasture
Predicted Answer: motorcycle wine cake cow
Ground Truth: motorcycle wine cake cow
   What kind of people can we What does the animal in the right Which object in this image is related to sail? What thing in this image is
   usually find in this place? of this image have as a part? capable of hunting a mouse?
Mined Facts: skiiers can be on snails have shells boat is related to sailing a cat can hunt mice
a ski slope
Predicted Answer: skiiers shells boat cat
Ground Truth: skiiers shells boat cat
   Which object in this image is used Which object in this image Which object in this image Which instrument in this
    to measure the passage of time? is a very trainable animal? is related to wool? image is common in jazz?
Mined Facts: a clock is for measuring horses are very sheep is related a saxophone is a common
the passage of time trainable animals to wool instrument in jazz
Predicted Answer: clock horse sheep saxophone
Ground Truth: clock horse sheep saxophone
TABLE XI: Some example results generated by our methods. The supporting facts triplet have been translate to textual sentence for easy understanding.
What animal in this image can rest standing up? What does the place in the image Which object in this image What can I do using this place?
can be used for? is utilized to chill food?
Predicted VC: Person, Cart, … Kitchen, … Refrigerator, Over, Stove, … Kitchen, Refrigerator, …
GT VC: Horse Bathroom Refrigerator Kitchen
Predicted QT: (CapableOf, Image, Object) (UsedFor, KB, Scene) (IsA, Image, Object) (UsedFor, KB, Scene)
GT QT: (CapableOf, Image, Object) (UsedFor, KB, Scene) (UsedFor, Image, Object) (UsedFor, KB, Scene)
Mined Fact: People can stand up for themselves A bathroom is for washing your hands An oven is a device to heat food A kitchenette is for cooking
GT Fact: Horses can rest standing up A kitchen is for cooking A refrigerator is used for chilling food A kitchenette is for preparing food
Predicted Answer: People Cooking Oven Cooking
Ground Truth: Horse Washing Refrigerator Preparing food
TABLE XII: Failure cases of our approach (GT: ground truch, QT: query type, VC: visual concept). The false reason for the first two examples is that the visual concepts are not extracted correctly. Our method makes a mistake on the third example due to the false question-to-query mapping. The reason for the fourth example is that the question has multiple answers (our method orders these answers according to the frequency in the training data, see Section IV-B for details).

Table V shows the overall accuracy of all the baseline methods and our proposed models. In the case of Top-1 accuracy, our proposed top-3-QQmaping model performs best, which doubles the accuracy of the best baseline (LSTM-Question+Image). The top-3-QQmaping is better than top-1-QQmaping because it produces better Question-Query mapping results, as shown in the Table IV. However, it is still not as good as gt-QQmaping since there are Question-Query mapping errors. There is still a significant gap between our models and the human performance. Among the baseline models, LSTM methods perform slightly better than SVM. Question+Image models always predict more accurate answers than Question or Image alone, no matter in SVM or LSTM. Interestingly, contradictory with previous works [9, 5] which found that ‘question’ played a more important role than ‘image’ in the visual question answering, our {SVM,LSTM}-Q does worse than {SVM,LSTM}-I, meaning that our questions rely more heavily on the image content, compared to the VQA [5] or COCO-QA datasets [9]. Actually, If {SVM,LSTM}-Q achieves too high performance, the corresponding questions may be not Visual questions and they may be actually Textual questions. According to the cases of Top-3 and Top-10, our top-3-QQmaping model also performs best. However, as both of our SVM and LSTM models are optimized for the top-1 classification accuracy, not for the ranking performance, it is not particularly fair to compare the performance of SVM and LSTM models in terms of Top-10 accuracy. Table XI shows some example results generated by our final model. Table VI and VII report the WUPS@0.9 and WUPS@0.0 accuracy for different methods.

Table VIII reports the accuracy of the methods on the questions that asked based on different Knowledge Base sources. Our methods produce the same-level accuracy on all the three KB sources, which suggests our models can generalize to many different KBs. DBpedia has highly structured knowledge from Wikipedia. ConceptNet includes many commonsense knowledge while there are many ‘comparative’ facts in WebChild.

Table IX illustrates the performance on questions that focus on three different visual concepts, which are object, scene and action. The performance on object-related questions is much higher than the other two types, especially when image features are given. This is not surprising since the image features are extracted from the VggNet which has been pre-trained on the object classification task. The accuracy of action or scene related questions is poorer than object-related questions (even for human subjects), which is partially because the answers of many scene or action related questions can be expressed in different ways. For example, the answer to ‘What can I do in this place’ (the image scene is kitchen) can be ‘preparing food’ or ‘cooking’. On the other hand, the performance of action classification is also worse than objects, which also leads to poor VQA performance.

Table X presents the accuracy for different methods according to different answer sources. If the answer is a visual concept in the image, we categorize the answer source into the ‘Image’. Otherwise, it is categorized into the ‘KB’. From the table, we can see the accuracy is much higher when the answer is from the ‘Image’ side, nearly times as much as the ‘KB’. This suggests that generating answers from a nearly unlimited answer space (and the answer is not directly appeared in the image) is a very challenging task. Our proposed models performs better than other baseline models.

Table XI shows some examples in which our method achieves the correct answer and Table XII shows some failure cases of our approach. From Table XII, we can see that the failure reasons are categorized into three aspects: 1. The visual concepts of the input image are not extracted correctly. In particular, the errors usually occur when the questioned visual concepts are missing. 2. The question-to-query mapping (via LSTM) is not correct, which means that the question text is wrongly understood. 3. Some errors occur during the stage of post-processing that generates the final answer from queried KB facts. The approach should select the most relevant fact from multiple facts that matches with query conditions. In particular for questions whose answers are from KB (in order words, open-ended questions), our method may generate multiple answers (see Section IV-B). Sometimes, the ground truth is not the first in the ordered answers. In these cases, the top answer is wrong, but the top answer may be correct.

Different from all the other state-of-art VQA methods, our proposed models are capable of explicit reasoning, i.e., providing the supporting facts of the predicted answer. Table XIII reports the accuracy of the facts prediction. We have chance to predict the correct supporting facts. This is a surprisingly good result given the truth that there are millions of facts in the incorporated Knowledge Bases.

Method Facts Prediction Acc. Std (%)
Top-1 Top-3 Top-10
Ours, gt-QQmaping
Ours, top-1-QQmaping
Ours, top-3-QQmaping
TABLE XIII: Facts prediction accuracy for our proposed methods.  indicates that ground truth Question-Query mappings are used, which (in gray) will not participate in rankings.

Vi Conclusion

In this work, we have built a new dataset and an approach for the task of visual question answering with external commonsense knowledge. The proposed FVQA dataset differs from existing VQA datasets in that it provides a supporting-fact which is critical for answering each visual question. We have also developed a novel VQA approach, which is able to automatically find the supporting fact for a visual question from large-scale structured knowledge bases. Instead of directly learning the mapping from questions to answers, our approach learns the mapping from questions to KB-queries, so it is much more scalable to the diversity of answers. Not only give the answer to a visual question, the proposed method also provides the supporting fact based on which it arrives at the answer, which uncovers the reasoning process.

Acknowledgement This work was in part supported by Australian Research Council project FT120100969, and Data to Decisions CRC Research Center.

C. Shen is the corresponding author.


  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012.
  • [2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comp. Vis., 2014.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
  • [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015.
  • [5] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual Question Answering,” in Proc. IEEE Int. Conf. Comp. Vis., 2015.
  • [6] M. Malinowski and M. Fritz, “Towards a Visual Turing Challenge,” arXiv:1410.8027, 2014.
  • [7] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015.
  • [8] L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual Madlibs: Fill in the Blank Description Generation and Question Answering,” in Proc. IEEE Int. Conf. Comp. Vis., December 2015.
  • [9] M. Ren, R. Kiros, and R. Zemel, “Exploring Models and Data for Image Question Answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015.
  • [10] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7W: Grounded Question Answering in Images,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [11] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalanditis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” arXiv preprint arXiv:1602.07332, 2016.
  • [12] M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1682–1690.
  • [13] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu, “Joint video and text parsing for understanding events and answering queries,” MultiMedia, IEEE, vol. 21, no. 2, pp. 42–70, 2014.
  • [14] D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual Turing test for computer vision systems,” Proceedings of the National Academy of Sciences, vol. 112, no. 12, pp. 3618–3623, 2015.
  • [15]

    M. Malinowski, M. Rohrbach, and M. Fritz, “Ask Your Neurons: A Neural-based Approach to Answering Questions about Images,” in

    Proc. IEEE Int. Conf. Comp. Vis., 2015.
  • [16] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in Proc. Int. Conf. Mach. Learn., 2015.
  • [17] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering,” arXiv preprint arXiv:1511.05960, 2015.
  • [18] A. Jiang, F. Wang, F. Porikli, and Y. Li, “Compositional Memory for Visual Question Answering,” arXiv preprint arXiv:1511.05676, 2015.
  • [19] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural Module Networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [20] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked Attention Networks for Image Question Answering,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [21] H. Noh, P. H. Seo, and B. Han, “Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [22] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering Temporal Context for Video Question and Answering,” arXiv preprint arXiv:1511.04670, 2015.
  • [23] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, DBpedia: A nucleus for a web of open data.    Springer, 2007.
  • [24] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open information extraction for the web,” in Proc. Int. Joint Conf. on Artificial Intell., 2007.
  • [25] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proc. ACM SIGMOD/PODS Conf., 2008, pp. 1247–1250.
  • [26] A. Carlson, J. Betteridge, B. Kisiel, and B. Settles, “Toward an Architecture for Never-Ending Language Learning.” in Proc. National Conf. Artificial Intell., 2010.
  • [27] X. Chen, A. Shrivastava, and A. Gupta, “Neil: Extracting visual knowledge from web data,” in Proc. IEEE Int. Conf. Comp. Vis., 2013.
  • [28] F. Mahdisoltani, J. Biega, and F. Suchanek, “YAGO3: A knowledge base from multilingual Wikipedias,” in CIDR, 2015.
  • [29] D. Vrandečić and M. Krötzsch, “Wikidata: a free collaborative knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85, 2014.
  • [30] R. W. Group et al., “Resource description framework,” 2014, http://www.w3.org/standards/techs/rdf.
  • [31] E. Prud’Hommeaux, A. Seaborne et al., “SPARQL query language for RDF,” W3C recommendation, vol. 15, 2008.
  • [32] O. Erling, “Virtuoso, a Hybrid RDBMS/Graph Column Store.” IEEE Data Eng. Bull., vol. 35, no. 1, pp. 3–8, 2012.
  • [33] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia,” in Proc. Int. Joint Conf. on Artificial Intell., 2013.
  • [34] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, “Open Information Extraction: The Second Generation.” in Proc. Int. Joint Conf. on Artificial Intell., 2011.
  • [35] A. Fader, S. Soderland, and O. Etzioni, “Identifying relations for open information extraction,” in Proc. Conf. Empirical Methods Natural Language Processing, 2011.
  • [36] X. Chen, A. Shrivastava, and A. Gupta, “Enriching visual knowledge bases via object discovery and segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2014.
  • [37] N. Tandon, G. De Melo, and G. Weikum, “Acquiring Comparative Commonsense Knowledge from the Web.” in Proc. National Conf. Artificial Intell., 2014.
  • [38] N. Tandon, G. de Melo, F. Suchanek, and G. Weikum, “Webchild: Harvesting and organizing commonsense knowledge from the web,” in Proceedings of the 7th ACM international conference on Web search and data mining.    ACM, 2014, pp. 523–532.
  • [39] H. Liu and P. Singh, “ConceptNet—a practical commonsense reasoning tool-kit,” BT technology journal, vol. 22, no. 4, pp. 211–226, 2004.
  • [40] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic Parsing on Freebase from Question-Answer Pairs.” in Proc. Conf. Empirical Methods Natural Language Processing, 2013, pp. 1533–1544.
  • [41] A. Bordes, S. Chopra, and J. Weston, “Question answering with subgraph embeddings,” arXiv:1406.3676, 2014.
  • [42]

    Q. Cai and A. Yates, “Large-scale Semantic Parsing via Schema Matching and Lexicon Extension.” in

    Proc. Conf. the Association for Computational Linguistics, 2013.
  • [43] A. Fader, L. Zettlemoyer, and O. Etzioni, “Open question answering over curated and extracted knowledge bases,” in Proc. ACM Int. Conf. Knowledge Discovery & Data Mining, 2014.
  • [44] O. Kolomiyets and M.-F. Moens, “A survey on question answering technology from an information retrieval perspective,” Information Sciences, vol. 181, no. 24, pp. 5412–5434, 2011.
  • [45] T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer, “Scaling semantic parsers with on-the-fly ontology matching,” in Proc. Conf. Empirical Methods Natural Language Processing, 2013.
  • [46] X. Yao and B. Van Durme, “Information extraction over structured data: Question answering with Freebase,” in Proc. Conf. the Association for Computational Linguistics, 2014.
  • [47] P. Liang, M. I. Jordan, and D. Klein, “Learning dependency-based compositional semantics,” Computational Linguistics, vol. 39, no. 2, pp. 389–446, 2013.
  • [48] C. Unger, L. Bühmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano, “Template-based question answering over RDF data,” in WWW, 2012.
  • [49] Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei, “Building a large-scale multimodal Knowledge Base for Visual Question Answering,” arXiv:1507.05670, 2015.
  • [50] Q. Wu, P. Wang, C. Shen, A. van den Hengel, and A. Dick, “Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [51] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, “Explicit Knowledge-based Reasoning for Visual Question Answering,” arXiv preprint arXiv:1511.02570, 2015.
  • [52] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comp. Vis., 2015.
  • [53]

    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in

    Proc. Adv. Neural Inf. Process. Syst., 2014.
  • [54] Q. Wu, C. Shen, A. van den Hengel, L. Liu, and A. Dick, “What Value Do Explicit High-Level Concepts Have in Vision to Language Problems?” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [55] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, “Yin and yang: Balancing and answering binary visual questions,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [56] J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “Towards ai-complete question answering: A set of prerequisite toy tasks,” arXiv preprint arXiv:1502.05698, 2015.
  • [57] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3104–3112.
  • [58] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, 2010, p. 3.
  • [59] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [60] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proc. Conf. the Association for Computational Linguistics, 1994.
  • [61]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.
  • [62] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.

Appendix A Appendix

Table XIV shows the query types classified by the types of visual concepts, predicates and answer sources, and the number of training/test questions for each query type. In Table XV, more examples of the proposed dataset are given. In Tables XVI and XVII, the visual concepts extracted by object detectors and attribute classifiers are demonstrated. Figure 5 shows a snapshot of the designed system for collecting questions.

Query type (REL,VC,AS) Train Test Total
TABLE XIV: The query types. VC: type of visual concepts, REL: type of predicates, AS: answer source.
Which object in this image Can you name the beer that we Which instrument in this image Why do they need a bow tie?
is able to stop cars usually enjoy with the fruit in the image? is usually used in polka music
GT Fact: Traffic light can stop cars Lemon is related to corona Accordions are used in polka music Bow ties are worn at formal events
Ground Truth: Traffic light Corona Accordion Formal events
Whether this animal runs What drink is made with Whether the game is a summer How many times you should
slower or faster than horse? this fruit? or winter Olympic? use this stuff per day?
GT Fact: Camel are Grenadine is related to Balance beam belongs to the category A toothbrush should be used
slower than horse pomegranates of Summer Olympic disciplines twice a day
Ground Truth: Slower Grenadine Summer Olympic disciplines Used twice a day
What is the difference between Is there present in the image Can you identify any medical Can you describe the metal
the animal on the left and moth? any tool used for logging? equipment in the image? thing on the right?
GT Fact: Butterfly are usually Chain saw belongs to the Crutch belongs to the category A knife is a metal blade for cutting
more colorful than moth category of Logging of Medical equipment or as a weapon with usually one
long sharp edge fixed in a handle
Ground Truth: More colorful than moth Chain saw Crutches Metal blade for cutting or
as a weapon with usually one
long sharp edge fixed in a handle
TABLE XV: More examples of the constructed dataset with a diverse set of questions, supporting facts and answers.
Category Number Object Name categories
person person
vehicle bicycle, car, motorcycle, airplane, bus, train, truck, boat, cart, snowmobile, snowplow, unicycle
outdoor traffic light, fire hydrant, stop sign, parking meter, bench
animal cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, ant, antelope, armadillo, bee, butterfly, camel, centipede,dragonfly, fox, frog, giant panda, goldfish, hamster, hippopotamus, isopod, jellyfish, koala bear, ladybug, lion, lizard, lobster,monkey, otter, porcupine, rabbit, ray, red panda, scorpion, seal, skunk, snail, snake, squirrel, starfish, swine, tick, tiger, turtle, whale, bird
accessory backpack, umbrella, handbag, tie, suitcase, band aid, bathing cap, crutch, diaper, face powder, hat with awide brim, helmet, maillot, miniskirt, neck brace, plastic bag, stethoscope, swimming trunks, bow tie, sunglasses, brassiere
sports frisbee, skis, snowboard, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, balance beam, baseball, basketball, croquet ball, golf ball, golfcart, horizontal bar, punching bag, racket, rugby ball,soccer ball, tennis ball, volleyball, ping-pong ball, puck, dumbbell
kitchen bottle, wine glass, cup, fork, knife, spoon, bowl, beaker, can opener, cocktail shaker, corkscrew, frying pan, ladle, milk can,pitcher, plate rack, salt or pepper shaker, spatula, strainer, water bottle
food banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, artichoke, bagel, bell pepper, burrito, cream,cucumber, fig, guacamole, hamburger, head cabbage, lemon, mushroom, pineapple, pomegranate, popsicle, pretzel,strawberry
furniture chair, couch, potted plant, bed, dining table, toilet, baby bed, filing cabinet
electronic tv, laptop, mouse, remote, keyboard, cell phone, iPod
appliance microwave, oven, toaster, sink, refrigerator, coffee maker, dishwasher, electric fan, printer, stove, tape player, vacuum,waffle iron, washer
indoor clock, vase, scissors, teddy bear, hair drier, toothbrush, binder, bookshelf, digital clock, hair spray, lamp, lipstick, pencil box, pencil sharpener, perfume, rubber eraser, ruler, soap dispenser, book
music accordion, banjo, cello, chime, drum, flute, french horn, guitar, harmonica, harp, oboe, piano, saxophone,trombone, trumpet, violin, maraca
tool axe, bow, chain saw, hammer, power drill, screwdriver, stretcher, syringe, nail
TABLE XVI: The objects which can be detected by object detectors.
Super-category Number Attribute categories
Action playing, sitting, standing, swinging, catching, cutting, dining, driving, eating, flying, hitting, jumping, laying, racing, reads, swimming, running, sleeping, smiling, taking, talking, walking, wearing, wedding
Scene road, snow, airport, bathroom, beach, city, court, forest, hill, island, lake, market, park, room, sea, field zoo
Object children, computer, drink, glass, monitor, tree, wood, basket, bathtub, beer, blanket, box, bread, bridge, buildings, cabinets, camera, candles, cheese, chicken, chocolate, church, clouds, coat, coffee, decker, desk, dishes, door, face, fence, fish, flag, flowers, foods, fruits, furniture, grass, hair, hands, head, hotdog, house, ice, jacket, kitten, lettuce, lights, luggage, meat, metal, mouth, onions, palm, pants, papers, pen, pillows, plants, plates, players, police, potatoes, racquet, railing, rain, rocks, salad, sand, seat, shelf, ship, shirt, shorts, shower, sofa, station, stone, suit, toddler, tomatoes, towel, tower, toys, tracks, vegetables, vehicles, wall, water, wii, windows, wine
TABLE XVII: The actions, scenes and objects detected by the attribute classifier.
Fig. 5: The system for collecting questions.