CVPR'17 Spotlight: What’s in a Question: Using Visual Questions as a Form of Supervision
Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have been explored: weak manual annotations, web search results, temporal continuity, ambient sound and others. We focus on one particular unexplored mode: visual questions that are asked about images. The key observation that inspires our work is that the question itself provides useful information about the image (even without the answer being available). For instance, the question "what is the breed of the dog?" informs the AI that the animal in the scene is a dog and that there is only one dog present. We make three contributions: (1) providing an extensive qualitative and quantitative analysis of the information contained in human visual questions, (2) proposing two simple but surprisingly effective modifications to the standard visual question answering models that allow them to make use of weak supervision in the form of unanswered questions associated with images and (3) demonstrating that a simple data augmentation strategy inspired by our insights results in a 7.1READ FULL TEXT VIEW PDF
We introduce the task of Image-Set Visual Question Answering (ISVQA), wh...
Recent insights on language and vision with neural networks have been
The ability to ask questions is a powerful tool to gather information in...
This paper presents final results of ICDAR 2019 Scene Text Visual Questi...
Visual Question Generation (VQG) is the task of generating natural quest...
We present a deep learning system for testing graphics units by detectin...
This paper proposes deep convolutional network models that utilize local...
CVPR'17 Spotlight: What’s in a Question: Using Visual Questions as a Form of Supervision
Supervised learning has shown great promise in developing visual AI. However, collecting manually annotated visual datasets is both challenging and expensive [36, 14, 28]. Using cheaper and weaker supervision is a growing research direction [46, 42, 8, 12, 35, 7, 34, 4]
. As AI is increasingly integrated into our daily lives, computer vision systems will have access to increasingly diverse sources of information by constantly observing human-human, human-object, human-environment and human-AI interactions. Efficiently utilizing all this information will become increasingly critical as we aim to develop large-scale, accurate and adaptable visual systems.
Visual Question Answering (VQA) has become a new mode of interaction between humans and AI . Presently, VQA is mostly used as means for evaluating visual reasoning capabilities of computers. However, going forward this is likely to become a natural human-AI interaction paradigm. It will become commonplace for humans to ask computers visual questions, such as, “Where did I leave my keys?”, “What breed of dog is this?”, “Have I met this person before?” or “Why is she doing that?”. Instead of viewing this as single-sided interaction with humans soliciting information from AI systems, we consider how visual questions themselves can serve as a form of supervision to improve computer vision systems (Figure 1).
In contrast to existing works that focus on improving AI’s VQA capabilities [16, 26, 33], we strive to understand how much information is contained within the question itself, even when the answer is not provided (as would be the case of human-AI conversations). E.g., the question “What breed of dog is this?” provides information that the animal in the scene is a dog and suggests that there is a single dog present. The question “Why is he doing that?” suggests that the depicted behavior is unusual or unexpected. This type of free, natural and open-ended supervision can pave the way to developing richer cognitive AI.
We set out to investigate this hypothesis that human questions can be effectively used to improve computer vision capabilities. We begin by providing extensive qualitative and quantitative analysis of the information contained in a visual question using the large-scale VQA dataset . We propose two simple but surprisingly effective modifications to the iBOWIMG  VQA model that allows it to make use of weak supervision in the form of images associated with unanswered questions. This proves our hypothesis that unanswered questions can be effectively used as a form of visual supervision.
Inspired by the insights from our initial experiments, we then propose a simple data augmentation strategy. The key idea is that instead of using the image-question-answer triplet as a training exemplar, we generate training exemplars incorporating all possible subsets of the questions associated with the image. This strategy yields a improvement in accuracy on the standard VQA benchmark, which confirms that our analysis has important implications not just in the future of close AI-human interactions but for the immediately relevant benchmarks.
Our code, models and additional details are available at http://sidgan.me/whats_in_a_question/.
Vision and language: Computer vision models are generally trained to recognize a fixed vocabulary of visual concepts [14, 28, 36]. But recently, there has been a trend towards more descriptive open-world image understanding. Efforts have included works on image [41, 15, 9, 40, 23, 25] and video  captioning, image segmentation from natural language expressions , aligning videos and books , zero-shot recognition from natural text [5, 18], learning object models from noisy open-world human labels  and other methods. While visual questions are certainly not meant to provide a complete description of the image, they still contain some open-world information about the scene encoded in natural text. In this work we take the initial steps towards extracting and harnessing this information.
Visual question answering: The literature on building visual question answering systems [6, 47, 45, 22, 18, 24, 31, 2, 38, 16, 33, 20, 43, 44, 48, 37, 26] is far too extensive to be covered in detail here. We do some analysis by building off of the iBOWIMG model of Zhou et al. . But, much of our investigation is orthogonal to the visual question answering pipeline. Our work on understanding the information embedded within a question is more similar to works such as Lin et al. 
on utilizing VQA knowledge to improve image captioning or Goyal et al. on analyzing the relative informativeness of different words within a question. However, in contrast to these approaches, we focus on the knowledge that can be extracted from the question alone and not the question-answer combination.
Incidental supervision: As we move towards large-scale open-world visual understanding, collecting manually annotated datasets for every task and concept is quickly becoming infeasible. Developing ways of using natural and cost-effective forms of supervision is a growing research direction: weak manual annotations [46, 42], web search-based supervision [8, 12], or extra modalities like temporal continuity , depth , ambient sound  or GPS signal . Along similar lines, we investigate whether visual questions associated with images provide sufficient supervision to train computer vision models. We argue that increasing integration of AI into everyday human environments will organically generate a large set of image-question pairs that can be used to improve visual AI systems.
We begin with qualitative and quantitative analysis of the information visual questions may contain. We consider a setting where we have an image and a question (or a set of questions) associated with it, but with no corresponding answer. We examine the information content of the questions from two perspectives: (1) whether these questions can provide a good image description and (2) whether we can learn what objects are present in the image, given these questions. The insights from our analysis provide inspiration for the method described in Section 4 for utilizing the unanswered questions in learning vision models.
We detail the setup for analysis here. We use the COCO dataset with 82,783 training and 40,504 validation images. Three types of annotations are associated with the dataset: (1) visual questions, where each image is associated with three human-generated questions about the visual scene , (2) image captions, where every image is associated with five human-generated natural language descriptions and (3) image classification labels, where every image is annotated with the presence or absence of 80 target object classes. In this section we don’t make use of the answers to the visual questions.
Image captions are a natural open-world way to describe an image.  qualitatively notes the difference in information between image captions and visual questions: questions tend to provide specific information regarding one object within the image, while captions naturally tend to be a richer source of information. However, when a human looks at a scene, it is rare that she will be compelled to provide a caption (except when posting the image on social media). In contrast, she may feel compelled to ask a question, such as, “Is this rice noodle soup?” or “Are the flowers real or artificial?”. The fact that she asks these questions provides some information about the scene contents. We begin by analyzing whether the visual questions contain enough information to provide an accurate description of the image.
|Image+Qs||NT + Seq2Seq||0.305||0.256|
Quantitative results: We evaluate using visual questions as image captions in Table 1 using two standard captioning metrics: METEOR  and SPICE . We first consider three baselines that don’t use image information but generate a caption purely based on the visual questions that have been asked: (1) One Q: using one of the visual questions directly as a caption, (2) Three Qs: using all three visual questions concatenated together as a caption and (3) Seq2Seq : a model trained on the COCO training set that takes an input of three visual questions and learns to output an image caption based on the information contained in the questions. Three Qs outperforms One Q (SPICE score of 0.115 vs 0.058), indicating that different questions provide complementary information about the image content.111The SPICE metric  considers both precision and recall of a caption, enabling a fair comparison between captions of different lengths.
considers both precision and recall of a caption, enabling a fair comparison between captions of different lengths.Training the Seq2Seq model to generate more semantically meaningful captions from the three questions provides an improvement to the SPICE score from 0.115 to 0.140. 222Our analysis bears some similarity to the work of Lin et al.  on re-ranking image captions using VQA; however, we don’t use the answers or the image and evaluate captions generated solely based on the questions.
We additionally investigate whether visual questions can provide complementary information to what is contained in the image features. We use a computer vision model NeuralTalk2 (NT)  that takes in an image and outputs an image caption. Directly concatenating this image-based caption with the caption generated from questions (NT + Seq2Seq) improves the SPICE score from 0.194 to 0.256, indicating that the signal from visual questions may be complementary to the information in the image.
Qualitative results: Finally, we qualitatively show some results of captions generated from visual questions. Figure 2 shows some results of applying the Seq2Seq model on the validation set to convert the 3 visual questions associated with the image to a single image caption. The results demonstrate that the visual questions can provide detailed information about the image content. The generated captions contain object category, human actions, color and affordance information indicating that this information can be readily extracted from the questions.
|What are these two people doing in the scene? What color is the person on the right’s hat? Was this picture taken during the day? people during day with hat||What is the baby chewing on? Is this child under 5? Is the child asleep? chewing baby||Is the street name on top an unusual name for a street ? Is there a box for newspaper delivery? What color is the road? street|
|Is this rice noodle soup? What is to the right of the soup? What website copyrighted the picture? copyrighted noodle soup||Are the flowers in a vase? How many different color flowers are there? Are the flowers real or artificial? many green flowers||Is this a 3-D photo? How many lights on the lamppost? Is this building an unusual color? lampost color is light|
Besides image description, another source of information that the visual questions can provide is the object classes that are present in the image. Some examples are shown in Table 2: e.g., asking “what color is the bus?” indicates the presence of a bus in the image.
Algorithm: To quantify how often this occurs, we extract object labels for the 80 COCO classes from visual questions. There are 64 question types in COCO: “how many,” “is there”, “what color is”, etc. For each type, we manually determine if questions of this type imply the presence of objects. For example, questions of the type “how many” do imply the presence of objects: “how many different flowers are on the table?” implies the presence of flowers and a table. In contrast, questions of the type “is there,” such as “is there a zebra in the photo?”, do not imply the presence of any object. For each question that implies the presence of the object, we extract which of the 80 COCO classes (if any) the question refers to. We use NLTK  to disambiguate tenses and synonyms as well as pattern.en333http://www.clips.ua.ac.be/pages/pattern-en
for singular-plurals. For two-word categories such as “teddy bear” we use n-gram overlap. More details in SectionA.1
|What color is the bus?|
|Are people waiting for the food truck?|
|How many umbrellas are in the image?|
|Is the bird sitting on a plant?|
We compare the resulting object class vectors with ground truth annotations of object presence in the image. Our conversion algorithm achieves mean per-class recall ofand precision of , indicating that while the three visual questions do not refer to all objects in the image, they nevertheless capture more than a quarter of the common objects with a few false positives.
Figure 3 shows the per-class recall and precision as a function of the average size of this class in an image. As expected, objects which are larger tend to be asked about more frequently. For example, “baseball glove” occupies only of the image on average and has a near-zero recall of , indicating that it is never asked about (or we are not able to parse it out with our algorithm). In contrast, “train” occupies of the image area and has a recall of , indicating that if a train appears in the image it is almost always asked about. A notable exception is “dining table,” which occupies of the image area on average but has a recall of only since it is rarely a target object of interest. Overall across all classes, the objects we do detect occupy of the image area on average, whereas the objects that we fail to detect occupy only .
Combination with vision models: We additionally verify that knowing the questions provides extra information about the image beyond what can currently be extracted by modern computer vision models. We finetune an ILSVRC2012-pretrained GoogLeNet model [39, 36] on the training set of COCO to recognize the 80 target object classes. It achieves image classification mAP of
on the validation set. We then combine the 80-dimensional classifier prediction vectorwith our object class vector extracted from the three visual questions using . This significantly increases image classification accuracy to mAP.
|Common Sense: Do the long shadows suggest it is well past morning?||Ambiguity: How many identical pinkish-tan vases are on the top shelf?||Composition: Is the woman’s costume made of real fruit and leaves?|
|Visual Relationship: Is the person falling down?||History: Was this picture taken over 100 years ago?||Affordances: Is this cat lying on the sofa?|
Discussion: We showed that visual questions, even without answers, provide informative image descriptions and object classification information. In addition, we briefly note that visual questions can also provide additional latent information as illustrated in Figure 4. We make no attempt to quantify here but note that this information may also potentially be extracted and exploited in the future AI systems.
Armed with the conclusion that visual questions themselves provide important and useful information about the image content, we now set out to investigate how these questions can be used to aid the development of improved computer vision models. We focus on the VQA task and investigate how even unanswered questions can be effectively utilized to improve VQA capabilities. Since our proposed formulation is very simple, the empirical benefits demonstrated in Section 5 are even more striking.
Standard VQA systems [6, 47, 45, 22, 18, 24, 31, 2, 38, 16, 33, 20, 43, 44, 48, 37] take the image and its target question as input, with the expectation of producing an accurate answer for the question. In Section 3 we made two key observations: (1) different visual questions provide information complementary to each other and (2) visual questions can provide information about the scene that may be complementary to what can be extracted from the image using modern computer vision models. Thus it is natural to ask – can we build a better question answering system that benefits from having access to not only the image information and the target question, but also to a set of other questions that may have been asked about this image.
To investigate, we build upon the iBOWIMG model 
. This model is perfect for our investigation as it is very simple to modify and analyze, while achieving impressive results on the VQA task. iBOWIMG models the image using deep features extracted from an ILSVRC-pretrained CNN[39, 36]
and the target question using a one-hot bag-of-words text feature which is transformed via a word embedding layer. The image and target question features are concatenated and sent through a softmax layer to predict the answer class amongst a set of choices.
We extend iBOWIMG to additionally take other questions which are asked about this image as input. We model these extra questions the same way as the target question: the extra questions are concatenated together into a long string, a bag-of-words text feature is computed and then transformed via a word embedding layer. This additional feature vector is concatenated with the image and target question features, as in Figure 5. We refer to this model as iBOWIMG-2x due to the increased dimensionality.444Our model bears some similarity to that of  which explores a different setting, where they double the dimensionality of the bag-of-words textual representation. However, they concatenate the question, image and answer features to predict the correctness of such image-question-answer triplets. In contrast, our feature vector utilizes the image features, target question and other questions about the image.
During training the model is tasked with predicting the answer to the target question and the other questions can be thought of as a richer feature representation of the image.
To train the richer iBOWIMG-2x model, we need to generate new training exemplars out of the available training data. Concretely, every image comes associated with a set of questions and corresponding answers . In addition, the image can also be associated with unanswered questions . Let be the set of all questions associated with an image.
The training examples for iBOWIMG are of the form:
In contrast, the training examples for iBOWIMG-2x are:
where denotes the powerset of and defines the extra information provided to the model in the form of additional questions asked about the same image.
For example, consider an image with a question , a corresponding answer and two additional unanswered questions and . For iBOWIMG, the single training example corresponding to this image would be . For iBOWIMG-2x there would be eight training examples, with = , , , , , , or making use of the extra information that is available about this image during training in the form of other asked questions.555While this model is formulated to make use of extra unanswered questions, an additional benefit is that it can be considered a form of data augmentation. For example, if 3 answered questions are available for this image, the iBOWIMG would have 3 training examples while iBOWIMG-2x would have 24 training examples. The target label for all these exemplars is the answer
. After the new exemplars are generated, the model is trained using stochastic gradient descent exactly as iBOWIMG.
One disadvantage of the method described so far is that it can only incorporate information from extra questions on images that have at least one answered question provided. However, it may be the case that we have access to a large collection of images with only unanswered questions associated with them: e.g. A dataset of image-question pairs without their associated ground truth can naturally emerge from a deployed VQA system that is interacting with users in the real world. Motivated by the findings of Section 3.2, we use to learn an image representation that may be better suited for the VQA task. Instead of using an ILSVRC-trained visual model, we use a visual model trained to recognize the words that appear within the questions. Intuitively, ILSVRC-trained models may not reflect the full spectrum of visual concepts or diverse visual scenes. This new image model can be incorporated into iBOWIMG-2x (or even iBOWIMG) as a better image representation.
The iBOWIMG-2x model can be evaluated in one of two ways. During test time for a standard VQA formulation, the model only has access to a novel image and a single target question . In this case, we can simply pass a zero-initialized vector for the extra features, reducing iBOWIMG-2x back to iBOWIMG but trained differently. However, iBOWIMG-2x allows additional flexibility by utilizing unanswered questions even at test time. For example, when the test image is provided with several target questions, they can further help interpret the image: e.g., test questions “Who is to the left of the dog?” and “What is to the right of the person?” provide complementary information that might help answer both questions better.
We now empirically verify our intuition that even unanswered questions can significantly improve the accuracy of VQA systems. In particular, we evaluate our proposed iBOWIMG-2x model trained on subsets of the COCO  dataset corresponding to two different settings: (1) where every image has at least one answered question and optional unanswered questions associated with it in Section 5.1, and (2) where some images have only unanswered questions associated with it in Section 5.2. We convincingly demonstrate that including extra questions significantly improves VQA accuracy. To conclude, we apply our insights to the standard VQA benchmark in Section 5.3.
Setup: We use the COCO dataset with 82,783 training and 40,504 validation images. Each image is associated with three questions and their corresponding answers, although we sometimes use only a subset of those in our experiments (details below). We evaluate the model on the multiple choice VQA task. We normalize the visual features and the two textual features independently to have norm of 1. We build upon the code released by .
Dataset: Consider the setting where we have access to a set of training images, each with one answered question and optional unanswered questions. We simulate this by using the VQA dataset, where each training image is associated with 3 questions and their respective answers . We randomly select a single question per image to be the target question and discard the other answers, leaving training images , each with a question , an answer and two additional unanswered question and . We train the model on the COCO training image and evaluate on the validation set. Here we use GoogLeNet  trained on ILSVRC2012  as the visual representation.
Key experiment: We begin by comparing our iBOWIMG-2x model trained with the extra unanswered questions against the iBOWIMG model of  which does not use the available unanswered question. After training our dataset with one answered question per image, iBOWIMG obtains an accuracy of on the validation set.666Here we evaluate the model in the standard setting where at test-time only the one target question is provided and the model is expected to produce an answer; we do this by inputting a zero-initialized vector as the second textual feature in the model (in place of the unanswered questions). In contrast, our model makes effective use of the provided unanswered questions and achieves a significant improvement, boosting accuracy to . We use bootstrapping to establish statistical significance.  The confidence interval for the baseline model is ; thus the improved accuracy of when including unanswered questions is statistically significant at the level. Figure 6 demonstrates qualitative results.
Ablation studies: We investigate two components of our model in terms of accuracy improvement: (1) the impact of having access to extra unanswered questions at training time (2) the impact of generating extra training examples per image with data augmentation based on the powerset in Eqn. 2. Table 3 shows the results.
|Unanswered questions||Accuracy w/o aug||Accuracy|
First, as observed above, adding the two additional unanswered questions boosts accuracy by from to . It is further encouraging to note that using just one unanswered question achieves about half the improvement: a boost from the baseline to accuracy with our model. This suggests that adding more unanswered questions (which will become freely available in real-world settings) is likely to further improve accuracy.
Second, we investigate the extent of improvement due to data augmentation. Instead of using the data augmentation strategy of Eqn. 2, we simply train iBOWIMG-2x with a single training exemplar per image where the two extra questions are concatenated together. This yields an accuracy of , which is lower than the accuracy of the whole model.777A natural question is whether this improvement arises from seeing the examples during training since the model is evaluated on test examples of the form . A model trained with augmentation except without the examples achieves accuracy, indicating this effect is minor.
This suggests that although most of the improvement comes from simply having access to the extra questions, the fact that the extra questions allow us to generate a diverse augmented training set is in itself a meaningful observation. We explore this further in Section 5.3.
|Are these people exercising? Yes Yes||What object is in focus? Fire Hydrant 3||How many dolls are there? No 2|
|What is in the water? Plastic Bag Fish||What’s the girl facing? Wall Wall||Is he being messy? Yes Red|
|What is the woman looking at so seriously? Woman Person||What type of goose is pictured? Canadian Red||What is stuck in the sandwich? No Toothpicks|
|What is on the back of the bike? Helmet Life Vests||What color the shower curtain? White 2||How many knives are in the knife holder? 3 6|
Analysis: Digging deeper, we seek to understand what makes iBOWIMG-2x more effective than iBOWIMG. First, we train a text-only model which learns to answer questions without looking at the image. In this setting, iBOWIMG-2x achieves an accuracy of , which is only a marginal improvement over iBOWIMG’s accuracy. This suggests that much of the benefit of iBOWIMG-2x is in learning to make better use of the image features. We investigate this further in Section 5.2.
Second, we note that iBOWIMG-2x is more likely to predict answers corresponding to actual words as opposed to a number or yes/no. In particular, iBOWIMG predicts a word answer of the time, while iBOWIMG-2x predicts a word answer only of the time. Further, iBOWIMG-2x predicts number answers at about half the rate of iBOWIMG: compared to . This suggests that our model’s richer representation better correlates the image appearance with the semantic textual features, making it more likely to predict a word answer instead of resorting to a simpler numerical or yes/no response.
Table 4 documents the breakdown of accuracy by answer type. Having access to the extra supervisory signal yields a improvement on number questions, a bigger improvement on the yes/no questions, and a large improvement on the challenging word-response questions. Our model is unable to use unanswered questions to learn how to count object instances much better than the baseline; however, it becomes significantly better at identifying the presence or absence of visual concepts and at answering more general visual questions.
Test-time supervision: Finally, an additional advantage of our model is that it can incorporate multiple questions at test time. Concretely, instead of asking a single test question on test image and passing in the tuple to the model, we consider including other test questions and and passing in the tuple . This yields an additional improvement in accuracy: from accuracy (when tested the standard way with only the target question available) to accuracy (when all three questions are available simultaneously).
Dataset: In Section 5.1 we considered the setting where an answered question is available on every training image. In contrast, here we consider the real-world scenario where some images have only unanswered questions associated with them. To simulate this setting, we randomly select of the training images to be associated with answered questions and we use only unanswered questions on the rest. We evaluate on the full validation set.
Key experiment: We use all the available questions to train a visual representation better suited for the VQA task. We use the pretrained AlexNet [27, 36] for the baseline and compare it with the same network finetuned on the COCO training images to recognize words from the question vocabulary instead of the 1000 ILSVRC classes. We use these networks as the visual representation when training the iBOWIMG-2x model on the small set of available images with answered questions. The baseline network achieves accuracy; the finetuned network effectively utilizes the information captured in the unanswered questions to improve by to accuracy of .
Ablation studies: We evaluate two components of the framework. First, we check whether the full vocabulary is required or if filtering to 80 words (corresponding to the COCO annotated object categories and extracted from the questions as in Section 3.2) or 1024 words (corresponding to the most relevant words according to tf-idf extracted using the code of ) would suffice. Figure 7(left) demonstrates continuous improvement with using larger vocabulary sizes. Second, we evaluate whether the full set of unanswered questions is necessary or a smaller subset would suffice. Figure 7(center) demonstrates that using more questions for finetuning progressively improves accuracy.
Benefits of answered vs unanswered questions: We ask one final question: how much does training a better visual representation help compared to collecting more answered questions. In Figure 7(right) we consider progressively increasing the number of available images with answered questions and compare the models with and without finetuning. Interestingly, a model finetuned with only of answered questions achieves an accuracy of which is on par with accuracy of the model trained on all answered questions without finetuning. This suggests that perhaps much of the information is already captured in the questions themselves even without the answers. However, further study is necessary to verify this claim.
Our findings demonstrate a very simple but effective way of improving VQA accuracy by adding extra unanswered questions. We take this one step further and ask the straightforward question – can we consider the full dataset but use our model as a form of data augmentation, where all questions are used as supervisory signals at training time. Thus, we train iBOWIMG-2x where every image-question-answer triplet is now represented by 8 training exemplars. We use the setup of  where the entire COCO training set and of the validation set is used for training. The finetuned GoogLeNet  model is used for the visual representation. We evaluate on the test-dev set as standard with only one question at a time provided during testing.
iBOWIMG-2x outperforms the baseline iBOWIMG model by an impressive : from for iBOWIMG to with iBOWIMG-2x having access to the exact same training question-answer pairs but with data augmentation.888Zhou et al.  reports accuracy on test-dev using iBOWIMG. However, despite our best efforts, we were unable to replicate that result. Evaluating their released predictions file on test-dev obtains the same accuracy as with our retrained iBOWIMG model. Table 5 documents the breakdown by answer type. The results are consistent with the findings of Section 5.1; in fact, they are even more pronounced. By making effective use of all the questions jointly through data augmentation, the model improves by on the number questions, by on the yes/no questions, and by an impressive on other questions. This suggests that the data augmentation strategy may be even more beneficial for the open-ended VQA task but we leave that for future work.
These experiments demonstrate that our findings provide important insights not only for the weakly supervised setting but also for the fully supervised VQA scenario. For completeness, our iBOWIMG-2x model achieves on test-standard. While this is not state-of-the-art accuracy, the significant improvement over the very simple model we started with suggest that our insights may be beneficial for improving the current best models as well.
We study a previously unexplored setting of using visual questions themselves as a form of supervision to improve computer vision models. We provide both qualitative and quantitative analysis of how much information is contained within the questions. Our insights already yield significant improvements over baselines on standard benchmarks. More importantly, we believe that visual questions will become freely available as a result of human-AI interactions and can serve as a form of supervision for improving visual models. This work is an early step in this direction.
We would like to thank Peiyun Hu, Achal Dave, Arvind Ramachandran, Gunnar Atli Sigurdsson and Siddharth Santurkar for helpful discussions. This research was supported by ONR MURI N000141612007.
In order to investigate the nature and quantity of information provided by the question, we use the object classification task. The Microsoft Common Objects in COntext (MS COCO) dataset contains 91 common object categories and 82 of them have more than 5,000 labeled instances. The 2014 release considers a subset of 80 categories from the original 91 categories. The 11 excluded categories are: hat, shoe, eyeglasses (due to too many instances in the dataset), mirror, window, door, street sign (as they were ambiguous and difficult to label), plate, desk (due to confusion with bowl and dining table, respectively), blender and hair brush (too few instances in the dataset). We strategically extract words that indicate the 80 object categories.
Here we provide more details for Section 3.2. We classify the 64 question types in COCO to question types that confirm the presence of an object (e.g., “how many different flowers are on the table?” implies the presence of flowers and table) and those that do not (e.g., “is there a zebra in the photo” does not confirm the presence of any object). Figure 6 shows the question types which do not confirm the presence of an object, while Figure 7 shows the confirmed question types. Some additional techniques to boost precision and recall which were derived from a detailed analysis of the questions are described below:
Super Category: The sports ball category covers a broad spectrum as it includes all instances of various kinds of sports balls. The questions are not annotated with sports ball, but rather with ‘football’, ‘basketball’ or ‘baseball’. For this reason, the synset category of ‘sports ball’ is modified to have all these entities. True positives are shown in Figure 10, however none of the associated questions with these true positives indicate which type of sports ball is in the image. Figure 11 shows the false positive of the sports ball category. Similarly, the airplane category has annotations of ‘jet plane’, ‘plane’, ‘passenger plane’ and ‘private plane’. These are included in the synsets of ‘airplane’.
Spell Check: Many words in the English language have different spellings. Due to this, category names like hair drier are be linked to questions that contain the string ‘hair dryer’. Different words can be used to convey the same object or event depending on the context. For instance, the category traffic light should cover questions that contain ‘traffic signal’ and ‘traffic light’. The synsets for ‘traffic light’ are modified to include ‘traffic signal’.
In order to separate overlap between single word and double word categories like, bear-teddy bear, dog-hot dog, we limit the possible signals from each detected word. While detecting these, we ensure that only one out of the two is present in the final question vector. Hence, a question like ‘What does this teddy bear have on its neck?’ won’t signal the bear category, while, ‘Does the bear love you?’ will signal the bear category. For this example, signalling the bear category is an example of a false positive for the animal ‘bear’ because the image actually has a ‘teddy bear’ (see Figure 9). In two word categories the order of the words is important. However, many questions have permutations of the two words. Owing to computational efficiency, these have not been included and the exact order which is present in the category type is used. Using an n-gram overlap for phrase word categories helps in differentiating cases like ‘hot dog’ and ‘dog’.
|What is the||What is||Is the||Is this||Is this a||Is there a||Is it||Is there||Is||Is this an||Is that a|
|How many||What||What color is the||Are the||What kind of|
|What type of||What are the||Where is the||Does the||What color are the|
|Are these||Are there||Which||What is the man||Are|
|How||Does this||What is on the||What does the||How many people are|
|What is in the||What is this||Do||What are||Are they|
|What time||What sport is||Are there any||What color is||Why|
|Where are the||What color||Who is||What animal is||Do you|
|How many people are in||What room is||Has||What is the woman||Can you|
|Why is the||What is the color of the||What is the person||Could||Was|
|What number is||What is the name||What brand||Is the person||Is he|
|Is the man||Is the woman||Is this person|
These include words with distinct meanings depending on the context. For instance, the category, orange, is often mistaken with the color orange. There are more questions which use ‘orange’ as an attribute rather than an object. Examples of false positives for orange category are shown in Figure 8, where ‘orange’ is used as an adjective. Remote is another such category, which is used as an adjective and as the target object. The animal category bear is sometimes confused with the food item ‘gummy bears’, and the sports team ‘Chicago Bears’, due to fuzzy human annotated questions. The question ‘What is the percentage of yellow gummy bears?’ will signal the category bear and is a false positive. More false positives are shown in Figure 9.
Categories that frequently occur along with other categories confusing. For example, ‘handbag’ instances co-occur with the person category, remote co-occurs with the tv category and the ‘wii’ object. Similarly, the toaster category has 49 instances which co-occur with ‘microwave’, ‘bowl’ and ‘counter’. ‘Fork’ co-occurs with various food categories causing confusion. ‘Dining table’ gets confused with other food related categories.
Less Instances: Categories like toaster have few instances and are often not asked about, directly. The oven category often co-occurs with the microwave and toaster category. Examples include, ‘Is there a vintage toaster oven in the photo?’, and ‘Where is the microwave oven?’. ‘Refrigerator’ is confused with various categories related to food. It also co-occurs with the word ‘magnet’ in most of its occurrences through questions like, ‘Are there magnets on the fridge?’.
|What oil brand is on the building that is white, orange and green?||How many orange cones are there?||How many orange poles are there?||What do the people in orange do?||How many people are wearing orange jackets?|
|Are the bears vertical or horizontal?||What is the percentage of yellow gummy bears?||Does the bear love you?||Is the large bear a chairman?|
|Has the batter already hit the ball?||Are the men playing rugby or football?||Is the baseball player at home plate?||Which foot will kick the soccer ball?|
|Is this woman balancing herself as she hits the ball?||How many billiard balls are there?||What is the number on the shirt of the girl throwing the softball?||Are the people in the stadium basketball fans?|
|Are there enough slices of pizza to feed a football team?|
Learning to compose neural networks for question answering.In North American Chapter of the Association for Computational Linguistics (NAACL), 2016.
Predicting deep zero-shot convolutional neural networks using textual descriptions.In International Conference on Computer Vision (ICCV), 2015.
Computer Vision and Pattern Recognition (CVPR), 2014.
Empirical Methods in Natural Language Processing (EMNLP), 2014.
ICML Workshop on Visualization for Deep Learning, 2016.
Densecap: Fully convolutional localization networks for dense captioning.In Computer Vision and Pattern Recognition (CVPR), 2016.
International Conference on Machine Learning (ICML), 2016.