Imagine a child that sees a crocodile for the first time. She may likely ask what the animal is called, or where it can be encountered outside the zoo, but probably does not need to be told that it is green or has four legs, and that its sharp teeth can pose danger. Children (and even adults) learn from teachers in an active way: asking questions about concepts that they are unfamiliar or uncertain about. In doing so, they make learning more efficient – the child who acquires exactly the information they are missing – and the teacher who answers the question instead of needing to explain many aspects of a concept in full detail. As A.I. becomes more and more integrated in our everyday lives, be it in the form of personal assistants or household robots[30, 18, 25], they too should actively seek out missing information from humans – by asking questions in the form of natural language which non-experts can understand and answer.
Most existing work on scene understanding tasks such as VQA[5, 27, 31, 6] and captioning [15, 23, 3] have focused on a closed world setting, i.e. consuming the knowledge provided by a labeled dataset. On the other hand, the goal of active learning is to be able to continuously update the model by seeking for the relevant data to be additionally labeled by a human . Most active learning approaches, however, ask the human to provide a full labeling of an example, and the main challenge is in identifying the examples to be labeled, to ensure annotation efficiency. In our work, we go beyond this, by endowing the model with the ability to ask for a particular aspect of a label, and do so in natural language in order to unambiguously identify the missing information.
We focus on the task of image captioning as a proxy task for scene understanding. In order to describe an image, a model needs to generate words describing the objects, its attributes, actions, and possibly relationships and interactions between objects. This is inherently a multi-task problem. In this paper, our goal is to allow a captioning agent to actively ask questions about the aspects of the image it is uncertain about, in a lifelong learning setting in which examples arrive sequentially and continually. Thus, instead of having humans provide captions for each new image, our agent aims to ask a minimal set of questions for the human to answer, and learn to caption from these answers.
Our model consists of three modules: captioning module, a decision making module that learns whether to ask and what to ask about, and a question generation module. At training time when the captioner produces each word, the decision module decides for which concept, if any, to ask about. If the agent decides to ask, the question generation module produces a question, and the teacher answers the question. All three modules are implemented as neural networks. They are updated continuously with the data arriving in batches: the captioning module is updated using the captions improved by the answers from the teacher, while the decision module is updated based on the current uncertainty of the captioning module. For efficiency reasons, our teacher to answer questions is a QA bot. At test time the captioning model describes new images without asking questions. We showcase our method on the challenging MSCOCO dataset. We provide insights into the behavior of our approach, and discuss open challenges ahead. To the best of our knowledge, this is the first time that natural language question asking has been explored in a lifelong learning setting with real-world images. All our code will be released.
2 Related Work
We provide a short overview of active and interactive learning approaches, and outline our main contributions with respect to existing work.
The goal of active learning is to intelligently seek labels for unlabelled data from an oracle in order to maximize learning while reducing the annotation cost. An agent predicts which sample, if labelled, will give the most useful learning signal as measured by performance on the test set. Strategies for active learning include uncertainty sampling, query by committee and expected model change . Unlike the typical active learning setting where an agent asks the oracle for a full data label (which would be a full caption in our scenario), our method learns to ask pointed questions to retrieve partial labels, i.e. missing key words that compose a caption. Our model thus needs to not only learn when to ask, but also what to ask, and how to distill the received answer into a complex multi-task module (captioner).
Learning by Asking Questions
is an exciting direction with notable contemporary work. Prior approaches typically differ in task, methodology (are questions natural or templated? how does the agent utilize the feedback?) and environment (synthetic vs real).  learns to answer questions by asking questions. Image and the generated question are treated as an unlabelled sample and an oracle provides an answer to form a novel training pair. This simplifies the learning by asking framework by bypassing the challenges of free-form conversation and interpreting the teacher’s answer, because QA can be directly used as training data. Our work generalizes over this framework by using question-asking as a support task to the main task, in our case image captioning, which leads to a more general, and significantly more challenging scenario. Furthermore,  operates in CLEVR , a synthetic environment and questions are limited to programs rather than natural language.
 explores question asking for visual recognition. Given an image, a graph of objects, attributes and relationships is continually updated as the agent asks questions. However, questions are limited to templates, and training is done in synthetic environments with a limited set of objects and relationships.  uses questions to explore new object classes for image classification. However, 
does not retrain their classifier. Our work differs from[33, 28] by proposing a way for the agent to learn in a lifelong setting.
In , the agent learns whether to ask questions to the teacher to efficiently solve dialogue tasks. The student’s goal is to maximize the accuracy of answering the teacher’s questions while reducing the cost (to the teacher) of asking for hints. We extend this line of thinking by letting the agent learn what to ask about in addition to whether to ask.
Vision and Language.
Our work tackles captioning [32, 23, 3], visual question answering (VQA) [27, 6, 10], and visual question generation (VQG) [13, 20]. However, most of these works have focused on a closed dataset setting. Our main goal here is not in designing a novel architecture for each module (captioning, VQG, VQA), but rather focus on the interaction of the modules and the teacher in order to learn in a continual, active setting. Related to us is , where a teacher observes the captioning agent in a continual setting, and gives natural language feedback when errors occur. The agent then learns to improve based on this signal. In our work, the agent is the one seeking advice, thus making the teaching process more efficient.
3 Our Approach
Our goal is to train an image captioning model in the active learning setting with minimal human supervision. We approach the problem by endowing the agent with the ability to ask questions, and learn from the teacher’s answers. However, question asking is only a tool for retrieving information during training; at test time, the captioner operates without needing to ask questions. Our model consists of three interacting modules, the captioner, question generator, and a decision maker. The question generator produces natural language questions given an image and information provided by the captioner. The decision maker chooses which words the captioner is uncertain about (if any), and should be queried from the teacher. We assume the teacher, not necessarily an expert, is able to help the agent learn by answering questions and scoring captions. Our agent learns continually, by receiving new images in sequential chunks. For each chunk, the agent iterates between interacting with the teacher, and learning from the collected information.
In the following sections, we describe how the captioner learns over its life time by interacting with the teacher. First we describe the lifelong learning setting, namely how the agent learns from data arriving in a sequence of batches. Next, we provide details of how the agent queries for answers and feedback from the teacher. Finally, we describe the implementation of our agent’s modules.
3.1 Lifelong Learning
We imagine a lifelong learning setting where data arrives in chunks. This is analogous to a student who learns over multiple classes in a semester. The first chunk has complete ground truth (GT), i.e. human written captions. We refer to it as the warmup chunk. The agent learns from the remaining unlabelled chunks with partial supervision from the teacher. We first train the question generator and pretrain the captioner on the warmup chunk. After pretraining, the agent learns in two phases.
In the collection phase, the agent looks at each image in an unlabelled chunk, attempts to caption, and decides whether to replace words with answers obtained by asking questions. The agent collects the improved captions and uses them to train the captioner in the update phase. In the collection phase, the feedback from the teacher is also used to train the decision maker to make better decisions about whether and what to ask. The two phases are repeated times, once for each chunk, until all available image data has been exhausted. This process is illustrated in Figure 2, and summarized in Algorithm 1.
In the collection phase, the agent attempts to improve captions generated from its own policy by querying the teacher. For each round, the agent makes multiple passes over a chunk. Given an image, the agent generates a caption, and the decision maker decides whether and when (at which word) to ask a question to the teacher. The teacher answers the question, which the agent uses to create a new caption (details in Section 3.3). The teacher scores both new and old captions and the agent stores the captions in a buffer . At the same time, the agent uses the scores from the teacher to make online updates to the decision maker to pick better time steps (words) for asking questions (Section 3.4).
The collected captions will be used in the update phase by the agent to distill the teacher’s knowledge back into the captioner. However, the agent could encounter difficult images that cannot be improved by asking questions. Empirically we find the agent cannot improve on images containing objects in unusual settings, or if the caption generated from the captioner’s policy is missing multiple key concepts. Therefore, we allow the agent to “give up” if the improved caption is bad, and the teacher writes a new caption. The agent first selects from the buffer the top captions for each image. Then it keeps the top of images based on the average reward of captions for that image. For the other 100- images, the agent is given GT captions. In practice, we choose out of the 5 MSCOCO captions. We define the KeepBestAndGiveUp subroutine in Algorithm 1 as this two step process.
After the collection phase, the agent trains the captioning module on the collected captions (details in Section 3.5). We assume the agent has full access to past data and is retrained from scratch. Future works can look at applying continual learning to more efficiently learn on new data. contains warmup GT captions, collected captions, and GT captions from “giving up”.
Let denote a caption of length , and an image. The captioning module
computes a probability distribution over the words in a sentence,i.e. . We further compute , denoting an array of contexts computed by the captioner (details in Sec 3.6). The context helps the decision maker decide what concepts to ask about, and the question generator to ask relevant questions. Let the context used by the decision maker and question generator be called and , respectively. The decision module computes a multinomial distribution indicating the probability of a word position in the caption at which the question should be asked. We allow to index a special <eos> position representing the case where no question should be asked. The question generation module computes the probability distribution over a question . The details about the modules are presented in Sec 3.6.
3.3 Interacting with the Teacher
We now provide details of how the agent interacts with the teacher in the collection phase. Given an image, the captioner produces the complete initial caption and context by a greedy rollout from . The decision module then makes a decision by sampling from . Words other than nouns, verbs, and adjectives are masked out. Let be the word for which the decision module decides to ask a question. The question generator produces a question and the agent receives an answer . The agent then replaces word in with and predicts a new caption , by rolling out the rest of the caption from position using the previous hidden state of the captioner and . If the teacher’s answer is a rare word for the agent, the agent may diverge from any sensible trajectory. For this reason, we give the agent the option of doing a one-word-replace of the expert’s answer, i.e. .
Finally the teacher scores both the original and the two improved captions, by giving each a numeric reward . The process can be repeated by asking a second question and replacing another word at step . In general, the agent can ask up to questions for a single caption. In practice, we observe to work best in our experiments. We keep in the following for the generality of exposition. The interaction process is summarized in Algorithm 2.
3.4 Learning When to Ask Questions
As the agent queries the teacher in the collection phase, it trains the decision maker online to ask better questions. The teacher provides a scalar, non-differentiable reward. Hence the decision maker is updated using REINFORCE . We baseline the reward with the greedy decision reward (that is, what the improved-caption would have been had sampled greedily), following the self-critical policy gradient . See line 11 in Algorithm 1. In the general case where questions are asked, the gradient for the parameters of the decision maker is:
In this work we did not update the question generator in lifelong learning because jointly training the decision maker and question generator is a hierarchical RL problem. Reward accreditation is challenging because the agent needs to learn to differentiate DM choosing a bad time step from DM choosing a good time step but question generator generating a bad question.
3.5 Learning to Caption from Teacher’s Feedback
The captioning module is re-trained in the update phase using a joint loss over the collected and GT captions in the stored buffer ,
where are collected captions, are GT captions, is the score given by the teacher for , and
is a tuned hyperparameter. In practice, we setto the percentile reward of the collected captions, assuming that ground truth captions are generally better than collected captions.
3.6 Implementation Details
is implemented as an attention CNN-RNN model . We additionally predict a part-of-speech (POS) tag at each time step to inform the question generator what type of question should be asked and the decision maker whether to ask. Captioner is trained using MLE with teacher forcing and scheduled sampling.
Question generation module.
is also implemented as a CNN-RNN and conditions on the context at time . Specifically, consists of: POS distribution which determines the “question type”, the attention weights predicted by the captioner which guide the question generator to look, an encoding of the caption which provides global context and prevents asking for redundant concepts, and the position encoding for . We found it helpful to allow the question generator to re-attend rather than fully rely on the captioner’s attention. We train the question generator on a novel dataset, using MLE with teacher forcing and scheduled sampling similar to the captioner (details in Appendix).
The decision maker
is implemented as a multilayer perceptron (MLP) with Softmax output. Contextconsists of the POS distribution, an encoding of the caption, and uncertainty metrics computed from top-k words predicted by the captioner:
Cosine similarity between the embedding of the top-1 word and all other words.
Cosine similarity between each top-k word and the embedding of the entire sentence (implemented as the sum of word embeddings).
Minimum distance of each top-k word to another word.
Entropy is a natural way to measure the uncertainty of the captioner. However, the model can predict synonyms which increase the entropy but do not suggest that the model is uncertain. Therefore, for each time step we take the word embeddings of the top-k words and compute their relative distances as a secondary measure of uncertainty. We use . In ablation studies, we show that these statistics alone can capture the uncertainty of the captioner. Training a neural network on these statistics further improves performance.
We imagine our agent in a human-in-the-loop setting where a teacher answers natural language questions, chooses the best caption out of a few alternatives, scores it, and writes GT captions if necessary. For efficiency, we use a synthetic teacher. It consists of two parts: a VQA bot implemented following  and a caption scorer composed of a linear combination of BLEU , ROUGE , METEOR , and CIDEr . Here, denotes additional supporting evidence; we use captions from MSCOCO. We call the reward from the caption scorer the Mix score, and denote it by . We discuss challenges to using a synthetic teacher in Sections 4.3 and 4.6.
|Method||GT %||Supervision %||Mix||CIDEr||METEOR||ROUGE||BLEU4||BLEU2|
|Equal GT||-||45.2 %||45.2 %||98.9||91.5||24.7||52.3||28.0||53.4|
|All GT||-||100 %||100 %||101.7||96.4||25.1||52.9||28.8||54.9|
|Inquisitive Student||70%||45.2 %||73.5 %||103.9||98.0||25.4||53.8||30.5||57.1|
|Mute Student||70%||45.2 %||72.6 %||102.2||95.9||25.2||53.4||29.3||55.9|
We evaluate our approach on the challenging MSCOCO dataset , and compare it to intelligent baselines. We perform detailed ablation studies that verify our choices and give insight into how our model behaves.
We follow the standard Karpathy split  that contains 117,843 training, 5K validation and 5K test images. We randomly split the training set into warmup and lifelong learning chunks. In our experiments, we vary the size of the warmup, and the number of lifelong chunks, to analyze the model behavior under different regimes. There are 5 GT captions for each image in the warmup set. At the end of lifelong learning, there are collected or GT captions for each image in the lifelong set.
Image features are extracted with ResNet-101 trained on ImageNet . Vocabulary sizes for the captioner, question generator and VQA are 11253, 9755 and 3003, respectively. We use the Stanford NLP parser to get GT POS labels . The decision maker only considers a subset of tags (listed in Appendix) for asking questions.
4.1 Training Details
using a multi-answer binary cross entropy loss function. The VQA model achieves 64.2% on the VQA2.0 val split without ensembling. We train the question generator by combining data from MSCOCO and VQA2.0. (Implementation details in App.) A natural concern is that training the question generator on images the captioner sees during lifelong learning will cause the que. gen. to “lookup” GT questions. We find this to not be the case (see Figure8). In general, the questions generated for an image are diverse, generic and rarely match GT questions (see Appendix for more examples).
4.2 Cost of Human Supervision
We first perform a human study to understand human cost associated with every interaction type with the agent. We choose to measure “human effort” by the time taken for a task. For our experiment, a human teacher has three possible tasks: produce a full caption, answer a question, and score a caption. Table 4 shows that on average it takes 5.2 and 4.6 times longer to caption than score a caption or answer a question. To compute the cost of human supervision, we normalize the cost of each task to caption scoring. Hence the agent incurs one point of supervision for each caption scored, 1.13 for each questions answered, and 5.2 for each full caption written. In practice, we assume no cost when the VQA module answers a question. A human teacher would charge the agent for answers but would also give better answers. In the experiments to follow, we use Human Supervision as a metric for cost incurred by querying a human.
4.3 Learning by Asking Questions
In Table 1 we evaluate our lifelong learner, aka “inquisitive student” (IS), against training only on GT data on the test split. All results are reported using greedy decoding. Our model was trained with a 10% warmup chunk, 3 unlabelled chunks and varying collect percentage. For each setting we report the best model out of three with different random seeds. We report two GT baselines: Equal GT – the same number of GT captions as our model but fewer total captions, and All GT – the same number of captions as our model but only GT captions.
In order to evaluate the benefits of asking questions, we also introduce a life long learner, “mute student” (MS), that has the ability to ask for caption-scores but not questions-answers. MS is trained in exactly the same lifelong setting as ours but samples from its own word distribution to explore new captions rather than ask questions. Specifically, MS makes several educated guesses for the caption by sampling from the captioning module. All models have the same hyperparameters and captioning architecture and are trained on all images to ensure fairness. GT and Supervision % are reported relative to All GT.
Compared to Equal GT, our lifelong model achieves 5 mixed and 6.5 CIDEr higher which shows that for an agent with a fixed budget of GT captions, additionally learning from collected captions can significantly improve performance. Compared to All GT, our model achieves 2.2 mixed or 1.6 CIDEr higher score while using only 45.2% of GT captions and 73.5% of human supervision. This means that training on teacher-improved captions not only achieves greater efficiency but also leads to higher performance than training on GT captions. We find this to be a particularly strong and interesting result.
IS also beats MS, which demonstrates that question-asking is beneficial. This is investigated further in Fig. 3. We vary the amount of GT captions by adjusting the percentage of collected captions. We call an agent that trusts its teacher-improved captions often (and rarely gives up) a “confident” learner. Confident learners use less human supervision. An agent that begins lifelong learning earlier with only a small warmup set is an “eager” learner.
IS outperforms MS in almost all settings but the difference is greater if the agents are eager. Fig. 3 shows that at 10% warmup the gap is 1.4 CIDEr (97 vs 95.6) but as we reduce to 1% warmup, the gap becomes 12.7 CIDEr (77 vs 64.3). This supports the intuition that asking questions benefits learners with less experience. In addition, a more eager learner ultimately reaches lower performance for the same amount of supervision. For about 30% supervision IS achieves 93.9 CIDEr in the 10% warmup setting and 83.5 CIDEr in the 1% warmup setting. We hypothesize this is because the quality of sentence continuations, or rollouts after receiving the teacher’s answer, worsens if the agent pretrains on less data. Furthermore, a very eager learner may make too many mistakes to fix by asking only one question.
Selected examples are shown in Fig 4. The first four examples are positive and show asking questions helps fix incorrect words and retrieve novel concepts. In the fifth example, the reward is lower for the new caption even though it is good according to human judgment. Auto-eval metrics do not reward the agent for relevant, novel captions that don’t match words in the reference captions. A human teacher with more flexible scoring could encourage the agent to learn more diverse captions and a larger vocabulary.
4.4 Learning New Concepts
One way to measure the usefulness of teacher answers is to compute how often teacher is repeating a concept the captioner already knows. Table 4 shows how frequently the answer from the teacher appears in the top-k words predicted by the captioner at the time step where the question is asked (ATopk). Note that this is approximate because the captioner may predict the answer at a different step. In the first round of lifelong training, 26.3% of teacher answers appeared in the top-5 words predicted by the captioner. Hence, 73.7% of the time, the agent is learning unfamiliar or novel concepts. Over the lifetime, ATopk increases as the student’s knowledge catches up to that of the teacher.
Fig. 7 shows the number of unique words used by a captioner evaluated on the val split at the end of lifelong learning. We found a dependency between training epochs and vocabulary size and therefore took all models at the same epoch. We baseline against mute student. IS has a more diverse vocabulary than MS at all % GT as it uses more unique noun, verb and total words than MS.
In Table 4 we compare the vocabulary of lifelong learners to All GT. All GT has a larger vocabulary than lifelong learners. This is intuitive because All GT has more GT captions and therefore sees more varied data. IS only receives a single word answer given an image, whereas All GT receives a complete caption label containing on average 10.5 words. For the same reason, in Fig. 7 the agents’ vocabulary decreases as % GT decreases.
4.5 Analyzing the Modules
We conducted a human study (Fig. 11) using Amazon Mechanical Turk (AMT) to evaluate the quality of generated questions. Annotators rated 500 images-question pairs by answering questions if they were good or flagging questions as “not understandable” or “irrelevant to the image”. The questions were randomly selected questions that the question generator asked while trying to caption. The images were not seen by the question generator during its training. 82.4% of questions were rated “good” and answered. This is a promising result and suggests that learning by asking can be adapted to use human teachers instead of a QA bot.
Fig. 8 shows generated questions at different time steps in a caption. In general, generated questions tend to be diverse, and generic. It’s important for questions to be generic so that the teacher can answer with a wide range of possible concepts and possibly new concepts. We also rarely observe the generated questions to be the same as the GT questions. More examples in Appendix.
To test the decision maker, we look directly at the scores of the refined captions it produces, rather than those of the final captions after retraining the captioner. This lets us to precisely observe the ablated performance of the DM. Table 11 evaluates different decision maker strategies. We first train captioning and question generation modules. The baseline is the performance of the captioner without asking questions. The other settings use various decision maker models to ask a question to improve captions. Learned models are trained using RL on a single chunk of unlabelled data. Scores are shown for the val split.
The full model gives 6.5 CIDEr improvement over no question asking. Picking the time step with maximum entropy is not a very good strategy. It is only 0.3 CIDEr better than picking a random step. This is because the model can predict synonyms which increase the entropy but do not indicate the model is uncertain. Adding closeness metrics yields 1.0 CIDEr improvement over maximum entropy, showing that taking into account the closeness of words in embedding space gives a better measure of uncertainty. In all cases, learning improves performance, with the best learned model achieving 3.1 CIDEr higher than the best non-learned model. We use the full model as our decision maker for all experiments.
4.6 Understanding the Model
Number of chunks.
Fig. 7 shows that as the number of chunks increases, performance increases (for similar human supervision). This is intuitive because more chunks means the agent sees fewer images before adapting the captioner. The number of chunks cannot be too large because we retrain the captioner from scratch after every chunk.
Catching up to the teacher.
As suggested in Sec. 4.4 we find that asking questions becomes less useful as the agent consumes more chunks. Fig. 11 shows the percent of collected captions that were improved by asking questions (left axis) and average reward of collected captions (right axis) versus num. consumed chunks. Over time, the agent is able to improve fewer and fewer captions by querying the teacher. Furthermore, the largest increase in collected reward occurs in the first round. Together these observations suggest that the teacher’s knowledge is exhausted over time. This is a limitation of using a static, synthetic, and noisy QA-bot (which only achieves 64% accuracy). Learning may benefit from human teachers over more rounds, because they are more accurate and have a much wider pool of knowledge.
Types of answers.
In Fig. 7 we see the distribution of answer types from the teacher. Over time, the student asks for more nouns, and less verbs and adjectives. We hypothesize this is because the agent is learning verbs and adjectives early on before moving onto nouns.
In this paper, we addressed the problem of active learning for the task of image captioning. In particular, we allow the agent to ask for a particular concept related to the image that it is uncertain about, and not require the full caption from the teacher. Our model is composed of three modules, i.e. captioning, decision making and question posing, which interact with each other in a lifelong learning setting. Done this way, the learning and teaching efficiency is shown to be improved on the challenging MS-COCO dataset.
Our work is the first step towards a more natural learning setting in which data arrives continuously, and robots learn from humans through natural language questions and feedback. There are many challenges ahead in making the lifelong model learning more efficient, and incorporating real humans in the loop.
Supported by the DARPA Explainable AI (XAI) program. We thank NVIDIA for their donation of GPUs. We thank Relu Patrascu for infrastructure support, David Acuna, Seung Wook Kim, Makarand Tapaswi, Yuan-Hong Liao, for fruitful discussion and Atef Chaudhury, Harris Chan, Silviu Pitis for their helpful feedback in editing the paper.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and
Vqa: Visual question answering.
Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
-  S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
-  B. Dai, S. Fidler, R. Urtasun, and D. Lin. Towards diverse and natural image descriptions via a conditional gan. arXiv preprint arXiv:1703.06029, 2017.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
Imagenet: A large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  A. K. Gupta. Survey of visual question answering: Datasets and techniques. arXiv preprint arXiv:1705.03865, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
-  S. W. Kim, M. Tapaswi, and S. Fidler. Progressive reasoning by module composition. arXiv preprint arXiv:1806.02453, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Learning through dialogue interactions by asking questions. arXiv:1612.04936, 2016.
-  Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6116–6124, 2018.
-  C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  H. Ling and S. Fidler. Teaching machines to describe images via natural language feedback. arXiv preprint arXiv:1706.00130, 2017.
C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky.
The stanford corenlp natural language processing toolkit.In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60, 2014.
-  M. J. Matarić. Socially assistive robotics: Human augmentation versus automation. Science Robotics, 2(4):eaam5410, 2017.
-  I. Misra, R. Girshick, R. Fergus, M. Hebert, A. Gupta, and L. van der Maaten. Learning by asking questions. arXiv preprint arXiv:1712.01238, 2017.
-  N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.
-  B. Settles. Active learning. , 6(1):1–114, 2012.
-  E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in fashion: Modeling the perception of fashionability. In CVPR, volume 2, page 6, 2015.
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour.
Policy gradient methods for reinforcement learning with function approximation.In Advances in neural information processing systems, pages 1057–1063, 2000.
-  D. Teney, P. Anderson, X. He, and A. v. d. Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
-  K. Uehara, A. Tejero-De-Pablos, Y. Ushiku, and T. Harada. Visual question generation for class acquisition of unknown objects. arXiv preprint arXiv:1808.01821, 2018.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
-  O. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
-  Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40, 2017.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015.
-  J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Visual curiosity: Learning to ask questions to learn visual recognition. arXiv preprint arXiv:1810.00912, 2018.
6 Supplementary Material
This supplementary contains details of the modules in our model, the training procedure, as well as additional qualitative examples. In Section 6.1 we discuss implementation details of the captioner, question generator, decision maker and VQA teacher. Furthermore, we describe the inquisitive student and mute student in the lifelong setting. In Section 6.2 we discuss the challenges with asking questions. In Section 6.3 we provide more detail on how human supervision is calculated for our experiments. In Section 6.4 we show an ablation study on the question generator. Our study highlights the importance of each feature in the context used by the question generator. In Section 6.5 we show more qualitative examples and describe the failure modes of our model.
6.1 Implementation Details
6.1.1 Lifelong Learning
In lifelong learning, data arrives in chunks. In the collection phase, the agent attempts to improve generated captions by querying the teacher. In the update phase, the captioning module is updated on collected captions.
In our experiments, we vary the collection percentage and the size of the warmup chunk. Note: the size of the warmup chunk is reported relative to the entire training split whereas % GT (reported in tables and figures) is relative to the total number of captions the baseline All GT is trained on. For example 10% warmup refers to a dataset with 11.3K/113K images and 57K/567K captions. We explored the following settings.
: 60, 70, 80, 90, 100%
warmup: 1, 3, 5, 10%
In the update phase, we train the captioner with ADAM , lr=, batchsize=, scheduled sampling, and learning rate (lr) decay. Scheduled sampling and learning rate decay are described in 6.1.2. We now outline details of inquisitive student (IS) and mute student (MS) in the collection phase.
The question student samples captions and questions greedily. We found it helpful to put the captioner and question generator into eval mode so that dropout is still applied. This introduces a small amount of stochasticity and the agent generates more varied captions. QE makes 8 passes over each image in a chunk. However, because the captioner and question generator are sampled greedily, later rounds only produce a few novel captions. We found that 4 passes worked almost as well; more passes produces diminishing returns. We train the decision maker online using policy gradient. We use ADAM, lr= and batchsize=
Mute student has the same captioning architecture and hyperparameters as QE. There are some differences in the collection phase. Instead of asking questions, ME samples from the captioning module to explore new captions. Specifically, ME samples captions with temperature 1.0. ME makes 4 passes over each image in a chunk. This is to ensure that QE and ME use similar amount of human supervision.
6.1.2 Captioning Module
The captioning module is implemented as an attention encoder-decoder. It predicts both the next word and the next POS given the previous word. In our implementation, the CNN encoder is fixed. However, we project image features using a fully connected (FC) layer before passing it to the decoder. The decoder is implemented as a single layer GRU with 512 units. We use dropout=0.5 in all layers.
The hidden state of the GRU is used to compute the next word and POS. More specifically, we first predict the POS distribution then condition the next word on either the predicted POS or ground truth POS. Scheduled sampling is used to control how often the predicted or GT POS is used. Words are embedded into 512 dimensional latent space and POS are embedded into 50 dimensional latent space. If the predicted POS is used to predict the next word, we embed the entire POS distribution and concatenate this embedding with the decoder hidden state. The resulting vector is passed into a FC layer to predict the next word. If the GT POS is used, we embed the one-hot vector and similarly predict the next word. The captioner is trained using a joint loss over next word and POS.
We tune and find to work the best. We limit the length of captions to 16 plus the end-of-sentence symbol.
Getting ground truth POS
We use the Stanford NLP parser to get ground truth POS for both GT and collected captions. On rare occasions, the Stanford NLP parser returns errors when parsing generated captions. In these cases, the agent collects the GT caption for the image rather than the generated one.
Training is the same for both the warmup chunk and update phase in lifelong learning. Specifically, we train the captioner using MLE with teacher forcing, scheduled sampling (on both the words and POS) and learning rate decay. We start the learning rate at and decays it by 0.8 every 3 epochs. The probability of predicting the next word using the previous predicted word (rather than the GT word) starts at 0 and increases by 0.05 every 5 epochs. The probability of predicting the next word using the previous predicted POS (rather than the GT POS) starts at 0.2 and increases by 0.05 every 3 epochs.
6.1.3 Question Generating Module
The question generator generates a question given a context vector computed by the captioner. More specifically, the context consists of the full caption and the POS and attention weights (of the captioner) at a particular time step. The time step is determined by the decision maker.
Encoding the caption
The caption is used by the question generator in two ways. First, a pretrained (and fixed) captioning module is used to encode the caption. The hidden state of the captioner GRU (aka cap-hid) is used as a feature. Second, the question generator encodes the caption with its own single layer Bidirectional-GRU (BiGRU). The BiGRU has 256 units. Caption words are represented as 256 dimensional latent vectors. Finally, the time step computed by the decision maker is encoded as a 256 dimensional vector of 1’s. It is fed alongside the caption word embeddings into the BiGRU.
The question generator’s decoder is also a single layer GRU with 512 units. We embed the POS and use it along with cap-hid to initialize the decoder hidden state. We use the entire POS distribution rather than the max POS. The BiGRU encoded caption is passed into the decoder along with image features at every time step. We limit the length of questions to 14. Question words are embedded into 512 dimensional latent space.
We allow the question generator to re-attend to the image. Specifically, the question generator first computes its own attention weights independent of the captioner’s weights. The attended image features are then concatenated with features computed from the captioner’s attention. Finally a FC layer is used to compute the final image features.
We train the question generator with MLE, teacher forcing, learning rate decay and scheduled sampling. We use a batch size of 20 and the same schedules as for training the captioner: lr decay 0.8 every 3 epochs, scheduled sampling increase 0.05 every 5 epochs.
6.1.4 Dataset for Training Question Generator
We combine the MSCOCO and VQA2.0 datasets to train the question generator. The two datasets share images. Therefore, we can form training samples by matching answers from QA pairs of VQA2.0 to words in the MSCOCO captions. A training sample is a (caption, answer, question) tuple. We pass the caption through a pretrained captioner to compute the context vector. The “time step” is chosen to be the index of the word in the caption that matches the QA answer. The question generator is trained to predict the GT question given the context. Doing this gives us 135K samples for training and 6K for validation. We call this the “answer-matched” set. We make a second “pos-matched” dataset to increase the diversity of questions by taking the answer from QA and instead matching its POS to a random word in MSCOCO captions with the same POS. The pos-matched dataset contains 108K samples. When we train the question generator, we sample from both the answer-matched and pos-matched datasets (equally) in every minibatch.
To make the VQA vocabulary match the captioning vocabulary better, we convert numbers from digits to words (e.g. 7 “seven”). For every image in the answer-matched dataset, we allowed at most 2 questions with the same answer. This is to prevent the model from overfitting and asking questions about only a single concept in an image.
6.1.5 Decision Making Module
The decision maker predicts which word in the generated caption the agent should ask about. It does so given a context vector from the captioner. The context vector consists of: the full caption, POS, attended image features, and the top-k words. The attended image features are computed by weighting the output of the CNN encoder by the captioner’s attention weights. The top-k words are the top-k words predicted by the captioner at every time step. They are used to compute closeness metrics which capture the captioner’s uncertainty. We use .
Encoding the caption
The decision maker encodes the full caption the same way as the question generator. First we pass the caption through a pretrained captioner. The hidden state of the GRU is used as a feature. Second, we encode the caption with a BiGRU with 256 units.
Masking out invalid POS
We mask out invalid POS (corresponding to words the agent can never ask questions about). We do this by computing the maximum of the POS distribution at each time step and comparing it to a predefined list of valid POS. See table 5 for the full list.
The top-k words predicted by the captioner are used to compute word closeness metrics. First words are embedded using the embedding layer from the captioner. Then, we compute the following features.
Cosine similarity between the embedding of the top-1 word and all other words.
Cosine similarity between each top-k word and the embedding of the entire sentence (implemented as the sum of word embeddings).
Minimum distance of each top-k word to another word.
The result is a tensor where is the batch size, the number of time steps, the number of channels/features. and . We combine the features along the axis using a 1D CNN. The final feature vector is a tensor.
Computing the probability of asking questions
We embed the POS distribution with an embedding layer. We project the image features with a FC layer. Finally we pass the POS, image, caption and closeness features through a MLP to compute logits. We apply aSoftmax across time to find the probability of asking a question at each time step.
|NNPS||proper noun, plural|
|VBG||verb, gerund/present participle|
|VBD||verb, past tense|
|VBN||verb, past participle|
|VBP||verb, singular present|
|VBZ||verb, 3rd person singular present|
We use a VQA model as a synthetic teacher-answerer. We remove yes/no questions from the VQA dataset as they would not be useful for our captioning regime. This leaves 277K/444K questions in the training set and 133K/214K questions in the validation set.
We use a similar architecture to  but with PReLU activations instead of gated Tanh. We use dropout=0.5 in all layers. Question words are encoded using an embedding layer of 300 dimensions. The word embedding is initialized using glove embeddings 
but we did not find significant difference versus training from scratch. We limit questions to 14 words. The full question is passed through a single layer GRU with 512 units to get a vector representation. We use the final hidden state (step 14) of the GRU (padding short sentences) without sequence trimming or dynamic unrolling. Captions are used as supporting evidence to increase the performance of the VQA model. Captions are also embedded using an embedding layer with 300 units and then encoded using a GRU.
To fuse the question, image, and caption, we do an element-wise product between their vector representations. Specifically, we multiply question and image together, as well as question and caption together. The two feature vectors are concatenated and fed through a MLP to predict the logits.
We use batchsize=, lr=
and ADAM to train the VQA model. Batch normalization is used to normalize image features. We use a multianswer loss. The loss for a single sample is shown below.
Here indexes over the answer vocabulary. is the size of the vocabulary. is the ground truth probability of answer . is the probability predicted by the model. Each question in the VQA2.0 dataset is answered by 10 humans. This loss takes the full empirical human answer distribution as the target rather than only the most common answer.
6.2 Asking Multiple Questions
In our reported experiments, the agent asked questions for each generated caption in the collection phase. We experimented with asking questions. However, this is challenging because the teacher’s answer is directly inputted into the captioner to roll out the rest of the new sentence. If the answer is a rare or out of vocabulary word, the final words of the sentence may not follow a sensible trajectory. This problem is worsened when multiple questions are asked. One possible solution is to exploit hypernyms to keep the captioner on a sensible trajectory while inserting novel words. Another solution may be to learn an answer “absorption” module to utilize the teacher’s answer better. We leave these directions to future works.
6.3 Calculating Human Supervision
In our reported experiments, we computed the cost of human supervision by considering the completion time of various tasks. More specifically, every GT caption a model has access to costs 5.2 units of supervision and every caption scored during lifelong learning costs 1 unit of supervision. We make some other assumptions when calculating human supervision.
First, we filtered out repeated captions and questions. Furthermore, we assume no cost when the VQA module answers a question. A human teacher would charge the agent for answers but would also give better answers. Finally, we only charge the agent once for picking the caption with the highest reward from the three alternatives: rollout, replace, and original and then scoring it. This assumption can be relaxed by training a caption discriminator in future works.
6.4 Ablating the Question Generator
In table 6 we show how including various features affects the accuracy of the question generator. We use accuracy as a proxy to question quality. Accuracy is measured by passing a generated question through the VQA module and comparing the teacher’s answer with the ground truth answer. a@n means the GT answer appears in the top-n teacher answers. The baseline is a model trained only with POS and an attention maps as context. Results are reported on the validation split of the dataset used to train the question generator. Both position and caption encoding give a boost to the accuracy. Using both achieves 14.2% accuracy over baseline. We use the full model as our question generator in experiments.
6.5 Qualitative Examples
More qualitative examples are shown in the following pages. Fig. 12 shows the agent interacting with the teacher in the collection phase of lifelong learning. Figs. 13 and 14 show generated questions.
6.5.1 Failure Modes
Fig. 15 shows failure modes of our model. In the first image, the decision maker chooses a bad time to ask a question, and the agent gains no new information. In the second image, the question generator produces a bad question. The VQA teacher gives a nonsensical answer and the final caption is bad. The auto-eval metrics give the new caption a higher score than the original even though both captions are bad and it’s unclear which one is better. In the third image, the captioning module is not able to utilize the answer from the expert. It ignores the answer in rolling out the rest of the sentence. In the last image, the agent is rewarded for identifying the orange juice in the image. However, the final sentence doesn’t make grammatical sense. This is a limitation of using auto-eval metrics as the reward.