Lifelong Learning for Image Captioning by Asking Natural Language Questions

12/01/2018 ∙ by Kevin Shen, et al. ∙ UNIVERSITY OF TORONTO 0

In order to bring artificial agents into our lives, we will need to go beyond supervised learning on closed datasets to having the ability to continuously expand knowledge. Inspired by a student learning in a classroom, we present an agent that can continuously learn by posing natural language questions to humans. Our agent is composed of three interacting modules, one that performs captioning, another that generates questions and a decision maker that learns when to ask questions by implicitly reasoning about the uncertainty of the agent and expertise of the teacher. As compared to current active learning methods which query images for full captions, our agent is able to ask pointed questions to improve the generated captions. The agent trains on the improved captions, expanding its knowledge. We show that our approach achieves better performance using less human supervision than the baselines on the challenging MSCOCO dataset.



There are no comments yet.


page 3

page 7

page 8

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine a child that sees a crocodile for the first time. She may likely ask what the animal is called, or where it can be encountered outside the zoo, but probably does not need to be told that it is green or has four legs, and that its sharp teeth can pose danger. Children (and even adults) learn from teachers in an active way: asking questions about concepts that they are unfamiliar or uncertain about. In doing so, they make learning more efficient – the child who acquires exactly the information they are missing – and the teacher who answers the question instead of needing to explain many aspects of a concept in full detail. As A.I. becomes more and more integrated in our everyday lives, be it in the form of personal assistants or household robots

[30, 18, 25], they too should actively seek out missing information from humans – by asking questions in the form of natural language which non-experts can understand and answer.

Most existing work on scene understanding tasks such as VQA 

[5, 27, 31, 6] and captioning [15, 23, 3] have focused on a closed world setting, i.e. consuming the knowledge provided by a labeled dataset. On the other hand, the goal of active learning is to be able to continuously update the model by seeking for the relevant data to be additionally labeled by a human [24]. Most active learning approaches, however, ask the human to provide a full labeling of an example, and the main challenge is in identifying the examples to be labeled, to ensure annotation efficiency. In our work, we go beyond this, by endowing the model with the ability to ask for a particular aspect of a label, and do so in natural language in order to unambiguously identify the missing information.

Figure 1: Learning to describe images by asking questions. Our model learns in a lifelong learning setting, by actively seeking for missing information. We jointly learn when and what to ask, and learn from the teacher’s answers. Our model poses questions in natural language.

We focus on the task of image captioning as a proxy task for scene understanding. In order to describe an image, a model needs to generate words describing the objects, its attributes, actions, and possibly relationships and interactions between objects. This is inherently a multi-task problem. In this paper, our goal is to allow a captioning agent to actively ask questions about the aspects of the image it is uncertain about, in a lifelong learning setting in which examples arrive sequentially and continually. Thus, instead of having humans provide captions for each new image, our agent aims to ask a minimal set of questions for the human to answer, and learn to caption from these answers.

Our model consists of three modules: captioning module, a decision making module that learns whether to ask and what to ask about, and a question generation module. At training time when the captioner produces each word, the decision module decides for which concept, if any, to ask about. If the agent decides to ask, the question generation module produces a question, and the teacher answers the question. All three modules are implemented as neural networks. They are updated continuously with the data arriving in batches: the captioning module is updated using the captions improved by the answers from the teacher, while the decision module is updated based on the current uncertainty of the captioning module. For efficiency reasons, our teacher to answer questions is a QA bot. At test time the captioning model describes new images without asking questions. We showcase our method on the challenging MSCOCO dataset 

[15]. We provide insights into the behavior of our approach, and discuss open challenges ahead. To the best of our knowledge, this is the first time that natural language question asking has been explored in a lifelong learning setting with real-world images. All our code will be released.

2 Related Work

We provide a short overview of active and interactive learning approaches, and outline our main contributions with respect to existing work.

Active learning.

The goal of active learning is to intelligently seek labels for unlabelled data from an oracle in order to maximize learning while reducing the annotation cost. An agent predicts which sample, if labelled, will give the most useful learning signal as measured by performance on the test set. Strategies for active learning include uncertainty sampling, query by committee and expected model change [24]. Unlike the typical active learning setting where an agent asks the oracle for a full data label (which would be a full caption in our scenario), our method learns to ask pointed questions to retrieve partial labels, i.e. missing key words that compose a caption. Our model thus needs to not only learn when to ask, but also what to ask, and how to distill the received answer into a complex multi-task module (captioner).

Learning by Asking Questions

is an exciting direction with notable contemporary work. Prior approaches typically differ in task, methodology (are questions natural or templated? how does the agent utilize the feedback?) and environment (synthetic vs real). [19] learns to answer questions by asking questions. Image and the generated question are treated as an unlabelled sample and an oracle provides an answer to form a novel training pair. This simplifies the learning by asking framework by bypassing the challenges of free-form conversation and interpreting the teacher’s answer, because QA can be directly used as training data. Our work generalizes over this framework by using question-asking as a support task to the main task, in our case image captioning, which leads to a more general, and significantly more challenging scenario. Furthermore, [19] operates in CLEVR [8], a synthetic environment and questions are limited to programs rather than natural language.

[33] explores question asking for visual recognition. Given an image, a graph of objects, attributes and relationships is continually updated as the agent asks questions. However, questions are limited to templates, and training is done in synthetic environments with a limited set of objects and relationships. [28] uses questions to explore new object classes for image classification. However, [28]

does not retrain their classifier. Our work differs from 

[33, 28] by proposing a way for the agent to learn in a lifelong setting.

In [12], the agent learns whether to ask questions to the teacher to efficiently solve dialogue tasks. The student’s goal is to maximize the accuracy of answering the teacher’s questions while reducing the cost (to the teacher) of asking for hints. We extend this line of thinking by letting the agent learn what to ask about in addition to whether to ask.

Vision and Language.

Our work tackles captioning [32, 23, 3], visual question answering (VQA) [27, 6, 10], and visual question generation (VQG) [13, 20]. However, most of these works have focused on a closed dataset setting. Our main goal here is not in designing a novel architecture for each module (captioning, VQG, VQA), but rather focus on the interaction of the modules and the teacher in order to learn in a continual, active setting. Related to us is [16], where a teacher observes the captioning agent in a continual setting, and gives natural language feedback when errors occur. The agent then learns to improve based on this signal. In our work, the agent is the one seeking advice, thus making the teaching process more efficient.

3 Our Approach

Figure 2: Modules being updated (green), modules being held fixed (grey), teacher (yellow). Writer is a teacher that produces full GT captions. The captioner begins by warming up on the first chunk containing all GT captions (left panel). Lifelong learning (right panel) occurs in two phases: collection and update. In the collection phase, the captioner generates a caption, the decision maker choose when to ask a question, the question generator generates a question and the teacher provides an answer. The answer is used to create two new captions. Captions are collected and used to train the captioner in the update phase.

Our goal is to train an image captioning model in the active learning setting with minimal human supervision. We approach the problem by endowing the agent with the ability to ask questions, and learn from the teacher’s answers. However, question asking is only a tool for retrieving information during training; at test time, the captioner operates without needing to ask questions. Our model consists of three interacting modules, the captioner, question generator, and a decision maker. The question generator produces natural language questions given an image and information provided by the captioner. The decision maker chooses which words the captioner is uncertain about (if any), and should be queried from the teacher. We assume the teacher, not necessarily an expert, is able to help the agent learn by answering questions and scoring captions. Our agent learns continually, by receiving new images in sequential chunks. For each chunk, the agent iterates between interacting with the teacher, and learning from the collected information.

In the following sections, we describe how the captioner learns over its life time by interacting with the teacher. First we describe the lifelong learning setting, namely how the agent learns from data arriving in a sequence of batches. Next, we provide details of how the agent queries for answers and feedback from the teacher. Finally, we describe the implementation of our agent’s modules.

3.1 Lifelong Learning

We imagine a lifelong learning setting where data arrives in chunks. This is analogous to a student who learns over multiple classes in a semester. The first chunk has complete ground truth (GT), i.e. human written captions. We refer to it as the warmup chunk. The agent learns from the remaining unlabelled chunks with partial supervision from the teacher. We first train the question generator and pretrain the captioner on the warmup chunk. After pretraining, the agent learns in two phases.

In the collection phase, the agent looks at each image in an unlabelled chunk, attempts to caption, and decides whether to replace words with answers obtained by asking questions. The agent collects the improved captions and uses them to train the captioner in the update phase. In the collection phase, the feedback from the teacher is also used to train the decision maker to make better decisions about whether and what to ask. The two phases are repeated times, once for each chunk, until all available image data has been exhausted. This process is illustrated in Figure 2, and summarized in Algorithm 1.

Collection phase

In the collection phase, the agent attempts to improve captions generated from its own policy by querying the teacher. For each round, the agent makes multiple passes over a chunk. Given an image, the agent generates a caption, and the decision maker decides whether and when (at which word) to ask a question to the teacher. The teacher answers the question, which the agent uses to create a new caption (details in Section 3.3). The teacher scores both new and old captions and the agent stores the captions in a buffer . At the same time, the agent uses the scores from the teacher to make online updates to the decision maker to pick better time steps (words) for asking questions (Section 3.4).

The collected captions will be used in the update phase by the agent to distill the teacher’s knowledge back into the captioner. However, the agent could encounter difficult images that cannot be improved by asking questions. Empirically we find the agent cannot improve on images containing objects in unusual settings, or if the caption generated from the captioner’s policy is missing multiple key concepts. Therefore, we allow the agent to “give up” if the improved caption is bad, and the teacher writes a new caption. The agent first selects from the buffer the top captions for each image. Then it keeps the top of images based on the average reward of captions for that image. For the other 100- images, the agent is given GT captions. In practice, we choose out of the 5 MSCOCO captions. We define the KeepBestAndGiveUp subroutine in Algorithm 1 as this two step process.

Update phase

After the collection phase, the agent trains the captioning module on the collected captions (details in Section 3.5). We assume the agent has full access to past data and is retrained from scratch. Future works can look at applying continual learning to more efficiently learn on new data. contains warmup GT captions, collected captions, and GT captions from “giving up”.

3.2 Modules

Let denote a caption of length , and an image. The captioning module

computes a probability distribution over the words in a sentence,

i.e. . We further compute , denoting an array of contexts computed by the captioner (details in Sec 3.6). The context helps the decision maker decide what concepts to ask about, and the question generator to ask relevant questions. Let the context used by the decision maker and question generator be called and , respectively. The decision module computes a multinomial distribution indicating the probability of a word position in the caption at which the question should be asked. We allow to index a special <eos> position representing the case where no question should be asked. The question generation module computes the probability distribution over a question . The details about the modules are presented in Sec 3.6.

3.3 Interacting with the Teacher

We now provide details of how the agent interacts with the teacher in the collection phase. Given an image, the captioner produces the complete initial caption and context by a greedy rollout from . The decision module then makes a decision by sampling from . Words other than nouns, verbs, and adjectives are masked out. Let be the word for which the decision module decides to ask a question. The question generator produces a question and the agent receives an answer . The agent then replaces word in with and predicts a new caption , by rolling out the rest of the caption from position using the previous hidden state of the captioner and . If the teacher’s answer is a rare word for the agent, the agent may diverge from any sensible trajectory. For this reason, we give the agent the option of doing a one-word-replace of the expert’s answer, i.e. .

Finally the teacher scores both the original and the two improved captions, by giving each a numeric reward . The process can be repeated by asking a second question and replacing another word at step . In general, the agent can ask up to questions for a single caption. In practice, we observe to work best in our experiments. We keep in the following for the generality of exposition. The interaction process is summarized in Algorithm 2.

3.4 Learning When to Ask Questions

As the agent queries the teacher in the collection phase, it trains the decision maker online to ask better questions. The teacher provides a scalar, non-differentiable reward. Hence the decision maker is updated using REINFORCE [26]. We baseline the reward with the greedy decision reward (that is, what the improved-caption would have been had sampled greedily), following the self-critical policy gradient [23]. See line 11 in Algorithm 1. In the general case where questions are asked, the gradient for the parameters of the decision maker is:


In this work we did not update the question generator in lifelong learning because jointly training the decision maker and question generator is a hierarchical RL problem. Reward accreditation is challenging because the agent needs to learn to differentiate DM choosing a bad time step from DM choosing a good time step but question generator generating a bad question.

1:procedure LIFELONG(, )
2:      train: , , train captioner, question generator, QA-bot
3:      initialize: DM initialize decision maker
6:      for  in  do begin lifelong learning
7:             collection phase
8:            for  to Number of Passes over Chunk do
9:                 for  in  do
12:                        collect caps. and rewards
13:                        update decision maker                              
15:            train: on using update phase       
Algorithm 1 Lifelong learning
1:procedure SeekTeacher(I, greedy=False)
2:      , compute caption and context
4:      for  to  do
5:             DM samples step
6:             generate question
7:             teacher provides answer
8:             roll new cap.
10:             teacher scores caption
13:      return
Algorithm 2 Interacting with the teacher

3.5 Learning to Caption from Teacher’s Feedback

The captioning module is re-trained in the update phase using a joint loss over the collected and GT captions in the stored buffer ,


where are collected captions, are GT captions, is the score given by the teacher for , and

is a tuned hyperparameter. In practice, we set

to the percentile reward of the collected captions, assuming that ground truth captions are generally better than collected captions.

3.6 Implementation Details

Captioning module.

is implemented as an attention CNN-RNN model [32]. We additionally predict a part-of-speech (POS) tag at each time step to inform the question generator what type of question should be asked and the decision maker whether to ask. Captioner is trained using MLE with teacher forcing and scheduled sampling.

Question generation module.

is also implemented as a CNN-RNN and conditions on the context at time . Specifically, consists of: POS distribution which determines the “question type”, the attention weights predicted by the captioner which guide the question generator to look, an encoding of the caption which provides global context and prevents asking for redundant concepts, and the position encoding for . We found it helpful to allow the question generator to re-attend rather than fully rely on the captioner’s attention. We train the question generator on a novel dataset, using MLE with teacher forcing and scheduled sampling similar to the captioner (details in Appendix).

Decision module.

The decision maker

is implemented as a multilayer perceptron (MLP) with Softmax output. Context

consists of the POS distribution, an encoding of the caption, and uncertainty metrics computed from top-k words predicted by the captioner:

  • [leftmargin=*]

  • Cosine similarity between the embedding of the top-1 word and all other words.

  • Cosine similarity between each top-k word and the embedding of the entire sentence (implemented as the sum of word embeddings).

  • Minimum distance of each top-k word to another word.

Entropy is a natural way to measure the uncertainty of the captioner. However, the model can predict synonyms which increase the entropy but do not suggest that the model is uncertain. Therefore, for each time step we take the word embeddings of the top-k words and compute their relative distances as a secondary measure of uncertainty. We use . In ablation studies, we show that these statistics alone can capture the uncertainty of the captioner. Training a neural network on these statistics further improves performance.

Teacher module.

We imagine our agent in a human-in-the-loop setting where a teacher answers natural language questions, chooses the best caption out of a few alternatives, scores it, and writes GT captions if necessary. For efficiency, we use a synthetic teacher. It consists of two parts: a VQA bot implemented following [27] and a caption scorer composed of a linear combination of BLEU [21], ROUGE [14], METEOR [2], and CIDEr [29]. Here, denotes additional supporting evidence; we use captions from MSCOCO. We call the reward from the caption scorer the Mix score, and denote it by . We discuss challenges to using a synthetic teacher in Sections 4.3 and 4.6.

4 Experiments

Method GT % Supervision % Mix CIDEr METEOR ROUGE BLEU4 BLEU2
Equal GT - 45.2 % 45.2 % 98.9 91.5 24.7 52.3 28.0 53.4
All GT - 100 % 100 % 101.7 96.4 25.1 52.9 28.8 54.9
Inquisitive Student 70% 45.2 % 73.5 % 103.9 98.0 25.4 53.8 30.5 57.1
Mute Student 70% 45.2 % 72.6 % 102.2 95.9 25.2 53.4 29.3 55.9
Table 1: Evaluation on test. Our model was trained with 10% warmup and 4 unlabelled chunks. Methods see all images at least once for fairness. Note: (Best of 3 runs) 100% GT corresponds to 46% of the MSCOCO training captions because only 2 (out of 5) captions are used for each image in the lifelong chunks.
Figure 3: Caption quality on test. Both models are decoded greedily. Refer to 4.2 for how human supervision is calculated. For each plot, supervision is varied by changing the percentage of captions collected by the agent. % GT captions is reported relative to All GT.

We evaluate our approach on the challenging MSCOCO dataset [15], and compare it to intelligent baselines. We perform detailed ablation studies that verify our choices and give insight into how our model behaves.

We follow the standard Karpathy split [9] that contains 117,843 training, 5K validation and 5K test images. We randomly split the training set into warmup and lifelong learning chunks. In our experiments, we vary the size of the warmup, and the number of lifelong chunks, to analyze the model behavior under different regimes. There are 5 GT captions for each image in the warmup set. At the end of lifelong learning, there are collected or GT captions for each image in the lifelong set.

Image features are extracted with ResNet-101 trained on ImageNet 

[4] [7]. Vocabulary sizes for the captioner, question generator and VQA are 11253, 9755 and 3003, respectively. We use the Stanford NLP parser to get GT POS labels [17]. The decision maker only considers a subset of tags (listed in Appendix) for asking questions.

4.1 Training Details

The synthetic teacher (VQA bot) was trained on the VQA2.0 dataset [1], following a simplified implementation of [27]

using a multi-answer binary cross entropy loss function. The VQA model achieves 64.2% on the VQA2.0 val split without ensembling. We train the question generator by combining data from MSCOCO and VQA2.0. (Implementation details in App.) A natural concern is that training the question generator on images the captioner sees during lifelong learning will cause the que. gen. to “lookup” GT questions. We find this to not be the case (see Figure 

8). In general, the questions generated for an image are diverse, generic and rarely match GT questions (see Appendix for more examples).

4.2 Cost of Human Supervision

We first perform a human study to understand human cost associated with every interaction type with the agent. We choose to measure “human effort” by the time taken for a task. For our experiment, a human teacher has three possible tasks: produce a full caption, answer a question, and score a caption. Table 4 shows that on average it takes 5.2 and 4.6 times longer to caption than score a caption or answer a question. To compute the cost of human supervision, we normalize the cost of each task to caption scoring. Hence the agent incurs one point of supervision for each caption scored, 1.13 for each questions answered, and 5.2 for each full caption written. In practice, we assume no cost when the VQA module answers a question. A human teacher would charge the agent for answers but would also give better answers. In the experiments to follow, we use Human Supervision as a metric for cost incurred by querying a human.

4.3 Learning by Asking Questions

Figure 4: T5C: top-5 words predicted by captioner at the word when question is asked. Rewards are in square brackets. Colors in OC indicate probability of the decision maker to ask about a word (scale is on right). Left are positive examples, right is failed (pointing to weaknesses of auto-eval metric).

In Table 1 we evaluate our lifelong learner, aka “inquisitive student” (IS), against training only on GT data on the test split. All results are reported using greedy decoding. Our model was trained with a 10% warmup chunk, 3 unlabelled chunks and varying collect percentage. For each setting we report the best model out of three with different random seeds. We report two GT baselines: Equal GT – the same number of GT captions as our model but fewer total captions, and All GT – the same number of captions as our model but only GT captions.

In order to evaluate the benefits of asking questions, we also introduce a life long learner, “mute student” (MS), that has the ability to ask for caption-scores but not questions-answers. MS is trained in exactly the same lifelong setting as ours but samples from its own word distribution to explore new captions rather than ask questions. Specifically, MS makes several educated guesses for the caption by sampling from the captioning module. All models have the same hyperparameters and captioning architecture and are trained on all images to ensure fairness. GT and Supervision % are reported relative to All GT.

Compared to Equal GT, our lifelong model achieves 5 mixed and 6.5 CIDEr higher which shows that for an agent with a fixed budget of GT captions, additionally learning from collected captions can significantly improve performance. Compared to All GT, our model achieves 2.2 mixed or 1.6 CIDEr higher score while using only 45.2% of GT captions and 73.5% of human supervision. This means that training on teacher-improved captions not only achieves greater efficiency but also leads to higher performance than training on GT captions. We find this to be a particularly strong and interesting result.

IS also beats MS, which demonstrates that question-asking is beneficial. This is investigated further in Fig. 3. We vary the amount of GT captions by adjusting the percentage of collected captions. We call an agent that trusts its teacher-improved captions often (and rarely gives up) a “confident” learner. Confident learners use less human supervision. An agent that begins lifelong learning earlier with only a small warmup set is an “eager” learner.

IS outperforms MS in almost all settings but the difference is greater if the agents are eager. Fig. 3 shows that at 10% warmup the gap is 1.4 CIDEr (97 vs 95.6) but as we reduce to 1% warmup, the gap becomes 12.7 CIDEr (77 vs 64.3). This supports the intuition that asking questions benefits learners with less experience. In addition, a more eager learner ultimately reaches lower performance for the same amount of supervision. For about 30% supervision IS achieves 93.9 CIDEr in the 10% warmup setting and 83.5 CIDEr in the 1% warmup setting. We hypothesize this is because the quality of sentence continuations, or rollouts after receiving the teacher’s answer, worsens if the agent pretrains on less data. Furthermore, a very eager learner may make too many mistakes to fix by asking only one question.

Selected examples are shown in Fig 4. The first four examples are positive and show asking questions helps fix incorrect words and retrieve novel concepts. In the fifth example, the reward is lower for the new caption even though it is good according to human judgment. Auto-eval metrics do not reward the agent for relevant, novel captions that don’t match words in the reference captions. A human teacher with more flexible scoring could encourage the agent to learn more diverse captions and a larger vocabulary.

4.4 Learning New Concepts

Figure 5: Num. of unique words used by captioner evaluated on val at the end of lifelong learning. Models trained with 10% warmup and 3 chunks.
Figure 6: Distribution of teacher answer types over rounds. The model was trained using 10% warmup, and 3 chunks.
Figure 7: Performance on val vs the number of total chunks (plus the warmup). Models were trained using 10% warmup and .
Figure 8: Questions generated from different words in the generated caption (colors match words to questions). Highlighted questions retrieve answers that are novel to the caption. Left 2 images are seen by question gen. during training (GTQ are GT questions used for training), right 2 are not. Generated questions tend to be diverse and different from GT ones.
Round ATop3 ATop5 ATop10 1 17.7 26.3 37.4 2 24.1 34.2 46.9 3 27.4 38.3 50.7
Table 2: Frequency (in %) of teacher answers that occur in captioning module’s predictions during lifelong training. Calculated from agent’s collected captions in each round.
Model Nouns Verbs Adj. IS 527 97 53 MS 491 86 48 All GT 680 127 47
Table 3: Number of unique words used by each model on val. Lifelong learners are trained with 10% warmup, , 3 chunks.
Task Avg. time (s) Std. (s) Time ratio Captioning 34.4 21.8 1.0 Scoring 6.6 2.2 5.2 Answering 7.6 3.7 4.6
Table 4: Time taken by humans to perform tasks: captioning, scoring a caption, answering a question. Time ratio is relative to captioning. humans surveyed, captions written, questions answered, captions scored.

One way to measure the usefulness of teacher answers is to compute how often teacher is repeating a concept the captioner already knows. Table 4 shows how frequently the answer from the teacher appears in the top-k words predicted by the captioner at the time step where the question is asked (ATopk). Note that this is approximate because the captioner may predict the answer at a different step. In the first round of lifelong training, 26.3% of teacher answers appeared in the top-5 words predicted by the captioner. Hence, 73.7% of the time, the agent is learning unfamiliar or novel concepts. Over the lifetime, ATopk increases as the student’s knowledge catches up to that of the teacher.

Fig. 7 shows the number of unique words used by a captioner evaluated on the val split at the end of lifelong learning. We found a dependency between training epochs and vocabulary size and therefore took all models at the same epoch. We baseline against mute student. IS has a more diverse vocabulary than MS at all % GT as it uses more unique noun, verb and total words than MS.

In Table 4 we compare the vocabulary of lifelong learners to All GT. All GT has a larger vocabulary than lifelong learners. This is intuitive because All GT has more GT captions and therefore sees more varied data. IS only receives a single word answer given an image, whereas All GT receives a complete caption label containing on average 10.5 words. For the same reason, in Fig. 7 the agents’ vocabulary decreases as % GT decreases.

Method Mix C B4 No questions 86.4 74.1 22.1 Random 88.3 76.2 22.2 Entropy 88.9 76.5 22.4 Unc. metrics 89.6 77.5 22.5 Unc. metrics learned 90.8 79.3 23.2 Full learned 91.9 80.6 23.7
Figure 9: Ablating the decision maker. Entropy is picking the time step with highest top-k word entropy. Unc. metrics includes entropy and words closeness (Sec. 3.6). Unc. metrics learned adds a MLP to predict the best time step for asking. Full learned additionally includes POS and an encoding of the caption as input.
Figure 10: Changes to collected captions over rounds. Model trained with 10% warmup, , 3 chunks.
Figure 11: AMT study to judge the quality of the generated questions. Given an image and a question, annotators were asked to answer the question if it is good, or flag it as “not understandable” or “not relevant”. Generally the questions were good.

4.5 Analyzing the Modules

Question Generator.

We conducted a human study (Fig. 11) using Amazon Mechanical Turk (AMT) to evaluate the quality of generated questions. Annotators rated 500 images-question pairs by answering questions if they were good or flagging questions as “not understandable” or “irrelevant to the image”. The questions were randomly selected questions that the question generator asked while trying to caption. The images were not seen by the question generator during its training. 82.4% of questions were rated “good” and answered. This is a promising result and suggests that learning by asking can be adapted to use human teachers instead of a QA bot.

Fig. 8 shows generated questions at different time steps in a caption. In general, generated questions tend to be diverse, and generic. It’s important for questions to be generic so that the teacher can answer with a wide range of possible concepts and possibly new concepts. We also rarely observe the generated questions to be the same as the GT questions. More examples in Appendix.

Decision Maker.

To test the decision maker, we look directly at the scores of the refined captions it produces, rather than those of the final captions after retraining the captioner. This lets us to precisely observe the ablated performance of the DM. Table  11 evaluates different decision maker strategies. We first train captioning and question generation modules. The baseline is the performance of the captioner without asking questions. The other settings use various decision maker models to ask a question to improve captions. Learned models are trained using RL on a single chunk of unlabelled data. Scores are shown for the val split.

The full model gives 6.5 CIDEr improvement over no question asking. Picking the time step with maximum entropy is not a very good strategy. It is only 0.3 CIDEr better than picking a random step. This is because the model can predict synonyms which increase the entropy but do not indicate the model is uncertain. Adding closeness metrics yields 1.0 CIDEr improvement over maximum entropy, showing that taking into account the closeness of words in embedding space gives a better measure of uncertainty. In all cases, learning improves performance, with the best learned model achieving 3.1 CIDEr higher than the best non-learned model. We use the full model as our decision maker for all experiments.

4.6 Understanding the Model

Number of chunks.

Fig. 7 shows that as the number of chunks increases, performance increases (for similar human supervision). This is intuitive because more chunks means the agent sees fewer images before adapting the captioner. The number of chunks cannot be too large because we retrain the captioner from scratch after every chunk.

Catching up to the teacher.

As suggested in Sec. 4.4 we find that asking questions becomes less useful as the agent consumes more chunks. Fig. 11 shows the percent of collected captions that were improved by asking questions (left axis) and average reward of collected captions (right axis) versus num. consumed chunks. Over time, the agent is able to improve fewer and fewer captions by querying the teacher. Furthermore, the largest increase in collected reward occurs in the first round. Together these observations suggest that the teacher’s knowledge is exhausted over time. This is a limitation of using a static, synthetic, and noisy QA-bot (which only achieves 64% accuracy). Learning may benefit from human teachers over more rounds, because they are more accurate and have a much wider pool of knowledge.

Types of answers.

In Fig. 7 we see the distribution of answer types from the teacher. Over time, the student asks for more nouns, and less verbs and adjectives. We hypothesize this is because the agent is learning verbs and adjectives early on before moving onto nouns.

5 Conclusion

In this paper, we addressed the problem of active learning for the task of image captioning. In particular, we allow the agent to ask for a particular concept related to the image that it is uncertain about, and not require the full caption from the teacher. Our model is composed of three modules, i.e. captioning, decision making and question posing, which interact with each other in a lifelong learning setting. Done this way, the learning and teaching efficiency is shown to be improved on the challenging MS-COCO dataset.

Our work is the first step towards a more natural learning setting in which data arrives continuously, and robots learn from humans through natural language questions and feedback. There are many challenges ahead in making the lifelong model learning more efficient, and incorporating real humans in the loop.


Supported by the DARPA Explainable AI (XAI) program. We thank NVIDIA for their donation of GPUs. We thank Relu Patrascu for infrastructure support, David Acuna, Seung Wook Kim, Makarand Tapaswi, Yuan-Hong Liao, for fruitful discussion and Atef Chaudhury, Harris Chan, Silviu Pitis for their helpful feedback in editing the paper.


6 Supplementary Material

This supplementary contains details of the modules in our model, the training procedure, as well as additional qualitative examples. In Section 6.1 we discuss implementation details of the captioner, question generator, decision maker and VQA teacher. Furthermore, we describe the inquisitive student and mute student in the lifelong setting. In Section 6.2 we discuss the challenges with asking questions. In Section 6.3 we provide more detail on how human supervision is calculated for our experiments. In Section 6.4 we show an ablation study on the question generator. Our study highlights the importance of each feature in the context used by the question generator. In Section 6.5 we show more qualitative examples and describe the failure modes of our model.

6.1 Implementation Details

6.1.1 Lifelong Learning

In lifelong learning, data arrives in chunks. In the collection phase, the agent attempts to improve generated captions by querying the teacher. In the update phase, the captioning module is updated on collected captions.

In our experiments, we vary the collection percentage and the size of the warmup chunk. Note: the size of the warmup chunk is reported relative to the entire training split whereas % GT (reported in tables and figures) is relative to the total number of captions the baseline All GT is trained on. For example 10% warmup refers to a dataset with 11.3K/113K images and 57K/567K captions. We explored the following settings.

  • : 60, 70, 80, 90, 100%

  • warmup: 1, 3, 5, 10%

In the update phase, we train the captioner with ADAM [11], lr=, batchsize=, scheduled sampling, and learning rate (lr) decay. Scheduled sampling and learning rate decay are described in 6.1.2. We now outline details of inquisitive student (IS) and mute student (MS) in the collection phase.

Inquisitive student

The question student samples captions and questions greedily. We found it helpful to put the captioner and question generator into eval mode so that dropout is still applied. This introduces a small amount of stochasticity and the agent generates more varied captions. QE makes 8 passes over each image in a chunk. However, because the captioner and question generator are sampled greedily, later rounds only produce a few novel captions. We found that 4 passes worked almost as well; more passes produces diminishing returns. We train the decision maker online using policy gradient. We use ADAM, lr= and batchsize=

Mute student

Mute student has the same captioning architecture and hyperparameters as QE. There are some differences in the collection phase. Instead of asking questions, ME samples from the captioning module to explore new captions. Specifically, ME samples captions with temperature 1.0. ME makes 4 passes over each image in a chunk. This is to ensure that QE and ME use similar amount of human supervision.

6.1.2 Captioning Module

The captioning module is implemented as an attention encoder-decoder. It predicts both the next word and the next POS given the previous word. In our implementation, the CNN encoder is fixed. However, we project image features using a fully connected (FC) layer before passing it to the decoder. The decoder is implemented as a single layer GRU with 512 units. We use dropout=0.5 in all layers.

POS prediction

The hidden state of the GRU is used to compute the next word and POS. More specifically, we first predict the POS distribution then condition the next word on either the predicted POS or ground truth POS. Scheduled sampling is used to control how often the predicted or GT POS is used. Words are embedded into 512 dimensional latent space and POS are embedded into 50 dimensional latent space. If the predicted POS is used to predict the next word, we embed the entire POS distribution and concatenate this embedding with the decoder hidden state. The resulting vector is passed into a FC layer to predict the next word. If the GT POS is used, we embed the one-hot vector and similarly predict the next word. The captioner is trained using a joint loss over next word and POS.


We tune and find to work the best. We limit the length of captions to 16 plus the end-of-sentence symbol.

Getting ground truth POS

We use the Stanford NLP parser to get ground truth POS for both GT and collected captions. On rare occasions, the Stanford NLP parser returns errors when parsing generated captions. In these cases, the agent collects the GT caption for the image rather than the generated one.


Training is the same for both the warmup chunk and update phase in lifelong learning. Specifically, we train the captioner using MLE with teacher forcing, scheduled sampling (on both the words and POS) and learning rate decay. We start the learning rate at and decays it by 0.8 every 3 epochs. The probability of predicting the next word using the previous predicted word (rather than the GT word) starts at 0 and increases by 0.05 every 5 epochs. The probability of predicting the next word using the previous predicted POS (rather than the GT POS) starts at 0.2 and increases by 0.05 every 3 epochs.

6.1.3 Question Generating Module

The question generator generates a question given a context vector computed by the captioner. More specifically, the context consists of the full caption and the POS and attention weights (of the captioner) at a particular time step. The time step is determined by the decision maker.

Encoding the caption

The caption is used by the question generator in two ways. First, a pretrained (and fixed) captioning module is used to encode the caption. The hidden state of the captioner GRU (aka cap-hid) is used as a feature. Second, the question generator encodes the caption with its own single layer Bidirectional-GRU (BiGRU). The BiGRU has 256 units. Caption words are represented as 256 dimensional latent vectors. Finally, the time step computed by the decision maker is encoded as a 256 dimensional vector of 1’s. It is fed alongside the caption word embeddings into the BiGRU.


The question generator’s decoder is also a single layer GRU with 512 units. We embed the POS and use it along with cap-hid to initialize the decoder hidden state. We use the entire POS distribution rather than the max POS. The BiGRU encoded caption is passed into the decoder along with image features at every time step. We limit the length of questions to 14. Question words are embedded into 512 dimensional latent space.


We allow the question generator to re-attend to the image. Specifically, the question generator first computes its own attention weights independent of the captioner’s weights. The attended image features are then concatenated with features computed from the captioner’s attention. Finally a FC layer is used to compute the final image features.


We train the question generator with MLE, teacher forcing, learning rate decay and scheduled sampling. We use a batch size of 20 and the same schedules as for training the captioner: lr decay 0.8 every 3 epochs, scheduled sampling increase 0.05 every 5 epochs.

6.1.4 Dataset for Training Question Generator

We combine the MSCOCO and VQA2.0 datasets to train the question generator. The two datasets share images. Therefore, we can form training samples by matching answers from QA pairs of VQA2.0 to words in the MSCOCO captions. A training sample is a (caption, answer, question) tuple. We pass the caption through a pretrained captioner to compute the context vector. The “time step” is chosen to be the index of the word in the caption that matches the QA answer. The question generator is trained to predict the GT question given the context. Doing this gives us 135K samples for training and 6K for validation. We call this the “answer-matched” set. We make a second “pos-matched” dataset to increase the diversity of questions by taking the answer from QA and instead matching its POS to a random word in MSCOCO captions with the same POS. The pos-matched dataset contains 108K samples. When we train the question generator, we sample from both the answer-matched and pos-matched datasets (equally) in every minibatch.

To make the VQA vocabulary match the captioning vocabulary better, we convert numbers from digits to words (e.g. 7 “seven”). For every image in the answer-matched dataset, we allowed at most 2 questions with the same answer. This is to prevent the model from overfitting and asking questions about only a single concept in an image.

6.1.5 Decision Making Module

The decision maker predicts which word in the generated caption the agent should ask about. It does so given a context vector from the captioner. The context vector consists of: the full caption, POS, attended image features, and the top-k words. The attended image features are computed by weighting the output of the CNN encoder by the captioner’s attention weights. The top-k words are the top-k words predicted by the captioner at every time step. They are used to compute closeness metrics which capture the captioner’s uncertainty. We use .

Encoding the caption

The decision maker encodes the full caption the same way as the question generator. First we pass the caption through a pretrained captioner. The hidden state of the GRU is used as a feature. Second, we encode the caption with a BiGRU with 256 units.

Masking out invalid POS

We mask out invalid POS (corresponding to words the agent can never ask questions about). We do this by computing the maximum of the POS distribution at each time step and comparing it to a predefined list of valid POS. See table 5 for the full list.

Closeness metrics

The top-k words predicted by the captioner are used to compute word closeness metrics. First words are embedded using the embedding layer from the captioner. Then, we compute the following features.

  • [leftmargin=*]

  • Cosine similarity between the embedding of the top-1 word and all other words.

  • Cosine similarity between each top-k word and the embedding of the entire sentence (implemented as the sum of word embeddings).

  • Minimum distance of each top-k word to another word.

The result is a tensor where is the batch size, the number of time steps, the number of channels/features. and . We combine the features along the axis using a 1D CNN. The final feature vector is a tensor.

Computing the probability of asking questions

We embed the POS distribution with an embedding layer. We project the image features with a FC layer. Finally we pass the POS, image, caption and closeness features through a MLP to compute logits. We apply a

Softmax across time to find the probability of asking a question at each time step.

POS Description
NN noun, singular
NNS noun, plural
NNP proper noun
NNPS proper noun, plural
VB verb
VBG verb, gerund/present participle
VBD verb, past tense
VBN verb, past participle
VBP verb, singular present
VBZ verb, 3rd person singular present
JJ adjective
JJS adjective, superlative
JJR adjective, comparative
RB adverb
RBS adverb, superlative
RBR adverb, comparative
RP particle
CD cardinal number

Table 5: List of valid POS for the decision maker. All other POS are classified as “other”.

6.1.6 Vqa

We use a VQA model as a synthetic teacher-answerer. We remove yes/no questions from the VQA dataset as they would not be useful for our captioning regime. This leaves 277K/444K questions in the training set and 133K/214K questions in the validation set.


We use a similar architecture to [27] but with PReLU activations instead of gated Tanh. We use dropout=0.5 in all layers. Question words are encoded using an embedding layer of 300 dimensions. The word embedding is initialized using glove embeddings [22]

but we did not find significant difference versus training from scratch. We limit questions to 14 words. The full question is passed through a single layer GRU with 512 units to get a vector representation. We use the final hidden state (step 14) of the GRU (padding short sentences) without sequence trimming or dynamic unrolling. Captions are used as supporting evidence to increase the performance of the VQA model. Captions are also embedded using an embedding layer with 300 units and then encoded using a GRU.

To fuse the question, image, and caption, we do an element-wise product between their vector representations. Specifically, we multiply question and image together, as well as question and caption together. The two feature vectors are concatenated and fed through a MLP to predict the logits.


We use batchsize=, lr=

and ADAM to train the VQA model. Batch normalization is used to normalize image features. We use a multianswer loss. The loss for a single sample is shown below.


Here indexes over the answer vocabulary. is the size of the vocabulary. is the ground truth probability of answer . is the probability predicted by the model. Each question in the VQA2.0 dataset is answered by 10 humans. This loss takes the full empirical human answer distribution as the target rather than only the most common answer.

6.2 Asking Multiple Questions

In our reported experiments, the agent asked questions for each generated caption in the collection phase. We experimented with asking questions. However, this is challenging because the teacher’s answer is directly inputted into the captioner to roll out the rest of the new sentence. If the answer is a rare or out of vocabulary word, the final words of the sentence may not follow a sensible trajectory. This problem is worsened when multiple questions are asked. One possible solution is to exploit hypernyms to keep the captioner on a sensible trajectory while inserting novel words. Another solution may be to learn an answer “absorption” module to utilize the teacher’s answer better. We leave these directions to future works.

6.3 Calculating Human Supervision

In our reported experiments, we computed the cost of human supervision by considering the completion time of various tasks. More specifically, every GT caption a model has access to costs 5.2 units of supervision and every caption scored during lifelong learning costs 1 unit of supervision. We make some other assumptions when calculating human supervision.

First, we filtered out repeated captions and questions. Furthermore, we assume no cost when the VQA module answers a question. A human teacher would charge the agent for answers but would also give better answers. Finally, we only charge the agent once for picking the caption with the highest reward from the three alternatives: rollout, replace, and original and then scoring it. This assumption can be relaxed by training a caption discriminator in future works.

6.4 Ablating the Question Generator

Method a@1 a@3 a@5 a@10
Baseline 37.8 50.2 55.3 62.7
+CE 45.9 60.2 65.5 72
+PE 49.2 63.9 69.3 75.4
+PE +CE 52 67.2 73 79.4
Table 6: Comparing question generation models using different context inputs. (+PE) with position encoding, (+CE) with RNN encoding of the caption.

In table  6 we show how including various features affects the accuracy of the question generator. We use accuracy as a proxy to question quality. Accuracy is measured by passing a generated question through the VQA module and comparing the teacher’s answer with the ground truth answer. a@n means the GT answer appears in the top-n teacher answers. The baseline is a model trained only with POS and an attention maps as context. Results are reported on the validation split of the dataset used to train the question generator. Both position and caption encoding give a boost to the accuracy. Using both achieves 14.2% accuracy over baseline. We use the full model as our question generator in experiments.

6.5 Qualitative Examples

More qualitative examples are shown in the following pages. Fig. 12 shows the agent interacting with the teacher in the collection phase of lifelong learning. Figs. 13 and 14 show generated questions.

6.5.1 Failure Modes

Fig. 15 shows failure modes of our model. In the first image, the decision maker chooses a bad time to ask a question, and the agent gains no new information. In the second image, the question generator produces a bad question. The VQA teacher gives a nonsensical answer and the final caption is bad. The auto-eval metrics give the new caption a higher score than the original even though both captions are bad and it’s unclear which one is better. In the third image, the captioning module is not able to utilize the answer from the expert. It ignores the answer in rolling out the rest of the sentence. In the last image, the agent is rewarded for identifying the orange juice in the image. However, the final sentence doesn’t make grammatical sense. This is a limitation of using auto-eval metrics as the reward.

Figure 12: Examples of the agent interacting with the teacher in the collection phase. T5C: top-5 words predicted by captioner at the word when question is asked. Rewards are in square brackets. First two rows show warmup dataset size of 10%. The last row shows warmup dataset size of 1%. For 1% warmup, some generated captions (OC) have too many errors to fix with a single question.
Figure 13: Examples of generated questions at different time steps of a generated caption. The images were used to train the question generator. Not all generated questions are useful. This shows the importance of learning the decision maker to decide when and whether to ask questions. Up to six generated questions are shown. GTQ are ground truth questions used to train que. gen. Generated questions are almost always different from GT ones. Questions are asked for bolded words in the caption. The order of questions corresponds to the order of bolded words. i.e. Q0 corresponds to the first bolded word, Q1 corresponds to the second bolded word, and so on. The caption is generated from a model trained on 10% warmup data.
Figure 14: Random examples of generated questions at different time steps of a generated caption. The images were (unseen) not used to train the question generator. Questions tend to be diverse and generic. Up to six generated questions are shown. Questions are asked for bolded words in the caption. The order of questions corresponds to the order of bolded words. i.e. Q0 corresponds to the first bolded word, Q1 corresponds to the second bolded word, and so on. The caption is generated from a model trained on 10% warmup data.
Figure 15: Failure modes of our model. From left to right these images highlight the failures of: the decision maker, the question generator and VQA teacher, the captioner rolling out the rest of the sentence after receiving the answer, using auto-eval metrics as reward.