Judge a man by his questions rather than by his answers.
Although Visual Question Answering (VQA) [3, 30, 31] has attracted more attention, Visual Question Generation (VQG) is a much more difficult task. Obviously, generating facile, repetitive questions represents no challenge at all, but generating a series of questions that draw out useful information towards an overarching goal, however, demands consideration of the image content, the goal, and the conversation thus far. It could, generally, also be seen as requiring consideration of the abilities and motivation of the other participant in the conversation.
A well-posed question extracts the most informative answer towards achieving a particular goal, and thus reflects the knowledge of the asker, and their estimate of the capabilities of the answerer. Although the information would be beneficial in identifying a particular object in an image, there is little value in an agent asking a human about the exact values of particular pixels, the statistics of their gradients, or the aspect ratio of the corresponding bounding box. The fact that the answerer is incapable of providing the requested information makes such questions pointless. Selecting a question that has a significant probability of generating an answer that helps achieve a particular goal is a complex problem.
Asking questions is an essential part of the way humans communicate and learn. Any intelligent agent that seeks to interact flexibly and effectively with humans thus needs to be able to ask questions. The ability to ask intelligent questions is even more important than receiving intelligent, actionable answers. A robot, for example in Fig. 1, has been given a task and realized that it is missing critical information required to carry it out, needs to ask a question. It will have a limited number of attempts before the human gets frustrated and carries out the task themselves. This scenario applies equally to any intelligent agent that seeks to interact with humans, as we have surprisingly little tolerance for agents that are unable to learn by asking questions, and for those that ask too many.
As a result of the above, Visual Question Generation (VQG) has started to receive research attention, but primarily as a vision-to-language problem [16, 20, 32]. Methods that approach the problem in this manner tend to generate arbitrary sequences of questions that are somewhat related to the image , but which bare no relationship to the goal. This reflects the fact that these methods have no means of measuring whether the answers generated to assist in making progress towards the goal. Instead, in this paper, we ground the VQG problem as a goal-oriented version of the game - GuessWhat?!, introduced in . The method presented in  to play the GuessWhat game is made up of three components: the Questioner asks questions to the Oracle, and the Guesser tries to identify the object that the Oracle is referring to, based on its answers. The quality of the generated questions is thus directly related to the success rate of the final task.
Goal-oriented training that uses a game setting has been used in visual dialog generation previously [7, 8]. However, these work focus on generating more human-like dialogs, not on helping the agent achieve the goal through better question generation. Moreover, previous work  only uses the final goal as the reward to train the dialog generator, which might be suitable for dialog generation but is a rather weak and undirected signal by which to control the quality, effectiveness, and informativeness of the generated question in a goal-oriented task. In other words, in some cases, we want to talk to a robot because we want it to finish a specific task but not to hold the meaningless boring chat. Therefore, in this paper, we use intermediate rewards to encourage the agent to ask short but informative questions to achieve the goal. Moreover, in contrast to previous works that only consider the overall goal as the reward, we assign different intermediate rewards for each posed question to control the quality.
This is achieved through fitting the goal-oriented VQG into a reinforcement learning (RL) paradigm and devising three different intermediate rewards, which are our main contributions in this paper, to explicitly optimize the question generation. The first goal-achieved reward is designed to encourage the agent to achieve the final goal (pick out the object that the Oracle is ‘thinking’) via asking multiple questions. However, different from only considering whether the goal is achieved, additional rewards are awarded if the agent can use fewer questions to achieve it. This is a reasonable setting because you do not need a robot that can finish a task but has to ask you hundreds of questions. The second reward we proposed is the progressive reward, which is established to encourage questions that generated by the agent can progressively increase the probability of the right answer. This is an intermediate reward for the individual question, and the reward is decided by the change of the ground-truth answer probability. A negative reward will be given if the probability decreases. The last reward is the informativeness reward, which is used to restrict the agent not to ask ‘useless’ questions, for example, a question that leads to the identical answer for all the candidate objects (this question cannot eliminate any ambiguous). We show the whole framework in Fig. 2.
We evaluate our model on the GuessWhat?! dataset , with the pre-trained standard Oracle and Guesser, we show that our novel Questioner model outperforms the baseline and state-of-the-art model by a large margin. We also evaluate each reward respectively, to measure the individual contribution. Qualitative results show that we can produce more informative questions.
2 Related Works
Visual Question Generation
Recently, the visual question generation problem has been brought to the computer vision community, aims at generating visual-related questions. Most of the works treat the VQG as a standalone problem and follow an image captioning style framework,i.e., translate an image to a sentence, in this case, a question. For example, in , Mora et al. use a CNN-LSTM model to generate questions and answers directly from the image visual content. Zhang et al.  focus on generating questions of grounded images. They use Densecap  as region captioning generator to guide the question generation. In , Mostafazadeh et al. propose a dataset to generate natural questions about images, which are beyond the literal description of image content. Li et al.  view the VQA and VQG as a dual learning process by jointly training them in an end-to-end framework. Although these works can generate meaningful questions that are related to the image, the motivation of asking these questions are rather weak because they are not related to any goals. Moreover, it is hard to conduct the quality measurement on this type of questions. Instead, in our work, we aim to develop an agent that can learn to ask realistic questions, which can contribute to achieving a specific goal.
Goal-oriented Visual Dialogue generation has attracted many attentions at most recently. In , Das et al. introduce a reinforcement learning mechanism for visual dialogue generation. They establish two RL agents corresponding to question and answer generation respectively, to finally locate an unseen image from a set of images. The question agent predicts the feature representation of the image and the reward function is given by measuring how close the representation is compared to the true feature. However, we focus on encouraging the agent to generate questions that directed towards the final goal, and we adopt different kinds of intermediate rewards to achieve that in the question generation process. Moreover, the question generation agent in their model only asks questions based on the dialogue history, which does not involve visual information. In , Florian et al. propose to employ reinforcement learning to solve question generation of the GuessWhat game by introducing the final status of success as the sole reward. We share the similar backbone idea, but there are several technical differences. One of the most significant differences is that the previous work only considers using whether achieving the final goal as the reward but we assign different intermediate rewards for each posed question to push VQG agent to ask short but informative questions to achieve the goal. The experimental results and analysis in Section 4 show that our model not only outperforms the state-of-art but also achieves higher intelligence, i.e., using as few questions as possible to finish the task.
Question Generation in NLP
There is a long history of works on grammar question generation from text domain in natural language processing (NLP)[6, 9, 25, 29]. In [1, 4], authors focus on automatically generating gap-fill questions, while crowdsourcing templates and manually built templates are used for question generation in  and  respectively. These works focus on constructing formatted questions from the text corpus.
Reinforcement Learning for V2L
Reinforcement learning [14, 27] has been adopted in several vision to language (V2L) problems, including image captioning [17, 23, 24], VQA [2, 12, 33], and aforementioned visual dialogue system [8, 18] etc. In , Ren et al. use a policy network and a value network to collaboratively generate image captions, while different optimization methods for RL in image captioning are explored in  and , called SPIDEr and self-critical sequence training. Zhu et al.  introduce knowledge source into the iterative VQA and employ RL to learn the query policy. In , authors use RL to learn the parameters of QA model for both images and structured knowledge bases. These works solve V2L related problems by employing RL as an optimization method, while we focus on using RL with carefully designed intermediate rewards to train the VQG agent for goal-oriented tasks.
3 Goal-Oriented VQG
We ground our goal-oriented VQG problem on a Guess What game, specifically, on the GuessWhat?! dataset . GuessWhat?! is a three-role interactive game, where all roles observe the same image of a rich visual scene that contains multiple objects. We view this game as three parts: Oracle, Questioner and Guesser. In each game, a random object in the scene is assigned to the Oracle, where this process is hidden to the Questioner. Then the Questioner can ask a series of yes/no questions to locate this object. The list of objects is also hidden to the Questioner during the question-answer rounds. Once the Questioner has gathered enough information, the Guesser can start to guess. The game is considered as successful if the Guesser selects the right object.
The Questioner part of the game is a goal-oriented VQG problem; each question is generated based on the visual information of the image and the previous rounds of question-answer pairs. The goal of VQG is to successfully finish the game, in this case, to locate the right object. In this paper, we fit the goal-oriented VQG into a reinforcement learning paradigm and propose three different intermediate rewards, namely the goal-achieved reward, progressive reward, and informativeness reward, to explicitly optimize the question generation. The goal-achieved reward is established to lead the dialogue to achieve the final goal, the progressive reward is used to push the intermediate generation process towards the optimal direction, while the informativeness reward is used to ensure the quality of generated questions. To better express the generation process, we first introduce the notations of GuessWhat?! game, which is used throughout the rest of sections.
Each game is defined as a tuple , where is the observed image, is the dialogue with rounds of question-answer pairs , is the list of objects in the image , where is the target object. Each question is a sequence of tokens, which are sampled from the pre-defined vocabulary . The is composed of word tokens, a question stop token ? and a dialogue stop token End. The answer Yes,No,NA is set to be yes, no or not applicable. For each object , it has an object category and a segment mask.
3.1 Learning Environment
We build the learning environment to generate visual dialogues based on the GuessWhat?! dataset. Since we focus on the goal-oriented VQG, for a fair comparison, the Oracle and Guesser are produced by referring to the original baseline models in GuessWhat?! 
. We also introduce the VQG supervised learning model, which is referred as the baseline for the rest of the paper.
requires generating answers for all kinds of questions about any objects within the image scene. We build the neural network architecture forOracle by referring to . The bounding box (obtained from the segment mask) of the object
are encoded into an eight dimensional vector to represent the spatial feature, whereindicates the box coordinates, width and height. The category is embedded using a learned look-up table, while the current question is encoded by an LSTM 
. All three features are concatenated into a single vector and fed into a one hidden layer MLP followed by a softmax layer to produce the answer probability.
Given an image and a series of question-answer pairs, the Guesser requires predicting right object from a list of objects. By referring to , we consider the generated dialogue as one flat sequence of tokens and encode it with an LSTM. The last hidden state is extracted as the feature to represent the dialogue. We also embed all the objects’ spatial features and categories by an MLP. We perform a dot-product between dialogue and object features with a softmax operation to produce the final prediction.
Given an image and a history of the question-answer pairs , the VQG requires generating a new question . We build the VQG baseline based on an RNN generator. The RNN recurrently produces a series of state vectors by transitioning from the previous state and the current input token . We use an LSTM as the transition function , that is, . In our case, the state vector is conditioned on the whole image and all the previous question-answer tokens. We add a softmax operation to produce the probability distribution over the vocabulary , where
. This baseline is conducted by employing the supervised training. We train the VQG by minimizing the following negative log loss function:
During the test stage, the question can be sampled from the model by starting from state ; a new token is sampled from the probability distribution, then embedded and fed back to the LSTM. We repeat this operation until the end of question token is encountered.
3.2 Reinforcement Learning of VQG
We use our established Oracle, Guesser and VQG baseline model to simulate a complete GuessWhat?! game. Given an image , an initial question is generated by sampling from the VQG baseline until the stop question token is encountered. Then the Oracle receives the question along with the assigned object category and its spatial information , and output the answer , the question-answer pair is appended to the dialogue history. We repeat this loop until the end of dialogue token is sampled, or the number of questions reaches the maximum. Finally, the Guesser takes the whole dialogue and the object list as inputs to predict the object. We consider the goal reached if is selected. Otherwise, it failed.
To more efficiently optimize the VQG towards the final goal and generate informative questions, we adopt three intermediate rewards (which will be introduced in the following sections) into the RL framework.
3.2.1 State, Action & Policy
We view the VQG as a Markov Decision Process (MDP), the VQG is noted as the agent. For the dialogue generated based on the imageat time step , the state of agent is defined as the image visual content with the history of question-answer pairs and the tokens of current question generated so far: , where . The action of agent is to select the next output token from the vocabulary . Depends on the actions that agent takes, the transition between two states falls into one of the following cases:
1) ?: The current question is finished, the Oracle from the environment will answer , which is appended to the dialogue history. The next state .
2) End: The dialogue is finished, the Guesser from the environment will select the object from the list .
3) Otherwise, the new generated token keeps appending to the current question , the next state .
The maximum length of question is , and the maximum rounds of the dialogue is . Therefore, the number of time steps of any dialogue are . We model the VQG under the stochastic policy , where represents the parameters of the deep neural network we used in the VQG baseline that produces the probability distributions for each state. The goal of the policy learning is to estimate the parameter .
After we set up the components of MDP, the most significant aspect of the RL is to define the appropriate reward function for each state-action pair . As we emphasized before, the goal-oriented VQG aims to generate the questions that lead to achieving the final goal. Therefore, we build three kinds of intermediate rewards to push the VQG agent to be optimized towards the optimal direction. The whole framework is shown in Fig. 2.
3.2.2 Goal-Achieved Reward
One basic rule of the appropriate reward function is that it cannot conflict with the final optimal policy . The primary purpose of the VQG agent is to gather enough information as soon as possible to help Guesser to locate the object. Therefore, we define the first reward to reflect whether the final goal is achieved. Moreover, we take the number of rounds into consideration to accelerate the questioning part and let the reward nonzero when the game is successful.
Given the state , where the End token is sampled or the maximum number of rounds is reached, the reward of the state-action pair is defined as:
We set the reward as one plus the weighted maximum number of rounds against the actual rounds of the current dialogue if the dialogue is successful, and zero otherwise. This is based on that we want the final goal to motivate the VQG to generate useful questions. Moreover, the intermediate process is considered into the reward function as the rounds of the question-answer pairs , which guarantees the efficiency of the generation process; the fewer questions are generated, the more reward VQG agent can get at the end of the game (if and only if the game succeed). This is a quite useful setting in the realistic because we do want to use fewer orders to guide the robot to finish more tasks. is a weight to balance between the contribution of the successful reward and the dialogue round reward.
3.2.3 Progressive Reward
Based on the intuition and the observation of the human interactive dialogues, we find that the questions of a successful game, are ones that progressively achieve the final goal, i.e., as long as the questions being asked and answered, the confidence of referring to the target object becomes higher and higher. Therefore, at each round, we define an intermediate reward for state-action pair as the improvement of target probability that Guesser outputs. More specific, we interact with the Guesser at each round to obtain the probability of predicting target object. If the probability increases, it means that the generated question is a positive question that leads the dialogue towards the right direction.
We set an intermediate reward called progressive reward to encourage VQG agent progressively generate these positive questions. At each round , we record the probability returned by Guesser, and compare it with the last round . The difference between two probabilities is used as the intermediate reward. That is:
In this way, the question is considered high-quality and has a positive reward, if it leads to a higher probability to guess the right object. Otherwise, the reward is negative.
3.2.4 Informativeness Reward
When we human ask questions (especially in a guess what game), we expect an answer that can help us to eliminate the confusion and distinguish the candidate objects. Hence, imagine that if a posed question that leads to the same answer for all the candidate object, this question will be useless. For example, all the candidate objects are ‘red’ and if we posed a question that ‘Is it red?’, we will get the answer ‘Yes.’ However, this question-answer pair cannot help us to identify the target. We want to avoid this kind of questions because they are non-informative. In this case, we need to evaluate the question based on the answer from the Oracle.
Given generated question , we interact with the Oracle to answer the question. Since the Oracle takes the image , the current question , and the target object as inputs, and outputs the answer , we let the Oracle answer question for all objects in the image. If the answers are different from each other, we consider is useful for locating the right object. Otherwise, it does not contribute to the final goal. Therefore, we set the reward positive, which we called informativeness reward, for these useful questions.
Formally, during each round, the Oracle receives the image , the current question and the list of objects , and then outputs the answer set , where each element corresponds to each object. Then the informativeness reward is defined as:
By giving a positive reward to the state-action pair, we improve the quality of the dialogue by encouraging agent to generate more informative questions.
3.2.5 Training with Policy Gradient
Now we have three different kinds of rewards that take the intermediate process into consideration, for each state-action pair , we add three rewards together as the final reward function:
Considering the large action space in the game setting, we adopt the policy gradient method  to train the VQG agent with proposed intermediate rewards. The goal of policy gradient is to update policy parameters with respect to the expected return by gradient descent. Since we are in the episodic environment, given policy , which is the generative network of the VQG agent, in this case, the policy objective function takes the form:
The parameters then can be optimized by following the gradient update rule. In REINFORCE algorithm , the gradient of can be estimated from a batch of episodes that are sampled from the policy :
where is the state-action value function that returns the expectation of cumulative reward at :
by substituting the notations with VQG agent, we have the following policy gradient:
is a baseline function to help reduce the gradient variance, which can be chosen arbitrarily. We use a one-layer MLP that takes stateas input in VQG agent and outputs the expected reward. The baseline is trained with mean squared error as:
The whole training procedure is shown in Alg.1.
In this section, we present our VQG results and conduct comprehensive ablation analysis about each intermediate reward. As mentioned above, the proposed method is evaluated on the GuessWhat?! game dataset  with pre-trained standard Oracle and Guesser. By comparing with the baseline and the state-of-the-art model, we show that proposed model can efficiently generate informative questions, which serve the final goal.
4.1 Dataset & Evaluation Metric
The GuessWhat?! Dataset  is composed of 155,281 dialogues grounded on the 66,537 images with 134,074 unique objects. There are 821,955 question-answer pairs in the dialogues with vocabulary size 4,900. We use the standard split of training, validation and test in [10, 26].
, we report the accuracies of the games as the evaluation metric. Given a-round dialogue, if the target object is located by Guesser, the game is noted as successful, which indicates that the VQG agent has generated the qualified questions to serve the final goal. There are two kinds of test runs on the training set and test set respectively, named as NewObject and NewImage. NewObject is randomly sampling target objects from the training images (but we restrict only to use new objects that are not seen before), while NewImage is sampling objects from the test images (unseen). We report three inference methods namely sampling, greedy and beam-search (beam size is 5) for these two test runs.
4.2 Implementation Details
The standard Oracle, Guesser and VQG baseline are reproduced by referring to . The error of trained Oracle, Guesser on test set are 21.1% and 35.8% respectively. The VQG baseline is referred as Baseline in Tab.1 and 2 111These results are reported on https://github.com/GuessWhatGame by original authors..
We initialize the training environment with the standard Oracle, Guesser5]. The learning rate and batch size are 0.001 and 64, respectively. The baseline function is trained with SGD at the same time. During each epoch, each training image is sampled once, and one of the objects inside it is randomly assigned as the target. We set the maximum round and maximum length of question . The weight of the dialog round reward is set to . The progressive reward is set as 222We use a grid search to select the hyper-parameters and , we find 0.1 produces the best results..
4.3 Results & Ablation Analysis
In this section, we give the overall analysis on proposed intermediate reward functions. To better show the effectiveness of each reward, we conduct comprehensive ablation studies. Moreover, we also carry out a human interpretability study to evaluate whether human subjects can understand the generated questions and how well the human can use these question-answer pairs to achieve the final goal. We note VQG agent trained with goal-achieved reward as VQG-, trained with goal-achieved and progressive rewards as VQG-+, trained with goal-achieved and informativeness rewards as VQG-+. The final agent trained with all three rewards is noted as VQG-++.
Tab. 1 and 2 show the comparisons between VQG agent optimized by proposed intermediate rewards and the state-of-the-art model proposed in  noted as Sole-, which uses indicator of whether reaching the final goal as the sole reward function. As we can see, with proposed intermediate rewards and their combinations, our VQG agents outperform both compared models on all evaluation metrics. More specifically, our final VQG-++ agent surpasses the Sole- 4.7%, 3.3% and 3.7% accuracy on NewObject sampling, greedy and beam-search respectively, while obtains 3.3%, 2.3% and 2.4% higher accuracy on NewImage sampling, greedy and beam-search respectively. Moreover, all of our agents outperform the supervised baseline by a significant margin.
To fully show the effectiveness of our proposed intermediate rewards, we train three VQG agents using , +, and + rewards respectively, and conduct ablation analysis. As we can see, the VQG- already outperforms both baseline and the state-of-the-art model, which means that controlling dialogue round can push the agent to ask more wise questions. With the combination of and reward respectively, the performance of VQG agent further improved. We find that the improvement gained from reward is higher than reward, which suggests that the intermediate progressive reward contributes more in our experiment. Our final agent combines all rewards and achieves the best results. Fig. 3 shows some qualitative results. More results can be found in the supplementary material, including some fail cases.
We conduct an experiment to investigate the relationship between the dialogue round and the game success ratio. More specifically, we let Guesser to select the object at each round and calculate the success ratio at the given round, the comparisons of different models are shown in Fig. 4. As we can see, our agent can achieve the goal at fewer rounds compared to the other models, especially at the round three.
To prove our VQG agent can learn a progressive trend on generated questions, we count the percentage of the successful game that has a progressive (ascending) trend on the target object, by observing the probability distributions generated by Guesser at each round. Our agent achieves 60.7%, while baseline and Sole- are 50.8% and 57.3% respectively, which indicates that our agent is better at generating questions in a progressive trend considering we introduce the progressive reward, . Some qualitative results of the ‘progressive trend’ are shown in the Fig. 3, i.e., the probability of the right answer is progressively increasing.
We also investigate the informativeness of the questions generated by different models. We let Oracle answer questions for all the objects at each round, and count the percentage of high-quality questions in the successful game. We define that a high-quality question is a one does not lead to the same answer for all the candidate objects. The experimental results show that our VQG agent has 87.7% high-quality questions, which is higher than the baseline (84.7%) and Sole- (86.3%). This confirms the contribution of the reward.
4.4 Human Study
We conduct a human study to see how well human can guess the target object based on the questions generated by these models. We show human subjects 50 images with generated question-answer pairs from baseline, Sole-, and our final VQG agent, and let them guess the objects, i.e., replacing the AI guesser to a real human. We ask three human subjects to play on the same split, and the game is recognized as successful if at least two of them give the right answer. Based on our experiments, averagely, subjects achieve the highest accuracy 76% based on our agent, which achieves 52% and 70% accuracies on the baseline and Sole- questions respectively. These results indicate that our agent can generate higher qualitative questions that can benefit the human to achieve the final goal.
The ability to devise concise questions that lead to two parties to a dialog satisfying a shared goal as effectively as possible has important practical applications and theoretical implications. By introducing suitably crafted intermediate rewards into a deep reinforcement learning framework, we have shown that it is possible to achieve this result, at least for a particular class of goal.
The method we have devised not only achieves the Guess What goal reliably and succinctly but also outperforms the state-of-art. However, since the Oracle and Guesser are fixed, they are inaccurate to a certain extent. Consider the main objective of this paper is to show the effectiveness of our proposed intermediate rewards on the VQG problem, we leave it as the further work to train the three components jointly, with a reinforcement learning framework.
-  M. Agarwal and P. Mannem. Automatic gap-fill question generation from text books. In Proc. Workshop Inno. Use of NLP for Buil. Educ. Appl., pages 56–64. Association for Computational Linguistics, 2011.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
-  L. Becker, S. Basu, and L. Vanderwende. Mind the gap: learning to choose gaps for question generation. In Proc. Conf. North Amer. Chap. Asso. for Comp. Ling.: Human Lang. Tech., pages 742–751. Association for Computational Linguistics, 2012.
Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
-  J. C. Brown, G. A. Frishkoff, and M. Eskenazi. Automatic question generation for vocabulary assessment. In Proc. Conf. Human Lang. Tech. and Empi. Meth. in Natu. Lang. Process, pages 819–826. Association for Computational Linguistics, 2005.
-  A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual dialog. arXiv preprint arXiv:1611.08669, 2016.
-  A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:1703.06585, 2017.
-  B. Davey and S. McBride. Effects of question-generation training on reading comprehension. Journal of Educational Psychology, 78(4):256, 1986.
-  H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with lstm. 1999.
-  R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. arXiv preprint arXiv:1704.05526, 2017.
J. Johnson, A. Karpathy, and L. Fei-Fei.
Densecap: Fully convolutional localization networks for dense captioning.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4565–4574, 2016.
-  L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. J. Arti. Intell. Research, 4:237–285, 1996.
-  I. Labutov, S. Basu, and L. Vanderwende. Deep questions without deep understanding. In ACL (1), pages 889–898, 2015.
-  Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang. Visual question generation as dual task of visual question answering. arXiv preprint arXiv:1709.07192, 2017.
-  S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Optimization of image description metrics using policy gradient methods. arXiv preprint arXiv:1612.00370, 2016.
-  J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. arXiv preprint arXiv:1706.01554, 2017.
-  K. Mazidi and R. D. Nielsen. Linguistic considerations in automatic question generation. In ACL (2), pages 321–326, 2014.
-  I. M. Mora, S. P. de la Puente, and X. Giro-i Nieto. Towards automatic generation of question answer pairs from images, 2016.
-  N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
-  A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proc. Int. Conf. Mach. Learn., volume 99, pages 278–287, 1999.
-  Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li. Deep reinforcement learning-based image captioning with embedding reward. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.
-  H. Singer and D. Donlan. Active comprehension: Problem-solving schema with question generation for comprehension of complex short stories. Reading Research Quarterly, pages 166–186, 1982.
-  F. Strub, H. de Vries, J. Mary, B. Piot, A. C. Courville, and O. Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proc. Int. Joint Conf. Artificial Intell., 2017.
-  R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
-  R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
-  W. J. Therrien, K. Wickstrom, and K. Jones. Effect of a combined repeated reading and question generation intervention on reading achievement. Learning Disabilities Research & Practice, 21(2):89–97, 2006.
-  Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proc. Eur. Conf. Comp. Vis., pages 451–466. Springer, 2016.
-  S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang. Automatic generation of grounded visual questions. arXiv preprint arXiv:1612.06530, 2016.
-  Y. Zhu, J. J. Lim, and L. Fei-Fei. Knowledge acquisition for visual question answering via iterative querying. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.