Log In Sign Up

Jointly Learning to See, Ask, and GuessWhat

by   Aashish Venkatesh, et al.

We are interested in understanding how the ability to ground language in vision interacts with other abilities at play in dialogue, such as asking a series of questions to obtain the necessary information to perform a certain task. With this aim, we develop a Questioner agent in the context of the GuessWhat?! game. Our model exploits a neural network architecture to build a continuous representation of the dialogue state that integrates information from the visual and linguistic modalities and conditions future action. To play the GuessWhat?! game, the Questioner agent has to be able to do both, ask questions and guess a target object in the visual environment. In our architecture, these two capabilities are considered jointly as a supervised multi-task learning problem, to which cooperative learning can be further applied. We show that the introduction of our new architecture combined with these learning regimes yields an increase of 19.5 with respect to a baseline model that treats submodules independently. With this increase, we reach an accuracy comparable to state-of-the-art models that use reinforcement learning, with the advantage that our architecture is entirely differentiable and thus easier to train. This suggests that combining our approach with reinforcement learning could lead to further improvements in the future. Finally, we present a range of analyses that examine the quality of the dialogues and shed light on the internal dynamics of the model.


page 6

page 14

page 16

page 17


Ask Before You Act: Generalising to Novel Environments by Asking Questions

Solving temporally-extended tasks is a challenge for most reinforcement ...

Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue

Building an interactive artificial intelligence that can ask questions a...

Deep Reinforcement Learning Models Predict Visual Responses in the Brain: A Preliminary Result

Supervised deep convolutional neural networks (DCNNs) are currently one ...

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

In an open-world setting, it is inevitable that an intelligent agent (e....

Using reinforcement learning to learn how to play text-based games

The ability to learn optimal control policies in systems where action sp...

Learning to Execute Actions or Ask Clarification Questions

Collaborative tasks are ubiquitous activities where a form of communicat...

Learning Shared Representations in Multi-task Reinforcement Learning

We investigate a paradigm in multi-task reinforcement learning (MT-RL) i...

1 Introduction

Over the last few decades, substantial progress has been made in developing dialogue systems that address the various abilities that need to be put to work while having a conversation, such as understanding and generating natural language, planning actions, and tracking the information exchanged by the dialogue participants. The latter is particularly critical since, for communication to be effective, participants need to represent the state of the dialogue and the common ground established through the conversation [Stalnaker1978, Lewis1979, Clark1996].

In this work, we develop a dialogue agent that builds a representation of the context and the dialogue state by integrating information from the visual and linguistic modalities. We are interested in understanding how the ability of combining language and vision interacts with the ability of asking a series of questions to obtain the information necessary to perform a certain task. Given the high complexity of these abilities, we focus on a simplified task-oriented scenario that makes the challenges manageable, while maintaining key properties of dialogue interaction, in particular the GuessWhat?! game [de Vries et al.2017]. The game involves two participants—a questioner and an answerer—who collaborate to identify a target object in a visual scene. We model the agent in the questioner’s role, who is faced with the tasks of asking questions and guessing the target.

Figure 1: Our questioner model with a single visually grounded dialogue state encoder.

The system we present exploits a neural network architecture to build a continuous representation of the dialogue state that can be learned end-to-end directly from data. Figure 1 shows a diagram illustrating the overall architecture of the proposed system, with our visually-grounded dialogue state encoder at its core (more details are provided in Section 4). In our system, the dialogue state representations are learned by jointly optimising several key abilities: vision and language understanding as well as action and language generation. This contrasts with previous work on the GuessWhat?! task, which has addressed these aspects independently via autonomous modules [de Vries et al.2017, Strub et al.2017, Zhu, Zhang, and Metaxas2017, Lee, Heo, and Zhang2017b, Shekhar et al.2018]. Our study shows that:

  • [leftmargin=13pt]

  • The introduction of a single visually-grounded dialogue state encoder jointly trained with other modules does not only address foundational issues on how to integrate visual grounding with dialogue systems’ components, but also yields a 10% improvement on task success with our best performing settings.

  • The new architecture allows us to leverage a cooperative learning regime whereby the questioning and guessing modules adapt to each other using self-generated dialogues. This method yields an additional increase in accuracy of 9.5% when using our best settings.

  • Further modules can be readily integrated into our architecture: in particular, adding a decision making component improves the quality of the dialogue by avoiding unnecessary questions.

2 Related Work

Our work is connected to, and brings together, research on task-oriented dialogue modelling and visual dialogue agents.

2.1 Task-oriented dialogue systems

The conventional architecture of task-oriented dialogue systems includes several components: most typically, a natural language understanding and a natural language generation module, plus a dialogue manager (consisting of a dialogue state tracker and policy) that connects the two of them. These components are usually trained in a data-driven manner using statistical methods. For instance, a prominent line of work treats the task of tracking the dialogue state as a partially-observable Markov decision process 

[Williams et al.2013, Young et al.2013, Kim et al.2014] that operates on a symbolic dialogue state consisting of predefined variables. The use of symbolic representations to characterise the state of the dialogue has some advantages (e.g., ease of interfacing with knowledge bases), but also some key disadvantages: the variables to be tracked have to be defined in advance and the system needs to be trained on data annotated with explicit state configurations.

Given these limitations, recent work has started to investigate how different components of the traditional dialogue system pipeline may be replaced with neural network models that learn their own representations. Following approaches to non-goal-oriented chatbots [Vinyals and Le2015, Sordoni et al.2015, Serban et al.2016, Li et al.2016a, Li et al.2016b] that model dialogue as a sequence-to-sequence problem [Sutskever, Vinyals, and Le2014]

, bord:lear17 bord:lear17 propose to fully replace the components of a task-oriented dialogue system with a memory network. In contrast, will:hybr17 will:hybr17 put forward a hybrid system that learns state representation through a Recurrent Neural Network (RNN) and integrates them with symbolic information to develop a dialogue system suitable for the market in practical applications. Similarly, zhao:towa16 zhao:towa16 propose a neural model that replaces the NLU and dialogue manager components of a traditional system. The dialogue state representation is created by an RNN that processes raw utterances and functions as a dialogue state tracker, while the dialogue manger is implemented as Multilayer Perceptron that selects a system action.

Our work pushes further the idea of learning the dialogue state representation from raw data by means of an RNN that is jointly optimised with other components of the dialogue system — in our case, we focus on the interplay between having to ground language on vision, on the one hand, and having to generate natural language questions and guessing actions, on the other.

2.2 Visual dialogue agents

In recent years, researchers in computer vision have proposed tasks that combine visual processing with dialogue interaction. visdial shek:askn18 and guesswhat_game guesswhat_game have created large datasets (

VisDial and GuessWhat?!, respectively) where two participants ask and answer questions about an image. While impressive progress has been made in combining vision and language, current models tend to focus on specific abilities. For example, the models proposed by visdial visdial concern a cooperative image guessing game, where one of the agents does not see the image (thus, no multimodal understanding) and the other one does see the image, but only responds to questions without the need to perform additional actions. The models so far proposed for the GuessWhat?! task do take other functions into consideration besides generating and understanding visually-grounded language: asking a sequence of related questions to achieve a task and guessing a target object. But these abilities are modelled independently [de Vries et al.2017, Strub et al.2017, Zhu, Zhang, and Metaxas2017, Lee, Heo, and Zhang2017b]. In contrast, we propose a model of a multimodal dialogue agent in the context of the GuessWhat?! task where all components are integrated into a joint architecture that has at its core a visually-grounded dialogue state encoder (see Figure 1 in the Introduction).

3 Task and Data

As a testbed for our dialogue agent architecture, we focus on the GuessWhat?! game introduced by guesswhat_game guesswhat_game. The game is a simplified instance of a referential communication task where two players collaborate to identify a referent — a setting used extensively to study human-human collaborative dialogue [Clark and Wilkes-Gibbs1986, Yule1997, Zarrieß et al.2016].

The GuessWhat?! dataset was collected via Amazon Mechanical Turk by guesswhat_game guesswhat_game.111 The task involves two human participants who see a real-world image, taken from the MS-COCO dataset [Lin et al.2014]. One of the participants (the Oracle) is assigned an object in the image and the other participant (the Questioner) is faced with the task to guess it by asking Yes/No questions to the Oracle. There are no time constraints to play the game. Once the Questioner is ready to make a guess, the list of candidate objects is provided and the game is considered successful if the Questioner picks the target object.

The GuessWhat?! dataset consists of around 155k dialogues about approximately 66k different images. Dialogues contain an average of 5.2 questions-answer pairs. 84.6% of games are completed successfully by the human participants; 8.4% are unsuccessful, and 7% are incomplete (the Questioner participant did not make a guess).

4 Models

We focus on developing an agent who plays the role of the Questioner in the GuessWhat?! game. As a baseline model, we consider our own implementation of the best performing systems put forward by guesswhat_game guesswhat_game, who develop models of the Questioner and Oracle roles.

4.1 Baseline model

The Questioner model by guesswhat_game guesswhat_game consists of two independent modules: a Question Generator (QGen) and a Guesser. For the sake of simplicity, QGen asks a fixed number of questions before the Guesser predicts the target object.

QGen is implemented as an RNN with a transition function handled with Long-Short-Term Memory (LSTM), on which a probabilistic sequence model is built with a Softmax classifier. At each time step in the dialogue, the model receives as input the raw image and the dialogue history and generates the next question one word at a time. The image is encoded by extracting its VGG-16 features 

[Simonyan and Zisserman2014]. In our new architecture (described below in Section 4.2), we use ResNet152 features instead of VGG, because they tend to yield better performance in image classification and are more efficient to compute. However, using ResNet152 features for the baseline QGen module leads to a decrease in performance.222Concretely, ResNet152 features yield 37.3% accuracy vs. 41.8% achieved with VGG-16 for the baseline model. We thus keep the original configuration by guesswhat_game guesswhat_game with VGG features, which provides a stronger baseline.

The Guesser module does not have access to the raw image. Instead, it exploits the annotations in the MS-COCO dataset [Lin et al.2014] to represent candidate objects by their object category and their spatial coordinates. This yields better performance than using raw image features in this case, as reported by guesswhat_game guesswhat_game. The objects’ categories and coordinates are passed through a Multi-Layer Perceptron (MLP) to get an embedding for each object. The Guesser also takes as input the dialogue history processed by its own dedicated LSTM. A dot product between the hidden state of the LSTM and each of the object embeddings returns a score for each candidate object.

The model playing the role of the Oracle is informed about the target object

. Like the Guesser, the Oracle does not have access to the raw image features. It receives as input embeddings of the target object’s category, its spatial coordinates, and the current question asked by the Questioner, encoded by a dedicated LSTM. These three embeddings are concatenated into a single vector and fed to an MLP that outputs an answer (Yes or No).

4.2 Visually-grounded dialogue state encoder

Figure 2: Question Generation and Guesser modules.

In line with the baseline model, our Questioner agent includes two sub-modules, a QGen and a Guesser. Our agent architecture, however, establishes major changes to the setup by guesswhat_game: rather than operating independently, the language generation and guessing modules are connected through a common grounded dialogue state encoder (GDSE), which combines linguistic and visual information as a prior for the two modules. Given this feature, we will refer to our Questioner agent as GDSE.

As illustrated in Figure 1

in the Introduction, the Encoder receives as input representations of the visual and linguistic context. The visual representation consists of the second to last layer of ResNet152 trained on ImageNet 

[He et al.2016]. We do not update the visual feature parameters during our training. The linguistic representation is obtained by an LSTM (LSTM) which processes each new question-answer pair in the dialogue. At each question-answer , the last hidden state of LSTM is concatenated with the image features , passed through a linear layer, and then a tanh activation producing the final layer of the Encoder:


where represents concatenation, , and . We refer to this final layer as the dialogue state, which is given as input to both QGen and Guesser.

As illustrated in Figure 2 our QGen and Guesser modules are like the corresponding modules by guesswhat_game guesswhat_game, except for the crucial fact that they receive as input the same grounded dialogue state representation. QGen employs an LSTM (LSTM) to generate the token sequence for each question conditioned on , which is used to initialise the hidden state of LSTM. As input at every time step, QGen receives a dense embedding of the previously generated token and the image features :


We optimise QGen by minimising the Negative Log Likelihood (NLL) of the human dialogues and use the Adam optimiser [Kingma and Ba2015]:


Thus, in our architecture the LSTM of QGen in combination with the LSTM of the Encoder forms a sequence-to-sequence model [Sutskever, Vinyals, and Le2014], conditioned on the visual and linguistic context — in contrast to the baseline model, where question generation is performed by a single LSTM on its own.

The Guesser consists of an MLP which is evaluated for each candidate object in the image. It takes the dense embedding of the category and the spatial information of the object to establish a representation for each object. A score is calculated for each object by performing the dot product between the dialogue state

and the object representation. Finally, a softmax over the scores results in a probability distribution over the candidate objects:


We pick the object with the highest probability and the game is successful if .


As with QGen, we optimise the Guesser by minimising the NLL and again make use of Adam:


The resulting architecture is fully differentiable. In addition, the GDSE agent faces a multi-task optimisation problem: While the QGen optimises and the Guesser optimises , the parameters of the Encoder (, ) are optimised via both and . Hence, both tasks faced by the Questioner agent contribute to the optimisation of the dialogue state , and thus to a more effective encoding of the input context.

5 Learning Approach and Task Success

Like guesswhat_game guesswhat_game, we train the Oracle with supervised learning (SL) using human dialogues: The model receives human questions and has to learn to provide the human Oracle’s answer. Our re-implementation of the Oracle obtains an accuracy of 78.47% as reported by guesswhat_game guesswhat_game.

As for the Questioner agent, our new architecture makes possible a different learning approach than the one used by guesswhat_game guesswhat_game. We first train our GDSE agent with SL in a multi-task setting and then exploit its differentiable architecture to update the parameters of the Guesser, QGen, and Encoder components through cooperative learning (CL). We use the same train (70%), validation (15%), and test (15%) splits as guesswhat_game guesswhat_game, where the test set contains new images not seen during training.

5.1 Supervised learning

In the model by guesswhat_game guesswhat_game, the QGen and the Guesser modules are trained autonomously with SL on human data: QGen is trained to replicate human questions and, independently, the Guesser is trained to predict the target object. Our new architecture with a common dialogue state Encoder allows us to formulate these two tasks as a multi-task problem, for which two different losses (Equations 3 and 6 in Section 4.2) need to be optimised in parallel.

These two tasks are not equally difficult: While the Guesser has to learn the probability distribution of the set of possible objects in the image, QGen needs to fit the distribution of natural language words. Thus, QGen has a harder task to optimise and requires more parameters and training iterations. We address this issue by making the learning schedule task-dependent. We call this setup modulo-n training, where n

indicates after how many epochs of QGen training the Guesser is updated together with QGen. We experimented with

n from 5 to 15 and found that updating the Guesser every 7 QGen training steps worked best. With this optimal configuration, we then train QGen for 91 epochs and the Guesser for 13 epochs. We use a batch size of 1024, Adam as optimiser, and a learning rate of 0.0001.

5.2 Cooperative learning

Once the model has been trained with SL, new training data can be generated by letting the agent play new games. Given an image from the training set used in the SL phase, we generate a new training instance by randomly sampling a target object from all objects in the image. We then let our Questioner agent and the Oracle play the game with that object as target, and further train the common Encoder using the generated dialogues by backpropagating the error with gradient descent through the Guesser. After training the Guesser and the Encoder with generated dialogues, QGen needs to ‘readapt’ to the newly arranged Encoder parameters. To achieve this, we re-train QGen on human data with SL, but using the new Encoder states. Also here, the error is backpropagated with gradient descent through the common Encoder.

Thus, with this learning scheme, the different components of the Questioner agent learn to better perform the overall Questioner’ task in a cooperative manner. visdial_rl visdial_rl have explored the use of cooperative learning to train two visual dialogue agents that receive joint rewards when they play a game successfully. To our knowledge, ours is the first approach where cooperative learning is applied to the internal components of a grounded conversational agent.

As in the SL phase described in the previous section, we use modulo-n training, setting n to the optimal value of 4 in this case. The GDSE previously trained with SL is further trained with this cooperative regime for 6 and 2 epochs for the Guesser and the QGen, respectively. We use a batch size of 256, Adam as optimiser, and a learning rate of 0.0001.

5.3 Results

Model Accuracy
Baseline 41.2
GDSE-SL 51.2
GDSE-CL 60.7
Strub et al. 2017 58.4
Zhang et al. 2017 60.7
Table 1: Test set accuracy scores on task success for each model with its best performing settings.

We report accuracy results on task success for our agent trained with supervised learning (GDSE-SL) and with cooperative learning (GDSE-CL). We set the number of questions to be asked by QGen to 10, after which the Guesser makes a guess. This number was selected as a result of a parameter search in the range from 5 to 12, using the validation set. The best performing baseline model asks 5 questions.

As shown in Table 1, the introduction of the new architecture trained with SL yields an increase in accuracy of 10% (from 41.2 to 51.2) with respect to the Baseline system.333While guesswhat_game guesswhat_game originally report an accuracy of 46.8%, this result was later revised to 40.8%, as clarified on the first author’s GitHub page. Our own implementation of the baseline system achieves an accuracy of 41.2%. The introduction of the cooperative learning approach (GDSE-CL) brings in a further improvement over GDSE-SL: 9.5% (from 51.2 to 60.7).

Other models have been evaluated on the GuessWhat?! dataset obtaining results that outperform the baseline. In particular, stru:end17 stru:end17 and zhan:aski zhan:aski obtain 58.4% and 60.7% accuracy, respectively (see Table 1).444 lee:answ lee:answ report an accuracy of 78.2%, but their Questioner retrieves questions from the human training dialogues instead of generating them. This result is therefore not comparable to any of the models in Table 1. These models use reinforcement learning. Our best results obtained with GDSE-CL are on a par with the state-of-the-art scores obtained by these models, with the added value that our architecture uses an entirely differentiable learning procedure. This is an advantage of our model over reinforcement learning approaches in terms of ease of training and speed in convergence.555stru:end17 stru:end17 and zhan:aski zhan:aski evaluate their models on a second test set, New Object, besides the test set of guesswhat_game, which they refer to as New Image. We have evaluated our model on the New Object setting too and obtained comparable results. See the Supplementary Material for details.

This suggests that combining our approach—which addresses foundational architectural aspects—with reinforcement learning could lead to further improvements in the future.

6 Analysis

In this section, we present a range of analyses that aim to shed light on the performance of our model and its inner workings. All analyses are carried out on the test set data.

6.1 Quantitative analysis of linguistic output

We analyse the language produced by our Questioner agent with respect to three factors: (1) the size of the vocabulary, (2) the number of unique questions, and (3) the number of repetitions. We compute these factors on the test set dialogues for our model with supervised learning (GDSE-SL) and with cooperative learning (GDSE-CL), for the baseline model (BL), and for the human data (H).

As we can see in Table 2, with respect to factors (1) and (2) the linguistic output of our model is richer than the baseline model and closer to the language used by humans: our agent is able to learn a much larger vocabulary than the baseline model (1569 with SL and 1272 with CL vs. 423 for the baseline),666Note that the upper bound for the models’ vocabulary is the size of the training vocabulary: 4901 words (all words with at least 3 occurrences in the training data). This upper bound is thus lower than the size of the human vocabulary in the test set (5255 words) reported in Table 2. makes use of many more unique questions overall (36577 with SL and 35317 with CL vs. 3545 for the baseline), and repeats the very same question within the same dialogue less often that the baseline (98.08% of the games played by the baseline contain at least one verbatim question repetition, whereas for SL and CL this happens only in the 66.24% and 59.97% of the games).

6.2 Dialogue policy: types of questions

To further understand the variety of questions asked by the agents, we classify questions into different types (see examples in the table on the left in Figure 3). We distinguish between questions that aim at getting the category of the target object (entity questions) and questions about properties of the queried objects (attribute questions). Within attribute questions, we make a distinction between colour, shape, size, texture, location, and action questions. Within entity

questions, we distinguish questions whose focus is an object category or a super-category. The classification is done by manually extracting keywords for each question type from the human dialogues, and then applying an automatic heuristics that assigns a class to a question given the presence of the relevant keywords.

777A question may be tagged with several attribute classes if keywords of different types are present. E.g., “Is it the white one on the left?” would be classified as both color and location. This procedure allows us to classify 91.41% of questions asked by humans. The coverage is higher for the questions asked by the models: 98.57% (BL), 93.64% (GDSE-SL), 98.55% (GDSE-CL).888In the Supplementary Material we provide details on the question classification procedure: the lists of keywords by class, the procedure used to obtain these lists, as well as the pseudo-code of the heuristics used to classify the questions.

Vocabulary size 423 1569 1272 5255
Unique Q’s 3545 36577 35317 54467
% games w. repeated Q’s 98.08 66.24 59.97
super-cat obj/att 66.44 71.58 96.44 89.56
object attribute 86.70 77.82 97.95 88.70
Table 2: Statistics of the linguistic output. Last two rows: Proportion of question type shift vs. no type shift in consecutive questions where has received a Yes answer.
Question type Example BL SL CL H entity 40.05 44.12 28.57 38.11 -super-category Is it a vehicle? 11.55 10.39 9.34 14.51 -object Is it a skateboard? 28.50 33.73 19.23 23.61 attribute 58.52 49.53 69.98 53.29 -color Is he wearing blue? 5.12 14.25 17.67 15.50 -shape Is it square? 0.00 0.02 0.03 0.30 -size The bigger one? 0.03 0.41 0.71 1.38 -texture Is it wood? 0.00 0.14 0.08 0.89 -location The one from the left? 54.72 39.65 66.62 40.00 -action Are they standing? 1.87 8.28 3.14 7.59 Not classified 1.43 6.35 1.45 8.6 statistics 20.86* 7.25 21.98* 0.00
Figure 3: Left: Percentage of questions per question type in all the test set games played by humans (H) and the models with 10Q. Bottom row: statistics indicating level of divergence with respect to the Human distribution of fine-grained question types (* ). Right: Structure of the dialogues in terms of question types by our GDSE-CL model. Cells show percentage of a type over all questions in the test set. Each time step in the -axis corresponds to a question-answer pair.

The statistics are shown in Figure 3, left. We observe that the distribution of fine-grained question classes by our SL model is statistically indistinguishable from the human distribution ( = 7.25, ), while the CL model’s distribution exhibits the highest divergence ( = 21.98, ). For example, we see that the CL model asks more location questions and fewer action questions than humans. This suggests that the agent trained with the cooperative learning regime achieves its high performance results by effectively learning its own dialogue policies.

This is also apparent when we analyse in more detail the structure of the dialogues in terms of the sequences of question types asked. As expected, both humans and models start almost always with an entity question (around 97% for all models and 78.48% for humans), in particular a super-category (around 70% for all models and 52.32% for humans). In some cases, humans start by asking questions directly about an attribute that may easily distinguish an object from others, while this is very uncommon for models. The heat map on the right in Figure 3 illustrates the structure of the dialogues by our CL model.999The heat maps of humans and the other models, as well as statistics on the type of questions at the start of the dialogues, are reported in the Supplementary Material.

Figure 4: Example of a successful game played by our CL model. The model guesses the target correctly, although some of the questions and answers are difficult to interpret with respect to the image.

To further analyse whether the models have learned a common-sense dialogue policy, we check how the answer to a given question type affects the type of the follow-up question. In principle, we expect to find that question types that are answered positively will be followed by questions of a more specific type. This is indeed what we observe in the human dialogues, as shown in the bottom rows of Table 2: For example, when a super-category question is answered positively, humans follow up with an object or attribute question 89.56% of the time. This trend is mirrored by the supervised models (BL and SL) albeit to a lesser extent: for instance, the BL model only transitions to a more specific question type 66.44% of the time after a positively answered super-category question. The CL model, in contrast, follows this common-sense pattern to a larger extent than humans: almost always (96.44 and 97.95 in Table 2), the agent moves on to ask a more specific question after receiving a Yes answer for a more general question type. Thus, our CL model seems to learn strategies that are somewhat simplistic, but end up being more effective than those learned only via SL. Given the intrinsic limitations of the agents compared to humans, trying to emulate human data by strict supervision may be detrimental.

6.3 Dialogue state representations

A key innovation of our agent model is the encoding and training of the dialogue state – thanks to the end-to-end training, the state representations do not only encode the visually-grounded dialogue context, but are progressively updated during training to optimise the objectives of the question generation and guessing components.

Since our visually-grounded dialogue state is not symbolic, its inner workings are not transparent. To gain insight in this respect, here we analyse how the representations learned by the system evolve over the course of a dialogue. In particular, we are interested in understanding how the dialogue state and the probability mass assigned to the target object change after each question-answer pair as a function of the type of information exchanged in that turn. For each question-answer pair , we compute:

  • [leftmargin=13pt]

  • the increase in the probability mass assigned to the target object from to , and

  • the cosine distance between the dialogue state at and the dialogue state at , once has been processed.101010The dialogue state at the beginning of a game is taken to be the hidden state of the encoder after processing the <start> token. The initial probability assigned to the target object depends on the number of candidate objects in the image.

In addition, for each , we extract:

  • [leftmargin=13pt]

  • type of : super-cat, object, attribute

  • type of : Yes, No

  • index : the position in the dialogue

Using these three features as independent predictors, we fit two linear regression models: 

reg_state_change with dialogue state cosine distance as the dependent variable, and reg_target_prob with target object probability as the dependent variable. We fit reg_state_change on all the dialogues in the test set separately for our SL agent and our CL agent. For reg-target-prob, we restrict the analysis to successful dialogues, where the target object is correctly identified.

The model fitting results show the same trends for both the SL and CL agents. Questions at the beginning of the dialogue bring about more change to the dialogue state and more increase in probability assigned to the target object (significant negative regression coefficient for index across the board). There is an interaction effect between question and answer types: positively answered entity questions bring about the most change to the state, even when controlling for temporal index. object questions that receive a Yes answer are the ones that lead to the highest target probability increase.

Note that the agents have not been explicitly trained on any of the distinctions expressed by the features in the regression models (question type, answer type, and index). Thus, any significant relationships we find between these variables and the model representations shed light on what has autonomously been learned by the models.

A robust observation arising from this analysis is that later turns in the dialogue tend to bring in little information. This is perhaps not surprising given that by having the constraint of asking 10 questions, we force the models to ask questions even if they have collected enough information to be able to guess the object. In the next section, we remove this artificial restriction and extend our model with a decision-making component which decides when to stop asking to make a guess.111111In the Supplementary Material we provide details on the regression models and plots with examples of how the cosine difference between the dialogue state representations and the probability of the target object evolve over the course of a dialogue.

7 Adding a Decision-Making Module

shek:askn18 modify the Questioner model of guesswhat_game by adding a decision-making component (DM) that decides whether to ask a follow-up question or to stop the conversation to make a guess. Two versions of this model are proposed: the decider receives as input either the hidden state computed by the QGen or the one computed by the Guesser, obtaining 40.02% and 41.2% accuracy, respectively, when allowing up to 10 questions. With the common Encoder in our model, which merges the language and vision inputs and interacts with both the QGen and the Guesser, we can train the DM jointly with the other modules. When allowing maximum 10 questions, adding a decider results in lower task success accuracy (48.11% with GDSE-SL and 55.24% with GDSE-CL), but improves the quality of the dialogues by significantly reducing the number of repetitions: Only 34.82% and 23.91% of the games played by the SL and CL models contain at least 1 repeated question. See the Supplementary Material for further details on this extension.

8 Conclusion

We have developed a Questioner model for goal-oriented dialogue that jointly optimises the different tasks the agent faces. Our agent has to learn to ground language into vision, ask questions, and guess a target object in an image to play the GuessWhat?! game. By using a visually-grounded dialogue state encoder common to both the question generation and the guessing components, we have been able to apply a two-phase learning approach: a supervised regime followed by cooperative learning whereby the Questioner’s sub-modules jointly learn to play the overall agent’s role. By addressing a foundational weakness of previous models for the GuessWhat?! task, we have increased task accuracy by 20%. Analysis of the dialogues shows that our best model adapts its strategy to its own skills, while still being reasonably close to human behaviour. Analysis of the dialogue state representations suggests that the model distinguishes between different categories as expressed by different question types and links them to visual information. Using attention mechanisms [Zhuang et al.2018] to investigate this link further is a promising line for future research.


  • [Bordes, Boureau, and Weston2017] Bordes, A.; Boureau, Y.-L.; and Weston, J. 2017. Learning end-to-end goal oriented dialog. In Proceedings of ICLR.
  • [Clark and Wilkes-Gibbs1986] Clark, H. H., and Wilkes-Gibbs, D. 1986. Referring as a collaborative process. Cognition 22(1):1–39.
  • [Clark1996] Clark, H. H. 1996. Using Language. Cambridge University Press.
  • [Das et al.2017a] Das, A.; Kottur, S.; Gupta, K.; Singh, A.; Yadav, D.; Moura, J. M.; Parikh, D.; and Batra, D. 2017a. Visual Dialog. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • [Das et al.2017b] Das, A.; Kottur, S.; Moura, J. M.; Lee, S.; and Batra, D. 2017b. Learning cooperative visual dialog agents with deep reinforcement learning. In International Conference on Computer Vision (ICCV).
  • [de Vries et al.2017] de Vries, H.; Strub, F.; Chandar, S.; Pietquin, O.; Larochelle, H.; and Courville, A. C. 2017. Guesswhat?! Visual object discovery through multi-modal dialogue. In Conference on Computer Vision and Pattern Recognition (CVPR).
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • [Kim et al.2014] Kim, D.; Breslin, C.; Tsiakoulis, P.; Gašić, M.; Henderson, M.; and Young, S. 2014. Inverse reinforcement learning for micro-turn management. In Fifteenth Annual Conference of the International Speech Communication Association.
  • [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR.
  • [Lee, Heo, and Zhang2017a] Lee, S.-W.; Heo, Y.-J.; and Zhang, B.-T. 2017a. Answerer in questioner’s mind for goal-oriented visual dialogue. In NIPS Workshop on Visually-Grounded Interaction and Language (ViGIL). arXiv:1802.03881. Last version Feb. 2018.
  • [Lee, Heo, and Zhang2017b] Lee, S.-W.; Heo, Y.; and Zhang, B.-T. 2017b. Answerer in questioner’s mind for goal-oriented visual dialogue. In NIPS Workshop on Visually-Grounded Interaction and Language (ViGIL).
  • [Lewis1979] Lewis, D. 1979. Scorekeeping in a language game. Journal of Philosophical Logic 8(1):339–359.
  • [Li et al.2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL-2016, 110–119.
  • [Li et al.2016b] Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016b. Deep reinforcement learning for dialogue generation. In Proceedings of EMNLP.
  • [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar; P.; and Zitnick, C. L. 2014. Microsoft COCO: Common objects in context. In Proceedings of ECCV (European Conference on Computer Vision).
  • [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

  • [Shekhar et al.2018] Shekhar, R.; Baumgärtner, T.; Venkatesh, A.; Bruni, E.; Bernardi, R.; and Fernandez, R. 2018. Ask no more: Deciding when to guess in referential visual dialogue. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018) (to appear). arXiv:1805.06960.
  • [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • [Sordoni et al.2015] Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.-Y.; Gao, J.; and Dolan, B. 2015. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of NAACL-HLT, 196–205.
  • [Stalnaker1978] Stalnaker, R. 1978. Assertion. In Cole, P., ed., Pragmatics, volume 9 of Syntax and Semantics. New York Academic Press.
  • [Strub et al.2017] Strub, F.; de Vries, H.; Mary, J.; Piot, B.; Courville, A.; and Pietquin, O. 2017. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Joint Conference on Artificial Intelligence.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104–3112.
  • [Vinyals and Le2015] Vinyals, O., and Le, Q. V. 2015. A neural conversational model. In

    ICML Deep Learning Workshop

  • [Williams, Asadi, and Zweig2017] Williams, J.; Asadi, K.; and Zweig, G. 2017. Hybrid code networks: Practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of ACL-2017. Association for Computational Linguistics.
  • [Williams et al.2013] Williams, J.; Raux, A.; Ramachandran, D.; and Black, A. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, 404–413.
  • [Young et al.2013] Young, S.; Gašić, M.; Thomson, B.; and Williams, J. D. 2013. POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5).
  • [Yule1997] Yule, G. 1997. Referential communication tasks. Routledge.
  • [Zarrieß et al.2016] Zarrieß, S.; Hough, J.; Kennington, C.; Manuvinakurike, R.; DeVault, D.; Fernández, R.; and Schlangen, D. 2016. Pentoref: A corpus of spoken references in task-oriented dialogues. In 10th edition of the Language Resources and Evaluation Conference.
  • [Zhang et al.2017] Zhang, J.; Wu, Q.; Shen, C.; Zhang, J.; and Lu, J. 2017. Asking the difficult questions: Goal-oriented visual question generation via intermediate rewards. arXiv:1711.07614.
  • [Zhao and Eskenazi2016] Zhao, T., and Eskenazi, M. 2016. Towards end-to-end learning for dialog state traching and management using deep reinforcement learning. In Proceedings of SIGDIAL-2016.
  • [Zhu, Zhang, and Metaxas2017] Zhu, Y.; Zhang, S.; and Metaxas, D. 2017. Interactive reinforcement learning for object grounding via self-talking. In NIPS Workshop on Visually-Grounded Interaction and Language (ViGIL).
  • [Zhuang et al.2018] Zhuang, B.; Wu, Q.; Shen, C.; Reid, I.; and ven dan Hengel, A. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Details on the GuessWhat?! dataset

The GuessWhat?! dataset contains 77,973 images with 609,543 objects. It consists of around 155K dialogues containing around 821K question/answer pairs composed out of 4900 words (min 5 times occurrences) on 66,537 unique images and 134,073 objects. Answers are Yes (52.2%), No (45.6%) and NA (not applicable, 2.2%); dialogues contain on average 5.2 questions and there are on average 2.3 dialogues per image. There are successful (84.6%), unsuccessful (8.4%) and not completed (7.0%) dialogues.

Models and Experimental Setting


Figure 5 illustrates the architecture of the Oracle model by [de Vries et al.2017], which we re-implemented for our study.

Details on our GDSE model

As explained in section 4.2, our architecture consists of a visually-grounded dialogue state Encoder, a Question Generator (QGen) module and Guesser module. Both QGen and Guesser modules are trained using ground truth data. The Encoder is updated during the training of both QGen and Guesser through backpropagation.

In the Encoder, the dialogue history is represented by word embeddings of dimension 512 which is processed by a LSTM with a hidden layer of 1024 dimension. This is then concatenated with visual features pre-computed from the second last layer of ResNet-152. The pre-computed visual features have a dimension of 2048. For pre-computing the visual features all the images are resized to 224x224 and then passed through the ResNet-152. The combined features (1024+2048 = 3072) are then passed through a linear layer to scale down the features to the size of 512. This is then passed through a tanh layer to get a final representation from the encoder which is going to be the input to the QGen and Guesser.

Figure 5: Oracle model.

The QGen acts as the decoder in seq2seq model. QGen is a LSTM with hidden layer of 512 dimension. Similar to seq2seq models, the Encoder representation is taken to be the starting hidden state of the decoder. The QGen module gets word embedding concatenated with scaled-down image features as the input. The image features are scaled down from 2048 to 512 and word embedding is of size 512. For each object, the Guesser receives the representation of the object category, viz., a dense category embedding obtained from its one-hot class vector using a learned look-up table, and its spatial representation, viz., an 8-dimensional vector.

The QGen and the Guesser are optimized while training by minimizing the negative log-likelihood for generated question words and the correct object selection respectively. They are trained using ADAM optimizer with a learning rate of 0.0001. All the parameters are tuned on the validation set. Training is stopped when there is no improvement in the validation game accuracy for 10 consecutive epochs and the best epoch is taken.

Additional Results

Here, we report additions results with respect to the New Object setting and show the utility of a decision-making component(DM). As shown in Table 1, the introduction of the new architecture trained with SL (GDSE-CL) yields an increase in accuracy of 9.5% (from 43.5 to 53.0) with respect to the Baseline model. The introduction of the cooperative learning approach (GDSE-CL) brings in a further improvement over the GDSE-SL: 10.3% (from 53.0 to 63.3). Overall, with respect to the Baseline we get improvement of 19.8% (from 43.5 to 63.3). Our GDSE-CL model performance (63.3%) is better or comparable to reinforcement learning bases model, [Strub et al.2017] (60.3%) and [Zhang et al.2017] (63.9%) respectively.

In Table 4, we report detailed results for all our models, including the ones extended with a decision-making component (DM), as explained in Section 7 of the paper. Here, we are interested in the analysis of the utility of the DM model. We report the performance on maximum number of the questions and , because the Baseline model performance best with maximum number of the questions (5Q) and our GDSE performs best with maximum number of the questions (10Q). From Table 4, we can observe that GDSE model with the DM component is able to outperform the Baseline in all the case with comparatively less questions. In the SL setting, for the 5Q by 4.2% (41.2 vs 45.5) with on an average questions only and for 10Q by 8.3% (39.8 vs 48.1) with on an average only. Similarly in the CL setting, for the 5Q by 10.7% (41.2 vs 52.2) with on an average questions only and for 10Q by 15.4% (39.8 vs 55.2) with on an average only. However, with respect to the GDSE without DM model, there is a drop in the accuracy. However, by looking at the games (see Figure 8 and 9) we can see that the quality of the dialogues is better in the GDSE with DM compare to the GDSE without DM. The GDSE-DM dialogues are more natural as the model stops the dialogue when enough information has been acquired.

Model Accuracy
Baseline 43.5
GDSE-SL 53.0
GDSE-CL 63.3
Strub et al. 2017 60.3
Zhang et al. 2017 63.9
Table 3: Test set accuracy scores on task success in the New Objects setting for each model with its best performing settings.
5Q 10Q
Baseline 41.2 39.8
GDSE-SL 47.8 51.2
GDSE-CL 54.8 60.7
GDSE-DM-SL 45.5 (3.93) 48.1 (5.32)
GDSE-DM-CL 52.2 (4.03) 55.2 (5.77)
Table 4: Accuracy scores on the test set (New Image). Models with a decision-making component (DM), avg. number of questions in the bracket.


We provide further details and visualizations related to the analyses carried out in Section 6 of the main paper.

Input : Question and annotated words (from Table 1).
Output : Question Classification
1 Let denotes all the words for the given Question ;
2 Let , , , , , , , denotes all words present in the ‘Color’, ‘Shape’, ‘Size’, ‘Texture’, ‘Location’, ‘Action’, ‘Object’ and ‘Super-category’ respectively ;
3 Let a Empty List ;
4 for  do
5       if   then
              // Append color to
6             break
7       end if
9 end for
10for  do
11       if   then
              // Append shape to
12             break
13       end if
15 end for
16for  do
17       if   then
              // Append size to
18             break
19       end if
21 end for
22for  do
23       if   then
              // Append texture to
24             break
25       end if
27 end for
28for  do
29       if   then
              // Append location to
30             break
31       end if
33 end for
34for  do
35       if   then
              // Append action to
36             break
37       end if
39 end for
40if  is EMPTY  then
41       for  do
42             if   then
                    // Assign object to
43                   break
44             end if
46       end for
48 end if
49if  is EMPTY  then
50       for  do
51             if   then
                    // Assign super-category to
52                   break
53             end if
55       end for
57 end if
58if  is EMPTY  then
        // Assign not-classified to
60 end if
Algorithm 1 Question Classification.

Question Classification

Question classification makes use of keywords. These keywords have been annotated using information in the MS-COCO dataset plus manual annotation. First, we created the possible question categories by inspecting the human dialogues. As explained in the paper, the resulting categories are Entity, subdivided into Super-category and Object, and Attribute, sub-divided into Color, Location, Shape, Size, Texture and Action. We exploited the super-category and object annotations from MS-COCO. To further enrich these annotations, we manually annotated the words in the human dialogues that occur at least 40 times in the training and testing sets. In Table 7, we report the complete list of keywords highlighting those obtained from COCO. Algorithm 1 provides the pseudo-code of the question classification heuristics we used. Table 5 provides some examples of the resulting classification.

Question Question type
is it a basket? object
is it a human? super-category
is it the person in the middle? location
is the person wearing a white shirt? color
is it the round table? shape
is it the little plate? size
is he wearing a striped shirt? texture

Table 5: Examples of questions from the human dialogues with keywords in italics and the types assigned through our classification procedure.
entity 97.26 97.26 96.64 78.48
attribute 1.72 1.39 2.54 13.95
object 22.03 28.52 28.22 26.16
super-cat 75.24 68.74 68.43 52.32
Table 6: Percentages of question type of the first question in the dialogues.
Super-category ‘person’, ‘vehicle’, ‘outdoor’, ‘animal’, ‘accessory’, ‘sports’, ‘kitchen’, ‘food’, ‘furniture’, ‘electronic’, ‘appliance’, ‘indoor’, ‘utensil’, ‘human’, ‘cloth’, ‘cloths’, ‘clothing’, ‘people’. ‘persons’
object ‘bicycle’, ‘car’, ‘motorcycle’, ‘airplane’, ‘bus’, ‘train’, ‘truck’, ‘boat’, ‘traffic light’, ‘fire hydrant’, ‘stop sign’, ‘parking meter’, ‘bench’, ‘bird’, ‘cat’, ‘dog’, ‘horse’, ‘sheep’, ‘cow’, ‘elephant’, ‘bear’, ‘zebra’, ‘giraffe’, ‘backpack’, ‘umbrella’, ‘handbag’, ‘tie’, ‘suitcase’, ‘frisbee’, ‘skis’, ‘snowboard’, ‘sports ball’, ‘kite’, ‘baseball bat’, ‘baseball glove’, ‘skateboard’, ‘surfboard’, ‘tennis racket’, ‘bottle’, ‘wine glass’, ‘cup’, ‘fork’, ‘knife’, ‘spoon’, ‘bowl’, ‘banana’, ‘apple’, ‘sandwich’, ‘orange’, ‘broccoli’, ‘carrot’, ‘hot dog’, ‘pizza’, ‘donut’, ‘cake’, ‘chair’, ‘couch’, ‘potted plant’, ‘bed’, ‘dining table’, ‘toilet’, ‘tv’, ‘laptop’, ‘mouse’, ‘remote’, ‘keyboard’, ‘cell phone’, ‘microwave’, ‘oven’, ‘toaster’, ‘sink’, ‘refrigerator’, ‘book’, ‘clock’, ‘vase’, ‘scissors’, ‘teddy bear’, ‘hair drier’, ‘toothbrush’, ‘meter’, ‘bear’, ‘cell’, ‘phone’, ‘wine’, ‘glass’, ‘racket’, ‘baseball’, glove’, ‘hydrant’, ‘drier’, ‘kite’, sofa’, ‘fork’, ‘adult’, ‘arms’, ‘baby’, ‘bag’, ‘ball’, ‘bananas’, ‘basket’, ‘bat’, ‘batter’, ‘bike’, ‘birds’, ‘board’, ‘body’, ‘books’, ‘bottles’, ‘box’, ‘boy’, ‘bread’, ‘brush’, ‘building’, ‘bunch’, ‘cabinet’, ‘camera’, ‘candle’, ‘cap’, ‘carrots’, ‘cars’, ‘cart’, ‘case’, ‘catcher’, ‘cell phone’, ‘chairs’, ‘child’, ‘chocolate’, ‘coat’, ‘coffee’, ‘computer’, ‘controller’, ‘counter’, ‘cows’, ‘cupboard’, ‘cups’, ‘curtain’, ‘cycle’, ‘desk’, ‘device’, ‘dining table’, ‘dish’, ‘doll’, ‘door’, ‘dress’, ‘driver’, ‘equipment’, ‘eyes’, ‘fan’, ‘feet’, ‘female’, ‘fence’, ‘fire’, ‘flag’, ‘flower’, ‘flowers’, ‘foot’, ‘frame’, ‘fridge’, ‘fruit’, ‘girl’, ‘girls’, ‘glasses’, ‘guy’, ‘guys’, ‘hair drier’, ‘handle’, ‘hands’, ‘hat’, ‘helmet’, ‘house’, ‘jacket’, ‘jar’, ‘jeans’, ‘kid’, ‘kids’, ‘lady’, ‘lamp’, ‘leg’, ‘legs’, ‘luggage’, ‘machine’, ‘male’, ‘man’, ‘meat’, ‘men’, ‘mirror’, ‘mobile’, ‘monitor’, ‘mouth’, ‘mug’, ‘napkin’, ‘pan’, ‘pants’, ‘paper’, ‘pen’, ‘picture’, ‘pillow’, ‘plant’, ‘plate’, ‘player’, ‘players’, ‘pole’, ‘pot’, ‘purse’, ‘rack’, ‘racket’, ‘road’, ‘roof’, ‘screen’, ‘shelf’, ‘shelves’, ‘shirt’, ‘shoe’, ‘shoes’, ‘short’, ‘shorts’, ’shoulder’, ’signal’, ’sign’, ’silverware’, ’skate’, ’ski’, ’sky’, ’snow’, ’soap’, ’speaker’, ’stairs’, ’statue’, ’stick’, stool’, ‘stove’, ‘street’, ‘suit’, ‘sunglasses’, ‘suv’, ‘teddy’, ‘tennis’, ‘tent’, ‘tomato’, ‘towel’, ‘tower’, ‘toy’, ‘traffic’, ‘tray’, ‘tree’, ‘trees’, ‘t-shirt’, ‘tshirt’, ‘vegetable’, ‘vest’, ‘wall’, ‘watch’, ‘wheel’, ‘wheels’, ‘window’, ‘windows’, ‘woman’, ‘women’

Color ‘white’, ‘red’, ‘black’, ‘blue’, ‘green’, ‘yellow’, ‘orange’, ‘brown’, ‘pink’, ‘grey’, ‘gray’, ‘dark’, ‘purple’, ‘color’, ‘colored’, ‘colour’, ‘blond’, ‘beige’, ‘bright’
Size ‘small’, ‘little’, ‘long’, ‘large’, ‘largest’, ‘big’, ‘tall’, ‘smaller’, ‘bigger’, ‘biggest’, ‘tallest’
Texture ‘metal’, ‘silver’, ‘wood’, ‘wooden’, ‘plastic’, ‘striped’, ‘liquid’
Shape ‘circle’, ‘rectangle’, ‘round’, ‘shape’, ‘square’, ‘triangle’
Location ‘1st’, ‘2nd’, ‘third’, ‘3’, ‘3rd’, ‘four’, ‘4th’, ‘fourth’, ‘5’, ‘5th’, ‘five’, ‘first’, ‘second’, ‘last’, ‘above’ , ‘ across’ , ‘after’, ‘around’ , ‘at’ , ‘away’ , ‘back ’ , ‘ background’ , ‘before’ , ‘behind’ , ‘below’ , ‘beside’ , ‘between’ , ‘bottom ’ , ‘ center’ , ‘close’ , ‘closer’ , ‘closest’ , ‘corner’ , ‘directly’ , ‘down’ , ‘edge’ , ‘end’ , ‘entire’ , ‘facing’ , ‘far’ , ‘farthest’ , ‘floor’ , ‘foreground’ , ‘from’ , ‘front’ , ‘furthest’ , ‘ground’ , ‘hidden’ , ‘in’ , ‘inside ’ , ‘ left ’ , ‘ leftmost ’ , ‘ middle ’ , ‘ near ’ , ‘ nearest ’ , ‘ next’ , ‘next to’ , ‘off’ , ‘on’ , ‘out’ , ‘outside ’ , ‘ over ’ , ‘ part ’ , ‘ right ’ , ‘ rightmost’ , ‘row’ , ‘side’ , ‘smaller’ , ‘top’ , ‘towards’ , ‘under’ , ‘ up’ , ‘ upper’ , ‘ with’
Table 7: Lists of keywords used to classify questions with the corresponding class according to Algorithm 1. Words in italics come from COCO object category/super-category.

Dialogue Structure and Policy

Table 6 shows how dialogues start, i.e., the percentages of the type of questions used right at the beginning of a game.

The heat maps in Figure 6 show the structure of the dialogue policies followed by humans and by the different models over the course of the dialogues. The maps have been built by computing the percentages of questions per type and position in the dialogue over all games in the test set.

(a) Baseline
(b) Human
Figure 6: Structure of the dialogues by the Baseline model (Fig. 5(a)) compared with Human (Fig. 5(b)), our GDSE-SL (Fig. 5(c)) and GDSE-CL (Fig. 5(d)) Each cell in the heat map shows percentage over all questions in the test set. Each time step in the -axis corresponds to a question-answer pair.

Question Repetition

In Table 8, we look at the repetitions of the questions. Repetition is performed by string matching. We see that in the game played by the baseline(BL) more than 98% of games have some form of repetitions. In case of GDSE, the games played in the supervised learning setting have more repetition that in the cooperative learning one (66.24% vs. 59.97%). Further, the DM module reduces the repetition by at least 32%.

We further looked at where repetition is happening i.e. at the start or towards the end of dialogue. For this, we looked at the dialogues and computed if there is any repetition in the dialogue after the question. For the baseline, around 75% of the dialogues are having repetitions toward the end of dialogue. This also explains why for the baseline 5 is the optimal number of questions. Even for GDSE based models, around 50% of repetitions happens towards the end of dialogue. Interestingly, the majority of the repetitions are in terms of consecutive question repetitions. These consecutive questions do not add any new information, which speaks in favor of the DM where repetition is comparatively very low (see Table 8).

% game
having at least
1 repetition
% game having
at least 1
consecutive repetition
% game having
at least 1 repetition
after Que
% game having at least
1 consecutive repetition
after Que
Baseline 98.08 72.26 74.75 59.75
GDSE-SL 66.24 42.83 30.25 24.34
GDSE-CL 59.97 44.44 32.62 26.86
GDSE-DM-SL 34.82 24.97 14.44 12.51
GDSE-DM-CL 23.91 16.83 10.04 7.77
Table 8: Percentages of repeated questions in all games for different models.

Dialogue State Representations

In Table 9, we provide the regression coefficients of the linear regression models described in Section 6.3.

reg_state_change reg_target_prob
type of object 0.0252836 0.0571792 0.0122921 0.0224780
type of super-category 0.2970974 0.3024038 0.0045200 0.0395073
type of Yes 0.0431628 0.0552425 0.0061232 0.0113060
index -0.0444708 -0.0417552 -0.0052097 -7.035e-03
type of Yes : type of object 0.0902804 3.002e-01 0.0303864 1.005e-01
type of Yes type of super-category 0.1193137 2.803e-01 -0.0099801 7.391e-02
Table 9: Estimated regression coefficients for the linear regression models. A positive/negative coefficient indicates a positive/negative correlation between a predictor and a dependent variable. The contribution of all predictors is statistically significant in all models (no-star: , ).

In Figure 7, we show cosine distance (i.e. ) and increase in the probability of the target object (i.e., )121212 at every new question-answer pair. We can observe that after the question, the target object probability is high. Subsequently, change in probability depends on the answer of the next question, Yes answer brings more positive change compared to No. If there are repetitions of the questions, as in GDSE-SL question onwards, there is very little change in the probability of the target object. For state chane, initially there is a large distance between the states of consecutive questions. Also, if there is repetition of questions (GDSE-SL) the state remains almost the same. While in case of different questions, the state always changes and changes are comparative steep if answer goes from No to Yes.

(a) Image and corresponding dialogue produced by GDSE-SL and GDSE-CL, respectively.


Figure 7: Cosine distance (Blue) and change in target object probability (Green) between consecutive questions.

Quality of the Dialogues

Though the accuracy of GDSE decreases when the decision making component is added (see Table 4) the quality of the games improves. As Figures 8 and 9 show, the dialogues become more natural and the model stops asking questions when enough information has been acquired.

Figure 8: Example of a game in which the fixed no. of questions is a disadvantage and GDSE-DM-SL properly decides when to stop.. Moreover, look at the Attribute questions asked by all the models, after getting the target Object category.
Figure 9: Example of a dialogue where GDSE-DM-SL after asking enough question to find the target object, stopped asking more question, while other models has to ask maximun number of questions. Specifically, GDSE-SL and GDSE-CL should have stopped after and questions, respectively.