The focus of this paper is visually-grounded conversational artificial intelligence (AI). Specifically, we would like to develop agents that can ‘see’ (, understand the contents of an image) and ‘communicate’ that understanding in natural language (, hold a dialog involving questions and answers about that image). We believe the next generation of intelligent systems will need to posses this ability to hold a dialog about visual content for a variety of applications: , helping visually impaired users understand their surroundings or social media content  (‘Who is in the photo? Dave. What is he doing?’), enabling analysts to sift through large quantities of surveillance data (‘Did anyone enter the vault in the last month? Yes, there are 103 recorded instances. Did any of them pick something up?’), and enabling users to interact naturally with intelligent assistants (either embodied as a robot or not) (‘Did I leave my phone on my desk? Yes, it’s here. Did I miss any calls?’).
Despite rapid progress at the intersection of vision and language, in particular, in image/video captioning [34, 3, 37, 12, 32, 33] and question answering [1, 24, 21, 30, 31], it is clear we are far from this grand goal of a visual dialog agent.
Two recent works [4, 5] have proposed studying this task of visually-grounded dialog. Perhaps somewhat counterintuitively, both these works treat dialog as a staticsupervised learning problem, rather than an interactive agent learning problem that it naturally is. Specifically, both works [4, 5] first collect a dataset of human-human dialog, , a sequence of question-answer pairs about an image . Next, a machine (a deep neural network) is provided with the image , the human dialog recorded till round , , the follow-up question , and is supervised to generate the human response . Essentially, at each round , the machine is artificially ‘injected’ into the conversation between two humans and asked to answer the question ; but the machine’s answer is thrown away, because at the next round , the machine is again provided with the ‘ground-truth’ human-human dialog that includes the human response and not the machine response . Thus, the machine is never allowed to steer the conversation because that would take the dialog out of the dataset, making it non-evaluable.
In this paper, we generalize the task of Visual Dialog beyond the necessary first stage of supervised learning – by posing it as a cooperative ‘image guessing’ game between two dialog agents. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end – from pixels to multi-agent multi-round dialog to the game reward.
Our setup is illustrated in fig:teaser. We formulate a game between a questioner bot (Q-bot) and an answerer bot (A-bot). Q-bot is shown a 1-sentence description (a caption) of an unseen image, and is allowed to communicate in natural language (discrete symbols) with the answering bot (A-bot), who is shown the image. The objective of this fully-cooperative game is for Q-bot to build a mental model of the unseen image purely from the natural language dialog, and then retrieve that image from a lineup of images.
Notice that this is a challenging game. Q-bot must ground the words mentioned in the provided caption (‘Two zebra are walking around their pen at the zoo.’
), estimate which images from the provided pool contain this content (there will typically be many such images since captions describe only the salient entities), and ask follow-up questions (‘Any people in the shot? Are there clouds in the sky? Are they facing each other?’) that help it identify the correct image.
Analogously, A-bot must build a mental model of what Q-bot understands, and answer questions (‘No, there aren’t any. I can’t see the sky. They aren’t.’) in a precise enough way to allow discrimination between similar images from a pool (that A-bot does not have access to) while being concise enough to not confuse the imperfect Q-bot.
At every round of dialog, Q-bot listens to the answer provided by A-bot
, updates its beliefs, and makes a prediction about the visual representation of the unseen image (specifically, the fc7 vector of), and receives a reward from the environment based on how close Q-bot’s prediction is to the true fc7 representation of . The goal of Q-bot and A-bot is to communicate to maximize this reward. One critical issue is that both the agents are imperfect and noisy – both ‘forget’ things in the past, sometimes repeat themselves, may not stay consistent in their responses, A-bot does not have access to an external knowledge-base so it cannot answer all questions, . Thus, to succeed at the task, they must learn to play to each other’s strengths.
An important question to ask is – why force the two agents to communicate in discrete symbols (English words) as opposed to continuous vectors? The reason is twofold. First, discrete symbols and natural language is interpretable. By forcing the two agents to communicate and understand natural language, we ensure that humans can not only inspect the conversation logs between two agents, but more importantly, communicate with them. After the two bots are trained, we can pair a human questioner with A-bot to accomplish the goals of visual dialog (aiding visually/situationally impaired users), and pair a human answerer with Q-bot to play a visual 20-questions game. The second reason to communicate in discrete symbols is to prevent cheating – if Q-bot and A-bot are allowed to exchange continuous vectors, then the trivial solution is for A-bot to ignore Q-bot’s question and directly convey the fc7 vector for , allowing Q-bot to make a perfect prediction. In essence, discrete natural language is an interpretable low-dimensional “bottleneck” layer between these two agents.
Contributions. We introduce a novel goal-driven training for visual question answering and dialog agents. Despite significant popular interest in VQA (over 200 works citing  since 2015), all previous approaches have been based on supervised learning, making this the first instance of goal-driven training for visual question answering / dialog.
We demonstrate two experimental results.
First, as a ‘sanity check’ demonstration of pure RL (from scratch), we show results on a diagnostic task where perception is perfect – a synthetic world with ‘images’ containing a single object defined by three attributes (shape/color/style). In this synthetic world, for Q-bot to identify an image, it must learn about these attributes. The two bots communicate via an ungrounded vocabulary, , symbols with no pre-specified human-interpretable meanings (‘X’, ‘Y’, ‘1’, ‘2’). When trained end-to-end with RL on this task, we find that the two bots invent their own communication protocol – Q-bot starts using certain symbols to query for specific attributes (‘X’ for color), and A-bot starts responding with specific symbols indicating the value of that attribute (‘1’ for red). Essentially, we demonstrate the automatic emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision!
Second, we conduct large-scale real-image experiments on the VisDial dataset . With imperfect perception on real images, discovering a human-interpretable language and communication strategy from scratch is both tremendously difficult and an unnecessary re-invention of English. Thus, we pretrain with supervised dialog data in VisDial before ‘fine tuning’ with RL; this alleviates a number of challenges in making deep RL converge to something meaningful. We show that these RL fine-tuned bots significantly outperform the supervised bots. Most interestingly, while the supervised Q-bot attempts to mimic how humans ask questions, the RL trained Q-bot shifts strategies and asks questions that the A-bot is better at answering, ultimately resulting in more informative dialog and a better team.
2 Related Work
Vision and Language. A number of problems at the intersection of vision and language have recently gained prominence, , image captioning [7, 34, 13, 6], and visual question answering (VQA) [20, 1, 24, 9, 21]. Most related to this paper are two recent works on visually-grounded dialog [4, 5]. Das  proposed the task of Visual Dialog, collected the VisDial dataset by pairing two subjects on Amazon Mechanical Turk to chat about an image (with assigned roles of ‘Questioner’ and ‘Answerer’), and trained neural visual dialog answering models. De Vries  extended the Referit game  to a ‘GuessWhat’ game, where one person asks questions about an image to guess which object has been ‘selected’, and the second person answers questions in ‘yes’/‘no’/NA (natural language answers are disallowed). One disadvantage of GuessWhat is that it requires bounding box annotations for objects; our image guessing game does not need any such annotations and thus an unlimited number of game plays may be simulated. Moreover, as described in Sec. 1, both these works unnaturally treat dialog as a static supervised learning problem. Although both datasets contain thousands of human dialogs, they still only represent an incredibly sparse sample of the vast space of visually-grounded questions and answers. Training robust, visually-grounded dialog agents via supervised techniques is still a challenging task.
In our work, we take inspiration from the AlphaGo  approach of supervision from human-expert games and reinforcement learning from self-play. Similarly, we perform supervised pretraining on human dialog data and fine-tune in an end-to-end goal-driven manner with deep RL.
20 Questions and Lewis Signaling Game. Our proposed image-guessing game is naturally the visual analog of the popular 20-questions game. More formally, it is a generalization of the Lewis Signaling (LS)
game, widely studied in economics and game theory. LS is a cooperative game between two players – asender and a receiver. In the classical setting, the world can be in a number of finite discrete states , which is known to the sender but not the receiver. The sender can send one of discrete symbols/signals to the receiver, who upon receiving the signal must take one of discrete actions. The game is perfectly cooperative, and one simple (though not unique) Nash Equilibrium is the ‘identity mapping’, where the sender encodes each world state with a bijective signal, and similarly the receiver has a bijective mapping from a signal to an action.
Our proposed ‘image guessing’ game is a generalization of LS with Q-bot being the receiver and A-bot the sender. However, in our proposed game, the receiver (Q-bot) is not passive. It actively solicits information by asking questions. Moreover, the signaling process is not ‘single shot’, but proceeds over multiple rounds of conversation.
Text-only or Classical Dialog. Li  have proposed using RL for training dialog systems. However, they hand-define what a ‘good’ utterance/dialog looks like (non-repetition, coherence, continuity, ). In contrast, taking a cue from adversarial learning [10, 19], we set up a cooperative game between two agents, such that we do not need to hand-define what a ‘good’ dialog looks like – a ‘good’ dialog is one that leads to a successful image-guessing play.
Emergence of Language. There is a long history of work on language emergence in multi-agent systems . The more recent resurgence has focused on deep RL [8, 16, 11, 22]. The high-level ideas of these concurrent works are similar to our synthetic experiments. For our large-scale real-image results, we do not want our bots to invent their own uninterpretable language and use pretraining on VisDial  to achieve ‘alignment’ with English.
3 Cooperative Image Guessing Game:
In Full Generality and a Specific Instantiation
Players and Roles. The game involves two collaborative agents – a questioner bot (Q-bot) and an answerer bot (A-bot) – with an information asymmetry. A-bot sees an image , Q-bot does not. Q-bot is primed with a 1-sentence description of the unseen image and asks ‘questions’ (sequence of discrete symbols over a vocabulary ), which A-bot answers with another sequence of symbols. The communication occurs for a fixed number of rounds.
Game Objective in General. At each round, in addition to communicating, Q-bot must provide a ‘description’ of the unknown image based only on the dialog history and both players receive a reward from the environment inversely proportional to the error in this description under some metric . We note that this is a general setting where the ‘description’ can take on varying levels of specificity – from image embeddings (or fc7 vectors of ) to textual descriptions to pixel-level image generations.
Specific Instantiation. In our experiments, we focus on the setting where Q-bot is tasked with estimating a vector embedding of the image . Given some feature extractor (, a pretrained CNN model, say VGG-16), no human annotation is required to produce the target ‘description’ (simply forward-prop the image through the CNN). Reward/error can be measured by simple Euclidean distance, and any image may be used as the visual grounding for a dialog. Thus, an unlimited number of ‘game plays’ may be simulated.
4 Reinforcement Learning for Dialog Agents
In this section, we formalize the training of two visual dialog agents (Q-bot and A-bot) with Reinforcement Learning (RL) – describing formally the action, state, environment, reward, policy, and training procedure. We begin by noting that although there are two agents (Q-bot, A-bot), since the game is perfectly cooperative, we can without loss of generality view this as a single-agent RL setup where the single “meta-agent” comprises of two “constituent agents” communicating via a natural language bottleneck layer.
Action. Both agents share a common action space consisting of all possible output sequences under a token vocabulary . This action space is discrete and in principle, infinitely-large since arbitrary length sequences may be produced and the dialog may go on forever. In our synthetic experiment, the two agents are given different vocabularies to coax a certain behavior to emerge (details in Sec. 5). In our VisDial experiments, the two agents share a common vocabulary of English tokens. In addition, at each round of the dialog , Q-bot also predicts , its current guess about the visual representation of the unseen image. This component of Q-bot’s action space is continuous.
State. Since there is information asymmetry (A-bot can see the image , Q-bot cannot), each agent has its own observed state. For a dialog grounded in image with caption , the state of Q-bot at round is the caption and dialog history so far , and the state of A-bot also includes the image .
Policy. We model Q-bot and A-bot operating under stochastic policies and , such that questions and answers may be sampled from these policies conditioned on the dialog/state history. These policies will be learned by two separate deep neural networks parameterized by and . In addition, Q-bot includes a feature regression network that produces an image representation prediction after listening to the answer at round , , . Thus, the goal of policy learning is to estimate the parameters .
Environment and Reward.
The environment is the image upon which the dialog is grounded.
Since this is a purely cooperative setting, both agents receive the same reward.
Let be a distance metric on image representations
(Euclidean distance in our experiments).
At each round , we define the reward for a state-action pair as:
, the change in distance to the true representation before and after a round of dialog. In this way, we consider a question-answer pair to be low quality (, have a negative reward) if it leads the questioner to make a worse estimate of the target image representation than if the dialog had ended.
Note that the total reward summed over all time steps of a dialog is a function
of only the initial and final states due to the cancellation of intermediate terms, ,
This is again intuitive – ‘How much do the feature predictions of Q-bot improve due to the dialog?’ The details of policy learning are described in Sec. 4.2, but before that, let us describe the inner working of the two agents.
4.1 Policy Networks for Q-bot and A-bot
Fig. 2 shows an overview of our policy networks for Q-bot and A-bot and their interaction within a single round of dialog. Both the agent policies are modeled via Hierarchical Recurrent Encoder-Decoder neural networks, which have recently been proposed for dialog modeling [26, 25, 4].
Q-bot consists of the following four components:
Fact Encoder: Q-bot asks a question : ‘Are there any animals?’ and receives an answer : ‘Yes, there are two elephants.’. Q-bot treats this concatenated -pair as a ‘fact’ it now knows about the unseen image. The fact encoder is an LSTM whose final hidden state is used as an embedding of .
State/History Encoder is an LSTM that takes the encoded fact at each time step to produce an encoding of the prior dialog including time as . Notice that this results in a two-level hierarchical encoding of the dialog and .
Question Decoder is an LSTM that takes the state/history encoding from the previous round and generates question by sequentially sampling words.
Feature Regression Network is a single fully-connected layer that produces an image representation prediction from the current encoded state .
Each of these components and their relation to each other are shown on the left side of fig:models. We collectively refer to the parameters of the three LSTM models as and those of the feature regression network as .
A-bot has a similar structure to Q-bot with slight differences since it also models the image via a CNN:
Question Encoder: A-bot receives a question from Q-bot and encodes it via an LSTM .
Fact Encoder: Similar to Q-bot, A-bot also encodes the -pairs via an LSTM to get . The purpose of this encoder is for A-bot to remember what it has already told Q-bot and be able to understand references to entities already mentioned.
State/History Encoder is an LSTM that takes as input at each round – the encoded question , the image features from VGG  , and the previous fact encoding – to produce a state encoding, . This allows the model to contextualize the current question the history while looking at the image to seek an answer.
Answer Decoder is an LSTM that takes the state encoding and generates by sequentially sampling words.
Our code will be publicly available.
To recap, a dialog round at time consists of 1) Q-bot generating a question conditioned on its state encoding , 2) A-bot encoding , updating its state encoding , and generating an answer , 3) Q-bot and A-bot both encoding the completed exchange as and , and 4) Q-bot updating its state to based on and making an image representation prediction for the unseen image.
4.2 Joint Training with Policy Gradients
In order to train these agents, we use the REINFORCE  algorithm that updates policy parameters in response to experienced rewards. In this section, we derive the expressions for the parameter gradients for our setup.
Recall that our agents take actions – communication and feature prediction – and our
objective is to maximize the expected reward under the agents’ policies, summed over the entire dialog:
While the above is a natural objective, we find that considering the entire dialog as a single RL episode does not differentiate between individual good or bad exchanges within it. Thus, we update our model based on per-round rewards,
Following the REINFORCE algorithm, we can write the gradient of this expectation as an expectation of a quantity related to the gradient. For , we derive this explicitly:
Similarly, gradient , , can be derived as
As is standard practice, we estimate these expectations with sample averages. Specifically, we sample a question from Q-bot (by sequentially sampling words from the question decoder LSTM till a stop token is produced), sample its answer from A-bot
, compute the scalar reward for this round, multiply that scalar reward to gradient of log-probability of this exchange, propagate backward to compute gradients all parameters. This update has an intuitive interpretation – if a particular is informative (, leads to positive reward), its probabilities will be pushed up (positive gradient). Conversely, a poor exchange leading to negative reward will be pushed down (negative gradient).
Finally, since the feature regression network forms a deterministic policy, its parameters receive ‘supervised’ gradient updates for differentiable .
5 Emergence of Grounded Dialog
To succeed at our image guessing game, Q-bot and A-bot need to accomplish a number of challenging sub-tasks – they must learn a common language (do you understand what I mean when I say ‘person’?) and develop mappings between symbols and image representations (what does ‘person’ look like?), , A-bot must learn to ground language in visual perception to answer questions and Q-bot must learn to predict plausible image representations – all in an end-to-end manner from a distant reward function. Before diving in to the full task on real images, we conduct a ‘sanity check’ on a synthetic dataset with perfect perception to ask – is this even possible?
Setup. As shown in fig:toy-example-qual, we consider a synthetic world with ‘images’ represented as a triplet of attributes – 4 shapes, 4 colors, 4 styles – for a total of 64 unique images. A-bot has perfect perception and is given direct access to this representation for an image. Q-bot is tasked with deducing two attributes of the image in a particular order – , if the task is (shape, color), Q-bot would need to output (square, purple) for a (purple, square, filled) image seen by A-bot (see Fig. 3b). We form all 6 such tasks per image.
Vocabulary. We conducted a series of pilot experiments and found the choice of the vocabulary size to be crucial for coaxing non-trivial ‘non-cheating’ behavior to emerge. For instance, we found that if the A-bot vocabulary is large enough, say (#images), the optimal policy learnt simply ignores what Q-bot asks and A-bot conveys the entire image in a single token (token (red, square, filled)). As with human communication, an impoverished vocabulary that cannot possibly encode the richness of the visual sensor is necessary for non-trivial dialog to emerge. To ensure at least 2 rounds of dialog, we restrict each agent to only produce a single symbol utterance per round from ‘minimal’ vocabularies for A-bot and for Q-bot. Since , a non-trivial dialog is necessary to succeed at the task.
Policy Learning. Since the action space is discrete and small, we instantiate Q-bot and A-bot as fully specified tables of Q-values (state, action, future reward estimate) and apply tabular Q-learning with Monte Carlo estimation over episodes to learn the policies. Updates are done alternately where one bot is frozen while the other is updated. During training, we use -greedy policies , ensuring an action probability of for the greedy action and split the remaining probability uniformly across other actions. At test time, we default to greedy, deterministic policy obtained from these -greedy policies. The task requires outputting the correct attribute value pair based on the task and image. Since there are a total of unique values across the attributes, Q-bot’s final action selects one of attribute-pairs. We use and as rewards for right and wrong predictions.
Results. fig:toy-example-quald shows the reward achieved by the agents’ policies number of RL iterations (each with 10k episodes/dialogs). We can see that the two quickly learn the optimal policy. Fig. 3b,c show some example exchanges between the trained bots. We find that the two invent their own communication protocol – Q-bot consistently uses specific symbols to query for specific attributes: color, shape, style. And A-bot consistently responds with specific symbols to indicate the inquired attribute, , if Q-bot emits (asks for color), A-bot responds with: purple, green, blue, red. Similar mappings exist for responses to other attributes. Essentially, we find the automatic emergence of grounded language and a communication protocol among ‘visual’ dialog agents without any human supervision!
|Image + Caption||Human-Human dialog ||SL-pretrained Q-bot-A-bot dialog||RL-full-QAf Q-bot-A-bot dialog|
A couple of people in the snow on skis.
|What are their genders? 1 man 1 woman Are they both adults? Yes Do they wear goggles? Looks like sunglasses Do they have hats on? Man does Are there any other people? No What color is man’s hat? Black Is it snowing now? No What is woman wearing? Blue jacket and black pants Are they smiling? Yes Do you see trees? Yes||How many people? 2 Male or female? I can’t tell What are they wearing? I can’t tell, they are far away What color are skis? I can’t tell Are they wearing goggles? Yes Are they wearing goggles? Yes Are they wearing goggles? Yes Are they wearing goggles? Yes Are they wearing hats? Yes Are they wearing goggles? Yes||
Are people male or female or both? I think 1 is male, can’t see other 1 …
What are they wearing? They are all wearing snow pants and jackets
Is it snowing? It does not appear to be snowing at momentCan you see any trees? Yes, there are lot of trees in background Can you see any buildings in background? No, I can not Can you see any trees? Yes, there are lot of trees in background Can you see any buildings in background? No, I do not see any buildings Can you see any trees? Yes, there are lot of trees in background Can you see any buildings in background? No , I do not see any buildings Does it look like they are in park? It does not appear to be
Two men giving thumbs up holding large slices of pizza.
|How old are men? They look to be 30’s Are they in restaurant? I can’t tell, its close up Do men look they might be related? Maybe Are they each holding slice of pizza in 1 hand and giving thumbs up with other? 1 holding slice, other holding box… What kind of pizza do they have? cheese Does box have pizza company name on it? No Are there any pictures or graphics on box? No Are men slender or heavy set? Slender Do either of men wear glasses? Yes 1 or both? 1||How old are men? They look like they are in their 20’s Are they male or female? 1 male and 1 female What are they wearing? T shirt and shorts What kind of pizza is it? Looks like pepperoni Does it look good? Yes it does Can you tell what kind of pizza it is? Looks like pepperoni Are there any toppings on pizza? Pepperoni Are there any toppings on pizza? Pepperoni Are there any toppings on pizza? Looks like cheese, but I can’t tell… Are there any other toppings on pizza? No||How old are men? They look like they are in their 20’s Are they wearing chef’s hat? No Do they have long hair? No What kind of pizza is it? Looks like pepperoni Does it look good? Yes Are there any other people around? No Are they inside or outside? Inside Can you tell what time of day it is? No Are there any other people around? No What color are walls? White|
A man riding a skateboard down the side of a ramp.
|How old is man? In his teens Is he wearing helmet? No Is he wearing any protective gear? No What color is his hair? Dirty blonde or dark brown What color is his shirt? Black with red letters Is he wearing shorts? Yes What color are his shorts? Denim Are there other people visible? No What color are his shoes? I can’t see them Is this at skate park? Yes||How old is man? I can’t see his face, I don’t know What is he wearing? shorts and t shirt What color is his shirt? white what color are his pants? black what color is his shirt? blue What color are his shoes? Black What color are his shoes? Black What color are his shoes? Black What color are his shoes? Black What color are his shoes? Black||
What color is skateboard? It is hard to tell but I think it’s black
Is he wearing helmet? Yes
Is he wearing knee pads? No, no knee padsIs it day or night? It is daytime Is it sunny? It is bright out, but I see no shadows, so I can not Can you see any other people? There are 2 other people in background Is it day or night? It is daytime Is it sunny? Yes, it is sunny What color is skateboard? It is hard to tell but I think it’s black Is he wearing helmet? Yes, he is wearing helmet
An airplane sitting on an asphalt landing strip.
|What color is plane? White and blue Is there writing on plane? Yes it says lufthansa What color is ground? Grey tarmac Any people? No people visible Can you see airport? Yes What color is it? Grey Is there sunshine? No it’s kind of overcast Any other planes? No other planes Any grass? No grass Any trees? No tree||What color is plane? White Is it large or small? Large Any people? No Any writing on plane? Yes Any writing on plane? Yes Any writing on plane? Yes Any writing on plane? Yes What does it say? UNK What color is plane? White Any writing on it? Yes||What color is airplane? White with red and blue stripes Can you tell what airline it is? No, it’s too far away What time of day is it? It is daytime Can you see any people? No Can you see any buildings? Yes, in background Is it day or night time? It is daytime Can you see sky? Yes Are there clouds in sky? Yes, few Does it look like it’s going to rain? Yes it does Does it look like it’s going to rain? Yes it does|
A dog watching food cook in an oven.
|What color is dog? Brown and white What breed is dog? Look like brittany spaniel Is dog wearing collar? Not able to see Can you tell what is cooking? Look like beans Is oven stainless steel? Yes Does it have see through glass? Yes How many racks in oven? 2 Can you see knobs? No Does oven have wooden handle? Not able to see Are beans in pot? Glass dish||What kind of dog is it? I can’t tell What color is it? White Is it large or small? Large What kind of food is it? I can’t tell Is it large or small? Small What color is it? White Is it large or small? Large What color are walls? White What kind of floor? Can’t see floor What kind of floor? Can’t see floor||What kind of dog is it? Terrier of some kind What color is it? White and grey Can you tell what kind of food it is? No Are there people around? No people Is it inside or outside? Inside Can you tell what time of day it is? No I can’t What color is room? I can’t see walls Do you see any windows? No Do you see any windows? No Do you see any windows? No|
A man prepares to swing at the tennis ball.
|Are there more people in this photo or just man? No Just tennis player man in this photo? Yes Is it outdoor court? Can’t tell, it’s close up What is man wearing? White shirt and shorts Can you see net? No Can you see bottom of court? Yes Can you see ball in photo? Yes Is it close to man’s racquet? Yes Is man about to take swing? Yes Is he left handed or right handed? Right||How old is man? I can’t see his face, I don’t know What color is his hair? Brown What is he wearing? Shorts and t shirt What color is his shirt? White What color is his shirt? White What color is his shirt? White What color is his shirt? White What color is his shirt? White What color is his shirt? White What color is his shirt? White||How old is man? He looks to be in his 20’s What color is his hair? His hair is brown Is he wearing glasses? No he is not Is he wearing hat? No he is not Is he wearing glasses? No he is not Are there other people around? No it’s just him Can you see net? No I can’t What color is court? Court is green Is it sunny? Yes it is Do you see any trees? No I do not|
A giraffe is resting its head in the grass near a rock.
|Is giraffe adult? I cannot tell, I only see face Is grass at higher altitude than its feet? No Is rock big? Yes How is weather? Sunny What color is rock? Tan color Is grass green or dry? Green but looks little dry Does giraffe look healthy? Yes Does giraffe look happy? It looks tired Is giraffe laying down? Yes Is giraffe’s mouth closed? Yes||Is there only 1 giraffe? Yes Are there any other animals? No Are there any people? No Are there any other animals? No Is it day or night? Day Is it sunny? Yes What time of day it is? I can’t tell Are there any other animals? No Are there any trees? No Are there any other animals? No||Is there only 1 giraffe? Yes Are there any other animals? No Are there any people? No Does this look like zoo? Can’t tell Is there fence? No Is it day or night? Day Is it sunny? Yes What time of day is it? Morning Are there any trees? No Can you see sky? No|
A cracker with shredded cheese and leafy greens on a white plate.
|Are there any people in picture? No, there are no people Is cheese white? Yes, cheese is white Are crackers white? No, cracker is brownish Are there lots of greens? No, only 1 sprig Was picture taken in kitchen? I cannot tell, it is too zoomed in How many crackers are there? There is 1 Is there design on plate? No, plate is plane white Is there good lighting? Yes, lighting is decent Is plate on table? I cannot tell I can only see plate as background Does cracker look delicious? Absolutely||Are there any people? No What color is plate? White Is plate on table? I can’t tell What color is plate? White Is plate on table? I can’t tell Is there anything else on table? No Is there anything else on table? No Is there anything else on table? No Is there anything else on table? No Is there anything else on table? No||Is plate large or small? Large Is it on table? I can’t tell What kind of donut? Plain with sugar on it What kind of chips? Potato chips Any people? No Is this inside or outside? I can’t tell, it’s close up What color is table? White Any chairs? No What color is table? I can’t tell Is it large or small? I can’t see plate|
A man making the live long and prosper sign from star trek.
|How old is this man? He looks like late teens Does he have short hair or long hair? Medium What color is his hair? Brown Is he wearing glasses? No Does he have facial hair? No How is he dressed? He’s white dress shirt with red tie and matching red vest over it looks like school uniform Is he larger man or is he skinny? He’s skinny Is this picture taken outside? Inside Can you tell if it’s daytime? I’m not sure Can you tell what room he is standing in? He’s in hallway||How old is man? He looks like he’s in his 20s What is he wearing? T shirt and jeans What color is his hair? Black What is he wearing? T shirt and jeans What color is his hair? Black What is he wearing? T shirt and jeans What color is his hair? Black Is it short or long? Short What color is his hair? Black Is it short or long? Short||Where is man located? Looks like classroom of some sort What is man wearing? Black t shirt and jeans Are there any other people in picture? Yes, quite few people in background What color is table? Picture is black and white, but it’s wood table Is there anything else on table? Not that I can see What are other people doing? They are standing in front of him, but I don’t know what they Are there any windows? Not that I can see What color are walls? I can’t see walls What time of day do you think it is? I can’t tell What is man wearing? Black t shirt and jeans|
Our synthetic experiments in the previous section establish that when faced with a cooperative task where information must be exchanged, two agents with perfect perception are capable of developing a complex communication protocol.
In general, with imperfect perception on real images, discovering human-interpretable language and communication strategy from scratch is both tremendously difficult and an unnecessary re-invention of English. We leverage the recently introduced VisDial dataset  that contains (as of the publicly released v0.5) human dialogs (10 rounds of question-answer pairs) on 68k images from the COCO dataset, for a total of 680k QA-pairs. Example dialogs from the VisDial dataset are shown in Tab. 1.
Image Feature Regression. We consider a specific instantiation of the visual guessing game described in Sec. 3 – specifically at each round , Q-bot needs to regress to the vector embedding of image corresponding to the fc7 (penultimate fully-connected layer) output from VGG-16 . The distance metric used in the reward computation is , .
Training Strategies. We found two training strategies to be crucial to ensure/improve the convergence of the RL framework described in Sec. 4, to produce any meaningful dialog exchanges, and to ground the agents in natural language.
Supervised Pretraining. We first train both agents in a supervised manner on the train split of VisDial  v0.5 under an MLE objective. Thus, conditioned on human dialog history, Q-bot is trained to generate the follow-up question by human1, A-bot is trained to generate the response by human2, and the feature network is optimized to regress to . The CNN in A-bot
is pretrained on ImageNet. This pretraining ensures that the agents can generally recognize some objects/scenes and emit English questions/answers. The space of possibleis tremendously large and without pretraining most exchanges result in no information gain about the image.
Curriculum Learning. After supervised pretraining, we ‘smoothly’ transition the agents to RL training according to a curriculum. Specifically, we continue supervised training for the first (say 9) rounds of dialog and transition to policy-gradient updates for the remaining rounds. We start at and gradually anneal to 0. This curriculum ensures that the agent team does not suddenly diverge off policy, if one incorrect or is generated.
Models are pretrained for 15 epochs on VisDial, after which we transition to policy-gradient training by annealingdown by every epoch. All LSTMs are -layered with -d hidden states. We use Adam  with a learning rate of , and clamp gradients to to avoid explosion. All our code will be made publicly available. There is no explicit state-dependent baseline in our training as we initialize from supervised pretraining and have zero-centered reward, which ensures a good proportion of random samples are both positively and negatively reinforced.
Model Ablations. We compare to a few natural ablations of our full model, denoted RL-full-QAf. First, we evaluate the purely supervised agents (denoted SL-pretrained), , trained only on VisDial data (no RL). Comparison to these agents establishes how much RL helps over supervised learning. Second, we fix one of Q-bot or A-bot to the supervised pretrained initialization and train the other agent (and the regression network ) with RL; we label these as Frozen-Q or Frozen-A respectively. Comparing to these partially frozen agents tell us the importance of coordinated communication. Finally, we freeze the regression network to the supervised pretrained initialization while training Q-bot and A-bot with RL. This measures improvements from language adaptation alone.
We quantify performance of these agents along two dimensions – how well they perform on the image guessing task (image retrieval) and how closely they emulate human dialogs (performance on VisDial dataset).
bots (and other ablations). Error bars show standard error of means.(c) shows qualitative results on this predicted fc7-based image retrieval. Left column shows true image and caption, right column shows dialog exchange, and a list of images sorted by their distance to the ground-truth image. The image predicted by Q-bot is highlighted in red. We can see that the predicted image is often semantically quite similar. b) VisDial Evaluation. Performance of A-bot on VisDial v0.5 test, under mean reciprocal rank (MRR), recall@ for and mean rank metrics. Higher is better for MRR and recall@, while lower is better for mean rank. We see that our proposed Frozen-Q-multi outperforms all other models on VisDial metrics by 3% relative gain. This improvement is entirely ‘for free’ since no additional annotations were required for RL.
Evaluation: Guessing Game. To assess how well the agents have learned to cooperate at the image guessing task, we setup an image retrieval experiment based on the test split of VisDial v0.5 ( images), which were never seen by the agents in RL training. We present each image + an automatically generated caption  to the agents, and allow them to communicate over 10 rounds of dialog. After each round, Q-bot predicts a feature representation . We sort the entire test set in ascending distance to this prediction and compute the rank of the source image.
Fig. 3(a) shows the mean percentile rank of the source image for our method and the baselines across the rounds (shaded region indicates standard error). A percentile rank of 95% means that the source image is closer to the prediction than 95% of the images in the set. Tab. 1 shows example exchanges between two humans (from VisDial), the SL-pretrained and the RL-full-QAf agents. We make a few observations:
RL improves image identification. We see that RL-full-QAf significantly outperforms SL-pretrained and all other ablations (, at round 10, improving percentile rank by over ), indicating that our training framework is indeed effective at training these agents for image guessing.
All agents ‘forget’; RL agents forget less. One interesting trend we note in Fig. 3(a) is that all methods significantly improve from round (caption-based retrieval) to rounds 2 or 3, but beyond that all methods with the exception of RL-full-QAf get worse, even though they have strictly more information. As shown in Tab. 1, agents will often get stuck in infinite repeating loops but this is much rarer for RL agents. Moreover, even when RL agents repeat themselves, it is after longer gaps (2-5 rounds). We conjecture that the goal of helping a partner over multiple rounds encourages longer term memory retention.
RL leads to more informative dialog. SL A-bot tends to produce ‘safe’ generic responses (‘I don’t know’, ‘I can’t see’) but RL A-bot responses are much more detailed (‘It is hard to tell but I think it’s black’). These observations are consistent with recent literature in text-only dialog . Our hypothesis for this improvement is that human responses are diverse and SL trained agents tend to ‘hedge their bets’ and achieve a reasonable log-likelihood by being non-committal. In contrast, such ‘safe’ responses do not help Q-bot in picking the correct image, thus encouraging an informative RL A-bot.
Evaluation: Emulating Human Dialogs. To quantify how well the agents emulate human dialog, we evaluate A-bot on the retrieval metrics proposed by Das . Specifically, every question in VisDial is accompanied by 100 candidate responses. We use the log-likehood assigned by the A-bot answer decoder to sort these candidates and report the results in Tab. 5. We find that despite the RL A-bot’s answer being more informative, the improvements on VisDial metrics are minor. We believe this is because while the answers are correct, they may not necessarily mimic human responses (which is what the answer retrieval metrics check for). In order to dig deeper, we train a variant of Frozen-Q with a multi-task objective – simultaneous (1) ground truth answer supervision and (2) image guessing reward, to keep A-bot close to human-like responses. We use a weight of 1.0 for the SL loss and 10.0 for RL. This model, denoted Frozen-Q-multi, performs better than all other approaches on VisDial answering metrics, improving the best reported result on VisDial by 0.7 mean rank (relative improvement of ). Note that this gain is entirely ‘free’ since no additional annotations were required for RL.
Human Study. We conducted a human interpretability study to measure (1) whether humans can easily understand the Q-bot-A-bot dialog, and (2) how image-discriminative the interactions are. We show human subjects a pool of 16 images, the agent dialog (10 rounds), and ask humans to pick their top-5 guesses for the image the two agents are talking about. We find that mean rank of the ground-truth image for SL-pretrained agent dialog is 3.70 2.73 for RL-full-QAf dialog. In terms of MRR, the comparison is 0.518 0.622 respectively. Thus, under both metrics, humans find it easier to guess the unseen image based on RL-full-QAf dialog exchanges, which shows that agents trained within our framework (1) successfully develop image-discriminative language, and (2) this language is interpretable; they do not deviate off English.
To summarize, we introduce a novel training framework for visually-grounded dialog agents by posing a cooperative ‘image guessing’ game between two agents. We use deep reinforcement learning to learn the policies of these agents end-to-end – from pixels to multi-agent multi-round dialog to game reward. We demonstrate the power of this framework in a completely ungrounded synthetic world, where the agents communicate via symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol without any human supervision. We go on to instantiate this game on the VisDial  dataset, where we pretrain with supervised dialog data. We find that the RL ‘fine-tuned’ agents not only significantly outperform SL agents, but learn to play to each other’s strengths, all the while remaining interpretable to outside humans observers.
We thank Devi Parikh for helpful discussions. This work was funded in part by the following awards to DB – NSF CAREER award, ONR YIP award, ONR Grant N00014-14-1-0679, ARO YIP award, ICTAS Junior Faculty award, Google Faculty Research Award, Amazon Academic Research Award, AWS Cloud Credits for Research, and NVIDIA GPU donations. SK was supported by ONR Grant N00014-12-1-0903, and SL was partially supported by the Bradley Postdoctoral Fellowship. Views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
-  J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. VizWiz: Nearly Real-time Answers to Visual Questions. In UIST, 2010.
-  X. Chen and C. L. Zitnick. Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. In CVPR, 2015.
-  A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In CVPR, 2017.
-  H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville. GuessWhat?! visual object discovery through multi-modal dialogue. In CVPR, 2017.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In CVPR, 2015.
-  H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From Captions to Visual Concepts and Back. In CVPR, 2015.
-  J. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
-  H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. In NIPS, 2015.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. In NIPS, 2014.
-  S. Havrylov and I. Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In ICLR Workshop, 2017.
J. Johnson, A. Karpathy, and L. Fei-Fei.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning.In CVPR, 2016.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
-  S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In EMNLP, 2014.
-  D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
-  A. Lazaridou, A. Peysakhovich, and M. Baroni. Multi-agent cooperation and the emergence of (natural) language. In ICLR, 2017.
-  D. Lewis. Convention: A philosophical study. John Wiley & Sons, 2008.
-  J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep Reinforcement Learning for Dialogue Generation. In EMNLP, 2016.
-  J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
-  M. Malinowski and M. Fritz. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, 2014.
M. Malinowski, M. Rohrbach, and M. Fritz.
Ask your neurons: A neural-based approach to answering questions about images.In ICCV, 2015.
-  I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.
-  S. Nolfi and M. Mirolli. Evolution of Communication and Language in Embodied Agents. Springer Publishing Company, Incorporated, 1st edition, 2009.
-  M. Ren, R. Kiros, and R. Zemel. Exploring Models and Data for Image Question Answering. In NIPS, 2015.
-  I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In AAAI, 2016.
-  I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. arXiv preprint arXiv:1605.06069, 2016.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
-  M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR, 2016.
-  K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S. C. Zhu. Joint Video and Text Parsing for Understanding Events and Answering Queries. IEEE MultiMedia, 2014.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. Sequence to Sequence - Video to Text. In ICCV, 2015.
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks.In NAACL HLT, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  S. Wu, H. Pique, and J. Wieland. Using artificial intelligence to help blind people ‘see’ facebook. http://newsroom.fb.com/news/2016/04/using-artificial-intelligence-to-help-blind-people-see-facebook/, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, 2015.