Natural language interaction is probably the most efficient and natural way for humans to acquire knowledge and exchange information. In artificial intelligence, developing intelligent agents that can correspond the linguistic concepts with the visual sensor inputs, and communicate via goal-ended dialogue is also a fundamental problem.
Recently, two visual grounded conversation tasks along with datasets/environments are proposed: VisDial  and GuessWhat?! . VisDial  collects a large dataset containing free-form conversations about natural images. Based on VisDial dataset, Das et al further proposed an image level
grounding task and applied reinforcement learning to train a goal-ended conversation system. The generated conversation was mainly evaluated on an image retrieval task.
Instead, GuessWhat?! proposed an environment focusing on object instance level grounding. The object grounding task implicitly requires agents detect and recognize object instances, understand the spatial layout, and use positional / attribute word to reason and distinguish among candidate regions. In this paper, we focus on GuessWhat?! task, since object level grounding is more practically relevant to real-world applications, for instance, interactive navigation robots.
Beyond supervised training using human dialog, reinforcement learning has recently been adopted in visual conversation [3, 10]. Particularly, on GuessWhat?! task, Strub et al used RL to tune the question generator while keeping answer agent and guesser unchanged. However, human collaboration normally requires both parties to engage and adapt for the other. In this paper, we propose to interactive train all three models in a dynamic environment. The question generator and the answer models are collectively tuned by a common reward function using reinforcement learning. The reward function, which is parameterized by the guesser model, is also dynamically updated to cooperate with other two models. Our result significantly outperforms the previous best result on GuessWhat?! task and achieves near human performance. Despite improved task success rate, we observe the generated conversations suffer from language drifting problem. We also propose a reward engineering technique to help improve interpretability of the generated conversations.
Our main result shows that agents are able to achieve near human level performance on visual object grounding task, by drifting from natural language towards a contrived language. It also indicates current goal-ended visual conversation task requires more semantic related metrics for evaluation, other than task completion rate. In summary, the contribution of this paper is two-fold:
We introduce an interactive training method for object instance grounding task. The proposed training method significantly outperforms previous best results on GuessWhat?! task.
To balance between interpretability and task success rate, we propose a reward engineering technique to interfere training, and improves the readability of the generated conversations.
2 Related Work
Visual Conversation: Existing visual conversation datasets can be divided into task-oriented dialogue  and free-form (chit-chat) dialogue . Recently, Das et al also introduced a goal-ended conversation task based on VisDial. One salient difference is that GuessWhat?! focuses on object instance level grounding while the task in  focus on image level grounding. Beyond supervised training baselines, Das et al.  showed cooperative RL improves supervised trained baselines in VisDial image retrieval task. Also on GuessWhat?!, Strub et al.  also showed that RL could improve task success rate by only tuning the question generator and keeping other two models static. Notably, the reward function in GuessWhat?! task depends on the subjective parametric guesser model rather than an objective metric. In this work, we focus on the object grounding task in GuessWhat?!, and extend reinforcement training towards a more interactive setting: both conversation bots are collectively trained using RL, and the parameterized reward function (the guesser model) is also actively involved in the updating dynamics.
Artificial Language vs Natural Language: Recently, [8, 6, 3] found that multiple agents can develop their own communication protocols (artificial languages) during cooperative training. The emerged language is very effective between AI agents, but not interpretable for humans . To narrow the semantic gap between artificial language and natural language, [7, 9] explored different techniques to retain interpretability, including constrain vocabulary size  and iteratively updating different agents . These findings are based on synthetic environments, rather than natural image based tasks. To echo these findings, we observe similar language drifting problems during interactive training. To improve the interpretability, we choose to explicitly enforce the desired dialogue properties in the reward function, thus balancing the trade-off between task success rate and interpretability.
3 Model Architecture and Interactive Training
We generally follow the architecture design in  for answer model and guesser. The only salient architecture difference is that we use seq2seq model for question generator instead of vanilla LSTMs.
seq2seq Question Generator The seq2seq model was originally introduced for machine translation, and later adopted in conversation systems. Compared with vanilla LSTM, seq2seq can be extended with attention modules for long distance reasoning. We used a global dot product attention layer to combine the visual context with the language embeddings.
The seq2seq model first encodes the previous conversation history using the LSTM encoder, then the language embeddings are mixed with the image feature (we also use VGG features) in the attention module. At reinforcement training stage, we only take at most 2 round of recent conversations as the input to the seq2seq model, instead of the whole conversation history.
Interactive Reinforcement Training After the supervised pre-training, three models only obtain knowledge from the static dataset, but not yet learn to cooperate with each other via interaction.
We argue that success communication requires all
parties learn to adapt for others. Based on this intuition, we enable three models to actively learn in a self-talking environment in an interactive manner.
In , the reward function for the question generator is a binary score, dependent on whether the guesser finish the task: so the reward at round is , where is guesser’s parameter,
is the state of question generator and answer model. We use the same reward definition to update the answer model. For the answer model, we also augment a score branch (a single FC + ReLU layer), to estimate the reward value, in order to stabilize the policy gradient updates.
With the above extension, we want to maximize the expected reward over two models’s policies, parameterized by and : . The policy gradient updates for the question generator and the answer model can be written as follows:
Note that the evaluation of reward depends on the guesser model. The accuracy of the guesser model is indeed far from perfect even for human dialogue ( errors). The mismatch between generated conversation and human conversation further enlarges the error and affects the policy gradient. Therefore, we let the guesser tune itself on the generated dialogue. Guesser’s parameter is updated by optimizing cross entropy loss of guesser’s prediction using generated conversations:
Reward Engineering Above interactive learning effectively improves task success rate, but the generated conversations diverge from natural language towards an effective but uninterpretable communication protocols. One explanation is that during interactive training, the guesser model manages to tolerate the gradually shifted conversation, and feedback positive reward “over-generously”.
Based on the above assumption, we use several heuristics to prune unnatural generated questions before feeding the generated conversation into the guesser. The intention is to limit the guesser only read the “natural” part of the conversation, and explicitly discourage the guesser to squeeze signal from unnatural QAs. Specifically, we use two heuristics to prune the unnatural QAs: 1) Removing questions containing repetitive words/phrases (e.g. “is it in front left front left front left?”); 2) Removing near duplicate questions happened in earlier conversations (e.g. “Is it on the left? … On the left?”). In an extreme case, if the generated QAs are mostly unreadable and pruned, the guesser won’t get enough input and forced to fail, so that the reward will be zero.
Implementation details We use similar data preprocessing as . We also do supervised pre-training and get comparable error rate as in  (answer model: 0.213, guesser: 0.380, question generator: 0.581). At interactive learning stage, we use Adam optimizer with batch size 64 with learning rate 1e-4. To generate questions, we use multinomial sampling in both training and testing.
Task Success Rate: The task success rates for baseline models and different variants of interactive trained models are shown in Table (a)a. In table (a)a, we use IRL to denote interactive reinforcement learning and use the superscript to denote which models are actively tuned during training. All interactively trained models consistently outperform the baseline RL model . 111The best score for baseline  is copied from the author’s Github page As expected, the most successful model is IRLQAG, which is very close to the human score as measured in .
Quantifying Semantic Gap: Although effectively improve task success rate, IRLQAG tends to generate uninterpretable conversations. Typical example are 1) repetitive word/phrase in questions; 2) the answer model tends to use “n/a” more frequently, even if the rational answers are “yes” or “no”. (top left example in Figure (b)b). We setup human studies to quantify this semantic gap: First, we asked independent human subjects to evaluate the generated answer quality. We randomly generated 100 cases and asked 20 human subjects to decide whether the generated answers agree with their judgments. Each QA pair is evaluated by 3 subjects with binary scores (3rd row in Table (a)a). When guesser and answer model are jointly updated (IRLAG), generated answers tend to disagree with humans. However, when the question generator is jointly updated, the answer quality improves, probably because generator adapts its questions to make it easy for answer model.
Second, we asked human subjects to evaluate question quality in terms of 1) whether the question is interpretable and 2) relevant to the image content. We asked the subjects to rank the quality of the generated questions from different models (best rank is 1). The averaged ranking is shown in the 2nd row of Table (a)a. IRLQAG’s semantic gap is enlarged most, despite improving success rate.
The reward engineering version IRL-pruneQAG improves interpretability compared with IRLQAG, but the task success rate is also slightly degraded. Some qualitative examples shown in Figure (b)b.
We proposed an interactive training method on object instance level visual grounding conversation task and significantly improve task success rate. Observing the language drifting problem during the interactive learning, we proposed a reward engineering technique during training and improved interpretability. The major problem of our method is still language drifting. Our result also suggests visual goal-ended conversation need semantic evaluation metric other than task success rate. plus 0.ex
Chattopadhyay et al. 
P Chattopadhyay, D Yadav, V Prabhu, A Chandrasekaran, A Das, S Lee, D Batra,
and D Parikh.
Evaluating visual conversational agents via cooperative human-ai games.HCOMP, 2017.
- Das et al. [2017a] A Das, S Kottur, K Gupta, A Singh, D Yadav, J Moura, D Parikh, and D Batra. Visual dialog. CVPR, 2017a.
- Das et al. [2017b] A Das, S Kottur, J Moura, S Lee, and D Batra. Learning cooperative visual dialog agents with deep rl. ICCV, 2017b.
- de Vries et al.  H de Vries, F Strub, S Chandar, O Pietquin, H Larochelle, and A Courville. Guesswhat?! visual object discovery through multi-modal dialogue. CVPR, 2017.
- Dewey  Daniel Dewey. Reinforcement learning and the reward engineering principle. 2014 AAAI Spring Symposium Series, 2014.
- Evtimova et al.  K Evtimova, A Drozdov, D Kiela, and K Cho. Emergent language in a multi-modal, multi-step referential game. arXiv:1705.10369, 2017.
- Kottur et al.  S Kottur, J. Moura, S Lee, and D Batra. Natural language does not emerge naturally in multi-agent dialog. EMNLP, 2017.
- Mordatch and Abbeel  I Mordatch and P Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv:1703.04908, 2017.
- S Milli  I Mordatch S Milli, P Abbeel. Interpretable and pedagogical examples. arXiv preprint arXiv:1711.00694, 2017.
- Strub et al.  F Strub, H de Vries, J Mary, B Piot, A Courville, and O Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. IJCAI, 2017.