Visual Dialogue  is a task that requires an agent to answer a series of questions grounded in an image, demanding the agent to reason about both visual content and dialogue history. There are two kinds of typical approaches to this task : discriminative and generative. Discriminative approach learns to select the best response in a candidate list, while generative approach may generate new responses that are not provided in the pre-constructed repository. The discriminative approach is relatively easier since the grammaticality and accuracy are guaranteed in the human-written responses. However, the retrieved responses are limited by the capacity of the pre-constructed repository. Even the best matched response may not be exactly appropriate since most cases are not tailored for the on-going questions . Therefore, the generative ability is crucial to achieve human-like conversation by synthesizing more factual and flexible responses accordingly.
The typical solution for the generative visual dialogue system is based on the encoder-decoder framework 
. The encoder aims to capture the semantics of the image, question and dialogue history by embeddings, while the decoder decodes these embeddings to a response by recurrent neural networks (RNN). Due to the difficulty of generation, the majority of previous works  have focused on designing more comprehensive encoder structures to make use of different aspects of information from the input. Though these methods achieve promising improvement, they still have obvious limitations, such as generating inaccurate details and repetitive words or phrases.
To tackle the above problems, we propose to adaptively incorporate more detailed information from the encoder for generating each word in the decoding process. Specifically, we propose a recurrent Deliberation, Abandon and Memory (DAM) module, a novel architecture of generative decoder to address the above two issues. As shown in Figure 1, on the one hand, DAM incorporates the global information in the response-level to keep semantic coherence. On the other hand, DAM pays attention to capture the related and unique details in the word-level by designing Deliberation Unit guided by the current generated word. To further reduce repetition, we devise Abandon Unit to select the unique information for the current word. In the end, Memory Unit integrates the derived word-level and response-level semantics into the memory state for word generation, which contributes to the unification of semantic coherence and the richness of details. With recurrent connections between the DAM cells inspired by LSTM , the network is capable of generating visual-grounded details in a progressive manner and remarkably eliminates repetition by coverage control. Note that DAM is a universal architecture that can be combined with existing visual dialogue models by adapting the Deliberation Unit to the corresponding encoder. To show the effectiveness of DAM, we propose three models by combining DAM with three typical visual dialogue encoders, including Late Fusion encoder  for general feature fusion, Memory Network encoder  for dialogue history reasoning, and DualVD encoder  for visual-semantic image understanding. We show that the performance of baseline models is consistently improved by combining with DAM.
The main contributions are summarized as follows: (1) We propose a novel generative decoder DAM to generate more detailed and less repetitive responses. DAM contains a compositive structure that leverages the complementary information from both response-level and word-level, which guarantees the accuracy and richness of the responses. (2) DAM is universal to cooperate with existing visual dialogue encoders by constraining the information selection mode to adapt to different encoder structures. (3) We demonstrate the module’s capability, generality and interpretability on the VisDial v1.0 dataset. DAM consistently improves the performance of existing models and achieves a new state-of-the-art 60.93% on NDCG for the generative task.
2 Related Work
Most previous works focused on discriminative approaches [28, 19, 11, 10] and achieved great progress. However, generative approaches, which are more practical in realistic applications, typically perform inferior to the discriminative approaches. 
combined reinforcement learning with generative adversarial networks to generate human-like answers.  introduced negative responses to generative model to reduce safe responses.  proposed a multi-hop reasoning model to generate more accurate responses. However, how to generate less repetitive and more detailed responses has been less studied. Our work devotes to reducing the repetition and improving the richness in responses via designing a universal generative decoder by selecting more related information for generating the current word from response-level and word-level semantics.
Generation-based Dialogue Systems.
The typical solution adopts the sequence-to-sequence (seq2seq) framework [14, 25] and uses RNN to generate responses. Existing works studied diverse aspects of generation, including expressing specific emotions [22, 18], introducing new topics , generating robust task-oriented responses [16, 12], improving the richness  and reducing repetition , etc.  assigned pointing to copy words from the source text to improve the richness of sentences and used coverage mechanism to reduce repetition. The problem of reducing repetition of response has been less studied in visual dialogue. What’s more, the methods in visual dialogue cannot adopt pointing to copy words directly, since the pointing
clues come from image and dialogue history in visual dialogue. One limitation of coverage mechanism is that it reduces repetition by rigid constraints of the loss function, which may result in the missing of essential words. Intuitively, understanding the input information comprehensively and capturing word-specific semantics can also reduce repetition. Inspired by this intuition, we propose a novel visual dialogue decoder to generate less repetitive and more detailed responses by considering the encoder structure and adaptively selecting and decoding information from the encoder.
The visual dialogue task can be described as follows: given an image and its caption , a dialogue history till round -, , and the current question , the task aims to generate an accurate response . Our work mainly focuses on the design of a novel generative decoder architecture DAM. To prove the effectiveness of DAM, we combine it with three typical encoders: Late Fusion (LF), Memory Network (MN) and the state-of-the-art Dual-coding Visual Dialogue (DualVD). In this section, we will introduce (1) the typical encoder-decoder generative model in visual dialogue, (2) the structure of our proposed generative decoder, and (3) the combination strategies of our decoder with the three typical encoders.
3.1 Encoder-Decoder Generative Model
Our proposed DAM network is an advancement of the typical generative decoder with deliberation and control abilities. In this section, we first introduce the typical generative visual dialogue encoder-decoder model. Encoder encodes the image , dialogue history and current question
by a hidden state called knowledge vector(for conciseness, is omitted below). On each decoding step , the decoder, typically using a single-layer unbidirectional LSTM, receives the word embedding of previous generated word and previous hidden state (the output knowledge vector from encoder serves as the initial hidden state) and outputs a decoded vector
. Then the probability distributionover the vocabulary can be computed by:
The word with the highest probability is selected as the predicted word and the model is trained by log-likelihood loss.
3.2 The DAM Decoder
DAM is a novel compositive decoder that can be incorporated with standard sequence-to-sequence generation framework. It helps to improve the richness of semantic details as well as discouraging repetition in the responses. As shown in Figure 2, DAM consists of response-level semantic decode layer (RSL), word-level detail decode layer (WDL) and information fusion module (Memory Unit). RSL is responsible for capturing the global information to guarantee the response’s fluency and correctness. However, the global information lacks the detailed semantics, for the current word and the rigid-decoding mode in LSTM tends to generate repeated words. WDL incorporates the essential and unique visual dialogue contents (i.e. question, dialogue history and image) into the generation of current word to enrich the word-level details. The structure of WDL consists of an LSTM, Deliberation Unit and Abandon Unit. Finally, Memory Unit is responsible for adaptively fusing both the response-level and word-level information.
Response-Level Semantic Decode Layer (RSL)
When answering a question about an image, human needs to capture the global semantic information to decide the main ideas and content for the responses. In our model, we regard the embedded information from the encoder as global semantic information, and denote it as knowledge vector . is used for providing the response-level semantics in the generation process. The response-level information for generating the current word is computed as:
where is the previous generated word and is the memory state of LSTM.
Word-Level Detail Decode Layer (WDL)
On the one hand, the response-level information lacks the details of the image and dialogue history, providing rigid clues for generating different words. On the other hand, response-level information changes slightly with the recurrent word generation process and results in repetitive words or phrases. To solve these problems, it’s critical to enrich the decoding vector with more detailed question-relevant information that is unique for current generated word.
For generating the word, we first adaptively capture word-relevant information from the encoded knowledge information along with previous generated word and previous hidden state via LSTM:
where “” denotes concatenation, is the updated knowledge vector in the - step and is the memory state of LSTM. Since only capture the global semantics from the encoder, we further incorporate the structure-adaptive local semantics from the encoder via the Deliberation Unit. Finally, we propose the Abandon Unit to filter out the redundant information while enhancing the word-specific information from both global and local clues. The Deliberation Unit and the Abandon Unit are detailed below.
It aims to adaptively leverage the encoder structure to extract the most related and detailed information for current word generation. Specifically, we first capture the significant information in the question under the guidance of the global semantic vector . Guided by the upgraded question representation, we adopt structure-adaptive strategies to different encoder structures to select image and dialogue history information. In the end, we get the detailed question-related information by fusing the information of question, dialogue history and image. Compared with most existing decoders that merely use the encoded embedding without considering the diverse encoder structures, our proposed Deliberation Unit provides a flexible strategy to derive more detailed information by taking the advantages of the elaborate encoders. To prove the effectiveness of DAM, we combine it with three typical encoders, including LF encoder for the general feature fusion, MN encoder for dialogue history reasoning and DualVD encoder for visual-semantic image understanding. The details of Deliberation Unit adaptive to these three encoders will be introduced in Section 3.3.
It further filters out the redundant information while enhancing the word-specific information from both the global and local encoded clues. Specifically, Abandon Unit updates current generated decoding vector by combining detailed knowledge information with via a gate operation and achieves the final word-level embedding :
where “” denotes the element-wise product.
Two Level Information Fusion
The information from RSL and WDL is complementary to each other. We design Memory Unit to combine the two kinds of information for word prediction. Memory Unit selects response-level information to control the global semantics in response and tracks the word-level information for generating more detailed and less repeated response via a gate operation:
The generated word with the maximum value in the probability distribution is selected as the predicted word. is computed as:
3.3 Variants of Deliberation Unit
Guided by the question and current generated word state , Deliberation Unit captures more detailed information from encoder-specific structures. The Deliberation Unit mainly contains three steps: (1) word-guided question information update, (2) question-guided information update, and (3) general feature fusion. The last two steps are adaptive to different encoders while the first step keeps unchanged. To select the most related information for current generated word, we first update question information with :
We will introduce the next two steps adaptive to LF, MN and DualVD encoders below. It should be noted that the parameters of the Deliberation Unit are independent of its encoder.
Deliberation Unit Adaptive to LF Encoder.
LF Encoder focuses on multi-modal information fusion without complex information reasoning. In our decoder, we merely fuse the updated question information with dialogue history and image from the encoder without question-guided information update step as shown in Figure 3(a).
Deliberation Unit Adaptive to MN Encoder.
MN Encoder focuses on the dialogue history reasoning. Compared with Deliberation Unit for LF Encoder, we further add question-guided information update step to reason over dialogue history via attention mechanism before general feature fusion as shown in Figure 3(b).
Deliberation Unit Adaptive to DualVD Encoder.
DualVD Encoder focuses on the visual-semantic image understanding. As shown in Figure 3(c), for the question-guided information update step, we first concatenate updated question and dialogue history to form the query vector , and assign to guide the update of image from visual and semantic aspects respectively. For the feature fusion step, we utilize the gate operation between visual and semantic image representation ( and ) to obtain the updated image representation.
We conduct extensive experiments on VisDial v1.0 dataset  constructed based on MSCOCO images and captions. VisDial v1.0 is split into training, validation and test sets. The training set consists of dialogues on 120k images from COCO-trainval while the validation and test sets are consisting of dialogues on an additional 10k COCO-like images from Flickr.
, we rank the 100 candidate answers based on their posterior probabilities and evaluate the performance by retrieval metrics: mean reciprocal rank (MRR), recall@( =1, 5, 10), mean rank of human response (Mean) and normalized discounted cumulative gain (NDCG). Lower value for Mean and higher value for other metrics are desired.
To build the vocabulary, we retain words in the dataset with word frequency greater than 5. The vocabulary contains 10366 words. The hidden states and cell states of LSTM are randomly initialized while LSTM is using the output knowledge vector from encoder as the initial hidden state and randomly initializing cell state. The maximum sentence length of the responses is set to 20. The hidden state size of all the LSTM blocks is set to 512 and the dimension of each gate is set to 1024. The Adam optimizer 
is used with the initial learning rate of 1e-3 and final learning rate of 3.4e-4 via cosine annealing strategy with 16 epochs. The batch size is set to 15.
4.1 State-of-the-Art Comparison
As shown in Table 1, we compare our models (third block) with SOTA generative models (first block) and baseline models (second block, re-trained by us). ReDAN-G and DMRM adopted complex multi-step reasoning, while HCIAE-G, CoAtt-G and Primary-G are attention-based models. For fairness, we only compare the original generative ability without re-ranking. We just replace the decoders in baseline models by our proposed DAM. Compared with the baseline models, our models outperform them on all the metrics, which indicates the complementary advantages between DAM and existing encoders in visual dialogue. Though DualVD-G performs lower than DMRM on Mean, DualVD-DAM outperforms DMRM on all the other metrics without multi-step reasoning, which is the advantages in DMRM over our models.
4.2 Ablation Study
The Effectiveness of Each Unit
We consider the following ablation models to illustrate the effectiveness of each unit of our model: 1) 2L-DAM: this is our full model that adaptively selects related information for decoding. 2) 2L-DM: full model w/o Abandon Unit. 3) 2L-M: 2L-DM w/o Deliberation Unit. 4) 2-LSTM: 2L-M w/o Memory Unit. As shown in Table 2, taking DualVD-DAM for example, the MRR values increase by 0.37%, 0.11% and 0.31% respectively when introducing the Memory Unit (2L-M), Deliberation Unit (2L-DM) and Abandon Unit (2L-DAM) to the baseline model (2-LSTM) progressively. Similar trend exists in LF-DAM and MN-DAM, which indicates the effectiveness of each unit in DAM. Since the space limitation and similar observations, we show the ablation studies on DualVD-DAM in the following experiments.
The Effectiveness of Two-Level Decode Structure
|RSL(DualVD-G): RSL only||0.60||0.47||0.20||0.03|
|WDL: WDL only||0.69||0.54||0.07||0.15|
To prove the complementary advantages of the response-level semantic decode layer (RSL) and the word-level detail decode layer (WDL), and to figure out the information selection mode, we first conduct Human Study on the effectiveness of RSL, WDL and the full model DualVD-DAM, and then visualize the gate values of Memory Unit to reveal the information selection mode.
In the human study, we follow  to sample 100 results from VisDial v1.0 validation set and ask 3 persons to evaluate the quality of the last response in the dialogue. Distinct from previous works, we add Repetition and Richness metrics, and for all metrics, we record the score when at least 2 persons agree. As shown in Table 3, WDL performs best on Richness and reduces the Repetition by 0.13 compared to RSL, which indicates that WDL contributes to the increase of detailed information and the decrease of repetition in the response. After incorporating RSL and Memory Unit with WDL, the repetition further reduces by 0.06 while M1 and M2 outperform by 0.06 and 0.07 respectively, which proves the complementary advantages between these two level information. We also notice that Richness decreases slightly. This is mainly because the information from RSL concentrates more attention on the global information, rather than detailed information.
Information Selection Mode.
We visualize the gate values in the Memory Unit for DualVD-DAM to demonstrate the information selection mode of the two level information. As shown in Figure 4, we can observe that the ratio of gate values for RSL is always higher than that for WDL. It indicates that the response-level information in RSL plays the predominant role in guiding the response generation. Another obvious phenomenon is that the ratio of gate values for WDL increases rapidly when generating the last word, which can be viewed as a signal to stop the response generation in time when the response already covers the complete semantics. It may due to the fact that WDL captures word-level information and is sensitive to the word repetition, which is beneficial to avoid repetitive word generation.
The Effectiveness of Each Operation in Deliberation Unit
We conduct experiments on DualVD-DAM to reveal the influence of essential operations in the Deliberation Unit: 1) I-S only uses semantic-level image information for information selection. 2) I-V only utilizes visual-level image information for information selection. 3) I-SV jointly exploits semantic and visual information for information selection. 4) H only leverages dialogue history for information selection.
As shown in Table 4, I-S and I-V update information from visual and semantic aspect respectively, while I-SV updates information from both two aspects which achieves the best performance compared to the above two models. The relatively higher results of H model indicate that the history information plays a more important role in the decoder. By jointly incorporating all the structure-aware information from the encoder, DualVD-DAM achieves the best performance on all the metrics. It proves the advantages of DAM via fully utilizing the information from the elaborate encoder, which is beneficial for enhancing the existing generation models by incorporating their encoders with DAM adaptively.
4.3 Qualitative Analysis
Response generation quality.
Figure 5 shows two examples of three-round dialogues and the corresponding responses generated by DualVD-G and DualVD-DAM. When answering Q3 in the first example, DualVD-DAM generates accurate and non-repetitive response “having fun” compared with DualVD-G. Comparing to the human response, DualVD-DAM further provides detailed description “he is posing for the picture” so as to increase the richness of the response. Similar observation exists in the second example.
Information selection quality.
We further visualize the evidence captured from the image and dialogue history for generating the essential words, i.e. there, buildings and background. As shown in Figure 6, it is difficult to answer the question of “Is there a building?” accurately, since the buildings are distant and small. DualVD-DAM accurately focuses on the visual and dialogue clues. Taking the word background for example, our model focuses on the background in the image and highlights the clear day in the dialogue history. It proves that DAM can adaptively focus on the exact visual and textual clues for generating each word, which contributes to the high quality of the responses.
In this paper, we propose a novel generative decoder DAM consisting of the Deliberation Unit, Abandon Unit and Memory Unit. The novel decoder adopts a compositive decoding mode in order to model information from both response-level and word-level, so as to discourage repetition in the generated responses. DAM is a universal decoding architecture which can be incorporated with existing visual dialogue encoders to improve their performance. The extensive experiments of combining DAM with LF, MN and DualVD encoders verify that our proposed DAM can effectively improve the generation performance of existing models and achieve new state-of-the-art results on the popular benchmark dataset.
This work is supported by the National Key Research and Development Program (Grant No.2017YFB0803301).
-  (2020) DMRM: a dual-channel multi-hop reasoning model for visual dialog. In AAAI, pp. 7504–7511. Cited by: §2, Table 1.
-  (2017) Visual dialog. In CVPR, pp. 1080–1089. Cited by: §1, §1, §4, §4, Table 1.
-  (2019) Multi-step reasoning via recurrent dual attention for visual dialog. In ACL, pp. 6463–6474. Cited by: Table 1.
-  (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
-  (2019) Image-question-answer synergistic network for visual dialog. In CVPR, pp. 10434–10443. Cited by: Table 1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
-  (1982) Neural networks and physical systems with emergent collective computational abilities.. Proceedings of the National Academy of Sciences of the United States of America 79 (8), pp. 2554–2558. Cited by: §1.
-  (2020) DualVD: an adaptive dual encoding model for deep visual understanding in visual dialogue. In AAAI, pp. 11125–11132. Cited by: §1, Table 1.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.
-  (2018) Visual dialog with multi-turn attentional memory network. In PCM, pp. 611–621. Cited by: §2.
-  (2018) Visual coreference resolution in visual dialog using neural module networks. In ECCV, pp. 153–169. Cited by: §2.
-  (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In ACL, pp. 1437–1447. Cited by: §2.
-  (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In NIPS, pp. 314–324. Cited by: Table 1.
-  (2018) Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In ACL, pp. 1468–1478. Cited by: §2.
-  (2019) Recursive visual attention in visual dialog. In CVPR, pp. 6679–6688. Cited by: §1.
-  (2018) Deep Dyna-Q: integrating planning for task-completion dialogue policy learning. In ACL, pp. 2182–2192. Cited by: §2.
-  (2020) Two causal principles for improving visual dialog. In CVPR, pp. 10860–10869. Cited by: §1.
-  (2019) Towards empathetic open-domain conversation models: a new benchmark and dataset. In ACL, pp. 5370–5381. Cited by: §2.
-  (2019) Factor graph attention. In CVPR, pp. 2039–2048. Cited by: §2.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL, pp. 1073–1083. Cited by: §2.
-  (2017) Generating high-quality and informative conversation responses with sequence-to-sequence models. In EMNLP, pp. 2210–2219. Cited by: §2.
-  (2019) Generating responses with a specific emotion in dialog. In ACL, pp. 3685–3695. Cited by: §2.
-  (2019) Learning to abstract for memory-augmented conversational response generation. In ACL, pp. 3816–3825. Cited by: §2.
-  (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In CVPR, pp. 6106–6115. Cited by: §2, §4.2, Table 1.
-  (2019) Neural response generation with meta-words. In ACL, pp. 5416–5426. Cited by: §2.
-  (2019) Making history matter: gold-critic sequence training for visual dialog. In ICCV, pp. 2561–2569. Cited by: §1.
Generative visual dialogue system via adaptive reasoning and weighted likelihood estimation. In IJCAI, pp. 1025–1031. Cited by: §2.
-  (2019) Reasoning visual dialogs with structural and partial observations. In CVPR, pp. 6669–6678. Cited by: §2.