Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could be related to any objects, relationships or semantics. The key challenge in Visual Dialogue task is thus to learn a more comprehensive and semantic-rich image representation which may have adaptive attentions on the image for variant questions. In this research, we propose a novel model to depict an image from both visual and semantic perspectives. Specifically, the visual view helps capture the appearance-level information, including objects and their relationships, while the semantic view enables the agent to understand high-level visual semantics from the whole image to the local regions. Futhermore, on top of such multi-view image features, we propose a feature selection framework which is able to adaptively capture question-relevant information hierarchically in fine-grained level. The proposed method achieved state-of-the-art results on benchmark Visual Dialogue datasets. More importantly, we can tell which modality (visual or semantic) has more contribution in answering the current question by visualizing the gate values. It gives us insights in understanding of human cognition in Visual Dialogue.READ FULL TEXT VIEW PDF
Visual question answering is concerned with answering free-form question...
With the rapid advancement of image captioning and visual question answe...
The Visual Dialogue task requires an agent to engage in a conversation a...
We characterise some of the quirks and shortcomings in the exploration o...
We introduce GuessWhat?!, a two-player guessing game as a testbed for
GuessWhat?! is a visual dialogue task between a guesser and an oracle. T...
The Guesser plays an important role in GuessWhat?! like visual dialogues...
To understand the real world by analyzing vision and language together is a priority for AI to achieve human-like abilities, which enables the development of diverse applications, such as Visual Question Answering (VQA) [Agrawal et al.2017], Referring Expressions [Wang et al.2019]Johnson, Karpathy, and Fei-Fei2016], etc. To move a step further, this work focuses on the Visual Dialogue [Das et al.2017] problem, which requires the agent to answer a series of questions in natural language regarding an image. It is more challenging because it demands the agent to adaptively focus on diverse visual content with respect to the current question, while other vision-language problems mostly attend to some specific objects or regions. Considering the dialogue in Figure 1: Given “ Q1: Is the man on the skateboard?”, the agent should be aware of the foreground visual content, i.e. the man, the skateboard, while “ Q5: Is there sky in the picture?” changes the attention of the agent to the background of sky. Besides appearance-level questions like Q1 and Q5, “ Q4: Is he young or older?” requires the agent to reason about the visual content for higher-level semantics. How to adaptively capture the desired visual content through dialogue becomes one of the most critical challenges in visual dialogue.
The typical solution for visual dialogue is to firstly fuse visual (i.e. image) features and textual (i.e. dialogue history, current question) features together and then to infer the correct answer. Most approaches focus on enhancing the textural representations by recovering the dialogue relational structure [Zheng et al.2019], imperfect dialogue history [Yang, Zha, and Zhang2019], and dialogue consistency [Qi et al.2018]. However, the role of visual information is at present less studied. Existing models simply use CNN [Simonyan and Zisserman2014] or R-CNN [Ren et al.2017] to extract visual features and focus on the question-relevant content. Such visual features have limited expressive ability due to the monolithic representations [Wang et al.2019]. On one hand, questions in a visual dialogue refer to a wide range of visual content, including objects, relationships and high-level semantics, which can not be covered by monolithic features. On the other hand, the referred visual content may change remarkably from visual appearance to high-level semantics through the dialogue, which is difficult for monolithic features to capture.
Our work is inspired by the Dual-coding theory [Paivio1971] of human cognition process. Dual-coding theory postulates that our brain encodes information in two ways: visual imagery and textual associations. When asked to act upon a concept, our brain retrieves either images or words, or both simultaneously. The ability to encode a concept by two different ways strengthens the capacity of memory and understanding. Inspired by the cognitive process, we first propose a novel scheme to comprehensively depict an image from both visual and semantic perspectives, where the major objects and their relationships are kept in the visual view while the higher-level abstraction is provided in the semantic view. We propose a model called Dual Encoding Visual Dialogue (DualVD) to adaptively select question-relevant information from the image in a hierarchical mode: intra-modal selection first captures the visual and semantic information individually from the object-relational visual features and global-local semantic features; then inter-modal selection obtains the joint visual-semantic knowledge by correlating vision and semantics. This hierarchical framework imitates human cognition process to capture targeted visual clues from multiple perceptual views and semantic levels.
The main contributions are summarized as follows: (1) We exploit the possibility of cognition in visual dialogue by depicting an image from both visual and semantic views, which covers a broad range of visual content referred by most of questions in the visual dialogue task; (2) We propose a hierarchical visual information selection model, which is able to progressively select question-adaptive clues from intra-modal and inter-modal information for answering diverse questions. It supports explicit visualization in visual-semantic knowledge selection and reveals which modality has more contribution to answer the question; (3) The proposed model outperforms state-of-the-art approaches on benchmark visual dialogue datasets, which demonstrates the feasibility and effectiveness of the proposed model. The code is available at https://github.com/JXZe/DualVD.
focuses on answering arbitrary natural language questions conditioned on an image. The typical solutions in VQA build multi-modal representations upon CNN-RNN architecture [Ren, Kiros, and Zemel2015, Qi et al.2017]. Existing approaches incorporate context-aware visual features. For example, [Ren, Kiros, and Zemel2015] applies CNN features of the whole image as global context, [Xu and Saenko2016, Anderson et al.2018] adopt patches and salient objects learned by attention mechanism as the region context, and [Gao et al.2018, Li et al.2019b] exploits inter-object relationships via graph attention networks or convolutional networks to model the relational context. However, how to leverage the external visual-semantic knowledge to learn more informative relational representations for better semantic understanding has not been well exploited yet. Another emerging line of work represents visual content explicitly by natural language and solves VQA as a reading comprehension problem. In [Li et al.2019a], the image is wholly converted into descriptive captions, which preserves information at semantic-level in textual domain. However, this kind of approaches use the generated captions, which could not be correct as we desired, and that they fully abandon the informative and subtle visual features. Besides the specific tasks, our model has notable progress compared to the above approaches. We adopt dual encoding mechanism to provide both appearance-level and semantic-level visual information, so that it incorporates the strong points of the above two kinds of approaches.
aims to answer a current question conditioned on an image and dialogue history. Most existing works are based on late fusion framework and focused on modeling the dialogue history. Sequential co-attention mechanism [Qi et al.2018] enables the model to identify question-relevant image regions and dialogue history to keep the dialogue consistency. [Yang, Zha, and Zhang2019] introduces false response in dialogue history for an adverse critic on the historic error. [Zheng et al.2019]
introduces an Expectation Maximization algorithm to infer the dialogue structure and the answers via graph neural networks. By contrast to extensive study on modeling dialogue history, the image content has been less studied. Although some works devise attention mechanism to focus on the essential visual features most relevant to the question and dialogue history, such monolithic visual representations still have limited expressive abilities. In this work, we exploit the role of visual information in visual dialogue. Different from existing works merely modeling the appearance, our model is able to adaptively capture visual and semantic information in a hierarchical mode inspired by the Dual-coding theory of human cognition process to provide adequate visual clues for diverse questions in visual dialogue.
The visual dialogue task can be described as follows: given an image and its caption , a dialogue history till round -, , and the current question , the task is to rank a list of 100 candidate answers and return the best answer to . In this section, we first introduce the idea of depicting an image from both visual and semantic perspectives. It covers a broad range of visual content like objects, relationships, global semantics and local semantics. Then we introduce a hierarchical feature selection approach to adaptively capture question-relevant visual-semantic information. Our model is based on the late fusion (LF) framework [Das et al.2017], which will be described at the end of this section.
In visual dialogue, two types of information play the primary role to depict an image and answer the diverse questions: visual information and semantic information (Figure 2). For visual information, the major objects and relationships should be kept. In semantic information, higher-level abstraction of the image content should be provided, which involves prior knowledge and complex cognition. In this section, we introduce a dual encoding scheme to generate both visual and semantic representations to depict an image. A scene graph is proposed to represent the visual information while multi-level captions in natural language are leveraged to represent the semantic information. These representations are served as the input of our DualVD model.
Each image is represented as a scene graph. Let denotes its nodes, which represents objects detected by a pre-trained object detector and let denotes its edges, which represents the semantic visual relationships embedded by our visual relationship encoder. We use a pre-trained Faster-RCNN [Ren et al.2017] to detect objects in an image and describe the object as a
-dimensional vector, denoted by. The visual relationship encoder [Zhang et al.2019], which is pre-trained on a visual relationship benchmark, i.e. GQA [Hudson and Manning2019], encodes relationships between the subject and object as a -dimensional relation embedding, denoted as . We assume that certain relationship exists between any pair of objects by considering “unknown-relationship” as a special kind of relationship. Therefore, the scene graph we constructed is fully-connected.
The visual relationship encoder embeds the relationships between objects into a semantic space which is aligned with their corresponding descriptions in natural language. Such continuous representations instead of discrete labels can preserve the discriminative capability and contextual awareness. Inspired by recent work [Zhang et al.2019], our encoder consists of a visual part and a textual part. The visual part takes three CNN feature maps corresponding to the visual regions of subject, object and their union region as input and outputs the three encoded embeddings , and
. The textual part uses a shared GRU to encode the annotations and yield textual embeddings. The loss function is designed to minimize the cosine similarity between the embeddings of positive visual-textual pairs and alienate negative pairs. The union embeddingis served as the visual relationship representation between and .
The advantages of captions compared to visual features lie in that captions are represented by natural language with high-level semantics, which can provide straightforward clues for the questions without “heterogeneous gap”. Global image caption (provided by the dataset) is beneficial to response to questions exploring the scene. Meanwhile, dense captions [Johnson, Karpathy, and Fei-Fei2016], denoted as ( is the number of dense captions), provide a set of local-level semantics, including the object properties (position, color, shape, ), the prior knowledge related to the objects (weather, species, emotion, ), and the relationships between objects (interactions, spatial positions, comparison, ). The words in both and are represented by concatenated GloVe [Pennington, Socher, and Manning2014] and ELMo [Peters et al.2018] word embeddings. Then and are separately encoded with two different LSTMs, denoted as and , respectively.
On top of the visual and semantic image representations, we propose a novel feature selection framework to adaptively select question-relevant information from the image. Under the guidance of the current question, the feature selection process is devised in a hierarchical mode: intra-modal selection first captures the visual and semantic information respectively from the visual module and semantic module; then inter-modal selection obtains the desired visual knowledge from both the visual module and semantic module via selective visual-semantic fusion. The advantages of such hierarchical framework is that it can explicitly reveal the progressive feature selection mode and preserve fine-grained information as much as possible.
This module is presented on the top of Figure 2. Based on the constructed scene graph introduced in Scene Graph Construction, we aim to select question-relevant relation information and object information. For relation information, we propose a relation-based graph attention network to enrich the object representations with question-aware relationships. It mainly consists of two units: Question-Guided Relation Attention highlights the critical relationships and Question-Guided Graph Convolution enriches the object features by its relation-critical neighbors. For object information, we highlight the most informative objects to answer the question. Finally, the clues of objects and relationships are further fused in Object-Relation Information Fusion to obtain the question-relevant visual content.
Question-Guided Relation Attention: The question-guided relation attention examines all the relationships to highlight the ones most relevant to the question. First, we select question-relevant information from the dialogue history to merge into the question representation via a gate operation, which is defined as:
where “” denotes concatenation, “” denotes the element-wise product. Each word is represented by concatenating the hidden states extracted from pre-trained GloVe and ELMo models. Then dialogue history and the current question are separately encoded with two different LSTMs, denoted as and , respectively. is a vector of gate values over and , (as well as
mentioned below) is the linear transformation layer andis the encoded history-aware question features.
The attention weights of all the visual relationships are calculated under the guidance of the question :
Each relation embedding is updated based on the attention importance. Formally defined as:
where is the question-guided relation embedding.
Question-Guided Graph Convolution: This module further updates each object’s representation under the guidance of questions by aggregating information from its neighborhood and the corresponding relationships. Given the feature of object and its relation embedding , the attention value of w.r.t. is calculated as:
The obtained attention values for all the neighbors of are used to compute a linear combination of their features, which serves as the updated representation for :
Since the scene graph is a fully connected graph, the number of neighbors for each object is equal to the number of objects detected in each image.
Object-Relation Information Fusion: In visual dialogue, the object appearance and the visual relationships will contribute to infer the answer, but with different contributions. In this module, we adaptively fuse question-relevant object features from both original object feature and relation-aware object feature again by a gate, which is defined by:
where is the updated representation of object . The whole image representation is obtained as the weighted sum of the object representations. In order to strengthen the influence of the current question and the original object features on the retrieved visual clues, we calculate the attention value for under the guidance of :
Then the the whole representation of the image can be updated by:
This module aims to select and merge question-relevant semantic information from global and local captions with a Question-Guided Semantic Attention module and a Global-Local Information Fusion module. The semantic module is located in the middle of Figure 2.
Question-Guided Semantic Attention: The semantic attention mechanism highlights relevant captions at both global-level and local-level. This type of attention is guided by the current question which is enhanced with corresponding information from the dialogue history (as introduced above). According to the attention distribution, we enrich the caption representations in order to better adapt to the question. The attention value for each caption in is calculated as follows:
The caption representation for and will be updated to and :
Global-Local Information Fusion: Some questions are global-related while others are local-related. This step adaptively selects the information from the global caption and local caption via a gate as described above:
where is the textural representations for the abstract visual semantics.
When asked to answer a question, the agent will retrieve either the visual information or the semantic information individually, or both simultaneously. In this module, we design a gate operation to decide the contributions of the two modalities on the answer prediction. The gate operation and the final visual knowledge representation are calculated as:
The full model consists of late fusion encoder and discriminative (softmax) decoder. The encoder first embeds each part in a dialogue tuple . Then we concatenate and with the visual knowledge representation into a joint input embedding for answer prediction. The decoder ranks all the answers from a set of 100 candidates
. It first encodes each candidate via a common LSTM. Then a dot product followed by softmax operation is calculated between the joint input embedding and candidates to get the posterior probability over each candidate. We obtain the correct answer by ranking the candidates based on their posterior probabilities. Our model can also be applied to more complex decoders and fusion strategies, such as memory network, co-attention, adversarial network,etc. In this paper, we utilize the simple late fusion and discriminative decoder to highlight the advantages of our visual encoder.
Datasets: We conduct extensive experiments on datasets [Das et al.2017]: VisDial v0.9 and VisDial v1.0. For both datasets, the examples are split into “train”, “val” and “test” and each dialogue contains 10 rounds of question-answer pairs. VisDial v1.0 is an upgraded version of VisDial v0.9. For VisDial v0.9, all the splits are built on MSCOCO images. For VisDial v1.0, all the splits of VisDial v0.9 serve as “train” (120k), while “val” (2k) and “test” (8k) consist of dialogues on extra 10k COCO-like images from Flickr.
Evaluation Metrics: We follow the metrics in [Das et al.2017] to evaluate the response performance. In the test stage, the model is asked to rank 100 candidate answer options and evaluated by Mean Reciprocal Rank (MRR), Recall@ and Mean Rank of human response (Mean) on both datasets. For VisDial v1.0, Normalized Discounted Cumulative Gain (NDCG) is added as an extra metric for more comprehensive analysis. Lower value for Mean and higher value for other metrics are desired.
: For the textual part, the maximum sentence length of the dialogue history, dense captions and the current question is all set to 20. The hidden state size of all the LSTM blocks is set to 512. We use Faster-RCNN with the ResNet-101 to detect object regions and extract the 2048-dimensional region features. Since some captions with low confidence are likely to introduce unexpected noise and too many captions will decrease the computation efficiency, we select the top 6 (the mean value of the caption distribution) dense captions in our model. We train all of our models by Adam optimizer with 16 epochs, where the mini-batch size is 15 and the dropout ratio is 0.5. For the strategy of learning rate, we first apply warm up strategy for 2 epoches with initial learning rateand warm-up factor 0.2. Then we adopt cosine annealing learning strategy with initial learning rate = and termination learning rate = for the rest epoches.
In Table 1 and Table 2, we compare DualVD with state-of-the-art discriminative models, namely LF [Das et al.2017], HRE [Das et al.2017], MN [Das et al.2017], SAN-QI [Yang et al.2016], HieCoAtt-QI [Lu et al.2016], AMEM [Seo et al.2017], HCIAE [Lu et al.2017], SF [Jain, Lazebnik, and Schwing2018], CoAtt [Qi et al.2018], CorefMN [Kottur et al.2018], VGNN [Zheng et al.2019], LF-Att [Das et al.2017], MN-Att [Das et al.2017], RvA[Niu et al.2019] and DL-61[Guo, Xu, and Tao2019]. Our model consistently outperforms all the approaches on most metrics, which highlights the importance of visual understanding from visual and semantic modules in visual dialogue. CoAtt and HeiCoAtt-QI are relevant to our model in the sense that they leverage attention mechanism to identify question-relevant visual features. However, they ignore the semantic-rich relationships and language priors. It should be noted that our model and the compared approaches all belong to single-step models. With the success of multi-step reasoning, ReDAN [Gan et al.2019] achieves 1% boost over our model on most metrics. We believe that stacking our visual encoder to achieve multi-step visual understanding is a promising future work. DL-61 [Guo, Xu, and Tao2019] is a two-stage network for candidate selection and re-ranking while FGA [Schwartz et al.2019] conducts attention across all the data parts, which gain relatively high performance on some metrics compared with our model. We believe that our model for the visual part and existing works for the dialogue or answer parts have complementary advantages.
Ablation study on VisDial v1.0 validation set exploits the influence of the essential components of DualVD. We use the same discriminative decoder for all the following variations:
Object Representation (ObjRep): this model uses the averaged object features to represent an image. Object representations are enhanced by question-driven attention.
Relation Representation (RelRep): this model applies averaged relation-aware object representations via question-guided relation attention and question-guided graph convolution as the image representation.
Visual Module without Relationships (VisNoRel): this is our full visual module except that the relation embeddings are replaced by unlabeled edges and the convolution is conducted via the intra-modal attention [Gao et al.2019].
Visual Module (VisMod): this is our full visual module, which fuses objects and relation features.
Global Caption (GlCap): this model uses LSTM to encode the global caption to represent the image.
Local Caption (LoCap): this model uses LSTM to encode the local captions to represent the image.
Semantic Module (SemMod): this is our full semantic module, which fuses global and local features.
DualVD (full model): this is our full model, which incorporates both the visual module and semantic module.
In Table 3, models in the first block are designed to evaluate the influence of key components in the visual module. ObjRep only considers isolated objects and ignores the relational information, which achieves worse performance compared with VisMod. RelRep considers the relationships by introducing relation embedding. However, empirical study indicates that enhancing visual relationships while weakening object appearance is still not sufficient for better performance. VisNoRel fuses the information from both object appearance and neighborhoods without relational semantics, which achieves slight improvement compared to ObjRep. On top of VisNoRel, VisMod moves a step further by aggregating all the neighborhood features with relational information, which achieves the best performance compared to above three models.
Orthogonal to visual part, models in the second block evaluate the influence of key components in the semantic part. The overall performance of either GlCap or LoCap decreases by 1% and 0.15% respectively, compared to their integrated version SemMod, which adaptively selects and fuses the task-specific descriptive clues from both global-level and local-level captions.
DualVD results in a great boost compared to SemMod and a relatively slight boost compared to VisMod. This unbalanced boost indicates that visual module provides comparatively richer clues than semantic module. Combining the two modules together gains an extra boost because of their complementary information. The performance of DualVD without ELMo embedding decrease slightly, which proves that the improvement of DualVD mainly comes from the contribution of the novel visual representation.
A critical advantage of DualVD lies in its interpretability: DualVD is capable to predict the attention weights in the visual module, semantic module and the gate values in visual-semantic fusion. It supports explicit visualization and can reveal DualVD’s mode in information selection. Figure 3 shows three examples with variant dependence on visual and semantic modules. The third example (third and fourth rows in Figure 3) shows three round of dialogues about an image. In each round of dialogue, DualVD is capable to capture the most relevant visual and semantic information regarding the current question. In the first question, the visual module highlights the face of a boy and the relationships to his body and the other boy, while the semantic module puts more attention on the captions describing the two boys, which all provide useful clues to infer the correct answer. In the second and third round of dialogues, DualVD respectively attends to the whole grass and the discs. In this example, the attended information is adaptively changed through the dialogue and this explains why the correct answer is selected.
We further show another two examples with a current question and the dialogue history (first two rows in Figure 3) to reveal DualVD’s mode in information selection. We observe that the amount of information derived from each module highly depends on the complexity of the question and the relevance of the content. More information will come from the semantic module when the question involves complex relationships or the semantic module explicitly contains question-relevant clues. In Figure 3, ratio of total gate values reveals the amount of information derived from each module. In the first example, more visual information is required. Similar observation exists for the second question in the third example. Such questions referring to object appearance depend more clues from the visual module. In the second example, the current question is about the relationship between the girl and the hair. The amount of semantic information remarkably increases since there exists explicit evidence “The girl has long hair”. This observation holds for the third question in the third example. Since language is a higher-level encoding of the visual content after complex reasoning involved with prior knowledge, it provides more useful clues for semantic-level questions.
In this paper, inspired by the dual-coding theory in cognitive science, we propose a novel DualVD model for visual dialogue. DualVD mainly consists of a visual module and a semantic module, which encodes image information at appearance-level and semantic-level, respectively. Desired clues for answer inference are adaptively selected from the two modules via gate mechanism. Results from extensive experiments on benchmarks demonstrate that deriving visual information from visual-semantic representations can achieve superior performance compared to other state-of-the-art approaches. Another major advantage of DualVD is its interpretability via progressive visualization. It can give us insight of how information from different modalities is used for inferring answers.
This work is supported by the National Key Research and Development Program (Grant No.2017YFB0803301).
Densecap: Fully convolutional localization networks for dense captioning.In CVPR, 4565–4574.