A human dialogue by its nature exhibits highly complex structures. Specifically, when we have a dialogue, some utterances are semantically dependent on previous ones (i.e., context), while others are independent, due to an abrupt change in topic. Previous topics could be readdressed later on in the dialogue. Furthermore, we take advantage of multimodal inputs, including visual, linguistic, and auditory information, to capture the temporal topics of conversation. Notably, Visual Dialog (VisDial) [das2017visual], which is an extended version of visual question answering (VQA) [antol2015vqa, goyal2017making], reflects the complex and multimodal nature of the dialogue. Unlike VQA, it is designed to answer a sequence of questions given an image, utilizing a dialog history as context. For example, to answer an ambiguous question like “Do you think they are her parents?” (D4 in Fig. 1), a dialog agent should attend to the meaningful context from a dialog history as well as visual information. This task demands a rich set of abilities – understanding a sequence of multimodal contents (i.e., an input image, questions, and dialog history), and reasoning semantic structures among them.
Previous approaches in visual dialog have explored the problem of reasoning semantic structures in dialogs by employing the soft-attention mechanism [bahdanau2014neural, xu2015show]. Typically, most of the previous research has focused on extracting the rich question-relevant representations from the given image and dialog history, while implicitly finding their relationships [das2017visual, lu2017best, wu2018you, guo2019image, gan2019multi, schwartz2019factor]. Another line of research has tackled the problem of visual coreference resolution [seo2017visual, niu2018recursive, kang2019dual], and the other approach [zheng2019reasoning] attempts to find the inherent structures of the dialog. However, all previous work relies on the soft-attention mechanism, and we argue that applying it to the previous utterances severely limits a dialog agent to learn various types of semantic relationships. Specifically, the soft attention, which is based on a softmax function, always assigns a non-zero weight to all previous utterances, which results in dense (i.e., fully-connected) relationships. Herein lies the problem: even for questions that are partially dependent (Q5 in Fig. 1) or independent (Q6 in Fig. 1) from the dialog history, all previous utterances are still taken into consideration and integrated into the contextual representations. As a consequence, the dialog agent overly relies on all previous utterances, even when these previous utterances are irrelevant to the given question. It may potentially hurt performance and interpretability.
In this paper, we propose Sparse Graph Learning Networks (SGLNs) that explicitly discover the sparse structures of the visually-grounded dialogs. We present a dialog in a graph structure where each node corresponds to the round of dialog, and edges represent the semantic dependencies between the nodes, as shown in Fig. 1. The proposed SGLNs infer the graph structure and predict the answer simultaneously. SGLNs involve two novel modules: a multimodal node embedding module and a sparse graph learning module. Inspired by a bottom-up and top-down attention mechanism [Anderson2017up-down]
, the node embedding module embeds the given image and each round of dialog in a joined fashion, yielding the multimodal joint embeddings. We represent each embedding vector as a node of the graph. The sparse graph learning module infers two edge weights: binary (i.e., 0 or 1) and score edges. It then ultimately discovers the sparse and weighted structure by incorporating them. Note that the sparse graph learning module ensures an isolated node when all elements in the binary edge weights are zero. It updates each node by integrating the neighborhood nodes via a message passing framework and feeds the updated node features to the answer decoder. Furthermore, we introduce a new structural loss function to encourage our model to infer explicit and reliable dialog structures by leveraging supervision that is readily obtainable. Consequently, as shown in (c) for Fig. 1, our model learns various types of semantic relationships: (1) dense relationships as in D1-D4, (2) sparse relationships as in D5, and (3) no relationships as in D6. The main contributions of our paper are as follows:
We propose Sparse Graph Learning Networks (SGLNs) that consider the sparse nature of a visually-grounded dialog. By using a multimodal node embedding module and a sparse graph learning module, our proposed model circumvents the conceptual shortcoming of dense structures by pruning unnecessary relationships.
We propose a new structural loss function to encourage SGLNs to learn the aforementioned semantic relationships explicitly. SGLNs are the first approach that predicts the sparse structures of the visually-grounded dialog with the structural loss function.
SGLNs achieve the new state-of-the-art results on the visual dialog v1.0 dataset using only 10.95% of the dialog history. Also, we make a comparison between SGLNs and the baseline models to demonstrate the effectiveness of the proposed method. Finally, we perform a qualitative analysis of our proposed model, showing that SGLNs reasonably infer the underlying sparse structures and improve interpretability compared to a baseline model.
2 Related Work
Visual Dialog. Visual dialog task [das2017visual] was recently introduced as a temporal extension of VQA [antol2015vqa, goyal2017making]. In this task, a dialog agent should answer a sequence of questions by using an image and the dialog history as a clue. We carefully categorize the previous studies on visual dialog into three groups: (1) soft attention-based methods that compute attended representations of the image, and the history [das2017visual, lu2017best, wu2018you, guo2019image, gan2019multi, schwartz2019factor, nguyen2019efficient], (2) a visual coreference resolution [seo2017visual, kottur2018visual, niu2018recursive, kang2019dual] that clarifies ambiguous expressions (e.g., it, them) in the question and links them to a specific entity in the image, and (3) a structural learning method [zheng2019reasoning] that attempts to infer dialog structures. Our approach belongs to the third group. Zheng et al. [zheng2019reasoning]
designed a structure inference model while predicting the answer in the context of an expectation-maximization (EM) algorithm. Specifically, they proposed the model based on graph neural networks (GNNs) that approximate a process of the EM algorithm. However, similar to the soft attention-based methods, they inferred the dense semantic structures using a softmax function in GNNs. Moreover, they implicitly recovered the structures only using supervision for the given questions. To address these two aspects, we propose SGLNs that explicitly infer sparse structures with a definite objective (i.e., a structural loss function).
On the one hand, a few [niu2018recursive, kim2020modality] have noticed the sparse property of the visual dialog, but their reasoning capability is still quite limited. The CDF [kim2020modality] randomly extracted up to three elements of the dialog history to avoid excessive exploitation of the whole history. For the visual coreference resolution, RvA [niu2018recursive] backtracked the history and selectively retrieved the visual attention maps of the previous dialogs, which are determined to be useful.
Graph Neural Networks (GNNs) [gori2005new, scarselli2008graph] have sparked a tremendous interest at the intersection of deep neural networks and structural learning approaches. There are two existing methods involving GNNs: (1) a method that operates on graph-structured data [kipf2016semi, battaglia2018relational, hamilton2017inductive, niepert2016learning, xu2018powerful], and (2) a method that constructs a graph with neural networks to approximate the learning or inference process of graphical models [sukhbaatar2016learning, battaglia2016interaction, gilmer2017neural, kipf2018neural]. More recently, graph learning networks (GLNs), which are an extension of the second method, were proposed by [pilco2019graph, on2020cut], with the goal of reasoning underlying structures of input data. Note that GLNs consider unstructured data and dynamic domains (e.g., time-varying domain). Accordingly, CB-GLNs [on2020cut] attempted to discover the compositional structure of long video data by using a normalized graph-cut algorithm [shi2000normalized]. Our method belongs to GLNs. However, SGLNs are significantly different from previous studies in that the SGLNs learn to build sparse structures adaptively, not relying on a predefined algorithm, and the dataset we use is highly multimodal.
3 Sparse Graph Learning Networks
In this section, we formulate the visual dialog task using graph structures, then describe our proposed model, Sparse Graph Learning Networks (SGLNs). The visual dialog task [das2017visual] is defined as follows: given an image , a caption describing the image, a dialog history until round , and a question at round , the goal is to find an appropriate answer to the question among the answer candidates, = . Following the previous work [das2017visual], we use the ground-truth answers for the dialog history.
In our approach, we consider the task as a graph with nodes, where each node corresponds to the multimodal feature for the previous dialog history and the current question . The semantic dependencies among the nodes are represented as weighted edges .
provides an overview of our proposed model, Sparse Graph Learning Networks (SGLNs). Specifically, the SGLNs consist of three components: a multimodal node embedding module, a sparse graph learning module, and an answer decoder. The multimodal node embedding module aims to learn the rich visual-linguistic representations for each round of dialog by employing the simple attention mechanism. We represent the multimodal joint feature vector for each round of dialog as a node of the graph. The sparse graph learning module estimates the binary and score edges among the nodes and combines these two edge weights into sparse weighted edges. Then, the sparse graph learning module aggregates the neighborhood node feature vector for the current question via the message passing algorithm[gilmer2017neural]. The aggregated hidden feature is fed into the answer decoder, which yields the most likely answer. Furthermore, the binary edges (i.e., 0 or 1) that represent the semantic relevance among the nodes are fed into the structural loss function to predict reliable dialog structures in test time. Drawing comparisons to human cognition, this multimodal node embedding module acts similarly to human episodic memory [baddeley2000episodic], where each node corresponds to a unit of episodic memory that contains visual and linguistic information for each round of dialog. Also, the sparse graph learning module mimics the behavior of a human who adaptively recalls relevant multimodal information from their episodic memory.
In the following sub-sections, we will introduce input features for SGLNs, then describe the detailed architectures of the multimodal node embedding module, the sparse graph learning module, and the answer decoder. Finally, we present the objective function for SGLNs.
3.1 Input Features
Visual Features. In the given image , we extract the -dimensional visual features of objects by employing the pre-trained Faster R-CNN model [ren2015faster, Anderson2017up-down], which are denoted as .
Language Features. We first encode the question which is a word sequence of length , , by using a bidirectional LSTM [hochreiter1997long] as follows:
where and denote the forward and backward hidden states of the -th word, respectively. Note that we use the concatenation of the last hidden states from each LSTM, followed by a projection matrix , which results in . Likewise, each round of the dialog history is encoded into , and the all answer candidates at the -th round are also embedded to with additional LSTMs.
3.2 Multimodal Node Embedding Module
As shown in Fig. 2, the multimodal node embedding module embeds the visual-linguistic joint representations associated with each node , by performing visual grounding of each language features. To implement these processes, we take inspiration from a bottom-up and top-down attention mechanism [Anderson2017up-down, kim2016hadamard]. For the object-level visual features and the corresponding language feature , the node embedding module firstly finds the spatial objects that the language feature describes with the soft attention mechanism. Formally,
where and are non-linear functions that transforms inputs to
dimensional space, such as multi-layer perceptrons (MLPs).denotes the hadamard product (i.e., element-wise multiplication), and are a vector whose elements are all one. The attention function is parametrized by vector. Then, the multi-modal feature is obtained from the attended visual feature and the language feature as follow:
where and are projection functions. As a consequence, we obtain visual-linguistic joint representations for all nodes which can be represented in the matrix-form .
3.3 Sparse Graph Learning Module
The sparse graph learning module infers the underlying sparse and weighted graph structure between nodes, where the edge weights are estimated based on the node features. To make the graph structure to be sparse, we propose two types of edges on the graph : binary edges and score edges , which corresponding adjacency matrices are and respectively. To simplify the notations, we omit the subscription in the following equations.
Binary Edges. We first define the binary edge between two nodes and
as a binary random variable, for all and
. The sparse graph learning module estimates the likelihood of the binary variables given the node features, where the probability implies whether the two nodes are semantically related or not. We regard the binary variable as a two-class categorical variable and define the probability distribution as follows:
where is a learnable parameter and is the softmax temperature. Since is discrete and non-differentiable, we employ Straight-Through Gumbel-Softmax estimator (i.e., ST-Gumbel) [jang2016categorical] to ensure end-to-end training. During the forward propagation, the ST-Gumbel makes a discrete decision by using a Gumbel-Max trick [maddison2016concrete]:
where random variable are drawn from [jang2016categorical]. In the backward pass, the ST-Gumbel utilizes the derivative of the probabilities by approximating , thus enabling the back-propagation and end-to-end training.
Score Edges. We also define the score edges that measure the extent to which the two nodes are relevant, and the weighted adjacency matrix is computed as:
with a learnable parameter . Following the relational graph learning algorithm [yang2018glomo], we compute the score edges using the squared operation for the stabilized training.
Sparse Weighted Edges. The sparse graph learning module multiplies the binary edges and score edges, finally yielding the sparse and weighted adjacency matrix as:
With the above edge weight estimations, the sparse graph learning module is able to model three types of relationship on : (1) dense relationships similar to the previous conventional softmax-based approaches if (i.e., all entries in are one),
(2) sparse relationships if , and
(3) no relationships if (i.e., isolated).
Message-passing and Update. Based on the sparse weighted adjacency matrix , the sparse graph learner updates the hidden states of all nodes through a message-passing framework [gilmer2017neural]. Similar to the graph convolutional networks [kipf2016semi], we simply implemented the message-passing layer as a linear projection of node features, followed by the normalized weighted sum according to the adjacent weights.
Note that is the degree matrix of . The hidden features of nodes are calculated via the update layer that adds the input feature and aggregated messages then feeds them into a non-linear function .
Notice that the sparse graph structure inference followed by the hidden state update can be viewed as a dialog reasoning. Moreover, the model is able to do multi-step reasoning by repeatedly conducting the inference and update based on the hidden states. In this paper, for the sake of simplicity, we assume that the only edges connected to the question node exist (i.e., ). For the question node , the message vector and hidden state vector is simply represented as below formula:
The sparse graph learner outputs the hidden state vector for the question node to predict the answer.
3.4 Answer Decoder
3.4.1 Discriminative Decoder.
The discriminative decoder computes the likelihood of the answer candidates by dot-product operations between the hidden vector and feature vectors for the answer candidates . Then, the SGLNs are optimized by minimizing negative log-likelihood of the ground-truth answer as:
is the one-hot encoded label vector. For evaluation, the answer candidates are ranked according to the likelihood.
3.4.2 Generative Decoder.
Similarly to the sequence-to-sequence model, the generative decoder aims to generate the ground-truth answer’s word sequence auto-regressively via a LSTMs:
where is the output of the sparse graph learning module, and denotes the ground-truth answer consisting of words . We initialize the hidden states of the LSTMs with (i.e., ). Following the Visual Dialog task [das2017visual], we utilize the log-likelihood scores to determine the rank of candidate answers for the process of evaluation.
3.5 Objective Function
3.5.1 Structural Loss Function.
Along with the two loss functions, and , we introduce a structural loss function to encourage the SGLNs to infer explicit, reliable dialog structures. Inspired by the visual coreference resolution model [kottur2018visual], our method utilizes the structural supervision in addition to the ground-truth answer at each round. Specifically, we automatically obtain the semantic dependencies among each round of dialog as a form of a lower triangular binary matrix from an off-the-shelf neural coreference resolution tool 111https://github.com/huggingface/neuralcoref based on the work [clark2016deep]. and use the information as the structural supervision. Consequently, the SGLNs minimize the distance between the structural supervision and the binary matrix that is predicted from our model:
where denotes the element-wise mean squared error. Here, encourages the SGLNs to predict a reliable adjacency matrix (i.e., dialog structure). Note that the SGLNs use the structural supervision only while training, and infer the dialog structures at test time. We clarify that the efficiency of the coreference resolution was explored for the visual dialog tasks by the previous work [kottur2018visual]; however, their gain is limited as they use a different approach from ours.
3.5.2 Multi-task Learning.
To predict the dialog structure and answer to the given questions, the SGLNs are trained to minimize the sum of the losses based on both the structural loss and the loss of answer decoder: or where are weights for each loss. Optionally, the SGLNs takes the dual decoder strategy by minimizing the three losses simultaneously: . Unless stated otherwise, the default loss is . The implementation details and results will be discussed in Section 4.
In this section, we describe the details of our experiments on the Visual Dialog dataset. We first introduce the Visual Dialog dataset, evaluation metrics, and implementation details. Then, we compare the SGLNs with baseline models and state-of-the-art methods. Note that the qualitative analysis of our proposed model is described in Sec.5.
4.1 Experimental Setup
Dataset. We benchmark our proposed model on the Visual Dialog (i.e., VisDial) v1.0 dataset. The VisDial dataset [das2017visual] was collected in a two-player chatting environment, where a questioner tries to figure out an unseen image by asking free-form questions, and an answerer responds to the questions based on the image. As a result, the VisDial v1.0 dataset contains 1.2M, 20k, and 44k question-answer pairs as train, validation, and test splits, respectively. The 123,287 images from COCO [lin2014microsoft], 2,064, and 8k images from Flickr are used to collect the dialog data for each split, respectively. A list of 100 answer candidates accompanies each question-answer pair.
Evaluation. We follow the standard protocol for evaluating the visual dialog model, as proposed in the earlier work [das2017visual]. Specifically, the visual dialog model ranks a list of 100 candidate answers and returns the ranked list for further evaluation. There are four kinds of evaluation metrics in the Visual Dialog task: (1) mean reciprocal rank (MRR) of the ground-truth answer in the ranked list, (2) recall@k (R@k), which is the existence of the ground-truth answer in the top-k list, (3) mean rank (Mean) of the ground-truth answer, and (4) normalized discounted cumulative gain (NDCG). Contrary to the classical retrieval metrics (MRR, R@k or mean rank), which are only based on a single ground-truth answer, NDCG takes into account all relevant answers from the 100-answer list by using the densely annotated relevance scores. It penalizes the lower-ranked answers with high relevance scores, and swapping candidates with the same relevance does not affect NDCG. Due to these properties, NDCG is regarded as the primary metric and used to evaluate methods for the VisDial v1.0 dataset.
Implementation Details. The SGLNs embed all the language inputs to 300-dimensional vector initialized by GloVe [pennington2014glove]. All three BiLSTMs used for encoding the word embedding vectors are single-layer with 512 hidden units. We also use the bottom-up attention features [Anderson2017up-down] from Faster R-CNN [ren2015faster] pre-trained on the Visual Genome [krishna2017visual]. The number of object features per image is , and the dimension of each feature is . The dimension of
is 512. The hyperparameters for the multi-task learning are, , and . We employ Adam optimizer [kingma2014adam] with initial learning rate . The learning rate is warmed up to
until epoch 4 and is halved every two epochs from 5 to 10 epochs. We use the VisDial v1.0 training split for evaluating our proposed model on the validation and test splits.
4.2 Quantitative Results
4.2.1 Comparison with Baselines.
We compare SGLNs to the baseline models to demonstrate the effectiveness of our method. We define two models as baselines: Dense, and Sparse-hard. The Dense model utilizes a softmax attention mechanism, which results in the fully-connected graph. Contrary to the Dense model, the Sparse-hard model picks exactly one element among the dialog history by applying the Gumbel-Softmax to the whole dialog history. Note that the structural supervision is provided in the Sparse-hard model. The results are summarized in Table 1. The SGLNs achieve better performance than the baseline models on the NDCG metric, maintaining competitive performance on the ground-truth dependent metrics (i.e., MRR, R@k, and Mean rank). We also observe that the Dense model, which overly exploits the dialog history, shows the best performance on the ground-truth dependent metrics. We argue that the Dense model mainly focuses on finding the single ground-truth answer with a rich set of dialog history, with the cost of sacrificing the ability to provide ‘flexible’ answers (i.e., NDCG). Similarly, the NDCG performance for the Sparse-hard model tends to increase as the sparsity increased.
4.2.2 Question-type Analysis.
As the same setup as the above experiment, we conduct a question-type analysis of the NDCG scores to verify our hypothesis discussed in Sec. 1. Based on the semantic dependency information introduced in Sec. 3, we categorize the entire questions in the VisDial v1.0 validation split into three groups: (1) independent questions that can be answered without dialog history, (2) partially dependent questions that demand a few elements of dialog history, and (3) densely dependent questions that require all previous dialogs. As illustrated in Fig. 3, we compare our proposed model with a softmax-based Dense model, showing that the SGLNs significantly outperform the Dense model on all types of questions. The performance gap between the two models is 3.74%, 2.61%, and 0.83% for each type of question, respectively. We observe that the Dense model relatively suffers from finding relevant answers for independent questions. It validates that excessive exploitation of the dialog history could cause a distraction for such questions.
4.2.3 Comparison with the State-of-the-art.
We compare our proposed model with the state-of-the-art methods on VisDial v1.0 dataset. As shown in Table 2, SGLNs with the discriminative decoder outperform all other methods with respect to the NDCG metric, including the concurrent work, Transformer [nguyen2019efficient]. They demonstrated the effectiveness of training the discriminative and generative decoder simultaneously (i.e., ). Accordingly, we also apply the dual decoder strategy as described Sec. 3 for a fair comparison, lifting our model’s NDCG to 61.27%. The results of the dual decoder models are obtained from the output of the discriminative decoder. Note that the sparsity of the SGLNs is 89.05%, which means that our proposed model only utilizes 10.95% of the dialog history. The sparsity is calculated as the percentage of zero-valued edges in the graph. We consider these results encouraging as they indicate that the SGLNs adaptively attend to the dialog history while achieving the new state-of-the-art performance on the primary metric. Furthermore, we report the performance of the generative decoder-based models on VisDial v1.0 validation split. As shown in Table 3, the SGLNs achieve a new state-of-the-art performance on NDCG with sparsity of 87.03%. Note that all entries in Table 3 are re-implemented by [gan2019multi], utilizing the object-level visual features from the Faster R-CNN [ren2015faster] and GloVe [pennington2014glove] vectors for a fair comparison.
Visualization of the Inferred Graph Structures.
For qualitative analysis, in Fig. 4, we visualize the images, the corresponding dialogs in the validation split, and the inferred adjacency matrices as well as the ones from the Dense mode as a counter.
Compared to the dense structure in the baseline, the proposed SGLNs indeed learn the innate sparse structures, and the question nodes receive the information from the other nodes in a selective fashion.
For instance, In the first dialog example, the questions from Q3 to Q10 have non-zero binary edges to all previous contexts except the D1 and D2, which do not contain relevant information about ‘the woman’.
On the contrary, the Q1 and Q2 are not connected to the other, even the caption node, because they can be answered solely without additional context.
Knowledge Transfer of Semantic Structure. In Section 3.5, the structural loss function can be seen as a knowledge distillation loss [Hinton2014] to transfer the knowledge from the pre-trained neural coreference resolution model to our sparse graph learning module. Even though we employ ST-Gumbel to mitigate the unpredictability of training the binary edges, this structural loss was decisively helpful to boost the early stage of training.
In this paper, we formulate the visual dialog tasks as a graph structure learning tasks where the edges represent the semantic dependencies among the multimodal embedding nodes learned from the given image, caption and question, and dialog history.
The proposed Sparse Graph Learning Networks (SGLNs) learn the sparse dialog structures by incorporating binary and score edges, leveraging structural supervisions.
Our experiments demonstrate the efficacy of SGLN by achieving the state-of-the-art NDCG performance on the VisDial v1.0 dataset with 61.27 for the test-std split, only using the 10.95 % of dialog.
Qualitatively, the visualized analysis with the inferred graph structures shows adaptive mechanisms depending on the type of the questions.
Acknowledgements. The authors would like to thank SK T-Brain for sharing GPU resources. This work was partly supported by the Korea government (2015-0-00310-SW.StarLab, 2017-0-01772-VTT, 2019-0-01367-BabyMind).