ORD: Object Relationship Discovery for Visual Dialogue Generation

06/15/2020 ∙ by Ziwei Wang, et al. ∙ The University of Queensland 0

With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well explored.Existing visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the response.Some recent methods tackle the co-reference resolution problem using co-attention mechanism to cross-refer relevant elements from the image, history, and the target question.However, it remains challenging to reason visual relationships, since the fine-grained object-level information is omitted before co-attentive reasoning. In this paper, we propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. Specifically, a hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally to obtain the final graph embeddings. A graph attention is further incorporated to dynamically attend to this graph-structured representation at the response reasoning stage. Extensive experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships. The model achieves superior performance over the state-of-the-art methods on the Visual Dialog dataset, increasing MRR from 0.6222 to 0.6447, and recall@1 from 48.48



There are no comments yet.


page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Multimedia content understanding has attracted increasing attention in the multimedia, vision and language fields in recent years. Emerging research directions such as image captioning (Karpathy and Fei-Fei, 2017; Bin et al., 2017) and visual question answering (VQA) (Antol et al., 2015) demonstrate promising reasoning ability towards the ultimate goal of multimedia intelligent systems. In brief, the image captioning models describe an image with a single sentence, whilst the VQA models answer an individual question conditioned on the visual content. However, in reality, a human is able to initiate several rounds of highly interrelated conversation about the visual content rather than a one-off response. Motivated by this intention, Visual Dialog (Das et al., 2017) is proposed recently as a challenging extension of captioning and VQA. In visual dialogue, a dialogue agent is built to constantly answer multi-rounds of highly coherent questions grounded in visual content and conditioned on dialogue history.

Figure 1.

The proposed model detects the salient regions, reasons the object interactions, and encodes the visual relationships as a scene graph. The neural network is trained to reason over the representations of image, scene graph, question and dialogue history, and to predict a suitable answer.

The multi-modal nature of the input question, image and history requires the dialogue model to simultaneously understand textual questions, extract salient visual features and retrieve relevant dialogue history for response prediction. In general, the existing visual dialogue methods consist of a multi-modal encoder and a language response decoder. The basic multi-modal encoder contains three sub-modules: convolutional neural network (CNN) image encoder, question long-short term memory (LSTM) language model, and history LSTM language model. Some recent work tackles the co-reference resolution problems, which is a task in natural language processing to find all expressions referring to the same entity given text content. These methods utilise co-attention mechanism to cross-refer both visual and textual elements from multi-modal sources for comprehensive feature representation.

Despite the fact that co-reference resolution methods have made multi-modal cues logically coupled at a high level, the fine-grained visual cues are not thoughtfully exploited. The recent methods simply use CNN visual features, without considering object regions and visual relationships, therefore lacking recognition on locative relationships between regions of interest. The information loss in object-object interactions can be forgivable in simple scenes such as “a pencil on the table”, “sheep on the grass”, in which the number of objects is limited. However, such “pure” scenes are not commonly obtainable in real world scenarios. Intuitively, the real images unavoidably contain multiple instances, overlapping spatial arrangements and noisy background. For example, in Figure 1, the picture contains many closely arranged or overlapped objects (e.g. bike-wheel, human-shirt-short), and a complex background with criss-cross trees, grass and bright sky.

Although it is trivial for a neural dialogue agent to answer naïve questions such as “Is it a sunny day?” given general visual context, it is still challenging to answer the fine-grained object-level details such as “What is he wearing?”, or “Where is the bike located?”. The global visual feature and high-level co-attention methods are not sufficient to reason object attributes and visual relationships.

To alleviate the above issues, we propose an attentive graph-structured hierarchical model for visual dialogue. To the best of our knowledge, this is the first paper to leverage scene graphs in a visual dialogue task. The scene graph is a graph structure to represent the scene, where the nodes are objects (e.g. human,shirt), and the edges are relationships between objects (e.g. , ). In this work, the scene graph only contains the objects and relationships without attributes following (Lu et al., 2016a; Li et al., 2017; Xu et al., 2017; Li et al., 2018). The object attributes can be further encoded in the object embedding as auxiliary information, but we do not consider this information for simplicity.

The proposed model utilises the scene graph to preserve the object interactions in the image, whilst retaining the reasoning capability of co-reference resolution models. Figure 2 illustrates an overview of the proposed visual dialogue framework. Technically, the tri-stream neural network is derived as the encoder, to jointly learns the representation for image, question and history features. In the visual stream, the proposed framework adopts Faster-RCNN (Ren et al., 2017) to extract regional visual features, which are utilised in both scene graph generation and visual feature representation.

To preserve visual relationships in the generated scene graph, we propose a Hierarchical Graph Convolutional Network (HierGCN) to integrate graph structure into neural visual dialogue framework. The HierGCN captures object nodes and their neighbour relationship edges locally, and then a global GCN refine the graph representations to preserve connections between objects.

Furthermore, the graph attention mechanism is incorporated to capture the salient object nodes in scene graph based on the current context. Similarly, the history and visual attention modules are equipped to reason the relevance to the target question. Finally, the fused representation of attended graph, visual, history features and question embedding is decoded by standard discriminative or generative decoder to predict the answer. Different from all existing visual dialogue methods, the proposed ORD model detects fine-grained object-level regions, encodes the scene with hierarchical understanding of visual relationships, rather than simply summarise visual features from convolutional layers, and therefore significantly improves dialogue response reasoning.

The contributions in this paper are three-fold:

  1. To the best of our knowledge, it is a novel attempt to leverage object relationships encoded in graph representation for visual dialogue tasks, with which the generated dialogues reveal visual details, sensitivity to object-object relationships and robust to background noise.

  2. Different from previous work, the fine-grained object detection is utilised to extract details from the visual content. A novel Hierarchical Graph Convolutional Network (HierGCN) is proposed to better preserve graph-structured object interactions, and a graph attention mechanism is equipped to enable an intelligent shift among focused objects.

  3. Extensive experiments are conducted on the Visual Dialog dataset illustrating the effectiveness of the proposed scene graph-structured framework, the HierGCN and the graph attention module.

The rest of this paper is organised as follows: Section 2 reviews visual dialogue and object relationships detection methods. Section 3 introduces a basic scene graph-structured dialogue generation model, and the full model with graph attention. Section 4 contains experiments, and Section 5 concludes the proposed framework.

Figure 2. Architecture of the visual dialogue generation framework. In general, given an image, question and the dialogue history, the proposed model encodes all the features into a combined feature representation to predict the answer. For image, the object regions are firstly detected from the image, followed by two embedding streams: scene graph and visual region features. In scene graph embedding stream, the visual relationships are detected from region proposals, and the graph structure is preserved via a novel Hierarchical Graph Convolutional Network (HierGCN). A novel graph attention mechanism is designed to selectively focus on relevant object nodes conditioned on the question and attended history to manifest the attended graph embeddings. Similarly, the attended history and visual embeddings are obtained from the individual attention modules. Finally, the question and attended context embeddings are fused and forwarded for answer decoding.

2. Related Work

2.1. Visual Dialogue

Multimedia content understanding tasks such as captioning (Karpathy and Fei-Fei, 2017; Vinyals et al., 2015; Xu et al., 2015; Krause et al., 2017; Wang et al., 2018), visual question answering (VQA) (Anderson et al., 2018; Yang et al., 2016; Lu et al., 2016b; Gao et al., 2019) have been intensively studied in the multimedia community recently. Even though understanding a visual content with a single sentence (image captioning) or a question-answer pair (VQA) has made great success, massive underlying information such as object-object interaction, geometric structure conveyed in the scene would be omitted if only condensing the visual scene into a single round of summary. As an extension of image captioning and VQA, multi-round visual dialogue has been proposed in (Das et al., 2017) to tackle the above deficiencies of a single round brief. Specifically, Das et al. (Das et al., 2017) introduced a visual-based dialogue task named VisDial. In VisDial, the questioner (e.g. human) asks the questions about visual content, and the dialogue agent will compose a human-understandable response based on the question, visual content and dialogue history. The visual content and the history of conversation are only held by the dialogue agent.

The emerging work mainly focuses on the dependencies between the question, image and history to solve the visual dialogue task following an encoder-decoder framework. Lu et al. (Lu et al., 2017) proposed a history-conditioned image attentive encoder (HCIAE) to attend relevant history and salient image region. This work further improves the generative decoder by transferring knowledge from the discriminative model. Following the same motivation, a sequential co-attention mechanism is proposed in (Wu et al., 2018). The co-attention jointly generates attention maps between the input question, image and history, and all the co-attended features are combined into the final feature as the encoder output in the final stage. An adversarial learning strategy is adopted to self-assess the quality of response while learning, which further improves the performance. A recent graph neural network (GNN) based method (Zheng et al., 2019) reasons the dependencies between textual question and answers. However, different from this GNN method, the focus of the proposed work is on the visual relationships underlying in the image.

In general, existing visual dialogue methods (Kottur et al., 2018; Niu et al., 2019) highly rely on the high-level co-reference resolution, knowledge transfer and adversarial training strategy. The underexploited visual details (e.g. region features, object-object interactions) inevitably make generated answers vague and lack discrimination on relative spatial relationships between objects. The proposed ORD, on the other hand, is to enrich the feature representation to generate more meaningful embedding by modelling the scene as a structured graph with relationships to facilitate the logical reasoning process effectively.

2.2. Visual Relationships Detection

The visual relationships between objects have received increasing attention recently (Xu et al., 2017; Li et al., 2017; Han et al., 2018), which have proven to be beneficial to varieties of vision tasks (Johnson et al., 2015; Teney et al., 2017; Yao et al., 2018; Johnson et al., 2018).

In general, the visual relationship detection methods could be divided into two trends: category-specific and generic. Early category-specific relation detectors exclusively target a specific category of relations, such as spatial relations (Choi et al., 2013; Gupta and Davis, 2008) and actions (Gupta et al., 2009; Gkioxari et al., 2015; Yao and Li, 2010). These models utilise visual features and geometry, but could hardly achieve satisfactory performance due to the fact that the nature of relationships does not limit to a single category, whilst the generic visual relationship detection (Lu et al., 2016a; Dai et al., 2017; Xu et al., 2017) intends to predict various types of relationships with promising performance. Lu et al. (Lu et al., 2016a) firstly introduced a generic visual relationship task, while this paper considers object detections as the first stage, then recognise predicates between objects with language prior. Furthermore, Xu et al. (Xu et al., 2017) formalised the visual predicates into a scene graph proposed in (Johnson et al., 2015). This model explicitly generates a scene graph representation from an image by an iterative message passing algorithm. Since the proposed visual dialogue model is agnostic to scene graph generation methods, this paper adopts the latest state-of-the-art subgraph-based scene graph generation framework named Factorizable Net (Li et al., 2018) to predict scene graph for visual dialogue.

3. The Proposed Approach

In this section, we introduce the proposed Object Relationship Discovery (ORD) framework to generate visual dialogue by explicitly leveraging scene graph to preserve visual relationships. As shown in Figure 2, ORD firstly detects object regions (e.g. Faster-RCNN (Ren et al., 2017)) from the input RGB image, the detected region features are further forwarded into scene graph embedding and visual embedding branches. In graph embedding channel, the scene graph is constructed based on the detected object regions to preserve semantic relationships (e.g. , ) before encoding by the novel Hierarchical Graph Convolutional Networks. In HierGCN, the local GCN firstly captures local relationships between the object nodes and neighbour relationship edges, then the global GCN refines the node features to preserve object-object connections. Moreover, the graph attention dynamically focus on the most relevant nodes given question and attended history to generate the attended graph embedding. Simultaneously, to retain regional visual content, the visual attention is engaged to attend salient regions based on context. Finally, the attended graph, visual and history embeddings, combined with the question embedding, are concatenated and forwarded to fusion layer. The fused feature is the output of encoder, which is decoded by discriminative or generative decoder to obtain the dialogue response.

3.1. Problem Formulation

In this section, we formally define the visual dialogue generation task as introduced by Das et al.(Das et al., 2017), followed by a brief introduction of the conventional framework.

Formally, we denote the input RGB image as , the ground-truth multi-round dialogue history till round (including an image caption ) as , where the and are questions and answers until the round in the history, respectively.

And follow-up target question at round is denoted as . The objective of the proposed model is to return a valid answer given , and :


where is the model parameters. The encoder-decoder model is expected to return most possible answer given the question, image and dialogue history.

Following (Das et al., 2017), the visual dialogue model is trained to return a sorting of 100 candidate answers . Given the problem setup, two different settings are introduced: discriminative and generative. For discriminative setting, , and are encoded into a combined feature representation . Based on , the discriminative decoder directly gives a ranked list of 100 candidate answers, in which the top ranked candidate is chosen as the response. The metric-learning multi-class N-pair loss (Lu et al., 2017) is adopted to train the model to maximise ground truth answer score, whilst encouraging the model to score options that are similar to the ground truth higher than the dissimilar ones. For generative setting, the combined feature is extracted using same encoder as in the discriminative setting, but the generative decoder is a word sequence generator which can generate open-ended answer. For evaluation purpose, the recurrent decoder uses log likelihood scores of all the candidate answers and ranks the candidate answers based on the scores.

In both discriminative and generative settings, the quality of the combined feature representation is critical to the dialogue generation task. In the basic Late Fusion (Das et al., 2017)

framework, the global image CNN features, the question representation from last hidden state of question LSTM, and the history representation from last hidden state of history LSTM are simply concatenated and linearly transformed into a combined feature

. In this fashion, the semantic relationships between visual and textual can hardly be captured. The following work (Lu et al., 2017) propose a history-conditioned image attentive encoder (HCIAE) model to perform co-reference resolution using co-attention mechanism. The co-attention mechanism is able to re-fine history and visual features by only focusing on the relevant context. Theoretically, the re-fined features will only contain most relevant visual and textual information for the given question. However, in practice, the image features from convolutional layers are not sufficient to eliminate background noise, and the object regions and relationships cannot be explicitly represented.

3.2. Object Relationship Discovery

The recent methods only consider CNN visual features extracted from the full image, without exploiting the fine-grained object regions and relationships.

In this section, we propose an Object Relationship Discovery (ORD) framework to exploit scene graph for visual reasoning. Specifically, we explicitly recognise object regions from the full image, and identify the relationships between objects to construct a scene graph. Since the graph structure is difficult to be directly integrated in the existing visual dialogue framework, we propose a novel Hierarchical Graph Convolutional Network (HierGCN) to preserve the graph structure. And the relationship-aware graph embeddings are further attended via a novel graph attention module.

To exploit the scene graph in the proposed framework, the objective function Equation 1 could be rewritten as:


Correspondingly, the combined representation will preserve image, history, question, and scene graph features. The following sub-sections will introduce the scene graph generation, the Hierarchical GCN, the graph attention module, and the co-attention encoder in details.

3.2.1. Preliminary Work on Scene Graph Generation

In this section, we briefly introduce how to extract image region features and build scene graph based on the given image . First, Faster-RCNN is utilised to detect object regions in the image . The region features with top- salient regions of objects in the image are extracted, where denotes the -dimensional visual feature of each object region. Next, the scene graph is constructed based a set of object nodes (vertices) and relationship edges . As mentioned in Section 1, in this work, we do not consider object attributes in the scene graph.

To build the scene graph, the scene graph generation model (e.g. F-Net (Li et al., 2018)) groups the detected regions in pairs, and constructs the fully-connected graph, in which each pair of object nodes is connected with directed edges (in and out). This model further represents the numerous relationships with fewer sub-graphs and object features to reduce computational cost, and the message are passed through the graph to refine the feature for more accurate relationships predictions. Finally, the top- relationships are kept for final scene graph construction.

3.2.2. Hierarchical Graph Convolutional Networks for Scene Graph Embedding

In this section, we discuss how to learn the graph embedding to integrate the scene graph in the proposed ORD framework.

Inspired by Graph Convolutional Network (Kipf and Welling, 2017) for node classification, we propose a Hierarchical Graph Convolutional Networks (see Figure 3) to refine the region features by preserving relationships between objects. We argue that the original GCN (Kipf and Welling, 2017) can only directly use in the case of undirected graph without edge features, which is not sufficient in our directed scene graph with relationship edges. To overcome the above issues, the proposed hierarchical GCN captures the connections between the object nodes and the neighbour relationship nodes locally, and then refines the relationship-aware node connections in the scene-level globally.

Figure 3. An illustration of the Hierarchical Graph Convolutional Network. Given the above image and its scene graph, the Hierarchical GCN preserves the directed graph with edge features in two stages: (a) Local GCN. The nodes and their neighbour relationship nodes are fed into local GCN to preserve local graph structure. (b) Global GCN. The relationship-aware nodes are embedded via global GCN to maintain object-object connection. The local and global GCN are trained end-to-end to preserve both local neighbour structure and global scene-level node connections.

(a) Local GCN. The local GCN only focus on objects and their neighbour relationships as illustrated in Figure 3(a), the function of this layer is to preserve relationship edge embeddings for each object node. After the graph propagation during local GCN, the edge features will be integrated into the updated node embeddings. We discuss the details of local GCN step-by-step as follows:

Given the scene graph , we firstly represent the directed graph as a bipartite graph, so each directed edge can be represented by two in- and out- undirected edges:


For example, after transforming to undirected graph, the directed edge becomes two undirected edges . Therefore, the directed scene graph can be represented in the GCN.

Furthermore, we still want to preserve edge with features in the proposed local GCN. Therefore, we further define the relationship edges as relationship nodes to enable GCN to utilise edge features. However, if we directly mix object nodes and relationship nodes together in a full graph, the model will get confused. As a result, we deploy local GCN to only focus on neighbour relationship nodes locally, and then the object-object connections will be further embedded in global GCN.

Formally, the input local sub-graphs contains object nodes and their neighbour relationship nodes (both in and out), the adjacency matrix represents the connected relationship nodes w.r.t. the object nodes. The local GCN output the embedding of all the nodes:


where is the number of object nodes, is the number of relationship types, contains all the object nodes and in-out-relationship nodes, is the dimension of node features. is the parameter to be learned, where is the dimension of GCN embedding space, here we set for consistency. And the adjacency matrix preserves the graph structure, the link between un-connected objects is set to zero.

is the non-linear activation function (e.g. ReLU).

Furthermore, during the graph propagation, for each node, the embeddings of neighbour nodes will be added to form the new node embeddings as shown in Eq.4.

Therefore, after the propagation, the object node will contain the information of its relationship neighbour nodes. After the local updating, we will only keep the updated object nodes for further global GCN calculation, the edge nodes are discarded. We denote the refined object node embeddings as .

(b) Global GCN. Since the local GCN has already captured the neighbour relationships for each node, the relationship-aware nodes and their undirected connections between each other are embedded in the global GCN:


where is parameter to be learned, and is the adjacency matrix for object nodes.

Finally, we obtained relationship-aware node embeddings , where denotes the -dim node embedding of each object node, is the number of objects nodes. If the object is absent in the image, the node embeddings will be set to zeros.

3.2.3. Graph Attention for Visual Dialogue

Traditionally, the visual attention (Xu et al., 2015) is a mechanism to reason about the most salient visual regions where the language model need to focus on while word sequence generation. Inspired by the visual attention, we propose a graph attention module to infer which object nodes should be currently focusing on.

Figure 4. An illustration of the graph attention module.

In the graph attention model (Figure 

4), given the learned graph embeddings , the graph attention attends to different relationship-aware object nodes given current states to provide context-aware scene graph representation.

Concretely, we use the relationship-aware graph embeddings from the proposed HierGCN in Section 3.2.2, the question feature at round generated by LSTM, and the attended dialogue history embeddings following Section 3.3.

Given , the context features and , we feed them through a fully-connected (FC) neural network layer. The outputs from FC layer are passed through a softmax function to obtain the attention scores over object nodes:


where is a vector filled with 1 to repeat current state to match the size of nodes. , denotes a latent -dimensional embedding space for dimension reduction, are model parameters to learn. is the attention score over the object node features. The context vector (attended graph embeddings) can be further calculated as a weighted sum of all the nodes:


where the context vector is forwarded and combined with other features into encoder feature .

3.3. Co-Attention Network Encoder

The co-attention strategy (Lu et al., 2016b, 2017; Wu et al., 2018) is commonly adopted for reasoning the correlation between multiple feature inputs.

In this section, we introduce a visual relationship aware co-attention structure (Figure 2) to solve the co-reference resolution between , , and , and encode all the features into a combined feature representation for final response generation.

First, we recapitulate the unattended (raw) feature embeddings of the image, scene graph, question and history. In terms of image representation, we use the top- region features extracted from introduced in Section 3.2.1. The relationship-aware graph node embeddings with object nodes is further embedded following the proposed HierGCN (Section 3.2.2). For textual content, the question at round is encoded with a standard language model (e.g. LSTM) to get the feature vector . Similarly, the previous history is encoded with another history language model as . Given question , we firstly reason which history dialogues are relevant to current question. The history attention predicts the attention scores, and the dialogue embeddings are weighted summed based on the attention scores obtaining the attended history . For example, given a question “Does it look old or new?”, it is difficult to precisely understand the meaning of pronoun it directly. But by reasoning history dialogue, we can effortlessly locate the object from the previous dialogue “What colour is the bike?”.

Moreover, the image representation contains top- object regions, but not all the visual regions are related to the current question and attended history. The visual attention swiftly shift focus on salient object regions to decide which part of the regions are mostly important. Intuitively, if the questioner asks “What is he wearing?” in Figure 1, the visual attention scores should reflect more weights on the shirt region. We use linear layer to reduce the region feature from to dimension. The final weighted sum of region features is denoted as .

Furthermore, the proposed graph attention (Section 3.2.3) outputs the attended graph embedding to preserve the visual relationships.

Finally, ,, and are concatenated and fused through a fully-connected fusion layer into encoder output :


where . The encoder output is then forwarded to discriminative and generative decoders respectively.

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset

We conduct the experiments on the benchmark dataset - Visual Dialog dataset v0.9 (Das et al., 2017), which is created for the visual dialogue generation task. The training set contains 82,783 images-dialogue pairs, the validation set has 40,504 pairs from MSCOCO (Lin et al., 2014). Each dialogue contains 10 rounds of question-answer pairs. Besides, the pronouns (e.g. “he”, “she”, “it”) frequently appear in the dataset, i.e. 98% dialogues, 38% of questions, and 19% of answers contain at least one pronoun, unavoidable causing the co-reference resolution problems.

4.1.2. Implementation Details

Given an image, we extract region features using Faster-RCNN with ResNet-101 (He et al., 2016) following the process in (Anderson et al., 2018), the top-36 regions with highest confidence are selected for 2048-dimensional region feature representations. The scene graph is generated by pre-trained Factorizable Net (Li et al., 2018), and the top-50 detected predicates are chosen for scene graph construction. The language models used in question and history encoding are single-layer LSTM units with the hidden size of 512. The dimension of word embeddings is empirically set to 300, with random initialisation. For optimisation, we use stochastic optimisation method Adam (Kingma and Ba, 2015) with learning rate of 4e-4 and coefficients ranging from 0.8 to 0.999. The batch size is fixed at 128. All the experiments are tested on a server with a 40-Core Intel(R) Xeon(R) E5-2660 CPU and 2 Nvidia GeForce GTX 1080 Ti GPUs.

4.1.3. Evaluation Metrics

We report the performance of dialogue generation using the common evaluation protocols from (Das et al., 2017; Lu et al., 2017; Wu et al., 2018). In testing, the visual dialogue model is given an image, a question, the multi-round dialogue history and a list of candidate answers (=100), and the model will return a sort of the candidate answers. The returned ranked list of answers is evaluated on retrieval metrics w.r.t human responses: Mean Reciprocal Rank (MRR), Recallk (k=1, 5, 10), and Mean Rank (MR). Both the discriminative and generative dialogue response decoders are compatible with the retrieval setting. Specifically, the discriminative decoder directly scores the confidence of candidate answers, whilst the generative model gives log-likelihood scores of each candidate answer.

4.2. Compared Methods

The experiments on both discriminative (-D) and generative (-G) decoders are conducted. In this section, we briefly introduce the compared visual dialogue methods.

SAN-QI-D (Yang et al., 2016): This Stacked Attention Networks (SAN) baseline is a VQA model, given only the question and image (without dialogue history) for answer prediction.

HieCoAtt-QI-D (Lu et al., 2016b): The Hierarchical Question-Image Co-Attention model utilises visual and hierarchical representation of the question in a joint framework for VQA.

LF (Das et al., 2017): The Late Fusion (LF) encodes the image, question and dialogue history in three separate streams. Next, the model simply combines the image features, question embeddings and history embeddings into a joint representation for answer prediction.

HREA (Das et al., 2017): In addition to LF model, the Hierarchical Recurrent Encoder with Attention model (HREA) adopts a hierarchical recurrent model to encode dialogue history in hierarchy equipped with a history attention.

MN (Das et al., 2017): Memory Network is applied in visual dialogue task to enhance historical questions and answers memorising.

HCIAE (Lu et al., 2017)

: The History-Conditioned Image Attentive Encoder (HCIAE) introduces an attentive framework to localise image regions, relevant history given the target question for co-reference resolution. For discriminative setting, we compare with HCIAE with MLE loss (-MLE), and with N-pair discriminative loss and self-attentive answer encoder (-NP-ATT) variants. While for generative setting, we mainly focus on HCIAE trained on maximum likelihood estimation (-MLE) loss, since the knowledge transfer (-DIS) is not in the scope of our method.

CoAtt (Wu et al., 2018): The Sequential Co-attention (CoAtt) encoder attends to each input feature by other features sequentially to capture the co-relation between all the input features. We compare our method with the sequential co-attention encoder with the MLE objective, without any adversarial learning strategies. The result of CoAtt discriminative (CoAtt-D) model with MLE objective is not available in the original paper, so the reported scores are based on our implementation. We also compare with their best model equipped with adversarial loss, intermediate reward and teacher forcing strategy (-GAN-TF).

GNN-SPO (Zheng et al., 2019): Reasoning Visual Dialogs with Structural and Parital Observations (GNN-SPO). GNN-SPO utilises a graph structure to represent the dialogue, in which the nodes are dialogue entities, while the edges are semantic dependencies between the nodes. This method considers the dialogue history as partial observation of the graph, and the target is to infer the values of unobserved answer nodes and the graph structure.

4.3. Quantitative Analysis

Model MRR R1 R5 R10 MR
SAN-QI-D (Yang et al., 2016) 0.5764 43.44 74.26 83.72 5.88
HieCoAtt-QI-D (Lu et al., 2016b) 0.5788 43.51 74.49 83.96 5.84
LF-D (Das et al., 2017) 0.5807 43.82 74.68 84.07 5.78
HREA-D (Das et al., 2017) 0.5868 44.82 74.81 84.36 5.66
MN-D (Das et al., 2017) 0.5965 45.55 76.22 85.37 5.46
HCIAE-D-MLE (Lu et al., 2017) 0.6140 47.73 77.50 86.35 5.15
HCIAE-D-NP-ATT (Lu et al., 2017) 0.6222 48.48 78.75 87.59 4.81
CoAtt-D-MLE (Wu et al., 2018) 0.6135 47.49 77.92 86.75 5.04
CoAtt-D-GAN (Wu et al., 2018) 0.6398 50.29 80.71 88.81 4.47
GNN-SPO (Zheng et al., 2019) 0.6285 48.95 79.65 88.36 4.57
ORD-D w/ SG 0.6340 49.93 79.70 88.20 4.66
ORD-D w/ SG+Rel 0.6383 50.46 80.09 88.54 4.56
ORD-D w/ SG+Rel+Attn 0.6447 51.22 80.67 89.01 4.44
Table 1. Discriminative - Performance on Visual Dialog validation dataset (Das et al., 2017)
Model MRR R1 R5 R10 MR
LF-G (Das et al., 2017) 0.5199 41.83 61.78 67.59 17.07
HREA-G (Das et al., 2017) 0.5242 42.28 62.33 68.17 16.79
MN-G (Das et al., 2017) 0.5259 42.29 62.85 68.88 17.06
HCIAE-G-MLE (Lu et al., 2017) 0.5382 44.07 63.42 69.03 16.06
HCIAE-G-DIS (Lu et al., 2017) 0.5467 44.35 65.28 71.55 14.23
CoAtt-G-MLE (Wu et al., 2018) 0.5411 44.32 63.82 69.75 16.47
CoAtt-G-GAN-TF (Wu et al., 2018) 0.5578 46.10 65.69 71.74 14.43
ORD-G w/ SG 0.5438 45.05 63.52 69.01 16.17
ORD-G w/ SG+Rel 0.5480 45.47 64.04 69.59 15.83
ORD-G w/ SG+Rel+Attn 0.5502 45.63 64.38 70.23 15.56
Table 2. Generative - Performance on Visual Dialog validation dataset (Das et al., 2017)

The main results on discriminative setting on the VisDial dataset are demonstrated in Table 1. In general, the proposed ORD full model outperforms most of state-of-the-art on all the metrics demonstrating the effectiveness of leveraging scene graph. Comparing to LF-D, ORD significantly improves MRR and Recall1 by relatively 11% and 17%, respectively. This observation indicates that visual reasoning and co-attention are critical to dialogue generation.

Moreover, the scene graph improves the answer prediction by a significant margin comparing to co-reference resolution methods (e.g. MN, HCIAE, CoAtt). Specifically, ORD outperforms HCIAE-D-NP-ATT by relatively 4% on MRR from 0.6222 to 0.6447 confirming the visual relationships can enhance the model to recognise visual cues for dialogue reasoning. Besides, similar trend can be found in recall, the proposed method also makes a 6% relative improvement on Recall1. We can also observe that the fine-grained object regions and relationships outperforms the GNN-based method built on the dialogue entities graph (e.g. GNN-SPO). In this regard, the graph structure provides strong reasoning performance where the underlying relationships are important.

Furthermore, we reports the performance of ORD and the compared methods under the generative setting in Table 2. In all the metrics, ORD performs the best among other compared methods, especially outperforming the previous state-of-the-art CoAtt-G-MLE by relative 2% and 3% in terms of MRR and Recall1. This shows that the object relationships are well preserved in the graph-structured multi-modal sources encoding framework to benefit the dialogue generator.

The performance of CoAtt-G-GAN-TF is slightly higher than ours because the GAN and teacher-forcing strategy. However, since those training strategies are not the main focus of this work, therefore they are not in the scope of our method. Theoretically, these training strategies are applicable into our framework to further improve the final result.

4.4. Ablative evaluation

In this section, we present several variants of our proposed method to study the contribution of each component in our ORD framework. Two baseline models LF and HCIAE-MLE are included for reference. The details of the variants are described as follows:

ORDw/o CoAttn: This model keeps visual features, scene graph features, question embeddings, and history features without Co-Attention mechanism between all of them. It is similar to LF (Das et al., 2017), but with region-level features and scene graph.

ORDw/o Vis: This model keeps only scene graph, question and histories without visual features.

ORDw/o SG: The model contains only visual features, questions and histories without scene graph.

ORDw/ SG: This model is the base model to leverage scene graph in high-level. It keeps only undirected link between object nodes, without fine-grained relationship edge features (e.g. man-shirt, bus-tree). The nodes are embedded via single layer GCN (global GCN). The average of node embeddings forms the scene graph feature.

ORDw/ SG+Rel: This model keeps directed link between objects nodes, with relationship edges features (e.g. man-wear-shirt, shirt-on-man). The nodes and edges are embedded by the proposed Hierarchical GCN (Global+Local GCN). The average of relationship-aware node embeddings forms the scene graph feature.

ORDw/ SG+Rel+Attn: Our full model. This model keeps directed link between objects nodes, with edges features (e.g. man-wear-shirt, shirt-on-man). The nodes and edges are embedded by the proposed Hierarchical GCN. The attended relationship-aware node embeddings forms the scene graph feature.

Model MRR R1 R5 R10 MR
LF (Das et al., 2017) 0.5807 43.82 74.68 84.07 5.78
HCIAE-MLE (Lu et al., 2017) 0.6140 47.73 77.50 86.35 5.15
ORD w/o CoAttn 0.6129 47.46 77.79 86.66 5.04
ORD w/o Vis 0.5990 45.75 76.47 85.46 5.40
ORD w/o SG 0.6266 49.10 79.08 87.79 4.77
ORD w/ SG 0.6340 49.93 79.70 88.20 4.66
ORD w/ SG+Rel 0.6383 50.46 80.09 88.54 4.56
ORD w/ SG+Rel+Attn 0.6447 51.22 80.67 89.01 4.44
Table 3. Ablative Study - Discriminative Setting

For discriminative setting, our base model ORDw/ SG outperforms the maximum likelihood trained (-MLE) models such as LF, HCIAE shown in Table 3. The result shows that the fine-grained object regions and connections between them can significantly improve the visual reasoning capability.

Figure 5. Example visual dialogue results. Each example is associated with the ground-truth (GT), state-of-the-art method HCIAE (Lu et al., 2017) and the proposed ORD (ours).

Whilst this variant shows the effectiveness of the proposed Global GCN, we also want to evaluate the contribution of Local GCN. However, it is not reasonable to only use Local GCN without Global GCN. The reason is that without linking object nodes via Global GCN layer, the nodes in the graph are not connecting to each other, so the nodes cannot pass message in the graph along edges. As a result, we add relationship edge features in addition to ORDw/ SG, so the ORDw/ SG+Rel can illustrate the additional contribution of local GCN. As we can see in Table 3, the fine-grained relationships (edge features) considered in ORD w/ SG+Rel improves the performance slightly, showing the Local GCN is effective to exploit the edge features. Furthermore, our full model ORD w/ SG+Rel+Attn further boosts the proposed scene graph-based reasoning framework and achieves 0.6447 on MRR, which demonstrates that the graph neural networks is able to dynamically focus on different semantic object nodes while answer reasoning. In addition, we remove co-attention, visual features and graph embeddings respectively to evaluate the importance of each the building blocks in the full framework. First, when we remove the co-attention module, the structure become to be very similar to LF model. However, in ORDw/o CoAttn, the additional region-level visual features and semantic scene graphs improves the performance. Nevertheless, the co-attention module is still an essential part of the model, otherwise the co-relation between different features cannot be well discovered. Second, in ORDw/o Vis, the removal of visual features degrades the performance dramatically, since the visual dialogue task is discussing the topics related to the given image. However, it is interesting that the MRR is still slightly better than LF model even no visual content is presented. We conclude that the well-defined co-attention and scene graph are the major reasons in this scenario, because the co-attention is capturing underlying interactions between history conversations, while the scene graph is preserving the semantic object relationships from the visual content. Third, when the model ORDw/o SG completely removes scene graph, we can see the performance is better than HCIAE-MLE thanks to the fine-grained region features. The structured visual relationships are indispensible for further improvement.

For generative setting, the proposed full model outperforms most of the state-of-the-art models shown in Table 2. However, the improvement of base model ORD-Gw/ SG is not as significant as the discriminative model, which indicates that the sequential generative decoder needs further feature refinement strategy at sequence decoding stage. Surprisingly, the contribution of ORD-Gw/ SG+Rel is slightly better than graph attention ORD-Gw/ SG+Rel+Attn in generative model which is different from the trend in discriminative setting. We believe that how to optimise the graph attention in generative decoder should also be considered in the future work.

4.5. Qualitative Analysis

We present qualitative results from the proposed ORD and state-of-the-art HCIAE (Lu et al., 2017) in Figure 5. The following properties can be observed from the results.

Fine-grained visual cues. The region visual features are preserved by integrating object detection in the visual dialogue framework, retaining detailed attributes of objects, which further enable the dialogue model to answer visually grounded questions (e.g. colour of objects). For example, in Figure 5(b), the question is “What colour is her coat?”. The dominant colour in this snow scene is white, and the coat is also white. It is a challenging question to CNN-based HCIAE model, since CNN features lose the object level details in visual representation. But by utilising the fine-grained region features, the colour of coat is easily observed by ORD model.

Geometric object relationship. By preserving the visual relationships in the scene graph, the ORD model can precisely reason the interactions between objects, which is omitted in previous work. Especially in the complex scene like Figure 5(d) with complicated arranged colourful objects. Comparing to ground truth, the ORD inevitably have some failure cases such as recognising the colour of the chair. But when comparing to the state-of-the-art method, the proposed object relationship discovery model still remains strong performance. Take the first question as an example, the child is wearing a pale green top which is recognised by ORD seamlessly. Similarly, in Figure 5(d), for the question “Is she wearing a hat?”, since the is preserved in scene graph, the ORD is able to effortlessly predict the correct answer.

Robust to background noise. Because the ORD model recognises and preserves the objects in the visual encoding stream, the background noise can be separated from foreground salient objects. By taking the Figure 5(c) as an example, the high reflection on the water surface is very noisy, making the HCIAE unable to recognise the number of people, the colour of craft, and the existence of waves. However, the proposed ORD is able to locate people, craft, and water regardless of background reflection and low lighting, therefore providing precise dialogue response in most of the cases.

5. Conclusion and Future Work

In this paper, we propose a Object Relationship Discovery framework to generate fine-grained dialogue by discovering visual relationships from the image, where the detected locative and subtle object relationships assist the model to understand visual cues in an accurate way. To preserve the discovered local and global object relationships, a hierarchical graph convolutional network is constructed followed by a graph attention. The graph attention module selectively focuses on relevant relationships, therefore eliminating background noise to ease visual evidence reasoning. The experiments have demonstrated the superiority of our proposed method compared to the state-of-the-art. Furthermore, in other relevant tasks such as image captioning and VQA, the object relationships are also important cues for visual reasoning. Our future direction is to utilise visual relationships in other vision-language tasks.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 6077–6086. Cited by: §2.1, §4.1.2.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2425–2433. Cited by: §1.
  • Y. Bin, Y. Yang, J. Zhou, Z. Huang, and H. T. Shen (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. In MM, MM ’17, New York, NY, USA, pp. 1345–1353. External Links: ISBN 978-1-4503-4906-2, Link, Document Cited by: §1.
  • W. Choi, Y. Chao, C. Pantofaru, and S. Savarese (2013) Understanding indoor scenes using 3d geometric phrases. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 33–40. Cited by: §2.2.
  • B. Dai, Y. Zhang, and D. Lin (2017) Detecting visual relationships with deep relational networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3298–3308. Cited by: §2.2.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1080–1089. Cited by: §1, §2.1, §3.1, §3.1, §3.1, §4.1.1, §4.1.3, §4.2, §4.2, §4.2, §4.4, Table 1, Table 2, Table 3.
  • L. Gao, P. Zeng, J. Song, Y. Li, W. Liu, T. Mei, and H. T. Shen (2019) Structured two-stream attention network for video question answering. In AAAI, pp. 6391–6398. Cited by: §2.1.
  • G. Gkioxari, R. B. Girshick, and J. Malik (2015) Contextual action recognition with r*cnn. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1080–1088. Cited by: §2.2.
  • A. Gupta and L. S. Davis (2008)

    Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers

    In Computer Vision - ECCV 2008, 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I, pp. 16–29. Cited by: §2.2.
  • A. Gupta, A. Kembhavi, and L. S. Davis (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31 (10), pp. 1775–1789. Cited by: §2.2.
  • C. Han, F. Shen, L. Liu, Y. Yang, and H. T. Shen (2018) Visual spatial attention network for relationship detection. See DBLP:conf/mm/2018, pp. 510–518. External Links: Link, Document Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1.2.
  • J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1219–1228. Cited by: §2.2.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li (2015) Image retrieval using scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678. Cited by: §2.2, §2.2.
  • A. Karpathy and L. Fei-Fei (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), pp. 664–676. Cited by: §1, §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.1.2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §3.2.2.
  • S. Kottur, J. M. Moura, D. Parikh, D. Batra, and M. Rohrbach (2018) Visual coreference resolution in visual dialog using neural module networks. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, pp. 153–169. Cited by: §2.1.
  • J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3337–3345. Cited by: §2.1.
  • Y. Li, W. Ouyang, Z. Bolei, S. Jianping, Z. Chao, and X. Wang (2018) Factorizable net: an efficient subgraph-based framework for scene graph generation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pp. 346–363. Cited by: §1, §2.2, §3.2.1, §4.1.2.
  • Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang (2017) Scene graph generation from objects, phrases and region captions. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 1270–1279. Cited by: §1, §2.2.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Cited by: §4.1.1.
  • C. Lu, R. Krishna, M. S. Bernstein, and F. Li (2016a) Visual relationship detection with language priors. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp. 852–869. Cited by: §1, §2.2.
  • J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 313–323. Cited by: §2.1, §3.1, §3.1, §3.3, Figure 5, §4.1.3, §4.2, §4.5, Table 1, Table 2, Table 3.
  • J. Lu, J. Yang, D. Batra, and D. Parikh (2016b) Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 289–297. Cited by: §2.1, §3.3, §4.2, Table 1.
  • Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu, and J. Wen (2019) Recursive visual attention in visual dialog. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6679–6688. Cited by: §2.1.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. Cited by: §1, §3.
  • D. Teney, L. Liu, and A. van den Hengel (2017) Graph-structured representations for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3233–3241. Cited by: §2.2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: A neural image caption generator. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. Cited by: §2.1.
  • Z. Wang, Y. Luo, Y. Li, Z. Huang, and H. Yin (2018) Look deeper see richer: depth-aware image paragraph captioning. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, pp. 672–680. Cited by: §2.1.
  • Q. Wu, P. Wang, C. Shen, I. D. Reid, and A. van den Hengel (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6106–6115. Cited by: §2.1, §3.3, §4.1.3, §4.2, Table 1, Table 2.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3097–3106. Cited by: §1, §2.2, §2.2.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    Proceedings of the 32nd International Conference on Machine Learning, Lille, France

    pp. 2048–2057. Cited by: §2.1, §3.2.3.
  • Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola (2016) Stacked attention networks for image question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29. Cited by: §2.1, §4.2, Table 1.
  • B. Yao and F. Li (2010) Grouplet: A structured image representation for recognizing human and object interactions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9–16. Cited by: §2.2.
  • T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pp. 711–727. Cited by: §2.2.
  • Z. Zheng, W. Wang, S. Qi, and S. Zhu (2019) Reasoning visual dialogs with structural and partial observations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §4.2, Table 1.