Multi-modal Learning with Prior Visual Relation Reasoning

by   Zhuoqian Yang, et al.
Beihang University

Visual relation reasoning is a central component in recent cross-modal analysis tasks, which aims at reasoning about the visual relationships between objects and their properties. These relationships convey rich semantics and help to enhance the visual representation for improving cross-modal analysis. Previous works have succeeded in designing strategies for modeling latent relations or rigid-categorized relations and achieving the lift of performance. However, this kind of methods leave out the ambiguity inherent in the relations because of the diverse relational semantics of different visual appearances. In this work, we explore to model relations by contextual-sensitive embeddings based on human priori knowledge. We novelly propose a plug-and-play relation reasoning module injected with the relation embeddings to enhance image encoder. Specifically, we design upgraded Graph Convolutional Networks (GCN) to utilize the information of relation embeddings and relation directionality between objects for generating relation-aware image representations. We demonstrate the effectiveness of the relation reasoning module by applying it to both Visual Question Answering (VQA) and Cross-Modal Information Retrieval (CMIR) tasks. Extensive experiments are conducted on VQA 2.0 and CMPlaces datasets and superior performance is reported when comparing with state-of-the-art work.



There are no comments yet.


page 1

page 2

page 4

page 8

page 10

page 12


Semantic Modeling of Textual Relationships in Cross-Modal Retrieval

Feature modeling of different modalities is a basic problem in current r...

Context-Aware Embeddings for Automatic Art Analysis

Automatic art analysis aims to classify and retrieve artistic representa...

Textual Relationship Modeling for Cross-Modal Information Retrieval

Feature representation of different modalities is the main focus of curr...

Robust and Interpretable Grounding of Spatial References with Relation Networks

Handling spatial references in natural language is a key challenge in ta...

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based VisualQuestion Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...

Relational Gating for "What If" Reasoning

This paper addresses the challenge of learning to do procedural reasonin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Ambiguous relation samples with varied appearances. (a) and (b) are two samples with distinct visual semantics while belonging to the same relation “ride”. (c) is the sample that implicates two relations, i.e. and , by the same appearance.
Figure 2: Illustration of our module for visual relational reasoning. We first use an object detector to detect instances in images to form the nodes of the graph. Then a relation encoder is used to generate a relation embedding between each pair of instances. Finally, we use our anisotropic graph convolution to process information in the graph.

Vision and natural language are most typical modalities for human to describe the real world. Large amount of information on the webpages, social networks, E-commerce website is conveyed by both visual and textual content. With the advances in Computer Vision (CV) and Natural Language Processing (NLP), researchers make a further step towards breaking the boundary of vision and natural language, such as visual question answering (VQA)

[34, 31, 1] , cross-modal information retrieval (CMIR)[33, 30, 19], image captioning[32]

, etc. All these tasks require fine-grained visual processing, or even content-specific visual reasoning to achieve smantic-rich visual representation, which is a top priority for Artificial Intelligence (AI) to achieve human’s ability.

One of the recent advances in visual representation for cross-modal learning is visual relational reasoning [27]. It aims to reason about the semantic relations (i.e., wearing, holding, riding), positional relations (i.e., above, below, inside, around), or even latent relations between visual objects in an image. The relations can be formally defined as triples , i.e. or

. Such relations are working in conjunction with deep neural networks to generate relation-aware visual representation. State-of-the-art works have proved that reasoning visual relations is crucial to improve the performance of VQA

[27] and image captioning [32]. The typical solutions are based on graph models. They treat visual relations as edges between two object nodes in the spatial or semantic graphs. Given the graphs, Graph Convolutional Networks (GCN)[15] or Relation Networks (RN) [27] are leveraged to reason about more complex relations and generate the enhanced representation for generating a caption or answering a question.

However, modeling the relations can be a challenging problem. Previous work [32] primarily studies appearance-based models to detect visual relations categorically they learn the relations as a classification task and output the rigid-divided category for each predict as the corresponding relation. Such relation information are typically integrated into deep architectures by using specific parameters for each category. Unfortunately, the visual appearances are too varied and semantic-rich to be modeled by rigid-divided categories. On the one hand, the visual appearance of the same relation vary significantly because of different object and subject involved. For example, the appearance of the relation ride can be quite various in the following two scenes: “a woman rides horse” in Figure 1 (a) and “a man rides motorcycle” in Figure 1 (b). On the other hand, the same appearance could implicate diverse relations. Take Figure 1 (c) for example. There exists two kinds of relations, i.e. kick and look at, between the boy in white T-shirt and the football. In summary, the rigid predict categories can hardly convey the diverse relational semantics of different appearances.

In this paper, we bypass this challenge by describing each relation in the image as a relation embedding, conditioned on the the instances involved in the relation (top in Figure 2). Compared with previous work of relation detection [32], our proposed relation embedding can be more effective and accurate to model the fine-grained semantics inherent in the object, subject and their interaction. Furthermore, we propose a plug-and-play relation reasoning module (bottom in Figure 2), which explores the use of relation embedding to enhance image encoder for multimodal learning. Our basic design is to model the image as a directed semantic graph. Each node represents an object (salient region) detected by Faster R-CNN while each edge denotes the relation between two objects. The relation is represented by relation embedding obtained by relation encoder learned on Visual Genome [16]. An upgraded Graph Convolutional Networks (GCN) [15] are then proposed to explore the directional and embedded visual relations for reasoning about more complex relations and enriching the object representations in the semantic graph. Finally the relation-aware object representations are injected into multimodal learning model for the downstream task. We evaluated our proposed relation reasoning module on two representative multimodal learning tasks: Visual Question Answering and Cross-Modal Information Retrieval. Experimental results indicates the superior improvements integrated with our module over the state-of-the-art methods, verifying the benefits of using the relation embedding and reasoning module.

The rest of this paper is organized as follows: Section 2 briefly reviews the related work in visual relation reasoning, VQA and CMIR. Then we detail our proposed relation reasoning module, named as Anisotropic Graph Convolutional Network (AGCN), in section 3. Two models integrated with AGCN are implemented for cross-modal reasoning and retrieval, and applied to VQA in Section 4 and CMIR in Section 5. We report the experimental results in Section 6 and conclude our work in Section 7.

2 Related Work

2.1 Visual Relation Reasoning

Visual relation reasoning is about learning relationship representations and their inference from known building blocks in a given image. It is a top priority for AI to achieve human intelligence [3]. Since visual data is one of the most common modalities of data, research on reasoning about visual objects and their relations, which we call visual relation reasoning, has drawn much attention. Early works [9] study the shallow geometric relations based on spatial information (e.g. below, above) to enhance visual segmentation. Later on, in [7], interactions (e.g. wear, carry) between paired objects are exploited, visual relation reasoning is then formulated as a classification task. Afterwards, relationship are extended to richer definition [5], including geometric, comparative, composition, action, etc. The most recent works propose to model visual relations by scene graphs based on priori human knowledge [16] and effectively improving the performance of image captioning [23].

One limitation of the above approaches is that they merely represent the relations between two objects by rigid-categorized labels. However, they fail to leverage the richer semantics inherent in the visual context and lead to ambiguity. For instance, Figure 2(c) shows that multiple relationships (i.e. kick and look at) may exist between two objects and Figure 2(a)(b) shows that even relationships in the same category (i.e. ride) may have diverse semantics. To solve this problem, quite a few works attempt to design deep models for inferring more complex relations among multiple objects.[27] proposes to infer relations between all the implicit object-like representation pairs via a plug-and-play MLP module for visual question answering. It proves that implicit relational reasoning without priori knowledge about explicit objects and particular relations is able to correctly reason just by composing the structure of the network. However, this method has only been proven to work on synthesized 3D datasets with primary geometrical objects, problems involving real-world objects and relations are yet to be better handled. [32] treats visual relations as labeled directional edges between two object nodes in the spatial and semantic graph and apply Graph Convolutional Networks(GCN) to reason about their implicit relations for image captioning. Although [32] makes relation reasoning sensitive to the relation types, it only applies different biases for “rigid-categorized” relation types and ignores the influence of the connected objects in the reasoning process. On the contrary, we attempt to learn relational embeddings to “softly” represent various relations and design an modified GCN module to support embedded relations for inferences seamlessly.

2.2 Visual Question Answering (VQA)

VQA aims to answer a question in natural language according to a natural image. It is quite a challenging task since it requires understanding and reasoning over both visual and textual content. A typical solution for VQA is to fuse visual and textual features for a joint representation and infer the answer based on the fused image-question representation. The most typical methods for feature fusion are element-wise summation/multiplication or direct concatenation. Zhou [34]

propose a typical baseline to predict the answer from the concatenation of the image features extracted from pre-trained CNN and question features represented by bag-of-words. Besides straightforward solutions, several works apply bilinear pooling

[28, 8, 14] or more complex fusion methods [22]. Noh [22] explores a novel CNN model for feature fusion with a dynamic parameter layer whose weights are learned adaptively by the question.

However, the above approaches are based on global features of both images and questions, which fails to provide fine-grained information and possibly introduces noise. Several works adopt the attention mechanism to focus on semantically relevant image regions regarding a given question. Yang [31] perform visual attention multiple times via stacked attention networks, and Anderson [1] use a top-down attention on pre-detected salient regions. These models explore the fine-grained correlations between visual and textual content and eliminate noisy information. Recently, visual relation reasoning has been introduced into VQA and achieved better answers for questions that require a logical understanding of the question and the image [27]. These approaches mimic human thinking, which has not been thoroughly studied yet.

2.3 Cross-Modal Information Retrieval (CMIR)

CMIR is a task to enable queries from one modality to retrieve information in another modality. The typical solution for CMIR is to project the data from different modalities into a common semantic space to directly compare their similarity. Several statistical methods are based on Canonical Correlation Analysis (CCA) [25, 24] to maximize the pairwise correlations. However, these methods ignore high-level semantic priority and could be hard to extend to large-scale data[21]

. Another research trend is based on deep learning

[33, 30, 19, 18], leveraging existing techniques to provide rich semantics by nonlinear transformations. Typically, [30] proposes a two-branch neural network with two layers of nonlinearities on top of visual and textual features. Instead of MLP for feature mapping, [19] proposes a 2-stage CNN-LSTM network to generate and refine the cross-modal features progressively. [18] leverages attention mechanism to focus on essential image regions and words for correlation learning. Recently, [33] explores the relationship between words and prove the effectiveness for representing texts and eventually improve the CMIR accuracy.

To the best of our knowledge, no study has attempted to model the visual relations for the CMIR task. Our method automatically detects and reasons visual relations for more informative image representations while embedding them into the same semantic space with texts, where cross-modal similarity is directly computed.

3 Methodology

We propose a plug-and-play module, named as Anisotropic Graph Convolutional Network (AGCN), for visual relation reasoning. To demonstrate the effectiveness of AGCN, we present two structures combined with AGCN for cross-modal learning: v-AGCN for visual question answering (Section 4) and c-AGCN for cross-modal information retrieval (Section 5). They share the common module of image modeling via AGCN but differ in their ways of associating visual and textual modalities. In this section, we introduce the architecture of AGCN. It mainly contains three parts: (1) Image representation (Section 3.1): each image is represented by a structured semantic graph with relation-aware edges; (2) Relation encoder (Section 3.2): the learnt relation encoder is leveraged to encode the interaction between two objects as a relation embedding, which is used to describe the semantics and directionality of edges in the constructed graph; (3) Anisotropic Graph Convolution (Section 3.3): it is an upgraded Graph Convolutional Networks (GCN) that exploring the directional and embedded visual relations for reasoning about more complex relations and enriching the object representations in the semantic graph.

3.1 Image Representation

Each image is represented by a scene graph, where the nodes, denoted as , represent salient regions (also called instances) detected by the pre-trained object detector while edges, denoted as , represent the semantic visual relations predicted by our proposed relation encoder. We utilize pre-trained Faster-RCNN [26] in conjunction with ResNet-101[12] to detect instances in an image and describe each instance as 2,048-dimensional features. We assume that there exists relations between any pair of instances, by considering “non-relation” as a specific relation. Therefore, we construct a fully connected graph based on the instance nodes and predicted relations via the relation encoder. The relation encoder is pre-trained on a visual relationship benchmark, i.e. Visual Genome [16], and that it encodes each relation between two objects as a 128-dimensional relation embedding, denoted as . The details for relation encoder are described in the next section.

Figure 3: Schematic illustration of the relation encoder.

3.2 Relation Encoder

Inspired by recent works in Visual Relationship Detection [6, 13] and Scene Graph Generation [20]

, we train a relation classifier to generate relationship embeddings on Visual Genome

[16]. Visual Genome is a large-scale image dataset that provides detailed relationship annotations based on human priori knowledge. The visual relationships are represented as triples . The relation predicate can be a preposition, an action or a combination of both.

The structure of the visual relationship encoder is illustrated in Figure 3

. The relationship encoder infers the type of the relationship between two objects based on three feature vectors extracted from three regions: the subject region, the object region and the union of the two regions. The three feature vectors are then projected into lower dimensions through a dense layer, where the subject feature and object feature share a set of parameters and the union feature exclusively use a set of parameters. Finally, projected feature vectors are concatenated and propagated through a few more dense layers to generate a relation embedding, based on which inference of relation category is made.

Visual Genome originally provides relationship annotations of categories, the distribution of the number of relation instances among which is very uneven, ranging from only a few to tens of thousands. Therefore, we manually selected and grouped most frequent and representative relation categories and added two additional classes for this task. We add a none relationship for any two objects that are not labeled with a relation predicate. We also add a is-a relationship for the self-connections of nodes. The relation encoder classifies categories in total. In our implementation, the feature descriptors of image regions are obtained from the pre-trained Faster-RCNN classifier, which uses RoI Pooling[26] to sample from the Res4b22 feature map of a ResNet-101[12] to predict the category of proposed image regions.

3.3 Anisotropic Graph Convolution

Figure 4: Schematic illustration of the proposed anisotropic graph convolution with a single attention head.

We propose a new graph convolution module that not only absorbs the advantages of previous models but also carries a new characteristic that mimics the anisotropic property of CNNs — data from different directions are encoded with different parameters. We first incorporate fine-grained priori knowledge to encode relationships between visual components to produce relation embeddings. The intuition is that an embedding vector should be more capable of providing fine-grained information about a relationship than a single label. The relation embeddings are then projected into directions. Specifically, attention weights are computed between each pair of visual components, corresponding to attention heads in our network. The attention heads then use distinctive sets of parameters respectively to process input data. Additionally, the relation embeddings are also combined with visual features and propagated through graph convolution to infuse relational information.

We define our anisotropic graph convolution on a single node as:


where denote the hidden state of node at the layer. means that the subscripted parameter or attention weight is used by the attention head. is an attention weight — a scalar that describes the extent to which node correlates to . denotes concatenation. Specifically,


This process is illustrated in Figure 4.

The introduction of a priori relation embedding into graph convolution brings three advantages: (i) It provides a larger volume of information about relationships between objects. (ii) It avoids the risk of losing useful information which haunts GCNs that rely on relationship labels to determine connectivity between nodes. (iii) The number of parameters is significantly reduced because a few sets of parameters is sufficient to account for many different types of relationships.

Using multi-head attention to learn and distinguish semantic “directions” enables the graph convolution to anisotropically examine relationships and thereby extract patterns that underlie object relationships just as CNNs are able to discover spatial patterns in images.

4 v-AGCN for Visual Question Answering

Figure 5: The architecture of our visual-question-answering model. In this figure,

denotes a linear transformation and

denotes the ReLU activation.

denotes element-wise product. The element-wise product used in “cross-modal attention” is slightly different in the way that it multiplies question embedding q with every object feature vector .

Visual question answering is a task that requires joint reasoning over questions in natural language and the images. We plug our module into a state-of-the-art VQA model by Anderson [1] to prove the effects of visual relational reasoning. An overview of our proposed model is shown in Figure 5. It treats visual question answering as a classification problem and takes three types of data as inputs: feature vectors of objects detected by the Faster-RCNN, relation embeddings generated using the aforementioned relation encoder, and tokenized question vectors. The architecture mainly contains three parts including question modeling, image modeling and cross-modal attention for enhancing the image features by attending to the relevant content in the question. v-AGCN and Anderson’s model [1] share the first two parts and differ in image modeling. [1] simply utilizes a fully connected layer instead of our module AGCN in the blue dashed box of “Image modeling” in Figure 5.

In the image modeling path, we first detect from (lower bound) to (upper bound) objects to construct the scene graph. Then both the graph structure, object features, and relation features are fed into the anisotropic graph convolution module (in the blue dashed box in Figure 5

) for relational reasoning. Specifically, two residual blocks with anisotropic graph convolution are used to extract image semantics. Empirically, we add a residual connection from the input of AGC to the feature combination after AGC to provide enhanced feature alignment, that is, preventing the updated object features to drift drastically.

Cross-modal attention is then applied to align the output of image features with the question embedding q. A scalar attention weight is obtained based on question features and object features. Formally,


where, for the object, is the original feature, is the enhanced feature by the AGCN, and is the corresponding question-aware feature. We employ a two-layer MLP to predict the score for candidate answers. Formally,


We denote the predicted score for the answer by and the benchmark label by

. The cross-entropy loss function is formally defined by:


where denotes the batch-size and subscripts question-answers entries in training batches.

Figure 6: The architecture of our cross-modal information retrieval model. In this figure, denotes a linear transformation and denotes the ReLU activation. denotes element-wise product. The denotes a self-attention mechanism explained Equation 7 and Equation 8.

5 c-AGCN for Cross-Modal Information Retrieval

Cross-modal information retrieval typically projects textual and visual features into a common semantic space to measure their similarity directly. Following the idea of [30], we design a simple dual-path neural network to learn multi-modal representations and compute their similarity by metric learning, as illustrated in Figure 6. We model the texts by bag-of-word features followed by two layers of nonlinearities. The image modeling is similar to that in v-AGCN and differing in the following two aspects.

First, cross-modal attention in v-AGCN is replaced by self-attention with four attention heads. To integrate AGCN-enhanced object features, we define a virtual node to collect the features of all the other nodes by graph convolution as follows:


Attention weights in this layer are not generated from relation embeddings. Instead, attention weights are generated by a linear transformation of the nodes’ hidden states:


The reason that cross-modal attention is not used when merging nodes is that the efficiency required by information retrieval does not tolerate the pairwise operation. Each path must be able to independently produce description vectors so that retrievals can be efficiently performed.

Distance metric learning is used to combine features from image and text paths by element-wise product, then followed by a layer of the linear transformation to obtain the similarity score. We adopt pairwise similarity loss function in [17] as our optimization objective. Specifically, we maximize the mean similarity score between matching text-image pairs and minimize the mean similarity score

for non-matching pairs. Meanwhile, a variance loss, which minimizes the similarity variance of both matching

and non-matching pairs, is added to the loss function to accelerate convergence. The loss function is defined as:


where is the margin between the mean distributions of matching and non-matching similarity and is used to balance the weight of the mean loss and variance a loss.

6 Experiments

In this section, we first conduct qualitative visualization to evaluate that the relation embedding is capable to disambiguate the semantics of visual relations. The we test our AGCN module on both VQA and CMIR tasks and fix the hyper-parameters for both v-AGCN and c-AGCN. The dimension of every hidden layer is shown in Figure 5 and Figure 6

. We train our models by Adamax solver with 20 epochs with mini-batch size 256. We adopt the learning rate of 0.001 and dropout ratio of 0.5. The number of possible answers for VQA is set to 3,129 by filtering out answers that appear less than

times. and

in the loss of c-AGCN are set to 0.6 and 0.35, respectively. All the expriments are implemented with Tensorflow and conducted with NVIDIA Tesla V100 GPUs.

Figure 9:

Samples of relation-embedding-based image retrieval. Each row shows the retrieved results of top 5 nearest images containing the same relational semantics as the query image. In each image, the subject is framed with red box while the object is framed with green box.

6.1 What is Captured by Relation Embeddings?

Since the relation embedding is designed for modeling fine-grained relations between instances and their interaction, it must obtain the ability to disambiguate the the diverse relational semantics of different visual appearances, which is a challenging problem explained in Section 1. To examine this ability, we perform relation-embedding-based image retrieval to search for the k-nearest images containing the same relational semantics as the query image. Specifically, Faster-RCNN is first utilized to detect salient regions in 1,000 images randomly sampled from MSCOCO dataset. We then generate a relation embedding for each pair of salient regions in an image and predict its relation category accordingly. Afterwards, we randomly sample a set of 50,000 relation embeddings from all the obtained relation embeddings, each of which belongs to its predicted relation category. In this embedding set, we randomly select 32 sample embeddings as queries, belonging to the predefined 32 categories, respectively. For each query, we search for its 5-nearest-neighbors in the embedding set and visualize the images that containing the query and the retrieved relation embeddings.

Some samples of the retrieved images are shown in Figure 9. In Figure 9(a), we visualize four queries and their retrieved images in the category of ride. Though the four queries belongs to the same relation category, it is obvious that they respectively convey the semantic of ride from different views, such as ride skateboard, ride horse or ride bike. The results demonstrate that, based on the relation embeddings, we can accurately retrieve the relation-relevant images containing the same fine-grained relations as the queries. Similarly, Figure 9(b) shows four queries and their retrieved results in the category of on. We come out with the same observation as the first four queries and this observation also exists in other categories. In summary, relationships might indicate the object is supporting the subject or that the subject is part of the object. There can be divergent visual semantics even in the same relation category, and our relation encoder is capable of making such distinction.

6.2 Evaluation on Visual Question Answering

6.2.1 Dataset and Evaluation Metrics

We evaluate v-AGCN on the VQA 2.0 [10] dataset, which is the upgraded version of VQA dataset. VQA 2.0 is more balanced in designing questions to relieve the problem of overfitting. Each question is corresponding to two different images with distinct annotated answers. Specifically, VQA 2.0 contains images with at least three questions and ten ground truth answers per question for each image. There are three types of questions: yes/no, number and other. The accuracy for each answer is based on a voting by annotators:


where is the times that an answer is voted by different annotators. We follow the standard splitting of the dataset and use the tool provided by [2] to evaluate the accuracy. Only the training and validation sets are available for model optimization. Following previous work, we train our model on the training set and report the results on the validation set in our ablation study. We train our model on both the training and validation set and report the - and - results from the VQA evaluation server for the state-of-the-art comparison.

6.2.2 Ablation Study

The architecture of v-AGCN is composed of multiple essential components. We conduct several ablation experiments to evaluate the contribution of each component. We first train our model and several ablated versions on the training set and compare their accuracy on the validation set. The variant models of v-AGCN include:

baseline model: we replace the image modeling component in Figure 5 with a single linear layer, which is the model introduced in [1].

GCN model: In this model, a layer of traditional GCN [15] is added to replace the image modeling component in Figure 5. The model does not consider the relation types among objects. Specifically, there are only connections between two object nodes when the relation detection trained on Visual Genome [32] predicts existing relations, regardless of the relation types.

GCN+att model: this model shares common structures with the GCN model but is additionally equipped with a multi-head attention mechanism. The attention weights in this model are computed from object features and do not use any relation label or embeddings. Different attention heads share the same attention weights.

GCN+att+re model: this model employs a multi-head attention mechanism that generates attention weights based on relation embeddings and uses the same parameters for different attention heads.

v-AGCN our proposed model employs attention mechanism computed from relation embeddings and uses distinctive sets of attention weights for different attention heads.

Model Validation Accuracy
baseline 62.41
GCN 62.44
GCN+att 63.32
GCN+att+re 62.84
v-AGCN (ours) 63.88
Table 1: Ablation study on each central component for the VQA tasks on the VQA 2.0 dataset.

The experimental results are shown in Table 1. We can have the following observations:

First, v-AGCN obviously outperforms baseline, GCN, GCN+att, and GCN+att+re. Compared with models without relation information, v-AGCN surpasses baseline and GCN by 1.5% and outperforms GCN+att by 0.6%. It indicates the effectiveness of relation-aware visual representation in improving the final performance of VQA. Besides, by setting distinct weights for multiple attention heads in v-AGCN, we obtain about 1% improvement over GCN+att+re with the same attention weights for all the attention heads. We can infer that different attention heads could capture different views of correlations between visual regions based on the same relation embedding. In summary, incorporating the relation attributes and multi-head attention mechanism can significantly enrich the visual representation and thus promote the VQA accuracy.

Second, progressively incorporating the baseline model with different components brings varying degrees of improvement. Though constructing the connections between visual regions, adding the typical GCN structure doesn’t improve the results over baseline model. This is because that only introducing mutually connections without specific relation types or characterizations is not informative to enrich the visual representation. Equipping the GCN model with an attention mechanism based on object appearance raised the accuracy by 0.9%. When we inject relation embeddings to the attention computation, we obtain slight decrease in accuracy, which is mostly because that in the VQA task, the relation embedding will work only if it is cooperated with the anisotropic attention. Therefore, the final model v-AGCN achieves the best validation accuracy of 63.9%.

6.2.3 State-of-the-Art Comparison

Model test-dev test-standard
Overall Other Number Yes/No Overall Other Number Yes/No
Prior[11] - - - - 25.98 01.17 00.36 61.20
Language only [11] - - - - 44.26 27.37 31.55 67.01
LSTM+CNN[11] - - - - 54.22 41.83 35.18 73.46
MCB [11] - - - - 62.27 53.36 38.28 78.82
Adelaide [29] 65.32 56.05 44.21 81.82 65.67 56.26 43.90 82.20
baseline 65.49 56.41 43.73 81.96 65.76 56.63 43.57 82.09
v-AGCN (ours) 65.94 56.46 45.93 82.39 66.17 56.71 45.12 82.58
Table 2: Accuracy comparison of VQA tasks with state-of-the-art approaches on the VQA 2.0 dataset.

Table 2 demonstrates our performance on the VQA 2.0 dataset. We compare our v-AGCN model with the baseline model (the model without AGCN module for visual relation reasoning) and the top-ranked state-of-the-art models on the VQA challenge 2017. Our method outperforms state-of-the-art methods by at least 0.5% on the overall accuracy. We use the combination of training and validation set to train v-AGCN and report their performance on the test set.

It’s obvious that the results of our single v-AGCN model significantly surpass all the top-ranked models and slightly outperform the 1st place model of Adelaide. Compared with the baseline model, v-AGCN is about 0.4% improvement on the overall performance. The best accuracy of v-AGCN achieves 65.94% on the - set and 66.17% on the - set. Noticeably, our method significantly improves the performance on Number questions.

6.2.4 Qualitative Evaluation

Figure 10: Predicted results and the corresponding attention visualization of the visual question answering task on the VQA 2.0 dataset. Each row shows the results for a image-question pair. In each row, the first column shows groundtruth answers and original images. The second column visualizes the cross-modal attention map of the baseline model. The third column shows the cross-modal attention map of the v-AGCN model. The attention maps of the first and second attention heads in the v-AGCN model, which make correlation between the regions from different views, are respectively shown in column 4 and column 5.

To examine the behavior of the AGCN module, we visualize the attention maps that the cross-modal attention and inter-node attention in v-AGCN. Specifically, we show samples of input image-question pairs with their attention maps generated by v-AGCN and the baseline model. Examples are shown in Figure 10. Each row contains visualization for a question. The first column in each row shows groundtruth answers and original images. The second column visualizes the cross-modal attention used in the baseline model. The third column shows the cross-modal attention in the v-AGCN model. The attention maps of the first and second attention heads in the v-AGCN model, which make correlation between the regions from different views, are respectively shown in column 4 and column 5. For each attention map, the red bounding box indicates the region with the maximum attention weight. The predicted answers are given above the attention maps accordingly.

Comparing with the results of baseline model, we observe that the cross-modal attention in v-AGCN focuses on more relevant image regions refered by the questions. When answering the question “What material are the pants in the foreground?”, the agent should make a distinction between the two people, in which the v-AGCN model succeeds. Another property of our v-AGCN is that it is less prone to answering questions with partial information. When answering question “What are the kids in?”, although the baseline model attended to a correct image region, the partial information in that region caused it to generate the answer suitcase, while the v-AGCN model observes other image regions and conclude that the kids are in fact in a train. Most importantly, it is shown that the v-AGCN model is capable of discovering and using visual relationships. For the question “What is the color of the shirt of the man reaching for the frisbee”, v-AGCN corretly answers blue by finding a correlation between the man and the frisbee.

6.3 Evaluation on Cross-Modal Information Retrieval

Figure 11: Retrieval samples of our proposed c-AGCN model and the baseline model on the CMPlaces dataset. Relevant results are highlighted with green boxes while irrelevant results are marked with red boxes.

6.3.1 Dataset and Evaluation Metrics

In this section, we test our models on the most recent CMIR benchmark datasets: Cross-Modal Places [4] (CMPlaces). CMPlaces is one of the largest cross-modal dataset providing weakly aligned data in five modalities divided into 205 categories. In our experiments, we utilize the natural images (about 1.5 million) and text descriptions (11,802) for evaluation. We randomly sample 250 images from each category and split the images for training, validation, and test with the proportion of 8:1:1. We also randomly split text descriptions for training, validation, and testing with the proportion of 14:3:3. As for the evaluation, MAP@100 is used to evaluate the query performance. We compute the overall MAP by averaging a score for text-queries and a score for image-queries .

6.3.2 Ablation Study

The architecture of c-AGCN is composed of the same essential components as v-AGCN. We conduct several ablation experiments to evaluate the contribution of each component. We train our model and its ablated versions on the training set and compare their accuracy on the testing set. The ablation models of c-AGCN, including baseline, GCN, GCN+att, GCN+att+re, are designed in the same way as the ablation models of v-AGCN. The only difference lies in the basic architecture that the ablation models in this section are based on the cross-modal information retrieval model illustrated in Section 5.

The MAP scores are shown in Table 3, including the results of text query(), image query (), and the average performance. From the observation, we come out with the same conclusion as the v-AGCN model. Our proposed c-AGCN model achieves the best overall performance. The baseline experiment shows the lowest text-query score and the highest image-query score, we hypothesize this is because dense fully-connected layers are prone to be overfitting. The traditional GCN model shows the lowest image query score. However, when adding the attention mechanism to GCN, it shows remarkable improvement for the image query task. By taking the advantages of relation embedding, GCN+att+re gains another 2.1% improvement for the image query task and slight promotion in the text query task. To go one step further, when setting distinct weights for different attention heads, c-AGCN model gains another 2.3% promotion for the average performance.

Method Avg.
baseline 29.7 40.5 35.1
GCN 34.5 21.0 27.5
GCN+att 34.5 35.7 35.1
GCN+att+re 34.6 37.8 36.2
c-AGCN (ours) 37.7 39.3 38.5
Table 3: Ablation study on each components for the image-text retrieval on the CMPlaces dataset.

6.3.3 State-of-the-Art Comparison

Method Avg.
BL-Ind [4] 0.6 0.8 0.7
BL-ShFinal [4] 3.3 12.7 8.0
BL-ShAll [4] 0.6 0.8 0.7
Tune(Free) [4] 5.2 18.1 11.7
TuneStatReg [4] 15.1 22.1 18.6
GIN [33] 25.9 23.9 24.8
baseline 29.7 40.5 35.1
c-AGCN (ours) 37.7 39.3 38.5
Table 4: MAP score comparison of image-text retrieval on the CMPlaces dataset.

The comparison with state-of-the-art is also shown in Table 4. We compare our model with state-of-the-art approaches, including GIN [33] and Castrejon’s models [4] proposed with the introduction of the CMPlaces dataset. Both GIN [33] and Castrejon ’s models [4] use grid-structured CNN features to model images. The results for GIN are newly tested on the CMPlaces dataset in our work while the results for Castrejon [4] are from their published paper. The MAP scores in Table 4 indicate that our model outperforms state-of-the-art approaches by considerable improvements. Specifically, the image queries benefit more from the anisotropic graph convolution than the text queries, since our AGCN module is mainly proposed for enhancing the expressive capacity and robustness of the visual representation.

6.3.4 Qualitative Evaluation

For quality evaluation, two examples for image-query-text and text-query-image tasks are shown in Figure 11. In the top of Figure 11, the text query is a description of pantry. Compared with baseline model, more results retrieved by c-AGCN are images belonging to “pantry” and that they are semantically relevant to the textual content from fine-grain view. For example, the top five retrieved images all contain visual content highly related to the textual descriptions, including “…stores goods used for cooking…” and “…filled with non perishable goods such as canned food or jars.” In contrast, the top results of the baseline model are quite similar to “pantry” in appearance, such as the first two images containing closet. However, the model doesn’t reason about the relations between the closet and the goods and infer the purpose of the closet.

The bottom sample in Figure 11 shows the top six retrieved texts given an image query belonging to the category of “train railway”. Similar to the observation of text query, the top retrieved texts by c-AGCN are semantically relevant to both the category and the fine-grained visual content of the image query. However, the baseline model without visual relation reasoning is easy to confuse with the scenes, e.g. subway platform, train platform, railroad track, which are semantic similar to “train railway” to some extent. In summary, from all the qualitative analysis we can see that c-AGCN is effective in fine-grained cross-modal correlation learning by injected with visual relation information.

7 Conclusion

In this paper, we propose a graph-based module, i.e. Anisotropic Graph Convolutional Network (AGCN), which explores visual relation reasoning for improving cross-modal learning. Specifically, by incorporating priori knowledge in the visual knowledge base, we novelly propose to model the relations between visual regions in an image as context-aware embeddings to enrich the relation representations. Then an ungraded Graph Convolutional Networks, i.e. Anisotropic Graph Convolutional Network, is introduced to further reason about richer semantics based on the learnt visual relation embeddings for more informative image representation. To evaluate the effectiveness of our module, we inject AGCN into the models for Visual Question Answering and Cross-Modal Information Retrieval tasks. Extensive experiments prove that existing models could be enhanced by our proposed module and resulting in significant improvement compared with state-of-the-art approaches.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, volume 3, page 6, 2018.
  • [2] S. Antol, A. Agrawal, J. Lu, and M. Mitchell. Vqa: visual question answering. In ICCV, 2015.
  • [3] P. W. Battaglia, J. B. Hamrick, and V. e. a. Bapst. Relational inductive biases, deep learning, and graph networks. In arXiv: 1806.01261, 2018.
  • [4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In CVPR. IEEE, 2016.
  • [5] L. Cewu, K. Ranjay, B. Michael, and F. Li. Visual relationship detection with language priors. In ECCV, page 852–869, 2016.
  • [6] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In

    Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on

    , pages 3298–3308. IEEE, 2017.
  • [7] S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, pages 3270–3277, 2014.
  • [8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrel, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, page 457–468, 2016.
  • [9] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation with relative location prior. In IJCV, 2008.
  • [10] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, volume 1, page 3, 2017.
  • [11] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: elevating the role of image understanding in visual question answering. In CVPR, 2017.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [13] S. Jae Hwang, S. N. Ravi, Z. Tao, H. J. Kim, M. D. Collins, and V. Singh. Tensorize, factorize and regularize: Robust visual relationship learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1014–1023, 2018.
  • [14] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard product for low-rank bilinear pooling. In ICLR, 2017.
  • [15] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [16] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
  • [17] V. B. G. Kumar, G. Carneiro, and I. Reid. Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions. In CVPR, page 5385–5394, 2016.
  • [18] K.-h. Lee, X. Chen, G. Hua, H. Hu, and X. He. Stacked cross attention for image-text matching. In ECCV, page arXiv:1803.08024, 2018.
  • [19] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang. Identity-aware textual-visual matching with latent co-attention. In ECCV, pages 1908–1917, 2017.
  • [20] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and region captions. In ICCV, 2017.
  • [21] Z. Ma, Y. Lu, and D. Foster. Finding linear structure in large datasets with scalable canonical correlation analysis. In ICML, pages 169–178, 2015.
  • [22] H. Noh, P. H. Seo, and H. Bohyung.

    Image question answering using convolutional neural network with dynamic parameter prediction.

    In CVPR, 2016.
  • [23] A. Peter, F. Basura, J. Mark, and G. Stephen. Spice: Semantic propositional image caption evaluation. In ECCV, page 382–398, 2016.
  • [24] V. Ranjan, N. Rasiwasia, and C. V. Jawahar. Multi-label cross-modal retrieval. In ICCV, pages 4094–4102.
  • [25] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACMMM, pages 251–260. ACM, 2010.
  • [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [27] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
  • [28] J. Tenebaum and W. Freeman. Separating style and content. In NIPS, page 662–668, 1997.
  • [29] D. Teney, P. Anderson, X. He, and A. Hengel. Tips and tricks for visual question answering: learnings from the 2017 challenge. In arXiv:1708.02711v1, 2017.
  • [30] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR, pages 5005–5013, 2016.
  • [31] Z. Yang, X. He, and J. Gao. Stacked attention networks for image question answering. In CVPR, 2016.
  • [32] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. arXiv preprint arXiv:1809.07041, 2018.
  • [33] J. Yu, Y. Lu, Z. Qin, W. Zhang, Y. Liu, J. Tan, and L. Guo. Modeling text with graph convolutional network for cross-modal information retrieval. In Pacific Rim Conference on Multimedia, pages 223–234. Springer, 2018.
  • [34] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual question answering. In arXiv preprint arXiv:1512.02167v2, 2015.