Image-to-Image Retrieval by Learning Similarity between Scene Graphs

As a scene graph compactly summarizes the high-level content of an image in a structured and symbolic manner, the similarity between scene graphs of two images reflects the relevance of their contents. Based on this idea, we propose a novel approach for image-to-image retrieval using scene graph similarity measured by graph neural networks. In our approach, graph neural networks are trained to predict the proxy image relevance measure, computed from human-annotated captions using a pre-trained sentence similarity model. We collect and publish the dataset for image relevance measured by human annotators to evaluate retrieval algorithms. The collected dataset shows that our method agrees well with the human perception of image similarity than other competitive baselines.


page 1

page 3

page 7

page 10

page 11

page 12


An Improved Relevance Feedback in CBIR

Relevance Feedback in Content-Based Image Retrieval is a method where th...

Relevance Proximity Graphs for Fast Relevance Retrieval

In plenty of machine learning applications, the most relevant items for ...

Artist Similarity with Graph Neural Networks

Artist similarity plays an important role in organizing, understanding, ...

Towards Reversible De-Identification in Video Sequences Using 3D Avatars and Steganography

We propose a de-identification pipeline that protects the privacy of hum...

Scene Graph Embeddings Using Relative Similarity Supervision

Scene graphs are a powerful structured representation of the underlying ...

Multi-Modal Retrieval using Graph Neural Networks

Most real world applications of image retrieval such as Adobe Stock, whi...

Image-Level Attentional Context Modeling Using Nested-Graph Neural Networks

We introduce a new scene graph generation method called image-level atte...

Code Repositories



view repo


Image-to-image retrieval, the task of finding similar images to a query image from a database, is one of the fundamental problems in computer vision and is the core technology in visual search engines. The application of image retrieval systems has been most successful in problems where each image has a clear representative object, such as landmark detection and instance-based retrieval

Gordo et al. (2016); Mohedano et al. (2016); Radenović et al. (2016), or has explicit tag labels Gong et al. (2014).

However, performing image retrieval with complex images that have multiple objects and various relationships between them remains challenging for two reasons. First, deep convolutional neural networks (CNNs), on which most image retrieval methods rely heavily, tend to be overly sensitive to low-level and local visual features

Zheng et al. (2017); Zeiler and Fergus (2014); Chen et al. (2018). As shown in Figure 1, nearest-neighbor search on ResNet-152 penultimate layer feature space returns images that are superficially similar but have completely different content. Second, there is no publicly available labeled data to train and evaluate the image retrieval system for complex images, partly because quantifying similarity between images with multiple objects as label information is difficult. Furthermore, a similarity measure for such complex images is desired to reflect semantics of images, i.e., the context and relationship of entities in images.

Figure 1: Image retrieval examples from ResNet and IRSGS. ResNet retrieves images with superficial similarity, e.g., grayscale or vertical lines, while IRSGS successfully returns images with correct context, such as playing tennis or skateboarding.

In this paper, we address these challenges and build an image retrieval system capable of finding semantically similar images to a query from a complex scene image database. First of all, we propose a novel image retrieval framework, Image Retrieval with Scene Graph Similarity (IRSGS), which retrieves images with a similar scene graph to the scene graph of a query. A scene graph represents an image as a set of objects, attributes, and relationships, summarizing the content of a complex image. Therefore, the scene graph similarity can be an effective tool to measure semantic similarity between images. IRSGS utilizes a graph neural networks to compute the similarity between two scene graphs, becoming more robust to confounding low-level features (Figure 1).

Also, we conduct a human experiment to collect human decisions on image similarity. In the experiment, annotators are given a query image along with two candidate images and asked to select which candidate image is more similar to the query than the other. With 29 annotators, we collect more than 10,000 annotations over more than 1,700 image triplets. Thanks to the collected dataset, we can quantitatively evaluate the performance of image retrieval methods. Our dataset is available online222 .

However, it is costly to collect enough ground truth annotation from humans to supervise the image retrieval algorithm for a large image dataset, because the number of pairwise relationships to be labeled grows in for the number of data . Instead, we utilize human-annotated captions of images to define proxy image similarity, inspired by Gordo and Larlus (2017) which used term frequencies of captions to measure image similarity. As a caption tends to cover important objects, attributes, and relationships between objects in an image, the similarity between captions is likely to reflect the contextual similarity between two images. Also, obtaining captions is more feasible, as the number of the required captions grow in . We use the state-of-the-art sentence embedding Reimers and Gurevych (2019) method to compute the similarity between captions. The computed similarity is used to train a graph neural network in IRSGS and evaluate the retrieval results.

Tested on real-world complex scene images, IRSGS show higher agreement with human judgment than other competitive baselines. The main contributions of this paper can be summarized as follows:

  • [noitemsep]

  • We propose IRSGS, a novel image retrieval framework that utilizes the similarity between scene graphs computed from a graph neural network to retrieve semantically similar images;

  • We collect more than 10,000 human annotations for semantic-based image retrieval methods and publish the dataset into the public;

  • We propose to train the proposed retrieval framework with the surrogate relevance measure obtained from image captions and a pre-trained language model;

  • We empirically evaluate the proposed method and demonstrate its effectiveness over other baselines.

Figure 2: An overview of IRSGS. Images

are converted into vector representations

through scene graph generation (SGG) and graph embedding. The graph embedding function is learned to minimize mean squared error to surrogate relevance, i.e., the similarity between captions. The bold red bidirectional arrows indicate trainable parts. For retrieval, the learned scene graph similarity function is used to rank relevant images.

Related Work

Image Retrieval

Conventional image retrieval methods use visual feature representations, object categories, or text descriptions Zheng et al. (2017); Babenko et al. (2014); Chen et al. (2019); Wei et al. (2016); Zhen et al. (2019); Gu et al. (2018); Vo et al. (2019); Gordo et al. (2017). The activation of intermediate layers of CNN is shown to be effective as a representation of an image for image retrieval tasks. However, as shown in Figure 1, CNN often fails to capture semantic contents of images and is confounded by low-level visual features.

Image retrieval methods which reflects more semantic contents of images are investigated in Gordo and Larlus (2017); Johnson et al. (2015). Gordo and Larlus (2017) used term frequencies in regional captions to supervise CNN for image retrieval, but they did not utilize scene graphs. Johnson et al. (2015) proposed an algorithm retrieving images given a scene graph query. However, their approach does not employ graph-to-graph comparison and is not scalable.

Scene Graphs

A scene graph Johnson et al. (2015) represents the content of an image in the form of a graph nodes of which represent objects, their attributes, and the relationships between them. After a large-scale real-world scene graph dataset manually annotated by humans in Visual Genome dataset Krishna et al. (2017)

was published, a number of applications such as image captioning

Wu et al. (2017); Lu et al. (2018); Milewski et al. (2020) visual question answering Teney et al. (2017), and image-grounded dialog Das et al. (2017) have shown the effectiveness of the scene graphs. Furthermore, various works, such as GQAHudson and Manning (2019), VRDLu et al. (2016), and VrR-VGLiang et al. (2019) provided the human-annotated scene graph datasets. Also, recent researches Yang et al. (2018); Xu et al. (2017); Li et al. (2017) have suggested methods to generate scene graphs automatically. Detailed discussion on scene graph generation will be made in Experimental Setup Section.

Graph Similarity Learning

Many algorithms have been proposed for solving the isomorphism test or (sub-)graph matching task between two graphs. However, such methods are often not scalable to huge graphs or not applicable in the setting where node features are provided. Here, we review several state-of-the-art algorithms that are related to our application, image retrieval by graph matching. For the graph pooling perspective, we focus on two recent algorithms, the Graph Convolutional Network (GCN;Kipf and Welling (2016)) and the Graph Isomorphism Network (GIN;Xu et al. (2018)). GCN utilized neural network-based spectral convolutions in the Fourier domain to perform the convolution operation on a graph. GIN used injective aggregation and graph-level readout functions. The learned graph representations, then, can be used to get the similarity of two graphs. Both networks transforms a graph into a fixed-length vector, enabling distance computation between two graphs in the vector space. Other studies viewed the graph similarity learning problem as the optimal transport problem Solomon et al. (2016); Maretic et al. (2019); Alvarez-Melis and Jaakkola (2018); Xu et al. (2019a, b); Titouan et al. (2019). Especially in Gromov Wasserstein Learning (GWL;Xu et al. (2019b)), node embeddings were learned from associated node labels. Thus the method can reflect not only a graph structure but also node features at the same time. Graph Matching Network (GMN;Li et al. (2019)) used the cross-graph attention mechanism, which yields different node representations for different pairs of graphs.

Image Retrieval with Scene Graph Similarity

In this section, we describe our framework, Image Retrieval with Scene Graph Similarity (IRSGS). Given a query image, IRSGS first generates a query scene graph from the image and then retrieves images with a scene graph highly similar to the query scene graph. Figure 2 illustrates the retrieval process. The similarity between scene graphs is computed through a graph neural network trained with surrogate relevance measure as a supervision signal.

Scene Graphs and Their Generation

Formally, a scene graph of an image is defined as a set of objects , attributes of objects , and relations on pairs of objects . All objects, attributes, and relations are associated with a word label, for example, ”car”, ”red”, and ”in front of”. We represent a scene graph as a set of nodes and edges, i.e., a form of a conventional graph. All objects, attributes, and relations are treated as nodes, and associations among them are represented as undirected edges. Word labels are converted into 300-dimensional GloVe vectors Pennington et al. (2014) and treated as node features.

Generating a scene graph from an image is equivalent to detecting objects, attributes, and relationships in the image. We employ a recently proposed method Anderson et al. (2018) in our IRSGS framework to generate scene graphs. While end-to-end training of scene graph generation module is possible in principle, a fixed pre-trained algorithm is used in our experiments to reduce the computational burden. We shall provide details of our generation process in Experimental Setup Section. Note that IRSGS is compatible with any scene graph generation algorithm and is not bound to the specific one we used in this paper.

Retrieval via Scene Graph Similarity

Given a query image , an image retrieval system ranks candidate images according to the similarity to the query image . IRSGS casts this image retrieval task into a graph retrieval problem by defining the similarity between images as the similarity between corresponding scene graphs. Formally,


where are scene graphs for , respectively. We shall refer as scene graph similarity.

We compute the scene graph similarity from the inner product of two representation vectors of scene graphs. With a scene graph, a graph neural network is applied, and the resulting node representations are pooled to generate a unit -dimensional vector . The scene graph similarity is then given as follows:


We construct by computing the forward pass of graph neural networks to obtain node representations and then apply average pooling. We implement with either GCN or GIN, yielding two versions, IRSGS-GCN and IRSGS-GIN, respectively.

Learning to Predict Surrogate Relevance

We define surrogate relevance measure between two images as the similarity between their captions. Let and are captions of image and . To compute the similarity between the captions, we first apply Sentence-BERT (SBERT; Reimers and Gurevych (2019))333We use the code and the pre-trained model (bert-large-nli-mean-tokens) provided in
and project the output to the surface of an unit sphere to obtain representation vectors and . The surrogate relevance measure is then given by their inner product: . When there is more than one caption for an image, we compute the surrogate relevance of all caption pairs and take the average. With the surrogate relevance, we are able to compute a proxy score for any pair of images in the training set, given their human-annotated captions. To validate the proposed surrogate relevance measure, we collect human judgments of semantic similarity between images by conducting a human experiment (details in Human Annotation Collection Section).

We train the scene graph similarity

by directly minimizing mean squared error from the surrogate relevance measure, formulating the learning as a regression problem. The loss function for

-th and -th images is given as . Other losses, such as triplet loss or contrastive loss, can be employed as well. However, we could not find clear performance gains with those losses and therefore adhere to the simplest solution.

2 Method Data nDCG Human Agreement
5 10 20 30 40 50
Inter Human - - - - - - - 0.730 0.05
Caption SBERT Cap(HA) 1 1 1 1 1 1 0.700
Random - 0.136 0.138 0.143 0.147 0.149 0.152 0.472 0.01
Gen. Cap. SBERT Cap(Gen) 0.609 0.628 0.657 0.681 0.703 0.726 0.473
ResNet I 0.687 0.689 0.691 0.692 0.693 0.693 0.494
ResNet-FT I 0.642 0.656 0.682 0.703 0.724 0.745 0.478
Object Count I+SG 0.736 0.749 0.770 0.788 0.804 0.819 0.587
GMN I+SG 0.721 0.735 0.755 0.771 0.786 0.801 0.535
IRSGS-GIN I+SG 0.751 0.768 0.790 0.808 0.824 0.839 0.576
IRSGS-GCN I+SG 0.784 0.795 0.814 0.829 0.844 0.856 0.602
Table 1:

Image retrieval results on VG-COCO with human-annotated scene graphs. Data column indicates which data modalities are used. Cap(HA): human-annotated captions. Cap(Gen): machine-generated captions. I: image. SG: scene graphs.

Human Annotation Collection

We collect semantic similarity annotations from humans to validate the proposed surrogate relevance measure and to evaluate image retrieval methods. Through our web-based annotation system, a human labeler is asked whether two candidate images are semantically similar to a given query image. The labeler may choose one of four answers: either of the two candidate images is more similar than the other, images in the triplet are semantically identical, or neither of the candidate images is relevant to the query. We collect 10,712 human annotations from 29 human labelers for 1,752 image triplets constructed from the test set of the VG-COCO, the dataset we shall define in Experimental Setup Section.

A query image of a triplet is randomly selected from the query set defined in the following section. Two candidate images are randomly selected from the rest of the test set, subjected to two constraints. First, the rank of a candidate image should be less than or equal to 100 when the whole test set is sorted according to cosine similarity in ResNet-152 representation to the query image. Second, the surrogate relevance of a query-candidate image pair in a triplet should be larger than the other, and the difference should be greater than 0.1. This selection criterion produces visually close yet semantically different image triplets.

We define the human agreement score to measure the agreement between decisions of an algorithm and that of the human annotators, in a similar manner presented in Gordo and Larlus (2017). The score is an average portion of human annotators who made the same decision per each triplet. Formally, given a triplet, let (or ) be the number of human annotators who chose the first (or the second) candidate image is more semantically similar to the query, be the number of annotators who answered that all three images are identical, and be the number of annotators who marked the candidates as irrelevant. If an algorithm choose either one of candidate images is more relevant, the human agreement score for a triplet is , where if the algorithm determines that the first image is semantically closer and otherwise. The score is averaged over triplets with

. Randomly selecting one of two candidate images produces an average human agreement of 0.472 with a standard deviation of 0.01. Note that the agreement of random decision is lower than 0.5 due to the existence of the human choice of ”both” (

) and ”neither” ().

The alignment between labelers is also measured with the human agreement score in a leave-one-out fashion. If a human answers that both candidate images are relevant, the score for the triplet is , where are computed from the rest of annotators. If a human marks that neither of the candidates is relevant for a triplet, the triplet is not counted in the human agreement score. The mean human agreement score among those annotators is 0.727, and the standard deviation is 0.05. We will make the human annotation dataset public after the review.

Experimental Setup


In experiments, we use two image datasets involving diverse semantics. The first dataset is the intersection of the Visual Genome Krishna et al. (2017) and MS-COCO Lin et al. (2014), which we will refer to as VG-COCO. In VG-COCO, each image has a scene graph annotation provided by Visual Genome and five captions provided by MS-COCO. We utilize the refined version of scene graphs provided by Xu et al. (2017) and their train-test split. After removing the images with empty scene graphs, we obtain fully annotated 35,017 training images and 13,203 test images. We randomly select a fixed set of 1,000 images among the test set and define them as a query set. For each query image, a retrieval algorithm is asked to rank the other 13,202 images in the test set according to the semantic similarity. Besides the annotated scene graphs, we automatically generate scene graphs for all images and experiment with our approach to both human-labeled and machine-generated scene graphs.

The second dataset is Flickr30K Plummer et al. (2017), where five captions are provided per an image. Flickr30K contains 30,000 training images, 1,000 validation images, and 1,000 testing images. For Flickr30k, the whole test set is the query set. During the evaluation, an algorithm ranks the other 999 images given a query image in a test set. Scene graphs are generated in the same manner as in the VG-COCO dataset.

Scene Graph Generation Detail

Since we focus on learning graph embeddings when two scene graphs are given for the image-to-image retrieval task, we use the conventional scene graph generation process. Following the works Anderson et al. (2018), objects in images are detected by Faster R-CNN method, and the name and attributes of the objects are predicted based on the ResNet-101 features from the detected bounding boxes. We keep up to 100 objects with a confidence threshold of 0.3. To predict relation labels between objects after extracting information about the objects, we used the frequency prior knowledge constructed from the GQA dataset that covers 309 kinds of relations.444We have been tried to predict relation labels by using recently suggested SGG algorithms, such as Yang et al. (2018); Xu et al. (2017); Li et al. (2017). However, we could not achieve any improvement in image retrieval tasks. The reasons might be that 1) small size vocabularies for object and relation are used for the conventional SGG setting (only 150/50 kinds of objects/relations), 2) the algorithms do not predict the attributes, and 3) the annotated scene graphs used for training the methods have very sparse relations. For each pair of the detected objects, relationships are predicted based on the frequency prior with confidence threshold 0.2. To give position-specific information, the coordinates of the detected bbox are used. Here, we should note that even though the suggested method to generate a scene graph is quite simple than other methods Yang et al. (2018); Xu et al. (2017); Li et al. (2017), it outperforms all the others.

Two-Step Retrieval using Visual Features

In information retrieval, it is a common practice to take a two-step approach Wang et al. (2019); Bai and Bai (2016): retrieving roughly relevant items first and then sorting (or ”re-ranking”) the retrieved items according to the relevance. We also employ this approach in our experiment. For a query image, we first retrieve

images that are closest to the query in a ResNet-152 feature representation space formed by the 2048-dimension activation vector of the last hidden layer. The distance is measured in cosine similarity. This procedure generates a set of good candidate images which have a high probability of having strong semantic similarity. This approximate retrieval step can be further boosted by using an approximate nearest neighbor engine such as Faiss

Johnson et al. (2017) and is critical if the following re-ranking step is computationally involved. We use this approximate pre-ranking for all experiments with unless otherwise mentioned. Although there is large flexibility of designing this step, we shall leave other possibilities for future exploration as the re-ranking step is our focus.

Training Details

We use Adam optimizer with the initial learning rate of 0.0001. We multiply 0.9 to the learning rate every epoch. We set batch size as 32, and models are trained for 25 epochs. In each training step, a mini-batch of pairs is formed by randomly drawing samples. When drawing the second sample in a pair, we employ an oversampling scheme to reinforce the learning of pairs with large similarity values. With a probability of 0.5, the second sample in a pair is drawn from 100 most relevant samples with the largest surrogate relevance score to the first sample. Otherwise, we select the second sample from the whole training set. Oversampling improves both quantitative and qualitative results and is apply identically for all methods except for GWL where the scheme is not applicable.

2 Method nDCG Human Agreement
5 10 20 40
Inter Human - - - - 0.730
Caption SBERT 1 1 1 1 0.700
Random 0.136 0.138 0.143 0.149 0.472
Gen. Cap. SBERT 0.609 0.628 0.657 0.703 0.473
ResNet 0.687 0.689 0.691 0.693 0.494
ResNet-FT 0.642 0.656 0.682 0.724 0.478
Object Count 0.73 0.743 0.761 0.794 0.581
GWL 0.748 0.758 0.774 0.803 0.598
GMN 0.728 0.740 0.755 0.781 0.539
IRSGS-GIN 0.764 0.781 0.802 0.834 0.612
IRSGS-GCN 0.771 0.784 0.805 0.836 0.611
Table 2: Image retrieval results on VG-COCO with machine-generated scene graphs. Baselines which do not use scene graphs are identical to the corresponding rows of Table 1.



We benchmark IRSGS and other baselines with VG-COCO and Flickr30K. Images in the query set are presented as queries, and the relevance of the images ranked by an image retrieval algorithm is evaluated with two metrics. First, we compute normalized discounted cumulative gain (nDCG) with the surrogate relevance as gain. A larger nDCG value indicates stronger enrichment of relevant images in the retrieval result. In nDCG computation, surrogate relevance is clipped at zero to ensure its positivity. Second, the agreement between a retrieval algorithm and decision of human annotators is measured in a method described in Human Annotation Collection Section.

Baseline Methods

ResNet-152 Features

Image retrieval is performed based on the cosine similarity in the last hidden representation of ResNet-152 pre-trained on ImageNet.

Generated Caption

To test whether machine-generated captions can be an effective means for semantic image retrieval, we generate captions of images by soft attention model

Xu et al. (2015) pretrained on Flickr30k dataset Plummer et al. (2017). We obtain SBERT representations of generated captions, and their cosine similarity is used to perform image retrieval.

Object Count (OC) Ignoring relation information given in a scene graph, we transform a scene graph into a vector of object counts. Then, we compute the cosine similarity of object count vectors to perform image retrieval.

ResNet Finetune (ResNet-FT) We test whether a ResNet-152 can be fine-tuned to capture semantic similarity. Similarly to Siamese Network Bromley et al. (1994), ResNet feature extractor is trained to produce cosine similarity between images close to their surrogate relevance measure.

Gromov-Wasserstein Learning (GWL) Based on Gromov-Wasserstein Learning (GWL) framework Xu et al. (2019b), we obtain a transport map using a proximal gradient method Xie et al. (2018). A transport cost, a sum of Gromov-Wasserstein discrepancy and Wasserstein discrepancy, is calculated with the transport map and the cost matrix, and used for retrieval. The method is computationally demanding, and we only tested the method for VG-COCO with generated scene graphs setting in Table 2.

Graph Matching Networks (GMN) GMNs are implemented based on the publicly available code555

. We use four propagation layers with shared weights. The propagation in the reverse direction is allowed, and the propagated representation is updated using the gated recurrent unit. Final node representations are aggregated by summation, resulting in a 128-dimensional vector which is then fed to a multi-layer perceptron to produce final scalar output. As GMN is capable of handling edge features, we leave relations as edges instead of transforming them as nodes. To indicate object-attribute connections, we append additional dimensionality to edge feature vectors and define a feature vector of an edge between an object and an attribute is a one-hot vector where only the last dimension is non-zero.

Graph Embedding Methods in IRSGS

Here, we describe implementation details of graph neural networks used in IRSGS.

IRSGS-GCN A scene graph is applied with GCN and the final node representations are aggregated via mean pooling and scaled to the unit norm, yielding a representation vector

. We use three graph convolution layers with 300 hidden neurons in each layer. The first two layers are followed by ReLU nonlinearity. Stacking more layers does not introduce clear improvement. We always symmetrize the adjacency matrix before applying GCN.

IRSGS-GIN Similarly to GCN, we stack three GIN convolution layers with 300 hidden neurons in each layer. For multi-layer perceptrons required for each layer, we use one hidden layer with 512 neurons with ReLU nonlinearity. Other details are the same as that of the GCN case.

Figure 3: Four most similar images retrieved by six algorithms. OC: Object Count, GIN: IRSGS-GIN, GCN: IRSGS-GCN. The visual genome ids for the query images are 2323522 and 2316427.
2 Method nDCG
5 10 20 40
Captions SBERT 1 1 1 1
Random 0.195 0.209 0.223 0.245
Gen. Cap. SBERT 0.556 0.576 0.610 0.659
Resnet 0.539 0.541 0.541 0.542
ResNet-FT 0.368 0.393 0.433 0.502
Object Count 0.511 0.530 0.560 0.615
IRSGS-GIN 0.564 0.584 0.618 0.673
IRSGS-GCN 0.567 0.590 0.623 0.672
Table 3: Image retrieval results on Flickr30K with machine-generated scene graphs.

Quantitative Results

From Table 1, Table 2, and Table 3, IRSGS shows larger nDCG score than baselines across datasets (VG-COCO and Flickr30K) and methods of obtaining scene graphs (human-annotated and machine-generated). IRSGS also achieves best agreement to human annotator’s perception on semantic similarity, as it can be seen from Table 1 and Table 2.

Comparing Table 1 and Table 2, we found that using machine-generated scene graphs instead of human-annotated ones does not deteriorate the retrieval performance. This result shows that IRSGS does not need human-annotated scene graphs to perform successful retrieval and can be applied to a dataset without scene graph annotation. In fact, Flickr30K is the dataset without scene graph annotation, and IRSGS still achieves excellent retrieval performance in Flickr30K with machine-generated scene graphs.

On the other hand, using machine-generated captions in retrieval results in significantly poor nDCG scores and human agreement scores. Unlike human-annotated captions, machine-generated captions are crude in quality and tend to miss important details of an image. We suspect that scene graph generation is more stable than caption generation since it can be done in a systematic manner, i.e., predicting objects, attributes, and relations in a sequential way.

While not showing the optimal performance, GWL and GMN also show competitive performance over other methods based on generated captions and ResNet. This overall tendency of competence of graph-based method is interesting and implies the effectiveness of scene graphs in capturing semantic similarity between images.

Note that in Caption SBERT, retrieval is performed with surrogate relevance, and their human agreement scores indicate the agreement between surrogate relevance and human annotations. With the highest human agreement score than any other algorithms, this result assures that the proposed surrogate relevance reflects the human perception of semantic similarity well.

Qualitative Results

Figure 1 and Figure 3 show the example images retrieved from the retrieval methods we test. Pitfalls of baseline methods that are not based on scene graphs can be noted. As mentioned in Introduction, retrieval with ResNet features often neglects the semantics and focuses on the superficial visual characteristics of images. On the contrary, OC only accounts for the presence of objects, yielding images with misleading context. For example, in the left panel of Figure 3, OC simply returns images with many windows. IRSGS could retrieve images containing similar objects with similar relations to the query image, for example, an airplane on the ground, or a person riding a horse.


Ablation Study We also perform an ablation experiment for effectiveness of each scene graph component (Table 4). In this experiment, we ignore attributes or randomize relation information from IRSGS-GCN framework. In both cases, nDCG and Human agreement scores are higher than the Object Count that uses only object information. This indicates that both attributes and relation information are useful to improve the image retrieval performance of the graph matching-based algorithm. Further, randomizing relations drops performance more than ignoring attribute information, which means that relations are important for capturing the human perception of semantic similarity.

Comparison to Johnson et al. (2015) We exclude Johnson et al. (2015) from our experiment because the CRF-based algorithm from Johnson et al. (2015) is not feasible in a large-scale image retrieval problem. One of our goals is to tackle a large-scale retrieval problem where a query is compared against more than ten thousand images. Thus, we mainly consider methods that generate a compact vector representation of an image or a scene graph (Eq.(2)). However, the method in Johnson et al. (2015) requires object detection results to be additionally stored and extra computation for all query-candidate pairs to be done in the retrieval phase. Note that Johnson et al. (2015) only tested their algorithm on 1,000 test images, while we benchmark algorithms using 13,203 candidate images.

Effectiveness of Mean Pooling and Inner Product One possible explanation for the competitive performance of IRSGS-GCN and IRSGS-GIN is that the mean pooling and inner product are particularly effective in capturing similarity between two sets. Given two sets of node representations and , the inner product of their means are given as , the sum of the inner product between all pairs. This expression is proportional to the number of common elements in the two sets, especially when is 1 if and 0 otherwise, measuring the similarity between the two sets. If the inner product values are not binary, then the expression measures the set similarity in a ”soft” way.

2 Method nDCG Human Agreement
5 10 20 40
IRSGS-GCN 0.771 0.784 0.805 0.836 0.611
No Attribute 0.767 0.782 0.803 0.834 0.606
Random Relation 0.764 0.777 0.797 0.828 0.604
Object Count 0.730 0.743 0.761 0.794 0.581
Table 4: Scene graph component ablation experiment results on VG-COCO. Machine-generated scene graphs are used.


In this paper, we tackle the image retrieval problem for complex scene images where multiple objects are present in various contexts. We propose IRSGS, a novel image retrieval framework, which leverages scene graph generation and a graph neural network to capture semantic similarity between complex images. IRSGS is trained to approximate surrogate relevance measure, which we define as a similarity between captions. By collecting real human data, we show that both surrogate relevance and IRSGS show high agreement to human perception on semantic similarity. Our results show that an effective image retrieval system can be built by using scene graphs with graph neural networks. As both scene graph generation and graph neural networks are techniques that are rapidly advancing, we believe that the proposed approach is a promising research direction to pursue.


Sangwoong Yoon is partly supported by the National Research Foundation of Korea Grant (NRF/MSIT2017R1E1A1A03070945) and MSIT-IITP (No. 2019-0-01367, BabyMind).


  • D. Alvarez-Melis and T. Jaakkola (2018) Gromov-wasserstein alignment of word embedding spaces. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 1881–1890. Cited by: Graph Similarity Learning.
  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: Comparison to SPICE.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: Scene Graphs and Their Generation, Scene Graph Generation Detail.
  • A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky (2014) Neural codes for image retrieval. In European conference on computer vision, pp. 584–599. Cited by: Image Retrieval.
  • S. Bai and X. Bai (2016) Sparse contextual activation for efficient visual re-ranking. IEEE Transactions on Image Processing 25 (3), pp. 1056–1069. Cited by: Two-Step Retrieval using Visual Features.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: Baseline Methods.
  • B. Chen, L. S. Davis, and S. Lim (2019) An analysis of object embeddings for image retrieval. arXiv preprint arXiv:1905.11903. Cited by: Image Retrieval.
  • X. Chen, L. Li, L. Fei-Fei, and A. Gupta (2018) Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7239–7248. Cited by: Introduction.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335. Cited by: Scene Graphs.
  • Y. Gong, Q. Ke, M. Isard, and S. Lazebnik (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106 (2), pp. 210–233. Cited by: Introduction.
  • A. Gordo, J. Almazán, J. Revaud, and D. Larlus (2016) Deep image retrieval: learning global representations for image search. In European conference on computer vision, pp. 241–257. Cited by: Introduction.
  • A. Gordo, J. Almazán, J. Revaud, and D. Larlus (2017) End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124 (2), pp. 237–254. Cited by: Image Retrieval.
  • A. Gordo and D. Larlus (2017) Beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6589–6598. Cited by: Introduction, Image Retrieval, Human Annotation Collection.
  • J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. CVPR. Cited by: Image Retrieval.
  • D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Scene Graphs.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: Computational Property, Two-Step Retrieval using Visual Features.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: Image Retrieval, Scene Graphs, Discussion.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Graph Similarity Learning.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: Scene Graphs, Data.
  • Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang (2017) Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270. Cited by: Scene Graphs, Scene Graph Generation Detail, footnote 4.
  • Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli (2019) Graph matching networks for learning the similarity of graph structured objects. arXiv preprint arXiv:1904.12787. Cited by: Graph Similarity Learning.
  • Y. Liang, Y. Bai, W. Zhang, X. Qian, L. Zhu, and T. Mei (2019) VrR-vg: refocusing visually-relevant relationships. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10403–10412. Cited by: Scene Graphs.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Data.
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, pp. 852–869. Cited by: Scene Graphs.
  • J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7219–7228. Cited by: Scene Graphs.
  • H. P. Maretic, M. E. Gheche, G. Chierchia, and P. Frossard (2019) GOT: an optimal transport framework for graph comparison. arXiv preprint arXiv:1906.02085. Cited by: Graph Similarity Learning.
  • V. Milewski, M. Moens, and I. Calixto (2020) Are scene graphs good enough to improve image captioning?. arXiv preprint arXiv:2009.12313. Cited by: Scene Graphs.
  • E. Mohedano, K. McGuinness, N. E. O’Connor, A. Salvador, F. Marques, and X. Giro-i-Nieto (2016) Bags of local convolutional features for scalable instance search. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 327–331. Cited by: Introduction.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Scene Graphs and Their Generation.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2017) Flickr30K entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123 (1), pp. 74–93. Cited by: Data, Baseline Methods.
  • F. Radenović, G. Tolias, and O. Chum (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In European conference on computer vision, pp. 3–20. Cited by: Introduction.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: Introduction, Learning to Predict Surrogate Relevance.
  • J. Solomon, G. Peyré, V. G. Kim, and S. Sra (2016) Entropic metric alignment for correspondence problems. ACM Transactions on Graphics (TOG) 35 (4), pp. 72. Cited by: Graph Similarity Learning.
  • D. Teney, L. Liu, and A. van den Hengel (2017) Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: Scene Graphs.
  • V. Titouan, N. Courty, R. Tavenard, C. Laetitia, and R. Flamary (2019) Optimal transport for structured data with application on graphs. In

    International Conference on Machine Learning

    pp. 6275–6284. Cited by: Graph Similarity Learning.
  • N. Vo, L. Jiang, C. Sun, K. Murphy, L. Li, L. Fei-Fei, and J. Hays (2019) Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6439–6448. Cited by: Image Retrieval.
  • L. Wang, X. Qian, Y. Zhang, J. Shen, and X. Cao (2019) Enhancing sketch-based image retrieval by cnn semantic re-ranking. IEEE transactions on cybernetics. Cited by: Two-Step Retrieval using Visual Features.
  • Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan (2016) Cross-modal retrieval with cnn visual features: a new baseline. IEEE transactions on cybernetics 47 (2), pp. 449–460. Cited by: Image Retrieval.
  • Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1367–1381. Cited by: Scene Graphs.
  • Y. Xie, X. Wang, R. Wang, and H. Zha (2018) A fast proximal point method for computing exact wasserstein distance. arXiv preprint arXiv:1802.04307. Cited by: Baseline Methods.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: Scene Graphs, Data, Scene Graph Generation Detail, footnote 4.
  • H. Xu, D. Luo, and L. Carin (2019a) Scalable gromov-wasserstein learning for graph partitioning and matching. arXiv preprint arXiv:1905.07645. Cited by: Graph Similarity Learning.
  • H. Xu, D. Luo, H. Zha, and L. Carin (2019b) Gromov-wasserstein learning for graph matching and node embedding. arXiv preprint arXiv:1901.06003. Cited by: Graph Similarity Learning, Baseline Methods.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: Baseline Methods.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: Graph Similarity Learning.
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–685. Cited by: Scene Graphs, Scene Graph Generation Detail, footnote 4.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: Introduction.
  • L. Zhen, P. Hu, X. Wang, and D. Peng (2019) Deep supervised cross-modal retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Image Retrieval.
  • L. Zheng, Y. Yang, and Q. Tian (2017) SIFT meets cnn: a decade survey of instance retrieval. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1224–1244. Cited by: Introduction, Image Retrieval.


Computational Property

lIRSGS is scalable in terms of both computing time and memory, adding only marginal overhead over a conventional image retrieval system. For candidate images in a database, their graph embeddings and ResNet features are pre-computed and stored. Generating a scene graph for a query image is mainly based on the object detection which can be run almost in real-time. Searching over the database is essentially a nearest neighbor search, which is fast for the small ( 100,000 images) number of images, and can be accelerated for a larger database with an approximate nearest neighbor search engines, such as Faiss Johnson et al. (2017). On the contrary, algorithms which use explicit graph matching, such as GWL and GMN, are significantly less scalable than IRSGS, because representation vectors from those methods cannot be pre-computed. Given a generated scene graph, processing a pair of images takes approximately 15 seconds and 0.002 seconds for GWL and GMN, respectively. When retrieving from a database of 10,000 images, 0.002 seconds for a pair results in 20 seconds per a query, not applicable for a practical retrieval system. On the other hand, IRSGS takes less than 0.001 seconds per a pair of images when the graph embeddings are not pre-computed and is more than 10 times faster when the embeddings are pre-computed and only the inner products to the query are computed.

Two-Stage Retrieval

The initial retrieval using ResNet is beneficial in two aspects: retrieval quality and speed. ResNet-based retrieval indeed introduces the bias but in a good way; the ResNet-based stage increases human agreement for all retrieval methods, possibly by excluding visually irrelevant images. Some baselines, such as graph matching networks, are not computationally feasible without the initial retrieval. However, IRSGS is computationally feasible without ResNet-based retrieval because the representations of images can be pre-computed and indexed. We empirically found that k=100 showed a good trade-off between computational cost and performance.

Comparison to SPICE

We initially excluded SPICEAnderson et al. (2016) from experiments not because of its computational property but because of the exact matching mechanism that SPICE is based on. By definition, SPICE would consider two semantically similar yet distinct words as different. Meanwhile, IRSGS is able to match similar words since it utilizes the continuous embeddings of words. Still, SPICE can be an interesting baseline, and we will consider adding it for comparison.

Full Resolution Figures

Here, we provide figures presented in the main manuscript in their full scale.

Figure 4: Image retrieval examples from ResNet and IRSGS. ResNet retrieves images with superficial similarity, e.g., grayscale or vertical lines, while IRSGS successfully returns images with correct context, such as playing tennis or skateboarding.
Figure 5: An overview of IRSGS. Images are converted into vector representations through scene graph generation (SGG) and graph embedding. The graph embedding function is learned to minimize mean squared error to surrogate relevance, i.e., the similarity between captions. The bold red bidirectional arrows indicate trainable parts. For retrieval, the learned scene graph similarity function is used to rank relevant images.
Figure 6: Four most similar images retrieved by six algorithms. OC: Object Count, GIN: IRSGS-GIN, GCN: IRSGS-GCN. The visual genome ids for the query images are 2323522 and 2316427.