Vision-and-language refers to a range of tasks that bridge vision and natural language, e.g. automatically describing visual content with text. Early neural approaches to vision-and-language [22, 46, 35, 49, 10] often encode visual information with pre-trained classification networks like VGG-Net  and ResNet . Recently, Anderson et al.  demonstrated that image understanding at instance-level can provide valuable prior knowledge to help language-and-vision models focus on salient objects and stuffs. This “bottom-up attention“ approach (as the attention on salient objects and stuffs comes bottom-up from perceptional priors instead of textual context) has proven to be very successful across various tasks including visual question answering, caption generation, image-text matching, and text-to-image synthesis [2, 25, 27].
Despite these progress, vision-and-language remains challenging partly due to the fact that interplay between objects and stuffs is not taken into account. Unlike attributes and actions (of single objects) that may be inferred from individual object/region features, visual relations are not considered at all in many state of the art models (e.g. the top-down captioner  and the SCAN model for image-text matching ). To understand the crux of the matter, we present two examples in Fig. 1 where “a baseball player swinging a bat” and “a baseball player holding a bat” are captions of two different images. However, say image-text matching models that do not consider visual relations (“holding” and “swinging”) could score both captions equally good for both images, thus fail to align the captions to their corresponding images. The same issue can be found in models for other applications such as caption generation and visual question answering .
Detecting visual relations between objects and stuffs is an emerging research problem that has drawn significant attention recently [50, 54, 6, 29, 34, 40, 48, 28, 30, 36, 39, 55, 56, 58] after the release of several large-scale visual-relation-labeled datasets such as visual genome  and HICO . In particular, researchers have developed scene graph generators by combining region and relationship detection models. Given an image, scene graph generators predict relation triplets . We hypothesize that training neural scene graph generators for relation detection, by necessity, would also learn embedding features with rich semantics of visual relations. Assuming this is true, potentially these embedding features could provide prior knowledge of visual relations for various language-and-vision applications. To empirically verify this hypothesis, we incorporate the scene graph generator features into state of the art models for image-text matching and image caption generation. For image-text matching, we propose a new relation-based Stacked Cross Attention Network (R-SCAN) based on SCAN . R-SCAN additionally encodes visual relations and employs a gating mechanism to adaptively select region and relation features. Similarly, we extend the top-down captioner  with an additional attention component for relation features.
Previous scene graph generators are usually trained and evaluated on Visual Genome  splits that consists of the most common visual relationships (e.g. VG150 dataset ). However, such datasets are problematic in that they mainly contain common relations whose corresponding predicates can be easily detected using statistical counting based on the text context without the need of truly understanding visual relations [54, 31]. For example, they would predict that the relation between “a baseball player” and “a bat” is most likely “swing” rather than “throw”, because “swing” co-occurs more often with “baseball player” and “bat” in data. In another word, scene graph generators learn to take the easy way out with such datasets. In light of this, Liang et al. , in parallel to this work, created VrR-VG dataset which contain much richer categories of relations that cannot be easily detected based solely on statistical counting. We therefore also resort to VrR-VG for training scene graph generators.
The experimental results show that R-SCAN significantly improves bi-directional retrieval metrics compared with SCAN which is the current state of the art (e.g. it improves recall@1 of image retrieval on Flickr30K by 12.2% relatively). Our relation-based top-down captioner  also improves CIDEr score  from 113.5 to 114.9 and SPICE score  from 20.3 to 20.9 on MSCOCO .
A major difference between this work and recent works [52, 14, 51] that also explore visual relations for vision-and-language is in that we do NOT use graph convolution networks (GCN). For example, Yao et al. 
drew connections between regions with scene graph generators or proximity-based heuristics (whereas no semantic information attached to these connections) and built complicated graph convolution to implicitly capture visual relations from caption data. We argue that directly transferring knowledge of visual relations from scene graph generators instead of discarding them is a much simpler and is an equivalently effective alternative to GCN. We show that our approach is generic and applicable to metric learning (image-text matching) and sequence prediction (image caption generation) tasks.
2 Related Work
The goal of image-text matching is learning similarity between images and text descriptions, and is usually evaluated on bi-directional image and text retrieval tasks. There has been an extensive line of work addressing image-text matching using neural networks[22, 45, 3, 47, 23, 26, 57, 10, 38, 12, 9, 7, 11, 19, 37, 16, 15, 35, 25]. In particular, R-SCAN model proposed in this paper is built on Stacked Cross Attention Network (SCAN)  that uses a two-stage attention mechanism to discover fine-grained correspondence between objects/stuffs and words. R-SCAN additionally encodes visual relations and employs a gating mechanism to select between region and relation features.
Image Captioning. Image captioning refers to automatic image description generation and has also been widely studied over the years [46, 11, 49, 41, 33]. Recently, Anderson et al.  proposed “bottom-up attention“, which refers to extracting and encoding salient regions of object and stuff bottom-up from perceptional priors that region detectors (e.g. Faster R-CNN) learn through pre-training. Bottom-up attention dramatically improve various vision-and-language tasks including image captioning . We extend the top-down captioner proposed by Anderson et al. , adding relation features from scene graph generators along with region features from bottom-up attention to help the captioning model capture visual relations.
Scene Graph Generation. The scene graph generation task has recently attracted significant interest from the vision community [42, 8, 50, 54, 6, 29, 34, 40, 48, 28, 30, 36, 39, 55, 56, 58]. In most of these works, visual relationship is treated as edges between two objects in the scene graph, and many previously proposed approaches have used context propagation mechanism. Xu et al.  presented an iterative message passing framework to predict object and their relationships jointly by using two separate networks, one for edge and one for nodes. In , Zellers et al. designed Stacked Motif Network to capture higher order substructures in scene graphs. Stacked Motif Network encodes each relation triplet
into an embedding vector which we use in this work as a source of visual relation prior for downstream applications.
Scene graph generators are usually trained and evaluated on Visual Genome  splits dominated with the most common visual relationships (e.g. VG150 dataset ). It is pointed out in [54, 31] that such relation data would lead scene graph generators fitting to statistical counting based on textual context instead of truly understanding visual relations. This finding implies existing scene graph generation benchmarks are potentially not ideal.
Bringing Visual Relations to Vision-and-Language. Johnson et al.  proposed a framework using ground-truth scene graph as the query for image retrieval, but in practice it is still very difficult to construct accurate scene graphs from either text or images. Several recent works proposed GCN-based models for image captioning and visual question answering [14, 51, 52]. For example, Yao et al.  designed a GCN-based captioning model that employs either heuristics of spatial proximity or scene graph generators to propose possible connections between objects (whereas no semantic information attached to these connections). These approaches discard semantic labels and representations of visual relations coming from scene graph generators, and instead implicitly infer relationships from caption data. Directly utilizing scene graph generator features avoids expensive graph convolution, and we argue it is still effective in capturing visual relations. On the other hand, Liang et al.  show that additional visual relation prediction objective enriches region features and improves downstream image captioning and visual question answering models, but their proposal does not actually model visual relations and thus is lack of explainability.
Sec. 3.1 describes R-SCAN model which leverages visual relation features from scene graph generators to improve the image-text matching. In Sec. 3.2, we present the proposed relation-based top-down captioner which extends the top-down captioner from Anderson et al. . Sec 3.3 explains how we pre-train scene graph generators to learn effective relation features.
3.1 R-SCAN for Image-Text Matching
The architecture of R-SCAN is presented in Fig. 2. R-SCAN consists of three components: (1) a text encoder, (2) a visual encoder for features of image region and visual relation, and (3) an attention module for aligning image regions and visual relations to words and calculating image-text similarity.
Text encoder. The text encoder is identical to SCAN . It takes as input a sequence of words, each being represented as a one-hot vector, and maps each word into a 300-dimensional vector as
where , is a randomly initialized embedding matrix and is the one-hot representation of the -th word. We then use a bi-directional GRU to generate for each word a contextual embedding vector by infusing contextual information from both sides of the word in the text. The bi-directional GRU contains a forward GRU which reads the word sequence from left to right to produce the hidden states:
and similarly a backward GRU which reads from right to left to produce the hidden states . The contextual embedding vector of word is obtained by averaging the forward hidden state and backward hidden state :
Visual encoder. We use a pre-trained Faster R-CNN (identical to SCAN ) for extracting representations of object and stuff, denoted as where is number of regions detected in an image. On the other hand, we use a pre-trained Stacked Motif Networks (a scene graph generator proposed by Zellers et al. ) for extracting representations of visual relations , denoted as where is number of visual relations detected in an image. and are subsequently transformed to -dimensional vectors:
Attention module for similarity inference. The attention module generalizes the previously proposed SCAN t-i model  and softly aligns333Attention on visual representations (of region or relation) w.r.t. a word in text fuses visual representations deferentially based on their relevance to the word. Such process can be considered as a soft alignment of relevant visual representations w.r.t. the word. We further introduce a “visual feature fusion gate” conditioned on the word to fuse the attended region and relation representation deferentially. This process is considered as a soft decision of whether to align region or relation w.r.t. the word. representations of region and relation in image with words in text and infer the similarity between image and text.
Given feature vector of regions , relations and words , attention weights and are computed as
where . The similarity between -th region and -th word is computed as
Given attention weights , the attended relation representation of the image w.r.t. word is defined as
where can be viewed as a summarized relation vector of the image generated using a fusing process where all the relation vectors are weighted by their attention weights of equation (6) w.r.t. and aggregated. can also be considered representing the soft alignment between and the relations in the image. A special case of the soft alignment is hard alignment where there is only one relation that has a non-zero attention weight w.r.t. .
Similarly, given attention weights , the attended region representation of the image w.r.t. word is defined as:
We now define the attended representation of the image w.r.t. word , denoted as , which combines and . Considering that while entities/nouns often attend to objects and stuffs in an image, predicates to relations, we introduce a visual feature fusion gate that conditions on each word (type) to fuse and using a mixture model:
is the sigmoid function,is bias and is a trainable projection vector with the same dimension as .
The similarity between the whole image and -th word is computed as:
Finally, we need to compute the similarity between the image and the text which consists of a set of words. In SCAN t-i 
, this is achieved by averaging or LogSumExp pooling the word-image similarity over all the words in the text. In R-SCAN, we assign each word an importance weight using a machine-learned importance gate, similar to the visual feature fusion gate of equation (14):
The similarity between image and text is defined as the sum of the norm of the weighted word-image similarity over all the words in the text:
Learning objective. Following [25, 10], we use hinged-based triplet ranking loss and focus on the hardest negatives in a mini-batch. For a positive pair , we generate two negative pairs by picking a mismatched image and a mismatched text
, respectively. The loss function is defined as
where and is the margin, which in this work is set to .
3.2 Relation-based Top-Down Captioner
For image captioning, we propose a simple extension of the top-down captioner , adding relation features from Stacked Motif Networks as shown in Fig. 3. Most parts of the model definition are identical to the original as described in Sec 3.2 and Fig. 3 of . To include relation features, we change the input vector to the attention LSTM at each time step to concatenation of the mean-pooled relation feature , the mean-pooled image region feature , the previous output of the language LSTM (), and an encoding of the previously generated word () . The attended relation feature is obtained in the same way as attended image feature . is then concatenated to the input to the language model LSTM, in addition to the attended image feature and the output of the attention LSTM (). The learning objectives for cross entropy training and subsequent self-critical CIDEr optimization  are identical to the top-down captioner . We also employ the same Faster R-CNN model for bottom-up attention.
3.3 Scene Graph Generator as Feature Learner
In our framework, relation features are extracted from neural scene graph generators. Specifically, we uses Stacked Motif Network  as the default scene graph generator for all the experiments. Stacked Motif Networks predict graph elements by staging bounding box predictions, object classifications, and relationships such that the global context encoding of all previous stages establishes rich context for predicting subsequent stages. We take the 4096-d relation representation before applying the final projection and softmax function to represent relation triplets (see Sec. 4.3 of ).
Previous scene graph generators are usually trained and evaluated on Visual Genome  splits that consists of the most frequent visual relationships in Visual Genome. VG150 dataset is one of the most used benchmark , but this data could be problematic because it consists of most frequent 50 relation predicates and 150 object categories in Visual Genome, and these common relation predicates in VG150 can often be detected statistical counting without understanding of the images [54, 31]. As a result, although the scene graph generators developed on VG150 are often reported to perform well on the VG150 test set, the high performance does not translate to visible gains in end applications such as captioning and visual question answering, as reported in Sec. 4 and in . Other similar benchmarks also suffer from the same cause.
In light of this, we resort to training Stacked Motif Networks with VrR-VG dataset . VrR-VG was created by choosing a subset of Visual Genome and removing the predicates that can be easily predicted solely using language models. We carefully avoid training data contamination, excluding any images that are in MSCOCO validation and test sets (created by Karpathy et al. ) from our training split.
In this work, we do not draw relations between visual relation detection results and numbers on end applications, although we did find that features from  lead to better results than  whose results on VG150 is inferior. First of all, as mentioned in Sec. 2, it has been found that VG150 and similar benchmarks might not be ideal for visual relations [54, 31]. Secondly, the common mAP metric becomes problematic with datasets like VrR-VG that contain large number of object and/or relation classes. As most of the classes are in tail of distributions, they can barely be accurately predicted yet mAP takes average of per-class precision.
We would also like to clarify that using additional relation features do not mean including additional training images. We pre-train Stacked Motif Networks with VrR-VG or VG150 which are subsets of Visual Genome, whereas the baseline methods SCAN  and top-down captioner  also use Visual Genome for pre-training of bottom-up attention Faster R-CNN models.
Datasets. We evaluate R-SCAN on the MSCOCO  and Flickr30K  datasets. Relation-based top-down captioner is only evaluated on MSCOCO following prior work. Flickr30K contains 31,000 images collected from Flickr with five captions each. Following the MSCOCO splits that Andrej Karpathy created [19, 10], we use 1,000 images for validation and 1,000 images for testing and the rest for training. MSCOCO contains 123,287 images, and each image is annotated with five text descriptions. In , the dataset is split into 82,783 training images, 5,000 validation images and 5,000 test images. We follow [10, 25] to add 30,504 images that were originally in the validation set of MSCOCO but have been left out in this split into the training set. Each image comes with 5 captions. The results are reported on full 5K test images or averaging over 5 folds of 1K test images. As is common in information retrieval, we measure performance of sentence retrieval (image query) and image retrieval (sentence query) by recall at (r@) defined as the fraction of queries for which the correct item is retrieved in the closest points to the query. Also following prior work, we evaluate captioning with CIDEr score  which captures the syntactic correctness and SPICE score  which reflects whether our models generate right descriptions of scene.
Implementation details. As mentioned in Sec. 3.3, we use Stacked Motif Networks to learn relation features. Top relation features are chosen based on the triplet confidence score following . We fix to 36 for the following experiments but using can result in similar performance in most of the cases. Increasing to 72 results in performance degradation due to noisy information. Stacked Motif Networks pre-trained with VG150 that we used in experiments is available publicly444https://github.com/rowanz/neural-motifs. We train Stacked Motif Networks on VrR-VG and matches the results reported in  for object detector, scene graph classification and scene graph detection. The object detector for Stacked Motif Networks is selected to be Faster R-CNN with VGG backbone . We have also experimented with ResNet-101 backbone  but did not observe difference in performance.
To detect and encode image regions, we adopt the same Faster R-CNN model from 
as our bottom-up attention model. Top 36 regions were selected per image following the same criterion in[2, 25]. Although region features from Stacked Motif Networks can also be used, we choose the bottom-up attention model for fair comparison with SCAN  and top-down captioner .
For R-SCAN, softmax temperature and are selected on the validation set. We use Adam optimizer 
to train the models. R-SCAN models are trained with a learning rate of 0.0005 for 10 epochs and then 0.00005 for another 10 epochs, following SCAN. For captioning, we follow the training and evaluation configurations .
4.1 The Effectiveness of Visual Relations
In the following analysis, we investigate the quality of image-text matching specifically on captions that describe visual relations and corresponding images, and compare Stacked Motif Network features pre-trained on VrR-VG and VG150. The motivation is to focus only on relation-relevant predicts to best quantify the improvements coming from relation features. Zellers et al.  analyzed Visual Genome dataset and concluded that the predominant relations are geometric (above, behind, under) and possessive (has, part of, wearing). Such relations are often obvious, e.g., houses tend to have windows. VrR-VG dataset rules out relations that could be easily predicted with language prior, and clusters the remaining high frequency relations based on semantic similarity to 117 predicates . By comparing VrR-VG and the original Visual Genome, those 117 relations can be mapped back to 259 relation predicates in Visual Genome, where 164 of them are identified by us as semantic relations (leaning on, walking towards, jumping on) which correspond to activities, are less frequent and less obvious (definition of semantic visual relations can be found in ). We found that there are 3,403 images in MSCOCO 5K test set with at least one ground truth caption that has one of the 164 semantic predicates. We use those images and randomly sample one corresponding caption that describes visual relations to construct a new COCO caption test split with visually relevant relations (COCO-test-VrR) which allows us to focus on improvements of image-text matching that involves visual relations.
|SCAN t-i AVG||37.9||69.4||80.8||38.5||70.7||82.5|
|Flickr30K 1K Test Images||MSCOCO 5-fold 1K Test Images|
|SCAN ensemble ||48.6||77.7||85.2||67.4||90.3||95.8||58.8||88.4||94.8||72.7||94.8||98.4|
|SCAN i-t AVG ||44.0||74.2||82.6||67.7||88.9||94.0||54.4||86.0||93.6||69.2||93.2||97.5|
|SCAN t-i AVG ||45.8||74.4||83.0||61.8||87.5||93.7||56.4||87.0||93.9||70.9||94.5||97.8|
|MSCOCO 5K Test Images|
|SCAN ens ||38.6||69.3||80.4||50.4||82.2||90.0|
|SCAN t-i AVG ||34.4||63.7||75.7||46.4||77.4||87.2|
In Table 1, we report the results of the baseline SCAN t-i AVG model and R-SCAN trained on MSCOCO and evaluated on COCO-test-VrR. We consider R-SCAN models with relation features pre-trained on VG150 and VrR-VG. Comparing to the SCAN t-i baseline, it can be observed that improvements on bi-directional retrieval with R-SCAN-VG150 is limited. Pre-training with VrR-VG (R-SCAN-VrRVG), on the other hand, leads to significant improvements. The hypothesis is that VG150 majorly contains relations whose corresponding predicts can be easily predicted with statistical counting and thus does not require genuine visual understanding, while VrR-VG preserves semantically valuable relations that cannot be inferred solely from counting and therefore learning on VrR-VG requires forming features that are truly embedded with visually relevant information. Based on this finding, we choose VrR-VG to train relation features for all the following experiments in this work. In Fig. 4, we present qualitative examples of image-text bi-directional retrievals using R-SCAN and the baseline SCAN t-i AVG model.
4.2 Cross-Modal Retrieval Results
In Table 2, we compare R-SCAN with the baseline SCAN t-i AVG model as well as other state of the art methods on Flickr30K and MSCOCO (tested on 5-fold 1K test set). On Flickr30K, R-SCAN achieves the best single model image retrieval with recall@1 at 51.4. Comparing to SCAN i-t AVG, the relative improvement is 12.2%. The R-SCAN model even outperforms SCAN ensemble on Flickr30K. On MSCOCO 5-fold test set, R-SCAN achieves the best recall@1 at 57.6 for image retrieval (single model). Table 3 presents the results on the full MSCOCO 5K test set. R-SCAN achieves the better performance than all previous single model on most metrics of cross-modal retrieval.
It can be observed that the relative improvement on image retrieval is more significant than on text retrieval. We hypothesize that the underlying causes are the composition of Flickr30K and MSCOCO test sets and the recall@ metric definition: as opposed to image retrieval where only one ground truth image exists for each text query, in text retrieval each image query corresponds to five ground truth captions. Any of them could count as a correctly retrieved item. However, not all of the five captions describe visual relations in the image. For example, “a male in a blue shirt and a laptop and couch” and “a man is sitting on a couch with a dog using a laptop” are both ground truth captions for an image in MSCOCO, but the former caption can be retrieved without understanding of semantic visual relations between the man and other major objects (e.g. sitting in a couch). In MSCOCO 5K test set, 68.0% of the images correspond to at least one caption that has one of the 164 COCO-test-VrR predicates. Nonetheless, only 2.8% of the images have five of such captions, despite that predicates could be rephrased in other captions and may not fall in the range of COCO-test-VrR predicates. Another observation that supports our hypothesis is that R-SCAN actually shows similar improvements on image and text retrieval on COCO-test-VrR (as shown in Table 1) where there is only one ground truth caption per image.
4.3 Image Captioning on MSCOCO
|Cross Entropy||CIDEr optim.|
Table 4 shows the image captioning results of the proposed relation-based top-down captioner and baseline top-down captioner  on MSCOCO. We report our results of optimizing the model for cross-entropy loss and subsequent policy gradient fine-tuning using CIDEr scores as rewards, following . Compared with the top-down captioner, our cross entropy model improves CIDEr score  from 113.5 to 114.9 and SPICE score  from 20.3 to 20.9. We also present the results reported by Yao et al. (GCN-LSTM)  which exploits complicated GCN.
Interestingly, we found fine-tuning the original top-down captioner 
with self-critical CIDEr optimization for 120 epochs (several days on single GPU), rather than training for less than one hour as reported in the original paper, can significantly boost CIDEr from 120.1 to 126.9 and SPICE from 21.4 to 21.8 (as on-policy reinforcement learning algorithms can take many epochs to converge). This finding suggests that the documented results of the baseline in are not comparable with the models whose training takes much more epochs. For example, the performance gap between  and  might not be as large as indicated by the results reported in the two original papers, respectively. For the sake of fairness, we compare the bottom-up baseline with our models, with both being optimized for CIDEr after 30 epochs. It can be observed in Table 4 that using relation features improves CIDEr score from 125.5 to 126.1 and SPICE score from 21.6 to 21.8. The corresponding qualitative examples are presented in appendix (Fig. 9). The results show that, without GCN, our method is still effective in capturing visual relationships and improving image captioning.
In this study, we explored learning visual relation features for image-text matching and image caption generation with neural scene graph generators. By additionally capturing interplay between objects and stuffs, the proposed R-SCAN model achieves new state of the art result on the task of image-text cross-modal retrieval on the Flickr30K and MSCOCO benchmarks. Similarly, relation-based top-down captioner also significantly improves image captioning. The scene graph generator features are indeed effective in helping downstream models ground language to visual relations, but the crux of matters lies in pre-training scene graph generators with visually relevant relation data. We hope this work would shed lights on the connection between scene graph generators and vision-and-language, and facilitate future research.
The authors thank Arun Sacheti and Pengchuan Zhang for their thoughtful feedback and discussions.
The supplementary material is structured as follows. Sec. A presents in details how the COCO-test-VrR test set is constructed. Sec. B presents additional qualitative examples of cross-modal retrieval between image and text to demonstrate the effectiveness of R-SCAN and the use of visual relations for image-text matching. We also present image captioning examples to qualitatively demonstrate the effectiveness of using visual relations for the task.
Appendix A COCO-test-VrR
COCO-test-VrR is a subset of MSCOCO Karpathy 5K test split  introduced in Sec. 4.1 of the main paper. COCO-test-VrR focuses the evaluation of image-text matching on the captions that describe semantic visual relations  and the corresponding images. We describe in detail how COCO-test-VrR is constructed in this section.
Zellers et al.  and Liang et al.  have shown that a majority of the prevalent visual relations in Visual Genome  could be predicted without visual information. Liang et al.  constructed the Visually-Relevant Relationship Dataset (VrR-VG) which excludes the relations that could be easily predicted using language models and positional information. They clustered the remaining high-frequency relations into 117 relation predicates based on semantic similarities. By comparing visual relation triplets in VrR-VG and the original Visual Genome metadata, those 117 predicates can be mapped back to 259 relation predicates in the original Visual Genome.
On the other hand, Zellers et al.  analyzed visual relations in Visual Genome and grouped them into four categories: geometric (e.g. above, behind, under), possessive (e.g. has, part of, wearing), semantic (e.g. carrying, eating, using), and miscellaneous (e.g. for, from, made of) (see more details in Sec. 3.1 and Table 1 of ). The majority of the high-frequency relations in Visual Genome are geometric and possessive . Many of those relations can be easily predicted without visual information [54, 31]. In contrast, semantic relations corresponding to activities are less frequent and hard to predict without visual information . In the aforementioned 259 relation predicates, 164 of them are identified by us as semantic relations:
adorning, appearing in, approaching, are attached to, are sitting on, attached, attached to, attached to a, balancing on, biting, boarding, bordering, built into, catching, chasing, coming out of, crashing on, decorating, displayed on, displaying, draped over, drawn on, dressed in, drinking from, driving, driving down, driving on, eating from, entering, filled with, floating in, floating on, flying, flying a, flying above, flying in, flying over, flying through, going down, grabbing, grazing, grazing in, grazing on, gripping, hanging, hanging above, hanging from, hanging in, hanging off, hanging on, hanging on a, hanging out of, hanging over, hangs from, hangs on, hits, hitting, hung on, jumping, jumping on, laying, laying in, laying on, laying on a, leaning on, leaning over, licking, looking out, lying in, lying inside, lying next to, lying on, lying on top of, marking, mounted on, mounted to, moving, overlooking, painted, painted on, petting, playing, playing in, playing on, playing with, plays, pointing, printed on, reflected in, reflected on, reflecting, reflecting in, reflecting off, reflecting on, resting on, running in, running on, securing, selling, served on, serving, sewn on, sits in, sits on, sitting, sitting at, sitting behind, sitting in, sitting in a, sitting inside, sitting near, sitting next to, sitting on, sitting on a, sitting with, skiing, skiing down, skiing in, skiing on, sleeping on, sniffing, stacked on, standing inside, standing near, standing with, sticking out, sticking out of, stopped at, stuck in, stuck on, supporting, supports, surfing, surfing in, surfing on, swimming in, swinging, swinging a, swings, talking on, talking to, tied around, tied to, touching, waiting at, waiting on, walking, walking across, walking along, walking behind, walking down, walking in, walking near, walking next to, walking on, walking on a, walking through, walking to, walking up, walking with, working on, wrapped around, wrapped in, written on.
COCO-test-VrR focuses on the semantic relations. We select 3,403 images from MSCOCO Karpathy 5K test split  where each image has at least one ground truth caption that contains at least one of the 164 semantic relation predicates. One ground truth caption that describes semantic relations is randomly sampled for each image.
Appendix B Additional Qualitative Examples
In this section, we present additional cross-modal retrieval examples to qualitatively demonstrate the effectiveness of incorporating visual relations for image-text matching. In Figure 5, we present the examples of image retrieval given text queries using R-SCAN and the baseline SCAN t-i AVG model . In Figure 6, we show the examples of text retrieval given image queries using R-SCAN and the baseline SCAN t-i AVG model. In Figure 7, we show the examples of top-5 ranked images given text queries on MSCOCO using R-SCAN. Similarly, in Figure 8, we show the examples of top-5 ranked sentences given image queries.
Spice: semantic propositional image caption evaluation.
European Conference on Computer Vision, pp. 382–398. Cited by: §1, §4.3, Table 4, §4.
-  (2018) Bottom-up and top-down attention for image captioning and VQA. In CVPR, Cited by: Figure 9, Appendix B, §1, §1, §1, §1, §2, §3.2, §3.3, §3, §4.3, §4.3, Table 4, §4, §4.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.
-  (2015) Hico: a benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1017–1025. Cited by: §1.
-  (2015) Attention-based models for speech recognition. In NIPS. Cited by: §3.1.
Detecting visual relationships with deep relational networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3076–3086. Cited by: §1, §2.
-  (2015) Language models for image captioning: the quirks and what works. In ACL, Cited by: §2.
-  (2014) Learning everything about anything: webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3270–3277. Cited by: §2.
-  (2017) Linking image and text with 2-way nets. In CVPR, Cited by: §2.
-  (2017) VSE++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612. Cited by: §1, §2, §3.1, Table 2, Table 3, §4.
-  (2015) From captions to visual concepts and back. In CVPR, Cited by: §2, §2.
-  (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In CVPR, Cited by: §2, Table 2, Table 3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.
-  (2019) Relational reasoning using prior knowledge for visual captioning. arXiv preprint arXiv:1906.01290. Cited by: §1, §2.
-  (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In CVPR, Cited by: §2.
-  (2018) Learning semantic concepts and order for image and sentence matching. In CVPR, Cited by: §2, Table 2, Table 3.
-  (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: Figure 5.
-  (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §2.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: Appendix A, Appendix A, §2, §3.3, Table 2, Table 3, Table 4, §4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2018) Illustrative language understanding: large-scale visual grounding with image search. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 922–933. Cited by: Table 2.
-  (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §1, §2, Table 2.
-  (2015) Associating neural word embeddings with deep image representations using fisher vectors. In CVPR, Cited by: §2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: Appendix A, §1, §1, §2, §3.3.
-  (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: Figure 5, Figure 6, Appendix B, §1, §1, §1, §2, §3.1, §3.1, §3.1, §3.1, §3.1, §3.1, §3.3, Table 2, Table 3, §4, §4, §4.
-  (2016) RNN fisher vectors for action recognition and image annotation. In ECCV, Cited by: §2.
-  (2019) Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12174–12182. Cited by: §1.
Vip-cnn: visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1347–1356. Cited by: §1, §2.
-  (2017) Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270. Cited by: §1, §2.
-  (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 848–857. Cited by: §1, §2.
-  (2019) Rethinking visual relationships for high-level image understanding. arXiv preprint arXiv:1902.00313. Cited by: Appendix A, Appendix A, §1, §2, §2, §3.3, §3.3, §3.3, §4.1, §4.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §4.
-  (2017) Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision, pp. 873–881. Cited by: §2.
-  (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, pp. 852–869. Cited by: §1, §2.
-  (2017) Dual attention networks for multimodal reasoning and matching. In CVPR, Cited by: §1, §2, Table 2.
-  (2017) Pixels to graphs by associative embedding. In Advances in neural information processing systems, pp. 2171–2180. Cited by: §1, §2.
-  (2017) Hierarchical multimodal LSTM for dense visual-semantic embedding. In ICCV, Cited by: §2, Table 2.
-  (2017) CM-GANs: cross-modal generative adversarial networks for common representation learning. arXiv preprint arXiv:1710.05106. Cited by: §2.
Weakly-supervised learning of visual relations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5179–5188. Cited by: §1, §2.
-  (2017) Phrase localization and visual relationship detection with comprehensive image-language cues. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1928–1937. Cited by: §1, §2.
-  (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §2, §3.2.
-  (2011) Recognition using visual phrases. In CVPR 2011, pp. 1745–1752. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §1, §4.3, Table 4, §4.
-  (2016) Order-embeddings of images and language. In ICLR, Cited by: §2.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1, §2.
-  (2016) Learning deep structure-preserving image-text embeddings. In CVPR, Cited by: §2.
-  (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: §1, §1, §2, §2, §3.3, §3.3.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §1, §2.
-  (2018) Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–685. Cited by: §1, §2.
-  (2018) Scene graph reasoning with prior visual relationship for visual question answering. arXiv preprint arXiv:1812.09681. Cited by: §1, §2.
-  (2018) Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699. Cited by: §1, §1, §2, §4.3, §4.3, Table 4.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: §1, §4.
-  (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: Appendix A, Appendix A, Appendix A, §1, §1, §2, §2, §3.1, §3.3, §3.3, §3.3, §4.1, §4.
-  (2017) Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5532–5540. Cited by: §1, §2.
-  (2017) PPR-fcn: weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4233–4241. Cited by: §1, §2.
-  (2017) Dual-path convolutional image-text embedding. arXiv preprint arXiv:1711.05535. Cited by: §2.
-  (2017) Towards context-aware interaction recognition for visual relationship detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 589–598. Cited by: §1, §2.