Understanding natural languages and their relationship with visual information is the foundation in AI for bridging humans and machines in the physical world. This problem involves many challenging tasks, among which, grounding referring expressions [9, 23] is a fundamental one. Grounding referring expressions attempts to locate the target visual object in an image on the basis of comprehending the relationships between referring natural language expressions (e.g. “the man with glasses”, “the dog near a car”) and the image.
Grounding referring expressions is typically formulated as a task that identifies a proposal referring to the expressions from a set of proposals in an image 
. Recent work combines Convolutional Neural Networks (CNN)
and Long Short-Term Memory Neural Networks (LSTM) to process the multimodal information in images and referring expressions. CNNs are used for extracting visual features of single objects, global visual contexts [23, 28] and pairwise visual differences [19, 35, 36, 37], while LSTMs encode global language contexts [19, 22, 23, 36, 37] and language features of the decomposed phrases [7, 35, 41, 33]. CNN cooperating with LSTM can also capture the context of object pairs [7, 26, 41]. However, such work cannot extract first-order relationships or multi-order relationships accurately from referring expressions, and the captured contexts also have discrepancies with the contexts described by referring expressions.
A feasible solution for grounding referring expressions needs to extract all the required information (i.e. objects and the relationships among them in the image and referring expressions) accurately for any given expression. However, as the expressions generated from the scene in an image are often unpredictable and flexible , the proposed model needs to extract information adaptively. For example, if the target is to locate “The man holding a balloon” in an image with two or more men, the required information from the natural language expression includes nouns (“man” and “balloon”) and the word about relationship “holding”; on the other hand, the proposals for “man” and “balloon” and the visual relationship (‘holding”) linking them should be identified in the image. If the expression is more complicated, such as “The man on the left of the man holding a balloon”, the additional relation information we need is “left”. In this example, we need to recognize the second-order relationship between the target and the “balloon” through the other “man”. Unfortunately, existing work either does not support relationship modeling or only considers first-order relationships among objects [7, 26, 41]. Theoretically, visual relation detectors [3, 21, 38] and natural language parsers can help achieve that goal by detecting the relational information in the image and parsing the expressions in the language mode. However, existing visual relation detectors cannot deliver satisfactory results for highly unrestricted scene compositions , and existing language parsers have adverse effects on the performance of grounding referring expressions due to their parsing errors .
Moreover, it is vital to represent the contextual information of referring expressions and target object proposals accurately and consistently because the context of an expression helps distinguish the target from other objects [26, 36, 41]. Nevertheless, existing methods for context modeling either cannot represent the contexts accurately or cannot achieve high-level consistency between both types of contexts mentioned above, and the reasons are given below. First, existing work on global language context modeling [19, 22, 23, 36, 37] and global visual context modeling [23, 28] introduces noisy information and makes it hard to match these two types of contexts. Second, pairwise visual differences computed in existing work [19, 35, 36, 37] can only represent instance-level visual differences among objects of the same category. Third, existing work on context modeling for object pairs [7, 26, 41] only considers first-order relationships but not multi-order relationships (e.g., they directly extract the context between the target “man” and “balloon” without considering the other “man” “holding the balloon”). In addition, multi-order relationships are actually structured information, and the context encoders adopted by existing work on grounding referring expressions are simply incapable of modeling them.
In order to overcome the aforementioned difficulties, we propose an end-to-end Cross-Modal Relationship Inference Network (CMRIN). CMRIN consists of two modules, i.e. the Cross-Modal Relationship Extractor (CMRE) and the Gated Graph Convolutional Network (GGCN). An example is illustrated in Figure 1. The CMRE extracts all the required information adaptively (i.e
., nouns and relationship words from the expressions, and object proposals and their visual relationships from the image) for constructing a language-guided visual relation graph with cross-modal attention. First, CMRE constructs a spatial relation graph for the image. Second, it learns to classify the words in the expression into four types and further assign the words to the vertices and edges in the spatial relation graph. Finally, it constructs the language-guided visual relation graph from the normalized attention distribution of words over vertices and edges of the spatial relation graph. The GGCN fuses information from different modes and propagates the fused information in the language-guided visual relation graph to obtain the semantic context referred to by the expression. We have tested our proposed CMRIN on three common benchmark datasets, including RefCOCO, RefCOCO+  and RefCOCOg . Experimental results show that our proposed network outperforms all the other state-of-the-art methods.
In summary, this paper has the following contributions:
Cross-Modal Relationship Extractor (CMRE) is proposed to convert the pair of input expression and image into a language-guided visual relation graph. For any given expression, CMRE highlights objects as well as relationships among them with a cross-modal attention mechanism.
Gated Graph Convolutional Network (GGCN) is proposed to capture multimodal semantic contexts with multi-order relationships. GGCN fuses information from different modes and propagates fused information in the language-guided visual relation graph.
CMRE and GGCN are integrated into Cross-Modal Relationship Inference Network (CMRIN), which outperforms all existing state-of-the-art methods on grounding referring expressions.
2 Related Work
2.1 Grounding Referring Expressions
Grounding referring expression and referring expression generation  are dual tasks. The latter generates an unambiguous text expression for a target object in an image, and the former selects the corresponding object according to the context in an image referred to by a text expression.
To address grounding referring expression, some previous work [19, 22, 23, 37, 36] extracts visual object features from CNN and treats an expression as a whole to encode language feature through an LSTM. Among them, some methods [22, 23, 36]
learn to maximize the posterior probability of the target object given the expression and the image, and the others[19, 37] model the joint probability of the target object and the expression directly. Different from the methods above, Context Modeling between Objects Network  inputs the concatenation of visual object representation, visual context representation and the word embedding to an LSTM model. Some recent methods [7, 35, 41] learn to decompose an expression into different components and compute the language-vision matching scores of each module for objects, others [42, 4] adopt co-attention mechanisms to build up the interactions between the expression and the objects in the image. Our Cross-Modal Relationship Extractor also learns to classify the words in the expression, but we treat the classified words as a guidance to highlight all the objects and their relationships described in the expression automatically to build the language-guided visual relation graph without extra visual relationships detection [3, 21, 38].
2.2 Context modeling
Context modeling has been utilized in many visual recognition tasks, e.g., object detection [2, 31, 39, 14], saliency detection [16, 17, 15] and semantic segmentation [25, 40]. Recently, Structure Inference Network  formulates the context modeling task as a graph structure inference problem [8, 11, 24]. Previous work on grounding referring expressions also attempts to capture contexts. Some work [23, 28] encodes the entire image as a visual context, but that global contextual information usually cannot match with the local context described by expression. Some work [19, 35, 36, 37] captures the visual difference between the objects belonging to the same category in an image, but the visual difference of the object’s appearance is often insufficient to distinguish the target from other objects. Instead, the visual difference between the context including appearance and relationship is essential, e.g., “Man holding a balloon”, the necessary information to locate the “man” is not only the appearance of the “man” but the “holding” relation with the “balloon”. Though there are some work [7, 26, 41] attempt to model the context from the relationship of object pairs, they only consider the context with first-order relationship between objects. Inspired by Graph Convolutional Network  for classification, our Gated Graph Convolutional Network flexibly captures the context referring to the expression by message passing, and the context with multi-order relationships can be captured.
3 Cross-Modal Relationship Inference Network
Our proposed Cross-Modal Relationship Inference Network (CMRIN) relies on cross-modal relationships among objects and contexts captured in the multimodal relation graph to choose the target object proposal (in the input image) referred to by the input expression. First, CMRIN constructs a language-guided visual relation graph using the Cross-Modal Relationship Extractor. Second, it captures multimodal contexts from the relation graph using the Gated Graph Convolutional Network. Finally, a matching score is computed for each object proposal according to its multimodal context and the context of the input expression. The overall architecture of our CMRIN for grounding referring expressions is illustrated in Figure 2. In the rest of this section, we elaborate all the modules in this network.
3.1 Cross-Modal Relationship Extractor
The Cross-Modal Relationship Extractor (CMRE) adaptively constructs the language-guided visual relation graph according to any given pair of image and expression using a cross-modal attention mechanism. Our CMRE softly classifies the words in the expression into four types (i.e., entity words, relation, absolute location, and unnecessary words) according to the context of every word. The context of the entire expression can be calculated from the context of each individual word. Meanwhile, a spatial relation graph of the image is constructed by linking object proposals in the image according to their size and locations. Next, CMRE generates the language-guided visual relation graph by highlighting the vertices and edges of the spatial relation graph. Highlighting is implemented as computing cross-modal attention between the words in the expression and the vertices and edges in the spatial relation graph.
3.1.1 Spatial Relation Graph
Exploring relative spatial relations among object proposals within an image is necessary for grounding referring expressions. On one hand, spatial information frequently occurs in referring expressions ; on the other hand, spatial relationship is an important aspect of visual relationship in images [3, 38]. We explore the spatial relationship between each pair of proposals according to their size and locations, which bears resemblance to the approach in .
For a given image with object proposals (bounding boxes), , the location of each proposal is denoted as , where are the normalized coordinates of the center of proposal , and are the normalized width and height respectively. The spatial feature is defined as . For any pair of proposals and , the spatial relationship between them is defined as follows. We compute the relative distance , relative angle (i.e
. the angle between the horizontal axis and vector) and Intersection over Union between them. If includes , is set to “inside”; if is covered by , is set to “cover”; if none of the above two cases is true and is larger than , is set to “overlap”; otherwise, when the ratio between and the diagonal length of the image is larger than , is set to “no relationship”. In the rest of the cases, is assigned to one of the following spatial relations, “right”, “top right”, “top”, “top left”, “left”, “bottom left”, “bottom” and “bottom right”, according to the relative angle . The details are shown in Figure 3.
The directed spatial relation graph is constructed from the set of object proposals and the set of pairwise relationships , where is the set of vertices and vertex corresponds to proposal ; is the set of edges and is the index label of relationship ; is the set of features at vertices and is the visual feature of proposal , and is extracted using a pretrained CNN model. A valid index label of ranges from to (the label of “no relationship” is ).
3.1.2 Language Context
Inspired by the attention weighted sum of word vectors over different modules in [7, 41, 35], our CMRE defines attention distributions of words over the vertices and edges of the spatial relation graph . In addition, different words in a referring expression may play different roles. For referring expressions, words can usually be classified into four types, i.e. entity, relation, absolute location and unnecessary words. By classifying the words into different types and distributing them over the vertices and edges of graph , the language context of every vertex and edge can be captured, and the global language context can also be obtained.
For a given expression , CMRE uses a bi-directional LSTM  to encode the context of every word. The context of word is the concatenation of its forward and backward hidden vectors, denoted as . The weight of each type (i.e. entity, relation, absolute location and unnecessary word) for word is defined as follows.
where , , and are learnable parameters, and are hyper-parameters and
is the activation function. The weight of entity, relation and absolute location are the first three elements of. The global language context of graph is calculated as .
Next, on the basis of the word contexts and the entity weight of every word , a weighted normalized attention distribution over the vertices of graph is defined as follows.
where , and are transformation matrices and is hyper-parameter. is the weighted normalized attention, indicating the probability that word refers to vertex . The language context at vertex is calculated by aggregating all attention weighted word contexts.
3.1.3 Language-Guided Visual Relation Graph
Different object proposals and different relationships between proposals do not have equal contributions in solving grounding referring expressions. The proposals and relationships mentioned in the referring expression should be given more attention. Our CMRE highlights the vertices and edges of the spatial relation graph , that have connections with the referring expression, to generate the language-guided visual relation graph . The highlighting operation is implemented by designing a gate for each vertex and edge in graph .
The gate for vertex is defined as the sum over the weighted probabilities that individual words in the expression refer to vertex ,
Each edge has its own type and the gates for edges are formulated as the gates for edges’ types. The weighted normalized distribution of words over the edges of graphis defined as follows.
where , , and are learnable parameters, and is hyper-parameter. is the -th element of , which is the weighted probability of word referring to edge type . And the gate for edges with type is the sum over all the weighted probabilities that individual words in the expression refer to edge type ,
The language-guided visual relation graph is defined as , where , and .
3.2 Multimodal Context Modeling
Our proposed Gated Graph Convolutional Network (GGCN) further fuses the language contexts into the language-guided visual relation graph to generate multimodal relation graph , and computes a multimodal semantic context for every vertex by performing gated graph convolutional operations on the graph .
3.2.1 Language-Vision Feature
As suggested by visual relationships detection [3, 38], the spatial locations together with the appearance features of objects are the key indicators of visual relationship, and the categories of objects is highly predictive of relationship. Our GGCN fuses the language context of every vertex into the language-guided visual relation graph ( encodes the spatial relationships and appearance features of proposals) to generate multimodal relation graph , which forms the basis for computing the semantic context of every vertex.
We define feature at vertex in to be the concatenation of the visual feature at vertex in the language-guided visual relation graph and the language context at vertex , i.e. . The multimodal graph is defined as , where .
3.2.2 Semantic Context Modeling
Multi-order relationships may exist in referring expressions. We obtain semantic contexts representing multi-order relationships through message passing. On one hand, semantic features are obtained by learning to fuse the spatial relations, visual features and language features. On the other hand, contexts representing multi-order relationships are computed by propagating pairwise contexts in graph .
Inspired by Graph Convolutional Network (GCN) for classification [11, 32], our GGCN adopts graph convolutional operations in multimodal relation graph for computing semantic contexts. Different from GCN operating in unweighted graphs, GGCN operates in weighted directed graphs with extra gate operations. The -th gated graph convolution operation at vertex in graph is defined as follows.
where , are learnable parameters, and is hyper-parameter. and are encoded features for out- and in- relationships respectively. is the updated feature for itself. The final encoded feature is the sum of the above three features and is the activation function. By performing the gated graph convolution operation multiple iterations (), semantic contexts representing multi-order relationships among vertices can be computed. Such semantic contexts are denoted as .
Finally, for each vertex , we concatenate its encoded spatial feature mentioned before and its language-guided semantic context to obtain the multimodal context , where .
3.3 Loss Function
The matching score between proposal and expression is defined as follows,
where and are transformation matrices.
Inspired by the deep metric learning algorithm for face recognition in, we adopt the triplet loss with online hard negative sample mining to train our CMRIN model. The triplet loss is defined as
where and are the matching scores of the ground-truth proposal and the negative proposal respectively. The negative proposal is randomly chosen from the set of online hard negative proposals, , where is the margin. During testing, we predict the target object by choosing the object proposal with the highest matching score.
|2||Neg Bag ||vgg16||76.90||75.60||78.00||-||-||-||-||68.40|
In RefCOCO, there are 50,000 target objects, collected from 19,994 images in MSCOCO , and 142,210 referring expressions. RefCOCO is split into train, validation, test A, and test B, which has 120,624, 10,834, 5,657 and 5,095 expression-target pairs, respectively. Test A includes images with multiple people, and test B includes images with multiple objects of other categories.
RefCOCO+ has 49,856 target objects collected from 19,992 images in MSCOCO, and 141,564 expressions collected from an interactive game interface. Different from RefCOCO, RefCOCO+ forbids absolute location descriptions in the expressions. It is split into train, validation, test A, and test B, which has 120,191, 10,758, 5,726 and 4,889 expression-target pairs, respectively.
RefCOCOg includes 49,822 target objects from 25,799 images in MSCOCO, and 95,010 long referring expressions collected in a non-interactive setting. RefCOCOg  has 80,512, 4,896 and 9,602 expression-target pairs for training, validation, and testing, respectively.
|1||global langcxt+vis instance||79.05||81.47||77.86||63.85||69.82||57.80||70.78||71.26|
|2||global langcxt+global viscxt(2)||82.61||83.22||82.36||67.75||73.21||63.06||74.29||75.23|
|3||weighted langcxt+guided viscxt(2)||85.29||86.09||84.12||73.70||79.60||67.52||78.47||79.39|
|4||weighted langcxt+guided viscxt(1)+fusion||85.80||86.09||83.98||73.95||78.43||67.21||79.37||78.90|
|5||weighted langcxt+guided viscxt(3)+fusion||86.55||87.50||84.53||75.29||80.46||68.79||80.11||80.45|
|6||weighted langcxt+guided viscxt(2)+fusion||86.99||87.63||84.73||75.52||80.93||68.99||80.45||80.66|
Ablation study on variances of our proposed CMRIN on RefCOCO, RefCOCO+ and RefCOCOg. The number following the “viscxt” refers to the number of gated graph convolutional layers used in the model.
4.2 Evaluation and Implementation
The Precision@1 metric (the fraction of correct predictions) is used for performance evaluation. A prediction is considered to be a true positive if the top predicted proposal is the ground-truth one w.r.t the referring expression.
For a given dataset, we count the number of occurrences of each word in the training set. If a word appears more than five times, we add it to the vocabulary. Each word in the expression is initially an one-hot vector, which is further converted into a 512-dimensional embedding. Annotated regions of object instances are provided in RefCOCO, RefCOCO+ and RefCOCOg. The target objects in the three datasets belong to the 80 object categories in MSCOCO, but the referring expressions may mention objects beyond the 80 categories. In order to make the scope of target objects consistent with referring expressions, it is necessary to recognize objects in expressions, even when they are not within the 80 categories.
Inspired by the Bottom-Up Attention Model in for image caption and visual question answering, we train ResNet-101 based Faster R-CNN [5, 27] over selected 1,460 object categories in the Visual Genome dataset , excluding the images in the training, validation and testing sets of RefCOCO, RefCOCO+ and RefCOCOg. We combine the detected objects and the ground-truth objects provided by MSCOCO to form the final set of objects in the images. We extract the visual features of objects as the 2,048-dimensional output from the pool5 layer of the ResNet-101 based Faster R-CNN model. Since some previous methods use VGG-16 as the feature extractor, we also extract the 4,096-dimensional output from the fc7 layer of VGG-16 for fair comparison. We set the mini-batch size to 64. The Adam optimizer  is adopted to update network parameters with the learning rate set to 0.0005 initially. Margin is set to in all our experiments.
4.3 Comparison with the State of the Art
We compare the performance of our proposed CMRIN against the state-of-the-art methods, including MMI , Neg Bag , CG , Attr , CMN , Speaker , Listener , VariContext , AccumulateAttn , ParallelAttn  and MAttNet .
4.3.1 Quantitative Evaluation
Table 1 shows quantitative evaluation results on RefCOCO, RefCOCO+ and RefCOCOg datasets. Our proposed CMRIN consistently outperforms existing methods across all the datasets by a large margin. Specially, CMRIN improves the average Precision@1 over validation and testing sets achieved by existing best-performing algorithm by 2.44%, 5.54% and 3.29% respectively on the RefCOCO, RefCOCO+ and RefCOCOg datasets when VGG-16 is used as the backbone. Our CMRIN significantly improves on the person category (testA of RefCOCO and RefCOCO+), which indicates that casting appearance attributes (e.g
., shirt, glasses and shoes) of a person as external relationships between person and appearance attributes can effectively distinguish the target person from other persons. After we switch to the visual features extracted by ResNet-101 based Faster R-CNN, the Precision@1 of our CMRIN is further improved by another3.61%. It improves the average Precision@1 over validation and testing sets achieved by MAttNet  by 1.29%, 4.38% and 2.45% respectively on the three datasets. Note that our CMRIN only uses the 2048-dimensional features from pool5, but MattNet uses multi-scale feature maps generated from the last convolutional layers of both the third and fourth stages.
4.3.2 Qualitative Evaluation
Visualizations of some samples along with their attention scores and matching scores are shown in Figure 4. They are generated from our CMRIN using ResNet-101 based Faster R-CNN features. Without relationship modeling, our CMRIN can identify the proposals appearing in the given expression (second columns), and it achieves this goal on the basis of single objects (e.g. the parking meter in Figure 4(a) and the elephant in full view in Figure 4(d) have higher attention scores). After fusing information from different modes and propagating multimodal information in the structured relation graph, it is capable of learning semantic contexts and locating target proposals (third columns) even when the target objects do not attract the most attention at the beginning. It is worth noting that our CMRIN learns semantic relations (“behind”) for pairs of proposals with different spatial relations (“bottom right” between “car” and “parking meter” in Figure 4(a); “top” between “green plant” and “lady’s head” in Figure 4(b)), which indicates that CMRIN is able to infer semantic relationships from the initial spatial relationships. In addition, CMRIN learns the context for target “elephant” (Figure 4(d)) from “two other elephants” by considering the relations from multiple elephants together. Moreover, multi-order relationships are learned through propagation in CMRIN, e.g., the relationships (“right” in Figure 4(c)) between object pairs are propagated gradually to the target proposal (most “right man”).
4.4 Ablation Study
Our proposed CMRIN includes CMRE and GGCN modules. To demonstrate the effectiveness and necessity of each module and further compare each module against its variants, we have trained five additional models for the comparison with the ResNet-101 based Faster R-CNN features. The results are shown in Table 2. As a baseline (row 1), we use the concatenation of instance-level visual features of objects and the location features as the visual features, and use the last hidden state of the expression encoding LSTM as the language feature, and then compute a matching score between every visual feature and the language feature. In comparison, a simple variant (row 2) that relies on a global visual context, which is computed by applying graph convolutional operations to the spatial relation graph, already outperforms the baseline. This demonstrates the importance of visual contexts. Another variant (row 3) with visual contexts computed in the language-guided visual relationship graph outperforms the above two versions. It captures the contexts by considering cross-modal information. By fusing the context of every word into the language-guided visual relationship graph, semantic contexts can be captured by applying gated graph convolutional operations (row 6, the final version of CMRIN). Finally, we explore the number of gated graph convolutional layers used in CMRIN. The 1-layer CMRIN (row 4) performs worse than the 2-layer CMRIN because it only captures contexts with first-order relationships. The 3-layer CMRIN (row 5) does not further improve the performance. One possible reason is that third-order relationships merely occur in the expressions.
In this paper, we have proposed an end-to-end Cross-Modal Relationship Inference Network (CMRIN) to compute and represent multimodal contexts for the task of grounding referring expressions in images. It consists of a Cross-Modal Relationship Extractor (CMRE) and a Gated Graph Convolutional Network (GGCN). CMRE extracts all the required information adaptively for constructing a language-guided visual relation graph with cross-modal attention. GGCN fuses information from different modes and propagates the fused information in the language-guided relation graph to obtain semantic contexts. Experimental results on three commonly used benchmark datasets show that our proposed method outperforms all existing state-of-the-art methods.
-  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick.
Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.In , pages 2874–2883, 2016.
-  Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3298–3308. IEEE, 2017.
-  Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual grounding via accumulated attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7746–7755, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 4418–4427. IEEE, 2017.
Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena.
Structural-rnn: Deep learning on spatio-temporal graphs.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5308–5317, 2016.
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg.
Referitgame: Referring to objects in photographs of natural scenes.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
-  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations, 2017.
-  Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Guanbin Li, Yukang Gan, Hejun Wu, Nong Xiao, and Liang Lin. Cross-modal attentional context learning for rgb-d object detection. IEEE Transactions on Image Processing, 28(4):1591–1601, 2019.
-  Guanbin Li, Yuan Xie, Liang Lin, and Yizhou Yu. Instance-level salient object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 2386–2395, 2017.
-  Guanbin Li and Yizhou Yu. Visual saliency detection based on multiscale deep cnn features. IEEE Transactions on Image Processing, 25(11):5012–5024, 2016.
-  Guanbin Li and Yizhou Yu. Contrast-oriented deep neural networks for salient object detection. IEEE Transactions on Neural Networks and Learning Systems, 29(12):6038–6051, 2018.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6985–6994, 2018.
-  Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer, 2016.
-  Ruotian Luo and Gregory Shakhnarovich. Comprehension-guided referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), volume 2, 2017.
-  Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 11–20, 2016.
Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta.
The more you know: Using knowledge graphs for image classification.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014.
-  Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision, pages 792–807. Springer, 2016.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–834. Springer, 2016.
-  Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 815–823, 2015.
-  Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
-  Abhinav Shrivastava and Abhinav Gupta. Contextual priming and feedback for faster r-cnn. In European Conference on Computer Vision, pages 330–348. Springer, 2016.
-  Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6857–6866, 2018.
-  Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. Interpretable video captioning via trajectory structured localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018.
-  Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In European Conference on Computer Vision, 2018.
-  Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
-  Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. A joint speakerlistener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), volume 2, 2017.
-  Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5831–5840, 2018.
-  Xingyu Zeng, Wanli Ouyang, Bin Yang, Junjie Yan, and Xiaogang Wang. Gated bi-directional cnn for object detection. In European Conference on Computer Vision, pages 354–369. Springer, 2016.
-  Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4158–4166, 2018.
-  Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4252–4261, 2018.