Referring expression comprehension [Yu et al.2017, Yu et al.2018b, Wang et al.2019] has attracted much attention in recent years. A referring expression is a natural language description of a particular object in an image. Given a referring expression, the target of referring expression comprehension is to localize the object instance in the image. It is one of the key tasks in the field of machine intelligence to realize human-computer interaction, robotics and early education.
Conventional methods for referring expression comprehension mostly formulate this problem as an object retrieval task, where an object that best matches the referring expression is retrieved from a set of object proposals. These methods [Yu et al.2018b, Yang, Li, and Yu2019b, Yang, Li, and Yu2019a, Wang et al.2019] are mainly composed of two stages. In the first stage, given an input image, a pre-trained object detection network is applied to generate a set of object proposals. In the second stage, given an input expression, the best matching region from the detected object proposals is selected. Although the two-stage methods have achieved great advance, there are still some problems. 1) The performance of the two-stage methods is very limited to the quality of object proposals generated in the first stage. If the target object is not accurately detected, it is impossible to match the language in the second stage. 2) In the first stage, a lot of extra object detection data, i.e., COCO [Lin et al.2014] and Visual Genome [Krishna et al.2017]
, are indispensable to achieve satisfactorys result. 3) Two-stage methods are usually computationally costly. For each object proposal, both feature extraction and cross-modality similarity computation should be conducted. However, only the proposal with highest similarity is selected finally. As we can see in Figure1, the accuracy of current two-stage methods is reasonable while the inference speed still has a big gap to reach real-time.
The three aforementioned problems are difficult to solve in existing two-stage frameworks. We reformulate referring expression comprehension as a cross-modality template matching problem, where the language serves as the template(filter kernel) and the image feature map is the search space to perform correlation filtering on. Mathematically, referring expression comprehension aims to learn a function that compares an expression to a candidate image and returns a high score in the corresponding regions. The region is represented by 2-dim center point, 2-dim object size (height and width) and 2-dim offset to recover the discretization error [Law and Deng2018, Zhou, Wang, and Krähenbühl2019, Duan et al.2019]. Our proposed RCCF is end-to-end trainable. The language embedding is used as correlation filter and applied to the feature map to produce the heatmap for center point. For more accurate localization, we compute the correlation map on multi-level image feature and fuse the output maps to produce the final heatmap of object center. Moreover, the width, height and offset heatmap are regressed with visual feature only. During inference, the text is first embedded and then slides on the image feature maps. The peak point in the object center heatmap is selected as the center of the target. The corresponding width, height and offset are collected to form the target bounding box, which is the referring expression comprehension result.
The advantage of our proposed RCCF method includes:
The inference speed of our method reaches real-time ( FPS) with a single GPU, which is -times faster than the two-stage methods. It has great practical value.
Our method can be trained with referring expression dataset only, with no need for any additional object detection data. Moreover, our one-stage model can avoid error accumulation from the object detector in traditional two-stage methods.
RCCF has achieved the state-of-the-art performance in RefClef, RefCOCO, RefCOCO+ and RefCOCOg datasets. Especially, in the RefClef dataset, our method outperforms the state-of-the-art methods a very large gap from to , almost double the performance of the state-of-the-art method.
Referring Expression Comprehension
Conventional methods for referring expression comprehension mostly composed of two-stage. In the first stage, given an input image, a pre-trained object detection network or an unsupervised method is applied to generate a set of object proposals. In the second stage, given an input expression, selecting the best matching region from the detected object proposals. The most two-stage methods focus on improving the second stage. Most of them [Mao et al.2016, Hu et al.2017, Zhang, Niu, and Chang2018, Yu et al.2018b, Wang et al.2019, Yang, Li, and Yu2019a] mainly focus on exploring how to mine context information from the language and image or model the relationship between referent.
Though the two-stage methods have achieved pretty-well performance, there are some common problems. Firstly, the performance of two-stage methods is limited to the object detectors. Secondly, these methods waste a lot of time in object proposals generation and proposals features extraction. Therefore, we propose to localize the target object directly given an expression with our correlation filtering based method.
The correlation filtering is firstly proposed to train a linear template to discriminate between images and their translations. The correlation filtering is widely used in different areas of computer vision. Object classification[Krizhevsky, Sutskever, and Hinton2012, He et al.2016, Simonyan and Zisserman2014]
can be seen as a correlation filtering task, where the output image feature vector can be seen as a filter kernel, which performs correlation filtering on the weight matrix of the last MLP. For single object tracking, which aims to localize an object in a video given the object region in the first frame, the correlation filtering can play a role in comparing the first frame with the rest ones. The early works[Bolme et al.2010, Henriques et al.2014] in tracking firstly transfer the image into Fourier domain, and perform correlation filtering in Fourier domain. Siamese FC [Bertinetto et al.2016] proposed to directly learn a correlation layer on the spatial domain, where Siamese FC compares two image features extracted from a Siamese network.
Inspired by human visual perception mechanism, we believe that the process of performing language based visual grounding can be analogized to the process of filter-based visual response activation. Specifically, people generally comprehend the semantic information of a sentence in a global way, and form a feature template about the sentence description in the mind, then quickly perform attention matching on the image based on the template, wherein the salient region with the highest response value is considered as the target matching region. To this end, we formulate the problem of referring expression comprehension as a cross-modality correlation filtering process and solve with a single-stage joint optimization paradigm.
Let represent a query sentence and denote the image of width and height . Our aim is to find the object region described by the expression. The target object region is represented by its center point and the object size
. Additionally, to recover the discretization error caused by the output stride, we predict a local offsetfor the center point . To sum up, the referring expression comprehension can be formulated as a mapping function .
As shown in Figure 2, our proposed RCCF is composed of three modules, i.e., expression and image encoder, correlation filtering and size and offset regression modules. The expression and image encoder module includes the language feature extractor and visual feature extractor . The extracted features are represented as and respectively. The expression feature is then mapped from the language domain to the visual domain by the cross-modality mapping function . The correlation filtering module treats the mapping result as the filter (kernel) to convolve with the visual feature map and produces a heatmap , where is the output stride. The peak value of indicates the center point of the object depicted by the expression. Moreover, the size and offset regression module predicts the object size and local offset of the center point . Next, we will introduce the three modules in detail.
Expression and Image Encoder
The expression encoder takes the expression as input, and produces a 512-D feature vector. We first embed the expression into a 1024-D vector, followed by a fully connected layer to transform the vector into 512-D. Then we feed the transformed feature into a Bi-LSTM to get the expression feature .
The image encoder adopts the Deep Layer Aggregation (DLA) [Yu et al.2018a] architecture with deformable convolution [Dai et al.2017]. DLA is an image classification network with hierarchical skip connections. Following Centernet [Zhou, Wang, and Krähenbühl2019], we use the modified DLA network with 34 layers, which replace the skip connection with the deformable convolution. Because a referring expression may consist of various kinds of semantic information such as attribute, relationship and spatial location. To well match the expression, we use three level visual features. As shown in Figure 2, we extract three level features from the DLA net which are transformed into a unified size from , , and respectively. The size of are all . When computing the correlation map , all three level features are utilized. During regression process, only with the highest resolution is used for computational efficiency.
Cross-modality Correlation Filtering
The aim to cross-modality correlation filtering is to localize the center of the target box . It contains three steps. Firstly, we utilize three different linear function to generate three filter kernels from the expression feature . The three fully connection layers , and serve as the cross-modality mapping function to project from the expression space to the visual space. The kernels are 64-D feature vector which is then reshaped into a filter for subsequent operations. Secondly, we perform correlation operation on the three levels of visual features with their corresponding language-mapped kernels , where
denotes convolution. Thirdly, the three correlation maps are pixel-wisely averaged and fed into an activation function. The size of , , and are all . The location with highest score in is the center point of the target object.
We train the center point prediction network following [Law and Deng2018, Zhou, Wang, and Krähenbühl2019]. For the ground-truth center point , we compute a low-resolution equivalent by considering the output stride . We use the Gaussian kernel to splat the ground-truth center point in a heatmap , where is value of at the spatial location andLin et al.2017]:
where and are hyper-parameters of the focal loss. We empirically set to 2, and to 4 in our experiments.
Size and Offset Regression
As shown in Figure 2, the module contains two parallel branches. The size regression branch predicts the and while the offset regression branch estimates and . The regressed size and offset maps are pixel-wisely corresponded to the estimated center points heatmap .
Both branches take the visual feature as input. The regression is conducted without using any expression features. The reason is that the spatial structure information is important to the regression, adding expression features may destroy the rich spatial information in the visual features. Both size and offset regression branches contain a
convolutional layer with ReLU followed by aconvolutional layer.
loss function is used during training. The object size loss and the local offset regression loss are defined as:
where and are the ground truth width and height of the target box and and are the ground truth offset vector. is the value of at the spatial location while , and are defined similarly. Note that the regression loss acts only at the location of the center point , all other locations are ignored.
Loss and Inference
The final loss is the weighted summation of three losses:
where we set to 0.1 and to 1. is equivalent to a normalized coefficient for the object size.
During inference, we select the point with the highest confidence score in the heatmap as the target center point. The target size and offset are obtained from the corresponding position in the , , and as , , and . The coordinates of the top-left and bottom-right corner of the target box are obtained by:
Dataset: The experiments are conducted and evaluated on the four common referring expression benchmarks, i.e., RefClef [Kazemzadeh et al.2014], RefCOCO [Kazemzadeh et al.2014], RefCOCO+ [Kazemzadeh et al.2014] and RefCOCOg [Mao et al.2016]. RefClef also known as Refitgame is a subset of the ImageCLEF dataset. The other three are all built on MS COCO images. RefCOCO and RefCOCO+ are collected in an interactive game, where the referring expressions tend to be short phrases. Comparing to RefCOCO, RefCOCO+ forbids using absolute location words and takes more attention on appearance description. To produce longer expressions, RefCOCOg is collected in a non-interactive setting. RefClef has expressions for objects in images. RefCOCO has expressions for objects in images, RefCOCO+ has expressions for objects in images, and RefCOCOg has expressions for objects in images.
Both RefCOCO and RefCOCO+ are divided into four subsets: ”train”, ”val”, ”testA” and ”testB”. The focus of the “testA” and ”testB” are different. An image contains multiple people in ”testA” and multiple objects in ”testB”. For RefCOCOg, we follow the split in [Yu et al.2018b]. For fair comparison, we used the split released by [Zhang, Niu, and Chang2018] for RefClef.
Evaluation Metric: Following the detection proposal setting in the previous works, we use the Prec@0.5 to evaluate our method, where a predicted region is correct if its intersection over union (IOU) with the ground-truth bounding box is greater than .
Implementation Details: We set hyper-parameters following Centernet [Zhou, Wang, and Krähenbühl2019]. Our RCCF method is also robust to these hyper-parameters. All experiments are conducted on the Titan Xp GPU and CUDA 9.0 with Intel Xeon CPU E5firstname.lastname@example.orgG.
The resolution of the input image is , and we set the output stride to . Thereby the output resolution is . Our proposed model is trained with Adam [Kingma and Ba2014]. We train on 8 GPUs with a batch-size of for epochs, with a learning rate of 5e-4 which is decreased by at the epochs, and again at
epochs. We use random shift and random scaling as the data augmentation. There is none augmentation during inference. The visual encoder are initialized with COCO object detection pretrain, and the language encoder and the output heads are randomly initialized. For ablation study, we also conduct experiments on the visual encoder initialized with ImageNet[Deng et al.2009] pretrain.
Comparison to the State-of-the-art
We compare RCCF to the state-of-the-art methods. The comparison results on RefClef dataset is shown in Table 2 while the results on the other three dataset are illustrated in Table 3. The previous methods use a 16-layer VGGNet [Simonyan and Zisserman2014] or a 101-layer ResNet [He et al.2016] as the image encoder, while our proposed RCCF adopts DLA-34 [Yu et al.2018a] to encode images. The reason is that it is proved in [Law and Deng2018, Duan et al.2019] the VGG16 and ResNet-101 are not suitable for the keypoint estimation alike tasks. For fair comparison, we compare the two backbone networks with DLA-34 from three aspects in Table 1. We can see the DLA-34 has the minimum parameters and computations(FLOPs), and its performance in image classification is worse than ResNet-101. Therefore, the performance gain of our RCCF comes from the framework itself, instead of more parameters or more complex backbone network. The baselines we compared with mainly use Faster-Rcnn [Ren et al.2015], pretrained in object detection dataset, i.e., COCO and Visual Genome, to generate object proposals first, then matches the expression with all object proposals.
RefClef: The results in RefClef are presented in Table 2. Comparing to the state-of-the-art methods in RefClef, our method outperforms the all previous methods a very large gap from 34.70 to 63.79, almost double the precision.
RefCOCO, RefCOCO+ and RefCOCOg: See the Table 3, our method outperforms existing methods in all evaluation sets on RefCOCO and RefCOCO+, and achieves comparable performance with the state-of-the-art method on RefCOCOg. Our result is a little lower than MAttNet [Yu et al.2018b] in the RefCOCOg dataset. The reason is that MAttNet uses extra supervision, such as attributes and class labels of region proposals, so it understands the expression better, especially for long and complex sentences in RefCOCOg. Additionally, MAttNet uses a more complex backbone ResNet-101 while we only uses DLA-34.
In conclusion, our method can achieve pretty-well performance in all four datasets. In addition, the two-stage methods achieve much higher precision in the three RefCOCO series datasets than in RefClef. It is owing that all the three RefCOCO series datasets are subsets of COCO, so the two-stage methods can train a very accurate detector based COCO object detection dataset, while RefClef does not have a such large corresponding object detection dataset. Therefore, traditional two-stage methods are heavily dependent on the object detector performance and the object detection dataset, while our novel RCCF framework avoid the explicit object detection stage and tackles the referring expression problem straightly. In a word, our proposed method is a better solution to referring expression comprehension.
|SCRC [Hu et al.2016]||17.93|
|GroundR [Rohrbach et al.2016]||26.93|
|MCB [Chen et al.2017]||26.54|
|CMN [Hu et al.2017]||28.33|
|VC [Zhang, Niu, and Chang2018]||31.13|
|GGRE [Luo and Shakhnarovich2017]||31.85|
|MNN [Chen et al.2017]||32.21|
|CITE [Plummer et al.2018]||34.13|
|IGOP [Yeh et al.2017]||34.70|
|1||MMI [Mao et al.2016]||VGG16||64.90||54.51||54.03||42.81||-|
|2||NegBag [Nagaraja, Morariu, and Davis2016]||VGG16||58.60||56.40||-||-||49.50|
|3||CG [Luo and Shakhnarovich2017]||VGG16||67.94||55.18||57.05||43.33||-|
|4||Attr [Liu, Wang, and Yang2017]||VGG16||72.08||57.29||57.97||46.20||-|
|5||CMN [Hu et al.2017]||VGG16||71.03||65.77||54.32||47.76||-|
|6||Speaker [Yu et al.2016]||VGG16||67.64||55.16||55.81||43.43||-|
|7||Spearker+Listener+Reinforcer [Yu et al.2017]||VGG16||72.94||62.98||58.68||47.68||-|
|8||Spearker+Listener+Reinforcer [Yu et al.2017]||VGG16||72.88||63.43||60.43||48.74||-|
|9||VC[Zhang, Niu, and Chang2018]||VGG16||73.33||67.44||58.40||53.18||-|
|10||ParallelAttn [Zhuang et al.2018]||VGG16||75.31||65.52||61.34||50.86||-|
|11||LGRANs [Wang et al.2019]||VGG16||76.6||66.4||64.0||53.4||-|
|12||DGA [Yang, Li, and Yu2019b]||VGG16||78.42||65.53||69.07||51.99||51.99|
|13||Spearker+Listener+Reinforcer [Yu et al.2017]||ResNet-101||73.71||64.96||60.74||48.80||59.63|
|14||Spearker+Listener+Reinforcer [Yu et al.2017]||ResNet-101||73.10||64.85||60.04||49.56||59.21|
|15||MAttNet [Yu et al.2018b]||ResNet-101||80.43||69.28||70.26||56.00||67.01|
|Method||testA||testB||Inference Time (ms)|
|4||Single Language Filter||77.66||68.87||24|
|5||Single Level Visual Feature||77.14||68.50||23|
|8||Glove Expression Encoder||81.05||71.17||25|
|9||Hourglass Image Encoder||78.12||69.38||80|
In this section, we perform ablation studies from five different aspects. Results are shown in Table 4.
Fusion Strategy: In the first two rows, we report the results on two different fusion manners for the output correlation maps. In the first row, we replace the original pixel-wisely averaging with the maximum fusion. In this way, we concatenate the three output correlation maps, and obtain pixel-wise maximum across all channels. In the second row, we generate the output heatmap by concatenating the three correlation maps, followed by a convolutional layer. From the result, we conclude both the maximum fusion and concatenation are not as good as the average fusion.
Filter Kernel Setting: Here we perform ablation studies on the different variations of language filters (kernels). Filter (row 3) is the method by expanding the language filter channels by times, and reshaping it into . Then, we perform correlation filter using the kernels. The result is almost the same with the ”Ours” with the kernel (row 11). Considering the additional computational cost, we choose to use kernel.
In row 4, we only generate only one single filter from the language feature, and perform correlation filtering on the three level visual features with the same kernel. In this case, the precision has dropped about 3 points. This shows that the diversity of the language kernels is important to match the visual features of different levels.
Single Level Visual Feature: In row 5, we perform the correlation filtering on the last level of the visual feature with single language kernel. The performance has dropped a lot from ”Ours”, but only dropped a little from the single language filter, multi-level visual features setting in row 4. Therefore, it can be concluded that the different language filters are sensitive to the different level of visual features.
Language-guided Regression: To verify whether the feature filtered by the language filter is suitable for the regression, we feed the concatenated feature of the three correlation maps into two convolutional layers in two regression branches. The results are shown in row 6, the performance drops a lot, about 6 points. Therefore, it is not a good choice to use language-guided feature to regress the object size and offset in our RCCF framework.
Expression & Image Encoder: The seventh to ninth rows of the Table 4 show our method with various encoders. In row 7, to explore the effect of the visual encoder pretrain model on the performance, we initialize the DLA-34 with ImageNet pretrain. The results has dropped about 2 points, but also achieved comparable results to the state-of-the-art method. It proves that our method can also work well without any prior knowledge from object detection. In row 8, we use GloVe [Pennington, Socher, and Manning2014] as the word embedding. There is little change in the performance, so our method is robust to the two different language embedding. In row 9, we replace the visual encoder with a deeper network Hourglass-104 [Law and Deng2018] in a single level setting. Comparing to the row 5, this setting has just improved a little, but this setting is much slower than our basic setting with DLA-34 during inference and training. More than hours are needed for training and the inference speed is much lower.
Inference: our model runs at 25ms per image on a single Titan Xp GPU, while the state-of-the-art two-stage method MAttNet needs 314ms. Our method is 12 times faster. For more detail comparison, the inference time per image of the first stage and the second stage of MAttNet are 262ms and 52ms respectively. The cost of either stage is longer than the total inference time of our method. More comparisons of the timing and precision can be found in Figure 1.
Training: our method is also fast to train. Training with DLA-34 on RefCOCO takes 35 hours in our synchronized 8-GPU implementation (1.78s per 128 image-language pairs mini-batch).
Qualitative Results Analyses
Correlation Map: Figure 4 shows the correlation map of the object center. We can see that given different expressions for the same image, the correlation map responses to different locations. It can be seen that the response is very high in areas near the center of object described by the expression. Moreover, there is very small responses in other locations. It shows that our model is capable to well match the expression and visual features.
Comparison to the State-of-the-art: In the first row of Figure 3, we compare our method with the state-of-the-art method MAttNet. Our method can accurately localize the target objects under the guidance of the language, even if the objects are hard to be detected for common object detector. For example, the described objects ”piece” (Figure 3a) and ”space” (Figure 3f) are very abstract and not included in the COCO categories, our method still can find them through the expression. It proves that our method is able to well match expression and visual features. While MAttNet is dependent on the object detector, MAttNet will fail if the object category is beyond the scope of the detector category set.
Failure Case Analysis: The second row of Figure 3 illustrates some failure cases. As shown in the Figure 3g, we find the right object, but fail to accurately locate the bounding-box. Another example is shown in Figure 3h, the target object is occluded heavily, and the model cannot capture enough appearance information. In addition, the ground-truth error may occur. For example in Figure 3j, there are more than one target objects described by the expression. Some failure may be caused by that target object lies in the background and it is difficult to find the appearance feature described by the expression. In addition, when expression is very complex and long, our model may fail to understand it well, such as the case in Figure 3l. We leave how to solve these failure cases as interesting future works.
Conclusion & Future Works
In this paper, we propose a real-time and accurate framework for referring expression comprehension. Completely different from the previous two-stage methods, our proposed RCCF directly localizes the object given an expression by predicting the object center through computing a correlation map between the referent and the image. The RCCF is able to achieve state-of-the-art performance in four referring expression datasets at real-time speed. For future work, we plan to explore how to capture more context information from expression and image, and thus understand the expression better.
- [Bertinetto et al.2016] Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; and Torr, P. H. 2016. Fully-convolutional siamese networks for object tracking. In ECCV.
- [Bolme et al.2010] Bolme, D. S.; Beveridge, J. R.; Draper, B. A.; and Lui, Y. M. 2010. Visual object tracking using adaptive correlation filters. In CVPR.
- [Chen et al.2017] Chen, K.; Kovvuri, R.; Gao, J.; and Nevatia, R. 2017. Msrc: Multimodal spatial regression with semantic context for phrase grounding. In ICMR.
- [Dai et al.2017] Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In ICCV.
- [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
- [Duan et al.2019] Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; and Tian, Q. 2019. Centernet: Object detection with keypoint triplets. arXiv preprint arXiv:1904.08189.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
- [Henriques et al.2014] Henriques, J. F.; Caseiro, R.; Martins, P.; and Batista, J. 2014. High-speed tracking with kernelized correlation filters. TPAMI.
- [Hu et al.2016] Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; and Darrell, T. 2016. Natural language object retrieval. In CVPR.
- [Hu et al.2017] Hu, R.; Rohrbach, M.; Andreas, J.; Darrell, T.; and Saenko, K. 2017. Modeling relationships in referential expressions with compositional modular networks. In CVPR.
- [Kazemzadeh et al.2014] Kazemzadeh, S.; Ordonez, V.; Matten, M.; and Berg, T. 2014. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Krishna et al.2017] Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV.
[Krizhevsky, Sutskever, and
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E.
Imagenet classification with deep convolutional neural networks.In NIPS.
- [Law and Deng2018] Law, H., and Deng, J. 2018. Cornernet: Detecting objects as paired keypoints. In ECCV.
- [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV.
- [Lin et al.2017] Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In ICCV.
- [Liu, Wang, and Yang2017] Liu, J.; Wang, L.; and Yang, M.-H. 2017. Referring expression generation and comprehension via attributes. In ICCV.
- [Luo and Shakhnarovich2017] Luo, R., and Shakhnarovich, G. 2017. Comprehension-guided referring expressions. In CVPR.
- [Mao et al.2016] Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A. L.; and Murphy, K. 2016. Generation and comprehension of unambiguous object descriptions. In CVPR.
- [Nagaraja, Morariu, and Davis2016] Nagaraja, V. K.; Morariu, V. I.; and Davis, L. S. 2016. Modeling context between objects for referring expression understanding. In ECCV.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In EMNLP.
- [Plummer et al.2018] Plummer, B. A.; Kordas, P.; Hadi Kiapour, M.; Zheng, S.; Piramuthu, R.; and Lazebnik, S. 2018. Conditional image-text embedding networks. In ECCV.
- [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
- [Rohrbach et al.2016] Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; and Schiele, B. 2016. Grounding of textual phrases in images by reconstruction. In ECCV.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- [Wang et al.2019] Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; and Hengel, A. v. d. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR.
- [Yang, Li, and Yu2019a] Yang, S.; Li, G.; and Yu, Y. 2019a. Cross-modal relationship inference for grounding referring expressions. In CVPR.
- [Yang, Li, and Yu2019b] Yang, S.; Li, G.; and Yu, Y. 2019b. Dynamic graph attention for referring expression comprehension. In ICCV.
- [Yeh et al.2017] Yeh, R.; Xiong, J.; Hwu, W.-M.; Do, M.; and Schwing, A. 2017. Interpretable and globally optimal prediction for textual grounding using image concepts. In NIPS.
- [Yu et al.2016] Yu, L.; Poirson, P.; Yang, S.; Berg, A. C.; and Berg, T. L. 2016. Modeling context in referring expressions. In ECCV.
- [Yu et al.2017] Yu, L.; Tan, H.; Bansal, M.; and Berg, T. L. 2017. A joint speaker-listener-reinforcer model for referring expressions. In CVPR.
- [Yu et al.2018a] Yu, F.; Wang, D.; Shelhamer, E.; and Darrell, T. 2018a. Deep layer aggregation. In CVPR.
- [Yu et al.2018b] Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; and Berg, T. L. 2018b. Mattnet: Modular attention network for referring expression comprehension. In CVPR.
- [Zhang, Niu, and Chang2018] Zhang, H.; Niu, Y.; and Chang, S.-F. 2018. Grounding referring expressions in images by variational context. In CVPR.
- [Zhou, Wang, and Krähenbühl2019] Zhou, X.; Wang, D.; and Krähenbühl, P. 2019. Objects as points. arXiv preprint arXiv:1904.07850.
- [Zhuang et al.2018] Zhuang, B.; Wu, Q.; Shen, C.; Reid, I.; and van den Hengel, A. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In CVPR.