Our goal is to detect and localize objects in images that are referred to, are queried by, natural language phrases. This objective is also commonly referred to as “phrase grounding”. The key difference with conventional object detection approaches is that the categories of objects is not pre-specified; furthermore the queries may contain attributes (such as a“red car”) and relations between objects (“baby holding a pacifier”). Phrase grounding finds applications in tasks such as Visual Dialog , Visual Search  and Image-text co-reference resolution .
Grounding faces several challenges beyond those present for learning detectors for pre-specified categories. These include generalizing the models from limited data, resolving semantic ambiguities from an open-ended vocabulary and localizing small, hard-to-detect visual entities. Current approaches (    ) adopt a two stage process for phrase grounding. In the first stage, a region proposal module   generates proposals to identify regions likely to contain objects or groups of objects in an image. In the second stage, the grounding system employs a multimodal subspace  that projects queries (textual modality) and their corresponding proposals (visual modality) to have high correlation score.
Various approaches are suggested to learn this subspace such as knowledge transfer from image captioning ; Canonical Correlation Analysis   and query-specific attention for region proposals  . To provide visual context,  augments the proposals’ visual features using bounding box features while  employs neighboring proposal features. Query phrases are often generated from image descriptions and neighboring phrases from a description can provide useful context like relative locations and relationships between them to reduce semantic ambiguity. Using this context,   ground multiple queries for the same image together to resolve ambiguity and reduce conflicting predictions.
Given a query phrase, above approaches consider all the proposals extracted from the image for grounding. This can result in inter-class errors especially for object classes with few corresponding query phrases in the training dataset. The existing approaches are also upperbound by the accuracy of proposal generator, since a query phrase is only able to choose from pre-generated proposals. Further, the existing approaches provide bounding box location and/or image visual representation as context for a proposal. However, many of the queries are defined by attributes and the suitability of an attribute to a proposal is relative. For example, for a query phrase “short boy” we need to compare each boy with all other boys in the image to figure out which one is the “short boy”.
To address the above challenges, we present a framework that uses Proposal Indexing, Relationship and Context for Grounding (“PIRC Net”), whose overview is presented in Figure 1. We propose an architecture with three modules; each succeeding module analyzes the region proposals for a query phrase at increasing levels of detail. The first module, Proposal Indexing Network (PIN) is a two stage network that is trained to reduce the generalization errors and volume of proposals required per query. The second module, Inter-phrase Regression Network (IRN) functions as an enhancement module by generating positive proposals for the cases where PIN does not generate a positive proposal. The third and final module, Proposal Ranking Network (PRN) scores the proposals generated by PIN and IRN by incorporating query specific context. A brief description of each of the modules is provided below.
In the first stage, PIN module classifies the region proposals into higher-level phrase categories. The phrase categories are generated by grouping semantically similar query phrases together using clustering. Intuitively, if the training data has no query phrases from object class “bus”, queries such as “red bus” should consider the proposals that have similar appearance to other queries from same phrase category like “sports car” and “yellow truck”. Skip-thought vectors
are used to encode the phrases since they latently identify the important nouns in a phrase to map semantically similar phrases. In the next stage, PIN learns to attend the proposals most relevant to query phrase to further reduce the volume of region proposals. IRN module employs a novel method estimate the location of one neighbor phrase from another given the relationship between them. For example, for a phrase tuple ’small baby’, ’holds’, ’a pacifier’ while it is difficult to detect ’a pacifier’ alone, the relationship ’holds’ can help localize ’a pacifier’ from the location of its neighboring phrase ’small baby’. PRN module trains a bimodal network which uses learned embeddings to score the region proposals (visual modality) given a query phrase (textual modality). To encode the visual context given a query phrase, we compare a region proposal to other proposals from the same phrase category by encoding their relative appearance.
During training, IRN and PRN learn to predict the proposals with high overlap to ground truth bounding box. However, the ground truth annotation is costly and most vision and language datasets do not provide this information. While it is difficult to train IRN and PRN without this information, we provide methodologies to train PIN with ground truth (supervised) and without ground truth (weakly-supervised) annotations. For supervised setting, PIN is trained using a RPN to predict the proposals close to the groundtruth bounding boxes for a phrase category. For the weakly supervised setting, we propose to use knowledge transfer learning from an object detection system for training PIN and retrieving region proposals that may belong to query phrase category.
We evaluate our framework on two common phrase grounding datasets: Flickr 30K entities  and Refer-it Game . For supervised setting, experiments show that our framework outperforms existing state-of-the-art , achieving 6%/8% and 10%/15% improvements using VGG/ResNet architectures on Flickr30K and Referit datasets respectively. For weakly-supervised setting, our framework achieves 5% and 4% improvements over state-of-the-art  for both datasets.
Our contributions are: (a) Designed a query-guided Proposal Indexing Network that reduces generalization errors for grounding; (b) Introduced novel Inter-phrase Regression and Proposal Ranking Networks that leverage the context provided by multiple phrases in a caption; (c) Proposed knowledge transfer mechanisms that employ object detection systems to index proposals in weakly supervised setting.
2 Related Work
Phrase grounding Improving hugely on early phrase grouding attempts that used limited vocabulary  , Karpathy et al  employ bidirectional RNN to align the sentence fragments and image regions in common embedding space. Hu et al  proposed to rank proposals using knowledge transfer from image captioning. Rohrbach et al  employ attention to rank proposals in a latent subspace. Chen et al  extended this approach to account for regression based on query semantics. Plummer et al  suggest using Canonical Correlation Analysis (CCA) and Wang et al  suggest Deep CCA to learn similarity among visual and language modalities. Wang et al.  employ structured matching and boost performance using partial matching of phrase pairs. Plummer et. al  further augment CCA model to take advantage of extensive linguistic cues in the phrases. All these approaches are upperbound by the performance of the proposal generation systems.
Recently, Chen et al 
proposed QRC that overcomes some limitations of region proposal generators by regressing proposals based on query and employs reinforcement learning techniques to punish conflicting predictions.
Visual and Semantic context Context provides broader information that can be leveraged to resolve semantic ambiguities and rank proposals. Hu et al  used global image context and bounding box encoding as context to augment visual features. Yu et al.  further encode the size information and jointly predict all query regions to boost performance in a referring task. Plummer et al.  jointly optimize neighboring phrases by encoding their relations for better grounding performance. Chen et al   employ semantic context to jointly ground multiple phrases to filter conflicting predictions among neighboring phrases. The existing approaches for grounding do not take full advantage of the rich context provided by visual context, semantic context and inter-phrase relationships.
Knowledge Transfer Knowledge transfer involves solving a target task by learning from a different but related source task. Li et al.  employ knowledge transfer, Rohrbach et al  use linguistic knowledge bases to automatically transfer information from source to target classes. Deselaers et al  and Rochan et al.  employ knowledge transfer in weakly-supervised object detection and localization respectively using skip vectors  as a knowledge base. In this work, we propose to use knowledge transfer from object detection task to index region proposals for weakly supervised phrase grounding.
3 Our network
In this section, we present the architecture of our PIRC network (Figure 2). First, we provide an overview of the entire framework followed by detailed descriptions of each of the three subnetworks : Phrase Indexing Network (PIN), Inter-phrase Regression Network (IRN) and Proposal Ranking Network (PRN).
3.1 Framework Overview
Given an image and query phrases , the goal of our system is to predict the location of visual entities specified by the queries. PIN takes each query phrase and image as input and in two stages, retrieves a set of “indexed region proposals”. IRN generates the region proposals for each query phrase by predicting its location using its relationship with neighboring query phrases of the same image (Note: This information is only available if multiple query phrases are generated from the same image description). The union of proposals generated by PIN and IRN is referred to as “candidate proposals”. Finally, PRN uses context-incorporated features to rank the candidate proposals and choose the one that is most relevant to the query phrase.
3.2 Proposal Indexing Network(PIN)
PIN retrieves a subset of region proposals that are likely correlated to a given query phrase in two stages. Classification is used by PIN to categorize the region proposals in the Stage 1. In Stage 2, PIN ranks the proposals and chooses a subset for which complex analysis is performed in later stages by IRN and PRN subnetworks.
Stage 1 Architecture of stage 1 of PIN is analogous to Faster RCNN  and has two subnetworks: A proposal generator (RPN) and a proposal classifier (RCNN). For proposal generation, we finetune a pretrained object detection network to generate region proposals like  instead of object proposals. For classification (RCNN), since each query phrase is distinct from another, we propose to group them into fixed number of higher level phrase categories. To achieve this, we encode phrases as skip-thought vectors  and cluster them into a set number of phrase categories . Skip-thought vectors  employ a data-driven encoder-decoder framework to embed semantically similar sentences into similar vector representations. Given a query phrase , it is embedded into a skip-thought vector and then categorized as follows:
where is center of cluster j.
Stage 2 In stage 1, classification chooses the region proposals that likely belong to the query phrase category (Eq : 1). In stage 2, these proposals are ranked based on their relevance to the query phrase.
For stage 2, we employ visual attention on each region proposal from stage 1 to rank their similarity to query phrase. For each region proposal, its visual feature is concatenated to the query phrase embedding and these multimodal features are projected through a FC layer network to get a 5 dimensional prediction vector . The first element of the vector indicates the similarity score between proposal and query embedding and the next four elements indicate the regression parameters of the proposals. Visual features of proposals are obtained from the penultimate layer of the classification network in Stage 1. An LSTM is used to encode a query phrase as an embedding vector. For a region proposal with visual feature and query feature generated for a query phrase
, the loss function is calculated as a combination of rankingand regression loss mentioned below:
where are regression parameters for proposals relative to ground-truth and is a smooth-L1 loss function. The region proposals with highest similarity to query phrase are chosen as indexed region proposals for further inspection.
While the proposals chosen in PIN have fairly high accuracy, they still do not consider any relative attributes and relationships while ranking the proposals. The next modules, IRN and PRN incorporate inter-phrase relations and context knowledge to improve ranking among these indexed region proposals.
3.3 Inter-phrase Regeression Network(IRN)
Inter-phrase Regression Network uses a novel architecture to take advantage of the relationship among two neighboring query phrases (from an image description) to estimate the relative location of a target phrase from a source phrase. Given a phrase tuple of source phrase, relationship, target phrase; IRN estimates the regression parameters to predict the location of target phrase given the location of source phrase and vice-versa. To model the visual features for regression, the representation of source phrase must encode not only its visual appearance but also its spatial configuration and its interdependence with the target phrase. For example, the interdependence of ’person-in-clothes’ is different to that of ’person-in-vehicle’ and is dependent on where the ’person’ is. To encode the spatial configuration , we employ a 5D vector that is encoded as where () are width and height of the image respectively. To encode the interdependence of the phrases, we suggest to use the phrase categories(from PIN) of source and target phrases embedded as a one-hot-vector . The relation between the two phrases is encoded using an LSTM (); concatenated with visual feature and is projected using a Fully Connected Layer to obtain regression parameters for target phrase location. For a query phrase and its neighboring phrases , the regression is estimated as follows:
where ’’ denotes the concatenation operator, is the query phrase whose regression parameters are predicted and is the set of region proposals chosen by PIN for neighboring phrases of query phrase . and are projection parameters and
is the non-linear activation function.
During training, the regression loss for predicted regression parameters is calculated as follows:
given the ground truth regression parameters calculated from ground truth location of target phrase .
The proposals estimated from both the neighboring phrases (subject and object) are added to the proposals generated by PIN
IRN enhances the proposal set and is especially useful for smaller objects which are often missed by proposal generators (Figure 4, query 3). The candidate proposal set is next passed to the Proposal Ranking Network.
3.4 Proposal Ranking Network(PRN)
Proposal Ranking Network (PRN) is designed to incorporate visual and semantic context while ranking the region proposals. PRN employs a bimodal network to generate a confidence score for each region proposal using a discriminative correlation metric. For the visual modality, we employ contrastive visual features to differentiate among the region proposals from the same phrase category. Each proposal is encoded by aggregating its relative appearance with respect to other candidate proposals ( Eq: 6 ). This relative appearance between a given proposal and any other candidate proposal is computed by L2 normalization of the difference in visual features of the two proposals. To compute the feature representation, this relative appearance is aggregated using average pooling for a given proposal as follows:
For location and size feature representation, we encode relative position and size of a given proposal with respect to all the other candidate proposals. This representation helps in capturing the attributes of a proposal compared with other candidate proposals and is especially helpful in relative references. This feature representation is encoded as a 5 D vector:
using the relative distance between centers, relative width, relative height and relative size of the given proposal and any other candidate proposal respectively. The final visual representation is the concatenation of all the above representations (Eq : 7).
To encode the text modality (Eq : 8), we concatenate the query embedding, with embedding, of the entire image description .
To compute the cross modal similarity, first the textual representation is projected into the same dimensionality as the visual representation. Then, the discriminative confidence score is computed by accounting for the bias between the two modalities as follows:
To learn the projection weights and bias for both modalities during training, we employ the max-margin ranking loss that assigns higher scores to positive proposals. To account for multiple positive proposals in the candidate set, we experiment with both maximum and average pooling to get the representative positive score from the proposals. In our experiments, maximum pooling operator performed better. The ranking loss is formulated below:
The loss implies that the score of the highest scoring positive proposal, should be greater than each of the negative proposal by a margin .
3.5 Supervised training and Inference
dataset. The fully-connected network of PIN is alternatively optimized with RPN to index the proposals. Stage 2 of PIN is trained independently for 30 epochs with a learning rate of 1e-3. IRN and PRN for are trained for 30 epochs with starting learning rate of 1e-3 that is reduced by a factor of 10 every 10 epochs. During testing, the region proposal with the highest score from PRN is chosen as the prediction for a query phrase.
4 Weakly Supervised Training
We present our framework for weakly-supervised training in this section.
4.1 Weak Proposal Indexing Network (WPIN)
For weakly-supervised grounding, to overcome the lack of ground truth information, we employ knowledge transfer learning from object detection systems for indexing relevant proposals for a query phrase. The knowledge transfer is two-fold : data-driven knowledge transfer and appearance based knowledge transfer. We describe both methodologies below.
Data-driven Knowledge Transfer For Data-driven Knowledge Transfer, our training objective is to learn the representations of phrase categories from a pre-trained object detection network. The pre-trained network’s relevant object classes provide a strong initialization for generating proposals of phrase categories (defined similarly as Section 3.2). For a region proposal
, the network trains to predict probabilityfor a phrase category . Final scoring layer of pre-trained network is replaced to predict the probabilities of phrase categories. Each region proposal is represented as the distribution of the phrase category probabilities . For training the network, the representations of region proposals for an image are added and loss function is calculated as a multi-label sigmoid classification loss as follows:
denotes the sigmoid function.if the image contains phrase category and 0 otherwise. During test time, the region proposals with highest scores for a query phrase , are chosen as the indexed query proposals .
Appearance-based Knowledge Transfer Appearance-based knowledge Transfer is based on the expectation that semantically related object classes have visually similar appearances. While this may not hold true universally and could mislead the system in few cases, it provides a strong generalization among classes that do. Given probability scores of a set of source classes for a region proposal , the goal of the knowledge transfer is to learn the correlation score for a query phrase for that region proposal . To measure the correlation among different classes, we employ skip vectors  that embed semantically related words in similar vector representations. For a query phrase , we employ its constituent nouns extracted using Stanford POS tagger along with phrase category for its semantic representation . For a set of proposals given by an object detection system with source class probability scores ; we measure their correlation to target phrase class and phrase’s constituent nouns as follows:
An average of appearance-based correlation and data-driven probability is employed as the final score for correlation of proposal with query phrase .
4.2 Training and Inference
Faster RCNN system  pretrained on MSCOCO  dataset using VGG  architecture is employed for knowledge transfer. For training the weakly supervised grounding system, the encoder-decoder network with attention is used to compute reconstruction loss similar to . The learning rate of the network is set to 1e-4.
5 Experiments and Results
Flickr30k Entities We use a standard split of 30,783 training and 1000 testing images. Each image has 5 captions, 360K query phrases are extracted from these captions and refer to 276K manually annotated bounding boxes. Each phrase is assigned one of eight pre-defined phrase categories. We treat the connecting words between two phrases in the caption as a ’relation’ and use relations occurring 5 times for training IRN.
ReferIt Game We use a standard split of 9,977 training and 9974 testing images. A total of 130K query phrases are annotated to refer to 96K distinct objects. Unlike Flick30K, the query phrases are not extracted from a caption and do not come with an associated phrase category. Hence, we skip training IRN for ReferIT.
5.2 Experimental Setup
Phrase Indexing Network (PIN) A Faster RCNN pre-trained on PASCAL VOC 2007  is finetuned on the respective datasets for proposal generation. Flickr30k  and ReferIt Game  employ 10 and 20 cluster centers obtained from clustering training query phrases as target classes respectively. Vectors from the last fc layers are used as visual representation for each proposal. For Stage 2, hidden size and dimension of bi-LSTM are set to 1024.
InterPhrase Regression Network (IRN) Since query phrases of ReferIt game are annotated individually, IRN is only applicable to Flickr30k dataset. The visual features from PIN are concatenated with 5D spatial representation (for regression) and 8*8 one-hot embedded vector for source and target phrase categories; for generating representation vector. Both left and right neighboring phrases, if available, are used for regression prediction.
Proposal Ranking Network(PRN)
For visual stream, the visual features from PIN are augmented with contrastive visual features and 5D relative location features; generating an augmented feature vector. For text stream, lstm features are generated for both query phrase and corresponding caption. Each stream has 2 fully connected layers followed by a ReLU non-linearity and Dropout layers with probability 0̄.5. The intermediate and output dimensions of visual and text streams are [8192,4096].
|Structured Matching ||42.08|
|CCA embedding ||50.89|
|PIN (VGG Net)||66.27|
|PIN + IRC (VGG Net)||70.17|
|PIN + PRN (VGG Net)||70.97|
|PIRC Net (VGG Net)||71.16|
|PIN (Res Net)||69.37|
|PIN + IRC (Res Net)||71.42|
|PIN + PRN (Res Net)||72.27|
|PIRC Net (Res Net)||72.83|
|Deep Fragments ||21.78|
All convolutional and fully connected layers are initialized using MSRA and Xavier respectively. All the features are l2 normalized and batch normalization is employed before similarity computation. Training batch size is set to 40 and learning rate is 1e-3 for both flickr30k and referit. VGG architecture is used for PIN to be comparable to existing approaches. Further, ResNet  architecture is used to establish the new state-of-the-art with improved visual representation. In the experiments, we use VGG net for comparison with other methods and ResNet for performance analysis. We set as 10 and 20 for VGG net and ResNet respectively.
Accuracy is adopted as the evaluation metric and a predicted proposal is considered positive, if it has an overlap of0.5 with the ground-truth location. For evaluating the efficiency of indexing network, Top 3 and Top 5 accuracy are also presented.
|Retrieval Rate||Top 1||Top 3||Top 5|
|Proposal Limit (no regression)||83.12|
5.3 Results on Flickr30k Entities
Performance We compare performance of PIRC Net to other existing approaches for Flickr30k dataset. As shown in table 1, PIN alone achieves 1.13% improvement while using a fraction of proposals compared to existing approaches. Adding IRN with ranking objective, we achieve 5.05% increase in accuracy. Adding PRN along with PIN, we achieve 5.83% improvement. Our overall framework achieves 6.02% improvement over QRC which is the existing state-of-the-art approach. Further, employing Resnet architecture for PIN gives an additional 1.67% improvement.
Weakly-supervised Grounding Performance We compare performance of PIRC Net to two other existing approaches for Weakly-supervised Grounding. As shown in table 2, we achieve 5.34% improvement over existing state-of-the art.
|Context-aware RL ||36.18|
|PIN (VGG Net)||51.67|
|PIRC Net (PIN+PRN) (VGG Net)||54.32|
|PIN (Res Net)||56.67|
|PIRC Net (PIN+PRN) (Res Net)||59.13|
|LRCN (reported in )||8.59|
|CAFFE-7K (reported in )||10.38|
PIN indexing performance The effectiveness of PIN is measured by its ability to index proposals for query phrase. For this, we measure the acurracy of proposals at Top 1, Top 3 and Top 5 ranks. We compare the results to QRC, which is the current state of the art in table 3. The upperbound i.e., the maximum accuracy possible with RPN is mentioned in the last row. PIN consistently outperforms QRC 111Results provided by authors in Top 3 and Top 5 retrieval showcasing its effectiveness in indexing the proposals. Our method employs regression to move proposals to positive locations improving over upperbound.
Per category performance For the predefined phrase categories of Flickr30K Entities, PIRC Net consistently performs better across all categories 3.
5.4 Results on ReferIt Game
Performance Table 4 shows the performance of PIRC Net compared to existing approaches. PIN network gives a 7.6% improvement over state-of-the-art approaches. The high gains could be attributed to richer diversity of objects in ReferIt Game than Flickr30k Entities dataset. Employing PRN in addition to PIN, gives a 10.25% improvement over QRC. Using ResNet architecture gives an additional 4.81% improvement; leading to 15.06% improvement over the state-of-the-art.
Weakly-supervised Grounding Performance We compare performance of PIRC Net to two other existing approaches for Weakly-supervised Grounding on Referit. As shown in table 5, we achieve 3.70% improvement over existing state-of-the art.
PIN indexing performance Similar to Flickr30k, we do performance analysis to judge the effectiveness of PIN . The results are presented in Table 6. PIN consistently performs better for Top 3 and Top 5 retrieval.
|Retrieval Rate||Top 1||Top 3||Top 5|
5.5 Qualitative Results
We present qualitative results for few samples on Flickr30K Entities and ReferItGame datasets(Fig 4). For Flickr30K (top row), we show an image with its caption and five associated query phrases. We can see the importance of context in localizing the queries. For ReferIt (middle row), we show the query phrase and the associated results. Bottom row shows the failure cases.
In this paper, we addressed the problem of phrase grounding using PIRC Net, a framework that incorporates semantic and contextual cues to rank visual proposals. By incorporating these cues , our framework outperforms other baselines for phrase grounding. Further, we demonstrate the benefit of knowledge transfer from object detection systems for weakly-supervised grounding.
This paper is based, in part, on research sponsored by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under agreement number FA8750-16-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Defense Advanced Research Projects Agency or the U.S. Government.
Agharwal, A., Kovvuri, R., Nevatia, R., Snoek, C.G.M.: Tag-based video retrieval by embedding semantic content in a continuous word space (2016), applications of Computer Vision(WACV)
-  Andrej, K., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
-  Chen, K., Kovvuri, R., J., G., Nevatia, R.: Msrc: Multimodal spatial regression with semantic context for phrase grounding. In: ICMR (2017)
-  Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: ICCV (2017)
-  Deselaers, T., Alexe, B., V., F.: Weakly supervised localization and learning with generic knowledge. In: IJCV (2012)
-  Donahue, J., Hendricks, L., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge. In: IJCV (2010)
-  Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP (2016)
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: ECCV (2016)
-  Guadarrama, S., Rodner, E., Saenko, K., Darrell, T.: Understanding object descriptions in robotics by open-vocabulary object retrieval and detection. In: IJRR (2016)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2016)
-  Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: CVPR (2016)
-  K., S., O., V., M., M., B., T.L.: Referit game: Referring to objects in photographs of natural scenes. In: EMNLP (2014)
-  Karpathy, A., Joulin, A., Li, F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS (2014)
-  Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R., Torralba, A., Urtasun, R., Fidler, S.: Skip-thought vectors. In: NIPS (2015)
-  Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: CVPR (2014)
-  L., W., Y., L., S., L.: Learning deep structure-preserving image-text embeddings. In: CVPR (2016)
-  fei Li, F.: Knowledge transfer in learning to recognize visual objects classes. In: ICDL (2006)
-  Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, L.: Microsoft coco: Common objects in context. In: ECCV (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
-  Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: ECCV (2016)
-  Plummer, B.A., Mallya, A., Christopher, M.C., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive image-language cues. In: ICCV (2017)
-  Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: IJCV (2016)
-  Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS (2015)
-  Rochan, M., Wang, Y.: Weakly supervised localization of novel objects using appearance transfer. In: CVPR (2015)
-  Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: ECCV (2016)
-  Rohrbach, A., Rohrbach, M., Tang, S., Oh, S.J., Schiele, B.: Generating descriptions with grounded and co-referenced people. In: CVPR (2017)
-  Rohrbach, M., Stark, M., B., S.: Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: CVPR (2011)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014)
-  Uijlings, J.R., Van D. S., K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV (2013)
-  de Vries, H., Strub, F., S., C., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: CVPR (2017)
-  Wang, M., Azab, M., Kojima, N., Mihalcea, R., Deng, J.: Structured matching for phrase localization. In: ECCV (2016)
-  Wu, F., Xu, Z., Yang, Y.: An end-to-end approach to natural language object retrieval via context-aware deep reinforcement learning. In: arxiv (2017)
-  Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016)
-  Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: ECCV (2014)