Image retrieval is a primary task in computer vision and a vital precursor for a wide range of topics in visual search. The core element of each retrieval algorithm is a procedure that queries an image database and returns a ranked list of images that are close to the query image. The ranking is defined with respect to a retrieval objective and a corresponding distance metric. The retrieval objective can be any underlying property of an image, such as its category (e.g., car, table, dog) [10, 17] or its visual attributes (e.g., metallic, hairy, soft) [7, 42]. This objective is expressed during training with a suitable criterion (e.g., a cross entropy loss for categories or a margin-based contrastive loss for attribute similarity) to encode the relevant information in a learned low dimensional feature space (i.e., the embedding space) [18, 35]. The distance metric is typically an Euclidean distance or a learned metric that captures the pairwise similarity in the respective embedding space.
While this generic approach works well in practice, it is also static and inflexible. Once the model is trained, we can only retrieve images based on the single retrieval objective chosen at training time, e.g., images with similar attributes or images with similar categories. However, the space of image-associated metadata is rich. Each new type of hand-annotated or machine-inferred data constitutes an independent axis of interest that retrieval algorithms should account for (e.g., objects, captions, and other types of structured data related to the local and global composition of a scene). A straightforward way to incorporate such diverse objectives during the retrieval process is to learn a joint embedding with a hyper approach . This is not desirable for the following reasons: (1) it reduces the amount of available training data to the point where standard training techniques become infeasible; (2) the semantic axes of interest are often orthogonal: a plane can be white, but not all planes are white and not all white objects are planes. Augmenting the label space through such a coupling is thus neither flexible nor scalable; (3) the contributions of the individual objectives are fixed and unweighted.
Instead, we propose a novel retrieval concept that accounts for retrieving images based on a convex combination of multiple retrieval objectives in a continuous and smooth manner. Such that, any valid weighting on this simplex of retrieval objectives can be chosen at test-time. We refer to a specific point on this simplex as the simplicial retrieval operating point (SROP). In this work, we explore a setting with two retrieval objectives: categories and attributes. We propose a novel approach that allows targeting a specific SROP at test time. The resulting model can thus be viewed as a continuous slider that can retrieve images based on the query image’s category, its attributes or any other SROP between those two extremes (e.g., an equal weighting between both). The representation disentangling between categories and attributes is achieved through parallel memory networks, learning corresponding prototypes. In particular, we assign a memory for generalization, where we capture categorical prototypes, and another one for specification
, where we capture attribute-based prototypes. Both memories are learned end-to-end based on gated backpropagation of gradients from custom losses, one for each respective objective. Crucially, both the gates and the losses are SROP-weighted. At test-time, our model can dynamically predict the suitable input embedding of an image for a targeted SROP by retrieving and mixing the relevant prototypes learned in both memories.
We make the following contributions in this work: (1) we introduce a novel retrieval task, namely spanning the continuous spectrum of image retrieval between different operating points; (2) we propose a multi-stream retrieval module allowing test-time selection of a simplicial retrieval operating point; (3) we introduce and validate novel optimization techniques that are necessary for efficient and effective learning of the retrieval modules; (4) we evaluate the advantages of our approach against common deep retrieval models and demonstrate its effectiveness in a real world application of visual fashion retrieval.
2 Related Work
The bulk of image retrieval research can be split into two main groups: instance- and category-based retrieval. In instance-based retrieval we want to retrieve images of the exact same object instance presented in a query (potentially in a different pose or from a different viewing angle). Early work in this direction focused on matching low-level descriptors [43, 29, 36, 20], learning a suitable similarity metric [19, 27], or compact representation [18, 35, 5]
. More recently, deep neural networks became predominant in the retrieval literature. CNNs, in particular, were used as off-the-shelf image encoders in many modern instance retrieval models[3, 11, 1, 39, 40, 33, 34, 37, 2, 46, 30]. Moreover, siamese  and triplet networks [53, 12, 15, 48] demonstrated impressive performance as they provide an end-to-end framework for both embedding and metric learning.
Category-based retrieval methods are on the other end of the retrieval spectrum. Here, the models target the semantic content of the query image rather than the specific instance. For example, given a query image of a house on the river bank, we are interested in retrieving images with similar semantic configuration (i.e., house+river) rather than images of the exact blue house with red door in the query. This type of model learns a mapping between visual representations and semantic content, which can be captured by image tags [9, 10, 17, 38]
, word vector embeddings[8, 31], image captions , semantic graphs , or semantic attributes [7, 42, 16]. Most recently, hyper-models [47, 26, 25] are proposed to learn an image embedding by jointly optimizing for multiple criteria like category and attribute classification. However, these types of models operate in one of the two retrieval spaces (i.e., categories or attributes) and not in between.
To the best of our knowledge, this work is the first to phrase the retrieval task as a continuous operation. Moreover, we propose a novel deep model that not only can operate at the two extremes of the retrieval spectrum but is also capable of dynamically traversing the simplex in between, generating intermediate retrieval rankings. Hence, it effectively provides an infinite set of retrieval models in one and can target the desired operating point at test-time using a specificity control parameter.
Our model is based on memory networks, which were first introduced by  and extended by  with an end-to-end learning framework. Memory networks have proven useful in many applications, including text-based [4, 23, 28, 44, 49] and vision-based [50, 51, 45] question answering and visual dialogs [6, 41]. We present a novel memory network architecture that learns and stores visual concept prototypes. We propose two different types of memories to distinguish between categorical- and instance-based information. Moreover, we provide key insights on how to improve learning of these concepts in memory using a novel dropout-based optimization approach.
3 Image Retrieval Along a Simplex
We propose an image retrieval approach that can operate along the retrieval simplex by distilling the information in a query image into category- and attribute-relevant signals. Using a novel memory network architecture (Fig. 1), our model learns various visual prototypes present in the data and stores them in the memory. The prototypes can be understood as non-linear sets of basis vectors separately encoding category- and attribute- information. Given a query image and a control parameter, our model learns to retrieve the relevant prototypes from memory and combine them in appropriate proportions. Thus, it constructs a custom embedding of the query to target a specific operating point on the retrieval simplex (SROP).
Given an image encoded with (e.g., a CNN), our proposed memory retrieval network (MRN) has the following main components: 1) Query module: that projects the generic query embedding into category- () and attribute-specific () representations to address our parallel memory module; 2) Memory module: that learns visual concept prototypes in a generalization () and a specification () memory. Both and are combined to form a category () and attribute () representation for the query image. Finally, 3) Output module: that mixes the constructed representations with different proportions given a specificity control parameter . The query embedding derived through the output module is then compared (e.g., using Euclidian distance) to similarly computed embeddings of each image in the dataset to arrive at the ranked list of retrieved images. Next, we provide a detailed description of our MRN and its various components.
3.1 Query module
Given an image encoded with a deep model, , we query each memory module using and . These two queries are learned using two different projection layers and to adapt to the different representations in and :
where and are the weights and bias for each layer, respectively.
3.2 Memory Modules
In order to operate at an arbitrary SROP, we need to control the information embedded in the query representation. Hence, we propose to factorize the representation of an input query in terms of a category- and attribute-based representation. While the category-based representation captures the shared information between and all samples of the same category as , the attribute-based representation captures the information that distinguishes from the rest (i.e., its visual attributes). By separating the two signals, we can then control how to mix these two representations to construct a new embedding of that targets the desired SROP on the retrieval simplex. Hence, our memory module is made up of two parallel units: a) The generalization memory (), that learns category-based prototypes; and b) The specification memory (), that learns attribute-based prototypes.
Here, we would like to learn concept prototypes that capture information shared among all samples of the same category. Based on the intuitive assumption that samples usually belong to a single (or a few) base categories, we can use a softmax layer to attend to the memory cells of. Given the sparse properties of a softmax layer, this allows us to learn a sparse and discriminative category-based representations and attend to the most suitable one to construct the input embedding. That is, given a query , the category-based embedding of is constructed as:
where is the attention over memory cell , is the number of the generalization memory cells, and is the output of the generalization memory module .
Dropout Decay Layer.
While the sparsity of the softmax is beneficial to learn discriminative prototypes, it may result in stagnation during the optimization process since the backpropagated error signals get channeled to a few memory cells. To counter these optimization difficulties caused by the softmax layer, we propose a dropout decay layer (DDL) over the attentions
. The DDL is a dropout layer with an initial dropout probabilityand a decay factor . Starting with an initially high dropout rate , the DDL pushes our model to activate and initialize many cells in rather than relaying on just a few. As the training progresses, the dropout probability gradually gets dampened by , enabling the model to learn a more discriminative representation in .
While each sample typically belongs to a single base category, it usually exhibits a set of multiple visual attributes that are instance specific and distinguish this instance from others within the same category. Hence, we use a sigmoid layer to attend to the memory cells of , which allows us to select and compose multiple attributes. Given a query , we construct the attribute-based representation as follows:
where is the attention over memory cell , is the number of cells in , and is the output of the specification memory module. Unlike softmax, the sigmoid layer does not produce a sparse attention over and hence we do not need a special activation mechanism as in .
3.3 Output Module
Given the output representation of the two memory modules, we construct the representation of input sample as a weighted linear combination of and :
where is an embedding of the linear combination of the memory modules’ outputs. The specificity control weights the contribution of each memory module in the final representation of . As approaches zero, the category-based information of is emphasized in . By increasing , we incorporate more instance-specific information (with attributes as proxy) from into .
3.4 Multitask Training
To learn the representations in the memory cells of and , we need a learning objective that can distill the error signal into category- and attribute-based signals. We achieve this using multitask learning, i.e., we jointly optimize for two criteria to capture both signals. Additionally, both criteria are weighted by the specificity parameter which controls the contributions of the backpropagated error signals to the memory modules. We consider two options:
1) Classification-based Memory Retrieval Network (CbMRN).
Here, we optimize the model jointly for category and attribute classification:
where and are cross entropy losses for the categories and attributes, respectively, and is a regularization loss over the model parameters .
2) Similarity-based Memory Retrieval Network (SbMRN).
This model optimizes jointly for category classification and pairwise instance similarity:
where and are defined as above and is a margin-based contrastive loss .
The previous two models have different characteristics: Since CbMRN is optimized with cross entropy losses, we expect a strong discriminative error signal from object and attribute losses, which will help in learning more discriminative prototypes. On the other hand, SbMRN leverages a contrastive loss that captures the generic pairwise similarity of samples in the attribute embedding space. Experimentally, we find that this lends itself well to cases where it is desirable to maintain the category as changes. Intuitively, SbMRN can be considered to be more generic than CbMRN, since arbitrary similarity metrics can be used in this formulation (e.g., caption similarity) to define the extrema of the retrieval simplex and, consequently, the SROPs that we wish to traverse.
For each training image , we sample randomly from
using a uniform distribution. Note thatnot only controls the mixing of category- and attribute-based prototypes, but also controls the error signal coming from the category- and attribute-based losses. acts as a gating layer which controls the flow of information to each memory module and allows us to distill the backpropagated error signal during training. At test time, we can select freely to control the SROP, which can be anywhere between pure category-prototypes and pure attribute-based ones.
Our experiments evaluate our model’s ability to operate along the retrieval simplex. We first conduct a thorough evaluation of the modules proposed in Sec. 3 by validating our design choices and their impact on the performance and the learned concept prototypes (Sec. 4.1). Next, we evaluate the performance of both CbMRN and SbMRN in the proposed retrieval task and highlight their distinct properties. A comparison to popular baselines demonstrates the advantages of our approach (Sec. 4.2). We conclude our evaluation with the analysis of our model in a fashion retrieval application (Sec. 4.3).
We evaluate our model on MNIST Dialog , a recent instance of MNIST  containing visual attributes. In particular, each image is augmented with binary visual attributes: foreground colors of the digit, background colors, and style-related attributes (stroke, flat). Unlike MNIST Dialog, which uses a – grid of images, we evaluate our model on individual images of size . In total, we extract images for training, for validation, and for testing. Henceforth, we refer to this subset as the MNIST Attributes dataset (see Fig. 7).
In order to guarantee a fair comparison, we use the same core CNN (green box in Fig. 1) in all our experiments: the first layers consist of convolutions with kernel size and /. The core CNN ends with
fully-connected layers with ReLU activations of size. We combine this core CNN with the modules illustrated in Fig. 1
to obtain our full model and with various classifiers for our baseline comparisons. We train all models using Adam with an initial learning rate of .
4.1 Learning Visual Concept Prototypes
We start by analyzing the properties of our memory modules. We train a CNN model augmented with our generalization memory , followed by a softmax layer for category classification and use a cross-entropy loss. We set the number of cells in the memory to and the initial dropout probability to , with a decay of .
Dropout Decay Layer (DDL).
Fig. 1(a) shows the performance of our model with and without the proposed DDL. The model without DDL suffers from long periods of stagnation in performance (red curve). To better understand this phenomenon, we track, during learning, the activation history of the memory cells across the entire training dataset: examining Fig. 1(b), we notice that overcoming these stages of stagnation actually corresponds to the activation of a new memory cell, i.e., a new category prototype. DDL significantly improves the performance of the model by pushing it to activate multiple memory cells early in the training process rather than relying on a few initial prototypes (Fig. 1(c)
). Moreover, the early epochs of training are now characterized by high activations of multiple cells, which gradually get dampened in the later stages. In summary, DDL not only counteracts the stagnation in training but also results in faster convergence and better overall performance.
To gauge what the model actually captures in memory, we validate it on test data by accumulating a histogram of each cell’s activations over the categories: Fig. 3(right) shows the average activations per category for each cell. We see that our model learns a clean prototype in each cell for each of the categories. In comparison to a model without DDL (Fig. 3(left)), the learned representations in memory are substantially more discriminative.
Interestingly, DDL is also more effective than standard dropout with fixed values and (i.e. without decay): in addition to improved performance (Fig. 4(a)), we see in Fig. 4(b) that such a fixed dropout probability leads to learning mixed category prototypes in the memory cells. For some categories, their prototypes are split across multiple cells, while other cells capture a generic prototype that represents multiple categories. This is expected: due to the constant dropout during training, multiple cells are forced to capture similar category prototypes in order to cover the information loss encountered from dropping out some of the cells.
Our memory module is parametrized by the number of cells and their dimensionality
. We analyze the impact of these hyperparameters on memory utilization and model performance in the following experiments:
Number of Cells: We vary the number of cells in the memory and check how many of them are utilized. We measure utilization as the percentage of cells activated for at least of the samples. Fig. 4(left) shows that utilization steadily drops beyond cells while the number of active cell remains stable at or , which is close to the number of categories in the dataset. We conclude that our model learns a sparse representation of the categories and uses the memory in a way that is consistent with the visual concepts present in the data. In other words, the model is not sensitive to this parameter and can effectively choose an appropriate number of needed active cells.
Dimensionality: We analyze the impact of cell dimensionality on the final performance of the model. We set our memory module to cells and vary the dimensionality . Fig. 4(right) shows the change in classification accuracy along . Our model exhibits good robustness with regard to memory capacity and reaches accuracy even with a low-dimensional representation.
While we analyzed the performance of the generalization memory, the analysis of specification memory looks similar, despite the denser activation patterns caused by the sigmoid (vs. softmax) attention for cell referencing.
4.2 From Category- to Attribute-based Retrieval
So far, we have evaluated our memory module and the impact of its configuration on utilization and accuracy. Next, we examine the performance and properties of our models along the retrieval simplex.
We train our full models with both generalization and specification memories using multitask learning. The CNN encoder is identical to the one we used in Sec. 4.1. However, in this experiment, we train our model for objectives that capture category- and attribute-based information, as explained in Sec. 3.4. While CbMRN is trained for a joint objective of category and attribute classification, the SbMRN model is trained for category classification and pairwise similarity, i.e., a sample pair is deemed similar if they share the same attributes. During training we sample uniformly in the range to enable the models to learn how to mix the visual concept prototypes at different operating points.
We measure the performance of category-based retrieval using C-TopK accuracy, which measures the percentage of the top similar samples with matching category label to the query . We measure the attribute-based performance using the A-TopK accuracy, which measures the percentage of matching attributes in the top similar samples to . At test time, we randomly sample queries from each category and rank the rest of the test data with respect to the query. The samples are embedded by the dynamic output of our MRN while gradually changing from to to target the different SROPs. We use Euclidean distance to measure the similarity between the embedding of and the other samples.
Fig. 6 shows the retrieval performance of the two proposed models (CbMRN and SbMRN): as goes from to , category-based performance decreases and attribute-based performance increases in the classification-based model, as expected (Fig. 5(a)). The similarity-based model, on the other hand, shows a stable, high performance for the category-based retrieval with varying and shares similar properties of the attribute-based performance with the previous model, albeit with larger dynamic range (Fig. 5(b)). In summary, the classification-based model helps us to cross the category boundary of the query to retrieve instances with similar attributes but from different categories. The similarity-based model traverse the manifold within the category boundaries and retrieves instances from the same category as the query with controllable attribute similarity. These conclusions are reinforced by our qualitative results, which we discuss next.
Fig. 7 shows qualitative results of both models: the first row (Q) is the query and each column displays the top retrieved instances as alpha goes from (left-most column) to (right-most column). In case of the CbMRN model (Fig. 6(a)), we see that the left-most column contains samples with the correct category of the query but with diverse attributes. As we move toward , we retrieve instances with more matching attributes but also from increasingly more diverse categories. With the SbMRN model (Fig. 6(b)), we can traverse from to while keeping the categorical information of the query, with increasing similarity of instance-specific information. We note that both of these behaviors may be desirable, depending on the application. For example, for a generic image retrieval, control over specificity of results may be desired, as is achieved with the SbMRN model; however, for exploring fashion apparel (Sec. 4.3) with a given style, the CbMRN model may be more appropriate.
Comparison to Baselines.
We compare our approach to well-established retrieval models:
Siamese Network (SiamNet ): a siamese model that captures the pairwise similarity of samples.
Hyper Network (HyperNet ): a deep model trained for the joint objective of predicting both categories and attributes with equal weighting.
We implement these models and adopt them to our problem and dataset. All baseline models use the same core CNN structure as our own models (see Sec. 4). We represent each image using the output of the last hidden layer of the respective model. The retrieval is conducted as in the previous experiment, i.e., the query image is compared against all samples in the dataset and ranked using Euclidean distance.
To the best of our knowledge, our model is the first one allowing the retrieval at different operating points of the retrieval simplex. Since there are no standard performance measures for this task, we introduce a new measure of performance along the continuous retrieval simplex, namely the -weighted average of category and attribute accuracy for the top- retrieved samples:
Fig. 8 shows the performance of our models compared to the classical retrieval models. Note that each of the competitor models excels at a specific point on the retrieval simplex. These optimal operation points of each of the competitor models are highlighted with a marker. We see that SiamNet and AttNet show best performance at , because their embeddings are optimized with the objective of attribute similarity. CatNet, on the other hand, operates best at , since it is optimized with the objective of category classification. The HyperNet model shows an intermediate performance among the competitors at , in accordance with the optimization of both category and attribute objectives. Our SbMRN model is the best-performing model at but declines faster than HyperNet as goes to . We believe this is because the SbMRN tries to incorporate attribute information in addition to categorical information, which is substantially harder than a trade-off between both objectives. Despite lower performance at certain SROPs, SbMRN does exhibit useful behavior in practice. Finally, our CbMRN model shows the best overall performance along the retrieval simplex, exhibiting consistent performance as it traverses between the two extremes of and .
4.3 Application: Traversing Visual Fashion Space
We conclude our evaluation with an application of our model to the real-world task of fashion exploration, where it allows the interactive traversal between the category and visual style (i.e. attributes) of a query fashion item.
We test our model using the UT-Zappos50K dataset . The dataset contains around images of footwear items collected from the online commerce website zappos.com. The items fall into main categories (Sandals, Boots, Shoes, and Slippers). Each item is tagged with a set of labels provided by the vendors. We select out of the most frequent tags as the attribute labels for the images. These attributes describe properties such as gender (e.g. women), closure (e.g. lace-up, zipper), heel-height (e.g. 1in, 4in) and toe-style (e.g. closed-toe, open-toe). The dataset is split with a ratio of to for testing and training.
We train our CbMRN model with a deeper backbone CNN of convolutions followed by fully-connected layers. Additionally, we train baselines that share the same backbone architecture with our model, similar to the setup in our previous experiments.
Fig. 9 shows the performance of our model compared to the baselines in terms of top-5 and top-20 accuracy. As before, we see that overall our CbMRN model outperforms its competitors with a significant margin (). Comparing Fig. 9 to Fig. 8, it is also evident that attribute retrieval on UT-Zappos50K is substantially more complex than on MNIST Attributes. Even AttNet, which should operate optimally at , shows a performance closer to HyperNet. Upon further inspection, we notice that a large number of attributes have strong correlation with category labels. For instance, attributes related to heel-heights or toe-styles appear almost exclusively with women’s shoes, rendering learning an attribute representation that is disentangled from the associated categories challenging. Our model handles these cases well and at it performs on par with HyperNet (top-5) and significantly better than all competing methods (top-20). Similar to Fig. 7, we show qualitative examples on UT-Zappos50K in Fig. 10
, illustrating our model’s ability to smoothly interpolate between categories and styles in fashion space.
We introduced a novel retrieval task, namely traversing the retrieval simplex smoothly between different operating points (SROPs). We proposed a novel memory-based deep model that can target the desired SROP along the simplex using a control parameter. Moreover, we presented key insights on how to optimize training and improve learning of such a model using the proposed dropout decay layer . Additionally, we demonstrated the properties and differences between two learning schemes and highlighted the performance of our model against a set of established deep retrieval models. Finally, we hope that our findings will be a stepping stone for further research in that direction.
-  H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson. From generic to specific deep representations for visual recognition. In CVPR Workshops, 2015.
A. Babenko and V. Lempitsky.
Aggregating local deep features for image retrieval.In ICCV, 2015.
-  A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In ECCV, 2014.
-  A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
-  Y. Cao, M. Long, B. Liu, and J. Wang. Deep cauchy hashing for hamming space retrieval. In CVPR, June 2018.
-  A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual dialog. arXiv preprint arXiv:1611.08669, 2016.
-  X. Y. Felix, R. Ji, M.-H. Tsai, G. Ye, and S.-F. Chang. Weak attributes for large-scale image retrieval. In CVPR, 2012.
-  A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS, 2013.
-  Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2):210–233, 2014.
-  Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. T-PAMI, 35(12):2916–2929, 2013.
-  Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In ECCV, 2014.
-  A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In ECCV, 2016.
-  A. Gordo and D. Larlus. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In CVPR, 2017.
-  R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
E. Hoffer and N. Ailon.
Deep metric learning using triplet network.
International Workshop on Similarity-Based Pattern Recognition, 2015.
-  J. Huang, R. S. Feris, Q. Chen, and S. Yan. Cross-domain image retrieval with a dual attribute-aware ranking network. In ICCV, 2015.
-  S. J. Hwang and K. Grauman. Learning the relative importance of objects from tagged images for retrieval and cross-modal search. IJCV, 100(2):134–153, 2012.
-  H. Jégou and O. Chum. Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. In ECCV, 2012.
-  H. Jégou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search. IJCV, 87(3):316–336, 2010.
-  H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
-  J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015.
-  D. P. Kingma and J. L. Ba. ADAM: A Method for Stochastic Optimization. In ICLR, 2015.
A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong,
R. Paulus, and R. Socher.
Ask me anything: Dynamic memory networks for natural language processing.In ICML, 2016.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  C. Li, C. Deng, N. Li, W. Liu, X. Gao, and D. Tao. Self-supervised adversarial hashing networks for cross-modal retrieval. In CVPR, 2018.
-  H. Liu, R. Wang, S. Shan, and X. Chen. Learning multifunctional binary codes for both category and attribute oriented retrieval tasks. In CVPR, 2017.
-  A. Mikulik, M. Perdoch, O. Chum, and J. Matas. Learning vocabularies over a fine quantization. International journal of computer vision, 2013.
-  A. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126, 2016.
-  D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006.
-  H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han. Large-scale image retrieval with attentive deep local features. In ICCV, 2017.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-Shot Learning by Convex Combination of Semantic Embeddings. In ICLR, 2014.
-  H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
-  M. Paulin, M. Douze, Z. Harchaoui, J. Mairal, F. Perronin, and C. Schmid. Local convolutional features with unsupervised training for image retrieval. In ICCV, 2015.
-  F. Perronnin and D. Larlus. Fisher vectors meet neural networks: A hybrid classification architecture. In CVPR, 2015.
-  F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
-  F. Radenović, G. Tolias, and O. Chum. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In ECCV, 2016.
-  V. Ranjan, N. Rasiwasia, and C. V. Jawahar. Multi-label cross-modal retrieval. In ICCV, 2015.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR Workshops, 2014.
-  A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki. Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications, 4(3):251–258, 2016.
-  P. H. Seo, A. Lehrmann, B. Han, and L. Sigal. Visual reference resolution using attention memory for visual dialog. In NIPS, 2017.
-  B. Siddiquie, R. S. Feris, and L. S. Davis. Image ranking and retrieval based on multi-attribute queries. In CVPR, 2011.
-  J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
-  S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In NIPS, 2015.
-  M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
-  G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of cnn activations. In ICLR, 2015.
-  A. Veit, S. J. Belongie, and T. Karaletsos. Conditional similarity networks. In CVPR, 2017.
-  J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In CVPR, 2014.
-  J. Weston, S. Chopra, and A. Bordes. Memory networks. In ICLR, 2015.
-  C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016.
-  A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In Computer Vision and Pattern Recognition (CVPR), 2014.
-  T. Yu, J. Yuan, C. Fang, and H. Jin. Product quantization network for fast image retrieval. In ECCV, 2018.