1 Introduction
Deep learning with endtoend problem formulations has reshaped visual recognition methods over the past few years. The core problems of highlevel vision, e.g. recognition, detection and segmentation, are commonly formulated as classification tasks. Classifiers are applied imagewise for recognition [19], regionwise for detection [30], and pixelwise for segmentation [22]. Classification in deep neural network is usually implemented as multiway parametric softmax and assumes that the categories are fixed between learning and evaluation.
However, such a “closedworld” assumption does not hold for the open world, where new categories could appear, often with very few training examples. For example, for face recognition
[41, 40], new identities should be recognized after just onetime occurrence. Due to the openset nature, one may want to generalize the feature embedding instead of learning another parametric classifier. A common practice for embedding is to simply chop off the softmax classification layer from a pretrained network and take the last layer features. However, such a transfer learning scheme is not optimal because these features only make sense for a linear classification boundary in the training space, most likely not for the new testing space. Instead of learning parametric classifiers, we can learn an embedding to directly optimize a feature representation which preserves distance metrics in a nonparametric fashion. Numerous works have investigated various loss functions (e.g. contrastive loss
[10], triplet loss [14, 26]) and data sampling strategies [47] for improving the embedding performance.Nonparametric embedding approaches have also been applied to computer vision tasks other than face recognition. Exemplarbased models have shown to be effective for learning object classes
[2] and object detection [25]. These nonparametric approaches build associations between data instances [23], and turn out to be useful for metaknowledge transfer [25]which would not be readily possible for parametric models. So far, none of these nonparametric methods have become competitive in the stateoftheart image recognition benchmarks such as ImageNet classification
[31] and MSCOCO object detection [21]. However, we argue that time might be right to revisit nonparametric methods to see if they could provide the generalization capabilities lacking in current approaches.We investigate a neighborhood approach for image classification by learning a feature embedding through deep neural networks. The core of our approach is a metric learning model based on Neighborhood Component Analysis (NCA) [8]. For each training image, NCA computes its distance to all the other images in the embedding space. The distances can then be used to define a classification distribution according to the class labels. Batch training with all the images is computationally expensive, thereby making the original NCA algorithm difficult to scale to large datasets. Inspired by prior works [48, 49]
, we propose to store the embedding of images in the entire dataset in an augmented nonparametric memory. The nonparametric memory is not learned by stochastic gradient descent, but simply updated after each training image is visited. During testing, we build a
nearestneighbor (kNN) classifier based on the learned metrics.
Our work makes three main contributions. 1) We scale up NCA to handle largescale datasets and deep neural networks by using an augmented memory to store nonparametric embeddings. 2) We demonstrate that a nearest neighbor classifier can achieve remarkable performance on the challenging ImageNet classification benchmark, nearly on par with parametric methods. 3) Our learned feature, trained with the same embedding method, delivers improved generalization ability for new categories, which is desirable for subcategory discovery and fewshot recognition.
2 Related Works
Object Recognition. Object recognition is one of the holy grail problems in computer vision. Most prior works cast recognition either as a category naming problem [3, 4] or as a data association problem [23]. Category naming assumes that all instances belonging to the same category are similar and that category membership is binary (either allin, or allout). Most of the research in this area is focused on designing better invariant category representations (e.g. bagofwords [45], pictorial models [5]). On the other hand, data association approaches [2, 50, 23, 24] regard categories as datadriven entities emergent from connections between individual instances. Such nonparametric paradigms are informative and powerful for transferring knowledge which may not be explicitly present in the labels. In the era of deep learning, however, the performance of exemplarbased approaches hardly reaches the stateoftheart for standard benchmarks on classification. Our work revisits the direction of data association models, learning an embedding representation that is tailored for nearest neighbor classifiers.
Learning with Augmented Memory. Since the formulation of LSTM [13], the idea of using memory for neural networks has been widely adopted for various tasks [12]. Recent approaches on augmented memory fall into two camps. One camp incorporates memory into neural networks as an endtoend differentiable module [9, 46], with automatic attention mechanism [33, 43] for reading and writing. These models are usually applied in knowledgebased reasoning [9, 43] and sequential prediction tasks [38]. The other camp treats memory as a nonparametric representation [42, 48, 49], where the memory size grows with the data set size. Matching networks [42] explore fewshot recognition using augmented memory, but their memory only holds the representations in current minibatches of
images. Our memory is also nonparametric, in a similar manner as storing instances for unsupervised learning
[48]. The key distinction is that our approach learns the memory representation with millions of entries for supervised largescale recognition.Metric Learning. There are many metric learning approaches [17, 8]
, some achieving the stateoftheart performance in image retrieval
[47], face recognition [35, 40, 44], and person reidentification [49]. In such problems, since the classes during testing are disjoint from those encountered during training, one can only make inference based on its feature representation, not on the subsequent linear classifier. Metric learning learning encourages the minimization of intraclass variations and the maximization interclass variations, such as contrastive loss [1, 37], triplet loss [14]. Recent works on fewshot learning [42, 36] also show the utility of metric learning, since it is difficult to optimize a parametric classifier with very few examples.NCA. Our work is built upon the original proposal of Neighborhood Component Analysis (NCA) [8] and its nonlinear extension [32]. In the original version [32], the features for the entire dataset needs to be computed at every step of the optimization, making it computationally expensive and not scalable for large datasets. Consequently, it has been mainly applied to small datasets such as MNIST or for dimensionality reduction [32]. Our work is the first to demonstrate that NCA can be applied successfully to largescale datasets.
3 Approach
We adopt a feature embedding framework for image recognition. Given a query image , we embed it into the feature space by . The function here is formulated as a deep neural network parameterized by parameter learned from data . The embedding is then queried against a set of images in the search database , according to a similarity metric. Images with the highest similarity scores are retrieved and information from these retrieved images can be transferred to the image .
Since the classification process does not rely on extra model parameters, the nonparametric framework can naturally extend to images in novel categories without any model finetuning. Consider three settings of .

When , i.e., the search database is the same as the training set, we have closedset recognition such as the ImageNet challenge.

When is annotated with labels different from , we have openset recognition such as subcategory discovery and fewshot recognition.

Even when is completely unannotated, the metric can be useful for general contentbased image retrieval.
The key is how to learn such an embedding function . Our approach builds upon NCA [8] with some of our modifications.
3.1 Neighborhood Component Analysis
3.1.1 Nonparametric formulation of classification.
Suppose we are given a labeled dataset of examples with corresponding labels . Each example
is embedded into a feature vector
. We first define similarity between instances andin the embedded space as cosine similarity. We further assume that the feature
is normalized. Then,(1) 
where is the angle between vector , . Each example selects example
as its neighbor with probability
defined as,(2) 
Note that each example cannot select itself as neighbors, i.e. . The probability thus is called leaveoneout distribution on the training set. Since the range of the cosine similarity is in , we add an extra parameter to control the scale of the neighborhood.
Let denote the indices of training images which share the same label with example . Then the probability of example being correctly classified is,
(3) 
The overall objective is to minimize the expected negative log likelihood over the dataset,
(4) 
Learning proceeds by directly optimizing the embedding without introducing additional model parameters. It turns out that each training example depends on all the other exemplars in the dataset. The gradients of the objective with respect to is,
(5) 
and where is,
(6) 
where
is the normalized distribution within the groundtruth category.
3.1.2 Differences from parametric softmax.
The traditional parametric softmax distribution is formulated as
(7) 
where each category has a parametrized prototype to represent itself. The maximum likelihood learning is to align all examples in the same category with the category prototype. However, in the above NCA formulation, the optimal solution is reached when the probability of negative examples () vanishes. The learning signal does not enforce all the examples in the same category to align with the current training example. The probability of some positive examples () can also vanish so long as some other positives align well enough to th example. In other words, the nonparametric formulation does not assume a single prototype for each category, and such a flexibility allows learning to discover inherent structures when there are significant intraclass variations in the data. Eqn 5 explains how each example contributes to the learning gradients.
3.1.3 Computational challenges for learning.
Learning NCA even for a single objective term would require obtaining the embedding as well as gradients (Eqn 5 and Eqn 6) in the entire dataset. This computational demand quickly becomes impossible to meet for largescale dataset, with a deep neural network learned via stochastic gradient descent. Samplingbased methods such as triplet loss [40] can drastically reduce the computation by selecting a few neighbors. However, hardnegative mining turns out to be crucial and typical batch size with 1800 examples [40] could still be impractical.
We take an alternative approach to reduce the amount of computation. We introduce two crude approximations.

Computing the gradient for still requires the embedding of the entire dataset, which would be prohibitively expensive for each minibatch update. We introduce augmented memory to store the embeddings for approximation. More details follow.
3.2 Learning with Augmented Memory
We store the feature representation of the entire dataset as augmented nonparametric memory. We learn our feature embedding network through stochastic gradient descent. At the beginning of the +1th iteration, suppose the network parameter has the state , and the nonparametric memory is in the form of . Suppose that the memory is roughly uptodate with the parameter at iteration
. This means the nonparametric memory is close to the features extracted from the data using parameter
,(8) 
During the +1th optimization, for training instance , we forward it through the embedding network , and calculate its gradient as in Eqn 5 but using the approximated embedding in the memory as,
(9) 
Then the gradients of the parameter can be backpropagated,
(10) 
Since we have forwarded the to get the feature , we update the memory for the training instance by the empirical weighted average [49],
(11) 
Finally, network parameter
is updated and learned through stochastic gradient descent. If the learning rate is small enough, the memory can always be uptodate with the change of parameters. The nonparametric memory slot for each training image is only updated once per learning epoch. Though the embedding is approximately estimated, we have found it to work well in practice.
3.3 Discussion on Complexity
In our model, the nonparametric memory , similarity metric , and probability density may potentially require a large storage and pose computation bottlenecks. We give an analysis of model complexity below.
Suppose our final embedding is of size , and we train our model on a typical largescale dataset using images with a batch size of . Nonparametric memory requires GB of memory. Similarity metric and probability density each requires GB of memory for storing the value and the gradient. In our current implementation, other intermediate variables used for computing the intraclass distribution require another GB . In total, we would need GB for the NCA module.
In terms of time complexity, the summation in Eqn 2 and Eqn 3 across the whole dataset becomes the bottleneck in NCA. However, in practice with a GPU implementation, the NCA module takes a reasonable amount of extra time with respect to the backbone network. During testing, exhaustive nearest neighbor search with one million entries is also reasonably fast. The time it takes is negligible with respect to the forward passing through the backbone network.
The complexity of our model scales linearly with the training size set. Our current implementation can deal with datasets at the ImageNet scale, but cannot scale up to 10 times more data based on the above calculations. A possible strategy to handle bigger data is to subsample a few neighbors instead of the entire training set. Sampling would help reduce the linear time complexity to a constant. For nearest neighbor search at the run time, computation complexity can be mitigated with proper data structures such as balltrees [7] and quantization methods [16].
4 Experiments
We conduct experiments to investigate whether our nonparametric feature embedding can perform well in the closedworld setting, and more importantly whether it can improve generalization in the openworld setting.
First, we evaluate the learned metric on the largescale ImageNet ILSVRC challenge [31]. Our embedding achieves competitive recognition accuracy with knearest neighbor classifiers using the same ResNet architecture. Secondly, we study an important property of our representation for subcategory discovery, when the model trained with only coarse annotations is transferred for finegrained label prediction. Lastly, we study how our learned metric can be transferred and applied to unseen object categories for fewshot recognition.
4.1 Image Classification
We study the effectiveness of our nonparametric representation for visual recognition on ImageNet ILSVRC dataset. We use the parametric softmax classification networks as our baselines.
Network Configuration. We use the ConvNet architecture ResNet[11] as the backbone for the feature embedding network. We remove the last linear classification layer of the original ResNet and append another linear layer which projects the feature to a low dimensional 128 space. The 128 feature vector is then normalized and fed to NCA learning. Our approach does not induce extra parameters for the embedding network.
Learning Details. During training, we use an initial learning rate of 0.1 and drops 10 times smaller every 40 epochs for a total of 130 epochs. Our network converges a bit slower than the baseline network, in part due to the approximated updates for the nonparametric memory. We set the momentum for updating the memory with at the start of learning, and gradually increase to at the end of learning. We use a temperature parameter
in the main results. All the other optimization details and hyperparameters remain the same with the baseline approach. We refer the reader to the PyTorch implementation
[28] of ResNet for details. During testing, we use a weighted k nearest neighbor classifier for classification. Our results are insensitive to parameter ; generally any in the range of gives very similar results. We report the accuracy with and using single center crops.


ResNet18  


Feature  =1  =30  
Baseline  512  62.91  68.41 
+PCA  128  60.43  66.26 


Ours  128  67.39  70.58 



ResNet34  


Feature  =1  =30  
Baseline  512  67.73  72.32 
+PCA  128  65.58  70.67 


Ours  128  71.81  74.43 



ResNet50  


Feature  d  =1  =30 
Baseline  2048  71.35  75.09 
+PCA  128  69.72  73.69 


Ours  128  74.34  76.67 



Feature  baseline  ours  
top1  top5  top1  top5  
ResNet18  69.64  88.98  70.58  89.38 
ResNet34  73.27  91.43  74.43  91.35 
ResNet50  76.01  92.93  76.67  92.84 



=1  =30  
256  67.54  70.71 
128  67.39  70.59 
64  65.32  69.54 
32  64.83  68.01 



=1  =30  
0.1  63.87  67.93 
0.05  67.39  70.59 
0.03  66.98  70.33 
0.02  N/A  N/A 

Main Results. Table 1 and Table 3 summarize our results in comparison with the features learned by parametric softmax. For baseline networks, we extract the last layer feature and evaluate it with the same k nearest neighbor classifiers. The similarity between features is measured by cosine similarity. Classification evaluated with nearest neighbors leads to a decrease of accuracy with , and accuracy with . We also project the baseline feature to dimension with PCA for evaluation. This reduction leads to a further decrease in performance, suggesting that the features learned by parametric classifiers do not work equally well with nearest neighbor classifiers. With our model, we achieve a improvement over the baseline using . At , we have even slightly better results than the parametric classifier: Ours are higher on ResNet34, and higher on ResNet50. We also find that predictions from our model disagree with the baseline on of the validation set, indicating a significantly different representation has been learned.
Figure 2 shows nearest neighbor retrieval comparisons. The upper four examples are our successful retrievals and the lower four are failure retrievals. For the failure cases, our model has trouble either when there are multiple objects in the same scene, or when the task becomes too difficult with finegrained categorization. For the four failure cases, our model predictions are “paddle boat”, “tennis ball”, “angora rabbit”, “appenzeller” respectively.
Ablation study on model parameters. We investigate the effect of the feature size and the temperature parameter in Table 3. For the feature size, 128 features and 256 features produce very similar results. We start to see performance degradation as the size is dropped lower than 64. For the temperature parameter, a lower temperature which induces smaller neighborhoods generally produces better results. However, the network does not converge if the temperature is too low, e.g., .


CIFAR  


Task  20 classes  100 classes 
Baseline  81.53  54.17 


Ours  81.42  62.32 



ImageNet  


Task  127 classes  1000 classes 
Baseline  81.48  48.07 
Ours  81.62  52.75 

4.2 Discovering SubCategories
Our nonparametric formulation of classification does not assume a single prototype for each category. Each training image only has to look for a few supporting neighbors [34] to embed the features. We refer nearest neighbors whose probability density sum over a given threshold as a support set for . In Figure 3, we plot the histograms over the size of the support set for support density thresholds , and . We can see most of the images only depend on around neighbors, which are a lot less than 1,000 images per category in ImageNet. These statistics suggest that our learned representation allows subcategories to develop automatically.
The ability to discover subcategories is of great importance for feature learning, as there are always intraclass variations no matter how we define categories. For example, even for the finest level of object species, we can further define object pose as subcategories.
To quantitatively measure the performance of subcategory discovery, we consider the experiment of learning the feature embedding using coarsegrained object labels, and evaluating the embedding using finegrained object labels. We can then measure how well feature learning discovers variations within categories. We refer this classification performance as induction accuracy as in [15]. We train the network with the baseline parametric softmax and with our nonparametric NCA using the same network architecture. To be fair with the baseline, we evaluate the feature from the penultimate layer from both networks. We conduct the experiments on CIFAR and ImageNet, and their results are summarized in Table 4.
CIFAR Results. CIFAR100 [18] images have both finegrained annotations in 100 categories and coarsegrained annotations in 20 categories. It is a proper testing scenario for evaluating subcategory discovery. We study subcategory discovery by transferring representations learned from 20 categories to 100 categories. The two approaches exhibit similar classification performances on the 20 category setting. However, when transferred to CIFAR100 using k nearest neighbors, baseline features suffer a big loss, with top1 accuracy on 100 classes. Fitting a linear classifier for the baseline features gives an improved top1 accuracy. Using k nearest neighbor classifiers, our features are better than the baselines, achieving a recognition accuracy.
ImageNet Results. As in [15], we use 127 coarse categories by clustering the 1000 categories in a topdown fashion by fixing the distances of the nodes from the root node in the WordNet tree. There are 65 of the 127 classes present in the original 1000 classes. The other 62 classes are parental nodes in the ImageNet hierarchical word tree. The two models achieve similar classification performance () on the original 127 categories. When evaluated with 1000 class annotations, our representation is about better than the baseline features. The baseline performance can be improved to by fitting another linear classifier on the 1000 classes.
Discussions. Our approach is able to preserve visual structures which are not explicitly presented in the supervisory signal. In Figure 4, we show nearest neighbor examples compared with the baseline features. For all the examples shown here, the groundtruth finegrained category does not exist in the training categories. Thus the model has to discover subcategories in order to recognize the objects. We can see our representation preserves apparent visual similarity (such as color and pose information) better, and is able to associate the query with correct exemplars for accurate recognition. For example, our model finds similar birds hovering above water in the third row, and finds butterflies of the same color in the last row. In Figure 5 we further show the prediction gains for each class. Our model is particularly stronger for main subcategories with rich intraclass variations.


Method  Network  FineTune  5way Setting  20way Setting  
1shot  5shot  1shot  5shot  
NN Baseline [42]  Small  No  41.10.7  51.00.7     
MetaLSTM [29]  Small  No  43.40.8  60.10.7  16.70.2  26.10.3 
MAML [6]  Small  Yes  48.70.7  63.20.9  16.50.6  19.30.3 
MetaSGD [20]  Small  No  50.51.9  64.00.9  17.60.6  28.90.4 
Matching Net [42]  Small  Yes  46.60.8  60.00.7     
Prototypical [36]  Small  No  49.40.8  68.20.7     
RelationNet [39]  Small  No  51.40.8  61.10.7     
Ours  Small  No  50.30.7  64.10.8  23.70.4  36.00.5 


SNAIL [27]  Large  No  55.71.0  68.90.9     
RelationNet [39]  Large  No  57.00.9  71.10.7     
Ours  Large  No  57.80.8  72.80.7  30.50.5  44.80.5 

4.3 Fewshot Recognition
Our feature embedding method learns a meaningful metric among images. Such a metric can be directly applied to new image categories which have not been seen during training. We study the generalization ability of our method for fewshot object recognition.
Evaluation Protocol. We use the miniImagenet dataset [42], which consists of 60,000 colour images and 100 classes (600 examples per class). We follow the split introduced previously [29], with 64, 16, and 20 classes for training, validation and testing. We only use the validation set for tuning model parameters. During testing, we create the testing episodes by randomly sampling a set of observation and query pairs. The observation consists of classes (way) and images (shot) per class. The query is an image from one of the classes. Each testing episode provides the task to predict the class of query image given few shot observations. We create episodes for testing and report the average results.
Network Architecture. We conduct experiments on two network architectures. One is a shallow network which receives small input images. It has 4 convolutional blocks, each with a
convolutional layer, a batch normalization layer, a ReLU layer, and a max pooling layer. A final fully connected layer maps the feature for classification. This architecture is widely used in previous works
[6, 42] for evaluating fewshot recognition. The other is a deeper version with ResNet18 and larger image inputs. Two previous works [27, 39] have reported their performance with similar ResNet18 architectures.Results. We summarize our results in Table 5. We train our embedding on the training set, and apply the representation from the penultimate layer for evaluation. Our current experiment does not finetune a local metric per episode, though such adaptation would potentially bring additional improvement. As with the previous experiments, we use k nearest neighbors for classification. We use neighbor for the 1shot scenario, and for the 5shot scenario.
For the shallow network setting, while our model is on par with the prototypical network [36], and RelationNet [39], our method is far more generic.
For the deeper network setting, we achieve the stateoftheart results for this task. MAML [6] suggests going deeper does not necessarily bring better results for meta learning. Our approach provides a counterexample: Deeper network architectures can in fact bring significant gains with proper metric learning.
Figure 6 shows visual examples of our predictions compared with the baseline trained with softmax classifiers.
5 Summary
We present a nonparametric neighborhood approach for visual recognition. We learn a CNN to embed images into a lowdimensional feature space, where the distance metric between images preserves the semantic structure of categorical labels according to the NCA criterion. We address NCA’s computation demand by learning with an external augmented memory, thereby making NCA scalable for large datasets and deep neural networks. Our experiments deliver not only remarkable performance on ImageNet classification for such a simple nonparametric method, but most importantly a more generalizable feature representation for subcategory discovery and fewshot recognition. In the future, it’s worthwhile to reinvestigate nonparametric methods for other visual recognition problems such as detection and segmentation.
Acknowledgements
This work was supported in part by Berkeley DeepDrive. ZW would like to thank Yuanjun Xiong for helpful discussions.
References
 [1] Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a” siamese” time delay neural network. In: NIPS (1994)
 [2] Chum, O., Zisserman, A.: An exemplar model for learning object classes. In: CVPR. IEEE (2007)
 [3] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: Imagenet: A largescale hierarchical image database. In: CVPR. IEEE (2009)
 [4] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV (2010)
 [5] Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. IJCV (2005)
 [6] Finn, C., Abbeel, P., Levine, S.: Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017)
 [7] Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) (1977)
 [8] Goldberger, J., Hinton, G.E., Roweis, S.T., Salakhutdinov, R.R.: Neighbourhood components analysis. In: NIPS (2005)
 [9] Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)
 [10] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR. IEEE (2006)
 [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
 [12] Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine (2012)

[13]
Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation (1997)

[14]
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: International Workshop on SimilarityBased Pattern Recognition. Springer (2015)
 [15] Huh, M., Agrawal, P., Efros, A.A.: What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614 (2016)
 [16] Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI (2011)
 [17] Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: CVPR. IEEE (2012)
 [18] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)

[19]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
 [20] Li, Z., Zhou, F., Chen, F., Li, H.: Metasgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017)
 [21] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. Springer (2014)
 [22] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
 [23] Malisiewicz, T., Efros, A.A.: Recognition by association via learning perexemplar distances. In: CVPR. IEEE (2008)
 [24] Malisiewicz, T., Efros, A.: Beyond categories: The visual memex model for reasoning about object relationships. In: NIPS (2009)
 [25] Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplarsvms for object detection and beyond. In: ICCV. IEEE (2011)
 [26] Mensink, T., Verbeek, J., Perronnin, F., Csurka, G.: Distancebased image classification: Generalizing to new classes at nearzero cost. PAMI (2013)
 [27] Mishra, N., Rohaninejad, M., Chen, X., Abbeel, P.: Metalearning with temporal convolutions. arXiv preprint arXiv:1707.03141 (2017)

[28]
Paszke, A., Chintala, S., Collobert, R., Kavukcuoglu, K., Farabet, C., Bengio, S., Melvin, I., Weston, J., Mariethoz, J.: Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, may 2017
 [29] Ravi, S., Larochelle, H.: Optimization as a model for fewshot learning (2016)
 [30] Ren, S., He, K., Girshick, R., Sun, J.: Faster rcnn: Towards realtime object detection with region proposal networks. In: NIPS (2015)
 [31] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. IJCV (2015)

[32]
Salakhutdinov, R., Hinton, G.: Learning a nonlinear embedding by preserving class neighbourhood structure. In: Artificial Intelligence and Statistics (2007)

[33]
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Metalearning with memoryaugmented neural networks. In: International conference on machine learning (2016)
 [34] Schölkopf, B., Platt, J.C., ShaweTaylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a highdimensional distribution. Neural computation (2001)
 [35] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR (2015)
 [36] Snell, J., Swersky, K., Zemel, R.: Prototypical networks for fewshot learning. In: NIPS (2017)
 [37] Sohn, K.: Improved deep metric learning with multiclass npair loss objective. In: NIPS (2016)
 [38] Sukhbaatar, S., Weston, J., Fergus, R., et al.: Endtoend memory networks. In: NIPS (2015)
 [39] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for fewshot learning. arXiv preprint arXiv:1711.06025 (2017)
 [40] Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to humanlevel performance in face verification. In: CVPR (2014)
 [41] Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: CVPR. IEEE (1991)
 [42] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NIPS (2016)
 [43] Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: NIPS (2015)
 [44] Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414 (2018)
 [45] Weber, M., Welling, M., Perona, P.: Unsupervised learning of models for recognition. In: ECCV. Springer (2000)
 [46] Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprint arXiv:1410.3916 (2014)
 [47] Wu, C.Y., Manmatha, R., Smola, A.J., Krähenbühl, P.: Sampling matters in deep embedding learning. arXiv preprint arXiv:1706.07567 (2017)
 [48] Wu, Z., Xiong, Y., Stella, X.Y., Lin, D.: Unsupervised feature learning via nonparametric instance discrimination. In: CVPR (2018)
 [49] Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: Joint detection and identification feature learning for person search. In: CVPR (2017)
 [50] Zhang, H., Berg, A.C., Maire, M., Malik, J.: Svmknn: Discriminative nearest neighbor classification for visual category recognition. In: CVPR. IEEE (2006)
Comments
There are no comments yet.