The recent progress of machine learning, especially the emergence of deep learning, has advanced the image classification performance into an unprecedented level. The error rates on large-scale benchmark datasets has been halved and halved again, even approaching human-level performance on some object recognition benchmarks. Despite the success, the state-of-the-art models are notoriously data hungry, requiring tons of samples for parameter learning. In real cases, however, the visual phenomena follows a long-tail distribution(Zhu et al., 2014) where only a few sub-categories are data-rich and the rest are with limited training samples. How to learn a classifier from as fewer samples as possible is critical for real applications and fundamental for exploring new learning mechanisms.
Compared with machines, people are far better learners as they are capable of learning models from very limited samples of a new category and make accurate predictions and judgements accordingly. An intuitive example is that a baby learner can learn to recognize a wolf with only a few sample images provided that he/she has been able to successfully recognize a dog. The key mystery making the difference is that people have strong prior knowledge to generalize across different categories (Lake et al., 2016). It means that people do not need to learn a new classifier (e.g. wolf) from scratch as most machine learning methods, but generalize and adapt the previously learned classifiers (e.g. dog) towards the new category. A major way to acquire these prior knowledge is through learning to learn from previous experience. In the image classification scenario, learning to learn refers to the mechanisms that learning to recognize a new concept can be accelerated by previous learning of other related concepts.
A typical image classifier is constituted by representation and classification steps, leading to two fundamental problems in learning to learn image classifiers: (1) how to generalize the representations from previous concepts to a new concept, and (2) how to generalize the classification parameters of previous concepts to a new concept. In literature, transfer learning and domain adaptation methods(Patricia and Caputo, 2014) are proposed with a similar notion, mainly focusing on the problem of representation generalization across different domains and tasks. With the development of CNN-based image classification models, the high-level representations learned from very large scale labeled dataset, e.g. the fc7 layer in AlexNet, are demonstrated to have good transfer ability across different concepts or even different datasets (Tzeng et al., 2015), which significantly alleviate the representation generalization problem. However, how to generalize the classification parameters in deep models (e.g. the fc7 layer in AlexNet) from well-trained concepts to a new concept (with only a few samples) is largely ignored by previous studies.
In this paper, we target the following problem. Given a well-trained N-class CNN model for N base classes, how to learn a binary classifier for the class with only a few samples? More specifically, we constrain the setting to let class share the same representation space as N base classes, i.e. we directly copy the representation layers of the N-class CNN model to the class, which is a common way in deep representation transfer (Vinyals et al., 2016; Kwitt et al., 2016; Wang and Hebert, 2016). Such a setting provides a reasonable and fair foundation for investigating how to optimally generalize classification parameters. Given a new class, the key problem is to identify which base classes’ classification parameters should be transferred.
Learning by analogy has been proved to be a fundamental building block in human learning process (Gentner and Holyoak, 1997), and share similar context with our problem. When we face a new situation, we recall a similar situation by matching them up, and then we learn from it. Similarly, in the previous example of dog and wolf, we have a plausible explanation on the fast learning of wolf that a human learner selects dog from the base classes by visual analogy and transfers its classification parameters for wolf classification. In this sense, visual analogy provides effective and informative clue for generalizing image classifiers in a way of human-like learning. But the limited number of samples in the new class would cause inaccurate and unstable measurements on visual analogy in high-dimensional representation space, and how to transfer the classification parameters from selected base classes to a new class is also highly non-trivial for the generalization efficacy.
To address the above problems, we first propose a novel Visual Analogy Network Embedded Regression (VANER) model to jointly learn a low-dimensional embedding space and a linear mapping function from the embedding space to classification parameters for base classes. In particular, we learn a low dimensional embedding for each base class with the constraint of embedding similarity between two base classes being able to reflect their visual analogy in the original representation space. Meanwhile, we learn a linear mapping function from the embedding of a base class to its previously learned classification parameters (i.e.
the logistic regression parameters). The VANER model enables the transformation from original representation space to embedded space and further into classification parameters. We then propose an out-of-sample embedding method to learn the embedding of a new class represented by a few samples through its visual analogy with base classes. By inputting the learned embedding into VANER, we can derive the classification parameters for the new class. Note that these classification parameters are purely generalized from base classes (i.e. transferred classification parameters), while the samples in the new class, although only a few, can also be exploited to generate a set of classification parameters (i.e. model classification parameters). Therefore, we further investigate the fusion strategy of the two kinds of parameters so that the prior knowledge and data knowledge can be fully leveraged. The framework of the proposed method is illustrated in Figure 1.
We intensively evaluate the proposed method, and the results show that our method can reach 0.87 AUC in average in 200 new classes from ImageNet in one-shot setting (i.e. each new class only consists of 1 image sample). In contrast, the AUC of logistic regression with only endogenous parameters is 0.77. We also find that improvement margins (between our method and baselines) in different new classes have significant positive correlation with the relative similarity ratio between a new class and base classes, indicating that our method is consistent with human-like learning.
The technical contributions of this paper are three folds.
We study the problem of learning to learn from a new angle: given fixed representation space, how to generalize the classification parameters of base classes to a new class? This problem setting can promote new research attempts towards human-like learning mechanism.
We propose to use visual analogy as the bridge for classification parameter generalization across different classes, and propose a novel VANER method to achieve the transformation from original representation to classification parameters for any new class.
We intensively evaluate the proposed method and the results show that our method consistently and significantly outperform other baselines, and, more importantly, our method is more consistent with human-like learning.
The rest of the paper are organized as follows. In Section 2, we briefly review the related work of the image classification problem especially on low-shot problem and the network embeddings. In Section 3, we present the framework of our VANER model. In Section 4, we discuss about the experimental results. Finally, we conclude the paper with a discussion of our findings and open issues in section 5.
2. Related Work
The related works can be categorized in three lines, including image classification with deep learning, one/low-shot image classification and network embedding, which we briefly review and discuss as follow.
Image Classification with Deep Learning. The first paper concentrating on the task of image classification using deep convolutional neural network on large-scaled image dataset dates back to the AlexNet (Krizhevsky et al., 2012) in 2012, which reaches an error rate of 17.0% of top-5 prediction on the ILSVRC2010 dataset. After the huge success of the deep neural network on image classification, more and more complex network structures are constantly put forward. Among them, VGGNet(Simonyan and Zisserman, 2014), GoogLeNet(Szegedy et al., 2015) and the ResNet(He et al., 2016) are well known and they reached an error rate of 6.8%, 6.67%, 3.57% on top-5 prediction on the same dataset respectively. All of them are end-to-end models with tons of parameters, leading to its disadvantage of data-hungry.
One/Low-shot Image Classification. One/Low-shot image classification problem mainly focuses on how to learn much information about a category from just one, or a handful of images instead of the large-scaled training dataset. Most of one/low-shot image classification algorithms take advantage of transfer learning. In the early work, (Fei-Fei et al., 2006) proposed a transfer method via a Bayesian approach on the low-level feature of the images. Due to the effectiveness SVM in image classification, many methods are proposed to combine the SVM parameters of the base classes to learn for the transfer parameter of one-shot classes. (Yao and Doretto, 2010; Qi et al., 2011) propose a transfer mechanism using Adaboost method. They both construct a set of weak classifiers through the data from the base classes, and learn a new classifier by linearly combining the weak classifiers. (Tommasi et al., 2014) proposes an adaptive Least-Square SVM method to directly combine the base classes SVM model and learn the weights automatically. These methods cannot work well on one-shot problem, as they require sufficient supervised information to learn the weight of the combined model. Also these methods are based on hand-crafted features, which seriously limits their performance.
After deep learning is introduced into the large-scale image classification, researchers turn to investigate the one-shot problem with deep learning. Some methods are proposed to learn a better image representation to adapt to one-shot image classification problem. (Koch, 2015) introduces a two-way Siamese Neural Network to learn the similarity of two input images using base classes and predict the most possible one-shot class for test images. (Hariharan and Girshick, Hariharan and Girshick) proposes a Squared Gradient Magnitude Loss considering both the multi-class logistic loss and small dataset training loss. Some other methods combine traditional deep neural network structures with new transfer learning algorithms. (Santoro et al., 2016) uses Memory-Augmented Neural Networks with a Least Recently Used Access module which can be seen as an external memory storing previously learned information, and later (Vinyals et al., 2016) proposes an improved method called Matching Network. They both capture the similarity of the novel classes with base classes and utilize the information to do an cross-class transfer, but they optimize the transfer process in representation learning step, rather than classification step. (Wang and Hebert, 2016) proposes a Model Regression Network for intra-class transfer which learns a nonlinear mapping from the model parameter trained by small-samples to the model parameter trained by large-samples. This mapping can be used to infer the classification parameters via only low-shots (i.e. small-samples) in the new classes. But the correlation patterns between small-sample and large-sample parameters are not always notable, which is demonstrated in our experiments. More recently, a few works exploit generative models to create more data for training. (Rezende et al., 2016) takes advantage of the deep generative models to give a method to produce similar images as a given image. (Hariharan and Girshick, Hariharan and Girshick) then proposes another algorithm to complete the transformation analogy in high-level image features and use this mechanism to expand the images in low-shot classes. Data generation is a feasible way to address the problem of sparse training samples. Differently, our paper attempts to address the problem from the angle of new learning mechanisms.
Network Embedding. In this paper, we exploit network embedding to model visual analogy among different classes, so here we briefly review the recent advances in network embedding. Network Embedding is used to extract the formalized representation of each node in a large-scaled graph. The low-dimension hidden embeddings could capture not only the characteristics of the whole network (e.g. the relationship between two nodes) but also the features of the each node itself. Now the network embedding method is widely used in social network area to solve the node clustering or link prediction problems etc. There are many algorithms issued to learn the embeddings much better and much faster. (Ahmed et al., 2013) uses a matrix factorization technique which is optimized by SGD. (Tang et al., 2015) proposes LINE method which preserves both the first-order and second-order proximities of each node and improves the quality of the embeddings. Network embedding is proved to be a effective method while dealing with graph analysis.
3. The Method
3.1. Notations and Problem Formulation
Suppose that we have an image set , and the set is divided into base-class set which have sufficient training samples, and novel-class set which have only a few training samples in each class. We train an AlexNet (Krizhevsky et al., 2012) on as our base CNN model and extract its fc7 layer as the high-level features of images. The feature space is denoted as . For each image in , we obtain its fc7 layer feature where represents its class and represents its index in class . We use the same CNN model to derive high-level representations for images in novel classes, denoted by .
A typical binary classifier can be represented as which is a mapping function parametrized by . The input is a
-dimensional image feature vector and the output is the possibility that the image belongs to the class. We useto denote the parameters for base class and for novel class . Based on the above notations, Our problem is defined as follows.
Problem 1 (Learning to learn image classifiers).
Given the image features of base classes , the well-trained base classifier parameters , and the image features of a novel class with only a few positive samples, learn the classification parameters for the novel class, so that the learned classifier can precisely predict labels for the novel class.
Note that the problem of learning to learn image classifiers differs from traditional image classification problems in that the learning of a classifier for a novel class depend on the previously learned base-class classifiers and the image representations in base classes besides the image samples in the novel class.
3.2. Framework of Learning to Learn Image Classifiers
The main idea of our method is to generalize the classification parameters of well-trained base classes to a novel class with only a few training samples. In order to realize this, we propose a framework for learning to learn image classifiers (as shown in Figure 1), which consists of two major steps including (1) learning the mapping function from representation space to classification parameters in base classes and (2) generalizing the base classification parameters to a novel class.
For the first step, we propose a novel VANER model to learn the mapping function. After acquiring the high-level representations from fc7 layer in AlexNet for all images in base classes, we calculate the mean feature vector for each class, and generate a visual analogy network for base classes by measuring their pair-wise class similarity. From the visual analogy network, we learn a low-dimensional embedding for each base class with the constraints that the embeddings of classes should preserve the visual analogy network structures, and, at the same time, the embedding of a base class can be transformed into the classification parameters of the base class through a linear mapping function. By training in base classes, we can derive the embeddings of base classes, and a mapping function from embeddings to classification parameters.
Given a novel class with only one or a few samples, we get its high-level representations through the same AlexNet trained in base classes. By comparing its feature vector with those of base classes, we construct a visual analogy network incorporating the novel classes and base classes, from which we can infer the embedding for the novel class through an out-of-sample embedding method. With the inferred embedding of the novel class and the mapping function learned in VANER, we obtain the classification parameters generalized from base classes. Meanwhile, we also learn the classification parameters for the new class from its samples (although only a few). After that, we conduct late fusions on these two kinds of parameters so that the knowledge from prior knowledge and data are fully leveraged. Finally we use the fused classification parameters to classify the novel class.
The notion of this framework is that the classifier for a novel class should be similar as that for a base class if and only if the novel class is visually analogous with the base class. In the example of Figure 1, the novel class wolf is similar as the base class dog in high-level representations, so the link between them will have a high weight in the visual analogy network. This high-weight link will enforce the embedding of wolf class to be similar with that of dog class, and the similar embeddings will result in similar classification parameters as they share the same mapping function. In this way, the classification parameters of the dog class which is well trained with sufficient training data can be successfully transferred to the new wolf class.
3.3. The VANER Model
We define a network where is the vertex set of the graph, with each vertex representing a base class and . is the edge set of the graph, each edge represents visual analogy relationship between two classes with the edge weight depicting the similarity degree. We use to represent the adjacency matrix of the network, and is the edge weight from vertex to vertex . and stands for the i-th row and the j-th column of respectively. In our classification problem, we construct the visual analogy network as a undirected full-connected graph, and edge weight (i.e. degree of visual analogy) between two classes is calculated by:
Here means the average feature vector for class and this equation is the cosine distance between two base classes. Note that our graph is an undirected graph, and the adjacency matrix is symmetric.
In order to make the visual analogy measurement robust in sparse scenarios, we need to reduce the representation space dimensions. Our basic hypothesis in generalizing classification parameters is that if two class are visual similar, they should share similar classification parameters. We realize this by imposing a linear mapping function from the embedding space to classification parameter space, so that similar embeddings will result in similar classification parameters. Motivated by this, we propose a Visual Analogy Network Embedded Regression model.
Let be the embeddings for all nodes in the network, and each row of with dimension is the embedding for each vertex. Let
represent all parameters of the base classifiers. There is also a common linear transformation matrix for all base classes
to convert the embedding space to the classification parameter space for all base classifiers. Then the loss function is defined as:
where is the Frobenius Norm of the matrix.
The first term enforces the embeddings to be able to converted into the classification parameter through a linear transformation. The second term constrain the embeddings to preserve the structure of the visual analogy network. Our goal is to find the matrix and to minimize this loss function.
This is a common unconstrained two variables optimization problem and we use the alternative coordinate descent method to find the best solution for and , where the gradients are calculated by:
3.4. Embedding Inference for Novel Classes
By training VANER model in base classes, we can get the embeddings for each base class and the mapping function from embeddings to classification parameters. Given a new class with only a few samples, we need to infer its embedding.Suppose the embedding for the novel class is . We calculate the similarity of a novel class with all base classes by Equation 1, and we denote this similarity vector by .
Then we define the objective function for the novel class embedding inference and our goal is to minimize the following function:
After we delete the independence term of , the final minimization problem for us to solve is:
In fact, the second term of Equation 5
is a regular term. We omit the second term and thus the first term is in the form of a linear regression loss. Then we can get the explicit solution forwithout using gradient descent. The solution is represented as:
where is the Moore-Penrose pseudo-inverse of matrix defined by . Note that we could speed up the algorithm by pre-computing the pseudo-inverse of .
After deriving the embedding for the new class, we can easily obtain its transferred classification parameters by multiplying transformation matrix :
3.5. Late Fusion
As mentioned above, we can also learn the classification parameters of a new class from its samples (although only a few), and we call them model classification parameters. Then we need to fuse the transferred classification parameters and model classification parameters into the final classifier. Here we present three strategies for late fusion: Initializing, Tuning, and Voting.
Let be the binary classifier for a new class. is the mixture set of positive and negative samples, and is the label with indicating positive sample and indicating negative sample.
Initializing We use the transferred classification parameters as an initialization and then re-learn the parameters of new classifier by the new class samples. The training loss function is defined as the common loss function for classification. That is:
where is the prediction error and we use cross-entropy loss in our experiment. is a regularization term and we use L2-norm in our experiment. For learning
, we use the batched Stochastic Gradient Descent (SGD) and theis initialized with the transferred classification parameters .
Tuning We train the model classification parameters with new class samples, while adding a loss term to constrain the similarity of the transferred classification parameters and the final parameter:
Here, is the transferred parameter we obtain from the previous steps (i.e. in Equation 7). We still use the batched SGD method with a random initialization to solve for .
Voting This method is a weighted average for the transferred classification parameters and the learned model classification parameters. First, we learn a using the Equation 8 with random initialization. Then we get the final parameter by:
The hyper-parameter serves as a voting weight.
3.6. Complexity Analysis
During the training process of our VANER model, the main cost is to calculate the gradient of the loss function . For calculating the first derivative of with respect to , the complexity per iteration is . As to the first derivative of with respect to , the complexity per iteration is . While predicting the novel class, if we use Equation 6 for accelerating, we are able to pre-compute the for and for each novel class, the complexity of the predicting process is .
4.1. Data and Experimental Settings
In our experiments, we mainly use the ILSVRC2015 dataset (Russakovsky et al., 2015), whose training set contains over 1.2 million images in 1,000 categories. We randomly divide the ILSVRC training dataset into 800 base classes and 200 novel classes. We retrain the AlexNet on the 800 base classes as our base CNN model. Before training, each image is cropped into size and all of the training setting is the same as (Krizhevsky et al., 2012) except that we do not use the data augmentation method. After training, we use the fc7 layer of AlexNet as the high-level representations for images.
Our goal is to learn the classifier for a new class given the base classifiers. So we set our problem to be a binary classification problem, where the new classifier is learned to classify the novel class (as positive samples) and all the base classes (as negative samples). In training phase, we randomly select images as the training set for each novel class to simulate -shot learning scenario. In testing phase, given a novel class, we randomly select 500 images (with no overlap with the training set) from it as the positive examples and randomly select 5 images from each base class of the ILSVRC2015 validation set as negative samples. To eliminate randomness, for any -shot setting, we run 10 times and report the average result in the following experiments.
The evaluating metric in our experiment is the Area Under Curve (AUC) of the Receiver Operating Characteristic (ROC) and the F1-score, which are widely used in binary classification.
We compare our method with the baselines below. We divide these algorithms into three categories: The first algorithm is the traditional method used for image classification; the next two algorithms are the methods mainly used in one-shot image classification, in accordance with our algorithm’s setting, we choose those algorithms which learn a new classifier while keeping the features of the image unchanged and among them MRN(Wang and Hebert, 2016) is state-of-the-art; the last three algorithms are within our framework but certain parts of the whole algorithm are excluded for comparison. In order to demonstrate the characteristics and advantages of our method, we also implement these variational versions of our method.
Logistic Regression (LR) We directly use the the novel class images as positive training samples and the randomly selected base class images as negative samples to train a logistic regression classifier. It is regarded as the null model without any generalization from the base classifiers.
Weighted Logistic Regression (Weighted-LR) Here we use the weighted average of the base classifiers’ parameters as the classification parameters for the new class. The weights are calculated by a L2
-normalization of cosine similarities between the feature vector of the novel class and those of all base classes. This method share a similar notion to transfer base classifiers to novel classes, but the transferring process is heuristic.
Model Regression Networks (MRN) (Wang and Hebert, 2016) This method suppose that there is a mapping function from the classification parameters trained with small samples to those trained with large samples within the same class, and this mapping function can be learned from base classes. Then, given a new classifier trained with small samples, the learned MRN is reused to predict the classification parameters trained with large samples.
VANER We only use the classification parameters transferred from base classes to classify the new classes, and do not consider the parameters generate by new class samples. This method is designed to demonstrate the importance of late fusion.
VANER(Mapping) We directly learn the embedding by Equation 2 without the first regression term. Then we use the above weighted-LR method in the embedding space instead of the original feature space. This method is used to evaluate the effectiveness of the mapping function.
VANER(Embedding) We directly train a regression model from the original feature space to the classification parameter space without the network embedding. This method is used to demonstrate the effectiveness of class node embedding on the visual analogy network.
4.3.1. Classification Performance on Novel Classes
In this section, we evaluate how well the classifiers learned by our method and other baselines can perform in new classes. The results are shown in Table 1. We can see that in all the low-shot settings, our method consistently performs the best in both and metrics. In contrast, performs the worse in 1-shot setting, which demonstrate the importance of generalization from base classes when the new class has very few samples. does not work well in most settings, demonstrating that its basic hypothesis that the classification parameters trained by large samples and small samples respectively are correlated do not necessarily hold in real data. By comparing with the other three variational versions of our method, we can safely draw the conclusion that the major ingredients in our method, including network embedding for low dimensional representations, mapping function for transforming embedding space to classification parameter space, as well as the late fusion strategy are necessary and effective. We also compare different late fusion strategies. From the results shown in Table 2 we find that the Voting strategy is more fit for our scenario.
Furthermore, we compare the performances of these methods in different low-shot settings, and the results are shown in Figure 2. We can see that our method consistently performs the best in all settings, and the advantage of our method is more obvious when the new classes have less training samples. In particular, by comparing our method and , we can see that need about 20 shots to reach 0.9, while we only need 2 shots, indicating that we can save training data. An interesting phenomenon is that the performance of do not change with the shot number increasing. The main reason is that the heuristic rule is not flexible enough to incorporate new information. This demonstrate the importance of learning to learn, rather than rule-based learning.
4.3.2. Insightful Analysis
Although our method performs the best in various settings, the failure cases are easy to find. We are interested in the following questions: (1) what are the typical failure cases? (2) what is the driving factor that controls the success of generalization? and (3) is the generalization process explainable?
In order to answer the above questions, we further conduct insightful analysis. Firstly, we randomly select 10 novel classes, and list the performance of our method and in one-shot setting on these classes, as shown in Table 3. We can see that the effect of generalization is very obvious in 9 classes, while in the bubble class, the generalization plays a negative role.
To discover the driving factor controlling success and failure, we define and calculate the similarity ratio (SR) of a novel class with the base classes by:
Here the similarity of two classes is calculated by Equation 1. Intuitively, if a new class is very similar with the top- base classes, while dissimilar with the remained base classes, its Similarity Ratio will be high.
In this experiment, we do a linear regression of the relative improvement in of our method over the non-transfer method in 1-shot setting on the Similarity Ratio for each novel class. The dependent variable indicates the success degree of generalization. And we use as our experiment setting. We plot the similarity ratio and relative improvement of all new classes in Figure 3. We can see the relative improvement in a new class is positively correlated with the similarity ratio of the new class, with confidence interval for the correlation coefficient range between and .
The results fully demonstrate that our method is consistent with human-like learning: First, the faster we can learn a new concept if it is more similar with some previously learned concepts. (i.e. Leading to the increase of the numerator of the Similarity Ratio). Second, the faster we can learn a new concept if we have learned more diversified concepts (i.e. Leading to the decrease of the denominator of the Similarity Ratio). This principle can also be used to guide the generalization process and help to determine whether a new class is fit for generalization.
|Category||LR (No Transfer)||VANER (Transfer)|
Finally, we validate whether the generalization process is explainable. Here we randomly select 6 novel classes, and for each novel class, we visualize the top-3 base classes that are most similar with the novel class, as shown in Figure 4. In our method these base classes have large impact on the formation of the new classifier. We can see that the top-3 base classes are visually correlated with the novel classes, and the generalization process can be very intuitive and explainable.
4.3.3. Parameter Analysis
In our method, there are two important parameters: voting parameter and the number of embedding dimension. The voting parameter decides the relative weights of the transfer parameters and model parameters in the fusion stage. Here we fix an 1-shot/5-shot/20-shot setting and observe the change of the performance as we tune the voting parameter. The result is shown in Figure 5. We can see that the voting parameter is relatively stable consisting in different settings, so we use 0.2 as the parameter for all -shot settings. We also tune the number of embedding dimensions and observe the performance change. The results are shown in Figure 6. We can see that there is a large stable range that we can select, and we select 600 in our experiments.
5. Conclusions and Future Works
In this paper, we investigate the problem of learning to learn image classifiers and attempt to explore a new human-like learning mechanism which fully leveraged the previously learned concepts to assist new concept learning. In particular, We organically combine the ideas of learning to learn and learning by analogy and propose a novel VANER model to fulfill the generalization process from base classes to novel classes. From the extensive experiments, we can safely draw the conclusion that the proposed method performs much better than baselines, complies with human-like learning and provide insightful and intuitive generalization process.
- Ahmed et al. (2013) Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola. 2013. Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web. ACM, 37–48.
- Fei-Fei et al. (2006) Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 (2006), 594–611.
- Gentner and Holyoak (1997) Dedre Gentner and Keith J Holyoak. 1997. Reasoning and learning by analogy: Introduction. American Psychologist 52, 1 (1997), 32.
- Hariharan and Girshick (Hariharan and Girshick) Bharath Hariharan and Ross Girshick. Low-shot Visual Recognition by Shrinking and Hallucinating Features. (????).
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In . 770–778.
- Koch (2015) Gregory Koch. 2015. Siamese neural networks for one-shot image recognition. Ph.D. Dissertation. University of Toronto.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
- Kwitt et al. (2016) Roland Kwitt, Sebastian Hegenbart, and Marc Niethammer. 2016. One-shot learning of scene locations via feature trajectory transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 78–86.
- Lake et al. (2016) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2016. Building machines that learn and think like people. arXiv preprint arXiv:1604.00289 (2016).
- Patricia and Caputo (2014) Novi Patricia and Barbara Caputo. 2014. Learning to Learn, from Transfer Learning to Domain Adaptation: A Unifying Perspective. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Qi et al. (2011) Guo-Jun Qi, Charu Aggarwal, Yong Rui, Qi Tian, Shiyu Chang, and Thomas Huang. 2011. Towards cross-category knowledge propagation for learning visual concepts. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 897–904.
- Rezende et al. (2016) Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. 2016. One-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106 (2016).
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. DOI:http://dx.doi.org/10.1007/s11263-015-0816-y
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016).
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
- Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. ACM, 1067–1077.
- Tommasi et al. (2014) Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. 2014. Learning categories from few examples with multi model knowledge transfer. IEEE transactions on pattern analysis and machine intelligence 36, 5 (2014), 928–941.
- Tzeng et al. (2015) Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. 2015. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision. 4068–4076.
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, and others. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630–3638.
- Wang and Hebert (2016) Yu-Xiong Wang and Martial Hebert. 2016. Learning to learn: Model regression networks for easy small sample learning. In European Conference on Computer Vision. Springer, 616–634.
- Yao and Doretto (2010) Yi Yao and Gianfranco Doretto. 2010. Boosting for transfer learning with multiple sources. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on. IEEE, 1855–1862.
- Zhu et al. (2014) Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. 2014. Capturing Long-tail Distributions of Object Subcategories. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).