Introduction
Even as deep neural networks continue to exhibit excellent performance on large scale data, they suffer from severe overfitting under learning with very low sample complexity. The growing complexity and size of these networks, the main factor that contributes to their effectiveness in learning from large scale data, is also the reason for their failure to generalize from limited data. Learning from very few training samples or shot learning, is an important learning paradigm that is widely believed to be how humans learn new concepts as discussed in [Thorpe, Fize, and Marlot1996] and [Li et al.2002]. However,
shot learning still remains a keychallenge in machine learning.
Finetuning methods seek to overcome this limitation by leveraging networks that have been pretrained on large scale data. Starting from such networks and carefully adapting their parameters have enabled deep neural networks to still be effective for learning from few samples. This procedure affords a few advantages: (1) enables us to exploit good feature representations learned from large scale data, (2) a very efficient process, often involving only a few quick iterations over the small scale, (3) scales linearly to a large number of shot learning tasks, and (4) is applicable to any existing pretrained networks without the need for searching for optimal architectures or training from scratch.
Unfortunately, finetuning can be unstable especially when the amount of training data is small. Large deep neural networks typically are comprised of many redundant parameters, with the parameters within each layer being highly correlation with each other. For instance consider the filters, shown in Fig.1, in the first layer of LeNet [LeCun et al.1998] that was learned on the MNIST dataset. A number of filters are similar to other filters, i.e., these filters functionally play the same role and tend to produce similar activations. The presence of a large number of correlated filters can potentially lead to overfitting, especially when learning under a small sample regime.
To stabilize the finetuning process, we propose a simple yet effective procedure to regularize finetuning based shot learning approaches. The key idea of our approach is to identify the redundancies in the parameters and constrain their updates during finetuning. This is achieved by first clustering the parameters in each layer of the network into multiple groups based on the similarity of their activations on a specific shot learning task. The parameters in each group share a common update while ensuring intragroup similarity and intergroup diversity of activations. By grouping the model parameters and guiding the finetuning process with more supervisory signals, our approach is able to reduce the capacity of the network, to mitigate overfitting and improve the effectiveness of pretrained networks for shot learning.
We make the following contributions in this paper, (1) grouping neuron by activations for layerwise clustering of parameters while enforcing intragroup similarity and intergroup orthogonality of group activations, (2)
a hybrid loss function
for shot learning consisting of crossentropy loss as well as triplet loss among the shot data, the later providing more supervision for optimizing the model, and (3) a reinforcement learning based mechanism to efficiently search for the optimal clustering of the parameters across all the layers of the model. Our proposed shot learning approach affords the following advantages: (1) task agnostic approach to shot learning that does not rely on any taskspecific prior knowledge, (2) is applicable to any network without having to change the original network structure, and (3) a general purpose technique for decomposing the parameter space of high capacity deep neural networks.To demonstrate the effectiveness of our approach we experimentally evaluate it across two tasks: an oneshot domainadaption task for matching images across three different domains and a
shot transfer learning task. Our experimental results show that the proposed approach yields significant performance improvements over task agnostic finetuning approaches for small sample learning without the need for any task specific prior knowledge.
Related Work
shot Learning: One of the earliest work on oneshot learning for object categories was proposed by FeiFei et al. [FeiFei, Fergus, and Perona2006]. The authors developed a Bayesian learning framework with the premise that previously learned classes can inform a prior on the model parameters for a new class. Among recent work, powerful generative models have been developed that compose characters from a dictionary of parts [Wong and Yuille2015] or strokes [Lake, Salakhutdinov, and Tenenbaum2013]. Such generative models have shown great promise on datasets with limited intraclass variation. Siamese networks [Koch, Zemel, and Salakhutdinov2015] has been used to automatically learn feature representations where objects of the same class are closer together. Santoro et al. [Santoro et al.2016] proposed the memoryaugmented neural networks with an external content based memory. Wang and Hebert [Wang and Hebert2016b, Wang and Hebert2016a]
propose a regression approach from classifiers trained on small datasets to classifiers trained on large datasets. Vinyals
et al. [Vinyals et al.2016] proposed matching networks that learns a nonparameteric nearest neighbor classifier through endtoend learning, with the weights for the nearest neighbors are provided by an LSTM. Ravi and Larochelle [Ravi and Larochelle2016] proposed LSTMbased metalearner that uses its state to represent the learning updates of the parameters of a classifier for shot learning. Hariharan and Girshick [Hariharan and Girshick2016] suggest a novel squared gradient magnitude regularization technique and techniques to hallucinate additional training examples for small data classes. While these approaches have stateoftheart performance on shot learning problems, they often utilize specific architectures designed for these problems. In contrast, we explore a more general method that can reuse existing networks, by finetuning them for shot learning.Domain Adaptation: These methods seek to adapt a pretrained model trained on one domain (source domain) to another domain (target domain). [Daumé III2009]
proposed an adaptation method through feature augmentation, creating feature vectors with a source component, a target component, and a shared component. A Support Vector Machine (SVM) is then trained on this augmented feature vector.
[Hoffman et al.2013]used the feature representation of a pretrained network like AlexNet that was trained on the 2012 ImageNet 1000way classification dataset (
[Krizhevsky, Sutskever, and Hinton2012]). The authors replace the source domain classification layer with a domainadaptive classification layer that takes the activations of one of the existing network’s layers as input features. We are also interested in adapting a model learned on large scale data from the source domain to a model for the target domain with few examples. However, unlike these approaches, we propose a task adaptive regularization approach that improves the adaptability of exiting pretrained networks to new target domains with limited training samples.Proposed Approach
Our focus in this paper is the task of shot learning by finetuning an existing pretrained network. We consider the setting where the pretrained network was learned on a source domain with large amounts of data and the
shot target domain consists of very few samples. To avoid the pitfalls of overfitting when training with only a few examples, we propose the following strategy. (1) We first search for similar activations to identify redundant filters, and then group them in a source domain. (2) After identifing the redundant parameters, the pretrained network is finetuned with groupwise backpropagaion in a target domain to regularize the network. The proposed (1) layerwise grouping method and (2) model finetune by groupwise backpropagation effectively make the finetuning on
shot samples more stable. However, our proposed grouping method has a significant hyperparameter, the number of groups. Deciding the number of groups for each layer is a nontrivial task as the optimal number of groups may be different in each layer. (3) We suggest a hyperparameter search method based on reinforcement learning to explore the optimal group numbers.We now describe the three subcomponents involved in our approach: (1) grouping neurons by activations, (2) model finetuning for shot learning and (3) a reinforcement learning based policy for searching over the optimal grouping of the parameters.
Grouping Neurons by Activations (GNA)
To identify redundant parameters to be grouped together for more stable finetuning, we define correlated filters as filters which have similar activations conditioned on a set of training images. We would like to group these correlated filters as a means of regularizing the network. Fig. 2 illustrates a toy example of two convolutional filters with very correlated activations (heatmaps). Since the two filters have similar patterns, their outputs are very similar.
Now consider the fully connect layer of a neural network illustrated in Fig 3. Given a batch of data as input, we can pass each data element (image) through the network to compute the activations at layer .
is the output of the nonlinear activation function of the
th neuron in layer . If we compare activation to another activation over the input data, we can measure the correlation between neurons. In our example, and have similar output patterns over the batch image data whereas, and have different output patterns. This implies that and are good candidates for grouping.In our proposed approach, we use a clustering algorithm to group similar neurons based on their activations over the shot training data (e.g., one image for each category). In particular, we use means clustering to group the neurons and the number of clusters for each layer is learned via a reinforcement learning procedure described later.
Backpropagation with Groups
Once the redundant parameter groups in each layer are identified, an effective regularization method is required during finetuning to prevent overfitting. To restrain the redundant parameters overfiting, we can consider updating the parameters in a group with one gradient because the gradients of the redundant weights in the same group would be expected to be very similar to each other. From this insight, we update the parameters of each group by a shared gradient during learning to regularize the network. The shared update is computed as the average gradient of all the filters in the group i.e., , where is the gradient of . We demonstrate the feasibility of this backpropagation by an average gradient with domain adaptation and transfer learning experiments described later.
Loss Functions
The low sample complexity of typical shot learning results in extremely noisy gradient updates for each shot entity. To provide more supervisory signals to the learning process, we introduce a triplet loss to the network optimization objective. The triplet loss is similar to the one introduced by Schroff et al. [Schroff, Kalenichenko, and Philbin2015]. The triplet loss serves the twin purposes of providing more supervisory signals to increase the separation between the shot entities as well as to reduce the noise in the gradient signal by averaging over larger number of loss terms
We define the triplet loss for the shot learning problem as:
(1) 
,where is the output of the network for input , , are indices of samples belonging to the same class and is the index of sample belonging to a different class, is the distance between and and denotes the margin maximizing loss. The distance can be the Euclidean distance and total variation for regression and classification tasks respectively. We note that the triplet loss reduces to a margin loss for oneshot learning. The margin loss is defined as:
(2) 
In addition to the classification loss described above, it is important to ensure that the intragroup activations are similar to each other, while the intergroup activations are orthogonal to each other. We augment the shot learning loss function with these two criterion during training. Let the activation of a filter in the th layer be . The intragroup similarity loss is defined as:
(3) 
The intergroup orthogonality loss is defined as:
(4) 
, where and are matrices with the activations of all the weights in group and respectively at the th layer and is the squared Frobenius norm.
Our shot learning task is trained to optimize a combination of the loss functions described above. The total loss is described as the following equation:
(5) 
,where , and are hyperparameters that control the importance of each of the loss terms.
HyperParameter Search Through Reinforcement Learning
The performance of the proposed approach is critically dependent on the number of clusters that the weights in each layer are grouped into. Manually selecting the number of clusters can lead to suboptimal performance while an exhaustive search is prohibitively expensive. This problem is exacerbated as the number of layers in the network increases. Common methods for determining hyper parameters are brute force search, grid search, or random search. While brute force search is guaranteed to find the optimal solution, it is very time consuming and is usually intractable. Grid search is the most used method for hyperparameter selection, but is still limited by the granularity of the search and can potentially end up being computationally expensive. On the other hand, surprisingly, Bergstra and Bengio [Bergstra and Bengio2012] suggests that random search is more effective than grid search. Recently, Hansen [Hansen2016] proposed a reinforcement learning approach for determining hyperparameters. Building upon this, Zoph and Le [Zoph and Le2016] proposed a neural network architecture to find the optimal hyperparameter of a neural architecture through reinforcement learning. In this work, we adopt a similar approach to determine the optimal number of clusters in each layer of the network for shot learning.
We pose the hyperparameter search problem as a reinforcement learning problem to find a locally optimal layerwise group size for the entire network. Figure 4(a) shows our reinforcement learning problem, where the environment is a pretrained network that we wish to finetune for shot learning. Intuitively the policy network implicitly learns the relation between different groupings of the layer weights and the performance of the network. We model the problem as a fixed horizon episodic reinforcement learning problem where all actions (layerwise prediction of number of clusters) have an equal affect on the final outcome. We represent the sequence of actions as , where is the action at the th layer, predicting the number of clusters in the th layer. We define the state as a vector that has the number of groups of each layer.
(6) 
,where is the number of groups in layer .
Our agent’s policy network is a Long ShortTerm Memory (LSTM) by Hochreier and Schmidhuber
[Hochreiter and Schmidhuber1997] as shown in Fig. 4(b) and is learned through a policy gradient method. The time horizon of the LSTM is equal to the number of layers in the pretrained network. The output of the LSTM consists of a fully connected layer followed by a softmax layer to predict the probabilities of the action
at the th layer. The input of the policy network at the th layer, is a vector created by the concatenation of the number of filters (a single integer) in theth layer and the output action at the previous layer (one hot encoding), where
is the number of actions.We adopt the policy gradient method [Zoph and Le2016] to learn our agent’s policy that maximizes the expected accuracy of the proposed finetuning process through parameter clustering since the cumulative reward is nondifferentiable. We define agent’s reward returned by the environment as the accuracy of the finetuned model on the validation set for a valid action and 1 for an invalid action (impossible group size).
is the accuracy of the finetuned network of which parameters are clustered and calculated on the validation set. In each episode the agent predicts a list of actions corresponding to the number of groups in the layers of the network. The parameters in each layer of the prenetwork are clustered into the number of groups as determined by the action. The pretrained network is then finetuned on the shot data until convergence, after which the validation accuracy of the network is recorded to use as a reward for the agent. The agent’s policy network is then updated by backpropagating the gradients computed from the loss Eq. 9. As the episodes are repeated, the policy network’s predictions inch closer to the optimal number of parameter clusters in each layer, in turn resulting in a gradual increase in the accuracy of the finetuning process.
To estimate the optimal clustering of the network’s parameters, the policy network’s parameters
are optimized to maximize the expected reward J (), computed over all future episodes from current state.(7) 
Since the reward signal is nondifferentiable, we use an approximate policy gradient method to iteratively update the policy network. In this work, we use the REINFORCE rule from [J.Williams1992]
(8) 
The above quantity can be empirically approximated as:
(9) 
,where is a reward of k episode, and is the number of episodes. denotes the probability of a history of actions given policydefining weights . Our complete algorithm is presented in Algorithm 1
Experiment
The usefulness of the proposed method is verified through experiments on two tasks, domain adaptation and transfer learning. In both tasks, we show how our approach can be used to learn a new model from only a few number of examples. We present results of multiple baseline variants of our proposed approach, (1) FineTuning: the standard approach of updating a pretrained network on shot data with crossentropy loss, (2) FineTuning+Triplet Loss: updating a pretrained network on shot data with crossentropy loss and the triplet loss, (3) GNA:
proposed orthogonal grouping of parameters with crossentropy loss and manual hyperparameter search, (4)
GNA+Triplet Loss: proposed orthogonal grouping of parameters with crossentropy and triplet loss and manual hyperparameter search, (5) GNA+Triplet Loss+Greedy: proposed orthogonal grouping of parameters with crossentropy and triplet loss and greedy hyperparameter selection, and (6) GNA+Triplet Loss+RL: proposed orthogonal grouping of parameters with crossentropy and triplet loss and RL based hyperparameter search.Domain Adaptation
For this task, we consider the Office dataset introduced by [Saenko et al.2010] consisting of a collection of images from three distinct domains: Amazon, DSLR and Webcam. The dataset consists of 31 objects that are commonly encountered in office settings, such as keyboards, file cabinets, laptops etc. We follow the experimental protocol used in [Hoffman et al.2013], and consider domain adaptation between the Amazon (source) and the Webcam (target) images. The experiment is conducted on 16 out of the 31 objects that are also present in the ImageNet dataset. Our pretrained network is the ResNet18 architecture by [He et al.2016] trained on the ImageNet dataset. Our action space for this experiment is the number of possible clusters in each layer , or equivalently the action space is the number of possible groups per layer. We set the action space to , where is the number of filters. The minimum number of groups is one. The maximum number of groups is the same as the number of weights. In this work, we define the actions (number of possible clusters) as to reduce the size of the action space and speed up the search process. However, in general, the action space can also be continuous like .
We use 20 source images per each class for clustering the parameters and finetune the model with 16 oneshot examples, one image per class. The performance of our proposed approach is compared to the baselines in Table 1, and Figure 5 shows the progression of the reinforcement learning based hyperparameter search on the shot learning accuracy. Late fusion [Hoffman et al.2013] and Daume [Daumé III2009] are compared as the baselines. The Late fusion and Daume use DeCAF7 features in their works, but we also apply their method with ResNet18 features for fair comparison with our method. For finetuning, the learningrate is 0.01, and it is changed to 0.001 after 1000 iteration. We tried 10 random runs with randomly selected different dataset to get average performance.
We note that the clustering hyperparameter search through the reinforcement learning is able to efficiently search the hyperparameter space and find better parameter groupings compared to both manual and greedy search. For the manual baseline, we initialize the number of groups in all the layers to two and compute the accuracy of the network. We then compute the accuracy of the of the network by doubling and halving the number of groups in a layer. The action (doubling or halving) that results in higher accuracy is selected. We repeat this process and update the number of groups iteratively through the process described above.
For the greedy baseline(Greedy), we set the number of groups in the first layer to two and compute the accuracy of the original network. If the accuracy is greater than before, then the number of groups is doubled, otherwise we set the number of groups to the previous number and move to the next layer. We repeat this procedure until the last layer.
Method  Feature type  Accuracy(%) 

Late fusion [Hoffman et al.2013]  DeCAF7  64.29 
Late fusion [Hoffman et al.2013]  ResNet18  71.08 
Daume [Daumé III2009]  DeCAF7  72.09 
Daume [Daumé III2009]  ResNet18  76.25 
FineTuning  ResNet18  70.07 
FineTuning + margin loss  ResNet18  70.34 
GNA  ResNet18  79.94 
GNA + margin loss  ResNet18  82.16 
GNA + margin loss+Greedy  ResNet18  83.16 
GNA + margin loss+RL  ResNet18  85.04 
Transfer Learning
In the domain adaptation experiment that has same classes, we showed our proposed method outperforms the baseline approaches. We can apply the grouping method to a task that the source classes are different from the target classes. We consider the task of transfer learning, where the shot learning task (target) is different from the pretrained network task (source). Our pretrained network is the ResNet18 architecture trained with the classes on the CIFAR100 dataset while the shot learning task is the classes on CIFAR10 dataset. For transfer learning setting, we select the classes that different from ten target classes as source classes. Our action space for this experiment is the number of possible clusters in each layer . We consider two different shot settings, one with and another with . The shot data are chosen randomly from the target training set for finetuning and we evaluate on the entire target test set. The performance of our proposed approach is compared to the baselines in Table 2 both for oneshot learning as well as for 10shot learning. Our proposed margin loss improves the accuracies of the grouping method as well as finetuning. The accuracies of our grouping methods are higher than the finetuning result. Thus, the proposed method with RL search outperforms the baseline finetuning approach by 6.37% in shot learning and 4.3% in shot learning.
Method  Accuracy(%) 

FineTuning  29.58 
FineTuning + margin loss  33.44 
GNA  32.70 
GNA+margin loss  34.43 
GNA+margin loss+greedy  33.50 
GNA+margin loss+RL  35.95 
(a) 1shot learning
Method
Accuracy(%)
FineTuning
56.00
FineTuning + margin loss
57.32
FineTuning + triplet loss
58.17
GNA
57.96
GNA+margin loss
59.05
GNA+triplet loss
58.56
GNA+triplet loss+greedy
58.56
GNA+triplet loss+RL
60.30
(b) 10shot learning
Effect of Sample Size K
In this experiment we are interested in comparing the performance of our proposed approach as we vary the difficulty of the shot learning paradigm. We consider different ranges of , the number of samples per category. Table 3 presents the results of GNA with clustering and standard finetuning without clustering as we vary . Unsurprisingly, the performance decreases and there is greater uncertainty as is lowered to oneshot learning. But we observe a consistent improvement in performance with our clustering approach in comparison to the standard finetuning procedure.
The number  Accuracy(%)  

of  
clustering  w/o clustering  w/ clustering 
25 shot  80.41  84.48 
20 shot  76.90  82.09 
15 shot  81.63  84.01 
10 shot  70.64  72.88 
5 shot  68.84  68.93 
1 shot  52.25  53.77 
(a) Accuracy
The number  Standard deviation  

of clustering  w/o clustering  w/ clustering 
25 shot  5.68  0.76 
20 shot  2.66  1.44 
15 shot  5.16  0.79 
10 shot  5.31  3.79 
5 shot  6.95  2.72 
1 shot  10.02  10.66 
(b) Standard deviation
Effect of Clustering Across Layers
It is commonly believed that most deep convolutional neural networks have highly redundant filters at the initial layers only. If this is indeed the case, applying our clustering method to layers other than the initial few layers should not be helpful. To test this hypothesis, we perform clustering to increasing number of layers, starting at the initial layers of the network. For this experiment we considered a pretrained ResNet18 network trained on a few categories in the CIFAR10 dataset and used the other categories as the shot learning task. The results of GNA in Table 4 surprisingly does not confirm our hypothesis. We found that all layers of the deep network did consist of redundant filters for the shot learning task. In fact, applying our method to all the layers of the network resulted in the best performance. This experiment suggests that large convolutional neural networks could potentially consist of redundant parameters even in the higher layers, necessitating search over this entire hyperparameter space of parameter groupings. This motivates the need for efficient techniques to search the hyperparameter space, like the one we proposed in this paper.
Accuracy(%)  

the number of layers  w/o clustering  w/ clustering 
1 layer  80.41  82.87 
3 layers  80.41  81.68 
5 layers  80.41  82.98 
7 layers  80.41  83.03 
all  80.41  84.08 
(a) Accuracy
Accuracy(%)  

the number of layers  w/o clustering  w/ clustering 
1 layer  5.68  4.47 
3 layers  5.68  4.01 
5 layers  5.68  3.65 
7 layers  5.68  3.11 
all  5.68  0.76 
(b) Standard deviation
Conclusion
In this paper we proposed a new regularization method for finetuning a pretrained network for shot learning. The key idea of our approach was to effectively reduce the dimensionality of the network parameter space, by clustering the weights in each layer while ensuring intragroup similarity and intergroup orthogonality. To provide additional supervision to the shot learning problem we introduce a triplet loss to maximize the separation between the shot samples. Lastly, we introduced a reinforcement learning based approach to efficiently search over the hyperparameters of our clustering approach. The experimental results demonstrate that our proposed regularization technique can significantly improve the performance of finetuning based shot learning approaches.
References
 [Bergstra and Bengio2012] Bergstra, J., and Bengio, Y. 2012. Random search for hyperparameter optimization. Journal Machine Learning Research 13:281–305.
 [Daumé III2009] Daumé III, H. 2009. Frustratingly easy domain adaptation. arXiv.
 [FeiFei, Fergus, and Perona2006] FeiFei, L.; Fergus, R.; and Perona, P. 2006. Oneshot learning of object categories.
 [Hansen2016] Hansen, S. 2016. Using Deep Qlearning to Control Optimization Hyperparameters. Optimization and Control.
 [Hariharan and Girshick2016] Hariharan, B., and Girshick, R. 2016. Lowshot visual recognition by shrinking and hallucinating features. arXiv preprint arXiv:1606.02819.

[He et al.2016]
He, K.; Zhang, X.; Ren, S.; and Sun, J.
2016.
Identity mappings in deep residual networks.
In
European Conference on Computer Vision
, 630–645. Springer.  [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Hoffman et al.2013] Hoffman, J.; Tzeng, E.; Donahue, J.; Jia, Y.; Saenko, K.; and Darrell, T. 2013. Oneshot adaptation of supervised deep convolutional models. arXiv preprint 1312.6204.
 [J.Williams1992] J.Williams, R. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine Learning.
 [Koch, Zemel, and Salakhutdinov2015] Koch, G.; Zemel, R.; and Salakhutdinov, R. 2015. Siamese neural networks for oneshot image recognition. 32nd International Conference on Machine Learning 2252–2259.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems 1097–1105.
 [Lake, Salakhutdinov, and Tenenbaum2013] Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. 2013. One shot learning by inverting a compositional causal process. NIPS.
 [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. GradientBased Learning Applied to Document Recognition. Procedding of IEEE.
 [Li et al.2002] Li, F. F.; VanRullen, R.; Koch, C.; and Perona, P. 2002. Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences 99(14):9596–9601.
 [Ravi and Larochelle2016] Ravi, S., and Larochelle, H. 2016. Optimization as a model for fewshot learning.
 [Saenko et al.2010] Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. Computer Vision–ECCV 2010 213–226.
 [Santoro et al.2016] Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Oneshot Learning with MemoryAugmented Neural Networks. arXiv preprint.

[Schroff, Kalenichenko, and
Philbin2015]
Schroff, F.; Kalenichenko, D.; and Philbin, J.
2015.
Facenet: A unified embedding for face recognition and clustering.
IEEE Conference on Computer Vision and Pattern Recognition, CVPR
.  [Thorpe, Fize, and Marlot1996] Thorpe, S.; Fize, D.; and Marlot, C. 1996. Speed of processing in the human visual system. nature 381(6582):520.
 [Vinyals et al.2016] Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching Networks for One Shot Learning. arXiv preprint.
 [Wang and Hebert2016a] Wang, Y.X., and Hebert, M. 2016a. Learning from small sample sets by combining unsupervised metatraining with cnns. In Advances in Neural Information Processing Systems, 244–252.
 [Wang and Hebert2016b] Wang, Y., and Hebert, M. 2016b. Learning to learn: Model regression networks for easy small sample learning. ECCV.
 [Wong and Yuille2015] Wong, A., and Yuille, A. L. 2015. One Shot Learning via Compositions of Meaningful Patches. In Proceedings of the IEEE International Conference on Computer Vision 1197–1205.
 [Zoph and Le2016] Zoph, B., and Le, Q. V. 2016. Neural architecture search with reinforcement learning. arXiv.
Comments
There are no comments yet.