NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
Oneshot learning is usually tackled by using generative models or discriminative embeddings. Discriminative methods based on deep learning, which are very effective in other learning scenarios, are illsuited for oneshot learning as they need large amounts of training data. In this paper, we propose a method to learn the parameters of a deep model in one shot. We construct the learner as a second deep network, called a learnet, which predicts the parameters of a pupil network from a single exemplar. In this manner we obtain an efficient feedforward oneshot learner, trained endtoend by minimizing a oneshot classification objective in a learning to learn formulation. In order to make the construction feasible, we propose a number of factorizations of the parameters of the pupil network. We demonstrate encouraging results by learning characters from single exemplars in Omniglot, and by tracking visual objects from a single initial exemplar in the Visual Object Tracking benchmark.
READ FULL TEXT VIEW PDF
As a discriminative method of oneshot learning, Siamese deep network al...
read it
Learning good feature embeddings for images often requires substantial
t...
read it
Fewshot learning (FSL) aims to learn novel visual categories from very ...
read it
Visual Object Tracking (VOT) can be seen as an extended task of FewShot...
read it
Deep learning based discriminative methods, being the stateoftheart
m...
read it
The recent years have seen a surge of interest in methods for imaging be...
read it
In classical machine learning, regression is treated as a black box proc...
read it
This project collects the different accepted papers and their link to Arxiv or Gitxiv
visual tracker benchmark results
视觉追踪算法评测。
visual tracker benchmark results
visual tracker benchmark results
Deep learning methods have taken by storm areas such as computer vision, natural language processing, and speech recognition. One of their key strengths is the ability to leverage large quantities of labelled data and extract meaningful and powerful representations from it. However, this capability is also one of their most significant limitations since using large datasets to train deep neural network is not just an option, but a necessity. It is well known, in fact, that these models are prone to overfitting.
Thus, deep networks seem less useful when the goal is to learn a new concept on the fly, from a few or even a single example as in one shot learning. These problems are usually tackled by using generative models rezende2016one ; lake2015human
or, in a discriminative setting, using adhoc solutions such as exemplar support vector machines (SVMs)
malisiewicz11ensemble . Perhaps the most common discriminative approach to oneshot learning is to learn offline a deep embedding function and then to define online simple classification rules such as nearest neighbors in the embedding space fan2014learning ; parkhi2015deep ; lin2015bilinear . However, computing an embedding is a far cry from learning a model of the new object.In this paper, we take a very different approach and ask whether we can induce, from a single supervised example, a full, deep discriminative model to recognize other instances of the same object class. Furthermore, we do not want our solution to require a lengthy optimization process, but to be computable onthefly, efficiently and in one go. We formulate this problem as the one of learning a deep neural network, called a learnet, that, given a single exemplar of a new object class, predicts the parameters of a second network that can recognize other objects of the same type.
Our model has several elements of interest. Firstly, if we consider learning to be any process that maps a set of images to the parameters of a model, then it can be seen as a “learning to learn” approach. Clearly, learning from a single exemplar is only possible given sufficient prior knowledge on the learning domain. This prior knowledge is incorporated in the learnet in an offline phase by solving millions of small oneshot learning tasks and backpropagating errors endtoend. Secondly, our learnet provides a feedforward learning algorithm that extracts from the available exemplar the final model parameters in one go. This is different from iterative approaches such as exemplar SVMs or complex inference processes in generative modeling. It also demonstrates that deep neural networks can learn at the “metalevel” of predicting filter parameters for a second network, which we consider to be an interesting result in its own right. Thirdly, our method provides a competitive, efficient, and practical way of performing oneshot learning using discriminative methods.
The rest of the paper is organized as follows. Sect. 1.1 discusses the works most related to our. Sect. 2 describes the learnet approaches and nuances in its implementation. Sect. 3 demonstrates empirically the potential of the method in image classification and visual tracking tasks. Finally, sect. 4 summarizes our findings.
Our work is related to several others in the literature. However, we believe to be the first to look at methods that can learn the parameters of complex discriminative models in one shot.
Oneshot learning has been widely studied in the context of generative modeling, which unlike our work is often not focused on solving discriminative tasks. One very recent example is by Rezende et al. rezende2016one
, which uses a recurrent spatial attention model to generate images, and learns by optimizing a measure of reconstruction error using variational inference
kingma2013auto . They demonstrate results by sampling images of novel classes from this generative model, not by solving discriminative tasks. Another notable work is by Lake et al. lake2015human , which instead uses a probabilistic program as a generative model. This model constructs written characters as compositions of pen strokes, so although more general programs can be envisioned, they demonstrate it only on Optical Character Recognition (OCR) applications.A different approach to oneshotlearning is to learn an embedding space, which is typically done with a siamese network bromley1993signature
. Given an exemplar of a novel category, classification is performed in the embedding space by a simple rule such as nearestneighbor. Training is usually performed by classifying pairs according to distance
fan2014learning , or by enforcing a distance ranking with a triplet loss parkhi2015deep . A variant is to combine embeddings using the outerproduct, which yields a bilinear classification rule lin2015bilinear .The literature on zeroshot learning (as opposed to oneshot learning) has a different focus, and thus different methodologies. It consists of learning a new object class without any
example image, but based solely on a description such as binary attributes or text. It is usually framed as a modality transfer problem and solved through transfer learning
socher2013zero .The general idea of predicting parameters has been explored before by Denil et al. denil2013predicting , who showed that it is possible to linearly predict as much as 95% of the parameters in a layer given the remaining 5%. This is a very different proposition from ours, which is to predict all of the parameters of a layer given an external exemplar image, and to do so nonlinearly.
Our proposal allows generating all the parameters from scratch, generalizing across tasks defined by different exemplars, and can be seen as a network that effectively “learns to learn”.
Since we consider oneshot learning as a discriminative task, our starting point is standard discriminative learning. It generally consists of finding the parameters that minimize the average loss of a predictor function , computed over a dataset of samples and corresponding labels :
(1) 
Unless the model space is very small, generalization also requires constraining the choice of model, usually via regularization. However, in the extreme case in which the goal is to learn from a single exemplar of the class of interest, called oneshot learning, even regularization may be insufficient and additional prior information must be injected into the learning process. The main challenge in discriminative oneshot learning is to find a mechanism to incorporate domainspecific information in the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications of oneshot learning, is to avoid a lengthy optimization process such as eq. 1.
We propose to address both challenges by learning the parameters of the predictor from a single exemplar using a metaprediction process, i.e. a noniterative feedforward function that maps to . Since in practice this function will be implemented using a deep neural network, we call it a learnet. The learnet depends on the exemplar , which is a single representative of the class of interest, and contains parameters of its own. Learning to learn can now be posed as the problem of optimizing the learnet metaparameters using an objective function defined below. Furthermore, the feedforward learnet evaluation is much faster than solving the optimization problem (1).
In order to train the learnet, we require the latter to produce good predictors given any possible exemplar , which is empirically evaluated as an average over training samples :
(2) 
In this expression, the performance of the predictor extracted by the learnet from the exemplar is assessed on a single “validation” pair , comprising another exemplar and its label . Hence, the training data consists of triplets . Notice that the meaning of the label is subtly different from eq. 1 since the class of interest changes depending on the exemplar : is positive when and belong to the same class and negative otherwise. Triplets are sampled uniformly with respect to these two cases. Importantly, the parameters of the original predictor of eq. 1 now change dynamically with each exemplar .
Note that the training data is reminiscent of that of siamese networks bromley1993signature , which also learn from labeled sample pairs. However, siamese networks apply the same model with shared weights to both and , and compute their innerproduct to produce a similarity score:
(3) 
There are two key differences with our model. First, we treat and asymmetrically, which results in a different objective function. Second, and most importantly, the output of is used to parametrize linear layers that determine the intermediate representations in the network . This is significantly different to computing a single inner product in the last layer (eq. 3). A similar argument can be made of bilinear networks lin2015bilinear .
specifies the optimization objective of oneshot learning as dynamic parameter prediction. By application of the chain rule, backpropagating derivatives through the computational blocks of
and is no more difficult than through any other standard deep network. Nevertheless, when we dive into concrete implementations of such models we face a peculiar challenge, discussed next.In order to analyse the practical difficulties of implementing a learnet, we will begin with oneshot prediction of a fullyconnected layer, as it is simpler to analyse. This is given by
(4) 
given an input , output , weights and biases .
We now replace the weights and biases with their functional counterparts, and , representing two outputs of the learnet given the exemplar as input (to avoid clutter, we omit the implicit dependence on ):
(5) 
While eq. 5 seems to be a dropin replacement for linear layers, careful analysis reveals that it scales extremely poorly. The main cause is the unusually large output space of the learnet . For a comparable number of input and output units in a linear layer (), the output space of the learnet grows quadratically with the number of units.
While this may seem to be a concern only for large networks, it is actually extremely difficult also for networks with few units. Consider a simple linear learnet . Even for a very small fullyconnected layer of only 100 units (), and an exemplar with 100 features (), the learnet already contains 1M parameters that must be learned. Overfitting and space and time costs make learning such a regressor infeasible. Furthermore, reducing the number of features in the exemplar can only achieve a small constantsize reduction on the total number of parameters. The bottleneck is the quadratic size of the output space , not the size of the input space .
A simple way to reduce the size of the output space is to consider a factorized set of weights, by replacing eq. 5 with:
(6) 
The product
can be seen as a factorized representation of the weights, analogous to the Singular Value Decomposition. The matrix
projects into a space where the elements of represent disentangled factors of variation. The second projection maps the result back from this space.Both and contain additional parameters to be learned, but they are modest in size compared to the case discussed in sect. 2.1. Importantly, the oneshot branch now only has to predict a set of diagonal elements (see eq. 6), so its output space grows linearly with the number of units in the layer (i.e. : ).
The factorization of eq. 6
can be generalized to convolutional layers as follows. Given an input tensor
, weights (where is the filter support size), and biases , the output of the convolutional layer is given by(7) 
where denotes convolution, and the biases are applied to each of the channels.
Projections analogous to and in eq. 6 can be incorporated in the filter bank in different ways and it is not obvious which one to pick. Here we take the view that and should disentangle the feature channels (i.e. third dimension of ), allowing to choose which filter to apply to each channel. As such, we consider the following factorization:
(8) 
where , , and . Convolution with subscript denotes independent filtering of channels, i.e. each channel of is simply the convolution of the corresponding channel in and . In practice, this can be achieved with filter tensors that are diagonal in the third and fourth dimensions, or using filter groups krizhevsky2012imagenet , each group containing a single filter. An illustration is given in fig. 1. The predicted filters can be interpreted as a filter basis, as described in the supplementary material (sec. A).
Notice that, under this factorization, the number of elements to be predicted by the oneshot branch is only (the filter size is typically very small, e.g. 3 or 5 fan2014learning ; wang2015transferring ). Without the factorization, it would be (the number of elements of in eq. 7). Similarly to the case of fullyconnected layers (sect. 2.2), when this keeps the number of predicted elements from growing quadratically with the number of channels, allowing them to grow only linearly.
We evaluate learnets against baseline oneshot architectures (sect. 3.1) on two oneshot learning problems in Optical Character Recognition (OCR; sect. 3.2) and visual object tracking (sect. 3.3).
As noted in sect. 2, the closest competitors to our method in discriminative oneshot learning are embedding learning using siamese architectures. Therefore, we structure the experiments to compare against this baseline. In particular, we choose to implement learnets using similar network topologies for a fairer comparison.
The baseline siamese architecture comprises two parallel streams and
composed of a number of layers, such as convolution, maxpooling, and ReLU, sharing parameters
(fig. 2.a). The outputs of the two streams are compared by a layer computing a measure of similarity or dissimilarity. We consider in particular: the dot product between vectors and , the Euclidean distance , and the weighted norm where is a vector of learnable weights and the Hadamard product).The first modification to the siamese baseline is to use a learnet to predict some of the intermediate shared stream parameters (fig. 2.b). In this case and the siamese architecture writes . Note that the siamese parameters are still the same in the two streams, whereas the learnet is an entirely new subnetwork whose purpose is to map the exemplar image to the shared weights. We call this model the siamese learnet.
The second modification is a singlestream learnet configuration, using only one stream of the siamese architecture and predicting its parameter using the learnet . In this case, the comparison block is reinterpreted as the last layer of the stream (fig. 2.c). Note that: i) the single predicted stream and learnet are asymmetric and with different parameters and ii) the learnet predicts both the final comparison layer parameters as well as intermediate filter parameters.
The singlestream learnet architecture can be understood to predict a discriminant function from one example, and the siamese learnet architecture to predict an embedding function for the comparison of two images. These two variants demonstrate the versatility of the dynamic convolutional layer from eq. 6.
Finally, in order to ensure that any difference in performance is not simply due to the asymmetry of the learnet architecture or to the induced filter factorizations (sect. 2.2 and sect. 2.3), we also compare unshared siamese nets, which use distinct parameters for each stream, and factorized siamese nets, where convolutions are replaced by factorized convolutions as in learnet.
Predicted filters  Activations 
Innerproduct (%)  Euclidean dist. (%)  Weighted dist. (%)  

Siamese (shared)  48.5  37.3  41.8 
Siamese (unshared)  47.0  41.0  34.6 
Siamese (unshared, factorized)  48.4  –  33.6 
Siamese learnet (shared)  51.0  39.8  31.4 
Learnet  43.7  36.7  28.6 
This section describes our experiments in oneshot learning on OCR. For this, we use the Omniglot dataset lake2015human , which contains images of handwritten characters from 50 different alphabets. These alphabets are divided into 30 background and 20 evaluation alphabets. The associated oneshot learning problem is to develop a method for determining whether, given any single exemplar of a character in an evaluation alphabet, any other image in that alphabet represents the same character or not. Importantly, all methods are trained using only background alphabets and tested on the evaluation alphabets.
Character images are resized to pixels in order to be able to explore efficiently several variants of the proposed architectures. There are exactly 20 sample images for each character, and an average of 32 characters per alphabet. The dataset contains a total of 19,280 images in the background alphabets and 13,180 in the evaluation alphabets.
Algorithms are evaluated on a series of recognition problems. Each recognition problem involves identifying the image in a set of 20 that shows the same character as an exemplar image (there is always exactly one match). All of the characters in a single problem belong to the same alphabet. At test time, given a collection of characters , the function is evaluated on each pair and the candidate with the highest score is declared the match. In the case of the learnet architectures, this can be interpreted as obtaining the parameters and then evaluating a static network for each .
The baseline stream for the siamese, siamese learnet, and singlestream learnet architecture consists of 3 convolutional layers, with
maxpooling layers of stride 2 between them. The filter sizes are
, and . For both the siamese learnet and the singlestream learnet, consists of the same layers as , except the number of outputs is 1600 – one for each element of the 64 predicted filters (of size ). To keep the experiments simple, we only predict the parameters of one convolutional layer. We conducted crossvalidation to choose the predicted layer and found that the second convolutional layer yields the best results for both of the proposed variants.Siamese nets have previously been applied to this problem by Koch et al. koch2016siamese using much deeper networks applied to images of size
. However, we have restricted this investigation to relatively shallow networks to enable a thorough exploration of the parameter space. A more powerful algorithm for oneshot learning, Hierarchical Bayesian Program Learning
lake2015human , is able to achieve humanlevel performance. However, this approach involves computationally expensive inference at test time, and leverages extra information at training time that describes the strokes drawn by the human author.Learning involves minimizing the objective function specific to each method (e.g. eq. 2 for learnet and eq. 3
for siamese architectures) and uses stochastic gradient descent (SGD) in all cases. As noted in
sect. 2, the objective is obtained by sampling triplets where exemplars and are congruous () or incongruous () with 50% probability. We consider 100,000 random pairs for training per epoch, and train for 60 epochs. We conducted a random search to find the best hyperparameters for each algorithm (initial learning rate and geometric decay, standard deviation of Gaussian parameter initialization, and weight decay).
Tab. 1 shows the classification error obtained using variants of each architecture. A dash indicates a failure to converge given a large range of hyperparameters. The two learnet architectures combined with the weighted distance are able to achieve significantly better results than other methods. The best architecture reduced the error from 37.3% for a siamese network with shared parameters to 28.6% for a singlestream learnet.
While the Euclidean distance gave the best results for siamese networks with shared parameters, better results were achieved by learnets (and siamese networks with unshared parameters) using a weighted distance. In fact, none of the alternative architectures are able to achieve lower error under the Euclidean distance than the shared siamese net. The dot product was, in general, less effective than the other two metrics.
The introduction of the factorization in the convolutional layer might be expected to improve the quality of the estimated model by reducing the number of parameters, or to worsen it by diminishing the capacity of the hypothesis space. For this relatively simple task of character recognition, the factorization did not seem to have a large effect.
The task of singletarget object tracking requires to locate an object of interest in a sequence of video frames. A video frame can be seen as a collection of image windows; then, in a oneshot setting, given an exemplar of the object in the first frame , the goal is to identify the same window in the other frames .
The method is trained using the ImageNet Large Scale Visual Recognition Challenge 2015
ILSVRC15 , with 3,862 videos totalling more than one million annotated frames. Instances of objects of thirty different classes (mostly vehicles and animals) are annotated throughout each video with bounding boxes. For tracking, instance labels are retained but object class labels are ignored. We use 90% of the videos for training, while the other 10% are heldout to monitor validation error during network training. Testing uses the VOT 2015 benchmark kristan2015visual .We experiment with siamese and siamese learnet architectures (fig. 2) where the learnet predicts the parameters of the second (dynamic) convolutional layer of the siamese streams. Each siamese stream has five convolutional layers and we test three variants of those: variant (A) has the same configuration as AlexNet krizhevsky2012imagenet but with stride 2 in the first layer, and variants (B) and (C) reduce to 50% the number of filters in the first two convolutional layers and, respectively, to 25% and 12.5% the number of filters in the last layer.
In order to train the architecture efficiently from many windows, the data is prepared as follows. Given an object bounding box sampled at random, a crop
double the size of that is extracted from the corresponding frame, padding with the average image color when needed. The border is included in order to incorporate some visual context around the exemplar object. Next,
is sampled at random with 75% probability of being positive. If , an image is extracted by choosing at random a frame that does not contain the object. Otherwise, a second frame containing the same object and within 50 temporal samples of the first is selected at random. From that, a patch centered around the object and four times bigger is extracted. In this way, contains both subwindows that do and do not match . Images and are resized to and pixels, respectively, and the triplet is formed. All subwindows in are considered to not match except for the central ones when .All networks are trained from scratch using SGD for 50 epoch of 50,000 sample triplets . The multiple windows contained in are compared to efficiently by making the comparison layer convolutional (fig. 2), accumulating a logistic loss across spatial locations. The same hyperparameters (learning rate of geometrically decaying to , weight decay of 0.005, and small minibatches of size 8) are used for all experiments, which we found to work well for both the baseline and proposed architectures. The weights are initialized using the improved Xavier he2015delving
method, and we use batch normalization
ioffe15batch after all linear layers.Adopting the initial crop as exemplar, the object is sought in a new frame within a radius of the previous position, proceeding sequentially. This is done by evaluating the pupil net convolutionally, as well as searching at five possible scales in order to track the object through scale space.
Predicted filters  Activations 


Tab. 3 compares the methods in terms of the official metrics (accuracy and number of failures) for the VOT 2015 benchmark kristan2015visual . The ranking plot produced by the VOT toolkit is provided in the supplementary material (fig. B.1). From tab. 3, it can be observed that factorizing the filters in the siamese architecture significantly diminishes its performance, but using a learnet to predict the filters in the factorization recovers this gap and in fact achieves better performance than the original siamese net. The performance of the learnet architectures is not adversely affected by using the slimmer prediction networks B and C (with less channels).
An elementary tracker based on learnet compares favourably against recent tracking systems, which make use of different features and online model update strategies: DAT possegger2015defense , DSST danelljan2014accurate , MEEM zhang2014meem , MUSTer hong2015multi and SODLT wang2015transferring . SODLT in particular is a good example of direct adaptation of standard batch deep learning methodology to online learning, as it uses SGD during tracking to finetune an ensemble of deep convolutional networks. However, the online adaptation of the model comes at a big computational cost and affects the speed of the method, which runs at 5 framespersecond (FPS) on a GPU. Due to the feedforward nature of our oneshot learnets, they can track objects in realtime at framerates in excess of 60 FPS, while achieving less tracking failures. We consider, however, that our implementation serves mostly as a proofofconcept, using tracking as an interesting demonstration of oneshotlearning, and is orthogonal to many technical improvements found in the tracking literature kristan2015visual .
In this work, we have shown that it is possible to obtain the parameters of a deep neural network using a single, feedforward prediction from a second network. This approach is desirable when iterative methods are too slow, and when large sets of annotated training samples are not available. We have demonstrated the feasibility of feedforward parameter prediction in two demanding oneshot learning tasks in OCR and visual tracking. Our results hint at a promising avenue of research in “learning to learn” by solving millions of small discriminative problems in an offline phase. Possible extensions include domain adaptation and sharing a single learnet between different pupil networks.
International Journal of Pattern Recognition and Artificial Intelligence
, 1993.ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems, 2012.Deep face recognition.
BMVC, 2015.This appendix provides an additional interpretation for the role of the predicted filters in a factorized convolutional layer (Section 2.3).
To make the presentation succint, we will use a notation that is slightly different from the main text. Let be a tensor of activations, then denotes channel of . If is a multichannel filter, then denotes the filter for output channel and input channel . That is, if is then is for , . The set is denoted .
The factorised convolution is
(9) 
where and are pixelwise projections and is a diagonal convolution. While a general convolution computes
(10) 
where each is a singlechannel filter, a diagonal convolution computes
(11) 
where each is a singlechannel filter, and a pixelwise projection computes
(12) 
where each is a scalar.
Let be the number of channels of , let be the number of channels of and let be the number of channels of the intermediate activations. Combining the above gives
(13)  
(14) 
This is therefore equivalent to a general convolution where each filter is a combination of singlechannel basis filters
(15) 
The predictions used in the dynamic convolution (Section 2.3) essentially modify these basis filters.
Architecture  Validation (training) error  VOT2015 scores  
Variant  Displacement  Classification  Objective  Accuracy  Failures 
Siamese (=B)  7.40 (6.26)  0.426 (0.0766)  0.156 (0.0903)  0.465  105 
Siamese (=B; unshared)  9.29 (6.95)  0.514 (0.120)  0.137 (0.0910)  0.447  131 
Siamese (=B; factorised)  8.58 (7.85)  0.564 (0.160)  0.141 (0.104)  0.444  138 
Siamese learnet (=B; =A)  7.19 (5.86)  0.356 (0.0627)  0.137 (0.0763)  0.500  87 
Siamese learnet (=B; =B)  7.11 (5.89)  0.351 (0.0515)  0.141 (0.0762)  0.497  93 
Siamese (=C)  8.13 (7.5)  0.589 (0.192)  0.157 (0.112)  0.466  120 
Siamese (=C; factorised)  9.80 (8.96)  0.539 (0.277)  0.141 (0.117)  0.435  132 
Siamese learnet (=C; =A)  7.51 (6.49)  0.389 (0.0863)  0.134 (0.0856)  0.483  105 
Siamese learnet (=C; =C)  7.47 (6.96)  0.326 (0.118)  0.142 (0.0940)  0.491  106 
Comments
There are no comments yet.