Dynamic Input Structure and Network Assembly for Few-Shot Learning

by   Nathan Hilliard, et al.

The ability to learn from a small number of examples has been a difficult problem in machine learning since its inception. While methods have succeeded with large amounts of training data, research has been underway in how to accomplish similar performance with fewer examples, known as one-shot or more generally few-shot learning. This technique has been shown to have promising performance, but in practice requires fixed-size inputs making it impractical for production systems where class sizes can vary. This impedes training and the final utility of few-shot learning systems. This paper describes an approach to constructing and training a network that can handle arbitrary example sizes dynamically as the system is used.


page 1

page 2

page 3

page 4


Bayesian Embeddings for Few-Shot Open World Recognition

As autonomous decision-making agents move from narrow operating environm...

One of these (Few) Things is Not Like the Others

To perform well, most deep learning based image classification systems r...

Simultaneous Perturbation Stochastic Approximation for Few-Shot Learning

Few-shot learning is an important research field of machine learning in ...

FS-HGR: Few-shot Learning for Hand Gesture Recognition via ElectroMyography

This work is motivated by the recent advances in Deep Neural Networks (D...

What Can Knowledge Bring to Machine Learning? – A Survey of Low-shot Learning for Structured Data

Supervised machine learning has several drawbacks that make it difficult...

Few-shot learning with attention-based sequence-to-sequence models

End-to-end approaches have recently become popular as a means of simplif...

Unleashing the Potential of CNNs for Interpretable Few-Shot Learning

Convolutional neural networks (CNNs) have been generally acknowledged as...

1 Introduction

Few-shot learning aims to alleviate the difficulty of learning a classifier from few examples – or even a single example. Traditional classification learns from hundreds or thousands of examples of per class. Instead of hoping to learn a classifier that can look at an input and classify it directly, a more robust technique is to provide the network with examples of each class and have it explicitly compare the input to each of the reference objects in each class. This typically takes the form of learning representations for both the reference examples you show the network as well as the query input that you are ultimately classifying. Something of a similarity metric between the representations is then either learned as described by

Triantafillou et al. (2017) or an out-of-the-box technique is used more explicitly, as in Vinyals et al. (2016). The reference class with the highest similarity metric to the query image would be the label.

However, a practical concern is that the networks generally require all inputs to have the same number of examples per reference class. This is largely unrealistic and unworkable for production systems. Each reference class could have a different number of images (for example, a dynamic number of images for each family member in a facial recognition system.) Dynamic input sizing remains a challenging problem for high performance techniques that utilize statically compiled graphs, such as Tensorflow and Caffe. Clever workarounds have thusly been developed for different learning tasks, such as masking in sequence-based learning where sequences of varied length are padded to a fixed length. The network learns to monitor for the padding, and it can act accordingly, allowing for variable length sequences to utilize the same network inputs. However, it must explicitly learn to ignore the padding, placing an extra burden on the training process.

Our contribution in this paper is a novel technique for a system that leverages dynamic network assembly using shared weights to provide batch-wise size agnosticism in a static graph, meaning the example size changes from batch to batch. We additionally describe a training regimen that can be used to train the network to generalize and maintain similar performance across example sizes. We demonstrate the architecture’s effectiveness on a 1-way classification benchmark and compare against fixed-size networks. We show that our contribution produces significantly higher performance on test tasks than a traditional static class-size approach.

2 Related Work

Since its inception, few-shot learning techniques have been implemented with a variety of architectures and components. For example, Vinyals et al. (2016) implemented one such network using memory augmentation with attention. This builds on advances made by other few-shot systems such as the one developed by Koch (2015) which used a siamese network with two convolutional tails achieving good performance on the omniglot dataset by Lake et al. (2015). Other architectures such as pairwise networks demonstrated by Mehrotra and Dukkipati (2017) have been used successfully as well. Additionally, researchers have also looked at meta-learning regimes for training few-shot learning networks such as in Ravi and Larochelle (2016). The goal being to train an LSTM that can provide updates to the network weights to a network during a few-shot training regimen.

Other tools for developing neural networks such as Chainer


and PyTorch

222https://github.com/pytorch/pytorch, which both use dynamic graph construction can also be used to address this problem. However, these tools are primarily for research and not production (although not impossible). Our approach differs in that we define a straightforward and useful way to reuse the weights in statically compiled graphs with a more production-ready library such as TensorFlow (Abadi et al. (2015)), giving us boosts in predictive accuracy as well as utility in production-grade applications.

3 Methods

3.1 Architecture

Figure 1: Our architecture, where and are both siamese networks. For simplicity, we represent and

as images and omit the feature extraction step.

We have a siamese two-stage network architecture where the first stage is based on work done by Koch (2015) that leverages pairwise relational style networks most similar to work done by Santoro et al. (2017) but also similar to Mehrotra and Dukkipati (2017). The second stage of the network focuses on learning an internal distance metric, similar to metric learning as in Bellet et al. (2013).

To reduce dimensionality of our inputs and bootstrap our network, we use transfer learning described by

Bengio et al. (2011). Specifically, we extract features from the layer prior to the final classification layer from the residual network described in He et al. (2015)

, this yields a vector of size 2048 in place of each image. This also has several added benefits, namely we offload much of the feature learning to pre-trained networks. Consequently, we then are able to instead focus each stage of our network on separate tasks. Our architecture is similar to the work done by

Koch (2015) except our image features are extracted rather than learned. A diagram visually describing our network can be seen in Figure 1.

3.1.1 Relational Stage

First and foremost, we have a pairwise relational network that takes in a class with examples . outputs a class embedding representing the set of examples .

Ultimately, can be described as a function of the unique pairs (or combinations) in :


Where is a neural network parameterized by , and are the embeddings of -th and -th members of class where . The -embeddings could be provided by a pre-trained network or learned end-to-end. It’s important to note that while we use a similar pairwise comparison such as that used by Santoro et al. (2017), we crucially take the average vector from the resulting comparisons instead of the sum of the comparisons. Averaging helps to enforce that the characteristic embedding of the class should not be explicitly related to the class size. As such, we can use Equation 1 to learn an embedding describing that the class be explicitly considering the relationships between members of the class.

3.1.2 Metric Stage

The second stage network focuses on learning a distance metric between the query image and the given class described by the output of

. Because the second stage and first stage are connected, we can learn better class embeddings via backpropogation. In essence, this stage learns the probability that


We describe the second stage network as , defined as:


Where is another neural network parameterized by .

3.2 Dynamic Assembly

alg:assemblyGiven an input number of examples , a feature vector size
initialize a list

initialize an input tensor

of size
initialize an input tensor of size
for each unique combination of indices :
Create a new instance of with shared weights i.e. , store the resulting output tensor in .
store the concatenation of in an intermediary tensor .
store the element-wise average of in an intermediary tensor .
connect and to a new instance of reusing weights i.e.
return and
Algorithm 1 Dynamic Assembly for Training & Inferring on New Example Sizes

Because the reduction step in Equation 1 gives us a fixed sized vector, regardless of class size, we can use an arbitrary class size in the first stage. We recreate for as many unique pairs that exist in dynamically, this is exemplified in Step 1 of Algorithm LABEL:alg:assembly. The result is that we ultimately only need to create intermediary operations between learned weights to accommodate new example sizes. These operations along with an input for are then finally wired to in Equation 2 to complete the model creation.

Though we ultimately incur an overhead cost for creating these operations, it outweighs deployment cost of training separate networks for each example size and either storing them in memory simultaneously or swapping models every time class sizes change. By storing the resulting input and output tensors from the creation, we can create an in-memory lookup table indexed by example size for each model based on the result of Step 1 in Algorithm LABEL:alg:assembly. This can then be easily incorporated in the batching step where the batcher is fed multiple classes each with varying example counts and returns that information in each batch so the appropriate inputs are used.

4 Experiments

4.1 Experimental Design

Our experiments were carried out by training the models end-to-end. The baseline models are simply the same network trained on a fixed example size of training data whereas the dynamic input model was trained on varied example counts batch-to-batch.

For both cases, we trained via stochastic gradient descent with momentum with a learning rate of

and momentum with a batch size of

. We used a final output layer with two units prior to a softmax activation function to determine whether

for a straightforward 1-way problem.

4.2 Caltech-UCSD Birds

Training Class size: 2 3 4 5
2-shot Network 62.3% 62.5% 62.7% 62.7%
3-shot Network 66.9% 67.1% 67.1% 67.2%
4-shot Network 71.9% 72.2% 72.3% 72.3%
5-shot Network 74.3% 74.4% 74.5% 74.7%
Dynamic Input 89.7% 90% 90.1% 90.3%
Table 1: 1-way Classification Results on the Caltech-UCSD Birds Dataset. Our experiments show fixed-size model performance on other example sizes in order to demonstrate how well they generalize. We create separate class sizes reusing the networks and that were trained on the fixed examples to evaluate the models on larger/smaller example sizes.

Each model was trained on the same task using a dataset consisting of portions of the Caltech 256 and Visual Genome datasets developed by Griffin et al. (2007); Krishna et al. (2016). Similarly as mentioned earlier, for the feature extraction step for this experiment, we used the residual network built by He et al. (2015). We evaluate the models on a fine-grained unrelated classification task, the Caltech-UCSD Birds dataset described by Welinder et al. (2010). In order to train the dynamic model to better generalize across class sizes, we feed it random example sizes each batch. More concisely, the first batch could contain 2-shot data with the next batch containing 5-shot data, and so on. An indirect, yet important, contribution of this work is the observation that randomly changing class sizes significantly reduces overfitting.

In our experiments we consider baselines using the same architecture which are trained on a fixed example size. Each model in Table 1 was trained with the same number of training steps. As is shown by the table, our technique far out-performs networks trained on fixed example sizes. The results show that even when trained on small example counts, when evaluating on higher example counts performance improves. This is largely unsurprising, the more examples you show the network the better it should perform in general. Networks trained on higher example counts outperforming networks trained on lower example counts is also an unsurprising result. This is likely due in large part to the element-wise mean operation we use. As example counts get higher the resulting class vector should get sharper, making it easier for the network to make a distinction between it and the query image.

4.3 Omniglot

For the Omniglot dataset (Lake et al. (2015)

), we use a very simple network trained on MNIST digits as our feature extractor as opposed to ResNet. All other experimental parameters are kept the same as that of the Caltech-Birds experiment except that we use Nesterov momentum instead of classic momentum as described by

Sutskever et al. (2013). Our model performs better than the baselines on this task by a narrower margin as can be seen in Table 2.

Training Class size: 2 3 4 5
2-shot Network 52.2% 52.4% 52.5% 52.5%
3-shot Network 80.5% 81.4% 81.6% 81.8%
4-shot Network 82.3% 84% 84.5% 84.8%
5-shot Network 82.8% 84.4% 85.2% 85.7%
Dynamic Input 83.1% 84.9% 85.8% 86.2%
Table 2: 1-way Classification Results on the Omniglot dataset.

5 Conclusion

In this paper we presented a technique for bringing few-shot learning into the dynamic setting necessary for production applications. By generalizing a network to multiple example sizes, a single network can perform few-shot classification on varied example sizes at runtime with a comparatively minimal overhead incurred. We demonstrate that our dynamic model can perform much better than its fixed-size counterparts on a fine-grained task unseen during training time.