Few-Shot Learning with Metric-Agnostic Conditional Embeddings

by   Nathan Hilliard, et al.

Learning high quality class representations from few examples is a key problem in metric-learning approaches to few-shot learning. To accomplish this, we introduce a novel architecture where class representations are conditioned for each few-shot trial based on a target image. We also deviate from traditional metric-learning approaches by training a network to perform comparisons between classes rather than relying on a static metric comparison. This allows the network to decide what aspects of each class are important for the comparison at hand. We find that this flexible architecture works well in practice, achieving state-of-the-art performance on the Caltech-UCSD birds fine-grained classification task.


Few-shot Learning with Meta Metric Learners

Few-shot Learning aims to learn classifiers for new classes with only a ...

Few-Shot Learning with Embedded Class Models and Shot-Free Meta Training

We propose a method for learning embeddings for few-shot learning that i...

Metric Based Few-Shot Graph Classification

Many modern deep-learning techniques do not work without enormous datase...

Towards ECDSA key derivation from deep embeddings for novel Blockchain applications

In this work, we propose a straightforward method to derive Elliptic Cur...

Improved Few-Shot Visual Classification

Few-shot learning is a fundamental task in computer vision that carries ...

ECML: An Ensemble Cascade Metric Learning Mechanism towards Face Verification

Face verification can be regarded as a 2-class fine-grained visual recog...

Attentive Recurrent Comparators

Rapid learning requires flexible representations to quickly adopt to new...

1 Introduction

The goal of few-shot learning is to generalize a classifier’s performance to new classes given relatively small amounts of data. Although both adults and children are capable of efficiently making these generalizations 

(Swingley, 2010; Lake et al., 2015), few-shot classification has remained a difficult problem for machine learning algorithms. A key insight from psychology is that human few-shot generalization only occurs when new classes can be understood in the context of old ones (Carey & Bartlett, 1978; Swingley, 2010). In essence, the ability to rapidly understand a new category (e.g., a new word) can only be accomplished when the learner already has an idea of what the space of categories looks like. As Carey puts it, “there must be powerful processes that establish and maintain lexical entries of newly heard words, locating their meanings in some relevant part of semantic space, while the nuanced meaning gets worked out.” (Carey, 2010).

This notion of semantic, or conceptual, space underlies many approaches to few-shot learning, most directly the set of deep learning models which fall under the umbrella of metric-learning (Kulis et al., 2013; Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017). Metric-learning approaches attempt to solve the few-shot problem by rapidly placing new categories within a learned metric space where classes can be easily separated, most often through a pre-defined distance metric such as Euclidean or cosine distance. These systems have achieved strong performance on many few-shot tasks (Vinyals et al., 2016; Snell et al., 2017), but it is unclear what aspects of their structure are most important for good few-shot generalization.

One important aspect of these models is the relation between learned class representations and the placement of a query image within the metric space. Consider, for instance, a case where a query image shares similarities to a number of few-shot classes based on attributes such as shape or color. While children, and some neural networks, share a bias to classify based on shape 

(Landau et al., 1988; Ritter et al., 2017), we can also recognize the potential that some classifications would require a re-weighting of attribute importance as in the classification of non-solids where shape is less important (Soja et al., 1991). Accomplishing this feature re-weighting requires an interaction between query and class representations which has not been thoroughly investigated in the literature.

Another area left largely unexplored is whether or not the use of pre-defined metrics, such as cosine or Euclidean distance, is necessary for the strong performance seen in existing metrics-based few-shot learning systems. While Snell et al. (2017) explore how switching between different distances affects model performance, it is unclear whether similar, or better, results could be obtained by allowing a parameterized network to perform classification directly.

To explore whether few-shot performance necessarily depends on a true metric space, we introduce a novel few-shot architecture where the metric space comparison is replaced by a learned neural network architecture. This means that our architecture is not a true metric-learning approach, as the output of the model is a softmax probability distribution, and not a true distance or similarity metric. We find, however, that our network performs quite well in practice. We also show that this metric-agnostic aspect is not enough for good performance. The network achieves its best performance only when it is allowed to condition each class representation based on the target, or query, image. Because this final network is both metric-agnostic and creates conditional representations, we refer to it as a Metric-Agnostic Conditional network (MACO).

2 Architecture

We describe our network in the traditional fashion used for few-shot learning. Each few-shot trial is made up of a so-called query image, , where the goal of the network is to decide which of classes the query belongs to. Each class is represented by a set of images and we refer to the entirety of these as the support set, . For a given class , refers to the images in the support set belonging to . An experiment with classes and images per class is referred to as a -way, -shot experiment.

The network is made up of four distinct stages which are trained fully end-to-end. We explicitly separate these components of the model so that individual pieces of the network are not forced to encode multiple complex relationships which might interfere with one another.

  1. Feature Stage –

    Convolutional architecture to represent images as a single vector. A single set of parameters is used for all images.

  2. Relational Stage – Images within a class are compared in a pairwise fashion as in Santoro et al. (2017) resulting in a single vector to represent the class. Parameters are shared across all classes.

  3. Conditioning Stage – The query image is used to augment the representation for each class.

  4. Classifier – Information is combined across class vectors and a final softmax classification is made based on the query image.

Figure 1:

Full model architecture. Blue rectangles indicate fully connected blocks, while orange rectangles indicate convolutional blocks. The final classification architecture makes use of two 1D convolutional blocks followed by a single fully connected block and dense softmax layer.

The full model architecture is detailed in Figure 1. Each rectangle represents a series of blocks, described in more detail below, where blue represents convolutional and orange represents fully connected layers. Note that the query image is incorporated both into the conditioning stage as well as the final classification stage.

2.1 Feature Stage

To extract a feature vector from each image, we use the same convolutional architecture as in Ravi & Larochelle (2017)111We make use of this smaller architecture in order to more fairly compare against baseline models in the literature. Initial experiments indicate the model also works well with a larger, pre-trained ResNet architecture.. The network is made up of four convolutional blocks where each block begins with a 2D convolutional layer with a

kernel and filter size of 32. Each convolutional layer is followed by batch normalization, an ELU activation 

(Clevert et al., 2015), and a max pooling layer222Initial experiments with dropout as a regularizer showed poor convergence during training and are therefore excluded.. After the fourth convolutional block, a linear layer produces a vector of size 800 to represent the image. This feature architecture is used with the same parameters for all images in each few-shot trial, regardless of whether the image is a query or from the support set. This encourages the network to use the feature stage to learn generic visual features which are useful regardless of the status of the image in the few-shot trial.

2.2 Relational Stage

Figure 2: Relational network . and

are image vectors of size 800 processed by the convolutional feature extractor. Each block represents a fully connected layer, batch normalization, and ELU non-linearity. The final output of the block is the summation of the first and final block outputs. A final class output is created by averaging the output of each image comparison.

A key problem in few-shot learning is how to efficiently learn class representations from a small set of images and previous approaches make use of a variety of techniques. Matching networks take advantage of the set-to-set framework with an attention kernel over images in  (Vinyals et al., 2015, 2016). Prototypical networks simply use a feature stage to embed images into the metric space and use their average to represent the class.

To combine information across images in a class, we turn to a different set-based network architecture, relational networks (Santoro et al., 2017). Relational networks take a set as input and output a single representation that is order insensitive. While the original relational network was intended to process multiple areas within a single image, in the case of few-shot learning, we can treat the images of a single class in the support set, , as input to a relational network because their ordering is irrelevant. Comparing to prototypical networks, this allows us to learn a more complex relationship between the images in and the vector representation for class . This method also allows us to avoid imposing an arbitrary ordering onto as in the case of matching networks with full context embeddings (Vinyals et al., 2016).

In the original formulation, a relational network is a network that takes in two items at a time and produces a single vector. Pair-wise comparisons are made using the same network for every pair of items within the input set. For a few-shot class with images, this results in comparisons333Although this scales quadratically with , we note that a fixed number of sampled comparisons could also be used as an approximation in cases where full calculations would be problematic.. Relational networks then combine information from these comparisons using a summation. We differ in that our relational stage makes use of an average following the approach of Hilliard et al. (2017). This has the added benefit that the average is invariant to the number of images per class. In cases where is always the same, as in our experimental results, this method differs only in terms of the scale of the output.

We formalize this as a function in Equation 1 where is a relational network with parameters and is the number of images in . Note that this produces a vector for a single class only. The process is repeated for all classes to produce class vectors.


As in the original relational network paper (Santoro et al., 2017), we parameterize

as a network with fully connected layers. The network is structured similarly to the feature extractor, but using fully connected instead of convolutional blocks. As before, within a block each fully connected layer is followed by batch normalization and an ELU activation. Fully connected layers have dimension 128. We also make use of a skip connection which links the output of the first and final fully connected blocks. Skip, or residual, connections are additional connections made between two layers in a network that “skip” over one or more intermediary layers and were an important step in efficiently training very deep networks 

(He et al., 2016). We connect the first and final blocks by summing their individual outputs. This allows later layers of the network to focus on processing information which is not fully captured by the first layer.

We treat the number of blocks within the relational stage as a hyperparameter which can be modified based on the complexity of the modeling task. Unless otherwise noted, we make use of 4 fully connected blocks. This relational architecture is used for every pair-wise comparison of images within a few-shot category with the final output of the relational stage being the average output across all comparisons. The output vectors can be thought of as a class embedding into a 128-dimensional embedding space.

2.3 Conditioning Stage

Figure 3: Conditioning network. The layer level design is the same as in Figure 2.

Because the relational stage outputs a single vector for each few-shot category, we could simple pass their output directly to the final classification, which would follow the method used in prototypical networks (Snell et al., 2017). Instead, we introduce an intermediate stage where the network has a chance to use information about the query image to condition each individual class representation. We therefore refer to this as our conditioning stage. Conceptually, the output of this stage is a modification of the original class embedding which better takes into account which features of the class might be most relevant for a few-shot trial.

While learns to represent the similarities between images in class , we would also like to learn class vectors that take into account the query image, . Doing so gives the network flexibility in learning what aspects of an image class might be relevant to a particular query. For instance, although the relational stage must encode as many features of the class as might be relevant, our conditioning stage might be able to specifically encode a similarity of color or dissimilarity in shape between the class and query. To achieve this, we concatenate and for each class in and then allow each to be separately processed by the rest of the conditioning architecture.

The conditioning network is described in Equation 2 where is a neural network parameterized by :


is structured similarly to the relational network with a series of fully connected blocks with batch normalization and ELU activations. Unless otherwise mentioned, we make use of 4 blocks in the conditioning network. A skip connection again sums the output of the first and final blocks. differs only in that its input is the concatenation of two vectors, and . Fully connected layers again have a dimension of 128, resulting in a conditioned embedding for each class within a 128-dimensional space.

This structure enables us to produce a single conditioned vector describing the group of images in the context of the query image, allowing the network to adapt its class representations for the given query. This representation is produced for each class in a traditional -way problem structure, producing corresponding conditioned vectors. By updating the class representation in a separate block rather than in the final classification stage we allow the model to separate the problem of understanding the context in a particular experimental trial from the problem of choosing which class the query image belongs to.

To ensure that this portion of the network is being utilized as intended, we also consider a modification of the algorithm where the input to the conditioning stage is simply . This removes the ability of the model to condition class representations on the query, but largely retains the additional number of parameters added by the stage. We refer to this as the Metric-Agnostic without conditioning model (MA w/o cond.).

2.4 Classification Stage

Figure 4: Final classification stage. The first two blocks are one dimensional convolutions with batch normalization and a non-linearity. The flattened output of the last convolution is then fed into a fully connected layer with batch normalization and a non-linearity prior to the final -way output layer with a softmax activation.

Now that we have a vector representing each class in , it would be possible to simply embed within this space and use a pre-defined metric, e.g. Euclidean distance, in order to perform classification. Instead, we opt to replace this with a parameterized neural network which takes in the representations of the support set and query and creates a softmax classification. We note that this classification architecture should not be thought of as a true metric, since it simply learns to point towards the correct class and does not necessarily satisfy all the properties of an actual metric. Notably, the query image is never embedded within the same space as the support set.

One goal for the classification architecture is that it should combine information across all five classes in order to make its final decision, rather than an individual decision being made for each class in isolation. To accomplish this we make use of a convolutional architecture without padding that learns to combine information across classes in an order agnostic manner. The input to the classification layer is

128-dimensional class vectors, which we pass into a 1D convolutional block with a kernel size of 3, filter size 128, batch normalization, and an ELU non-linearity. A second 1D convolutional block of the same specifications reduces all information about a 5-way comparison into a single 128-dimensional vector. A fully connected block of size 128, again with batch norm and an ELU activation performs a final non-linear operation before passing the vector to the final dense softmax layer.

Because the order of classes is randomized for each trial, this final softmax learns to point to the most similar class regardless of its arbitrary ordering.

3 Related Work

Our work draws on a number of previous approaches to few-shot learning, predominantly those referred to under the umbrella of metric-learning (Kulis et al., 2013; Vinyals et al., 2016; Snell et al., 2017). The goal of such approaches is to embed the input into a vector space where a simple distance function can be used for classification. Our work differs from traditional metric-learning in that we allow a neural network (our classification stage) to learn both the embedding space and the comparison metric, rather than using a static distance function such as cosine or Euclidean distance.

Architecturally, we take inspiration from methods such as Siamese (Koch et al., 2015) and relational networks (Santoro et al., 2017). Similar to the Siamese network approach, we apply a single network to process images from all classes. Once features for each image have been processed, we make use of the relational network approach of matching networks. Although Santoro et al. (2017) create a single representation by summing across all images in , the use of summation limits the model to cases where the number of object comparisons is always identical. Although this is the case in our experiments, we choose to follow Hilliard et al. (2017) in averaging across image comparisons, avoiding this fundamental limitation of the sum-based relational net.

Our model also differs substantially from previous attempts in the way it combines information about exemplar images in each class and the query image. Matching networks (Vinyals et al., 2016) learn to embed individual positive examples by taking into account the full support set , which they label full context embeddings. Prototypical networks (Snell et al., 2017) can achieve better results by making careful architecture choices but represent each class based only on its own images. This is similar to our relational stage which creates a class vector taking into account only the images within the class. Instead of using the query only at the very end, as is Matching and Prototypical networks, we introduce an intermediary conditioning stage where positive class vectors and the query image can be combined to create an updated class vector, warping the metric space to accommodate the task at hand.

As noted previously, one of the largest differences between our approach and many other metric-learning models is that we do not make use of a pre-defined distance metric. Whereas Vinyals et al. (2016) make use of cosine distance and Snell et al. (2017) make use of Euclidean distance, we allow our network to make use of a classification stage that performs a similar function, but which has learnable parameters. While our parameterized classification stage does not calculate a formal similarity metric per se, it does offer increased flexibility in the modeling task and we find that it performs well in practice.

An alternative approach to few-shot learning is known as meta-learning. Models such as MAML (Finn et al., 2017) and Meta-LSTM (Ravi & Larochelle, 2017) can be thought of as learned optimizers which train a new network to make few-shot decisions for every few-shot trial. Meta-LSTM accomplishes this by incorporating a robust external memory, which in the context of few-shot learning can be used to store information about previously seen classes. The MAML architecture instead frames the problem as one involving two separate, but cooperative, networks. A high-level network, the meta-learner, learns to adapt the weights of a low-level network, the learner

, which makes actual task decisions. For every few-shot learning example, the learner is initialized and then trained for a short duration by the meta-learner with error backpropagated from the learner on to the meta-learner. Although our approach differs quite substantially from these meta-learning models, they represent some of the currently strongest baselines in few-shot learning and for this reason we compare our model against them.

4 Experiments

4.1 Experimental Design

For each of our experiments we make use of the same architecture and training hyperparameters. Architectural details are given in Section 2

. All networks are trained using in Keras using the Nadam optimizer with a learning rate of 0.001. Our models trained best using LeCun normal initialization 

(LeCun et al., 1998) for all fully connected layers and glorot normal (Glorot & Bengio, 2010)

for convolutional layers. We train each model for 50 epochs with 60,000 few-shot trials per epoch and a batch size of 32. In all cases we evaluate the model with the highest validation accuracy.

Because the network sees the query image twice (by design), the network is prone to overfitting by memorizing the relationships between images in the training set. It is important to prevent the network from memorizing the small sets of images that make up each training class. To deal with these issues, we explored a number of regularization techniques but settled on using only data augmentation. For training we began with an aggressive data augmentation scheme that included randomized rotation, translation, zooming, and horizontal flipping. We found that, in practice, this prevents the models from overfitting without the need for additional regularization such as dropout.

We include baselines from other well known techniques in this field including Meta-LSTM, MAML, matching networks and prototypical networks444Baseline implementations were used from:
Meta-LSTM & matching networks: https://github.com/twitter/meta-learning-lstm
MAML: https://github.com/cbfinn/maml.
Prototypical networks: https://github.com/jakesnell/prototypical-networks
. All baseline models were evaluated on the same train/val/test splits as the MACO networks in order to ensure equivalency among the results.

4.2 Caltech-UCSD Birds

Model 1-shot 5-shot
Matching Network 29.34% 35.48%
Matching Network (FCE) 49.34% 59.31%
Prototypical Network 45.27% 34.35%
Meta-Learner LSTM 40.43% 49.65%
MAML 38.43% 59.15%
MACO 60.76% 74.96%
MA w/o cond. 55.86% 69.49%
Table 1: Average test set classification accuracy on Caltech-UCSD Birds.

We first look at a fine-grained classification task in the Caltech-UCSD Birds 200 (CUB-200) dataset. This dataset includes 200 fine-grained categories of birds which we randomly divide into 100 for training, 50 for validation, and 50 for testing. Each image is again resized to 84x84 pixels and put through the data augmentation process described previously to reduce overfitting. Initial experiments indicated that the deeper networks, as used on our other datasets, generalized poorly even with data augmentation. To combat this, we reduced the depth of the relational and conditioning stage blocks from 4 to 2. We present our results on this dataset in Table 1.

We find that our model is able to easily outperform the previous state-of-the-art. For the 5-shot experiments, we are able to achieve 74.96% test accuracy, over 15 percentage points higher than the best performing baseline, matching networks. As expected, performance is much worse in the 1-shot case, with performance at 60.76%, but is still approximately 11 percentage points higher than the matching networks baseline (49.34%).

To understand what leads to this level of performance, we investigated whether or not the inclusion of the comparison stage was necessary. In order to keep model parameters roughly comparable, we leave in the conditioning stage but remove the inclusion of the query image. We deem this our Metric-agnostic network without conditioning. The important aspect of this network is that information about the query image can only be included at the final classification stage, which is also where information about all classes in the support set is made. We find that the full MACO network is able to achieve much higher performance, gaining approximately 5 percentage points accuracy on both 1- and 5-shot tasks.

In Figure 5 we show the how model loss and accuracy change over the course of training for the 5-shot task. In blue we represent the accuracy of the model with accuracy on the training set as a dashed line and the validation accuracy as a solid. Model loss is represented as the curve in red. We note that validation accuracy and loss starts off much better because of the aggressive data augmentation which takes place for the training set.

Figure 5: Loss and accuracy across epochs for 5-shot, 5-way experiments on the Caltech-UCSD Birds dataset. Epochs are arbitrarily defined as 60k iterations. Training loss and accuracy are represented as dashed lines, while validation scores are solid.

4.3 miniImageNet

Model 1-shot 5-shot
Matching Network 43.74% 52.90%
Matching Network (FCE) 45.91% 57.66%
Meta-Learner LSTM 49.26% 59.59%
MAML 32.05% 61.55%
MACO 41.09% 58.32%
Table 2: Average test set classification accuracy on miniImageNet.

To test the ability of our architecture to learn relatively broad categories, we evaluate the model on miniImageNet. The miniImageNet task, originally defined in Vinyals et al. (2016), uses a random set of 100 classes from the ILSVRC ImageNet dataset. We used the same class splits used in Ravi & Larochelle (2017) with 64 for training, 16 for validation, and the remaining 20 classes as a test set. The original ImageNet images are downsampled to a smaller 84 84 resolution. When generating few-shot trails, we randomly sample our own images from the 600 images provided in each class.

We report our results in conjunction with baselines in Table 2 with the best performing model for a given task identified in bold. We conduct 5-way experiments with both 1-shot and 5-shot trials. Of the baseline models we find that meta-learning approaches are most successful, with Meta-LSTM achieving 49.26% on the 1-shot trials and MAML achieving 61.55% on the 5-shot.

Our MACO network achieves a 5-shot test accuracy of 58.32%, higher than the matching network baseline but somewhat below the two meta-learning algorithms. We perform more poorly on the 1-shot case with only 41.09% accuracy.

4.4 miniDogsNet

Model 1-shot 5-shot
Matching Network 45.00% 57.08%
Matching Network (FCE) 46.01% 57.38%
Meta-Learner LSTM 38.37% 53.65%
MAML 31.52% 59.66%
MACO 39.10% 54.45%
Table 3: Average test set classification accuracy on miniDogsNet.

While the CUB results are promising, it is important to test the model’s fine-grained abilities on a variety of data. For an alternate dataset we created a miniDogsNet task, mirroring the miniImageNet dataset proposed in Vinyals et al. (2016). This dataset consists entirely of images from the ImageNet dog categories listed in that paper. We randomly selected 100 of those classes and used the same 64/16/20 random class split for training, validation, and testing555We note that this differs from the task described in that paper, which involved training on non-dog classes and testing on dog-specific classes..

Results for the miniDogsNet task are presented in Table 3. The fine-grained task is more difficult than miniImageNet and performance is lower across the board. Whereas meta-learning approaches dominated the other baselines for broad classification, we find that this does not hold for miniDogsNet. Matching networks with their full context embeddings is the highest performing baseline on the 1-shot experiments (46.01%), while MAML still outperforms on the 5-shot (59.66%).

Looking at the MACO results, we again find that we are able to compete with these state-of-the-art approaches. In the 1-shot case we achieve 39.10% on the 1-shot task and 54.45% on the 5-shot. For 1-shot learning, this places us above either meta-learning baseline, but below matching networks.

5 Conclusion

We have introduced a novel Metric-Agnostic Conditional architecture for few-shot learning and evaluated its effectiveness across three image datasets. Our architecture deviates from previous approaches both in that it replaces a pre-defined distance metric with a learnable classifier and in that it explicitly conditions class representations to take into account the query image. We achieve state-of-the-art performance for the Caltech-UCSD Birds dataset on both 1- and 5-shot experiments and show that the ability to condition is responsible for approximately a 5 percentage point boost in performance on that dataset. The success of our approach on this fine-grained task also raises questions as to whether previous metric-based approaches might benefit from decisions being made by learned classifiers rather than pre-defined metrics.


This work was funded by the U.S. Government.