Log In Sign Up

Adaptive Cross-Modal Few-Shot Learning

by   Chen Xing, et al.

Metric-based meta-learning techniques have successfully been applied to few-shot classification problems. However, leveraging cross-modal information in a few-shot setting has yet to be explored. When the support from visual information is limited in few-shot image classification, semantic representatins (learned from unsupervised text corpora) can provide strong prior knowledge and context to help learning. Based on this intuition, we design a model that is able to leverage visual and semantic features in the context of few-shot classification. We propose an adaptive mechanism that is able to effectively combine both modalities conditioned on categories. Through a series of experiments, we show that our method boosts the performance of metric-based approaches by effectively exploiting language structure. Using this extra modality, our model bypass current unimodal state-of-the-art methods by a large margin on two important benchmarks: mini-ImageNet and tiered-ImageNet. The improvement in performance is particularly large when the number of shots are small.


page 1

page 5


A Baseline for Few-Shot Image Classification

Fine-tuning a deep network trained with the standard cross-entropy loss ...

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Although significant progress has been made in few-shot learning, most o...

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

We study multimodal few-shot object detection (FSOD) in this paper, usin...

Multimodal Prototypical Networks for Few-shot Learning

Although providing exceptional results for many computer vision tasks, s...

Will Multi-modal Data Improves Few-shot Learning?

Most few-shot learning models utilize only one modality of data. We woul...

Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints

Cross-modal embeddings, between textual and visual modalities, aim to or...

Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation

Few-shot (FS) and zero-shot (ZS) learning are two different approaches f...

1 Introduction

Deep learning methods have achieved major advances in areas such as speech, language and vision (LeCun et al., 2015). These systems, however, usually require a large amount of labeled data, which can be impractical or expensive to acquire. Limited labeled data lead to overfitting and generalization issues in classical deep learning approaches.

On the other hand, existing evidence suggests that the human visual system is capable of effectively operating in small data regime: humans can learn new concepts from a very few samples, by leveraging prior knowledge and context (Landau et al., 1988; Markman, 1991; Smith & Slone, 2017). The problem of learning new concepts with small number of labeled data points is usually referred to as few-shot learning (Bart & Ullman, 2005; Fink, 2005; Li et al., 2006; Lake et al., 2011).

Most approaches addressing this problem are based on the meta-learning paradigm (Schmidhuber, 1987; Bengio et al., 1992; Thrun, 1998; Hochreiter et al., 2001), a class of algorithms and models focusing on learning how to (quickly) learn new concepts. Meta-learning approaches work by learning a parameterized function that embeds a variety of learning tasks and can generalize to new tasks. They are usually trained by sampling different (small) sets from a large universe of labeled samples to imitate its test scenario, and optimizing the parameters of the model for each set. At test time, they are able solve new learning tasks using only a small number of samples.

Figure 1: Concepts have different visual and semantic feature space. (Left) Some categories may have similar visual features and dissimilar semantic features. (Right) Other can possess same semantic label but very distinct visual features. Our method adaptively exploits both modalities to improve classification performance in low-shot regime.

Recent progress in few-shot classification has primarily been made in the context of unimodal learning. In contrast to this, strong evidence supports the hypothesis that language helps the learning of new concepts in toddlers (Jackendoff, 1987; Smith & Gasser, 2005). This suggests that semantic features can be a powerful source of information in the context of few-shot image classification. While there have been many applications in combining visual and semantic embeddings to improve visual recognition tasks (e.g. in zero-shot learning (Frome et al., 2013)

or image retrieval 

(Weston et al., 2011)), exploiting semantic language structure in the meta-learning framework has been mostly unexplored.

In this paper, we argue that few-shot classification can be considerably improved by leveraging semantic information from labels (learned, for example, from unsupervised text corpora). Visual and semantic feature spaces have heterogeneous structures by definition. For certain concepts, visual features might be richer and more discriminative than semantic ones. While for others, the inverse might be true. Figure 1 illustrates this remark. Moreover, when the support from visual information is limited, semantic features can provide strong prior knowledge and context to help learning. Based on this idea, we propose Adaptive Modality Mixture Mechanism (AM3), an approach that effectively and adaptively combines information from visual and semantic spaces.

AM3 is built on top of metric-based meta-learning approaches . These approaches perform classification by comparing distances in a learned metric space (from visual data). Different from previous metric-based methods, our model is able to exploit both visual and semantic feature spaces for classification. The semantic representation is learned from unsupervised text corpora and is easy to acquire. Our proposed mechanism performs classification in a feature space that is a convex combination of the two modalities. Moreover, we design the mixing coefficient to be adaptive w.r.t. different categories.

Our main contributions can be summarized as follows: (i) we propose an adaptive cross-modality mixing mechanism for few-shot classification that is able to effectively leverage visual and semantic information, (ii) we show that our approach achieves considerable boost in performance over different metric-based meta-learning approaches on different datasets and number of shots, and (iii) we perform an empirical investigation to analyze how the model leverages semantic information. Moreover, our method is able to beat by a large margin current (single-modality) state of the art.

The rest of the paper is organized as follows: Section 2 presents related works, Section 3 describes how we leverage semantic information to improve few-shot classification. Section 4 describes our experiments in two important few-shot learning datasets: miniImageNet (Vinyals et al., 2016) and tieredImageNet (Ren et al., 2018). We conclude the work in Section 5.

2 Related Work

Few-shot learning.

Meta-learning or “learning to learn” is a problem that has a prominent history in machine learning 

(Schmidhuber, 1987; Bengio et al., 1992; Thrun, 1998). Due to advances in representation learning methods (Goodfellow et al., 2016) and the creation of new few-shot learning datasets (Lake et al., 2011; Vinyals et al., 2016), many deep meta-learning approaches have been proposed to address the few-shot learning problem . These methods can be roughly divided into two main types: metric-based and gradient-based approaches.

Metric-based approaches aim at learning representations that minimize intra-class distances while maximizing the distance between different classes. These approaches tend to rely on an episodic training framework: the model is trained with sub-tasks (episodes) in which there are only a few training samples for each category. For example, matching networks (Vinyals et al., 2016) follows a simple nearest neighbour framework. In each episode, it uses an attention mechanism (over the encoded support) as a similarity measure for one-shot classification.

In prototypical networks (Snell et al., 2017), a metric space is learned where embeddings of queries of one category are close to the centroid (or prototype) of supports of the same category, and far away from centroids of other classes in the episode. Due to simplicity and performance of this approach, many methods extended this work. For instance, Ren et al. (2018) propose a semi-supervised few-shot learning approach and show that leveraging unlabeled samples outperform purely supervised prototypical networks. Wang et al. (2018) propose to augment the support set by generating hallucinated examples. Task-dependent adaptive metric (TADAM) (Oreshkin et al., 2018)

relies on conditional batch normalization 

(Dumoulin et al., 2018) to provide task adaptation (based on task representations encoded by visual features) to learn a task-dependent metric space.

Gradient-based meta-learning methods aim at training models that can generalize well to new tasks with only a few fine-tuning updates. Most these methods are built on top of model-agnostic meta-learning (MAML) framework (Finn et al., 2017). Given the universality of MAML, many follow-up works were recently proposed to improve its performance on few-shot learning (Nichol et al., 2018; Lacoste et al., 2017)Kim et al. (2018); Finn et al. (2018) propose a probabilistic extension to MAML trained with variational approximation. Conditional class-aware meta-learning (CAML) (Jiang et al., 2019) conditionally transforms embeddings based on a metric space that is trained with prototypical networks to capture inter-class dependencies. Latent embedding optimization (LEO) (Rusu et al., 2019) aims to tackle MAML’s problem of only using a few updates on a low data regime to train models in a high dimensional parameter space. The model employs a low-dimensional latent model embedding space for update and then decodes the actual model parameters from the low-dimensional latent representations. This simple yet powerful approach achieves current state of the art result in different few-shot classification benchmarks.

Other meta-learning approaches for few-shot learning include using memory architecture to either store exemplar training samples (Santoro et al., 2016) or to directly encode fast adaptation algorithm (Ravi & Larochelle, 2017)Mishra et al. (2017) use temporal convolution to achieve the same goal.

Current approaches mentioned above rely solely on visual features for few-shot classification. Our contribution is orthogonal to current metric-based approaches and can be integrated into them to boost performance in few-shot classification. One concurrent work (Tokmakov et al., 2018) also leverages side information to help few-shot learning. This method highly relies on human-annotated attributes of images in its design, while ours rely on semantic embeddings of category labels (learned from unsupervised text corpora). This makes our approach easily applicable in many different (few-shot) image classification datasets with no human-annotated attributes.

Zero-shot learning.

Zero-shot learning aims at recognizing objects whose instance have not been seen during training (Larochelle et al., 2008; Palatucci et al., 2009). We point the reader to (Xian et al., 2018a) for an overview of current zero-shot learning methods. Classic approaches to this problem encode classes with a set of numeric attributes. Frome et al. (2013); Socher et al. (2013) propose the first methods to use label semantic features in zero-shot learning. They transform visual features into semantic space and force this transformation to keep the same structure as that of the semantic space (pre-trained on text corpora). More recently, Hubert Tsai et al. (2017) propose a method that uses maximum mean discrepancy MMD (Gretton et al., 2007) to learn joint embeddings for semantic and visual feature. These embeddings are then used to perform zero and few-shot learning.  Xian et al. (2018b) propose a GAN (Goodfellow et al., 2014)-based approach to generate visual features conditioned on semantic label as a means of mapping a label to a distribution of visual features (conditioned on the label). In (Schönfeld et al., 2018), the authors propose a method that encodes information from both sides using two VAEs (Kingma & Welling, 2014) and adding regularization to align the two latent spaces.

Because zero-shot learning does not have access to any visual feature support, joint-embedding approaches are reasonable. The model has no choice but to force the visual representation space to have the same structure as the semantic space. This way during test, the image query’s similarity with the semantic information from candidate classes can be computed for classification. However, this explicit visual-semantic alignment may be harmful when we have access to supporting image set. Examples presented in Section 1 illustrate that visual and semantic spaces have different structures. Therefore, forcing them to align blindly may result in information loss in both modalities, ultimately weakening both of them. Instead, we propose an adaptive mixture mechanism that is able to combine both modalities (conditioned on categories) in an effective and efficient way. In Section 4, we show that our proposed model is able to leverage label semantic information much more effectively than jointly embedding visual and semantic spaces.

3 Method

In this section, we explain how we leverage language structure to improve few-shot image classification. We start with a brief explanation of episodic training for few-shot learning and a summary of prototypical networks followed by a description of the proposed adaptive modality mixture mechanism.

3.1 Preliminaries

3.1.1 Episodic Training

In few-shot learning, we are interested in training a classifier on a labeled dataset

that generalizes well on a test dataset . The class sets are disjoint between and . The test set has only a few labeled samples per category. Most successful approaches rely on an episodic training paradigm: the few shot regime faced at test time is simulated by sampling small samples from the large labeled set during training.

In general, models are trained on -shot, -way episodes. Each episode is created by first sampling categories from the training set and then sampling two sets of images from these categories: (i) the support set containing examples for each of the categories and (ii) the query set containing different examples from the same categories.

The episodic training for few-shot classification is achieved by minimizing, for each episode, the loss of the prediction on samples in query set, given the support set. The model is a parameterized function and the loss is the negative loglikelihood of the true class of each query sample:


where and are, respectively, the sampled query and support set at episode and are the parameters of the model.

3.1.2 Prototypical Networks

We build our model on top of metric-based meta-learning methods. We chose prototypical network (Snell et al., 2017) for explaining our model due to its simplicity. We note, however, that the proposed method can potentially be applied to any metric-based approach.

Prototypical networks use the support set to compute a centroid (prototype) for each category (in the sampled episode) and query samples are classified based on the distance to each prototype. The model is a convolutional neural network 

(Lecun et al., 1998) , parameterized by , that learns a -dimensional space where samples of the same category are close and those of different categories are far apart.

For every episode , each embedding prototype (of category ) is computed by averaging the embeddings of all support samples of class :


where is the subset of support belonging to class .

The model produces a distribution over the categories of the episode based on a softmax (Bridle, 1990) over (negative) distances of the embedding of the query (from category ) to the embedded prototypes:


We consider to be the Euclidean distance. The model is trained by minimizing Equation 1

and the parameters are updated with stochastic gradient descent.

3.2 Adaptive Modality Mixture Mechanism

The information contained in semantic concepts can significantly differ from visual information content. For instance, ‘Siberian husky’ and ‘wolf’, or ‘komondor’ and ‘mop’, might be difficult to discriminate with visual features, but might be easier to discriminate with language semantic features.

In zero-shot learning, where no visual information is given at test time (that is, the support set is void), algorithms need to rely on side information. Current state-of-the-art zero-shot learning methods rely on joint embedding of the image feature space and the class label embedding space (Frome et al., 2013; Xian et al., 2018a). On the other extreme, when the number of labeled image samples is considerable, neural network models tend to ignore the semantic information as it is able to generalize well with large number of samples (Krizhevsky et al., 2012).

Figure 2: Adaptive modality mixture model. The final category prototype is a convex combination of the visual and the semantic feature representations. The mixing coefficient is conditioned on the semantic label embedding.

In the few-shot learning scenario, we hypothesize that both visual and semantic information can be useful for classification. Because we assume the visual and the semantic spaces have different structures, it is desirable that the proposed model exploit both modalities in the best way.

We augment prototypical networks to incorporate language structure learned by a word-embedding model (pre-trained on unsupervised large text corpora), containing label embeddings of all categories in . In our model, we modify the prototype representation of each category by taking into account their label embeddings.

More specifically, we model the new prototype representation as a convex combination of the two modalities. That is, for each category , the new prototype is computed as:


where is the adaptive mixture coefficient (conditioned on the category) and is a transformed version of the label embedding for class . This transformation , parameterized by , is important to guarantee that both modalities lie on the space of the same dimension and can be combined.

There are many different ways to adaptively calculate to mix the two modalities. In this work we chose to condition the mixing coefficient on different categories. A very structured semantic space is a good choice for conditioning. Therefore, we chose a simple model for modulation conditioned on the semantic embedding space:


where is the adaptive mixing network, with parameters . Figure 2 illustrates the proposed model. The mixing coefficient can be conditioned on different variables. In Section 4.3 we show how performance changes when the mixing coefficient is conditioned on different variables.

  Input: Training set . .
  Output: Episodic loss for sampled episode .
   //Select classes for episode
   //Compute cross-modal prototypes
  for  in  do
  end for
   //Compute loss
  for  in  do
     for  in  do
     end for
  end for
Algorithm 1 Training episode loss computation for adaptive cross-modality few-shot learning. is the total number of classes in the training set, is the number of classes in every episode, is the number of supports for each class, is the number of queries for each class, is the pretrained label embedding dictionary.

The training procedure is similar to that of the original prototypical networks. However, the distances (used to calculate the distribution over classes for every image query) are between the query and the cross-modal prototype :


where is the set of parameters. Once again, the model is trained by minimizing Equation 1

. Note that in this case the probability is also conditioned on the word embeddings


Figure 3 illustrates an example on how the proposed method works. Algorithm 1 shows the pseudocode for calculating episode loss.

Figure 3: Qualitative example of how AM3 works. Assume query sample has category . (a) The closest visual prototype to the query sample is . (b) The semantic prototypes. (c) The mixture mechanism modify the positions of the prototypes, given the semantic embeddings. (d) After the update, the closest prototype to the query is now the one of the category , correcting the classification.

4 Experiments

We first describe the experimental setup including datasets, different cross-modal baselines and implementation details. Then, we compare the performance of our model, adaptive modality mixture mechanism (AM3), with other methods on the problem of few-shot classification. Finally, we perform a series of ablation studies to better understand the model.

4.1 Experimental Setup

4.1.1 Datasets

We experiment with two widely used few-shot learning datasets: miniImageNet (Vinyals et al., 2016) and tieredImageNet (Ren et al., 2018).


This dataset is a subset of ImageNet ILSVRC12 dataset (Russakovsky et al., 2015). It contains 100 randomly sampled categories, each with 600 images of size . For fair comparison with other methods, we use the same split proposed by Ravi & Larochelle (2017), which contains 64 categories for training, 16 for validation and 20 for test.


This dataset is a larger subset of ImageNet than miniImageNet. It contains 34 high-level category nodes (779,165 images in total) that are split in 20 for training, 6 for validation and 8 for test. This leads to 351 actual categories for training, 97 for validation and 160 for the test. There are more than 1,000 images for each class. The train/val/test split is done according to their higher-level label hierarchy. According to Ren et al. (2018), splitting near the root of ImageNet hierarchy results in a more realistic (and challenging) scenario with training and test categories that are less similar.

Word embeddings.

We use GloVe (Pennington et al., 2014) to extract the semantic embeddings for the category labels. GloVe is an unsupervised approach based on word-word co-occurrence statistics from large text corpora. We use the Common Crawl version trained on 840B tokens. The embeddings are of dimension 300. When a category has multiple (synonym) annotations, we consider the first one. If the first one is not present in GloVe’s vocabulary we use the second. If there is no annotation in GloVe’s vocabulary for a category (4 cases in tiered

ImageNet), we randomly sample each dimension of the embedding from a uniform distribution with the range (-1, 1). If an annotation contains more than one word, the embedding is generated by averaging them. We also experimented with fastText embeddings 

(Joulin et al., 2016) and observed similar performances.

4.1.2 Baselines

Current state of the art in few-shot learning relies on visual embeddings only. We introduce four baselines that leverage cross-modal embeddings in different ways. Note that neither of them is published yet but they are easy-coming-to-mind methods given the current research literature.

Model Test Accuracy
1-shot 5-shot 10-shot
Matching Network (Vinyals et al., 2016) 43.56 0.84% 55.31 0.73% -
Prototypical Network (Snell et al., 2017) 49.42 0.78% 68.20 0.66% 74.30 0.52%
Discriminative k-shot (Bauer et al., 2017) 56.30 0.40% 73.90 0.30% 78.50 0.00%
Meta-Learner LSTM (Ravi & Larochelle, 2017) 43.44 0.77% 60.60 0.71% -
Meta-SGD (Li et al., 2017) 50.47 1.87% 64.03 0.94% -
MAML (Finn et al., 2017) 48.70 1.84% 63.11 0.92% -

Proto. Nets w Soft k-Means 

(Ren et al., 2018)
50.41 0.31% 69.88 0.20% -
SNAIL (Mishra et al., 2018) 55.71 0.99% 68.80 0.92% -

CAML (Jiang et al., 2019)
59.23 0.99% 72.35 0.71% -
LEO (Rusu et al., 2019) 61.76 0.08% 77.59 0.12% -
ProtoNets++ -MBR 56.99 1.33% 72.63 0.72% 76.70 0.53%
ProtoNets++ -MMD 57.23 0.76% 73.85 0.63% 77.21 0.31%
ProtoNets++ -CMC 57.63 0.71% 66.23 0.45% 68.59 0.38%
TADAM -CBNlabel 57.17 0.32% 77.35 0.30% 80.46 0.44%

56.52 0.45% 74.28 0.20% 78.31 0.44%
AM3-ProtoNets++ 65.21 0.30% 75.20 0.27% 78.52 0.28%
TADAM (Oreshkin et al., 2018) 58.56 0.39% 76.65 0.38% 80.83 0.37%
AM3-TADAM 65.30 0.49% 78.10 0.36% 81.57 0.47 %

Table 1: Few-shot classification accuracy on test split of miniImageNet. Results in the top use only visual features. Cross-modal baselines are shown on the middle and our results (and their backbones) on the bottom part.

Our first baseline is a natural cross-modal extension of prototypical networks, borrowing ideas from zero-shot learning (ZSL) literature (Frome et al., 2013; Socher et al., 2013). We force the visual embedding space to keep a structure similar to the semantic space (borrowed from ZSL literature). This is achieved by adding a metric-based regularization (MBR) term to the loss of the original prototypical network (Equation 1):


In our preliminary experiment we also tried this regularization with the transformed semantic space (replacing with in Equation 7), which resulted in worse performance.


This baseline relies on a maximum mean discrepancy (MMD) (Gretton et al., 2007) regularizer instead of a metric-based one. This approach forces the visual and textual feature distributions to match. As shown in (Hubert Tsai et al., 2017), this regularization seems to be more effective (at least in some tasks) than the metric-based one.


Here, we consider a constant mixture coefficient (CMC) to disentangle the effectiveness of the adaptive component of the proposed mechanism. We set for all categories .


Some few-shot classification methods (Oreshkin et al., 2018; Jiang et al., 2019) learn a metric space that is conditioned by each task, using visual features as the auxiliary meta-information and conditional batch norm (CBN) (Dumoulin et al., 2018). Inspired by these approaches (and the first use of CBN), our last baseline is a version of TADAM (Oreshkin et al., 2018) with its CBN conditioned on GloVe embeddings instead of its original task encoding.

4.1.3 Implementation Details

We model the visual feature extractor with a ResNet-12 (He et al., 2016), which has shown to be very effective for few-shot classification (Oreshkin et al., 2018). This network produces embeddings of dimension 512. We use this backbone in the baselines mentioned above and in AM3 implementations. We call ProtoNets++ the prototypical network (Snell et al., 2017) implementation with this more powerful backbone.

The semantic transformation is a neural network with one hidden layer with 300 units which also outputs a 512-dimensional representation. The transformation of the mixture mechanism also contains one hidden layer with 300 units and outputs a single scalar for . On both and

networks, we use ReLU non-linearity 

(Glorot et al., 2011) and dropout (Srivastava et al., 2014) (we set the dropout coefficient to be 0.7 on miniImageNet and 0.9 on tieredImageNet).

The model is trained with stochastic gradient descent with momentum (Sutskever et al., 2013). We use an initial learning rate of 0.1 and a fixed momentum coefficient of 0.9. On miniImageNet, we train every model for 30,000 iterations and anneal the learning rate by a factor of ten at iterations 15,000, 17,500 and 19,000. On tieredImageNet, models are trained for 80,000 iterations and the learning rate is reduced by a factor of ten at iteration 40,000, 50,000, 60,000.

Model Test Accuracy
1-shot 5-shot
MAML (Finn et al., 2017) 51.67 1.81% 70.30 0.08%
Proto. Nets with Soft k-Means (Ren et al., 2018) 53.31 0.89% 72.69 0.74%
Relation Net (Sung et al., 2018) 54.48 0.93% 71.32 0.78%
Transductive Prop. Nets (Liu et al., 2019) 54.48 0.93% 71.32 0.78%
LEO (Rusu et al., 2019) 66.33 0.05% 81.44 0.09%
ProtoNets++ -MBR 61.78 0.43% 77.17 0.81%
ProtoNets++ -MMD 62.77 0.31% 77.27 0.42%
ProtoNets++ -CMC 61.52 0.93% 68.23 0.23%
TADAM -CBNlabel 62.74 0.63 % 81.94 0.55 %
ProtoNets++ 58.47 0.64% 78.41 0.41%
AM3-ProtoNets++ 67.23 0.34% 78.95 0.22%
TADAM (Oreshkin et al., 2018) 62.13 0.31% 81.92 0.30%
AM3-TADAM 69.08 0.47% 82.58 0.31%

Table 2: Few-shot classification accuracy on test split of tieredImageNet. Results in the top use only visual features. Cross-modal baselines are shown in the middle and our results (and their backbones) in the bottom part. deeper net, evaluated. in (Liu et al., 2019)

The training procedure composes a few-shot training batch from several tasks, where a task is a fixed selection of 5 classes. We found empirically that the best number of tasks per batch are 5,2 and 1 for 1-shot, 5-shot and 10-shot, respectively. The number of query per batch is 24 for 1-shot, 32 for 5-shot and 64 for 10-shot. All our experiments are evaluated following the standard approach of few-shot classification: we randomly sample 1,000 tasks from the test set each having 100 random query samples, and average the performance of the model on them.

All hyperparameters were chosen based on accuracy on validation set. All our results are reported with an average over five independent run (with a fixed architecture and different random seeds) and with

confidence intervals. Source code for reproducing the methods in this paper will be released.

4.2 Comparison to Other Methods

Table 1 and Table 2 show classification accuracy on miniImageNet and on tieredImageNet, respectively. In the top part of each table, we show recent methods exploiting only visual features. We show our cross-modality baselines described in Section 4.1.2 in the middle part and at the bottom we show results of our method, AM3, with two different backbone architectures: ProtoNets++ and TADAM.

We conclude multiple results from these experiments. First, our approach outperforms its backbone methods by a large margin in all cases tested. This indicates that language can be effectively leveraged to boost performance in classification with low number of shots.

Second, AM3 (with TADAM backbone) achieves results superior to current (single modality) state of the art (Rusu et al., 2019). The margin in performance is particularly remarkable in the 1-shot scenario. Although our approach exploits semantic embeddings, we note they were learned with unlabeled text corpora.

Finally, we show that no cross-modal baseline described in Section 4.1.2 outperforms current uni-modal state of the art. This indicates that exploiting semantic information in few-shot learning is not a trivial task. ProtoNets++-MBR and ProtoNets++-MMD (extension of methods deployed in zero-shot learning to metric based few-shot learning) does not help in this situation. We argue the reason might be that they force the two modalities to have the same structure, which can cause information loss. By comparing the performance of ProtoNets++ and ProtoNets++-CMC, we conclude that an adaptive mixture mechanism is important to leverage semantic features.

In summary, our method boosts the performance of metric-based algorithms and beats state-of-the-art methods for few-shot learning. This is achieved by adaptively exploiting visual and semantic information, while other cross-modal baselines fail to do so. In both backbone architectures tested, ProtoNets++ and TADAM, our model is able to achieve much better performance when compared to the base methods, particularly when the number of shots is reduced.

4.3 Ablation Studies

4.3.1 Number of Shots

Figure 4(a-b) shows the accuracy of our model compared to the two backbones tested (ProtoNets++ and TADAM) on miniImageNet for 1-10 shot scenarios. It is clear from the plots that the gap between AM3 and the corresponding backbone gets reduced as the number of shots increases. Figure 4(c-d) shows the mean and std (over whole validation set) of the mixing coefficient for different shots and backbones. We observe that keeps increasing as the number of shots increases. This means that AM3 weighs more on semantic information (and less on visual one) as the number of shots (hence the number of visual data points) decreases.

(a) ProtoNets++
(c) AM3-ProtoNets++
Figure 4: (a-b) Comparison of AM3 and its corresponding backbone for different number of shots (c-d) Average value of (over whole validation set) for different number of shot, considering both backbones.

These trends corroborate our intuition that semantic representations get more useful as the number of support images decreases, since the visual support information gets reduced in such cases. It also indicates that AM3 can automatically learn the importance of both information sources in different scenarios.

4.3.2 Adaptive Mechanism

We also perform an ablation study to see how the adaptive mechanism performs with respect to different features. Table 3 shows results, on both datasets, of our method with three different inputs for the adaptive mixing network: (i) the raw GloVe embedding (), (ii) the visual representation () and (iii) a concatenation of both the query and the language embedding ().

We observe that conditioning on transformed GloVe features performs better than on the raw features. Also, conditioning on semantic features performs better than when conditioning on visual ones, suggesting that the former space has a more appropriate structure to the adaptive mechanism than the latter. Finally, we note that conditioning on the query and semantic embeddings helps with the ProtoNets++ backbone but not with TADAM.

Method ProtoNets++ TADAM
1-shot 5-shot 1-shot 5-shot
61.23 74.77 57.47 72.27
64.48 74.80 64.93 77.60
66.12 75.83 53.23 56.70
(AM3) 65.21 75.20 65.30 78.10

Table 3: Performance of our method when the adaptive mixing network is conditioned on different features. Last row is the original model.

5 Conclusion

In this paper, we propose a method that can efficiently and effectively leverage cross-modal information for few-shot classification. Our method, AM3, adaptively combines visual and semantic features, given instance categories. AM3 boosts the performance of metric-based approaches by a large margin on different datasets and settings. Moreover, by leveraging unsupervised textual data, AM3 outperforms current unimodal state of the art on few-shot classification by a large margin. We also show that the semantic features are particularly helpful on the very low (visual) data regime (e.g. one-shot).


The authors thank Konrad Zolna, Dmitriy Serdyuk and Hugo Larochelle for helpful discussions and Thomas Boquet and Jean Raby for help with computational infrastructure.


  • Bart & Ullman (2005) Bart, E. and Ullman, S. Cross-generalization: learning novel classes from a single example by feature replacement. In CVPR, 2005.
  • Bauer et al. (2017) Bauer, M., Rojas-Carulla, M., Swikatkowski, J. B., Scholkopf, B., and Turner, R. E. Discriminative k-shot learning using probabilistic models. In NIPS Bayesian Deep Learning, 2017.
  • Bengio et al. (1992) Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the optimization of a synaptic learning rule. In Conference on Optimality in Biological and Artificial Networks, 1992.
  • Bridle (1990) Bridle, J.

    Probabilistic interpretation of feedforward classification network outputs with relationships to statistical pattern recognition.

    Neurocomputing: Algorithms, Architectures and Applications, 1990.
  • Dumoulin et al. (2018) Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H. d., Courville, A., and Bengio, Y. Feature-wise transformations. Distill, 2018.
  • Fink (2005) Fink, M. Object classification from a single example utilizing class relevance metrics. In NIPS, 2005.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • Finn et al. (2018) Finn, C., Xu, K., and Levine, S. Probabilistic model-agnostic meta-learning. In NeurIPS, 2018.
  • Frome et al. (2013) Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
  • Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In AISTATS, 2011.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, 2014.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT Press, 2016.
  • Gretton et al. (2007) Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J. A kernel method for the two-sample-problem. In NIPS, 2007.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
  • Hochreiter et al. (2001) Hochreiter, S., Younger, A. S., and Conwell, P. R. Learning to learn using gradient descent. In ICANN, 2001.
  • Hubert Tsai et al. (2017) Hubert Tsai, Y.-H., Huang, L.-K., and Salakhutdinov, R. Learning robust visual-semantic embeddings. In CVPR, 2017.
  • Jackendoff (1987) Jackendoff, R. On beyond zebra: the relation of linguistic and visual information. Cognition, 1987.
  • Jiang et al. (2019) Jiang, X., Havaei, M., Varno, F., Chartrand, G., Chapados, N., and Matwin, S. Learning to learn with conditional class dependencies. In ICLR, 2019.
  • Joulin et al. (2016) Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., and Mikolov, T. Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
  • Kim et al. (2018) Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In NeurIPS, 2018.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • Lacoste et al. (2017) Lacoste, A., Boquet, T., Rostamzadeh, N., Oreshki, B., Chung, W., and Krueger, D. Deep prior. NIPS workshop, 2017.
  • Lake et al. (2011) Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J. One shot learning of simple visual concepts. In Annual Meeting of the Cognitive Science Society, 2011.
  • Landau et al. (1988) Landau, B., Smith, L. B., and Jones, S. S. The importance of shape in early lexical learning. Cognitive development, 1988.
  • Larochelle et al. (2008) Larochelle, H., Erhan, D., and Bengio, Y. Zero-data learning of new tasks. In AAAI, 2008.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 2015.
  • Li et al. (2006) Li, F.-F., Fergus, R., and Perona, P. One-shot learning of object categories. PAMI, 2006.
  • Li et al. (2017) Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few shot learning. In arXiv, 2017.
  • Liu et al. (2019) Liu, Y., Lee, J., Park, M., Kim, S., and Yang, Y. Transductive propagation network for few-shot learning. ICLR, 2019.
  • Markman (1991) Markman, E. M. Categorization and naming in children: Problems of induction. MIT Press, 1991.
  • Mishra et al. (2017) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. Meta-learning with temporal convolutions. In ICLR, 2017.
  • Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In ICLR, 2018.
  • Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. arXiv, 2018.
  • Oreshkin et al. (2018) Oreshkin, B. N., Lacoste, A., and Rodriguez, P. Tadam: Task dependent adaptive metric for improved few-shot learning. In NeurIPS, 2018.
  • Palatucci et al. (2009) Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. Zero-shot learning with semantic output codes. In NIPS, 2009.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C.

    Glove: Global vectors for word representation.

    In EMNLP, 2014.
  • Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In ICLR, 2017.
  • Ren et al. (2018) Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Meta-learning for semi-supervised few-shot classification. In ICLR, 2018.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • Rusu et al. (2019) Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In ICLR, 2019.
  • Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In ICML, 2016.
  • Schmidhuber (1987) Schmidhuber, J. Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta…-hook. Diploma thesis, Technische Universitat Munchen, Germany, 1987.
  • Schönfeld et al. (2018) Schönfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., and Akata, Z.

    Generalized zero-and few-shot learning via aligned variational autoencoders.

    arXiv, 2018.
  • Smith & Gasser (2005) Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial life, 2005.
  • Smith & Slone (2017) Smith, L. B. and Slone, L. K. A developmental approach to machine learning? Frontiers in psychology, 2017.
  • Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In NIPS, 2017.
  • Socher et al. (2013) Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Zero-shot learning through cross-modal transfer. In NIPS, 2013.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
  • Sung et al. (2018) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
  • Sutskever et al. (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In ICML, 2013.
  • Thrun (1998) Thrun, S. Lifelong learning algorithms. Kluwer Academic Publishers, 1998.
  • Tokmakov et al. (2018) Tokmakov, P., Wang, Y.-X., and Hebert, M. Learning compositional representations for few-shot recognition. arXiv preprint arXiv:1812.09213, 2018.
  • Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In NIPS, 2016.
  • Wang et al. (2018) Wang, Y.-X., Girshick, R. B., Hebert, M., and Hariharan, B. Low-shot learning from imaginary data. In CVPR, 2018.
  • Weston et al. (2011) Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011.
  • Xian et al. (2018a) Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. PAMI, 2018a.
  • Xian et al. (2018b) Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. Feature generating networks for zero-shot learning. In CVPR, 2018b.