Variational Metric Scaling for Metric-Based Meta-Learning

12/26/2019 ∙ by Jiaxin Chen, et al. ∙ Hong Kong Polytechnic University 0

Metric-based meta-learning has attracted a lot of attention due to its effectiveness and efficiency in few-shot learning. Recent studies show that metric scaling plays a crucial role in the performance of metric-based meta-learning algorithms. However, there still lacks a principled method for learning the metric scaling parameter automatically. In this paper, we recast metric-based meta-learning from a Bayesian perspective and develop a variational metric scaling framework for learning a proper metric scaling parameter. Firstly, we propose a stochastic variational method to learn a single global scaling parameter. To better fit the embedding space to a given data distribution, we extend our method to learn a dimensional scaling vector to transform the embedding space. Furthermore, to learn task-specific embeddings, we generate task-dependent dimensional scaling vectors with amortized variational inference. Our method is end-to-end without any pre-training and can be used as a simple plug-and-play module for existing metric-based meta-algorithms. Experiments on mini-ImageNet show that our methods can be used to consistently improve the performance of existing metric-based meta-algorithms including prototypical networks and TADAM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Few-shot learning [13] aims to assign unseen samples (query) to the belonging categories with very few labeled samples (support) in each category. A promising paradigm for few-shot learning is meta-learning, which learns general patterns from a large number of tasks for fast adaptation to unseen tasks. Recently, metric-based meta-learning algorithms [6, 11, 23, 25] demonstrate great potential in few-shot classification. Typically, they learn a general mapping, which projects queries and supports into an embedding space. These models are trained in an episodic manner [25]

by minimizing the distances between a query and same-labeled supports in the embedding space. Given a new task in testing phase, a nearest neighbour classifier is applied to assign a query to its nearest class in the embedding space.

Many metric-based meta-algorithms (short form for meta-learning algorithms) employ a softmax classifier with cross-entropy loss, which is computed with the logits being the distances between a query and supports in the embedding (metric) space. However, it has been shown that the scale of the logits – the metric scaling parameter, is critical to the performance of the learned model. snell2017prototypical snell2017prototypical found that Euclidean distance significantly outperforms cosine similarity in few-shot classification, while oreshkin2018tadam oreshkin2018tadam and wang2018large wang2018large pointed out that there is no clear difference between them if the logits are scaled properly. They supposed that there exists an optimal metric scaling parameter which is data and architecture related, but they only used cross validation to manually set the parameter, which requires pre-training and cannot find an ideal solution.

In this paper, we aim to design an end-to-end method that can automatically learn an accurate metric scaling parameter. Given a set of training tasks, to learn a data-dependent metric scaling parameter that can generalize well to a new task, Bayesian posterior inference over learnable parameters is a theoretically attractive framework [7, 21]. We propose to recast metric-based meta-algorithms from a Bayesian perspective and take the metric scaling parameter as a global parameter. As exact posterior inference is intractable, we introduce a variational approach to efficiently approximate the posterior distribution with stochastic variational inference.

While a proper metric scaling parameter can improve classification accuracy via adjusting the cross-entropy loss, it simply rescales the embedding space but does not change the relative locations of the embedded samples. To transform the embedding space to better fit the data distribution, we propose a dimensional variational scaling method to learn a scaling parameter for each dimension, i.e., a metric scaling vector. Further, in order to learn task-dependent embeddings [19], we propose an amortized variational approach to generate task-dependent metric scaling vectors, accompanied by an auxiliary training strategy to avoid time-consuming pre-training or co-training.

Our metric scaling methods can be used as pluggable modules for metric-based meta-algorithms. For example, it can be incorporated into prototypical networks (PN) [23] and all PN-based algorithms to improve their performance. To verify this, we conduct extensive experiments on the miniImageNet benchmark for few-shot classification progressively. First, we show that the proposed stochastic variational approach consistently improves on PN, and the improvement is large for PN with cosine similarity. Second, we show that the dimensional variational scaling method further improves upon the one with single scaling parameter, and the task-dependent metric scaling method with amortized variational inference achieves the best performance. We also incorporate the dimensional metric scaling method into TADAM [19] in conjunction with other tricks proposed by the authors and observe notable improvement. Remarkably, after incorporating our method, TADAM achieves highly competitive performance compared with state-of-the-art methods.

To sum up, our contributions are as follows:

  • We propose a generic variational approach to automatically learn a proper metric scaling parameter for metric-based meta-algorithms.

  • We extend the proposed approach to learn dimensional and task-dependent metric scaling vectors to find a better embedding space by fitting the dataset at hand.

  • As a pluggable module, our method can be efficiently used to improve existing metric-based meta-algorithms.

2. Related Work

Metric-based meta-learning.

koch2015siamese koch2015siamese proposed the first metric-based meta-algorithm for few-shot learning, in which a siamese network [2] is trained with the triplet loss to compare the similarity between a query and supports in the embedding space. Matching networks [25] proposed the episodic training strategy and used the cross-entropy loss where the logits are the distances between a query and supports. Prototypical networks [23] improved Matching networks by computing the distances between a query and the prototype (mean of supports) of each class. Many metric-based meta-algorithms [19, 5, 24, 14] extended prototypical networks in different ways.

Some recent methods proposed to improve prototypical networks by extracting task-conditioning features. oreshkin2018tadam oreshkin2018tadam trained a network to generate task-conditioning parameters for batch normalization. li2019finding li2019finding extracted task-relevant features with a category traversal module. Our methods can be incorporated into these methods to improve their performance.

In addition, there are some works related to our proposed dimensional scaling methods. kang2018few kang2018few trained a meta-model to re-weight features obtained from the base feature extractor and applied it for few-shot object detection. lai2018task lai2018task proposed a generator to generate task-adaptive weights to re-weight the embeddings, which can be seen as a special case of our amortized variational scaling method.

Metric scaling.

Cross-entropy loss is widely used in many machine learning problems, including metric-based meta-learning and metric learning

[1, 20, 15, 27, 28, 1, 26]

. In metric learning, the influence of metric scaling on the cross-entropy loss was first studied in wang2017normface wang2017normface and ranjan2017l2 ranjan2017l2. They treated the metric scaling parameter as a trainable parameter updated with model parameters or a fixed hyperparameter. zhang2018heated zhang2018heated proposed a “heating-up” scaling strategy, where the metric scaling parameter decays manually during the training process. The scaling of logits in cross-entropy loss for model compression was also studied in hinton2015distilling hinton2015distilling, where it is called temperature scaling. The temperature scaling parameter has also been used in confidence calibration 

[9].

The effect of metric scaling for few-shot learning was first discussed in snell2017prototypical snell2017prototypical and oreshkin2018tadam oreshkin2018tadam. The former found that Euclidean distance outperforms cosine similarity significantly in prototypical networks, and the latter argued that the superiority of Euclidean distance could be offset by imposing a proper metric scaling parameter on cosine similarity and using cross validation to select the parameter.

3. Preliminaries

3.1. Notations and Problem Statement

Let be a domain where is the input space and is the output space. Assume we observe a meta-sample including training tasks, where the -th task consists of a support set of size , , and a query set of size , . Each training data point belongs to the domain . Denote by the model parameters and the metric scaling parameter. Given a new task and a support set sampled from the task, the goal is to predict the label of a query .

3.2. Prototypical Networks

Prototypical networks (PN) [23] is a popular and highly effective metric-based meta-algorithm. PN learns a mapping which projects queries and the supports to an -dimensional embedding space. For each class , the mean vector of the supports of class in the embedding space is computed as the class prototype . The embedded query is compared with the prototypes and assigned to the class of the nearest prototype. Given a similarity metric

, the probability of a query

belonging to class is,

(1)

Training proceeds by minimizing the cross-entropy loss, i.e., the negative log-probability of its true class . After introducing the metric scaling parameter , the classification loss of the task becomes

(2)

The metric scaling parameter has been found to affect the performance of PN significantly.

4. Variational Metric Scaling

4.1. Stochastic Variational Scaling

In the following, we recast metric-based meta-learning from a Bayesian perspective. The predictive distribution can be parameterized as

(3)

The conditional distribution is the discriminative classifier parameterized by . Since the posterior distribution is intractable, we propose a variational distribution parameterized by parameters to approximate . By minimizing the KL divergence between the approximator and the real posterior distribution , we obtain the objective function

(4)

We want to optimize w.r.t. both the model parameters and the variational parameters . The gradient and the optimization procedure of the model parameters are similar to the original metric-based meta-algorithms [25, 23] as shown in Algorithm 1.

To derive the gradients of the variational parameters, we leverage the re-parameterization trick proposed by kingma2013auto kingma2013auto to derive a practical estimator of the variational lower bound and its derivatives w.r.t. the variational parameters. In this paper, we use this trick to estimate the derivatives of

w.r.t. . For a distribution , we can re-parameterize using a differentiable transformation

, if exists, of an auxiliary random variable

. For example, given a Gaussian distribution

, the re-parameterization is , where . Hence, the first term in (4.1. Stochastic Variational Scaling) is formulated as .

We apply a Monte Carlo integration with a single sample

for each task to get an unbiased estimator. Note that

is sampled for the task rather than for each instance, i.e., share the same . The second term in (4.1. Stochastic Variational Scaling) can be computed with a given prior distribution . Then, the final objective function is

(5)

Estimation of gradients.

The objective function (4.1. Stochastic Variational Scaling) is a general form. Here, we consider as a Gaussian distribution . The prior distribution is also a Gaussian distribution . By the fact that the KL divergence of two Gaussian distributions has a closed-form solution, we obtain the following objective function

(6)

where . The derivatives of w.r.t. and respectively are

(7)
(8)

In particular, we apply the proposed variational metric scaling method to Prototypical Networks with feature extractor . The details of the gradients and the iterative update procedure are shown in Algorithm 1. It can be seen that the gradients of the variational parameters are computed using the intermediate quantities in the computational graph of the model parameters during back-propagation, hence the computational cost is very low.

For meta-testing, we use (mean) as the metric scaling parameter for inference.

Input: Meta-sample , learning rates , and .
Initialize: and randomly.
1 for  in  do
         // Sample for task.
2       for  in  do
               // Compute prototypes.
3            
4      for  in  do
5            
         // Update the model parameters .
         // Update the variational parameters .
6      
Algorithm 1 Stochastic Variational Scaling for Prototypical Networks

Figure 1: The middle figure shows a metric space in which the query (blue) and the support samples (red) are normalized to a unit ball. The left and right figures show the spaces scaled by a single parameter and a two-dimensional vector , respectively. The query is still assigned to class in the left figure but to class in the right one.

The proposed variational scaling framework is general. Note that training the scaling parameter together with the model parameters [20] is a special case of our framework, when is defined as

, the variance of the prior distribution is

, and the learning rate is fixed as .

4.2. Dimensional Stochastic Variational Scaling

Metric scaling can be seen as a transformation of the metric (embedding) space. Multiplying the distances with the scaling parameter accounts to re-scaling the embedding space. By this point of view, we generalize the single scaling parameter to a dimensional scaling vector which transforms the embedding space to fit the data.

If the dimension of the embedding space is too low, the data points cannot be projected to a linearly-separable space. Conversely, if the dimension is too high, there may be many redundant dimensions. The optimal number of dimensions is data-dependent and difficult to be selected as a hyperparameter before training. Here, we address this problem by learning a data-dependent dimensional scaling vector to modify the embedding space, i.e., learning different weights for each dimension to highlight the important dimensions and reduce the influence of the redundant ones. Figure 1 shows a two-dimensional example. It can be seen that the single scaling parameter simply changes the scale of the embedding space, but the dimensional scaling changes the relative locations of the query and the supports.

The proposed dimensional stochastic variational scaling method is similar to Algorithm 1, with the variational parameters and . Accordingly, the metric scaling operation is changed to

(9)

The gradients of the variational parameters are still easy to compute and the computational cost can be ignored.

4.3. Amortized Variational Scaling

The proposed stochastic variational scaling methods above consider the metric scale as a global scalar or vector parameter, i.e., the entire meta-sample shares the same embedding space. However, the tasks randomly sampled from the task distribution may have specific task-relevant feature representations [10, 12, 14]. To adapt the learned embeddings to the task-specific representations, we propose to apply amortized variational inference to learn the task-dependent dimensional scaling parameters.

For amortized variational inference, is a local latent variable dependent on instead of a global parameter. Similar to stochastic variational scaling, we apply the variational distribution to approximate the posterior distribution . In order to learn the dependence between and

, amortized variational scaling learns a mapping approximated by a neural network

, from the task to the distribution parameters of .

By leveraging the re-parameterization trick, we obtain the objective function of amortized variational scaling:

(10)

where . Note that the local parameters are functions of , i.e., . We iteratively update and

by minimizing the loss function (

4.3. Amortized Variational Scaling) during meta-training

During meta-testing, for each task, the generator produces a variational distribution’s parameters and we still use the mean vector as the metric scaling vector for inference.

Auxiliary loss.

To learn the mapping from a set to the variational parameters of the local random variable

, we compute the mean vector of the embedded queries and the embedded supports as the task prototype to generate the variational parameters. A problem is that the embeddings are not ready to generate good scaling parameters during early epochs. Existing approaches including co-training 

[19] and pre-training [14]

can alleviate this problem at the expense of computational efficiency. They pre-train or co-train an auxiliary supervised learning classifier in a traditional supervised manner over the meta-sample

, and then apply the pre-trained embeddings to generate the task-specific parameters and fine-tune the embeddings during meta-training. Here, we propose an end-to-end algorithm which can improve training efficiency in comparison with pre-training or co-training. We optimize the following loss function (11) where an auxiliary weight is used instead of minimizing (4.3. Amortized Variational Scaling) in Algorithm 2 , i.e.,

(11)

where , i.e., no scaling is used. Given a decay step size , starts from and linearly decays to as the number of epochs increases, i.e., . During the first epochs, the weight of the gradients is high and the algorithm learns the embeddings of PN. As the training proceeds, is updated to tune the learned embedding space. See the details in Algorithm 2.

Input: Meta-sample , learning rates , , prior and step size .
Initialize: and randomly, .
1 for  in  do
         // Compute the task prototype.
2      
         // Generate and for task.
3       for  in  do
4            
5      for  in  do
6            
         // Update the model parameters .
         // Update the parameters of the generator.
7       if  then
8            
9      
Algorithm 2 Dimensional Amortized Variational Scaling for Prototypical Networks

5. Experiments

To evaluate our methods, we plug them into two popular algorithms, prototypical networks (PN) [23] and TADAM [19], implemented by both Conv-4 and ResNet-12 backbone networks. To be elaborated later, Table 1 shows our main results in comparison to state-of-the-art meta-algorithms, where it can been that our dimensional stochastic variational scaling algorithm outperforms other methods substantially. For TADAM, we incorporate our methods into TADAM in conjunction with all the techniques proposed in their paper and still observe notable improvement.

miniImageNet test accuracy
Backbones Model -way -shot -way -shot
Conv-4 Matching networks [25]
Relation Net [24]
Meta-learner LSTM [22]
MAML [3]
LLAMA [8]
REPTILE [18]
PLATIPUS [4]
ResNet-12 adaResNet [17]
SNAIL [16]
TADAM [19]
TADAM Euclidean + D-SVS (ours)
PN Euclidean [23] *
PN Cosine [23] *
PN Euclidean + D-SVS (ours) *
PN cosine + D-SVS (ours) *
Table 1: Test accuracies of 5-way classification tasks on miniImageNet using Conv-4 and ResNet-12 respectively. * indicates results by our re-implementation.
5-way 1-shot 5-way 5-shot
Euclidean Cosine Euclidean Cosine
PN
PN + SVS
PN + D-SVS
PN + D-AVS

Table 2: Results of prototypical networks (the first row) and prototypical networks with SVS, D-SVS and D-AVS respectively by our re-implementation using Conv-4.

5.1. Dataset and Experimental Setup

miniImageNet.

The miniImageNet [25] consists of 100 classes with 600 images per class. We follow the data split suggested by ravi2016optimization ravi2016optimization, where the dataset is separated into a training set with 64 classes, a testing set with 20 classes and a validation set with 16 classes.

Model architecture.

To evaluate our methods with different backbone networks, we re-implement PN with the Conv-4 architecture proposed by snell2017prototypical snell2017prototypical and the ResNet-12 architecture adopted by oreshkin2018tadam oreshkin2018tadam, respectively.

The Conv-4 backbone contains four convolutional blocks, where each block is sequentially composed of a 3

3 kernel convolution with 64 filters, a batch normalization layer, a ReLU nonlinear layer and a 2

2 max-pooling layer.

The ResNet-12 architecture contains 4 Res blocks, where each block consists of 3 convolutional blocks followed by a 2 2 max-pooling layer.

Training details.

We follow the episodic training strategy proposed by [25]. In each episode, classes and shots per class are selected from the training set, the validation set or the testing set. For fair comparisons, the number of queries, the sampling strategy of queries, and the testing strategy are designed in line with PN or TADAM.

For Conv-4, we use Adam optimizer with a learning rate of without weight decay. The total number of training episodes is for Conv-4. And for ResNet-12, we use SGD optimizer with momentum , weight decay and episodes in total. The learning rate is initialized as and decayed at episode steps , and

. Besides, we use gradient clipping when training ResNet-12. The reported results are the mean accuracies with

confidence intervals estimated by runs.

We normalize the embeddings before computing the distances between them. As shown in Eq. (7) and (8), the gradient magnitude of variational metric scaling parameters is proportional to the norm of embeddings. Therefore, to foster the learning process of these parameters, we adopt a separate learning rate for all variational metric scaling parameters.

5.2. Evaluation

The effectiveness of our proposed methods is illustrated in Table 2 progressively, including stochastic variational scaling (SVS), dimensional stochastic variational scaling (D-SVS) and dimensional amortized variational scaling (D-AVS). On both -way -shot and -way -shot classification, noticeable improvement can be seen after incorporating SVS into PN. Compared to SVS, D-SVS is more effective, especially for -way -shot classification. D-AVS performs even better than D-SVS by considering task-relevant information.

Performance of SVS.

We study the performance of SVS by incorporating it into PN. We consider both -way and -way training scenarios. The prior distribution of the metric scaling parameter is set as and the variational parameters are initialized as , . The learning rate is set to be .

Results in Table 3 show the effect of the metric scaling parameter (SVS). Particularly, significant improvement is observed for the case of PN with cosine similarity and for the case of -way -shot classification. Moreover, it can be seen that with metric scaling there is no clear difference between the performance of Euclidean distance and cosine similarity.

We also compare the performance of a fixed with a trainable

. We add a shifted ReLU activation function (

) on the learned to ensure it is positive. Nevertheless, in our experiments, we observe that the training is very stable and the variance is always positive even without the ReLU activation function. We also find that there is no significant difference between the two settings. Hence, we treat as a fixed hyperparameter in other experiments.

5-way 1-shot 5-way 5-shot
5-way training 20-way training 5-way training 20-way training
PN Euclidean
PN Cosine
PN Euclidean + SVS ()
PN Cosine + SVS ()
PN Euclidean + SVS (learned )
PN Cosine + SVS (learned )
Table 3: Results of prototypical networks and prototypical networks with SVS by our re-implementation using Conv-4.
5-way 1-shot 5-way 5-shot
Auxiliary training Prior Euclidean Cosine Euclidean Cosine
Table 4: Ablation study of prototypical networks with D-AVS by our re-implementation using Conv-4.

Performance of D-SVS.

We validate the effectiveness of D-SVS by incorporating it into PN and TADAM, with the results shown in Table 1 and Table 2. On 5-way-1-shot classification, for PN, we observe about and absolute increase in test accuracy with Conv-4 and ResNet-12 respectively; for TADAM, absolute increase in test accuracy is observed. The learning rate for D-SVS is set to be . Here we use a large learning rate since the gradient magnitude of each dimension of the metric scaling vector is extremely small after normalizing the embeddings.

Performance of D-AVS.

We evaluate the effectiveness of D-AVS by incorporating it into PN. We use a multi-layer perception (MLP) with one hidden layer as the generator . The learning rate is set to be . In Table 2, on both -way -shot and -way -shot classification, we observe about absolute increase in test accuracy for dimensional amortized variational scaling (D-AVS) over SVS with a single scaling parameter. In our experiments, the hyperparameter is selected from the range of with 200 training epochs in total.

Ablation study of D-AVS.

To assess the effects of the auxiliary training strategy and the prior information, we provide an ablation study as shown in Table 4. Without the auxiliary training and the prior information, D-AVS degenerates to a task-relevant weight generating approach [12]. Noticeable performance drops can be observed after removing the two components. Removing either one of them also leads to performance drop, but not as significant as removing both. The empirical results confirm the necessity of the auxiliary training and a proper prior distribution for amortized variational metric scaling.

5.3 Robustness Study

We also design experiments to show: 1) The convergence speed of existing methods does not slow down after incorporating our methods; 2) Given the same prior distribution, the variational parameters converge to the same values in spite of different learning rates and initializations.

For the iterative update of the model parameters and the variational parameters , a natural question is whether it will slow down the convergence speed of the algorithm. Figure 2 shows the learning curves of PN and PN+D-SVS on both 5-way 1-shot and 5-way 5-shot classification. It can be seen that the incorporation of SVS does not reduce the convergence speed.

We plot the learning curves of the variational parameter w.r.t. different initializations and different learning rates . Given the same prior distribution , Fig. 3(a) shows that the variational parameter with different initializations will converge to the same value. Fig. 3(b) shows that is robust to different learning rates.

(a) -way -shot
(b) -way -shot
Figure 2: Learning curves of prototypical networks and prototypical networks with D-SVS.
(a)
(b)
Figure 3: Learning curves of (a) for different initializations and (b) for different learning rates.

Conclusion

In this paper, we have proposed a generic variational metric scaling framework for metric-based meta-algorithms, under which three efficient end-to-end methods are developed. To learn a better embedding space to fit data distribution, we have considered the influence of metric scaling on the embedding space by taking into account data-dependent and task-dependent information progressively. Our methods are lightweight and can be easily plugged into existing metric-based meta-algorithms to improve their performance.

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments. This research was supported by the grants of DaSAIL projects P0030935 and P0030970 funded by PolyU (UGC).

References

  • [1] A. Babenko and V. Lempitsky (2015)

    Aggregating deep convolutional features for image retrieval

    .
    arXiv preprint arXiv:1510.07493. Cited by: Metric scaling..
  • [2] S. Chopra, R. Hadsell, Y. LeCun, et al. (2005) Learning a similarity metric discriminatively, with application to face verification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: Metric-based meta-learning..
  • [3] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: Table 1.
  • [4] C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: Table 1.
  • [5] S. Fort (2017) Gaussian prototypical networks for few-shot learning on omniglot. arXiv preprint arXiv:1708.02735. Cited by: Metric-based meta-learning..
  • [6] V. Garcia and J. Bruna (2017) Few-shot learning with graph neural networks. International Conference on Learning Representations. Cited by: 1. Introduction.
  • [7] J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. E. Turner (2018) Meta-learning probabilistic inference for prediction. International Conference on Learning Representations. Cited by: 1. Introduction.
  • [8] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths (2018) Recasting gradient-based meta-learning as hierarchical bayes. International Conference on Learning Representations. Cited by: Table 1.
  • [9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: Metric scaling..
  • [10] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-shot object detection via feature reweighting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8420–8429. Cited by: 4.3. Amortized Variational Scaling.
  • [11] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    ,
    Vol. 2. Cited by: 1. Introduction.
  • [12] N. Lai, M. Kan, S. Shan, and X. Chen (2018) Task-adaptive feature reweighting for few shot classification. In Asian Conference on Computer Vision, pp. 649–662. Cited by: 4.3. Amortized Variational Scaling, Ablation study of D-AVS..
  • [13] F. Li, R. Fergus, and P. Perona (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: 1. Introduction.
  • [14] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang (2019) Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10. Cited by: Metric-based meta-learning., Auxiliary loss., 4.3. Amortized Variational Scaling.
  • [15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017)

    Sphereface: deep hypersphere embedding for face recognition

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: Metric scaling..
  • [16] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. International Conference on Learning Representations. Cited by: Table 1.
  • [17] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler (2017)

    Rapid adaptation with conditionally shifted neurons

    .
    arXiv preprint arXiv:1712.09926. Cited by: Table 1.
  • [18] A. Nichol and J. Schulman (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2. Cited by: Table 1.
  • [19] B. Oreshkin, P. R. López, and A. Lacoste (2018) TADAM: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 719–729. Cited by: 1. Introduction, 1. Introduction, Metric-based meta-learning., Auxiliary loss., Table 1, 5. Experiments.
  • [20] R. Ranjan, C. D. Castillo, and R. Chellappa (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: Metric scaling., Estimation of gradients..
  • [21] S. Ravi and A. Beatson (2019) Amortized bayesian meta-learning. International Conference on Learning Representations. Cited by: 1. Introduction.
  • [22] S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: Table 1.
  • [23] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: 1. Introduction, 1. Introduction, Metric-based meta-learning., 3.2. Prototypical Networks, 4.1. Stochastic Variational Scaling, Table 1, 5. Experiments.
  • [24] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: Metric-based meta-learning., Table 1.
  • [25] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: 1. Introduction, Metric-based meta-learning., 4.1. Stochastic Variational Scaling, miniImageNet., Training details., Table 1.
  • [26] W. Wan, Y. Zhong, T. Li, and J. Chen (2018) Rethinking feature distribution for loss functions in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9117–9126. Cited by: Metric scaling..
  • [27] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille (2017) Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041–1049. Cited by: Metric scaling..
  • [28] X. Zhang, F. X. Yu, S. Karaman, W. Zhang, and S. Chang (2018) Heated-up softmax embedding. arXiv preprint arXiv:1809.04157. Cited by: Metric scaling..

Appendix A Appendix A. SVS

A.1. Comparison with a Special Case

ranjan2017l2 ranjan2017l2 proposed to train the single scaling parameter together with model parameters. Their method can be seen as a special case of our stochastic variational scaling method SVS under the conditions of , and . We compare our method with theirs by varying the initialization of ().

Noticeably, our method achieves absolute improvements of , , and for four different initializations respectively. As shown in Table 5, our method is stable w.r.t. the initialization of , thanks to the prior information introduced in our Bayesian framework which may counteract the influence of initialization.

Method 1 10 100 1000
PN+SVS
PN (Training together)
Table 5: Comparison of PN (Training together) and PN+SVS implemented by Conv-4 backbone.

A.2. Sensitivity to the Prior and Initialization

In Bayesian framework, the prior distribution has a significant impact on learning posterior distribution. For stochastic variational inference, initialization is another key factor for learning the variational parameters. Here, we conduct experiments of PN+SVS with different prior distributions and initializations. The results of -way -shot classification are summarized in Table 6. It can be observed that our method is not sensitive to the prior and initialization as long as either one of them is not too small.

Appendix B Appendix B. D-SVS

B.1. Distributions of the Mean Vector

Figure 4 illustrates the distributions of the mean vector during the meta-training procedure of -way -shot and -way -shot classification respectively. Darker colour means more frequent occurrence.

At step , all dimensions of are initialized as . They diverge as the meta-training proceeds, which shows D-SVS successfully learns different scaling parameters for different dimensions. It is also worth noting that for both tasks, the distribution of converges eventually (after steps).

Appendix C Appendix C. D-AVS

C.1. Viewing the Learned Metric Scaling Parameters of PN+D-AVS.

D-AVS generates variational distributions for different tasks, from which the task-specific scaling parameters are sampled. Below we print out the scaling vectors learned by D-AVS on two different testing tasks for 5-way 5-shot classification, where only the first ten dimensions are displayed. It can be seen that D-AVS successfully learns tailored scaling vectors for different tasks.

Scaling parameters for Task 1: [64.9170, 22.4030, 13.4468, 2.3949, 28.2470, 29.7770, 54.4221, 60.9279, 2.3008, 147.5304].

Scaling parameters for Task 2: [60.2564, 21.1672, 12.8457, 2.3603, 26.6194, 27.9963, 50.6127, 56.5965, 2.2672, 135.2510].

Appendix D Appendix D. Implementation Details

D.1. Sampling from the Variational Distribution

We adopt the following sampling strategy for the proposed three approaches. For meta-training, we sample once per task from the variational distribution for the metric scaling parameter; for meta-testing, we use the mean of the learned Gaussian distribution as the metric scaling parameter. The computational overhead is very small and can be ignored.

Table 6: Results of PN+SVS w.r.t. different initializations and priors implemented by Conv-4.
Figure 4: Distributions of the learned w.r.t. the number of training steps. The horizontal and vertical axes are the number of training steps and values of , respectively. The top is for -way -shot classification and the bottom is for -way -shot.