1. Introduction
Fewshot learning [13] aims to assign unseen samples (query) to the belonging categories with very few labeled samples (support) in each category. A promising paradigm for fewshot learning is metalearning, which learns general patterns from a large number of tasks for fast adaptation to unseen tasks. Recently, metricbased metalearning algorithms [6, 11, 23, 25] demonstrate great potential in fewshot classification. Typically, they learn a general mapping, which projects queries and supports into an embedding space. These models are trained in an episodic manner [25]
by minimizing the distances between a query and samelabeled supports in the embedding space. Given a new task in testing phase, a nearest neighbour classifier is applied to assign a query to its nearest class in the embedding space.
Many metricbased metaalgorithms (short form for metalearning algorithms) employ a softmax classifier with crossentropy loss, which is computed with the logits being the distances between a query and supports in the embedding (metric) space. However, it has been shown that the scale of the logits – the metric scaling parameter, is critical to the performance of the learned model. snell2017prototypical snell2017prototypical found that Euclidean distance significantly outperforms cosine similarity in fewshot classification, while oreshkin2018tadam oreshkin2018tadam and wang2018large wang2018large pointed out that there is no clear difference between them if the logits are scaled properly. They supposed that there exists an optimal metric scaling parameter which is data and architecture related, but they only used cross validation to manually set the parameter, which requires pretraining and cannot find an ideal solution.
In this paper, we aim to design an endtoend method that can automatically learn an accurate metric scaling parameter. Given a set of training tasks, to learn a datadependent metric scaling parameter that can generalize well to a new task, Bayesian posterior inference over learnable parameters is a theoretically attractive framework [7, 21]. We propose to recast metricbased metaalgorithms from a Bayesian perspective and take the metric scaling parameter as a global parameter. As exact posterior inference is intractable, we introduce a variational approach to efficiently approximate the posterior distribution with stochastic variational inference.
While a proper metric scaling parameter can improve classification accuracy via adjusting the crossentropy loss, it simply rescales the embedding space but does not change the relative locations of the embedded samples. To transform the embedding space to better fit the data distribution, we propose a dimensional variational scaling method to learn a scaling parameter for each dimension, i.e., a metric scaling vector. Further, in order to learn taskdependent embeddings [19], we propose an amortized variational approach to generate taskdependent metric scaling vectors, accompanied by an auxiliary training strategy to avoid timeconsuming pretraining or cotraining.
Our metric scaling methods can be used as pluggable modules for metricbased metaalgorithms. For example, it can be incorporated into prototypical networks (PN) [23] and all PNbased algorithms to improve their performance. To verify this, we conduct extensive experiments on the miniImageNet benchmark for fewshot classification progressively. First, we show that the proposed stochastic variational approach consistently improves on PN, and the improvement is large for PN with cosine similarity. Second, we show that the dimensional variational scaling method further improves upon the one with single scaling parameter, and the taskdependent metric scaling method with amortized variational inference achieves the best performance. We also incorporate the dimensional metric scaling method into TADAM [19] in conjunction with other tricks proposed by the authors and observe notable improvement. Remarkably, after incorporating our method, TADAM achieves highly competitive performance compared with stateoftheart methods.
To sum up, our contributions are as follows:

We propose a generic variational approach to automatically learn a proper metric scaling parameter for metricbased metaalgorithms.

We extend the proposed approach to learn dimensional and taskdependent metric scaling vectors to find a better embedding space by fitting the dataset at hand.

As a pluggable module, our method can be efficiently used to improve existing metricbased metaalgorithms.
2. Related Work
Metricbased metalearning.
koch2015siamese koch2015siamese proposed the first metricbased metaalgorithm for fewshot learning, in which a siamese network [2] is trained with the triplet loss to compare the similarity between a query and supports in the embedding space. Matching networks [25] proposed the episodic training strategy and used the crossentropy loss where the logits are the distances between a query and supports. Prototypical networks [23] improved Matching networks by computing the distances between a query and the prototype (mean of supports) of each class. Many metricbased metaalgorithms [19, 5, 24, 14] extended prototypical networks in different ways.
Some recent methods proposed to improve prototypical networks by extracting taskconditioning features. oreshkin2018tadam oreshkin2018tadam trained a network to generate taskconditioning parameters for batch normalization. li2019finding li2019finding extracted taskrelevant features with a category traversal module. Our methods can be incorporated into these methods to improve their performance.
In addition, there are some works related to our proposed dimensional scaling methods. kang2018few kang2018few trained a metamodel to reweight features obtained from the base feature extractor and applied it for fewshot object detection. lai2018task lai2018task proposed a generator to generate taskadaptive weights to reweight the embeddings, which can be seen as a special case of our amortized variational scaling method.
Metric scaling.
Crossentropy loss is widely used in many machine learning problems, including metricbased metalearning and metric learning
[1, 20, 15, 27, 28, 1, 26]. In metric learning, the influence of metric scaling on the crossentropy loss was first studied in wang2017normface wang2017normface and ranjan2017l2 ranjan2017l2. They treated the metric scaling parameter as a trainable parameter updated with model parameters or a fixed hyperparameter. zhang2018heated zhang2018heated proposed a “heatingup” scaling strategy, where the metric scaling parameter decays manually during the training process. The scaling of logits in crossentropy loss for model compression was also studied in hinton2015distilling hinton2015distilling, where it is called temperature scaling. The temperature scaling parameter has also been used in confidence calibration
[9].The effect of metric scaling for fewshot learning was first discussed in snell2017prototypical snell2017prototypical and oreshkin2018tadam oreshkin2018tadam. The former found that Euclidean distance outperforms cosine similarity significantly in prototypical networks, and the latter argued that the superiority of Euclidean distance could be offset by imposing a proper metric scaling parameter on cosine similarity and using cross validation to select the parameter.
3. Preliminaries
3.1. Notations and Problem Statement
Let be a domain where is the input space and is the output space. Assume we observe a metasample including training tasks, where the th task consists of a support set of size , , and a query set of size , . Each training data point belongs to the domain . Denote by the model parameters and the metric scaling parameter. Given a new task and a support set sampled from the task, the goal is to predict the label of a query .
3.2. Prototypical Networks
Prototypical networks (PN) [23] is a popular and highly effective metricbased metaalgorithm. PN learns a mapping which projects queries and the supports to an dimensional embedding space. For each class , the mean vector of the supports of class in the embedding space is computed as the class prototype . The embedded query is compared with the prototypes and assigned to the class of the nearest prototype. Given a similarity metric
, the probability of a query
belonging to class is,(1) 
Training proceeds by minimizing the crossentropy loss, i.e., the negative logprobability of its true class . After introducing the metric scaling parameter , the classification loss of the task becomes
(2) 
The metric scaling parameter has been found to affect the performance of PN significantly.
4. Variational Metric Scaling
4.1. Stochastic Variational Scaling
In the following, we recast metricbased metalearning from a Bayesian perspective. The predictive distribution can be parameterized as
(3) 
The conditional distribution is the discriminative classifier parameterized by . Since the posterior distribution is intractable, we propose a variational distribution parameterized by parameters to approximate . By minimizing the KL divergence between the approximator and the real posterior distribution , we obtain the objective function
(4) 
We want to optimize w.r.t. both the model parameters and the variational parameters . The gradient and the optimization procedure of the model parameters are similar to the original metricbased metaalgorithms [25, 23] as shown in Algorithm 1.
To derive the gradients of the variational parameters, we leverage the reparameterization trick proposed by kingma2013auto kingma2013auto to derive a practical estimator of the variational lower bound and its derivatives w.r.t. the variational parameters. In this paper, we use this trick to estimate the derivatives of
w.r.t. . For a distribution , we can reparameterize using a differentiable transformation, if exists, of an auxiliary random variable
. For example, given a Gaussian distribution
, the reparameterization is , where . Hence, the first term in (4.1. Stochastic Variational Scaling) is formulated as .We apply a Monte Carlo integration with a single sample
for each task to get an unbiased estimator. Note that
is sampled for the task rather than for each instance, i.e., share the same . The second term in (4.1. Stochastic Variational Scaling) can be computed with a given prior distribution . Then, the final objective function is(5) 
Estimation of gradients.
The objective function (4.1. Stochastic Variational Scaling) is a general form. Here, we consider as a Gaussian distribution . The prior distribution is also a Gaussian distribution . By the fact that the KL divergence of two Gaussian distributions has a closedform solution, we obtain the following objective function
(6) 
where . The derivatives of w.r.t. and respectively are
(7)  
(8) 
In particular, we apply the proposed variational metric scaling method to Prototypical Networks with feature extractor . The details of the gradients and the iterative update procedure are shown in Algorithm 1. It can be seen that the gradients of the variational parameters are computed using the intermediate quantities in the computational graph of the model parameters during backpropagation, hence the computational cost is very low.
For metatesting, we use (mean) as the metric scaling parameter for inference.
4.2. Dimensional Stochastic Variational Scaling
Metric scaling can be seen as a transformation of the metric (embedding) space. Multiplying the distances with the scaling parameter accounts to rescaling the embedding space. By this point of view, we generalize the single scaling parameter to a dimensional scaling vector which transforms the embedding space to fit the data.
If the dimension of the embedding space is too low, the data points cannot be projected to a linearlyseparable space. Conversely, if the dimension is too high, there may be many redundant dimensions. The optimal number of dimensions is datadependent and difficult to be selected as a hyperparameter before training. Here, we address this problem by learning a datadependent dimensional scaling vector to modify the embedding space, i.e., learning different weights for each dimension to highlight the important dimensions and reduce the influence of the redundant ones. Figure 1 shows a twodimensional example. It can be seen that the single scaling parameter simply changes the scale of the embedding space, but the dimensional scaling changes the relative locations of the query and the supports.
The proposed dimensional stochastic variational scaling method is similar to Algorithm 1, with the variational parameters and . Accordingly, the metric scaling operation is changed to
(9) 
The gradients of the variational parameters are still easy to compute and the computational cost can be ignored.
4.3. Amortized Variational Scaling
The proposed stochastic variational scaling methods above consider the metric scale as a global scalar or vector parameter, i.e., the entire metasample shares the same embedding space. However, the tasks randomly sampled from the task distribution may have specific taskrelevant feature representations [10, 12, 14]. To adapt the learned embeddings to the taskspecific representations, we propose to apply amortized variational inference to learn the taskdependent dimensional scaling parameters.
For amortized variational inference, is a local latent variable dependent on instead of a global parameter. Similar to stochastic variational scaling, we apply the variational distribution to approximate the posterior distribution . In order to learn the dependence between and
, amortized variational scaling learns a mapping approximated by a neural network
, from the task to the distribution parameters of .By leveraging the reparameterization trick, we obtain the objective function of amortized variational scaling:
(10) 
where . Note that the local parameters are functions of , i.e., . We iteratively update and
by minimizing the loss function (
4.3. Amortized Variational Scaling) during metatrainingDuring metatesting, for each task, the generator produces a variational distribution’s parameters and we still use the mean vector as the metric scaling vector for inference.
Auxiliary loss.
To learn the mapping from a set to the variational parameters of the local random variable
, we compute the mean vector of the embedded queries and the embedded supports as the task prototype to generate the variational parameters. A problem is that the embeddings are not ready to generate good scaling parameters during early epochs. Existing approaches including cotraining
[19] and pretraining [14]can alleviate this problem at the expense of computational efficiency. They pretrain or cotrain an auxiliary supervised learning classifier in a traditional supervised manner over the metasample
, and then apply the pretrained embeddings to generate the taskspecific parameters and finetune the embeddings during metatraining. Here, we propose an endtoend algorithm which can improve training efficiency in comparison with pretraining or cotraining. We optimize the following loss function (11) where an auxiliary weight is used instead of minimizing (4.3. Amortized Variational Scaling) in Algorithm 2 , i.e.,(11) 
where , i.e., no scaling is used. Given a decay step size , starts from and linearly decays to as the number of epochs increases, i.e., . During the first epochs, the weight of the gradients is high and the algorithm learns the embeddings of PN. As the training proceeds, is updated to tune the learned embedding space. See the details in Algorithm 2.
5. Experiments
To evaluate our methods, we plug them into two popular algorithms, prototypical networks (PN) [23] and TADAM [19], implemented by both Conv4 and ResNet12 backbone networks. To be elaborated later, Table 1 shows our main results in comparison to stateoftheart metaalgorithms, where it can been that our dimensional stochastic variational scaling algorithm outperforms other methods substantially. For TADAM, we incorporate our methods into TADAM in conjunction with all the techniques proposed in their paper and still observe notable improvement.
miniImageNet test accuracy  

Backbones  Model  way shot  way shot 
Conv4  Matching networks [25]  
Relation Net [24]  
Metalearner LSTM [22]  
MAML [3]  
LLAMA [8]  
REPTILE [18]  
PLATIPUS [4]  
ResNet12  adaResNet [17]  
SNAIL [16]  
TADAM [19]  
TADAM Euclidean + DSVS (ours)  
PN Euclidean [23] *  
PN Cosine [23] *  
PN Euclidean + DSVS (ours) *  
PN cosine + DSVS (ours) * 
5way 1shot  5way 5shot  
Euclidean  Cosine  Euclidean  Cosine  
PN  
PN + SVS  
PN + DSVS  
PN + DAVS  

5.1. Dataset and Experimental Setup
miniImageNet.
The miniImageNet [25] consists of 100 classes with 600 images per class. We follow the data split suggested by ravi2016optimization ravi2016optimization, where the dataset is separated into a training set with 64 classes, a testing set with 20 classes and a validation set with 16 classes.
Model architecture.
To evaluate our methods with different backbone networks, we reimplement PN with the Conv4 architecture proposed by snell2017prototypical snell2017prototypical and the ResNet12 architecture adopted by oreshkin2018tadam oreshkin2018tadam, respectively.
The Conv4 backbone contains four convolutional blocks, where each block is sequentially composed of a 3
3 kernel convolution with 64 filters, a batch normalization layer, a ReLU nonlinear layer and a 2
2 maxpooling layer.
The ResNet12 architecture contains 4 Res blocks, where each block consists of 3 convolutional blocks followed by a 2 2 maxpooling layer.
Training details.
We follow the episodic training strategy proposed by [25]. In each episode, classes and shots per class are selected from the training set, the validation set or the testing set. For fair comparisons, the number of queries, the sampling strategy of queries, and the testing strategy are designed in line with PN or TADAM.
For Conv4, we use Adam optimizer with a learning rate of without weight decay. The total number of training episodes is for Conv4. And for ResNet12, we use SGD optimizer with momentum , weight decay and episodes in total. The learning rate is initialized as and decayed at episode steps , and
. Besides, we use gradient clipping when training ResNet12. The reported results are the mean accuracies with
confidence intervals estimated by runs.We normalize the embeddings before computing the distances between them. As shown in Eq. (7) and (8), the gradient magnitude of variational metric scaling parameters is proportional to the norm of embeddings. Therefore, to foster the learning process of these parameters, we adopt a separate learning rate for all variational metric scaling parameters.
5.2. Evaluation
The effectiveness of our proposed methods is illustrated in Table 2 progressively, including stochastic variational scaling (SVS), dimensional stochastic variational scaling (DSVS) and dimensional amortized variational scaling (DAVS). On both way shot and way shot classification, noticeable improvement can be seen after incorporating SVS into PN. Compared to SVS, DSVS is more effective, especially for way shot classification. DAVS performs even better than DSVS by considering taskrelevant information.
Performance of SVS.
We study the performance of SVS by incorporating it into PN. We consider both way and way training scenarios. The prior distribution of the metric scaling parameter is set as and the variational parameters are initialized as , . The learning rate is set to be .
Results in Table 3 show the effect of the metric scaling parameter (SVS). Particularly, significant improvement is observed for the case of PN with cosine similarity and for the case of way shot classification. Moreover, it can be seen that with metric scaling there is no clear difference between the performance of Euclidean distance and cosine similarity.
We also compare the performance of a fixed with a trainable
. We add a shifted ReLU activation function (
) on the learned to ensure it is positive. Nevertheless, in our experiments, we observe that the training is very stable and the variance is always positive even without the ReLU activation function. We also find that there is no significant difference between the two settings. Hence, we treat as a fixed hyperparameter in other experiments.5way 1shot  5way 5shot  

5way training  20way training  5way training  20way training  
PN Euclidean  
PN Cosine  
PN Euclidean + SVS ()  
PN Cosine + SVS ()  
PN Euclidean + SVS (learned )  
PN Cosine + SVS (learned ) 
5way 1shot  5way 5shot  

Auxiliary training  Prior  Euclidean  Cosine  Euclidean  Cosine 
Performance of DSVS.
We validate the effectiveness of DSVS by incorporating it into PN and TADAM, with the results shown in Table 1 and Table 2. On 5way1shot classification, for PN, we observe about and absolute increase in test accuracy with Conv4 and ResNet12 respectively; for TADAM, absolute increase in test accuracy is observed. The learning rate for DSVS is set to be . Here we use a large learning rate since the gradient magnitude of each dimension of the metric scaling vector is extremely small after normalizing the embeddings.
Performance of DAVS.
We evaluate the effectiveness of DAVS by incorporating it into PN. We use a multilayer perception (MLP) with one hidden layer as the generator . The learning rate is set to be . In Table 2, on both way shot and way shot classification, we observe about absolute increase in test accuracy for dimensional amortized variational scaling (DAVS) over SVS with a single scaling parameter. In our experiments, the hyperparameter is selected from the range of with 200 training epochs in total.
Ablation study of DAVS.
To assess the effects of the auxiliary training strategy and the prior information, we provide an ablation study as shown in Table 4. Without the auxiliary training and the prior information, DAVS degenerates to a taskrelevant weight generating approach [12]. Noticeable performance drops can be observed after removing the two components. Removing either one of them also leads to performance drop, but not as significant as removing both. The empirical results confirm the necessity of the auxiliary training and a proper prior distribution for amortized variational metric scaling.
5.3 Robustness Study
We also design experiments to show: 1) The convergence speed of existing methods does not slow down after incorporating our methods; 2) Given the same prior distribution, the variational parameters converge to the same values in spite of different learning rates and initializations.
For the iterative update of the model parameters and the variational parameters , a natural question is whether it will slow down the convergence speed of the algorithm. Figure 2 shows the learning curves of PN and PN+DSVS on both 5way 1shot and 5way 5shot classification. It can be seen that the incorporation of SVS does not reduce the convergence speed.
We plot the learning curves of the variational parameter w.r.t. different initializations and different learning rates . Given the same prior distribution , Fig. 3(a) shows that the variational parameter with different initializations will converge to the same value. Fig. 3(b) shows that is robust to different learning rates.
Conclusion
In this paper, we have proposed a generic variational metric scaling framework for metricbased metaalgorithms, under which three efficient endtoend methods are developed. To learn a better embedding space to fit data distribution, we have considered the influence of metric scaling on the embedding space by taking into account datadependent and taskdependent information progressively. Our methods are lightweight and can be easily plugged into existing metricbased metaalgorithms to improve their performance.
Acknowledgements
We would like to thank the anonymous reviewers for their helpful comments. This research was supported by the grants of DaSAIL projects P0030935 and P0030970 funded by PolyU (UGC).
References

[1]
(2015)
Aggregating deep convolutional features for image retrieval
. arXiv preprint arXiv:1510.07493. Cited by: Metric scaling.. 
[2]
(2005)
Learning a similarity metric discriminatively, with application to face verification.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: Metricbased metalearning..  [3] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: Table 1.
 [4] (2018) Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: Table 1.
 [5] (2017) Gaussian prototypical networks for fewshot learning on omniglot. arXiv preprint arXiv:1708.02735. Cited by: Metricbased metalearning..
 [6] (2017) Fewshot learning with graph neural networks. International Conference on Learning Representations. Cited by: 1. Introduction.
 [7] (2018) Metalearning probabilistic inference for prediction. International Conference on Learning Representations. Cited by: 1. Introduction.
 [8] (2018) Recasting gradientbased metalearning as hierarchical bayes. International Conference on Learning Representations. Cited by: Table 1.
 [9] (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330. Cited by: Metric scaling..
 [10] (2019) Fewshot object detection via feature reweighting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8420–8429. Cited by: 4.3. Amortized Variational Scaling.

[11]
(2015)
Siamese neural networks for oneshot image recognition.
In
ICML Deep Learning Workshop
, Vol. 2. Cited by: 1. Introduction.  [12] (2018) Taskadaptive feature reweighting for few shot classification. In Asian Conference on Computer Vision, pp. 649–662. Cited by: 4.3. Amortized Variational Scaling, Ablation study of DAVS..
 [13] (2006) Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: 1. Introduction.
 [14] (2019) Finding taskrelevant features for fewshot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10. Cited by: Metricbased metalearning., Auxiliary loss., 4.3. Amortized Variational Scaling.

[15]
(2017)
Sphereface: deep hypersphere embedding for face recognition
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: Metric scaling..  [16] (2018) A simple neural attentive metalearner. International Conference on Learning Representations. Cited by: Table 1.

[17]
(2017)
Rapid adaptation with conditionally shifted neurons
. arXiv preprint arXiv:1712.09926. Cited by: Table 1.  [18] (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2. Cited by: Table 1.
 [19] (2018) TADAM: task dependent adaptive metric for improved fewshot learning. In Advances in Neural Information Processing Systems, pp. 719–729. Cited by: 1. Introduction, 1. Introduction, Metricbased metalearning., Auxiliary loss., Table 1, 5. Experiments.
 [20] (2017) L2constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: Metric scaling., Estimation of gradients..
 [21] (2019) Amortized bayesian metalearning. International Conference on Learning Representations. Cited by: 1. Introduction.
 [22] (2016) Optimization as a model for fewshot learning. Cited by: Table 1.
 [23] (2017) Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: 1. Introduction, 1. Introduction, Metricbased metalearning., 3.2. Prototypical Networks, 4.1. Stochastic Variational Scaling, Table 1, 5. Experiments.
 [24] (2018) Learning to compare: relation network for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: Metricbased metalearning., Table 1.
 [25] (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: 1. Introduction, Metricbased metalearning., 4.1. Stochastic Variational Scaling, miniImageNet., Training details., Table 1.
 [26] (2018) Rethinking feature distribution for loss functions in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9117–9126. Cited by: Metric scaling..
 [27] (2017) Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041–1049. Cited by: Metric scaling..
 [28] (2018) Heatedup softmax embedding. arXiv preprint arXiv:1809.04157. Cited by: Metric scaling..
Appendix A Appendix A. SVS
A.1. Comparison with a Special Case
ranjan2017l2 ranjan2017l2 proposed to train the single scaling parameter together with model parameters. Their method can be seen as a special case of our stochastic variational scaling method SVS under the conditions of , and . We compare our method with theirs by varying the initialization of ().
Noticeably, our method achieves absolute improvements of , , and for four different initializations respectively. As shown in Table 5, our method is stable w.r.t. the initialization of , thanks to the prior information introduced in our Bayesian framework which may counteract the influence of initialization.
Method  1  10  100  1000 

PN+SVS  
PN (Training together) 
A.2. Sensitivity to the Prior and Initialization
In Bayesian framework, the prior distribution has a significant impact on learning posterior distribution. For stochastic variational inference, initialization is another key factor for learning the variational parameters. Here, we conduct experiments of PN+SVS with different prior distributions and initializations. The results of way shot classification are summarized in Table 6. It can be observed that our method is not sensitive to the prior and initialization as long as either one of them is not too small.
Appendix B Appendix B. DSVS
B.1. Distributions of the Mean Vector
Figure 4 illustrates the distributions of the mean vector during the metatraining procedure of way shot and way shot classification respectively. Darker colour means more frequent occurrence.
At step , all dimensions of are initialized as . They diverge as the metatraining proceeds, which shows DSVS successfully learns different scaling parameters for different dimensions. It is also worth noting that for both tasks, the distribution of converges eventually (after steps).
Appendix C Appendix C. DAVS
C.1. Viewing the Learned Metric Scaling Parameters of PN+DAVS.
DAVS generates variational distributions for different tasks, from which the taskspecific scaling parameters are sampled. Below we print out the scaling vectors learned by DAVS on two different testing tasks for 5way 5shot classification, where only the first ten dimensions are displayed. It can be seen that DAVS successfully learns tailored scaling vectors for different tasks.
Scaling parameters for Task 1: [64.9170, 22.4030, 13.4468, 2.3949, 28.2470, 29.7770, 54.4221, 60.9279, 2.3008, 147.5304].
Scaling parameters for Task 2: [60.2564, 21.1672, 12.8457, 2.3603, 26.6194, 27.9963, 50.6127, 56.5965, 2.2672, 135.2510].
Appendix D Appendix D. Implementation Details
D.1. Sampling from the Variational Distribution
We adopt the following sampling strategy for the proposed three approaches. For metatraining, we sample once per task from the variational distribution for the metric scaling parameter; for metatesting, we use the mean of the learned Gaussian distribution as the metric scaling parameter. The computational overhead is very small and can be ignored.