AdaptationAgnosticMetaLearning
source code to ICLR'19, 'A Closer Look at Few-shot Classification'
view repo
Many meta-learning algorithms can be formulated into an interleaved process, in the sense that task-specific predictors are learned during inner-task adaptation and meta-parameters are updated during meta-update. The normal meta-training strategy needs to differentiate through the inner-task adaptation procedure to optimize the meta-parameters. This leads to a constraint that the inner-task algorithms should be solved analytically. Under this constraint, only simple algorithms with analytical solutions can be applied as the inner-task algorithms, limiting the model expressiveness. To lift the limitation, we propose an adaptation-agnostic meta-training strategy. Following our proposed strategy, we can apply stronger algorithms (e.g., an ensemble of different types of algorithms) as the inner-task algorithm to achieve superior performance comparing with popular baselines. The source code is available at https://github.com/jiaxinchen666/AdaptationAgnosticMetaLearning.
READ FULL TEXT VIEW PDFsource code to ICLR'19, 'A Closer Look at Few-shot Classification'
Meta-learning is a promising solution to endow machines with skills to fast adapt to new environments with few experiences. From a unified view, the commonly used meta-training procedure in existing meta-algorithms is an interleaved process which includes inner-task adaptation and meta-update. During inner-task adaptation, the inner-task algorithm runs through the support set and outputs a predictor parameterized by task-specific parameters . During meta-update, the loss of the task-specific predictor over the query set is minimized to update the meta-parameters that are shared by all tasks. The update rule of meta-parameters is formulated as follows.
(1) |
In this formulation, the optimization of meta-parameters should differentiate through the inner-task adaptation. To obtain an explicit and differentiable meta-objective function and its gradient w.r.t. in Eq. (1), we should compute . Hence, the inner-task algorithm should be solved analytically, i.e., should be an analytical expression of ,
(2) |
To satisfy this requirement, only simple algorithms with closed-form solvers can be applied as the inner-task algorithm, such as nearest neighbor classification (Snell et al., 2017)
(Bertinetto et al., 2019), SVM (Lee et al., 2019) or gradient descent with a learned initialization (Finn et al., 2017), which significantly limits its expressive power.To enrich the choices of inner-task algorithms and improve the expressive power of models, we propose an Adaptation-Agnostic Meta-training strategy (A2M) to remove the analytical dependency between the task-specific parameters and the meta-parameters. For inner-task adaptation, we fix the meta-parameters and use the support set to optimize the task-specific parameters. For meta-update, we fix the task-specific parameters and optimize the meta-parameters using the query set. The meta-parameters are updated by minimizing the predictor’s loss over the embedded query set. Without differentiating the inner-task optimization process, the meta-training strategy is agnostic to the inner-task algorithm which only needs the solution of .
The generality and flexibility of the proposed meta-training strategy makes it easy to combine different types of algorithms as an ensemble inner-task algorithm to exploit their advantages and alleviate their drawbacks. We introduce an instantiation of A2M and conduct extensive experiments on standard or cross-domain few-shot classification tasks over miniImagenet and CUB to evaluate its effectiveness. Experiments show A2M achieves superior performance with low computational cost in comparison with the popular baselines.
Meta-learning algorithms can be broadly categorized based on the type of the inner-task algorithm, namely, metric-based, gradient-based, model-based and meta-algorithms with closed-form solutions. Metric-based meta-algorithms learn a mapping from the data space to an embedding space, where the inner-task algorithm is a comparison algorithm based on a similarity metric (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Garcia and Bruna, 2018). Similarly, over the embedding space, meta-algorithms with closed-form solutions apply simple inner-task algorithms with a closed-form solution such as ridge regression (Bertinetto et al., 2019) or SVM (Lee et al., 2019). Gradient-based meta-learning (Finn et al., 2017)
uses a gradient descent optimizer with a learned initialization of a deep neural network as the inner-task algorithm.
Decoupled training strategies have been explored in meta-learning or multi-task learning literature (Franceschi et al., 2018; Frans et al., 2017; Yang and Hospedales, 2016; Zintgraf et al., 2019; Rajeswaran et al., 2019). Franceschi et al. (2018)
proposed a bilevel programming for hyperparameter optimization and meta-learning, which is essentially similar to our proposed training strategy. Similar training strategies are adopted in
Frans et al. (2017) and Yang and Hospedales (2016)for reinforcement learning and multi-task learning respectively, which divided a network into shared layers and task-specific layers and updated them iteratively. However, the motivation and use of the training strategy in these works are totally different from ours. We motivate the decoupled training procedure from an adaptation-agnostic perspective for meta-learning, which leads to a flexible inner-task algorithm with high expressiveness.
Rajeswaran et al. (2019) decouples the meta-training procedure by drawing implict gradient which can be seen as an approximation of MAML.The goal of a meta-algorithm is to learn an inner-task algorithm which can fast adapt to new tasks drawn from a task distribution . During meta-training, given a set of training tasks , the meta-algorithm observes the meta-samples , where is the training (support) set of task and is the test (query) set of . Denote by a data sample, where is the feature space and is the label space. A support set and a query set per task are of size and respectively. After trained on these tasks, the meta-algorithm outputs a inner-task algorithm .
We characterize that the existing meta-algorithms leverage a meta-training procedure as the gradient of the meta-parameters is computed through the inner-task adaptation. As the meta-algorithm updates the meta-parameters by minimizing the loss of the task-specific predictor over the query set of each task. The update rule of is
(3) |
where parameterized by task-specific parameters . It can be discovered that following the update rule (3), the gradients of the meta-parameters are back-propagated though the task-specific parameters . To make the back-propagation feasible, most existing meta-algorithms follow a common meta-training procedure as shown in Fig. 1. First, the task-specific parameters are directly computed w.r.t. , i.e.,
(4) |
and denotes an analytical expression. Then, is plugged back to the meta objective function (3) and is optimized by the gradient propagated from . For example, the task-specific parameters of a typical gradient-based meta-algotithm, MAML (Finn et al., 2017) is
(5) |
Then, the gradients of include a second-order gradient of , because . It turns out that the choice of inner-task algorithm with different analytical expressions characterizes the key difference among existing meta-algorithms. Apart from gradient-based meta-algorithms such as MAML (Finn et al., 2017), the other popular meta-algorithms can be unified in this perspective.
Metric-based meta-algorithms. The inner-task algorithm of a metric-based meta-algorithm is a nearest neighbor algorithm with a distance function in the metric space, e.g., . For matching networks (Vinyals et al., 2016), the nearest neighbor algorithm is non-parametric, so there is no explicit training in inner-task adaptation. For prototypical networks (Snell et al., 2017)
, the task-specific parameters are the mean vectors of same-class support samples, which can be computed as
(6) |
Model-based meta-algorithms. Some of model-based meta-algorithms avoid inner-task training by learning a meta amortization network parameterized by to generate task-specific parameters using the support set as inputs (Gordon et al., 2018b, a), i.e.,
(7) |
Both and are global parameters to be optimized in meta-training.
Meta-algorithms with closed-form solvers. Several meta-algorithms adopt a simple algorithm with convex objective function as inner-task algorithm such that the task-specific parameters have a closed-form solution (Bertinetto et al., 2019; Lee et al., 2019). For example, (Bertinetto et al., 2019) uses ridge regression as inner-task algorithm, and the closed-form solution is
(8) |
For brevity, and , where (Bertinetto et al., 2019).
From the unified perspective, the key constraint in designing a meta-algorithm is to find an inner-task algorithm which has an explicit analytical solutions which significantly limits the expressiveness of the inner-task algorithms. The key constraint is caused by the normal meta-training strategy that the optimization of meta-parameters in meta-update need differentiate through the inner-task adaptation. To relax this constraint, we propose an adaptation-agnostic meta-training strategy which makes no assumption on such dependency.
In particular, we do not enforce the optimization of meta-parameters to back-propagate the inner-task adaptation but propose to conduct these two steps separately and iteratively. In inner-task adaptation, the meta-parameters is fixed, and the support set is fed to the shared embedding network and used to train the task-specific predictor . In meta-update, the task-specific parameters are fixed and the query set is used to optimize the meta-parameters . The iteration scheme is formulated as follows:
(9) | |||
(10) |
where refers to the meta-parameters, i.e., the global parameters shared by all the tasks, and refers to the task-specific parameters, i.e., the local parameters which are different among the tasks. We call this training strategy adaptation-agnostic, since in Eq. (9
), it allows to use any inner-task algorithm with any optimization algorithm as long as the meta loss function
is differentiable w.r.t. given , regardless of whether has an analytical expression w.r.t. .Without the requirement of an analytical solution, the choice of the inner-task algorithm is of great flexibility. Naturally, we come up with a neural network with a non-convex loss function, i.e., cross-entropy loss. Since there is no restriction on the optimization algorithm or the network architecture, we simply use a multilayer perceptron (MLP) trained by SGD as an inner-task algorithm. The inner-task adaptation can be formulated as:
(11) |
where is randomly initialized for each task. This may make the inner-task algorithm more flexible and less prone to overfitting, as verified in our experiments.
An Ensemble Inner-Task Algorithm. The generality and flexibility of the proposed adaptation-agnostic meta-training strategy enables us to apply a powerful algorithm as the inner-task algorithm. We come up with an ensemble which can combine the advantages of different types of inner-task algorithm. As an instantiation, we combine the mean-centroid classification algorithm of (Snell et al., 2017), initialization-based inner-task algorithm in ANIL ^{1}^{1}1Raghu et al. (2020) shows that the simplified MAML, i.e., ANIL, achieves the same performance as MAML (Finn et al., 2017). Hence, it suffices to only compare with MAML. (Raghu et al., 2020) and MLP proposed by us (Eq. (11)) as the inner-task algorithm. As illustrated in Fig. 2, during inner-task adaptation, we train a bag of diverse algorithms separately with the embedded support set over the embedding space and obtain three independent predictors. Meta-update is performed by aggregating the predictions of all the predictors on the query set to obtain final predictions and then using the final predictions to update the shared meta-parameters . See details in Appendix A.
The experiments are designed to evaluate the instantiation of A2M introduced in Sec. 4.2 on standard and cross-domain few-shot classification tasks. We evaluate our method on miniImageNet (Vinyals et al., 2016) with Conv- and ResNet- under the standard -way -shot and -way -shot settings. See implementation details and ablation study in Appendix B
miniImageNet test accuracy | ||
---|---|---|
Model | -way -shot | -way -shot |
Matching Net (Vinyals et al., 2016) | ||
Relation Net (Sung et al., 2018) | ||
MAML (Finn et al., 2017) | ||
Protonet (Snell et al., 2017) | ||
MetaOptNet (Lee et al., 2019) | ||
A2M (Mean-centroid + MLP+ Init-based) | ||
Matching Net (Vinyals et al., 2016) | ||
Relation Net (Sung et al., 2018) | ||
MAML (Finn et al., 2017) | ||
Protonet (Snell et al., 2017) | ||
MetaOptNet (Lee et al., 2019) | ||
A2M (Mean-centroid + MLP+ Init-based) |
(1) Standard few-shot classification. Referring to Table 1, for both -shot and -shot tasks, A2M achieves comparable or superior performance compared with state-of-the-art meta-algorithms. Remarkably using ResNet-18 in Table 1, A2M outperforms the best meta-algorithms by achieving and absolute increases in the -shot and -shot tasks respectively, demonstrating the effectiveness of A2M. (2) Cross-domain few-shot classification. As shown in Table 2, A2M achieves , absolute increase over MAML and PN respectively. The results verifies the generalization ability of A2M which can even generalize to new tasks with a domain shift.
In this paper, we provided a unified view on the commonly used meta-training strategy and proposed an adaptation-agnostic meta-training strategy that is more general, flexible and less prone to overfitting. In future work, we target to analyze the theoretical properties of the adaptation-agnostic meta-training strategy and explore more powerful inner-task algorithms.
Proceedings of the 34th International Conference on Machine Learning-Volume 70
, pp. 1126–1135. Cited by: §A.1, §B.2.2, Table 3, §1, §2, §3.2.1, Table 1, Table 2, footnote 1.Third workshop on Bayesian Deep Learning
, Cited by: §3.2.1.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §B.1.3.Deep multi-task representation learning: a tensor factorisation approach
. arXiv preprint arXiv:1605.06391. Cited by: §2.As illustrated in Fig. 2, during inner-task adaptation, we train a bag of diverse algorithms separately with the embedded support set and obtain predictors, i.e., . Next, meta-update is performed by aggregating the predictions of all the predictors on the query set to obtain final predictions and then using the final predictions to update the shared meta-parameters . Formally, the meta-training procedure of A2M is formulated as follows,
(12) |
An instantiation of A2M. A2M (Eq. A.1) is a very general framework, and it can basically integrate any inner-task algorithm as inner-task algorithm for diverse purposes. Since this paper focuses on few-shot classification, we instantiate A2M with an ensemble of the mean-centroid classification algorithm of ProtoNets (Snell et al., 2017), the initialization-based inner-task algorithm as in MAML (Finn et al., 2017)
, and a two-layer MLP as inner-task classifier proposed by us (Eq. (
11)) as the ensemble inner-task algorithm for meta-learning. Note that for the inner-task algorithm of MAML, we use the version as in ANIL (Raghu et al., 2020) which naturally combined in A2M framework. In addition, we choose to combine the mean-centroid classification algorithm and the initialization-based algorithm due to their complementary capabilities. The former has low model capacity but stable, while the latter has high model expressiveness but can easily overfit. The effectiveness of the proposed ensemble inner-task algorithm is empirically verified by our experiments in Sec. 5.For inner-task adaptation, as illustrated in Fig. 2, the three algorithms are trained over the embedded support set independently, i.e,:
(13) |
where , and denote the mean-centroid classification algorithm, the initialization-based algorithm and the two-layer MLP, respectively. Note that for , is shared by each task and updated during meta-update.
In meta-update, as shown Fig. 2, for any query , the predictions of , and
are aggregated to produce the final prediction. In our instantiation, we sum up all the predictions and use the output as the query’s logits for computing the cross-entropy loss. Specifically, given the task-specific parameters
, and , the meta-update process is as follows,where is the distance between the query’s embedding and the prototype and .
The miniImageNet (Vinyals et al., 2016) consists of 100 classes with 600 images per class. The dataset is split into a training set with 64 classes, a testing set with 20 classes and a validation set with 16 classes (Ravi and Larochelle, 2017). Following the convention, the images are cropped into 38484 and 3224224 when using CNN-based (Vinyals et al., 2016) and ResNet-based model architectures (Chen et al., 2019) respectively.
In order to achieve a fair comparison, we employ the consistent experimental environment proposed in (Chen et al., 2019) and strictly follow the training details in it. Specifically, we compare the performance using the widely-used Conv-4 as in (Snell et al., 2017) and the ResNet-18 backbone adopted in their environment. We have not applied any high-way or high-shot training strategy. For the optimizer, we use Adam (Kingma and Ba, 2014) as the meta-optimizer with a fixed learning rate . For the cross-domain tasks, we train models on the entire miniImageNet dataset. The meta-validation and meta-test of the models use the validation set and test set of the CUB dataset respectively.
For fair comparison, here we compare with the state-of-the-art methods that have a similar implementation (e.g., using the same backbone network) as ours. In this paper, we use a standard ResNet-18 backbone (He et al., 2016). Differently, MetaOptNet (Lee et al., 2019) and TADAM (Oreshkin et al., 2018) use a ResNet-12 backbone; LEO (Rusu et al., 2018) uses a WRN-28-10 backbone. Besides, we do not use techniques such as DropBlock regularization, label smoothing and weight decay as adopted in MetaOptNet (Lee et al., 2019) to increase performance. Hence, we do not compare with these methods.
miniImageNet test accuracy | ||
---|---|---|
Model | -way -shot | -way -shot |
Matching Net (Vinyals et al., 2016) | ||
Relation Net (Sung et al., 2018) | ||
Meta LSTM (Ravi and Larochelle, 2017) | ||
SNAIL (Mishra et al., 2018) | ||
LLAMA (Grant et al., 2018) | ||
REPTILE (Nichol and Schulman, 2018) | ||
PLATIPUS (Finn et al., 2018) | ||
GNN (Garcia and Bruna, 2018) | ||
R2-D2 (high) (Bertinetto et al., 2019) | ||
MAML (Finn et al., 2017) | ||
Protonet (Snell et al., 2017) | ||
MetaOptNet (Lee et al., 2019) | ||
A2M (Mean-centroid + MLP+ Init-based) | ||
Matching Net (Vinyals et al., 2016) | ||
Relation Net (Sung et al., 2018) | ||
MAML (Finn et al., 2017) | ||
Protonet (Snell et al., 2017) | ||
MetaOptNet (Lee et al., 2019) | ||
A2M (Mean-centroid + MLP+ Init-based) |
For the standard few-shot scenario, we conduct experiments of -way -shot and -way -shot classification on miniImageNet with the Conv-4 and the ResNet-18 backbones. The results are shown in Table 1. For both -shot and -shot tasks, our model achieves comparable or superior performance compared with state-of-the-art meta-algorithms. Remarkably on the ResNet-18 backbone in Table 1, A2M outperforms the best meta-algorithms by achieving approximate and absolute increases in the -shot and -shot tasks respectively, demonstrating the effectiveness of A2M.
To further examine the generalization ability of our method, we conduct experiments on the challenging cross-domain classification task proposed in (Chen et al., 2019). The results are shown in Table 2. Here, D-MLP denotes the decoupled meta-training with a MLP inner-task algorithm proposed by us as in Sec. 4.2. For -way -shot classification, D-MLP achieves absolute increases compared with MAML (Finn et al., 2017), which indicates that D-MLP is less prone to overfitting than MAML. Our ensemble framework A2M achieves , , absolute increase over D-MLP, MAML (Finn et al., 2017) and PN (Snell et al., 2017) respectively. The results show that our adaptation-agnostic ensemble framework facilitates the meta-net to learn more general structures that can adapt better to new tasks with a domain shift.
Mean-centroid | MLP | Init-based | -way -shot | -way -shot |
---|---|---|---|---|
To further study our proposed A2M, we provide an ablation study using the Conv-4 backbone. In Table 4, we can observe that A2M (mean-centroid + MLP + init-based) achieves best results when compared with each individual component or an ensemble of any two components. This further demonstrates that the ensemble method is effective.
Besides, we observe that A2M has the advantage of combining the strength of individual components while mitigating their drawbacks. On one hand, focusing on the results of -way -shot classification. It can be seen that all the variants of A2M achieve better results compared with the individual component. It is well known that models are extremely easy to overfit in the
-shot scenario. Clearly, our framework is capable of reducing classification variance in such cases. On the other hand, inspecting the outcomes of the
-way -shot classification, the results of the ensemble including the mean-centroid component are and which outperform the result of (MLP+init-based), i.e., without the mean-centroid component. It demonstrates the power of the mean-centroid component in preventing overfit as the shot number increases and the ensemble method can obtain such advantage after incorporating the mean-centroid classification algorithm.We provide a quantitative comparison by measuring the meta-training and meta-testing time for episodes shown in Fig. 3 and Our results are obtained on 5-way 1-shot models with the ResNet-18 backbone. Fig. 3 shows that our A2M merely increases the running time by a small margin even when combining three components in the ensemble and validates our statement that A2M is efficient.