1 Introduction
Humans possess an innate ability to draw upon past experiences to efficiently solve new problems. From being exposed to multiple environments, we are able to learn general concepts which can be leveraged to solve tasks in novel domains. This concept of transfer between problem domains is a core tenant of human intelligence. To endow this ability on intelligent learning systems, one can adhere to MultiTask Learning (MTL), a paradigm that enables learning over multiple frameworks concurrently. A particular instance of MTL is MetaLearning. These set of algorithms are designed to quickly learn and adapt to novel tasks. In general, this is done by exposing the learner to several tasks concurrently, allowing it to learn a common source of information between the tasks, as well as their differences.
This ability to discriminate between tasks is at the core of meta learning models. A single task can be arbitrarily complex while a set of related tasks, instead, can form a well defined structure. In most cases, such a structure may be defined by only a small number of variables. That is, the space of related tasks can generally be assumed to be equipped with a very low intrinsic dimension. Exploiting effectively this structure enables the meta learner to make an efficient use of its acquired knowledge. Adapting to novel tasks then essentially amounts to finding and exploiting these intrinsic factors of variation in the task space.
To give an example of such a structure, imagine the regression problem of metalearning a collection of tasks defined as the sum of sine waves of varying amplitude and phase. Learning a single task can be made arbitrarily complex by considering additional sine waves. However, when learning multiple tasks in conjunction it suffices to identify the common form of the tasks and then learning only the relatively simple task structure induced by the varying task parameters. Figure 1 depicts such a structure learnt by GBML. As shown, the task structure is embedded in the taskadapted parameters themselves.
In this paper we aim to study more explicitly this task structure through the lens of GradientBased MetaLearning (GBML) (Finn et al., 2017). This particular family of meta learning methods adapt to a new task by learning how to generate a set of optimal parameters specific to that task. Recent work has shown how, in the context of classification, GBML requires to update only a subset of its parameters (Raghu et al., 2019; Oh et al., 2020)
. We expand on this perspective by demonstrating that GBML implicitly learns a subspace that reflects the inherent structure of the tasks for both classification and regression. To investigate this phenomena, we borrow notions from dimensionality estimation and study the distribution of the parameters found by GBML after adaptation for each one of the tasks considered. By representing a task as the parameters of a taskadapted neural network, we find a highdimensional representation of it. We can then leverage representation learning techniques such as dimensionality reduction to estimate the intrinsic dimensionality of the task space. We empirically show that GBML finds a subspace of a dimensionality that is optimal for the set of tasks at hand. This, in turn, provides us with an insight into the inherent subspace structure of common datasets.
Our contributions are as follows:

Empirical evidence that GBML methods naturally tend to a solution where the adapted parameters lie on a subspace of the smallest dimensionality possible given the metadataset.

A method to analyze the dimensionality of the taskspace of common datasets.
2 Related Work
MetaLearning. The general idea of metalearning tackles the setting of learning a new task given only a small set of labeled examples. To achieve this, one of the most prominent approaches is model agnostic meta learning (MAML) (Finn et al., 2017). It learns a common parameter initialization across tasks that lies close to the optimal solution. Extensions to MAML couple the optimization procedure with a learned set of parameters to increase its expressivity (Lee and Choi, 2018; Flennerhag et al., 2019; Park and Oliva, 2019).
Fewshot Classification. GBML for image classification has been extensively studied. (Raghu et al., 2019)
showed that GBML models naturally converge to the solution of adapting only the parameters of the last layer of an image classifier for
way shot classification. They suggest that the test performances are correlated with the quality of the representation learner represented by the first part of the network. This has been confirmed by (Oh et al., 2020) where they also propose to freeze the adaptation of the last layer to incentivize the meta learner to find a solution that makes use of more layers. Further analysis have compared the behavior of MAML with a contrastive learning framework where the learned features exhibit invariance to the samples used in adaptation (Kao et al., 2021; Goldblum et al., 2020; Li et al., 2018a).Dimensionality Reduction
A common accepted notion in machine learning is that highdimensional data lies on a submanifold of a much lower dimension than the dimension of the ambient space. Dimensionality reduction techniques can be utilized to find this intrinsic dimensionality of data. Examples of such techniques are Principal Component Analysis (PCA) (Pearson, 1901) and Isomap (Tenenbaum et al., 2000). Isomap is a nonlinear dimensionality reduction technique that aims to preserve geodesics of the original space in a space of dimension
. It works by minimizing the reconstruction error which is the difference between distances in the original space and embedded space w.r.t to the Isomap kernel. Previous work has utilized such techniques in an attempt to estimate the true intrinsic dimensionality of images (Pope et al., 2021).Subspace Structure. Several works have investigated the geometry of the losslandscape of neural networks (Li et al., 2018b) (Lengyel et al., 2020). In particular, it has been shown that there exists paths in the parameter space on which the loss remains low (Wortsman et al., 2021) (Garipov et al., 2018) (Frankle, 2020). Other type of structure, such as hyperplanes has been explored in (Lengyel et al., 2020)
. Recently, subspace structure has been considered in online reinforcementlearning by learning a subspace of policies which enables better generalization during testing
(Gaya et al., 2021). In this work, we extend this analysis to the metalearning setting by identifying submanifold structure in the learnt parameter space.3 Preliminaries
The general formulation of gradientbased metalearning follows from (Finn et al., 2017). We consider a set of tasks sampled from a distribution together with a parameterized function . Given a new task , GBML seeks to find a set of parameters such that applying one gradient step on the loss computed on a few datapoints () of the task results in the optimal set of parameters :
(1) 
To find such , GBML optimizes the expectation of the loss, computed on the rest of the task’s data (), of the model over all the tasks after adaptation:
(2) 
Let be the support of the distribution of tasks. GBML essentially learns a differentiable map from the support to the space of parameters of a neural network. Let denote the image of this map, that is the space induced by the taskadapted parameters found through GBML. Since is differentiable, we have that . Assuming the tasks carry no redundancy, the dimensions are expected to be equal. The intrinsic complexity of the task can then be reduced to the study of the properties of .
4 Evaluation Method
We aim to investigate the geometry of the parametric model space of gradientbased metalearning methods for different metaproblems. We hypothesize that since the space of tasks holds an intrinsic dimension
, our taskadapted parametric model space should be embedded in lowdimensional space of the same dimension. To investigate this hypothesis, we make use of dimensionality reduction techniques. Given a set of taskadapted parameters , we make the assumption that is sampled from a lowerdimensional manifold of intrinsic dimension . A common way to estimate the intrinsic dimensionality is to reduce the dimensions of to a varying number of dimensions . The intrinsic dimensionality would then correspond to the smallest which still preserves the features of the original data space, measured by the reconstruction error. We utilize Principal Component Analysis (PCA) (Pearson, 1901) and Isomap (Tenenbaum et al., 2000) to estimate this intrinsic dimensionality.5 Experiments
We conduct a number of experiments to analyze the parameter space that emerges when applying GBML to a number of regression and classification problems. For our experiments we used Model Agnostic Meta Learning (MAML) (Finn et al., 2017), one the most prominent members of the GBML family. To confirm that a structure of the parameters emerges independently of the expressiveness of the learner, we also used in the experiments MetaCurvature (MC) (Park and Oliva, 2019). MC relies on learning a matrix which conditions the gradients by
. We perform our analysis on a toy classification and regression task and furthermore investigate fewshot classification on miniImagenet and Omniglot datasets. For the toy experiments we consider a two layer neural network with ReLU activations and
hidden units. For miniImagenet and Omniglot, we use alayer convolutional neural network with batchnormalization, ReLU activations and maxpooling. We train the models for
epochs with a metabatch size of . For Isomap, we use a neighborhood size of .5.1 Classification Task
In fewshot classification, it has been noted that GBML algorithms find a solution where only the last layers of the metalearner gets updated during adaptation (Raghu et al., 2019). In our first experiment, we expand upon this insight by studying the entire space of taskadapted parameters for a toy way shot classification problem. To create this task, we construct a set of prototype classes that are equally spaced on a dimensional grid. We sample uniformly a subset of prototypes, each assigned to a distinct class . From each prototype, we now sample a set of examples . Adapting to a new task now involves changing the decision boundaries of a multinomial classification problem. We consider different datasets with the number of classes . To show that our method is architecture agnostic, we train our model by varying the number of hidden units in the last layer by .
Since we are interested in identifying the intrinsic dimension from which no further increase in performance can be found, we normalize the Isomap scores by dividing them by the first (largest) value. The plots are shown in Figure 2. As the number of classes increase, the number of dimensions required for adaptation increase accordingly. For PCA, it can be seen that for lower number of classes, more weight is put on a single principal component. This is reflected in the Isomap results where using a lower number of classes correspond to a steeper curve and thus lower dimension. Furthermore, the results are consistent with different architectures as the error bars remain intact. This shows that subspace structure is agnostic to the model used and is entirely dependent on the task itself. We can note that for two classes, a onedimensional embedded task space seems to be the predominant result. In the appendix, Figure 6 confirms that only a small subset of parameters in the last layers are updated for the binary classification problem while for other classes we note a larger change. While Figure 6 reconfirms the results of (Raghu et al., 2019), we raise the point that it is not only what subset of parameters are updated, but what subspace they lie on. Although a subset of parameters alters with choice of architecture, the subspace is instead only dependent on the complexity of the problem itself.
5.2 Regression Task
Next we investigate if the same results holds true for regression tasks as well. We construct a regression task by considering the sum of sine waves with different amplitudes. To put it explicitly, for each task we sample a set of amplitudes for and construct a regression task as the sum:
(3) 
Here, the number of sines defines the intrinsic dimension of the task space as the tasks vary only by the parameters . For our experiments, we consider
. We investigate the explained variance of PCA to compute an estimate of the dimensionality of the parametric model space. The results for PCA are shown in Figure
3 while the ones for Isomap are included in the appendix (Figure 7).From the figure, the strongest principal components coincide with the intrinsic dimensionality of the task. The effect declines as the complexity of the task increases, since there is an increased ambiguity for the metalearner to learn complex functions with a limited amount of support data. This confirms our hypothesis that the intrinsic dimensionality of the taskspace is preserved in the space of parameters induced by the metalearner.
5.3 Analysis of Common Datasets
We now turn to a more complex setting of estimating the intrinsic dimensionality of common classification tasks. We consider an way, shot few shot learning setting with data from miniImagenet and Omniglot. For both of these tasks we let and and train our model for epochs. For the architecture, we consider the convolutional architecture specified in (Vinyals et al., 2016). Figure 4 depicts the reconstruction error of Isomap for MAML and MetaCurvature for both the miniImagenet and Omniglot tasks. The dimensionality reduction is performed on different epochs throughout the training. The results for miniImagenet show both a higher variability during training and a seemingly lower estimated dimensionality than Omniglot which is reflected in their performance on the test set as well. In the appendix we elaborate upon this observation.
5.4 Reconstruction Experiment
To further substantiate our findings, we perform an experiment involving reconstructing the task from the taskadapted parameters. We perform dimensionality reduction using PCA on the taskadapted parameters to find a representation of reduced dimension . From this we train a learner in a supervised manner to reconstruct the task from the lowdimensional embedding. For the sineexperiment, we attempt to regress the amplitudes defined in Equation 3. For the classification experiments, we do not have access to the groundtruth taskparameters. In this case, we consider classifying an image from a specific task given the embedding of the learnt parameters for that task. We perform the experiments for varying dimension and evaluate the results by considering the regression or classification performance. The results for the sine experiment are shown in Figure 5. As can be seen, the performance does not increase further as you increase the embedding size beyond the intrinsic dimension of the task. We evaluate the performance of and classes for the toyclassification experiment. The results can be seen in Figure 10 in the Appendix. However, we argue that the discrete nature of classification tasks poses a limit in the analysis of the dimensionality of the manifold.
6 Conclusion and Future Work
In this work we proposed a method for analyzing the intrinsic dimensionality of the taskspace learnt by gradientbased metalearning methods. We provided an empirical analysis that reveals that GBML learn a lowdimensional subspace on which it performs adaptation on. Furthermore, we related this observation to the generalization performance of the model. We believe our analysis adds a valuable contribution to recent work that employ subspace structure in GBML. A possible future line of work is applying our method on a larger set of models and datasets in order to gain further insight. As a second possible direction, the space of parameters can be studied using additional geometric techniques such as topological data analysis. In our experiments we observe that the dimension of the subspace changes during training and could possibly be correlated with performance. This opens up to future directions of research of further exploring this correlation. We hypothesize that if the dimensionality of the subspace is known a priori, developing a regularizer that enforces this could possibly improve performance for GBML methods.
7 Acknowledgements
We thank Giovanni Luca Marchetti, Vladislav Polianskii and Marco Moletta for useful discussions. This work has been supported by the European Research Council (BIRD: 884807), Swedish Research Council and Knut and Alice Wallenberg Foundation and H2020 CANOPIES.
References
 Modelagnostic metalearning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126–1135. Cited by: §1, §2, §3, §5.
 Metalearning with warped gradient descent. arXiv preprint arXiv:1909.00025. Cited by: §2.
 Revisiting” qualitatively characterizing neural network optimization problems”. arXiv preprint arXiv:2012.06898. Cited by: §2.
 Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31. Cited by: §2.
 Learning a subspace of policies for online adaptation in reinforcement learning. arXiv preprint arXiv:2110.05169. Cited by: §2.
 Unraveling metalearning: understanding feature representations for fewshot tasks. In International Conference on Machine Learning, pp. 3607–3616. Cited by: §2.
 MAML is a noisy contrastive learner. arXiv preprint arXiv:2106.15367. Cited by: §2.
 Gradientbased metalearning with learned layerwise metric and subspace. In International Conference on Machine Learning, pp. 2927–2936. Cited by: §2.
 Genni: visualising the geometry of equivalences for neural network identifiability. arXiv preprint arXiv:2011.07407. Cited by: §2.

Learning to generalize: metalearning for domain generalization.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.  Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31. Cited by: §2.
 Boil: towards representation change for fewshot learning. arXiv preprint arXiv:2008.08882. Cited by: §1, §2.
 Metacurvature. Advances in Neural Information Processing Systems 32. Cited by: §2, §5.
 LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2 (11), pp. 559–572. Cited by: §4.
 The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894. Cited by: §2.
 Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157. Cited by: §A.1, §1, §2, §5.1, §5.1.
 A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §4.
 Matching networks for one shot learning. Advances in neural information processing systems 29. Cited by: §5.3.
 Learning neural network subspaces. In International Conference on Machine Learning, pp. 11217–11227. Cited by: §2.
Appendix A Appendix
a.1 Classification
In figure 6, we empirically investigate the distribution of how the parameters change with different tasks. We take all the tasks in the test set and compute the mean absolute difference between all the parameters. This gives an indication of what parameters change between tasks. As can be seen in the figure, for classification, mostly the parameters in the later layers get updated which confirms the results of (Raghu et al., 2019).
a.2 SineExperiment
a.3 Omniglot
We also investigate fewshot classification on Omniglot. The results are depicted in figure 8.
a.4 miniImagenet
In Figure 9, we provide a comparison between the estimated dimension and the testset performance of the models at different stages during the training. The estimated dimensionality varies during training with MAML seemingly starting off with a lowdimensional parameter space which grows larger in the first epochs then settles down again as performance increases. For metacurvature, we can observe a clear overfitting on the test set with performance peaking at around epoch . This is reflected in the dimensionality estimation as Isomap tends to estimate a lower dimension for later epochs.