1 Introduction
Deep neural networks that optimize for effective representations have enjoyed tremendous success over humanengineered representations. Metalearning takes this one step further by optimizing for a learning algorithm that can effectively acquire representations. A common approach to metalearning is to train a recurrent or memoryaugmented model such as a recurrent neural network to take a training dataset as input and then output the parameters of a learner model
(Schmidhuber, 1987; Bengio et al., 1992; Li & Malik, 2017a; Andrychowicz et al., 2016). Alternatively, some approaches pass the dataset and test input into the model, which then outputs a corresponding prediction for the test example (Santoro et al., 2016; Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018). Such recurrent models are universal learning procedure approximators, in that they have the capacity to approximately represent any mapping from dataset and test datapoint to label. However, depending on the form of the model, it may lack statistical efficiency.In contrast to the aforementioned approaches, more recent work has proposed methods that include the structure of optimization problems into the metalearner (Ravi & Larochelle, 2017; Finn et al., 2017a; Husken & Goerick, 2000). In particular, modelagnostic metalearning (MAML) optimizes only for the initial parameters of the learner model, using standard gradient descent as the learner’s update rule (Finn et al., 2017a). Then, at metatest time, the learner is trained via gradient descent. By incorporating prior knowledge about gradientbased learning, MAML improves on the statistical efficiency of blackbox metalearners and has successfully been applied to a range of metalearning problems (Finn et al., 2017a, b; Li et al., 2017). But, does it do so at a cost? A natural question that arises with purely gradientbased metalearners such as MAML is whether it is indeed sufficient to only learn an initialization, or whether representational power is in fact lost from not learning the update rule. Intuitively, we might surmise that learning an update rule is more expressive than simply learning an initialization for gradient descent. In this paper, we seek to answer the following question: does simply learning the initial parameters of a deep neural network have the same representational power as arbitrarily expressive metalearners that directly ingest the training data at metatest time? Or, more concisely, does representation combined with standard gradient descent have sufficient capacity to constitute any learning algorithm?
We analyze this question from the standpoint of the universal function approximation theorem. We compare the theoretical representational capacity of the two metalearning approaches: a deep network updated with one gradient step, and a metalearner that directly ingests a training set and test input and outputs predictions for that test input (e.g. using a recurrent neural network). In studying the universality of MAML, we find that, for a sufficiently deep learner model, MAML has the same theoretical representational power as recurrent metalearners. We therefore conclude that, when using deep, expressive function approximators, there is no theoretical disadvantage in terms of representational power to using MAML over a blackbox metalearner represented, for example, by a recurrent network.
Since MAML has the same representational power as any other universal metalearner, the next question we might ask is: what is the benefit of using MAML over any other approach? We study this question by analyzing the effect of continuing optimization on MAML performance. Although MAML optimizes a network’s parameters for maximal performance after a fixed small number of gradient steps, we analyze the effect of taking substantially more gradient steps at metatest time. We find that initializations learned by MAML are extremely resilient to overfitting to tiny datasets, in stark contrast to more conventional network initialization, even when taking many more gradient steps than were used during metatraining. We also find that the MAML initialization is substantially better suited for extrapolation beyond the distribution of tasks seen at metatraining time, when compared to metalearning methods based on networks that ingest the entire training set. We analyze this setting empirically and provide some intuition to explain this effect.
2 Preliminaries
In this section, we review the universal function approximation theorem and its extensions that we will use when considering the universal approximation of learning algorithms. We also overview the modelagnostic metalearning algorithm and an architectural extension that we will use in Section 4.
2.1 Universal Function Approximation
The universal function approximation theorem states that a neural network with one hidden layer of finite width can approximate any continuous function on compact subsets of up to arbitrary precision (Hornik et al., 1989; Cybenko, 1989; Funahashi, 1989)
. The theorem holds for a range of activation functions, including the sigmoid
(Hornik et al., 1989)and ReLU
(Sonoda & Murata, 2017) functions. A function approximator that satisfies the definition above is often referred to as a universal function approximator (UFA). Similarly, we will define a universal learning procedure approximator to be a UFA with input and output , where denotes the training dataset and test input, while denotes the desired test output. Furthermore, Hornik et al. (1990) showed that a neural network with a single hidden layer can simultaneously approximate any function and its derivatives, under mild assumptions on the activation function used and target function’s domain. We will use this property in Section 4 as part of our metalearning universality result.2.2 ModelAgnostic MetaLearning with a Bias Transformation
ModelAgnostic MetaLearning (MAML) is a method that proposes to learn an initial set of parameters such that one or a few gradient steps on computed using a small amount of data for one task leads to effective generalization on that task (Finn et al., 2017a)
. Tasks typically correspond to supervised classification or regression problems, but can also correspond to reinforcement learning problems. The MAML objective is computed over many tasks
as follows:where corresponds to a training set for task and the outer loss evaluates generalization on test data in . The inner optimization to compute can use multiple gradient steps; though, in this paper, we will focus on the single gradient step setting. After metatraining on a wide range of tasks, the model can quickly and efficiently learn new, heldout test tasks by running gradient descent starting from the metalearned representation .
While MAML is compatible with any neural network architecture and any differentiable loss function, recent work has observed that some architectural choices can improve its performance. A particularly effective modification, introduced by
Finn et al. (2017b), is to concatenate a vector of parameters,
, to the input. As with all other model parameters, is updated in the inner loop via gradient descent, and the initial value of is metalearned. This modification, referred to as a bias transformation, increases the expressive power of the error gradient without changing the expressivity of the model itself. While Finn et al. (2017b) report empirical benefit from this modification, we will use this architectural design as a symmetrybreaking mechanism in our universality proof.3 MetaLearning and Universality
We can broadly classify RNNbased metalearning methods into two categories. In the first approach
(Santoro et al., 2016; Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018), there is a metalearner model with parameters which takes as input the dataset for a particular task and a new test input, and outputs the estimated output
for that input:The metalearner is typically a recurrent model that iterates over the dataset and the new input . For a recurrent neural network model that satisfies the UFA theorem, this approach is maximally expressive, as it can represent any function on the dataset and test input .
In the second approach (Hochreiter et al., 2001; Bengio et al., 1992; Li & Malik, 2017b; Andrychowicz et al., 2016; Ravi & Larochelle, 2017; Ha et al., 2017), there is a metalearner that takes as input the dataset for a particular task and the current weights of a learner model , and outputs new parameters for the learner model. Then, the test input is fed into the learner model to produce the predicted output . The process can be written as follows:
Note that, in the form written above, this approach can be as expressive as the previous approach, since the metalearner could simply copy the dataset into some of the predicted weights, reducing to a model that takes as input the dataset and the test example.^{1}^{1}1For this to be possible, the model must be a neural network with at least two hidden layers, since the dataset can be copied into the first layer of weights and the predicted output must be a universal function approximator of both the dataset and the test input. Several versions of this approach, i.e. Ravi & Larochelle (2017); Li & Malik (2017b), have the recurrent metalearner operate on orderinvariant features such as the gradient and objective value averaged over the datapoints in the dataset, rather than operating on the individual datapoints themselves. This induces a potentially helpful inductive bias that disallows coupling between datapoints, ignoring the ordering within the dataset. As a result, the metalearning process can only produce permutationinvariant functions of the dataset.
In modelagnostic metalearning (MAML), instead of using an RNN to update the weights of the learner , standard gradient descent is used. Specifically, the prediction for a test input is:
where denotes the initial parameters of the model and also corresponds to the parameters that are metalearned, and corresponds to a loss function with respect to the label and prediction. Since the RNN approaches can approximate any update rule, they are clearly at least as expressive as gradient descent. It is less obvious whether or not the MAML update imposes any constraints on the learning procedures that can be acquired. To study this question, we define a universal learning procedure approximator to be a learner which can approximate any function of the set of training datapoints and the test point . It is clear how can approximate any function on , as per the UFA theorem; however, it is not obvious if can represent any function of the set of input, output pairs in , since the UFA theorem does not consider the gradient operator.
The first goal of this paper is to show that is a universal function approximator of in the oneshot setting, where the dataset consists of a single datapoint . Then, we will consider the case of shot learning, showing that
is universal in the set of functions that are invariant to the permutation of datapoints. In both cases, we will discuss meta supervised learning problems with both discrete and continuous labels and the loss functions under which universality does or does not hold.
4 Universality of the OneShot GradientBased Learner
We first introduce a proof of the universality of gradientbased metalearning for the special case with only one training point, corresponding to oneshot learning. We denote the training datapoint as , and the test input as . A universal learning algorithm approximator corresponds to the ability of a metalearner to represent any function up to arbitrary precision.
We will proceed by construction, showing that there exists a neural network function such that approximates up to arbitrary precision, where and is the nonzero learning rate. The proof holds for a standard multilayer ReLU network, provided that it has sufficient depth. As we discuss in Section 6, the loss function cannot be any loss function, but the standard crossentropy and meansquared error objectives are both suitable. In this proof, we will start by presenting the form of and deriving its value after one gradient step. Then, to show universality, we will construct a setting of the weight matrices that enables independent control of the information flow coming forward from and , and backward from .
We will start by constructing , which, as shown in Figure 1 is a generic deep network with layers and ReLU nonlinearities. Note that, for a particular weight matrix at layer , a single gradient step can only represent a rank1 update to the matrix . That is because the gradient of is the outer product of two vectors, , where is the error gradient with respect to the presynaptic activations at layer , and is the forward postsynaptic activations at layer . The expressive power of a single gradient update to a single weight matrix is therefore quite limited. However, if we sequence weight matrices as , corresponding to multiple linear layers, it is possible to acquire a rank update to the linear function represented by . Note that deep ReLU networks act like deep linear networks when the input and presynaptic activations are nonnegative. Motivated by this reasoning, we will construct as a deep ReLU network where a number of the intermediate layers act as linear layers, which we ensure by showing that the input and presynaptic activations of these layers are nonnegative. This allows us to simplify the analysis. The simplified form of the model is as follows:
where represents an input feature extractor with parameters and a scalar bias transformation variable , is a product of square linear weight matrices, is a function at the output, and the learned parameters are . The input feature extractor and output function can be represented with fully connected neural networks with one or more hidden layers, which we know are universal function approximators, while corresponds to a set of linear layers with nonnegative input and activations.
Next, we derive the form of the postupdate prediction . Let , and the error gradient . Then, the gradient with respect to each weight matrix is:
Therefore, the postupdate value of is given by
where we will disregard the last term, assuming that is comparatively small such that and all higher order terms vanish. In general, these terms do not necessarily need to vanish, and likely would further improve the expressiveness of the gradient update, but we disregard them here for the sake of the simplicity of the derivation. Ignoring these terms, we now note that the postupdate value of when is provided as input into is given by
(1)  
and .
Our goal is to show that that there exists a setting of , , and for which the above function, , can approximate any function of . To show universality, we will aim to independently control information flow from , from , and from by multiplexing forward information from and backward information from . We will achieve this by decomposing , , and the error gradient into three parts, as follows:
(2) 
where the initial value of will be . The top components all have equal numbers of rows, as do the middle components. As a result, we can see that will likewise be made up of three components, which we will denote as , , and . Lastly, we construct the top component of the error gradient to be , whereas the middle and bottom components, and , can be set to be any linear (but not affine) function of . We will discuss how to achieve this gradient in the latter part of this section when we define and in Section 6.
In Appendix A.3, we show that we can choose a particular form of , , and that will simplify the products of matrices in Equation 1, such that we get the following form for :
(3) 
where , , can be chosen to be any symmetric positivedefinite matrix, and can be chosen to be any positive definite matrix. In Appendix D, we further show that these definitions of the weight matrices satisfy the condition that the activations are nonnegative, meaning that the model can be represented by a generic deep network with ReLU nonlinearities.
Finally, we need to define the function at the output. When the training input is passed in, we need to propagate information about the label as defined in Equation 2. And, when the test input is passed in, we need a different function defined only on . Thus, we will define as a neural network that approximates the following multiplexer function and its derivatives (as shown possible by Hornik et al. (1990)):
(4) 
where is a linear function with parameters such that satisfies Equation 2 (see Section 6) and is a neural network with one or more hidden layers. As shown in Appendix A.4, the postupdate value of is
(5) 
Now, combining Equations 3 and 5, we can see that the postupdate value is the following:
(6) 
In summary, so far, we have chosen a particular form of weight matrices, feature extractor, and output function to decouple forward and backward information flow and recover the postupdate function above. Now, our goal is to show that the above function is a universal learning algorithm approximator, as a function of . For notational clarity, we will use to denote the inner product in the above equation, noting that it can be viewed as a type of kernel with the RKHS defined by ^{2}^{2}2Due to the symmetry of kernels, this requires interpreting
as part of the input, rather than a kernel hyperparameter, so that the left input is
and the right one is . The connection to kernels is not in fact needed for the proof, but provides for convenient notation and an interesting observation. We then define the following lemma:Lemma 4.1
Let us assume that can be chosen to be any linear (but not affine) function of . Then, we can choose , , , such that the function
(7) 
can approximate any continuous function of on compact subsets of .^{3}^{3}3The assumption with regard to compact subsets of the output space is inherited from the UFA theorem.
Intuitively, Equation 7 can be viewed as a sum of basis vectors weighted by , which is passed into to produce the output. There are likely a number of ways to prove Lemma 4.1. In Appendix A.1, we provide a simple though inefficient proof, which we will briefly summarize here. We can define to be a indicator function, indicating when takes on a particular value indexed by . Then, we can define to be a vector containing the information of and . Then, the result of the summation will be a vector containing information about the label and the value of which is indexed by . Finally, defines the output for each value of . The bias transformation variable plays a vital role in our construction, as it breaks the symmetry within . Without such asymmetry, it would not be possible for our constructed function to represent any function of and after one gradient step.
In conclusion, we have shown that there exists a neural network structure for which is a universal approximator of . We chose a particular form of that decouples forward and backward information flow. With this choice, it is possible to impose any desired postupdate function, even in the face of adversarial training datasets and loss functions, e.g. when the gradient points in the wrong direction. If we make the assumption that the inner loss function and training dataset are not chosen adversarially and the error gradient points in the direction of improvement, it is likely that a much simpler architecture will suffice that does not require multiplexing of forward and backward information in separate channels. Informative loss functions and training data allowing for simpler functions is indicative of the inductive bias built into gradientbased metalearners, which is not present in recurrent metalearners.
Our result in this section implies that a sufficiently deep representation combined with just a single gradient step can approximate any oneshot learning algorithm. In the next section, we will show the universality of MAML for shot learning algorithms.
5 General Universality of the GradientBased Learner
Now, we consider the more general shot setting, aiming to show that MAML can approximate any permutation invariant function of a dataset and test datapoint for . Note that does not need to be small. To reduce redundancy, we will only overview the differences from the shot setting in this section. We include a full proof in Appendix B.
In the shot setting, the parameters of are updated according to the following rule:
Defining the form of to be the same as in Section 4, the postupdate function is the following:
In Appendix C, we show one way in which this function can approximate any function of that is invariant to the ordering of the training datapoints . We do so by showing that we can select a setting of and of each and such that is a vector containing a discretization of and frequency counts of the discretized datapoints^{4}^{4}4With continuous labels and meansquared error , we require the mild assumption that no two datapoints may share the same input value : the input datapoints must be unique.. If is a vector that completely describes without loss of information and because is a universal function approximator, can approximate any continuous function of on compact subsets of . It’s also worth noting that the form of the above equation greatly resembles a kernelbased function approximator around the training points, and a substantially more efficient universality proof can likely be obtained starting from this premise.
6 Loss Functions
In the previous sections, we showed that a deep representation combined with gradient descent can approximate any learning algorithm. In this section, we will discuss the requirements that the loss function must satisfy in order for the results in Sections 4 and 5 to hold. As one might expect, the main requirement will be for the label to be recoverable from the gradient of the loss.
As seen in the definition of in Equation 4, the preupdate function is given by , where is used for backpropagating information about the label(s) to the learner. As stated in Equation 2, we require that the error gradient with respect to to be:
and where and must be able to represent [at least] any linear function of the label .
We define as follows: .
To make the top term of the gradient equal to , we can set to be , which causes the preupdate prediction to be . Next, note that and . Thus, for to be any linear function of , we require a loss function for which is a linear function , where is invertible. Essentially, needs to be recoverable from the loss function’s gradient. In Appendix E and F, we prove the following two theorems, thus showing that the standard and crossentropy losses allow for the universality of gradientbased metalearning.
Theorem 6.1
The gradient of the standard meansquared error objective evaluated at is a linear, invertible function of .
Theorem 6.2
The gradient of the softmax cross entropy loss with respect to the presoftmax logits is a linear, invertible function of
, when evaluated at .Now consider other popular loss functions whose gradients do not satisfy the labellinearity property. The gradients of the and hinge losses are piecewise constant, and thus do not allow for universality. The Huber loss is also piecewise constant in some areas its domain. These error functions effectively lose information because simply looking at their gradient is insufficient to determine the label. Recurrent metalearners that take the gradient as input, rather than the label, e.g. Andrychowicz et al. (2016), will also suffer from this loss of information when using these error functions.
7 Experiments
Now that we have shown that metalearners that use standard gradient descent with a sufficiently deep representation can approximate any learning procedure, and are equally expressive as recurrent learners, a natural next question is – is there empirical benefit to using one metalearning approach versus another, and in which cases? To answer this question, we next aim to empirically study the inductive bias of gradientbased and recurrent metalearners. Then, in Section 7.2, we will investigate the role of model depth in gradientbased metalearning, as the theory suggests that deeper networks lead to increased expressive power for representing different learning procedures.
7.1 Empirical Study of Inductive Bias
First, we aim to empirically explore the differences between gradientbased and recurrent metalearners. In particular, we aim to answer the following questions: (1) can a learner trained with MAML further improve from additional gradient steps when learning new tasks at test time, or does it start to overfit? and (2) does the inductive bias of gradient descent enable better fewshot learning performance on tasks outside of the training distribution, compared to learning algorithms represented as recurrent networks?
To study both questions, we will consider two simple fewshot learning domains. The first is 5shot regression on a family of sine curves with varying amplitude and phase. We trained all models on a uniform distribution of tasks with amplitudes
, and phases . The second domain is 1shot character classification using the Omniglot dataset (Lake et al., 2011), following the training protocol introduced by Santoro et al. (2016). In our comparisons to recurrent metalearners, we will use two stateoftheart metalearning models: SNAIL (Mishra et al., 2018) and metanetworks (Munkhdalai & Yu, 2017). In some experiments, we will also compare to a taskconditioned model, which is trained to map from both the input and the task description to the label. Like MAML, the taskconditioned model can be finetuned on new data using gradient descent, but is not trained for fewshot adaptation. We include more experimental details in Appendix G.To answer the first question, we finetuned a model trained using MAML with many more gradient steps than used during metatraining. The results on the sinusoid domain, shown in Figure 2, show that a MAMLlearned initialization trained for fast adaption in 5 steps can further improve beyond gradient steps, especially on outofdistribution tasks. In contrast, a taskconditioned model trained without MAML can easily overfit to outofdistribution tasks. With the Omniglot dataset, as seen in Figure 4, a MAML model that was trained with inner gradient steps can be finetuned for gradient steps without leading to any drop in test accuracy. As expected, a model initialized randomly and trained from scratch quickly reaches perfect training accuracy, but overfits massively to the examples.
Next, we investigate the second question, aiming to compare MAML with stateoftheart recurrent metalearners on tasks that are related to, but outside of the distribution of the training tasks. All three methods achieved similar performance within the distribution of training tasks for 5way 1shot Omniglot classification and 5shot sinusoid regression. In the Omniglot setting, we compare each method’s ability to distinguish digits that have been sheared or scaled by varying amounts. In the sinusoid regression setting, we compare on sinusoids with extrapolated amplitudes within and phases within . The results in Figure 3 and Appendix G show a clear trend that MAML recovers more generalizable learning strategies. Combined with the theoretical universality results, these experiments indicate that deep gradientbased metalearners are not only equivalent in representational power to recurrent metalearners, but should also be a considered as a strong contender in settings that contain domain shift between metatraining and metatesting tasks, where their strong inductive bias for reasonable learning strategies provides substantially improved performance.
7.2 Effect of Depth
The proofs in Sections 4 and 5 suggest that gradient descent with deeper representations results in more expressive learning procedures. In contrast, the universal function approximation theorem only requires a single hidden layer to approximate any function. Now, we seek to empirically explore this theoretical finding, aiming to answer the question: is there a scenario for which modelagnostic metalearning requires a deeper representation to achieve good performance, compared to the depth of the representation needed to solve the underlying tasks being learned?
Comparison of depth while keeping the number of parameters constant. Taskconditioned models do not need more than one hidden layer, whereas metalearning with MAML clearly benefits from additional depth. Error bars show standard deviation over three training runs.
To answer this question, we will study a simple regression problem, where the metalearning goal is to infer a polynomial function from 40 input/output datapoints. We use polynomials of degree where the coefficients and bias are sampled uniformly at random within and the input values range within . Similar to the conditions in the proof, we metatrain and metatest with one gradient step, use a meansquared error objective, use ReLU nonlinearities, and use a bias transformation variable of dimension 10. To compare the relationship between depth and expressive power, we will compare models with a fixed number of parameters, approximately , and vary the network depth from 1 to 5 hidden layers. As a point of comparison to the models trained for metalearning using MAML, we trained standard feedforward models to regress from the input and the 4dimensional task description (the 3 coefficients of the polynomial and the scalar bias) to the output. These taskconditioned models act as an oracle and are meant to empirically determine the depth needed to represent these polynomials, independent of the metalearning process. Theoretically, we would expect the taskconditioned models to require only one hidden layer, as per the universal function approximation theorem. In contrast, we would expect the MAML model to require more depth. The results, shown in Figure 5, demonstrate that the taskconditioned model does indeed not benefit from having more than one hidden layer, whereas the MAML clearly achieves better performance with more depth even though the model capacity, in terms of the number of parameters, is fixed. This empirical effect supports the theoretical finding that depth is important for effective metalearning using MAML.
8 Conclusion
In this paper, we show that there exists a form of deep neural network such that the initial weights combined with gradient descent can approximate any learning algorithm. Our findings suggest that, from the standpoint of expressivity, there is no theoretical disadvantage to embedding gradient descent into the metalearning process. In fact, in all of our experiments, we found that the learning strategies acquired with MAML are more successful when faced with outofdomain tasks compared to recurrent learners. Furthermore, we show that the representations acquired with MAML are highly resilient to overfitting. These results suggest that gradientbased metalearning has a number of practical benefits, and no theoretical downsides in terms of expressivity when compared to alternative metalearning models. Independent of the type of metalearning algorithm, we formalize what it means for a metalearner to be able to approximate any learning algorithm in terms of its ability to represent functions of the dataset and test inputs. This formalism provides a new perspective on the learningtolearn problem, which we hope will lead to further discussion and research on the goals and methodology surrounding metalearning.
Acknowledgments
We thank Sharad Vikram for detailed feedback on the proof, as well as Justin Fu, Ashvin Nair, and Kelvin Xu for feedback on an early draft of this paper. We also thank Erin Grant for helpful conversations and Nikhil Mishra for providing code for SNAIL. This research was supported by the National Science Foundation through IIS1651843 and a Graduate Research Fellowship, as well as NVIDIA.
References
 Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Neural Information Processing Systems (NIPS), 2016.
 Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, 1992.

Cybenko (1989)
George Cybenko.
Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, 1989.  Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.

Finn et al. (2017a)
Chelsea Finn, Pieter Abbeel, and Sergey Levine.
Modelagnostic metalearning for fast adaptation of deep networks.
International Conference on Machine Learning (ICML)
, 2017a. 
Finn et al. (2017b)
Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine.
Oneshot visual imitation learning via metalearning.
Conference on Robot Learning (CoRL), 2017b.  Funahashi (1989) KenIchi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 1989.
 Ha et al. (2017) David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. International Conference on Learning Representations (ICLR), 2017.
 Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks. Springer, 2001.
 Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 1989.
 Hornik et al. (1990) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks, 1990.
 Husken & Goerick (2000) Michael Husken and Christian Goerick. Fast learning for problem classes using knowledge based network initialization. In International Joint Conference on Neural Networks (IJCNN), 2000.
 Lake et al. (2011) Brenden M Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B Tenenbaum. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.
 Li et al. (2017) Da Li, Yongxin Yang, YiZhe Song, and Timothy M Hospedales. Learning to generalize: Metalearning for domain generalization. arXiv preprint arXiv:1710.03463, 2017.
 Li & Malik (2017a) Ke Li and Jitendra Malik. Learning to optimize. International Conference on Learning Representations (ICLR), 2017a.
 Li & Malik (2017b) Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017b.
 Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. International Conference on Learning Representations (ICLR), 2018.
 Munkhdalai & Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. International Conference on Machine Learning (ICML), 2017.
 Ravi & Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations (ICLR), 2017.
 Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International Conference on Machine Learning (ICML), 2016.
 Schmidhuber (1987) Jurgen Schmidhuber. Evolutionary principles in selfreferential learning. On learning how to learn: The metameta… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
 Sonoda & Murata (2017) Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 2017.
 Wang et al. (2016) Jane X Wang, Zeb KurthNelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
Appendix A Supplementary Proofs for 1Shot Setting
a.1 Proof of Lemma 4.1
While there are likely a number of ways to prove Lemma 4.1 (copied below for convenience), here we provide a simple, though inefficient, proof of Lemma 4.1. See 4.1
To prove this lemma, we will proceed by showing that we can choose , , and each and such that the summation contains a complete description of the values of , , and . Then, because is a universal function approximator, will be able to approximate any function of , , and .
Since and , we will essentially ignore the first and last elements of the sum by defining and , where is a small positive constant to ensure positive definiteness. Then, we can rewrite the summation, omitting the first and last terms:
Next, we will reindex using two indexing variables, and , where will index over the discretization of and over the discretization of .
Next, we will define our chosen form of in Equation 8. We show how to acquire this form in the next section.
Lemma A.1
We can choose and each such that
(8) 
where denotes a function that produces a onehot discretization of its input and denotes the 0indexed standard basis vector.
Now that we have defined the function , we will next define the other terms in the sum. Our goal is for the summation to contain complete information about . To do so, we will chose to be the linear function that outputs stacked copies of . Then, we will define to be a matrix that selects the copy of in the position corresponding to , i.e. in the position . This can be achieved using a diagonal matrix with diagonal values of at the positions corresponding to the th vector, and elsewhere, where and is used to ensure that is positive definite.
As a result, the postupdate function is as follows:
where is at the position within the vector , where satisfies and where satisfies . Note that the vector is a complete description of in that , , and can be decoded from it. Therefore, since is a universal function approximator and because its input contains all of the information of , the function is a universal function approximator with respect to its inputs .
a.2 Proof of Lemma a.1
In this section, we show one way of proving Lemma A.1: See A.1 Recall that is defined as , where . Since the gradient with respect to can be chosen to be any linear function of the label (see Section 6), we can assume without loss of generality that .
We will choose and as follows:
where we use to denote the matrix with a 1 at and 0 elsewhere, and is added to ensure the positive definiteness of as required in the construction.
Using the above definitions, we can see that:
Thus, we have proved the lemma, showing that we can choose a and each such that:
a.3 Form of linear weight matrices
The goal of this section is to show that we can choose a form of , , and such that we can simplify the form of in Equation 1 into the following:
(9) 
where , for , for and .
Recall that we decomposed , , and the error gradient into three parts, as follows:
(10) 
where the initial value of will be . The top components, and , have equal dimensions, as do the middle components, and . The bottom components are scalars. As a result, we can see that will likewise be made up of three components, which we will denote as , , and , where, before the gradient update, , , and . Lastly, we construct the top component of the error gradient to be , whereas the middle and bottom components, and , can be set to be any linear (but not affine) function of .
Using the above definitions and noting that , we can simplify the form of in Equation 1, such that the middle component, , is the following:
We aim to independently control the backward information flow from the gradient and the forward information flow from . Thus, choosing all and to be square and full rank, we will set
so that we have
for where and . Then we can again simplify the form of :
(11) 
where , for , for and .
a.4 Output function
In this section, we will derive the postupdate version of the output function . Recall that is defined as a neural network that approximates the following multiplexer function and its derivatives (as shown possible by Hornik et al. (1990)):
(12) 
The parameters are a part of , in addition to the parameters required to estimate the indicator functions and their corresponding products. Since and when the gradient step is taken, we can see that the error gradients with respect to the parameters in the last term in Equation 12 will be approximately zero. Furthermore, as seen in the definition of in Section 6, the value of is also zero, resulting in a gradient of approximately zero for the first indicator function.^{5}^{5}5To guarantee that g and h are zero when evaluated at , we make the assumption that and are neural networks with no biases and nonlinearity functions that output zero when evaluated at zero.
The postupdate value of is therefore:
(13) 
as long as . In Appendix A.1, we can see that is indeed not equal to zero.
Appendix B Full KShot Proof of Universality
In this appendix, we provide a full proof of the universality of gradientbased metalearning in the general case with datapoints. This proof will share a lot of content from the proof in the shot setting, but we include it for completeness.
We aim to show that a deep representation combined with one step of gradient descent can approximate any permutation invariant function of a dataset and test datapoint for . Note that does not need to be small.
We will proceed by construction, showing that there exists a neural network function such that approximates up to arbitrary precision, where and is the learning rate. As we discuss in Section 6, the loss function cannot be any loss function, but the standard crossentropy and meansquared error objectives are both suitable. In this proof, we will start by presenting the form of and deriving its value after one gradient step. Then, to show universality, we will construct a setting of the weight matrices that enables independent control of the information flow coming forward from the inputs and , and backward from the labels .
We will start by constructing . With the same motivation as in Section 4, we will construct as the following:
Comments
There are no comments yet.