1 Introduction
A key ingredient of human intelligence is the ability to generalize beyond models that have already been learned, and quickly propose new models in new environments with few examples. For example, after seeing several moving objects accelerated by different forces in different environments, humans are able to not only develop models for each environment, but also to develop a metamodel that can generalize to a continuum of unseen accelerated objects. Upon arriving at a new environment and seeing an object moving for only a short time, s/he can quickly propose a new model for the moving object, including a good estimate of its acceleration. Incorporating this ability of generalization beyond the training datasets and environments for quick recognition from few examples remains an important challenge in machine learning.
In this paper, we focus on learning a series of prediction/regression models with continuous targets, where each class of problems has similar underlying mechanisms. Algorithms are compared by how well and how quickly they can generalize to unseen tasks from few examples. This class of problems is important in many areas, for example, learning and predicting physics Fraccaro et al. (2017); Wu et al. (2017); Watters et al. (2017); Chang et al. (2016)
, reinforcement learning of games
Mnih et al. (2015); Kansky et al. (2017), unsupervised learning of videos
Srivastava et al. (2015), and applications such as selfdriving, where we cannot enumerate and train with all environments that the algorithm will encounter.To tackle such problems, we propose a novel class of neural networks that we term MetaLearning Autoencoders (MeLA), schematically illustrated in Figure 1
. At its core, a MeLA consists of a learnable metarecognition model that can for each (unseen) task distill a few inputoutput examples into a model code vector parametrizing the task’s functional relationship, and a learnable metagenerative model that maps this model code into the weight and bias parameters of a neural network implementing this function. This architecture forces the metarecognition model to discover and encode the important variations of the functional mappings for different tasks, and the metagenerative model to decode the model codes to corresponding taskspecific models with a common modelgenerating network. This brings the key innovation of MeLA: for a class of tasks, MeLA does not attempt to learn a
single good initialization for multiple tasks as in Finn et al. (2017), or learn an update function Schmidhuber (1987); Bengio et al. (1992); Andrychowicz et al. (2016), or learn an update function together with a single good initialization Li et al. (2017); Ravi & Larochelle (2016). Instead, it learns to map the few examples from different datasets into differentmodels, which not only allows for more diverse model parameters tailored for each individual tasks, but also obviates the need for finetuning. Moreover, by encoding each function as a vector in a single lowdimensional latent space, MeLA is able to generalize beyond the training datasets, by both interpolating between and extrapolating beyond learned models into a continuum of models. We will demonstrate that the metalearning autoencoder has the following 3 important capabilities:

Augmented model recognition: MeLA strategically builds on a preexisting, singletask trained network, augmenting it with a second network used for metarecognition. It achieves lower loss in unseen environments at zero and few gradient steps, compared with both the original architecture upon which it is based and stateoftheart metalearning algorithms.

Influence identification: MeLA can identify which examples are most useful for determining the model (for example, a rectangle’s vertices have far greater influence in determining its position and size than its inferior points).

Interactive learning: MeLA can actively request new samples which maximize its ability to learn models.
2 Methods
Metalearning problem setup
We are interested in modeling a set of vectorvalued functions (which we will refer to as models), that each map an dimensional input vector into an dimensional output vector . Let’s first consider the case for a single dataset. Given many inputoutput pairs linked by the same function , we group the corresponding vectors into matrices and whose rows are the vectors and . In this paper, we focus on regression problems where the target is continuous, but the generalization to classification problems is straightforward. This class of problems includes a wide range of scenarios, e.g., modeling time series data, learning physics and dynamics, and frametoframe prediction of videos.
The metalearning problem we tackle is as follows. Suppose that we are given an ensemble of datasets , , each of which is generated by a corresponding function . In the singletask scenario, we want to train a model
that predicts all output vectors from the corresponding input vectors from a single dataset, to minimize some loss function
that quantifies the prediction errors. The metalearning goal is, after training on an ensemble of training datasets ,…,, to be able to quickly learn from few examples from heldout datasets ,…, adapt to them and obtain a low loss on them.Metalearning autoencoder architecture
The architecture of our MetaLearning Autoencoder (MeLA) is illustrated in Figure 1. It is defined by three vectorvalued functions , and that are defined by feedforward neural networks parametrized by vectors , and , respectively. In contrast to prior methods for learning to quickly adapt to different datasets Finn et al. (2017) or using memoryaugmented setup Santoro et al. (2016); Vinyals et al. (2016), the MeLA takes full advantage of the prior that the datasets are generated by a hidden model class, where the functions lie in a relatively lowdimensional submanifold of the space of all functions. Based on this prior, we use a metarecognition model that maps a whole dataset to a model code vector , and a metagenerative model that maps to the parameters vector of the network implementing the function . In other words, , and for a specific dataset , can be instantiated by
(1) 
This architecture is designed so that it can easily transform a neural network that is originally intended to learn from a single task into an architecture that can perform meta/fewshot learning on a number of tasks, combining the knowledge of individual task architectures with MeLA’s metalearning power. If the original singletask model is , then without changing the architecture of , we can simply attach a metarecognition model and a metagenerative model that generates the parameters of , and train on an ensemble of tasks.
Network architecture examples
Although the MeLA architecture described above can be implemented with any choices whatsoever for the three feedforward neural networks that define the functions , and , let us consider simple specific implementations to build intuition and get ready for the numerical experiments.
Suppose we implement the main model as a network with two hidden layers with and neurons, respectively. Its input size is and its output size is . The metarecognition model takes as input and concatenated horizontally into a single matrix, where is the number of training examples at hand. The feedforward neural network implementing has two parts: the first is a series of layers that collectively transform the input matrix into an matrix, where
is the number of output neurons in this first block (we typically use 200 to 400 below). Then a maxpooling operation is applied over the
examples, transforming this matrix into a single vector of length . The metarecognition model is thus defined independently of the number of training examples . As will be explained in the “Influence identification" subsection below, maxpooling is key to MeLA, forcing the metarecognition model to learn to capture key characteristics in a few representative examples. The second block of the network is a multilayer feedforward neural network, which takes as input the maxpooled vector, and transforms it into a dimensional model code that parametrizes the functional relationship between and .The metagenerative model takes as input the model code , and for each layer in the main model , it has two separate neural networks that map to all the weight and bias parameters of that layer. We typically implement each of these subnetworks of using 23 hidden layers with 60 neurons each. Compared with the original , this implies only a linear increase in the number of parameters, independent of the number of tasks.
MeLA’s metatraining and evaluation
The extension from the training on a singletask to MeLA is straightforward. Suppose that the loss function for the singletask is , with expected risk . Then the metaexpected risk for MeLA is
(2) 
where is the distribution for datasets generated by the hidden model class . The goal of metatraining is to learn the parameters for the metarecognition model and metagenerative model such that is minimized:
(3) 
Algorithm 1 illustrates the stepbystep metatraining process for MeLA implementing an empirical metarisk minimization for Eq. (3). In each iteration, the training dataset ensemble is randomly permuted, from which each dataset is selected once for innerloop taskspecific training. Inside the taskspecific training, the training examples for each dataset are used for calculating the model code , after which the model parameter vector and the testing examples are used to calculate the taskspecific testing loss , from which the gradients w.r.t. and are computed and used for onestep of gradient descent for the metarecognition model and metagenerative model. Note that here the taskspecific testing loss in the training datasets serves as the training loss in the metatraining.
During the evaluation of MeLA, we use the heldout datasets unseen during the metatraining. For each heldout dataset, we split it into training and testing examples. The training examples is fed to MeLA and a taskspecific model is generated without any gradient descent. Then we evaluate the taskspecific model on the testing examples in the heldout datasets. We also evaluate whether the taskspecific model can further improve with a few more steps of gradient descent.
Influence identification
The maxpooling over examples in the metarecognition model is key to MeLA, and also provides a natural way to identify the influence of each example on the model . Typically, some examples are more useful than others in in determining the model. For example, suppose that we try to learn a function defining on that equals 1 inside a polygon and 0 outside, with different polygons corresponding to different models parametrized by . Then data points near the polygon vertices carry far more information about than do points in the deep interior, and the maxpooling over the dimension of examples forces the metarecognition model to recognize those most useful points, and based on them perform computation that returns a model code that determines the whole polygon. Recall that maxpooling compresses numbers into merely , which means that for each column of the matrix, only one of the examples takes the maximum value and hence contributes to this feature. We therefore define the influence of an example as
(4) 
The influence of each example can be interpreted as a percentage, since it lies in , and the influences sum to 1 for all the examples in the dataset fed to the metarecognition model.
Interactive learning
In some situations, measurements are hard or costly to obtain. It is then helpful if we can do better than merely acquiring random examples, and instead determine in advance at which data points to collect measurements to glean as much information as possible about the correct function . Specifically, suppose that we want to predict as accurately as possible at a given input point where we have no training data. If before making our prediction, we have the option to measure at one of several candidate points , then which point shall we choose?
The MeLA architecture provides a natural way to answer this question. We can first use to calculate the current predictions for at based on current model generated by and , where and are the examples that are already given. Then we can fix the metaparameters and , and calculate the sensitivity matrix of w.r.t. each current prediction :
(5) 
We can select the candidate point whose sensitivity matrix has the largest determinant, i.e., the point for which the measured data carries the most information about the answer that we want:
(6) 
If we model our uncertainty about
as a multivariate Gaussian distribution, then this criterion maximizes the entropy reduction,
i.e., the number of bits of information learned about from the new measurement. Note that with fixed and for a given , the Jacobian matrix is independent of the different candidate inquiry inputs . This means that we can simply select the candidate point that has the largest “projection" of onto , requiring in total only one forward and one backward pass for all the candidate examples to obtain the gradient. This factorization emerges naturally from MeLA’s architecture.3 Related work
MeLA addresses the problem of metalearning Thrun & Pratt (2012); Schmidhuber (1987); Naik & Mammone (1992), where an important subfield is to quickly adapt to new tasks with oneshot or fewshot examples. A recent innovative metalearning method MAML Finn et al. (2017) optimizes the parameters of the model so that it is easy to finetune to individual tasks in a few gradient steps. Another class of methods focuses on learning a learning rule or update functions Schmidhuber (1987); Bengio et al. (1992); Andrychowicz et al. (2016), or learning an update function from a single good initialization Li et al. (2017); Ravi & Larochelle (2016). Compared to these methods that only learn a single good initialization point or how to update from a single initialization point, our method learns recognition and generative models that can quickly determine the model code for the model, and directly propose the appropriate neural network parameters tailored for each task without the need of finetuning.
Another interesting class of fewshot learning methods uses memoryaugmented networks. Vinyals et al. (2016)
proposes matching nets for oneshot classification, which generates the probability distribution for the test example based on the support set using attention mechanisms, essentially learning a “similarity" metric between the test example and the support set.
Santoro et al. (2016)utilizes a neural Turing machines for fewshot learning, and
Duan et al. (2016); Wang et al. (2016) learn fast reinforcement learning agents with recurrent policies using memoryaugmented nets. In contrast to memoryaugmented approaches, our model learns to distill features from representative examples and produces a model code, based on which it directly generates the parameters of the main model. This eliminates the need to store the examples for the support set, and allows a continuous generation of models, which is especially suitable for generating a continuum of regression models. Other fewshot learning techniques include using Siamese structures Koch et al. (2015) and evolutionary methods Mengistu et al. (2016).Autoencoders are typically used for representation learning in a single dataset, and have only recently been applied to multiple datasets. The recent neural statistician work Edwards & Storkey (2016) applies the variational autoencoder approach to the encoding and generation of datasets. Compared to their work, our MeLA differs in the following aspects. Firstly, the problem is different. While in neural statistician, each example in the dataset is an instance of a class, in MeLA, we are dealing with datasets whose examples are pairs, where we don’t know a priori where the input will be in testing time. Therefore, direct autoencoding of datasets is not enough for prediction, especially for regression tasks. Therefore, instead of using autoencoding to generate the dataset, our MeLA uses autoencoding to generate the model that can generate the dataset given test inputs , which is a more compact way to express the relationship between and .
The idea of using an indirect encoding for the weights of another network originates from the neuroevolution algorithm of HyperNEAT Stanley et al. (2009)
. Both the structure and weights are updated by evolutionary algorithms.
Fernando et al. (2016) improves this method by making the weights differentiable, and Hypernetworks Ha et al. (2016) further make both the network generator and network differentible. Our MeLA gets inspirations from these prior works, and also differs in several key aspects: while the above works focus on learning a single task, MeLA is endowed with a recognition model, and is designed for meta and fewshot learning for unseen tasks.4 Experiments
Let us now test the core desiderata of MeLA: can it transform a model that is originally intended for singletask learning into one that can quickly adapt to new tasks with few examples without training, and continue to improve with a few gradient steps? The baseline we compare with is a single network pretrained to fit to all tasks, which during testing is finetuned to each individual task through further training. MeLA has the same main network architecture as this baseline network, supplemented by the metarecognition and metagenerative models trained via Algorithm 1. We also compare with the stateoftheart metalearning algorithm MAML Finn et al. (2017), with the same network architecture . In addition, we explore the two other MeLA capabilities: influence identification and interactive learning.
For all experiments, the true model and its parameters are hidden from all algorithms, except for an oracle model which “cheats" by getting access to the true model parameters for each example, thus providing an upper bound on performance. The performance of each algorithm is then evaluated on previously unseen test datasets. For all experiments, the Adam optimizer Kingma & Ba (2014) with default parameters is used for training and finetuning during evaluation. ^{1}^{1}1The code for MeLA and experiments will be opensourced upon acceptance of the paper.
Simple regression problem
We first demonstrate the 3 capabilities of MeLA via the same simple regression problem previously studied with MAML Finn et al. (2017), where the hidden function class is , and the parameters , are randomly generated for each dataset. For each dataset, 10 input points are sampled from as training examples and another 10 are sampled as testing examples. 100 such datasets are presented for the algorithms during training. The baseline model is a 3layer network where each hidden layer has 40 neurons with leakyReLU activation.
The results are shown in Fig. 2. Panel a) plots the mean squared error vs. number of gradient steps on unseen randomly generated testing datasets, showing that MeLA outclasses the baseline model at all stages. It also shows that MeLA asymptotes to the same performance as MAML but learns much faster, starting with a low loss that MAML needs 5 gradient steps to surpass. Panel b) compares predictions with 0 gradient steps. MeLA not only proposes a model that accurately matches the true model, but also identifies each examples’ influence on the model generation, and obtains good prediction if only the top 3 influential examples are given. Panels c) and d) show MeLA’s capability of actively requesting informative examples by predicting which additional example will help improve the prediction the most.
Ball bouncing with state representation
Next, we test MeLA’s capability in simple but challenging physical environments, where it is desirable that an algorithm quickly adapts to each new environment with few observations of states or frames. Each environment consists of a room with 4 walls, whose frictionless floor is a random 4sided convex polygon inside the 2dimensional unit square (Fig. 3(a)), and a ball of radius 0.075 that bounces elastically off of these walls and otherwise moves with constant velocity. Because the different room geometries give the ball conflicting bouncing dynamics in different environments, a model trained well in one environment may not necessarily perform well in another, providing an ideal test bed for meta and fewshot learning. During training, all models take as input 3 consecutive time steps of ball’s state ( and coordinates), recorded every time it has moved a distance 0.1. The oracle model is also given as input the coordinates of the floor’s 4 corners.
Fig. 3 (b) plots the mean Euclidean distance of the models’ predictions vs. rollout distance traveled. We can see that MeLA outperforms pretrained and MAML for both 0 and 5 gradient steps. Moreover, what MeLA identifies as influential examples (Fig. 4) lies near the vertices of the polygon, showing that MeLA essentially learns to capture the convex hull of all the trajectories when proposing the model.
Video prediction
To test MeLA’s ability to integrate into other endtoend architectures that deal with highdimensional inputs, we present it with an ensemble of video prediction tasks, each of which has a ball bouncing inside randomly generated polygon walls. The environment setup is the same as in section 4, except that the inputs are 3 consecutive frames of 39 x 39 pixel snapshots, and the target is a 39 x 39 snapshot of the next time step. For all the models, a convolutional autoencoder is used for autoencoding the frames, and the models differ only in the latent dynamics model that predicts the future latent variable based on the 3 steps of latent variables encoded by the autoencoder. For the pretrained model, a single 4layer network with 40 neurons in each hidden layer is used for the latent dynamics model, training on all tasks. MAML and MeLA also have/generate the same architecture for the latent dynamics model. For the oracle model, the coordinates of the vertices are concatenated with the latent variables as inputs.
Fig. 4(b) plots the mean Euclidean distance of the center of mass of the models’ predictions vs. rollout distance. We see that MeLA again greatly reduces the prediction error compared to the baseline model which has to use a single model to predict the trajectory in all environments. MeLA’s accuracy is seen to be near that of the oracle, demonstrating that MeLA is learning to quickly recognize and model each environment and propose reasonable models.
5 Conclusions
In this paper, we have proposed MeLA, an algorithm for rapid recognition and determination of models in meta and fewshot learning. We have demonstrated that MeLA can transform a model originally intended for singletask learning into one that can quickly adapt to new tasks with few examples, without training, and continue to improve with a few gradient steps. It learns better and faster than both the original model it is based on, and the stateoftheart metalearning algorithm MAML. We also demonstrate two additional capabilities of MeLA: its ability to identify influential examples, and how MeLA can interactively request informative examples to optimize learning.
A core enabler of human ability to handle novel tasks is our ability to quickly recognize and propose models in new environments, based on previously learned models. We believe that by incorporating this ability, machine learning models will become more adaptive and capable for new environments.
References
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
 Bengio et al. (1992) Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992.
 Chang et al. (2016) Chang, M. B., Ullman, T., Torralba, A., and Tenenbaum, J. B. A compositional objectbased approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
 Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 Edwards & Storkey (2016) Edwards, H. and Storkey, A. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.

Fernando et al. (2016)
Fernando, C., Banarse, D., Reynolds, M., Besse, F., Pfau, D., Jaderberg, M.,
Lanctot, M., and Wierstra, D.
Convolution by evolution: Differentiable pattern producing networks.
In
Proceedings of the Genetic and Evolutionary Computation Conference 2016
, pp. 109–116. ACM, 2016.  Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Fraccaro et al. (2017) Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 3601–3610. Curran Associates, Inc., 2017.
 Ha et al. (2016) Ha, D., Dai, A., and Le, Q. V. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
 Kansky et al. (2017) Kansky, K., Silver, T., Mély, D. A., Eldawy, M., LázaroGredilla, M., Lou, X., Dorfman, N., Sidor, S., Phoenix, D. S., and George, D. Schema networks: Zeroshot transfer with a generative causal model of intuitive physics. In ICML, 2017.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Koch et al. (2015)
Koch, G., Zemel, R., and Salakhutdinov, R.
Siamese neural networks for oneshot image recognition.
In
ICML Deep Learning Workshop
, volume 2, 2015.  Li et al. (2017) Li, Z., Zhou, F., Chen, F., and Li, H. Metasgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835, 2017.
 Mengistu et al. (2016) Mengistu, H., Lehman, J., and Clune, J. Evolvability search:directly selecting for evolvability in order to study and produce it. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 141–148. ACM, 2016.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Naik & Mammone (1992) Naik, D. K. and Mammone, R. J. Metaneural networks that learn by learning. In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, volume 1, pp. 437–442 vol.1, Jun 1992.
 Ravi & Larochelle (2016) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. 2016.
 Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Metalearning with memoryaugmented neural networks. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1842–1850, New York, New York, USA, 20–22 Jun 2016. PMLR.
 Schmidhuber (1987) Schmidhuber, J. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.
 Srivastava et al. (2015) Srivastava, N., Mansimov, E., and Salakhutdinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 843–852. JMLR.org, 2015.
 Stanley et al. (2009) Stanley, K. O., D’Ambrosio, D. B., and Gauci, J. A hypercubebased encoding for evolving largescale neural networks. Artificial life, 15(2):185–212, 2009.
 Thrun & Pratt (2012) Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.
 Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., kavukcuoglu, k., and Wierstra, D. Matching networks for one shot learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3630–3638. Curran Associates, Inc., 2016.
 Wang et al. (2016) Wang, J. X., KurthNelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
 Watters et al. (2017) Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., and Tacchetti, A. Visual interaction networks: Learning a physics simulator from video. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 4539–4547. Curran Associates, Inc., 2017.
 Wu et al. (2017) Wu, J., Lu, E., Kohli, P., Freeman, B., and Tenenbaum, J. Learning to see physics via visual deanimation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 153–164. Curran Associates, Inc., 2017.
Appendix A MeLA architectural details
As described in section 2, MeLA consists of a metarecognition model and a metagenerative model that generates the taskspecific model . The metarecognition model consists of two blocks. The first block is a MLP with 3 hidden layers, each of which has 60 neurons with leakyReLU activation (unless otherwise specified, the leakyReLU activation in this paper all have a slope of 0.3 when the activation is below 0). The last layer has neurons and linear activation. Then a maxpooling is performed along the example dimension, collapsing the matrix to matrix, which feeds into the second block. The second block is an MLP with two hidden layers, each of which has 60 neurons with leakyReLU activation, and the last layer has neurons with linear activation. The output is the model code .
The metagenerative model takes as input the model code , and for each layer in the main model , it has two separate MLPs that map to all the weight and bias parameters of that layer. For all the experiments in this paper, the MLPs in the metagenerative model have 3 hidden layers, each of which has 60 neurons with leakyReLU activation. The last layer of the MLP has linear activation, and has an output size equal to the size of weight or bias in the main network . The output of each MLP in the metagenerative model is then reshaped into the size of the corresponding weight or bias matrix, and directly used as the parameters of .
The architecture of the main network is dependent on the specific application, which MeLA’s architecture is agnostic to. For the simple regression problem in this paper, we implement as an MLP with 2 hidden layers, each of which has 40 neurons with leakyReLU activation. The last layer has linear activation with output size of 1. For the ball bouncing with state representation experiment, is an MLP with input size of 6 and 3 hidden layers, each of which has 40 neurons with leakyReLU activation. The last layer has linear activation with output size of 2. For the video prediction task, the latent dynamics network uses the same architecture. The convolutional autoencoder used in this experiment is as follows. For the encoder, it has 3 convolutional layers with 32
kernels with stride 2 and leakyReLU activation. After that, it is flattened into 512 neurons, which feeds into a dense layer with 2 neurons and linear activation. For the decoder, the first layer is a dense layer with 512 neurons and linear activation, then the output is reshaped to a
tensor (32 is the number of channels). The tensor then goes into 3 layers of convolutionaltranspose layers with 32 kernels, each with size of 3, stride of 2 and leakyReLU activation. For the leakyReLU activation in the convolutional autoencoder, we use a slope of 0.01 when the activation is below 0.