1 Introduction
For the most part, learning algorithms have been handcrafted. The deep learning community has largely converged to use almost exclusively gradientbased approaches for learning a model’s parameters. Such gradientbased approaches generally impose limitations in terms of the loss landscape, choice of network architecture and training dynamics. A nonexhaustive list of examples is: their inherent tendency to overfit to training sets, their catastrophic forgetting behaviour, their requirement of a smooth loss landscape, and experiencing vanishing or exploding gradients in recurrent or large architectures. Moreover, while the mechanics of artificial neurons are inspired by their biological counterparts, they have been greatly simplified to be scalable and differentiable, rendering them far less powerful than their biological counterpart. Perhaps the simplicity of its building blocks is the reason of why most of deep learning research occurs on layered deep networks that require increasing amounts of computation and memory, limiting the explorations on fundamentally different architectures. Nevertheless, backpropagation is still the best tool in our toolkit for optimising models with an extensive set of parameters.
In this work, we leverage gradientbased learning to find a new learning protocol for tuning an arbitrary computational graph to adapt to a task from a given family of tasks. We show how this learning protocol and its associated metalearner can be used to train traditional neural networks. This is accomplished by rethinking neural networks as selforganising systems: a graph composed of nodes representing operations such as synapses (individual weights and biases), activations and losses, that have to communicate in order to solve a given task.
We therefore propose to learn a Message Passing Learning Protocol (MPLP): given a directed graph composed of (sparsely) connected nodes, we let these nodes communicate with each other by passing kdimensional vectors along directed edges. The metatraining phase consists of learning a MPLP that, given an initial configuration/initialization, and a training set, is able to adapt to a given task. The kind of graphs we explore in this work are all endtoend differentiable  we therefore metaoptimize MPLP through gradientbased approaches.
We show how MPLP can be applied to feedforward neural networks as a metalearned replacement to gradientbased approaches. In fact, we can consider gradientbased learning algorithms as a specific instance of MPLP. Gradient based approaches do the following: (1) in the forward pass, store the input and compute with your function. (2) in the backward pass, get a gradient of a loss upstream, compose it with your gradient function, and update your own weight with the gradient and some autoregressive behavior, (3) pass a gradient of the loss, modified accordingly to your function. Hence, we can specialize MPLP as follows: every time an operation occurs, we consider it a node. Examples of such nodes are single weights (or biases ) multiplied by an input
, or a value modified by an activation function. The forward pass remains unchanged. In the backward pass, instead of receiving the simple gradient of a loss scalar, each node receives a multidimensional message vector, updates its weight through a parameterized function
, and backpropagates a modified message through a parameterized function . These functions are trainable neural networks. Given that MLPs have been shown to work as universal function approximators, we propose this property would allow our learning protocol to implement traditional gradient descent arbitrarily well, but would no be limited to or encouraged to do exactly so. While for realworld applications it would be desirable to backpropagate the gradient of the loss alongside a learned message, we decided to never pass any gradient to showcase properties of a pure MPLP. As a proof of concept, we show how a gradientfree MPLP can be used to train feedforward neural networks for fewshot sinusoidal fitting and MNIST classification. In the sinusoidal case, we also show how we can enhance ModelAgnostic MetaLearning (MAML) (Finn et al. (2017)) approaches by jointly learning both priors and learning rules.While this paper limits the explorations on this framework to traditional feedforward neural networks, this framework can be used on any graph of agents, as long as their communication protocol remains endtoend differentiable. We briefly discuss some possible applications of MPLP on nontraditional neural networks in Section 5. The code for this framework and reproduction of the following experiments can be found on https://github.com/googleresearch/selforganisingsystems/tree/master/mplp.
2 Related Work
Metalearning. Our work finds extensive common ground in the field of metalearning, which has recently reexploded in popularity. Given the velocity and size of the field, we must limit this section to work we believe is very strongly related or has served as a direct inspiration to this work. For a more extensive overview of metalearning, we recommend Clune (2019) and Hospedales et al. (2020).
(Bengio et al. (1991, 1997)) introduced the idea of discovering local learning rules instead of gradientbased optimization. We consider our work as a generalization of their approach. Schmidhuber (1993) and followup work (Hochreiter et al. (2001); Younger et al. (2001)
) demonstrated how LSTMs (Long Short Term Memory networks) can be used for metalearning. We find LSTMs useful in our meta learning approach and their use is well documented in similar approaches, such as (
Andrychowicz et al. (2016); Ravi and Larochelle (2017)), who present a method for optimally applying an incoming gradient signal to parameters by using an LSTM. Further novel work on nontraditional learning rules can by found in the field of feedback alignment (Lillicrap et al. (2014); Nøkland (2016)), where the authors show that backpropagation can still work with an altered weight matrix in the backward pass, instead of using the transpose of the weight matrix as would be derived in traditional backpropagation. Our work is to the largest degree inspired by the original MAML paper (Finn et al. (2017), and followup works Li et al. (2017); Antoniou et al. (2018)), and our training regime can be seen as an enhancement of it. An approach implementing hebbian learning was recently metalearned through endtoend backpropagation (Miconi et al. (2020)). A blackbox evolutionary approach was recently explored in AutoML Zero (Real et al. (2020)). We find such a cleanslate approach desirable and we believe it could be used to metalearn MPLPbased models too. Our method was inspired by the recent Neural CA work (Mordvintsev et al. (2020))  endtoend differentiable cellular automata can be seen as a special case of a locally connected computation graph being trained with metalearning. The recent talk by Aguera y Arcas (2019) demonstrating bacteriainspired lifeforms and evolutionary approaches to metalearning was also a direct inspiration to our work.Methods that fulfill metalearning objectives such as ours can often be used for solving fewshot learning tasks. Triantafillou et al. (2019)
provides a valuable overview of fewshot learning and introduces a large dataset for evaluating architectures. Recently, it is becoming apparent that feature extraction as a preprocessing step may be very beneficial to scaling and improving performance on fewshot learning tasks (
Gidaris and Komodakis (2018); Qi et al. (2017); Chen et al. (2019)). We suspect the our approach can be directly applied on learned representations of data. See Section 5 for more details.Recent work in metalearning has independently yielded approaches that share some key properties with our work. We hereby highlight the similarities and differences. Metz et al. (2018)
trains metalearning unsupervised learning rules through backpropagation. During the inner update, they backpropagate information through a learned matrix, and instead of receiving a gradient of a loss, each synapse is responsible for accumulating useful statistics in an unsupervised fashion. In summary, the information sent across layers is scalar for each synapse, as opposed to our work, and furthermore, they do not use losses in the inner loop.
Gregor (2020) uses kdimensional messages as a communication protocol and it is conceptually very similar to our work. However, they specifically set out to solve the task of learning to remember, and do not attempt to generalize to a swapin replacements to gradient based approaches as we do. Bertens and Lee (2019) also uses kdimensional message passing and a variant of an LSTM/GRU as an underlying building block. Unlike our approach, they rely on evolution based learning and do not set out to find a replacement for gradientbased approaches for traditional NNs.Continual learning is another field where we believe there are opportunities to make use of our framework. Historically, metalearning approaches were not common in continual learning, see Parisi et al. (2018) for an overview of the field. However, we are aware of some metalearning work occurring in recent years. Follow the meta leader (FTML) (Finn et al. (2019)) adapts MAML to work in a continual learning scenario; Online aware Metalearning (OML) (Javed and White (2019)) metalearns sparse representations aiding in continual learning; A Neuromodulated MetaLearning Algorithm (ANML) (Beaulieu et al. (2020)) metalearns architectures resistant to catastrophic forgetting by adding a gating mechanism. In Section 5, we discuss how MPLP could be applied to Continual learning, and where we see possible limitations/drawbacks.
Graph Neural Networks. From a Graph Neural Networks (GNN) perspective, our framework could be considered an example of Message Passing Neural Network (Gilmer et al. (2017)). However, GNNs are generally not applied to metalearning, instead are used to ingest graphstructured data. We recommend Wu et al. (2020) for a survey on GNNs.
3 Model
We designed our approach around a directed computational multidigraph. We refer to computations happening within and across nodes as arrows. Arrows are responsible for computing messages and updating internal states of nodes. For a more detailed explanation of the actual implementation, we point the reader to Appendix A.
In this work, we focus on specialising our models to mimic traditional feedforward neural networks forward and backward passes in supervised learning scenarios.
3.1 Feedforward Neural Networks
A typical feedforward neural network is composed of two data pathways: a forward pass, where given some input and parameters we compute , and a backward pass, where a given loss and some stored intermediate data, typically computed after a forward pass, are used to compute a gradientbased update on . Our computational model can be used in a forward/backward routine for architectures equivalent to traditional NN, where the forward pass is effectively unchanged. However, instead of relying on gradients for the backward pass, we metalearn a MPLP.
We now discuss the logical implementation of both forward and backward passes by focusing on the local interactions that each node has. For instance, instead of describing procedures in terms of a weight matrix , we will show what happens to each individual synapse locally. The actual implementation computes updates in bulks; Figure 1 shows a higher level example of dataflow.
In the forward pass
, we construct nodes for each operation happening in affine transforms, activations, and loss functions. Each weight and bias is stored in the state of the respective node. We define a forward arrow to compute onedimensional outputs from onedimensional inputs, just like what occurs in a traditional forward pass of neural networks; for instance, a forward pass of a weight is
, and stores for future use in the backward pass. is further aggregated to . The result of a matrix multiplication can therefore be constructed by these local operations to obtain the more familiar . Likewise, a sigmoid layer can be deconstructed locally for each scalar and as follows: , storing for further use in the backward pass.In the backward pass, every node computes a message to send back, given its stored forward input, the message being passed from the successive layer, and any internal states. We refer to this function as the message passing rule, or . For a given node indexed , we compute the message from message , forward input and internal state , which consists of any parameters specific to the node’s forward pass as well as any recurrent hidden state.
(1) 
Operations such as the loss function and activation functions have no parameters, so the internal states of these nodes only store the preactivation inputs^{1}^{1}1Some nodes take more inputs. This is purely done to reduce the total number of nodes necessary for complex operations. For instance, Softmax nodes store every intermediate operation result and the common denominator across all inputs.and a hidden recurrent state.
Seeding the backward pass requires passing an initial message to the loss node. We typically pass multidimensional messages between nodes, however we treat the message being externally passed into the loss node as a special case of onedimensional message:
(2) 
The output message of a loss node being passed backwards is multidimensional.
Nodes representing parameterized operations, such as an affine transform, further undergo a weight update rule defined by during the backwards pass.
(3) 
For an input with a batch of size B, at step , the update is the average of the update computed over every element in the batch.
(4) 
Given a traditional dense layer with a weight matrix of cardinality , and forward and backward arrows for each individual weight and bias, it is evident some of the messages have to be replicated and others aggregated. Each of the neurons will have their own bias, which will send the same message to each of its inputs, as they are connected to incoming nodes and associated weights. This is not unlike how the backwards flowing gradient is sent along all paths. Similarly, the output of the backward pass of a dense layer is expected to be messages, one for each input. We accomplish that by averaging the messages being passed backwards to each input.
(5) 
Other strategies may be worth exploring, such as summing backward flowing messages, or even more complex and stateful aggregation such as feeding messages into an RNN.
Stateful and stateless learners. We parameterize the backwards arrows, and , using deep neural networks. We experiment with both stateful and stateless implementations of these arrows. We consider the backward network to be stateless in the case when there is not recurrent hidden state between messages or updates being computed  the backwards pass has no memory of previous iterations. To implement the backwards pass with a memory, we incorporate a hidden state at each node and implement and
as using a Gated Recurrent Unit (GRU). We observed that a deep network is beneficial for the computation of the next state update:
(6) 
(7) 
(8) 
The MLP have two hidden layers of size 80 and 40 respectively with ReLu activations. We consider some of the carry states as output messages, and weight update for weights and biases. In Experiment
4.1.2, we used a stateless version of the above, where we use a MLP for , and a MLP for , both of them with two hidden layers of size 80 and 40 respectively with ReLu activations, and a tanh activation for the final layer. We did not explore the possible space of architectures further.Normalization. Metalearning is notoriously hard to train. We found initializations to be critical for successful training. In particular, for any and
inputs, it is evident that input messages, carry states, inputs to the forward pass, and optional weights have all different means and magnitudes. We mitigated this problem by standardizing these inputs individually over initial minibatches. The mean and standard deviation from these initial minibatches are recorded, and reused thorough the network’s lifetime to standardize subsequent batches. Outputs need to be translated and scaled as well: the outputs of
are inherently bounded in (1, 1) by the tanh activation function, while the typical magnitude and standard deviation of weights and biases is usually orders of magnitude smaller. This empirically results in too large weight and bias updates during early training, rendering metatraining particularly unstable. Moreover, having a mean output significantly different from zero causes suboptimal training. To mitigate these problems, before starting training, for each dense layer, we store the output mean of and (respectively for the weights and biases) and use them to have initial mean zero outputs. Furthermore, we scale outputs down to be at most times the standard deviation of the weight matrix^{2}^{2}2This means that weights and biases of the same dense layer are initialized to have s output similar magnitudes. This is of particular importance, since we zeroinitialize biases, and therefore we could not compute their initial standard deviations.. We keep all these normalization variables fixed during metatraining, except for the output scaling, which we metatrain as we observed it to be beneficial for fast adaptation. We believe other standardization techniques such as batch normalization may prove fruitful, but we do not explore them in this work.
Parameter sharing.
Across all our experiments, all weight nodes for a given weight matrix, and all bias nodes for a given bias vector, share
and ^{3}^{3}3Weights and biases do not share and among themselves. The same reasoning applies with loss functions and activations: a crossentropy loss node does not share with a ReLU node.. Likewise, all nodes for a given activation function, and all nodes for a given loss, share . In addition, we experimented with two more configurations: sharing f and g across all layers of the same kind (so we have one f and one g for all weight nodes in the network, and another set of f and g for all bias nodes), and sharing f, g and the standardization parameters described above.3.1.1 Training regime
Crossvalidation loss. We compute a crossvalidation loss after performing ksteps of adaptation, similarly to the metatraining procedure used in MAML.
Let us call and the forward pass and backward pass functions of a neural network. In our use case, can be interpreted as MPLP. The neural network is parameterized by and MPLP is parameterized by . Let be an arbitrary loss function. We adapt the neural network on a training set and keep a heldout set .
We update with the training set:
(9) 
(10) 
This is repeated k times. Afterwards, we compute a crossvalidation loss with the heldout set:
(11) 
While we do not pass hidden states to the equations above, you can assume every function ingests and returns hidden states.
Hint losses. In addition to the crossvalidation loss described above, we have observed it is beneficial to add a hint loss: at each step of the inner loop, after Equations 9 and 10, we compute a loss on the same training data just observed:
(12) 
This metaloss is scaled such that the sum of all hint losses across all the inner loop steps has the same magnitude as the final crossvalidation loss. We find better convergence by weighting the total hint loss to be larger than the global metaloss. We suspect such an intermediate loss produces a smoother local loss landscape, but its inherent greediness may also hamper our model in finding an optimal convergence path. In Experiment 4.1.2 we show what can happen when we do not add a hint loss.
Outer batches. We metalearn by accumulating gradients across multiple tasks, generally referred to as outer batches. In this work, tasks refer to: (a) a selection of amplitude and phase for sinusoidals, (b) a randomly initialized network to optimize for either sinusoidals and MNIST. Regarding the latter, for all experiments except Experiment 4.1.2, we train MPLP able to adapt any randomly initialized networks. We do so by initializing^{4}^{4}4
We initialize weights sampled from a normal distribution of mean 0 and standard deviation of 0.05. We zeroinitialize biases.
a pool of networks before training , and using a different randomly sampled prior for every outer batch. Even very small pools are generally sufficient for MPLP to generalize on unseen random initializations. However, a pool of size 1 would have stateful learners overfit to it.We metalearn through Adam and normalize the gradients for each optimizer parameter. We further describe training setups specific to each experiment in their relative sections.
4 Experiments
4.1 Sinusoidal fitting
Following the example in MAML, we show convergence on tasks drawn from a family of sinusoidal functions. We sample a set of sinusoidal tasks for each training step, where a sinusoidal task is defined as a regression on samples from the underlying function , with and . We sample inner batches with . For an inner training loop consisting of k steps, we sample k inner batches. We show convergence of our approach both when trained from a randomly initialized prior, and when combined with MAML to learn prior weights of the optimizee network. We use 8dimensional messages.
4.1.1 Randomly initialized priors
In this experiment, we train MPLP to generalize to adapt for arbitrary sinusoidals and arbitrary networks initializations. Thus, for each outer batch, we sample a random sinusoidal task and a randomly initialized network prior. The optimizee network is a traditional deep neural network with two hidden layers of 20 units each, and ReLu activations on the hidden layers. We use stateful learners for MPLP and do not share any parameters across layers.
Training. We perform 5step learning with an inner batch size of 10. We use an outer batch size of 4. We use a crossvalidation loss at the end of the 5step learning, and a hint loss at every step. We use an L2 loss for both crossvalidation and hint losses. The L2 loss is also the final node of the network (that is, the loss in the initial message in the backward stage is L2).
Figure 2 shows a comparison between MPLP and Adam^{5}^{5}5We use outofthebox Adam parameters, except for the learning rate, that we finetuned to be 0.01, since it gave the best results.. The MPLP optimizer was trained for ~40,000 steps. The optimized MPLP is able to fit the task, and it vastly outperforms what Adam can achieve in 5 steps.
Training MPLP showed a great variance in its convergence results. Anecdotally, we observed a trend of obtaining better and more reliable results the larger number of parameters MPLP has. However, all parameter sharing configurations can, albeit seldom, converge to successful MPLP similar to what shown in Figure
2. We give more details in Appendix B.4.1.2 Learning with a prior
In this experiment, we show how it is possible to meta learn the priors (as in MAML) and MPLP jointly. To showcase different properties of MPLP from what observed in Experiment 4.1.1, we choose to use a stateless learning algorithm, perform 2shot learning, and use only a crossvalidation loss at the end (i.e. no hint loss). We also do not share any parameters across layers.
Figure 3 showcases how adding MPLP to MAML can drastically change the behaviour of the learner. While both models successfully solve the task, MAML runs are all always approximating a sinusoidal with priors. Moreover, since they are forced to follow the gradient at every step, it is no surprise they increasingly approximate a sinusoidal. Using MPLP, instead, the learning algorithm is not restricted to performing incremental improvements at every step. In this case, this results in nothing like sinusoidal approximation until the last step. We also verified that the learned MPLP is tightly coupled with the learned priors, as it would not succeed with a randomly initialized network. Metatraining on this experiment will yield different MPLPs and henceforth different visual results. Therefore, what we show in Figure 3 is just one example of many possibilities.
4.2 MNIST classification
The goal of this experiment is to learn a MPLP capable of generalizing to arbitrary network initializations to adapt on MNIST. To do so, we perform a 20step training with inner batch size of 8 on a scaleddown (12x12) MNIST. We scaledown MNIST mostly for performance reasons, but this allows us to evaluate our trained learners on a fullscale MNIST afterwards. We also standardize the inputs by computing mean and standard deviation on the train dataset. The architecture used is a (144,50,10) network with sigmoid activation for the hidden layer, a softmax for the final layer, and a crossentropy loss node afterwards. Every task in the outer batch is initialized with a different network prior. We used a stateful learner with a message size of 4 as a compromise between quality and computational efficiency.
Parameter sharing. We share the learners for all the dense layers (all weights share one pair of and , all biases share one pair of and ), and keep normalization parameters layerspecific. We could not manage to achieve comparable results when sharing normalization parameters across layers, and we defer more robust explorations to future work.
Training regime. We use an outer batch size of 1 and metalearn for 5000 steps. We do so for computational constraints and it is by no means optimal. We compute a hint loss for every step and a crossvalidation loss after 20 steps. The losses used are crossentropy.
Model  Architecture  Accuracy @ 20 (%) 

MPLP  (144,50,10), Sigmoid  
MPLP (transfer)  (144,50,10), Step  
MPLP (transfer)  (784,100,10), Sigmoid  
SGD  (144,50,10), Sigmoid  
Adam  (144,50,10), Sigmoid 
Results. Figure 4 and Table 1 show evaluations on the test set, with randomly initialized priors, of our MPLP compared to Adam and SGD^{6}^{6}6Here as well, we finetune the learning rate of both Adam and SGD to 0.01 and 0.1 respectively.
, averaged over 100 runs. Most evaluations are performed on the scaleddown 12x12 MNIST on a (144,50,10) architecture. The exception is for the "MPLP on bigger network", where we transfer the learnt parameters to a larger network of size (784,100,10). We can see how MPLP outperforms Adam and SGD, on average, for every step. We also experimented with replacing the sigmoid function in the hidden layer with a step function, since the two are closely related. While a step function would only backpropagate zero gradients with traditional means, we are able to train a network with it seamlessly. Finally, we ported the trained MPLP to a larger network of dimensions (784,100,10), with a sigmoid as hidden layer, and evaluated its performance on a 28x28 MNIST. Even if we did not share any standardizers, we observe how the model is still able to train meaningfully.
5 Discussion
In this article, we introduced MPLP and applied it to fairly small feedforward neural networks with the aim of inquiring whether this is a viable generalization to gradientbased approaches. In this section, we want to share our thoughts and discuss some of the promising results, as well as some eyecatching hardships we encountered.
Overall, for fewshot learning, a properly normalized MPLP appears to be a promising path to pursue. All the experiments explored suggest that metalearning MPLP for fewshot tasks should work for smallsized networks. If we wanted to scale this up to larger sized networks, this algorithm can quickly become overly expensive. For instance, some of our early experiments implemented convolutions for stateless learners, and we saw how that can very quickly become too computationally expensive. We therefore hypothesize a scaling up might require some significant architectural enhancement, or, more likely, finding a compromise between locality and scalability. One other viable path could be rethinking traditional NN blocks, making them inherently more powerful through MPLP, decreasing architectural depth.
In this work, we never passed gradients in the backward step. This may not necessarily be optimal for realworld experiments, and we observed some good arguments in favour of passing gradients whenever possible: A typical metatraining regime tends to start with the learners not knowing what to do at all. This results in some early loss plateau, that the learners have to get out of. This plateau gets harder to escape with badly initialized networks. We hypothesize this to be less of a problem when also passing the gradient of the loss, since the model could very easily escape such plateaus by just scaling the received gradients.
We also showed a toy example on how a MPLP trained for sigmoids transfers to step functions too. While we would not recommend reading too much into that specific result, we could imagine some metametatraining regimes that could improve MPLP to eventually get completely rid of differentiability requirements, while transfering well to non differentiable operations.
MPLP can be applied to arbitrary graphs. As long as the building blocks are endtoend differentiable, MPLP can be learned efficiently. We expect that different tasks may benefit from nontraditional NN blocks, and recommend trying MPLP. The code is open sourced on
https://github.com/googleresearch/selforganisingsystems/tree/master/mplp.Continual learning is also a great fit for metalearning and MPLP: we can train learners that follow crossvalidation losses, or losses that discourage forgetfulness. However, we observed one drawback with our lossdriven approach: metalearning MPLP suffers from all the well known problems of training NNs. In particular, we should consider MPLP as composed by small neural networks, and they can quickly fall to bad local minima. We observed these problems in nearly every experiment we performed, including some we did not include in this work. One solution may be to overparameterize MPLP, but this is likely inefficient. One other solution is to use the right set of losses to hint MPLP to go towards a proper direction. This is why we introduced hint losses. Hint losses are even more useful for longer time series such as 50 steps or more. Catastrophic forgetting avoidance would as well require a hint loss for remembering over time. However, finding the right hint loss for the right task is not straightforward. We consider this the biggest limitation we observed in our experiments.
In summary, MPLP can be a viable way to overcome some of the problems gradientbased approaches have: adding a crossvalidation loss can drive the learners to optimize for generalizing results, and not overfit; catastrophic forgetting avoidance can as well be encoded in the metaloss used; the learned MPLP does not require differentiable architectures, if we are able to transfer it effectively; vanishing and exploding gradients may be less of an issue, given that the messages passed do not necessarily need to decrease/increase in magnitude to express certain wishes of changes to be performed downstream.
More broadly, one can learn a MPLP on any endtoend differentiable configuration of agents. This may lead to exploring novel architectures, where each agent is inherently more complex and powerful than traditional operations, or where computations happen asynchronously and distributedly. MPLP could also be used for devising implicit reward signals in continual learning or reinforcement learning scenarios.
Acknowledgements
We thank Iulia Comșa, Hugo Larochelle, Blaise Aguera y Arcas, and Max Vladymyrov for their insightful feedback.
References
 Social intelligence. Note: NeurIPS External Links: Link Cited by: §2.
 Learning to learn by gradient descent by gradient descent. External Links: 1606.04474 Cited by: §2.
 How to train your maml. External Links: 1810.09502 Cited by: §2.
 Learning to continually learn. External Links: 2002.09571 Cited by: §2.
 On the optimization of a synaptic learning rule. Cited by: §2.
 Learning a synaptic learning rule. In IJCNN91Seattle International Joint Conference on Neural Networks, Vol. ii, pp. 969 vol.2–. Cited by: §2.
 Network of evolvable neural units: evolving to learn at a synaptic level. External Links: 1912.07589 Cited by: §2.
 A closer look at fewshot classification. External Links: 1904.04232 Cited by: §2.

AIgas: aigenerating algorithms, an alternate paradigm for producing general artificial intelligence
. External Links: 1905.10985 Cited by: §2.  Modelagnostic metalearning for fast adaptation of deep networks. External Links: 1703.03400 Cited by: §1, §2.
 Online metalearning. External Links: 1902.08438 Cited by: §2.

Dynamic fewshot visual learning without forgetting.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
. External Links: ISBN 9781538664209, Link, Document Cited by: §2. 
Neural message passing for quantum chemistry.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 1263–1272. Cited by: §2.  Finding online neural update rules by learning to remember. External Links: 2003.03124 Cited by: §2.
 Learning to learn using gradient descent. In IN LECTURE NOTES ON COMP. SCI. 2130, PROC. INTL. CONF. ON ARTI NEURAL NETWORKS (ICANN2001, pp. 87–94. Cited by: §2.
 Metalearning in neural networks: a survey. External Links: 2004.05439 Cited by: §2.
 Metalearning representations for continual learning. External Links: 1905.12588 Cited by: §2.
 Metasgd: learning to learn quickly for fewshot learning. External Links: 1707.09835 Cited by: §2.
 Random feedback weights support learning in deep neural networks. External Links: 1411.0247 Cited by: §2.
 Metalearning update rules for unsupervised representation learning. External Links: 1804.00222 Cited by: §2.
 Backpropamine: training selfmodifying neural networks with differentiable neuromodulated plasticity. External Links: 2002.10585 Cited by: §2.
 Growing neural cellular automata. Distill. Note: https://distill.pub/2020/growingca External Links: Document Cited by: §2.
 Direct feedback alignment provides learning in deep neural networks. External Links: 1609.01596 Cited by: §2.
 Continual lifelong learning with neural networks: a review. External Links: 1802.07569 Cited by: §2.
 Lowshot learning with imprinted weights. External Links: 1712.07136 Cited by: §2.
 Optimization as a model for fewshot learning. In ICLR, Cited by: §2.
 AutoMLzero: evolving machine learning algorithms from scratch. arXiv preprint arXiv:2003.03384. Cited by: §2.
 A neural network that embeds its own metalevels. In IEEE International Conference on Neural Networks, Vol. , pp. 407–412 vol.1. Cited by: §2.
 Metadataset: a dataset of datasets for learning to learn from few examples. External Links: 1903.03096 Cited by: §2.
 A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: ISSN 21622388, Link, Document Cited by: §2.
 Metalearning with backpropagation. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), Vol. 3, pp. 2001–2006 vol.3. Cited by: §2.
Appendix A Computational Graph details
For ease of implementation we designed our approach around a directed computational multidigraph. Nodes in our computational graph are stateful and store arbitrary data structures which we refer to as "state". Arrows (directed edges) in our graph represent a series of operations to be executed, each taking as input the state of the source node, potentially modifying this state or part of it, and then constructing and passing a message into the state of the destination node. This passed message contains any information the source node wants to pass on to the destination node. The received message can be instantly processed by the destination node by executing a successive arrow computation, or stored for future usage. It is important to note that the mapping of a node’s received messages to the inputs of its outgoing arrows is defined during the construction of the graph. Likewise, the order of execution of the edges is arbitrary but in the specific case of traditional networks we emulate the forwardbackward paradigm  executing forward arrows in sequence until the final message being passed is the loss value, then executing backward arrows in the reverse sequence to update the parameters stored in the state. By property of a multidigraph, two nodes can have more than one edge defined between them, as is the case with the aforementioned forward and backward arrows. A node can additionally have reflexive arrows, allowing updates to its own states without receiving any external inputs. An example of how we utilize internal states in our experiments is storing weights or bias values, Gated Recurrent Unit (GRU)/LSTM hidden states, or a node’s embedding.
Such a framework allows for a variety of possible architectures for arbitrary computation and communication within a network.
Appendix B Sinusoidal Ablation study
We performed ablations on message size and amount of shared parameters (see Section 3
for the meaning of the possible configurations). Unfortunately, we observed a great amount of variance for each experiment, and it was too computationally prohibitive to perform several repetitions for each instance. What therefore follows is anecdotal evidence. Ablating the message size (we tried 1, 4, and 8), we observed that training converges faster and the final result is better with a larger message size. Ablating the amount of shared parameters, we observed the models to converge faster with no shared parameters, but the end results are all comparable even if we share all parameters. A note on reliable results: the least powerful the models, the more likely they are to get stuck in some local minima, and we observed every model to get stuck, in some runs; the most common problem we observed was it getting stuck to classifying a center arc only, while regressing to classifying a constant value outside of the center.
Comments
There are no comments yet.