MPLP: Learning a Message Passing Learning Protocol

07/02/2020 ∙ by Ettore Randazzo, et al. ∙ Google 7

We present a novel method for learning the weights of an artificial neural network - a Message Passing Learning Protocol (MPLP). In MPLP, we abstract every operations occurring in ANNs as independent agents. Each agent is responsible for ingesting incoming multidimensional messages from other agents, updating its internal state, and generating multidimensional messages to be passed on to neighbouring agents. We demonstrate the viability of MPLP as opposed to traditional gradient-based approaches on simple feed-forward neural networks, and present a framework capable of generalizing to non-traditional neural network architectures. MPLP is meta learned using end-to-end gradient-based meta-optimisation. We further discuss the observed properties of MPLP and hypothesize its applicability on various fields of deep learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For the most part, learning algorithms have been hand-crafted. The deep learning community has largely converged to use almost exclusively gradient-based approaches for learning a model’s parameters. Such gradient-based approaches generally impose limitations in terms of the loss landscape, choice of network architecture and training dynamics. A non-exhaustive list of examples is: their inherent tendency to overfit to training sets, their catastrophic forgetting behaviour, their requirement of a smooth loss landscape, and experiencing vanishing or exploding gradients in recurrent or large architectures. Moreover, while the mechanics of artificial neurons are inspired by their biological counterparts, they have been greatly simplified to be scalable and differentiable, rendering them far less powerful than their biological counterpart. Perhaps the simplicity of its building blocks is the reason of why most of deep learning research occurs on layered deep networks that require increasing amounts of computation and memory, limiting the explorations on fundamentally different architectures. Nevertheless, backpropagation is still the best tool in our toolkit for optimising models with an extensive set of parameters.

In this work, we leverage gradient-based learning to find a new learning protocol for tuning an arbitrary computational graph to adapt to a task from a given family of tasks. We show how this learning protocol and its associated meta-learner can be used to train traditional neural networks. This is accomplished by rethinking neural networks as self-organising systems: a graph composed of nodes representing operations such as synapses (individual weights and biases), activations and losses, that have to communicate in order to solve a given task.

We therefore propose to learn a Message Passing Learning Protocol (MPLP): given a directed graph composed of (sparsely) connected nodes, we let these nodes communicate with each other by passing k-dimensional vectors along directed edges. The meta-training phase consists of learning a MPLP that, given an initial configuration/initialization, and a training set, is able to adapt to a given task. The kind of graphs we explore in this work are all end-to-end differentiable - we therefore meta-optimize MPLP through gradient-based approaches.

We show how MPLP can be applied to feed-forward neural networks as a meta-learned replacement to gradient-based approaches. In fact, we can consider gradient-based learning algorithms as a specific instance of MPLP. Gradient based approaches do the following: (1) in the forward pass, store the input and compute with your function. (2) in the backward pass, get a gradient of a loss upstream, compose it with your gradient function, and update your own weight with the gradient and some autoregressive behavior, (3) pass a gradient of the loss, modified accordingly to your function. Hence, we can specialize MPLP as follows: every time an operation occurs, we consider it a node. Examples of such nodes are single weights (or biases ) multiplied by an input

, or a value modified by an activation function. The forward pass remains unchanged. In the backward pass, instead of receiving the simple gradient of a loss scalar, each node receives a multidimensional message vector, updates its weight through a parameterized function

, and backpropagates a modified message through a parameterized function . These functions are trainable neural networks. Given that MLPs have been shown to work as universal function approximators, we propose this property would allow our learning protocol to implement traditional gradient descent arbitrarily well, but would no be limited to or encouraged to do exactly so. While for real-world applications it would be desirable to backpropagate the gradient of the loss alongside a learned message, we decided to never pass any gradient to showcase properties of a pure MPLP. As a proof of concept, we show how a gradient-free MPLP can be used to train feedforward neural networks for few-shot sinusoidal fitting and MNIST classification. In the sinusoidal case, we also show how we can enhance Model-Agnostic Meta-Learning (MAML) (Finn et al. (2017)) approaches by jointly learning both priors and learning rules.

While this paper limits the explorations on this framework to traditional feedforward neural networks, this framework can be used on any graph of agents, as long as their communication protocol remains end-to-end differentiable. We briefly discuss some possible applications of MPLP on non-traditional neural networks in Section 5. The code for this framework and reproduction of the following experiments can be found on https://github.com/google-research/self-organising-systems/tree/master/mplp.

2 Related Work

Meta-learning. Our work finds extensive common ground in the field of meta-learning, which has recently re-exploded in popularity. Given the velocity and size of the field, we must limit this section to work we believe is very strongly related or has served as a direct inspiration to this work. For a more extensive overview of meta-learning, we recommend Clune (2019) and Hospedales et al. (2020).

(Bengio et al. (1991, 1997)) introduced the idea of discovering local learning rules instead of gradient-based optimization. We consider our work as a generalization of their approach. Schmidhuber (1993) and follow-up work (Hochreiter et al. (2001); Younger et al. (2001)

) demonstrated how LSTMs (Long Short Term Memory networks) can be used for meta-learning. We find LSTMs useful in our meta learning approach and their use is well documented in similar approaches, such as (

Andrychowicz et al. (2016); Ravi and Larochelle (2017)), who present a method for optimally applying an incoming gradient signal to parameters by using an LSTM. Further novel work on non-traditional learning rules can by found in the field of feedback alignment (Lillicrap et al. (2014); Nøkland (2016)), where the authors show that backpropagation can still work with an altered weight matrix in the backward pass, instead of using the transpose of the weight matrix as would be derived in traditional backpropagation. Our work is to the largest degree inspired by the original MAML paper (Finn et al. (2017), and follow-up works Li et al. (2017); Antoniou et al. (2018)), and our training regime can be seen as an enhancement of it. An approach implementing hebbian learning was recently meta-learned through end-to-end backpropagation (Miconi et al. (2020)). A black-box evolutionary approach was recently explored in AutoML Zero (Real et al. (2020)). We find such a clean-slate approach desirable and we believe it could be used to meta-learn MPLP-based models too. Our method was inspired by the recent Neural CA work (Mordvintsev et al. (2020)) - end-to-end differentiable cellular automata can be seen as a special case of a locally connected computation graph being trained with meta-learning. The recent talk by Aguera y Arcas (2019) demonstrating bacteria-inspired lifeforms and evolutionary approaches to meta-learning was also a direct inspiration to our work.

Methods that fulfill meta-learning objectives such as ours can often be used for solving few-shot learning tasks. Triantafillou et al. (2019)

provides a valuable overview of few-shot learning and introduces a large dataset for evaluating architectures. Recently, it is becoming apparent that feature extraction as a pre-processing step may be very beneficial to scaling and improving performance on few-shot learning tasks (

Gidaris and Komodakis (2018); Qi et al. (2017); Chen et al. (2019)). We suspect the our approach can be directly applied on learned representations of data. See Section 5 for more details.

Recent work in meta-learning has independently yielded approaches that share some key properties with our work. We hereby highlight the similarities and differences. Metz et al. (2018)

trains meta-learning unsupervised learning rules through backpropagation. During the inner update, they backpropagate information through a learned matrix, and instead of receiving a gradient of a loss, each synapse is responsible for accumulating useful statistics in an unsupervised fashion. In summary, the information sent across layers is scalar for each synapse, as opposed to our work, and furthermore, they do not use losses in the inner loop.

Gregor (2020) uses k-dimensional messages as a communication protocol and it is conceptually very similar to our work. However, they specifically set out to solve the task of learning to remember, and do not attempt to generalize to a swap-in replacements to gradient based approaches as we do. Bertens and Lee (2019) also uses k-dimensional message passing and a variant of an LSTM/GRU as an underlying building block. Unlike our approach, they rely on evolution based learning and do not set out to find a replacement for gradient-based approaches for traditional NNs.

Continual learning is another field where we believe there are opportunities to make use of our framework. Historically, meta-learning approaches were not common in continual learning, see Parisi et al. (2018) for an overview of the field. However, we are aware of some meta-learning work occurring in recent years. Follow the meta leader (FTML) (Finn et al. (2019)) adapts MAML to work in a continual learning scenario; Online aware Meta-learning (OML) (Javed and White (2019)) meta-learns sparse representations aiding in continual learning; A Neuromodulated Meta-Learning Algorithm (ANML) (Beaulieu et al. (2020)) meta-learns architectures resistant to catastrophic forgetting by adding a gating mechanism. In Section 5, we discuss how MPLP could be applied to Continual learning, and where we see possible limitations/drawbacks.

Graph Neural Networks. From a Graph Neural Networks (GNN) perspective, our framework could be considered an example of Message Passing Neural Network (Gilmer et al. (2017)). However, GNNs are generally not applied to meta-learning, instead are used to ingest graph-structured data. We recommend Wu et al. (2020) for a survey on GNNs.

3 Model

Figure 1: Diagram of simple nodes being trained with MPLP, representing a 1-dimensional matrix multiplication and a loss. and are the weight update and message generating networks, respectively. They each maintain their own recurrent hidden state.

We designed our approach around a directed computational multidigraph. We refer to computations happening within and across nodes as arrows. Arrows are responsible for computing messages and updating internal states of nodes. For a more detailed explanation of the actual implementation, we point the reader to Appendix A.

In this work, we focus on specialising our models to mimic traditional feed-forward neural networks forward and backward passes in supervised learning scenarios.

3.1 Feed-forward Neural Networks

A typical feed-forward neural network is composed of two data pathways: a forward pass, where given some input and parameters we compute , and a backward pass, where a given loss and some stored intermediate data, typically computed after a forward pass, are used to compute a gradient-based update on . Our computational model can be used in a forward/backward routine for architectures equivalent to traditional NN, where the forward pass is effectively unchanged. However, instead of relying on gradients for the backward pass, we meta-learn a MPLP.

We now discuss the logical implementation of both forward and backward passes by focusing on the local interactions that each node has. For instance, instead of describing procedures in terms of a weight matrix , we will show what happens to each individual synapse locally. The actual implementation computes updates in bulks; Figure 1 shows a higher level example of dataflow.

In the forward pass

, we construct nodes for each operation happening in affine transforms, activations, and loss functions. Each weight and bias is stored in the state of the respective node. We define a forward arrow to compute one-dimensional outputs from one-dimensional inputs, just like what occurs in a traditional forward pass of neural networks; for instance, a forward pass of a weight is

, and stores for future use in the backward pass. is further aggregated to . The result of a matrix multiplication can therefore be constructed by these local operations to obtain the more familiar . Likewise, a sigmoid layer can be deconstructed locally for each scalar and as follows: , storing for further use in the backward pass.

In the backward pass, every node computes a message to send back, given its stored forward input, the message being passed from the successive layer, and any internal states. We refer to this function as the message passing rule, or . For a given node indexed , we compute the message from message , forward input and internal state , which consists of any parameters specific to the node’s forward pass as well as any recurrent hidden state.

(1)

Operations such as the loss function and activation functions have no parameters, so the internal states of these nodes only store the pre-activation inputs111Some nodes take more inputs. This is purely done to reduce the total number of nodes necessary for complex operations. For instance, Softmax nodes store every intermediate operation result and the common denominator across all inputs.and a hidden recurrent state.

Seeding the backward pass requires passing an initial message to the loss node. We typically pass multidimensional messages between nodes, however we treat the message being externally passed into the loss node as a special case of one-dimensional message:

(2)

The output message of a loss node being passed backwards is multidimensional.

Nodes representing parameterized operations, such as an affine transform, further undergo a weight update rule defined by during the backwards pass.

(3)

For an input with a batch of size B, at step , the update is the average of the update computed over every element in the batch.

(4)

Given a traditional dense layer with a weight matrix of cardinality , and forward and backward arrows for each individual weight and bias, it is evident some of the messages have to be replicated and others aggregated. Each of the neurons will have their own bias, which will send the same message to each of its inputs, as they are connected to incoming nodes and associated weights. This is not unlike how the backwards flowing gradient is sent along all paths. Similarly, the output of the backward pass of a dense layer is expected to be messages, one for each input. We accomplish that by averaging the messages being passed backwards to each input.

(5)

Other strategies may be worth exploring, such as summing backward flowing messages, or even more complex and stateful aggregation such as feeding messages into an RNN.

Stateful and stateless learners. We parameterize the backwards arrows, and , using deep neural networks. We experiment with both stateful and stateless implementations of these arrows. We consider the backward network to be stateless in the case when there is not recurrent hidden state between messages or updates being computed - the backwards pass has no memory of previous iterations. To implement the backwards pass with a memory, we incorporate a hidden state at each node and implement and

as using a Gated Recurrent Unit (GRU). We observed that a deep network is beneficial for the computation of the next state update:

(6)
(7)
(8)

The MLP have two hidden layers of size 80 and 40 respectively with ReLu activations. We consider some of the carry states as output messages, and weight update for weights and biases. In Experiment 

4.1.2, we used a stateless version of the above, where we use a MLP for , and a MLP for , both of them with two hidden layers of size 80 and 40 respectively with ReLu activations, and a tanh activation for the final layer. We did not explore the possible space of architectures further.

Normalization. Meta-learning is notoriously hard to train. We found initializations to be critical for successful training. In particular, for any and

inputs, it is evident that input messages, carry states, inputs to the forward pass, and optional weights have all different means and magnitudes. We mitigated this problem by standardizing these inputs individually over initial minibatches. The mean and standard deviation from these initial minibatches are recorded, and reused thorough the network’s lifetime to standardize subsequent batches. Outputs need to be translated and scaled as well: the outputs of

are inherently bounded in (-1, 1) by the tanh activation function, while the typical magnitude and standard deviation of weights and biases is usually orders of magnitude smaller. This empirically results in too large weight and bias updates during early training, rendering meta-training particularly unstable. Moreover, having a mean output significantly different from zero causes suboptimal training. To mitigate these problems, before starting training, for each dense layer, we store the output mean of and (respectively for the weights and biases) and use them to have initial mean zero outputs. Furthermore, we scale outputs down to be at most times the standard deviation of the weight matrix222This means that weights and biases of the same dense layer are initialized to have s output similar magnitudes. This is of particular importance, since we zero-initialize biases, and therefore we could not compute their initial standard deviations.

. We keep all these normalization variables fixed during meta-training, except for the output scaling, which we meta-train as we observed it to be beneficial for fast adaptation. We believe other standardization techniques such as batch normalization may prove fruitful, but we do not explore them in this work.

Parameter sharing.

Across all our experiments, all weight nodes for a given weight matrix, and all bias nodes for a given bias vector, share

and 333Weights and biases do not share and among themselves. The same reasoning applies with loss functions and activations: a cross-entropy loss node does not share with a ReLU node.. Likewise, all nodes for a given activation function, and all nodes for a given loss, share . In addition, we experimented with two more configurations: sharing f and g across all layers of the same kind (so we have one f and one g for all weight nodes in the network, and another set of f and g for all bias nodes), and sharing f, g and the standardization parameters described above.

3.1.1 Training regime

Cross-validation loss. We compute a cross-validation loss after performing k-steps of adaptation, similarly to the meta-training procedure used in MAML.

Let us call and the forward pass and backward pass functions of a neural network. In our use case, can be interpreted as MPLP. The neural network is parameterized by and MPLP is parameterized by . Let be an arbitrary loss function. We adapt the neural network on a training set and keep a heldout set .

We update with the training set:

(9)
(10)

This is repeated k times. Afterwards, we compute a cross-validation loss with the heldout set:

(11)

While we do not pass hidden states to the equations above, you can assume every function ingests and returns hidden states.

Hint losses. In addition to the cross-validation loss described above, we have observed it is beneficial to add a hint loss: at each step of the inner loop, after Equations 9 and 10, we compute a loss on the same training data just observed:

(12)

This meta-loss is scaled such that the sum of all hint losses across all the inner loop steps has the same magnitude as the final cross-validation loss. We find better convergence by weighting the total hint loss to be larger than the global meta-loss. We suspect such an intermediate loss produces a smoother local loss landscape, but its inherent greediness may also hamper our model in finding an optimal convergence path. In Experiment 4.1.2 we show what can happen when we do not add a hint loss.

Outer batches. We meta-learn by accumulating gradients across multiple tasks, generally referred to as outer batches. In this work, tasks refer to: (a) a selection of amplitude and phase for sinusoidals, (b) a randomly initialized network to optimize for either sinusoidals and MNIST. Regarding the latter, for all experiments except Experiment 4.1.2, we train MPLP able to adapt any randomly initialized networks. We do so by initializing444

We initialize weights sampled from a normal distribution of mean 0 and standard deviation of 0.05. We zero-initialize biases.

a pool of networks before training , and using a different randomly sampled prior for every outer batch. Even very small pools are generally sufficient for MPLP to generalize on unseen random initializations. However, a pool of size 1 would have stateful learners overfit to it.

We meta-learn through Adam and normalize the gradients for each optimizer parameter. We further describe training setups specific to each experiment in their relative sections.

4 Experiments

4.1 Sinusoidal fitting

Following the example in MAML, we show convergence on tasks drawn from a family of sinusoidal functions. We sample a set of sinusoidal tasks for each training step, where a sinusoidal task is defined as a regression on samples from the underlying function , with and . We sample inner batches with . For an inner training loop consisting of k steps, we sample k inner batches. We show convergence of our approach both when trained from a randomly initialized prior, and when combined with MAML to learn prior weights of the optimizee network. We use 8-dimensional messages.

4.1.1 Randomly initialized priors

In this experiment, we train MPLP to generalize to adapt for arbitrary sinusoidals and arbitrary networks initializations. Thus, for each outer batch, we sample a random sinusoidal task and a randomly initialized network prior. The optimizee network is a traditional deep neural network with two hidden layers of 20 units each, and ReLu activations on the hidden layers. We use stateful learners for MPLP and do not share any parameters across layers.

Training. We perform 5-step learning with an inner batch size of 10. We use an outer batch size of 4. We use a cross-validation loss at the end of the 5-step learning, and a hint loss at every step. We use an L2 loss for both cross-validation and hint losses. The L2 loss is also the final node of the network (that is, the loss in the initial message in the backward stage is L2).

MPLP

Adam

Figure 2: Comparison of MPLP (left column) learning with Adam (learning rate of 0.01) (right column). The top row shows a typical example run for the evaluation of 5-step training with inner batch size of 10. The bottom row shows means and standard deviations of the losses over 5 steps, across 100 runs; the eval loss is the L2 loss averaged across the entire domain, the train loss is the L2 loss average across all training points observed so far.

Figure 2 shows a comparison between MPLP and Adam555We use out-of-the-box Adam parameters, except for the learning rate, that we fine-tuned to be 0.01, since it gave the best results.. The MPLP optimizer was trained for ~40,000 steps. The optimized MPLP is able to fit the task, and it vastly outperforms what Adam can achieve in 5 steps.

Training MPLP showed a great variance in its convergence results. Anecdotally, we observed a trend of obtaining better and more reliable results the larger number of parameters MPLP has. However, all parameter sharing configurations can, albeit seldom, converge to successful MPLP similar to what shown in Figure 

2. We give more details in Appendix B.

4.1.2 Learning with a prior

In this experiment, we show how it is possible to meta learn the priors (as in MAML) and MPLP jointly. To showcase different properties of MPLP from what observed in Experiment 4.1.1, we choose to use a stateless learning algorithm, perform 2-shot learning, and use only a cross-validation loss at the end (i.e. no hint loss). We also do not share any parameters across layers.

MPLP & MAML

MAML
Figure 3: Comparison of models trained with MPLP and MAML (left), and standalone MAML (right).

Figure 3 showcases how adding MPLP to MAML can drastically change the behaviour of the learner. While both models successfully solve the task, MAML runs are all always approximating a sinusoidal with priors. Moreover, since they are forced to follow the gradient at every step, it is no surprise they increasingly approximate a sinusoidal. Using MPLP, instead, the learning algorithm is not restricted to performing incremental improvements at every step. In this case, this results in nothing like sinusoidal approximation until the last step. We also verified that the learned MPLP is tightly coupled with the learned priors, as it would not succeed with a randomly initialized network. Meta-training on this experiment will yield different MPLPs and henceforth different visual results. Therefore, what we show in Figure 3 is just one example of many possibilities.

4.2 MNIST classification

The goal of this experiment is to learn a MPLP capable of generalizing to arbitrary network initializations to adapt on MNIST. To do so, we perform a 20-step training with inner batch size of 8 on a scaled-down (12x12) MNIST. We scale-down MNIST mostly for performance reasons, but this allows us to evaluate our trained learners on a full-scale MNIST afterwards. We also standardize the inputs by computing mean and standard deviation on the train dataset. The architecture used is a (144,50,10) network with sigmoid activation for the hidden layer, a softmax for the final layer, and a cross-entropy loss node afterwards. Every task in the outer batch is initialized with a different network prior. We used a stateful learner with a message size of 4 as a compromise between quality and computational efficiency.

Parameter sharing. We share the learners for all the dense layers (all weights share one pair of and , all biases share one pair of and ), and keep normalization parameters layer-specific. We could not manage to achieve comparable results when sharing normalization parameters across layers, and we defer more robust explorations to future work.

Training regime. We use an outer batch size of 1 and meta-learn for 5000 steps. We do so for computational constraints and it is by no means optimal. We compute a hint loss for every step and a cross-validation loss after 20 steps. The losses used are cross-entropy.

Figure 4: Evalautions of MPLP, SGD and Adam on the MNIST test set. For all models except "MPLP on bigger network", we run this on a 12x12 sized MNIST dataset. Every plot point is computed by averaging 100 test points for every of the 100 runs. Table 1 shows standard deviations on the last step.
Model Architecture Accuracy @ 20 (%)
MPLP (144,50,10), Sigmoid
MPLP (transfer) (144,50,10), Step
MPLP (transfer) (784,100,10), Sigmoid
SGD (144,50,10), Sigmoid
Adam (144,50,10), Sigmoid
Table 1: MNIST Accuracy at 20 steps

Results. Figure 4 and Table 1 show evaluations on the test set, with randomly initialized priors, of our MPLP compared to Adam and SGD666Here as well, we fine-tune the learning rate of both Adam and SGD to 0.01 and 0.1 respectively.

, averaged over 100 runs. Most evaluations are performed on the scaled-down 12x12 MNIST on a (144,50,10) architecture. The exception is for the "MPLP on bigger network", where we transfer the learnt parameters to a larger network of size (784,100,10). We can see how MPLP outperforms Adam and SGD, on average, for every step. We also experimented with replacing the sigmoid function in the hidden layer with a step function, since the two are closely related. While a step function would only backpropagate zero gradients with traditional means, we are able to train a network with it seamlessly. Finally, we ported the trained MPLP to a larger network of dimensions (784,100,10), with a sigmoid as hidden layer, and evaluated its performance on a 28x28 MNIST. Even if we did not share any standardizers, we observe how the model is still able to train meaningfully.

5 Discussion

In this article, we introduced MPLP and applied it to fairly small feedforward neural networks with the aim of inquiring whether this is a viable generalization to gradient-based approaches. In this section, we want to share our thoughts and discuss some of the promising results, as well as some eye-catching hardships we encountered.

Overall, for few-shot learning, a properly normalized MPLP appears to be a promising path to pursue. All the experiments explored suggest that meta-learning MPLP for few-shot tasks should work for small-sized networks. If we wanted to scale this up to larger sized networks, this algorithm can quickly become overly expensive. For instance, some of our early experiments implemented convolutions for stateless learners, and we saw how that can very quickly become too computationally expensive. We therefore hypothesize a scaling up might require some significant architectural enhancement, or, more likely, finding a compromise between locality and scalability. One other viable path could be rethinking traditional NN blocks, making them inherently more powerful through MPLP, decreasing architectural depth.

In this work, we never passed gradients in the backward step. This may not necessarily be optimal for real-world experiments, and we observed some good arguments in favour of passing gradients whenever possible: A typical meta-training regime tends to start with the learners not knowing what to do at all. This results in some early loss plateau, that the learners have to get out of. This plateau gets harder to escape with badly initialized networks. We hypothesize this to be less of a problem when also passing the gradient of the loss, since the model could very easily escape such plateaus by just scaling the received gradients.

We also showed a toy example on how a MPLP trained for sigmoids transfers to step functions too. While we would not recommend reading too much into that specific result, we could imagine some meta-meta-training regimes that could improve MPLP to eventually get completely rid of differentiability requirements, while transfering well to non differentiable operations.

MPLP can be applied to arbitrary graphs. As long as the building blocks are end-to-end differentiable, MPLP can be learned efficiently. We expect that different tasks may benefit from nontraditional NN blocks, and recommend trying MPLP. The code is open sourced on

https://github.com/google-research/self-organising-systems/tree/master/mplp.

Continual learning is also a great fit for meta-learning and MPLP: we can train learners that follow cross-validation losses, or losses that discourage forgetfulness. However, we observed one drawback with our loss-driven approach: meta-learning MPLP suffers from all the well known problems of training NNs. In particular, we should consider MPLP as composed by small neural networks, and they can quickly fall to bad local minima. We observed these problems in nearly every experiment we performed, including some we did not include in this work. One solution may be to overparameterize MPLP, but this is likely inefficient. One other solution is to use the right set of losses to hint MPLP to go towards a proper direction. This is why we introduced hint losses. Hint losses are even more useful for longer time series such as 50 steps or more. Catastrophic forgetting avoidance would as well require a hint loss for remembering over time. However, finding the right hint loss for the right task is not straightforward. We consider this the biggest limitation we observed in our experiments.

In summary, MPLP can be a viable way to overcome some of the problems gradient-based approaches have: adding a cross-validation loss can drive the learners to optimize for generalizing results, and not overfit; catastrophic forgetting avoidance can as well be encoded in the meta-loss used; the learned MPLP does not require differentiable architectures, if we are able to transfer it effectively; vanishing and exploding gradients may be less of an issue, given that the messages passed do not necessarily need to decrease/increase in magnitude to express certain wishes of changes to be performed downstream.

More broadly, one can learn a MPLP on any end-to-end differentiable configuration of agents. This may lead to exploring novel architectures, where each agent is inherently more complex and powerful than traditional operations, or where computations happen asynchronously and distributedly. MPLP could also be used for devising implicit reward signals in continual learning or reinforcement learning scenarios.

Acknowledgements

We thank Iulia Comșa, Hugo Larochelle, Blaise Aguera y Arcas, and Max Vladymyrov for their insightful feedback.

References

  • B. Aguera y Arcas (2019) Social intelligence. Note: NeurIPS External Links: Link Cited by: §2.
  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. External Links: 1606.04474 Cited by: §2.
  • A. Antoniou, H. Edwards, and A. Storkey (2018) How to train your maml. External Links: 1810.09502 Cited by: §2.
  • S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney (2020) Learning to continually learn. External Links: 2002.09571 Cited by: §2.
  • S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei (1997) On the optimization of a synaptic learning rule. Cited by: §2.
  • Y. Bengio, S. Bengio, and J. Cloutier (1991) Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, Vol. ii, pp. 969 vol.2–. Cited by: §2.
  • P. Bertens and S. Lee (2019) Network of evolvable neural units: evolving to learn at a synaptic level. External Links: 1912.07589 Cited by: §2.
  • W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. External Links: 1904.04232 Cited by: §2.
  • J. Clune (2019)

    AI-gas: ai-generating algorithms, an alternate paradigm for producing general artificial intelligence

    .
    External Links: 1905.10985 Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. External Links: 1703.03400 Cited by: §1, §2.
  • C. Finn, A. Rajeswaran, S. Kakade, and S. Levine (2019) Online meta-learning. External Links: 1902.08438 Cited by: §2.
  • S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    .
    External Links: ISBN 9781538664209, Link, Document Cited by: §2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1263–1272. Cited by: §2.
  • K. Gregor (2020) Finding online neural update rules by learning to remember. External Links: 2003.03124 Cited by: §2.
  • S. Hochreiter, A. S. Younger, and P. R. Conwell (2001) Learning to learn using gradient descent. In IN LECTURE NOTES ON COMP. SCI. 2130, PROC. INTL. CONF. ON ARTI NEURAL NETWORKS (ICANN-2001, pp. 87–94. Cited by: §2.
  • T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey (2020) Meta-learning in neural networks: a survey. External Links: 2004.05439 Cited by: §2.
  • K. Javed and M. White (2019) Meta-learning representations for continual learning. External Links: 1905.12588 Cited by: §2.
  • Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few-shot learning. External Links: 1707.09835 Cited by: §2.
  • T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman (2014) Random feedback weights support learning in deep neural networks. External Links: 1411.0247 Cited by: §2.
  • L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein (2018) Meta-learning update rules for unsupervised representation learning. External Links: 1804.00222 Cited by: §2.
  • T. Miconi, A. Rawal, J. Clune, and K. O. Stanley (2020) Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. External Links: 2002.10585 Cited by: §2.
  • A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin (2020) Growing neural cellular automata. Distill. Note: https://distill.pub/2020/growing-ca External Links: Document Cited by: §2.
  • A. Nøkland (2016) Direct feedback alignment provides learning in deep neural networks. External Links: 1609.01596 Cited by: §2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2018) Continual lifelong learning with neural networks: a review. External Links: 1802.07569 Cited by: §2.
  • H. Qi, M. Brown, and D. G. Lowe (2017) Low-shot learning with imprinted weights. External Links: 1712.07136 Cited by: §2.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §2.
  • E. Real, C. Liang, D. R. So, and Q. V. Le (2020) AutoML-zero: evolving machine learning algorithms from scratch. arXiv preprint arXiv:2003.03384. Cited by: §2.
  • J. Schmidhuber (1993) A neural network that embeds its own meta-levels. In IEEE International Conference on Neural Networks, Vol. , pp. 407–412 vol.1. Cited by: §2.
  • E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2019) Meta-dataset: a dataset of datasets for learning to learn from few examples. External Links: 1903.03096 Cited by: §2.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: ISSN 2162-2388, Link, Document Cited by: §2.
  • A. S. Younger, S. Hochreiter, and P. R. Conwell (2001) Meta-learning with backpropagation. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), Vol. 3, pp. 2001–2006 vol.3. Cited by: §2.

Appendix A Computational Graph details

For ease of implementation we designed our approach around a directed computational multidigraph. Nodes in our computational graph are stateful and store arbitrary data structures which we refer to as "state". Arrows (directed edges) in our graph represent a series of operations to be executed, each taking as input the state of the source node, potentially modifying this state or part of it, and then constructing and passing a message into the state of the destination node. This passed message contains any information the source node wants to pass on to the destination node. The received message can be instantly processed by the destination node by executing a successive arrow computation, or stored for future usage. It is important to note that the mapping of a node’s received messages to the inputs of its outgoing arrows is defined during the construction of the graph. Likewise, the order of execution of the edges is arbitrary but in the specific case of traditional networks we emulate the forward-backward paradigm - executing forward arrows in sequence until the final message being passed is the loss value, then executing backward arrows in the reverse sequence to update the parameters stored in the state. By property of a multidigraph, two nodes can have more than one edge defined between them, as is the case with the aforementioned forward and backward arrows. A node can additionally have reflexive arrows, allowing updates to its own states without receiving any external inputs. An example of how we utilize internal states in our experiments is storing weights or bias values, Gated Recurrent Unit (GRU)/LSTM hidden states, or a node’s embedding.

Such a framework allows for a variety of possible architectures for arbitrary computation and communication within a network.

Appendix B Sinusoidal Ablation study

We performed ablations on message size and amount of shared parameters (see Section 3

for the meaning of the possible configurations). Unfortunately, we observed a great amount of variance for each experiment, and it was too computationally prohibitive to perform several repetitions for each instance. What therefore follows is anecdotal evidence. Ablating the message size (we tried 1, 4, and 8), we observed that training converges faster and the final result is better with a larger message size. Ablating the amount of shared parameters, we observed the models to converge faster with no shared parameters, but the end results are all comparable even if we share all parameters. A note on reliable results: the least powerful the models, the more likely they are to get stuck in some local minima, and we observed every model to get stuck, in some runs; the most common problem we observed was it getting stuck to classifying a center arc only, while regressing to classifying a constant value outside of the center.