The computational substrate that the human brain employs to carry out its computational functions, is given by networks of spiking neurons (SNNs). There appear to be numerous reasons for evolution to branch off towards such a design. For example, networks of such neurons facilitate a distributed scheme of computation, intertwined with memory entities. Thereby overcoming known disadvantages in contemporary computer designs such as the von Neumann bottleneck. Importantly, the human brain serves as a blueprint for a power efficient learning machine, solving demanding computational tasks while consuming little resources. A characteristic property that makes energy efficient computation possible is the distinct communication among these neurons. In particular, neurons do not need to produce an output at all times. Instead, information is integrated over time and communicated sparsely using a format of discrete events, “spikes”.
The connectivity structure, the development of computational functions in specific brain regions, as well as the active learning algorithms are all subject to an evolutionary process. In particular, evolution has shaped the human brain and successfully formed a learning machine, capable to carry out a range of complex computations. In close connection to this, a characteristic property of learning processes in humans is the ability to take advantage of previous, related experiences and use them in novel tasks. Indeed, humans show both, the ability to quickly adapt to new challenges in various domains, and the ability to transfer prior acquired knowledge about different, but related tasks to new, potentially unseen ones(Taylor and Stone, 2009; Robert Canini et al., 2010; Wang and Zheng, 2015).
One strategy to investigate the benefit of a knowledge transfer between different, but related learning tasks is to impose a so-called Learning-to-Learn (L2L) optimization. L2L employs task-specific learning algorithms, but also tries to mimic the slow evolutionary and developmental processes that have prepared brains for the learning tasks humans have to face. In particular, L2L introduces a nested optimization procedure, consisting of an inner loop and an outer loop. In the inner loop, specific tasks are learned, while an additional outer loop aims to optimize the learning performance on a range of different tasks. This concept gave rise to an interesting body of work (Hochreiter et al., 2001; Finn et al., 2017; Wang et al., 2016) and showed that one can endow artificial learning systems with transfer learning capabilities. Recently, this concept was also extended to networks of spiking neurons. In (Bellec et al., 2018) it is shown that a biologically inspired circuit can encode prior assumptions about the tasks it will encounter.
Usually, one takes advantage of the availability of gradient information to facilitate optimization, here instead, we employ powerful gradient-free optimization algorithms in the outer loop that emulate the evolutionary process. In particular, we demonstrate the benefits of evolutionary strategies (ES) (Rechenberg, 1973) and cross entropy methods (CE) (Rubinstein, 1997)
, as they are able to deal with noisy function evaluations and perform in high-dimensional spaces. In the inner loop, on the other hand, we consider reinforcement learning problems (RL problems), such as Markov Decision Processes and Multi-armed bandits. Problems of this type appear quite often in general and therefore, a rich literature has emerged. However, it still remains that learning from rewards is particularly inefficient, as the feedback is given by a single scalar quantity, the reward. We show that by employing the concept of L2L we can produce agents that learn efficiently from rewards and exploit previous experiences on related, new tasks.
As another novelty, we implement the learning agent on a neuromorphic hardware (NM hardware). Specialized hardware of this type has emerged by taking inspiration of principles of brain computation, with the intent to port the advantages of distributed and power efficient computation to silicon chips (Mead, 1990)
. This holds the great promise to install artificial intelligence in devices without cloud connection and/or limited resource. Numerous architectures have been proposed that are either based on analog, digital or mixed-signal approaches:(Ambrogio et al., 2018; Furber, 2016; Schemmel et al., 2010; Aamir et al., 2018; Furber et al., 2014; Pantazi et al., 2016; Davies et al., 2018). We refer to (Schuman et al., 2017) for a survey on neuromorphic systems.
In order to further enhance the learning capabilities of NM hardware, we exploit the adjustability of the employed neuromorphic chip and consider the use of meta-plasticity. In other words, we evolve a highly configurable plasticity rule that is responsible for learning in the network of spiking neurons. To this end, we represent the plasticity rule as a multilayer perceptron and demonstrate that this approach can significantly boost learning performance as compared to the level that is achieved by plasticity rules that we derive from general algorithms.
NM hardware is especially well suited for L2L because it renders the large number of simulations that need to be carried out feasible. Spiking neurons that are simulated on NM hardware typically exhibit accelerated dynamics as compared to their biological counterparts. In addition, the chosen neuromorphic hardware allows to emulate both, the RL environment as well as the learning algorithm at the same acceleration factor and hence, one unlocks the full potential of the specialized neuromorphic chip.
First, in Section 2 we will discuss our approaches and methods, as well as the set of tools that was used in our experiments. In particular, the NM hardware that was employed is discussed. Then, in Section 3.1 we will exhibit the increase in performance that we obtained on NM hardware for the conducted tasks and discuss which gradient-free algorithms worked best for us. In the following, we discuss in Section 3.3 that performance can be further increased by the adoption of a highly customizable learning rule, i.e. meta-plasticity, that is shaped through L2L, and discuss its relevance in transfer learning. We further discuss the impact in terms of simulation time thanks to the underlying NM hardware. Finally, we conclude our findings and results in Section 4.
2 Methods and Materials
This section provides the technical details to the conducted experiments. First, we describe the background for L2L in Section 2.1, and discuss the gradient-free optimization techniques that are employed. Subsequently, we provide details to the reinforcement learning tasks that we considered (Section 2.2).
Since the agent that interacts with the RL environments is implemented on a NM hardware, we discuss the corresponding chip in Section 2.3. We exhibit the network structure that we used throughout all our experiments in Section 2.4. Subsequently, we provide details to the learning algorithms that we used in Section 2.5 and discuss methods for analysis.
2.1 Learning-to-Learn and Gradient-free Optimization
The goal of Learning-to-Learn is to enhance a learning systems capability to learn. In models of neural networks, learning performance can be enhanced by several measurements. For example, one can optimize hyperparameters that affect the learning procedure or optimize the learning procedure as such. Often, this optimization is carried out manually and involves a lot of domain knowledge. Here, we approach this problem with L2L and evolve suitable hyperparameters as well as learning algorithms automatically.
In particular, L2L introduces a nested optimization that consists of two loops: an inner loop and an outer loop as displayed in Figure 1. In the inner loop, one is solely concerned about learning particular tasks , which are sampled at random from a family of learning tasks that share some abstract concepts. The outer loop, on the other hand, is responsible to increase a learning fitness over many tasks . We express the learning fitness as a function that depends on the task to be learned, as well as hyperparameters that characterize the learning algorithm. Formally, we write the goal of L2L as an optimization problem:
and emphasize that includes the learning process of a task .
In practice, the expectation in Equation (1) is approximated using batches of different tasks: .
As a result of considering different tasks in the inner loop each time, the hyperparameters can only assume task independent concepts that are shared throughout the family. In fact, one can consider L2L as an optimization that happens on two different timescales: fast learning of single tasks in the inner loop, and a slower learning process that adapts hyperparameters in order to boost learning on the entire family of learning tasks.
The L2L scheme allows separating the learning process in the inner loop from the optimization algorithms that work in the outer loop. We used Q-Learning and Meta-Plasticity to implement learning in the inner loop (discussed in Section 2.5), while at the same time, we considered several gradient-free optimization techniques in the outer loop. The requirements for a well-suited optimization algorithm in the outer loop are the ability to operate in a high-dimensional parameter space, the ability to deal with noisy fitness evaluations, the ability to find a good final solution and the ability to do so using a small number of fitness evaluations. Due to this broad set of requirements, the choice of the outer loop algorithm is non-trivial and needs to be adjusted based on the task family that is considered in the inner loop. We selected a set of gradient-free optimization techniques such as cross-entropy methods, evolutionary strategies, numerical gradient-descent as well as a parallelized variation of simulated annealing. In the following, we provide a brief outline of the algorithms used and refer to the corresponding literature. For the concrete implementation, we employ a L2L software framework that provides several such optimization methods (Subramoney et al., 2019). In particular, the L2L optimization is carried out on a Linux-based host computer, whereas inner loop is simulated in its entirety on the later discussed neuromorphic hardware, Section 2.3.
Cross-entropy (CE) (Rubinstein, 1997)
In each iteration, this algorithm fits a parameterized distribution to the set of
best-performing hyperparameters in terms of maximum likelihood. In the subsequent step, new hyperparameters are sampled from this distribution and evaluated. Afterwards, the procedure starts over again until a stopping criterion is met. Through this process, the algorithm tries to find a region of individuals where the performance is high on average. We used a univariate Gaussian distribution with a dense covariance matrix.
Evolution strategies (ES) (Rechenberg, 1973)
In each iteration, this algorithm maintains base hyperparameters which are perturbed by random deviations to form a new set of hyperparameters. This set is then evaluated and ranked by their fitness. In a subsequent step, the perturbations are weighted according to their rank to produce a direction of increasing fitness, which is used to update the base hyperparameters. Similar to Cross-entropy, ES also finds a region of hyperparameters with high fitness, rather than just a single one. Note that many variations of this algorithm have been proposed that differ for example in the way how the ranking or how the perturbations are computed (Salimans et al., 2017). In particular, we used Algorithm 1 from (Salimans et al., 2017).
Simulated annealing (SA) (Kirkpatrick et al., 1983)
In each iteration, the algorithm maintains hyperparameters and a temperature . The hyperparameters are perturbated with a random , whose size depends on the temperature , and are evaluated later. The fitness of the unperturbed hyperparameters is then compared with the perturbated hyperparameters . The replaces
with a probability of. In the next step, the temperature is decreased following a predefined schedule and the new hyperparameters get perturbed. In contrast to the other methods discussed before, a single set of hyperparameters is the result. In our experiments, we simultaneously perform a number of parallel SA optimizations, using a linear temperature decay.
Numerical gradient-descent (GD)
In each iteration, the algorithm maintains hyperparameters
which are perturbed randomly in many directions and then evaluated. Subsequently, the gradient is numerically estimated and an ascending step on the fitness landscape is performed.
2.2 Reinforcement Learning Problems
In all our experiments we considered reinforcement learning problems. Tasks of this type usually require many trials and sophisticated algorithms in order to produce a well-performing agent, since a teacher signal is only available in the form of a scalar quantity, the reward. To the worse, a reward does not arrive at every time step, but is often given very sparsely and only for certain events. Figure 2 (A) depicts a generic reinforcement learning loop. The agent observes the current state of the environment and has to decide on an action . In particular, the agent samples an action according to policy
, which is a probability distribution over actionsgiven a state . Upon executing the action, the environment will advance to a new state and the agent receives a reward . In all our experiments, the RL environment was simulated on the neuromorphic chip.
2.2.1 Markov Decision Process
Markov Decision Processes (MDPs) are a well-known and established model for decision making processes in literature. A MDP is defined by a five-tuple , with representing the state space, the action space, the state transition function, the reward function and a discount factor that weights future rewards differently from present ones. In particular, we are concerned here with such MDPs that exhibit discrete and finite state and action spaces. In addition, rewards are given in the range of . Figure 2 (B) shows a simple example of such a MDP with and .
The goal of solving a MDP is to find a policy actions that yields the largest discounted cumulative reward that is defined as:
In order to perform well on MDPs, the agent has to keep track of the rewarding transitions and must therefore represent the transition probabilities. Furthermore, the agent has to make a trade-off between exploring new transitions and consolidating already known transitions. Such problems have been studied intensively in literature and a mathematical framework was developed to optimally solve them (Bellman et al., 1954). The so-called Value-Iteration (VI) algorithm emerged from this framework and yields an optimal policy. Therefore, this algorithm is considered as the optimal baseline in all following MDP results.
In order to apply the L2L scheme, we introduce a family of tasks consisting of MDPs with a fixed size of the action and the state space. MDPs of that family are generated according to the following sampling procedure: whenever a new task is required, the rewards and the transition probabilities are randomly sampled from the range . In addition, the elements of are normalized such that the outgoing probabilities for all actions in each state sum up to 1.
We report our results in the form of a normalized discounted cumulative reward, where we scale between the performance of a random action selection and the performance of an optimal action selection, given by a policy produced by VI.
2.2.2 Multi-armed Bandits
As a second category of RL problems, we consider multi-armed bandit (MAB) problems. A MAB is best described as a collection of several one-armed bandits, each of which produces a reward stochastically when pulled. In other words, one can view MAB problems as MDPs with a single state and multiple actions. Despite the deceptive simplicity of such problems, a great deal of effort was made in science to study these problems and the celebrated result of (Gittins and Gittins, 1979) showed that a learning strategy exists.
For the sake of brevity, we use the same notations for MABs as for MDPs. In particular, we say that the environment is always in one state and the agent is given the opportunity to pull several bandit arms , which corresponds to actions . In all experiments regarding MABs, we considered two-armed bandits, where each bandit produces a reward of either or with a fixed reward probability . We investigate the impact of L2L on the basis of two different families of MAB tasks:
unstructured bandits: A task of this family is generated by sampling each of the two reward probabilities independently and uniform in .
structured bandits: A task of this family is generated by sampling the reward probability uniformly in and compute .
Similar to MDPs, we report our results for MABs in the form of a normalized cumulative reward, where we scale between the performance of a random action selection and the performance of an oracle that always picks the be best possible bandit arm. As a comparison baseline, we employ the Gittins index policy and note that the computation of the Gittins index value is calculated in the same way for both families. In particular, the Gittins index values are calculated assuming that the reward probabilities are independent (unstructured bandits).
2.3 Neuromorphic hardware - HICANN DLSv2
Various approaches for specialized hardware systems implementing spiking neural networks emerged and fundamentally differ in their realizations, ranging from pure digital over pure analog solutions using optical fibers up to mixed-signal devices (Indiveri et al., 2011; Nawrocki et al., 2016; Schuman et al., 2017). Whereas every single such NM hardware comes with certain advantages and limitations, one promising platform is the HICANN-DLS (Friedmann et al., 2017), herein used in the prototype version 2.
The hardware is a prototype of the second generation BrainScaleS-2 system currently under development as part of the Human Brain Project neuromorphic platform Markram et al. (2011). It represents a scaled-down version of the future full-size chip and is used to evaluate and demonstrate new features as illustrated in this work.
Conceptually the chip is a mixed-signal design with analog circuits for neurons and synapses, spike-based, continuous time communication and an embedded microprocessor. The NM hardware is realized in a 65nm CMOS process node by the company TSMC. It features 32 neurons of the leaky-integrate-and-fire (LIF) type connected by a 32x32 crossbar array of synapses such that each neuron can receive inputs from a column of 32 synapses. Synaptic weights can be set with a precision of 6-bits and can be configured row-wise to deliver excitatory or inhibitory inputs. Synapses feature local short-term (STP) and long-term (STDP) plasticity. All analog time constants are scaled down by a factor of 1000 to represent an accelerated neuromorphic system compared to biological time-scales, a feature that is strongly exploited in this paper.
The embedded microprocessor is a 32-bit CPU implementing the Power-PC instruction set with custom vector extensions. It is primarily used as a plasticity processing unit (PPU) for arbitrary operations on synaptic weights and labels. As a general purpose processor, it can also act on any other on-chip data like neuron and synapse parameters as well as on the network connectivity. It can also send and receive off-chip signals like rewards or other control signals. We make use of the freely-programmable PPU and investigated different learning algorithms which are explained in Section2.5. They all exploit the proposed network structure from Section 2.4 and have the commonality, that the reward information of the state transitions is encoded in the synaptic efficacy.
In addition to learning algorithms, the plasticity processing unit also allows implementing environments for an agent. Since the system features a high speedup factor, any environment must also provide the same speedup factor, but with this closed-loop setup, the full potential of the neuromorphic hardware is unlocked.
Some of the basic design rationales behind the second generation BrainScaleS-2 system with special emphasis on the PPU are described in (Friedmann et al., 2017). Figure 3 (A) shows the measurement setup and (B) shows the micrograph of the hardware. In addition to other components, the measurement setup hosts the neuromorphic chip, a USB-Interface to connect the baseboard with a host computer as well as a separate FPGA board to control the experiments. The micrograph of the neuromorphic chip shows the different components and where they are located. A description of the actual prototype used in this work including details on the neuron implementation and the synaptic array can be found in (Aamir et al., 2016).
2.4 Network Structure and Action Selection
Like discussed in Section 2.2, the agent is required to select an appropriate action given a particular state of the environment. We discuss in this Section how the agent can be implemented using a network of spiking neurons on neuromorphic hardware. Since our experiments were concerned with either Multi-armed bandits or Markov Decision Processes, we designed the network structure for the more general MDP problems. In particular, the design is based on the Markov Property of MDPs, using the fact that the next state solely depends on the chosen action and the current state , similarly to (Friedrich and Lengyel, 2016).
Concretely, we make use of a feed-forward network of spiking neurons with two populations, as illustrated in Figure 4 (A). One population encodes the state of the environment (state population, marked in red) and the second population encodes all possible action choices (action population, marked in blue). We assume that all states exhibit the same number of possible actions. Under this assumption, the resulting agent commits to specific actions by the following action selection protocol: Given that the agent finds itself in state , then the corresponding state neuron receives stimulating input and produces output spikes that are transmitted to the neurons of the action population by excitatory synapses . Eventually, this stimulation will trigger a spike in the action population, depending on the synaptic strengths . The action that will be taken, is determined by the neuron of the action population that emits a spike first. In addition, neurons coding for actions are connected inhibitory among each other with synapses of strength , through which a WTA-like network structure arises. Due to this mutual inhibition, mostly a single neuron of the action population will emit a spike and hence, trigger the corresponding action.
In practice, additional tricks are required to implement the proposed scheme on the neuromorphic device, see Figure 4 (B). To continually excite the active state neuron, we send a single spike that triggers a persistent firing through strong excitatory autapses (marked in green). If a neuron from the action population eventually emits a spike, the active state neuron needs to be prevented from further spiking. For this purpose, we use inhibitory synapses of strength projecting from action neurons to state neurons. Due to synaptic delays, more than one action neuron may emit a spike. In such a case, an action is randomly selected among the set of active neurons. It is to be noted that smaller inhibition weights lead to more random exploration, because insufficient inhibition will not prevent spikes of other action neurons, in which case action selection becomes randomized.
One other implementation detail comes from the fact that there is only a limited resolution (6 bit) available on the NM hardware. This might cause that weights saturate at either or the maximum weight value and prevent efficient learning. To avoid this problem, the weights are rescaled with a certain frequency according to:
where and provide the upper and lower rescale boundary.
Figure 4 (C) depicts typical examples of the action selection process for three common cases occurring throughout the learning process. In case 1 (usually before training), a state neuron, i.e. corresponding to state 2, is active and persistently emits a spike. However, none of the synapses connecting to the action neurons is strong enough to cause a spike. In such a case, after a predefined time, the state neuron is inhibited and a random action is selected. In case 2 (likely during learning), another state neuron is active, but all synapses to the action neurons are strong enough to cause every action neuron to spike before the mutual inhibition sets in. In such a case, a random action among the active action neurons is selected. Eventually, the system reaches case 3 (after learning), where only a single action neuron is excited by a given state neuron.
Learning in this network structure is implemented by synaptic plasticity rules that act upon the excitatory weights projecting from the state to the action population. In particular, these weights pin down which action has the highest priority for each state.
2.5 Learning Algorithms
MDPs have been studied intensively in computer science and a rigorous framework on how to solve problems of this kind optimally was introduced by Bellman in (Bellman et al., 1954). An important quantity in MDPs is the so-called Q-Function, or Action-Value function. The Q-Function expresses the expected discounted cumulative reward, when the agent starts in state , takes action and subsequently proceeds according to its policy . Formally, one writes this as:
where is the discount factor of the MDP and is the immediate reward at time step . As discussed before in Section 2.2.1, we consider only discrete MDPs and the Q-Functions can therefore be represented in a tabular form. This property suits our network structure, since the synapses that project from the state population to the action population can represent all Q-values . Hence, we define .
To solve MDPs, the goal is to determine the optimal policy . A common approach is to infer the Q-Function of an optimal policy and then reconstruct the policy according to:
Indeed, as we aim to encode Q-values in synaptic weights , we emphasize that the operation will be naturally carried out by the spiking neural network, as proposed in Section 2.4. To infer the Q-values of the optimal policy, we derive rules of synaptic plasticity based on temporal difference algorithms as proposed by (Sutton and Barto, 1998).
Temporal Difference Learning (TD(1)-Learning) was developed as a method to obtain the optimal policy. The estimate of the optimal Q-Function is improved based on single interactions with the environment. TD(1)-Learning is guaranteed to converge to the correct solution as shown in (Watkins and Dayan, 1992; Dayan and Sejnowski, 1994). Based on TD(1), the synaptic weight updates take on the following form:
Where denotes a learning rate.
The convergence speed of TD(1)-Learning can be further improved if one uses additional eligibility traces per synapse. The resulting algorithm is then referred to as TD()-Learning. In particular, the trace indicates to what extent a current reward makes the earlier visited state-action pair more valuable. Convergence proofs of TD() are given in (Dayan, 1992; Dayan and Sejnowski, 1994). To implement the algorithm, we update eligibility traces at every time step according to the schedule
In addition, we define an error according to
which enables us to express the resulting plasticity rule as a product of the eligibility trace and error . This update is carried out for every synapse :
The parameter controls how many state transitions are taken into account and one obtains as one corner case TD(1)-Learning.
In order to tailor the specific update rule towards the actual task family at hand, we approached the problem also from the perspective of meta-plasticity. I.e. we represent the synaptic weight update by a parameterized function approximator. We then optimize its parameters with L2L in such a way that a useful learning rule for a given task family emerges. We used a multilayer perceptron, the architecture of which is visualized in Figure 5. The perceptron receives five inputs, computes seven hidden units with sigmoidal activation and provides one output, the weight update . Effectively, the input to output mapping of this approximator is specified by a number of free parameters (weights of the multilayer perceptron) that are considered as hyperparameters and optimized as part of the L2L procedure. Since the multilayer perceptron is a type of an artificial neural network, this plasticity rule is referred to as ANN learning rule. The update of synaptic weights thus takes on the general form of:
The specific choice of inputs is salient for the possible set of learning rules that can emerge. In the case of the ANN learning rule, we only considered structured MAB, where each of the two synapses is updated at every time step. We set the inputs in this case to a vector
that is composed of the current time step , a binary flag indicating if the synapse was responsible for the action at step , the obtained reward , the weight and the weight of the synapse associated to the other bandit arm .
2.6 Analysis of Meta-Plasticity
Since we use an artificial neural network in our meta-plasticity approach, which can represent fairly general nonlinear functions, it is hard to understand what the arising plasticity rule actually expresses. To investigate the emerged functionality, one approach used in literature is called functional Analysis of Variance (fANOVA) which was presented in(Hutter et al., 2014)
. This method originally aims to assess the importance of hyperparameters in the machine learning domain. It does so by fitting a random forest to the performance data of the machine learning model that was gathered using different hyperparameters.
We adopted this method but applied it to a slightly different, but related problem. Our goal is to assess the impact of each input of the ANN rule with respect to its output. To do so, the weights of the plasticity network remain fixed, while the input values to the plasticity network as well as the output from the plasticity network are considered as inputs to the fANOVA framework. Based on this data, a random forest with 30 trees is fitted and the fraction of the explained variance of the output with respect to each input variable can be obtained.
This section presents the results of our approach implemented on the described neuromorphic hardware. First, we report how L2L can improve the performance and learning speed in Section 3.1. Then, we investigate the impact of outer loop optimization algorithms in Section 3.2 and demonstrate in Section 3.3 that Meta-Plasticity yields competitive performance, while also enhancing transfer learning capabilities. Finally, we investigate the speedup gained from the neuromorphic hardware by comparing our implementation on the NM hardware to a pure software implementation of the same model in Section 3.4.
3.1 Learning-to-Learn improves Learning Speed and Performance
Here, we first demonstrate the generality of our network structure when applied to Markov Decision Processes. Then, we examine the effects of an imposed task structure more closely by investigating Multi-armed Bandit problems. To efficiently train the network of spiking neurons, we employed Q-Learning and derived corresponding plasticity rules, as described in see Section 2.5.1. The plasticity rules itself, as well as the concrete implementation on NM hardware, are influenced by hyperparameters that we optimized by L2L, such that the cumulative discounted reward for a given family of tasks is improved on average, see Section 2.1.
We implemented a neuromorphic agent that learns MDPs. In fact, the proposed network structure in Section 2.4 is particularly designed for such tasks and we applied concretely TD(), see equation (11).
Hyperparameters included all occurring parameters of the employed TD()-Learning rule , the inhibition strength among the action neurons , the strength of inhibitory weights connecting the action neurons to the state neurons , as well as the variables influencing the hardware-specific rescaling and . Therefore, the complete hyperparameter vector was given as . We used the discounted cumulative reward, Equation 2, as the fitness function and optimized using CE.
We used a batch size of .
The results for the MDP tasks are depicted in Figure 6 (A) where we report the discounted cumulative reward for steps. The discounted cumulative reward is normalized in such a way, that VI is scaled to and the random policy is scaled to . To compare with, we used a TD()-Learning implementation from a software library222https://pymdptoolbox.readthedocs.io/en/latest/index.html without a spiking neural network (green line).
We found that applying L2L improved the discounted cumulative reward (red solid line), compared to the case where the hyperparameters are randomly chosen (blue line). In addition, the learning speed was also increased, which can be seen in the zoom depicted in Figure 6 (B).
In the case of MABs, we focused on small networks and two arms in the bandit, which allowed us to complement the results that were obtained for general MDPs of larger size. We considered two families of MABs: unstructured bandits and structured bandits (2.2.2) which the neuromorphic agent had to learn using the TD(1)-Learning rule, see Equation (8), where we set . In addition, we introduced here a decaying learning rate defined as , using a decay factor of . We then used L2L to carry out a hyperparameter optimization separately for both MAB families and optimized the parameters of the TD(1)-Learning rule and , the inhibition strength among action neurons and the inhibitory weights of synapses that connect the action population to the state population . Hence, the hyperparameter vector was given as . We used the cumulative reward as the fitness function and optimized using CE. We used a batch size of .
In Figure 7 we report the performance results that were obtained before and after applying L2L. The agent interacted for steps with a single MAB and we compare with a baseline given by the Gittins index policy, as described in 2.2.2. We found that after performing a L2L optimization the performance was enhanced, which was even more apparent for structured bandits. In particular, L2L endowed the agent with a better learning speed, which is exhibited by a faster rising of the performance curve. This can only be achieved when the hyperparameters of the learning system are well tailored to the tasks that are likely to be encountered, which was the responsibility of L2L. We also observed that the agent could still learn a MAB task to a reasonable level even if no L2L optimization was carried out. This is implied by the fact that TD(1)-Learning is primed to learn RL tasks. However, this also raises the question of how well such a general plasticity rule can adapt to the level of variations exhibited by analog circuitry. We consider extensions in Section 3.3.
3.2 Performance comparison of Gradient-free Optimization Algorithms in the outer loop
The results presented so far suggest that the concept of L2L can improve the overall performance and also lays the foundation that abstract knowledge about the task family at hand is integrated into an agent. However, the choice of a proper outer loop optimization algorithm is also crucial for this scheme to work well. The modular structure of the L2L approach used in this paper allows to interchange different types of optimization algorithms in the outer loop for the same inner loop task. To demonstrate the impact in terms of performance when using different optimization algorithms, several such algorithms were investigated for both general MDPs and also for specialized MAB tasks. Figure 8 shows a comparison of the final discounted cumulative reward at the end of the tasks for different outer loop optimization algorithms.
Depending on the inner loop task considered, we found that the cross-entropy (CE) method, as well as evolution strategies (ES), work well because both aim to find a region in the hyperparameter space, where the fitness is high. This property is particularly desired when it comes to noise in the fitness landscape due to imperfections of an underlying neuromorphic hardware. In addition, both can cope with noisy fitness evaluations and do not overestimate a single fitness evaluation which could easily lead to a wrong direction in the presence of high noise in the fitness landscape.
However, a simpler algorithm such as simulated annealing (SA) can also find a hyperparameter set with rather high fitness. Especially when running multiple separate annealing processes in parallel with different starting points, the results can almost compete with the ones found by CE or ES. However, SA does not aim at finding a good parameter region but just tries to find a single good set of working hyperparameters. This is prone to cause problems because a single good set of hyperparameters offers less robustness compared to an entire region of well-performing hyperparameters. A simple numerical gradient-based approach did not yield good results at all because of the noisy fitness landscape. In general, the developer is free to choose any optimization algorithm in the outer loop when using L2L. New algorithms can also be implemented which are specially tailored to a particular problem class, which can lead to a new research direction.
3.3 Performance improvement through Meta-Plasticity
We further asked the question if one can install plasticity rules in spiking neural networks that allow to improve learning from rewards beyond the level of Q-Learning on specific tasks. We conjectured that the TD(1)-Learning rule that was used on MAB tasks offers a somewhat limited surface for incorporating task structure or imperfections in analog hardware. Hence, we pursued a different approach to implement learning from rewards using meta-plasticity. I.e. we specified the entire learning rule hyperparameters. In particular, we used a multilayer perceptron of 7 hidden neurons as envisioned in Figure 9 (A), whose input to output behavior implements the plasticity rule of synapses in the spiking neural network. This is apparently the first example of meta-plasticity on neuromorphic hardware, where a rule for synaptic plasticity is evolved through optimization by L2L.
To test the approach, we used L2L to optimize all occurring hyperparameters on the task family of structured bandits. In particular, the hyperparameter vector was composed of the parameters of the plasticity rule and the inhibition strengths and : . We used the cumulative reward, Equation 2, as the fitness function and optimized using CE with a batch size of .
We summarize our results in Figure 9 and observed a drastic increase in learning performance of agents employing the evolved learning rule (B). As opposed to the results for a tuned version of the TD(1)-Learning rule, we can achieve a performance that is on the same level as the Gittins index policy. This highlights that the tailor-made plasticity rule for a family of tasks can counteract negative effects of imperfections in the neuromorphic hardware.
Even though the arising learning rule performs well on average on the family of tasks it has been trained on, there is no theoretical guarantee for that. Hence, an analysis of the optimized learning rule was conducted, where we examined the importance of the multiple inputs provided to the update rule for the resulting output, see Figure 9 (C). Apparently, the most important inputs are the flag that represents if the current weight was responsible for the last action and the obtained reward. Since both of the inputs can assume only two values, one can visualize the four different cases in four different curves. We report the expected weight change depending on the current weight, see Figure 9 (D), where we average over other unspecified inputs.
Updates for weights which were responsible for the previous action are in the direction of the obtained rewards. Hence, the meta-plasticity rule reinforced actions depending on the reward outcome, similarly to Q-learning rules. Interestingly however, the update of the synaptic weight which had not caused the last action was always negative independently of the reward. We believe that L2L simply found that it does not matter what happens to the weight that did not cause actions, because as long as it does not increase, it will not disturb the current belief of the best bandit arm.
To test if the reinforcement learning agent on the neuromorphic hardware has been optimized for a particular range of tasks, we carried out another experiment. We tried to answer if the agent can take advantage of the abstract task structure if it was present. To do so, we always tested learning performance on structured bandits, denoted as . For optimization with L2L, we instead used either unstructured bandits or structured bandits, and we denote the family on which hyperparameter optimization was carried out by . This experimental protocol (Figure 10 (A)) allowed us to determine to which extent abstract task structure can be encoded in hyperparameters. We report the results for neuromorphic agents in Figure 10 (B), where we considered the TD(1)-Learning rule and the meta-plasticity learning rule. Consistently, we observed that optimizing hyperparameters for the appropriate task family enhances performance. However, we conjecture that the greater adjustability of the meta-plasticity learning rule renders it to be better suited for transfer learning as compared to TD(1)-Learning rule.
3.4 Exploiting the benefit of accelerated hardware for L2L
One of the main features of neuromorphic hardware devices is the ability to simulate spiking neural networks very fast and efficient. To make this more explicit for the MDP tasks, a software implementation with the same network structure and the same plasticity rule was conducted on a standard desktop PC using one single core of an Intel™ Xeon™ CPU X5690 running at 3.47 GHz. The spiking neural network was implemented using the Neural Simulation Technology (NEST) (Gewaltig and Diesmann, 2007) with a Python interface and the plasticity rule as well as the environment were also implemented in Python.
To have a better comparison, two families of MDP tasks with different sizes of and were defined. The first family is defined by and (small MDP) and the second family by and (large MDP).
Figure 11 (A) shows a comparison of the simulation time needed for a single randomly selected MDP tasks, averaged over MDPs and for each of the two families. The simulation times include implementation specific overheads, for example, the communication overhead with the neuromorphic hardware.
One can see that the simulation time needed for MDP tasks with both sizes are shorter using the neuromorphic hardware and in addition, the simulation time needed to solve the larger task does not increase. First, this indicates, that the neuromorphic hardware can carry out the simulation of the spiking neural network faster and second, that using a larger network structure does not yield an additional cost, as long as the network can fit on the NM hardware. In contrast to this, using more neurons requires longer simulation times in pure software.
A similar key message can be found in Figure 11 (B), where instead of a single MDP run, an entire L2L run is evaluated on the neuromorphic hardware as well as with the software implementation. Both, the L2L run on neuromorphic hardware as well as the one in software can in principle be easily parallelized when using more hardware systems or more CPU cores which would decrease the overall simulation time. Note that scheduler overheads are not taken into considerations.
Outstanding successes have been achieved in the field of deep learning, ranging from scientific theories and demonstrators to real-world applications. Despite impressive results, deep neural networks are not out of the box suitable for low-power or resource-limited applications. Instead, spiking neural networks are inspired by the brain, an arguably very power efficient computing machine. The neuromorphic hardware that we used in this work was designed to port key aspects of the astounding properties of this biological circuitry to silicon devices.
The human brain has been prepared by a long evolutionary process over with a set of hyperparameters and learning algorithms that can be used to cover a large variety of computing and learning tasks. Indeed, humans are able to generalize task concepts and port them to new, similar tasks, which provides them with a tremendous advantage as compared to most of the contemporary neural networks. In order to mimic this behavior, we employed gradient-free optimization techniques, such as the cross-entropy method or evolutionary strategies (see Section 2.1), applied in a Learning-to-Learn setting. This two-looped scheme combines task-specific learning with a slower evolutionary-like process that results in a good set of hyperparameters as demonstrated in Section 3.1. The approach is generic in the sense that both, the algorithms mimicking the slower evolutionary process and the learning agent can be exchanged. In principle, any agent with learning capabilities can be used as the learning agent and any optimization algorithms as the evolutionary process. We found that some outer loop optimization algorithm perform better than others and the optimization algorithms should ideally be chosen with the inner loop task in mind. Outer loop optimization algorithms need to operate in a high-dimensional parameter space, have the ability to deal with noisy result evaluations, have the ability to find a good final solution and also require a low number of parameter evaluations before reaching a good solution. Algorithms that aim to find a region of hyperparameters with high performance such as evolution strategies or cross-entropy worked the best for us, see Section 3.2.
L2L offers both, either to find optimal hyperparameters for a fixed individual task or to boost transfer learning capabilities of an agent when using a family of tasks. In addition, new optimization algorithms can be developed to further improve performance in the outer loop of L2L. In this work, we used reinforcement learning problems in connection with NM hardware to demonstrate the aforementioned benefits.
In particular, the concept of L2L allows to shape highly adjustable plasticity rules for specific task families. The usage is not only limited to spiking neural networks but can also be applied to artificial neural networks. This may yield potential for a future research direction. Apparently, this is the first time that the idea of L2L and Meta-Plasticity was applied to a neuromorphic hardware, see Section 3.3.
Neuromorphic hardware allows to emulate a spiking neural network with a significant speedup compared to the biological equivalent, which makes a large number of computations, required in the L2L scheme, feasible. To quantify the overall speedup of the accelerated neuromorphic hardware, a comparison with a pure software simulation on a conventional computer was carried out (see Figure 11). We conclude that the two-looped L2L scheme is especially suited for accelerated neuromorphic hardware.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
WM, TB and FS developed the theory and experiments. TB implemented and conducted experiments with regard to MDPs, benchmarked performance impact of outer loop optimization algorithms and probed the performance benefit of NM hardware. FS implemented and conducted experiments with regard to MABs. FS, CP and WM conceived meta-plasticity, FS and CP implemented it. FS tested the benefits in transfer learning. TB, FS, CP, WM and KM wrote the paper.
This research/project was supported by the HBP Joint Platform, funded from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreement No. 785907 (Human Brain Project SGA2).
We thank Anand Subramoney for his support and the contributions to the Learning-to-Learn framework. We are also grateful for the support during the experiments with the neuromorphic hardware. In particular, we like to thank David Stöckel, Benjamin Cramer, Aaron Leibfried, Timo Wunderlich, Yannik Stradmann, Christian Mauch and Eric Müller. Furthermore, we also like to thank Elias Hajek for useful comments on earlier versions of the manuscript.
- A highly tunable 65-nm CMOS LIF neuron for a large scale neuromorphic system. In ESSCIRC Conference 2016: 42nd European Solid-State Circuits Conference, pp. 71–74. External Links: Cited by: Figure 3, §2.3.
- An Accelerated LIF Neuronal Network Array for a Large Scale Mixed-Signal Neuromorphic Architecture. External Links: Cited by: §1.
- Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558 (7708), pp. 60–67. External Links: Cited by: §1.
- Long short-term memory and learning-to-learn in networks of spiking neurons. External Links: Cited by: §1.
- The Theory of Dynamic Programming as Applied to a Smoothing Problem. Journal of the Society for Industrial and Applied Mathematics 2 (2), pp. 82–88. External Links: Cited by: §2.2.1, §2.5.1.
- Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro 38 (1), pp. 82–99. External Links: Cited by: §1.
- TD() Converges with Probability 1. Machine Learning 14 (3), pp. 295–301. External Links: Cited by: §2.5.1, §2.5.1.
- The Convergence of TD(X) for General X. Machine Learning 8, pp. 341–362. External Links: Cited by: §2.5.1.
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. External Links: Cited by: §1.
- Demonstrating hybrid learning in a flexible neuromorphic hardware system. IEEE Transactions on Biomedical Circuits and Systems 11, pp. 128–142. External Links: Cited by: §2.3, §2.3.
- Goal-Directed Decision Making with Spiking Neurons.. The Journal of neuroscience : the official journal of the Society for Neuroscience 36 (5), pp. 1529–46. External Links: Cited by: §2.4.
- The SpiNNaker Project. Proceedings of the IEEE 102 (5), pp. 652–665. External Links: Cited by: §1.
- Large-scale neuromorphic computing systems. Journal of Neural Engineering 13 (5), pp. 051001. External Links: Cited by: §1.
- NEST (neural simulation tool). Scholarpedia 2 (4), pp. 1430. Cited by: §3.4.
- Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, pp. 148–177. Cited by: §2.2.2.
- Learning to learn using gradient descent. In ICANN, Lecture Notes in Computer Science, Vol. 2130, pp. 87–94. Cited by: §1.
- An efficient approach for assessing hyperparameter importance. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 754–762. External Links: Cited by: §2.6.
- Neuromorphic Silicon Neuron Circuits. Frontiers in Neuroscience 5, pp. 73. External Links: Cited by: §2.3.
- Optimization by simulated annealing.. Science (New York, N.Y.) 220 (4598), pp. 671–80. External Links: Cited by: §2.1.
- Introducing the human brain project. Procedia Computer Science 7, pp. 39 – 42. Note: Proceedings of the 2nd European Future Technologies Conference and Exhibition 2011 (FET 11) External Links: Cited by: §2.3.
- Neuromorphic electronic systems. Proceedings of the IEEE 78 (10), pp. 1629–1636. Cited by: §1.
- A Mini Review of Neuromorphic Architectures and Implementations. IEEE Transactions on Electron Devices 63 (10), pp. 3819–3829. External Links: Cited by: §2.3.
- All-memristive neuromorphic computing with level-tuned neurons. Nanotechnology 27 (35), pp. 355205. External Links: Cited by: §1.
- Evolutionsstrategie : optimierung technischer systeme nach prinzipien der biologischen evolution. Problemata, Frommann-Holzboog, Stuttgart-Bad Cannstatt. Cited by: §1, §2.1.
- Modeling Transfer Learning in Human Categorization with the Hierarchical Dirichlet Process. Cited by: §1.
- Optimization of computer simulation models with rare events. European Journal of Operational Research 99 (1), pp. 89–112. External Links: Cited by: §1, §2.1.
- Evolution Strategies as a Scalable Alternative to Reinforcement Learning. External Links: Cited by: §2.1.
- A wafer-scale neuromorphic hardware system for large-scale neural modeling. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pp. 1947–1950. External Links: Cited by: §1.
- A Survey of Neuromorphic Computing and Neural Networks in Hardware. External Links: Cited by: §1, §2.3.
- IGITUGraz/l2l: v0.4.3. External Links: Cited by: §2.1.
- Reinforcement learning : an introduction. MIT Press. External Links: Cited by: §2.5.1.
- Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. External Links: Cited by: §1.
- Transfer Learning for Speech and Language Processing. External Links: Cited by: §1.
- Learning to reinforcement learn. External Links: Cited by: §1.
- Q-learning. Machine Learning 8 (3-4), pp. 279–292. External Links: Cited by: §2.5.1.