Embodied Synaptic Plasticity with Online Reinforcement learning

03/03/2020 ∙ by Jacques Kaiser, et al. ∙ 23

The endeavor to understand the brain involves multiple collaborating research fields. Classically, synaptic plasticity rules derived by theoretical neuroscientists are evaluated in isolation on pattern classification tasks. This contrasts with the biological brain which purpose is to control a body in closed-loop. This paper contributes to bringing the fields of computational neuroscience and robotics closer together by integrating open-source software components from these two fields. The resulting framework allows to evaluate the validity of biologically-plausibe plasticity models in closed-loop robotics environments. We demonstrate this framework to evaluate Synaptic Plasticity with Online REinforcement learning (SPORE), a reward-learning rule based on synaptic sampling, on two visuomotor tasks: reaching and lane following. We show that SPORE is capable of learning to perform policies within the course of simulated hours for both tasks. Provisional parameter explorations indicate that the learning rate and the temperature driving the stochastic processes that govern synaptic learning dynamics need to be regulated for performance improvements to be retained. We conclude by discussing the recent deep reinforcement learning techniques which would be beneficial to increase the functionality of SPORE on visuomotor tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The brain evolved over millions of years for the sole purpose of controlling the body in a goal-directed fashion. Computations are performed relying on neural dynamics and asynchronous communication. Spiking neural network models base their computations on these computational principles. Biologically plausible synaptic plasticity rules for functional learning in spiking neural networks are regularly proposed ([46, 17, 32, 35, 41]

). In general, these rules are derived to minimize a distance (referred to as error) between the output of the network and a target. Therefore, the evaluation of these rules is usually carried out on open-loop pattern classification tasks. By neglecting the embodiment, this type of evaluation disregards the closed-loop dynamics the brain has to handle with the environment. Indeed, the decisions taken by the brain have an impact on the environment, and this change is sensed back by the brain. To get a deeper understanding of the plausibility of these rules, an embodied evaluation is necessary. This evaluation is technically complicated since spiking neurons are dynamical systems that must be synchronized with the environment. Additionally, as in biological bodies, sensory information and motor commands need to be encoded and decoded respectively.

In this paper, we bring the fields of computational neuroscience and robotics closer together by integrating open-source software components from these two fields. The resulting framework is capable of learning online the control of simulated and real robots with a spiking network in a modular fashion. This framework is demonstrated in the evaluation of the promising neural reward-learning rule SPORE ([21, 19, 22, 45]) on two closed-loop robotic tasks. SPORE is an instantiation of the synaptic sampling scheme introduced in [21, 19]. It incorporates a policy sampling method which models the growth of dendritic spines with respect to dopamine influx. Unlike current state-of-the-art reinforcement learning methods implemented with conventional neural networks ([30, 29, 28]), SPORE learns online from precise spike-time and is entirely implemented with spiking neurons. We evaluate this learning rule in a closed-loop reaching and a lane following ([4, 18]

) setup. In both tasks, an end-to-end visuomotor policy is learned, mapping visual input to motor commands. In the last years, important progress have been made on learning control from visual input with deep learning. However, deep learning approaches are computationally expensive and rely on biologically implausible mechanisms such as dense synchronous communication and batch learning. For networks of spiking neurons learning visuomotor tasks online with synaptic plasticity rules remains challenging. In this paper, visual input is encoded in Address Event Representation with a Dynamic Vision Sensor (DVS) simulation (

[27, 18]). This representation drastically reduces the redundancy of the visual input as only motion is sensed, allowing more efficient learning. It agrees with the two pathways hypothesis which states that motion is processed separately than color and shape in the visual cortex ([25]).

The main contribution of this paper is the embodiment of SPORE and its evaluation on two neurorobotic tasks using a combination of open-source software components. This embodiment allowed us to identify crucial techniques to regulate SPORE learning dynamics, not discussed in previous works where this learning rule was only evaluated on simple proof-of-concept learning problems ([21, 19, 22, 45]). Our results suggest that an external mechanism such as learning rate annealing is beneficial to retain a performing policy on advanced lane following task.

This paper is structured as follows. We provide a review of the related work in Section 2. In Section 3, we give a brief overview of SPORE and discuss the contributed techniques required for its embodiment. The implementation and evaluation on the two chosen neurorobotic tasks is carried out in Section 4. Finally, we discuss in Section 5 how the method could be improved.

2 Related Work

The year 2015 marked a significant breakthrough in deep reinforcement learning. Artificial neural networks of analog neurons are now capable of solving a variety of tasks ranging from playing video games ([30]), to controlling multi-joints robots ([39, 28]) and lane following ([44]). Most recent methods ([39, 38, 28, 29]

) are based on policy-gradients. Specifically, policy parameters are updated by performing ascending gradient steps with backpropagation to maximize the probability of taking rewarding actions. While functional, these methods are not based on biologically plausible processes. First, a large part of neural dynamics are ignored. Importantly, unlike

SPORE, these methods do not learn online – weight updates are performed with respect to entire trajectories stored in rollout memory. Second, learning is based on backpropagation which is not biologically plausible learning mechanism, as stated in [3].

Spiking network models inspired by deep reinforcement learning techniques were introduced in [40] and [2]

. In both papers, the spiking networks are implemented with deep learning frameworks (PyTorch and TensorFlow, respectively) and rely on automatic differentiation. Their policy-gradient approach is based on

Proximal Policy Optimization (PPO) ([39]). As the learning mechanism consists of backpropagating the PPO loss (through-time in the case of [2]), most biological constraints stated in [3] are still violated. Indeed, the computations are based on spikes (4), but the backpropagation is purely linear (1), the feedback paths require precise knowledge of the derivatives (2) and weights (3) of the corresponding feedforward paths, and the feedforward and feedback phases alternate synchronously (5) (the enumeration refers to [3]).

Only a small body of work focused on reinforcement learning with spiking neural networks, while addressing the previous points. Groundwork of reinforcement learning with spiking networks was presented in [16, 10, 26]

. In these works, a mathematical formalization is introduced characterizing how dopamine modulated spike-timing-dependent plasticity (DA-STDP) solves the distal reward problem with eligibility traces. Specifically, since the reward is received only after a rewarding action is performed, the brain needs a form of memory to reinforce previously chosen actions. This problem is solved with the introduction eligibility traces, which assign credit to recently active synapses. This concept has been observed in the brain (

[11, 34]), and SPORE also relies on eligibility traces. Fewer works evaluated DA-STDP in an embodiment for reward maximization – a recent survey encompassing this topic is available in [5].

The closest previous work related to this paper are [18, 4] and [6]. In [18], a neurorobotic lane following task is presented, where a simulated vehicle is controlled end-to-end from event-based vision to motor command. The task is solved with an hard-coded spiking network of 16 neurons implementing a simple Braitenberg vehicle. The performance is evaluated with respect to distance and orientation differences to the middle of the lane. In this paper, these performance metrics are combined into a reward signal which the spiking network maximizes with the SPORE learning rule.

In [4], the authors evaluate DA-STDP (referred to as R-STDP for reward-modulated STDP) in a similar lane following environment. Their approach outperforms the hard-coded Braitenberg vehicle presented in [18]

. The two motor neurons controlling the steering receive different (mirrored) reward signals whether the vehicle is on the left or on the right of the lane. This way, the reward provides the information of what motor command should be taken, similar to a supervised learning setup. Conversely, the approach presented in this paper is more generic since a global reward is distributed to all synapses and does not indicate which action the agent should take.

A similar plasticity rule implenting a policy-gradient approach is derived in [6]. Also relying on eligibility traces, this reward-learning rule uses a “slow” noise term to drive the exploration. This rule is demonstrated on a target reaching task comparable to the one discussed in Section 4.1.1 and achieves impressive learning times (in the order of 100s) with proper tuning of the noise term.

In [31], a spiking version of the free-energy-based reinforcement learning framework proposed in [33] is introduced. In this framework, a spiking Restricted Boltzmann Machine (RBM) is trained with a reward-modulated plasticity rule which decreases the free-energy of rewarding state-action pairs. The approach is evaluated on discrete-actions tasks where the observations consist of MNIST digits processed by a pre-trained feature extractor. However, some characteristics of RBM are biologically implausible and make their implementation cumbersome: symmetric synapses and clocked network activity. With our approach, network activity does not have to be manually synchronized into observation and action phases of arbitrary duration for learning to take place.

In [13], a supervised synaptic learning rule named Feedback-based Online Local Learning Of Weights (FOLLOW) is introduced. This rule is used to learn the inverse dynamics of a two-link arm – the model predicts control commands (torques) for a given arm trajectory. The loop is closed in [14] by feeding the predicted torques as control commands. In contrast, SPORE learns from a reward signal and can solve a variety of tasks.

3 Method

In this section, we give a brief overview of the reward-based learning rule SPORE. We then discuss how SPORE was embodied in closed-loop, along with our modifications to increase the robustness of the learned policy.

3.1 Synaptic Plasticity with Online Reinforcement Learning (Spore)

Throughout our experiments we use an implementation of the reward-based online learning rule for spiking neural networks, named synaptic sampling, that was introduced in [21]. The learning rule employs synaptic updates that are modulated by a global reward signal to maximize the expected reward. More precisely, the learning rule does not converge to a local maximum

of the synaptic parameter vector

, but it continuously samples different solutions from a target distribution that peaks at parameter vectors that likely yield high reward. A temperature parameter allows to make the distribution flatter (high exploration) or more peaked (high exploitation).

SPORE ([20]) is an implementation of the reward-based synaptic sampling rule [21], that uses the NEST neural simulator ([12]). SPORE is optimized for closed-loop applications to form an online policy-gradient approach. We briefly review here the main features of the synaptic sampling algorithm.

We consider the goal of reinforcement learning to maximize the expected future discounted reward given by

(1)

where denotes the reward at time and is a time constant that discounts remote rewards. We consider non-negative reward at any time such that for all . The distribution denotes the probability of observing the sequence of reward under a given parameter vector . Note that computing this expectation involves averaging over a number of experimental trials and network responses.

As proposed in [21] we replace the standard goal of reinforcement learning to maximize the objective function in Equation 1 by a probabilistic framework that generates samples from the parameter vector according to some target distribution . We will focus on sampling from the target distribution of the form

(2)

where is a prior distribution over the network parameters that allows us, for example, to introduce constraints on the sparsity of the network parameters. It has been shown in [21] that the learning goal in Equation 2 is achieved, if all synaptic parameters obey the stochastic differential equation

(3)

where is a scaling parameter that functions as a learning rate, are the stochastic increments and decrements of a Wiener process and is the temperature parameter. denotes the partial derivative with respect to the synaptic parameter . The stochastic process in Equation 3 generates samples of that are with high probability close to the local optima of the target distribution .

It has been further shown in [21] that Equation 3 can be implemented using a synapse model with local update rules. The state of each synapse consists of the dynamic variables , , , and . The variable is the pre-synaptic spike train filtered with a postsynaptic-potential kernel. is the eligibility trace that maintains a brief history of pre-/post neural activity.

is a variable to estimate the reward gradient, i.e. the gradient of the objective function in

Equation 1 with respect to the synaptic parameter . denotes the weight of synapse at time . In addition each synapse has access to the global reward signal . The variables , and are updated by solving the differential equations:

(4)
(5)
(6)

where is a sum of Dirac delta pulses placed at the firing times of the post-synaptic neuron, is the prior mean of synaptic parameters ( in Eq. (2)) and is the instantaneous firing rate of the post-synaptic neuron at time . The constants and are tuning parameters of the algorithm that scale the influence of the prior distribution against the influence of the reward-modulated term. Setting corresponds to a non-informative (flat) prior. In general, the prior distribution is modeled as a Gaussian centered around : . We used

in our simulations. The variance of the reward gradient estimation (

Equation 5) could be reduced by subtracting a baseline to the reward as introduced in [43], although this was not investigated in this paper.

Finally the synaptic weights are given by the projection

(7)

which scaling and offset parameters and , respectively.

In SPORE the differential equations Equations 4 to 6 are solved using the Euler method with a time step of 1 ms. The dynamics of the postsynaptic term , the eligibility trace and the reward gradient are updated at each time step. The dynamics of and are updated on a coarser time grid with step width 100 ms for the sake of simulation speed. The synaptic weights remain constant between two updates. Synaptic parameters are clipped at and . Parameter gradients are clipped at . The parameters used in our evaluation are stated in Tables 1 to 3.

3.2 Closed-Loop Embodiment Implementation

Usually, synaptic learning rules are solely evaluated on open-loop pattern classification tasks [46, 32, 35, 41]. An embodied evaluation is technically more involved and requires a closed-loop environment simulation. A core contribution of this paper is the implementation of a framework allowing to evaluate the validity of bio-plausibe plasticity models in closed-loop robotics environments. We rely on this framework to evaluate the synaptic sampling rule SPORE ([20]), as depicted in Figure 1. n This framework is tailored for evaluating spiking network learning rules in an embodiment. Visual sensory input is sensed, encoded as spikes, processed by the network, and output spikes are converted to motor commands. The motor commands are executed by the agent, which modifies the environment. This modification of the environment is sensed by the agent. Additionally, a continuous reward signal is emitted from the environment. SPORE tries to maximize this reward signal online by steering the ongoing synaptic plasticity processes of the network towards configurations which are expected to yield more overall reward. Unlike classical reinforcement learning setup, the spiking network is treated as a dynamical system continuously receiving input and outputting motor commands. This allows us to report learning progress with respect to (biological) simulated time, unlike classical reinforcement learning which reports learning progress in number of iterations. Similarly, we reset the agent only when the task is completed (in the reaching task) or when the agent goes off-track (in the lane following task). We do not enforce finite-time episodes and neither the agent nor SPORE are notified of the reset.

This framework relies on many open-source software components: As neural simulator we use NEST ([12]) combined with the open-source implementation of SPORE ([21]111https://github.com/IGITUGraz/spore-nest-module). The robotic simulation is managed by Gazebo ([24]) and ROS ([36]) and visual perception is realized using the open-source DVS plugin for Gazebo ([18]222https://github.com/HBPNeurorobotics/gazebo_dvs_plugin). This plugin emits polarized address events when variations in pixel intensity cross a threshold. The robotic simulator and the neural network run in different processes. We rely on MUSIC ([7, 8]) to communicate and transform the spikes and we employ the ROS-MUSIC tool-chain by [42] to bridge between the two communication frameworks. The latter also synchronizes ROS time with spiking network time. Most of these components are also integrated in the Neurorobotics Platform (NRP) [9], except for MUSIC and the ROS-MUSIC tool-chain. Therefore, the NRP does not support streaming a reward signal to all synapses, required in our experiments.

As part of this work, we contributed to the Gazebo DVS plugin by integrating it to ROS-MUSIC, and to the SPORE module by integrating it with MUSIC. These contributions enable researchers to design new ROS-MUSIC experiments using event-based vision to evaluate SPORE or their own biologically-plausible learning rules. A clear advantage of this framework is that the robotic simulation can be substituted for a real robot seamlessly. However, the necessary human supervision in real robotics coupled with the many hours needed by SPORE to learn a performing policy is currently prohibitive. The simulation of the whole framework was conducted on a Quad core Intel Core i7-4790K with 16GB RAM in real-time.

Figure 1: Implementation of the embodied closed-loop evaluation of the reward-based learning rule SPORE. Left: our asynchronous framework based on open-source software components. The spiking network is implemented with the NEST neural simulator ([12]), which communicates spikes with MUSIC ([7, 8]). The reward is streamed to all synapses in the spiking network learning with SPORE ([20]). Spikes are encoded from address events and decoded to motor commands with ROS-MUSIC tool-chain adapters ([42]). Address events are emitted by the DVS plugin ([18]) within the simulated robotic environment Gazebo ([24]), which communicates with ROS ([36]). Right: Encoding visual information to spikes for the lane following experiment, see Section 4.1.2 for more information. Address events (red and blue pixels on the rendered image) are downscaled and fed to visual neurons as spikes.

3.3 Learning Rate Annealing

In the original work presenting SPORE ([21, 19, 22, 45]), the learning rate and the temperature were kept constant throughout the learning process. Note that in deep learning, learning rates are often regulated by the optimization processes ([23]). We found that the learning rate of SPORE plays an important role in learning and benefit from an annealing mechanism. This regulation allows the synaptic weights to converge to a stable configuration and prevents the network to forget previous policy improvements. For the lane following experiment presented in this paper, the learning rate is decreased over time, which also reduces the temperature (random exploration), see Equation 3. Specifically, we decay the learning rate exponentially with respect to time:

(8)

The learning rate is updated following this equation every 10 minutes. Independently decaying the temperature term was not investigated, however we expect a minor impact on the performance because of the high variance of the reward gradient estimation, intrinsically leading the agent to explore.

4 Evaluation

We evaluate our approach on two neurorobotic tasks: a reaching task and the lane following task presented in [18, 4]. In the following sections, we describe these tasks and the ability of SPORE to solve them. Additionally, we analyze the performance and stability of the learned policies with respect to the prior distribution and learning rate , see Equation 3.

4.1 Experimental Setup

Figure 2: Visualization of the setup for the two experiments. Left: reaching experiment. The goal of the task is to control the ball to the center of the plane. Visual input is provided by a DVS simulation above the plane looking downward. The ball is controlled with Cartesian velocity vectors. Right: Lane following experiment. The goal of the task is to keep the vehicle on the right lane of the road. Visual input is provided by a DVS simulation attached to the vehicle looking forward to the road. The vehicle is controlled with steering angles.

The tasks used for our evaluation are depicted in Figure 2. In both tasks, a feed-forward all-to-all two-layers network of spiking neurons is trained with SPORE to maximize a task-specific reward. Previous work has shown that this architecture was sufficient for the task complexity considered [18, 4, 6]. The network is end-to-end and maps the address events of a simulated DVS to motor commands. The parameters used for the evaluation are presented in Tables 1 to 3. In the next paragraphs, we describe the tasks together with their decoding schemes and reward functions.

4.1.1 Reaching Task

The reaching task is a natural extension of the open-loop blind reaching task on which SPORE was evaluated in [45]. A similar visual tracking task was presented in [6], with a different visual input encoding. In our setup, the agent controls a ball of 2m radius which has to move towards the 2m radius center of a 20mx20m plane enclosed with walls. Sensory input is provided by a simulated DVS with a resolution of 16x16 pixels located above the center which perceives the ball and the entire plane. There is one visual neuron corresponding to each DVS pixel – we make no distinctions between ON and OFF events. We additionally enhance the input space with an axis feature neuron for each row and each column. These neurons fire for each spikes in the respective row or column of neurons they cover. Both 16x16 visual neurons and 2x16 axis feature neurons are connected to all 8 motor neurons with 10 plastic SPORE synapses, resulting in 23040 learnable parameters. The network controls the ball with instantaneous velocity vectors through the Gazebo Planar Move Plugin. Velocity vectors are decoded from output spikes with the linear decoder:

(9)

with the activity of motor neuron obtained by applying a low-pass filter on the spikes with time constant . This decoding scheme consists of equally distributing motor neurons on a circle representing their contribution to the displacement vector. For our experiment, we set motor neurons. We add an additional exploration neuron to the network which excites the motor neurons and is inhibited by the visual neurons. This neuron prevents long periods of immobility. Indeed, when the agent decides to stay motionless, it does not receive any sensory input as the DVS simulation only senses change. Since the network is feedforward, the absence of sensory input causes the neural activity to drop, leading to more immobility.

The ball is reset to a random position on the plane if it has reached the center. This reset is not signaled to the network – aside from the abrupt change in visual input – and does not mark the end of an episode. Let denote the absolute value of the angle between the straight line to the goal and the direction taken by the ball. The agent is rewarded if the ball moves in the direction towards the goal at a sufficient velocity . Specifically, the reward is computed as:

(10)

This signal is smoothed with an exponential filter before being streamed to the agent. This formulation provides a continuous feedback to the agent, unlike delivering a discrete terminal reward upon reaching the goal state. In our experiments, discrete terminal rewards did not suffice for the agent to learn performing policies in a reasonable amount of time. On the other hand, distal rewards are supported by SPORE through eligibility traces, as was demonstrated in [21, 45]

for open-loop tasks with clearly delimited episodes. This suggests that additional mechanisms or hyperparameter tuning would be required for

SPORE to learn from distal rewards online.

4.1.2 Lane following Task

The lane following task was already used to demonstrate spiking neural controllers in [18] and [4]. The goal of the task is to steer a vehicle to stay on the right lane of a track. Sensory input is provided by a simulated DVS with a resolution of 128x32 pixels mounted on top of the vehicle showing the track in front. There are 16x4 visual neurons covering the pixels, each neuron responsible for a 8x8 pixel window. Each visual neuron spikes at a rate correlated to the amount of events in its window, see Figure 1. The vehicle starts driving on a fixed starting point with a constant velocity on the right lane of the track. As soon as the vehicle leaves the track, it is reset to the starting point. As in the reaching task, this reset is not explicitly signaled to the network and does not mark the end of a learning episode.

The network controls the angle of the vehicle by steering it, while its linear velocity is constant. The output layer is separated into two neural populations. The steering commands sent to the agent consist of the difference of activity between these two populations. Specifically, steering commands are decoded from output spikes as a ratio between the following linear decoders:

(11)

The first neurons pull the steering on one side, while the remaining neurons pull steering to the other side. We set so that there are left motor neurons and right motor neurons. The steering command is obtained by discretizing the ratio into five possible commands: hard left (-30°), left (-15°), straight (0°), right (15°) and hard right (30°). The decision boundaries between these steering angles are respectively. This discretization is similar than the one used in [44]. It yielded better performance than directly using (multiplied with a scaling constant ) as a continuous-space steering command as in [18].

The reward signal delivered to the vehicle is equivalent to the performance metrics used in [18] to evaluate the policy. As in the reaching task, the reward depends on two terms – the angular error and the distance error . The angular error is the absolute value of the angle between the right lane and the vehicle. The distance error is the distance between the vehicle and the center of the right lane. The reward is computed as:

(12)

The constants are chosen so that the score is halved every 0.1m distance error or 5°angular error. Note that this reward function is comprised between and is less informative than the error used in [4]. In our case, the same reward is delivered to all synapses, and a particular reward value does not indicate whether the vehicle is on the left or on the right of the lane. The decay of the learning rate is , see Table 2.

4.2 Results

Our results show that SPORE is capable of learning policies online for moderately difficult embodied tasks within some simulated hours. We first discuss the results on the reaching task, where we evaluated the impact of the prior distribution. We then present the results on the lane following task, where the impact of the learning rate was evaluated.

4.2.1 Impact of Prior Distribution

For the reaching task, a flat prior yielded the policy with highest performance, see Figure 3. In this case, the performance improves rapidly within a few hours of simulated time, and the ball reaches the center about 90 times every . Conversely, a strong prior () forcing the synaptic weights close to prevented performing policies to emerge. In this case, after 13h of learning, the ball reaches the center only about 10 times on average every 250s, a performance comparable to the random policy. Less constraining priors also affected the performance of the learned policies compared to the unconstrained case, but allowed learning to happen. With , the ball reaches the center about 60 times on average every . Additionally, the number of retracting synapses increases over time – even in the flat prior case – reducing the computational overhead, important for a neuromorphic hardware implementation ([1]). Indeed, for , the number of weak synaptic weights (below ) increased from 3329 to 7557 after 1h of learning to 14753 after 5h of learning (out of 23040 synapses in total). In other words, only 36% of all synapses are active. The weight distribution for is similar to the no-prior case . The strong prior prevented strong weights to form, trading-off performance. The same trend is observed for the lane following task, where only 33% of all synapses are active after 4h of learning, see Figure 5.

Figure 3: Results for the reaching task. Left: comparing the effect of different prior configurations on the overall learning performance. The results were averaged over 8 trials. The performance is measured with the rate at which the target is reached (the ball moves to the center and is reset at a random position). Right: Development of the synaptic weights over the course of learning for two trials: no prior (, top) and strong prior (, bottom). In both cases, the number of weak synaptic weights (below 0.07) increases significantly over time.

The analysis of a single trial with is depicted in Figure 4. The performance does not converge and rather rise and drop while the network is sampling configurations. On initialization (b), the policy employs weak actions with random directions.

After over of learning (c), the first local maximum is reached. Vector directions have largely turned towards the grid center (see inner pixel colors). Additionally, the overall magnitude of the weights has largely increased, as could be expected from the weight histogram in Figure 3. In particular, patterns of single rows and columns emerge, due to the 2x16 axis feature neurons described in Section 4.1.1. One drawback of the axis feature neurons can be seen in the center column of pixel. The axis feature neuron responsible for this column learned to push the ball down, since the ball mostly visited the upper part of the grid. However, at the center, the correct direction to push the ball towards the center is flipped.

At (d), the performance has further increased. The policy, as shown in the second peak has grown even stronger for many pixels which also point in the right direction. The pixels pointing in the wrong direction mostly have a low vector strength.

After (e), the performance drops to half its previous performance. As we can see from the policy, the weights grew even stronger. Some strong pixels vectors pointing towards each other have emerged, which can lead to the ball constantly moving up and down, without receiving any reward.

After this valley, the performance rises slowly again and at of simulation time (f) the policy has reached the maximum performance of this trial. Around the whole grid, strong motion vectors push the ball towards the center, and the ball reaches the center around times every .

Just before the end of the trial, the performance drops again (g). Most vectors still point towards the right direction, however, the overall strength has largely decreased.

Figure 4: Policy development for selected points in time in a single trial. On the top, the performance over time for a single, well-performing trial is depicted. The red lines indicate certain points in time, for which the policies are shown in the bottom figures. Each policy plot consists of a 2d-grid representing the DVS pixels. Hereby, every pixel contains a vector, which indicates the motion corresponding to the contribution of an event emitted by this pixel. The magnitude of the contribution (vector strength) is indicated by the outer pixel area. The inner circle color represents the assessment of the vector direction (angular correctness).

4.2.2 Impact of Learning Rate

For the lane following experiment, we show that the learning rate plays an important role for retaining policy improvements. Specifically, when the learning rate remains constant over the course of learning, the policy does not improve compared to random, see Figure 5. In the random case, the vehicle remains about 10 seconds on the right lane until triggering a reset. After about 3h of learning, the learning rate decreased to 40% of its initial value and the policy starts to improve. After 5h of learning, the learning rate approaches 20% of its initial value and the performance improvements are retained. Indeed, while the weights are not frozen, the amplitude of subsequent synaptic updates are drastically reduced. In this case, the policy is significantly better than random and the vehicle remains on the right lane about 60s on average.

Figure 5: Results for the lane following task with a medium prior (). Left: comparing the effect of annealing on the overall learning performance. The results were averaged over 6 trials. Without annealing, performance improvements are not retained and the network does not learn to perform the task. With annealing, the learning rate decreases over time and performance improvements are retained. Right: Development of the synaptic weights over the course of learning for a medium prior of with annealing. The number of weak synaptic weights (below ) increases from 41 to 231 after 1h of learning to 342 after 4h of learning (out of 512 synapses in total).

5 Conclusion

The endeavor to understand the brain spans over multiple research fields. Collaborations allowing synaptic learning rules derived by theoretical neuroscientists to be evaluated in closed-loop embodiment are an important milestone of this endeavor. In this paper, we successfully implemented a framework allowing this evaluation by relying on open-source software components for spiking network simulation [12, 20], synchronization and communication [7, 8, 42, 36] and robotic simulation [24, 18]. The resulting framework is capable of learning online the control of simulated and real robots with a spiking network in a modular fashion. This framework is used to evaluate the reward-learning rule SPORE ([21, 19, 22, 45]) on two closed-loop visuomotor tasks. Overall, we have shown that SPORE was capable of learning shallow feedforward policies online for moderately difficult embodied tasks within some simulated hours. This evaluation allowed us to characterize the influence of the prior distribution on the learned policy. Specifically, constraining priors deteriorate the performance of the learned policy but prevent strong synaptic weights to emerge, see Figure 3. Additionally, for the lane following experiment, we have shown how learning rate regulation enabled policy improvements to be retained. Inspired by simulated annealing, we presented a simple method decreasing the learning rate over time. This method does not model a particular biological mechanism, but seems to work better in practice. On the other hand, novelty is known to modulate plasticity through a number of mechanisms ([37, 15]). Therefore, a decrease in learning rate after familiarization with the task is reasonable.

On a functional scale, deep learning methods still outperform biologically plausible learning rules such as SPORE. For future work, the performance gap between SPORE and deep learning methods should be tackled by taking inspiration from deep learning methods. Specifically, the online learning method inherent to SPORE is impacted by the high variance of the policy evaluation. This problem was alleviated in policy-gradient methods by introducing a critic trained to estimate the expected return of a given state. This expected return is used as a baseline which reduces the variance of the policy evaluation. Decreasing the variance could also be achieved by considering an action-space noise as in [6] instead of a parameter-space noise implemented by the Wiener process in Equation 3. Lastly, an automatic mechanism to regulate the learning rate is beneficial for more complex task. Such a mechanism could be inspired by trust-region methods ([38]), which constrains weight updates to alter the policy little by little. These improvements should increase SPORE performance so that more complex tasks such as multi-joint effector control and discrete terminal rewards – supported by design by the proposed framework – could be considered.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions

All the authors participated in writing the paper. JK, MH, AK, JCVT and DK conceived the experiments and analyzed the data.

Funding

This research has received funding from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreement No. 720270 (Human Brain Project SGA1) and No. 785907 (Human Brain Project SGA2), as well as a fellowship within the FITweltweit programme of the German Academic Exchange Service (DAAD) [MH]. In addition, this work was supported by the H2020-FETPROACT project Plan4Act (#732266) [DK].

Acknowledgments

The collaboration between the different institutes that led to the results reported in the present paper was carried out under CoDesign Project 5 (CDP5 – Biological Deep Learning) of the Human Brain Project.

Data Availability Statement

No datasets were generated for this study.

       time-step/resolution
      synapse update interval
      (reaching) exploration noise
      (reaching) noise to exploration exc.
      (reaching) visual to exploration inh.
      (reaching) exploration to motor exc.
Table 1: NEST Parameters
       visual to motor exc. (clipped at )
      visual to motor mul.
      temperature ()
      initial learning rate ()
      learning rate decay ()
      integration time
      max synaptic parameter )
      min synaptic parameter )
      (reaching) episode length
      (lane following) episode length
Table 2: SPORE Parameters
       MUSIC time-step
      DVS adapter time-step
      decoder time constant
Table 3: ROS-MUSIC Parameters

References

  • [1] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. arXiv preprint arXiv:1711.05136, 2017.
  • [2] Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. In Conference on Neural Information Processing Systems (NIPS), 03 2018.
  • [3] Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.
  • [4] Zhenshan Bing, Claus Meschede, Kai Huang, Guang Chen, Florian Rohrbein, Mahmoud Akl, and Alois Knoll. End to end learning of spiking neural network based on r-stdp for a lane keeping vehicle. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
  • [5] Zhenshan Bing, Claus Meschede, Florian Röhrbein, Kai Huang, and Alois C. Knoll. A survey of robotics control based on learning-inspired spiking neural networks. Frontiers in Neurorobotics, 12:35, 2018.
  • [6] Emmanuel Daucé. A model of neuronal specialization using hebbian policy-gradient with “slow” noise. In International Conference on Artificial Neural Networks, pages 218–228. Springer, 2009.
  • [7] Mikael Djurfeldt, Johannes Hjorth, Jochen M. Eppler, Niraj Dudani, Moritz Helias, Tobias C. Potjans, Upinder S. Bhalla, Markus Diesmann, Jeanette Hellgren Kotaleski, and Örjan Ekeberg. Run-Time Interoperability Between Neuronal Network Simulators Based on the MUSIC Framework. Neuroinformatics, 8(1):43–60, mar 2010.
  • [8] Örjan Ekeberg and Mikael Djurfeldt. MUSIC – Multisimulation Coordinator: Request For Comments. 2008.
  • [9] Egidio Falotico, Lorenzo Vannucci, Alessandro Ambrosano, Ugo Albanese, Stefan Ulbrich, Juan Camilo Vasquez Tieck, Georg Hinkel, Jacques Kaiser, Igor Peric, Oliver Denninger, et al. Connecting artificial brains to robots in a comprehensive simulation framework: the neurorobotics platform. Frontiers in neurorobotics, 11:2, 2017.
  • [10] Răzvan V Florian. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Computation, 19(6):1468–1502, 2007.
  • [11] Uwe Frey, Richard GM Morris, et al. Synaptic tagging and long-term potentiation. Nature, 385(6616):533–536, 1997.
  • [12] Marc-Oliver Gewaltig and Markus Diesmann. Nest (neural simulation tool). Scholarpedia, 2(4):1430, 2007.
  • [13] Aditya Gilra and Wulfram Gerstner. Predicting non-linear dynamics by stable local learning in a recurrent spiking neural network. Elife, 6:e28295, 2017.
  • [14] Aditya Gilra and Wulfram Gerstner. Non-linear motor control by local learning in spiking neural networks. In Jennifer Dy and Andreas Krause, editors,

    Proceedings of the 35th International Conference on Machine Learning

    , volume 80 of Proceedings of Machine Learning Research, pages 1773–1782, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [15] Arif A Hamid, Jeffrey R Pettibone, Omar S Mabrouk, Vaughn L Hetrick, Robert Schmidt, Caitlin M Vander Weele, Robert T Kennedy, Brandon J Aragona, and Joshua D Berke. Mesolimbic dopamine signals the value of work. Nature neuroscience, 19(1):117, 2016.
  • [16] Eugene M Izhikevich. Solving the distal reward problem through linkage of stdp and dopamine signaling. Cerebral cortex, 17(10):2443–2452, 2007.
  • [17] Jacques Kaiser, Hesham Mostafa, and Emre Neftci. Synaptic plasticity dynamics for deep continuous local learning. arXiv preprint arXiv:1811.10766, 2018.
  • [18] Jacques Kaiser, J Camilo Vasquez Tieck, Christian Hubschneider, Peter Wolf, Michael Weber, Michael Hoff, Alexander Friedrich, Konrad Wojtasik, Arne Roennau, Ralf Kohlhaas, et al. Towards a framework for end-to-end control of a simulated vehicle with spiking neural networks. In 2016 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), pages 127–134. IEEE, 2016.
  • [19] David Kappel, Stefan Habenschuss, Robert Legenstein, and Wolfgang Maass.

    Network Plasticity as Bayesian Inference.

    PLOS Computational Biology, 11(11):e1004485, nov 2015.
  • [20] David Kappel, Michael Hoff, and Anand Subramoney. IGITUGraz/spore-nest-module: SPORE version 2.14.0. Nov 2017.
  • [21] David Kappel, Robert Legenstein, Stefan Habenschuss, Michael Hsieh, and Wolfgang Maass. A Dynamic Connectome Supports the Emergence of Stable Computational Function of Neural Circuits through Reward-Based Learning. eneuro, pages ENEURO.0301–17.2018, apr 2018.
  • [22] David Kappel, Bernhard Nessler, and Wolfgang Maass.

    STDP Installs in Winner-Take-All Circuits an Online Approximation to Hidden Markov Model Learning.

    PLoS Computational Biology, 10(3):e1003511, mar 2014.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), volume 3, pages 2149–2154. IEEE, 2004.
  • [25] Norbert Kruger, Peter Janssen, Sinan Kalkan, Markus Lappe, Ales Leonardis, Justus Piater, Antonio J. Rodriguez-Sanchez, and Laurenz Wiskott.

    Deep hierarchies in the primate visual cortex: What can we learn for computer vision?

    35(8):1847–1871, 2013.
  • [26] Robert Legenstein, Dejan Pecevski, and Wolfgang Maass. A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. PLOS Computational Biology, 4(10):1–27, 10 2008.
  • [27] Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128128 120 dB 15 s Latency Asynchronous Temporal Contrast Vision Sensor. IEEE Journal of Solid-State Circuits, 43(2):566–576, 2008.
  • [28] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • [29] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • [30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [31] Takashi Nakano, Makoto Otsuka, Junichiro Yoshimoto, and Kenji Doya. A spiking neural network model of model-free reinforcement learning with high-dimensional sensory input and perceptual ambiguity. PloS one, 10(3):e0115620, 2015.
  • [32] Emre Neftci. Stochastic synapses as resource for efficient deep learning machines. In Electron Devices Meeting (IEDM), 2017 IEEE International, pages 11–1. IEEE, 2017.
  • [33] Makoto Otsuka, Junichiro Yoshimoto, and Kenji Doya. Free-energy-based reinforcement learning in a partially observable environment. In ESANN, 2010.
  • [34] Wei-Xing Pan, Robert Schmidt, Jeffery R Wickens, and Brian I Hyland. Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the reward-learning network. Journal of Neuroscience, 25(26):6235–6242, 2005.
  • [35] Jean-Pascal Pfister, Taro Toyoizumi, David Barber, and Wulfram Gerstner. Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural computation, 18(6):1318–1348, 2006.
  • [36] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.
  • [37] Mauricio Rangel-Gomez and Martijn Meeter. Neurotransmitters and novelty: a systematic review. Journal of psychopharmacology, 30(1):3–12, 2016.
  • [38] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • [39] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [40] Juan Camilo Vasquez Tieck, Marin Vlastelica Pogančić, Jacques Kaiser, Arne Roennau, Marc-Oliver Gewaltig, and Rüdiger Dillmann. Learning continuous muscle control for a multi-joint arm by extending proximal policy optimization with a liquid state machine. In International Conference on Artificial Neural Networks, pages 211–221. Springer, 2018.
  • [41] Robert Urbanczik and Walter Senn. Learning by the dendritic prediction of somatic spiking. Neuron, 81(3):521–528, 2014.
  • [42] Philipp Weidel, Mikael Djurfeldt, Renato C. Duarte, and Abigail Morrison. Closed Loop Interactions between Spiking Neural Network and Robotic Simulators Based on MUSIC and ROS. Frontiers in Neuroinformatics, 10(31):1–19, aug 2016.
  • [43] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [44] P. Wolf, C. Hubschneider, M. Weber, A. Bauer, J. Härtl, F. Dürr, and J. M. Zöllner. Learning how to drive in a real world simulation with deep q-networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 244–250, June 2017.
  • [45] Zhaofei Yu, David Kappel, Robert Legenstein, Sen Song, Feng Chen, and Wolfgang Maass. Camkii activation supports reward-based neural network optimization through hamiltonian sampling. arXiv preprint arXiv:1606.00157, 2016.
  • [46] Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018.