Improved robustness of reinforcement learning policies upon conversion to spiking neuronal network platforms applied to ATARI games

by   Devdhar Patel, et al.

Various implementations of Deep Reinforcement Learning (RL) demonstrated excellent performance on tasks that can be solved by trained policy, but they are not without drawbacks. Deep RL suffers from high sensitivity to noisy and missing input and adversarial attacks. To mitigate these deficiencies of deep RL solutions, we suggest involving spiking neural networks (SNNs). Previous work has shown that standard Neural Networks trained using supervised learning for image classification can be converted to SNNs with negligible deterioration in performance. In this paper, we convert Q-Learning ReLU-Networks (ReLU-N) trained using reinforcement learning into SNN. We provide a proof of concept for the conversion of ReLU-N to SNN demonstrating improved robustness to occlusion and better generalization than the original ReLU-N. Moreover, we show promising initial results with converting full-scale Deep Q-networks to SNNs, paving the way for future research.



There are no comments yet.


page 2


Deep Reinforcement Learning with Spiking Q-learning

With the help of special neuromorphic hardware, spiking neural networks ...

Human-Level Control through Directly-Trained Deep Spiking Q-Networks

As the third-generation neural networks, Spiking Neural Networks (SNNs) ...

Strategy and Benchmark for Converting Deep Q-Networks to Event-Driven Spiking Neural Networks

Spiking neural networks (SNNs) have great potential for energy-efficient...

ANS: Adaptive Network Scaling for Deep Rectifier Reinforcement Learning Models

This work provides a thorough study on how reward scaling can affect per...

Deep Reinforcement Learning and its Neuroscientific Implications

The emergence of powerful artificial intelligence is defining new resear...

DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization

Despite overparameterization, deep networks trained via supervised learn...

Theory and Tools for the Conversion of Analog to Spiking Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have shown great potential for...

Code Repositories


Studying and applying spiking neural networks in reinforcement learning.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advancements in deep reinforcement-learning (RL) have achieved astonishing results surpassing human performance on various ATARI games (Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2016)

. However, deep RL is susceptible to adversarial attacks similarly to deep learning

(Huang et al., 2017). The vulnerability to adversarial attacks is due to the fact that deep RL uses gradient descent to train the agent. Another consequence of the gradient descent algorithm is that the trained agent learns to focus on a few sensitive areas and when these areas are occluded or perturbed, the performance of the RL agent deteriorates. Moreover, there is evidence that the policies learned by the networks in deep RL algorithms do not generalize well and the performance of the agent deteriorates when it encounters a state that it has not seen before even if it is similar to other states (Witty et al., 2018).

Biological systems tend to be very noisy by nature (Richardson & Gerstner, 2006; Stein et al., 2005), but they can still operate well even under harsh conditions that affect their internal state and input. Spiking Neural Networks (SNNs) are considered to be closer to biological neurons due to their event-based nature; they are often termed the third generation of neural networks (Maass, 1996). A spike is the quantification of the internal and external process of the neuron and is always equal to other spikes. Therefore, the individual neuron can serve as a small bottleneck that gives the ability to sustain low intermittent noise and not propagate the noise further. Moreover, spiking neurons as a group in a network can damp the noise even further due to their collective effect and their architectural connectivity (Hazan & Manevitz, 2012)

. However, SNNs are typically harder to train using backpropagation due to the non-differentiable nature of the spikes

(Pfeiffer & Pfeil, 2018).

Much of the recent work with SNNs has focused on implementing methods similar to backpropagation (Huh & Sejnowski, 2018; Wu et al., 2018) or using biologically inspired learning rules like spike-timing-dependent plasticity (STDP) to train the network (Bengio et al., 2015; Diehl & Cook, 2015; Gilra & Gerstner, 2018; Ferré et al., 2018)

. One of the benefits of using SNNs is their potential to be more energy efficient and faster than rectified linear unit networks (ReLU-N), particularly so on dedicated neuromorphic hardware

(Martí et al., 2016).

Using SNNs in a RL environment seems almost natural since many animals learn to perform certain tasks using a variation of semi-supervised and reinforcement learning. Moreover, there is evidence that biological neurons also learn using evaluative feedback from neurotransmitters such as dopamine (Wang et al., 2018) (e.g., in the postulated dopamine reward prediction-error signal (Schultz, 2016)

). However, since spiking neurons are fundamentally different from artificial neurons, it is not clear if SNNs are as capable as ReLU-Ns in machine learning domains. This raises the questions: Do SNNs have the capability to represent the same functions as ReLU-N? To be more specific, can SNNs represent complex policies that can successfully play Atari games? If so, do they have any advantages in handling noisy inputs?

We answer these questions by demonstrating that ReLU-networks trained using existing reinforcement learning algorithms can be converted to SNN with similar performance on the reinforcement learning task when playing Atari Breakout game. Furthermore, we show that such converted SNNs are more robust than the original ReLU-Ns. Finally, we demonstrate that full-sized deep Q-networks (DQN) (Mnih et al., 2015) can also be converted to SNNs and maintain its better than human performance, paving the way for future research in robustness and RL with SNNs.

2 Background

2.1 Arcade learning environment

The Arcade learning environment (ALE) (Bellemare et al., 2013) is a platform that enables researchers to test their algorithms on over 50 Atari 2600 games. The agent sees the environment through image frames of the game, interacts with the environment with 18 possible actions, and receives feedback in the form of the change in the game score. The games were designed for humans and thus are free from experimenter bias. The games span many different genres that require the agent/algorithm to generalize well over various tasks, difficulty levels, and timescales. ALE thus has become a popular test-bed for reinforcement learning.

Figure 1: Screenshot of Atari 2600 Breakout game

Breakout: We demonstrate our results on the game of Breakout. Breakout is a game similar to the popular game Pong. The player controls a paddle at the bottom of the screen. There are rows of colored bricks on the upper part of the screen. A ball bounces in between the bricks and the player controlled paddle. If the ball hits a brick, the brick breaks and the score of the game is increased. However, if the ball falls below the paddle, the player loses a life. The game starts with five lives, and the player/agent is supposed to break all the bricks before they run of lives. Figure 1 shows a frame of the game.

2.2 Deep Q-Networks

Reinforcement learning algorithms train a policy

to maximize the expected cumulative reward received over time. Formally, this process is modelled as a Markov decision process (MDP). Given a state-space

and an action-space , the agent starts in an initial state from a set of possible start states . At each time-step , starting from , the agent takes an action to transition from to

. The probability of transitioning from state

to state by taking action is given by the transition function . The reward function defines the expected reward received by the agent after taking action on state .

A policy is defined as the conditional distribution of actions given the state . The Q-value or action-value of a state-action pair for a given policy, , is the expected return following policy after taking the action from state .


where is the discount factor. The action-value function follows a Bellman equation that can be written as:


Many widely used reinforcement learning algorithms first approximate the Q-value and then select the policy that maximizes the Q-value at each step to maximize returns (Sutton & Barto, 2018). Deep Q-networks (DQN) (Mnih et al., 2015) are one such algorithm that uses deep artificial neural networks to approximate the Q-value. The neural network can learn policies from only the pixels of the screen and the game score and has been shown to surpass human performance on many of the Atari 2600 games.

Figure 2: Architecture of Deep Q-networks; following Mnih et al. (2015); ReLu nonlinear units are emphasized by red circles.

2.3 Spiking neurons

SNNs may use any of the various neuron models (W. Gerstner, 2002; Tuckwell, 1988)

. For our experiments, we use four different variations of spiking neuron. We use the notation below used to describe the variance neurons:

is the time constant. is the membrane potential voltage. is the resting membrane potential. is the neuron threshold.

  1. Integrate-and-fire (IF) neuron: The IF neuron is the simplest form of spiking neuron models. The neuron simply integrates input until the membrane potential exceeds the voltage threshold and a spike is generated. Once the spike is generated, the membrane potential is reset to .

  2. Subtractive Integrate-and-fire (SubIF) neuron: The SubLIF (Cassidy et al., 2013; Diehl et al., 2016; Rueckauer et al., 2017) behaves similar to the IF neuron with one small change, when the membrane potential voltage exceeds threshold value, the neuron emits a spike and resets its membrane voltage to . By adding the overshoot voltage the neuron “remember” the excessive voltage from the last spike and will be more prone to be excited with the next incoming inputs. This reduces the information lost when spiking in SNNs converted from ReLU-N.

  3. Leaky integrate-and-fire (LIF) neuron: The LIF neuron behaves similarly to the IF neuron. However, for every time-step that its membrane potential is above the resting potential, the neuron leaks a constant amount of current.

  4. Stochastic leaky integrate-and-fire neuron: The stochastic leaky integrate-and-fire neuron is based on the LIF neuron. However, the neuron may spike if its membrane potential is below the threshold with probability proportional to its membrane potential (escape noise). The escape noise () is described here:


where and are constant positive parameters. For this paper, we set both and to 1.

For all the spiking models listed above, after every spike, the neuron enters a refractory period during which they are unable to spike or integrate input. For this paper, we ignore the refractory period for simplicity in the conversion from artificial neurons. For a complete list of the parameters used for LIF and stochastic LIF see supplementary materials Table 2.

3 Related work

Much of the recent work has focused on the ReLU-N to SNN conversion. Pérez-Carrasco et al. (2013) first introduced the idea of converting CNN to spiking neurons with the aim of processing inputs from event-based sensors. Cao et al. (2015)

suggested that frequency of spikes of the spiking neuron is closely related to the activations of a rectified linear unit (ReLU) and reported good performance on computer vision benchmarks.

Diehl et al. (2015) proposed a method of weight normalization that re-scales the weights of the SNN to reduce the errors due to excessive or too little fringing of neurons. They also showed near lossless conversion of ReLU-Ns for the MNIST classification task. Rueckauer et al.Rueckauer et al. (2016, 2017)

) demonstrated spiking equivalents of a variety of common operations used in deep convolutional networks like max-pooling, softmax, batch-normalization and inception modules. This allowed them to convert popular CNN architectures like VGG-16

(Simonyan & Zisserman, 2014), Inception-V3(Szegedy et al., 2016), BinaryNet(Courbariaux et al., 2016), etc. They achieved near lossless conversion of these networks. There has been no previous work on conversion of Deep Q-networks into SNN to our knowledge. Figure 3 shows the network in Figure 2 converted to a spiking neural network (SNN).

Figure 3: Network architecture following (Mnih et al., 2015), after converting ReLu nonlinearity to spiking network.

4 Methods

We trained each network using the DQN algorithm (Mnih et al., 2015). We started by testing our methods on shallow ReLU-Ns with one hidden layer and then move on to full-sized deep Q-networks with the same architecture as Mnih et al. (2015). We trained the smaller networks using a replay memory size of 200000 and initial replay memory size of 50000. We trained the network over 30000 episodes. The rest of the hyper-parameters we used are same as in Mnih et al. (2015)’s work. For a complete list of the hyper-parameters used see supplementary materials Table 1.

The trained ReLU-Ns are then converted to SNN. For the converted SNN, The firing frequency of the spiking neurons in the output layer is proportional to the Q-value of the corresponding action.

We simulate spiking neurons using the PyTorch based open source library BindsNET

(Hazan et al., 2018). Testing SNN based agents in the ALE is a computationally heavy task. We use BindsNET as it allows users to leverage GPUs to simulate the SNN and speed up testing.

4.1 Network architecture

Typically, the network used to train on Atari games using the DQN algorithm consists of multiple convolutional layers followed by fully connected layers (Mnih et al., 2015). However, to reduce the complexity of the network and reduce the number of parameters, we choose a shallow fully connected network with one hidden layer for our initial experiments.

The ReLU-N consists of 80x80 pixel input followed by a fully connected hidden layer with 1000 ReLU neurons. The output layer is a fully connected layer with 4 neurons that give the estimate of the optimal action-value of each of the 4 possible actions in the Breakout game.

Figure 4: Network architecture: The input to the network consists of an 80x80 image produced by preprocessing the frames of the game. The hidden layer consists of 1000 neurons followed by the output layer. The size of the output layer is equal to the number of possible actions for the game.

Figure 4 shows the network architecture of the shallow SNN. The network architecture of the shallow SNN is similar to the shallow ReLU-N except that the neurons are replaced by spiking neurons and the ReLU non-linearity is removed.

4.2 Experiments

The ReLU-N can be converted to SNN by replacing the ReLU neurons with spiking neurons. However, the result of this straight forward conversion usually causes to a very little spiking activity in the network. Therefore, the network needs to run for a large number of time steps on given input to generate enough meaningful activity for a good estimate of the Q-values. In order to expedite the process and increase the spiking activity, the weights need to be scaled up. Generally, the weights of deeper layers need to be scaled higher than weights of the shallower layers. We can treat the scale of the weights as parameters that need to be adjusted with a constant run-time for each input. All the weights of the same layer are scaled by the same factor thus preserving the learned filters. To search for the weight scale parameters, we can use many different methods. While Rueckauer et al. (2017)

showed one way of normalizing the weights, we also employed other methods such as grid search and particle swarm optimization

(Clerc, 2012) to search for the optimal parameters. For our experiments, we run the SNN for 500 time-steps on each input.

Binary input

The first part of our experiments uses binary pixel inputs. Each state consists of an 80x80 image of binary pixels. The frames from the Gym environment are pre-processed to create the state.

Each frame from the gym environment is cropped to remove the text above the screen displaying the score and the number of lives left. The image is then re-sized to an 80x80 image and converted to a binary image. The previous frame is then subtracted from the current frame while clamping all the negative values to 0. We then add the most recent four such difference frames to create a state for the RL environment. Thus, a state is an 80x80 binary image containing the movement information of the last four states. From this image, however, it is not possible to detect the direction of the movement from the image. This, we believe, restricts the performance of the agent on the game.

(a) 0.05 epsilon greedy binary input

(b) Greedy binary input

(c) 0.05 epsilon greedy grayscale input

(d) Greedy grayscale input
Figure 5: Performance of the networks for Binary and Grayscale inputs. Each plot shows the reward distribution over 100 episodes using 0.05 epsilon greedy policy.

Grayscale input

The binary input does not contain information about the direction of the ball movement which we believe can confuse the agent. To alleviate this problem, we weighted each frame according to time and added them to create the state. The most recent frame has the highest weight, and the least recent frame has the least weights. At time the state is made up of the sum of the most recent 4 frames as follows:


Where and are the state and the frame at time respectively.


Recent work has shown that deep Q-networks are vulnerable to white-box and black-box adversarial attacks (Huang et al., 2017). Witty et al. (2018) also showed that the policies learned by the DQN algorithm generalize poorly for the states of the game that the agent has not seen during training. To test the robustness of the SNN against the ReLU-N, we test the performance of each network when a 3-pixel thick horizontal bar of pixels spanning the entire width of the input is occluded. The thickness of the occlusion bar corresponds to the thickness of the paddle on the screen after preprocessing. We tested the performance for every possible position of the bar on the screen. The position of the occlusion bar does not change during the episode. This is a challenging task since the bar sometimes partially or wholly occludes the position of the ball or the paddle; however, it tests the robustness and generalization of the policies represented by both the networks.

5 Results

The experiments below show the results of 100 episodes on two different inputs (Binary or Grayscale) using two policies (greedy and 0.05 epsilon-greedy). We tested the shallow SNN using LIF neurons and stochastic LIF neurons. We refer to the SNN using LIF neurons as SNN and the SNN using stochastic LIF neurons as stochastic SNN.

Binary input

The results demonstrate that SNNs are capable of representing policies that perform even better than the ReLU-N they are converted from. Figure 5 shows the performance of the ReLU-N against the performance of SNN and the stochastic SNN for binary input. We can see that the stochastic SNN performs better on average than the ReLU-N it is converted from. The optimal parameters for this binary input spiking neural networks were found using grid search.

Grayscale input

The results show that the performance for the grayscale input is higher then the binary input for both networks (SNN and ReLU-N) as shown in figure 5. Table 1

summarizes the best performance for each method of input for each network. We also see that the standard deviation of the rewards gained by the SNN is lower and the behavior is less random than for the binary input, due to proper weight normalization. We employed particle swarm optimization

(Clerc, 2012) to search for the optimal weight scaling parameters. The scale of each of the two layers was treated as a parameter; thus the dimension of the search space is 2. The swarm size was set to 13. The stochastic LIF network has a smoother surface of performance over the parameter space than the LIF network. This suggests that the stochastic LIF network is more robust to change in the scaling of its weights. The escape noise of the stochastic LIF neuron can be tuned to improve the performance further however we leave that to future work.

Input ReLU-N SNN Stochastic
0.05 Epsilon greedy
Table 1: Best performance achieved for different inputs and networks. Each value represents an average of 100 episodes.


(a) Pixel-wise robustness ReLU-N vs SNN
Figure 6: Performance of ReLU-N and SNN for the robustness test. The x-axis represents the position of the bottom most occluded pixels of the 3-pixel thick horizontal occlusion bar. The y-axis represents the average reward. The standard distribution for the reward distribution is shown using the shaded region. The two critical areas are marked by the black bars A and B at the bottom of the plot. A shows the area near the paddle, while B marks the region of the screen occupied by the brick wall.

Figure 6 shows the performance of the ReLU-N and SNN for the robustness task. The x-axis represents the vertical position of the bottom-most occluded pixels. Thus as we move from left to right on the plot, the 3-pixel thick occlusion bar moves from bottom to the top of the screen. Figure 6 shows the result of 77 experiments, one for each possible position of the horizontal occlusion bar. Each experiment was run for 100 episodes using 0.05 epsilon greedy policy.

Our experiments on robustness show that SNNs are much more robust to occlusions on input as compared to ReLU-N even though they share the same weights. The ReLU-N trained using backpropagation is very sensitive to occlusions and perturbations at a few places in the input. When these areas are occluded, the ReLU-N performs poorly. One such area is near the bottom of the screen (marked in Figure 6 by ). Occlusion in this area results in drastic decrease in the performance of the ReLU-N. This is understandable as this area contains the paddle and also shows the position of the ball just before it hits the paddle or falls below the screen. Surprisingly, occluding the neighborhood of area has much less negative impact on the performance of the SNN as compared to ReLU-N. Once the paddle is visible, we see that the SNN has no significant loss in performance.

Another sensitive area for the ReLU-N corresponds to the position of the brick wall, marked by in figure 6. We see that occluding some of the positions in this area results in a sharp drop in performance for ReLU-N. This can be explained by the nature of the gradient descent updates. Since the score changes when the ball hits the bricks and the backpropagation loss calculated using the TD-error is highest when the score changes, the filters of the network learn to discriminate these areas. Thus, when these areas are occluded, the performance drops. Interestingly, these sudden drops in performance are not observed in the SNN. This suggests that the SNN is more robust to occlusions in the input than the ReLU-N it is converted from. We also see that the SNN performs better than the ReLU-N in most of the experiments and has a lower standard deviation in the reward distribution.

For detailed list of results for positions of the occlusion see supplementary materials, Table 3.

6 Deep Q-networks

(a) DQN vs SNN
Figure 7: Performance of Deep Q-network vs. Deep Spiking Network. Each plot shows the reward distribution over 100 episodes using 0.05 epsilon greedy policy.

To demonstrate that our approach is applicable for state-of-the-art, large-scale networks, we trained the Deep Q-network (Mnih et al., 2015) and converted the weights to SNN with similar network architecture (see Figure 3). Since converting the DQN to SNN requires parameter search for a larger number of parameters, we used the parameter normalization method (Rueckauer et al., 2017). This approach shows reasonable performance, although its performance can be clearly improved using a systematic parameter optimization method. The deep Q-SNN was tested using the subtractive-IF neurons. We used the OpenAI baseline implementation of DQN to train the network (Dhariwal et al., 2017). We show that the DQN can be converted to spiking Q-network without significant loss in performance; see Figure 7 for full distribution of rewards using the two networks. At the present stage of the work, we did not conduct robustness test for the trained networks. We leave a systematic robustness study and comparison to future work.

7 Conclusion

In this paper, we demonstrate that ReLU-Ns trained on the game breakout can be converted to SNNs without degradation of performance. Moreover, we show that SNNs are more robust to occlusion attack and can outperform traditional ReLU networks on reinforcement learning tasks. In some cases, SNNs perform better than ReLU-N on previously unseen states. These results, combined with other benefits of SNNs, such as energy efficiency on neuromorphic hardware, make SNNs an ideal framework for reinforcement learning tasks when resources are limited and the environment is noisy.

In summary:

  1. SNNs can perform reinforcement learning tasks like playing Atari games.

  2. SNNs can be trained on reinforcement learning tasks by conversion from trained ReLU-Ns.

  3. SNNs can outperform the ReLU-Ns from which they have been converted on reinforcement learning tasks, like playing Atari games.

  4. SNNs are robust to attacks and perturbations in the input. They have improved generalization on states, which they have not encountered before.