## 1 Introduction

Most reinforcement learning algorithms (RL) primarily fall into one of the two categories, value function based and policy based algorithms. The former category of algorithms such as Qlearning qlearning express the value function as a mapping from a state space to a real value, which indicates how good it is for an agent to be in a state. The latter class of algorithms such as REINFORCE williams learn policy as a mapping from state space to action space, which tells the agent the best action in each state to maximize its reward. Both types of algorithms, thus, need to address the problem of learning a representation of the state space to solve the task at hand. As the task gets more complex, so does the representation to be learned.

Deep networks trained with backpropagation

backprop have a potential for rich feature representations afforded by the hierarchical structure of these networks deeprl. Nevertheless these methods face certain key challenges (power consumption, inference speed, robustness to adversarial attacks, online learning, etc) which could potentially be addressed by revisiting the computational models of learning and decision making in the brain. Similar to the deep networks, the brain is composed of multiple layers from the neurons that represent stimulus to the neurons that encode the action. But unlike in deep networks, backpropagation is not considered a viable optimization technique in the brain crick. The propagation of error signals backward from the upstream neurons in backpropagation is deemed biologically implausible. Instead reinforcement learning models, supported by the evidence of neural coding of reward prediction errors in the brain rlmodel, provides a plausible theory of learning by facilitating local learning rules.In this study, we demonstrate that a multi-agent RL framework with each agent modeled after the GLM model of a spiking neuron pillow

, can learn complex stimulus-action mappings with local learning rules and a global high level feedback. The policy of the spiking agent is defined as its firing probability conditioned on the stimulus and spiking activity of other agents. Each agent updates its firing policy by descending its local policy gradient modulated by a global reward prediction error. A heirarchical structure is imposed on the network of spiking agents to enable rich feature representations similar to the deep networks. We further explore architectural techniques inspired from the brain such as modularity and population coding that can overcome the challenges in convergence posed by local learning rules.

## 2 Reinforcement learning with a network of spiking agents

### 2.1 Notation

An RL domain expressed as a Markov Decision Process (MDP) is defined by the state space,

, the action space , a state transition matrix, and a reward function, . A policy is a distribution of action probabilities conditioned on state space and is defined as where denotes the parameters of the policy. is the discounted return.### 2.2 Agent and architecture

Consider a GLM spiking agent shown in the Figure 1. The agent’s instantaneous conditional probability of spiking at any time instant , denoted by depends on the input , its own spiking history , the spike train histories of other agents at time , , through the parameters as

(1) |

where is the sigmoidal non-linearity, is the bias and are stimulus, post-spike and coupling filters as shown in Figure 1. The agent produces a spike at any instant with a probability and remains silent with a probability . Given a spike train of length , the probability of producing a spike train is the product of probabilities of generating independent spikes, , where denotes times at which spike occurs and denotes the times at which there is no spike. At each instant in the MDP scale, the action of the agent is a spike train response of length . The policy of the agent is then expressed as the probability of producing a given spike train in a given state as shown below.

(2) |

Thus the policy of each agent represents a mapping from the state space to its action space through its parameters. We now consider a hierarchical network of GLM spiking agents where the first layer of agents receive the stimulus from the state space of the MDP and generate the spike train response as an action at a given time . The action space of the first layer of agents becomes the state space of the succeeding layer of agents. The action space of the final layer of the agents is the action space of the MDP. We now describe how a network of spiking agents learn to represent a complex mapping from state space to the action space of the MDP to solve the RL task.

### 2.3 Learning updates

We derive our learning framework based on comdp, which theoretically proves that in a network of modular agents describing a policy, descending the policy gradient on the network as a whole is equivalent to descending the policy gradient on each of the agents separately. We follow the formulation of a modular actor critic described in pgcn, where each of the agents in the network receives the reward prediction error from a global temporal difference (TD) critic. For a given agent, the gradient of the expected discounted return rl at time can be expressed as

(3) |

where is the TD error delivered by the critic and is the policy of the agent given by Equation (2), where is the state of the agent and is the action/ spike train response at time . A gradient ascent in policy parameters is done to maximize the expected discounted return as follows, . A stochastic gradient ascent can be performed as .

The log derivative of the policy can be written as

(4) |

By the updates , we increase the probability of producing the spike trains which result in a positive TD error and decrease the probability of those that result in a negative TD error. These updates take into account only the local information of the agent and receive no feedback from the upstream agents regarding their contribution to the global action selection, unlike in backpropagation. Hence these updates tend to exhibit high variance as the agent is unaware of the correlation of its spiking policy to the global critic feedback.

### 2.4 Variance reduction

The structural credit assignment problem in deep hierarchical networks is to identify the amount of blame to assign to a particular agent for an error in action selection. One way to reduce the variance in a local learning setup is to introduce global high level feedback from the network regarding the contribution of the agent to the action selection. Instead of transporting precise error signals based on the network connectivity, random synaptic feedback correlated to the error signals as in random. However in this study, we focus on the variance reduction by solely targeting architectural design.

Variance reduction through a modular connectionist architecture: Brain networks have been demonstrated to have the property of hierarchical modularity, i.e, each module being composed of sub-modules which are in turn composed of several sub-modules. This modular structure is claimed to be responsible for faster adaptation and evolution of the system with changing stimulus conditions modular. jacobs

showed that incorporation of a modular architecture in neural networks results in faster learning compared to a fully connected architecture by decomposing the task into many functionally independent tasks. In this study we demonstrate that such a modular architecture is conducive to local learning rules.

Figure 2 shows a modular connectionist architecture with sparse modular connections instead of a fully-connected architecture. In this network the spiking of an agent in a layer affects only few of the agents in the succeeding layer, thus enabling us to identify and update the agents that are responsible in case an erroneous action is chosen, thereby reducing the variance.

Variance reduction through population coding

: In the primary motor cortex, it was demonstrated that the decision of the movement is generated by a population of neurons by a weighted vector addition of preferred directions of individual neurons

population. This population coding produces a more robust action selection that is unaffected by the volatility in the spiking activity of a single neuron. To achieve a similar affect, we incorporate population coding by decomposing an agent into a population of agents or equivalently considering a population of networks.We average the action probabilities across a population of networks and the final action is chosen with a softmax action selection. The individual networks are then updated in an off-policy manner with a positive TD error if the action chosen by the network is same as the final action of the ensemble and a negative TD error otherwise.

### 2.5 Case studies

In this section, we apply our framework to solve two sample RL tasks, a delayed reward task (gridworld) and a continuous control task (cartpole).

Learning in gridworld and cartpole tasks. The curves are presented with standard error bars over 100 independent trials.

Gridworld: Consider a gridworld domain with four possible actions (Up, Down, Left, Right) at state. Transition to the terminal state at has a reward of 10 and every other transition has a reward of zero. The states of gridworld are encoded using neurons with a spike train length of . The hidden layer has agents each with a spike train length of . The stimulus filter of a agent is a kernel of parameters which produces the hidden layer spike train responses upon convolution with the spike train stimuli from the previous layer. For simplicity we ignore the other filters. The output layer has agents each corresponding to an action of the domain and the activity of the agent is encoded in a single spike. A population of

such networks are concurrently used to select the actions. In Figure 3(a), learning curve from the above network is compared against a tabular actor-critic (AC) and an AC parameterized by neural networks (15 input - 50 hidden - 4 output) and trained with backpropagation. The best hyperparameters for each of the methods are tuned separately. The comparison between the neural network AC and spiking agent AC is not a fair one (as the former has a neural network critic in the former and the latter has a tabular critic) but gives a rough sense of learning in the spiking agent AC.

Cartpole: We apply the spiking agent network with a GLM spiking agent of spike train response 1 to the cartpole balancing task cartpole to demonstrate the efficacy of the variance reduction methods employed. The task is to balance the pole for time steps with two possible actions and reward of for each time step that the pole remains balanced. We use input neurons to represent the value of each of the state variables. The hidden layer has agents and the output layer has agents for the two actions. Actions are chosen using a population of such networks. Figure 3(b,c) demonstrates that incorporation of modularity and population coding makes the framework conducive to local learning.

## 3 Conclusion

In this paper, we extended the concept of a hedonistic neuron hedonistic by formulating a spiking neuron as an RL agent. A noisy spiking neuron with temporal coding is proved to have more computational power than a sigmoidal neuron noisy

which is yet to be harnessed. A group of such agents organized in a hierarchical and modular fashion interacting with each other in cooperation and competition has a potential for rich representational learning. We have demonstrated one such learning framework which is capable of solving RL tasks thus underscoring the relevance of the neuroscientific principles for the advancement of artificial intelligence.

Comments

There are no comments yet.