1 Introduction
Reinforcement learning (RL) algorithms share qualitative similarities with the algorithms implemented by animal brains. However, there remain clear differences between these two types of algorithms. For example, while RL algorithms using artificial neural networks require information to flow backwards through the network via the backpropagation algorithm, there is currently debate about whether this is feasible in biological neural implementations (Werbos and Davis, 2016). Policy gradient coagent networks (PGCNs) are a class of RL algorithms that were introduced to remove this possibly biologically implausible property of RL algorithms—they use artificial neural networks but do not use the backpropagation algorithm (Thomas, 2011).
Since their introduction, PGCN algorithms have proven to be not only a possible improvement in biological plausibility, but a practical tool for improving RL agents. They were used to solve RL problems with highdimensional action spaces (Thomas and Barto, 2012), are the RL precursor to the more general stochastic computation graphs (Schulman et al., 2015), and, as we will show in this paper, generalize the recently proposed optioncritic architecture (Bacon et al., 2017), while drastically simplifying key derivations.
The paper introducing PGCNs claims that each node (neuron) in a network can perform all of its updates given only local information—information that would be available to a neuron in an animal brain. However, this is not the case since PGCNs still require an implicit signal that was overlooked: a clock to determine when each node should produce its output and update its weights. In this paper we show how PGCNs can be extended to operate without a clock signal (or with a noisy clock signal), resulting in a new class of RL algorithms that
1) do not require the backpropagation of information through an artificial neural network, and 2) do not require a clock signal to be broadcast to any nodes in the network. Furthermore, removing the need for a clock has important ramifications beyond biological plausibility: It allows distributed implementations of large neural networks to operate without requirements of synchronicity, and provides an alternate view of temporal abstraction for RL algorithms. We clarify this second point later by discussing the relationship between PGCN algorithms and the options framework (Sutton et al., 1999).The contributions of this paper are: 1) a complete and formal proof of a key result related to PGCN algorithms that this paper relies on, and which prior work provides an informal and incomplete proof, 2) a generalization of the PGCN framework to handle asynchronous networks, 3) a proof that asynchronous PGCNs generalize the optioncritic framework, and 4) empirical support of our theoretical claims regarding the gradients of asynchronous PGCN algorithms.
2 Related Work
Klopf (1982) theorized that traditional models of classical and operant conditioning could be explained by modeling biological neurons as hedonistic, that is, seeking excitation and avoiding inhibition. The ideas motivating coagent networks bear a deep resemblance to Klopf’s proposal.
Stochastic neural networks
were first applied to RL at the dawn of machine learning itself, with applications dating back at least to Marvin Minsky’s
stochastic neural analog reinforcement calculator (SNARC), built in 1951 Russell and Norvig (2016). Interest in their usage has continued throughout the history of RL. The wellknown REINFORCE algorithm was originally proposed with the intent of training stochastic networks Williams (1992), though it has since been primarily applied to conventional networks. Other rules like adaptive rewardpenalty Barto (1985) were proposed exclusively for training stochastic networks.Multiagent reinforcement learning (MARL) is the application of RL in multiagent systems. MARL differs from coagent RL in that agents typically have separate manifestations within the environment; additionally, the goals of the agents may or may not be aligned. Despite these differences, many results from the study of MARL are relevant to the study of coagent networks. For instance, Liu et al. (2014) showed that multiagent systems sometimes learn more quickly when agents are given individualized rewards, rather than only receiving teamwide rewards. An overview of MARL is given by Busoniu et al. (2010).
Deep reinforcement learning, the application of conventional neural networks to RL, has recently become an active area of research, following its successful application to challenging domains such as realtime games Mnih et al. (2015), board games Silver et al. (2017), and robotic control Andrychowicz et al. (2018). While conventional deep networks have dominated recent RL research (and machine learning research more broadly), stochastic networks have also recently been a moderately popular research topic. The formalism of stochastic computation graphs
was proposed to describe networks with a mixture of stochastic and deterministic nodes, with applications to supervised learning, unsupervised learning, and RL
Schulman et al. (2015). Policy gradient coagent networks Thomas (2011), the subject of this paper, were proposed for RL specifically, and have been used to discover “motor primitives” in simulated robotic control tasks Thomas and Barto (2012).Several recently proposed approaches fit into the formalism of stochastic networks, but the relationship has frequently gone unnoticed. One notable example is the optioncritic architecture Bacon et al. (2017). The optioncritic provides a framework for learning options Sutton et al. (1999), a type of highlevel and temporally extended action, and how to choose between options. The motivations for the optioncritic are largely similar to previous motivations for PGCNs: namely, achieving temporal abstraction and hierarchical control. We show that the optioncritic architecture can be described by a particular coagent network architecture. The theory presented in this paper makes the resulting derivation of the optioncritic policy gradients nearly trivial in contrast with the original derivations, and places the optioncritic within a more general theoretical framework.
3 Background
We consider an MDP, , where is the finite set of possible states, is the finite set of possible actions, and is the finite set of possible rewards. Let denote the time step. , , and are the state, action, and reward at time
, and are random variables that take values in
, , and , respectively. is the transition function, given by . is the reward distribution, given by . The initial state distribution, , is given by . The discount factor, , is the reward discount parameter. An episode is a sequence of states, actions, and rewards, starting from and continuing indefinitely. We assume that the discounted sum of rewards over an episode is finite.A policy, , is a stochastic method of selecting actions, such that . A parameterized policy
is a policy that takes a parameter vector
. Different parameter vectors result in different policies. More formally, we redefine the symbol to denote a parameterized policy, , such that for all , is a policy. We assume that exists for all . An agent’s goal is typically to choose a policy that maximizes the objective function, which is defined as where conditioning on denotes that, for all , . The statevalue function, , is defined as The discounted return, , is defined as . We denote the objective function for a policy that has parameters as, and condition probabilities on
to denote that the parameterized policy uses parameter vector .Consider a parameterized policy that consists of an acyclic network of nodes, called coagents, which do not share parameters. Each coagent can have several inputs that may include the state at time , a noisy and incomplete observation of the state at time , and/or the outputs of other coagents. When considering the coagent, can be partitioned into two vectors, (the parameters used by the coagent) and (the parameters used by all other coagents). From the point of view of the coagent, is produced from in three stages: execution of the nodes prior to the coagent (nodes whose outputs are required to compute the input to the coagent), execution of the coagent, and execution of the remaining nodes in the network to produce the final action. This process is depicted graphically in Figure 1 and described in detail below. First, we define a parameterized distribution to capture how the previous coagents in the network produce their outputs given the current state. The output of the previous coagents is a random variable, which we denote by , and which takes continuous and/or discrete values in some set . is sampled from the distribution . Next, the coagent takes and as input and produces the output (below, when it is unambiguously referring to the output of the coagent, we make the implicit and denote it as ). We denote this input, , as (or if it is not unambiguously referring to the coagent). The conditional distribution of is given by the coagent’s policy, . Although we allow the coagent’s output to depend directly on , it may be parameterized to only depend on . Finally, is sampled according to a distribution , which captures how the subsequent coagents in the network produce .
Below, we sometimes make and implicit and write the three policy functions as , , and . Also, following the work of Thomas and Barto (2011), we model the coagent’s environment (consisting of the original environment as well as all other coagents in the network) as an MDP called a
conjugate Markov decision process
(CoMDP).4 The Coagent Policy Gradient Theorem
Consider what would happen if the coagent ignored all of the complexity in this problem setup and simply implemented an unbiased policy gradient algorithm, like REINFORCE (Williams, 1992). From the coagent’s point of view, the state would be and together (the coagent may ignore components of this state, such as the component), its actions would be , and the rewards would remain . We refer to the expected update in this setting as the local policy gradient, , for the coagent. Note that although we eventually prove that, for all , is equivalent to the policy gradient of the CoMDP, we do not assume this equivalence. Formally, the local policy gradient of the coagent is:
(1) 
What would happen if all coagents updated their parameters using a local and unbiased policy gradient algorithm? The coagent policy gradient theorem (CPGT) answers this question: If
is fixed and all coagents update their parameters following unbiased estimates,
, of their local policy gradients, then the entire network will follow an unbiased estimator of , which we call the global policy gradient. For example, if every coagent performs the following update simultaneously at the end of each episode, then the entire network will be performing stochastic gradient ascent on (without using backpropagation):(2) 
In practice, one would use a more sophisticated policy gradient algorithm than this simple variant of REINFORCE.
Although Thomas and Barto (2011) present the CPGT in their Theorem 3, the provided proof is lacking in two ways. First, it is not general enough for our purposes because it only considers networks with two coagents. Second, it is missing a crucial step. They define a new MDP, the CoMDP, which models the environment faced by a coagent. They show that the policy gradient for this new MDP is a component of . However, they do not show that the chosen definition of the CoMDP accurately describes the environment that the coagent faces. Without this step, Thomas and Barto (2011) have shown that there is a new MDP for which the policy gradient is a component of , but not that this MDP has any relation to the coagent network. In this section we provide formal and generalized proofs of the CPGT. Although this proof is an important contribution of this work, due to space restrictions this section is an abbreviated outline of the proof in Section A of the supplementary material.
4.1 Conjugate Markov Decision Process (CoMDP)
We model the coagent’s environment as an MDP, called the CoMDP, and begin by formally defining the CoMDP. Given , , , , and , we define a corresponding CoMDP, , as , where:

We write , , and to denote the state, action, and reward of at time . Below, we relate these random variables to the corresponding random variables in . Note that all random variables in the CoMDP are written with tildes to provide a visual distinction between terms from the CoMDP and original MDP. Additionally, when it is clear that we are referring to the CoMDP, we often make implicit and denote these as , , and .

. We often denote simply as . This is the input (analogous to a state set) to the coagent. Additionally, for , we denote the component as and the component as . We also sometimes denote an as . For example, represents the probability that has component and component .

(or simply ) is an arbitrary set that denotes the output of the coagent.

and .

Below, we make implicit and denote this as . Recall from the definition of an MDP and its relation to the transition function that this means: .

Like the transition function, we make implicit and write .

.
We write to denote the objective function of . Notice that although (the parameters of the other coagents) is not an explicit parameter of the objective function, it is implicitly included via the CoMDP’s transition function.
We assume that, given the same parameters , the coagent has the same policy in both the original MDP and the CoMDP. That is,
Assumption 1.
4.2 The CoMDP Models the Coagent’s Environment
Here we show that our definition of the CoMDP correctly models the coagent’s environment. We do so by presenting a series of properties and lemmas that each establish different components of the relationship between the CoMDP and the environment faced by a coagent. Figure 2 depicts the setup that we have described and makes relevant independence properties clear. The proofs of these properties and theorems are provided in the supplementary material.
In Properties 1 and 2, by manipulating the definitions of and , we show that and the distribution of capture the distribution of the inputs to the coagent.
Property 1.
Property 2.
For all ,
In Property 3, we show that captures the distributions of the inputs that the coagent will see given the input at the previous step and the output that it selected.
Property 3.
For all , and ,
In Property 4, we show that captures the distribution of the rewards that the coagent receives given the output that it selected and the inputs at the current and next steps.
Property 4.
For all , , , and ,
(3) 
In Properties 5 and 6, we show that the distributions of and capture the distribution of inputs to the coagent.
Property 5.
For all and ,
(4) 
Property 6.
For all ,
(5) 
In Property 7, we show that the distribution of given captures the distribution .
Property 7.
For all and ,
(6) 
In Property 8, we show that the distribution of given , , and captures the distribution of the component of the input that the coagent will see given the input at the previous step and the output that it selected.
Property 8.
For all , , , and ,
(7)  
(8) 
In Property 9, we use Property 8 to show that: Given the component of the input, the component of the input that the coagent will see is independent of the previous input and output.
Property 9.
For all , , , , and ,
In Property 10, we use Assumption 1 and Properties 6, 7, 8, 9, and 10 to show that the distribution of captures the distribution of the rewards that the coagent receives.
Property 10.
For all ,
Lemma 1.
is a Markov decision process.
Finally, in Lemma 2, we use the properties above to show that the CoMDP (built from , , , , and ) correctly models the local environment of the coagent.
Lemma 2.
For all , and , and given a policy parameterized by , the corresponding CoMDP satisfies Properties 16 and Property 10.
Lemma 2 is stated more formally in the supplementary material.
4.3 The Coagent Policy Gradient Theorem
We use Property 10 to show that, given the same , the objective functions produce the same output in the original MDP and all CoMDPs. More formally:
Property 11.
For all coagents , for any , .
Next, using Lemmas 1 and 2, we show that the local policy gradient, (the expected value of the naive REINFORCE update), is equivalent to the gradient of the CoMDP.
Lemma 3.
For all coagents , for all ,
We can now formally state and prove the CPGT. Using Property 11 and Lemma 3, we show that the local policy gradients are the components of the global policy gradient:
Theorem 1 (Coagent Policy Gradient Theorem).
, where is the number of coagents and is the local policy gradient of the coagent.
Corollary 1.
If is a deterministic positive stepsize, , , additional technical assumptions are met (Bertsekas and Tsitsiklis, 2000, Proposition 3), and each coagent updates its parameters, , with an unbiased local policy gradient update , then converges to a finite value and .
5 Asynchronous Recurrent Networks
Having formally established the CPGT, we now turn to extending the PGCN framework to asynchronous and cyclic networks—networks where the coagents execute, that is, look at their local state and choose actions, asynchronously and without any necessary order. This extension removes the necessity for a biologically implausible perfect clock, allowing the network to function with an imprecise clock, or with no clock at all. This also allows for distributed implementations, where nodes may not execute synchronously.
We first consider how we may modify an MDP to allow coagents to execute at arbitrary points in time, including at points in between our usual time steps. We make a simplifying assumption: Time is discrete (as opposed to continuous). We break a time step of the MDP into an arbitrarily large number of shorter steps, which we call atomic time steps. We assume that the environment performs its usual update regularly every atomic time steps, and that each coagent executes (chooses an output in its respective
) at each atomic time step with some probability, given by an arbitrary but fixed probability distribution. The duration of atomic time steps can be arbitrarily small to allow for arbitrarily close approximations to continuous time or to model, for example, a CPU cluster that performs billions of updates per second. The objective is still the expected value of
, the discounted sum of rewards from all atomic time steps: .Next, we extend the coagent framework to allow cyclic connections. Previously, we considered a coagent’s local state to be captured by , where is some combination of outputs from coagents that come before the coagent topologically. We now allow coagents to also consider the output of all coagents on the previous time step, . In the new setting, the local state at time is therefore given by . The corresponding local state set is given by . In this construction, when , we must consider some initial output of each coagent, . For the coagent, we define to be drawn from some independent initial distribution, , such that for all .
We redefine how each coagent selects actions in the asynchronous setting. First, we define a random variable, , the value of which is if the coagent executes on atomic time step , and otherwise. One useful factor for deciding whether a coagent should update or not is the number of time steps since its last execution. To this end, we define a counter for each coagent given at time for the coagent by (define to be the set of nonnegative integers). The counter increments by one at each atomic time step, and resets to zero when the coagent executes. We define a function, , that captures this behavior, such that . Each coagent has a fixed execution function, , which defines the probability of the coagent executing on time step , given the coagent’s local state and its counter. That is, for all and ). Finally, the action, , that the coagent selects at time is sampled from if , and is otherwise. That is, if the agent does not execute on atomic time step , then it should repeat its action from time .
This asynchronous setting does not match the usual MDP description because the policy represented by the network is nonMarkovian—that is, we cannot determine the distribution over the output of the network given only the current state, . Therefore, we cannot apply the CPGT. However, we show that the asynchronous setting can be reduced to the usual acyclic, synchronous setting using formulaic changes to the state set, transition function, and network structure. This allows us to derive an expression for the gradient with respect to the parameters of the original, asynchronous network, and therefore to train such a network. We prove a result similar to the CPGT that allows us to update the parameters of each coagent using only states and actions from atomic time steps when the coagent executes.
5.1 The CPGT for Asynchronous Networks
We first extend the definition of the local policy gradient, , to the asynchronous setting. In the synchronous setting, the local policy gradient captures the update that a coagent would perform if it was following an unbiased policy gradient algorithm using its local inputs and outputs. In the asynchronous setting, we capture the update that an agent would perform if it were to consider only the local inputs and outputs it sees when it executes. Formally, we define the asynchronous local policy gradient:
(9) 
The only change from the synchronous version is the introduction of . Note that when the coagent does not execute (), the entire inner expression is 0. In other words, these states and actions can be ignored. An algorithm estimating would still need to consider the rewards from every atomic time step, including time steps where the coagent does not execute. However, the algorithm may still be designed such that the coagents only perform a computation when executing. For example, during execution, coagents may be given the discounted sum of rewards since their last execution. The important question is then: does something like the CPGT hold for the asynchronous local policy gradient? If each coagent executes a policy gradient algorithm using unbiased estimates of , does the network still perform gradient descent on the asynchronous setting objective, ? The answer turns out to be yes.
Theorem 2 (Asynchronous Coagent Policy Gradient Theorem).
where is the number of coagents and is the asynchronous local policy gradient of the coagent.
Proof.
The general approach is to show that for any MDP , with an asynchronous network represented by with parameters , there is an augmented MDP, , with objective and an acylic, synchronous network, , with the same parameters , such that . Thus, we reduce the asynchronous problem to an equivalent synchronous problem. Applying the CPGT to this reduced setting allows us to derive Theorem 2.
The original MDP, , is given by the tuple . We define the augmented MDP, , as the tuple, . We would like to hold all of the information necessary for each coagent to compute its next output, including the previous outputs and counters of all coagents. This will allow us to construct an acyclic version of the network to which we may apply the CPGT. We define to be the combined output set of all coagents in , to be the set of possible counter values, and to be the set of possible combinations of coagent executions. We define the state set to be , and the action set to be . We write the random variables representing the state, action, and reward at time as and respectively. Additionally, we refer to the components of values and and the components of the random variables and using the same notation as for the components of above (for example, is the component of , is the component of , etc.). For vector components, we write the component of the vector using a subscript (for example, is the component of ).
The transition function, , captures the original transition function, the fact that , and the behavior of the counters. For all and , is given by and , and otherwise. For all , and , the reward distribution is simply given by . The initial state distribution, , captures the original state distribution, the initialization of each coagent, and the initialization of the counters to zero. For all , it is given by if is the zero vector, and otherwise. The discount parameter is . The objective is the usual: , where .
We next define the synchronous network, , in terms of components of the original asynchronous network, —specifically, each , , and . We must modify the original network to accept inputs in and produce outputs in . Recall that in the asynchronous network, the local state at time of the coagent is given by . In the augmented MDP, the information in is contained in , so the local state of the coagent in the synchronous network is , with accompanying state set . To produce the component of the action, , we append the output of each coagent to the action. In doing so, we have removed the need for cyclic connections, but still must deal with the asynchronous execution.
The critical step is as follows: We represent each coagent in the asynchronous network by two coagents in the the synchronous network, the first of which represents the execution function, , and the second of which represents the original policy, . At time step , the first coagent accepts and outputs with probability , and 0 otherwise. We append the output of every such coagent to the action in order to produce the component of the action, . Because the coagent representing executes before the coagent representing , from the latter’s perspective, the output of the former is present in , that is, . If , the coagent samples a new action from . Otherwise, it repeats its previous action, which can be read from its local state (that is, ). Formally, for all and , the probability of the latter coagent producing action is given by:
(10) 
This completes the description of . In the supplementary material, we prove that this network exactly captures the behavior of the asynchronous network—that is, for all possible values of in their appropriate sets.
The proof that is given in Section B of the supplementary material, but it follows intuitively from the fact that 1) the “hidden” state of the network is now captured by the state set, 2) accurately captures the dynamics of the hidden state, and 3) this hidden state does not materially affect the transition function or the reward distribution with respect to the original states and actions.
Having shown that the expected return in the asynchronous setting is equal to the expected return in the synchronous setting, we turn to deriving the asynchronous local policy gradient, . It follows from that . Since is a synchronous, acylic network, and is an MDP, we can apply the CPGT to find an expression for . This gives us for the coagent in the synchronous network:
(11) 
Consider , which we abbreviate as . When , we know that the action is regardless of . Therefore, in these local states, is zero. When , we see from the definition of that . Therefore, we see that the in all cases, . Substituting this into the above expression yields:
(12) 
In the proof that given in Section B of the supplementary material, we show that the distribution over all over the analogous random variables is equivalent in both settings (for example, for all , ). Substituting each of the random variables of into the above expression yields precisely the asynchronous local policy gradient, . ∎
6 Case Study: OptionCritic
The optioncritic framework Bacon et al. (2017) aspires to many of the same goals as coagent networks: namely, hierarchical learning and temporal abstraction. In this section, we show that the architecture is equivalently described in terms of a simple, threenode coagent network, depicted in Figure 3. We show that the policy gradients derived by Bacon et al. (2017) are equivalent to the gradients suggested by the CPGT. While the original derivation required several steps, we show that the derivation is nearly trivial using the CPGT, demonstrating its utility as a tool for analyzing a variety of important RL algorithms.
In this section, we adhere mostly to the notation given by Bacon et al. (2017), with some minor changes used to enhance conceptual clarity regarding the inputs and outputs of each policy. In the optioncritic framework, the agent is given a set of options, . The agent selects an option, , by sampling from a policy . An action, , is then selected from a policy which considers both the state and the current option: . A new option is not selected at every time step; rather, an option is run until a termination function, , selects the termination action, 0. If the action 1 is selected, then the current option continues. is parameterized by weights , and by weights . Bacon et al. (2017) gave the corresponding policy gradients, which we rewrite as:
(13)  
(14) 
where is a discounted weighting of stateoption pairs, given by , is the expected return from choosing option