Challenging problems for reinforcement learning are those in which observations do not reveal the Markov states of the environment. Such problems are also known as partially observable Markov decision problems (POMDP) [sondik1971optimal]. One solution is to use the history of observation as state, transforming to problem to MDP and thus enabling RL to learn a state-action value function [mnih2015human, sutton2018reinforcement]. LSTMs were successfully used in deep recurrent Q-learning to solve some types of POMDP such as flickering Atari [hausknecht2015deep, heess2015memory, mirowski2016learning].
The history of observations, however, is not always sufficient to efficiently solve a POMDP. In fact, if the observations derive from a large set, a sequence of observations is unlikely to repeat, and therefore the derived MDP is very large; moreover, if the time gap between actions and reward has a variable delay, intervening observations also result in a large space of histories. These conditions can be intuitively understood as problems in which input data across a time window is not equally important to inform the decision process and maximize reward. Thus, rather than the observation history, other approaches to solve POMDPs rely on a belief system. E.g., [Rasmussen2017]
propose an approach based on the well known model-based PILCO algorithm for continuous control. A variational autoencoder is used to update a belief system in[Igl2018] and reported improved performance over [hausknecht2015deep] in POMDPs such as the flickering ATARI. [Azizzadenesheli2017] introduced spectral methods in reinforcement learning to solve POMDPs problems. [Doshi2015] used Bayesian nonparametric representations for distinguishing between separate states to help with appropriate action selection.
In this paper, we propose a new way to cope with confounding stimuli and delayed rewards by augmenting a DQN architecture with a single layer reward-modulated Hebbian network with neural eligibility traces. This new deep RL architecture can discriminate between pivotal decision points and irrelevant or confounding observations, thus gaining a learning advantage in RL problems with delayed rewards and confounding observations. A modulated Hebbian network with neural eligibility traces (MOHN) is employed to reconstruct observation-actions-rewards sequences with variable delays in the cause-effect chain of events. The ability of MOHN to bridge temporal gaps between causing events and rewards was demonstrated in a spiking neural model in [izhikevich2007solving]
, and an equivalent model for rate-base neurons was shown effective in simulation[soltoggio2013solving] and robotics applications [soltoggio2013rare, soltoggio2013learning].
Traces is not a new concept and it has been explored in the past reinforcement learning literature [sutton2018reinforcement, Munos2016]. The main difference between traces in reinforcement learning and neural traces in MOHN [izhikevich2007solving, soltoggio2013solving] is in the different computational framework that creates and used them: in TD(), traces are incremented by the value of the gradient
that, with linear approximation, corresponds to the feature vector. In MOHN, increments are instead derived from a cause-effect neural dynamics inspired by the STDP learning rule [markram1997regulation]. An augmented version of STDP with neuromodulation [izhikevich2007solving] can be used in a Hebbian network with rate-based neurons [soltoggio2013rare] to derive cause-effect relationships. Finally, TD error is not easily computed in POMDP because states might not be defined or correctly identified, therefore MOHN use the raw reward signal, and not the difference between actual and expected reward. Recurrent modulated networks [izhikevich2007solving, soltoggio2013rare], can learn to compute the difference between actual and expected reward, but they are nevertheless fed with raw reward signals.
Exploiting the properties of reward-modulated Hebbian networks, we propose a modulated Hebbian plus Q network architecture (MOHQA) for deep reinforcement learning problems. The architecture is composed of two parts: a standard DQN network, and a modulated Hebbian network (MOHN), parallel to the Q-network head that contributes to decision making. The key idea is that the modulated Hebbian network can learn distal associations, which allow the network to ignore time gaps and confounding observations that occur between actions and rewards. In other words, the MOHN infers associations between inputs, outputs, and rewards, across multiple time steps, thus bypassing the need to compute a TD error, which maybe uncertain due partially observable states. The DQN shares the feature extraction convolutional layers with the Hebbian network. Thus, while DQN[mnih2015human] can learn useful features even in problems that it cannot solve because of partial observability, the MOHN contributes to the decisions of the DQN when TD error is misleading due to confounding observations.
Two unique aspects of the traces used in the MOHQA are: (i) traces are used only in the network head and (ii) traces are sparse (or rare). The first point (i) implies that traces are used to associate high level features with decisions: this principle implicitly assumes that distal cause-effect relationships exist at high level features, rather than at the input pixel level. Point (ii) implies that the eligibility of weights is increased only in a small percentage of the total number, ensuring that credit is assigned to a small subset of weights. This is critical to maintain stability with raw reward signals when TD errors cannot be computed.
The architecture is tested in a set of generalized reward-based decision problems that include POMDPs with delayed rewards and confounding observations. Tests include comparisons with a baseline DQN, QRDQN+LSTM [hausknecht2015deep] and A2C [mnih2016asynchronous]. Initial tests shows that the MOHQA solves POMDPs where the baseline algorithms fail. The proposed approach is the first application of a combined Hebbian and DQN network to implicitly learn a belief system and address challenging TD computation in the presence of partially observable Markov states.
A set of POMDP problems with distal rewards and confounding observations
Assume an environment where key decision points, in which actions are critical, occur occasionally and are separated by wait states during a simple and fixed policy is required,e.g., action wait. This is a fairly common case in robotics applications and games. This scenario is implemented in this work with a configurable tree graph (CT-graph) that encodes a partially observable MDP and returns observations of two main types: decision points (where the graph branches in multiple sub-graphs) and wait states (where the tree-graph does not branch). Actions are also of two types: wait-actions and act-actions. The CT-graph is designed to be a challenging problem for current state-of-the-art RL approaches. The CT-graph results in (1) the sparsity of the reward, (2) the non-observability of the MDP and (3) a large number of confounding observation.
A reward is located at one particular leaf node in the tree graph. The agent is required to perform wait-actions while in a wait state and choose from a set of act-actions while at a decision point. The choice of a specific act-action determines the path that the agent follows along the tree. While at a wait state, the agent receives observations that can be seen as confounding observations because only the wait-action is ever required in a wait state (only one wait-action is used in the current setup). Wait states lead to themselves with a delay probabilityor to the next state in the sequence with probability . At each decision point there are options corresponding the branches in the sub-graphs. of a tree graph. The number of consecutive decision points is the depth of the tree graph. The environment has an optimal sequence of actions that leads to one unique leaf of the tree graph that returns a reward of one. The branch factor , the depth , the delay probability , and the sets of observations are configurable parameters, making this problem a blueprint for a large set of benchmarks , from simple to extremely difficult problems for medium to large sizes graphs (Fig. 1).
For each type of states, the environment provides either deterministic or stochastic observations, but all from the specific subsets of associated with one state type: thus, different states of the same type can provide the same observations. Referring to Fig. 1(B), S1, S3 and S4 (wait states) may all provide the same observation or different observations from the same set, which makes the problem a POMDP111Source code for the CT-graph provided at https://github.com/–anonymised. While a standard input size for DRL is the image size used with the ATARI platform, we implemented a smaller -sized set that allows for a considerably faster simulation while nevertheless requiring a feature extraction phase. The images are single-channel scaled up and rotated checker patterns of low-medium-high values (Fig. 1(C)). Such choice of images was made to require feature learning and test deep RL algorithm while maintaining low computational requirements.
While the problem is simple in the mathematical formulation, it can be made very difficult for RL to solve due to large reward sparsity and confounding observations. In fact, in the most complex graph that we tested random exploration results in a reward only once in 84,000 episodes, or 2.7M steps (depth 2, branch 2, probability () 0.1). Such a condition was sufficient to make most advanced RL algorithms currently known ineffective.
Non-observability and wait.
Wait states and decision points alternate while following a branch of the tree graph, and similarly the observations that depend on such states alternate and reoccur along the graph. As a result, neither the observations, nor the state types provide an observable MDP. The delay probability of transitioning from a wait state to a decision point determines the expected delay between decision points, and between the last decision point and the reward at the end of the episode. Such a setup represents a decision-making process in which confounding observations (waiting state) are more common, than decision points. This situation can occur for example when a robot is traversing a new building. Corridors of a building may result in similar, seemingly random observations due to simple similarities or noise in the various sensors, while the critical decision are taken at specific landmarks. A list of all states used in CT-graph for this experiment is available in supplementary materials.
Comparison to other RL benchmarks.
Compared to other RL benchmarks such as various games (Atari, Minecraft) or control benchmarks (muJoCo, lunar lander), the CT-graph differs in two aspects i) decision making is significantly more complex in a CT-graph and ii) the CT-graph is visually simpler. For example, the majority of actions in the CT-graph result in the episode termination with a reward of 0, while in other benchmarks the agent can take several actions before a termination occurs. Moreover, in some Atari games (e.g., breakout) the agent is likely to stumble across a reward with a relatively high probability and a short sequence of actions. On the contrary, in the CT-graph, the agent has to complete a longer sequence of actions to reach a reward, which is consequently more sparse. The high sparsity of the reward makes exploration less effective. Secondly, while our visual input is simpler than other benchmarks, some important aspects of complexity of the visual input are retained such as states associated with decisions are similar to states associated with waiting points, thus very similarly looking states require different decisions.
The modulated Hebbian plus Q network architecture (MOHQA)
The proposed architecture (MOHQA) is composed of two main parts: a deep Q-network [mnih2015human] (DQN), and a modulated Hebbian network (MOH) that is plugged into the Q-network as a parallel unit to the DQN head (Fig. 2).
The DQN follows the standard implementation [mnih2015human]. A standard DQN is used to approximate the optimal Q-value function, or optimal value-action function, defined as
where is a reward at time , is a state at time , is an action at time and is the policy. To solve instability issues caused by representing action-value pairs with the network, [mnih2015human] proposed two innovations i) experience replay and ii) a target network that is only periodically updated. The parameters of the target and base network are and . The parameters are updated with every steps.
The output of the DQN body is a set of features that are used as input to both the MOHN and DQN heads. The layers and sizes of the DQN body are summarized in Table 1.
The DQN head is the final layer of the DQN network. It interprets the features to produce Q-values. Both DQN head and DQN body are trained by back propagation to minimize the TD error [mnih2015human].
Associating stimulus-action pairs with distal rewards by means of MOHN
The MOHN in this work is an adaptation of a bio-inspired, unsupervised and modulated Hebbian network proposed in [izhikevich2007solving, soltoggio2013solving]. In those studies, those networks were shown to cope with sparse rewards outside a RL framework, i.e. when neither states nor TD errors are defined. These assumptions are useful when an agent cannot infer the state of the MDP, and thus a TD error cannot be correctly computed and used to propagate the Q function updates. This is generally true even with memory-augmented DQN such as RDQN: in fact, if the history of observations maps to a large space due to stochastic observations, learning history-action values is ineffective. The confusion is a result of memorized observations appearing at different positions throughout the graph at different episodes, thus memorized memorized cannot be used to determine which action to take in the next state. For example in one episode observation appearing just before decision point (implying agent should take act-action), can appear in front of another wait state in the next episode, confusing the decision making. The MOHN solves this problem by using a learning rule based on modulated Hebbian with eligibility traces that uses raw reward signals. The result is that observations-action pairs are associated with later rewards by means of traces that update the weights that caused an action and “ignore” intervening events between actions and rewards.
STDP-inspired plasticity and neural eligibility traces (MOHN)
The neural traces used in this study implement two principles: (i) causal relationships between observation and actions, and (ii) sparse correlations. The causal relationship are derived by applying the Hebbian multiplication rule to successive, rather than simultaneous, simulation steps, so that Hebbian terms captures the contribution of a presynaptic activity to the activity of a postsynaptic neuron, similarly to the STDP rule. This is also sometimes called an asymmetrical learning window [kempter1999hebbian]. Sparse correlations are explicitly imposed by selecting the top and bottom of correlations/decorrelations [soltoggio2013rare]. A modified Hebbian term between a presynpatic neuron and a postsynaptic neuron is updated according the equation:
where and are the output values of the pre and postsynaptic neurons, equivalent to the input and output layers in the MOHN. The input to the MOHN, is the output of the DQN body (Fig. 2) minus its own running average to enhance the detection of changes in the feature space. The traces in the and decay decay with time with a time constant
The modulatory signal is the reward plus a small baseline modulation, i.e., r(t) + , and is used to multiply the traces to obtain the weight update:
The weights are clipped in the interval [-1,1] to contain Hebbian updates [miller1994role, soltoggio2012modulated].
It is worth noting that the MOHN is capable of intrinsic exploration and exploitation dynamics. Exploitation derives from the typical Hebbian dynamics that reinforces established behaviors [hebb2005organization] by leading weights to saturation [miller1994role]. Behaviors that do not lead to reward, instead, cause weight depression that consequently leads to noise-driven exploration. Noise in the system can be added both at the input level (noise on the pixels of the image) or at feature levels. In both cases, noises has the role to facilitate exploration of weight configurations [soltoggio2012modulated].
Finally, the output of the MOHN, , is computed as:
where is a function that returns a one-hot vector with the 1 value at the index of the maximum value, and
is a sigmoid function. The one-hot function has the purpose to facilitate the increase of the traces for the weights that are afferent to the action-triggering neuron.
The idea to sum the output of DQN head with the output of the MOHN is to provide the overall architecture a set of additional high level hints that highlight cause-effect relationships across time gaps to help decisions when standard DQN fails. Thus, the MOHQA Q-values, , are defined as
where indicates that an observation is used to approximate the state, even when this is incorrect due to partial observability. As in standard approaches, the action is chosen as
Crucially, the loss function is computed using the difference between best action as indicated by the q-value of sum of DQN and MOHQA and q-value indicated by the DQN:
This section reports the analysis of how the MOHQA uses the features to solve the POMDP problems and then presents a comparison of the performance with the DQN, QRDQN+LSTM, REINFORCE and A2C algorithms.
Analysis of feature learning.
To better understand the need for a new loss function , three experiments are performed (Fig. 3) to show feature learning by DQN (A) in a one-decision-point CT-graph, (B) in a two-decision-point CT-graph with traditional loss function, (C) and finally a two-decision-point CT-graph with the newly proposed loss function. In the first experiment, the baseline DQN attempted to learn a CT-graph with one decision point and two wait states, one before and one after the decision point. Fig. 3(A) shows the learned feature space output throughout one episode. As expected, DQN learns similar high level features from different observations if those require the same action. In particular, wait states that require the wait-action are distinguishable from decision points that require act-actions. However, TD learning can not propagate the reward values backwards because the wait states (before and after the decision point) are not distinguished.
A similar situation is observed in a longer CT-graph with two decision points in Fig. 3(B). In this case, the two decision points had unique observations, thus making the problem observable, but only at decision points. DQN learns the same features for wait states and decision points. This is reasonable because these two state types require either action (wait state) or actions or in the decision point. However, not only the observations prevents propagation of the TD error, but the feature space does not allow to distinguish between the first and second decision point. This confusion between first and second decision point highlights an interesting consideration: if DQN cannot learn the path to the reward, it cannot also learn the separate features that would enable correct decisions.
Finally, in the third experiment, the MOHQA is used on the same two decision-point CT-graph. In Fig. 3(C) the feature space clearly shows a difference between the first and second decision point. The MOHQA, by suggesting optimal actions to the DQN, was able to also lead the DQN to learn different features for different decision points, which DQN alone could not achieve.In this last test, the output values reveal the inner working of the MOHQA architecture: DQN suggests the wait-action at wait states, and expresses equal preferences for both act actions and . The MOHN contributes by biasing the decision towards the act-action (either or ) that is associated with the future reward.
Simulations were performed for a range of CT-graphs and compared with benchmarks algorithms222The simulation code will be made available at https://github.com/pladosz/MOHQA.git.. For all simulations, we used a baseline modulation , a sparse correlation threshold and a decay factor of 14 and 48 steps for the one-decision-point and two-decision-point CT-graph.
In a one-decision-point CT-graph (Fig. 4)(b) with variable delay (), all RL algorithms solve a MPD version (all observations are unique) (Fig. 4(a)). In Fig. 4(b), the CT-graph is made POMDP by removing uniqueness from the wait states. This is the case where confounding observations affect the problem. The MOHQA is able to solve the problem while the other approaches do not. Note that the simplest of the CT-graphs (Fig. 4 (a) and (b) and Fig. 4(c)(a)) do not require a feature extractor due to limited number of states used in the problem. The feature extractor is kept for a generality of the approach and for fair comparison with more complex CT-graphs.
For the two-decision-points CT-graph, three tests are performed, i) fully MDP with (Fig. 4(c)(a), ii) POMDP, where a wait state produces one of 64 observations, and is (Fig. 4(c)(b) and iii) POMDP, where a wait state produces one of the 64 observations, and is (Fig. 4(c)(c). The two-decision-points CT-graph (Fig. 4(c)) shows very similar trends to one-decision-point CT-graph, but reveals the increased complexity of the problem with the baseline algorithms failing.
The deep reinforcement learning problems posed in this paper appear trivial, and yet, state-of-the-art RL algorithms struggle to find good solutions. The particular and yet common challenge derives from a non-Markov problem in which the history might not be useful to reconstruct the hidden MDP due to the large space of possible histories. It is worth noting that, although POMDP, the CT-graph was configured to allow for solutions with agents that do not have memory. This is an interesting consideration because it is assumed that DQN cannot solve a hidden MDP without a memory unit (e.g. LSTM), but the QRDQN+LSTM baseline shows that such an approach also struggles. The memory based approach is likely not to learn correctly due the challenge of propagating TD-error, essential to train the LSTM, and therefore it fails to learn distinguishing features for different observations.
The current proof-of-concept MOHQA is tested on a limited set of POMDP problems and compared with a limited number of benchmark algorithms. Further tests with other POMDP problems, e.g. the Morrison water maze and some ATARI games, will be essential to test the full potential of the MOHQA. Yet, the present work suggests that a fundamentally simple test of POMDP deep RL problems casts insights on the challenging problem of learning simultaneously a feature space and an policy with delayed rewards and confounding observation.
The proposed architecture, while proving effective and posing a new learning paradigm, has some limitations. The MOHQA is more complex than a standard DQN network. However, to the best of the authors’ knowledge, this is the first successful attempt to combine a modulated unsupervised Hebbian network with a DQN network. The MOHQA solves a POMDP without memory because non-observability is limited to the wait states, while the observations from decision points corresponds to states. In future developments, a promising direction is to add memory to the MOHQA to enable it to solve even more complex POMDP problems. The time constant of the traces determines how far back rewards can be associated with observation-actions pairs. A fast decaying trace cannot capture too long delays; a slow decaying trace means that too many observations-action pairs are credited, resulting in either slow or unstable learning.
Finally, an interesting consideration is that the MOHN appears to be instrumental to guide DQN to learn useful features. Due to the confounding observations that are provided during wait states, the baselines algorithms struggle to learn useful features of the decision points, which are instrumental to inform an optimal policy. Thus, learning the appropriate features depend on the actions which in turn depends on the feature. While this chicken and egg problem is typical in deep reinforcement learning, the proposed MOHQA appears to facilitate the process of guiding the learning of useful features by discovering cause-effect relationships and offering guidance to the DQN underlying architecture.
This paper proposed a new neural architecture (MOHQA) for deep reinforcement learning. The key novelty is the integration, of a standard DQN architecture, and a modulated Hebbian learning network with neural eligibility traces. The objective was to provide an RL agent with the ability to ignore confounding observation during delays and associate key cause-effect relationships to delayed rewards. The parallel architecture was shown to enhance the standard DQN with the ability to learn in challenging POMDPs where a number of state-of-the-art approaches fail (DQN, A2C, QRDQN+LSTM). While this is the first proof-of-concept study to propose a combined Hebbian and a backpropagation-learned architecture for deep reinforcement learning, the promising results encourage further tests on a wider range of standard deep RL benchmarks.
This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-18-C-0103.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA).