Model-free deep reinforcement learning algorithms have been applied in a range of challenge domains, from games to robotic control[that_matters]
. The combination of RL and high-capacity function approximators such as neural networks hold the promise pf automating a wide range of decision making and control tasks, but widespread adoption of these methods in real-wold domains have been hampered by two major challenges[world_model]
First, all these methods can only promise to solve the MDP process, but in real-world problems seldom follow the MDP process, in the real world problems often require dealing with partially observation states, it is general very challenging to construct and infer hidden state as they often depend on an agent’s entire interaction history and may require substantial domain knowledge.
Second, the model-free method suffers from low sample efficiency even some simple tasks need millions of interval with the environment and complex behaviors with high dimensional observations might need substantially more.
Third, these methods often brittle with respect to their hyper-parameters, which means we need carefully tuning the parameters, most important, they often suck to local optimal. In many cases they fail to find a reward signal, even when the reward signal is relatively dense, they still fail to find the optimal solution, so often the case we need handcraft the reward function, some researcher design such a complex reward function for each environment they want to solve [design].
Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states. It is in general very challenging to construct and infer hidden states as they often depend on the agent’s entire interaction history and may require substantial domain knowledge.
In this paper, we propose a new hi-bird approach to using deep learning to tackle complex tasks, Recurrent neural networks (RNNs) for reinforcement learning (RL) have shown distinct advantages, e.g., solving memory-dependent tasks and meta-learning.However, little effort has been spent on improving RNN architectures and on understanding the underlying neural mechanisms for performance gain.
We investigate a deep-learning approach to learning the representation of states in partially observable tasks, with minimal prior knowledge of the domain. In particular, we propose a new family of hybrid models that combines the strength of both supervised learning and reinforcement learning, training in a joint fashion: The supervised learning component can be a recurrent neural network (RNN) combine with a different head, providing an effective way of learning the representation of hidden states. The RL components a soft actor-critic[SAC] that learn to optimize the control for maximizing long-term rewards. Furthermore, we design a model together with curiosity bond, which leads to better representation and exploration. Extensive experiments on a both POMDP and MDP process demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous state-of-the-art methods.
First, we employ recurrent neural networks and GRU models to learn the representation of the state for RL. Since these recurrent models can aggregate partial information in the past, and can capture long-term dependencies in the sequence information, their performance is expected to superior to the contextual -window-based approach, which was used in the DQN model of [DQN]
Second, in order to best leverage supervision signals in the training data, the proposed hybrid approach combines the strength of both supervised learning and RL. In particular, the model in our hybrid the approach is jointly learned using stochastic gradient descent (SGD): in each iteration, the representation of hidden states is first inferred using supervision signals (i.e. next observation and reward) in the training data; then, the Q-function is updated using the SAC that takes the learned hidden states as input. The superiority of the hybrid approach is validated in extensive experiments on a benchmark dataset.
Third, by jointly training on model and RL algorithm we can get a good representation to capture the underline state, which means we can change the POMDP process to the MDP process.
Last, to avoid local optimal and encourage our agent to explore more, we add curiosity bond to standard RL objection, which means we use both internal reward and external reward, have a separate reward signal could make our agent capture the reward more easily can find the optimal solution
Our reinforcement learning problem can be defined as policy search in Partially Observable Markov Decision Processgiven by the tuple [planet]. The underlying Markov Decision Process (MDP) is defined by tuple
is the set of states
the set of actions,
is the transition function mapping state-actions to probability distributions over next states
is the reward function, with
gives the set of observations potentially received by the agent
is the observation function mapping (unobserved) states to probability distributions over observations.
Within this framework, the agent receives an observation , which may only contain partial information about the underlying state . When the agent takes an action the environment responds by transitioning to state and giving the agent a new observation and reward
Although there are many approaches to RL in POMDPs, we focus on using recurrent neural networks (RNNs) with back propagation through time (BPTT) to learn a representation that disambiguates the true state of the POMDP. The Deep Q-Network agent (DQN)[DQN] learns to play games from the Atari-57 benchmark by using frame-stacking of 4 consecutive frames as observations, and training a convolutional network to represent a value function with Q-learning, from data continuously collected in a replay buffer. Other algorithms like the A3C , use an LSTM and are trained directly on the online stream of experience without using a replay buffer. In paper [RDPG] combined DDPG with an LSTM by storing sequences in replay and initializing the recurrent state to zero during training.
we consider a partially observable Markov decision process (POMDP). We define a discrete time step , hidden states , image observations
, continuous action vectors, and scalar rewards , that follow the stochastic dynamics
Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.
To talk about expected return, we first have to talk about probability distributions over trajectories. Let’s suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a T -step trajectory is:
The expected return (for whichever measure), denoted by , is then:
2.2 Entropy-Regularized Reinforcement Learning
Standard RL maximizes the expected sum of rewards as purposed in equation 2. We will consider a more general maximum entropy objective[entropy_rl], which favors stochastic policies by augmenting the objective, In entropy-regularized reinforcement learning, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that time-step. This changes the RL problem to:
where is the trade-off coefficient. (Note: we’re assuming an infinite-horizon discounted setting here, and we’ll do the same for the rest of this page.) We can now define the slightly-different value functions in this setting. is changed to include the entropy bonuses from every time-step:
is changed to include the entropy bonuses from every time-step except the first:
With these definitions, and are connected by:
and the Bellman equation for is
As we all know, there are five key elements for intelligent, so-called: prediction,curiosity,intuition,memory, inference we can achieve this function within deep learning framework by an RNN based architecture. In the real world, the full state is rarely provided to the agent. In other words, the Markov property rarely holds in real-world environments. A Partially Observable Markov Decision Process (POMDP) better captures the dynamics of many real-world environments by explicitly acknowledging that the sensations received by the agent. Real-world tasks often feature incomplete and noisy state information resulting from partial observability.
The main structure of our agent contains three parts: RNN, Model Intuition, each part has its function.
We use RNN for inference and memory, which means learning , as we only will provide as time-step instead of so it is crucial for RNN store passed information into a hidden state , we call this process as inference, most important the both mean hidden state in RNN and it is the state we will use as the input of our RL algorithm, It is hard to say the is the ”true” state which follows MDP property, but we will push the toward the ”true” state by jointly optimize with intuition head and model head(we will discuss future in session 3.4)
The core idea of the model head is to provide a prediction of future and use the predicted error as curiosity bond. As we all know the understanding and perdition of the future is fundamental to human, by putting a model in our agent we can get a better representation of state and avoid stack into local optimal
As for intuition head, the key function is decision making, we achieve the by using SAC, which based on the actor-critic framework and uses entropy regularization, but we combine the original algorithm with our framework which means it can adapt to the POMDP process having a model and internal reward
3.1.1 Inference and Memory
The main function of RNN is providing memory and inference. Due to POMDP process, we can’t use observation at step directly to make decision or make prediction, so we need an inference model to encode observation, action to state.
Since the model is non-linear, we cannot directly compute the state posteriors that are needed for parameter learning,but we can optimize the function approximator by backup losses through the RNN cell, Gradients coming from the policy head are blocked and only gradients originating from the Q-network head and model head are allowed to back-propagate into the RNN. We block gradients from the policy head for increased stability, as this avoids positive feedback loops between and caused by shared representations
we will future discuss the choice of backup value loss and model loss in session 3.4, in a word, training RNN with model and value function jointly captures a better representation for state.
3.1.2 Initialize Strategy
In order to achieve good performance in a partially observed environment, an RL agent requires a state representation that encodes information about its state-action trajectory in addition to its current observation. The most common way to achieve this is by using an RNN, as part of the agent’s state encoding. To train an RNN from replay and enable it to learn meaningful long-term dependencies, whole state-action trajectories need to be stored in replay and used for training the network.Recent work[R2D2] compared four strategies of training an RNN from replayed experience:
Using a zero start state to initialize the network at the beginning of sampled sequences.
Replaying whole episode trajectories.
Stored state: Storing the recurrent state in replay and using it to initialize the network at training time. This partially remedies the weakness of the zero start state strategy, however it may suffer from the effect of ‘representational drift’ leading to ‘recurrent state staleness’, as the stored recurrent state generated by a sufficiently old network could differ significantly from a typical state produced by a more recent version.
Burn-in: Allow the network a ‘burn-in period’ by using a portion of the replay sequence only for unrolling the network and producing a start state, and update the network only on the remaining part of the sequence. We hypothesize that this allows the network to partially recover from a poor start state (zero, or stored but stale) and find itself in a better initial state before being required to produce accurate outputs.
The zero start state strategy’s appeal lies in its simplicity and it allows independent decorrelated sampling of relatively short sequences, which is important for robust optimization of a neural network. On the other hand, it forces the RNN to learn to recover meaningful predictions from an atypical initial recurrent state (‘initial recurrent state mismatch’), which may limit its ability to fully rely on its recurrent state and learn to exploit long temporal correlations. The second strategy, on the other hand, avoids the problem of finding a suitable initial state, but creates a number of practical, computational, and algorithmic issues due to varying and potentially environment-dependent sequence length, and higher variance of network updates because of the highly correlated nature of states in a trajectory when compared to training on randomly sampled batches of experience tuples.[RDPG] observed little difference between the two strategies for empirical agent performance on a set of Atari games, and therefore opted for the simpler zero start state strategy. One possible explanation for this is that in some cases, an RNN tends to converge to a more ‘typical’ state if allowed a certain number of ‘burn-in’ steps, and so recovers from a bad initial recurrent state on a sufficiently long sequence. We also hypothesize that while the zero start state strategy may suffice in the most fully observable Atari domain, it prevents a recurrent network from learning actual long-term dependencies in more memory-critical domains
In all our experiments we will be using the proposed agent architecture with replay sequences of length , with an optional burn-in prefix of .
burn-in strategy on its own partially mitigates the staleness problem on the initial part of replayed sequences, Empirically, this translates into noticeable performance improvements, as the only difference between the pure zero state and the burn-in strategy lies in the fact that the latter unrolls the network over a prefix of states on which the network does not receive updates. the beneficial effect of burn-in lies in the fact that it prevents ‘destructive updates’ to the RNN parameters resulting from highly inaccurate initial outputs on the first few time steps after a zero state initialization. The stored state strategy, on the other hand, proves to be overall much more effective at mitigating state staleness in terms of the Q-value discrepancy, which also leads to clearer and more consistent improvements in empirical performance. Finally, the combination of both methods consistently yields the smallest discrepancy in the last sequence states and the most robust performance gains.
3.2 Model Learning(model Head)
Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal model. While it is the role of the intuition head is to compress what the agent sees at each time frame, we also want to compress what happens over time. For this purpose, the role of the model is to predict the future. The model head serves as a predictive model of the future state.
The transition dynamics is modeled with a feed-forward neural network, using the standard practice train the neural network to predict the change in state (rather than the next state) given a state and an action as inputs. This relieves the neural network from memorizing the input state, especially when the change is small[metrpo], We denote the function approximator for the next state, which is the sum of the input state and the output of the neural network, as
The loss of model learning is the one-step prediction loss, where the training dataset that stores the transitions the agent has experienced:
As human agents, we are accustomed to operating with rewards that are so sparse that we only experience them once or twice in a lifetime, if at all. To a three-year-old enjoying a sunny Sunday afternoon on a playground, most trappings of modern life – college, good job, a house, a family are so far into the future, they provide no useful reinforcement signal. Yet, the three-year-old has no trouble entertaining herself in that playground using what psychologists call intrinsic motivation[rnd] or curiosity [ICM]. Motivation/curiosity have been used to explain the need to explore the environment and discover novel states. More generally, curiosity is a way of learning new skills which might come handy for pursuing rewards in the future.
There are many types of couriers, but the most fundamental one is the curiosity for things we can’t predict correctly [curiosity],seek a good a model to predict future is almost the origin for science, so it’s an obvious idea we can let our agent predict the future and use the predict error as curiosity bond or so called curiosity-driven intrinsic reward signal.
Our agent is composed of two subsystems:
a reward generator that outputs a curiosity-driven intrinsic reward signal
and a policy that outputs a sequence of actions to maximize that reward signal.
In addition to intrinsic rewards,the agent optionally may also receive some extrinsic reward from the environment. Let the intrinsic curiosity reward generated by the agent at time t be and the extrinsic reward be . The policy sub-system is trained to maximize the sum of these two rewards
In practice we use a parameter to represent the strength of intrinsic reward, as the learning process continuous, model loss should decay to zero, but this is often not the case, due to the complex environment dynamic, it’s really hard to make perfect prediction if possible. But we can’t let our agent seeding some state all the time, thus we need to decay to make sure we can have a good policy.
3.3 Intuition Head
We will use function approximators for both the soft Q-function and the policy, and instead of running evaluation and improvement to convergence, alternate between optimizing both networks with stochastic gradient descent. We will consider two parameterized soft Q-function and a tractable policy . The parameters of these networks are and . For example, the soft Q-function can be modeled as expressive neural networks, and the policy as a Gaussian with mean and covariance given by neural networks. We will next derive update rules for these parameter vectors
3.3.1 Learning Q-Functions
The Q-functions are learned by MSBE minimization, using a target value network to form the Bellman backups. They both use the same target, like in TD3, and have loss functions:
The target value network, like the target networks in DDPG and TD3, is obtained by polyak averaging the value network parameters over the course of training.Before we go into the learning rule, let’s first rewrite the connection equation by using the definition of entropy to obtain:
The value function is implicitly parameterized through the soft Q-function parameters via Equation 14 We uses clipped double-Q like TD3[TD3] and SAC[SAC] for express the TD target, and takes the minimum Q-value between the two approximators, So the loss for Q-function parameters is:
The update makes use of a target soft Q-function, that are obtained as an exponentially moving average of the soft Q-function weights, which has been shown to stabilize training. Importantly, we do not use actions from the replay buffer here: these actions are sampled fresh from the current version of the policy.
3.3.2 Learning the Policy
The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize , which we expand out (as before) into
The target density is the Q-function, which is represented by a neural network an can be differentiated, and it is thus convenient to apply the reparameterization trick instead, resulting in a lower variance estimate, in which a sample fromis drawn by computing a deterministic function of state, policy parameters, and independent noise. following the authors of the SAC paper[SAC1], we use a squashed Gaussian policy, which means that samples are obtained according to
The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):
To get the policy loss, the final step is that we need to substitute with one of our function approximators. The same as in TD3, we use . The policy is thus optimized according to
As it proposed in[SAC1], we also learn by minimizing the dual objective:
This can be done by approximating dual gradient descent[opt_convex]. Dual gradient descent alternates between optimizing the Lagrangian with respect to the primal variables to convergence, and then taking a gradient step on the dual variables. While optimizing with respect to the primal variables fully is impractical, a truncated version that performs incomplete optimization (even for a single gradient step) can be shown to converge under convexity assumptions[opt_convex]. While such assumptions do not apply to the case of nonlinear function approximators such as neural networks, we found this approach to still work in practice. Thus, we compute gradients for with the following objective:
The final algorithm is listed in Algorithm 1. The method alternates between collecting experience from the environment with the current policy and updating the function approximators using the stochastic gradients from batches sampled from a replay pool. Using off-policy data from a replay pool is feasible because both value estimators and the policy can be trained entirely on off-policy data. The algorithm is agnostic to the parameterization of the policy, as long as it can be evaluated for any arbitrary state-action tuple.
3.4 Better Representations
When we talk about parameters update, there are several options, first of all, there are three types of loss, and three are four kinds of networks, we list them with the update option in Table 1
It’s obvious we have to update each head according to their own loss, and it won’t make any sense if we back-propagate loss from one head to another head, thus our question become which part of loss should we back-propagate into RNN, in order to understand this question, in many prior papers they choose a method arbitrary, we will analyses different way to address this problem
Gradient coming from the policy head are blocked and only gradients originating from the value head and model head are allowed to back-propagate into the RNN, We block gradients from the policy head for increased stability, as this avoids positive feedback loops between and caused by shared representations.
We hypothesise by jointly training on model loss and intuition loss, we can get a better representation, cause for POMDP, we sim to find a enough to predict next state, which is what we did by training RNN on model loss, Meanwhile we need a powerful intuition, and the state is correlate to value as well, so we can combine two parts and get a better representation, we further give the experimental result to prove our theory in session 4.3.
We designed our experiments to investigate the following questions:
Can RMCSAC be used to solve challenging continue control problems? How does our agent compare with other methods when applied to this problems, with regard to final performance, computation time, and sample complexity?
Can RMCSAC handling POMDP process, how well does it deal with the absence of information, and how well does it generalize.
We optimize RNN on model loss and intuition loss jointly, does this really help us improve performance
We add curiosity bond and model head on our agent, does this parts give us a more powerful agent
To answer (1) we compare the performance of our agent with other method in session4.1 To answer (2) we purposed a modified mujoco environment so-called flicker mujoco, which follows the POMDP process, we will discuss the details of the environment and the experiment setting in session 4.2 With regard to (3)(4), we addressed ablation study on our algorithm in session 4.3, testing how does different update scheme and network design influenced the performance
The results shows overall our agent outperform baseline with a large margin, both in terms of learning speed and the final performance, The quantitative results attained by our agent is our experiments also compare very favorably to results reported by other methods in prior work, indicating that both the sample efficiency and the final performance of our agent on these benchmark tasks exceeds the state of art.
The goal of this experimental evaluation is to understand how the sample complexity and stability of our method compares with prior off-policy and on-policy deep reinforcement learning algorithms. We compare our method to prior techniques on a range of challenging continuous control tasks from the OpenAI gym benchmark suite. Although the easier tasks can be solved by a wide range of different algorithms, the more complex benchmarks, such as the 21-dimensional Humanoid, are exceptionally difficult to solve with off-policy algorithms. The stability of the algorithm also plays a large role in performance: easier tasks make it more practical to tune hyper-parameters to achieve good results, while the already narrow basins of effective hyper-parameters become prohibitively small for the more sensitive algorithms on the hardest benchmarks,leading to poor performance[QProp].
We compare our method to deep deterministic policy gradient (DDPG)[ddpg] , an algorithm that is regarded as one of the ; proximal policy optimization (PPO)[ppo], a stable and effective on-policy policy gradient algorithm; and soft actor-critic(SAC)[SAC], a recent off-policy algorithm for learning maximum entropy policies. We additionally compare to twin delayed deep deterministic policy gradient algorithm (TD3)[TD3].
We conducted the robotic locomotion experiments using the MuJoCo simulator[mujoco]
The states of the robots are their generalized positions and velocities, and the controls are joint torques. Under actuation, high dimensionality, and non-smooth dynamics due to contacts make these tasks very challenging. To allow for reproducible and fair comparison, we evaluate all the algorithm with similar network structure, for off-policy algorithm we use a two layer feed-forward neural network of 400 and 300 hidden nodes respectively, with rectified linear units (ReLU) between each layer for both the actor and critic, for on-policy algorithm we use a 64 hidden nodes feed-forward neural network, we use the parameters with is shown superior in prior work[that_matters] as the comparison of our agent. Both network parameters are updated using Adam[adam] with a learning rate of , with no modifications to the environment or reward.
compares five individual runs with both variants, initialized with different random seeds. RMCSAC performs much more better, shows that our agent significantly outperforms the baseline, indicating substantially better stability and stability. As evident from the figure, with jointly training and internal reward, we can achieve stable training. This becomes especially important with harder tasks, where tuning hyperparameters is challenging.
It shows our agent outperform other baseline method with a large marginal, indicate both the efficiency and stability of method is superior
4.2 Flicker Mujoco
To address this problem, we introduce the Flickering Mujoco POMDP - a modification to the classic Mujoco benchmark such that at each time-step, the screen is either fully revealed or fully obscured with probability . Obscuring frames in this manner probabilistically induces an incomplete memory of observations needed for Mujoco to become a POMDP. In order to succeed at the game of Flickering Mujoco, it is necessary to integrate information across frames to estimate relevant variables such as the location and velocity of the joints and the location of the joints. Since half of the frames are obscured in expectation, a successful player must be robust to the possibility of several potentially contiguous obscured inputs.
When dealing with partial observability, a choice exists between using a non-recurrent deep network with a long history of observations or using a recurrent network trained with a single observation at each timestep. The results in this section show that recurrent networks can integrate information through time and serve as a viable alternative to stacking frames in the input layers.
As it is shown in Figure 2(f). Our agent outperform standard SAC combine with frame stack, our agent performs well at this task even when given only one input frame per time-step, RMC successfully integrates information through time. Our agent are capable of integrating noisy single-frame information through time to detect events. Thus, given the same length of history, the recurrent net can better adapt at evaluation time if the quality of observations changes.
4.2.2 Generalization Performance
Our agent performance increases when trained on a POMDP and then evaluated on a MDP. Arguably the more interesting question is the reverse: Can a recurrent network be trained on a standard MDP and then generalize to a POMDP at evaluation time? To address this question, we evaluate the highest-scoring policies of our agent and SAC over the flickering equivalents of all 3 games. Figure 3(a) shows that while both algorithms incur significant performance decreases on account of the missing information, our agent captures more of its previous performance than SAC across all levels of flickering. We conclude that recurrent controllers have a certain degree of robustness against missing information, even trained with full state information.
4.3 Ablation Study
4.3.1 The Impact of Different Training Method
As shown in Table LABEL:table:_scheme we have at most six different kinds of update scheme. Figure 3(b) shows how learning performance changes when the update schemes are changed, For scheme 4~6, the policy becomes nearly random, and consequently fails to exploit the reward signal, resulting in substantial degradation of performance. For scheme 2, the value function is enhanced with the capacity of RNN so the model learns quickly at first, but the policy then becomes nearly deterministic, leading to poor local minimal due to the lack of adequate exploration and worse state representation, as for scheme 3, due to the weak capacity of the value function, although the representation of the state maybe better, still it can not achieve awesome performance. With scheme 1, the model balance exploration and exploitation, model head make sure the state is good enough to predict next state, and push the POMDP process to MDP process by back-propagate model loss into RNN. At the same time, value head can adjust representation and achieve amazing capacity by jointly optimize RNN with model head. Just like our brain has two different kinds of thinking pattern, so-called intuition and reasoning, our agent can take advantage of joint optimization, by learning a representation of state from observation both for predict and decision making.
4.3.2 The Impact of Curiosity Strength
As we discuss in session 3, the model head can provide a prediction of the future in the meantime provide curiosity bond and a better representation, we already analyzed the influence of model update and shown that jointly training can improve performance, but how about curiosity part what if we only training jointly but set to zero(remove internal) To test the design of our algorithm, we choose update scheme 1, which has been shown to be superior in the previous experiment, and change the scale of As illustrate in figure 3(c), when we set to zero, the model can’t explore well so both the sample efficiency and final score is obscured. If we use a huge , the internal reward influenced the agent too much, so it’s hard to utilize external reward and the policy become nearly random.
Theoretical speaking, with the learning process going on and on the model loss will become smaller and smaller until zero, so beta won’t have much impact at the end, and all the different choices will lead to a similar final policy, but in practice duo to the stochastic environment, the model loss almost never be zero, so we decay the from large to small, make our agent explore more at the beginning, and exploited more at the end, leading to fast learning at the beginning and stable learning at the end.
We found that the impact of RNN and jointly training goes beyond providing the an agent with memory. Instead, also serves a role not previously studied in RL, potentially by enabling better representation learning, and thereby improves performance even on domains that are fully observable and do not obviously require memory.
Empirically results show that RMCSAC outperforms state-of-the-art model-free deep RL methods, including the off-policy SAC algorithm and the on-policy PPO algorithm by a substantial margin. provide a promising avenue for improved robustness and stability, Meanwhile with the memory and inference functionality of our agent, we can solve POMDP problem as well, which shell light to real-world applications Further exploration including methods that incorporate stochastic transition function (e.g. deep plan net[planet], world model[world_model]) and more theoretic analyze.