1 Introduction
Modelfree deep reinforcement learning algorithms have been applied in a range of challenge domains, from games to robotic control
[that_matters]. The combination of RL and highcapacity function approximators such as neural networks hold the promise pf automating a wide range of decision making and control tasks, but widespread adoption of these methods in realwold domains have been hampered by two major challenges
[world_model]First, all these methods can only promise to solve the MDP process, but in realworld problems seldom follow the MDP process, in the real world problems often require dealing with partially observation states, it is general very challenging to construct and infer hidden state as they often depend on an agent’s entire interaction history and may require substantial domain knowledge.
Second, the modelfree method suffers from low sample efficiency even some simple tasks need millions of interval with the environment and complex behaviors with high dimensional observations might need substantially more.
Third, these methods often brittle with respect to their hyperparameters, which means we need carefully tuning the parameters, most important, they often suck to local optimal. In many cases they fail to find a reward signal, even when the reward signal is relatively dense, they still fail to find the optimal solution, so often the case we need handcraft the reward function, some researcher design such a complex reward function for each environment they want to solve [design].
Successful applications of reinforcement learning in realworld problems often require dealing with partially observable states. It is in general very challenging to construct and infer hidden states as they often depend on the agent’s entire interaction history and may require substantial domain knowledge.
In this paper, we propose a new hibird approach to using deep learning to tackle complex tasks, Recurrent neural networks (RNNs) for reinforcement learning (RL) have shown distinct advantages, e.g., solving memorydependent tasks and metalearning.However, little effort has been spent on improving RNN architectures and on understanding the underlying neural mechanisms for performance gain.
We investigate a deeplearning approach to learning the representation of states in partially observable tasks, with minimal prior knowledge of the domain. In particular, we propose a new family of hybrid models that combines the strength of both supervised learning and reinforcement learning, training in a joint fashion: The supervised learning component can be a recurrent neural network (RNN) combine with a different head, providing an effective way of learning the representation of hidden states. The RL components a soft actorcritic
[SAC] that learn to optimize the control for maximizing longterm rewards. Furthermore, we design a model together with curiosity bond, which leads to better representation and exploration. Extensive experiments on a both POMDP and MDP process demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous stateoftheart methods.
First, we employ recurrent neural networks and GRU models to learn the representation of the state for RL. Since these recurrent models can aggregate partial information in the past, and can capture longterm dependencies in the sequence information, their performance is expected to superior to the contextual windowbased approach, which was used in the DQN model of [DQN]

Second, in order to best leverage supervision signals in the training data, the proposed hybrid approach combines the strength of both supervised learning and RL. In particular, the model in our hybrid the approach is jointly learned using stochastic gradient descent (SGD): in each iteration, the representation of hidden states is first inferred using supervision signals (i.e. next observation and reward) in the training data; then, the Qfunction is updated using the SAC that takes the learned hidden states as input. The superiority of the hybrid approach is validated in extensive experiments on a benchmark dataset.

Third, by jointly training on model and RL algorithm we can get a good representation to capture the underline state, which means we can change the POMDP process to the MDP process.

Last, to avoid local optimal and encourage our agent to explore more, we add curiosity bond to standard RL objection, which means we use both internal reward and external reward, have a separate reward signal could make our agent capture the reward more easily can find the optimal solution
2 Background
2.1 Pomdp
Our reinforcement learning problem can be defined as policy search in Partially Observable Markov Decision Process
given by the tuple [planet]. The underlying Markov Decision Process (MDP) is defined by tuple
is the set of states

the set of actions,

is the reward function, with

gives the set of observations potentially received by the agent

is the observation function mapping (unobserved) states to probability distributions over observations.
Within this framework, the agent receives an observation , which may only contain partial information about the underlying state . When the agent takes an action the environment responds by transitioning to state and giving the agent a new observation and reward
Although there are many approaches to RL in POMDPs, we focus on using recurrent neural networks (RNNs) with back propagation through time (BPTT) to learn a representation that disambiguates the true state of the POMDP. The Deep QNetwork agent (DQN)[DQN] learns to play games from the Atari57 benchmark by using framestacking of 4 consecutive frames as observations, and training a convolutional network to represent a value function with Qlearning, from data continuously collected in a replay buffer. Other algorithms like the A3C , use an LSTM and are trained directly on the online stream of experience without using a replay buffer. In paper [RDPG] combined DDPG with an LSTM by storing sequences in replay and initializing the recurrent state to zero during training.
we consider a partially observable Markov decision process (POMDP). We define a discrete time step , hidden states , image observations
, continuous action vectors
, and scalar rewards , that follow the stochastic dynamics
Transition function:

Observation function:

Reward function:

Policy:

Inference model:
Whatever the choice of return measure (whether infinitehorizon discounted, or finitehorizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.
To talk about expected return, we first have to talk about probability distributions over trajectories. Let’s suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a T step trajectory is:
(1) 
The expected return (for whichever measure), denoted by , is then:
(2) 
2.2 EntropyRegularized Reinforcement Learning
Standard RL maximizes the expected sum of rewards as purposed in equation 2. We will consider a more general maximum entropy objective[entropy_rl], which favors stochastic policies by augmenting the objective, In entropyregularized reinforcement learning, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that timestep. This changes the RL problem to:
(3) 
where is the tradeoff coefficient. (Note: we’re assuming an infinitehorizon discounted setting here, and we’ll do the same for the rest of this page.) We can now define the slightlydifferent value functions in this setting. is changed to include the entropy bonuses from every timestep:
(4) 
is changed to include the entropy bonuses from every timestep except the first:
(5) 
With these definitions, and are connected by:
(6) 
and the Bellman equation for is
(7)  
(8) 
3 Method
As we all know, there are five key elements for intelligent, socalled: prediction,curiosity,intuition,memory, inference we can achieve this function within deep learning framework by an RNN based architecture. In the real world, the full state is rarely provided to the agent. In other words, the Markov property rarely holds in realworld environments. A Partially Observable Markov Decision Process (POMDP) better captures the dynamics of many realworld environments by explicitly acknowledging that the sensations received by the agent. Realworld tasks often feature incomplete and noisy state information resulting from partial observability.
The main structure of our agent contains three parts: RNN, Model Intuition, each part has its function.
We use RNN for inference and memory, which means learning , as we only will provide as timestep instead of so it is crucial for RNN store passed information into a hidden state , we call this process as inference, most important the both mean hidden state in RNN and it is the state we will use as the input of our RL algorithm, It is hard to say the is the ”true” state which follows MDP property, but we will push the toward the ”true” state by jointly optimize with intuition head and model head(we will discuss future in session 3.4)
The core idea of the model head is to provide a prediction of future and use the predicted error as curiosity bond. As we all know the understanding and perdition of the future is fundamental to human, by putting a model in our agent we can get a better representation of state and avoid stack into local optimal
As for intuition head, the key function is decision making, we achieve the by using SAC, which based on the actorcritic framework and uses entropy regularization, but we combine the original algorithm with our framework which means it can adapt to the POMDP process having a model and internal reward
3.1 Rnn
3.1.1 Inference and Memory
The main function of RNN is providing memory and inference. Due to POMDP process, we can’t use observation at step directly to make decision or make prediction, so we need an inference model to encode observation, action to state.
(9) 
Since the model is nonlinear, we cannot directly compute the state posteriors that are needed for parameter learning,but we can optimize the function approximator by backup losses through the RNN cell, Gradients coming from the policy head are blocked and only gradients originating from the Qnetwork head and model head are allowed to backpropagate into the RNN. We block gradients from the policy head for increased stability, as this avoids positive feedback loops between and caused by shared representations
we will future discuss the choice of backup value loss and model loss in session 3.4, in a word, training RNN with model and value function jointly captures a better representation for state.
3.1.2 Initialize Strategy
In order to achieve good performance in a partially observed environment, an RL agent requires a state representation that encodes information about its stateaction trajectory in addition to its current observation. The most common way to achieve this is by using an RNN, as part of the agent’s state encoding. To train an RNN from replay and enable it to learn meaningful longterm dependencies, whole stateaction trajectories need to be stored in replay and used for training the network.Recent work[R2D2] compared four strategies of training an RNN from replayed experience:

Using a zero start state to initialize the network at the beginning of sampled sequences.

Replaying whole episode trajectories.

Stored state: Storing the recurrent state in replay and using it to initialize the network at training time. This partially remedies the weakness of the zero start state strategy, however it may suffer from the effect of ‘representational drift’ leading to ‘recurrent state staleness’, as the stored recurrent state generated by a sufficiently old network could differ significantly from a typical state produced by a more recent version.

Burnin: Allow the network a ‘burnin period’ by using a portion of the replay sequence only for unrolling the network and producing a start state, and update the network only on the remaining part of the sequence. We hypothesize that this allows the network to partially recover from a poor start state (zero, or stored but stale) and find itself in a better initial state before being required to produce accurate outputs.
The zero start state strategy’s appeal lies in its simplicity and it allows independent decorrelated sampling of relatively short sequences, which is important for robust optimization of a neural network. On the other hand, it forces the RNN to learn to recover meaningful predictions from an atypical initial recurrent state (‘initial recurrent state mismatch’), which may limit its ability to fully rely on its recurrent state and learn to exploit long temporal correlations. The second strategy, on the other hand, avoids the problem of finding a suitable initial state, but creates a number of practical, computational, and algorithmic issues due to varying and potentially environmentdependent sequence length, and higher variance of network updates because of the highly correlated nature of states in a trajectory when compared to training on randomly sampled batches of experience tuples.
[RDPG] observed little difference between the two strategies for empirical agent performance on a set of Atari games, and therefore opted for the simpler zero start state strategy. One possible explanation for this is that in some cases, an RNN tends to converge to a more ‘typical’ state if allowed a certain number of ‘burnin’ steps, and so recovers from a bad initial recurrent state on a sufficiently long sequence. We also hypothesize that while the zero start state strategy may suffice in the most fully observable Atari domain, it prevents a recurrent network from learning actual longterm dependencies in more memorycritical domainsIn all our experiments we will be using the proposed agent architecture with replay sequences of length , with an optional burnin prefix of .
burnin strategy on its own partially mitigates the staleness problem on the initial part of replayed sequences, Empirically, this translates into noticeable performance improvements, as the only difference between the pure zero state and the burnin strategy lies in the fact that the latter unrolls the network over a prefix of states on which the network does not receive updates. the beneficial effect of burnin lies in the fact that it prevents ‘destructive updates’ to the RNN parameters resulting from highly inaccurate initial outputs on the first few time steps after a zero state initialization. The stored state strategy, on the other hand, proves to be overall much more effective at mitigating state staleness in terms of the Qvalue discrepancy, which also leads to clearer and more consistent improvements in empirical performance. Finally, the combination of both methods consistently yields the smallest discrepancy in the last sequence states and the most robust performance gains.
3.2 Model Learning(model Head)
3.2.1 Prediction
Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal model. While it is the role of the intuition head is to compress what the agent sees at each time frame, we also want to compress what happens over time. For this purpose, the role of the model is to predict the future. The model head serves as a predictive model of the future state.
The transition dynamics is modeled with a feedforward neural network, using the standard practice train the neural network to predict the change in state (rather than the next state) given a state and an action as inputs. This relieves the neural network from memorizing the input state, especially when the change is small
[metrpo], We denote the function approximator for the next state, which is the sum of the input state and the output of the neural network, asThe loss of model learning is the onestep prediction loss, where the training dataset that stores the transitions the agent has experienced:
(10) 
3.2.2 Curiosity
As human agents, we are accustomed to operating with rewards that are so sparse that we only experience them once or twice in a lifetime, if at all. To a threeyearold enjoying a sunny Sunday afternoon on a playground, most trappings of modern life – college, good job, a house, a family are so far into the future, they provide no useful reinforcement signal. Yet, the threeyearold has no trouble entertaining herself in that playground using what psychologists call intrinsic motivation[rnd] or curiosity [ICM]. Motivation/curiosity have been used to explain the need to explore the environment and discover novel states. More generally, curiosity is a way of learning new skills which might come handy for pursuing rewards in the future.
There are many types of couriers, but the most fundamental one is the curiosity for things we can’t predict correctly [curiosity],seek a good a model to predict future is almost the origin for science, so it’s an obvious idea we can let our agent predict the future and use the predict error as curiosity bond or so called curiositydriven intrinsic reward signal.
Our agent is composed of two subsystems:

a reward generator that outputs a curiositydriven intrinsic reward signal

and a policy that outputs a sequence of actions to maximize that reward signal.
In addition to intrinsic rewards,the agent optionally may also receive some extrinsic reward from the environment. Let the intrinsic curiosity reward generated by the agent at time t be and the extrinsic reward be . The policy subsystem is trained to maximize the sum of these two rewards
(11) 
In practice we use a parameter to represent the strength of intrinsic reward, as the learning process continuous, model loss should decay to zero, but this is often not the case, due to the complex environment dynamic, it’s really hard to make perfect prediction if possible. But we can’t let our agent seeding some state all the time, thus we need to decay to make sure we can have a good policy.
3.3 Intuition Head
We will use function approximators for both the soft Qfunction and the policy, and instead of running evaluation and improvement to convergence, alternate between optimizing both networks with stochastic gradient descent. We will consider two parameterized soft Qfunction and a tractable policy . The parameters of these networks are and . For example, the soft Qfunction can be modeled as expressive neural networks, and the policy as a Gaussian with mean and covariance given by neural networks. We will next derive update rules for these parameter vectors
3.3.1 Learning QFunctions
The Qfunctions are learned by MSBE minimization, using a target value network to form the Bellman backups. They both use the same target, like in TD3, and have loss functions:
(12) 
The target value network, like the target networks in DDPG and TD3, is obtained by polyak averaging the value network parameters over the course of training.Before we go into the learning rule, let’s first rewrite the connection equation by using the definition of entropy to obtain:
(13)  
(14) 
The value function is implicitly parameterized through the soft Qfunction parameters via Equation 14 We uses clipped doubleQ like TD3[TD3] and SAC[SAC] for express the TD target, and takes the minimum Qvalue between the two approximators, So the loss for Qfunction parameters is:
(15) 
The update makes use of a target soft Qfunction, that are obtained as an exponentially moving average of the soft Qfunction weights, which has been shown to stabilize training. Importantly, we do not use actions from the replay buffer here: these actions are sampled fresh from the current version of the policy.
3.3.2 Learning the Policy
The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize , which we expand out (as before) into
(16) 
The target density is the Qfunction, which is represented by a neural network an can be differentiated, and it is thus convenient to apply the reparameterization trick instead, resulting in a lower variance estimate, in which a sample from
is drawn by computing a deterministic function of state, policy parameters, and independent noise. following the authors of the SAC paper[SAC1], we use a squashed Gaussian policy, which means that samples are obtained according to(17) 
The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):
(18) 
To get the policy loss, the final step is that we need to substitute with one of our function approximators. The same as in TD3, we use . The policy is thus optimized according to
(19) 
3.3.3 Learning
As it proposed in[SAC1], we also learn by minimizing the dual objective:
(20) 
This can be done by approximating dual gradient descent[opt_convex]. Dual gradient descent alternates between optimizing the Lagrangian with respect to the primal variables to convergence, and then taking a gradient step on the dual variables. While optimizing with respect to the primal variables fully is impractical, a truncated version that performs incomplete optimization (even for a single gradient step) can be shown to converge under convexity assumptions[opt_convex]. While such assumptions do not apply to the case of nonlinear function approximators such as neural networks, we found this approach to still work in practice. Thus, we compute gradients for with the following objective:
(21) 
The final algorithm is listed in Algorithm 1. The method alternates between collecting experience from the environment with the current policy and updating the function approximators using the stochastic gradients from batches sampled from a replay pool. Using offpolicy data from a replay pool is feasible because both value estimators and the policy can be trained entirely on offpolicy data. The algorithm is agnostic to the parameterization of the policy, as long as it can be evaluated for any arbitrary stateaction tuple.
3.4 Better Representations
When we talk about parameters update, there are several options, first of all, there are three types of loss, and three are four kinds of networks, we list them with the update option in Table 1
Parameters\loss  RNN  Model  Value  Policy 

RNN  False  Option  Option  Option 
Model  False  True  False  False 
Value  False  False  True  False 
Policy  False  False  False  True 
It’s obvious we have to update each head according to their own loss, and it won’t make any sense if we backpropagate loss from one head to another head, thus our question become which part of loss should we backpropagate into RNN, in order to understand this question, in many prior papers they choose a method arbitrary, we will analyses different way to address this problem
Gradient coming from the policy head are blocked and only gradients originating from the value head and model head are allowed to backpropagate into the RNN, We block gradients from the policy head for increased stability, as this avoids positive feedback loops between and caused by shared representations.
We hypothesise by jointly training on model loss and intuition loss, we can get a better representation, cause for POMDP, we sim to find a enough to predict next state, which is what we did by training RNN on model loss, Meanwhile we need a powerful intuition, and the state is correlate to value as well, so we can combine two parts and get a better representation, we further give the experimental result to prove our theory in session 4.3.
4 Experiment
We designed our experiments to investigate the following questions:

Can RMCSAC be used to solve challenging continue control problems? How does our agent compare with other methods when applied to this problems, with regard to final performance, computation time, and sample complexity?

Can RMCSAC handling POMDP process, how well does it deal with the absence of information, and how well does it generalize.

We optimize RNN on model loss and intuition loss jointly, does this really help us improve performance

We add curiosity bond and model head on our agent, does this parts give us a more powerful agent
To answer (1) we compare the performance of our agent with other method in session4.1 To answer (2) we purposed a modified mujoco environment socalled flicker mujoco, which follows the POMDP process, we will discuss the details of the environment and the experiment setting in session 4.2 With regard to (3)(4), we addressed ablation study on our algorithm in session 4.3, testing how does different update scheme and network design influenced the performance
The results shows overall our agent outperform baseline with a large margin, both in terms of learning speed and the final performance, The quantitative results attained by our agent is our experiments also compare very favorably to results reported by other methods in prior work, indicating that both the sample efficiency and the final performance of our agent on these benchmark tasks exceeds the state of art.
4.1 Mujoco
The goal of this experimental evaluation is to understand how the sample complexity and stability of our method compares with prior offpolicy and onpolicy deep reinforcement learning algorithms. We compare our method to prior techniques on a range of challenging continuous control tasks from the OpenAI gym benchmark suite. Although the easier tasks can be solved by a wide range of different algorithms, the more complex benchmarks, such as the 21dimensional Humanoid, are exceptionally difficult to solve with offpolicy algorithms. The stability of the algorithm also plays a large role in performance: easier tasks make it more practical to tune hyperparameters to achieve good results, while the already narrow basins of effective hyperparameters become prohibitively small for the more sensitive algorithms on the hardest benchmarks,leading to poor performance[QProp].
We compare our method to deep deterministic policy gradient (DDPG)[ddpg] , an algorithm that is regarded as one of the ; proximal policy optimization (PPO)[ppo], a stable and effective onpolicy policy gradient algorithm; and soft actorcritic(SAC)[SAC], a recent offpolicy algorithm for learning maximum entropy policies. We additionally compare to twin delayed deep deterministic policy gradient algorithm (TD3)[TD3].
We conducted the robotic locomotion experiments using the MuJoCo simulator[mujoco]
The states of the robots are their generalized positions and velocities, and the controls are joint torques. Under actuation, high dimensionality, and nonsmooth dynamics due to contacts make these tasks very challenging. To allow for reproducible and fair comparison, we evaluate all the algorithm with similar network structure, for offpolicy algorithm we use a two layer feedforward neural network of 400 and 300 hidden nodes respectively, with rectified linear units (ReLU) between each layer for both the actor and critic, for onpolicy algorithm we use a 64 hidden nodes feedforward neural network, we use the parameters with is shown superior in prior work
[that_matters] as the comparison of our agent. Both network parameters are updated using Adam[adam] with a learning rate of , with no modifications to the environment or reward.Figure 3
compares five individual runs with both variants, initialized with different random seeds. RMCSAC performs much more better, shows that our agent significantly outperforms the baseline, indicating substantially better stability and stability. As evident from the figure, with jointly training and internal reward, we can achieve stable training. This becomes especially important with harder tasks, where tuning hyperparameters is challenging.
It shows our agent outperform other baseline method with a large marginal, indicate both the efficiency and stability of method is superior
4.2 Flicker Mujoco
To address this problem, we introduce the Flickering Mujoco POMDP  a modification to the classic Mujoco benchmark such that at each timestep, the screen is either fully revealed or fully obscured with probability . Obscuring frames in this manner probabilistically induces an incomplete memory of observations needed for Mujoco to become a POMDP. In order to succeed at the game of Flickering Mujoco, it is necessary to integrate information across frames to estimate relevant variables such as the location and velocity of the joints and the location of the joints. Since half of the frames are obscured in expectation, a successful player must be robust to the possibility of several potentially contiguous obscured inputs.
4.2.1 Result
When dealing with partial observability, a choice exists between using a nonrecurrent deep network with a long history of observations or using a recurrent network trained with a single observation at each timestep. The results in this section show that recurrent networks can integrate information through time and serve as a viable alternative to stacking frames in the input layers.
As it is shown in Figure 2(f). Our agent outperform standard SAC combine with frame stack, our agent performs well at this task even when given only one input frame per timestep, RMC successfully integrates information through time. Our agent are capable of integrating noisy singleframe information through time to detect events. Thus, given the same length of history, the recurrent net can better adapt at evaluation time if the quality of observations changes.
4.2.2 Generalization Performance
Our agent performance increases when trained on a POMDP and then evaluated on a MDP. Arguably the more interesting question is the reverse: Can a recurrent network be trained on a standard MDP and then generalize to a POMDP at evaluation time? To address this question, we evaluate the highestscoring policies of our agent and SAC over the flickering equivalents of all 3 games. Figure 3(a) shows that while both algorithms incur significant performance decreases on account of the missing information, our agent captures more of its previous performance than SAC across all levels of flickering. We conclude that recurrent controllers have a certain degree of robustness against missing information, even trained with full state information.
4.3 Ablation Study
4.3.1 The Impact of Different Training Method
As shown in Table LABEL:table:_scheme we have at most six different kinds of update scheme. Figure 3(b) shows how learning performance changes when the update schemes are changed, For scheme 4~6, the policy becomes nearly random, and consequently fails to exploit the reward signal, resulting in substantial degradation of performance. For scheme 2, the value function is enhanced with the capacity of RNN so the model learns quickly at first, but the policy then becomes nearly deterministic, leading to poor local minimal due to the lack of adequate exploration and worse state representation, as for scheme 3, due to the weak capacity of the value function, although the representation of the state maybe better, still it can not achieve awesome performance. With scheme 1, the model balance exploration and exploitation, model head make sure the state is good enough to predict next state, and push the POMDP process to MDP process by backpropagate model loss into RNN. At the same time, value head can adjust representation and achieve amazing capacity by jointly optimize RNN with model head. Just like our brain has two different kinds of thinking pattern, socalled intuition and reasoning, our agent can take advantage of joint optimization, by learning a representation of state from observation both for predict and decision making.
scheme  Model  Value  Policy 

1  True  True  False 
2  False  True  False 
3  True  False  False 
4  True  True  True 
5  True  False  True 
6  False  True  True 
4.3.2 The Impact of Curiosity Strength
As we discuss in session 3, the model head can provide a prediction of the future in the meantime provide curiosity bond and a better representation, we already analyzed the influence of model update and shown that jointly training can improve performance, but how about curiosity part what if we only training jointly but set to zero(remove internal) To test the design of our algorithm, we choose update scheme 1, which has been shown to be superior in the previous experiment, and change the scale of As illustrate in figure 3(c), when we set to zero, the model can’t explore well so both the sample efficiency and final score is obscured. If we use a huge , the internal reward influenced the agent too much, so it’s hard to utilize external reward and the policy become nearly random.
Theoretical speaking, with the learning process going on and on the model loss will become smaller and smaller until zero, so beta won’t have much impact at the end, and all the different choices will lead to a similar final policy, but in practice duo to the stochastic environment, the model loss almost never be zero, so we decay the from large to small, make our agent explore more at the beginning, and exploited more at the end, leading to fast learning at the beginning and stable learning at the end.
5 Conclusion
We found that the impact of RNN and jointly training goes beyond providing the an agent with memory. Instead, also serves a role not previously studied in RL, potentially by enabling better representation learning, and thereby improves performance even on domains that are fully observable and do not obviously require memory.
Empirically results show that RMCSAC outperforms stateoftheart modelfree deep RL methods, including the offpolicy SAC algorithm and the onpolicy PPO algorithm by a substantial margin. provide a promising avenue for improved robustness and stability, Meanwhile with the memory and inference functionality of our agent, we can solve POMDP problem as well, which shell light to realworld applications Further exploration including methods that incorporate stochastic transition function (e.g. deep plan net[planet], world model[world_model]) and more theoretic analyze.
Comments
There are no comments yet.