1 Introduction
The general idea of episodic memory in reinforcement learning setting is to leverage longterm memory reflecting data structure containing information of the past episodes to improve agent performance. The existing works [2, 15]
, show that episodic control (EC) can be beneficial for decision making process. In the context of discrete action problems, episodic memory stores information about states and the corresponding returns in a tablelike data structure. With only several actions the proposed methods store past experiences in multiple memory buffers per each action. During the action selection process the estimate of taking each action is reconsidered with respect to the stored memories and may be corrected taking into the account past experience. The motivation of using episodiclike structures is to latch quickly to the rare but promising experiences that slow gradientbased models cannot reflect from a small number of samples.
The notion of using episodic memory in continuous control is not trivial. Since the action space may be highdimensional, the methods that operate discrete action space become not suitable. Another challenge is the complexity of the state space. The study of [2] shows that some discrete action environments (e.g. Atari) may have high ratio of repeating states, which is not the case for complex continuous control environments.
Our work proposes an algorithm that leverages episodic memory experience through modification of the ActorCritic objective. Our algorithm is based on DDPG, the modelfree actorcritic algorithm that operates over continuous control space. We offer to store the representation of actionstate pairs in memory module to perform memory association not only from environment state, but also from the performed action. Our experiments show that modification of objective provides greater sample efficiency compared with offpolicy algorithms. We further improve agent performance by introducing novel way of prioritized experience replay based on episodic experiences. The proposed algorithm, Episodic Memory ActorCritic (EMAC), exploits episodic memory during the training, distilling the MonteCarlo (MC) discounted return signal through the critic to actor and resulting in the strong policy that provides greater sample efficiency than other models.
We draw the connection between the proposed objective and alleviation of Qvalue overestimation, a common problem in ActorCritic methods [18]. We show that the use of the retrieved episodic data results in more realistic critic estimates that in turn provides faster training convergence. In contrast, the proposed prioritized experience replay, aims to focus more on optimistic promising experiences from the past. The process of frequently exploiting only the highreward transitions from the whole replay buffer may result in unstable behaviour. However, the stabilizing effect of the new objective allows to decrease the priority of noninformative transitions with low return without loss in stability.
We evaluate our model on a set of OpenAI gym environments [4] (Figure 1
) and show that it achieves greater sample efficiency compared with the stateoftheart offpolicy algorithms (TD3, SAC). We open sourced our algorithm to achieve reproducibility. All the codes and learning curves can be accessed at:
http://github.com/schatty/EMAC.2 Related Work
Episodic Memory.
First introduced in [11], episodic control is studied within the regular treestructured MDP setup. Episodic control aims to memorize highlyrewarding experiences and replays sequences of actions that achieved high returns in the past. The successful application of episodic memory to the discrete action problems has been studied in [2] . Reflecting the model of hippocampal episodic control the work shows that the use of tablebased structures containing state representation with the corresponding returns can improve modelfree algorithms on environments with the high ratio of repeating states. Semitabular differential version of memory is proposed in [15] to store past experiences. The work of [14] is reminiscent of ideas presented in our study. They modify the DQN objective to mimic relationships between two learning systems of the human brain. The experiments show that this modifications improves sample efficiency for the arcade learning environment [1]. The framework to associate the related experience trajectories in the memory to achieve reasoning of effective strategies is proposed in [19]. In the context of the continuous control, the work of [22] exploits episodic memory to redesign experience replay for the asynchronous DDPG. To the best of our knowledge, this is the only work that leverages episodic memory within the continuous control.
ActorCritic algorithms.
ActorCritic methods represents a set of algorithms that compute value function for the policy (critic) and improve the policy (actor) from this value function. Using deep approximators for the value function and the actor, [12] presents Deep Deterministic Policy Gradient. The proposed modelfree offpolicy algorithm aims to learn policy in highdimensional, continuous action space. Recent works propose several modifications that stabilize and improve performance of the original DDPG. The work of [5] addresses Qvalue overestimation and proposes the Twin Delayed Deep Deterministic (TD3) algorithm which greatly improves DDPG learning speed. The proposed modifications include existence of multiple critic to reduce critic’s over optimism, additional noise applied within calculation of target Qestimate and delayed policy update. In [6, 7] the authors study maximumentropy objectives based on which provide stateoftheart performance for OpenAI gym benchmark.
3 Background
Reinforcement learning setup consists of an agent that interacts with an environment . At each discrete time step the agent receives an environment state , performs an action and receives a reward
. An agent’s action is defined by a policy, which maps the state to a probability distribution over the possible actions
. The return is defined as a discounted sum of rewards with a discount factor .Reinforcement learning aims to find the optimal policy , with parameters , which maximizes the expected return from the initial distribution . The Qfunction denotes the expected return when performing action from the state following the current policy
For continuous control problems policy can be updated taking the gradient of the expected return with deterministic policy gradient algorithm [17]:
(1) 
In actorcritic methods, we operate with two parametrized functions. An actor represents the policy and a critic is the Qfunction. The critic is updated with temporal difference learning by iteratively minimizing the Bellman equation [21]:
(2) 
In deep Qlearning , the parameters of Qfunction are updated with additional frozen target network which is updated by proportion to match the current Qfunction
(3) 
where
(4) 
The actor is learned to maximize the current function.
(5) 
The advantage of the actorcritic methods is that they can be applied offpolicy, methods that proven to have better sample complexity [12]. During the training the actor and critic are updated with sampled minibatches from the experience replay buffer [13].
4 Method
We present Episodic Memory ActorCritic, the algorithm that leverages episodic memory during the training. Our algorithm builds on the Deep Deterministic Policy Gradient [12] by modifying critic’s objective and introducing episodicbased prioritized experience replay.
4.1 Memory Module
The core of our algorithm features a tablebased structure that stores the experience of the past episodes. Following the approach of [2]
we treat this memory table as a set of keyvalue pairs. The key is the representation of a concatenated stateaction pair and the value is the true discounted MonteCarlo return from an environment. We encode this stateaction pair to a vector of smaller dimension by the projection operation
, where is the dimension of the concatenated stateaction vector, and is a smaller projected dimension. As the JohnsonLindenstrauss lemma [8] states, this transformation preserves relative distances in the original space given that is a standard Gaussian. The projection matrix is initialized once at the beginning of the training.Memory module implies two operations: add and lookup. As we can calculate the discounted return only at the end of an episode we perform add operation adding pairs of , when the whole episode is generated by the agent. We note that the true discounted return may be considered fair only when the episode ends naturally. For time step limit case we should perform additional evaluation steps without using these transitions for training. This way we allow discounted return from later transitions be obtained from the same amount of consequent steps as transitions in the beginning of the episode. Given the complexity and continuous nature of environments we make an assumption that the received stateaction pair has no duplicates in the stored representations. As a result, we do not perform the search for the stored duplicates to make a replacement and we add the incoming keyvalue pair to the end of the module. Thus, the add operation takes .
The lookup operation takes as input stateaction pair, performs the projection to the smaller dimension and accomplishes the search for the most similar keys returning the corresponding MC returns. We use distance as a metric of similarity between the projected states
(6) 
where is a small constant. The episodic MC return for a projected query vector is formulated as a weighted sum of the closest elements
(7) 
where is a value stored in the memorymodule and is a weight proportional to the inverse distance to the query vector.
(8) 
We propose to leverage stored MC returns as the second pessimistic estimate of the Qfunction. In contrast of approach of [15, 14] we use weighted sum of all near returns rather then taking maximum of them. Given that we are able calculate MC return for each transition from the offpolicy replay buffer we propose the following modification of the original critic objective (3):
(9) 
where is the value returned by lookup operation and
is a hyperparameter controlling the contribution of episodic memory. In the case of
the objective becomes a common Qlearning loss function. In our experiments we found beneficial small values of
. In the evaluation results is set to 0.1 for the Walker, Hopper, InvertedPendulum, InvertedDoublePendulum, and to 0.15 for the Swimmer.The memory module is similar to the replay buffer. They both needed to sample offpolicy data during the Qlearning update. The difference is the memory module stores smalldimensional representations of the stateaction pair for effective lookup and discounted returns instead of rewards. The outline of the proposed architecture on calculating Qestimates is presented in the Figure 2.
4.2 Alleviating Qvalue Overestimation
The issue of Qvalue overestimation is a common case in Qlearning based methods [18, 12]. Considering the discrete action setting, the value estimate is updated in a greedy fashion from suboptimal function containing some error . As a result, the maximum over the actions along with its error will generally be greater than the true maximum [18]. This bias is then propagated through the Bellman residual resulting in a consistent overestimation of the Qfunction. The work of [5] studies the presence of the same overestimation in an actorcritic setting. The authors propose a clipped variant of Double Qlearning that reduces overestimation. In our work we show that additional episodic Qestimate in critic loss can be used as a tool to reduce overestimation bias.
Following the procedure described in [5] we perform the experiment that measures the predicted Qvalue from a critic compared with a true Qestimate. We compare the value estimate of the critic, true value estimate and episodic MC return stored in the memory module. As a true value estimate we use a discounted return of the current policy starting from the state sampled randomly from the replay buffer. The discounted return is calculated from true rewards for maximum of 1000 steps or less in the case of the episode’s end. We measure true value estimate each 5000 steps during the training of 100000 steps. We use the average of the batch for the critic’s Qvalue estimate and episodic return. The learning behaviour for the Walker2dv3 and Hopperv3 domains is shown in Figure 3. In (a) we show that the problem of Qvalue overestimation exists for both TD3 and EMAC algorithms, as both of Qvalue predictions are higher than their true estimates. The Qvalue prediction for EMAC shows less tendency for overestimaton. In (b) we compare true value estimate with episodic MCreturn and show that latter has more realistic behaviour than the critic’s estimate. Here, the difference between the true estimate and the MCreturn is that episodic returns are obtained from the past suboptimal policy, whereas true value estimate is calculated using the current policy. The training curves from (b) show that MCreturn has less value than the true estimate.
The experiment shows that episodic Monte Carlo returns from the past are always more pessimistic than the corresponding critic value. This fact makes incorporating MCreturn into the objective beneficial for the training. As a result, the proposed objective with episodic MCreturn shows less tendency to overestimation compared to the TD3 model. Given the stateaction pair we here state that the MC return produced by the suboptimal policy for the same state may be used as a second stabilized estimate of the critic. Therefore, additional loss component of MSE between episodic return and critic estimate may be interpreted as a penalty for critic overestimation.
4.3 Episodicbased Experience Replay Prioritization
The prioritization of the sampled experiences for offpolicy update of Qfunction approximation is a studied tool for improving sampleefficiency and stability of an agent [16, 20, 9]. The common approach of using prioritization is formulated in the work of [16]
, where normalized TDerror is used as a criterion of transition importance during the sampling. We propose a slightly different prioritization scheme that is based on episodic return as a criterion for sampling preference. We formulate the probability of sampling a transition as
(10) 
where priority of transition is a MC return stored in the memory module. The exponent controls the measure of prioritization with as a uniform nonprioritized case. The interpretation of such a prioritization is an increasing reuse of transitions which gave high returns in the past. The shift of true reward distribution used for offpolicy update may result in divergent behaviour, therefore large values of may destabilize the training. In our experiment the coefficient value of 0.5 gave promising results improving nonprioritized version of the algorithm.
Environment  EMAC  EMACNoPr  DDPG  TD3  SAC 

Walker2dv3  129.73  1008.32.  1787.28  
Hopperv3  1043.78  989.96  2411.02  
Swimmerv3  27.45  38.91  56.7  
InvertedPendulumv2  1000  1000  909.98.  1000  1000 
InvertedDoublePendulumv2  9205.81  9309.73  9357.17 
corresponds to a standard deviation over 10 trials.
The stored probabilities of sampling are recalculated after each episode comes to the memorymodule and used consequently during sampling from offpolicy replay buffer.
5 Experiments
The EMAC algorithm is shown in Algorithm 1. We compare our algorithm with modelfree offpolicy algorithms DDPG [12], TD3 [5] and SAC [6]. For DDPG and TD3 we use implementations of [5]. For SAC we implemented the model following the [6]. All the networks have the same architecture in terms of the number of hidden layers, nonlinearities and size of the hidden dimensions. In our experiments we focus on smalldata regime given all algorithms 200000 time steps from an environment.
We evaluate our algorithm on a set of OpenAI gym domains [4]. Each environment is run for time steps with the corresponding number of networks update steps. Evaluation is performed every 1000 steps with the reported value as an average from 10 evaluation episodes from different seeds without any exploration. We report results from both prioritized (EMAC) and nonprioritized (EMACNoPr) versions of the algorithm. The results are reported from 10 random seeds. The results in Table 1 is the average return of the last training episode over all seeds. The training curves of the compared algorithms presented in Figure 4. Our algorithm outperforms DDPG and TD3 on all tested environments and SAC on three out of five environments.
Networks’ parameters are updated with Adam optimizer [10]
with a learning rate of 0.001. All models consists of two hidden layers, size 256, for an actor and a critic and a rectified linear unit (ReLU) as a nonlinearity. For the first
time steps we do not exploit an actor for action selection and choose the actions randomly for the exploration purpose.During the architecture design we study the dependency between the size of the projected stateaction pairs and final metrics. We witness small increase in efficiency with greater projected dimensions. Training curves showing different variants of projections of 4, 16, 32 sizes on domains of Walkerv3 and Hopperv3 are depicted in Figure 5. Unfortunately, increased projection size tends to slow down the lookup operation. In all our experiments we use the minimal option of projection size of for the faster computations. The work of [15] leverages the KDtree data structure to store and search experiences. On the contrary, we decided to keep tablebased structure of the stored experiences, but placed the module on a CUDA device to perform vectorized l2distance calculations. All our experiments are performed on single 1080ti NVIDIA card.
Due to lowdata regime we are able to store all incoming transitions without replacement for the replay buffer as well as for episodic memory module. We set episodic memory size to 200000 transitions, although our experiments do not indicate loss in performance for cases of smaller sizes of 100 and 50 thousands of records. The parameter that determines the number of topk nearest neighbours for the weighted episodic return calculation is set to 1 or 2 dependent on the best achieved results from both options. We notice that k bigger than 3 leads to the worse performance.
To evaluate episodicbased prioritization we compare the prioritized and nonprioritized versions of EMAC on multiple environments. Prioritization parameter is chosen to be 0.5. The learning behaviour of the prioritized and nonprioritized versions of the algorithm is showed in Figure 6. The average return for both versions is provided in Table 1.
6 Discussion
We present Episodic Memory ActorCritic (EMAC), a deep reinforcement learning algorithm that exploits episodic memory in continuous control problems. EMAC uses nonparametric data structure to store and retrieve experiences from the past. The episodic memory module is involved in the training process via the additional term in the critic’s loss. The motivation behind such an approach is that the actor is directly dependent on the critic, therefore improving critic’s quality ensures the stronger and more efficient policy. We do not exploit episodic memory during the policy evaluation, which means that memory module is used only within the network update step. The loss modification demonstrates sampleefficiency improvement over the DDPG baseline. We further show that introducing prioritization based on the episodic memories improves our results. Experimental study of Qvalue overestimation shows that proposed approach has less tendency in critic overestimation thus providing faster and more stable training.
Our experiments show that leveraging episodic memory gave superior results in comparison to the baseline algorithm DDPG and TD3 on all tested environments and also outperformed SAC on 3 out of 5 environments. We hypothesize that the applicability of the proposed method is dependent from environment complexity. As a result, we struggle to outperform SAC for such a complicated environment as Humanoid which has bigger action space than the other OpenAI gym domains.
We now outline the directions of future work. As noted in [3], both animal and human brain exploits multiple memory systems while solving various tasks. Episodic memory is often associated with hippocampal involvement [2, 14] as a longterm memory. Although the gradientupdatebased policy may be seen as a working memory, it may be beneficial to study the role of separate shortterm memory mechanisms alongside episodic memory for better decision making. Another interesting direction is to use different stateaction representation for storing experiences. Although the random projection provides a way to transfer the distance relation from the original space to the smaller one, it does not show topological similarity between the stateaction records. One possible way to overcome this issue is to use differential embedding for stateaction representation. Unfortunately, the changing nature of embeddings entails the need of the constant or periodical memory update, which is alone an engineering challenge. We believe that our work provides the benefit of using episodicmemory for continuous control tasks and opens further research directions in this area.
References

[1]
(2013)
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §2.  [2] (2016) Modelfree episodic control. arXiv preprint arXiv:1606.04460. Cited by: §1, §1, §2, §4.1, §6.
 [3] (2005) Uncertaintybased competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience 8 (12), pp. 1704–1711. Cited by: §6.
 [4] (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §1, §5.
 [5] (2018) Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §2, §4.2, §4.2, §5.
 [6] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2, §5.
 [7] (2018) Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §2.
 [8] (1984) Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26 (189206), pp. 1. Cited by: §4.1.
 [9] (2018) Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, Cited by: §4.3.
 [10] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 [11] (2007) Hippocampal contributions to control: the thireep way. Advances in neural information processing systems 20, pp. 889–896. Cited by: §2.
 [12] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2, §3, §4.2, §4, §5.
 [13] (1992) Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (34), pp. 293–321. Cited by: §3.
 [14] (2018) Episodic memory deep qnetworks. arXiv preprint arXiv:1805.07603. Cited by: §2, §4.1, §6.
 [15] (2017) Neural episodic control. arXiv preprint arXiv:1703.01988. Cited by: §1, §2, §4.1, §5.
 [16] (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §4.3.
 [17] (2014) Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Cited by: §3.
 [18] (1993) Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, Cited by: §1, §4.2.
 [19] (2017) Episodic contributions to modelbased reinforcement learning. In Annual conference on cognitive computational neuroscience, CCN, Cited by: §2.
 [20] (2016) Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224. Cited by: §4.3.
 [21] (1992) Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §3.
 [22] (2019) Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. IEEE Transactions on Cybernetics. Cited by: §2.
Comments
There are no comments yet.