Human-regulated environments often rely on legislation and complex sets of rules. At present, Reinforcement Learning (RL) methods are usually tested in environments with relatively sparse rules and exceptions . Denser regulations appear in applications of RL for autonomous vehicles research, but such rulesets are often fixed in terms of complexity 
. We are interested in observing how learning can be affected by the depth of the rulesets governing such systems. With large numbers of corner cases arising as a consequence of dense rulesets, generating a sufficiently diverse set of experiences and exposing these exceptions to an RL agent can be challenging. Some works in literature propose to sample past experiences related to those exceptions, heuristically revisiting potentially important events. Among them, the technique of Prioritised Experience Replay (PER) looks at over-sampling experiences that are most poorly captured by the agent’s learned model. However, this mechanism does not necessarily focus on the cause of events or their exceptional nature.
In this letter, we pursue the intuition that explanations are a pivotal mechanism for human intelligence, and that this mechanism has the potential to boost the performance of RL agents in complex environments. This is why we draw inspiration from user-centred explanatory processes for humans , and design a set of heuristics and mechanisms for prioritised experience replay to explain complex regulations to a generic off-policy RL agent. A central design challenge towards this goal is integrating explanations into computational representations. Approaches such as encoding the ruleset (or part of it) into the agent’s observation space may incur severe re-training overhead even under minimal ruleset changes, as the semantics of the regulation are explicitly provided as input . This minimises compatibility with extant methods and may obscure whether differences in performance are due to changes to the architecture or the complexity of the ruleset. We propose a solution that is agnostic to explicitly engineering state and observation spaces, using an explanation-aware experience replay mechanism.
In our approach, we avoid explicit representations of the ruleset (i.e. rule-based explanations ) by instead representing the meaning of the regulations as organised collections of examples (i.e. case-based explanations ). These explanations do not need to be understood by the agent in the traditional sense, but can still convey meaning if the example was labelled/explained in a semantic and meaningful process. In a ludic example, suppose a young man, called Luke, is taking hyperspace flight lessons from his exasperated friend Chewbacca. However, he does not understand a single word of Shyriiwook, the tutor’s language. With sufficient repetition, Luke can associate distinct Wookiee growls (and punishments) to categories of experienced episodes, even if the content of the message is in an unknown language. Eventually, Luke would learn the meaning of the most relevant utterances by associating them to the experienced consequences. Hence, our approach modifies conventional experience replay structures by partitioning the replay buffer (or memory) into multiple clusters, each representing a distinct explanation associated with a collection of experiences that serve as examples. We call this process Explanation-Aware Experience Replay (XAER) (see Figure 1) and integrate this technique into three seminal learning algorithms: Deep Q-Networks (DQN) , Twin-Delayed DDPG (TD3) , and Soft Actor-Critic (SAC) .
In summary, we state the following contributions:
We show how distinct types and instances of explanations can be used to partition replay buffers and improve the rule coverage of sampled experiences.
We design discrete and continuous environments (GridDrive and GraphDrive) compatible with modular rulesets of arbitrary complexity (cultures). This leads to 9 learning tasks involving both environments with different levels of rule complexity and reward sparsity. These serve as a platform to evaluate how RL agents react to changes in rulesets whilst keeping a consistent state and action space.
We introduce XAER-modified versions of traditional algorithms such as DQN, TD3, and SAC, and test the performance of those modified versions in our proposed environments.
Upon experimenting on the proposed continuous and discrete environments, our key insight is that organising experiences with XAER improves agent performance (compared to traditional PER) and can be able to reach a better policy where traditional PER may fail to learn altogether.
2 Related Work
In this section we give the necessary background to understand our proposed solution and the following experiments, along with prior related work.
2.1 Model-Free Reinforcement Learning
A Reinforcement Learning problem is typically formalised as a Markov Decision Process (MDP). In this setting, an agent interacts at discrete time steps with an external environment. At each time step, the agent observes a state and chooses an action according to some policy
, that is a mapping (a probability distribution) from states to actions. As a result of its action, the agent obtains a reward, and the environment passes to a new state . The process is then iterated until a terminal state is reached.
The future cumulative reward is the total accumulated reward from time starting at . is the discount factor, representing the difference in importance between present and future rewards. The goal of the agent is to maximise the expected cumulative return starting from an initial state . The action value is the expected return for selecting action in state and prosecuting with strategy . Given a state and an action , the optimal action value function is the best possible action value achievable by any policy. Similarly, the value of state given a policy is and the optimal value function is .
Two of the major approaches to RL are value-based and actor-critic algorithms. Value-based algorithms, such as Deep Q-Networks (DQN) , use temporal difference learning, where policy extraction is done after an optimal value function is found. Actor-critic methods, such as and Twin-Delayed DDPG (TD3)  and Soft Actor-Critic (SAC) , rely on evaluating and improving a policy (via gradient descent) together with a state-value function. DQN is one of the very first value-based deep RL algorithms, designed to work on discrete action-spaces only. Adaptations for continuous action-spaces, as DDPG and then TD3, propose to address value overestimation problems by means of clipped double Q-learning, delayed update of target and policy networks, and target policy smoothing. However, one of the main limitations with TD3 is that it randomly samples actions using a pre-defined distribution. To overcome the issue of being limited by a fixed distribution, Soft Actor Critic (SAC) empowers the agent with the ability to also learn the distribution with which to sample actions, empowering the agent to explore more different strategies through entropy maximisation.
2.2 Explanations in RL
The most important field studying explanations in AI and RL is eXplainable AI (XAI) 
. Among the many surveys on XAI, a common dimension used to classify explanations is the representative format used to convey them. Within this domain, explanations are commonly conveyed via textual/visual descriptive representations of the decision criteria (i.e.rule-based), or with similar examples (i.e. case-based). An example of rule-based explanation is ‘you will get a penalty for reaching 75, which is above the speed limit of 50’, based on the rule ‘if speed is above 50, you will get a penalty’. While an example of case-based explanation is ‘you get a penalty because you are in a situation similar to this other vehicle that reached speed 74 and was previously penalised’.
Dietterich and Flann  frame explanation-based RL as a case-based explanatory process where prototypical trajectories of state-transitions are used to tackle similar but unseen situations, while Chow et al.  implement a rule-based method, constraining the Markov Decision Process by means of Lyapunov functions.
Generally speaking, many rule-based methods for explaining to RL agents usually fall under the umbrella of a sub-discipline called Safe RL . Safe RL includes techniques for both: encoding rules in the optimality criterion [8, 27] and incorporating such external knowledge into the action/state space . Although not generating explicit explanations, those methods engineer safety rules into the learning process, implicitly explaining to the agent what not
to do. Alternatively, a famous example of case-based methods for explaining to RL agents is that of Imitation Learning, where demonstrations (as trajectories of state-transitions generated by a human or expert algorithms) are used to train the RL agent. These can be seen as high-quality cases/examples provided by an expert human or algorithm. However, access to human expert data may not scale well to every domain, and not all problems dispose of accessible expert algorithms.
We are interested in sampling the most useful experiences to cover a particular agent’s gap in knowledge. An agent-centred explanatory process is an iterative process that follows the agent through the process of learning and selects the most useful explanations for it, at every time-step. Below, we look at how experience replay techniques tackle this issue in off-policy RL.
2.3 Prioritised Experience Replay
Algorithms such as DQN, TD3, and SAC aim to find a policy that maximises the cumulative return, by keeping and learning from a set of expected returns estimates for some past policy. This set of expected returns is kept in an experience buffer, enabling experience replay. Experience replay  consists in re-utilising information from the space of sampled experiences. The agent’s experiences at each time-step are stored as transitions , where , , represent the state, action, and reward at time , followed by the next state . These transitions are pooled over many episodes into a replay memory, which is usually randomly sampled for a mini-batch of experiences.
Experience sampling can be improved by differentiating important transitions from unimportant ones. In Prioritised Experience Replay (PER) , the importance of transitions with high expected learning value is measured by the magnitude of their temporal-difference (TD) error. Experiences with larger TD are sampled more frequently, as TD quantifies the unexpectedness of a given transition . This prioritisation can lead to a loss of diversity and introduce biases. Bias in prioritised experience replay occurs when the distribution is changed without control. This effect therefore changes the solution that the estimates will converge to. This bias can be corrected through importance-sampling (IS) weights.
Many approaches to Prioritised Experience Replay (PER) in RL  can be re-framed as mechanisms for achieving agent-centrality, re-ordering experience by relevance in the attempt of explaining to the agent and selecting the most useful experience, as indirectly suggested by Li et al. . Over the years, many human-inspired intuitions behind PER drove researchers towards improved, more sophisticated and agent-centred mechanisms to RL [31, 32, 33]. Among these works, the closest to a fully agent-centred explanatory process is Experience Replay Optimisation 
, which moves towards agent-centrality by providing an external black-box mechanism (or experience sampler) for extracting arbitrary sequences of information out of a flat (no abstraction involved) experience buffer. The experience sampler is trained to select the most ‘useful’ ones for the learning agent. However, due to its non-explainable nature, it is not clear whether the benefits given by Experience Replay Optimisation are due to the overhead given by the experience sampler increasing the number of neurons in the agent’s network.
Another work trying to achieve agent-centrality in this sense is Attentive Experience Replay , suggesting to prioritise uncommon experience that is also on-distribution (related to the agent’s current task). However this work, as the previous one, also falls short of explicitly organising experience in an abstract-enough way by conveying human-readable explanations to the agent. Hierarchical Experience Replay  has attempted to address the abstraction issue in an attempt to simplify the task to the agent, decomposing it into sub-tasks. However, they do not do so in an agent-centred and goal-oriented way, given that its sub-task selection is uniform and not curricular.
On the other hand, a curricular approach for training RL agents was proposed by Ren et al. , exploiting PER and the intuition that simplicity is inversely proportional to TD-errors, but not exploiting any abstract and hierarchical representation of tasks. Similarly to ours,  aims to organise experience abstractly, based on its explanatory content — framed as the ability to answer how good/bad a sequence of state-transitions is with respect to average experience. This work only considers explanations about the immediate performance of the agent (i.e. HOW explanations), and lacks any consideration of other and richer types (i.e. WHY), as well as curricular prioritisation facilities.
Our use of explanations is aligned to Holland’s  and Achinstein’s  philosophical theories of explanations. In fact, in the former, the act of explaining is framed as a process of revising belief whenever new experience challenges it. In the latter, explaining is the attempt to answer questions (such as ‘why’, ‘what’, etc ) in an agent-centred way. Specifically, we propose a transformation of rule-based explanations (e.g. given by a ruleset/culture) to case-based explanations (experience), which are compatible with experience replay. Leaning on the concept of Explanation-Awareness (XA), our heuristics facilitate information acquisition via the organisation of experience buffers.
Drawing from an epistemic  interpretation of explanations, we argue that a central aspect of providing case-based explanations to an RL agent comes from meaningfully re-ordering experience to a greater degree. The intuition behind how we construct our case-based explanations is: ‘a simple set of relevant state-transitions representing abstract-enough aspects of the problem to be solved.’ This intuition motivates the heuristics of abstraction, relevance, and simplicity (ARS, in short). We adapt these heuristics from prior work 
in the HCI domain, where they are presented in greater abstraction to form a higher-level taxonomy and knowledge graph for an interactive explanatory process.
Consider a problem where an RL agent has to learn a policy to optimally navigate through an environment with sophisticated rules and exceptions (e.g. a real traffic regulation with exceptions for special types of vehicles). Let the state-transition denote the transition from state to state by means of action , yielding a reward . We assume the environment is imbued with explanatory capabilities via an explainer. Note that the explanations generated by the explainer can have virtually any representation, be it human-understandable or not, provided they are distinct and serve the purpose of labelling different clusters.
Definition 1 (Explainer)
The explainer is a function that maps a list of state-transition tuples to an explanation , where is the space of possible state-transitions and is the explanatory space, i.e., the space of all possible explanations.
An agent who has more diverse experiences with regards to the reasons (explanations) associated with rewards will have a better chance at converging towards a policy that better represents the underlying ruleset. Therefore, we posit that the more complex the environment is in terms of rules, the more useful Explanation-Awareness (XA) should be, as it would ensure a more even distribution of experiences with regards to different reasons justifying rewards. This diversity of explanations culminates on a clustering that is semantic by nature, and transitions are partitioned according to the explanation that represented its reward.
Definition 2 (XA Clusters)
Let be a XA state-transition represented by the explanation , where . Let be the set of all state-transitions in a given episode. We say is the set of XA clusters seen in , where is the number of different explanations seen in that episode.
We introduce our adaptation of ARS, below.
3.1 Abstraction: Clustering Strategies
The purpose of the abstraction heuristic is to regulate the level of granularity of the explanations, hence of the experience clusters. Our abstractions are based on the understanding that explanations are indeed answers to questions. Hence, explanations may have different granularity defined by the level-of-detail of the question they answer.
More in detail, the HOW explanations we consider answer the question ‘How well is the agent performing with this reward?’. So that this type of explanations can be produced by studying the average behaviour of an agent (e.g. if an episode has a cumulative reward that is greater than the running mean, then the explanation indicates that the agent is behaving better than average). Interestingly, these explanations can be obtained without an explainer function. On the other hand, the WHY explanations we consider answer the question ‘Why did the agent achieve this reward?’. This specific information regards why the state-transition has the rewards it has. It can be generated by a culture  acting as the explainer function and producing verified arguments that compose an explanation from the state of the environment. In our proposed environments (see Section 4), WHY explanations contain a set of the verified surviving arguments in the culture, i.e., the current rules that compose an explanation for the given reward. Furthermore, WHY and HOW explanations (or any other type) can be easily combined, so that the explanation would answer both the associated questions. Though, if done systematically, this type of combinations might consequently cause a combinatorial explosion in the number of experience buffer’s clusters.
In order to compose the experience buffer, represented by the set of experience clusters , we consequently devise the following clustering strategies, for each explanation type:
HOW: The experience buffer is divided into 2 clusters and , where contains batches with rewards greater than the running mean of rewards, and vice-versa (given a sliding window of a defined size).
WHY: The number of clusters is equivalent to the number of distinct explanations available. If a batch can be explained by multiple explanations simultaneously, we select the explanation associated with the smallest cluster (most under-represented) and the batch is associated to the corresponding cluster.111Since buffers will be prioritised and clusters will be fairly represented, there is no need for duplicating the batch across multiple clusters.
HOW+WHY: a combination of HOW and WHY strategies. There are two custom and clusters for every WHY explanation, formed after their concatenation.
After clustering state-transitions using the prior clustering strategies, we propose mechanisms for assessing the relevance of specific state-transitions during learning.
3.2 Relevance: Intra-Cluster Prioritisation
Prioritisation mechanisms are used for organising information given their relevance to the agent’s objectives.
The priority of a batch is usually estimated by computing its loss with respect to the agent’s objective . In DQN, TD3, and SAC, relevance is estimated by the absolute TD-error of the agent. The closer to 0, the lower the loss and the relevance. The intuition is that batches with TD-error equal to zero are of no use since they represent an already solved challenge. In our method, this relevance heuristic can be combined with the aforementioned clustering strategy by sampling clusters in a prioritised way (by summing the priorities of all its batches) and then performing prioritised sampling of batches from the sampled cluster.
3.3 Simplicity: (Curricular) Inter-Cluster Prioritisation
Occam’s Razor  states that when presented with two explanations for the same phenomenon, the simplest explanation should be preferred. In human explanations, simplicity is a common heuristic [16, 24]. We will adhere to those principles and select minimal and simple explanations, following a curricular approach.
Clustered prioritised experience replay changes the real distribution of tasks by means of over-sampling. Assuming that the whole experience buffer has a fixed and constant size , and that the experience buffer contains different clusters, let and be the minimum and maximum size of a cluster. Any new experience is added to a full buffer by removing the oldest one within buffers having more elements than .
If all the clusters have the same size (therefore ), replaying the task’s cluster with the highest (TD-error) priority might push the agent to tackle the exceptions before the most common tasks, preventing the agent from learning an optimal policy faster. The assumption here is that exceptional tasks (exceptions) are less frequent.
On the other hand, if and , the size of a cluster would depend only on the real distribution of tasks within a small sliding window, as in traditional PER, thus preventing over-sampling. The presence of clusters helps over-sampling batches likely related to under-represented tasks, and learning to tackle potentially hard cases more efficiently.
Consequently, we posit that shall be large enough for effective over-sampling, while having
being dependent on the real distribution of tasks. This will push the agent towards tackling the most frequent and relevant tasks first, analogously to curricular learning. We define a hyperparameter to control thecluster size proportion.
Definition 3 (Cluster Size Proportion)
In order for all clusters to have a size , we set , where represents the cluster size proportion.
Therefore, can be easily controlled by modifying . We enforce when . Consequently, for curricular prioritisation, if the cluster’s priority is (for example) computed as the sum of its batch’s priorities and is not too large (e.g. ), the resulting cluster’s priorities will reflect the real distribution of tasks while smoothly over-sampling the most relevant tasks, thus avoiding over-estimation of the priority of a task. The degree of smoothness can be easily controlled with .
With those mechanisms in place, we propose new environments to evaluate the performance of agents when subjected to complex rulesets.
Real-life air/sea/road traffic regulations are often complex, and their mastery is a crucial aspect of orderly navigation. Many realistic settings have a number of exceptions that must be taken into consideration (e.g. ambulances are not subjected to some rules when in emergencies, sailing boats have different priorities if on wind power, etc). To implement our rulesets, we use cultures : a mechanism to encode human rulesets as machine-compatible argumentation frameworks, imbued with fact-checking mechanisms. These can be used to produce rule-based explanations from an agent’s behaviour. As traditional environments do not possess the ability to yield explanations and are not amenable to variable rulesets, we motivate our work by creating two rule-aware environments (one discrete and one continuous) with explainable capabilities, where cultures/rulesets with different levels of complexity can be plugged in and change the criteria for rewards.
The environments are:
4.1 GridDrive - Discrete
A 1515 grid of cells, where every cell represents a different type of road (see Figure 2, left), with base types (e.g. motorway, school road, city) combined with other modifiers (roadworks, accidents, weather). Each vehicle will have a set of properties that define which type of vehicle they are (emergency, civilian, worker, etc). Complex combinations of these properties will define a strict speed limit for each cell, according to the culture.
Actions. A sample in the action space consists of a direction and a speed where .
Observations. A sample in the observation space is a tuple where denotes the concatenation of the vehicle’s properties (including speed), is the concatenation of all neighbouring roads’ properties, is a boolean matrix keeping track of visited cells, and represent the vehicle’s current global coordinates.
Rewards. Let denote the normalised speed of the agent in that step. Rewards are given at every step, given the following criteria:
4.2 GraphDrive - Continuous
An Euclidean representation of a planar graph with vertices and edges (see Figure 2, right). The agent starts at the coordinates of one of those vertices and has to drive between vertices (called ‘junctions’) in continuous space with Ackermann-based non-holonomic motion. Edges represent roads and are subjected to the same rules with properties to those seen in GridDrive plus a few extra rules to encourage the agent to stay close to the edges. The incentive is to drive as long as possible without committing speed infractions. In this setting, the agent not only has to master the rules of the roads, but also the control dynamics to steer and accelerate correctly. We test two variations of this environment: one with dense and another with sparse rewards.
Observations. A sample in the observation space for GraphDrive is a tuple , where denotes a concatenation of the vehicle’s properties (car features, position, speed/angle, distance to path, junction status, number of visited junctions), is the concatenation of the properties of the closest road to the agent (likely to be the one the agent is driving on), and is the concatenation of the properties of roads connected to the next junction.
Rewards (dense version). Let denote the normalised speed of the agent in that frame, and let be the number of unique junctions visited in the episode. Rewards are given at every frame, given the following criteria:
Rewards (sparse version). In this version, the agent will get null (zero) reward when moving correctly. Positive rewards only appear when the agent manages to acquire a new junction. Therefore, the agent will have to drive entire roads correctly to get any positive reward. Rewards are given according to the following criteria:
Every episode incurs in an initialisation of the grid or graph (for GridDrive or GraphDrive, respectively) with random roads, along with randomly-sampled agent properties. The agent is encouraged to drive for as long as possible until it either achieves a maximum number of steps or breaks a rule (terminal state). All environments will be instantiated in versions with 3 different cultures (rulesets), according to their levels of complexity: Easy, Medium, and Hard.
Easy: 3 properties (2 for roads, 1 for agents), 5 distinct explanations.
Medium: 7 properties (5 for roads, 2 for agents), 12 distinct explanations.
Hard: 15 properties (9 for roads, 6 for agents), 20 distinct explanations.
In this section we describe our experimental setup and present results obtained in our proposed environments with XAER versus traditional PER. We trained 3 baseline agents with traditional PER (DQN/Rainbow, SAC, and TD3). For each of the 3 baseline algorithms, we train 3 XAER versions with different clustering strategies, using HOW, WHY, and HOW+WHY explanations (see Section 3). Additionally, we show results for HOW+WHY explanations without the simplicity heuristic (prioritised clustering) — i.e. clusters are sampled uniformly. For a total of 12 XA agents, we call the XAER-equipped versions of DQN, SAC, and TD3 XADQN, XASAC, and XATD3, respectively. DQN and XADQN agents are applied to GridDrive (discrete), whilst SAC, TD3, XASAC, and XATD3222Their implementations come from RLlib  , an open-source library for RL agents. We developed the XARL Python library, which can be easily integrated in RLlib and provides XA facilities for obtaining XADQN, XATD3 and XASAC.
, an open-source library for RL agents. We developed the XARL Python library, which can be easily integrated in RLlib and provides XA facilities for obtaining XADQN, XATD3 and XASAC.were trained separately on GraphDrive with dense and sparse rewards (continuous).
The neural network adopted for all the experiments is the default one implemented in the respective baselines (although better ones can be certainly devised), and it is characterised by fully connected layers of few units (e.g. 256) followed by the output layers for actors and/or critics, depending on the algorithm’s architecture. XAER methods introduce the cluster size proportion () hyperparameter. We perform ablation experiments to choose appropriate values of , and arrive at for XADQN and XATD3, and for XASAC. We will omit the detailed ablation study for brevity, but full plots and auxiliary results can be found in our GitHub page333https://github.com/Francesco-Sovrano/XARL.
As the environments presented in Section 4 have different levels of rule density/complexity, we are interested in observing if XAER exhibits superior performance compared to traditional PER in tasks that involve learning sophisticated and exception-heavy regulations. We trained all agents up to steps on all environments. Our reported scores are obtained by segmenting the curve of mean episode rewards into 20 regions containing 5% of steps each. We select the best region (highest median) for each agent to compare agents at their respective best performances. We report those medians in Table 1
, as well as the 25-75% inter-quartile range for the selected region.
|Grid Easy||16.48 (15.38-17.43)||16.40 (15.35-17.33)||13.57 (12.17-14.84)||14.98 (13.52-16.19)||14.64 (12.88-16.01)|
|Grid Medium||7.76 (6.63-9.88)||8.25 (6.94-9.90)||7.60 (6.70-8.59)||12.19 (11.14-13.16)||9.22 (8.1-10.28)|
|Grid Hard||2.01 (1.74-2.29)||1.85 (1.64-2.07)||1.85 (1.62-2.12)||3.64 (3.23-4.11)||2.1 (1.78-2.41)|
|Graph Easy||82.18 (74.83-90.25)||76.12 (67.8-84.08)||133.48 (129.08-138.89)||131.50 (124.90-138.41)||127.0 (121.0-132.96)|
|Graph Medium||67.79 (60.98-75.36)||74.25 (67.32-81.38)||106.55 (100.60-112.21)||115.31 (110.16-121.28)||94.95 (89.27-100.72)|
|Graph Hard||29.57 (27.49-31.55)||23.85 (21.74-25.86)||33.39 (31.38-35.16)||32.63 (30.84-34.73)||19.19 (16.98-21.87)|
|Graph Easy (SR)||4.17 (3.55-4.53)||3.5 (2.88-3.91)||4.86 (4.58-5.03)||4.75 (4.43-4.99)||2.95 (2.74-3.15)|
|Graph Medium (SR)||3.09 (2.85-3.37)||3.14 (2.9-3.4)||2.4 (2.3-2.5)||2.8 (2.68-2.91)||2.05 (1.84-2.18)|
|Graph Hard (SR)||-0.04 (-0.05-(-0.02))||1.36 (1.24-1.5)||0.69 (0.61-0.77)||-0.12 (-0.15-(-0.11))||0.62 (0.51-0.79)|
|Graph Easy||81.36 (75.93-86.72)||64.21 (57.48-70.65)||96.73 (90.59-102.00)||96.66 (89.44-101.84)||101.47 (94.49-107.34)|
|Graph Medium||0.0 (-0.01-(0.02))||0.0 (-0.01-(0.03))||77.85 (70.55-84.29)||73.75 (69.38-79.12)||61.52 (57.48-66.59)|
|Graph Hard||-0.01 (-0.03-(0.0))||22.12 (20.40-23.63)||25.95 (23.81-27.96)||24.38 (22.38-26.32)||7.79 (6.76-8.82)|
|Graph Easy (SR)||2.3 (1.87-2.66)||2.14 (1.81-2.39)||2.79 (2.53-3.03)||3.17 (2.98-3.38)||2.13 (1.97-2.29)|
|Graph Medium (SR)||-0.03 (-0.05-(-0.02))||-0.04 (-0.06-(-0.03))||2.23 (1.94-2.45)||2.12 (1.95-2.27)||1.92 (1.75-2.05)|
|Graph Hard (SR)444All agents failed to learn a policy and thus we do not highlight results.||-0.03 (-0.05-(-0.02))||-0.03 (-0.04-(-0.03))||-0.04 (-0.05-(-0.03))||-0.03 (-0.05-(-0.02))||-0.05 (-0.06-(-0.03))|
Results in Table 1 show that across all tasks and methods, XAER versions only lose to the PER baseline against DQN/Rainbow in GridDrive Easy, by 0.4%. For GridDrive Medium and Hard, XADQN with HOW+WHY explanations exhibit significantly higher performance (57% and 81%, respectively). WHY and HOW+WHY exhibit similar performance in GraphDrive, being bested by HOW in Medium and Hard Sparse cases only. Although HOW+WHY explanations have consistently good results across environments, the version without the simplicity heuristic exhibited consistently inferior results. Neither baseline SAC or TD3 managed to learn a policy in GraphDrive Hard Sparse (our hardest environment). XATD3 also failed to learn a policy in this environment, but XASAC was able to achieve positive results.
6 Discussion and Conclusion
Our results indicate a significant benefit achieved via explanation-aware experience replay. In some cases, endowing an agent with XAER can enable an agent to learn altogether in environments where it would otherwise fail entirely. XAER allowed TD3 agents to learn in Medium and Hard (dense), and SAC agents to learn in Hard (sparse), obtaining significantly higher rewards whilst having the same hyper-parameters and number of learning steps.
The choice of explanation type also affected results: when superior, HOW+WHY explanations exhibited larger margins of improvement over other XAER methods. In other cases, when bested by WHY explanations, the former maintained very close results, thus achieving consistently satisfactory results in most cases. Also importantly, although HOW explanations exhibited lower performance than other XAER counterparts in most environments, it is worth noting that HOW explanations do not require an explainer and could in theory be used in any environment. The consistency of HOW+WHY results suggests that the act of explaining may involve answering more archetypal questions, not just causal ones, as hypothesised also in .
The frequency and magnitude of rewards is an important factor to be considered in XAER clustering. When negative rewards are more frequent (with similar magnitude to positive rewards), and there are more negative than positive clusters, oversampling may cause the agent to tackle situations with negative rewards more frequently, preventing it to maximise cumulative rewards. This effect can be particularly pronounced with very sparse rewards, such as the ones seen in the sparse version of GraphDrive.
Intuitively, this is akin to the notion that if there are few opportunities to explain, one must choose their explanations well. The notion of explanation engineering surfaces as a mechanism to orient the learning agent through means of selecting which experiences (and explanations) are more important to the task at hand, by means of abstractions. Being explainable by design, explanation engineering can be an intuitive and semantically-grounded alternative to reward engineering, as the meaning of the rewards matter just as their magnitude. A few examples include increasing the number of positive clusters, or have the clusters organised hierarchically.
With regards to relevance, if the cumulative priority of the state-transitions of a whole cluster is low, it may indicate that the agent has already learned to handle the task represented by the cluster, so it may not need it as an explanation (thus being less relevant). Oppositely, if the cumulative priority is high, it could indicate a further need for additional explanations. The cluster might be representing either non-generic or generic tasks. If the agent needs explanations for a generic task, it should also need them for a non-generic task. In that case, the generic task is prioritised over the non-generic. The benefits of inter-cluster prioritisation (simplicity) are higher in environments with harder rulesets, and proportional to the complexity of the culture . This suggests that uniformly selecting an explanation type to replay to an agent is less beneficial than selecting the simplest and most relevant explanation.
This work foments diverse avenues for further investigation. For one, further experiments could include the development of explainer functions to evaluate the performance of WHY explanations in popular benchmarks. Additionally, future work may observe the effect of XAER with on-policy algorithms, such as PPO. And lastly, the illocutionary effect of explanations deriving from further archetypal questions  (i.e. HOW, WHY, WHAT, WHERE, WHEN) could be explored in advanced explanation engineering for experience clustering.
-  Agnar Aamodt and Enric Plaza. Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications, 7:39–59, 1994.
-  Peter Achinstein. The nature of explanation. Oxford University Press on Demand, 1983.
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser,
Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio
Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information Fusion, 58:82–115, 2020.
-  Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe Model-based Reinforcement Learning with Stability Guarantees. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
-  Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam’s razor. Information processing letters, 24(6):377–380, 1987.
-  P S Bokare and A K Maurya. Acceleration-Deceleration Behaviour of Various Vehicle Types. Transportation Research Procedia, 25:4733–4749, 2017.
-  L.Karl Branting. Building explanations from rules and structured cases. International Journal of Man-Machine Studies, 34(6):797–837, 1991.
-  Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708, 2018.
-  Thomas G Dietterich and Nicholas S Flann. Explanation-based learning and reinforcement learning: A unified view. Machine Learning, 28(2):169–210, 1997.
-  Daniel B Fambro, Rodger J Koppa, Dale L Picha, and Kay Fitzpatrick. Driver Braking Performance in Stopping Sight Distance Situations. Transportation Research Record, 1701(1):9–16, 1 2000.
-  Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1587–1596. PMLR, 4 2018.
-  Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
-  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870, Stockholmsmässan, Stockholm Sweden, 3 2018. PMLR.
-  John H Holland, Keith J Holyoak, Richard E Nisbett, and Paul R Thagard. Induction: Processes of inference, learning, and discovery. MIT press, 1989.
-  Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
-  Samuel GB Johnson, JJ Valenti, and Frank C Keil. Simplicity and complexity preferences in causal explanation: An opponent heuristic account. Cognitive psychology, 113:101222, 2019.
-  Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, and others. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
-  B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems, pages 1–18, 2021.
-  Ang A Li, Zongqing Lu, and Chenglin Miao. Revisiting Prioritized Experience Replay: A Value Perspective. arXiv preprint arXiv:2102.03261, 2021.
-  Changjian Li and Krzysztof Czarnecki. Urban Driving with Multi-Objective Deep Reinforcement Learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, pages 359–367, Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems.
-  Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica. Ray rllib: A composable and scalable reinforcement learning library. arXiv preprint arXiv:1712.09381, page 85, 2017.
-  GR Mayes. Theories of explanation. the internet encyclopedia of philosophy, 2005.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
-  Jakob Nielsen. Enhancing the explanatory power of usability heuristics. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 152–158, 1994.
-  Alex Raymond, Hatice Gunes, and Amanda Prorok. Culture-Based Explainable Human-Agent Deconfliction. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, pages 1107–1115, Richland, SC, 2020. International Foundation for Autonomous Agents and Multiagent Systems.
-  Zhipeng Ren, Daoyi Dong, Huaxiong Li, and Chunlin Chen. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE transactions on neural networks and learning systems, 29(6):2216–2226, 2018.
-  Jikun Rong and Nan Luan. Safe Reinforcement Learning with Policy-Guided Planning for Autonomous Driving. In 2020 IEEE International Conference on Mechatronics and Automation (ICMA), pages 320–326, 2020.
-  Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
-  Francesco Sovrano. Combining experience replay with exploration by random network distillation. In 2019 IEEE Conference on Games (CoG), pages 1–8. IEEE, 2019.
-  Francesco Sovrano and Fabio Vitali. From philosophy to interfaces: an explanatory method and a tool based on achinstein’s theory of explanation. In Proceedings of the 26th International Conference on Intelligent User Interfaces, 2021.
-  Peiquan Sun, Wengang Zhou, and Houqiang Li. Attentive experience replay. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5900–5907, 2020.
-  Haiyan Yin and Sinno Pan. Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
-  Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, and Xia Hu. Experience replay optimization. arXiv preprint arXiv:1906.08387, 2019.