During the last decade, deep reinforcement learning algorithms (DRL) have shown notable results in a variety of applications, ranging from games to autonomous vehicles Arulkumaran et al. (2017)
. However, DRL algorithms rely on neural networks to represent agent’s policy, making their decisions difficult to understand and interpret. This lack of transparency stands in the way of wider adoption of DRL methods in high-risk areas, such as healthcare or finance. Additionally, enabling agents to explain their decision-making process to humans is necessary to facilitate trust and collaboration between RL agent and user.
To address this issue, various approaches for interpreting behavior of DRL agents have been proposed in the recent years. Depending on their scope, methods for explaining a DRL system can either be local or global Du et al. (2019). Local methods interpret a single decision of the RL model Greydanus et al. (2018); Fukuchi et al. (2017), while global approaches explain policy behavior as a whole Amir and Amir (2018); Hayes and Shah (2017). Although these methods have shown to increase human understanding of agent’s decision-making process, they have been mostly limited to explaining a single RL policy. However, comparing and interpreting differences in policy behavior is necessary in situations where the user is confronted with a choice between alternative policies. Additionally, enabling developers to discover differences between policies could help them debug imperfect models by locating situations where they differ from a known expert model. Finally, comparing policies corresponds with the human tendency to prefer contrasting explanations which analyze differences in alternative scenarios Miller (2019).
In this work we distinguish between two ways in which behaviors of two RL policies can differ: ability-based and preference-based. If one policy is trained to a higher standard and exhibits generally superior behavior compared to the other one, then differences between the two are ability-based. On the other hand, if both policies perform the task competently, but follow different strategies based on their individual inclinations then we consider their differences to be preference-based. Current methods for comparing RL policies assume that differences between policies are ability-based and completely disregard the notion of preference Sequeira and Gervasio (2020); Amitai and Amir (2021). However, in complex domains where defining the optimal behavior is not straightforward, it is possible to train multiple RL policies that perform the task adequately, but rely on different strategies. Interpreting differences between those policies is necessary from the perspective of personalisation – if user needs to choose a policy that suits them best from a set of offered alternatives, understanding differences between them is crucial for making an informed decision. Furthermore, recognizing differences between multiple policies could aid developers in understanding the different behaviors that stem from various reward functions and hyperparameter combinations Amitai and Amir (2021).
In RL, multiple different strategies can be the result of different reward functions in a task where the objective cannot be precisely defined, and certain trade-offs between goals have to be made Huang et al. (2019). For example, consider the problem of training a RL policy for driving an autonomous vehicle. Defining the reward function for this task is not straightforward and can involve multiple objectives that need to be satisfied. Consequently, multiple capable policies could be trained on different reward functions that slightly favor a specific objective over the others. These policies will achieve high average reward according to their individual reward function, but due to their opposing preferences in terms of objectives they may also exhibit different strategies for performing the task. Specifically, a policy which was trained on the reward function that severely penalizes close contact with another car will develop a safety-oriented strategy, and prefer slower driving and keeping distance from other vehicles. On the other hand, policy that was penalized for not arriving to the destination on time will likely prefer faster driving, and will have a more relaxed understanding of safety concerns. If a user is given a choice between these two policies it is necessary that the they recognize the differences between their strategies to choose the one best fitting for them.
In this work, we focus on comparing two policies trained on the same task and interpret their preference-based differences. We choose the global approach to explainability, and attempt to discover the differences in overall behavior of two RL agents as opposed to analysing their differences in a single state. Our approach attempts to uncover differences between two policies by analysing situations where policies disagree on the best strategy to follow. We propose an algorithm for distinguishing between disagreements that stem from difference in ability between two agents and those that arise from their opposing preferences. We then analyse only the preference-based disagreements in order to extract conditions in terms of state features that specific agents favour and generate global explanations contrasting agents’ behavior. We test and evaluate our approach in an autonomous driving environment, where agent’s task is to merge into another lane currently occupied by a non-autonomous vehicle, and we compare policies of a safety-oriented vs speed-oriented agent.
Our contributions are as follows:
We propose a method for distinguishing between ability-based and preference-based differences in behavior between two RL agents.
We present a method for generating contrasting explanations based on the contrast in preferred state feature values between two policies.
We test and evaluate our approach in an autonomous driving environment.
2 Related work
In the recent years various methods for explaining either one decision or the entire behavior of RL system have been proposed Puiutta and Veith (2020). However, despite the fact that research shows humans tend to seek contrasting explanations when reasoning about an event Miller (2019), there is still limited work in the field of comparing alternative behaviors of reinforcement learning systems.
Madumal et al. (2020b, a) approached the problem of explainability from a causal perspective and proposed a method for generating local explanations that contrast alternative actions in a specific state. The approach however requires a hand-crafted causal model of the environment, which may be difficult to obtain and requires expert knowledge. Additionally, authors focus on generating local explanations, while we interpret and compare global behavior of agents.
Summarisation methods, one of the most notable global approaches for condensing and explaining the agent behavior have also been used to compare multiple RL policies. Sequeira and Gervasio (2020) generated contrasting summaries of agents’ behavior in order to highlight the differences in their capabilities. Similar to our work, Amitai and Amir (2021) used the notion of disagreement between policies to detect and analyse situations where two policies pick different actions, but opted for explaining policies’ differences through contrasting summaries. However, both approaches focus only on extracting discrepancies between agents’ abilities, and disregard the potential difference in their strategies. Additionally, applicability of summarisation methods is limited to tasks with visual input.
Most relevant to our work, van der Waa et al. (2018) compare the outcomes of following different policies from a specific state to justify agent’s choice of action. However, their work focuses only on local explanations, and requires manual encodings of states and outcomes. In contrast, in our work we aim to provide global comparisons of policies and do not rely on hand-crafted interpretable features.
3 Preference-based contrastive explanations
In this section we propose a set of conditions for distinguishing between ability and preference-based differences between two policies and offer a method for extracting explanations that highlight feature values that specific agents favour. Throughout this section we assume oracle access to two policies and , their action-value functions and the transition function of the environment . Our approach consists of three steps presented in this section. Firstly, policies are unrolled in the environment to collect data on situations where two policies disagree on the best course of action (Section 3.1). Afterwards, collected data is filtered so that only data illustrating preference-based differences between the policies is obtained (Section 3.2). Finally, we analyse and compare preference-based disagreement data from both policies to extract explanations that indicate which conditions agents prefer to end up in (Section 3.3).
3.1 Disagreement data
We adopt the definition of disagreement state from Amitai and Amir (2021) and consider two policies to disagree in a state if they do not choose the same action in that state.
With that in mind, we collect three different types of disagreement data from policies’ interaction with the environment. We follow the method for gathering disagreement data presented in Amitai and Amir (2021) which assumes unrolling policy in the environment and at every step comparing decisions of and until a disagreement is reached, then following both policies separately for a set number of steps , and finally returning control to .
Specifically, we start by executing policy in the environment. In each state that encounters we compare the decisions of both policies and record those states in which policies choose different actions:
Definition (Disagreement states) Given two policies and , is a disagreement state if:
After encountering a disagreement state , we also unroll both policies for a set number of steps starting from and record the resulting pair of trajectories:
Definition (Disagreement trajectories): Given two policies and and a disagreement state , a pair of disagreement trajectories is a tuple , where:
Finally, upon collecting disagreement trajectories, we also record the last states in each trajectory pair:
Definition (Disagreement outcomes): Given two policies and and a pair of disagreement trajectories , where and , pair of disagreement outcomes is a tuple where:
In other words, an outcome is a state in which agent ends up after following its policy for a set number of steps from a disagreement state. After individually unrolling two policies from disagreement state for steps, control is returned to policy which continues to progress in the environment, until a new disagreement state or episode terminates. Entire collection process is repeated for episodes. The approach is further detailed in Algorithm 1.
Throughout this section we use the term disagreement to denote a tuple where is a disagreement state, and disagreement trajectories starting in , and and their outcomes. Output from this section of the approach is a set of collected disagreements .
3.2 Ability vs. preference-based disagreement
In order to generate preference-based explanations using the gathered disagreement data, we need to distinguish between disagreement that comes from different abilities of two agents and that which is a consequence of different preferences. Specifically, we aim to a detect particular type of disagreement that can be representative of difference in preferences between two policies. Intuitively, we would like to select only those disagreements where both agents see the same potential in the state and fulfil that potential to the same extent over the next steps, but disagree strongly on the course of action in . In other words, we select disagreements where agents feel similarly optimistic in state and policy feels similarly as satisfied being in as is with reaching
, while both agents have high confidence in their chosen path. This would indicate that agents estimate and realize the same potential in the environment, but strongly disagree on their preferred way to do so.
To precisely define these conditions, we need to introduce a metric for measuring how strongly two policies disagree in a specific state. For this purpose we define state importance as follows:
Definition (State importance) Given two policies and , a disagreement state , and if and are vectors of Q action-values in state
are vectors of Q action-values in stateaccording to policies and respectively, the importance of with regards to policies and can be defined as:
We compute the softmax over a vector of Q action-values in order to emphasise the contrast between the first-ranked action and the others, and select the maximum value to represent how sure the policy is in its decision.
Furthermore, in order to quantify the potential that policy sees in a specific state we employ the idea of a state-value function. Since we do not assume direct access to state-value functions of policies, we simulate them with the help of the available Q action-value functions:
Before using them for estimating state-value function, Q action-values are normalized to range, so that they can be compared between different agents.
Finally, we formalize the conditions for considering disagreement to be preference-based:
Definition (Preference-based disagreement) Given two policies and and disagreement between them , disagreement is considered to be preference-based if the following conditions are fulfilled:
Both policies are highly confident in their decision in the disagreement state :
Both policies have similar evaluations of the disagreement state :
To estimate this similarity we evaluate the expression:
After unrolling policies in the environment for steps from state , both policies have similar evaluations of their outcomes:
To estimate this similarity we evaluate the expression:
where , and are threshold values.
Selected values for threshold parameters , and affect how selective the algorithm is. Some suitable values to use in practice would be: , and . Decreasing would result in including disagreements where policies are not confident in their choice and evaluate some alternative actions as similarly promising. These situations indicate that the disagreement is not a consequence of strong preferences of the agents. On the other hand increasing parameters or would result in allowing more ability-based disagreements into the end result. For example, consider a situation where two agents encounter state which they evaluate similarly, but disagree strongly on the best course of action. If after unrolling these policies separately for a number of steps they arrive at different states, and one policy is far more satisfied with its outcome than the other, that indicates that the disagreement in was not a consequence of opposing preferences, but rather of inferior abilities of the second policy.
Finally, we can use the defined measures to filter the set of disagreements to obtain only those that are preference-based:
Output from this stage is the set of preference-based disagreements .
3.3 Generating contrastive explanations
Data set is rich with information on preference-based differences between and . Therefore, there are multiple way to approach generating contrastive explanations using – we could exploit differences in disagreement trajectories or disagreement outcomes. In this work we choose the latter and focus on analysing and comparing outcomes of following the two policies to uncover which states agents prefer to reach. More specifically, we address the question: “What conditions (in terms of state features) does agent prefer to end up in, compared to agent ?”. To answer this question, we start by creating two sets of values and for each state feature , storing values of feature in outcomes from policies and respectively:
Finally, we include in our explanations only those features for which there is a significant difference in distributions of and . Since we deal with continuous state features in the example presented in Section 4
, we use a paired T-test to assess the difference in the two distributions. Ifis not continuous, alternative appropriate statistical tests can be used to determine statistical relationship McNemar (1947); Pearson (1900). Provided that there is a significant difference in distribution of feature in outcomes of the two policies, we consider to be indicative of agent’s preference and we use it in the explanations. For example, if there is a significant difference between and for some feature and mean value of is larger than mean value of then the explanation will include that prefers to end up in states where feature has larger values. Final explanation is generated by combining all feature-specific preferences using a natural-language template.
To evaluate the method proposed in Section 3, we employ a simplified version of the merging task in autonomous driving presented in Huang et al. (2019). In this environment, the autonomous vehicle navigates a three-lane road. Agent begins the episode in the center lane, and is tasked with merging safely into the right lane, currently occupied by a non-autonomous vehicle. Episode ends upon successful completion of the task, or if agent fails catastrophically by crashing into another car. Reward of is awarded for successfully merging in the right lane, while penalty is received for crashing into the non-autonomous vehicle. Additionally, driving off the road yields penalty. There are two suitable ways to approach this task. Agent can either employ a safety-oriented strategy and merge behind the non-autonomous vehicle, minimizing its chances of collision or it can speed up and merge in front of the other car, depending on its preference in the trade-off between speed and safety.
Agent’s observation is a vector describing both the agent and the non-autonomous vehicle. Specifically, agent observes location of its rear axle , as well as its heading , velocity and steering wheel angle . Additionally, agent can observe the same features for the non-autonomous vehicle denoted by and . At each step agent chooses between discrete actions – agent can increase or decrease its speed by , it can change its steering angle by degrees in any direction, or it can choose to alter nothing. Non-autonomous vehicle, however, drives straight ahead in the right lane with the same velocity throughout the episode. Action space is significantly simplified compared to one described in Huang et al. (2019), but it is still rich enough to enable the desired behavior. Additional environment parameters are available in Table 1. Finally, the reward function consists of multiple features that agent needs to optimize:
Distance from the goal: , where is the x coordinate of the center of the right lane.
Distance to the other car: , where is the location of the agent and is the location of the non-autonomous car at time step .
Deviation from initial speed:
Deviation from initial heading:
All reward features are normalized to range. Ultimately, the reward function takes the form of a linear combination of features:
where each feature is weighed by a parameter from the vector to determine its importance in the overall objective.
We start by training a baseline policy with reward function parameters . This policy is safety-oriented – it prefers to keep a distance to the non-autonomous vehicle and chooses to slow down and merge behind it. All policies in this section are trained using DQN algorithm Mnih et al. (2013).
Furthermore, to obtain policies with different strategies we vary the value of the parameter which affects how important progress is to the agent. Small values of this parameter indicate a safety-oriented policy which prefers to slow down and merge behind the non-autonomous vehicle, decreasing its chances of collision. This strategy is identical to that of . On the other hand, increasing the value of this parameter results in a more aggressive policy which values progress over keeping greater-than-necessary distance to the other vehicle. Such policy prefers to speed up and merge in front of the non-autonomous car. For each value we train different models and using the reward function with parameters . In other words, we keep all other reward feature parameters same as in the baseline model, and change only the parameter corresponding to progress. Models and are trained for the same number of steps and achieve near-optimal performance on the task. To ensure we also generate a policy with inferior capabilities, is trained for significantly less time, and does not fully learn the task. Training parameters are given in Table 2.
We set up experiments to show that the method presented in Section 3 captures and explains only preference-based differences in behavior of the two agents. In other words, we do not want our method to detect difference in behavior when comparing two policies employing the same strategy, or two policies with significantly different capabilities. Therefore, we set up three different evaluation goals:
The method detects differences between two policies with different preferences.
The method does not detect differences between two policies with same preference.
The method does not detect differences between two policies of significantly different capabilities.
To test all three evaluation goals, we set up three different evaluation scenarios. To evaluate our method against the first goal, we compare the behavior of policies and for each . Since we expect to exhibit more aggressive behavior compared to for larger values of , this scenario tests whether this difference in strategy will be captured by the proposed approach. Furthermore, to evaluate the second goal we use the method in Section 3 to compare policies and for every . These policies should not differ in their abilities or preferences, since they are trained for same amount of time steps and on the same reward function. Finally, to investigate the third evaluation goal we compare the behavior of and for every . These two policies differ in their abilities, as has been trained for a shorter time, and fails to learn any meaningful strategy in the environment. For all three evaluation scenarios we record the total number of disagreements encountered as well as the number of preference-based disagreements according to the method presented in Section 3 and generate contrasting explanations provided that the number of gathered disagreements exceeds . Further parameters for this approach are given in Table 3.
Results for the three evaluation scenarios are presented in Table 4. Firstly, we can notice that in each scenario policies have encountered a number of disagreement states, despite the fact that some policies, such as and were trained on the same reward function for the same amount of time. From Table 4 we can clearly see that our method abides by the evaluation goals and , because comparing two capable policies trained with the same preferences ( vs. ) or two policies with different abilities ( vs. ) yields no preference-based disagreements. In order to confirm that the method also satisfies the first evaluation goal, we must determine whether actual behavior of policies corresponds to number of discovered preference-based disagreements. In other words, according to Table 4 the proposed approach concluded that only policy follows the same safety-oriented strategy as the and merges behind the non-autonomous vehicle, while all other policies
exhibit a different, more aggressive behavior. To verify these findings, we explore the behavior of these policies and for each of them record the average y-coordinate distance at the moment of merging into the right lane (Figure3). Similarly, we record the average velocity of the car during the episode for each of the trained policies (Figure 3) since this feature is the most indicative of the agent’s strategy – in order to merge in front of the non-autonomous vehicle agent must speed up, and for safety-oriented merging agent will need to slow down. Figures 3 and 3 show that the behavior of policy differs significantly from the other policies, and is most similar to . Specifically, is the only policy which merges behind the non-autonomous vehicle (Figure 3) and maintains an average speed lower than the initial velocity (Figure 3). From this we can conclude that our method indeed satisfies the first evaluation goal.
Finally, we use the preference-based disagreement data to generate explanations about contrasting behavior of two policies with different strategies using the method described in Section 3.3. The explanations identify the state feature values that certain agents prefers compared to the other agent. Specifically, after comparing two policies and , where is safely-oriented, while prefers more aggressive driving, we obtain the following explanations:
Policy prefers states with smaller, smaller, smaller, smaller, larger compared to policy .
6 Discussion and future work
In this work we focused on the problem of explaining differences in behavior of two RL agents that stem from their opposing preferences. We proposed a method for distinguishing between ability-based and preference-based differences and generated contrasting explanations about state feature values that agents prefer. We also evaluated our approach on a merging task in autonomous driving.
Although we have shown that our method can successfully differentiate between ability and preference-based differences in behavior, our approach relies on the choice of threshold values , and . Additionally, our approach focuses on comparing only two RL policies. In future work we hope to address these two limitations, and extend the method to allow for end-to-end approach for learning threshold parameters and to support multiple policies.
This publication has emanated from research supported in part by a grant from Science Foundation Ireland under Grant number 18/CRT/6223 . For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
- Highlights: summarizing agent behavior to people. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1168–1176. Cited by: §1.
- ”I don’t think so”: disagreement-based policy summaries for comparing agents. arXiv preprint arXiv:2102.03064. Cited by: §1, §2, §3.1, §3.1.
- Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine 34 (6), pp. 26–38. Cited by: §1.
Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §1.
- Autonomous self-explanation of behavior for interactive reinforcement learning agents. In Proceedings of the 5th International Conference on Human Agent Interaction, pp. 97–101. Cited by: §1.
- Visualizing and understanding atari agents. In International Conference on Machine Learning, pp. 1792–1801. Cited by: §1.
- Improving robot controller transparency through autonomous policy explanation. In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI, pp. 303–312. Cited by: §1.
- Enabling robots to communicate their objectives. Autonomous Robots 43 (2), pp. 309–326. Cited by: §1, Figure 1, §4, §4.
- Distal explanations for explainable reinforcement learning agents. arXiv preprint arXiv:2001.10284. Cited by: §2.
Explainable reinforcement learning through a causal lens.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 2493–2500. Cited by: §2.
- Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (2), pp. 153–157. Cited by: §3.3.
- Explanation in artificial intelligence: insights from the social sciences. Artificial intelligence 267, pp. 1–38. Cited by: §1, §2.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §4.
X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50 (302), pp. 157–175. External Links: Cited by: §3.3.
- Explainable reinforcement learning: a survey. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 77–95. Cited by: §2.
- Interestingness elements for explainable reinforcement learning: understanding agents’ capabilities and limitations. Artificial Intelligence 288, pp. 103367. Cited by: §1, §2.
- Contrastive explanations for reinforcement learning in terms of expected consequences. arXiv preprint arXiv:1807.08706. Cited by: §2.