Log In Sign Up

MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement Learning

Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable reinforcement learning (RL) has shown promise in extracting more interpretable decision tree-based policies from neural networks, but only in the single-agent setting. To fill this gap, we propose the first set of algorithms that extract interpretable decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER learns high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.


Multi-Agent Deep Reinforcement Learning with Adaptive Policies

We propose a novel approach to address one aspect of the non-stationarit...

Learning Interpretable, High-Performing Policies for Continuous Control Problems

Gradient-based approaches in reinforcement learning (RL) have achieved t...

Co-Training an Observer and an Evading Target

Reinforcement learning (RL) is already widely applied to applications su...

Explaining Conditions for Reinforcement Learning Behaviors from Real and Imagined Data

The deployment of reinforcement learning (RL) in the real world comes wi...

Deep Decision Trees for Discriminative Dictionary Learning with Adversarial Multi-Agent Trajectories

With the explosion in the availability of spatio-temporal tracking data ...

Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

Current work in explainable reinforcement learning generally produces po...

ALMA: Hierarchical Learning for Composite Multi-Agent Tasks

Despite significant progress on multi-agent reinforcement learning (MARL...

1 Introduction

Multi-agent reinforcement learning (MARL) is a promising technique for solving challenging problems, such as air traffic control [5], train scheduling [27], cyber defense [22], and autonomous driving [4]. In many of these scenarios, we want to train a team

of cooperating agents. Other settings, like cyber defense, involve an adversary or set of adversaries with goals that may be at odds with the team of defenders. To obtain high-performing agents, most of the recent breakthroughs in MARL rely on neural networks (NNs) 

[10, 35], which have thousands to millions of parameters and are challenging for a person to interpret and verify. Real-world risks necessitate learning interpretable policies that people can inspect and verify before deployment, while still performing well at the specified task and being robust to a variety of attackers (if applicable).

Decision trees [34] (DTs) are generally considered to be an intrinsically interpretable model family [28]: sufficiently small trees can be contemplated by a person at once (simulatability), have subparts that can be intuitively explained (decomposability), and are verifiable (algorithmic transparency) [18]. In the RL setting, DT-like models have been successfully used to model transition functions [40], reward functions [8], value functions [32, 43], and policies [24]. Although learning DT policies for interpretability has been investigated in the single-agent RL setting [24, 33, 37], it has yet to be explored in the multi-agent setting.

To address this gap, we propose two algorithms, IVIPER and MAVIPER, which combine ideas from model compression and imitation learning to learn DT policies in the multi-agent setting. Both algorithms extend VIPER 

[2], which extracts DT policies for single-agent RL. IVIPER and MAVIPER work with most existing NN-based MARL algorithms: the policies generated by these algorithms serve as “expert policies” and guide the training of a set of DT policies.

The main contributions of this work are as follows. First, we introduce the IVIPER algorithm as a novel extension of the single-agent VIPER algorithm to multi-agent settings. Indeed, IVIPER trains DT policies that achieve high individual performance in the multi-agent setting. Second, to better capture coordination between agents, we propose a novel centralized DT training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees. To train each agent’s policy, MAVIPER uses a novel resampling scheme to find states that are considered critical for its interactions with other agents. We show that MAVIPER-trained agents achieve better coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.

2 Background and Preliminaries

We focus on the problem of learning interpretable DT policies in the multi-agent setting. We first describe the formalism of our multi-agent setting, then discuss DT policies and review the single-agent version of VIPER.

2.1 Markov Games and MARL Algorithms

In MARL, agents act in an environment defined by a Markov game [19, 38]. A Markov game for agents consists of a set of states describing all possible configurations for all agents, the initial state distribution , and the set of actions and observations for each agent . Each agent aims to maximize its own total expected return , where is the discount factor that weights the relative importance of future rewards. To do so, each agent selects actions using a policy . After the agents simultaneously execute their actions in the environment, the environment produces the next state according to the state transition function . Each agent receives reward according to a reward function

and a private observation, consisting of a vector of

features, correlated with the state .

Given a policy profile , agent ’s value function is defined as: and state-action value function is: . We refer to a policy profile excluding agent as .

Figure 1: A decision tree of depth two that MAVIPER learns in the Cooperative Navigation environment. The learned decision tree captures the expert’s behavior of going to one of the landmarks.

MARL algorithms fall into two categories: value-based [35, 39, 41] and actor-critic [11, 16, 20, 48]. Value-based methods often approximate -functions for individual agents in the form of and derive the policies by taking actions with the maximum Q-values. In contrast, actor-critic methods often follow the centralized training and decentralized execution (CTDE) paradigm [30]. They train agents in a centralized manner, enabling agents to leverage information beyond their private observation during training; however, agents must behave in a decentralized manner during execution. Each agent uses a centralized critic network , which takes as input some state information (including the observations of all agents) and the actions of all agents. This assumption addresses the stationarity issue in MARL training: without access to the actions of other agents, the environment appears non-stationary from the perspective of any one agent. Each agent also has a policy network that takes as input its observation .

2.2 Decision Tree Policies

DTs are tree-like models that recursively partition the input space along a specific feature using a cutoff value. These models produce axis-parallel partitions: internal nodes are the intermediate partitions, and leaf nodes are the final partitions. When used to represent policies, the internal nodes represent the features and values of the input state that the agent uses to choose its action, and the leaf nodes correspond to chosen actions given some input state. For an example of a DT policy, see Figure 1.

2.3 Viper

VIPER [2] is a popular algorithm [7, 21, 25]

that extracts DT policies for a finite-horizon Markov decision process given an

expert policy trained using any single-agent RL algorithm. It combines ideas from model compression [6, 13] and imitation learning [1] — specifically, a variation of the DAGGER algorithm [36]. It uses a high-performing deep NN that approximates the state-action value function to guide the training of a DT policy.

VIPER trains a DT policy in each iteration ; the final output is the best policy among all iterations. More concretely, in iteration , it samples trajectories following the DT policy trained at the previous iteration. Then, it uses the expert policy to suggest actions for each visited state, leading to the dataset (Line 4, Alg. 3). VIPER adds these relabeled experiences to a dataset consisting of experiences from previous iterations. Let and be the state value function and state-action value function given the expert policy . VIPER resamples points according to weights: . See Algorithm 3 in Appendix 0.A for the full VIPER algorithm.

3 Approach

We present two algorithms: IVIPER and MAVIPER. Both are general policy extraction algorithms for the multi-agent setting inspired by the single-agent VIPER algorithm. At a high level, given an expert policy profile with associated state-action value functions trained by an existing MARL algorithm, both algorithms produce a DT policy for each agent . These algorithms work with various state-of-art MARL algorithms, including value-based and multi-agent actor-critic methods. We first discuss IVIPER, the basic version of our multi-agent DT learning algorithm. We then introduce additional changes that form the full MAVIPER algorithm.

3.1 Iviper

Input: , , , ,

1:for i=1 to N do
2:     Initialize dataset and policy
3:     for  to  do
4:         Sample trajectories:
5:         Aggregate dataset
6:         Resample dataset according to loss:
7:         Train decision tree TrainDecisionTree()      
8:     Get best policy BestPolicy()
9:return Best policies for each agent
Algorithm 1 IVIPER in Multi-Agent Setting

Motivated by the practical success of single-agent RL algorithms in the MARL setting [23, 3], we extend single-agent VIPER to the multi-agent setting by independently applying the single-agent algorithm to each agent, with a few critical changes described below. Algorithm 1 shows the full IVIPER pseudocode.

First, we ensure that each agent has sufficient information for training its DT policy. Each agent has its own dataset of training tuples. When using VIPER with multi-agent actor-critic methods that leverage a per-agent centralized critic network , we ensure that each agent’s dataset has not only its observation and actions, but also the complete state information — which consists of the observations of all of the agents — and the expert-labeled actions of all of the other agents . By providing each agent with the information about all other agents, we avoid the stationarity issue that arises when the policies of all agents are changing throughout the training process (like in MARL).

Second, we account for important changes that emerge from moving to a multi-agent formalism. When we sample and relabel trajectories for training each agent’s DT policy, we sample from the distribution induced by agent ’s policy at the previous iteration and the expert policies of all other agents . We only relabel the action for agent because the other agents choose their actions according to . It is equivalent to treating all other expert agents as part of the environment and only using DT policy for agent .

Third, we incorporate the actions of all agents when resampling the dataset to construct a new, weighted dataset (creftypecap 6, Algorithm 1). If the MARL algorithm uses a centralized critic , we resample points according to:




Crucially, we include the actions of all other agents in Equation 2 to select agent ’s minimum Q-value from its centralized state-action value function.

When applied to value-based methods, IVIPER is more similar to single-agent VIPER. In particular, in creftypecap 4, Algorithm 1, it is sufficient to only store and in the dataset , although we still must sample trajectories according to and . In creftypecap 6, we use from single-agent VIPER, removing the reliance of the loss on a centralized critic.

Taken together, these algorithmic changes form the basis of the IVIPER algorithm. This algorithm can be viewed as transforming the multi-agent learning problem to a single-agent one, in which other agents are folded into the environment. This approach works well if i) we only want an interpretable policy for a single agent in a multi-agent setting or ii) agents do not need to coordinate with each other. When coordination is needed, this algorithm does not reliably capture coordinated behaviors, as each DT is trained independently without consideration for what the other agent’s resulting DT policy will learn. This issue is particularly apparent when trees are constrained to have a small maximum depth, as is desired for interpretability.

3.2 Maviper

Input: , , , ,

1:Initialize dataset and policy for each agent
2:for  to  do
3:     Sample trajectories:
4:     Aggregate dataset
5:     For each agent , resample according to loss:
6:     Jointly train DTs: TrainJointTrees()
7:return Best set of agents
9:function TrainJointTrees()
10:     Initialize decision trees .
11:     repeat
12:         Grow one more level for agent ’s tree Build(, )
13:         Move to the next agent:
14:     until all trees have grown to the maximum depth allowed
15:     return decision trees
17:function Build(, )
18:     for each data point  do
19:         // Will agent ’s (projected) final DT predict its action correctly?
21:         // This data point is useful only if many agents’ final DTs predict correctly.
22:         if  then Remove from dataset:               
23:      Calculate best next feature split for DT using .
24:     return
26:function Predict(, )
27:     Use to traverse until leaf node
28:     Train a projected final DT TrainDecisionTree()
29:     return .predict()
Algorithm 2 MAVIPER (Joint Training)

To address the issue of coordination, we propose MAVIPER, our novel algorithm for centralized training of coordinated multi-agent DT policies. For expository purpose, we describe MAVIPER in a fully cooperative setting, then explain how to use MAVIPER for mixed cooperative-competitive settings. At a high-level, MAVIPER trains all of the DT policies, one for each agent, in a centralized manner. It jointly grows the trees of each agent by predicting the behavior of the other agents in the environment using their anticipated trees. To train each DT policy, MAVIPER employs a new resampling technique to find states that are critical for its interactions with other agents. Algorithm 2 shows the full MAVIPER algorithm. Specifically, MAVIPER is built upon the following extensions to IVIPER that aim at addressing the issue of coordination.

First, MAVIPER does not calculate the probability

of a joint observation by viewing the other agents as stationary experts. Instead, MAVIPER focuses on the critical states where a good joint action can make a difference. Specifically, MAVIPER aims to measure how much worse off agent would be, taking expectation over all possible joint actions of the other agents, if it acts in the worst way possible compared with when it acts in the same way as the expert agent. So, we define , as in Equation 2, as:


MAVIPER uses the DT policies from the last iteration to perform rollouts and collect new data.

Second, we add a prediction module to the DT training process to increase the joint accuracy, as shown in the Predict function. The goal of the prediction module is to predict the actions that the other DTs might make, given their partial observations. To make the most of the prediction module, MAVIPER grows the trees evenly using a breadth-first ordering to avoid biasing towards the result of any specific tree. Since the trees are not complete at the time of prediction, we use the output of another DT trained with the full dataset associated with that node for the prediction. Following the intuition that the correct prediction of one agent alone may not yield much benefit if the other agents are wrong, we use this prediction module to remove all data points whose proportion of correct predictions is lower than a predefined threshold. We then calculate the splitting criteria based on this modified dataset and continue iteratively growing the tree.

In some mixed cooperative-competitive settings, agents in a team share goals and need to coordinate with each other, but they face other agents or other teams whose goals are not fully aligned with theirs. In these settings, MAVIPER follows a similar procedure to jointly train policies for agents in the same team to ensure coordination. More specifically, for a team , the Build and Predict function is constrained to only make predictions for the agents in the same team. Equation 3 now takes the expectation over the joint actions for agents outside the team and becomes:


Taken together, these changes comprise the MAVIPER algorithm. Because we explicitly account for the anticipated behavior of other agents in both the predictions and the resampling probability, we hypothesize that MAVIPER will better capture coordinated behavior.

4 Experiments

We now investigate how well MAVIPER and IVIPER agents perform in a variety of environments. Because the goal is to learn high-performing yet interpretable policies, we evaluate the quality of the trained policies in three multi-agent environments: two mixed competitive-cooperative environments and one fully cooperative environment. We measure how well the DT policies perform in the environment because our goal is to deploy these policies, not the expert ones.

Since small DTs are considered interpretable, we constrain the maximum tree depth to be at most . The expert policies used to guide the DT training are generated by MADDPG [20]***

We use the Pytorch 

[31] implementation We compare to two baselines:

  1. Fitted Q-Iteration. We iteratively approximate the Q-function with a regression DT [9]. We discretize states to account for continuous state values. More details in Section 0.B.2

    . We derive the policy by taking the the action associated with the highest estimated Q-value for that input state.

  2. Imitation DT. Each DT policy is directly trained using a dataset collected by running the expert policies for multiple episodes. No resampling is performed. The observations for an agent are the features, and the actions for that agent are the labels.

We detail the hyperparameters and the hyperparameter-selection process in

Section 0.B.3. We train a high-performing MADDPG expert, then run each DT-learning algorithm times with different random seeds. We evaluate all policies by running episodes. Error bars correspond to the confidence interval. Our code is available through our project website:

4.1 Environments

We evaluate our algorithms on three multi-agent particle world environments [20], described below. Episodes terminate when the maximum number of timesteps is reached. We choose the primary performance metric based on the environment (detailed below), and we also provide results using expected return as the performance metric in Appendix 0.C.

Physical Deception.

In this environment, a team of defenders must protect targets from one adversary. One of the targets is the true target, which is known to the defenders but not to the adversary. For our experiments, . Defenders succeed during an episode if they split up to cover all of the targets simultaneously; the adversary succeeds if it reaches the true target during the episode. Covering and reaching targets is defined as being -close to a target for at least one timestep during the episode. We use the defenders’ and the adversary’s success rate as the primary performance metric in this environment.

Cooperative Navigation.

This environment consists of a team of agents, who must learn to cover all targets while avoiding collisions with each other. For our experiments, . Agents succeed during an episode if they split up to cover all of the targets without colliding. Our primary performance metric is the summation of the distance of the closest agent to each target, for all targets. Low values of the metric indicate that the agents correctly learn to split up.


This variant involves a team of slower, cooperating predators that chase faster prey. There are landmarks impeding the way. We choose

. We assume that each agent has a restricted observation space mostly consisting of binarized relative positions and velocity (if applicable) of the landmarks and other agents in the environment. See

Section 0.B.1 for full details. Our primary performance metric is the number of collisions between predators and prey. For prey, lower is better; for predators, higher is better.

4.2 Results

For each environment, we compare the DT policies generated by different methods and check if IVIPER and MAVIPER agents achieve better performance ratio than the baselines overall. We also investigate whether MAVIPER learns better coordinated behavior than IVIPER. Furthermore, we investigate which algorithms are the most robust to different types of opponents. We conclude with an ablation study to determine which components of the MAVIPER algorithm contribute most to its success.

4.2.1 Individual Performance Compared to Experts

(a) Physical Deception
(b) Cooperative Navigation
(c) Predator-prey
Figure 2: Individual performance ratio: Relative performance when only one agent adopts DT policy and all other agents use expert policy.

We analyze the performance of the DT policies when only one agent adopts the DT policy while all other agents use the expert policies. Given a DT policy profile and the expert policy profile , if agent who belongs to team uses its DT policy, then the individual performance ratio is defined as: , where is team ’s performance given the agents’ policy profile (since we define our primary performance metrics at the team level). A performance ratio of means that the DT policies perform as well as the expert ones. We can get a ratio above , since we compare the performance of the DT and the expert policies in the environment, not the similarity of the DT and expert policies.

We report the mean individual performance ratio for each team in Figure 2, averaged over all trials and all agents in the team. As shown in Figure 1(a), individual MAVIPER and IVIPER defenders outperform the two baselines for all maximum depths in the physical deception environment. However, MAVIPER and IVIPER adversaries perform similarly to the Imitation DT adversary, indicating that the correct strategy may be simple enough to capture with a less-sophisticated algorithm. Agents also perform similarly on the cooperative navigation environment (Figure 1(b)). As mentioned in the original MADDPG paper [20], this environment has a less stark contrast between success and failure, so these results are not unexpected.

In predator-prey, we see the most notable performance difference when comparing the predator. When the maximum depth is 2, only MAVIPER achieves near-expert performance. When the maximum depths are 4 and 6, MAVIPER and IVIPER agents achieve similar performance and significantly outperform the baselines. The preys achieve similar performance across all algorithms. We suspect that the complexity of this environment makes it challenging to replace even a single prey’s policy with a DT.

Furthermore, MAVIPER achieves a performance ratio above 0.75 in all environments with a maximum depth of 6. The same is true for IVIPER, except for the adversaries in physical deception. That means DT policies generated by IVIPER and MAVIPER lead to a performance degradation of less than or around 20% compared to the less interpretable NN-based expert policies. These results show that IVIPER and MAVIPER generate reasonable DT policies and outperform the baselines overall when adopted by a single agent.

4.2.2 Joint Performance Compared to Experts

(a) Physical Deception
(b) Cooperative Navigation
(c) Predator-prey
Figure 3: Joint performance ratio: Relative performance when all agents in a team adopt DT policy and other agents use expert policy.

A crucial aspect in multi-agent environments is agent coordination, especially when agents are on the same team with shared goals. To ensure that the DT policies capture this coordination, we analyze the performance of the DT policies when all agents in a team adopt DT policies, while other agents use expert policies. We define the joint performance ratio as: , where is the utility of team when using their DT policies against the expert policies of the other agents . Figure 3 shows the mean joint performance ratio for each team, averaged over all trials.

Figure 4: Features used by the two defenders in the physical deception environment. Actual features are the relative positions of that agent and the labeled feature. Darker squares correspond to higher feature importance. MAVIPER defenders most commonly split importance across the two targets.

Figure 2(a) shows that MAVIPER defenders outperform IVIPER and the baselines, indicating that it better captures the coordinated behavior necessary to succeed in this environment. Fitted Q-Iteration struggles to achieve coordinated behavior, despite obtaining non-zero success for individual agents. This algorithm cannot capture the coordinated behavior, which we suspect is due to poor Q-value estimates. We hypothesize that the superior performance of MAVIPER is partially due to the defender agents correctly splitting their “attention” to the two targets to induce the correct behavior of covering both targets. To investigate this, we inspect the normalized average feature importances of the DT policies of depth for both IVIPER and MAVIPER over of the trials, as shown in Figure 4. Each of the MAVIPER defenders (top) most commonly focuses on the attributes associated with one of the targets. More specifically, defender 1 focuses on target 2 and defender 2 focuses on target 1. In contrast, both IVIPER defenders (bottom) mostly focus on the attributes associated with the goal target. Not only does this overlap in feature space mean that defenders are unlikely to capture the correct covering behavior, but it also leaves them more vulnerable to an adversary, as it is easier to infer the correct target.

Figure 2(b) shows that MAVIPER agents significantly outperform all other algorithms in the cooperative navigation environment for all maximum depths. IVIPER agents significantly outperform the baselines for a maximum depth of but achieve similar performance to the Imitation DT for the other maximum depths (where both algorithms significantly outperform the Fitted Q-Iteration baseline). MAVIPER better captures coordinated behavior, even as we increase the complexity of the problem by introducing another cooperating agent.

Figure 2(c) shows that the prey teams trained by IVIPER and MAVIPER outperform the baselines for all maximum depths. The predator teams trained by IVIPER and MAVIPER similarly outperform the baselines for all maximum depths. Also, MAVIPER leads to better performance than IVIPER in two of the settings (prey with depth 2 and predator with depth 4) while having no statistically significant advantage in other settings. Taken together, these results indicate that IVIPER and MAVIPER better capture the coordinated behavior necessary for a team to succeed in different environments, with MAVIPER significantly outperforming IVIPER in several environments.

4.2.3 Robustness to Different Opponents

Environment Team MAVIPER IVIPER Imitation Fitted
DT Q-Iteration
Physical Defender .77 (.01) .33 (.01) .24 (.03) .004 (.00)
Deception Adversary .42 (.03) .41 (.03) .42 (.03) .07 (.01)
Predator- Predator 2.51 (0.72) 1.98 (0.58) 1.14 (0.28) 0.26 (0.11)
prey Prey 1.76 (0.80) 2.16 (1.24) 2.36 (1.90) 1.11 (0.82)
Table 1:

Robustness results. We report mean team performance and standard deviation of DT policies for each team, averaged across a variety of opponent policies. The best-performing algorithm for each agent type is shown in


We investigate the robustness of the DT policies when a team using DT policies plays against a variety of opponents in the mixed competitive-cooperative environments. For this set of experiments, we choose a maximum depth of . Given a DT policy profile , a team Z’s performance against an alternative policy file used by the opponents is: . We consider a broad set of opponent policies , including the policies generated by MAVIPER, IVIPER, Imitation DT, Fitted Q-Iteration, and MADDPG. We report the mean team performance averaged over all opponent policies in Table 1. See Tables 4 and 3 in Appendix 0.C for the full results.

For physical deception, MAVIPER defenders outperform all other algorithms, with a gap of between its performance and the next-best algorithm, IVIPER. This result indicates that MAVIPER learns coordinated defender policies that perform well against various adversaries. MAVIPER, IVIPER, and Imitation DT adversaries perform similarly on average, with a similar standard deviation, which supports the idea that the adversary’s desired behavior is simple enough to capture with a less-sophisticated algorithm. For predator-prey, MAVIPER predators and prey outperform all other algorithms. The standard deviation of the performance of all algorithms is high due to this environment’s complexity.

4.2.4 Ablation Study

Figure 5: Ablation study for MAVIPER for a maximum depth of . MAVIPER (No Prediction) does not utilize the predicted behavior of the anticipated DTs of the other agents to grow each agent’s tree. MAVIPER (IVIPER Resampling) uses the same resampling method as IVIPER.

As discussed in Section 3.2, MAVIPER improves upon IVIPER with a few critical changes. First, we utilize the predicted behavior of the anticipated DTs of the other agents to grow each agent’s tree. Second, we alter the resampling probability to incorporate the average Q-values over all actions for the other agents. To investigate the contribution of these changes to the performance, we run an ablation study with a maximum depth of on the physical deception environment. We report both the mean independent and joint performance ratios for the defender team in Figure 5, comparing MAVIPER and IVIPER to two variants of MAVIPER without one of the two critical changes. Results show that both changes contributed to the improvement of MAVIPER over IVIPER, especially in the joint performance ratio.

5 Related Work

Most work on interpretable RL is in the single-agent setting [26]. We first discuss techniques that directly learn DT policies. CUSTARD [42] extends the action space of the MDP to contain actions for constructing a DT, i.e., choosing a feature to branch a node. Training an agent in the augmented MDP yields a DT policy for the original MDP while still enabling training using any function approximator, like NNs, during training. By redefining the MDP, the learning problem becomes more complex, which is problematic in multi-agent settings where the presence of other agents already complicates the learning problem. A few other works directly learn DT policies [9, 24, 44] for single-agent RL but not for the purpose of interpretability. Further, these works have custom learning algorithms and cannot utilize a high-performing NN policy to guide training.

VIPER [2] is considered to be a post-hoc DT-learning method [2]; however, we use it to produce intrinsically interpretable policies for deployment. MOET [45] extends VIPER by learning a mixture of DT policies trained on different regions of the state space. The resulting policy is a linear combination of multiple trees with non-axis-parallel partitions of the state. We find that the performance difference between VIPER and MOET is not significant enough to increase the complexity of the policy structure, which would sacrifice interpretability.

Despite increased interest in interpretable single-agent RL, interpretable MARL is less commonly explored. One line of work generates explanations from non-interpretable policies. Some work uses attention [14, 17, 29] to select and focus on critical factors that impact agents in the training process. Other work generates explanations as verbal explanations with predefined rules [47] or Shapley values [12]. The most similar line of work to ours [15] approximates non-interpretable MARL policies to interpretable ones using the framework of abstract argumentation. This work constructs argument preference graphs given manually-provided arguments. In contrast, our work does not need these manually-provided arguments for interpretability. Instead, we generate DT policies.

6 Discussion and Conclusion

We proposed IVIPER and MAVIPER, the first algorithms, to our knowledge, that train interpretable DT policies for MARL. We evaluated these algorithms on both cooperative and mixed competitive-cooperative environments. We showed that they can achieve individual performance of at least 75% of expert performance in most environment settings and over 90% in some of them, given a maximum tree depth of 6. We also empirically validated that MAVIPER effectively captures coordinated behavior by showing that teams of MAVIPER-trained agents outperform the agents trained by IVIPER and several baselines. We further showed that MAVIPER generally produces more robust agents than the other DT-learning algorithms.

Future work includes learning these high-quality DT policies from fewer samples, e.g., by using dataset distillation [46]

. We also note that our algorithms can work in some environments where the experts and DTs are trained on different sets of features. Since DTs can be easier to learn with a simpler set of features, future work includes augmenting our algorithm with an automatic feature selection component that constructs simplified yet still interpretable features for training the DT policies.

6.0.1 Acknowledgements

This material is based upon work supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program. This research was sponsored by the U.S. Army Combat Capabilities Development Command Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the funding agencies or government agencies. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.


  • [1] P. Abbeel and A. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In ICML, Cited by: §2.3.
  • [2] O. Bastani et al. (2018) Verifiable reinforcement learning via policy extraction. In NeurIPS, Cited by: Appendix 0.A, §1, §2.3, §5.
  • [3] C. Berner et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint 1912.06680. Cited by: §3.1.
  • [4] S. Bhalla et al. (2020) Deep multi agent reinforcement learning for autonomous driving. In Canadian Conf. Artif. Intell., Cited by: §1.
  • [5] M. Brittain and P. Wei (2019) Autonomous air traffic controller: a deep multi-agent reinforcement learning approach. arXiv preprint arXiv:1905.01303. Cited by: §1.
  • [6] C. Buciluǎ et al. (2006) Model compression. In KDD, Cited by: §2.3.
  • [7] Z. Chen et al. (2021) ReLACE: reinforcement learning agent for counterfactual explanations of arbitrary predictive models. arXiv preprint arXiv:2110.11960. Cited by: §2.3.
  • [8] T. Degris et al. (2006) Learning the structure of factored Markov decision processes in reinforcement learning problems. In ICML, Cited by: §1.
  • [9] D. Ernst et al. (2005) Tree-based batch mode reinforcement learning. JMLR 6. Cited by: item 1, §5.
  • [10] J. Foerster et al. (2017) Stabilising experience replay for deep multi-agent reinforcement learning. In ICML, Cited by: §1.
  • [11] J. Foerster et al. (2018) Counterfactual multi-agent policy gradients. In AAAI, Cited by: §2.1.
  • [12] A. Heuillet et al. (2022) Collective explainable ai: explaining cooperative strategies and agent contribution in multiagent reinforcement learning with shapley values. IEEE Comput. Intell. Magazine 17. Cited by: §5.
  • [13] G. Hinton et al. (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.3.
  • [14] S. Iqbal and F. Sha (2019) Actor-attention-critic for multi-agent reinforcement learning. In ICML, Cited by: §5.
  • [15] D. Kazhdan et al. (2020) MARLeME: a multi-agent reinforcement learning model extraction library. In IJCNN, Cited by: §5.
  • [16] S. Li et al. (2019) Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In AAAI, Cited by: §2.1.
  • [17] W. Li et al. (2019) SparseMAAC: sparse attention for multi-agent reinforcement learning. In Int. Conf. Database Syst. for Adv. Appl., Cited by: §5.
  • [18] Z. Lipton (2018) The mythos of model interpretability. ACM Queue 16 (3). Cited by: §1.
  • [19] M. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In Mach. Learning, Cited by: §2.1.
  • [20] R. Lowe et al. (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275. Cited by: §0.B.1, §0.B.1, §2.1, §4.1, §4.2.1, §4.
  • [21] R. Luss et al. (2022) Local explanations for reinforcement learning. arXiv preprint arXiv:2202.03597. Cited by: §2.3.
  • [22] K. Malialis and D. Kudenko (2015) Distributed response to network intrusions using multiagent reinforcement learning. Eng. Appl. Artif. Intell.. Cited by: §1.
  • [23] L. Matignon et al. (2012) Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. Knowledge Eng. Review 27 (1). Cited by: §3.1.
  • [24] R. McCallum (1997) Reinforcement learning with selective perception and hidden state. PhD Thesis, Univ. Rochester, Dept. of Comp. Sci.. Cited by: §1, §5.
  • [25] Z. Meng et al. (2020)

    Interpreting deep learning-based networking systems

    In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, Cited by: §2.3.
  • [26] S. Milani et al. (2022) A survey of explainable reinforcement learning. arXiv preprint arXiv:2202.08434. Cited by: §5.
  • [27] S. Mohanty et al. (2020) Flatland-rl: multi-agent reinforcement learning on trains. arXiv preprint arXiv:2012.05893. Cited by: §1.
  • [28] C. Molnar (2019)

    Interpretable machine learning

    Cited by: §1.
  • [29] Y. Motokawa and T. Sugawara (2021) MAT-dqn: toward interpretable multi-agent deep reinforcement learning for coordinated activities. In ICANN, Cited by: §5.
  • [30] F. Oliehoek et al. (2008) Optimal and approximate q-value functions for decentralized pomdps. JAIR 32. Cited by: §2.1.
  • [31] A. Paszke et al. (2017) Automatic differentiation in pytorch. Cited by: footnote *.
  • [32] L. Pyeatt and A. Howe (2001) Decision tree function approximation in reinforcement learning. In Int. Symp. on Adaptive Syst.: Evol. Comput. and Prob. Graphical Models, Cited by: §1.
  • [33] L. Pyeatt (2003) Reinforcement learning with decision trees. In Appl. Informatics, Cited by: §1.
  • [34] J. Quinlan (1986) Induction of decision trees. Mach. Learning. Cited by: §1.
  • [35] T. Rashid et al. (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, Cited by: §1, §2.1.
  • [36] S. Ross et al. (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, Cited by: §2.3.
  • [37] A. Roth et al. (2019) Conservative q-improvement: reinforcement learning for an interpretable decision-tree policy. arXiv preprint arXiv:1907.01180. Cited by: §1.
  • [38] L. Shapley (1953) Stochastic games. PNAS 39 (10). Cited by: §2.1.
  • [39] K. Son et al. (2019) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:1905.05408. Cited by: §2.1.
  • [40] A. Strehl et al. (2007) Efficient structure learning in factored-state mdps. In AAAI, Cited by: §1.
  • [41] P. Sunehag et al. (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: §2.1.
  • [42] N. Topin et al. (2021) Iterative bounding mdps: learning interpretable policies via non-interpretable methods. In AAAI, Cited by: §5.
  • [43] K. Tuyls et al. (2002) Reinforcement learning in large state spaces. In Robot Soccer World Cup, Cited by: §1.
  • [44] W. Uther and M. Veloso (2000) The lumberjack algorithm for learning linked decision forests. In Int. Symp. Abstract., Reformulation, and Approx., Cited by: §5.
  • [45] M. Vasic et al. (2019) Moët: interpretable and verifiable reinforcement learning via mixture of expert trees. arXiv preprint arXiv:1906.06717. Cited by: §5.
  • [46] T. Wang et al. (2018) Dataset distillation. arXiv preprint arXiv:1811.10959. Cited by: §6.
  • [47] X. Wang et al. (2020) Explanation of reinforcement learning model in dynamic multi-agent system. arXiv preprint arXiv:2008.01508. Cited by: §5.
  • [48] C. Yu et al. (2021) The surprising effectiveness of mappo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955. Cited by: §2.1.

Appendix 0.A Omitted Algorithm

Algorithm 3 shows the full pseudocode for the single-agent version of VIPER [2].

Input: , , , ,

1:Initialize dataset
2:Initialize policy
3:for  to  do
4:     Sample trajectories:
5:     Aggregate dataset
6:     Resample dataset according to loss:
7:     Train decision tree TrainDecisionTree()
8:return Best policy on cross validation
Algorithm 3 VIPER for Single-Agent Setting

Appendix 0.B Experimental Details

0.b.1 Environments

For all environments, we utilize the initialization and reward scheme as described in the original MADDPG paper [20] and Pytorch implementation. The only change we make is to the predator-prey environment, which we describe below.


We follow the definition of the original environment proposed in the multi-agent particle environment [20], with only changes in the partial observation provided to each agent. The observations of the adversary and the agents consist of the concatenation of the following vectors:

  1. binarized relative positions and relative velocity (if applicable) of the landmarks and other agents using as the binarizing function

  2. binarized relative distance between all pairs of agents on the other team. If the opponent team has agent , then it will be .

  3. binarized relative distance between all pairs of agents on the same team.

For an environment with , the observation size will be and respectively for the adversary and the agents.

0.b.2 Implementation Details

To optimize running speed, MAVIPER adopts a caching mechanism to avoid training a new decision tree for each data point being predicted. It also does parallelization for the Predict function starting from creftypecap 26 by precomputing all the prediction information upfront, where the each prediction is delayed until all data points are looped over. In this way, MAVIPER can gather all the predictions that a particular tree needs to make and therefore do it in a parallel manner.

Since IVIPER is fully decentralized, training of each DT can be performed in parallel.

For Fitted Q-Iteration, we bin the states into 10 (mostly) evenly-spaced bins:

We note that Fitted Q-Iteration may perform better with a better choice of bins; however, choosing the correct bin values requires either domain knowledge or extensive manual tuning to find the right balance between granularity, number of timesteps for training, and performance.

0.b.3 Hyperparameters

We vary the hyperparameters that would impact training performance of these algorithms 2-3 times and choose the hyperparameters that yield the agents with the best performance. For all environments, we vary the number of rollouts to be and the number of iterations to be , while the threshold is fixed at . For the baselines, we also vary the maximum number of samples used for training each agent between . For Imitation DT, we did not see much of a performance increase between and samples, so we pick the maximum value for fairness of comparison. For Fitted Q-Iteration, we also did not see much performance increase after samples, so we chose samples due to time constraints. Table 2 shows the values of the hyperparameters that are utilized by all algorithms. Although we set a maximum number of training iterations for MAVIPER, we stop training early when there is no noticeable performance gain to further improve runtime.

Algorithm Environment Max Training Number of Threshold Max Samples
Iterations Rollouts
IVIPER Physical 50 50
Cooperative 100 50 N/A 300,000
Predator-prey 100 100
MAVIPER All 100 50 300,000
Imitation DT All 20 N/A N/A 100,000
Fitted Q-Iteration All 10 N/A N/A 30,000
Table 2: Hyperparameter values used for all algorithms.

Appendix 0.C Additional Results

In Sections 0.C.2 and 0.C.1, we further present results using the defined environment reward. This reward is not as intuitive as the primary metric for many of these environments, but we present the results here for the sake of completeness. The individual and joint performance ratios are defined in the same way as in Section 4.2 in the main body of the paper, with one caveat. Since we are now measuring reward, which may be negative or positive, we take to report how much more or less is than .

0.c.1 Individual Performance

(a) Physical Deception
(b) Cooperative Navigation
(c) Predator-prey
Figure 6: Individual performance ratio measured by reward. Individual performance ratio of DT agents compared to expert agents for different maximum depths. Higher is better. Error bars represent the 95% confidence interval.

Figure 6 shows the individual performance ratio measured by reward on all three environments.

Physical Deception

Results for physical deception are shown in Figure 5(a). Interestingly, MAVIPER and IVIPER defenders only significantly outperform both baselines when the maximum depth is 2. When the maximum depth is 4, MAVIPER, IVIPER, and Imitation DT perform similarly, with Fitted Q-Iteration barely reaching above .50 for all depths. When the maximum depth is 6, Imitation DT actually achieves the highest defender reward, significantly outperforming all other algorithms. However, as shown in Figure 2 in the main body, MAVIPER significantly outperforms all algorithms for all maximum depths when reporting the success ratio. This is because the reward is in part dependent on the performance of the adversary. In other words, a poorly-performing defender can achieve similar performance to a high-performing defender if the poor-performing defender is paired with a high-performing adversary. We see a similar pattern for the adversary performance: IVIPER, MAVIPER, and Fitted Q-Iteration all significantly outperform the Fitted Q-Iteration baseline for different maximum depths. We note that MAVIPER tends to perform better for lower maximum depths, which is desirable for interpretability.

Cooperative Navigation

Results for cooperative navigation are shown in Figure 5(b). IVIPER and MAVIPER significantly outperform the Fitted Q-Iteration baseline for all maximum depths. However, MAVIPER only significantly outperforms Imitation DT when the maximum depth is 2. Otherwise, IVIPER, MAVIPER, and Imitation DT all perform similarly.


Results for predator-prey are shown in Figure 5(c). MAVIPER prey significantly outperform all other algorithms for maximum depths of 2 and 6. For a maximum depth of 4, MAVIPER and IVIPER algorithms both significantly outperform the baselines. In contrast, MAVIPER predators only significantly outperform all other algorithms for a maximum depth of 4. For a maximum depth of 2, MAVIPER and IVIPER significantly outperform the baselines, For a maximum of depth of 6, MAVIPER, IVIPER, and Imitation DT significantly outperform Fitted Q-Iteration. Again, we note that this performance is not necessarily reflected in the results using the collision metric in Figure 1(c).

0.c.2 Joint Performance

(a) Physical Deception
(b) Cooperative Navigation
(c) Predator-prey
Figure 7: Joint performance ratio, measured by reward, of DT agents compared to expert agents for different maximum depths. DT agents are evaluated jointly. Error bars represent the 95% confidence interval. Higher is better.

Figure 7 shows the joint performance ratio measured by reward on all three environments.

Physical Deception

Figure 6(a) shows the results on physical deception. MAVIPER defenders significantly outperform all other algorithms on this environment for maximum depths of 2 and 4. For a maximum depth of 6, IVIPER, MAVIPER, and Imitation DTall perform similarly. Note that MAVIPER again achieves good performance for lower maximum depths, demonstrating its promise as an algorithm for producing interpretable policies. We also note that, again, the reward metric is somewhat deceptive: when measuring the success conditions in the environment (as in Figure 3 in the main body of the paper), MAVIPER significantly outperforms all other algorithms for all maximum depths.

Cooperative Navigation

Figure 6(b) depicts the joint agent performance on the cooperative navigation environment. Interestingly, we see that MAVIPER significantly outperforms all other algorithms for all maximum depths, despite obtaining similar individual performance. Consequently, this means that MAVIPER better captures the desired coordinated behavior than all of the other algorithms.


Figure 6(c) shows the results for the predator-prey environment. MAVIPER prey significantly outperform other algorithms for a maximum depth of 2. For maximum depths of 6 and 8, it achieves slightly better (but not statistically significant) performance than IVIPER, and significantly outperforms the two baselines. MAVIPER and IVIPER predators enjoy similar performance for maximum depths of 2 and 6. For a maximum depth of 4, MAVIPER significantly outperforms all other algorithms. Note that the correct behavior in this environment is challenging to capture with a small decision tree, as the number of features is either or , depending on the agent type.

0.c.3 Robustness to Different Opponents

We present the full robustness results for the predator-prey and physical deception environments. For space reasons, we only report the average over the trials; however, we only label the best-performing agent of each type in either red or blue if the 95% confidence intervals do not overlap, unless otherwise mentioned. We exclude MADDPG from this calculation, since we know that MADDPG agents will outperform all other agent types, and we are mostly interested in how well the decision tree policies perform.

Predator DT Q-Iteration
MAVIPER (2.28, 2.28) (3.49, 3.49) (2.41, 2.41) (3.01, 3.01) (1.37, 1.37)
IVIPER (1.95, 1.95) (2.46, 2.46) (2.17, 2.17) (2.44, 2.44) (0.88, 0.88)
Imitation DT (1.32, 1.32) (1.17, 1.17) (1.18, 1.18) (1.40, 1.40) (0.61, 0.61)
Fitted Q-Iteration (0.46, 0.46) (0.30, 0.30) (0.24, 0.24) (0.18, 0.18) (0.14, 0.14)
MADDPG (2.78, 2.78) (3.36, 3.36) (5.82, 5.82) (4.98, 4.98) (2.54, 2.54)
Table 3: Robustness results of DT agents on predator-prey. Results are presented as: average number of touches in an episode. Higher is better for predator, and lower is better for prey. Excluding MADDPG, the best-performing prey (lowest in value) for each predator type is in blue and the best-performing predator (highest in value) for each prey type is in red.

MAVIPER predators are strictly more robust than all other agents (except MADDPG) to different types of prey. MAVIPER prey are the most or second most robust to different types of predators. In this environment, predator coordination is more critical, as predators must strategically catch the prey. The prey, on the other hand, does not require much coordination, which explains the Imitation DT prey’s robustness by imitating the action of the single-agent expert.

Adversary DT Q-Iteration
MAVIPER (.42, .76) (.45, .33) (.45, .23) (.37, .01) (.40, .93)
IVIPER (.39, .78) (.45, .32) (.40, .23) (.38, .00) (.43, .92)
Imitation DT (.40, .79) (.42, .34) (.46, .26) (.38, .01) (.46, .92)
Fitted Q-Iteration (.07, .77) (.06, .33) (.07, .19) (.08, .00) (.08, .79)
MADDPG (.71, .76) (.77, .32) (.77, .26) (.58, .00) (.62, .90)
Table 4: Robustness results of DT agents on physical deception. Results are presented as: (adversary success ratio, defender success ratio). Higher is better. Excluding MADDPG, the best-performing defender for each adversary type is in blue and the best-performing adversary for each defender type is in red.
Physical Deception

MAVIPER defenders are the most robust than all agents (except MADDPG) to different types of adversaries. Interestingly, MAVIPER, IVIPER, and Imitation DT adversaries all perform similarly. Indeed, they often do not achieve performance that is statistically significant from one another, as measured by the 95% confidence interval. However, we still highlight the best-performing adversary in red to more easily show the attained performance. Note that MADDPG adversaries can occasionally achieve success greater than around , which means that these adversaries can take advantage of some information about the defenders to correctly choose the target to visit. In contrast, adversaries trained with any of the DT-learning algorithms never achieve greater than , which indicates that they may need a more complex representation to capture important details about the defenders.

0.c.4 Exploitability

Figure 8: Exploitability of DT defenders and adversaries in the physical deception environment. Lower exploitability is better.

In this set of experiments, we evaluate the exploitability of DT policies in the physical deception environment. Formally, we define the exploitability of a team as:


where the optimal is the best response policy profile to team ’s policies. Practically, to measure exploitability, we fix the policies of the agents in team under evaluation and calculate its approximate best response by training a new neural network policy to convergence using MADDPG.

We evaluate the exploitability of the adversary and defenders in the physical deception environment and report the restuls in Figure 8. For the defending team, MAVIPER exhibits the lowest exploitability on all three depths, showing its effectiveness in learning coordinated policies. It is worth noting that in such a multi-agent learning setting, Imitation DT no longer performs well due to the complexity of the expert and the necessity of cooperation between agents. For the adversary, Imitation DT performs well while IVIPER and MAVIPER performs similarly as the second best. This could be the result that imitation learning quickly imitates a near-optimal expert adversary starting from the depth of four. Since the adversary consists of only a single agent, MAVIPER and IVIPER are reduced to the same method.