. While leveraging neural networks for learning state representations has enabled the *drl agents to learn policies for tasks with large state spaces, the policy decisions made by the agent are not interpretable, which hinders their use in safety-critical applications.
Some recent works leverage programs and decision trees as representations for interpreting the learned agent policies. PirlVerma et al. (2018) uses program synthesis techniques to automatically generate a program in a *dsl that is close to the *drl agent policy. The design of the *dsl with desired operators is a tedious manual effort and the enumerative search algorithm for synthesis is difficult to scale for larger programs. In contrast, Viper Bastani et al. (2018)
learns a *dt to interpret the *drl agent policy, which not only allows for a general representation for different policies, but also allows for verification of these policies using integer linear programming solvers.
Viper uses the DAgger Ross et al. (2011) imitation learning approach to collect state action pairs for training the student *dt policy given the teacher *drl policy. It modifies the DAgger algorithm to also take into account the Q-function of teacher policy to prioritize states of critical importance during learning. However, learning a single *dt for the complete policy leads to some key shortcomings such as i) less faithful representation of original agent policy measured by the number of mispredictions, ii) lower overall performance (reward), and iii) larger *dt sizes that make them harder to interpret.
In this paper, we present MoËT (Mixture of Expert Trees), a technique based on *moe Jacobs et al. (1991); Jordan and Xu (1995); Yuksel et al. (2012), and reformulate its learning procedure to support *dt experts. *moe models can typically use any expert as long as it is a differentiable function of model parameters, which unfortunately does not hold for *dts. Similar to *moe training with EM algorithm, we first observe that MoËT can be trained by interchangeably optimizing the weighted log likelihood for experts (independently from one another) and optimizing the gating function with respect to the obtained experts. Then, we propose a procedure for *dt learning in the specific context of MOE. To the best of our knowledge we are first to combine standard non-differentiable *dt experts, which are interpretable, with *moe model. Existing combinations which rely on differentiable tree or treelike models, such as soft decision trees Irsoy et al. (2012) and hierarchical mixture of experts Zhao et al. (2019), are not interpretable.
We adapt the imitation learning technique of Viper to use MoËT policies instead of *dts. MoËT
creates multiple local *dts that specialize on different regions of the input space, allowing for simpler (shallower) *dts that more accurately mimic the *drl agent policy within their regions, and combines the local trees into a global policy using a gating function. We use a simple and interpretable linear model with softmax function as the gating function, which returns a distribution over *dt experts for each point in the input space. While standard *moe uses this distribution to average predictions of *dts, we also consider selecting just one most likely expert tree to improve interpretability. While decision boundaries of Viper *dt policies must be axis-perpendicular, the softmax gating function supports boundaries with hyperplanes of arbitrary orientations, allowingMoËT to more faithfully represent the original policy.
We evaluate our technique on four different environments: CartPole, Pong, Acrobot, and Mountaincar. We show that MoËT consistently achieves better reward and lower misprediction rate with shallower trees. We also visualize the Viper and MoËT policies for Mountaincar, demonstrating the differences in their learning capabilities. Finally, we demonstrate how a MoËT policy can be translated into an SMT formula and show an example translation for verifying properties for CartPole game using the Z3 theorem prover De Moura and Bjørner (2008) under similar assumptions made in Viper.
In summary, this paper makes the following key contributions: 1) We propose MoËT, a technique based on *moe to learn mixture of expert decision trees and present a learning algorithm to train MoËT models. 2) We use MoËT models for interpreting *drl policies with a softmax gating function and adapt the imitation learning approach used in Viper to learn MoËT models. 3) We evaluate MoËT on different environments and show that it leads to smaller, more faithful, and performant representations of *drl agent policies compared to Viper while preserving verifiability.
2 Related Work
Imitation Learning. Imitation learning generates labeled data using existing teacher policy and trains a student policy in a supervised manner. Imitation learning using only trajectories observed by a teacher leads to high error that grows quadratically Ross et al. (2011) in number of decision steps. Ross et al. (2011) proposed DAgger (Dataset Aggregation) to solve this issue where intermediate student policies are also used for sampling trajectories, while data is always labeled using the teacher. Viper modifies the DAgger algorithm to prioritize states of critical importance (measured by the difference in Q values of available actions), which leads to smaller decision trees. We follow similar imitation learning approach, but change the model used for learning student policies.
Explainable Machine Learning
Explainable Machine Learning. There has been a lot of recent interest in explaining decisions of black-box models Guidotti et al. (2018a); Doshi-Velez and Kim (2017). For image classification, activation maximization techniques can be used to sample representative input patterns Erhan et al. (2009); Olah et al. (2017). TCAV Kim et al. (2017)
uses human-friendly high-level concepts to associate their importance to the decision. Some recent works also generate contrastive robust explanations to help users understand a classifier decision based on a family of neighboring inputsZhang et al. (2018); Dhurandhar et al. (2018). LORE Guidotti et al. (2018b) explains behavior of a black-box model around an input of interest by sampling the black-box model around the neighborhood of the input, and training a local *dt over the sampled points. Our model presents an approach that combines local trees into a global policy.
Tree-Structured Models. Irsoy et al. Irsoy et al. (2012)
propose a a novel decision tree architecture with soft decisions at the internal nodes where both children are chosen with probabilities given by a sigmoid gating function. Similarly, binary tree-structured hierarchical routing mixture of experts (HRME) model, which has classifiers as non-leaf node experts and simple regression models as leaf node experts, were proposed inZhao et al. (2019). Both models are unfortunately not interpretable.
In this section we provide description of two relevant methods we build upon: (1) Viper, an approach for interpretable imitation learning, and (2) *moe learning framework.
Viper. Viper (Algorithm 1) is an instance of DAgger imitation learning approach, adapted to prioritize critical states based on Q-values. Inputs to the Viper training algorithm are (1) environment which is an finite horizon (-step) *mdp with states , actions , transition probabilities , and rewards ; (2) teacher policy ; (3) its Q-function and (4) number of training iterations . Distribution of states after steps in environment using a policy is (assuming randomly chosen initial state). Viper uses the teacher as an oracle to label the data (states with actions). It initially uses teacher policy to sample trajectories (states) to train a student (*dt) policy. It then uses the student policy to generate more trajectories. Viper samples training points from the collected dataset giving priority to states having higher importance , where . This sampling of states leads to faster learning of optimal policy and shallower *dts. The process of sampling trajectories and training students is repeated for number of iterations , and the best student policy is chosen using reward as the criterion.
Mixture of Experts. *moe is an ensemble model Jacobs et al. (1991); Jordan and Xu (1995); Yuksel et al. (2012) that consists of expert networks and a gating function. Gating function divides the input (feature) space into regions for which different experts are specialized and responsible. *moe is flexible with respect to the choice of expert models as long as they are differentiable functions of model parameters (which is not the case for *dts).
In *moe framework, probability of outputting given an input is given by:
where is the number of experts, is the probability of choosing the expert (given input ), is the probability of expert producing output (given input ). Learnable parameters are , where are parameters of the gating function and are parameters of the experts. Gating function can be modeled using a softmax function over a set of linear models. Let
consist of parameter vectors, then the gating function can be defined as .
In the case of classification, an expert outputs a vector of length , where is the number of classes. Expert associates a probability to each output class (given by ) using a softmax function. Does it have to be softmax, rename to gating function? Final probability of a class is a gate weighted sum of for all experts . This creates a probability vector , and the output is of *moe is .
*moe is commonly trained using EM algorithm, where instead of direct optimization of the likelihood one performs optimization of an auxiliary function defined in a following way. Let denote the expert chosen for instance . Then joint likelihood of and can be considered. Since is not observed in the data, log likelihood of samples cannot be computed, but instead expected log likelihood can be considered, where expectation is taken over . Since the expectation has to rely on some distribution of
, in the iterative process, the distribution with respect to the current estimate of parametersis used. More precisely function is defined by Jordan and Xu (1995):
where is the estimate of parameters in iteration . Then, for a specific sample , the following formula can be derived Jordan and Xu (1995):
where it holds
4 Mixture of Expert Trees
In this section we explain the adaptation of original *moe model to mixture of decision trees, and present both training and inference algorithms.
Considering that coefficients (Eq. 4) are fixed with respect to and that in Eq. 3 the gating part (first double sum) and each expert part depend on disjoint subsets of parameters , training can be carried out by interchangeably optimizing the weighted log likelihood for experts (independently from one another) and optimizing the gating function with respect to the obtained experts. The training procedure for MoËT, described by Algorithm 2, is based on this observation. First, the parameters of the gating function are randomly initialized (line 2). Then the experts are trained one by one. Each is trained on a dataset of instances weighted by the gating function value for that expert (line 5), by applying specific *dt learning algorithm (line 6) that we adapted for *moe context (described below). After the experts are trained, optimization of the gating function is performed (line 7) by maximizing the gating part of Eq. 4. At the end, the parameters are returned (line 8).
In order to complete this algorithm description, we propose the following tree learning procedure. Our technique modifies original *moe algorithm in that it uses *dts as experts. The fundamental difference with respect to traditional model comes from the fact that *dts do not rely on explicit and differentiable loss function which can be trained by gradient descent or Newton’s methods. Instead, due to their discrete structure, they rely on a specific greedy training procedure. Therefore, the training of *dts has to be modified in order to take into account the weights that the gating function gives to each instance. If the gating were hard, meaning that each instance is assigned to strictly one expert, such weighting would result in partitioning the feature space into disjoint regions belonging to different experts. For soft gating, we consider the weighting as fractionally distributing each instance to different experts. We should make it explicit that this is our contribution. We come up with this algorithm of assigning importance to different inputs when training decision trees. (And we assign importance based on the gating function; which is intuitive.) MN: I reformulated first sentence of this paragraph to state explicitly that we propose this learning procedure. Please check if that is ok. The higher the association of an instanceand an expert , reflected by the value of the gating function , the higher the influence of that instance on that expert’s training. In order to formulate this principle, we consider which way the instance influences construction of a tree. First, it affects the impurity measure computed when splitting the nodes and second, it influences probability estimates in the leaves of the tree. We address these two issues next.
A commonly used impurity measure to determine splits in the tree is the Gini index. Let be a set of indices of instances assigned to the node for which the split is being computed and set of corresponding instances. Let categorical outcomes of be and for denote fraction of assigned instances for which it holds . More formally:
where denotes indicator function of its argument expression and equals if the expression is true. Then the Gini index of the set is defined by: . Considering that the assignment of instances to experts are fractional that are defined by gating function , this definition has to be modified in that the instances assigned to the node should not be counted, but instead, their weights should be summed. Hence, we propose the following definition:
and compute the Gini index for the set as
. Similar modification can be performed for other impurity measures relying on distribution of outcomes of a categorical variable, like entropy. Note that while the instance assignments to experts are soft, instance assignments to nodes within an expert are hard (meaning sets of instances assigned to different nodes are disjoint), since splitting is based on values of the variables, not on the values of gating function.
Probability estimate for in the leaf node is usually performed by computing fractions of instances belonging to each class. In our case, the modification is the same as the one presented by Eq. 6. That way, estimates of probabilities needed by *moe are defined. In Algorithm 2, function performs decision tree training using the above modifications.
We consider two ways to perform inference with respect to the obtained model. First one which we call MoËT, is performed by maximizing with respect to where this probability is defined by Eq. 1. The second way, which we call MoËTh, performs inference as , meaning that we only rely on the most probable expert.
Expressiveness and interpretability. Standard decision trees used by Viper are easily interpretable, but they make their decisions by partitioning the feature space into regions which have borders perpendicular to coordinate axes. In order to approximate borders that are not perpendicular to coordinate axes, usually very deep trees are necessary. MoËTh mitigates this shortcoming by exploiting hard softmax partitioning of the feature space using borders which are still hyperplanes, but need not be perpendicular to coordinate axes. This in turn improves the expressiveness while still maintaining interpretability. First, the gating function is interpretable as it is implemented by a linear model with hyperplanes for decision boundaries that are easily computable from the model parameters, second MoËTh uses a single *dt for inference (instead of weighted average). In addition, we also show a technique to translate MoËT policy to a logical formula for analysis and verification using Z3.
In this section we present evaluation results comparing performance of MoËT and Viper on four OpenAI Gym environments: CartPole, Pong, Acrobot and Mountaincar (brief environment description provided in supplementary material). For CartPole, we use policy gradient model used in Viper, for other environments we use a *dqn network Mnih et al. (2015) (parameters used for training are provided in supplementary material). The rewards obtained by the agents on CartPole, Pong, Acrobot and Mountaincar are , , and , respectively (higher reward is better). Rewards are averaged across runs ( in CartPole).
Comparison of MoËT, MoËTh, and Viper policies. For CartPole, Acrobot, and Mountaincar environments, we train Viper *dts with maximum depths of , while in the case of Pong we use maximum depths of as the problem is more complex and requires deeper trees. For experts in MoËT policies we use the same maximum depths as in Viper and we train the policies for to experts (in case of Pong we train for experts). We train all policies using iterations of Viper algorithm, and choose the best performing policy in terms of rewards (and lower misprediction rate in case of equal rewards).
We use two criteria to compare policies: rewards and mispredictions (number of times the student performs an action different from what a teacher would do). High reward indicates that the student learned more crucial parts of the teacher’s policy, while a low misprediction rate indicates that in most cases student performs the same action as the teacher. In order to measure mispredictions, we run the student for number of runs, and compare actions it took to the actions teacher would perform.
Tables 1, 2, 3, 4 compare the performance of Viper, MoËT and MoËTh. Explain why both MoËT and MoËTh are presented. MV: I think it is reasonable to present both. The first column shows the maximum depth of decision trees, rewards are shown in R columns, and mispredictions in M columns. Additionally, we show number of experts used (E) for MoËT, where we select the configuration with the best performance. The best configuration is chosen by selecting the highest reward, while in case of the same rewards we choose lower mispredictions.
For CartPole (Table 1), MoËT and MoËTh both achieve perfect reward () with a *dt depth of only , while Viper needs *dt with depth at least to achieve the perfect reward. Moreover, the misprediction rates for Viper with *dt depths of 1 and 2 are and respectively, which are significantly higher than the misprediction rates of less than for both MoËT and MoËTh for similar depths. Even with depth , Viper could only achieve a misprediction rate of . MoËT and MoËTh perform similarly with a slightly lower misprediction rate for MoËT.
The results for the Pong environment are shown in Table 2. For Pong as well, we observe a similar trend that Viper could only achieve a perfect reward of with *dt depth of 12, whereas both MoËT and MoËTh models achieve the perfect reward with *dt depth of 8. For depths of 9, 10, and 11, Viper achieves the rewards of , , and respectively (additional depths not shown in the table). Moreover, MoËT model achieves significantly lower misprediction rates compared to that of Viper, ranging from a decrease of for depth 4 to a decrease of for depth 16.
For Acrobot (Table 3), we notice that both MoËT and MoËTh models lead to better rewards and misprediction rates compared to Viper for different *dt depths, where the improvements in misprediction rates are less dramatic ranging from to improvement. However, we observe that the improvements in rewards are quite significant. Moreover, we observe that for some depths MoËTh outperforms even MoËT in terms of both better reward and misprediction rate.
Finally, the results for Mountaincar are shown in Table 4. In this case as well, we observe that MoËT achieves the best performance in terms of both reward and mispredictions, while MoËTh also performs significantly better than Viper and only slightly worse than MoËT.
Additional results with different depth and experts are provided in supplementary material.
Analyzing the learned Policies.
We analyze the learned student policies (Viper and MoËTh) by visualizing their state-action space, the differences between them, and differences with the teacher policy. We use the Mountaincar environment for this analysis because of the ease of visualizing its 2-dimensional state space comprising of car position () and car velocity () features, and 3 allowed actions left, right, and neutral. We visualize *drl, Viper and MoËTh policies in Figure 1, showing the actions taken in different parts of the state space (additional visualizations are in supplementary material).
The state space is defined with feature bounds and , which represent sets of allowed feature values in Mountaincar. We sample the space uniformly with a resolution . @vasic: Increase resolution for the images The actions left, neutral, and right are colored in green, yellow, and blue, respectively. Recall that MoËTh can cover regions whose borders are hyperplanes of arbitrary orientation, while Viper, i.e. *dt can only cover regions whose borders are perpendicular to coordinate axes. This manifests in MoËTh policy containing slanted borders in yellow and green regions to capture more precisely the geometry of *drl policy, while the Viper policy only contains straight borders.
Furthermore, we visualize mispredictions for Viper and MoËTh policies. While in previous section, we calculated mispredictions by using student policy for playing the game, in this analysis we visualize mispredictions across the whole state space. Note that the student might never encounter some of the states in the whole state space, thus mispredictions in some parts of the state space might not be of great importance. In order to account for this, we note that Viper algorithm optimizes actions that are of greater importance by calculating a score , where denotes the value of action in state , and is a set of all possible actions. Using a similar scoring function, we visualize mispredictions weighted by the action importance as that is more informative than mispredictions themselves.
We create a vector consisting of importance scores for sampled points, and normalize it to range . We also create a binary vector which is in the case of misprediction (student policy decision is different from *drl decision) and otherwise. We multiply and to compute and visualize the vector , where higher value indicates misprediction of higher importance and is denoted by a red color of higher intensity. The mispredictions normalized by their importance scores for Viper and MoËTh policies are shown in Figure 0(d) and Figure 0(e) respectively. We can observe that the MoËTh policy has fewer high intensity regions leading to fewer overall mispredictions.
To provide a quantitative difference between the mispredictions of two policies, we compute , which is measure in bounds such that its value is in the case of no mispredictions, and in the case of all mispredictions. For the policies shown in Figure 0(d) and Figure 0(e), we obtain for Viper and for MoËTh policies. Recheck these numbers for higher resolution. We also show differences in mispredictions between Viper and MoËTh (Figure 0(f)), by subtracting the vector of MoËTh from the vector of Viper. The positive values are shown in blue and the negative values are shown in red. The higher intensity blue regions denote states where MoËTh policy gets more important action right and Viper does not (similarly vice versa for high intensity red regions).
Translating MoËT to SMT. We now show the translation of MoËT policy to SMT constraints for verifying policy properties. We present an example translation of MoËT policy on CartPole environment with the same property specification that was proposed for verifying Viper policies Bastani et al. (2018). The goal in CartPole is to keep the pole upright, which can be encoded as a formula:
where represents state after steps, is the deviation of pole from the upright position. In order to encode this formula it is necessary to encode the transition function which models environment dynamics: given a state and action it returns the next state of the environment. Also, it is necessary to encode the policy function that for a given state returns action to perform. There are two issues with verifying : (1) infinite time horizon; and (2) the nonlinear transition function . To solve this problem, Bastani et al. (2018) use a finite time horizon and linear approximation of the dynamics and we make the same assumptions.
To encode we need to translate both the gating function and *dt experts to logical formulas. Since the gating function in MoËTh uses exponential function, it is difficult to encode the function directly in Z3 as SMT solvers do not have efficient decision procedures to solve non-linear arithmetic. The direct encoding of exponentiation therefore leads to prohibitively complex Z3 formulas. We exploit the following simplification of gating function that is sound when hard prediction is used:
First simplification is possible since the denominators for gatings of all experts are same, and second simplification is due to the monotonicity of the exponential function. For encoding *dts we use the same encoding as in Viper. To verify that holds we need to show that is unsatisfiable. We run the verification with our MoËTh policies and show that is indeed unsatisfiable.
To better understand the scalability of our verification procedure, we report on the verification times needed to verify policies for different number of experts and different expert depths in Figure 2. We observe that while MoËTh policies with 2 experts take from s to s for verification, the verification times for 8 experts can go up to as much as s. This directly corresponds to the complexity of the logical formula obtained with an increase in the number of experts.
We introduced MoËT, a technique based on *moe with expert decision trees and presented a learning algorithm to train MoËT models. We then used MoËT models for interpreting *drl agent policies, where different local *dts specialize on different regions of input space and are combined into a global policy using a gating function. We showed that MoËT models lead to smaller, more faithful and performant representation of *drl agents compared to previous state-of-the-art approaches like Viper while still maintaining interpretability and verifiability.
- Silver et al.  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
- Brown and Sandholm  Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
- Verma et al.  Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically Interpretable Reinforcement Learning. In International Conference on Machine Learning, pages 5052–5061, 2018.
- Bastani et al.  Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction. In Advances in Neural Information Processing Systems, pages 2499–2509, 2018.
Ross et al. 
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
- Jacobs et al.  Robert A Jacobs, Michael I Jordan, Steven J Nowlan, Geoffrey E Hinton, et al. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Jordan and Xu  Michael I Jordan and Lei Xu. Convergence results for the EM approach to mixtures of experts architectures. Neural networks, 8(9):1409–1431, 1995.
- Yuksel et al.  Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.
Irsoy et al. 
Ozan Irsoy, Olcay Taner Yıldız, and Ethem Alpaydın.
Soft decision trees.
Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 1819–1822. IEEE, 2012.
- Zhao et al.  Wenbo Zhao, Yang Gao, Shahan Ali Memon, Bhiksha Raj, and Rita Singh. Hierarchical Routing Mixture of Experts. arXiv preprint arXiv:1903.07756, 2019.
- De Moura and Bjørner  Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008.
- Guidotti et al. [2018a] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018a.
- Doshi-Velez and Kim  Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Erhan et al.  Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. 2009.
- Olah et al.  Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017.
- Kim et al.  Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279, 2017.
- Zhang et al.  Xin Zhang, Armando Solar-Lezama, and Rishabh Singh. Interpreting neural network judgments via minimal, stable, and symbolic corrections. In Advances in Neural Information Processing Systems, pages 4874–4885, 2018.
- Dhurandhar et al.  Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pages 592–603, 2018.
- Guidotti et al. [2018b] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Dino Pedreschi, Franco Turini, and Fosca Giannotti. Local rule-based explanations of black box decision systems. arXiv preprint arXiv:1805.10820, 2018b.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  OpenAI Baselines. OpenAI Baselines. https://github.com/openai/baselines.
- Wang et al.  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- Barto et al.  Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.
- Sutton  Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044, 1996.
- Moore  Andrew William Moore. Efficient memory-based learning for robot control. 1990.
Appendix A *drl Agent Training parameters
Here we present parameters we used to train *drl agents for different environments. For CartPole, we use policy gradient model as used in Viper. While we use the same model, we had to retrain it from scratch as the trained Viper agent was not available. For Pong, we use a *dqn network Mnih et al. , and we use the same model as in Viper, which originates from OpenAI baselines OpenAI Baselines . For Acrobot and Mountaincar, we implement our own version of dueling *dqn network following Wang et al. . We use hidden layers with neurons in each layer. We set the learning rate to , batch size to , step size to and number of epochs to . We checkpoint a model every steps and pick the best performing one in terms of achieved reward.
Appendix B Environments
In this section we provide a brief description of environments we used in our experiments. We used four environments from OpenAI Gym: CartPole, Pong, Acrobot and Mountaincar.
This environment consists of a cart and a rigid pole hinged to the cart, based on the system presented by Barto et al. Barto et al. . At the beginning pole is upright, and the goal is to prevent it from falling over. Cart is allowed to move horizontally within predefined bounds, and controller chooses to apply either left or right force to the cart. State is defined with four variables: (cart position), (cart velocity), (pole angle), and (pole angular velocity). Game is terminated when the absolute value of pole angle exceeds , cart position is more than units away from the center, or after successful steps; whichever comes first. In each step reward of is given, and the game is considered solved when the average reward is over in over 100 consecutive trials.
This is a classical Atari game of table tennis with two players. Minimum possible score is and maximum is . Maybe we should remove this whole section.
This environment is analogous to a gymnast swinging on a horizontal bar, and consists of a two links and two joins, where the joint between the links is actuated. The environment is based on the system presented by Sutton Sutton . Initially both links are pointing downwards, and the goal is to swing the end-point (feet) above the bar for at least the length of one link. The state consists of six variables, four variables consisting of and values of the joint angles, and two variables for angular velocities of the joints. The action is either applying negative, neutral, or positive torque on the joint. At each time step reward of is received, and episode is terminated upon successful reaching the height, or after steps, whichever comes first. Acrobot is an unsolved environment in that there is no reward limit under which is considered solved, but the goal is to achieve high reward.
This environment consists of a car positioned between two hills, with a goal of reaching the hill in front of the car. The environment is based on the system presented by Moore Moore . Car can move in a one-dimensional track, but does not have enough power to reach the hill in one go, thus it needs to build momentum going back and forth to finally reach the hill. Controller can choose left, right or neutral action to apply left, right or no force to the car. State is defined by two variables, describing car position and car velocity. In each step reward of is received, and episode is terminated upon reaching the hill, or after steps, whichever comes first. The game is considered solved if average reward over consecutive trials is no less than .
Appendix C Additional Visualizations
In this section we provide visualization of a gating function. Figure 3 shows how gating function partitions the state space for which different experts specialize. Gatings of MoËTh policy with experts and depth are shown.
Appendix D Ablation Results
In this section we show results for all *dt depths and numbers of experts used for training Viper and MoËT policies. Mispredictions and rewards are shown for all configurations. Tables 5,7,7 show results for CartPole. Tables 8,10,10 show results for Pong. Tables 11,13,13 show results for Acrobot. Tables 14,16,16 show results for Mountaincar.