Conservative Q-Improvement: Reinforcement Learning for an Interpretable Decision-Tree Policy

There is a growing desire in the field of reinforcement learning (and machine learning in general) to move from black-box models toward more "interpretable AI." We improve interpretability of reinforcement learning by increasing the utility of decision tree policies learned via reinforcement learning. These policies consist of a decision tree over the state space, which requires fewer parameters to express than traditional policy representations. Existing methods for creating decision tree policies via reinforcement learning focus on accurately representing an action-value function during training, but this leads to much larger trees than would otherwise be required. To address this shortcoming, we propose a novel algorithm which only increases tree size when the estimated discounted future reward of the overall policy would increase by a sufficient amount. Through evaluation in a simulated environment, we show that its performance is comparable or superior to traditional tree-based approaches and that it yields a more succinct policy. Additionally, we discuss tuning parameters to control the tradeoff between optimizing for smaller tree size or for overall reward.



There are no comments yet.


page 1

page 2

page 3

page 4


Verifiable Reinforcement Learning via Policy Extraction

While deep reinforcement learning has successfully solved many challengi...

Neural-to-Tree Policy Distillation with Policy Improvement Criterion

While deep reinforcement learning has achieved promising results in chal...

Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

Current work in explainable reinforcement learning generally produces po...

Interpretable Policy Specification and Synthesis through Natural Language and RL

Policy specification is a process by which a human can initialize a robo...

Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video

Temporally language grounding in untrimmed videos is a newly-raised task...

Interpretable Optimal Stopping

Optimal stopping is the problem of deciding when to stop a stochastic sy...

TripleTree: A Versatile Interpretable Representation of Black Box Agents and their Environments

In explainable artificial intelligence, there is increasing interest in ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many powerful machine learning algorithms are “black boxes” (Knight, 2017). They produce models which can be used to make predictions or policies that can be used to perform tasks, but these models are often opaque or require significant analysis to penetrate. “Black blox” algorithms are less likely to be accepted for use by an organization, and adoption occurs more slowly than for processes which are directly interpretable (Hihn and Menzies, 2015). It is difficult for humans to trust decisions that cannot be verified, and thus much research has gone into creating interpretable models (Wu et al., 2017) as part of general growing interest in interpretable AI  (Bloomberg, 2018).

We look specifically at the field of Reinforcement Learning (RL), which in recent years has demonstrated impressive results on tasks such as the games of Go and Dota2, walking as a simulated human-like agent, and accomplishing a variety of real-world robotic tasks (Gitau, 2019; Arulkumaran et al., 2017). However, many advanced RL methods today result in policies which perform well but which are not directly interpretable. We are motivated to build an RL algorithm that can learn a policy that is in a human understandable form. However, in contrast to after-the-fact explanations or approximations, we use an interpretable representation that is simultaneously the policy used to perform the task.

A human-understandable RL model is useful in a scenario where an end-user wishes to continually tweak a policy, or change it after it has been learned. Consider a situation where a household assistant robot learns the preferences of it’s owner and how to perform certain kinds of tasks in their home. If the owner goes on a diet, or otherwise changes some specific aspect of their behavior, it would be useful to communicate this to the robot by simply accessing the robot’s settings, selecting the task model, and changing the relevant specific aspect of the policy. This would be faster than the agent having to learn the simple change over time, and less complicated than requiring the robot to understand a human’s request and make the change itself.

To improve the interpretability of RL policies, we focus on learning a decision tree policy. Decision trees are frequently used to explain processes to humans, such as when a deep network is trained to explain itself (Alaniz and Akata, 2019; Huk Park et al., 2018). Decision trees can be represented by text or graphically in a way that a human can readily comprehend. The tree, once constructed, also has useful information about which states are treated identically and which features yield the highest information gain (Buntine and Niblett, 1992; Leroux et al., 2018). A reasonably-sized tree with labeled attributes enables a human to understand the behavior of a learned policy without having to observe its execution.

Our main contribution is the Conservative Q-Improvement (CQI) reinforcement learning algorithm. CQI learns a policy in the form of a decision tree, is applicable to any domain in which the state space can be represented as a vector of features, and generates a human-interpretable decision tree policy. Additionally, we evaluate CQI on a modified RobotNav environment

(Mitchell and Thrun, 1993).

The rest of our paper is organized as follows: Section 2 discusses relevant background. Section 3, discusses previous related work. Section 4 describes our approach. Section 5 describes the environment we use to evaluate our method. We compare our algorithm to a reference baseline method in Section 6 and discuss conclusions in Section 7.

2 Background

Reinforcement Learning

Reinforcement learning is a field of algorithms for obtaining policies of (state action) mappings through experience in an environment (Sutton and Barto, 2018). We focus on Q-learning: Given a set of states , set of actions , and an immediate reward for executing an action in a state , the expected value (Q-value) of performing an action in a state can be learned over time by experimenting with actions in states, observing the reward, and updating the Q-value estimate via the Bellman equation:


where is the current Q-value estimate for a state-action pair, and are the current state and chosen action, is the learning rate, is the immediate reward for choosing action when in state (as experienced immediately by interacting with the environment), is the next state the agent is in after executing , and is the best possible next action, such that the maximum expected future reward from being in state is added to after being discounted by the factor .

Decision Trees

Decision trees have often been used to create easy-to-understand solutions to classification problems, among others. They are trees that start at a root node and branch based on conditions (Quinlan, 1986). An additional benefit to decision trees is that they can be represented graphically, which aids in human understanding (Mulrow, 2002). Some work has explored combining neural nets and decision tree forms (Tanno et al., 2018; Zhao, 2001), including using decision trees to explain neural nets (Zhang et al., 2018). We seek to create a reinforcement learning policy that is in decision tree form.

In an AI context specifically, it has been suggested that huge decision trees are insufficiently comprehensible to humans, and that simplifying trees is important for humans to interact with the representation (Quinlan, 1987). Chess experts had difficulty understanding an early chess-playing algorithm that fully “explained” itself with a complex decision tree—too complex a tree becomes opaque (Michie, 1983, 1987). It has been shown that, all else being equal, humans prefer simpler explanations compared to more complex ones (Lombrozo, 2007). Based on the foregoing, we assume for the remainder of the paper that smaller decision trees are more interpretable by humans.

3 Related Work

We are not the first to attempt to learn a decision tree policy. The “G algorithm” (Chapman and Kaelbling, 1991) learns a decision tree incrementally as new examples are added. The Lumberjack algorithm  (Uther and Veloso, 2000) constructs a Linked Decision Forest—like a decision tree but without the need for repeated internal structure which would otherwise occur. TTree (Uther and Veloso, 2003; Uther, 2002) solves MDPs or SMDPs and involves both state abstraction (as in our and other methods) and temporal abstraction. (Pyeatt et al., 2001) is a work which uses a tree as an internal structure and incrementally builds the tree, splitting on Q-values. Unlike (Hester et al., 2010), an approach that uses decision trees as part of the policy representation, we use a single decision tree to represent the entire policy. (Wu et al., 2012) create decision trees using a post-learning transformation, but we instead maintain our policy as a tree at all stages of training and merely update our tree.

The UTree algorithm (McCallum and Ballard, 1996; Uther and Veloso, 1998) learns a tree, starting with a single abstract state and then continually splitting as appropriate (based on a splitting criterion), ending when a stopping criterion is reached. This is similar to our method, except that in UTree splits occur based on how accurately the Q-function is represented rather than based on what improves the policy. An effort to produce a decision tree via RL is found in (Gupta et al., 2015), but the final product is not directly interpretable: it consists of a tree with linear Gibbs softmax sub-policies as leaves. In contrast, our approach creates a tree with action choices as leaves.

The best decision-tree-via-RL approach of which we are aware is the Pyeatt Method (Pyeatt, 2003) (PM), which we implement and use as a comparison. PM differs from our method in terms of its splitting criteria and the manner in which splits are performed. Like our method, PM starts with a single root node and adds branches and nodes over time. PM maintains a history of changes in Q-values for each leaf, and splits when it believes that this history represents two distributions. We instead use a lookahead approach that predicts which split will produce the largest increase in reward. The trees resulting from the Pyeatt Method tend to be larger than those that result from our approach, which is designed to keep the size of the tree small and manageable.

Indeed, in none of the above cases are there constraints on the size of the decision trees or merit placed on having a smaller tree (beyond gains in performance). Our work is aimed at enabling interpretability; this requires more than simply using a tree structure as an underlying mechanism. Certainly we want to learn a good policy, but no less important for us is that the resulting form of the policy be easy to understand.

4 Method

1 initial tree is a single leaf node;
2 the starting threshold for what potential is required to trigger a split;
3 ;
4 the decay for and values;
5 for number of episodes do
6       while episode not done do
7             current state at timestep ;
8             leaf of corresponding to ;
9             TakeAction();
10             UpdateLeafQValues();
11             UpdateVisitFrequency();
12             UpdatePossibleSplits();
13             BestSplit();
14             if  then
15                   SplitNode(, );
17            else
19             end if
21       end while
23 end for
Algorithm 1 Conservative Q-Improvement

4.1 Method Summary

We introduce Conservative Q-Improvement (CQI), an algorithm for learning an interpretable decision-tree policy. CQI learns a policy in the form of a decision tree while only adding to the tree when doing so would improve the policy.

Each node in the tree is a branch node or leaf node. Branch nodes have two children and a condition based on a feature of the state space. Leaf nodes correspond to abstract states, and indicate actions that are to be taken.

The tree is initialized with a single leaf node, representing a single abstract state. Our method is strictly additive. Over time, it creates branches and nodes by replacing existing leaf nodes with a branch node and two child leaf nodes.

The tree is split only when the expected discounted future reward of the new policy would increase above a dynamic threshold. This minimum threshold decreases slowly over the course of training so the split which induces the largest increase in reward is chosen. Additionally, the threshold resets to its initial value after every split. This way, when the current tree does not represent the Q-values as accurately following a change, a greater increase in reward is required to justify a split. In this way, the method is “conservative” in performing splits.

The high level algorithm is shown in Algorithm 1. An overview is given here, with more in-depth explanations following in Section 4.2.

At a given timestep, the environment will be in state . This will correspond to some leaf node . When there is a single leaf node, all states correspond to this node. When there are multiple leaf nodes, the tree must be traversed to determine the corresponding (each branch node having a boolean condition that operates on , indicating which of two children to traverse).

Once is identified, an action is chosen. If exploring, a random action is taken. If exploiting, an action is chosen based on the highest Q-values on that leaf. Executing the action yields a reward and next-state . A series of updates are then performed.

The Q-values on leaf are updated using the standard Bellman equation update. Each node keeps track of how often it has been visited. Each leaf node maintains a history of possible splits. These are potential means of converting the leaf node into a branch node with two child nodes. These estimated child nodes have hypothetical visit frequencies and Q-values, which are updated at this time.

Then, the possible splits are checked to determine which split would yield the most split value, where split value is increase-in-expected-reward moderated by visit frequency. If this value is beyond a threshold , the split occurs. Otherwise, this threshold is decreased slightly, by a decay value . In this manner, the split with the best expected reward gain is chosen, and the time between splits is longer when the increase is small, thereby allowing the leaves to better represent the Q-function before a split occurs. Splits can occur at any timestep. loosely affects the rate at which splits will occur.

4.2 Method Details

A detailed explanation of the algorithm follows in this section. The tree itself consists of two node types, “branching nodes” and “leaf nodes”.

A branching node has the following attributes: visit frequency , dimension to split on (our method assumes a multidimensional state space), and a value to split on . Each state falls within a single node. To determine the node for state , the tree is traversed, starting at the root. If the value of the state in the dimension is less than the value on the node, the left child branch of the node is traversed. (If , the right branch is traversed.) When a leaf node is reached, traversal ends—the leaf node here corresponds to the abstract state . This is the process referenced in Algorithm 1 line 8.

A leaf node has the following attributes: visit frequency , a mapping of actions to Q-values (), and a list of objects .

Each object contains information about potential splits (one per potential split), and has dimension number , value to split on , and Q-value for each action and visit frequency for both potential children (, , , and ).

The TakeAction method in line 9 of Algorithm 1 is detailed in Algorithm 2. This is the standard RL behavior, where an value or function controls whether to explore (randomly choose a valid action) or exploit (use action-Q-value map on to determine action with highest expected value). An action, reward, and next state are returned.

2 if ([0,1])  then
3       random valid action;
6       the action that has the largest Q-value () on ;
8 end if
9 reward after executing action ;
10 state after executing action ;
11 Return ;
Algorithm 2 Take Action

The Q-values on the leaves (for each leaf, one Q-value per valid action) are updated using the standard Bellman update, as shown in Algorithm 3.

1UpdateLeafQValues(, , , ):
2 ;
Algorithm 3 UpdateLeafQValues

We keep track of how often a state node is visited. (A node is visited if it is a leaf node corresponding to a state the agent is in, or if it is a parent to such a leaf node.) In Algorithm 4, these metrics are updated. Note that the update rule corresponds to an exponential average, so it is a natural choice to accompany Q-learning. indicates the visit frequency of node , and indicates the set of nodes along the direct path from node to node in tree . When determining splits later, it is important to weight potential gain by relative frequency visited. It serves to convert the value increase within a leaf to an estimated policy-wide increase. The reference to a “sibling of ” on line 4 refers to the child of a parent of that is not . In our method, a node in the tree can have only two children or no children.

1UpdateVisitFrequency(, ):
2 for node  do
3       ;
4       sibling of ;
5       ;
7 end for
visit decay factor
Algorithm 4 UpdateVisitFrequency

Each leaf keeps track of potential that could be performed (turning a leaf node into branch node and children leaves). As shown in Algorithm 5, the maintain and update their own metrics for the visit frequency and Q-values for the case that a specific split is performed. The notation is used to express “opposite side” Each has a left and right . refers to the {action Q-values} mapping on , and refers to the Q-value of action on node .

1 UpdatePossibleSplits():
2 for  do
3       ‘left’ or ‘right’ depending on which side of the split corresponds to ;
4       ;
5       ;
6       ;
8 end for
Algorithm 5 UpdatePossibleSplits

After the updates occur, it is necessary to retrieve information as to which of the potential splits are currently the best possible split, and what the expected gain () would be if that split occurs. This process is shown in Algorithm 6. In line 2, gets the product of the visit frequencies on all of the nodes along the path from the root node to . tracks splits and split values, and split values are the sum of the difference between Q-values (on both the left and right side, each weighted by visits). Each split is investigated to determine which has the largest . This value along with the split itself are returned to the parent algorithm (where, if a threshold , a split is performed, and if not, is decayed).

“Performing a split” means to take a leaf node and turn it into a branch node with two children leaf nodes. This process is described in Algorithm 7.

1 BestSplit():
2 ;
3 % is a mapping of splits to split value (indicating expected increase in if a split is used);
4 ;
5 for  do
6       ;
7       ;
9 end for
11 ;
Algorithm 6 BestSplit
1SplitNode(, ):
2 Let be an internal branching node.;
3 Let be left and right children of , respectively.;
4 ;
5 ;
6 ;
7 for  do
8       { Add a set of left-right split pairs, one for each dimension and split.}
9 end for
11 ;
12 ;
13 ;
14 Return , ,
default Q-value;
the number of splits to check for in each dimension;
Algorithm 7 SplitNode
Splits leaf node into nodes , ,

5 Environment

Figure 1: An image of RobotNav, the domain in which we evaluate CQI. As noted in (Mitchell and Thrun, 1993): “a) The simulated robot world, b) Actions”.

We evaluate our approach on the RobotNav environment from (Mitchell and Thrun, 1993). We have also written an open-source OpenAI Gym version of the environment for anyone else to use.111[Link to code available after review.]

In this 2D environment, there is a robot, a goal location, and one or more obstacles. The robot must navigate to the goal, avoiding the obstacles, as shown in Figure 1.

In addition to the action space in the original work, we also add the option for using the following action set (corresponds to automatically turning to face the goal after every action):

  1. move directly towards goal

  2. move directly away from goal

  3. move perpendicular to direct-line-to-goal (right)

  4. move perpendicular to direct-line-to-goal (left)

6 Results

6.1 Direct Comparison

We evaluate the Pyeatt Method and Conservative Q-Improvement Trees on the RobotNav environment. Overall, we find that CQI creates trees with an order of magnitude fewer nodes and greater average reward per episode than the Pyeatt Method. Preliminary experiments on diverse alternative domains exhibit the same behavior.

To fairly compare both methods, we perform a grid search for each method to determine the best hyperparameters. We search over alpha values of 0.005, 0.01, and 0.1 through 0.8 by increments 0.1. We test hist-mins for the Pyeatt Method of 1000, 3000, 5000, 8000, and 10,000. For CQI, we try split-thresh-max values from to with a step of . We try num-splits values of 2 through 9. The visit decay and split threshold decay parameters for CQI are held constant at 0.999 and 0.9999, respectively. All of the best hyperparameter configurations fall within the ranges we search.

The hyperparameters that yield the highest average reward-per-episode for PM are history-list-min-size of 5000 and alpha of 0.3. CQI is optimized twice, once for highest reward and once for smallest tree size. We note that optimizing CQI for smallest tree size still yields a policy with higher reward than the Pyeatt Method. The hyperparameters that yield the best reward for CQI are alpha of 0.01, split-thresh-max of 10, and num-splits of 3. The hyperparameters that result in the smallest tree size are alpha of 0.2, split-thresh-max of , and num-splits of 7. The results of ten trials for each of these parameter configurations are shown in Table 1. Gamma is set to 0.8 as in (Mitchell and Thrun, 1993). Policies are trained for 500,000 steps and evaluated across 50,000 steps, recording average reward per episode and final tree size. We use epsilon-greedy exploration with parameter at the start of training and decay epsilon linearly (to a minimum of ) over the course of steps.

max width=0.48 Method Tree Size Avg. Reward Per Episode Avg. Std. Dev. Avg. Std. Dev Pyeatt Method 194.2 2.35 -75.73 59.05 CQI (reward opt) 20.2 1.40 -18.59 0.45 CQI (size opt) 7 0 -27.20 26.50

Table 1: Comparison of CQI (optimized for reward or size) to Pyeatt on RobotNav.

CQI outperforms PM while resulting in substantially smaller trees. Additionally, the variance for CQI is smaller. Even with the best hyperparameters we found, the Pyeatt method’s maximum reward at the end of training varied between -182.57 and -19.58. In contrast, we find that CQI (optimized for reward) consistently learns good policies. CQI (optimized for policy size) failed to learn a good policy in one trial, but otherwise performs better than the reward for the best Pyeatt method trial; the average reward across the other nine trials is -18.825 and the variance is 0.47.

We believe that CQI manages to achieve smaller tree size and greater reward than the Pyeatt Method because it creates a split only when the policy would improve, not just when the Q-values can be more faithfully represented if a split were added. As a result, CQI avoids extraneous splits, so the agent can benefit from greater abstraction over the state space.

Figure 2: Comparison of Reward vs Tree Size for CQI and Pyeatt on RobotNav.

We plot the reward obtained by each policy as a function of tree size (calculating reward by evaluating the policy as it existed before the subsequent split) in Figure 2. The “only successful runs” PM line refers to trials where PM achieved more than reward by the end of training. No trials were excluded when plotting the CQI performance.

It is possible to force the Pyeatt method to achieve smaller trees as an end-result of training by increasing history-list-min-size, but it impairs performance. Increasing this parameter to 50,000 yields trees of size 19 and 21 on RoboNav. The average reward over 10 trials was -102.62 and -102.59, with variances of 0.80 and 0.01, respectively. Essentially, when the Pyeatt Method is forced to make trees as small as CQI normally produces, the Pyeatt Method fails to learn.

6.2 Parameter Sensitivity

Figure 3: Performance of CQI on RobotNav as a function of split threshold maximum.

We also investigated the sensitivity of CQI to changes in hyperparameter values. There are a number of hyperparameters that can be adjusted to influence desired behavior. Optimizing for either size or reward is one such case. We considered the effect of num-split and split-thresh-max on reward and size, varying each while keeping other parameters constant.

The num-split parameter has little effect on reward. Across 10 trials, reward varies only slightly, from -18.84 to -17.88 (excluding one outlier). With other parameters set to the values which minimize tree size, there is no discernible effect on tree size: it ranges between 7 and 9, and with parameters set to values which maximized reward, size ranges from 15 to 21. (At the same time, during the grid search, we observed a trend of higher num-splits yielding smaller trees for hyperparameter configurations far from those minimizing size or maximizing reward).

Increasing split-threshold-maximum up to has little effect on reward (at parameters otherwise optimized for reward), with final episode reward ranging from -19.49 to -17.75. Note that these are all greater than the best recorded Pyeatt trial which yielded a reward of -19.58. For values greater that , the reward drops precipitously, passing below -100, as shown in Figure 3. This follows intuition, since split-threshold-maximum controls how long to wait before making a split. With a fixed training period, too long an interval between splits means that it is impossible to split enough times to create a successful policy. Another intuitive result of increasing split-threshold-maximum is the very clear trend of decreasing tree size shown in Figure 4. Waiting longer to perform a split means that the earlier splits are better informed and therefore higher quality. When this happens, it is more difficult to improve the policy further, so it takes more steps for potential splits to reach the split threshold. This result highlights the importance of this parameter since larger values can yield smaller trees. However, increasing training time allows for smaller trees without sacrificing final reward. As long as the training time is sufficient for the chosen split-threshold-maximum, the final policy will have sufficient time to train and avoid the “cliff” which results in poor policies.

Figure 4: Policy size of CQI on RobotNav as a function of split threshold maximum.

7 Conclusion and Future Work

We have introduced Conservative Q-Improvement. CQI is a reinforcement learning method applicable to any environment with discrete actions and multidimensional state space. It produces a policy in the form of a decision tree, and we have designed it to produce smaller trees than existing methods while not sacrificing policy performance.

We investigated the nature of the CQI method, discussing ways that one might optimize for tradeoffs between smaller trees and strictly better policies. We found that CQI outperformed the Pyeatt method in terms of both tree size and reward achieved. This is due to CQI’s conservative nature, whereby it only creates a new node when it will result in an improvement in the policy.

The tradeoffs of CQI can be tuned depending on the use case, and depending on domain. For example, if there were a domain where even CQI produces large trees, a practitioner could settle for lower reward in order to have a smaller tree that can be understood and inspected. On the other hand, if all policies for an environment were relatively small, or the policy only had to be manually inspected occasionally or without hurry, then one could choose parameter values which result in a larger, better-performing policy.

Future work could entail tree-building methods that are not strictly additive but rebalance along the way so as to reduce final policy size. Additionally, a regularization term which allows the tree size to directly affect the policy’s perceived reward would allow CQI to directly optimize for a trade-off between size and performance on original tasks.


  • Alaniz and Akata [2019] Stephan Alaniz and Zeynep Akata. Xoc: Explainable observer-classifier for explainable binary decisions. arXiv preprint arXiv:1902.01780, 2019.
  • Arulkumaran et al. [2017] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. CoRR, 2017.
  • Bloomberg [2018] Jason Bloomberg.

    Don’t trust artificial intelligence? time to open the ai ‘black box’, 2018.

  • Buntine and Niblett [1992] Wray Buntine and Tim Niblett. A further comparison of splitting rules for decision-tree induction. Machine Learning, 8(1):75–85, 1992.
  • Chapman and Kaelbling [1991] David Chapman and Leslie Pack Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In IJCAI, volume 91, pages 726–731, 1991.
  • Gitau [2019] Catherine Gitau. Success stories of reinforcement learning, 2019.
  • Gupta et al. [2015] Ujjwal Das Gupta, Erik Talvitie, and Michael Bowling. Policy tree: Adaptive representation for policy gradient. In AAAI, pages 2547–2553, 2015.
  • Hester et al. [2010] Todd Hester, Michael Quinlan, and Peter Stone. Generalized model learning for reinforcement learning on a humanoid robot. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 2369–2374. IEEE, 2010.
  • Hihn and Menzies [2015] Jaitus Hihn and Tim Menzies. Data mining methods and cost estimation models: Why is it so hard to infuse new ideas? In Automated Software Engineering Workshop (ASEW), 2015 30th IEEE/ACM International Conference on, pages 5–9. IEEE, 2015.
  • Huk Park et al. [2018] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 8779–8788, 2018.
  • Knight [2017] Will Knight. The dark secret at the heart of ai: no one really knows how the most advanced algorithms do what they do-that could be a problem, 2017.
  • Leroux et al. [2018] Antonin Leroux, Matthieu Boussard, and Remi Dès. Information gain ratio correction: Improving prediction with more balanced decision tree splits. arXiv preprint arXiv:1801.08310, 2018.
  • Lombrozo [2007] Tania Lombrozo.

    Simplicity and probability in causal explanation.

    Cognitive psychology, 55(3):232–257, 2007.
  • McCallum and Ballard [1996] Andrew Kachites McCallum and Dana Ballard. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester. Dept. of Computer Science, 1996.
  • Michie [1983] Donald Michie. Inductive rule generation in the context of the fifth generation. In Machine Learning Workshop, page 65, 1983.
  • Michie [1987] Donald Michie. Current developments in expert systems. In Proceedings of the Second Australian Conference on Applications of expert systems, pages 137–156. Addison-Wesley Longman Publishing Co., Inc., 1987.
  • Mitchell and Thrun [1993] Tom M Mitchell and Sebastian B Thrun.

    Explanation-based neural network learning for robot control.

    In Advances in neural information processing systems, pages 287–294, 1993.
  • Mulrow [2002] Edward J Mulrow. The visual display of quantitative information, 2002.
  • Pyeatt et al. [2001] Larry D Pyeatt, Adele E Howe, et al. Decision tree function approximation in reinforcement learning. In

    Proceedings of the third international symposium on adaptive systems: evolutionary computation and probabilistic graphical models

    , volume 2, pages 70–77. Cuba, 2001.
  • Pyeatt [2003] Larry D Pyeatt. Reinforcement learning with decision trees. In Applied Informatics, pages 26–31, 2003.
  • Quinlan [1986] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  • Quinlan [1987] J. Ross Quinlan. Simplifying decision trees. International journal of man-machine studies, 27(3):221–234, 1987.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tanno et al. [2018] Ryutaro Tanno, Kai Arulkumaran, Daniel C Alexander, Antonio Criminisi, and Aditya Nori. Adaptive neural trees. CoRR, 2018.
  • Uther and Veloso [1998] William TB Uther and Manuela M Veloso. Tree based discretization for continuous state space reinforcement learning. In AAAI/IAAI, pages 769–774, 1998.
  • Uther and Veloso [2000] William TB Uther and Manuela M Veloso. The lumberjack algorithm for learning linked decision forests. In International Symposium on Abstraction, Reformulation, and Approximation, pages 219–232. Springer, 2000.
  • Uther and Veloso [2003] William TB Uther and Manuela M Veloso. Ttree: Tree-based state generalization with temporally abstract actions. In Adaptive agents and multi-agent systems, pages 260–290. Springer, 2003.
  • Uther [2002] William T Uther. Tree-based hierarchical reinforcement learning. Technical report, Carnegie Mellon, Univ Pittsburgh, PA, Dept of Computer Science, 2002.
  • Wu et al. [2012] Min Wu, Atsushi Yamashita, and Hajime Asama. Rule abstraction and transfer in reinforcement learning by decision tree. In System Integration (SII), 2012 IEEE/SICE International Symposium on, pages 529–534. IEEE, 2012.
  • Wu et al. [2017] Huijun Wu, Chen Wang, Jie Yin, Kai Lu, and Liming Zhu. Interpreting shared deep learning models via explicable boundary trees. arXiv preprint arXiv:1709.03730, 2017.
  • Zhang et al. [2018] Quanshi Zhang, Yu Yang, Ying Nian Wu, and Song-Chun Zhu. Interpreting cnns via decision trees. CoRR, 2018.
  • Zhao [2001] Qiangfu Zhao. Evolutionary design of neural network tree-integration of decision tree, neural network and ga. In Evolutionary Computation, 2001. Proceedings of the 2001 Congress on, volume 1, pages 240–244. IEEE, 2001.