Reinforcement learning (RL) with neural network function approximators, known as “Deep RL,” has achieved tremendous results in recent years. Deep RL uses multi-layered neural networks to represent policies that are trained to maximize an agent’s expected future reward. However, these neural-network-based approaches are largely uninterpretable due to the millions of parameters involved.
In safety-critical domains, such as healthcare, aviation, and military operations, interpretability is of utmost importance. Human operators must be able to interpret and follow step-by-step procedures and checklists [10, 11, 5]. Of the class of machine learning methods that can generate such a set of procedures, decision tree algorithms are perhaps the most highly developed . While interpretable machine learning methods offer the promise of revolutionizing safety-critical domains , they are generally unable to match the singular performance seen in Deep RL [21, 8]. Decision trees have often been viewed as the de facto technique for interpretable machine learning [18, 14] as they can learn compact representations of underlying relationships within data . In prior work, decision trees have been applied to RL problems where they served as function approximators: compactly representing information about which action to take in which state [7, 8, 20, 17].
The challenge with applying decision trees as function approximators in RL lies in the online nature of the learning problem. As such, the decision tree model (or any model) must be able to adapt to handle the non-stationary distribution of the observed data. The two primary techniques for RL function approximation are Q-learning  and policy gradient(PG) . Underlying these learning mechanisms are variants of stochastic gradient descent (SGD); at each time-step, the RL agent takes an action based on the prediction of the approximated policy, receives a reward from the environment, and computes how to update the setting of each of the policy’s parameters [2, 9]
. Decision trees are not typically amenable to gradient descent due to their Boolean nature; they are a collection of nested if-then rules. Therefore, researchers have used heuristic, non-gradient-descent-based methods for training decision trees[7, 8, 17]
. A common approach when applying decision trees to RL is to perform online state aggregation using heuristics to update or expand a decision tree’s terminal (i.e., leaf) nodes rather than seeking to update the entire model with respect to a global loss function. Researchers have also attempted to use decision trees for RL by training them in batch mode, completely re-learning the tree from scratch to account for the non-stationarity introduced by an improving policy 
. While effective, this approach is inefficient when seeking algorithms that scale to realistic situations. Despite these attempts, success comparable to that of modern deep learning approaches has been elusive.
Seeking to develop a decision tree formulation amenable to gradient descent, Suárez and Lutsko formulated a continuous and fully differentiable decision tree (DDT), in which they adopted a sigmoidal representation of each decision node’s splitting criterion . Suárez and Lutsko applied their approach to offline, supervised learning but not to RL. Researchers have continued to explore continuous decision tree formulations [15, 12], while there has been limited success in applying such fuzzy trees to RL (e.g., ).
In this paper, we develop and demonstrate an end-to-end framework for RL with function approximation via DDTs. We provide three key contributions: first, we examine the theoretical properties for gradient descent over DDTs, motivating the need for PG based learning. To the best of our knowledge, this is the first investigation of the optimization surfaces of Q-learning and PGs for DDTs. Second, we introduce a regularization formulation to ensure interpretability via sparsity in the tree structure. Third, we demonstrate the novel ability to seamlessly update an entire decision tree via PG in canonical RL domains to produce an interpretable, sharp policy.
In this section, we highlight the traditional decision tree and describe how Suárez and Lutsko augmented this model to be fully differentiable for gradient descent. We also review RL, Q-Learning, and PG.
Ii-a Decision Trees
A decision tree is a directed, acyclic graph, with nodes and edges, that takes as input an example, , performs a forward recursion, and returns a label as shown in Equations 1-3. There are two types of nodes: decision nodes and leaf nodes. Decision and leaf nodes have an outdegree of two and zero, respectively. All nodes, , have an indegree of one except for the root node, , which has an indegree of zero. Each decision node is represented as a Boolean expression, (Equation 3), where and are the selected feature and splitting threshold, respectively, for decision node . For each decision node, the left outgoing edge is labeled “true” and the right outgoing edge is labeled “false.” If is evaluated true (or false) for an example, then the left child node, , (or right child node, ) is considered next. If the child node is a decision node, the process is repeated until a leaf node is reached. Once a leaf node is reached, the tree returns the label represented by that leaf node. The problem of finding the optimal decision tree is to determine, , the best feature and splitting criterion, for each decision node the label, , for each leaf node; and the structure of the tree (i.e., whether, for each node , there exists a corresponding child node).
There are many heuristic techniques for learning decision trees with a batch data set that reason about the entropy, r-squared error, or other loss function . A limitation of these Boolean decision trees is that one cannot readily apply standard gradient descent update rules since a tree is fixed after generation. Some researchers have tried heuristic approaches to iteratively grow trees in an RL context ; however, these approaches do not allow for a natural update of the entire structure of the tree in online learning environments.
Suárez and Lutsko provide one of the first DDT models. In their approach, they employ a sigmoid formulation for Equation 3, with a linear combination of features, , weighted by , and a steepness parameter . Suárez and Lutsko demonstrate that one can then compute the gradient of the tree with respect to the tree’s parameters, , , and , for all nodes, to approximately solve classification and regression problems .
While there are limitations to this approach (i.e., whether a weighted, linear combination of many features is interpretable), we believe this model is at least a strong building block towards developing interpretable, machine learning models amenable to gradient descent.
Ii-C Reinforcement Learning
RL is an approach within machine learning where an agent is tasked with learning the optimal sequence of actions that will lead to the maximal expected future reward 
. The actions and observations of the agent are traditionally abstracted as a Markov Decision Process (MDP). Formally, an MDP is a five-tupledefined as follows: is the set of states; is the set of actions;
is the transition matrix describing the probability that taking actionin state will result in state ; is the discount factor which states how important it is to receive a reward in the current state versus a future state; and is the function dictating the reward an agent receives by taking action in state . In RL, the goal is to learn a policy, , that prescribes which action to take in each state in order to maximize the agent’s future expected reward, as defined in Equation 5:
Here, is the optimal policy and is the value of policy when starting in state . There are two widely practiced approaches to learn such a policy: Q-learning and PG.
In Q-learning, one seeks to learn a Q-function, , which returns the expected future reward when taking a given action in a given state when following policy . Since enumerating the state space is intractable for problems of a realistic nature, the Q-function is typically approximated by some parameterization (e.g., a linear combination of features describing the states weighted by ), as denoted by . To learn these parameters and in turn, an approximation for the Q-function one seeks to minimize the Bellman residual by applying the update rule in Equation 6, where indicates the change in with time-step under Q-learning, is the state the agent arrives in after applying action in state , and is the step size.
Ii-C2 Policy Gradient
In PG, the objective is to directly learn a policy, , parameterized by as opposed to Q-learning. The update rule seeks to maximize the expected reward of a policy, as shown in Equation 7, where indicates the change in with time-step under PG, and are the state and action chosen at time , is the reward received at time , and
is the PG coefficient. This update considers an entire episode to compute an estimate of the expected value of the policy for each time step.
Iii Assessing Online Methods for DDTs: Q-Learning and Policy Gradient
Iii-a Problem Set Up
For our investigation into using a DDT for RL, we consider an MDP with four states and two actions as depicted in Figure 1. The rewards are for each state, respectively. The transition matrix, defined in Equation 8, indicates that the agent moves to a state with a higher index (i.e., ) when taking action and a state with a lower index (i.e., ) when taking action . Actions are taken successfully with probability . We note that and are terminal states.
We optimistically assume ; despite this hopeful assumption, we show unfavorable results for Q-learning and PG based agents using DDTs as function approximators. Further, we note that the Q-learning and PG updates do not explicitly consider transition probabilities (i.e., Equations 6 and 7). We assume the agent operating on this MDP learns episodically, meaning that the agent takes a sequence of actions until either time expires or a terminal node is reached. The agent begins in state when the time, , is (i.e., ) and takes at most four actions (i.e., , where is the final time step).
Given these assumptions, one can see by inspection111Derivation withheld due to space constraints. that the optimal policy, , is to apply action in state and action in state .
Remark 1 (Analogy to Cart Pole).
We note that this MDP determines that when the agent is in one portion of the state space, one action should be applied; when the agent is in a different portion of the state space, a different action should be applied. This behavior is analogous to many canonical RL problems, such as Cart Pole (Figure 2). In the Cart Pole problem, there is a point mass located at the end of a pole, connected to a cart through an un-actuated joint. Gravity causes the pole to fall to the left when the pole is leaning left and right when leaning right. The RL agent must provide a counteracting force to balance the pole.
Iii-B Decision Trees as Function Approximators
We aim to learn a decision tree that can serve as a function approximator for Q-learning or PG. For simplicity, we consider a decision tree with one decision node and two leaf nodes, as shown in Figure 3 and defined in Equation 9. This decision tree bifurcates the state space into two: states with an index less than or equal to and those greater than .
Under Q-learning, the leaf nodes return an estimate of the expected future reward (i.e., the Q-value) for applying each action when in the portion of the state space dictated by the decision node’s criteria. For example, if and the agent is in (i.e, parameter in is ), the Q-values for taking actions and are and
, respectively. Under PG, the leaves represent an estimate of the optimal probability distribution over actions the RL agent should take to maximize its future expected reward. Therefore, the values at these leaves represent the probability of selecting the corresponding action. We note here that one would impose the constraint.
For our investigation, we assume that the decision tree’s parameters are initialized to the optimal setting. As noted previously —and regardless of using Q-learning or PG— the optimal policy, , is to apply action in state and action in state , assuming . As such, for Q-learning, we set and , which correspond to the Q-values of taking action and in states and when otherwise following the optimal policy starting in a non-terminal node. When generating results in Section V for PG, we set and . These settings correspond to a decision tree that focuses on exploiting the current (optimal if ) policy. While varying the setting of leaf node values using PG is outside of the scope of this paper due to space considerations, we note that the results generalize to other settings of these parameters
Iii-C Decision Tree Function Approximator Policies
There can be five qualitatively unique policies using a Boolean tree (i.e., Equation 3). These policies correspond to . For each policy, we can generate the sequence of states and the associated rewards the agent would receive, shown in Table I, assuming the agent starts in . Based on this information, we compute in Table II the value function (Equation 5) for each setting . For simplicity, we assume .
Remark 2 (Boolean vs. Continuous Decision Trees).
A key difference between a Boolean and differentiable decision tree is that the output of the differentiable tree is a weighted, nonlinear combination of the leaves (Equation 9). Using PG, one samples actions probabilistically from . The probability of applying the “wrong” action (i.e., one resulting in a negative reward) is in state and in state . Assuming it equally likely to be in states and , the overall probability is . These probabilities are depicted in Figure 4, which shows how the optimal setting, , for should be using PG.
Iii-D Computing Critical Points
To apply Q-learning and PG updates, we must compute the gradient of the DDT formulation from Equations 2 and 4. As we focus primarily on the splitting criterion, , we need only consider as shown in Equation 10.
The full Q-learning and PG updates, and , respectively, are then given by Equations 11 and 12, where the indicates the updates are dependent on a specific time step with an associated state-action pair.
Recall the agent experiences episodes with four time steps (). Each step generates its own update, which are combined to give the overall update in Equation 13.
Pseudo-critical points exist, then, whenever = 0. A gradient descent algorithm would treat these as extrema because the gradient update would push towards these points. As such, we consider them critical points in our analyses.
The critical points given by are shown in Figures 4(a) and 4(b) for Q-learning and PG, respectively. For each curve, there are five critical points. We note that the curve is piece-wise curvilinear, with breaks at , and . The only true critical point exists at for Q-learning and, suboptimally, at for PG.
Iii-E Inferring the Optimality Curve
By integrating (Equation 14) with respect to from to , we infer the “optimality curve,” which should equal the value of the policy, , implied by Q-learning and PG. We numerically integrate using Riemann’s method as shown in Figure 6, normalized to be in .
Iii-F Evaluation of Gradient-Based Methods
Figures 4(a) and 4(b) depict the Q-learning and PG updates, respectively. We can see that the Q-learning update introduces multiple critical points as a function of the splitting criterion, , whereas the PG update includes only a single critical point for finite values of the splitting criterion, .
Figure 5(a) depicts the value of the DDT policy when trained using Q-learning and PG. This Figure shows the expected behavior with a maximum at for both training methods. However, Figure 5(b), derived using Equation 13 from the curves in Figure 5, stands in contradiction.
One would expect that the respective curves for the policy value and integrated gradient updates (i.e., Equation 13) would be identical; however, this does not hold for Q-learning. As we saw in Figure 4(a), Q-learning with DDTs introduces undesired extrema (critical points in 4(a)), depicted by the blue curve in Figure 5(b). PG, on the other hand, maintains a single maximum, coinciding with the expected .
The conclusion of this finding is that PG is not always superior to Q-learning as a training method for DDTs. However, this analysis does provide evidence that, even for toy problems, Q-learning exhibits weaknesses. As such, we conclude that PG serves as a more promising approach for training DDTs. Based on this evidence, the results reported below are for DDTs trained with PG.
Iv Interpretability in Full-gradient Learning
Given a training algorithm (i.e., PG as per Section III’s conclusion), we now seek to address the two key drawbacks of the original DDT formulation in  in making the tree interpretable. First, the operation at each node produces a linear combination of the features (rather than a single feature) and there is a smooth transition on the feature space between the and states of a node (rather than a step function transition occurring at the splitting criterion). To overcome these limitations, we propose a regularizing to generate decisions on a single feature and tune to encourage steepness (i.e., “crispness”) of the tree.
Iv-a Modification and Regularization for Interpretability
To achieve our goal of interpretability while maintaining competitive performance, we applied three modifications to the original : 1) sparsity-inducing regularization, 2) unbiased (i.e., uniform) tree initialization, and 3) discretization of the tree (i.e., to obtain the interpretable, classical version of a decision tree) at each time step to assess model performance.
Iv-A1 Sparsity-inducing Regularization
During online RL via stochastic gradient descent, we added a regularization term to the loss that encouraged sparse feature representation in . Although regularization is often applied at the beginning of a training episode and increased according to some schedule, we found that allowing the model to converge in performance and then beginning the regularization procedure improved results in our tests.
Although equation 15 improves sparsity, if used alone, it likely multiplies the chosen feature by some amount other than the unit (i.e., for a single feature and otherwise). To mitigate this problem, we applied a Softmax operator, , to the learned beta parameter, , such that the resulting decision node was governed by Equation 16 and 17. This modeling choice encouraged emphasis on a single feature.
Iv-A2 Tree Initialization
By employing a sparsity-inducing regularization term, we sought to improve the tree’s interpretability; however, this regularization can result in undesired under-fitting of the model by overpowering the reward signal. Through experimentation, we found that a random tree initialization introduced bias into the training process from which the tree could not recover even when allowed to converge before regularizing. To avoid this, we uniformly initialized at each node such that .
Iv-A3 Decision Sharpening
Due to the nature of the sigmoid function, even a sparsewas not sufficient to guarantee a discrete decision at each node. Thus, to obtain a truly discrete tree, we converted the fuzzy tree into a discrete tree at each iteration by employing an to obtain the index of the feature of that the node will use as in Equation 18.
V-a Supervised Learning
Next, as a proof of concept, we evaluated our algorithm using the Tic-Tac-Toe Endgame, the Breast Cancer Wisconsin (Diagnostic) and the Caesarian Section Classification data sets from the UCI repository . For comparison we used a decision tree trained with C4.5 using the gini coefficient as the splitting criterion. For evaluation, we computed the Precision Recall Curve’s AUC over three-fold cross-validation using of the data for validation on each fold. The goal of this validation was not to achieve on-par performance with traditional decision-tree-learning mechanisms, which have the advantage of offline learning; rather, the goal of this intermediate investigation was to confirm that the model could learn competent mappings from input to outputs. Given such confirmation, we confidently move to demonstrate on our target task: RL.
For this intermediate supervised-learning task, we initialized the DDT to be a full binary tree of depth equal to the maximum depth of the tree generated by C4.5. Namely, for Tic-Tac-Toe a depth of 13, for Cancer a depth of 8, and for Cesarean a depth of 11.
|Sharp DDT AUC||0.6141||0.6577||0.7948|
V-B Reinforcement Learning
Our ultimate goal is in showing DDTs can learn competent, interpretable policies online for RL tasks. To show this, we evaluated our DDT algorithm using the CartPole, LunarLander and Acrobot OpenAI Gym environments  using PPO   as our chosen optimization algorithm. Since our interest is interpretability, the policies are trained on the observed environment states. For comparison, we used an MLP with one hidden layer of
hidden units. We compared the performance of full binary trees of depth 4 and depth 6. To show the variance of the policy being trained, we ran 5 seeds for each policy-environment combination. Finally, to assess the performance of each sharp tree, we computed an average reward over 10 episodes initialized to different seeds. In Figure7, it is evident there was comparable performance to MLP in terms of sample complexity.
After training, we retained the best performing seeds for the MLP and the crisp DDTs as measured on the evaluation dataset. The performance achieved by these policies for each environment is reported in Figure 8.
V-B1 Resulting Trees
For our most challenging domain for learning (Acrobot), we were still able to find a good, crisp policy as depicted in Figure 9. Due to space limitations, we were unable to report the crisp version of each DDT we learned.
In this paper, we provide a theoretical argument in favor of PG as a training method for DDTs over Q-learning in the RL setting based on analysis of a simple, single-criterion, stochastic gradient descent update on the Cart Pole domain. We also provide results for fuzzy and discrete DDTs for several reinforcement learning and classification settings.
Given the flexibility of MLPs and their large number of parameters,we anticipated an advantage in raw performance. After all, our DDT’s have a much smaller number of parameters, were regularized to encourage sparsity, and were constrained even further through discretization. We provide promising results that— even after converting the trained DDT into a discretized (i.e., interpretable) tree— the training process would yield tree policies that outperformed even the best MLP. Before triggering regularization, DDTs would even manage to improve on MLP’s sample efficiency as is the case with the CartPole and the LunarLander. Similarly, in supervised learning, we found that discretized DDTs sometimes outperformed traditional decision trees.
Choosing monte-carlo sampling of seeds to explore better trees was a potential limitation, although ubiquitous in the field of RL. By effect of sampling policies every certain number of epochs, one policy could have happened to perform well on the evaluation environments, a possibility considering the high variance DDTs exhibited across seeds during training. Although possible for a very expressive policy class such as MLPs, it is exceedingly unlikely in the case of DDTs as they are designed to underfit in achieving interpretability. Further, we evaluated each discretized tree acrossrandom episodes.
A hypothesis for future work, we suspect DDTs perform better in the Acrobot and CartPole scenarios because, in part, we believe Bang Bang Controllers would likely also perform well in those scenarios. Although MLPs are more expressive in a continuous domain of policies, they may be poor approximators for discrete decision boundaries. In the future, we would also like to explore ways of pruning DDTs. As seen in Figure 9, there are redundant nodes in the policy that could be removed. Finally, we aim to further analyze the convergence and performance of DDT policies.
Ultimately, we show how differentiable decision trees can be used in the context of reinforcement learning to generate interpretable policies. We provide a motivating example for why PG should be used to train this particular policy class and demonstrate results in both classification and reinforcement learning settings.
- Arulkumaran et al.  K Arulkumaran, MP Deisenroth, M Brundage, and AA Bharath. A brief survey of deep reinforcement learning. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
- Bottou  L Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
- Breiman et al.  L Breiman, J Friedman, CJ Stone, and RA Olshen. Classification and regression trees. CRC press, 1984.
- Brockman et al.  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
- Clay-Williams and Colligan  R Clay-Williams and L Colligan. Back to basics: checklists in aviation and healthcare. BMJ Qual Saf, 24(7):428–431, 2015.
- Dheeru and Karra Taniskidou  Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
- Ernst et al.  D Ernst, P Geurts, and L Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
Finney et al. 
S Finney, NH Gardiol, LP Kaelbling, and T Oates.
The thing that we tried didn’t work very well: deictic representation
in reinforcement learning.
Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 154–161. Morgan Kaufmann Publishers Inc., 2002.
- Fletcher and Powell  R Fletcher and MJD Powell. A rapidly convergent descent method for minimization. The computer journal, 6(2):163–168, 1963.
- Gawande  A Gawande. Checklist Manifesto, The (HB). Penguin Books India, 2010.
- Haynes et al.  AB Haynes, TG Weiser, WR Berry, SR Lipsitz, AHS Breizat, EP Dellinger, T Herbosa, S Joseph, PL Kibatala, MCM Lapitan, et al. A surgical safety checklist to reduce morbidity and mortality in a global population. New England Journal of Medicine, 360(5):491–499, 2009.
Kontschieder et al. 
P Kontschieder, M Fiterau, A Criminisi, and S Rota-Bulo.
Deep neural decision forests.
Proceedings of the IEEE International Conference on Computer Vision, pages 1467–1475, 2015.
- Kostrikov  Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr, 2018.
Letham et al. 
B Letham, C Rudin, TH McCormick, D Madigan, et al.
Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model.The Annals of Applied Statistics, 9(3):1350–1371, 2015.
- Olaru and Wehenkel  C Olaru and L Wehenkel. A complete fuzzy decision tree technique. Fuzzy sets and systems, 138(2):221–254, 2003.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
Pyeatt et al. 
LD Pyeatt, AE Howe, et al.
Decision tree function approximation in reinforcement learning.
Proceedings of the third international symposium on adaptive systems: evolutionary computation and probabilistic graphical models, volume 2, pages 70–77. Cuba, 2001.
- Rudin  C Rudin. Algorithms for interpretable machine learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1519–1519. ACM, 2014.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
- Shah and Gopal  H Shah and M Gopal. Fuzzy decision tree function approximation in reinforcement learning. International Journal of Artificial Intelligence and Soft Computing, 2(1-2):26–45, 2010.
- Silver et al.  D Silver, A Huang, CJ Maddison, A Guez, L Sifre, G Van Den Driessche, J Schrittwieser, I Antonoglou, V Panneershelvam, M Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Suárez and Lutsko  A Suárez and JF Lutsko. Globally optimal fuzzy decision trees for classification and regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12):1297–1311, 1999.
- Sutton and Barto  RS Sutton and AG Barto. Reinforcement learning: An introduction. MIT Press, 1998.
- Sutton et al.  RS Sutton, DA McAllester, SP Singh, and Y Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057–1063, 2000.
- Watkins  CJCH Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
- Weiss and Indurkhya  SM Weiss and N Indurkhya. Rule-based machine learning methods for functional prediction. Journal of Artificial Intelligence Research, 3:383–403, 1995.
Appendix A: Derivation of the Optimal Policy
In this section, we provide a derivation of the optimal policy for the MDP in Figure 1. For this derivation, we use the definition of the Q-function described in Equation 19, where is the state resulting from applying action in state . In keeping with the investigation in this paper, we assume deterministic transitions between states (i.e., from Equation 8). As such, we can ignore and simply apply Equation 20.
We begin by asserting in Equation 21 that the Q-values for are given and for any action . This result is due to the definition that states and are terminal states and the reward for those states is regardless of the action applied.
For the Q-value of state-action pair, , we must determine whether is less than or equal to . If the agent were to apply action in state , we can see from Equation 31 that the agent would receive at a minimum , because , must be the maximum from Equation 30. We can make a symmetric argument for in Equation 31. Given this relation, we arrive at Equations 32 and 33.
Recall that given our definition of the MDP in Figure 1. Therefore, . If the RL agent is non-myopic, i.e., , then we have the strict inequality . For these non-trivial settings of , we can see that the optimal policy for the RL agent is to apply action in state and action in state . Lastly, because and are terminal states, the choice of action is irrelevant, as seen in Equation 21. ∎
The optimal policy is then given by Equation 37.
Appendix B: Q-learning Leaf Values
For the decision tree in Figure 3, there are four leaf values: , , , and . Table IV contains the settings of those parameters. In Table IV, the first column depicts the leaf parameters; the second column depicts the Q-function state-action pair; the third column contains the equation reference to Appendix A, where the Q-value is calculated; and the fourth column contains the corresponding Q-value. These Q-values assume that the agent begins in a non-terminal state (i.e., or ) and follows the optimal policy represented by Equation 37.