1 Introduction: Mimic a Deep Reinforcement Learner
Deep Reinforcement Learning has mastered human-level control policies in a wide variety of tasks . Despite excellent performance, the learned knowledge remains implicit in neural networks and hard to explain. There exists a trade-off between model performance and interpretability . One of the methods to address this trade-off is mimic learning , which trains an interpretable mimic model to match the predictions of a highly accurate model. Many works [5, 2, 7]
have applied types of mimic learning to distill knowledge from deep models to a mimic model with tree representation. Current methods focus only on interpreting deep models for supervised learning. However, DRL is an unsupervised process, where agents continuously interact with an environment, instead of learning from a static training/testing dataset.
This work develops a novel mimic learning framework for Reinforcement Learning. We examine two different approaches to generating data for RL mimic learning. Within the first Experience Training setting, which allows applying traditional batch learning methods to train a mimic model, we record all state action pairs during the training process of DRL and complement them with Q values as soft supervision labels. Storing and reading the training experience of a DRL model consumes much time and space, and the training experience may not even be available to a mimic learner. Therefore our second Active Play setting generates streaming data through interacting with the environment using the mature DRL model. The active play setting requires an on-line algorithm to dynamically update the model as more learning data is generated.
is a classic online reinforcement learning method which represents a Q function using a tree structure. To strengthen its generalization ability, we add a linear model to each leaf node, which defines a novel Linear Model U-Tree (LMUT). To support the active play setting, we introduce a novel on-line learning algorithm for LMUT, which applies Stochastic Gradient Descent to update the linear models, given some memory of recent input data stored on each leaf node. We conducted an empirical evaluation in three benchmark environments with five baseline methods. Two natural evaluation metrics for an RL mimic learner are: 1) fidelity: how well the mimic model matches the predictions of the neural net, as in supervised learning, and 2) play performance: how well the average return achieved by a controller based on the mimic model matches the return achieved by the neural net. Play performance is the most relevant metric for reinforcement learning. Perfect fidelity implies a perfect match in play performance. However, our experiments show that approximate fidelity does not imply a good match in play performance. This is because RL mimic learning must strike a balance between coverage: matching the neural net across a large section of the state space, and optimality: matching the neural net on the states that are most important for performance. In our experiments, LMUT learning achieves a good balance: the best match to play performance among the mimic methods, and competitive fidelity to the neural net predictions. The transparent tree structure of LMUT makes the DRL neural net interpretable. To analyze the mimicked knowledge, we calculate the importance of input features and extract rules for typical examples of agent behavior. For image inputs, the super-pixels in input images are highlighted to illustrate the key regions.
The main contributions of this paper are as follow:
1) To our best knowledge, the first work that extends interpretable mimic learning to Reinforcement Learning.
2) A novel on-line learning algorithm for LMUT, a novel model tree to mimic a DRL model.
3) We show how to interpret a DRL model by analyzing the knowledge stored in the tree structure of LMUT.
The paper is organized as follow. Section 2 covers the background and related work of DRL, mimic learning and U-tree. Section 3 introduces the mimic learning framework and Section 4 shows how to learn a LMUT. Empirical evaluation is performed in section 5 and section 6 discusses the interpretability of LMUT.
2 Background and Related Work
Reinforcement Learning and the Q-function. Reinforcement Learning constructs a policy for agents to interact with environment and maximize cumulative reward 
. Such an environment can be formalized as a Markov Decision Process (MDP) with 4-tuple, where at timestep t, an agent observes a state , chooses a action and receives a reward and the next observation from environment. A Q function represents the value of executing action under state . Given a policy , the value is the expectation of the sum of discounted reward . -learning is similar to temporal difference methods that update the current Q value estimates towards the observed reward and estimated utility of the resulting state . An advanced model Deep Q-Network (DQN)  was proposed, which uses neural network to approximate the function approximation. Parameter (
) updates minimize the differentiable loss function:
U-Tree Learning. A tree structure is transparent and interpretable, allowing rule extraction and measuring feature influence . U-tree  learning was developed as an online reinforcement learning algorithm with a tree structure representation. A U-tree takes a set of observed feature/action values as input and maps it to a state value (or Q-value). 
introduces the continuous U-tree (CUT) for continuous state features. CUT learning dynamically generates a tree-based discretization of the input signal and estimates state transition probabilities by retaining transitions in every leaf node. CUT learning applies dynamic programming to solve it to solve the resulting Markov Decision Process (MDP). Although CUT has been successfully applied in test environments like Corridor and Hexagonal Soccer, constructing Continuous U-tree from raw data is rather slow and consumes much computing time and space.
3 Mimic Learning for Deep Reinforcement Learning
Unlike supervised learning, a DRL model is not trained with static input/output data pairs; instead it interacts with the environment by selecting actions to perform and adjusting its policy to maximize the expectation of cumulative reward. We now present two settings to mimic the Q functions in DRL models.
3.0.1 Experience Training
is a vector of continuous features that represents a state (for discrete features we use one-hot representation). Then, by inputting them to a mature DRL model, we obtain their corresponding soft outputand use the entire input/output pairs as the experience training dataset. .
3.0.2 Active Play
generates mimic data by applying a mature DRL model to interact with the environment. Similar to , our active learner has three components: . The first component is a querying function that gives the current observed signal , selects an action . The querying function controls ’s interaction with the environment so it must consider the balance between exploration and exploitation. Here the -greedy scheme  ( decaying from 1 to 0) is used as our querying function. The second component is the deep model that produces Q values: .
As shown in Figure 2, the mimic training data is generated in the following steps: Step 1: Given a starting observation signal on time step , we select an action , and obtain a soft output Q value . Step 2: After performing , the environment provides a reward and the next state observation . We record a labelled transition where the soft label comes from the well trained DRL model. Step 3: We set as the next starting observation signal, repeat above steps until we have training data for the active learner to finish sufficient updates over mimic model . This process will produce an infinite data stream (transitions ) in sequential order. We use minibatch online learning, where the learner returns a mimic model after some fixed batchsize of queries.
Compared to Experience Training, Active Play does not require recording data during the training process of DRL models. This is important because: (1) Many mimic learners have access only to the trained deep models. (2) Training a DRL model often generates a large amount of data, which requires much memory and is computationally challenging to process. (3) The Experience Training data includes frequent visits to suboptimal states, which makes it difficult for the mimic learner to obtain an optimal return.
4 Learning Linear Model U-Trees
A neural network with continuous activation functions computes a continuous function. A regression tree can approximate a continuous function arbitrarily closely, given enough leaves. Continuous U-Trees (CUTs) are essentially regression trees for value functions, and therefore a natural choice for a tree structure representation of a DRL Q function. However, their ability to generalize is limited, and CUT learning converges slowly. In this paper, we introduce a novel extension of CUT, Linear Model U-Tree (LMUT), that allows CUT leaf nodes to contain a linear model, rather than simple constants. Being strictly more expressive than a regression tree, a linear model tree can also approximate a continuous function arbitrarily closely, with typically with many fewer leaves. Smaller trees are more interpretable, and therefore more suitable for mimic learning.
As shown in Figure (a)a and Table 2, each leaf node of a LMUT defines a partition cell of the input space, which can be interpreted as a discrete state for the decision process. Within each partition cell, LMUT also records the reward and the transition probabilities of performing action on the current state , as shown in the Leaf Node 5 of Figure (a)a. So LMUT builds a Markov Decision Process (MDP) from the interaction data between environment and deep model. Compared to a linear Q-function approximator , a LMUT defines an ensemble of linear Q-function models, one for each partition cell. Since each Q-value prediction comes from a single linear model, the prediction can be explained by the feature weights of the model.
We now discuss how to train an LMUT. Similar to , we separate the training into two phases: 1) Data Gathering Phase and 2) Node Splitting Phase.
4.1 Data Gathering Phase
Data Gathering Phase collects transitions on leaf nodes and prepares them for fitting linear models and splitting nodes. Given an input transition , we pass it through feature splits down to a leaf node. As an option, an LMUT can dynamically build an MDP, in which case it updates transition probabilities, rewards and average Q values on the leaf nodes. The complete Data Gathering Phase process is detailed in part I (the first for loop) of Algorithm 1.
4.2 Node Splitting Phase
After node updating, LMUT scans the leaf nodes and updates their linear model with Stochastic Gradient Descent (SGD). If SGD achieves insufficient improvement on node N, LMUT determines a new split and adds the resulting leaves to the current partition cell. For computational efficiency, our node splitting phase considers only a single split for each leaf given a single minibatch of new transitions. Part II of Alg.1 shows the detail of the node splitting phase. LMUT applies a minibatch stagewise fitting approach to learn linear models in the leaves of the U-tree. Like other stagewise approaches , this approach provides smoothed weight estimates where nearby leaves tend to have similar weights. We use Stochastic Gradient Descent to implement the weight updates.
4.2.1 Stochastic Gradient Descent (SGD) Weight Updates
is a straightforward well-established online weight learning method for a single linear regression model. The weights and bias of linear regression on leaf nodeare updated by applying SGD over all Transitions assigned to . For a transition , we take as input and as label. We build a separate LMUT for each action, so the linear model on is function of the state features: . We update the weights on leaf node by applying SGD with loss function . The updates are computed with a single pass over each minibatch.
4.2.2 Splitting Criterion
is used to find the best split on the leaf node, if SGD achieves limited improvement. We have tried three splitting criteria including working response of SGD, Kolmogorov–Smirnov (KS) test and Variance Test. The first method aims to find the best split to improve working response of the parent linear model on the data for its children. But as reported in
, the final result becomes less intelligible. The second method Kolmogorov–Smirnov (KS) test is a non-parametric statistical test that measures the differences in empirical cumulative distribution functions between the child data. The final Variance criterion selects a split that generates child nodes whose Q values contain the least variance. The idea is similar to the variance reduction method applied in CART tree. Like, we found that Variance test works well with less time complexity than KS test ( v.s. ), so we select Variance test as the splitting criterion. Exploring the different possible splits efficiently is the main scalability challenge in LMUT learning (cf. ).
5 Empirical Evaluation
We evaluate the mimic performance of LMUT by comparing it with five other baseline methods under three evaluation environments. Empirical evaluation measures both regression and game playing matches under experience training and active play learning.
5.1 Evaluation Environment
The evaluation environments include Mountain Car, Cart Pole and Flappy Bird. Our environments are simulated by OpenAI Gym toolkit . Mountain Car and Cart Pole are two benchmark tasks for reinforcement learning . Mountain Car is about accelerating a car to the top of the hill and Cart Pole is about balancing a pole in the upright position. Mountain Car and Cart Pole have a discrete action space and continuous feature space. Flappy Bird is a mobile game that controls a bird to fly between pipes. Flappy Bird has two discrete actions, and its observation consists of four consecutive images . We follow the Deep Q-Learning (DQN) method to play this game. During the image preprocessing, the input images are first rescaled to 80*80, transferred to gray image and then binary images. With 6,400 features, the state space of Flappy Bird is substantially more complex than that for Cart Pole and Mountain Car.
5.2 Baseline Methods
CART is our first baseline method, where we fit the input/output training pairs into a CART regression tree . A CART tree predicts the mean of sample on each leaf node. M5  is a tree training algorithm with more generalization ability. It first constructs a piecewise constant tree and then prunes to build a linear regression model for the instances in each leaf node. The WEKA toolkit  provides an implementation of M5. We include M5 with Regression-Tree option (M5-RT) and M5 tree with Model-Tree option (M5-MT) in our baselines. M5-MT builds a linear function on each leaf node, while M5-RT has only a constant value. To compare the online training performance, a recently proposed online learning model Fast Incremental Model Tree (FIMT)  is applied. Similar to M5-MT, it builds a linear model tree, but can perform explicit change detection and informed adaption for evolving data stream. We experiment with a basic version of FIMT and an advanced version with Adaptive Filters on leaf nodes (named FIMT-AF).
5.3 Fidelity: Regression Performance
We evaluate how well our LMUT approximates the soft output ( values) from Q function in a Deep Q-Network (DQN). We report the standard regression metrics Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). Under the Experience Training setting, we compare the performance of CART, M5-RT, M5-MT, FIMT and FIMT-AF with our LMUT. The dataset sizes are 150K transitions for Mountain Car, 70K transitions for Car Pole, and 20K transitions for Flappy Bird. Because of the high dimensionality of the Flappy Bird state space, 32GB main memory fits only 20K transitions. Given an experience training dataset, we apply 10 fold cross evaluation to train and test our model. For the Active Play setting, batch training algorithms like CART and M5 are not applicable, so we experiment only with online methods, including FIMT, FIMT-AF and LMUT. We first train the mimic models with 30k consecutive transitions from evaluation environments, and evaluate them with another 10k transitions. The result for the three evaluation environments are shown in Table 4, Table 4 and Table 5
. A t-test demonstrates that the differences between the results of LMUT and the results of other models are significant ().
Compared to the other two online learning methods (FIMT and FIMT-AF), LMUT achieves a better fit to the neural net predictions with a much smaller model tree, especially in the active play online setting. This is because both FIMT and FIMT-AF update their model tree continuously after each datum, whereas LMUT fits minibatches of data at each leaf. Neither FIMT nor FIMT-AF terminate on high-dimensional data.111For example, in the Flappy Bird environment, FIMT takes 29 minutes and 10.8GB main memory to process 10 transitions on a machine using i7-6700HQ CPU. So we omit the result of applying FIMT and FIMT-AF in the Flappy Bird environment. We observe that the CART tree model has significantly more leaves than our LMUT, but not better fit to the DQN than M5-RT, M5-MT and LMUT, which suggests overfitting. In the Mountain Car and Flappy Bird environments, model tree batch learning (M5-RT and M5-MT) performs better than LMUT, while LMUT achieves comparable fidelity, and leads in the Cart Pole environment. In conclusion, (1) our LMUT learning algorithm outperforms the state-of-the-art online model tree learner FIMT. (2) Although LMUT is an online learning method, it showed competitive performance to batch methods even in the batch setting.
Learning Curves. We apply consecutive testing  to analyze the performance of LMUT learning in more detail. We compute the correlation and testing error of LMUT as more transitions for learning are provided (From 0 to 30k) under the active play setting. To adjust the error scale across different game environments, we use Relative Absolute Error (RAE) and Relative Square Error (RSE). We repeat the experiment 10 times and plot the shallow graph in Figure 6. In the Mountain Car environment, LMUT converges quickly with its performance increasing smoothly in 5k transitions. But for complex environments like Cart Pole and Flappy Bird, the evaluation metrics fluctuate during the learning process but will approximate to the optimum within 30k transitions.
5.4 Matching Game Playing Performance
We now evaluate how well a model mimics Q functions in DQN by directly playing the games with them and computing the average reward per episode. (The games in OpenAI Gym toolkit are divided into episodes that start when a game begins and terminate when: (1) the player reaches the goal, (2) fails for a fixed number of times or (3) the game time passes a preset threshold). Specifically, given an input signal , we obtain values from mimic models and select an action . By executing in the current game environment, we receive a reward and next observation signal . This process is repeated until a game episode terminates. This experiment uses Average Reward Per Episodes (ARPE), a common evaluation metric that has been applied by both DRL models  and OpenAI Gym tookit , to evaluate mimic models. In the Experience Training setting, the play performance of CART, M5-RT, M5-MT, FIMT, FIMT-AF and our LMUT are evaluated and compared by partial 10-fold cross evaluation, where we select 9 sections of data to train the mimic models and test them by directly playing another 100 games. For the Active play, only the online methods FIMT and FIMT-AF are compared, without the Flappy Bird environment (as discussed in Section 5.3). Here we train the mimic models with 30k transitions, and test them in another 100 games.
The result of game playing performance is shown in Table 7. We first experiment with learning a Continuous U-Tree (CUT) directly using reinforcement learning  instead of mimic learning. CUT converges slowly with limited performance, especially in the high-dimensional Flappy Bird environment. This shows the difficulty of directly constructing a tree model from the environment.
We find that among all mimic methods, LMUT achieves the Game Play Performance APER closest to the DQN. Although the batch learning models have strong fidelity in regression, they do not perform as well in game playing as the DQN. Game playing observation shows that the batch learning models (CART, M5-RT, M5-MT) are likely to choose sub-optimal actions under some key scenarios (e.g., when a pole tilts to one side with high velocity in Cart Pole.). This is because the neural net controller selects many sub-optimal actions at the beginning of training, so the early training experience contains many sub-optimal state-action pairs. The batch models fit the entire training experience equally, while our LMUT fits more closely the most recently generated transitions from a mature controller. More recent transitions tend to correspond to optimal actions. The FIMT algorithms keep adapting to the most recent input only, and fail to build adequate linear models on their leaf nodes. Compared to them, LMUT achieves a sweet spot between optimality and coverage (Figure (a)a).
|Mountain Car||Cart Pole||Flappy Bird|
In this section, we discuss how to interpret a DRL model through analyzing the knowledge stored in the transparent tree structure of LMUT: computing feature influence, analyzing the extracted rules and highlighting the super-pixels.
6.1 Feature Influence
Feature importance is one of the most common interpretation tools for tree-based models [5, 21]. In a LMUT model, feature values are used as splitting thresholds to form partition cells for input signals. We evaluate the influence of a splitting feature by the total variance reduction of the Q values. The absolute weight values from linear regression provide extra knowledge of feature importance. So we compute a weight importance rate and multiply it by Variance Reduction, and measure the influence of splitting feature on node by:
where is the weight of feature on node N, is the number of Q values on node and is the variance of Q values on node . We quantify the influence of a splitting feature by summing for all nodes split by in our LMUT. For Mountain Car and Cart Pole, we report the feature influences in table 8. The most important feature for Mountain Car and Cart Pole are Velocity and Pole Angle respectively, which matches the common understanding of the domains. For Flappy Bird whose observations are 80*80 images, LMUT uses pixels as splitting features. Figure 8 illustrates the pixels with above-average feature influences (the mean of all feature influences). The most influential pixels are located on the top left where the bird is likely to stay, which reflects the importance of locating the bird.
|Pole Velocity At Tip||2953.73|
6.2 Rule Extraction
Rule extraction is a common method to extract knowledge from tree models [7, 6, 2]. We extract and analyze rules for the Mountain Car and Cart Pole environment. Figure 13 (top) shows three typical examples of extracted rules in Mountain Car environment. The rules are presented in the form of partition cells (constructed by the splitting features in LMUT). Each cell contains the range of velocity, position and a Q vector () representing the average -value in the cell. The top left example is a state where the cart is moving toward the left hill with very small velocity. The extracted rule suggests pushing right ( has the largest value -29.4): the cart is almost stopped on the left, and by pushing right, it can increase its momentum (or Kinetic Energy). The top middle example illustrates a state where the car is approaching the top of the left hill with larger left side velocity (compared to the first example). In this case, however, the cart should be pushed left ( is the largest), in order to store more Gravitational Potential Energy and prepare for the final rush to the target. The rush will lead to the state shown in the top right image, where the cart is rushing up the right hill. In this state, the cart should be pushed right to reach the target. We also observe if the cart can reach the target in fewer steps, its Q values are larger.
Figure 13 (bottom) shows three examples of extracted rules in the Cart Pole environment, where each cell contains the scope of cart position, cart velocity, pole angle, pole velocity and a Q vector (). The key for Cart Pole is using inertia and acceleration to balance the pole. In the bottom left example, the cart should be pushed right, according to the rules (), if the pole tilts to the right with a velocity less than 0.5. A similar scenario happens in the second example, where the pole is also tilting to the right but has velocity towards the left. We should push right () to maintain this trend even if the cart is close to the right side border, which makes its Q values smaller than that in the first example. The third example describes a case where a pole tilts to the left with velocity towards the right. This time we need a left acceleration so the model selects pushing cart left ().
6.3 Super-pixel Explanation
In video games, DRL models take the raw pixels from four consecutive images as input. To mimic the deep models, our LMUT also learns on four continuous images and performs splits directly on raw pixels. Deep models for image input can be explained by super-pixels . We highlight the pixels that have feature influence (the mean of all feature influences) along the splitting path from root to the target partition cell. Figure 22 provides two examples of input images with their highlighted pixels at the beginning of game (top) and in the middle of game (bottom). We find 1) most splits are made on the first image which reflects the importance of the most recent image input 2) the first image is often used to locate the pipes (obstacles) and the bird, while the remaining three images provide further information about the bird’s location and velocity.
This work introduced a mimic learning framework for a Reinforcement Learning Environment. A novel Linear Model U-tree represents an interpretable model with the expressive power to approximate a Q value function learned by a deep neural net. We introduced a novel on-line LMUT mimic learning algorithm based on stochastic gradient descent. Empirical evaluation compared LMUT with five baseline methods on three different Reinforcement Learning environments. The LMUT model achieved clearly the best match to the neural network in terms of its performance on the RL task. We illustrated the abillity of LMUT to extract the knowledge implicit in the neural network model, by (1) computing the influence of features, (2) analyzing the extracted rules and (3) highlighting the super-pixels. A direction for future work is to explore variants of our LMUT, for example by adding tree pruning, and by experimenting more extensively with hyper-parameters. Another important topic is sampling strategies for the active play setting, which would illuminate the difference we observed between matching the neural net’s play performance, vs. matching the function it represents.
-  Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in neural information processing systems. pp. 2654–2662 (2014)
-  Boz, O.: Extracting decision trees from trained neural networks. In: Proceedings SIGKDD. pp. 456–461. ACM (2002)
-  Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)
-  Chaudhuri, P., Huang, M.C., Loh, W.Y., Yao, R.: Piecewise-polynomial regression trees. Statistica Sinica pp. 143–167 (1994)
-  Che, Z., et al.: Interpretable deep models for ICU outcome prediction. In: AMIA Annual Symposium Proceedings. vol. 2016, p. 371. AMIA (2016)
-  Craven, M., Shavlik, J.W.: Extracting tree-structured representations of trained networks. In: Advances in neural information processing systems. pp. 24–30 (1996)
-  Dancey, D., Bandar, Z.A., McLean, D.: Logistic model tree extraction from artificial neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37(4), 794–802 (2007)
-  Hall, M., Frank, E., et al.: The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1), 10–18 (2009)
-  Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data mining and knowledge discovery 23(1), 128–168 (2011)
Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Machine learning59(1-2), 161–205 (2005)
-  Lipton, Z.C.: The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016)
-  Loh, W.Y.: Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1), 14–23 (2011)
-  McCallum, A.K., et al.: Learning to use selective attention and short-term memory in sequential tasks. In: Proceedings SAB. vol. 4, pp. 315–325 (1996)
-  Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Quinlan, R.J.: Learning with continuous classes. In: 5th Australian Joint Conference on Artificial Intelligence. pp. 343–348. World Scientific, Singapore (1992)
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings SIGKDD. pp. 1135–1144. ACM (2016)
-  Riedmiller, M.: Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In: ECML. pp. 317–328. Springer (2005)
-  Sutton, R.S., Barto, A.G.: Introduction to reinforcement learning, vol. 135. MIT press Cambridge (1998)
-  2(Nov), 45–66 (2001)
-  Uther, W.T., Veloso, M.M.: Tree based discretization for continuous state space reinforcement learning. In: AAAI/IAAI. pp. 769–774 (1998)
-  Wu, M., Hughes, M., et al.: Beyond sparsity: Tree regularization of deep models for interpretability. In: AAAI (2018)