1 Introduction
Deep Q Network (DQN) is an offpolicy learning algorithm that uses a Convolutional Neural Network (CNN;
(Krizhevsky et al., 2012)) to represent the actionvalue function. Agents trained using DQN are showing superior performance on a wide range of problems (Mnih et al., 2015). Their success, and that of Deep Neural Network (DNN) in general, is explained by its ability to learn good representations automatically. Unfortunately, its high expressiveness is also the source of its unclarity, making it very hard to analyze. Visualization methods for DNN try to tackle this problem by analyzing and interpreting the learned representations (Zeiler and Fergus, 2014; Erhan et al., 2009; Yosinski et al., 2014). However, these methods were developed for supervised learning tasks, assuming the data is i.i.d, thus overlooking the temporal structure of the learned representation.
A major challenge in Reinforcement Learning (RL) is scaling to higher dimensions in order to solve realworld applications. Spatial abstractions such as state aggregation (Bertsekas and Castanon, 1989), tries to tackle this problem by grouping states with similar characteristics such as policy behaviour, value function or dynamics. On the other hand, temporal abstractions (i.e., options or skills (Sutton et al., 1999a)) can help an agent to focus less on lower level details of a task and more on high level planning (Dietterich, 2000; Parr, 1998). The problem with these methods is that finding good abstractions is typically done manually which hampers their wide use. The internal model principle (Francis and Wonham, 1975), ”Every good key must be a model of the lock it opens”, was formulated mathematically for control systems by Sontag (2003)
, claiming that if a system is solving a control task, it must necessarily contain a subsystem which is capable of predicting the dynamics of the system. In this work we follow the same line of thought and claim that DQNs are learning an underlying spatiotemporal model of the problem, without implicitly being trained to. We identify this model as an Semi Aggregated Markov Decision Process (SAMDP), an approximation of the true MDP that allows human interpretability.
Zahavy et al. (2016) used handcrafted features in order to interpret policies learned by DQN agents. They revealed that DQNs are automatically learning spatiotemporal representations such as hierarchical state aggregation and skills. The main drawback of their approach is that they used a manual reasoning of a tDistributed Stochastic Neighbour Embedding (tSNE) map (Van der Maaten and Hinton, 2008), a tedious process that requires careful inspection as well as an experienced eye. Moreover, their claim to observe skills is not supported with any quantitative evidence. In contrast, we use temporal aware clustering algorithms in order to aggregate the state space, and automatically reveal the underlying spatiotemporal structure of the tSNE map. The aggregated states uniquely identify skills and allow us to estimate the SAMDP transition probabilities and reward signal empirically. In particular our main contributions are

SAMDP: a model that gives a simple explanation on how DRL agents solve a task  by hierarchically decomposing it into a set of subproblems and learning specific skills at each.

Automatic analysis: we suggest quantitative criteria that allows us to select good models and evaluate their consistency.

Interpretation: we developed a novel visualization tool that gives a qualitative understanding of the learned policy.

Shared autonomy: the SAMDP model allows us to predict situations where the DQN agent is not performing well. In such occasions we suggest to take the control from the agent and ask for expert advice.
2 Background
We briefly review the standard reinforcement learning framework of discretetime, finite Markov decision processes (MDPs). In this framework, the goal of an RL agent is to maximize its expected return by learning a policy , a mapping from states
to probability distribution over actions
. At time the agent observes a state , selects an action , and receives a reward . Following the agents action choice, it transitions to the next state . We consider infinite horizon problems where the cumulative return at time is given by , and is the discount factor. The actionvalue function represents the expected return after observing state , taking action after which following policy . The optimal actionvalue function obeys a fundamental recursion known as the optimal Bellman Equation:Deep Q Networks: The DQN algorithm approximates the optimal Q function using a CNN. The training objective it to minimize the expected TD error of the optimal Bellman Equation:
(Mnih et al., 2015). DQN is an offline learning algorithm that collects experience tuples and stores them in the Experience Replay (ER) (Lin, 1993). At each training step, a minibatch of experience tuples are sampled at random from the ER. The DQN maintains two separate Qnetworks. The current Qnetwork with parameters , and the target Qnetwork with parameters . The parameters are set to every fixed number of iterations. In order to capture the MDP dynamics, the final DQN representation is a concatenation of several consecutive states.
Skills, Options, Macroactions, (Sutton et al., 1999a) are temporally extended control structures, denoted by . A skill is defined by a triplet: I defines the set of states where the skill can be initiated. is the intraskill policy, and is the set of termination probabilities determining when a skill will stop executing. is typically either a function of state or time . Any MDP with a fixed set of skills is a SemiMarkov Decision Process (SMDP). Planning with skills can be performed by learning for each state the value of choosing each skill. More formally, an SMDP can be defined by a fivetuple where is the set of states, is the set of skills, is the SMDP transition matrix, is the discount factor and the SMDP reward is defined by:
(1) 
The Skill Policy is a mapping from states to a probability distribution over skills. The actionvalue function represents the value of choosing skill at state , and thereafter selecting skills according to policy . The optimal skill value function is given by: (Stolle and Precup, 2002).
3 Semi Aggregated Markov Decision Processes
Reinforcement Learning problems are typically modeled using the MDP formulation. The abundant theory developed for MDP throughout the years gave rise to various algorithms for efficiently solving MDPs, and finding good policies. MDP however, is not the optimal modeling choice when one wishes to analyze a given policy. Policy analysis methods typically suffer from the cardinality of the state space and the length of the planning horizon. For example, building a graphical model that explains the policy will be too large (in terms of states), and complex (in terms of planning horizon) for a human to comprehend. If the policy one wishes to analyze is known to be planning using temporallyextended actions (i.e. skills), then one may resort to SMDP modeling. The SMDP model reduces the planning horizon dramatically and simplifies the graphical model. There are two problems however with this approach. First, it requires to identify the set of skills used by the policy, a longstanding challenging problem with no easy solution. Second, one is still facing the high complexity of the state space.
A different modeling approach is to aggregate similar states first. This is useful when there is a reason to believe that groups of states share common attributes such as similar policy, value function or dynamics. State aggregation is a well studied problem that can be solved by applying clustering on the MDP state representation. These models are not necessarily Markovian, however they can provide great simplification of the state space. With a slight abuse of notation we denote this model as Aggregated MDP (AMDP). Under the right staterepresentation, the AMDP can also help to identify skills (if exist). We argue that this is possible if the AMDP dynamics is such that the majority of the transitions are done within the clusters, followed by rare transitions between clusters. As we will show in the experiments section, DQN indeed provides a good state representation that allows skill identification.
If the state representation contains both spatial and temporal hierarchies, then the AMDP model can be further simplified into an SAMDP model. Under SAMDP modeling, both the statespace cardinality and the planning horizon are reduced, making policy reasoning more feasible. We summarize our observations about the different modeling approaches in Figure 1.
In the remaining of this section we explain the SAMDP modeling in detail and focus on explaining how to empirically build an SAMDP model from experience. To do so we explain how to aggregate states, identify skills and estimate the transition probabilities and reward measures. Finally we discuss how to evaluate the fitness of an empiric SAMDP model to the data.
3.1 State aggregation
We evaluate a DQN agent, by letting it play multiple trajectories with an greedy policy. During evaluation we record all visited states, neural activations, value estimations, and index them by their visitation order. We treat the neural activations as the state representation that the DQN agent has learned. Zahavy et al. (2016) showed that this state representation captures a spatiotemporal hierarchy and therefore makes a good candidate for state aggregation. We then apply tSNE on the neural activations data, a nonlinear dimensionality reduction method that is particularly good at creating a single map that reveals structure at many different scales. tSNE reduces the tendency of points to crowd together in the center of the map by using a heavy tailed Studentt distribution in the low dimensional space. The result is a compact, well separated representation, that is easy to visualize and interpret.
We represent an MDP state
by a feature vector
, comprised of the two tSNE coordinates and the DQN value estimate. Using this representation we aggregate the state space by applying clustering algorithms and define the AMDP states as the resulting clusters. Standard clustering algorithms assume that the data is drawn from an i.i.d distribution, however our data is generated from an MDP which violates this assumption.In order to alleviate this problem, we suggest two versions of KMeans (Algorithm 1) that take into account the temporal structure of the data. (1) SpatioTemporal Cluster Assignment that encourages temporal coherency by modifying the assignment step in the following way:
(2) 
Where is the time index of observation , is the set of points before and after along the trajectory. In this way, a point is assigned to a cluster , if its neighbours along the trajectory are also close to .
(2) Entropy Regularization Cluster Assignment that creates simpler models by adding an entropy regularization term to the Kmean assignment step:
(3) 
Where is a penalty weight, and indicates the entropy (as defined in Section 3.3) gain of changing assignment to cluster in the SMDP obtained at iteration . This is equivalent to minimizing an energy function, the sum of the Kmeans objective function and an entropy term.
We also considered Agglomerative Clustering, a bottomup hierarchical approach. Starting with a mapping from points to clusters (e.g., each point is a singular cluster), the algorithm advances by merging pairs of clusters such that a linkage criteria is minimized. In order to encourage temporal coherency in cluster assignments we define a new linkage criteria based on Ward (1963):
(4) 
where measures the difference between the entropy of the corresponding SMDP before and after merging clusters .
3.2 Temporal abstractions
We define the SAMDP skills by their initiation and termination AMDP states :
(5) 
More implicit, once the DQN agent enters an AMDP state at an MDP state , it follows the skill policy for steps, until it reaches a state , s.t . Note that we do not define the skill policy implicitly, but we will observe later that our model successfully captures spatiotemporal defined skill policies. We set the SAMDP discount factor same as was used to train the DQN. We now turn to estimate the SAMDP probability matrix and reward signal. For that goal we make the following assumptions:
Definition 1. A deterministic probability matrix, is a probability matrix such that each of its rows contains one element that equals to and the others equal to .
Assumption 1. The MDP transition matrices are deterministic.
This assumption limits our analysis for environments with deterministic dynamics. However, many interesting problems are in fact deterministic, e.g., Atari2600 benchmarks, Go, Chess etc.
Assumption 2.
The policy played by the DQN agent is deterministic.
Although DQN chooses actions deterministically (by selecting the action that corresponds to the maximal Q value in each state), we allow stochastic exploration. This introduces errors into our model that we will later analyze.
Given the DQN policy, the MDP is reduced into a Markov Reward Process (MRP) with probability matrix . Note that by Assumptions 1 and 2, this is also a deterministic probability matrix.
The SAMDP transition probability matrix , indicates the probability of moving from state to given that skill is chosen. It is also a deterministic probability matrix by our definition of skills (Equation 5). Our goal is to estimate the probability matrix that the DQN policy induces on the SAMDP model: .
We do not require this policy to be deterministic from two reasons. First, we evaluate the DQN agent with an
greedy policy. While almost deterministic in the view of a single time step, the variance of its behaviour increases as more moves are played. Second, the aggregation process is only an approximation. For example, a given state may contain more than one ”real” state and therefore hold more than one skill with different transitions. A stochastic policy can solve this disagreement by allowing to choose skills at random.
This type of modeling does not guarantee that our SAMDP model is Markovian and we are not claiming it to be. SAMDP is an approximation of the the true dynamics that simplifies it over space and time to and allow human interpretation. Finally, we estimate the skill length and SAMDP reward for each skill from the data using Equation 1. In the experiments section we show that this model is in fact consistent with the data by evaluating its value function:
(6) 
and the greedy policy with respect to it:
(7) 
3.3 Evaluation criteria
We follow the analysis of (Hallak et al., 2013) and define criteria to measure the fitness of a model empirically. We define the Value Mean Square Error(VMSE) as the normalized distance between two value estimations: The SAMDP value is given by Equation 6 and the DQN value is evaluated by averaging the DQN value estimates over all MDP states in a given cluster (SAMDP state): .
The Minimum Description Length (MDL; (Rissanen, 1978)) principle is a formalization of the celebrated Occam’s Razor. It copes with the overfitting problem for the purpose of model selection. According to this principle, the best hypothesis for a given data set is the one that leads to the best compression of the data. Here, the goal is to find a model that explains the data well, but is also simple in terms of the number of parameters. In our work we follow a similar logic and look for a model that best fits the data but is still “simple”.
Instead of considering ”simple” in terms of the number of parameters, we measure the simplicity of the spatiotemporal state aggregation. For spatial simplicity we define the Inertia: which measures the variance of MDP states inside a cluster (AMDP state). For temporal simplicity we define the entropy: , and the Intensity Factor which measures the fraction of in/out cluster transitions:
To summarize, the stages of building an SAMDP model are:

Evaluate : Run the trained (DQN) agent, record visited states, representations and Qvalues.

Reduce : Apply tSNE on the state representations to obtain a low dimensional map.

Aggregate : Cluster states in the map.

Model : Fit an SAMDP model, select the best model.

Visualize : Visualize the SAMDP on top of the tSNE map.
4 Experiments
Setup. We evaluate our method on three Atari2600 games, Breakout, Pacman and Seaquest. For each game we collect 120k game states (each represented by 512 features), and Qvalues for all actions. We apply PCA to reduce the data to 50 dimensions, then we apply tSNE using the Barnes Hut approximation to reach the desired low dimension. We run the tSNE algorithm for 3000 iterations with perplexity of 30. We use SpatioTemporal Kmeans clustering (Section 3.1) to create the AMDP states (clusters), and evaluate the transition probabilities between them using the trajectory data. We overlook flickertransitions where a cluster is visited for less than time steps before transiting out. Finally we truncate transitions with less than 0.1 probability.
Model Selection. We perform a grid search on two parameters: i) number of clusters . ii) window size . We found that models larger (smaller) than that are too cumbersome (simplistic) to analyze. We select the best model in the following way: Let be the entropy, inertia, VMSE, and intensity factor respective measures of configuration in the greed search. Let be the corresponding sets grouped over all grid search configurations. We sort each set from good to bad, i.e. from minimum to maximum (except for intensity factor where larger values are considered better). We then iteratively intersect the pprefix of all sets (i.e. the first p elements of each set) starting with 1prefix. We stop when the intersection is non empty and choose the configuration at the intersection. Figure 2 shows the correlation between pairs of criteria (for Breakout).
Overall, we see a tradeoff between spatial and temporal complexity. For example, in the bottom left plot, we observe correlation between the Inertia and the Intensity Factor; a small window size leads to welldefined clusters in space (low Inertia) at the expense of a complex transition matrix (small intensity factor). A large
causes the clusters to be more spread in space (large Inertia), but has the positive effect of intensifying the incluster transitions (high intensity factor). We also measure the pvalue of the chosen model with the null hypothesis being the SAMDP model constructed with randomly clustered states. We tested 10000 random SAMDP models, none of which scored better than the chosen model (for any of the evaluation criteria).
Qualitative Evaluation. Examining the resulting SAMDP (Figure 3) it is interesting to note the sparsity of transitions. This indicate that clusters are well located in time. Inspecting the mean image of each cluster also reveal some insights about the nature of the skills hiding within. We also see evidence for the ”tunneldigging” option described in (Zahavy et al., 2016) in the transitions between clusters 11,12,14 and 4.
Model Evaluation. We evaluate our model using three different methods. First, the VMSE criteria (Figure 4, top): high correlation between the DQN values and the SAMDP values gives a clear indication to the fitness of the model to the data. Second, we evaluate the correlation between the transitions induced by the policy improvement step and the trajectory reward . To do so, we measure the empirical distribution of choosing the greedy policy at state in that trajectory. Finally we present the correlation coefficients at each state: (Figure 4, center). Positive correlation indicates that following the greedy policy leads to high reward. Indeed for most of the states we observe positive correlation, supporting the consistency of the model. The third evaluation is close in spirit to the second one. We create two transition matrices using k toprewarded trajectories and k leastrewarded trajectories respectively. We measure the correlation of the greedy policy with each of the transition matrices for different values of k (Figure 4 bottom). As clearly seen, the correlation of the greedy policy and the top trajectories is higher than the correlation with the bad trajectories.
Eject Button: Performance improvement. In the following experiment we show how the SAMDP model can help to improve the performance of a trained policy. The motivation for this experiment stems from the idea of shared autonomy (Pitzer et al., 2011). There are domains where errors are not permitted and performance must be as high as possible. The idea of shared autonomy is to allow an operator to intervene in the decision loop in critical times. For example, it is known that in 20 of commercial flights, the autopilot returns the control to the human pilots. For this experiment we first build an SAMDP model and then let the agent to play new (unseen) trajectories. We project the online state visitations onto our model and monitor its transitions along it. We define as above. If the likelihood of with respect to the online trajectory is greater than the likelihood of , we press the Eject button and terminate this execution (a procedure inspired by option interruption (Sutton et al., 1999b)). We’re interested to measure the average performance of the unterminated trajectories with respect to all trajectories. The performance improvement achieved with and without using the Eject button is presented in Table 1.
Game  Average Score without eject  Average Score with eject  Improvement 

Breakout  293  400  +36 
Seaquest  5641  6780  +20 
Pacman  230  241  +4.7 
5 Discussion
In this work we considered the problem of automatically building an SAMDP model for analyzing trained policies. Starting from a tSNE map of neural activations, and ending up with a compact model that gives a clear interpretation for complex RL tasks. We showed how SAMDP can help in identifying skills that are well defined in terms of initiation and termination sets. However, the SAMDP doesn’t offer much information about the skill policy and we suggest to further investigate it in future work. It would also be interesting to see whether skills of different states actually represent the same behaviour. Most importantly, the skills we find are determined by the state aggregation. Therefore, they are impaired by the artifacts of the clustering method used. In future work we will consider other clustering methods that better relate to the topology (such as spectralclustering), to see if they lead to better skills.
In the Eject experiment we showed how SAMDP model can help to improve the policy at hand without the need to retrain it. It would be even more interesting to use the SAMDP model to improve the training phase itself. The strength of SAMDP in identifying spatio and temporal hierarchies could be used for harnessing DRL hierarchical algorithms (Tessler et al., 2016; Kulkarni et al., 2016). For example by automatically detecting subgoals or skills.
Another question we’re interested in answering is whether a global control structure exists? Motivated by the success of policy distillation ideas (Rusu et al., 2015), it would be interesting to see how well an SAMDP built for game A, explains game B? Finally we would like to use this model to interpret other DRL agents that are not specifically trained to approximate value such as deep policy gradient methods.
References
 Bertsekas and Castanon [1989] Dimitri P Bertsekas and David A Castanon. Adaptive aggregation methods for infinite horizon dynamic programming. Automatic Control, IEEE Transactions on, 34(6):589–598, 1989.
 Dietterich [2000] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.
 Erhan et al. [2009] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higherlayer features of a deep network. Dept. IRO, Université de Montréal, Tech. Rep, 4323, 2009.
 Francis and Wonham [1975] Bruce A Francis and William M Wonham. The internal model principle for linear multivariable regulators. Applied mathematics and optimization, 2(2), 1975.
 Hallak et al. [2013] Assaf Hallak, Dotan DiCastro, and Shie Mannor. Model selection in markovian processes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Kulkarni et al. [2016] Tejas D Kulkarni, Karthik R Narasimhan, Ardavan Saeedi, and Joshua B Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057, 2016.
 Lin [1993] LongJi Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
 MacQueen et al. [1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. 1967.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540), 2015.

Parr [1998]
Ronald Parr.
Flexible decomposition algorithms for weakly coupled Markov
decision problems.
In
Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
, pages 422–430. Morgan Kaufmann Publishers Inc., 1998.  Pitzer et al. [2011] Benjamin Pitzer, Michael Styer, Christian Bersch, Charles DuHadway, and Jan Becker. Towards perceptual shared autonomy for robotic mobile manipulation. In IEEE International Conference on Robotics Automation (ICRA), 2011.
 Rissanen [1978] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
 Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 Sontag [2003] Eduardo D Sontag. Adaptation and regulation with signal detection implies internal model. Systems & control letters, 50(2):119–126, 2003.
 Stolle and Precup [2002] Martin Stolle and Doina Precup. Learning options in reinforcement learning. Springer, 2002.
 Sutton et al. [1999a] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1), August 1999a.
 Sutton et al. [1999b] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999b.
 Tessler et al. [2016] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in minecraft. arXiv preprint arXiv:1604.07255, 2016.

Van der Maaten and Hinton [2008]
Laurens Van der Maaten and Geoffrey Hinton.
Visualizing data using tSNE.
Journal of Machine Learning Research
, 9(25792605):85, 2008.  Ward [1963] Joe H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963.
 Yosinski et al. [2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? pages 3320–3328, 2014.
 Zahavy et al. [2016] Tom Zahavy, Nir Ben Zrihem, and Shie Mannor. Graying the black box: Understanding dqns. arXiv preprint arXiv:1602.02658, 2016.
 Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. pages 818–833, 2014.