DeepAI
Log In Sign Up

Deep Reinforcement Learning Discovers Internal Models

06/16/2016
by   Nir Baram, et al.
0

Deep Reinforcement Learning (DRL) is a trending field of research, showing great promise in challenging problems such as playing Atari, solving Go and controlling robots. While DRL agents perform well in practice we are still lacking the tools to analayze their performance. In this work we present the Semi-Aggregated MDP (SAMDP) model. A model best suited to describe policies exhibiting both spatial and temporal hierarchies. We describe its advantages for analyzing trained policies over other modeling approaches, and show that under the right state representation, like that of DQN agents, SAMDP can help to identify skills. We detail the automatic process of creating it from recorded trajectories, up to presenting it on t-SNE maps. We explain how to evaluate its fitness and show surprising results indicating high compatibility with the policy at hand. We conclude by showing how using the SAMDP model, an extra performance gain can be squeezed from the agent.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/22/2016

Visualizing Dynamics: from t-SNE to SEMI-MDPs

Deep Reinforcement Learning (DRL) is a trending field of research, showi...
06/03/2019

Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies

This paper proposes a novel scheme for the watermarking of Deep Reinforc...
12/02/2020

Are Gradient-based Saliency Maps Useful in Deep Reinforcement Learning?

Deep Reinforcement Learning (DRL) connects the classic Reinforcement Lea...
11/25/2015

Strategic Dialogue Management via Deep Reinforcement Learning

Artificially intelligent agents equipped with strategic skills that can ...
09/18/2020

RLzoo: A Comprehensive and Adaptive Reinforcement Learning Library

Recently, we have seen a rapidly growing adoption of Deep Reinforcement ...
09/07/2021

On the impact of MDP design for Reinforcement Learning agents in Resource Management

The recent progress in Reinforcement Learning applications to Resource M...

1 Introduction

Deep Q Network (DQN) is an off-policy learning algorithm that uses a Convolutional Neural Network (CNN;

(Krizhevsky et al., 2012)) to represent the action-value function. Agents trained using DQN are showing superior performance on a wide range of problems (Mnih et al., 2015). Their success, and that of Deep Neural Network (DNN) in general, is explained by its ability to learn good representations automatically. Unfortunately, its high expressiveness is also the source of its unclarity, making it very hard to analyze. Visualization methods for DNN try to tackle this problem by analyzing and interpreting the learned representations (Zeiler and Fergus, 2014; Erhan et al., 2009; Yosinski et al., 2014)

. However, these methods were developed for supervised learning tasks, assuming the data is i.i.d, thus overlooking the temporal structure of the learned representation.

A major challenge in Reinforcement Learning (RL) is scaling to higher dimensions in order to solve real-world applications. Spatial abstractions such as state aggregation (Bertsekas and Castanon, 1989), tries to tackle this problem by grouping states with similar characteristics such as policy behaviour, value function or dynamics. On the other hand, temporal abstractions (i.e., options or skills (Sutton et al., 1999a)) can help an agent to focus less on lower level details of a task and more on high level planning (Dietterich, 2000; Parr, 1998). The problem with these methods is that finding good abstractions is typically done manually which hampers their wide use. The internal model principle (Francis and Wonham, 1975), ”Every good key must be a model of the lock it opens”, was formulated mathematically for control systems by Sontag (2003)

, claiming that if a system is solving a control task, it must necessarily contain a subsystem which is capable of predicting the dynamics of the system. In this work we follow the same line of thought and claim that DQNs are learning an underlying spatio-temporal model of the problem, without implicitly being trained to. We identify this model as an Semi Aggregated Markov Decision Process (SAMDP), an approximation of the true MDP that allows human interpretability.

Zahavy et al. (2016) used hand-crafted features in order to interpret policies learned by DQN agents. They revealed that DQNs are automatically learning spatio-temporal representations such as hierarchical state aggregation and skills. The main drawback of their approach is that they used a manual reasoning of a t-Distributed Stochastic Neighbour Embedding (t-SNE) map (Van der Maaten and Hinton, 2008)

, a tedious process that requires careful inspection as well as an experienced eye. Moreover, their claim to observe skills is not supported with any quantitative evidence. In contrast, we use temporal aware clustering algorithms in order to aggregate the state space, and automatically reveal the underlying spatio-temporal structure of the t-SNE map. The aggregated states uniquely identify skills and allow us to estimate the SAMDP transition probabilities and reward signal empirically. In particular our main contributions are

  1. SAMDP: a model that gives a simple explanation on how DRL agents solve a task - by hierarchically decomposing it into a set of sub-problems and learning specific skills at each.

  2. Automatic analysis: we suggest quantitative criteria that allows us to select good models and evaluate their consistency.

  3. Interpretation: we developed a novel visualization tool that gives a qualitative understanding of the learned policy.

  4. Shared autonomy: the SAMDP model allows us to predict situations where the DQN agent is not performing well. In such occasions we suggest to take the control from the agent and ask for expert advice.

2 Background

We briefly review the standard reinforcement learning framework of discrete-time, finite Markov decision processes (MDPs). In this framework, the goal of an RL agent is to maximize its expected return by learning a policy , a mapping from states

to probability distribution over actions

. At time the agent observes a state , selects an action , and receives a reward . Following the agents action choice, it transitions to the next state . We consider infinite horizon problems where the cumulative return at time is given by , and is the discount factor. The action-value function represents the expected return after observing state , taking action after which following policy . The optimal action-value function obeys a fundamental recursion known as the optimal Bellman Equation:
Deep Q Networks: The DQN algorithm approximates the optimal Q function using a CNN. The training objective it to minimize the expected TD error of the optimal Bellman Equation:

(Mnih et al., 2015). DQN is an offline learning algorithm that collects experience tuples and stores them in the Experience Replay (ER) (Lin, 1993). At each training step, a mini-batch of experience tuples are sampled at random from the ER. The DQN maintains two separate Q-networks. The current Q-network with parameters , and the target Q-network with parameters . The parameters are set to every fixed number of iterations. In order to capture the MDP dynamics, the final DQN representation is a concatenation of several consecutive states.

Skills, Options, Macro-actions, (Sutton et al., 1999a) are temporally extended control structures, denoted by . A skill is defined by a triplet: I defines the set of states where the skill can be initiated. is the intra-skill policy, and is the set of termination probabilities determining when a skill will stop executing. is typically either a function of state or time . Any MDP with a fixed set of skills is a Semi-Markov Decision Process (SMDP). Planning with skills can be performed by learning for each state the value of choosing each skill. More formally, an SMDP can be defined by a five-tuple where is the set of states, is the set of skills, is the SMDP transition matrix, is the discount factor and the SMDP reward is defined by:

(1)

The Skill Policy is a mapping from states to a probability distribution over skills. The action-value function represents the value of choosing skill at state , and thereafter selecting skills according to policy . The optimal skill value function is given by: (Stolle and Precup, 2002).

3 Semi Aggregated Markov Decision Processes

Reinforcement Learning problems are typically modeled using the MDP formulation. The abundant theory developed for MDP throughout the years gave rise to various algorithms for efficiently solving MDPs, and finding good policies. MDP however, is not the optimal modeling choice when one wishes to analyze a given policy. Policy analysis methods typically suffer from the cardinality of the state space and the length of the planning horizon. For example, building a graphical model that explains the policy will be too large (in terms of states), and complex (in terms of planning horizon) for a human to comprehend. If the policy one wishes to analyze is known to be planning using temporally-extended actions (i.e. skills), then one may resort to SMDP modeling. The SMDP model reduces the planning horizon dramatically and simplifies the graphical model. There are two problems however with this approach. First, it requires to identify the set of skills used by the policy, a long-standing challenging problem with no easy solution. Second, one is still facing the high complexity of the state space.

Figure 1: Left: Illustration of state aggregation and skills. Primitive actions (orange arrows) cause transitions between MDP states (black dots) while skills (red arrows) induce transitions between SAMDP states (blue circles). Right: Modeling approaches for analyzing policies. MDP (top-left): a policy is analyzed in the MDP state space , with the original set of primitive actions . SMDP (top-right): using the set of identified skills , the policy is easier to analyze. AMDP (bottom-left): State aggregation allows to reduce state space complexity . SAMDP (bottom-right): identifying skills in the AMDP model reduces the planning horizon .

A different modeling approach is to aggregate similar states first. This is useful when there is a reason to believe that groups of states share common attributes such as similar policy, value function or dynamics. State aggregation is a well studied problem that can be solved by applying clustering on the MDP state representation. These models are not necessarily Markovian, however they can provide great simplification of the state space. With a slight abuse of notation we denote this model as Aggregated MDP (AMDP). Under the right state-representation, the AMDP can also help to identify skills (if exist). We argue that this is possible if the AMDP dynamics is such that the majority of the transitions are done within the clusters, followed by rare transitions between clusters. As we will show in the experiments section, DQN indeed provides a good state representation that allows skill identification.

If the state representation contains both spatial and temporal hierarchies, then the AMDP model can be further simplified into an SAMDP model. Under SAMDP modeling, both the state-space cardinality and the planning horizon are reduced, making policy reasoning more feasible. We summarize our observations about the different modeling approaches in Figure 1.
In the remaining of this section we explain the SAMDP modeling in detail and focus on explaining how to empirically build an SAMDP model from experience. To do so we explain how to aggregate states, identify skills and estimate the transition probabilities and reward measures. Finally we discuss how to evaluate the fitness of an empiric SAMDP model to the data.

3.1 State aggregation

We evaluate a DQN agent, by letting it play multiple trajectories with an -greedy policy. During evaluation we record all visited states, neural activations, value estimations, and index them by their visitation order. We treat the neural activations as the state representation that the DQN agent has learned. Zahavy et al. (2016) showed that this state representation captures a spatio-temporal hierarchy and therefore makes a good candidate for state aggregation. We then apply t-SNE on the neural activations data, a non-linear dimensionality reduction method that is particularly good at creating a single map that reveals structure at many different scales. t-SNE reduces the tendency of points to crowd together in the center of the map by using a heavy tailed Student-t distribution in the low dimensional space. The result is a compact, well separated representation, that is easy to visualize and interpret.

We represent an MDP state

by a feature vector

, comprised of the two t-SNE coordinates and the DQN value estimate. Using this representation we aggregate the state space by applying clustering algorithms and define the AMDP states as the resulting clusters. Standard clustering algorithms assume that the data is drawn from an i.i.d distribution, however our data is generated from an MDP which violates this assumption.

Input: MDP sates feature representation
Output: SAMDP states
Objective: minimize the within-cluster sum of squares:

where is the mean of points in .
Repeat until convergence:

  1. Assignment step, each observation is assigned to its closest cluster center:

  2. Update step, each cluster center is updated to be the mean of its constituent instances:

Algorithm 1 K-means (MacQueen et al., 1967) for state aggregation

In order to alleviate this problem, we suggest two versions of K-Means (Algorithm 1) that take into account the temporal structure of the data. (1) Spatio-Temporal Cluster Assignment that encourages temporal coherency by modifying the assignment step in the following way:

(2)

Where is the time index of observation , is the set of points before and after along the trajectory. In this way, a point is assigned to a cluster , if its neighbours along the trajectory are also close to .
(2) Entropy Regularization Cluster Assignment that creates simpler models by adding an entropy regularization term to the K-mean assignment step:

(3)

Where is a penalty weight, and indicates the entropy (as defined in Section 3.3) gain of changing assignment to cluster in the SMDP obtained at iteration . This is equivalent to minimizing an energy function, the sum of the K-means objective function and an entropy term.
We also considered Agglomerative Clustering, a bottom-up hierarchical approach. Starting with a mapping from points to clusters (e.g., each point is a singular cluster), the algorithm advances by merging pairs of clusters such that a linkage criteria is minimized. In order to encourage temporal coherency in cluster assignments we define a new linkage criteria based on Ward (1963):

(4)

where measures the difference between the entropy of the corresponding SMDP before and after merging clusters .

3.2 Temporal abstractions

We define the SAMDP skills by their initiation and termination AMDP states :

(5)

More implicit, once the DQN agent enters an AMDP state at an MDP state , it follows the skill policy for steps, until it reaches a state , s.t . Note that we do not define the skill policy implicitly, but we will observe later that our model successfully captures spatio-temporal defined skill policies. We set the SAMDP discount factor same as was used to train the DQN. We now turn to estimate the SAMDP probability matrix and reward signal. For that goal we make the following assumptions:

Definition 1. A deterministic probability matrix, is a probability matrix such that each of its rows contains one element that equals to and the others equal to .

Assumption 1. The MDP transition matrices are deterministic.
This assumption limits our analysis for environments with deterministic dynamics. However, many interesting problems are in fact deterministic, e.g., Atari2600 benchmarks, Go, Chess etc.

Assumption 2. The policy played by the DQN agent is deterministic.
Although DQN chooses actions deterministically (by selecting the action that corresponds to the maximal Q value in each state), we allow stochastic exploration. This introduces errors into our model that we will later analyze.
Given the DQN policy, the MDP is reduced into a Markov Reward Process (MRP) with probability matrix . Note that by Assumptions 1 and 2, this is also a deterministic probability matrix.

The SAMDP transition probability matrix , indicates the probability of moving from state to given that skill is chosen. It is also a deterministic probability matrix by our definition of skills (Equation 5). Our goal is to estimate the probability matrix that the DQN policy induces on the SAMDP model: .

We do not require this policy to be deterministic from two reasons. First, we evaluate the DQN agent with an

-greedy policy. While almost deterministic in the view of a single time step, the variance of its behaviour increases as more moves are played. Second, the aggregation process is only an approximation. For example, a given state may contain more than one ”real” state and therefore hold more than one skill with different transitions. A stochastic policy can solve this disagreement by allowing to choose skills at random.

This type of modeling does not guarantee that our SAMDP model is Markovian and we are not claiming it to be. SAMDP is an approximation of the the true dynamics that simplifies it over space and time to and allow human interpretation. Finally, we estimate the skill length and SAMDP reward for each skill from the data using Equation 1. In the experiments section we show that this model is in fact consistent with the data by evaluating its value function:

(6)

and the greedy policy with respect to it:

(7)

3.3 Evaluation criteria

We follow the analysis of (Hallak et al., 2013) and define criteria to measure the fitness of a model empirically. We define the Value Mean Square Error(VMSE) as the normalized distance between two value estimations: The SAMDP value is given by Equation 6 and the DQN value is evaluated by averaging the DQN value estimates over all MDP states in a given cluster (SAMDP state): .
The Minimum Description Length (MDL; (Rissanen, 1978)) principle is a formalization of the celebrated Occam’s Razor. It copes with the over-fitting problem for the purpose of model selection. According to this principle, the best hypothesis for a given data set is the one that leads to the best compression of the data. Here, the goal is to find a model that explains the data well, but is also simple in terms of the number of parameters. In our work we follow a similar logic and look for a model that best fits the data but is still “simple”.
Instead of considering ”simple” in terms of the number of parameters, we measure the simplicity of the spatio-temporal state aggregation. For spatial simplicity we define the Inertia: which measures the variance of MDP states inside a cluster (AMDP state). For temporal simplicity we define the entropy: , and the Intensity Factor which measures the fraction of in/out cluster transitions:
To summarize, the stages of building an SAMDP model are:

  1. Evaluate : Run the trained (DQN) agent, record visited states, representations and Q-values.

  2. Reduce : Apply t-SNE on the state representations to obtain a low dimensional map.

  3. Aggregate : Cluster states in the map.

  4. Model : Fit an SAMDP model, select the best model.

  5. Visualize : Visualize the SAMDP on top of the t-SNE map.

4 Experiments

Setup. We evaluate our method on three Atari2600 games, Breakout, Pacman and Seaquest. For each game we collect 120k game states (each represented by 512 features), and Q-values for all actions. We apply PCA to reduce the data to 50 dimensions, then we apply t-SNE using the Barnes Hut approximation to reach the desired low dimension. We run the t-SNE algorithm for 3000 iterations with perplexity of 30. We use Spatio-Temporal K-means clustering (Section 3.1) to create the AMDP states (clusters), and evaluate the transition probabilities between them using the trajectory data. We overlook flicker-transitions where a cluster is visited for less than time steps before transiting out. Finally we truncate transitions with less than 0.1 probability.

Figure 2: Model Selection: Correlation between criteria pairs for the SAMDP model of Breakout.

Model Selection. We perform a grid search on two parameters: i) number of clusters . ii) window size . We found that models larger (smaller) than that are too cumbersome (simplistic) to analyze. We select the best model in the following way: Let be the entropy, inertia, VMSE, and intensity factor respective measures of configuration in the greed search. Let be the corresponding sets grouped over all grid search configurations. We sort each set from good to bad, i.e. from minimum to maximum (except for intensity factor where larger values are considered better). We then iteratively intersect the p-prefix of all sets (i.e. the first p elements of each set) starting with 1-prefix. We stop when the intersection is non empty and choose the configuration at the intersection. Figure 2 shows the correlation between pairs of criteria (for Breakout).

Overall, we see a tradeoff between spatial and temporal complexity. For example, in the bottom left plot, we observe correlation between the Inertia and the Intensity Factor; a small window size leads to well-defined clusters in space (low Inertia) at the expense of a complex transition matrix (small intensity factor). A large

causes the clusters to be more spread in space (large Inertia), but has the positive effect of intensifying the in-cluster transitions (high intensity factor). We also measure the p-value of the chosen model with the null hypothesis being the SAMDP model constructed with randomly clustered states. We tested 10000 random SAMDP models, none of which scored better than the chosen model (for any of the evaluation criteria).


Qualitative Evaluation. Examining the resulting SAMDP (Figure 3) it is interesting to note the sparsity of transitions. This indicate that clusters are well located in time. Inspecting the mean image of each cluster also reveal some insights about the nature of the skills hiding within. We also see evidence for the ”tunnel-digging” option described in (Zahavy et al., 2016) in the transitions between clusters 11,12,14 and 4.

Figure 3: SAMDP visualization for Breakout over the t-SNE map colored by value estimates (low values in blue and high in red).

Model Evaluation. We evaluate our model using three different methods. First, the VMSE criteria (Figure 4, top): high correlation between the DQN values and the SAMDP values gives a clear indication to the fitness of the model to the data. Second, we evaluate the correlation between the transitions induced by the policy improvement step and the trajectory reward . To do so, we measure the empirical distribution of choosing the greedy policy at state in that trajectory. Finally we present the correlation coefficients at each state: (Figure 4, center). Positive correlation indicates that following the greedy policy leads to high reward. Indeed for most of the states we observe positive correlation, supporting the consistency of the model. The third evaluation is close in spirit to the second one. We create two transition matrices using k top-rewarded trajectories and k least-rewarded trajectories respectively. We measure the correlation of the greedy policy with each of the transition matrices for different values of k (Figure 4 bottom). As clearly seen, the correlation of the greedy policy and the top trajectories is higher than the correlation with the bad trajectories.

Figure 4: Model Evaluation. Top: Value function consistency. Center: greedy policy correlation with trajectory reward. Bottom: top (blue), least (red) rewarded trajectories.

Eject Button: Performance improvement. In the following experiment we show how the SAMDP model can help to improve the performance of a trained policy. The motivation for this experiment stems from the idea of shared autonomy (Pitzer et al., 2011). There are domains where errors are not permitted and performance must be as high as possible. The idea of shared autonomy is to allow an operator to intervene in the decision loop in critical times. For example, it is known that in 20 of commercial flights, the auto-pilot returns the control to the human pilots. For this experiment we first build an SAMDP model and then let the agent to play new (unseen) trajectories. We project the online state visitations onto our model and monitor its transitions along it. We define as above. If the likelihood of with respect to the online trajectory is greater than the likelihood of , we press the Eject button and terminate this execution (a procedure inspired by option interruption (Sutton et al., 1999b)). We’re interested to measure the average performance of the un-terminated trajectories with respect to all trajectories. The performance improvement achieved with and without using the Eject button is presented in Table 1.

Game Average Score without eject Average Score with eject Improvement
Breakout 293 400 +36
Seaquest 5641 6780 +20
Pacman 230 241 +4.7
Table 1: Performance gain using eject button averaged over 60 trajectories. Numbers are reported for DQN agents we train ourselves.

5 Discussion

In this work we considered the problem of automatically building an SAMDP model for analyzing trained policies. Starting from a t-SNE map of neural activations, and ending up with a compact model that gives a clear interpretation for complex RL tasks. We showed how SAMDP can help in identifying skills that are well defined in terms of initiation and termination sets. However, the SAMDP doesn’t offer much information about the skill policy and we suggest to further investigate it in future work. It would also be interesting to see whether skills of different states actually represent the same behaviour. Most importantly, the skills we find are determined by the state aggregation. Therefore, they are impaired by the artifacts of the clustering method used. In future work we will consider other clustering methods that better relate to the topology (such as spectral-clustering), to see if they lead to better skills.

In the Eject experiment we showed how SAMDP model can help to improve the policy at hand without the need to re-train it. It would be even more interesting to use the SAMDP model to improve the training phase itself. The strength of SAMDP in identifying spatio and temporal hierarchies could be used for harnessing DRL hierarchical algorithms (Tessler et al., 2016; Kulkarni et al., 2016). For example by automatically detecting sub-goals or skills.

Another question we’re interested in answering is whether a global control structure exists? Motivated by the success of policy distillation ideas (Rusu et al., 2015), it would be interesting to see how well an SAMDP built for game A, explains game B? Finally we would like to use this model to interpret other DRL agents that are not specifically trained to approximate value such as deep policy gradient methods.

References

  • Bertsekas and Castanon [1989] Dimitri P Bertsekas and David A Castanon. Adaptive aggregation methods for infinite horizon dynamic programming. Automatic Control, IEEE Transactions on, 34(6):589–598, 1989.
  • Dietterich [2000] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.
  • Erhan et al. [2009] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Dept. IRO, Université de Montréal, Tech. Rep, 4323, 2009.
  • Francis and Wonham [1975] Bruce A Francis and William M Wonham. The internal model principle for linear multivariable regulators. Applied mathematics and optimization, 2(2), 1975.
  • Hallak et al. [2013] Assaf Hallak, Dotan Di-Castro, and Shie Mannor. Model selection in markovian processes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Kulkarni et al. [2016] Tejas D Kulkarni, Karthik R Narasimhan, Ardavan Saeedi, and Joshua B Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057, 2016.
  • Lin [1993] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
  • MacQueen et al. [1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. 1967.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540), 2015.
  • Parr [1998] Ronald Parr. Flexible decomposition algorithms for weakly coupled Markov decision problems. In

    Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

    , pages 422–430. Morgan Kaufmann Publishers Inc., 1998.
  • Pitzer et al. [2011] Benjamin Pitzer, Michael Styer, Christian Bersch, Charles DuHadway, and Jan Becker. Towards perceptual shared autonomy for robotic mobile manipulation. In IEEE International Conference on Robotics Automation (ICRA), 2011.
  • Rissanen [1978] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
  • Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  • Sontag [2003] Eduardo D Sontag. Adaptation and regulation with signal detection implies internal model. Systems & control letters, 50(2):119–126, 2003.
  • Stolle and Precup [2002] Martin Stolle and Doina Precup. Learning options in reinforcement learning. Springer, 2002.
  • Sutton et al. [1999a] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1), August 1999a.
  • Sutton et al. [1999b] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999b.
  • Tessler et al. [2016] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in minecraft. arXiv preprint arXiv:1604.07255, 2016.
  • Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.

    Journal of Machine Learning Research

    , 9(2579-2605):85, 2008.
  • Ward [1963] Joe H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963.
  • Yosinski et al. [2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? pages 3320–3328, 2014.
  • Zahavy et al. [2016] Tom Zahavy, Nir Ben Zrihem, and Shie Mannor. Graying the black box: Understanding dqns. arXiv preprint arXiv:1602.02658, 2016.
  • Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. pages 818–833, 2014.