I Introduction
Reinforcement Learning (RL) has been successfully applied to numerous challenging problems for autonomous agents to behave intelligently in unstructured realworld environment. One interesting area of research in RL which motivates this work is goaldirected reinforcement learning problem (GDRLP) [1] [2]. In GDRLP, the learning process takes place in two stages. The first stage focuses on solving the goaldirected exploration problem (GDEP) which allows an agent to determine a viable path from an initial state to a goal state in an unknown or only partially known state space. The path found in this stage need not be the optimal one. In the second stage, the agent takes advantage of the previously learned knowledge to optimize the path to the goal state. The two stages iterate in order to converge to the action policy.
RL methods are generally divided into ModelFree (MF) and ModelBased (MB) approaches. MF methods can learn complex policies with minimal feature and policy engineering work. However, the convergence of such methods might require millions of trials and hence they are sample inefficient [3] [4]. MB methods require much smaller number of realworld trials to converge but need an accurate model of realworld physical system and the environment which might be challenging to obtain [5]. Also, relying on an accurate model can be problematic because small glitches on the model may lead to catastrophic consequences.
In this paper, we leverage the benefits of both approaches  MB and MF, with an aim to improve the sample efficiency during the training process. We achieve this via a twostep iterative learning mechanism. In the first step, we learn an approximate model of the physical system using an MF scheme. Note that, our approach does not need to construct an accurate model, instead a rough model with little cost is sufficient which is used to characterize the structure of the problem. In the second step, we leverage this approximate model along with the notion of reachability using Mean First Passage Times (MFPT) where the result is used to guide following MF exploration and learning.
Contribution: The main contributions of our work are:

We propose a hybrid RL approach that introduces a modelbased characterization into the stateoftheart RL algorithms to improve sampleefficiency.

The modelbased characterization is achieved via a modelbased RL algorithm that is robust to an approximate model for learning complex policies.
We demonstrate our proposed method on two tasks related to the pathplanning domain in 2D and 3D simulation environments respectively. The goal of the agent in both tasks is to learn an optimal policy to reach a goal state from a given start location. Our results show that the proposed hybrid algorithms with modelbased characterization were able to learn the optimal policy in very few trials, thereby improving the sample efficiency and accelerating the learning process.
Ii Related Work
Earlier works have demonstrated various approaches to solve goaldirected reinforcement learning problem. Braga et al. presented a solution to solve GDRLP for an indoor unknown environment. Firstly, using temporal difference learning method, they find an initial solution to reach the goal and then improve upon the initial solution by employing a variable learning rate [1]. Several earlier works also focused on goaldirected exploration as it provides insight into the corresponding RL problems. It has been proved that goaldirected exploration with RL methods can be intractable, therefore demonstrating that solving RL problems can be intractable as well [6]. This work also presented that the behavior of an agent followed a random walk until it reached the goal for the first time. Koenig et al. also studied the effect of representation and knowledge on the tractability of goaldirected exploration with RL algorithms [2]. However, in our work, the primary focus is on the second stage of GDRLP, where the aim is to accelerate the convergence to optimal policy via model characeterization.
MF approaches in reinforcement learning can learn complex policies but requires many trials to converge. The most widely used modelfree reinforcement learning might be the QLearning [7] which is detailed in the next section. Another modelfree learning algorithm similar to QLearning is SARSA [8]
. The main difference between SARSA and QLearning is that SARSA agent learns the actionvalue function by following the policy it learned, while QLearning agent learns the actionvalue function by following an exploitation policy owing to the exploration/exploitation tradeoff. On the other hand, MB approaches can generalize to new tasks and environments in fewer trials, however, an accurate model is necessary. We recently also investigated reachability heuristics and showed that computational performance for standard and accurate MDP models can be improved
[9].Another interesting research direction focused on reducing the size of problem space in MB approaches. Boutilier et al. proposed structured reachability analysis of MDPs in order to remove variables from problem description, thereby reducing the size of MDP and eventually making it easier to solve [10]. It is therefore very intuitive to investigate approaches that combine the advantages of MF and MB methods [11] [12]. There have also been multiple previous works that combined the two paradigms. The primary objective of such methods were to speedup the learning process for MF reinforcement learning approaches. A broad area of research including the DYNA framework [13] [14] leveraged a learned model to generate synthetic experience for MF learning.
Along similar direction, several prior works focused on devising a model as an initialization for MF component [15] [16]. One of the challenges that this leads to is the inaccuracies in the model which cause the issue of model bias. A suggested solution to overcome model bias is to directly train the dynamics in a goaloriented manner [17] [18]. Our work is also motivated from this approach in order to deal with model bias.
Unlike prior works on combining MB and MF reinforcement learning methods, we integrate the benefits of both approaches  MB and MF, with an aim to improve the sample efficiency during the training process. The primary objective in this work is to incorporate a modelbased characterization using MFPT into a reinforcement learning algorithm (modelfree approaches like QLearning or RL frameworks like DYNA), so that the characterization result can be used to guide following MF exploration and learning. Our approach differs from existing hybrid models in that, the method does not need to construct an accurate model, and a rough model with little cost is enough for capturing highlevel features. By comparing with stateoftheart baseline approaches, our evaluations reveal that the proposed hybrid algorithms are able to learn optimal policy in very few trials with high sample efficiency, and have significantly accelerated the practical learning process.
Iii Preliminaries
Iiia ModelBased Reinforcement Learning
ModelBased reinforcement learning needs to first build a model, and then use it to derive a policy. The underlying mechanism is Markov Decision Process (MDP), which is a tuple
, where is a set of states and is a set of actions. The state transition functionis a probability function such that
is the probability that action in state will lead to state , and is a reward function where returns the immediate reward received on taking action in state that will lead to state . A policy is of the form . We denote as the action associated to state. If the policy of a MDP is fixed, then the MDP behaves as a Markov chain
[19].To solve an MDP, the most widely used approach should be value iteration (VI). The VI is an iterative procedure that calculates the value (or utility in some literature) of each state based on the values of the neighbouring states that can directly transit to. The value of state at each iteration can be calculated by the Bellman equation shown below
(1) 
where is a reward discounting parameter. The stopping criteria for the algorithm is when the values calculated on two consecutive iterations are close enough, i.e., , where is an optimization tolerance/threshold value, which determines the level of convergence accuracy.
Relevant to this work, the prioritized sweeping mechanism is an important heuristicbased approach for efficiently solving MDPs in order to further speed up the value iteration process [20]. This heuristic evaluates each state and obtains a score based on the state’s contribution to the convergence, and then prioritizes/sorts all states based on their scores (e.g., those states with larger difference in value between two consecutive iterations will get higher scores) [21, 22]. Then immediately in the next dynamic programming iteration, evaluating the value of states follow the newly prioritized order.
Given a model, methods proposed for solving MDPs can be easily extended to the context of MB learning methods [23]. The modelbased characterization in our proposed approach is also built on top of this notion.
IiiB ModelFree Reinforcement Learning
Modelfree reinforcement learning aims at learning a policy without learning a model. The most widely used modelfree reinforcement learning might be the QLearning [7], which is a special algorithm of the Temporal Difference (TD) learning [24]. This approach is able to compare the expected utility of the available actions at a given state without requiring a model of the environment. To learn the expected utility of taking a given action in a given state , it learns a actionvalue function . The QLearning rule is
(2) 
where, <> is an experience tuple, is the learning rate, and is the discount factor. After the actionvalue function is learned, the optimal policy can be constructed by greedily selecting the action with the highest actionvalue in each state.
IiiC Synthesis of Modelbased and Modelfree
There are several frameworks that integrate the modelbased and modelfree paradigms, with the most well known architecture probably being DYNA [13, 14]. DYNA exploits a middle ground, yielding strategies that are both more effective than modelfree learning and more computationally efficient than the certaintyequivalence approach. DYNA architecture comprises of two phases. In the first phase, the agent carries out actions in the environment and performs regular reinforcement learning to learn value function and adjust the policy. It also uses the real experience to explicitly build up the transition model and/or the reward function associated with the environment. The second phase involves planning updates where simulated experiences are used to update policy and value function.
IiiD Mean First Passage Times
Before we elaborate on what we mean by model characterization in our work, we will describe a key concept called Mean First Passage Times (MFPT).
The first passage time (FPT), , is defined as the number of state transitions involved in reaching states when started in state for the first time. The mean first passage time (MFPT), from state to is the expected number of hopping steps for reaching state given initially it was in state [25]. The MFPT analysis is built on the Markov chain, and has nothing to do with the agent’s actions. Remember that, when a MDP is associated to a fixed policy, it then behaves as a Markov chain [19].
Formally, let us define a Markov chain with states and transition probability matrix, . If the transition probability matrix is regular, then each MFPT, , satisfies the below conditional expectation formula:
(3) 
where, represents the transition probability from state to , and is an event where the first state transition happens from state to . From the definition of mean first passage times, we have, . So, we can rewrite Eq. (3) as follows
(4) 
Since, = 1, Eq. (4) can be formulated as per the below equation:
(5) 
Solving all MFPT variables can be viewed as solving a system of linear equations
(6) 
The values , , , represents the MFPTs calculated for state transitions from states , , , to and . To solve above equation, efficient decomposition methods [26] may help to avoid a direct matrix inversion.
Iv Technical Approach
In this work, we are interested in goaldirected autonomy, where the agent is specified with a goal or terminal state to arrive. Note, a Markov system is defined as absorbing if from every nonterminal state it is possible to eventually enter a goal state [27]. We restrict our attention to absorbing Markov systems so that the agent finally terminates at a goal.
Iva Reachability Characterization using Mean First Passage Times
The notion of MFPT allows us to define the reachability of a state. By “reachability of a state" we mean that based on current fixed policy, how hard it is for the agent to transit from the current state to the given goal/absorbing state. With all MFPT values obtained, we can construct a reachability landscape which is essentially a “map" measuring the degree of difficulty of all states transiting to the goal.
Fig. 1 shows a series of landscapes represented in heatmap in our simulated environment. The values in the heatmap range from 0 (cold color) to 600 (warm color). In order to better visualize the low MFPT spectrum that we are most interested, any value greater than 600 has been clipped to 600. Fig. 11 show the change of landscapes as the learning algorithm proceeds. Initially, all states except the goal state are initialized as unreachable, as shown by the high MFPT color in Fig. 1.
We observe that the reachability landscape conveys very useful information on potential impacts of different states. More specifically, a state with a better reachability (smaller MFPT value) is more likely to make a larger change during the MDP convergence procedure, leading to a bigger convergence step. With such observation and the new metric, we can design state prioritization schemes for Value Iteration that we will use in our proposed approaches.
IvB Mean First Passage Time based QLearning (MFPTQ)
Classic QLearning converges to the optimal solution through the QLearning rule as shown in Eq. (2), which essentially learns the stateaction value function that represents the expected utility of the available actions at a given state.
Our proposed hybrid algorithm, MFPTQ, performs two main operations involving a modelfree and a modelbased update every iteration. Firstly, given an experience <>, it builds an approximate model by updating the transition function and reward function . To update , it uses the count statistics which basically refers to the count of transitions occurring from state to followed by normalization between 0 and 1.
The second step leverages the approximate model computed earlier to perform a modelbased update using MFPTVI algorithm (See lines 815 of Alg. 1). The MFPTVI method is built on a metric using the reachability (MFPT values) since, the reachability characterization of each state reflects the potential impact/contribution of this state. Hence, such characterization provides a natural basis for state prioritization while performing a version of valueiteration for Q values during modelbased update.
Note that, since the MFPT computation is relatively expensive, and the purpose of using MFPT is to characterize global and general (instead of fine) features of all states, thus it is not necessary to compute the MFPT at every iteration, but rather after every few iterations. For those iterations between two adjacent MFPT updates, the value of all states converge from a “local refinement" perspective. The computational process of MFPTQ is pseudocoded in Alg. 1.
IvC Mean First Passage Time based DYNA (MFPTDYNA)
Classic DYNA architecture balances between real and simulated experience to speed up the training process. As mentioned earlier, the agent learns a value function and updates the policy using both sets of experiences. In addition, the agent also learns a model of the environment using real experiences. This notion of learned model in DYNA makes it intuitive and easier to integrate our proposed modelbased characterization. Here we further extend this framework and propose an upgraded hybrid algorithm  Mean First Passage Time based DYNA (MFPTDYNA).
Remember that the classic DYNA algorithm includes two steps involving real experience and simulated experience. For the simulated experience, we employ the classic DYNA procedure where the simulation performs additional updates, i.e., it chooses stateaction pairs at random and update stateaction values according to the rule mentioned in Eq. (2). (See lines 913 of Alg. 2.)
Different from the standard DYNA mechanism, our MFPTDYNA utilizes the real experience <> to build the approximate model by updating stateaction values according to the rule mentioned in Eq. (2). Again, for updating , it increments the count statistics for the transitions occurring from state to followed by normalization between 0 and 1. It updates the based on the reward received for taking action in state .
Analogous to the one mentioned in case of MFPTQ, we propose a modelbased update using MFPTVI algorithm. The MFPTVI algorithm leverages the approximate model (represented by the transition and reward function) computed earlier to perform a version of the valueiteration update for values as shown in lines 1421 of Alg. 2.
One advantage of this framework is the introduction of modelbased characterization by MFPTVI. MFPTVI very well assesses the importance of states based on their MFPT values and thus provides a natural basis for effectively prioritizing states while updating stateaction values. This mechanism towards updating stateaction values allow the algorithm to converge with a very small number of iterations, which practically decrease the overall training time by a significant margin. The evaluation details are presented in Section V.
IvD Time Complexity Analysis
QLearning and DYNA have a sample complexity , where denotes the number of states in order to obtain a policy arbitrarily close to the optimal one with high probability [28] [29]. Calculation of the MFPT needs to solve a linear system that involves matrix inversion (the matrix decomposition has a time complexity of if stateoftheart algorithms are employed [26], given that the size of matrix is the number of states ). Therefore, for each iteration, the worstcase time complexity for both the MFPTQ and MFPTDYNA algorithms is . Note, in practical applications, since the expensive MFPT is used for summerizing global features, this part is usually used sparsely (less frequently) which also saves much time for computation.
V Experimental Results
In this section, we compare the performance of our proposed MFPTQ and MFPTDYNA with their corresponding baseline methods  QLearning and DYNA respectively. More importantly, through the comparison, we wish to demonstrate that our proposed feature characterization mechanism can be used as a module to existing other popular frameworks (not limited to Q or DYNA) in order to further speed up their practical learning processes.
Va Experimental Setting
Task Details
We validated our method through numerical evaluations with two types of simulation suites running on a Linux machine.
For the first task, we developed a simulator in C++ using OpenGL. To obtain the discrete MDP states, we tessellate the agent’s workspace into a 2D grid map, and represent the center of each grid as a state. In this way, the state hopping between two states represents the corresponding motion in the workspace. Each nonboundary state has a total of nine actions, i.e., a nonboundary state can move in the directions of N, NE, E, SE, S, SW, W, NW, plus an idle action returning to itself. A demonstration is shown in Fig. 1. Such environmental setting allows us to better visualize the characterized reachability landscape, with a small number of states in 2D space.
For the second task, we developed a simulator in C++ using ROS and Rviz [30]. The agent’s workspace is partitioned into a 3D voxel map where the center of each voxel denotes a MDP state. Each nonboundary state has a total of seven actions, i.e., a nonboundary state can move in the directions of N, E, S, W, TOP, BOTTOM plus an idle action returning to itself. Such a 3D environment setting is more complex compared to the 2D setting. Moreover, it can be leveraged to simulate various robotic pathplanning application scenarios like quadrotor trajectory planning and demonstrate the practicality of our proposed algorithms for such tasks. A demonstration of the agent flying in the 3D simulation environment is shown in Fig. 2.
In both tasks, the objective of the agent is to reach a goal state from an initial start location in the presence of obstacles. The reward function for both setups is represented as high positive reward at the goal state and 1 for other states. All experiments were performed on a laptop computer with 8GB RAM and 2.6 GHz quadcore Intel i7 processor.
Evaluation Methods
We are concerned about both computational performance and realworld training performance. Thus, we designed two evaluation metrics:

For the first metric, we look into the computational costs of the proposed and baseline approaches where we investigate the number of iterations required to converge as well as the computational runtime used to generate the result.

For the second evaluation metric, we evaluate and compare the actual time used for training and completing a task. We do this because the robot needs to interact with the real world, and the time spent on training with the real world experience can be much more than computational time cost. Unsurprisingly, such saving also extends to other costs such as energy if the task can be done more quickly.
VB 2D Grid Setup
In this setup, we compare our proposed algorithms with their corresponding baseline algorithms in terms of their computational runtimes as well as the numbers of iterations required to converge.
Computational Time Cost: We compare the computational time taken by the algorithms as the number of states change. The time taken by QLearning and MFPTQ algorithms to converge to the optimal solution (with the same convergence error threshold) are shown in Fig. 3. The results reveal that time taken by our proposed MFPTQ to converge is faster than QLearning when the number of states are less than 5000. When the number of states exceed 5000, QLearning takes lesser time to converge than our proposed MFPTQ. Fig. 3 compares the time taken by DYNA and MFPTDYNA. Here, we observe that our proposed MFPTDYNA takes more time to converge in comparison to DYNA. The reason is due to the increased time taken for MFPT calculation which dominates the time taken to converge in the execution of MFPTDYNA.
Number of Iterations: We then evaluate the number of iterations taken by the algorithms to converge to the optimal policy since it very well reflects the sample efficiency.
Fig. 3 compares the number of iterations taken by QLearning and MFPTQ, respectively. The plot clearly shows that MFPTQ takes much fewer iterations to converge in comparison to the standard QLearning. Fig. 3 compares the number of iterations taken by DYNA and MFPTDYNA, from which we can observe that MFPTDYNA takes much fewer itertations to converge than DYNA.
This also implies the remarkable merit of modelbased characterization via MFPT as means for faster convergence in significantly fewer iterations.
VC 3D Grid Setup
To evaluate with larger number of states as well as more complex environments, we compare our proposed algorithms: MFPTQ and MFPTDYNA in the 3D simulation environment.
Computational Time Cost: Fig. 4 presents the computational time taken by QLearning and MFPTQ algorithms to converge to the optimal solution. We can see that the computational time cost trends are very similar to that of the 2D case, where for a large number of states, the MFPT variants take longer time than the baseline methods. We attribute this to the fact that as the number of states increase, the time taken for MFPT calculation also increase which dominates the computational time cost in the MFPT variants.
Number of Iterations: Similarly, we also compare the number of iterations taken by QLearning and MFPTQ, respectively. As shown in Fig. 4, our proposed MFPT variant converges in fewer trials compared to QLearning. Next, we compare the number of iterations taken by DYNA and MFPTDYNA. Again, the advantage of our proposed hybrid RL approach that introduces a modelbased characterization into DYNA, is clearly visible in Fig. 4, as the results show that the MFPTDYNA requires much smaller number of iterations to converge compared to DYNA.
VD Training Runtime Performance
As previously discussed, the objective of RL for an agent is to learn an optimal policy in a given environment in order to reach a goal state from a given starting location. Here we present our second evaluation metric that considers the total time involved during the agent’s training process in the simulation environment.
Fig. 5 shows the progress of an agent in the 3D environment using QLearning. During the initial stages of the learning process, the agent could hardly overcome the first obstacle as shown in Fig. 5. After 5000 trials, the agent could overcome the first obstacle, however was unable to overcome the next one. At the end of the training process, the agent learned the optimal policy using which it could overcome all obstacles and reach the goal as shown in Fig. 5.
We trained an agent in the 3D environment under varying voxel sizes. Fig. 6 shows that the agent takes much less overall time to learn the optimal policy when MFPTQ was employed in comparison to classical QLearning. Similarly, the agent takes much less time to learn and complete the task using MFPTDYNA in comparison to DYNA algorithm as shown in Fig. 6.
Since, practically the training and learning efforts are much more expensive and important than the computational time cost, thus, these results reestablish the benefits of our hybrid algorithms towards improving sample efficiency in goaldirected reinforcement learning. Such faster convergence and lesser training time is owing to the underlying mechanism of modelbased characterization via MFPT introduced to the existing reinforcement learning schemes.
Vi Conclusions
In this paper, we propose a hybrid approach where we introduced a new modelbased characterization that can be extended to reinforcement learning techniques in order to improve its sample efficiency. We achieved this by synthesizing the advantages of both modelfree and modelbased reinforcement learning paradigms. The proposed hybrid framework can further accelerate reinforcement learning approaches, via an integration of the MFPT feature characterization mechanism. The experimental results show the remarkable merit of modelbased characterization in our hybrid algorithms which learn much faster with fewer samples in comparison to their stateoftheart reinforcement learning counterparts.
References

[1]
Arthur P. de S. Braga and Aluizio F. R. Araújo.
Goaldirected reinforcement learning using variable learning rate.
In Flávio Moreira de Oliveira, editor,
Advances in Artificial Intelligence
, pages 131–140, Berlin, Heidelberg, 1998. Springer Berlin Heidelberg.  [2] Sven Koenig and Reid G. Simmons. The effect of representation and knowledge on goaldirected exploration with reinforcementlearning algorithms. Machine Learning, 22(1):227–250, Mar 1996.
 [3] Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
 [4] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust Region Policy Optimization. ArXiv eprints, February 2015.
 [5] Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. Found. Trends Robot, 2(1–2):1–142, August 2013.
 [6] Steven D. Whitehead and Dana H. Ballard. Learning to perceive and act by trial and error. Machine Learning, 7(1):45–83, Jul 1991.
 [7] Christopher J. C. H. Watkins and Peter Dayan. Qlearning. In Machine Learning, pages 279–292, 1992.
 [8] G. A. Rummery and M. Niranjan. Online qlearning using connectionist systems. Technical report, 1994.
 [9] Shoubhik Debnath, Lantao Liu, and Gaurav Sukhatme. Reachability and differential based heuristics for solving markov decision processes. In Proceedings of 2017 International Symposium on Robotics Research. forthcoming.
 [10] Craig Boutilier, Ronen I. Brafman, and Christopher Geib. Structured reachability analysis for markov decision processes. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, pages 24–32, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.
 [11] Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining ModelBased and ModelFree Updates for TrajectoryCentric Reinforcement Learning. ArXiv eprints, March 2017.
 [12] S. Bansal, R. Calandra, K. Chua, S. Levine, and C. Tomlin. MBMF: ModelBased Priors for ModelFree Reinforcement Learning. ArXiv eprints, September 2017.
 [13] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In In Proceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan Kaufmann, 1990.
 [14] Richard S. Sutton. Planning by incremental dynamic programming. In In Proceedings of the Eighth International Workshop on Machine Learning, pages 353–357. Morgan Kaufmann, 1991.
 [15] F. Farshidian, M. Neunert, and J. Buchli. Learning of closedloop motion control. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1441–1446, Sept 2014.
 [16] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural Network Dynamics for ModelBased Deep Reinforcement Learning with ModelFree FineTuning. ArXiv eprints, August 2017.
 [17] P. L. Donti, B. Amos, and J. Zico Kolter. Taskbased Endtoend Model Learning in Stochastic Optimization. ArXiv eprints, March 2017.
 [18] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C. J. Tomlin. GoalDriven Dynamics Learning via Bayesian Optimization. ArXiv eprints, March 2017.
 [19] John G Kemeny, Hazleton Mirkill, J Laurie Snell, and Gerald L Thompson. Finite mathematical structures. PrenticeHall, 1959.
 [20] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. In Machine Learning, pages 103–130, 1993.
 [21] David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. Advances in Neural Information Processing Systems, 1998.
 [22] David Wingate and Kevin D Seppi. Prioritization methods for accelerating mdp solvers. Journal of Machine Learning Research, 6(May):851–881, 2005.
 [23] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. J. Artif. Int. Res., 4(1):237–285, May 1996.
 [24] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, Aug 1988.
 [25] David Assaf, Moshe Shared, and J. George Shanthikumar. Firstpassage times with pfr densities. Journal of Applied Probability, 22(1):185–196, 1985.
 [26] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.
 [27] Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Stochastic dynamic programming with factored representations. Artificial intelligence, 121(1):49–107, 2000.
 [28] Michael Kearns and Satinder Singh. Finitesample convergence rates for qlearning and indirect algorithms. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 996–1002, Cambridge, MA, USA, 1999. MIT Press.
 [29] Luis C. Cobo, Charles L. Isbell, and Andrea L. Thomaz. Object focused qlearning for autonomous agents. In Proceedings of the 2013 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’13, pages 1061–1068, Richland, SC, 2013. International Foundation for Autonomous Agents and Multiagent Systems.
 [30] Rviz: 3d visualization tool for ros. http://wiki.ros.org/rviz. Accessed: 20180227.
Comments
There are no comments yet.