Model-free Reinforcement Learning (rl) demands that robots produce lots of data to evaluate and improve their decision making policies. For marine robots, this can be challenging, since learning must be performed online, and their acoustics-based sensors produce data in low volumes. Here, we will present an algorithm that supports the learning of navigation policies with very little data.
Our algorithm belongs to the class of value estimation methods. Such methods are rooted in Bellman’s equation, which describes the value of taking actionin state as a sum of the expected reward and the forecasted value over a random transition :
The contractive nature of Bellman’s equation motivates many standard algorithms, which apply the recursion to obtain better estimates of the true value [1, 2, 3]. Our method chooses a less conventional approach, but one which is more data-efficient, reducing Bellman updates to Gaussian Process (gp) regression .
gp regression is a flexible Bayesian method for learning unknown functions from data . It uses a kernel-based covariance structure to promote a high degree of data association, well suited for learning when data is scarce. The main drawback is the prediction complexity, which scales prohibitively as , where is the number of data points . To address this issue, we appeal to a sparse approximation.
Sparse approximations have been proposed to reduce the complexity of exact predictions [6, 4, 7, 8]. Many methods use an information criterion as the basis for rejecting redundant data. For gp-rl, Engel et al. employ the conditional covariance as a measure of relative error [4, 7], and they discard points that contribute little error to predictions. By limiting the active set to points, their predictions never cost more than operations.
Besides discarding potentially useful information, the most prevalent issue with rejection sparsification is how it interferes with model selection. It can cause sharp changes to appear in the marginal likelihood and complicate evidence maximization with common optimizers, such as l-bfgs .
In what follows, we will describe a new method for gp value estimation that induces sparsity without discarding data. We approximate the exact td regression framework of Engel et al.  with a smaller active set containing adjustable support points. With this change, our method achieves the same prediction complexity as the state-of-the-art approximation, , while incurring less approximation error. Our method is based on the Sparse Pseudo-input Gaussian Process (spgp) . Therefore, it inherits many of spgp’s favorable characteristics, all of which we show support the unique needs of robots learning to navigate in a marine environment.
2 TD Value Estimation as GP Regression
td algorithms recover the latent value function with data gathered in the standard rl fashion: at each step, the robot selects an action based on its current state , after which it transitions to the next state and collects a reward
. The repeated interaction is described as a Markov Decision Process,, associated with the transition distribution , stationary policy , and discount factor . As the name suggests, td algorithms update a running estimate of the value function to minimize its difference from the Bellman estimate: ; being the observed reward. Once the estimate converges, a robot can navigate by selecting actions from the greedy policy , such that .
The Gaussian Process Temporal Difference (gptd) framework improves upon the data efficiency of frequentist td estimation by departing from the contractive nature of Bellman’s equation, in favor of a convergence driven by non-parametric Bayesian regression. The data model is based on the random return , expressed as a sum of its mean, , and zero-mean residual,
. Model inputs are state-action vectors, and value differences are used to describe the observation process:
Our work makes the simplifying assumption that noise levels,
, are i.i.d random variables with constant parameters,. Under this assumption, transitions exhibit no serial correlation. Although it is straightforward to model serially-correlated noise , doing so would preclude application of our desired approximation. We elaborate on this point in Section 3.
Given a time-indexed sequence of transitions , we stack the model variables into vectors, resulting in the complete data model: , where , and
Notice the commonality Equation 2 has with a standard gp likelihood model, . Both models assume the outputs, , are noisy observations of a latent function, . What distingushes td estimation is the presence of value correlations, imposed from Bellman’s equation and encoded as temporal difference coefficients in . Used for exact gp regression, Equation 2 leads to the gp-sarsa algorithm: a non-parametric Bayesian method for recovering latent values .
As a Bayesian method, gp-sarsa
computes a predictive posterior over the latent values by conditioning on observed rewards. The corresponding mean and variance are used for policy evaluation:
Here, is the covariance matrix with elements , , and , where . Subscripts denote dimensionality, e.g. .
3 Sparse Pseudo-input Gaussian Process Temporal Difference Learning
training samples (magenta) taken from the prior. The predictive mean is black, and two standard deviations of uncertainty are shown in gray. We plotspgp-sarsa before (center) and after pseudo input optimization (right). After training, the sparse method is nearly identical to the exact method.
The gp-sarsa method requires an expensive matrix inversion, costing . Since robots must estimate values online to improve their navigation policies, we appeal to an approximate method which induces sparsity in the standard data model (Equation 2
). We expand the probability space withadditional pseudo values, . Here, pseudo values serve a parametric role in a non-parametric setting; they are free variables that provide probability mass at support locations . The extra latent variables obey the same data model as , but are predetermined, and thus, exhibit no noise. By conditioning upon and , we can collapse the predictive probability space such that all dense matrix inversions are of rank . Finally, we optimize to maximize the likelihood and produce a high-quality fit. This strategy is called Sparse Pseudo-input gp regression  (Figure 1).
Although spgp regression is well-known, it has never been applied to td estimation, where latent variables exhibit serial correlation. Therefore, current results from the sparse gp literature do not apply. In what follows we apply the spgp principles to derive a new method for td value estimation, which we call spgp-sarsa. The method is data-efficient and fast enough for online prediction.
3.1 Latent Value Likelihood Model
Given an input location, , the likelihood of the latent value, , is the conditional probability
where , . The complete data likelihood is obtained by stacking the independent single transition likelihoods
where we defined , , with .
3.2 Conditioned Data Likelihood Model
Here we derive the likelihood distribution of the observed rewards conditioned on the pseudo values
. Consider the transformed joint distribution over values, rewards, and pseudo values
The likelihood is obtained by conditioning on and invoking transition independence:
3.3 Posterior of Pseudo Values
To obtain the posterior , we use Bayes’ rule. Given the marginal and the conditional for given (Equation 6), the posterior for given is
3.4 Latent Value Predictive Posterior
The predictive posterior is obtained by marginalizing the pseudo values:
Let . Our new predictive is Gaussian with mean and variance
Equation 8 represents our main contribution. The parameters and are independent of the input, and their expressions depend on two inverses. The first is the dense -rank matrix, . The second is the -rank diagonal matrix . When , both matrices are easier to invert than a dense -rank matrix. Thus, Equation 8 provides motivation for estimating and predicting latent values efficiently.
3.5 Assumptions and Related Work
There are several key assumptions underpinning our results. To guarantee the likelihood can be factored, we need to assume that transitions are uncorrelated. Had we modeled serial correlation in the noise process, would be distributed with a tridiagonal covariance , which cannot be factored directly. It is possible to obtain a factorable model by applying a whitening transform. However, this changes the observation process to Monte Carlo samples, which are known to be noisy . Under our simpler set of assumptions, we obtain an efficiently-computable, sparse representation of the value posterior that is amenable to smooth evidence maximization. Section 4.1
details how to select the hyperparameters and pseudo inputs with gradient-based optimization.
The most relevant methods to our work apply gp regression to estimate the latent value function in a td setting [4, 7, 12, 13]. This class of algorithms is distinct from those which apply gp regression in the absence of sequential correlation, with Monte Carlo sample returns , and methods whose convergence behavior is driven primarily by the Bellman contraction [15, 16]. As a gp-td algorithm, the policy update process (Algorithm 2) depends only on gp regression. This convergence is known to be data efficient and asymptotically unbiased . We do not prove convergence for gp-td algorithms here; however, we mention the convergence behavior to underscore our method’s relevance to robots learning with limited volumes of data. We also note these methods have been scaled to high-dimensional systems with complex, continously-varying dynamics .
When it comes to other approximate gp-td methods, the state-of-the-art uses a low-rank approximation to the full covariance matrix : , where is a projection matrix. Before adding new data to the active set, lowrank-sarsa checks if new data points increase the conditional covariance by a desired error threshold, . In Section 5.1, we compare the approximation quality of lowrank-sarsa and spgp-sarsa.
4 New Algorithms for Robot Navigation
Navigation tasks are specified through the reward function. As a robot transitions through its operating space, it should assign the highest value to states and actions that bring it closer to the goal, and the lowest values around obstacles and other forbidden regions. We provide examples of such functions in Section 5.
Given a suitable reward function, Algorithm 1 implements spgp-sarsa regression, where the posterior value function parameters are learned with sequentially-observed data. The policy is updated using standard policy iteration, described in Algorithm 2.
4.1 Finding the Hyper-parameters and Pseudo-inputs
We use the marginal likelihood to fit the hyper-parameters and pseudo inputs to the observed data, . Unlike rejection-based sparsification, our method has the benefit of being continuous in nature. This allows for the variables to be tuned precisely to achieve a high-quality fit. The marginal likelihood is a Gaussian, given by
Instead of optimizing Equation 9 directly, we maximize its logarithm, . Given the gradient of , we can use an iterative method, such as l-bfgs, to find the optimal parameters. Full details of the gradient computation are provided in the supplement.
4.2 Training Considerations
The frequency with which model parameters are optimized can greatly influence the runtime of Algorithm 1. For -dimensional inputs, spgp-sarsa must fit variables; whereas, gp-sarsa must fit only . Although it is not strictly necessary to refit model parameters at each time step, the frequency which updates are needed will depend on how well and reflect the support of the operating space. As new regions are explored, the model will need to be refit.
Even with a strategy to fit model parameters, we must still choose the number of pseudo inputs, . In Figure 2, we examine the tradeoff between efficiency and accuracy in relation to . As increases, begins to coincide with , and prediction efficiency reduces, but the approximation improves to match the exact predictive posterior.
Another consideration, when training gp models, is preventing overfitting. Overfitting occurs when the predictive variance collapses around the training data. It can be prevented by adding a regularization term to the log of Equation 9, penalizing the magnitude of covariance parameters and pseudo inputs.
4.3 SPGP TD Learning
To reduce spgp-sarsa to its model-based equivalent, we let and swap the state-action transition process for the associated state transition process (Table 1); the analysis from Section 2 and 3 follows directly. The new input variable, , simply reduces the latent function space to one over variables. Equation 8 describes the state value posterior moments, analogous to frequentist td  and standard gp-td [4, 12]. We call the resulting algorithm spgp-td.
|Target value||Input space||Transition Dist.|
5 Experimental Results
We presented spgp-sarsa as a method for value estimation and explained how it applies to learning navigation policies. Now, we empirically verify several prior assertions: spgp-sarsa is data efficient; it provides a more flexible and accurate approximation to gp-sarsa than lowrank-sarsa; it is suitable for online applications to marine robots. Evidence to support these claims is provided with several targeted simulation studies and a physical experiment using a BlueROV underwater robot.
All experiments use the same covariance function: , where , and is a diagonal matrix of length scales.
5.1 Comparing Approximation Quality
To facilitate comparison between the approximation quality of spgp-sarsa and lowrank-sarsa, we analyzed a measure of evidence maximization: the ratio of sparse-to-complete log-likelihood, . We found the low-rank approximation causes sharp changes in the likelihood, and its magnitude often varied around . Although this precluded visual comparisons, we are still able to show spgp-sarsa provides a tight approximation. The ratio varies smoothly in relation to different levels of sparsity and improves further as pseudo inputs are optimized (Figure 1(a)).
In a second study, we examined the range of lowrank-sarsa’s adjustability. In principle, the error threshold can be tuned to any positive number. However, results show the available range can be limited and result in extreme amounts of data retention or rejection (Figure 1(b)).
As marine robots require learning algorithms that are simultaneously data-efficient and online-viable, it can be problematic if one quality is missing. Retaining an excessive amount of data reduces the computational benefit of the low-rank approximation. Conversely, rejecting too much data is counterproductive when very little arrives. spgp-sarsa offers a good middle ground, because it retains all observations while still achieving fast predictions at any level of sparsity.
5.2 Simulated Navigation Tasks
For this experiment, we solve the complete rl problem. Employing Algorithm 2, we compare performance of each value estimator, gp, spgp, lowrank-sarsa, as they inform policy updates on simulated navigation tasks. The purpose is to understand each algorithm’s learning performance.
From the data (Figure 4), it is clear that all algorithms converge quickly: in around fifty episodes. These results are consistent with prior work by Engel et al. , where gp-sarsa was applied to control a high dimensional octopus arm having 88 continuous state variables and 6 action variables. Performance differences between the two approximate methods are due to their approximation quality (Figure 2). As expected, spgp-sarsa is able to learn on par with the exact method, because it can replicate the predictive posterior better than lowrank-sarsa.
In each experiment, robots learn over 100 episodes and select actions with unique -greedy policies. The number of pseudo inputs, , was selected for a fair comparison. Specifically, after finding a that induced approximately 50% sparsity, we choose so both methods converged with active sets of the same size. All pseudo inputs were initialized randomly.
First, we considered a canonical rl navigation problem, the Mountain Car . With limited power, the robot must learn to exploit its dynamics to reach the crest of a hill. The state is given by , where , and . Rewards are , with , until the goal is reached, where . Episodes start at , and the goal is . We let , , and learning evolve over transitions. One action, controls the robot’s motion (Figure 3) .
Our second system is a planar Unmanned Surface Vehicle (usv), which has been considered in prior learning experiments [17, 18]. The robot must navigate within of using transitions. States contain position, , heading. , and heading rate . The speed is held constant at m/s, and the angular rate controls the robot through
We use time steps of s, and a step time delay for the command to be realized. The delay models resistance of surface currents and actuator limitations. Rewards are assigned with , where , , as before, , and . We chose a linear policy, , where . was updated with a line search, to maximize the first moment of the value posterior. We selected and .
For the third system we consider a common Unmanned Underwater Vehicle (uuv) design, with differential control. The robot commands forward acceleration, , and turn rate, , through port, , and starboard, , actions. The dynamics are an extension of the usv with the additional dimension, . The policy uses Fourier basis functions, with defined as before, and :
These map to actions through and , where we let . Policy updates select parameters and to maximize the value posterior mean - found through a grid search. Learning occurs over 100 episodes of 200 transitions, with and .
5.3 Learning to Navigate with a BlueROV
Solving the complete rl problem with a BlueROV (Figure 5) presented a unique set of challenges. States derived from a Doppler velocity log, with estimates that drifted on the order of 1m every 1-2 min; this ultimately bounded the number of transitions per episode. While initiating the learning process at depth, disturbances from the data tether introduced uncertainty in the initial position and heading. The robot’s speed is also constrained to facilitate accurate localization. To move, the robot can yaw and translate back and forth for a variable length of time.
With numerous limitations imposed on the learning process, we evaluated what could be learned from a single demonstration of only eight transitions. In practice, learning from demonstrations can reduce trial and error [19, 20]. Results show, even with minimal information, that spgp-sarsa was able to recover a near-optimal value function and policy (Figure 5). The experiment was repeated twenty times and the average policy update time took s with a 2.8GHz i7 processor. We achieve only a modest improvement in prediction time, since , and we use pseudo inputs. Despite this fact, our results confirm spgp-sarsa can support efficient robot learning.
This paper presented an algorithm that supports learning navigation policies with very little data. We argued for the use of gp-td algorithms to replace standard Bellman recursion, because non-parametric regression can be more data-efficient for learning value functions. We derived spgp-sarsa as a sparse approximation to gp-sarsa and showed it is more flexible and its predictions are more accurate than the state-of-the-art low-rank method. spgp-sarsa was applied to a physical marine robot and learned a near-optimal value function from a single demonstration. In closing, we believe our results highlight the efficiency of gp-td algorithms and the utility of spgp-sarsa as a marine robot learning method.
We thank the anonymous reviewers for their excellent feedback. This research has been supported in part by the National Science Foundation, grant number IIS-1652064. This work was also supported in part by the U.S. Department of Homeland Security under Cooperative Agreement No. 2014-ST-061-ML0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security.
- Szepesvári  C. Szepesvári. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010. URL http://www.sztaki.hu/~szcsaba/papers/RLAlgsInMDPs-lecture.pdf.
- Sutton  R. S. Sutton. Learning to predict by the methods of temporal differences. In MACHINE LEARNING, pages 9–44. Kluwer Academic Publishers, 1988.
- Rummery and Niranjan  G. A. Rummery and M. Niranjan. On-line q-learning using connectionist systems. Technical report, Cambridge University, 1994.
- Engel et al.  Y. Engel, S. Mannor, and R. Meir. Bayes meets bellman: The gaussian process approach to temporal difference learning. In Proceedings of the 20th International Conference on Machine Learning (ICML), 2003.
- Rasmussen and Williams  C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
- Csato and Opper  L. Csato and M. Opper. Sparse on-line gaussian processes. Neural Computation, 2002.
- Engel et al.  Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005.
H. Jakab and L. Csato.
Improving gaussian process value function approximation in policy
Artificial Neural Networks and Machine Learning, 2011.
- Liu and Nocedal  D. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503–528, 1989.
- Snelson and Ghahramani  E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2006.
- Sutton and Barto  R. Sutton and A. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
- Engel  Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, The Hebrew University, 2005.
- Engel et al.  Y. Engel, P. Szabo, and D. Volkinshtein. Learning to control an octopus arm with gaussian process temporal difference methods. In Advances in Neural Information Processing Systems 18. MIT Press, 2006.
- Deisenroth  M. Deisenroth. Efficient reinforcement learning using Gaussian processes. PhD thesis, Karlsruhe Institute of Technology, 2010.
- Deisenroth et al.  M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian process dynamic programming. Neurocomput., 2009.
- Rasmussen and Kuss  C. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.
- Ghavamzadeh et al.  M. Ghavamzadeh, Y. Engel, and M. Valko. Bayesian policy gradient and actor-critic algorithms. J. Mach. Learn. Res., 2016.
Martin and Englot 
J. Martin and B. Englot.
Extending model-based policy gradients for robots in heteroscedastic environments.In 1st Annual Conference on Robot Learning, 2017.
- Argall et al.  B. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 2009.
- Kober et al.  J. Kober, A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research, 2013.
7.1 Model Parameter Optimization
Most optimization packages require an objective function and its gradient to optimize. In the full paper, we described the object function as the log likelihood of the value posterior. Below, we provide the full details of the corresponding gradient computation.
7.2 Objective Gradient
Let and be the -th optimization variable. The gradient with respect to is
Here, is the tangent matrix of with respect to . The full equations for computing are described in the next section.
7.3 Likelihood Covariance Tangent Matrix
Denote to be a generic optimization variable. Then the matrix used in Equation 10 is given by