1 Introduction – Optimal Reinforcement Learning
Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efficient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation tradeoff. Classic reinforcement learning often considers time cheap; the tradeoff then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on adhoc methods to control exploration, such as “greedy” [1]
, or “Thompson sampling”
[2]. However, at least since a thesis by Duff [3]it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space
[4]. The situation is particularly dire for continuous spacetimes, where both depth and branching factor of the “tree” are uncountably infinite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or adhoc estimators that can be shown to be in some sense close to the Bayesoptimal policy
[9].In a parallel development, recent work by Todorov [10], Kappen [11]
and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al.
[12] showed that it is actually possible to solve the exploration exploitation tradeoff locally, by constructing a linearapproximation using a Kalman filter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics
and loss functions of an unknown, nonlinear, timevarying system (note that most reinforcement learning algorithms are restricted to timeinvariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a firstorder description of optimal reinforcement learning in form of infinitedimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5).An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system).
Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that field. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates with states , controls with actions , dynamics with transitions and utilities with losses (negative rewards) . The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent).
2 A Class of Learning Problems
We consider the task of optimally controlling an uncertain system whose states lie in a dimensional Euclidean phase spacetime: A cost (cumulated loss) is acquired at with rate , and the first inference problem is to learn this analytic function . A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms affine to the control:
(1) 
where is the control function we seek to optimize, and are analytic functions. To simplify our analysis, we will assume that either or are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let be uncertain and known. Information about both and is acquired stochastically: A Poisson process of constant rate produces mutually independent samples
(2) 
The noise levels and are presumed known. Let our initial beliefs about and be given by Gaussian processes and independent Gaussian processes , respectively, with kernels over , and mean / covariance functions /
. In other words, samples over the belief can be drawn using an infinite vector
of i.i.d. Gaussian variables, as(3) 
the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that are unknown, but deterministic. At any point during learning, we can use the same samples to describe uncertainty, while change during the learning process.
To ensure continuous trajectories, we also need to regularize the control. Following control custom, we introduce a quadratic control cost with control cost scaling matrix . Its units relate the cost of changing location to the utility gained by doing so.
The overall task is to find the optimal discounted horizon value
(4) 
where is the trajectory generated by the dynamics defined in Equation (1), using the control law (policy) . The exponential definition of the discount gives the unit of time to .
Before beginning the analysis, consider the relative generality of this definition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The specific choice of a Poisson process for the generation of samples is somewhat adhoc, but some
measure is required to quantify the flow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width
. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cartandpole to realistic models for helicopters [13].3 Optimal Control for the Learning Process
The optimal solution to the exploration exploitation tradeoff is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a beliefaugmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the HamiltonJacobiBellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics:
At time , let the system be at phase spacetime and have the Gaussian process belief over the function (all derivations in this section will focus on , and we will drop the subscript from many quantities for readability. The forms for , or , are entirely analogous, with independent Gaussian processes for each dimension ). This belief stems from a finite number of samples collected at spacetimes (note that to need not be equally spaced, ordered, or ). For arbitrary points , the belief over is a Gaussian with mean function
, and covariance function
[15](5)  
where is the Gram matrix with elements . We will abbreviate from here on. The covector has elements and will be shortened to . How does this belief change as time moves from to ? If , the chance of acquiring a datapoint in this time is . Marginalising over this Poisson stochasticity, we expect one sample with probability , two samples with and so on. So the mean after is expected to be
(6) 
where we have defined the map , the vector with elements , and the scalar . Algebraic reformulation yields
(7) 
Note that , the mean prediction at and , the marginal variance there. Hence, we can define scalars and write
(8) 
So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure to write
(9) 
where we have implicitly defined the innovation function . Note that is a function of both and . A similar argument finds the change of the covariance function to be the deterministic rate
(10) 
So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for finitedimensional Gaussian beliefs, for data with known noise magnitude).
We introduce the beliefaugmented space containing states . Since the means and covariances are functions, is infinitedimensional. Under our beliefs, obeys a stochastic differential equation of the form
(11) 
with free dynamics , controlled dynamics , uncertainty operator , and noise operator
(12)  
(13) 
The value – the expected cost to go – of any state is given by the HamiltonJacobiBellman equation, which follows from Bellman’s principle and a firstorder expansion, using Eq. (4):
(14) 
Integration over can be performed with ease, and removes the stochasticity from the problem; The uncertainty over is a lot more challenging. Because the distribution over future losses is correlated through space and time, , are functions of , and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples and solve for the value for each sample, we get an “average optimal controller”. This overestimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between and , we can integrate (17) locally, all terms in drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global onestep lookahead”. One could imagine extensions that consider the influence of on to a higher order, but these will be left for future work. Under this firstorder approximation, analytic minimisation over can be performed in closed form, and bears
(15) 
The optimal HamiltonJacobiBellman equation is then
(16) 
A more explicit form emerges upon reinserting the definitions of Eq. (12) into Eq. (16):
(17) 
Equation (17) is the central result: Given Gaussian priors on nonlinear controlaffine dynamic systems, up to a first order approximation, optimal reinforcement learning is described by an infinitedimensional secondorder partial differential equation. It can be interpreted as follows (labels in the equation, note the negative signs of “beneficial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through spacetime and the benefit of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The first two lines of the right hand side describe effects from the phase spacetime subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the first two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical subspace – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the first differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate : If is very low, exploration is useless over the discount horizon.
Even after these approximations, solving Equation (17) for remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course infinitedimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads
(18) 
(note that this object remains a function of the state ). For general kernels , these integrals may only be solved numerically. However, for at least one specific choice of kernel (squareexponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the squareexponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class.
The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a definitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable definition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling.
Digression: Relaxing Some Assumptions
This paper only applies to the specific problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both and nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become infinitely dependent when conditioned on data. But if the belief on either or is parametric (e.g. a general linear model), a joint belief on and is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost and the controlaffine form () are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on and or , it is clear that standard ways of inferring kernels [15, 17] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here.
4 Numerically Solving the HamiltonJacobiBellman Equation
Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one specific Ansatz, a Galerkintype projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels are given by square exponentials with parameters . As discussed above, we approximate by setting . We find an approximate solution through a factorizing parametric Ansatz: Let the value of any point in the belief space be given through a set of parameters and some nonlinear functionals , such that their contributions separate over phase space, mean, and covariance functions:
(19) 
This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals
have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is(20)  
(21)  
(22) 
(the subtracted term in the first integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in , linear in all other :
(23) 
using covectors and a matrix with elements
(24)  
Note that and are both functions of the physical state, through . It is through this functional dependency that the value of information is associated with the physical phase spacetime. To solve for , we simply choose a number of evaluation points sufficient to constrain the resulting system of quadratic equations, and then find the leastsquares solution by function minimisation, using standard methods, such as LevenbergMarquardt [18]
. A disadvantage of this approach is that is has a number of degrees of freedom
, such as the kernel parameters, and the number and locations of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically.
5 Experiments
5.1 Illustrative Experiment on an Artificial Environment
As a simple example system with a onedimensional state space, were sampled from the model described in Section 2, and set to the unit function. The state space was tiled regularly, in a bounded region, with square exponential (“radial”) basis functions (Equation 20), initially all with weight . For the information terms, only a single basis function was used for each term (i.e. one single , one single , and equally for , all with very large length scales , covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on . Five times the number of parameters, i.e. evaluation points were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it suffices to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to , the sampling rate at , the control cost at . Value and optimal control were evaluated at time steps of .
Figure 1 shows the situation after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on induces a complex shape. The system constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms.
Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reflects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of suboptimality: Remember that the value is defined on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space.
5.2 Comparative Experiment – The Furuta Pendulum
Method  cumulative loss  

Full Information (baseline)  .  
TD()  .  
Kalman filter Optimal Learner  .  
Gaussian process optimal learner  . 
cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over five experiments.
The cartandpole system is an underactuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [19]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true , but otherwise using Algorithm 1, to that of an greedy TD learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency influences TD in nontrivial ways rarely studied (for example through the coarseness of the greedy policy). The parameters were set to , . Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although the GP learner does swing up the pendulum. This is due to the random, nonoptimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17).
There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for timevarying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer.
6 Conclusion
Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other fields raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one specific approximation was presented, using functionals to reduce the problem to finite leastsquares parameter estimation.
The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may find this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A finite initial identification phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model.
The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than first order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results?
Acknowledgments
The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper.
References
 [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998.
 [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933.

[3]
M.O.G. Duff.
Optimal Learning: Computational procedures for Bayesadaptive Markov decision processes
. PhD thesis, U of Massachusetts, Amherst, 2002.  [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006.

[5]
Richard Dearden, Nir Friedman, and David Andre.
Model based Bayesian exploration.
In
Uncertainty in Artificial Intelligence
, pages 150–159, 1999.  [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000.
 [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for online reward optimization. In International Conference on Machine Learning, pages 956–963, 2005.
 [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artificial Intelligence, 2009.
 [9] J.Z. Kolter and A.Y. Ng. NearBayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009.
 [10] E. Todorov. Linearlysolvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007.
 [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007.
 [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal tradeoff between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008.
 [13] I. Fantoni and L. Rogelio. Nonlinear Control for Underactuated Mechanical Systems. Springer, 1973.
 [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961.
 [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923.
 [17] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010.
 [18] D. W. Marquardt. An algorithm for leastsquares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
 [19] K. Furuta, M. Yamakita, and S. Kobayashi. Swingup control of inverted pendulum using pseudostate feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992.
Comments
There are no comments yet.