I Introduction
Reinforcement Learning is a powerful approach for learning control policies for tasks that require a sequence of decisions, such as walking, playing a game, or navigating a maze. This learning is accomplished by repeatedly taking actions and observing both their effect on the environment state and some scalar reward signal that they elicit. The parameters of the policy are iteratively updated to increase the total expected reward over an entire sequence of observations and actions, called a trajectory. The weakness of RL is that it suffers from high sample complexity. Often millions of actions must be taken before an RL algorithm learns an efficient control policy, meaning a single execution can take days to complete. This problem is particularly pronounced in tasks where the space of possible actions and states is continuous and highdimensional, a category that includes most realworld tasks, such as robotic object manipulation or autonomous driving.
One contributor to this high sample complexity is the explorationexploitation tradeoff [1]. This is a fundamental tradeoff in any sequential decision making problem between exploitation  using available knowledge to follow a trajectory that maximizes expected reward  and exploration  searching previously unseen parts of the state space to find better trajectories. Excessive exploitation slows learning as effort is spent achieving high rewards without gaining new information that can be used to improve the policy. On the other hand, excessive exploration slows learning as effort is spent searching through regions of the state space where improved trajectories are unlikely to be found. Achieving good exploration is a challenging problem, as all but the simplest RL tasks have state spaces that are intractable to explore fully. The state space grows exponentially in the number of dimensions, and the relationship between actions and state changes is often complex, so that even reaching a specific state is not trivial.
In spite of this, it is common for stateoftheart RL algorithms to handle exploration in a naive manner. A standard approach is
greedy, which consists of taking a random action with probability
and following the learned policy with probability . The equivalent for continuous action spaces is adding Gaussian noise to policy actions. Consequently, exploration is often inefficient, spending much time searching through wellknown regions of the state space that will provide little new information, or through regions that good trajectories are unlikely to intersect.More recently, a class of algorithms has emerged that guides learning by use of intrinsic rewards [2], meaning a reward signal that comes from the model itself (intrinsic) rather than the environment (extrinsic). This is analogous to curiosity in humans, which are internally motivated to seek out novel experiences, even when there is no external reward for doing so. However, such algorithms tend to require cumbersome additions to the policy model.
In this paper we propose a curiosity based method that can be applied to an arbitrary RL algorithm in order to direct exploration towards parts of the state space that have not yet been visited. Using a Bayesian linear regression model with a learned latent space embedding, we compute the uncertainty of the model for arbitrary states. This uncertainty grows with the dissimilarity from previously seen points in the state space, and is therefore a good measure of the novelty of a given state observation. We use the model uncertainty to generate an intrinsic reward signal that encourages visiting new states far from previously explored regions. Applying this approach to stateoftheart RL algorithms, we verify experimentally that it accelerates learning in classic control tasks as well as in challenging robotics tasks with high dimensional state spaces, continuous action spaces, and even sparse rewards.
Ii Related Work
Trading off exploration and exploitation is a fundamental problem in reinforcement learning, dating back to the multiarmed bandit problem [3] studied in statistics. The canonical algorithm for handling this tradeoff was greedy [1], which consists of taking a random action with probability and otherwise taking a greedy action that maximizes the expected reward. Although guaranteed to converge,
greedy is unable to incorporate information about the model’s uncertainty. Many algorithms have been proposed to leverage such uncertainty information by estimating a posterior over the expected value, for example using Bayesian linear regression
[4, 5]or Bayesian neural networks
[6]. While provably efficient, these algorithms involve the use of an argmax operation, limiting them to problems with discrete actions. Our method seeks to address exploration in more realistic problems with continuous actions.One powerful approach that can use uncertainty information to deal with the explorationexploitation tradeoff is the Upper Confidence Bound (UCB) algorithm [7]
. Instead of greedily taking the action that maximizes the expected reward, the UCB algorithm takes the action that maximizes the sum of the reward’s expectation and variance. A significant weakness, however, is that UCB does not extend naturally beyond the bandit setting. The EMUQ algorithm
[8] applies UCB principles to more general reinforcement learning problems. Bayesian linear regression with random Fourier features is used to estimate the Qfunction as well as the uncertainty of the model and actions are taken that greedily maximize the sum of both. An alternative approach is SARSA with an uncertainty Bellman equation (UBE) [9]. The authors derive a different Bellman backup for the uncertainty of the Qfunction, and approximate a Bayesian linear regression with a DQN [10]. Actions are taken by Thompson sampling
[11]. While the EMUQ algorithm works well for continuous action spaces with lowdimensional actions and observations, UBE works well in problems with highdimensional observations but is restricted to discrete action spaces. In contrast, the method we propose can solve problems that have both a continuous action space and highdimensional observations such as images.In recent years, intrinsic motivation algorithms have emerged as a popular solution to the explorationexploitation tradeoff. This class of algorithms is based on the idea that exploration can be driven by a separate reward signal that encourages visiting underexplored states. Such a reward signal can be derived from visitation counts [12, 13, 14], model prediction error [15, 16, 17], variational information gain [18], or entropy maximization [19, 20]. In contrast, we propose to use the uncertainty of a Bayesian linear regression, which is a well understood mathematical mechanism that can explicitly separate uncertainty in the model from uncertainty in the data. In [21]
, the uncertainty of a Gaussian process was used as a reward term in a linearquadratic regularizer in order to encourage exploration. While Gaussian processes provide good uncertainty estimates, they are problematic when it comes to dealing with large data sets and highdimensional data such as images, both of which our method can handle.
Iii Background
Iiia The Markov Decision Process
A Markov Decision Process (MDP) is described by a tuple
where: is the set of all possible states. is the set of all possible actions. is a stochastic policy mapping stateaction pairs to the probability of choosing action in state . is a state transition function mapping tuples to the probability of arriving at state after taking action at state . is a reward function assigning a scalar reward value to each stateaction pair. is a discount factor while is a time horizon.An MDP can be used to generate a sequences of states and actions as follows: given an initial state , iteratively select the next action and evolve the state by sampling . The sequence is called a trajectory. The expected total discounted reward over an entire trajectory is:
(1) 
where the expectation is subscripted by to denote the distribution over trajectories that induces, and discounts rewards later in the trajectory. Let denote a parameterized policy whose parameters are . A reinforcement learning algorithm seeks to maximize the expected total discounted reward by optimizing over the policy parameters. This can be expressed succinctly as the following optimization problem:
(2) 
IiiB Bayesian Linear Regression
Linear regression is a form of regression analysis where the data is explained using a linear model
[22]. That is, a model of the form , where is the input,is a vector of weights and
is the intercept. Bayesian linear regression (BLR) [23] places a prior on the model weights . We choose a Gaussian prior:(3)  
where
is a hyperparameter representing the precision of the model, and
are a vector and matrix describing the prior mean and covariance over the model weights. Note that in order to simplify the notation we remove the intercept and instead add a dummy dimension to the input that will always have a constant value of 1. Let there be a set of training data where is an matrix whose rows are dimensional input vectors and is an dimensional vector of targets. Although we restrict the targets to being scalar in order to simplify the notation, the following treatment can be easily extended to vector targets. In order to capture nonlinearity in the data, we transform the inputs of the BLR using a nonlinear function . Note that it is possible that , meaning can change the dimensionality of the data. The posterior distribution over given the training data is:(4)  
where is a hyperparameter representing noise precision in the data, is a vector of targets and is an matrix whose th row is . Note that after we perform an update with one set of data, we can perform another update with a second set by assigning .
Linear regression gives a prediction of for an arbitrary input
. BLR with the above formulation yields a posterior predictive distribution over
:(5)  
where is the variance of the predictive posterior.
Iv Bayesian Curiosity
We propose Bayesian Curiosity, a method that can extend any RL algorithm by adding a secondary model that produces an intrinsic reward signal. The pipeline of the method can be seen in fig. 2. In black is the standard RL machinery: a state observation is passed to the policy network , which produces action . This action is then passed to the environment , which evolves its state and emits both an environment reward and the next state observation. In green is the proposed Bayesian Curiosity mechanism: is given as input to a neural network with parameters , which transforms it into a latent space. This latent space representation is passed to a BLR , which computes the uncertainty for the given input. The uncertainty is then used to construct a curiosity reward signal , which together with forms the reinforcement reward .
Iva Latent Space Embedding
BLR is a linear model, and requires some transformation on the inputs in order to capture nonlinearity in the data. Further, it scales poorly with the dimensionality of the data. We will therefore learn a transformation using a neural network with weights to capture nonlinearity and reduce dimensionality. Algorithm 1 introduces a procedure for jointly optimizing the parameters of the network and BLR , given a set of training data . We acquire the training data by generating expert demonstrations with small Gaussian noise, meaning that our method is a form of learning from demonstration. In terms of the BLR, this means that will be observations and will be actions.
To train
we follow a stochastic gradient descent procedure defined in
algorithm 1. In each epoch, randomly sample a subset of the training data
. We use as the data for the BLR update in (4), denoting the parameters of the posterior over BLR weights as and . Note that and are expressions parameterized by . Plugging this into (5), we get the moments of the predictive posterior for an arbitrary demonstration
:(6)  
The loss for our SGD is the negative loglikelihood (NLL) of for , given by:
(7) 
In each epoch, having chosen , we repeatedly sample minibatches without replacement, and take stochastic gradient steps to minimize the NLL loss w.r.t. .
IvB Reinforcement Learning with Bayesian Curiosity
Given the latent space embedding , we define the curiosity reward as:
(8) 
where is the variance of the predictive posterior in (5) and is the observation at time . Since is bounded from below by , the lower bound of is . Given extrinsic reward , the final combined reward is:
(9) 
where is a hyperparameter that controls the weight of the curiosity reward.
We proceed to perform reinforcement learning with the combined reward signal as shown in algorithm 2. Let be some RL algorithm that can be chosen arbitrarily. Before RL begins, we reset the Bayesian linear regression . In every episode, we execute rollouts of the policy to obtain sequences of actions, observations, and rewards. The observations are used to update . Since
is a BLR, uncertainty will, in expectation, be higher for new observations whose latent representations have low cosine similarity to previous observations. According to
[9], states differing only in ways irrelevant to the task will be mapped to similar representations, and vice versa. Thus the curiosity reward will be higher for novel states. Note that we do not need to update , since we only use the uncertainty of . This frees us from having to find target values , for which there is in general no intuitive candidate in the reinforcement learning setting. At the end of every episode is updated according to the rules of the algorithm . Note that throughout the RL stage is not updated, as doing so would change the latent space and thus invalidate all previous updates to .V Experimental results
We proceed to test the effectiveness of our method on a series of continuous control and robotics tasks. In the following, all neural networks are implemented using Theano
[24]. Pretraining is done using ADAM [25] with regularization. For the RL algorithms we use the implementations in RLLab [26]. Hyperparameters are and for the BLR, and the curiosity weight is . Source code is available at https://gitlab.com/tomblau/BayesianCuriosity.Va Classic Control
TRPO  DDPG  REINFORCE  
Reward  Speedup  Reward  Speedup  Reward  Speedup  
Mountaincar  Curiosity  20.0 (16.0, 22.0)    50.0 (200.0, 7.0)    30.15 (4.80, 31.70)   
VIME  22.0 (17.0, 24.0)  0.8      28.5 (15.75, 30.375)  3.36  
Vanilla  18.0 (12.25, 22.0)  1.43  11.0 (200, 20.0)  0.67  29.85 (4.75, 30.9)  2.25  
Swingup  Curiosity  1.39 (15.80, 12.36)    33.11 (17.78, 44.33)    46.53 (57.61, 35.05)   
VIME  46.66 (67.93, 33.34)  6.76      68.57 (81.67, 56.04)  5.33  
Vanilla  23.60 (45.99, 9.94)  4.44  25.0 (20.71, 42.22)  1.61  48.09 (57.99, 37.47)  1.11  
Pendulum  Curiosity  80.33 (87.39, 74.47)    27.13 (50.2, 14.43)    80.53 (86.22, 76.41)   
VIME  79.16 (86.03, 72.03)  0.833      81.43 (87.11, 77.19)  2.52  
Vanilla  79.63 (88.25, 66.75)  0.89  53.51 (100, 19.45)  13.33  83.91 (100.0, 77.96)  9.09  
Acrobot  Curiosity  182.7 (241.33, 148.83)    138.05 (203.25, 110.13)    135.5 (190.08, 108.4)   
VIME  230.1 (319.83, 183.58)  2.47      137.15 (191.33, 110.40)  1.14  
Vanilla  197.8 (286.88, 143.05)  1.45  143.1 (208.8, 111.95)  1.52  145.9 (224.13, 115.0)  2.21 
We begin by evaluating our method on classic control problems with a number of RL algorithms: REINFORCE [27], DDPG [28], and TRPO [29] in four continuous control tasks: Mountaincar, Cartpole Swingup, Pendulum, and Acrobot. For each algorithm and environment we evaluate the algorithm with and without Bayesian Curiosity. As an additional exploration baseline, we also evaluate the baseline algorithm combined with VIME [18]. Extrinsic rewards for the above tasks have been sparsified to emphasize the importance of exploration.
shows the results for this set of experiments, with each entry aggregated over 10 random seeds. The ”reward” columns show the median total trajectory reward, as well as the lower and upper quartiles (the
and percentiles, respectively). The ”speedup” columns show how much more quickly the algorithms with Bayesian Curiosity learned compared with their respective baselines. The value is the ratio of the number of timesteps the baseline took to achieve its best result and the number of timesteps the corresponding Bayesian Curiosity algorithm took to match the baseline. Results for DDPGVIME are missing as the authors of the original paper have not made code available for this algorithm. For most tasks and algorithms, our method is able to achieve comparable or superior performance in significantly fewer timesteps compared with both the standard RL and intrinsic motivation baselines. This is in spite of the fact that these are relatively simple problems with lowdimensional state and action spaces. Notable exceptions are DDPG on the Mountaincar task and TRPO on the Pendulum task, in which Bayesian Curiosity achieves lower performance than the baselines. In these cases, ”speedup” is the ratio of the number of timesteps Bayesian Curiosity took to achieve its best result and the number of timesteps the baseline took to match it. For DDPG on Mountaincar, learning for both the baseline and Bayesian Curiosity versions is unstable, and the agents will periodically forget and relearn how to achieve successful trajectories. Indeed, the lower quartile for both versions is throughout the learning process, meaning that in every episode at least a quarter of agents have failed to find a successful trajectory. For TRPO on Pendulum, the baseline and Bayesian Curiosity agents achieve similar results.Finally, we examine how the surface of curiosity rewards changes when novel states are discovered. Figure 3 shows heatmaps of curiosity rewards w.r.t. the state space before and after an episode of mountaincar. The states visited in this episode are shown as pink dots in both plots. There is a noticeable decrease in curiosity in the range of the xaxis, where new states were visited that have not been seen before.
VB Robotic Control
We now examine the effect of Bayesian Curiosity rewards on learning sensorimotor control policies for a 6DOF robotic arm. This includes Cartesian control problems, where goal state information is represented by its d Cartesian coordinates, and visuomotor control problems, which instead have image data. All tasks are executed in the VREP simulation environment [30] using a Kinova Jaco arm, as shown in fig. 1, except for the final task which is executed on a real robot. Training data for the curiosity model are generated using an inverse kinematics solver, and the set consists of data points. To speed up the experiments and to make comparisons with the baselines more fair, all policies are pretrained to imitate this training dataset. Experiments focus on TRPO because it was found to produce the most stable learning of all tested algorithms. All policies have Gaussian action noise determined by a neural network whose parameters are learned by the RL algorithm.
We begin with two Cartesian problems, reaching and grasping. The arm must be controlled to reach or grasp a cylinder that sits on a table. The cylinder is placed uniformly at random in a square at the beginning of an episode. In the reaching task, this cylinder is a noninteractable object that serves as an objective marker. Collision of the arm with itself or other objects results in failure. Rewards are dense for the reaching task and sparse for the grasping task. Observations consist of the angles of the controllable joints as well as the dimensional coordinates of the cylinder. Actions are a vector of angle deltas, prescribing a change in the joint angles.
The top row of fig. 4 shows performance on the Cartesian reaching and grasping tasks for TRPO with and without Bayesian Curiosity. We also include VIME [18] and RND [15] as intrinsic motivation baselines. Performance is measured as , where is the number of successes and is the number of timesteps in the 10 most recent episodes. This metric directly captures the ability of the policy to successfully complete the task, and unlike the success rate it also assigns a higher value to completing the task in fewer timesteps. In the reaching task, although median performance is similar, the lower quartile improves much more quickly in our method compared with the baselines. For the baselines, some initializations result in agents that take a very long time to converge, whereas with Bayesian Curiosity even the slowest agents converge quickly. This indicates that our method is more robust to random initialization.
For both the Cartesian and visuomotor cases, the grasping task exhibits a much starker difference between our method and the baselines than the reaching task does. The lower quartile of Bayesian Curiosity quickly rises above the upper quartile of the baselines. Further, whereas the trendlines for the baseline agents are almost linear, the curiosity agents exhibits a more sigmoidal trendline, improving rapidly at first before slowing down. There are two factors contributing to this difference. The first is that the grasping tasks have sparse rewards while the reaching tasks have dense rewards. The second has to do with the nature of the grasping task – any success state is very close to a failed state wherein the cylinder was knocked over rather than grasped. This means that there is a bottleneck in the state space, formed by failure states, which must be traversed to reach a success state. This kind of geometry is very difficult to explore using Gaussian action noise, and comparatively easier to explore with a curiosity reward that discourages revisiting explored states.
In the next set of experiments we investigate visuomotor versions of the reaching and grasping tasks. Observations no longer include the d coordinates of the cylinder, but instead contain RGBD images taken from a fixed camera pose. The performance of our method is compared with the vanilla TRPO and RND baselines (VIME does not scale to highdimensional observations) in the bottom row of fig. 4. In the reaching task, our method greatly outperforms the naive baseline, quickly achieving performance that vanilla TRPO can’t match even after times as many timesteps, and slightly improves on the RND baseline. Compared with the Cartesian case, the intrinsic motivation methods perform better, while the performance of the naive baseline degrades. This suggests that agents with intrinsic motivation can explore enough to leverage the additional information of the visual sensors, while agents that rely on Gaussian noise for exploration struggle to explore the enlarged observation space. In the grasping task, our method achieves performance that the baselines can’t match even after training for times as many timesteps.
Finally, we replicate the Cartesian reaching task on a physical robotic arm. The bottomright plot in fig. 4 shows the results of this set of experiments. Due to the time requirements of robotic experiments only RND, the strongest baseline in simulation, was included. As in the simulated reaching task, we see that median performance is comparable to the RND baseline, while the lower quartile grows more quickly, indicating higher robustness to random initialization. This robustness is particularly valuable when dealing with physical robots, as each run of the algorithm is expensive and timeconsuming.
Vi Conclusions
In this work we introduced a new method that combines BLR with a learned latent space embedding to generate a curiosity signal that directs exploration towards novel states, and demonstrated its capability to augment a variety of standard RL algorithms. Compared with both naive Gaussian noise exploration and SOTA intrinsic motivation methods, our method is able to accelerate exploration and achieve comparable or superior performance in fewer timesteps. This improvement is particularly noticeable when the state space has a geometry that makes it difficult to explore, or when the reward function is sparse, and provides little information to help direct exploration. The ability to learn policies from sparse rewards is highly desirable, as it obviates the need for carefully designing reward functions using expert knowledge.
Future work can adapt Bayesian Curiosity to continually update the latent space embedding during the RL procedure, eliminating the need for demonstrations. Another possible extension is to combine the latent space representation with Random Fourier Features, which can approximate a Gaussian Process [31].
References
 [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press Cambridge, 1998.
 [2] P. Y. Oudeyer and F. Kaplan, “How can we define intrinsic motivation?” in International Conference on Epigenetic Robotics, 2008.
 [3] H. Robbins, “Some aspects of the sequential design of experiments,” in Herbert Robbins Selected Papers. Springer, 1985.

[4]
I. Osband, B. Van Roy, and Z. Wen, “Generalization and exploration via
randomized value functions,” in
International Conference on Machine Learning
, 2016.  [5] K. Azizzadenesheli, E. Brunskill, and A. Anandkumar, “Efficient exploration through bayesian deep qnetworks,” in IEEE Information Theory and Applications Workshop, 2018.

[6]
Z. Lipton, X. Li, J. Gao, L. Li, F. Ahmed, and L. Deng, “Bbqnetworks:
Efficient exploration in deep reinforcement learning for taskoriented
dialogue systems,” in
AAAI Conference on Artificial Intelligence
, 2018.  [7] P. Auer, “Using confidence bounds for exploitationexploration tradeoffs,” Journal of Machine Learning Research, 2002.
 [8] P. Morere and F. Ramos, “Bayesian rl for goalonly rewards,” in Conference on Robot Learning, 2018.
 [9] B. O’Donoghue, I. Osband, R. Munos, and V. Mnih, “The uncertainty bellman equation and exploration,” in International Conference on Machine Learning, 2018.
 [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, B. M. G., A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, 2015.
 [11] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, 1933.
 [12] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying countbased exploration and intrinsic motivation,” in Advances in Neural Information Processing Systems, 2016.
 [13] M. C. Machado, M. G. Bellemare, and M. Bowling, “Countbased exploration with the successor representation,” arXiv preprint, 2018.
 [14] H. Tang, R. Houthooft, D. Foote, A. Stooke, et al., “# exploration: A study of countbased exploration for deep reinforcement learning,” in NIPS, 2017.
 [15] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by random network distillation,” International Conference on Learning Representations, 2019.
 [16] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiositydriven exploration by selfsupervised prediction,” in International Conference on Machine Learning, 2017.
 [17] J. Achiam and S. Sastry, “Surprisebased intrinsic motivation for deep reinforcement learning,” arXiv preprint, 2017.
 [18] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel, “Vime: Variational information maximizing exploration,” in Advances in Neural Information Processing Systems, 2016.
 [19] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” in ICLR, 2019.
 [20] Z. Hong, T. Shann, S. Su, Y. Chang, et al., “Diversitydriven exploration strategy for deep reinforcement learning,” in NIPS, 2018.
 [21] S. Bechtle, A. Rai, Y. Lin, L. Righetti, and F. Meier, “Curious ilqr: Resolving uncertainty in modelbased rl,” in ICML Workshop on Reinforcement Learning for Real Life, 2019.
 [22] X. Yan and X. Su, Linear Regression Analysis: Theory and Computing. World Scientific, 2009.
 [23] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag, 2006.
 [24] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. WardeFarley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Python for Scientific Computing Conference (SciPy), 2010.
 [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint, 2014.
 [26] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” International Conference on Machine Learning, 2016.
 [27] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000.
 [28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint, 2015.
 [29] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” International Conference on Machine Learning, 2015.
 [30] E. Rohmer, S. P. N. Singh, and M. Freese, “Vrep: a versatile and scalable robot simulation framework,” in International Conference on Intelligent Robots and Systems, 2013.
 [31] J. QuinoneroCandela, C. E. Rasmussen, A. R. FigueirasVidal, et al., “Sparse spectrum gaussian process regression,” Journal of Machine Learning Research, vol. 11, 2010.
Comments
There are no comments yet.