1 Introduction
Reinforcement learning (RL) solves sequential decision making problems by considering an agent that interacts with the environment and seeks for the optimal policy (Sutton & Barto, 1998)
. During the learning process, the agent is required to evaluate its policies using a value function. In many real world RL domains, such as robotics, games and autonomous driving cars, the state and action spaces are large, hence the value function is approximated, e.g., using a Deep Neural Network (DNN). A common approach is to optimize a set of parameters by minimizing the sum of squared Bellman Temporal Differences (TD) errors
(Dann et al., 2014). There are two underlying assumptions in this approach: first, the value and its parameters are deterministic; second, the Bellman TD errors are independent Gaussian random variables (RVs) with zero mean and a fixed variance. Although being a commonly used objective function, these underlying assumptions may not be suitable for the policy evaluation task in RL. Distributional RL
(Bellemare et al., 2017) refers to the second assumption and argues in favor of a full distribution perspective over the sum of discounted rewards for a fixed policy. In particular, learning this distribution is meaningful in presence of value approximation. However, in their formulation the value parameters are still considered deterministic and they do not provide an amount of confidence for the value estimates.Treating the value or its parameters as RVs has been investigated in the RL literature. Engel et al. (2003, 2005) used Gaussian Processes (GP) for the value and the return to capture uncertainties in policy evaluation. Geist & Pietquin (2010) proposed to use the Unscented Kalman filter (UKF) to learn the uncertainty in value parameters. Their formulation requires many samples of parameters in each training step, which is not feasible in Deep Reinforcement Learning (DRL) with large state and action spaces.
Motivated by the works of Engel et al. (2003, 2005) and Geist & Pietquin (2010), we present in this work a unified framework for addressing uncertainties while approximating the value in DRL domains. Our framework incorporates the wellknown Kalman filter estimation techniques with RL principles to improve value approximation. The Kalman filter (Kalman et al., 1960) and its variant for nonlinear approximations, the Extended Kalman filter (EKF) (Anderson & Moore, 1979; Gelb, 1974), are used for online tracking and for estimating states in dynamic environments through indirect noisy observations. These methods have been successfully applied to numerous control dynamic systems such as navigation and tracking targets (Särkkä, 2013). The Kalman filter can also be used for parameter estimation in approximation functions, where parameters replace the states of dynamic systems.
We develop a new optimization method for policy evaluation based on the EKF formulation. Figure 1 illustrates our Bayesian perspective over value parameters and noisy observed returns. Our proposed method has the following properties: It forms a trust region over the value parameters, based on their uncertainty covariance; It is aimed at tracking the solution rather than converging to it; It incrementally updates the parameters and the error covariance matrix, hence avoids sampling the parameters as is often required in Bayesian methods; It adjusts suitable learning rate to each individual parameter through the Kalman gain, thus the learning procedure does not depend on the parameterization of the value.
Our main contributions are: (1) Developing a new regularized objective function for approximating values in the policy evaluation task. The regularization term accounts for both parameters and observations uncertainties. (2) Presenting a novel optimization algorithm, Kalman Optimization for Value Approximation (KOVA), and prove that it minimizes at each time step the regularized objective function. This optimizer can be easily plugged into any policy optimization algorithm and improve it. (3) Beyond RL context, we present the connection between EKF and the incremental GaussNewton method, the online natural gradient and the Kullback Leibler (KL) divergence, and explain how our objective function forms a trust region over the value parameters. (4) Demonstrating the improvement achieved by our optimizer on several control tasks with large state and action spaces.
2 Background
2.1 Reinforcement Learning and MDPs
The standard RL setting considers an interaction of an agent with an environment
for a discrete number of time steps. The environment is modeled as a Markov Decision Process (MDP)
where is a finite set of states, is a finite set of actions,is the state transition probabilities for each state
and action , is a deterministic and bounded reward function and is a discount factor. At each time step , the agent observes state and chooses action according to a policy . The agent receives an immediate reward and the environment stochastically steps to stateaccording to the probability distribution
. The state value function and the stateaction Qfunction are used for evaluating the performance of a fixed policy (Sutton & Barto, 1998): and , where denotes the expectation with respect to the state (stateaction) distribution induced by transition law and policy .2.2 Value Function Estimation
Policy evaluation, or value estimation, is a core element in RL algorithms. We will use the term value function (VF) to address the following functions: the state value function , the stateaction Qfunction and the advantage function . When the state or action space is large, a common approach is to approximate the VF using a parameterized function, . We focus on general, possibly nonlinear approximation functions such as DNNs that can learn effectively complex approximations.
Algorithm type  Example  type  

Actorcritic  A3C (Mnih et al., 2016)  step Vevaluation  
Actorcritic  DDPG (Lillicrap et al., 2015)  step Qevaluation  
Policy gradient 
PPO (Schulman et al., 2017)
TRPO (Schulman et al., 2015a) 
GAE (Schulman et al., 2015b) 


greedy  DQN (Mnih et al., 2013)  Optimality equation 
A common approach for optimizing the VF parameters is to minimize at each time step the empirical mean of the squared Bellman TD error , over a batch of samples generated form the environment under a given policy:
(1) 
We use the general notation to specify the input for the target label and for the approximated value at time , . For example, for , is the state at a discrete time ; For , is the stateaction pair. In Table 1 we provide examples of several options for and which clarify how this general notation can be utilized in known policy optimization algorithms.
Traditionally, the VF is trained by stochastic gradient descent methods, estimating the loss on each experience as it is encountered, yielding the update:
, where is the learning rate and is the experience distribution. Typically, the training procedure seeks for a point estimate of the model parameters. We will show (Section 3) that the underlying assumption on (1) is that the parameters are deterministic and that the target labels are independent Gaussian RVs with mean and a fixed variance. In Section 2.3 we present the EKF approach which generalizes the process of generating observations and adds flexibility to the model assumptions: the parameters may be viewed as RVs and the variance of the target label may change between observations.
2.3 Extended Kalman Filter (EKF)
In this section we briefly outline the Extended Kalman filter (Anderson & Moore, 1979; Gelb, 1974). The EKF is a standard technique for estimating the state of a nonlinear dynamic system or for learning the parameters of a nonlinear approximation function. In this paper we will focus on its latter role, meaning estimating . The EKF considers the following model:
(2) 
where are the parameters evaluated at time , is the
dimensional observations vector at time
:(3) 
and is an dimensional vector, where is a nonlinear observation function with input and parameters :
(4) 
is the evolution noise,
is the observation noise, both modeled as additive and white noises with covariances
and , respectively. As seen in the model presented in Equation (2), EKF treats the parameters as RVs, similarly to Bayesian approaches. According to this perspective, the parameters belong to an uncertainty set governed by the mean and covariance of the parameters distribution.The estimation at time t, denoted as is the conditional expectation of the parameters with respect to the observed data. The EKF formulation distinguishes between estimates that are based on observations up to time , , and observations up to time , . With some abuse of notation, are the observations gathered up to time : . The parameters errors are defined by: and . The conditional error covariances are given by: .
EKF considers several statistics of interest at each time step: The prediction of the observation function, the observation innovation, the covariance between the parameters error and the innovation, the covariance of the innovation and the Kalman gain are defined respectively in Equations (5)  (9):
(5)  
(6)  
(7)  
(8)  
(9) 
The above statistics serve for the EKF updates:
(10) 
In the next section we present how to use the EKF formulation in order to approximate VFs which consider uncertainty both in the parameters and in the noisy observations.
3 EKF for Value Function Approximation
We now derive a novel regularized objective function and argue in its favor for optimizing value functions in RL. We use general notations in order to enable integration of our proposed VF optimization method with any policy optimization algorithm. The main idea is to decompose the Bellman TD error vector into two parts:
. (i) The observation at time , is a vector that contains target labels . (ii) The observation function may be one of the following:
The observation functions for inputs are concatenated into the dimensional vector , as presented in Equation (4). In Table 1 we provide several examples for the Bellman TD error decomposition according to the chosen policy optimization algorithm.
Our goal is to estimate the parameters . One way is to learn them by maximum likelihood estimation (MLE) using stochastic gradient descent methods: . This forms the objective function in Equation (1). Another way is learning them by a Bayesian approach which uses Bayes rule and adds prior knowledge over the parameters to calculate the maximum aposteriori (MAP) estimator: . Given the observations gathered up to time , we can rewrite the MAP estimator:
(11) 
Here, instead of using the parameters prior, we use an equivalent derivation for the parameters posterior conditioned on , based on the likelihood of a single observation and the posterior conditioned on (Van Der Merwe, 2004). This unique derivation is a key step for making the incremental Kalman updates and for defining the objective function in Equation (3.1). In order do define the likelihood and the posterior , we adopt the EKF model (2), and make the following assumptions:
Assumption 3.1.
The likelihood is assumed to be Gaussian:
Assumption 3.2.
The posterior distribution is assumed to be Gaussian: .
These assumptions are common when using the EKF. In the context of RL, these assumptions add the flexibility we want: the value is treated as a RV and information is gathered on the uncertainty of its estimate. In addition, the noisy observations (the target labels), can have different variances and can even be correlated. Based on these Gaussian assumptions, we can derive the following Theorem:
Theorem 3.1.
The proof for Theorem 3.1 appears in the supplementary material. It is based on solving the maximization problem in Equation (11) using the EKF model (2) and the Gaussian Assumptions 3.1 and 3.2.
We now explicitly write the expressions for the statistics of interest in Equations (5)  (9) (see the supplementary material for more detailed derivations). The derivations are based on the first order Taylor series linearization for the observation function : , where
(13)  
and is typically chosen to be the previous estimation of the parameters at time , . The prediction of the observation function is , the covariance between the parameters error and the innovation is and the covariance of the innovation is:
(14) 
The Kalman gain then becomes:
(15) 
This Kalman gain is used in the parameters update and the error covariance update in Equation (10).
3.1 Comparing between and for Optimizing Value Functions
We argue in favor of using the regularized objective function (3.1) for optimizing VFs instead of the commonly used objective function (1). Corollary 3.1 will assist us to discuss and compare between the two objective functions:
Corollary 3.1.
The proof is given in the supplementary material. According to Corollary 3.1, the two objective functions are the same if we consider the parameters as deterministic and if we assume that the noisy target labels have a fixed variance.
So what are the differences between the two objective functions? First, is a regularized version of : the regularization is causing the parameters to track the recent parameters estimate, , stabilizing the estimate process. The error between the successive estimates is weighted with the inverse of the uncertainty information . does not include a regularization term, meaning it does not account for parametrization uncertainties. Note that when adding a standard regularization to , often common in DNNs, it reflects staying close to the vector which is not always desired.
Second, weights the squared Bellman TD error vector with which can be interpreted as an additional regularization technique. can be viewed as the amount of confidence we have in the observations, as defined in the EKF model (2): if the observations are noisy, we should consider larger values for the diagonal elements in the covariance . In addition, allows us to model correlations between observations errors, unlike the iid assumption in . In Section 5 we discuss possible options for .
Looking at the parameters update in Equation (10) and the definition of the Kalman gain in Equation (3), we can see that the Kalman gain propagates the new information from the noisy target labels, back down into the parameters uncertainty set , before combining it with the estimated parameter value. Actually, can be interpreted as an adaptive learning rate for each individual parameter that implicitly incorporates the uncertainty of each parameter. This approach resembles familiar stochastic gradient optimization methods such as Adagrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012)
, RMSprop
(Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014), for different choices of and . We refer the reader to Ruder (2016).When looking at the reader may ask what do and stand for? When we are estimating the VF parameters for a fixed policy , our objective function imposes a trust region method in each iteration of a batch optimization procedure. The trust region helps us to avoid overfitting to the most recent batch of data. In this case is the last evaluation of the VF parameters for the same fixed policy , and is the conditional error covariance between the new parameters estimation and the previous one, again, for the same fixed policy^{1}^{1}1We added the upperscript to emphasis that the VF parameters correspond to evaluating the same policy.. When we change policies and start to evaluate the VF parameters of we set and , meaning we start a new estimation procedure at for the new policy.
3.2 Connection between EKF, Natural Gradient and the GaussNewton Method
The EKF may be viewed as an online natural gradient algorithm (Amari, 1998) that uses the Fisher information matrix (Ollivier et al., 2018). In this setting, the connection between the error covariance matrix and the Fisher matrix is given by: . This insight suggests that the regularization term in is actually a second order approximation of the KLdivergence between the previous parameter estimate and the current one. Combining these insights together, we conclude that our proposed method can be viewed as a natural gradient algorithm for VF approximation. Similarly, the EKF may be viewed as an incremental version of the GaussNewton method, which is a common iterative method for solving least squares problems (Bertsekas, 1996). When updating the parameters, the GaussNewton uses the matrix where is the Jacobian of . When the observations are assumed to be Gaussian (as we assume in Assumption 3.1), is equivalent to the Fisher information matrix.
The following Theorem formalizes the connection between EKF and two separate KL divergences:
Theorem 3.2.
Assume the inputs are drawn independently from a training distribution with density function , and assume the corresponding observations are drawn from a conditional training distribution with density function . Let
be the joint distribution whose density is
, and let be the learned distribution, whose density is . Under Assumptions 3.1 and 3.2, consider a diagonal covariance with diagonal elements then:where .
Theorem 3.2 illustrates how EKF is aimed at minimizing two separate KLdivergences. The first is the KL divergence between two conditional distributions, and . This term is equivalent to the loss in (1). The second is the KL divergence between two different parameterizations of the joint learned distribution . This is the term which imposes trust region on the VF parameters in (3.1). The proof for Theorem 3.2 appears in the supplementary material.
3.3 Practical algorithm: KOVA optimizer
We now derive a practical algorithm for approximating VFs, by minimizing the objective function (3.1). In practice we use the update Equations (10) and the Kalman gain Equation in (14)(3) in order to avoid inversing . In addition, we add a fixed learning rate to smooth the update. The KOVA optimizer is presented in Algorithm 1 and illustrated in Figure 2. Notice that is a samples generator whose structure depends on the policy algorithm for which KOVA is used as a VF optimizer. can contain trajectories from a fixed policy or it can be an experience replay which contains transitions from several different policies.
Algorithm complexity: For a dimensional parameter vector , our algorithm requires extra space to store the covariance matrix and computations for matrix multiplications. Note that our update method does not require inverting the dimensional matrix in the update process, but only requires inverting the dimensional matrix . Usually, . The extra time and memory requirements can be tolerated for smallmedium networks with size . However, it can be considered as a drawback of the algorithm for large network sizes. Fortunately, there are several options for overcoming these drawbacks: (a) The use of GPU for matrix multiplications can accelerate the computation time. (b) We can assume correlations only between blocks of parameters, for example, between parameters in the same DNN layer, and apply layer factorization. This can reduce significantly the computation and memory requirements (Puskorius & Feldkamp, 1991; Zhang et al., 2017; Wu et al., 2017). (c) We can apply the Kalman optimization method only on the last layer in large DNNs. This approach was used by Levine et al. (2017) where they optimized the last layer using linear least squares optimization methods. We emphasis that yet, our approach scales with large state and action spaces, and is suitable for continuous control problems which are considered hard domains.
4 Related Work
TRPO are used as policy optimization algorithms. We compare between Adam and KOVA optimizers for policy evaluation. For Swimmerv2, Hopperv2 and HalfCheetahv2 we trained over one million time steps and for Antv2 and Walker2dv2 we trained over two million time steps. We present the average (solid lines) and standard deviation (shaded area) of the episodes rewards over 8 runnings, generated from random seeds.
Bayesian Neural Networks (BNNs): There are several works on Bayesian methods for placing uncertainty on the approximator parameters (Blundell et al., 2015; Gal & Ghahramani, 2016). Depeweg et al. (2016, 2017)
have used BNNs for learning MDP dynamics in RL tasks. In these works a fully factorized Gaussian distribution on parameters is assumed while we consider possible correlations between parameters. In addition, BNNs require sampling the parameters, and running several feedforward runs for each of the parameters samples. Our incremental method avoids multiple samples of the parameters, since the uncertainty is propagated with every optimization update.
Kalman filters: Outside of the RL framework, the use of Kalman filter as an optimization method is discussed in (Haykin et al., 2001; Vuckovic, 2018; GomezUribe & Karrer, 2018). Wilson & Finkel (2009) solve the dynamics of each parameter with Kalman filtering. Wang et al. (2018) use Kalman filter for normalizing batches. In our work we use Kalman filtering for VF optimization in the context of RL. EKF is connected with the incremental GaussNewton method (Bertsekas, 1996), and with the online natural gradient (Ollivier et al., 2018). These methods require inversing the dimensional Fisher information matrix (for dimensional parameter), thus require high computational resources. Our method avoids this inversion in the update step which is more computationally efficient.
Trust region for policies: The natural gradient method, when applied to RL tasks, is mostly used in policy gradient algorithms to estimate the parameters of the policy (Kakade, 2002; Peters & Schaal, 2008; Schulman et al., 2015a). Trust region methods in RL have been developed for parameterized policies (Schulman et al., 2015a, 2017). Despite that, trust region methods for parametrized VFs are rarely presented in the RL literature. Recently, Wu et al. (2017) suggested to apply the natural gradient method also on the critic in the actorcritic framework, using Kroneckerfactored approximations. Schulman et al. (2015b) suggested to apply GaussNewton method to estimate the VF. However, they did not analyze and formalize the underlying model and assumptions that lead to the regularization in the objective function, while this is the focus in our work.
Distributional perspective on values and observations: Distributional RL (Bellemare et al., 2017) treats the full (general) distribution of total return, and considers VF parameters as deterministic. In our work we assume Gaussian distribution over the total return and in addition Gaussian distribution over the VF parameters.
Our work may be seen as a modern extension of GPTD (Engel et al., 2003, 2005) for DRL domains with continuous state and action spaces. GPTD uses Gaussian Processes (GPs) for both VF and total return, for solving the RL problem of value estimation. We introduce here several improvements and generalizations over their work: (1) Our formulation is adapted to learning nonlinear VF approximations, as common in DRL; (2) We include a fading memory option for previous observations by using a decay factor in the error covariance prediction (); (3) We allow for a general observation noise covariance (not necessarily diagonal) and for a general noisy observations (not only 1step TD errors); (4) Our observation vector has a fixed size (the batch size) as opposed to the growing size vectors in GPTD which grow for any new observation and make it difficult to train in DRL domains.
The use of Kalman filters to solve RL tasks was proposed by Geist & Pietquin (2010). Their formulation, called Kalman Temporal Difference (KTD), serves as the base for our formulation for the optimizer we propose. We introduce here several differences between their work and ours: (1) We reformulate the observation equation (10) to increase training stability by using a target network for the VF that appears in the target label (see Table 1). With this formulation, the observation function is simply the VF of the current input; (2) We use the Extended Kalman filter as opposed to their use of the Unscented Kalman filter to approximate nonlinear functions (Julier & Uhlmann, 1997; Wan & Van Der Merwe, 2000). In our formulation, the observation function is differential, allowing us to use first order Taylor expansion linearization. The UKF has shown superior performance in some applications (StPierre & Gingras, 2004; Van Der Merwe, 2004), however, its computational cost is much greater than the computational cost of the EKF, due to its requirement of sampling the parameters in each training step for times. Moreover, it requires to evaluate the observation function at these samples at every training step. Unfortunately, this is not tractable in DNNs where the parameters might be highdimensional.
5 Experiments
In this section we present experiments that illustrate the performance attained by our KOVA optimizer^{2}^{2}2Code can be found in: https://github.com/KOVAtrustregion/KOVA. Technical details on policy and VF networks and on the hyperparameters we used are described in the supplementary material..
KOVA optimizer for policy evaluation: We tested the performance of KOVA in domains with high state and action spaces: the robotic tasks benchmarks implemented in OpenAI Gym (Brockman et al., 2016), which use the MuJoCo physics engine (Todorov et al., 2012). For the policy training we used PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a) and used their baselines implementations (Dhariwal et al., 2017). For VF training we replaced the originally used Adam optimizer (Kingma & Ba, 2014) with our KOVA optimizer (Algoritm 1) and compared their affect on the mean episode reward in each environment. The results are presented in Figure 3. When training with PPO, we can see that KOVA improved the agent’s performance in four out of five environments. In Antv2 it kept approximately the same performance. When training with TRPO, we can see that KOVA improved the agent’s performance mostly in Swimmerv2 and HalfCheetahv2. These improvements, both in PPO and in TRPO, demonstrate the importance of incorporating uncertainty estimation in value function approximation for improving the agent’s performance.
Investigating the evolution and observation noises: The most interesting hyperparameters in KOVA are related to the covariances and . As seen in Corollary 3.1, for deterministic interpretation of the parameters we simply set . However, the more interesting setting would be with being a small number that controls the amount of fading memory (Ollivier et al., 2018). can be used for incorporating prior domain knowledge. For example, a diagonal matrix implies independent observations , while if observations are known to be correlated, additional nondiagonal elements can be added. We investigated the effect of different values of and in the Swimmer and HalfCheetah environments, where KOVA gained the most success. The results are depicted in Figure 4. We tested two different settings: the batchsize setting where and the maxratio setting where . Interestingly, although using KOVA results in lower policy loss (which we try to maximize), it actually increases the policy entropy and encourages exploration, which we believe helps in gaining higher rewards during training. We can clearly see how the mean rewards increases as the policy entropy increases, for different values of . This insight was observed in both tested Mujoco environments and in both settings of .
6 Conclusion
In this work we presented a novel regularized objective function for optimizing VFs in policy evaluation, which originates from a Bayesian perspective over both noisy observations and value parameters. Our empirical results illustrate how the KOVA optimizer can improve the performance of various RL agents in domains with large state and action spaces. For future work, it would be interesting to further investigate the connection between trust region over value parameters and trust region over policy parameters and how to use this connection to improve exploration.
References
 Amari (1998) Amari, S.I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Anderson & Moore (1979) Anderson, B. D. and Moore, J. B. Optimal filtering. Englewood Cliffs, 21:22–95, 1979.
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
 Bertsekas (1996) Bertsekas, D. P. Incremental least squares methods and the extended kalman filter. SIAM Journal on Optimization, 6(3):807–822, 1996.
 Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Dann et al. (2014) Dann, C., Neumann, G., and Peters, J. Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1):809–883, 2014.
 Depeweg et al. (2016) Depeweg, S., HernándezLobato, J. M., DoshiVelez, F., and Udluft, S. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016.
 Depeweg et al. (2017) Depeweg, S., HernándezLobato, J. M., DoshiVelez, F., and Udluft, S. Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems. arXiv preprint arXiv:1710.07283, 2017.
 Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Openai baselines. https://github.com/openai/baselines, 2017.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Engel et al. (2003) Engel, Y., Mannor, S., and Meir, R. Bayes meets bellman: The gaussian process approach to temporal difference learning. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 154–161, 2003.
 Engel et al. (2005) Engel, Y., Mannor, S., and Meir, R. Reinforcement learning with gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pp. 201–208. ACM, 2005.

Gal & Ghahramani (2016)
Gal, Y. and Ghahramani, Z.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In international conference on machine learning, pp. 1050–1059, 2016. 
Geist & Pietquin (2010)
Geist, M. and Pietquin, O.
Kalman temporal differences.
Journal of artificial intelligence research
, 39:483–532, 2010.  Gelb (1974) Gelb, A. Applied optimal estimation. MIT press, 1974.
 GomezUribe & Karrer (2018) GomezUribe, C. A. and Karrer, B. The decoupled extended kalman filter for dynamic exponentialfamily factorization models. arXiv preprint arXiv:1806.09976, 2018.
 Haykin et al. (2001) Haykin, S. S. et al. Kalman filtering and neural networks. Wiley Online Library, 2001.
 Julier & Uhlmann (1997) Julier, S. J. and Uhlmann, J. K. New extension of the kalman filter to nonlinear systems. In AeroSense’97, pp. 182–193. International Society for Optics and Photonics, 1997.
 Kakade (2002) Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
 Kalman et al. (1960) Kalman, R. E. et al. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Levine et al. (2017) Levine, N., Zahavy, T., Mankowitz, D. J., Tamar, A., and Mannor, S. Shallow updates for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3135–3145, 2017.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Martens (2014) Martens, J. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
 Ollivier et al. (2018) Ollivier, Y. et al. Online natural gradient as a kalman filter. Electronic Journal of Statistics, 12(2):2930–2961, 2018.
 Peters & Schaal (2008) Peters, J. and Schaal, S. Natural actorcritic. Neurocomputing, 71(79):1180–1190, 2008.
 Puskorius & Feldkamp (1991) Puskorius, G. V. and Feldkamp, L. A. Decoupled extended kalman filter training of feedforward layered networks. In Neural Networks, 1991., IJCNN91Seattle International Joint Conference on, volume 1, pp. 771–777. IEEE, 1991.
 Ruder (2016) Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
 Särkkä (2013) Särkkä, S. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013.
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 StPierre & Gingras (2004) StPierre, M. and Gingras, D. Comparison between the unscented kalman filter and the extended kalman filter for the position estimation module of an integrated navigation information system. In Intelligent Vehicles Symposium, 2004 IEEE, pp. 831–835. IEEE, 2004.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Van Der Merwe (2004) Van Der Merwe, R. Sigmapoint Kalman filters for probabilistic inference in dynamic statespace models. PhD thesis, Oregon Health & Science University, 2004.
 Vuckovic (2018) Vuckovic, J. Kalman gradient descent: Adaptive variance reduction in stochastic optimization. arXiv preprint arXiv:1810.12273, 2018.
 Wan & Van Der Merwe (2000) Wan, E. A. and Van Der Merwe, R. The unscented kalman filter for nonlinear estimation. In Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. ASSPCC. The IEEE 2000, pp. 153–158. Ieee, 2000.
 Wang et al. (2018) Wang, G., Peng, J., Luo, P., Wang, X., and Lin, L. Batch kalman normalization: Towards training deep neural networks with microbatches. arXiv preprint arXiv:1802.03133, 2018.
 Wilson & Finkel (2009) Wilson, R. and Finkel, L. A neural implementation of the kalman filter. In Advances in neural information processing systems, pp. 2062–2070, 2009.
 Wu et al. (2017) Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pp. 5279–5288, 2017.
 Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 Zhang et al. (2017) Zhang, R., Li, C., Chen, C., and Carin, L. Learning structural weight uncertainty for sequential decisionmaking. arXiv preprint arXiv:1801.00085, 2017.
Supplementary Material
Appendix A Theoretical Results
a.1 Extended Kalman Filter (EKF)
In this section we briefly outline the Extended Kalman filter (Anderson & Moore, 1979; Gelb, 1974). The EKF considers the following model:
(A.1) 
where are the parameters evaluated at time , is the dimensional observation vector at time , and where is a nonlinear observation function with input and parameters .
The evolution noise is white () with covariance , .
The observation noise is white () with covariance , .
The EKF sets the estimation of the parameters at time according to the conditional expectation:
(A.2) 
where with some abuse of notation, are the observations gathered up to time : . The parameters errors are defined by:
(A.3) 
The conditional error covariances are given by:
(A.4) 
EKF considers several statistics of interest at each time step: The prediction of the observation function:
The observation innovation:
The covariance between the parameters error and the innovation:
The covariance of the innovation:
The Kalman gain:
The above statistics serve for the update of the parameters and the error covariance:
(A.5) 
a.2 EKF for Value Function Estimation
When applying the EKF formulation to value functions approximation, the observation at time is the target label (see Table 1 in the main article), and the observation function can be the state value function, the state action value function or the advantage function.
The EKF uses a first order Taylor series linearization for the observation function:
(A.6) 
where and is typically chosen to be the previous estimation of the parameters at time , . This linearization helps in computing the statistics of interest. Recall that the expectation here is over the random variable where is fixed. For simplicity, we keep to write . The prediction of the observation function: