Log In Sign Up

Trust Region Value Optimization using Kalman Filtering

by   Shirli Di-Castro Shashua, et al.

Policy evaluation is a key process in reinforcement learning. It assesses a given policy using estimation of the corresponding value function. When using a parameterized function to approximate the value, it is common to optimize the set of parameters by minimizing the sum of squared Bellman Temporal Differences errors. However, this approach ignores certain distributional properties of both the errors and value parameters. Taking these distributions into account in the optimization process can provide useful information on the amount of confidence in value estimation. In this work we propose to optimize the value by minimizing a regularized objective function which forms a trust region over its parameters. We present a novel optimization method, the Kalman Optimization for Value Approximation (KOVA), based on the Extended Kalman Filter. KOVA minimizes the regularized objective function by adopting a Bayesian perspective over both the value parameters and noisy observed returns. This distributional property provides information on parameter uncertainty in addition to value estimates. We provide theoretical results of our approach and analyze the performance of our proposed optimizer on domains with large state and action spaces.


page 1

page 2

page 3

page 4


Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Policy evaluation is a key process in Reinforcement Learning (RL). It as...

Deep Robust Kalman Filter

A Robust Markov Decision Process (RMDP) is a sequential decision making ...

Queueing Network Controls via Deep Reinforcement Learning

Novel advanced policy gradient (APG) methods with conservative policy it...

Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

It is well known that quantifying uncertainty in the action-value estima...

MM-KTD: Multiple Model Kalman Temporal Differences for Reinforcement Learning

There has been an increasing surge of interest on development of advance...

Analyzing the Effects of Observation Function Selection in Ensemble Kalman Filtering for Epidemic Models

The Ensemble Kalman Filter (EnKF) is a Bayesian filtering algorithm util...

1 Introduction

Reinforcement learning (RL) solves sequential decision making problems by considering an agent that interacts with the environment and seeks for the optimal policy (Sutton & Barto, 1998)

. During the learning process, the agent is required to evaluate its policies using a value function. In many real world RL domains, such as robotics, games and autonomous driving cars, the state and action spaces are large, hence the value function is approximated, e.g., using a Deep Neural Network (DNN). A common approach is to optimize a set of parameters by minimizing the sum of squared Bellman Temporal Differences (TD) errors

(Dann et al., 2014)

. There are two underlying assumptions in this approach: first, the value and its parameters are deterministic; second, the Bellman TD errors are independent Gaussian random variables (RVs) with zero mean and a fixed variance. Although being a commonly used objective function, these underlying assumptions may not be suitable for the policy evaluation task in RL. Distributional RL

(Bellemare et al., 2017) refers to the second assumption and argues in favor of a full distribution perspective over the sum of discounted rewards for a fixed policy. In particular, learning this distribution is meaningful in presence of value approximation. However, in their formulation the value parameters are still considered deterministic and they do not provide an amount of confidence for the value estimates.

Treating the value or its parameters as RVs has been investigated in the RL literature. Engel et al. (2003, 2005) used Gaussian Processes (GP) for the value and the return to capture uncertainties in policy evaluation. Geist & Pietquin (2010) proposed to use the Unscented Kalman filter (UKF) to learn the uncertainty in value parameters. Their formulation requires many samples of parameters in each training step, which is not feasible in Deep Reinforcement Learning (DRL) with large state and action spaces.

Motivated by the works of Engel et al. (2003, 2005) and Geist & Pietquin (2010), we present in this work a unified framework for addressing uncertainties while approximating the value in DRL domains. Our framework incorporates the well-known Kalman filter estimation techniques with RL principles to improve value approximation. The Kalman filter (Kalman et al., 1960) and its variant for nonlinear approximations, the Extended Kalman filter (EKF) (Anderson & Moore, 1979; Gelb, 1974), are used for on-line tracking and for estimating states in dynamic environments through indirect noisy observations. These methods have been successfully applied to numerous control dynamic systems such as navigation and tracking targets (Särkkä, 2013). The Kalman filter can also be used for parameter estimation in approximation functions, where parameters replace the states of dynamic systems.

We develop a new optimization method for policy evaluation based on the EKF formulation. Figure 1 illustrates our Bayesian perspective over value parameters and noisy observed returns. Our proposed method has the following properties: It forms a trust region over the value parameters, based on their uncertainty covariance; It is aimed at tracking the solution rather than converging to it; It incrementally updates the parameters and the error covariance matrix, hence avoids sampling the parameters as is often required in Bayesian methods; It adjusts suitable learning rate to each individual parameter through the Kalman gain, thus the learning procedure does not depend on the parameterization of the value.

Our main contributions are: (1) Developing a new regularized objective function for approximating values in the policy evaluation task. The regularization term accounts for both parameters and observations uncertainties. (2) Presenting a novel optimization algorithm, Kalman Optimization for Value Approximation (KOVA), and prove that it minimizes at each time step the regularized objective function. This optimizer can be easily plugged into any policy optimization algorithm and improve it. (3) Beyond RL context, we present the connection between EKF and the incremental Gauss-Newton method, the on-line natural gradient and the Kullback Leibler (KL) divergence, and explain how our objective function forms a trust region over the value parameters. (4) Demonstrating the improvement achieved by our optimizer on several control tasks with large state and action spaces.

Figure 1: Illustration of our proposed model: a Bayesian perspective for the policy evaluation problem in RL. The noisy observation for an input (for example is a state or a state-action pair and is the a sum of discounted n-step rewards from this state) is decomposed into its mean, the value and a random zero-mean noise . The randomness of originates from two sources: (i) the random noise which relates to the stochasticity of the transitions in the trajectory and to the possibly random policy. (ii) the randomness of through its dependency on the random parameters . In the context of RL, this randomness can be related to uncertainty regarding the MDP model that generated the noisy observations.

2 Background

2.1 Reinforcement Learning and MDPs

The standard RL setting considers an interaction of an agent with an environment

for a discrete number of time steps. The environment is modeled as a Markov Decision Process (MDP)

where is a finite set of states, is a finite set of actions,

is the state transition probabilities for each state

and action , is a deterministic and bounded reward function and is a discount factor. At each time step , the agent observes state and chooses action according to a policy . The agent receives an immediate reward and the environment stochastically steps to state

according to the probability distribution

. The state value function and the state-action Q-function are used for evaluating the performance of a fixed policy (Sutton & Barto, 1998): and , where denotes the expectation with respect to the state (state-action) distribution induced by transition law and policy .

2.2 Value Function Estimation

Policy evaluation, or value estimation, is a core element in RL algorithms. We will use the term value function (VF) to address the following functions: the state value function , the state-action Q-function and the advantage function . When the state or action space is large, a common approach is to approximate the VF using a parameterized function, . We focus on general, possibly non-linear approximation functions such as DNNs that can learn effectively complex approximations.

Algorithm type Example type
Actor-critic A3C (Mnih et al., 2016) -step V-evaluation
Actor-critic DDPG (Lillicrap et al., 2015) -step Q-evaluation
Policy gradient PPO (Schulman et al., 2017)
TRPO (Schulman et al., 2015a)
GAE (Schulman et al., 2015b)
-greedy DQN (Mnih et al., 2013) Optimality equation
Table 1: Different examples for policy optimization algorithms and their Bellman TD error type. The decomposition of into the observation function and the target label in the EKF model (2) enables the integration of our KOVA optimizer with any policy optimization algorithm. refers to the previous network or to a target network, different than the one being trained .

A common approach for optimizing the VF parameters is to minimize at each time step the empirical mean of the squared Bellman TD error , over a batch of samples generated form the environment under a given policy:


We use the general notation to specify the input for the target label and for the approximated value at time , . For example, for , is the state at a discrete time ; For , is the state-action pair. In Table 1 we provide examples of several options for and which clarify how this general notation can be utilized in known policy optimization algorithms.

Traditionally, the VF is trained by stochastic gradient descent methods, estimating the loss on each experience as it is encountered, yielding the update:

, where is the learning rate and is the experience distribution. Typically, the training procedure seeks for a point estimate of the model parameters. We will show (Section 3) that the underlying assumption on (1) is that the parameters are deterministic and that the target labels are independent Gaussian RVs with mean and a fixed variance. In Section 2.3 we present the EKF approach which generalizes the process of generating observations and adds flexibility to the model assumptions: the parameters may be viewed as RVs and the variance of the target label may change between observations.

2.3 Extended Kalman Filter (EKF)

In this section we briefly outline the Extended Kalman filter (Anderson & Moore, 1979; Gelb, 1974). The EKF is a standard technique for estimating the state of a nonlinear dynamic system or for learning the parameters of a nonlinear approximation function. In this paper we will focus on its latter role, meaning estimating . The EKF considers the following model:


where are the parameters evaluated at time , is the

-dimensional observations vector at time



and is an -dimensional vector, where is a nonlinear observation function with input and parameters :


is the evolution noise,

is the observation noise, both modeled as additive and white noises with covariances

and , respectively. As seen in the model presented in Equation (2), EKF treats the parameters as RVs, similarly to Bayesian approaches. According to this perspective, the parameters belong to an uncertainty set governed by the mean and covariance of the parameters distribution.

The estimation at time t, denoted as is the conditional expectation of the parameters with respect to the observed data. The EKF formulation distinguishes between estimates that are based on observations up to time , , and observations up to time , . With some abuse of notation, are the observations gathered up to time : . The parameters errors are defined by: and . The conditional error covariances are given by: .

EKF considers several statistics of interest at each time step: The prediction of the observation function, the observation innovation, the covariance between the parameters error and the innovation, the covariance of the innovation and the Kalman gain are defined respectively in Equations (5) - (9):


The above statistics serve for the EKF updates:


In the next section we present how to use the EKF formulation in order to approximate VFs which consider uncertainty both in the parameters and in the noisy observations.

3 EKF for Value Function Approximation

We now derive a novel regularized objective function and argue in its favor for optimizing value functions in RL. We use general notations in order to enable integration of our proposed VF optimization method with any policy optimization algorithm. The main idea is to decompose the Bellman TD error vector into two parts:
. (i) The observation at time , is a vector that contains target labels . (ii) The observation function may be one of the following:

The observation functions for inputs are concatenated into the -dimensional vector , as presented in Equation (4). In Table 1 we provide several examples for the Bellman TD error decomposition according to the chosen policy optimization algorithm.

Our goal is to estimate the parameters . One way is to learn them by maximum likelihood estimation (MLE) using stochastic gradient descent methods: . This forms the objective function in Equation (1). Another way is learning them by a Bayesian approach which uses Bayes rule and adds prior knowledge over the parameters to calculate the maximum a-posteriori (MAP) estimator: . Given the observations gathered up to time , we can re-write the MAP estimator:


Here, instead of using the parameters prior, we use an equivalent derivation for the parameters posterior conditioned on , based on the likelihood of a single observation and the posterior conditioned on (Van Der Merwe, 2004). This unique derivation is a key step for making the incremental Kalman updates and for defining the objective function in Equation (3.1). In order do define the likelihood and the posterior , we adopt the EKF model (2), and make the following assumptions:

Assumption 3.1.

The likelihood is assumed to be Gaussian:

Assumption 3.2.

The posterior distribution is assumed to be Gaussian: .

These assumptions are common when using the EKF. In the context of RL, these assumptions add the flexibility we want: the value is treated as a RV and information is gathered on the uncertainty of its estimate. In addition, the noisy observations (the target labels), can have different variances and can even be correlated. Based on these Gaussian assumptions, we can derive the following Theorem:

Theorem 3.1.

Under Assumptions 3.1 and 3.2, (10) minimizes at each the following regularized objective function:


where .

The proof for Theorem 3.1 appears in the supplementary material. It is based on solving the maximization problem in Equation (11) using the EKF model (2) and the Gaussian Assumptions 3.1 and 3.2.

We now explicitly write the expressions for the statistics of interest in Equations (5) - (9) (see the supplementary material for more detailed derivations). The derivations are based on the first order Taylor series linearization for the observation function : , where


and is typically chosen to be the previous estimation of the parameters at time , . The prediction of the observation function is , the covariance between the parameters error and the innovation is and the covariance of the innovation is:


The Kalman gain then becomes:


This Kalman gain is used in the parameters update and the error covariance update in Equation (10).

3.1 Comparing between and for Optimizing Value Functions

We argue in favor of using the regularized objective function (3.1) for optimizing VFs instead of the commonly used objective function (1). Corollary 3.1 will assist us to discuss and compare between the two objective functions:

Corollary 3.1.

Under Assumptions 3.1 and 3.2, consider a diagonal covariance with diagonal elements and assume , then: .

The proof is given in the supplementary material. According to Corollary 3.1, the two objective functions are the same if we consider the parameters as deterministic and if we assume that the noisy target labels have a fixed variance.

So what are the differences between the two objective functions? First, is a regularized version of : the regularization is causing the parameters to track the recent parameters estimate, , stabilizing the estimate process. The error between the successive estimates is weighted with the inverse of the uncertainty information . does not include a regularization term, meaning it does not account for parametrization uncertainties. Note that when adding a standard regularization to , often common in DNNs, it reflects staying close to the vector which is not always desired.

Second, weights the squared Bellman TD error vector with which can be interpreted as an additional regularization technique. can be viewed as the amount of confidence we have in the observations, as defined in the EKF model (2): if the observations are noisy, we should consider larger values for the diagonal elements in the covariance . In addition, allows us to model correlations between observations errors, unlike the iid assumption in . In Section 5 we discuss possible options for .

Looking at the parameters update in Equation (10) and the definition of the Kalman gain in Equation (3), we can see that the Kalman gain propagates the new information from the noisy target labels, back down into the parameters uncertainty set , before combining it with the estimated parameter value. Actually, can be interpreted as an adaptive learning rate for each individual parameter that implicitly incorporates the uncertainty of each parameter. This approach resembles familiar stochastic gradient optimization methods such as Adagrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012)

, RMSprop

(Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014), for different choices of and . We refer the reader to Ruder (2016).

When looking at the reader may ask what do and stand for? When we are estimating the VF parameters for a fixed policy , our objective function imposes a trust region method in each iteration of a batch optimization procedure. The trust region helps us to avoid over-fitting to the most recent batch of data. In this case is the last evaluation of the VF parameters for the same fixed policy , and is the conditional error covariance between the new parameters estimation and the previous one, again, for the same fixed policy111We added the upper-script to emphasis that the VF parameters correspond to evaluating the same policy.. When we change policies and start to evaluate the VF parameters of we set and , meaning we start a new estimation procedure at for the new policy.

3.2 Connection between EKF, Natural Gradient and the Gauss-Newton Method

The EKF may be viewed as an on-line natural gradient algorithm (Amari, 1998) that uses the Fisher information matrix (Ollivier et al., 2018). In this setting, the connection between the error covariance matrix and the Fisher matrix is given by: . This insight suggests that the regularization term in is actually a second order approximation of the KL-divergence between the previous parameter estimate and the current one. Combining these insights together, we conclude that our proposed method can be viewed as a natural gradient algorithm for VF approximation. Similarly, the EKF may be viewed as an incremental version of the Gauss-Newton method, which is a common iterative method for solving least squares problems (Bertsekas, 1996). When updating the parameters, the Gauss-Newton uses the matrix where is the Jacobian of . When the observations are assumed to be Gaussian (as we assume in Assumption 3.1), is equivalent to the Fisher information matrix.

The following Theorem formalizes the connection between EKF and two separate KL divergences:

Theorem 3.2.

Assume the inputs are drawn independently from a training distribution with density function , and assume the corresponding observations are drawn from a conditional training distribution with density function . Let

be the joint distribution whose density is

, and let be the learned distribution, whose density is . Under Assumptions 3.1 and 3.2, consider a diagonal covariance with diagonal elements then:

where .

Theorem 3.2 illustrates how EKF is aimed at minimizing two separate KL-divergences. The first is the KL divergence between two conditional distributions, and . This term is equivalent to the loss in (1). The second is the KL divergence between two different parameterizations of the joint learned distribution . This is the term which imposes trust region on the VF parameters in (3.1). The proof for Theorem 3.2 appears in the supplementary material.

3.3 Practical algorithm: KOVA optimizer

Figure 2: KOVA optimizer block diagram. KOVA receives as input the initial general prior and the covariances and . It initializes with small random values or with the VF parameters of the previous policy (see the discussion in Section 3.1). For every , it samples target labels from (see Table 1 for target label examples), constructs (3) and , (4) and computes (13) and (14)-(3). Then it updates and outputs the MAP parameters estimator and the error covariance matrix according to Equation (10).
0:  , , , , . Initialize: , .
1:  for  do
2:     Set predictions:.
3:     Sample N tuples from .
4:     Construct -dim vectors (3) and (4).
5:     Compute -dim matrix (13).
6:     .
7:     .
9:     Set updates:
10:  end for
10:   and
Algorithm 1 KOVA Optimizer

We now derive a practical algorithm for approximating VFs, by minimizing the objective function (3.1). In practice we use the update Equations (10) and the Kalman gain Equation in (14)-(3) in order to avoid inversing . In addition, we add a fixed learning rate to smooth the update. The KOVA optimizer is presented in Algorithm 1 and illustrated in Figure 2. Notice that is a samples generator whose structure depends on the policy algorithm for which KOVA is used as a VF optimizer. can contain trajectories from a fixed policy or it can be an experience replay which contains transitions from several different policies.

Algorithm complexity: For a -dimensional parameter vector , our algorithm requires extra space to store the covariance matrix and computations for matrix multiplications. Note that our update method does not require inverting the -dimensional matrix in the update process, but only requires inverting the -dimensional matrix . Usually, . The extra time and memory requirements can be tolerated for small-medium networks with size . However, it can be considered as a drawback of the algorithm for large network sizes. Fortunately, there are several options for overcoming these drawbacks: (a) The use of GPU for matrix multiplications can accelerate the computation time. (b) We can assume correlations only between blocks of parameters, for example, between parameters in the same DNN layer, and apply layer factorization. This can reduce significantly the computation and memory requirements (Puskorius & Feldkamp, 1991; Zhang et al., 2017; Wu et al., 2017). (c) We can apply the Kalman optimization method only on the last layer in large DNNs. This approach was used by Levine et al. (2017) where they optimized the last layer using linear least squares optimization methods. We emphasis that yet, our approach scales with large state and action spaces, and is suitable for continuous control problems which are considered hard domains.

4 Related Work

Figure 3: Mean episode reward during training for Mujoco environments. (a) PPO or (b)

TRPO are used as policy optimization algorithms. We compare between Adam and KOVA optimizers for policy evaluation. For Swimmer-v2, Hopper-v2 and HalfCheetah-v2 we trained over one million time steps and for Ant-v2 and Walker2d-v2 we trained over two million time steps. We present the average (solid lines) and standard deviation (shaded area) of the episodes rewards over 8 runnings, generated from random seeds.

Bayesian Neural Networks (BNNs): There are several works on Bayesian methods for placing uncertainty on the approximator parameters (Blundell et al., 2015; Gal & Ghahramani, 2016). Depeweg et al. (2016, 2017)

have used BNNs for learning MDP dynamics in RL tasks. In these works a fully factorized Gaussian distribution on parameters is assumed while we consider possible correlations between parameters. In addition, BNNs require sampling the parameters, and running several feed-forward runs for each of the parameters samples. Our incremental method avoids multiple samples of the parameters, since the uncertainty is propagated with every optimization update.

Kalman filters: Outside of the RL framework, the use of Kalman filter as an optimization method is discussed in (Haykin et al., 2001; Vuckovic, 2018; Gomez-Uribe & Karrer, 2018). Wilson & Finkel (2009) solve the dynamics of each parameter with Kalman filtering. Wang et al. (2018) use Kalman filter for normalizing batches. In our work we use Kalman filtering for VF optimization in the context of RL. EKF is connected with the incremental Gauss-Newton method (Bertsekas, 1996), and with the on-line natural gradient (Ollivier et al., 2018). These methods require inversing the -dimensional Fisher information matrix (for -dimensional parameter), thus require high computational resources. Our method avoids this inversion in the update step which is more computationally efficient.

Trust region for policies: The natural gradient method, when applied to RL tasks, is mostly used in policy gradient algorithms to estimate the parameters of the policy (Kakade, 2002; Peters & Schaal, 2008; Schulman et al., 2015a). Trust region methods in RL have been developed for parameterized policies (Schulman et al., 2015a, 2017). Despite that, trust region methods for parametrized VFs are rarely presented in the RL literature. Recently, Wu et al. (2017) suggested to apply the natural gradient method also on the critic in the actor-critic framework, using Kronecker-factored approximations. Schulman et al. (2015b) suggested to apply Gauss-Newton method to estimate the VF. However, they did not analyze and formalize the underlying model and assumptions that lead to the regularization in the objective function, while this is the focus in our work.

Figure 4: Mean episode reward, policy entropy and the policy loss for a PPO agent in the Mujoco environments Swimmer-v2 and HalfCheetah-v2. We compare between optimizing the VF with Adam vs. our KOVA optimizer. For KOVA, we present three different values for and two different values for the diagonal elements in : (a) max-ratio and (b) batch-size. We present the average (solid lines) and standard deviation (shaded area) of the episodes rewards over 8 runnings, generated from random seeds.

Distributional perspective on values and observations: Distributional RL (Bellemare et al., 2017) treats the full (general) distribution of total return, and considers VF parameters as deterministic. In our work we assume Gaussian distribution over the total return and in addition Gaussian distribution over the VF parameters.

Our work may be seen as a modern extension of GPTD (Engel et al., 2003, 2005) for DRL domains with continuous state and action spaces. GPTD uses Gaussian Processes (GPs) for both VF and total return, for solving the RL problem of value estimation. We introduce here several improvements and generalizations over their work: (1) Our formulation is adapted to learning nonlinear VF approximations, as common in DRL; (2) We include a fading memory option for previous observations by using a decay factor in the error covariance prediction (); (3) We allow for a general observation noise covariance (not necessarily diagonal) and for a general noisy observations (not only 1-step TD errors); (4) Our observation vector has a fixed size (the batch size) as opposed to the growing size vectors in GPTD which grow for any new observation and make it difficult to train in DRL domains.

The use of Kalman filters to solve RL tasks was proposed by Geist & Pietquin (2010). Their formulation, called Kalman Temporal Difference (KTD), serves as the base for our formulation for the optimizer we propose. We introduce here several differences between their work and ours: (1) We re-formulate the observation equation (10) to increase training stability by using a target network for the VF that appears in the target label (see Table 1). With this formulation, the observation function is simply the VF of the current input; (2) We use the Extended Kalman filter as opposed to their use of the Unscented Kalman filter to approximate nonlinear functions (Julier & Uhlmann, 1997; Wan & Van Der Merwe, 2000). In our formulation, the observation function is differential, allowing us to use first order Taylor expansion linearization. The UKF has shown superior performance in some applications (St-Pierre & Gingras, 2004; Van Der Merwe, 2004), however, its computational cost is much greater than the computational cost of the EKF, due to its requirement of sampling the parameters in each training step for times. Moreover, it requires to evaluate the observation function at these samples at every training step. Unfortunately, this is not tractable in DNNs where the parameters might be high-dimensional.

5 Experiments

In this section we present experiments that illustrate the performance attained by our KOVA optimizer222Code can be found in: Technical details on policy and VF networks and on the hyper-parameters we used are described in the supplementary material..

KOVA optimizer for policy evaluation: We tested the performance of KOVA in domains with high state and action spaces: the robotic tasks benchmarks implemented in OpenAI Gym (Brockman et al., 2016), which use the MuJoCo physics engine (Todorov et al., 2012). For the policy training we used PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a) and used their baselines implementations (Dhariwal et al., 2017). For VF training we replaced the originally used Adam optimizer (Kingma & Ba, 2014) with our KOVA optimizer (Algoritm 1) and compared their affect on the mean episode reward in each environment. The results are presented in Figure 3. When training with PPO, we can see that KOVA improved the agent’s performance in four out of five environments. In Ant-v2 it kept approximately the same performance. When training with TRPO, we can see that KOVA improved the agent’s performance mostly in Swimmer-v2 and HalfCheetah-v2. These improvements, both in PPO and in TRPO, demonstrate the importance of incorporating uncertainty estimation in value function approximation for improving the agent’s performance.

Investigating the evolution and observation noises: The most interesting hyper-parameters in KOVA are related to the covariances and . As seen in Corollary 3.1, for deterministic interpretation of the parameters we simply set . However, the more interesting setting would be with being a small number that controls the amount of fading memory (Ollivier et al., 2018). can be used for incorporating prior domain knowledge. For example, a diagonal matrix implies independent observations , while if observations are known to be correlated, additional non-diagonal elements can be added. We investigated the effect of different values of and in the Swimmer and HalfCheetah environments, where KOVA gained the most success. The results are depicted in Figure 4. We tested two different settings: the batch-size setting where and the max-ratio setting where . Interestingly, although using KOVA results in lower policy loss (which we try to maximize), it actually increases the policy entropy and encourages exploration, which we believe helps in gaining higher rewards during training. We can clearly see how the mean rewards increases as the policy entropy increases, for different values of . This insight was observed in both tested Mujoco environments and in both settings of .

6 Conclusion

In this work we presented a novel regularized objective function for optimizing VFs in policy evaluation, which originates from a Bayesian perspective over both noisy observations and value parameters. Our empirical results illustrate how the KOVA optimizer can improve the performance of various RL agents in domains with large state and action spaces. For future work, it would be interesting to further investigate the connection between trust region over value parameters and trust region over policy parameters and how to use this connection to improve exploration.


Supplementary Material

Appendix A Theoretical Results

a.1 Extended Kalman Filter (EKF)

In this section we briefly outline the Extended Kalman filter (Anderson & Moore, 1979; Gelb, 1974). The EKF considers the following model:


where are the parameters evaluated at time , is the -dimensional observation vector at time , and where is a nonlinear observation function with input and parameters .

The evolution noise is white () with covariance , .

The observation noise is white () with covariance , .

The EKF sets the estimation of the parameters at time according to the conditional expectation:


where with some abuse of notation, are the observations gathered up to time : . The parameters errors are defined by:


The conditional error covariances are given by:


EKF considers several statistics of interest at each time step: The prediction of the observation function:

The observation innovation:

The covariance between the parameters error and the innovation:

The covariance of the innovation:

The Kalman gain:

The above statistics serve for the update of the parameters and the error covariance:


a.2 EKF for Value Function Estimation

When applying the EKF formulation to value functions approximation, the observation at time is the target label (see Table 1 in the main article), and the observation function can be the state value function, the state action value function or the advantage function.

The EKF uses a first order Taylor series linearization for the observation function:


where and is typically chosen to be the previous estimation of the parameters at time , . This linearization helps in computing the statistics of interest. Recall that the expectation here is over the random variable where is fixed. For simplicity, we keep to write . The prediction of the observation function: