DeepAI
Log In Sign Up

Deep Robust Kalman Filter

03/07/2017
by   Shirli Di-Castro Shashua, et al.
0

A Robust Markov Decision Process (RMDP) is a sequential decision making model that accounts for uncertainty in the parameters of dynamic systems. This uncertainty introduces difficulties in learning an optimal policy, especially for environments with large state spaces. We propose two algorithms, RTD-DQN and Deep-RoK, for solving large-scale RMDPs using nonlinear approximation schemes such as deep neural networks. The RTD-DQN algorithm incorporates the robust Bellman temporal difference error into a robust loss function, yielding robust policies for the agent. The Deep-RoK algorithm is a robust Bayesian method, based on the Extended Kalman Filter (EKF), that accounts for both the uncertainty in the weights of the approximated value function and the uncertainty in the transition probabilities, improving the robustness of the agent. We provide theoretical results for our approach and test the proposed algorithms on a continuous state domain.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/26/2013

Approximate Kalman Filter Q-Learning for Continuous State-Space MDPs

We seek to learn an effective policy for a Markov Decision Process (MDP)...
01/23/2019

Trust Region Value Optimization using Kalman Filtering

Policy evaluation is a key process in reinforcement learning. It assesse...
03/10/2019

A Decision Support System for Multi-target Geosteering

Geosteering is a sequential decision process under uncertainty. The goal...
03/31/2022

AKF-SR: Adaptive Kalman Filtering-based Successor Representation

Recent studies in neuroscience suggest that Successor Representation (SR...
02/25/2022

The Uncertainty Aware Salted Kalman Filter: State Estimation for Hybrid Systems with Uncertain Guards

In this paper we present a method for updating robotic state belief thro...
11/15/2021

Joint State and Input Estimation of Agent Based on Recursive Kalman Filter Given Prior Knowledge

Modern autonomous systems are purposed for many challenging scenarios, w...
02/17/2020

Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Policy evaluation is a key process in Reinforcement Learning (RL). It as...

1 Introduction

Sequential decision making in stochastic environments are often modeled as Markov Decision Processes (MDPs) in order to optimize a policy that achieves maximal expected accumulated reward (Puterman, 2014; Bertsekas & Tsitsiklis, 1996)

. Given the MDP model parameters, namely the transition probabilities and the reward function, the aim of an agent is to find the optimal policy. In many cases the true MDP model is unknown in advance and its parameters are estimated from data. The deviation of the estimated model from the true model may cause a degradation in the performance of the learned policy

(Mannor et al., 2007). This undesired behavior leads to the need for model-robust methods that consider a set of possible MDP models and find the optimal policy over this set.

Figure 1: Robustness in different domains.

Robustness is effective when environmental safety issues arise (Garcıa & Fernández, 2015; Lipton et al., 2016). For example, in autonomous cars (Figure 1(a)) it is important to account for environmental uncertainties such as weather or road conditions. A robust driving policy should account for all of these uncertainties and must subsequently adjust the agent’s driving behavior accordingly. Additional motivation for model-robust methods is when an agent seeks to optimize a coherent risk measure (Artzner et al., 1999) or to follow a risk-sensitive policy (Petrik & Zilberstein, 2011; Chow et al., 2015). For example, investing in stock markets (Figure 1(b)) requires defining a model for the dynamics of the stock prices. This model is based on historical data that may be noisy or insufficient which induces uncertainty in the model. A robust investing policy would account for the uncertain model to avoid dangerous decisions that can cause loss of large amount of money. Control tasks can benefit from robustness as well when state transitions depend on several parameters (Rajeswaran et al., 2016). A robust agent would consider different possible values for these parameters to ensure satisfactory performance in the real world.

The robust-MDP (RMDP) framework (Nilim & El Ghaoui, 2005; Iyengar, 2005) was developed to account for uncertainties in the MDP model parameters when looking for an optimal policy. In this framework, each policy is associated with a known uncertainty set of transition probabilities. The optimal policy is the one that maximizes the worst case value function over the associated uncertainty set. When the state space is large, solving robust optimization problems can be a difficult task (Le Tallec, 2007).

We are looking for a method for solving large-scale RMDPs, with on-line nonlinear value function approximation. Existing methods for solving RMDPs (Iyengar, 2005; Nilim & El Ghaoui, 2005; Tamar et al., 2014; Rajeswaran et al., 2016) have several limitations such as linear approximation, off-line estimation and restrictive assumptions on the transition probabilities. We review these methods in Section 2 and compare them to the robust Bayesian approach we propose in this paper.

We distinguish between two types of uncertainty: the MDP model parameters uncertainty

refers to all the possible transition probability distributions of stepping from state

to state when taking an action . We denote this set by , and when it is clear from the context, we omit the subscript and use . The other type of uncertainty origins from the Bayesian assumption over the weights of the approximated value function. We denote the weight uncertainty set by .

Inspired by the success of Deep Q-Network (DQN) agents (Mnih et al., 2013) in estimating large-scale nonlinear value functions, we propose the Robust Temporal Difference DQN (RTD-DQN) algorithm which replaces the nominal Bellman Temporal Difference (TD) error involved in the optimized objective function with the robust Bellman TD error. We show that this algorithm captures the MDP model uncertainty and improves the robustness of the agent.

The Kalman filter (Kalman et al., 1960) and its variant for nonlinear approximations, the Extended Kalman filter (EKF) (Anderson & Moore, 1979; Gelb, 1974), are used for on-line tracking and for estimating states in dynamic environments through indirect observations. These methods have been successfully applied to numerous control dynamic systems such as navigation and tracking targets. The Kalman filter can be also used for weights estimation of approximation functions, where the weights constitute the states of dynamic systems. We suggest to extend the RTD-DQN algorithm by using a Bayesian approach such as EKF to account for the uncertainty in the weights of the value function approximation, in addition to the uncertainty of the transition probabilities. We present this approach in the Deep Robust Kalman filter (Deep-RoK) algorithm.

This approach may be surprising. How are the weights of the value function related to the MDP model parameters? To answer this question we refer to the work of Mannor et al. (2007)

. When estimating the MDP model parameters, the potential for error in the estimates, i.e., the uncertainty in the MDP model parameters, introduces variance in the estimates of the value function, governed by the value function weights. In Figure

2 we illustrate our robust Bayesian approach. The EKF serves as a Bayesian learning algorithm that receives the new information from the transition probabilities uncertainty set and propagates it into the weights uncertainty set . This approach provides a robust and efficient estimation as we demonstrate in the experiments.

Figure 2: The robust Bayesian approach in Deep-RoK.

Our contributions are: (1) A Bayesian approach for on-line and nonlinear approximation of the value function in RMDPs. We connect the robust Bellman TD error to the EKF updates to achieve robust policies in RMDPs; (2) We propose two algorithms, RTD-DQN and Deep-RoK, to solve large scale RMDPs; (3) We provide theoretical guarantees for our proposed methods; (4) We demonstrate the performance of our two algorithms on a continuous state domain.

2 Related Work

State On-line Nonlinear Uncertainty in Uncertainty Kalman RL
space approximation MDP model in weights Filter method
Scalability (robust) (Bayesian)
Deep-RoK (This paper) EKF for
(EKF) Deep Q-learning
RTD-DQN (This paper) Deep Q-learning
(Iyengar, 2005) DP
(Nilim & El Ghaoui, 2005) DP
(Tamar et al., 2014) ADP
(Rajeswaran et al., 2016) Policy gradient
(Blundell et al., 2015)
(Li et al., 2016)
(Geist & Pietquin, 2010) UKF for
(UKF) Q-learning
(Singhal & Wu, 1988) ✓(EKF)
Table 1: Comparison of different approaches to Deep-RoK and RTD-DQN

Our paper is related to several areas of research, namely RMDPs, Deep Q-learning networks, EKF and Bayesian approach for weight uncertainty in Neural Networks (NNs). Our work is the first to solve RMDPs while combining scalability to large state spaces, on-line estimation, nonlinear Q-function approximations, robustness to uncertainty in the transition probabilities and a Bayesian approach (EKF) to account for the uncertainty in the approximation weights. In Table 1 we compare between different approaches and our proposed algorithms, Deep-RoK and RTD-DQN.

Tamar et al. (2014) used approximate dynamic programming (ADP) method with linear value function approximation. Their convergence analysis is based on a restrictive assumption over the transitions of the exploration policy and the (uncertain) transitions of the policy under evaluation. Our work does not rely on such assumptions which facilitates the convergence analysis of our proposed algorithms.

The use of Kalman filters to solve reinforcement learning (RL) problems was proposed by

Geist & Pietquin (2010). Their formulation, called Kalman Temporal Difference (KTD), serves as the base for our formulation for the algorithms we propose. We introduce here several differences between their work and ours: (1) We re-formulate the observation function such that the observation of the agent at time is the target label, meaning the sum of the immediate reward and the discounted next state optimal Q-function. With this formulation, the observation function is simply the Q-function of the current state and action; (2) They used the nominal Bellman TD error, while we are using the robust version of it; (3) We use the Extended Kalman filter as opposed to their use of the Unscented Kalman filter (UKF) to approximate nonlinear functions (Julier & Uhlmann, 1997; Wan & Van Der Merwe, 2000). In our formulation, the observation function is differential, allowing us to use first order Taylor expansion linearization as used in the EKF and in gradient descent optimization methods. The UKF has shown superior performance in some applications (St-Pierre & Gingras, 2004; Van Der Merwe, 2004), however, its computational cost is much greater than the computational cost of the EKF, due to its requirement of sampling the weights in each training step for number of times equals to the double of the weights dimension. Moreover, it requires to evaluate the observation function at these samples at every training step. Unfortunately, this is not tractable in deep NNs where the weights might be high-dimensional.

3 Background

In this section we describe formulations from different fields, towards combining them into our formulation for solving large scale RMDPs in Section 4.

3.1 Robust Markov Decision Processes (RMDPs)

An RMDP (Iyengar, 2005; Nilim & El Ghaoui, 2005) is a tuple where is a finite set of states, is a finite set of actions, is a deterministic and bounded reward function, is a discount factor and is a probability measure over which denotes the uncertainty set of the transition probabilities for each state and action . At each discrete time step the system stochastically steps from state to state by taking an action . Each transition is associated with an immediate reward . The agent chooses the actions according to a policy that maps each state to a probability distribution over the actions set. The transitions in the system are according to the probability distribution which is assumed to lie in a known uncertainty set .

The Q-function of state-action pair under policy and state transition model represents the expected sum of discounted returns when starting from and executing policy : , where denotes the expectation w.r.t. the state-action distribution induced by the transitions and the policy . In RMDPs, we are interested in finding the policy that maximizes the worst case Q-function: . The optimal robust Q-function is then the unique solution of the robust Bellman recursion:

(1)

where is the successive state when taking an action in state . Iyengar (2005) showed that the agent can be restricted to stationary deterministic Markov policies without affecting the achievable robust reward. In this paper we focus on the -greedy exploration strategy, where the agent takes a random action with probability , and follows the optimal policy with probability .

The solution of the minimization problem in Equation (1) may be computationally demanding. Fortunately, there are some families of sets for which the solution is tractable. Popular uncertainty sets, presented by Iyengar (2005) and Nilim & El Ghaoui (2005) are constructed from approximations of confidence regions associated with probability density estimation. This choice seems natural when the uncertainty is due to statistical errors when estimating the states transition probabilities using historical data.

3.2 Deep Q-learning

Q-learning (Watkins & Dayan, 1992) is a TD method that aims at directly finding the optimal policy by updating the Q-function with the optimal greedy policy (Sutton, 1988)

. Therefore, learning the optimal policy can be reduced to learning the optimal Q-function. In many RL problems the state space is large, thus the Q-function is typically approximated by parametric models

where denotes the weights of the approximation function.

In Deep Q-learning (Mnih et al., 2013), the agent improves the Q-function (and, in turn, the greedy policy) by minimizing at each time step the squared nominal Bellman TD error (nBTDe) :

(2)

where

(3)

Here, is the nominal target label and is defined as:

(4)

The weight is a fixed set of weights, normally called the target network. It is composed of a more stable periodic copy of the trained weights. We denote by

the joint distribution of experiences under the current policy. The observation in each time step

is typically stored in an experience replay .

Traditionally, the Q-function is trained by stochastic gradient descent, estimating the loss on each experience as it is encountered, yielding the update:

(5)

where is the learning rate.

3.3 Deep Q-learning: A Bayesian Perspective

The weights can be learned by maximum likelihood estimation (MLE) using stochastic gradient descent methods: . A Bayesian approach uses Bayes rule and suggests adding prior knowledge over the weights to calculate the maximum a-posteriori (MAP) estimator:

(6)

Placing a prior introduces regularization to the network. We will address the advantages of this regularization in Section 4.

3.4 Extended Kalman Filter (EKF)

In this section we briefly outline the Extended Kalman filter (Anderson & Moore, 1979; Gelb, 1974). The EKF is a standard technique for estimating the state of a nonlinear dynamic system or for learning the weights of a nonlinear approximation function. In this paper we will focus on its latter role, meaning estimating . The EKF considers the following model:

(7)

where are the weights evaluated at time , is the observation at time and is a nonlinear observation function. is the evolution noise,

is the observation noise, both modeled as additive and white noises with covariance

and variance respectively. As seen in the model presented in Equation (7) the EKF treats the weights

as random variables, similarly to the Bayesian approach. According to this perspective, the weights belong to an uncertainty set

governed by the mean and covariance of the weights distribution.

The estimation at time t, denoted as is the conditional expectation of the weights with respect to the observed data. The EKF formulation distinguishes between estimates that are based on observations up to time , , and observations up to time , . The weights errors are defined by: and .

The conditional error covariances are given by:

EKF considers several statistics of interest at each time step: The prediction of the observation function, the observation innovation, the covariance between the weights error and the innovation, the covariance of the innovation and the Kalman gain are defined respectively in Equations (8) - (12):

(8)
(9)
(10)
(11)
(12)

The above statistics serve for the update of the weights and the error covariance:

4 Solving Large-Scale RMDPs

In this section we explain how to combine EKF as a Bayesian method with Deep Q-learning and the robust formulation to solve large scale RMDPs with uncertainty in the transition probabilities.

4.1 Robust Temporal Difference Deep Q-Network (RTD-DQN)

Our first step in solving large scale RMDPs with on-line nonlinear approximations is to change the nBTDe presented in Equation (4) with the robust Bellman TD error (rBTDe) and minimize the following objective function at each time step:

where

(13)

Here, is the robust target label:

(14)

The set is the set of all possible next states from state when taking action , and it is drawn from the uncertainty set . Note that is a variation of the robust Bellman function for the optimal Q-function presented in Equation (1). It looks for worst case transitions that may reduce the value of the expected Q-function, and sets the robust target label value according to the minimal expectation. In return, the agent that receives these robust target labels, learns how to act optimally over these transitions.

This approach for solving RMDPs is presented in Algorithm 1, RTD-DQN. It is based on the DQN algorithm (Mnih et al., 2013) but incorporates the robust target label instead of the nominal one. RTD-DQN initializes the weights with small random values and holds an experience reply with finite capacity . The environment is initialized at the beginning of each episode. We denote the estimation process with subscript , while the observations at each episode are denoted with subscript . During an episode the agent observes the uncertainty set for the set of next possible states . For each mini-batch sample , the agent computes (14) and updates the weights according to the gradient descent step (3.2) with replacing . Note that the environment transitions to state that is drawn from the true unknown MDP model parameter . However, actions taken by the agent adhere the robust policy, based on the robust target labels. The output of the RTD-DQN algorithm is the MLE weights estimator . The RTD-DQN algorithm incorporates the robust target label into the weights update, but it does not account for uncertainty in the weights. This uncertainty is important for robustness properties. In the next section we suggest to use EKF for this purpose.

0:  , , ; Initialize: , , .
1:  for episode  do
2:     for  do
3:        With probability select a random action , otherwise select
4:        Observe and store it in .
5:        Compute (14) for random mini-batch .
6:        Compute .
7:        Update the weights:
8:        
9:     end for
10:  end for
10:  
Algorithm 1 RTD-DQN

4.2 EKF for Deep Q-learning

We propose to improve the performance of RTD-DQN by accounting for uncertainty in the weights in addition to the uncertainty in the transition probabilities. We suggest to use EKF for solving Deep Q-leaning networks. For this purpose, using the formulation for EKF presented in Section 3.4, the observation at time is simply the nominal target label (4) and the observation function is the state-action Q-function, . With this formulation, the EKF model for DQN with Bayesian approach becomes:

(15)

The EKF uses a first order Taylor series linearization for the observation function (Q-function):

where is typically chosen to be the previous estimation of the weights at time , . This linearization helps in computing the statistics of interest (see the supplementary material for more detailed derivations):

(16)

The Kalman gain then becomes:

(17)

and the update for the weights of the Q-function and the error covariance are:

(18)

It is interesting to note that the Kalman gain (17) can be interpreted as an adaptive learning rate for each individual weight that implicitly incorporates the uncertainty of each weight. This approach resembles familiar stochastic gradient optimization methods such as Adagrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012)

, RMSprop

(Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014), for different choices of and . We refer the reader to Ruder (2016).

We now revisit the MAP estimator presented in Equation (3.3). Given the observations gathered up to time (denoted as ) we can write:

(19)

Here, instead of using the prior of the weights, we present an equivalent derivation for the posterior of the weights conditioned on , based on the likelihood of a single observation and the posterior conditioned on (Van Der Merwe, 2004). When estimating using the EKF, it is common to make the following assumptions regarding the likelihood and the posterior:

Assumption 1.

The likelihood is assumed to be Gaussian: .

Assumption 2.

The posterior distribution is assumed to be Gaussian: .

We use the notation for a general target label (for example or ) that serves as an observation in the EKF formulation. Based on the Gaussian assumptions, we can derive the following Theorem:

Theorem 1.

Under Assumptions 1 and 2, (18) minimizes at each time step the following regularized objective function:

where .

The proof for Theorem 1 appears in the supplementary material. It is based on solving the maximization problem in Equation (4.2) using the EKF model (15) and the Gaussian assumptions in Assumptions 1 and 2.

Note that this objective function is a regularized version of the objective function in Equation (2), where the weights are weighted according to the error covariance matrix . The nBTDe is the same as in Equation (3) and the nominal target label is the same as in Equation (4). The observation noise variance can be interpreted as the regularization coefficient (Rivals & Personnaz, 1998) and we can examine it from two points of view: (1) Looking at as the amount of confidence we have in the observations: if the observations are noisy, we should consider larger values for . (2) Treating as a regularization coefficient: when observations are noisy we would like to put larger impact on the weights prior by increasing .

4.3 Deep Robust Extended Kalman Filter (Deep-RoK)

0:  , , , , ; Initialize: , ,
1:  for episode  do
2:     for  do
3:        Set predictions: .
4:        With probability select a random action , otherwise select
5:        Observe and store it in .
6:        Compute (14) for random mini-batch .
7:        Compute .
8:        Compute (16) and (17).
9:        Update the weights and error covariance:
10:        
11:     end for
12:  end for
12:   and
Algorithm 2 Deep-RoK

We are now ready to combine the EKF as a Bayesian method with the rBTDe (13). This approach incorporates uncertainty in the Q-function weights and allows propagation of the uncertainty in the transition probabilities, , into the uncertainty set of the weights, . This approach can be utilized to solve large scale RMDPs with nonlinear approximation and in an on-line fashion. Practically we change the objective function presented in Equation (2) by both adding regularization according to the EKF formulation and by replacing the nBTDe with the rBTDe . This results in the following Theorem:

Theorem 2.

Under Assumptions 1 and 2, (50) minimizes at each time step the following regularized robust objective function:

where

(20)

and .

The proof for Theorem 2 follows the same arguments as the proof for Theorem 1, but uses rBTDe (13), instead of nBTDe . The proof appears in the supplementary material.

Looking at the weights update in Equation (50) and the definition of the Kalman gain in Equation (17), we can see that by combing the rBTDe with the EKF formulation, the Kalman gain propagates the new information from the robust target label (derived from the transition probabilities uncertainty set ), back down into the weight uncertainty set , before combining it with the estimated weight value.

This Bayesian approach for solving RMDPs is presented in Algorithm 2, Deep-RoK. Deep-RoK receives as input the initial prior for the error covariance , the evolution noise covariance , the observation noise variance , the uncertainty set and a discount factor . Its observations are similar to the ones described for the RTD-DQN algorithm, but the weights update is different: Deep-RoK uses the EKF updates (50) with the robust target label which is based on the uncertainty set .

The output of the Deep-RoK algorithm is the MAP weights estimator and the error covariance matrix . Deep-RoK is suitable for any prior , including priors that assume correlations between the weights. Figure 3 presents a block diagram which illustrates the flow of weights updates in the Deep-RoK algorithm.

Figure 3: Block diagram for the Deep-RoK algorithm. The Kalman gain propagates the new information from the transition probabilities uncertainty set into the weights uncertainty set , using the rBTDe .

During the test phase, the output of the Deep-RoK algorithm provides flexibility to the agent. It can choose to use the point estimate as a single NN on which it performs tests. Recall that incorporates the information regarding the weights uncertainty set and the transitions uncertainty set . However, the agent has the ability to take advantage of the additional output and to test the results over an ensemble of NNs by sampling weights from a distribution with mean and covariance .

5 Experiments

We demonstrate the performance of the RTD-DQN and Deep-RoK algorithms on the classic RL environment Cart-Pole with nonlinear Q-function approximations. In the Cart-Pole domain the agent’s goal is to balance a pole atop a cart while controlling the direction of the force applied on the cart. The action space is discrete and contains two possible actions: applying a constant force to the right or to the left. The state space is continuous, where each state is four-dimensional consisting respectively of the cart position, cart velocity, pole angle and the pole’s angular velocity. At each time step the agent receives an immediate reward of 1 if the pole has not fallen down and if the cart has not run off the right or the left boundary of the screen. If these cases occur, the agent receives a reward of 0 and the episode is terminated. The transitions follow the dynamic model of the system and are based on the parameters {cart mass, pole length}. Additional technical details regarding this experiment can be found in the supplementary material.

In order to introduce robustness into the Cart-Pole domain, we assumed that the parameters are not precisely known in advance but rather they lie in a known uncertainty set . This in turn introduces uncertainty in the transition probabilities defined by the set . We considered the range of meters for the pole length and Kg for the cart mass. At each episode we uniformly sampled 5 values from each of the parameters ranges. These samples aided us in building . We used the implementation CartPole-v0 contained in the OpenAI gym (Brockman et al., 2016) where we added the uncertainty set parameters into the implementation.

We trained the agent with the RTD-DQN and the Deep-RoK algorithms, both using the rBTDe. We compare their performance with a Double-DQN agent (Van Hasselt et al., 2016) that uses the nBTDe. All agents were trained for episodes and the results are drawn from testing the trained models over episodes.

In Figure 4(a)-(c) we show the performance of all three agents for different values of pole length. In Figure 4(d)-(f) we show the % of success of all three agents for different values of pole length and cart mass. We can see that the Double-DQN agent performs well (high cumulative reward) on the parameters it was trained on (pole length meter and cart mass Kg, marked in red in the graphs). However it performs poorly on more extreme values of these parameters. The Double-DQN agent learned a policy which is highly optimized for the specific parameters is was trained on, but is brittle under parameter mismatch. The RTD-DQN agent has high performance also on parameter values which are not the nominal values, but they are taken into account in the uncertainty set through the rBTDe. The Deep-RoK agent has the most robust results and it keeps high performance for a large range of pole lengths and cart masses. The Bayesian approach in this algorithm provides the agent with more robustness to uncertain parameters.

Figure 4: (a) - (c)

Cumulative reward for different values of pole length (Mean and one standard deviation).

(d) - (f) % of success for different value of pole length and cart mass. A successful episode is defined by cumulative reward . The nominal values for the Double-DQN agent are marked in red.

6 Discussion

We introduced two algorithms for solving large scale RMDPs using on-line nonlinear estimations. The RTD-DQN algorithm incorporates the robust Bellman temporal difference error into a robust loss function, yielding robust policies for the agent. The Deep-RoK algorithm uses a robust Bayesian approach, based on the Extended Kalman Filter, to account for both the uncertainty in the weights of the value function and the uncertainty in the transition probabilities. We proved that the Deep-RoK algorithm outputs the weights which minimize the robust EKF lost function. We demonstrated the performance of our algorithms on the Cart-Pole domain and showed that our robust approach performs better comparing to a nominal DQN agent. We believe that real-world domains, such as autonomous driving or investment strategies, can benefit from using a robust approach to improve their agents performances by accounting for uncertainties in their models.

Future work should address accounting for changes in the confidence level during the evaluation procedure, directed exploration by using uncertainty estimates and robustness in policy gradient algorithms.

References

Supplementary Material

Appendix A Extended Kalman Filter (EKF)

In this section we briefly outline formulation of the Extended Kalman filter (Anderson & Moore, 1979; Gelb, 1974). The EKF considers the following model:

(21)

where are the weights evaluated at time , is the observation at time and is a nonlinear observation function.

The evolution noise is white () with covariance , .

The observation noise is white () with variance , .

The EKF sets the estimation of the weights at time according to the conditional expectation:

(22)

where are the observations gathered up to time . The weights errors are defined by:

(23)

The conditional error covariances are given by:

(24)
(25)

EKF considers several statistics of interest at each time step: The prediction of the observation function:

The observation innovation:

The covariance between the weights error and the innovation:

The covariance of the innovation:

The Kalman gain:

The above statistics serve for the update of the weights and the error covariance:

Appendix B EKF for Deep Q-learning

When applying the EKF formulation to Deep Q-leaning networks, the observation at time is simply the nominal target label :

(26)

where the weight is a fixed set of weights. The observation function is the state-action Q-function:

(27)

We use the notation