Log In Sign Up

MM-KTD: Multiple Model Kalman Temporal Differences for Reinforcement Learning

There has been an increasing surge of interest on development of advanced Reinforcement Learning (RL) systems as intelligent approaches to learn optimal control policies directly from smart agents' interactions with the environment. Objectives: In a model-free RL method with continuous state-space, typically, the value function of the states needs to be approximated. In this regard, Deep Neural Networks (DNNs) provide an attractive modeling mechanism to approximate the value function using sample transitions. DNN-based solutions, however, suffer from high sensitivity to parameter selection, are prone to overfitting, and are not very sample efficient. A Kalman-based methodology, on the other hand, could be used as an efficient alternative. Such an approach, however, commonly requires a-priori information about the system (such as noise statistics) to perform efficiently. The main objective of this paper is to address this issue. Methods: As a remedy to the aforementioned problems, this paper proposes an innovative Multiple Model Kalman Temporal Difference (MM-KTD) framework, which adapts the parameters of the filter using the observed states and rewards. Moreover, an active learning method is proposed to enhance the sampling efficiency of the system. More specifically, the estimated uncertainty of the value functions are exploited to form the behaviour policy leading to more visits to less certain values, therefore, improving the overall learning sample efficiency. As a result, the proposed MM-KTD framework can learn the optimal policy with significantly reduced number of samples as compared to its DNN-based counterparts. Results: To evaluate performance of the proposed MM-KTD framework, we have performed a comprehensive set of experiments based on three RL benchmarks. Experimental results show superiority of the MM-KTD framework in comparison to its state-of-the-art counterparts.


page 3

page 4

page 5

page 6

page 7

page 8

page 10

page 12


Multi-Agent Reinforcement Learning via Adaptive Kalman Temporal Difference and Successor Representation

Distributed Multi-Agent Reinforcement Learning (MARL) algorithms has att...

State2vec: Off-Policy Successor Features Approximators

A major challenge in reinforcement learning (RL) is the design of agents...

Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models

Dyna-style reinforcement learning (RL) agents improve sample efficiency ...

Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Policy evaluation is a key process in Reinforcement Learning (RL). It as...

AKF-SR: Adaptive Kalman Filtering-based Successor Representation

Recent studies in neuroscience suggest that Successor Representation (SR...

Trust Region Value Optimization using Kalman Filtering

Policy evaluation is a key process in reinforcement learning. It assesse...

Sequential Dynamic Decision Making with Deep Neural Nets on a Test-Time Budget

Deep neural network (DNN) based approaches hold significant potential fo...

I Introduction

Inspired by exceptional learning capabilities of human beings, Reinforcement Learning (RL) systems have emerged aiming to form optimal control policies merely by relying on the knowledge about the past interactions of an agent with its environment. Such a learning approach is particularly beneficial, as unlike supervised learning methods, an RL system does not require availability of carefully labelled data, which is typically difficult to acquire. On the contrary, to choose the best action possible at each time step, an RL system rather relies on the reward (feedback) received from each move 

[1, 2, 3, 4]. Given their phenomenal potentials, there has been an increasing surge of interest on advancement of RL systems, which is also the focus of the paper.

When an accurate model of the environment is available, methods such as dynamic programming [5] may be used to find the optimal policy. An accurate model of the environment, however, is rarely available in practice leading to the need to resort to model-based RL or model-free approaches. In the model-based RL approaches, the goal is to learn the model of the system using the past transitions of the system and then apply a model-based control scheme (e.g., model predictive controller [6]) to reach the desired goal [7]. While model-based RL approaches are proven to be sample efficient, their performance, typically, degrades over time by the uncertainties and bias in the constructed model of the environment. On the other hand, model-free approaches directly acquire the optimal policy for the system, without developing a model of the system. Due to their superior asymptotic performance, model-free approaches are commonly preferred over their model-based counterparts especially when sampling the agent’s past trajectories is inexpensive. The paper, focuses on development of a model-free RL approach to achieve the desirable policy for the underlying system.

A common practice in model-free RL schemes is to calculate the state (or alternatively state-action) value function for each state. While this can be done with relative ease for a system with limited number of states (e.g., via Monte Carlo sampling [8]

), it is impractical to find the value function for each state when the number of states is large or the states are continuous (due to the curse of dimensionality 

[9]). In such cases, which entail most realistic applications, the value function needs to be approximated. In a large group of works, e.g., [10, 11, 12, 13, 14], artificial neural networks were employed to approximate the value function over the entire state-space. Despite few successful trials, most early attempts of such were not very successful due to the overfitting problem. However, this problem was overcome in [15], where simple though efficient techniques were employed to avoid the overfitting problem. This work was later extended to systems with continuous actions [16]. However, it was shown later that such approaches are highly sensitive to parameter selection, therefore, required to be revised further to be applicable to a larger variety of problems. Despite continuous attempts [17, 18, 19, 20] to overcome such problems, the training of the neural network has remained an open problem in the field.

Another well-practiced approach to approximate the value function is to employ a set of weighted local estimators and convert the approximation problem to a weight estimation problem. Various local estimator were proposed in the literature, among which Radial Basis Functions (RBFs) 

[21], and Cerebellar Model Articulation Controllers (CMACs) [22] are most popular. It was shown that RBFs are more suitable than CMACs in systems with continuous states, due to their continuous nature [23]. More recently, Fourier basis was proposed as the local estimator function, however, the performance of the system was shown to be comparable to those using RBFs [24]. Due to its advantages, we exploit the RBFs for the value function approximation. The parameters of the RBFs are usually computed based on the knowledge of the problem. However, it is possible to adapt these parameters using the observed transitions to enhance the autonomy of the method. Cross entropy and gradient descent methods were proposed by [25] for that matter. Stability of the latter was later enhanced using a restrictive technique in [26], which is adopted in this work.

Once the structure of the value function is determined, a suitable algorithm has to be picked to train the value function approximation. Various approaches were proposed in the literature to gradually bring the value approximations close to their real values in all states. These methods were generally categorized as bootstrapping approaches (such as Fixed-Point Kalman Filter or FPKF 

[27]), residual approaches (e.g., Gaussian Process Temporal Difference or GPTD [28]), and projected fixed-point approaches (e.g., Least Square Temporal Difference or LSTD [29]). A comprehensive comparison of these approaches are available in [30]. Among these approaches, Kalman Temporal difference (KTD) [31] stands out as it provides not only the Minimum Mean Square Error (MMSE) estimation of the value function (given the selected structure), but also their uncertainty in terms of their error covariance which could be exploited further to reach higher sample efficiency [31]. It was also shown in [31] that GPTD is a special case of KTD. However, like other Kalman-based schemes, KTD requires the parameters of the filter (such as process and measurement noise covariances) to be known a priori, which is not the case in most practical scenarios.

Filter parameter estimation is a well-studied problem for Kalman filters and has led to numerous adaptive schemes in the literature. These methods may be roughly categorized as innovation-based adaptive methods (e.g., [32] and multiple model adaptive schemes (e.g., [33]). The latter has the advantage of fast adaptability when the mode of the system is changing. Most recently, the effective methods in multiple model adaptive estimation (MMAE) was discussed in [34], where various averaging and weighting schemes were proposed and compared to achieve superior results. Multiple model approaches were previously used in RL problems. A model-based multiple model approach was introduced in [35] where the uncertainty of the system model was challenged using a set of models for the system. Moreover, a model selection approach was proposed in [36] for a multiple model KTD to overcome the filter parameter uncertainty problem, yet the models and the selection scheme were naive and therefore was not suitable for more general tasks.

In this work, various adaptive schemes are considered for value function approximation using KTD to address the parameter uncertainty problem. In particular, the paper makes the following contributions:

  • A innovative Multiple Model Adaptive Estimation (MMAE) scheme is adopted within the RL process, referred to as the Multiple Model Kalman Temporal Differences (MM-KTD), to compensate for the lack of information about the measurement noise covariance as the most important parameter of the filter [37], while the mean and covariance of the RBFs are updated exploiting the restricted gradient descent method of [26].

  • To improve the sample efficiency of the proposed MM-KTD, the off-policy Q-learning is adopted to learn the optimal policy from the behaviour policy.

  • Within the proposed MM-KTD framework, the estimated uncertainty of the value functions are exploited to form the behaviour policy leading to more visits to less certain values, therefore, improving the overall learning sample efficiency.

The proposed approach is tested over multiple RL benchmarks, which show desirable results as compared to their deep learning-based counterparts.

The remaining of the paper is organized as follows: In Section II, the basic techniques of RL is briefly outlined. Section III presents the proposed MM-KTD framework. Experimental results based on three RL benchmarks are presented in Section IV illustrating effectiveness and superiority of the proposed MM-KTD framework. Finally, Section V concludes the paper.

Ii Problem Formulation

In this section, we briefly present the required background on RL and formulate the problem at hand. Throughout the paper, the following notations are used: Non-bold letter denotes a scalar variable, lowercase bold letter

represents a vector, and capital bold letter

denotes a matrix. Transpose of a matrix is denoted .

Ii-a Reinforcement Learning (RL)

In a typical RL scenario, one deals with an agent being placed in an unknown environment. At each time, the agent is considered to be in a specific state within the available set of states denoted by , and can take an action from the action set using specific policy that takes the agent to another state. More specifically, at time step , the agent at state takes an action exploiting the policy , which takes the system to the state

with the transition probability of

resulting in a reward . The -tuple , for (

) denoting the discount factor, defines the process known as the Markov Decision Process (MDP), which defines the domain of work in RL. The agent starts at an initial state denoted by

and continues to explore the states until it reaches a terminal state denoted by , where an episode is completed.

Generally speaking, the RL goal is to find a policy through a number of experimental episodes that maximizes the expected sum of discounted rewards over the future states. In achieving so, it is useful to define the state value function, as


where represents the expectation function. If the value function (and also the transition probability) is known for a policy, the policy may be improved by selecting a greedy action at each step leading to the next state, which has the highest value (policy improvement) [38]. If the transition probability is unknown, it is more useful to exploit the state-action value function defined as follows


Using the state-action value function () defined in Eq. (2) has the advantage of direct selection of the greedy action as compared to the state value function () defined in Eq. (1), where the leading states have to be identified.

Ii-B Off-Policy TD Learning

Following the Bellman update idea [39], sample transitions may be exploited to gradually update the value functions. In other words, at each transition (from one state to the next by taking an action), a one-step update may be performed, which is best known as a Temporal Difference (TD) update [39]. If the current policy is used to select actions for such an update, the procedure is called “on-policy learning”. For instance, in SARSA method [40, 41], which is an on-policy learning method, the state-action value function is updated as follows


where is the learning rate. On-policy samples are usually not very sample efficient, as the value function is learned for the current policy and not based on the optimal one. Besides, exploring new states is challenging in the on-policy methods as they follow a particular policy. On the other hand, the “off-policy learning” methods, also referred to as behavior policies, allow for updating the optimal policy using the information gained from other policies. In most cases, a stochastic (e.g., random) policy is selected as the behaviour policy to ensure enough exploration of new states. One of the most practiced off-policy methods is known as Q-learning [44, 46, 41, 42], which updates the value function using the Bellman optimality equation as follows

where represents the optimal policy. For the greedy policy, the state value function is related to the state action value function as follows


It is noteworthy to mention that action in Eq. (II-B) is selected based on the behaviour policy. Once the system has converged, the optimal policy may be used as follows


This completes description of off-policy TD learning. Next, we focus on value function approximation.

Ii-C Value Function Approximation

When the number of states are finite and small, it is relatively easy to update the value function by visiting every state of the system (e.g., using Q-Learning). When the states are continuous, however, it is not viable to visit all the states, therefore, requiring to approximate the value function. Deep learning techniques provide powerful supervised learning solutions for such purposes by approximating highly nonlinear functions using labelled data. However, neural networks (used in deep learning) are notorious for problems such as over-fitting and brittleness to parameter tuning, therefore, should be used with extra care.

Alternatively, the value function may be approximated using basis functions. In this approach, the value function is estimated with a weighted sum of local basis functions, each of which is active in a local region of the state-space. The value function is then formed as follows


where is a vector of basis functions (will be described later in Section III-C) and is a weight vector. Various basis function may be used for such approximations, among which RBFs are proven [23] to be one of the most suitable options and are therefore selected in this work for value function approximation. This completes our background discussion on RL. Next, we present the proposed MM-KTD framework.

Iii Mm-Ktd: Multiple Model Kalman Temporal Difference

For RL tasks with continuous state-space, the sample transitions of the system and the gained rewards are used as the data for the purpose of value function approximation and to estimate the weights. To avoid overfitting problems, supervised learning methods such as deep learning require this data to be stored and then used in batches (batch learning) for training of a system (e.g., neural networks). Generally speaking, neural networks have considerably high memory requirements to store the input data, weight parameters, and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate the error gradients in the backwards pass. As an example, the -layer ResNet network has million weight parameters and computes million activations in the forward pass. Measuring, roughly, the memory requirement associated with the training stage of the ResNet-50 with a mini-batch of shows that it requires a, typically, a high performance GPU and over GB of local DRAM. In the contrary, sequential data processing methods such as multiple-model filters [47, 48, 49, 50] can adapt the system with the last measurement (assuming a one-step Markov property), without the need to store the whole measurement history for learning, which results in much less memory requirement. Using a Kalman-based approach, the posteriori of the weights can be calculated recursively using the Bayes rule as follows


where is the measurements vector of the system (i.e, transition information) at time step and is the set of all measurements from time Step to time Step . Utilizing the probabilistic model in Eq. (9), the paper proposes a Kalman-based off-policy learning scheme, which is detailed below.

Iii-a Kalman Temporal Difference Method

The optimal value function may be approximated from its one-step approximation using the TD method shown in Eq. (II-B), i.e.,


With a change in the order of the variables, the reward at time Step may be considered as a noisy measurement from the system as follows



is assumed to be a zero-mean Gaussian noise with variance of

. In this paper, the value function is approximated as discussed in Eq. (8), rendering Eq. (11) to have the following form




where is found from

Eq. (12) defines the measurement of the system (the reward) as a linear function of the weight function (i.e., ), which is to be estimated. Assuming the weight vector to be modelled by a linear dynamic system, i.e.,


where is the transition matrix and is a zero-mean Gaussian noise with the covariance of , the Kalman filtering formulation may be employed to estimate the weights in a minimum MSE sense. To be more precise, the weights and their error covariance are first initialized, i.e.,


Then at each time step, first the weights and their error covariance are predicted as


and then the estimations are updated using the observed reward from the transition from State to the next state () as follows


The value function for each set of state and action is easily reconstructed through Eq. (8), and the optimal policy would be to select the action with the highest state-action value function at each state similar to Eq. (7).

When the parameters of the filter and system initial values (i.e., , , , , , and ) are known a priori, the system will provide accurate estimations. However, these values are usually not available and need to be approximated using the measurements obtained from the environment. Among these, the measurement mapping function () and the measurement noise variance () are of most importance since they regulate the flow of information from new measurements. The adaptation of these parameters is the topic of the following two subsections. Other filter parameters may be selected as design parameters.

As a final note, we would like to point out that the basic component of the constructed state-space model is the measurement equation (Eq. (12)), which relates the reward to the basis functions. In this paper, following the formulation in Eq. (8) we are dealing with a linear combination of the basis functions with unknown (to be estimated) weight vectors . This formulation results in a linear measurement model of Eq. (12). An interesting direction for future work is to consider nonlinear measurement model [31] for estimating the value function. With regards to the dynamics of the weight vectors, at one hand, it is a common practice [47] to use the linear model of Eq. (14) when an auxiliary and unknown dynamic is introduced for the variable of interest to be estimated. Intuitively speaking, the dynamical model of Eq. (14) is introduced to make it possible to use state-space based solutions such as Kalman-based filters or Particle filters. As its true nature is unknown a-priori, the intuition is to consider it to be constant (hence having an identity type sate model ) with changes being controlled by the covariance of the state model noise (). An interesting direction for further investigations is to learn dynamics of the introduced state model as well, e.g., using a separate neural-based module, which is the focus of our ongoing research.

Iii-B Multiple Model Adaptive Estimation

Kalman filter is a powerful tool for accurate estimation of hidden variables when the estimation model is fully known. However, usually full knowledge about the filter parameters and initial values is not available in practical scenarios, which lead to deterioration of the system performance. Adaptive estimation is a one powerful way to remedy such a problem. In this work, a multiple model adaptation scheme [47, 48, 49, 50] is adopted due to simplicity and effectiveness of multiple-model solutions. In such schemes, multiple candidates are considered for each of the uncertain parameters and values and the estimations made based on each set of candidates is weighted exploiting the measurement likelihood function. The number of candidates increases exponentially with the number of uncertain parameters (curse of dimensionality), therefore, it is desirable to limit the adaptation to some of the most important parameters. In a Kalman-based estimation framework, the measurement noise variance () is one of the most important parameters and is, therefore, considered for the adaptation in this section using the multiple model technique. After observing and by taking action , the measurement model () is calculated. Then, different values () for the measurement noise variance is considered in the proposed MM-KTD scheme and a bank of mode-matched Kalman filters are implemented for adaptation of the observation noise variance. Eqs. (19)-(III-A) are, therefore, replaced with the following


where superscription denotes the Kalman filter value for the filter which exploits as its measurement noise covariance. The posteriori of each mode-matched filter is weighted based on its associated and normalized weight, which are then combined to form the system’s posteriori, i.e.,


where is number of mode-matched filter within the MM-KTD framework and,

where is the normalized measurement likelihood for the filter and,


Using Eq. (25), the weight and its error covariance are then updated as follows


As a final note, it is interesting to highlight the connections between the above multiple model frame work of the proposed MM-KTD with other methods, in particular gating approach of Mixture of Experts (MoE) [45]

. In a general setting, the goal of the weight update step in a multiple-model framework is to find the conditional probability density function (PDF) corresponding to mode

, denoted by for (), given all the observations up to the current time, i.e.,


where the denominator is a normalizing factor to ensure that is a proper PDF. Term in the nominator of Eq. (30) is the likelihood function of mode , which is proportional to the exponential term in Eq. (III-B). Therefore, the corresponding weight for the filter matched to mode , for (), can be simplified as


On the contrary, the MoE, typically, uses the soft max adaptation for Gaussian gating to obtain the associated weight, denoted here by , for each model (expert). The weight is obtained by multiplying the input (in our case, the estimated state vector by model , for ()) by a trainable weight matrix and then applying the Softmax function as follows


Comparing Eq. (32) with Eq. (31) reveals the potentials of the multiple model approach in comparison to the MoE. This completes our discussion on multiple model adaptive estimation. Next, we discuss the update process for computation of the basis functions.

Iii-C Basis Function Update

Vector , defined in Eq. (13), is the measurement mapping function and is required to be computed for accurate estimations within the context of Kalman-based filtering schemes. Such a prior knowledge, however, is typically not available, therefore should be adapted to its correct value. Since is formed by the basis functions, its adaptation necessitates the adaptation of the basis functions. The vector of basis functions shown in Eq. (8) is formed as follows,


where is the number of basis functions per action and is the total number of actions (the arguments of the basis functions are omitted for brevity). Each basis function is selected as a RBF, which is defined based on its mean vector and covariance matrix as follows


where and are the mean and covariance of this radial basis function. Due to the large number of parameters associated with the measurement mapping function (i.e., ), it is reasonable to adapt these parameters through a gradient descent-based adaptation scheme rather than the multiple model method. For this purpose, the Restricted Gradient Descent (RGD) method proposed in [26] is adopted in this work. Using RGD, first the gradient of the object function with respect to the parameters of each basis function is calculated using partial derivations. The goal is to minimize the objective function, which is defined as the difference between the estimated value function and its one-step (TD) update, i.e.,


The gradient of the object function with respect to the parameters of the RBFs is calculated using the chain rule,


where the partial derivations are calculated using Eqs. (8), (34), and (35) as follows


The mean and covariance of the RBFs are then adapted using the calculated partial derivative as follows


where and are the adaptation rates. To make the system more stable, only one of the updates shown in Eqs. (43) and (LABEL:Eq:38) will be performed as discussed in [26]. To be more precise, when the size of the covariance is decreasing (i.e., ), the covariances of the RBFs are updated using Eq. (LABEL:Eq:38), otherwise, their means are updated using Eqs. (43). Using this approach, unlimited expansion of the RBF covariances is avoided.

Iii-D Active Learning

In order to ensure enough exploration of the states in off-policy learning methods, the behaviour policy is usually chosen to be a stochastic policy (e.g., a random policy). However, such a choice commonly leads to sample inefficiency, which is already a big problem in model-free RL methods. One advantage that the proposed MM-KTD learning framework offers over other optimization-based techniques (e.g., gradient descent-based methods) is the calculation of the uncertainty for the weights (), which is directly related to the uncertainty of the value function. This information can then be used at each step to select the actions, which lead to most reduction in the uncertainty of the weights. Using the information of the Kalman filter (information filter [48]), the information of the weights, which is denoted by the inverse of is updated as follows


Since only the second element (i.e., ) in the right hand side of Eq. (46) is affected by the choice of the action (as it changes ), the action is selected such that this term is maximized. More specifically, the action is obtained by maximizing the information of the weights, i.e.,


The second equality in Eq. (47) is constructed as is a scalar. The proposed behavior policy in Eq. (47) is different from that of Reference [31], where a random policy was introduced, which favored actions with less certainty of the value function. Even though favoring the actions that reduce the uncertainty of the value function is a good idea, the random nature of such policies make them less sample efficient than expected. The proposed MM-KTD framework is briefed in Algorithm 1. It is worth further clarifying computation of Step 6 in Algorithm 1. For learning the optimal policy, the control action is selected based on the behavioral policy, which leads to most reduction in the uncertainty of the weights in Eq. (47). Once the system has been converged, the resulted optimal policy will be used based on Eq. (7) to select the actions during the testing phase. Because the state-space is continuous, we have approximated the value function of Eq. (7) using RBFs. The value function is estimated with a weighted sum of RBFs with the weight vectors . For estimating the weights, the sample transition of the system, i.e., and the gained reward are used in a Kalman-based approach as in Eqs. (12), and (14). Therefore, the control action in Step 6 will be found based on the proposed active learning behavioral policy in Eqs. (13), and (47). Because the matrix that generates the control action cannot be maximized, we have maximized its trace . Finally, we note that the proposed MM-KTD algorithm is designed for systems with finite number of actions. It is worth mentioning that having infinite number of actions per state is typical of continuous control tasks [42, 43]. Extension of the proposed framework for application to infinite-dimensional action space is an interesting direction for future research work.

1:Learning Phase:
2:Set for and
3:Repeat (for each episode):
4: Initialize
5:While do:
7:  Take action , observe
8:  Calculate via Eqs. (33) and (34)
12:  for do:
16:  end for
Compute and by using & (III-B)
20:  RBFs Parameters Update:
22:  if then:
23:   Update via Eq. (43)
24:  else:
25:   Update via Eq. (LABEL:Eq:38)
26:  end if
27:end while
28:Testing Phase:
29:Repeat (for each trial episode):
30:  While do:
32:  Take action , and observe
33:  End While
Algorithm 1 The Proposed MM-KTD Framework

Iv Experimental Results

Fig. 1: The Inverted Pendulum platform.

In this section, we evaluate performance of the proposed MM-KTD framework. In order to demonstrate the effectiveness and sample efficiency of the MM-KTD, which is a model-free and multiple model RL scheme, three popular RL benchmarks, i.e., Inverted Pendulum, Mountain Car, and Lunar Lander are considered, experimented and different comparisons are performed.111Implementations of the MM-KTD model for all three RL benchmarks is available publicly for open access at For the implementation of the proposed MM-KTD, a hardware with a GHz Intel Core i7 processor and GB RAM has been used. To have a fair comparison, we used optimized parameters for the NFQ method, and the KTD approach as specified in [10, 31], respectively. It is also worth mentioning that one benefit of the proposed MM-KTD framework (which comes from its multiple-model architecture) is its superior ability to deal with scenarios where enough information about the underlying parameters is not fully available.

Iv-a Inverted Pendulum

In the first experiment, the Inverted Pendulum platform is considered, which is shown in Fig. 1. The weight of the base is kg, while the weight of the pendulum is assumed to be kg. The length of the pendulum is m. The pendulum is initially left at an angle (arbitrarily close to the upright position) and then its base (object with mass of kg) is moved to keep its balance. The base may be moved to the left or right with a force of N, or not moved. The episode ends once the pendulum is fallen behind a horizontal line. The goal is to prevent the pendulum from falling below the horizontal line as long as possible. The state of the system is determined as the pendulum angle from the upright position and its angular velocity (i.e., ). At each time step, given the state () and the taken action (), the next state of the system is determined through the dynamics of the system (which is not known to the agent). If the angle of the pendulum in the next state is beyond the horizontal line (i.e., ), then a reward of will be fed back to the agent. Otherwise, the reward is kept as .

Fig. 2:

The average (lines) and 95% confidence interval (error bars) of the number of successful trials over 50 trials based on the proposed MM-KTD scheme as compared with other RL methods for the Inverted Pendulum environment.

The proposed MM-KTD is employed in this problem using RBFs and a bias parameter. Since there are three possible actions (i.e., ), the size of the feature vector is . The mean and covariance of the RBFs are initialized as follows



is the identity matrix of size (

). The vector of basis functions in Eq. (33) are, therefore, given by


where the value of for is calculated via Eq. (34). The initial values of and are selected as and , respectively by using trial and error to keep the system stable. The time step is selected to be milliseconds and the discount factor is selected as , (i.e., ). The process noise covariance is selected small (), and the transition matrix is selected to be the identity matrix (). The measurement noise covariance candidates are selected from the following set,


The initial weights are selected to be zero, (i.e., ), while the initial error covariance is chosen as

. Each experiment is started from an angle randomly selected from a normal distribution with mean zero and standard deviation of

. The agent is first trained through a set of episodes, then tested in episodes to find if it can keep the pendulum in balance for at least seconds ( samples). Various number of episodes are used for training the system. To highlight the effectiveness of the proposed MM-KTD, the achieved results are compared with that of KTD method of [31], MM-KTD with no active learning (denoted by MM-KTD (P)), and Neural Fitted Q (NFQ) learning method of [10]

, which performs the update session by considering an entire set of transition experiences in an off-line fashion, instead of using an online approach for updating the neural value function. More specifically, the NFQ scheme optimizes the sequence of the loss function using Rprop algorithm, which is a supervised learning method for batch learning to update the parameters of the Q-network. The available historical observations without performing any exploration that is used in the NFQ leads to a substantially lower amount of training experience required for learning optimal policies. In this approach, we just have a loop over trained steps. The NFQ uses the current policy in order to get the target value, while the DQN uses the target network to achieve the goal. As the nature of this paper is to use the proposed algorithm to restrict the number of learning episodes with reduced number of training data, which is critical for practical application in real scenarios, we focused only on comparison with algorithms that have the least required number of learning episodes.

Each testing scenario is repeated times for a specific number of training episodes to minimize the randomness of the achieved number of successful trials out of total trials. For evaluation purposes, we formed the average and % confidence interval of the number of successful trials out of the repetitions. Fig. 2 shows the resulted mean (lines) and 95% confidence interval (error bars) for different number of training episodes ranging from to , which are also briefed in Table I. As it was expected, the proposed MM-KTD is the most sample efficient algorithm of all. In addition, the proposed method offers the highest asymptotic performance in the performed experiments. This superior performance comes as a result of adaptive estimation of the value function through active Q-learning. It is expected that the performance of MM-KTD (P) and NFQ become better with more training episodes and get closer to that of the proposed MM-KTD. However, original KTD cannot reach that level of performance as it lacks accurate knowledge of the filter’s parameters.

# Training Episodes KTD MM-KTD (P) MM-KTD NFQ
10 2.1/50 (4.2%) 12.3/50 (24.6%) 38.5/50 (77%) 1.9/50 (3.8%)
20 1/50 (2%) 17.4/50 (34.8%) 44/50 (88%) 1.5/50 (3%)
30 3.6/50 (7.2%) 17.6/50 (35.2%) 50/50 (100%) 21.6/50 (43.2%)
40 15/50 (30%) 20.7/50 (41.4%) 34/50 (68%) 15/50 (30%)
50 16.5/50 (33%) 22.0/50 (44%) 46.6/50 (93.2%) 23/50 (46%)
TABLE I: The number (percentage) of successful trials of the proposed MM-KTD scheme as compared with other RL methods for the Inverted Pendulum environment.
Fig. 3: The state value function of the greedy policy () after episodes of training using: (a) MM-KTD (b) MM-KTD (P) (c) NFQ, and (d) KTD

The state value function of the methods after 10 training episodes are depicted in Fig. 3, where the and axes represent and of the pendulum, respectively. It is worth mentioning that when the state value function has the higher values around the vertical (origin) position of the Inverted Pendulum (states close to ), it means that the selected actions, resulted from the behavioral policy, have led to the states near the vertical position, which is the expected result, i.e., the pendulum is above the horizontal line. Therefore, based on Fig. 3, the proposed active learning method causes higher sample efficiency of MM-KTD scheme compared to the other techniques. Due to the higher sample efficiency, MM-KTD can quickly concentrate higher values for states closer to the origin. However, other practiced methods are still far from this stage, therefore, fail to perform in an acceptable fashion with restricted amount of experience.

Finally, to evaluate stability of the utilized RBFs, we have conducted the following Monte-Carlo (MC) study. In particular, we have fixed the number of RBFs to be and their means as selected based on Eq. (48). Then, the proposed MM-KTD scheme has been performed on the Inverted Pendulum task for different trials. The entire process has been repeated times (i.e., through a MC simulation of runs) using three different values of the widths of the RBFs. The number of steps to the goal was averaged over the runs. Fig. 4 illustrates potential stability of the utilized RBFs. As can be observed from Fig. 4, steady state performance can be achieved by using the RBFs for the value function approximation.

Fig. 4: Stability analysis of the RBFs.

Iv-B Mountain Car

In the second experiment, the Mountain Car platform, is shown in Fig. 5, is chosen for the evaluation. Mountain Car is a classic RL problem where the objective is to create an algorithm which learns to climb a steep hill to reach the goal marked by a flag. The car’s engine is not powerful enough to drive up the hill without a head start, therefore, the car must drive up the left hill to obtain enough momentum to scale the steeper hill to the right and reach the goal. The state of the system is the position of the car and its velocity (i.e. ). The possible actions are restricted within () which are “push left”, “no push”, and the “push right”, respectively. The road ends at the position , i.e., the position must be greater that . The task is to reach the top where the position must be larger or equal to . There would be a reward for each step where the car is unable to mount the hill to reach the goal and there is no penalty for climbing the left hill acting as a wall when is reached. Each episode starts with a random position ranging from to with no velocity.

Fig. 5: The Mountain Car environment.

Similar to the Inverted Pendulum experiment presented in Section IV-A, RBFs and a bias parameter are used for the proposed MM-KTD algorithm. Therefore, the size of the feature vector is . The mean and covariance of the RBFs are initialized as follows


The initial values of and are selected as and , respectively. The time step is selected to be milliseconds and the discount factor is chosen as . The process noise covariance is set to , and the transition matrix is selected to be the identity matrix (). Candidate values for are selected from the same set was selected for the Inverted Pendulum environment.

Fig. 6: Similar to Fig. 2 but for the Mountain Car environment.

The initial weights are selected to be zero (i.e., ), whereas the initial error covariance is chosen as . The agent first is trained through different number of episodes, then tested times for episodes to find if it can reach the flag during samples of each episode. Fig. 6 shows the results of KTD [31], MM-KTD, MM-KTD (P)), and NFQ learning method of [10]. As expected, the performance of KTD, MM-KTD (P) and NFQ improves with increased training episodes. However, KTD, MM-KTD (P) and NFQ can’t provide that level of performance is achieved by MM-KTD with the lowest number of training episodes in this experiment (). Based on the achieved results shown in Table II, MM-KTD is the most sample efficient approach of all tested algorithms. Fig. 7 depicts the trajectories of the system’s states (position, velocity) for the Mountain Car environment at episode number 50. As can be observed from Fig. 7, the episode starts from zero velocity and a random position in the range [], and ends when the position reaches to . Fig. 8 shows the optimal agent’s actions resulted from applying the MM-KTD scheme over a combination of positions and velocities. Based on Fig. 8, the agent moves left when its velocity is negative. Most of the times that the velocity is positive, the agent moves to the right and sometimes does nothing.

Fig. 7: State space trajectory of the Mountain Car environment at episode number 50.
Fig. 8: Actions resulted from optimal policy of MM-KTD algorithm at the episode number 50 for Mountain Car environment.
# Training Episodes KTD MM-KTD (P) MM-KTD NFQ
10 1.5/50 (3%) 1.9/50     (3.8%) 18/50 (36%) 2.0/50 (4%)
20 1.2/50 (2.4%) 2.3/50     (4.6%) 20.3/50 (40.6%) 3.5/50 (7%)
30 2.6/50 (5.2%) 3.4/50     (6.8%) 22.2/50 (44.4%) 4.72/50 (9.44%)
40 9.2/50 (18.4%) 9/50         (18%) 16.6/50 (33.2%) 6/50 (12%)
50 12.3/50 (24.6%) 12.0/50 (24%) 18.0/50 (36%) 12.6/50 (25.2%)
TABLE II: The number (percentage) of successful trials of the proposed MM-KTD scheme as compared with other RL methods for the Mountain Car environment.

Iv-C Lunar Lander

In a third experiment, we focus on the Lunar Lander environment, which is a more complicated environment compared to the two previous ones. In the Lunar Lander environment, the goal is for the RL agent to learn to land successfully on a landing pad located at coordinate

in a randomly generated surface on the moon as shown in Fig. 9. The state space of the system consists of the agent’s position in space, horizontal and vertical velocity , orientation in space , and angular velocity . The agent has four possible actions, i.e., do nothing; firing the left engine, firing the main engine, and; firing the right engine (). Reward for landing on the pad is about to points, varying on the lander placement on the pad. If the lander moves away from the landing pad it loses reward. Each episode terminates if the lander lands or crashes, receiving additional + or - points, respectively. Each leg ground contact worth + points. Firing the main engine results in a - point penalty for each frame. The problem is considered solved if the agent receives + points over iterations. The RBFs of order two are considered for each state variable resulting in RBFs for each action. Consequently, the size of the feature vector will be . Based on the expected range of each variable of the state vector, the initial mean and covariance of the RBFs are chosen as follows


The initial values of and are both selected as to keep the system stable. The discount factor is selected as . The process noise covariance is set to . Candidate values for are selected from the same set as was used for the Inverted Pendulum environment. Like the two previous environments, first, the agent is trained through different number of episodes changing from to , then tested times for trials to find if the agent can successfully land on the landing pad over steps. Fig. 10 depicts the results of applying the KTD [31], MM-KTD, MM-KTD (P)), and NFQ learning [10] scheme. It can be observed that the proposed approach outperforms its counterparts.

Fig. 9: The Lunar Lander environment.
Fig. 10: Similar to Fig. 2 but for the Lunar Lander environment.

Iv-D Parameters Selection

In this sub-section, we provide more details on different values assigned to the underlying parameters in the three simulations presented above. First, it is worth mentioning that one aspect of the proposed multiple-model type framework is to address the issues of reducing dependence of RL performance on its parameter values. For adaptation of the RBFs’ parameters, the initial centers of the RBFs are, typically, distributed evenly along each dimension of the state vector, leading to centers for state variables and a given order . Initial variance of each state’s variable is often set to . For the Inverted Pendulum and Mountain Car, the value of is selected following the KTD approach of [31] for having fair comparison. For the Lunar Lander, the value of is selected to be . Furthermore, the values for and are selected in such a way to keep the system stable. In addition, based on Fig. 4, it can be observed that RBFs are fairly stable against selection of their underlying parameters. The discount factor denoted by affects how much weight is given to the future rewards in the value function. A discount factor will result in state/action values representing the immediate reward, while a higher discount factor will result in the values representing the cumulative discounted future reward that an agent is expected to receive (behaving under a given policy). Commonly (as is the case in the implemented environments), a large portion of the reward is earned upon reaching the goal. To prioritize this final success, we expect an acceptable to be close 1. In our manuscript, is set to the high values of and .

Fig. 11: Success rate during 50 training episodes for Inverted Pendulum environment.

The prior should be initialized to a value close to the expected optimal value based on historical data, or to a default value, e.g., the zero vector. The prior quantifies the certainty that the user has in the initialized prior , the lower the less certainty. The process noise covariance is a design parameter, the proposed MM-KTD algorithm allows, systematically, to use a set of different values of process noise covariance (). We have conducted a sensitivity analysis experiment for the Inverted Pendulum environment to evaluate sensitivity to this design parameter for the scenario where only a single initial value can be assigned. Fig. 11 presents the agent’s success rate during the training phase over episodes based on three different assignments to . It can be observed that different values of the process noise covariance affect the time that the agent takes to complete the training process. Finally and as stated previously, variable , i.e., the measurement noise variance, is one of the most important parameters to be identified. To select this parameter, our intuition in the proposed MM-KTD framework is to cover the potential range of the measurement noise variance using mode-matched filters. The parameter is set a-priori denoting the number of candidates , for (), for the observation noise variance. For example, in the experiments, we have selected to be equal to .

V Conclusion

The paper proposes a Multiple-Model Kalman-filter-based Temporal Differences framework, referred to as the MM-KTD, which deals with the problems of sample efficiency, online learning, prior information and memory problems in the other RL algorithms. The proposed MM-KTD algorithm is evaluated based on three RL experiments. Based on the achieved results, the proposed algorithm achieved the highest number of successful trials while its number of training episodes for learning the best policy is considerably less in comparison to its counterparts. The need for restricted number of learning episodes results in the reduced training data, which is critical for practical applications in the real scenarios.


  • [1] S. Spanò, G.C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Matta, A. Nannarelli, and M. Re, “An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm,” IEEE Access, vol. 7, pp. 186340-186351, 2019.
  • [2] M. Seo, L.F. Vecchietti, S. Lee, and D. Har, “Rewards Prediction-Based Credit Assignment for Reinforcement Learning With Sparse Binary Rewards,” IEEE Access, vol. 7, pp. 118776-118791, 2019.
  • [3] A. Toubman et al.,

    “Modeling behavior of Computer Generated Forces with Machine Learning Techniques, the NATO Task Group approach,”

    IEEE Int. Con. Systems, Man, and Cyb. (SMC), Budapest, 2016, pp. 001906-001911.
  • [4] J. J. Roessingh et al., “Machine Learning Techniques for Autonomous Agents in Military Simulations - Multum in Parvo,” IEEE Int. Con. Systems, Man, and Cyb. (SMC), Banff, AB, 2017, pp. 3445-3450.
  • [5] R. Bellman, “The Theory of Dynamic Programming,” RAND Corp Santa Monica CA, Tech. Rep., 1954.
  • [6] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information Theoretic MPC for Model-based Reinforcement Learning,” International Conference on Robotics and Automation (ICRA), 2017.
  • [7] S. Ross and J. A. Bagnell, “Agnostic System Identification for Model-based Reinforcement Learning,” arXiv:1203.1007, 2012.
  • [8] A. Lazaric, M. Restelli, and A. Bonarini, “Reinforcement Learning in Continuous Action Spaces Through Sequential Monte Carlo Methods,” Advances in Neural Information Processing Systems, 2008, pp. 833-840.
  • [9] E. Keogh and A. Mueen, “Curse of Dimensionality,” Encyclopedia of Machine Learning, Springer, 2011, pp. 257-258.
  • [10] M. Riedmiller, “Neural Fitted Q Iteration-first Experiences with a Data Efficient Neural Reinforcement Learning Method,” European Conference on Machine Learning, Springer, 2005, pp. 317-328.
  • [11] Y. Tang, H. Guo,T. Yuan, X. Gao, X. Hong, Y. Li, J. Qiu, Y. Zuo, and J. Wu, “Flow Splitter: A Deep Reinforcement Learning-Based Flow Scheduler for Hybrid Optical-Electrical Data Center Network,” IEEE Access, vol. 7, pp.129955-129965, 2019.
  • [12] C. Hu, and M. Xu, “Adaptive Exploration Strategy With Multi-Attribute Decision-Making for Reinforcement Learning,” IEEE Access, vol. 8, pp. 32353-32364, 2020.
  • [13] M. Kim, S. Lee, J. Lim, J. Choi, and S.G. Kang, “Unexpected Collision Avoidance Driving Strategy Using Deep Reinforcement Learning,” IEEE Access, vol. 8, pp. 17243-17252, 2020.
  • [14] J. Xie, Z. Shao, Y. Li, Y. Guan, and J. Tan, “Deep reinforcement learning with optimized reward functions for robotic trajectory planning,” IEEE Access, vol. 7, pp. 105669-105679, 2019.
  • [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv:1312.5602, 2013.
  • [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous Control with Deep Reinforcement Learning,” arXiv:1509.02971, 2015.
  • [17] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-critic Methods,” arXiv:1802.09477, 2018.
  • [18] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Offpolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” arXiv:1801.01290, 2018.
  • [19] H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning.” AAAI, vol. 2. Phoenix, AZ, 2016, p. 5.
  • [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” International Conference on Machine Learning, 2016, pp. 1928-1937.
  • [21] S. Haykin, “Neural Networks: A Comprehensive Foundation,” Prentice Hall PTR, 1994.
  • [22] W. T. Miller, F. H. Glanz, and L. G. Kraft,

    “Cmas: An Associative Neural Network Alternative to Backpropagation,”

    Proceedings of the IEEE, vol. 78, no. 10, pp. 1561-1567, 1990.
  • [23] R. M. Kretchmar and C. W. Anderson, “Comparison of CMACs and Radial Basis Functions for Local Function Approximators in Reinforcement Learning,” International Conference on Neural Networks, vol. 2. IEEE, 1997, pp. 834-837.
  • [24] G. Konidaris, S. Osentoski, and P. S. Thomas, “Value Function Approximation in Reinforcement Learning using the Fourier Basis,” AAAI, vol. 6, 2011, p. 7.
  • [25] I. Menache, S. Mannor, and N. Shimkin, “Basis Function Adaptation in Temporal Difference Reinforcement Learning,” Annals of Operations Research, vol. 134, no. 1, pp. 215-238, 2005.
  • [26] A. d. M. S. Barreto and C. W. Anderson, “Restricted Gradient-descent Algorithm for Value-function Approximation in Reinforcement Learning,” Artificial Intelligence, vol. 172, no. 4-5, pp. 454-482, 2008.
  • [27] D. Choi and B. Van Roy, “A generalized Kalman filter for Fixed Point Approximation and Efficient Temporal-difference Learning,” Discrete Event Dynamic Systems, vol. 16, no. 2, pp. 207-239, 2006.
  • [28] Y. Engel, “Algorithms and Representations for Reinforcement Learning,” Hebrew University of Jerusalem, 2005.
  • [29] S. J. Bradtke and A. G. Barto, “Linear Least-squares Algorithms for Temporal Difference Learning,” Machine Learning, vol. 22, no. 1-3, pp. 33-57, 1996.
  • [30] M. Geist and O. Pietquin, “Algorithmic Survey of Parametric Value Function Approximation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 6, pp. 845-867, 2013.
  • [31] M. Geist and O. Pietquin, “Kalman Temporal Differences,” Journal of Artificial Intelligence Research, vol. 39, pp. 483-532, 2010.
  • [32] R. Mehra, “On the Identification of Variances and Adaptive Kalman Filtering,” IEEE Transactions on Automatic Control, vol. 15, no. 2, pp. 175-184, 1970.
  • [33] D. G. Lainiotis, “Partitioning: A Unifying Framework for Adaptive Systems, i: Estimation,” Proceedings of the IEEE, vol. 64, no. 8, pp. 1126-1143, 1976.
  • [34] A. Assa and K. N. Plataniotis, “Similarity-based Multiple Model Adaptive Estimation,” IEEE Access, vol. 6, pp. 36 632-36 644, 2018.
  • [35] K. Doya, K. Samejima, K.-i. Katagiri, and M. Kawato, “Multiple Model-based Reinforcement Learning,” Neural Computation, vol. 14, no. 6, pp. 1347-1369, 2002.
  • [36] T. Kitao, M. Shirai, and T. Miura, “Model Selection based on Kalman Temporal Differences Learning,” IEEE International Conference on Collaboration and Internet Computing (CIC), 2017, pp. 41-47.
  • [37] A. Assa and K. N. Plataniotis, “Adaptive Kalman Filtering by Covariance Sampling,” IEEE Signal Processing Letters, vol. 24, no. 9, pp. 1288-1292, 2017.
  • [38] R. S. Sutton, A. G. Barto, F. Bach et al., “Reinforcement Learning: An Introduction,” MIT Press, 1998.
  • [39] M. Hutter and S. Legg, “Temporal Difference Updating without a Learning Rate,” Advances in Neural Information Processing Systems, 2008, pp. 705-712.
  • [40] R. S. Sutton, “Generalization in Reinforcement Learning: Successful Examples using Sparse Coarse Coding,” Advances in Neural Information Processing Systems, 1996, pp. 1038-1044.
  • [41] W. Xia, C. Di, H. Guo, and S. Li, “Reinforcement Learning Based Stochastic Shortest Path Finding in Wireless Sensor Networks,” IEEE Access, vol. 7, pp.157807-157817, 2019.
  • [42] J. Li, T. Chai, F. L. Lewis, Z. Ding and Y. Jiang, “Off-Policy Interleaved -Learning: Optimal Control for Affine Nonlinear Discrete-Time Systems,” IEEE Transactions on Neural Networks and Learning Systems,, vol. 30, no. 5, pp. 1308-1320, May 2019.
  • [43] A. Al-Tamimi, M. Abu-Khalaf, F. L. Lewis, “Model-Free Q- Learning Designs for Discrete-Time Zero-Sum Games with ApplicationtoH-InfinityControl,” Automatica, vol. 43, pp. 473-481, 2007.
  • [44] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279-292, 1992.
  • [45] N. Shazeer, A., Mirhoseini, K., Maziarz, A. Davis, Q., Le, G. Hinton, and J. Dean, “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” ArXiv preprint, January 2017.
  • [46] Y. Ge, F. Zhu, X. Ling, and Q. Liu, “Safe Q-Learning Method Based on Constrained Markov Decision Processes,” IEEE Access, vol. 7, pp. 165007-165017, 2019.
  • [47] P. Malekzadeh, A. Mohammadi, M. Barbulescu and K. N. Plataniotis, “STUPEFY: Set-Valued Box Particle Filtering for Bluetooth Low Energy-Based Indoor Localization,” IEEE Signal Processing Letters, vol. 26, no. 12, pp. 1773-1777, Dec. 2019.
  • [48] A. Mohammadi and K. N. Plataniotis, “Event-Based Estimation With Information-Based Triggering and Adaptive Update,” IEEE Transactions on Signal Processing, vol. 65, no. 18, pp. 4924-4939, 15 Sept. 2017.
  • [49] A. Mohammadi and K. N. Plataniotis, “Improper Complex-Valued Bhattacharyya Distance,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 5, pp. 1049-1064, May 2016.
  • [50] A. Mohammadi and K. N. Plataniotis, “Distributed Widely Linear Multiple-Model Adaptive Estimation,” IEEE Transactions on Signal and Information Processing over Networks, vol. 1, no. 3, pp. 164-179, Sept. 2015.