I Introduction
Inspired by exceptional learning capabilities of human beings, Reinforcement Learning (RL) systems have emerged aiming to form optimal control policies merely by relying on the knowledge about the past interactions of an agent with its environment. Such a learning approach is particularly beneficial, as unlike supervised learning methods, an RL system does not require availability of carefully labelled data, which is typically difficult to acquire. On the contrary, to choose the best action possible at each time step, an RL system rather relies on the reward (feedback) received from each move
[1, 2, 3, 4]. Given their phenomenal potentials, there has been an increasing surge of interest on advancement of RL systems, which is also the focus of the paper.When an accurate model of the environment is available, methods such as dynamic programming [5] may be used to find the optimal policy. An accurate model of the environment, however, is rarely available in practice leading to the need to resort to modelbased RL or modelfree approaches. In the modelbased RL approaches, the goal is to learn the model of the system using the past transitions of the system and then apply a modelbased control scheme (e.g., model predictive controller [6]) to reach the desired goal [7]. While modelbased RL approaches are proven to be sample efficient, their performance, typically, degrades over time by the uncertainties and bias in the constructed model of the environment. On the other hand, modelfree approaches directly acquire the optimal policy for the system, without developing a model of the system. Due to their superior asymptotic performance, modelfree approaches are commonly preferred over their modelbased counterparts especially when sampling the agent’s past trajectories is inexpensive. The paper, focuses on development of a modelfree RL approach to achieve the desirable policy for the underlying system.
A common practice in modelfree RL schemes is to calculate the state (or alternatively stateaction) value function for each state. While this can be done with relative ease for a system with limited number of states (e.g., via Monte Carlo sampling [8]
), it is impractical to find the value function for each state when the number of states is large or the states are continuous (due to the curse of dimensionality
[9]). In such cases, which entail most realistic applications, the value function needs to be approximated. In a large group of works, e.g., [10, 11, 12, 13, 14], artificial neural networks were employed to approximate the value function over the entire statespace. Despite few successful trials, most early attempts of such were not very successful due to the overfitting problem. However, this problem was overcome in [15], where simple though efficient techniques were employed to avoid the overfitting problem. This work was later extended to systems with continuous actions [16]. However, it was shown later that such approaches are highly sensitive to parameter selection, therefore, required to be revised further to be applicable to a larger variety of problems. Despite continuous attempts [17, 18, 19, 20] to overcome such problems, the training of the neural network has remained an open problem in the field.Another wellpracticed approach to approximate the value function is to employ a set of weighted local estimators and convert the approximation problem to a weight estimation problem. Various local estimator were proposed in the literature, among which Radial Basis Functions (RBFs)
[21], and Cerebellar Model Articulation Controllers (CMACs) [22] are most popular. It was shown that RBFs are more suitable than CMACs in systems with continuous states, due to their continuous nature [23]. More recently, Fourier basis was proposed as the local estimator function, however, the performance of the system was shown to be comparable to those using RBFs [24]. Due to its advantages, we exploit the RBFs for the value function approximation. The parameters of the RBFs are usually computed based on the knowledge of the problem. However, it is possible to adapt these parameters using the observed transitions to enhance the autonomy of the method. Cross entropy and gradient descent methods were proposed by [25] for that matter. Stability of the latter was later enhanced using a restrictive technique in [26], which is adopted in this work.Once the structure of the value function is determined, a suitable algorithm has to be picked to train the value function approximation. Various approaches were proposed in the literature to gradually bring the value approximations close to their real values in all states. These methods were generally categorized as bootstrapping approaches (such as FixedPoint Kalman Filter or FPKF
[27]), residual approaches (e.g., Gaussian Process Temporal Difference or GPTD [28]), and projected fixedpoint approaches (e.g., Least Square Temporal Difference or LSTD [29]). A comprehensive comparison of these approaches are available in [30]. Among these approaches, Kalman Temporal difference (KTD) [31] stands out as it provides not only the Minimum Mean Square Error (MMSE) estimation of the value function (given the selected structure), but also their uncertainty in terms of their error covariance which could be exploited further to reach higher sample efficiency [31]. It was also shown in [31] that GPTD is a special case of KTD. However, like other Kalmanbased schemes, KTD requires the parameters of the filter (such as process and measurement noise covariances) to be known a priori, which is not the case in most practical scenarios.Filter parameter estimation is a wellstudied problem for Kalman filters and has led to numerous adaptive schemes in the literature. These methods may be roughly categorized as innovationbased adaptive methods (e.g., [32] and multiple model adaptive schemes (e.g., [33]). The latter has the advantage of fast adaptability when the mode of the system is changing. Most recently, the effective methods in multiple model adaptive estimation (MMAE) was discussed in [34], where various averaging and weighting schemes were proposed and compared to achieve superior results. Multiple model approaches were previously used in RL problems. A modelbased multiple model approach was introduced in [35] where the uncertainty of the system model was challenged using a set of models for the system. Moreover, a model selection approach was proposed in [36] for a multiple model KTD to overcome the filter parameter uncertainty problem, yet the models and the selection scheme were naive and therefore was not suitable for more general tasks.
In this work, various adaptive schemes are considered for value function approximation using KTD to address the parameter uncertainty problem. In particular, the paper makes the following contributions:

A innovative Multiple Model Adaptive Estimation (MMAE) scheme is adopted within the RL process, referred to as the Multiple Model Kalman Temporal Differences (MMKTD), to compensate for the lack of information about the measurement noise covariance as the most important parameter of the filter [37], while the mean and covariance of the RBFs are updated exploiting the restricted gradient descent method of [26].

To improve the sample efficiency of the proposed MMKTD, the offpolicy Qlearning is adopted to learn the optimal policy from the behaviour policy.

Within the proposed MMKTD framework, the estimated uncertainty of the value functions are exploited to form the behaviour policy leading to more visits to less certain values, therefore, improving the overall learning sample efficiency.
The proposed approach is tested over multiple RL benchmarks, which show desirable results as compared to their deep learningbased counterparts.
The remaining of the paper is organized as follows: In Section II, the basic techniques of RL is briefly outlined. Section III presents the proposed MMKTD framework. Experimental results based on three RL benchmarks are presented in Section IV illustrating effectiveness and superiority of the proposed MMKTD framework. Finally, Section V concludes the paper.
Ii Problem Formulation
In this section, we briefly present the required background on RL and formulate the problem at hand. Throughout the paper, the following notations are used: Nonbold letter denotes a scalar variable, lowercase bold letter
represents a vector, and capital bold letter
denotes a matrix. Transpose of a matrix is denoted .Iia Reinforcement Learning (RL)
In a typical RL scenario, one deals with an agent being placed in an unknown environment. At each time, the agent is considered to be in a specific state within the available set of states denoted by , and can take an action from the action set using specific policy that takes the agent to another state. More specifically, at time step , the agent at state takes an action exploiting the policy , which takes the system to the state
with the transition probability of
resulting in a reward . The tuple , for () denoting the discount factor, defines the process known as the Markov Decision Process (MDP), which defines the domain of work in RL. The agent starts at an initial state denoted by
and continues to explore the states until it reaches a terminal state denoted by , where an episode is completed.Generally speaking, the RL goal is to find a policy through a number of experimental episodes that maximizes the expected sum of discounted rewards over the future states. In achieving so, it is useful to define the state value function, as
(1) 
where represents the expectation function. If the value function (and also the transition probability) is known for a policy, the policy may be improved by selecting a greedy action at each step leading to the next state, which has the highest value (policy improvement) [38]. If the transition probability is unknown, it is more useful to exploit the stateaction value function defined as follows
(2) 
Using the stateaction value function () defined in Eq. (2) has the advantage of direct selection of the greedy action as compared to the state value function () defined in Eq. (1), where the leading states have to be identified.
IiB OffPolicy TD Learning
Following the Bellman update idea [39], sample transitions may be exploited to gradually update the value functions. In other words, at each transition (from one state to the next by taking an action), a onestep update may be performed, which is best known as a Temporal Difference (TD) update [39]. If the current policy is used to select actions for such an update, the procedure is called “onpolicy learning”. For instance, in SARSA method [40, 41], which is an onpolicy learning method, the stateaction value function is updated as follows
(4)  
where is the learning rate. Onpolicy samples are usually not very sample efficient, as the value function is learned for the current policy and not based on the optimal one. Besides, exploring new states is challenging in the onpolicy methods as they follow a particular policy. On the other hand, the “offpolicy learning” methods, also referred to as behavior policies, allow for updating the optimal policy using the information gained from other policies. In most cases, a stochastic (e.g., random) policy is selected as the behaviour policy to ensure enough exploration of new states. One of the most practiced offpolicy methods is known as Qlearning [44, 46, 41, 42], which updates the value function using the Bellman optimality equation as follows
where represents the optimal policy. For the greedy policy, the state value function is related to the state action value function as follows
(6) 
It is noteworthy to mention that action in Eq. (IIB) is selected based on the behaviour policy. Once the system has converged, the optimal policy may be used as follows
(7) 
This completes description of offpolicy TD learning. Next, we focus on value function approximation.
IiC Value Function Approximation
When the number of states are finite and small, it is relatively easy to update the value function by visiting every state of the system (e.g., using QLearning). When the states are continuous, however, it is not viable to visit all the states, therefore, requiring to approximate the value function. Deep learning techniques provide powerful supervised learning solutions for such purposes by approximating highly nonlinear functions using labelled data. However, neural networks (used in deep learning) are notorious for problems such as overfitting and brittleness to parameter tuning, therefore, should be used with extra care.
Alternatively, the value function may be approximated using basis functions. In this approach, the value function is estimated with a weighted sum of local basis functions, each of which is active in a local region of the statespace. The value function is then formed as follows
(8) 
where is a vector of basis functions (will be described later in Section IIIC) and is a weight vector. Various basis function may be used for such approximations, among which RBFs are proven [23] to be one of the most suitable options and are therefore selected in this work for value function approximation. This completes our background discussion on RL. Next, we present the proposed MMKTD framework.
Iii MmKtd: Multiple Model Kalman Temporal Difference
For RL tasks with continuous statespace, the sample transitions of the system and the gained rewards are used as the data for the purpose of value function approximation and to estimate the weights. To avoid overfitting problems, supervised learning methods such as deep learning require this data to be stored and then used in batches (batch learning) for training of a system (e.g., neural networks). Generally speaking, neural networks have considerably high memory requirements to store the input data, weight parameters, and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate the error gradients in the backwards pass. As an example, the layer ResNet network has million weight parameters and computes million activations in the forward pass. Measuring, roughly, the memory requirement associated with the training stage of the ResNet50 with a minibatch of shows that it requires a, typically, a high performance GPU and over GB of local DRAM. In the contrary, sequential data processing methods such as multiplemodel filters [47, 48, 49, 50] can adapt the system with the last measurement (assuming a onestep Markov property), without the need to store the whole measurement history for learning, which results in much less memory requirement. Using a Kalmanbased approach, the posteriori of the weights can be calculated recursively using the Bayes rule as follows
(9) 
where is the measurements vector of the system (i.e, transition information) at time step and is the set of all measurements from time Step to time Step . Utilizing the probabilistic model in Eq. (9), the paper proposes a Kalmanbased offpolicy learning scheme, which is detailed below.
Iiia Kalman Temporal Difference Method
The optimal value function may be approximated from its onestep approximation using the TD method shown in Eq. (IIB), i.e.,
(10) 
With a change in the order of the variables, the reward at time Step may be considered as a noisy measurement from the system as follows
(11) 
where
is assumed to be a zeromean Gaussian noise with variance of
. In this paper, the value function is approximated as discussed in Eq. (8), rendering Eq. (11) to have the following form(12)  
Considering,
(13) 
where is found from
Eq. (12) defines the measurement of the system (the reward) as a linear function of the weight function (i.e., ), which is to be estimated. Assuming the weight vector to be modelled by a linear dynamic system, i.e.,
(14) 
where is the transition matrix and is a zeromean Gaussian noise with the covariance of , the Kalman filtering formulation may be employed to estimate the weights in a minimum MSE sense. To be more precise, the weights and their error covariance are first initialized, i.e.,
(15)  
(16) 
Then at each time step, first the weights and their error covariance are predicted as
(17)  
(18) 
and then the estimations are updated using the observed reward from the transition from State to the next state () as follows
(19)  
(20)  
The value function for each set of state and action is easily reconstructed through Eq. (8), and the optimal policy would be to select the action with the highest stateaction value function at each state similar to Eq. (7).
When the parameters of the filter and system initial values (i.e., , , , , , and ) are known a priori, the system will provide accurate estimations. However, these values are usually not available and need to be approximated using the measurements obtained from the environment. Among these, the measurement mapping function () and the measurement noise variance () are of most importance since they regulate the flow of information from new measurements. The adaptation of these parameters is the topic of the following two subsections. Other filter parameters may be selected as design parameters.
As a final note, we would like to point out that the basic component of the constructed statespace model is the measurement equation (Eq. (12)), which relates the reward to the basis functions. In this paper, following the formulation in Eq. (8) we are dealing with a linear combination of the basis functions with unknown (to be estimated) weight vectors . This formulation results in a linear measurement model of Eq. (12). An interesting direction for future work is to consider nonlinear measurement model [31] for estimating the value function. With regards to the dynamics of the weight vectors, at one hand, it is a common practice [47] to use the linear model of Eq. (14) when an auxiliary and unknown dynamic is introduced for the variable of interest to be estimated. Intuitively speaking, the dynamical model of Eq. (14) is introduced to make it possible to use statespace based solutions such as Kalmanbased filters or Particle filters. As its true nature is unknown apriori, the intuition is to consider it to be constant (hence having an identity type sate model ) with changes being controlled by the covariance of the state model noise (). An interesting direction for further investigations is to learn dynamics of the introduced state model as well, e.g., using a separate neuralbased module, which is the focus of our ongoing research.
IiiB Multiple Model Adaptive Estimation
Kalman filter is a powerful tool for accurate estimation of hidden variables when the estimation model is fully known. However, usually full knowledge about the filter parameters and initial values is not available in practical scenarios, which lead to deterioration of the system performance. Adaptive estimation is a one powerful way to remedy such a problem. In this work, a multiple model adaptation scheme [47, 48, 49, 50] is adopted due to simplicity and effectiveness of multiplemodel solutions. In such schemes, multiple candidates are considered for each of the uncertain parameters and values and the estimations made based on each set of candidates is weighted exploiting the measurement likelihood function. The number of candidates increases exponentially with the number of uncertain parameters (curse of dimensionality), therefore, it is desirable to limit the adaptation to some of the most important parameters. In a Kalmanbased estimation framework, the measurement noise variance () is one of the most important parameters and is, therefore, considered for the adaptation in this section using the multiple model technique. After observing and by taking action , the measurement model () is calculated. Then, different values () for the measurement noise variance is considered in the proposed MMKTD scheme and a bank of modematched Kalman filters are implemented for adaptation of the observation noise variance. Eqs. (19)(IIIA) are, therefore, replaced with the following
(22)  
(23)  
where superscription denotes the Kalman filter value for the filter which exploits as its measurement noise covariance. The posteriori of each modematched filter is weighted based on its associated and normalized weight, which are then combined to form the system’s posteriori, i.e.,
(25) 
where is number of modematched filter within the MMKTD framework and,
where is the normalized measurement likelihood for the filter and,
(27) 
Using Eq. (25), the weight and its error covariance are then updated as follows
(28)  
(29) 
As a final note, it is interesting to highlight the connections between the above multiple model frame work of the proposed MMKTD with other methods, in particular gating approach of Mixture of Experts (MoE) [45]
. In a general setting, the goal of the weight update step in a multiplemodel framework is to find the conditional probability density function (PDF) corresponding to mode
, denoted by for (), given all the observations up to the current time, i.e.,(30) 
where the denominator is a normalizing factor to ensure that is a proper PDF. Term in the nominator of Eq. (30) is the likelihood function of mode , which is proportional to the exponential term in Eq. (IIIB). Therefore, the corresponding weight for the filter matched to mode , for (), can be simplified as
(31) 
On the contrary, the MoE, typically, uses the soft max adaptation for Gaussian gating to obtain the associated weight, denoted here by , for each model (expert). The weight is obtained by multiplying the input (in our case, the estimated state vector by model , for ()) by a trainable weight matrix and then applying the Softmax function as follows
(32) 
Comparing Eq. (32) with Eq. (31) reveals the potentials of the multiple model approach in comparison to the MoE. This completes our discussion on multiple model adaptive estimation. Next, we discuss the update process for computation of the basis functions.
IiiC Basis Function Update
Vector , defined in Eq. (13), is the measurement mapping function and is required to be computed for accurate estimations within the context of Kalmanbased filtering schemes. Such a prior knowledge, however, is typically not available, therefore should be adapted to its correct value. Since is formed by the basis functions, its adaptation necessitates the adaptation of the basis functions. The vector of basis functions shown in Eq. (8) is formed as follows,
(33) 
where is the number of basis functions per action and is the total number of actions (the arguments of the basis functions are omitted for brevity). Each basis function is selected as a RBF, which is defined based on its mean vector and covariance matrix as follows
(34) 
where and are the mean and covariance of this radial basis function. Due to the large number of parameters associated with the measurement mapping function (i.e., ), it is reasonable to adapt these parameters through a gradient descentbased adaptation scheme rather than the multiple model method. For this purpose, the Restricted Gradient Descent (RGD) method proposed in [26] is adopted in this work. Using RGD, first the gradient of the object function with respect to the parameters of each basis function is calculated using partial derivations. The goal is to minimize the objective function, which is defined as the difference between the estimated value function and its onestep (TD) update, i.e.,
(35) 
The gradient of the object function with respect to the parameters of the RBFs is calculated using the chain rule,
(36)  
(37) 
where the partial derivations are calculated using Eqs. (8), (34), and (35) as follows
(38)  
(39)  
(40)  
(41) 
The mean and covariance of the RBFs are then adapted using the calculated partial derivative as follows
(42)  
(43)  
where and are the adaptation rates. To make the system more stable, only one of the updates shown in Eqs. (43) and (LABEL:Eq:38) will be performed as discussed in [26]. To be more precise, when the size of the covariance is decreasing (i.e., ), the covariances of the RBFs are updated using Eq. (LABEL:Eq:38), otherwise, their means are updated using Eqs. (43). Using this approach, unlimited expansion of the RBF covariances is avoided.
IiiD Active Learning
In order to ensure enough exploration of the states in offpolicy learning methods, the behaviour policy is usually chosen to be a stochastic policy (e.g., a random policy). However, such a choice commonly leads to sample inefficiency, which is already a big problem in modelfree RL methods. One advantage that the proposed MMKTD learning framework offers over other optimizationbased techniques (e.g., gradient descentbased methods) is the calculation of the uncertainty for the weights (), which is directly related to the uncertainty of the value function. This information can then be used at each step to select the actions, which lead to most reduction in the uncertainty of the weights. Using the information of the Kalman filter (information filter [48]), the information of the weights, which is denoted by the inverse of is updated as follows
(46) 
Since only the second element (i.e., ) in the right hand side of Eq. (46) is affected by the choice of the action (as it changes ), the action is selected such that this term is maximized. More specifically, the action is obtained by maximizing the information of the weights, i.e.,
(47)  
The second equality in Eq. (47) is constructed as is a scalar. The proposed behavior policy in Eq. (47) is different from that of Reference [31], where a random policy was introduced, which favored actions with less certainty of the value function. Even though favoring the actions that reduce the uncertainty of the value function is a good idea, the random nature of such policies make them less sample efficient than expected. The proposed MMKTD framework is briefed in Algorithm 1. It is worth further clarifying computation of Step 6 in Algorithm 1. For learning the optimal policy, the control action is selected based on the behavioral policy, which leads to most reduction in the uncertainty of the weights in Eq. (47). Once the system has been converged, the resulted optimal policy will be used based on Eq. (7) to select the actions during the testing phase. Because the statespace is continuous, we have approximated the value function of Eq. (7) using RBFs. The value function is estimated with a weighted sum of RBFs with the weight vectors . For estimating the weights, the sample transition of the system, i.e., and the gained reward are used in a Kalmanbased approach as in Eqs. (12), and (14). Therefore, the control action in Step 6 will be found based on the proposed active learning behavioral policy in Eqs. (13), and (47). Because the matrix that generates the control action cannot be maximized, we have maximized its trace . Finally, we note that the proposed MMKTD algorithm is designed for systems with finite number of actions. It is worth mentioning that having infinite number of actions per state is typical of continuous control tasks [42, 43]. Extension of the proposed framework for application to infinitedimensional action space is an interesting direction for future research work.
Compute and by using & (IIIB) 
Iv Experimental Results
In this section, we evaluate performance of the proposed MMKTD framework. In order to demonstrate the effectiveness and sample efficiency of the MMKTD, which is a modelfree and multiple model RL scheme, three popular RL benchmarks, i.e., Inverted Pendulum, Mountain Car, and Lunar Lander are considered, experimented and different comparisons are performed.^{1}^{1}1Implementations of the MMKTD model for all three RL benchmarks is available publicly for open access at https://github.com/parvin95/MMKTD. For the implementation of the proposed MMKTD, a hardware with a GHz Intel Core i7 processor and GB RAM has been used. To have a fair comparison, we used optimized parameters for the NFQ method, and the KTD approach as specified in [10, 31], respectively. It is also worth mentioning that one benefit of the proposed MMKTD framework (which comes from its multiplemodel architecture) is its superior ability to deal with scenarios where enough information about the underlying parameters is not fully available.
Iva Inverted Pendulum
In the first experiment, the Inverted Pendulum platform is considered, which is shown in Fig. 1. The weight of the base is kg, while the weight of the pendulum is assumed to be kg. The length of the pendulum is m. The pendulum is initially left at an angle (arbitrarily close to the upright position) and then its base (object with mass of kg) is moved to keep its balance. The base may be moved to the left or right with a force of N, or not moved. The episode ends once the pendulum is fallen behind a horizontal line. The goal is to prevent the pendulum from falling below the horizontal line as long as possible. The state of the system is determined as the pendulum angle from the upright position and its angular velocity (i.e., ). At each time step, given the state () and the taken action (), the next state of the system is determined through the dynamics of the system (which is not known to the agent). If the angle of the pendulum in the next state is beyond the horizontal line (i.e., ), then a reward of will be fed back to the agent. Otherwise, the reward is kept as .
The proposed MMKTD is employed in this problem using RBFs and a bias parameter. Since there are three possible actions (i.e., ), the size of the feature vector is . The mean and covariance of the RBFs are initialized as follows
(48)  
(49) 
where
is the identity matrix of size (
). The vector of basis functions in Eq. (33) are, therefore, given by(50)  
(51)  
(52) 
where the value of for is calculated via Eq. (34). The initial values of and are selected as and , respectively by using trial and error to keep the system stable. The time step is selected to be milliseconds and the discount factor is selected as , (i.e., ). The process noise covariance is selected small (), and the transition matrix is selected to be the identity matrix (). The measurement noise covariance candidates are selected from the following set,
(53) 
The initial weights are selected to be zero, (i.e., ), while the initial error covariance is chosen as
. Each experiment is started from an angle randomly selected from a normal distribution with mean zero and standard deviation of
. The agent is first trained through a set of episodes, then tested in episodes to find if it can keep the pendulum in balance for at least seconds ( samples). Various number of episodes are used for training the system. To highlight the effectiveness of the proposed MMKTD, the achieved results are compared with that of KTD method of [31], MMKTD with no active learning (denoted by MMKTD (P)), and Neural Fitted Q (NFQ) learning method of [10], which performs the update session by considering an entire set of transition experiences in an offline fashion, instead of using an online approach for updating the neural value function. More specifically, the NFQ scheme optimizes the sequence of the loss function using Rprop algorithm, which is a supervised learning method for batch learning to update the parameters of the Qnetwork. The available historical observations without performing any exploration that is used in the NFQ leads to a substantially lower amount of training experience required for learning optimal policies. In this approach, we just have a loop over trained steps. The NFQ uses the current policy in order to get the target value, while the DQN uses the target network to achieve the goal. As the nature of this paper is to use the proposed algorithm to restrict the number of learning episodes with reduced number of training data, which is critical for practical application in real scenarios, we focused only on comparison with algorithms that have the least required number of learning episodes.
Each testing scenario is repeated times for a specific number of training episodes to minimize the randomness of the achieved number of successful trials out of total trials. For evaluation purposes, we formed the average and % confidence interval of the number of successful trials out of the repetitions. Fig. 2 shows the resulted mean (lines) and 95% confidence interval (error bars) for different number of training episodes ranging from to , which are also briefed in Table I. As it was expected, the proposed MMKTD is the most sample efficient algorithm of all. In addition, the proposed method offers the highest asymptotic performance in the performed experiments. This superior performance comes as a result of adaptive estimation of the value function through active Qlearning. It is expected that the performance of MMKTD (P) and NFQ become better with more training episodes and get closer to that of the proposed MMKTD. However, original KTD cannot reach that level of performance as it lacks accurate knowledge of the filter’s parameters.
# Training Episodes  KTD  MMKTD (P)  MMKTD  NFQ 

10  2.1/50 (4.2%)  12.3/50 (24.6%)  38.5/50 (77%)  1.9/50 (3.8%) 
20  1/50 (2%)  17.4/50 (34.8%)  44/50 (88%)  1.5/50 (3%) 
30  3.6/50 (7.2%)  17.6/50 (35.2%)  50/50 (100%)  21.6/50 (43.2%) 
40  15/50 (30%)  20.7/50 (41.4%)  34/50 (68%)  15/50 (30%) 
50  16.5/50 (33%)  22.0/50 (44%)  46.6/50 (93.2%)  23/50 (46%) 
The state value function of the methods after 10 training episodes are depicted in Fig. 3, where the and axes represent and of the pendulum, respectively. It is worth mentioning that when the state value function has the higher values around the vertical (origin) position of the Inverted Pendulum (states close to ), it means that the selected actions, resulted from the behavioral policy, have led to the states near the vertical position, which is the expected result, i.e., the pendulum is above the horizontal line. Therefore, based on Fig. 3, the proposed active learning method causes higher sample efficiency of MMKTD scheme compared to the other techniques. Due to the higher sample efficiency, MMKTD can quickly concentrate higher values for states closer to the origin. However, other practiced methods are still far from this stage, therefore, fail to perform in an acceptable fashion with restricted amount of experience.
Finally, to evaluate stability of the utilized RBFs, we have conducted the following MonteCarlo (MC) study. In particular, we have fixed the number of RBFs to be and their means as selected based on Eq. (48). Then, the proposed MMKTD scheme has been performed on the Inverted Pendulum task for different trials. The entire process has been repeated times (i.e., through a MC simulation of runs) using three different values of the widths of the RBFs. The number of steps to the goal was averaged over the runs. Fig. 4 illustrates potential stability of the utilized RBFs. As can be observed from Fig. 4, steady state performance can be achieved by using the RBFs for the value function approximation.
IvB Mountain Car
In the second experiment, the Mountain Car platform, is shown in Fig. 5, is chosen for the evaluation. Mountain Car is a classic RL problem where the objective is to create an algorithm which learns to climb a steep hill to reach the goal marked by a flag. The car’s engine is not powerful enough to drive up the hill without a head start, therefore, the car must drive up the left hill to obtain enough momentum to scale the steeper hill to the right and reach the goal. The state of the system is the position of the car and its velocity (i.e. ). The possible actions are restricted within () which are “push left”, “no push”, and the “push right”, respectively. The road ends at the position , i.e., the position must be greater that . The task is to reach the top where the position must be larger or equal to . There would be a reward for each step where the car is unable to mount the hill to reach the goal and there is no penalty for climbing the left hill acting as a wall when is reached. Each episode starts with a random position ranging from to with no velocity.
Similar to the Inverted Pendulum experiment presented in Section IVA, RBFs and a bias parameter are used for the proposed MMKTD algorithm. Therefore, the size of the feature vector is . The mean and covariance of the RBFs are initialized as follows
(54)  
(55) 
The initial values of and are selected as and , respectively. The time step is selected to be milliseconds and the discount factor is chosen as . The process noise covariance is set to , and the transition matrix is selected to be the identity matrix (). Candidate values for are selected from the same set was selected for the Inverted Pendulum environment.
The initial weights are selected to be zero (i.e., ), whereas the initial error covariance is chosen as . The agent first is trained through different number of episodes, then tested times for episodes to find if it can reach the flag during samples of each episode. Fig. 6 shows the results of KTD [31], MMKTD, MMKTD (P)), and NFQ learning method of [10]. As expected, the performance of KTD, MMKTD (P) and NFQ improves with increased training episodes. However, KTD, MMKTD (P) and NFQ can’t provide that level of performance is achieved by MMKTD with the lowest number of training episodes in this experiment (). Based on the achieved results shown in Table II, MMKTD is the most sample efficient approach of all tested algorithms. Fig. 7 depicts the trajectories of the system’s states (position, velocity) for the Mountain Car environment at episode number 50. As can be observed from Fig. 7, the episode starts from zero velocity and a random position in the range [], and ends when the position reaches to . Fig. 8 shows the optimal agent’s actions resulted from applying the MMKTD scheme over a combination of positions and velocities. Based on Fig. 8, the agent moves left when its velocity is negative. Most of the times that the velocity is positive, the agent moves to the right and sometimes does nothing.
# Training Episodes  KTD  MMKTD (P)  MMKTD  NFQ 

10  1.5/50 (3%)  1.9/50 (3.8%)  18/50 (36%)  2.0/50 (4%) 
20  1.2/50 (2.4%)  2.3/50 (4.6%)  20.3/50 (40.6%)  3.5/50 (7%) 
30  2.6/50 (5.2%)  3.4/50 (6.8%)  22.2/50 (44.4%)  4.72/50 (9.44%) 
40  9.2/50 (18.4%)  9/50 (18%)  16.6/50 (33.2%)  6/50 (12%) 
50  12.3/50 (24.6%)  12.0/50 (24%)  18.0/50 (36%)  12.6/50 (25.2%) 
IvC Lunar Lander
In a third experiment, we focus on the Lunar Lander environment, which is a more complicated environment compared to the two previous ones. In the Lunar Lander environment, the goal is for the RL agent to learn to land successfully on a landing pad located at coordinate
in a randomly generated surface on the moon as shown in Fig. 9. The state space of the system consists of the agent’s position in space, horizontal and vertical velocity , orientation in space , and angular velocity . The agent has four possible actions, i.e., do nothing; firing the left engine, firing the main engine, and; firing the right engine (). Reward for landing on the pad is about to points, varying on the lander placement on the pad. If the lander moves away from the landing pad it loses reward. Each episode terminates if the lander lands or crashes, receiving additional + or  points, respectively. Each leg ground contact worth + points. Firing the main engine results in a  point penalty for each frame. The problem is considered solved if the agent receives + points over iterations. The RBFs of order two are considered for each state variable resulting in RBFs for each action. Consequently, the size of the feature vector will be . Based on the expected range of each variable of the state vector, the initial mean and covariance of the RBFs are chosen as follows(56)  
(57) 
The initial values of and are both selected as to keep the system stable. The discount factor is selected as . The process noise covariance is set to . Candidate values for are selected from the same set as was used for the Inverted Pendulum environment. Like the two previous environments, first, the agent is trained through different number of episodes changing from to , then tested times for trials to find if the agent can successfully land on the landing pad over steps. Fig. 10 depicts the results of applying the KTD [31], MMKTD, MMKTD (P)), and NFQ learning [10] scheme. It can be observed that the proposed approach outperforms its counterparts.
IvD Parameters Selection
In this subsection, we provide more details on different values assigned to the underlying parameters in the three simulations presented above. First, it is worth mentioning that one aspect of the proposed multiplemodel type framework is to address the issues of reducing dependence of RL performance on its parameter values. For adaptation of the RBFs’ parameters, the initial centers of the RBFs are, typically, distributed evenly along each dimension of the state vector, leading to centers for state variables and a given order . Initial variance of each state’s variable is often set to . For the Inverted Pendulum and Mountain Car, the value of is selected following the KTD approach of [31] for having fair comparison. For the Lunar Lander, the value of is selected to be . Furthermore, the values for and are selected in such a way to keep the system stable. In addition, based on Fig. 4, it can be observed that RBFs are fairly stable against selection of their underlying parameters. The discount factor denoted by affects how much weight is given to the future rewards in the value function. A discount factor will result in state/action values representing the immediate reward, while a higher discount factor will result in the values representing the cumulative discounted future reward that an agent is expected to receive (behaving under a given policy). Commonly (as is the case in the implemented environments), a large portion of the reward is earned upon reaching the goal. To prioritize this final success, we expect an acceptable to be close 1. In our manuscript, is set to the high values of and .
The prior should be initialized to a value close to the expected optimal value based on historical data, or to a default value, e.g., the zero vector. The prior quantifies the certainty that the user has in the initialized prior , the lower the less certainty. The process noise covariance is a design parameter, the proposed MMKTD algorithm allows, systematically, to use a set of different values of process noise covariance (). We have conducted a sensitivity analysis experiment for the Inverted Pendulum environment to evaluate sensitivity to this design parameter for the scenario where only a single initial value can be assigned. Fig. 11 presents the agent’s success rate during the training phase over episodes based on three different assignments to . It can be observed that different values of the process noise covariance affect the time that the agent takes to complete the training process. Finally and as stated previously, variable , i.e., the measurement noise variance, is one of the most important parameters to be identified. To select this parameter, our intuition in the proposed MMKTD framework is to cover the potential range of the measurement noise variance using modematched filters. The parameter is set apriori denoting the number of candidates , for (), for the observation noise variance. For example, in the experiments, we have selected to be equal to .
V Conclusion
The paper proposes a MultipleModel Kalmanfilterbased Temporal Differences framework, referred to as the MMKTD, which deals with the problems of sample efficiency, online learning, prior information and memory problems in the other RL algorithms. The proposed MMKTD algorithm is evaluated based on three RL experiments. Based on the achieved results, the proposed algorithm achieved the highest number of successful trials while its number of training episodes for learning the best policy is considerably less in comparison to its counterparts. The need for restricted number of learning episodes results in the reduced training data, which is critical for practical applications in the real scenarios.
References
 [1] S. Spanò, G.C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Matta, A. Nannarelli, and M. Re, “An Efficient Hardware Implementation of Reinforcement Learning: The QLearning Algorithm,” IEEE Access, vol. 7, pp. 186340186351, 2019.
 [2] M. Seo, L.F. Vecchietti, S. Lee, and D. Har, “Rewards PredictionBased Credit Assignment for Reinforcement Learning With Sparse Binary Rewards,” IEEE Access, vol. 7, pp. 118776118791, 2019.

[3]
A. Toubman et al.,
“Modeling behavior of Computer Generated Forces with Machine Learning Techniques, the NATO Task Group approach,”
IEEE Int. Con. Systems, Man, and Cyb. (SMC), Budapest, 2016, pp. 001906001911.  [4] J. J. Roessingh et al., “Machine Learning Techniques for Autonomous Agents in Military Simulations  Multum in Parvo,” IEEE Int. Con. Systems, Man, and Cyb. (SMC), Banff, AB, 2017, pp. 34453450.
 [5] R. Bellman, “The Theory of Dynamic Programming,” RAND Corp Santa Monica CA, Tech. Rep., 1954.
 [6] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information Theoretic MPC for Modelbased Reinforcement Learning,” International Conference on Robotics and Automation (ICRA), 2017.
 [7] S. Ross and J. A. Bagnell, “Agnostic System Identification for Modelbased Reinforcement Learning,” arXiv:1203.1007, 2012.
 [8] A. Lazaric, M. Restelli, and A. Bonarini, “Reinforcement Learning in Continuous Action Spaces Through Sequential Monte Carlo Methods,” Advances in Neural Information Processing Systems, 2008, pp. 833840.
 [9] E. Keogh and A. Mueen, “Curse of Dimensionality,” Encyclopedia of Machine Learning, Springer, 2011, pp. 257258.
 [10] M. Riedmiller, “Neural Fitted Q Iterationfirst Experiences with a Data Efficient Neural Reinforcement Learning Method,” European Conference on Machine Learning, Springer, 2005, pp. 317328.
 [11] Y. Tang, H. Guo,T. Yuan, X. Gao, X. Hong, Y. Li, J. Qiu, Y. Zuo, and J. Wu, “Flow Splitter: A Deep Reinforcement LearningBased Flow Scheduler for Hybrid OpticalElectrical Data Center Network,” IEEE Access, vol. 7, pp.129955129965, 2019.
 [12] C. Hu, and M. Xu, “Adaptive Exploration Strategy With MultiAttribute DecisionMaking for Reinforcement Learning,” IEEE Access, vol. 8, pp. 3235332364, 2020.
 [13] M. Kim, S. Lee, J. Lim, J. Choi, and S.G. Kang, “Unexpected Collision Avoidance Driving Strategy Using Deep Reinforcement Learning,” IEEE Access, vol. 8, pp. 1724317252, 2020.
 [14] J. Xie, Z. Shao, Y. Li, Y. Guan, and J. Tan, “Deep reinforcement learning with optimized reward functions for robotic trajectory planning,” IEEE Access, vol. 7, pp. 105669105679, 2019.
 [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv:1312.5602, 2013.
 [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous Control with Deep Reinforcement Learning,” arXiv:1509.02971, 2015.
 [17] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actorcritic Methods,” arXiv:1802.09477, 2018.
 [18] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” arXiv:1801.01290, 2018.
 [19] H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Qlearning.” AAAI, vol. 2. Phoenix, AZ, 2016, p. 5.
 [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” International Conference on Machine Learning, 2016, pp. 19281937.
 [21] S. Haykin, “Neural Networks: A Comprehensive Foundation,” Prentice Hall PTR, 1994.

[22]
W. T. Miller, F. H. Glanz, and L. G. Kraft,
“Cmas: An Associative Neural Network Alternative to Backpropagation,”
Proceedings of the IEEE, vol. 78, no. 10, pp. 15611567, 1990.  [23] R. M. Kretchmar and C. W. Anderson, “Comparison of CMACs and Radial Basis Functions for Local Function Approximators in Reinforcement Learning,” International Conference on Neural Networks, vol. 2. IEEE, 1997, pp. 834837.
 [24] G. Konidaris, S. Osentoski, and P. S. Thomas, “Value Function Approximation in Reinforcement Learning using the Fourier Basis,” AAAI, vol. 6, 2011, p. 7.
 [25] I. Menache, S. Mannor, and N. Shimkin, “Basis Function Adaptation in Temporal Difference Reinforcement Learning,” Annals of Operations Research, vol. 134, no. 1, pp. 215238, 2005.
 [26] A. d. M. S. Barreto and C. W. Anderson, “Restricted Gradientdescent Algorithm for Valuefunction Approximation in Reinforcement Learning,” Artificial Intelligence, vol. 172, no. 45, pp. 454482, 2008.
 [27] D. Choi and B. Van Roy, “A generalized Kalman filter for Fixed Point Approximation and Efficient Temporaldifference Learning,” Discrete Event Dynamic Systems, vol. 16, no. 2, pp. 207239, 2006.
 [28] Y. Engel, “Algorithms and Representations for Reinforcement Learning,” Hebrew University of Jerusalem, 2005.
 [29] S. J. Bradtke and A. G. Barto, “Linear Leastsquares Algorithms for Temporal Difference Learning,” Machine Learning, vol. 22, no. 13, pp. 3357, 1996.
 [30] M. Geist and O. Pietquin, “Algorithmic Survey of Parametric Value Function Approximation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 6, pp. 845867, 2013.
 [31] M. Geist and O. Pietquin, “Kalman Temporal Differences,” Journal of Artificial Intelligence Research, vol. 39, pp. 483532, 2010.
 [32] R. Mehra, “On the Identification of Variances and Adaptive Kalman Filtering,” IEEE Transactions on Automatic Control, vol. 15, no. 2, pp. 175184, 1970.
 [33] D. G. Lainiotis, “Partitioning: A Unifying Framework for Adaptive Systems, i: Estimation,” Proceedings of the IEEE, vol. 64, no. 8, pp. 11261143, 1976.
 [34] A. Assa and K. N. Plataniotis, “Similaritybased Multiple Model Adaptive Estimation,” IEEE Access, vol. 6, pp. 36 63236 644, 2018.
 [35] K. Doya, K. Samejima, K.i. Katagiri, and M. Kawato, “Multiple Modelbased Reinforcement Learning,” Neural Computation, vol. 14, no. 6, pp. 13471369, 2002.
 [36] T. Kitao, M. Shirai, and T. Miura, “Model Selection based on Kalman Temporal Differences Learning,” IEEE International Conference on Collaboration and Internet Computing (CIC), 2017, pp. 4147.
 [37] A. Assa and K. N. Plataniotis, “Adaptive Kalman Filtering by Covariance Sampling,” IEEE Signal Processing Letters, vol. 24, no. 9, pp. 12881292, 2017.
 [38] R. S. Sutton, A. G. Barto, F. Bach et al., “Reinforcement Learning: An Introduction,” MIT Press, 1998.
 [39] M. Hutter and S. Legg, “Temporal Difference Updating without a Learning Rate,” Advances in Neural Information Processing Systems, 2008, pp. 705712.
 [40] R. S. Sutton, “Generalization in Reinforcement Learning: Successful Examples using Sparse Coarse Coding,” Advances in Neural Information Processing Systems, 1996, pp. 10381044.
 [41] W. Xia, C. Di, H. Guo, and S. Li, “Reinforcement Learning Based Stochastic Shortest Path Finding in Wireless Sensor Networks,” IEEE Access, vol. 7, pp.157807157817, 2019.
 [42] J. Li, T. Chai, F. L. Lewis, Z. Ding and Y. Jiang, “OffPolicy Interleaved Learning: Optimal Control for Affine Nonlinear DiscreteTime Systems,” IEEE Transactions on Neural Networks and Learning Systems,, vol. 30, no. 5, pp. 13081320, May 2019.
 [43] A. AlTamimi, M. AbuKhalaf, F. L. Lewis, “ModelFree Q Learning Designs for DiscreteTime ZeroSum Games with ApplicationtoHInfinityControl,” Automatica, vol. 43, pp. 473481, 2007.
 [44] C. J. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279292, 1992.
 [45] N. Shazeer, A., Mirhoseini, K., Maziarz, A. Davis, Q., Le, G. Hinton, and J. Dean, “Outrageously Large Neural Networks: The SparselyGated MixtureofExperts Layer,” ArXiv preprint, January 2017.
 [46] Y. Ge, F. Zhu, X. Ling, and Q. Liu, “Safe QLearning Method Based on Constrained Markov Decision Processes,” IEEE Access, vol. 7, pp. 165007165017, 2019.
 [47] P. Malekzadeh, A. Mohammadi, M. Barbulescu and K. N. Plataniotis, “STUPEFY: SetValued Box Particle Filtering for Bluetooth Low EnergyBased Indoor Localization,” IEEE Signal Processing Letters, vol. 26, no. 12, pp. 17731777, Dec. 2019.
 [48] A. Mohammadi and K. N. Plataniotis, “EventBased Estimation With InformationBased Triggering and Adaptive Update,” IEEE Transactions on Signal Processing, vol. 65, no. 18, pp. 49244939, 15 Sept. 2017.
 [49] A. Mohammadi and K. N. Plataniotis, “Improper ComplexValued Bhattacharyya Distance,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 5, pp. 10491064, May 2016.
 [50] A. Mohammadi and K. N. Plataniotis, “Distributed Widely Linear MultipleModel Adaptive Estimation,” IEEE Transactions on Signal and Information Processing over Networks, vol. 1, no. 3, pp. 164179, Sept. 2015.