## I Introduction

Reinforcement learning (RL) is getting significant attention due to the recent successful demonstration of the ‘Go game’, where the RL agents outperform humans in certain tasks (video game [1], playing Go [2]). Although the demonstration shows the great potential of the RL, those game environments are confined and restrictive compared to what ordinary humans go through in their everyday life. One of the major differences between the game environment and the real-life is the presence of unknown factors, i.e. the observation of the state of the environment is incomplete. Most RL algorithms are based on the assumption that complete state observation is available, and the state transition depends on the current state and the action (Markovian assumption). Markov decision process (MDP) is a modeling framework with the Markovian assumption. Development and analysis of the standard RL algorithm are based on MDP. Applying those RL algorithms with incomplete observation may lead to poor performance. In [3], the authors showed that a standard policy evaluation algorithm can result in an arbitrary error due to the incomplete state observation. In fact, the RL agent in [1] shows poor performance for the games, where inferring the hidden context is the key for winning.

Partially observable Markov decision process (POMDP) is a generalization of MDP that incorporates the incomplete state observation model. When the model parameter of a POMDP is given, the optimal policy is determined by using dynamic programming on the belief state of MDP, which is transformed from the POMDP [4]. The belief state of MDP has continuous state space, even though the corresponding POMDP has finite state space. Hence, solving a dynamic programming problem on the belief state of MDP is computationally challenging. There exist a number of results to obtain approximate solutions to the optimal policy, when the model is given, [5, 6]. When the model of POMDP is not given (model-free), a choice is in the policy gradient approach without relying on Bellman’s optimality. For example, Monte-Carlo policy gradient approaches [7, 8]

are known to be less vulnerable to the incomplete observation, since they do not require to learn the optimal action-value function, which is defined using the state of the environment. However, the Monte-Carlo policy gradient estimate has high variance so that convergence to the optimal policy typically takes longer as compared to other RL algorithms, which utilize Bellman’s optimality principle when the full state observation is available.

A natural idea is to use a dynamic estimator of the hidden state and apply the optimality principle to the estimated state. Due to its universal approximation property, the recurrent neural networks (RNN) are used to incorporate the estimation of the hidden state in reinforcement learning. In

[9], the authors use an RNN to approximate the optimal value state function using the memory effect of the RNN. In [10], the authors propose an actor-critic algorithm, where RNN is used for the critic that takes the sequential data. However, the RNNs in [9, 10] are trained only based on the Bellman optimality principle, but do not consider how accurately the RNNs can estimate the state which is essential for applying Bellman optimality principle. Without reasonable state estimation, taking an optimal decision even with given correct optimal action-value function is not possible. To the best of the authors’ knowledge, most RNNs used in reinforcement learning do not consider how the RNN accurately infers the hidden state.In this paper, we aim to develop a recursive estimation algorithm for a POMDP to estimate the parameters of the model, predict the hidden state, and also determine the optimal value state function concurrently. The idea of using a recursive state predictor (Bayesian state belief filter) in RL was investigated in [11, 12, 13, 14]. In [11], the author proposed to use the Bayesian state belief filter for the estimation of the Q-function. In [12], the authors implemented the Bayesian state belief update with an approximation technique for the ease of computation and analyzed its convergence. More recently, the authors in [13] combine the Bayesian state belief filter and QMDP [5]. However, the algorithms in [11, 12, 13] require the POMDP model parameter readily available^{1}^{1}1In [12], the algorithm needs full state observation for the system identification of POMDP.. A *model-free* reinforcement learning that uses HMM formulation is presented in [14]. The result in [14] shares the same idea as ours, where we use HMM estimator with a fixed behavior policy, in order to disambiguate the hidden state, learn the POMDP parameters, and find optimal policy. However, the algorithm in [14] involves multiple phases, including identification and design, which are hard to apply online to real-time learning tasks, whereas recursive estimation is more suitable (e.g., DQN, DDPG, or Q-learning are online algorithms). The main contribution of this paper is to present and analyze a new *on-line* estimation algorithm to simultaneously estimate the POMDP model parameters and corresponding optimal action-value function (Q-function), where we employ online HMM estimation techniques [15, 16].

The remainder of the paper is organized as follows. In Section II, HMM interpretation of the POMDP with a behavior policy presented. In Section III, the proposed recursive estimation of the HMM, POMDP, and Q-function is presented and the convergence of the estimator is analyzed. In Section IV, a numerical example is presented. Section V summarizes.

## Ii A HMM: POMDP excited by Behavior policy

We consider a partially observable Markov decision process (POMDP) on finite state and action sets. A fixed behavior policy^{2}^{2}2Behavior policy is the terminology used in the reinforcement learning, and it is analogous to excitation of a plant for system identification. excites the POMDP so that all pairs of state-action are realized infinitely often along the infinite time horizon.

### Ii-a POMDP on finite state-action sets

The POMDP comprises: a finite state space , a finite action space

, a state transition probability

, for and , a reward model such that , where denotes independent identically distributed (i.i.d.) Gaussian noise , a finite observation space , an observation probability , and the discount factor . At each time step , the agent first observes from the environment at the state , does action on the environment and gets the reward in accordance to .### Ii-B Behavior policy and HMM

A behavior policy is used to estimate the model parameters. Similarly to other off-policy reinforcement learning (RL) algorithms, i.e. Q-learning [17], a behavior policy excites the POMDP, and the estimator uses the samples generated from the controlled POMDP. The behavior policy’s purpose is system identification (in other words, estimation of the POMDP parameter). We denote the behavior policy by , which is a conditional probability, i.e. . Since we choose how to excite the system, the behavior policy can be used in the estimation. The POMDP with becomes a hidden Markov model (HMM), as illustrated in Fig. 1.

The HMM comprises: state transition probability for all pairs of and the extended observation probability, i.e. which is determined by the POMDP model parameters: , and the behavior policy .

For the ease of notation, we define the following tensor and matrices:

such that , such that , such that , and such that .The HMM estimator in Fig. 1 learns the model parameters , where is defined in II-A, and also provides the state estimate (or belief state) to the MDP and Q-function estimator. Given the transition of the state estimates and the action, the MDP estimator learns the transition model parameter . Also, the optimal action-value function is recursively estimated based on the transition of the state estimates, reward sample and the action taken.

## Iii HMM Q-Learning Algorithm For POMDPs

The objective of this section is to present a new HMM model estimation-based Q-learning algorithm, called HMM Q-learning, for POMDPs, which is the main outcome of this paper. The pseudo code of the recursive algorithm is in Algorithm 1.

denotes the Jacobian of the state prediction vector

with respect to the parameter vector .It recursively estimates the maximum likelihood estimate of the POMDP parameter and Q-function using partial observation. The recursive algorithm integrates (a) the HMM estimation, (b) MDP transition model estimation, and (c) the Q-function estimation steps. Through the remaining subsections, we prove the convergence of Algorithm 1. To this end, we first make the following assumptions.

###### Assumption 1

The transition probability matrix determined by the transition , the observation , and the behavior policy are aperiodic and irreducible [18]. Furthermore, we assume that the state-action pair visit probability is strictly positive under the behavior policy.

We additionally assume the following.

###### Assumption 2

All elements in the observation probability matrix are strictly positive, i.e. for all and .

Under these assumptions, we will prove the following convergence result.

###### Proposition 1 (Main convergence result)

(i) The iterate in Algorithm 1 converges almost surely to the stationary point of the conditional log-likelihood density function based on the sequence of the extended observations , , i.e., the point is satisfying

where is the normal cone [19, pp. 343] of the convex set at , and the expectation is taken with respect to the invariant distribution of and .

(ii) Define in the almost sure convergence sense. Then the iterate in Algorithm 1 converges in distribution to the optimal Q-function , satisfying

### Iii-a HMM Estimation

We employ the recursive estimators of HMM from [15, 16] for our estimation problem, where we estimate the true parameter with the model parameters being parametrized as continuously differentiable functions of the vector of real numbers , such that and . We denote the functions of the parameter as respectively. In this paper, we consider the normalized exponential function (or softmax function)^{3}^{3}3Let denote the parameters for the probability matrix . Then the ^{th} element of is .
to parametrize the probability matrices , . The reward matrix is a matrix in and is a scalar.

The iterate of the recursive estimator converges to the set of the stationary points, where the gradient of the likelihood density function is zero [15, 16]. The conditional log-likelihood density function based on the sequence of the extended observations is

(1) |

When the state transition and observation model parameters are available, the state estimate

(2) |

where is calculated from the recursive state predictor (Bayesian state belief filter) [20]. The state predictor is given as follows:

(3) |

where

(4) |

and is the diagonal matrix with . Using Markov property of the state transitions and the conditional independence of the observations given the states, it is easy to show that the conditional likelihood density (1) can be expressed with the state prediction and the observation likelihood as follows [15, 16]:

(5) |

###### Remark 1

Since the functional parameterization of uses the non-convex soft-max functions, is non-convex in general.

Roughly speaking, the recursive HMM estimation [15, 16] calculates the online estimate of the gradient of based on the current output , the state prediction , and the current parameter estimate and adds the stochastic gradient to the current parameter estimate , i.e. it is a stochastic gradient *ascent* algorithm to maximize the conditional likelihood.

We first introduce the HMM estimator [15, 16] and then apply the convergence result [15] to our estimation task. The recursive HMM estimation in Algorithm 1 is given by:

(6) |

(7) |

where denotes the projection onto the convex constraint set , denotes the diminishing step-size such that , denotes the Jacobian of the state prediction vector with respect to the parameter vector .

###### Remark 2

(i) The diminishing step-size used above is standard in the stochastic approximation algorithms (see Chapter 5.1 in [21]). (ii) The algorithm with a projection on to the constraint convex set has advantages such as guaranteed stability and convergence of the algorithm, preventing numerical instability (e.g. floating point underflow) and avoiding exploration in the parameter space far away from the true one. The useful parameter values in a properly parametrized practical problem are usually confined by constraints of physics or economics to some compact set [21]. can be usually determined based on the solution analysis depending on the problem structure.

Using Calculus, the equation (7) is written in terms of , , , and its partial derivatives as follows:

(8) | ||||

where is the ^{th} column of the ,
is recursively updated using the state predictor in (3) as

(9) |

with being initialized as an arbitrary distribution on the finite state set, being the state transition probability matrix for the current iterate . The state predictor (9) calculates the state estimate (or Bayesian belief) on the by normalizing the conditional likelihood and then multiplying it with the state transition probability . The predicted state estimate is used recursively to calculate the state prediction in the next step. Taking derivative on the update law (9), the update law for is

(10) |

where

denotes the ^{th} element of the parameter , denotes the identity matrix, , the initial is arbitrarily chosen from .

At each time step , the HMM estimator defined by (6), (8), (9), and (10) updates based on the current sample , while keeping track of the state estimate , and its partial derivative .

Now we state the convergence of the estimator.

###### Proposition 2

(i) The extended Markov chain

is geometrically ergodic^{4}

^{4}4 A Markov chain with transition probability matrix is

*geometrically ergodic*, if for finite constants and a

(ii) For , the log-likelihood in (1) almost surely converges to ,

(11) |

where ,

is the set of probability distribution on

, and is the marginal distribution of , which is the invariant distribution of the extended Markov chain.(iii) The iterate converges almost surely to the invariant set (set of equilibrium points) of the ODE

(12) |

where , the expectation is taken with respect to , and is the projection term to keep in , is the tangent cone of at [19, pp. 343].

###### Remark 3

The second equation in (12) is due to [22, Appendix E]. Using the definitions of tangent and normal cones [19, pp. 343], we can readily prove that the set of stationary points of (12) is , where is the normal cone of at . Note that the set of stationary points is identical to the set of KKT points of the constrained nonlinear programming .

###### Remark 4

Like other maximum likelihood estimation algorithms, further assuming that is concave, it is possible to show the converges to the unique maximum likelihood estimate. However, the convexity of is not known in prior. Similarly, asymptotic stability of the ODE (12) is assumed to show the desired convergence in [15]. We refer to [15] for the technical details regarding the convergence set.

### Iii-B Estimating Q-function with the HMM State Predictor

In addition to estimation of the HMM parameters , we aim to *recursively* estimate the optimal action-value function using *partial* state observation.

From Bellman’s optimality principle, function is defined as

(13) |

where is the state transition probability, which corresponds to in the POMDP model. The standard Q-learning from [17] estimates function using the recursive form:

Since the state is not directly observed in POMDP, the state estimate in (9) from the HMM estimator is used instead of . Define the estimated state transition as

(14) | ||||

where is calculated using Bayes rule:

(15) |

Using as a surrogate for in (13), a recursive estimator for is proposed as follows:

(16) | ||||

where . In the following proposition we establish the convergence of (16).

###### Proposition 3

Suppose that Assumption 1 and Assumption 2 hold. Then the following ODE has a unique globally asymptotically stable equilibrium point:

where is determined by the expected frequency of the recurrence to the action (for the detail, see Appendix V-B), denotes the expectation of , denotes the expectation of and the expectations are taken with the invariant distribution . As a result, the iterate of the recursive estimation law in (16) converges in distribution to the unique equilibrium point of the ODE, i.e., the unique solution of the Bellman equation

###### Remark 5

Note that

is the continuous function of the random variables

, which almost surely converges due to the ergodicity of the Markov chain and the convergence of (proven above). By continuous mapping theorem from [23], as a continuous function of the converging random variables converges in the same sense.The update of is asynchronous, as we update the part of for the current action taken. Result on stochastic approximation from [21] is invoked to prove the convergence. The proof follows from the ergodicity of the underlying Markov chain and the contraction of the operator . See Appendix V-B for the details.

### Iii-C Learning State Transition given Action with the HMM State Predictor

When the full state observation is available, the transition model can be estimated simply counting all the incidents of each transition , and the transition model estimation corresponds to the maximum likelihood estimate. Since the state is partially observed, we use the state estimate instead of counting transitions.

We aim to estimate the expectation of the following indicator function

(17) |

where the expectation is taken with respect to the stationary distribution corresponding to the true parameter . Thus, is the expectation of the counter of the transition divided by the total number of transitions (or the stationary distribution ).

###### Remark 6

The proposed recursive estimation of is given by

(18) | ||||

We note that the estimation in (18) uses as a surrogate for in (13). The ODE corresponding to (18) is

Following the same procedure in the proof of Proposition 3, we can show that converges to , where denotes the marginal distribution of the transition from to after taking

with respect to the invariant distribution of the entire process. Since we estimate the joint distribution, the conditional distribution

can be calculated by dividing the joint probabilities with marginal probabilities.## Iv A Numerical Example

In this simulation, we implement the HMM Q-learning for a finite state POMDP example, where 4 hidden states are observed through 2 observations with the discount factor as specified below:

The following behavior policy is used to estimate the HMM, the transition model, and the Q-function

The diminishing step size is chosen as for .

### Iv-a Estimation of the HMM and Q-function

Figure 1(a) shows that the mean of the sample conditional log-likelihood density increases. Figure 1(b) shows that converges to the true parameter .

To validate the estimation of the Q-function in (16), we run three estimations of Q-function in parallel: (i) Q-learning [17] with full state observation , (ii) Q-learning with partial observation , (iii) HMM Q-learning. Figure 3 shows for all three algorithms.

After 200,000 steps, the iterates of , and at are as follows:

where the elements of the matrices are the estimates of the Q-function value, when

. Similar to the other HMM estimations (from unsupervised learning task), the labels of the inferred hidden state do not match the labels assigned to the true states. Permuting the state indices

to in order to have better matching between the estimated and true Q-function, we compare the estimated Q-function as follows:This permutation is consistent with the estimated observation as below:

Comments

There are no comments yet.