# Sublinear Regret for Learning POMDPs

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of O(T^2/3√(log T)) for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

## Authors

• 10 publications
• 14 publications
• 7 publications
• 35 publications
• ### Online Learning for Unknown Partially Observable MDPs

Solving Partially Observable Markov Decision Processes (POMDPs) is hard....
02/25/2021 ∙ by Mehdi Jafarnia-Jahromi, et al. ∙ 0

• ### Regime Switching Bandits

We study a multi-armed bandit problem where the rewards exhibit regime-s...
01/26/2020 ∙ by Xiang Zhou, et al. ∙ 6

• ### A Reduction from Reinforcement Learning to No-Regret Online Learning

We present a reduction from reinforcement learning (RL) to no-regret onl...
11/14/2019 ∙ by Ching-An Cheng, et al. ∙ 0

• ### Reinforcement Learning of POMDPs using Spectral Methods

We propose a new reinforcement learning algorithm for partially observab...
02/25/2016 ∙ by Kamyar Azizzadenesheli, et al. ∙ 0

• ### Experimental results : Reinforcement Learning of POMDPs using Spectral Methods

We propose a new reinforcement learning algorithm for partially observab...
05/07/2017 ∙ by Kamyar Azizzadenesheli, et al. ∙ 0

• ### Learning to Switch Between Machines and Humans

Reinforcement learning algorithms have been mostly developed and evaluat...
02/11/2020 ∙ by Vahid Balazadeh-Meresht, et al. ∙ 0

• ### Agnostic System Identification for Model-Based Reinforcement Learning

A fundamental problem in control is to learn a model of a system from ob...
03/05/2012 ∙ by Stéphane Ross, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The partially observable Markov decision process (POMDP) is a framework for dynamic decision-making when some evolving state of the system cannot be observed. It extends the Markov decision process (MDP) and can be used to model a wide variety of real-world problems, ranging from healthcare to business. The solution to POMDPs is usually through a reduction to MDPs, whose state is the belief (a probability distribution) of the unobserved state of the POMDP, see e.g.

Krish2016 for an overview.

We study the problem of decision making when the environment of the POMDP, such as the transition probability of the hidden state and the probability distribution governing the observation, is unknown to the agent. Thus, the agent has to simultaneously learn the model parameters (we use “environment” and “parameters” interchangeably) and take optimal actions. Such online learning framework has received considerable attention in the last decades

(Sutton2011). Despite the practical relevance of POMDPs, the learning of POMDPs is considered much more challenging than finite-state MDPs and few theoretical results are known. This is not surprising: even with a known environment, the corresponding belief MDP features a continuous state space. When the environment is unknown, we face the additional difficulty of not being able to calculate the belief accurately, whose updating formula is based on the environment. This is in contrast to the learning of standard MDPs, in which the state is always observed exactly.

To tackle this daunting task, we provide an algorithm that achieves sublinear regret, which is a popular measure for the performance of a learning algorithm relative to that of the oracle, i.e., the optimal policy in the known environment. This is the first algorithm that achieves sublinear regret, to our knowledge, in the general POMDP setup we consider. We summarize the three major contributions of this paper below.

In terms of problem formulation, we benchmark our algorithm against an oracle and measure the performance by calculating the regret. The oracle we consider is the strongest among the recent literature (Azizzadenesheli2016, Fiez2019). In particular, the oracle is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. Such an oracle has higher average reward than the oracles that use the best fixed action (Fiez2019) or the optimal memoryless policy (the action only depends on the current observation) (Azizzadenesheli2016). Still our algorithm is able to attain sublinear regret in the length of the learning horizon. This implies that as the learning horizon increases, the algorithm tends to approximate the strong oracle more accurately.

In terms of the algorithmic design, the learning algorithm we propose (see Algorithm 1) has two key ingredients. First, it builds on the recent advance on the estimation of the parameters of hidden Markov models (HMMs) using spectral method-of-moments methods, which involve the spectral decomposition of certain low-order multivariate moments computed from the data (anandkumar2012method, anandkumar2014tensor, Azizzadenesheli2016). It benefits from the theoretical finite-sample bound of spectral estimators, while the finite-sample guarantees of other alternatives such as maximum likelihood estimators remain an open problem (Lehericy2019).1. 1. endnote: 1. There are recent advances on the EM algorithm that are applied to likelihood-based methods for HMMs (balakrishnan2017statistical, yang2017statistical) with finite-sample analysis. However, the conditions on the function and the resulting basin of attraction are hard to translate to our setting explicitly. Second, it builds on the well-known “upper confidence bound” (UCB) method in reinforcement learning (ortner2007logarithmic, Jaksch2010). We divide the horizon into nested exploration and exploitation phases. We use spectral estimators in the exploration phase to estimate the unknown parameters such as the transition matrix of the hidden state, which itself is a function of the action in the period. We apply the UCB method to control the regret in the exploitation phase based on the estimated parameters in the exploration phase and the associated confidence regions. Although the two components have been studied separately before, it is a unique challenge to combine them in our setting. In particular, the belief of the hidden state is subject to the estimation error. We re-calibrate the belief at the beginning of each exploitation phase based on the most recent estimate of the parameters. This helps us achieve the sublinear regret.

In terms of regret analysis, we establish a regret bound of for our proposed learning algorithm where is the learning horizon. Our regret analysis draws inspirations from Jaksch2010, Ortner2012 for learning MDPs and undiscounted reinforcement learning problems, but the analysis differs significantly from theirs since there are two main technical challenges in our problem.

First, the belief in POMDPs, unlike the state in MDPs, is not directly observed and needs to be estimated. This is in stark contrast to learning MDPs (Jaksch2010, Ortner2012) with observed states. As a result, we need to bound the estimation error of the belief which itself depends on the estimation error of the model parameters. In addition, we also need to bound the error in the belief transition kernel, which depends on the model parameters in a complex way via Bayesian updating. We overcome the difficulties by extending the approach in DeCastro2017 for HMM to POMDP, together with a delicate analysis of the belief transition kernel to control the errors.

Second, to establish the regret bound, we need an uniform bound for the span of the bias function (also referred as the relative value function) for the optimistic belief MDP which has a continuous state space. Such a bound is often critical in the regret analysis of undiscounted reinforcement learning of continuous MDP, but it is often shown under restrictive assumptions such as the Hölder continuity that do not hold for the belief state in our setting (Ortner2012, Lakshmanan2015). We develop a novel approach to bound the bias span by bounding the Lipschitz module of the bias function for discounted problems when the discount factor tends to one. Our key step is to uniformly bound the Lipschitz module of the belief transitions using the Kantorovich metric over the discount factors. Exploiting the connection with the infinite-horizon undiscounted problem via the vanishing discount factor method then yields an explicit bound on the bias span for our problem.

### 1.1 Related Literature

Closest to our work is Azizzadenesheli2016. They also propose to use spectral estimators and upper confidence methods to learn POMDPs, and establish a regret bound of . One main difference between our work and theirs is the choice of the oracle/benchmark. Specifically, their oracle is the optimal memoryless policy, i.e., a policy that only depends on the current reward observation instead of using all historical observations to form the belief of the underlying state. The performance gap is linear in between their oracle and ours for general POMDPs. As a result, we need to design a new learning algorithm to achieve sublinear regret with our oracle. By considering the belief based policies, several new difficulties arise in our setting. First, the spectral method can not be applied to samples generated from belief-based policies due to history dependency; Second, the belief states can not be observed and need to be calculated using the parameters, which is not an issue in Azizzadenesheli2016 because the observation in the current period can be regarded as the state; Third, we need to bound the bias span for the optimisitic belief MDP in the regret analysis. We tackle these difficulties by using an exploration-exploitation interleaving approach in the algorithm design, carefully controlling the belief error, and developing a new method to bound the bias span. We also mention that our bound on the bias span is different from the diameter of the POMDP discussed in Azizzadenesheli2016. The diameter in Azizzadenesheli2016 is only for observation-based policies, not for belief-state based policies we consider. That also explains why we need a new approach to bound the bias span, which is related to the diameter at a high level.

This paper extends the online learning framework popularized by multi-armed bandits to POMDPs. There is a large stream of literature on the topic of bandits and bubeck2012regret provide a comprehensive survey. In POMDPs, the rewards across periods are not independent any more. It can be related to adversarial bandits auer2002nonstochastic, in which the oracle is the best fixed arm, and the reward can change arbitrarily over periods. In our setting, the oracle is the optimal policy of the POMDP, which performs better than any fixed action (the gap is linear in ). Therefore, popular algorithms such as EXP3 does not achieve sublinear regret in our setting. A stream of literature studies nonstationary/switching MAB, including auer2002nonstochastic, Garivier2011, besbes2014stochastic, keskin2017chasing, cheung2018hedging, auer2019adaptively. The reward can change over periods subject to a number of switches or certain changing budget (the total magnitude of reward changes over the horizon), and the oracle is the best arm in each period. It should be noted that the oracle considered is stronger than ours. However, all the designed algorithms in this literature require finite switches or sublinear changing budget (in the order of ). This is understandable, as there is no hope to learn such a strong oracle if the arms can be completely different across periods. In our setting, the number of changes (state transitions) is linear in and the algorithms are expected to fail to achieve sublinear regret even measured against our oracle, which is weaker than the oracle in this stream of literature. There are a few exceptions, including zhu2020demands, chen2020learning, zhou2020regime, which study models with linear changing budget but specific structures. In chen2020learning, the rewards are cyclic which can be leveraged to learn across cycles despite of the linear change. In zhu2020demands, the reward grows over time according to a function. In zhou2020regime

, the reward is modulated by an unobserved Markov chain. Another stream of literature investigates the so-called restless Markov bandit problem, in which the state of each arm evolves according to independent Markov chains, whose states may not be observable. See, for example,

Slivkins2008, Guha2010, Ortner2014. The POMDP model we consider has a more complex structure. Thus the algorithms proposed in the above studies cannot achieve sublinear regret.

Our work is related to the rich literature on learning MDPs. Jaksch2010 propose the UCRL2 (Upper Confidence Reinforcement Learning) algorithm to learn finite-state MDPs and prove that the algorithm can achieve the optimal rate of regret measured against the optimal policy in terms of the undiscounted average reward. Follow-up papers have investigated various extensions to Jaksch2010, including posterior sampling (agrawal2017posterior), minimax optimal regret (azar2017minimax, zhang2019regret), and the model-free setting (jin2018q). cheung2019non consider the case where the parameters of the MDP, such as the transition matrix, may change over time. The algorithms are not applicable to our setting, because of the unobserved state in POMDPs. However, since a POMDP can be transformed to a continuous-state MDP, our setting is related to the literature, especially those papers studying MDPs with a continuous state space. Ortner2012, Lakshmanan2015 extend the algorithm in Jaksch2010 to a continuous state space. Furthermore, qian2018exploration, Gampa2019 improve the implementation of the algorithm and make it computationally more efficient. Still, our problem is not equivalent to the learning of continuous-state MDPs. First, in this literature Hölder continuity is typically assumed for the rewards and transition probabilities with respect to the state, in order to aggregate the state and reduce it to the discrete case. However, this assumption does not hold in general for the belief state of POMDPs, whose transition probabilities are not given but arise from the Bayesian updating. Second, even if the continuity holds, the state of the belief MDP in our problem, which is the belief of the hidden state, cannot be observed. It can only be inferred using the estimated parameters. This distinguishes our problem from those studied in this literature. The algorithm and analysis also deviate substantially as a result. There are studies that focus on the applications such as inventory management (zhang2018perishable, chen2019coordinating, zhang2020closing) and handle specific issues such as demand censoring and lost sales. We are not aware of any papers that propose learning algorithms for POMDPs in this stream of literature.

Moreover, our work is related to studies on reinforcement learning for POMDPs, see e.g. ross2011bayesian, spaan2012partially and references therein. guo2016pac propose a learning algorithm for a class of episodic POMDPs, where the performance metric is the sample complexity, i.e. the time required to find an approximately optimal policy. Recently, jin2020sample give a sample efficient learning algorithm for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of hidden states. Their algorithm does not necessarily lead to sublinear regret guarantees due to its large amount of strategic exploration. There is also a growing body of literature that apply deep reinforcement learning methods to POMDPs, see e.g. hausknecht2015deep, igl2018deep. Our work differs from these papers in that we study the learning of ergodic POMDPs in an unknown environment and we focus on developing an learning algorithm with sublinear regret guarantees. A concurrent study (kwon2021rl) considers regret minimization for reinforcement learning in a special class of POMDPs called latent MDP. The hidden state is static in their work while it is dynamic in our setting.

Furthermore, our work is related to the literature on the spectral method to estimate HMMs and its application to POMDPs. For instance, anandkumar2012method, anandkumar2014tensor use the spectral method to estimate the unknown parameters in HMMs, by constructing the so-called multi-views from the observations. The spectral method is not readily applicable to POMDPs, because of the dependence introduced from the actions. Azizzadenesheli2016 address the issue by restricting to memoryless policies, i.e., the action only depends on the observation in the current period instead of the belief state. They extend the spectral estimator to the data generated from an arbitrary distribution other than the stationary distribution of the Markov chain, which is necessary in learning problems when the policy needs to be experimented.

Finally, on the computational side, there is a large body of literature on the planning problem for POMDPs, i.e., computing the optimal belief-based policy for the average reward POMDP when the environment is known. Solving such planning problems using exact dynamic programming methods for the resulting belief MDP is challenging, because the belief MDP has a continuous state space. Various methods have been proposed to compute an approximately optimal policy for belief MDPs or more general continuous-state MDPs with average reward criterion. See, e.g. ormoneit2002kernel, Yu2004, yu2008near, saldi2017asymptotic, sharma2020approximate and the references therein for details. In our work, we focus on learning the POMDP while assuming the access to an optimization oracle to the planning problem to derive regret bounds.

The rest of the paper is organized as follows. In Section 2 we discuss the problem formulation. Section 3 presents our learning algorithm. In Section 4, we state our main results on the regret bounds for the learning algorithm. Finally, we conclude in Section LABEL:sec:conclusion. All the proofs of the results in the paper are deferred to the electronic companion.

## 2 Problem Formulation

We first introduce the notation for the POMDP. A POMDP model with horizon consists of the tuple

 {M,I,O,P,Ω,R}, (1)

where

• denotes the state space of the hidden state. We use to denote the state at time .

• denotes the action space with representing the action chosen by the agent at time .

• is a finite set of possible observations and denotes the observation at time .

• describes a family of transition probability matrices, where is the transition probability matrix for states in after the agent takes action . That is, for .

• The observation density function is a distribution over observations that occur in state after the agent takes action in the last period, i.e., .

• The reward function specifies the immediate reward for each state-action pair , and we assume the reward function for some constant

The following sequence of events occur in order in each period. In period , the underlying state transits to . Then the agent observes , whose distribution depends on and . The agent then chooses an action and receives reward determined by reward function which depends on the state and the action . Then the time proceeds to and the state transits to , whose transition probability depends on the action .

In the POMDP model, the agent does not observe the state , but only the observation , after which an action is chosen. Moreover, it is typical to assume that the action does not depend on the realized reward as well (cao2007partially), as if the reward is not observed. Therefore, the action taken in period , , depends on the history up to time , denoted by

 H0\coloneqq{I0}, (2) Ht\coloneqq{I0,O1,⋯,It−1,Ot},t≥1. (3)

The agent attempts to optimize the expected cumulative reward over a finite horizon . The information structure is illustrated by the graph in Figure 1.

### 2.1 Reformulate POMDP as Belief MDP

If the model environment in (1) is known to the agent, then it is well-known (see, e.g., Krish2016) that to maximize the expected reward, the agent can reformulate the POMDP as an MDP with a continuous state space. The state of the MDP reflects the belief, or the distribution, over the hidden states, and thus it is referred to as the belief MDP. More precisely, define an

-dimensional vector

as the belief of the underlying state in period :

 b0(m) \coloneqqP(M0=m), (4) bt(m) \coloneqqP(Mt=m|Ht),t≥1.

Note that we do not include the observed rewards in the history. This is a typical setting in the POMDP literature (cao2007partially).

Because of the Markovian structure of the belief, we can show (see, e.g., Krish2016, Puterman2014) that the belief in period can be updated based on the current belief , the chosen action and the observation . In particular, the updating function determines

 bt+1=HP,Ω(bt,It,Ot+1). (5)

We may omit the dependence on and if it doesn’t cause confusion. By Bayes’s theorem, we have

 bt+1(m) =Ω(o|m,i)∑m′∈MP(i)(m′,m)bt(m′)P(o|b,i), (6)

where is the distribution of the observation under belief and action .

We next introduce some notations to facilitate the discussion and analysis. Define the expected reward conditional on the belief and action

 ¯R(b,i)\coloneqq∑m∈MR(m,i)b(m). (7)

We can also define the transition kernel of the belief conditional on the action:

 ¯T(bt+1|bt,it) :=P(bt+1|bt,it)=∑ot+1∈O1{H(bt,it,ot+1)=bt+1}P(ot+1|bt,it). (8)

A policy for the belief MDP is a mapping from the belief states to actions, i.e., the action chosen by the agent at time is . Following the literature (see, e.g., agrawal2017posterior), we define the gain of a policy and the optimal gain. The gain of a policy given the initial belief state , is defined as the long-run average reward for the belief MDP over an infinite horizon, given by:

 ρμb:=limsupT→∞1TE[T−1∑t=0R(Mt,μt)|b0=b], (9)

where the expectation is taken with respect to the interaction sequence when policy interacts with the belief MDP. The optimal gain is defined by

 ρ∗\coloneqqsupbsupμρμb. (10)

### 2.2 Assumptions

Next we provide the technical assumptions for the analysis.

The entries of all transition matrices are bounded away from zero .

.

Assumptions 2.2 and 2.2 can be strong in general, but they are required by the state-of-art method to bound the belief error caused by the parameter miscalibration (see DeCastro2017 for the HMM setting), which is essential in learning POMDPs. Moreover, the two assumptions provide sufficient conditions to guarantee the existence of the solution to the Bellman optimality equation of the belief MDP and the boundedness of the bias span; See Propositions 2.2 and LABEL:prop:span-uni-bound. Note that Assumption 2.2 itself implies that for any fixed , the Markov chain with transition matrix is geometrically ergodic with a unique stationary distribution denoted by , and the geometric rate is upper bounded by . See e.g. Theorems 2.7.2 and 2.7.4 in Krish2016. This geometric ergodicity, which can hold under weaker assumptions, is needed for spectral estimations of the POMDP model as in Azizzadenesheli2016.

For each , the transition matrix is invertible.

For all , are linearly independent.

Assumption 2.2 and 2.2 are required for the finite-sample guarantee of spectral estimators (anandkumar2012method, anandkumar2014tensor). See Section 3.1 for more details. Since our learning algorithm uses the spectral estimator to estimate hidden Markov models, our approach inherits the assumptions.

Before we proceed, we first state a result on the characterization of the optimal gain given in Definition 2.1 and the existence of stationary optimal policies for the belief MDP (9) under the average reward criterion. Note that in general (without the assumptions), there is no guarantee that a stationary optimal policy would exist for problem (9) (see e.g. Yu2004).

Suppose Assumptions 2.2 and 2.2 hold. There exists a bounded function and a constant such that the Bellman optimality equation holds for problem (10):

 ρ∗+v(b)=maxi∈I[¯R(b,i)+∫Bv(b′)¯T(db′|b,i)],∀b∈B. (11)

Moreover, there exists a stationary deterministic optimal policy for problem (10), which prescribes an action that maximizes the right side of (11). The constant is the optimal gain defined in (10).

The function is referred to as the bias function, or the relative value function, of the belief state for the undiscounted problem (9) (Chapter 8 of Puterman2014). To prove Proposition 2.2, it is known that the key is to establish the uniform bound of the bias functions for the associated discounted problems, see, e.g., ross1968arbitrary, Hsu2006. We achieve this by using a new approach based on analyzing the Lipschitz properties of belief transition kernels and value functions for the discounted problems. Generally solving the optimality equation (11) and finding the optimal policy for POMDP with average reward criteria in a known environment are computationally challenging due to the continuous belief states. In this work, we do not focus on this planning problem and assume the access to an optimization oracle that solves the Bellman equation (11) and returns and the optimal stationary policy .

### 2.3 Learning POMDP

We consider learning algorithms to learn the POMDP model when some model parameters are unknown. In particular, the agent knows the state space , the action space , the observation space , and the reward function , but has no knowledge about the underlying hidden state , the transition matrices for all actions and the observation density function . The goal is to design a learning policy to decide which action to take in each period to maximize the expected cumulative reward over periods even if is unknown in advance. Note that the setting is slightly different from multi-armed bandits, in which the reward distribution of each arm is unknown. In POMDP, it is typical to assume to be a deterministic function and the random noise mainly comes from the observation. Moreover, the realized reward is usually not observed or used to determine the action, as mentioned previously. Therefore, it is reasonable to set up the environment to learn the parameters related to the observations. Our approach can be used to learn the reward function as well, if the historical reward can be observed.

For a learning policy , the action taken in period , which we denote by , is adapted to the history , where denotes the observation received under the learning policy in period . Note that maps the initial belief and the history to an action in period . Similar to Definition 2.1, we may define the reward in period for the policy when the initial belief as

 Rπt(b)\coloneqqR(Mt,πt). (12)

Note that both and depend on the initial belief , which we omit in the notation.

To measure the performance of a learning policy, we follow the literature (see, e.g., Jaksch2010, Ortner2014, agrawal2017posterior) and set the optimal gain as the benchmark. In particular, we define the total regret of in periods as

 RπT\coloneqqmaxb{(T+1)ρ∗−T∑t=0Rπt(b)}. (13)

The objective is to design efficient learning algorithms whose regret grows sublinearly in with theoretical guarantees. In the sequel, the dependency of on may be dropped if it is clear from the context.

## 3 The SEEU Learning Algorithm

This section describes our learning algorithm for the POMDP, which is referred to as the Spectral Exploration and Exploitation with Upper Confidence Bound (SEEU) algorithm. We first provide a high-level overview of the algorithm and then elaborate on the details.

To device a learning policy for the POMDP with unknown (transition probabilities) and (observation distributions), one needs a procedure to estimate those quantities from the history, i.e., the past actions and observations. anandkumar2012method, anandkumar2014tensor propose the spectral estimator for the unknown parameters in hidden Markov models (HMMs), with finite-sample theoretical properties. It serves as a major component in the SEEU algorithm.

However, the spectral estimator is not directly applicable to ours, because there is no decision making involved in HMMs. In a POMDP, the action may depend on past observations and such dependency violates the assumptions of the spectral estimator. To address the issue, we divide the horizon into nested “exploration” and “exploitation” phases. In the exploration phase, we choose each action successively for a fixed length of periods. This transforms the system into an HMM so that we can apply the spectral method to estimate and from the observed actions and observations in that phase. In the exploitation phase, based on the confidence region of the estimators obtained from the exploration phase, we use a UCB-type policy to implement the optimistic policy (the optimal policy for the best-case estimators in the confidence region) for the POMDP.

The SEEU algorithm is presented in Algorithm 1. The algorithm proceeds with episodes with increasing length, similar to the UCRL2 algorithm in Jaksch2010 for learning MDPs. Each episode is divided into exploration and exploitation phases. The exploration phase lasts periods (Step 3), where

is a tunable hyperparameter and

is the total number of actions in the action space. In this phase, the algorithm chooses each action successively for periods. In Step 7 it applies the spectral estimator (Algorithm 2 to be introduced in Section 3.1) to (re-)estimate and . Moreover, it constructs a confidence region based on Proposition 2 with a confidence level , where is a vanishing sequence with in episode (Step 8). The key information to extract from the exploration phase is

• the optimistic POMDP inside the confidence region (Step 9);

• the updated belief vector according to the new estimators (Step 11).

Then the algorithm enters the exploitation phase (Step 13), whose length is in episode and is another tunable hyperparameter. In the exploitation phase, an action is chosen according to the optimal policy associated with the optimistic estimators for and inside the confidence region. This is the principle of “optimisim in the face of uncertainty” for UCB-type algorithms.

Before getting into the details, we comment on the major difficulties of designing and analyzing such an algorithm. To apply the spectral estimator, in the exploration phase the actions are chosen deterministically to “mimic” an HMM, as mentioned above. This is necessary as the spectral estimator requires fast convergence to a stationary distribution, guaranteed by Assumption 2.2. Moreover, at the first sight, the re-calculation of the belief in Step 11 may deviate significantly from the actual belief using the exact parameters. The belief relies on the whole history, and a small error in the estimation may accumulate over periods and lead to an erroneous calculation. We show in Proposition LABEL:prop:lip_bt that the belief error can actually be well controlled. This is important for the algorithm to achieve the sublinear regret.

### 3.1 Exploration: Spectral Method

We next zoom in to the exploration phase of a particular episode, in order to show the details of the spectral estimator in Step 7 and 8 (anandkumar2012method, anandkumar2014tensor, Azizzadenesheli2016). Suppose the exploration phase lasts from period 0 to , with a fixed action and realized observations sampled according to the observation density . When the action is fixed, the underlying state converges to the steady state geometrically fast due to Assumption 2.2. For the ease of exposition, we assume that the system has reached the steady state at . In Remark 3.1, we discuss how to control the error as the system starts from an arbitrary state distribution.

For , we consider three “views” . (Here a view is simply a feature of the collected data, a term commonly used in data fusion (zhao2017multi). We stick to the term as in the original description of the spectral estimator.) We can see from Figure 1 that given and , all the three views are independent. Since the system has reached the steady state, the distribution of is also stationary. The key of the spectral estimator is to express the distribution of as a function of the parameters to learn. Then the relevant moments are matched to the samples, which is similar to the spirit of methods of moments.

We represent the views in the vector form for convenience. Formally, we encode into a unit vector , satisfying . Similarly, and can also be expressed as unit vectors and . Define three matrices for action such that:

 A(i)1(o,m)=P(v(i)1,t=eo|mt=m,it=i), (14) A(i)2(o,m)=P(v(i)2,t=eo|mt=m,it=i), (15) A(i)3(o,m)=P(v(i)3,t=eo|mt=m,it=i). (16)

By stationarity, the distribution of the matrices is independent of . We use and to denote the -th column of , and , respectively. Let be the correlation matrix between and , for .2. 2. endnote: 2. For any vectors

, the tensor products are defined as follows:

with , and with .

The spectral estimator uses the following modified views, which are linear transformations of

and :

 ˜v(i)1,t\coloneqqW(i)3,2(W(i)1,2)†v(i)1,t,˜v(i)2,t\coloneqqW(i)3,1(W(i)2,1)†v(i)2,t, (17)

where represents the pseudoinverse of a matrix. It turns out that the second and third moment of the modified views,

 M(i)2\coloneqqE[˜v(i)1,t⊗˜v(i)2,t],M(i)3\coloneqqE[˜v(i)1,t⊗˜v(i)2,t⊗v(i)3,t], (18)

can be compactly represented by the model parameters. More precisely, by Theorem 3.6 in anandkumar2014tensor, we have the following spectral decomposition:

 M(i)2=∑m∈Mω(i)(m)θ(i)3,m⊗θ(i)3,m,M(i)3=∑m∈Mω(i)(m)θ(i)3,m⊗θ(i)3,m⊗θ(i)3,m, (19)

where we recall that is the state stationary distribution under the policy for all .

With the relationship (19), we can describe the procedures of the spectral estimator. Suppose a sample path is observed under the policy . It can be translated to samples of , . They can be used to construct the sample average of for :

 ^W(i)p,q=1N−1N∑t=2v(i)p,t⊗v(i)q,t. (20)

By (17) and (18), we can construct the following estimators:

 ^v(i)1,t=^W(i)3,2(^W(i)1,2)†v(i)1,t, ^v(i)2,t=^W(i)3,1(^W(i)2,1)†v(i)2,t, (21) ^M(i)2=1N−1N∑t=2^v(i)1,t⊗^v(i)2,t, ^M(i)3=1N−1N∑t=2^v(i)1,t⊗^v(i)2,t⊗v(i)3,t. (22)

Plugging and into the left-hand sides of (19), we can apply the tensor decomposition method (anandkumar2014tensor) to solve from (19), which is denoted as . It can also be shown that and , which naturally lead to estimators and . As a result, the unknown parameters and can be estimated according to the following lemma. The unknown transition matrix and the observation density function satisfy and . We remark that Assumption 2.2 and Assumption 2.2 imply that all three matrices are all of full column rank (Azizzadenesheli2016), and hence the pseudoinverse in Lemma 3.1 is well defined. The subroutine to estimate POMDP estimators is summarized in Algorithm 2.

The stationary distribution for a fixed action is crucial for the spectral estimator and , which allows (18) and (19) to be independent of . In our case, the spectral estimator is applied to a sequence of samples in the exploration phase, which does not start in a steady state. This is a similar situation as Azizzadenesheli2016. Fortunately, Assumption 2.2 allows fast mixing so that the distribution converges to the stationary distribution at a sufficiently fast rate. We can still use Algorithm 2, which is originally designed for stationary HMMs. The theoretical result in Proposition 2 already takes into account the error attributed to mixing.

The following result, adapted from Azizzadenesheli2016, provides the confidence regions of the estimators in Algorithm 2.

[Finite-sample guarantee of spectral estimators] Suppose Assumptions 2.2, 2.2 and 2.2 hold. For any . If for any action , the number of samples satisfies for some , then with probability , the estimated and by Algorithm 2 satisfy

 ∥∥Ω(⋅|m,i)−^Ω(⋅|m,i)∥∥1 ≤C1  ⎷log(6(O2+O)δ)N(i), (23) ∥∥∥P(i)(m,:)−^P(i)(m,:)∥∥∥2 ≤C2  ⎷log(6(O2+O)δ)N(i). (24)

for and . Here, and are constants independent of any .

The explicit expressions of constants are given in Section LABEL:sec:proof-prop-spectral in the appendix. Note that and are identifiable up to a proper permutation of the hidden state labels, because the exact index of the states cannot be recovered. We do not explicitly mention the permutation in the statement of Proposition 2 for simplicity, consistent with the literature such as Azizzadenesheli2016.