1 Introduction
In this paper, we study Markov Decision Processes (hereafter MDPs) with arbitrarily varying rewards. MDP provides a general mathematical framework for modeling sequential decision making under uncertainty [7, 22, 30]. In the standard MDP setting, if the process is in some state , the decision maker takes an action and receives an expected reward , before the process randomly transitions into a new state. The goal of the decision maker is to maximize the total expected reward. It is assumed that the decision maker has complete knowledge of the reward function , which does not change over time.
Over the past two decades, there has been much interest in sequential learning and decision making in an unknown and possibly adversarial environment. A wide range of sequential learning problems can be modeled using the framework of Online Convex Optimization (OCO) [39, 18]. In OCO, the decision maker plays a repeated game against an adversary for a given number of rounds. At the beginning of each round indexed by , the decision maker chooses an action in some convex compact set and the adversary chooses a concave reward function , hence a reward of is received. After observing the realized reward function, the decision maker chooses its next action and so on. Since the decision maker does not know how the future reward functions will be chosen, its goal is to achieve a small regret; that is, the cumulative reward earned throughout the game should be close to the cumulative reward if the decision maker had been given the benefit of hindsight to choose one fixed action. We can express the regret after rounds as
The OCO model has many applications such as universal portfolios [11, 24, 21], online shortest path [33], and online submodular minimization [20]. It also has close relations with areas such as convex optimization [19, 6]
and game theory
[9]. There are many algorithms that guarantee sublinear regret, e.g., Online Gradient Descent [39], Perturbed Follow the Leader [25], and Regularized Follow the Leader [32, 4]. Compared with the MDP setting, the main difference is that in OCO there is no notion of states, however the payoffs may be chosen by an adversary.In this work, we study a general problem that unites the MDP and the OCO frameworks, which we call the Online MDP problem
. More specifically, we consider MDPs where the decision maker knows the transition probabilities but the rewards are dynamically chosen by an adversary. The Online MDP model can be used for a wide range of applications, including multiarmed bandits with constraints
[37], the paging problem in computer operating systems [15], the server problem [15], stochastic inventory control in operations research [30], and scheduling of queueing networks [12, 3].1.1 Main Results
We propose a new computationally efficient algorithm that achieves near optimal regret for the Online MDP problem. Our algorithm is based on the linear programming formulation of infinitehorizon average reward MDPs, which uses the occupancy measure of stateaction pairs as decision variables. This approach differs from other papers that have studied the Online MDP problem previously, see review in §
1.2.We prove that the algorithm achieves regret bounded by , where denotes the state space, denotes the action space, is the mixing time of the MDP, and is the number of periods. Notice that this regret bound depends logarithmically on the size of state and action space. The algorithm solves a regularized linear program in each period with complexity. The regret bound and the computation complexity compares favorably to the existing methods discussed in §1.2.
We then extend our results to the case where the state space is extremely large so that computational complexity is impractical. We assume the stateaction occupancy measures associated with stationary policies are approximated with a linear architecture of dimension . We design an approximate algorithm combining several innovative techniques for solving large scale MDPs inspired by [2, 3]. A salient feature of this algorithm is that its computational complexity does not depend on the size of the statespace but instead on the number of features . The algorithm has a regret bound , where is a problem dependent constant. To the best of our knowledge, this is the first regret result for large scale Online MDPs.
1.2 Related Work
The history of MDPs goes back to the seminal work of Bellman [5] and Howard [22] from the 1950’s. Some classic algorithms for solving MDPS include policy iteration, value iteration, policy gradient, Qlearning and their approximate versions (see [30, 7, 8] for an excellent discussion). In this paper, we will focus on a relatively less used approach, which is based on finding the occupancy measure using linear programming, as done recently in [10, 34, 2] to solve MDPs with static rewards (see more details in Section 3.1
). To deal with the curse of dimensionality,
[10] uses bilinear functions to approximate the occupancy measures and [2] uses a linear approximation.The Online MDP problem was first studied a decade ago by [37, 15]. In [15], the authors developed no regret algorithms where the bound scales as , where is the mixing time (see §4). Their method runs an expert algorithm (e.g. Weighted Majority [26]) on every state where the actions are the experts. However, the authors did not consider the case with large state space in their paper. In [37], the authors provide a more computationally efficient algorithm using a variant of Follow the Perturbed Leader [25], but unfortunately their regret bound becomes . They also considered approximation algorithm for large state space, but did not establish an exact regret bound. The work most closely related to ours is that from [13], where the authors also use a linear programming formulation of MDP similar to ours. However, there seem to be some gaps in the proof of their results.^{1}^{1}1In particular, we believe the proof of Lemma 1 in [13]
is incorrect. Equation (8) in their paper states that the regret relative to a policy is equal to the sum of a sequence of vector products; however, the dimensions of vectors involved in these dot products are incompatible. By their definition, the variable
is a vector of dimension , which is being multiplied with a loss vector with dimension .The paper [27] also considers Online MDPs with large state space. Under some conditions, they show sublinear regret using a variant of approximate policy iteration, but the regret rate is left unspecified in their paper. [38] considers a special class of MDPs called episodic MDPs and design algorithms using the occupancy measure LP formulation. Following this line of work, [29]
shows that several reinforcement learning algorithms can be viewed as variant of Mirror Descent
[23] thus one can establish convergence properties of these algorithms. In [28] the authors consider Online MDPs with bandit feedback and provide an algorithm based on [15]’s with regret of .A more general problem to the Online MDP setting considered here is where the MDP transition probabilites also change in an adversarial manner, which is beyond the scope of this paper. It is believed that this problem is much less tractable computationally [see discussion in 14]. [36] studies MDPs with changing transition probabilities, although [28] questions the correctness of their result, as the regret obtained seems to have broken a lower bound. In [17], the authors use a sliding window approach under a particular definition of regret. [1] shows sublinear regret with changing transition probabilities when they compare against a restricted policy class.
2 Problem Formulation: Online MDP
We consider a general Markov Decision Process with known transition probabilities but unknown and adversarially chosen rewards. Let denote the set of possible states, and denote the set of actions. (For notational simplicity, we assume the set of actions a player can take is the same for all states, but this assumption can be relaxed easily.) At each period , if the system is in state , the decision maker chooses an action and collects a reward . Here, denotes a reward function for period . We assume that the sequence of reward functions is initially unknown to the decision maker. The function is revealed only after the action has been chosen. We allow the sequence to be chosen by an adaptive adversary, meaning can be chosen using the history and ; in particular, the adversary does not observe the action when choosing . After is chosen, the system then proceeds to state in the next period with probability . We assume the decision maker has complete knowledge of the transition probabilities given by .
Suppose the initial state of the MDP follows , where
is a probability distribution over
. The objective of the decision maker is to choose a sequence of actions based on the history of states and rewards observed, such that the cumulative reward in periods is close to that of the optimal offline static policy. Formally, let denote a stationary (randomized) policy: , where is the set of probability distributions over the action set . Let denote the set of all stationary policies. We aim to find an algorithm that minimizes(1) 
where the expectations are taken with respect to random transitions of MDP and (possibly) external randomization of the algorithm.
3 Preliminaries
Next we provide additional notation for the MDP. Let be the probability of transitioning from state to given a policy . Let be the matrix with entries . We use row vector to denote the probability distribution over states at time . Let be the distribution over states at time under policy , given by . Let denote the stationary distribution for policy , which satisfies the linear equation . We assume the following condition on the convergence to stationary distribution, which is commonly used in the MDP literature [see 37, 15, 28].
Assumption 1.
There exists a real number such that for any policy and any pair of distributions , it holds that .
We refer to in Assumption 1 as the mixing time, which measures the convergence speed to the stationary distribution. In particular, the assumption implies that is unique for a given policy .
We use to denote the proportion of time that the MDP visits stateaction pair in the long run. We call the occupancy measure of policy . Let be the longrun average reward under policy when the reward function is fixed to be every period, i.e., . We define , where is the policy selected by the decision maker for time .
3.1 Linear Programming Formulation for the Average Reward MDP
Given a reward function , suppose one wants to find a policy that maximizes the longrun average reward: . Under Assumption 1
, the Markov chain induced by any policy is ergodic and the longrun average reward is independent of the starting state
[7]. It is well known that the optimal policy can be obtained by solving the Bellman equation, which in turn can be written as a linear program (in the dual form):(2)  
s.t.  
Let be an optimal solution to the LP (2). We can construct an optimal policy of the MDP by defining for all such that ; for states where the denominator is zero, the policy may choose arbitrary actions, since those states will not be visited in the stationary distribution. Let be the stationary distribution over states under this optimal policy.
For simplicity, we will write the first constraint of LP (2) in the matrix form as , for appropriately chosen matrix . We denote the feasible set of the above LP as . The following definition will be used in the analysis later.
Definition 1.
Let be the largest real number such that for all , the set is nonempty.
4 A Sublinear Regret Algorithm for Online MDP
In this section, we present an algorithm for the Online MDP problem.
At the beginning of each round , the algorithm starts with an occupancy measure . If the MDP is in state , we play action with probability . If the denominator is 0, the algorithm picks any action in with equal probability. After observing reward function and collecting reward , the algorithm changes the occupancy measure to .
The new occupancy measure is chosen according to the Regularized Follow the Leader (RFTL) algorithm [32, 4]. RFTL chooses the best occupancy measure for the cumulative reward observed so far , plus a regularization term . The regularization term forces the algorithm not to drastically change the occupancy measure from round to round. In particular, we choose to be the entropy function.
The complete algorithm is shown in Algorithm 1. The main result of this section is the following.
Theorem 1.
Suppose is an arbitrary sequence of rewards such that for all and . For , the MDPRFTL algorithm with parameters , guarantees
The regret bound in Theorem 1 is near optimal: a lower bound of exists for the problem of learning with expert advice [16, 18], a special case of Online MDP where the state space is a singleton. We note that the bound only depends logarithmically on the size of the state space and action space. The stateoftheart regret bound for Online MDPs is that of [15], which is . Compared to their result, our bound is better by a factor of . However, our bound has depends on , whereas the bound in [15] depends on . Both algorithms require computation time, but are based on different ideas: The algorithm of [15] is based on expert algorithms and requires computing functions at each time step, whereas our algorithm is based on RFTL. In the next section, we will show how to extend our algorithm to the case with large state space.
4.1 Proof Idea for Theorem 1
The key to analyze the algorithm is to decompose the regret with respect to policy as follows
(3) 
This decomposition was first used by [15]. We now give some intuition on why should be sublinear. By the mixing condition in Assumption 1, the state distribution at time under a policy differs from the stationary distribution by at most . This result can be used to bound the first term of (3).
The second term of (3) can be related to the online convex optimization (OCO) problem through the linear programming formulation from Section 3.1. Notice that , and . Therefore, we have that
(4) 
which is exactly the regret quantity commonly studied in OCO. We are thus seeking an algorithm that can bound . In order to achieve logarithmic dependence on and in Theorem 1, we apply the RFTL algorithm, regularized by the negative entropy function . A technical challenge we faced in the analysis is that is not Lipschitz continuous over , the feasible set of LP (2). So we design the algorithm to play in a shrunk set for some (see Definition 1), in which is indeed Lipschitz continuous.
For the last term in (3), note that it is similar to the first term, although more complicated: the policy is fixed in the first term, but the policy used by the algorithm is varying over time. To solve this challenge, the key idea is to show that the policies do not change too much from round to round, so that the third term grows sublinearly in . To this end, we use the property of the RFTL algorithm with carefully chosen regularization parameter . The complete proof of Theorem 1 can be found in Appendix A.
5 Online MDPs with Large State Space
In the previous section, we designed an algorithm for Online MDP with sublinear regret. However, the computational complexity of our algorithm is per round. In practice, MDPs often have extremely large state space due to the curse of dimenionality [7], so computing the exact solution becomes impractical. In this section we propose an approximate algorithm that can handle large state spaces.
5.1 Approximating Occupancy Measures and Regret Definition
We consider an approximation scheme introduced in [3] for standard MDPs. The idea is to use feature vectors (with ) to approximate occupancy measures . Specifically, we approximate where is a given matrix of dimension , and for some positive constant . As we will restrict the occupancy measures chosen by our algorithm to satisfy , the definition of MDPregret (1) is too strong as it compares against all stationary policies. Instead, we restrict the benchmark to be the set of policies that can be represented by matrix , where
Our goal will now be to achieve sublinear MDPregret defined as
(5) 
where the expectation is taken with respect to random state transitions of the MDP and randomization used in the algorithm. Additionally, we want to make the computational complexity independent of and .
Choice of Matrix and Computation Efficiency. The columns of matrix represent probability distributions over stateaction pairs. The choice of is problem dependent, and a detailed discussion is beyond the scope of this paper. [3] shows that for many applications such as the game of Tetris and queuing networks, can be naturally chosen as a sparse matrix, which allows constant time access to entries of and efficient dot product operations. We will assume such constant time access throughout our analysis. We refer readers to [3] for further details.
5.2 The Approximate Algorithm
The algorithm we propose is built on MDPRFTL, but is significantly modified in several aspects. In this section, we start with key ideas on how and why we need to modify the previous algorithm, and then formally present the new algorithm.
To aid our analysis, we make the following definition.
Definition 2.
Let be the largest real number such that for all the set is nonempty. We also write .
As a first attempt, one could replace the shrunk set of occupancy measures in Algorithm 1 with defined above. We then use occupancy measures given by the RFTL algorithm, i.e., . The same proof of Theorem 1 would apply and guarantee a sublinear MDPRegret. Unfortunately, replacing with does not reduce the time complexity of computing the iterates , which is still .
To tackle this challenge, we will not apply the RFTL algorithm exactly, but will instead obtain an approximate solution in time. We relax the constraints and that define the set , and add the following penalty term to the objective function:
(6) 
Here, is a sequence of tuning parameters that will be specified in Theorem 2. Let . Thus, the original RFTL step in Algorithm 1 now becomes
(7) 
In the above function, we use a modified entropy function as the regularization term, because the standard entropy function has an infinite gradient at the origin. More specifically, let be the entropy function. We define , where
(8) 
Since computing an exact gradient for function would take time, we solve problem (7) by stochastic gradient ascent. The following lemma shows how to efficiently generate stochastic subgradients for function via sampling.
Lemma 1.
Let be any probability distribution over stateaction pairs, and be any probability distribution over all states. Sample a pair and . The quantity
satisfies for any . Morever, we have w.p.1, where
(9) 
5.3 Analysis of the Approximate Algorithm
We establish a regret bound for the LargeMDPRFTL algorithm as follows.
Theorem 2.
Suppose is an arbitrary sequence of rewards such that for all and . For , LargeMDPRFTL with parameters , , , guarantees that
Here is a problem dependent constant. The constants are defined in Lemma 1.
A salient feature of the LargeMDPRFTL algorithm is that its computational complexity in each period is independent of the size of state space or the size of action space , and thus is amenable to large scale MDPs. In particular, in Theorem 2, the number of SGA iterations, , is and independent of and .
Compared to Theorem 1, we achieve a regret with similar dependence on the number of periods and the mixing time . The regret bound also depends on and , with an additional constant term . The constant comes from a projection problem (see details in Appendix B) and may grow with and in general. But for some classes of MDP problem, is bounded by an absolute constant: an example is the Markovian Multiarmed Bandit problem [35].
Proof Idea for Theorem 2. Consider the MDPRFTL iterates ,{, and the occupancy measures induced by following policies . Since it holds that for all . Thus, following the proof of Theorem 1, we can obtain the same MDPRegret bound in Theorem 1 if we follow policies . However, computing takes time.
The crux the proof of Theorem 2 is to show that the iterates in Algorithm 2 induce occupancy measures that are close to . Since the algorithm has relaxed constraints of , in general we have and thus . So we need to show that the distance between , and is small. Using triangle inequality we have
where denotes the Euclidean projection onto . We then proceed to bound each term individually. We defer the details to Appendix B as bounding each term requires lengthy proofs.
6 Conclusion
We consider Markov Decision Processes (MDPs) where the transition probabilities are known but the rewards are unknown and may change in an adversarial manner. We provide an online algorithm, which applies Regularized Follow the Leader (RFTL) to the linear programming formulation of the average reward MDP. The algorithm achieves a regret bound of , where is the state space, is the action space, is the mixing time of the MDP, and is the number of periods. The algorithm’s computational complexity is polynomial in and per period.
We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. We approximate the stateaction occupancy measures with a linear architecture of dimension
. We then propose an approximate algorithm which relaxes the constraints in the RFTL algorithm, and solve the relaxed problem using stochastic gradient descent method. A salient feature of our algorithm is that its computation time complexity is independent of the size of state space
and the size of action space . We prove a regret bound of compared to the best static policy approximated by the linear architecture, where is a problem dependent constant. To the best of our knowledge, this is the first regret bound for large scale MDPs with changing rewards.References
 [1] Y. Abbasi, P. L. Bartlett, V. Kanade, Y. Seldin, and C. Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems, pages 2508–2516, 2013.
 [2] Y. AbbasiYadkori, P. L. Bartlett, X. Chen, and A. Malek. Largescale markov decision problems via the linear programming dual. arXiv preprint arXiv:1901.01992, 2019.

[3]
Y. AbbasiYadkori, P. L. Bartlett, and A. Malek.
Linear programming for largescale markov decision problems.
In
International Conference on Machine Learning
, volume 32, pages 496–504. MIT Press, 2014.  [4] J. D. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Conference on Learning Theory, 2009.
 [5] R. Bellman. A markovian decision process. Journal of Mathematics and Mechanics, pages 679–684, 1957.
 [6] A. BenTal, E. Hazan, T. Koren, and S. Mannor. Oraclebased robust optimization via online learning. Operations Research, 63(3):628–638, 2015.
 [7] D. P. Bertsekas. Dynamic programming and optimal control, volume 2. Athena Scientific, Belmont, MA, 4 edition, 2012.
 [8] D. P. Bertsekas and J. N. Tsitsiklis. Neurodynamic programming. Athena Scientific, Belmont, MA, 1996.
 [9] N. CesaBianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 [10] Y. Chen, L. Li, and M. Wang. Scalable bilinear pi learning using state and action features. arXiv preprint arXiv:1804.10328, 2018.
 [11] T. M. Cover. Universal portfolios. Mathematical finance, 1(1):1–29, 1991.
 [12] D. P. De Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850–865, 2003.
 [13] T. Dick, A. Gyorgy, and C. Szepesvari. Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning, pages 512–520, 2014.
 [14] E. EvenDar, S. M. Kakade, and Y. Mansour. Experts in a markov decision process. In Advances in Neural Information Processing Systems, pages 401–408, 2005.
 [15] E. EvenDar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
 [16] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(12):79–103, 1999.
 [17] P. Gajane, R. Ortner, and P. Auer. A slidingwindow algorithm for markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066, 2018.
 [18] E. Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 [19] E. Hazan and S. Kale. An optimal algorithm for stochastic stronglyconvex optimization. arXiv preprint arXiv:1006.2425, 2010.
 [20] E. Hazan and S. Kale. Online submodular minimization. Journal of Machine Learning Research, 13(Oct):2903–2922, 2012.
 [21] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. Online portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325–347, 1998.
 [22] R. A. Howard. Dynamic programming and markov processes. John Wiley, 1960.
 [23] A. Juditsky, A. Nemirovski, et al. First order methods for nonsmooth convex largescale optimization, i: general purpose methods. Optimization for Machine Learning, pages 121–148, 2011.
 [24] A. Kalai and S. Vempala. Efficient algorithms for universal portfolios. Journal of Machine Learning Research, 3(Nov):423–440, 2002.
 [25] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 [26] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
 [27] Y. Ma, H. Zhang, and M. Sugiyama. Online markov decision processes with policy iteration. arXiv preprint arXiv:1510.04454, 2015.
 [28] G. Neu, A. György, C. Szepesvári, and A. Antos. Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676–691, 2014.
 [29] G. Neu, A. Jonsson, and V. Gómez. A unified view of entropyregularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
 [30] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 [31] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.
 [32] S. ShalevShwartz. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
 [33] E. Takimoto and M. K. Warmuth. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4(Oct):773–818, 2003.
 [34] M. Wang. Primaldual pi learning: Sample complexity and sublinear run time for ergodic markov decision problems. arXiv preprint arXiv:1710.06100, 2017.
 [35] P. Whittle. Multiarmed bandits and the gittins index. Journal of the Royal Statistical Society: Series B (Methodological), 42(2):143–149, 1980.
 [36] J. Y. Yu and S. Mannor. Online learning in markov decision processes with arbitrarily changing rewards and transitions. In 2009 International Conference on Game Theory for Networks, pages 314–322. IEEE, 2009.
 [37] J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
 [38] A. Zimin and G. Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems, pages 1583–1591, 2013.
 [39] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
Appendix A Regret Analysis of MDPRFTL: Proof of Theorem 1
To bound the regret incurred by MDPRFTL, we bound each term in Eq (3). We start with the first term. We use the following lemma, which was first stated in [15] and was also used by [28].
Lemma 2.
For any and any policy it holds that
Proof of Lemma 2 .
Recall that , so we have by CauchySchwarz inequality, since defines a probability distribution over actions. Also, recall that is the stationary distribution over states by following policy and for all . We have
Now, notice that
Finally, we have that
which concludes the proof. ∎
We now bound the third term in (3). We use the following lemma, which bounds the difference of two stationary distributions by the difference of the corresponding occupancy measures.
Lemma 3.
Let and be two arbitrary stationary distributions over . Let and be the corresponding occupancy mesures. It holds that
Proof of Lemma 3.
∎
We are ready to bound the third term in (3).
Lemma 4.
Let be the random sequence of stateaction pairs generated by the policies induced by occupancy measures . It holds that
Proof of Lemma 4.
By the definition of , we have
Now, recall that . We now bound for all as follows: