A Markov decision process (MDP) is a discrete-time state-transition system in which the transition dynamics follow the Markov property (Puterman (1994), Bertsekas and Tsitsiklis (1996)). MDPs are a standard model to express uncertainty in reinforcement learning problems. In the classical MDP model, the transition dynamics and the reward functions are time-invariant. However, such fixed transition dynamics and reward functions are insufficient to model real world problems in which parameters of the world change over time. To deal with such problems, we consider a setting in which both the transition dynamics and the reward functions may vary over time. These changes can be either abrupt or gradual. As a motivation, consider the problem of deciding which ads to place on a webpage. The instantaneous reward is the payoff when viewers are redirected to an advertiser, and the state captures the details of the current ad. With a heterogeneous group of viewers, an invariant state-transition function cannot accurately capture the transition dynamics. The instantaneous reward, dependent on external factors, is also better represented by changing reward functions. For additional motivation and further practical applications, see Yuan Yu and Mannor (2009a) and Yuan Yu and Mannor (2009b), Abbasi et al. (2013).
1.1 Main Contribution
For reinforcement learning in MDP with changes in reward function and transition probabilities, we provide an algorithm, UCRL with Restarts, a version of UCRL (Jaksch et al., 2010), which restarts according to a schedule dependent on the variation in the MDP (defined in Section 2 below). We derive a high-probability upper bound on the cumulative regret of our algorithm of which is optimal in the order of time and variation. Our problem formulation and analysis of the algorithm in terms of variation rather than the number of changes allows us to handle cases where the number of changes is high. To the best of our knowledge, these bounds are the first variational bounds for the general reinforcement learning setting. So far, variational regret bounds have been derived only for the simpler bandit setting (Besbes et al., 2014).
1.2 Related Work
MDPs in which the state-transition probabilities change arbitrarily but the reward functions remain fixed, have been considered by Nilim and El Ghaoui (2005), Xu and Mannor (2006). On the other hand, Even-dar et al. (2005) and Dick et al. (2014) consider the problem of MDPs with fixed state-transition probabilities and changing reward functions. Moreover, Even-dar et al. (2005, Theorem 11) also show that the case of MDPs with both changing state-transition probabilities and changing reward functions is computationally hard. Yuan Yu and Mannor (2009a) and Yuan Yu and Mannor (2009b) consider arbitrary changes in the reward functions and arbitrary, but bounded, changes in the state-transition probabilities. They also give regret bounds that scale with the proportion of changes in the state-transition kernel and which in the worst case grow linearly with time. Abbasi et al. (2013) consider MDP problems with (oblivious) adversarial changes in state-transition probabilities and reward functions and provide algorithms for minimizing the regret with respect to a comparison set of stationary (expert) policies.
The rest of the article is structured as follows. In Section 2, we describe the problem at hand formally and define the performance measure to be used. In Section 3, we present our algorithmic solution for the stated problem. We briefly summarize the main result of the article in Section 4. Next, in Section 5, we provide proofs for the two main theorems as well as other preliminary results required for the former. A few technical details required for these proofs are deferred until Section 6. In Section 7, we discuss this work, including possible future directions.
In our setting, while the state space () and action space () of the MDP in which the learner operates are assumed to be fixed, mean rewards and transition probabilities are time dependent. Accordingly, we write for the mean reward for choosing action in state at time , and for the probability of a transition from state to when choosing action at time . We assume that the MDP at each step is communicating, i.e., has a finite diameter . (The diameter is the minimal expected time it takes to get from any state to any other state in the MDP, cf. Jaksch et al. (2010).) Further, we denote a common upper bound for all up to step by . The average reward of a stationary policy in an is the limit of the expected average accumulated reward when following , i.e.,
where denotes the random reward the learner obtains at time . The optimal average reward of each is independent of the initial state and we denote it by . The learner competes at each time step with the optimal average reward . Accordingly, we define the regret after steps by,
Note that if there are no changes, this is equivalent to the standard notion of regret as used e.g. by Jaksch et al. (2010).
2.1 Definition of Variation
We consider individual terms for the variation in mean rewards and transition probabilities, that is,
These “local” variation measures can also be used to bound a more “global” notion of variation in average reward, defined as
The proof of Theorem 1 is given in Section 5.2. While is a more straightforward adaptation of the notion of variation of Besbes et al. (2014) from the bandit to the MDP setting, in the latter it seems more natural to work with the local variation measures for rewards and transition probabilities, as –unlike in the bandit setting– the learner only has direct access to the rewards and transition probabilities.
For reinforcement learning in the changing MDP setting, we propose an algorithm based on the UCRL algorithm of Jaksch et al. (2010)
. UCRL is an algorithm that is based on the idea of being optimistic in the case of uncertainty and employs confidence intervals for rewards and transition probabilities to implement it. That is, when computing a new policy it considers the set of all MDPs that are plausible with respect to the observations so far. From this set, the algorithm chooses the policy and the MDP that give the highest average reward.
For the changing MDP setting, we restart UCRL after a particular number of steps. The idea of restarting UCRL in the changing MDP setting has been considered already by Jaksch et al. (2010) who also showed respective regret bounds (see next section). However, we modify the restarting schedule of the algorithm to provide a regret bound which is order optimal in terms of time horizon and the variation parameters.
The algorithm is shown in detail as Algorithm 1. Each phase restarts UCRL, which maintains state-action counts (line 7
) as well as estimates of rewards and transition probabilities (line8). In each episode (UCRL-internal, not to be confused with the phases), UCRL computes an optimistic policy that maximizes the optimal average reward over all policies and all plausible MDPs (defined via confidence intervals for the estimated rewards and transition probabilities), see lines 9 and 10. This policy is played until the visits in some state-action pair double (lines 11 and 12), at which point a new episode starts and a new policy is computed subsequently.
If one considers a setting with the MDP changing at most times, then UCRL with a restarting scheme adapted to the number of changes gives the following regret bound of order . This bound is due to Jaksch et al. (2010).
Theorem 2 (Jaksch et al. (2010)).
Given that the number of changes in reward distributions and transition probabilities is bounded by , the regret (measured as the sum of missed rewards compared to the optimal policies in the periods during which the MDP remains constant) of UCRL restarted at steps for is upper bounded by
with probability of at least .
For our algorithm with the different restarting scheme adapted to the variation, we can show a regret upper bound as follows.
The regret of UCRL restarting every steps is bounded with probability as,
First, we state a few preliminary results which we will use later in the proofs of the main theorem.
Let and be the optimistic values computed by our algorithm from samples collected in any time interval . Further, let and be the variation of rewards and transition probabilities in the interval . Let and be the optimal average reward at time and the optimistic average reward of the algorithm at time , respectively. Then,
simultaneously for all intervals with probability at least .
The proof for this lemma is given in Section 6.1.
Lemma 2 (Azuma-Hoeffding inequality (Hoeffding (1963))).
Let be a martingale difference sequence with for all . Then for all and ,
For any sequence of numbers with
This lemma is proven in Appendix C.3 of Jaksch et al. (2010).
5.2 Proof of Theorem 1
We state the following lemma based on Lemma 8 in Ortner et al. (2014).
Consider a pair of communicating MDPs and with the same state-action space such that
Assume that an optimal policy for is performed on for steps and let be the number of times states is visited in these steps. Then,
with probability at least .
We can divide the result of lemma 4 by , choose and let to get the statement of the lemma. ∎
Then one can write that,
for any policy . Using the above,
5.3 Proof of Theorem 3
We first consider the regret in a fixed phase . Let , and be the variation in rewards, transition probabilities and average rewards within this phase respectively. Let be the number of time steps in phase and be the number of time steps in episode . Let be the number of episodes in phase . We can bound the regret in phase as,
Consider comprising of samples with . Thus defined constitutes a martingale difference sequence. Because the mean rewards are bounded in , . Therefore by Lemma 2,
Therefore with probability at least ,
The first term of Eq. (3). Let be the transition matrix of on , and
be the row vector of visit counts for each state and the corresponding action chosen by. Using this notation, by Section 4.3.1 in the UCRL analysis of (Jaksch et al., 2010), we can rewrite the first term on the right hand side as
for a vector with .
The second term of Eq. (3).
The last inequality follows from Eq. (1).
With another application of Lemma 2,
Therefore with probability at least ,
with probability at least .
Simplifying : Denote the unit vectors with -th coordinate and all other coordinates 0 by . Then
The first term: Below we use that .
We can write,