1 Introduction
Recent advances in reinforcement learning (RL) methods has led to increased focus on finding practical RL applications. RL algorithms provide a set of tools for tackling sequential decision making problems with potential applications ranging from web advertising and portfolio optimization, to healthcare applications like adaptive drug treatment. However, despite the empirical success of RL in simulated domains such boardgames and video games, it has seen limited use in real world applications because of the inherent trialanderror nature of the paradigm. In addition to these concerns, for the applications listed above, the designer has to essentially design adaptive methods for a population of users instead of a single system. For example, consider the problem of optimizing adaptive drug treatment plans for a sequence of patients. In such a scenario, one has to ensure quickly learning good policies for each user and also share the observed outcome data efficiently across patients. Intuitively, we expect that frequently seen patient types can be adequately dealt with by using adaptive learning methods whereas difficult and rare cases could be referred to experts.
In this paper, we consider a setting where at the start of every patient interaction, we have access to some contextual information. This information can be demographic, genomic or pertain to measurements taken from lab tests. We model this setting using the framework of Contextual Markov Decision Processes (CMDPs) previously studied by Modi et al. (2018). Similar settings have been studied with slightly differing formalizations by AbbasiYadkori and Neu (2014); Hallak et al. (2015) and Dann et al. (2018)
. While the framework proposed in these works is innovative, there are a number of deficiencies in the available set of results. First, theoretical guarantees (PACstyle mistake bounds or regret bounds) sometimes hold only under a restrictive linearity assumption on the mapping between contexts and MDPs. Second, if nonlinear mappings are introduced, they do not always deal properly with the requirement that nextstate distributions be properly normalized probability vectors.
We address these deficiencies using generalized linear models (GLMs) to model the mapping from context to MDP parameters. Our results provide new bounds for the logit link and improve existing results even in the simpler linear case. In addition, we focus on computational and space complexity concerns for learning in such CMDPs. Our proposed algorithm uses the popular optimism under the face of uncertainty (OFU) approach and relies on efficient online updates, both in terms of memory and computation time. It therefore differs from the typical OFU approaches whose running time scales linearly with number of observed contexts. Finally, the proposed algorithm also results in a cumulative policy certificate bound as studied by Dann et al. (2018).
Outline In Section 2, we describe the setting formally with the required assumptions and notation. We present a method for confidence set construction in Section 3. Section 4 uses the confidence sets to develop our main algorithm GLORL and provides its theoretical analysis. In Section 5, we show a regret lower bound for our setting. After giving a general method for obtaining confidence sets under prior information (Section 6) and a discussion of related work (Section 7), we conclude in Section 8.
2 Setting and Notation
The agent in this setting interacts with a sequence of fixed horizon tabular MDPs which are indexed by contexts where . Each episode is a sequence of length with states , actions and . In each episode, the state and reward for each timestep are sampled from the contextual MDPs () parameters: and . The actions are chosen by the agent’s (contextual) policy computed at the beginning of each episode. We denote the size of MDP parameters by the usual notation: and .
We denote the value of a policy in an episode by the expected total return for steps:
The optimal policy for each episode is denoted by and its value as . We denote the instantaneous regret for episode as . The goal is to bound the cumulative regret , i.e., sum of these for any number of episodes . Note that this formulation allows an adversarial sequence of context vectors.
Additionally, for two matrices and , . For a vector and a matrix , we define . For a matrices and , we have where is the row of the matrix. For simplicity, we remove the subscripts/superscripts from the notation when clear from the context. Any norm which appears without a subscript will denote the norm.
2.1 GENERALIZED LINEAR MODEL FOR CMDPs
We assume that each contextual MDP is obtained by a set of generalized linear models. Specifically, for each pair , there exists a weight matrix where is a convex set^{3}^{3}3Without loss of generality, we can set the last row of the weight matrix to be 0 to avoid an overparametrized system.. For any context , the next state distribution for the pair is specified by a GLM:
(1) 
where is the link function of the GLM. We will assume that this link function is convex which is always the case for a canonical exponential family (Lauritzen, 1996). For rewards, we assume that each mean reward is given by a linear function of the context: where . In addition, we will make the following assumptions about the link function.
Assumption 2.1.
The function is strongly convex, that is:
(2) 
Assumption 2.2.
The function is strongly smooth, that is:
(3) 
We will see that this assumption is critical for constructing the confidence sets used in our algorithm. We make another assumption about the size of the weight matrices and contexts :
Assumption 2.3.
For all episodes , we have and for all stateaction pairs , and . So, we have for all .
The regret bounds for the proposed algorithms will depend on these quantities.
2.2 Special Cases
We show that this setting covers contextual MDPs constructed through multinomial logit models or linear combination of base MDPs.
Example 2.4 (Multinomial logit model, Agarwal (2013)).
Each next state is sampled from a multinomial distribution with probabilities:
The link function for this case can be given as which can be shown to be strongly convex with and smooth with .
Example 2.5 (Linear combination of MDPs, Modi et al. (2018)).
Each MDP is obtained by a linear combination of base MDPs. Here, , and . The link function for this can be shown to be:
which is strongly convex and smooth with parameters . Moreover, here is the matrix containing each next state distribution in a column. We have, , and .
3 Online Estimates and Confidence Set Construction
In order to obtain a noregret algorithm for our setting, we will follow the popular optimism under uncertainty approach which relies of the construction of confidence sets for MDP parameters at the beginning of each episode. We focus on deriving these confidence sets for the next state distributions for all state action pairs. In this section, we will assume that, for all pairs , , , and
are known apriori. Therefore, for each stateaction pair, we have the following online estimation problem: For
:
Propose an estimate and a set such that, with high probability.

Observe and a sample .
We consider this as an online optimization problem with the following loss sequence based on the negative loglikelihood:
(4) 
where denotes the onehot vector with the value at the sampled state
. The utility of this loss function is that it preserves strong convexity of
with respect to and is a proper loss function (Agarwal, 2013):(5) 
Since our aim is computational and memory efficiency, we carefully follow the Online Newton Step (Hazan et al., 2007) based method proposed for rewards with logistic link function in Zhang et al. (2016). This extension to generalized linear models utilizes the structure of multinomial vectors in various places of the analysis. Let us focus on the estimation problem for a single stateaction pair. The update rule for the parameter matrix is given below.
Let and . We maintain an estimate by the following update rule:
(6) 
where .
We will see that strong convexity of plays an important role in the analysis and the multinomial GLM assumption improves the dependence on . The method only stores the empirical covariance matrix and solves an optimization problem using only the current context. Since, the set is convex, this is a tractable problem and can be solved via any offthe shelf optimizer up to desired accuracy. The computation time for computing these sets for each context is with no dependence on . Furthermore, we only store many matrices of size and covariance matrices of sizes . In Section 4, we will see that we can obtain an confidence set over the next state distribution for each context if we have access to a confidence set for . The following key result gives such a confidence set.
Theorem 3.1 (Confidence set for ).
If is obtained by equation 6 for all timesteps , then for all timesteps, with probability at least , we have:
(7) 
with
(8) 
with .
The term now depends on the size of the true weight matrix, strong convexity parameter and the log determinant of the covariance matrix. We will see later that the term is of the order . Therefore, overall this term has scales as .
3.1 Proof of Theorem 3.1
We closely follow the analysis from Zhang et al. (2016) and use properties of the multinomial output to adapt it to our case. We will denote the derivative with respect to the matrix for loss as and the derivative with respect to the projection as . Now, using the strong convexity of the loss function , for all , we have:
Taking expectation with respect to the multinomial sample , we get:
(9) 
where the lhs is obtained by using the calibration property from eq. 5. Now, for the first term on rhs, we have:
where and . We bound the term using the following lemma:
Lemma 3.2.
(10) 
Proof.
To prove this, we go back to the update rule in (6) which has the following form:
with , , , and . For a solution to any such optimization problem, by the first order optimality conditions, we have:
Now,
(11)  
Noting that , we get
Substituting this and along with other terms in ineq. (11) proves the stated lemma (ineq. (10)). ∎
For bounding in ineq. 3.1, we note that is a martingale difference sequence for which we have:
Similarly, for martingale
, we bound the conditional variance as
Now, using Bernstein’s inequality for martingales and the peeling technique as used in Lemma 5 of Zhang et al. (2016), we get
with probability at least . Using the RMSAM inequality, with probability at least , we get
(12) 
From eqs. (9), (3.1) and (10), we have
Unwrapping the rhs over and substituting ineq. (12), we get
Using the following Lemma from Zhang et al. (2016), we arrive at the result in Theorem 3.1.
Lemma 3.3 (Lemma 6, Zhang et al. (2016)).
We have,
4 Optimistic Reinforcement Learning for Glm Cmdp
In this section, we describe the OFU based online learning algorithm which leverages the confidence sets as described in the previous sections. Not surprisingly, our algorithm is similar to the algorithm of Dann et al. (2018) and AbbasiYadkori and Neu (2014) and follows the standard format for noregret bounds in MDPs. As can be seen from the algorithm outline, we construct confidence sets for the next state distributions and rewards for each stateaction pair
. Using that, we compute an optimistic policy to run at the beginning of each episode. For deriving the confidence interval for
, we use the results from previous section:(13)  
where quantities with subscript denote the value at the beginning of episode . The pair of matrices is maintained for every state action pair . For rewards, we assume the usual linear bandit structure, which has a standard confidence interval (Lattimore and Szepesvári (2018), Theorem 20.5).
(14) 
For the given algorithm using the confidence sets from previous sections, we can show the following bound:
Theorem 4.1 (Regret of GlOrl).
4.1 Proof of Theorem 4.1
We first begin by showing that the computed policy’s value is optimistic.
Lemma 4.2 (Optimism).
If all the confidence intervals as computed in Algorithm 1 are valid for all episodes , then for all and and , we have:
Proof.
For every episode, the lemma is true trivially for . Assume that it is true for . Now, for , we have:
where the last line uses the guarantee on confidence intervals and the assumption for . ∎
Using the optimism guarantee, we have
We can now bound the regret for an episode as:
(15)  
Therefore, we have:
We bound the first summation using Lemma 23 of Dann et al. (2018).
Lemma 4.3 (Dann et al. (2018), Lemma 23).
With probability at least , for all , we have
Now, we bound the second term of the regret bound decomposition:
Now, we can bound the individual terms, where we focus on the higher order terms arising due to :
where the last inequality follows from CauchySchwartz inequality. We now bound the elliptic potential inside the square root. Note that, instead of summing up the weighted operator norm with changing values of , we keep the matrix same for all observations in an episode. Note that, denotes the matrix at the beginning of episode and therefore, does not include the terms . Therefore, we rederive the bound for this setting. Now, for any episode :
where in the last step, we have used the following:
Therefore, we can finally bound the term as:
The regret component for rewards will contain lower order terms and we ignore it for conciseness. Now, taking , we get the total failure probability for Lemma 4.3 and the confidence intervals to be at most . If we use the results from Section 3, we know that . Thus, by combining all terms, we get