In many practical settings, the actions of an agent have only a limited effect on the environment. For example, in a wireless cellular network, system performance is determined by a number of parameters that must be dynamically controlled to optimize performance. We can formulate this as a Markov Decision Process (MDP) in which the reward function is the negative of the number of users who are suffering from low bandwidth. However, the reward is heavily influenced by exogenous factors such as the number, location, and behavior of the cellular network customers. Customer demand varies stochastically as a function of latent factors (news, special events, traffic accidents). In addition, atmospheric conditions can affect the capacity of each wireless channel. This high degree of stochasticity can confuse reinforcement learning algorithms, because during exploration, the expected benefit of trying action in state
is hard to determine. Many trials are required to average away the exogenous component of the reward so that the effect of the action can be measured. For temporal difference algorithms, such as Q Learning, the learning rate will need to be very small. For policy gradient algorithms, the number of Monte Carlo trials required to estimate the gradient will be very large (or equivalently, the step size will need to be very small). In this paper, we analyze this setting and develop algorithms for automatically detecting and removing the effects of exogenous state variables. This accelerates reinforcement learning (RL).
This paper begins by defining exogenous variables and rewards and presenting the exogenous-endogenous (Exo/Endo) MDP decomposition. We show that, under the assumption that the reward function decomposes additively into exogenous and endogenous components, the Bellman equation for the original MDP decomposes into two equations: one for the exogenous Markov reward process (Exo-MRP) and the other for the endogenous MDP (Endo-MDP). Importantly, every optimal policy for the Endo-MDP is an optimal policy for the full MDP. Next we study conditions under which solving the Endo-MDP is faster (in sample complexity) than solving the full MDP. To do this, we derive dynamic programming updates for the covariance between the -horizon returns of the Exo-MRP and the Endo-MDP, which may be of independent interest. The third part of the paper presents an abstract algorithm for automatically identifying the Exo/Endo decomposition. We develop two approximations to this general algorithm. One is a global scheme that computes the entire decomposition at once; the second is a faster stepwise method that can scale to large problems. Finally, we present experimental results to illustrate cases where the Exo/Endo decomposition yields large speedups and other cases where it does not.
2 MDPs with Exogenous Variables and Rewards
We study discrete time MDPs with stochastic rewards (Puterman, 1994; Sutton & Barto, 1998); the state and action spaces may be either discrete or continuous. Notation: state space , action space , reward distribution (where
is the space of probability densities over the real numbers), transition function(where is the space of probability densities or distributions over ), starting state , and discount factor . We assume that for all , has expected value and finite variance .
Suppose the state space can be decomposed into two subspaces and according to . We will say that the subspace is exogenous if the transition function can be decomposed as where is the state that results from executing action in state . We will say that an MDP is an Exogenous State MDP if its transition function can be decomposed in this way. We will say it is an Additively Decomposable Exogenous State MDP if its reward function can be decomposed as the sum of two terms as follows. Let be the exogenous reward distribution and be the endogenous reward distribution such that where with mean and variance and with mean and variance .
For any Additively Decomposable Exogenous State MDP with exogenous state space , the -step finite-horizon Bellman optimality equation can be decomposed into two separate equations, one for a Markov Reward Process involving only and and the other for an MDP (the endo-MDP) involving only : (1) (2) (3)
Proof. Proof by induction on the horizon . Note that the expectations could be either sums (if is discrete) or integrals (if is continuous).
Base case: ; we take one action and terminate.
The base case is established by setting and .
Recursive case: . Distribute the expectation over the sum in brackets and simplify. We obtain The result is established by setting QED.
Any optimal policy for the endo-MDP of Equation 1 is an optimal policy for the full exogenous state MDP.
Proof. Because does not depend on the policy, the optimal policy can be computed simply by solving the endo-MDP. QED.
In an unpublished manuscript, Bray (2017
) proves a similar result. He also identifies conditions under which value iteration and policy iteration on the Endo-MDP can be accelerated by computing the eigenvector decomposition of the endogenous transition matrix. While such techniques are useful for MDP planning with a known transition matrix, we do not know how to exploit them in reinforcement learning where the MDP is unknown.
3 Analysis of the Exo/Endo Decomposition
Suppose we are given the decomposition of the state space into exogenous and endogenous subspaces. Under what conditions would reinforcement learning on the endogenous MDP be more efficient than on the original MDP? To explore this question, let us consider the problem of estimating the value of a fixed policy in a given start state via Monte Carlo trials of length . We will compare the sample complexity of estimating on the full MDP to the sample complexity of estimating on the Endogenous MDP. Of course most RL algorithms must do more than simply estimate for fixed , but the difficulty of estimating is closely related to the difficulty of fitting a value function approximator or estimating the gradient in a policy gradient method.
to be a random variable for the-step cumulative discounted return of starting in state and choosing actions according to for steps. To compute a Monte Carlo estimate of , we will generate realizations of by executing -step trajectories in the MDP, each time starting in .
For any and any , let be the Monte Carlo estimate of the expected -step return of policy starting in state . If
Proof. This is a simple application of the Chebychev Inequality,
with . The variance of the mean of iid random variables is the variance of any single variable divided by . Hence . To obtain the result, plug this into the Chebychev inequality, set the rhs equal to , and solve for . QED.
Now let us consider the Exo/Endo decomposition of the MDP. Let denote the -step return of the exogenous MRP and denote the return of the endogenous MDP. Then is a random variable denoting the cumulative -horizon discounted return of the original, full MDP. Let be the variance of and be the covariance between them.
The Chebychev upper bound on the number of Monte Carlo trials required to estimate using the endogenous MDP will be reduced compared to the full MDP iff
Proof. From Theorem 2, we know that the sample size bound using the endogenous MDP will be less than the required sample size using the full MDP when . The variance of the sum of two random variables is Hence, iff QED.
To evaluate this covariance condition, we need to compute the variance and covariance of the -step returns. We derive dynamic programming formulas for these.
Proof. Sobel (1982) analyzed the variance of infinite horizon discounted MDPs with deterministic rewards. We modify his proof to handle a fixed horizon and stochastic rewards. We proceed by induction on . To simplify notation, we omit the dependence on .
Base Case: . This is established by Equations 4.
Inductive Step: . Write
where the rhs involves the three random variables , , and . To obtain Equation 4, compute the expected value and take each expectation in turn. To obtain the formula for the variance, write the standard formula for the variance: Substitute in the first term and simplify the second term to obtain Expand the square in the first term: Distribute the two innermost expectations over the sum: Apply the definition of variance in reverse to terms 1 and 3 in brackets: Factor the quadratic involving terms 1, 3, and 4: Finally, distribute the expectation with respect to to obtain Equation 4. QED.
Proof. By induction. The base case is established by Equation 5. For the inductive case, we begin with the formula for non-centered covariance: Replace by and by and replace the expectations wrt and by expectations wrt the six variables . We will use the following abbreviations for these variables: , , and . We will focus on the expectation term. Multiply out the two terms and distribute the expectations wrt and : Apply the non-centered covariance formula “in reverse” to term 4. Distribute expectation with respect to :
Obtain Equation 5 by factoring the first four terms, writing the expectations explicitly, and including the term. QED.
To gain some intuition for this theorem, examine the three terms on the right-hand side of Equation 5. The first is the “recursive” covariance for . The second is the one-step non-centered covariance, which is the expected value of the product of the backed-up values for and . The third term is the product of and for the current state, which re-centers the covariance.
Theorems 4 and 5 allow us to check the covariance condition of Theorem 3 in every state, including the start state , so that we can decide whether to solve the original MDP or the endo-MDP. Some special cases are easy to verify. For example, if the mean exogenous reward , for all states, then the covariance condition reduces to .
4 Algorithms for Decomposing an MDP into Exogenous and Endogenous Components
In some applications, it is easy to specify the exogenous state variables, but in others, we must discover them from training data. Suppose we are given a database of sample transitions obtained by executing one or more exploration policies in the full MDP. Assume each is a
-dimensional real-valued vector andis a
-dimensional real-valued vector (possibly a one-hot encoding ofdiscrete actions). In the following, we center and by subtracting off the sample mean.
We seek to learn three functions , , and parameterized by , , and , such that is the exogenous state, is the endogenous state, and recovers the original state from the exogenous and endogenous parts. We want to capture as much exogenous state as possible subject to the constraint that we (1) satisfy the conditional independence relationship , and (2) we can recover the original state from the exogenous and endogenous parts. We formulate the decomposition problem as the following abstract optimization problem:
We treat the observations as samples from the random variables and the expectations are estimated from these samples. The objective is to maximize the expected “size” of ; below we will instantiate this abstract norm. In the first constraint, denotes the estimated conditional mutual information. Ideally, it should be 0, which implies that . We only require it to be smaller than . The second constraint encodes the requirement that the average reconstruction error of the state should be small. We make the usual assumption that and are independent conditioned in , , and .
We now instantiate this abstract schema by making specific choices for the norm, , , and . Instantiate the norm as the squared norm, so the objective is to maximize the variance of the exogenous state. Define as a linear projection from the full -dimensional state space to a smaller -dimensional space, defined by the matrix . The projected version of state is . Define the endogenous state to contain the components of not contained within . From linear decomposition theory (Jolliffe, 2002), we know that for a fixed consisting of orthonormal components (), the exogenous state using all dimensions is , and the endogenous state . Under this approach, , so the reconstruction error is 0, and the second constraint in (9) is trivially satisfied. The endo-exo optimization problem becomes
Formulation (10) involves simultaneous optimization over the unknown dimensionality and the projection matrix . It is hard to solve, because of the conditional mutual information constraint. To tackle this, we approximate by the partial correlation coefficient , defined as the Frobenius norm of the normalized partial covariance matrix (Baba et al., 2004; Fukumizu et al., 2008, 2004):
and is the trace. If the set of random variables
follow a multivariate Gaussian distribution, then the partial correlation coefficient of a pair of variables given all the other variables is equal to the conditional correlation coefficient(Baba et al., 2004). With this change, we can express our optimization problem in matrix form as follows. Arrange the data into matrices , each with rows and columns, where the th row refers to the th instance. Let be the action matrix with rows and columns:
4.1 Global Algorithm
This formulation assumes a fixed exogenous dimensionality and computes the projection with the minimal PCC. The orthonormality condition constrains to lie on a Stiefel manifold (Stiefel, 1935). Several optimization algorithms exist for optimizing on Stiefel manifolds (Jiang & Dai, 2015; Absil et al., 2007; Edelman et al., 1999). We employ the algorithms implemented in the Manopt package (Boumal et al., 2014).
For a fixed dimensionality of the exogenous state, we can solve (12) to minimize the PCC. Given that our true objective is (11) and the optimal value of is unknown, we must try all values , and pick the that achieves the maximal variance, provided that the PCC from optimization problem (12) is less than . This is costly since it requires solving manifold optimization problems. We can speed this up somewhat by iterating from down to 1 and stopping with the first projection that satisfies the PCC constraint. If no such projection exists, then the exogenous projection is empty, and the state is fully endogenous. The justification for this early-stopping approach is that the exogenous decomposition that maximizes will contain any component that appears in exogenous decompositions of lower dimensionality. Hence, the that attains maximal variance is achieved by the largest satisfying the PCC constraint. Algorithm 1 describes this procedure. Unfortunately, this scheme must still solve optimization problems in the worst case. We now introduce an efficient stepwise algorithm.
4.2 Stepwise Algorithm
Our stepwise algorithm extracts the components (vectors) of the exogenous projection matrix one at a time by solving a sequence of small optimization problems. Suppose we have already identified the first components of , and we seek to identify . These components define exogenous state variables . Recall that the original objective seeks to uncover the conditional independence . This requires us to know and , whereas we only know a portion of , and we therefore do not know at all. To circumvent this problem, we make two approximations. First, we eliminate from the conditional independence. Second, we assume that has a lower-triangular form, so that . This yields the simpler objective .
What damage is done by eliminating ? Components with low values for will also have low values for , so this objective will not miss true exogenous components. On the other hand, it may find components that are not exogenous, because they depend on . To address this, after discovering a new component we add it to , only if it satisfies the original PCC constraint conditioned on both and . What damage is done by assuming a triangular dependence structure? We do not have a mathematical characterization of this case, so we will assess it experimentally.
To finish the derivation of the stepwise algorithm, we must ensure that each new component is orthogonal to the previously-discovered components. Consider the matrix . To ensure that the new component will be orthogonal to the previously-discovered vectors, we restrict to lie in the null space of . We compute an orthonormal basis for this null space. The matrix is then an orthonormal basis for . Since we want to be orthogonal to the components in , it must have the form , where . The unit norm constraint is satisfied if and only if .
Algorithm 2 incorporates the aforementioned observations. Line 5 computes an orthonormal basis for the null space of all components that have been discovered so far. Lines 6-8 compute the component in that minimizes the PCC and add it to . Line 9 computes the endogenous space assuming the new component was added to the current exogenous projection . In Lines 10-13, if the PCC, conditioned on both the endogenous state and the action, is lower than , then we add the newly discovered component to . The algorithm terminates when the entire state space of dimensionality has been decomposed.
As in the global algorithm, Line 6 involves a manifold optimization problem; however, the Stiefel manifold corresponding to the constraint is just the unit sphere, which has a simpler form than the general manifold in formulation (12) in the Global algorithm 1. Note that the variance of the exogenous state can only increase as we add components to . The stepwise scheme has the added benefit that it can terminate early, for instance once it has discovered a sufficient number of components or once it has exceeded a certain time threshold.
5 Experimental Studies
We report three experiments. (Several additional experiments are described in the Supplementary Materials.) In each experiment, we compare Q Learning (Watkins, 1989)
on the full MDP (“Full MDP”) to Q Learning on the decomposed MDPs discovered by the Global (“Endo MDP Global”) and Stepwise (“Endo MDP Stepwise”) algorithms. Where possible, we also compare the performance of Q Learning on the true endogenous MDP (“Endo MDP Oracle”). The Q function is represented as a neural network with a single hidden layer oftanh units and a linear output layer, except for Problem 1 where 2 layers of
units each are used. Q-learning updates are implemented with stochastic gradient descent. Exploration is achieved via Boltzmann exploration with temperature parameter. Given the current Q values, the action is selected according to
In each experiment, all Q learners observe the entire current state , but the full Q learner is trained on the full reward, while the endogenous reward Q learners are trained on the (estimated) endogenous reward. All learners are initialized identically and employ the same random seed. For the first steps, the full reward is employed, and we collect a database of transitions. We then apply Algorithm 1 and Algorithm 2 to this database to estimate and
. The algorithms then fit a linear regression modelto predict the reward as a function of the exogenous state . The endogenous reward is defined as the residuals of this model: . The endogenous Q learner then employs this endogenous reward for steps onward. Each experiment is repeated times. A learning curve is created by plotting one value every steps, which consists of the mean of
immediate rewards, along with a 95% confidence interval for that mean.
In all Q Learners, we set the discount factor to be . The learning rates are set to for Problem 1, and for Problem 2, and for Problem 3. The temperature of Boltzmann exploration is set to for Problem 1 and for Problems 2 and 3. We employ steepest descent solving in Manopt. For the PCC constraint, is set to .
5.1 Problem 1: Wireless Network Parameter Configuration
We begin by applying our developed algorithm to the wireless network problem described in Section 1. The parameter to be configured in the time period is the threshold of monitored sign