Variance Reduction for Reinforcement Learning in Input-Driven Environments

07/06/2018 ∙ by Hongzi Mao, et al. ∙ MIT 0

We consider reinforcement learning in input-driven environments, where an exogenous, stochastic input process affects the dynamics of the system. Input processes arise in many applications, including queuing systems, robotics control with disturbances, and object tracking. Since the state dynamics and rewards depend on the input process, the state alone provides limited information for the expected future returns. Therefore, policy gradient methods with standard state-dependent baselines suffer high variance during training. We derive a bias-free, input-dependent baseline to reduce this variance, and analytically show its benefits over state-dependent baselines. We then propose a meta-learning approach to overcome the complexity of learning a baseline that depends on a long sequence of inputs. Our experimental results show that across environments from queuing systems, computer networks, and MuJoCo robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (RL) has emerged as a powerful approach for sequential decision-making problems, achieving impressive results in domains such as game playing (Mnih et al., 2015; Silver et al., 2017) and robotics (Levine et al., 2016; Schulman et al., 2015a; Lillicrap et al., 2015). This paper concerns RL in input-driven environments. Informally, input-driven environments have dynamics that are partially dictated by an exogenous, stochastic input process. Queuing systems (Kleinrock, 1976; Kelly, 2011) are an example; their dynamics are governed by not only the decisions made within the system (e.g., scheduling, load balancing) but also the arrival process that brings work (e.g., jobs, customers, packets) into the system. Input-driven environments also arise naturally in many other domains: network control and optimization (Winstein & Balakrishnan, 2013; Mao et al., 2017), robotics control with stochastic disturbances (Pinto et al., 2017), locomotion in environments with complex terrains and obstacles (Heess et al., 2017), vehicular traffic control (Belletti et al., 2018; Wu et al., 2017), tracking moving targets, and more (see Figure 1).

We focus on model-free policy gradient RL algorithms (Williams, 1992; Mnih et al., 2016; Schulman et al., 2015a), which have been widely adopted and benchmarked for a variety of RL tasks (Duan et al., 2016; Wu & Tian, 2017)

. A key challenge for these methods is the high variance in the gradient estimates, as such variance increases sample complexity and can impede effective learning 

(Schulman et al., 2015b; Mnih et al., 2016). A standard approach to reduce variance is to subtract a “baseline” from the total reward (or “return”) to estimate the policy gradient (Weaver & Tao, 2001). The most common choice of a baseline is the value function — the expected return starting from the state.

Our main insight is that a state-dependent baseline — such as the value function — is a poor choice in input-driven environments, whose state dynamics and rewards are partially dictated by the input process. In such environments, comparing the return to the value function baseline may provide limited information about the quality of actions. The return obtained after taking a good action may be poor (lower than the baseline) if the input sequence following the action drives the system to unfavorable states; similarly, a bad action might end up with a high return with an advantageous input sequence. Intuitively, a good baseline for estimating the policy gradient should take the specific instance of the input process — the sequence of input values — into account. We call such a baseline an input-dependent baseline; it is a function of both the state and the entire future input sequence.

We formally define input-driven Markov decision processes, and we prove that an input-dependent baseline does not introduce bias in standard policy gradient algorithms such as Advantage Actor Critic (A2C) 

(Mnih et al., 2016) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a), provided that the input process is independent of the states and actions. We derive the optimal input-independent baseline and a simpler one to work with in practice; this takes the form of a conditional value function — the expected return given the state and the future input sequence.

Input-dependent baselines are harder to learn than their state-dependent counterparts; they are high-dimensional functions of the sequence of input values. To learn input-dependent baselines efficiently, we propose a simple approach based on meta-learning (Finn et al., 2017; Vilalta & Drissi, 2002). The idea is to learn a “meta baseline” that can be specialized to a baseline for a specific input instantiation using a small number of training episodes with that input. This approach applies to applications in which an input sequence can be repeated during training, e.g., applications that use simulations or experiments with previously-collected input traces for training (McGough et al., 2017).

We compare our input-dependent baseline to the standard value function baseline for the five tasks illustrated in Figure 1. These tasks are derived from queuing systems (load balancing heterogeneous servers (Harchol-Balter & Vesilo, 2010)), computer networks (bitrate adaptation for video streaming (Mao et al., 2017)), and variants of standard continuous control RL benchmarks in the MuJoCo physics simulator (Todorov et al., 2012). We adapted three widely-used MuJoCo benchmarks (Duan et al., 2016; Clavera et al., 2018a; Heess et al., 2017) to add a stochastic input element that makes these tasks significantly more challenging. For example, we replaced the static target in a 7-DoF robotic arm target-reaching task with a randomly-moving target that the robot aims to track over time. Our results show that input-dependent baselines consistently provide improved training stability and better eventual policies. Input-dependent baselines are applicable to a variety of policy gradient methods, including A2C, TRPO, PPO, robust adversarial RL methods such as RARL (Pinto et al., 2017), and meta-policy optimization such as MB-MPO (Clavera et al., 2018b). Video demonstrations of our experiments are available at

Figure 1: Input-driven environments: (a) load-balancing heterogeneous servers (Harchol-Balter & Vesilo, 2010) with stochastic job arrival as the input process; (b) adaptive bitrate video streaming (Mao et al., 2017) with stochastic network bandwidth as the input process; (c) Walker2d in wind with a stochastic force (wind) applied to the walker as the input process; (d) HalfCheetah on floating tiles with the stochastic process that controls the buoyancy of the tiles as the input process; (e) 7-DoF arm tracking moving target with the stochastic target position as the input process. Environments (c)–(e) use the MuJoCo physics simulator (Todorov et al., 2012).

2 Preliminaries


We consider a discrete-time Markov decision process (MDP), defined by , where is a set of -dimensional states, is a set of -dimensional actions,

is the state transition probability distribution,

is the distribution over initial states, is the reward function, and is the discount factor. We denote a stochastic policy as , which aims to optimize the expected return , where is the trajectory following , , . We use to define the value function, and to define the state-action value function. For any sequence , we use to denote the entire sequence and to denote .

Policy gradient methods.

Policy gradient methods estimate the gradient of expected return with respect to the policy parameters (Sutton et al., 2000; Kakade, 2002; Gu et al., 2017). To train a policy parameterized by , the Policy Gradient Theorem (Sutton et al., 2000) states that


where denotes the discounted state visitation frequency. Practical algorithms often use the undiscounted state visitation frequency (i.e., in ), which can make the estimation slightly biased (Thomas, 2014).

Estimating the policy gradient using Monte Carlo estimation for the function suffers from high variance (Mnih et al., 2016). To reduce variance, an appropriately chosen baseline can be subtracted from the Q-estimate without introducing bias (Greensmith et al., 2004). The policy gradient estimation with a baseline in Equation (1) becomes . While an optimal baseline exists (Greensmith et al., 2004; Wu et al., 2018), it is hard to estimate and often replaced by the value function  (Sutton & Barto, 2017; Mnih et al., 2016).

3 Motivating Example

Figure 2: Load balancing over two servers. (a) Job sizes follow a Pareto distribution and jobs arrive as a Poisson process; the RL agent observes the queue lengths and picks a server for an incoming job. (b) The input-dependent baseline (blue) results in a 50 lower policy gradient variance (left) and a 33% higher test reward (right) than the standard, state-dependent baseline (green). (c) The probability heatmap of picking server 1 shows that using the input-dependent baseline (left) yields a more precise policy than using the state-dependent baseline (right).

We use a simple load balancing example to illustrate the variance introduced by an exogenous input process. As shown in Figure 2a, jobs arrive over time and a load balancing agent sends them to one of two servers. The jobs arrive according to a Poisson process, and the job sizes follow a Pareto distribution. The two servers process jobs from their queues at identical rates. On each job arrival, the load balancer observes state , denoting the queue length at the two servers. It then takes an action , sending the job to one of the servers. The goal of the load balancer is to minimize the average job completion time. The reward corresponding to this goal is , where is the time elapsed since the last action and is total number of enqueued jobs.

In this example, the optimal policy is to send the job to the server with the shortest queue (Daley, 1987). However, we find that a standard policy gradient algorithm, A2C (Mnih et al., 2016), trained using a value function baseline struggles to learn this policy. The reason is that the stochastic sequence of job arrivals creates huge variance in the reward signal, making it difficult to distinguish between good and bad actions. Consider, for example, an action at the state shown in Figure 2a. If the arrival sequence following this action consists of a burst of large jobs (e.g., input sequence 1 in Figure 2a), the queues will build up, and the return will be poor compared to the value function baseline (average return from the state). On the other hand, a light stream of jobs (e.g., input sequence 2 in Figure 2a) will lead to short queues and a better-than-average return. Importantly, this difference in return has little to do with the action; it is a consequence of the random job arrival process.

We train two A2C agents (Mnih et al., 2016), one with the standard value function baseline and the other with an input-dependent baseline tailored for each specific instantiation of the job arrival process (details of this baseline in §4). Since the the input-dependent baseline takes each input sequence into account explicitly, it reduces the variance of the policy gradient estimation much more effectively (Figure 2b, left). As a result, even in this simple example, only the policy learned with the input-dependent baseline comes close to the optimal (Figure 2b, right). Figure 2c visualizes the policies learned using the two baselines. The optimal policy (pick-shortest-queue) corresponds to a clear divide between the chosen servers at the diagonal.

In fact, the variance of the standard baseline can be arbitrarily worse than an input-dependent baseline: we refer the reader to Appendix A for an analytical example on a 1D grid world.

4 Reducing Variance for Input-Driven MDPs

We now formally define input-driven MDPs and derive variance-reducing baselines for policy gradient methods in environments with input processes.

Figure 3: Graphical model of input-driven MDPs.
Definition 1.

An input-driven MDP is defined by , where is a set of -dimensional input values, is the transition kernel of the states, is the transition kernel of the input process, is the distribution of the initial input, is the reward function, and , , , follow the standard definition in §2.

An input-driven MDP adds an input process, , to a standard MDP. In this setting, the next state depends on . We seek to learn policies that maximize cumulative expected rewards. We focus on two cases, corresponding to the graphical models shown in Figure 3:
Case 1: is a Markov process, and is observed at time . The action can hence depend on both and .
Case 2: is a general process (not necessarily Markov), and is observed at time . The action hence depends only on .

In Appendix B, we prove that case 1 corresponds to a fully-observable MDP. This is evident from the graphical model in Figure 3a by considering to be the ‘state’ of the MDP at time . Case 2, on the other hand, corresponds to a partially-observed MDP (POMDP) if we define the state to contain both and , but leave unobserved at time (see Appendix B for details).

4.1 Variance Reduction

In input-driven MDPs, the standard input-agnostic baseline is ineffective at reducing variance, as shown by our motivating example (§3). We propose to use an input-dependent baseline of the form  — a function of both the observation at time and the input sequence from onwards. An input-dependent baseline uses information that is not available to the policy. Specifically, the input sequence cannot be used when taking an action at time , because has not yet occurred at time . However, in many applications, the input sequence is known at training time. In some cases, we know the entire input sequence upfront, e.g., when training in a simulator. In other situations, we can record the input sequence on the fly during training. Then, after a training episode, we can use the recorded values, including those that occurred after time , to compute the baseline for each step .

We now analyze input-dependent baselines. Our main result is that input-dependent baselines are bias-free. We also derive the optimal input-dependent baseline for variance reduction. All the results hold for both cases in Figure 3. We first state two useful lemmas required for our analysis. The first lemma shows that under the input-driven MDP definition, the input sequence is conditionally independent of the action given the observation , while the second lemma states the policy gradient theorem for input-driven MDPs.

Lemma 1.

, i.e.,

forms a Markov chain.

Proof. See Appendix C.

Lemma 2.

For an input-driven MDP, the policy gradient theorem can be rewritten as


where denotes the discounted visitation frequency of the observation and input sequence , and .

Proof. See Appendix D.

Equation (2) generalizes the standard Policy Gradient Theorem in Equation (1).

can be thought of as a joint distribution over observations and input sequences.

is a “state-action-input” value function, i.e., the expected return when taking action after observing , with input sequence from that step onwards. The key ingredient in the proof of Lemma 2 is the conditional independence of the input process and the action given the observation (Lemma 1).

Theorem 1.

An input-dependent baseline does not bias the policy gradient.


Using Lemma 2, we need to show: . We have:


Since , the theorem follows. ∎

Input-dependent baselines are also bias-free for policy optimization methods such as TRPO (Schulman et al., 2015a), as we show in Appendix F

. Next, we derive the optimal input-dependent baseline for variance reduction. As the gradient estimates are vectors, we use the trace of the covariance matrix as the minimization objective 

(Greensmith et al., 2004).

Theorem 2.

The input-dependent baseline that minimizes variance in policy gradient is given by


Proof. See Appendix E.

Operationally, for observation at each step , the input-dependent baseline takes the form . In practice, we use a simpler alternative to Equation (4): . This can be thought of as a value function that provides the expected return given observation and input sequence from that step onwards. We discuss how to estimate input-dependent baselines efficiently in §5.

Remark.  Input-dependent baselines are generally applicable to reducing variance for policy gradient methods in input-driven environments. In this paper, we apply input-dependent baselines to A2C (§6.2), TRPO (§6.1) and PPO (Appendix L). Our technique is complementary and orthogonal to adversarial RL (e.g., RARL (Pinto et al., 2017)) and meta-policy adaptation (e.g., MB-MPO (Clavera et al., 2018b)) for environments with external disturbances. Adversarial RL improves policy robustness by co-training an “adversary” to generate a worst-case disturbance process. Meta-policy optimization aims for fast policy adaptation to handle model discrepancy between training and testing. By contrast, input-dependent baselines improve policy optimization itself in the presence of stochastic input processes. Our work primarily focuses on learning a single policy in input-driven environments, without policy adaptation. However, input-dependent baselines can be used as a general method to improve the policy optimization step in adversarial RL and meta-policy adaptation methods. For example, in Appendix M, we empirically show that if an adversary generates high-variance noise, RARL with a standard state-based baseline cannot train good controllers, but the input-dependent baseline helps improve the policy’s performance. Similarly, input-dependent baselines can improve meta-policy optimization in environments with stochastic disturbances, as we show in Appendix N.

5 Learning Input-Dependent Baselines Efficiently

Input-dependent baselines are functions of the sequence of input values. A natural approach to train such baselines is to use models that operate on sequences (e.g., LSTMs (Gers et al., 1999)). However, learning a sequential mapping in a high-dimensional space can be expensive (Bahdanau et al., 2014). We considered an LSTM approach, but ruled it out when initial experiments showed that it fails to provide significant policy improvement over the standard baseline in our environments (Appendix G).

Fortunately, we can learn the baseline much more efficiently in applications where we can repeat the same input sequence multiple times during training. Input-repeatability is feasible in many applications: it is straightforward when using simulators for training, and also feasible when training a real system with previously-collected input traces outside simulation. For example, training a robot in the presence of exogenous forces might apply a set of time-series traces of these forces repeatedly to the physical robot. We now present two approaches that exploit input-repeatability to learn input-dependent baselines efficiently.

Multi-value-network approach. A straightforward way to learn for different input instantiations is to train one value network to each particular instantiation of the input process. Specifically, in the training process, we first generate input sequences and restrict training only to those sequences. To learn a separate baseline function for each input sequence, we use value networks with independent parameters , and single policy network with parameter . During training, we randomly sample an input sequence , execute a rollout based on with the current policy , and use the (state, action, reward) data to train the value network parameter and the policy network parameter (details in Appendix I).

Meta-learning approach. The multi-value-network approach does not scale if the task requires training over a large number of input instantiations to generalize. The number of inputs needed is environment-specific, and can depend on a variety of factors, such as the time horizon of the problem, the distribution of the input process, the relative magnitude of the variance due to the input process compared to other sources of randomness (e.g., actions). Ideally, we would like an approach that enables learning across many different input sequences. We present a method based on meta-learning to train with an unbounded number of input sequences. The idea is to use all (potentially infinitely many) input sequences to learn a “meta value network” model. Then, for each specific input sequence, we first customize the meta value network using a few example rollouts with that input sequence. We then compute the actual baseline values for training the policy network parameters, using the customized value network for the specific input sequence. Our implementation uses Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017).

0:  ,

: meta value network step size hyperparameters

1:  Initialize policy network parameters and meta-value-network parameters
2:  while not done do
3:     Generate a new input sequence
4:     Sample rollouts using policy and input sequence
5:     Adapt with the first rollouts:
6:     Estimate baseline value for using adapted
7:     Adapt with the second rollouts:
8:     Estimate baseline value for using adapted
9:     Update policy with Equation (2) using the values from line (6) and (8) as baseline
10:     Update meta value network:
11:  end while
Algorithm 1 Training a meta input-dependent baseline for policy-based methods.

The pseudocode in Algorithm 1 depicts the training algorithm. We follow the notation of MAML, denoting the loss in the value function on a rollout as . We perform rollouts times with the same input sequence (lines 3 and 4); we use the first rollouts to customize the meta value network for this instantiation of (line 5), and then apply the customized value network on the states of the other rollouts to compute the baseline for those rollouts (line 6); similarly, we swap the two groups of rollouts and repeat the same process (lines 7 and 8). We use different rollouts to adapt the meta value network and compute the baseline to avoid introducing extra bias to the baseline. Finally, we use the baseline values computed for each rollout to update the policy network parameters (line 9), and we apply the MAML (Finn et al., 2017) gradient step to update the meta value network model (line 10).

6 Experiments

Our experiments demonstrate that input-dependent baselines provide consistent performance gains across multiple continuous-action MuJoCo simulated robotic locomotions and discrete-action environments in queuing systems and network control. We conduct experiments for both policy gradient methods and policy optimization methods (see Appendix K for details). The videos for our experiments are available at

6.1 Simulated Robotic Locomotion

Figure 4: In continuous-action MuJoCo environments, TRPO (Schulman et al., 2015a) with input-dependent baselines achieve 25%–3

better testing reward than with a standard state-dependent baseline. Learning curves are on 100 testing episodes with unseen input sequences; shaded area spans one standard deviation.

We use the MuJoCo physics engine (Todorov et al., 2012) in OpenAI Gym (Brockman et al., 2016) to evaluate input-dependent baselines for robotic control tasks with external disturbance. We extend the standard Walker2d, HalfCheetah and 7-DoF robotic arm environments, adding a different external input to each (Figure 1).

Walker2d with random wind (Figure 1c). We train a 2D walker with varying wind, which randomly drags the walker backward or forward with different force at each step. The wind vector changes randomly, i.e., the wind forms a random input process. We add a force sensor to the state to enable the agent to quickly adapt. The goal is for the walker to walk forward while keeping balance.

HalfCheetah on floating tiles with random buoyancy (Figure 1d). A half-cheetah runs over a series of tiles floating on water (Clavera et al., 2018a). Each tile has different damping and friction properties, which moves the half-cheetah up and down and changes its dynamics. This random buoyancy is the external input process; the cheetah needs to learn running forward over varying tiles.

7-DoF arm tracking moving target (Figure 1e).

We train a simulated robot arm to track a randomly moving target (a red ball). The robotic arm has seven degrees of freedom and the target is doing a random walk, which forms the external input process. The reward is the negative squared distance between the robot hand (blue square) and the target.

The Walker2d and 7-DoF arm environments correspond to the fully observable MDP case in Figure 3, i.e. the agent observes the input at time . The HalfCheetah environment is a POMDP, as the agent does not observe the buoyancy of the tiles. In Appendix H, we show results for the POMDP version of the Walker2d environment.

Results. We build 10-value networks and a meta-baseline using MAML, both on top of the OpenAI’s TRPO implementation (Dhariwal et al., 2017). Figure 4 shows the performance comparison among different baselines with 100 unseen testing input sequences at each training checkpoint. These learning curves show that TRPO with a state-dependent baseline performs worst in all environments. With the input-dependent baseline, by contrast, performance in unseen testing environments improves by up to 3, as the agent learns a policy robust against disturbances. For example, it learns to lean into headwind and quickly place its leg forward to counter the headwind; it learns to apply different force on tiles with different buoyancy to avoid falling over; and it learns to co-adjust multiple joints to keep track of the moving object. The meta-baseline eventually outperforms 10-value networks as it effectively learns from a large number of input processes and hence generalizes better.

The input-dependent baseline technique applies generally on top of policy optimization methods. In Appendix L, we show a similar comparison with PPO (Schulman et al., 2017). Also, in Appendix M we show that adversarial RL (e.g., RARL (Pinto et al., 2017)) alone is not adequate to solve the high variance problem, and the input-dependent baseline helps improve the policy performance (Figure 9).

6.2 Discrete-Action Environments

Our discrete-action environments arise from widely-studied problems in computer systems research: load balancing and bitrate adaptation.111We considered Atari games often used as benchmark discrete-action RL environments (Mnih et al., 2015). However, Atari games lack an exogenous input process: a random seed perturbs the games’ initial state, but it does not affect the environmental changes (e.g., in “Seaquest”, the ships always come in a fixed pattern). As these problems often lack closed-form optimal solutions (Grandl et al., 2016; Yin et al., 2015)

, hand-tuned heuristics abound. Recent work suggests that model-free reinforcement learning can achieve better performance than such human-engineered heuristics 

(Mao et al., 2016; Evans & Gao, 2016; Mao et al., 2017; Mirhoseini et al., 2017). We consider a load balancing environment (similar to the example in §3) and a bitrate adaptation environment in video streaming (Yin et al., 2015). The detailed setup of these environments is in Appendix J.

Results. We extend OpenAI’s A2C implementation (Dhariwal et al., 2017) for our baselines. The learning curves in Figure 5 illustrate that directly applying A2C with a standard value network as the baseline results in unstable test reward and underperforms the traditional heuristic in both environments. Our input-dependent baselines reduce the variance and improve test reward by 25–33%, outperforming the heuristic. The meta-baseline performs the best in all environments.

Figure 5: In environments with discrete action spaces, A2C (Mnih et al., 2016) with input-dependent baselines outperforms the best heuristic and achieves 25–33% better testing reward than vanilla A2C (Mnih et al., 2016). Learning curves are on 100 test episodes with unseen input sequences; shaded area spans one standard deviation.

7 Related Work

Policy gradient methods compute unbiased gradient estimates, but can experience a large variance (Sutton & Barto, 2017; Weaver & Tao, 2001). Reducing variance for policy-based methods using a baseline has been shown to be effective (Williams, 1992; Sutton & Barto, 2017; Weaver & Tao, 2001; Greensmith et al., 2004; Mnih et al., 2016). Much of this work focuses on variance reduction in a general MDP setting, rather than variance reduction for MDPs with specific stochastic structures. Wu et al. (2018)’s techniques for MDPs with multi-variate independent actions are closest to our work. Their state-action-dependent baseline improves training efficiency and model performance on high-dimensional control tasks by explicitly factoring out, for each action, the effect due to other actions. By contrast, our work exploits the structure of state transitions instead of stochastic policy.

Recent work has also investigated the bias-variance tradeoff in policy gradient methods. Schulman et al. (2015b) replace the Monte Carlo return with a -weighted return estimation (similar to TD() with value function bootstrap (Tesauro, 1995)), improving performance in high-dimensional control tasks. Other recent approaches use more general control variates to construct variants of policy gradient algorithms.  Tucker et al. (2018) compare the recent work, both analytically on a linear-quadratic-Gaussian task and empirically on complex robotic control tasks. Analysis of control variates for policy gradient methods is a well-studied topic, and extending such analyses (e.g., Greensmith et al. (2004)) to the input-driven MDP setting could be interesting future work.

In other contexts, prior work has proposed new RL training methodologies for environments with disturbances. Clavera et al. (2018b) adapts the policy to different pattern of disturbance by training the RL agent using meta-learning. RARL (Pinto et al., 2017) improves policy robustness by co-training an adversary to generate a worst-case noise process. Our work is orthogonal and complementary to these work, as we seek to improve policy optimization itself in the presence of inputs like disturbances.

8 Conclusion

We introduced input-driven Markov Decision Processes in which stochastic input processes influence state dynamics and rewards. In this setting, we demonstrated that an input-dependent baseline can significantly reduce variance for policy gradient methods, improving training stability and the quality of learned policies. Our work provides an important ingredient for using RL successfully in a variety of domains, including queuing networks and computer systems, where an input workload is a fundamental aspect of the system, as well as domains where the input process is more implicit, like robotics control with disturbances or random obstacles.

We showed that meta-learning provides an efficient way to learn input-dependent baselines for applications where input sequences can be repeated during training. Investigating efficient architectures for input-dependent baselines for cases where the input process cannot be repeated in training is an interesting direction for future work.


We thank Ignasi Clavera for sharing the HalfCheetah environment, Jonas Rothfuss for the comments on meta-policy optimization and the anonymous ICLR reviewers for their feedback. This work was funded in part by NSF grants CNS-1751009, CNS-1617702, a Google Faculty Research Award, an AWS Machine Learning Research Award, a Cisco Research Center Award, an Alfred P. Sloan Research Fellowship and the sponsors of MIT Data Systems and AI Lab.


Appendix A Illustration of Variance Reduction in 1D Grid World

Consider a walker in a 1D grid world, where the state at time denotes the position of the walker, and action denotes the intent to either move forward or backward. Additionally let be a uniform i.i.d. “exogenous input” that perturbs the position of the walker. For an action and input , the state of the walker in the next step is given by . The objective of the game is to move the walker forward; hence, the reward is at each time step. is a discount factor.

While the optimal policy for this game is clear ( for all ), consider learning such a policy using policy gradient. For simplicity, let the policy be parametrized as , with initialized to at the start of training. In the following, we evaluate the variance of the policy gradient estimate at the start of training under (i) the standard value function baseline, and (ii) a baseline that is the expected cumulative reward conditioned on all future inputs.

Variance under standard baseline. The value function in this case is identically at all states. This is because since both actions and inputs are i.i.d. with mean . Also note that and ; hence . Therefore the variance of the policy gradient estimate can be written as


Variance under input-dependent baseline. Now, consider an alternative “input-dependent” baseline defined as . Intuitively this baseline captures the average reward incurred when experiencing a particular fixed sequence. We refer the reader to §4 for a formal discussion and analysis of input-dependent baselines. Evaluating the baseline we get . Therefore the variance of the policy gradient estimate in this case is


Reduction in variance. To analyze the variance reduction between the two cases (Equations (5) and (6)), we note that


This follows because

Therefore the covariance term in Equation (7) is . Hence the variance reduction from Equation (8) can be written as

Thus the input-dependent baseline reduces variance of the policy gradient estimate by an amount proportional to the variance of the external input. In this toy example, we have chosen to be binary-valued, but more generally the variance of could be arbitrarily large and might be a dominating factor of the overall variance in the policy gradient estimation.

Appendix B Markov properties of input-driven decision processes

Proposition 1.

An input-driven decision process satisfying the conditions of case 1 in Figure 3 is a fully observable MDP, with state , and action .


Proposition 2.

An input-driven decision process satisfying the conditions of case 2 in Figure 3, with state and action is a fully observable MDP. If only is observed at time , it is a partially observable MDP (POMDP).


Therefore, is a fully observable MDP. If only is observed, the decision process is a POMDP, since the component of the state is not observed. ∎

Appendix C Proof of Lemma 1


From the definition of an input-driven MDP (Definition 1), we have


Notice that in both the MDP and POMDP cases in Figure 3. By marginalizing over on both sides, we obtain the result:


Appendix D Proof of Lemma 2


Expanding the Policy Gradient Theorem (Sutton & Barto, 2017), we have


where the last step uses Lemma 1. Using the definition of , we obtain:


Appendix E Proof of Theorem 2


Let denote . For any input-dependent baseline , the variance of the policy gradient estimate is given by

Notice that the baseline is only involved in the last term in a quadratic form, where the second order term is positive. To minimize the variance, we set baseline to the minimizer of the quadratic equation, i.e., and hence the result follows. ∎

Appendix F Input-Dependent Baseline for TRPO

We show that the input-dependent baselines are bias-free for Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a).

Preliminaries.Stochastic gradient descent using Equation (1) does not guarantee consistent policy improvement in complex control problems. TRPO is an alternative approach that offers monotonic policy improvements, and derives a practical algorithm with better sample efficiency and performance. TRPO maximizes a surrogate objective, subject to a KL divergence constraint:

subject to (14)

in which serves as a step size for policy update. Using a baseline in the TRPO objective, i.e. replacing with , empirically improves policy performance (Schulman et al., 2015b).

Similar to Theorem 2, we generalize TRPO to input-driven environments, with denoting the discounted visitation frequency of the observation and input sequence , and . The TRPO objective becomes , and the constraint is .

Theorem 3.

An input-dependent baseline does not change the optimal solution of the optimization problem in TRPO, that is .