Deep reinforcement learning (RL) has emerged as a powerful approach for sequential decision-making problems, achieving impressive results in domains such as game playing (Mnih et al., 2015; Silver et al., 2017) and robotics (Levine et al., 2016; Schulman et al., 2015a; Lillicrap et al., 2015). This paper concerns RL in input-driven environments. Informally, input-driven environments have dynamics that are partially dictated by an exogenous, stochastic input process. Queuing systems (Kleinrock, 1976; Kelly, 2011) are an example; their dynamics are governed by not only the decisions made within the system (e.g., scheduling, load balancing) but also the arrival process that brings work (e.g., jobs, customers, packets) into the system. Input-driven environments also arise naturally in many other domains: network control and optimization (Winstein & Balakrishnan, 2013; Mao et al., 2017), robotics control with stochastic disturbances (Pinto et al., 2017), locomotion in environments with complex terrains and obstacles (Heess et al., 2017), vehicular traffic control (Belletti et al., 2018; Wu et al., 2017), tracking moving targets, and more (see Figure 1).
We focus on model-free policy gradient RL algorithms (Williams, 1992; Mnih et al., 2016; Schulman et al., 2015a), which have been widely adopted and benchmarked for a variety of RL tasks (Duan et al., 2016; Wu & Tian, 2017)
. A key challenge for these methods is the high variance in the gradient estimates, as such variance increases sample complexity and can impede effective learning(Schulman et al., 2015b; Mnih et al., 2016). A standard approach to reduce variance is to subtract a “baseline” from the total reward (or “return”) to estimate the policy gradient (Weaver & Tao, 2001). The most common choice of a baseline is the value function — the expected return starting from the state.
Our main insight is that a state-dependent baseline — such as the value function — is a poor choice in input-driven environments, whose state dynamics and rewards are partially dictated by the input process. In such environments, comparing the return to the value function baseline may provide limited information about the quality of actions. The return obtained after taking a good action may be poor (lower than the baseline) if the input sequence following the action drives the system to unfavorable states; similarly, a bad action might end up with a high return with an advantageous input sequence. Intuitively, a good baseline for estimating the policy gradient should take the specific instance of the input process — the sequence of input values — into account. We call such a baseline an input-dependent baseline; it is a function of both the state and the entire future input sequence.
We formally define input-driven Markov decision processes, and we prove that an input-dependent baseline does not introduce bias in standard policy gradient algorithms such as Advantage Actor Critic (A2C)(Mnih et al., 2016) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a), provided that the input process is independent of the states and actions. We derive the optimal input-independent baseline and a simpler one to work with in practice; this takes the form of a conditional value function — the expected return given the state and the future input sequence.
Input-dependent baselines are harder to learn than their state-dependent counterparts; they are high-dimensional functions of the sequence of input values. To learn input-dependent baselines efficiently, we propose a simple approach based on meta-learning (Finn et al., 2017; Vilalta & Drissi, 2002). The idea is to learn a “meta baseline” that can be specialized to a baseline for a specific input instantiation using a small number of training episodes with that input. This approach applies to applications in which an input sequence can be repeated during training, e.g., applications that use simulations or experiments with previously-collected input traces for training (McGough et al., 2017).
We compare our input-dependent baseline to the standard value function baseline for the five tasks illustrated in Figure 1. These tasks are derived from queuing systems (load balancing heterogeneous servers (Harchol-Balter & Vesilo, 2010)), computer networks (bitrate adaptation for video streaming (Mao et al., 2017)), and variants of standard continuous control RL benchmarks in the MuJoCo physics simulator (Todorov et al., 2012). We adapted three widely-used MuJoCo benchmarks (Duan et al., 2016; Clavera et al., 2018a; Heess et al., 2017) to add a stochastic input element that makes these tasks significantly more challenging. For example, we replaced the static target in a 7-DoF robotic arm target-reaching task with a randomly-moving target that the robot aims to track over time. Our results show that input-dependent baselines consistently provide improved training stability and better eventual policies. Input-dependent baselines are applicable to a variety of policy gradient methods, including A2C, TRPO, PPO, robust adversarial RL methods such as RARL (Pinto et al., 2017), and meta-policy optimization such as MB-MPO (Clavera et al., 2018b). Video demonstrations of our experiments are available at https://sites.google.com/view/input-dependent-baseline/.
We consider a discrete-time Markov decision process (MDP), defined by , where is a set of -dimensional states, is a set of -dimensional actions,
is the state transition probability distribution,is the distribution over initial states, is the reward function, and is the discount factor. We denote a stochastic policy as , which aims to optimize the expected return , where is the trajectory following , , . We use to define the value function, and to define the state-action value function. For any sequence , we use to denote the entire sequence and to denote .
Policy gradient methods.
Policy gradient methods estimate the gradient of expected return with respect to the policy parameters (Sutton et al., 2000; Kakade, 2002; Gu et al., 2017). To train a policy parameterized by , the Policy Gradient Theorem (Sutton et al., 2000) states that
where denotes the discounted state visitation frequency. Practical algorithms often use the undiscounted state visitation frequency (i.e., in ), which can make the estimation slightly biased (Thomas, 2014).
Estimating the policy gradient using Monte Carlo estimation for the function suffers from high variance (Mnih et al., 2016). To reduce variance, an appropriately chosen baseline can be subtracted from the Q-estimate without introducing bias (Greensmith et al., 2004). The policy gradient estimation with a baseline in Equation (1) becomes . While an optimal baseline exists (Greensmith et al., 2004; Wu et al., 2018), it is hard to estimate and often replaced by the value function (Sutton & Barto, 2017; Mnih et al., 2016).
3 Motivating Example
We use a simple load balancing example to illustrate the variance introduced by an exogenous input process. As shown in Figure 2a, jobs arrive over time and a load balancing agent sends them to one of two servers. The jobs arrive according to a Poisson process, and the job sizes follow a Pareto distribution. The two servers process jobs from their queues at identical rates. On each job arrival, the load balancer observes state , denoting the queue length at the two servers. It then takes an action , sending the job to one of the servers. The goal of the load balancer is to minimize the average job completion time. The reward corresponding to this goal is , where is the time elapsed since the last action and is total number of enqueued jobs.
In this example, the optimal policy is to send the job to the server with the shortest queue (Daley, 1987). However, we find that a standard policy gradient algorithm, A2C (Mnih et al., 2016), trained using a value function baseline struggles to learn this policy. The reason is that the stochastic sequence of job arrivals creates huge variance in the reward signal, making it difficult to distinguish between good and bad actions. Consider, for example, an action at the state shown in Figure 2a. If the arrival sequence following this action consists of a burst of large jobs (e.g., input sequence 1 in Figure 2a), the queues will build up, and the return will be poor compared to the value function baseline (average return from the state). On the other hand, a light stream of jobs (e.g., input sequence 2 in Figure 2a) will lead to short queues and a better-than-average return. Importantly, this difference in return has little to do with the action; it is a consequence of the random job arrival process.
We train two A2C agents (Mnih et al., 2016), one with the standard value function baseline and the other with an input-dependent baseline tailored for each specific instantiation of the job arrival process (details of this baseline in §4). Since the the input-dependent baseline takes each input sequence into account explicitly, it reduces the variance of the policy gradient estimation much more effectively (Figure 2b, left). As a result, even in this simple example, only the policy learned with the input-dependent baseline comes close to the optimal (Figure 2b, right). Figure 2c visualizes the policies learned using the two baselines. The optimal policy (pick-shortest-queue) corresponds to a clear divide between the chosen servers at the diagonal.
In fact, the variance of the standard baseline can be arbitrarily worse than an input-dependent baseline: we refer the reader to Appendix A for an analytical example on a 1D grid world.
4 Reducing Variance for Input-Driven MDPs
We now formally define input-driven MDPs and derive variance-reducing baselines for policy gradient methods in environments with input processes.
An input-driven MDP is defined by , where is a set of -dimensional input values, is the transition kernel of the states, is the transition kernel of the input process, is the distribution of the initial input, is the reward function, and , , , follow the standard definition in §2.
An input-driven MDP adds an input process, , to a standard MDP.
In this setting, the next state depends on .
We seek to learn policies that maximize cumulative expected rewards.
We focus on two cases, corresponding to the graphical models shown in Figure 3:
Case 1: is a Markov process, and is observed at time . The action can hence depend on both and .
Case 2: is a general process (not necessarily Markov), and is observed at time . The action hence depends only on .
In Appendix B, we prove that case 1 corresponds to a fully-observable MDP. This is evident from the graphical model in Figure 3a by considering to be the ‘state’ of the MDP at time . Case 2, on the other hand, corresponds to a partially-observed MDP (POMDP) if we define the state to contain both and , but leave unobserved at time (see Appendix B for details).
4.1 Variance Reduction
In input-driven MDPs, the standard input-agnostic baseline is ineffective at reducing variance, as shown by our motivating example (§3). We propose to use an input-dependent baseline of the form — a function of both the observation at time and the input sequence from onwards. An input-dependent baseline uses information that is not available to the policy. Specifically, the input sequence cannot be used when taking an action at time , because has not yet occurred at time . However, in many applications, the input sequence is known at training time. In some cases, we know the entire input sequence upfront, e.g., when training in a simulator. In other situations, we can record the input sequence on the fly during training. Then, after a training episode, we can use the recorded values, including those that occurred after time , to compute the baseline for each step .
We now analyze input-dependent baselines. Our main result is that input-dependent baselines are bias-free. We also derive the optimal input-dependent baseline for variance reduction. All the results hold for both cases in Figure 3. We first state two useful lemmas required for our analysis. The first lemma shows that under the input-driven MDP definition, the input sequence is conditionally independent of the action given the observation , while the second lemma states the policy gradient theorem for input-driven MDPs.
, i.e., forms a Markov chain.
forms a Markov chain.
Proof. See Appendix C.
For an input-driven MDP, the policy gradient theorem can be rewritten as
where denotes the discounted visitation frequency of the observation and input sequence , and .
Proof. See Appendix D.
can be thought of as a joint distribution over observations and input sequences.is a “state-action-input” value function, i.e., the expected return when taking action after observing , with input sequence from that step onwards. The key ingredient in the proof of Lemma 2 is the conditional independence of the input process and the action given the observation (Lemma 1).
An input-dependent baseline does not bias the policy gradient.
Using Lemma 2, we need to show: . We have:
Since , the theorem follows. ∎
. Next, we derive the optimal input-dependent baseline for variance reduction. As the gradient estimates are vectors, we use the trace of the covariance matrix as the minimization objective(Greensmith et al., 2004).
The input-dependent baseline that minimizes variance in policy gradient is given by
Proof. See Appendix E.
Operationally, for observation at each step , the input-dependent baseline takes the form . In practice, we use a simpler alternative to Equation (4): . This can be thought of as a value function that provides the expected return given observation and input sequence from that step onwards. We discuss how to estimate input-dependent baselines efficiently in §5.
Remark. Input-dependent baselines are generally applicable to reducing variance for policy gradient methods in input-driven environments. In this paper, we apply input-dependent baselines to A2C (§6.2), TRPO (§6.1) and PPO (Appendix L). Our technique is complementary and orthogonal to adversarial RL (e.g., RARL (Pinto et al., 2017)) and meta-policy adaptation (e.g., MB-MPO (Clavera et al., 2018b)) for environments with external disturbances. Adversarial RL improves policy robustness by co-training an “adversary” to generate a worst-case disturbance process. Meta-policy optimization aims for fast policy adaptation to handle model discrepancy between training and testing. By contrast, input-dependent baselines improve policy optimization itself in the presence of stochastic input processes. Our work primarily focuses on learning a single policy in input-driven environments, without policy adaptation. However, input-dependent baselines can be used as a general method to improve the policy optimization step in adversarial RL and meta-policy adaptation methods. For example, in Appendix M, we empirically show that if an adversary generates high-variance noise, RARL with a standard state-based baseline cannot train good controllers, but the input-dependent baseline helps improve the policy’s performance. Similarly, input-dependent baselines can improve meta-policy optimization in environments with stochastic disturbances, as we show in Appendix N.
5 Learning Input-Dependent Baselines Efficiently
Input-dependent baselines are functions of the sequence of input values. A natural approach to train such baselines is to use models that operate on sequences (e.g., LSTMs (Gers et al., 1999)). However, learning a sequential mapping in a high-dimensional space can be expensive (Bahdanau et al., 2014). We considered an LSTM approach, but ruled it out when initial experiments showed that it fails to provide significant policy improvement over the standard baseline in our environments (Appendix G).
Fortunately, we can learn the baseline much more efficiently in applications where we can repeat the same input sequence multiple times during training. Input-repeatability is feasible in many applications: it is straightforward when using simulators for training, and also feasible when training a real system with previously-collected input traces outside simulation. For example, training a robot in the presence of exogenous forces might apply a set of time-series traces of these forces repeatedly to the physical robot. We now present two approaches that exploit input-repeatability to learn input-dependent baselines efficiently.
Multi-value-network approach. A straightforward way to learn for different input instantiations is to train one value network to each particular instantiation of the input process. Specifically, in the training process, we first generate input sequences and restrict training only to those sequences. To learn a separate baseline function for each input sequence, we use value networks with independent parameters , and single policy network with parameter . During training, we randomly sample an input sequence , execute a rollout based on with the current policy , and use the (state, action, reward) data to train the value network parameter and the policy network parameter (details in Appendix I).
Meta-learning approach. The multi-value-network approach does not scale if the task requires training over a large number of input instantiations to generalize. The number of inputs needed is environment-specific, and can depend on a variety of factors, such as the time horizon of the problem, the distribution of the input process, the relative magnitude of the variance due to the input process compared to other sources of randomness (e.g., actions). Ideally, we would like an approach that enables learning across many different input sequences. We present a method based on meta-learning to train with an unbounded number of input sequences. The idea is to use all (potentially infinitely many) input sequences to learn a “meta value network” model. Then, for each specific input sequence, we first customize the meta value network using a few example rollouts with that input sequence. We then compute the actual baseline values for training the policy network parameters, using the customized value network for the specific input sequence. Our implementation uses Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017).
The pseudocode in Algorithm 1 depicts the training algorithm. We follow the notation of MAML, denoting the loss in the value function on a rollout as . We perform rollouts times with the same input sequence (lines 3 and 4); we use the first rollouts to customize the meta value network for this instantiation of (line 5), and then apply the customized value network on the states of the other rollouts to compute the baseline for those rollouts (line 6); similarly, we swap the two groups of rollouts and repeat the same process (lines 7 and 8). We use different rollouts to adapt the meta value network and compute the baseline to avoid introducing extra bias to the baseline. Finally, we use the baseline values computed for each rollout to update the policy network parameters (line 9), and we apply the MAML (Finn et al., 2017) gradient step to update the meta value network model (line 10).
Our experiments demonstrate that input-dependent baselines provide consistent performance gains across multiple continuous-action MuJoCo simulated robotic locomotions and discrete-action environments in queuing systems and network control. We conduct experiments for both policy gradient methods and policy optimization methods (see Appendix K for details). The videos for our experiments are available at https://sites.google.com/view/input-dependent-baseline/.
6.1 Simulated Robotic Locomotion
We use the MuJoCo physics engine (Todorov et al., 2012) in OpenAI Gym (Brockman et al., 2016) to evaluate input-dependent baselines for robotic control tasks with external disturbance. We extend the standard Walker2d, HalfCheetah and 7-DoF robotic arm environments, adding a different external input to each (Figure 1).
Walker2d with random wind (Figure 1c). We train a 2D walker with varying wind, which randomly drags the walker backward or forward with different force at each step. The wind vector changes randomly, i.e., the wind forms a random input process. We add a force sensor to the state to enable the agent to quickly adapt. The goal is for the walker to walk forward while keeping balance.
HalfCheetah on floating tiles with random buoyancy (Figure 1d). A half-cheetah runs over a series of tiles floating on water (Clavera et al., 2018a). Each tile has different damping and friction properties, which moves the half-cheetah up and down and changes its dynamics. This random buoyancy is the external input process; the cheetah needs to learn running forward over varying tiles.
7-DoF arm tracking moving target (Figure 1e).
We train a simulated robot arm to track a randomly moving target (a red ball). The robotic arm has seven degrees of freedom and the target is doing a random walk, which forms the external input process. The reward is the negative squared distance between the robot hand (blue square) and the target.
The Walker2d and 7-DoF arm environments correspond to the fully observable MDP case in Figure 3, i.e. the agent observes the input at time . The HalfCheetah environment is a POMDP, as the agent does not observe the buoyancy of the tiles. In Appendix H, we show results for the POMDP version of the Walker2d environment.
Results. We build 10-value networks and a meta-baseline using MAML, both on top of the OpenAI’s TRPO implementation (Dhariwal et al., 2017). Figure 4 shows the performance comparison among different baselines with 100 unseen testing input sequences at each training checkpoint. These learning curves show that TRPO with a state-dependent baseline performs worst in all environments. With the input-dependent baseline, by contrast, performance in unseen testing environments improves by up to 3, as the agent learns a policy robust against disturbances. For example, it learns to lean into headwind and quickly place its leg forward to counter the headwind; it learns to apply different force on tiles with different buoyancy to avoid falling over; and it learns to co-adjust multiple joints to keep track of the moving object. The meta-baseline eventually outperforms 10-value networks as it effectively learns from a large number of input processes and hence generalizes better.
The input-dependent baseline technique applies generally on top of policy optimization methods. In Appendix L, we show a similar comparison with PPO (Schulman et al., 2017). Also, in Appendix M we show that adversarial RL (e.g., RARL (Pinto et al., 2017)) alone is not adequate to solve the high variance problem, and the input-dependent baseline helps improve the policy performance (Figure 9).
6.2 Discrete-Action Environments
Our discrete-action environments arise from widely-studied problems in computer systems research: load balancing and bitrate adaptation.111We considered Atari games often used as benchmark discrete-action RL environments (Mnih et al., 2015). However, Atari games lack an exogenous input process: a random seed perturbs the games’ initial state, but it does not affect the environmental changes (e.g., in “Seaquest”, the ships always come in a fixed pattern). As these problems often lack closed-form optimal solutions (Grandl et al., 2016; Yin et al., 2015)
, hand-tuned heuristics abound. Recent work suggests that model-free reinforcement learning can achieve better performance than such human-engineered heuristics(Mao et al., 2016; Evans & Gao, 2016; Mao et al., 2017; Mirhoseini et al., 2017). We consider a load balancing environment (similar to the example in §3) and a bitrate adaptation environment in video streaming (Yin et al., 2015). The detailed setup of these environments is in Appendix J.
Results. We extend OpenAI’s A2C implementation (Dhariwal et al., 2017) for our baselines. The learning curves in Figure 5 illustrate that directly applying A2C with a standard value network as the baseline results in unstable test reward and underperforms the traditional heuristic in both environments. Our input-dependent baselines reduce the variance and improve test reward by 25–33%, outperforming the heuristic. The meta-baseline performs the best in all environments.
7 Related Work
Policy gradient methods compute unbiased gradient estimates, but can experience a large variance (Sutton & Barto, 2017; Weaver & Tao, 2001). Reducing variance for policy-based methods using a baseline has been shown to be effective (Williams, 1992; Sutton & Barto, 2017; Weaver & Tao, 2001; Greensmith et al., 2004; Mnih et al., 2016). Much of this work focuses on variance reduction in a general MDP setting, rather than variance reduction for MDPs with specific stochastic structures. Wu et al. (2018)’s techniques for MDPs with multi-variate independent actions are closest to our work. Their state-action-dependent baseline improves training efficiency and model performance on high-dimensional control tasks by explicitly factoring out, for each action, the effect due to other actions. By contrast, our work exploits the structure of state transitions instead of stochastic policy.
Recent work has also investigated the bias-variance tradeoff in policy gradient methods. Schulman et al. (2015b) replace the Monte Carlo return with a -weighted return estimation (similar to TD() with value function bootstrap (Tesauro, 1995)), improving performance in high-dimensional control tasks. Other recent approaches use more general control variates to construct variants of policy gradient algorithms. Tucker et al. (2018) compare the recent work, both analytically on a linear-quadratic-Gaussian task and empirically on complex robotic control tasks. Analysis of control variates for policy gradient methods is a well-studied topic, and extending such analyses (e.g., Greensmith et al. (2004)) to the input-driven MDP setting could be interesting future work.
In other contexts, prior work has proposed new RL training methodologies for environments with disturbances. Clavera et al. (2018b) adapts the policy to different pattern of disturbance by training the RL agent using meta-learning. RARL (Pinto et al., 2017) improves policy robustness by co-training an adversary to generate a worst-case noise process. Our work is orthogonal and complementary to these work, as we seek to improve policy optimization itself in the presence of inputs like disturbances.
We introduced input-driven Markov Decision Processes in which stochastic input processes influence state dynamics and rewards. In this setting, we demonstrated that an input-dependent baseline can significantly reduce variance for policy gradient methods, improving training stability and the quality of learned policies. Our work provides an important ingredient for using RL successfully in a variety of domains, including queuing networks and computer systems, where an input workload is a fundamental aspect of the system, as well as domains where the input process is more implicit, like robotics control with disturbances or random obstacles.
We showed that meta-learning provides an efficient way to learn input-dependent baselines for applications where input sequences can be repeated during training. Investigating efficient architectures for input-dependent baselines for cases where the input process cannot be repeated in training is an interesting direction for future work.
We thank Ignasi Clavera for sharing the HalfCheetah environment, Jonas Rothfuss for the comments on meta-policy optimization and the anonymous ICLR reviewers for their feedback. This work was funded in part by NSF grants CNS-1751009, CNS-1617702, a Google Faculty Research Award, an AWS Machine Learning Research Award, a Cisco Research Center Award, an Alfred P. Sloan Research Fellowship and the sponsors of MIT Data Systems and AI Lab.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Belletti et al. (2018) Francois Belletti, Daniel Haziza, Gabriel Gomes, and Alexandre M. Bayen. Expert level control of ramp metering based on multi-task deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 19(4):1198–1207, 2018.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. https://gym.openai.com/docs/, 2016.
Chilimbi et al. (2014)
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman.
Project adam: Building an efficient and scalable deep learning training system.In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 571–582, October 2014.
- Clavera et al. (2018a) Ignasi Clavera, Anusha Nagabandi, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Meta-learning for model-based control. arXiv preprint arXiv:1803.11347, 2018a.
- Clavera et al. (2018b) Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214, 2018b.
- Daley (1987) D.J. Daley. Certain optimality properties of the first-come first-served discipline for G/G/s queues. Stochastic Processes and their Applications, 25:301–308, 1987.
- DASH Industry Form (2016) DASH Industry Form. Reference Client 2.4.0. http://mediapm.edgesuite.net/dash/public/nightly/samples/dash-if-reference-player/index.html, 2016.
- Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. OpenAI Baselines. https://github.com/openai/baselines, 2017.
- Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
- Evans & Gao (2016) Richard Evans and Jim Gao. DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/, 2016.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135, 2017.
- Gers et al. (1999) Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999.
- Grandl et al. (2016) Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 81–97, 2016.
- Greensmith et al. (2004) Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
- Gu et al. (2017) Shixiang Gu, Timothy P. Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3849–3858, 2017.
- Harchol-Balter & Vesilo (2010) Mor Harchol-Balter and Rein Vesilo. To balance or unbalance load in size-interval task allocation. Probability in the Engineering and Informational Sciences, 24(2):219–244, April 2010.
- Harrison et al. (2017) James Harrison, Animesh Garg, Boris Ivanovic, Yuke Zhu, Silvio Savarese, Li Fei-Fei, and Marco Pavone. Adapt: zero-shot adaptive policy transfer for stochastic dynamical systems. arXiv preprint arXiv:1707.04674, 2017.
- Heess et al. (2017) Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
- Kakade (2002) Sham M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1531–1538, 2002.
- Kelly (2011) Frank P. Kelly. Reversibility and stochastic networks. Cambridge University Press, 2011.
- Kleinrock (1976) Leonard Kleinrock. Queueing systems, volume 2: Computer applications, volume 66. Wiley, New York, 1976.
- Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334–1373, January 2016.
- Lillicrap et al. (2015) Timothy P. Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Mao et al. (2016) Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets), November 2016.
- Mao et al. (2017) Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of the ACM SIGCOMM 2017 Conference, 2017.
- McGough et al. (2017) Stephen McGough, Noura Al Moubayed, and Matthew Forshaw. Using machine learning in trace-driven energy-aware simulations of high-throughput computing systems. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (ICPE), pp. 55–60. ACM, 2017.
- Mirhoseini et al. (2017) Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2017.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
- Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1928–1937, 2016.
- Nair & Hinton (2010) Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814, 2010.
- Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 2817–2826, 2017.
- Riiser et al. (2013) Haakon Riiser, Paul Vigmostad, Carsten Griwodz, and Pål Halvorsen. Commute Path Bandwidth Traces from 3G Networks: Analysis and Applications. In Proceedings of the 4th ACM Multimedia Systems Conference (MMSys), 2013.
- Schulman et al. (2015a) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015a.
- Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Sutton & Barto (2017) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction, Second Edition. MIT Press, 2017.
- Sutton et al. (2000) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063. 2000.
- Tesauro (1995) Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
- Thomas (2014) Philip Thomas. Bias in natural actor-critic algorithms. In International Conference on Machine Learning, pp. 441–448, 2014.
- Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033, 2012.
- Tucker et al. (2018) George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031, 2018.
- Vilalta & Drissi (2002) Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2):77–95, 2002.
- Weaver & Tao (2001) Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pp. 538–545. Morgan Kaufmann Publishers Inc., 2001.
- Williams (1992) Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Winstein & Balakrishnan (2013) Keith Winstein and Hari Balakrishnan. TCP ex machina: Computer-generated congestion control. In ACM SIGCOMM Computer Communication Review, volume 43, pp. 123–134. ACM, 2013.
- Wu et al. (2017) Cathy Wu, Aboudy Kreidieh, Kanaad Parvate, Eugene Vinitsky, and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, 2017.
- Wu et al. (2018) Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
- Wu & Tian (2017) Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. In Submitted to International Conference on Learning Representations, 2017.
- Yin et al. (2015) Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A Control-Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP. In Proceedings of the 2015 ACM SIGCOMM Conference, 2015.
Appendix A Illustration of Variance Reduction in 1D Grid World
Consider a walker in a 1D grid world, where the state at time denotes the position of the walker, and action denotes the intent to either move forward or backward. Additionally let be a uniform i.i.d. “exogenous input” that perturbs the position of the walker. For an action and input , the state of the walker in the next step is given by . The objective of the game is to move the walker forward; hence, the reward is at each time step. is a discount factor.
While the optimal policy for this game is clear ( for all ), consider learning such a policy using policy gradient. For simplicity, let the policy be parametrized as , with initialized to at the start of training. In the following, we evaluate the variance of the policy gradient estimate at the start of training under (i) the standard value function baseline, and (ii) a baseline that is the expected cumulative reward conditioned on all future inputs.
Variance under standard baseline. The value function in this case is identically at all states. This is because since both actions and inputs are i.i.d. with mean . Also note that and ; hence . Therefore the variance of the policy gradient estimate can be written as
Variance under input-dependent baseline. Now, consider an alternative “input-dependent” baseline defined as . Intuitively this baseline captures the average reward incurred when experiencing a particular fixed sequence. We refer the reader to §4 for a formal discussion and analysis of input-dependent baselines. Evaluating the baseline we get . Therefore the variance of the policy gradient estimate in this case is
This follows because
Thus the input-dependent baseline reduces variance of the policy gradient estimate by an amount proportional to the variance of the external input. In this toy example, we have chosen to be binary-valued, but more generally the variance of could be arbitrarily large and might be a dominating factor of the overall variance in the policy gradient estimation.
Appendix B Markov properties of input-driven decision processes
An input-driven decision process satisfying the conditions of case 1 in Figure 3 is a fully observable MDP, with state , and action .
An input-driven decision process satisfying the conditions of case 2 in Figure 3, with state and action is a fully observable MDP. If only is observed at time , it is a partially observable MDP (POMDP).
Therefore, is a fully observable MDP. If only is observed, the decision process is a POMDP, since the component of the state is not observed. ∎
Appendix C Proof of Lemma 1
Appendix D Proof of Lemma 2
Appendix E Proof of Theorem 2
Let denote . For any input-dependent baseline , the variance of the policy gradient estimate is given by
Notice that the baseline is only involved in the last term in a quadratic form, where the second order term is positive. To minimize the variance, we set baseline to the minimizer of the quadratic equation, i.e., and hence the result follows. ∎
Appendix F Input-Dependent Baseline for TRPO
We show that the input-dependent baselines are bias-free for Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a).
Preliminaries.Stochastic gradient descent using Equation (1) does not guarantee consistent policy improvement in complex control problems. TRPO is an alternative approach that offers monotonic policy improvements, and derives a practical algorithm with better sample efficiency and performance. TRPO maximizes a surrogate objective, subject to a KL divergence constraint:
in which serves as a step size for policy update. Using a baseline in the TRPO objective, i.e. replacing with , empirically improves policy performance (Schulman et al., 2015b).
Similar to Theorem 2, we generalize TRPO to input-driven environments, with denoting the discounted visitation frequency of the observation and input sequence , and . The TRPO objective becomes , and the constraint is .
An input-dependent baseline does not change the optimal solution of the optimization problem in TRPO, that is .