1 Introduction
Deep reinforcement learning (RL) has emerged as a powerful approach for sequential decisionmaking problems, achieving impressive results in domains such as game playing (Mnih et al., 2015; Silver et al., 2017) and robotics (Levine et al., 2016; Schulman et al., 2015a; Lillicrap et al., 2015). This paper concerns RL in inputdriven environments. Informally, inputdriven environments have dynamics that are partially dictated by an exogenous, stochastic input process. Queuing systems (Kleinrock, 1976; Kelly, 2011) are an example; their dynamics are governed by not only the decisions made within the system (e.g., scheduling, load balancing) but also the arrival process that brings work (e.g., jobs, customers, packets) into the system. Inputdriven environments also arise naturally in many other domains: network control and optimization (Winstein & Balakrishnan, 2013; Mao et al., 2017), robotics control with stochastic disturbances (Pinto et al., 2017), locomotion in environments with complex terrains and obstacles (Heess et al., 2017), vehicular traffic control (Belletti et al., 2018; Wu et al., 2017), tracking moving targets, and more (see Figure 1).
We focus on modelfree policy gradient RL algorithms (Williams, 1992; Mnih et al., 2016; Schulman et al., 2015a), which have been widely adopted and benchmarked for a variety of RL tasks (Duan et al., 2016; Wu & Tian, 2017)
. A key challenge for these methods is the high variance in the gradient estimates, as such variance increases sample complexity and can impede effective learning
(Schulman et al., 2015b; Mnih et al., 2016). A standard approach to reduce variance is to subtract a “baseline” from the total reward (or “return”) to estimate the policy gradient (Weaver & Tao, 2001). The most common choice of a baseline is the value function — the expected return starting from the state.Our main insight is that a statedependent baseline — such as the value function — is a poor choice in inputdriven environments, whose state dynamics and rewards are partially dictated by the input process. In such environments, comparing the return to the value function baseline may provide limited information about the quality of actions. The return obtained after taking a good action may be poor (lower than the baseline) if the input sequence following the action drives the system to unfavorable states; similarly, a bad action might end up with a high return with an advantageous input sequence. Intuitively, a good baseline for estimating the policy gradient should take the specific instance of the input process — the sequence of input values — into account. We call such a baseline an inputdependent baseline; it is a function of both the state and the entire future input sequence.
We formally define inputdriven Markov decision processes, and we prove that an inputdependent baseline does not introduce bias in standard policy gradient algorithms such as Advantage Actor Critic (A2C)
(Mnih et al., 2016) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a), provided that the input process is independent of the states and actions. We derive the optimal inputindependent baseline and a simpler one to work with in practice; this takes the form of a conditional value function — the expected return given the state and the future input sequence.Inputdependent baselines are harder to learn than their statedependent counterparts; they are highdimensional functions of the sequence of input values. To learn inputdependent baselines efficiently, we propose a simple approach based on metalearning (Finn et al., 2017; Vilalta & Drissi, 2002). The idea is to learn a “meta baseline” that can be specialized to a baseline for a specific input instantiation using a small number of training episodes with that input. This approach applies to applications in which an input sequence can be repeated during training, e.g., applications that use simulations or experiments with previouslycollected input traces for training (McGough et al., 2017).
We compare our inputdependent baseline to the standard value function baseline for the five tasks illustrated in Figure 1. These tasks are derived from queuing systems (load balancing heterogeneous servers (HarcholBalter & Vesilo, 2010)), computer networks (bitrate adaptation for video streaming (Mao et al., 2017)), and variants of standard continuous control RL benchmarks in the MuJoCo physics simulator (Todorov et al., 2012). We adapted three widelyused MuJoCo benchmarks (Duan et al., 2016; Clavera et al., 2018a; Heess et al., 2017) to add a stochastic input element that makes these tasks significantly more challenging. For example, we replaced the static target in a 7DoF robotic arm targetreaching task with a randomlymoving target that the robot aims to track over time. Our results show that inputdependent baselines consistently provide improved training stability and better eventual policies. Inputdependent baselines are applicable to a variety of policy gradient methods, including A2C, TRPO, PPO, robust adversarial RL methods such as RARL (Pinto et al., 2017), and metapolicy optimization such as MBMPO (Clavera et al., 2018b). Video demonstrations of our experiments are available at https://sites.google.com/view/inputdependentbaseline/.
2 Preliminaries
Notation.
We consider a discretetime Markov decision process (MDP), defined by , where is a set of dimensional states, is a set of dimensional actions,
is the state transition probability distribution,
is the distribution over initial states, is the reward function, and is the discount factor. We denote a stochastic policy as , which aims to optimize the expected return , where is the trajectory following , , . We use to define the value function, and to define the stateaction value function. For any sequence , we use to denote the entire sequence and to denote .Policy gradient methods.
Policy gradient methods estimate the gradient of expected return with respect to the policy parameters (Sutton et al., 2000; Kakade, 2002; Gu et al., 2017). To train a policy parameterized by , the Policy Gradient Theorem (Sutton et al., 2000) states that
(1) 
where denotes the discounted state visitation frequency. Practical algorithms often use the undiscounted state visitation frequency (i.e., in ), which can make the estimation slightly biased (Thomas, 2014).
Estimating the policy gradient using Monte Carlo estimation for the function suffers from high variance (Mnih et al., 2016). To reduce variance, an appropriately chosen baseline can be subtracted from the Qestimate without introducing bias (Greensmith et al., 2004). The policy gradient estimation with a baseline in Equation (1) becomes . While an optimal baseline exists (Greensmith et al., 2004; Wu et al., 2018), it is hard to estimate and often replaced by the value function (Sutton & Barto, 2017; Mnih et al., 2016).
3 Motivating Example
We use a simple load balancing example to illustrate the variance introduced by an exogenous input process. As shown in Figure 2a, jobs arrive over time and a load balancing agent sends them to one of two servers. The jobs arrive according to a Poisson process, and the job sizes follow a Pareto distribution. The two servers process jobs from their queues at identical rates. On each job arrival, the load balancer observes state , denoting the queue length at the two servers. It then takes an action , sending the job to one of the servers. The goal of the load balancer is to minimize the average job completion time. The reward corresponding to this goal is , where is the time elapsed since the last action and is total number of enqueued jobs.
In this example, the optimal policy is to send the job to the server with the shortest queue (Daley, 1987). However, we find that a standard policy gradient algorithm, A2C (Mnih et al., 2016), trained using a value function baseline struggles to learn this policy. The reason is that the stochastic sequence of job arrivals creates huge variance in the reward signal, making it difficult to distinguish between good and bad actions. Consider, for example, an action at the state shown in Figure 2a. If the arrival sequence following this action consists of a burst of large jobs (e.g., input sequence 1 in Figure 2a), the queues will build up, and the return will be poor compared to the value function baseline (average return from the state). On the other hand, a light stream of jobs (e.g., input sequence 2 in Figure 2a) will lead to short queues and a betterthanaverage return. Importantly, this difference in return has little to do with the action; it is a consequence of the random job arrival process.
We train two A2C agents (Mnih et al., 2016), one with the standard value function baseline and the other with an inputdependent baseline tailored for each specific instantiation of the job arrival process (details of this baseline in §4). Since the the inputdependent baseline takes each input sequence into account explicitly, it reduces the variance of the policy gradient estimation much more effectively (Figure 2b, left). As a result, even in this simple example, only the policy learned with the inputdependent baseline comes close to the optimal (Figure 2b, right). Figure 2c visualizes the policies learned using the two baselines. The optimal policy (pickshortestqueue) corresponds to a clear divide between the chosen servers at the diagonal.
In fact, the variance of the standard baseline can be arbitrarily worse than an inputdependent baseline: we refer the reader to Appendix A for an analytical example on a 1D grid world.
4 Reducing Variance for InputDriven MDPs
We now formally define inputdriven MDPs and derive variancereducing baselines for policy gradient methods in environments with input processes.
Definition 1.
An inputdriven MDP is defined by , where is a set of dimensional input values, is the transition kernel of the states, is the transition kernel of the input process, is the distribution of the initial input, is the reward function, and , , , follow the standard definition in §2.
An inputdriven MDP adds an input process, , to a standard MDP.
In this setting, the next state depends on .
We seek to learn policies that maximize cumulative expected rewards.
We focus on two cases, corresponding to the graphical models shown in Figure 3:
Case 1: is a Markov process, and is observed at time . The action can hence depend on both and .
Case 2: is a general process (not necessarily Markov), and is observed at time . The action hence depends only on .
In Appendix B, we prove that case 1 corresponds to a fullyobservable MDP. This is evident from the graphical model in Figure 3a by considering to be the ‘state’ of the MDP at time . Case 2, on the other hand, corresponds to a partiallyobserved MDP (POMDP) if we define the state to contain both and , but leave unobserved at time (see Appendix B for details).
4.1 Variance Reduction
In inputdriven MDPs, the standard inputagnostic baseline is ineffective at reducing variance, as shown by our motivating example (§3). We propose to use an inputdependent baseline of the form — a function of both the observation at time and the input sequence from onwards. An inputdependent baseline uses information that is not available to the policy. Specifically, the input sequence cannot be used when taking an action at time , because has not yet occurred at time . However, in many applications, the input sequence is known at training time. In some cases, we know the entire input sequence upfront, e.g., when training in a simulator. In other situations, we can record the input sequence on the fly during training. Then, after a training episode, we can use the recorded values, including those that occurred after time , to compute the baseline for each step .
We now analyze inputdependent baselines. Our main result is that inputdependent baselines are biasfree. We also derive the optimal inputdependent baseline for variance reduction. All the results hold for both cases in Figure 3. We first state two useful lemmas required for our analysis. The first lemma shows that under the inputdriven MDP definition, the input sequence is conditionally independent of the action given the observation , while the second lemma states the policy gradient theorem for inputdriven MDPs.
Lemma 1.
, i.e.,
forms a Markov chain.
Proof. See Appendix C.
Lemma 2.
For an inputdriven MDP, the policy gradient theorem can be rewritten as
(2) 
where denotes the discounted visitation frequency of the observation and input sequence , and .
Proof. See Appendix D.
Equation (2) generalizes the standard Policy Gradient Theorem in Equation (1).
can be thought of as a joint distribution over observations and input sequences.
is a “stateactioninput” value function, i.e., the expected return when taking action after observing , with input sequence from that step onwards. The key ingredient in the proof of Lemma 2 is the conditional independence of the input process and the action given the observation (Lemma 1).Theorem 1.
An inputdependent baseline does not bias the policy gradient.
Proof.
Inputdependent baselines are also biasfree for policy optimization methods such as TRPO (Schulman et al., 2015a), as we show in Appendix F
. Next, we derive the optimal inputdependent baseline for variance reduction. As the gradient estimates are vectors, we use the trace of the covariance matrix as the minimization objective
(Greensmith et al., 2004).Theorem 2.
The inputdependent baseline that minimizes variance in policy gradient is given by
(4) 
Proof. See Appendix E.
Operationally, for observation at each step , the inputdependent baseline takes the form . In practice, we use a simpler alternative to Equation (4): . This can be thought of as a value function that provides the expected return given observation and input sequence from that step onwards. We discuss how to estimate inputdependent baselines efficiently in §5.
Remark. Inputdependent baselines are generally applicable to reducing variance for policy gradient methods in inputdriven environments. In this paper, we apply inputdependent baselines to A2C (§6.2), TRPO (§6.1) and PPO (Appendix L). Our technique is complementary and orthogonal to adversarial RL (e.g., RARL (Pinto et al., 2017)) and metapolicy adaptation (e.g., MBMPO (Clavera et al., 2018b)) for environments with external disturbances. Adversarial RL improves policy robustness by cotraining an “adversary” to generate a worstcase disturbance process. Metapolicy optimization aims for fast policy adaptation to handle model discrepancy between training and testing. By contrast, inputdependent baselines improve policy optimization itself in the presence of stochastic input processes. Our work primarily focuses on learning a single policy in inputdriven environments, without policy adaptation. However, inputdependent baselines can be used as a general method to improve the policy optimization step in adversarial RL and metapolicy adaptation methods. For example, in Appendix M, we empirically show that if an adversary generates highvariance noise, RARL with a standard statebased baseline cannot train good controllers, but the inputdependent baseline helps improve the policy’s performance. Similarly, inputdependent baselines can improve metapolicy optimization in environments with stochastic disturbances, as we show in Appendix N.
5 Learning InputDependent Baselines Efficiently
Inputdependent baselines are functions of the sequence of input values. A natural approach to train such baselines is to use models that operate on sequences (e.g., LSTMs (Gers et al., 1999)). However, learning a sequential mapping in a highdimensional space can be expensive (Bahdanau et al., 2014). We considered an LSTM approach, but ruled it out when initial experiments showed that it fails to provide significant policy improvement over the standard baseline in our environments (Appendix G).
Fortunately, we can learn the baseline much more efficiently in applications where we can repeat the same input sequence multiple times during training. Inputrepeatability is feasible in many applications: it is straightforward when using simulators for training, and also feasible when training a real system with previouslycollected input traces outside simulation. For example, training a robot in the presence of exogenous forces might apply a set of timeseries traces of these forces repeatedly to the physical robot. We now present two approaches that exploit inputrepeatability to learn inputdependent baselines efficiently.
Multivaluenetwork approach. A straightforward way to learn for different input instantiations is to train one value network to each particular instantiation of the input process. Specifically, in the training process, we first generate input sequences and restrict training only to those sequences. To learn a separate baseline function for each input sequence, we use value networks with independent parameters , and single policy network with parameter . During training, we randomly sample an input sequence , execute a rollout based on with the current policy , and use the (state, action, reward) data to train the value network parameter and the policy network parameter (details in Appendix I).
Metalearning approach. The multivaluenetwork approach does not scale if the task requires training over a large number of input instantiations to generalize. The number of inputs needed is environmentspecific, and can depend on a variety of factors, such as the time horizon of the problem, the distribution of the input process, the relative magnitude of the variance due to the input process compared to other sources of randomness (e.g., actions). Ideally, we would like an approach that enables learning across many different input sequences. We present a method based on metalearning to train with an unbounded number of input sequences. The idea is to use all (potentially infinitely many) input sequences to learn a “meta value network” model. Then, for each specific input sequence, we first customize the meta value network using a few example rollouts with that input sequence. We then compute the actual baseline values for training the policy network parameters, using the customized value network for the specific input sequence. Our implementation uses ModelAgnostic MetaLearning (MAML) (Finn et al., 2017).
The pseudocode in Algorithm 1 depicts the training algorithm. We follow the notation of MAML, denoting the loss in the value function on a rollout as . We perform rollouts times with the same input sequence (lines 3 and 4); we use the first rollouts to customize the meta value network for this instantiation of (line 5), and then apply the customized value network on the states of the other rollouts to compute the baseline for those rollouts (line 6); similarly, we swap the two groups of rollouts and repeat the same process (lines 7 and 8). We use different rollouts to adapt the meta value network and compute the baseline to avoid introducing extra bias to the baseline. Finally, we use the baseline values computed for each rollout to update the policy network parameters (line 9), and we apply the MAML (Finn et al., 2017) gradient step to update the meta value network model (line 10).
6 Experiments
Our experiments demonstrate that inputdependent baselines provide consistent performance gains across multiple continuousaction MuJoCo simulated robotic locomotions and discreteaction environments in queuing systems and network control. We conduct experiments for both policy gradient methods and policy optimization methods (see Appendix K for details). The videos for our experiments are available at https://sites.google.com/view/inputdependentbaseline/.
6.1 Simulated Robotic Locomotion
We use the MuJoCo physics engine (Todorov et al., 2012) in OpenAI Gym (Brockman et al., 2016) to evaluate inputdependent baselines for robotic control tasks with external disturbance. We extend the standard Walker2d, HalfCheetah and 7DoF robotic arm environments, adding a different external input to each (Figure 1).
Walker2d with random wind (Figure 1c). We train a 2D walker with varying wind, which randomly drags the walker backward or forward with different force at each step. The wind vector changes randomly, i.e., the wind forms a random input process. We add a force sensor to the state to enable the agent to quickly adapt. The goal is for the walker to walk forward while keeping balance.
HalfCheetah on floating tiles with random buoyancy (Figure 1d). A halfcheetah runs over a series of tiles floating on water (Clavera et al., 2018a). Each tile has different damping and friction properties, which moves the halfcheetah up and down and changes its dynamics. This random buoyancy is the external input process; the cheetah needs to learn running forward over varying tiles.
7DoF arm tracking moving target (Figure 1e).
We train a simulated robot arm to track a randomly moving target (a red ball). The robotic arm has seven degrees of freedom and the target is doing a random walk, which forms the external input process. The reward is the negative squared distance between the robot hand (blue square) and the target.
The Walker2d and 7DoF arm environments correspond to the fully observable MDP case in Figure 3, i.e. the agent observes the input at time . The HalfCheetah environment is a POMDP, as the agent does not observe the buoyancy of the tiles. In Appendix H, we show results for the POMDP version of the Walker2d environment.
Results. We build 10value networks and a metabaseline using MAML, both on top of the OpenAI’s TRPO implementation (Dhariwal et al., 2017). Figure 4 shows the performance comparison among different baselines with 100 unseen testing input sequences at each training checkpoint. These learning curves show that TRPO with a statedependent baseline performs worst in all environments. With the inputdependent baseline, by contrast, performance in unseen testing environments improves by up to 3, as the agent learns a policy robust against disturbances. For example, it learns to lean into headwind and quickly place its leg forward to counter the headwind; it learns to apply different force on tiles with different buoyancy to avoid falling over; and it learns to coadjust multiple joints to keep track of the moving object. The metabaseline eventually outperforms 10value networks as it effectively learns from a large number of input processes and hence generalizes better.
The inputdependent baseline technique applies generally on top of policy optimization methods. In Appendix L, we show a similar comparison with PPO (Schulman et al., 2017). Also, in Appendix M we show that adversarial RL (e.g., RARL (Pinto et al., 2017)) alone is not adequate to solve the high variance problem, and the inputdependent baseline helps improve the policy performance (Figure 9).
6.2 DiscreteAction Environments
Our discreteaction environments arise from widelystudied problems in computer systems research: load balancing and bitrate adaptation.^{1}^{1}1We considered Atari games often used as benchmark discreteaction RL environments (Mnih et al., 2015). However, Atari games lack an exogenous input process: a random seed perturbs the games’ initial state, but it does not affect the environmental changes (e.g., in “Seaquest”, the ships always come in a fixed pattern). As these problems often lack closedform optimal solutions (Grandl et al., 2016; Yin et al., 2015)
, handtuned heuristics abound. Recent work suggests that modelfree reinforcement learning can achieve better performance than such humanengineered heuristics
(Mao et al., 2016; Evans & Gao, 2016; Mao et al., 2017; Mirhoseini et al., 2017). We consider a load balancing environment (similar to the example in §3) and a bitrate adaptation environment in video streaming (Yin et al., 2015). The detailed setup of these environments is in Appendix J.Results. We extend OpenAI’s A2C implementation (Dhariwal et al., 2017) for our baselines. The learning curves in Figure 5 illustrate that directly applying A2C with a standard value network as the baseline results in unstable test reward and underperforms the traditional heuristic in both environments. Our inputdependent baselines reduce the variance and improve test reward by 25–33%, outperforming the heuristic. The metabaseline performs the best in all environments.
7 Related Work
Policy gradient methods compute unbiased gradient estimates, but can experience a large variance (Sutton & Barto, 2017; Weaver & Tao, 2001). Reducing variance for policybased methods using a baseline has been shown to be effective (Williams, 1992; Sutton & Barto, 2017; Weaver & Tao, 2001; Greensmith et al., 2004; Mnih et al., 2016). Much of this work focuses on variance reduction in a general MDP setting, rather than variance reduction for MDPs with specific stochastic structures. Wu et al. (2018)’s techniques for MDPs with multivariate independent actions are closest to our work. Their stateactiondependent baseline improves training efficiency and model performance on highdimensional control tasks by explicitly factoring out, for each action, the effect due to other actions. By contrast, our work exploits the structure of state transitions instead of stochastic policy.
Recent work has also investigated the biasvariance tradeoff in policy gradient methods. Schulman et al. (2015b) replace the Monte Carlo return with a weighted return estimation (similar to TD() with value function bootstrap (Tesauro, 1995)), improving performance in highdimensional control tasks. Other recent approaches use more general control variates to construct variants of policy gradient algorithms. Tucker et al. (2018) compare the recent work, both analytically on a linearquadraticGaussian task and empirically on complex robotic control tasks. Analysis of control variates for policy gradient methods is a wellstudied topic, and extending such analyses (e.g., Greensmith et al. (2004)) to the inputdriven MDP setting could be interesting future work.
In other contexts, prior work has proposed new RL training methodologies for environments with disturbances. Clavera et al. (2018b) adapts the policy to different pattern of disturbance by training the RL agent using metalearning. RARL (Pinto et al., 2017) improves policy robustness by cotraining an adversary to generate a worstcase noise process. Our work is orthogonal and complementary to these work, as we seek to improve policy optimization itself in the presence of inputs like disturbances.
8 Conclusion
We introduced inputdriven Markov Decision Processes in which stochastic input processes influence state dynamics and rewards. In this setting, we demonstrated that an inputdependent baseline can significantly reduce variance for policy gradient methods, improving training stability and the quality of learned policies. Our work provides an important ingredient for using RL successfully in a variety of domains, including queuing networks and computer systems, where an input workload is a fundamental aspect of the system, as well as domains where the input process is more implicit, like robotics control with disturbances or random obstacles.
We showed that metalearning provides an efficient way to learn inputdependent baselines for applications where input sequences can be repeated during training. Investigating efficient architectures for inputdependent baselines for cases where the input process cannot be repeated in training is an interesting direction for future work.
Acknowledgements.
We thank Ignasi Clavera for sharing the HalfCheetah environment, Jonas Rothfuss for the comments on metapolicy optimization and the anonymous ICLR reviewers for their feedback. This work was funded in part by NSF grants CNS1751009, CNS1617702, a Google Faculty Research Award, an AWS Machine Learning Research Award, a Cisco Research Center Award, an Alfred P. Sloan Research Fellowship and the sponsors of MIT Data Systems and AI Lab.
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Belletti et al. (2018) Francois Belletti, Daniel Haziza, Gabriel Gomes, and Alexandre M. Bayen. Expert level control of ramp metering based on multitask deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 19(4):1198–1207, 2018.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. https://gym.openai.com/docs/, 2016.

Chilimbi et al. (2014)
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman.
Project adam: Building an efficient and scalable deep learning training system.
In Proceedings of the 11^{th} USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 571–582, October 2014.  Clavera et al. (2018a) Ignasi Clavera, Anusha Nagabandi, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Metalearning for modelbased control. arXiv preprint arXiv:1803.11347, 2018a.
 Clavera et al. (2018b) Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Modelbased reinforcement learning via metapolicy optimization. arXiv preprint arXiv:1809.05214, 2018b.
 Daley (1987) D.J. Daley. Certain optimality properties of the firstcome firstserved discipline for G/G/s queues. Stochastic Processes and their Applications, 25:301–308, 1987.
 DASH Industry Form (2016) DASH Industry Form. Reference Client 2.4.0. http://mediapm.edgesuite.net/dash/public/nightly/samples/dashifreferenceplayer/index.html, 2016.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. OpenAI Baselines. https://github.com/openai/baselines, 2017.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
 Evans & Gao (2016) Richard Evans and Jim Gao. DeepMind AI Reduces Google Data Centre Cooling Bill by 40%. https://deepmind.com/blog/deepmindaireducesgoogledatacentrecoolingbill40/, 2016.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135, 2017.
 Gers et al. (1999) Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999.
 Grandl et al. (2016) Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Graphene: Packing and dependencyaware scheduling for dataparallel clusters. In Proceedings of the 12^{th} USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 81–97, 2016.
 Greensmith et al. (2004) Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
 Gu et al. (2017) Shixiang Gu, Timothy P. Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3849–3858, 2017.
 HarcholBalter & Vesilo (2010) Mor HarcholBalter and Rein Vesilo. To balance or unbalance load in sizeinterval task allocation. Probability in the Engineering and Informational Sciences, 24(2):219–244, April 2010.
 Harrison et al. (2017) James Harrison, Animesh Garg, Boris Ivanovic, Yuke Zhu, Silvio Savarese, Li FeiFei, and Marco Pavone. Adapt: zeroshot adaptive policy transfer for stochastic dynamical systems. arXiv preprint arXiv:1707.04674, 2017.
 Heess et al. (2017) Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Kakade (2002) Sham M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1531–1538, 2002.
 Kelly (2011) Frank P. Kelly. Reversibility and stochastic networks. Cambridge University Press, 2011.
 Kleinrock (1976) Leonard Kleinrock. Queueing systems, volume 2: Computer applications, volume 66. Wiley, New York, 1976.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334–1373, January 2016.
 Lillicrap et al. (2015) Timothy P. Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Mao et al. (2016) Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15^{th} ACM Workshop on Hot Topics in Networks (HotNets), November 2016.
 Mao et al. (2017) Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of the ACM SIGCOMM 2017 Conference, 2017.
 McGough et al. (2017) Stephen McGough, Noura Al Moubayed, and Matthew Forshaw. Using machine learning in tracedriven energyaware simulations of highthroughput computing systems. In Proceedings of the 8^{th} ACM/SPEC on International Conference on Performance Engineering (ICPE), pp. 55–60. ACM, 2017.
 Mirhoseini et al. (2017) Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. In Proceedings of the 33^{rd} International Conference on Machine Learning (ICML), 2017.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33^{rd} International Conference on Machine Learning (ICML), pp. 1928–1937, 2016.
 Nair & Hinton (2010) Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27^{th} International Conference on Machine Learning (ICML), pp. 807–814, 2010.
 Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34^{th} International Conference on Machine Learning (ICML), pp. 2817–2826, 2017.
 Riiser et al. (2013) Haakon Riiser, Paul Vigmostad, Carsten Griwodz, and Pål Halvorsen. Commute Path Bandwidth Traces from 3G Networks: Analysis and Applications. In Proceedings of the 4^{th} ACM Multimedia Systems Conference (MMSys), 2013.
 Schulman et al. (2015a) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Sutton & Barto (2017) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction, Second Edition. MIT Press, 2017.
 Sutton et al. (2000) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063. 2000.
 Tesauro (1995) Gerald Tesauro. Temporal difference learning and tdgammon. Communications of the ACM, 38(3):58–68, 1995.
 Thomas (2014) Philip Thomas. Bias in natural actorcritic algorithms. In International Conference on Machine Learning, pp. 441–448, 2014.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for modelbased control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033, 2012.
 Tucker et al. (2018) George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of actiondependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031, 2018.
 Vilalta & Drissi (2002) Ricardo Vilalta and Youssef Drissi. A perspective view and survey of metalearning. Artificial Intelligence Review, 18(2):77–95, 2002.
 Weaver & Tao (2001) Lex Weaver and Nigel Tao. The optimal reward baseline for gradientbased reinforcement learning. In Proceedings of the 17^{th} Conference on Uncertainty in Artificial Intelligence, pp. 538–545. Morgan Kaufmann Publishers Inc., 2001.
 Williams (1992) Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Winstein & Balakrishnan (2013) Keith Winstein and Hari Balakrishnan. TCP ex machina: Computergenerated congestion control. In ACM SIGCOMM Computer Communication Review, volume 43, pp. 123–134. ACM, 2013.
 Wu et al. (2017) Cathy Wu, Aboudy Kreidieh, Kanaad Parvate, Eugene Vinitsky, and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, 2017.
 Wu et al. (2018) Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with actiondependent factorized baselines. In Proceedings of the 6^{th} International Conference on Learning Representations (ICLR), 2018.
 Wu & Tian (2017) Yuxin Wu and Yuandong Tian. Training agent for firstperson shooter game with actorcritic curriculum learning. In Submitted to International Conference on Learning Representations, 2017.
 Yin et al. (2015) Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A ControlTheoretic Approach for Dynamic Adaptive Video Streaming over HTTP. In Proceedings of the 2015 ACM SIGCOMM Conference, 2015.
Appendix A Illustration of Variance Reduction in 1D Grid World
Consider a walker in a 1D grid world, where the state at time denotes the position of the walker, and action denotes the intent to either move forward or backward. Additionally let be a uniform i.i.d. “exogenous input” that perturbs the position of the walker. For an action and input , the state of the walker in the next step is given by . The objective of the game is to move the walker forward; hence, the reward is at each time step. is a discount factor.
While the optimal policy for this game is clear ( for all ), consider learning such a policy using policy gradient. For simplicity, let the policy be parametrized as , with initialized to at the start of training. In the following, we evaluate the variance of the policy gradient estimate at the start of training under (i) the standard value function baseline, and (ii) a baseline that is the expected cumulative reward conditioned on all future inputs.
Variance under standard baseline. The value function in this case is identically at all states. This is because since both actions and inputs are i.i.d. with mean . Also note that and ; hence . Therefore the variance of the policy gradient estimate can be written as
(5) 
Variance under inputdependent baseline. Now, consider an alternative “inputdependent” baseline defined as . Intuitively this baseline captures the average reward incurred when experiencing a particular fixed sequence. We refer the reader to §4 for a formal discussion and analysis of inputdependent baselines. Evaluating the baseline we get . Therefore the variance of the policy gradient estimate in this case is
(6) 
Reduction in variance. To analyze the variance reduction between the two cases (Equations (5) and (6)), we note that
(7)  
(8) 
This follows because
Therefore the covariance term in Equation (7) is . Hence the variance reduction from Equation (8) can be written as
Thus the inputdependent baseline reduces variance of the policy gradient estimate by an amount proportional to the variance of the external input. In this toy example, we have chosen to be binaryvalued, but more generally the variance of could be arbitrarily large and might be a dominating factor of the overall variance in the policy gradient estimation.
Appendix B Markov properties of inputdriven decision processes
Proposition 1.
An inputdriven decision process satisfying the conditions of case 1 in Figure 3 is a fully observable MDP, with state , and action .
Proof.
∎
Proposition 2.
An inputdriven decision process satisfying the conditions of case 2 in Figure 3, with state and action is a fully observable MDP. If only is observed at time , it is a partially observable MDP (POMDP).
Proof.
Therefore, is a fully observable MDP. If only is observed, the decision process is a POMDP, since the component of the state is not observed. ∎
Appendix C Proof of Lemma 1
Appendix D Proof of Lemma 2
Appendix E Proof of Theorem 2
Proof.
Let denote . For any inputdependent baseline , the variance of the policy gradient estimate is given by
Notice that the baseline is only involved in the last term in a quadratic form, where the second order term is positive. To minimize the variance, we set baseline to the minimizer of the quadratic equation, i.e., and hence the result follows. ∎
Appendix F InputDependent Baseline for TRPO
We show that the inputdependent baselines are biasfree for Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a).
Preliminaries.Stochastic gradient descent using Equation (1) does not guarantee consistent policy improvement in complex control problems. TRPO is an alternative approach that offers monotonic policy improvements, and derives a practical algorithm with better sample efficiency and performance. TRPO maximizes a surrogate objective, subject to a KL divergence constraint:
(13)  
subject to  (14) 
in which serves as a step size for policy update. Using a baseline in the TRPO objective, i.e. replacing with , empirically improves policy performance (Schulman et al., 2015b).
Similar to Theorem 2, we generalize TRPO to inputdriven environments, with denoting the discounted visitation frequency of the observation and input sequence , and . The TRPO objective becomes , and the constraint is .
Theorem 3.
An inputdependent baseline does not change the optimal solution of the optimization problem in TRPO, that is .
Comments
There are no comments yet.