On the Generalization Gap in Reparameterizable Reinforcement Learning

05/29/2019 ∙ by Huan Wang, et al. ∙ 8

Understanding generalization in reinforcement learning (RL) is a significant challenge, as many common assumptions of traditional supervised learning theory do not apply. We focus on the special class of reparameterizable RL problems, where the trajectory distribution can be decomposed using the reparametrization trick. For this problem class, estimating the expected return is efficient and the trajectory can be computed deterministically given peripheral random variables, which enables us to study reparametrizable RL using supervised learning and transfer learning theory. Through these relationships, we derive guarantees on the gap between the expected and empirical return for both intrinsic and external errors, based on Rademacher complexity as well as the PAC-Bayes bound. Our bound suggests the generalization capability of reparameterizable RL is related to multiple factors including "smoothness" of the environment transition, reward and agent policy function class. We also empirically verify the relationship between the generalization gap and these factors through simulations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has proven successful in a series of applications such as games (Silver et al., 2016, 2017; Mnih et al., 2015; Vinyals et al., 2017; OpenAI, 2018), robotics (Kober et al., 2013), recommendation systems (Li et al., 2010; Shani et al., 2005), resource management (Mao et al., 2016; Mirhoseini et al., 2018), neural architecture design (Baker et al., 2017), and more. However some key questions in reinforcement learning remain unsolved. One that draws more and more attention is the issue of overfitting in reinforcement learning (Sutton, 1995; Cobbe et al., 2018; Zhang et al., 2018b; Packer et al., 2018; Zhang et al., 2018a). A model that performs well in the training environment, may or may not perform well when used in the testing environment. There is also a growing interest in understanding the conditions for model generalization and developing algorithms that improve generalization.

In general we would like to measure how accurate an algorithm is able to predict on previously unseen data. One metric of interest is the gap between the training and testing loss or reward. It has been observed that such gaps are related to multiple factors: initial state distribution, environment transition, the level of “difficulty” in the environment, model architectures, and optimization. Zhang et al. (2018b) split randomly sampled initial states into training and testing and evaluated the performance gap in deep reinforcement learning. They empirically observed overfitting caused by the randomness of the environment, even if the initial distribution and the transition in the testing environment are kept the same as training. On the other hand, Farebrother et al. (2018); Justesen et al. (2018); Cobbe et al. (2018) allowed the test environment to vary from training, and observed huge differences in testing performance. Packer et al. (2018) also reported very different testing behaviors across models and algorithms, even for the same RL problem.

Although overfitting has been empirically observed in RL from time to time, theoretical guarantees on generalization, especially finite-sample guarantees, are still missing. In this work, we focus on on-policy RL, where agent policies are trained based on episodes of experience that are sampled “on-the-fly” using the current policy in training. We identify two major obstacles in the analysis of on-policy RL. First, the episode distribution keeps changing as the policy gets updated during optimization. Therefore, episodes have to be continuously redrawn from the new distribution induced by the updated policy during optimization. For finite-sample analysis, this leads to a process with complex dependencies. Second, state-of-the-art research on RL tends to mix the errors caused by randomness in the environment and shifts in the environment distribution. We argue that actually these two types of errors are very different. One, which we call intrinsic error, is analogous to overfitting in supervised learning, and the other, called external error, looks more like the errors in transfer learning.

Our key observation is there exists a special class of RL, called reparameterizable RL, where randomness in the environment can be decoupled from the transition and initialization procedures via the reparameterization trick (Kingma and Welling, 2014). Through reparameterization, an episode’s dependency on the policy is “lifted” to the states. Hence, as the policy gets updated, episodes are deterministic given peripheral random variables. As a consequence, the expected reward in reparameterizable RL is connected to the Rademacher complexity as well as the PAC-Bayes bound. The reparameterization trick also makes the analysis for the second type of errors, i.e., when the environment distribution is shifted, much easier since the environment parameters are also “lifted” to the representation of states.

Related Work

Generalization in reinforcement learning has been investigated a lot both theoretically and empirically. Theoretical work includes bandit analysis (Agarwal et al., 2014; Auer et al., 2002, 2009; Beygelzimer et al., 2011)

, Probably Approximately Correct (PAC) analysis

(Jiang et al., 2017; Dann et al., 2017; Strehl et al., 2009; Lattimore and Hutter, 2014) as well as minimax analysis (Azar et al., 2017; Chakravorty and Hyland, 2003). Most works focus on the analysis of regret and consider the gap between the expected value and optimal return. On the empirical side, besides the previously mentioned work, Whiteson et al. (2011) proposes generalized methodologies that are based on multiple environments sampled from a distribution. Nair et al. (2015) also use random starts to test generalization.

Other research has also examined generalization from a transfer learning perspective. Lazaric (2012); Taylor and Stone (2009); Zhan and Taylor (2015); Laroche (2017) examine model generalization across different learning tasks, and provide guarantees on asymptotic performance.

There are also works in robotics for transferring policy from simulator to real world and optimizing an internal model from data (Kearns and Singh, 2002), or works trying to solve abstracted or compressed MDPs (Majeed and Hutter, 2018).

Our Contributions:

  • A connection between (on-policy) reinforcement learning and supervised learning through the reparameterization trick. It simplifies the finite-sample analysis for RL, and yields Rademacher and PAC-Bayes bounds on Markov Decision Processes (MDP).

  • Identifying a class of reparameterizable RL and providing a simple bound for “smooth” environments and models with a limited number of parameters.

  • A guarantee for reparameterized RL when the environment is changed during testing. In particular we discuss two cases in environment shift: change in the initial distribution for the states, or the transition function.

2 Notation and Formulation

We denote a Markov Decision Process (MDP) as a -tuple . Here is the state space, is the action-space, is the transition probability from state to when taking action , represents the reward function, and is the initial state distribution. Let be the policy map that returns the action at state .

We consider episodic MDPs with a finite horizon. Given the policy map and the transition probability , the state-to-state transition probability is . Without loss of generality, the length of the episode is . We denote a sequence of states as . The total reward in an episode is , where is a discount factor and .

Denote the joint distribution of the sequence of states in an episode

as . Note is also related to and . In this work we assume and are fixed, so is a function of . Our goal is to find a policy that maximizes the expected total discounted reward (return):


Suppose during training we have a budget of episodes, then the empirical return is


where is the th episode of length . We are interested in the generalization gap


Note that in (3) the distribution may be different from since in the testing environment as well as may be shifted compared to the training environment.

3 Generalization in Reinforcement Learning v.s. Supervised Learning

Generalization has been well studied in the supervised learning scenario. A popular assumption is that samples are independent and identically distributed . Similar to empirical return maximization discussed in Section 2, in supervised learning a popular algorithm is empirical risk minimization:


where is the prediction function to be learned and

is the loss function. Similarly generalization in supervised learning concerns the gap between the expected loss

and the empirical loss .

It is easy to find the correspondence between the episodes defined in Section 2 and the samples in supervised learning. Just like supervised learning where , in (episodic) reinforcement learning . Also the reward function in reinforcement learning is similar to the loss function in supervised learning. However, reinforcement learning is different because

  • In supervised learning, the sample distribution is kept fixed, and the loss function changes as we choose different predictors .

  • In reinforcement learning, the reward function is kept fixed, but the sample distribution changes as we choose different policy maps .

As a consequence, the training procedure in reinforcement learning is also different. Popular methods such as REINFORCE (Williams, 1992), Q-learning (Sutton and Barto., 1998), and actor-critic methods (Mnih et al., 2016) draw new states and episodes on the fly as the policy is being updated. That is, the distribution from which episodes are drawn always changes during optimization. In contrast, in supervised learning we only update the predictor without affecting the underlying sample distribution .

4 Intrinsic vs External Generalization Errors

The generalization gap (3) can be bounded


using the triangle inequality. The first term in (5) is the concentration error between the empirical reward and its expectation. Since it is caused by intrinsic randomness of the environment, we call it the intrinsic error. Even if the test environment shares the same distribution with training, in the finite-sample scenario there is still a gap between training and testing. This is analogous to the overfitting problem studied in supervised learning. Zhang et al. (2018b) mainly focuses on this aspect of generalization. In particular, their randomness is carefully controlled in experiments to only come from the initial states .

We call the second term in (5) external error, as it is caused by shifts of the distribution in the environment. For example, the transition distribution or the initialization distribution may get changed during testing, which leads to a different underlying episode distribution . This is analogous to the transfer learning problem. For instance, generalization as in Cobbe et al. (2018) is mostly external error since the number of levels used for training and testing are different even though the difficult level parameters are sampled from the same distribution. The setting in Packer et al. (2018) covers both intrinsic and external errors.

5 Why Intrinsic Generalization Error?

If is fixed, by concentration of measures, as the number of episodes increases, the intrinsic error decreases roughly with . For example, if the reward is bounded , by McDiarmid’s bound, with probability at least ,


where . Note the bound above also holds for the test samples if the distribution is fixed and .

For the population argument (1), is defined deterministically since the value is a deterministic function of . However, in the finite-sample case (2), the policy map is stochastic: it depends on the samples . As a consequence, the underlying distribution is not fixed. In that case, the expectation in (6) becomes a random variable so (6) does not hold any more.

One way of fixing the issue caused by random is to prove a bound that holds uniformly for all policies . If is finite, by applying a union bound, it follows that:

Lemma 1.

If is finite, and , then with probability at least , for all


where is the cardinality of .

Unfortunately in most of the applications, is not finite. One difficulty in analyzing the intrinsic generalization error is that the policy changes during the optimization procedure. This leads to a change in the episode distribution . Usually is updated using episodes generated from some “previous” distributions, which are then used to generate new episodes. In this case it is not easy to split episodes into a training and testing set, since during optimization samples always come from the updated policy distribution.

6 Reparameterization Trick

The reparameterization trick has been popular in the optimization of deep networks (Kingma and Welling, 2014; Maddison et al., 2017; Jang et al., 2017; Tokui and Sato, 2016) and used, e.g., for the purpose of optimization efficiency. In RL, suppose the objective (1) is reparameterizable:

Then under some weak assumptions


The reparameterization trick has already been used: for example, PGPE (Rückstieß et al., 2010) uses policy reparameterization, and SVG (Heess et al., 2015) uses policy and environment dynamics reparameterization. In this work, we will show the reparameterization trick can help to analyze the generalization gap. More precisely, we will show that since both and are fixed, even if they are unknown, as long as they satisfy some “smoothness” assumptions, we can provide theoretical guarantees on the test performance.

7 Reparameterized MDP

We start our analysis with reparameterizing a Markov Decision Process with discrete states. We will give a general argument on reparameterizable RL in the next section. In this section we slightly abuse notation by letting and denote

-dimensional probability vectors for multinomial distributions for initialization and transition respectively.

One difficulty in the analysis of the generalization in reinforcement learning rises from the sampling steps in MDP where states are drawn from multinomial distributions specified by either or , because the sampling procedure does not explicitly connect the states and the distribution parameters. We can use standard Gumbel random variables to reparameterize sampling and get a procedure equivalent to classical MDPs but with slightly different expressions, as shown in Algorithm 1.

  Initialization: Sample . , .
  for  in  do
  end for
  return .
Algorithm 1 Reparameterized MDP

In the reparameterized MDP procedure, is an -dimensional Gumbel distribution. are -dimensional vectors with each entry being a Gumbel random variable. Also and are entry-wise vector sums, so they are both -dimensional vectors. returns the index of the maximum entry in the -dimensional vector . In the reparameterized MDP procedure shown above, the states are represented as an index in . After reparameterization, we may rewrite the RL objective (2) as:111Again we abuse the notation by denoting as .


where , is an -dimensional Gumbel random variable, and


is the discounted return for one episode of length .

The reparameterized objective (9) maximizes the empirical reward by varying the policy . The distribution from which the random variables are drawn does not depend on the policy anymore, and the policy only affects the reward through the states .

The objective (9) is a discrete function due to the operator. One way to circumvent this is to use Gumbel softmax to approximate the operator (Maddison et al., 2017; Jang et al., 2017). If we denote as a one-hot vector in , and further relax the entries in to take positive values that sum up to one, we may use the softmax to approximate the operator. For instance, the reparametrized initial-state distribution becomes:


where is an -dimensional Gumbel random variable, is an -dimensional probability vector in multinomial distribution, and is a positive scalar. As the temperature , the softmax approaches in terms of the one-hot vector representation.

8 Reparameterizable RL

In general, as long as the transition and initialization process can be reparameterized so that the environment parameters are separated from the random variables, the objective can always be reformulated so that the policy only affects the reward instead of the underlying distribution. The reparameterizable RL procedure is shown in Algorithm 2.

  Initialization: Sample . , .
  for  in  do
  end for
  return .
Algorithm 2 Reparameterizzble RL

In this procedure, s are -dimensional random variables but they are not necessarily sampled from the same distribution.222They may also have different dimensions. In this work, without loss of generality, we assume the random variables have the same dimension . In many scenarios they are treated as random noise. is the initialization function. During initialization, the random variable is taken as input and the output is an initial state . The transition function , takes the current state , the action produced by the policy , and a random variable to produce the next state .

In reparameterizable RL, the peripheral random variables can be sampled before the episode is generated. In this way, the randomness is decoupled from the policy function, and as the policy gets updated, the episodes can be computed deterministically.

The class of reparamterizable RL problems includes those whose initial state, transition, reward and optimal policy distribution can be reparameterized. Generally, a distribution can be reparameterized, e.g., if it has a tractable inverse CDF, is a composition of reparameterizable distributions (Kingma and Welling, 2014), or is a limit of smooth approximators (Maddison et al., 2017; Jang et al., 2017). Reparametrizable RL settings include LQR (Lewis et al., 1995)

and physical systems (e.g., robotics) where the dynamics are given by stochastic partial differential equations (PDE) with reparameterizable components over continuous state-action spaces.

9 Main Result

For reparameterizable RL, if the environments and the policy are “smooth”, we can control the error between the expected and the empirical reward. In particular, the assumptions we make are333 is the norm, and .

Assumption 1.

is -Lipschitz in terms of the first variable , and -Lipschitz in terms of the second variable . That is, ,

Assumption 2.

The policy is parameterized as , and is -Lipschitz in terms of the states, and -Lipschitz in terms of the parameter , that is,

Assumption 3.

The reward is -Lipschitz:

If assumptions (1) (2) and (3) hold, we have the following:

Theorem 1.

In reparameterizable RL, suppose the transition in the test environment satisfies , and suppose the initialization function in the test environment satisfies . If assumptions (1), (2) and (3) hold, the peripheral random variables for each episode are i.i.d., and the reward is bounded , then with probability at least , for all policies :

where , and is the Rademacher complexity of under the training transition , the training initialization , and is the number if training episodes.

Note the i.i.d. assumption on the peripheral variables is across episodes. Within the same episode, there could be correlations among the s at different time steps.

Similar arguments can also be made when the transition in the test environment stays the same as , but the initialization is different from . In the following sections we will bound the intrinsic and external errors respectively.

10 Bounding Intrinsic Generalization Error

After reparameterization, the objective (9) is essentially the same as an empirical risk minimization problem in the supervised learning scenario. According to classical learning theory, the following lemma is straight-forward (Shalev-Shwartz and Ben-David, 2014):

Lemma 2.

If the reward is bounded, , and are i.i.d. for each episode, with probability at least , for :


where is the Rademacher complexity of .

The bound (12) holds uniformly for all , so it also holds for . Unfortunately, in MDPs is hard to control, mainly due to the recursive in the representation of the states, .

On the other hand, for general reparameterizable RL we may control the intrinsic generalization gap by assuming some “smoothness” conditions on the transitions , as well as the policy . In particular, it is straight-forward to prove that the empirical return is “smooth” if the transitions and policies are all Lipschitz.

Lemma 3.

For reparameterizable RL, given assumptions 1, 2, and 3, the empirical return defined in (10), as a function of the parameter , has a Lipschitz constant of


where .

Also, if the number of parameters in is bounded, then the Rademacher complexity in Lemma 2 can be controlled (van der Vaart., 1998; Bartlett, 2013).

Lemma 4.

For reparameterizable RL, given assumptions 1, 2, and 3, if the parameters is bounded such that , and the function class of the reparameterized reward is closed under negations, then the Rademacher complexity is bounded by


where is the Lipschitz constant defined in (13), and is the number of episodes.

In the context of deep learning, deep neural networks are over-parameterized models that have proven to work well in many applications. However, the bound above does not explain why over-parameterized models also generalize well since the Rademacher complexity bound (

14) can be extremely large as grows. To ameliorate this issue, recently Arora et al. (2018) proposed a compression approach that compresses a neural network to a smaller one with fewer parameters but has roughly the same training errors. Whether this also applies to reparameterizable RL is yet to be proven. There are also trajectory-based techniques proposed to sharpen the generalization bound (Li et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019; Cao and Gu, 2019).

10.1 PAC-Bayes Bound on Reparameterizable RL

We can also analyze the Rademacher complexity of the empirical return by making a slightly different assumption on the policy. Suppose is parameterized as , and is sampled from some posterior distribution

. According to the PAC-Bayes theorem

(McAllester, 1998, 2003; Neyshabur et al., 2018; Langford and Shawe-Taylor, 2002):

Lemma 5.

Given a “prior” distribution , with probability at least over the draw of episodes, :


where is the expected “Bayesian” reward.

The bound (15) holds for all posterior . In particular it holds if is where could be any solution provided by empirical return maximization, and

is a perturbation, e.g., zero-centered uniform or Gaussian distribution. This suggests maximizing a perturbed objective instead may lead to better generalization performance, which has already been observed empirically

(Wang et al., 2018b).

The tricky part about perturbing the policy is choosing the level of noise. Suppose there is an empirical reward optimizer . When the noise level is small, the first term in (15) is large, but the second term may also be large since the posterior is too focused on but the “prior” cannot depend on , and vice versa. On the other hand, if the reward function is “nice”, e.g., if some “smoothness” assumption holds in a local neighborhood of , then one can prove the optimal noise level roughly scales inversely as the square root of the local Hessian diagonals (Wang et al., 2018a).

11 Bounding External Generalization Error

Another source of generalization error in RL comes from the change of environment. For example, in an MDP , the transition probability or the initialization distribution is different in the test environment. Cobbe et al. (2018) and Packer et al. (2018) show that as the distribution of the environment varies the gap between the training and testing could be huge.

Indeed if the test distribution is drastically different from the training environment, there is no guarantee the performance of the same model could possibly work for testing. On the other hand, if the test distribution is not too far away from the training distribution then the test error can still be controlled. For example, for supervised learning, Mohri and Medina (2012) prove the expected loss of a drifting distribution is also bounded. In addition to Rademacher complexity and a concentration tail, there is one more term in the gap that measures the discrepancy between the training and testing distribution.

For reparameterizable RL, since the environment parameters are lifted into the reward function in the reformulated objective (9), the analysis becomes easier. For MDPs, a small change in environment could cause large difference in the reward since is not continuous. However if the transition function is “smooth”, the expected reward in the new environment can also be controlled. e.g., if we assume the transition function , the reward function , as well as the policy function are all Lipschitz, as in section 10.

If the transition function is the same in the test environment and the only difference is the initialization, we can prove the following lemma:

Lemma 6.

In reparameterizable RL, suppose the initialization function in the test environment satisfies for , and the transition function in the test environment is the same as training. If assumptions (1), (2), and (3) hold, then:


Lemma 6 means that if the initialization in the test environment is not too different from the training one, and if the transition, policy and reward functions are smooth, then the expected reward in the test environment won’t deviate from that of training too much.

The other possible environment change is that the test initialization stays the same but the transition changes from the training transition to . Similar to before, we have:

Lemma 7.

In reparameterizable RL, suppose the transition in the test environment satisfies , and the initialization in the test environment is the same as training. If assumptions (1), (2) and (3) hold then


where .

The difference between (18) and (17) is that the change in transition is further enlarged during an episode: as long as , the gap in (18) is larger and can become huge as the length of the episode increases.

12 Simulation

Temperature Policy State Action
Gap Gap Gap
Table 1: Intrinsic Gap versus Smoothness

We now present empirical measurements in simulations to verify some claims made in section 10 and 11. The bound (14) suggests the gap between the expected reward and the empirical reward is related to the Lipschitz constant of , which according to equation (13) is related to the Lipschitz constant of a series of functions including , , and .

12.1 Intrinsic Generalization Gap

In (13), as the length of the episode increases, the dominating factors in are , and . Our first simulation fixes the environment and verifies . In the simulation, we assume the initialization and the transition are all known and fixed. is an identity function, and

is a vector of i.i.d. uniformly distributed random variables:

. The transit function is , where , , are row vectors, and , , and are matrices used to project the states, actions, and noise respectively. , , and are randomly generated and then kept fixed during the experiment. We use as the discounting constant throughout.

The policy

is modeled using a multiple layer perceptron (MLP) with rectified linear as the activation. The last layer of MLP is a linear layer followed by a softmax function with temperature:


By varying the temperature we are able to control the Lipschitz constant of the policy class and if we assume the bound on the parameters is unchanged.

We set the length of the episode , and randomly sample for training and testing episodes. Then we use the same random noise to evaluate a series of policy classes with different temperatures .

Since we assume and are known, during training the computation graph is complete. Hence we can directly optimize the coefficients in just as in supervised learning.444In real applications this is not doable since and are unknown. Here we assume they are known just to investigate the generalization gap. We use Adam (Kingma and Ba, 2015) to optimize with initial learning rates and . When the reward stops increasing we halved the learning rate. and analyze the gap between the average training and testing reward.

First, we observe the gap is affected by the optimization procedure. For example, different learning rates can lead to different local optima, even if we decrease the learning rate by half when the reward does not increase. Second, even if we know the environment and , so that we can optimize the policy directly, we still experience unstable learning just like other RL algorithms. This suggests that the unstableness of the RL algorithms may not rise from the estimation of the environment for the model based algorithms such as A2C and A3C (Mnih et al., 2016), since even if we know the environment the learning is still unstable.

Given the unstable training procedure, for each trial we ran the training for epochs with learning rate of 1e-2 and 1e-3, and the one with higher training reward at the last epoch is used for reporting. Ideally as we vary , the Lipschitz constant for the function class is changed accordingly given the assumption . However, it is unclear if is changed or not for different configurations. After all, the assumption that the parameters are bounded is artificial. To ameliorate this defect we also check the metric , where is the weight matrix of the th layer of MLP. In our experiment there is no bias term in the linear layers in MLP, so can be used as a metric on the Lipschitz constant at the solution point . We also vary the smoothness in the transition function a a function of states (), and actions (), by applying softmax with different temperatures

to the singular values of the randomly generated matrix.

Table 1 shows the average generalization gap roughly decreases as decreases. The metric also decreases similarly as the average gap. In particular, the 2nd and 3rd column shows the average gap as the policy becomes “smoother”. The 4th column shows, if we fix the policy- as well as setting , the generalization gap decreases as we increase the transition- for (states). Similarly the last column is the gap as the transition- for actions () varies. In Table 4 the environment is fixed and for each parameter configuration the gap is averaged from trials with randomly initialized and then optimized policies.

Params 65.6k 131.3k 263.2k 583.4k 1.1m
Gap 0.204 0.183 0.214 0.336 0.418
Table 2: Empirical gap vs #policy params.
in 1 10 100 1,000
Gap 0.481 0.477 0.659 0.532
Table 3: Empirical generalization gap vs shift in initialization.
in 1 10 100 1,000
Gap 11 451 8,260 73,300
Table 4: Empirical generalization gap vs shift in transition.

12.2 External Generalization Gap

To measure the external generalization gap, we vary the transition as well as the initialization in the test environment. For that, we add a vector of Rademacher random variables to or , with . We adjust the level of noise in the simulation and report the change of the average gap in Table 4 and Table 4. It is not surprising that the change in transition leads to a higher generalization gap since the impact from is accumulated across time steps. Indeed if we compare the bound (18) and (17), when as long as , the gap in (18) is larger.

13 Discussion and Future Work

Even though a variety of distributions, discrete or continuous, can be reparameterized, and we have shown that the classical MDP with discrete states is reparameterizable, it is not clear in general under which conditions reinforcement learning problems are reparameterizable. Classifying particular cases where RL is not reparameterizable is an interesting direction for future work. Second, the transitions of discrete MDPs are inherently non-smooth, so Theorem

1 does not apply. In this case, the PAC-Bayes bound can be applied, but this requires a totally different framework. It will be interesting to see if there is a “Bayesian” version of Theorem 1. Finally, our analysis only covers “on-policy” RL. Studying generalization for “off-policy” RL remains an interesting future topic.


  • A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire (2014) Taming the monster: a fast and simple algorithm for contextual bandits.

    International Conference on Machine Learning

    Cited by: §1.
  • Z. Allen-Zhu, Y. Li, and Y. Liang (2018) Learning and generalization in overparameterized neural networks, going beyond two layers. CoRR abs/1811.04918. Cited by: §10.
  • S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. International Conference on Machine Learning. Cited by: §10.
  • S. Arora, R. Ge, B. Neyshabur, and Y. Zhang (2018) Stronger generalization bounds for deep nets via a compression approach. International Conference on Machine Learning. Cited by: §10.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Maching Learning. Cited by: §1.
  • P. Auer, T. Jaksch, and R. Ortner (2009) Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems 21. Cited by: §1.
  • M. G. Azar, I. Osband, and R. Munos (2017) Minimax regret bounds for reinforcement learning. International Conference on Machine Learning. Cited by: §1.
  • B. Baker, O. Gupta, N. Naik, and R. Raskar (2017) Designing neural network architectures using reinforcement learning. Cited by: §1.
  • P. Bartlett (2013) Lecture notes on theoretical statistics. Cited by: §10.
  • A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire (2011) Contextual bandit algorithms with supervised learning guarantees.

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    Cited by: §1.
  • Y. Cao and Q. Gu (2019)

    A generalization theory of gradient descent for learning over-parameterized deep relu networks

    CoRR abs/1902.01384. Cited by: §10.
  • S. Chakravorty and D. C. Hyland (2003) Minimax reinforcement learning. American Institute of Aeronautics and Astronautic. Cited by: §1.
  • K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman (2018) Quantifying generalization in reinforcement learning. CoRR. External Links: Link Cited by: §1, §1, §11, §4.
  • C. Dann, T. Lattimore, and E. Brunskill (2017) Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. International Conference on Neural Information Processing Systems (NIPS). Cited by: §1.
  • J. Farebrother, M. C. Machado, and M. Bowling (2018) Generalization and regularization in dqn. CoRR. External Links: Link Cited by: §1.
  • N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa (2015) Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems. Cited by: §6.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization by gumbel-softmax. International Conference on Learning Representations. Cited by: §6, §7, §8.
  • N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire (2017) Contextual decision processes with low Bellman rank are PAC-learnable. International Conference on Machine Learning. Cited by: §1.
  • N. Justesen, R. R. Torrado, P. Bontrager, A. Khalifa, J. Togelius, and S. Risi (2018) Illuminating generalization in deep reinforcement learning through procedural level generation. NeurIPS Deep RL Workshop. Cited by: §1.
  • M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Mache Learning. Cited by: §1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes.. International Conference on Learning Representations. Cited by: §1, §6, §8.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. International Conference on Learning Representations. Cited by: §12.1.
  • J. Kober, J. A. Bagnell, and J. Peters (2013) Reinforcement learning in robotics: a survey. International Journal of Robotic Research. Cited by: §1.
  • J. Langford and J. Shawe-Taylor (2002) PAC-bayes & margins. International Conference on Neural Information Processing Systems (NIPS). Cited by: §10.1.
  • R. Laroche (2017) Transfer reinforcement learning with shared dynamics. Cited by: §1.
  • T. Lattimore and M. Hutter (2014) Near-optimal PAC bounds for discounted mdps. Theoretical Computer Science. Cited by: §1.
  • A. Lazaric (2012) Transfer in reinforcement learning: a framework and a survey.. Reinforcement Learning - State of the Art, Springer. Cited by: §1.
  • F.L. Lewis, V.L. Syrmos, and V.L. Syrmos (1995) Optimal control. A Wiley-interscience publication, Wiley. External Links: ISBN 9780471033783, LCCN lc95015649, Link Cited by: §8.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web. Cited by: §1.
  • Y. Li, T. Ma, and H. Zhang (2018) Cited by: §10.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    International Conference on Learning Representations. Cited by: §6, §7, §8.
  • S. J. Majeed and M. Hutter (2018) Performance guarantees for homomorphisms beyond markov decision processes. CoRR abs/1811.03895. Cited by: §1.
  • H. Mao, M. Alizadeh, I. Menache, and S. Kandula (2016) Resource management with deep reinforcement learning. Cited by: §1.
  • D. A. McAllester (1998) Some pac-bayesian theorems. Conference on Learning Theory (COLT). Cited by: §10.1.
  • D. A. McAllester (2003) Simplified pac-bayesian margin bounds. Conference on Learning Theory (COLT). Cited by: §10.1.
  • A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean (2018) Hierarchical planning for device placement. External Links: Link Cited by: §1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning. Cited by: §12.1, §3.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §1.
  • M. Mohri and A. M. Medina (2012) New analysis and algorithm for learning with drifting distributions. Algorithmic Learning Theory. Cited by: §11.
  • A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver (2015) Massively parallel methods for deep reinforcement learning. CoRR abs/1507.04296. External Links: Link Cited by: §1.
  • B. Neyshabur, S. Bhojanapalli, and N. Srebro (2018) A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. International Conference on Learning Representations (ICLR). Cited by: §10.1.
  • OpenAI (2018) OpenAI five. Note: https://blog.openai.com/openai-five/ Cited by: §1.
  • C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song (2018) Assessing generalization in deep reinforcement learning. CoRR. External Links: Link Cited by: §1, §1, §11, §4.
  • T. Rückstieß, F. Sehnke, T. Schaul, D. Wierstra, Y. Sun, and J. Schmidhuber (2010) Exploring parameter space in reinforcement learning. Paladyn. Cited by: §6.
  • S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York, NY, USA. External Links: ISBN 1107057132, 9781107057135 Cited by: §10.
  • G. Shani, R. I. Brafman, and D. Heckerman (2005) An mdp-based recommender system. The Journal of Machine Learning Research. Cited by: §1.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature. Cited by: §1.
  • D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR. External Links: Link Cited by: §1.
  • A. L. Strehl, L. Li, and M. L. Littman (2009) Reinforcement learning in finite mdps: pac analysis. Journal of Machine Learning Research. Cited by: §1.
  • R. Sutton and A. Barto. (1998) Reinforcement learning: an introduction.. MIT Press. Cited by: §3.
  • R. S. Sutton (1995) Generalization in reinforcement learning: successful examples using sparse coarse coding. Cited by: §1.
  • M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res.. Cited by: §1.
  • S. Tokui and I. Sato (2016) Reparameterization trick for discrete variables. CoRR. External Links: Link Cited by: §6.
  • A. van der Vaart. (1998) Asymptotic statistics... Cambridge. Cited by: §10.
  • O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. P. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing (2017) StarCraft II: A new challenge for reinforcement learning. CoRR. External Links: Link Cited by: §1.
  • H. Wang, N. S. Keskar, C. Xiong, and R. Socher (2018a) Identifying generalization properties in neural networks. External Links: Link Cited by: §10.1.
  • J. Wang, Y. Liu, and B. Li (2018b) Reinforcement learning with perturbed rewards. CoRR abs/1810.01032. External Links: Link Cited by: §10.1.
  • S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone (2011) Protecting against evaluation overfitting in empirical reinforcement learning. 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. Cited by: §3.
  • Y. Zhan and M. E. Taylor (2015) Online transfer learning in reinforcement learning domains. CoRR abs/1507.00436. Cited by: §1.
  • A. Zhang, N. Ballas, and J. Pineau (2018a) A dissection of overfitting and generalization in continuous reinforcement learning. CoRR. External Links: Link Cited by: §1.
  • C. Zhang, O. Vinyals, R. Munos, and S. Bengio (2018b) A study on overfitting in deep reinforcement learning. CoRR. External Links: Link Cited by: §1, §1, §4.

Appendix A Proof of Lemma 3


For Reparameterizable RL, given assumptions 1, 2, and 3, the empirical reward defined in (10), as a function of the parameter , has a Lipschitz constant of

where .


Let’s denote , and . We start by investigating the policy function across different time steps:


The first inequality is the triangle inequality, and the second is from our Lipschitz assumption 2.

If we look at the change of states as the episode proceeds:


Now combine both (19) and (20),

In the initialization, we know since the initialization process does not involve any computation using the parameter in the policy .

By recursion, we get

where .

By assumption 3, is -Lipschitz, so

So the reward

Appendix B Proof of Lemma 6


In reparameterizable RL, suppose the initialization function in the test environment satisfies