Log In Sign Up

Provably efficient reconstruction of policy networks

by   Bogdan Mazoure, et al.

Recent research has shown that learning poli-cies parametrized by large neural networks can achieve significant success on challenging reinforcement learning problems. However, when memory is limited, it is not always possible to store such models exactly for inference, and com-pressing the policy into a compact representation might be necessary. We propose a general framework for policy representation, which reduces this problem to finding a low-dimensional embedding of a given density function in a separable inner product space. Our framework allows us to de-rive strong theoretical guarantees, controlling the error of the reconstructed policies. Such guaran-tees are typically lacking in black-box models, but are very desirable in risk-sensitive tasks. Our experimental results suggest that the reconstructed policies can use less than 10 no decrease in rewards.


Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

We consider a settings of hierarchical reinforcement learning, in which ...

Reinforcement Learning from a Mixture of Interpretable Experts

Reinforcement learning (RL) has demonstrated its ability to solve high d...

Towards Mixed Optimization for Reinforcement Learning with Program Synthesis

Deep reinforcement learning has led to several recent breakthroughs, tho...

A functional mirror ascent view of policy gradient methods with function approximation

We use functional mirror ascent to propose a general framework (referred...

State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards

Constrained reinforcement learning involves multiple rewards that must i...

Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks

We consider the off-policy evaluation problem of reinforcement learning ...

Active Information Acquisition

We propose a general framework for sequential and dynamic acquisition of...

1 Introduction

In the reinforcement learning (RL) framework, the goal of a rational agent consists in maximizing the expected rewards in a dynamical system by finding a suitable conditional distribution known as policy. This conditional distribution can be found using policy iteration and any suitable function approximator, ranging from linear models to neural networks. Due to their high representation power and efficient scalability, neural networks are often used for this matter (Lillicrap et al., 2015; Schulman et al., 2015). They have shown great successes in simulated environments such as games (e.g. Atari) or robotic control (e.g. MuJoCo).

However, in situations with limited memory capacity, e.g. deployment on mobile devices (Liu et al., 2018) or field drones, finding a lower-dimensional embedding of a policy network with bounds on its performance and error rate is challenging. A typical policy network can contain redundant parameters that must be pruned to save storage, a process for which knowledge distillation (Rusu et al., 2015) is not well-suited due to lack of guarantees. Other parameters reduction attempts include Mazoure et al. (2019); Doan et al. (2019), but authors do not provide theoretical guarantees or interpretations for this improvement.

It is in fact crucial to have theoretical guarantees when conducting policy reconstruction, especially for real world applications. Unlike simulated environments, where the agent can explore freely and interact without incurring unexpected consequences, such a setup is not applicable to real life scenarios. In multiple risk-sensitive domains, failures can lead to drastic damages either in term of human lives (robotic surgery (Ubee et al., 2019), self driving car (Shalev-Shwartz et al., 2017), air traffic control (Kim, 2015; Moon et al., 2011)) or financial consequences (warehouse robots  (Bogue, 2017), advertisement campaign). Before deploying the compressed models, one might request performance guarantees on the reconstructed policy’s performance (Berkenkamp et al., 2017). In this case, black-box methods such as neural networks are disadvantaged due to the lack of theoretical guarantees, therefore are unlikely to be applied to real-world scenarios (Ghavamzadeh et al., 2016).

In this paper, we address two drawbacks of policies represented by neural networks: lack of (1) performance guarantees and (2) control over which redundant parameters are pruned. We treat the policy embedding problem as finding a low-dimensional representation of a given density function over state action space, and propose a general framework for representing any given policy with provable guarantees. We show that the performance of the reconstructed policy depends on the projection algorithm, as well as on the policy’s long-term behaviour. To the best of our knowledge, this is the first work proposing a general framework to learn a compressed policy with performance and long-term behaviour guarantees.

The experiments section explores various policy representation methods based on our framework, which are compared on average returns as well as state coverage. More precisely, we perform experiments with a wide range of different types of basis, including orthogonal basis such as singular value decomposition (SVD, 

Halko et al. 2011) and non-orthogonal basis such as Gaussian mixtures (GMM, Rasmussen 2000

) which can be directly learned from data. In addition, we evaluate the performance of pre-defined orthogonal basis such as Fourier transformations (FFT, 

Stein & Shakarchi 2003) and Daubechies wavelet (DB4, Daubechies 1988). Our empirical results suggest that the framework we propose is general enough to handle various classes of RL problems, whether the basis needs to be learned or arises naturally.

2 Preliminary

In this section, we provide a brief introduction to reinforcement learning, density approximation, as well as singular value decomposition.

2.1 Reinforcement learning

We consider a problem modeled by a discounted Markov Decision Process (MDP)

, where is the state space; is the action space; given states , , action ,

is the transition probability of transferring

to under the action and is the reward collected at state after executing the action ; is the set of initial states and is the discount factor.

The agent in RL environment often execute actions according to some policies. We define a stochastic policy such that is the conditional probability that the agent takes action after observing the state . The objective is to find a policy which maximizes the expected discounted reward:


We define the state value function and the state-action value function as:

The gap between and is known as the advantage function:

where and . The coverage of policy is defined as the stationary state visitation probability under .

2.2 Function approximation in inner product spaces

The theory of inner product spaces has been widely used to characterize approximate learning of Markov decision processes (Puterman, 2014; Tsitsiklis & Van Roy, 1999). In this subsection, we provide a short overview of inner product spaces of square-integrable functions on a closed interval denoted , and mention drawbacks of working purely in the continuous setting.

The space is known to be separable for , i.e., it admits a countable orthonormal basis of functions , such that for all and being Kronecker’s delta. This property makes possible computations with respect to the inner product


and corresponding norm , for .

That is, for any with basis , there exist scalars such that


where the coefficients are given by


If , then adding to

can be thought of as adding ”non-redundant” information to the previous estimate.

Eq. (3) suggests a simple compression rule, where one would project a density function onto a fixed basis and store only the first

coefficients. Which components to pick will depend on the nature of the basis: harmonic and wavelet bases are ranked according to amplitude, while singular vectors are sorted by corresponding singular values.

The choice of the basis plays a crucial role in the quality of approximation and should be taken while keeping in mind properties of the compressed function such as smoothness and periodicity. Below, we provide some examples of well-known orthonormal bases of :

  • Fourier: ;

  • Haar: ;

In order to allow our learning framework to have strong convergence results in the inner product sense as well as a countable set of basis functions, we restrict ourselves to separable Hilbert spaces.

While convergence guarantees are mostly known for the closed interval , previous works have studied properties of the above functions on the whole real line and in multiple dimensions (Egozcue et al., 2006). Weaker convergence results in Hilbert spaces can be stated with respect to the space’s inner product. If, for every ,


then the sequence is said to converge weakly to . Stronger theorems are known for specific bases but require stronger assumptions on . Some are presented in Section 2.3.

2.3 Universal density approximators

Universal density approximators are a class of models which, given enough parameters, can approximate any smooth density up to an arbitrary error level. Over the separable Hilbert space of square-integrable real functions, such approximations typically rely on a countable set of orthonormal elementary functions and include but are not limited to Fourier, Hermite, Haar and Daubechies bases (Stein & Shakarchi, 2003; Haar, 1909; Daubechies, 1988). Moreover, it is known that a mixture model with countable number of components is a universal density approximator (McLachlan & Peel, 2004); in such case, the basis consists of Gaussian density functions and is not necessarily orthonormal.

Our approach for policy compression can be summarized in the following steps: (1) pick a set of known basis over the space, (2) project the policy onto the first basis functions and (3) store the vector of projection weights and optionally the corresponding basis functions.

Below, we mention existing results for the Fourier partial sum, as well as finite component smooth mixture model.

Rate of convergence of Fourier approximation

Let denote the density approximated by the first Fourier partial sums (Stein & Shakarchi, 2003), then a result from (Jackson, 1930) shows that


More recent results (Giardina & Chirlian, 1972) provide even tighter bounds for continuous and periodic functions with continuous derivatives and bounded derivative. In such case,

Rate of convergence of mixture approximation

Let denote a (finite) component mixture model. A result from (Rakhlin et al., 2005) shows the following result for in the class of mixtures with marginally-independent scaled density functions and

in the class of lower-bounded probability density functions:


where is the number of i.i.d. samples used in the learning of .

The constants hidden inside the big-O notation depend on the nature of the function and can become quite large, which can explain differences in empirical evaluation.

2.4 Low-rank matrix factorization

Matrix decomposition algorithms such as the truncated singular value decomposition (SVD) are popular compression methods due to their simplicity and robustness. Although SVD is commonly used in neural network compression schemes (Lu et al., 2017; Goetschalckx et al., 2018)

, most works apply it on the weights rather than the outputs of the function. The latter makes the process easy to implement, but error guarantees are harder to derive due to the non-linear activation functions of neural networks. Similarly to the density approximators above, a key appeal of SVD is that the error rate of truncation can be controlled efficiently through the magnitude of the truncated singular values

***see Eckart-Young-Mirsky theorem.

2.5 Time complexity comparison

The time complexity for a -truncated SVD decomposition of a matrix is as discussed in Du et al. (2017). Fast Fourier transform can be done in for the 2-dimensional case (Lohne, 2017). Hence, SVD is expected to be faster whenever . The discrete wavelet transform’s (DWT) time complexity depends on the choice of mother wavelets. When using Mallat’s algorithm, the 2-dimensional DWT is known to have complexities ranging from to as low as , for a square matrix of size and depending on the choice of mother wavelet (Resnikoff et al., 2012). Finally, we expect FFT and Daubechies wavelets to have less parameters than SVD, since the former use pre-defined orthonormal bases, while the later must store the left and right singular vectors.

3 Reconstruction algorithm

In this section, we illustrate our method in a single state case setting (i.e. bandit), before generalizing to multiple states.

3.1 Bandit example

Consider the special case of a multi-armed bandit over with reward function and some given policy . The average rewards collected by are a simplification of Eq. (1):


If is finite, is a vector which can be approximated with


The compression algorithm should therefore pick the coefficients which minimize the projection error onto .

In an empirical setting, one can monitor the difference between average actions with respect to ground truth and approximate policies, which is upper-bounded by 1-Wasserstein distance (Dudley, 2018):



is the quantile function and

is the cumulative distribution function of the action random variable

. Therefore, it is sufficient to keep track of , which can be efficiently computed for discrete distributions.

An extension of the multi-armed bandit problem to continuous action spaces implies that the probability of pulling each arm is given by the function . The extension of discrete quantities to their continuous counterparts is discussed in the next section.

3.2 Discretization of continuous policies

Consider the case when is a continuous, positive and decreasing function with a discrete sequence . For example, the reconstruction error as well as difference in returns fall under this family. Then, implies that . Under these assumptions, convergence guarantees (e.g. on monotonically decreasing reconstruction error) in continuous space imply convergence in the discrete (empirical) setting. Hence, we operate on discrete spaces rather than continuous ones.

We now discuss the construction of , the discretized policy. The core idea consists in grouping together similar state-action pairs (in term of visitation frequency). To this end, we suggest using the quantile state visitation function and the state visitation distribution function , as well as their empirical counterparts, and . Quantile and distribution functions for the action space are defined analogously.

Algorithm 1 outlines the proposed discretization process allowing one to approximate any continuous policy (e.g., computed by a neural network) by a 2-dimensional table indexed by discrete states and actions. Binning is done via quantile discretization in order to have maximal resolution in frequently visited areas. It also allows for faster sampling during rollouts, since the probability of falling in each bin is uniform. The number of state and action bins per dimension is denoted and , respectively.

Input: Policy , number of state, action bins

Result: Discrete policy

Collect a set of states and set of actions via rollouts of

Build the empirical c.d.f. from set

Build the empirical c.d.f. from set

Find numbers s.t. using for all

Find numbers s.t. using for all

if  and  then

       Assign to
end if
Set to
Algorithm 1 Policy discretization

All further computations of distance between two discretized policies will have to be computed over the corresponding discrete grid . Similarly to Eq. 11, one can look at the average action taken by two MDP policies at state :


This relationship is one of the motivations behind using to assess goodness-of-fit in the experiments section. Moreover, since the support is finite and discrete, computation can be done efficiently (Rubner et al., 1998).

3.3 Pruning unvisited states

As often some states might be never visited and one might wonder how much loss in performance is incurred by pruning those states. Let denote the number of times visits state and the number of times action is taken from over trajectories. It can happen that and is never visited. Hence, some rarely visited state-action pairs are not required and can be pruned in order to reduce the number of parameters in the discretized policy. If, at inference time, one of these pruned states is encountered, we can simply act uniformly at random, knowing that the performance will lightly be affected.

This leads us to define a pruned policy as one which acts randomly under some state visitation threshold. Precisely


Pruning rarely visited states can drastically reduce the number of parameters, while maintaining high probability performance guarantees for .

Theorem 3.1.

(Policy pruning) Let be the number of rollout trajectories of the policy , be the largest reward. With high probability , we have a guarantee on the performance of the reconstructed policy :


The proof is shown in Theorem 2 of (Simão et al., 2019). The theorem allows us to discretize only visited states and still ensuring a strong performance guarantee with high probability.

3.4 Acting with tabular policies

In order to execute the learned policy, one has to be able to (1) generate samples conditioned on a given state and (2) evaluate the log-probability of state-action pairs. Acting with a tabular density can be done via various sampling techniques. For example, importance sampling of the atoms corresponding to each bin is possible when

is small, while the inverse c.d.f. method is expected to be faster in larger dimensions. The process can be further optimized on a case-per-case basis for each method. For example, sampling directly from the real part of a characteristic function can be done with the algorithm defined in

(Devroye, 1986)

Furthermore, it is possible to use any of the above algorithms to jointly sample pairs under the assumption that states were discretized by quantiles. First, uniformly sample a state bin and then use any of the conditional sampling algorithms to sample action . Optionally, one can add uniform noise (clipped to the range of the bin) to the sampled action. This naive trick would transform discrete actions into continuous approximations of the policy network.

3.5 Method

We propose the following algorithm which projects the function described by a given neural network onto some inner product space:

Input: Policy , number of components , basis

Result: A set of coefficients

Rollout in the environment to estimate

Discretize into using Alg. 1

Prune with using Eq. 13

Project onto using the corresponding algorithm

Algorithm 2 Joint policy reconstruction

Different projection algorithms are expected to perform differently depending on the curvature of the policy. For example, if

is a Gaussian distribution for each state

in a state space of dimension , then the table will have exactly modes. Moreover, methods decomposing signals in probability space such as FFT and SVD should be able to find structural patterns in the probability domain.

As a rule of thumb, if contains few distinct modes, state-action space methods (e.g. mixture models) are expected to perform better than probability space methods.

4 Theoretical guarantees

In this section, we provide bounds controlling the impact of the proposed approximation method on collected rewards and policy coverage.

4.1 Performance bound (single state case)

Storing an approximation of the ground truth policy implies a trade-off between performance and model size. Our framework allows to control the difference in collected reward as a function of reconstruction error in probability space.

Lemma 4.1.

Let be policies in , be the bandit’s reward function. Let be such that and s.t. . Then

Lemma 4.1 (whose proof can be found in appendix) guarantees that rewards collected by will be no further than away from rewards of .

4.2 Performance bounds (multi-state case)

Similar guarantees can be obtained when the state space is not reduced to a single state.

Theorem 4.2.

(Multi-state policy reconstruction) Let be square-summable policies. The following holds


where , is the advantage function, and .

The proof can be found in the Appendix and relies on techniques found in (Schulman et al., 2015). A similar relation is presented in the next section for stationary distributions of .

A key technical difference in the multi-state setting is that the discretization of the state space and pruning of the policy must be taken into account. Consequently, the performance error is now decomposed by triangle inequality into the discretization error (Eq. (12)), the pruning error (Thm 3.1) and the projection error which is controlled by specific function approximation theorems:


Loose bounds on the reconstruction error are given in Eq. (6) and Eq. (8), tighter versions can be found in the literature (Giardina & Chirlian, 1972; Rakhlin et al., 2005).

To estimate the total error, one can separately estimate each of the three terms of Eq. (16). Empirical results of theses three error terms are shown in the next section.

4.3 Connection between reconstruction error and policy coverage

A useful metric to assess the long-term behaviour of a given policy in an MDP is the stationary distribution

of the Markov chain induced by

. For a given policy, measures the state coverage of the model averaged with respect to that policy.

If one wishes to switch between the MDP and Markov chain frameworks for a fixed policy, it is possible to define the expected transition model

. In tensor form, it can be represented as

, where is the policy matrix, is the Khatri-Rao product and is the second matricization of the transition tensor .

If the Markov chain defined above is irreducible and homogeneous, then its stationary distribution corresponds to the state occupancy probabilityThe existence and uniqueness of follow from the Perron-Frobenius theorem.

given by the principal left eigenvector of


The following theorem bridges the error made on the reconstruction of long-run distribution as a function of the system’s transition dynamics and policy reconstruction error.

Theorem 4.3 (Approximate policy coverage).

Let be the Schatten -norm and be the vector -norm. If (, ) is an irreducible, homogeneous Markov chain, then the following holds:


where and is a vector of all ones.

A detailed proof can be found in the Appendix.

Note that the upper bound depends on both the environment’s structure as well as on the policy reconstruction quality. It is thus expected that, for MDPs with particularly large singular values of , the bound converges less quickly than for those with smaller singular values. A visualization of this bound is provided in Figure 1.

5 Experimental results

In this section, we visualize the rate of convergence of a reconstructed stationary distribution via a toy example to illustrate Eq. (17). We then assess the validity of our method on several environments and compare the quality of the reconstructed policy for different projection methods, in terms of performance as well as computation time.

5.1 Rate of convergence of stationary distribution under random policy

We validate the rate of convergence of Thm. 4.3 in the toy experiment described below. Consider a deterministic chain MDP with states. A fixed policy in state transitions to with probability and to with probability . If in states or , the agent remains there with probability and , respectively. We consider the expected transition model obtained using the policy reconstructed through discrete Fourier transform. A visualization of Theorem 4.3 is shown in Fig. 1.

Figure 1: Upper bound on in the chain MDP task as a function of number of Fourier components and of the number of states.

5.2 Control domains

We consider a range of reconstruction tasks from three environments: a toy bandit problem inspired by Fellows et al. (2018), as well as the classical control problems of ContinuousMountainCar-v0 and Pendulum-v0 from OpenAI Gym (Brockman et al., 2016).

In all environments but the first, we omit GMM and fixed basis GMM methods, since the expectation-maximization algorithm runs into stability problems when dealing with extremely high number of components. Instead, we use SVD for low-rank matrix factorization and the fourth order Daubechies wavelet 

(Daubechies, 1988) (DB4) as an orthonormal basis alternative to Fourier (DFT). We use and as initial estimates for the discretization for all experiments except for Mountain Car ().

5.2.1 Bandit turntable

We consider a bandit problem in which rewards are spread in a circle as shown in Figure 2 a). Actions consist of angles in the interval . Given the corresponding Boltzmann policy, we compare the reconstruction quality of the discrete Fourier transform and GMM. In addition to Fourier and GMM, we also compare with fixed basis GMM, where the means are sampled uniformly over

and variance is drawn uniformly from

; only mixing coefficients are learned. This provides insight into whether the reconstruction algorithms are learning the policy, or if is so simple that a random set of Gaussian can already approximate it well.

Figure 2: Bandit turntable environment in (a) polar coordinates. Intensity of modes is proportional to the reward. Actions are absolute angles in interval . b) shows the distance between ground truth and order approximations. Dotted line indicates the number of modes in the ground truth density. While GMM shows instability with more components, Fourier transform achieves low with only a few bases.

Results are reported in Figure 2 b), where the metric was obtained by first computing the row-wise (i.e. action-specific) scores for every state, and subsequently averaging over all possible states. As

increases, EM has more trouble converging, as can be seen in the green curve’s trend and confidence intervals from

to 100. Fixed basis GMM struggles to accurately approximate the policy. On the other hand, DFT shows a stable performance after a threshold of components.

5.2.2 Mountain Car

In this environment, the agent (a car) must reach the flag located at the top of a hill. It needs to go back and forth in order to generate sufficient momentum to reach the top. The agent is allowed to apply a speed motion in the interval . We use Soft Actor-Critic (SAC, Haarnoja et al. 2018) to train the agent until convergence ( steps).

We compare all reconstruction methods with a maximum likelihood estimate (MLE) of , which consists of approximating the discrete policy with samples of that discrete policy. This method is referred as ”oracle” in the plots, since no reconstruction is performed. As shown in Figure 3, DFT, SVD and DB4 reach same reconstruction error as the oracle method. All methods achieve good performance with one order of magnitude less than the neural network in term of parameters. Note that DB4 shows slightly better performance than DFT.

Figure 3: Difference in returns between reconstructed and ground truth policies for the Mountain Car task with respect of the number of parameters used (X-axis). Our reconstructed models used drastically less than neural network parameters.

5.2.3 Pendulum

This classical mechanic task consists of a pendulum which needs to swing up in order to stay upright. The actions represent the joint effort (torque) between and units. We train SAC until convergence and save snapshots of the actor and critic after and steps. The reconstruction task is to recover both the Gaussian policy (actor) and the Boltzmann-Q policy (critic, temperature set to 1). While the Gaussian policy is unimodal, the Boltzmann-Q policy can be multi-modal and is a good challenge to demonstrates properties of each method. As shown in Figure 5, when policy has converged to a spiky shape ( steps), all methods show comparable performance (in terms of convergence). DFT shows better convergence at early stages of training ( steps), that is when the ground truth policy has a large variance. Note that all methods use less parameters than the original neural network.

Figure 4: Plots of Gaussian after training steps in polar coordinates. Each circle of different radius represents a states in the Pendulum environment as shown by the snapshots. Higher intensity colors (red) represent higher density mass on the given angular action.
Figure 5: Absolute difference in returns collected by discretized and reconstructed policies, averaged over 5 trials. Blue dots represent the number of parameters of the neural network policy. SVD, DFT and DB4 projections need an order of magnitude less in term of parameters to reconstruct the original policy.
Figure 6: distance between the true and reconstructed stationary distributions, averaged over trials. SVD, DFT and DB4 methods show a fast convergence to the oracle’s stationary distribution using only of the neural network’s parameters.
Figure 7: Plots of discretized and pruned (a) Gaussian policy and (b) Boltzmann-Q policy in polar coordinates for thousands training steps of soft actor-critic. Rewards collected by their respective continuous networks are indicated in parentheses. Each circle of radius corresponds to for a discrete . All densities are on the same scale.
Figure 8: Plots of discretization and pruning errors for (a) the Gaussian policy of ContinuousMountainCar-v0 and (b) Boltzmann-Q policy of Pendulum-v0. As expected, more bins induce lower discretization error (green curve), while collecting more trajectories guarantees confidence for pruning unvisited states (blue curve).

Results for the Boltzmann-Q policy presented in the Appendix exhibit patterns similar to Gaussian’s. The discretization and pruning errors are illustrated in Figure 8 and agree with error bounds of Thm 3.1.

6 Conclusion

In this work, we introduced a general framework for provably efficient policy reconstruction. It allowed to drastically compress policy densities in our experiments by a projection onto basis functions. Moreover, the reconstruction error has been shown to depend on the discretization, pruning and projection algorithms. We conducted experiments to demonstrate the behaviour of four sets of basis functions on a set of continuous control tasks, which exhibit desirable performance guarantees. A potential extension to our framework would be to directly operate in the continuous space, avoiding the discretization step procedure.


7 Appendix

Reproducibility Checklist

We follow the reproducibility checklist ”The machine learning reproducibility checklist” and point to relevant sections explaining them here.
For all algorithms presented, check if you include:

  • A clear description of the algorithm, see main paper and included codebase. The proposed approach is completely described by Alg. 2.

  • An analysis of the complexity (time, space, sample size) of the algorithm. The space complexity of our algorithm depends on the number of desired basis. It is for a 1-dimension component GMM with diagonal covariance, for a truncated SVD of an matrix, and only for the wavelet and Fourier bases. Note that a simple implementation of SVD requires to find the three matrices before truncation.

    The time complexity depends on the reconstruction, pruning and discretization algorithms, which involves conducting rollouts w.r.t. policy , training a quantile encoder on the state space trajectories, and finally embedding the the pruned matrix using the desired algorithm. We found that the Gaussian mixture is the slowest to train, due to shortcomings of the expectation-maximization algorithm.

  • A link to a downloadable source code, including all dependencies. The code is included with Supplemental Files as a zip file; all dependencies can be installed using Python’s package manager. Upon publication, the code would be available on Github. Additionally, we include the model’s weights as well as the discretized policy for Pendulum-v0 environment.

For all figures and tables that present empirical results, check if you include:

  • A complete description of the data collection process, including sample size. We use standard benchmarks provided in OpenAI Gym (Brockman et al., 2016).

  • A link to downloadable version of the dataset or simulation environment. Not applicable.

  • An explanation of how samples were allocated for training / validation / testing.

    We do not use a training-validation-test split, but instead report the mean performance (and one standard deviation) of the policy at evaluation time across 5 trials.

  • An explanation of any data that were excluded. The only data exclusion was done during policy pruning, as outlined in the main paper.

  • The exact number of evaluation runs. 5 trials to obtain all figures, and 200 rollouts to determine .

  • A description of how experiments were run. See Section Experimental Results in the main paper and didactic example details in Appendix.

  • A clear definition of the specific measure or statistics used to report results. Undiscounted returns across the whole episode are reported, and in turn averaged across 5 seeds. Confidence intervals shown in Fig. 8 were obtained using the pooled variance formula from a difference of means test.

  • Clearly defined error bars. Confidence intervals are always mean standard deviation over 5 trials.

  • A description of results with central tendency (e.g. mean) and variation (e.g. stddev). All results use the mean and standard deviation.

  • A description of the computing infrastructure used. All runs used 1 CPU for all experiments with Gb of memory.

7.1 Discretization techniques

Naively discretizing some function over a fixed interval can be done by equally spaced bins. To pass from a continuous value to discrete, one would find such that . However, using a uniform grid for a non-uniform wastes representation power. To re-allocate bins in low density areas, we use the quantile binning method, which first computes the cumulative distribution function of , called . Then, it finds points such that the probability of falling in each bin is equal. Quantile binning can be see as uniform binning in probability space, and exactly corresponds to uniform discretization if

is constant (uniform distribution).

Below, we present an example of discretizing 4,000 sample points taken from four distributions, using the quantile binning.

Figure 9: Quantile binning for the four Gaussians problem, using 10 bins per dimension.

In Figure 9, we are only allowed to allocate 10 bins per dimension. Note how the grid is denser around high-probability regions, while the furthermost bins are the widest. This uneven discretization, if applied to the policy density function, allows the agent to have higher detail around high-probability regions.
In Python, the function sklearn.preprocessing.KBinsDiscretizer can be used to easily discretize samples from the target density.

Figure 10: Quantile binning of unnormalized state visitation counts of DQN policy in the discrete Mountain Car task. Bins closer to the starting states are visited more often than those at the end of the trajectory. Blue dots represents the bins limits.

Since policy densities are expected to be more complex than the previous example, we analyze quantile binning in the discrete Mountain Car environment. To do so, we train a DQN policy until convergence (i.e. collected rewards at least ), and perform 500 rollouts using the greedy policy. Trajectories are used to construct a 2-dimensional state visitation histogram. At the same time, all visited states are given to the quantile binning algorithm, which is used to assign observations to bins.

Figure 10 presents the state visitation histogram with bins, where color represents the state visitation count. Bins are shown by dotted lines.

7.2 Proofs


(Lemma 4.1)

where the before last line is a direct application of Holder inequality.


(Theorem 4.2)

Following (Schulman et al., 2015), we take

Using inequality (45) from (Schulman et al., 2015), we have:

Expanding and using triangle inequality, we have:

Combining with inequality (45):


(Theorem 4.3) We first represent the approximate transition matrix induced by the policy as a perturbation of the true transition:


Then, the difference between stationary distributions and is equal to (Schweitzer, 1968; Cho & Meyer, 2001):


where is the fundamental matrix of the Markov chain induced by and is a vector of ones.

In particular, the above result holds for Schatten norms (Baumgartner, 2011):


So far, this result is known for irreducible, homogeneous Markov chains and has no decision component.
Consider the matrix , which is the difference between expected transition models of true and approximate policies. It can be expanded into products of matricized tensors:


The norm of can also be upper bounded as follows:


Combining this result from that of (Schweitzer, 1968) yields


7.3 Additional plots

7.3.1 Pendulum (Gaussian policy)

Figure 11: Plots of discretized and pruned Gaussian policy in polar coordinates for and thousands training steps of soft actor-critic. Rewards collected by the continuous ground truth at every timestep are indicated in parentheses. Each circle of radius corresponds to for a discrete . All densities are on the same scale.

7.3.2 Pendulum (Boltzmann-Q policy)

Figure 12: Plots of discretized and pruned Boltzmann policy in polar coordinates for and thousands training steps of soft actor-critic. Each circle of radius corresponds to for a discrete .
Figure 13: Absolute difference in returns collected by discretized and reconstructed Boltzmann-Q policies, averaged over 5 trials. Blue dots represent the number of parameters of the neural network policy. SVD, DFT and DB4 projections need an order of magnitude less in term of parameters to reconstruct the original policy.
Figure 14: distance between the true and reconstructed stationary Boltzmann-Q distributions, averaged over trials. SVD, DFT and DB4 methods show a fast convergence to the oracle’s stationary distribution using only of the neural network’s parameters.