1 Introduction
In the reinforcement learning (RL) framework, the goal of a rational agent consists in maximizing the expected rewards in a dynamical system by finding a suitable conditional distribution known as policy. This conditional distribution can be found using policy iteration and any suitable function approximator, ranging from linear models to neural networks. Due to their high representation power and efficient scalability, neural networks are often used for this matter (Lillicrap et al., 2015; Schulman et al., 2015). They have shown great successes in simulated environments such as games (e.g. Atari) or robotic control (e.g. MuJoCo).
However, in situations with limited memory capacity, e.g. deployment on mobile devices (Liu et al., 2018) or field drones, finding a lowerdimensional embedding of a policy network with bounds on its performance and error rate is challenging. A typical policy network can contain redundant parameters that must be pruned to save storage, a process for which knowledge distillation (Rusu et al., 2015) is not wellsuited due to lack of guarantees. Other parameters reduction attempts include Mazoure et al. (2019); Doan et al. (2019), but authors do not provide theoretical guarantees or interpretations for this improvement.
It is in fact crucial to have theoretical guarantees when conducting policy reconstruction, especially for real world applications. Unlike simulated environments, where the agent can explore freely and interact without incurring unexpected consequences, such a setup is not applicable to real life scenarios. In multiple risksensitive domains, failures can lead to drastic damages either in term of human lives (robotic surgery (Ubee et al., 2019), self driving car (ShalevShwartz et al., 2017), air traffic control (Kim, 2015; Moon et al., 2011)) or financial consequences (warehouse robots (Bogue, 2017), advertisement campaign). Before deploying the compressed models, one might request performance guarantees on the reconstructed policy’s performance (Berkenkamp et al., 2017). In this case, blackbox methods such as neural networks are disadvantaged due to the lack of theoretical guarantees, therefore are unlikely to be applied to realworld scenarios (Ghavamzadeh et al., 2016).
In this paper, we address two drawbacks of policies represented by neural networks: lack of (1) performance guarantees and (2) control over which redundant parameters are pruned. We treat the policy embedding problem as finding a lowdimensional representation of a given density function over state action space, and propose a general framework for representing any given policy with provable guarantees. We show that the performance of the reconstructed policy depends on the projection algorithm, as well as on the policy’s longterm behaviour. To the best of our knowledge, this is the first work proposing a general framework to learn a compressed policy with performance and longterm behaviour guarantees.
The experiments section explores various policy representation methods based on our framework, which are compared on average returns as well as state coverage. More precisely, we perform experiments with a wide range of different types of basis, including orthogonal basis such as singular value decomposition (SVD,
Halko et al. 2011) and nonorthogonal basis such as Gaussian mixtures (GMM, Rasmussen 2000) which can be directly learned from data. In addition, we evaluate the performance of predefined orthogonal basis such as Fourier transformations (FFT,
Stein & Shakarchi 2003) and Daubechies wavelet (DB4, Daubechies 1988). Our empirical results suggest that the framework we propose is general enough to handle various classes of RL problems, whether the basis needs to be learned or arises naturally.2 Preliminary
In this section, we provide a brief introduction to reinforcement learning, density approximation, as well as singular value decomposition.
2.1 Reinforcement learning
We consider a problem modeled by a discounted Markov Decision Process (MDP)
, where is the state space; is the action space; given states , , action ,is the transition probability of transferring
to under the action and is the reward collected at state after executing the action ; is the set of initial states and is the discount factor.The agent in RL environment often execute actions according to some policies. We define a stochastic policy such that is the conditional probability that the agent takes action after observing the state . The objective is to find a policy which maximizes the expected discounted reward:
(1) 
We define the state value function and the stateaction value function as:
The gap between and is known as the advantage function:
where and . The coverage of policy is defined as the stationary state visitation probability under .
2.2 Function approximation in inner product spaces
The theory of inner product spaces has been widely used to characterize approximate learning of Markov decision processes (Puterman, 2014; Tsitsiklis & Van Roy, 1999). In this subsection, we provide a short overview of inner product spaces of squareintegrable functions on a closed interval denoted , and mention drawbacks of working purely in the continuous setting.
The space is known to be separable for , i.e., it admits a countable orthonormal basis of functions , such that for all and being Kronecker’s delta. This property makes possible computations with respect to the inner product
(2) 
and corresponding norm , for .
That is, for any with basis , there exist scalars such that
(3) 
where the coefficients are given by
(4) 
If , then adding to
can be thought of as adding ”nonredundant” information to the previous estimate.
Eq. (3) suggests a simple compression rule, where one would project a density function onto a fixed basis and store only the first
coefficients. Which components to pick will depend on the nature of the basis: harmonic and wavelet bases are ranked according to amplitude, while singular vectors are sorted by corresponding singular values.
The choice of the basis plays a crucial role in the quality of approximation and should be taken while keeping in mind properties of the compressed function such as smoothness and periodicity. Below, we provide some examples of wellknown orthonormal bases of :

Fourier: ;

Haar: ;
In order to allow our learning framework to have strong convergence results in the inner product sense as well as a countable set of basis functions, we restrict ourselves to separable Hilbert spaces.
While convergence guarantees are mostly known for the closed interval , previous works have studied properties of the above functions on the whole real line and in multiple dimensions (Egozcue et al., 2006). Weaker convergence results in Hilbert spaces can be stated with respect to the space’s inner product. If, for every ,
(5) 
then the sequence is said to converge weakly to . Stronger theorems are known for specific bases but require stronger assumptions on . Some are presented in Section 2.3.
2.3 Universal density approximators
Universal density approximators are a class of models which, given enough parameters, can approximate any smooth density up to an arbitrary error level. Over the separable Hilbert space of squareintegrable real functions, such approximations typically rely on a countable set of orthonormal elementary functions and include but are not limited to Fourier, Hermite, Haar and Daubechies bases (Stein & Shakarchi, 2003; Haar, 1909; Daubechies, 1988). Moreover, it is known that a mixture model with countable number of components is a universal density approximator (McLachlan & Peel, 2004); in such case, the basis consists of Gaussian density functions and is not necessarily orthonormal.
Our approach for policy compression can be summarized in the following steps: (1) pick a set of known basis over the space, (2) project the policy onto the first basis functions and (3) store the vector of projection weights and optionally the corresponding basis functions.
Below, we mention existing results for the Fourier partial sum, as well as finite component smooth mixture model.
Rate of convergence of Fourier approximation
Let denote the density approximated by the first Fourier partial sums (Stein & Shakarchi, 2003), then a result from (Jackson, 1930) shows that
(6) 
More recent results (Giardina & Chirlian, 1972) provide even tighter bounds for continuous and periodic functions with continuous derivatives and bounded derivative. In such case,
(7) 
Rate of convergence of mixture approximation
Let denote a (finite) component mixture model. A result from (Rakhlin et al., 2005) shows the following result for in the class of mixtures with marginallyindependent scaled density functions and
in the class of lowerbounded probability density functions:
(8) 
where is the number of i.i.d. samples used in the learning of .
The constants hidden inside the bigO notation depend on the nature of the function and can become quite large, which can explain differences in empirical evaluation.
2.4 Lowrank matrix factorization
Matrix decomposition algorithms such as the truncated singular value decomposition (SVD) are popular compression methods due to their simplicity and robustness. Although SVD is commonly used in neural network compression schemes (Lu et al., 2017; Goetschalckx et al., 2018)
, most works apply it on the weights rather than the outputs of the function. The latter makes the process easy to implement, but error guarantees are harder to derive due to the nonlinear activation functions of neural networks. Similarly to the density approximators above, a key appeal of SVD is that the error rate of truncation can be controlled efficiently through the magnitude of the truncated singular values
^{*}^{*}*see EckartYoungMirsky theorem.2.5 Time complexity comparison
The time complexity for a truncated SVD decomposition of a matrix is as discussed in Du et al. (2017). Fast Fourier transform can be done in for the 2dimensional case (Lohne, 2017). Hence, SVD is expected to be faster whenever . The discrete wavelet transform’s (DWT) time complexity depends on the choice of mother wavelets. When using Mallat’s algorithm, the 2dimensional DWT is known to have complexities ranging from to as low as , for a square matrix of size and depending on the choice of mother wavelet (Resnikoff et al., 2012). Finally, we expect FFT and Daubechies wavelets to have less parameters than SVD, since the former use predefined orthonormal bases, while the later must store the left and right singular vectors.
3 Reconstruction algorithm
In this section, we illustrate our method in a single state case setting (i.e. bandit), before generalizing to multiple states.
3.1 Bandit example
Consider the special case of a multiarmed bandit over with reward function and some given policy . The average rewards collected by are a simplification of Eq. (1):
(9) 
If is finite, is a vector which can be approximated with
(10) 
The compression algorithm should therefore pick the coefficients which minimize the projection error onto .
In an empirical setting, one can monitor the difference between average actions with respect to ground truth and approximate policies, which is upperbounded by 1Wasserstein distance (Dudley, 2018):
(11) 
where
is the quantile function and
is the cumulative distribution function of the action random variable
. Therefore, it is sufficient to keep track of , which can be efficiently computed for discrete distributions.An extension of the multiarmed bandit problem to continuous action spaces implies that the probability of pulling each arm is given by the function . The extension of discrete quantities to their continuous counterparts is discussed in the next section.
3.2 Discretization of continuous policies
Consider the case when is a continuous, positive and decreasing function with a discrete sequence . For example, the reconstruction error as well as difference in returns fall under this family. Then, implies that . Under these assumptions, convergence guarantees (e.g. on monotonically decreasing reconstruction error) in continuous space imply convergence in the discrete (empirical) setting. Hence, we operate on discrete spaces rather than continuous ones.
We now discuss the construction of , the discretized policy. The core idea consists in grouping together similar stateaction pairs (in term of visitation frequency). To this end, we suggest using the quantile state visitation function and the state visitation distribution function , as well as their empirical counterparts, and . Quantile and distribution functions for the action space are defined analogously.
Algorithm 1 outlines the proposed discretization process allowing one to approximate any continuous policy (e.g., computed by a neural network) by a 2dimensional table indexed by discrete states and actions. Binning is done via quantile discretization in order to have maximal resolution in frequently visited areas. It also allows for faster sampling during rollouts, since the probability of falling in each bin is uniform. The number of state and action bins per dimension is denoted and , respectively.
All further computations of distance between two discretized policies will have to be computed over the corresponding discrete grid . Similarly to Eq. 11, one can look at the average action taken by two MDP policies at state :
(12) 
This relationship is one of the motivations behind using to assess goodnessoffit in the experiments section. Moreover, since the support is finite and discrete, computation can be done efficiently (Rubner et al., 1998).
3.3 Pruning unvisited states
As often some states might be never visited and one might wonder how much loss in performance is incurred by pruning those states. Let denote the number of times visits state and the number of times action is taken from over trajectories. It can happen that and is never visited. Hence, some rarely visited stateaction pairs are not required and can be pruned in order to reduce the number of parameters in the discretized policy. If, at inference time, one of these pruned states is encountered, we can simply act uniformly at random, knowing that the performance will lightly be affected.
This leads us to define a pruned policy as one which acts randomly under some state visitation threshold. Precisely
(13) 
Pruning rarely visited states can drastically reduce the number of parameters, while maintaining high probability performance guarantees for .
Theorem 3.1.
(Policy pruning) Let be the number of rollout trajectories of the policy , be the largest reward. With high probability , we have a guarantee on the performance of the reconstructed policy :
(14) 
The proof is shown in Theorem 2 of (Simão et al., 2019). The theorem allows us to discretize only visited states and still ensuring a strong performance guarantee with high probability.
3.4 Acting with tabular policies
In order to execute the learned policy, one has to be able to (1) generate samples conditioned on a given state and (2) evaluate the logprobability of stateaction pairs. Acting with a tabular density can be done via various sampling techniques. For example, importance sampling of the atoms corresponding to each bin is possible when
is small, while the inverse c.d.f. method is expected to be faster in larger dimensions. The process can be further optimized on a casepercase basis for each method. For example, sampling directly from the real part of a characteristic function can be done with the algorithm defined in
(Devroye, 1986)Furthermore, it is possible to use any of the above algorithms to jointly sample pairs under the assumption that states were discretized by quantiles. First, uniformly sample a state bin and then use any of the conditional sampling algorithms to sample action . Optionally, one can add uniform noise (clipped to the range of the bin) to the sampled action. This naive trick would transform discrete actions into continuous approximations of the policy network.
3.5 Method
We propose the following algorithm which projects the function described by a given neural network onto some inner product space:
Different projection algorithms are expected to perform differently depending on the curvature of the policy. For example, if
is a Gaussian distribution for each state
in a state space of dimension , then the table will have exactly modes. Moreover, methods decomposing signals in probability space such as FFT and SVD should be able to find structural patterns in the probability domain.As a rule of thumb, if contains few distinct modes, stateaction space methods (e.g. mixture models) are expected to perform better than probability space methods.
4 Theoretical guarantees
In this section, we provide bounds controlling the impact of the proposed approximation method on collected rewards and policy coverage.
4.1 Performance bound (single state case)
Storing an approximation of the ground truth policy implies a tradeoff between performance and model size. Our framework allows to control the difference in collected reward as a function of reconstruction error in probability space.
Lemma 4.1.
Let be policies in , be the bandit’s reward function. Let be such that and s.t. . Then
Lemma 4.1 (whose proof can be found in appendix) guarantees that rewards collected by will be no further than away from rewards of .
4.2 Performance bounds (multistate case)
Similar guarantees can be obtained when the state space is not reduced to a single state.
Theorem 4.2.
(Multistate policy reconstruction) Let be squaresummable policies. The following holds
(15) 
where , is the advantage function, and .
The proof can be found in the Appendix and relies on techniques found in (Schulman et al., 2015). A similar relation is presented in the next section for stationary distributions of .
A key technical difference in the multistate setting is that the discretization of the state space and pruning of the policy must be taken into account. Consequently, the performance error is now decomposed by triangle inequality into the discretization error (Eq. (12)), the pruning error (Thm 3.1) and the projection error which is controlled by specific function approximation theorems:
(16)  
Loose bounds on the reconstruction error are given in Eq. (6) and Eq. (8), tighter versions can be found in the literature (Giardina & Chirlian, 1972; Rakhlin et al., 2005).
To estimate the total error, one can separately estimate each of the three terms of Eq. (16). Empirical results of theses three error terms are shown in the next section.
4.3 Connection between reconstruction error and policy coverage
A useful metric to assess the longterm behaviour of a given policy in an MDP is the stationary distribution
of the Markov chain induced by
. For a given policy, measures the state coverage of the model averaged with respect to that policy.If one wishes to switch between the MDP and Markov chain frameworks for a fixed policy, it is possible to define the expected transition model
. In tensor form, it can be represented as
, where is the policy matrix, is the KhatriRao product and is the second matricization of the transition tensor .If the Markov chain defined above is irreducible and homogeneous, then its stationary distribution corresponds to the state occupancy probability^{†}^{†}†The existence and uniqueness of follow from the PerronFrobenius theorem.
given by the principal left eigenvector of
.The following theorem bridges the error made on the reconstruction of longrun distribution as a function of the system’s transition dynamics and policy reconstruction error.
Theorem 4.3 (Approximate policy coverage).
Let be the Schatten norm and be the vector norm. If (, ) is an irreducible, homogeneous Markov chain, then the following holds:
(17) 
where and is a vector of all ones.
A detailed proof can be found in the Appendix.
Note that the upper bound depends on both the environment’s structure as well as on the policy reconstruction quality. It is thus expected that, for MDPs with particularly large singular values of , the bound converges less quickly than for those with smaller singular values. A visualization of this bound is provided in Figure 1.
5 Experimental results
In this section, we visualize the rate of convergence of a reconstructed stationary distribution via a toy example to illustrate Eq. (17). We then assess the validity of our method on several environments and compare the quality of the reconstructed policy for different projection methods, in terms of performance as well as computation time.
5.1 Rate of convergence of stationary distribution under random policy
We validate the rate of convergence of Thm. 4.3 in the toy experiment described below. Consider a deterministic chain MDP with states. A fixed policy in state transitions to with probability and to with probability . If in states or , the agent remains there with probability and , respectively. We consider the expected transition model obtained using the policy reconstructed through discrete Fourier transform. A visualization of Theorem 4.3 is shown in Fig. 1.
5.2 Control domains
We consider a range of reconstruction tasks from three environments: a toy bandit problem inspired by Fellows et al. (2018), as well as the classical control problems of ContinuousMountainCarv0 and Pendulumv0 from OpenAI Gym (Brockman et al., 2016).
In all environments but the first, we omit GMM and fixed basis GMM methods, since the expectationmaximization algorithm runs into stability problems when dealing with extremely high number of components. Instead, we use SVD for lowrank matrix factorization and the fourth order Daubechies wavelet
(Daubechies, 1988) (DB4) as an orthonormal basis alternative to Fourier (DFT). We use and as initial estimates for the discretization for all experiments except for Mountain Car ().5.2.1 Bandit turntable
We consider a bandit problem in which rewards are spread in a circle as shown in Figure 2 a). Actions consist of angles in the interval . Given the corresponding Boltzmann policy, we compare the reconstruction quality of the discrete Fourier transform and GMM. In addition to Fourier and GMM, we also compare with fixed basis GMM, where the means are sampled uniformly over
and variance is drawn uniformly from
; only mixing coefficients are learned. This provides insight into whether the reconstruction algorithms are learning the policy, or if is so simple that a random set of Gaussian can already approximate it well.Results are reported in Figure 2 b), where the metric was obtained by first computing the rowwise (i.e. actionspecific) scores for every state, and subsequently averaging over all possible states. As
increases, EM has more trouble converging, as can be seen in the green curve’s trend and confidence intervals from
to 100. Fixed basis GMM struggles to accurately approximate the policy. On the other hand, DFT shows a stable performance after a threshold of components.5.2.2 Mountain Car
In this environment, the agent (a car) must reach the flag located at the top of a hill. It needs to go back and forth in order to generate sufficient momentum to reach the top. The agent is allowed to apply a speed motion in the interval . We use Soft ActorCritic (SAC, Haarnoja et al. 2018) to train the agent until convergence ( steps).
We compare all reconstruction methods with a maximum likelihood estimate (MLE) of , which consists of approximating the discrete policy with samples of that discrete policy. This method is referred as ”oracle” in the plots, since no reconstruction is performed. As shown in Figure 3, DFT, SVD and DB4 reach same reconstruction error as the oracle method. All methods achieve good performance with one order of magnitude less than the neural network in term of parameters. Note that DB4 shows slightly better performance than DFT.
5.2.3 Pendulum
This classical mechanic task consists of a pendulum which needs to swing up in order to stay upright. The actions represent the joint effort (torque) between and units. We train SAC until convergence and save snapshots of the actor and critic after and steps. The reconstruction task is to recover both the Gaussian policy (actor) and the BoltzmannQ policy (critic, temperature set to 1). While the Gaussian policy is unimodal, the BoltzmannQ policy can be multimodal and is a good challenge to demonstrates properties of each method. As shown in Figure 5, when policy has converged to a spiky shape ( steps), all methods show comparable performance (in terms of convergence). DFT shows better convergence at early stages of training ( steps), that is when the ground truth policy has a large variance. Note that all methods use less parameters than the original neural network.
6 Conclusion
In this work, we introduced a general framework for provably efficient policy reconstruction. It allowed to drastically compress policy densities in our experiments by a projection onto basis functions. Moreover, the reconstruction error has been shown to depend on the discretization, pruning and projection algorithms. We conducted experiments to demonstrate the behaviour of four sets of basis functions on a set of continuous control tasks, which exhibit desirable performance guarantees. A potential extension to our framework would be to directly operate in the continuous space, avoiding the discretization step procedure.
References
 Baumgartner (2011) Baumgartner, B. An inequality for the trace of matrix products, using absolute values. arXiv preprint arXiv:1106.6189, 2011.
 Berkenkamp et al. (2017) Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe modelbased reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918, 2017.
 Bogue (2017) Bogue, R. Robots that interact with humans: a review of safety technologies and standards. Industrial Robot, 44:395–400, 2017.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540.
 Cho & Meyer (2001) Cho, G. E. and Meyer, C. D. Comparison of perturbation bounds for the stationary distribution of a markov chain. Linear Algebra and its Applications, 335(13):137–150, 2001.
 Daubechies (1988) Daubechies, I. Orthonormal bases of compactly supported wavelets. Communications on pure and applied mathematics, 41(7):909–996, 1988.
 Devroye (1986) Devroye, L. An automatic method for generating random variates with a given characteristic function. In SIAM journal on applied mathematics, 1986.
 Doan et al. (2019) Doan, T., Mazoure, B., Durand, A., Pineau, J., and Hjelm, R. D. Attractionrepulsion actorcritic for continuous control reinforcement learning. arXiv preprint arXiv:1909.07543, 2019.
 Du et al. (2017) Du, S. S., Wang, Y., and Singh, A. On the power of truncated svd for general highrank matrix estimation problems. In Advances in neural information processing systems, pp. 445–455, 2017.
 Dudley (2018) Dudley, R. M. Real analysis and probability. Chapman and Hall/CRC, 2018.
 Egozcue et al. (2006) Egozcue, J. J., DíazBarrero, J. L., and PawlowskyGlahn, V. Hilbert space of probability density functions based on aitchison geometry. Acta Mathematica Sinica, 22(4):1175–1182, 2006.
 Fellows et al. (2018) Fellows, M., Ciosek, K., and Whiteson, S. Fourier Policy Gradients. In ICML, 2018.
 Ghavamzadeh et al. (2016) Ghavamzadeh, M., Petrik, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298–2306, 2016.
 Giardina & Chirlian (1972) Giardina, C. and Chirlian, P. Bounds on the truncation error of periodic signals. IEEE Transactions on Circuit Theory, 19(2):206–207, 1972.

Goetschalckx et al. (2018)
Goetschalckx, K., Moons, B., Wambacq, P., and Verhelst, M.
Efficiently combining svd, pruning, clustering and retraining for
enhanced neural network compression.
In
Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning
, pp. 1–6, 2018.  Haar (1909) Haar, A. Zur theorie der orthogonalen funktionensysteme. GeorgAugustUniversitat, Gottingen., 1909.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018. URL http://arxiv.org/abs/1801.01290.
 Halko et al. (2011) Halko, N., Martinsson, P.G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
 Jackson (1930) Jackson, D. The theory of approximation. The American mathematical society, 1930.

Kim (2015)
Kim, H.Y.
Statistical notes for clinical researchers: Type i and type ii errors in statistical decision.
Restorative dentistry & endodontics, 40(3):249–252, 2015.  Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liu et al. (2018) Liu, S., Lin, Y., Zhou, Z., Nan, K., Liu, H., and Du, J. Ondemand deep model compression for mobile devices: A usagedriven model selection framework. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pp. 389–400. ACM, 2018.
 Lohne (2017) Lohne, M. The computational complexity of the fast fourier transform. Technical report, 2017.

Lu et al. (2017)
Lu, Y., Kumar, A., Zhai, S., Cheng, Y., Javidi, T., and Feris, R.
Fullyadaptive feature sharing in multitask networks with
applications in person attribute classification.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5334–5343, 2017.  Mazoure et al. (2019) Mazoure, B., Doan, T., Durand, A., Hjelm, R. D., and Pineau, J. Leveraging exploration in offpolicy algorithms via normalizing flows. CoRR, abs/1905.06893, 2019. URL http://arxiv.org/abs/1905.06893.
 McLachlan & Peel (2004) McLachlan, G. and Peel, D. Finite mixture models. John Wiley & Sons, 2004.
 Moon et al. (2011) Moon, W.C., Yoo, K.E., Choi, Y.C., et al. Air traffic volume and air traffic control human errors. Journal of Transportation Technologies, 1(03):47, 2011.
 Puterman (2014) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 Rakhlin et al. (2005) Rakhlin, A., Panchenko, D., and Mukherjee, S. Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9:220–229, 2005.

Rasmussen (2000)
Rasmussen, C. E.
The infinite gaussian mixture model.
In Advances in neural information processing systems, pp. 554–560, 2000.  Resnikoff et al. (2012) Resnikoff, H. L., Raymond Jr, O., et al. Wavelet analysis: the scalable structure of information. Springer Science & Business Media, 2012.
 Rubner et al. (1998) Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 59–66. IEEE, 1998.
 Rusu et al. (2015) Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/schulman15.html.
 Schweitzer (1968) Schweitzer, P. J. Perturbation theory and finite markov chains. Journal of Applied Probability, 5(2):401–413, 1968.
 ShalevShwartz et al. (2017) ShalevShwartz, S., Shammah, S., and Shashua, A. On a formal model of safe and scalable selfdriving cars. arXiv preprint arXiv:1708.06374, 2017.
 Simão et al. (2019) Simão, T. D., Laroche, R., and des Combes, R. T. Safe policy improvement with an estimated baseline policy, 2019.
 Stein & Shakarchi (2003) Stein, E. and Shakarchi, R. Fourier Analysis: An Introduction. Princeton University Press, 2003. ISBN 9780691113845. URL https://books.google.ca/books?id=I6CJngEACAAJ.
 Tsitsiklis & Van Roy (1999) Tsitsiklis, J. N. and Van Roy, B. Optimal stopping of markov processes: Hilbert space theory, approximation algorithms, and an application to pricing highdimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10):1840–1851, 1999.
 Ubee et al. (2019) Ubee, S. S., Selvan, M., Chandrashekar, R., and Cooke, P. Safety considerations for performing robotic surgery in the presence of a permanent pacemaker. Journal of Perioperative Practice, 29(78):242–246, 2019.
7 Appendix
Reproducibility Checklist
We follow the reproducibility checklist ”The machine learning reproducibility checklist” and point to relevant sections explaining them here.
For all algorithms presented, check if you include:

A clear description of the algorithm, see main paper and included codebase. The proposed approach is completely described by Alg. 2.

An analysis of the complexity (time, space, sample size) of the algorithm. The space complexity of our algorithm depends on the number of desired basis. It is for a 1dimension component GMM with diagonal covariance, for a truncated SVD of an matrix, and only for the wavelet and Fourier bases. Note that a simple implementation of SVD requires to find the three matrices before truncation.
The time complexity depends on the reconstruction, pruning and discretization algorithms, which involves conducting rollouts w.r.t. policy , training a quantile encoder on the state space trajectories, and finally embedding the the pruned matrix using the desired algorithm. We found that the Gaussian mixture is the slowest to train, due to shortcomings of the expectationmaximization algorithm.

A link to a downloadable source code, including all dependencies. The code is included with Supplemental Files as a zip file; all dependencies can be installed using Python’s package manager. Upon publication, the code would be available on Github. Additionally, we include the model’s weights as well as the discretized policy for Pendulumv0 environment.
For all figures and tables that present empirical results, check if you include:

A complete description of the data collection process, including sample size. We use standard benchmarks provided in OpenAI Gym (Brockman et al., 2016).

A link to downloadable version of the dataset or simulation environment. Not applicable.

An explanation of how samples were allocated for training / validation / testing.
We do not use a trainingvalidationtest split, but instead report the mean performance (and one standard deviation) of the policy at evaluation time across 5 trials.

An explanation of any data that were excluded. The only data exclusion was done during policy pruning, as outlined in the main paper.

The exact number of evaluation runs. 5 trials to obtain all figures, and 200 rollouts to determine .

A description of how experiments were run. See Section Experimental Results in the main paper and didactic example details in Appendix.

A clear definition of the specific measure or statistics used to report results. Undiscounted returns across the whole episode are reported, and in turn averaged across 5 seeds. Confidence intervals shown in Fig. 8 were obtained using the pooled variance formula from a difference of means test.

Clearly defined error bars. Confidence intervals are always mean standard deviation over 5 trials.

A description of results with central tendency (e.g. mean) and variation (e.g. stddev). All results use the mean and standard deviation.

A description of the computing infrastructure used. All runs used 1 CPU for all experiments with Gb of memory.
7.1 Discretization techniques
Naively discretizing some function over a fixed interval can be done by equally spaced bins. To pass from a continuous value to discrete, one would find such that . However, using a uniform grid for a nonuniform wastes representation power. To reallocate bins in low density areas, we use the quantile binning method, which first computes the cumulative distribution function of , called . Then, it finds points such that the probability of falling in each bin is equal. Quantile binning can be see as uniform binning in probability space, and exactly corresponds to uniform discretization if
is constant (uniform distribution).
Below, we present an example of discretizing 4,000 sample points taken from four distributions, using the quantile binning.
In Figure 9, we are only allowed to allocate 10 bins per dimension. Note how the grid is denser around highprobability regions, while the furthermost bins are the widest. This uneven discretization, if applied to the policy density function, allows the agent to have higher detail around highprobability regions.
In Python, the function sklearn.preprocessing.KBinsDiscretizer can be used to easily discretize samples from the target density.
Since policy densities are expected to be more complex than the previous example, we analyze quantile binning in the discrete Mountain Car environment. To do so, we train a DQN policy until convergence (i.e. collected rewards at least ), and perform 500 rollouts using the greedy policy. Trajectories are used to construct a 2dimensional state visitation histogram. At the same time, all visited states are given to the quantile binning algorithm, which is used to assign observations to bins.
Figure 10 presents the state visitation histogram with bins, where color represents the state visitation count. Bins are shown by dotted lines.
7.2 Proofs
Proof.
(Theorem 4.2)
Following (Schulman et al., 2015), we take
Using inequality (45) from (Schulman et al., 2015), we have:
Expanding and using triangle inequality, we have:
Combining with inequality (45):
∎
Proof.
(Theorem 4.3) We first represent the approximate transition matrix induced by the policy as a perturbation of the true transition:
(18) 
Then, the difference between stationary distributions and is equal to (Schweitzer, 1968; Cho & Meyer, 2001):
(19) 
where is the fundamental matrix of the Markov chain induced by and is a vector of ones.
In particular, the above result holds for Schatten norms (Baumgartner, 2011):
(20) 
So far, this result is known for irreducible, homogeneous Markov chains and has no decision component.
Consider the matrix , which is the difference between expected transition models of true and approximate policies. It can be expanded into products of matricized tensors:
(21) 
The norm of can also be upper bounded as follows:
(22) 
Combining this result from that of (Schweitzer, 1968) yields
(23) 
∎