1 Introduction
A fundamental quantity of interest in RL is the state–action value () function, which quantifies the expected return for taking action in state . Many RL algorithms, notably Qlearning (Watkins, 1989), learn an approximation of the function from environmental interactions. When using function approximation with Qlearning, the agent has a parameterized function class, and learning consists of finding a parameter setting for the approximate value function that accurately represents the true function. A core operation here is finding an optimal action with respect to the value function, , or finding the highest actionvalue . The need for performing these operations arises not just when computing a behavior policy for action selection, but also when learning itself using bootstrapping techniques (Sutton and Barto, 2018).
The optimization problem is generally challenging if is continuous, in contrast to the discrete case where the operation is trivial if the number of discrete actions is not enormous. The challenge stems from the observation that the surface of the function could have many local maxima and saddle points; therefore, naïve approaches such as finding the maximum through gradient ascent can lead into inaccurate answers (Ryu et al., 2019). In light of this technical challenge, recent work on solving continuous control problems has instead embraced policygradient algorithms, which typically compute , rather than solving , and follow the ascent direction to move an explicitly maintained policy towards actions with higher (Silver et al., 2014)
. However, policygradient algorithms have their own weaknesses, particularly in settings with sparse rewards where computing an accurate estimate of the gradient requires an unreasonable number of environmental interactions
(Kakade and Langford, 2002; Matheron et al., 2019). Rather than adopting a policygradient approach, we focus on tackling the problem of efficiently computing for valuefunctionbased RL.Previous work on valuefunctionbased algorithms for continuous control has shown the benefits of using function classes that are conducive to efficient action maximization. For example, Gu et al. (2016) explored function classes that can capture an arbitrary dependence on the state, but only a quadratic dependence on the action. Given a function class with a quadratic dependence on the action, Gu et al. (2016) showed how to compute quickly and in constant time. A more general idea is to use input–convex neural networks (Amos et al., 2017) that restrict to functions that are convex (or concave) with respect to , so that for any fixed state the optimization problem can be solved efficiently using convexoptimization techniques (Boyd and Vandenberghe, 2004). These solutions trade the expressiveness of the function class for easy action maximization.
While restricting the function class can enable easy maximization, it can be problematic if no member of the restricted class has low approximation error relative to the true function (Lim et al., 2018). More concretely, when the agent cannot possibly learn an accurate , the error could be significant even if the agent can solve exactly. In the case of input–convex neural networks, for example, high error can occur if is completely nonconvex. Thus, it is desirable to ensure that, for any true function, there exists a member of the function class that approximates up to any desired accuracy. Such a function class is said to be capable of universal function approximation (UFA) (Hornik et al., 1989; Benaim, 1994; Hammer and Gersmann, 2003). A function class that is both conducive to efficient action maximization and also capable of UFA would be ideal.
We introduce deep RBF value functions, which approximate by a standard deep neural network equipped with an RBF output layer. We show that deep RBF value functions have the two desired properties outlined above: First, using deep RBF value functions enable us to approximate the optimal action up to any desired accuracy. Second, deep RBF value functions support universal function approximation.
Prior work in RL used RBF networks for learning the statevalue function () in problems with discrete action spaces (see Section 9.5.5 of Sutton and Barto (2018) for a discussion). That said, to the best of our knowledge, our discovery of the actionmaximization property of RBF networks is novel, and there has been no application of deep RBF networks to continuous control. We combine deep RBF networks with DQN (Mnih et al., 2015), a standard deep RL algorithm originally proposed for discrete actions, to produce a new algorithm called RBF–DQN. We evaluate RBF–DQN on a large set of continuousaction RL problems, and demonstrate its superior performance relative to standard deepRL baselines.
2 Background
We study the interaction between an environment and an agent that seeks to maximize reward (Sutton and Barto, 2018)
, a problem typically formulated using Markov Decision Processes (MDPs)
(Puterman, 2014). An MDP is usually specified by a tuple: . In this work, and denote the continuous state space and the continuous action space of the MDP. The MDP model is comprised of two functions, namely the transition model , and the reward model . The discount factor, , determines the importance of immediate reward as opposed to rewards received in the future. The goal of an RL agent is to find a policy, that collects high sums of discounted rewards across timesteps.For a state , action , and a policy , we define the state–action value function:
where is called the return at timestep . The state–action value function of an optimal policy, denoted by , can be written recursively (Bellman, 1952):
(1) 
If the model of the MDP is available, standard dynamic programming approaches find by solving for the fixed point of (1), known as the Bellman equation.
In the absence of a model, a class of RL algorithms solve for the fixed point of the Bellman equation using environmental interactions and without learning a model. Qlearning (Watkins, 1989), a notable example of these socalled modelfree algorithms, learns an approximation of , denoted by and parameterized by . When combined with function approximation, Qlearning updates parameters as follows:
(2) 
using tuples of experience observed during environmental interactions. The quantity is often referred to as the temporal difference (TD) error (Sutton, 1988).
Note that Qlearning’s update rule (2) is agnostic to the choice of function class, and so in principle any differentiable and parameterized function class could be used in conjunction with the above update to learn parameters. For example, Sutton (1996) used linear function approximation, Konidaris et al. (2011) used Fourier basis functions, and Mnih et al. (2015)
chose the class of convolutional neural networks and showed remarkable results for learning to play Atari games.
3 Deep RBF Value Functions
Deep RBF value functions combine the practical advantages of deep networks (Goodfellow et al., 2016) with the theoretical advantages of radialbasis functions (RBFs) (Powell, 1987). A deep RBF network is comprised of a number of arbitrary hidden layers, followed by an RBF output layer, defined next. The RBF output layer, first introduced in a seminal paper by Broomhead and Lowe (1988), is sometimes used as a standalone singlelayer function approximator, referred to as a (shallow) RBF network. We use an RBF network as the final, or output, layer of a deep network.
For a given input , the RBF layer is defined as:
(3) 
where each represents a centroid location, is the value of the centroid , is the number of centroids, and is an RBF. A commonly used RBF is the negative exponential:
(4) 
equipped with a smoothing parameter . (See Karayiannis (1999) for a thorough treatment of other RBFs.) Formulation (3
) could be thought of as an interpolation based on the value and the weights of all centroids, where the weight of each centroid is determined by its proximity to the input. Proximity here is quantified by the RBF
, in this case the negative exponential (4).As will be clear momentarily, it is theoretically useful to normalize centroid weights to ensure that they sum to 1 so that implements a weighted average. This weighted average is sometimes referred to as a normalized Gaussian RBF layer (Moody and Darken, 1989; Bugmann, 1998):
(5) 
As the smoothing parameter the function implements a winnertakeall case where the value of the function at a given input is determined only by the value of the closest centroid location, nearestneighbor style. This limiting case is sometimes referred to as a Voronoi decomposition (Aurenhammer, 1991). Conversely, converges to the mean of centroid values regardless of the input as gets close to 0; that is,
. Since an RBF layer is differentiable, it could be used in conjunction with (stochastic) gradient descent and backprop to learn the centroid locations and their values by optimizing for a loss function. Note that formulation (
5) is different than the Boltzmann softmax operator (Asadi and Littman, 2017; Song et al., 2019), where the weights are determined, not by an RBF, but by the action values.Finally, to represent the function for RL, we use the following formulation:
(6) 
A deep RBF function (6) internally learns two mappings: statedependent set of centroid locations and statedependent centroid values . The role of the RBF output layer, then, is to use these learned mappings to form the output of the entire deep RBF function. We illustrate the architecture of a deep RBF function in Figure 1. In the experimental section, we demonstrate how to learn parameters .
We now show that deep RBF function have the first desired property for valuefunctionbased RL, namely that they enable easy action maximization.
In light of the RBF formulation, it is easy to find the value of the deep RBF function at each centroid location , that is, to compute . Note that in general for a finite , because the other centroids may have nonzero weights at . In other words, the actionvalue function at a centroid can in general differ from the centroid’s value .
Therefore, to compute , we access the centroid location using , then input to get . Once we have , we can trivially find the highestvalued centroid or its corresponding :
While in general there may be a gap between the global maximimum and its easytocompute approximation , the following theorem predicts that this gap is zero in onedimensional action spaces. More importantly, Theorem 1 guarantees that in action spaces with an arbitrary number of dimensions, the gap gets exponentially small with increasing the smoothing parameter , allowing us to reduce the gap very quickly and up to any desired accuracy by simply increasing the smoothing parameter .
Theorem 1.
Let be a member of the class of normalized Gaussian RBF value functions.

For a onedimensional action space :

For :
Proof.
See Appendix. ∎
Figure 2 shows an example of the output of an RBF function where there exists a gap between and for small values of . Note also that, consistent with Theorem 1, we can quickly decrease this gap by increasing the value of .
In light of the above theoretical result, to approximate we compute . If the goal is to ensure that the approximation is sufficiently accurate, one can always increase the smoothing parameter to quickly get the desired accuracy.
Notice that this result holds for normalized Gaussian RBF networks, but not necessarily for the unnormalized case or for other types of RBFs. We believe that this observation is an interesting result in and of itself, regardless of its connection to valuefunctionbased RL.
We finally note that, for the case where we are actually interested in an approximation for , we can get the following corollary akin to Theorem 1:
Corollary.
Let be a member of the class of normalized Gaussian RBF value functions as formulated in (6).

For :

For :
We now move to the second desired property of RBF networks, namely that these networks are in fact capable of universal function approximation (UFA).
Theorem 2.
Consider any state–action value function defined on a closed action space . Assume that is a continuous function. For a fixed state and for any , there exists a deep RBF value function and a setting of the smoothing parameter for which:
Proof.
See Appendix. ∎
Collectively, Theorems 1 and 2 guarantee that deep RBF functions preserve the desired UFA property while ensuring accurate and efficient action maximization. This combination of properties stands in contrast with prior work that used function classes that enable easy action maximization but lack the UFA property (Gu et al., 2016; Amos et al., 2017), as well as prior work that preserved the UFA property but did not guarantee arbitrarily low accuracy when performing the maximization step (Lim et al., 2018; Ryu et al., 2019). The only important assumption in Theorem 2 is that the true value function is continuous, which is a standard assumption in the UFA literature (Hornik et al., 1989) and in RL (Asadi et al., 2018).
We note that, while using a large value of makes it theoretically possible to approximate any function up to any desired accuracy, there is a downside to using large values. Specifically, very large values of result in extremely local
approximations, which ultimately increases sample complexity as experience is not generalized from centroid to centroid. The bias–variance tension between using large
values that allow for greater accuracy and using smaller values that reduce sample complexity make intermediate values of work best. This property could be examined formally through the lens of regularization (Bartlett and Mendelson, 2002).As for scalability to large action spaces, note that the RBF formulation scales naturally owing to its freedom to come up with centroids that best minimize the loss function. As a thought experiment, suppose that some region of the action space has a high value, so an agent with greedy action selection frequently chooses actions from that region. The deep RBF function would then move more centroids to the region, because the region heavily contributes to the loss function. It is unnecessary, then, to initialize centorid locations carefully, or to uniformly cover the action space a priori. In our RL experiments in Section 5, we achieved reasonable results with the number of centroids fixed across every problem, indicating that we need not rapidly increase the number of centroids as the action dimension increases.
4 Experiments: Continuous Optimization
To demonstrate the operation of an RBF network in the simplest and clearest setting, we start with a singleinput continuous optimization problem, where the agent lacks access to the true reward function but can sample input–output pairs . This setting is akin to the action maximization step in RL for a single state or, stated differently, a continuous bandit problem. We are interested in evaluating approaches that use tuples of experience to learn the surface of the reward function, and then optimize the learned function.
To this end, we chose the reward function:
(7) 
Figure 3 (left) shows the surface of this function. It is clearly nonconvex and includes several local maxima (and minima). We are interested in two cases, first the problem where the goal is to find , and the converse problem where we desire to find .
Exploration is challenging in this setting (Lattimore and Szepesvári, 2018). Here, our focus is not to find the most effective exploration policy, but to evaluate different approaches based on their effectiveness to represent and optimize a learned reward function . So, in the interest of fairness, we adopt the same random actionselection strategy for all approaches.
More concretely, we sampled 500 actions uniformly randomly from and provided the agent with the reward associated with the actions according to (7). We then used this dataset for training. When learning ended, we computed the action that maximized (or minimized) the learned . Details of the function classes used in each case, as well as how to perform and will now be presented below for each individual approach.
For our first baseline, we discretized each action dimension to 7 bins, resulting in 49 bins that uniformly covered the two dimensions of the input space. For each bin, we averaged the rewards over pairs for which the sampled action belonged to that bin. Once we had a learned , which in this case was just a table, we performed and by a simple table lookup. Discretization clearly fails to scale to problems with higher dimensionality, and we have included this baseline solely for completeness.
Our second baseline used the inputconvex neural network architecture (Amos et al., 2017), where the neural network is constrained so that the learned reward function
is convex. Learning was performed by RMSProp optimization
(Goodfellow et al., 2016) with meansquared loss. Once was learned, we used gradient ascent for finding the maximum, and gradient descent for finding the minimum. Note that this inputconvex approach subsumes the excluded quadratic case proposed by Gu et al. (2016), because quadratic functions are just a special case of convex functions, but the converse in not necessarily true (Boyd and Vandenberghe, 2004).Our next baseline was the wirefitting method proposed by Baird and Klopf (1993). This method is similar to RBF networks in that it also learns a set of centroids. Similar to the previous case, we used the RMSprop optimizer and meansquared loss, and finally returned the centroids with lowest (or highest) values according to the learned .
As the last baseline, we used a standard feedforward neural network architecture with two hidden layers to learn
. It is well–known that this function class is capable of UFA (Hornik et al., 1989) and so can accurately learn the reward function in principle. However, once learning ends, we face a non–convex optimization problem for action maximization (or minimization) . We simply initialized gradient descent (ascent) to a point chosen uniformly randomly, and followed the corresponding direction until convergence.To learn an RBF reward function, we used centroids and . We again used RMSprop and meansquared loss minimization. Recall that Theorem 1 showed that with an RBF network the following approximations are welljustified in theory: and . As such, when the learning of ends, we output the centroid values with highest and lowest reward.
For each individual case, we ran the corresponding experimental pipeline for 30 different random seeds. The solution found by each learner was fed to the true reward function (7) to get the true quality of the found solution. We report the average reward achieved by each function class in Figure 4. The RBF learner outperforms all baselines on both the maximization and the minimaztion problem. We further show the function learned by a sample run of RBF on the right side of Figure 3, which is an almost perfect approximation for the true reward function.
5 Experiments: Continuous Control
We now use deep functions for solving continuousaction RL problems. To this end, we learn a deep RBF function using a learning algorithm similar to that of DQN (Mnih et al., 2015), but extended to the continuousaction case. DQN uses the following loss function for learning a deep state–action value function:
DQN adds tuples of experience to a buffer, and later samples a minibatch of tuples to compute . DQN maintains a second network parameterized by weights . This second network, denoted and referred to as the target network, is periodically synchronized with the online network .
RBF–DQN uses the same loss function, but modifies the function class of DQN. Concretely, DQN learns a deep network that outputs a scalar actionvalue output per action, exploiting the discrete and finite nature of the action space. By contrast, RBF–DQN takes a state vector and an action vector as input, and outputs a single scalar using a deep RBF
function. Note that every operation in a deep RBF function is differentiable, so the gradient of the loss function with respect toparameters can be computed using standard deep learning libraries. Specifically, we used Python’s Tensorflow library
(Abadi et al., 2016)with Keras
(Chollet, 2015) as its interface.In terms of action selection, with probability
, DQN chooses a random action, and with probability it chooses an action with the highest value. The value of is annealed so that the agent becomes more greedy as learning proceeds. To define an analog of this so calledgreedy policy for RBF–DQN, we sample from a uniform distribution with probability
, and we take with probability . We annealed the parameter, similar to DQN.Additionally, we made a minor change to the original DQN algorithm in terms of updating , the weights of the target network. Concretely, we update using an exponential moving average of all the previous values, as suggested by Lillicrap et al. (2015): , which differs from the occasional periodic updates of the original DQN agent. We observed a significant performance increase with this simple modification.
For completeness, we provide pseudocode for RBF–DQN in Algorithm 1, include the code for the algorithm in the supplementary material, and will provide an open repository ^{1}^{1}1github.com/kavosh8/RBFDQN.
We compared RBF–DQN’s performance to other deepRL baselines on a large set of standard continuousaction RL domains from Gym (Brockman et al., 2016). These domains range from simple tasks such as Inverted Pendulum with a onedimensional action space, to more complicated domains such as Ant with a 7dimensional action space. We used the same number of centroids for learning the deep RBF function. We found the performance of RBF–DQN to be most sensitive to two hyperparameters, namely RMSProp’s learning rate and the RBF smoothing parameter . We tuned these two parameters via grid search (Goodfellow et al., 2016)
for each individual domain, while all other hyperparameters were fixed across domains. See the Appendix for a complete explanation of the process of hyperparameter tuning for RBF–DQN, as well as for the baselines.
For a meaningful comparison, we performed roughly similar numbers of gradientbased updates for RBF–DQN and the baselines. Specifically, in all domains, we performed updates per episode on RBF–DQN’s network parameters . We used the same number of updates per episode for other valuefunctionbased baselines, such as inputconvex neural networks (Amos et al., 2017). Moreover, in the case of policygradient baseline DDPG (Lillicrap et al., 2015), we performed 100 valuefunction updates and 100 policy updates per episode. This number of updates gave us reasonable results in terms of data efficiency, and it also helped us run all experiments on modern CPUs.
In choosing our baselines, our main goal was to compare RBF–DQN to other valuefunctionbased deepRL baselines that explicitly perform the maximization step. We did not perform comparisons with an exhaustive set of existing policy gradient methods from the literature, since they work fundamentally differently than RBF–DQN and circumvent the actionmaximization step. That said, deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) and its more advanced variant, TD3 (Fujimoto et al., 2018), are two very common baselines in continuous control, so we included them for completeness.
Moreover, in light of recent concerns with reproducibility in RL (Henderson et al., 2018), we ran each algorithm for 10 fixed random seeds and report average performance, we release our code, and we clearly explain our hyper–parameter tuning process in the Appendix. Other than the inputconvex neural network baseline (Amos et al., 2017), for which the authors released Tensorflow code, we chose to implement RBF–DQN and all other baselines ourselves and in Tensorflow. This choice reflected a concern that comparing results across different deeplearning libraries is extremely difficult.
It is clear from Figure 5 that RBF–DQN is competitive to all baselines both in terms of data efficiency and final performance. Moreover, we report final meanperformance with standard errors in the Appendix. RBF–DQN yields the highestperforming final policies in 8 out of the 9 domain.
6 Future Work
We envision several promising directions for future work. First, the RL literature has numerous examples of algorithmic ideas that help improve valuefunctionbased agents (see Hessel et al. (2018) for some examples). These ideas are usually proposed for domains with discrete actions, so extending them to continuousaction domains using RBF–DQN could be an exciting direction to pursue.
A big advantage of valuefunctionbased methods is the flexibility that they offer when tackling the exploration problem. Examples of exploration strategies for these methods include optimistic initialization (Sutton and Barto, 2018), softmax policies (Rummery and Niranjan, 1994; Sutton and Barto, 2018), uncertaintybased exploration (Osband et al., 2016), and PAC learning (Kakade, 2003; Strehl et al., 2006). We used a simple greedy policy, but a combination of the advanced exploration strategies with deep RBF functions could be more effective.
Moreover, we solely focused on deep RBF value functions with negative exponentials, but various RBFs exist in the literature (Karayiannis, 1999). Further research into other types of RBFs can shed light on their strengths and weaknesses in the context of continuousaction RL problems. Moreover, we noticed that tuning the smoothing parameter of negative exponentials can be challenging, so methods that automatically learn this parameter, such as meta gradient approaches (Xu et al., 2018), are a promising direction for future.
Finally, we look forward to applying deep RBF functions to key problems, such as robotics, realtime bidding, recommendation systems, and dialog systems.
7 Conclusion
We proposed, analyzed, and exhibited the strengths of deep RBF value functions in continuous control. These value functions facilitate easy action maximization, support universal function approximation, and scale to large continuous action spaces. Deep RBF value functions are thus an appealing choice for value function approximation in continuous control.
References
 Tensorflow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283. Cited by: §5.
 Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning, pp. 146–155. Cited by: §1, §3, §4, §5, §5.
 An alternative softmax operator for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 243–252. Cited by: §3.
 Lipschitz continuity in modelbased reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 264–273. Cited by: §3.
 Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Computing Surveys 23 (3), pp. 345–405. Cited by: §3.
 Reinforcement learning with highdimensional, continuous actions. Technical report . Cited by: §4.
 Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §3.
 On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America 38 (8), pp. 716. Cited by: §2.
 On functional approximation with normalized Gaussian units. Neural Computation 6 (2), pp. 319–333. Cited by: §1, §8.1.
 Convex optimization. Cambridge University Press. Cited by: §1, §4, §8.1.
 OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §5.
 Radial basis functions, multivariable functional interpolation and adaptive networks. Technical report Cited by: §3.

Normalized Gaussian radial basis function networks
. Neurocomputing 20 (13), pp. 97–110. Cited by: §3.  Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: §5.
 Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pp. 1587–1596. Cited by: §5, §8.2.7.
 Deep learning. MIT press. Cited by: §3, §4, §5, §8.2.2.
 Continuous deep Qlearning with modelbased acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §1, §3, §4.

A note on the universal approximation capability of support vector machines
. Neural Processing Letters 17 (1), pp. 43–53. Cited by: §1. 
Deep reinforcement learning that matters.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §5.  Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §6.
 Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §1, §3, §4.
 Approximately optimal approximate reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 267–274. Cited by: §1.
 On the sample complexity of reinforcement learning. Ph.D. Thesis, University of London London, England. Cited by: §6.
 Reformulated radial basis neural networks trained by gradient descent. IEEE Transactions on Neural Networks 10 (3), pp. 657–671. Cited by: §3, §6.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §8.2.6, §8.2.7.
 Value function approximation in reinforcement learning using the Fourier basis. In Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, Cited by: §2.
 Bandit algorithms. preprint. Cited by: §4.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §5, §5, §5.
 Actorexpert: a framework for using actionvalue methods in continuous action spaces. arXiv preprint arXiv:1810.09103. Cited by: §1, §3.
 The problem with DDPG: understanding failures in deterministic environments with sparse rewards. arXiv preprint arXiv:1911.11679. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2, §5.
 Fast learning in networks of locallytuned processing units. Neural computation 1 (2), pp. 281–294. Cited by: §3.
 Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §6.
 Radial basis functions for multivariable interpolation: a review. Algorithms for approximation. Cited by: §3.
 Markov decision processes.: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.
 Online qlearning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, England. Cited by: §6.
 Caql: continuous action qlearning. arXiv preprint arXiv:1909.12397. Cited by: §1, §3.
 Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395. Cited by: §1.
 Revisiting the softmax bellman operator: new benefits and new perspective. In International Conference on Machine Learning, pp. 5916–5925. Cited by: §3, §8.1.
 PAC modelfree reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888. Cited by: §6.
 Reinforcement learning: an introduction. MIT press. Cited by: §1, §1, §2, §6.
 Learning to predict by the methods of temporal differences. Machine Learning 3 (1), pp. 9–44. Cited by: §2.
 Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, pp. 1038–1044. Cited by: §2.
 Learning from delayed rewards. King’s College, Cambridge. Cited by: §1, §2.
 Metagradient reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2396–2407. Cited by: §6.
8 Appendix
8.1 Proofs
See 1
Proof.
We begin by proving the first result. For an arbitrary action , we can write:
where each weight is determined via softmax. Without loss of generality, we sort all anchor points so that . Take two neighboring centroids and and notice that:
In the above, we used the fact that all are to the left of and . Similarly, we can argue that . Intuitively, as long as the action is between and , the ratio of the weight of a centroid to the left of , over the weight of itself, remains constant and does not change with . The same holds for the centroids to the right of . In light of the above result, by renaming some variables we can now write:
Moreover, note that the weights need to sum up to 1:
and is at its peak when we choose and at its smallest value when we choose . A converse statement is true about . Moreover, the weights monotonically increase and decrease as we move the input . We call the endpoints of the range and . As such, the problem
could be written as this linear program:
A standard result in linear programming is that every linear program has an extreme point that is an optimal solution (Boyd and Vandenberghe, 2004). Therefore, at least one of the points or is an optimal solution. It is easy to see that there is a onetoone mapping between and in light of the monotonic property. As a result, the first point corresponds to the unique value of , and the second corresponds to unique value of . Since no point in between two centroids can be bigger than the surrounding centroids, at least one of the centroids is a globally optimal solution in the range , that is
To finish the proof, we can show that . The proof for follows similar steps. So,
which concludes the proof of the first part.
We now move to the more general case with :
WLOG, we assume the first centroid is the one with highest , that is , and conclude the proof. Note that a related result was shown recently (Song et al., 2019):
∎
See 2
Proof.
Since is continuous, we leverage the fact that it is Lipschitz with a Lipschitz constant :
As such, assuming that , we have that
(8) 
Consider a set of centroids , define the as:
and the radius as:
Assuming that is a closed set, there always exists a set of centroids for which . Now consider the following functional form:
Now suppose lies in a subset of cells, called the central cells :
We define a second neighboring set of cells:
and a third set of far cells:
We now have:
We now bound each of the three sums above. Starting from the first sum, it is easy to see that , simply because . As for the second sum, since is the centroid of a neighboring cell, using a central cell , we can write:
and so in this case . In the third case with the set of far cells , observe that for a far cell and a central cell we have:
For some . In the above, we used the fact that is always true.
Putting it all together, we have:
In order to have , it suffices to have