1 Introduction
A large body of reinforcement learning (RL) algorithms, based on approximate dynamic programming (ADP) (bertsekas1996neuro; scherrer2015approximate), operate in two steps: A greedy step, where the algorithm learns a policy that maximizes a function, and an evaluation step, that (partially) updates the values towards the values of the policy. A common improvement to these techniques is to use regularization, that prevents the new updated policy from being too different from the previous one, or from a fixed “prior” policy. For example, KullbackLeibler (KL) regularization keeps the policy close to the previous iterate (vieillard2020leverage), while entropy regularization keeps the policy close to the uniform one (haarnoja2018soft). Entropy regularization, often used in this context (ziebart2010modeling), modifies both the greedy step and the evaluation step so that the policy jointly maximizes its expected return and its entropy. In this framework, the solution to the policy optimization step is simply a softmax of the values over the actions. When dealing with small discrete action spaces, the softmax can be computed exactly: one only needs to define a critic algorithm, with a single loss that optimizes a
function. However, in large multidimensional – or even continuous – action spaces, one needs to estimate it. This estimation is usually done by adding an
actor loss, that optimizes a policy to fit this softmax. It results in an actorcritic algorithm, with two losses that are optimized simultaneously (haarnoja2018soft). This additional optimization step introduces supplementary errors to the ones already created by the approximation in the evaluation step.To remove these extraneous approximations, we introduce the Implicit Functions (IQ) algorithm, that deviates from classic actorcritics, as it optimizes a policy and a value in a single loss. The core idea is to implicitly represent the function as the sum of a value function and a logpolicy. This representation ensures that the policy is an exact softmax of the value, despite the use of any approximation scheme. We use this to design a practical modelfree deep RL algorithm that optimizes with a single loss a policy network and a value network, built on this implicit representation of a value. To better understand it, we abstract this algorithm to an ADP scheme, IQDP, and use this point of view to provide a detailed theoretical analysis. It relies on a key observation, that shows an equivalence between IQDP and a specific form of regularized Value Iteration (VI). This equivalence explains the role of the components of IQ: namely, IQ performs entropy and KL regularization. It also allows us to derive strong performance bounds for IQDP. In particular, we show that the errors made when following IQDP are compensated along iterations.
Parametrizing the value as a sum of a logpolicy and a value is reminiscent of the dueling architecture (wang2016dueling), that factorizes the value as the sum of an advantage and a value. In fact, we show that dueling architecture is a limiting case of IQ in a discrete actions setting. This link highlights the role of the policy in IQ, which calls for a discussion on the necessary parametrization of the policy.
Finally, we empirically validate IQ. We evaluate our method on several classic continuous control benchmarks: locomotion tasks from Openai Gym (brockman2016openai), and hand manipulation tasks from the Adroit environment (rajeswaran2017learning). On these environments, IQ reaches performances competitive with stateoftheart actor critic methods.
2 Implicit value parametrization
We consider the standard Reinforcement Learning (RL) setting, formalized as a Markov Decision Process (MDP). An MDP is a tuple
. and are the finite state and action spaces^{1}^{1}1We restrict to finite spaces for the sack of analysis, but our approach applies to continuous spaces., is the discount factor and is the bounded reward function. Write the simplex over the finite set . The dynamics of an MDP are defined by a Markovian transition kernel , whereis the probability of transitioning to state
after taking action in . An RL agent acts through a stochastic policy , a mapping from states to distribution over actions. The quality of a policy is quantified by the value function, . The function is a useful extension, which notably allows choosing a (soft)greedy action in a modelfree setting, . An optimal policy is one that achieve the highest expected return, .A classic way to design practical algorithms beyond the tabular setting is to adopt the ActorCritic perspective. In this framework, an RL agent parametrizes a policy and a value
with function approximation, usually through the use of neural networks, and aims at estimating an optimal policy. The policy and the
function are then updated by minimizing two losses: the actor loss corresponds to the greedy step, and the critic loss to the evaluation step. The weights of the policy and value networks are regularly frozen into target weights and . With entropy regularization, the greedy step amounts to finding the policy that maximizes (maximize the value with stochastic enough policy). The solution to this problem is simply which is the result of the greedy step of regularized Value Iteration (VI) (geist2019theory) and, for example, how the optimization step of Soft ActorCritic (haarnoja2018soft, SAC) is built. In a setting where the action space is discrete and small, it amounts to a simple softmax computation. However, on more complex action spaces (continuous, and/or with a higher number of dimensions: as a reference, the Humanoidv2 environment from Openai Gym (brockman2016openai) has an action space of dimension ), it becomes prohibitive to use the exact solution. In this case, the common practice is to resort to a approximation with a parametric distribution model. In many actor critic algorithms (SAC, TD3(fujimoto2018addressing), …), the policy is modelled as a Gaussian distribution over actions. It introduces approximation errors, resulting from the partial optimization process of the critic, and inductive bias, as a Gaussian policy cannot represent an arbitrary softmax distribution. We now turn to the description of our core contribution: the Implicit
value (IQ) algorithm, introduced to mitigate this discrepancy.IQ implicitly parametrizes a value via an explicit parametrization of a policy and a value. Precisely, from a policy network and a value network , we define our implicit value as
(1) 
Since is constrained to be a distribution over the actions, we have by construction that , the solution of the regularized greedy step (see Appx. A.1 for a detailed proof). Hence, the consequence of using such a parametrization is that the greedy step is performed exactly, even in the function approximation regime. Compared to the classic actorcritic setting, it thus gets rid of the errors created by the actor. Note that calling a value makes sense, since following the same reasoning we have that , a soft version of the value. With this parametrization in mind, one could derive a deep RL algorithm from any valuebased loss using entropy regularization. We conserve the fixedpoint approach of the standard actorcritic framework, and are regularly copied to and , and we design an offpolicy algorithm, working on a replay buffer of transitions
collected during training. Consider two hyperparameters,
and that we will show in Sec. 3 control two forms of regularization. The policy and value are optimized jointly by minimizing the loss(2) 
where denote the empirical expected value over a dataset of transitions. IQ consists then in a single loss that optimizes jointly a policy and a value. This brings a notable remark on the role of functions in RL. Indeed, learning was introduced by watkins1992q – among other reasons – to make greediness possible without a model (using a value only, one needs to maximize it over all possible successive states, which requires knowing the transition model), and consequently derive practical, modelfree RL algorithms. Here however, IQ illustrates how, with the help of regularization, one can derive a modelfree algorithm that does not rely on an explicit value.
3 Analysis
In this section, we explain the workings of the IQ algorithm defined by Eq. (2) and detail the influence of its hyperparameters. We abstract IQ into an ADP framework, and show that, from that perspective, it is equivalent to a Mirror Descent VI (MDVI) scheme (geist2019theory), with both entropy and KL regularization. Let us first introduce some useful notations. We make use of the actions partial dotproduct notation: for , we define . For any , we have for any . We will define regularized algorithms, using the entropy of a policy, , and the KL divergence between two policies, . The value of a policy is the unique fixed point of its Bellman operator defined for any as . We denote the optimal value (the value of the optimal policy). When the MDP is entropyregularized with a temperature , a policy admits a regularized value , the fixed point of the regularized bellman operator . A regularized MDP admits an optimal regularized policy and a unique optimal regularized value (geist2019theory).
3.1 Ideal case
First, let us look at the ideal case, i.e. when is exactly minimized at each iteration (tabular representation, dataset covering the whole stateaction space, expectation rather than sampling for transitions). In this context, IQ can be understood as a Dynamic Programming (DP) scheme that iterates on a policy and a value . They are respectively equivalent to the target networks and , while the next iterate matches the solution of the optimization problem in Eq. (2). We call the scheme IQDP and one iteration is defined by choosing such that the squared term in Eq. (2) is , that is
(3) 
This equation is welldefined, due to the underlying constraint that (the policy must be a distribution over actions), that is for all . The basis for our discussion will be the equivalence of this scheme to a version of regularized VI. Indeed, we have the following result, proved in Appendix A.3.
Theorem 1.
For any , let be the solution of IQDP at step . We have that
(4) 
so IQDP() produces the same sequence of policies as a valuebased version of Mirror Descent VI, MDVI (vieillard2020leverage).
Discussion.
The previous results sheds a first light on the nature of the IQ method. Essentially, IQDP is a parametrization of a VI scheme regularized with both entropy and KL divergence, MDVI. This first highlights the role of the hyperparameters, as its shows the interaction between the two forms of regularization. The value of balances between those two: with , IQDP reduces to a classic VI regularized with entropy; with only the KL regularization will be taken into account. The value of then controls the amplitude of this regularization. In particular, in the limit , we recover the standard VI algorithm. This results also justifies the soundness of IQDP. Indeed, this MDVI scheme is known to converge to the optimal policy of the regularized MDP (vieillard2020leverage, Thm. 2) and this results readily applies to IQ^{2}^{2}2vieillard2020leverage show this for functions, but it can straightforwardly be extended to value functions.. Another consequence is that it links IQ to Advantage Learning (AL) (bellemare2016increasing). Indeed, AL is a limiting case of MDVI when and (vieillard2020munchausen). Therefore, IQ also generalizes AL, and the parameter can be interpreted as the advantage coefficient. Finally, a key observation is that IQ performs KL regularization implicitly, the way it was introduced by Munchausen RL (vieillard2020munchausen), by augmenting the reward with the term (Eq. (3)). This observation will have implications discussed next.
3.2 Error propagation result
Now, we are interested in understanding how errors introduced by the function approximation used propagate along iterations. At iteration of IQ, denote and the target networks. In the approximate setting, we do not solve Eq. (3), but instead, we minimize
with stochastic gradient descent. This means that
and are the result of this optimization, and thus the next target networks. The optimization process introduces errors, that come from many sources: partial optimization, function approximation (policy and value are approximated with neural networks), finite data, etc. We study the impact of these errors on the distance between the optimal value of the MDP and the regularized value of the current policy used by IQ, . We insist right away that is not the learned, implicit value, but the actual value of the policy computed by IQ in the regularized MDP. We have the following result concerning the error propagation.Theorem 2.
Write and the update of respectively the target policy and value networks. Consider the error at step , , as the difference between the ideal and the actual updates of IQ. Formally, we define the error as, for all ,
(5) 
and the moving average of the errors as
(6) 
We have the following results for two different cases depending on the value of . Note that when , we bound the distance to regularized optimal value.

General case: and , entropy and KL regularization together:
(7) 
Specific case , , use of KL regularization alone:
(8)
Sketch of proof..
The full proof is provided in Appendix A.4. We build upon the connection we established between IQDP and a VI scheme regularized by both KL and entropy in Thm. 1. By injecting the proposed representation into the classic MDVI scheme, we can build upon the analysis of vieillard2020leverage to provide these results. ∎
Impact of KL regularization.
The KL regularization term, and specifically in the MDVI framework, is discussed extensively by vieillard2020leverage, and we refer to them for indepth analysis of the subject. We recall here the main interests of KL regularization, as illustrated by the bounds of Thm 2. In the second case, where it is the clearest (only KL is used), we observe a beneficial property of KL regularization: Averaging of errors. Indeed, in a classic nonregularized VI scheme (scherrer2015approximate), the error would depend on a moving average of the norms of the errors , while with the KL it depends on the norm of the average of the errors . In a simplified case where the errors would be i.i.d. and zero mean, this would allow convergence of approximate MDVI, but not of approximate VI. In the case
, where we introduce entropy regularization, the impact is less obvious, but we still transform a sum of norm of errors into a sum of moving average of errors, which can help by reducing the underlying variance.
Link to Munchausen RL.
As stated in the sketched proof, Thm. 2 is a consequence of (vieillard2020leverage, Thm. 1 and 2). A crucial limitation of this work is that the analysis only applies when no errors are made in the greedy step. This is possible in a relatively simple setting, with tabular representation, or with a linear parametrization of the function. However, in the general case with function approximation, exactly solving the optimization problem regularized by KL is not immediately possible: the solution of the greedy step of MDVI is (where ), so computing it exactly would require remembering every during the procedure, which is not feasible in practice. A workaround to this issue was introduced by vieillard2020munchausen as Munchausen RL: the idea is to augment the reward by the logpolicy, to implicitly define a KL regularization term, while reducing the greedy step to a softmax. As mentioned before, in small discrete action spaces, this allows to compute the greedy step exactly, but it is not the case in multidimensional or continuous action spaces, and thus Munchausen RL loses its interest in such domains. With IQ, we utilize the Munchausen idea to implicitly define the KL regularization; but with our parametrization, the exactness of the greedy step holds even for complex action spaces: recall that the parametrization defined in Eq. (1) enforces that the policy is a softmax of the (implicit) value. Thus, IQ can be seen as an extension of Munchausen RL to multidimensional and continuous action spaces.
To sum up, IQ implements with function approximation an ADP scheme that is essentially VI with entropy and KL regularization. This type of regularization is know to be efficient, as it can compensate errors made during the evaluation step, but this compensation relies on the greedy step being exact. A way to have an exact greedy step and still using KL regularization is to use the Munchausen method, that avoids computing an explicit KL by simply augmenting the reward with a logpolicy. This type of KL regularization reduces the greedy step to a softmax: this is sufficient to avoid errors in a discrete actions setting, but not with continuous actions. IQ allows to have an exact softmax, by implicitly defining the value. And, using the Munchausen method to compute KL regularization, it extends it to continuous actions: IQ performs entropy and KL regularization with no approximation in the greedy step, even in continuous action domains.
3.3 Link to the dueling architecture
Now, we show a link between IQ and the dueling networks architecture as defined by wang2016dueling. We will first quickly describe the dueling arcithecture, and then show how it can be related to IQ.
Dueling Networks (DN) were introduced as a variation of the seminal Deep QNetworks (DQN, mnih2015human), and has been empirically proven to be efficient (for example by hessel2018rainbow). The idea is to represent the value as the sum of a value and an advantage. In this setting, we work with a notion of advantage defined over functions (as opposed to defining the advantage as a function of a policy). For any , its advantage is defined . The advantage encodes a suboptimality constraint: it has negative values and its maximum over actions (the action maximizing the value) is . wang2016dueling propose to learn a value by defining and advantage network and a value network , which in turn define a value as
(9) 
Subtracting the maximum over the actions ensures that the advantage network indeed represents an advantage. Note that dueling DQN was designed for discrete action settings, where computing the maximum over actions is not an issue.
In IQ, we need a policy network that represents a distribution over the actions. There are several practical ways to represent the policy, that are discussed in Sec 4. For the sake of simplicity, let us for now assume that we are in a monodimensional discrete action space, and that we use a common scaled softmax representation. Specifically, our policy is represented by a neural network (eg. fully connected)
, that maps stateaction pairs to logits
. The policy is then defined as . Directly from the definition of the softmax, we observe that . The second term is a classic scaled logsumexp over the actions, a soft version of the maximum: when , we have that . Within the IQ parametrization, we have(10) 
which makes a clear link between IQ and DN. In this case (scaled softmax representation), the IQ parametrization generalizes the dueling architecture, retrieved when (and with an additional AL term whenever , see Sec. 3). In practice, wang2016dueling use a different parametrization of the advantage, replacing the maximum by a mean, defining . We could use a similar trick and replace the logsumexp by a mean in our policy parametrization, but in our case this did not prove to be efficient in practice.
We showed how the logpolicy represents a soft version of the advantage. While this makes its role in the learning procedure clearer, it also raises questions about what sort of representation would be the most suited for optimization.
4 Practical considerations
We now describe key practical issues encountered when choosing a policy representation. The main one comes from the delegation of the representation power of the algorithm to the policy network. In a standard actorcritic algorithm – take SAC for example, where the policy is parametrized as a Gaussian distribution – the goal of the policy is mainly to track the maximizing action of the value. Thus, estimation errors can cause the policy to choose suboptimal actions, but the inductive bias caused by the Gaussian representation may not be a huge issue in practice, as long as the mean of the Gaussian policy is not too far from the maximizing action. In other words, the representation capacity of an algorithm such as SAC lies mainly in the representation capacity of its network.
In IQ, we have a parametrization of the policy that enforces it to be a softmax of an implicit value. By doing this, we trade in estimation error – our greedy step is exact by construction – for representation power. More precisely, as the value is not parametrized explicitly, but through the policy, the representation power of IQ is in its policy network, and a “simple” representation might not be enough anymore. For example, if we parameterized the policy as a Gaussian, this would amount to parametrize an advantage as a quadratic function of the action: this would drastically limit what the IQ could represent.
Multicategorical policies.
To address this issue, we turn to other, richer, distribution representations. In practice, we consider a multicategorical discrete softmax distribution. Precisely, we are in the context of a multidimensional action space
of dimension , each dimension being a bounded interval. We discretize each dimension of the space uniformly in values , for . It effectively defines a discrete action space , with. A multidimensional action is a vector
, and we denote the component of the action . Assuming independence between actions conditioned on states, a policy can be factorized as the product of marginal monodimensional policies . We represent each policy as the softmax of the output of a neural network , an thus we get the full representation(11) 
The functions can be represented as neural networks with a shared core, which only differ in the last layer. This type of multicategorical policy can represent any distribution (with high enough) that does not encompass a dependency between the dimensions. The independence assumption is quite strong, and does not hold in general. From an advantage point of view, it assumes that the softadvantage (i.e. the logpolicy) can be linearly decomposed along the actions. While this somehow limits the advantage representation, it is a much weaker constraint than paramterizing the advantage as a quadratic function of the action (which would be the case with a Gaussian policy). In practice, these types of multicategorical policies have been experimented (akkaya2019solving; tang2020discretizing), and have proven to be efficient on continuous control tasks.
Even richer policy classes can be explored. To account for dependency between dimensions, one could envision autoregressive multicategorical representations, used for example to parametrize a value by metz2017discrete. Another approach is to use richer continuous distributions, such as normalizing flows (rezende2015variational; ward2019improving). In this work, we restrict ourselves to the multicategorical setting, which is sufficient to get satisfying results (Sec. 6.2), and we leave the other options for future work.
5 Related work
Similar parametrizations.
Other algorithms make use of a similar parametrization. First, Path Consistency Learning (PCL, (nachum2017bridging)) also parametrize the value as a sum of a logpolicy and a value. TrustPCL (nachum2017trust), builds on PCL by adding a trust region constraint on the policy update, similar to our KL regularization term. A key difference with IQ is that (Trust)PCL is a residual algorithm, while IQ works around a fixedpoint scheme. Shortly, TrustPCL can be seen as a version of IQ without the target value network . These entropyregularized residual approaches are derived from the softmax temporal consistency principle, which allows to consider extensions to a specific form of multistep learning (strongly relying on the residual aspect), but they also come with drawbacks, such as introducing a bias in the optimization when the environment is stochastic (geist2016bellman). Second, Quinoa (degrave2019quinoa) uses a similar loss to TrustPCL and IQ (without reference to the former TrustPCL), but do not propose any analysis, and is evaluated only on a few tasks. Third, Normalized Advantage Function (NAF, gu2016continuous) is designed with similar principles. In NAF, a value is parametrized as a value and and an advantage, the former being quadratic on the action. It matches the special case of IQ with a Gaussian policy, where we recover this quadratic parametrization.
Regularization.
Entropy and KL regularization are used by many other RL algorithms. Notably, from a dynamic programming perspective, IQDP(0, ) (IQ with only entropy regularization) performs the same update as SAC – an entropy regularized VI. This equivalence is however not true in the function approximation regime. Due to the empirical success of SAC and its link to IQ, it will be used as our main baseline on continuous control tasks. Other algorithms also use KL regularization, notably Maximum a posteriori Policy Optimization (MPO, abdolmaleki2018maximum). We refer to vieillard2020leverage for an exhaustive review of algorithms encompassed within the MDVI framework.
6 Experiments
Here, we describe our experimental setting and provide results evaluating the performance of IQ.
6.1 Setup
Environments and metrics.
We evaluate IQ first on the Mujoco environment from OpenAI Gym (brockman2016openai). It consists of locomotion tasks, with action spaces ranging from (Hopperv2) to dimensions (Humanoidv2). We use a rather long time horizon setting, evaluating our algorithm on M steps on each environments. We also provide result on the Adroit manipulation dataset (rajeswaran2017learning), with a similar setting of M environment steps. Adroit is a collection of hand manipulation tasks. This environment is often use in an offline RL setting, but here we use it only as a direct RL benchmark. Out of these tasks, we only consider of them: We could not find any working algorithm (baseline or new) on the “relocate” task. To summarize the performance of an algorithm, we report the baselinenormalized score along iterations: It normalizes the score so that corresponds to a random score, and to a given baseline. It is defined for one task as , where the baseline is the best version of SAC on Mujoco and Adroit after M steps. We then report aggregated results, showing the mean and median of these normalized scores along the tasks. Each score is reported as the average over
random seeds. For each experiment, the corresponding standard deviation is reported in
B.3IQ algorithms.
We implement IQ with the Acme (hoffman2020acme) codebase. It defines two deep neural networks, a policy network and a value network . IQ interacts with the environment through , and collect transitions that are stored in a FIFO replay buffer. At each interaction, IQ updates and by performing a step of stochastic gradient descent with Adam (kingma2014adam) on (Eq. (2)). During each step, IQ updates a copy of the weights , , with a smooth update , with . It tracks a similar copy of . We keep almost all common hyperparameters (networks architecture, , etc.) the same as our main baseline, SAC. We only adjust the learning rate for two tasks, Humanoid and Walker, where we used a lower value: we found that IQ benefits from this, while for SAC we did not observe any improvement (we provide more details and complete results in Appx. B.3). Our value network has the same architecture as the SAC networks except that the input size is only the state size (as it does not depend on the action). The policy network has the same architecture as the SAC policy network, and differs only by its output: IQ policy outputs a multicategorical policy (so values, where is the dimensionality of the action space and is the number of discrete action on each dimension), while SAC policy outputs dimensional vectors (mean and diagonal covariance matrix of a Gaussian). We use in our experiments. IQ introduces two hyperparameters, and . We tested several values of between and , and selected a value per task suite: we use on Mujoco tasks and on Adroit. We tested values of in . To make the distinction between the cases when and , we denote IQ() as MIQ, for MunchausenIQ, since it makes use of the Munchausen regularization term. For MIQ, we found to be the best performing value, which is consistent with the findings of vieillard2020munchausen. We report results for nonoptimal values of in the ablation study (Section 6.2). Extended explanations are provided in Appendix B.2.
Baselines.
On continuous control tasks, our main baseline is SAC, as it reaches stateoftheart performance on Mujoco tasks. We compare to the version of SAC that uses an adaptive temperature for reference, but note that for IQ we keep a fixed temperature () setting. To reach its best performance, SAC either uses a specific temperature value per task, or an adaptive scheme that controls the entropy of the policy. This method could be extended to multicategorical policies, but we leave this for future work, and focus on a fixed temperature setting, where we use the same value of for all tasks of an environment. On Gym, we use the default parameters from haarnoja2018soft2. On adroit, we used a specific tuned version of SAC. Remarkably, both SAC and IQ work with similar hyperparameter ranges on Mujoco and Adroit. We only found that using a learning rate of (instead of ) gave better performance on Adroit. We also compare IQ to TrustPCL. It is the closest algorithm to IQ, with a similar parametrization. To be fair, we compare to our version of trustPCL, which is essentially a residual version of IQ, where the target value network is removed (replaced by the online one). We use TrustPCL with a fixed temperature, and we tuned this temperature to the environment. We found that TrustPCL reaches its best performance with significantly lower values of compared to IQ. In the ablation (Fig. 1) we used for PCL and TrustPCL.
6.2 Results
Comparison to baselines.
We report aggregated results of IQ and MIQ on Gym in Figure 1 and on Adroit in Figure 2, and corresponding standard deviations in Appx. B.3. IQ reaches competitive performance to SAC. It is less sample efficient on Gym (SAC reaches higher performance sooner), but faster on Adroit, and IQ reaches a close final performance on both environments. These results also show the impact of the parameter. Although the impact of the Munchausen term (i.e KL regularization) might not seem as impressive as in discrete actions, these results show that using that term is never detrimental, and can even bring a slight improvement on Gym; while it does not add any compute complexity to the algorithm. We also report scores on each individual task in Appx. B.3, along with indepth discussion on the performance and the impact of hyperparameters.
Influence of the temperature.
We study the influence of the temperature on the Mujoco tasks in Fig. 3. We report the score of IQ for several values of (with here,and with in Appx>B.3), on all environments of Mujoco. It shows that needs to be selected carefully: while it helps learning, too high values of can be detrimental to the performance, and it highlights that its optimal value might be dependant on the task. Another observation is that has a much stronger influence on IQ than . This is a key empirical difference regarding the performance of MDQN (vieillard2020munchausen), that has the same parameters, but is evaluated on discrete actions settings. In these settings, the parameters is shown to have a crucial importance in terms of empirical results: MDQN with largely outperforms MDQN with on the Atari benchmark. While this term still has effect in IQ on some tasks, it is empirically less useful, even though it is never detrimental; this discrepancy is yet to be understood.
Ablation study.
We perform an ablation on important components of IQ in Fig. 1. (1) We replace the target network by its online counterpart in Eq. (2), which gives us TrustPCL (and PCL is obtained by setting ), a residual version of our method. IQ and MIQ both outperform TrustPCL and PCL on Mujoco. (2) We use a Gaussian parametrization of the policy instead of a multicategorical distribution. We observe on Figure 1 that this causes the performance to drop drastically. This empirically validates the considerations about the necessary complexity of the policy from Section 4.
7 Conclusion
We introduced IQ, a parametrization of a value that mechanically preserves the softmax relation between a policy and an implicit function. Building on this parametrization, we derived an offpolicy algorithm, that learns a policy and a value by minimizing a single loss, in a fixedpoint fashion. We provided insightful analysis that justifies our parametrization and the algorithm. Specifically, IQ performs entropy and (implicit) KL regularization on the policy. While this kind of regularization had already been used and analyzed in RL, it was limited by the difficulty of estimating the softmax of function in continuous action settings. IQ ends this limitation by avoiding any approximation in this softmax, effectively extending the analysis of this regularization. This parametrization comes at a cost: it shifts the representation capacity from the network to the policy, which makes the use of Gaussian representation, ineffective. We solved this issue by considering simple multicategorical policies, which allowed IQ to reach performance comparable to stateoftheart methods on classic continuous control benchmarks. Yet, we envision that studing even richer policy classes may results in even better performance. In the end, this work brings together theory and practice: IQ is a theoryconsistent manner of implementing an algorithm based on regularized VI in continuous actions settings.
References
Appendix A Analysis
This Appendix provides details and proofs on the IQ paramterization.
Reminder on notations.
Throughout the Appendix, we use the following notations. Recall that we defined the action dot product as, for any and ,
(12) 
We also slightly overwrite the operator. Precisely, for any , , we define as
(13) 
Write the constant function of value . For any , we define the softmax operator as
(14) 
where the fraction is overwritten as the addition operator, that is for any stateaction pair ,
(15) 
a.1 About the softmax consistency
First, we provide a detailed explanation of the consistency of the IQ parametrization. In Section 2, we claim that parametrizing a value as enforces the relation . This relation comes mechanically from the constraint that is a distribution over actions. For the sake of precision, we provide a detailed proof of this claim as formalized in the following lemma.
Lemma 1.
For any , , , we have
(16) 
Proof.
Directly from the left hand side (l.h.s.) of Eq. (16), we have
(17) 
Since ( is a distribution over the actions), we have
(18)  
(19)  
(20) 
And, for the policy, this gives
(21) 
It concludes the proof. ∎
a.2 Useful properties of KLentropyregularized optimization
The following proofs relies on some properties of the KL divergence and of the entropy. Consider the greedy step of MDVI, defined in Thm. 1
(22) 
Since the function is concave in , this optimization problem can be tackled using properties of the LegendreFenchel transform (see for example hiriart2004fundamentals for general definition and properties, and vieillard2020leverage for application to our setting). We quickly state two properties that are of interest for this work in the following Lemma.
Lemma 2.
Consider the optimization problem of Eq. (22). Write , we have that
(23) 
We also get a relation between the maximizer and the maximum
(24) 
Proof.
See vieillard2020leverage.
∎
a.3 Equivalence to MDVI: proof of Theorem 1
We turn to the proof of Thm 1. This result formalizes an equivalence in the exact case between the IQDP scheme and a VI scheme regularized by entropy and KL divergence. Recall that we define the update of IQDP at step as
(25) 
Note that we are for now considering the scenario where this update is computed exactly. We will consider errors later, in Thm 2. Recall Thm. 1.
Theorem 1.
For any , let be the solution of IQDP at step . We have that
(26) 
so IQDP() produces the same sequence of policies as a valuebased version of Mirror Descent VI, MDVI [vieillard2020leverage].
a.4 Error propagation: proof of Theorem 2
Now we turn to the proof of Thm 2. This theorem handles the IQDP scheme in the approximate case, when errors are made during the iterations. The considered scheme is
(32) 
Recall Thm. 2.
Theorem 2.
Write and the update of respectively the target policy and value networks. Consider the error at step , , as the difference between the ideal and the actual updates of IQ. Formally, we define the error as, for all ,
(33) 
and the moving average of the errors as
(34) 
We have the following results for two different cases depending on the value of . Note that when , we bound the distance to regularized optimal value.

General case: and , entropy and KL regularization together:
(35) 
Specific case , , use of KL regularization alone:
(36)
Proof.
To prove this error propagation, result, we first show an extension of Thm. 1, that links Approximate IQDP with a value based version of MDVI. This new equivalence makes IQDP corresponds exactly to a scheme that is extensively analyzed by vieillard2020leverage. Then our result can be derived as a consequence of [vieillard2020leverage, Thm 1] and [vieillard2020leverage, Thm 2].
Define a (KLregularized) implicit value as
(37) 
so that now, the IQDP update (Eq. (32)) can be written
(38) 
We then use same method that for the proof of Thm. 1. Specifically, applying Lemma 1 to the definition of gives for the policy
(39)  
(40)  
(41) 
For the value, we have from Lemma 1 on
(42) 
then, using Lemma 2, and the fact that , we have
(44) 
Injecting this in Eq. (38) gives
(45) 
Thus, we have proved the following equivalence between DP schemes
(46)  
(47)  
(48) 
with
(49) 
The above scheme in Eq. (48) is exactly the MDVI scheme studied by vieillard2020leverage, where they define and . We now use their analysis of MDVI to apply their result to IQDP, building on the equivalence between the schemes. Note that transferring this type of analysis between equivalent formulations of DP schemes is justified because the equivalences exist in terms of policies. Indeed, IQDP and MDVI compute different ()values, but produce identical series of policies. Since [vieillard2020leverage, Thm 1] and [vieillard2020leverage, Thm. 2] bound the distance between the optimal (regularized) value and the actual (regularized) values of the computed policy, the equivalence in terms of policies is sufficient to apply these theorems to IQDP. Specifically, [vieillard2020leverage, Thm 1] applied to the formulation of IQ in Eq. (48) proves point of Thm. 2, that is the case where . The second part is proven by applying [vieillard2020leverage, Thm 2] to this same formulation.
∎
a.5 IQ and Munchausen DQN
We claim in Section 3 that IQ is a form of Munchausen algorithm, specifically MunchausenDQN (MDQN). Here, we clarify this link. Note that all of the information below is contained in Appx. A.3 and Appx. A.4. The point of this section is to rewrite it using notations used to defined IQ as a deep RL agents, notations consistent with how MDQN is defined.
IQ optimizes a policy and a value by minimizing a loss (Eq. (2)). Recall that IQ implicitly defines a function as . Identifying this in makes the connection between Munchausen RL and IQ completely clear. Indeed, the loss can be written as
(50) 
and since we have (Lemma 2, and using the fact that )
(51) 
we get that the loss is
(52) 
which is exactly the MunchausenDQN loss on . Thus, in a monodimensional action setting (classic discrete control problems for examle), IQ, can really be seen as a reparameterized version of MDQN.
Appendix B Additional material on experiments
This Appendix provides additional detail on experiments, along with complete empirical results.
b.1 General information on experiments
Used assets.
IQ is implemented on the Acme library [hoffman2020acme]
, distributed as opensource code under the Apache License (2.0).
Compute resources.
Experiments were run on TPUv2. One TPU is used for a single run, with one random seed. To produce the main results (without the sweeps over parameters), we computed single runs. One of this run on a TPUv2 takes from to hours depending on the environment (the larger the action space, the longer the run).
b.2 Details on algorithms
On the relation between and .
The equivalence result of Theorem 1 explains the role and the relation between and . In particular, it shows that IQDP performs a VI scheme in an entropyregularized MDP (or in a maxentropy setting) where the temperature is not , but . Indeed, in this framework, the parameter balances between two forms of regularization: with , IQDP is only regularized with entropy, but with , IQDP is regularized with both entropy and KL. Thus, IQDP modifies implicitly the intrinsic temperature of the MDP it is optimizing for. To account for this discrepancy, every time we evaluate IQ with (that is, MIQ), we report scores using , and not . For example, on Gym, we used a temperature of for IQ, and thus for MIQ (since, in our experiments, we took ).
Discretization.
We used IQ with policies that discretize the action space evenly. Here, we provide a precise definition for our discretization method. Consider a multidimensional action space of dimension , each dimension being a bounded interval , such that . We discretize each dimension of the space uniformly in values , for . The bins values are defined as
(53) 
and, for each
(54) 
It effectively defines a discrete action space
(55) 
We use in all of our experiments. The values of , and depend on the environments specifications.
Evaluation setting.
We evaluate our algorithms on Mujoco environements from OpenAI Gym and from the Adroit manipulation tasks. On each enviroenment, we track performance for M environment steps. Every k environment steps, we stop learning, and we evaluate our algorithm by reporting the average undiscounted return over episodes. We use deterministic evaluation, meaning that, at evaluation time, the algorithms interact by choosing the expected value of the policy in one state, not by sampling from this policy (sampling is used during training).
Pseudocode.
We provide a pseudocode of IQ in Algorithm 1. This pseudocode describes a general learning procedure that is followed by all agents. Replacing the IQ loss in Algorithm 1 by its residual version will give the pseudocode for PCL, and replacing it by the actor and critic losses of SAC will give the pseudocode for this method.
(56) 
HyperParameters.
We Provide the hyperparameters used for our experiments in Tab. 1. If a parameter is under “common parameters”, then it was used for all algorithms. We denote a fully connected layer with an output of neurons. Recall that is the dimension of the action space, and is the number of bins we discretize each dimension into.
Parameter  Value 

Common parameters  
(update coefficient)  0.05 
(discount)  0.99 
(replay buffer size)  
(batch size)  256 
activations  Relu 
optimizer  Adam 
learning rate  on Gym, on Adroit 
IQ specific parameters  
(entropy temperature)  on Gym, on Adroit 
(implicit KL term)  
(number of bins for the discretization)  11 
network  (input: state) 
network structure  (input: state) 