# Implicitly Regularized RL with Implicit Q-Values

The Q-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to Q. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action spaces in modern actor-critic architectures, intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the Q-function implicitly, as the sum of a log-policy and of a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the Q-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.

## Authors

• 8 publications
• 15 publications
• 10 publications
• 56 publications
• 48 publications
• ### Bridging the Gap Between Value and Policy Based Reinforcement Learning

We establish a new connection between value and policy based reinforceme...
02/28/2017 ∙ by Ofir Nachum, et al. ∙ 0

• ### Quinoa: a Q-function You Infer Normalized Over Actions

We present an algorithm for learning an approximate action-value soft Q-...
11/05/2019 ∙ by Jonas Degrave, et al. ∙ 20

• ### Softmax Deep Double Deterministic Policy Gradients

A widely-used actor-critic reinforcement learning algorithm for continuo...
10/19/2020 ∙ by Ling Pan, et al. ∙ 0

• ### Sparsity Prior Regularized Q-learning for Sparse Action Tasks

In many decision-making tasks, some specific actions are limited in thei...
05/18/2021 ∙ by Jing-Cheng Pang, et al. ∙ 0

• ### Q-Networks for Binary Vector Actions

In this paper reinforcement learning with binary vector actions was inve...
12/04/2015 ∙ by Naoto Yoshida, et al. ∙ 0

• ### Leverage the Average: an Analysis of Regularization in RL

Building upon the formalism of regularized Markov decision processes, we...
03/31/2020 ∙ by Nino Vieillard, et al. ∙ 7

• ### On the Convergence of Approximate and Regularized Policy Iteration Schemes

Algorithms based on the entropy regularized framework, such as Soft Q-le...
09/20/2019 ∙ by Elena Smirnova, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A large body of reinforcement learning (RL) algorithms, based on approximate dynamic programming (ADP) (bertsekas1996neuro; scherrer2015approximate), operate in two steps: A greedy step, where the algorithm learns a policy that maximizes a -function, and an evaluation step, that (partially) updates the -values towards the -values of the policy. A common improvement to these techniques is to use regularization, that prevents the new updated policy from being too different from the previous one, or from a fixed “prior” policy. For example, Kullback-Leibler (KL) regularization keeps the policy close to the previous iterate (vieillard2020leverage), while entropy regularization keeps the policy close to the uniform one (haarnoja2018soft). Entropy regularization, often used in this context (ziebart2010modeling), modifies both the greedy step and the evaluation step so that the policy jointly maximizes its expected return and its entropy. In this framework, the solution to the policy optimization step is simply a softmax of the -values over the actions. When dealing with small discrete action spaces, the softmax can be computed exactly: one only needs to define a critic algorithm, with a single loss that optimizes a

-function. However, in large multi-dimensional – or even continuous – action spaces, one needs to estimate it. This estimation is usually done by adding an

actor loss, that optimizes a policy to fit this softmax. It results in an actor-critic algorithm, with two losses that are optimized simultaneously (haarnoja2018soft). This additional optimization step introduces supplementary errors to the ones already created by the approximation in the evaluation step.

To remove these extraneous approximations, we introduce the Implicit -Functions (IQ) algorithm, that deviates from classic actor-critics, as it optimizes a policy and a value in a single loss. The core idea is to implicitly represent the -function as the sum of a value function and a log-policy. This representation ensures that the policy is an exact softmax of the -value, despite the use of any approximation scheme. We use this to design a practical model-free deep RL algorithm that optimizes with a single loss a policy network and a value network, built on this implicit representation of a -value. To better understand it, we abstract this algorithm to an ADP scheme, IQ-DP, and use this point of view to provide a detailed theoretical analysis. It relies on a key observation, that shows an equivalence between IQ-DP and a specific form of regularized Value Iteration (VI). This equivalence explains the role of the components of IQ: namely, IQ performs entropy and KL regularization. It also allows us to derive strong performance bounds for IQ-DP. In particular, we show that the errors made when following IQ-DP are compensated along iterations.

Parametrizing the -value as a sum of a log-policy and a value is reminiscent of the dueling architecture (wang2016dueling), that factorizes the -value as the sum of an advantage and a value. In fact, we show that dueling architecture is a limiting case of IQ in a discrete actions setting. This link highlights the role of the policy in IQ, which calls for a discussion on the necessary parametrization of the policy.

Finally, we empirically validate IQ. We evaluate our method on several classic continuous control benchmarks: locomotion tasks from Openai Gym (brockman2016openai), and hand manipulation tasks from the Adroit environment (rajeswaran2017learning). On these environments, IQ reaches performances competitive with state-of-the-art actor critic methods.

## 2 Implicit Q-value parametrization

We consider the standard Reinforcement Learning (RL) setting, formalized as a Markov Decision Process (MDP). An MDP is a tuple

. and are the finite state and action spaces111We restrict to finite spaces for the sack of analysis, but our approach applies to continuous spaces., is the discount factor and is the bounded reward function. Write the simplex over the finite set . The dynamics of an MDP are defined by a Markovian transition kernel , where

is the probability of transitioning to state

after taking action in . An RL agent acts through a stochastic policy , a mapping from states to distribution over actions. The quality of a policy is quantified by the value function, . The -function is a useful extension, which notably allows choosing a (soft)-greedy action in a model-free setting, . An optimal policy is one that achieve the highest expected return, .

A classic way to design practical algorithms beyond the tabular setting is to adopt the Actor-Critic perspective. In this framework, an RL agent parametrizes a policy and a -value

with function approximation, usually through the use of neural networks, and aims at estimating an optimal policy. The policy and the

-function are then updated by minimizing two losses: the actor loss corresponds to the greedy step, and the critic loss to the evaluation step. The weights of the policy and -value networks are regularly frozen into target weights and . With entropy regularization, the greedy step amounts to finding the policy that maximizes (maximize the -value with stochastic enough policy). The solution to this problem is simply which is the result of the greedy step of regularized Value Iteration (VI) (geist2019theory) and, for example, how the optimization step of Soft Actor-Critic (haarnoja2018soft, SAC) is built. In a setting where the action space is discrete and small, it amounts to a simple softmax computation. However, on more complex action spaces (continuous, and/or with a higher number of dimensions: as a reference, the Humanoid-v2 environment from Openai Gym (brockman2016openai) has an action space of dimension ), it becomes prohibitive to use the exact solution. In this case, the common practice is to resort to a approximation with a parametric distribution model. In many actor critic algorithms (SAC, TD3(fujimoto2018addressing)

, …), the policy is modelled as a Gaussian distribution over actions. It introduces approximation errors, resulting from the partial optimization process of the critic, and inductive bias, as a Gaussian policy cannot represent an arbitrary softmax distribution. We now turn to the description of our core contribution: the Implicit

-value (IQ) algorithm, introduced to mitigate this discrepancy.

IQ implicitly parametrizes a -value via an explicit parametrization of a policy and a value. Precisely, from a policy network and a value network , we define our implicit -value as

 Qθ,ϕ(s,a)=τlnπθ(a|s)+Vϕ(s). (1)

Since is constrained to be a distribution over the actions, we have by construction that , the solution of the regularized greedy step (see Appx. A.1 for a detailed proof). Hence, the consequence of using such a parametrization is that the greedy step is performed exactly, even in the function approximation regime. Compared to the classic actor-critic setting, it thus gets rid of the errors created by the actor. Note that calling a value makes sense, since following the same reasoning we have that , a soft version of the value. With this parametrization in mind, one could derive a deep RL algorithm from any value-based loss using entropy regularization. We conserve the fixed-point approach of the standard actor-critic framework, and are regularly copied to and , and we design an off-policy algorithm, working on a replay buffer of transitions

collected during training. Consider two hyperparameters,

and that we will show in Sec. 3 control two forms of regularization. The policy and value are optimized jointly by minimizing the loss

 LIQ(θ,ϕ)=^E[(rt+ατlnπ¯θ(at|st)+γV¯ϕ(st+1)−τlnπθ(at|st)−Vϕ(st))2], (2)

where denote the empirical expected value over a dataset of transitions. IQ consists then in a single loss that optimizes jointly a policy and a value. This brings a notable remark on the role of -functions in RL. Indeed, -learning was introduced by watkins1992q – among other reasons – to make greediness possible without a model (using a value only, one needs to maximize it over all possible successive states, which requires knowing the transition model), and consequently derive practical, model-free RL algorithms. Here however, IQ illustrates how, with the help of regularization, one can derive a model-free algorithm that does not rely on an explicit -value.

## 3 Analysis

In this section, we explain the workings of the IQ algorithm defined by Eq. (2) and detail the influence of its hyperparameters. We abstract IQ into an ADP framework, and show that, from that perspective, it is equivalent to a Mirror Descent VI (MD-VI) scheme (geist2019theory), with both entropy and KL regularization. Let us first introduce some useful notations. We make use of the actions partial dot-product notation: for , we define . For any , we have for any . We will define regularized algorithms, using the entropy of a policy, , and the KL divergence between two policies, . The -value of a policy is the unique fixed point of its Bellman operator defined for any as . We denote the optimal -value (the -value of the optimal policy). When the MDP is entropy-regularized with a temperature , a policy admits a regularized -value , the fixed point of the regularized bellman operator . A regularized MDP admits an optimal regularized policy and a unique optimal regularized -value  (geist2019theory).

### 3.1 Ideal case

First, let us look at the ideal case, i.e. when is exactly minimized at each iteration (tabular representation, dataset covering the whole state-action space, expectation rather than sampling for transitions). In this context, IQ can be understood as a Dynamic Programming (DP) scheme that iterates on a policy and a value . They are respectively equivalent to the target networks and , while the next iterate matches the solution of the optimization problem in Eq. (2). We call the scheme IQ-DP and one iteration is defined by choosing such that the squared term in Eq. (2) is , that is

 τlnπk+2+Vk+1=r+ατlnπk+1+γPVk. (3)

This equation is well-defined, due to the underlying constraint that (the policy must be a distribution over actions), that is for all . The basis for our discussion will be the equivalence of this scheme to a version of regularized VI. Indeed, we have the following result, proved in Appendix A.3.

###### Theorem 1.

For any , let be the solution of IQ-DP at step . We have that

 {πk+2=argmax⟨π,r+γPVk⟩+(1−α)τH(π)−ατKL(π||πk+1)Vk+1=⟨πk+2,r+γPVk⟩+(1−α)τH(πk+2)−ατKL(πk+2||πk+1) (4)

so IQ-DP() produces the same sequence of policies as a value-based version of Mirror Descent VI, MD-VI (vieillard2020leverage).

#### Discussion.

The previous results sheds a first light on the nature of the IQ method. Essentially, IQ-DP is a parametrization of a VI scheme regularized with both entropy and KL divergence, MD-VI. This first highlights the role of the hyperparameters, as its shows the interaction between the two forms of regularization. The value of balances between those two: with , IQ-DP reduces to a classic VI regularized with entropy; with only the KL regularization will be taken into account. The value of then controls the amplitude of this regularization. In particular, in the limit , we recover the standard VI algorithm. This results also justifies the soundness of IQ-DP. Indeed, this MD-VI scheme is known to converge to the optimal policy of the regularized MDP (vieillard2020leverage, Thm. 2) and this results readily applies to IQ222vieillard2020leverage show this for -functions, but it can straightforwardly be extended to value functions.. Another consequence is that it links IQ to Advantage Learning (AL) (bellemare2016increasing). Indeed, AL is a limiting case of MD-VI when and (vieillard2020munchausen). Therefore, IQ also generalizes AL, and the parameter can be interpreted as the advantage coefficient. Finally, a key observation is that IQ performs KL regularization implicitly, the way it was introduced by Munchausen RL (vieillard2020munchausen), by augmenting the reward with the term (Eq. (3)). This observation will have implications discussed next.

### 3.2 Error propagation result

Now, we are interested in understanding how errors introduced by the function approximation used propagate along iterations. At iteration of IQ, denote and the target networks. In the approximate setting, we do not solve Eq. (3), but instead, we minimize

with stochastic gradient descent. This means that

and are the result of this optimization, and thus the next target networks. The optimization process introduces errors, that come from many sources: partial optimization, function approximation (policy and value are approximated with neural networks), finite data, etc. We study the impact of these errors on the distance between the optimal -value of the MDP and the regularized -value of the current policy used by IQ, . We insist right away that is not the learned, implicit -value, but the actual -value of the policy computed by IQ in the regularized MDP. We have the following result concerning the error propagation.

###### Theorem 2.

Write and the update of respectively the target policy and value networks. Consider the error at step , , as the difference between the ideal and the actual updates of IQ. Formally, we define the error as, for all ,

 ϵk=τlnπk+2+Vk+1−(r+ατlnπk+1+γPVk), (5)

and the moving average of the errors as

 Ek=(1−α)k∑j=1αk−jϵj. (6)

We have the following results for two different cases depending on the value of . Note that when , we bound the distance to regularized optimal -value.

1. General case: and , entropy and KL regularization together:

 ∥Q(1−α)τ∗−Q(1−α)τπk∥∞≤2(1−γ)2((1−γ)k∑j=1γk−j∥Ej∥∞)+o(1k). (7)
2. Specific case , , use of KL regularization alone:

 ∥Q∗−Qπk∥∞≤21−γ∥∥ ∥∥1kk∑j=1ϵj∥∥ ∥∥∞+O(1k). (8)
###### Sketch of proof..

The full proof is provided in Appendix A.4. We build upon the connection we established between IQ-DP and a VI scheme regularized by both KL and entropy in Thm. 1. By injecting the proposed representation into the classic MD-VI scheme, we can build upon the analysis of vieillard2020leverage to provide these results. ∎

#### Impact of KL regularization.

The KL regularization term, and specifically in the MD-VI framework, is discussed extensively by vieillard2020leverage, and we refer to them for in-depth analysis of the subject. We recall here the main interests of KL regularization, as illustrated by the bounds of Thm 2. In the second case, where it is the clearest (only KL is used), we observe a beneficial property of KL regularization: Averaging of errors. Indeed, in a classic non-regularized VI scheme  (scherrer2015approximate), the error would depend on a moving average of the norms of the errors , while with the KL it depends on the norm of the average of the errors . In a simplified case where the errors would be i.i.d. and zero mean, this would allow convergence of approximate MD-VI, but not of approximate VI. In the case

, where we introduce entropy regularization, the impact is less obvious, but we still transform a sum of norm of errors into a sum of moving average of errors, which can help by reducing the underlying variance.

As stated in the sketched proof, Thm. 2 is a consequence of (vieillard2020leverage, Thm. 1 and 2). A crucial limitation of this work is that the analysis only applies when no errors are made in the greedy step. This is possible in a relatively simple setting, with tabular representation, or with a linear parametrization of the -function. However, in the general case with function approximation, exactly solving the optimization problem regularized by KL is not immediately possible: the solution of the greedy step of MD-VI is (where ), so computing it exactly would require remembering every during the procedure, which is not feasible in practice. A workaround to this issue was introduced by vieillard2020munchausen as Munchausen RL: the idea is to augment the reward by the log-policy, to implicitly define a KL regularization term, while reducing the greedy step to a softmax. As mentioned before, in small discrete action spaces, this allows to compute the greedy step exactly, but it is not the case in multidimensional or continuous action spaces, and thus Munchausen RL loses its interest in such domains. With IQ, we utilize the Munchausen idea to implicitly define the KL regularization; but with our parametrization, the exactness of the greedy step holds even for complex action spaces: recall that the parametrization defined in Eq. (1) enforces that the policy is a softmax of the (implicit) -value. Thus, IQ can be seen as an extension of Munchausen RL to multidimensional and continuous action spaces.

To sum up, IQ implements with function approximation an ADP scheme that is essentially VI with entropy and KL regularization. This type of regularization is know to be efficient, as it can compensate errors made during the evaluation step, but this compensation relies on the greedy step being exact. A way to have an exact greedy step and still using KL regularization is to use the Munchausen method, that avoids computing an explicit KL by simply augmenting the reward with a log-policy. This type of KL regularization reduces the greedy step to a softmax: this is sufficient to avoid errors in a discrete actions setting, but not with continuous actions. IQ allows to have an exact softmax, by implicitly defining the -value. And, using the Munchausen method to compute KL regularization, it extends it to continuous actions: IQ performs entropy and KL regularization with no approximation in the greedy step, even in continuous action domains.

### 3.3 Link to the dueling architecture

Now, we show a link between IQ and the dueling networks architecture as defined by wang2016dueling. We will first quickly describe the dueling arcithecture, and then show how it can be related to IQ.

Dueling Networks (DN) were introduced as a variation of the seminal Deep Q-Networks (DQN, mnih2015human), and has been empirically proven to be efficient (for example by hessel2018rainbow). The idea is to represent the -value as the sum of a value and an advantage. In this setting, we work with a notion of advantage defined over -functions (as opposed to defining the advantage as a function of a policy). For any , its advantage is defined . The advantage encodes a sub-optimality constraint: it has negative values and its maximum over actions (the action maximizing the -value) is . wang2016dueling propose to learn a -value by defining and advantage network and a value network , which in turn define a -value as

Subtracting the maximum over the actions ensures that the advantage network indeed represents an advantage. Note that dueling DQN was designed for discrete action settings, where computing the maximum over actions is not an issue.

In IQ, we need a policy network that represents a distribution over the actions. There are several practical ways to represent the policy, that are discussed in Sec 4. For the sake of simplicity, let us for now assume that we are in a mono-dimensional discrete action space, and that we use a common scaled softmax representation. Specifically, our policy is represented by a neural network (eg. fully connected)

, that maps state-action pairs to logits

. The policy is then defined as . Directly from the definition of the softmax, we observe that . The second term is a classic scaled logsumexp over the actions, a soft version of the maximum: when , we have that . Within the IQ parametrization, we have

 (10)

which makes a clear link between IQ and DN. In this case (scaled softmax representation), the IQ parametrization generalizes the dueling architecture, retrieved when (and with an additional AL term whenever , see Sec. 3). In practice, wang2016dueling use a different parametrization of the advantage, replacing the maximum by a mean, defining . We could use a similar trick and replace the logsumexp by a mean in our policy parametrization, but in our case this did not prove to be efficient in practice.

We showed how the log-policy represents a soft version of the advantage. While this makes its role in the learning procedure clearer, it also raises questions about what sort of representation would be the most suited for optimization.

## 4 Practical considerations

We now describe key practical issues encountered when choosing a policy representation. The main one comes from the delegation of the representation power of the algorithm to the policy network. In a standard actor-critic algorithm – take SAC for example, where the policy is parametrized as a Gaussian distribution – the goal of the policy is mainly to track the maximizing action of the -value. Thus, estimation errors can cause the policy to choose sub-optimal actions, but the inductive bias caused by the Gaussian representation may not be a huge issue in practice, as long as the mean of the Gaussian policy is not too far from the maximizing action. In other words, the representation capacity of an algorithm such as SAC lies mainly in the representation capacity of its -network.

In IQ, we have a parametrization of the policy that enforces it to be a softmax of an implicit -value. By doing this, we trade in estimation error – our greedy step is exact by construction – for representation power. More precisely, as the -value is not parametrized explicitly, but through the policy, the representation power of IQ is in its policy network, and a “simple” representation might not be enough anymore. For example, if we parameterized the policy as a Gaussian, this would amount to parametrize an advantage as a quadratic function of the action: this would drastically limit what the IQ could represent.

#### Multicategorical policies.

To address this issue, we turn to other, richer, distribution representations. In practice, we consider a multi-categorical discrete softmax distribution. Precisely, we are in the context of a multi-dimensional action space

of dimension , each dimension being a bounded interval. We discretize each dimension of the space uniformly in values , for . It effectively defines a discrete action space , with

. A multidimensional action is a vector

, and we denote the component of the action . Assuming independence between actions conditioned on states, a policy can be factorized as the product of marginal mono-dimensional policies . We represent each policy as the softmax of the output of a neural network , an thus we get the full representation

 πθ(a|s)=d∏j=1softmax(Fjθ(⋅|s))(aj). (11)

The functions can be represented as neural networks with a shared core, which only differ in the last layer. This type of multicategorical policy can represent any distribution (with high enough) that does not encompass a dependency between the dimensions. The independence assumption is quite strong, and does not hold in general. From an advantage point of view, it assumes that the soft-advantage (i.e. the log-policy) can be linearly decomposed along the actions. While this somehow limits the advantage representation, it is a much weaker constraint than paramterizing the advantage as a quadratic function of the action (which would be the case with a Gaussian policy). In practice, these types of multicategorical policies have been experimented (akkaya2019solving; tang2020discretizing), and have proven to be efficient on continuous control tasks.

Even richer policy classes can be explored. To account for dependency between dimensions, one could envision auto-regressive multicategorical representations, used for example to parametrize a -value by metz2017discrete. Another approach is to use richer continuous distributions, such as normalizing flows (rezende2015variational; ward2019improving). In this work, we restrict ourselves to the multicategorical setting, which is sufficient to get satisfying results (Sec. 6.2), and we leave the other options for future work.

## 5 Related work

#### Similar parametrizations.

Other algorithms make use of a similar parametrization. First, Path Consistency Learning (PCL, (nachum2017bridging)) also parametrize the -value as a sum of a log-policy and a value. Trust-PCL (nachum2017trust), builds on PCL by adding a trust region constraint on the policy update, similar to our KL regularization term. A key difference with IQ is that (Trust-)PCL is a residual algorithm, while IQ works around a fixed-point scheme. Shortly, Trust-PCL can be seen as a version of IQ without the target value network . These entropy-regularized residual approaches are derived from the softmax temporal consistency principle, which allows to consider extensions to a specific form of multi-step learning (strongly relying on the residual aspect), but they also come with drawbacks, such as introducing a bias in the optimization when the environment is stochastic (geist2016bellman). Second, Quinoa (degrave2019quinoa) uses a similar loss to Trust-PCL and IQ (without reference to the former Trust-PCL), but do not propose any analysis, and is evaluated only on a few tasks. Third, Normalized Advantage Function (NAF, gu2016continuous) is designed with similar principles. In NAF, a -value is parametrized as a value and and an advantage, the former being quadratic on the action. It matches the special case of IQ with a Gaussian policy, where we recover this quadratic parametrization.

#### Regularization.

Entropy and KL regularization are used by many other RL algorithms. Notably, from a dynamic programming perspective, IQ-DP(0, ) (IQ with only entropy regularization) performs the same update as SAC – an entropy regularized VI. This equivalence is however not true in the function approximation regime. Due to the empirical success of SAC and its link to IQ, it will be used as our main baseline on continuous control tasks. Other algorithms also use KL regularization, notably Maximum a posteriori Policy Optimization (MPO, abdolmaleki2018maximum). We refer to vieillard2020leverage for an exhaustive review of algorithms encompassed within the MD-VI framework.

## 6 Experiments

Here, we describe our experimental setting and provide results evaluating the performance of IQ.

### 6.1 Setup

#### Environments and metrics.

We evaluate IQ first on the Mujoco environment from OpenAI Gym (brockman2016openai). It consists of locomotion tasks, with action spaces ranging from (Hopper-v2) to dimensions (Humanoid-v2). We use a rather long time horizon setting, evaluating our algorithm on M steps on each environments. We also provide result on the Adroit manipulation dataset (rajeswaran2017learning), with a similar setting of M environment steps. Adroit is a collection of hand manipulation tasks. This environment is often use in an offline RL setting, but here we use it only as a direct RL benchmark. Out of these tasks, we only consider of them: We could not find any working algorithm (baseline or new) on the “relocate” task. To summarize the performance of an algorithm, we report the baseline-normalized score along iterations: It normalizes the score so that corresponds to a random score, and to a given baseline. It is defined for one task as , where the baseline is the best version of SAC on Mujoco and Adroit after M steps. We then report aggregated results, showing the mean and median of these normalized scores along the tasks. Each score is reported as the average over

random seeds. For each experiment, the corresponding standard deviation is reported in

B.3

#### IQ algorithms.

We implement IQ with the Acme (hoffman2020acme) codebase. It defines two deep neural networks, a policy network and a value network . IQ interacts with the environment through , and collect transitions that are stored in a FIFO replay buffer. At each interaction, IQ updates and by performing a step of stochastic gradient descent with Adam (kingma2014adam) on (Eq. (2)). During each step, IQ updates a copy of the weights , , with a smooth update , with . It tracks a similar copy of . We keep almost all common hyperparameters (networks architecture, , etc.) the same as our main baseline, SAC. We only adjust the learning rate for two tasks, Humanoid and Walker, where we used a lower value: we found that IQ benefits from this, while for SAC we did not observe any improvement (we provide more details and complete results in Appx. B.3). Our value network has the same architecture as the SAC -networks except that the input size is only the state size (as it does not depend on the action). The policy network has the same architecture as the SAC policy network, and differs only by its output: IQ policy outputs a multicategorical policy (so values, where is the dimensionality of the action space and is the number of discrete action on each dimension), while SAC policy outputs -dimensional vectors (mean and diagonal covariance matrix of a Gaussian). We use in our experiments. IQ introduces two hyperparameters, and . We tested several values of between and , and selected a value per task suite: we use on Mujoco tasks and on Adroit. We tested values of in . To make the distinction between the cases when and , we denote IQ() as M-IQ, for Munchausen-IQ, since it makes use of the Munchausen regularization term. For M-IQ, we found to be the best performing value, which is consistent with the findings of vieillard2020munchausen. We report results for non-optimal values of in the ablation study (Section 6.2). Extended explanations are provided in Appendix B.2.

#### Baselines.

On continuous control tasks, our main baseline is SAC, as it reaches state-of-the-art performance on Mujoco tasks. We compare to the version of SAC that uses an adaptive temperature for reference, but note that for IQ we keep a fixed temperature () setting. To reach its best performance, SAC either uses a specific temperature value per task, or an adaptive scheme that controls the entropy of the policy. This method could be extended to multicategorical policies, but we leave this for future work, and focus on a fixed temperature setting, where we use the same value of for all tasks of an environment. On Gym, we use the default parameters from haarnoja2018soft2. On adroit, we used a specific tuned version of SAC. Remarkably, both SAC and IQ work with similar hyperparameter ranges on Mujoco and Adroit. We only found that using a learning rate of (instead of ) gave better performance on Adroit. We also compare IQ to Trust-PCL. It is the closest algorithm to IQ, with a similar parametrization. To be fair, we compare to our version of trust-PCL, which is essentially a residual version of IQ, where the target value network is removed (replaced by the online one). We use Trust-PCL with a fixed temperature, and we tuned this temperature to the environment. We found that Trust-PCL reaches its best performance with significantly lower values of compared to IQ. In the ablation (Fig. 1) we used for PCL and Trust-PCL.

### 6.2 Results

#### Comparison to baselines.

We report aggregated results of IQ and M-IQ on Gym in Figure 1 and on Adroit in Figure 2, and corresponding standard deviations in Appx. B.3. IQ reaches competitive performance to SAC. It is less sample efficient on Gym (SAC reaches higher performance sooner), but faster on Adroit, and IQ reaches a close final performance on both environments. These results also show the impact of the parameter. Although the impact of the Munchausen term (i.e KL regularization) might not seem as impressive as in discrete actions, these results show that using that term is never detrimental, and can even bring a slight improvement on Gym; while it does not add any compute complexity to the algorithm. We also report scores on each individual task in Appx. B.3, along with in-depth discussion on the performance and the impact of hyperparameters.

#### Influence of the temperature.

We study the influence of the temperature on the Mujoco tasks in Fig. 3. We report the score of IQ for several values of (with here,and with in Appx>B.3), on all environments of Mujoco. It shows that needs to be selected carefully: while it helps learning, too high values of can be detrimental to the performance, and it highlights that its optimal value might be dependant on the task. Another observation is that has a much stronger influence on IQ than . This is a key empirical difference regarding the performance of M-DQN (vieillard2020munchausen), that has the same parameters, but is evaluated on discrete actions settings. In these settings, the parameters is shown to have a crucial importance in terms of empirical results: M-DQN with largely outperforms M-DQN with on the Atari benchmark. While this term still has effect in IQ on some tasks, it is empirically less useful, even though it is never detrimental; this discrepancy is yet to be understood.

#### Ablation study.

We perform an ablation on important components of IQ in Fig. 1. (1) We replace the target network by its online counterpart in Eq. (2), which gives us Trust-PCL (and PCL is obtained by setting ), a residual version of our method. IQ and M-IQ both outperform Trust-PCL and PCL on Mujoco. (2) We use a Gaussian parametrization of the policy instead of a multicategorical distribution. We observe on Figure 1 that this causes the performance to drop drastically. This empirically validates the considerations about the necessary complexity of the policy from Section 4.

## 7 Conclusion

We introduced IQ, a parametrization of a -value that mechanically preserves the softmax relation between a policy and an implicit -function. Building on this parametrization, we derived an off-policy algorithm, that learns a policy and a value by minimizing a single loss, in a fixed-point fashion. We provided insightful analysis that justifies our parametrization and the algorithm. Specifically, IQ performs entropy and (implicit) KL regularization on the policy. While this kind of regularization had already been used and analyzed in RL, it was limited by the difficulty of estimating the softmax of -function in continuous action settings. IQ ends this limitation by avoiding any approximation in this softmax, effectively extending the analysis of this regularization. This parametrization comes at a cost: it shifts the representation capacity from the -network to the policy, which makes the use of Gaussian representation, ineffective. We solved this issue by considering simple multicategorical policies, which allowed IQ to reach performance comparable to state-of-the-art methods on classic continuous control benchmarks. Yet, we envision that studing even richer policy classes may results in even better performance. In the end, this work brings together theory and practice: IQ is a theory-consistent manner of implementing an algorithm based on regularized VI in continuous actions settings.

## Appendix A Analysis

This Appendix provides details and proofs on the IQ paramterization.

#### Reminder on notations.

Throughout the Appendix, we use the following notations. Recall that we defined the action dot product as, for any and ,

 ⟨u,v⟩=(∑a∈Au(s,a)v(s,a))s∈RS. (12)

We also slightly overwrite the operator. Precisely, for any , , we define as

 ∀(s,a)∈S×A,(Q+V)(s,a)=Q(s,a)+V(s). (13)

Write the constant function of value . For any , we define the softmax operator as

 softmax(Q)=exp(Q)⟨1,expQ⟩∈RS×A, (14)

where the fraction is overwritten as the addition operator, that is for any state-action pair ,

 (15)

### a.1 About the softmax consistency

First, we provide a detailed explanation of the consistency of the IQ parametrization. In Section 2, we claim that parametrizing a -value as enforces the relation . This relation comes mechanically from the constraint that is a distribution over actions. For the sake of precision, we provide a detailed proof of this claim as formalized in the following lemma.

###### Lemma 1.

For any , , , we have

 Q=τlnπ+V⇔{π=softmax(Qτ)V=τln⟨1,expQτ⟩. (16)
###### Proof.

Directly from the left hand side (l.h.s.) of Eq. (16), we have

 π=expQ−Vτ. (17)

Since ( is a distribution over the actions), we have

 ⟨1,π⟩=1⇔ ⟨1,expQ−Vτ⟩=1 (18) ⇔ (exp−Vτ)⟨1,expQτ⟩=1(V does not depend on the actions) (19) ⇔ V=τln⟨1,expQτ⟩. (20)

And, for the policy, this gives

 π =expQ−Vτ=expQ−τln⟨1,expQτ⟩τ=exp(Qτ)⟨1,expQτ⟩=softmaxQτ. (21)

It concludes the proof. ∎

### a.2 Useful properties of KL-entropy-regularized optimization

The following proofs relies on some properties of the KL divergence and of the entropy. Consider the greedy step of MD-VI, defined in Thm. 1

 πk+2=argmaxπ∈ΔSA⟨π,r+γPVk⟩+(1−α)τH(π)−ατKL(π||πk+1). (22)

Since the function is concave in , this optimization problem can be tackled using properties of the Legendre-Fenchel transform (see for example hiriart2004fundamentals for general definition and properties, and vieillard2020leverage for application to our setting). We quickly state two properties that are of interest for this work in the following Lemma.

###### Lemma 2.

Consider the optimization problem of Eq. (22). Write , we have that

 πk+2=παk+1expQk+1α⟨1,παk+1expQk+1α⟩. (23)

We also get a relation between the maximizer and the maximum

 ⟨πk+2,r+γPVk⟩+(1−α)τH(π)−ατKL(π||πk+1)=τln⟨παk+1,expQk+1τ⟩. (24)
###### Proof.

See  vieillard2020leverage.

### a.3 Equivalence to MD-VI: proof of Theorem 1

We turn to the proof of Thm 1. This result formalizes an equivalence in the exact case between the IQ-DP scheme and a VI scheme regularized by entropy and KL divergence. Recall that we define the update of IQ-DP at step as

 τlnπk+2+Vk+1=r+ατlnπk+1+γPVk%IQ−DP$(α,τ)$. (25)

Note that we are for now considering the scenario where this update is computed exactly. We will consider errors later, in Thm 2. Recall Thm. 1.

###### Theorem 1.

For any , let be the solution of IQ-DP at step . We have that

 {πk+2=argmax⟨π,r+γPVk⟩+(1−α)τH(π)−ατKL(π||πk+1)Vk+1=⟨πk+2,r+γPVk⟩+(1−α)τH(πk+2)−ατKL(πk+2||πk+1) (26)

so IQ-DP() produces the same sequence of policies as a value-based version of Mirror Descent VI, MD-VI [vieillard2020leverage].

###### Proof.

Applying Lemma 1 to Eq. (25) gives

 ⎧⎨⎩πk+2=softmaxr+ατlnπk+1+γPVkτVk+1=τln⟨1,expr+ατlnπk+1+γPVkτ⟩. (27)

For the policy, we have

 πk+2=exp(αlnπk+1)expr+γPVkα⟨1,exp(αlnπk+1)expr+γPVkα⟩=παk+1expr+γPVkα⟨1,παk+1expr+γPVkα⟩, (28)

and as direct consequence of Lemma 2

 πk+2=argmax⟨π,r+γPVk⟩+(1−α)τH(π)−ατKL(π||πk+1). (29)

For the value, we have:

 Vk+1=τln⟨1,exp(αlnπk+1)expr+γPVkτ⟩=τln⟨παk+1,expr+γPVkτ⟩, (30)

and again applying Lemma 2 gives

 Vk+1=⟨πk+2,r+γVk⟩+(1−α)τH(πk+2)−ατKL(πk+2||πk+1). (31)

### a.4 Error propagation: proof of Theorem 2

Now we turn to the proof of Thm 2. This theorem handles the IQ-DP scheme in the approximate case, when errors are made during the iterations. The considered scheme is

 τlnπk+2+Vk+1=r+ατlnπk+1+γPVk+ϵk+1. (32)

Recall Thm. 2.

###### Theorem 2.

Write and the update of respectively the target policy and value networks. Consider the error at step , , as the difference between the ideal and the actual updates of IQ. Formally, we define the error as, for all ,

 ϵk+1=τlnπk+2+Vk+1−(r+ατlnπk+1+γPVk), (33)

and the moving average of the errors as

 Ek=(1−α)k∑j=1αk−jϵj. (34)

We have the following results for two different cases depending on the value of . Note that when , we bound the distance to regularized optimal -value.

1. General case: and , entropy and KL regularization together:

 ∥Q(1−α)τ∗−Q(1−α)τπk∥∞≤2(1−γ)2((1−γ)k∑j=1γk−j∥Ej∥∞)+o(1k). (35)
2. Specific case , , use of KL regularization alone:

 ∥Q∗−Qπk∥∞≤21−γ∥∥ ∥∥1kk∑j=1ϵj∥∥ ∥∥∞+O(1k). (36)
###### Proof.

To prove this error propagation, result, we first show an extension of Thm. 1, that links Approximate IQ-DP with a -value based version of MD-VI. This new equivalence makes IQ-DP corresponds exactly to a scheme that is extensively analyzed by vieillard2020leverage. Then our result can be derived as a consequence of [vieillard2020leverage, Thm 1] and [vieillard2020leverage, Thm 2].

Define a (KL-regularized) implicit -value as

 Qk=τlnπk+1−ατlnπk+Vk, (37)

so that now, the IQ-DP update (Eq. (32)) can be written

 Qk+1=r+γPVk+ϵk+1. (38)

We then use same method that for the proof of Thm. 1. Specifically, applying Lemma 1 to the definition of gives for the policy

 πk+1 =softmax(Qk+ατlnπkτ)(Lemma~{}???) (39) =παkexpQkα⟨1,παkexpQkα⟩ (40) ⇔πk+1 =argmax⟨π,Qk⟩+(1−α)τH(π)−ατKL(π||πk).(Lemma~% {}???) (41)

For the value, we have from Lemma 1 on

 Vk=τln⟨1,expQk+ατlnπkτ⟩=τln⟨παk,expQkτ⟩, (42)

then, using Lemma 2, and the fact that , we have

 Vk =⟨πk+1,Qk⟩+(1−α)τH(πk+1)−ατKL(πk+1||πk). (44)

Injecting this in Eq. (38) gives

 Qk+1=r+γP(⟨πk+1,Qk⟩+(1−α)τH(πk+1)−ατKL(πk+1||πk)). (45)

Thus, we have proved the following equivalence between DP schemes

 τlnπk+2+Vk+1=r+ατlnπk+1+γPVk+ϵk+1 (46) ⇕ (47) {πk+1=argmax⟨π,Qk⟩+(1−α)τH(π)−ατKL(π||πk)Qk+1=r+γP(⟨πk+1,Qk⟩+(1−α)τH(πk+1)−ατKL(πk+1||πk))+ϵk+1, (48)

with

 Qk=τlnπk+1−ατlnπk+Vk. (49)

The above scheme in Eq. (48) is exactly the MD-VI scheme studied by vieillard2020leverage, where they define and . We now use their analysis of MD-VI to apply their result to IQ-DP, building on the equivalence between the schemes. Note that transferring this type of analysis between equivalent formulations of DP schemes is justified because the equivalences exist in terms of policies. Indeed, IQ-DP and MD-VI compute different ()-values, but produce identical series of policies. Since [vieillard2020leverage, Thm 1] and [vieillard2020leverage, Thm. 2] bound the distance between the optimal (regularized) -value and the actual (regularized) -values of the computed policy, the equivalence in terms of policies is sufficient to apply these theorems to IQ-DP. Specifically, [vieillard2020leverage, Thm 1] applied to the formulation of IQ in Eq. (48) proves point of Thm. 2, that is the case where . The second part is proven by applying [vieillard2020leverage, Thm 2] to this same formulation.

### a.5 IQ and Munchausen DQN

We claim in Section 3 that IQ is a form of Munchausen algorithm, specifically Munchausen-DQN (M-DQN). Here, we clarify this link. Note that all of the information below is contained in Appx. A.3 and Appx. A.4. The point of this section is to re-write it using notations used to defined IQ as a deep RL agents, notations consistent with how M-DQN is defined.

IQ optimizes a policy and a value by minimizing a loss (Eq. (2)). Recall that IQ implicitly defines a -function as . Identifying this in makes the connection between Munchausen RL and IQ completely clear. Indeed, the loss can be written as

 ^E⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⎛⎜ ⎜ ⎜ ⎜ ⎜⎝rt+ατlnπ¯θ(at|st)+γV¯ϕ(st+1)τln∑aexpQ¯θ,¯ϕ(st+1,a)τ−τlnπθ(at|st)−Vϕ(st)Qθ,ϕ⎞⎟ ⎟ ⎟ ⎟ ⎟⎠2⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (50)

and since we have (Lemma 2, and using the fact that )

 τln∑aexpQ¯θ,¯ϕ(s,a)τ=∑aπ¯θ(a|s)(Q¯θ(s,a)−τlnπ¯θ,¯ϕ(a|s)), (51)

we get that the loss is

 ^E⎡⎣(rt+ατlnπ¯θ(at|st)+∑aπ¯θ(a|st+1)(Q¯θ,¯ϕ(st+1,a)−τlnπ¯θ(a|st+1))−Qθ,ϕ(st,at))2⎤⎦, (52)

which is exactly the Munchausen-DQN loss on . Thus, in a mono-dimensional action setting (classic discrete control problems for examle), IQ, can really be seen as a re-parameterized version of M-DQN.

## Appendix B Additional material on experiments

This Appendix provides additional detail on experiments, along with complete empirical results.

### b.1 General information on experiments

#### Used assets.

IQ is implemented on the Acme library [hoffman2020acme]

, distributed as open-source code under the Apache License (2.0).

#### Compute resources.

Experiments were run on TPUv2. One TPU is used for a single run, with one random seed. To produce the main results (without the sweeps over parameters), we computed single runs. One of this run on a TPUv2 takes from to hours depending on the environment (the larger the action space, the longer the run).

### b.2 Details on algorithms

#### On the relation between α and τ.

The equivalence result of Theorem 1 explains the role and the relation between and . In particular, it shows that IQ-DP performs a VI scheme in an entropy-regularized MDP (or in a max-entropy setting) where the temperature is not , but . Indeed, in this framework, the parameter balances between two forms of regularization: with , IQ-DP is only regularized with entropy, but with , IQ-DP is regularized with both entropy and KL. Thus, IQ-DP modifies implicitly the intrinsic temperature of the MDP it is optimizing for. To account for this discrepancy, every time we evaluate IQ with (that is, M-IQ), we report scores using , and not . For example, on Gym, we used a temperature of for IQ, and thus for M-IQ (since, in our experiments, we took ).

#### Discretization.

We used IQ with policies that discretize the action space evenly. Here, we provide a precise definition for our discretization method. Consider a multi-dimensional action space of dimension , each dimension being a bounded interval , such that . We discretize each dimension of the space uniformly in values , for . The bins values are defined as

 δ0=amin+amax−amin2n, (53)

and, for each

 δj=δ0+jamax−aminn. (54)

It effectively defines a discrete action space

 A′=d×j=1Aj,withAj={δ0,…δn−1}. (55)

We use in all of our experiments. The values of , and depend on the environments specifications.

#### Evaluation setting.

We evaluate our algorithms on Mujoco environements from OpenAI Gym and from the Adroit manipulation tasks. On each enviroenment, we track performance for M environment steps. Every k environment steps, we stop learning, and we evaluate our algorithm by reporting the average undiscounted return over episodes. We use deterministic evaluation, meaning that, at evaluation time, the algorithms interact by choosing the expected value of the policy in one state, not by sampling from this policy (sampling is used during training).

#### Pseudocode.

We provide a pseudocode of IQ in Algorithm 1. This pseudocode describes a general learning procedure that is followed by all agents. Replacing the IQ loss in Algorithm 1 by its residual version will give the pseudocode for PCL, and replacing it by the actor and critic losses of SAC will give the pseudocode for this method.

#### HyperParameters.

We Provide the hyperparameters used for our experiments in Tab. 1. If a parameter is under “common parameters”, then it was used for all algorithms. We denote a fully connected layer with an output of neurons. Recall that is the dimension of the action space, and is the number of bins we discretize each dimension into.