# Unifying Value Iteration, Advantage Learning, and Dynamic Policy Programming

Approximate dynamic programming algorithms, such as approximate value iteration, have been successfully applied to many complex reinforcement learning tasks, and a better approximate dynamic programming algorithm is expected to further extend the applicability of reinforcement learning to various tasks. In this paper we propose a new, robust dynamic programming algorithm that unifies value iteration, advantage learning, and dynamic policy programming. We call it generalized value iteration (GVI) and its approximated version, approximate GVI (AGVI). We show AGVI's performance guarantee, which includes performance guarantees for existing algorithms, as special cases. We discuss theoretical weaknesses of existing algorithms, and explain the advantages of AGVI. Numerical experiments in a simple environment support theoretical arguments, and suggest that AGVI is a promising alternative to previous algorithms.

## Authors

• 9 publications
• 3 publications
• 11 publications
• ### The Value Iteration Algorithm is Not Strongly Polynomial for Discounted Dynamic Programming

This note provides a simple example demonstrating that, if exact computa...
12/19/2013 ∙ by Eugene A. Feinberg, et al. ∙ 0

• ### A Unifying View of Optimism in Episodic Reinforcement Learning

The principle of optimism in the face of uncertainty underpins many theo...
07/03/2020 ∙ by Gergely Neu, et al. ∙ 0

• ### Approximate information state for approximate planning and reinforcement learning in partially observed systems

We propose a theoretical framework for approximate planning and learning...
10/17/2020 ∙ by Jayakumar Subramanian, et al. ∙ 0

• ### A dynamic programming approach for generalized nearly isotonic optimization

Shape restricted statistical estimation problems have been extensively s...
11/06/2020 ∙ by Zhensheng Yu, et al. ∙ 0

• ### Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate dynamic programming is a popular method for solving large Ma...
05/08/2012 ∙ by Marek Petrik, et al. ∙ 0

• ### Online Reinforcement Learning Control by Direct Heuristic Dynamic Programming: from Time-Driven to Event-Driven

In this paper time-driven learning refers to the machine learning method...
06/16/2020 ∙ by Qingtao Zhao, et al. ∙ 0

• ### A Subgame Perfect Equilibrium Reinforcement Learning Approach to Time-inconsistent Problems

In this paper, we establish a subgame perfect equilibrium reinforcement ...
10/27/2021 ∙ by Nixie S. Lesmana, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Approximate dynamic programming (approximate DP or ADP) approximates each iteration of DP in two ways: estimating the Bellman operator using empirical samples and/or expressing a Q-value function by a function approximator. Many reinforcement learning (RL) algorithms are based on ADP. For example, Q-learning is an instance of approximate value iteration (approximate VI or AVI). Recently, a combination of deep learning and AVI is becoming increasingly popular because its performance has exceeded that of human experts in many Atari games

[Mnih et al.2015, Hasselt, Guez, and Silver2016].

However, theoretical analysis of AVI shows that even when approximation errors are i.i.d. Gaussian noise, AVI may not be able to find an optimal policy. Unfortunately, approximate policy iteration (approximate PI or API) has almost the same performance guarantee [Bertsekas and Tsitsiklis1996, Scherrer et al.2012], and a better ADP algorithm is necessary to further extend the applicability of RL to complex problems.

Recently, value-based algorithms using new DP operators have been proposed by several researchers [Azar, Gómez, and Kappen2012, Bellemare et al.2016]. Bellemare et al. showed that a class of operators including an advantage learning (AL) operator can be used to find an optimal policy when there is no approximation error [Bellemare et al.2016]. In particular, it was shown experimentally that deep RL based on approximate AL (AAL) outperforms an AVI-based deep RL algorithm called deep Q-network (DQN). However, AAL lacks a performance guarantee.

Azar et al. proposed dynamic policy programming (DPP) and approximate DPP (ADPP). The latter displays greater robustness to approximation errors than AVI [Azar, Gómez, and Kappen2012]. In particular, if cumulative approximation errors over iterations is , ADPP finds an optimal policy. However, despite its theoretical guarantee, ADPP has been rarely used for complex tasks, with a few exceptions [Tsurumine et al.2017].

Motivated by those studies, we propose a new DP algorithm called generalized VI (GVI), which unifies VI, DPP, and AL. We provide a performance guarantee of approximate GVI (AGVI), which not only shows that the price to pay for ADPP’s robustness to approximation errors is a prolonged effect of approximation errors, but also provides a performance guarantee for AAL. Furthermore, we argue that AAL tends to over-estimate Q-values by maximization bias [Hasselt2010] as the cost of its optimality. AGVI provides a way to balance the pros and cons of other algorithms, leading to better performance as exemplified in Fig. 1.

We also explain how GVI is related to a regularized policy search method, in which a policy is updated repeatedly with constraints on new and old policies. We show a relationship between the Q-value function learned by GVI and regularization coefficients. Such a connection has not been demonstrated for AL.

Finally, we show experimental results for AGVI in simple environments. The results support our theoretical argument, and suggest that AGVI is a promising alternative.

In summary, here we:

• propose a new DP algorithm called GVI the approximated version of which is robust to approximation errors and resistant to over-estimation of Q-values.

• show AGVI’s performance guarantee, which indicates a weakness of ADPP.

• show a performance guarantee for ALL.

• clarify a connection with existing DP algorithms and a regularized policy search method.

## 2 Preliminaries

### 2.1 Basic Definitions and Notations

We denote the set of bounded real-valued functions over a finite set by . We often consider a Banach space , where . For brevity, we denote it by with an abuse of notation. When we say a series of functions converges to , we mean uniform converges, and we write .

For functions and with a domain , mean for any . Similarly, any arithmetic operation of two functions is a component-wise operation. For example, is a function , and is a function for any constant .

### 2.2 Reinforcement Learning

We only consider the following type of Markovian decision processes (MDPs):

###### Definition 1 (Finite State and Action MDP).

An MDP is a 5-tuple of , where is the finite state space, is the finite action space,

is the state transition probability kernel,

is the expected immediate reward function, and is the discount factor.

Semantics are as follows: suppose that an agent has executed an action at a state . Then, state transition to a subsequent state occurs with an immediate reward whose expected value is . We usually use and to denote a state and an action, respectively. We only consider infinite horizon tasks.

A policy is a conditional probability distribution over actions given a state. We consider only stationary stochastic Markov policies.

The state value function (for a policy ) is the expected discounted future rewards when the policy is followed from a state , in other words, , where indicates that a policy is followed with the expectation, and and denote reward and state at time , respectively. When the expectation is further conditioned by an action , it is called a Q-value function. We denote it as . It is known that and exist under our settings, and they are called the optimal state value function and the optimal Q-value function, respectively. An optimal policy satisfies . The optimal advantage function is defined as .

### 2.3 Bellman Operator and Policy Operators

An operator is a mapping between functional spaces. A policy yields a right-linear operator defined by , . A stochastic kernel also yields a right-linear operator defined as , where . By combining them, we define the following right-linear operator . Hereafter, we omit parentheses, e.g., , and denote it as for brevity.

The Bellman operator for a policy is defined s.t. , . Similarly, the Bellman optimality operator is defined s.t. , , where is max operator defined by . We often use the mellowmax operator defined by

 mβf(s):=1βlog∑aexp(βf(s,a))|A|,

where is the number of actions [Asadi and Littman2017]. It is known that as , . On the other hand, becomes just an average over actions. Therefore, by and , we mean and an average over actions, respectively, in this paper. We define s.t. , . Mellowmax is known to be a non-expansion [Asadi and Littman2017]. Therefore, is a contraction with modulus . We denote its unique fixed point by .

The following operator is often used in RL:

 bβf(s)=∑aexp(βf(s,a))f(s,a)∑a′exp(βf(s,a′)).

We call the Boltzmann operator, which is not a non-expansion [Asadi and Littman2017].

Bellemare et al. [Bellemare et al.2016] proposed an AL operator:

 Qk+1:=TQk+α(Qk−mQk), (1)

where . The algorithm using this update rule is called AL. They showed that a greedy policy w.r.t. is an optimal policy when there is no approximation error. Furthermore, Bellmare et al. argued that by using AL, the difference between Q-values for an optimal action and for sub-optimal actions is enhanced, leading to learning that is less susceptible to function approximation error. They experimentally showed that deep RL based on AAL outperforms DQN in Atari games.

### 2.5 Dynamic Policy Programming Operator

Azar et al. [Azar, Gómez, and Kappen2012] proposed the following update rule called DPP:

 Qk+1:=TβQk+Qk−mβQk, (2)

where . Since the difference between and can be bounded, they also proposed the following update rule:

 Qk+1:=r+γPbβQk+Qk−bβQk.

They showed that a Boltzmann action selection policy converges to an optimal policy, and that ADPP is more robust to approximation errors than AVI or API.

## 3 The Algorithm and Theoretical Analyses

### 3.1 Generalized Value Iteration (GVI)

Note that r.h.s. of (2) becomes an AL operator with as . Consequently, one may think that also converges to the optimal Q-value function. Unfortunately, it does not hold. Specifically, the following theorem holds (All proof is in Appendix).

###### Theorem 1 (Generalized Value Iteration).

Suppose a function and the following update rule

 Qk+1:=TβQk+α(Qk−mβQk), (3)

where , . If ,

where . If , , s.t. , s.t.

 Qk=V∗+Q0+kA∗−mβ((k−1)A∗+Q0)+ϕ.

We call the algorithm using the update rule (3) GVI.

###### Remark 1.

Theorem 1 for states that for any required accuracy , there exists s.t. for , the deviation of from is kept within . Note that unless , does not converge. Hence, we cannot state it in a form similar to cases where . Clearly, when for a state and an action , is diverging to . Accordingly, it follows that a greedy policy w.r.t.  is an optimal. When , any action is optimal.

###### Remark 2.

Let us note that when the greedy policy w.r.t. is optimal. (Hence, the greedy policy w.r.t. is optimal) Suppose that an optimal action and a second-optimal action satisfies . Then, . Hence, is also greedy action w.r.t. . It implies that whether or not a greedy action w.r.t. is optimal depends on , which is called action-gap [Farahmand2011]. When action-gap is large, a task is easy, and GVI may find an optimal policy. On the other hand, when action-gap is small, a task is difficult, and GVI is likely not to find an optimal policy. However, a second-optimal action also has a Q-value close to the best action. Accordingly, the second-optimal action may not be a bad choice.

As clearly shown by Theorem 1, unless either or , an optimal policy cannot be obtained by GVI. However, empirical results show that both AAL and ADPP work best when and take moderate values rather than or for and for [Azar, Gómez, and Kappen2012, Bellemare et al.2016]. Our theoretical analyses (Theorem 2 and Sect. 3.2) indeed indicate that moderate values of and have preferable properties.

### 3.2 Performance Bound for Approximate GVI

An exact implementation of GVI requires a model of an environment. In model-free RL, sampling by a behavior policy introduces sampling errors and bias on chosen state-action pairs. In addition, in large scale problems, function approximation is inevitable. As a result, GVI updates are contaminated with approximation errors resulting in an update rule . We call this algorithm approximate GVI (AGVI). The following theorem relates approximation error and the quality of a policy obtained by AGVI.

###### Theorem 2 (Performance Bound for AGVI).

Suppose the update rule of AGVI, and s.t. . Furthermore, let denote a policy which satisfies . Then, we have

 ∥Q∗−Qπk∥≤C+21−γ1−α1−αk+1(Ck+Ek), (4)

where

 C :=γ1−γ1−αβlog|A|, Ck :=γαk+1−γk+1α−γ(2Vmax+αβlog|A|), Ek :=k∑i=0γi∥∥ ∥∥k−i∑j=0αjεk−i−j∥∥ ∥∥.
###### Remark 3.

which satisfies can be found, for example, by maximizing the entropy of with constraints as proposed in [Asadi and Littman2017].

###### Remark 4.

As approaches , this performance bound reconstructs that of ADPP [Azar, Gómez, and Kappen2012]111We corrected a mistake in their bound (their error terms lack a coefficient 2).:

 ∥Q∗−Qπk∥≤2(1−γ)(k+1)(Ck+Ek),

where in and is set to . As a corollary, a performance bound for AAL can be obtained by as well.

#### Faster Error Decay with α<1

Theorem 2 implies a slow decay of approximation error when . For simplicity, assume that for all except where . In this case, (4) becomes

 ∥Q∗−Qπk∥≤C+21−γDkε, (5)

where , and all terms not related to approximation error are aggregated to . Therefore, determines how quickly the effect of the approximation error decays. Figure 2 shows the coefficient for various . As becomes higher, the decay slows.

Accordingly, for some types of approximation error, such as model bias of a function approximator, might pile up, and ADPP might perform poorly. Another source of such error is sampling bias due to a poor policy. In the beginning of learning, a policy that seems best is deployed to collect samples. However, such a policy may not be optimal, and may explore only a limited state and action space. As a result, approximation error is expected to accumulate outside the limited space. Over-estimation of Q-value function, which we explain next, is also a source of such error.

#### Less Maximization Bias with finite β

AVI tends to over-estimate the Q-value due to maximization bias. Such over-estimation can be caused not only by environmental stochasticity, but also by function approximation error. This is a significant problem when these algorithms are applied to complex RL tasks [Hasselt2010, Hasselt, Guez, and Silver2016].

To understand maximization bias, suppose that AVI has started with . When an environment or a policy is stochastic,

is a random variable. As a result, taking the maximum of

over actions corresponds to an estimator , i.e., over-estimation of , which we want in reality.

On the other hand, as , the over-estimation diminishes. Indeed, since mellowmax is increasing in , and convex in Q-value, we have

 E[m0Q1(s)]≤mβE[Q1(s,a)]≤E[mβQ1(s)].

The l.h.s. is equal to . Accordingly, for a small , over-estimation of becomes less. Soft-update similar to the above works better than double Q-learning [Fox, Pakman, and Tishby2016].

### 3.3 Derivation of the Algorithm

To understand the meaning of and , we explain how GVI is derived from a regularized policy search method. The derivation is similar to that of DPP [Azar, Gómez, and Kappen2012]. A difference is that we use entropy regularization in addition to Kullback–Leibler (KL) divergence.

#### Regularized Policy Search to a New PI-Like DP

Let denote KL divergence between policies and at state , and denote entropy of at state . Suppose a modified state value function

 Vπ˜π(s)=Vπ(s) −Eπ[∑t≥0γt(1ηD(st;π,˜π)−1θH(st;π))∣s0=s]. (6)

Let denote an optimal policy that maximizes the modified state value function above. It turns out that they have the following form. (A proof is in Appendix D)

###### Theorem 3 (Expression of an Optimal Policy π∘).

For a modified state value function (3.3), there exists a policy s.t. for any policy , . Furthermore, and have the following form:

 Vπ∘˜π(s) =1βlog∑a˜π(a|s)αexp(βQπ∘˜π(s,a)) π∘(a|s) =˜π(a|s)αexp(βQπ∘˜π(s,a))∑a′˜π(a′|s)αexp(βQπ∘˜π(s,a′)) =˜π(a|s)αexp(βQπ∘˜π(s,a))exp(βVπ∘˜π(s)).

where , , and .

Therefore, after obtaining , can be computed with . Since maximizes expected cumulative rewards while maintaining entropy and KL divergence between and , is expected to be better than , but not to be too different from it and not to be deterministic.

We are interested in solving an original MDP. A straightforward approach is updating to , and finding a new optimal policy with as a new baseline policy, s.t. KL divergence becomes . This can be done by first obtaining using fixed-point iteration

 Vk+1˜π(s)=log∑a˜π(a|s)αexp(βQk˜π(s,a))β,

where . Then, we compute with , and finally update to . By repeating these steps, the policy is expected to converge to an entropy-regularized optimal policy.

#### Regularized Policy Search to a New VI-Like DP

Rather than updating infinitely, updating once might be enough, as is the case for VI. Suppose . In this case, update rule is

 Vk+1(s)=log∑aπk(a|s)αexp(β(r+γPVk)(s,a))β, (7)

where is an arbitrary policy satisfying for any state and action , and

 πk+1(a|s)=πk(a|s)αexp(β(r+γPVk+1)(s,a))exp(βVk+1(s)).

It turns out that (slightly modified version of) the above algorithm can be efficiently implemented by GVI. The modification is that policy improvement is done by

 πk+1(a|s)=πk(a|s)αexp(β(r+γPVk)(s,a))exp(βVk+1(s)). (8)

With this modification, GVI can be derived as follows: define by222 is added just for obtaining log-average-exp expression in the end. Without it, the almost same algorithm can be derived.

 Qk+1 :=r+γPVk+αβlogπk+α−γ(1−γ)βlog|A|. (9)

Equivalently, we have

 r+γPVk=Qk+1−αβlogπk−α−γ(1−γ)βlog|A|. (10)

By using (7) and (10),

 Vk+1(s) =log∑aexp(βQk+1(s,a)−α−γ1−γlog|A|)β (11) =log∑aexp(βQk+1(s,a))β−α−γ(1−γ)βlog|A| (12) =mβQk+1(s)+1−α(1−γ)βlog|A|. (13)

Therefore, we have

 r+γPVk−Vk+1 =Qk+1−αβlogπk−α−γ(1−γ)βlog|A|−Vk+1 =Qk+1−αβlogπk−log∑aexp(βQk+1(⋅,a))β.

Consequently, by substituting in (8) with the above expression,

 πk+1(a|s) =exp(βQk+1(s,a))∑a′exp(βQk+1(s,a′)). (14)

Plugging back (13) and (14) to and in (9), respectively, we get

 Qk+1(s,a) =r(s,a)+γPmβQk(s,a)+αQk(s,a) −αβlog∑a′exp(βQk(s,a′))+αβlog|A| =TβQk(s,a)+α[Qk(s,a)−mβQk(s,a)].

The last line exactly corresponds to GVI update rule.

## 4 Numerical Experiments

Our purposes in the numerical experiments are the followings:

• Purpose 1. We confirm that Theorem 1 is consistent with numerical experiments, and that the Q-value difference can be enhanced as approaches .

• Purpose 2. (or ADPP) may need time to switch from a poor initial policy to a better policy. We examine whether by setting to a moderate value, such a problem can be ameliorated.

• Purpose 3. (or AAL) over-estimates Q-values. We examine whether by setting to a moderate value, such a problem can be avoided.

### 4.1 Environments and Experimental Conditions

We used the following environments.

#### ChainWalk

There are states () connected like a chain, and the agent can move either left or right. Training episodes always start from state . State transition to a desired direction occurs with probability . With probability , state transition to the opposite occurs. At the ends of the chain, attempted movement to outside of the chain results in staying at the ends. When an agent gets to a state which is on the left (or right) side, but not at the left (or right) end of the chain, the agent gets (or ) reward. If the agent reaches the center, or state , it gets no reward. If the agent moves to the left (or right) end of the chain, it can get (or ) reward. In this environment, optimal behavior is going to the left regardless of states. For brevity, we denote the Q-value of going left by , and right by in this environment.

#### LongChainWalk

The LongChainWalk environment is a modified version of the ChainWalk environment. We modified the environment as follows: First, the chain consisted of states. Second, training episodes start from a uniformly sampled state. Third, actions are specified by integers from to meaning a desired movement to state , where is a current state, and is an action. In other words, the agent is able to make larger movements. Since the over-estimation problem becomes more serious as the number of actions increases, this modification is important for our purpose. Fourth, action always succeeds, but a subsequent state is , where is sampled uniformly from an integer from to at every state transition, and restrict to . Finally, immediate reward is , where is a subsequent state. Therefore, the agent needs to move toward the center.

In an experiment for Purpose 1, the ChainWalk environment was used. We updated a Q-table with a perfect model of the environment. and are fixed to and , respectively.

In an experiment for Purpose 2, we again used the ChainWalk environment. However, this time, we trained an agent without the environmental model. Training consisted of episodes. In each episode, the agent was allowed to take actions according to -greedy. After every episode, the Q-table was immediately updated using experience the agent obtained during the episode. After every episode, evaluation of the agent was performed. The evaluation consisted of episodes starting from a state sampled uniformly. The agent was allowed to take greedy actions w.r.t. the Q-value it obtained from the training. The metric of the agent is the median of mean episodic rewards in an evaluation over experimental runs. and are fixed to and , respectively.

In an experiment for Purpose 3, we used LongChainWalk. Except that training consisted of episodes, and that was fixed to , training conditions were same as the second experiment.

### 4.2 Value Difference Enhancement (Purpose 1)

Figure 3 compares the numerical and analytical values of action-gap at various . It shows that Theorem 1 is consistent with numerical experiments, and the action-gap increases as approaches , as predicted.

Figure 4 shows the numerical and analytical Q-values after iterations when . In this environment, going right is a sub-optimal action. Therefore, it is strongly devalued when ( is diverging to ).

From these results, we conclude that Theorem 1 is consistent with numerical experiments, and that the Q-value difference can be increasingly enhanced as approaches to .

### 4.3 Error Decay Property of AGVI (Purpose 2)

(or ADPP) may need time to switch from a poor initial policy (going right) to a better policy (going left). We examine whether by setting moderate , such a problem can be ameliorated.

Figure 5 shows the result. When , the performance is poorer than that of . However, performance slowly approaches that of as learning proceeds. For reasonably large and smaller than or equal to approximately , similar results were obtained.

In order to further analyze what was occurring, we visualized the Q-value of ADPP (Fig. 6). It suggested that slow learning when (ADPP) is caused by prolonged devaluation of an optimal action.

In summary, as Fig. 5 shows, when (ADPP), learning is slow. This occurred because ADPP takes a long time to switch from an initial poor policy that tends to go right to better policy that tends to go left (Fig. 6) by a strong marginalization of a sub-optimal action. Indeed, this slow learning was not seen when was higher, supposedly due to almost exploratory behavior. This policy switching is probably important in complex environments in which an initial policy is likely to be sub-optimal. Figure 5 shows that setting to a moderate value ameliorates this problem while outperforming AVI.

### 4.4 Less Maximization Bias (Purpose 3)

Finally, we conducted an experiment to investigate whether by setting to a moderate value, over-estimation of the Q-value by maximization bias could be avoided.

We define the error ratio (ER) by

 ER:=∑s[maxa˜Q(s,a)−maxaQ(s,a)]∣∣∑smaxaQ(s,a)∣∣, (15)

where is the true Q-value of a corresponding parameter setting, and is the estimated Q-value. Therefore, stronger over-estimation results in a higher error ratio. In Fig. 7, ERs for various parameters are shown. One can see that over-estimation appears when is high. In particular, parameter settings at the upper left corner strongly over-estimate the Q-value.

Figure 8 shows final performance across different parameter settings. It is clear that a moderate value of works best, for except for , where no over-estimation occurred, and performance was high with any choice of . However, even in this simple environment, we observed that intermediate performance was higher when was set to a moderate value such as . This observation is consistent with the finding that a moderate may lead to faster learning because of faster error decay.

From these results, we conclude that by setting to a moderate value, over-estimation can be avoided. Rather, under-estimation occurs.

## 5 Related Work

Our work was stimulated by research that established connections between a regularized policy search and value-function learning such as [Azar, Gómez, and Kappen2012, O’Donoghue et al.2017, Fox, Pakman, and Tishby2016, Nachum et al.2017]. In particular, our work is an extension of [Azar, Gómez, and Kappen2012], further connecting a regularized policy search with AL [Bellemare et al.2016].

In [Bellemare et al.2016], it is shown that a class of operators may enhance the action-gap. In this paper, we showed the value to which AL converges. We also showed a performance guarantee that implies that as increases, AAL becomes robust to approximation errors although its learning slows. A performance guarantee for AAL is new, and our performance guarantee explains why AAL works well.

Over-estimation of the Q-value by maximization bias was first noted by [Thrun and Schwartz1993], and several researchers have addressed it in various ways [Fox, Pakman, and Tishby2016, Hasselt2010, Hasselt, Guez, and Silver2016]. In particular, GVI is similar to a batch version of G-learning [Fox, Pakman, and Tishby2016] with action-gap enhancement. Fox et al. predicted that action-gap enhancement would further ameliorate maximization bias. Our experimental results support their argument. Another approach to tackle the over-estimation uses a double estimator, as proposed in [Hasselt2010]. With a double estimator, searching to find optimal or scheduling of can be avoided. This may be a good choice when interactions with an environment require a long time so that short experiments with different are prohibitive. However, the use of a double estimator doubles the sample complexity of algorithms. In [Fox, Pakman, and Tishby2016], the authors showed that soft-update using appropriate scheduling of led to faster and better learning.

Recently, a unified view of a regularized policy search and DPP has been provided [Neu, Jonsson, and Gómez2017]. Our work is limited in that we only unified value-iteration-like algorithms. However, our work shows that AL can be also seen in the unified view. In addition, our work is more advanced in that we showed a performance guarantee for AGVI, which includes AAL, for which there has been no performance guarantee previously.

## 6 Conclusion

In this paper, we proposed a new DP algorithm called GVI, which unifies VI, AL, and DPP. We showed a performance guarantee of its approximate version, AGVI, and discussed a weakness of ADPP. We also showed that AAL tends to over-estimate Q-values. Experimental results support our argument, and suggest our algorithm as a promising alternative to existing algorithms. Specifically, AGVI allows us to balance (i) faster learning and robustness to approximation error and (ii) maximization bias and optimality of the algorithm. We also showed an interesting connection between GVI and a regularized policy search. For AL, such a connection was formerly unknown.

## Acknowledgement

This work was supported by JSPS KAKENHI Grant Numbers 16H06563 and 17H06042.

## References

• [Asadi and Littman2017] Asadi, K., and Littman, M. L. 2017. A new softmax operator for reinforcement learning. In

Proc. of the 34th International Conference on Machine Learning

.
• [Azar, Gómez, and Kappen2012] Azar, M. G.; Gómez, V.; and Kappen, H. J. 2012. Dynamic policy programming. J. Mach. Learn. Res. 13(1):3207–3245.
• [Bellemare et al.2016] Bellemare, M. G.; Ostrovski, G.; Guez, A.; Thomas, P.; and Munos, R. 2016. Increasing the action gap: New operators for reinforcement learning. In

Proc. of the 30th AAAI Conference on Artificial Intelligence

.
• [Bertsekas and Tsitsiklis1996] Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Nashua, NH, USA: Athena Scientific, 1st edition.
• [Farahmand2011] Farahmand, A.-m. 2011. Action-gap phenomenon in reinforcement learning. In Proc. of the 24th International Conference on Neural Information Processing Systems.
• [Fox, Pakman, and Tishby2016] Fox, R.; Pakman, A.; and Tishby, N. 2016. G-learning: Taming the noise in reinforcement learning via soft updates. In Proc. of the 32nd Conference on Uncertainty in Artificial Intelligence.
• [Hasselt, Guez, and Silver2016] Hasselt, H. v.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double Q-learning. In Proc. of the 30th AAAI Conference on Artificial Intelligence.
• [Hasselt2010] Hasselt, H. V. 2010. Double Q-learning. In Proc. of the 23rd International Conference on Neural Information Processing Systems.
• [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
• [Nachum et al.2017] Nachum, O.; Norouzi, M.; Xu, K.; and Schuurmans, D. 2017. Bridging the Gap Between Value and Policy Based Reinforcement Learning. ArXiv e-prints.
• [Neu, Jonsson, and Gómez2017] Neu, G.; Jonsson, A.; and Gómez, V. 2017.

A unified view of entropy-regularized Markov decision processes.

ArXiv e-prints.
• [O’Donoghue et al.2017] O’Donoghue, B.; Munos, R.; Kavukcuoglu, K.; and Mnih, V. 2017. Combining policy gradient and q-learning. In Proc. of the 5th International Conference on Learning Representation.
• [Scherrer et al.2012] Scherrer, B.; Ghavamzadeh, M.; Gabillon, V.; and Geist, M. 2012. Approximate modified policy iteration. In Proc. of the 29th International Conference on Machine Learning.
• [Sutton and Barto2017] Sutton, R. S., and Barto, A. G. 2017. Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2nd edition. (A draft version of June 19 2017).
• [Thrun and Schwartz1993] Thrun, S., and Schwartz, A. 1993. Issues in using function approximation for reinforcement learning. In Proc. of the 4th Connectionist Models Summer School.
• [Tsurumine et al.2017] Tsurumine, Y.; Cui, Y.; Uchibe, E.; and Matsubara, T. 2017. Deep dynamic policy programming for robot control with raw images. In Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems.

## Appendix A Lemmas Related to Mellowmax

We prove lemmas related to mellowmax. They are used throughout the proof of Theorem 1 and Theorem 2. For brevity, we use the following definitions in this section:

 f(β) :=∑ixiexp(βxi)∑jexp(βxj) g(β) :=1βlog∑iexp(βxi)N,

where , and .

, .

###### Proof.

Indeed, consider an entropy of

 p(i;β):=exp(βxi)∑jexp(βxj).

It can be rewritten as

 H(β) =log∑iexp(βxi)−βf(β) =βg(β)+logN−βf(β).

Accordingly, . Since , we can conclude the proof. ∎

###### Lemma 5.

, is non-increasing, but is non-decreasing.

###### Proof.

Indeed,

 dg(β)dβ =∑ixiexp(βxi)β∑iexp(βxi)−logN−1∑iexp(βxi)β2 =1β[f(β)−g(β)].

Therefore, the derivative of is smaller than or equal to , but the derivative of is larger than or equal to . ∎

## Appendix B Proof of Theorem 1.

For shorthand notation, we use . This value frequently appears as the unique fixed point of , i.e., . When either or , needs to be understood as , and needs to be read as . Hereafter, we use this notation.

We mainly assume . For , take the limit of appropriately (also see [Azar, Gómez, and Kappen2012]). For brevity, we define

 Ak:=1−αk1−α

for non-negative integer . Note that when , .

For later use in a proof of Theorem 2, we deal with a case where AGVI update is used in this section. Its update rule is the following: suppose , is obtained by applying to the update rule of AGVI

 Qk+1=TβQk+α(Qk−mβQk)+εk,

where , , and is approximation error at iteration .

The following series of functions turns out to be very useful: and

 Ak+1qk+1:=AkTπkqk+αk(Tπkq0+Ek),

where satisfies , and .

### b.1 Proof Sketch

Since the proof is lengthy, we provide a sketch of the proof.

Lemma 7 shows that is bounded by

 Ak+1∥∥Qθ−qk+1∥∥ ≤γAk∥∥Qθ−qk∥∥+αk(C0+∥Ek∥), (16)

where .

By using (B.1), we can show that when there is no approximation error, . To show this, we suppose that it does not hold, and deduce a contradiction. When there is no approximation error, (B.1) becomes

 ∥∥Qθ−qk+1∥∥ ≤(γAkAk+1+αkC0Ak+1∥∥Qθ−qk∥∥)∥∥Qθ−qk∥∥.

Since converges to , converges to . On the other hand, converges to . Therefore,

 αkC0Ak+1∥Qθ−qk∥→0

unless or , both of which implies with a convergence rate equal to or faster than . Accordingly, it converges to , and there exists such that for ,

 ∥∥Qθ−qk+1∥∥≤c∥∥Qθ−qk∥∥,

where . It clearly follows that converges to , and it contradicts to the assumption that does not hold. Therefore, .

Furthermore, from this discussion, one can see that there exists such that

 ∥∥(k+t)Qθ−(k+t)qk+t∥∥≤ct(k+t)∥∥Qθ−qk∥∥.

This shows that for any , there exists such that , .

Lemma 6 shows that can be expressed by

 Qk=Akqk+αkq0−αmβ(Ak−1qk−1+αk−1q0). (17)

Note that for any functions ,

 ∥∥mβf−mβg∥∥≤∥f−g∥

holds [Asadi and Littman2017]. Therefore, for any sequence of functions such that , . Accordingly,

 limk→∞mβ(Ak−1qk−1+αk−1q0) =mβ(Qθ1−α) =mθQθ1−α,

and .

### b.2 Proofs

###### Lemma 6.
 Qk=Akqk+αkq0−απk−1(Ak−1qk−1+αk−1q0). (18)
###### Proof.

We prove the claim by induction. For ,

 Q1 =TβQ0+α(Q0−mβQ0)+ε0 =Tπ0q0+ε0+α(q0−π0q0) =q1+α(q0−π0q0) =A1q1+