# Neural Temporal-Difference Learning Converges to Global Optima

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

## Authors

• 22 publications
• 68 publications
• 68 publications
• 80 publications
• ### Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

Temporal-difference and Q-learning play a key role in deep reinforcement...
06/08/2020 ∙ by Yufeng Zhang, et al. ∙ 0

• ### Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Proximal policy optimization and trust region policy optimization (PPO a...
06/25/2019 ∙ by Boyi Liu, et al. ∙ 0

• ### Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Despite the wide applications of Adam in reinforcement learning (RL), th...
02/15/2020 ∙ by Huaqing Xiong, et al. ∙ 22

• ### Should All Temporal Difference Learning Use Emphasis?

Emphatic Temporal Difference (ETD) learning has recently been proposed a...
03/01/2019 ∙ by Xiang Gu, et al. ∙ 0

• ### Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known ite...
07/19/2021 ∙ by Miroslav Štrupl, et al. ∙ 15

• ### A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Many reinforcement learning algorithms rely on value estimation. However...
04/28/2021 ∙ by Andrew Patterson, et al. ∙ 0

• ### Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

Generative adversarial imitation learning (GAIL) demonstrates tremendous...
03/08/2020 ∙ by Yufeng Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given a policy, temporal-different learning (TD) (Sutton, 1988) aims to learn the corresponding (action-)value function by following the semigradients of the mean-squared Bellman error in an online manner. As the most-used policy evaluation algorithm, TD serves as the “critic” component of many reinforcement learning algorithms, such as the actor-critic algorithm (Konda and Tsitsiklis, 2000) and trust-region policy optimization (Schulman et al., 2015). In particular, in deep reinforcement learning, TD is often applied to learn value functions parametrized by neural networks (Lillicrap et al., 2015; Mnih et al., 2016; Haarnoja et al., 2018), which gives rise to neural TD. As policy improvement relies crucially on policy evaluation, the optimization efficiency and statistical accuracy of neural TD are critical to the performance of deep reinforcement learning. Towards theoretically understanding deep reinforcement learning, the goal of this paper is to characterize the convergence of neural TD.

Despite the broad applications of neural TD, its convergence remains rarely understood. Even with linear value function approximation, the nonasymptotic convergence of TD remains open until recently (Bhandari et al., 2018; Lakshminarayanan and Szepesvari, 2018; Dalal et al., 2018; Srikant and Ying, 2019), although its asymptotic convergence is well understood (Jaakkola et al., 1994; Tsitsiklis and Van Roy, 1997; Borkar and Meyn, 2000; Kushner and Yin, 2003; Borkar, 2009). Meanwhile, with nonlinear value function approximation, TD is known to diverge in general (Baird, 1995; Boyan and Moore, 1995; Tsitsiklis and Van Roy, 1997). To remedy this issue, Bhatnagar et al. (2009)

propose nonlinear (gradient) TD, which uses the tangent vectors of nonlinear value functions in place of the feature vectors in linear TD. Unlike linear TD, which converges to the global optimum of the mean-squared projected Bellman error (MSPBE), nonlinear TD is only guaranteed to converge to a local optimum asymptotically. As a result, the statistical accuracy of the value function learned by nonlinear TD remains unclear. In contrast to such conservative theory, neural TD, which straightforwardly combines TD with neural networks without the explicit local linearization in nonlinear TD, often learns a desired value function that generalizes well to unseen states in practice

(Duan et al., 2016; Amiranashvili et al., 2018; Henderson et al., 2018). Hence, a gap separates theory from practice.

There exist three obstacles towards closing such a theory-practice gap: (i) MSPBE has an expectation over the transition dynamics within the squared loss, which forbids the construction of unbiased stochastic gradients (Sutton and Barto, 2018). As a result, even with linear value function approximation, TD largely eludes the classical optimization framework, as it follows biased stochastic semigradients. (ii) When the value function is parametrized by a neural network, MSPBE is nonconvex in the weights of the neural network, which may introduce undesired stationary points such as local optima and saddle points (Jain and Kar, 2017). As a result, even an ideal algorithm that follows the population gradients of MSPBE may get trapped. (iii) Due to the interplay between the bias in stochastic semigradients and the nonlinearity in value function approximation, neural TD may even diverge (Baird, 1995; Boyan and Moore, 1995; Tsitsiklis and Van Roy, 1997), instead of converging to an undesired stationary point, as it lacks the explicit local linearization in nonlinear TD (Bhatnagar et al., 2009). Such divergence is also not captured by the classical optimization framework.

Contribution. Towards bridging theory and practice, we establish the first nonasymptotic global rate of convergence of neural TD. In detail, we prove that randomly initialized neural TD converges to the global optimum of MSPBE at the rate of with population semigradients and at the rate of with stochastic semigradients. Here is the number of iterations and the (action-)value function is parametrized by a sufficiently wide two-layer neural network. Moreover, we prove that the projection in MSPBE allows for a sufficiently rich class of functions, which has the same representation power of a reproducing kernel Hilbert space associated with the random initialization. As a result, for a broad class of reinforcement learning problems, neural TD attains zero MSPBE. Beyond policy evaluation, we further establish the global convergence of neural (soft) Q-learning, which allows for policy improvement. In particular, we prove that, under stronger regularity conditions, neural (soft) Q-learning converges at the same rate of neural TD to the global optimum of MSPBE for policy optimization. Also, by exploiting the connection between (soft) Q-learning and policy gradient algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we establish the global convergence of a variant of the policy gradient algorithm (Williams, 1992; Szepesvári, 2010; Sutton and Barto, 2018).

At the core of our analysis is the overparametrization of the two-layer neural network for value function approximation (Zhang et al., 2016; Neyshabur et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019), which enables us to circumvent the three obstacles above. In particular, overparametrization leads to an implicit local linearization that varies smoothly along the solution path, which mirrors the explicit one in nonlinear TD (Bhatnagar et al., 2009). Such an implicit local linearization enables us to circumvent the third obstacle of possible divergence. Moreover, overparametrization allows us to establish a notion of one-point monotonicity (Harker and Pang, 1990; Facchinei and Pang, 2007) for the semigradients followed by neural TD, which ensures its evolution towards the global optimum of MSPBE along the solution path. Such a notion of monotonicity enables us to circumvent the first and second obstacles of bias and nonconvexity. Broadly speaking, our theory backs the empirical success of overparametrized neural networks in deep reinforcement learning. In particular, we show that instead of being a curse, overparametrization is indeed a blessing for minimizing MSPBE in the presence of bias, nonconvexity, and even divergence.

More Related Work. There is a large body of literature on the convergence of linear TD under both asymptotic (Jaakkola et al., 1994; Tsitsiklis and Van Roy, 1997; Borkar and Meyn, 2000; Kushner and Yin, 2003; Borkar, 2009) and nonasymptotic (Bhandari et al., 2018; Lakshminarayanan and Szepesvari, 2018; Dalal et al., 2018; Srikant and Ying, 2019) regimes. See Dann et al. (2014) for a detailed survey. In particular, our analysis is based on the recent breakthrough in the nonasymptotic analysis of linear TD (Bhandari et al., 2018) and its extension to linear Q-learning (Zou et al., 2019). An essential step of our analysis is bridging the evolution of linear TD and neural TD through the implicit local linearization induced by overparametrization.

To incorporate nonlinear value function approximation into TD, Bhatnagar et al. (2009) propose the first convergent nonlinear TD based on explicit local linearization, which however only converges to a local optimum of MSPBE. See Geist and Pietquin (2013); Bertsekas (2019) for a detailed survey. In contrast, we prove that, with the implicit local linearization induced by overparametrization, neural TD, which is simpler to implement and more widely used in deep reinforcement learning than nonlinear TD, provably converges to the global optimum of MSPBE.

There exist various extensions of TD, including least-squares TD (Bradtke and Barto, 1996; Boyan, 1999; Lazaric et al., 2010; Ghavamzadeh et al., 2010; Tu and Recht, 2017) and gradient TD (Sutton et al., 2009a, b; Bhatnagar et al., 2009; Liu et al., 2015; Du et al., 2017; Wang et al., 2017; Touati et al., 2017). In detail, least-squares TD is based on batch update, which loses the computational and statistical efficiency of the online update in TD. Meanwhile, gradient TD follows unbiased stochastic gradients, but at the cost of introducing another optimization variable. Such a reformulation leads to bilevel optimization, which is less stable in practice when combined with neural networks (Pfau and Vinyals, 2016). As a result, both extensions of TD are less widely used in deep reinforcement learning (Duan et al., 2016; Amiranashvili et al., 2018; Henderson et al., 2018). Moreover, when using neural networks for value function approximation, the convergence to the global optimum of MSPBE remains unclear for both extensions of TD.

Our work is also related to the recent breakthrough in understanding overparametrized neural networks, especially their generalization error (Zhang et al., 2016; Neyshabur et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019). See Fan et al. (2019) for a detailed survey. In particular, Daniely (2017); Allen-Zhu et al. (2018); Arora et al. (2019); Chizat and Bach (2018); Jacot et al. (2018); Lee et al. (2019)

characterize the implicit local linearization in the context of supervised learning, where we train an overparametrized neural network by following the stochastic gradients of the mean-squared error. In contrast, neural TD does not follow the stochastic gradients of any objective function, hence leading to possible divergence, which makes the convergence analysis more challenging.

## 2 Background

In Section 2.1, we briefly review policy evaluation in reinforcement learning. In Section 2.2, we introduce the corresponding optimization formulations.

### 2.1 Policy Evaluation

We consider a Markov decision process

, in which an agent interacts with the environment to learn the optimal policy that maximizes the expected total reward. At the -th time step, the agent has a state and takes an action . Upon taking the action, the agent enters the next state

according to the transition probability

and receives a random reward from the environment. The action that the agent takes at each state is decided by a policy , where

is the set of all probability distributions over

. The performance of policy is measured by the expected total reward, , where is the discount factor.

Given policy , policy evaluation aims to learn the following two functions, the value function and the action-value function (Q-function) . Both functions form the basis for policy improvement. Without loss of generality, we focus on learning the Q-function in this paper. We define the Bellman evaluation operator,

 TπQ(s,a)=\EE[r(s,a)+γQ(s′,a′)|s′∼P(⋅|s,a),a′∼π(s′)], (1)

for which is the fixed point, that is, the solution to the Bellman equation .

### 2.2 Optimization Formulation

Corresponding to (1), we aim to learn by minimizing the mean-squared Bellman error (MSBE),

 (2)

where the Q-function is parametrized as with parameter . Here is the stationary distribution of corresponding to policy . Due to Q-function approximation, we focus on minimizing the following surrogate of MSBE, namely the projected mean-squared Bellman error (MSPBE),

 (3)

Here is the projection onto a function class . For example, for linear Q-function approximation (Sutton, 1988), takes the form , where is linear in and is the set of feasible parameters. As another example, for nonlinear Q-function approximation (Bhatnagar et al., 2009), takes the form , which consists of the local linearization of at .

Throughout this paper, we assume that we are able to sample tuples in the form of from the stationary distribution of policy in an independent and identically distributed manner, although our analysis can be extended to handle temporal dependence using the proof techniques of Bhandari et al. (2018). With a slight abuse of notation, we use to denote the stationary distribution of corresponding to policy and any of its marginal distributions.

## 3 Neural Temporal-Difference Learning

TD updates the parameter of the Q-function by taking the stochastic semigradient descent step (Sutton, 1988; Szepesvári, 2010; Sutton and Barto, 2018),

 θ′←θ−η⋅(^Qθ(s,a)−r(s,a)−γ^Qθ(s′,a′))⋅∇θ^Qθ(s,a), (4)

which corresponds to the MSBE in (2). Here and is the stepsize. In a more general context, (4) is referred to as TD(0). In this paper, we focus on TD(0), which is abbreviated as TD, and leave the extension to TD() to future work.

In the sequel, we denote the state-action pair by a vector with . We consider to be continuous and to be finite. Without loss of generality, we assume that and is upper bounded by a constant for any . We use a two-layer neural network

 ^Q(x;W)=1√mm∑r=1brσ(W⊤rx) (5)

to parametrize the Q-function. Here

and the parameter are initialized as and for any independently. During training, we only update , while keeping fixed as the random initialization. To ensure global convergence, we incorporate an additional projection step with respect to . See Algorithm 1 for a detailed description.

To understand the intuition behind the global convergence of neural TD, note that for the TD update in (4), we have from (1) that

 \EE(s,a,r,s′,a′)∼μ[(^Qθ(s,a)−r(s,a)−γ^Qθ(s′,a′))⋅∇θ^Qθ(s,a)] =\EE(s,a)∼μ[(^Qθ(s,a)−\EE[r(s,a)+γQ(s′,a′)|s′∼P(⋅|s,a),a′∼π(s′)])⋅∇θ^Qθ(s,a)] =\EE(s,a)∼μ[(^Qθ(s,a)−Tπ^Qθ(s,a))(i)⋅∇θ^Qθ(s,a)(ii)]. (6)

Here (i) is the Bellman residual at , while (ii) is the gradient of the first term in (i). Although the TD update in (4

) resembles the stochastic gradient descent step for minimizing a mean-squared error, it is not an unbiased stochastic gradient of any objective function. However, we show that the TD update yields a descent direction towards the global optimum of the MSPBE in (

3). Moreover, as the neural network becomes wider, the function class that projects onto in (3) becomes richer. Correspondingly, the MSPBE reduces to the MSBE in (2) as the projection becomes closer to identity, which implies the recovery of the desired Q-function such that . See Section 4 for a more rigorous characterization.

## 4 Main Results

In Section 4.1, we characterize the global optimality of the stationary point attained by Algorithm 1 in terms of minimizing the MSPBE in (3) and its other properties. In Section 4.2, we establish the nonasymptotic global rates of convergence of neural TD to the global optimum of the MSPBE when following the population semigradients in (3) and the stochastic semigradients in (4), respectively.

We use the subscript to denote the expectation over the randomness of the tuple (or its concise form ) conditional on all other randomness, e.g., the random initialization and the random current iterate. Meanwhile, we use the subscript when we are taking the expectation over all randomness, including the random initialization.

### 4.1 Properties of Stationary Point

We consider the population version of the TD update in Line 6 of Algorithm 1,

 (7)

where is the stationary distribution and is the Bellman residual at . The stationary point of (7) satisfies the following stationarity condition,

 \EEμ[δ(x,r,x′;W†)⋅∇W^Q(x;W†)]⊤(W−W†)≥0,  for any W∈SB. (8)

Also, note that

 ^Q(x;W)=1√mm∑r=1brσ(W⊤rx)=1√mm∑r=1br\ind{W⊤rx>0}W⊤rx

and almost everywhere in . Meanwhile, recall that . We define the function class

 \cF†B,m={1√mm∑r=1br\ind{(W†r)⊤x>0}W⊤rx:W∈SB}, (9)

which consists of the local linearization of at . Then (8) takes the following equivalent form

 (10)

which implies by the definition of the projection induced by . By (3), is the global optimum of the MSPBE that corresponds to the projection onto .

Intuitively, when using an overparametrized neural network with width , the average variation in each diminishes to zero. Hence, roughly speaking, we have with high probability for any . As a result, the function class defined in (9) approximates

 \cFB,m={1√mm∑r=1br\ind{Wr(0)⊤x>0}W⊤rx:W∈SB}. (11)

In the sequel, we show that, to characterize the global convergence of Algorithm 1 with a sufficiently large , it suffices to consider in place of , which simplifies the analysis, since the distribution of is given. To this end, we define the approximate stationary point with respect to the function class defined in (11). [Approximate Stationary Point ] If satisfies

 \EEμ[δ0(x,r,x′;W∗)⋅∇W^Q0(x;W∗)]⊤(W−W∗)≥0,  for any W∈SB, (12)

where we define

 ^Q0(x;W)=1√mm∑r=1br\ind{Wr(0)⊤x>0}W⊤rx, (13) δ0(x,r,x′;W)=^Q0(x;W)−r−γ^Q0(x′;W), (14)

then we say that is an approximate stationary point of the population update in (7). Here depends on the random initialization and . The next lemma proves that such an approximate stationary point uniquely exists, since it is the fixed point of the operator , which is a contraction in the -norm associated with the stationary distribution . [Existence, Uniqueness, and Optimality of ] There exists a unique approximate stationary point for any and . Also, is the global optimum of the MSPBE that corresponds to the projection onto in (11).

###### Proof.

See Appendix B.1 for a detailed proof. ∎

### 4.2 Global Convergence

In this section, we establish the main results on the global convergence of neural TD in Algorithm 1. We first lay out the following regularity condition on the stationary distribution . [Regularity of Stationary Distribution ] There exists a constant such that for any and , it holds almost surely that

 \EEμ[\ind{|w⊤x|≤τ}∣∣w]≤c0⋅τ/∥w∥2. (15)

Assumption 4.2 regularizes the density of in terms of the marginal distribution of . In particular, it is straightforwardly implied when the density of in terms of state is upper bounded.

Population Update: The next theorem establishes the nonasymptotic global rate of convergence of neural TD when it follows population semigradients. Recall that the approximate stationary point and are defined in Definition 4.1. Also, is the radius of the set of feasible , which is defined in Algorithm 1, is the number of iterations, is the discount factor, and is the width of the neural network in (5). [Convergence of Population Update] We set in Algorithm 1 and replace the TD update in Line 6 by the population update in (7). Under Assumption 4.2, the output of Algorithm 1 satisfies

 \EEinit,μ[(^Qout(x)−^Q0(x;W∗))2]≤16B2(1−γ)2T+O(B3m−1/2+B5/2m−1/4),

where the expectation is taken with respect to all randomness, including the random initialization and the stationary distribution .

###### Proof.

The key to the proof of Theorem 4.2 is the one-point monotonicity of the population semigradient , which is established through the local linearization of . See Appendix C.5 for a detailed proof. ∎

Stochastic Update:

To further prove the global convergence of neural TD when it follows stochastic semigradients, we first establish an upper bound of their variance, which affects the choice of the stepsize

. For notational simplicity, we define the stochastic and population semigradients as

 g(t)=δ(x,r,x′;W(t))⋅∇W^Q(x;W(t)),¯¯¯g(t) =\EEμ[g(t)]. (16)

[Variance Bound] There exists such that the variance of the stochastic semigradient is upper bounded as for any .

###### Proof.

See Appendix B.2 for a detailed proof. ∎

Based on Theorem 4.2 and Lemma 4.2, we establish the global convergence of neural TD in Algorithm 1. [Convergence of Stochastic Update] We set in Algorithm 1. Under Assumption 4.2, the output of Algorithm 1 satisfies

 \EEinit,μ[(^Qout(x)−^Q0(x;W∗))2] ≤16(B2+σ2g)(1−γ)2√T+O(B3m−1/2+B5/2m−1/4).
###### Proof.

See Appendix C.6 for a detailed proof. ∎

As the width of the neural network , Lemma 4.1 implies that is the global optimum of the MSPBE in (3) with a richer function class to project onto. In fact, the function class is a subset of an RKHS with -norm upper bounded by . Here is defined in (5). See Appendix A.2 for a more detailed discussion on the representation power of . Therefore, if the desired Q-function falls into , it is the global optimum of the MSPBE. By Lemma 4.1 and Theorem 4.2, we approximately obtain through .

More generally, the following proposition quantifies the distance between and in the case that does not fall into the function class . In particular, it states that the -norm distance is upper bounded by the distance between and . [Convergence of Stochastic Update to ] It holds that , which by Theorem 4.2 implies

 \EEinit,μ[(^Qout(x)−Qπ(x))2] ≤32(B2+σ2g)(1−γ)2√T+2\EEinit,μ[(Π\cFB,mQπ(x)−Qπ(x))2](1−γ)2 +O(B3m−1/2+B5/2m−1/4).
###### Proof.

See Appendix B.3 for a detailed proof. ∎

Proposition 4.2 implies that if , then as . In other words, neural TD converges to the global optimum of the MSPBE in (3), or equivalently, the MSBE in (2), both of which have objective value zero.

## 5 Proof Sketch

In the sequel, we sketch the proofs of Theorems 4.2 and 4.2 in Section 4.

### 5.1 Implicit Local Linearization via Overparametrization

Recall that as defined in (13), takes the form

 ^Q0(x;W)=Φ(x)⊤W, where Φ(x)=1√m⋅(\ind{W1(0)⊤x>0}x,…,\ind{Wm(0)⊤x>0}x)∈\RRmd,

which is linear in the feature map . In other words, with respect to , linearizes the neural network defined in (5) locally at . The following lemma characterizes the difference between , which is along the solution path of neural TD in Algorithm 1, and its local linearization . In particular, we show that the error of such a local linearization diminishes to zero as . For notational simplicity, we use to denote in the sequel. Note that by (13) we have . Recall that is the radius of the set of feasible in (11).

[Local Linearization of Q-Function] There exists a constant such that for any , it holds that

 \EEinit,μ[∣∣^Qt(x)−^Q0(x;W(t))∣∣2]≤4c1B3⋅m−1/2.
###### Proof.

See Appendix C.1 for a detailed proof. ∎

As a direct consequence of Lemma 5.1, the next lemma characterizes the effect of local linearization on population semigradients. Recall that is defined in (16). We denote by the locally linearized population semigradient, which is defined by replacing in with its local linearization . In other words, by (16), (13), and (14), we have

 ¯¯¯g(t) =\EEμ[δ(x,r,x′;W(t))⋅∇W^Q(x;W(t))], (17) ¯¯¯g0(t) =\EEμ[δ0(x,r,x′;W(t))⋅∇W^Q0(x;W(t))]. (18)

[Local Linearization of Semigradient] Let be the upper bound of the reward for any . There exists a constant such that for any , it holds that

 \EEinit[∥¯¯¯g(t)−¯¯¯g0(t)∥22]≤(56c1B3+24c2B+6c1B¯¯¯r2)⋅m−1/2.
###### Proof.

See Appendix C.2 for a detailed proof. ∎

Lemmas 5.1 and 5.1 show that the error of local linearization diminishes as the degree of overparametrization increases along . As a result, we do not require the explicit local linearization in nonlinear TD (Bhatnagar et al., 2009). Instead, we show that such an implicit local linearization suffices to ensure the global convergence of neural TD.

### 5.2 Proofs for Population Update

The characterization of the locally linearized Q-function in Lemma 5.1 and the locally linearized population semigradients in Lemma 5.1 allows us to establish the following descent lemma, which extends Lemma 3 of Bhandari et al. (2018) for characterizing linear TD.

[Population Descent Lemma] For in Algorithm 1 with the TD update in Line 6 replaced by the population update in (7), it holds that

 ∥W(t+1)−W∗∥22 +2η2⋅∥¯¯¯g(t)−¯¯¯g0(t)∥22+2ηB⋅∥¯¯¯g(t)−¯¯¯g0(t)∥2Error of Local Linearization.
###### Proof.

See Appendix C.3 for a detailed proof. ∎

Lemma 5.2 shows that, with a sufficiently small stepsize , decays at each iteration up to the error of local linearization, which is characterized by Lemma 5.1. By combining Lemmas 5.1 and 5.2 and further plugging them into a telescoping sum, we establish the convergence of to the global optimum of the MSPBE. See Appendix C.5 for a detailed proof.

### 5.3 Proofs for Stochastic Update

Recall that the stochastic semigradient is defined in (16). In parallel with Lemma 5.2, the following lemma additionally characterizes the effect of the variance of , which is induced by the randomness of the current tuple . We use the subscript to denote the expectation over the randomness of the current iterate conditional on the random initialization and . Correspondingly, is over the randomness of both the current tuple and the current iterate conditional on the random initialization.

[Stochastic Descent Lemma] For in Algorithm 1, it holds that

 \EEW,μ[∥W(t+1)−W∗∥22] ≤\EEW[∥W(t)−W∗∥22]−(2η(1−γ)−8η2)⋅\EEW,μ[(^Q0(x;W(t))−^Q0(x;W∗))2] +\EEW[2η2⋅∥¯¯¯g(t)−¯¯¯g0(t)∥22+2ηB⋅∥¯¯¯g(t)−¯¯¯g0(t)∥2]Error of Local Linearization+\EEW,μ[η2⋅∥g(t)−¯¯¯g(t)∥22]Variance of Semigradient.
###### Proof.

See Appendix C.4 for a detailed proof. ∎

To ensure the global convergence of neural TD in the presence of the variance of , we rescale the stepsize to be of order . The rest proof of Theorem 4.2 mirrors that of Theorem 4.2. See Appendix C.6 for a detailed proof.

## 6 Extension to Policy Optimization

With the Q-function learned by TD, policy iteration may be applied to learn the optimal policy. Alternatively, Q-learning more directly learns the optimal policy and its Q-function using temporal-difference update. Compared with TD, Q-learning aims to solve the projected Bellman optimality equation

 Q=Π\cFTQ,  with  TQ(s,a)=\EE[r(s,a)+γmaxa′∈\cAQ(s′,a′)∣∣s′∼P(⋅|s,a)], (19)

which replaces the Bellman evaluation operator in (3) with the Bellman optimality operator . When is identity, the fixed-point solution to (19) is the Q-function of the optimal policy , which maximizes the expected total reward (Szepesvári, 2010; Sutton and Barto, 2018). Compared with TD, the max operator in makes the analysis more challenging and hence requires stronger regularity conditions. In the following, we first introduce neural Q-learning and then establish its global convergence. Finally, we discuss the corresponding implication for policy gradient algorithms.

### 6.1 Neural Q-Learning

In parallel with (4), we update the parameter of the optimal Q-function by

 θ′←θ−η⋅(^Qθ(s,a)−r(s,a)−γmaxa′∈\cA^Qθ(s′,a′))⋅∇θ^Qθ(s,a), (20)

where the tuple is sampled from the stationary distribution of an exploration policy in an independent and identically distributed manner. We present the detailed neural Q-learning algorithm in Algorithm 2. Similar to Definition 4.1, we define the approximate stationary point of Algorithm 2 by

 \EEμexp[δ0(x,r,x′;W∗)⋅∇W^Q0(x;W∗)]⊤(W−W∗)≥0,  for any W∈SB, (21)

where the Bellman residual is now . Following the same analysis of neural TD in Lemma 4.1, we have that is the unique fixed-point solution to the projected Bellman optimality equation , where the function class is define in (11).

### 6.2 Global Convergence

To establish the global convergence of neural Q-learning, we lay out an extra regularity condition on the exploration policy , which is not required by neural TD. Such a regularity condition ensures that with the greedy action in Line 4 of Algorithm 2 follows a similar distribution to that of , which is the stationary distribution of the exploration policy . Recall that is defined in (13) and is the discount factor. [Regularity of Exploration Policy ] There exists a constant such that for any , it holds that

 \EEx∼μexp[(^Q0(x;W1)−^Q0(x;W2))2]≥(γ+ν)2⋅\EEs∼μexp[(^Q♯0(s;W1)−^Q♯0(s;W2))2], (22)

where . We remark that Melo et al. (2008); Zou et al. (2019) establish the global convergence of linear Q-learning based on an assumption that implies (22). Although Assumption 6.2 is strong, we are not aware of any weaker regularity condition in the literature, even for linear Q-learning. As our focus is to go beyond linear Q-learning to analyze neural Q-learning, we do not attempt to weaken such a regularity condition in this paper.

The following regularity condition on mirrors Assumption 4.2, but additionally accounts for the max operator in the Bellman optimality operator. [Regularity of Stationary Distribution ] There exists a constant such that for any and , it holds almost surely that

 \EEs∼μexp[maxa∈\cA\ind{|w⊤(s,a)|≤τ}∣∣w]≤c3⋅τ/∥w∥2. (23)

In parallel with Theorem 4.2, the following theorem establishes the global convergence of neural Q-learning in Algorithm 2.

[Convergence of Stochastic Update] We set to be of order in Algorithm 2. Under Assumptions 6.2 and 6.2, the output of Algorithm 2 satisfies

 \EEinit,μexp[(^Qout(x)−^Q0(x;W∗))2]=O(B2T−1/2+B3m−1/2+B5/2m−1/4).
###### Proof.

See Appendix D.1 for a detailed proof. ∎

Corresponding to Proposition 4.2, Theorem 6.2 also implies the convergence to , which is omitted due to space limitations.

### 6.3 Implication for Policy Gradient

Theorem 6.2 can be further extended to handle neural soft Q-learning, where the max operator in the Bellman optimality operator is replaced by a more general softmax operator (Haarnoja et al., 2017; Neu et al., 2017). By exploiting the equivalence between soft Q-learning and policy gradient algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we establish the global convergence of a variant of the policy gradient algorithm. Due to space limitations, we defer the discussion to Appendix E.

## 7 Conclusions

In this paper we prove that neural TD converges at a sublinear rate to the global optimum of the MSPBE for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks. Moreover, we extend the convergence result to policy optimization, including (soft) Q-learning and policy gradient. Our results shed new light on the theoretical understanding of RL with neural networks, which is widely employed in practice.