Given a policy, temporal-different learning (TD) (Sutton, 1988) aims to learn the corresponding (action-)value function by following the semigradients of the mean-squared Bellman error in an online manner. As the most-used policy evaluation algorithm, TD serves as the “critic” component of many reinforcement learning algorithms, such as the actor-critic algorithm (Konda and Tsitsiklis, 2000) and trust-region policy optimization (Schulman et al., 2015). In particular, in deep reinforcement learning, TD is often applied to learn value functions parametrized by neural networks (Lillicrap et al., 2015; Mnih et al., 2016; Haarnoja et al., 2018), which gives rise to neural TD. As policy improvement relies crucially on policy evaluation, the optimization efficiency and statistical accuracy of neural TD are critical to the performance of deep reinforcement learning. Towards theoretically understanding deep reinforcement learning, the goal of this paper is to characterize the convergence of neural TD.
Despite the broad applications of neural TD, its convergence remains rarely understood. Even with linear value function approximation, the nonasymptotic convergence of TD remains open until recently (Bhandari et al., 2018; Lakshminarayanan and Szepesvari, 2018; Dalal et al., 2018; Srikant and Ying, 2019), although its asymptotic convergence is well understood (Jaakkola et al., 1994; Tsitsiklis and Van Roy, 1997; Borkar and Meyn, 2000; Kushner and Yin, 2003; Borkar, 2009). Meanwhile, with nonlinear value function approximation, TD is known to diverge in general (Baird, 1995; Boyan and Moore, 1995; Tsitsiklis and Van Roy, 1997). To remedy this issue, Bhatnagar et al. (2009)
propose nonlinear (gradient) TD, which uses the tangent vectors of nonlinear value functions in place of the feature vectors in linear TD. Unlike linear TD, which converges to the global optimum of the mean-squared projected Bellman error (MSPBE), nonlinear TD is only guaranteed to converge to a local optimum asymptotically. As a result, the statistical accuracy of the value function learned by nonlinear TD remains unclear. In contrast to such conservative theory, neural TD, which straightforwardly combines TD with neural networks without the explicit local linearization in nonlinear TD, often learns a desired value function that generalizes well to unseen states in practice(Duan et al., 2016; Amiranashvili et al., 2018; Henderson et al., 2018). Hence, a gap separates theory from practice.
There exist three obstacles towards closing such a theory-practice gap: (i) MSPBE has an expectation over the transition dynamics within the squared loss, which forbids the construction of unbiased stochastic gradients (Sutton and Barto, 2018). As a result, even with linear value function approximation, TD largely eludes the classical optimization framework, as it follows biased stochastic semigradients. (ii) When the value function is parametrized by a neural network, MSPBE is nonconvex in the weights of the neural network, which may introduce undesired stationary points such as local optima and saddle points (Jain and Kar, 2017). As a result, even an ideal algorithm that follows the population gradients of MSPBE may get trapped. (iii) Due to the interplay between the bias in stochastic semigradients and the nonlinearity in value function approximation, neural TD may even diverge (Baird, 1995; Boyan and Moore, 1995; Tsitsiklis and Van Roy, 1997), instead of converging to an undesired stationary point, as it lacks the explicit local linearization in nonlinear TD (Bhatnagar et al., 2009). Such divergence is also not captured by the classical optimization framework.
Contribution. Towards bridging theory and practice, we establish the first nonasymptotic global rate of convergence of neural TD. In detail, we prove that randomly initialized neural TD converges to the global optimum of MSPBE at the rate of with population semigradients and at the rate of with stochastic semigradients. Here is the number of iterations and the (action-)value function is parametrized by a sufficiently wide two-layer neural network. Moreover, we prove that the projection in MSPBE allows for a sufficiently rich class of functions, which has the same representation power of a reproducing kernel Hilbert space associated with the random initialization. As a result, for a broad class of reinforcement learning problems, neural TD attains zero MSPBE. Beyond policy evaluation, we further establish the global convergence of neural (soft) Q-learning, which allows for policy improvement. In particular, we prove that, under stronger regularity conditions, neural (soft) Q-learning converges at the same rate of neural TD to the global optimum of MSPBE for policy optimization. Also, by exploiting the connection between (soft) Q-learning and policy gradient algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we establish the global convergence of a variant of the policy gradient algorithm (Williams, 1992; Szepesvári, 2010; Sutton and Barto, 2018).
At the core of our analysis is the overparametrization of the two-layer neural network for value function approximation (Zhang et al., 2016; Neyshabur et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019), which enables us to circumvent the three obstacles above. In particular, overparametrization leads to an implicit local linearization that varies smoothly along the solution path, which mirrors the explicit one in nonlinear TD (Bhatnagar et al., 2009). Such an implicit local linearization enables us to circumvent the third obstacle of possible divergence. Moreover, overparametrization allows us to establish a notion of one-point monotonicity (Harker and Pang, 1990; Facchinei and Pang, 2007) for the semigradients followed by neural TD, which ensures its evolution towards the global optimum of MSPBE along the solution path. Such a notion of monotonicity enables us to circumvent the first and second obstacles of bias and nonconvexity. Broadly speaking, our theory backs the empirical success of overparametrized neural networks in deep reinforcement learning. In particular, we show that instead of being a curse, overparametrization is indeed a blessing for minimizing MSPBE in the presence of bias, nonconvexity, and even divergence.
More Related Work. There is a large body of literature on the convergence of linear TD under both asymptotic (Jaakkola et al., 1994; Tsitsiklis and Van Roy, 1997; Borkar and Meyn, 2000; Kushner and Yin, 2003; Borkar, 2009) and nonasymptotic (Bhandari et al., 2018; Lakshminarayanan and Szepesvari, 2018; Dalal et al., 2018; Srikant and Ying, 2019) regimes. See Dann et al. (2014) for a detailed survey. In particular, our analysis is based on the recent breakthrough in the nonasymptotic analysis of linear TD (Bhandari et al., 2018) and its extension to linear Q-learning (Zou et al., 2019). An essential step of our analysis is bridging the evolution of linear TD and neural TD through the implicit local linearization induced by overparametrization.
To incorporate nonlinear value function approximation into TD, Bhatnagar et al. (2009) propose the first convergent nonlinear TD based on explicit local linearization, which however only converges to a local optimum of MSPBE. See Geist and Pietquin (2013); Bertsekas (2019) for a detailed survey. In contrast, we prove that, with the implicit local linearization induced by overparametrization, neural TD, which is simpler to implement and more widely used in deep reinforcement learning than nonlinear TD, provably converges to the global optimum of MSPBE.
There exist various extensions of TD, including least-squares TD (Bradtke and Barto, 1996; Boyan, 1999; Lazaric et al., 2010; Ghavamzadeh et al., 2010; Tu and Recht, 2017) and gradient TD (Sutton et al., 2009a, b; Bhatnagar et al., 2009; Liu et al., 2015; Du et al., 2017; Wang et al., 2017; Touati et al., 2017). In detail, least-squares TD is based on batch update, which loses the computational and statistical efficiency of the online update in TD. Meanwhile, gradient TD follows unbiased stochastic gradients, but at the cost of introducing another optimization variable. Such a reformulation leads to bilevel optimization, which is less stable in practice when combined with neural networks (Pfau and Vinyals, 2016). As a result, both extensions of TD are less widely used in deep reinforcement learning (Duan et al., 2016; Amiranashvili et al., 2018; Henderson et al., 2018). Moreover, when using neural networks for value function approximation, the convergence to the global optimum of MSPBE remains unclear for both extensions of TD.
Our work is also related to the recent breakthrough in understanding overparametrized neural networks, especially their generalization error (Zhang et al., 2016; Neyshabur et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019). See Fan et al. (2019) for a detailed survey. In particular, Daniely (2017); Allen-Zhu et al. (2018); Arora et al. (2019); Chizat and Bach (2018); Jacot et al. (2018); Lee et al. (2019)
characterize the implicit local linearization in the context of supervised learning, where we train an overparametrized neural network by following the stochastic gradients of the mean-squared error. In contrast, neural TD does not follow the stochastic gradients of any objective function, hence leading to possible divergence, which makes the convergence analysis more challenging.
2.1 Policy Evaluation
We consider a Markov decision process, in which an agent interacts with the environment to learn the optimal policy that maximizes the expected total reward. At the -th time step, the agent has a state and takes an action . Upon taking the action, the agent enters the next state
according to the transition probabilityand receives a random reward from the environment. The action that the agent takes at each state is decided by a policy , where
is the set of all probability distributions over. The performance of policy is measured by the expected total reward, , where is the discount factor.
Given policy , policy evaluation aims to learn the following two functions, the value function and the action-value function (Q-function) . Both functions form the basis for policy improvement. Without loss of generality, we focus on learning the Q-function in this paper. We define the Bellman evaluation operator,
for which is the fixed point, that is, the solution to the Bellman equation .
2.2 Optimization Formulation
Corresponding to (1), we aim to learn by minimizing the mean-squared Bellman error (MSBE),
where the Q-function is parametrized as with parameter . Here is the stationary distribution of corresponding to policy . Due to Q-function approximation, we focus on minimizing the following surrogate of MSBE, namely the projected mean-squared Bellman error (MSPBE),
Here is the projection onto a function class . For example, for linear Q-function approximation (Sutton, 1988), takes the form , where is linear in and is the set of feasible parameters. As another example, for nonlinear Q-function approximation (Bhatnagar et al., 2009), takes the form , which consists of the local linearization of at .
Throughout this paper, we assume that we are able to sample tuples in the form of from the stationary distribution of policy in an independent and identically distributed manner, although our analysis can be extended to handle temporal dependence using the proof techniques of Bhandari et al. (2018). With a slight abuse of notation, we use to denote the stationary distribution of corresponding to policy and any of its marginal distributions.
3 Neural Temporal-Difference Learning
which corresponds to the MSBE in (2). Here and is the stepsize. In a more general context, (4) is referred to as TD(0). In this paper, we focus on TD(0), which is abbreviated as TD, and leave the extension to TD() to future work.
In the sequel, we denote the state-action pair by a vector with . We consider to be continuous and to be finite. Without loss of generality, we assume that and is upper bounded by a constant for any . We use a two-layer neural network
to parametrize the Q-function. Hereand the parameter are initialized as and for any independently. During training, we only update , while keeping fixed as the random initialization. To ensure global convergence, we incorporate an additional projection step with respect to . See Algorithm 1 for a detailed description.
Here (i) is the Bellman residual at , while (ii) is the gradient of the first term in (i). Although the TD update in (4
) resembles the stochastic gradient descent step for minimizing a mean-squared error, it is not an unbiased stochastic gradient of any objective function. However, we show that the TD update yields a descent direction towards the global optimum of the MSPBE in (3). Moreover, as the neural network becomes wider, the function class that projects onto in (3) becomes richer. Correspondingly, the MSPBE reduces to the MSBE in (2) as the projection becomes closer to identity, which implies the recovery of the desired Q-function such that . See Section 4 for a more rigorous characterization.
4 Main Results
In Section 4.1, we characterize the global optimality of the stationary point attained by Algorithm 1 in terms of minimizing the MSPBE in (3) and its other properties. In Section 4.2, we establish the nonasymptotic global rates of convergence of neural TD to the global optimum of the MSPBE when following the population semigradients in (3) and the stochastic semigradients in (4), respectively.
We use the subscript to denote the expectation over the randomness of the tuple (or its concise form ) conditional on all other randomness, e.g., the random initialization and the random current iterate. Meanwhile, we use the subscript when we are taking the expectation over all randomness, including the random initialization.
4.1 Properties of Stationary Point
where is the stationary distribution and is the Bellman residual at . The stationary point of (7) satisfies the following stationarity condition,
Also, note that
and almost everywhere in . Meanwhile, recall that . We define the function class
which consists of the local linearization of at . Then (8) takes the following equivalent form
which implies by the definition of the projection induced by . By (3), is the global optimum of the MSPBE that corresponds to the projection onto .
Intuitively, when using an overparametrized neural network with width , the average variation in each diminishes to zero. Hence, roughly speaking, we have with high probability for any . As a result, the function class defined in (9) approximates
In the sequel, we show that, to characterize the global convergence of Algorithm 1 with a sufficiently large , it suffices to consider in place of , which simplifies the analysis, since the distribution of is given. To this end, we define the approximate stationary point with respect to the function class defined in (11). [Approximate Stationary Point ] If satisfies
where we define
then we say that is an approximate stationary point of the population update in (7). Here depends on the random initialization and . The next lemma proves that such an approximate stationary point uniquely exists, since it is the fixed point of the operator , which is a contraction in the -norm associated with the stationary distribution . [Existence, Uniqueness, and Optimality of ] There exists a unique approximate stationary point for any and . Also, is the global optimum of the MSPBE that corresponds to the projection onto in (11).
See Appendix B.1 for a detailed proof. ∎
4.2 Global Convergence
In this section, we establish the main results on the global convergence of neural TD in Algorithm 1. We first lay out the following regularity condition on the stationary distribution . [Regularity of Stationary Distribution ] There exists a constant such that for any and , it holds almost surely that
Assumption 4.2 regularizes the density of in terms of the marginal distribution of . In particular, it is straightforwardly implied when the density of in terms of state is upper bounded.
Population Update: The next theorem establishes the nonasymptotic global rate of convergence of neural TD when it follows population semigradients. Recall that the approximate stationary point and are defined in Definition 4.1. Also, is the radius of the set of feasible , which is defined in Algorithm 1, is the number of iterations, is the discount factor, and is the width of the neural network in (5). [Convergence of Population Update] We set in Algorithm 1 and replace the TD update in Line 6 by the population update in (7). Under Assumption 4.2, the output of Algorithm 1 satisfies
where the expectation is taken with respect to all randomness, including the random initialization and the stationary distribution .
To further prove the global convergence of neural TD when it follows stochastic semigradients, we first establish an upper bound of their variance, which affects the choice of the stepsize. For notational simplicity, we define the stochastic and population semigradients as
[Variance Bound] There exists such that the variance of the stochastic semigradient is upper bounded as for any .
See Appendix B.2 for a detailed proof. ∎
Based on Theorem 4.2 and Lemma 4.2, we establish the global convergence of neural TD in Algorithm 1. [Convergence of Stochastic Update] We set in Algorithm 1. Under Assumption 4.2, the output of Algorithm 1 satisfies
See Appendix C.6 for a detailed proof. ∎
As the width of the neural network , Lemma 4.1 implies that is the global optimum of the MSPBE in (3) with a richer function class to project onto. In fact, the function class is a subset of an RKHS with -norm upper bounded by . Here is defined in (5). See Appendix A.2 for a more detailed discussion on the representation power of . Therefore, if the desired Q-function falls into , it is the global optimum of the MSPBE. By Lemma 4.1 and Theorem 4.2, we approximately obtain through .
More generally, the following proposition quantifies the distance between and in the case that does not fall into the function class . In particular, it states that the -norm distance is upper bounded by the distance between and . [Convergence of Stochastic Update to ] It holds that , which by Theorem 4.2 implies
See Appendix B.3 for a detailed proof. ∎
5 Proof Sketch
5.1 Implicit Local Linearization via Overparametrization
Recall that as defined in (13), takes the form
which is linear in the feature map . In other words, with respect to , linearizes the neural network defined in (5) locally at . The following lemma characterizes the difference between , which is along the solution path of neural TD in Algorithm 1, and its local linearization . In particular, we show that the error of such a local linearization diminishes to zero as . For notational simplicity, we use to denote in the sequel. Note that by (13) we have . Recall that is the radius of the set of feasible in (11).
[Local Linearization of Q-Function] There exists a constant such that for any , it holds that
See Appendix C.1 for a detailed proof. ∎
As a direct consequence of Lemma 5.1, the next lemma characterizes the effect of local linearization on population semigradients. Recall that is defined in (16). We denote by the locally linearized population semigradient, which is defined by replacing in with its local linearization . In other words, by (16), (13), and (14), we have
[Local Linearization of Semigradient] Let be the upper bound of the reward for any . There exists a constant such that for any , it holds that
See Appendix C.2 for a detailed proof. ∎
Lemmas 5.1 and 5.1 show that the error of local linearization diminishes as the degree of overparametrization increases along . As a result, we do not require the explicit local linearization in nonlinear TD (Bhatnagar et al., 2009). Instead, we show that such an implicit local linearization suffices to ensure the global convergence of neural TD.
5.2 Proofs for Population Update
The characterization of the locally linearized Q-function in Lemma 5.1 and the locally linearized population semigradients in Lemma 5.1 allows us to establish the following descent lemma, which extends Lemma 3 of Bhandari et al. (2018) for characterizing linear TD.
See Appendix C.3 for a detailed proof. ∎
Lemma 5.2 shows that, with a sufficiently small stepsize , decays at each iteration up to the error of local linearization, which is characterized by Lemma 5.1. By combining Lemmas 5.1 and 5.2 and further plugging them into a telescoping sum, we establish the convergence of to the global optimum of the MSPBE. See Appendix C.5 for a detailed proof.
5.3 Proofs for Stochastic Update
Recall that the stochastic semigradient is defined in (16). In parallel with Lemma 5.2, the following lemma additionally characterizes the effect of the variance of , which is induced by the randomness of the current tuple . We use the subscript to denote the expectation over the randomness of the current iterate conditional on the random initialization and . Correspondingly, is over the randomness of both the current tuple and the current iterate conditional on the random initialization.
[Stochastic Descent Lemma] For in Algorithm 1, it holds that
See Appendix C.4 for a detailed proof. ∎
6 Extension to Policy Optimization
With the Q-function learned by TD, policy iteration may be applied to learn the optimal policy. Alternatively, Q-learning more directly learns the optimal policy and its Q-function using temporal-difference update. Compared with TD, Q-learning aims to solve the projected Bellman optimality equation
which replaces the Bellman evaluation operator in (3) with the Bellman optimality operator . When is identity, the fixed-point solution to (19) is the Q-function of the optimal policy , which maximizes the expected total reward (Szepesvári, 2010; Sutton and Barto, 2018). Compared with TD, the max operator in makes the analysis more challenging and hence requires stronger regularity conditions. In the following, we first introduce neural Q-learning and then establish its global convergence. Finally, we discuss the corresponding implication for policy gradient algorithms.
6.1 Neural Q-Learning
In parallel with (4), we update the parameter of the optimal Q-function by
where the tuple is sampled from the stationary distribution of an exploration policy in an independent and identically distributed manner. We present the detailed neural Q-learning algorithm in Algorithm 2. Similar to Definition 4.1, we define the approximate stationary point of Algorithm 2 by
where the Bellman residual is now . Following the same analysis of neural TD in Lemma 4.1, we have that is the unique fixed-point solution to the projected Bellman optimality equation , where the function class is define in (11).
6.2 Global Convergence
To establish the global convergence of neural Q-learning, we lay out an extra regularity condition on the exploration policy , which is not required by neural TD. Such a regularity condition ensures that with the greedy action in Line 4 of Algorithm 2 follows a similar distribution to that of , which is the stationary distribution of the exploration policy . Recall that is defined in (13) and is the discount factor. [Regularity of Exploration Policy ] There exists a constant such that for any , it holds that
where . We remark that Melo et al. (2008); Zou et al. (2019) establish the global convergence of linear Q-learning based on an assumption that implies (22). Although Assumption 6.2 is strong, we are not aware of any weaker regularity condition in the literature, even for linear Q-learning. As our focus is to go beyond linear Q-learning to analyze neural Q-learning, we do not attempt to weaken such a regularity condition in this paper.
The following regularity condition on mirrors Assumption 4.2, but additionally accounts for the max operator in the Bellman optimality operator. [Regularity of Stationary Distribution ] There exists a constant such that for any and , it holds almost surely that
See Appendix D.1 for a detailed proof. ∎
6.3 Implication for Policy Gradient
Theorem 6.2 can be further extended to handle neural soft Q-learning, where the max operator in the Bellman optimality operator is replaced by a more general softmax operator (Haarnoja et al., 2017; Neu et al., 2017). By exploiting the equivalence between soft Q-learning and policy gradient algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we establish the global convergence of a variant of the policy gradient algorithm. Due to space limitations, we defer the discussion to Appendix E.
In this paper we prove that neural TD converges at a sublinear rate to the global optimum of the MSPBE for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks. Moreover, we extend the convergence result to policy optimization, including (soft) Q-learning and policy gradient. Our results shed new light on the theoretical understanding of RL with neural networks, which is widely employed in practice.
- Allen-Zhu et al. (2018) Allen-Zhu, Z., Li, Y. and Liang, Y. (2018). Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.
- Amiranashvili et al. (2018) Amiranashvili, A., Dosovitskiy, A., Koltun, V. and Brox, T. (2018). TD or not TD: Analyzing the role of temporal differencing in deep reinforcement learning. arXiv preprint arXiv:1806.01175.
- Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584.
Baird, L. (1995).
Residual algorithms: Reinforcement learning with function
International Conference on Machine Learning.
- Bertsekas (2019) Bertsekas, D. P. (2019). Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA Journal of Automatica Sinica, 6 1–31.
- Bhandari et al. (2018) Bhandari, J., Russo, D. and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450.
- Bhatnagar et al. (2009) Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R. and Szepesvári, C. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems.
- Borkar (2009) Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer.
- Borkar and Meyn (2000) Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38 447–469.
- Boyan (1999) Boyan, J. A. (1999). Least-squares temporal difference learning. In International Conference on Machine Learning.
- Boyan and Moore (1995) Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems.
- Bradtke and Barto (1996) Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22 33–57.
- Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
Dalal et al. (2018)
Dalal, G., Szörényi, B., Thoppe, G. and
Mannor, S. (2018).
Finite sample analyses for TD(0) with function approximation.
AAAI Conference on Artificial Intelligence.
- Daniely (2017) Daniely, A. (2017). SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems.
- Dann et al. (2014) Dann, C., Neumann, G. and Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15 809–883.
- Du et al. (2017) Du, S. S., Chen, J., Li, L., Xiao, L. and Zhou, D. (2017). Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning.
- Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J. and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
- Facchinei and Pang (2007) Facchinei, F. and Pang, J.-S. (2007). Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media.
- Fan et al. (2019) Fan, J., Ma, C. and Zhong, Y. (2019). A selective overview of deep learning. arXiv preprint arXiv:1904.05526.
- Geist and Pietquin (2013) Geist, M. and Pietquin, O. (2013). Algorithmic survey of parametric value function approximation. IEEE Transactions on Neural Networks and Learning Systems, 24 845–867.
- Ghavamzadeh et al. (2010) Ghavamzadeh, M., Lazaric, A., Maillard, O. and Munos, R. (2010). LSTD with random projections. In Advances in Neural Information Processing Systems.
- Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P. and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning.
- Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
- Harker and Pang (1990) Harker, P. T. and Pang, J.-S. (1990). Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical Programming, 48 161–220.
- Henderson et al. (2018) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D. and Meger, D. (2018). Deep reinforcement learning that matters. In AAAI Conference on Artificial Intelligence.
- Hofmann et al. (2008) Hofmann, T., Schölkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning. Annals of Statistics 1171–1220.
- Jaakkola et al. (1994) Jaakkola, T., Jordan, M. I. and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems.
- Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems.
- Jain and Kar (2017) Jain, P. and Kar, P. (2017). Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10 142–336.
- Konda and Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems.
- Kushner and Yin (2003) Kushner, H. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer Science & Business Media.
- Lakshminarayanan and Szepesvari (2018) Lakshminarayanan, C. and Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics.
- Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2010). Finite-sample analysis of LSTD. In International Conference on Machine Learning.
- Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Liu et al. (2015) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S. and Petrik, M. (2015). Finite-sample analysis of proximal gradient TD algorithms. In Conference on Uncertainty in Artificial Intelligence.
- Melo et al. (2008) Melo, F. S., Meyn, S. P. and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In International Conference on Machine Learning.
- Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning.
- Neu et al. (2017) Neu, G., Jonsson, A. and Gómez, V. (2017). A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798.
- Neyshabur et al. (2018) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. and Srebro, N. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076.
- Pfau and Vinyals (2016) Pfau, D. and Vinyals, O. (2016). Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945.
- Rahimi and Recht (2008a) Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems.
- Rahimi and Recht (2008b) Rahimi, A. and Recht, B. (2008b). Uniform approximation of functions with random bases. In Annual Allerton Conference on Communication, Control, and Computing.
- Schulman et al. (2017) Schulman, J., Chen, X. and Abbeel, P. (2017). Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning.
- Srikant and Ying (2019) Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD learning. arXiv preprint arXiv:1902.00923.
- Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3 9–44.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press.
- Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C. and Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning.
- Sutton et al. (2009b) Sutton, R. S., Maei, H. R. and Szepesvári, C. (2009b). A convergent temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems.
- Szepesvári (2010) Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4 1–103.
- Touati et al. (2017) Touati, A., Bacon, P.-L., Precup, D. and Vincent, P. (2017). Convergent tree-backup and retrace with function approximation. arXiv preprint arXiv:1705.09322.
- Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems.
- Tu and Recht (2017) Tu, S. and Recht, B. (2017). Least-squares temporal difference learning for the linear quadratic regulator. arXiv preprint arXiv:1712.08642.
- Wang et al. (2017) Wang, Y., Chen, W., Liu, Y., Ma, Z.-M. and Liu, T.-Y. (2017). Finite sample analysis of the GTD policy evaluation algorithms in Markov setting. In Advances in Neural Information Processing Systems.
- Williams (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 229–256.
- Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Zou et al. (2019) Zou, S., Xu, T. and Liang, Y. (2019). Finite-sample analysis for SARSA and Q-learning with linear function approximation. arXiv preprint a