## 1 Introduction

In reinforcement learning (RL), an agent interacts with an unknown environment and seeks to learn a policy which maps states to distribution over actions to maximise a long-term numerical reward. Recently, many popular off-policy deep RL algorithms have enjoyed many empirical successes on challenging RL domains such as video games and robotics. Their success can be attributed to their ability to scale gracefully to high dimensional state-action spaces thanks to their use of modern high-capacity function approximators such as neural networks. Most of these algorithms have their roots in Approximate Dynamic Programming (ADP) methods (Bertsekas et al., 1995)

, which are standard approaches to tackle decision problems with large state space by making successive calls to a supervised learning algorithm. For example, Deep Q-Network (DQN)

(Mnih et al., 2015) can be related to Approximate Value Iteration while Soft Actor-Critic (SAC) (Haarnoja et al., 2018) can be related to Approximate Policy Iteration.Unfortunately, it is well known that such off-policy methods, when combined with function approximators, fail frequently to converge to a solution and can be even divergent (Baird, 1995; Tsitsiklis and Van Roy, 1997). Stable approaches have been an active area of investigation. For example, restrictive function classes such as averagers (Gordon, 1995) or smoothing kernel (Ormoneit and Sen, 2002) were shown to lead to stable learning. Gradient-based temporal difference approaches have been proposed to derive convergent algorithms with linear function approximators but only for off-policy evaluation (Sutton et al., 2008; Touati et al., 2018).

Recently, Smoothed Bellman Error Embedding (SBEED) was introduced in Dai et al. (2018) as the first provably convergent algorithm with general function approximators. Dai et al. (2018) leverage Nesterov’s smoothing technique and the convex-conjugate trick to derive a primal-dual optimization problem. The algorithm learns the optimal value function and the optimal policy in the primal, and the Bellman residual in the dual.

In this work, we study the theoretical behavior of SBEED in batch-mode reinforcement learning where the algorithm has only access to a fixed dataset of transitions. We prove a near-optimal performance guarantee that depends on the representation power of the function classes we use and a tight notion of the distribution shift. Our results improve upon prior guarantee of SBEED, presented in the original paper Dai et al. (2018), in terms of the dependence on the planning horizon and on the sample size. In particular, we show that SBEED enjoys linear dependence on horizon, which is the best that we can hope (Scherrer and Lesner, 2012), and that the statistical error decreases in the rate instead of provided that function classes are rich enough in a sense that we will specify. Our analysis builds on the recent work of Xie and Jiang (2020) that studies a related algorithm MSBO, which could be interpreted as a non-smooth counterpart of SBEED. However, both algorithms differ in several aspects: SBEED learns jointly the optimal policy and the optimal value function while MSBO learns the optimal -value function and considers only policies that are greedy with respect to it. Moreover, even as the smoothing parameter goes to zero, SBEED’s learning objective does not recover MSBO’s objective.

## 2 Preliminaries and Setting

### 2.1 Markov Decision Processes

We consider a discounted Markov Decision Process (MDP) defined by a tuple

with state space , action space , discount factor, transition probabilities

mapping state-action pairs to distributions over next states ( denotes the probability simplex), and reward function . is the initial state distribution. For the sake of clarity, we assume the state and action spaces are finite whose cardinality can be arbitrarily large, but our analysis can be extended to the countable or continuous case. We denote by the probability of choosing action in state under the policy . The performance of a policy represents the expected sum of discounted rewards: where the expectation is taken over trajectories induced by the policy in the MDP such that . Moreover, we define value function and Q-value function . These functions take value in where .We define the discounted state occupancy measure induced by a policy as

where is the probability that after we execute for steps, starting from initial state . By definition, . Similarly, we define .

#### Entropy Regularized MDP:

The idea of entropy regularization has also been widely used in the RL literature. In entropy regularized MDP, also known as soft MDP, we aim at finding the policy that maximizes the following objective:

where is the Shannon entropy function and is a regularization parameter.

### 2.2 Batch Reinforcement Learning

We are concerned with the batch RL setting where an agent does not have the ability to interact with the environment, but is instead provided with a batch dataset such that for every , is an i.i.d sample generated from a data distribution , and .

A typical batch learning algorithm requires the access to a function class and aims at computing a near-optimal policy from the data by approximating the optimal action-value function with some element of and then outputing the greedy policy with respect to . Different algorithms suppose access to different function classes. As a further simplification, we assume that all function classes have finite but exponentially large cardinalities.

## 3 Smoothed Bellman Error Embedding

In this section, we describe the SBEED algorithm and provide insights about its design that will be useful for our subsequent analysis. The next lemma restates Proposition 2 in Dai et al. (2018) that characterizes the optimal value-function and the optimal policy of the soft MDP.

###### Lemma 1 (Temporal consistency Proposition 3 in Dai et al. (2018)).

The optimal value function and the optimal policy of the soft MDP are the unique pair that satisfies the following equality for all

where .

Let denote the consistency operator defined for any by . A natural objective function inspired by Lemma 1 would be:

(1) |

where is the -weighted 2-norm. , with , is the class of candidate value functions and ^{1}^{1}1According to Lemma 1, , is the class of candidate policies.
To solve the minimization problem (1

), one may try to minimize the empirical objective estimated purely from data:

Due to the inner conditional expectation in (1), the expectation of over the draw of dataset is different from the original objective . In particular.

(2) | ||||

To address this issue, also known as the double sampling issue (Baird, 1995), Dai et al. (2018) use the Frenchel dual trick as well as an appropriate change of variable and derive the following minimax SBEED objective:

(3) |

where is a helper function class and

To understand the intuition behind the SBEED objective (3), note that if is rich enough, in the regime of infinite amount of data, the minimizer of the regression problem converges to

, which is the Bayes optimal regressor, and the minimum converges to the conditional variance

, which is the optimal Bayes error. This allows the cancellation of the extra conditional variance term in Equation (2). Therefore, is a consistent estimate of as long as is rich enough.Note that the only difference between and is that the former takes single-variable function () as first argument while the latter takes two-variable function () as the first argument.

## 4 Analysis

In this section, we provide a near-optimal performance guarantee for the SBEED algorithm. In order to state our main results, we need to introduce a few key assumptions. The first characterizes the distribution shift, more precisely, the mismatch between the training distribution and the discounted occupancy measure induced by any policy .

###### Assumption 1 (Concentrability coefficient).

we assume that .

uses the -weighted square of the marginalized importance weights and it is one of the simplest versions of concentrability coefficients considered in the literature (Munos, 2003; Antos et al., 2008; Scherrer, 2014). In spite of its simple form, could be tighter than more involved concentrability coefficients in some cases (Xie and Jiang, 2020).

Now, we introduce the assumptions that characterize the representation power of the function classes. The next assumption measures the capacity of the policies and value spaces to represent the optimal policy and the optimal value function of the soft MDP.

###### Assumption 2 (Approximate realizability).

According to Lemma 1, and realize and ( and ) implies that . Therefore, measures the violation of the realizability of and .

###### Assumption 3 (Approximate realizability of the helper class).

.

When realizes the optimal Bayes regressor for any and , . Therefore, the latter assumption measures the violation of for the worst-case and .

Our analysis starts by stating a useful telescoping Lemma.

###### Lemma 2 (Telescoping Lemma).

For any and :

Lemma 2 is an important first step to prove a linear dependence on the planning horizon of SBEED unlike standard iterative methods, such as Fitted Q-iteration, that incur quadratic dependence on the horizon . A similar lemma was proved in Xie and Jiang (2020) for Q-value functions of the unregularized MDP.

Let and denote the output of the SBEED algorithm:

With the telescoping lemma 2 and the definition of concentrability coefficient , we can relate the suboptimality of the learnt policy with the minimization objective Equation (1).

###### Lemma 3 (Suboptimality).

The performance difference between the optimal policy and the output policy, is given by

(4) |

The first term of the left-hand-side of Equation (4) is the bias due to the entropy regularization. To be able to establish our performance guarantee we need to relate the second term to the empirical loss in the minimax objective in Equation (3) that SBEED solves where we define for any . The former is a population loss while the latter is an empirical loss estimated from the dataset and involves a helper function . We first drop the dependence on the helper function by bounding the deviation between and uniformly over , and . We get informally:

We can finally obtain the desired result by exploiting the fact that is equal in expectation over draws of the dataset to the quantity of interest . Thorough treatment of each step involves using concentration of measures as well as dealing with function approximation errors. In particular, we use Bernstein inequality in order to get faster rate, similarly to what was used for Fitted Q-iteration and MSBO analysis in Chen and Jiang (2019). Detailed proofs are provided in the supplemental. We now state our performance guarantee.

As an immediate consequence of Theorem LABEL:theorem:_performance, we provide a finite sample complexity of SBEED in the case of full realizability i.e .

#### Comparison with prior analysis of SBEED:

Our results improve over the original analysis of SBEED in Dai et al. (2018) in many aspects. First, in terms of guarantee, Dai et al. (2018) provide a bound on the -weighted distance between the learnt value function and the optimal one , which does not really capture the performance of the algorithm. In fact, the learnt value function does not necessarily correspond to the value of any policy. It is rather used to learn a policy which will be executed in the MDP. Therefore, the quantity of interest that we should be looking at instead, as we did in this work, is the difference between the optimal performance and the performance of : . Secondly, in terms of statistical error, we obtain a faster rate of in the fully realizable case thanks to the use of Bernstein inequality (Cesa-Bianchi and Lugosi, 2006) while Dai et al. (2018) prove a slower rate of . When the realizability holds only approximately, the slow rate of our bound in Theorem LABEL:theorem:_performance can be made smaller than the of Dai et al. (2018) by decreasing the approximation error.

#### Comparison with MSBO:

Xie and Jiang (2020) studies a related algorithm called MSBO, which solves the following minimax objective:

SBEED can be seen as a smooth couterpart to MSBO. Except for the bias due to entropy regularization, our performance bound shares a similar structure to the bound obtained by Xie and Jiang (2020) for MSBO, but with our own definition of the concentrability coefficient and approximation errors that suit our algorithm of interest. However, the two algorithms differ in several manners. SBEED learns jointly both the policy and value function while MSBO learns the -value function and considers only policies that are greedy with respect to it. If we set in the SBEED objective, we don’t recover the MSBO objective, which means that MSBO is not a special case of SBEED. In fact, when , SBEED will learn the value function of the behavior policy that generates the data . Finally, it is established that SBEED, when implemented with a differentiable function approximator, would converge locally while there is no practical convergent instantiation of MSBO, as far as we know.

## 5 Conclusion and Future Work

We establish a performance guarantee of the SBEED algorithm that depends only linearly on the planning horizon and enjoys improved statistical rate in the realizable case. Our bound matches the bound of MSBO, a non-smooth counterpart of SBEED, which suggests that there is no clear benefit of the entropy regularization. As future work, we would like to look at regularized versions of Fitted Policy Iteration or Fitted Q Iteration that have weaker guarantee than SBEED and investigate whether the regularization would play a more significant role to improve their performance.

## 6 Acknowledgements

We would like to thank Harsh Satija for his helpful feedback on the paper.

## References

- Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71 (1), pp. 89–129. Cited by: §4.
- Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30–37. Cited by: §1, §3.
- Dynamic programming and optimal control. Vol. 1, Athena scientific Belmont, MA. Cited by: §1.
- Prediction, learning, and games. Cambridge university press. Cited by: §4.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042–1051. Cited by: §4.
- SBEED: convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1125–1134. Cited by: Sharp Analysis of Smoothed Bellman Error Embedding, §1, §1, §3, §3, §4, Lemma 1.
- Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pp. 261–268. Cited by: §1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
- Error bounds for approximate policy iteration. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, T. Fawcett and N. Mishra (Eds.), pp. 560–567. External Links: Link Cited by: §4.
- Kernel-based reinforcement learning. Machine learning 49 (2-3), pp. 161–178. Cited by: §1.
- On the use of non-stationary policies for stationary infinite-horizon markov decision processes. In Advances in Neural Information Processing Systems, pp. 1826–1834. Cited by: §1.
- Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pp. 1314–1322. Cited by: §4.
- A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in neural information processing systems 21 (21), pp. 1609–1616. Cited by: §1.
- Convergent tree backup and retrace with function approximation. In International Conference on Machine Learning, pp. 4955–4964. Cited by: §1.
- Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pp. 1075–1081. Cited by: §1.
- Q approximation schemes for batch reinforcement learning: A theoretical comparison. CoRR abs/2003.03924. External Links: Link, 2003.03924 Cited by: Sharp Analysis of Smoothed Bellman Error Embedding, §1, §4, §4, §4.

## Appendix A Outline

The appendix of this paper is organized as follows:

## Appendix B Notations

We provide this table for easy reference. Notation will also be defined as it is introduced.

policy performance in the unregularized MDP | ||

optimal policy, value function and performance of the unregularized MDP | ||

policy performance in the soft MDP | ||

optimal policy, value function and performance of the soft MDP | ||

maximum value taken by the value function of the soft MDP | ||

Consistency operator defined by . | ||

class of candidate value function | ||

class of candidate policies | ||

class of helper functions | ||

the -weighted 2-norm | ||

the output of SBEED algorithm | ||

the best solution in and | ||

the best solution in | ||

concentrability coefficient | ||

## Appendix C Proof of Lemma 2

###### Proof.

We have

(by marginalizing over ) | ||||

We obtain the desired result by noticing that ∎

## Appendix D Proof of Lemma 3

###### Proof.

We start by bounding the performace suboptimality in the soft MDP.

(apply Lemma 2) | ||||

() | ||||

(Cauchy-Schwarz inequality) | ||||

Therefore,

( is deterministic policy, ) | ||||

( by optimality of ) | ||||

∎

## Appendix E Proof of Theorem LABEL:theorem:_performance

We provide here a complete analysis of the SBEED algorithm. Recall:

### e.1 Dependence on the helper function class

Let the best function in class .

where we define the following random variables

and for , and is an i.i.d sample when .

###### Lemma 4 (Properties of ).

We have