1 Introduction
One fundamental problem in Reinforcement Learning (RL) is to learn the longterm expected reward, i.e. the value function, which can consequently be used for determining a good control policy, cf.
Sutton and Barto (1998). In the general setting with large or infinite state space, exact representation of the actual value function is often inhibitively computationally expensive or hardly possible. To overcome this difficulty, function approximation techniques are employed for estimating the value function from sampled trajectories. The quality of the learned policy depends significantly on the chosen function approximation technique.
In this paper, we consider the technique of linear value function approximation
. The value function is represented or approximated as a linear combination of a set of features, or basis functions. These features are generated from the sampled states via either some heuristic constructions, e.g.
Bradtke and Barto (1996); Keller et al. (2006), or kernelbased approaches, e.g. Taylor and Parr (2009). A common approach generates firstly a vast number of features, which is often much larger than the number of available samples, and then chooses automatically relevant features to approximate the actual value function. Unfortunately, such approaches may fail completely due to overfitting. To cope with this situation, regularization techniques are necessarily to be employed. Other than the simple regularization, which penalizes the smoothness of the learned value function, e.g. Farahmand et al. (2008), in this work we focus on regularization. Theregularization often produces sparse solutions, thus can serve as a method of automatic feature selection for linear value function approximation.
This work focuses on the development of Temporal Difference (TD) learning algorithms, cf. Bradtke and Barto (1996). Recent active researches on applying regularization to TD learning have led to a various number of effective algorithms, e.g. Loth et al. (2007); Kolter and Ng (2009); Johns et al. (2010); Geist and Scherrer (2012); Hoffman et al. (2012). It is important to notice that minimization has been extensively studied in the areas of compressed sensing and image processing, and many efficient minimization algorithms have been developed, cf. Candés and Romberg (2007); Zibulevsky and Elad (2010). Very recently, two advanced minimization algorithms have been adapted to the TD learning, i.e. the Dantzig selector based TD algorithm from Geist et al. (2012) and the orthogonal matching pursuit based TD algorithm developed in PainterWakefield and Parr (2012a).
On the other hand, most TD learning algorithms are known to be unstable with linear value function approximation and offpolicy learning. By observing the fact that most original forms of TD algorithms are not true gradient descent methods, a new class of intrinsic gradient TD (GTD) learning algorithms with linear value function approximation are developed and proven to be stable, cf.
Sutton et al. (2008, 2009). However, it is important to know that success of GTD algorithms might be limited due to the fact that the GTD family requires a set of well chosen features. In other words, the GTD algorithms are in potential danger of overfitting. The key contribution of the present work is the development of a family of regularized GTD algorithms, referred to as GTDIST algorithms. Convergence properties of the proposed algorithms are investigated from the perspective of stochastic optimization.The paper is outlined as follows. In Section 2, we briefly introduce a general setting of TD learning and provide some preliminaries of TD objective functions. Section 3 presents a framework of regularized GTD learning algorithms, and investigates their convergence properties. In Section 4, several numerical experiments depict the practical performance of the proposed algorithms, compared with several existing regularized TD algorithms. Finally, a conclusion is drawn in Section 5.
2 Notations and Preliminaries
In this work, we consider a RL process as a Markov Decision Process (MDP), defined as a tuple
, where is a set of possible states of the environment, is a set of actions of the agent,the conditional transition probabilities
over state transitions from state to state given an action , is a reward function assigning immediate reward to a state , and is a discount factor.2.1 TD Learning with Linear Function Approximation
The goal of a RL agent is to learn a mapping from states to actions, i.e. a policy , which maximizes the value function of a state taking a policy , defined as
(1) 
It is well known that, for a given policy , the value function fulfills the Bellman equation, i.e.
(2) 
The right hand side of (2) is often referred to as the Bellman operator for policy , denoted by . In other words, the value function is the fixed point of the Bellman operator , i.e. .
When the state space is too large or infinite, exact representation of the value function is often practically unfeasible. Function approximation is thus of great demand for estimating the actual value function. A popular approach is to construct a set of features by the map , which are called the features or basis functions, and then to approximate the value function by a linear function. Concretely, for a given state , the value function is approximated by
(3) 
where
is a parameter vector. In the setting of TD learning, the parameter
is updated at each time step , i.e. for each state transition and the associated reward . Here, we consider the simple onestep TD learning with linear function approximation, i.e. in the framework of TD() learning. The parameter is updated as follows(4) 
where is a sequence of stepsize parameters, and is the simple TD error
(5) 
Note, that the TD error can be considered as a function of the parameter . By abuse of notation, in the rest of the paper we also denote .
2.2 Three Objective Functions for TD Learning
In order to find an optimal parameter via an optimization process, one has to define an appropriate objective function, which accurately measures the correctness of the current value function approximation, i.e. how far the current approximation is away from the actual TD solution. In this subsection, we recall three popular objective functions for TD learning.
Motivated by the fact that the value function is the fixed point of the Bellman operator for a given policy, correctness of an approximation can be simply measured by the TD error itself, i.e.
(6) 
where is a diagonal matrix, whose components are some state distribution. This cost function is often referred to as the Mean Squared Bellman Error (MSBE). Ideally, the minimum of the MSBE function admits a good value function approximation. Unfortunately, it is well known that, in practice, the performance of an approximation depends on the preselected feature space , i.e. the span of the features . By introducing the projector as
(7) 
the socalled Mean Squared Projected Bellman Error (MSPBE) is ofter preferred
(8) 
Minimizing the MSPBE function finds a fixed point of the projected Bellman operator in the feature space , i.e. .
Finally, we present a less popular objective function for TD learning. Recall the TD parameter update as defined in (4). The vector in the second summand can be considered as an error for a given . It is expected to be equal to zero at the TD solution. Hence, one can use the norm of this vector, defined as
(9) 
as an objective function for TD learning. The function is referred to as the Norm of Expected TD Update (NEU), which is used to derive the original GTD algorithm in Sutton et al. (2008).
3 Stochastic Gradient Algorithms for Regularized TD Learning
In the first part of this section, we present a general framework of gradient algorithms for minimizing the regularized TD objective functions. The second subsection develops two regularized stochastic gradient TD algorithms in the online setting, and investigates their convergence properties from the perspective of stochastic optimization.
3.1 Regularized TD Learning
Applying an regularizer to the parameter leads to the following objective function
(10) 
where and denotes the norm of a vector . Here, the scalar weighs the regularization term , and balances the sparsity of against the TD objective function . The iterative soft thresholding (IST) algorithm is nowadays one classic algorithm for minimizing the cost function (10). It can be interpreted as an extension of the classical gradient algorithm. Due to its high popularity, we skip the derivation of the IST algorithm, and refer to Zibulevsky and Elad (2010) and the references therein for further reading.
Given and , the soft thresholding operator applied to is defined as
(11) 
where and are entrywise, and is the entrywise multiplication. Then, minimization of the objective function (10) can be achieved via applying the soft thresholding operator iteratively. Straightforwardly, we define the IST based TD update as follows
(12) 
where , and denotes the gradient update of . Specifically, the gradient updates of the three objective functions are given as
(13) 
We refer to this family of algorithms as TDIST algorithms. Note that IST has been employed in developing fixed point TD algorithms in PainterWakefield and Parr (2012b), whereas in this work we focus on developing intrinsic gradient TD algorithm.
3.2 Stochastic GTDIST Algorithms
The TDIST algorithms presented in the previous subsection are only applicable in the batch setting. In some real applications, it is certainly favorable to have them working online. Stochastic gradient descent algorithms can be developed straightforwardly to minimize the
regularized TD objective functions.Now let us consider the online setting, i.e. given a sequence of data samples . In the form of stochastic gradient descent, we propose a general form of parameter update as
(14) 
where denotes the stochastic gradient updates of , or their appropriate stochastic approximations, cf. Sutton et al. (2008, 2009). To investigate convergence properties of the proposed algorithms requires results from Duchi and Singer (2009), which develops a general framework for analyzing empirical loss minimization with regularizations. We adapt the result in corollary 10 from Duchi and Singer (2009) to our current setting as follows.
Let the function be smooth and strictly convex and be the global minimum of the function with . If the following three conditions hold: (1) fulfills for some constant ; (2) for some constant ; and (3) a stochastic estimate of the gradient fulfills , then IST based stochastic algorithms converge with probability one to .
Let us look at the regularized NEU function first. Recall the approximate stochastic gradient update, developed in Sutton et al. (2008), as
(15) 
with
(16) 
where is a step size parameter. We refer to the corresponding algorithm as the GTDIST algorithm. Convergence properties of the GTDIST algorithm are characterized in the following corollary. If
is an i.i.d sequence with uniformly bounded second moments, and the matrix
is invertible, then the GTDIST algorithm, whose update is specified in (15), converges with probability one to the TD solution. Recall the TD error . The regularized NEU cost function can be written as(17) 
It is easily seen that the regularized function is strictly convex if the matrix is invertible. The TD solution is then the global minimum of . The condition of being an i.i.d sequence with uniformly bounded second moments ensures that holds true for some constant . Finally, applying the fact that the stochastic approximation is a quasistationary estimate of the term , cf. Sutton et al. (2008), we have
(18) 
Then the result follows from Theorem 3.2.
In order to minimize the MSPBE function , two efficient GTD algorithms are developed in Sutton et al. (2009). Their approximate stochastic updates are defined as
(19a)  
(19b) 
where
(20) 
We refer to the corresponding regularized GTD algorithms, which employ the updates (19a) and (19b), as GTD2IST and TDCIST algorithms, respectively. With no surprises, they share similar convergence properties as the GTDIST algorithm.
If is an i.i.d sequence with uniformly bounded second moments, and both and are invertible, then both the GTD2IST and the TDCIST algorithms, whose updates are specified in (19), converge with probability one to the TD solution. The regularized MSPBE cost function can be written as
(21) 
The function is strictly convex if the matrix
(22) 
is positive definite, i.e. both and are invertible. By the fact that the stochastic approximation is a quasistationary estimate of the term , cf. Sutton et al. (2009), we get
(23) 
Then, the result follows straightforwardly from the same arguments as in Corollary 16.
4 Numerical Experiments
In this section, we investigate the performance of our proposed regularized GTD algorithms, compared with two existing regularized TD algorithms, in both the onpolicy and offpolicy settings.
4.1 Experiment One: OnPolicy Learning
In this experiment, we apply our proposed algorithms to a random walk problem in the chain environment consisting of seven states. There exists only one action and the transition probability of going right or left is equal. A reward of one is only assigned in the rightmost state, which is the terminal state, whereas the rewards are zero everywhere else. The features consist of a binary encoding of the states and ten additional “noisy” features, which are simply Gaussian noise. In this setting, we run three different experiments.
4.1.1 Regularized vs. Unregularized
This experiment compares the performance of the proposed regularized GTD algorithms with their unregularized counterparts. Figure 1 shows the learning curves of three GTD learning algorithms, namely, GTD, GTD2, and TDC, together with their regularized versions. It is evident that IST based GTD algorithms outperform all their original unregularized versions respectively. The experimental results demonstrate the effectiveness of IST based GTD learning algorithms.
4.1.2 Unfavorable Initializations
The second experiment investigates the recovery behavior and convergence speed of our proposed algorithms with unfavorable initializations. Here, we only consider the simple GTDIST algorithm. The parameter vector is initialized to have ones for all the noisy features and zeros for all the “good” features. In other words, our experiment starts with the initialization of selecting all the “bad” features. The results in Figure 2 show that the regularized GTD algorithms, i.e. the GTDIST algorithm with different parameter value , converge faster to the correct selection of features than the original GTD algorithm.
4.1.3 GTDIST Algorithms vs. Others
In the third experiment, we compare the GTDIST algorithms with the L1TD algorithm from PainterWakefield and Parr (2012b) and the LARSTD algorithm from Kolter and Ng (2009). Results in both Figure 3 and 3 imply that, with or without noise, all three GTDIST algorithms outperforms the L1TD algorithm consistently. A closer look at the result in the zoomedin window in Figure 3 shows that the LARSTD algorithm performs the best with the presence of noise. This might be due to the fact that the LARSTD algorithm updates, after every 20 episodes, using all the samples available. Nevertheless, without any surprise, a timing experiment shows in Table 1 that the LARSTD algorithm performs much slower than the other online algorithms.
L1TD  GTDIST  GTD2IST  TDCIST  LARSTD  
Time (s)  9.2160  9.5565  8.2130  8.4660  118.5490 
4.2 Experiment Two: OffPolicy Learning
To test the performance of the GTDIST algorithms on the offpolicy learning, we employ the wellknown star example, proposed in Baird (1995). It consists of seven states with one state being considered as the “center”. In each of the outer states, the agent can choose between two actions: either the “solid” action, which takes it to the center state with probability one, or the “dotted” action, which takes it to any of the other states with equal probability. Reward on all state transitions is equal to zero and the states are represented by tabular features as described in the original setting. We add 20 noisy “Gaussian” features to the state representation. The behavior policy chooses the “solid” action with the probability and the “dotted” otherwise, while the estimation policy chooses always the “dotted” action. The learning curves in Figure 4 shows that both GTDIST and GTD2IST algorithms outperform their original counterparts consistently.
5 Conclusions
This work combines the recently developed GTD methods with regularization, and proposes a family of GTDIST algorithms. We investigate the convergence properties of the proposed algorithms from the perspective of stochastic optimization. Preliminary experiments demonstrate that the proposed family of GTDIST algorithms outperform all their original counterparts and two existing regularized TD algorithms. Being aware of advanced developments in the community of sparse representation, we project to employ further stateoftheart algorithms of sparse representation to RL. For example, the IST algorithms are usually known to be slow compared to other advanced minimization algorithms. Applying more efficient minimization algorithms, such as Beck and Teboulle (2009), to TD learning are of great interests as the future work.
Acknowledgements
This work has been partially supported by the International Graduate School of Science and Engineering (IGSSE), Technische Universität München, Germany. The authors would like to thank Christopher PainterWakefield for providing us with the Matlab implementation of the L1TD algorithm.
References

Baird (1995)
L. Baird.
Residual algorithms: Reinforcement learning with function
approximation.
In Proceeding of the
International Conference on Machine Learning
, pages 30–37, 1995.  Beck and Teboulle (2009) A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):1136–1152, 2009.
 Bradtke and Barto (1996) S. J. Bradtke and A. G. Barto. Linear leastsquares algorithms for temporal difference learning. Maching Learning, 22(13):33–57, 1996.
 Candés and Romberg (2007) E. J. Candés and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3):969–985, 2007.
 Duchi and Singer (2009) J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009.
 Farahmand et al. (2008) A. M. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized policy iteration. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, volume 21, pages 441–448. The MIT Press, 2008.
 Geist and Scherrer (2012) M. Geist and B. Scherrer. penalized projected bellman residual. In S. Sanner and M. Hutter, editors, Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 89–101. Springer Berlin Heidelberg, 2012.
 Geist et al. (2012) M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh. A Dantzig selector approach to temporal difference learning. In Proceedings of the International Conference on Machine Learning, 2012.
 Hoffman et al. (2012) M. W. Hoffman, A. Lazaric, M. Ghavamzadeh, and R. Munos. Regularized least squares temporal difference learning with nested and penalization. In S. Sanner and M. Hutter, editors, Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 102–114. Springer Berlin Heidelberg, 2012.
 Johns et al. (2010) J. Johns, C. PainterWakefield, and R. Parr. Linear complementarity for regularized policy evaluation and improvement. In J. Lafferty, C. K. I. Williams, J. ShaweTaylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1009–1017, 2010.
 Keller et al. (2006) P. W. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML’06), pages 449–456, 2006.
 Kolter and Ng (2009) J. Z. Kolter and A. Y. Ng. Regularization and feature selection in leastsquares temporal difference learning. In Proceedings of the International Conference on Machine Learning (ICML 2009), pages 521–528, 2009.
 Loth et al. (2007) M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using lasso. In Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.
 PainterWakefield and Parr (2012a) C. PainterWakefield and R. Parr. Greedy algorithms for sparse reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2012a.
 PainterWakefield and Parr (2012b) C. PainterWakefield and R. Parr. regularized linear temporal difference learning. Technical report, Department of Computer Science, Duke University, 2012b.
 Sutton and Barto (1998) R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
 Sutton et al. (2008) R. S. Sutton, Csaba Szepesvári, and H. R. Maei. A convergent algorithm for offpolicy temporaldifference learning with linear function approximations. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, volume 21, pages 1609–1616. The MIT Press, 2008.
 Sutton et al. (2009) R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML’09), pages 993–1000, 2009.
 Taylor and Parr (2009) G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML’09), pages 1017–1024, 2009.
 Zibulevsky and Elad (2010) M. Zibulevsky and M. Elad.  optimization in signal and image processing. IEEE Signal Processing Magazine, 27(3):76–88, 2010.