1 Introduction
For tasks with large state or action spaces, where tabular representations are not feasible, reinforcement learning algorithms typically rely on function approximation. Whether they are learning the value function, policy or models, the success of function approximation techniques hinges on the quality of the representation. Typically, representations are handcrafted, with some common representations including tilecoding, radial basis functions, polynomial basis functions and Fourier basis functions
(Sutton, 1996; Konidaris et al., 2011). Automating feature discovery, however, alleviates this burden and has the potential to significantly improve learning.Representation learning techniques in reinforcement learning have typically drawn on the large literature in unsupervised and supervised learning. Common approaches include feature selection, including
regularization on the value function parameters (Loth et al., 2007; Kolter and Ng, 2009; Nguyen et al., 2013) and matching pursuit (Parr et al., 2008; PainterWakefield and Parr, 2012); basisfunction adaptation approaches (Menache et al., 2005; Whiteson et al., 2007); instancebased approaches, such as locally weighted regression (Atkeson and Morimoto, 2003), sparse distributed memories (Ratitch and Precup, 2004), protovalue functions (Mahadevan and Maggioni, 2007) and manifold learning techniques (Mahadevan, 2009); and neural network approaches, including more standard feedforward neural networks
(Coulom, 2002; Riedmiller, 2005; Mnih et al., 2015) as well as random representations (Sutton and Whitehead, 1993), linear threshold unit search (Sutton and Barto, 2013), and evolutionary algorithms like NEAT
(Stanley and Miikkulainen, 2002).Surprisingly, however, there has been little investigation into using sparse coding for reinforcement learning. Sparse coding approaches have been developed to learn MDP models for transfer learning
(Ammar et al., 2012); outside this work, however, little has been explored. Nonetheless, such sparse coding representations have several advantages, including that they naturally enable local models, are computationally efficient to use, are much simpler to train than more complicated models such as neural networks and are biologically motivated by the observed representation in the mammalian cortex (Olshausen and Field, 1997).In this work, we develop a principled sparse coding objective for policy evaluation. In particular, we formulate a joint optimization over the basis and the value function parameters, to provide a supervised sparse coding objective where the basis is informed by its utility for prediction. We highlight the importance of using the Bellman error or meansquared return error for this objective, and discuss how the projected Bellman error is not suitable. We then show that, despite being a nonconvex objective, all local minima are global minima, under minimal conditions. We avoid the need for careful initialization strategies needed for previous optimality results for sparse coding (Agarwal et al., 2014; Arora et al., 2015), using recent results for more general dictionary learning settings (Haeffele and Vidal, 2015; Le and White, 2017), particularly by extending beyond smooth regularizers using convergence. Using this insight, we provide a simple alternating proximal gradient algorithm and demonstrate the utility of learning supervised sparse coding representations versus unsupervised sparse coding and a variety of tilecoding representations.
2 Background
In reinforcement learning, an agent interacts with its environment, receiving observations and selecting actions to maximize a scalar reward signal provided by the environment. This interaction is usually modeled by a Markov decision process (MDP). An MDP consists of
where is the set of states; is a finite set of actions;, the transition function, which describes the probability of reaching a state
from a given state and action ; and finally the reward function , which returns a scalar value for transitioning from stateaction to state . The state of the environment is said to be Markov if .One important goal in reinforcement learning is policy evaluation: learning the value function for a policy. A value function approximates the expected return. The return from a state is the total discounted future reward, discounted by , for following policy
where is the expectation of this return from state
. This value function can also be thought of as a vector of values
satisfying the Bellman equation(1)  
Given the reward function and transition probabilities, the solution can be analytically obtained: .
In practice, however, we likely have a prohibitively large state space. The typical strategy in this setting is to use function approximation to learn from a trajectory of samples: a sequence of states, actions, and rewards , , , , , , , , , where is drawn from the startstate distribution, and . Commonly, a linear function is assumed, for a parameter vector and a feature function describing states. With this approximation, however, typically we can no longer satisfy the Bellman equation in (1), because there may not exist a such that equals for . Instead, we focus on minimizing the error to the true value function.
Reinforcement learning algorithms, such as temporal difference learning and residual gradient, therefore focus on finding an approximate solution to the Bellman equation, despite this representation issue. The quality of the representation is critical to accurately approximating with
, but also balancing compactness of the representation and speed of learning. Sparse coding, and sparse representations, have proven successful in machine learning and in reinforcement learning, particularly as fixed bases, such as tile coding, radial basis functions and other kernel representations. A natural goal, therefore, and the one we explore in this work, is to investigate learning these sparse representations automatically.
3 Sparse Coding for Reinforcement Learning
In this section, we formalize sparse coding for reinforcement learning as a joint optimization over the value function parameters and the representation. We introduce the true objective over all states, and then move to the sampled objective for the algorithm in the next section.
We begin by formalizing the representation learning component. Many unsupervised representation learning approaches consist of factorizing input observations^{1}^{1}1This variable can also be a base set of features, on which the agent can improve or which the agent can sparsify. into a basis dictionary and new representation . The rows of form a set of bases, with columns in weighting amongst those bases for each observation (column) in . Though simple, this approach encompasses a broad range of models, including PCA, CCA, ISOMAP, locally linear embeddings and sparse coding (Singh and Gordon, 2008; Le and White, 2017). The (unsupervised) sparse coding objective is (Aharon et al., 2006)
where is the squared Frobenius norm; is a learned basis dictionary; determine the magnitudes of the regularizers; is a diagonal matrix giving a distribution over states, corresponding to the stationary distribution of the policy ; and is a weighted norm. The reconstruction error
is weighted by the stationary distribution because states are observed with frequency indicated by . The weighted
promotes sparsity on the entries of , preferring entries in to be entirely pushed to zero rather than spreading magnitude across all of . The Frobenius norm regularizer on ensures that does not become too large. Without this regularizer, all magnitude can be shifted to , producing the same , but pushing to zero and nullifying the utility of its regularizer. Optimizing this sparse coding objective would select a sparse representation for each observation such that approximately reconstructs .
Further, however, we would like to learn a new representation that is also optimized towards approximating the value function. Towards this aim, we need to jointly learn and , where provides the approximate value function. In this way, the optimization must balance between accurately recreating and approximating the value function . For this, we must choose an objective for learning .
We consider two types of objectives: fixedpoint objectives and squarederror objectives. Two common fixedpoint objectives are the meansquared Bellman error (MSBE), also called the Bellman residual (Baird, 1995)
and meansquared projected BE (MSPBE) (Sutton et al., 2009)
where is a diagonal matrix giving a distribution over states, corresponding to the stationary distribution of the policy; is a weighted norm; and the projection matrix for linear value functions is . The family of TD algorithms converge to the minimum of the MSPBE, whereas residual gradient algorithms typically use the MSBE (see (Sun and Bagnell, 2015) for an overview). Both have useful properties (Scherrer, 2010), though arguably the MSPBE is more widely used.
There are also two alternative squarederror objectives, that do not correspond to fixedpoint equations: the meansquared return error (MSRE) and the Bellman error (BE). For a trajectory of samples , BE is defined as
and the MSRE as
where is a sample return. In expectation, these objectives are, respectively
where the expectation is w.r.t. the transition probabilities and taking actions according to policy .
These differ from the fixedpoint objectives because of the placement of the expectation. To see why, consider the MSBE and BE. The expected value of the BE is the expected squared error between the prediction from this state and the reward plus the value from a possible next state. The MSBE, on the other hand, is the squared error between the prediction from this state and the expected reward plus the expected value for the next state. Though the MSPBE and MSBE constitute the most common objectives chosen for reinforcement learning, these squarederror objectives have also been shown to be useful particularly for learning online (Sun and Bagnell, 2015).
For sparse coding,
however,
the MSPBE is not a suitable choice—compared to the MSBE, BE and MSRE—for two reasons.
First, the MSBE, BE and MSRE are all convex in , whereas the MSPBE is not.
Second, because of the projection onto the space spanned by the features,
the MSPBE can be solved with zero error for any features .
Therefore, because it does not inform the choice of , the MSPBE produces a two stage approach^{2}^{2}2This problem seems to have been overlooked in
two approaches for basis adaptation based on the MSPBE:
adaptive bases algorithm for the projected Bellman error (ABPBE) (Di Castro and
Mannor, 2010, Algorithm 9)
and
mirror descent Q() with basis adaptation (Mahadevan et al., 2013).
For example, for ABPBE, it is not immediately obvious this would be a problem,
because a stochastic approximation approach is taken. However, if written as a minimization
over the basis parameters and the weights, one would obtain a minimum error solution (i.e., error zero)
immediately for any basis parameters. The basis parameters are considered to change on a slow timescale,
and the weights on a fast timescale, which is a reflection of this type of separate minimization.
Menache et al. (2005) avoided this problem by explicitly using a twostage approach, using MSPBE approaches for learning
the parameters and using other score function, such as the squared Bellman error, to
update the bases. This basis learning approach, however, is unsupervised.
Representation learning strategies for the MSPBE have been developed, by using local projections (Yu and Bertsekas, 2009; Bhatnagar et al., 2009). These strategies, however, do not incorporate sparse coding.
, where
features are learned in a completely unsupervised way and prediction
performance does not influence .
The final objective for loss set to either MSBE, BE or MSRE is
(2)  
4 Algorithm for Sparse Coding
We now derive the algorithm for sparse coding for policy evaluation: SCoPE. We generically consider either the BE or MSRE. For a trajectory of samples , the objective is
(3)  
for BE, and and for MSRE, and . We consider two possible powers for the norm or , where the theory relies on using , but in practice we find they perform equivalently and provides a slightly simpler optimization. The loss is averaged by , to obtain a sample average, which in the limit converges to the expected value under . This averaged loss is also more scaleinvariant—in terms of the numbers of samples—to the choice of regularization parameters.
SCoPE consists of alternating amongst these three variables, and , with a proximal gradient update for the nondifferentiable norm. The loss in terms of and is differentiable; to solve for (or ) with the other variables fixed, we can simply used gradient descent. To solve for with the and fixed, however, we cannot use a standard gradient descent update because the regularizer is nondifferentiable. The proximal update consists of stepping in the direction of the gradient for the smooth component of the objective—which is differentiable—and then projecting back to a sparse solution using the proximal operator: a soft thresholding operator. The convergence of this alternating minimization follows from results on block coordinate descent for nonsmooth regularizers (Xu and Yin, 2013).
To apply the standard proximal operator for the regularizer, we need to compute an upper bound on the Lipschitz constant for this objective. The upper bound is
, computed by finding the maximum singular value of the Hessian of the objective w.r.t.
for each . We will provide additional details for this calculation, and implementation details, in a supplement.4.1 Local Minima Are Global Minima
In this section, we show that despite nonconvexity, the objective for SCoPE has the nice property that all local minima are in fact global minima. Consequently, though there may be many different local minima, they are in fact equivalent in terms of the objective. This result justifies a simple alternating minimization scheme, where convergence to local minima ensures an optimal solution is obtained.
We need the following technical assumption. It is guaranteed to be true for a sufficiently large (see Haeffele and Vidal (2015); Le and White (2017)).
Learning curves for SCoPE versus a variety of tile coding representations in three domains. The graphs depict early learning; the numbers in brackets correspond to final error, after 5000 samples. Because errors are sampled every 50 samples, and because MSRE is used for optimization, the lines have an irregular pattern. The differences are nonetheless statistically significant, with an average over 50 runs, and so the small standard error bars are omitted. SCoPE outperforms the best of the TC representations in Mountain Car and Acrobot using a more compact sparse representation; in Puddle World, it performs more poorly, which we discuss further in the text. The larger TC representations likely perform poorly due to hashing.
Assumption 1 For the given , the following function is convex in
Theorem 1 (Landscape of the SCoPE objective).
For the objective in equation (3) with ,
1. under Assumption 1, all fullrank local minima are global minima; and
2. if a local minimum has (i.e., a zero column) and , (i.e., a zero row) for some , then it is a global minimum.
Proof.
For the first statement, we construct a limit of twicedifferentiable functions that converge to the SCoPE objective . With this, we can then show that all minimizers of the sequence converge to minimizers of , and viceversa (Braides, 2013). Because all local minimizers of the twicedifferentiable functions are global minimizers from (Le and White, 2017, Theorem 10), we can conclude that all corresponding minimizers of are global minimizers.
We use the pseudoHuber loss (Fountoulakis and Gondzio, 2013), which is twicedifferentiable approximation to the absolute value: . Let . The sequence of functions are defined with , as
where equals the equation in (3), but without the regularizer on .
Part 1: All local minima of for all are global minima. To show this, we show each satisfies the conditions of (Le and White, 2017, Theorem 10 and Proposition 11).
Part 1.1 We can rewrite the loss in terms of
(4) 
where a diagonal matrix of all ones with the first diagonal entry set to zero, and with the last diagonal entry set to zero. This loss is convex in the joint variable because equation (4) is the composition of a convex function (squared norm) and an affine function (multiplication by and addition of ).
Part 1.2 The regularizer on must be a weighted Frobenius norm, with weightings on each column; here, we have weighting using regularization parameters for the first columns (corresponding to ) and regularization parameter for the last column (corresponding to ).
Part 1.3 The inner dimension , which is true by assumption and the common setting for sparse coding.
Part 1.4 The pseudoHuber loss, on the columns of , is convex, centered and twicedifferentiable.
Part 2: The sequence converges uniformly to . To see why, recall the definition of uniform convergence. A sequence of functions is uniformly convergent with limit if for every , there exists such that for all all , . Further recall that for any complete metric space, if is uniformly Cauchy, then it is uniformly convergent. The sequence is uniformly Cauchy if for all , . Take any and let . Then
The upper bound of the first component is maximized when , and so we get
Part 3: Asymptotic equivalence of minimizers of and . Because is continuous, and so lower semicontinuous, and uniformly converges to , we know that converges to : Braides (2013).
By the fundamental theorem of convergence, if the is an equicoercive family of functions, then the minimizers of converge to minimizers of . A sequence of functions is equicoercive iff there exists a lower semicontinuous coercive function such that on for every (Dal Maso, 2012, Proposition 7.7). A function is coercive if as . For , it is clear that is coercive, as well as lower semicontinuous (since it is continuous). Further, , because the regularizer on is nonnegative. Therefore, the family is equicoercive, and so the minimizers of converge to minimizers of .
For the other direction, if a local minimum of is an isolated local minimum, then there exists a sequence with a local minimizer of for sufficiently small (Braides, 2013, Theorem 5.1). Because we have Frobenius norm regularizers on , which are strongly convex, the objective is strictly convex with respect to . Further, because is full rank, is a strictly convex function with respect to . Therefore, locally the objective is strictly convex with respect to . We therefore know that local minima of are isolated, and so there exists an such that for all , are local minimizers of . Since these local minimizers are global minimizers, and they converge to , this means is a global minimum of .
For the second statement, we use (Haeffele and Vidal, 2015, Theorem 15)
. Because we already showed above that our loss can be cast as factorization, it is clear our loss and regularizers are positively homogenous, of order 2. A minimum is guaranteed to exist for our objective, because the loss function is continuous, bounded below (by zero) and goes to infinity as the parameters go to
. ∎5 Experimental Results
We aim to address the question: can we learn useful representations using SCoPE? We therefore tackle the setting where the representation is first learned, and then used, to avoid conflating incremental estimation and the utility of the representation. We particularly aim to evaluate estimation accuracy, as well as qualitatively understanding the types of sparse representations learned by SCoPE.
Domains. We conducted experiments in three benchmark RL domains  Mountain Car, Puddle World and Acrobot (Sutton, 1996). All domains are episodic, with discount set to 1 until termination. The data in Mountain Car is generated using the standard energypumping policy policy with 10% randomness. The data in Puddle World is generated by a policy that chooses to go North with 50% probability, and East with 50% probability on each step, with the starting position in the lowerleft corner of the grid, and the goal in the topright corner. The data in Acrobot is generated by a nearoptimal policy.
Evaluation. We measure value function estimation accuracy using mean absolute percentage value error (MAPVE), with rollouts to compute the true value estimates. , where is the set of test states, is the number of samples in the test set, is the estimated value of state and is the true value of state computed using extensive rollouts. Errors are averaged over 50 runs.
Algorithms. We compare to using several fixed tilecoding (TC) representations. TC uses overlapping grids on the observation space. It is a sparse representation that is well known to perform well for Mountain Car, Puddle World, and Acrobot. We varied the granularity of the gridsize N and number of tilings D, where is the number of active features for each observation. The grid is either NN for Mountain Car and Puddle World or N for Acrobot. We explore (D=4, N=4), (D=4,N=8), (D=16,N=4), (D=16,N=8), (D=32,N=4), (D=32,N=8); a grid size of 16 performed poorly, and so is omitted. For Mountain Car and Puddle World the number of features respectively are 64, 256, 256, 1024, 512, 2048, then hashed to 1024 dimensions; for Acrobot, the number of features are 1024, 16384, 4096, 65536, 8192, 131072, then hashed to 4096. Both of these hashed sizes are much larger than our chosen .
For consistency, once the SCoPE representation is learned, we use the same batch gradient descent update on the MSRE for all the algorithms, with line search to select stepsizes. The regularization weights are chosen from , based on lowest cumulative error. For convenience, is fixed to be the same as . For learning the SCoPE representations, regularization parameters were chosen using 5fold crossvalidation on 5000 training samples, with fixed to give a reasonable level of sparsity. This data is only used to learn the representation; for the learning curves, the weights are learned from scratch in the same way they are learned for TC. The dimension is set to be smaller than for tile coding, to investigate if SCoPE can learn a more compact sparse representation. We tested unsupervised sparse coding, but the error was poor (approximately worse). We discuss the differences between the representations learned by supervised and unsupervised sparse coding below.
Learning curves. We first demonstrate learning with increasing number of samples, in Figure 1. The weights are recomputed using the entire batch up to the given number of samples.
Across domains, SCoPE results in faster learning and, in Mountain Car and Acrobot, obtains lowest final error. Matching the performance of TC is meaningful, as TC is wellunderstood and optimized for these domains. For Acrobot, it’s clear a larger TC is needed resulting in relatively poor performance, whereas SCoPE can still perform well with a compact, learned sparse representation. These learning curves provide some insight that we can learn effective sparse representations with SCoPE, but also raise some questions. One issue is that SCoPE is not as effective in Puddle World as some of the TC representations, namely 44 and 164. The reason for this appears to be that we optimize MSRE to obtain the representation, which is a surrogate for the MAPVE. When measuring MSRE instead of MAPVE on the test data, SCoPE consistently outperforms TC. Optimizing both the representation and weights according to MSRE may have overfitting issues; extensions to MSBE or BE, or improvements in selecting regularization parameters, may alleviate this issue.
Learned representations. We also examine the learned representations, both for unsupervised sparse coding and SCoPE, shown in Figure 2. We draw two conclusions from these results: the structure in the observations is not sufficient for unsupervised sparse coding, and the combination of supervised and unsupervised losses sufficiently constrain the space to obtain discriminative representations. For these twodimensional and fourdimensional observations, it is relatively easy to reconstruct the observations by using only a small subset of dictionary atoms (row vectors of in equation (2)). The unsupervised representations, even with additional nonnegativity constraints to narrow the search space, are less distributed, with darker and thicker blocks, and more frequently pick less features. For the supervised sparse coding representation, however, the sparsity pattern is smoother and more distributed: more features are selected by at least one sample, but the level of sparsity is similar. We further verified the utility of supervised sparse coding, by only optimizing the supervised loss (MSRE), without including the unsupervised loss; the resulting representations looked similar to the purely unsupervised representations. The combination of the two losses, therefore, much more effectively constrains or regularizes the space of feasible representations and improves discriminative power.
The learning demonstrated for SCoPE here is under ideal conditions. This was intentionally chosen to focus on the question: can we learn effective sparse representations using the SCoPE objective? With the promising results here, future work needs to investigate the utility of jointly estimating the representation and learning the value function, as well as providing incremental algorithms for learning the representations and setting the regularization parameters.
6 Conclusion
In this work, we investigated sparse coding for policy evaluation in reinforcement learning. We proposed a supervised sparse coding objective, for joint estimation of the dictionary, sparse representation and value function weights. We provided a simple algorithm that uses alternating minimization on these variables, and proved that this simple and easytouse approach is principled. We finally demonstrate results on three benchmark domains, Mountain Car, Puddle World and Acrobot, against a variety of configurations for tile coding.
This paper provides a new view of using dictionary learning techniques from machine learning in reinforcement learning. It lays a theoretical and empirical foundation for further investigating sparse coding, and other dictionary learning approaches, for policy evaluation and suggests that they show some promise. Formalizing representation learning as a dictionary learning problem facilitates extending recent and upcoming advances in unsupervised learning to the reinforcement learning setting. For example, though we considered a batch gradient descent approach for this first investigation, the sparse coding objective is amenable to incremental estimation, with several works investigating effective stochastic gradient descent algorithms
(Mairal et al., 2009, 2010; Le and White, 2017). The generality of the approach and easy to understand optimization make it a promising direction for representation learning in reinforcement learning.References
 Agarwal et al. [2014] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, and Praneeth Netrapalli. Learning sparsely used overcomplete dictionaries via alternating minimization. In Ann. Conf. on Learning Theory, 2014.
 Aharon et al. [2006] Michal Aharon, Michael Elad, and Alfred Bruckstein. KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Transactions on Signal Processing, 2006.
 Ammar et al. [2012] Haitham B Ammar, Karl Tuyls, Matthew E Taylor, Kurt Driessens, and Gerhard Weiss. Reinforcement learning transfer via sparse coding. In Inter. Conf. on Autonomous Agents and Multiagent Systems, 2012.
 Arora et al. [2015] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. arXiv:1503.00778v1 [cs.LG], 2015.
 Atkeson and Morimoto [2003] Christopher G Atkeson and Jun Morimoto. Nonparametric representation of policies and value functions: a trajectorybased approach. In Advances in Neural Information Processing Systems, 2003.
 Baird [1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Inter. Conf. on Mach. Learning, 1995.
 Bhatnagar et al. [2009] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, 2009.
 Braides [2013] Andrea Braides. Local minimization, variational evolution and convergence. Lecture Notes in Mathematics, 2013.
 Coulom [2002] Rémi Coulom. Reinforcement learning using neural networks, with applications to motor control. PhD thesis, INPG, 2002.
 Dal Maso [2012] Gianni Dal Maso. An Introduction to Convergence. Springer, 2012.
 Di Castro and Mannor [2010] Dotan Di Castro and Shie Mannor. Adaptive bases for Qlearning. In Conference on Decision and Control, 2010.
 Fountoulakis and Gondzio [2013] Kimon Fountoulakis and Jacek Gondzio. A secondorder method for strongly convex regularization problems. Math. Prog., 2013.

Haeffele and Vidal [2015]
Benjamin D Haeffele and Rene Vidal.
Global Optimality in Tensor Factorization, Deep Learning, and Beyond.
arXiv.org, 2015.  Kolter and Ng [2009] J Zico Kolter and Andrew Y Ng. Regularization and feature selection in leastsquares temporal difference learning. In Inter. Conf. on Mach. Learning, 2009.
 Konidaris et al. [2011] George Konidaris, Sarah Osentoski, and Philip Thomas. Value function approximation in reinforcement learning using the Fourier basis. In Inter. Conf. on Mach. Learning, 2011.
 Le and White [2017] Lei Le and Martha White. Global optimization of factor models using alternating minimization. arXiv.org:1604.04942v3, 2017.
 Loth et al. [2007] Manuel Loth, Manuel Davy, and Philippe Preux. Sparse temporal difference learning using LASSO. In Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.
 Mahadevan and Maggioni [2007] Sridhar Mahadevan and Mauro Maggioni. Protovalue functions: a Laplacian framework for learning representation and control in Markov decision processes. J. Machine Learning Research, 2007.

Mahadevan et al. [2013]
Sridhar Mahadevan, Stephen Giguere, and Nicholas Jacek.
Basis Adaptation for Sparse Nonlinear Reinforcement Learning.
In
AAAI Conference on Artificial Intelligence
, 2013.  Mahadevan [2009] Sridhar Mahadevan. Learning Representation and Control in Markov Decision Processes: New Frontiers. Foundations and Trends® in Machine Learning, 2009.
 Mairal et al. [2009] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Supervised dictionary learning. In Advances in Neural Information Processing Systems, 2009.
 Mairal et al. [2010] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization and sparse coding. J. Machine Learning Res., 2010.
 Menache et al. [2005] Ishai Menache, Shie Mannor, and Nahum Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 2005.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Nguyen et al. [2013] Trung Nguyen, Zhuoru Li, Tomi Silander, and Tze Yun Leong. Online feature selection for modelbased reinforcement learning. J. Machine Learning Research, 2013.
 Olshausen and Field [1997] Bruno Olshausen and David Field. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research, 1997.
 PainterWakefield and Parr [2012] Christopher PainterWakefield and Ronald Parr. Greedy algorithms for sparse reinforcement learning. In Inter. Conf. on Mach. Learning, 2012.
 Parr et al. [2008] Ronald Parr, Lihong Li, Gavin Taylor, Christopher PainterWakefield, and Michael L Littman. An analysis of linear models linear value function approximation and feature selection for reinforcement learning. In Inter. Conf. on Mach. Learning, 2008.
 Ratitch and Precup [2004] Bohdana Ratitch and Doina Precup. Sparse distributed memories for online valuebased reinforcement learning. In ECML PKDD, 2004.
 Riedmiller [2005] Martin Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML PKDD, 2005.
 Scherrer [2010] Bruno Scherrer. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In Inter. Conf. on Mach. Learning, 2010.
 Singh and Gordon [2008] Ajit Singh and Geoffrey Gordon. A unified view of matrix factorization models. In ECML PKDD, 2008.
 Stanley and Miikkulainen [2002] Kenneth O Stanley and Risto Miikkulainen. Efficient evolution of neural network topologies. In CEC, 2002.
 Sun and Bagnell [2015] Wen Sun and J Andrew Bagnell. Online Bellman Residual Algorithms with Predictive Error Guarantees. In Conference on Uncertainty in Artificial Intelligence, 2015.
 Sutton and Barto [2013] Richard S Sutton and Andrew G Barto. Representation search through generate and test. In Proceedings of the AAAI Workshop on Learning Rich Representations from LowLevel Sensors, 2013.
 Sutton and Whitehead [1993] Richard S Sutton and Steven D Whitehead. Online learning with random representations. In Inter. Conf. on Mach. Learning, 1993.
 Sutton et al. [2009] Richard S Sutton, Hamid Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Inter. Conf. on Mach. Learning, 2009.
 Sutton [1996] Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, 1996.
 Whiteson et al. [2007] Shimon Whiteson, Matthew E Taylor, and Peter Stone. Adaptive tile coding for value function approximation. Technical report, 2007.
 Xu and Yin [2013] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sciences, 2013.
 Yu and Bertsekas [2009] Huizhen Yu and Dimitri P Bertsekas. Basis function adaptation methods for cost approximation in MDP. In Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.