1 Introduction
Learning performance in artificial intelligence systems is highly dependent on the data representation—the features. An effective representation captures important attributes of the state (or instance), as well as simplifies the estimation of predictors. Consider a reinforcement learning agent. A local representation enables the agent to more feasibly make accurate predictions for that local region, because the local dynamics are likely to be a simpler function than learning global dynamics. Additionally, such a representation can help prevent forgetting or interference
[McCloskey and Cohen1989, French1991], by only updating local weights, as opposed to dense representations where any update would modify many weights. At the same time, it is important to have a distributed representation
[Bengio2009, Bengio, Courville, and Vincent2013], where the representation for an input is distributed across multiple features or attributes, promoting generalization and a more compact representation.Such properties can be well captured by sparse representations: those for which only a few features are active for a given input (Figure 1). Enforcing sparsity promotes identifying key attributes, because it encourages the input to be welldescribed by a small subset of attributes. Sparsity, then, promotes locality, because local inputs are likely to share similar attributes (similar activation patterns) with less overlap to nonlocal inputs. In fact, many handcrafted features are sparse representations, including tile coding [Sutton1996, Sutton and Barto1998]
, radial basis functions and sparse distributed memory
[Kanerva1988, Ratitch and Precup2004]. Other useful properties of sparse representations—which can be seen as projecting data into a higherdimensional space—include invariance [Goodfellow et al.2009, Rifai et al.2011]; decorrelated features per instance [Földiák1990]; improved computational efficiency for updating weights in the predictor, as only weights corresponding to active features need to be updates; and enabling linear separability in the highdimensional space [Cover1965], which facilitates the learning of a simple linear predictor. Further, such sparse, distributed representations have been observed in the brain [Olshausen and Field1997, Quian Quiroga and Kreiman2010, Ahmad and Hawkins2015].Traditionally, sparse representations have been common for control in reinforcement learning, such as tile coding and radial basis functions [Sutton and Barto1998]. They are effective for incremental learning, but can be difficult to scale to highdimensional inputs because they grow exponentially with input dimension. Neural networks much more feasibly enable scaling to highdimensional inputs, such as images, but can be problematic when used with incremental training. Instead, techniques like target networks, inspired by batch methods such as fitted Qiteration [Riedmiller2005], have been necessary for many of the successes of control with neural networks. We provide some evidence in this paper that this modification is necessary with dense, but not sparse, networks because the reinforcement learning agent bootstraps off its own estimates. If the value in other states are overwritten, the agent will bootstrap off inaccurate estimate. Local representations, however, are much less likely to suffer from interference and these issues with bootstrapping. Learned sparse representations, then, are a promising strategy to obtain the benefits of previously common, fixed sparse representations with the scaling of neural networks.
Learning sparse representations, however, does remain a challenge. There have been some approaches developed to learning sparse representations incrementally, particularly through factorization approaches for dictionary learning [Mairal et al.2009, Mairal et al.2010, Le, Kumaraswamy, and White2017] or for general sparse distributions [Olshausen and Field1997, Olshausen2002, Teh et al.2003, Ranzato et al.2006, Ranzato, Boureau, and LeCun2007, Lee et al.2008]
, like Boltzmann machines. In sparse coding, for example, the sparse representation learning problem is formulated as a matrix factorization, where input instances are reconstructed using a sparse, or small subset, of a large dictionary. Many of the methods for general sparse distribution, however, are expensive or complex to train and those based on sparse coding have been found to have serious outofsample issues
[Mairal et al.2009, Lemme, Reinhart, and Steil2012, Le, Kumaraswamy, and White2017].There are fewer methods using feedforward neural network architectures. Certain activation functions—such as linear threshold units (LTU)
[McCulloch and Pitts1943]and rectified linear units (ReLU)
[Glorot, Bordes, and Bengio2011]—naturally provide some level of sparsity, but of course provide no such guarantees. Early work on catastrophic interference investigated some simple heuristics for encouraging sparsity, such as node sharpening
[French1991]. Though catastrophic interference was reduced, the resulting networks were still quite dense.^{1}^{1}1There have been strategies developed for catastrophic interference that rely on rehearsal or dedicating subparts of the network to particular tasks. This work is a complementary direction for understanding catastrophic interference for a sequential multitask setting. We explore specifically the utility of sparse representations for alleviating interference for RL agents learning incrementally on one task, but do not necessarily imply that it is the only strategy to alleviate such interference. The comparisons in this work, therefore, focus on other strategies to learn sparse representations. ksparse autoencoders [Makhzani and Frey2013] use a topk constraint per instance: only the top nodes with largest activations are kept, and the rest are zeroed. WinnnerTakeAll autoencoders [Makhzani and Frey2015] use a k% response constraint per node across instances, during training, to promote sparse activations of the node over time. These approaches, however, can be problematic—as we reaffirm in this work—because they tend to truncate nonnegligible values or produce insufficiently sparse representations. Another line of work has investigated learning or specifying sparse activation functions for neural networks [Triesch2005, Ranzato et al.2006, Lemme, Reinhart, and Steil2012, Arpit et al.2015], but used a sigmoid activation which is unlikely to result in sparse representations. They define sparsity based on norms of the vector, rather than activation level.
In this work, we first highlight that learned sparse representations can significantly improve control performance, under an incremental learning setting, compared to dense neural networks. We visualize the activation of the hidden nodes for the sparse representation as well as the actionvalues for particular states. These provide evidence that locality helps avoid catastrophic interference and improves accuracy of actionvalues for bootstrapping. We then investigate a simple strategy for encouraging sparsity in neural networks: Distributional Regularizers. This approach flexibly enables any desired architecture, simply with the addition of a KL divergence on the activation level for a node. We show that direct use of such a regularizer can cause dead filters or collapse—activation concentrating on a few nodes—potentially explaining why this simple strategy has not yet found widespread use. We show that a simple clipping is sufficient to obtain effective sparse representations, and conclude with a comparison to several other strategies for obtaining a sparse representation on the same benchmark domains.
2 Background
In reinforcement learning (RL), an agent interacts with its environment, receiving observations and selecting actions to maximize a reward signal. The environment is formalized by a Markov decision process (MDP), with states
, actions, transition probabilities
, rewards and discount function [White2017].One algorithm for onpolicy control is Sarsa, where the agent updates its actionvalues for its current policy and acts neargreedily according to these actionvalues. The actionvalues for a policy are the expected return for that policy, starting from state and action :
(1)  
These actionvalues can be estimated with function approximation, such as with neural network. Because the expected return is a realvalue target, such a neural network typically uses a linear activation on the last layer:
(2) 
where is the weights in the last layer and is the representation learned by the network with weights , composed of all the hidden layers in the network. The function corresponds to the last layer in the network, with the weights of the network. The efficacy of the actionvalue approximation, therefore, relies on this representation .
3 The Utility of Sparsity for Control
We begin by highlighting the utility of sparsity for control before discussing how to learn sparse representations. We show that two sparse representations—tile coding and sparse representation learned by a neural network (referred to as SRNN from hereon)—both significantly improve stability in control. We choose tile coding, a static representation, as a baseline to compare to, as it known to perform very well in the benchmark RL domains we experiment with [Sutton and Barto1998]. We hypothesize that the main reason is due to catastrophic interference, which is much less problematic for the local representations typically provided by a sparse representations. We show both that SRNN does appear to have more stable actionvalues for bootstrapping, and that the learned sparse representation is local, providing some evidence for this hypothesis.
We evaluate control performance on four benchmark domains: Mountain Car, Puddle World, Acrobot and Catcher. All domains are episodic, with discount set to 1 until termination. We choose these domains because they are wellunderstood, and typically considered relatively simple. A priori, it would be expected that a standard actionvalue method, like Sarsa, with a twolayer neural network, should be capable of learning a nearoptimal policy in all four of these domains. We provide details about the domains in the Appendix.
The experimental setup is as follows. To extract a representation with a neural network, to be used for control, we pretrain the neural network on a batch of data with a meansquared temporal difference error (MSTDE) objective and the applicable regularization strategies. The training data consists of trajectories generated by a fixed policy that explores much of the space in the various domains. For the SRNN, we use our distributional regularization strategy, described in a later section. This learned representation is then fixed, and used by a (fully incremental) Sarsa(0) agent for learning a control policy, where only the weights on the last layer are updated. The metaparameters for the batchtrained neural network producing the representation and the Sarsa agent were swept in a wide range, and chosen based on control performance. The aim is to provide the best opportunity for a regular feedforward network (NN) to learn on these problems, as it is more sensitive to its metaparameters than the SRNN. Additional details on ranges and objectives are provided in the Appendix.
We choose this twostage training regime to remove confounding factors in difficulties of training neural networks incrementally. Our goal here is to identify if a sparse representation can improve control performance, and if so, why. The networks are trained with an objective for learning values, on a large batch of data generated by a policy that covers the space; the learned representations are capable of representing the optimal policy. We investigate their utility for fully incremental learning. Outside of this carefully controlled experiment, we advocate for learning the representation incrementally, for the task faced by the agent.
The learning curves for the four domains, with TileCoding (TC), SRNN and NN , are shown in Figure 2. Both SRNN and NN used twolayers, of size , with ReLU activations. The NNs performs surprisingly poorly, in some case increasing and then decreasing in performance (Mountain Car), and in others failing altogether (Catcher). In all the benchmark RL domains, the baseline sparse representation, TC, performs well, as expected. Specifically in Catcher, TC learns a closetooptimal policy as the representation is powerful. The learned SRNN performs as well in all domains, and is effective for learning in Catcher, whereas NN performs really poorly in all domains, and does not learn anything in Catcher. Both SRNN and NN representations were trained in the same regime, with similar representational capabilities. Yet, the sparsity of SRNN enables the Sarsa(0) agent to learn, where the regular feedforward NN does not. We investigate this effect further in the next sets of experiments, to better understand the phenomenon.
To determine if the main impact of the sparse representation is simply from regularization, preventing overfitting, we tested several regularization strategies for the neural network. These include and on the weights of the network (NN and NN respectively) and Dropout on the activation [Srivastava et al.2014] (DropoutNN). The regularizer actually encourages weights to go to zero, reducing the number of connections, but does not necessarily provide a sparse representation. In Figure 3, we can see that regularization is unlikely to account for the improvements in control. SRNN performs well across all domains, whereas none of the regularization strategies consistently perform well. NN and NN perform well in Mountain Car during early learning, but fail in other domains. DropoutNN performs poorly in all domains except Puddle World. Interestingly, in this one domain, DropoutNN appears to have learned a sparse representation, based on the heatmap shown in Figure 4. It has been observed that Dropout can at times learn sparse representations [Banino et al.2018], but not consistently, as corroborated by our experiments.
We next investigate the hypothesis that locality is preventing catastrophic interference. We first investigate the locality of the representations, as well as examining the bootstrap values over time. We show results for Puddle World here, as it is an interpretable twodimensional domain; similar experiments for other domains are in the Appendix.
Figure 4(a) shows the activation map of randomly selected hidden neurons with the different networks. We can see that each hidden neuron in SRNN only responds to a local region of the input space, while some hidden neurons in NN respond to a large part of the space. Consequently, when one state is updated in a part of the space with the NN representation, it is more likely to significantly shift the values in other parts of the space, as compared to the more local SRNN. The NN, and NN representations do not exhibit any discernible locality properties. DropoutNN does achieve some degree of locality in this domain, as mentioned earlier.
To show the stability (or lack of stability) of bootstrap targets used during control, we select five states and evaluate their actionvalues for the optimal action over the course of learning. These states are distributed across the observation space, depicted in Figure 4(b). The bootstrap estimates, that correspond to the algorithm settings for the learning curves, are plotted in Figure 4(c). We can see that the relative ordering of the value estimates is maintained with SRNN and DropoutNN, which were the two NNs effective for onpolicy control, and that their values converge to near the true values (given in Figure 4(d)). The other representations, on the other hand, have very poor estimates. Moreover, these estimates seem to decrease together, suggesting interference is causing overgeneralization to reduce values in other states.
Finally, we report additional measures of locality, to determine if the successful methods are indeed sparse. The heatmaps provide some evidence of locality, but are more qualitative than quantitative. We report two qualitative measures: instance sparsity and activation overlap. Instance sparsity corresponds to the percentage of active units for each input. A sparse representation should be instance sparse, where most inputs produce relatively low percentage activation. As shown in Figure 5, SRNN has consistently low instance sparsity across all four domains, with slightly higher level in Catcher, potentially explaining the noisy behaviour in that domain. Once again, DropoutNN is noticeably more instance sparse on Puddle World, but less so on other domains. The NN representation, which has no regularization, has some instance sparsity, likely due to simply using ReLU activation. Interestingly, NN and NN actually produced less instance sparsity.
Activation overlap, introduced by french1991usingfrench1991using, reflects the amount of shared activation between any two inputs. We consider a variant of activation overlap that measures the number of shared activation between two representations, and , for two samples, , and :
We measure the activation overlap of the five chosen states, distributed across Puddle World. If the overlap between two representations is zero, the interference would be zero. Updating the value function with respect to one state, therefore, would not affect the other state’s value. Table 1 shows the average overlap, and once again, a similar trend emerges where, SRNN has significantly less overlap (about 8), with DropoutNN showing the next least overlap (with about 30).
Overall, these results provide some evidence that (a) sparse representations can improve control performance in an incremental learning setting, (b) these sparse representations appear to provide locality and (c) this locality reduces interference and improves accuracy of bootstrap values in Sarsa(0). These results are a first step, and warrant further investigation. They do nonetheless motivate that learning sparse representations could be a promising direction for control in reinforcement learning. In the next section, we discuss how we actually obtain such sparse representations (SRNN).
4 Distributional Regularizers for Sparsity
In this section, we describe how to use Distributional Regularizers to learn sparse representations with neural networks.^{2}^{2}2The idea was originally introduced for neural networks with Sigmoid activations in an unpublished set of notes [Ng2011], and as yet has not been systematically explored. When used outofthebox, we found important limitations in the learned representations, including from using Sigmoid activations instead of ReLU and from using the KL to a specific distribution. We explore the idea indepth here, to make it a practical option for learning sparse representations. We introduce a Set Distributional Regularizer, which when paired with ReLU activations enables sparse representations to be learned, as we demonstrate in the next section. We first describe how to define Distributional Regularizers on neural networks, and then discuss the extension to a Set Distributional Regularizer, and motivation for doing so.
SRNN  NN  NN  DropoutNN  NN 

8.8  111.5  142.5  31.2  54.0 
The goal of using Distributional Regularizers is to encourage the distribution of each hidden node—across samples—to match a desired target distribution. In a neural network, we can view the hidden nodes,
, as random variables, with randomness due to random inputs. Each of these random variables
has a distribution , where the parameters of this distribution are induced by the weights of the neural network:This provides a distribution over the values for the feature , across inputs . A Distributional Regularizer is a KL divergence that encourages this distribution to match a desired target distribution with parameter .
Such a regularizer can be used to encourage sparsity, by selecting a target distribution that has high mass or density at zero. Consider a Bernoulli distribution for activations, with
. Using a Bernoulli target distribution with , giving , encodes a desired activation of 10%. As another example, for continuous nonnegative, the target distribution can be set to an exponential distribution
, which has highest density at zero with expected value . Setting encourages the average activation to be and increases density on .The efficacy of this regularizer, however, is tied to the parameterization of the network, which should match the target distribution. For a ReLU activation, for example, which has a range , a Bernoulli target distribution is not appropriate. Rather, for the range , an exponential distribution is more suitable. For a Sigmoid activation, giving values between , a Bernoulli is reasonably appropriate. Additionally, the parametrization should be able to set activations to zero. The ReLU activation naturally enables zero values [Glorot, Bordes, and Bengio2011], by pushing activations to negative values. The addition of a Distributional Regularizer simply encourages this natural tendency, and is more likely to provide sparse representations. Activations under Sigmoid and tanh, on the other hand, are more difficult to encourage to zero, because they require highly negative input values or input values exactly equal to 0.5, respectively, to set the hidden node to zero. For these reasons, we advocate for ReLU for the sparse layer, with an exponential target distribution.
Finally, we modify this regularizer to provide a Set Distributional Regularizer, which does not require an exact level of sparsity to be achieved. It can be difficult to choose a precise level of sparsity, making the Distributional Regularizer prone to misspecification. Rather, the actual goal is typically to obtain at least some level of sparsity, where some nodes can be even more sparse. For this modification, we specify that the distribution should match any of a set of target distributions , giving a Set KL: . Generally, this Set KL can be hard to evaluate. However, as we show below, it corresponds to a simple clipped KLdivergence for certain choices of , importantly including for exponential distributions where .
Theorem 1 (Set KL as a ClippedKL).
Let be a onedimensional exponential family distribution with the natural parameter , be a convex set in the natural parameter space and . Then the Set KL divergence
(3) 
is (a) nonnegative (b) convex in and (c) corresponds to a simple clipped form
(4) 
Proof.
For exponential families, the KL divergence correspond to a Bregman divergence [Banerjee et al.2005]:
for a convex potential function that depends on the exponential family. Hence, we have
If , this minimum over Bregman divergences is clearly zero. If and , we have to consider the minimization. The Bregman divergence is not necessarily convex in the second argument. Instead, we can rely on convexity of the set . Taking the derivative of wrt , we get
Now because is convex, is always negative. The derivative, then, is negative when , indicating should be increased to decrease . Similarly, when , the derivative is positive, indicating should be decreased to decrease . This derivative, then, points to the boundaries when , respectively to the boundary points closest to . ∎
Corollary 1 (SKL for Exponential Distributions).
For an exponential distribution, with natural parameter , and , then
(5) 
We use the SKL in Corollary 1, to encode a sparsity level of at least —rather than exactly —for the last layer in a twolayer neural network with ReLU activations. This regularizer was used to encourage sparse activations for SRNN in the preceding section. We include pseudocode for optimizing the regularized objective with the SKL, in Algorithm 1 in the Appendix.
5 Evaluation of Distributional Regularizers
In this section, we investigate the efficacy of Distributional Regularizers for obtaining sparsity. There are a variety of possible choices with Distributional Regularizers, including activation function and corresponding target distribution and using a KL versus a Set KL. In this section, we investigate some of these combinations, particularly focusing on the difference in sparsity and performance when using (a) KL versus SKL; (b) Sigmoid (with a Bernoulli target distribution) versus ReLU (with an Exponential target distribution); and (c) previous strategies to obtain sparse representations versus the proposed variant of the Distributional Regularizer.
In the first set of experiments, we compare the instance sparsity of KL to Set KL, with ReLU activations and Exponential Distributions (ReLU+KL and ReLU+SKL). Figure 6 shows the instance sparsity for these, and for the NN without regularization. Interestingly, ReLU+KL actually reduces sparsity in several domains, because the optimization encouraging an exact level of sparsity is quite finicky. ReLU+SKL, on the other hand, significantly improves instance sparsity over the NN. This instance sparsity again translates into control performance, where ReLU+KL does noticeably worse than ReLU+SKL across the four domains in Figure 7. Despite the poor instance sparsity, ReLU+KL does actually seem to provide some useful regularity, that does allow some learning across all four domains. This contrasts the previous regularization strategies, , and Dropout, which all failed to learn on at least one domain, particularly Catcher.
In the next set of experiments, we compare Sigmoid (with a Bernoulli target distribution) versus ReLU (with an Exponential target distribution). We included both KL and Set KL, giving the combinations ReLU+KL, ReLU+SKL, SIG+KL, and SIG+SKL. We expect Sigmoid with Bernoulli to perform significantly worse—in terms of sparsity levels, locality and performance—because the Sigmoid activation makes it difficult to truly get sparse representations. This hypothesis is validated in the learning curves in Figure 7 and the heatmaps for Puddle World in Figure 8. SIG+KL and SIG+SKL perform poorly across domains, even in Puddle World, where they achieved their best performance. Unlike ReLU with Exponential, here the Set KL seems to provide little benefit. The heatmaps in Figure 8 show that both versions, SIG+KL and SIG+SKL, cover large porions of the space, and do not have local activations for hidden nodes. In fact, SIG+KL and SIG+SKL use all the hidden nodes for all the samples across domains, resulting in no instance sparsity.
Next, we compare to previously proposed strategies for learning sparse representations with neural networks. These include using and regularization on the activation (denoted by RNN and RNN respectively); ksparse NNs, where all but the top activations are zeroed [Makhzani and Frey2013] (ksparseNN); and WinnerTakeAll NNs that keep the top k% of the activations per node across instances, to promote sparse activations of nodes over time [Makhzani and Frey2015] (WTANN).^{3}^{3}3
Both ksparseNNs and WTANNs were introduced for autoencoders, though the idea can be applied more generally to NNs. We additionally tested these methods with autoencoders, but performance was significantly worse.
We include learning curves and instance sparsity for these methods, for a ReLU activation, in Figures 9 and 10. Results for the Sigmoid activation are included in the Appendix. Neither WTANN nor ksparseNN are effective. We found the ksparseNN was prone to dead units, and often truncates nonnegligible value. Surprisingly, RNN performs comparably to SRNN in all domains but Catcher, whereas RNN is effective only during early learning in Mountain Car. From the instance sparsity plots in Catcher, we see that RNN and RNN produce highly sparse (2%3% instance sparsity), potentially explaining its poor performance. While similar instance sparsity was effective in Puddle World, this is unlikely to be true in general. This was with considerable parameter optimization for the regularization parameter.
6 Conclusion
In this work, we investigate using and learning sparse representations with neural networks for control in reinforcement learning. We show that sparse representations can significantly improve control performance when used in an incremental learning setting, and provide some evidence that this is because the locality of the representation reduces catastrophic interference which would otherwise overwrite bootstrap values. We formalize Distributional Regularizers, with a practically important extension to a Set KL, for learning a Sparse Representation Neural Network (SRNN). We provide an empirical investigation into the sparsity properties and control performance under different Distributional Regularizers, as well as compared to other algorithms to obtain sparse representations with neural networks. We conclude that SRNN performs consistently well across domains, with the next best method—which only fails in one domain—being a simple methods that uses an regularizer on activations.
This work highlights an important phenomenon that arises in control, beyond the typical issues with catastrophic interference. Interference is typically considered for sequential multitask learning, where previous functions are forgotten by training on a new task. Interference could occur even in a singletask setting, if an agent remains in a particular area of the space for a long time. In reinforcement learning, however, this problem is magnified by the fact that the agent uses its own estimates as targets. If estimates change incorrectly due to interference, there could be a cascading effect. This work provides some first empirical steps, in a carefully controlled set of experiments, to identify that this could be an issue, and that sparse representations could be a promising direction to alleviate the problem. We hope for this work to spur further empirical investigation into how widespread this issue is, and further algorithmic development into learning sparse representations for reinforcement learning.
References
 [Ahmad and Hawkins2015] Ahmad, S., and Hawkins, J. 2015. Properties of Sparse Distributed Representations and their Application to Hierarchical Temporal Memory.

[Arpit et al.2015]
Arpit, D.; Zhou, Y.; Ngo, H.; and Govindaraju, V.
2015.
Why regularized autoencoders learn sparse representation?
In
International Conference on Machine Learning
.  [Banerjee et al.2005] Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2005. Clustering with bregman divergences. Journal of Machine Learning Research 6(Oct):1705–1749.
 [Banino et al.2018] Banino, A.; Barry, C.; Uria, B.; Blundell, C.; Lillicrap, T.; Mirowski, P.; Pritzel, A.; Chadwick, M. J.; Degris, T.; Modayil, J.; et al. 2018. Vectorbased navigation using gridlike representations in artificial agents. Nature.
 [Bengio, Courville, and Vincent2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence.
 [Bengio2009] Bengio, Y. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning.

[Cover1965]
Cover, T. M.
1965.
Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition.
IEEE Trans. Electronic Computers.  [Földiák1990] Földiák, P. 1990. Forming sparse representations by local antiHebbian learning. Biological Cybernetics.
 [French1991] French, R. M. 1991. Using semidistributed representations to overcome catastrophic forgetting in connectionist networks. In Annual Cognitive Science Society Conference.
 [Glorot, Bordes, and Bengio2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics.
 [Goodfellow et al.2009] Goodfellow, I.; Lee, H.; Le, Q. V.; Saxe, A.; and Ng, A. Y. 2009. Measuring invariances in deep networks. In Advances in Neural Information Processing Systems.

[He et al.2015]
He, K.; Zhang, X.; Ren, S.; and Sun, J.
2015.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In IEEE International Conference on Computer Vision.  [Hinton, Srivastava, and Swersky2012] Hinton, G.; Srivastava, N.; and Swersky, K. 2012. Neural networks for machine learning lecture 6a overview of minibatch gradient descent.
 [Kanerva1988] Kanerva, P. 1988. Sparse Distributed Memory. MIT Press.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
 [Le, Kumaraswamy, and White2017] Le, L.; Kumaraswamy, R.; and White, M. 2017. Learning sparse representations in reinforcement learning with sparse coding. arXiv:1707.08316.
 [Lee et al.2008] Lee, H.; Ekanadham, C.; information, A. N. A. i. n.; and 2008. 2008. Sparse deep belief net model for visual area V2. In Advances in Neural Information Processing Systems.
 [Lemme, Reinhart, and Steil2012] Lemme, A.; Reinhart, R. F.; and Steil, J. J. 2012. Online learning and generalization of partsbased image representations by nonnegative sparse autoencoders. Neural Networks.
 [Mairal et al.2009] Mairal, J.; Bach, F.; Ponce, J.; Sapiro, G.; and Zisserman, A. 2009. Supervised dictionary learning. In Advances in Neural Information Processing Systems.
 [Mairal et al.2010] Mairal, J.; Bach, F.; Ponce, J.; and Sapiro, G. 2010. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research.
 [Makhzani and Frey2013] Makhzani, A., and Frey, B. 2013. ksparse autoencoders. arXiv preprint arXiv:1312.5663.
 [Makhzani and Frey2015] Makhzani, A., and Frey, B. 2015. Winnertakeall autoencoders. In Adv. in Neural Information Processing Systems.
 [McCloskey and Cohen1989] McCloskey, M., and Cohen, N. J. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation.
 [McCulloch and Pitts1943] McCulloch, W. S., and Pitts, W. 1943. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics.
 [Ng2011] Ng, A. 2011. Sparse autoencoder. CS294A Lecture notes.
 [Olshausen and Field1997] Olshausen, B. A., and Field, D. J. 1997. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research.
 [Olshausen2002] Olshausen, B. A. 2002. Sparse Codes and Spikes. In Probabilistic Models of the Brain.
 [Quian Quiroga and Kreiman2010] Quian Quiroga, R., and Kreiman, G. 2010. Measuring sparseness in the brain: Comment on bowers (2009).

[Ranzato et al.2006]
Ranzato, M.; Poultney, C. S.; Chopra, S.; and LeCun, Y.
2006.
Efficient Learning of Sparse Representations with an EnergyBased Model.
In Adv. in Neural Info. Process. Sys. 
[Ranzato, Boureau, and
LeCun2007]
Ranzato, M.; Boureau, Y.L.; and LeCun, Y.
2007.
Sparse Feature Learning for Deep Belief Networks.
In Advances in Neural Information Processing Systems.  [Ratitch and Precup2004] Ratitch, B., and Precup, D. 2004. Sparse distributed memories for online valuebased reinforcement learning. In Machine Learning: ECML PKDD.
 [Riedmiller2005] Riedmiller, M. 2005. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning.

[Rifai et al.2011]
Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; and Bengio, Y.
2011.
Contractive autoencoders: Explicit invariance during feature extraction.
In Inter. Conf. on Machine Learning.  [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction. MIT press Cambridge.
 [Sutton1988] Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning 3(1):9–44.
 [Sutton1996] Sutton, R. S. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems.
 [Teh et al.2003] Teh, Y. W.; Welling, M.; Osindero, S.; and Hinton, G. E. 2003. EnergyBased Models for Sparse Overcomplete Representations. Journal of Machine Learning Research.
 [Triesch2005] Triesch, J. 2005. A Gradient Rule for the Plasticity of a Neuron’s Intrinsic Excitability. In ICANN.
 [White2017] White, M. 2017. Unifying Task Specification in Reinforcement Learning. In Inter. Conf. on Machine Learning.
Appendix A Additional algorithmic details
In general, we advocate for learning the representation incrementally, for the task faced by the agent. However, for our experiments, we learned the representations first to remove confounding factors. We detail that learning regime here.
The problem of learning a good representation in the case of finite actions can be transformed to learning a good representation of the form , and using that to represent the actionvalue function from Equation (2) as:
(6) 
Here, is the linear representation of the state , which is used in conjunction with the linear predictor to estimate actionvalues for action across the state space. Under a given policy, like the actionvalues , corresponding statevalues, , are defined as:
An easy objective to train connectionist networks with simple backpropagation is the Mean Squared Temporal Difference Error (MSTDE)
[Sutton1988]. For a given policy, the MSTDE is defined as:(7)  
Here, denotes the stationary distribution over the states induced by the given policy, and and
are parameters that can be estimated with stochastic gradient descent. Therefore, given experience generated by a policy that explores sufficiently in an environment, a strong function approximator (a dense neural network) can be trained to estimate useful features,
. These features can then be used for estimating actionvalues in onpolicy control for learning the (closeto) optimal behaviour policy in the environment.Appendix B Experimental Details
b.1 Policies to generate training and testing data
In Mountain Car, we use the standard energy pumping policy with 10% randomness. In Puddle World, by a policy that chooses to go North with 50 probability, and East with 50 probability on each step. The data in Acrobot is generated by a nearoptimal policy. In Catcher, the agent chooses to move toward the apple with 50 probability, and selects a random action with 50 probability on each step; and gets only 1 life in the environment.
b.2 Tile Coding
We compare to Tile Coding (TC) representation, a wellknown sparse representation, as the baseline. TC uses overlapping grids on the observation space, to convert a continuous space to a discrete dimensional space. The representations generated by it are sparse and distributed based on a static hashing technique. We experiment with several configurations for the fixed representation, particularly with gridsizes(N) in and number of tilings (D) in . We use a hash size of 8192, which is significantly larger than the largest feature size of 256, as used in the other learned representation models we compare to. The results shown in Figure 3 are for the best configuration of the static tilecoder after a sweep.
b.3 Training neural networks
Architecture and optimizer: We used neural networks with two hidden layers. The first layer 32 hidden units. The second layer, which is the representation layer used for prediction, has 256 units. We optimized the neural network weights using Adam optimization [Kingma and Ba2014] with a batch size of 64. The neural network weights are initialized based on He initialization [He et al.2015]
. That is, the neural networks weights are initialized with zeromean Gaussian distribution with variance equals to
, where is the number of input nodes for layer .Representation hyperparameters
: The range of grid search for the representation hyperparameters are as follows:
Algorithmic choices: For ksparse networks, only the topk hidden units in the representation layer are activated. We also use scheduling of sparsity level described in the original paper [Makhzani and Frey2013]. If used in conjunction with a distributional regularizer, the topk nodes are chosen before application of the distributional regularizer. For dropout, given the form of the supervision goal (MSTDE), the same dropout mask is chosen to generate the representation for both states and ^{4}^{4}4We have experimented with different dropout masks for and , and the result suggests that it is not able to learn good representations even for prediction across all domains. – this preserves dropouts role as regularizer w.r.t. the target, and promotes diversity in learning.
Gridsearch evaluation metric
: The learned representations are then used for onpolicy control in Sarsa(0) with fixed. The value function for Sarsa is initialized with zeromean Gaussian distribution with small variance. For sparse representations, we use semigradient Sarsa with step decay learnining rate. For dense representatinos, we use adaptive learnining rate method RMSprop
[Hinton, Srivastava, and Swersky2012]. The initial learning rate for Sarsa(0) is swept in the set:All the sweeps for selecting the representation learning hyperparameters across domains use 50 epochs and 10 runs.
Learning curves: The chosen hyperparameters are used to train a good representation (saturated testing loss – 100 epochs for Acrobot, and 50 epochs for other domains), following which it is used for onpolicy control with Sarsa(0). While the control performance is focused on in the main paper, the learning curve during the representation training phase is shown in Figures 11, 12 and 13. The metric on the yaxis is the Root Mean Squared Error (RMSE), which is evaluated as follows:
where is the set of test states for which the representations have been extracted, is the estimated value of state and is the true value of state computed using Monte Carlo rollouts. The number of test states are 5000 for benchmark domains and 1000 for Catcher. Most algorithms converge to a good solution within 50 epochs in Mountain Car, Puddle World and Catcher, and 100 epochs in Acrobot as shown in the curves. The curves are for representation purposes and only averaged over 5 runs. All learning curves for Sarsa(0) are averaged over 30 runs, and are plotted with exponential moving average ().
Appendix C More results
c.1 Control curves
We perform the evaluation of sparsity inducing networks with Sigmoid activation. Figure 14 shows the performance of Sarsa(0) with representations learned by different networks. ksparse and WTA performs well in Puddle World, however, none of these representations are effective across all domains.
The learning curves for various ksparse networks with distributional regularizers are in Figure 15. It suggests that ksparse (ReLU+k+SKL) provides no improvement over just using distributional regularizer for ReLU activation (SRNN).
c.2 Activation heatmaps
The activation heatmaps for randomly selected neurons (excluding dead neurons) in Mountain Car with different regularization stratergies are shown in Figure 16, and with differnt Distributional Regularization designs are shown in Figure 17. Heatmaps for sparsity inducing networks with ReLU activations and Sigmoid activation, for Mountain Car and Puddle World are shown in Figure 18.
c.3 Bootstrap values
The bootstrap values comparing SRNN to different regularization strategies, and NN are shown in Figure 19. Since it is not easy to visualize 4dimensional space, we only include the bootstrap value result of Mountain Car here.
c.4 Activation overlap
We show the overlap of representations learned by different networks in Table 2 for Mountain Car and Puddle World. RNN and RNN have low overlap values. However, the regularizers tend to push many neurons to be activated for a really small region to reduce penalty as shown in Figure 18. SRNN, on the other hand, learns a more distributed representation.
Mountain Car  Puddle World  

SRNN  16.8  8.8 
NN  112.3  111.5 
NN  109.5  142.5 
DropoutNN  72.5  31.2 
NN  106.5  54.0 
ReLU+KL  36.8  71.4 
SIG+SKL  256.0  256.0 
SIG+KL  256.0  256.0 
ksparseNN  36.6  61.8 
WTANN  24.8  6.5 
RNN  30.0  3.8 
RNN  10.5  0.4 