1 Introduction
Reinforcement learning (RL) is a field of research that uses dynamic programing (DP; Bertsekas 2008
), among other approaches, to solve sequential decision making problems. The main challenge in applying DP to real world problems is an exponential growth of computational requirements as the problem size increases, known as the curse of dimensionality
(Bertsekas, 2008).RL tackles the curse of dimensionality by approximating terms in the DP calculation such as the value function or policy. Popular function approximators for this task include deep neural networks, henceforth termed deep RL (DRL), and linear architectures, henceforth termed shallow RL (SRL).
SRL methods have enjoyed wide popularity over the years (see, e.g., Tsitsiklis et al. 1997; Bertsekas 2008 for extensive reviews). In particular, batch algorithms based on a least squares (LS) approach, such as Least Squares Temporal Difference (LSTD, Lagoudakis & Parr 2003) and FittedQ Iteration (FQI, Ernst et al. 2005) are known to be stable and data efficient. However, the success of these algorithms crucially depends on the quality of the feature representation. Ideally, the representation encodes rich, expressive features that can accurately represent the value function. However, in practice, finding such good features is difficult and often hampers the usage of linear function approximation methods.
In DRL, on the other hand, the features are learned together with the value function
in a deep architecture. Recent advancements in DRL using convolutional neural networks demonstrated learning of expressive features
(Zahavy et al., 2016; Wang et al., 2016) and stateoftheart performance in challenging tasks such as video games (Mnih et al. 2015; Tessler et al. 2017; Mnih et al. 2016), and Go (Silver et al., 2016). To date, the most impressive DRL results (E.g., the works of Mnih et al. 2015, Mnih et al. 2016) were obtained using onlineRL algorithms, based on a stochastic gradient descent (SGD) procedure.
On the one hand, SRL is stable and data efficient. On the other hand, DRL learns powerful representations. This motivates us to ask: can we combine DRL with SRL to leverage the benefits of both?
In this work, we develop a hybrid approach that combines batch SRL algorithms with online DRL. Our main insight is that the last layer in a deep architecture can be seen as a linear representation, with the preceding layers encoding features. Therefore, the last layer can be learned using standard SRL algorithms. Following this insight, we propose a method that repeatedly retrains the last hidden layer of a DRL network with a batch SRL algorithm, using data collected throughout the DRL run.
We focus on valuebased DRL algorithms (e.g., the popular DQN of Mnih et al. 2015) and on SRL based on LS methods^{1}^{1}1Our approach can be generalized to other DRL/SRL variants., and propose the Least Squares DQN algorithm (LSDQN). Key to our approach is a novel regularization term for the least squares method that uses the DRL solution as a prior in a Bayesian least squares formulation. Our experiments demonstrate that this hybrid approach significantly improves performance on the Atari benchmark for several combinations of DRL and SRL methods.
To support our results, we performed an indepth analysis to tease out the factors that make our hybrid approach outperform DRL. Interestingly, we found that the improved performance is mainly due to the large batch size of SRL methods compared to the small batch size that is typical for DRL.
2 Background
In this section we describe our RL framework and several shallow and deep RL algorithms that will be used throughout the paper.
RL Framework: We consider a standard RL formulation (Sutton & Barto, 1998)
based on a Markov Decision Process (MDP). An MDP is a tuple
, where is a finite set of states, is a finite set of actions, andis the discount factor. A transition probability function
maps states and actions to a probability distribution over next states. Finally,
denotes the reward. The goal in RL is to learn a policy that solves the MDP by maximizing the expected discounted return . Value based RL methods make use of the action value function , which represents the expected discounted return of executing action from state and following the policy thereafter. The optimal action value function obeys a fundamental recursion known as the Bellman equation .2.1 SRL algorithms
Least Squares Temporal Difference QLearning (LSTDQ): LSTD (Barto & Crites, 1996) and LSTDQ (Lagoudakis & Parr, 2003) are batch SRL algorithms. LSTDQ learns a control policy
from a batch of samples by estimating a linear approximation
of the action value function , where are a set of weights and is a feature matrix. Each row ofrepresents a feature vector for a stateaction pair
. The weights are learned by enforcing to satisfy a fixed point equation w.r.t. the projected Bellman operator, resulting in a system of linear equations , where and . Here, is the reward vector, is the transition matrix and is a matrix describing the policy. Given a set of samples , we can approximate and with the following empirical averages:(1) 
The weights can be calculated using a least squares minimization: or by calculating the pseudoinverse: . LSTDQ is an offpolicy algorithm: the same set of samples can be used to train any policy so long as is defined for every in the set.
Fitted Q Iteration (FQI): The FQI algorithm (Ernst et al., 2005) is a batch SRL algorithm that computes iterative approximations of the Qfunction using regression. At iteration of the algorithm, the set defined above and the approximation from the previous iteration
are used to generate supervised learning targets:
. These targets are then used by a supervised learning (regression) method to compute the next function in the sequence , by minimizing the MSE loss . For a linear function approximation , LS can be used to give the FQI solution where are given by:(2) 
2.2 DRL algorithms
Deep QNetwork (DQN): The DQN algorithm (Mnih et al., 2015) learns the Q function by minimizing the mean squared error of the Bellman equation, defined as , where . The DQN maintains two separate networks, namely the current network with weights and the target network with weights . Fixing the target network makes the DQN algorithm equivalent to FQI
(see the FQI MSE loss defined above), where the regression algorithm is chosen to be SGD (RMSPROP,
Hinton et al. 2012). The DQN is an offpolicy learning algorithm. Therefore, the tuples that are used to optimize the network weights are first collected from the agent’s experience and are stored in an Experience Replay (ER) buffer (Lin, 1993) providing improved stability and performance.Double DQN (DDQN): DDQN (Van Hasselt et al., 2016) is a modification of the DQN algorithm that addresses overly optimistic estimates of the value function. This is achieved by performing action selection with the current network and evaluating the action with the target network, , yielding the DDQN target update if is terminal, otherwise .
3 The LSDQN Algorithm
We now present a hybrid approach for DRL with SRL updates^{2}^{2}2Code is available online at https://github.com/ShallowUpdatesforDeepRL. Our algorithm, the LSDQN Algorithm, periodically switches between training a DRL network and retraining its last hidden layer using an SRL method. ^{3}^{3}3We refer the reader to Appendix B for a diagram of the algorithm.
We assume that the DRL algorithm uses a deep network for representing the Q function^{4}^{4}4The features in the last DQN layer are not action dependent. We generate actiondependent features
by zeropadding to a onehot stateaction feature vector. See Appendix E for more details.
, where the last layer is linear and fully connected. Such networks have been used extensively in deep RL recently (e.g., Mnih et al. 2015; Van Hasselt et al. 2016; Mnih et al. 2016). In such a representation, the last layer, which approximates the Q function, can be seen as a linear combination of features (the output of the penultimate layer), and we propose to learn more accurate weights for it using SRL.Explicitly, the LSDQN algorithm begins by training the weights of a DRL network, , using a valuebased DRL algorithm for steps (Line 2). LSDQN then updates the last hidden layer weights, , by executing LSUPDATE: retraining the weights using a SRL algorithm with samples (Line 3).
The LSUPDATE consists of the following steps. First, data trajectories for the batch update are gathered using the current network weights, (Line 7). In practice, the current experience replay can be used and no additional samples need to be collected. The algorithm next generates new features from the data trajectories using the current DRL network with weights . This step guarantees that we do not use samples with inconsistent features, as the ER contains features from ’old’ networks weights. Computationally, this step requires running a forward pass of the deep network for every sample in , and can be performed quickly using parallelization.
Once the new features are generated, LSDQN uses an SRL algorithm to recalculate the weights of the last hidden layer (Line 9).
While the LSDQN algorithm is conceptually straightforward, we found that naively running it with offtheshelf SRL algorithms such as FQI or LSTD resulted in instability and a degradation of the DRL performance. The reason is that the ‘slow’ SGD computation in DRL essentially retains information from older training epochs, while the batch SRL method ‘forgets’ all data but the most recent batch. In the following, we propose a novel regularization method for addressing this issue.
Regularization
Our goal is to improve the performance of a valuebased DRL agent using a batch SRL algorithm. Batch SRL algorithms, however, do not leverage the knowledge that the agent has gained before the most recent batch^{5}^{5}5While conceptually, the data batch can include all the data seen so far, due to computational limitations, this is not a practical solution in the domains we consider.. We observed that this issue prevents the use of offtheshelf implementations of SRL methods in our hybrid LSDQN algorithm.
To enjoy the benefits of both worlds, that is, a batch algorithm that can use the accumulated knowledge gained by the DRL network, we introduce a novel Bayesian regularization method for LSTDQ and FQI that uses the last hidden layer weights of the DRL network as a Bayesian prior for the SRL algorithm ^{6}^{6}6The reader is referred to Ghavamzadeh et al. (2015) for an overview on using Bayesian methods in RL..
SRL Bayesian Prior Formulation: We are interested in learning the weights of the last hidden layer (), using a least squares SRL algorithm. We pursue a Bayesian approach, where the prior weights distribution at iteration of LSDQN is given by , and we recall that are the last hidden layer weights of the DRL network at iteration . The Bayesian solution for the regression problem in the FQI algorithm is given by (Box & Tiao, 2011)
where and are given in Equation 2. A similar regularization can be added to LSTDQ based on a regularized fixed point equation (Kolter & Ng, 2009). Full details are in Appendix A.
4 Experiments
In this section, we present experiments showcasing the improved performance attained by our LSDQN algorithm compared to stateoftheart DRL methods. Our experiments are divided into three sections. In Section 4.1, we start by investigating the behavior of SRL algorithms in high dimensional environments. We then show results for the LSDQN on five Atari domains, in Section 4.2, and compare the resulting performance to regular DQN and DDQN agents. Finally, in Section 4.3, we present an ablative analysis of the LSDQN algorithm, which clarifies the reasons behind our algorithm’s success.
4.1 SRL Algorithms with High Dimensional Observations
In the first set of experiments, we explore how least squares SRL algorithms perform in domains with high dimensional observations. This is an important step before applying a SRL method within the LSDQN algorithm. In particular, we focused on answering the following questions: (1) What regularization method to use? (2) How to generate data for the LS algorithm? (3) How many policy improvement iterations to perform?
To answer these questions, we performed the following procedure: We trained DQN agents on two games from the Arcade Learning Environment (ALE, Bellemare et al.); namely, Breakout and Qbert, using the vanilla DQN implementation (Mnih et al., 2015). For each DQN run, we (1) periodically ^{7}^{7}7Every three million DQN steps, referred to as one epoch (out of a total of 50 million steps). save the current DQN network weights and ER; (2) Use an SRL algorithm (LSTDQ or FQI) to relearn the weights of the last layer, and (3) evaluate the resulting DQN network by temporarily replacing the DQN weights with the SRL solution weights. After the evaluation, we replace back the original DQN weights and continue training.
Each evaluation entails rollouts ^{8}^{8}8Each rollout starts from a new (random) game and follows a policy until the agent loses all of its lives. with an greedy policy (similar to Mnih et al., ). This periodic evaluation setup allowed us to effectively experiment with the SRL algorithms and obtain clear comparisons with DQN, without waiting for full DQN runs to complete.
(1) Regularization: Experiments with standard SRL methods without any regularization yielded poor results. We found the main reason to be that the matrices used in the SRL solutions (Equations 1 and 2
) are illconditioned, resulting in instability. One possible explanation stems from the sparseness of the features. The DQN uses ReLU activations
(Jarrett et al., 2009), which causes the network to learn sparse feature representations. For example, once the DQN completed training on Breakout, of features were zero.Once we added a regularization term, we found that the performance of the SRL algorithms improved. We experimented with the and Bayesian Prior (BP) regularizers (). While the regularizer showed competitive performance in Breakout, we found that the BP performed better across domains (Figure 1, best regularizers chosen, shows the average score of each configuration following the explained evaluation procedure, for the different epochs). Moreover, the BP regularizer was not sensitive to the scale of the regularization coefficient. Regularizers in the range performed well across all domains. A table of average scores for different coefficients can be found in Appendix C.1. Note that we do not expect for much improvement as we replace back the original DQN weights after evaluation.
(2) Data Gathering: We experimented with two mechanisms for generating data: (1) generating new data from the current policy, and (2) using the ER. We found that the data generation mechanism had a significant impact on the performance of the algorithms. When the data is generated only from the current DQN policy (without ER) the SRL solution resulted in poor performance compared to a solution using the ER (as was observed by Mnih et al. 2015). We believe that the main reason the ER works well is that the ER contains data sampled from multiple (past) policies, and therefore exhibits more exploration of the state space.
(3) Policy Improvement: LSTDQ and FQI are offpolicy algorithms and can be applied iteratively on the same dataset (e.g. LSPI, Lagoudakis & Parr 2003). However, in practice, we found that performing multiple iterations did not improve the results. A possible explanation is that by improving the policy, the policy reaches new areas in the state space that are not represented well in the current ER, and therefore are not approximated well by the SRL solution and the current DRL network.
4.2 Atari Experiments
We next ran the full LSDQN algorithm (Alg. 1) on five Atari domains: Asterix, Space Invaders, Breakout, QBert and Bowling. We ran the LSDQN using both DQN and DDQN as the DRL algorithm, and using both LSTDQ and FQI as the SRL algorithms. We chose to run a LSupdate every steps, for a total of M steps (). We used the current ER buffer as the ‘generated’ data in the LSUPDATE function (line 7 in Alg. 1, ), and a regularization coefficient for the Bayesian prior solution (both for FQI and LSTQQ). We emphasize the we did not use any additional samples beyond the samples already obtained by the DRL algorithm.
Figure 2 presents the learning curves of the DQN network, LSDQN with LSTDQ, and LSDQN with FQI (referred to as DQN, LSDQN, and LSDQN, respectively) on three domains: Asterix, Space Invaders and Breakout. Note that we use the same evaluation process as described in Mnih et al. (2015). We were also interested in a test to measure differences between learning curves, and not only their maximal score. Hence we chose to perform Wilcoxon signedrank test on the average scores between the three DQN variants. This nonparametric statistical test measures whether related samples differ in their means (Wilcoxon, 1945). We found that the learning curves for both LSDQN and LSDQN were statistically significantly better than those of DQN, with pvalues smaller than e for all three domains.
Table 1 presents the maximum average scores along the learning curves of the five domains, when the SRL algorithms were incorporated into both DQN agents and DDQN agents (the notation is similar, i.e., LSDDQN)^{9}^{9}9 Scores for DQN and DDQN were taken from Van Hasselt et al. (2016).. Our algorithm, LSDQN, attained better performance compared to the vanilla DQN agents, as seen by the higher scores in Table 1 and Figure 2. We observe an interesting phenomenon for the game Asterix: In Figure 2, the DQN’s score “crashes” to zero (as was observed by Van Hasselt et al. 2016). LSDQN did not manage to resolve this issue, even though it achieved a significantly higher score that that of the DQN. LSDQN, however, maintained steady performance and did not “crash” to zero. We found that, in general, incorporating FQI as an SRL algorithm into the DRL agents resulted in improved performance.
AlgorithmGame  Breakout  Space Invaders  Asterix  Qbert  Bowling 

DQN^{9}  401.20  1975.50  6011.67  10595.83  42.40 
LSDQN  420.00  3207.44  13704.23  10767.47  61.21 
LSDQN  438.55  3360.81  13636.81  12981.42  75.38 
DDQN^{9}  375.00  3154.60  15150.00  14875.00  70.50 
LSDDQN  397.94  4400.83  16270.45  12727.94  80.75 
4.3 Ablative Analysis
In the previous section, we saw that the LSDQN algorithm has improved performance, compared to the DQN agents, across a number of domains. The goal of this section is to understand the reasons behind the LSDQN’s improved performance by conducting an ablative analysis of our algorithm. For this analysis, we used a DQN agent that was trained on the game of Breakout, in the same manner as described in Section 4.1. We focus on analyzing the LSDQN algorithm, that has the same optimization objective as DQN (cf. Section 2), and postulate the following conjectures for its improved performance:

The SRL algorithms use a Bayesian regularization term, which is not included in the DQN objective.

The SRL algorithms have less hyperparameters to tune and generate an explicit solution compared to SGDbased DRL solutions.

Largebatch methods perform better than smallbatch methods when combining DRL with SRL.

SRL algorithms focus on training the last layer and are easier to optimize.
The Experiments: We started by analyzing the learning method of the last layer (i.e., the ‘shallow’ part of the learning process). We did this by optimizing the last layer, at each LSUPDATE epoch, using (1) FQI with a Bayesian prior and a LS solution, and (2) an ADAM (Kingma & Ba, 2014)
optimizer with and without an additional Bayesian prior regularization term in the loss function. We compared these approaches for different minibatch sizes of
, , and data points, and used for all experiments.Relating to conjecture (ii), note that the FQI algorithm has only one hyperparameter to tune and produces an explicit solution using the whole dataset simultaneously. ADAM, on the other hand, has more hyperparameters to tune and works on different minibatch sizes.
The Experimental Setup: The experiments were done in a periodic fashion similar to Section 4.1, i.e., testing behavior in different epochs over a vanilla DQN run. For both ADAM and FQI, we first collected data samples from the ER at each epoch. For ADAM, we performed iterations over the data, where each iteration consisted of randomly permuting the data, dividing it into minibatches and optimizing using ADAM over the minibatches^{10}^{10}10 The selected hyperparameters used for these experiments can be found in Appendix D, along with results for one iteration of ADAM.. We then simulate the agent and report average scores across trajectories.
The Results: Figure 3 depicts the difference between the average scores of (1) and (2) to that of the DQN baseline scores. We see that larger minibatches result in improved performance. Moreover, the LS solution (FQI) outperforms the ADAM solutions for minibatch sizes of and on most epochs, and even slightly outperforms the best of them (minibatch size of and a Bayesian prior). In addition, a solution with a prior performs better than a solution without a prior.
Summary: Our ablative analysis experiments strongly support conjectures (iii) and (iv) from above, for explaining LSDQN’s improved performance. That is, largebatch methods perform better than smallbatch methods when combining DRL with SRL as explained above; and SRL algorithms that focus on training only the last layer are easier to optimize, as we see that optimizing the last layer improved the score across epochs.
We finish this Section with an interesting observation. While the LS solution improves the performance of the DRL agents, we found that the LS solution weights are very close to the baseline DQN solution. See Appendix D, for the full results. Moreover, the distance was inversely proportional to the performance of the solution. That is, the FQI solution that performed the best, was the closest (in norm) to the DQN solution, and vice versa. There were orders of magnitude differences between the norms of solutions that performed well and those that did not. Similar results, i.e., that largebatch solutions find solutions that are close to the baseline, have been reported in (Keskar et al., 2016). We further compare our results with the findings of Keskar et al. in the section to follow.
5 Related work
We now review recent works that are related to this paper.
Regularization:
The general idea of applying regularization for feature selection, and to avoid overfitting is a common theme in machine learning. However, applying it to RL algorithms is challenging due to the fact that these algorithms are based on finding a fixedpoint rather than optimizing a loss function
(Kolter & Ng, 2009).Valuebased DRL approaches do not use regularization layers (e.g. pooling, dropout and batch normalization), which are popular in other deep learning methods. The DQN, for example, has a relatively shallow architecture (three convolutional layers, followed by two fully connected layers) without any regularization layers. Recently, regularization was introduced in problems that combine valuebased RL with other learning objectives. For example,
Hester et al. (2017) combine RL with supervised learning from expert demonstration, and introduce regularization to avoid overfitting the expert data; and Kirkpatrick et al. (2017)introduces regularization to avoid catastrophic forgetting in transfer learning. SRL methods, on the other hand, perform well with regularization
(Kolter & Ng, 2009) and have been shown to converge Farahmand et al. (2009).Batch size: Our results suggest that a large batch LS solution for the last layer of a valuebased DRL network can significantly improve it’s performance. This result is somewhat surprising, as it has been observed by practitioners that using larger batches in deep learning degrades the quality of the model, as measured by its ability to generalize (Keskar et al., 2016).
However, our method differs from the experiments performed by Keskar et al. 2016 and therefore does not contradict them, for the following reasons: (1) The LSDQN Algorithm uses the large batch solution only for the last layer. The lower layers of the network are not affected by the large batch solution and therefore do not converge to a sharp minimum. (2) The experiments of (Keskar et al., 2016) were performed for classification tasks, whereas our algorithm is minimizing an MSE loss. (3) Keskar et al. showed that largebatch solutions work well when piggybacking (warmstarted) on a smallbatch solution. Similarly, our algorithm mixes small and large batch solutions as it switches between them periodically.
Moreover, it was recently observed that flat minima in practical deep learning model classes can be turned into sharp minima via reparameterization without changing the generalization gap, and hence it requires further investigation Dinh et al. (2017). In addition, Hoffer et al. showed that largebatch training can generalize as well as smallbatch training by adapting the number of iterations Hoffer et al. (2017). Thus, we strongly believe that our findings on combining large and small batches in DRL are in agreement with recent results of other deep learning research groups.
Deep and Shallow RL: Using the lasthidden layer of a DNN as a feature extractor and learning the last layer with a different algorithm has been addressed before in the literature, e.g., in the context of transfer learning (Donahue et al., 2013). In RL, there have been competitive attempts to use SRL with unsupervised features to play Atari (Liang et al., 2016; Blundell et al., 2016), but to the best of our knowledge, this is the first attempt that successfully combines DRL with SRL algorithms.
6 Conclusion
In this work we presented LSDQN, a hybrid approach that combines leastsquares RL updates within online deep RL. LSDQN obtains the best of both worlds: rich representations from deep RL networks as well as stability and data efficiency of least squares methods. Experiments with two deep RL methods and two least squares methods revealed that a hybrid approach consistently improves over vanilla deep RL in the Atari domain. Our ablative analysis indicates that the success of the LSDQN algorithm is due to the large batch updates made possible by using least squares.
This work focused on valuebased RL. However, our hybrid linear/deep approach can be extended to other RL methods, such as actor critic (Mnih et al., 2016)
. More broadly, decades of research on linear RL methods have provided methods with strong guarantees, such as approximate linear programming
(Desai et al., 2012) and modified policy iteration (Scherrer et al., 2015). Our approach shows that with the correct modifications, such as our Bayesian regularization term, linear methods can be combined with deep RL. This opens the door to future combinations of wellunderstood linear RL with deep representation learning.Acknowledgement
This research was supported by the European Community’s Seventh Framework Program (FP7/20072013) under grant agreement 306638 (SUPREL). A. Tamar is supported in part by Siemens and the Viterbi Scholarship, Technion.
References
 Barto & Crites (1996) Barto, AG and Crites, RH. Improving elevator performance using reinforcement learning. Advances in neural information processing systems, 8:1017–1023, 1996.

Bellemare et al. (2013)
Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bertsekas (2008) Bertsekas, Dimitri P. Approximate dynamic programming. 2008.
 Blundell et al. (2016) Blundell, Charles, Uria, Benigno, Pritzel, Alexander, Li, Yazhe, Ruderman, Avraham, Leibo, Joel Z, Rae, Jack, Wierstra, Daan, and Hassabis, Demis. Modelfree episodic control. stat, 1050:14, 2016.
 Box & Tiao (2011) Box, George EP and Tiao, George C. Bayesian inference in statistical analysis. John Wiley & Sons, 2011.
 Desai et al. (2012) Desai, Vijay V, Farias, Vivek F, and Moallemi, Ciamac C. Approximate dynamic programming via a smoothed linear program. Operations Research, 60(3):655–674, 2012.
 Dinh et al. (2017) Dinh, Laurent, Pascanu, Razvan, Bengio, Samy, and Bengio, Yoshua. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
 Donahue et al. (2013) Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 30th international conference on machine learning (ICML13), pp. 647–655, 2013.
 Ernst et al. (2005) Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
 Farahmand et al. (2009) Farahmand, Amir M, Ghavamzadeh, Mohammad, Mannor, Shie, and Szepesvári, Csaba. Regularized policy iteration. In Advances in Neural Information Processing Systems, pp. 441–448, 2009.
 Ghavamzadeh et al. (2015) Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, Tamar, Aviv, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(56):359–483, 2015.
 Hester et al. (2017) Hester, Todd, Vecerik, Matej, Pietquin, Olivier, Lanctot, Marc, Schaul, Tom, Piot, Bilal, Sendonaris, Andrew, DulacArnold, Gabriel, Osband, Ian, Agapiou, John, et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.
 Hinton et al. (2012) Hinton, Geoffrey, Srivastava, NiRsh, and Swersky, Kevin. Neural networks for machine learning lecture 6a overview of mini–batch gradient descent. 2012.
 Hoffer et al. (2017) Hoffer, Elad, Hubara, Itay, and Soudry, Daniel. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.
 Jarrett et al. (2009) Jarrett, Kevin, Kavukcuoglu, Koray, LeCun, Yann, et al. What is the best multistage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2146–2153. IEEE, 2009.
 Keskar et al. (2016) Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, and Tang, Ping Tak Peter. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kirkpatrick et al. (2017) Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, GrabskaBarwinska, Agnieszka, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pp. 201611835, 2017.
 Kolter & Ng (2009) Kolter, J Zico and Ng, Andrew Y. Regularization and feature selection in leastsquares temporal difference learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 2009.
 Lagoudakis & Parr (2003) Lagoudakis, Michail G and Parr, Ronald. Leastsquares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
 Liang et al. (2016) Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowling, Michael. State of the art control of atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 2016.
 Lin (1993) Lin, LongJi. Reinforcement learning for robots using neural networks. 1993.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 Riedmiller (2005) Riedmiller, Martin. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.
 Scherrer et al. (2015) Scherrer, Bruno, Ghavamzadeh, Mohammad, Gabillon, Victor, Lesner, Boris, and Geist, Matthieu. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16:1629–1676, 2015.
 Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Sutton & Barto (1998) Sutton, Richard and Barto, Andrew. Reinforcement Learning: An Introduction. MIT Press, 1998.
 Tessler et al. (2017) Tessler, Chen, Givony, Shahar, Zahavy, Tom, Mankowitz, Daniel J, and Mannor, Shie. A deep hierarchical approach to lifelong learning in minecraft. Proceedings of the National Conference on Artificial Intelligence (AAAI), 2017.
 Tsitsiklis et al. (1997) Tsitsiklis, John N, Van Roy, Benjamin, et al. An analysis of temporaldifference learning with function approximation. IEEE transactions on automatic control 42.5, pp. 674–690, 1997.
 Van Hasselt et al. (2016) Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double qlearning. Proceedings of the National Conference on Artificial Intelligence (AAAI), 2016.
 Wang et al. (2016) Wang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt, Hado, Lanctot, Marc, and de Freitas, Nando. Dueling network architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1995–2003, 2016.
 Wilcoxon (1945) Wilcoxon, Frank. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.
 Zahavy et al. (2016) Zahavy, Tom, BenZrihem, Nir, and Mannor, Shie. Graying the black box: Understanding dqns. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1899–1908, 2016.
Appendix A Adding Regularization to LSTDQ
For LSTDQ, regularization cannot be applied directly since the algorithm is finding a fixedpoint and not solving a LS problem. To overcome this obstacle, we augment the fixed point function of the LSTDQ algorithm to include a regularization term based on (Kolter & Ng, 2009):
(3) 
where stands for the linear projection, for the Bellman optimality operator and is the regularization function. Once the augmented problem is solved, the solution to the regularized LSTDQ problem is given by . This derivation results in the same solution for LSTDQ as was obtained for FQI (Equation 2). In the special case where we get the regularized solution of Kolter & Ng (2009).
Appendix B LSDQN Algorithm
Figure 4 provides an overview of the LSDQN algorithm described in the main paper. The DNN agent is trained for steps (A). The weights of the last hidden layer are denoted . Data is then gathered (LS.1) from the agent’s experience replay and features are generated (LS.2). An SRLAlgorithm is applied to the generated features (LS.3) which includes a regularized Bayesian prior weight update (LS.4). Note that the weights are used as the prior. The weights of the last hidden layer are then replaced by the SRL output and this process is repeated.
Appendix C Results for SRL Algorithms with High Dimensional Observations
We present the average scores (averaged over rollouts) at different epochs, for both the original DQN and after relearning the last layer using LSTDQ, for different regularization coefficients.
Breakout
Epoch  DQN  

Epoch  54  49  48  44  53  49  48  50  28  30  46 
Epoch  207  189  196  193  64  30  18  4  9  5  171 
Epoch  238  247  314  284  277  254  270  232  225  194  271 
Epoch  238  271  289  249  207  201  291  326  274  304  212 
Epoch  265  311  322  315  208  109  175  36  14  48  292 
Epoch  299  331  327  328  259  150  248  227  281  245  164 
Epoch  332  335  350  266  128  67  145  249  291  214  325 
Epoch  361  352  343  262  204  65  270  309  287  304  324 
Epoch  294  291  323  319  101  85  224  276  347  340  350 
Epoch  186  297  256  263  243  236  349  323  333  333  165 
Epoch  241  277  290  140  79  111  338  335  330  315  233 
Epoch  328  336  327  352  226  208  337  374  354  377  302 
Epoch  343  305  247  308  62  112  338  342  305  344  316 
Epoch  278  294  259  273  156  198  320  355  350  346  306 
Epoch  312  327  282  292  161  141  321  381  368  367  252 
Epoch  186  160  283  273  170  225  370  314  325  324  114 
Qbert
Epoch  DQN  

Epoch  3470  3070  2163  1998  1599  2078  964  629  831  484  2978 
Epoch  2794  1853  2196  2565  3839  3558  1376  2123  1728  2388  2060 
Epoch  4253  4188  4579  4034  4031  2239  561  691  824  570  4148 
Epoch  2789  2489  2536  2750  3435  5214  2730  2303  1356  594  1878 
Epoch  6426  6831  7480  6703  3419  3335  4205  3519  4673  5231  7410 
Epoch  8480  7265  7950  5300  4978  4178  4533  6005  6133  4829  8356 
Epoch  8176  9036  8635  7774  7269  7428  6196  3030  3246  2343  8643 
Epoch  9104  10340  9935  7293  7689  7343  6728  2913  3299  1473  9315 
Epoch  9274  10288  9115  7508  6660  7800  120  8133  4880  5018  8156 
Epoch  10523  7245  9704  7949  8640  7794  2663  8905  10044  7585  12584 
Epoch  10821  11510  9971  7064  6836  9908  1020  11868  9940  11138  10290 
Epoch  7291  10134  7583  6673  7815  9028  5564  8893  8649  6748  7438 
Epoch  12365  12220  13103  11868  11531  10091  2753  10804  8216  8835  13054 
Epoch  11686  11085  10338  10811  8386  9580  2980  6469  6435  6071  10249 
Epoch  11228  12841  13696  10971  5820  10148  7524  11959  9270  6949  11630 
Epoch  11643  12489  13468  11773  8191  8976  198  7284  7598  5649  12923 
Appendix D Results for Ablative Analysis
We used the implementation of ADAM from the optim
package for torch that can be found at
https://github.com/torch/optim/blob/master/adam.lua. We used the default hyperparameters (except for the learning rate): learningRate, learningRateDecay, beta1, beta2, epsilone, and weightDecay. For solutions that use the prior, we set .Figure 5 depicts the offset of the average scores from the DQN’s scores, after one iteration of the ADAM algorithm:
Table 4 shows the norm of the difference between the different solution weights and the original last layer weights of the DQN (divided by the norm of the DQN’s weights for scale), averaged over epochs. Note that MB stands for minibatch sizes used by the ADAM solver.
Batch  MB=32 iter=1  MB=32 iter=20  MB=512 iter=1  MB=512 iter=20  MB=4096 iter=1  MB=4096 iter=20  

w/ prior  3e4  3e3  3e3  2e3  2e3  1.7e3  1.8e3 
wo/ prior  3.8e2  2.7e1  1.3e2  1.2e1  5e3  5e2 
Appendix E Feature augmentation
The LSDQN algorithm requires a function that creates features (Algorithm 1, Line 9) for a dataset using the current valuebased DRL network. Notice that for most valuebased DRL networks (e.g. DQN and DDQN), the DRL features (output of the last hidden layer) are a function of the state and not a function of the action. On the other hand, the FQI and LSTDQ algorithms require features that are a function of both state and action. We, therefore, augment the DRL features to be a function of the action in the following manner. Denote by the output of the last hidden layer in the DRL network (where
is the number of neurons in this layer). We define
to be on a subset of indices that belongs to action and zero otherwise, where refers to the size of the action space.Note that in practice, DQN and DDQN maintain an ER, and we create features for all the states in the ER. A more computationally efficient approach would be to store the features in the ER after the DRL agent visits them, makes a forward propagation (and compute features) and store them in the ER. However, SRL algorithms work only with features that are fixed over time. Therefore, we generate new features with the current DRL network.