Shallow Updates for Deep Reinforcement Learning

by   Nir Levine, et al.
berkeley college

Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.


page 1

page 2

page 3

page 4


Randomized Policy Learning for Continuous State and Action MDPs

Deep reinforcement learning methods have achieved state-of-the-art resul...

Deep Reinforcement Learning with Decorrelation

Learning an effective representation for high-dimensional data is a chal...

Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks

Reinforcement learning systems require good representations to work well...

Using Generative Adversarial Nets on Atari Games for Feature Extraction in Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) has been successfully applied in sever...

Deep Reinforcement Learning: Framework, Applications, and Embedded Implementations

The recent breakthroughs of deep reinforcement learning (DRL) technique ...

An Exploration of Deep Learning Methods in Hungry Geese

Hungry Geese is a n-player variation of the popular game snake. This pap...

An Efficient Asynchronous Method for Integrating Evolutionary and Gradient-based Policy Search

Deep reinforcement learning (DRL) algorithms and evolution strategies (E...

1 Introduction

Reinforcement learning (RL) is a field of research that uses dynamic programing (DP; Bertsekas 2008

), among other approaches, to solve sequential decision making problems. The main challenge in applying DP to real world problems is an exponential growth of computational requirements as the problem size increases, known as the curse of dimensionality

(Bertsekas, 2008).

RL tackles the curse of dimensionality by approximating terms in the DP calculation such as the value function or policy. Popular function approximators for this task include deep neural networks, henceforth termed deep RL (DRL), and linear architectures, henceforth termed shallow RL (SRL).

SRL methods have enjoyed wide popularity over the years (see, e.g.,  Tsitsiklis et al. 1997; Bertsekas 2008 for extensive reviews). In particular, batch algorithms based on a least squares (LS) approach, such as Least Squares Temporal Difference (LSTD, Lagoudakis & Parr 2003) and Fitted-Q Iteration (FQI, Ernst et al. 2005) are known to be stable and data efficient. However, the success of these algorithms crucially depends on the quality of the feature representation. Ideally, the representation encodes rich, expressive features that can accurately represent the value function. However, in practice, finding such good features is difficult and often hampers the usage of linear function approximation methods.

In DRL, on the other hand, the features are learned together with the value function

in a deep architecture. Recent advancements in DRL using convolutional neural networks demonstrated learning of expressive features

(Zahavy et al., 2016; Wang et al., 2016) and state-of-the-art performance in challenging tasks such as video games (Mnih et al. 2015; Tessler et al. 2017; Mnih et al. 2016), and Go (Silver et al., 2016). To date, the most impressive DRL results (E.g., the works of Mnih et al. 2015, Mnih et al. 2016) were obtained using online

RL algorithms, based on a stochastic gradient descent (SGD) procedure.

On the one hand, SRL is stable and data efficient. On the other hand, DRL learns powerful representations. This motivates us to ask: can we combine DRL with SRL to leverage the benefits of both?

In this work, we develop a hybrid approach that combines batch SRL algorithms with online DRL. Our main insight is that the last layer in a deep architecture can be seen as a linear representation, with the preceding layers encoding features. Therefore, the last layer can be learned using standard SRL algorithms. Following this insight, we propose a method that repeatedly re-trains the last hidden layer of a DRL network with a batch SRL algorithm, using data collected throughout the DRL run.

We focus on value-based DRL algorithms (e.g., the popular DQN of Mnih et al. 2015) and on SRL based on LS methods111Our approach can be generalized to other DRL/SRL variants., and propose the Least Squares DQN algorithm (LS-DQN). Key to our approach is a novel regularization term for the least squares method that uses the DRL solution as a prior in a Bayesian least squares formulation. Our experiments demonstrate that this hybrid approach significantly improves performance on the Atari benchmark for several combinations of DRL and SRL methods.

To support our results, we performed an in-depth analysis to tease out the factors that make our hybrid approach outperform DRL. Interestingly, we found that the improved performance is mainly due to the large batch size of SRL methods compared to the small batch size that is typical for DRL.

2 Background

In this section we describe our RL framework and several shallow and deep RL algorithms that will be used throughout the paper.

RL Framework: We consider a standard RL formulation (Sutton & Barto, 1998)

based on a Markov Decision Process (MDP). An MDP is a tuple

, where is a finite set of states, is a finite set of actions, and

is the discount factor. A transition probability function

maps states and actions to a probability distribution over next states. Finally,

denotes the reward. The goal in RL is to learn a policy that solves the MDP by maximizing the expected discounted return . Value based RL methods make use of the action value function , which represents the expected discounted return of executing action from state and following the policy thereafter. The optimal action value function obeys a fundamental recursion known as the Bellman equation .

2.1 SRL algorithms

Least Squares Temporal Difference Q-Learning (LSTD-Q): LSTD (Barto & Crites, 1996) and LSTD-Q (Lagoudakis & Parr, 2003) are batch SRL algorithms. LSTD-Q learns a control policy

from a batch of samples by estimating a linear approximation

of the action value function , where are a set of weights and is a feature matrix. Each row of

represents a feature vector for a state-action pair

. The weights are learned by enforcing to satisfy a fixed point equation w.r.t.  the projected Bellman operator, resulting in a system of linear equations , where and . Here, is the reward vector, is the transition matrix and is a matrix describing the policy. Given a set of samples , we can approximate and with the following empirical averages:


The weights can be calculated using a least squares minimization: or by calculating the pseudo-inverse: . LSTD-Q is an off-policy algorithm: the same set of samples can be used to train any policy so long as is defined for every in the set.

Fitted Q Iteration (FQI): The FQI algorithm (Ernst et al., 2005) is a batch SRL algorithm that computes iterative approximations of the Q-function using regression. At iteration of the algorithm, the set defined above and the approximation from the previous iteration

are used to generate supervised learning targets:

. These targets are then used by a supervised learning (regression) method to compute the next function in the sequence , by minimizing the MSE loss . For a linear function approximation , LS can be used to give the FQI solution where are given by:


The FQI algorithm can also be used with non-linear function approximations such as trees (Ernst et al., 2005) and neural networks (Riedmiller, 2005). The DQN algorithm (Mnih et al., 2015) can be viewed as online form of FQI.

2.2 DRL algorithms

Deep Q-Network (DQN): The DQN algorithm (Mnih et al., 2015) learns the Q function by minimizing the mean squared error of the Bellman equation, defined as , where . The DQN maintains two separate networks, namely the current network with weights and the target network with weights . Fixing the target network makes the DQN algorithm equivalent to FQI

(see the FQI MSE loss defined above), where the regression algorithm is chosen to be SGD (RMSPROP,

Hinton et al. 2012). The DQN is an off-policy learning algorithm. Therefore, the tuples that are used to optimize the network weights are first collected from the agent’s experience and are stored in an Experience Replay (ER) buffer (Lin, 1993) providing improved stability and performance.

Double DQN (DDQN): DDQN (Van Hasselt et al., 2016) is a modification of the DQN algorithm that addresses overly optimistic estimates of the value function. This is achieved by performing action selection with the current network and evaluating the action with the target network, , yielding the DDQN target update if is terminal, otherwise .

3 The LS-DQN Algorithm

We now present a hybrid approach for DRL with SRL updates222Code is available online at Our algorithm, the LS-DQN Algorithm, periodically switches between training a DRL network and re-training its last hidden layer using an SRL method. 333We refer the reader to Appendix B for a diagram of the algorithm.

We assume that the DRL algorithm uses a deep network for representing the Q function444The features in the last DQN layer are not action dependent. We generate action-dependent features

by zero-padding to a one-hot state-action feature vector. See Appendix E for more details.

, where the last layer is linear and fully connected. Such networks have been used extensively in deep RL recently (e.g., Mnih et al. 2015; Van Hasselt et al. 2016; Mnih et al. 2016). In such a representation, the last layer, which approximates the Q function, can be seen as a linear combination of features (the output of the penultimate layer), and we propose to learn more accurate weights for it using SRL.

Explicitly, the LS-DQN algorithm begins by training the weights of a DRL network, , using a value-based DRL algorithm for steps (Line 2). LS-DQN then updates the last hidden layer weights, , by executing LS-UPDATE: retraining the weights using a SRL algorithm with samples (Line 3).

The LS-UPDATE consists of the following steps. First, data trajectories for the batch update are gathered using the current network weights, (Line 7). In practice, the current experience replay can be used and no additional samples need to be collected. The algorithm next generates new features from the data trajectories using the current DRL network with weights . This step guarantees that we do not use samples with inconsistent features, as the ER contains features from ’old’ networks weights. Computationally, this step requires running a forward pass of the deep network for every sample in , and can be performed quickly using parallelization.

Once the new features are generated, LS-DQN uses an SRL algorithm to re-calculate the weights of the last hidden layer (Line 9).

While the LS-DQN algorithm is conceptually straightforward, we found that naively running it with off-the-shelf SRL algorithms such as FQI or LSTD resulted in instability and a degradation of the DRL performance. The reason is that the ‘slow’ SGD computation in DRL essentially retains information from older training epochs, while the batch SRL method ‘forgets’ all data but the most recent batch. In the following, we propose a novel regularization method for addressing this issue.

2:for  do
3:      Train the DRL network for steps
4:      LS-UPDATE() Update the last layer weights with the SRL solution
5:end for
7:function LS-UPDATE()
11:     return
12:end function
Algorithm 1 LS-DQN Algorithm


Our goal is to improve the performance of a value-based DRL agent using a batch SRL algorithm. Batch SRL algorithms, however, do not leverage the knowledge that the agent has gained before the most recent batch555While conceptually, the data batch can include all the data seen so far, due to computational limitations, this is not a practical solution in the domains we consider.. We observed that this issue prevents the use of off-the-shelf implementations of SRL methods in our hybrid LS-DQN algorithm.

To enjoy the benefits of both worlds, that is, a batch algorithm that can use the accumulated knowledge gained by the DRL network, we introduce a novel Bayesian regularization method for LSTD-Q and FQI that uses the last hidden layer weights of the DRL network as a Bayesian prior for the SRL algorithm 666The reader is referred to Ghavamzadeh et al. (2015) for an overview on using Bayesian methods in RL..

SRL Bayesian Prior Formulation: We are interested in learning the weights of the last hidden layer (), using a least squares SRL algorithm. We pursue a Bayesian approach, where the prior weights distribution at iteration of LS-DQN is given by , and we recall that are the last hidden layer weights of the DRL network at iteration . The Bayesian solution for the regression problem in the FQI algorithm is given by (Box & Tiao, 2011)

where and are given in Equation 2. A similar regularization can be added to LSTD-Q based on a regularized fixed point equation (Kolter & Ng, 2009). Full details are in Appendix A.

4 Experiments

In this section, we present experiments showcasing the improved performance attained by our LS-DQN algorithm compared to state-of-the-art DRL methods. Our experiments are divided into three sections. In Section 4.1, we start by investigating the behavior of SRL algorithms in high dimensional environments. We then show results for the LS-DQN on five Atari domains, in Section 4.2, and compare the resulting performance to regular DQN and DDQN agents. Finally, in Section 4.3, we present an ablative analysis of the LS-DQN algorithm, which clarifies the reasons behind our algorithm’s success.

4.1 SRL Algorithms with High Dimensional Observations

In the first set of experiments, we explore how least squares SRL algorithms perform in domains with high dimensional observations. This is an important step before applying a SRL method within the LS-DQN algorithm. In particular, we focused on answering the following questions: (1) What regularization method to use? (2) How to generate data for the LS algorithm? (3) How many policy improvement iterations to perform?

To answer these questions, we performed the following procedure: We trained DQN agents on two games from the Arcade Learning Environment (ALE, Bellemare et al.); namely, Breakout and Qbert, using the vanilla DQN implementation (Mnih et al., 2015). For each DQN run, we (1) periodically 777Every three million DQN steps, referred to as one epoch (out of a total of 50 million steps). save the current DQN network weights and ER; (2) Use an SRL algorithm (LSTD-Q or FQI) to re-learn the weights of the last layer, and (3) evaluate the resulting DQN network by temporarily replacing the DQN weights with the SRL solution weights. After the evaluation, we replace back the original DQN weights and continue training.

Each evaluation entails roll-outs 888Each roll-out starts from a new (random) game and follows a policy until the agent loses all of its lives. with an -greedy policy (similar to Mnih et al., ). This periodic evaluation setup allowed us to effectively experiment with the SRL algorithms and obtain clear comparisons with DQN, without waiting for full DQN runs to complete.

(1) Regularization: Experiments with standard SRL methods without any regularization yielded poor results. We found the main reason to be that the matrices used in the SRL solutions (Equations 1 and 2

) are ill-conditioned, resulting in instability. One possible explanation stems from the sparseness of the features. The DQN uses ReLU activations

(Jarrett et al., 2009), which causes the network to learn sparse feature representations. For example, once the DQN completed training on Breakout, of features were zero.

Once we added a regularization term, we found that the performance of the SRL algorithms improved. We experimented with the and Bayesian Prior (BP) regularizers (). While the regularizer showed competitive performance in Breakout, we found that the BP performed better across domains (Figure 1, best regularizers chosen, shows the average score of each configuration following the explained evaluation procedure, for the different epochs). Moreover, the BP regularizer was not sensitive to the scale of the regularization coefficient. Regularizers in the range performed well across all domains. A table of average scores for different coefficients can be found in Appendix C.1. Note that we do not expect for much improvement as we replace back the original DQN weights after evaluation.

(2) Data Gathering: We experimented with two mechanisms for generating data: (1) generating new data from the current policy, and (2) using the ER. We found that the data generation mechanism had a significant impact on the performance of the algorithms. When the data is generated only from the current DQN policy (without ER) the SRL solution resulted in poor performance compared to a solution using the ER (as was observed by Mnih et al. 2015). We believe that the main reason the ER works well is that the ER contains data sampled from multiple (past) policies, and therefore exhibits more exploration of the state space.

(3) Policy Improvement: LSTD-Q and FQI are off-policy algorithms and can be applied iteratively on the same dataset (e.g. LSPI, Lagoudakis & Parr 2003). However, in practice, we found that performing multiple iterations did not improve the results. A possible explanation is that by improving the policy, the policy reaches new areas in the state space that are not represented well in the current ER, and therefore are not approximated well by the SRL solution and the current DRL network.

Figure 1: Periodic evaluation for DQN (green), LS-DQN with Bayesian prior regularization (red, solid , dashed ), and regularization (blue, solid , dashed ).

4.2 Atari Experiments

We next ran the full LS-DQN algorithm (Alg. 1) on five Atari domains: Asterix, Space Invaders, Breakout, Q-Bert and Bowling. We ran the LS-DQN using both DQN and DDQN as the DRL algorithm, and using both LSTD-Q and FQI as the SRL algorithms. We chose to run a LS-update every steps, for a total of M steps (). We used the current ER buffer as the ‘generated’ data in the LS-UPDATE function (line 7 in Alg. 1, ), and a regularization coefficient for the Bayesian prior solution (both for FQI and LSTQ-Q). We emphasize the we did not use any additional samples beyond the samples already obtained by the DRL algorithm.

Figure 2 presents the learning curves of the DQN network, LS-DQN with LSTD-Q, and LS-DQN with FQI (referred to as DQN, LS-DQN, and LS-DQN, respectively) on three domains: Asterix, Space Invaders and Breakout. Note that we use the same evaluation process as described in Mnih et al. (2015). We were also interested in a test to measure differences between learning curves, and not only their maximal score. Hence we chose to perform Wilcoxon signed-rank test on the average scores between the three DQN variants. This non-parametric statistical test measures whether related samples differ in their means (Wilcoxon, 1945). We found that the learning curves for both LS-DQN and LS-DQN were statistically significantly better than those of DQN, with p-values smaller than e- for all three domains.

Figure 2: Learning curves of DQN (green), LS-DQN (red), and LS-DQN (blue).

Table 1 presents the maximum average scores along the learning curves of the five domains, when the SRL algorithms were incorporated into both DQN agents and DDQN agents (the notation is similar, i.e., LS-DDQN)999 Scores for DQN and DDQN were taken from Van Hasselt et al. (2016).. Our algorithm, LS-DQN, attained better performance compared to the vanilla DQN agents, as seen by the higher scores in Table 1 and Figure 2. We observe an interesting phenomenon for the game Asterix: In Figure 2, the DQN’s score “crashes” to zero (as was observed by Van Hasselt et al. 2016). LS-DQN did not manage to resolve this issue, even though it achieved a significantly higher score that that of the DQN. LS-DQN, however, maintained steady performance and did not “crash” to zero. We found that, in general, incorporating FQI as an SRL algorithm into the DRL agents resulted in improved performance.

AlgorithmGame Breakout Space Invaders Asterix Qbert Bowling
DQN9 401.20 1975.50 6011.67 10595.83 42.40
LS-DQN 420.00 3207.44 13704.23 10767.47 61.21
LS-DQN 438.55 3360.81 13636.81 12981.42 75.38
DDQN9 375.00 3154.60 15150.00 14875.00 70.50
LS-DDQN 397.94 4400.83 16270.45 12727.94 80.75
Table 1: Maximal average scores across five different Atari domains for each of the DQN variants.

4.3 Ablative Analysis

In the previous section, we saw that the LS-DQN algorithm has improved performance, compared to the DQN agents, across a number of domains. The goal of this section is to understand the reasons behind the LS-DQN’s improved performance by conducting an ablative analysis of our algorithm. For this analysis, we used a DQN agent that was trained on the game of Breakout, in the same manner as described in Section 4.1. We focus on analyzing the LS-DQN algorithm, that has the same optimization objective as DQN (cf. Section 2), and postulate the following conjectures for its improved performance:

  1. The SRL algorithms use a Bayesian regularization term, which is not included in the DQN objective.

  2. The SRL algorithms have less hyperparameters to tune and generate an explicit solution compared to SGD-based DRL solutions.

  3. Large-batch methods perform better than small-batch methods when combining DRL with SRL.

  4. SRL algorithms focus on training the last layer and are easier to optimize.

The Experiments: We started by analyzing the learning method of the last layer (i.e., the ‘shallow’ part of the learning process). We did this by optimizing the last layer, at each LS-UPDATE epoch, using (1) FQI with a Bayesian prior and a LS solution, and (2) an ADAM (Kingma & Ba, 2014)

optimizer with and without an additional Bayesian prior regularization term in the loss function. We compared these approaches for different mini-batch sizes of

, , and data points, and used for all experiments.

Relating to conjecture (ii), note that the FQI algorithm has only one hyper-parameter to tune and produces an explicit solution using the whole dataset simultaneously. ADAM, on the other hand, has more hyper-parameters to tune and works on different mini-batch sizes.

The Experimental Setup: The experiments were done in a periodic fashion similar to Section 4.1, i.e., testing behavior in different epochs over a vanilla DQN run. For both ADAM and FQI, we first collected data samples from the ER at each epoch. For ADAM, we performed iterations over the data, where each iteration consisted of randomly permuting the data, dividing it into mini-batches and optimizing using ADAM over the mini-batches101010 The selected hyper-parameters used for these experiments can be found in Appendix D, along with results for one iteration of ADAM.. We then simulate the agent and report average scores across trajectories.

The Results: Figure 3 depicts the difference between the average scores of (1) and (2) to that of the DQN baseline scores. We see that larger mini-batches result in improved performance. Moreover, the LS solution (FQI) outperforms the ADAM solutions for mini-batch sizes of and on most epochs, and even slightly outperforms the best of them (mini-batch size of and a Bayesian prior). In addition, a solution with a prior performs better than a solution without a prior.

Summary: Our ablative analysis experiments strongly support conjectures (iii) and (iv) from above, for explaining LS-DQN’s improved performance. That is, large-batch methods perform better than small-batch methods when combining DRL with SRL as explained above; and SRL algorithms that focus on training only the last layer are easier to optimize, as we see that optimizing the last layer improved the score across epochs.

Figure 3: Differences of the average scores, for different learning methods, compared to vanilla DQN. Positive values represent improvement over vanilla DQN. For example, for mini-batch of 32 (left figure), in epoch 3, FQI (blue) out-performed vanilla DQN by 21, while ADAM with prior (red), and ADAM without prior (green) under-performed by -46, and -96, respectively. Note that: (1) as the mini-batch size increases, the improvement of ADAM over DQN becomes closer to the improvement of FQI over DQN, and (2) optimizing the last layer improves performance.

We finish this Section with an interesting observation. While the LS solution improves the performance of the DRL agents, we found that the LS solution weights are very close to the baseline DQN solution. See Appendix D, for the full results. Moreover, the distance was inversely proportional to the performance of the solution. That is, the FQI solution that performed the best, was the closest (in norm) to the DQN solution, and vice versa. There were orders of magnitude differences between the norms of solutions that performed well and those that did not. Similar results, i.e., that large-batch solutions find solutions that are close to the baseline, have been reported in (Keskar et al., 2016). We further compare our results with the findings of Keskar et al. in the section to follow.

5 Related work

We now review recent works that are related to this paper.


The general idea of applying regularization for feature selection, and to avoid over-fitting is a common theme in machine learning. However, applying it to RL algorithms is challenging due to the fact that these algorithms are based on finding a fixed-point rather than optimizing a loss function

(Kolter & Ng, 2009)

.Value-based DRL approaches do not use regularization layers (e.g. pooling, dropout and batch normalization), which are popular in other deep learning methods. The DQN, for example, has a relatively shallow architecture (three convolutional layers, followed by two fully connected layers) without any regularization layers. Recently, regularization was introduced in problems that combine value-based RL with other learning objectives. For example,

Hester et al. (2017) combine RL with supervised learning from expert demonstration, and introduce regularization to avoid over-fitting the expert data; and Kirkpatrick et al. (2017)

introduces regularization to avoid catastrophic forgetting in transfer learning. SRL methods, on the other hand, perform well with regularization

(Kolter & Ng, 2009) and have been shown to converge Farahmand et al. (2009).

Batch size: Our results suggest that a large batch LS solution for the last layer of a value-based DRL network can significantly improve it’s performance. This result is somewhat surprising, as it has been observed by practitioners that using larger batches in deep learning degrades the quality of the model, as measured by its ability to generalize (Keskar et al., 2016).

However, our method differs from the experiments performed by Keskar et al. 2016 and therefore does not contradict them, for the following reasons: (1) The LS-DQN Algorithm uses the large batch solution only for the last layer. The lower layers of the network are not affected by the large batch solution and therefore do not converge to a sharp minimum. (2) The experiments of (Keskar et al., 2016) were performed for classification tasks, whereas our algorithm is minimizing an MSE loss. (3) Keskar et al. showed that large-batch solutions work well when piggy-backing (warm-started) on a small-batch solution. Similarly, our algorithm mixes small and large batch solutions as it switches between them periodically.

Moreover, it was recently observed that flat minima in practical deep learning model classes can be turned into sharp minima via re-parameterization without changing the generalization gap, and hence it requires further investigation Dinh et al. (2017). In addition, Hoffer et al. showed that large-batch training can generalize as well as small-batch training by adapting the number of iterations Hoffer et al. (2017). Thus, we strongly believe that our findings on combining large and small batches in DRL are in agreement with recent results of other deep learning research groups.

Deep and Shallow RL: Using the last-hidden layer of a DNN as a feature extractor and learning the last layer with a different algorithm has been addressed before in the literature, e.g., in the context of transfer learning (Donahue et al., 2013). In RL, there have been competitive attempts to use SRL with unsupervised features to play Atari (Liang et al., 2016; Blundell et al., 2016), but to the best of our knowledge, this is the first attempt that successfully combines DRL with SRL algorithms.

6 Conclusion

In this work we presented LS-DQN, a hybrid approach that combines least-squares RL updates within online deep RL. LS-DQN obtains the best of both worlds: rich representations from deep RL networks as well as stability and data efficiency of least squares methods. Experiments with two deep RL methods and two least squares methods revealed that a hybrid approach consistently improves over vanilla deep RL in the Atari domain. Our ablative analysis indicates that the success of the LS-DQN algorithm is due to the large batch updates made possible by using least squares.

This work focused on value-based RL. However, our hybrid linear/deep approach can be extended to other RL methods, such as actor critic (Mnih et al., 2016)

. More broadly, decades of research on linear RL methods have provided methods with strong guarantees, such as approximate linear programming 

(Desai et al., 2012) and modified policy iteration (Scherrer et al., 2015). Our approach shows that with the correct modifications, such as our Bayesian regularization term, linear methods can be combined with deep RL. This opens the door to future combinations of well-understood linear RL with deep representation learning.


This research was supported by the European Community’s Seventh Framework Program (FP7/2007-2013) under grant agreement 306638 (SUPREL). A. Tamar is supported in part by Siemens and the Viterbi Scholarship, Technion.


  • Barto & Crites (1996) Barto, AG and Crites, RH. Improving elevator performance using reinforcement learning. Advances in neural information processing systems, 8:1017–1023, 1996.
  • Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, 2013.
  • Bertsekas (2008) Bertsekas, Dimitri P. Approximate dynamic programming. 2008.
  • Blundell et al. (2016) Blundell, Charles, Uria, Benigno, Pritzel, Alexander, Li, Yazhe, Ruderman, Avraham, Leibo, Joel Z, Rae, Jack, Wierstra, Daan, and Hassabis, Demis. Model-free episodic control. stat, 1050:14, 2016.
  • Box & Tiao (2011) Box, George EP and Tiao, George C. Bayesian inference in statistical analysis. John Wiley & Sons, 2011.
  • Desai et al. (2012) Desai, Vijay V, Farias, Vivek F, and Moallemi, Ciamac C. Approximate dynamic programming via a smoothed linear program. Operations Research, 60(3):655–674, 2012.
  • Dinh et al. (2017) Dinh, Laurent, Pascanu, Razvan, Bengio, Samy, and Bengio, Yoshua. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
  • Donahue et al. (2013) Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 30th international conference on machine learning (ICML-13), pp. 647–655, 2013.
  • Ernst et al. (2005) Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
  • Farahmand et al. (2009) Farahmand, Amir M, Ghavamzadeh, Mohammad, Mannor, Shie, and Szepesvári, Csaba. Regularized policy iteration. In Advances in Neural Information Processing Systems, pp. 441–448, 2009.
  • Ghavamzadeh et al. (2015) Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, Tamar, Aviv, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(5-6):359–483, 2015.
  • Hester et al. (2017) Hester, Todd, Vecerik, Matej, Pietquin, Olivier, Lanctot, Marc, Schaul, Tom, Piot, Bilal, Sendonaris, Andrew, Dulac-Arnold, Gabriel, Osband, Ian, Agapiou, John, et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.
  • Hinton et al. (2012) Hinton, Geoffrey, Srivastava, NiRsh, and Swersky, Kevin. Neural networks for machine learning lecture 6a overview of mini–batch gradient descent. 2012.
  • Hoffer et al. (2017) Hoffer, Elad, Hubara, Itay, and Soudry, Daniel. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.
  • Jarrett et al. (2009) Jarrett, Kevin, Kavukcuoglu, Koray, LeCun, Yann, et al. What is the best multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2146–2153. IEEE, 2009.
  • Keskar et al. (2016) Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, and Tang, Ping Tak Peter. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirkpatrick et al. (2017) Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pp. 201611835, 2017.
  • Kolter & Ng (2009) Kolter, J Zico and Ng, Andrew Y. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 2009.
  • Lagoudakis & Parr (2003) Lagoudakis, Michail G and Parr, Ronald. Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
  • Liang et al. (2016) Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowling, Michael. State of the art control of atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 2016.
  • Lin (1993) Lin, Long-Ji. Reinforcement learning for robots using neural networks. 1993.
  • Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mnih et al. (2016) Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
  • Riedmiller (2005) Riedmiller, Martin. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.
  • Scherrer et al. (2015) Scherrer, Bruno, Ghavamzadeh, Mohammad, Gabillon, Victor, Lesner, Boris, and Geist, Matthieu. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16:1629–1676, 2015.
  • Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Sutton & Barto (1998) Sutton, Richard and Barto, Andrew. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • Tessler et al. (2017) Tessler, Chen, Givony, Shahar, Zahavy, Tom, Mankowitz, Daniel J, and Mannor, Shie. A deep hierarchical approach to lifelong learning in minecraft. Proceedings of the National Conference on Artificial Intelligence (AAAI), 2017.
  • Tsitsiklis et al. (1997) Tsitsiklis, John N, Van Roy, Benjamin, et al. An analysis of temporal-difference learning with function approximation. IEEE transactions on automatic control 42.5, pp. 674–690, 1997.
  • Van Hasselt et al. (2016) Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double q-learning. Proceedings of the National Conference on Artificial Intelligence (AAAI), 2016.
  • Wang et al. (2016) Wang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt, Hado, Lanctot, Marc, and de Freitas, Nando. Dueling network architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1995–2003, 2016.
  • Wilcoxon (1945) Wilcoxon, Frank. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.
  • Zahavy et al. (2016) Zahavy, Tom, Ben-Zrihem, Nir, and Mannor, Shie. Graying the black box: Understanding dqns. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1899–1908, 2016.

Appendix A Adding Regularization to LSTD-Q

For LSTD-Q, regularization cannot be applied directly since the algorithm is finding a fixed-point and not solving a LS problem. To overcome this obstacle, we augment the fixed point function of the LSTD-Q algorithm to include a regularization term based on (Kolter & Ng, 2009):


where stands for the linear projection, for the Bellman optimality operator and is the regularization function. Once the augmented problem is solved, the solution to the regularized LSTD-Q problem is given by . This derivation results in the same solution for LSTD-Q as was obtained for FQI (Equation 2). In the special case where we get the regularized solution of Kolter & Ng (2009).

Appendix B LS-DQN Algorithm

Figure 4 provides an overview of the LS-DQN algorithm described in the main paper. The DNN agent is trained for steps (A). The weights of the last hidden layer are denoted . Data is then gathered (LS.1) from the agent’s experience replay and features are generated (LS.2). An SRL-Algorithm is applied to the generated features (LS.3) which includes a regularized Bayesian prior weight update (LS.4). Note that the weights are used as the prior. The weights of the last hidden layer are then replaced by the SRL output and this process is repeated.

Figure 4: An overview of the LS-DQN algorithm.

Appendix C Results for SRL Algorithms with High Dimensional Observations

We present the average scores (averaged over roll-outs) at different epochs, for both the original DQN and after relearning the last layer using LSTD-Q, for different regularization coefficients.


Epoch DQN
Epoch 54 49 48 44 53 49 48 50 28 30 46
Epoch 207 189 196 193 64 30 18 4 9 5 171
Epoch 238 247 314 284 277 254 270 232 225 194 271
Epoch 238 271 289 249 207 201 291 326 274 304 212
Epoch 265 311 322 315 208 109 175 36 14 48 292
Epoch 299 331 327 328 259 150 248 227 281 245 164
Epoch 332 335 350 266 128 67 145 249 291 214 325
Epoch 361 352 343 262 204 65 270 309 287 304 324
Epoch 294 291 323 319 101 85 224 276 347 340 350
Epoch 186 297 256 263 243 236 349 323 333 333 165
Epoch 241 277 290 140 79 111 338 335 330 315 233
Epoch 328 336 327 352 226 208 337 374 354 377 302
Epoch 343 305 247 308 62 112 338 342 305 344 316
Epoch 278 294 259 273 156 198 320 355 350 346 306
Epoch 312 327 282 292 161 141 321 381 368 367 252
Epoch 186 160 283 273 170 225 370 314 325 324 114
Table 2: Average scores on the different epochs as a function of regularization coefficients


Epoch DQN
Epoch 3470 3070 2163 1998 1599 2078 964 629 831 484 2978
Epoch 2794 1853 2196 2565 3839 3558 1376 2123 1728 2388 2060
Epoch 4253 4188 4579 4034 4031 2239 561 691 824 570 4148
Epoch 2789 2489 2536 2750 3435 5214 2730 2303 1356 594 1878
Epoch 6426 6831 7480 6703 3419 3335 4205 3519 4673 5231 7410
Epoch 8480 7265 7950 5300 4978 4178 4533 6005 6133 4829 8356
Epoch 8176 9036 8635 7774 7269 7428 6196 3030 3246 2343 8643
Epoch 9104 10340 9935 7293 7689 7343 6728 2913 3299 1473 9315
Epoch 9274 10288 9115 7508 6660 7800 120 8133 4880 5018 8156
Epoch 10523 7245 9704 7949 8640 7794 2663 8905 10044 7585 12584
Epoch 10821 11510 9971 7064 6836 9908 1020 11868 9940 11138 10290
Epoch 7291 10134 7583 6673 7815 9028 5564 8893 8649 6748 7438
Epoch 12365 12220 13103 11868 11531 10091 2753 10804 8216 8835 13054
Epoch 11686 11085 10338 10811 8386 9580 2980 6469 6435 6071 10249
Epoch 11228 12841 13696 10971 5820 10148 7524 11959 9270 6949 11630
Epoch 11643 12489 13468 11773 8191 8976 198 7284 7598 5649 12923
Table 3: Average scores on the different epochs as a function of regularization coefficients

Appendix D Results for Ablative Analysis

We used the implementation of ADAM from the optim

package for torch that can be found at We used the default hyperparameters (except for the learning rate): learningRate, learningRateDecay, beta1, beta2, epsilone, and weightDecay. For solutions that use the prior, we set .

Figure 5 depicts the offset of the average scores from the DQN’s scores, after one iteration of the ADAM algorithm:

Figure 5: Differences of the average scores from DQN compared to ADAM and FQI (with and without priors) for different mini-batches (MB) sizes.

Table 4 shows the norm of the difference between the different solution weights and the original last layer weights of the DQN (divided by the norm of the DQN’s weights for scale), averaged over epochs. Note that MB stands for mini-batch sizes used by the ADAM solver.

Batch MB=32 iter=1 MB=32 iter=20 MB=512 iter=1 MB=512 iter=20 MB=4096 iter=1 MB=4096 iter=20
w/ prior 3e-4 3e-3 3e-3 2e-3 2e-3 1.7e-3 1.8e-3
wo/ prior 3.8e-2 2.7e-1 1.3e-2 1.2e-1 5e-3 5e-2
Table 4: Norms of the Difference Between solutions Weights

Appendix E Feature augmentation

The LS-DQN algorithm requires a function that creates features (Algorithm 1, Line 9) for a dataset using the current value-based DRL network. Notice that for most value-based DRL networks (e.g. DQN and DDQN), the DRL features (output of the last hidden layer) are a function of the state and not a function of the action. On the other hand, the FQI and LSTDQ algorithms require features that are a function of both state and action. We, therefore, augment the DRL features to be a function of the action in the following manner. Denote by the output of the last hidden layer in the DRL network (where

is the number of neurons in this layer). We define

to be on a subset of indices that belongs to action and zero otherwise, where refers to the size of the action space.

Note that in practice, DQN and DDQN maintain an ER, and we create features for all the states in the ER. A more computationally efficient approach would be to store the features in the ER after the DRL agent visits them, makes a forward propagation (and compute features) and store them in the ER. However, SRL algorithms work only with features that are fixed over time. Therefore, we generate new features with the current DRL network.