Deep Reinforcement Learning with Decorrelation

by   Borislav Mavrin, et al.

Learning an effective representation for high-dimensional data is a challenging problem in reinforcement learning (RL). Deep reinforcement learning (DRL) such as Deep Q networks (DQN) achieves remarkable success in computer games by learning deeply encoded representation from convolution networks. In this paper, we propose a simple yet very effective method for representation learning with DRL algorithms. Our key insight is that features learned by DRL algorithms are highly correlated, which interferes with learning. By adding a regularized loss that penalizes correlation in latent features (with only slight computation), we decorrelate features represented by deep neural networks incrementally. On 49 Atari games, with the same regularization factor, our decorrelation algorithms perform 70% in terms of human-normalized scores, which is 40% better than DQN. In particular, ours performs better than DQN on 39 games with 4 close ties and lost only slightly on 6 games. Empirical results also show that the decorrelation method applies to Quantile Regression DQN (QR-DQN) and significantly boosts performance. Further experiments on the losing games show that our decorelation algorithms can win over DQN and QR-DQN with a fined tuned regularization factor.


Factor Representation and Decision Making in Stock Markets Using Deep Reinforcement Learning

Deep Reinforcement learning is a branch of unsupervised learning in whic...

Minimax Strikes Back

Deep Reinforcement Learning (DRL) reaches a superhuman level of play in ...

Shallow Updates for Deep Reinforcement Learning

Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQ...

PoPS: Policy Pruning and Shrinking for Deep Reinforcement Learning

The recent success of deep neural networks (DNNs) for function approxima...

Episodic Memory Deep Q-Networks

Reinforcement learning (RL) algorithms have made huge progress in recent...

Functional Regularization for Reinforcement Learning via Learned Fourier Features

We propose a simple architecture for deep reinforcement learning by embe...

Generalization and Regularization in DQN

Deep reinforcement learning (RL) algorithms have shown an impressive abi...

1 Introduction

Since the early days of artificial intelligence, learning effective representation of states has been one focus of research, especially because state representation is important for algorithms in practice. For example, reinforcement learning(RL)-based algorithms for computer Go used to rely on millions of binary features constructed from local shapes that recognize certain important geometry for the board. Some features were even constructed by the Go community from interviewing top Go professionals

(Silver, 2009). In this paper, we study state representation for RL. There has been numerous works in learning a good representation in RL with linear function approximation. Tile coding is a classical binary scheme for encoding states and generalizes in local sub-spaces (Albus, 1975). Parr et al. (2008) constructed representation from sampled Bellman error to reduce policy evaluation error. Petrik (2007)

computed powers of certain transition matrix in the underlying Markov Decision Processes (MDPs) to represent states.

Konidaris et al. (2011) used Fourier basis functions to construct state representation; and so on.

To the best of our knowledge, there are two approaches for state representation learning in DRL. The first is using auxiliary tasks. For example, UNREAL is an architecture that learns a universal representation by presenting the learning agent auxiliary tasks (each of which has a pseudo reward) that are relevant to the main task (which has an extrinsic reward) (Jaderberg et al., 2016); with a goal of solving all the tasks from a shared representation. The second is a two-phase approach, which first learns a good representation and then performs learning and control using the representation. The motivation for this line of work is to leverage neural networks for RL algorithms without using experience replay. It is well-known that neural networks has a “catastrophic interference” issue: the networks can forget what it has been trained on in the past, which is exactly the motivation for experience replay in DRL (using a buffer to store experience and replaying experience later). Ghiassian et al. (2018)

proposed to apply input transformation to reduce input interference caused by ReLU gates.

Liu et al. (2018) proposed to learn sparse representation for features which can be used for online algorithms such as Sarsa.

While we also deal with representation learning with neural networks for RL, our motivation is to learn effective representation in a one-phase process, simultaneously acquiring a good representation and solving control. We also study the generic one-task setting, although our method may be applicable to multi-task control. Our insight in this paper is that feature correlation

phenomenon happens frequently in DRL algorithms: features learned by deep neural networks are highly correlated measured by covariances. We empirically show that the correlation causes DRL algorithms to have a slow learning curve. We propose to use a regularized correlation loss function to automatically decorrelate state features for Deep Q networks (DQN). Across 49 Atari games, our decorrelation method significantly improves DQN with only slight computation. Our decorrelation algorithms perform

in terms of human-normalized scores and better than DQN. In particular, ours performs better than DQN on 39 games with 4 close ties with only slight loss on games.

To test the generalibility of our method, we also apply decorrelation to Quantile Regression DQN (QR-DQN), a recent distributional RL algorithm that achieves generally better performance than DQN. Results show that the same conclusion holds: our decorrelation method performs much better than QR-DQN in terms of median human-normalized scores. In particular, the decorrelation algorithm lost 5 games, with 10 games being close (less than in cumulative rewards), and 34 winning games (more than in cumulative rewards). We used the same regularization factor in benchmarking for all games. We conducted further experiments on losing games, and found that the loss is due to a parameter choice for regularization factor. For the losing games, decorrelation algorithms with other values of regularization factor can significantly win over DQN and QR-DQN.

2 Background

We consider a Markov Decision Process (MDP) of a state space , an action space , a reward “function” , a transition kernel , and a discount ratio . In this paper we treat the reward “function”

as a random variable to emphasize its stochasticity. Bandit setting is a special case of the general RL setting, where we usually only have one state.

We use to denote a stochastic policy. We use to denote the random variable of the sum of the discounted rewards in the future, following the policy and starting from the state and the action . We have , where and . The expectation of the random variable is

which is usually called the state-action value function. In general RL setting, we are usually interested in finding an optimal policy , such that holds for any . All the possible optimal policies share the same optimal state-action value function , which is the unique fixed point of the Bellman optimality operator (Bellman, 2013),

Based on the Bellman optimality operator, Watkins & Dayan (1992) proposed Q-learning to learn the optimal state-action value function for control. At each time step, we update as,

where is a step size and is a transition. There have been many work extending Q-learning to linear function approximation (Sutton & Barto, 2018; Szepesvári, 2010).

3 Regularized Correlation Loss

3.1 DQN with Decorrelation

Mnih et al. (2015) combined Q-learning with deep neural network function approximators, resulting the Deep-Q-Network (DQN). Assume the function is parameterized by a network with weights

, at each time step, DQN performs a stochastic gradient descent to update

minimizing the loss

where is target network (Mnih et al., 2015), which is a copy of and is synchronized with periodically, and is a transition sampled from a experience replay buffer (Mnih et al., 2015), which is a first-in-first-out queue storing previously experienced transitions.

Suppose the latent feature vector in the last hidden layer of DQN is denoted by a column vector


is the number of units in the last hidden layer), then the covariance matrix of features estimated from samples is,


where and . We know that if two features and (the th and th unit in the last layer) are decorrelated, then . To apply decorrelation incrementally and efficiently, our key idea is to regularize the DQN loss function with a term that penalizes correlation in features.

where the regularization term is the mean-squared loss of the off-diagonal entries in the covariance matrix, which we call the regularized correlation loss. The other elements of our algorithm such as experience replay is completely the same as DQN. We call this new algorithm DQN-decor.

3.2 QR-DQN with Decorrelation

The core idea behind QR-DQN is the Quantile Regression introduced by the seminal paper (Koenker & Bassett Jr, 1978)

. This approach gained significant attention in the field of theoretical and applied statistics. Let us first consider QR in the supervised learning. Given data

, we want to compute the quantile of corresponding the quantile level . Linear quantile regression loss is defined as:




is the weighted sum of residuals. Weights are proportional to the counts of the residual signs and order of the estimated quantile . For higher quantiles positive residuals get higher weight and vice versa. If , then the estimate of the median for is , with .

Instead of learning the expected return , distributional RL focuses on learning the full distribution of the random variable directly (Jaquette, 1973; Bellemare et al., 2017). There are various approaches to represent a distribution in RL setting (Bellemare et al., 2017; Dabney et al., 2018; Barth-Maron et al., 2018). In this paper, we focus on the quantile representation (Dabney et al., 2017) used in QR-DQN, where the distribution of is represented by a uniform mix of supporting quantiles:

where denote a Dirac at , and each is an estimation of the quantile corresponding to the quantile level (a.k.a. quantile index) for . The state-action value is then approximated by . Such approximation of a distribution is referred to as quantile approximation.

Similar to the Bellman optimality operator in mean-centered RL, we have the distributional Bellman optimality operator for control in distributional RL,

Based on the distributional Bellman optimality operator, Dabney et al. (2017) proposed to train quantile estimations (i.e., ) via the Huber quantile regression loss (Huber et al., 1964). To be more specific, at time step the loss is





where is the indicator function and is the Huber loss,

To decorrelate QR-DQN, we use the following loss function,

where denotes the feature encoder by the last layer of QR-DQN, is computed in equation (4), and is computed in equation (5).

The other elements of QR-DQN are also preserved. The only difference of our new algorithm from QR-DQN is the regularized correlation loss. We call this new algorithm QR-DQN-decor.

4 Experiment

In this section, we conduct experiments to study the effectiveness of the proposed decorrelation method for DQN and QR-DQN. We used the Atari game environments by Bellemare et al. (2013). In particular, we compare DQN-decor vs. DQN, and QR-DQN-decor vs. QR-DQN. Algorithms were evaluated on training with 20 million frames (or equivalently, 5 million agent steps) 3 runs per game.

In performing the comparisons, we used the same parameters for all the algorithms, which are reported below. The discount factor is . The image size is . We clipped the continuous reward into three discrete values, , according to their sign. This clipping of reward signals leads to more stable performance according to (Mnih et al., 2015). The reported performance scores is calculated, however, in terms of the original unclipped rewards. The target networks update frequency is frames. The learning rate is . The exploration strategy is epsilon-greedy, with decaying linearly from to in the first 1 million frames, and remains constantly after that. The experience replay buffer size is 1 million. The minibatch size for experience replay is . In the first frames, all the four agents were behaving randomly to fill the buffer with experience as a warm start.

Figure 1: Human-normalized performance (using median across 49 Atari games): DQN-decor vs. DQN.

Figure 2: Correlation loss over training (using median across 49 Atari games): DQN-decor vs. DQN.

Figure 3: Cumulative reward improvement of DQN-decor over DQN. The bar indicates improvement and is computed as the Area Under the Curve (AUC) in the end of training, normalized relative to DQN. Bars above / below horizon indicate performance gain / loss. For MonteZumaRevenge, the improvement is ; and for PrivateEye, the improvement is .
(a) Bowling-DQN: Performance.
(b) Bowling-DQN: Correlation loss.

(c) BattleZone: DQN-decor-- performs significantly better than DQN.

The input of the neural networks is taking the most recent 4 frames. There are three convolution layers followed by a fully connected layer. The first layer convolves 32 filters of

with stride 4 with the input image and applies a rectifier non-linearity. The second layer convolves 64 filters of

with strides 2, followed by a rectifier non-linearity. The third convolution layer convolves 64 filters of with stride 1 followed by a rectifier. The final hidden layer has rectifier units. The outputs of these units given an image are just the features encoded in . Thus the number of features, , is . This is followed by the output layer which is fully connected and linear, with a single output for each action (the number of actions in each game is different, ranging from 4 to 18).

The parameter for QR-DQN algorithms in the Huber loss function was set to following Dabney et al. (2017).

The correlation loss in equation 1 is computed on the minibatch sampled in experience replay (the empirical mean is also computed over the minibatch).

4.1 Decorrelating DQN

Performance. First we compare DQN-decor with DQN in Figure 1. The performance is measured by the median performance of DQN-decor and DQN across all games, where for each game the average score over runs is first taken. DQN-decor is shown to outperform DQN starting from about 4 million frames, with the winning edge increasing with time. By 20 million frames, DQN-decor achieves a human-normalized score of (while DQN is ), outperforming DQN by .

While the median performance is widely used, it is only a performance summary across all games. To see how algorithms perform in each game, we benchmark them for each game in Figure 3. DQN-decor won over DQN on games (each of which is more than ) with 4 close ties and lost slightly on games. In particular, only on Bowling and BattleZone, it loses to DQN by about . For the other losing games, the loss was below .

We dig into the training on Bowling, and found that DQN fails to learn for this task. The evaluation score for Bowling in the original DQN paper (Mnih et al., 2015) is

(see their Table 2) with a standard deviation of

(trained over 50 million frames). In fact, for a fair comparison, one can check that the training performance for DQN at, because scores reported in the original DQN paper is the testing performance without exploration. Their results show that for Bowling, the scores of DQN in the training phases fluctuate around , which is close to our implemented DQN here. Figure 3(a) shows that two runs (out of three) of DQN is even worse than the random agent. Because the exploration factor is close to 1.0 in the beginning, the first data points of the curves correspond to the performance of the random agent. The representation learned by DQN turns out to be not useful for this task. In fact, the features learned by these two failure runs of DQN are nearly i.i.d, reflected in that the correlation loss is nearly zero as shown in Figure 3(b). So decorrelating the unsuccessful representation does not help in this case.

The loss of DQN-decor to DQN on BattleZone is due to the regularization factor, , which is uniformly fixed for all games. In producing these two figures, was fixed to for all 49 games. In fact, we did not run parameter search to optimize for all games. The value was picked at our best guess. We did notice that DQN-decor can achieve better performance than DQN for BattleZone with a different regularization factor as shown in Figure 3(c). It appears that the features learned by DQN are already much decorrelated, and thus a small regularization factor (in this case, ) gives a better performance.

Excluding the unsuccessful representation learned by DQN for the case of Bowling and better parameter choice for regularization factor, we found that decorrelation is a uniformly effective technique for training DQN.

Effect of Feature Correlation. To study the correlation between performance and correlation loss, we plot the correlation loss over training time (also using the median measure) across 49 games in Figure 2. The correlation loss is computed by averaging over the minibatch samples during experience replay at each time step.

As shown by the figure, DQN-decor effectively reduces the decorrelation in features. We can see that the feature correlation loss for DQN-decor is almost flat from no more than 2 million frames, indicating that decorrelating features can be achieved faster than learning (a good news for representation learning). While for DQN, the learned features keep correlating with each other more and more in the first 6 million frames. After that, however, it appears that DQN does achieve a flat correlation loss, although much bigger than DQN-decor.

Figure 4: Human-normalized performance (using median across 49 Atari games): QR-DQN-decor vs. QR-DQN.

Figure 5: Frostbite: QR-DQN-decor--0.01 won by in normalized AUC (area under curve).

Figure 6: Normalized AUC: Cumulative reward improvement of QR-DQN-decor over QR-DQN. For MonteZumaRevenge, the improvement is ; for Hero, the improvement is ; and for PrivateEye, the improvement is .
(a) BattleZone.

(b) DemonAttack.
Figure 7: Re-comparing QR-DQN-decor (with ) and QR-DQN in normalized AUC. Left (BattleZone): losing to QR-DQN by . Right (DemonAttack): winning QR-DQN by .

This phenomenon of DQN learning is interesting because it shows that although correlation is not considered in its loss function, DQN does have the ability of achieving feature decorrelation (to some extent) over time. Note that DQN’s performance keeps improving after 6 million frames while at the same time its correlation loss measure is almost flat. This may indicate that it is natural to understand learning in two phases: decorrelating features (with some loss) and after that learning and performance continues to improve with decorrelated features.

Interestingly, faster learning of DQN-decor follows after the features are decorrelated. To see this, DQN-decor’s advantage over DQN becomes significant after about 6 million frames (Figure 1), while DQN-decor’s correlation loss has been flat since less than 2 million frames (Figure 2).

Game DQN DQN-decor, QR-DQN-1 QR-DQN-1-decor,
Alien 1445.8 1348.5 1198.0 1496.0
Amidar 234.9 263.8 236.5 339.0
Assault 2052.6 1950.8 6118.2 5341.3
Asterix 3880.2 5541.7 6978.4 8418.6
Asteroids 733.2 1292.0 1442.4 1731.4
Atlantis 189008.6 307251.6 65875.3 148162.4
BankHeist 568.4 648.1 730.3 740.8
BattleZone 15732.2 14945.3 18001.3 17852.6
BeamRider 5193.1 5394.4 5723.6 6483.1
Bowling 27.3 21.9 22.5 22.2
Boxing 85.6 86.0 87.1 83.9
Breakout 311.3 337.7 372.8 393.3
Centipede 2161.2 2360.5 6003.0 6092.9
ChopperCommand 1362.4 1735.2 2266.9 2777.4
CrazyClimber 69023.8 100318.4 74110.1 100278.4
DemonAttack 7679.6 7471.8 34845.7 27393.6
DoubleDunk -15.5 -16.8 -18.5 -19.1
Enduro 808.3 891.7 409.4 884.5
FishingDerby 0.7 11.7 9.0 10.8
Freeway 23.0 32.4 25.6 24.9
Frostbite 293.8 376.6 1414.2 755.7
Gopher 2064.5 3067.6 2816.5 3451.8
Gravitar 271.2 382.3 305.9 314.9
Hero 3025.4 6197.1 1948.4 9352.2
IceHockey -10.0 -8.6 -10.3 -9.6
Jamesbond 387.5 471.0 391.4 515.0
Kangaroo 3933.3 3955.5 1987.8 2504.4
Krull 5709.9 6286.4 6547.7 6567.9
KungFuMaster 16999.0 20482.9 22131.3 23531.4
MontezumaRevenge 0.0 0.0 0.0 2.4
MsPacman 2019.0 2166.0 2221.4 2407.9
NameThisGame 7699.0 7578.2 8407.0 8341.0
Pong 19.9 20.0 19.9 20.0
PrivateEye 345.6 610.8 41.7 251.0
Qbert 2823.5 4432.4 4041.0 5148.1
Riverraid 6431.3 7613.8 7134.8 7700.5
RoadRunner 35898.6 39327.0 36800.0 37917.6
Robotank 24.8 24.5 31.3 30.6
Seaquest 4216.6 6635.7 4856.8 5224.6
SpaceInvaders 1015.8 913.0 946.7 825.1
StarGunner 15586.6 21825.0 25530.8 37461.0
Tennis -22.3 -21.2 -17.3 -16.2
TimePilot 2802.8 3852.1 3655.8 3651.6
Tutankham 103.4 116.2 148.3 190.2
UpNDown 8234.5 9105.8 8647.4 11342.4
Venture 8.4 15.3 0.7 1.1
VideoPinball 11564.1 15759.3 53207.5 66439.4
WizardOfWor 1804.3 2030.3 2109.8 2530.6
Zaxxon 3105.2 7049.4 4179.8 6816.9
Table 1: Comparison of gaming scores obtained by our decorrelation algorithms with DQN and QR-DQN, averaged over the last one million training frames (the total number training frames is 20 million). Out of 49 games, decorrelation algorithms perform the best in 39 games: DQN-decor is best in 15 games and QR-DQN-decor is best in 24 games.

4.2 Decorrelating QR-DQN

We further conduct experiments to study whether our decorrelation method applies to QR-DQN. The value of was set to , picked simply from a parameter study on the game of Sequest. Figure 4 shows the median human-normalized performance scores across 49 games. Similar to the case of DQN, our method achieves much better performance by decorrelating QR-DQN, especially after about 7 million frames. Again, we see the performance edge of QR-DQN-decor over QR-DQN has a trend of increasing with time. We observed a similar correlation between performance and correlation loss, which supports that decorrelation is the factor that improves the performance.

We also profile the performance of algorithms for each game in the normalized AUC, which is shown in Figure 6. In this case, the decorrelation algorithm lost 5 games, with 10 games being close (less than ), and 34 winning games (bigger than ). The algorithm lost most to QR-DQN on the game of Frostbite, BattleZone and DemonAttack. This is due to parameter choice instead of an algorithmic issue. For all the three losing games, we performed additional experiments with (the same value as for the DQN experiments). The learning curves are shown in Figure 5 (Frostbite), Figure 6(a) (BattleZone), and Figure 6(b) (DemonAttack). Recall that QR-DQN-decor with lost to QR-DQN by about . Figure 5 shows that with , the performance of QR-DQN-decor is significantly improved, with an improvement over QR-DQN by (in the normalized AUC measure). For the other two games, QR-DQN-decor performs at least no worse than QR-DQN ( better on DemonAttack, on BattleZone).

Note that the median human-normalized scores across games and the normalized AUC measure for each game may not give a full picture of algorithm performances. Algorithms perform well in these two measures can exhibit plummeting behaviour. Plummeting is characterized by abrupt degradation in performance. In this case the learning curve can drop to low score and stay there indefinitely. A more detailed discussion of this point is in (Machado et al., 2017). To study if our decorrelation algorithms have plummeting behaviour, we benchmark all algorithms by averaging the rewards in the last one million training frames (which is essentially near-tail performance in training). In this measure, the results are summarized in Table 1. Out of 49 games, decorrelation algorithms perform the best in 39 games: DQN-decor is best in 15 games and QR-DQN-decor is best in 24 games. Thus our decorrelation algorithms do not have plummeting behaviour.

In a summary, empirical results in this section show that decorrelation is effective for improving the performance of both QR-DQN and QR-DQN. Even on the most losing games (incurred using a fixed, same regularization factor for all games), both DQN and QR-DQN can still be significantly improved by our decorrelation method with a tuned regularization factor.

4.3 Analysis on the Learned Representation

To study the correlation between the learned features after training finished, we first compute the feature covariance matrix (see equation 1

) using the (one million) samples in the experience replay buffer at the end of training, for both DQN and DQN-decor. Then we sort the (512) features according to their variances, which are just the diagonal parts of the feature covariance matrix.

(a) DQN.

(b) DQN-decor.
Figure 8: Features correlation heat map on Sequest. First, the diagonalization pattern is more obvious in the feature covariance matrix of DQN-decor. Second, the magnititude of feature correlation in DQN is much larger than DQN-decor. Third, there are more highly activated features in DQN and they correlate with each other, suggesting that many of them may be redundent.
Figure 9: Activation of the most important features (according to their variances) on Sequest (6 random samples). Top row: features of DQN-dec. Middle row: features of DQN. Bottom row: current image frame.

A visualization of the top 50 features on the game of Sequest is shown in Figure 7(a) (DQN), and Figure 7(b) (DQN-decor). Three observations can be made. First, the diagonalization pattern is more obvious in the feature covariance matrix of DQN-decor. Second, the magnititude of feature correlation in DQN is much larger than DQN-decor (note the heat intensity bar on the right for the two plots is different). Third, there are more highly activated features in DQN, reflected in that the diagonal parts of DQN’s matrix have more intense values (black color). Interestingly, there are also almost same number of equally sized squares in the off-diagonal region of DQN’s matrix, indicating that these highly activated features correlate with each other and many of them may be redundent. In contrast, the off-diagonal region of DQN-decor’s matrix has very few intense values, suggesting that these important features are successfully decorrelated.

Figure 9 shows randomly sampled frames and their feature activation values (same set of 50 features as in Figure 7(a) and Figure 7(b)) for both algorithms. Interestingly, the features of DQN-decor (top row) appear to have much fewer values that are activated at the same time given an input image. In addition, DQN also has many highly activated features at the same time. In contrast, for DQN-decor, there are just a few (in these samples, there is only one) highly active features. Interpretting features for DRL algorithms may be made easier thanks to decorrelation. Although we have not any conclusion on this yet, it is an interesting future research direction.

5 Conclusion

In this paper, we found that feature correlation is a key factor in the performance of DQN algorithms. We have proposed a method for obtaining decorrelated features for DQN. The key idea is to regularize the loss function of DQN by a correlation loss computed from the mean squared correlation between the features. Our decorrelation method turns out to be very effective for training DQN algorithms. We show that it also applies to QR-DQN, improving QR-DQN significantly on most games, and losing only on a few games. Experiment study on the losing games shows that it is due to the parameter choice for the regularization factor. Our decorrelation method effectively improves DQN and QR-DQN with a better or no worse performance across most games, which makes it hopeful for improving representation with the other DRL algorithms.

Appendix A All Learning Curves

The learning curves of DQN-decor vs. DQN on all games are shown in Figure 10. ().

The learning curves of QR-DQN-decor vs. QR-DQN are shown in Figure 11. ().

Figure 10: DQN-decor (orange) vs. DQN (blue): all learning curves for 49 Atari games.

Figure 11: QR-DQN-decor (orange) vs. QR-DQN (blue): all learning curves for 49 Atari games.