Since the early days of artificial intelligence, learning effective representation of states has been one focus of research, especially because state representation is important for algorithms in practice. For example, reinforcement learning(RL)-based algorithms for computer Go used to rely on millions of binary features constructed from local shapes that recognize certain important geometry for the board. Some features were even constructed by the Go community from interviewing top Go professionals(Silver, 2009). In this paper, we study state representation for RL. There has been numerous works in learning a good representation in RL with linear function approximation. Tile coding is a classical binary scheme for encoding states and generalizes in local sub-spaces (Albus, 1975). Parr et al. (2008) constructed representation from sampled Bellman error to reduce policy evaluation error. Petrik (2007)
computed powers of certain transition matrix in the underlying Markov Decision Processes (MDPs) to represent states.Konidaris et al. (2011) used Fourier basis functions to construct state representation; and so on.
To the best of our knowledge, there are two approaches for state representation learning in DRL. The first is using auxiliary tasks. For example, UNREAL is an architecture that learns a universal representation by presenting the learning agent auxiliary tasks (each of which has a pseudo reward) that are relevant to the main task (which has an extrinsic reward) (Jaderberg et al., 2016); with a goal of solving all the tasks from a shared representation. The second is a two-phase approach, which first learns a good representation and then performs learning and control using the representation. The motivation for this line of work is to leverage neural networks for RL algorithms without using experience replay. It is well-known that neural networks has a “catastrophic interference” issue: the networks can forget what it has been trained on in the past, which is exactly the motivation for experience replay in DRL (using a buffer to store experience and replaying experience later). Ghiassian et al. (2018)
proposed to apply input transformation to reduce input interference caused by ReLU gates.Liu et al. (2018) proposed to learn sparse representation for features which can be used for online algorithms such as Sarsa.
While we also deal with representation learning with neural networks for RL, our motivation is to learn effective representation in a one-phase process, simultaneously acquiring a good representation and solving control. We also study the generic one-task setting, although our method may be applicable to multi-task control. Our insight in this paper is that feature correlation
phenomenon happens frequently in DRL algorithms: features learned by deep neural networks are highly correlated measured by covariances. We empirically show that the correlation causes DRL algorithms to have a slow learning curve. We propose to use a regularized correlation loss function to automatically decorrelate state features for Deep Q networks (DQN). Across 49 Atari games, our decorrelation method significantly improves DQN with only slight computation. Our decorrelation algorithms performin terms of human-normalized scores and better than DQN. In particular, ours performs better than DQN on 39 games with 4 close ties with only slight loss on games.
To test the generalibility of our method, we also apply decorrelation to Quantile Regression DQN (QR-DQN), a recent distributional RL algorithm that achieves generally better performance than DQN. Results show that the same conclusion holds: our decorrelation method performs much better than QR-DQN in terms of median human-normalized scores. In particular, the decorrelation algorithm lost 5 games, with 10 games being close (less than in cumulative rewards), and 34 winning games (more than in cumulative rewards). We used the same regularization factor in benchmarking for all games. We conducted further experiments on losing games, and found that the loss is due to a parameter choice for regularization factor. For the losing games, decorrelation algorithms with other values of regularization factor can significantly win over DQN and QR-DQN.
We consider a Markov Decision Process (MDP) of a state space , an action space , a reward “function” , a transition kernel , and a discount ratio . In this paper we treat the reward “function”
as a random variable to emphasize its stochasticity. Bandit setting is a special case of the general RL setting, where we usually only have one state.
We use to denote a stochastic policy. We use to denote the random variable of the sum of the discounted rewards in the future, following the policy and starting from the state and the action . We have , where and . The expectation of the random variable is
which is usually called the state-action value function. In general RL setting, we are usually interested in finding an optimal policy , such that holds for any . All the possible optimal policies share the same optimal state-action value function , which is the unique fixed point of the Bellman optimality operator (Bellman, 2013),
Based on the Bellman optimality operator, Watkins & Dayan (1992) proposed Q-learning to learn the optimal state-action value function for control. At each time step, we update as,
3 Regularized Correlation Loss
3.1 DQN with Decorrelation
Mnih et al. (2015) combined Q-learning with deep neural network function approximators, resulting the Deep-Q-Network (DQN). Assume the function is parameterized by a network with weights
, at each time step, DQN performs a stochastic gradient descent to updateminimizing the loss
where is target network (Mnih et al., 2015), which is a copy of and is synchronized with periodically, and is a transition sampled from a experience replay buffer (Mnih et al., 2015), which is a first-in-first-out queue storing previously experienced transitions.
Suppose the latent feature vector in the last hidden layer of DQN is denoted by a column vector(where
is the number of units in the last hidden layer), then the covariance matrix of features estimated from samples is,
where and . We know that if two features and (the th and th unit in the last layer) are decorrelated, then . To apply decorrelation incrementally and efficiently, our key idea is to regularize the DQN loss function with a term that penalizes correlation in features.
where the regularization term is the mean-squared loss of the off-diagonal entries in the covariance matrix, which we call the regularized correlation loss. The other elements of our algorithm such as experience replay is completely the same as DQN. We call this new algorithm DQN-decor.
3.2 QR-DQN with Decorrelation
The core idea behind QR-DQN is the Quantile Regression introduced by the seminal paper (Koenker & Bassett Jr, 1978)
. This approach gained significant attention in the field of theoretical and applied statistics. Let us first consider QR in the supervised learning. Given data, we want to compute the quantile of corresponding the quantile level . Linear quantile regression loss is defined as:
is the weighted sum of residuals. Weights are proportional to the counts of the residual signs and order of the estimated quantile . For higher quantiles positive residuals get higher weight and vice versa. If , then the estimate of the median for is , with .
Instead of learning the expected return , distributional RL focuses on learning the full distribution of the random variable directly (Jaquette, 1973; Bellemare et al., 2017). There are various approaches to represent a distribution in RL setting (Bellemare et al., 2017; Dabney et al., 2018; Barth-Maron et al., 2018). In this paper, we focus on the quantile representation (Dabney et al., 2017) used in QR-DQN, where the distribution of is represented by a uniform mix of supporting quantiles:
where denote a Dirac at , and each is an estimation of the quantile corresponding to the quantile level (a.k.a. quantile index) for . The state-action value is then approximated by . Such approximation of a distribution is referred to as quantile approximation.
Similar to the Bellman optimality operator in mean-centered RL, we have the distributional Bellman optimality operator for control in distributional RL,
Based on the distributional Bellman optimality operator, Dabney et al. (2017) proposed to train quantile estimations (i.e., ) via the Huber quantile regression loss (Huber et al., 1964). To be more specific, at time step the loss is
where is the indicator function and is the Huber loss,
To decorrelate QR-DQN, we use the following loss function,
The other elements of QR-DQN are also preserved. The only difference of our new algorithm from QR-DQN is the regularized correlation loss. We call this new algorithm QR-DQN-decor.
In this section, we conduct experiments to study the effectiveness of the proposed decorrelation method for DQN and QR-DQN. We used the Atari game environments by Bellemare et al. (2013). In particular, we compare DQN-decor vs. DQN, and QR-DQN-decor vs. QR-DQN. Algorithms were evaluated on training with 20 million frames (or equivalently, 5 million agent steps) 3 runs per game.
In performing the comparisons, we used the same parameters for all the algorithms, which are reported below. The discount factor is . The image size is . We clipped the continuous reward into three discrete values, , according to their sign. This clipping of reward signals leads to more stable performance according to (Mnih et al., 2015). The reported performance scores is calculated, however, in terms of the original unclipped rewards. The target networks update frequency is frames. The learning rate is . The exploration strategy is epsilon-greedy, with decaying linearly from to in the first 1 million frames, and remains constantly after that. The experience replay buffer size is 1 million. The minibatch size for experience replay is . In the first frames, all the four agents were behaving randomly to fill the buffer with experience as a warm start.
The input of the neural networks is taking the most recent 4 frames. There are three convolution layers followed by a fully connected layer. The first layer convolves 32 filters of
with stride 4 with the input image and applies a rectifier non-linearity. The second layer convolves 64 filters ofwith strides 2, followed by a rectifier non-linearity. The third convolution layer convolves 64 filters of with stride 1 followed by a rectifier. The final hidden layer has rectifier units. The outputs of these units given an image are just the features encoded in . Thus the number of features, , is . This is followed by the output layer which is fully connected and linear, with a single output for each action (the number of actions in each game is different, ranging from 4 to 18).
The parameter for QR-DQN algorithms in the Huber loss function was set to following Dabney et al. (2017).
The correlation loss in equation 1 is computed on the minibatch sampled in experience replay (the empirical mean is also computed over the minibatch).
4.1 Decorrelating DQN
Performance. First we compare DQN-decor with DQN in Figure 1. The performance is measured by the median performance of DQN-decor and DQN across all games, where for each game the average score over runs is first taken. DQN-decor is shown to outperform DQN starting from about 4 million frames, with the winning edge increasing with time. By 20 million frames, DQN-decor achieves a human-normalized score of (while DQN is ), outperforming DQN by .
While the median performance is widely used, it is only a performance summary across all games. To see how algorithms perform in each game, we benchmark them for each game in Figure 3. DQN-decor won over DQN on games (each of which is more than ) with 4 close ties and lost slightly on games. In particular, only on Bowling and BattleZone, it loses to DQN by about . For the other losing games, the loss was below .
We dig into the training on Bowling, and found that DQN fails to learn for this task. The evaluation score for Bowling in the original DQN paper (Mnih et al., 2015) is
(see their Table 2) with a standard deviation of(trained over 50 million frames). In fact, for a fair comparison, one can check that the training performance for DQN at https://google.github.io/dopamine/baselines/plots.html, because scores reported in the original DQN paper is the testing performance without exploration. Their results show that for Bowling, the scores of DQN in the training phases fluctuate around , which is close to our implemented DQN here. Figure 3(a) shows that two runs (out of three) of DQN is even worse than the random agent. Because the exploration factor is close to 1.0 in the beginning, the first data points of the curves correspond to the performance of the random agent. The representation learned by DQN turns out to be not useful for this task. In fact, the features learned by these two failure runs of DQN are nearly i.i.d, reflected in that the correlation loss is nearly zero as shown in Figure 3(b). So decorrelating the unsuccessful representation does not help in this case.
The loss of DQN-decor to DQN on BattleZone is due to the regularization factor, , which is uniformly fixed for all games. In producing these two figures, was fixed to for all 49 games. In fact, we did not run parameter search to optimize for all games. The value was picked at our best guess. We did notice that DQN-decor can achieve better performance than DQN for BattleZone with a different regularization factor as shown in Figure 3(c). It appears that the features learned by DQN are already much decorrelated, and thus a small regularization factor (in this case, ) gives a better performance.
Excluding the unsuccessful representation learned by DQN for the case of Bowling and better parameter choice for regularization factor, we found that decorrelation is a uniformly effective technique for training DQN.
Effect of Feature Correlation. To study the correlation between performance and correlation loss, we plot the correlation loss over training time (also using the median measure) across 49 games in Figure 2. The correlation loss is computed by averaging over the minibatch samples during experience replay at each time step.
As shown by the figure, DQN-decor effectively reduces the decorrelation in features. We can see that the feature correlation loss for DQN-decor is almost flat from no more than 2 million frames, indicating that decorrelating features can be achieved faster than learning (a good news for representation learning). While for DQN, the learned features keep correlating with each other more and more in the first 6 million frames. After that, however, it appears that DQN does achieve a flat correlation loss, although much bigger than DQN-decor.
This phenomenon of DQN learning is interesting because it shows that although correlation is not considered in its loss function, DQN does have the ability of achieving feature decorrelation (to some extent) over time. Note that DQN’s performance keeps improving after 6 million frames while at the same time its correlation loss measure is almost flat. This may indicate that it is natural to understand learning in two phases: decorrelating features (with some loss) and after that learning and performance continues to improve with decorrelated features.
Interestingly, faster learning of DQN-decor follows after the features are decorrelated. To see this, DQN-decor’s advantage over DQN becomes significant after about 6 million frames (Figure 1), while DQN-decor’s correlation loss has been flat since less than 2 million frames (Figure 2).
4.2 Decorrelating QR-DQN
We further conduct experiments to study whether our decorrelation method applies to QR-DQN. The value of was set to , picked simply from a parameter study on the game of Sequest. Figure 4 shows the median human-normalized performance scores across 49 games. Similar to the case of DQN, our method achieves much better performance by decorrelating QR-DQN, especially after about 7 million frames. Again, we see the performance edge of QR-DQN-decor over QR-DQN has a trend of increasing with time. We observed a similar correlation between performance and correlation loss, which supports that decorrelation is the factor that improves the performance.
We also profile the performance of algorithms for each game in the normalized AUC, which is shown in Figure 6. In this case, the decorrelation algorithm lost 5 games, with 10 games being close (less than ), and 34 winning games (bigger than ). The algorithm lost most to QR-DQN on the game of Frostbite, BattleZone and DemonAttack. This is due to parameter choice instead of an algorithmic issue. For all the three losing games, we performed additional experiments with (the same value as for the DQN experiments). The learning curves are shown in Figure 5 (Frostbite), Figure 6(a) (BattleZone), and Figure 6(b) (DemonAttack). Recall that QR-DQN-decor with lost to QR-DQN by about . Figure 5 shows that with , the performance of QR-DQN-decor is significantly improved, with an improvement over QR-DQN by (in the normalized AUC measure). For the other two games, QR-DQN-decor performs at least no worse than QR-DQN ( better on DemonAttack, on BattleZone).
Note that the median human-normalized scores across games and the normalized AUC measure for each game may not give a full picture of algorithm performances. Algorithms perform well in these two measures can exhibit plummeting behaviour. Plummeting is characterized by abrupt degradation in performance. In this case the learning curve can drop to low score and stay there indefinitely. A more detailed discussion of this point is in (Machado et al., 2017). To study if our decorrelation algorithms have plummeting behaviour, we benchmark all algorithms by averaging the rewards in the last one million training frames (which is essentially near-tail performance in training). In this measure, the results are summarized in Table 1. Out of 49 games, decorrelation algorithms perform the best in 39 games: DQN-decor is best in 15 games and QR-DQN-decor is best in 24 games. Thus our decorrelation algorithms do not have plummeting behaviour.
In a summary, empirical results in this section show that decorrelation is effective for improving the performance of both QR-DQN and QR-DQN. Even on the most losing games (incurred using a fixed, same regularization factor for all games), both DQN and QR-DQN can still be significantly improved by our decorrelation method with a tuned regularization factor.
4.3 Analysis on the Learned Representation
To study the correlation between the learned features after training finished, we first compute the feature covariance matrix (see equation 1
) using the (one million) samples in the experience replay buffer at the end of training, for both DQN and DQN-decor. Then we sort the (512) features according to their variances, which are just the diagonal parts of the feature covariance matrix.
A visualization of the top 50 features on the game of Sequest is shown in Figure 7(a) (DQN), and Figure 7(b) (DQN-decor). Three observations can be made. First, the diagonalization pattern is more obvious in the feature covariance matrix of DQN-decor. Second, the magnititude of feature correlation in DQN is much larger than DQN-decor (note the heat intensity bar on the right for the two plots is different). Third, there are more highly activated features in DQN, reflected in that the diagonal parts of DQN’s matrix have more intense values (black color). Interestingly, there are also almost same number of equally sized squares in the off-diagonal region of DQN’s matrix, indicating that these highly activated features correlate with each other and many of them may be redundent. In contrast, the off-diagonal region of DQN-decor’s matrix has very few intense values, suggesting that these important features are successfully decorrelated.
Figure 9 shows randomly sampled frames and their feature activation values (same set of 50 features as in Figure 7(a) and Figure 7(b)) for both algorithms. Interestingly, the features of DQN-decor (top row) appear to have much fewer values that are activated at the same time given an input image. In addition, DQN also has many highly activated features at the same time. In contrast, for DQN-decor, there are just a few (in these samples, there is only one) highly active features. Interpretting features for DRL algorithms may be made easier thanks to decorrelation. Although we have not any conclusion on this yet, it is an interesting future research direction.
In this paper, we found that feature correlation is a key factor in the performance of DQN algorithms. We have proposed a method for obtaining decorrelated features for DQN. The key idea is to regularize the loss function of DQN by a correlation loss computed from the mean squared correlation between the features. Our decorrelation method turns out to be very effective for training DQN algorithms. We show that it also applies to QR-DQN, improving QR-DQN significantly on most games, and losing only on a few games. Experiment study on the losing games shows that it is due to the parameter choice for the regularization factor. Our decorrelation method effectively improves DQN and QR-DQN with a better or no worse performance across most games, which makes it hopeful for improving representation with the other DRL algorithms.
Appendix A All Learning Curves
The learning curves of DQN-decor vs. DQN on all games are shown in Figure 10. ().
The learning curves of QR-DQN-decor vs. QR-DQN are shown in Figure 11. ().
- Albus (1975) Albus, J. S. Data storage in the cerebellar model articulation controller (CMAC). Journal of Dynamic Systems,Measurement and Control, 97(3):228–233, 1975.
- Barth-Maron et al. (2018) Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Muldal, A., Heess, N., and Lillicrap, T. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
- Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
- Bellman (2013) Bellman, R. Dynamic programming. Courier Corporation, 2013.
- Dabney et al. (2017) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044, 2017.
- Dabney et al. (2018) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
- Ghiassian et al. (2018) Ghiassian, S., Yu, H., Rafiee, B., and Sutton, R. S. Two geometric input transformation methods for fast online reinforcement learning with neural nets. CoRR, abs/1805.07476, 2018. URL http://arxiv.org/abs/1805.07476.
- Huber et al. (1964) Huber, P. J. et al. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964.
- Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016. URL http://arxiv.org/abs/1611.05397.
- Jaquette (1973) Jaquette, S. C. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1973.
- Koenker & Bassett Jr (1978) Koenker, R. and Bassett Jr, G. Regression quantiles. Econometrica: Journal of the Econometric Society, pp. 33–50, 1978.
- Konidaris et al. (2011) Konidaris, G., Osentoski, S., and Thomas, P. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI’11, pp. 380–385. AAAI Press, 2011. URL http://dl.acm.org/citation.cfm?id=2900423.2900483.
- Liu et al. (2018) Liu, V., Kumaraswamy, R., Le, L., and White, M. The utility of sparse representations for control in reinforcement learning. CoRR, abs/1811.06626, 2018. URL http://arxiv.org/abs/1811.06626.
- Machado et al. (2017) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
Parr et al. (2008)
Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L.
An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.In
Proceedings of the 25th international conference on Machine learning, pp. 752–759. ACM, 2008.
- Petrik (2007) Petrik, M. An analysis of laplacian methods for value function approximation in mdps. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 2574–2579, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id=1625275.1625690.
- Silver (2009) Silver, D. Reinforcement Learning and Simulation-Based Search in Computer Go. PhD thesis, 2009.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction (2nd Edition). MIT press, 2018.
- Szepesvári (2010) Szepesvári, C. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.
- Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Q-learning. Machine Learning, 1992.