1 Introduction
Since the early days of artificial intelligence, learning effective representation of states has been one focus of research, especially because state representation is important for algorithms in practice. For example, reinforcement learning(RL)based algorithms for computer Go used to rely on millions of binary features constructed from local shapes that recognize certain important geometry for the board. Some features were even constructed by the Go community from interviewing top Go professionals
(Silver, 2009). In this paper, we study state representation for RL. There has been numerous works in learning a good representation in RL with linear function approximation. Tile coding is a classical binary scheme for encoding states and generalizes in local subspaces (Albus, 1975). Parr et al. (2008) constructed representation from sampled Bellman error to reduce policy evaluation error. Petrik (2007)computed powers of certain transition matrix in the underlying Markov Decision Processes (MDPs) to represent states.
Konidaris et al. (2011) used Fourier basis functions to construct state representation; and so on.To the best of our knowledge, there are two approaches for state representation learning in DRL. The first is using auxiliary tasks. For example, UNREAL is an architecture that learns a universal representation by presenting the learning agent auxiliary tasks (each of which has a pseudo reward) that are relevant to the main task (which has an extrinsic reward) (Jaderberg et al., 2016); with a goal of solving all the tasks from a shared representation. The second is a twophase approach, which first learns a good representation and then performs learning and control using the representation. The motivation for this line of work is to leverage neural networks for RL algorithms without using experience replay. It is wellknown that neural networks has a “catastrophic interference” issue: the networks can forget what it has been trained on in the past, which is exactly the motivation for experience replay in DRL (using a buffer to store experience and replaying experience later). Ghiassian et al. (2018)
proposed to apply input transformation to reduce input interference caused by ReLU gates.
Liu et al. (2018) proposed to learn sparse representation for features which can be used for online algorithms such as Sarsa.While we also deal with representation learning with neural networks for RL, our motivation is to learn effective representation in a onephase process, simultaneously acquiring a good representation and solving control. We also study the generic onetask setting, although our method may be applicable to multitask control. Our insight in this paper is that feature correlation
phenomenon happens frequently in DRL algorithms: features learned by deep neural networks are highly correlated measured by covariances. We empirically show that the correlation causes DRL algorithms to have a slow learning curve. We propose to use a regularized correlation loss function to automatically decorrelate state features for Deep Q networks (DQN). Across 49 Atari games, our decorrelation method significantly improves DQN with only slight computation. Our decorrelation algorithms perform
in terms of humannormalized scores and better than DQN. In particular, ours performs better than DQN on 39 games with 4 close ties with only slight loss on games.To test the generalibility of our method, we also apply decorrelation to Quantile Regression DQN (QRDQN), a recent distributional RL algorithm that achieves generally better performance than DQN. Results show that the same conclusion holds: our decorrelation method performs much better than QRDQN in terms of median humannormalized scores. In particular, the decorrelation algorithm lost 5 games, with 10 games being close (less than in cumulative rewards), and 34 winning games (more than in cumulative rewards). We used the same regularization factor in benchmarking for all games. We conducted further experiments on losing games, and found that the loss is due to a parameter choice for regularization factor. For the losing games, decorrelation algorithms with other values of regularization factor can significantly win over DQN and QRDQN.
2 Background
We consider a Markov Decision Process (MDP) of a state space , an action space , a reward “function” , a transition kernel , and a discount ratio . In this paper we treat the reward “function”
as a random variable to emphasize its stochasticity. Bandit setting is a special case of the general RL setting, where we usually only have one state.
We use to denote a stochastic policy. We use to denote the random variable of the sum of the discounted rewards in the future, following the policy and starting from the state and the action . We have , where and . The expectation of the random variable is
which is usually called the stateaction value function. In general RL setting, we are usually interested in finding an optimal policy , such that holds for any . All the possible optimal policies share the same optimal stateaction value function , which is the unique fixed point of the Bellman optimality operator (Bellman, 2013),
Based on the Bellman optimality operator, Watkins & Dayan (1992) proposed Qlearning to learn the optimal stateaction value function for control. At each time step, we update as,
where is a step size and is a transition. There have been many work extending Qlearning to linear function approximation (Sutton & Barto, 2018; Szepesvári, 2010).
3 Regularized Correlation Loss
3.1 DQN with Decorrelation
Mnih et al. (2015) combined Qlearning with deep neural network function approximators, resulting the DeepQNetwork (DQN). Assume the function is parameterized by a network with weights
, at each time step, DQN performs a stochastic gradient descent to update
minimizing the losswhere is target network (Mnih et al., 2015), which is a copy of and is synchronized with periodically, and is a transition sampled from a experience replay buffer (Mnih et al., 2015), which is a firstinfirstout queue storing previously experienced transitions.
Suppose the latent feature vector in the last hidden layer of DQN is denoted by a column vector
(whereis the number of units in the last hidden layer), then the covariance matrix of features estimated from samples is,
(1) 
where and . We know that if two features and (the th and th unit in the last layer) are decorrelated, then . To apply decorrelation incrementally and efficiently, our key idea is to regularize the DQN loss function with a term that penalizes correlation in features.
where the regularization term is the meansquared loss of the offdiagonal entries in the covariance matrix, which we call the regularized correlation loss. The other elements of our algorithm such as experience replay is completely the same as DQN. We call this new algorithm DQNdecor.
3.2 QRDQN with Decorrelation
The core idea behind QRDQN is the Quantile Regression introduced by the seminal paper (Koenker & Bassett Jr, 1978)
. This approach gained significant attention in the field of theoretical and applied statistics. Let us first consider QR in the supervised learning. Given data
, we want to compute the quantile of corresponding the quantile level . Linear quantile regression loss is defined as:(2) 
where
(3) 
is the weighted sum of residuals. Weights are proportional to the counts of the residual signs and order of the estimated quantile . For higher quantiles positive residuals get higher weight and vice versa. If , then the estimate of the median for is , with .
Instead of learning the expected return , distributional RL focuses on learning the full distribution of the random variable directly (Jaquette, 1973; Bellemare et al., 2017). There are various approaches to represent a distribution in RL setting (Bellemare et al., 2017; Dabney et al., 2018; BarthMaron et al., 2018). In this paper, we focus on the quantile representation (Dabney et al., 2017) used in QRDQN, where the distribution of is represented by a uniform mix of supporting quantiles:
where denote a Dirac at , and each is an estimation of the quantile corresponding to the quantile level (a.k.a. quantile index) for . The stateaction value is then approximated by . Such approximation of a distribution is referred to as quantile approximation.
Similar to the Bellman optimality operator in meancentered RL, we have the distributional Bellman optimality operator for control in distributional RL,
Based on the distributional Bellman optimality operator, Dabney et al. (2017) proposed to train quantile estimations (i.e., ) via the Huber quantile regression loss (Huber et al., 1964). To be more specific, at time step the loss is
where
(4) 
and
(5) 
where is the indicator function and is the Huber loss,
To decorrelate QRDQN, we use the following loss function,
where denotes the feature encoder by the last layer of QRDQN, is computed in equation (4), and is computed in equation (5).
The other elements of QRDQN are also preserved. The only difference of our new algorithm from QRDQN is the regularized correlation loss. We call this new algorithm QRDQNdecor.
4 Experiment
In this section, we conduct experiments to study the effectiveness of the proposed decorrelation method for DQN and QRDQN. We used the Atari game environments by Bellemare et al. (2013). In particular, we compare DQNdecor vs. DQN, and QRDQNdecor vs. QRDQN. Algorithms were evaluated on training with 20 million frames (or equivalently, 5 million agent steps) 3 runs per game.
In performing the comparisons, we used the same parameters for all the algorithms, which are reported below. The discount factor is . The image size is . We clipped the continuous reward into three discrete values, , according to their sign. This clipping of reward signals leads to more stable performance according to (Mnih et al., 2015). The reported performance scores is calculated, however, in terms of the original unclipped rewards. The target networks update frequency is frames. The learning rate is . The exploration strategy is epsilongreedy, with decaying linearly from to in the first 1 million frames, and remains constantly after that. The experience replay buffer size is 1 million. The minibatch size for experience replay is . In the first frames, all the four agents were behaving randomly to fill the buffer with experience as a warm start.
The input of the neural networks is taking the most recent 4 frames. There are three convolution layers followed by a fully connected layer. The first layer convolves 32 filters of
with stride 4 with the input image and applies a rectifier nonlinearity. The second layer convolves 64 filters of
with strides 2, followed by a rectifier nonlinearity. The third convolution layer convolves 64 filters of with stride 1 followed by a rectifier. The final hidden layer has rectifier units. The outputs of these units given an image are just the features encoded in . Thus the number of features, , is . This is followed by the output layer which is fully connected and linear, with a single output for each action (the number of actions in each game is different, ranging from 4 to 18).The parameter for QRDQN algorithms in the Huber loss function was set to following Dabney et al. (2017).
The correlation loss in equation 1 is computed on the minibatch sampled in experience replay (the empirical mean is also computed over the minibatch).
4.1 Decorrelating DQN
Performance. First we compare DQNdecor with DQN in Figure 1. The performance is measured by the median performance of DQNdecor and DQN across all games, where for each game the average score over runs is first taken. DQNdecor is shown to outperform DQN starting from about 4 million frames, with the winning edge increasing with time. By 20 million frames, DQNdecor achieves a humannormalized score of (while DQN is ), outperforming DQN by .
While the median performance is widely used, it is only a performance summary across all games. To see how algorithms perform in each game, we benchmark them for each game in Figure 3. DQNdecor won over DQN on games (each of which is more than ) with 4 close ties and lost slightly on games. In particular, only on Bowling and BattleZone, it loses to DQN by about . For the other losing games, the loss was below .
We dig into the training on Bowling, and found that DQN fails to learn for this task. The evaluation score for Bowling in the original DQN paper (Mnih et al., 2015) is
(see their Table 2) with a standard deviation of
(trained over 50 million frames). In fact, for a fair comparison, one can check that the training performance for DQN at https://google.github.io/dopamine/baselines/plots.html, because scores reported in the original DQN paper is the testing performance without exploration. Their results show that for Bowling, the scores of DQN in the training phases fluctuate around , which is close to our implemented DQN here. Figure 3(a) shows that two runs (out of three) of DQN is even worse than the random agent. Because the exploration factor is close to 1.0 in the beginning, the first data points of the curves correspond to the performance of the random agent. The representation learned by DQN turns out to be not useful for this task. In fact, the features learned by these two failure runs of DQN are nearly i.i.d, reflected in that the correlation loss is nearly zero as shown in Figure 3(b). So decorrelating the unsuccessful representation does not help in this case.The loss of DQNdecor to DQN on BattleZone is due to the regularization factor, , which is uniformly fixed for all games. In producing these two figures, was fixed to for all 49 games. In fact, we did not run parameter search to optimize for all games. The value was picked at our best guess. We did notice that DQNdecor can achieve better performance than DQN for BattleZone with a different regularization factor as shown in Figure 3(c). It appears that the features learned by DQN are already much decorrelated, and thus a small regularization factor (in this case, ) gives a better performance.
Excluding the unsuccessful representation learned by DQN for the case of Bowling and better parameter choice for regularization factor, we found that decorrelation is a uniformly effective technique for training DQN.
Effect of Feature Correlation. To study the correlation between performance and correlation loss, we plot the correlation loss over training time (also using the median measure) across 49 games in Figure 2. The correlation loss is computed by averaging over the minibatch samples during experience replay at each time step.
As shown by the figure, DQNdecor effectively reduces the decorrelation in features. We can see that the feature correlation loss for DQNdecor is almost flat from no more than 2 million frames, indicating that decorrelating features can be achieved faster than learning (a good news for representation learning). While for DQN, the learned features keep correlating with each other more and more in the first 6 million frames. After that, however, it appears that DQN does achieve a flat correlation loss, although much bigger than DQNdecor.
This phenomenon of DQN learning is interesting because it shows that although correlation is not considered in its loss function, DQN does have the ability of achieving feature decorrelation (to some extent) over time. Note that DQN’s performance keeps improving after 6 million frames while at the same time its correlation loss measure is almost flat. This may indicate that it is natural to understand learning in two phases: decorrelating features (with some loss) and after that learning and performance continues to improve with decorrelated features.
Interestingly, faster learning of DQNdecor follows after the features are decorrelated. To see this, DQNdecor’s advantage over DQN becomes significant after about 6 million frames (Figure 1), while DQNdecor’s correlation loss has been flat since less than 2 million frames (Figure 2).
Game  DQN  DQNdecor,  QRDQN1  QRDQN1decor, 

Alien  1445.8  1348.5  1198.0  1496.0 
Amidar  234.9  263.8  236.5  339.0 
Assault  2052.6  1950.8  6118.2  5341.3 
Asterix  3880.2  5541.7  6978.4  8418.6 
Asteroids  733.2  1292.0  1442.4  1731.4 
Atlantis  189008.6  307251.6  65875.3  148162.4 
BankHeist  568.4  648.1  730.3  740.8 
BattleZone  15732.2  14945.3  18001.3  17852.6 
BeamRider  5193.1  5394.4  5723.6  6483.1 
Bowling  27.3  21.9  22.5  22.2 
Boxing  85.6  86.0  87.1  83.9 
Breakout  311.3  337.7  372.8  393.3 
Centipede  2161.2  2360.5  6003.0  6092.9 
ChopperCommand  1362.4  1735.2  2266.9  2777.4 
CrazyClimber  69023.8  100318.4  74110.1  100278.4 
DemonAttack  7679.6  7471.8  34845.7  27393.6 
DoubleDunk  15.5  16.8  18.5  19.1 
Enduro  808.3  891.7  409.4  884.5 
FishingDerby  0.7  11.7  9.0  10.8 
Freeway  23.0  32.4  25.6  24.9 
Frostbite  293.8  376.6  1414.2  755.7 
Gopher  2064.5  3067.6  2816.5  3451.8 
Gravitar  271.2  382.3  305.9  314.9 
Hero  3025.4  6197.1  1948.4  9352.2 
IceHockey  10.0  8.6  10.3  9.6 
Jamesbond  387.5  471.0  391.4  515.0 
Kangaroo  3933.3  3955.5  1987.8  2504.4 
Krull  5709.9  6286.4  6547.7  6567.9 
KungFuMaster  16999.0  20482.9  22131.3  23531.4 
MontezumaRevenge  0.0  0.0  0.0  2.4 
MsPacman  2019.0  2166.0  2221.4  2407.9 
NameThisGame  7699.0  7578.2  8407.0  8341.0 
Pong  19.9  20.0  19.9  20.0 
PrivateEye  345.6  610.8  41.7  251.0 
Qbert  2823.5  4432.4  4041.0  5148.1 
Riverraid  6431.3  7613.8  7134.8  7700.5 
RoadRunner  35898.6  39327.0  36800.0  37917.6 
Robotank  24.8  24.5  31.3  30.6 
Seaquest  4216.6  6635.7  4856.8  5224.6 
SpaceInvaders  1015.8  913.0  946.7  825.1 
StarGunner  15586.6  21825.0  25530.8  37461.0 
Tennis  22.3  21.2  17.3  16.2 
TimePilot  2802.8  3852.1  3655.8  3651.6 
Tutankham  103.4  116.2  148.3  190.2 
UpNDown  8234.5  9105.8  8647.4  11342.4 
Venture  8.4  15.3  0.7  1.1 
VideoPinball  11564.1  15759.3  53207.5  66439.4 
WizardOfWor  1804.3  2030.3  2109.8  2530.6 
Zaxxon  3105.2  7049.4  4179.8  6816.9 
4.2 Decorrelating QRDQN
We further conduct experiments to study whether our decorrelation method applies to QRDQN. The value of was set to , picked simply from a parameter study on the game of Sequest. Figure 4 shows the median humannormalized performance scores across 49 games. Similar to the case of DQN, our method achieves much better performance by decorrelating QRDQN, especially after about 7 million frames. Again, we see the performance edge of QRDQNdecor over QRDQN has a trend of increasing with time. We observed a similar correlation between performance and correlation loss, which supports that decorrelation is the factor that improves the performance.
We also profile the performance of algorithms for each game in the normalized AUC, which is shown in Figure 6. In this case, the decorrelation algorithm lost 5 games, with 10 games being close (less than ), and 34 winning games (bigger than ). The algorithm lost most to QRDQN on the game of Frostbite, BattleZone and DemonAttack. This is due to parameter choice instead of an algorithmic issue. For all the three losing games, we performed additional experiments with (the same value as for the DQN experiments). The learning curves are shown in Figure 5 (Frostbite), Figure 6(a) (BattleZone), and Figure 6(b) (DemonAttack). Recall that QRDQNdecor with lost to QRDQN by about . Figure 5 shows that with , the performance of QRDQNdecor is significantly improved, with an improvement over QRDQN by (in the normalized AUC measure). For the other two games, QRDQNdecor performs at least no worse than QRDQN ( better on DemonAttack, on BattleZone).
Note that the median humannormalized scores across games and the normalized AUC measure for each game may not give a full picture of algorithm performances. Algorithms perform well in these two measures can exhibit plummeting behaviour. Plummeting is characterized by abrupt degradation in performance. In this case the learning curve can drop to low score and stay there indefinitely. A more detailed discussion of this point is in (Machado et al., 2017). To study if our decorrelation algorithms have plummeting behaviour, we benchmark all algorithms by averaging the rewards in the last one million training frames (which is essentially neartail performance in training). In this measure, the results are summarized in Table 1. Out of 49 games, decorrelation algorithms perform the best in 39 games: DQNdecor is best in 15 games and QRDQNdecor is best in 24 games. Thus our decorrelation algorithms do not have plummeting behaviour.
In a summary, empirical results in this section show that decorrelation is effective for improving the performance of both QRDQN and QRDQN. Even on the most losing games (incurred using a fixed, same regularization factor for all games), both DQN and QRDQN can still be significantly improved by our decorrelation method with a tuned regularization factor.
4.3 Analysis on the Learned Representation
To study the correlation between the learned features after training finished, we first compute the feature covariance matrix (see equation 1
) using the (one million) samples in the experience replay buffer at the end of training, for both DQN and DQNdecor. Then we sort the (512) features according to their variances, which are just the diagonal parts of the feature covariance matrix.
A visualization of the top 50 features on the game of Sequest is shown in Figure 7(a) (DQN), and Figure 7(b) (DQNdecor). Three observations can be made. First, the diagonalization pattern is more obvious in the feature covariance matrix of DQNdecor. Second, the magnititude of feature correlation in DQN is much larger than DQNdecor (note the heat intensity bar on the right for the two plots is different). Third, there are more highly activated features in DQN, reflected in that the diagonal parts of DQN’s matrix have more intense values (black color). Interestingly, there are also almost same number of equally sized squares in the offdiagonal region of DQN’s matrix, indicating that these highly activated features correlate with each other and many of them may be redundent. In contrast, the offdiagonal region of DQNdecor’s matrix has very few intense values, suggesting that these important features are successfully decorrelated.
Figure 9 shows randomly sampled frames and their feature activation values (same set of 50 features as in Figure 7(a) and Figure 7(b)) for both algorithms. Interestingly, the features of DQNdecor (top row) appear to have much fewer values that are activated at the same time given an input image. In addition, DQN also has many highly activated features at the same time. In contrast, for DQNdecor, there are just a few (in these samples, there is only one) highly active features. Interpretting features for DRL algorithms may be made easier thanks to decorrelation. Although we have not any conclusion on this yet, it is an interesting future research direction.
5 Conclusion
In this paper, we found that feature correlation is a key factor in the performance of DQN algorithms. We have proposed a method for obtaining decorrelated features for DQN. The key idea is to regularize the loss function of DQN by a correlation loss computed from the mean squared correlation between the features. Our decorrelation method turns out to be very effective for training DQN algorithms. We show that it also applies to QRDQN, improving QRDQN significantly on most games, and losing only on a few games. Experiment study on the losing games shows that it is due to the parameter choice for the regularization factor. Our decorrelation method effectively improves DQN and QRDQN with a better or no worse performance across most games, which makes it hopeful for improving representation with the other DRL algorithms.
Appendix A All Learning Curves
The learning curves of DQNdecor vs. DQN on all games are shown in Figure 10. ().
The learning curves of QRDQNdecor vs. QRDQN are shown in Figure 11. ().
References
 Albus (1975) Albus, J. S. Data storage in the cerebellar model articulation controller (CMAC). Journal of Dynamic Systems,Measurement and Control, 97(3):228–233, 1975.
 BarthMaron et al. (2018) BarthMaron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Muldal, A., Heess, N., and Lillicrap, T. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
 Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
 Bellman (2013) Bellman, R. Dynamic programming. Courier Corporation, 2013.
 Dabney et al. (2017) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044, 2017.
 Dabney et al. (2018) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
 Ghiassian et al. (2018) Ghiassian, S., Yu, H., Rafiee, B., and Sutton, R. S. Two geometric input transformation methods for fast online reinforcement learning with neural nets. CoRR, abs/1805.07476, 2018. URL http://arxiv.org/abs/1805.07476.
 Huber et al. (1964) Huber, P. J. et al. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964.
 Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016. URL http://arxiv.org/abs/1611.05397.
 Jaquette (1973) Jaquette, S. C. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1973.
 Koenker & Bassett Jr (1978) Koenker, R. and Bassett Jr, G. Regression quantiles. Econometrica: Journal of the Econometric Society, pp. 33–50, 1978.
 Konidaris et al. (2011) Konidaris, G., Osentoski, S., and Thomas, P. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, AAAI’11, pp. 380–385. AAAI Press, 2011. URL http://dl.acm.org/citation.cfm?id=2900423.2900483.
 Liu et al. (2018) Liu, V., Kumaraswamy, R., Le, L., and White, M. The utility of sparse representations for control in reinforcement learning. CoRR, abs/1811.06626, 2018. URL http://arxiv.org/abs/1811.06626.
 Machado et al. (2017) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Parr et al. (2008)
Parr, R., Li, L., Taylor, G., PainterWakefield, C., and Littman, M. L.
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
InProceedings of the 25th international conference on Machine learning
, pp. 752–759. ACM, 2008.  Petrik (2007) Petrik, M. An analysis of laplacian methods for value function approximation in mdps. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 2574–2579, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id=1625275.1625690.
 Silver (2009) Silver, D. Reinforcement Learning and SimulationBased Search in Computer Go. PhD thesis, 2009.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction (2nd Edition). MIT press, 2018.
 Szepesvári (2010) Szepesvári, C. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.
 Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Qlearning. Machine Learning, 1992.