1 Temporal Difference Learning
Temporal-difference (TD) methods [Sutton1988] are an important approach in reinforcement learning as they combine ideas from dynamic programming and Monte Carlo methods. TD allows learning to occur from raw experience in the absence of a model of the environment’s dynamics, like with Monte Carlo methods, while computing estimates which bootstrap from other estimates, like with dynamic programming. This provides a way for an agent to learn online and incrementally in both long-term prediction and sequential decision-making problems.
A key view of TD learning is that it is learning testable, predictive knowledge of the environment [Sutton et al.2011]. The learned value functions represent answers to predictive questions about how a signal will accumulate over time, given a way of behaving in the environment. A TD learning agent can continually compare its predictions to the actual outcomes, and incrementally adjust its world knowledge accordingly. In control problems, this signal is the reward sequence, and the value function represents the long-term cumulative reward an agent expects to receive when behaving greedily with respect to its current predictions about this signal.
A TD learning agent’s time horizon of interest, or how long-term it is to predict into the future, is specified through a discount rate [Sutton and Barto2018]. This parameter adjusts the weighting given to later outcomes in the sum of a sequence over time, trading off between only considering immediate or near-term outcomes and estimating the sum of arbitrarily long sequences. From this interpretation of its purpose, along with convergence considerations, the discount rate is restricted to be in episodic problems, and in continuing problems.
In this paper, we investigate whether meaningful information can be learned from relaxing the range of values the discount rate can be set to. In particular, we allow it to take on complex values, and instead restrict the magnitude of the discount rate, , to fall within the aforementioned ranges.
2 One-step TD and the MDP Formalism
The sequential decision-making problem in reinforcement learning is often modeled as a Markov decision process (MDP). Under the MDP framework, an agent interacts with an environment over a sequence of discrete time steps. At each time step , the agent receives information about the environment’s current state, , where is the set of all possible states in the MDP. The agent is to use this state information to select an action, , where is the set of possible actions in state . Based on the environment’s current state and the agent’s selected action, the agent receives a reward, , and gets information about the environment’s next state, , according to the environment model: .
The agent selects actions according to a policy,
, which gives a probability distribution across actionsfor a given state , and is interested in the expected discounted return:
given a discount rate and equal to the final time step in an episodic task, or and equal to infinity for a continuing task.
Value-based methods approach the sequential decision-making problem by computing value functions, which provide estimates of what the return will be from a particular state onwards. In prediction problems, also referred to as policy evaluation, the goal is to estimate the return under a particular policy as accurately as possible, and a state-value function is often estimated. It is defined to be the expected return when starting in state and following policy :
For control problems, the policy which maximizes the expected return is to be learned, and an action-value function from which a policy can be derived is instead estimated. It is defined to be the expected return when taking action in state , and following policy :
Of note, the action-value function can still be used for prediction problems, and the state-value can be computed as an expectation across action-values under the policy for a given state:
TD methods learn an approximate value function, such as for state-values, by computing an estimate of the return, . First, Equation 3 can be written in terms of its successor state-action pairs, also known as the Bellman equation for :
Based on Equation 5, one-step TD methods estimate the return by taking an action in the environment according to a policy, sampling the immediate reward, and bootstrapping off of the current estimates in the value function for the remainder of the return. The difference between this TD target and the value of the previous state is then computed, and is often referred to as the TD error. The previous state’s value is then updated by taking a step proportional to the TD error with step size :
Since the rewards received depend on the actions selected, the above updates will learn the expected return under the policy that is generating its behavior, and is referred to as on-policy learning. off-policy learning allows an agent to learn about the expected return given a policy different from the one generating an agent’s behavior. One way of achieving this is through importance sampling [Precup, Sutton, and Singh2000], where with a behavior policy and a target policy , an alternative update to Equation 7 is:
This strictly generalizes the on-policy case, as the importance sampling ratio is when the two policies are identical.
3 Complex Discounting
The discount rate has an interpretation of specifying the horizon of interest for the return, trading off between focusing on immediate rewards and considering the sum of longer sequences of rewards. It can also be interpreted as a soft termination of the return [Sutton et al.2011, Sutton1995, Modayil, White, and Sutton2014], where an agent includes the next reward with probability , and terminates with probability , receiving a terminal reward of . From these interpretations, it is intuitive for the discount rate to fall in the range of with the exception of episodic problems, where can be equal to 1.
With considerations for convergence, assuming the rewards are bounded, restricting the discount rate to be in this range makes the infinite sum (in the continuing case) of Equation 1 finite. However, this sum will remain finite when the magnitude of the discount rate is restricted to be , allowing for the use of negative discount rates up to , as well as complex discount rates within the complex unit circle.
While the use of alternative discount rates may result in some corresponding value function, a question arises regarding whether these values are meaningful, or if there is any situation in which an agent would benefit from this knowledge. First, we consider the implications of exponentiating a complex discount rate. We look at the exponential form of a complex number with unit magnitude, and note that it can be expressed as a sum of sinusoids by Euler’s Formula:
From this, it is evident that exponentiating a complex number to the power of corresponds to taking steps around the complex unit circle with an angle of :
Using the above as a discount rate, assuming an episodic setting as it has a magnitude of , we would get the following return for some angle :
Instead of weighting the reward sequence in a way that decays the importance of future rewards, complex discount rates weight the sequence with two sinusoids, one along the real axis and one along the imaginary axis. This can be interpreted as checking the cross correlation between a reward sequence and a sinusoid oscillating with a frequency of , and effectively allows a TD learning agent to identify periodicity in the reward sequence at specified frequencies online and incrementally.
4 The Discrete Fourier Transform
The ability to identify periodicity in the reward sequence by weighting it with exponentiated complex numbers can be viewed as performing the Discrete Fourier Transform (DFT) from digital signal processing literature [Brigham1988]. The DFT is defined as follows:
where is the length of the sequence, and is set to each whole number less than . This can be viewed as testing whether a frequency of exists in the sequence, for equally spaced values of . If a frequency of exists, this sum will tend to have a larger magnitude; if no such frequency exists, the terms in the sum will tend to cancel out, and will hover around zero. Acknowledging that is less than , the term is in the range , and can be rewritten where the frequency is specified directly:
where . This is exactly equivalent to the DFT when the length of the sequence is known and particular frequencies are chosen, but has similar functionality and interpretation for other sets of frequencies.
The DFT corresponds to the discrete form of the coefficients of a Fourier series, and thus each complex number encodes the amplitude and phase of a particular sinusoidal component of a sequence. Specifically, the normalized magnitude corresponds to the amplitude, and the angle between the imaginary and real components,
, gives the phase. The DFT is also an invertible, linear transformation[Brigham1988]. With knowledge of the length of the sequence, , and the sampling frequency (in Hz), denoted , one way of reconstructing the original sequence is by computing this sum of sinusoids:
In the context of a TD learning agent with a complex discount rate, the learned approximate values can be seen as computing the expected DFT of the reward sequence from a specified state onwards, and allows for extraction of the corresponding amplitude and phase information. However, the expected length of the sequence is typically not known by the agent, resulting in unnormalized amplitude information.
5 Revisiting Continuing Problems
The DFT is computed with complex numbers that have a magnitude of , and a discount rate of (corresponding to ) would only work in the episodic setting. To see the effects of using a complex discount rate with a magnitude less than , we introduce an amplitude parameter to the exponential form of a complex number:
which results in a complex number with a magnitude of . Substituting this in the summation in Equation 1 gives:
which can be seen as computing the DFT of a real discounted return with a discount rate of . That is, a TD learning agent would still learn an expected DFT, but of the reward sequence over an exponentially decaying window determined by . While discounting can distort the signal, it primarily affects the low frequencies which are unable to complete an oscillation within a discount rate’s effective horizon.
6 Existence and Uniqueness of Value Function
Perhaps surprisingly, the value function is well-defined for complex-valued discounting when , similar to the more familiar case with a real-valued discount factor.
We first illustrate this in the continuing setting (i.e., the MDP has no terminal states). Consider the aperiodic, irreducible Markov chain with state transition matrix
, and expected reward vectorwith entries
. Such a transition matrix has eigenvalues[Rosenthal1995]. The vector of expected returns after transitions is:
Where represents the expected return conditioned on starting in state .
For the case where , we have that , and that the partial sums of the matrix series satisfy (for ):
Thus the partial sums become arbitrarily close together as and grow larger. Stated more formally, is a Cauchy sequence, and therefore convergent. We can then observe that:
Taking the limit, we note
The preceding facts come together to show that the limit of the matrix series exists and is equal to . Thus we have:
As in the usual setting with .
For the episodic setting (i.e., where contains some absorbing states) we note that that we have , assuming that is aperiodic and indecomposable. This implies that the previous argument for the convergence of the matrix series (in 18) holds, and that as before.
Therefore, the value function is well-defined for complex discount factors with (or for the episodic setting), in the sense that it exists and is unique.
In this section, we detail several experiments involving TD learning agents using complex-valued discounting.
7.1 Checkered Grid World
The checkered grid world environment consists of a 5 5 grid of states with terminal states in the top-left and bottom-right corners. The actions consist of deterministic 4-directional movement, and moving off of the grid transitions the agent to the same state. The agent starts in the center, and the board is colored with a checkered pattern with colors representing the reward distribution. Transitioning into a white cell results in a reward of 1, transitioning into a gray cell results in a reward of -1, and transitioning to a terminal state ends the episode with a reward of 11. A diagram of the environment can be seen in Figure 4. This pattern introduces an alternating pattern of 1 and -1 in the reward sequence. Given the interpretation of complex discounting as computing the DFT, we would like to see whether an agent using complex discount rates can pick up on this periodic pattern. We would also like to qualitatively assess how well the expected reward sequence can be reconstructed through Equation 14 (given knowledge of the expected sequence length).
This environment was treated as an on-policy policy evaluation task with no discounting (). The agent behaved under an equiprobable-random behavior policy, which results in an expected episode length of 37.33 steps. Because the reconstruction of the reward sequence requires an integer sequence length, we round this up to 38 steps. The agent learned 114 (a multiple of 38) value functions in parallel corresponding to equally spaced frequencies in the range . Action-values were learned using the Expected Sarsa algorithm [van Seijen et al.2009], and state-values were computed from the learned action-values through Equation 4.
We performed 100 runs of 250 episodes, and the value of the starting state, represented by the complex number’s magnitude and phase information, was plotted for each frequency after the 250th episode. The resulting learned DFT of the starting state can be seen in Figure 2. Of note, the specified frequencies are normalized by the agent’s sampling frequency (in Hz). Under the assumption that the agent is sampling at 1 Hz, the frequency corresponds to one sample per time step. Also, when computing the DFT of a real-valued signal, it will be symmetric about half of the sampling frequency [Brigham1988]. This “folding” frequency is referred to as the Nyquist frequency, which acts as a limit for the largest detectable frequency. Frequencies larger than this would be under-sampled and subject to aliasing.
In the learned DFT, the magnitude of the value at corresponds to the expected return with a discount rate of . That is, it is what a standard TD learning agent with a non-oscillatory discount rate would have learned. The magnitudes of the values at other frequencies are interpreted as a measure of confidence in a particular frequency existing in the reward sequence, as exact amplitude information would require normalization by sequence length. It can be seen that there is relatively large magnitude at the frequency , which corresponds to half of the agent’s sampling frequency. If the agent is sampling at a rate of 1 Hz, or 1 sample per time step, this means that it has large confidence in an oscillation at a rate of half a cycle per time step. This corresponds to the rewards alternating between 1 and -1 in the environment, as this pattern takes two time steps to complete a cycle.
Next, we try to reconstruct the expected reward sequence by computing a sum of sinusoids. Using a sequence length of 38, we use the learned complex values corresponding to 38 equally spaced frequencies in , and evaluate Equation 14 up to the 38th time step. The resulting reconstructed reward sequence can be seen in Figure 3.
One might intuitively expect the reward sequence to consist of an alternating sequence of 1 and -1, and ending with an 11. Qualitatively, the reconstructed signal does not fit this intuition, but still captures several aspects of the structure of the return. For example, the apparent oscillations in the sequence are at 0.5 Hz, and they begin at an approximate amplitude of 1. The oscillations also have a positive mean, corresponding to the large positive reward upon termination. Also, if we compute the sum of the reconstructed sequence, we get the learned value of the starting state for (the standard undiscounted return). There are several reasons why the reconstruction wouldn’t completely match the aforementioned intuition. One reason is due to cases where the agent tries to move off of the grid. Doing so transitions the agent to the same state, which may break or shift the periodic pattern each time this occurs. The earliest an agent can bump into a wall is in 3 steps, and is approximately where the exponential decay begins in the reconstructed sequence. Another reason is that the expected return consists of averaging sequences from varying episode lengths and this is an attempt at reconstructing a sequence over a fixed length (which is rounded up from the expected episode length). This would shift where in the sequence the terminal reward appears, and end up distributing it as the mean of the oscillation.
7.2 Wavy Ring World
The previous experiment was done in an undiscounted, episodic, tabular setting. To see whether we can achieve similar results in a continuing setting with function approximation, we designed the wavy ring world environment. This environment consists of 20 states arranged in a ring. Each state has a single action which moves it to the next state in a fixed direction along the ring. We used tile coding [Sutton1996] to produce, for each state, a binary feature vector to be used with linear function approximation. Specifically, the 20 states were covered by 6 overlapping tilings where each tile spanned 1/3-rd of the 20 states. This resulted in 6 active features for a given state, and relatively broad generalization between states. The reward for leaving a state , , consisted of the sum of four state-dependent sinusoids with periods of 2, 4, 5, and 10 states:
A TD agent learned a set of complex-valued weights for each of 64 equally spaced frequencies in the range , and the magnitude of each discount rate was . As there is no stochasticity in the transitions, we performed 1 run of 15,000 steps, with the agent starting in state 0. We extracted the state values from the learned weights, and the resulting DFT of the return from state 0 can be seen in Figure 5.
In the learned DFT, we can see that despite the lower discount rate magnitude, and the use of function approximation, it still has relatively large peaks at various frequencies in the magnitude plot. Looking at the frequencies at which these peaks occur (up until the Nyquist frequency), they correspond to the frequencies of the reward function in Equation 20.
From our experiments, we showed that a TD agent using complex discount rates can identify periodic patterns in the return. This is due to complex discount rates being closely related to the DFT, which a TD learning agent can be seen as incrementally estimating. We also showcased a simple way of inverting the DFT, using knowledge of the sequence length, in an attempt to reconstruct the original reward sequence. This reconstructed reward sequence contained several features pertaining to the structure of the reward sequence: An oscillation at a particular frequency and amplitude, a positive average reward, and a sum equal to the standard expected return.
Our experiments focused on a case where the periodicity came from the environment. This may have implications for reinforcement learning approaches for problems with sound or image data, as the DFT is typically used as an offline post-processing tool in those applications. In general, this approach would identify policy-contingent frequency information, as the expected return is computed under a particular policy. One could imagine an agent behaving under a policy which led it in circles. This would induce similar alternating behavior in the experienced reward sequence without this explicit structure in the environment. An example of an application involving policies containing cyclic behavior is robot gait training. If the rewards are set to be a robot’s joint position, it would allow the robot to be aware of periodicity in tasks involving repetitive motion, such as walking. Such awareness of periodicity also has implications in the options framework [Sutton, Precup, and Singh1999], as it may offer insight regarding where an option should terminate. It may also have use in exploration, where if state features are used as rewards, an agent actively avoiding periodicity might lead it to seek out novel states.
With the ability to invert the DFT and roughly reconstruct the expected reward sequence (given a sequence length), an agent would have access to information regarding the structure of the sequence. This may be able to inform decisions based on properties like reward sparsity, or noise in the reward signal. Reconstructing the sequence can be seen as recovering the information lost from computing the sum of the rewards, which is different but comparable to distributional reinforcement learning [Bellemare, Dabney, and Munos2017, Dabney et al.2017] which recovers the information lost from computing the expectation of this sum.
There has been prior work on using a Fourier basis as a representation for reinforcement learning problems [Konidaris, Osentoski, and Thomas2011]. Using the learned value functions as a state representation, complex discounting may allow for incrementally estimating a similar representation. Also, in the deep reinforcement learning setting, learning about many frequencies in parallel may have the representation learning benefits of predicting many auxiliary tasks at once [Jaderberg et al.2016].
In this paper, we showed that meaningful information can be learned by allowing the discount rate in TD learning to take on complex numbers. The learned complex value functions can be interpreted as incremental estimation of the DFT of a signal of interest. From this DFT interpretation, a complex discount rate corresponds to a particular frequency, and the magnitude of the learned complex value represents an agent’s confidence in the frequency being present in the reward sequence. By learning several complex value functions in parallel, in both a tabular setting and one with function approximation, we showed that a TD learning agent was able to pick up on periodic structure in the reward sequence.
We also showed that information regarding the structure of the reward sequence is encoded in the resulting DFT. Because the DFT is invertible (with knowledge of the sequence length), we showed that an expected reward sequence can be reconstructed from the learned DFT. The resulting reconstructed sequence had qualitative properties that seemed reasonable for the given environment. It may be possible to infer the structure of the return from the phase information directly (without having to invert the DFT), but we leave that as an avenue for future research.
The authors thank Roshan Shariff for insights and discussions contributing to the results presented in this paper, and the entire Reinforcement Learning and Artificial Intelligence research group for providing the environment to nurture and support this research. We gratefully acknowledge funding from Alberta Innovates – Technology Futures, Google Deepmind, and from the Natural Sciences and Engineering Research Council of Canada.
Bellemare, Dabney, and
Bellemare, M. G.; Dabney, W.; and Munos, R.
A distributional perspective on reinforcement learning.
In ICML, volume 70 of
Proceedings of Machine Learning Research, 449–458. PMLR.
- Brigham1988. Brigham, E. O. 1988. The Fast Fourier Transform and Its Applications. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
- Dabney et al.2017. Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R. 2017. Distributional reinforcement learning with quantile regression. CoRR abs/1710.10044.
- Jaderberg et al.2016. Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Reinforcement learning with unsupervised auxiliary tasks. CoRR abs/1611.05397.
- Konidaris, Osentoski, and Thomas2011. Konidaris, G. D.; Osentoski, S.; and Thomas, P. S. 2011. Value function approximation in reinforcement learning using the Fourier basis. In Proceedings of the Twenty-Fifth Conference on Artificial Intelligence, 380–385.
- Modayil, White, and Sutton2014. Modayil, J.; White, A.; and Sutton, R. S. 2014. Multi-timescale nexting in a reinforcement learning robot. Adaptive Behaviour 22(2):146–160.
- Precup, Sutton, and Singh2000. Precup, D.; Sutton, R. S.; and Singh, S. P. 2000. Eligibility traces for off-policy policy evaluation. In Kaufman, M., ed., Proceedings of the 17th International Conference on Machine Learning, 759–766.
- Rosenthal1995. Rosenthal, J. S. 1995. Convergence rates for Markov chains. Siam Review 37(3):387–405.
- Sutton and Barto2018. Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learning: An Introduction. 2nd edition. Manuscript in preparation.
- Sutton et al.2011. Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P. M.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS, 761–768. IFAAMAS.
- Sutton, Precup, and Singh1999. Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2):181–211.
- Sutton1988. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning 3(1):9–44.
- Sutton1995. Sutton, R. S. 1995. TD model: Modeling the world at a mixture of time scales. Technical report, Amherst, MA, USA.
- Sutton1996. Sutton, R. S. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, 1038–1044. MIT Press.
- van Seijen et al.2009. van Seijen, H.; van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of Expected Sarsa. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 177–184.