1 Introduction & Related Work
We introduce a novel method for discovering disentangled structure in the reinforcement learning (RL) setting by learning a decomposition of the agent’s reward function, such that the pursuit of one decomposed reward does not result in the collection of another, i.e., the individual rewards in the decomposition areindependently-obtainable. In the following sections we briefly discuss related work, detail our method, and empirically and theoretically explore the resulting structure learned from the reward decomposition.
When viewed through the lens of RL, methods for learning disentangled representations can be classified according to how they utilize their learned representations.
Some methods seek robust interpretable disentangled features (Achille et al., 2018; Mathieu et al., 2016; Yingzhen & Mandt, 2018; Tran et al., 2017). For example, Higgins et al. (2016) does so by creating an “information bottleneck” (Tishby & Zaslavsky, 2015) that pressures the latent representation to be unit Gaussian. Chen et al. (2016)
accomplishes a similar goal by maximizing the mutual information between components of their latent representation and other independent random variables. Methods such as these have been leveraged in RL to decompose the state of the environment. In particular,Laversanne-Finot et al. (2018) have applied -VAE to learn disentangled, “modular” representations of the environment state for use in many goals RL (Kaelbling, 1993).
However, in Laversanne-Finot et al. (2018), the decomposition of environment state and learning corresponding control policies are treated as separate processes. There has been other work that explores jointly learning state decompositions and corresponding policies. Thomas et al. (2017) defines an alternative notion of disentanglement: “independent controllability” which pairs together components of the learned state representation with control policies and measures the degree to which policies can control their corresponding components independently of other components.
While Laversanne-Finot et al. (2018) and Thomas et al. (2017) each leverage some notion of disentanglement to address RL problems, their methods do not take into account the reward function. This motivates our exploration into directly decomposing the reward function of the environment (of course, rewards are functions of state and have corresponding policies and hence decomposing rewards also implicitly decomposes states as well as policies).
In addition to those that decompose parts of the environment, many methods exist that exploit existing decompositions. Guestrin et al. (2002) and Kok & Vlassis (2006) rely on existing state factorizations to efficiently coordinate agents in the multi-agent learning setting (Hu et al., 1998); Van Seijen et al. (2017) and Russell & Zimdars (2003) propose methods of learning given an existing reward factorization. Such works are complementary to our own.
2 RL Framework & Reward Decomposition
We consider RL problems formulated as Markov Decision Processes (MDPs; Sutton et al. 1998), which we represent here by the tuples: , where is a set of states that the environment can be in, is a set of actions that an agent operating in the environment can perform at any time, is a function mapping environment states to their corresponding rewards,
represents the probability of transitioning between two particular states when an action is taken, andis a discount factor that makes future reward less valuable than more immediate reward.
The goal of an RL agent is to act so as to maximize the expected discounted reward, or value, over an infinite horizon, defined as follows:
where the expectation is over trajectories generated by starting at state in the environment and behaving according to the policy (a policy maps states to actions or more generally distributions over actions). RL methods learn optimal policies defined as follows:
The central aim of our method is to learn an additive factorization (henceforth, decomposition) of the reward function of the following form:
where each can be thought of as a type of sub-reward (below we add constraints that make these sub-rewards independently obtainable).
Given such a decomposition, we get a Factored Markov Decision Process (fMDP; Degris & Sigaud 2013) as follows: . Value functions and policies with respect to pursuing individual reward functions in the fMDP can be defined as follows:
Note that throughout we will use to denote value functions for the learned rewards and for the value functions for the environment reward .
Motivating and Defining Disentangled Reward Decompositions
Notice that for any MDP there exist an endless array of different decompositions, many of them “uninteresting”. For example, the following two decompositions:
are valid in that they sum to the environment reward function, but uninteresting in that they encode no additional information about the environment not present in the unfactored reward function.
In this paper, we propose that one approach to encouraging “interesting” reward decompositions is for them to satisfy the following two desiderata as best possible:
Each reward should be able to be obtained independently of other rewards. (i.e., the policy that optimally obtains should not obtain for any ).
Each reward should be non-trivial. (an example of a trivial reward is of the form ).
We can roughly codify these properties using the factor-specific value functions defined in the following equations:
where is some distribution with support on and are functions that can be used to control the weighting of different value function terms in and . For ease of exposition we set for all and in subsequent equations (for our choice of in our experiments, see Section 3.1) .
Intuitively, the above two desiderata make it possible to view each reward function as a different “resource” that an agent can collect in the world. then encodes the degree to which the ’th resource (reward) is collected when the agent is attempting to collect the ’th resource (reward). In line with our first desideratum, we should expect “interesting” reward decompositions to have small values of . Similarly, large values of should ensure that each of the factors encodes something about the environment: that they are non-trivial under the second desideratum.
Capturing the two desiderata, we define one reward decomposition as more disentangled than another if it has a larger value of
Next we present our method for learning reward decompositions in MDPs by maximizing .
3 Proposed Method
We use a parameterized reward decomposition network
(neural network with parameters) that learns a function . The outputs of this function are used to define a reward decomposition through a softmax function
is the environment reward function. The vector of decomposed rewards is thus. See Figure 1a for a visualization.
Note that defines the reward decomposition which in turn defines the matrix of value functions used in the definition of and so for ease of notation we can refer to as just . We similarly abbreviate and when appropriate.
We use approximate gradient ascent to update the parameters as follows:
where is a cutoff on the number of time-steps that a trajectory is rolled out for and depends on the details of the form of the functions (which will be neural networks in our empirical work below).
Notice that computing the gradient of the disentanglement objective with respect to requires learning the optimal policies for each factor of the reward decomposition after each change to . In practice, for sample efficiency, we do incremental updates of the policies and value functions as we adapt . We use Deep Q-Networks (DQNs; Mnih et al. 2015) to learn optimal policies for the decomposed rewards (collectively called the Policy Networks; see Figure 1b). Additionally, in order to compute we must be able to collect multiple trajectories from the same starting state (see Equation 6). This requires that our environments be resettable to specific states (more details for our specific implementations are in Algorithm 1 presented in Appendix A).
3.1 Illustrative Results
Before we present our theoretical results and substantive empirical results, we present an empirical illustration of the kinds of reward decompositions achieved by our method in a 5x5 grid with a rewarding square at each corner (see image labeled “Gridworld” in Figure 2a). The agent can move left, right, up and down. Upon reaching a rewarding square, the agent is teleported to a random square.
Learned Reward Decompositions
Figure 2a illustrates the types of reward decompositions obtained when we learn 2, 3 and 4 reward functions respectively (the numbers/colors denote which reward sources were put into the same learned reward). Notice how our disentanglement objective encourages the environment reward to be divided among the learned reward functions in a way that can be “independently obtained” by their associated policies. As the number of learned reward functions increases, this results in increasingly fine divisions of the environment reward: two reward functions divides the reward between halves of the environment, three reward functions separates the top half of the environment from the bottom and further divides the bottom half into bottom left and right corners and four reward functions associates each corner of the environment with a separate reward function.
Training Stability and Correspondence to
Despite our disentanglement objective discouraging the learning of trivial (i.e., zero-everywhere) reward functions, we observed that such degenerate decompositions can still emerge as a result of our training procedure. Particularly, this can occur in early training as illustrated in Figure 2b where in the early (shaded in blue) part the score for two runs (denoted (1) and (2)) “drops” to lower values, while the score for one run (denoted (3)) continues to increase with training. The drops coincide with individual learned reward functions becoming trivial during early training. This can be seen in Figure 2c where we show the corresponding reward decompositions. Note that run (1) puts all 4 reward sources into the same reward function while run (2) puts three reward source in one reward function and the fourth reward source in a second reward function. Only in run (3), where the score does increase during learning do we find the disentanglement we were looking for, each reward source is in a separate reward function. It is comforting and useful that our score is useful as an indicator of the quality of reward decompositions found.
How do we mitigate the above instability? As seen in Algorithm 1 presented in the Appendix A, updating the learned reward functions depends on having policies that can obtain some environment reward, and a given policy’s ability to obtain environment reward depends on its learned reward function being non-trivial. If a learned reward function becomes trivial before the corresponding policy learns to obtain some environment reward, it can get stuck there. We found that the prevalence of this problem could be significantly reduced with a specific choice of the coefficients defined in Equations 3 and 4. For all of our experiments we used
which corresponds to taking a softened minimum with temperature over the terms in Equation 4. The intuition behind this choice is that by approximately maximizing the minimal term, the optimization procedure is able to dynamically attend to the learned reward functions which are most in danger of becoming trivial. This choice yielded the best empirical stability among the choices we considered.
4 Theoretical Properties
In this section we detail various theoretical properties that result from the optimization of our objective. The following theorems characterize the nature of the reward functions and their corresponding policies learned by our method.
Non-Overlapping Visitation Frequencies
Our first theoretical result illustrates that the optimization of our objective results in policies whose state-visitation frequencies have a low degree of overlap. This provides a concrete sense of the resulting disentanglement.
Define the discounted visitation frequency of starting from under policy as:
where is defined as
Specifically, we show that if two learned reward functions, and , are sufficiently disentangled in that , then the state visitation frequencies of their corresponding optimal policies must also be at least somewhat different.
If and for all then
where is the maximum environment reward and represents total variation distance.
We can alternatively represent value functions in terms of as:
Our constraints imply that
as needed. ∎
Saturation of Softmax-Parameterization
We refer to a reward decomposition as saturated if for each rewarding state there is exactly one reward function that is non-zero (i.e., the decomposition never splits an environment reward between two or more reward functions).
For for all and and under conditions described in Appendix C, the optimal reward decomposition under is saturated.
where and for all and .
Suppose we chose an alternate parameterization with the property that
This alternate parameterization yields
with strict inequality if for any pair of . Next notice that represents the best policy to collect the ’th reward under parameterization . Under the new parameterization , there is a new optimal policy . It thus follows that:
Finally note that:
where is defined in Appendix C.
This implies that if is sufficiently small then . ∎
5 Experimental Results
We present three classes of experimental results here. First, we demonstrate the qualitative and quantitative properties of our reward decompositions and associated policies on the Atari 2600 games: Assault, Pacman and Seaquest. We connect these qualitative properties with the theoretical properties discussed in Section 4. Second, we compare the policies learned by our method to those learned by an alternative method: Independently Controllable Factors (ICF; Thomas et al. 2017). Third, we illustrate that the policies optimal for our learned reward functions can be used as actions to obtain environment reward and further show that these policies exhibit some degree of generalization performance when applied to tasks with modified environment reward functions.
Common Experimental Procedure
For all experiments involving learning a decomposition, we produce
decompositions and select the best run, i.e., the decomposition from the run that achieves the best disentanglement score. Unless otherwise stated, all curves seen in the subsequent sections should be regarded as the average of four runs with different random seeds all using the best-decomposition discussed above. For further details regarding experimental procedure and hyperparameter selection, see AppendixB.
5.1 Qualitative & Quantitative Analyses
Reward Function Visualization Methodology
Atari 2600 games have too many states to enumerate the learned rewards, necessitating an approximate visualization. For each game we select a “game-element” (see Figure 3) and visualize its position as reward is received from the different learned reward functions. To construct our visualization, we discretize the environments spatially into pixel bins. We then execute a random policy for time-steps; every time an environment reward is received we store its associated reward decomposition in the appropriate bin determined by the location of the game-element. Finally, we compute the distribution over learned reward functions at each bin and produce images for each reward function with regions of high-reward shaded in green. This is depicted in Figure 3.
Reward Function Visualization Analysis
In Figure 3 each image shows the location of the game-element when a specific learned reward is obtained. We observe that each learned reward function generally yields reward in separate regions of the environment’s state-space. This effect is most pronounced when the number of reward functions is small but persists as more are added, suggesting that the location of the game-element plays a significant role in our learned decompositions. The learning of location-dependent reward functions is consistent with Theorem 1 which implies that the optimization of results in optimal policies with different occupancy frequencies.
Empirical Reward Saturation
Theorem 2 predicts that under certain conditions, the optimal reward decomposition never splits the environment reward of a state between its learned factors. We term this property “saturation.” To illustrate the degree to which our method learns saturated reward functions empirically, we define the following “saturation score:”
which is defined when the environment reward is nonzero and ranges from to as the reward decomposition becomes more saturated at .
Table 1 depicts the average saturation scores over all instances of positive reward obtained during the execution of a random policy across our selection of Atari 2600 games as the number of learned reward functions varies. We observe that, under Equation 7, our learned decompositions are extremely saturated, suggesting that Theorem 2 holds in practice.
5.2 Comparison to ICF
We now compare our method against a competing approach, ICF, which learns a set of latent factors as a state decomposition with the property that each factor can be controlled by a corresponding policy without changing the other factors by optimizing a “selectivity score.” While ICF does not involve the reward function, we can empirically compare its resultant policies111We use the directed variant of the selectivity score proposed in (Thomas et al., 2017). In the comparisons reported here, unless otherwise stated, we allot ICF two directed policies for every policy we let our method learn. against those generated by our method.
Qualitative Gridworld Comparison
We examine the behaviors of the policies found by ICF with 4 policies and our method with 4 reward functions on the gridworld domain in Figure 4. In this setting ICF learns policies corresponding to the cardinal directions of the domain, with each policy performing a single action (i.e., up, down, left, right) irrespective of the state. Conversely, given four reward functions, our approach learns policies that are directed to the four sources of reward.
|Average Policy Values||Average State-Dependence|
Quantitative Policy Comparison on Atari
As a further comparison between ICF and our method on Atari 2600 games, we collect two statistics on the policies learned by each: the value of the environment reward averaged over all the policies, and the average degree to which each policy changes actions with respect to state. We denote this latter quantity as “state-dependence” and define it for a policy as follows:
where denotes states sampled from a trajectory generated according to policy and
denotes standard deviation with respect to distribution.
We display the results of these comparisons in Table 2. Notice how the average values of the our method’s policies achieve higher average environment-reward-based value than the ICF policies and substantially higher state-dependencies. While we expect this trend in average value (since our method considers the environment reward), the near-zero state-dependencies of the ICF policies is more surprising and suggests that the policies learned by ICF are not influenced by the environment state.
5.3 Control using Induced MDPs
We also explore whether the policies generated from disentangled rewards can be useful for learning control with a DQN (Mnih et al., 2015). Replacing the actions in the original MDP with the policies found by our method “induces” an MDP in which selecting a policy in a state executes the action in the original MDP the policy takes in that state. The top row of Figure 6 shows that the game score in the induced MDP rises faster but asymptotes lower than the baseline of learning in the original MDP in Seaquest and Assault. We conjecture the faster rise is because the policies learned for the disentangled rewards produced by our method are also good at obtaining environment reward. Of course, the lower asymptote is expected because the baseline method is not limited in the behaviors that can be executed. We should also expect the use of the learned policies as actions to generalize to changes in the environment reward and we see this in the bottom row of Figure 6. The specific reward changes to each Atari game are explained in the caption of Figure 5. In both Seaquest and Assault, the induced agent not only learns faster than the baseline but also achieves an asymptote that is competitive with it.
Another expected but nonetheless interesting result is that in both Seaquest and Assault performance gets better in general as we increase the number of learned rewards for both the top and bottom row of Figure 6b. In Pacman, the results are a bit less consistent: on the one hand in the top row learning with 3 policies as actions does a bit better than the baseline, on the other hand performance is not necessarily better with increasing numbers of policies.
6 Conclusion and Future Work
In this work we presented and explored a novel formulation for additively-decomposing rewards into independently obtainable components. With empirical investigations, we showed that our disentanglement score is predictive of qualitatively interesting decompositions, that our algorithm is able to learn independently obtainable rewards for 3 Atari games better than a recent approach to disentangling states and policies, and that the policies optimal with respect to the learned rewards are useful in that they also obtain portions of the environment reward. With theoretical investigations, we showed that when our rewards are independently obtainable their optimal policies occupy non-overlapping states and that our gradient-based method for finding the reward decomposition yields saturated rewards in which each state’s environment reward is allocated entirely to one learned reward.
This work was supported by a grant from Toyota Research Institute (TRI), and by a grant from DARPA’s L2M program. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.
Achille et al. (2018)
Achille, A., Eccles, T., Matthey, L., Burgess, C., Watters, N., Lerchner, A.,
and Higgins, I.
Life-long disentangled representation learning with cross-domain latent homologies.In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9895–9905. Curran Associates, Inc., 2018.
- Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
Degris & Sigaud (2013)
Degris, T. and Sigaud, O.
Factored markov decision processes.
Markov Decision Processes in Artificial Intelligence, pp. 99–126, 2013.
- Guestrin et al. (2002) Guestrin, C., Koller, D., and Parr, R. Multiagent planning with factored mdps. In Advances in neural information processing systems, pp. 1523–1530, 2002.
- Higgins et al. (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
- Hu et al. (1998) Hu, J., Wellman, M. P., et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pp. 242–250. Citeseer, 1998.
- Kaelbling (1993) Kaelbling, L. P. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993.
- Kok & Vlassis (2006) Kok, J. R. and Vlassis, N. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7(Sep):1789–1828, 2006.
- Laversanne-Finot et al. (2018) Laversanne-Finot, A., Péré, A., and Oudeyer, P.-Y. Curiosity driven exploration of learned disentangled goal spaces. arXiv preprint arXiv:1807.01521, 2018.
- Mathieu et al. (2016) Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., and LeCun, Y. Disentangling factors of variation in deep representation using adversarial training. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 5040–5048. Curran Associates, Inc., 2016.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Russell & Zimdars (2003) Russell, S. J. and Zimdars, A. Q-decomposition for reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 656–663, 2003.
- Sutton et al. (1998) Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction. MIT press, 1998.
- Thomas et al. (2017) Thomas, V., Pondard, J., Bengio, E., Sarfati, M., Beaudoin, P., Meurs, M.-J., Pineau, J., Precup, D., and Bengio, Y. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
- Tishby & Zaslavsky (2015) Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1–5. IEEE, 2015.
Tran et al. (2017)
Tran, L., Yin, X., and Liu, X.
Disentangled representation learning gan for pose-invariant face recognition.In CVPR, volume 3, pp. 7, 2017.
- Van Seijen et al. (2017) Van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., and Tsang, J. Hybrid reward architecture for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5392–5402, 2017.
Yingzhen & Mandt (2018)
Yingzhen, L. and Mandt, S.
Disentangled sequential autoencoder.In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5670–5679, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/yingzhen18a.html.
Appendix A Algorithm Description
Appendix B Experimental Details
For Atari 2600 games we concatenate the last 3 frames before passing them to our networks. We do not do this for our gridworld experiments. We clip all Atari 2600 rewards between 0 and 1 and do not mark any states as terminal.
All networks are trained for steps. All network training is done with a batch-size of 32. A single replay buffer is shared across DQN instances with a capacity of . We annealed our epsilon greedy behavior policy from to over the course of time-steps. We updated our Policy networks every time-steps and our Reward Decomposition networks every time-steps. For all experiments we use a discount value of .
Reward Decomposition Network
The Reward Decomposition (RD) network was trained using the Adam optimizer with a learning rate of .
Our RD network maps states to vectors of
decomposed rewards and consists of 3 convolutional layers followed by 2 fully connected layers. The convolutional layers have filter sizes: 8, 4, 3; numbers of filters: 32, 64, 64; strides: 4, 2, 1 and activations: relu, relu, relu. The fully connected layers have widths: 128,and activations: relu, softmax.
When approximating value functions as in Equation 6, we roll out each policy for steps. We do not compute gradients through the weighting functions .
Policy Networks (DQNs)
Our Policy networks are represented as dueling DQNs adapted from the OpenAI baselines repository.222https://github.com/openai/baselines These networks each use a learning rate of , a target update frequency of time-steps. Each network consists of 3 convolutional layers followed by 2 fully connected layers. The convolutional layers have filter sizes: 8, 4, 3; numbers of filters: 32, 64, 64; strides: 4, 2, 1 and activations: relu, relu, relu. The fully connected layers have widths: 256, and activations: relu, unit.
Our network for replicating ICF was adapted from the authors’ git respository,333https://github.com/bengioe/implementations/tree/master/DL/ICF_simple leaving the majority of settings unchanged. The network, when applied to an image with channels, consists of an autoencoder with an auxiliary branch that outputs the latent state representation and policy distributions. The encoder consists of 3 convolutional layers with filter sizes: 3, 3, 3; numbers of filters: 32, 64, 64; strides: 2,2,2 and activations: relu, relu, relu. The decoder consists of 3 transpose-convolutional layers with filter sizes: 3,3,3; numbers of filters 64, 32, n; strides: 2,2,2 and activations: relu,relu, unit. The auxiliary network branch has a single fully-connected layer with width 32 and relu activation before splitting two branches for latent factors and policy distributions respectively.
Appendix C Mathematical Results
Relationship between and
We next illustrate that the and terms defined in Equations 3 and 4 can be related to each other by a function that does not directly depend on the choice of reward decomposition, but on the behavior of the agents resulting from the decomposition.
For a set of policies , we define this function as:
which intuitively captures the “total value” obtained by a the policies .
If for all and it follows that:
The statement can be shown by simple algebraic manipulations:
as needed. ∎
We can then define the “sensitivity” of an MDP’s total value with respect to a change in policies as: