Learning Action Representations for Reinforcement Learning

02/01/2019 ∙ by Yash Chandak, et al. ∙ University of Massachusetts Amherst adobe 0

Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori. We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions. These representations improve generalization over large, finite action sets by allowing the agent to infer the outcomes of actions similar to actions already taken. We provide an algorithm to both learn and use action representations and provide conditions for its convergence. The efficacy of the proposed method is demonstrated on large-scale real-world problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) methods have been applied successfully to many simple and game-based tasks. However, their applicability is still limited for problems involving decision making in many real-world settings. One reason is that many real-world problems with significant human impact involve selecting a single decision from a multitude of possible choices. For example, maximizing long-term portfolio value in finance using various trading strategies (Jiang et al., 2017), improving fault tolerance by regulating voltage level of all the units in a large power system (Glavic et al., 2017), and personalized tutoring systems for recommending sequences of videos from a large collection of tutorials (Sidney et al., 2005). Therefore, it is important that we develop RL algorithms that are effective for real problems, where the number of possible choices is large.

In this paper we consider the problem of creating RL algorithms that are effective for problems with large action sets. Existing RL algorithms handle large state

sets (e.g., images consisting of pixels) by learning a representation or embedding for states (e.g., using line detectors or convolutional layers in neural networks), which allow the agent to reason and learn using the state representation rather than the raw state. We extend this idea to the set of actions: we propose learning a representation for the actions, which allows the agent to reason and learn by making decisions in the space of action representations rather than the original large set of possible actions. This setup is depicted in Figure

1, where an internal policy, , acts in a space of action representations, and a function, , transforms these representations into actual actions. Together we refer to and as the overall policy, .

Figure 1: The structure of the proposed overall policy, , consisting of and , that learns action representations to generalize over large action sets.

Recent work has shown the benefits associated with using action-embeddings (Dulac-Arnold et al., 2015), particularly that they allow for generalization over actions. For real-world problems where there are thousands of possible (discrete) actions, this generalization can significantly speed learning. However, this prior work assumes that fixed and predefined representations are provided. In this paper we present a method to autonomously learn the underlying structure of the action set by using the observed transitions. This method can both learn action representation from scratch and improve upon a provided action representation.

A key component of our proposed method is that it frames the problem of learning an action representation (learning ) as a supervised

learning problem rather than an RL problem. This is desirable because supervised learning methods tend to learn more quickly and reliably than RL algorithms since they have access to instructive feedback rather than evaluative feedback

(Sutton and Barto, 2018). The proposed learning procedure exploits the structure in the action set by aligning actions based on the similarity of their impact on the state. Therefore, updates to a policy that acts in the space of learned action representation generalizes the feedback received after taking an action to other actions that have similar representations. Furthermore, we prove that our combination of supervised learning (for ) and reinforcement learning (for ) within one larger RL agent preserves the almost sure convergence guarantees provided by policy gradient algorithms (Borkar and Konda, 1997).

To evaluate our proposed method empirically, we study two real-world recommender system problems using data from Adobe HelpX and Adobe Photoshop. In both the applications, there are thousands of possible recommendations that could be given at each time step (e.g., which video to suggest the user watch next on the HelpX portal, or which tool to suggest to the user next in the Photoshop software). Our experimental results show our proposed system’s ability to significantly improve performance relative to existing methods for these applications by quickly and reliably learning action representations that allow for meaningful generalization over the large discrete set of possible actions.

The rest of this paper is organized to provide in the following order: a background on RL, related work, and the following primary contributions:

  • A new parameterization, called the overall policy, that leverages action representations. We show that for all optimal policies, , there exist parameters for this new policy class that are equivalent to .

  • A proof of equivalence of the policy gradient update between the overall policy and the internal policy.

  • A supervised learning algorithm for learning action representations ( in Figure 1). This procedure can be combined with any existing policy gradient method for learning the overall policy.

  • An almost sure asymptotic convergence proof for the algorithm, which extends existing results for actor-critics (Borkar and Konda, 1997).

  • Experimental results on real-world domains with thousands of actions using actual data collected from Adobe HelpX and Photoshop.

2 Background

We consider problems modeled as discrete-time Markov decision processes (MDPs) with discrete states and finite actions. An MDP is represented by a tuple, . is the set of all possible states, called the state space, and is a finite set of actions, called the action set. Though our notation assumes that the state set is finite, our primary results extend to MDPs with continuous states. In this work, we restrict our focus to MDPs with finite action sets, and

denotes the size of the action set. The random variables,

, , and denote the state, action, and reward at time . We assume that for some finite . The first state, , comes from an initial distribution, , and the reward function is defined so that for all and . Hereafter, for brevity, we write

to denote both probabilities and probability densities, and when writing probabilities and expectations, write

or to denote both elements of various sets and the events , , or (defined later). The desired meaning for or should be clear from context. The reward discounting parameter is given by . is the state transition function, such that .

A policy is a conditional distribution over actions for each state: for all , and . Although is simply a function, we write rather than to emphasize that it is a conditional distribution. For a given , an agent’s goal is to find a policy that maximizes the expected sum of discounted future rewards. For any policy , the corresponding state-action value function is , where conditioning on denotes that for all and for . The state value function is . It follows from the Bellman equation that . An optimal policy is any , where denotes the set of all possible policies, and is shorthand for .

3 Related Work

Here we summarize the most related work and discuss how they relate to the proposed work.

Factorizing Action Space: To reduce the size of large action spaces, Pazis and Parr (2011) considered representing each action in binary format and learning a value function associated with each bit. A similar binary based approach was also used as an ensemble method to learning optimal policies for MDPs with large action sets (Sallans and Hinton, 2004). For planning problems, Cui and Khardon (2016, 2018) showed how a gradient based search on a symbolic representation of the state-action value function can be used to address scalability issues. More recently, it was shown that better performances can be achieved on Atari 2600 games (Bellemare et al., 2013) when actions are factored into their primary categories (Sharma et al., 2017). All these methods assumed that a handcrafted binary decomposition of raw actions was provided. To deal with discrete actions that might have an underlying continuous representation, Van Hasselt and Wiering (2009) used policy gradients with continuous actions and selected the nearest discrete action. This work was extended by Dulac-Arnold et al. (2015) for larger domains, where they performed action representation look up, similar to our approach. However, they assumed that the embeddings for the actions are given, a priori. We present a method that can learn action representations with no prior knowledge or further optimize available action representations. If no prior knowledge is available, our method learns these representations from scratch autonomously.

Auxiliary Tasks: Previous works showed empirically that supervised learning with the objective to predict a component of a transition tuple from the others, can be useful as an auxiliary method to learn state representations (Jaderberg et al., 2016) or to obtain intrinsic rewards (Shelhamer et al., 2016; Pathak et al., 2017)

. We show how the overall policy itself can be decomposed using an action representation module learned using a similar loss function.

Motor Primitives: Research in neuroscience suggests that animals decompose their plans into mid-level abstractions, rather than the exact low-level motor controls needed for each movement (Jing et al., 2004). Such abstractions of behavior that form the building blocks for motor control are often called motor primitives (Lemay and Grill, 2004; Mussa-Ivaldi and Bizzi, 2000). In the field of robotics, dynamical system based models have been used to construct dynamic movement primitives (DMPs) for continuous control (Ijspeert et al., 2003; Schaal, 2006)

. Imitation learning can also be used to learn DMPs, which can be fine-tuned online using RL

(Kober and Peters, 2009b, a). However, these are significantly different from our work as they are specifically parameterized for robotics tasks and produce an encoding for kinematic trajectory plans, not the actions.

Later, Thomas and Barto (2012) showed how a goal-conditioned policy can be learned using multiple motor primitives that control only useful sub-spaces of the underlying control problem. To learn binary motor primitives, Thomas and Barto (2011) showed how a policy can be modeled as a composition of multiple “coagents”, each of which learns using only the local policy gradient information (Thomas, 2011)

. Our work follows a similar direction, but we focus on automatically learning optimal continuous-valued action representations for discrete actions. For action representations, we present a method that uses supervised learning and restricts the usage of high variance policy gradients to train the internal policy only.

Other Domains: In supervised learning, representations of the output categories have been used to extract additional correlation information among the labels. Popular examples include learning label embeddings for image classification (Akata et al., 2016) and learning word embeddings for natural language problems (Mikolov et al., 2013). In contrast, for an RL setup, the policy is a function whose outputs correspond to the available actions. We show how learning action representations can be beneficial as well.

4 Generalization over Actions

The benefits of capturing the structure in the underlying state space of MDPs is a well understood and a widely used concept in RL. State representations allow the policy to generalize across states. Similarly, there often exists additional structure in the space of actions that can be leveraged. We hypothesize that exploiting this structure can enable quick generalization across actions, thereby making learning with large action sets feasible. To bridge the gap, we introduce an action representation space, , and consider a factorized policy, , parameterized by an embedding-to-action mapping function, , and an internal policy, , such that the distribution of given is characterized by:


Here, is used to sample , and the function deterministically maps this representation to an action in the set . Both these components together form an overall policy, . Figure 2 illustrates the probability of each action under such a parameterization. With a slight abuse of notation, we use to denote the set of representations that are mapped to the action by the function , i.e., .

In the following sections we discuss the existence of an optimal policy and the learning procedure for . To elucidate the steps involved, we split it into four parts. First, we show that there exists and such that is an optimal policy. Then we present the supervised learning process for the function when is fixed. Next we give the policy gradient learning process for when is fixed. Finally, we combine these methods to learn and simultaneously.

4.1 Existence of and to Represent An Optimal Policy

Figure 2: Illustration of the probability induced for three actions by the probability density of on a -D embedding space. The -axis represents the embedding, , and the -axis represents the probability. The colored regions represent the mapping , where each color is associated with a specific action.

In this section, we aim to establish a condition under which can represent an optimal policy. Consequently, we then define the optimal set of and using the proposed parameterization. To establish the main results we begin with the necessary assumptions.

The characteristics of the actions can be naturally associated with how they influence state transitions. In order to learn a representation for actions that captures this structure, we consider a standard Markov property, often used for learning probabilistic graphical models (Ghahramani, 2001), and make the following assumption that the transition information can be sufficiently encoded to infer the action that was executed.

Assumption A1.

Given an embedding , is conditionally independent of and :

Assumption A2.

Given the embedding the action, is deterministic and is represented by a function , i.e., .

We now establish a necessary condition under which our proposed policy can represent an optimal policy. This condition will also be useful later when deriving learning rules.

Lemma 1.

Under Assumptions (A1)–(A2), which defines a function , for all , there exists a such that


The proof is deferred to the Appendix A. Following Lemma 1, we use and to define the overall policy as

Theorem 1.

Under Assumptions (A1)–(A2), which defines a function , there exists an overall policy, , such that .


This follows directly from Lemma 1. Because the state and action sets are finite, the rewards are bounded, and , there exists at least one optimal policy. For any optimal policy , the corresponding state-value and state-action-value functions are the unique and , respectively. By Lemma 1 there exist and such that


Therefore, there exists and , such that the resulting has the state-value function , and hence it represents an optimal policy. ∎

Note that Theorem 1 establishes existence of an optimal overall policy based on equivalence of the state-value function, but does not ensure that all optimal policies can be represented by an overall policy. Using (4), we define . Correspondingly, we define the set of optimal internal policies as .

4.2 Supervised Learning of For a Fixed

Theorem 1 shows that there exist and a function , which helps in predicting the action responsible for the transition from to , such that the corresponding overall policy is optimal. However, such a function, , may not be known a priori

. In this section, we present a method to estimate

using data collected from interactions with the environment.

By Assumptions (A1)–(A2), can be written in terms of and . We propose searching for an estimator, , of and an estimator, , of such that a reconstruction of is accurate. Let this estimate of based on and be


One way to measure the difference between and is using the expected (over states coming from the on-policy distribution) Kullback-Leibler (KL) divergence


Since the observed transition tuples, , contain the action responsible for the given to transition, an on-policy sample estimate of the KL-divergence can be computed readily using (7). We adopt the following loss function based on the KL divergence between and :


where the denominator in (7) is not included in (8) because it does not depend on or . If and are parameterized, their parameters can be learned by minimizing the loss function, , using a supervised learning procedure.

Figure 3: (a) Given a state transition tuple, functions and are used to estimate the action taken. The red arrow denotes the gradients of the supervised loss (8) for learning the parameters of these functions. (b) During execution, an internal policy, , can be used to first select an action representation, . The function , obtained from previous learning procedure, then transforms this representation to an action. The blue arrow represents the internal policy gradients (10) obtained using Lemma 2 to update .

A computational graph for this model is shown in Figure 3. We refer the reader to the Appendix D for the parameterizations of and used in our experiments. Note that, while will be used for in an overall policy, is only used to find , and will not serve an additional purpose.

As this supervised learning process only requires estimating , it does not require (or depend on) the rewards. This partially mitigates the problems due to sparse and stochastic rewards, since an alternative informative supervised signal is always available. This is advantageous for making the action representation component of the overall policy learn quickly and with low variance updates.

4.3 Learning For a Fixed

A common method for learning a policy parameterized with weights is to optimize the discounted start-state objective function, For a policy with weights , the expected performance of the policy can be improved by ascending the policy gradient, .

Let the state-value function associated with the internal policy, , be , and the state-action value function . We then define the performance function for as:


Viewing the embeddings as the action for the agent with policy , the policy gradient theorem (Sutton et al., 2000), states that the unbiased (Thomas, 2014) gradient of (9) is,


where, the expectation is over states from , as defined by Sutton et al. (2000) (which is not a true distribution, since it is not normalized). The parameters of the internal policy can be learned by iteratively updating its parameters in the direction of . Since there are no special constraints on the policy , any policy gradient algorithm designed for continuous control, like DPG (Silver et al., 2014), PPO (Schulman et al., 2017), NAC (Bhatnagar et al., 2009) etc., can be used out-of-the-box.

However, note that the performance function associated with the overall policy, (consisting of function and the internal policy parameterized with weights ), is:


The ultimate requirement is the improvement of this overall performance function, , and not just . So, how useful is it to update the internal policy, , by following the gradient of its own performance function? The following lemma answers this question.

Lemma 2.

For all deterministic functions, , which map each point, , in the representation space to an action, , the expected updates to based on are equivalent to updates based on . That is,

The proof is deferred to the Appendix B. The chosen parameterization for the policy has this special property, which allows to be learned using its internal policy gradient. Since this gradient update does not require computing the value of any explicitly, the potentially intractable computation of in (3) required for can be avoided. Instead, can be used directly to update the parameters of the internal policy while still optimizing the overall policy’s performance, .

4.4 Learning and Simultaneously

Since the supervised learning procedure for does not require rewards, a few initial trajectories can contain enough information to begin learning a useful action representation. As more data becomes available it can be used for fine-tuning and improving the action representations.

4.4.1 Algorithm

We call our algorithm policy gradients with representations for actions (PG-RA). PG-RA first initializes the parameters in the action representation component by sampling a few trajectories using a random policy and using the supervised loss defined in (8). If additional information is known about the actions, as assumed in prior work (Dulac-Arnold et al., 2015), it can also be considered when initializing the action representations. Optionally, once these action representations are initialized, they can be kept fixed.

In the Algorithm LABEL:Alg:1, Lines - illustrate the online update procedure for all of the parameters involved. Each time step in the episode is represented by . For each step, an action representation is sampled and is then mapped to an action by . Having executed this action in the environment, the observed reward is then used to update the internal policy, , using any policy gradient algorithm. Depending on the policy gradient algorithm, if a critic is used then semi-gradients of the TD-error are used to update the parameters of the critic. In other cases, like in REINFORCE (Williams, 1992) where there is no critic, this step can be ignored. The observed transition is then used in Line to update the parameters of and so as to minimize the supervised learning loss (8). In our experiments, Line uses a stochastic gradient update.


4.4.2 PG-RA Convergence

If the action representations are held fixed while learning the internal policy, then as a consequence of Property 2, convergence of our algorithm directly follows from previous two-timescale results (Borkar and Konda, 1997; Bhatnagar et al., 2009). Here we show that learning both and simultaneously using our PG-RA algorithm can also be shown to converge by using a three-timescale analysis.

Similar to prior work (Bhatnagar et al., 2009; Degris et al., 2012; Konda and Tsitsiklis, 2000), for analysis of the updates to the parameters, , of the internal policy, , we use a projection operator that projects any to a compact set

. We then define an associated vector field operator,

, that projects any gradients leading outside the compact region, , back to . We refer the reader to the Appendix C.3 for precise definitions of these operators and the additional standard assumptions (A3)–(A5). Practically, however, we do not project the iterates to a constraint region as they are seen to remain bounded (without projection).

Theorem 2.

Under Assumptions (A1)–(A5), the internal policy parameters , converge to as , with probability one.


(Outline) We consider three learning rate sequences, such that the update recursion for the internal policy is on the slowest timescale, the critic’s update recursion is on the fastest, and the action representation module’s has an intermediate rate. With this construction, we leverage the three-timescale analysis technique (Borkar, 2009) and prove convergence. The complete proof is in the Appendix C. ∎

5 Empirical Analysis

A core motivation of this work is to provide an algorithm that can be used as a drop-in extension for improving the action generalization capabilities of existing policy gradient methods for problems with large action spaces. We consider two standard policy gradient methods: actor-critic (AC) and deterministic-policy-gradient (DPG) (Silver et al., 2014) in our experiments. Just like previous algorithms, we also ignore the terms and perform the biased policy gradient update to be practically more sample efficient (Thomas, 2014). We believe that the reported results can be further improved by using the proposed method with other policy gradient methods; we leave this for future work. For detailed discussion on parameterization of the function approximators and hyper-parameter search, see Appendix D.

5.1 Domains


As a proof-of-concept, we constructed a continuous-state maze environment where the state comprised of the coordinates of the agent’s current location. The agent has equally spaced actuators (each actuator moves the agent in the direction the actuator is pointing towards) around it, and it can choose whether each actuator should be on or off. Therefore, the size of the action set is exponential in the number of actuators, that is . The net outcome of an action is the vectorial summation of the displacements associated with the selected actuators. The agent is rewarded with a small penalty for each time step, and a reward of is given upon reaching the goal position. To make the problem more challenging, random noise was added to the action of the time and the maximum episode length was steps.

This environment is a useful test bed as it requires solving a long horizon task in an MDP with a large action set and a single goal reward. Further, we know the Cartesian representation for each of the actions, and can thereby use it to visualize the learned representation, as shown in Figure 4.

Figure 4: (a) The maze environment. The star denotes the goal state, the red dot corresponds to the agent and the arrows around it are the actuators. Each action corresponds to a unique combination of these actuators. Therefore, in total actions are possible. (b) 2-D representations for the displacements in the Cartesian co-ordinates caused by each action, and (c) learned action embeddings. In both (b) and (c), each action is colored based on the displacement (, ) it produces. That is, with the color [R= , G=, B=], where and are normalized to before coloring. Cartesian actions are plotted on co-ordinates (, ), and learned ones are on the coordinates in the embedding space. Smoother color transition of the learned representation is better as it corresponds to preservation of the relative underlying structure. The ‘squashing’ of the learned embeddings is an artifact of a non-linearity applied to bound its range.
Real-word recommender systems:

We consider two real-world applications of recommender systems that require decision making over multiple time steps.

First, Adobe HelpX, a web-based video-tutorial platform, which has a recommendation engine that suggests a series of tutorial videos on various Adobe software products. The aim is to meaningfully engage the users in learning how to use these software products and convert novice users into experts in their respective areas of interest. The tutorial suggestion at each time step is made from a large pool of available tutorial videos on several products.

The second application is Adobe Photoshop, a professional multi-media editing software. Modern multimedia editing software often contain many tools that can be used to manipulate the media, and this wealth of options can be overwhelming for users. In this Adobe Photoshop domain, an agent suggests which of the available tools the user may want to use next. The objective is to increase user productivity and assist in achieving their end goal.

For both of these applications, an existing log of user’s click stream data was used to create an n-gram based MDP model for user behavior

(Shani et al., 2005). In the Adobe HelpX tutorial recommendation task, user activity for a three month period was observed. Sequences of user interaction were aggregated to obtain over million clicks. Similarly, for a month long duration, sequential usage patterns of the tools in the Adobe Photoshop software were collected to obtain a total of over billion user clicks. Tutorials and tools that had less than clicks in total were discarded. The remaining tutorials and tools for the HelpX platform and Adobe Photoshop, respectively, were used to create the action set for the MDP model.

The state for the MDP consists of the feature descriptors associated with each item (tutorial or tool) in the current n-gram. Rewards were chosen based on a surrogate measure for difficulty level of tutorials on HelpX portal and popularity of final outcomes of user interactions in Photoshop, respectively. Since such data is sparse, only of the items had rewards associated with them, and the maximum reward for any item was .

Often the problem of recommendation is formulated as a contextual bandit or collaborative filtering problem, but as shown by Theocharous et al. (2015) these approaches fail to capture the long term value of the prediction. Solving this problem for a longer time horizon with a large number of actions (tutorials/tools) makes this real-life problem a useful and a challenging domain for RL algorithms.

5.2 Results

Visualizing the learned action representations

To understand the internal working of our proposed algorithm, we present visualizations of the learned action representations on the maze domain. A pictorial illustration of the environment is provided in Figure 4. Here, the underlying structure in the set of actions is related to the displacements in the Cartesian coordinates. This provides an intuitive base case against which we can compare our results.

In Figure 4, we provide a comparison between the action representations learned using our algorithm and the underlying Cartesian representation of the actions. It can be seen that the proposed method extracts useful structure in the action space. Actions which correspond to settings where the actuators on the opposite side of the agent are selected result in relatively small displacements to the agent. These are the ones in the center of plot. In contrast, maximum displacement in any direction is caused by only selecting actuators facing in that particular direction. Actions corresponding to those are at the edge of the representation space. The smooth color transition indicates that not only the information about magnitude of displacement but the direction of displacement is also represented. Therefore, the learned representations efficiently preserve the relative transition information among all the actions. To make exploration step tractable in the internal policy, , we bound the representation space along each dimension to the range [] using Tanh non-linearity. This results in ‘squashing’ of these representations around the edge of this range.

Performance Improvement

Figure 5: (Top) Results on the Maze domain with and actions respectively. (Bottom) Results on a) Adobe HelpX MDP b) Adobe Photoshop MDP. AC-RA and DPG-RA are the variants of PG-RA algorithm that uses actor-critic (AC) and DPG, respectively.

The plots in Figure 5 for the Maze domain show how the performance of standard actor-critic (AC) method deteriorates as the number of actions increases, even though the goal remains the same. However, with the addition of an action representation module it is able to capture the underlying structure in the action space and consistently perform well across all settings. Similarly, for both Adobe HelpX and Adobe Photoshop MDPs, standard AC methods fail to reason over longer time horizons under such an overwhelming number of actions, choosing mostly one-step actions that have high returns. In comparison, instances of our proposed algorithm are not only able to achieve significantly higher return, up to and in the respective tasks, but they do so much quicker. These results reinforce our claim that learning action representations allow implicit generalization of feedback to other actions embedded in proximity to executed action.

Further, under the PG-RA algorithm, only a fraction of total parameters, the ones in the internal policy, are learned using the high variance policy gradient updates. The other set of parameters associated with action representations are learned by a supervised learning procedure. This reduces the variance of updates significantly, thereby making the PG-RA algorithms learn a better policy faster. This is evident from the plots in the Figure 5. These advantages allow the internal policy, , to quickly approximate an optimal policy without succumbing to the curse of large actions sets.

6 Summary and Future Work

In this paper, we built upon the core idea of leveraging the structure in the space of actions and showed its importance for enhancing generalization over large action sets in real-world large-scale applications. Our approach has three key advantages. (a) Simplicity: by simply using the observed transitions, an additional supervised update rule can be used to learn action representations. (b) Theory: we showed that the proposed overall policy class can represent an optimal policy and derived the associated learning procedures for its parameters. (c) Extensibility: as the PG-RA algorithm indicates, our approach can be easily extended using other policy gradient methods to leverage additional advantages, while preserving the convergence guarantees.

An interesting future direction would be to extend the results for capturing the structure of a high dimensional continuous action space () into a lower dimensional representation space () as well. Unlike finite set of actions that can be embedded in a continuous space, the key challenge is that learning lower dimensional space for continuous action inevitably results in the inability to represent some sections of the original action space (as ).


  • Akata et al. (2016) Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2016.
  • Bellemare et al. (2013) M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, 2013.
  • Bhatnagar et al. (2009) S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
  • Borkar (2009) V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
  • Borkar and Konda (1997) V. S. Borkar and V. R. Konda. The actor-critic algorithm as multi-time-scale stochastic approximation. Sadhana, 22(4):525–543, 1997.
  • Cui and Khardon (2016) H. Cui and R. Khardon. Online symbolic gradient-based optimization for factored action mdps. In IJCAI, pages 3075–3081, 2016.
  • Cui and Khardon (2018) H. Cui and R. Khardon. Lifted stochastic planning, belief propagation and marginal MAP. In The Workshops of the The Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Degris et al. (2012) T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
  • Dulac-Arnold et al. (2015) G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
  • Ghahramani (2001) Z. Ghahramani.

    An introduction to hidden markov models and bayesian networks.

    International journal of pattern recognition and artificial intelligence

    , 15(01):9–42, 2001.
  • Glavic et al. (2017) M. Glavic, R. Fonteneau, and D. Ernst. Reinforcement learning for electric power system decision and control: Past considerations and perspectives. IFAC-PapersOnLine, 50(1):6918–6927, 2017.
  • Haeffele and Vidal (2017) B. D. Haeffele and R. Vidal. Global optimality in neural network training. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7331–7339, 2017.
  • Ijspeert et al. (2003) A. J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor primitives. In Advances in neural information processing systems, pages 1547–1554, 2003.
  • Jaderberg et al. (2016) M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
  • Jiang et al. (2017) Z. Jiang, D. Xu, and J. Liang. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059, 2017.
  • Jing et al. (2004) J. Jing, E. C. Cropper, I. Hurwitz, and K. R. Weiss. The construction of movement with behavior-specific and behavior-independent modules. Journal of Neuroscience, 24(28):6315–6325, 2004.
  • Kawaguchi (2016) K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
  • Kober and Peters (2009a) J. Kober and J. Peters. Learning motor primitives for robotics. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 2112–2118. IEEE, 2009a.
  • Kober and Peters (2009b) J. Kober and J. R. Peters. Policy search for motor primitives in robotics. In Advances in neural information processing systems, pages 849–856, 2009b.
  • Konda and Tsitsiklis (2000) V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
  • Konidaris et al. (2011) G. Konidaris, S. Osentoski, and P. S. Thomas. Value function approximation in reinforcement learning using the fourier basis. In AAAI, volume 6, page 7, 2011.
  • Lemay and Grill (2004) M. A. Lemay and W. M. Grill. Modularity of motor output evoked by intraspinal microstimulation in cats. Journal of neurophysiology, 91(1):502–514, 2004.
  • Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • Mussa-Ivaldi and Bizzi (2000) F. A. Mussa-Ivaldi and E. Bizzi. Motor learning through the combination of primitives. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 355(1404):1755–1769, 2000.
  • Pathak et al. (2017) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In

    International Conference on Machine Learning (ICML)

    , volume 2017, 2017.
  • Pazis and Parr (2011) J. Pazis and R. Parr. Generalized value functions for large action sets. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1185–1192, 2011.
  • Sallans and Hinton (2004) B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5(Aug):1063–1088, 2004.
  • Schaal (2006) S. Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006.
  • Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Shani et al. (2005) G. Shani, D. Heckerman, and R. I. Brafman. An MDP-based recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
  • Sharma et al. (2017) S. Sharma, A. Suresh, R. Ramesh, and B. Ravindran. Learning to factor policies and action-value functions: Factored action space representations for deep reinforcement learning. arXiv preprint arXiv:1705.07269, 2017.
  • Shelhamer et al. (2016) E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
  • Sidney et al. (2005) K. D. Sidney, S. D. Craig, B. Gholson, S. Franklin, R. Picard, and A. C. Graesser. Integrating affect sensors in an intelligent tutoring system. In Affective Interactions: The Computer in the Affective Loop Workshop at, pages 7–13, 2005.
  • Silver et al. (2014) D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
  • Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Theocharous et al. (2015) G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Ad recommendation systems for life-time value optimization. In Proceedings of the 24th International Conference on World Wide Web, pages 1305–1310. ACM, 2015.
  • Thomas (2014) P. Thomas. Bias in natural actor-critic algorithms. In International Conference on Machine Learning, pages 441–448, 2014.
  • Thomas (2011) P. S. Thomas. Policy gradient coagent networks. In Advances in Neural Information Processing Systems, pages 1944–1952, 2011.
  • Thomas and Barto (2011) P. S. Thomas and A. G. Barto. Conjugate Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 137–144, 2011.
  • Thomas and Barto (2012) P. S. Thomas and A. G. Barto. Motor primitive discovery. In Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on, pages 1–8. IEEE, 2012.
  • Tsitsiklis and Van Roy (1996) J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximationtechnical. Technical report, Report LIDS-P-2322. Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, 1996.
  • Van Hasselt and Wiering (2009) H. Van Hasselt and M. A. Wiering. Using continuous action spaces to solve discrete problems. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 1149–1156. IEEE, 2009.
  • Williams (1992) R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Zhu et al. (2018) Z. Zhu, D. Soudry, Y. C. Eldar, and M. B. Wakin. The global optimization geometry of shallow linear neural networks. arXiv preprint arXiv:1805.04938, 2018.


Appendix A Proof of Lemma 1

Lemma 1.

Under Assumptions (A1)–(A2), which defines a function , for all , there exists a such that


The Bellman equation associated with a policy, , for any MDP, , is:


is used to denote hereafter. Re-arranging terms in the Bellman equation,


Using the law of total probability, we introduce a new variable

such that:111Note that and

are from a joint distribution over a discrete and a continuous random variable. For simplicity, we avoid measure-theoretic notations to represent its joint probability.


After multiplying and dividing by , we have:


Since the transition to the next state, , is conditionally independent of , given the previous state, , and the action taken, ,


Similarly, using the Markov property, action is conditionally independent of given ,


As evaluates to for representations, , that map to and for others (Assumption A2),


In (28), note that the probability density, is the internal policy, . Therefore,


Appendix B Proof of Lemma 2

Lemma 2.

For all deterministic functions, , which map each point, , in the representation space to an action, , the expected updates to based on are equivalent to updates based on . That is,


Recall from (3) that the probability of an action given by the overall policy, , is


Using Lemma 1, we express the performance function of the overall policy, , as:


The gradient of the performance function is therefore


Using the policy gradient theorem [Sutton et al., 2000] for the overall policy, , the partial derivative of w.r.t.  is,


Note that since deterministically maps to , . Therefore,


Finally, since each is mapped to a unique action by the function , the nested summation over and its inner integral over can be replaced by an integral over the entire domain of . Hence,


Appendix C Convergence of PG-RA

To analyze the convergence of PG-RA, we first briefly review existing two-timescale convergence results for actor-critics. Afterwards, we present a general setup for stochastic recursions of three dependent parameter sequences. Asymptotic behavior of the system is then discussed using three different timescales, by adapting existing multi-timescale results by Borkar [2009]. This lays the foundation for our subsequent convergence proof. Finally, we prove convergence of the PG-RA method, which extends standard actor-critic algorithms using a new action prediction module, using a three-timescale approach. This technique for the proof is not a novel contribution of the work. We leverage and extend the existing convergence results of actor-critic algorithms [Borkar and Konda, 1997] for our algorithm.

c.1 Actor-Critic Convergence Using Two-Timescales

In the actor-critic algorithms, the updates to the policy depends upon a critic that can estimate the value function associated with the policy at that particular instance. One way to get a good value function is to fix the policy temporarily and update the critic in an inner-loop that uses the transitions drawn using only that fixed policy. While this is a sound approach, it requires a possibly large time between successive updates to the policy parameters and is severely sample-inefficient. Two-timescale stochastic approximation methods [Bhatnagar et al., 2009, Konda and Tsitsiklis, 2000] circumvent this difficulty. The faster update recursion for the critic ensures that asymptotically it is always a close approximation to the required value function before the next update to the policy is made.

c.2 Three-Timescale Setup

In our proposed algorithm, to update the action prediction module, one could have also considered an inner loop that uses transitions drawn using the fixed policy for supervised updates. Instead, to make such a procedure converge faster, we extend the existing two-timescale actor-critic results and take a three-timescale approach.

Consider the following system of stochastic ordinary differential equations (ODE):


where, and are Lipschitz continuous functions and , are the associated martingale difference sequences for noise w.r.t. the increasing -fields = , satisfying

for , and any constant such that the quadratic variation of noise is always bounded. To study the asymptotic behavior of the system, consider the following standard assumptions,

Assumption B1 (Boundedness).

, almost surely.

Assumption B2 (Learning rate schedule).

The learning rates and satisfy:

Assumption B3 (Existence of stationary point for Y).

The following ODE has a globally asymptotically stable equilibrium , where is a Lipschitz continuous function.

Assumption B4 (Existence of stationary point for X).

The following ODE has a globally asymptotically stable equilibrium , where is a Lipschitz continuous function.

Assumption B5 (Existence of stationary point for Z).

The following ODE has a globally asymptotically stable equilibrium ,


Assumptions B1B2 are required to bound the values of the parameter sequence and make the learning rate well-conditioned, respectively. Assumptions B3-B4 ensure that there exists a global stationary point for the respective recursions, individually, when other parameters are held constant. Finally, Assumption B5 ensures that there exists a global stationary point for the update recursion associated with , if between each successive update to , and have converged to their respective stationary points.

Lemma 3.

Under Assumptions B1-B5, as , with probability one.


We adapt the multi-timescale analysis by Borkar [2009] to analyze the above system of equations using three-timescales. First we present an intuitive explanation and then we formalize the results.

Since these three updates are not independent at each time step, we consider three step-size schedules: , and , which satisfy Assumption B2. As a consequence of (46), the recursion (42) is ‘faster’ than (43), and (41) is ‘faster’ than both (42) and (43). In other words, moves on the slowest timescale and the moves on the fastest. Such a timescale is desirable since converges to its stationary point if at each time step the value of the corresponding converged and estimates are used to make the next update (Assumption B5).

To elaborate on the previous points, first consider the ODEs:


Alternatively, one can consider the ODE


in place of (50), because is fixed (51). Now, under Assumption B3 we know that the iterative update (42) performed on , with a fixed , will eventually converge to a corresponding stationary point.

Now, with this converged , consider the following ODEs:


Alternatively, one can consider the ODE


in place of (53), as and are fixed (54)-(55). As a consequence of Assumption B4, converges when both and are held fixed.

Intuitively, as a result of Assumption B2, in the limit, the learning-rate, becomes very small relative to . This makes ‘quasi-static’ compared to and has an effect similar to fixing and running the iteration (42) forever to converge at . Similar