1 Introduction
Reinforcement learning (RL) methods have been applied successfully to many simple and gamebased tasks. However, their applicability is still limited for problems involving decision making in many realworld settings. One reason is that many realworld problems with significant human impact involve selecting a single decision from a multitude of possible choices. For example, maximizing longterm portfolio value in finance using various trading strategies (Jiang et al., 2017), improving fault tolerance by regulating voltage level of all the units in a large power system (Glavic et al., 2017), and personalized tutoring systems for recommending sequences of videos from a large collection of tutorials (Sidney et al., 2005). Therefore, it is important that we develop RL algorithms that are effective for real problems, where the number of possible choices is large.
In this paper we consider the problem of creating RL algorithms that are effective for problems with large action sets. Existing RL algorithms handle large state
sets (e.g., images consisting of pixels) by learning a representation or embedding for states (e.g., using line detectors or convolutional layers in neural networks), which allow the agent to reason and learn using the state representation rather than the raw state. We extend this idea to the set of actions: we propose learning a representation for the actions, which allows the agent to reason and learn by making decisions in the space of action representations rather than the original large set of possible actions. This setup is depicted in Figure
1, where an internal policy, , acts in a space of action representations, and a function, , transforms these representations into actual actions. Together we refer to and as the overall policy, .Recent work has shown the benefits associated with using actionembeddings (DulacArnold et al., 2015), particularly that they allow for generalization over actions. For realworld problems where there are thousands of possible (discrete) actions, this generalization can significantly speed learning. However, this prior work assumes that fixed and predefined representations are provided. In this paper we present a method to autonomously learn the underlying structure of the action set by using the observed transitions. This method can both learn action representation from scratch and improve upon a provided action representation.
A key component of our proposed method is that it frames the problem of learning an action representation (learning ) as a supervised
learning problem rather than an RL problem. This is desirable because supervised learning methods tend to learn more quickly and reliably than RL algorithms since they have access to instructive feedback rather than evaluative feedback
(Sutton and Barto, 2018). The proposed learning procedure exploits the structure in the action set by aligning actions based on the similarity of their impact on the state. Therefore, updates to a policy that acts in the space of learned action representation generalizes the feedback received after taking an action to other actions that have similar representations. Furthermore, we prove that our combination of supervised learning (for ) and reinforcement learning (for ) within one larger RL agent preserves the almost sure convergence guarantees provided by policy gradient algorithms (Borkar and Konda, 1997).To evaluate our proposed method empirically, we study two realworld recommender system problems using data from Adobe HelpX and Adobe Photoshop. In both the applications, there are thousands of possible recommendations that could be given at each time step (e.g., which video to suggest the user watch next on the HelpX portal, or which tool to suggest to the user next in the Photoshop software). Our experimental results show our proposed system’s ability to significantly improve performance relative to existing methods for these applications by quickly and reliably learning action representations that allow for meaningful generalization over the large discrete set of possible actions.
The rest of this paper is organized to provide in the following order: a background on RL, related work, and the following primary contributions:

A new parameterization, called the overall policy, that leverages action representations. We show that for all optimal policies, , there exist parameters for this new policy class that are equivalent to .

A proof of equivalence of the policy gradient update between the overall policy and the internal policy.

A supervised learning algorithm for learning action representations ( in Figure 1). This procedure can be combined with any existing policy gradient method for learning the overall policy.

An almost sure asymptotic convergence proof for the algorithm, which extends existing results for actorcritics (Borkar and Konda, 1997).

Experimental results on realworld domains with thousands of actions using actual data collected from Adobe HelpX and Photoshop.
2 Background
We consider problems modeled as discretetime Markov decision processes (MDPs) with discrete states and finite actions. An MDP is represented by a tuple, . is the set of all possible states, called the state space, and is a finite set of actions, called the action set. Though our notation assumes that the state set is finite, our primary results extend to MDPs with continuous states. In this work, we restrict our focus to MDPs with finite action sets, and
denotes the size of the action set. The random variables,
, , and denote the state, action, and reward at time . We assume that for some finite . The first state, , comes from an initial distribution, , and the reward function is defined so that for all and . Hereafter, for brevity, we writeto denote both probabilities and probability densities, and when writing probabilities and expectations, write
or to denote both elements of various sets and the events , , or (defined later). The desired meaning for or should be clear from context. The reward discounting parameter is given by . is the state transition function, such that .A policy is a conditional distribution over actions for each state: for all , and . Although is simply a function, we write rather than to emphasize that it is a conditional distribution. For a given , an agent’s goal is to find a policy that maximizes the expected sum of discounted future rewards. For any policy , the corresponding stateaction value function is , where conditioning on denotes that for all and for . The state value function is . It follows from the Bellman equation that . An optimal policy is any , where denotes the set of all possible policies, and is shorthand for .
3 Related Work
Here we summarize the most related work and discuss how they relate to the proposed work.
Factorizing Action Space: To reduce the size of large action spaces, Pazis and Parr (2011) considered representing each action in binary format and learning a value function associated with each bit. A similar binary based approach was also used as an ensemble method to learning optimal policies for MDPs with large action sets (Sallans and Hinton, 2004). For planning problems, Cui and Khardon (2016, 2018) showed how a gradient based search on a symbolic representation of the stateaction value function can be used to address scalability issues. More recently, it was shown that better performances can be achieved on Atari 2600 games (Bellemare et al., 2013) when actions are factored into their primary categories (Sharma et al., 2017). All these methods assumed that a handcrafted binary decomposition of raw actions was provided. To deal with discrete actions that might have an underlying continuous representation, Van Hasselt and Wiering (2009) used policy gradients with continuous actions and selected the nearest discrete action. This work was extended by DulacArnold et al. (2015) for larger domains, where they performed action representation look up, similar to our approach. However, they assumed that the embeddings for the actions are given, a priori. We present a method that can learn action representations with no prior knowledge or further optimize available action representations. If no prior knowledge is available, our method learns these representations from scratch autonomously.
Auxiliary Tasks: Previous works showed empirically that supervised learning with the objective to predict a component of a transition tuple from the others, can be useful as an auxiliary method to learn state representations (Jaderberg et al., 2016) or to obtain intrinsic rewards (Shelhamer et al., 2016; Pathak et al., 2017)
. We show how the overall policy itself can be decomposed using an action representation module learned using a similar loss function.
Motor Primitives: Research in neuroscience suggests that animals decompose their plans into midlevel abstractions, rather than the exact lowlevel motor controls needed for each movement (Jing et al., 2004). Such abstractions of behavior that form the building blocks for motor control are often called motor primitives (Lemay and Grill, 2004; MussaIvaldi and Bizzi, 2000). In the field of robotics, dynamical system based models have been used to construct dynamic movement primitives (DMPs) for continuous control (Ijspeert et al., 2003; Schaal, 2006)
. Imitation learning can also be used to learn DMPs, which can be finetuned online using RL
(Kober and Peters, 2009b, a). However, these are significantly different from our work as they are specifically parameterized for robotics tasks and produce an encoding for kinematic trajectory plans, not the actions.Later, Thomas and Barto (2012) showed how a goalconditioned policy can be learned using multiple motor primitives that control only useful subspaces of the underlying control problem. To learn binary motor primitives, Thomas and Barto (2011) showed how a policy can be modeled as a composition of multiple “coagents”, each of which learns using only the local policy gradient information (Thomas, 2011)
. Our work follows a similar direction, but we focus on automatically learning optimal continuousvalued action representations for discrete actions. For action representations, we present a method that uses supervised learning and restricts the usage of high variance policy gradients to train the internal policy only.
Other Domains: In supervised learning, representations of the output categories have been used to extract additional correlation information among the labels. Popular examples include learning label embeddings for image classification (Akata et al., 2016) and learning word embeddings for natural language problems (Mikolov et al., 2013). In contrast, for an RL setup, the policy is a function whose outputs correspond to the available actions. We show how learning action representations can be beneficial as well.
4 Generalization over Actions
The benefits of capturing the structure in the underlying state space of MDPs is a well understood and a widely used concept in RL. State representations allow the policy to generalize across states. Similarly, there often exists additional structure in the space of actions that can be leveraged. We hypothesize that exploiting this structure can enable quick generalization across actions, thereby making learning with large action sets feasible. To bridge the gap, we introduce an action representation space, , and consider a factorized policy, , parameterized by an embeddingtoaction mapping function, , and an internal policy, , such that the distribution of given is characterized by:
(1) 
Here, is used to sample , and the function deterministically maps this representation to an action in the set . Both these components together form an overall policy, . Figure 2 illustrates the probability of each action under such a parameterization. With a slight abuse of notation, we use to denote the set of representations that are mapped to the action by the function , i.e., .
In the following sections we discuss the existence of an optimal policy and the learning procedure for . To elucidate the steps involved, we split it into four parts. First, we show that there exists and such that is an optimal policy. Then we present the supervised learning process for the function when is fixed. Next we give the policy gradient learning process for when is fixed. Finally, we combine these methods to learn and simultaneously.
4.1 Existence of and to Represent An Optimal Policy
In this section, we aim to establish a condition under which can represent an optimal policy. Consequently, we then define the optimal set of and using the proposed parameterization. To establish the main results we begin with the necessary assumptions.
The characteristics of the actions can be naturally associated with how they influence state transitions. In order to learn a representation for actions that captures this structure, we consider a standard Markov property, often used for learning probabilistic graphical models (Ghahramani, 2001), and make the following assumption that the transition information can be sufficiently encoded to infer the action that was executed.
Assumption A1.
Given an embedding , is conditionally independent of and :
Assumption A2.
Given the embedding the action, is deterministic and is represented by a function , i.e., .
We now establish a necessary condition under which our proposed policy can represent an optimal policy. This condition will also be useful later when deriving learning rules.
Lemma 1.
The proof is deferred to the Appendix A. Following Lemma 1, we use and to define the overall policy as
(3) 
Theorem 1.
Proof.
This follows directly from Lemma 1. Because the state and action sets are finite, the rewards are bounded, and , there exists at least one optimal policy. For any optimal policy , the corresponding statevalue and stateactionvalue functions are the unique and , respectively. By Lemma 1 there exist and such that
(4) 
Therefore, there exists and , such that the resulting has the statevalue function , and hence it represents an optimal policy. ∎
4.2 Supervised Learning of For a Fixed
Theorem 1 shows that there exist and a function , which helps in predicting the action responsible for the transition from to , such that the corresponding overall policy is optimal. However, such a function, , may not be known a priori
. In this section, we present a method to estimate
using data collected from interactions with the environment.By Assumptions (A1)–(A2), can be written in terms of and . We propose searching for an estimator, , of and an estimator, , of such that a reconstruction of is accurate. Let this estimate of based on and be
(5) 
One way to measure the difference between and is using the expected (over states coming from the onpolicy distribution) KullbackLeibler (KL) divergence
(6)  
(7) 
Since the observed transition tuples, , contain the action responsible for the given to transition, an onpolicy sample estimate of the KLdivergence can be computed readily using (7). We adopt the following loss function based on the KL divergence between and :
(8) 
where the denominator in (7) is not included in (8) because it does not depend on or . If and are parameterized, their parameters can be learned by minimizing the loss function, , using a supervised learning procedure.
A computational graph for this model is shown in Figure 3. We refer the reader to the Appendix D for the parameterizations of and used in our experiments. Note that, while will be used for in an overall policy, is only used to find , and will not serve an additional purpose.
As this supervised learning process only requires estimating , it does not require (or depend on) the rewards. This partially mitigates the problems due to sparse and stochastic rewards, since an alternative informative supervised signal is always available. This is advantageous for making the action representation component of the overall policy learn quickly and with low variance updates.
4.3 Learning For a Fixed
A common method for learning a policy parameterized with weights is to optimize the discounted startstate objective function, For a policy with weights , the expected performance of the policy can be improved by ascending the policy gradient, .
Let the statevalue function associated with the internal policy, , be , and the stateaction value function . We then define the performance function for as:
(9) 
Viewing the embeddings as the action for the agent with policy , the policy gradient theorem (Sutton et al., 2000), states that the unbiased (Thomas, 2014) gradient of (9) is,
(10) 
where, the expectation is over states from , as defined by Sutton et al. (2000) (which is not a true distribution, since it is not normalized). The parameters of the internal policy can be learned by iteratively updating its parameters in the direction of . Since there are no special constraints on the policy , any policy gradient algorithm designed for continuous control, like DPG (Silver et al., 2014), PPO (Schulman et al., 2017), NAC (Bhatnagar et al., 2009) etc., can be used outofthebox.
However, note that the performance function associated with the overall policy, (consisting of function and the internal policy parameterized with weights ), is:
(11) 
The ultimate requirement is the improvement of this overall performance function, , and not just . So, how useful is it to update the internal policy, , by following the gradient of its own performance function? The following lemma answers this question.
Lemma 2.
For all deterministic functions, , which map each point, , in the representation space to an action, , the expected updates to based on are equivalent to updates based on . That is,
The proof is deferred to the Appendix B. The chosen parameterization for the policy has this special property, which allows to be learned using its internal policy gradient. Since this gradient update does not require computing the value of any explicitly, the potentially intractable computation of in (3) required for can be avoided. Instead, can be used directly to update the parameters of the internal policy while still optimizing the overall policy’s performance, .
4.4 Learning and Simultaneously
Since the supervised learning procedure for does not require rewards, a few initial trajectories can contain enough information to begin learning a useful action representation. As more data becomes available it can be used for finetuning and improving the action representations.
4.4.1 Algorithm
We call our algorithm policy gradients with representations for actions (PGRA). PGRA first initializes the parameters in the action representation component by sampling a few trajectories using a random policy and using the supervised loss defined in (8). If additional information is known about the actions, as assumed in prior work (DulacArnold et al., 2015), it can also be considered when initializing the action representations. Optionally, once these action representations are initialized, they can be kept fixed.
In the Algorithm LABEL:Alg:1, Lines  illustrate the online update procedure for all of the parameters involved. Each time step in the episode is represented by . For each step, an action representation is sampled and is then mapped to an action by . Having executed this action in the environment, the observed reward is then used to update the internal policy, , using any policy gradient algorithm. Depending on the policy gradient algorithm, if a critic is used then semigradients of the TDerror are used to update the parameters of the critic. In other cases, like in REINFORCE (Williams, 1992) where there is no critic, this step can be ignored. The observed transition is then used in Line to update the parameters of and so as to minimize the supervised learning loss (8). In our experiments, Line uses a stochastic gradient update.
algocf[t]
4.4.2 PGRA Convergence
If the action representations are held fixed while learning the internal policy, then as a consequence of Property 2, convergence of our algorithm directly follows from previous twotimescale results (Borkar and Konda, 1997; Bhatnagar et al., 2009). Here we show that learning both and simultaneously using our PGRA algorithm can also be shown to converge by using a threetimescale analysis.
Similar to prior work (Bhatnagar et al., 2009; Degris et al., 2012; Konda and Tsitsiklis, 2000), for analysis of the updates to the parameters, , of the internal policy, , we use a projection operator that projects any to a compact set
. We then define an associated vector field operator,
, that projects any gradients leading outside the compact region, , back to . We refer the reader to the Appendix C.3 for precise definitions of these operators and the additional standard assumptions (A3)–(A5). Practically, however, we do not project the iterates to a constraint region as they are seen to remain bounded (without projection).Theorem 2.
Proof.
(Outline) We consider three learning rate sequences, such that the update recursion for the internal policy is on the slowest timescale, the critic’s update recursion is on the fastest, and the action representation module’s has an intermediate rate. With this construction, we leverage the threetimescale analysis technique (Borkar, 2009) and prove convergence. The complete proof is in the Appendix C. ∎
5 Empirical Analysis
A core motivation of this work is to provide an algorithm that can be used as a dropin extension for improving the action generalization capabilities of existing policy gradient methods for problems with large action spaces. We consider two standard policy gradient methods: actorcritic (AC) and deterministicpolicygradient (DPG) (Silver et al., 2014) in our experiments. Just like previous algorithms, we also ignore the terms and perform the biased policy gradient update to be practically more sample efficient (Thomas, 2014). We believe that the reported results can be further improved by using the proposed method with other policy gradient methods; we leave this for future work. For detailed discussion on parameterization of the function approximators and hyperparameter search, see Appendix D.
5.1 Domains
Maze:
As a proofofconcept, we constructed a continuousstate maze environment where the state comprised of the coordinates of the agent’s current location. The agent has equally spaced actuators (each actuator moves the agent in the direction the actuator is pointing towards) around it, and it can choose whether each actuator should be on or off. Therefore, the size of the action set is exponential in the number of actuators, that is . The net outcome of an action is the vectorial summation of the displacements associated with the selected actuators. The agent is rewarded with a small penalty for each time step, and a reward of is given upon reaching the goal position. To make the problem more challenging, random noise was added to the action of the time and the maximum episode length was steps.
This environment is a useful test bed as it requires solving a long horizon task in an MDP with a large action set and a single goal reward. Further, we know the Cartesian representation for each of the actions, and can thereby use it to visualize the learned representation, as shown in Figure 4.
Realword recommender systems:
We consider two realworld applications of recommender systems that require decision making over multiple time steps.
First, Adobe HelpX, a webbased videotutorial platform, which has a recommendation engine that suggests a series of tutorial videos on various Adobe software products. The aim is to meaningfully engage the users in learning how to use these software products and convert novice users into experts in their respective areas of interest. The tutorial suggestion at each time step is made from a large pool of available tutorial videos on several products.
The second application is Adobe Photoshop, a professional multimedia editing software. Modern multimedia editing software often contain many tools that can be used to manipulate the media, and this wealth of options can be overwhelming for users. In this Adobe Photoshop domain, an agent suggests which of the available tools the user may want to use next. The objective is to increase user productivity and assist in achieving their end goal.
For both of these applications, an existing log of user’s click stream data was used to create an ngram based MDP model for user behavior
(Shani et al., 2005). In the Adobe HelpX tutorial recommendation task, user activity for a three month period was observed. Sequences of user interaction were aggregated to obtain over million clicks. Similarly, for a month long duration, sequential usage patterns of the tools in the Adobe Photoshop software were collected to obtain a total of over billion user clicks. Tutorials and tools that had less than clicks in total were discarded. The remaining tutorials and tools for the HelpX platform and Adobe Photoshop, respectively, were used to create the action set for the MDP model.The state for the MDP consists of the feature descriptors associated with each item (tutorial or tool) in the current ngram. Rewards were chosen based on a surrogate measure for difficulty level of tutorials on HelpX portal and popularity of final outcomes of user interactions in Photoshop, respectively. Since such data is sparse, only of the items had rewards associated with them, and the maximum reward for any item was .
Often the problem of recommendation is formulated as a contextual bandit or collaborative filtering problem, but as shown by Theocharous et al. (2015) these approaches fail to capture the long term value of the prediction. Solving this problem for a longer time horizon with a large number of actions (tutorials/tools) makes this reallife problem a useful and a challenging domain for RL algorithms.
5.2 Results
Visualizing the learned action representations
To understand the internal working of our proposed algorithm, we present visualizations of the learned action representations on the maze domain. A pictorial illustration of the environment is provided in Figure 4. Here, the underlying structure in the set of actions is related to the displacements in the Cartesian coordinates. This provides an intuitive base case against which we can compare our results.
In Figure 4, we provide a comparison between the action representations learned using our algorithm and the underlying Cartesian representation of the actions. It can be seen that the proposed method extracts useful structure in the action space. Actions which correspond to settings where the actuators on the opposite side of the agent are selected result in relatively small displacements to the agent. These are the ones in the center of plot. In contrast, maximum displacement in any direction is caused by only selecting actuators facing in that particular direction. Actions corresponding to those are at the edge of the representation space. The smooth color transition indicates that not only the information about magnitude of displacement but the direction of displacement is also represented. Therefore, the learned representations efficiently preserve the relative transition information among all the actions. To make exploration step tractable in the internal policy, , we bound the representation space along each dimension to the range [] using Tanh nonlinearity. This results in ‘squashing’ of these representations around the edge of this range.
Performance Improvement
The plots in Figure 5 for the Maze domain show how the performance of standard actorcritic (AC) method deteriorates as the number of actions increases, even though the goal remains the same. However, with the addition of an action representation module it is able to capture the underlying structure in the action space and consistently perform well across all settings. Similarly, for both Adobe HelpX and Adobe Photoshop MDPs, standard AC methods fail to reason over longer time horizons under such an overwhelming number of actions, choosing mostly onestep actions that have high returns. In comparison, instances of our proposed algorithm are not only able to achieve significantly higher return, up to and in the respective tasks, but they do so much quicker. These results reinforce our claim that learning action representations allow implicit generalization of feedback to other actions embedded in proximity to executed action.
Further, under the PGRA algorithm, only a fraction of total parameters, the ones in the internal policy, are learned using the high variance policy gradient updates. The other set of parameters associated with action representations are learned by a supervised learning procedure. This reduces the variance of updates significantly, thereby making the PGRA algorithms learn a better policy faster. This is evident from the plots in the Figure 5. These advantages allow the internal policy, , to quickly approximate an optimal policy without succumbing to the curse of large actions sets.
6 Summary and Future Work
In this paper, we built upon the core idea of leveraging the structure in the space of actions and showed its importance for enhancing generalization over large action sets in realworld largescale applications. Our approach has three key advantages. (a) Simplicity: by simply using the observed transitions, an additional supervised update rule can be used to learn action representations. (b) Theory: we showed that the proposed overall policy class can represent an optimal policy and derived the associated learning procedures for its parameters. (c) Extensibility: as the PGRA algorithm indicates, our approach can be easily extended using other policy gradient methods to leverage additional advantages, while preserving the convergence guarantees.
An interesting future direction would be to extend the results for capturing the structure of a high dimensional continuous action space () into a lower dimensional representation space () as well. Unlike finite set of actions that can be embedded in a continuous space, the key challenge is that learning lower dimensional space for continuous action inevitably results in the inability to represent some sections of the original action space (as ).
References
 Akata et al. (2016) Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Labelembedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2016.

Bellemare et al. (2013)
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bhatnagar et al. (2009) S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
 Borkar (2009) V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
 Borkar and Konda (1997) V. S. Borkar and V. R. Konda. The actorcritic algorithm as multitimescale stochastic approximation. Sadhana, 22(4):525–543, 1997.
 Cui and Khardon (2016) H. Cui and R. Khardon. Online symbolic gradientbased optimization for factored action mdps. In IJCAI, pages 3075–3081, 2016.
 Cui and Khardon (2018) H. Cui and R. Khardon. Lifted stochastic planning, belief propagation and marginal MAP. In The Workshops of the The ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Degris et al. (2012) T. Degris, M. White, and R. S. Sutton. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839, 2012.
 DulacArnold et al. (2015) G. DulacArnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.

Ghahramani (2001)
Z. Ghahramani.
An introduction to hidden markov models and bayesian networks.
International journal of pattern recognition and artificial intelligence
, 15(01):9–42, 2001.  Glavic et al. (2017) M. Glavic, R. Fonteneau, and D. Ernst. Reinforcement learning for electric power system decision and control: Past considerations and perspectives. IFACPapersOnLine, 50(1):6918–6927, 2017.

Haeffele and Vidal (2017)
B. D. Haeffele and R. Vidal.
Global optimality in neural network training.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7331–7339, 2017.  Ijspeert et al. (2003) A. J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor primitives. In Advances in neural information processing systems, pages 1547–1554, 2003.
 Jaderberg et al. (2016) M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Jiang et al. (2017) Z. Jiang, D. Xu, and J. Liang. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059, 2017.
 Jing et al. (2004) J. Jing, E. C. Cropper, I. Hurwitz, and K. R. Weiss. The construction of movement with behaviorspecific and behaviorindependent modules. Journal of Neuroscience, 24(28):6315–6325, 2004.
 Kawaguchi (2016) K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 Kober and Peters (2009a) J. Kober and J. Peters. Learning motor primitives for robotics. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 2112–2118. IEEE, 2009a.
 Kober and Peters (2009b) J. Kober and J. R. Peters. Policy search for motor primitives in robotics. In Advances in neural information processing systems, pages 849–856, 2009b.
 Konda and Tsitsiklis (2000) V. R. Konda and J. N. Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 Konidaris et al. (2011) G. Konidaris, S. Osentoski, and P. S. Thomas. Value function approximation in reinforcement learning using the fourier basis. In AAAI, volume 6, page 7, 2011.
 Lemay and Grill (2004) M. A. Lemay and W. M. Grill. Modularity of motor output evoked by intraspinal microstimulation in cats. Journal of neurophysiology, 91(1):502–514, 2004.
 Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 MussaIvaldi and Bizzi (2000) F. A. MussaIvaldi and E. Bizzi. Motor learning through the combination of primitives. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 355(1404):1755–1769, 2000.

Pathak et al. (2017)
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell.
Curiositydriven exploration by selfsupervised prediction.
In
International Conference on Machine Learning (ICML)
, volume 2017, 2017.  Pazis and Parr (2011) J. Pazis and R. Parr. Generalized value functions for large action sets. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1185–1192, 2011.
 Sallans and Hinton (2004) B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5(Aug):1063–1088, 2004.
 Schaal (2006) S. Schaal. Dynamic movement primitivesa framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006.
 Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Shani et al. (2005) G. Shani, D. Heckerman, and R. I. Brafman. An MDPbased recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
 Sharma et al. (2017) S. Sharma, A. Suresh, R. Ramesh, and B. Ravindran. Learning to factor policies and actionvalue functions: Factored action space representations for deep reinforcement learning. arXiv preprint arXiv:1705.07269, 2017.
 Shelhamer et al. (2016) E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell. Loss is its own reward: Selfsupervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
 Sidney et al. (2005) K. D. Sidney, S. D. Craig, B. Gholson, S. Franklin, R. Picard, and A. C. Graesser. Integrating affect sensors in an intelligent tutoring system. In Affective Interactions: The Computer in the Affective Loop Workshop at, pages 7–13, 2005.
 Silver et al. (2014) D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Theocharous et al. (2015) G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Ad recommendation systems for lifetime value optimization. In Proceedings of the 24th International Conference on World Wide Web, pages 1305–1310. ACM, 2015.
 Thomas (2014) P. Thomas. Bias in natural actorcritic algorithms. In International Conference on Machine Learning, pages 441–448, 2014.
 Thomas (2011) P. S. Thomas. Policy gradient coagent networks. In Advances in Neural Information Processing Systems, pages 1944–1952, 2011.
 Thomas and Barto (2011) P. S. Thomas and A. G. Barto. Conjugate Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 137–144, 2011.
 Thomas and Barto (2012) P. S. Thomas and A. G. Barto. Motor primitive discovery. In Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on, pages 1–8. IEEE, 2012.
 Tsitsiklis and Van Roy (1996) J. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximationtechnical. Technical report, Report LIDSP2322. Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, 1996.
 Van Hasselt and Wiering (2009) H. Van Hasselt and M. A. Wiering. Using continuous action spaces to solve discrete problems. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 1149–1156. IEEE, 2009.
 Williams (1992) R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Zhu et al. (2018) Z. Zhu, D. Soudry, Y. C. Eldar, and M. B. Wakin. The global optimization geometry of shallow linear neural networks. arXiv preprint arXiv:1805.04938, 2018.
Appendix
Appendix A Proof of Lemma 1
Lemma 1.
Proof.
The Bellman equation associated with a policy, , for any MDP, , is:
(13)  
(14) 
is used to denote hereafter. Rearranging terms in the Bellman equation,
(15)  
(16)  
(17)  
(18) 
Using the law of total probability, we introduce a new variable
such that:^{1}^{1}1Note that andare from a joint distribution over a discrete and a continuous random variable. For simplicity, we avoid measuretheoretic notations to represent its joint probability.
(19) 
After multiplying and dividing by , we have:
(20)  
(21)  
(22)  
(23)  
(24) 
Since the transition to the next state, , is conditionally independent of , given the previous state, , and the action taken, ,
(25) 
Similarly, using the Markov property, action is conditionally independent of given ,
(26) 
As evaluates to for representations, , that map to and for others (Assumption A2),
(27)  
(28) 
In (28), note that the probability density, is the internal policy, . Therefore,
(29) 
∎
Appendix B Proof of Lemma 2
Lemma 2.
For all deterministic functions, , which map each point, , in the representation space to an action, , the expected updates to based on are equivalent to updates based on . That is,
Proof.
Recall from (3) that the probability of an action given by the overall policy, , is
(30) 
Using Lemma 1, we express the performance function of the overall policy, , as:
(31)  
(32) 
The gradient of the performance function is therefore
(33) 
Using the policy gradient theorem [Sutton et al., 2000] for the overall policy, , the partial derivative of w.r.t. is,
(34)  
(35)  
(36) 
Note that since deterministically maps to , . Therefore,
(37) 
Finally, since each is mapped to a unique action by the function , the nested summation over and its inner integral over can be replaced by an integral over the entire domain of . Hence,
(38)  
(39)  
(40) 
∎
Appendix C Convergence of PGRA
To analyze the convergence of PGRA, we first briefly review existing twotimescale convergence results for actorcritics. Afterwards, we present a general setup for stochastic recursions of three dependent parameter sequences. Asymptotic behavior of the system is then discussed using three different timescales, by adapting existing multitimescale results by Borkar [2009]. This lays the foundation for our subsequent convergence proof. Finally, we prove convergence of the PGRA method, which extends standard actorcritic algorithms using a new action prediction module, using a threetimescale approach. This technique for the proof is not a novel contribution of the work. We leverage and extend the existing convergence results of actorcritic algorithms [Borkar and Konda, 1997] for our algorithm.
c.1 ActorCritic Convergence Using TwoTimescales
In the actorcritic algorithms, the updates to the policy depends upon a critic that can estimate the value function associated with the policy at that particular instance. One way to get a good value function is to fix the policy temporarily and update the critic in an innerloop that uses the transitions drawn using only that fixed policy. While this is a sound approach, it requires a possibly large time between successive updates to the policy parameters and is severely sampleinefficient. Twotimescale stochastic approximation methods [Bhatnagar et al., 2009, Konda and Tsitsiklis, 2000] circumvent this difficulty. The faster update recursion for the critic ensures that asymptotically it is always a close approximation to the required value function before the next update to the policy is made.
c.2 ThreeTimescale Setup
In our proposed algorithm, to update the action prediction module, one could have also considered an inner loop that uses transitions drawn using the fixed policy for supervised updates. Instead, to make such a procedure converge faster, we extend the existing twotimescale actorcritic results and take a threetimescale approach.
Consider the following system of stochastic ordinary differential equations (ODE):
(41)  
(42)  
(43) 
where, and are Lipschitz continuous functions and , are the associated martingale difference sequences for noise w.r.t. the increasing fields = , satisfying
for , and any constant such that the quadratic variation of noise is always bounded. To study the asymptotic behavior of the system, consider the following standard assumptions,
Assumption B1 (Boundedness).
, almost surely.
Assumption B2 (Learning rate schedule).
The learning rates and satisfy:
(44)  
(45)  
(46) 
Assumption B3 (Existence of stationary point for Y).
The following ODE has a globally asymptotically stable equilibrium , where is a Lipschitz continuous function.
(47) 
Assumption B4 (Existence of stationary point for X).
The following ODE has a globally asymptotically stable equilibrium , where is a Lipschitz continuous function.
(48) 
Assumption B5 (Existence of stationary point for Z).
The following ODE has a globally asymptotically stable equilibrium ,
(49) 
Assumptions B1–B2 are required to bound the values of the parameter sequence and make the learning rate wellconditioned, respectively. Assumptions B3B4 ensure that there exists a global stationary point for the respective recursions, individually, when other parameters are held constant. Finally, Assumption B5 ensures that there exists a global stationary point for the update recursion associated with , if between each successive update to , and have converged to their respective stationary points.
Proof.
We adapt the multitimescale analysis by Borkar [2009] to analyze the above system of equations using threetimescales. First we present an intuitive explanation and then we formalize the results.
Since these three updates are not independent at each time step, we consider three stepsize schedules: , and , which satisfy Assumption B2. As a consequence of (46), the recursion (42) is ‘faster’ than (43), and (41) is ‘faster’ than both (42) and (43). In other words, moves on the slowest timescale and the moves on the fastest. Such a timescale is desirable since converges to its stationary point if at each time step the value of the corresponding converged and estimates are used to make the next update (Assumption B5).
To elaborate on the previous points, first consider the ODEs:
(50)  
(51) 
Alternatively, one can consider the ODE
(52) 
in place of (50), because is fixed (51). Now, under Assumption B3 we know that the iterative update (42) performed on , with a fixed , will eventually converge to a corresponding stationary point.
Now, with this converged , consider the following ODEs:
(53)  
(54)  
(55) 
Alternatively, one can consider the ODE
(56) 
in place of (53), as and are fixed (54)(55). As a consequence of Assumption B4, converges when both and are held fixed.
Intuitively, as a result of Assumption B2, in the limit, the learningrate, becomes very small relative to . This makes ‘quasistatic’ compared to and has an effect similar to fixing and running the iteration (42) forever to converge at . Similarly, both and become very small relative to . Therefore, both and are ‘quasistatic’ compared to the critic, which has an effect similar to fixing and