Introduction
Deep reinforcement learning (DRL) , which combines reinforcement learning algorithms and deep neural networks , has achieved great success in many domains, such as playing Atari games
[21], playing game of Go [24] and robotics control [17]. Although the DRL is viewed as one of the most potential ways to the General Artificial Intelligence, it is still criticized for its data inefficiency. Training an agent from scratch requires considerable numbers of interactions with the environment for a very specific task. Moreover, some researchers points that DRL algorithms likely overfit to the trained environment
[3], once the configuration of the environments changes, the learned policy will not work and needs to be retrained. One approach to deal with the problem in DRL domain is Transfer Learning (TL) [26], which make use of the knowledge gained while solving one task and applying it to a different but related task, aiming to reduce the consumption of samples and improve the performance.Several methods have been proposed to transfer various knowledge across tasks [27, 9, 4]. However, all of those work assume that the source tasks and the target tasks share the same space of actions and states, specifically, the sizes of both action and state space are consistent and wellaligned. Obviously, this assumption is unsatisfied in many tasks. For example, in Moba games different heroes have their own unique skills and state representation, so they are usually modeled by individual neural network with different input (state) and output (policy or value). Under this circumstances, direct transfer between models with difference network structure is not feasible.
In order to overcome the formulation discordance of source task and target task, especially when the state and action spaces are different. Many studies have proposed various methods to transfer knowledge across tasks, such as manifold alignment [1, 11] and domain adaptation [6]
. However, most of them focus on mapping original state space into a common feature space, rarely considering the action space . In the previous Moba game example, though the skill system of different heroes are distinct, in fact the effect of heroes’ skills may be similar. For most of the heroes, their skills can be classified into several categories, such as ’Damage Skill’, ’Control Skill’, ’Summoning Skill’ and so on. Apparently, if the semantic of actions is learned explicitly, it has chance to utilize the semantic information as a prior to transfer policy across different heroes.
One idea is intertask mapping [25] which constructs interstate and interaction mapping to describe the relations between tasks. However, it can be difficult to learn such a mapping, because it requires the prior knowledge about the semantic and range of state variables and actions. Inspired by recent work in action embedding [28, 7], we study the feasibility of leveraging action embeddings, which automatically learn the semantic of actions, to transfer the policy across tasks with different state space and/or discrete action space.
The main difficulty in our problem is that how to learn meaningful action embeddings that captures the semantic of actions. Additionally, in order to transfer the policy, we need to align the embeddings of target task with the embeddings of source task with different state spaces. To learn action embeddings in RL effectively, the work [28] use a skipgram model [20] with negative sampling to train action representations according to action contexts from expert demonstrations. Differently, Our method builds on the idea that the semantic of actions should be reflected in its effect and the effect is shown in the state changes, while state changes are implied in the state transition function of RL problem. Thus, we can learn latent representations of actions by learning the dynamics using a transition model in RL which is quite similar to learn word embeddings from a language model [16], and it can learn from any generated trajectories. Further, the parameters of the transition model are frozen or used as initializations to make the embeddings of tasks as close as possible so that the policy can be transferred.
We test our methods with the reinforcement learning algorithms Soft ActorCritic (SAC) [12] on three sets of environments, a set of gridworld navigation tasks, a set of discretized continuous control tasks in Mujoco [29] and Roboschool and a set of fighting task in a video game.
Our main contributions in this paper are summarized as follows:

We propose a method to learn action embeddings from interactions with environments, which may also be used for other purpose, such as automatic task decomposition.

We proposed an transfer framework via action embeddings, making the policy transfer across tasks with different action and state spaces possible.

Our experimental ^{1}^{1}1The experimental code of this work is released anonymously at https://github.com/ActionEmbedding/ActionEmbedding.git for reproducibility. results show that our methods can a) learn informative action embeddings b) effectively transfer policy to reduce the time needed for convergence to nearoptimal behavior
Related Work
In this section, we briefly review the literature about transfer learning in RL and action embedding respectively.
Transfer in Reinforcement Learning
Transfer learning is always considered as an important and challenging direction in reinforcement learning and have drawn more and more attention. The work [27] propose a method that use a shared distilled policy for joint training of multiple tasks, named Distral. The work [9] introduce a generic framework of metalearning that can achieve fast adaptation. Successor features and generalized policy improvement are also applied to transfer knowledges [4, 19]. All of these methods focus on tasks that only differ in reward functions.
To transfer across tasks with different state and action spaces, an intertask mapping is manually constructed by [25] and they build a transferable actionvalue function based on the mapping. In [2], they introduce a common task subspace between states of tasks, and use it to learn the state mapping between tasks. Further, unsupervised manifold alignment (UMA) are used to autonomously learn an intertask mapping from source and target task trajectories [1]. The main difference from our work is that we do not learn a direct mapping but try to embed them into a common space where the similarities can be measured by distance. In a similar vein, the work [11] try to learn invariant common features between different agents from a proxy skill and use it as an auxiliary reward. The learning process, however, minimize the distance between state embeddings of corresponding pairs of states which may be difficult to obtain. Adversarial losses, which is based on mutual alignment of visited state distributions between tasks, are used by [30] as auxiliary rewards to train policies. The work [6]
later adopt adversarial autoencoder (AAE) to align the representation vectors of target and source states on atari games.
All these work focus on learning a state representation that contains transferable knowledge among the tasks, while in this paper we try to leverage the action representations to transfer the policy.
Action Embedding
Action embeddings is firstly studied by [8], aiming to solve the large discrete action space problem in RL. And they find the optimal actions using a knearest neighbors approach. However, the action embeddings are assumed to be given as a prior. In[7], embeddings are used as a part of overall policy and a mapping function is learned to map the embeddings into discrete actions. Act2Vec is introduced by [28], in which a skipgram model is used to learn representations of actions from expert demonstrations. And they transfer the embeddings from a 2D navigation task to a 3D navigation task.
Different from the existing studies, we present a method that can autonomously capture the relations between actions and learn informative action embeddings without any prior knowledge.
Representation Learning
Representation learning aims to learn representations of the data that make it easier to extract useful information when building classifiers or other predictors [5], and have been applied in a various domains, e.g. NLP [20, 23] and graphs [22]
. In reinforcement learning, features are extracted from raw images by convolutional neural networks
[21], sparse representations are learned for control [18], and the work [10] combine modelbased and modelfree approaches via a shared state abstraction. Moreover, in [14], they utilize skill representations to learn versatile skills in hierarchical reinforcement learning. While in this paper, we investigate the representations of discrete actions and use it to improve the RL algorithms.Background and Problem Formulation
In this section we present the background material that will serve as a foundation for the rest of the paper.
Reinforcement Learning
RL problem is often modeled as Markov Decision Processes (MDPs) which is defined as a tuple
. and are sets of states and actions , called state space and action space respectively. In this work, we restrict our focus on discrete action spaces, and denotes the size of action set.is a state transition probability function describing the dynamics of the problem.
is a reward function measuring the performance of agents and is a discount factor for future rewards. Further a policy can be defined, which is a conditional distribution over actions for each state. For any policy , its corresponding state value function is and stateaction value function is for all and at time step . Given a MDP , the goal of an agent is to find a optimal policy that maximize the expected discounted return .In this work, we choose Soft ActorCritic (SAC) [12] which is a modelfree offpolicy actorcritic method as our RL algorithm. In SAC, actor aims to maximize not only expected reward but also the entropy of the stochastic policy. This change the RL problem to:
where is the trajectory of stateaction pairs, is the tradeoff coefficient, is the entropy of from its distribution . It is noteworthy that the proposal method is not limit to SAC, it can be extended to any other RL algorithms with appropriate adaption.
Problem Formulation
In this work, we consider the transfer problem between two tasks which can be denoted as a source MDP and a target MDP . Generally, the state and action spaces in the two MDPs might be completely different, so as the dynamics and reward function . However, in this paper, we assume there are some similarities between reward functions. In particular, the goals of two MDPs are similar so that the optimal policy of the target MDP will resemble the optimal policy of the source MDP. What’s more, the dynamics are also assumed to be approximate. Because we hope that the action embeddings of the target task can be learned fast and are aligned with that of the source tasks by using one transition model. For example, in one of our experiments, is to balance a pole on a cart while in there are two poles on the cart. The dimensionality of the states are completely different. And we discretize the continuous actions into different numbers of discrete action so that . The two tasks both need to balance the poles under gravity with a reward for keeping it alive. Therefore, the dynamics and reward functions shares some similarities between them.
Methods
In this section, we will discuss how the action embeddings can be learned by means of transition model. We will then describe how the action embeddings can be combined with RL algorithms and used for policy transfer for a new task.
Learning Action Embeddings
We aim to project discrete actions into a continuous feature space, where the distance between two action representations is close if the effect of the actions are similar. And it is straightforward to measure the effects of actions by state changes, state transition probabilities in RL particularly. Officially, our goal is to learn an embedding matrix , in which for each denotes the row of , such that
where is the dimension of embedding vector , is the state transition probability given state and action , and is KL divergence that measures distance of two distributions. Therefore action embeddings can be learned by training a transition model which predicts the next state with parameter .
We adopt a recurrent model, e.g. LSTM [15], to learn the transition. Given a sequence of stateaction pairs , the forward process of the model runs as follows:
(1) 
where is the length of the trajectory, is a activation vector that allow the model to access all previous states at time step and is the parameters of the model. Further, the next state is computed by a multilayered feedforward network that conditions on , .
The recurrent state transition model can be understood as a deterministic process [13]. To cope with stochastic environments, we can introduce a latent variable as stochastic process, just like variational autoencoder (VAE), in the model. Then the computation of next state become
. And each latent variable is sampled from a multivariate Gaussian distribution with the parameter obtained by using a nonlinear transformation of the previous hidden state, i.e.
where .The transition model is learned by minimizing the following loss function over the trajectory:
(2)  
where the loss function is mean squared error (MSE) in this work. At the meantime the action embeddings will be learned as well. The whole transition model is illustrated in Figure. 1
Action Generation
To utilize the learned embeddings, firstly, we need to combine RL algorithms with embeddings. [28] show that Qfunction can be approximated by the inner product of action embeddings and state representations. In this paper, we adopt the architecture proposed by [8], in which the policy outputs over actions within a continuous space and maps the output to discrete action space, since the output may not exist in discrete action space. Specifically, the policy parameterized by can be defined as . The policy output a protoaction for a given state . Then the real action performed is chosen by a nearest neighbor in the learned action embeddings:
where is a mapping from a continuous space to a discrete space based on a nearestneighbor algorithm. It returns an action in that are closest to protoaction in embedding space by distance.
Note that protoactions generated by policy are stored in the replay buffer when training the policy, rather than the embedding of performed actions. Otherwise the space occupied by discrete actions is very limited, which often leads to unstable training, especially for networks which predict Qvalues according to states and action representations.
Policy Transfer
In this section, we discuss the procedure of policy transfer from source task to target task. For a better understanding, we start from a simple setting in which the state spaces of source and target task are the same while the action spaces differ. For example, an character carries different sets of skills to perform a task, each sets of skills contain different number of skills. Under this setting, the policy can be directly transferred. Since facing the same state, agent should react similarly even carries different sets of skills, and the most relevant skill will be found by nearestneighbor algorithm in the embedding space. Meanwhile, the embeddings can be learned by the same transition model which assures the actions are embedded into the same latent space.
While considering the tasks that differ in both state and action spaces, the transition model can not be reused since dimension of state are different between source and target tasks. The premise of reuse the transition model is that the actions are embedded into the same or similar space with same state size. Thus, the input of the transition model becomes a sequence , where denotes a nonlinear function, mapping the original state space into a common state space, called state embedding. The function can be represented by neural network and trained along with the policy. Note that, in this way, the two modules become interdependent–transition model needs state embeddings as training input and policy needs action embeddings to perform nearestneighbor action selection. Therefore, we train two modules together, which can also increase the data utilization. Overall, the training process on source task is outlined in Algorithm. 1, and the process of transfer to target task is shown in Figure. 2.
After training on source task, the parameters and are used as the initialization weights of the network trained for the target task in our method. The initialization network will then be trained according to the line 215 in Algorithm. 1. The transfer process is shown in Algorithm. 2. This might be seen as a kind of adaptation where the state embedding is regularized by policy and transition model. While there exists some promising methods, such as adversarial autoencoder [6], to align the state representation function with in pretraining process, however, it’s not the main point in this paper.
Experiment
To evaluate the feasibility to learn action embeddings with the dynamics model, we first conduct experiments on two domains to show the semantic of learned embeddings. And next, we evaluate the transferability of our method, named AESAC, to transfer both between tasks in the same domains as well as tasks in different domains.
For comparison, we select Soft ActorCritic [12] for discrete action space denoted by SAC. In all experiments, AESAC and SAC
use the same hyperparameters. The results of all experiments are averaged over 10 individual runs.
Semantic of Embeddings
We first validate our method of learning action embeddings on a simple environment, in which the agent needs to reach a randomly assigned goal position in a gridworld. Basically, the environment has 4 atom actions: Up, Down, Left and Right at each time step. We consider a step planning task here, and hence the number of actions will become . Set , we sample 1000 trajectories with maximum length 20 according to a random policy to train the action embeddings with dimension. We project the embedings of actions into 2D space for understandability, the result is shown in Figure. 3(a), the actions with the same or similar effect are closely positioned, and the embeddings can be clustered into 16 separated groups. What is more, the embedding space can be divided into 4 axises and shows evident symmetry w.r.t to the four directions in gridworld, which means our method effectively capture the semantic of actions. In word embeddings, the relationship between words are often discussed, such as Paris  France + Italy = Rome [20]. Surprisingly, in action embedding, we also get .
Further, we test our approach on a oneversusone video game. In this game, heroes can carry different skills to fight against a role with fixed skills controlled by rules. There are totally 15 skills. Differently, we take opponent’s action into considerations as well in order to learn more reasonable representations, since opponent’s actions have a great influence on the states. We set and trained by 5000 fighting records with maximum length 20. The result is shown in Figure. 3 (b), we notice that the skills with special effect, such as Silence and Stun, are distinguished clearly, damage skills are also closer to each other. As annotated in the figure, the DoT Damage skill and Instant Damage are recognized as well.
Gridworld
We first evaluate AESAC on samedomain transfer where the state spaces are the same among tasks. Consider step planning task in the gridworld described before, we demonstrate experiments on three settings with . Thus, action spaces of tasks are 4, 16, 64, respectively. And the state of the task is consisted of current position and the goal position . Agent receives 0.05 each step and +10 reward when the agent reaches the goal. We assessed the performance of using AESAC transferred policy from other source tasks versus standard SAC for discrete actions by measuring the averaged 100 episode return of the target task and the number of episodes.
Figure. 4 shows the results on different tasks. As seen, the speed of training has greatly accelerated after transferring and are faster than SAC in all tasks, especially on target task with and . Note that there are not jump starts on performance due to the action embeddings of the target tasks are randomly initialized first and quickly adapted. To transfer policy, the action embeddings of target task should align with source task so that policy could have a promising performance on target task. Figure. 5 (a) displays action embeddings of the source task () and the target task (). As we can see, the embeddings of target task is shifted since the parameters of the dynamic model is not frozen after transferring. but it still shows the same symmetry in a single task. Nonetheless, transfer can still effectively accelerate the training, because the policy only need to amend the biases.
Mujoco and Roboschool
Next, we consider the more difficult problem of crossdomain transfer, where action spaces are different as well as state spaces. We conduct experiments on four tasks, InvertedPendulum and InvertedDoublePendulum in Mujoco [29] and Roboschool, denoted by mP, mDP, rP and rDP for short. On these tasks, agent needs to control the cart to balance the poles on the cart. For mP and rP, agent receives +1 for each step if it is not terminated. And for mDP and rDP, apart from +10 reward for keeping alive, it is penalized for high velocity and states far from the goal state. As for the dynamics, there might share some similarities between InvertedPendulum and InvertedDoublePendulum since the former is included by the latter. The dynamics between Mujoco and Roboschool might differ because the physics engines are quite different, however, they both follow the laws of physics. To evaluate our methods, we discretize the original dimension continuous control action space into equally spaced values, resulting in a discrete action space with actions. The details of the tasks are summarized in Table. 1.
Task  State Dim  Act Dim  Act Range  Discretized 

mP  4  1  [3, 3]  101 
mDP  11  1  [1, 1]  51 
rP  5  1  [1, 1]  91 
rDP  9  1  [1, 1]  71 
We train models on these tasks and transfer between each other and Figure. 6 reports the results of crossdomain transfer. Overall, almost all transfer results learn faster than from scratch, especially on tasks of Mujoco. In Figure. 6 (a), the result of transferring from mDP achieves the fastest performance, and the other two are relatively close. As for rP (Figure. 6
(b)), there is only a small gap between methods. The variance of transferring from
mDP appears to be quite large at first since these two vary not only in tasks but also the underlying physics engine. In Figure. 6 (c), it is faster to transfer policy from mP. rDP and rP are close at first, and the variance of them are both quite large. All transfer results are about the same in Figure. 6 (d). What’s interesting, in previous experiments, SAC all performs better than AESAC, since AESAC needs to learn action embeddings at the meantime, which makes it difficult to learn a policy. While, for tasks mDP and rDP, AESAC outperforms SAC.Further, we want to investigate that the properties of action embeddings learned from sequences of pairs of state embedding and action. The relations of action embeddings of these tasks ought to be linear, since the actions are discretized from a continuous action space. Figure. 5 (b) plots the action embeddings of the source task (mDP) and the target task (mP) using PCA to reduce them to 1dimension together. As shown in Figure. 5 (b), though there are some local oscillations, the overall trend is linear. It’s proved that our methods can still learn meaningful latent representations of actions based on state embeddings.
Fighting Video Game
To show the potential of our method in more practical problem, we validate it on a aforementioned oneversusone fighting commercial video game. In the scenario, the hero can select a subset of skills to fight against a rulecontrolled opponent. The state of this environment is a 48dimensional vector that consists of the information of the hero and the opponent. Agent receives positive rewards for damaging and winning, and negative rewards for self loss and losing. And agent is punished by choosing unready skills.
In this environment, we select three different set of skills with 5, 6, 7 skills respectively. Some skills are shared between them. We first train the policy individually and then transfer to each other. The performance of methods is measured by average winning rate over 100 recent episode.
Figure. 7 report the result. As seen, for the task with 5 skills (Figure. 6 (a)), the policy transferred from 7 skills suits well and the winning rate rise to 100% soon. While the other one is close to SAC. This can caused by that the skills carried by the target task are quite different from the key skills used by the source policy. In Figure. 7(b), the two transfer results are close and both outperform SAC. And for the task with 7 skills (Figure. 6(c)), the policy from 6 skills adapts better and achieves the best. Overall, all transfer results outperforms AESAC and SAC more or less.
Conclusion
In this paper, we investigate learning and leveraging action embeddings of discrete actions to transfer across tasks with different action spaces and/or state spaces in RL. We propose a method that can effectively learn meaningful action embeddings from any generated trajectories by training a transition model. Further, we train RL policies with action embeddings by using a nearest neighbor in the embedding space. Then the policy and the transition model are leveraged as initializations to transfer to the target task, leading to a quick learning for action representations in the target task and adaptation of policy. Our method is evaluated on three sets of tasks, demonstrating that it is capable of improving the initial performance compared to standard RL algorithm for discrete action, even with different state space.
In the future, we try to learn promising action embeddings for continuous action spaces and align the state embeddings with additional restriction.
References
 [1] (2015) Unsupervised crossdomain transfer in policy gradient reinforcement learning via manifold alignment. In TwentyNinth AAAI Conference on Artificial Intelligence, Cited by: Introduction, Transfer in Reinforcement Learning.
 [2] (2011) Reinforcement learning transfer via common subspaces. In International Workshop on Adaptive and Learning Agents, pp. 21–36. Cited by: Transfer in Reinforcement Learning.
 [3] (2018) Do deep reinforcement learning algorithms really learn to navigate?. Cited by: Introduction.
 [4] (2019) Transfer in deep reinforcement learning using successor features and generalised policy improvement. arXiv preprint arXiv:1901.10964. Cited by: Introduction, Transfer in Reinforcement Learning.
 [5] (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: Representation Learning.
 [6] (2019) Domain adaptation for reinforcement learning on the atari. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1859–1861. Cited by: Introduction, Transfer in Reinforcement Learning, Policy Transfer.

[7]
(2019)
Learning action representations for reinforcement learning.
In
Proceedings of the 36th International Conference on Machine Learning
, pp. 941–950. Cited by: Introduction, Action Embedding.  [8] (2015) Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679. Cited by: Action Embedding, Action Generation.
 [9] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: Introduction, Transfer in Reinforcement Learning.
 [10] (2019) Combined reinforcement learning via abstract representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3582–3589. Cited by: Representation Learning.
 [11] (2017) Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949. Cited by: Introduction, Transfer in Reinforcement Learning.
 [12] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: Introduction, Reinforcement Learning, Experiment.
 [13] (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: Learning Action Embeddings.
 [14] (2018) Learning an embedding space for transferable robot skills. Cited by: Representation Learning.
 [15] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Learning Action Embeddings.
 [16] (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: Introduction.
 [17] (2016) Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: Introduction.
 [18] (2019) The utility of sparse representations for control in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4384–4391. Cited by: Representation Learning.
 [19] (2018) Universal successor representations for transfer reinforcement learning. arXiv preprint arXiv:1804.03758. Cited by: Transfer in Reinforcement Learning.

[20]
(2013)
Efficient estimation of word representations in vector space
. arXiv preprint arXiv:1301.3781. Cited by: Introduction, Representation Learning, Semantic of Embeddings.  [21] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: Introduction, Representation Learning.
 [22] (2017) Unsupervised large graph embedding. In Thirtyfirst AAAI conference on artificial intelligence, Cited by: Representation Learning.
 [23] (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: Representation Learning.
 [24] (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: Introduction.
 [25] (2007) Transfer learning via intertask mappings for temporal difference learning. Journal of Machine Learning Research 8 (Sep), pp. 2125–2167. Cited by: Introduction, Transfer in Reinforcement Learning.
 [26] (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: Introduction.
 [27] (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506. Cited by: Introduction, Transfer in Reinforcement Learning.
 [28] (2019) The natural language of actions. In Proceedings of the 36th International Conference on Machine Learning, pp. 6196–6205. Cited by: Introduction, Introduction, Action Embedding, Action Generation.
 [29] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Introduction, Mujoco and Roboschool.
 [30] (2017) Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907. Cited by: Transfer in Reinforcement Learning.