Log In Sign Up

Macro Action Reinforcement Learning with Sequence Disentanglement using Variational Autoencoder

One problem in the application of reinforcement learning to real-world problems is the curse of dimensionality on the action space. Macro actions, a sequence of primitive actions, have been studied to diminish the dimensionality of the action space with regard to the time axis. However, previous studies relied on humans defining macro actions or assumed macro actions as repetitions of the same primitive actions. We present Factorized Macro Action Reinforcement Learning (FaMARL) which autonomously learns disentangled factor representation of a sequence of actions to generate macro actions that can be directly applied to general reinforcement learning algorithms. FaMARL exhibits higher scores than other reinforcement learning algorithms on environments that require an extensive amount of search.


page 4

page 5

page 6


Construction of Macro Actions for Deep Reinforcement Learning

Conventional deep reinforcement learning typically determines an appropr...

MAGIC: Learning Macro-Actions for Online POMDP Planning using Generator-Critic

When robots operate in the real-world, they need to handle uncertainties...

Reinforcement Learning for the Soccer Dribbling Task

We propose a reinforcement learning solution to the soccer dribbling tas...

Deep Reinforcement Learning With Macro-Actions

Deep reinforcement learning has been shown to be a powerful framework fo...

The Natural Language of Actions

We introduce Act2Vec, a general framework for learning context-based act...

Nonmyopic Gaussian Process Optimization with Macro-Actions

This paper presents a multi-staged approach to nonmyopic adaptive Gaussi...

Strategic Attentive Writer for Learning Macro-Actions

We present a novel deep recurrent neural network architecture that learn...

1 Introduction

Reinforcement learning has gained significant attention recently in both robotics and machine-learning communities because of its potential of wide application to different domains. Recent studies have achieved above-human level game play in Go

[Silver et al.2016, Silver et al.2017] and video games [Mnih et al.2015, OpenAI2018]. Application of reinforcement learning to real-world robots has also been widely studied [Levine et al.2016, Haarnoja et al.2018].

Reinforcement learning involves learning the relationship between a state and action on the basis of rewards. Reinforcement learning fails when the dimensionality of a state or action increases. This is why reinforcement learning is often considered data inefficient, i.e., requiring a large number of trials. The curse of dimensionality on the state space is partially solved using a convolutional neural network (CNN)

[Krizhevsky et al.2012, Mnih et al.2015]; training policy from raw image input has become possible by applying a CNN against the input states. However, reducing the dimensionality on the action side is still challenging. The search space can be exponentially wide with a longer sequence and higher action dimension.

Application of macro actions to reinforcement learning has been studied to reduce the dimensionality of actions. By compressing the sequence of primitive actions, macro actions diminish the search space. Previous studies defined macro actions as repetitions of the same primitive actions [Sharma et al.2017] or requiring humans to manually define them [Hausknecht and Stone2015]. However, more sophisticated macro actions should contain different primitive actions in one sequence without humans having to manually defining these actions.

We propose Factorized Macro Action Reinforcement Learning (FaMARL), a novel algorithm for abstracting the sequence of primitive actions to macro actions by learning disentangled representation [Bengio2013] of a given sequence of actions, reducing dimensionality of the action search space. Our algorithm uses Factorized Action Variational Autoencoder (FAVAE) [Yamada et al.2019], a variation of VAE [Kingma and Welling2013], to learn macro actions from given expert demonstrations. Using the acquired disentangled latent variables as macro actions, FaMARL matches the state with the latent variables of FAVAE instead of primitive actions directly. The matched latent variables are then decoded into a sequence of primitive actions and applied repeatedly to the environment. FaMARL is not limited to just repeating the same primitive actions multiple times, because this compresses any kind of representation with FAVAE. We experimentally show that FaMARL can learn environments with high dimensionality of the search space.

2 Related work

Applying a sequence of actions to reinforcement learning has been studied [Sharma et al.2017, Vezhnevets et al.2016, Lakshminarayanan et al.2017, Durugkar et al.2016]. Fine Grained Action Repetition (FiGAR) successfully adopts macro actions into deep reinforcement learning [Sharma et al.2017], showing that Asynchronous Advantage Actor-Critic (A3C)[Mnih et al.2016], an asynchronous variant of deep reinforcement learning algorithm, with a learning time scale of repeating the action as well as the action itself scores higher than that with primitive actions in Atari 2600 Games.

There are mainly two differences between FaMARL and FiGAR. First, FiGAR can only generate macro actions that are the repeat of the same primitive actions. On the other hand, macro actions generated with FaMARL can be a combination of different primitive actions because FaMARL finds a disentangled representation of a sequence of continuous actions and uses the decoded sequence as macro actions. Second, FaMARL learns how to generate macro actions and optimizes the policy for the target task independently, while FiGAR learns both simultaneously. Despite FaMARL cannot learn macro actions end-to-end, this algorithm can easily recycle acquired macro actions to new target tasks, because macro actions are acquired independent to target tasks.

Hausknecht proposed using a parameterized continuous action space in the reinforcement learning framework [Hausknecht and Stone2015]. This approach, however, is limited in the fact that the action has to be selected at every time step, and humans need to parameterize the action. FaMARL can be viewed as an expansion of this model to time series.

3 Sequence-Disentanglement Representation Learning by Factorized Action Variational AutoEncoder

VAE [Kingma and Welling2013] is a generative model that learns probabilistic latent variables

via the probability distribution learning of a dataset. VAE encodes data

to latent variable and reconstructs from .

The -VAE [Higgins et al.2017] and CCI-VAE [Burgess et al.2018], which is an improved -VAE, are models for learning the disentangled representations. These models disentangle by adding the constraint to reduce the total correlation to VAE. FAVAE [Yamada et al.2019] is an extended

-VAE model to learn disentangled representations from sequential data. FAVAE has a ladder network structure and information-bottleneck-type loss function. This loss function of FAVAE is defined as


where , is the index of the ladder,

is a constant greater than zero that encourages disentangled representation learning by weighting Kullback-Leibler divergence term, and

is called information capacity for supporting the reconstruction. In the learning phase,

increases linearly along with epochs from

to . The is determined by first training FAVAE with a small amount of (we used ) and . The last value of is used as . Each ladder requires a C . For example, a 3-ladder network requires 3 Cs.

4 Proposed algorithm

Figure 1: Overview of FaMARL

Our objective is to find factorized macro actions from given time series of expert demonstrations and search for the optimal policy of a target task based on these macro actions instead of primitive actions. The target task can differ from the task that the expert demonstrations are generated. We use FAVAE [Yamada et al.2019] to find factorized macro actions. The details of FaMARL are given in Sections 4.1 and 4.2.

One might be curious why we do not apply expert demonstrations or their segmentations, directly to the reinforcement learning agent to learn a new task. There are two reasons for learning disentangled factors of (segmented) expert demonstrations. First, if the agent explores these expert demonstrations only, it can only mimic expert demonstrations to solve the task, which results in serious deficiencies in generalizing macro actions. Consider a set of demonstrations containing actions of turn right , turn right , …, turn right . If the environment requires the agent to turn right , the agent cannot complete the task. On the other hand, latent variables trained with the expert demonstrations acquire generated macro actions to ”turn right . Thus, the agent can easily adapt to the target task. Second, without latent variables, the action space is composed by listing only all expert demonstrations, forming a discrete action space. This causes the curse of dimensionality, detering fast convergence on the task.

Input: Expert demonstration on Base, where ( length of th episode)
Parameter: Encoder

1:   // Slice all with WindowSize 4
2:  Train with
4:  Segment with
Algorithm 1 Unsupervised segmentation of macro actions

Input: Decoder of FAVAE
Parameter: PPO Agent

1:  while converge do
5:     for  to  do
8:     end for
9:     Minimize equation 4 using
10:  end while
Algorithm 2 Factorized macro action with proximal policy optimization (PPO)

4.1 Unsupervised segmentation of macro actions

An episode of an expert demonstration is composed of a series of macro actions, e.g., when humans show a demonstration of moving an object by hand, that demonstration is composed of 1)extending a hand to the object, 2)grasping the object, 3)moving the hand to the target position, and 4)releasing the object.

Therefore, expert demonstrations first need to be segmented into each macro action. One significant challenge is that there are usually no ground-truth labels for macro actions. One possible solution is to ask experts to label their actions. However, this is another burden and incurs additional cost.

Lee proposed a simple method using an AE [Hinton and Salakhutdinov2006, Vincent et al.2008] to segment signal data [Lee et al.2018]. This method, simply speaking, trains an AE with sliding windows of signal data, acquiring the temporal characteristics of the sliding windows. Then, the distance between the encoded features of two adjacent sliding windows is calculated. All the peaks of the distance curve are selected as segmentation points. One advantage of this method is that it is not domain-specific. This method can be easily applied to expert demonstration data since it is assumed that there are no specific data characteristics.

On our implementation of this segmentation method, distance is defined as , where refers to the encoded feature of the th sliding window on th trajectory data. We used a sliding window size of . Any distance point that is highest among adjacent points with a margin of is selected as a peak.

4.2 Learning disentangled latent variables with FAVAE

Once the expert demonstrations are segmented, FAVAE learns factors that compose macro actions. However, FAVAE cannot directly intake segmented macro actions. This is because segmented macro actions may have different lengths, while FAVAE cannot compute data with different lengths because it uses a combination of 1D convolution and multilayer perceptron which requires an unified data size across all datasets. To address this issue, macro actions are padded with trailing zeros to match the data length of

, the input size of FAVAE. Also, two additional dimensions and are added to macro actions to identify if action at timestep is a real action or zero-padded one. The is and is , where subscript is the length of a macro action and subscript is the input size of FAVAE. The cutting point of a real action against zero-padding is computed by the first timestep where is selected from the softmax of and . We used the mean squared error for reconstruction loss. Also, FAVAE used three ladders and CCI is applied. [Burgess et al.2018].

4.3 Learning policy with proximal policy optimization (PPO)

Our key idea of diminishing the search space is to search on the latent space of the macro actions instead of primitive actions directly. We used proximal policy optimization (PPO) [Schulman et al.2017] as the reinforcement learning algorithm, although any kind of reinforcement learning algorithm can be used222Our implementation of PPO is based on

PPO is used following the loss function:


Here, , where denotes the probability ratio.

Integrating PPO with macro actions generated with FAVAE is simply to replace the primitive action of every time step with the macro action with a step interval which is the length of the macro action. Therefore, the model of the environment with respect to a macro action is:


where is the transition model of the environment.

The PPO agent matches a latent variable on input state .

The decoder of FAVAE then decodes into series of actions: , where subscript is the output length of the decoder. Then actions are trimmed using the value of the softmax of and , which is also decoded from the decoder.

The macro action is cropped to where subscript is the first timestep at which is selected. This macro action is applied to the environment without feedback. Rewards between and are summed and regarded as the reward for the result of output .

Thus, the objective function of PPO can be modified as:


where is the time step from the perspective of the macro action. If and indicate the same time step in the environment, the relationship of is established.

5 Experiments

FaMARL was tested in two environments: ContinuousWorld, a simple 2D environment with continuous action and state spaces, and RobotHand, a 2D environment with simulated robot hand made by Box2D, a 2D physics engine333Dataset and other supplementary results are available at

5.1 ContinuousWorld

The objective with this environment is to find the optimal trajectory from the starting position (blue dot in Figure  2) to the goal position (red dot in Figure  2). The reward of this environment is , where is the position of the agent and is the position of the goal. The action space is defined by the acceleration to the axis and acceleration to the axis .

(a) Base
(b) Maze
Figure 2: ContinuousWorld tasks

There are two tasks in ContinuousWorld: Base and Maze. In Base, the agent and goal are randomly placed at the corners, top or bottom. Thus, the number of cases for initialization is . To acquire factors of macro actions regardless of scale, the size of map is selected between randomly. In Maze, the agent and goal are always placed at the same position. However, the entrances in the four walls are set randomly for each episode so that the agent has to find an optimal policy on different entrance positions. This makes this environment difficult because walls act like strong local optima of reward; the agent has to make a long detour with lower rewards to finally find the optimal policy.

Figure 3: Examples of script trajectories. DownOnly uses only trajectories in 3, Down&Up uses those in 3 and 3, PushedDownOnly uses those in 3, and PushedDown&Up uses those in 3 and 3

Our purpose was to find disentangled macro actions from expert demonstrations in Base and applying the macro actions to complete the target tasks. 100 episodes of the expert demonstrations were generated in Base using programmed scripts. We compared four different scripts: DownOnly, Down&Up, PushedDownOnly, and PushedDown&Up. All scripts are illustrated in Figure 3. For DownOnly, the goal is only initialized at the bottom of the aisle; therefore, the macro actions do not include upward movements. On the other hand, Down&Up does not limit the position of the goal; thus, upward and downward movements are included in the macro actions. For PushedDownOnly and PushedDown&Up, the agent always accelerates upward or downward, according to the goal position.

(a) Comparison among different actions
(b) Example trajectories of macro actions. Color change indicates change in macro action
Figure 4: Results of Maze

With the expert demonstrations generated in Base, we used FaMARL in Maze. We used . Among FaMARL with macro actions acquired from expert demonstrations of PushedDownOnly, PPO with primitive actions, and FiGAR, FaMARL performed best for this task and other two algorithms failed to converge (Figure 4). It is also obvious that the choice of macro action is critical. While PushedDownOnly outperformed the primitive action, other macro actions could not complete the task. Because PushedDownOnly does not contain any demonstrated actions of moving upwards, this can dramatically diminish the action space to search. On the other hand, Down&Up is similar to just repeatedly moving one direction, which was not sufficient for completing the task.

(a) (3,1): Node that learned factor
(b) (2,1): Node that did not learn any factor
Figure 5: Examples of latent traversal on (Ladder, Index of z) of Down&Up

Figure 5 shows visualized example trajectories of latent traversal for Down&Up. Latent traversal is a technique that shifts only one latent variable and fixes the other variables for observing the decoded output from the modified latent variables. If disentangled factor representation is acquired, the output shows meaningful changes. Otherwise, changes are not distinguishable. Also, if the number of latent variables exceeds that of factors that form the sequence of actions, only some of the latent variables acquire factors and the others show no changes when traversed. Figure 5 shows that the st variable of the rd ladder changed to . This changed the direction of the agent’s trajectory, while Figure 5 shows no change. This result indicates that FAVAE learns the disentangled representation of a given sequence of actions.

(a) Comparison among different on PushedDownOnly
(b) Comparison among different numbers of expert trajectories of PushedDownOnly
Figure 6: Comparison between different and numbers of expert trajectories in Maze

Comparison among different of equation 1 and numbers of expert trajectories are shown in Figure 6 using PushedDownOnly. Figure 6 illustrates the experiment with different . FAVAE did not learn factors in a disentangled manner when was low. The entangled latent variables of macro actions severely deters matching the state space with macro action space for an optimal policy because the latent space, which actually matches with the state space, is distorted. On ContinuousWorld, we found that is enough to complete Maze. Figure 6 illustrates the experiment with different numbers of expert trajectories. Even though we used 100 expert trajectories across all experiments, the number of trajectories did not impact the performance of FaMARL.

5.2 RobotHand

RobotHand has four degrees of freedom (DOFs), i.e., moving along the x axis, moving along the y axis, rotation, and grasping operation. The entire environment was built with Box2D and rendered with OpenGL Similar to Base task at ContinuousWorld, Base at RobotHand, which is a pegging task, provides 100 expert demonstrations to learn disentangled macro actions. And the target tasks Reaching and BallPlacing are completed with the acquired macro actions. We used on this environment.

Base (Figure 7) is a pegging task. In Base, the robot moves a rod from a blue basket to a red one. We chose this task because the pegging task is complex enough to contain all macro actions that might be used in target tasks.

Reaching (Figure 7) is a simple task. The robot hand has to reach for a randomly generated goal position (red) as fast as possible. To make this task sufficiently difficult, we used a sparse reward setting in which the robot hand only receives a positive reward of +100 for reaching the goal position within a distance of 0.5 m; otherwise there is a time penalty of -1.

In BallPlacing (Figure 7), the robot hand has to carry the ball (blue) to the goal position (red). The ball is initialized at random positions within a certain range, and the goal position is fixed. The reward is defined by where is the position of the ball and is the position of the goal. An episode ends when the ball hits the edges or reaches the goal position within a distance of 0.5 m. An additional reward of +200 is given when the ball reaches the goal.

(a) Base
(b) Reaching
(c) BallPlacing
Figure 7: RobotHand tasks
(a) Reaching
(b) BallPlacing
Figure 8: Comparison of FaMARL, PPO with primitive actions, and FiGAR in RobotHand tasks

Figure 8 is a comparison of FaMARL, PPO with primitive actions, and FiGAR on both Reaching and BallPlacing. PPO with primitive actions and FiGAR respectively failed to learn Reaching and BallPlacing, while FaMARL successfully learned both tasks. Because the reward of Reaching is sparse, using primitive actions fails to find rewards. on the other hand, even though the reward of BallPlacing is not sparse, it requires precisely controlling a ball to the goal., FiGAR, which repeats the same primitive actions a number of times, could not precisely control the ball. FaMARL is the only algorithm that completed both tasks.

(a) Reaching with time penalty
(b) Reaching without time penalty
Figure 9: Average macro action length and rewards in Reaching with/without time penalty

It should be noted that in the RobotHand experiments FaMARL optimized its behavior by shortening macro actions, while fully using the advantages of exploring with macro actions. In Reaching, the average length of macro actions gradually diminished (Figure 9). However, when time penalty (in Reaching, time penalty of -1 was added to the reward at every time step) is eliminated, the length of a macro action did not diminish (Figure 9). This is because the agent did not need to optimize its policy in accordance with speed. A macro action can be inefficient in optimizing policy compare to a primitive action because the optimal policy for the task may not match macro actions, but a suboptimal policy will. That is why FaMARL gradually uses primitive-like actions (macro actions with lengths of 1 3) instead of keeping macro actions with dozens of primitive actions.

6 Limitations of FaMARL

FaMARL exhibits generally better scores than using primitive actions. However, there are limitations with FaMARL.

6.1 Lack of feed-back control

Searching on macro actions instead of primitive actions facilitates searching on the action space in exchange for fast response to unexpected changes in state. We failed to train BipedalWalker-v2444 with FaMARL based on the expert demonstration at BipedalWalker-v2. Because a bipedal-locomotion task requires highly precise control for balancing induced from instability of the environment; thus, diminishing the search space by macro actions in exchange for faster response was not adequate.

6.2 Compatibility of macro actions with task

Figure 4 shows that the type of macro actions is critical. If the targeted task does not require the macro actions that are abstracted from expert demonstrations, FaMARL will easily fail because the actions an optimal policy requires are not present in the acquired macro actions. Thus, choosing appropriate expert demonstrations for a targeted task is essential for transferring macro actions to target tasks.

7 Discussion

We proposed FaMARL, an algorithm of using expert demonstrations to learn disentangled latent variables of macro actions to search on these latent spaces instead of primitive actions directly for efficient search. FaMARL exhibited higher scores than other reinforcement learning algorithms in tasks that require extensive iterations of search when proper expert demonstrations are provided. This is because FaMARL diminishes the searching space based on acquired macro actions. We consider this a promising first step for practical application of macro actions in reinforcement learning in a continuous actions space. However, FaMARL could not complete a task that requries actions outside of macro actions. the tasks that need actions outside of restricted searching space cannot be solved. Possible solutions include searching optimal policy with both macro actions and primitive actions.