Learning to Run with Actor-Critic Ensemble

12/25/2017 ∙ by Zhewei Huang, et al. ∙ Megvii Technology Limited 0

We introduce an Actor-Critic Ensemble(ACE) method for improving the performance of Deep Deterministic Policy Gradient(DDPG) algorithm. At inference time, our method uses a critic ensemble to select the best action from proposals of multiple actors running in parallel. By having a larger candidate set, our method can avoid actions that have fatal consequences, while staying deterministic. Using ACE, we have won the 2nd place in NIPS'17 Learning to Run competition, under the name of "Megvii-hzwer".

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Method

We introduce an Actor-Critic Ensemble(ACE) method for improving the performance of Deep Deterministic Policy Gradient(DDPG) algorithm1. At inference time, our method uses a critic ensemble to select the best action from proposals of multiple actors running in parallel. By having a larger candidate set, our method can avoid actions that have fatal consequences, while staying deterministic. Using ACE, we have won the 2nd place in NIPS’17 Learning to Run competition, under the name of "Megvii-hzwer"111https://www.crowdai.org/challenges/nips-2017-learning-to-run/leaderboards .

1.1 Ace

1.1.1 Dooming Actions Problem of DDPG

The competition of Learning to Run asks the participants to teach a skeleton with legs to run as far as possible in 1000 steps, while avoiding random obstacles on the ground. In the game, we find that legs of a fast running skeleton can easily be tripped by obstacles, which causes the skeleton to enter an unstable state with limbs swing widely and fall down after a few frames. We call the action causing the skeleton to enter unstable state a "dooming action", as it is almost impossible to recover from the unstable states.

To investigate dooming actions, we let the critic network inspect the actions at inference time. We find that most of the time, the critic can recognize dooming actions by giving low scores. However, as there is only one action proposed by the actor network in DDPG at every step, the dooming actions cannot be avoided. This observation leads us to propose using an actor ensemble to allow the agent to avoid dooming actions by having a critic ensemble to pick the best action, as shown in Fig. 3(a).

(a) DDPG and ACE
(b) Performance of ACE
Figure 3: Schema for DDPG and ACE

1.1.2 Inference-Time Actor Critic Ensemble

We first train multiple actor-critic pairs separately, using the standard DDPG method. Then we build a new agent with many actor networks proposing actions at every step. Given multiple actions, a critic network is used to select the best action. We simply pick the action with the highest score, and send it to the actuator.

Empirically, we find that using actors of heterogeneous nature, like trained with different hyper-parameters, to perform better than using actors from different epochs of the same training setting. This is in agreement with the observations in Ensemble Learning

4.

To further improve the prediction quality of the critic, we build an ensemble of critics, by picking the pairing critics of actors. The outputs of the critic networks are combined by taking average.

1.1.3 Training with ACE

If we put actor networks together to train, all the actor networks can be updated at every step, even if its action is not used. The modified Bellman equation is as below:

.

2 Experiments

2.1 Baseline Implementation

We use the DDPG as our baseline. To describe the state of the agent, we collect three consecutive frames of observations from the environment. These information goes through the feature engineering as proposed by Yongliang Qin222https://github.com/ctmakro/stanford-osrl before being fed into the network.

As the agent is expected to run 1000 steps to finish a successful episode, we find the vanishing gradient problem to be critical. We make several attempts to deal with this difficulty. First, we find that with the original simulation timestep, the DDPG converges slowly. In contrast, using four times larger simulation timestep, which is equivalent to changing the action only every four frames, is found to speedup convergence significantly. We have also tried unrolling DDPG as in

with 2

, but found it be inferior to simply increasing simulation timestep. Second, we have tried several activation functions and found that the activation function of Scaled Exponential Linear Units(SELU)

3

, to be superior to ReLU, Leaky ReLU, Tanh and Sigmoid, as shown in Fig. 

6.

(a) Reward per episode(60 processes)
(b) Reward per episode(20 processes)
Figure 6: Training with different activation functions and different number of processes for generating training data, by DDPG
Actor network architecture , Tanh for output layer and SELU for other layers
Critic network architecture , linear for output layer and SELU for other layers
Actor learning rate 3e-4
Critic learning rate 3e-4
Batch size 128
0.96
replay buffer size
Table 1: Hyper-parameters used in the experiments

2.2 ACE experiments

For all models we use an identical architecture of actor and critic networks, with hyper-parameters listed in Table 1. Our code used for competition can be found online333https://github.com/hzwer/NIPS2017-LearningToRun.

We build the ensemble by drawing models trained with settings of the last section. Fig. 3(b) gives the distribution of rewards when using ACE, where AXCY stands for X number of actors and Y number of critics. It can be seen that A10C10 (having 10 critics and 10 actors) has a much smaller chance of falling (rewards below 30) compared to A1C0, which is equivalent to DDPG. The maximum rewards also get improved, as shown in Tab. 2.

Training with ACE is found to have similar performance as Inference-Time ACE.

Experiment # Test # Actor # Critic Average reward Max reward # Fall off
A1C0 100 1 0 32.0789 41.4203 25
A10C1 100 10 1 37.7578 41.4445 7
A10C10 100 10 10 39.2579 41.9507 4
Table 2: Performance of ACE

3 Conclusion

We propose Actor-Critic Ensemble, a deterministic method that avoids dooming actions at inference time by asking an ensemble of critics to pick actions proposed by an ensemble of actors. Experiments find that ACE can significantly improve the performance of DDPG, exhibited by less number of fallings and increased speed of the running skeletons.

References