We introduce an Actor-Critic Ensemble(ACE) method for improving the performance of Deep Deterministic Policy Gradient(DDPG) algorithm1. At inference time, our method uses a critic ensemble to select the best action from proposals of multiple actors running in parallel. By having a larger candidate set, our method can avoid actions that have fatal consequences, while staying deterministic. Using ACE, we have won the 2nd place in NIPS’17 Learning to Run competition, under the name of "Megvii-hzwer"111https://www.crowdai.org/challenges/nips-2017-learning-to-run/leaderboards .
1.1.1 Dooming Actions Problem of DDPG
The competition of Learning to Run asks the participants to teach a skeleton with legs to run as far as possible in 1000 steps, while avoiding random obstacles on the ground. In the game, we find that legs of a fast running skeleton can easily be tripped by obstacles, which causes the skeleton to enter an unstable state with limbs swing widely and fall down after a few frames. We call the action causing the skeleton to enter unstable state a "dooming action", as it is almost impossible to recover from the unstable states.
To investigate dooming actions, we let the critic network inspect the actions at inference time. We find that most of the time, the critic can recognize dooming actions by giving low scores. However, as there is only one action proposed by the actor network in DDPG at every step, the dooming actions cannot be avoided. This observation leads us to propose using an actor ensemble to allow the agent to avoid dooming actions by having a critic ensemble to pick the best action, as shown in Fig. 3(a).
1.1.2 Inference-Time Actor Critic Ensemble
We first train multiple actor-critic pairs separately, using the standard DDPG method. Then we build a new agent with many actor networks proposing actions at every step. Given multiple actions, a critic network is used to select the best action. We simply pick the action with the highest score, and send it to the actuator.
Empirically, we find that using actors of heterogeneous nature, like trained with different hyper-parameters, to perform better than using actors from different epochs of the same training setting. This is in agreement with the observations in Ensemble Learning4.
To further improve the prediction quality of the critic, we build an ensemble of critics, by picking the pairing critics of actors. The outputs of the critic networks are combined by taking average.
1.1.3 Training with ACE
If we put actor networks together to train, all the actor networks can be updated at every step, even if its action is not used. The modified Bellman equation is as below:
2.1 Baseline Implementation
We use the DDPG as our baseline. To describe the state of the agent, we collect three consecutive frames of observations from the environment. These information goes through the feature engineering as proposed by Yongliang Qin222https://github.com/ctmakro/stanford-osrl before being fed into the network.
As the agent is expected to run 1000 steps to finish a successful episode, we find the vanishing gradient problem to be critical. We make several attempts to deal with this difficulty. First, we find that with the original simulation timestep, the DDPG converges slowly. In contrast, using four times larger simulation timestep, which is equivalent to changing the action only every four frames, is found to speedup convergence significantly. We have also tried unrolling DDPG as inwith 2
, but found it be inferior to simply increasing simulation timestep. Second, we have tried several activation functions and found that the activation function of Scaled Exponential Linear Units(SELU)3
, to be superior to ReLU, Leaky ReLU, Tanh and Sigmoid, as shown in Fig.6.
|Actor network architecture||, Tanh for output layer and SELU for other layers|
|Critic network architecture||, linear for output layer and SELU for other layers|
|Actor learning rate||3e-4|
|Critic learning rate||3e-4|
|replay buffer size|
2.2 ACE experiments
For all models we use an identical architecture of actor and critic networks, with hyper-parameters listed in Table 1. Our code used for competition can be found online333https://github.com/hzwer/NIPS2017-LearningToRun.
We build the ensemble by drawing models trained with settings of the last section. Fig. 3(b) gives the distribution of rewards when using ACE, where AXCY stands for X number of actors and Y number of critics. It can be seen that A10C10 (having 10 critics and 10 actors) has a much smaller chance of falling (rewards below 30) compared to A1C0, which is equivalent to DDPG. The maximum rewards also get improved, as shown in Tab. 2.
Training with ACE is found to have similar performance as Inference-Time ACE.
|Experiment||# Test||# Actor||# Critic||Average reward||Max reward||# Fall off|
We propose Actor-Critic Ensemble, a deterministic method that avoids dooming actions at inference time by asking an ensemble of critics to pick actions proposed by an ensemble of actors. Experiments find that ACE can significantly improve the performance of DDPG, exhibited by less number of fallings and increased speed of the running skeletons.
- 1 Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- 2 Anonymous. Distributional policy gradients. Under review at International Conference on Learning Representations, 2018.
- 3 Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. arXiv preprint arXiv:1706.02515, 2017.
Thomas G Dietterich et al.
Ensemble methods in machine learning.
Multiple classifier systems, 1857:1–15, 2000.
- 5 Mikhail Pavlov and Sergey Kolesnikov and Sergey M. Plis. Run, skeleton, run: skeletal model in a physics-based simulation. arXiv preprint arXiv:1711.06922, 2017.