Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in avariety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC.READ FULL TEXT VIEW PDF
Off-policy Actor-Critic (Off-PAC) methods are currently central in deep reinforcement learning (RL) research due to their greater sample efficiency compared to on-policy alternatives. On-policy requires new trajectories to be collected for each update to the policy, and is expensive as the number of gradient steps and samples per step increases with task-complexity even for contemporary TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017) and A3C (Mnih et al., 2016) algorithms.
achieve greater sample efficiency due to their ability to learn from randomly sampled historical transitions without a time sequence requirement, thus making better use of past experience. Their critic estimates the action-value (Q-value) function using a differentiable function approximator, and the actor updates its policy parameters in the direction of the approximate action-value gradient. Briefly, the critic provides a loss to guide the actor, and is trained in turn to estimate the environmental action-value under the current policy via temporal-difference learning(Sutton et al., 2009). In all these cases the learning objective function is hand-crafted and fixed.
Recently meta-learning, or “learning-to-learn” has become topical as a paradigm to accelerate RL by learning aspects of the learning strategy, for example, through learning fast adaptation strategies (Finn et al., 2017; Rakelly et al., 2019; Riemer et al., 2019), exploration strategies (Gupta et al., 2018), optimization strategies (Duan et al., 2016b), losses (Houthooft et al., 2018)et al., 2018; Veeriah et al., 2019), and intrinsic rewards (Zheng et al., 2018). However, the majority of these works perform meta-learning on a family of tasks or environments and amortize this huge cost by deploying the trained strategy for fast learning on a new task.
In this paper we introduce a novel meta-critic network to enhance existing Off-PAC learning frameworks. The meta-critic is used alongside the vanilla critic to provide a loss to guide the actor’s learning. However, compared to the vanilla critic, the meta-critic is explicitly (meta)-trained to accelerate the learning process rather than merely estimate the action-value function. Overall, the actor is trained by gradients provided by both critic and meta-critic losses, the critic is trained by temporal-difference as usual, and the meta-critic is trained to generate maximum learning performance improvements in the actor. In our framework, both the critic and meta-critic use randomly sampled off-policy transitions for efficient and effective Off-PAC learning, providing superior sample efficiency compared to existing on-policy meta-learners. Furthermore, we demonstrate that our meta-critic can be successfully learned online within a single task. This is in contrast to the currently widely used meta-learning research paradigm – where entire task families are required to provide enough data for meta-learning, and to provide new tasks to amortize the huge cost of meta-learning.
Essentially our framework meta-learns an auxiliary loss function, which can be seen as an intrinsic motivation towards optimum learning progress(Oudeyer and Kaplan, 2009). As analogously observed in several recent meta-learning studies (Franceschi et al., 2018), our loss-learning can be formalized as a bi-level optimization problem with the upper level being meta-critic learning, and lower level being conventional learning. We solve this joint optimization by iteratively updating the meta-critic and base learner online while solving a single task. Our strategy is thus related to the meta-loss learning in EPG (Houthooft et al., 2018), but learned online rather than offline, and integrated with Off-PAC rather than their on-policy policy-gradient learning. The most related prior work is LIRPG (Zheng et al., 2018), which meta-learns an intrinsic reward online. However, their intrinsic reward just provides a helpful scalar offset to the environmental reward for on-policy trajectory optimization via policy-gradient (Sutton et al., 2000). In contrast our meta-critic provides a loss for direct actor optimization just based on sampled transitions, and thus achieves dramatically better sample efficiency than LIRPG reward learning in practice. We evaluate our framework on several contemporary continuous control benchmarks and demonstrate that online meta-critic learning can be integrated with and improve a selection of contemporary Off-PAC algorithms including DDPG, TD3 and SAC.
Policy-Gradient (PG) Methods. On-policy methods usually update actor parameters in the direction of greater cumulative reward. However, on-policy methods need to interact with the environment in a sequential manner to accumulate rewards and the expected reward is generally not differentiable due to environment dynamics. Even exploiting tricks like importance sampling and improved application of A2C (Zheng et al., 2018), the use of full trajectories is less effective than off-policy transitions, as the trajectory needs a series of continuous transitions in time. Off-policy actor-critic architectures aim to provide better sample efficiency by reusing past experience (previously collected transitions). DDPG (Lillicrap et al., 2016) borrows two main ideas from Deep Q Networks (Mnih et al., 2013, 2015): a big replay buffer and a target Q network to give consistent targets during temporal-difference backups. TD3 (Twin Delayed Deep Deterministic policy gradient) (Fujimoto et al., 2018) develops a variant of Double Q-learning by taking the minimum value between a pair of critics to limit over-estimation. SAC (Soft Actor-Critic) (Haarnoja et al., 2018a, b) proposes a maximum entropy RL framework where its stochastic actor aims to simultaneously maximize expected action-value and entropy. The latest version of SAC (Haarnoja et al., 2018b) also includes the “the minimum value between both critics” idea in its implementation.
Meta Learning for RL. Meta-learning (a.k.a. learning to learn) (Santoro et al., 2016; Finn et al., 2017) has received a resurgence in interest recently due to its potential to improve learning performance, and especially sample-efficiency in RL (Gupta et al., 2018). Several studies learn optimizers that provide policy updates with respect to known loss or reward functions (Andrychowicz et al., 2016; Duan et al., 2016b; Meier et al., 2018). A few studies learn hyperparameters (Xu et al., 2018; Veeriah et al., 2019), loss functions (Houthooft et al., 2018; Sung et al., 2017) or rewards (Zheng et al., 2018) that steer the learning of standard optimizers. Our meta-critic framework is in the category of loss-function meta-learning, but unlike most of these we are able to meta-learn the loss function online in parallel to learning a single extrinsic task rather. No costly offline learning on a task family is required as in Houthooft et al. (2018); Sung et al. (2017). Most current Meta-RL methods are based on on-policy policy-gradient, limiting the sample efficiency. For example, while LIRPG (Zheng et al., 2018) is one of the few prior works to attempt online meta-learning, it is ineffective in practice due to only providing a scalar reward increment rather than a loss for direct optimization. A few meta-RL studies have begun to address off-policy RL, for conventional multi-task meta-learning (Rakelly et al., 2019) and for optimising transfer vs forgetting in continual learning of multiple tasks (Riemer et al., 2019). The contribution of our Meta-Critic is to enhance state-of-the-art Off-PAC RL with single-task online meta-learning.
where a teacher network predicts the parameters of a manually designed loss in supervised learning. In contrast our meta-critic is itself a differentiable loss, and is designed for use in reinforcement learning. Other applications learn losses that improve model robustness to out of distribution samples(Li et al., 2019; Balaji et al., 2018). Our loss learning architecture is related to (Li et al., 2019), but designed for accelerating single-task Off-PAC RL rather than improving robustness in multi-domain supervised learning.
We aim to learn a meta-critic that provides an auxiliary loss to assist the actor’s learning of a task. The auxiliary loss parameters are optimized in a meta-learning process. The main policy loss and auxiliary loss train the actor
off-policy via stochastic gradient descent.
Reinforcement learning involves an agent interacting with the environment . At each time , the agent receives an observation , takes a (possibly stochastic) action based on its policy , and receives a scalar reward and new state of the environment . We call as a single point transition. The objective of RL is to find the optimal policy , which maximizes the expected cumulative return .
In on-policy RL, is defined as the discounted episodic return based on a sequential trajectory over the horizon : . . In the usual implementation of A2C, is represented by a surrogate state-value from its critic. Since is only a scalar value, the gradient of with respect to policy parameters has to be optimized under the policy gradient theorem (Sutton et al., 2000): .
In off-policy RL (e.g., DDPG, TD3, SAC) which is our focus in this paper, parameterized policies can be directly updated by defining the actor loss in terms of the expected return and taking its gradient , where depends on the action-value . The main loss provided by the vanilla critic is thus
where we follow the notation in TD3 and SAC that and denote actors and critics respectively.
The main loss is calculated by a mini-batch of transitions randomly sampled from the replay buffer. The actor’s policy network is updated as , following the critic’s gradient to increase the likelihood of actions that achieve a higher Q-value. Meanwhile, the critic uses Q-learning updates to estimate the action-value function:
Our meta-learning goal is to train an auxiliary meta-critic network that in turn enhances actor learning. Specifically, it should lead to the actor having improved performance on the main task when following gradients provided by the meta-critic as well as those provided by the main task. This can be seen as a bi-level optimization problem111See Franceschi et al. (2018) for a discussion on convergence of bi-level algorithms. (Franceschi et al., 2018; Rajeswaran et al., 2019) of the form:
where we can assume for now. Here the lower-level optimization trains the actor to minimize both the main task and meta-critic-provided losses on some training samples. The upper-level optimization further requires the meta-critic to have produced a learned actor that minimizes a meta-loss that measures the actor’s main task performance on a second set of validation samples, after being trained by the meta-critic. Note that in principle the lower-level optimization could purely rely on analogously to the procedure in EPG (Houthooft et al., 2018), but we find that optimizing their sum greatly increases learning stability and speed. Eq. (3) is satisfied when the meta-critic successfully trains the actor for good performance on the main task as measured by validation meta loss. Note that the vanilla critic update is also in the lower loop, but as it updates as usual, so we focus on the actor and meta-critic optimization for simplicity of exposition.
In this setup the meta-critic is a neural networkthat takes as input some featurisation of the actor and the states and actions in . This auxiliary neural network must produce a scalar output, which we can then treat as a loss , and must be differentiable with respect to . We next discuss the overall optimization flow, and discuss the specific meta-critic architecture later.
Meta-Optimization Flow. To optimize Eq. (3), we iteratively update the meta-critic parameters (upper-level) and actor and vanilla-critic parameters and (lower-level). At each iteration, we perform: (i) Meta-train: Sample a mini-batch of transitions and putatively update policy according to the main and meta-critic losses. (ii) Meta-test: Sample another mini-batch of transitions to evaluate the performance of the updated policy according to . (iii) Meta-optimization: Update the meta-critic parameters to maximize the performance on the validation batch, and perform the real actor update according to both losses. In this way the meta-critic is trained online and in parallel to the actor so that they co-evolve. Figure 1 and Alg. LABEL:alg:main summarize the process and the details of each step are explained next.
Updating Actor Parameters (). During meta-train, we randomly sample a mini-batch of transitions with batch size from the replay buffer . We then update the policy using both losses as: . We also compute a separate update that only makes use of the vanilla loss. If the meta-critic provided a beneficial source of loss, should be a better parameter than , and in particular it should be a better parameter than . We will use this comparison in the next meta-test step.
Updating Meta-Critic Parameters (). To train the meta-critic network, we sample another mini-batch of transitions: with batch size . The use of a validation batch for bi-level meta-optimization (Franceschi et al., 2018; Rajeswaran et al., 2019) ensures the meta-learned component does not overfit. Since our framework is off-policy, this does not incur any sample-efficiency cost. The meta-critic is then updated by a meta loss , which could in principle be the same as the main loss . However, we find it helpful for optimization efficiency to optimize the difference between the updates with- and without meta-critic’s input. Specifically, we use
which is simply a monotonic re-centering and re-scaling of 222Note that the parameter that minimises as Eq. 4 is also the minimum of and vice-versa.. This leads to
Note that here the updated actor has dependence on the feedback given by meta-critic and does not. Thus only the first term is optimized for . In his setup the term should obtain high reward/low loss on the validation batch and the latter provides a baseline, analogous to the baseline widely used to accelerate and stabilize policy-gradient RL. The use of reflects the idea of diminishing marginal utility, and ensures that the meta-loss range is always nicely distributed in . In essence, the meta-loss is for the agent to ask itself the question: “Did meta-critic improve validation performance?”, and adjusts the meta-critic (auxiliary task) parameters accordingly.
Designing Meta-Critic (). The meta-critic network implements the auxiliary loss for the actor. The design-space for has several requirements: (i) Its input must depend on the policy parameters , because this auxiliary loss is also used to update policy network. (ii) It should be permutation invariant to transitions in , i.e., it should not make a difference if we feed the randomly sampled transitions indexed [1,2,3] or [3,2,1]. The naivest way to achieve (i) is given in MetaReg (Balaji et al., 2018) which meta-learns a parameter regularizer: . Although this form of acts directly on , it does not exploit state information, and introduces a large number of parameters in , as may be a high-dimensional neural network. Therefore, we design a more efficient and effective form of that also meets both of these requirements. Similar to the feature extractor in supervised learning, the actor needs to analyse and extract information from states for decision-making. We assume the policy network can be represented as
and decomposed into the feature extractionand decision-making (i.e., the last layer of the full policy network) modules. Thus the output of the penultimate layer of full policy network is just the output of feature extraction , and such output of feature jointly encodes and . Given this encoding, we implement
as a three-layer multi-layer perceptron (MLP) whose input is the extracted feature from. Here we consider two designs for meta-critic (): using our joint feature alone (Eq. (6)) or augmenting the joint feature with states and actions (Eq. (7)):
is to work out the auxiliary loss based on such batch-wise set-embdedding (Zaheer et al., 2017) of our joint actor-state feature. That is to say, is a randomly sampled mini-batch transitions from the replay buffer, and then the (and ) of the transitions are inputted to the network in a permutation invariant way, and finally we can obtain the auxiliary loss for this batch . Here, our design of Eq. (7) also includes the cues features in LIRPG and EPG where and are used as the input of their learned reward and loss respectively. We set a softplus activation to the final layer of , following the idea in TD3 that the vanilla critic may over-estimate and so the introduction of a non-negative actor auxiliary loss can mitigate such over-estimation. Moreover, we point out that only (and ) from are used when calculating and for the actor, while , , and are all used for optimizing the vanilla critic.
Implementation on DDPG, TD3 and SAC. Our meta-critic module can be incorporated in the main Off-PAC methods DDPG, TD3 and SAC. In our framework, these algorithms differ only in their definitions of , and the meta-critic implementation is otherwise exactly the same for each. Further implementation details can be found in the supplementary material.
TD3 (Fujimoto et al., 2018) borrows the Double Q-learning idea and use the minimum value between both critics to make unbiased value estimations. At the same time, computational cost is obtained by using a single actor optimized with respect to . Thus the corresponding for actor becomes:
In SAC, two key ingredients are considered for the actor: maximizing the policy entropy and automatic temperature hyper-parameter regulation. At the same time, the latest version of SAC (Haarnoja et al., 2018b) also draws lessons from “taking the minimum value between both critics”. The for SAC actor is:
The goal of our experimental evaluation is to demonstrate the versatility of our meta-critic module in integration with several prior Off-PAC algorithms, and its efficacy in improving their respective performance. We use the open-source implementations of DDPG, TD3 and SAC algorithms as our baselines, and denote their enhancements by meta-critic as DDPG-MC, TD3-MC, SAC-MC respectively. All -MC agents have both their built-in vanilla critic, and the meta-critic that we propose. We take Eq. (6) as the default meta-critic architecture , and we compare the alternative in the later ablation study. For our implementation of meta-critic, we use a three-layer neural network with an input dimension of (300 in DDPG and TD3, 256 in SAC), two hidden feed-forward layers of
hidden nodes each, and ReLU non-linearity between layers.
We evaluate the methods on a suite of seven MuJoCo continuous control tasks (Todorov et al., 2012) in OpenAI Gym (Brockman et al., 2016), two MuJoCo tasks in rllab (Duan et al., 2016a), and the simulated racing car environment TORCS (Loiacono et al., 2013). For MuJoCo-Gym, we use the latest V2 tasks instead of V1 used in TD3 and the old-SAC (Haarnoja et al., 2018a) work without any modification to their original environment or reward.
Implementation Details. For DDPG, we use the open-source implementation “OurDDPG” 333https://github.com/sfujim/TD3/blob/master/OurDDPG.py which is the re-tuned version of DDPG implemented in Fujimoto et al. (2018) with the same hyper-parameters of the actor and critic for MuJoCo tasks. For TD3 and SAC, we use the open-source implementations of TD3 444https://github.com/sfujim/TD3/blob/master/TD3.py and SAC 555https://github.com/pranz24/pytorch-soft-actor-critic. In MuJoCo cases we integrate our meta-critic with learning rate 0.001. The hyper-parameters for TORCS can be found in the supplementary material.
DDPG Figure 2
shows the learning curves of DDPG and DDPG-MC. The experimental results corresponding to each task are averaged over 5 random seeds (trials) and network initialisations, and the standard deviation confidence intervals are represented as shaded regions over the time steps. Following(Fujimoto et al., 2018), curves are uniformly smoothed for clarity (window_size=10 for TORCS, 30 for others). We run the gym-MuJoCo experiments for 1-10 million depending on to environment, rllab experiments for 3 million steps and TORCS experiment for 100 thousand steps. Every 1000 steps we evaluate our policy over 10 episodes with no exploration noise.
From the learning curves in Figure 2
, we can see that DDPG-MC generally outperforms the corresponding DDPG baseline in terms of the learning speed and asymptotic performance. Furthermore, it usually has smaller variance. The summary results for all tasks in terms of max average return are given in Table1. -MC usually provides consistently higher max return. We select the seven tasks shown in Figure 2 for plotting, because the other MuJoCo tasks “Reacher”, “InvertedPendulum” and “InvertedDoublePendulum” have environmental reward upper bounds which all methods reach quickly without obvious differences.
TD3 and SAC Figure 3 reports the learning curves for TD3. For some tasks vanilla TD3 performance declines in the long run, while our TD3-MC shows improved stability with much higher asymptotic performance. Generally speaking, the learning curves show that TD3-MC providing comparable or better learning performance in each case, while Table 1 shows the clear improvement in the max average return.
Figure 4 report the learning curves of SAC. Note that we use the most recent update of SAC (Haarnoja et al., 2018b), which can be regarded as the combination SAC+TD3. Although this SAC+TD3 is arguably the strongest existing method, SAC-MC still gives a clear boost on the asymptotic performance for several of the tasks.
Comparison vs PPO-LIRPG Intrinsic Reward Learning for PPO (Zheng et al., 2018) is the most related method to our work in performing online single-task meta-learning of an auxiliary reward/loss via a neural network. The original PPO-LIRPG study evaluated on a modified environment with hidden rewards. Here we apply it to the standard unmodified learning tasks that we aim to improve. The results in Table 1 demonstrate that: (i) In this conventional setting, PPO-LIRPG worsens rather than improves basic PPO performance. (ii) Overall Off-PAC methods generally perform better than on-policy PPO for most environments. This shows the importance of our meta-learning contribution to the off-policy setting. In general Meta-Critic is preferred compared to PPO-LIRPG because the latter only provides a scalar reward bonus only influences the policy indirectly via high-variance policy-gradient updates, while Meta-Critic provides a direct loss.
Summary Table 1 and Figure 5 summarize all the results in terms of max average return. We can see that SAC-MC generally performs best; the Meta-Critic-enhanced methods are generally comparable or better than their corresponding vanilla alternatives; and Meta-Critic usually provides improved variance in return compared to the baselines.
|Max Average Return||6398.8 289.2||7164.9 151.3||7423.8 780.2||6644.3 1815.6||6456.1 424.8|
|Sum Average Return||53,695,678||61,672,039||57,364,405||58,875,184||52,446,717|
Loss Analysis. To analyse the learning dynamics of our algorithm, we take a simple learning problem, tabular MDP (Duan et al., 2016b) (, ) as an example, and compare DDPG vs DDPG-MC. Figure 6 reports the main loss curves of actor and the loss curve of (i.e., ) and over 5 trials for DDPG-MC. In addition, we plot the model optimization trajectories (pink dots) via a 2D weight-space slice in Figure 7. These are plotted over the average reward surface for this slice. Following the the neural network visualization method of Li et al. (2018), we calculate the subspace to plot as: Let denote model parameters at episode and the final estimate as (we set ). We apply PCA to the matrix , and take the two most explanatory directions of this optimization path. Model parameters are then projected onto the plane defined by these directions for plotting; and models at each point on this plane are densely evaluated to calculate average reward.
We see some interesting behavior in these results. Figure 6 shows: (i) DDPG-MC shows faster convergence to a lower value of , demonstrating the auxiliary loss’s ability to accelerate learning. (ii) The meta-loss (which corresponds to the success of the meta-critic in improving actor learning) shows a pattern: ‘positive’ ‘negative’ ‘converging to zero’. This pattern is expected because: At the start, is randomly initialised and knows little about how to help the actor, thus -based model outperforms -based model. Then as is trained by the meta-loss, it begins to make better than . In the late stage, meta-loss goes towards zero, which indicates that all of ’s knowledge has been distilled to help the actor. (iii) The auxiliary loss converges smoothly under the supervision of the meta-loss. In Figure 7 (iv) DDPG-MC has a very direct optimization trajectory to the high reward zone, while the vanilla DDPG model moves slowly through the low reward space and before finally finding the direction to the high-reward zone.
Ablation on design. To analyse the designs of , we run Walker2d experiments under SAC-MC with the alternative architecture from Eq. (7) or MetaReg (Balaji et al., 2018) format (input actor parameters directly). As shown in Table 2, we record the max average return and sum average return (regarded as the area under the average reward curve) of all evaluations during all time steps. Eq. (7) achieves the highest max average return and our default (Eq. (6)) attains the highest mean average return. We can also see some improvement for using MetaReg format, but the huge number (73484) of parameters is expensive. Overall, all meta-critic module designs provides at least a small improvement on vanilla SAC.
Ablation on baseline in meta-loss. In Eq. (4), we use as a baseline to improve numerical stability of the gradient update. To evaluate this design, we remove the baseline and optimize . The last column in Table 2 shows that this barely improves on vanilla SAC, validating our design choice to use a baseline.
We present Meta-Critic, an auxiliary critic module for Off-PAC methods that can be meta-learned online during single task learning. The meta-critic is trained to generate gradients that improve the actor’s learning performance over time, and leads to long run performance gains in continuous control. The meta-critic module can be flexibly incorporated into various contemporary Off-PAC methods to boost performance. In future work, we plan to apply the meta-critic to conventional meta-learning with multi-task and multi-domain RL.