The rapid development of deep neural-networks as general-purpose function approximators has propelled the recent Reinforcement Learning (RL) renaissance(Zai and Brown, 2020). RL algorithms have progressed in robustness, e.g. from (Lillicrap et al., 2016) to (Fujimoto et al., 2018); exploration (Haarnoja et al., 2018); gradient sampling (Schulman et al., 2017, 2015a); and off-policy learning (Fujimoto et al., 2019; Kumar et al., 2019). Many actor-critic algorithms have focused on improving the critic learning routines by modifying the target value (Hasselt et al., 2016), which enables more accurate and robust -function approximations. While this greatly improves the policy optimization efficiency, the performance is still bound by the networks’ ability to represent -functions and policies. Such a constraint calls for studying and designing neural models suited for the representation of these RL building blocks.
A critical insight in designing neural models for RL is the reciprocity between the state and the action, which both serve as the input for the -function. At the start, each input can be processed individually according to its source domain. For example, when is a vector of images, it is common to employ CNN models (Kaiser et al., 2019), and when or are natural language words, each input can be processed separately with embedding vectors (He et al., 2016). The common practice in incorporating the state and action learnable features into a single network is to concatenate the two vectors and follow with MLP to yield the -value (Schulman et al., 2017). In this work, we argue that for actor-critic RL algorithms (Grondman et al., 2012), such an off-the-shelf method could be significantly improved with Hypernetworks.
In actor-critic methods, for each state, sampled from the dataset distribution, the actor’s task is to solve an optimization problem over the action distribution, i.e. the policy. This motivates an architecture where the -function is explicitly modeled as the value function of a contextual bandit (Lattimore and Szepesvári, 2020) where is the context. While standard architectures are not designed to model such a relationship, Hypernetworks were explicitly constructed for that purpose (Ha et al., 2016). Hypernetworks, also called meta-networks, can represent hierarchies by transforming a meta variable into a context-dependent function that maps a base variable to the required output space. This emphasizes the underlying dynamic between the meta and base variables and has found success in a variety of domains such as Bayesian neural-networks (Lior Deutsch, 2019), continual learning (von Oswald et al., 2019), generative models (Ratzlaff and Li, 2019) and adversarial defense (Sun et al., 2017). The practical success has sparked interest in the theoretical properties of Hypernetworks. For example, it has recently been shown that they enjoy better parameter complexity than classical models which concatenate the base and meta-variables together (Galanti and Wolf, 2020a, b).
When analyzing the critic’s ability to represent the -function, it is important to notice that in order to optimize the policy, modern off-policy actor-critic algorithms (Fujimoto et al., 2018; Haarnoja et al., 2018) utilize only the parametric neural gradient of the critic with respect to the action input, i.e., .111This is in contrast to the REINFORCE approach (Williams, 1992) based on the policy gradient theorem (Sutton et al., 2000) which does not require a differentiable -function estimation. Recently, (Ilyas et al., 2019)
examined the accuracy of the policy gradient in on-policy algorithms. They demonstrated that standard RL implementations achieve gradient estimation with a near-zero cosine similarity when compared to the “true” gradient. Therefore, recovering better gradient approximations has the potential to substantially improve the RL learning process. Motivated by the need to obtain high-quality gradient approximations, we set out to investigate the gradient accuracy of Hypernetworks with respect to standard models. In Sec.3 we analyze three critic models and find that the Hypernetwork model with a state as a meta-variable enjoys better gradient accuracy which translates into a faster learning rate.
Much like the induced hierarchy in the critic, meta-policies that optimize multi-task RL problems have a similar structure as they combine a task-dependent context and a state input. While some algorithms like MAML (Finn et al., 2017) and LEO (Rusu et al., 2019) do not utilize an explicit context, other works, e.g. PEARL (Rakelly et al., 2019) or MQL (Fakoor et al., 2019), have demonstrated that a context improves the generalization abilities. Recently, (Jayakumar et al., 2019) have shown that Multiplicative Interactions (MI) are an excellent design choice when combining states and contexts. MI operations can be viewed as shallow Hypernetwork architectures. In Sec. 4, we further explore this approach and study context-based meta-policies with deep Hypernetworks. We find that with Hypernetworks, the task and state-dependent gradients are disentangled s.t. the state-dependent gradients are marginalized out, which leads to an empirically lower learning step variance. This is specifically important in on-policy methods such as MAML, where there are fewer optimization steps during training.
The contributions of this paper are three-fold. First, in Sec. 3 we provide a theoretical link between the -function gradient approximation quality and the allowable learning rate for monotonic policy improvement. Next, we show empirically that Hypernetworks achieve better gradient approximations which translates into a faster learning rate and improves the final performance. Finally, in Sec. 4 we show that Hypernetworks significantly reduce the learning step variance in Meta-RL. We summarize our empirical results in Sec. 5, which demonstrates the gain of Hypernetworks both in single-task RL and Meta-RL. Importantly, we find empirically that Hypernetwork policies eliminate the need for the MAML adaptation step and improve the Out-Of-Distribution generalization in PEARL.
A Hypernetwork (Ha et al., 2016) is a neural-network architecture designed to process a tuple and output a value . It is comprised of two networks, a primary network which produces weights for a dynamic network . Both networks are trained together, and the gradient flows through to the primary networks’ weights . During test time or inference, the primary weights are fixed while the input determines the dynamic network’s weights.
The idea of learnable context-dependent weights can be traced back to (McClelland, 1985; Schmidhuber, 1992). However, only in recent years have Hypernetworks gained popularity when they have been applied successfully with many dynamic network models, e.g. recurrent networks (Ha et al., 2016), MLP networks for 3D point clouds (Littwin and Wolf, 2019), spatial transformation (Potapov et al., 2018), convolutional networks for video frame prediction (Jia et al., 2016) and few-shot learning (Brock et al., 2018). In the context of RL, Hypernetworks were also applied, e.g., in QMIX (Rashid et al., 2018) to solve Multi-agent RL tasks and for continual model-based RL (Huang et al., 2020).
which transform the meta-variable into a 1024 sized latent representation. This stage is followed by a series of parallel linear transformations, termed “heads”, which output the sets of dynamic weights. The dynamic networkcontains only a single hidden layer of 256 which is smaller than the standard MLP architecture used in many RL papers (Fujimoto et al., 2018; Haarnoja et al., 2018)
of 2 hidden layers, each with 256 neurons. The computational model of each dynamic layer is
where the non-linearity is applied only over the hidden layer and is an additional gain parameter that is required in Hypernetwork architectures (Littwin and Wolf, 2019). We defer the discussion of these design choices to Sec. 5.
3 Recomposing the Actor-Critic’s -Function
Reinforcement Learning concerns finding optimal policies in Markov Decision Processes (MDPs). An MDP(Dean and Givan, 1997) is defined by a tuple where is a set of states, is a set of actions,
is a set of probabilities to switch from a stateto given an action , and is a scalar reward function. The objective is to maximize the expected discounted sum of rewards with a discount factor
can also be written, up to a constant factor , as an expectation over the -function
where the -function is the expected discounted sum of rewards following visitation at state and execution of action (Sutton and Barto, 2018), and is the state distribution induced by policy .
Actor-critic methods maximize
over the space of parameterized policies. Stochastic policies are constructed as a state dependent transformation of an independent random variable
where is a predefined multivariate distribution over and is the number of actions.222Deterministic policies, on the other hand, are commonly defined as a deterministic transformation of the state’s feature vector. To maximize over the parameters, actor-critic methods operate with an iterative three-phase algorithm. First, they collect into a replay buffer the experience tuples generated with the parametric and some additive exploration noise policy (Zhang and Sutton, 2017)
. Then they fit a critic which is a parametric modelfor the -function. For that purpose, they apply TD-learning (Sutton and Barto, 2018)
with the loss function
where is a lagging set of parameters (Lillicrap et al., 2016). Finally, they apply gradient descent updates in the direction of an off-policy surrogate of
Here, is a matrix of size where is the number of policy parameters to be optimized.
Two well-known off-policy algorithms are TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018). TD3 optimizes deterministic policies with additive normal exploration noise and double -learning to improve the robustness of the critic part (Hasselt et al., 2016)
. On the other hand, SAC adopts stochastic, normally distributed policies but it modifies the reward function to include a high entropy bonuswhich eliminates the need for exploration noise.
3.2 Our Approach
The gradient of the off-policy surrogate differs from the true gradient in two elements: First, the distribution of states is the empirical distribution in the dataset and not the policy distribution ; and second, the -function gradient is estimated with the critic’s parametric neural gradient . Avoiding a distribution mismatch is the motivation of many constrained policy improvement methods such as TRPO and PPO (Schulman et al., 2015a, 2017). However, it requires very small and impractical steps. Thus, many off-policy algorithms ignore the distribution mismatch and seek to maximize only the empirical advantage
In practice, a positive empirical advantage is associated with better policies and is required by monotonic policy improvement methods such as TRPO (Kakade and Langford, 2002; Schulman et al., 2015a). Yet, finding positive empirical advantage policies requires a good approximation of the gradient . The next proposition suggests that with a sufficiently accurate approximation, applying the gradient step as formulated in the actor update in Eq. (5) yields positive empirical advantage policies.
Let be a stochastic parametric policy with , and a transformation with a Lipschitz continuous gradient and a Lipschitz constant . Assume that its -function has a Lipschitz continuous gradient in , i.e. . Define the average gradient operator . If there exists a gradient estimation and s.t.
then the ascent step with yields a positive empirical advantage policy.
We define and provide the proof in the appendix. It follows that a positive empirical advantage can be guaranteed when the gradient of the -function is sufficiently accurate, and with better gradient models, i.e. smaller , one may apply larger ascent steps. However, instead of fitting the gradient, actor-critic algorithms favor modeling the -function and estimate the gradient with the parametric gradient of the model . It is not obvious whether better models for the -functions, with lower Mean-Squared-Error (MSE), provide better gradient estimation. A more direct approach could be to explicitly learn the gradient of the -function (Sarafian et al., 2020; Saremi, 2019); however, in this work, we choose to explore which architecture recovers more accurate gradient approximation based on the parametric gradient of the -function model.
We consider three alternative models:
MLP network, where state features (possibly learnable) are concatenated into a single input of a multi-layer linear network.
Action-State Hypernetwork (AS-Hyper) where the actions are the meta variable, input of the primary network , and the state features are the base variable, input for the dynamic network .
State-Action Hypernetwork (SA-Hyper), which reverses the order of AS-Hyper.
To develop some intuition, let us first consider the simplest case where the dynamic network has a single linear layer and the MLP model is replaced with a plain linear model. Starting with the linear model, the -function and its gradient take the following parametric model:
where . Clearly, in this case, the gradient is not a function of the state, therefore it is impossible to exploit this model for actor-critic algorithms. For the AS-Hyper we obtain the following model
Usually, the state feature vector has a much larger dimension than the action dimension . Thus, the matrix has a large null-space which can potentially hamper the training as it may yield zero or near-zero gradients even when the true gradient exists.
On the other hand, the SA-Hyper formulation is
which is a state-dependent constant model of the gradient in
. While it is a relatively naive model, it is sufficient for localized policies with low variance as it approximates the tangent hyperplane around the policy mean value.
Moving forward to a multi-layer architecture, let us first consider the AS-Hyper architecture. In this case the gradient is . We see that the problem of the single layer is exacerbated since is now a matrix where is the number of dynamic network weights.
Next, the MLP and SA-Hyper models can be jointly analyzed. First, we calculate the input’s gradient of each layer
is the activation function andand are the weights and biases of the
-th layer, respectively. By the chain rule, the input’s gradient of an-layers network is the product of these expressions. For the MLP model we obtain
On the other hand, in SA-Hyper the weights are the outputs of the primary network, thus we have
Importantly, while the SA-Hyper’s gradient configuration is controlled via the state-dependent matrices , in the MLP model, it is a function of the state only via the diagonal elements in
. These local derivatives of the non-linear activation functions are usually piecewise constant when the activations take the form of ReLU-like functions. Also, they are required to be bounded and smaller than one in order to avoid exploding gradients during training(Philipp et al., 2017). These restrictions significantly reduce the expressiveness of the parametric gradient and its ability to model the true -function gradient. For example, with ReLU, for two different pairs and the estimated gradient is equal if they have same active neurons map (i.e. the same ReLUs are in the active mode). Following this line of reasoning, we postulate that the SA-Hyper configuration should have better gradient approximations.
Empirical analysis To test our hypothesis, we trained TD3 agents with different network models and evaluated their parametric gradient . To empirically analyze the gradient accuracy, we opted to estimate the true -function gradient with a non-parametric local estimator at the policy mean value, i.e. at . For that purpose, we generated independent trajectories with actions sampled around the mean value, i.e. , and fit with a Least-Mean-Square (LMS) estimator a linear model for the empirical return of the sampled trajectories. The “true” gradient is therefore the linear model’s gradient. Additional technical details of this estimator are found in the appendix.
As our -function estimator is based on Temporal-Difference (TD) learning, it bears bias. Hence, in practice we cannot hope to reconstruct the true -function scale. Thus, instead of evaluating the gradient’s MSE, we take the Cosine Similarity (CS) as a surrogate for measuring the gradient accuracy.
Fig. 3 summarizes our CS evaluations with the three model alternatives averaged over 4 Mujoco (Todorov et al., 2012) environments. Fig. 3d presents the mean CS over states during the training process. Generally, the CS is very low, which indicates that the RL training is far from optimal. While this finding is somewhat surprising, it corroborates the results in (Ilyas et al., 2019) which found near-zero CS in policy gradient algorithms. Nevertheless, note that the impact of the CS accuracy is cumulative as in each gradient ascent step the policy accumulates small improvements. This lets even near-zero gradient models improve over time. Overall, we find that the SA-Hyper CS is higher, and unlike other models, it is larger than zero during the entire training process. The SA-Hyper advantage is specifically significant at the first learning steps, which indicates that SA-Hyper learns faster in the early learning stages.
Assessing the gradient accuracy by the average CS can be somewhat confounded by states that have reached a local equilibrium during the training process. In these states the true gradient has zero magnitude s.t. the CS is ill-defined. For that purpose, in Fig. 3a-c we measure the percentage of states with a CS higher than a threshold . This indicates how many states are learnable where more learnable states are attributed to a better gradient estimation. Fig. 3a shows that for all thresholds SA-Hyper has more learnable states, and Fig. 3b-c present the change in learnable states for different during the training process. Here we also find that the SA-Hyper advantage is significant particularly at the first stage of training. Finally, Fig. 4 demonstrates how gradient accuracy translates to better learning curves. As expected, we find that SA-Hyper outperforms both the MLP architecture and the AS-Hyper configuration which is also generally inferior to MLP.
In the next section, we discuss the application of Hypernetworks in Meta-RL for modeling context conditional policies. When such a context exists, it also serves as an input variable to the -function. In that case, when modeling the critic with a Hypernetwork, one may choose to use the context as a meta-variable or alternatively as a base variable. Importantly, when the context is the dynamic’s input, the dynamic weights are fixed for each state, regardless of the task. In our PEARL experiments in Sec. 5 we always used the context as a base variable of the critic. We opted for this configuration since: (1) we found empirically that it is important for the generalization to have a constant set of weights for each state; and (2) As the PEARL context is learnable, we found that when the context gradient backpropagates through three networks (primary, dynamic and the context network), it hampers the training. Instead, as a base variable, the context’s gradient backpropagates only via two networks as in the original PEARL implementation.
4 Recomposing the Policy in Meta-RL
Meta-RL is the generalization of Meta-Learning (Mishra et al., 2018; Sohn et al., 2019) to the RL domain. It aims at learning meta-policies that solve a distribution of different tasks . Instead of learning different policies for each task, the meta-policy shares weights between all tasks and thus can generalize from one task to the other (Sung et al., 2017). A popular Meta-RL algorithm is MAML (Finn et al., 2017), which learns a set of weights that can quickly adapt to a new task with a few gradient ascent steps. To do so, for each task, it estimates the policy gradient (Sutton et al., 2000) at the adaptation point. The total gradient is the sum of policy gradients over the task distribution :
where is the empirical advantage estimation at the -th step in task (Schulman et al., 2015b). On-policy algorithms tend to suffer from high sample complexity as each update step requires many new trajectories sampled from the most recent policy in order to adequately evaluate the gradient direction.
Off-policy methods are designed to improve the sample complexity by reusing experience from old policies (Thomas and Brunskill, 2016). Although not necessarily related, in Meta-RL, many off-policy algorithms also avoid the MAML approach of weight adaptation. Instead, they opt to condition the policy and the -function on a context which distinguishes between different tasks (Ren et al., 2019; Sung et al., 2017). A notable off-policy Meta-RL method is PEARL (Rakelly et al., 2019). It builds on top of the SAC algorithm and learns a -function , a policy and a context . The context, which is a latent representation of task , is generated by a probabilistic model that processes a trajectory of transitions sampled from task . To learn the critic alongside the context, PEARL modifies the SAC critic loss to
is a prior probability over the latent distribution of the context. While PEARL’s context is a probabilistic model, other works(Fakoor et al., 2019) have suggested that a deterministic learnable context can provide similar results.
In this work, we consider both a learnable context and also the simpler approach of an oracle-context which is a unique, predefined identifier for task (Jayakumar et al., 2019). It can be an index when there is a countable number of tasks or a continuous number when the tasks are sampled from a continuous distribution. In practice, the oracle identifier is often known to the agent. Moreover, sometimes, e.g., in goal-oriented tasks, the context cannot be recovered directly from the transition tuples without prior knowledge, since there are no rewards unless the goal is reached, which rarely happens without policy adaptation.
4.2 Our Approach
Hypernetworks naturally fit into the meta-learning formulation where the context is an input to the primary network (von Oswald et al., 2019; Zhao et al., 2020). Therefore, we suggest modeling meta-policies s.t. the context is the meta variable and the state is the dynamic’s input
Interestingly, this modeling disentangles the state dependent gradient and the task dependent gradient of the meta-policy. To see that, let us take for example the on-policy objective of MAML and plug in a context dependent policy . Then, the objective in Eq. (15) becomes
Applying the Hypernetwork modeling of the meta-policy in Eq. (16), this objective can be written as
In this form, the state-dependent gradients of the dynamic weights are averaged independently for each task, and the task-dependent gradients of the primary weights are averaged only over the task distribution and not over the joint task-state distribution as in Eq. (17). We postulate that such disentanglement reduces the gradient noise for the same number of samples. This should translate to more accurate learning steps and thus a more efficient learning process.
To test our hypothesis, we trained two different meta-policy models based on the MAML algorithm: (1) an MLP model where a state and an oracle-context are joined together; and (2) a Hypernetwork model, as described, with an oracle-context as a meta-variable. Importantly, note that, other than the neural architecture, both algorithms are identical. For four different timestamps during the learning process, we constructed 50 different uncorrelated gradients from different episodes and evaluating the updated policy’s performance. We take the performance statistics of the updated policies as a surrogate for the gradient noise. In Fig. 5, we plot the performance statistics of the updated meta-policies. We find that the variance of the Hypernetwork model is significantly lower than the MLP model across all tasks and environments. This indicates more efficient improvement and therefore we also observe that the mean value is consistently higher.
5.1 Experimental Setup
We conducted our experiments in the MuJoCo simulator (Todorov et al., 2012) and tested the algorithms on the benchmark environments available in OpenAI Gym (Brockman et al., 2016). For single task RL, we evaluated our method on the: (1) Hooper-v2; (2) Walker2D-v2; (3) Ant-v2333We reduced the control cost as is done in PEARL (Rakelly et al., 2019) to avoid numerical instability problems.; and (4) Half-Cheetah-v2 environments. For meta-RL, we evaluated our method on the: (1) Half-Cheetah-Fwd-Back and (2) Ant-Fwd-Back, and on velocity tasks: (3) Half-Cheetah-Vel and (4) Ant-Vel as is done in (Rakelly et al., 2019). We also added the Half-Cheetah-Vel-Medium environment as presented in (Fakoor et al., 2019), which tests out-of-distribution generalization abilities. For Context-MAML and Hyper-MAML we adopted the oracle-context as discussed in Sec. 4. For the forward-backward tasks, we provided a binary indicator, and for the velocity tasks, we adopted a continuous context in the range that maps to the velocities in the training distribution.
implementation when the official one was not available) and the original baselines’ hyperparameters, as well as strictly following each algorithm evaluation procedure. The Hypernetwork training was executed with the baseline loss s.t. we changed only the networks model and adjusted the learning rate to fit the different architecture. All experiments were averaged over 5 seeds. Further technical details are in the appendix.
5.2 The Hypernetwork Architecture
Our Hypernetwork model is illustrated in Fig. 1 and in Sec. 2. When designing the Hypernetwork model, we did not search for the best performance model, rather we sought a proper comparison to the standard MLP architecture used in RL (denoted here as MLP-Standard). To that end, we used a smaller dynamic network than the MLP model (single hidden layer instead of two layers and the same number of neurons (256) in a layer). With this approach, we wish to show the gain of using dynamic weights with respect to a fixed set of weights in the MLP model. To emphasize the gain of the dynamic weights, we added an MLP-Small baseline with equal configuration to the dynamic model (one hidden layer with 256 neurons).
Unlike the dynamic network, the role of the primary network is missing from the MLP architecture. Therefore, for the primary network, we used a high-performance ResNet model (Srivastava et al., 2015) which we found apt for generating the set of dynamic weights (Glorot and Bengio, 2010). To make sure that the performance gain is not due to the expressiveness of the ResNet model or the additional number of learnable weights, we added three more baselines: (1) ResNet Features: the same primary and dynamic architecture, but the output of the primary is a state feature vector which is concatenated to the action as the input for an MLP-Standard network; (2) MLP-Large: two hidden layers, each with 2900 neurons which sum up to weights as in the Hypernetwork architecture; and (3) Res35: ResNet with 35 blocks to yield the -value, which sum up to weights. In addition, we added a comparison to the Q-D2RL model: a deep dense architecture for the -function which was recently suggested in (Sinha et al., 2020).
One important issue with Hypernetworks is their numerical stability. We found that they are specifically sensitive to weight initialization as bad primary initialization may amplify into catastrophic dynamic weights (Chang et al., 2019). We solved this problem by initializing the primary s.t. the average initial distribution dynamic weights resembles the Kaiming-uniform initialization (He et al., 2015). Further details can be found in the appendix.
The results and the comparison to the baselines are summarized in Fig. 6. In all four experiments, our Hypernetwork model achieves an average of 10% - 70% gain over the MLP-Standard baseline in the final performance and reaches the baseline’s score, with only 20%-70% of the total training steps. As described in Sec. 5.2, for the RL experiments, in addition to the MLP-Standard model, we tested five more baselines: (1) MLP-Large; (2) MLP-Small; (3) ResNet Features; (4) ResNet35; and (5) Q-D2RL. Both on TD3 and SAC, we find a consistent improvement over all baselines and SA-Hyper outperforms in all environments with two exceptions: where MLP-Large or Q-D2RL achieve a better score than SA-Hyper in the Ant-v2 environment (the learning curves for each environment are found in the appendix). While it may seem like the Hypernetwork improvement is due to its large parametric dimension or the ResNet design of the primary model, our results provide strong evidence that this assumption is not true. The SA-Hyper model outperforms other models with the same number of parameters (MLP-Large and ResNet Features444Interestingly, The Resnet Features baseline achieved very low scores even as compared to the MLP-Standard baseline. Indeed, this result is not surprising as the action gradient model of Resnet Features is identical to the action gradient model of MLP-Small (single hidden layer with 256 neurons). While ResNet generated state features may improve the -function estimation, they do not necessarily improve the gradient estimation as the network is not explicitly trained to model the gradient.) and also models that employ ResNet architectures (ResNet Features and Res35). In addition, it is as good (SAC) or better (TD3) than Q-D2RL, which was recently suggested as an architecture tailored for the RL problem (Sinha et al., 2020). Please note that as discussed in Sec. 5.2 and unlike D2RL, we do not optimize the number of layers in the dynamic model.555We do not compare to the full D2RL model which also modifies the policy architecture as our SA-Hyper model only changes the -net model.
In Fig. 6c we compared different models for MAML: (1) Vanilla-MAML; (2) Context-MAML, i.e. a context-based version of MAML with an oracle-context; and (3) Hyper-MAML, similar to context-MAML but with a Hypernetwork model. For all models, we evaluated both the pre-adaptation (pre-ad) as well as the post-adaptation scores. First, we verify the claim in (Fakoor et al., 2019) that context benefits Meta-RL algorithms just as Context-MAML outperforms Vanilla-MAML. However, we find that Hyper-MAML outperforms Context-MAML by roughly 50%. Moreover, unlike the standard MLP models, we find that Hyper-MAML does not require any adaptation step (no observable difference between the pre- and post-adaptation scores). We assume that this result is due to the better generalization capabilities of the Hypernetwork architecture as can also be seen from the next PEARL experiments.
In Fig. 6d we evaluated the Hypernetwork model with the PEARL algorithm. The context is learned with a probabilistic encoder as presented in (Rakelly et al., 2019) s.t. the only difference with the original PEARL is the policy and critic neural models. The empirical results show that Hyper-PEARL outperforms the MLP baseline both in the final performance (15%) and in sample efficiency (70% fewer steps to reach the final baseline score). Most importantly, we find that Hyper-PEARL generalizes better to the unseen test tasks. This applies both to test tasks sampled from the training distribution (as the higher score and lower variance of Hyper-PEARL indicate) and also to Out-Of-Distribution (OOD) tasks, as can be observed in Fig. 7.
In this work, we set out to study neural models for the RL building blocks: -functions and meta-policies. Arguing that the unique nature of the RL setting requires unconventional models, we suggested the Hypernetwork model and showed empirically several significant advantages over MLP models. First, Hypernetworks are better able to estimate the parametric gradient signal of the -function required to train actor-critic algorithms. Second, they reduce the gradient variance in training meta-policies in Meta-RL. Finally, they improve OOD generalization and they do not require any adaptation step in Meta-RL training, which significantly facilitates the training process.
Our Hypernetwork PyTorch implementation is found at https://github.com/keynans/HypeRL.
- SMASH: one-shot model architecture search through hypernetworks. In International Conference on Learning Representations, Cited by: §2.
- OpenAI gym. CoRR abs/1606.01540. External Links: Cited by: §5.1.
- Principled weight initialization for hypernetworks. In International Conference on Learning Representations, Cited by: §D.3, §5.2.
- Model minimization in markov decision processes. In AAAI/IAAI, pp. 106–111. Cited by: §3.1.
- Meta-q-learning. In International Conference on Learning Representations, Cited by: §E.3.1, §1, §4.1, §5.1, §5.3.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §1, §4.1.
Off-policy deep reinforcement learning without exploration.
International Conference on Machine Learning, pp. 2052–2062. Cited by: §1.
- Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: Appendix E, §1, §1, §2, §3.1.
- Comparing the parameter complexity of hypernetworks and the embedding-based alternative. arXiv preprint arXiv:2002.10006. Cited by: §1.
- On the modularity of hypernetworks. Advances in Neural Information Processing Systems 33. Cited by: §1.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §5.2.
- A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (6), pp. 1291–1307. Cited by: §1.
- HyperNetworks. arXiv, pp. arXiv–1609. Cited by: §1, §2, §2.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §1, §2, §3.1.
- Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2094–2100. Cited by: §1, §3.1.
- Deep reinforcement learning with a natural language action space. In ACL (1), Cited by: item 1, §1.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, Cited by: §D.3, §5.2.
- Densely connected convolutional networks. In , pp. 4700–4708. Cited by: item 1.
- Continual model-based reinforcement learning with hypernetworks. arXiv preprint arXiv:2009.11997. Cited by: §2.
- A closer look at deep policy gradients. In International Conference on Learning Representations, Cited by: §1, §3.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: item 3.
- Multiplicative interactions and where to find them. In International Conference on Learning Representations, Cited by: §1, §4.1.
- Dynamic filter networks. In Advances in neural information processing systems, pp. 667–675. Cited by: §2.
- Model based reinforcement learning for atari. In International Conference on Learning Representations, Cited by: §1.
- Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning, Cited by: §3.2.
- Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §5.1.
- Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pp. 11784–11794. Cited by: §1.
- Bandit algorithms. Cambridge University Press. Cited by: §1.
- Continuous control with deep reinforcement learning.. In ICLR (Poster), Cited by: §1, §3.1.
- A generative model for sampling high-performance and diverse weights for neural networks. CoRR. Cited by: §D.3, §1.
- Deep meta functionals for shape representation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1824–1833. Cited by: §2, §2.
- Putting knowledge in its place: a scheme for programming parallel processing structures on the fly. Cognitive Science 9 (1), pp. 113–146. Cited by: §2.
- A simple neural attentive meta-learner. In International Conference on Learning Representations, Cited by: §4.1.
Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: item 3.
The exploding gradient problem demystified-definition, prevalence, impact, origin, tradeoffs, and solutions. arXiv preprint arXiv:1712.05577. Cited by: §3.2.
- HyperNets and their application to learning spatial transformations. In International Conference on Artificial Neural Networks, pp. 476–486. Cited by: §2.
- Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp. 5331–5340. Cited by: Appendix E, §1, §4.1, §5.1, §5.3, footnote 3.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §2.
- HyperGAN: A generative model for diverse, performant neural networks. CoRR abs/1901.11058. External Links: Cited by: §1.
- Context-based meta-reinforcement learning with structured latent space. Skills Workshop NeurIPS 2019. Cited by: §4.1.
- Meta-learning with latent embedding optimization. In International Conference on Learning Representations, External Links: Cited by: §1.
- Weight normalization: a simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868. Cited by: item 3.
- Explicit gradient learning for black-box optimization. In International Conference on Machine Learning, pp. 8480–8490. Cited by: §3.2.
- On approximating with neural networks. arXiv preprint arXiv:1910.12744. Cited by: §3.2.
- Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §2.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1, §3.2.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §4.1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §1, §3.2.
- D2RL: deep dense architectures in reinforcement learning. arXiv preprint arXiv:2010.09163. Cited by: §D.4.8, §5.2, §5.3.
- Meta reinforcement learning with autonomous inference of subtask dependencies. In International Conference on Learning Representations, Cited by: §4.1.
- Training very deep networks. In NIPS, Cited by: §2, §5.2.
- HyperNetworks with statistical filtering for defending adversarial examples. arXiv preprint arXiv:1711.01791. Cited by: §1.
- Learning to learn: meta-critic networks for sample efficient learning. arXiv preprint arXiv:1706.09529. Cited by: §4.1, §4.1.
- Reinforcement learning: an introduction. MIT press. Cited by: §3.1, §3.1.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §4.1, footnote 1.
- Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §4.1.
- Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §3.2, §5.1.
- Continual learning with hypernetworks. In International Conference on Learning Representations, Cited by: §1, §4.2.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: footnote 1.
- Deep reinforcement learning in action. Manning Publications. Cited by: §1.
- A deeper look at experience replay. arXiv preprint arXiv:1712.01275. Cited by: §3.1.
- Meta-learning via hypernetworks. 4th Workshop on Meta-Learning at NeurIPS 2020. Cited by: §4.2.
Appendix A Proof of Proposition 1
Let be a stochastic parametric policy with and a transformation with a Lipschitz continuous gradient and a Lipschitz constant . Assume that its -function has a Lipschitz continuous gradient in , i.e. . Define the average gradient operator . If there exists a gradient estimation and s.t.
then the ascent step with yields a positive empirical advantage policy.
First, recall the objective to be optimized:
Notice that as is bounded by the maximal reward and its gradient is Lipschitz continuous, the gradient is therefore bounded. Similarly, since the action space is bounded, and the terministic transformation has a Lipschitz continuous gradient, it follows that is also bounded. Define and .
Let s.t. and and is the induced vector norm. And let s.t. and . The operator is Lipschitz with constant .
The Lipschitz constant of the objective gradient is bounded by
Applying Lemma 1, we obtain
Therefore, is also Lipschitz. Hence, applying Taylor’s expansion around , we have that
Plugging in the iteration we obtain
Taking the second term on the right-hand side,
For the last term we have
Plugging both terms together into Eq. (21) we get
To obtain a positive empirical advantage we need
Thus the sufficient requirement for the learning rate is
Appendix B Cosine Similarity Estimation
To evaluate the averaged Cosine Similarity (CS)
we need to estimate the local CS for each state. To that end, we estimate the “true” -function at the vicinity of with a non-parametric local linear model
where s.t. the -function gradient is constant . To fit the linear model, we sample unbiased samples of the -function around , i.e. . These samples are the empirical discounted sum of rewards following execution of action at state and then applying policy .
To fit the linear model we directly fit the constant model for the gradient. Recall that applying the Taylor’s expansion around gives
for at the vicinity of .
To find the best fit we minimize averaged the quadratic error term over all pairs of sampled trajectories
This problem can be expressed in a matrix notation as
where is a matrix with rows of all the vectors and is a element vector of all the differences . Its minimization is the Least-Mean-Squared Estimator (LMSE)
In our experiments we evaluated the CS every learning steps and used , and for each evaluation. This choice trades off somewhat less accurate local estimators with more samples during training. To test our gradient estimator, we first applied it to the outputs of the -function network (instead of the true returns) and calculated the CS between a linear model based on the network outputs and the network parametric gradient. The results in Fig. 8 show that our estimator obtains a high CS between the -net outputs of the SA-Hyper and MLP models and their respective parametric gradients. This indicates that these networks are locally () linear. On the other hand, the CS between the linear model based on the AS-Hyper outputs and its parametric gradient is lower, which indicates that the network is not necessarily close to linear with . We assume that this may be because the action in the AS-Hyper configuration plays the meta-variable role which increases the non-linearity of the model with respect to the action input. Importantly, note that this does not indicate that the true -function of the AS-Hyper model is more non-linear than other models.
In Fig. 9 we plot the CS for 4 different environments averaged with a window size of . The results show that on average the SA-Hyper configuration obtains a higher CS, which indicates that the policy optimization step is more accurate s.t. the RL training process is more efficient.
Appendix C Gradient Step Noise Statistics in MAML
Hypernetworks disentangle the state-dependent gradient and the task-dependent gradient. As explained in the paper, we hypothesized that these characteristics reduce the gradient ascent step noise during policy updates
where is the gradient step estimation and is the learning rate. It is not obvious how to define the gradient noise properly as any norm-based measure depends on the network’s structure and size. Therefore, we take an alternative approach and define the gradient noise as the performance statistics after applying a set of independent gradient steps. In simple words, this definition essentially corresponds to how noisy the learning process is.
To estimate the performance statistics, we take different independent policy gradients based on independent trajectories at 4 different time steps during the training process. For each gradient step, we sampled 20 trajectories with a maximal length of 200 steps (identical to a single policy update during the training process) out of 40 tasks. After each gradient step, we evaluated the performance and restored the policy’s weights s.t. the gradient steps are independent.
We compared two different network architectures, both with access to an oracle context: (1) Hyper-MAML; and (2) Context-MAML. We did not evaluate Vanilla-MAML as it has no context and the gradient noise, in this case, might also be due to higher adaptation noise as the context must be recovered from the trajectories’ rewards. In the paper, we presented the performance statistics after different updates. In Table 1 we present the variance of those statistics.
|Envrionment||50 iter||150 iter||300 iter||450 iter|
|Context-MAML||1.184 (774)||4.492 (2595)||2.590 (1891)||0.822 (3689)|
|Hyper MAML (Ours)||0.027 (26)||0.017 (43)||0.021 (96)||0.014 (53)|
|Context-MAML||0.035 (122)||0.050 (208)||0.093 (520)||0.066 (161)|
|Hyper MAML (Ours)||0.009 (5)||0.005 (1)||0.008 (2)||0.009 (2)|
|Context-MAML||0.274 (3)||0.199 (5)||0.400 (12)||0.285 (20)|
|Hyper MAML (Ours)||0.073 (1)||0.047 (2)||0.050 (6)||0.047 (11)|
|Context-MAML||0.379 (52)||0.377 (8)||0.628 (109)||0.418 (117)|
|Hyper MAML (Ours)||0.252 (5)||0.159 (2)||0.080 (2)||0.057 (2)|
Appendix D Models Design
d.1 Hypernetwork Architecture
The Hypernetwork’s primary part is composed of three main blocks followed by a set of heads. Each block contains an up-scaling linear layer followed by two pre-activation residual linear blocks (ReLU-linear-ReLU-linear). The first block up-scales from the state’s dimension to 256 and the second and third blocks grow to 512 and 1024 neurons respectively. The total number of learnable parameters in the three blocks is . The last block is followed by the heads which are a set of linear transformations that generate the dynamic parameters (including weights, biases and gains). The heads have learnable parameters s.t. the total number of parameters in the primary part is .
d.2 Primary Model Design: Negative Results
In our search for a primary network that can learn to model the weights of a state-dependent dynamic -function, we experimented with several different architectures. Here we outline a list of negative results, i.e. models that failed to learn good primary networks.
We found that the head size (the last layer that outputs all the dynamic network weights) should not be smaller than 512 and the depth should be at least 5 blocks. Upsampling from the low state dimension can either be done gradually or at the first layer.
For the non-linear activation functions, we tried RELU and ELU which we found to have similar performances.
d.3 Hypernetwork Initialization
A proper initialization for the Hypernetwork is crucial for the network’s numerical stability and its ability to learn. Common initialization methods are not necessarily suited for Hypernetworks (Chang et al., 2019) since they fail to generate the dynamic weights in the correct scale. We found that some RL algorithms are more affected than others by the initialization scheme, e.g, SAC is more sensitive than TD3. However, we leave this question of why some RL algorithms are more sensitive than others to the weight initialization for future research.
To improve the Hypernetwork weight initialization, we followed (Lior Deutsch, 2019) and initialized the primary weights with smaller than usual values s.t. the initial dynamic weights were also relatively small compared to standard initialization (Fig. 10). As is shown in Fig. 11, this enables the dynamic weights to converge during the training process to a relatively similar distribution of a normal MLP network.
The residual blocks in the primary part were initialized with a fan-in Kaiming uniform initialization (He et al., 2015) with a gain of (instead of the normal gain of
for the ReLU activation). We used fixed uniform distributions to initialize the weights in the heads:for the first dynamic layer,
for the second dynamic layer and for the standard deviation output layer in the PEARL meta-policy we used thedistribution.
In Fig. 10 and Fig. 11 we plot the histogram of the TD3 critic dynamic network weights with different primary initializations: (1) our custom primary initialization; and (2) The default Pytorch initialization of the primary network. We compare the dynamic weights to the weights of a standard MLP-Small network (the same size as the dynamic network). We take two snapshots of the weight distribution: (1) in Fig. 10 before the start of the training process; and (2) after training steps. In Table 2 we also report the total-variation distance between each initialization and the MLP-Small weight distribution. Interestingly, the results show that while the dynamic weight distribution with the Pytorch primary initialization is closer to the MLP-Small distribution at the beginning of the training process, after 100K training steps our primary initialized weights produce closer dynamic weight distribution to the MLP-Small network (also trained for steps).
|Primary Initialization Scheme||Hopper||Walker2d||Ant||HalfCheetah|
|Ours Hyper init||31.4||23.9||13.6||29.4|
|Pytorch Hyprer init||16.3||20.5||9.2||8.8|
|Ours Hyper init||34.8||30.77||37.7||36.9|
|Pytorch Hyprer init||24.7||39.6||11.2||29.3|
|First Layer After 100K Steps|
|Ours Hyper init||14.4||19.6||29.9||16.4|
|Pytorch Hyprer init||24.9||22.6||34.0||22.4|
|Second Layer After 100K Steps|
|Ours Hyper init||31.2||28.5||30.6||21.1|
|Pytorch Hyprer init||32.11||20.8||30.7||31.1|
d.4 Baseline Models for the SAC and TD3 algorithms
In our TD3 and SAC experiments, we tested the Hypernetwork architecture with respect to 7 different baseline models.
A standard MLP architecture, which is used in many RL papers (e.g. SAC and TD3) with 2 hidden layers of 256 neurons each with ReLU activation function.
The MLP-Small model helps in understanding the gain of using context-dependent dynamic weights. It is an MLP network with the same architecture as our dynamic network model, i.e. 1 hidden layer with 256 neurons followed by a ReLU activation function. As expected, although the MLP-Small and MLP-Standard configurations are relatively similar with only a different number of hidden layers (1 and 2 respectively), the MLP-Small achieved close to half the return of the MLP-Standard. However, our experiments show that when using even a shallow MLP network with context-dependent weights (i.e. our SA-Hyper model), it can significantly outperform both shallow and deeper standard MLP models.
To make sure that the performance gain is not due to the large number of weights in the primary network, we evaluated MLP-Large, an MLP network with 2 hidden layers as the MLP-Standard but with 2,900 neurons in each layer. This yields a total number of learnable parameters, as in our entire primary model. While this large network usually outperformed other baselines, in almost all environments it still did not reach the Hypernetwork performance with one exception in the Ant-v2 environment in the TD3 algorithm. This provides another empirical argument that Hypernetworks are more suited for the RL problem and their performance gain is not only due to their larger parametric space.
d.4.4 ResNet Features
To test whether the performance gain is due to the expressiveness of the ResNet model, we evaluated ResNet-Features: an MLP-Small model but instead of plugging in the raw state features, we use the primary model configuration (with ResNet blocks) to generate 10 learnable features of the state. Note that the feature extractor part of ResNet-Features has a similar parameter space as the Hypernetwork’s primary model except for the head units. The ResNet-Features was unable to learn on most environments in both algorithms, even though we tried several different initialization schemes. This shows that the primary model is not suitable for a state’s features extraction, and while it may be possible to find other models with ResNet that outperform this ResNet model, it is yet further evidence that the success of the Hypernetwork architecture is not attributed solely to the ResNet expressiveness power in the primary network.
This is the reverse configuration of our SA-Hyper model. In this configuration, the action is the meta-variable and the state serves as the base-variable. Its lower performance provides another empirical argument (alongside the lower CS, see Sec. B) that the “correct” Hypernetwork composition is when the state plays the context role and the action is the base-variable.
In this configuration, we replace the input of the primary network with a learnable embedding of size 5 (equal to the PEARL context size) and the dynamic part gets both the state and the action as its input variables. This produces a learnable set of weights that is constant for all states and actions. However, unlike MLP-Small, the weights are generated via the primary model and are not independent as in normal neural network training. Note that we did not include this experiment in the main paper but we have added it to the results in the appendix. This is another configuration that aims to validate that the Hypernetwork gain is not due to the over-parameterization of the primary model and that the disentanglement of the state and action is an important ingredient of the Hypernetwork performance.
d.4.7 ResNet 35
To validate that the performance gain is not due to a large number of weights in the primary network combined with the expressiveness of the residual blocks, we evalua