1 Introduction
The rapid development of deep neuralnetworks as generalpurpose function approximators has propelled the recent Reinforcement Learning (RL) renaissance
(Zai and Brown, 2020). RL algorithms have progressed in robustness, e.g. from (Lillicrap et al., 2016) to (Fujimoto et al., 2018); exploration (Haarnoja et al., 2018); gradient sampling (Schulman et al., 2017, 2015a); and offpolicy learning (Fujimoto et al., 2019; Kumar et al., 2019). Many actorcritic algorithms have focused on improving the critic learning routines by modifying the target value (Hasselt et al., 2016), which enables more accurate and robust function approximations. While this greatly improves the policy optimization efficiency, the performance is still bound by the networks’ ability to represent functions and policies. Such a constraint calls for studying and designing neural models suited for the representation of these RL building blocks.A critical insight in designing neural models for RL is the reciprocity between the state and the action, which both serve as the input for the function. At the start, each input can be processed individually according to its source domain. For example, when is a vector of images, it is common to employ CNN models (Kaiser et al., 2019), and when or are natural language words, each input can be processed separately with embedding vectors (He et al., 2016). The common practice in incorporating the state and action learnable features into a single network is to concatenate the two vectors and follow with MLP to yield the value (Schulman et al., 2017). In this work, we argue that for actorcritic RL algorithms (Grondman et al., 2012), such an offtheshelf method could be significantly improved with Hypernetworks.
In actorcritic methods, for each state, sampled from the dataset distribution, the actor’s task is to solve an optimization problem over the action distribution, i.e. the policy. This motivates an architecture where the function is explicitly modeled as the value function of a contextual bandit (Lattimore and Szepesvári, 2020) where is the context. While standard architectures are not designed to model such a relationship, Hypernetworks were explicitly constructed for that purpose (Ha et al., 2016). Hypernetworks, also called metanetworks, can represent hierarchies by transforming a meta variable into a contextdependent function that maps a base variable to the required output space. This emphasizes the underlying dynamic between the meta and base variables and has found success in a variety of domains such as Bayesian neuralnetworks (Lior Deutsch, 2019), continual learning (von Oswald et al., 2019), generative models (Ratzlaff and Li, 2019) and adversarial defense (Sun et al., 2017). The practical success has sparked interest in the theoretical properties of Hypernetworks. For example, it has recently been shown that they enjoy better parameter complexity than classical models which concatenate the base and metavariables together (Galanti and Wolf, 2020a, b).
When analyzing the critic’s ability to represent the function, it is important to notice that in order to optimize the policy, modern offpolicy actorcritic algorithms (Fujimoto et al., 2018; Haarnoja et al., 2018) utilize only the parametric neural gradient of the critic with respect to the action input, i.e., .^{1}^{1}1This is in contrast to the REINFORCE approach (Williams, 1992) based on the policy gradient theorem (Sutton et al., 2000) which does not require a differentiable function estimation. Recently, (Ilyas et al., 2019)
examined the accuracy of the policy gradient in onpolicy algorithms. They demonstrated that standard RL implementations achieve gradient estimation with a nearzero cosine similarity when compared to the “true” gradient. Therefore, recovering better gradient approximations has the potential to substantially improve the RL learning process. Motivated by the need to obtain highquality gradient approximations, we set out to investigate the gradient accuracy of Hypernetworks with respect to standard models. In Sec.
3 we analyze three critic models and find that the Hypernetwork model with a state as a metavariable enjoys better gradient accuracy which translates into a faster learning rate.Much like the induced hierarchy in the critic, metapolicies that optimize multitask RL problems have a similar structure as they combine a taskdependent context and a state input. While some algorithms like MAML (Finn et al., 2017) and LEO (Rusu et al., 2019) do not utilize an explicit context, other works, e.g. PEARL (Rakelly et al., 2019) or MQL (Fakoor et al., 2019), have demonstrated that a context improves the generalization abilities. Recently, (Jayakumar et al., 2019) have shown that Multiplicative Interactions (MI) are an excellent design choice when combining states and contexts. MI operations can be viewed as shallow Hypernetwork architectures. In Sec. 4, we further explore this approach and study contextbased metapolicies with deep Hypernetworks. We find that with Hypernetworks, the task and statedependent gradients are disentangled s.t. the statedependent gradients are marginalized out, which leads to an empirically lower learning step variance. This is specifically important in onpolicy methods such as MAML, where there are fewer optimization steps during training.
The contributions of this paper are threefold. First, in Sec. 3 we provide a theoretical link between the function gradient approximation quality and the allowable learning rate for monotonic policy improvement. Next, we show empirically that Hypernetworks achieve better gradient approximations which translates into a faster learning rate and improves the final performance. Finally, in Sec. 4 we show that Hypernetworks significantly reduce the learning step variance in MetaRL. We summarize our empirical results in Sec. 5, which demonstrates the gain of Hypernetworks both in singletask RL and MetaRL. Importantly, we find empirically that Hypernetwork policies eliminate the need for the MAML adaptation step and improve the OutOfDistribution generalization in PEARL.
2 Hypernetworks
A Hypernetwork (Ha et al., 2016) is a neuralnetwork architecture designed to process a tuple and output a value . It is comprised of two networks, a primary network which produces weights for a dynamic network . Both networks are trained together, and the gradient flows through to the primary networks’ weights . During test time or inference, the primary weights are fixed while the input determines the dynamic network’s weights.
The idea of learnable contextdependent weights can be traced back to (McClelland, 1985; Schmidhuber, 1992). However, only in recent years have Hypernetworks gained popularity when they have been applied successfully with many dynamic network models, e.g. recurrent networks (Ha et al., 2016), MLP networks for 3D point clouds (Littwin and Wolf, 2019), spatial transformation (Potapov et al., 2018), convolutional networks for video frame prediction (Jia et al., 2016) and fewshot learning (Brock et al., 2018). In the context of RL, Hypernetworks were also applied, e.g., in QMIX (Rashid et al., 2018) to solve Multiagent RL tasks and for continual modelbased RL (Huang et al., 2020).
Fig. 1 illustrates our Hypernetwork model. The primary network contains residual blocks (Srivastava et al., 2015)
which transform the metavariable into a 1024 sized latent representation. This stage is followed by a series of parallel linear transformations, termed “heads”, which output the sets of dynamic weights. The dynamic network
contains only a single hidden layer of 256 which is smaller than the standard MLP architecture used in many RL papers (Fujimoto et al., 2018; Haarnoja et al., 2018)of 2 hidden layers, each with 256 neurons. The computational model of each dynamic layer is
(1) 
where the nonlinearity is applied only over the hidden layer and is an additional gain parameter that is required in Hypernetwork architectures (Littwin and Wolf, 2019). We defer the discussion of these design choices to Sec. 5.
3 Recomposing the ActorCritic’s Function
3.1 Background
Reinforcement Learning concerns finding optimal policies in Markov Decision Processes (MDPs). An MDP
(Dean and Givan, 1997) is defined by a tuple where is a set of states, is a set of actions,is a set of probabilities to switch from a state
to given an action , and is a scalar reward function. The objective is to maximize the expected discounted sum of rewards with a discount factor(2) 
can also be written, up to a constant factor , as an expectation over the function
(3) 
where the function is the expected discounted sum of rewards following visitation at state and execution of action (Sutton and Barto, 2018), and is the state distribution induced by policy .
Actorcritic methods maximize
over the space of parameterized policies. Stochastic policies are constructed as a state dependent transformation of an independent random variable
(4) 
where is a predefined multivariate distribution over and is the number of actions.^{2}^{2}2Deterministic policies, on the other hand, are commonly defined as a deterministic transformation of the state’s feature vector. To maximize over the parameters, actorcritic methods operate with an iterative threephase algorithm. First, they collect into a replay buffer the experience tuples generated with the parametric and some additive exploration noise policy (Zhang and Sutton, 2017)
. Then they fit a critic which is a parametric model
for the function. For that purpose, they apply TDlearning (Sutton and Barto, 2018)with the loss function
where is a lagging set of parameters (Lillicrap et al., 2016). Finally, they apply gradient descent updates in the direction of an offpolicy surrogate of
(5)  
Here, is a matrix of size where is the number of policy parameters to be optimized.
Two wellknown offpolicy algorithms are TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018). TD3 optimizes deterministic policies with additive normal exploration noise and double learning to improve the robustness of the critic part (Hasselt et al., 2016)
. On the other hand, SAC adopts stochastic, normally distributed policies but it modifies the reward function to include a high entropy bonus
which eliminates the need for exploration noise.3.2 Our Approach
The gradient of the offpolicy surrogate differs from the true gradient in two elements: First, the distribution of states is the empirical distribution in the dataset and not the policy distribution ; and second, the function gradient is estimated with the critic’s parametric neural gradient . Avoiding a distribution mismatch is the motivation of many constrained policy improvement methods such as TRPO and PPO (Schulman et al., 2015a, 2017). However, it requires very small and impractical steps. Thus, many offpolicy algorithms ignore the distribution mismatch and seek to maximize only the empirical advantage
In practice, a positive empirical advantage is associated with better policies and is required by monotonic policy improvement methods such as TRPO (Kakade and Langford, 2002; Schulman et al., 2015a). Yet, finding positive empirical advantage policies requires a good approximation of the gradient . The next proposition suggests that with a sufficiently accurate approximation, applying the gradient step as formulated in the actor update in Eq. (5) yields positive empirical advantage policies.
Proposition 1.
Let be a stochastic parametric policy with , and a transformation with a Lipschitz continuous gradient and a Lipschitz constant . Assume that its function has a Lipschitz continuous gradient in , i.e. . Define the average gradient operator . If there exists a gradient estimation and s.t.
(6) 
then the ascent step with yields a positive empirical advantage policy.
We define and provide the proof in the appendix. It follows that a positive empirical advantage can be guaranteed when the gradient of the function is sufficiently accurate, and with better gradient models, i.e. smaller , one may apply larger ascent steps. However, instead of fitting the gradient, actorcritic algorithms favor modeling the function and estimate the gradient with the parametric gradient of the model . It is not obvious whether better models for the functions, with lower MeanSquaredError (MSE), provide better gradient estimation. A more direct approach could be to explicitly learn the gradient of the function (Sarafian et al., 2020; Saremi, 2019); however, in this work, we choose to explore which architecture recovers more accurate gradient approximation based on the parametric gradient of the function model.
We consider three alternative models:

MLP network, where state features (possibly learnable) are concatenated into a single input of a multilayer linear network.

ActionState Hypernetwork (ASHyper) where the actions are the meta variable, input of the primary network , and the state features are the base variable, input for the dynamic network .

StateAction Hypernetwork (SAHyper), which reverses the order of ASHyper.
To develop some intuition, let us first consider the simplest case where the dynamic network has a single linear layer and the MLP model is replaced with a plain linear model. Starting with the linear model, the function and its gradient take the following parametric model:
(7)  
where . Clearly, in this case, the gradient is not a function of the state, therefore it is impossible to exploit this model for actorcritic algorithms. For the ASHyper we obtain the following model
(8)  
Usually, the state feature vector has a much larger dimension than the action dimension . Thus, the matrix has a large nullspace which can potentially hamper the training as it may yield zero or nearzero gradients even when the true gradient exists.
On the other hand, the SAHyper formulation is
(9)  
which is a statedependent constant model of the gradient in
. While it is a relatively naive model, it is sufficient for localized policies with low variance as it approximates the tangent hyperplane around the policy mean value.
Moving forward to a multilayer architecture, let us first consider the ASHyper architecture. In this case the gradient is . We see that the problem of the single layer is exacerbated since is now a matrix where is the number of dynamic network weights.
Next, the MLP and SAHyper models can be jointly analyzed. First, we calculate the input’s gradient of each layer
(10)  
(11)  
(12) 
where
is the activation function and
and are the weights and biases of theth layer, respectively. By the chain rule, the input’s gradient of an
layers network is the product of these expressions. For the MLP model we obtain(13) 
On the other hand, in SAHyper the weights are the outputs of the primary network, thus we have
(14) 
Importantly, while the SAHyper’s gradient configuration is controlled via the statedependent matrices , in the MLP model, it is a function of the state only via the diagonal elements in
. These local derivatives of the nonlinear activation functions are usually piecewise constant when the activations take the form of ReLUlike functions. Also, they are required to be bounded and smaller than one in order to avoid exploding gradients during training
(Philipp et al., 2017). These restrictions significantly reduce the expressiveness of the parametric gradient and its ability to model the true function gradient. For example, with ReLU, for two different pairs and the estimated gradient is equal if they have same active neurons map (i.e. the same ReLUs are in the active mode). Following this line of reasoning, we postulate that the SAHyper configuration should have better gradient approximations.Empirical analysis To test our hypothesis, we trained TD3 agents with different network models and evaluated their parametric gradient . To empirically analyze the gradient accuracy, we opted to estimate the true function gradient with a nonparametric local estimator at the policy mean value, i.e. at . For that purpose, we generated independent trajectories with actions sampled around the mean value, i.e. , and fit with a LeastMeanSquare (LMS) estimator a linear model for the empirical return of the sampled trajectories. The “true” gradient is therefore the linear model’s gradient. Additional technical details of this estimator are found in the appendix.
As our function estimator is based on TemporalDifference (TD) learning, it bears bias. Hence, in practice we cannot hope to reconstruct the true function scale. Thus, instead of evaluating the gradient’s MSE, we take the Cosine Similarity (CS) as a surrogate for measuring the gradient accuracy.
Fig. 3 summarizes our CS evaluations with the three model alternatives averaged over 4 Mujoco (Todorov et al., 2012) environments. Fig. 3d presents the mean CS over states during the training process. Generally, the CS is very low, which indicates that the RL training is far from optimal. While this finding is somewhat surprising, it corroborates the results in (Ilyas et al., 2019) which found nearzero CS in policy gradient algorithms. Nevertheless, note that the impact of the CS accuracy is cumulative as in each gradient ascent step the policy accumulates small improvements. This lets even nearzero gradient models improve over time. Overall, we find that the SAHyper CS is higher, and unlike other models, it is larger than zero during the entire training process. The SAHyper advantage is specifically significant at the first learning steps, which indicates that SAHyper learns faster in the early learning stages.
Assessing the gradient accuracy by the average CS can be somewhat confounded by states that have reached a local equilibrium during the training process. In these states the true gradient has zero magnitude s.t. the CS is illdefined. For that purpose, in Fig. 3ac we measure the percentage of states with a CS higher than a threshold . This indicates how many states are learnable where more learnable states are attributed to a better gradient estimation. Fig. 3a shows that for all thresholds SAHyper has more learnable states, and Fig. 3bc present the change in learnable states for different during the training process. Here we also find that the SAHyper advantage is significant particularly at the first stage of training. Finally, Fig. 4 demonstrates how gradient accuracy translates to better learning curves. As expected, we find that SAHyper outperforms both the MLP architecture and the ASHyper configuration which is also generally inferior to MLP.
In the next section, we discuss the application of Hypernetworks in MetaRL for modeling context conditional policies. When such a context exists, it also serves as an input variable to the function. In that case, when modeling the critic with a Hypernetwork, one may choose to use the context as a metavariable or alternatively as a base variable. Importantly, when the context is the dynamic’s input, the dynamic weights are fixed for each state, regardless of the task. In our PEARL experiments in Sec. 5 we always used the context as a base variable of the critic. We opted for this configuration since: (1) we found empirically that it is important for the generalization to have a constant set of weights for each state; and (2) As the PEARL context is learnable, we found that when the context gradient backpropagates through three networks (primary, dynamic and the context network), it hampers the training. Instead, as a base variable, the context’s gradient backpropagates only via two networks as in the original PEARL implementation.
4 Recomposing the Policy in MetaRL
4.1 Background
MetaRL is the generalization of MetaLearning (Mishra et al., 2018; Sohn et al., 2019) to the RL domain. It aims at learning metapolicies that solve a distribution of different tasks . Instead of learning different policies for each task, the metapolicy shares weights between all tasks and thus can generalize from one task to the other (Sung et al., 2017). A popular MetaRL algorithm is MAML (Finn et al., 2017), which learns a set of weights that can quickly adapt to a new task with a few gradient ascent steps. To do so, for each task, it estimates the policy gradient (Sutton et al., 2000) at the adaptation point. The total gradient is the sum of policy gradients over the task distribution :
(15)  
where is the empirical advantage estimation at the th step in task (Schulman et al., 2015b). Onpolicy algorithms tend to suffer from high sample complexity as each update step requires many new trajectories sampled from the most recent policy in order to adequately evaluate the gradient direction.
Offpolicy methods are designed to improve the sample complexity by reusing experience from old policies (Thomas and Brunskill, 2016). Although not necessarily related, in MetaRL, many offpolicy algorithms also avoid the MAML approach of weight adaptation. Instead, they opt to condition the policy and the function on a context which distinguishes between different tasks (Ren et al., 2019; Sung et al., 2017). A notable offpolicy MetaRL method is PEARL (Rakelly et al., 2019). It builds on top of the SAC algorithm and learns a function , a policy and a context . The context, which is a latent representation of task , is generated by a probabilistic model that processes a trajectory of transitions sampled from task . To learn the critic alongside the context, PEARL modifies the SAC critic loss to
where
is a prior probability over the latent distribution of the context. While PEARL’s context is a probabilistic model, other works
(Fakoor et al., 2019) have suggested that a deterministic learnable context can provide similar results.In this work, we consider both a learnable context and also the simpler approach of an oraclecontext which is a unique, predefined identifier for task (Jayakumar et al., 2019). It can be an index when there is a countable number of tasks or a continuous number when the tasks are sampled from a continuous distribution. In practice, the oracle identifier is often known to the agent. Moreover, sometimes, e.g., in goaloriented tasks, the context cannot be recovered directly from the transition tuples without prior knowledge, since there are no rewards unless the goal is reached, which rarely happens without policy adaptation.
4.2 Our Approach
Hypernetworks naturally fit into the metalearning formulation where the context is an input to the primary network (von Oswald et al., 2019; Zhao et al., 2020). Therefore, we suggest modeling metapolicies s.t. the context is the meta variable and the state is the dynamic’s input
(16) 
Interestingly, this modeling disentangles the state dependent gradient and the task dependent gradient of the metapolicy. To see that, let us take for example the onpolicy objective of MAML and plug in a context dependent policy . Then, the objective in Eq. (15) becomes
(17) 
Applying the Hypernetwork modeling of the metapolicy in Eq. (16), this objective can be written as
(18) 
In this form, the statedependent gradients of the dynamic weights are averaged independently for each task, and the taskdependent gradients of the primary weights are averaged only over the task distribution and not over the joint taskstate distribution as in Eq. (17). We postulate that such disentanglement reduces the gradient noise for the same number of samples. This should translate to more accurate learning steps and thus a more efficient learning process.
To test our hypothesis, we trained two different metapolicy models based on the MAML algorithm: (1) an MLP model where a state and an oraclecontext are joined together; and (2) a Hypernetwork model, as described, with an oraclecontext as a metavariable. Importantly, note that, other than the neural architecture, both algorithms are identical. For four different timestamps during the learning process, we constructed 50 different uncorrelated gradients from different episodes and evaluating the updated policy’s performance. We take the performance statistics of the updated policies as a surrogate for the gradient noise. In Fig. 5, we plot the performance statistics of the updated metapolicies. We find that the variance of the Hypernetwork model is significantly lower than the MLP model across all tasks and environments. This indicates more efficient improvement and therefore we also observe that the mean value is consistently higher.
5 Experiments
5.1 Experimental Setup
We conducted our experiments in the MuJoCo simulator (Todorov et al., 2012) and tested the algorithms on the benchmark environments available in OpenAI Gym (Brockman et al., 2016). For single task RL, we evaluated our method on the: (1) Hooperv2; (2) Walker2Dv2; (3) Antv2^{3}^{3}3We reduced the control cost as is done in PEARL (Rakelly et al., 2019) to avoid numerical instability problems.; and (4) HalfCheetahv2 environments. For metaRL, we evaluated our method on the: (1) HalfCheetahFwdBack and (2) AntFwdBack, and on velocity tasks: (3) HalfCheetahVel and (4) AntVel as is done in (Rakelly et al., 2019). We also added the HalfCheetahVelMedium environment as presented in (Fakoor et al., 2019), which tests outofdistribution generalization abilities. For ContextMAML and HyperMAML we adopted the oraclecontext as discussed in Sec. 4. For the forwardbackward tasks, we provided a binary indicator, and for the velocity tasks, we adopted a continuous context in the range that maps to the velocities in the training distribution.
In the RL experiments, we compared our model to SAC and TD3, and in MetaRL, we compared to MAML and PEARL. We used the authors’ official implementations (or opensource PyTorch
(Ketkar, 2017)implementation when the official one was not available) and the original baselines’ hyperparameters, as well as strictly following each algorithm evaluation procedure. The Hypernetwork training was executed with the baseline loss s.t. we changed only the networks model and adjusted the learning rate to fit the different architecture. All experiments were averaged over 5 seeds. Further technical details are in the appendix.
5.2 The Hypernetwork Architecture
Our Hypernetwork model is illustrated in Fig. 1 and in Sec. 2. When designing the Hypernetwork model, we did not search for the best performance model, rather we sought a proper comparison to the standard MLP architecture used in RL (denoted here as MLPStandard). To that end, we used a smaller dynamic network than the MLP model (single hidden layer instead of two layers and the same number of neurons (256) in a layer). With this approach, we wish to show the gain of using dynamic weights with respect to a fixed set of weights in the MLP model. To emphasize the gain of the dynamic weights, we added an MLPSmall baseline with equal configuration to the dynamic model (one hidden layer with 256 neurons).
Unlike the dynamic network, the role of the primary network is missing from the MLP architecture. Therefore, for the primary network, we used a highperformance ResNet model (Srivastava et al., 2015) which we found apt for generating the set of dynamic weights (Glorot and Bengio, 2010). To make sure that the performance gain is not due to the expressiveness of the ResNet model or the additional number of learnable weights, we added three more baselines: (1) ResNet Features: the same primary and dynamic architecture, but the output of the primary is a state feature vector which is concatenated to the action as the input for an MLPStandard network; (2) MLPLarge: two hidden layers, each with 2900 neurons which sum up to weights as in the Hypernetwork architecture; and (3) Res35: ResNet with 35 blocks to yield the value, which sum up to weights. In addition, we added a comparison to the QD2RL model: a deep dense architecture for the function which was recently suggested in (Sinha et al., 2020).
One important issue with Hypernetworks is their numerical stability. We found that they are specifically sensitive to weight initialization as bad primary initialization may amplify into catastrophic dynamic weights (Chang et al., 2019). We solved this problem by initializing the primary s.t. the average initial distribution dynamic weights resembles the Kaiminguniform initialization (He et al., 2015). Further details can be found in the appendix.
5.3 Results
The results and the comparison to the baselines are summarized in Fig. 6. In all four experiments, our Hypernetwork model achieves an average of 10%  70% gain over the MLPStandard baseline in the final performance and reaches the baseline’s score, with only 20%70% of the total training steps. As described in Sec. 5.2, for the RL experiments, in addition to the MLPStandard model, we tested five more baselines: (1) MLPLarge; (2) MLPSmall; (3) ResNet Features; (4) ResNet35; and (5) QD2RL. Both on TD3 and SAC, we find a consistent improvement over all baselines and SAHyper outperforms in all environments with two exceptions: where MLPLarge or QD2RL achieve a better score than SAHyper in the Antv2 environment (the learning curves for each environment are found in the appendix). While it may seem like the Hypernetwork improvement is due to its large parametric dimension or the ResNet design of the primary model, our results provide strong evidence that this assumption is not true. The SAHyper model outperforms other models with the same number of parameters (MLPLarge and ResNet Features^{4}^{4}4Interestingly, The Resnet Features baseline achieved very low scores even as compared to the MLPStandard baseline. Indeed, this result is not surprising as the action gradient model of Resnet Features is identical to the action gradient model of MLPSmall (single hidden layer with 256 neurons). While ResNet generated state features may improve the function estimation, they do not necessarily improve the gradient estimation as the network is not explicitly trained to model the gradient.) and also models that employ ResNet architectures (ResNet Features and Res35). In addition, it is as good (SAC) or better (TD3) than QD2RL, which was recently suggested as an architecture tailored for the RL problem (Sinha et al., 2020). Please note that as discussed in Sec. 5.2 and unlike D2RL, we do not optimize the number of layers in the dynamic model.^{5}^{5}5We do not compare to the full D2RL model which also modifies the policy architecture as our SAHyper model only changes the net model.
In Fig. 6c we compared different models for MAML: (1) VanillaMAML; (2) ContextMAML, i.e. a contextbased version of MAML with an oraclecontext; and (3) HyperMAML, similar to contextMAML but with a Hypernetwork model. For all models, we evaluated both the preadaptation (pread) as well as the postadaptation scores. First, we verify the claim in (Fakoor et al., 2019) that context benefits MetaRL algorithms just as ContextMAML outperforms VanillaMAML. However, we find that HyperMAML outperforms ContextMAML by roughly 50%. Moreover, unlike the standard MLP models, we find that HyperMAML does not require any adaptation step (no observable difference between the pre and postadaptation scores). We assume that this result is due to the better generalization capabilities of the Hypernetwork architecture as can also be seen from the next PEARL experiments.
In Fig. 6d we evaluated the Hypernetwork model with the PEARL algorithm. The context is learned with a probabilistic encoder as presented in (Rakelly et al., 2019) s.t. the only difference with the original PEARL is the policy and critic neural models. The empirical results show that HyperPEARL outperforms the MLP baseline both in the final performance (15%) and in sample efficiency (70% fewer steps to reach the final baseline score). Most importantly, we find that HyperPEARL generalizes better to the unseen test tasks. This applies both to test tasks sampled from the training distribution (as the higher score and lower variance of HyperPEARL indicate) and also to OutOfDistribution (OOD) tasks, as can be observed in Fig. 7.
6 Conclusions
In this work, we set out to study neural models for the RL building blocks: functions and metapolicies. Arguing that the unique nature of the RL setting requires unconventional models, we suggested the Hypernetwork model and showed empirically several significant advantages over MLP models. First, Hypernetworks are better able to estimate the parametric gradient signal of the function required to train actorcritic algorithms. Second, they reduce the gradient variance in training metapolicies in MetaRL. Finally, they improve OOD generalization and they do not require any adaptation step in MetaRL training, which significantly facilitates the training process.
7 Code
Our Hypernetwork PyTorch implementation is found at https://github.com/keynans/HypeRL.
References
 SMASH: oneshot model architecture search through hypernetworks. In International Conference on Learning Representations, Cited by: §2.
 OpenAI gym. CoRR abs/1606.01540. External Links: Link, 1606.01540 Cited by: §5.1.
 Principled weight initialization for hypernetworks. In International Conference on Learning Representations, Cited by: §D.3, §5.2.
 Model minimization in markov decision processes. In AAAI/IAAI, pp. 106–111. Cited by: §3.1.
 Metaqlearning. In International Conference on Learning Representations, Cited by: §E.3.1, §1, §4.1, §5.1, §5.3.
 Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §1, §4.1.

Offpolicy deep reinforcement learning without exploration.
In
International Conference on Machine Learning
, pp. 2052–2062. Cited by: §1.  Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: Appendix E, §1, §1, §2, §3.1.
 Comparing the parameter complexity of hypernetworks and the embeddingbased alternative. arXiv preprint arXiv:2002.10006. Cited by: §1.
 On the modularity of hypernetworks. Advances in Neural Information Processing Systems 33. Cited by: §1.

Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §5.2.  A survey of actorcritic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (6), pp. 1291–1307. Cited by: §1.
 HyperNetworks. arXiv, pp. arXiv–1609. Cited by: §1, §2, §2.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §1, §2, §3.1.
 Deep reinforcement learning with double qlearning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2094–2100. Cited by: §1, §3.1.
 Deep reinforcement learning with a natural language action space. In ACL (1), Cited by: item 1, §1.

Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. In ICCV, Cited by: §D.3, §5.2. 
Densely connected convolutional networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4700–4708. Cited by: item 1.  Continual modelbased reinforcement learning with hypernetworks. arXiv preprint arXiv:2009.11997. Cited by: §2.
 A closer look at deep policy gradients. In International Conference on Learning Representations, Cited by: §1, §3.2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: item 3.
 Multiplicative interactions and where to find them. In International Conference on Learning Representations, Cited by: §1, §4.1.
 Dynamic filter networks. In Advances in neural information processing systems, pp. 667–675. Cited by: §2.
 Model based reinforcement learning for atari. In International Conference on Learning Representations, Cited by: §1.
 Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning, Cited by: §3.2.
 Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §5.1.
 Stabilizing offpolicy qlearning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pp. 11784–11794. Cited by: §1.
 Bandit algorithms. Cambridge University Press. Cited by: §1.
 Continuous control with deep reinforcement learning.. In ICLR (Poster), Cited by: §1, §3.1.
 A generative model for sampling highperformance and diverse weights for neural networks. CoRR. Cited by: §D.3, §1.
 Deep meta functionals for shape representation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1824–1833. Cited by: §2, §2.
 Putting knowledge in its place: a scheme for programming parallel processing structures on the fly. Cognitive Science 9 (1), pp. 113–146. Cited by: §2.
 A simple neural attentive metalearner. In International Conference on Learning Representations, Cited by: §4.1.

Spectral normalization for generative adversarial networks
. arXiv preprint arXiv:1802.05957. Cited by: item 3. 
The exploding gradient problem demystifieddefinition, prevalence, impact, origin, tradeoffs, and solutions
. arXiv preprint arXiv:1712.05577. Cited by: §3.2.  HyperNets and their application to learning spatial transformations. In International Conference on Artificial Neural Networks, pp. 476–486. Cited by: §2.
 Efficient offpolicy metareinforcement learning via probabilistic context variables. In International conference on machine learning, pp. 5331–5340. Cited by: Appendix E, §1, §4.1, §5.1, §5.3, footnote 3.
 QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §2.
 HyperGAN: A generative model for diverse, performant neural networks. CoRR abs/1901.11058. External Links: Link, 1901.11058 Cited by: §1.
 Contextbased metareinforcement learning with structured latent space. Skills Workshop NeurIPS 2019. Cited by: §4.1.
 Metalearning with latent embedding optimization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 Weight normalization: a simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868. Cited by: item 3.
 Explicit gradient learning for blackbox optimization. In International Conference on Machine Learning, pp. 8480–8490. Cited by: §3.2.
 On approximating with neural networks. arXiv preprint arXiv:1910.12744. Cited by: §3.2.
 Learning to control fastweight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §2.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1, §3.2.
 Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §4.1.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §1, §3.2.
 D2RL: deep dense architectures in reinforcement learning. arXiv preprint arXiv:2010.09163. Cited by: §D.4.8, §5.2, §5.3.
 Meta reinforcement learning with autonomous inference of subtask dependencies. In International Conference on Learning Representations, Cited by: §4.1.
 Training very deep networks. In NIPS, Cited by: §2, §5.2.
 HyperNetworks with statistical filtering for defending adversarial examples. arXiv preprint arXiv:1711.01791. Cited by: §1.
 Learning to learn: metacritic networks for sample efficient learning. arXiv preprint arXiv:1706.09529. Cited by: §4.1, §4.1.
 Reinforcement learning: an introduction. MIT press. Cited by: §3.1, §3.1.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §4.1, footnote 1.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §4.1.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §3.2, §5.1.
 Continual learning with hypernetworks. In International Conference on Learning Representations, Cited by: §1, §4.2.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: footnote 1.
 Deep reinforcement learning in action. Manning Publications. Cited by: §1.
 A deeper look at experience replay. arXiv preprint arXiv:1712.01275. Cited by: §3.1.
 Metalearning via hypernetworks. 4th Workshop on MetaLearning at NeurIPS 2020. Cited by: §4.2.
Appendix A Proof of Proposition 1
Proposition 1.
Let be a stochastic parametric policy with and a transformation with a Lipschitz continuous gradient and a Lipschitz constant . Assume that its function has a Lipschitz continuous gradient in , i.e. . Define the average gradient operator . If there exists a gradient estimation and s.t.
(19) 
then the ascent step with yields a positive empirical advantage policy.
Proof.
First, recall the objective to be optimized:
(20)  
Notice that as is bounded by the maximal reward and its gradient is Lipschitz continuous, the gradient is therefore bounded. Similarly, since the action space is bounded, and the terministic transformation has a Lipschitz continuous gradient, it follows that is also bounded. Define and .
Lemma 1.
Let s.t. and and is the induced vector norm. And let s.t. and . The operator is Lipschitz with constant .
Proof.
∎
The Lipschitz constant of the objective gradient is bounded by
Applying Lemma 1, we obtain
Therefore, is also Lipschitz. Hence, applying Taylor’s expansion around , we have that
Plugging in the iteration we obtain
(21) 
Taking the second term on the righthand side,
For the last term we have
Plugging both terms together into Eq. (21) we get
To obtain a positive empirical advantage we need
Thus the sufficient requirement for the learning rate is
where .
∎
Appendix B Cosine Similarity Estimation
To evaluate the averaged Cosine Similarity (CS)
(22) 
we need to estimate the local CS for each state. To that end, we estimate the “true” function at the vicinity of with a nonparametric local linear model
where s.t. the function gradient is constant . To fit the linear model, we sample unbiased samples of the function around , i.e. . These samples are the empirical discounted sum of rewards following execution of action at state and then applying policy .
To fit the linear model we directly fit the constant model for the gradient. Recall that applying the Taylor’s expansion around gives
, therefore
for at the vicinity of .
To find the best fit we minimize averaged the quadratic error term over all pairs of sampled trajectories
This problem can be expressed in a matrix notation as
where is a matrix with rows of all the vectors and is a element vector of all the differences . Its minimization is the LeastMeanSquared Estimator (LMSE)
In our experiments we evaluated the CS every learning steps and used , and for each evaluation. This choice trades off somewhat less accurate local estimators with more samples during training. To test our gradient estimator, we first applied it to the outputs of the function network (instead of the true returns) and calculated the CS between a linear model based on the network outputs and the network parametric gradient. The results in Fig. 8 show that our estimator obtains a high CS between the net outputs of the SAHyper and MLP models and their respective parametric gradients. This indicates that these networks are locally () linear. On the other hand, the CS between the linear model based on the ASHyper outputs and its parametric gradient is lower, which indicates that the network is not necessarily close to linear with . We assume that this may be because the action in the ASHyper configuration plays the metavariable role which increases the nonlinearity of the model with respect to the action input. Importantly, note that this does not indicate that the true function of the ASHyper model is more nonlinear than other models.
In Fig. 9 we plot the CS for 4 different environments averaged with a window size of . The results show that on average the SAHyper configuration obtains a higher CS, which indicates that the policy optimization step is more accurate s.t. the RL training process is more efficient.
Appendix C Gradient Step Noise Statistics in MAML
Hypernetworks disentangle the statedependent gradient and the taskdependent gradient. As explained in the paper, we hypothesized that these characteristics reduce the gradient ascent step noise during policy updates
where is the gradient step estimation and is the learning rate. It is not obvious how to define the gradient noise properly as any normbased measure depends on the network’s structure and size. Therefore, we take an alternative approach and define the gradient noise as the performance statistics after applying a set of independent gradient steps. In simple words, this definition essentially corresponds to how noisy the learning process is.
To estimate the performance statistics, we take different independent policy gradients based on independent trajectories at 4 different time steps during the training process. For each gradient step, we sampled 20 trajectories with a maximal length of 200 steps (identical to a single policy update during the training process) out of 40 tasks. After each gradient step, we evaluated the performance and restored the policy’s weights s.t. the gradient steps are independent.
We compared two different network architectures, both with access to an oracle context: (1) HyperMAML; and (2) ContextMAML. We did not evaluate VanillaMAML as it has no context and the gradient noise, in this case, might also be due to higher adaptation noise as the context must be recovered from the trajectories’ rewards. In the paper, we presented the performance statistics after different updates. In Table 1 we present the variance of those statistics.
Envrionment  50 iter  150 iter  300 iter  450 iter 
HalfCheetahFwdBack  
ContextMAML  1.184 (774)  4.492 (2595)  2.590 (1891)  0.822 (3689) 
Hyper MAML (Ours)  0.027 (26)  0.017 (43)  0.021 (96)  0.014 (53) 
HalfCheetahVel  
ContextMAML  0.035 (122)  0.050 (208)  0.093 (520)  0.066 (161) 
Hyper MAML (Ours)  0.009 (5)  0.005 (1)  0.008 (2)  0.009 (2) 
AntFwdBack  
ContextMAML  0.274 (3)  0.199 (5)  0.400 (12)  0.285 (20) 
Hyper MAML (Ours)  0.073 (1)  0.047 (2)  0.050 (6)  0.047 (11) 
AntVel  
ContextMAML  0.379 (52)  0.377 (8)  0.628 (109)  0.418 (117) 
Hyper MAML (Ours)  0.252 (5)  0.159 (2)  0.080 (2)  0.057 (2) 

Appendix D Models Design
d.1 Hypernetwork Architecture
The Hypernetwork’s primary part is composed of three main blocks followed by a set of heads. Each block contains an upscaling linear layer followed by two preactivation residual linear blocks (ReLUlinearReLUlinear). The first block upscales from the state’s dimension to 256 and the second and third blocks grow to 512 and 1024 neurons respectively. The total number of learnable parameters in the three blocks is . The last block is followed by the heads which are a set of linear transformations that generate the dynamic parameters (including weights, biases and gains). The heads have learnable parameters s.t. the total number of parameters in the primary part is .
d.2 Primary Model Design: Negative Results
In our search for a primary network that can learn to model the weights of a statedependent dynamic function, we experimented with several different architectures. Here we outline a list of negative results, i.e. models that failed to learn good primary networks.

We found that the head size (the last layer that outputs all the dynamic network weights) should not be smaller than 512 and the depth should be at least 5 blocks. Upsampling from the low state dimension can either be done gradually or at the first layer.

For the nonlinear activation functions, we tried RELU and ELU which we found to have similar performances.
d.3 Hypernetwork Initialization
A proper initialization for the Hypernetwork is crucial for the network’s numerical stability and its ability to learn. Common initialization methods are not necessarily suited for Hypernetworks (Chang et al., 2019) since they fail to generate the dynamic weights in the correct scale. We found that some RL algorithms are more affected than others by the initialization scheme, e.g, SAC is more sensitive than TD3. However, we leave this question of why some RL algorithms are more sensitive than others to the weight initialization for future research.
To improve the Hypernetwork weight initialization, we followed (Lior Deutsch, 2019) and initialized the primary weights with smaller than usual values s.t. the initial dynamic weights were also relatively small compared to standard initialization (Fig. 10). As is shown in Fig. 11, this enables the dynamic weights to converge during the training process to a relatively similar distribution of a normal MLP network.
The residual blocks in the primary part were initialized with a fanin Kaiming uniform initialization (He et al., 2015) with a gain of (instead of the normal gain of
for the ReLU activation). We used fixed uniform distributions to initialize the weights in the heads:
for the first dynamic layer,for the second dynamic layer and for the standard deviation output layer in the PEARL metapolicy we used the
distribution.In Fig. 10 and Fig. 11 we plot the histogram of the TD3 critic dynamic network weights with different primary initializations: (1) our custom primary initialization; and (2) The default Pytorch initialization of the primary network. We compare the dynamic weights to the weights of a standard MLPSmall network (the same size as the dynamic network). We take two snapshots of the weight distribution: (1) in Fig. 10 before the start of the training process; and (2) after training steps. In Table 2 we also report the totalvariation distance between each initialization and the MLPSmall weight distribution. Interestingly, the results show that while the dynamic weight distribution with the Pytorch primary initialization is closer to the MLPSmall distribution at the beginning of the training process, after 100K training steps our primary initialized weights produce closer dynamic weight distribution to the MLPSmall network (also trained for steps).
Primary Initialization Scheme  Hopper  Walker2d  Ant  HalfCheetah 
First Layer  
Ours Hyper init  31.4  23.9  13.6  29.4 
Pytorch Hyprer init  16.3  20.5  9.2  8.8 
Second Layer  
Ours Hyper init  34.8  30.77  37.7  36.9 
Pytorch Hyprer init  24.7  39.6  11.2  29.3 
First Layer After 100K Steps  
Ours Hyper init  14.4  19.6  29.9  16.4 
Pytorch Hyprer init  24.9  22.6  34.0  22.4 
Second Layer After 100K Steps  
Ours Hyper init  31.2  28.5  30.6  21.1 
Pytorch Hyprer init  32.11  20.8  30.7  31.1 

d.4 Baseline Models for the SAC and TD3 algorithms
In our TD3 and SAC experiments, we tested the Hypernetwork architecture with respect to 7 different baseline models.
d.4.1 MLPStandard
A standard MLP architecture, which is used in many RL papers (e.g. SAC and TD3) with 2 hidden layers of 256 neurons each with ReLU activation function.
d.4.2 MLPSmall
The MLPSmall model helps in understanding the gain of using contextdependent dynamic weights. It is an MLP network with the same architecture as our dynamic network model, i.e. 1 hidden layer with 256 neurons followed by a ReLU activation function. As expected, although the MLPSmall and MLPStandard configurations are relatively similar with only a different number of hidden layers (1 and 2 respectively), the MLPSmall achieved close to half the return of the MLPStandard. However, our experiments show that when using even a shallow MLP network with contextdependent weights (i.e. our SAHyper model), it can significantly outperform both shallow and deeper standard MLP models.
d.4.3 MLPLarge
To make sure that the performance gain is not due to the large number of weights in the primary network, we evaluated MLPLarge, an MLP network with 2 hidden layers as the MLPStandard but with 2,900 neurons in each layer. This yields a total number of learnable parameters, as in our entire primary model. While this large network usually outperformed other baselines, in almost all environments it still did not reach the Hypernetwork performance with one exception in the Antv2 environment in the TD3 algorithm. This provides another empirical argument that Hypernetworks are more suited for the RL problem and their performance gain is not only due to their larger parametric space.
d.4.4 ResNet Features
To test whether the performance gain is due to the expressiveness of the ResNet model, we evaluated ResNetFeatures: an MLPSmall model but instead of plugging in the raw state features, we use the primary model configuration (with ResNet blocks) to generate 10 learnable features of the state. Note that the feature extractor part of ResNetFeatures has a similar parameter space as the Hypernetwork’s primary model except for the head units. The ResNetFeatures was unable to learn on most environments in both algorithms, even though we tried several different initialization schemes. This shows that the primary model is not suitable for a state’s features extraction, and while it may be possible to find other models with ResNet that outperform this ResNet model, it is yet further evidence that the success of the Hypernetwork architecture is not attributed solely to the ResNet expressiveness power in the primary network.
d.4.5 ASHyper
This is the reverse configuration of our SAHyper model. In this configuration, the action is the metavariable and the state serves as the basevariable. Its lower performance provides another empirical argument (alongside the lower CS, see Sec. B) that the “correct” Hypernetwork composition is when the state plays the context role and the action is the basevariable.
d.4.6 EmbHyper
In this configuration, we replace the input of the primary network with a learnable embedding of size 5 (equal to the PEARL context size) and the dynamic part gets both the state and the action as its input variables. This produces a learnable set of weights that is constant for all states and actions. However, unlike MLPSmall, the weights are generated via the primary model and are not independent as in normal neural network training. Note that we did not include this experiment in the main paper but we have added it to the results in the appendix. This is another configuration that aims to validate that the Hypernetwork gain is not due to the overparameterization of the primary model and that the disentanglement of the state and action is an important ingredient of the Hypernetwork performance.
d.4.7 ResNet 35
To validate that the performance gain is not due to a large number of weights in the primary network combined with the expressiveness of the residual blocks, we evaluated a full ResNet architecture: The state and actions are concatenated and followed by 35 ResNet blocks. Each block contains two linear layers of 256 size (and an identity path). This yields a a total number of learnable parameters, which is half of the parameters in the Hypernetwork model. In almost all environments it underperformed both with respect to SAHyper and also with respect to the MLPStandard baseline.
d.4.8 QD2rl
The Deep Dense architecture (D2RL) (Sinha et al., 2020) suggests to add skip connections from the input to each hidden layer. In the original paper this applies both to the net model, where states and actions are concatenated and added to each hidden layer, and to policies where only states are added to each hidden layer. According to the paper, the best performing model contains 4 hidden layers. Here, we compared to QD2RL which only modifies the net as our SAHyper model but does not alter the policy network. QD2RL shows an inconsistent performance between SAC and TD3. In the SAC algorithm, it performs close to the SAHyper in all environments. On the other hand, in the TD3 algorithm, QD2RL was unable to reach the SAHyper performance in any environment.
d.5 Complexity and Run Time Considerations
Modern deep learning packages such as Pytorch and Tensorflow currently do not have optimized implementation of Hypernetworks as opposed to conventional neural architectures such as CNN or MLP. Therefore, it is not surprising that the training of Hypernetwork can take a longer time than MLP models. However, remarkably, in MAML we were able to reduce the training time as the primary weights and gradients are calculated only once for each task and the dynamic network is smaller than the VanillaMAML MLP network. Therefore, within each task, both data collection and gradient calculation with the dynamic model requires less time than the VanillaMAML network. In Table
3 we summarize the average training time of each algorithm and compare the Hyper and MLP configurations.Algorithm 
