Log In Sign Up

Generalization in Transfer Learning

Agents trained with deep reinforcement learning algorithms are capable of performing highly complex tasks including locomotion in continuous environments. In order to attain a human-level performance, the next step of research should be to investigate the ability to transfer the learning acquired in one task to a different set of tasks. Concerns on generalization and overfitting in deep reinforcement learning are not usually addressed in current transfer learning research. This issue results in underperforming benchmarks and inaccurate algorithm comparisons due to rudimentary assessments. In this study, we primarily propose regularization techniques in deep reinforcement learning for continuous control through the application of sample elimination and early stopping. First, the importance of the inclusion of training iteration to the hyperparameters in deep transfer learning problems will be emphasized. Because source task performance is not indicative of the generalization capacity of the algorithm, we start by proposing various transfer learning evaluation methods that acknowledge the training iteration as a hyperparameter. In line with this, we introduce an additional step of resorting to earlier snapshots of policy parameters depending on the target task due to overfitting to the source task. Then, in order to generate robust policies,we discard the samples that lead to overfitting via strict clipping. Furthermore, we increase the generalization capacity in widely used transfer learning benchmarks by using entropy bonus, different critic methods and curriculum learning in an adversarial setup. Finally, we evaluate the robustness of these techniques and algorithms on simulated robots in target environments where the morphology of the robot, gravity and tangential friction of the environment are altered from the source environment.


page 9

page 10

page 11

page 12

page 13

page 14

page 15

page 16


Sample-based Regularization: A Transfer Learning Strategy Toward Better Generalization

Training a deep neural network with a small amount of data is a challeng...

Improving Robustness of Deep Reinforcement Learning Agents: Environment Attack based on the Critic Network

To improve policy robustness of deep reinforcement learning agents, a li...

Distral: Robust Multitask Reinforcement Learning

Most deep reinforcement learning algorithms are data inefficient in comp...

Transfer Learning Across Simulated Robots With Different Sensors

For a robot to learn a good policy, it often requires expensive equipmen...

Improving Robustness of Deep Reinforcement Learning Agents: Environment Attacks based on Critic Networks

To improve policy robustness of deep reinforcement learning agents, a li...

How Transferable are the Representations Learned by Deep Q Agents?

In this paper, we consider the source of Deep Reinforcement Learning (DR...

Fleet Control using Coregionalized Gaussian Process Policy Iteration

In many settings, as for example wind farms, multiple machines are insta...

1 Introduction

Inferring the general intuition of the learning process and harnessing this to learn a different task is necessary for an autonomous agent to operate in non-stationary real-life environments. Being able to adapt to the changes in an environment while performing continuous control as quickly as possible by cross-task generalization is preeminent to attain artificial intelligence. Generalizing well among the tasks belonging to the same set often leads to higher performance in each.

Deep reinforcement learning methods require long training periods in the source domain to develop a strategy close to a human for the same source environment [mnih2015humanlevel]. In real life, a robot will encounter different scenarios when executing tasks in a non-stationary environment. Waiting for a robot to interact millions of times with the environment instead of increasing its robustness to the varying environmental dynamics is time-consuming and impractical. A robot is expected to generalize to a similar task it hasn’t encountered, adequately and quickly to coexist with humans in the real world. In order to obtain robust policies that can excel in the aforementioned tasks the robot should not only learn walking but to learn walking robustly in altered target environments and the learning attained in source environment should be transferable to morphologically different robots. Hence, the trade-off between the generalization capacity and the source task performance should be acknowledged when determining the most pertinent training model.

Analogous to the variety of ways humans carry out a simple task in the real world, robots can perform a task in continuous control environments distinctively based on the design decisions in the training phase. A human can perform locomotion in many different ways ranging from closer to ground running to a careful tiptoeing on a rope. To increase the capabilities of the robots in the same manner, we have focused on increasing the scope of abilities gained during learning.

Evaluation of generalization in deep reinforcement learning for discrete and continuous environments is still an open research area[cobbe2018quantifying, zhang2018study, zhao2019investigating]. Similar to Zhao et al.[zhao2019investigating], we first prove that source task environment performance isn’t indicative of generalization capacity and target task performance. To achieve adequate performance in the target task environment, this statement will be the origin of our methodology. In continuous control environments, we will leverage a recent policy gradient method particularly Proximal Policy Optimization(PPO)[schulman2017proximal]

to acquire knowledge in the form of neural network parameters. First, we will show how the failure of recognizing overfitting leads to inaccurate algorithm comparisons. We suggest transfer learning evaluation structures based on the policy buffer we propose and employ one of them in our evaluations. Policy buffer consists of promising policy network snapshots saved during training in the source task environment and the design is inspired by the human memory. Likewise we propose new environments inspired by the real life humanoid robot scenarios for benchmarking transfer learning. First, we increased the range of gravity and robot torso mass experiments used by Henderson et al.

[henderson2017benchmark], Rajeswaran et al.[rajeswaran2016epopt] and Pinto et al.[pinto2017robust]

respectively to demonstrate the capabilities of the methods we propose. Then, we introduce new morphology environments for humanoid. Robots should be able to transfer the learning they’ve attained to each other like humans. Subsequently, we designed two new target environments named tall humanoid and short humanoid, each having different loss function constraints and morphologies than the standard humanoid source environment. Similarly because carrying a heavy object is among the expectations of a service robot, we designed a humanoid delivery environment.

Recognition of the policy iteration as a hyperparameter not only prevents inaccurate algorithm evaluations but additionally increases the performance of recent algorithms. In this regard, we will show that earlier iterations of policies performs well in harder target environments due to the regularization effect of early stopping. We introduce the method of strict clipping to discard samples that cause overfitting. This regularization technique is developed for PPO but we will discuss its possible applications to other algorithms in future directions. We propose a new advantage estimation technique for Robust Adversarial Reinforcement Learning (RARL)

[pinto2017robust] in Section 3.2 named Average Consecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL) by involving both critics at each iteration. In morphological experiments, we further demonstrate that a hopper robot can hop with twice its original torso mass using learning attained in the standard environment with this technique [pinto2017robust]. We compare the generalization capacity of different training methods on a hopper robot, namely advantage estimation techniques, entropy bonuses[cobbe2018quantifying] and different curriculum[shioya2018extending] in the RARL setting. In contrast to previous work we combine entropy bonus with RARL, compare different critic architectures using the iteration of training as a hyperparameter. These findings are beneficial in constructing a meaningful policy buffer and further asses the necessity of this evaluation technique by pointing out the striking variability in the target task performance using constant hyperparameters.

In Section II, studies on the generalization in deep reinforcement learning and transfer learning are provided. The proposed method along with the background will be detailed in Section III. Experimental setup and results will be given in Section IV and V respectively. Finally, in Section VI, the conclusion section will include a summary of our contributions, discussion of the results and future directions for research.

2 Related Work

Generalization in Deep Reinforcement Learning

Evaluation of generalization in deep reinforcement learning is a trending research area[zhang2018study] [cobbe2018quantifying] that flourished from the need to extend the range of control tasks using existing skills. Because testing and the training environments are identical in deep reinforcement learning research, the proposed algorithms’ robustness which is essential to real world deployment are often neglected[zhang2018study]. Since the generalization in deep reinforcement learning is often neglected, the algorithms are developed without the necessary intermediary step to increase the generalization capacity.

Data is separated into a training set, a validation set and a test set in supervised learning problems. Cross-validation is used for hyperparameter tuning on the training set. The training set is divided into a predetermined number of groups and for each hyperparameter search iteration, each group is used once as the test environment. After all the hyperparameters are used, the algorithm is ready to be tested on the unseen test set. Although the hyperparameter optimization to increase the generalization capacity of the algorithm is an almost mandatory step in supervised learning problems it is challenging to implement this method in deep reinforcement learning.

Concerns over the reproducibility and evaluation of deep reinforcement learning algorithms in general have been brought up by Henderson et al. [henderson2018deep] where affect of various hyperparameter selections on the training performance are analyzed. In transfer learning, the design decisions are even more pivotal since the training environment performance is not informative on the target environment performance. Consequently few-shot learning via sampling from the target environment is used in these cases.

Cobbe et al.[cobbe2018quantifying]

developed a benchmark for generalization named CoinRun where high number of test and training levels can be generated. The generalization capacity of various agents are compared using the percentage of levels solved in the test environment. Dropout, L2 regularization, data augmentation, batch normalization and increasing the stochasticity of the policy and the environment are the regularization techniques used in


to increase the zero-shot test performance. Still, few shot-learning in the test environment is required to find the dropout probability, L2 weight, epsilon and entropy coefficient that increases the generalization capacity. Zhao

et al.[zhao2019investigating] studied generalization for continuous control by parametrizing domain shift in generalization as testing Area Under Curve(AUC) using systematic shifts and noise scales introduced to the transition, observation, actuator functions. Similar to [cobbe2018quantifying], Zhao et al.[zhao2019investigating] compared regularization techniques for deep reinforcement learning such as policy network size reduction, soft actor critic entropy regularizer[haarnoja2018soft], multi-domain learning[tobin2017domain], structured control net [srouji2018structured] and adversarially robust policy learning(APRL)[mandlekar2017adversarially] using AUC. In contrast to [zhao2019investigating], we will not compare the regularization techniques using the policies obtained after a predetermined number of training iterations.

Forward Transfer Learning

Transferring the knowledge and representation gained from one task to another task is called forward transfer learning[deeprlcourse-fa17]. In order to start from a more rewarding location in the parameter space or to learn faster, most applications include transfer of the policy parameters to the new task after the manipulation of the training phase or modification of the architecture [rajeswaran2016epopt, rusu2016progressive]. The transferred neural networks are expected to generalize to the new task and adapt to the new task’s domain.

A transfer can occur from a large domain to a small domain and vice versa. In both cases, the possibility of negative transfer exists. If the algorithm is performing worse than not using any transfer at all, it is called negative transfer learning. For instance, if the algorithm is initialized from an optimal point for the source task but a suboptimal point for the target task, insufficient exploration might occur. Eventually, this results in a negative transfer thereby making random initialization a better choice.

When a transfer occurs from a larger domain to a smaller domain it is called partial transfer learning

. One example of this in computer vision is discussed in

[Cao_2018_CVPR] where the source space is the subset of the target space. The transfer cases they have covered in the experiments section include performing transfer from ImageNet 1000 [russakovsky2015imagenet] to Caltech 84, and Caltech 256 [griffin2007caltech] to ImageNet 84

where the corresponding numbers stand for the number of classes. They have attempted to counteract the effects of negative transfer by discriminating the outlier classes from the source domain and maximizing the accuracy of the source and target distribution locations.

Transferring from simulation to the real world is oftentimes a tedious task. Model-free algorithms rely on samples however the cost of sampling from a real-world environment is high in robotics settings. Tobin et al. [tobin2017domain] implemented the method of domain randomization in the physics simulator to accurately localize the objects in the real world for a manipulation task. Similarly Sadeghi et al. [sadeghi2016cad2rl] used domain variation in 3D Modeling simulator Blender [blender] by generating distinct pieces of furniture and hallway textures to train a simulated quadcopter.

Adversarial scenarios have also been widely used for robotics tasks involving computer vision. For instance, Bousmalis et al. [bousmalis2018using], and Tzeng et al. [tzeng2015adapting] used adversarial networks where one neural network is optimized to discriminate the real world image data from the simulation whereas the other network is optimized to generate simulator images that can fool the discriminator. The generator network will come up with better representations of the real image data as the discriminator approaches the minima.

Deep reinforcement learning have benefited from the adversarial implementations. Increasing the robustness of policy leads to a higher performance in target tasks involving domain change. Rajeswaran et al.[rajeswaran2016epopt] suggested a method to increase robustness of the policy network in Ensemble Policy Optimization(EPOpt) algorithm by training an ensemble of different tasks. Their dual-step approach to perform transfer from one distribution of tasks to another consists of Robust Policy Search where policy optimization is performed using samples from a batch of different tasks and Model-Based Bayesian Reinforcement Learning where the source task distribution parameters are updated via experience on the target task during training. Experiments in EPOpt further asses that performing batch policy optimization solely on the worst performing subset while discarding the higher performing trajectories leads to more robust policies. Although not experimented in the paper, this setting is applicable to problems where a limited number of trials are allowed in the real world target setting. All in all, they provide satisfactory results by comparing their results from EPOpt on percentile with Trust Region Policy Optimization (TRPO) trained on a single source task for each mass and EPOpt without the use of worst percentile subset extraction.

In Robust Adversarial Reinforcement Learning (RARL) [pinto2017robust], a separate adversarial network is created to destabilize the agent during training in a more intelligent way for increased robustness in the target environment. Separate critic networks are consecutively optimized with their policy network counterparts in RARL. The reward functions of the protagonist and the antagonist are each others’ negative in the algorithm thus a single shared critic network is also a relevant architecture. For instance, in Dong’s implementation both of the networks are optimized redundantly but only the protagonist critic network’s resulting advantage estimation is used in policy optimization resulting in a single shared critic network architecture [rarlbaselines].

Inspired by the fruitful results of RARL [pinto2017robust], Shioya et al. [shioya2018extending] proposed two extensions to the RARL algorithm by varying the adversary policies. Their first proposal is to add a penalty term to the adversary policy’s reward function by sampling from the test domain to adapt to the source task’s transition function. This, however, is tailored robustness for each test task at hand which requires sampling from the test domain similar to the Bayesian update used in EPOpt [rajeswaran2016epopt]. The second extension is inspired by curriculum learning that selects the adversarial agents based on the progress of learning instead of naively taking the latest adversarial policy. The protagonist policies trained in harder environments doesn’t guarantee a more robust performance, in fact, the using previous adversaries randomly during training was explored in some previous works [bansal2017emergent] [shioya2018extending]. In both Shioya et al.’s [shioya2018extending] and Bansal et al.’s [bansal2017emergent] experiments, using the latest and the hardest adversarial policy hinders the learning progression of the protagonist. In [shioya2018extending]

, first, multiple adversaries are created and samples from the latest T iterations of the adversary policies are ranked according to the progress of learning using linear regression. The ranks are used to determine the probabilities of using the samples which are selected stochastically during training. Each adversary policy is maximized using the negative reward of the protagonist agent and the sum of KL Divergence from all the other adversary policies to encourage diversity between the adversaries. Bansal

et al.[bansal2017emergent] used distribution for determining opponent humanoid’s policy iteration where is the percentage of the constructed set’s coverage from the last adversary policy iterations. Shioya et al. [shioya2018extending] used multiple adversaries and ranked each sample’s performance to determine the set of samples that should be used for optimization. Experiments were done using Hopper and Walker2d environments in MuJoCo [todorov2012mujoco] to compare the results to the RARL Algorithm [shioya2018extending]. It is found that in the Hopper environment ranking the policies to adapt the probability of their selection performs better than RARL and uniform random selection but performed worse than the latter in Walker2d environment. In both of the environments using the less rewarding trajectories performed worse than all the methods. As a contrast to this result, optimizing over the worst performing samples generated more robust policies in Hopper task when tested with different torso masses in EPOpt Algorithm [rajeswaran2016epopt] when an adversary policy is not present.

The adversarial algorithms can be considered as a dynamic way of generating different suitable tasks for the agent at each iteration to encourage it to be more robust in unseen test environment [journals/corr/abs-1710-03641]. Challenging tasks increases the robustness by allowing agents to grasp complex latent features of the task.

3 Method

3.1 Background

Policy gradient methods have gained vast popularity against Deep Q-Networks (DQNs) [mnih2015humanlevel] especially after the introduction of algorithms that constrain gradient movement in the policy parameter space such as Trust Region Policy Optimization(TRPO) and Proximal Policy Optimization (PPO) [schulman2015trust, schulman2017proximal]. In our experiments the open-source OpenAI Baselines framework [baselines] will be used where PPO [schulman2017proximal] is used to optimize the actor policy network and the generalized advantage estimator (GAE) is used to optimize the critic value function network simultaneously [schulman2015high].

Actor-Critic architectures are used to estimate the advantage function [schulman2015high] because the actual value of the state can only be approximated based on the samples rolled out so far in the reinforcement learning domain. Thus, a separate critic network is trained simultaneously with the policy network to predict the value of a given state.

Value function loss is the mean squared difference between the target value and the predicted critic network output. In the OpenAI Baselines framework, the target value is the sum of generalized advantage estimation and the sampled output of the value function network. In PPO, given in Eq. 1, if the advantage is negative then all the ratios below will be clipped and if the advantage is positive, then all the ratios above will be clipped and the gradient of clipped loss will be 0.


RARL algorithm will be used as the baseline algorithm in the hopper robot experiments, so detailed information will be presented in Section 3.2 . The proposed scenario is a two-player zero sum discounted game where the agent tries to maximize its own reward that is actively minimized by the adversary. Actions () are sampled from the agent’s policy denoted as whereas the actions () are rolled out from the adversary’s policy . Equation 2 shows the reward function of the agent. Corresponding to that, the reward function of the adversary is .


Instead of optimizing the minimax equation at each iteration the reward functions of the agent and the adversary are maximized consecutively. The agent’s policy is optimized iteratively, while collecting samples using a fixed adversary policy. After that, the same number of rollouts are collected from the environment for the adversary’s optimization while the agent’s last policy’s parameters are used. First, they have compared TRPO and RARL using the default environment hyperparameters without any disturbances for 500 iterations on tasks HalfCheetah, Swimmer, Hopper, and Walker2d in MuJoCo environments. Training with RARL achieved a better mean reward than the baseline TRPO. RARL and TRPO are evaluated with a trained adversary in the test environment where RARL performed significantly better on all tasks compared to TRPO although the training and test environments are identical in this scenario. They’ve also tested the aforementioned tasks by varying torso mass and friction coefficients which were not seen during the training phase and again RARL yielded better results than the TRPO baseline. An implementation of the algorithm using rllab framework [rllab-adv] [gym-adv] [rllab] and a single critic, simultaneous PPO variant using OpenAI Baselines framework are open-sourced[rarlbaselines]. We will compare double, single and shared double critic structure in Section 5.2.2 and Section 5.2.3 to evaluate how the critic effects the algorithm’s generalization capability using a policy buffer.

3.2 Proposed Method

Policy Buffer

Performing continuous control tasks in different environments requires different policies. Policies trained with different hyperparameters show different control patterns thus we propose a policy buffer to store these policies trained on the same source task environment with the same loss function. In the proposed system, we show that it is possible to extract a comprehensive set of policies representative of distinct control patterns just from one source task environment.

Different snapshots of the policy network parameters taken during training in the source task environment perform dramatically different in the target task environment. Overtraining in the source task environment decreases the testing area under curve performance which is dependent on the expected return in the test environment. [zhao2019investigating] Overfitting leads to a worse result on the target environments as the distance between the target environment and the source task environment increase in the environment parameter space. In order to discover the most suitable policy network parameters for the unknown target task, we first take snapshots of the policy network at each sampling iteration at predetermined intervals. The snapshots of the policies trained with different hyperparameters in the source task environment will be saved in the policy buffer. Considering the striking difference between each policy snapshot’s ability to transfer, evaluations at different iterations are necessary to compare various methods. Taking the resulting parameters after a constant number of training iterations does not constitute a valid comparison because full scope of the algorithms’ generalization capacity is omitted.

Sampling from the target environment during training is used in several transfer learning algorithms [rajeswaran2016epopt, shioya2018extending] to demonstrate the real world as the target environment where minimizing the quantity of the samples is aimed and the simulator is the source environment where the sampling phase is only restricted by the computational resources.

Acknowledging the training iteration as a hyperparameter is analogous to using early stopping regularization technique in supervised learning. In deep reinforcement learning settings where the difference between source task and target task can be parametrized, we propose designing a surrogate validation task. This task, will prove to be informative for hyperparameter optimization to identify eminently generalizable policy snapshots for an alike target task. The surrogate validation task should have parameters closer to the target environment than the source environment and a few-shot learning setting surrogate validation task would be an adequate starting point for determining the policies that should be given priority during target environment sampling.

We should be aware that these target environment performance plots are unknown to us before sampling in the target environment. Consequently, we suggest that an alternative proper comparison of algorithms can be made by computing the area above a predetermined threshold for each curve because that would suggest that the likelihood of choosing a robust policy from the snapshots of policies trained with the corresponding algorithm is higher if the area is bigger. However, in our work we will plot the consecutive policy iterations of the algorithm that generated best performing policy in the target environment to prove that an expert level robust policy for a variety of environments has already been saved during training.

Regularization via PPO Hyperparameter Tuning

Policy gradient algorithms are the building blocks of the recent transfer learning and generalization algorithms. The choice of the clipping hyperparameter of Proximal Policy Optimization is crucial when using the algorithm as a transfer learning benchmark. Open AI Baselines framework and most of the literature use the clip parameter of 0.2 for continuous control tasks [baselines] [schulman2017proximal] [bansal2017emergent] [journals/corr/abs-1710-03641] [huang2018reinforcement][henderson2018deep]. In addition to that, the clipping parameter is discounted using a learning rate multiplier in Open AI Baselines framework to encourage swift reaching to the asymptote for continuous control tasks in MuJoCo. In our experiments, we have found out that decaying the clipping parameter decreases the asymptotic performance of the algorithm in the Humanoid environment. After our suggestion, annealing clipping hyperparameter in the ppo1 algorithm has been omitted in the Open AI Baselines framework [baselines].

We hypothesize that in a transfer learning setting, strict clipping can be used to discard the MDP tuples that lead to overfitting. In addition to early stopping, we propose strict clipping as a regularization technique for PPO to combat overfitting introduced by the source task-specific samples. In order to construct a fair comparison between state of art transfer learning algorithms and their corresponding benchmarks, various lower values of clipping parameters are analyzed. Strict clipping is performed by decreasing the clipping parameter to unconventional values, for instance as low as 0.01. We prove that this method is a competitive benchmark for the transfer learning algorithms.

Adversarial Reinforcement Learning

Training the policy aiming general robustness for a range of possible unknown scenarios is one way of achieving a successful initial test performance. Adversarial scenarios are inspired by the success of domain randomization during training. Introducing an adversary to destabilize the agent using multidimensional forces in Robust Adversarial Reinforcement Learning (RARL) have proven successful results in continuous control tasks [pinto2017robust]. APRL[mandlekar2017adversarially] which is another type of adversarial algorithm that uses adversarial noise, is among the regularization methods used for comparison in [zhao2019investigating]. However, it performed worse than vanilla PPO in most of the cases where each regularization technique is compared on the test performance after a predetermined number of timesteps sampled from the source environment. In contrast to prior work, we will acknowledge policy iteration as a hyperparameter in our comparisons and use RARL instead of APRL for our evaluations.

There is a continuous competition in RARL depending on the destabilization capability of the adversary. For instance, an adversary policy with a 2-dimensional output has a restricted power due to low dimensional action space and might have a hard time destabilizing a Humanoid robot with 17-dimensional action space depending on the strength of the force during training. However, an adversary with a 27-dimensional action space that applies a 3-dimensional force to each body component of the protagonist humanoid might even hinder the protagonist policy from reaching convergence. Accounting for overfitting is pivotal in finding the policy with the highest generalization capability when increasing the complexity of the training environment. Thus, we will compare and discuss the variants of RARL by forming a policy buffer and extract the most generalizable policies to increase the capability of the algorithm.

In order to analyze the effect of different critic network architectures on the generalization capacity of the policy we will compare three different critic architectures: separate double critic networks that is used in RARL[rllab-adv], single critic networks in Shared Critic Robust Adversarial Reinforcement Learning (SC-RARL)[rarlbaselines] and our proposition Average Consecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL). Critic networks are inherently function approximators so they are vulnerable to overfitting as well as the actor networks.

In RARL, each critic network is separate from each other and at each global iteration protagonist and adversary policies are updated with the computed using the rewards from different trajectories and separate randomly initialized critic network outputs. In SC-RARL, the shared critic is updated by the rewards gained during protagonist optimization phase and in RARL each critic is updated only by the rewards gained during its corresponding policy’s optimization phase. Since the advantage function computation for the adversary uses the negative of the protagonist’s, SC-RARL is meaningful but the protagonist’s critic network is only updated by the trajectories sampled in the protagonist’s optimization phase. Fewer samples and optimization iterations might act as a regularization but we should also explore ways of using the rewards from other sampling iteration without overfitting. The total number of samples used to update the critic networks are the same for RARL and SC-RARL.

We propose a third architecture named Average Consecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL) that computes advantage estimates using the mean of both critic’s output but consecutively and separately optimizes each critic network along with their corresponding policies. By this method, we aim to decrease overfitting via using double critic networks with different random initializations and restrict the movement of the critic in parameter space by including the output of the previously updated critic in advantage estimation. The values for both adversary and protagonist in ACC-RARL algorithm are shown in Equation 3.


In ACC-RARL, each critic is updated by the average of two sequentially updated critic outputs and the rewards gained during their corresponding policy’s optimization phase. In RARL, policies are not informed of each other’s critic output but they rely on the similarity of 2 consecutive sets of rewards accumulated by a different group of agents. On the other hand, in ACC-RARL, both critics are updated by each others’ critic outputs with the rewards gained during their optimization phase. Updating the critic function more frequently than the policies would increase overfitting so by ACC-RARL we aim to increase generalization capacity by encouraging them to optimize with different reward batches while considering each other’s critic outputs. Thus, each critic network observes all the rewards sampled in the environment but assigns more weight to the rewards gained during its optimization phase.

Both the value function approximators and the policy networks have 2 hidden layers with size 64 and input layer of size 11 for the hopper environment. All policy networks have a separate logarithm of standard deviations vector that has the same size as the policy network’s output. These vectors are optimized simultaneously with the policy networks.

Entropy bonus is used to aid in exploration by rewarding the variance in the multivariate Gaussian distribution of action probabilities by a coefficient

given in Equation 4.


Although entropy bonus is a part of PPO total loss function we did not include it in the loss function because it decreased the training performance. Cobbe et al.[cobbe2018quantifying] used PPO with a non-zero entropy coefficient to perform stochasticity as a regularization technique in discrete environments. Instead, we will incorporate entropy bonus in RARL for the continuous control cases because decreasing training performance usually increases the generalization capacity of the algorithm. Its inclusion in the protagonist and the adversary’s loss function separately will be discussed due to its destabilizing effect. Our hypothesis is motivated by the works on competitive and adversarial environments that suggest that hard environments might hinder learning [bansal2017emergent, shioya2018extending]. Subsequently impelling the protagonist to explore proves to be beneficial in some cases although it decreases the performance in environments with no adversaries. Similarly, adding an entropy bonus to adversary’s loss function might prepare the protagonist for a wider range of target tasks by extended domain randomization. Correspondingly the entropy bonus hinders adversary to reach peak performance and decreases the difficulty of the environment occasionally. All in all, the snapshots of two uniquely trained policies will be added to the policy buffer for each critic architecture.

Curriculum Learning is a recent branch in transfer learning that focuses on discovering the optimal arrangement of the source tasks to perform better on the target task. To excel in complex tasks, humans follow specifically designed curricula in higher education [bengio2009curriculum]. Similar to how a more personalized curriculum leads to a more successful result for humans, this strategy is beneficial for the learning of robots in hard environments.

Similar to [bansal2017emergent, shioya2018extending], we will construct a random curriculum by randomizing the adversary policy iterations during training. In our experiments, we will compare the performance of protagonist policies trained with adversaries randomly chosen from different last iterations. We will use uniform sampling from a restricted last adversary policy set for this experiment which corresponds to Shioya et al.’s [shioya2018extending] "mean" method. We introduce RARL Policy Storage(PS)-Curriculum where we first train the policy with the adversary in the source task and record all the adversary policy snapshots at each iteration to policy storage. Then at each iteration based on the sampling from the distribution [bansal2017emergent], the adversary from the policy storage will be loaded during sampling. This method is included in our experiments because the adversaries loaded from and recorded to the buffer during curriculum training become less capable and more inconsistent as the training progress and decreases. As a consequence, we intend to show how a variety of design decisions during training affects the target performance in the pursuit of developing more reliable benchmarks for transfer deep reinforcement learning.

4 Experimental Setup

To demonstrate the generalization capability of the policies trained using the proposed regularization techniques, we introduce a set of transfer learning benchmarks in the Humanoid-v2 environment and increase the range of commonly used mass and gravity benchmarks in the Hopper-v2 environment [todorov2012mujoco]. Reproducibility of reinforcement learning algorithms is challenging due to colossal dependability on hyperparameters and implementation details. We will use a stochastic policy for the training and a deterministic policy for the testing environment.

The inputs of each policy network are the same however the outputs of the protagonist and adversary policy networks differ. Because Hopper-v2

is a 2-dimensional environment 2 output neurons are specified to represent each force applied to the

geom component of the robot, specifically the heel. The environment simulates actions from both the protagonist and adversary at each state and outputs the reward based on the next derived position of the agent.

4.1 Environment Variation

The environments in our setup are diverse enough to pose a challenging transfer learning problem. For instance, the representation of a humanoid trained for the source environment friction with parameters optimized for source task can’t walk in a target environment with high tangential friction coefficient. No modifications will be made to the loss functions for these environmental conditions.

Friction is one of the environmental dynamics that have a substantial effect on bipedal locomotion. Rajeswaran et al. used ground friction for a hopper robot in [rajeswaran2016epopt]. Similarly, we will perform forward transfer learning in a higher dimensional humanoid environment by transferring the learning attained in an environment with ground tangential friction 1 to an environment with ground tangential friction of 3.5.

Altering the gravity to generate target tasks for the Humanoid and Hopper was one of the experiments Henderson et al. [henderson2017benchmark] designed multi-task environments in line with OpenAI’s request for research [OpenAIRequest]. In [henderson2018optiongan] 4 target tasks are created with 0.25 multiples of the source environment’s gravity more specifically and using MuJoCo simulator where G=-9.81. In our gravity experiments, we will use and as the target environment gravities to benchmark our propositions.

4.2 Morphological Variation

Transferring among morphologically different robots with different limb sizes or torso masses have been a popular multi-task learning benchmark [henderson2017benchmark] [rajeswaran2016epopt] [pinto2017robust]. In the first section, we will test the generalization capability of our trained policy on a hopper with differing torso masses to compare our proposition to other algorithms. In Section 5.1.2 we will introduce three new target environments: a tall, heavy humanoid robot, a short, lightweight humanoid robot and a delivery robot that carries a heavy box.

5 Results

In this section, we will first discuss the details of the training conducted in the source environments using the hyperparameters provided in Table 1, then the target environment experiments will be discussed using the methodology we propose.

Table 1: Hyperparameters
Hyperparameters Symbol Values
Clipping Parameter 0.01 0.025 0.05 0.1 0.2 0.3
Batch size 64 512
Step Size 0.0001 0.0003
Curriculum Parameter 0.3 0.5
Learning Schedule constant linear
Clipping Schedule constant linear
Trajectory Size 2048
Discount 0.99
GAE Parameter 0.95
Adam Optimizer 0.9
Adam Optimizer 0.999

Number of Epochs

Entropy Coefficient 0.001
Number of Hidden Layers 2
Hidden Layer Size 64
Activation Function tanh

5.1 Humanoid

5.1.1 Humanoid Source Environment

As detailed in Equation 5, the standard reward function in the Humanoid-v2 environment consists of an alive bonus, linear forward velocity reward, quadratic impact cost with lower bound 10 and the quadratic control cost.


If the z-coordinate of the agent’s root which lays at the center of the torso is not between the interval 1 and 2, the episode terminates. The alive bonus is specified as +5 for the Humanoid task which is the default value for the Gym Humanoid Environment[1606.01540].

In Equation 6, the total loss function of the PPO algorithm that we will use to update actor and critic networks is given. We do not use the entropy reward in the original algorithm for PPO implementation because we did not see any improvement in the expected reward. Likewise, the entropy bonus is not used in PPO implementation [baselines][schulman2017proximal]. In addition to that, in the OpenAI Baselines framework [baselines], the clipping hyperparameter decays by the learning rate multiplier, we also omit that because the learning curve tends to decrease after reaching the asymptote at the later stages of training for higher state-action space environments. Hence we propose a constant clipping schedule and strict clipping parameters for action and state spaces with higher dimensions and transfer learning scenarios.

Figure 1: (a) Humanoid running in source environment using the last policy trained with PPO . (b) Expected reward of policies trained in the standard humanoid source environment with different hyperparameters

In Figure 1b, average episode rewards of policies trained with 4 different sets of hyperparameters are shown. The policy parameters of each set is saved at intervals of 50 iterations. Learning curve obtained with the latest PPO hyperparameters suggested in OpenAI Baselines framework [baselines] for the Humanoid environment are represented by the red curve in Figure 1b. Following the OpenAI Baselines framework, we used 16 parallel processes to sample from structurally same environments concurrently with different random seeds during training time. The learning curves for the strict clipping methods for generalization are shown with clipping hyperparameters 0.01, 0.025 and 0.01 decaying learning rate and clipping. Linearly decaying learning rate and clipping is a method used for environments with lower dimensions but we include strict clipping variations with clipping hyperparameters 0.01 and 0.025 of it in our benchmarks since source environment performance is not indicative of the target environment performance. In the testing phase, we will sample trajectories of 2048 timesteps from 32 different seeded target task environments for each policy from the policy buffer.

5.1.2 The Morphology Experiment

The morphological modification experiments include interrobot transfer learning. Since the termination criterion depends on the location of the center of the torso, the reward functions of both tall and short humanoid environment are updated. For the tall humanoid, the range of the constraint shifts higher and for the short humanoid, it shifts lower. In addition to that, the total body weights of short and tall humanoids differ from the standard humanoid by the exclusion and inclusion of the upper waist respectively as seen in Figure 2.

The method of clipping is primarily used to discourage catastrophic displacement in the parameter space. We’ve observed that a higher clipping parameter often causes a sudden drop in the learning curve and puts the policy in an unrecoverable location at the policy parameter space. In contrast, we show that strict clipping to unconventional values like is used in a transfer learning setting, the MDP samples that lead to overfitting to the source task are discarded. Using strict clipping the trajectory that is used in the optimization process will be free of the variance introduced by the source task-specific samples.

Figures 2, 3 and 4 show the performance the of policies trained with clipping parameter of and . The lines of the strictly clipped policy iterations tend to be smooth thus saving the policy parameters every 50 iterations for all environments is sufficient for this experiment. Additionally, we’ve also tried RARL algorithm for these tasks but strict clipping performed superior with the hyperparameters we’ve used. Besides, strict clipping achieved remarkably well results in our benchmarks so the domain randomization enacted by the adversary is not needed. Strict clipping allows the humanoid to learn general characteristics of forward locomotion that can be transferred to various different environments by discarding samples that cause overfitting.

Figure 2: (a)Average reward per episode of every 50 iterations from policy buffer for a shorter humanoid. (b) Snapshot from the short humanoid environment simulation when the policy iteration is used.

Figure 2a shows that all the policies trained in the standard humanoid environment with strict clipping , saved after iteration perform exceptionally well in the target environment. and policy iterations not only gained a high average reward per episode but also performed consistently well with low standard deviation over all the trajectories sampled from 32 environments with different seeds. When the policy is directly transferred, the short humanoid is able to run without the need for adaptation. In contrast, the policy with clipping parameter can’t transfer the learning it attained in the source environment because the additional samples used during optimization caused overfitting to the source environment. It is seen in Figure 2, that even the earlier iterations of the policy trained with clipping parameter can’t be transferred to a shorter robot where early stopping regularization technique is not sufficient.

Figure 3: (a)Average reward per episode of every 50 iterations from policy buffer for a taller humanoid. (b) Snapshot from the tall humanoid environment simulation when the policy iteration is used.

We observed that the humanoid takes smalles steps to stay in balance with a larger upperbody and a higher constraint. Figure 3 proves that same policy iterations trained with strict clipping also performs well in the tall humanoid environment. This suggests that the form of moving forward is applicable in both of these environments as well as the source environment. As a result, a tradeoff between generalization capacity and the performance arises as we prove that overfitting to the source task samples is a crucial issue. Our results confirm that tall humanoid environment is a harder environment than the short humanoid as expected because it is harder to keep balance with heavier upper body mass where the center of mass is further from the ground. Similarly, we also observe higher variance in the average rewards collected from the hopper environments with heavier unit torso mass.

Masses of relevant body parts for the delivery robot are given in Table 2. Taking into account the total body mass, a delivery box with a unit mass of 5 constitutes a challenging benchmark. The design decision was made to create a horizontal imbalance by enforcing the humanoid to carry the box only by the right hand. Humanoid is able to carry the heavy box just like a human using the policy. policy iteration that has the best target environment jumpstart performance is shown in the simulation snapshot in Figure 4b. The simulation performance shows that the humanoid can generalize to this delivery task just by utilizing the learning attained from standard humanoid environment. The reduction in the performance after the iteration in Figure 4a, supports our method of resorting to the earlier policy iterations before discarding all the snapshots of the corresponding algorithm. This concavity suggests that the experience and learning gained from the target environment is detrimental after a point based on the target environment and early stopping as a regularization technique is effective.

Table 2: Delivery Environment
Body Unit Mass
Delivery Box 5
Right Hand 1.19834313
Torso 8.32207894
Total Body without Delivery Box 39.64581713
Figure 4: (a)Average reward per episode of every 50 iterations from policy buffer at target environment where a delivery box of mass 5 unit is carried using right hand. (b) Standard delivery humanoid.

The humanoid robots fall immediately in the target environments when the best-performing policy in the source environment is used thus the orange curves in Figures 2, 3 and 4 assess our claim that source environment performance isn’t indicative of the generalization capacity. When strict clipping is used as a regularization technique for PPO, the humanoid is able to run in the proposed target environments. In these sets of experiments, our policy buffer consisted of snapshots of policies trained with only and . An alternative policy buffer might consist of snapshots of policies trained with different hyperparameters or training methods and a better performing snapshot trained in the source environment might be found for each benchmark. In order to find the best performing policy from the buffer when there is no surrogate validation environment, more sampling should be done in the target environment. Thus, the tradeoff between the number of trajectories rolled out and the performance in the target environment emerges. If this problem is not acknowledged the number of experiences gathered from the environment might even exceed the number of samples used for random initialization from the beginning. Using observations gained from the performances of different policies in various target environments we provide insight on the logical ways of constructing a policy buffer.

5.1.3 The Friction Environment

Frictional variation is one of the most common scenarios encountered in real life for bipedal locomotion. The environment designed to benchmark this scenario is shown in Figure 5b. The humanoid is seen sinking to the ground due to high tangential friction but still is able to run using the policy trained with strict clipping without the knowledge of the target environment friction.

Figure 5: (a)Comparison of average reward per episode of policies at target environment with tangential friction 3.5 times the source environment. (b) Humanoid runing in target environment with tangential friction 3.5 times the source environment.

Instead of using multiple different policies for environments with different friction coefficients, choosing a policy with a higher generalization capacity is sufficient even for a target environment with 3.5 times the tangential friction of the source environment.

Table 3: Best Performing Iterations of Policies in Target Friction Environment
Clip Iteration Average Reward per Episode
0.01 1500
0.1 300

In Figure 5a, the best jumpstart performances for each clipping parameter are given. For instance, as seen in Table 3, the last iteration of the policy with a strict clipping trained in the source environment has an average reward of 8283 and a standard deviation of 24.26 across 32 target environments. In contrast, the best performing policy in the source environment has a low generalization capacity due to the fact that it overfits the source task environment. As more samples are discarded and the movement in parameter space is restricted using strict clipping thus the agent learns more generalizable patterns of bipedal locomotion.

5.1.4 The Gravity Environment

In order to stay in balance under harsh circumstances, the policy that is being transferred should be robust to unknown environmental dynamics. Walking uninterruptedly in gravities lower and higher than the earth’s environment requires different patterns of forward locomotion unlike the humanoid morpohology benchmarks in 5.1.2 where the iteration of each policy trained with the same clipping parameter gained above 4000 average rewards per episode for all target environments.

Figure 6: (a)Average reward per episode at target environment with gravity= -4.905 (). (b)Humanoid in target environment with gravity= -4.905 () including RARL variations.

When the last iteration of the policy trained with strict clipping is used in the target environment with () the humanoid in Figure 6b is able to run. Although earlier snapshots of the policy that shows the best training performance yields less average rewards than the last policy iteration of PPO trained with in the target environment, the performance is still remarkably well. Both regularization techniques namely early stopping and strict clipping show the same performance and generalization to this target environment like the delivery environment in Section 5.1.2. Similarly, in the target environment with gravity= - 14.715 (), the humanoid needs to resort to the previous snapshots of the policy trained with strict clipping as plotted in the Figure 7a. The bipedal locomotion pattern in the simulated target environment with gravity= - 14.715 () when the humanoid jumpstarts with the policy trained with clipping parameter is shown in Figure 7b.

Figure 7: (a)Average reward per episode at target environment with gravity= -14.715 (). (b)Average reward per episode at target environment with gravity= -14.715 () including RARL variations.
Figure 8: (a)Average reward per episode at target environment with gravity= -17.1675 (). (b)Humanoid in target environment with gravity= -17.1675 () including RARL variations.

Gravity benchmarks for the humanoid indicate that snapshots of different policies should be used for the target environment with gravity= - 17.1675 (). The policy iterations trained with hyperparameters " and decaying learning rate and clipping" performed poorly in the source environment and target environment with lower gravities given in Figures 1a and 6a respectively. However, the last iterations of them perform consistently well in environments with higher gravities. Figures 7 and 8a reveal that decaying clipping during training might have hindered the exploration and restricted the humanoid to stick to a more careful way of stepping forward under the high gravitational force which pulls the humanoid to the ground as in Figure 8b.

5.2 Hopper

5.2.1 Hopper Source Environment

Reward function of the hopper environment given in Equation 7 consists of an alive bonus, linear forward velocity reward and sum of squared actions. Alive bonus in the Hopper environment is is +1.

Figure 9: (a) Expected reward of policies trained in the standard hopper environment. (b) Hopping action in source environment using the last policy trained with PPO.

16 parallel processes with different random seeds for each Hopper environment are initiated and 1.875M timesteps of samples are collected from each uniquely seeded environment. Hyperparameters of the best performing PPO in this experimental setting are found via simple grid search. In Figure 9b, the best performing clipping parameter, step size and batch size in source task environment are , and . Average reward per episode of PPO and RARL with different critic architectures at source environment over a total of 30 million timesteps are shown in Figure 4b.

The policies trained with different variations of RARL perform worse in the source task environment than PPO, consistent with the comparison of the policies proposed in the Humanoid experiments. The protagonist policy gains fewer rewards due to domain randomization introduced by the adversary but a natural regularization occurs which counteracts the overfitting to the target environment.

5.2.2 The Morphology Experiment

Initially, the performance of different critic structures will be compared: RARL [rllab-adv], Shared Critic Robust Adversarial Reinforcement Learning (SC-RARL) [rarlbaselines], our proposition Average Consecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL). Next, the best performing variation of each critic structure will be analyzed. Table 4 shows the morphological specifications of the standard Hopper.

Table 4: Source Environment
Body Unit Mass
Torso 3.53429174
Thigh 3.92699082
Leg 2.71433605
Foot 5.0893801
Total Body 15.26499871

In Figure 9b, best source task reward is found to be above 3000 where the hopper hops quickly and seamlessly. In RARL with TRPO, the torso mass range chosen for the experiments is [pinto2017robust], we will experiment with torso unit masses in the range. unit masses will prove to be easier benchmarks thus the best performing policy for each critic structure comparison will be omitted. The right iterations of baseline policy PPO performed adequately between which suggests that domain randomization via adversary is redundant for these experiments and early stopping as a regularization technique is sufficient by using the right iteration of policy from the policy buffer.

Algorithms trained with adversaries proposed in Section 3.2 are more unstable during the training phase. In our experiments, we observed that wrong choice of hyperparameters lead to a decrease in average reward per episode after convergence .

Figure 10: (a)Average reward per episode at target environment with torso mass 1 of every 10 iterations from policy buffer. (b) Average reward per episode at target environment with torso mass 2 of every 10 iterations from policy buffer. (c) Average reward per episode at target environment with torso mass 3 of every 10 iterations from policy buffer. (d) Average reward per episode at target environment with torso mass 4 of every 10 iterations from policy buffer. (e) Average reward per episode at target environment with torso mass 5 of every 10 iterations from policy buffer.

The policy buffer will consist of snapshots of policies trained with PPO and 21 variants of RARL recorded at intervals of 10. Hence, we will analyze the average reward per episode of 2002 different policies and test their generalization capacities for torso masses . In Figure 10b, it is seen that different iterations of each algorithm reach maximum performance in the target environment.

Figures 10c and 10d show that RARL and SC-RARL perform uniformly satisfactory for the corresponding benchmarks close to the source environment. Although both algorithms have different critic initializations each critic is updated for the same number of iterations and using same structured loss functions thus it is understandable that In Figure 10d, the SC-RARL, and RARL perform similarly when target environment is close to the source environment.

There is a considerable difference between the target environment performances of SC-RARL and RARL, especially in iteration, in Figure 10e which proves that training with the rewards of different trajectories sampled using different protagonist adversary pairs does have an effect on the type of control behavior learned.

The performance of the last iteration of RARL starts to decay in Figures 10a and 10e. As a consequence, the agent should first resort to earlier snapshots of the policy intended for transfer to succeed in the harder target environments. Let us assume that the agent only has several snapshots of the policy in its policy buffer trained in the source environment with standard torso mass. Then the agent is put in a target environment with torso mass 6 which is analogous to an agent expected to carry weight while performing a control task. We propose that in cases like these, instead of training from the very beginning because the last iteration of each policy known by the agent performs below a certain threshold as in Figure 10e, the agent should primarily resort to earlier policies at intervals suited for the context because the policies performing above 3000 are readily available in the agent’s memory if the policy iterations are saved during training.

The policy buffer allows us to analyze all the different patterns of hopping learned during training. For instance, the type of hopping learned by the ACC-RARL between the iterations 550 and 800 is successful in the source task environment and environment where torso mass is 3 provided in Figure 10c but it is clearly unsuccessful in environments where torso mass is 1 and 2 as illustrated in Figures 10a, 10b. If we had not recognized the policy iteration as a hyperparameter then comparing the algorithms at arbitrarily selected iterations would not constitute a fair comparison. More importantly, the PPO that is generally used as a benchmark algorithm performs poorly when the last snapshot of it is used for comparison in a target task in Figure 10a. However, for target environments with torso masses , the right snapshots of PPO are capable of obtaining above 2500 average reward per episode from the environment as seen in Figures 10a, 10b, 10c, 10d, 10e. Thus, this transfer learning problem reduces to finding the most suitable snapshot of the policies from a appropriately constructed policy buffer.

Assuming that the environments are parametrized, it might be possible to predict performance of the policies residing in the policy buffer. Coinciding with the performances seen in Figures 10a, 11a, 12a, 13a , it is anticipated that the earlier policy iterations trained with less samples perform better as the distance between target environment and the source environment increases in the parameter space. As seen in Figures 10b, 10c and 10d the last iteration of the policy trained with RARL with PPO perform better or same compared to the original RARL experiment carried out with TRPO in [pinto2017robust]. For masses below hopper’s torso mass the last iterations of ACC-RARL shown in Figures 10a, 10b, 10c performs superior to the last iteration of SC-RARL and RARL.

Figure 11: (a)Average reward per episode at target environment with torso mass 6 of every 10 iterations from policy buffer. (b) Average reward per episode at target environment with torso mass 6 of every 10 iterations from policy buffer including RARL variations.

In Figure 11a, a significant target environment performance drop occurs after approximately until the last policy iteration where all algorithms are affected. This implies that the type of hopping behavior learned after a certain point of training can’t generalize to hopper environments with higher torso masses and all policy iterations trained via PPO algorithm with the given hyperparameters are inadequate.The best performing policy iterations of all algorithms start to get accumulated in the range when torso mass is greater than 5 thus a mapping between the target environment parameters and the policy iterations is highly probable.

Addition of the entropy bonus increased the fluctuation of the average rewards for all algorithms as seen in Figure 11

b. Because the standard deviations of adversary’s output action probability distributions increase as a result of optimization, the adversary takes more randomized actions which often changes the protagonist’s way of hopping by destabilizing the equilibrium.

Figure 12: (a)Average reward per episode at target environment with torso mass 7 of every 10 iterations from policy buffer. (b) Average reward per episode at target environment with torso mass 7 of every 10 iterations including RARL variations.

In harder environments with torso mass 7 and 8, the earlier iterations of the policy trained with critic architecture ACC-RARL that we propose performs the best. Figure 12a shows that the range of the best performing policy iterations are contracted more and the performance similarities of RARL and SC-RARL are more apparent as illustrated in Figure 10d. Moreover, it is observed in Figure 12b that the benefit of adding entropy to the adversary loss function in ACC-RARL and SC-RARL continues in the target environment with torso unit mass 7 whereas standard RARL is seen to perform better in this case. Hence, addition of entropy bonus is not guaranteed to increase the maximum average performance in the target environment.

Figure 13: (a)Average reward per episode at target environment with torso mass 8 of every 10 iterations from policy buffer. (b) Average reward per episode at target environment with torso mass 8 of every 10 iterations from policy buffer including RARL variations.

The best average episode rewards for SC-RARL is trained with adversary loss function including the entropy bonus with entropy coefficient as shown in Figures 11b, 12b, 13b. Training with adversary entropy shifts the best-performing policy iterations slightly to the right due to the increased domain randomization through the encouragement of adversary policy exploration via entropy bonus.

When the mass of the torso is increased to 8, iteration of policy trained with ACC-RARL gains an average episode reward of as plotted in Figure 13. Although RARL trained with curriculum (RARL PS-Curriculum ) still performed worse than ACC-RARL, it is observed that there is less variation among policy snapshots recorded after 100 iterations as shown in Figure 13b. The increase in the performance indicates that training with the hardest adversary policies might not be beneficial. The robustness of a randomly chosen snapshot from the policies trained with RARL PS-Curriculum has increased.

Table 5: Average Reward per Episode of 2 policy iterations trained with ACC-RARL
Unit Mass Iteration Average Reward per Episode
1 170
2 170
3 170
4 170
5 170
6 170
7 150
8 150

If the target environments are grouped as all the target environments lower and higher than the source environment’s torso mass, we find that each target group requires a different policy if the highest possible target performance is intended. We show the performance of two policy iterations with high generalization capacity trained with the ACC-RARL algorithm in Table 5 to demonstrate that only two closely saved policy iterations are capable of performing forward locomotion when torso mass is in the range . Using ACC-RARL and regularization via early stopping we’ve increased the target environment success range in RARL[pinto2017robust] from to .

5.2.3 The Gravity Environment

In Learning Joint Reward Policy Options using Generative Adversarial Inverse Reinforcement Learning (OptionGAN) [henderson2018optiongan], the parameter space of gravity environment is between and for both humanoid and hopper. The policy over options converges to 2 different policies for Hopper tasks: one for lower and one for higher than the earth’s gravity indicating that the tasks are complex enough to be solved with different policies. In these sets of experiments conducted with a larger range of gravity environments specifically and , we will further assess that by using early stopping with adversarial algorithms, average episode reward greater than 3000 in the target environment is achieved. For these sets of tasks, we will use the same policy buffer we’ve created using PPO and different variations of RARL for the morphology experiments for hopper. The plots in Figures 14a and 14b prove that the probability of picking the right iteration of policies trained using curriculum from the policy buffer that conforms to expectations is higher.

Figure 14: (a)Average reward per episode at target environment with Gravity= -4.905 (). (b) Average reward per episode at target environment with Gravity= -4.905 () using policy buffer including RARL variations.

In Figure 15a the baseline PPO’s policy is shown where the average return is . Although it is the best performing policy among the policy iterations recorded at intervals of 10 it is harder to find this iteration than the policies trained with RARL. The Figure 15b shows that training SC-RARL with curriculum and inclusion of entropy bonus in adversary loss function not only increased the average reward per episode for the best performing policy iterations but also increased the number of policy snapshots that obtained an average episode reward above 2000.

Figure 15: (a)Average reward per episode at target environment with Gravity= -14.715 (). (b) Average reward per episode at target environment with Gravity= -14.715 () using policy buffer including RARL variations.
Figure 16: (a)Average reward per episode at target environment with Gravity= 17.1675 (). (b) Average reward per episode at target environment with Gravity= -17.1675 () using policy buffer including RARL variations.

As the target environment gets harder by moving further away from the source environment, the best performing iterations of policies are aggregated around earlier iterations similar to the morphology experiments. As seen in Figure 16 the domain randomization in naive PPO doesn’t suffice for generalization in harder target environments. In Figure 16b we see that encouraging the exploration of the protagonist policy through the inclusion of entropy bonus increases performance for policies trained with ACC-RARL and SC-RARL. Although entropy doesn’t guarantee an increase in all harder target environments in morphology tasks, it is a regularization technique that should be considered given that entropy increases the generalization for ACC-RARL and SC-RARL in gravity tasks.

In Figures 16a and 16b we see the same concavity encountered in the performance plots of heavier torso mass target environments in the hopper morphology and the humanoid delivery experiments. This curve is analogous to the convexity of the test error curve in supervised learning problems where earlier training iterations at the training error curve leads to underfitting and the later iterations overfit to the training set. The optimum point on the curve has high generalization capacity and performance in the target environment. The regularization effect of early stopping is pivotal in increasing the generalization capacity. Above all, the gravity environments performed in line with morphology experiments and early stopping in deep reinforcement transfer learning is proved to be essential in succeeding in harder target environments and creating meaningful benchmarks.

6 Conclusion

An agent tries a substantial amount of different action combinations depending on the hyperparameters in the training phase of deep reinforcement learning algorithms. With each environment interaction, the agent’s strategy of solving that particular task is expected to advance globally. At some point, the agent becomes an expert at performing that task but doesn’t remember the generalizable strategies it has acquired during the earlier phases of learning. These overridden strategies and strategies with poorer training performance yield higher performance in our transfer learning benchmarks. Provided that forward locomotion is an integral problem in continuous control, we altered the gravity, the tangential friction of the environment and the morphology of the agent in our benchmarks.

When training deep neural networks, early stopping [prechelt1998early] is used when the algorithm’s generalization capacity, validation set performance starts to decrease. Knowing where to stop depends on the difference between target and the source environment. In our work, we proposed keeping a policy buffer analogous to human memory to capture different strategies because training performance doesn’t determine test performance in a transfer learning setting. Since we’re not given the context of the target environment during training in the source environment, we keep snapshots of policies in the policy buffer. Accordingly, transferring the policy with the best source task performance to the target task becomes a less adequate evaluation technique as the difference between the source and target environment increases. Consequently, this methodology allowed us to increase the scope of existing algorithms and transfer the same learning attained in a source environment to harder target environments. In addition to that, we suggest the use of surrogate validation environment to optimize the hyperparameters by choosing the best fitting policy from the buffer. We proposed the inclusion of the iteration of training to the hyperparameters. After this inclusion, we’ve managed to retrieve the overridden strategies that yield high rewards in the target environments. We proved that a hopper robot is capable of performing forward locomotion in an unknown environment 1.75 times the source task’s gravity using the policies saved at earlier iterations.

We provided comparisons of RARL algorithms trained with different critic structures, curriculum learning, entropy bonus and showed how the choice of training affect the generalization capacity in continuous control experiments. Using our proposed methods ACC-RARL and early stopping via policy buffer, we’ve increased the range of Hopper torso mass experiments from the range of [2.5-4.75] to [1-8].

Furthermore, we introduced strict clipping for Proximal Policy Optimization as a regularization technique. Using an unconventionally low clipping parameter we discarded the samples that overfit the source task namely the standard humanoid environment. We observed higher jumpstart performance in humanoid environments with higher tangential friction, a larger range of gravity and morphological modifications using the robust policies saved during training.

Although outside the scope of transfer learning, we’ve discovered that decaying the clipping parameter decreases training performance for the humanoid environment that has higher state, action space. Policy gradient algorithms are used in the transfer learning algorithms for continuous control thus this finding had a substantial effect on the training and testing performance.

We believe that the first step of determining the most promising policy parameters lies in the accurate parametrization of the environment. We would like to investigate the relationship between the parametrized distance between environments and the hyperparameters used to train the policies residing in the buffer. This mapping might be depicted as a nonlinear function approximator and should be estimated with the least amount of data possible.

In this study, we showed the necessity of hyperparameter tuning not just for the training performance but for the generalization capacity of the transferred policy. Designing a better surrogate validation task while minimizing its difference from the source task is also a fruitful future research direction we would like to explore. Decreasing the Kullback–Leibler divergence constraint to unconventional values for Trust Region Policy Optimization (TRPO) similar to strict clipping for PPO might also prove to increase the generalization capacity.