I Introduction
As a classic approach to solve intelligent decisionmaking problem, reinforcement learning [1] has been on its way to revive with the development of deep learning technology in the last decade [2][3]. RL algorithms relies on reward functions to perform well. Despite the recent efforts in marginalizing handengineered reward functions [4][5][6] in academia, reward design is still an essential way to deal with credit assignments for most RL applications. [7][8] first proposed and studied the optimal reward problem (ORP). Later [9] reported that a somehow bounded agent can hardly achieve best performance under the direct guidance of the designer’s goals yet welldesigned alternative reward functions enable better and faster learning.
A general formulation of reward functions takes the form , where is a scalar vector composed of encouraging/discouraging reward items and is a vector of predefined indicator features dependent on states , , and the action . With fixed, ORP seeks to find the optimal that leads to RL policies maximizing the given fitness functions. One major difficulty in reward design is the lack of instant feedback mechanism from to its actual effect due to the inherent inefficiency of RL algorithms. Most existing ORPoriented approaches rely on sufficient training to provide good indicators for further improvement on [7][8][9][10][11][12][13][14][15]. However, the scalability of these approaches are undefined since they are only verified with small tasks.
In this paper, we propose a new scalable approach named conditional deep reinforcement learning (cDRL). Instead of optimizing reward functions directly, we leverage the representation power of deep neural networks to model their influences on RL policies. As illustrated in Fig. 1, by extending the input observation with a condition linearly correlated with the effective reward parameters and training the model with corresponding featured examples, we expect deep RL algorithms to learn policies sensitive to this condition while being able to adapt behaviors according to the underlying longperiod preferences. This approach is time & resource efficient in the sense that it only makes tiny modifications to the frameworks and training processes of standard deep RL algorithms without requiring extra computing resource or prolonged training, so it can be easily applied to any largescale tasks in a plug and play fashion.
Once a cDRL agent is trained, the input condition could solely act as a control panel to tweak the policy’s characteristics in a totally hindsight perspective since the consequent effects can be readily measured without any further training. Despite the potential modeling inaccuracy on reward influences, which also should be well realized, cDRL indeed alleviates the dilemma in reward design as long as the policy stays sensitive to the input condition and their asymptotic highlevel interaction mechanism is properly learned. Given the convenience introduced by cDRL, a straightforward application for it is to perform hindsight policy boosting with respect to fitness functions given by the designer. We validate this potential with multiple experiments in section 4.
Ii Conditional Deep Reinforcement Learning
Iia Problem SetUp
When facing a new RL task, we need to figure out a group of indicator features based on observations, actions and history in accordance with the designer’s goal. These features, either scalar or binary, should be highly expressive and correlated with the intrinsic logic behind the intended behaviors.
Assumption 1. For any specific task domain, we assume that all rewardrelated indicator features are predefined and well capable of conveying the designer’s goal.
Moreover, we also need to set a scalar vector, i.e. the reward parameters, as the weights of these indicator features, which is the core procedure for reward design in ORP. In RL, parallel vectors of reward parameters normally have equivalent effects. To tackle this redundancy, we select the first element as anchor^{1}^{1}1Actually any element in reward parameters can be set as anchor, we here choose the first element to make (1) concise. with constant value and study the others’ influences when they are varying with respect to it. This dimension reduced parameter space is denoted as to distinguish from the whole space . Since we are only interested in the nearoptimal region of , another assumption similar to [15] is made below.
Assumption 2. Each nonanchor reward parameter is assigned with a reasonable range according to the anchor based on available domain knowledge so that the sampled combinations are likely to lead to high true utility behaviors as desired.
Essentially, these ranges delimit a subspace of interest within . We argue that this is pragmatically much easier than setting exact optimal values for reward parameters. Hence our target now is to model the underlying interaction mechanisms among different dimensions of , which provides the opportunity to perform hindsight reward tweaking and achieve better performance than standard RL algorithms with handdesigned rewards.
To facilitate cDRL training for a certain domain, we need instance environments with randomized reward parameters uniformly drawn from while outputting these parameters along with original observations at every step. Note that different dimensions of may vary drastically in numerical scales, which increases learning difficulty. We thus introduce an adaptive affine transformation which simply maps each dimension of to [1,1] in . We name as the condition space and denote the normalized subspace corresponding to as (see Fig. 2). All reward parameters within are projected to before they are concatenated and output with the original observations. Then the stepwise reward function can be reformulated as:
(1) 
where , uniquely maps to a valid reward parameter vector within which, combined with the anchor , forms a whole group of weights for indicator features.
IiB CDRL Algorithms
When applying deep RL to a certain domain, different reward parameters produce different returns and thus are in favor of different behaviors for the same observation. If the input observation is extended with a condition^{2}^{2}2Conditions are not necessarily to be fed into the neural network at input layer with original observations, one can also concatenate them with intermediate features as needed, which is especially the case when imagelike observations are used.
that exclusively embodies the relevant reward parameters, gradients will consist of two components during backpropagation. One drives the neural network to extract useful features from the original observation; the other drives the network to interpret different input conditions and combine them with the extracted features properly to generate specifically desirable behaviors as well as value estimations. With this intuition above, we expect such a conditional deep neural policy can learn to adapt its characteristics as the input condition changes.
We denoted a conditional policy parameterized by as or for simplicity, where represents the input condition. Assume is the initial state distribution under , then the cDRL optimization target is formulated as:
(2) 
Apparently, the desired training data is nested, both the condition subspace
and the consequent conditional example spaces should be sufficiently explored. This may seem extremely inefficient in the first place, but keep in mind that the feature extraction task is shared across all reward parameters while various reward parameters will induce observation distributions with better diversity which in turn is beneficial for this task. We assume that the main learning burden for deep RL algorithms lies in extracting efficient highlevel features strongly correlated with decision making from raw input information. Then it’s possible for cDRL to learn without requiring more examples or notably prolonged training compared to standard deep RL as long as its extra task of preference adjustment is relatively simple. Actually, we evaluate both our approach and baselines with equal amount of training in section 4. The experimental results support our assumption well.
On the other hand, to enhance exploration diversity in , we adopt the asynchronous methods as in [16] by running a batch of environments in parallel with individually sampled reward parameters which are updated periodically. Other measures of standard deep RL algorithms to improve data efficiency and learning stabilization are kept unchanged. We next describe the conditional versions of A3C [16], DDPG [17], and Deep Qlearning [2] in detail. As a general case, the full cDRL algorithm is outlined in Algorithm 1. For the chosen deep RL frameworks, any neural network approximator that takes observation as input will become conditional in the sense that original observation has been concatenated with an extra condition. We discuss these cDRL frameworks in detail as below.

Conditional A3C: the algorithm maintains a conditional policy and a conditional value estimation function . The policy is optimized according to the advantagebased policy gradient , where is the conditional advantage of action for state under condition
. For moment
and a following episode of length , is estimated by . We adopt PPO [18] to stabilize learning. As an online RL algorithm, a big batch size of sampling environments is important for the success of conditional training. 
Conditional DDPG: the algorithm learns a conditional Qfunction with parameter by fitting the target value based on the bootstrapping property of Bellman Equation. is the conditional deterministic policy parameterized by , which is optimized through the gradient directly stemming from , given by . Independent target networks and are used and softly updated with a temperature parameter to stabilize learning. Conditional transitions [, , , ] are pushed into a replay buffer and resampled in batches for training, where represents concatenating operation.

Conditional DQN: similar to conditional DDPG, this algorithm also learns a conditional parameterized by from conditional transitions but with a different target value given by , where is a discrete collection of all available actions for state . A lagged target network is used to stabilize learning, which is updated with a lower frequency than . For a certain state under condition , the desired action is selected according to with greedy strategy during training or greedy strategy at test time. No explicit conditional policy is learned in this algorithm.
Iii Related Work
As proofofconcept researches, [7][8][9] used exhaustive search to examine the nature of reward functions and verify the benefits of welldesigned rewards. [10] made one step towards pragmatic applications and proposed a lightweight approach which utilizes policy gradient to optimize reward parameters online. This approach requires an explicit model of the Markov Decision Process which is impractical for complex or continuous tasks. Another research direction adopts nested optimizations which apply a highlevel reinforcer or genetic programmer to optimize reward parameters while optimizing RL policies [11][12][13][14]. These approaches are strongly bounded on task complexities and available computation resources. [15] presented a Bayes approach for reward design by estimating a posteriori over optimal rewards with parameter samples and their performance. They used alternative planning methods instead of RL to circumvent the intractability of the original idea which also confines their approach to relatively simple tasks.
Meta reinforcement learning (metaRL) is a sort of RL algorithms designed for fast adaption to new tasks via learning internal representations broadly suitable to a certain task distribution. Theoretically, metaRL could be trained to adapt to different reward parameters and perform similar hindsight policy characteristic tweaking as in cDRL. However, it relies on either special network structures [21][22][23] or a special loss computed by two consecutively sampled batches [24], which significantly increases learning difficulty and inevitably demands for longer training periods. Besides, sufficient metaadaptions are needed before performance evaluations on reward functions can be executed. In contrast, as a highly specialized approach for hindsight reward tweaking, cDRL is efficient both in training and evaluation.
Analogies has been made between deep reinforcement learning with ActorCritic structures and generative adversarial nets (GANs) [25][26]. Similarly, cDRL also corresponds to conditional GANs (cGANs) [27] for applying the same methodology: featured conditions are added to the input and trained to be sensitive for corresponding data distributions which in cGANs are manipulated via data feeding while in cDRL are determined by reward parameters in a relatively unstraightforward way. cGANs have been widely reported to sharpen the predictive distributions for both the discriminator and generator, thus significantly improve the visual quality of generated images [27][28]. They are also easier to train than vanilla GANs [29]. These facts endorse the effectiveness and feasibility of our approach in a way.
Iv Case Study: Hindsight Policy Boosting
Iva Method Formulation
After a cDRL policy is trained, all learnable parameters are held constant as . Given a certain fitness function, the input condition becomes the only control interface for further optimization. Then, search for the optimal policy reduces to search for the optimal input condition , which is given by:
(3) 
where represents any evaluation process on the conditional policy which returns scalar fitness scores. stands for the searching space for at test time. With sufficient training, a cDRL policy can generalize to boarder space within such that the potential optimal condition may locate outside within which the policy is trained. Since we have little prior knowledge about
with respect to agent performance, a natural choice for optimization method is genetic programming [20], which places no assumption on problem domains and ensures global optimum in searching space.
Compared to separate ’trialtraintest’ circles, cDRL combines the first two phases into a single onetime training process, which makes the hindsight reward tweaking and evaluation much more flexible. This is the key advantage of cDRL over former ORP solutions on largescale complex tasks which could take tremendous amount time of training [19].
IvB Experimental Configurations
We choose MuJoCo [30] integrated in OpenAI Gym [31] to test the proposed cDRL approach. MuJoCo provides excellent physics simulations and is widely adopted for benchmarking highdimensional continuous control tasks. Our target is to verify if a trained cDRL policy can be tweaked by the input condition and if a hindsight boosted cDRL policy outperforms policies trained by standard deep RL algorithms with default reward parameters given equal amount of training. In specific, we choose three locomotion tasks: HalfCheetah, Walker2d, and Ant, all of which aim to maximize the agents’ forward velocity without falling to the ground (if applicable). We apply the conditional versions of both A3C and DDPG to these domains and use deep polices trained by corresponding standard frameworks with default reward parameters in Gym environment settings, which are already well optimized, as our baselines for benchmarking.
We follow the original indicator features defined in Gym such as forward reward, healthy reward, control cost and contact cost, etc. and then a vector of ones acts as baseline reward parameters. When applying cDRL algorithms, we choose forward reward as anchor with constant weight 1.0 and the other reward parameters varied within [,] where is a small positive value. We set = 0.2 for HalfCheetah and Walker2d while = 0.05 for Ant. Note that this is not necessarily the case for tasks of which indicator features are not well scaled and substantially heterogeneous ranges might need to be specified for these tasks. During training, baseline and cDRL models use almost identical configurations and hyperparameters except for three main differences: a) cDRL uses a slightly modified network architecture to admit the extra input condition; b) cDRL independently samples reward parameters from the predefined ranges and refresh them periodically for every environment while baseline uses default reward parameters for all environments; c) baseline models are trained 310^{4} more agent steps than cDRL models to compensate their extra exposure to environments during hindsight optimization.
A3C: The Actor and Critic have identical and separated fullyconnected network structures with 3 hidden layers of 256 units and tanh nonlinearity. We use PPO loss [18] to compute stabilized policy gradient with a clip range of 0.2 and Adam [32] to update network parameters with a learning rate of 310^{4}
. 50 environments are run in parallel with an episode length of 2048, a full batch from all 50 environments are divided into 4 minibatches and utilized for 16 epochs per update. For each experiment, a total number of 810 updates are performed with samples of about 8
10^{7} agent steps. We use 0.99 and 0.95 for discounting factor and truncation factor of generalized advantage estimation (GAE) [33] respectively. The entropy coefficient is set to 0.0 as default for MuJoCo tasks in Gym. For the conditional version of A3C, we resample the reward parameters for all environments every 10 updates.DDPG:
The Actor and Critic have identical and separated fullyconnected network structures with 2 hidden layers of 64 units and ReLU nonlinearity after which layer normalization [34] is applied. Adaptive parameter noise is used for HalfCheetah and Walker2d for exploration while OrnsteinUhlenbeck noise [17] is used for Ant. 50 environments are run in parallel to sample transitions which are pushed into a replay buffer of size 1
10^{6}. A batch of 128 transitions are resampled from the buffer for gradient calculation per update. Adam is used to update network parameters with learning rates 110^{4} and 110^{3} for Actor and Critic respectively. For each experiment, 50 updates are performed after every 100 agent steps which is repeated for 210^{4} times. We use a discounting factor of 0.99 and a soft update coefficient of 0.001 for target networks. For the conditional version of DDPG, we resample the reward parameters every 2048 agent steps.For hindsight optimization, we choose the average travel distance of 1000 continuous steps along forward direction for 50 random seeds as the fitness function. Outliers caused by occasional fallings are excluded for stabilization while runs with over 10 fallings return 0. Genetic programming with realvalued encoding was applied to a predefined condition subspace
of which each dimension varies within [2,2] for conditional A3C policies while [10,10] for conditional DDPG polices. A population size of 50 is used for every generation. We use tournament selection strategy with elite preservation and single point crossover with a probability of 0.8 for recombination after which a mutation could happen with probability 0.1. All optimizations evolve for 30 generations and the extra environment exposures during this process are compensated for baseline trainings as mentioned above.
IvC Results
The evolution processes are visualized in Fig. 3. We stored the populations of all generations during hindsight optimization of the conditional A3C policy trained on Walker2d domain which has 2 nonanchor reward parameters. These 2D parameter coordinates are illustrated with scatter plotting of which colors indicate measured performance and densities reveal evolution trends. Obviously, a cDRL policy exhibits distinct characteristics, and thus different performance, while the input condition is varying in the condition space, which validates the feasibility of cDRL as we expected. From the evolution heatmaps in Fig. 3, we can also learn some good intuition about the interaction mechanisms of reward parameters for Walker2d domain: bigger healthy reward plus smaller control cost tend to result in better ability of running forward.
To verify the ability of cDRL in boosting policy performance, we compared the best individuals of generations during genetic evolution with baseline models. As shown in Fig. 4, hindsight optimized cDRL policies consistently yields longer travel distance than baseline policies in all the three domains, which proves that the longperiod influences of reward parameters can not only be modeled by cDRL but also further utilized to search for better polices. Apparently, one can also make use of cDRL approach to achieve better performance on realworld RL tasks where handdesigned raw reward parameters works but sophisticated interaction mechanisms exists among them.
For better understanding of cDRL, we performed extra qualitative experiments on its unique hyperparameters, i.e. refresh period and exploration range of reward parameters. We found that a too small or too big would lead to significant performance decrease. The former is caused by unbalanced sample structure which overemphasizes horizontal diversity in reward parameter space at the expense of vertical data sufficiency in conditional example spaces; the latter results from the loss of focus on core nearoptimal region. In practice, the selection of these two hyperparameters, especially for , is dependent on specific task properties. As a rule of thumb, one should use as big a batch of sampling environments as possible and avoid very small s. For unfamiliar task domains, one should start with small exploration ranges for each reward parameter given a group of rewards that already works.
V Conclusion and Future Work
We observe the fundamental role of reward design in RL, refer to the wisdom of ’conditional deep learning’, and propose a new paradigm for deep RL called cDRL which models the influences of reward functions while doing its original job. This approach is scalable for modern complex RL tasks. We successfully verify the feasibility of cDRL with several experiments on MuJoCo tasks and demonstrate one potential application in hindsight performance boosting of trained policies. Our approach tries to bridge the gap between reward changes and their actual effects by exempting routine trainings and enabling hindsight reward tweaking with more handy feedbacks. Importantly, cDRL doesn’t require substantial modifications on learning processes of standard deep RL frameworks.
Essentially, our trained conditional polices provides a manipulation interface stemming from the learned longperiod functioning mechanism of several key factors (reward parameters in this paper) in the form of internal neural weights. Sensitivity takes no less credit than modeling accuracy does for the effectiveness of cDRL. Given the success of our approach on reward parameters, it’s promising to extend this paradigm to other hyperparameters with fundamental but delayed influences, such as the discounting factor, truncation factor of GAE, etc. as long as they could be individually configured in separate sampling pipelines. We’ll leave this research for future work.
References
 [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge, MA: MIT Press, 1998.
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, HumanLevel Control through Deep Reinforcement Learning. Nature 518(7540): pp.529533. 2015.
 [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of go with deep neural networks and tree search. Nature 529(7587): pp.484489. 2016.
 [4] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp.42994307. 2017.
 [5] D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette, Learning to understand goal specifications by modelling reward. arXiv:1806.01946v2 [cs.AI]. Ithaca, NY: Cornell University Library. 2018.
 [6] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, Diversity is all you need: Learning skills without a reward function. arXiv:1802.06070 [cs.AI]. Ithaca, NY: Cornell University Library. 2018.
 [7] S. P. Singh, R. L. Lewis, and A. G. Barto, Where do rewards come from? In Proceedings of the Annual Conference of the Cognitive Science Society, pp.26012606. 2009.
 [8] S. P. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development 2(2):pp.7082. 2010.

[9]
J. Sorg, S. P. Singh, and R. L. Lewis, Internal rewards mitigate agent boundedness. In Proceedings of the International Conference on Machine Learning, pp.10071014. 2010.
 [10] J. Sorg, R. L. Lewis, and S. P. Singh, Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pp.21902198. 2010.
 [11] S. Niekum, A. G. Barto, and L. Spector, Genetic programming for reward function search. IEEE Transactions on Autonomous Mental Development 2(2): pp.8390. 2010.

[12]
L. Spector, D. M. Clark, I. Lindsay, B. Barr, and J. Klein, Genetic programming for finite algebras. In Proceedings of the Genetic and Evolutionary Computation Conference, pp.12911298. 2008.

[13]
C. Mericli, T. Mericli, and H. L. Akin, A reward function generation method using genetic algorithms: A robot soccer case study. In Proceedings of the Adaptive Agents and Multi Agents Systems, pp.15131514. 2010.
 [14] J. Bratman, S. P. Singh, J. Sorg, and R. L. Lewis, Strong mitigation: Nesting search for good policies within search for good reward. In Proceedings of the Adaptive Agents and Multi Agents Systems, pp.407414. 2012.
 [15] D. HadfieldMenell, S. Milli, P. Abbeel, S. Russell, and A. D. Dragan, Inverse reward design. In Advances in ppNeural Information Processing Systems. pp.67656774. 2017.
 [16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp.19281937. 2016.
 [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. P. Wierstra, Continuous control with deep reinforcement learning. arXiv:1509.02971 [cs.LG]. Ithaca, NY: Cornell University Library. 2015.
 [18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms. arXiv:1707.06347 [cs.LG]. Ithaca, NY: Cornell University Library. 2017.
 [19] OpenAI. Openai five. https://openai.com/blog/openaifive/. 2019.
 [20] J. R. Koza, Genetic programming: on the programming of computers by means of natural selection. Cambridge, MA: MIT Press. 1992.
 [21] J. X. Wang, Z. Kurthnelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. M. Botvinick, Learning to reinforcement learn. arXiv: 1611.05763 [cs.LG]. Ithaca, NY: Cornell University Library. 2016.
 [22] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett?I. Sutskever, and P. Abbeel, Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779v2 [cs.AI]. Ithaca, NY: Cornell University Library. 2016.
 [23] G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabskabarwinska, J. W. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton, A. Cain, C. Hillier, D. Silver?K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap, Unsupervised predictive memory in a goaldirected agent. arXiv:1803.10760 [cs.LG]. Ithaca, NY: Cornell University Library. 2018.
 [24] C. Finn, P. Abbeel, and S. Levine, Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, pp.11261135. 2017.
 [25] I. J. Goodfellow, J. PougetAbadie, M. Mirza, X. Bing, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets. In Advances in Neural Information Processing Systems, pp.26722680. 2014.
 [26] D. Pfau and O. Vinyals, Connecting generative adversarial networks and actorcritic methods. arXiv:1610.01945 [cs.LG]. Ithaca, NY: Cornell University Library. 2016.
 [27] M. Mirza and S. Osindero, Conditional generative adversarial nets. arXiv:1411.1784v1[cs.LG]. Ithaca, NY: Cornell University Library. 2014.
 [28] K. Sricharan, R. Bala, M. Shreve, H. Ding, K. Saketh, and J. Sun, Semisupervised conditional gans. arXiv:1708.05789 [stat.ML]. Ithaca, NY: Cornell University Library. 2017.
 [29] Soumith. How to train a gan? tips and tricks to make gans work. https://github.com/soumith/ganhacks. 2016.
 [30] E. Todorov, T. Erez, and Y. Tassa, Mujoco: A physics engine for modelbased control. In Proceedings of the Intelligent Robots and Systems, pp.50265033. 2012.
 [31] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, Openai gym. arXiv:1606.01540v1 [cs.LG]. Ithaca, NY: Cornell University Library. 2016.
 [32] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv:1412.6980[cs.LG]. Ithaca, NY: Cornell University Library. 2014.
 [33] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, Highdimensional continuous control using generalized advantage estimation. arXiv:1506.02438 [cs.LG]. Ithaca, NY: Cornell University Library. 2015.
 [34] J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer normalization. arXiv:1607.06450v1[stat.ML]. Ithaca, NY: Cornell University Library. 2016.