Enhanced Experience Replay Generation for Efficient Reinforcement Learning

05/23/2017 ∙ by Vincent Huang, et al. ∙ Ericsson 0

Applying deep reinforcement learning (RL) on real systems suffers from slow data sampling. We propose an enhanced generative adversarial network (EGAN) to initialize an RL agent in order to achieve faster learning. The EGAN utilizes the relation between states and actions to enhance the quality of data samples generated by a GAN. Pre-training the agent with the EGAN shows a steeper learning curve with a 20 learning, compared to no pre-training, and an improvement compared to training with GAN by about 5 and slow data sampling the EGAN could be used to speed up the early phases of the training process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 5G telecom systems, network functions need to fulfill new network characteristic requirements, such as ultra-low latency, high robustness, quick response to changed capacity needs, and dynamic allocation of functionality. With the rise of cloud computing and data centers, more and more network functions will be virtualized and moved into the cloud. Self-optimized and self-care dynamic systems with fast and efficient scaling, workload optimization, as well as new functionality like self-healing, parameter free and zero-touch systems will assure SLA (Service Level Agreements) and reduce TCO (Total Cost of Ownership). Reinforcement learning, where an agent learns how to act optimally given the system state information and a reward function, is a promising technology to solve such an optimization problem.

Reinforcement learning is a technology to develop self-learning SW agents, which can learn and optimize a policy based on observed states of the environment and a reward system. An agent receives observations from the environment in state and selects an action to maximize the expected future reward . Based on the expected future rewards, a value function for each state can be calculated, and an optimal policy that maximizes the long term value function can be derived. In a model-free environment, the RL agent needs to balance exploitation with exploration. Exploitation is the strategy to select actions based on previously learned policy, while exploration is a strategy to search for better policies using actions not from the learned policy. Exploration creates opportunities, but also induces the risk that choices done during this phase will not generate increased reward.

In real-time service-critical systems, exploration can have an impact on the service quality. In addition, sparse and slow data sampling, and extended training duration put extra requirements on the training phase. This paper proposes a new approach for pre-training the agent based on enhanced GAN data sampling to shorten the training phase, to address the training limitation options of environments with sparse and slow data sampling.

The paper is organized as follows. In Section 2, we give a brief overview of reinforcement learning, generative adversarial networks and their recent development. In Section 3, we present our proposed approach of a pre-training system with enhanced GAN. Experiment results are presented in Section 4. Finally, we give concluding remarks and discussions in Section 5.

2 Background

2.1 Reinforcement Learning

Reinforcement learning is generally the problem of learning to make decisions by maximizing a numerical reward signal. (Sutton and Barto, 1998). A reinforcement learning agent receives an observation from the environment it interacts in state , and selects an action so as to maximize the total expected discounted reward . The action, drawn from the action space , is calculated by a policy . Every time the policy is executed, a scalar reward is returned from the environment, and the agent transitions to the next state,

, following the state transition probability

.

We can define the state value function as the expected return at state , following policy , and the action value function as the expected return taking action , while in state , following policy .

The reinforcement learning agent tries to maximize the expected return by maximizing the value function :

(1)

An approach of maximizing

is using policy gradients (PG), in which the policy is parametrized and optimized by calculating the gradients using supervised learning, while iteratively adjusting the weights by backpropagating the gradients into the neural network.

Most reinforcement learning work uses simulated environments like OpenAI Gym (Brockman et al., 2016) and can achieve good results by running many episodes (Duan et al., 2016; Mnih et al., 2015)

. Compared to simulated environments, real environments have different characteristics and different training strategies need to be applied. The agent has access only to partial, local information, which can be formalized as a Decentralized Partial-Observable Markov Decision Process (Dec-POMDPs)

(Oliehoek, 2012). Further, it is either not possible or too expensive to do exhaustive exploration strategies in a real production system, which might cause service impact. Finally, sparse data, low data sampling rate, and slow reaction time to actions greatly limit the possibility to train an agent in an acceptable time frame (Duan et al., 2016). New, sample efficient algorithms such as Q-Prop (Gu et al., 2016) have been proposed, that provide substantial gains in sample efficiency over trust region policy optimization (TRPO) (Schulman et al., 2015). Methods such as actor critic algorithms (Mnih et al., 2016), as well as combinations of on-policy and off policy algorithms (O’Donoghue et al., ) have been tested to beat the benchmarks. Other approaches using supervised learning have been also tested (Pinto and Gupta, 2016). Still, the need of increasing sample efficiency to speed-up training time is imperative in real production systems that only allow for sparse data sampling.

2.2 Generative Adversarial Networks

A second trend in deep learning research has been generative models, especially Generative Adversarial Nets (GAN)

(Goodfellow et al., 2014), and the connection to reinforcement learning (Finn et al., 2016; Yu et al., 2017) has been discussed. GANs are used to synthesize data samples that can be used for training an RL agent. In our case, these synthesized data samples are used to pre-train a Reinforcement Learning Agent to speed-up the training time in the real production system. We will compare this method with different pre-training alternatives.

The essence behind Generative Adversarial Nets is an aversion between a generative model , which learns the true data distribution, and a discriminative model , which evaluates the probability of a sample coming from the true distribution, rather than having been generated by .

The generator, modeled as a multilayer perceptron, is given inputs

, sampled from a noise distribution . The network is trained to learn the mapping from to , where is the true data distribution. The discriminator, , also represented by a multilayer perceptron, is given as input either the generated sample or a true data point . is learning the probability of originating from the true distribution.

By training both models in parallel, we can converge to a single solution where can eventually capture the training data distribution, and cannot discriminate between true and generated samples.

3 Enhanced GAN

The object of GAN can be considered as the minmax game. The discriminator tries to maximize a value function, while the generator tries to minimize it, as shown below.

(2)

where, the value function can be expressed as:

(3)

In our case, the training data set is the collected state()-{action, reward}() pairs. Thus, the training data can be subset to two parts:

(4)

Correspondingly, the generated data also consist of two parts:

(5)

where and are the generated state()-{action, reward}() pairs. Since the new state and reward depend on the current state and the selected action, there are latent relations between and . The mutual information between and can be expressed as two entropy terms:

(6)

where represents the pair and represents the pair. The pair is dependent of the pair, therefore

cannot be zero. To utilize this information, we can generate better quality experience replay data. To achieve this, we use the Kullback–Leibler divergence from

to , where is the distribution of the generated action values and represents the distribution of derived dependent pair generated from using the mutual information .

(7)

can be obtained by training from the real experience replay data. Thus, we can update the value function of as

(8)

where is just a weighting factor. The last term is a regularization term to force the GAN to follow the relation between state and action-reward pair. When the GAN improves, and will follow the relations in the real experience replay data and the -divergence will tend to zero. The goal of the generator network is also to minimize this term.

The network architecture can be realized as in figure 1.

Figure 1: Enhanced GAN structure.

Besides the normal GAN networks, an additional DNN has been added, to train the relations between state() and {action, reward}() pairs. The training procedure is shown in algorithm 1.

Data: Batch of quadruplets drawn from the real experience
Result: Unlimited experience replay samples which can be used for the pre-training of the reinforcement learning agent.
begin
       initialization;
       /* initializes the weights for generator and discriminator networks in GAN, as well as the enhancer network */
       training GAN;
       /* training a GAN network with the real experience data */
       training enhancer;
       /* training an enhancer network with the real experience data to find the relations between and */
       for k iterations do
             generate data with GAN;
             /* generate a test experience data set with GAN */
             improve GAN with enhancer;
             /* using the enhancer to calculate the discrepancy between and and use this to update GAN */
            
       end for
      
end
Algorithm 1 Data generation algorithm with EGAN

In practice, we can update the GAN with the regularization term at the same time. However, it is also possible to update the regularization term separately. In a real system, where the data collection is slow, more training on the network can be performed while waiting fro inputs of the new experience replay data. We train the relation between the state() and {action, reward}() pairs whenever new experience data comes in. After we train the GAN with the normal settings, the network weights can be updated using the trained relations from the Enhancer.

Once the GAN has been trained, it is possible to generate unlimited experience replay data to pre-train the agent.

4 Results

We use the CartPole environment from OpenAI Gym to evaluate the EGAN performance, as shown in figure 2, with parameter settings listed in table 1. The figure shows the training of the PG agent after it has been pre-trained, therefore we observe a small offset of the pre-trained agents on the x-axis by around 10000 samples (500 episodes), while the agent with no pre-training starts at 0. The black solid line is the 100-episode rolling average reward over the total consumed samples of a PG agent, without any pre-training mechanisms. The red dash line and the blue dot line represent the performance of the PG agent with GAN and EGAN pre-training respectively. The EGAN uses 500-episode real experience, , with randomly selected actions to train the GAN and Enhancer neural networks in the pre-training phase, and then generates 6000 batches of synthetic data, , to update the policy network in the beginning of the training phase. For the no pre-training agent we set the total training episodes to 5500, to have a fair comparison with the EGAN over cumulative samples.

The samples for training the GAN and EGAN were collected using a random policy. Consequently, we expect a low initial performance for both pre-trained systems, but a more accurate value function estimation, thus a quicker learning curve since they are already initiated by generated samples. As a result, we can observe in figure

2

a faster increase of the reward for both agents pre-trained with GAN and EGAN. Both those networks can provide more modalities in the data space and since EGAN enhances the state-action-reward relation it can further improve the quality of the synthesized data, and the robustness of the system in terms of single standard deviation. We obtain, therefore, a 20% higher sample efficiency for EGAN pre-training compared to no pre-training, and a 5% improvement compare to pre-training with GAN without an enhancer. That means to reach a certain mean reward, less cumulative samples are needed, thus speeding up the training time.

Pre-training phase Training phase
D and G network size Policy network size
Enhancer network size PG learning rate
GAN learning rate PG discount factor
GAN sample size PG update frequency
Pre-training buffer size episodes Training episodes
EGAN pre-training iterations Synthetic replay buffer size
Table 1: EGAN simulation parameter settings
Figure 2: Comparison of with and without EGAN pre-training.

In order to test the hypothesis of bootstrapping the online training with DNN, GAN, and policy network pre-initialization, we trained our system with a varying number of pre-training lengths, demonstrating the results in figure 3. The y axis represents again the 100-episode rolling average reward, while the x axis displays the online episode numbers. Figure  3 demonstrates that pre-training the generator networks with 5000 episodes results in a faster learning curve for the policy network.

In a real production system, the pre-training could be achieved with saving prior data to pre-initialize the system, so as to aid it to converge faster, while also achieving a more stable training. Therefore, it is of great importance to point out the fact that in real environments, where samples are expensive to produce, while also taking into consideration the episodes needed for the pre-initialization, pre-training the network with 500 episodes rather than 5000 is more efficient and cost-effective.

Figure 3: Comparison of different pre-training lengths.

5 Conclusions

In this work, we are tackling a fundamental problem of reinforcement learning applied to a real environment. The training normally takes long time and requires many samples. We first collected a small set of data samples from the environment, following a random policy, in order to train a GAN. The GAN is then used to generate unlimited synthesized data to pre-train an RL agent, so that the agent learns the basic characteristics of the environment. Using a GAN, we can cover larger variations of the random sampled data. We further improve the GAN with an enhancer, which utilizes the state-action relations in the experience replay data in order to improve the quality of the synthesized data.

By using the enhanced structure (EGAN) we can achieve a 20% faster than no pre-training, a 5% faster learning than pre-training with a GAN, and a more robust system in terms of standard deviation. However, further work is needed to verify and fine-tune the system for achieving optimal performance.

Our next step is to explore and test this setup together with virtualized network functions in 5G telecom systems, where sample efficiency is crucial, and exploration can directly affect the service quality of the system.

References

  • Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • Duan et al. [2016] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In

    Proceedings of the 33rd International Conference on Machine Learning (ICML)

    , 2016.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Oliehoek [2012] Frans A. Oliehoek. Decentralized POMDPs, pages 471–503. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-27645-3. doi: 10.1007/978-3-642-27645-3_15. URL http://dx.doi.org/10.1007/978-3-642-27645-3_15.
  • Gu et al. [2016] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • [9] Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and q-learning.
  • Pinto and Gupta [2016] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 3406–3413. IEEE, 2016.
  • Goodfellow et al. [2014] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. ArXiv e-prints, June 2014.
  • Finn et al. [2016] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016.
  • Yu et al. [2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: sequence generative adversarial nets with policy gradient. In

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.