Introduction: GAN vs DRL (actorcritic)
Currently all problems in artificial intelligence (AI) are formulated as an endtoend (deep) artificial neural network optimization problem, also known as deep neural network learning or in short deep learning (DL)
[1]. These problems are mainly classified into: offline learning and online learning; Offline learning refers to learning from the stored data or dataset which are nowadays known as big data problem. In offline learning, the stored big or small dataset might not have any labels (unsupervised), might be partially labeled (semisupervised) or might be fully labeled (fullysupervised). Online learning refers to learning from devices or machines such as robots (simulated or real) directly in realtime to achieve a specific goal or accomplish a specific task. Online learning might be refered to as reinforcement learning (RL) or rewardbased learning or robot learning or robust control learning as well. RL is sometimes refered to as deep RL (DRL) which is due to the application of DL to RL problems. Originally DL
[1]was proposed as a powerful solution to offline learning, more specifically supervised learning (SL) of big data such as Imagenet classification
[2] and TIMIT speech recognition[3]. Currently, generative adversarial nets (GAN)[4] & deep reinforcement learning (DRL)[5], specifically actorcritic (AC) algorithms are the forefrontiers of DL/AI community: (A) GAN for offline learning or (un/, semi/,fully)supervised learning (SL)[6, 7]; and (B) AC for online learning or (deep) RL or DRL[8, 9]. DL formulates an optimization problem with a userdefined objective (loss or cost) function and lacks a single unifying objective (loss or cost) function[1, 10]. GAN & AC both consist of two models (networks): one of which approximates the main function for prediction (actor for AC or generator for GAN) and another one approximates the objective (loss or cost) function (discriminator for GAN or critic for AC). This has revolutionized the assumptions behind DL algorithms in terms of having a predefined fixed objective (loss or cost) function although applying gradient descent (and ascent) to these networks (GAN & AC) often leads to mode collapse, unstable training and no convergence in some cases. AC methods[11, 12] and GAN[4] are two main classes of multilevel (in fact bilevel) optimization problems with close connection[10]. Since both of these hybrid or multilevel optimization models suffer from stability issues, mode collapse, and convergence difficulties, techniques[10] for stabilizing training have been developed largely independently by the two communities. In this work, we mainly focus on the two most stable one of each: Wasserstein GAN (WGAN)[13] and deep deterministic policy gradients (DDPG)[8]. The main contribution of this work is to point out: 1. the strong connection/similarity between WGAN and DDPG as the two most stable classes of GAN and AC (DRL) model; 2. Given the first contribution, we argue that DDPG demonstrates adversarial learning behavior very similar to WGAN; 3. Given the first contribution, how this similarity can lead to a modelbased DDPG (modelbased actorcritic); Eventually, we want to conclude that adversarial intelligence (as the product of adversarial learning process/behavior) might be a generalpurpose AI (AGI) since it is applicable to both offline learning through GAN for (un, semi, fully)SL and online learning through AC (DRL) for RL and robot learning.Gan
Generative adversarial networks (GAN)[4]
formulates the unsupervised learning problem as a game between two opponents: a generator G which generates sample images from a random noise sampled from a fixed probability distribution/density function (PDF) or a fixed noise source such as normal function; and a discriminator D which classifies the sample images as real/true or fake/false (maps an image to a binary classification probability); The original or vanilla GAN
[4]was formulated as a zero sum game with the crossentropy loss between the discriminator prediction and the true identity of the image as real or generated/fake. To make sure that the generator has gradients from which to learn even when the discriminator classification accuracy is high, the generator loss function is usually formulated as maximizing the probability of classifying a generated sample as true rather than minimizing its probability of being classified false (or fake).
GAN for offline learning or (un, semi, fully) supervised learning
GAN was initially applied to unsupervised learning (no label available for the dataset) for image generation, data distribution modelling and discovering the data structure[4]
. Later on, it was applied to (semi)supervised learning (SL) problem
[6, 7] where the data is either partially labeled (semisupervised) or fully labeled (fullysupervised)[14] such as image classification, object detection, and object recognition. The first application of GAN to (semi)SL problem[6] reported 64% accuracy on SVHN^{1}^{1}1streetview house numbers ( http://ufldl.stanford.edu/thestreetviewhousenumbers/) dataset with 1% labeled data. In another original work[7] by Tim Salimans at OpenAI, GAN reaches over 94% accuracy on the same SVHN dataset using only 1,000 labeled examples out of almost 100,000 examples which means only 1% of the entire dataset is labeled. Though originally GAN was proposed as a form of generative model for unsupervised learning[4], GAN has proven useful for semisupervised learning[7], and fully supervised learning[14]. Therefore, GAN is a potentially capable approach for offlinelearning of big data such as big data analytics/mining and visualization.GAN has no proof of convergence.
Unfortunately, there is no proof of convergence for GAN. No one has yet shown that GAN[4] will converge to an equilibrium. On contrary, very many practical experiences have shown that GAN is unstable and might lead to mode collapse which means technically no convergence (equilibrium) at all. One ray of hope is provided in the idea of using Earth Mover’s metric (or Wasserstein metric)[13]. MIXGAN[15] architecture with multiple generators and discriminators can also reach an approximate equilibrium under fairly tight conditions which are not usually practical. Generative multiadversarial network[16] shows that multiple discriminators does indeed improve empirical performance.
Wasserstein GAN: stable GAN with Wasserstein loss
WGAN[13] uses Wasserstein loss (also known as earth mover) as an alternative loss function instead of the traditional/original GAN loss function, crossentropy loss[4]. Wasserstein loss improves the stability of GAN training/learning considerably by almost eliminating the important problem of mode collapse. Later on, we show that Wasserstein loss provides deep connection and deep similarity with Bellman loss which is used in DDPG[8] as the most stable AC method in DRL.
DRL (actorcritic)
Actorcritic (AC) methods are a longestablished class of techniques in reinforcement learning (RL) [2]. While most RL algorithms either focus on learning a value function like value iteration and TDlearning (valuebased methods), or learning a policy directly such as policy gradient methods (policybased methods), AC methods learn both the actor (policy) and the critic (value) simultaneously. In some AC methods, the actor model (policy function) is updated with respect to the approximate critic model (value function), in which case learning behavior and architecture similar to those in GAN can result. Formally, consider the typical Markov decision processes (MDP) setting for RL, where we have a set of states S, actions A, and discount factor [0, 1]. The aim of AC methods is to simultaneously learn an actionvalue function that predicts the total expected discounted future reward and learn a policy that is optimal for that value function. There are many AC methods that attempt to solve this problem. The distinction between these methods lies mainly in the way training proceeds. Traditional AC methods optimize the policy through policy gradients and scale the policy gradient by Bellman loss, also known as temporaldifference (TD) loss or TD error, while the actionvalue function is updated by ordinary TD error learning
[17]. In this work, we mainly focus on deep deterministic policy gradients (DDPG)[8]which is intended for the case where actions and observations are continuous, and use deep learning (DL) for function approximation of both the actionvalue function (critic) and the stateaction function or policy (actor). DDPG is an established DRL (AC) approach with continuous actions which updates the policy (actor) by backpropagating the gradients of the estimated value from critic with respect to the actions rather than backpropagating the TD error directly.
Actorcritic for online learning and (deep) reinforcement learning
Deep Qlearning (DQL) or deep Qnetwork[5] was one of the breakthrough work in RL and the beginning of deep RL (DRL). Actorcritic (AC) algorithms[18] are one of the most powerful RL or DRL algorithms which are composed of two networks: actor and critic. AC methods are models from deep reinforcement learning (DRL) in which a model learns an actionvalue function that predicts the expected total future reward (the Critic), and a policy that is optimal for maximizing that value (the Actor or controller). DDPG[8], as a stable DLbased AC method, is in fact the extension of DQL approach to the continuous action space since DQL was in fact applied to the Atari game for discrete action space control and search.
Actorcritic methods have proof of convergence to an optimal policy.
The original AC method [11] can be viewed as the formal beginning of work in computational reinforcement learning. The critic in AC is like the discriminator in GAN, and the actor in AC methods is like the generator in GAN. In both systems, there is a game being played between the actor (generator) and the critic (discriminator). The actor begins exploring the state space, and the critic should learn to evaluate the random exploratory behavior of the actor. It was formally proved by Konda[19, 20, 21] that in any Markov decision process (MDP), AC methods will eventually converge to the optimal policy. The proof was nontrivial and shown a decade after AC proposal[11]. Recently (35 years after AC methods proposal) Google DeepMind proposed the same AC algorithm combined with DL (DDPG[8]) to solve the difficult Atari games from raw video input.
Deep Deterministic Policy Gradients: stable AC (DRL) method with Bellman/TD loss
Deep Deterministic Policy Gradients (DDPG)[8] (stable AC approach in terms of convergence) is the application of DL to AC methods. DDPG uses Bellman loss[5] or temporaldifference (TD) loss[12] as the loss function. TD loss learning is also know as Qlearning[12] which after the introduction of DL[1] gave birth to Deep Qlearning (DQL) or deep Qnetwork[5] which was the beginning of deep RL (DRL). DDPG[8], as a stable DLbased AC method, is in fact the extension of DQL approach to the continuous action space since DQL was in fact applied to the Atari games for discrete action space control and search.
Connection between GAN and AC methods
Both GAN and AC methods can be seen as bilevel or twotimescale optimization problems, meaning one model is optimized with respect to the optimum of another model. Bilevel optimization problems have been extensively studied under AC methods by Konda[19, 20, 21] mainly under the assumption that both optimization problems are linear or convex[22]. On contrary, both DDPG[8] and WGAN[13] which are the center of attention in this work and the most stable kind of GAN and AC methods, are in fact nonlinear and have nonconvex optimization surfaces due to DL which are nonlinear function approximator. The question is though: (A) if GAN and AC are by nature the same algorithms or not? Or (B) can one emerge from another? if yes, which emerges from which? Pfau and Vinyals[10] show that GAN can be viewed as AC methods in an environment where the actor cannot affect the reward. Therefore, they answer the above questions as following: (A) Yes, they are connected; (B) Yes, GAN can emerge from AC methods; They review a number of extensions to GAN and AC algorithms with even more complicated information flow. They encourage both GAN and AC (DRL) communities to develop general, scalable, and stable algorithms for multilevel optimization with DL, and to draw inspiration across communities. Pfau and Vinyals[10] confirmed that AC and GAN are siblings and therefore there must be a parent algorithm/architecture that include both methods. Pfau and Vinyals[10] encourage us for more investigation of deeper connection (and possibly convergence and unification) between GAN and AC methods for the development and adoption of a more generalpurpose AI technique (AGI) applicable as GAN or AC (DRL). Pfau and Vinyal[10] describes the connection between GAN and AC as an MDP in which GAN is a modified version of AC method as following: Consider an MDP where the actions set every pixel in an image. The environment randomly chooses either to show the image generated by the actor or show the real image. Let the reward from the environment be 1 if the environment chose the real image and 0 if not. This MDP is stateless as the image generated by the actor does not affect future data. The AC architecture learning in this environment resembles the GAN game. A few adjustments have to be made to make it identical. If the actor had access to the state, it could trivially pass a real image forward. Therefore the actor must be a blind actor, with no knowledge of the state. Stateless MDP doesn’t prevent the actor from learning though. The meansquared Bellman/TD loss is usually used as the loss function for AC (specifically DDPG), meancross entropy adversarial loss is used instead for GAN (socalled GAN loss). The actor’s parameters in AC should not be updated if the environment shows a real image. Critic zeros down its gradients for the actor (no update for actor’s parameters/weights) if the reward is 1 for the real image. If the reward is 0 for the fake image, critic still zeros down its gradients for the actor. If the reward is 1 for the fake image, it’s time for updating actor’s parameters using critic’s output and gradients. This is how, GAN can be seen as a modified AC method with blind actor in stateless MDP.
Is actorcritic adversarial?
According to [10], it is not obvious why an AC algorithm should lead to an adversarial behavior; Typically the actor and critic are trying to optimize complimentary loss functions (compatible or orthogonal), rather than optimize the same loss function in different directions (adversarial). The adversarial behavior in AC emerges due to the stateless MDP in which the GAN game is being played since the actor cannot have any causal effect on the reward in this stateless MDP. A critic, however, cannot learn the causal structure of the game from input examples alone, and moves in the direction of features that predict reward more accurately (minimizing the reward prediction error/loss). The actor moves in a direction to increase reward (maximizing the reward value) based on the best estimate from the critic, but this change cannot lead to an increase in the true reward, so the critic will quickly learn to assign lower value in the direction the actor has moved. Thus the updates to the actor and critic, which ideally would be orthogonal (as in compatible actorcritic or complimentary loss functions) instead becomes adversarial. Despite these differences between GAN and typical RL problems (settings), we believe there are enough similarities to generalize between GAN and AC (DRL) algorithms. This generalization might be the path toward generalpurpose AI or artificial general intelligence (AGI).
Target networks is the only main difference between AC method (DDPG) and GAN.
Since the actionvalue function appears twice in the Bellman equation, stability can be an issue in Qlearning with function approximation. Target networks address this by fixing one of the networks in TD updates or possibly slowing down the updates of this network socalled target network. This will turn the Qlearning problem from RL to SL and helps the Qlearning to converge whereas without it will most likely diverge. Since the GAN game can be seen as a stateless MDP, the second appearance of the action value function disappears. Therefore, we do not consider target networks applicable to the GAN setting[10].
Compatibility of actorcritic
According to [10], one of the unique theoretical developments of AC methods is the notion that the actor and critic are compatible or complementary in terms of loss function. It is not clear if the notion of compatibility can be naturally extended to the GAN setting. We would generally prefer our GAN to be adversarial than compatible. We hope that by pointing out the deep connections between GAN and AC (DRL), we encourage both communities of GAN and DRL (AC) to merge and join forces.
Modelbased actorcritic: GAN + DRL (actorcritic)
We propose to teach machines to accomplish more complex tasks in one common environment with modelling the real world/environment with rich textures and complex structural compositions. We address three challenges in particular for training an agent to model the realworld: First, how to apply the adversarial learning [4] to improve the quality of the generated environment model () since GAN is proved to be effective in image generation and modelling data distribution tasks [23]. This is also known as explicit modelling of the environment. Actorcritic (AC) methods, specifically deep deterministic policy gradients (DDPG) [8] models the environment internally in the critic network. We chose DDPG as our default AC approach due to the continuous action space of the agent. Second, how to build an efficient differentiable neural renderer that can simulate the environment and is transferable to other tasks as well so that we don’t need to start training the environment from scratch. Also the more tasks our agent accomplishes, the more accurate our environment model (the generator or the neural differentiable simulator/renderer) becomes. We train a generative neural network which directly maps the current action () and the current state of environment/world into the next state of environment (). This differential renderer (generator) can be combined with AC and turn into modelbased AC that can be trained in an endtoend fashion, which might significantly boosts both the modelling quality and convergence speed in terms of solving the task. NOTE: generator, neural simulator, differential renderer, environment modeller, and world renderer are all the same. Third, how to design a reward function is another really important challenge since reinforcement learning (RL or deep RL also refered as DRL) or rewardbased learning is impossible without the reward. Manually designing this reward function for the complex problems/tasks is almost impossible. Specifically, if we want to apply DRL or deep RL to the robotics problem, we have to be able to design an accurate reward function or go for a simple task which the reward function design is fairly easy. The latter is not always possible and therefore we have to be able to also learn the reward function. This has turned into a headache for many researchers in artificial intelligence (AI) and robotics. In summary, our contributions/solutions for the three aforementioned challenges/problems are also threefold: 1. We approach the modelling task with the GAN and build a modelbased DRL agent by combining GAN with (modelfree) AC method (DDPG) that can model the agent’s environment. To this end, we build a differentiable neural renderer (generator) for efficiently modelling the environment. This neural environment renderer (generator) and a discriminator model the world by training modelbased DRL agent in an endtoend fashion. Discriminator is required to discriminate between the real environment samples and fake ones (generated/rendered ones) by neural environment model. 2. Explicitly modelling the environment with generator neural net compared to implicitly modelling it in critic of AC method (DDPG [8]
) helps transferring this model to other problems and tasks in that same environment. This transfer learning makes our DRL agent very sample efficient, meaning it can accomplish the same task with much less episodes compared to (modelfree) AC or original DDPG.
3. Given the explicit transferable environment model (generator) and the target image (an image of the goal or what we want to accomplish/gain at the end), we can learn how to design the reward function with the discriminator as it becomes more and more powerful in discriminating between the real environment and explicit environment model. This is also solves the problem of designing/handengineering the reward function for complex tasks.Literature
SPIRAL [24] is an adversarially trained DRL agent that learn the structures in images and tries to paint them from scratch, but fails to recover the details of images. SPIRAL++ [25] is an improved version of SPIRAL which is using GAN and RL agent at the same time in a reinforced adversarial learning fashion for paining an image in a canvas from scratch. Both SPIRAL and SPIRAL++ are composed of an actorcritic and discriminator but they use a fixed/nondifferentiable renderer which is generating the painting. StrokeNet [26]
combines differentiable neural image/paining renderer (a generator) on top of actorcritic and discriminator using recurrent neural network (RNN) to train agents to paint but fails to generalize on color images. The most inspiring and similar work to us is done by Huang et al.
[27] which is proposing a modelbased DDPG which is in fact a modelbased actorcritic as well. He is proposing four neural networks in his modelbased DDPG agent to learn how to paint on canvas from scratch and without any external reward. He eventually experimentally compared his modelbased DDPG to the original (modelfree) DDPG and experimentally demonstrates the superiority of modelbased DDPG.RL approaches
Superficially, SPIRAL and SPIRAL++ can be seen as a modelfree RL techniques for stroke based, nonphotorealistic rendering/generating. Similarly to some modern strokebased rendering techniques, the positions of strokes are determined by a learned system (namely, an RL agent). Unlike traditional methods however, the objective function that determines the goodness of the output image is learned unsupervised via the adversarial objective (adversarial loss). This adversarial objective allows us to train without access to groundtruth strokes, enabling the method to be applied to domains where labels are prohibitively expensive to collect, and for it to discover surprising and unexpected styles. In this line of work, there are a number of works that use constructions similar to SPIRAL and SPIRAL++ to tune the parameters of nondifferentiable/fixed simulators. In another line of work, Frans and Cheng (2018) [28], Nakano (2019) [29], Zheng et al. (2019) [26], and Huang et al. (2019) [27] achieve remarkable learning speed by relying on backpropagation/ neural model of the renderer, however they rely on being able to train an accurate model of the environment in the first instance.
Modelfree vs modelbased RL approaches
Recently, several works (Frans and Cheng in 2018 [28], Nakano in 2019 [29], Zheng et al. in 2019 [26], and Huang et al. in 2019 [27]) proposed to improve sample efficiency and convergence stability of generative RL agents by replacing the actual nondifferentiable/fixed rendering simulator with a differentiable neural one (generator neural net) which is trained offline to predict how an action affects the state of the canvas environment. Although promising, there are three scenarios in which modelfree approaches like SPIRAL and SPIRAL++ might be a better approach: Firstly, one of the main advantages of the neural environments is training agents by directly backpropagating the gradient coming from the objective of interest (either reconstruction loss if there is no discriminator or adversarial loss if there is a discriminator or GAN setting). Unfortunately, this means that the action space of the agent has to be continuous to efficiently use backpropagation gradient (neural net learning). This means if the action space is discrete, we might have problem in learning our neural networks in modelbased approaches. Secondly, the success of the RL agent training largely depends on the quality of the neural model of the environment. The simulator used in SPIRAL and SPIRAL++ is arguably easy to learn since new strokes interact with the existing drawing in a fairly straightforward manner. Environments which are highly stochastic or require handling of object occlusions and lighting might pose a challenge for neural network based environment models. Thirdly, while in principle possible, designing a model for a simulator with complex dynamics may be a nontrivial task. The majority of the recent works relying on neural renderers assume that the simulator state is fully represented by the appearance of the canvas and therefore only consider nonrecurrent state transition models. There is no need for such an assumption in the SPIRAL and SPIRAL++ framework.
Conclusion of the literature
The use of neural simulators in modelbased DRL are largely orthogonal to SPIRAL and SPIRAL++ and are likely to improve their performance further. Nonetheless modelbased DRL approaches are dependent on a given target image which is not the case for SPIRAL and SPIRAL++.
Modelbased actorcritic
We propose the modelbased DRL framework shown in Figure 1.
This framework is mainly composed of two stream frameworks: GAN framework for adversarial learning stream and AC framework for reinforcement learning stream.
GAN framework for adversarial learning stream
Since GAN has been widely used because of its great ability in modelling the data distribution by measuring the distribution distance between the generated (fake) data and the target (real) data, therefore the adversarial learning stream is mainly composed of GAN, a generator and discriminator, with an actor as you can see in figure 2.
Wasserstein GAN (WGAN) [13] is our preference compared the original GAN [4] since it is an improved version of the original GAN [4] which uses the distance, also known as EarthMover distance. The objective of the discriminator in WGAN is defined as 1:
(1) 
where denotes the discriminator, and are the fake samples and real samples distribution. The adversarial learning stream is responsible for two important tasks: 1. modeling the environment using the generator as accurately as possible. Discriminator in this stream helps discriminating between the real environment and the environment model . Using a generator neural network to model the environment has two advantages: First, it is transferable to any other tasks in that same environment and also the more tasks we do with this environment model the more accurate it will become in terms of predicting the next state of the environment given the current state and current taken action ; Second, it boosts the performance of the agent in terms of successful completion of the task and sampleefficiency which in simple words means completing the task in much less episodes of tryanderror; The training samples can also be generated randomly using Computer Graphics rendering programs. The generator (environment model ) can be quickly trained with supervised learning on the GPU/s. The modelbased transition dynamics and the reward function are differentiable neural network. The generator is a deep (endtoend) neural network consisting of several fully connected layers and convolution layers.
2. learning the reward function for the reinforcement learning stream since the reinforcement learning stream is impossible without the the reward function. Using a discriminator neural network to model the reward has also two advantages: First, it is adaptable to other tasks in that same environment and we don’t have to manually engineer/design the reward function; Second, it boosts the performance of the agent in terms of successful completion of the task by accurately learning the appropriate reward function for that specific task; The discriminator similar to generator is also a deep (endtoend) neural network consisting of several fully connected layers and convolution layers but the output is a scalar value/score . The conditional GAN training schema [30] is suggested by [27] as the reward metric, where fake samples and the target vs real samples and the target are used for reward learning as shown in Figure 3 taken from [27].
we want to reduce the differences between the current state and target state as much as possible. To achieve this, we set the difference of discriminator scores from to using equation 2 as the reward for guiding the learning of the actor.
(2) 
0.1 Actorcritic framework for reinforcement learning stream
The reinforcement learning stream is mainly composed of the actorcritic model and the generator which contain the model of the environment as shown in figure 4.
The optimization of the RL agent using the modelbased AC is different from that using the original AC. At timestep , the critic takes as input instead of both of and as shown in figure 5. The critic still predicts the expected reward for the state but without the need for current action as the input also shown in figure 5. The new expected reward is a value function trained using discounted reward: Here is the reward when performing action based on . The environment function is the generator network which models the environment.
Model parameters and variables
We model a task as an MDP with a state space an action space , a reward function , an environment function which can be the real environment or the environment model . The details of these components are as following: The state space includes all information regarding the state of the environment which can be observed by the agent. The environment function makes the transition process from the current state of the environment to the next state. The action space of the agent is a set of continuous parameters that control the position and orientation of the agent or part of the agent required for completing the target task. We define the behavior of an agent as a policy function that maps states to deterministic actions (). Timesteps is every step that the agent make an observation of the environment state and takes an action accordingly and in result of the taken action, the current state of the environment evolves/transitioned based on the environment function which can be the real one or the fake one . Reward Selecting a suitable metric to measure the difference between the current state and the target state is crucial for RL agent. In fact, reinforcement learning is rewardbased learning and without a stable reward function/metric it is not possible. The reward is defined as follows: where is the reward at step , is the measured loss between the target environment state and the current state and is the measured loss between the target environment state and the next state . and is formulated as the discriminator output score for current environment state and the next state. To reach the target environment state , the agent should model the environment accurately for an precise prediction of the next state (adversarial learning stream using GAN) and maximize the cumulative rewards (reinforcement learning stream using AC) in one episode.
Why modelbased actorcritic?
Since the action space in the control and robotic tasks are mainly continuous and high dimensional, discretizing the action space to adapt to discrete DRL methods such as deep Qnetwork (DQN) [5] and policy gradients (PG) [12] is quite burdensome if possible. In contrast, deterministic policy gradients (DPG) [31] uses deterministic policy to resolve the difficulties caused by highdimensional continuous action space. If the environment states are also highdimensional, (original) DDPG [8]
solves this problem too using deep (endtoend) convolutional neural network (DNN) socalled deep learning (DL)
[1, 2]. In the original DDPG, there are two networks: the actor and critic . The actor models a policy that maps a state to action . The critic estimates the expected reward for the agent taking action at state , which is trained using Bellman equation 3 as in Qlearning [32] and the data is sampled from an experience replay buffer:(3) 
Here is a reward given by the environment when performing an action at the state . The actor is trained to maximize the critic’s estimated value . In other words, the actor decides an action/policy for each state. Based on the current state and the target state, the critic predicts an expected reward for the action. The critic is optimized to estimate more accurate expected rewards. We cannot train a goodperformance RL agent using original DDPG because it’s hard for the agent to model the complex environment well composed of realworld images/observations during learning. The World Model [33] is a method to make agent understand the environments effectively. Similar to [27, 29, 28], we design a neural model of the environment (refered to as neural renderer in the cited works) so that the agent can model the environment effectively. Then it can explore the modelled environment and improve its policy efficiently. The difference between the two algorithms visually shown in Figure 5 from [27] is very similar to what we are proposing.
Experiments: Initial limited experimental results
We implemented and tested the proposed modelbased actorcritic in some simulated environments such as OpenAI Gym and Unity ML agents that simulates a number of independent tasks in their own environment which provides both sensors input and the reward function as visualized in figure 6. These two simulators provide a number of tasks in their own unique environments that varies from classical control problems (e.g. CartPole), robotic problems (e.g. Reacher arm), and famous video games (e.g. Car race). These task environments are independent from each other in a sense that the knowledge from one is not required nor can be transferred to another one. The inputs and output of the AI model (agent) and the task environment (env) are visually shown in figure 6 for better understanding of the data flow in between the AI model (agent) and the environment task (env) and how the experiments are being performed.
Reacher environment simulated by Unity MLagents as a robotic arm problem
Reacher is one of the Unity MLagents environments for deep reinforcement learning (DRL) research experiments. Reacher environment features and specifications are listed as following: Reacher is a doublejointed arm which can move to target locations; Goal: The agents must move its hand to the goal location, and keep it there; Agent Reward Function (independent): For each step, agent’s hand reaches the goal location, it receives +0.1; State space:
A vector of 26 variables corresponding to position, rotation, velocity, and angular velocities of the two arm rigid bodies;
Action space: A vector of 4 continuous variables corresponding to torque applicable to two joints; To solve this task, Benchmark Mean Reward: 30 Reacher environment is capable of using one agents or 20 agents (multiagents) (figure 7).We have implemented the proposed modelbased actorcritic for initially experimentation on solving such tasks (figure 6
) using PyTorch library which is a pythonbased deep learning framework for Facebook. The initial results of applying the modelbased actorcritic to the Reacher environment with one agent or multiagents (twenty) are shown in figure
8. Based on the solving criteria of Reacher, the proposed modelbased actorcritic solved this task with one agent in roughly 500 episodes and with twenty agents (multiagent) in roughly 175 episodes as shown in figure 8. We conducted this experiment to make sure that we can implement the proposed architecture or it is implementable. We also wanted to make sure that it works in terms of solving reinforcement learning tasks.Conclusion & future perspective
Our limited experiments show that deep reinforcement learning (DRL) and GAN in (our AGI model) can result in an incremental goaldriven intellignce required to potentially solve (generalpurpose) variety of independent tasks, each in their own separate independent environments. Our future focus is to investigate:

the connection between the modelbased DDPG and the brain: is modelbased DDPG architecture and learning compatible and plausible with the brain?

the application of the autoencoding GAN to (semi/fully) SL problems for offline learning of the stored data for big data analytics (mining) & visualization: is it applicable to all variety of SL problems?

the application of modelbased DDPG to variety of independent tasks in only one same environment with reward (or reward function) such as DeepMind control suite: can it transfer skill from one task to another?

the application of modelbased DDPG for (simulated or real) robotic control without reward signal from the environment: can we learn reward function instead of manually engineering/designing it in a robotic environment?
References
 [1] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436 (2015).
 [2] Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105 (2012).
 [3] Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29 (2012).
 [4] Goodfellow, I. et al. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680 (2014).
 [5] Mnih, V. et al. Humanlevel control through deep reinforcement learning. Nature 518, 529 (2015).
 [6] Kingma, D. P., Mohamed, S., Rezende, D. J. & Welling, M. Semisupervised learning with deep generative models. In Advances in neural information processing systems, 3581–3589 (2014).
 [7] Salimans, T. et al. Improved techniques for training gans. In Advances in neural information processing systems, 2234–2242 (2016).
 [8] Lillicrap, T. P. et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
 [9] BarthMaron, G. et al. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617 (2018).
 [10] Pfau, D. & Vinyals, O. Connecting generative adversarial networks and actorcritic methods. arXiv preprint arXiv:1610.01945 (2016).
 [11] Barto, A. G., Sutton, R. S. & Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics 834–846 (1983).
 [12] Sutton, R. S., McAllester, D. A., Singh, S. P. & Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063 (2000).

[13]
Arjovsky, M., Chintala, S. &
Bottou, L.
Wasserstein generative adversarial networks.
In
International conference on machine learning
, 214–223 (2017).  [14] Isola, P., Zhu, J.Y., Zhou, T. & Efros, A. A. Imagetoimage translation with conditional adversarial networks. arxiv (2016).
 [15] Arora, S., Ge, R., Liang, Y., Ma, T. & Zhang, Y. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine LearningVolume 70, 224–232 (JMLR. org, 2017).
 [16] Durugkar, I., Gemp, I. & Mahadevan, S. Generative multiadversarial networks. arXiv preprint arXiv:1611.01673 (2016).
 [17] Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning 3, 9–44 (1988).
 [18] Crites, R. H. & Barto, A. G. An actor/critic algorithm that is equivalent to qlearning. In Advances in Neural Information Processing Systems, 401–408 (1995).
 [19] Konda, V. R. & Tsitsiklis, J. N. Actorcritic algorithms. In Advances in neural information processing systems, 1008–1014 (2000).
 [20] Konda, V. Actorcritic algorithms (ph. d. thesis). Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (2002).
 [21] Konda, V. R. & Tsitsiklis, J. N. On actorcritic algorithms. SIAM journal on Control and Optimization 42, 1143–1166 (2003).
 [22] Colson, B., Marcotte, P. & Savard, G. An overview of bilevel optimization. Annals of operations research 153, 235–256 (2007).

[23]
Ledig, C. et al.
Photorealistic single image superresolution using a generative adversarial network.
InProceedings of the IEEE conference on computer vision and pattern recognition
, 4681–4690 (2017).  [24] Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S. & Vinyals, O. Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118 (2018).
 [25] Mellor, J. F. et al. Unsupervised doodling and painting with improved spiral. arXiv preprint arXiv:1910.01007 (2019).
 [26] Zheng, N., Jiang, Y. & Huang, D. Strokenet: A neural painting environment. In Proceedings of the IEEE conference on learning and representation (ICLR) (2018).
 [27] Huang, Z., Heng, W. & Zhou, S. Learning to paint with modelbased deep reinforcement learning. arXiv preprint arXiv:1903.04411 (2019).
 [28] Frans, K. & Cheng, C.Y. Unsupervised image to sequence translation with canvasdrawer networks. arXiv preprint arXiv:1809.08340 (2018).
 [29] Nakano, R. Neural painters: A learned differentiable constraint for generating brushstroke paintings. arXiv preprint arXiv:1904.08410 (2019).
 [30] Isola, P., Zhu, J.Y., Zhou, T. & Efros, A. A. Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1125–1134 (2017).
 [31] Silver, D. et al. Deterministic policy gradient algorithms. In Proceedings of MLR (2014).
 [32] Watkins, C. J. & Dayan, P. Qlearning. Machine learning 8, 279–292 (1992).
 [33] Ha, D. & Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, 2450–2462 (2018).
Comments
There are no comments yet.