I Introduction
Deep reinforcement learning (RL) has been successfully applied to a variety of problems recently such as playing Atari games with superhuman proficiency [1], and for robot control [2]. However, Applying RL methods to real robots can be extremely costly, since acquiring thousands of episodes of interactions with the environment often requires a lot of time, and can lead to physical damage. Furthermore, in humanrobot interaction (HRI) scenarios, human actions cannot be predicted with certainty, which can significantly impede convergence to a good policy.
One way of alleviating these problems is to have the agent learn a model of the environment, and use this model to generate synthetic data that can be used in conjunction with real data to train the agent. This assumes that the environment dynamics are easier to learn than an optimal policy, which is a generally valid assumption at least for some classes of tasks. Furthermore, if such a model is stochastic in nature, then the uncertainty in state changes can be taken into account, thus allowing more natural interaction with humans. Much like how people learn, an agent with a model of its environment can generate imaginary scenarios that can be used to help optimize its performance. This approach has garnered much attention in the field recently, and is sometimes refered to as endowing agents with imagination [3, 4, 5].
In this paper we describe an architecture that allows an agent to learn a stochastic model of the environment and use it to accelerate learning in RL problems. In our approach, an agent encodes sensory information into lowdimensional representations, and learns a model of its environment online in latent space, while simultaneously learning an optimal policy. The model can be learned much faster than the policy, and therefore can be used to augment transitions collected from the real environment with synthetic transitions, improving the sampleefficiency of RL. Our approach requires no prior knowledge of the task; only the encoder needs to be pretrained on taskrelevant images, and can generally be reused for multiple tasks. We test our architecture on a highlevel robotic task in which a robot has to interpret a gesture and solve a puzzle based on it (Fig 1). Results show that incorporating synthetic data leads to a significant speedup in learning, especially when only a small amount of real interaction data are made available to the agent.
Ii Related Work
There has been much recent interest in the literature about combining modelfree and modelbased approaches to reinforcement learning. Ha and Schmidhuber [5]
built models for various video game environments using a combination of a mixture density network (MDN) and a long shortterm memory (LSTM), which they call MDNRNN. In their approach, they first compress visual data into a lowdimensional representations via a variational autoencoder (VAE), and then train the MDNRNN to predict future state vectors, which are used by the controller as additional information to select optimal actions. However, they pretrained the environment models on data collected by a random agent playing video games, whereas in our work a model for an HRI task is learned online.
The use of learned models to create synthetic training data has also been explored. Kalweit et al. [4]
used learned models to create imaginary rollouts to be used in conjunction with real rollouts. In their approach, they limit model usage based on an uncertainty estimate in the Qfunction approximator, which they obtain with bootstrapping. They were able to achieve significantly faster learning on simulated continuous robot control tasks. However, they relied on welldefined, lowdimensional state representations such as joint states and velocities, as opposed to raw visual data as in our approach.
Racaniere et al. [3] used a learned model to generate multiple imaginary rollouts, which they compress and aggregate to provide context for a controller that they train on classic video games. The advantage of this approach is that the controller can leverage important information contained in subsequences of imagined rollouts, and is more robust to erroneous model predictions.
Model rollouts can be used to improve targets for temporal differencing (TD) algortihms as well. Feinburg et al. [6] used a model rollout to compute improved targets over many steps, in what they termed modelbased value expansion (MVE). More recently, Buckman et al. [7]
propsed an extension to MVE, in which they use an ensemble of models to generate multiple rollouts of various lengths, interpolating between them and favoring those with lower uncertainty.
Deep reinforcement learning is increasingly being employed successfully for robots in continuous control tasks and manipulation [2, 8, 9, 10]. However, its application to highlevel tasks and in HRI has been very limited. Qureshi et al. [11] used a multimodal DQN to teach a humanoid robot basic social skills such as successfully shaking hands with strangers. Input to their system consisted of depth and greyscale images. Interaction data were collected using the robot over a period of 14 days, where they have separated the data collection and training phases and alternated between them for practical reasons. In our work, we are primarily interested in increasing the sample efficiency so that training requires less resources, allowing RL to be more practical for robots.
Iii Background
Iiia Reinforcement Learning
In reinfrocement learning [12]
, a task is modelled as a Markov decision process (MDP) where an agent influences its environment state
with action chosen according to some policy . The environment then transitions into a new state and provides the agent with a reward signal . This process repeats until the environment reaches a terminal state, concluding an episode of interaction. The goal of the agent is to learn an optimal policy and use it to maximize the expected return, which is the discounted sum of rewards, where is the discount factor and T is the timestep a terminal state is reached.There are modelbased and modelfree algorithms to find an optimal policy. One such modelfree method is to learn the actionvalue function , which is the expected return from taking action in state and following policy thereafter: . The agent’s goal thus becomes to learn an optimal Qfunction . This can be achieved using a recursive relationship known as the Bellman equation:
(1) 
Since most nontrivial tasks have very large state or action spaces, usually the Qfunction cannot be calculated analytically, and is estimated instead by a function approximator with parameters . One common approach is deep Qnetworks (DQN) [1], in which transitions are stored as tuples of the form
, and used to train a neural network so that
. A DQN is trained to minimize the loss function:
(2) 
where the target is obtained from Equation 1 using the estimate . Given the gradients of Equation 2 with respect to
, the network can be trained using some variation of stochastic gradient descent. Actions are chosen based on the
greedy policy where the optimal action is chosen with probability
and a random action with probability .IiiB Variational Autoencoders
Variational autoencoders (VAE) [13] are generative models that can be used to both generate synthetic data, and to encode existing data into representations in lowdimensional latent space. Like traditional autoencoders, they consist of an encoding network and a decoding one. The key difference is that they encode a data point
into a probability distribution over latent variable vector
. The goal is then to learn an approximate multivariate Gaussian posterior distribution which is assumed to have a unit Gaussian prior . This can be done by minimizing the KullbackLiebler (KL) divergence between and the true posterior :(3)  
Here, the first term is the reconstruction loss, while the second term penalizes divergence of the learned posterior from the assumed proir. The expectation can be approximated as by sampling a vector with and decoding it with the decoder network. Since maximizing the marginal likelihood is also maximizing the expected likelihood , and since , minimizing Equation 3 is equivalent to minimzing:
(4) 
where we have parametrized the encoder and decoder networks with and respectively, is the dimensionality of the latent space, and are the diagonal elements of . The encoder and decoder networks are trained back to back to minimize the loss given by Equation 4. Note that if is Bernoulli, the reconstruction loss is equivalent to the crossentropy between the actual and the predicted .
IiiC Mixture Density Networks
Mixture density networks (MDN) [14]
are neural networks that model data as a mixture of Gaussian distributions. This is useful for modeling multivalued mappings, such as many inverse functions or stochastic processes. MDNs model the distribution of target data
conditioned on input data as where is the number of components, are the mixture coefficients subject to , and is a Gaussian kernel with meanand variance
. MDNs have a similar structure to feedforward networks, except that they have three parallel output layers for three vectors: one for the means, one the variances, and one for the mixture coefficients. The network parameters are optimized by minimizing the negative loglikelihood of the data:(5) 
To predict an ouput for a given input, we sample from the resulting mixture distribution by first sampling from categorical distribution defined by to select a component Gaussian, and then sampling from the latter.
Iv Architecture
The proposed architecture consists of three components: the vision module (V) that produces abstract representations of input images, the environment model (M) which generates imaginary rollouts, and the controller (C) that learns to map states into actions. We assume that the environment is Markovian and is fully represented at any given time by the input image. Figure 2 shows an overview of the architecture.
V comprises the encoder part of a variational autoencoder (VAE) [13], and is responsible for mapping the highdimensional input images into lowdimensional state representations. The controller and the environment model are trained in this lowdimensional latent space, which is generally computationally less expensive. The main advantage of using a VAE instead of a vanilla autoencoder is that the VAE maps every input image into a continuous region in the latent space, defined by the parameters of a multivariate Gaussian distribution. This makes the environment model more robust and ensures that its output is meaningful and can be mapped back into realistic images.
M is responsible for generating synthetic transitions, and predicts future states and the reward based on current states and input actions . It is composed of three models: a mixture density network (MDN) [14]
that learns the transition dynamics, a reward predictor called the rnetwork, and a terminal state predictor called the dnetwork. The MDN learns the conditional probability distribution of the next state
. The rnetwork learns to predict the reward for each state, while the dnetwork learns to predict whether a state is terminal or not. Both the r and dnetworks are implemented as feedfroward neural networks. To generate imaginary rollouts, M can be seeded with an initial state from V, and then run in closed loop where its output is fed back into its input along with the selected action. The advantage of using an MDN is that it is possible to learn a model of stochastic environments, in which an action taken in a given state can lead to multiple next states. This is especially useful for use in HRI tasks, in which the human response to actions taken by the robot cannot be expected with certainty. Furthermore, modelling the next state probabilistically is much more robust to errors in prediction, allowing the environment model to run in closed loop.Lastly, C is responsible for selecting the appropriate action in a given state. It is implemented as a Qnetwork, and learns to estimate the action values for states. C is trained on both real and imaginary transitions to maximize the cumulative reward.
V Experiments
The experiments detailed in this section are designed to evaluate our approach on a real world robotic application. We are interested primarily in the performance increase gained by utilizing the learned model, compared to the baseline DQN method [1].
Va Experiment Setup
To test our architecture, we desinged a task in which a robot has to solve a puzzle based on pointing gestures made by a human. The robot sees three cubes with arrows painted on them, with each arrow pointing either up, down, left, or right. The human can point to any of the three cubes, but may not point to a different cube during the same episode. To successfully complete the task, the robot has to rotate the cubes so that only the arrow on the cube being pointed to is in the direction of the human, with the constraint that at no point should two arrows point to the same direction. The task is similar to puzzle games typically used in studies about robot learning, such as the Towers of Hanoi puzzle [15].
We formulate the task as an RL problem in which the agent can choose from 6 discrete actions at any given time: rotating any of the three cubes 90°clockwise or counterclockwise. The robot gets a reward of +50 for reaching the goal state, 5 for reaching an illegal state, and 1 otherwise to incentivize solving the task as efficiently as possible. An episode terminates if the robot reaches either a goal state or an illegal state, or after it performs 10 actions. See Fig 3 for examples of goal and illegal terminal states.
To train the robot, we created a simulated environment that receives the selected action from the agent (the robot) as input, and outputs an image representing its state, along with a reward signal and a flag to mark terminal states. The environment is implemented as a finite state machine with 192 macrostates, where each macrostate is the combination of 3 microstates describing the directions of each of the three arrows, plus another microstate describing which box the hand is pointing to. Whenever the environment transitions to a certain state, it outputs one of a multitude of images associated with that state at random, thus producing the observable state that the agent perceives.
To produce the images used for the environment, we first collected multiple image fragments for each of the possible microstates of the environment. Each of these fragments depicts a slight variation for the same microstate, for example slightly different box positions or hand positions. We thus create a pool of multiple image fragments for each possible microstate. To synthesize a complete image for a given macrostate, we choose a random fragment for each of its constituent microstates, and then patch them together. For the experiments, we collected 50 fragments for each possible hand microstate, and 16 fragments for each possible arrow microstate, resulting in about possible unique synthesized images. The images were taken with the Sawyer robotic arm camera (Fig. 1). For the experiments, we synthesized 100,000 training images, and 10,000 test images.
VB Procedure
The training procedure for our experiments can be summarized as follows:

Train the VAE on all training images.

Start collecting real rollouts and training the controller.

After some amount of episodes, start training environment model M on collected rollouts.

Use M to generate synthetic rollouts simultaneously with real rollouts.

Continue training the controller using both real and synthetic rollouts.
The exact training procedure is given in Algorithm 1. In the following, we will detail the training procedure for each component of the system and justify our choice of different parameters.
VB1 Variational Autoencoder
To train the VAE, we split the grayscale training images along the horizontal axis into 3 strips, where each strip contains a single box. We then fed the strips into the VAE as 3 channels to help the VAE learn more taskrelevant features. The architecture used for the VAE is that used by Ha and Schmidhuber in [5], except that we encode the images into 8dimentional latent space. The VAE was trained on the 100,000 synthesized training images after scaling them down to a manageable
resolution for 1000 epochs. The VAE is trained to minimize the combined reconstruction and KL losses given by Equation
4. Here, the reconstruction loss is given by the pixelwise binary crossentropy. The KL loss was multiplied by a weighting factor that controls the capacity of the latent information chanel. In general, increasing yields more efficient compression of the inputs and leads to learning independent and disentagled features, at the cost of worse reconstruction [16]. We found to produce best results. The Adam optimizer was used with a learning rate of 0.0005 and a batch size of 2000.VB2 Environment Model
The MDN used to model the dynamics in the environment model learns the posterior probability of the next latent state vector as a Gaussian mixture model with 5 components. The MDN has 3 hidden layers of 256 ReLU units with 50% dropout and 3 parallel output layers for the distibution parameters: one for the mixture coefficients with softmax activation, one for means and one for variances both with linear activation. When collecting transitions, we stored the parameters
and produced by V for each frame, and we sampled from to obtain latent space vectors when constructing a training batch. This form of data augmentation was found to greatly improve the generalization and performance of the model. The accompanying rnetwork has 3 hidden layers of 512 ReLU units each with 50% dropout, and was trained to minimize the logcosh loss. The dnetwork has 2 hidden layers of 256 ReLU units each with 50% dropout and was trained to minimize the binary crossentropy. Both networks were trained to predict the corresponding value based on the state alone. During training, the MDN, the rnetwork and the dnetwork were all updated 16 times on batches of 512 transitions each timestep using the Adam optimizier with a learning rate of 0.001.VB3 Controller
The controller is a DQN consisting of 3 hidden layers (512 ReLU, 256 ReLU, 128 ReLU) and a linear output layer. It was updated once on a batch of 64 real transitions and once on a batch of 64 imaginary transitions each timestep. We found that for such a relatively simple task, updating the controller more often led to worse performance. We also found that using popular DQN extensions like a separate target network or prioritized experience replay did not significantly affect performance. The controller used an greedy strategy with an exponentially annealing exploration rate given by with , , , and is the time step. The controller was trained to minimize the MSE loss using an Adam optimizer with a learning rate of 0.001.
VB4 Parameters
When training the agents, we set the depth of imaginary rollouts 10, and the breadth to 3. The size of the real memory was 50,000 transitions, and that of the imaginary memory was 3,000. We found that training the controller only on recently generated transitions leads to better performance, since more recent copies of produce better predictions. We achieve this by both limiting the imaginary memory size, and generating multiple rollouts simultaneously. Furthermore, we found that setting the update rate of the controller on both real and imaginary transsitions ( and in Algorithm 1) to more than 1 can lead to stability issues. Another parameter we had to tune was the number of episodes to wait before staring to generate imaginary rollouts ( in Algorithm 1), since M will produce erroneous predictions early on in the training. We found that waiting for about 1000 episodes provides best results.
VC Results
We compare the performance of agents augmented with imaginary transitions using our approach with a baseline DQN trained only on real transitions. To aid comparison, all hyperparameters and architectural choices were the same for agents augmented with our approach and the baseline DQN. For a given number of training episodes, we trained 5 agents from scratch and then tested them on the simulated environment for 1000 episodes. We then averaged the percentage of successfully completed episodes of all 5 agents in all test runs.
The agents trained using our approach performed significantly better than baseline DQN when trained for a small amount of episodes, with an increase of 35.9% in performance at 2000 episodes (Fig 4). The advantage then starts to decline the more episodes the agent is trained, as the baseline DQN catches up quickly. This is to be expected since at higher episodes, the agent has collected enough real tranisitons and no longer needs the extra data generated by the environment model. Table I shows the exact results for this experiment.
We then increased the difficulty of the task while keeping the dynamics the same by additionally requiring the goal state not to have any arrows pointing towards the agent, and ran all the tests again. Results can be seen in Fig 4 and Table II. Augmented agents showed even greater performance increase compared to the baseline DQN, with up to 78.5% increase in performance at 2000 training episodes. This shows that the performance increase due to using synthetic transitions is proportional to the difference in complexity between the task itself and the environment dynamics.
shows the performance increase in both tasks. Error bars represent standard deviations.
Episodes  Base DQN  Augmented  % increase 

2000  42.18 (6.01)  57.34 (6.37)  35.94 
3000  62.88 (4.91)  81.4 (2.95)  29.45 
4000  81.44 (3.13)  91.88 (2.38)  12.81 
5000  88.22 (3.07)  95.1 (3.76)  7.79 
6000  92.16 (2.57)  96.96 (2.1)  5.2 
Episodes  Base DQN  Augmented  % increase 

2000  30.12 (4.16)  53.78 (3.16)  78.55 
3000  49.56 (8.48)  75.1 (3.5)  51.53 
4000  62.54 (2.92)  81.12 (5.35)  29.7 
5000  79.44 (2.97)  94.88 (1.8)  19.43 
6000  84.66 (2.73)  95.03 (2.23)  12.24 
Generating Plans
One of the advantages of learning an environment model is that it allows a trained agent to produce entire plans given only the initial state^{1}^{1}1This is only possible for environments with deterministic underlying dynamics. This can be achieved by initializing the environment model with the initial state, and then generating an imaginary rollout in which the controller always chooses the optimal action for each state. To demonstrate this, we deployed a controller and an environment model on the Sawyer robotic arm (Fig 1), where both networks had been previously trained for 6000 episodes using the training method descirbed previously. Afterwards, we ran experiments to evaluate the planning capabilities of the system. Each experiment began by setting the cubes to a random state, with the experimenter pointing to a random cube. Then, we let the robot observe the configuration with the camera, and asked it to produce a plan consisting of a trajectory of actions to solve the task in its original form. The robot can execute the plan by selecting successively from a set of preprogrammed pointtopoint movements to rotate the boxes. Out of 20 test runs, the robot successfully solved the task 17 times. The correct generated plans varied in length from 1 to 5, depending on the initial state. Moreover, the generated plans for all successful runs were optimal, containing only the fewest possible actions required to solve the task. Fig 5 shows an example of an imaginary rollout according to an optimal plan of length 5 as generated by the agent.
Model Generalization
One of the interesting results we noticed is that the model showed some generalization capabilities to transitions it had not experienced before. Since episodes always terminated after encountering a terminal state, the model never experienced any transitions from this kind of state. To test model generalization, we deliberately set the model state to a random termminal state 20 times, and then asked it to predict the next state for a random action each time. A model trained for 5000 episodes was able to correctly predict the next state 75 % of the time. Fig 6 shows an example of model prediction for unseen transitions.
VD Discussion
One of the main challenges in learning a model online is avoiding overfitting on the small subset of data that are made available early in the training. A model can easily get stuck in a local minimum if it gets trained execcsively on initial data, and fail to converge later to an acceptable loss value in a reasonable amount of time as more data are made available^{2}^{2}2When trained online, highcapacity models often exhibited a behaviour reminiscent of the DunningKruger effect. They would achieve a very low loss value early in the training, which would quickly rise as more data are acquired, before eventually settling at a value in between.. We achieve this through three things. First, we limit the model capacity by deliberately choosing smaller model sizes. Second, we adopt a probabilistic approach to encoding latent space representations and modeling environment dynamics. Third, we employ high dropout rates in the models. We also found that selecting an unnecessarily large latent space dimensionality leads to worse models.
Probabilistic models are also much more robust, which is essential when using the dynamics model in closed loop to generate rollouts. Traditional models based on point estimates will produce some error in prediction, which will quickly compound resulting in completely erroneous predictions sometimes as early as the second pass. This of course makes using imaginary rollouts detrimental to learning.
The ability to learn stochastic models can be useful even for environments whose underlying dynamics are deterministic. An environment with deterministic underlying dynamics can have stochastic observable dynamics, since each latent state of the environment can produce multiple observable states. For example, the task we used for the experiments has deterministic underlying dynamics, since the configuration of the arrows will alwyas change in the same way in response to a certain action. However, the observable state will change stochastically. The positions of the boxes or the hand may differ for the same configuration. The agent has no knowledge of the underlying dynamics since it only has access to observable states. Therefore, it needs to be able to model the observable dynamics stochastically in order to produces realistic imaginary rollouts.
The generalization capabilities of the dynamics model can in principle be used to facilitate learning other similar tasks. The two variations of the task we used for the experiments share the exact same dynamics; they are only different in the definition of the reward functions. Indeed, for any given dynamics, an arbitrarily large family of tasks can be defined by specifying different reward functions. If learning the reward function can be separated from learning the dynamics, and assuming that the former is easier to learn than the latter, then learning new tasks in the same family will become much faster once the agent learns a dynamics model. However, this is left for future work.
Vi Conclusion
In this paper we presented an architecture that allows an agent to learn a model of stochastic environments in a reinforcement learning setting. This allows the agent to significantly reduce the amount of interactions it needs to make with the actual environment. This is especially useful for tasks involving real robots in which collecting real data can be prohibitively expensive. Furthermore, the ability to model stochastic environments makes this approach wellsuited for tasks involivng interaction with humans as their actions usually cannot be predicted with certainty. We provided a detailed algorithm describing how to train both the agent and the environment model simultaneously, and how to use both synthetic data in conjunction with real data. We validated our approach on a highlevel robotic task in which an agent has to simultaneously interpret a human gesture and solve a puzzle based on it. Results show that agents augmented with synthetic data outperform baseline methods especially in situations where only limited interaction data with the environment are available.
In future work, we will include recurrent models (such as LSTMs) in our architecture to handle environments with nonMarkovian state representations. Furthermore, we will experiment with building environment models that can capture multimodal dynamics, allowing agents to make use of acoustic information for instance. Another important extension is to include a measure of uncertainty to limit model usage if its output is erroneous. The simplest way to achieve this is by using model ensembles. We will also incorporate different ways of leveraging synthetic data to improve data efficiency even further, such as using imaginary rollouts to compute improved targets and for predicting future outcomes directly. Finally, we will investigate using programming by demonstration techniques [17] to bootstrap agents, further decreasing the amount of interactions the robot has to make with the environment.
Acknowledgment
This project has received funding from the European Union’s Horizon 2020 framework programme for research and innovation under the Marie SklodowskaCurie Grant Agreement No.642667 (SECURE)
References
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.

[2]
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep
visuomotor policies,”
The Journal of Machine Learning Research
, vol. 17, no. 1, pp. 1334–1373, 2016.  [3] S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al., “Imaginationaugmented agents for deep reinforcement learning,” in Advances in neural information processing systems, 2017, pp. 5690–5701.
 [4] G. Kalweit and J. Boedecker, “Uncertaintydriven imagination for continuous deep reinforcement learning,” in Conference on Robot Learning, 2017, pp. 195–206.
 [5] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in Advances in Neural Information Processing Systems, 2018, pp. 2455–2467.
 [6] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Modelbased value estimation for efficient modelfree reinforcement learning,” arXiv preprint arXiv:1803.00101, 2018.
 [7] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sampleefficient reinforcement learning with stochastic ensemble value expansion,” in Advances in Neural Information Processing Systems, 2018, pp. 8234–8244.
 [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [9] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 3389–3396.
 [10] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. BarthMaron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Dataefficient deep reinforcement learning for dexterous manipulation,” arXiv preprint arXiv:1704.03073, 2017.
 [11] A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro, “Robot gains social intelligence through multimodal deep reinforcement learning,” in 2016 IEEERAS 16th International Conference on Humanoid Robots (Humanoids). IEEE, 2016, pp. 745–751.
 [12] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [13] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” in International Conference on Learning Representations, 2014.
 [14] C. M. Bishop, “Mixture density networks,” Citeseer, Tech. Rep., 1994.

[15]
K. Lee, Y. Su, T.K. Kim, and Y. Demiris, “A syntactic approach to robot imitation learning using probabilistic activity grammars,”
Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1323–1334, 2013.  [16] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “betavae: Learning basic visual concepts with a constrained variational framework,” in International Conference on Learning Representations, 2017.
 [17] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Robot programming by demonstration,” Springer handbook of robotics, pp. 1371–1394, 2008.
Comments
There are no comments yet.