I Introduction
Model learning, or learning dynamics models from data, is a key element of modelbased reinforcement learning (MBRL). MBRL is attractive due to its better sample efficiency over modelfree methods [1]. The learned forward dynamics model can be used as a simulator for evaluating the performance of a candidate policy. In this process, also called as longterm prediction, successive predictions from the model and the policy are cascaded in order to predict a trajectory distribution for a time horizon of interest. Such recursive predictions are susceptible to error accumulation and can deviate into regions where the model has seen less training data. When the uncertainty due to limited training data is also incorporated into the probabilistic model, further improvement in sample efficiency can be achieved [2]. Therefore, accurate model learning and longterm prediction of such uncertainty aware models is important for MBRL.
We consider the case of contactrich manipulation tasks, such as assembly, in static and rigid environments. Although model learning can be achieved using any standard regression methods such as artificial neural networks (ANN) or Gaussian process (GP), the complex nature of contact dynamics can lead to poor performance. The dynamics for contactrich manipulation is characterized by smooth regions representing different contact situations separated by discontinuities (Fig. 1b). In such situations, the mixture of experts (ME)
[3] strategy would be more appropriate, where a number of experts (or modes in the case of dynamics) that model the different regions switch between one another. However, such an approach introduces a problem for longterm prediction. A straightforward longterm prediction using a standard ME model will not be able to handle the discontinuity in the state (velocity) evolution that can be caused by an impacting contact (Fig. 1d). Our main goal is a method that can learn a discontinuous dynamics model represented as a system of switching modes, encode uncertainty due to limited training data, and support discontinuous longterm prediction. Our problem is also related to system identification of hybrid dynamical system [4], learning switching linear dynamical system (SLDS) [5], and learning hidden Markov model (HMM) based switching models
[6], but none of the existing frameworks is designed to satisfy our specific goal.Our method relies on particle based uncertainty propagation and is intended to be used as a gradientfree option. We obtain a system of uncertaintyaware models by clustering the trial data and learning independent dynamics modes (GP). Next, we introduce a novel switching scheme that involves learned onestepahead mode predictor and initialization models that predicts initial state distributions of modes for achieving longterm prediction. The novel switching scheme and its associated models is what makes the important contribution of allowing discontinuous (Gaussian) state evolution. It also supports simultaneous propagation through multiple modes in the form of a mixture of Gaussians. Our experiment on a 7DOF industrial robot shows that our method can scale to real scenarios and outperform highly flexible but general purpose model learning baselines.
Ii Related Work
Although model learning for robot control has been actively researched [7], learning forward dynamics models for contactrich manipulation tasks that feature discontinuities has been less explored. Most methods address learning only inverse dynamics models, which are often used for onestep prediction. For MBRL in a contactrich manipulation context, the learned forward model has to be evaluated for longterm prediction through discontinuities. Such a direct evaluation was lacking in all the reviewed works.
A prominent example of probabilistic model learning for MBRL is PILCO [2], in which GPs were used for model learning. Unfortunately, with squared exponential (SE) kernels, PILCO does not support learning and prediction through discontinuous dynamics. A strategy for addressing discontinuities within the GP framework is manifold GP [8]. The SE kernel function operates in a feature space, that is transformed from the input space using an ANN. However, the method was evaluated only for single step prediction. A more recent work (PETS) [9] that learns an uncertainty aware ANN model (ensemble of bootstraps), succeeded in regaining many of the desirable features of GP while retaining the expressive power and scalability of ANN. Apart from benefiting from the data efficiency of GP, our method has two main differences over the model learning approach in PETS: 1) it explicitly handles discontinuity and therefore can potentially model contacts better and 2) it has a parametric form (GMM) for multimodal state distribution instead of a particle based representation.
The mixture of experts (ME) approach [3], in which a number of local models or experts are learned together with a gating network, provides a framework for highly complex models. Inverse dynamics models were learned for contact tasks in [10] and [11] and for free motion task in [12]. McKinnon et al. in [13] used GP based Dirichlet Process Mixture Model to learn the forward dynamics model of a mobile robot. Such ME inspired approaches, and the more general GP mixture model in [14], where the current dynamics mode is inferred based on the present or past data, may be sufficient for one step prediction, but is not designed for discontinuous state propagation. To remedy this, our method introduces a onestepahead mode predictor, a GP based initial state predictor and a probabilistic switching scheme to support not only discontinuous dynamics but also discontinuous state evolution.
Also relevant to our work is forward model learning with multistep prediction for contactrich tasks such as [15] and [16], in which deterministic dynamics models were learned using ANN. An uncertaintyaware approach, such as ours, can potentially help achieve greater data efficiency for MBRL [2]. Finally, Levine et al. [17]
used Gaussian mixture models (GMM) to learn locally linear model priors. While each component of the GMM represented a locally linear dynamics mode, our method benefits from a more flexible nonlinear GP model.
Iii Problem Formulation
In contactrich manipulation, the robotenvironment interaction is characterized by contact situations such as sliding, sticking, slipping, or motion constrained by obstacles. Our modeling assumptions are: 1) smooth dynamics within a contact situation, 2) impacting contact causes instantaneous state change (velocities), and 3) a static and rigid environment.
The goal of model learning is, given a training dataset consisting of tuples , to fit a model of the form,
(1) 
where
is the state vector comprising of the joint positions
and velocities , is the control action taken by a policy, is the number of axes of the manipulator, andare the mean and variance functions of the dynamics model. In contactrich manipulation,
(and ) can change instantaneously with contact situations (Fig. 1b). While GP with SE kernel certainly cannot express such functions, ANN under low data regimes can also fail. The ME model is a natural framework for such cases, where a set of experts and a gate are learned to solve the regression problem. Equation (2) summarizes the standard ME model for forward dynamics, where represents an expert (or mode) and ’s the model parameters.(2) 
The gate (or mode predictor) splits the input space of the regression model into a number of regions. For each region, an expert is trained. If the most likely expert is selected (hard switching) instead of averaging, discontinuities can be represented more accurately.
In longterm prediction, we seek the prediction of the state distributions , starting from an initial state distribution , where is a policy that is under evaluation and is the time horizon of interest. Obtaining longterm prediction by cascading the model in (2) will only work if during switching one expert (plus the policy) succeeds in driving the state variable into the next experts input region. In other words, there has to be a continuous evolution of state regardless of dynamics switching (Fig. 1c). Unfortunately, this is not the case for contactrich manipulation, in which the state variables can undergo instantaneous changes in velocities during contact (Fig. 1d).
In MBRL, longterm prediction that encodes the subjective uncertainty due to limited training data (epistemic uncertainty), in addition to the inherent stochasticity in the system (aleatoric uncertainty), can be advantageous [2]
. Furthermore, contacts may give rise to discrete events, such as stick or slip, that may be probable for the same policy. Therefore, another desirable feature is the ability to propagate multiple state paths.
We define the problem as: 1) representation of contactrich dynamics by a system of switching models, 2) longterm prediction with discontinuous state evolution, 3) encoding epistemic and aleatoric uncertainties, and 4) propagation of a multimodal distribution that reflects multiple probable dynamics modes.
Iv Model Learning and Longterm Prediction
Our method has two main aspects: model learning (Alg. 1) and longterm prediction (Alg. 2). They are also visualized in Fig. 2. The ME inspired system of switching dynamics models is realized by learning a set of GP dynamics models. GP regression is the most established method for encoding both epistemic and aleatoric uncertainties. Our method exploits discontinuities in state data of contactrich manipulation by first clustering the dataset (in state space or its equivalent) and then fitting dynamics modes using individual clusters. The number of such models are automatically inferred from data. As a key step, we also learn a onestepahead mode predictor and initialization models that predict initial state distributions for switching pairs.
Longterm prediction is addressed by a novel switching scheme. The onestepahead mode predictor is used to determine switching probabilities. In the event of a switch, the initialization model predicts the initial state distribution of the next mode. This allows discontinuous state evolution. Probabilistic switching is introduced to manage the splitting of the state distributions caused by discontinuities. The result is an overlapping of active modes in time with states in each mode represented as a weighted Gaussian. The same mechanism also supports a continued propagation of multiple modes as mixture of Gaussians.
Iva Learning Forward Models of Switching Dynamics
Equation (1) is the standard form for probabilistic dynamics model. We adopt an alternative form [2] that predicts the difference between the next and current states. Individual parts of our model learning can be formalized as:
(3)  
(4)  
(5) 
The variable represents a mode at time . Equation (3) represents a deterministic onestepahead mode predictor. Equation (4) represents the dynamics model for mode . Equation (5) represents the initialization model that predicts the initial state distribution of a new mode when transitioned from the current mode .
The model learning procedure is summarized in Alg. 1 and also depicted in the upper half of Fig. 2. The dataset obtained from a batch of trials is first clustered in the state space (or its equivalent) using the Dirichlet Process Gaussian Mixture Model [18] (DPGMM) method, which automatically infers the number of clusters (modes), . Fig. 2 symbolically depicts the clustering in a twodimensional state space. We maintain the sequential order of the dataset and the cluster labels , which will be exploited elsewhere in the algorithm. Next, we train a multioutput GP dynamics model for each cluster with its respective data . Each output dimension is modeled as an independent GP. Although the learned GP model is (4), for representational simplicity we refer to the full model in (1) as . The initialization models in (5) are also learned with GP regression. For every mode switching that is found in the dataset, a multioutput GP (independent outputs), , is learned for predicting the initial state distribution of mode conditioned on the state action pair in mode at the instant of transition. The training data, , for each initialization model is extracted from the clustered dataset that had the sequential order of data preserved. In the final step, the onestepahead mode predictor model , which approximates (3
), is obtained by training a multiclass support vector machine (SVM) classifier using the dataset
, where and consists of arrays of and, respectively. The SVM hyperplane (in
space) is symbolically depicted in the upper right part of Fig. 2. This is another use of the sequential order of the dataset.The motivation to use DPGMM [18] is to be independent of the number of modes. In practice, we use the Bayesian Gaussian mixture, a variational inference based implementation available at [19], that requires an upper bound on the number of clusters, . Clustering dynamics data to model piecewise smooth dynamics has been previously tried in [17]
where a regular GMM clustering in the joint inputoutput space of the model was used. For our robot experiment, we got best results for clustering in the Cartesian state space, with the state represented by positions and velocities of three noncollinear points defined at the endeffector. To prevent spurious switching events, we implemented a heuristic that reassigns all data points that form small isolated groups (
) along a trajectory.GP regression is a Bayesian nonparametric method that is well suited our needs. A GP is a distribution of functions completely defined by a mean function and covariance function . In our case, and
is the standard SE function with automatic relevance determination (ARD). The GP model provides uncertainty estimates for predictions and in our case allows independent noise level estimation for each output dimension of each mode. The GP models are trained by optimizing the log marginal likelihood.
SVM has been previously used for gating in a standard ME model in [10]. In our model, , has the crucial difference of being a onestepahead predictor.
IvB Uncertainty Propagation in Forward Prediction
Longterm prediction is achieved by repeating onestep predictions recursively. But, it involves propagating uncertain inputs through the learned models in (3)(5) and also the policy . We focus on this process which is also shown in the bottom half of Fig. 2.
We derive a probabilistic mode predictor (MCSVM) from the previously learned deterministic model, , by using a simple Monte Carlo approach. A number of samples are drawn from the input distribution
and the relative frequencies of the predicted classes are approximated as a discrete probability distribution
. MCSVM is depicted in the lower left section of Fig. 2 with dots used to represent the particles. The MC process is represented with the operator , where and are the predictor and the input distribution, respectively.We chose another particle based method [20] that is based on the unscented transform (UT) method [21] for the GP models and
. An analytic but less computationally efficient alternative is moment matching
[22]. We also exploit the particle based approach to split stateaction distributions during switching. Following the outcomes of the mode predictor, for each pair (), a corresponding stateaction distribution is approximated using the particles involved in the transition (). During a mode switch, when both the original mode and the next mode are simultaneously active for a few time steps, is propagated through (4) for and (5) for . Through this splitting the prediction and training data are made consistent.UT is a particle based method that is used to propagate an input distribution through a nonlinear function. It is also used on the policy. As introduced in [20], the GP prediction variance for the mean particle is added to the transformed input distribution. A similar approach is taken for the policy and its intrinsic variance. The crosscovariance between and is obtained in a manner similar to the variance and used to form . Fig. 2 shows the cases for the GP based models with the UT particles indicated as crosses. Similar to the case with MC, we represent the UT process with the operator .
IvC Longterm Prediction with Switching Dynamics Models
Longterm prediction is achieved by cascading the onestep prediction steps in (6). This induces a distribution of all trajectories possible with the learned model and the current policy. For switching dynamics models that feature discontinuous state evolution, the trajectory distribution will be piecewise continuous. We refer to each piece, that evolves within a mode, as a segment (denoted by ). Because modes can be revisited, many segments can be associated with a mode but not vice versa. In our novel switching scheme, segments can overlap in time during switching, which we call as probabilistic switching, or when multiple modes are active simultaneously. An example of the latter case is stickslip phenomenon where both cases are probable.
In Alg. 2 we drop the conditioning on in (6) and indicate the same by the superscript since completely defines . The first segment is initialized with and weight one. The mode predictor predicts the probabilities of the next mode, , at every time step and for each active segment . We split into a number of based on all predicted switches ( to ) as discussed earlier. There are two cases: (nonswitching) and (switching). In the nonswitching case, the state distribution is advanced using one of the dynamics models (line 9). The switching case follows a similar process but with one of the initialization models (line 13). In both cases, the result of the onestep prediction is merged (lines 10 and 14) with another one that may exist as the result of a previous iteration of the loop in line 4. The merging process is implemented to handle probabilistic switching, i.e. simultaneously active segments involved in a multistep switch are maintained as weighted Gaussians by consolidating a segment’s existing quantity with the one split away from the other one.
The merging of the two close by Gaussians in lines 10 and 14 is done as follows. Let and represent the two Gaussians and and
their respective weights. It can be shown that the mean and variance of the merged Gaussian distribution are:
(7a)  
(7b) 
where and are the normalized versions of and , respectively. The weights of the segments are updated (lines 11 and 15) such that for any time , if more than one segment is active, the state distribution will be a valid mixture of Gaussian. Therefore, if the mode predictor continuously predicts multiple modes, a multimodal state distribution with each mode corresponding to a dynamics mode follows.
V Experimental Results
We conducted two experiments to validate our method. First, a simulated block of mass was controlled to slide on a surface (1D) with abrupt changes in frictional properties. Here, we also demonstrate the multimodal state propagation aspect. Second, a 7DOF robot arm (YuMi) was controlled to move such that it experienced unexpected contacts and sliding motions. In this case, our method was applied to a dynamics model with and .
As baselines for comparison, we used GP (SE kernel with ARD), manifold GP (mGP) in [8] and the uncertaintyaware ANN method in PETS [9](uANN). Both mGP and uANN encodes the epistemic and aleatoric uncertainties and, in principle, can model any nonlinear function. In all cases, we do not assume observation noise. In mGP, the parameters of an ANN feature space mapping and an SE kernel based on that feature space is jointly learned. The ANN based mapping is expected to provide enough flexibility to represent discontinuities. The uANN model encodes uncertainty using an ensemble of bootstrap models and propagates state distributions with a number of particles. It benefits from the high flexibility and scalability of ANNs. For all the three baselines, we sampled 50 trajectories up to the full episode lengths and followed it by density estimation (GMM) at each time step to make them comparable to our method. For both experiments, we report results with small and large datasets. This helps in analyzing how the methods scale with data size. Each dataset is split into train () and test () subsets. We report the average negative log likelihood (NLL) and root mean square error (RMSE) of individual time step predictions w.r.t the test trajectories. The RMSE is calculated for the most likely mode.
For DPGMM, the upper bound on the number of clusters was set to 10 and 20 for the sliding mass and robot experiments, respectively. The concentration parameter was set to for both cases. For the UT method, we used parameters, , , and . mGP was implemented on the GPflow framework [23]. For uANN, we used the published implementation of [9] and chose 5 ensembles for all cases. The TS uncertainty propagation scheme was used (when possible), in which particles are assigned to one of the ensembles permanently. All ANN network structures were fully connected layers.
Va The Sliding Mass Experiment
The sliding mass experiment is designed to have three dynamics modes, in which two are expected to be predicted simultaneously with some probabilities. A block of mass is controlled by a policy that pulls it to fixed destination along one direction. The policy is a simple linear Gaussian controller that outputs force based on the current state (positionvelocity). Along the direction of motion, the surface imposes three dynamics modes (Fig. 3): free frictionless motion, halted due to sticking and slipping under kinetic friction. Once it is stuck, the stochastic control policy may or may not overcome the stiction force, thus resulting in a probabilistic scenario of possibly staying stuck and also possibly slipping. We expect discontinuities to appear at stick and also slip. The slip is implemented as an instantaneous velocity jump to 5 m/s. Stochasticity is introduced to the dynamics by injecting Gaussian noise to the state.
The experiment generated two datasets: consisting of trials, and consisting of trials. Each trial trajectory is simulated for time steps with a stepping time of seconds. Fig. 4 shows the result of . We indicate different modes using different colors. We used the same policy and initial state distribution that were used for generating the data. All but one test trajectory eventually switched to the slip mode. The mean predictions are plotted by adjusting its transparency according to the probability weights for better clarity. The variance shown is the region for individual modes and is not adjusted for the weights. Also visible in the plot is the overlap indicating simultaneous activation of modes during switching. We have two cases of discontinuities: freetostick and sticktoslip. The prediction closely follows the test trajectories () while handling the two discontinuities well. Our method is also able to predict both the slip and stick modes simultaneously and the state distribution is indeed multimodal. The switch without any discontinuity that appears in the middle of the free motion segment implies that the clustering phase split the free motion data into two. Although less ideal from a conceptual point of view, such overclustering is practically advantageous since it reduces the GP training time. The variance prediction of our model is generally consistent (test data inside the region) but with slight violations.
NLL  RMSE  NLL  RMSE  
GP      
mGP      
uANN  
Ours  0.63  0.76 
Figure 4 and Table I show comparisons of the various methods. The plot has only mean values of predictions, although both the means and variances are used for the score calculation. For the baselines, the density estimation of the sampled trajectories was done using DPGMM (). We used network sizes of [32, 32, 2] and [16, 16, 16, 2] for mGP and uANN, respectively. Our method accurately handled both the expected discontinuities. None of the baselines were successful in handling the freetostick discontinuity, while both uANN and mGP somewhat succeeded in the sticktoslip case. Although the GP model failed in both cases, when averaged over restarts, it gave better score than uANN and mGP. Our method has the best scores in all cases. uANN showed slight advantage over mGP with . GP and mGP were left out for the case due to a long training times. did not lead to better scores although some slight reduction in the spread of NLL is noticeable. Our method occasionally underestimates the variances while all the baselines consistently overestimates them.
VB The Contact Motion Experiment
The robot used in this experiment is the bimanual 7DOF collaborative robot YuMi from ABB. It was mounted on a wooden platform ( plane) which also had a thick steel plate rigidly attached on its surface (Fig. 5). The robot interacts with the environment through a rigidly attached peg at the endeffector. A Cartesian space trajectory (uniform velocity of 0.06 m/s) was generated such that it caused the following sequence of events: free motion along axis, contact with the platform, sliding motion along axis, contact and sliding motion along the edge of the steel plate (axis). Stable contact motions were accomplished by combining reference trajectories that were slightly beyond the physical constraints, low gain joint space proportionalderivative (PD) controllers, and a closedloop inverse kinematics scheme. Gaussian noise () was added to the and components of the reference trajectory as exploration noise. The robot was put in gravity compensated mode and torque commands were sent to the control interface.
Each trial consisted of 100 time steps sampled at 0.05 seconds interval. We also perturbed the base policy (controller) by altering the joint space proportional gains of all axes by , , and . The experiment generated two datasets: and , where contained 15 trials from the base policy and had extra 5 trials each from all except the case. The case also had 5 trials and was designated as the test set (). By perturbing the policy, we aim to closely emulate a typical MBRL iteration in which the dataset for model learning would be a mixture of trials from slightly different policies. All aspects of our method is in the joint space, with the state variable comprising of joint positions and velocities. Our actual policy is a complex Cartesian space controller, but we use the generated joint space reference trajectory and the fixed PD gains as a simpler alternative. We use the resulting trajectorycentric policy that is associated with the test sets for longterm predictions.
In Fig. 6, the results of is presented in the Cartesian space (only translational position) for visual clarity. The transformation from the joint space was done by applying the UT method on forward kinematics. The plot shows only the most likely mode at each time step. It can be seen that the method sequentially switched through eight modes, each indicated with its own color. This reveals that the clustering step discovered eight clusters (potential overclustering). The prediction manages to closely follow the ground truth trajectories, respecting the two instances of contacts (freetowood and woodtosteel), and with fairly consistent variance predictions (2). Figures 6 and 6 help validate our hypothesis by showing the incremental effects of adding the onestepahead predictor and the initialization model to a basic ME model. A basic ME model with SVMbased currentstep mode predictor and hard switching is not able to switch between modes correctly and handle discontinuities (at about 1.5 and 3.5 seconds) due to the problem mentioned in Sec. III (Fig. 6). With only the onestepahead predictor (and our switching scheme), the model is able to switch more effectively, but it is still not able to handle discontinuities (Fig. 6). Our method (Fig. 6), with the further addition of the initialization model, is able to achieve effective switching and can handle discontinuities. Note that the predictions in Figs. 6 and 6 are a part of the overall longterm predictions.
NLL  RMSE  NLL  RMSE  
GP      
mGP      
uANN  
Ours  0.17  0.15 
The performance of our method is significantly better than the baselines, none of which succeeded in closely following the ground truth (Fig. 6). Since we did not expect multiple active modes, the density estimation for the baselines was limited to unimodal Gaussian. We used network sizes of [32, 32, 3] and [64, 64, 64, 14] for mGP and uANN, respectively. While attempting the TS propagation scheme, it was revealed that although it performed generally well up to relatively shorter horizon, as was used in [9], it ultimately went unstable by successively predicting larger values. We reduced this problem by resorting to a scheme of averaging predictions of all ensembles for each particle. Larger training data also alleviated this problem. While our method predicted fairly consistent variances, with occasional underestimations, all of the baselines produced overestimated variances. Our method has the best scores in all cases (Table II). mGP has better scores than uANN but the plot shows noisy mean predictions. As in the previous experiment, the GP outperformed the other baselines. Both GP and mGP were not feasible to train with . Our method showed slight improvement with , but the case with uANN was inconclusive. For , the training times were 53s and 923s for uANN and our method, respectively. This can be improved with parallelization.
Vi Discussions
Our method has high prediction accuracy and is able to handle discontinuities in a very data efficient manner thanks to its novelties such as the onestepahead mode predictor and the initialization model. It can be argued that ANN based methods would require much larger training data especially in the vicinity of discontinuities. A sign of data inefficiency in the case of uANN was the instability in longterm prediction. Both mGP and uANN are sensitive to network structure selection and also suffer from the problem of local optimum. The uANN method, however, scales better. While training uANN with minibatch gradient descent is usually fast, the parameter optimization of mGP with LBFGS can be quite slow. We attribute the slightly better scores of the GP baseline over uANN and mGP to the above mentioned issues with ANN based models. Our method inherits data efficiency and minimum model selection from GP and has improved prediction accuracy by virtue of being a system of expert models. One aspect of accuracy is the variance prediction which we noticed as being significantly overestimated by all the baselines. The experiments clearly show that our method outputs reasonable variance estimations, even in the presence of discontinuities.
Unlike PILCO, our particle based method does not offer the possibility to compute analytic gradients for use in policy search. A complete MBRL would be our future focus and a suitable strategy could be the blackbox gradientfree approach in [24].
The clustering step relies on contact induced formation of clusters of data in the state space. However, not all mode transitions are accompanied by impacting contacts and i.i.d clustering such as DPGMM can easily identify multiple clusters within a mode as we have seen in the results. None of these is an issue because our focus has clearly been modeling and prediction involving discontinuities rather than precisely and uniquely identifying dynamics modes. The current clustering strategy could result in suboptimal solutions in the event of local optimum or sparse trajectories. Trajectory segmentation methods that take into account temporal correlation of data could be a future improvement.
Multimodality in longterm prediction is allowed to arise only due to simultaneous activation of modes and not from the propagation in a single mode. Such representations of discrete modes of operation could be exploited in policy search. Also, state distributions in parametric form removes the need for numerous trajectory samples for cost evaluation.
One of the biggest concerns with GP models is its scalability with training data (). Our divide and conquer strategy reduces training time to a great extent but could still improve. Since each output dimension of each dynamics mode is represented by an independent GP, there is immense potential to improve the scalability of model learning with parallelization. This will be pursued in our future work.
Vii Conclusion
In this paper, we addressed the problem of predicting the state evolution of a contactrich manipulation task using a learned dynamics model. In MBRL, such predictions are useful for evaluating a policy. Focusing on discontinuities as the main challenge, we adopted the strategy of a switching dynamics model and made progress on discontinuous state evolution. Our GP based method inherently encodes uncertainty due to limited data and it can output multimodal state distribution that can reflect discrete modes of operations. The main experiment, which closely reflected a typical MBRL iteration of an episodic manipulation task, showed that our method outperforms highly flexible ANN based methods. The need for achieving further scaling with data remains and a clear strategy has been suggested. Therefore, although limited to gradientfree settings, the proposed method is a promising option for model learning for contactrich tasks.
References
 [1] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
 [2] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for dataefficient learning in robotics and control,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 2, pp. 408–423, 2015.
 [3] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
 [4] S. Paoletti, A. L. Juloski, G. FerrariTrecate, and R. Vidal, “Identification of hybrid systems a tutorial,” European journal of control, vol. 13, no. 23, pp. 242–260, 2007.
 [5] S. Linderman, M. Johnson, A. Miller, R. Adams, D. Blei, and L. Paninski, “Bayesian learning and inference in recurrent switching linear dynamical systems,” in Artificial Intelligence and Statistics, 2017, pp. 914–922.
 [6] Y. Bengio and P. Frasconi, “An input output hmm architecture,” in Advances in neural information processing systems, 1995, pp. 427–434.
 [7] D. NguyenTuong and J. Peters, “Model learning for robot control: a survey,” Cognitive processing, vol. 12, no. 4, pp. 319–340, 2011.
 [8] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth, “Manifold gaussian processes for regression,” in Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE, 2016, pp. 3338–3345.
 [9] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in Advances in Neural Information Processing Systems 31, 2018, pp. 4754–4765.
 [10] R. Calandra, S. Ivaldi, M. P. Deisenroth, E. Rueckert, and J. Peters, “Learning inverse dynamics models with contacts,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 3186–3191.

[11]
M. Toussaint and S. Vijayakumar, “Learning discontinuities with
productsofsigmoids for switching between local models,” in
Proceedings of the 22nd international conference on Machine Learning
. ACM, 2005, pp. 904–911.  [12] D. NguyenTuong, J. R. Peters, and M. Seeger, “Local gaussian process regression for real time online model learning,” in Advances in Neural Information Processing Systems, 2009, pp. 1193–1200.
 [13] C. D. McKinnon and A. P. Schoellig, “Learning multimodal models for robot dynamics online with a mixture of gaussian process experts,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 322–328.
 [14] C. E. Rasmussen and Z. Ghahramani, “Infinite mixtures of gaussian process experts,” in Advances in neural information processing systems, 2002, pp. 881–888.
 [15] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7559–7566.
 [16] I. Lenz, R. A. Knepper, and A. Saxena, “Deepmpc: Learning deep latent features for model predictive control.” in Robotics: Science and Systems, 2015.
 [17] S. Levine, N. Wagener, and P. Abbeel, “Learning contactrich manipulation skills with guided policy search,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 156–163.
 [18] C. E. Rasmussen, “The infinite gaussian mixture model,” in Advances in neural information processing systems, 2000, pp. 554–560.
 [19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [20] J. Ko and D. Fox, “Gpbayesfilters: Bayesian filtering using gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, 2009.

[21]
S. J. Julier and J. K. Uhlmann, “New extension of the kalman filter to nonlinear systems,” in
Signal processing, sensor fusion, and target recognition VI, vol. 3068. International Society for Optics and Photonics, 1997, pp. 182–194.  [22] M. P. Deisenroth, M. F. Huber, and U. D. Hanebeck, “Analytic momentbased gaussian process filtering,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 225–232.

[23]
D. G. Matthews, G. Alexander, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. LeónVillagrá, Z. Ghahramani, and J. Hensman, “Gpflow: A gaussian process library using tensorflow,”
The Journal of Machine Learning Research, vol. 18, no. 1, pp. 1299–1304, 2017.  [24] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J.B. Mouret, “Blackbox dataefficient policy search for robotics,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 51–58.
Comments
shbz80 ∙
Please find the latest version at https://arxiv.org/abs/1909.04915