Probabilistic Inference of Simulation Parameters via Parallel Differentiable Simulation

To accurately reproduce measurements from the real world, simulators need to have an adequate model of the physical system and require the parameters of the model be identified. We address the latter problem of estimating parameters through a Bayesian inference approach that approximates a posterior distribution over simulation parameters given real sensor measurements. By extending the commonly used Gaussian likelihood model for trajectories via the multiple-shooting formulation, our chosen particle-based inference algorithm Stein Variational Gradient Descent is able to identify highly nonlinear, underactuated systems. We leverage GPU code generation and differentiable simulation to evaluate the likelihood and its gradient for many particles in parallel. Our algorithm infers non-parametric distributions over simulation parameters more accurately than comparable baselines and handles constraints over parameters efficiently through gradient-based optimization. We evaluate estimation performance on several physical experiments. On an underactuated mechanism where a 7-DOF robot arm excites an object with an unknown mass configuration, we demonstrate how our inference technique can identify symmetries between the parameters and provide highly accurate predictions. Project website:



There are no comments yet.


page 1

page 3

page 14


DiSECt: A Differentiable Simulator for Parameter Inference and Control in Robotic Cutting

Robotic cutting of soft materials is critical for applications such as f...

BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators

We introduce BayesSim, a framework for robotics simulations allowing a f...

Dual Online Stein Variational Inference for Control and Dynamics

Model predictive control (MPC) schemes have a proven track record for de...

DiSECt: A Differentiable Simulation Engine for Autonomous Robotic Cutting

Robotic cutting of soft materials is critical for applications such as f...

A Differentiable Newton-Euler Algorithm for Real-World Robotics

Obtaining dynamics models is essential for robotics to achieve accurate ...

Forget-SVGD: Particle-Based Bayesian Federated Unlearning

Variational particle-based Bayesian learning methods have the advantage ...

Simulation-based Bayesian inference for multi-fingered robotic grasping

Multi-fingered robotic grasping is an undeniable stepping stone to unive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Simulators for robotic systems allow for rapid prototyping and development of algorithms and systems [koenig_design_2004]

, as well as the ability to quickly and cheaply generate training data for reinforcement learning agents and other control algorithms 

[andrychowicz_learning_2020]. In order for these models to be useful, the simulator must accurately predict the outcomes of real-world interactions. This is accomplished through both accurately modeling the dynamics of the environment as well as correctly identifying the parameters of such models. In this work, we focus on the latter problem of parameter inference. Optimization-based approaches have been applied in the past to find simulation parameters that best match the measured trajectories [chebotar_closing_2019, ramos_bayessim_2019]. However, in many systems we encounter in the real world, the dynamics are highly nonlinear, resulting in optimization landscapes fraught with poor local optima where such algorithms get stuck. Global optimization approaches, such as population-based methods, have been applied [heiden2021neuralsim] but are sample inefficient and cannot quantify uncertainty over the predicted parameters. In this work we follow a probabilistic inference approach and estimate belief distributions over the the most likely simulation parameters given the noisy trajectories of observations from the real system. The relationship between the trajectories and the underlying simulation parameters can be highly nonlinear, hampering commonly used inference algorithms. To tackle this, we introduce a multiple-shooting formulation to the parameter estimation process which drastically improves convergence to high-quality solutions. Leveraging GPU-based parallelism of a differentiable simulator allows us to efficiently compute likelihoods and evaluate its gradients over many particles simultaneously. Based on Stein Variational Gradient Descent (SVGD), our gradient-based nonparametric inference method allows us to optimize parameters while respecting constraints on parameter limits and continuity between shooting windows. Through various experiments we demonstrate the improved accuracy and convergence of our approach. Our contributions are as follows: first, we reformulate the commonly used Gaussian likelihood function through the multiple-shooting strategy to allow for the tractable estimation of simulation parameters from long noisy trajectories. Second, we propose a constrained optimization algorithm for nonparametric variational inference with constraints on parameter limits and shooting defects. Third, we leverage a fully differentiable simulator and GPU parallelism to automatically calculate gradients for many particles in parallel. Finally, we validate our system on a simulation parameter estimation problem from real-world data and show that our calculated posteriors are more accurate than comparable algorithms, as well as likelihood-free methods.

Ii Related Work

System identification methods for robotics use a dynamics model with often linearly dependent parameters in classical time or frequency domain 

[vandanjon1995spectrum], and solve for these parameters via least-squares methods [kozlowski_modelling_1998]. Such estimation approaches have been applied, for example, to the identification of inertial parameters of robot arms [khosla1985sysid, atkeson1986inertial, caccavale_identification_1994, vandanjon1995spectrum] with time-dependent gear friction [grotjahn_friction_2001], or parameters of contact models [verscheure2010contact, fazeli2018identifiability]. Parameter estimation has been studied to determine a minimum set of identifiable inertial parameters [gautier1990minset] and finding exciting trajectories which maximize identifiability [antonelli_systematic_1999, gautier_exciting_1991]. More recently, least-squares approaches have been applied to estimate parameters of nonlinear models, such as the constitutive equations of material models [mahnken2017identification, hahn2019real2sim, narang2020tactile], and contact models [kolev2015sysid, lelidec2021diffsim]

. In this work, we do not assume a particular type of system to identify, but propose a method for general-purpose differentiable simulators that may combine multiple models whose parameters can influence the dynamics in highly nonlinear ways. Bayesian methods seek to infer probability distributions over simulation parameters, and have been applied to infer the parameters of dynamical systems in robotic tasks 

[wang_markov_2011, Muratore2020BayesianDR, Tan2018SimtoRealLA] and complex dynamical systems [ninness_bayesian_2010]. Our approach is a Bayesian inference algorithm which allows us to include priors to find posterior distributions over simulation parameters. The advantages of Bayesian inference approaches have been shown to be useful in the areas of uncertainty quantification [ninness_bayesian_2010, eykhoff_chapter_1981], system noise quantification [ting_bayesian_2011] and model parameter inference [qian2003mc, cranmer2020sbi]. Our method is designed for differentiable simulators which have been developed recently for various areas of modeling, such as articulated rigid body dynamics [peres2018lcp, hu2020difftaichi, geilinger2020add, Qiao2020Scalable, heiden2021neuralsim, lelidec2021diffsim], deformables [hu2019chainqueen, hu2020difftaichi, murthy2021gradsim, heiden2021disect, Huang2021PlasticineLabAS] and cloth simulation [liang2019diffcloth, hu2020difftaichi, Qiao2020Scalable], as well as sensor simulation [NimierDavidVicini2019Mitsuba2, heiden2020lidar]. Certain physical simulations (e.g. fracture mechanics) may not be analytically differentiable, so that surrogate gradients may be necessary [han2018stein]. Without assuming access to the system equations, likelihood-free inference approaches, such as approximate Bayesian computation (ABC), have been applied to the inference of complex phenomena [toni2008abc, papamakarios2016epsilon, ramos_bayessim_2019, hsu2019likelihoodfree, matl_inferring_2020, matl_stressd_2020]. While such approaches do not rely on a simulator to compute the likelihood, our experiments in subsection B-D indicate that the approximated posteriors inferred by likelihood-free methods are less accurate for high-dimensional parameter distributions while requiring significantly more simulation roll-outs as training data. Domain adaptation techniques have been proposed that close the loop between parameter estimation from real observation and improving policies learned from simulators [chebotar_closing_2019, ramos_bayessim_2019, mehta2020adaptiveda, du_auto-tuned_2021]. Achieving an accurate simulation is typically not the final objective of these methods. Instead, the performance of the learned policy derived from the calibrated simulator is evaluated which does not automatically imply that the simulation is accurate [lambert2020mismatch]. In this work, we focus solely on the calibration of the simulator where its parameters need to be inferred.

Iii Formulation

In this work we address the parameter estimation problem via the Bayesian inference methodology. The posterior over simulation parameters and a set of trajectories is calculated using Bayes’ rule:

where is the likelihood distribution and is the prior distribution over the simulation parameters. We aim to approximate the distribution over true parameters , which in general may be intractable to compute. These true parameters generate a set of trajectories

which may contain some observation noise. We assume that these parameters are unchanging during the trajectory and represent physical parameters, such as friction coefficients or link masses. We assume that each trajectory is a Hidden Markov Model (HMM) 

[russell_artificial_2009] which has some known initial state, , and some hidden states , . These hidden states induce an observation model . In the simulator, we map simulation states to observations via a deterministic observation function

. The transition probability

of the HMM cannot be directly observed but, in the case of a physics engine, can be approximated by sampling from a distribution of simulation parameters and applying a deterministic simulation step function . In subsection B-A, we describe our implementation of , the discrete dynamics simulation function. The function rolls out multiple steps via to produce a trajectory of

states given a parameter vector

and an initial state : . To compute measurements from such state vectors, we use the observation function : . Finally, we obtain a set of simulated trajectories for each initial state from the trajectories in . The initial state from an observed trajectory may be acquired via state estimation techniques, e.g. methods that use inverse kinematics to infer joint positions from tracking measurements. We aim to minimize the Kullback-Leibler (KL) divergence between the trajectories generated from forward simulating our learned parameter particles and the ground-truth trajectories , while taking into account the priors over simulation parameters:

We choose the exclusive KL divergence instead of the opposite direction since it was shown for our choice of particle-based inference algorithm in [liu2017svgdgradientflow] that the particles exactly approximate the target measure when an infinitely dimensional functional optimization process minimizes this form of KL divergence.

Iv Approach

Iv-a Stein Variational Gradient Descent

A common challenge in statistics and machine learning is the approximation of intractable posterior distributions. In the domain of robotics simulators, the inverse problem of inferring high-dimensional simulation parameters from trajectory observations is often nonlinear and non-unique as there are potentially a number of parameters values that equally well produce simulated roll-outs similar to the real dynamical behavior of the system. This often results in non-Gaussian, multi-modal parameter estimation problems. Markov Chain Monte-Carlo (MCMC) methods are known to be able to find the true distribution, but require an enormous amount of samples to converge, which is exacerbated in high-dimensional parameter spaces. Variational inference approaches, on the other hand, approximate the target distribution by a simpler, tractable distribution, which often does not capture the full posterior over parameters accurately 

[blei_variational_2017]. We present a solution based on the Stein Variational Gradient Descent (SVGD) algorithm [liu_stein_2019] that approximates the posterior distribution by a set of particles where is the Dirac delta function, and makes use of differentiable likelihood and prior functions to be efficient. SVGD avoids the computation of the intractable marginal likelihood in the denominator by only requiring the computation of which is independent of the normalization constant. The particles are adjusted according to the steepest descent direction to reduce the KL divergence in a reproducing kernel Hilbert space (RKHS) between the current set of particles representing and the target . As shown in [liu_stein_2019], the set of particles is updated by the following function:


where is a positive definite kernel and

is the step size. In this work, we use the radial basis function kernel, which is a common choice for SVGD 


due to its smoothness and infinite differentiability, and the median heuristic, which has been shown to provide robust performance on large learning problems 

[garreau2018median]. SVGD requires that the likelihood function be differentiable in order to calculate Equation 1, which in turn requires that and be differentiable. To enable this, we use a fully-differentiable simulator, observation function and likelihood function, and leverage automatic differentiation with code generation to generate CUDA kernel code [sanders_cuda_2010]. Because of this CUDA code generation, we are able to calculate for each particle in parallel on the GPU.

Iv-B Likelihood Model for Trajectories

(a) Single Shooting
(b) Multiple Shooting
(c) Shooting Windows and Defects
Fig. 2: The two heatmaps plots on the left show the landscape of the log-likelihood function for an inference problem where the two link lengths of a double pendulum are estimated (ground-truth parameters indicated by a white star). In (a), the likelihood is evaluated over a 400-step trajectory; in (b), the trajectory is split into 10 shooting windows and the likelihood is computed via Equation 6. In (c), the shooting intervals and defects are visualized for an exemplary parameter guess.

We define as the probability of an individual trajectory simulated from parameters with respect to a matched ground-truth trajectory . Following the HMM assumption from section III, we define Equation 4 as the product of probabilities over observations, shown in Equation 2. This assumption is justified because the next state is fully determined by the position and velocity of the current state in articulated rigid body dynamics (see subsection B-A). To directly compare observations we use a Gaussian likelihood:


This likelihood model for observations is used to compute the likelihood for a trajectory


where is (an estimate of) the first state of . This formulation is known as a single-shooting estimation problem, where a trajectory is compared against another only by varying the initial conditions of the modeled system. To evaluate the likelihood for a collection of ground-truth trajectories,

, we use an equally weighted Gaussian Mixture Model likelihood:


Iv-C Multiple Shooting

Estimating parameters from long trajectories can prove difficult in the face of observation noise, and for systems in which small changes in initial conditions produce large changes in trajectories, potentially resulting in poor local optima [aydogmus_modified_2020]. We adopt the multiple-shooting method which significantly improves the convergence and stability of the estimation process. Multiple shooting has been applied in system identification problems [bock_recent_1983] and biochemistry [peifer_parameter_2007]. Multiple shooting divides up the trajectory into shooting windows over which the likelihood is computed. To be able to simulate such shooting windows, we require their start states to be available for to generate a shooting window trajectory. Since we only assume access to the first true state from the real trajectory , we augment the parameter vector by the start states of the shooting windows, which we refer to as shooting variables (for ). We define an augmented parameter vector as . A shooting window of time steps starting from time is then simulated via , where denotes a selection operation of the sub-array between the indices and . Analogous to Equation 2, we evaluate the likelihood for a single shooting window as a product of state-wise likelihoods. To impose continuity between the shooting windows, defect constraints are imposed as a Gaussian likelihood term between the last simulated state from the previous shooting window at time and the shooting variable at time :



is a small variance so that the MCMC samplers adhere to the constraint. Including the defect likelihood allows the extension of the likelihood defined in

Equation 3 to a multiple-shooting scenario:


As for the single-shooting case, Equation 4 with as the trajectory-wise likelihood function gives the likelihood for a set of trajectories. In Figure 2, we provide a parameter estimation example where the two link lengths of a double pendulum must be inferred. The single-shooting likelihood from Equation 3 (shown in (a)) exhibits a rugged landscape where many estimation algorithms will require numerous samples to escape from poor local optima, or a much finer choice of parameter prior to limit the search space. The multiple-shooting likelihood (shown in Figure 2) is significantly smoother and therefore easier to optimize.

Iv-D Parameter Limits as a Uniform Prior

Simulators may not be able to handle simulating trajectories from any given parameter and it is often useful to enforce some limits on the parameters. To model this, we define a uniform prior distribution on the parameter settings , where denote the upper and lower limits of parameter dimension .

Iv-E Constrained Optimization for SVGD

Directly optimizing SVGD for the unnormalized posterior on long trajectories, with a uniform prior, can be difficult for gradient based optimizers. The uniform prior has discontinuities at the extremities, effectively imposing constraints, which produce non-differentiable regions in the parameter space domain. We propose an alternative solution to deal with this problem and treat SVGD as a constrained optimization on with and as constraints. A popular method of constrained optimization is the Modified Differential Multiplier Method (MDMM) [platt1988constrained] which augments the cost function to penalize constraint violations. In contrast to the basic penalty method, MDMM uses Lagrange multipliers in place of constant penalty coefficients that are updated automatically during the optimization:


MDMM formulates this setup into an unconstrained minimization problem by introducing Lagrange multipliers (initialized with zero):


where is a constant damping factor that improves convergence in gradient descent algorithms. To accommodate these Lagrange multipliers per-particle we again extend the parameter set () introduced in subsection IV-C to store the multiple shooting variables and Lagrange multipliers, . The following update equations for and are used to minimize Equation 8:

We include parameter limit priors from subsection IV-D as the following equality constraints (where clips the value to the interval ): Other constraints are the defect constraints from Equation 5 which are included as equality constraints as well: . The overall procedure of our Constrained SVGD (CSVGD) algorithm is given in Algorithm 1.

Inputs: differentiable simulator , observation function , start states for each ground-truth trajectory , learning rate scheduler (e.g. Adam)
for  do
       Roll out simulated observations
       Compute via Equation 1 and
       Update via
       Update via for
end for
Algorithm 1 Constrained SVGD

Iv-F Performance Metrics for Particle Distributions

In most cases we do not know the underlying ground-truth parameter distribution , and only have access to a finite set of ground-truth trajectories. Therefore, we measure the discrepancy between trajectories rolled out from the estimated parameter distribution, , and the reference trajectories, . One measure, the KL divergence, is the expected value of the log likelihood ratio between two distributions. Although the KL divergence cannot be calculated from samples of continuous distributions, methods have been developed to estimate it from particle distributions using the -nearest neighbors distance [wang_divergence_2009]. The KL divergence is non-symmetric and this estimate can be poor in two situations. The first is estimating when the particles are all very close to one trajectory, causing a low estimated divergence, yet the posterior is of poor quality. The opposite can happen when the particles in the posterior are overly spread out when estimating . We additionally measure the maximum mean discrepancy (MMD) [gretton2012mmd], which is a metric used to determine if two sets of samples are drawn from the same distribution by calculating the square of the distance between the embedding of the two distributions in an RKHS.

V Experiments

We compare our method against commonly used parameter estimation baselines. As an algorithm comparable to our particle-based approach, we use the Cross Entropy Method (CEM) [rubinstein1997optimization]. In addition, we evaluate the Markov chain Monte-Carlo techniques Emcee [foreman2013emcee], Stochastic Gradient Langevin Dynamics (SGLD) [welling_bayesian_2011], and the No-U-Turn-Sampler (NUTS) [hoffman_no-u-turn_2011], which is an adaptive variant of the gradient-based Hamiltonian MC algorithm. For Emcee, we use a parallel sampler that implements the “stretch move” ensemble method [goodman_ensemble_2010]. For BayesSim, we report results from the best performing instantiation using a mixture density random Fourier features model (MDRFF) with Matérn kernel in Table I. We present further details and extended results from our experiments in the appendix.

V-a Parameter Estimation Accuracy

We create a synthetic dataset of 10 trajectories from a double pendulum which we simulate by varying its two link lengths. These two uniquely identifiable [fazeli2018identifiability]

parameters are drawn from a Gaussian distribution with a mean of

and a full covariance matrix (density visualized by the red contour lines in Figure 3). We show the evolution of the consistency metrics in Figure 4. Trajectories generated by evaluating the posteriors of the compared methods are compared against 50 test trajectories rolled out from the ground-truth parameter distribution. We find that SVGD produces a density very close to the density that matches the principal axes of the ground-truth posterior, and outperforms the other methods in all metrics except log likelihood. CEM collapses to a single high-probability estimate but does not accurately represent the full posterior, as can be seen in (d) being maximal quickly but poor performance on the other metrics which measure the spread of the posterior. Emcee and NUTS represent the spread of the posterior but do not sharply capture the high-likelihood areas, shown by their good performance in (b), (b) and (c). SGLD captures a small amount of the posterior around the high likelihood points but does not fully approximate the posterior.

(a) Emcee
(b) CEM
(c) SGLD
(d) NUTS
(e) SVGD
Fig. 3: Estimated posterior distributions from synthetic data generated from a known multivariate Gaussian distribution (subsection V-A). The 100 last samples were drawn from the Markov chains sampled by Emcee, SGLD, and NUTS, while CEM and SVGD used 100 particles.
Experiment Metric Emcee CEM SGLD NUTS SVGD BayesSim CSVGD (Ours)
Double Pendulum 8542.2466 8911.1798 8788.0962 9196.7461 8803.5683 8818.1830 5204.5336
4060.6312 8549.5927 7876.0310 6432.2131 10283.6659 3794.9873 2773.1751
MMD 1.1365 0.9687 2.1220 0.5371 0.7177 0.6110 0.0366
Panda Arm -16.1185 -17.3331 -17.3869 -17.9809 -17.7611 -17.6395 -15.1671
TABLE I: Consistency metrics of the posterior distributions approximated by the different estimation algorithms. Each metric is calculated across simulated and real trajectories. Lower is better on all metrics except .
Fig. 4: Accuracy metrics for the estimated parameter posteriors shown in Figure 3. The estimations were done on the synthetic dataset of a multivariate Gaussian over parameters (subsection V-A). (d) shows the single-shooting likelihood using the equation described in Equation 3.

V-B Identify Real-world Double Pendulum

We leverage the dataset from [asseman2018learning] containing trajectories from a physical double pendulum. While in our previous synthetic data experiment the parameter space was reduced to only the two link lengths, we now define 11 parameters to estimate. The parameters for each link are the mass, inertia , the center of mass in the and direction, and the joint friction. We also estimate the length of the second link. Note that parameters, such as the length of the first link, are not explicitly included since they are captured by the remaining parameters, which we validated through sensitivity analysis. Like before, the state space is completely measurable, i.e. , except for observation noise. In this experiment we find that CSVGD outperforms all other methods in KL divergence (both ways) as well as MMD, shown in Table I. We believe this is because of the complex relationship between the parameters of each link and the resulting observations. This introduces many local minima which are hard to escape from (see Figure 5). The multiple-shooting likelihood improves the convergence significantly by simplifying the optimization landscape.

(a) SVGD
Fig. 5: Trajectory density plots (only joint positions are shown) obtained by simulating the particle distribution found by SVGD with the single-shooting likelihood ((a)) and the multiple-shooting likelihood ((b)) on the real-world double pendulum (subsection V-B).

V-C Identify Inertia of an Articulated Rigid Object

In our final experiment, we investigate a more complicated physical system which is underactuated. Through an uncontrollable universal joint, we attach an acrylic box to the end-effector of a physical 7-DOF Franka Emika Panda robot arm. We fix two weights at the bottom inside the box, and prescribe a trajectory which the robot executes through a PD controller. By tracking the motion of the box via a VICON motion capture system, we aim to identify the 2D locations of the two weights (see Figure 1). The system only has access to the proprioceptive measurements from the robot arm (seven joint positions and velocities), as well as the 3D pose of the box at each time step. In the first phase, we identify the inertia properties of the empty box, as well as the joint friction parameters of the universal joint from real-robot trajectories of shaking an empty box. Given our best estimate, in the actual parameter estimation setup we infer the two planar positions of the weights. The particles from SVGD and CSVGD quickly converge in a way that the two weights are aligned opposed to each other. If the weights were not at locations symmetrical about the center, the box would tilt and yield a large discrepancy to the real observations. MCMC, on the other hand, even after more than ten times the number of iterations, only rarely approaches configurations in which the box remains balanced. In Figure 1 (right) we visualize the posterior over weight locations found by CSVGD (blue shade). The found symmetries are clearly visible when we draw lines (orange) between the inferred positions of both weights (yellow, green), while the true weight locations (red) are contained in the approximated distribution. As can be seen in Table I, the log likelihood is maximized by CSVGD. The results indicate the ability of our method to accurately model difficult posteriors over complex trajectories because of the symmetries underlying the simulation parameters.

Vi Conclusion

We have presented Constrained Stein Variational Gradient Descent (CSVGD), a new method for estimating the distribution over simulation parameters that leverages Bayesian inference and parallel, differentiable simulators. By segmenting the trajectory into multiple shooting windows via hard defect constraints, and effectively using the likelihood gradient, CSVGD produces more accurate posteriors and exhibits improved convergence over previous estimation algorithms. In future work, we plan to leverage the probabilistic predictions from our simulator for uncertainty-aware control applications. Similar to [lambert_stein_2020], the particle-based uncertainty information can be leveraged by a model-predictive controller that takes into account the multi-modality of future outcomes.

Appendix A Technical Details

In this section we provide further technical details on our approach.

A-a Initializing the Estimators

For each experiment, we initialize the particles for the estimators via the deterministic Sobol sequence on the intervals specified through the parameter limits. Our code uses the Sobol sequence implementation from [burkardt2021sobol] which is based on a Fortran77 implementation by [bennett1986sobol]. For the MCMC methods that sample a single Markov chain, we used the center point between the parameter limits as initial guess.

A-B Likelihood Model for Sets of Trajectories

(a) Product
(b) Sum
Fig. 6: Comparison of posterior parameter distributions obtained from fitting the parameters to two ground-truth trajectories generated from different link lengths of a simulated double pendulum (units of the axes in meters). The trajectories were 300 steps long (which corresponds to a length of ) and contain the 2 joint positions and 2 joint velocities of the uncontrolled double pendulum which starts from a zero-velocity initial configuration where the first angle is at (sideways) and the other at . In (a), the product of the individual per-trajectory likelihoods is maximized (Equation 10). In (b), the sum of the likelihoods is maximized (Equation 9).

In this work we assume that trajectories may have been generated by a distribution over parameters. In the case of a replicable experimental setup, this could be a point distribution at the only true parameters. However, when trajectories are collected from multiple robots, or with slightly different experimental setups between experiments, there may be a multimodal distribution over parameters which generated the set of trajectories. Note, that irrespective of the choice of likelihood function we do not make any assumption about the shape of the posterior distribution by leveraging SVGD which is a non-parametric inference algorithm. In trajectory space, the Gaussian likelihood function is a common choice as it corresponds to the typical least-squares estimation methodology. Other likelihood distributions may be integrated with our method, which we leave up to future work. The likelihood which we use is a mixture of equally-weighted Gaussians centered at each reference trajectory :


If we were to consider each trajectory as an independent sample from the same trajectory distribution (the product), the likelihood function would be


While both Eqs. (9) and (10) define the likelihood for a combination of single-shooting likelihood functions for a set of real trajectories , the same combination operators (sum or product) apply to the combination of multiple-shooting likelihood functions analogously. The consequence of using these likelihoods can be seen in Figure 6 where (a) shows the resulting posterior distribution (in parameter space) when treating a set of trajectories as independent and taking the product of their likelihoods (Equation 10), while (b) shows the result of treating them as a sum of Gaussian likelihoods (Equation 9). In (a) the posterior becomes the average of the two distributions since that is the most likely position that generated both of the distinct trajectories. In contrast, the posterior approximated by the same algorithm (CSVGD) but using the sum of Gaussian likelihoods, successfully captures the multimodality in the trajectory space since most particles have aligned near the two modes of the true distribution in parameter space.

A-C State and Parameter Normalization

The parameters we are estimating are valid over only particular ranges of values. These ranges are often widely different - in the case of our real-world pendulum experiment, the center of mass of a link in a pendulum may be in the order of centimeters, while the angular velocity at the beginning of the recorded motion can reach values on the orders of meters per second. It is therefore important to scale the parameters to a common range to avoid any dimension to dominate smaller parameter ranges during the estimation. Similarly, the state dimensions are of different units - for example, we typically include velocities and positions in the recorded data over which we compute the likelihood. Therefore, we also normalize the range over the state dimensions. Given the state vector, respective parameter vector, , we normalize each dimension by its statistical variance , i.e. .

A-D KNN-based Approximation for KL Divergence

In this work, we compare a set of parameter guesses (particles) to the ground-truth parameters, or a set of trajectories generated by simulating trajectories from each parameter in the particle distribution to a set of trajectories created on a physical system. To compare these distributions, we use the KL divergence to determine how the two distributions differ from each other. Formally, the KL divergence is the expected value of the log likelihood ratio between two distributions, and is an asymmetric divergence that does not satisfy the triangle inequality. The KL divergence is easily computable in the case of discrete distributions or simple parametric distributions, but is not easily calculable for samples from non-parametric distributions such as those over trajectories. Instead, we use an approximation to the KL divergence which uses the relative distances between samples in a set to estimate the KL divergence between particle distributions. This method has been used to compare particle distributions over robot poses to asses the performance of particle filter distributions [chou_performance_2011]. To estimate the KL divergence between particle distributions over trajectories and we adopt the formulation from [wang_divergence_2009, chou_performance_2011]:


where is the dimensionality of the trajectories, is the number of trajectories in the dataset, is the number of particles in the dataset, is the distance from trajectory to its -th nearest neighbor in , is the distance from trajectory to its -th nearest neighbor in , and is the digamma function. Note that this approximation of KL divergence can also be applied to compare parameter distributions, as we show in the synthetic data experiment from subsection V-A (cf. (a) and (b)) where the ground-truth parameter distribution is known. Throughout this work, we set and to 3 as this reduces the bias in the approximation, but does not require a large amount of samples from the ground-truth distribution.

Appendix B Experiments

In the following, we provide further technical details and results from the experiments we present in the main paper.

B-a Differentiable Simulator

Other than requiring a differentiable forward dynamics model which allows to simulate the system in its entirety following the Markov assumption, our proposed algorithm does not rely on a particular choice of dynamical system or simulator for which its parameters need to be estimated. For our experiments, we use the Tiny Differentiable Simulator [heiden2021neuralsim] that implements end-to-end differentiable contact models and the Articulated Body Algorithm (ABA) [featherstone2007rbda] to compute the forward dynamics (FD) for articulated rigid-body mechanisms. Given joint positions , velocities , torques in generalized coordinates, and external forces , ABA calculates the joint accelerations . We use semi-implicit Euler integration to advance the system dynamics in time for a time step :


The second-order ODE described by Equation 12 is lowered to a first-order system, with state . Furthermore, we deal primarily with the discrete time-stepped dynamics function , assuming that is constant. The function uses iteratively to produce a trajectory of states given an initial state and the parameters . Many systems of practical interest for robotics are controlled by an external input. In our formulation for parameter estimation, we include controls as explicit dependencies on the time parameter . For an articulated rigid body system, the parameters may include (but are not limited to) the masses, inertial properties and geometrical properties of the bodies in the mechanism, as well as joint and contact friction coefficients. Given and

, gradients of simulation parameters with respect to the state trajectories can be computed directly through the chain rule, or via the adjoint sensitivity method 


Link Parameter Minimum Maximum
Link 1 Mass
Joint friction
Link 2 Length
Joint friction
(a) Parameters to be estimated. refers to the inertia matrix, COM stands for center of mass.
(b) Time lapse of a double pendulum trajectory from the IBM dataset [asseman2018learning].
Fig. 7: Physical double pendulum experiment from subsection B-B.

B-B Identify Real-world Double Pendulum

We set up the double pendulum estimation experiment with the 11 parameters shown in (a) to be estimated. The state space consists of the two positions and velocities of both joints: . The dataset of trajectories contains image sequences (see time lapse of an excerpt from a trajectory in (b)) and annotated pixel coordinates of the three vertices in the double pendulum, from which we extracted joint positions and velocities (via finite differencing given the known recording frequency of ). Since we know that all trajectories in this dataset stem from the same double pendulum [asseman2018learning], we only used a single reference trajectory as target trajectory during the estimation. We let each estimator run for 2000 iterations. For evaluation, we calculate the consistency metrics from Table I over 10 held-out trajectories from a test dataset. For comparison, we visualize the trajectory density over simulations rolled out from the last 100 Markov samples (or 100 particles in the case of particle-based approaches) in Figure 8. The ground-truth shown in these plots again stems from the unseen test dataset. This experiment further demonstrates the generalizability of simulation-based inference, which, when an adequate model has been implemented and its parameters identified, can predict outcomes under various novel conditions even though the training dataset consisted of only a single trajectory in this example.

Fig. 8: Kernel density estimation over trajectory roll-outs from the last estimated 100 parameter guesses of each method, applied to the physical double pendulum dataset from subsection B-B. The ground-truth trajectory here stems from the test dataset of 10 trajectories that were held out during training. The particle-based approaches (CEM, SVGD, CSVGD) use 100 particles.

B-C Ablation Study on Multiple Shooting

We evaluate the baseline estimation algorithms with our proposed multiple-shooting likelihood function (using 10 shooting windows) on the physical double pendulum dataset from before. To make the constrained optimization problem amenable to the MCMC samplers, we formulate the defect constraint through the likelihood defined in Equation 5 where we tune to a small value (on the order of ) such that the defects are minimized during the estimation. As we describe in subsection IV-C, the parameter space is augmented by the shooting variables . As shown in Table II, the MCMC approaches Emcee and NUTS do not benefit meaningfully from the multiple-shooting approach. Emcee often yields unstable simulations from which we are not able to compute some of the metrics. The increased dimensionality of the parameter space appears to add a significant challenge to these methods, which are known to scale poorly to higher dimensions. Despite being configured to use a Gaussian mixture model of 3 kernels, the CEM posterior immediately collapses to a single point such that the KL divergence of simulated against real trajectories cannot be computed. We observe a significant improvement in estimation accuracy on SGLD, where the multiple-shooting approach allowed it to converge to closely matching trajectories, as shown in Figure 9. As with SVGD, the availability of gradients allows this method to scale better to the higher dimensional parameter space, while the smoothed likelihood landscape further helps the approach to find better fitting parameters.

Algorithm SS MS SS MS SS MS
Emcee 8542.2466 8950.4574 4060.6312 N/A 1.1365 N/A
CEM 8911.1798 8860.5115 8549.5927 N/A 0.9687 0.5682
SGLD 8788.0962 5863.2728 7876.0310 2187.2825 2.1220 0.0759
NUTS 9196.7461 8785.5326 6432.2131 4935.8983 0.5371 1.1642
(C)SVGD 8803.5683 5204.5336 10283.6659 2773.1751 0.7177 0.0366
TABLE II: Consistency metrics of the posterior distributions approximated from the physical double pendulum dataset (subsection B-B) by the different estimation algorithms using the single-shooting likelihood (column “SS”) and the multiple-shooting likelihood (column “MS”) with 10 shooting windows. Note that SVGD with multiple-shooting corresponds to CSVGD.
Fig. 9: Kernel density estimation over trajectory roll-outs from the last estimated 100 parameter guesses of SGLD with the multiple-shooting likelihood model (see subsection B-C), applied to the physical double pendulum dataset from subsection B-B. Similarly to SVGD, SGLD benefits significantly from the smoother likelihood function while being able to cope with the augmented parameter space thanks to its gradient-based approach.

B-D Comparison to Likelihood-free Inference

Our Bayesian inference approach leverages the simulator as part of the likelihood model to approximate posterior distributions over simulation parameters, which means the simulator is indispensable in our estimation process. In the following, we compare our approach against the likelihood-free inference approach BayesSim [ramos_bayessim_2019] that leverages approximate Bayesian computation (ABC) which is the most popular family of algorithms within likelihood-free methods. Likelihood-free methods assume the simulator is a black box that can generate a trajectory given a parameter setting. Instead of querying the simulator to evaluate the likelihood (as in our approach), a conditional density

is learned directly from a dataset of simulation parameters and their corresponding rolled-out trajectories via supervised learning to approximate the posterior. A common choice of model for such density is a mixture density network 

[bishop1994mixture], which parameterizes a Gaussian mixture model. This is in contrast to our (C)SVGD algorithm which can approximate any shape of posterior by being a nonparametric inference algorithm. For our experiments we collect a dataset of 10,000 simulated trajectories of parameters randomly sampled from the prior distribution. We train the density network via the Adam optimizer with a learning rate of

for 3000 epochs, after which we observed no meaningful improvement to the calculated log-likelihood loss during training. In the following, we provide further details on the likelihood-free inference pipeline we consider, by describing the input data processing and the model used for approximating the posterior.

B-D1 Input Data Processing

The input to the learned density model has to be a sufficient statistic of the underlying data, while being low-dimensional in order to keep the learning problem computationally tractable. We consider the following four methods of processing the trajectories that are the input to the likelihood-free methods, as visualized for an example trajectory in Figure 11. Note that we configured the following input processing methods to generate a one-dimensional input vector that has a reasonable length to be computationally feasible to train on (given the 10,000 trajectories from the training dataset), while achieving acceptable performance which we validated through testing various settings.


we down-sample the trajectory so that for the double pendulum experiment (subsection B-B) we use only every 20th state, for the Panda arm experiment (subsection V-C) only every 200-th state of the trajectory. Finally, the state dimensions per trajectory are concatenated to a one-dimensional vector.


we adapt the input statistic from the original BayesSim approach in [ramos_bayessim_2019, Eq. (22)] where the differences of two consecutive states along the trajectory are used in concatenation with their mean and variance:

As before, we down-sample these state differences and concatenate them to a vector.


for each state dimension of the trajectory, we compute the following statistics typical for time series: mean, variance, cross correlation between state dimensions of the trajectory, as well as auto-correlations for each dimension at 5 different time delays: [5, 10, 20, 50, 100] time steps. These numbers are concatenated for all state dimensions to a one-dimensional vector per input trajectory.


we compute the signature transform from the signatory package [kidger2021signatory] over the input trajectory. Such so-called path signatures have been recently introduced to extract information about order and area, thereby preserving features inherent to nonlinear trajectories. We select a depth for the signature transform of 3 for the double pendulum experiment, and 2 for the Panda arm experiment, to obtain feature vectors of comparable size to the aforementioned input techniques.

B-D2 Density Model

As the density model for the learned posterior , we select the following commonly used representations.

Mixture density network (Mdn)

uses neural network features from a feed-forward neural network using two hidden layers with 24 units each.

Mixture density random Fourier features (Mdrff)

this density model uses Fourier features and a kernel. We evaluate the MDRFF with the following common choices for the kernel:

  • Radial Basis Function (RBF) kernel

  • Matérn kernel [rasmussen2005gp, Equation (4.14)] with

B-D3 Evaluation

Note that instead of action generation, which is part of the proposed BayesSim pipeline [ramos_bayessim_2019], we only focus on the inference of the posterior density over simulation parameters in order to compare such likelihood-free inference approach against our method. Finally, to evaluate the metrics shown in Table III for each BayesSim instantiation (input method and density model), we sample 100 parameter vectors from the learned posterior and simulate them to obtain 100 trajectories which are compared against the reference trajectory sets, as we did in the comparison for the other Bayesian inference methods in Table I.

Double Pendulum Experiment Panda Arm Experiment
Input Model MMD
Downsampled MDN 8817.9222 4050.4666 0.6748 -17.4039
Difference MDN 8919.2463 4633.2637 0.6285 -17.1646
Summary MDN 9092.5575 5093.8851 0.5664 -18.3481
Signature MDN 8985.8056 4610.5438 0.5807 -19.3432
Downsampled MDRFF (RBF) 9027.9474 5091.5283 0.5593 -17.2335
Difference MDRFF (RBF) 8936.3823 4282.8599 0.5988 -18.4892
Summary MDRFF (RBF) 9063.1753 4884.1398 0.5672 -19.5430
Signature MDRFF (RBF) 8980.9080 4081.1160 0.6016 -18.3458
Downsampled MDRFF (Matérn) 8818.1830 3794.9873 0.6110 -17.6395
Difference MDRFF (Matérn) 8859.2156 4349.9971 0.6176 -17.2752
Summary MDRFF (Matérn) 8962.0501 4241.4551 0.5999 -19.6672
Signature MDRFF (Matérn) 9036.9626 4620.9517 0.5715 -18.1652
CSVGD 5204.5336 2773.1751 0.0366 -15.1671
TABLE III: Consistency metrics of the posterior distributions approximated by the different BayesSim instantiations, where the input method and model name (including the kernel type for the MDRFF model) are given. Each metric is calculated across simulated and real trajectories. Lower is better on all metrics except the log-likelihood from the Panda arm experiment. For comparison, in the last row, we reproduce the numbers from CSVGD shown in Table I.
MDN – Downsampled MDN – Difference
MDN – Summary MDN – Signature
MDRFF (RBF) – Downsampled MDRFF (RBF) – Difference
MDRFF (RBF) – Summary MDRFF (RBF) – Signature
MDRFF (Matérn) – Downsampled MDRFF (Matérn) – Difference
MDRFF (Matérn) – Summary MDRFF (Matérn) – Signature

Fig. 10: Kernel density estimation over trajectory roll-outs from 100 parameter samples drawn from the posterior of each BayesSim method (model name with kernel choice in bold font + input method, see subsection B-D), applied to the physical double pendulum dataset from subsection B-B. The ground-truth trajectory here stems from the test dataset of 10 trajectories that were held out during training.
(a) Raw Input
(b) Downsampled
(c) Difference
(d) Summary
(e) Signature
Fig. 11: Exemplary visualization of the input processing methods for the likelihood-free baselines from subsection B-D applied to a trajectory from the double pendulum experiment in subsection B-B.

B-D4 Discussion

The results from our experiments with the various likelihood-free approaches in Table III indicate that, among the tested pipelines, the MDRFF model with Matérn kernel and downsampled trajectory input overall performed the strongest, followed by the MDN with downsampled input. In comparison to the likelihood-based algorithms from Table I, these results are comparable on the double pendulum experiment. However, in comparison to CSVGD, the estimated likelihood-free posteriors are significantly less accurate, which can also be clearly seen in the density plots over the rolled out trajectories from such learned densities in Figure 10. On the Panda arm experiment, the likelihood-free methods are outperformed by the likelihood-based algorithms (such as the Emcee sampler) more often on the likelihood of the learned parameter densities. CSVGD again achieves a much more accurate posterior in this experiment than any likelihood-free approach. Why do these likelihood-free methods perform so poorly on a seemingly simple double pendulum? One would expect that this kind of dynamical system poses no greater challenge to BayesSim when it was shown to identify a cartpole’s link length and cart mass successfully [ramos_bayessim_2019]. To investigate this problem, we revisit the simplified double pendulum estimation experiment from subsection V-B

of our main paper, where only the two link lengths need to be estimated from simulated trajectories. As before, we create a dataset with 10,000 trajectories of 400 time steps based on the two simulation parameters sampled from a uniform distribution ranging between

and . While keeping all parameters the same as in our previous double-pendulum experiment where eleven parameters had to be inferred, all of the density models in combination with both the “difference” and “downsampled” input statistic infer a highly accurate parameter distribution, as shown in (a). The trajectories produced by sampling from the BayesSim posterior ((b)) also match the reference observations much more closely than any of the BayesSim models on the previous 11-parameter double pendulum (Figure 10). These results suggest that BayesSim and potentially other likelihood-free method have problems in inferring higher dimensional parameter distributions. The experiments in [ramos_bayessim_2019] demonstrated as many as four parameters being estimated (for the acrobot), while showing inference results for simulated systems only. While our double pendulum system from subsection B-B is basic in principle, the higher dimensional parameter space (see parameters in (a)) and the fact that we need to fit against real-world data makes it a significantly harder problem for most state-of-the-art inference algorithms. CSVGD is able to achieve a close fit thanks to the multiple-shooting segmentation of the trajectory which improves the convergence (see more ablation results for multiple-shooting on this experiment in subsection B-C).

(a) Posterior
(b) Trajectory density
Fig. 12: Results from BayesSim on the simplified double pendulum experiment where only the two link lengths need to be inferred. (a) shows the approximated posterior distribution by the MDN model and “downsampled” input statistics. The diagonal plots show the marginal parameter distributions, the bottom-left heatmap and the top-right contour plot show the 2D posterior where the ground-truth parameters at (, ) are indicated by a red star. The black dots in the top-right plot are 100 parameters sampled from the posterior which are rolled out to generate trajectories for the trajectory density plot in (b). (b) shows a kernel density estimation over these 100 trajectories for the four state dimensions of the double pendulum.

B-E Identify Inertia of an Articulated Rigid Object

The state space consists of the positions and velocities of the seven degrees of freedom of the robot arm and the two degrees of freedom in the universal joint, resulting in a 20-dimensional state vector

consisting of nine joint positions and velocities, plus the PD control position and velocity targets, and , for the actuated joints of the robot arm. We control the arm using the MoveIt! motion planning framework [coleman_david_reducing_2014] by moving joints 6 and 7 to predefined joint-space offsets of and radians, in sequence. We use the default Iterative Parabolic Time Parameterization algorithm with a velocity scaling factor of . We track the motion of the acrylic box via a Vicon motion capture system and derive four Cartesian coordinates as observation to represent the frame of the box (shown in Figure 13): a point of origin located at the center of the upper lid of the box, and three points located away from the origin into the x, y, and z direction (transformed by the reference frame of the box). We chose this state representation to ease the computation of the likelihood, since we only need to compute differences between 3D points instead of computing the distances over 3D rotations which requires special treatment [huynh2009metrics].

Fig. 13: Rendering of the simulation for the system from subsection V-C, where the four reference points for the origin, unit x, y, and z vectors are shown. The trace of the simulated trajectory is visualized by the solid lines, the ground-truth trajectories of the markers are shown as dotted lines.

We first identify the simulation parameters pertaining to the inertial properties of the box and the friction parameters of the universal joint. As shown in (a), the symmetric inertia matrix of the box is fully determined by the first six parameters, followed by the 3D center of mass. We have measured the mass to be , so we do not need to estimate it. We simulate the universal joint with velocity-dependent damping, with possibly different friction coefficients for both degrees of freedom. The simulation parameters yielding the most accurate fit to a ground-truth trajectory from the physical robot shaking an empty box is shown in Figure 14. We were able to find such parameters via SVGD, CSVGD and Emcee (shown is a parameter configuration from the particle distribution estimated by CSVGD with the highest likelihood). While the simulated trajectory matches the real data significantly better after the inertial parameters of the empty box have been identified ((b)) than before ((a)), a reality gap remains. We believe this to be a result from a slight modeling error that the rigid body simulator cannot capture, e.g. the top of the box where the universal joint is attached bends slightly while the box is moving, and there may be slight geometric offsets between the real system and the model of it we use in the simulator. The latter parameters could have been further identified with our approach, nonetheless the simulation given the identified parameters is sufficient to be used in the next phase of the inference experiment.

(a) Before identification of empty box
(b) After identification of empty box
Fig. 14: Trajectories from the Panda robot arm shaking an empty box. Visualized are the simulated (red) and real (black) observations before (a) and after (b) the inertial parameters of the empty box and the friction from the universal joint ((a)) have been identified. The columns correspond to the four reference points in the frame of the box (see a rendering of them in Figure 13), the rows show the , , and axes of these reference points in meters. The horizontal axes show the time step.

Given the parameters found in the first phase, we now investigate how well the various approaches can cope with dependent variables. By fixing two to the bottom of the acrylic box, the 2D locations of such weights need to be inferred. Naturally, such assignment is symmetric, i.e. weight 1 and 2 can swap locations without affecting the dynamics. What would significantly alter the dynamics, however, is an unbalanced configuration of the weights which would cause the box to tilt.

Parameter Minimum Maximum
U-Joint Friction DOF 1
Friction DOF 2
(a) Phase I
Weight Parameter Minimum Maximum
Weight 1 Position
Weight 2 Position
(b) Phase II
TABLE IV: Parameters to be estimated and their ranges for the two estimation phases of the underactuated mechanism experiment from subsection V-C.
(c) Emcee
(d) CEM
(e) SGLD
(f) NUTS
(g) SVGD
Fig. 15: Posterior plots over the 2D weight locations approximated by the estimation algorithms applied to the underactuated mechanism experiment from subsection V-C. Blue shades indicate a Gaussian kernel density estimation computed over the inferred parameter samples. Since it is an unbounded kernel density estimate, the blue shades cross the parameter boundaries in certain areas (e.g. for CSVGD), while in reality none of the estimated particles violate the parameter limits.

We use 50 particles and run each estimation algorithm for 500 iterations. For each baseline method, we carefully tuned the hyper parameters to facilitate a fair comparison. Such tuning included selecting an appropriate measurement noise variance, which, as we observed on Emcee and SGLD in particular, had a significant influence on the exploration behavior of these algorithms. With a larger observation noise variance the resulting posterior distribution became wider, however we were unable to attain such behavior with the NUTS estimator whose iterates quickly collapsed to a single point at the center of the box (see (f)). Similarly, CEM immediately became stuck in the suboptimal configuration shown in (d). Nonetheless, after 500 iterations, all methods predicted weight positions that were aligned opposite to one another to balance the box. As can be seen in (g), SVGD achieves a fairly predictive posterior approximation, with many particles aligned close to the true vertical position at . With the introduction of the multiple-shooting constraints, Constrained SVGD (CSVGD) converges significantly faster to a posterior distribution that accurately captures the true locations of the box, while retaining the exploration performance of SVGD that spreads out the particles over multiple modes, as shown in (h).