I Introduction
Simulators for robotic systems allow for rapid prototyping and development of algorithms and systems [koenig_design_2004]
, as well as the ability to quickly and cheaply generate training data for reinforcement learning agents and other control algorithms
[andrychowicz_learning_2020]. In order for these models to be useful, the simulator must accurately predict the outcomes of realworld interactions. This is accomplished through both accurately modeling the dynamics of the environment as well as correctly identifying the parameters of such models. In this work, we focus on the latter problem of parameter inference. Optimizationbased approaches have been applied in the past to find simulation parameters that best match the measured trajectories [chebotar_closing_2019, ramos_bayessim_2019]. However, in many systems we encounter in the real world, the dynamics are highly nonlinear, resulting in optimization landscapes fraught with poor local optima where such algorithms get stuck. Global optimization approaches, such as populationbased methods, have been applied [heiden2021neuralsim] but are sample inefficient and cannot quantify uncertainty over the predicted parameters. In this work we follow a probabilistic inference approach and estimate belief distributions over the the most likely simulation parameters given the noisy trajectories of observations from the real system. The relationship between the trajectories and the underlying simulation parameters can be highly nonlinear, hampering commonly used inference algorithms. To tackle this, we introduce a multipleshooting formulation to the parameter estimation process which drastically improves convergence to highquality solutions. Leveraging GPUbased parallelism of a differentiable simulator allows us to efficiently compute likelihoods and evaluate its gradients over many particles simultaneously. Based on Stein Variational Gradient Descent (SVGD), our gradientbased nonparametric inference method allows us to optimize parameters while respecting constraints on parameter limits and continuity between shooting windows. Through various experiments we demonstrate the improved accuracy and convergence of our approach. Our contributions are as follows: first, we reformulate the commonly used Gaussian likelihood function through the multipleshooting strategy to allow for the tractable estimation of simulation parameters from long noisy trajectories. Second, we propose a constrained optimization algorithm for nonparametric variational inference with constraints on parameter limits and shooting defects. Third, we leverage a fully differentiable simulator and GPU parallelism to automatically calculate gradients for many particles in parallel. Finally, we validate our system on a simulation parameter estimation problem from realworld data and show that our calculated posteriors are more accurate than comparable algorithms, as well as likelihoodfree methods.Ii Related Work
System identification methods for robotics use a dynamics model with often linearly dependent parameters in classical time or frequency domain
[vandanjon1995spectrum], and solve for these parameters via leastsquares methods [kozlowski_modelling_1998]. Such estimation approaches have been applied, for example, to the identification of inertial parameters of robot arms [khosla1985sysid, atkeson1986inertial, caccavale_identification_1994, vandanjon1995spectrum] with timedependent gear friction [grotjahn_friction_2001], or parameters of contact models [verscheure2010contact, fazeli2018identifiability]. Parameter estimation has been studied to determine a minimum set of identifiable inertial parameters [gautier1990minset] and finding exciting trajectories which maximize identifiability [antonelli_systematic_1999, gautier_exciting_1991]. More recently, leastsquares approaches have been applied to estimate parameters of nonlinear models, such as the constitutive equations of material models [mahnken2017identification, hahn2019real2sim, narang2020tactile], and contact models [kolev2015sysid, lelidec2021diffsim]. In this work, we do not assume a particular type of system to identify, but propose a method for generalpurpose differentiable simulators that may combine multiple models whose parameters can influence the dynamics in highly nonlinear ways. Bayesian methods seek to infer probability distributions over simulation parameters, and have been applied to infer the parameters of dynamical systems in robotic tasks
[wang_markov_2011, Muratore2020BayesianDR, Tan2018SimtoRealLA] and complex dynamical systems [ninness_bayesian_2010]. Our approach is a Bayesian inference algorithm which allows us to include priors to find posterior distributions over simulation parameters. The advantages of Bayesian inference approaches have been shown to be useful in the areas of uncertainty quantification [ninness_bayesian_2010, eykhoff_chapter_1981], system noise quantification [ting_bayesian_2011] and model parameter inference [qian2003mc, cranmer2020sbi]. Our method is designed for differentiable simulators which have been developed recently for various areas of modeling, such as articulated rigid body dynamics [peres2018lcp, hu2020difftaichi, geilinger2020add, Qiao2020Scalable, heiden2021neuralsim, lelidec2021diffsim], deformables [hu2019chainqueen, hu2020difftaichi, murthy2021gradsim, heiden2021disect, Huang2021PlasticineLabAS] and cloth simulation [liang2019diffcloth, hu2020difftaichi, Qiao2020Scalable], as well as sensor simulation [NimierDavidVicini2019Mitsuba2, heiden2020lidar]. Certain physical simulations (e.g. fracture mechanics) may not be analytically differentiable, so that surrogate gradients may be necessary [han2018stein]. Without assuming access to the system equations, likelihoodfree inference approaches, such as approximate Bayesian computation (ABC), have been applied to the inference of complex phenomena [toni2008abc, papamakarios2016epsilon, ramos_bayessim_2019, hsu2019likelihoodfree, matl_inferring_2020, matl_stressd_2020]. While such approaches do not rely on a simulator to compute the likelihood, our experiments in subsection BD indicate that the approximated posteriors inferred by likelihoodfree methods are less accurate for highdimensional parameter distributions while requiring significantly more simulation rollouts as training data. Domain adaptation techniques have been proposed that close the loop between parameter estimation from real observation and improving policies learned from simulators [chebotar_closing_2019, ramos_bayessim_2019, mehta2020adaptiveda, du_autotuned_2021]. Achieving an accurate simulation is typically not the final objective of these methods. Instead, the performance of the learned policy derived from the calibrated simulator is evaluated which does not automatically imply that the simulation is accurate [lambert2020mismatch]. In this work, we focus solely on the calibration of the simulator where its parameters need to be inferred.Iii Formulation
In this work we address the parameter estimation problem via the Bayesian inference methodology. The posterior over simulation parameters and a set of trajectories is calculated using Bayes’ rule:
where is the likelihood distribution and is the prior distribution over the simulation parameters. We aim to approximate the distribution over true parameters , which in general may be intractable to compute. These true parameters generate a set of trajectories
which may contain some observation noise. We assume that these parameters are unchanging during the trajectory and represent physical parameters, such as friction coefficients or link masses. We assume that each trajectory is a Hidden Markov Model (HMM)
[russell_artificial_2009] which has some known initial state, , and some hidden states , . These hidden states induce an observation model . In the simulator, we map simulation states to observations via a deterministic observation function. The transition probability
of the HMM cannot be directly observed but, in the case of a physics engine, can be approximated by sampling from a distribution of simulation parameters and applying a deterministic simulation step function . In subsection BA, we describe our implementation of , the discrete dynamics simulation function. The function rolls out multiple steps via to produce a trajectory ofstates given a parameter vector
and an initial state : . To compute measurements from such state vectors, we use the observation function : . Finally, we obtain a set of simulated trajectories for each initial state from the trajectories in . The initial state from an observed trajectory may be acquired via state estimation techniques, e.g. methods that use inverse kinematics to infer joint positions from tracking measurements. We aim to minimize the KullbackLeibler (KL) divergence between the trajectories generated from forward simulating our learned parameter particles and the groundtruth trajectories , while taking into account the priors over simulation parameters:We choose the exclusive KL divergence instead of the opposite direction since it was shown for our choice of particlebased inference algorithm in [liu2017svgdgradientflow] that the particles exactly approximate the target measure when an infinitely dimensional functional optimization process minimizes this form of KL divergence.
Iv Approach
Iva Stein Variational Gradient Descent
A common challenge in statistics and machine learning is the approximation of intractable posterior distributions. In the domain of robotics simulators, the inverse problem of inferring highdimensional simulation parameters from trajectory observations is often nonlinear and nonunique as there are potentially a number of parameters values that equally well produce simulated rollouts similar to the real dynamical behavior of the system. This often results in nonGaussian, multimodal parameter estimation problems. Markov Chain MonteCarlo (MCMC) methods are known to be able to find the true distribution, but require an enormous amount of samples to converge, which is exacerbated in highdimensional parameter spaces. Variational inference approaches, on the other hand, approximate the target distribution by a simpler, tractable distribution, which often does not capture the full posterior over parameters accurately
[blei_variational_2017]. We present a solution based on the Stein Variational Gradient Descent (SVGD) algorithm [liu_stein_2019] that approximates the posterior distribution by a set of particles where is the Dirac delta function, and makes use of differentiable likelihood and prior functions to be efficient. SVGD avoids the computation of the intractable marginal likelihood in the denominator by only requiring the computation of which is independent of the normalization constant. The particles are adjusted according to the steepest descent direction to reduce the KL divergence in a reproducing kernel Hilbert space (RKHS) between the current set of particles representing and the target . As shown in [liu_stein_2019], the set of particles is updated by the following function:(1)  
where is a positive definite kernel and
is the step size. In this work, we use the radial basis function kernel, which is a common choice for SVGD
[liu_stein_2019]due to its smoothness and infinite differentiability, and the median heuristic, which has been shown to provide robust performance on large learning problems
[garreau2018median]. SVGD requires that the likelihood function be differentiable in order to calculate Equation 1, which in turn requires that and be differentiable. To enable this, we use a fullydifferentiable simulator, observation function and likelihood function, and leverage automatic differentiation with code generation to generate CUDA kernel code [sanders_cuda_2010]. Because of this CUDA code generation, we are able to calculate for each particle in parallel on the GPU.IvB Likelihood Model for Trajectories
We define as the probability of an individual trajectory simulated from parameters with respect to a matched groundtruth trajectory . Following the HMM assumption from section III, we define Equation 4 as the product of probabilities over observations, shown in Equation 2. This assumption is justified because the next state is fully determined by the position and velocity of the current state in articulated rigid body dynamics (see subsection BA). To directly compare observations we use a Gaussian likelihood:
(2)  
This likelihood model for observations is used to compute the likelihood for a trajectory
(3)  
where is (an estimate of) the first state of . This formulation is known as a singleshooting estimation problem, where a trajectory is compared against another only by varying the initial conditions of the modeled system. To evaluate the likelihood for a collection of groundtruth trajectories,
, we use an equally weighted Gaussian Mixture Model likelihood:
(4) 
IvC Multiple Shooting
Estimating parameters from long trajectories can prove difficult in the face of observation noise, and for systems in which small changes in initial conditions produce large changes in trajectories, potentially resulting in poor local optima [aydogmus_modified_2020]. We adopt the multipleshooting method which significantly improves the convergence and stability of the estimation process. Multiple shooting has been applied in system identification problems [bock_recent_1983] and biochemistry [peifer_parameter_2007]. Multiple shooting divides up the trajectory into shooting windows over which the likelihood is computed. To be able to simulate such shooting windows, we require their start states to be available for to generate a shooting window trajectory. Since we only assume access to the first true state from the real trajectory , we augment the parameter vector by the start states of the shooting windows, which we refer to as shooting variables (for ). We define an augmented parameter vector as . A shooting window of time steps starting from time is then simulated via , where denotes a selection operation of the subarray between the indices and . Analogous to Equation 2, we evaluate the likelihood for a single shooting window as a product of statewise likelihoods. To impose continuity between the shooting windows, defect constraints are imposed as a Gaussian likelihood term between the last simulated state from the previous shooting window at time and the shooting variable at time :
(5) 
where
is a small variance so that the MCMC samplers adhere to the constraint. Including the defect likelihood allows the extension of the likelihood defined in
Equation 3 to a multipleshooting scenario:(6)  
As for the singleshooting case, Equation 4 with as the trajectorywise likelihood function gives the likelihood for a set of trajectories. In Figure 2, we provide a parameter estimation example where the two link lengths of a double pendulum must be inferred. The singleshooting likelihood from Equation 3 (shown in (a)) exhibits a rugged landscape where many estimation algorithms will require numerous samples to escape from poor local optima, or a much finer choice of parameter prior to limit the search space. The multipleshooting likelihood (shown in Figure 2) is significantly smoother and therefore easier to optimize.
IvD Parameter Limits as a Uniform Prior
Simulators may not be able to handle simulating trajectories from any given parameter and it is often useful to enforce some limits on the parameters. To model this, we define a uniform prior distribution on the parameter settings , where denote the upper and lower limits of parameter dimension .
IvE Constrained Optimization for SVGD
Directly optimizing SVGD for the unnormalized posterior on long trajectories, with a uniform prior, can be difficult for gradient based optimizers. The uniform prior has discontinuities at the extremities, effectively imposing constraints, which produce nondifferentiable regions in the parameter space domain. We propose an alternative solution to deal with this problem and treat SVGD as a constrained optimization on with and as constraints. A popular method of constrained optimization is the Modified Differential Multiplier Method (MDMM) [platt1988constrained] which augments the cost function to penalize constraint violations. In contrast to the basic penalty method, MDMM uses Lagrange multipliers in place of constant penalty coefficients that are updated automatically during the optimization:
(7)  
s.t. 
MDMM formulates this setup into an unconstrained minimization problem by introducing Lagrange multipliers (initialized with zero):
(8) 
where is a constant damping factor that improves convergence in gradient descent algorithms. To accommodate these Lagrange multipliers perparticle we again extend the parameter set () introduced in subsection IVC to store the multiple shooting variables and Lagrange multipliers, . The following update equations for and are used to minimize Equation 8:
We include parameter limit priors from subsection IVD as the following equality constraints (where clips the value to the interval ): Other constraints are the defect constraints from Equation 5 which are included as equality constraints as well: . The overall procedure of our Constrained SVGD (CSVGD) algorithm is given in Algorithm 1.
IvF Performance Metrics for Particle Distributions
In most cases we do not know the underlying groundtruth parameter distribution , and only have access to a finite set of groundtruth trajectories. Therefore, we measure the discrepancy between trajectories rolled out from the estimated parameter distribution, , and the reference trajectories, . One measure, the KL divergence, is the expected value of the log likelihood ratio between two distributions. Although the KL divergence cannot be calculated from samples of continuous distributions, methods have been developed to estimate it from particle distributions using the nearest neighbors distance [wang_divergence_2009]. The KL divergence is nonsymmetric and this estimate can be poor in two situations. The first is estimating when the particles are all very close to one trajectory, causing a low estimated divergence, yet the posterior is of poor quality. The opposite can happen when the particles in the posterior are overly spread out when estimating . We additionally measure the maximum mean discrepancy (MMD) [gretton2012mmd], which is a metric used to determine if two sets of samples are drawn from the same distribution by calculating the square of the distance between the embedding of the two distributions in an RKHS.
V Experiments
We compare our method against commonly used parameter estimation baselines. As an algorithm comparable to our particlebased approach, we use the Cross Entropy Method (CEM) [rubinstein1997optimization]. In addition, we evaluate the Markov chain MonteCarlo techniques Emcee [foreman2013emcee], Stochastic Gradient Langevin Dynamics (SGLD) [welling_bayesian_2011], and the NoUTurnSampler (NUTS) [hoffman_nouturn_2011], which is an adaptive variant of the gradientbased Hamiltonian MC algorithm. For Emcee, we use a parallel sampler that implements the “stretch move” ensemble method [goodman_ensemble_2010]. For BayesSim, we report results from the best performing instantiation using a mixture density random Fourier features model (MDRFF) with Matérn kernel in Table I. We present further details and extended results from our experiments in the appendix.
Va Parameter Estimation Accuracy
We create a synthetic dataset of 10 trajectories from a double pendulum which we simulate by varying its two link lengths. These two uniquely identifiable [fazeli2018identifiability]
parameters are drawn from a Gaussian distribution with a mean of
and a full covariance matrix (density visualized by the red contour lines in Figure 3). We show the evolution of the consistency metrics in Figure 4. Trajectories generated by evaluating the posteriors of the compared methods are compared against 50 test trajectories rolled out from the groundtruth parameter distribution. We find that SVGD produces a density very close to the density that matches the principal axes of the groundtruth posterior, and outperforms the other methods in all metrics except log likelihood. CEM collapses to a single highprobability estimate but does not accurately represent the full posterior, as can be seen in (d) being maximal quickly but poor performance on the other metrics which measure the spread of the posterior. Emcee and NUTS represent the spread of the posterior but do not sharply capture the highlikelihood areas, shown by their good performance in (b), (b) and (c). SGLD captures a small amount of the posterior around the high likelihood points but does not fully approximate the posterior.Experiment  Metric  Emcee  CEM  SGLD  NUTS  SVGD  BayesSim  CSVGD (Ours) 

Double Pendulum  8542.2466  8911.1798  8788.0962  9196.7461  8803.5683  8818.1830  5204.5336  
4060.6312  8549.5927  7876.0310  6432.2131  10283.6659  3794.9873  2773.1751  
MMD  1.1365  0.9687  2.1220  0.5371  0.7177  0.6110  0.0366  
Panda Arm  16.1185  17.3331  17.3869  17.9809  17.7611  17.6395  15.1671 
VB Identify Realworld Double Pendulum
We leverage the dataset from [asseman2018learning] containing trajectories from a physical double pendulum. While in our previous synthetic data experiment the parameter space was reduced to only the two link lengths, we now define 11 parameters to estimate. The parameters for each link are the mass, inertia , the center of mass in the and direction, and the joint friction. We also estimate the length of the second link. Note that parameters, such as the length of the first link, are not explicitly included since they are captured by the remaining parameters, which we validated through sensitivity analysis. Like before, the state space is completely measurable, i.e. , except for observation noise. In this experiment we find that CSVGD outperforms all other methods in KL divergence (both ways) as well as MMD, shown in Table I. We believe this is because of the complex relationship between the parameters of each link and the resulting observations. This introduces many local minima which are hard to escape from (see Figure 5). The multipleshooting likelihood improves the convergence significantly by simplifying the optimization landscape.
VC Identify Inertia of an Articulated Rigid Object
In our final experiment, we investigate a more complicated physical system which is underactuated. Through an uncontrollable universal joint, we attach an acrylic box to the endeffector of a physical 7DOF Franka Emika Panda robot arm. We fix two weights at the bottom inside the box, and prescribe a trajectory which the robot executes through a PD controller. By tracking the motion of the box via a VICON motion capture system, we aim to identify the 2D locations of the two weights (see Figure 1). The system only has access to the proprioceptive measurements from the robot arm (seven joint positions and velocities), as well as the 3D pose of the box at each time step. In the first phase, we identify the inertia properties of the empty box, as well as the joint friction parameters of the universal joint from realrobot trajectories of shaking an empty box. Given our best estimate, in the actual parameter estimation setup we infer the two planar positions of the weights. The particles from SVGD and CSVGD quickly converge in a way that the two weights are aligned opposed to each other. If the weights were not at locations symmetrical about the center, the box would tilt and yield a large discrepancy to the real observations. MCMC, on the other hand, even after more than ten times the number of iterations, only rarely approaches configurations in which the box remains balanced. In Figure 1 (right) we visualize the posterior over weight locations found by CSVGD (blue shade). The found symmetries are clearly visible when we draw lines (orange) between the inferred positions of both weights (yellow, green), while the true weight locations (red) are contained in the approximated distribution. As can be seen in Table I, the log likelihood is maximized by CSVGD. The results indicate the ability of our method to accurately model difficult posteriors over complex trajectories because of the symmetries underlying the simulation parameters.
Vi Conclusion
We have presented Constrained Stein Variational Gradient Descent (CSVGD), a new method for estimating the distribution over simulation parameters that leverages Bayesian inference and parallel, differentiable simulators. By segmenting the trajectory into multiple shooting windows via hard defect constraints, and effectively using the likelihood gradient, CSVGD produces more accurate posteriors and exhibits improved convergence over previous estimation algorithms. In future work, we plan to leverage the probabilistic predictions from our simulator for uncertaintyaware control applications. Similar to [lambert_stein_2020], the particlebased uncertainty information can be leveraged by a modelpredictive controller that takes into account the multimodality of future outcomes.
Appendix A Technical Details
In this section we provide further technical details on our approach.
Aa Initializing the Estimators
For each experiment, we initialize the particles for the estimators via the deterministic Sobol sequence on the intervals specified through the parameter limits. Our code uses the Sobol sequence implementation from [burkardt2021sobol] which is based on a Fortran77 implementation by [bennett1986sobol]. For the MCMC methods that sample a single Markov chain, we used the center point between the parameter limits as initial guess.
AB Likelihood Model for Sets of Trajectories
In this work we assume that trajectories may have been generated by a distribution over parameters. In the case of a replicable experimental setup, this could be a point distribution at the only true parameters. However, when trajectories are collected from multiple robots, or with slightly different experimental setups between experiments, there may be a multimodal distribution over parameters which generated the set of trajectories. Note, that irrespective of the choice of likelihood function we do not make any assumption about the shape of the posterior distribution by leveraging SVGD which is a nonparametric inference algorithm. In trajectory space, the Gaussian likelihood function is a common choice as it corresponds to the typical leastsquares estimation methodology. Other likelihood distributions may be integrated with our method, which we leave up to future work. The likelihood which we use is a mixture of equallyweighted Gaussians centered at each reference trajectory :
(9) 
If we were to consider each trajectory as an independent sample from the same trajectory distribution (the product), the likelihood function would be
(10) 
While both Eqs. (9) and (10) define the likelihood for a combination of singleshooting likelihood functions for a set of real trajectories , the same combination operators (sum or product) apply to the combination of multipleshooting likelihood functions analogously. The consequence of using these likelihoods can be seen in Figure 6 where (a) shows the resulting posterior distribution (in parameter space) when treating a set of trajectories as independent and taking the product of their likelihoods (Equation 10), while (b) shows the result of treating them as a sum of Gaussian likelihoods (Equation 9). In (a) the posterior becomes the average of the two distributions since that is the most likely position that generated both of the distinct trajectories. In contrast, the posterior approximated by the same algorithm (CSVGD) but using the sum of Gaussian likelihoods, successfully captures the multimodality in the trajectory space since most particles have aligned near the two modes of the true distribution in parameter space.
AC State and Parameter Normalization
The parameters we are estimating are valid over only particular ranges of values. These ranges are often widely different  in the case of our realworld pendulum experiment, the center of mass of a link in a pendulum may be in the order of centimeters, while the angular velocity at the beginning of the recorded motion can reach values on the orders of meters per second. It is therefore important to scale the parameters to a common range to avoid any dimension to dominate smaller parameter ranges during the estimation. Similarly, the state dimensions are of different units  for example, we typically include velocities and positions in the recorded data over which we compute the likelihood. Therefore, we also normalize the range over the state dimensions. Given the state vector, respective parameter vector, , we normalize each dimension by its statistical variance , i.e. .
AD KNNbased Approximation for KL Divergence
In this work, we compare a set of parameter guesses (particles) to the groundtruth parameters, or a set of trajectories generated by simulating trajectories from each parameter in the particle distribution to a set of trajectories created on a physical system. To compare these distributions, we use the KL divergence to determine how the two distributions differ from each other. Formally, the KL divergence is the expected value of the log likelihood ratio between two distributions, and is an asymmetric divergence that does not satisfy the triangle inequality. The KL divergence is easily computable in the case of discrete distributions or simple parametric distributions, but is not easily calculable for samples from nonparametric distributions such as those over trajectories. Instead, we use an approximation to the KL divergence which uses the relative distances between samples in a set to estimate the KL divergence between particle distributions. This method has been used to compare particle distributions over robot poses to asses the performance of particle filter distributions [chou_performance_2011]. To estimate the KL divergence between particle distributions over trajectories and we adopt the formulation from [wang_divergence_2009, chou_performance_2011]:
(11)  
where is the dimensionality of the trajectories, is the number of trajectories in the dataset, is the number of particles in the dataset, is the distance from trajectory to its th nearest neighbor in , is the distance from trajectory to its th nearest neighbor in , and is the digamma function. Note that this approximation of KL divergence can also be applied to compare parameter distributions, as we show in the synthetic data experiment from subsection VA (cf. (a) and (b)) where the groundtruth parameter distribution is known. Throughout this work, we set and to 3 as this reduces the bias in the approximation, but does not require a large amount of samples from the groundtruth distribution.
Appendix B Experiments
In the following, we provide further technical details and results from the experiments we present in the main paper.
Ba Differentiable Simulator
Other than requiring a differentiable forward dynamics model which allows to simulate the system in its entirety following the Markov assumption, our proposed algorithm does not rely on a particular choice of dynamical system or simulator for which its parameters need to be estimated. For our experiments, we use the Tiny Differentiable Simulator [heiden2021neuralsim] that implements endtoend differentiable contact models and the Articulated Body Algorithm (ABA) [featherstone2007rbda] to compute the forward dynamics (FD) for articulated rigidbody mechanisms. Given joint positions , velocities , torques in generalized coordinates, and external forces , ABA calculates the joint accelerations . We use semiimplicit Euler integration to advance the system dynamics in time for a time step :
(12)  
The secondorder ODE described by Equation 12 is lowered to a firstorder system, with state . Furthermore, we deal primarily with the discrete timestepped dynamics function , assuming that is constant. The function uses iteratively to produce a trajectory of states given an initial state and the parameters . Many systems of practical interest for robotics are controlled by an external input. In our formulation for parameter estimation, we include controls as explicit dependencies on the time parameter . For an articulated rigid body system, the parameters may include (but are not limited to) the masses, inertial properties and geometrical properties of the bodies in the mechanism, as well as joint and contact friction coefficients. Given and
, gradients of simulation parameters with respect to the state trajectories can be computed directly through the chain rule, or via the adjoint sensitivity method
[pontryagin2018mathematical].

BB Identify Realworld Double Pendulum
We set up the double pendulum estimation experiment with the 11 parameters shown in (a) to be estimated. The state space consists of the two positions and velocities of both joints: . The dataset of trajectories contains image sequences (see time lapse of an excerpt from a trajectory in (b)) and annotated pixel coordinates of the three vertices in the double pendulum, from which we extracted joint positions and velocities (via finite differencing given the known recording frequency of ). Since we know that all trajectories in this dataset stem from the same double pendulum [asseman2018learning], we only used a single reference trajectory as target trajectory during the estimation. We let each estimator run for 2000 iterations. For evaluation, we calculate the consistency metrics from Table I over 10 heldout trajectories from a test dataset. For comparison, we visualize the trajectory density over simulations rolled out from the last 100 Markov samples (or 100 particles in the case of particlebased approaches) in Figure 8. The groundtruth shown in these plots again stems from the unseen test dataset. This experiment further demonstrates the generalizability of simulationbased inference, which, when an adequate model has been implemented and its parameters identified, can predict outcomes under various novel conditions even though the training dataset consisted of only a single trajectory in this example.
Emcee  

CEM  
NUTS  
SGLD  
SVGD  
CSVGD 
BC Ablation Study on Multiple Shooting
We evaluate the baseline estimation algorithms with our proposed multipleshooting likelihood function (using 10 shooting windows) on the physical double pendulum dataset from before. To make the constrained optimization problem amenable to the MCMC samplers, we formulate the defect constraint through the likelihood defined in Equation 5 where we tune to a small value (on the order of ) such that the defects are minimized during the estimation. As we describe in subsection IVC, the parameter space is augmented by the shooting variables . As shown in Table II, the MCMC approaches Emcee and NUTS do not benefit meaningfully from the multipleshooting approach. Emcee often yields unstable simulations from which we are not able to compute some of the metrics. The increased dimensionality of the parameter space appears to add a significant challenge to these methods, which are known to scale poorly to higher dimensions. Despite being configured to use a Gaussian mixture model of 3 kernels, the CEM posterior immediately collapses to a single point such that the KL divergence of simulated against real trajectories cannot be computed. We observe a significant improvement in estimation accuracy on SGLD, where the multipleshooting approach allowed it to converge to closely matching trajectories, as shown in Figure 9. As with SVGD, the availability of gradients allows this method to scale better to the higher dimensional parameter space, while the smoothed likelihood landscape further helps the approach to find better fitting parameters.
MMD  

Algorithm  SS  MS  SS  MS  SS  MS 
Emcee  8542.2466  8950.4574  4060.6312  N/A  1.1365  N/A 
CEM  8911.1798  8860.5115  8549.5927  N/A  0.9687  0.5682 
SGLD  8788.0962  5863.2728  7876.0310  2187.2825  2.1220  0.0759 
NUTS  9196.7461  8785.5326  6432.2131  4935.8983  0.5371  1.1642 
(C)SVGD  8803.5683  5204.5336  10283.6659  2773.1751  0.7177  0.0366 
BD Comparison to Likelihoodfree Inference
Our Bayesian inference approach leverages the simulator as part of the likelihood model to approximate posterior distributions over simulation parameters, which means the simulator is indispensable in our estimation process. In the following, we compare our approach against the likelihoodfree inference approach BayesSim [ramos_bayessim_2019] that leverages approximate Bayesian computation (ABC) which is the most popular family of algorithms within likelihoodfree methods. Likelihoodfree methods assume the simulator is a black box that can generate a trajectory given a parameter setting. Instead of querying the simulator to evaluate the likelihood (as in our approach), a conditional density
is learned directly from a dataset of simulation parameters and their corresponding rolledout trajectories via supervised learning to approximate the posterior. A common choice of model for such density is a mixture density network
[bishop1994mixture], which parameterizes a Gaussian mixture model. This is in contrast to our (C)SVGD algorithm which can approximate any shape of posterior by being a nonparametric inference algorithm. For our experiments we collect a dataset of 10,000 simulated trajectories of parameters randomly sampled from the prior distribution. We train the density network via the Adam optimizer with a learning rate offor 3000 epochs, after which we observed no meaningful improvement to the calculated loglikelihood loss during training. In the following, we provide further details on the likelihoodfree inference pipeline we consider, by describing the input data processing and the model used for approximating the posterior.
BD1 Input Data Processing
The input to the learned density model has to be a sufficient statistic of the underlying data, while being lowdimensional in order to keep the learning problem computationally tractable. We consider the following four methods of processing the trajectories that are the input to the likelihoodfree methods, as visualized for an example trajectory in Figure 11. Note that we configured the following input processing methods to generate a onedimensional input vector that has a reasonable length to be computationally feasible to train on (given the 10,000 trajectories from the training dataset), while achieving acceptable performance which we validated through testing various settings.
Downsampled:
we downsample the trajectory so that for the double pendulum experiment (subsection BB) we use only every 20th state, for the Panda arm experiment (subsection VC) only every 200th state of the trajectory. Finally, the state dimensions per trajectory are concatenated to a onedimensional vector.
Difference:
we adapt the input statistic from the original BayesSim approach in [ramos_bayessim_2019, Eq. (22)] where the differences of two consecutive states along the trajectory are used in concatenation with their mean and variance:
As before, we downsample these state differences and concatenate them to a vector.
Summary:
for each state dimension of the trajectory, we compute the following statistics typical for time series: mean, variance, cross correlation between state dimensions of the trajectory, as well as autocorrelations for each dimension at 5 different time delays: [5, 10, 20, 50, 100] time steps. These numbers are concatenated for all state dimensions to a onedimensional vector per input trajectory.
Signature:
we compute the signature transform from the signatory package [kidger2021signatory] over the input trajectory. Such socalled path signatures have been recently introduced to extract information about order and area, thereby preserving features inherent to nonlinear trajectories. We select a depth for the signature transform of 3 for the double pendulum experiment, and 2 for the Panda arm experiment, to obtain feature vectors of comparable size to the aforementioned input techniques.
BD2 Density Model
As the density model for the learned posterior , we select the following commonly used representations.
Mixture density network (Mdn)
uses neural network features from a feedforward neural network using two hidden layers with 24 units each.
Mixture density random Fourier features (Mdrff)
this density model uses Fourier features and a kernel. We evaluate the MDRFF with the following common choices for the kernel:

Radial Basis Function (RBF) kernel

Matérn kernel [rasmussen2005gp, Equation (4.14)] with
BD3 Evaluation
Note that instead of action generation, which is part of the proposed BayesSim pipeline [ramos_bayessim_2019], we only focus on the inference of the posterior density over simulation parameters in order to compare such likelihoodfree inference approach against our method. Finally, to evaluate the metrics shown in Table III for each BayesSim instantiation (input method and density model), we sample 100 parameter vectors from the learned posterior and simulate them to obtain 100 trajectories which are compared against the reference trajectory sets, as we did in the comparison for the other Bayesian inference methods in Table I.
Double Pendulum Experiment  Panda Arm Experiment  

Input  Model  MMD  
Downsampled  MDN  8817.9222  4050.4666  0.6748  17.4039 
Difference  MDN  8919.2463  4633.2637  0.6285  17.1646 
Summary  MDN  9092.5575  5093.8851  0.5664  18.3481 
Signature  MDN  8985.8056  4610.5438  0.5807  19.3432 
Downsampled  MDRFF (RBF)  9027.9474  5091.5283  0.5593  17.2335 
Difference  MDRFF (RBF)  8936.3823  4282.8599  0.5988  18.4892 
Summary  MDRFF (RBF)  9063.1753  4884.1398  0.5672  19.5430 
Signature  MDRFF (RBF)  8980.9080  4081.1160  0.6016  18.3458 
Downsampled  MDRFF (Matérn)  8818.1830  3794.9873  0.6110  17.6395 
Difference  MDRFF (Matérn)  8859.2156  4349.9971  0.6176  17.2752 
Summary  MDRFF (Matérn)  8962.0501  4241.4551  0.5999  19.6672 
Signature  MDRFF (Matérn)  9036.9626  4620.9517  0.5715  18.1652 
CSVGD  5204.5336  2773.1751  0.0366  15.1671 
MDN – Downsampled  MDN – Difference 
MDN – Summary  MDN – Signature 
MDRFF (RBF) – Downsampled  MDRFF (RBF) – Difference 
MDRFF (RBF) – Summary  MDRFF (RBF) – Signature 
MDRFF (Matérn) – Downsampled  MDRFF (Matérn) – Difference 
MDRFF (Matérn) – Summary  MDRFF (Matérn) – Signature 
BD4 Discussion
The results from our experiments with the various likelihoodfree approaches in Table III indicate that, among the tested pipelines, the MDRFF model with Matérn kernel and downsampled trajectory input overall performed the strongest, followed by the MDN with downsampled input. In comparison to the likelihoodbased algorithms from Table I, these results are comparable on the double pendulum experiment. However, in comparison to CSVGD, the estimated likelihoodfree posteriors are significantly less accurate, which can also be clearly seen in the density plots over the rolled out trajectories from such learned densities in Figure 10. On the Panda arm experiment, the likelihoodfree methods are outperformed by the likelihoodbased algorithms (such as the Emcee sampler) more often on the likelihood of the learned parameter densities. CSVGD again achieves a much more accurate posterior in this experiment than any likelihoodfree approach. Why do these likelihoodfree methods perform so poorly on a seemingly simple double pendulum? One would expect that this kind of dynamical system poses no greater challenge to BayesSim when it was shown to identify a cartpole’s link length and cart mass successfully [ramos_bayessim_2019]. To investigate this problem, we revisit the simplified double pendulum estimation experiment from subsection VB
of our main paper, where only the two link lengths need to be estimated from simulated trajectories. As before, we create a dataset with 10,000 trajectories of 400 time steps based on the two simulation parameters sampled from a uniform distribution ranging between
and . While keeping all parameters the same as in our previous doublependulum experiment where eleven parameters had to be inferred, all of the density models in combination with both the “difference” and “downsampled” input statistic infer a highly accurate parameter distribution, as shown in (a). The trajectories produced by sampling from the BayesSim posterior ((b)) also match the reference observations much more closely than any of the BayesSim models on the previous 11parameter double pendulum (Figure 10). These results suggest that BayesSim and potentially other likelihoodfree method have problems in inferring higher dimensional parameter distributions. The experiments in [ramos_bayessim_2019] demonstrated as many as four parameters being estimated (for the acrobot), while showing inference results for simulated systems only. While our double pendulum system from subsection BB is basic in principle, the higher dimensional parameter space (see parameters in (a)) and the fact that we need to fit against realworld data makes it a significantly harder problem for most stateoftheart inference algorithms. CSVGD is able to achieve a close fit thanks to the multipleshooting segmentation of the trajectory which improves the convergence (see more ablation results for multipleshooting on this experiment in subsection BC).BE Identify Inertia of an Articulated Rigid Object
The state space consists of the positions and velocities of the seven degrees of freedom of the robot arm and the two degrees of freedom in the universal joint, resulting in a 20dimensional state vector
consisting of nine joint positions and velocities, plus the PD control position and velocity targets, and , for the actuated joints of the robot arm. We control the arm using the MoveIt! motion planning framework [coleman_david_reducing_2014] by moving joints 6 and 7 to predefined jointspace offsets of and radians, in sequence. We use the default Iterative Parabolic Time Parameterization algorithm with a velocity scaling factor of . We track the motion of the acrylic box via a Vicon motion capture system and derive four Cartesian coordinates as observation to represent the frame of the box (shown in Figure 13): a point of origin located at the center of the upper lid of the box, and three points located away from the origin into the x, y, and z direction (transformed by the reference frame of the box). We chose this state representation to ease the computation of the likelihood, since we only need to compute differences between 3D points instead of computing the distances over 3D rotations which requires special treatment [huynh2009metrics].We first identify the simulation parameters pertaining to the inertial properties of the box and the friction parameters of the universal joint. As shown in (a), the symmetric inertia matrix of the box is fully determined by the first six parameters, followed by the 3D center of mass. We have measured the mass to be , so we do not need to estimate it. We simulate the universal joint with velocitydependent damping, with possibly different friction coefficients for both degrees of freedom. The simulation parameters yielding the most accurate fit to a groundtruth trajectory from the physical robot shaking an empty box is shown in Figure 14. We were able to find such parameters via SVGD, CSVGD and Emcee (shown is a parameter configuration from the particle distribution estimated by CSVGD with the highest likelihood). While the simulated trajectory matches the real data significantly better after the inertial parameters of the empty box have been identified ((b)) than before ((a)), a reality gap remains. We believe this to be a result from a slight modeling error that the rigid body simulator cannot capture, e.g. the top of the box where the universal joint is attached bends slightly while the box is moving, and there may be slight geometric offsets between the real system and the model of it we use in the simulator. The latter parameters could have been further identified with our approach, nonetheless the simulation given the identified parameters is sufficient to be used in the next phase of the inference experiment.
Given the parameters found in the first phase, we now investigate how well the various approaches can cope with dependent variables. By fixing two to the bottom of the acrylic box, the 2D locations of such weights need to be inferred. Naturally, such assignment is symmetric, i.e. weight 1 and 2 can swap locations without affecting the dynamics. What would significantly alter the dynamics, however, is an unbalanced configuration of the weights which would cause the box to tilt.


We use 50 particles and run each estimation algorithm for 500 iterations. For each baseline method, we carefully tuned the hyper parameters to facilitate a fair comparison. Such tuning included selecting an appropriate measurement noise variance, which, as we observed on Emcee and SGLD in particular, had a significant influence on the exploration behavior of these algorithms. With a larger observation noise variance the resulting posterior distribution became wider, however we were unable to attain such behavior with the NUTS estimator whose iterates quickly collapsed to a single point at the center of the box (see (f)). Similarly, CEM immediately became stuck in the suboptimal configuration shown in (d). Nonetheless, after 500 iterations, all methods predicted weight positions that were aligned opposite to one another to balance the box. As can be seen in (g), SVGD achieves a fairly predictive posterior approximation, with many particles aligned close to the true vertical position at . With the introduction of the multipleshooting constraints, Constrained SVGD (CSVGD) converges significantly faster to a posterior distribution that accurately captures the true locations of the box, while retaining the exploration performance of SVGD that spreads out the particles over multiple modes, as shown in (h).
Comments
There are no comments yet.