. Planning these push-manipulations requires a forward model. There are many ways to express and acquire such a model, from analytic mechanics to machine learning, as well as hybrid techniques. There are several open problems, of which we address two. First, push planning typically does not take account of all the types of uncertainty in the predictions of the forward model. Second, when using purely learned models, push planning has only been demonstrated for single pushes, not for complex push sequences. In this work we present a combined solution to these problems.
Uncertainty in prediction comes from two sources. First, it can arise from small variations in physical properties, such as shape and friction, that are hard to measure, but which significantly alter action effects. Second, in a learned forward model it can arise from a paucity of data. In this paper, we use two different learning methods to explicitly predict a distribution over push outcomes, including both types of uncertainty.
When planning an action sequence the robot can take account of regions of high uncertainty. This is important because actions in uncertain regions of the state space can lead to unrecoverable states. We model these as incurring a cost that rises with the uncertainty predicted by the forward model. But, what push planner can we use? The choice is complicated by the fact that the overall cost function is typically not a differentiable function of the actions. In this case, a path integral formulation [15, 28, 29] works well. We also utilise re-planning each step to account for model inaccuracies. Thus, our planner is a model predictive path integral (MPPI) controller that performs uncertainty averse pushing.
The technical contributions of this work are: (i) applying ensembles of MDNs and Gaussian Processes to learn uncertainty aware forward models of push manipulation; (ii) using a learnt forward model with model predictive path integral (MPPI) control to push an object to a given goal pose with a many step plan; and (iii) two algorithms for uncertainty averse push planning.
The paper is organised as follows. Section III reviews existing work on push learning and planning. Section IV gives a brief background on the elements of our approach. Section V explains the problem formulation and algorithms. Section VI details the experimental study. Finally, Section VII is a discussion.
Iii Related Work
Iii-a Learning Models for Pushing
There are many approaches to modelling push effects based on classical analytic mechanics [23, 22, 25, 24]. These require modelling and knowledge of parameters such as friction, mass, inertia, and centre of gravity, of the manipulated object, manipulator, and other objects in contact. Accurate identification of these parameters is hard. Even if this is solved, approximations used in rigid body engines can render poor predictions. An alternative is to learn a model from data [11, 1, 26, 18], or to use a hybrid approach [32, 3]. Learning methods divide into data-intensive and data-efficient methods.
Data-intensive methods, such as deep-learning, adopt self-supervision for data collection, allowing the creation of large datasets. In, for example, a siamese architecture learns a forward and an inverse model for pushing. The forward model is used as a regulariser for the inverse model. Limitations are the use of a discrete action space and the lack of a representation of predictive uncertainty. Finn et al  used an auto-encoder based forward model that is used in a model predictive control schema to find push actions based on image input. However, this model also lacks knowledge of the predictive uncertainty and has only been shown to achieve single step push manipulations. Pinto et al. , used push models to improve grasp performance through pushing. Again, predictive uncertainty in the learnt models was not explored.
There are also data-efficient approaches to learning push effects [18, 19, 2]. These models typically use hand-crafted features, and have not yet been used for push planning. They do, however, represent uncertainty in the outcome, including, in , the ability to predict meta-uncertainty due to a lack of data.
Iii-B Estimating Predictive Uncertainty
The ability to predict both uncertainty in dynamics and meta-uncertainty in this dynamics model is useful for robot planning . Policy search method PILCO  utilises Gaussian Processes (GPs) . GPs, however, scale poorly with the amount of training data. Representations such as Gaussian Mixture Regression scale better and can estimate dynamic uncertainty, but not the meta uncertainty 
. A neural network approach to representing uncertainty in dynamics is to learn the parameters of a mixture density. This is termed an mixture density network (MDN). This can be extended to add meta-uncertainty due to a lack of data in various ways. One way is to use dropout. Another way is to use an ensemble of MDNs and adversarial training . We use this latter approach.
Iii-C Path Integral Applications for Control
Stochastic optimal control (SOC) deals with both uncertainty in the action and sensor models, and the resultant state uncertainty. The sequence of control commands is found by minimising an integral of individual step costs (called the running cost) along a given trajectory. The SOC problem is defined by a Hamilton-Bellman-Jacobi
(HJB) partial differential equation (PDE) corresponding to the system to be controlled. This can be solved numerically backwards in time, given the system’s initial and target configurations. This is straightforward for linear systems with quadratic costs, but non-trivial for non-linear systems.
However, by using the Feynman-Kac theorem, a non-linear HJB can be converted into a linear PDE, which can be solved via forward sampling of trajectories [28, 15]. This formulation can cope with arbitrary state costs that need not be differentiable, and is applicable to a wide range of non-linear systems. Recently, researchers have explored its benefits for robot control [31, 29, 30]. We make use of the path integral framework. Specifically, we apply the model predictive path integral control algorithm proposed by .
Classical optimal control deals with finding a set of control actions that solves the problem (typically deterministic systems) and is optimal with respect to a cost function. The general framework is to design an agent as an automaton that seeks to minimise a cost function for a fixed or varying time horizon . There are typically two methods to solve an optimal control problem. They are the HJB formulation (finding a solution using dynamic programming) and the second is by using Pontryagin’s Minimum Principle
(PMP) (finding a solution to the ordinary differential equations formed). These formulations are typically interested in finding a globally optimal solution.
In this paper we define a trajectory as being a sequence of states , actions and associated uncertainty at time step . Thus, let be a convenient tuple, such that . Then, using to index a trajectory and to index a discrete timestep, the optimal control problem is cast by defining a cost function for a trajectory starting from timestep until , i.e. :
where, is the immediate cost function and is the final cost function. The aim is to find a policy that minimises the above cost function. The value function for a state is defined as the minimum cumulative cost the agent can obtain from a state, if it proceeds optimally from that state to the goal. The value function for a state can be defined as:
At any state , the aim is to find a set of control commands or actions that would minimise the expected cumulative cost from that state. Note that in Eq 2 the expectation is taken over all possible paths (i.e. trajectories) and thus the index is dropped for convenience.
We now describe the proposed uncertainty averse push planner, which comprises two parts. First, an uncertainty calibrated forward model is learnt with data collected from a variety of pushes. Second, using the learnt model, we formalise our planning problem as Model Predictive Path Integral control (MPPI) and detail our planning algorithm. Later we describe another path integral based approach to find a low cost trajectory that can be followed by using the MPPI controller.
V-a Forward Models with Predictive Uncertainty
There are various ways to capture uncertainty in predictions. Gaussian Processes, for example, provide an effective and theoretically clean tool to make uncertainty aware predictions . The main problem with Gaussian processes (GPs) is their inability to scale to high dimensional spaces or large data sets. We therefore also utilise an ensemble of mixture density networks (E-MDN).
Arbitrary densities can also be learnt via gradient descent with Mixture Density Networks (MDNs) , thus modelling arbitrary multi-modal densities as defined in equation 3.
where , and are the network outputs which form the parameters of the mixture. In order to make sure the network outputs valid parameters for the mixture, Bishop 
suggests using a softmax layer to represent, such that , whereas an exponential layer is able to guarantee that is positive definite, and finally
can be a linear combination of hyperbolic tangent activation functions. The reader is encouraged to refer to Bishop and Graves  for further details on practical considerations in implementing MDNs.
Concretely, for learning a forward model, we define , where is the state at time step , (position and orientation of the box on the plane) , and is the action taken at that time, encoded as a direction vector  and a single real value , normalised between zero and one, indicating the contact location on the box edge, i.e. (all quantities are given in the object frame). Furthermore, as has been demonstrated by , one can form an ensemble of such models so as to estimate the uncertainty of the forward model’s predictive distribution. If an ensemble is composed of members, the final model can be written as:
The statistics of interest that we compute from the ensemble are the mean prediction and the variance, which, when trained accordingly, can reflect the predictive uncertainty of the model, and are given by:
Thus, the predictive uncertainty we refer throughout this paper represents both inherent uncertainty in dynamics, and meta uncertainty due to lack of data, as given by 6.
V-B Uncertainty Averse Model Predictive Path Integral Control
Once we have a forward model, we need a way to use this model to find the right sequence of push commands to move the object to the goal. When solving a task, it is easier to exploit the already known part of the state-action space rather than exploring new parts. This suggests moving through more certain regions of the state-action space. We use the path integral based approach to model predictive control in .
The path integral formulation for stochastic optimal control permits one to find policy updates by calculating expectations over trajectory roll outs . It provides an alternative to directly solving the non-linear HJB equation via backward integration, and allows one to find the command updates that minimise the cost-to-go in Eq 7 as a weighted average over forward sampled trajectories.
Given the cost-to-go in Eq 7, one wants to find the optimal action sequence as . Note that in this work the cost is also a function of the predictive uncertainty , in addition to the state and controls . The importance of each sample is given by a weight, defined as the exponential of the cost-to-go , which is given by:
Thus, the control command updates are calculated as an expectation, or weighted average, over sampled control disturbances with weights equal to . Here, control disturbances are in a similar manner to that defined in . However, we introduce an exploration decay parameter , i.e.
where is the time step magnitude, with , which has same dimensionality as the control actions . The parameter can be seen as a constant responsible for controlling the magnitude or level of exploration of the sampled disturbances , whereas a suitable exploration decay schedule for helps to ensure local convergence even when the number of sampled trajectories is small. In all experiments presented in this paper is always initialised as and geometrically decays over a chosen number of decay steps as detailed in Algorithm 1.
By using a model predictive approach we are able to incorporate feedback into the system. At each state a look-ahead window of time steps is used, starting at time step . Then the first control command of the steps is executed on the robot. After this first push, the new state is fed back into the optimiser and the process is repeated till task convergence.
The immediate cost function used in our formulation is
and the final cost is given by
By adding the term in equation 11, the samples that pass through an uncertain region are penalized more and hence would contribute less to the control update in equation 9. Throughout the remainder of the paper, the state of the object to be pushed is defined by its position and orientation on the plane under the quasi-static assumption, subject to only planar motion, i.e. .
V-C An Alternative Approach to Uncertainty Averse Planning
The performance of MPPI for uncertainty averse planning depends on the cost function defined. The challenge of the cost function defined in equation 7 is to optimise the trade off between and .
There is a different approach that involves decoupling goal finding and uncertainty reduction. First, a simple path is defined from start to goal. Then the learnt forward model can be used to find a low uncertainty variation of this path. Finally, this can be given to the model predictive controller to follow. The decoupling of the uncertainty cost and the final goal cost gives a significant improvement in performance. We propose a path integral based approach to find a low cost trajectory in the state-action space that can then be followed by the MPPI. The formulation derives inspiration from the STOMP planner , though our formulation uses a different cost function and improvement equations. Apart from finding a low cost trajectory, kinematic constraints are imposed on the optimisation problem to find paths within the workspace of the system.
The problem of finding a low uncertainty trajectory can be formulated as:
where, for a state at time the uncertainty predicted is and is a positive constant. The uncertainty for each point in the heat-map is estimated via Monte Carlo sampling. Thus, by iterating over a subset of states (e.g. states on a grid) and random actions for each state, the average uncertainty is estimated using the learnt forward model with Eq 6 for each state. With this uncertainty heat-map, Algorithm 2
starts with a initial trajectory leading from start to goal. This could be simple linear interpolation from start to goal. If the total number of states in this interpolation is, then the optimisation is performed for states from to , thus ensuring the path continues to reach the goal from the start. The initial trajectory is improved iteratively by usinq Eq 8 for the cost defined in Eq 13. The iterative update is defined by:
Experiments 1 and 2 described below validate the fundamental ability to push objects to a goal location using E-MDN as the learnt forward model on a real robot (Baxter), without considering uncertainty in the cost function. Experiments 3 and 4 demonstrate in simulation that the same approach is able to find uncertainty averse trajectories for pushing an object to a given goal. For the experiments in the real robot, we found that uncertainty averse pushes achieved in simulation were sometimes hard to reproduce on the Baxter platform due to kinematic limitations of the robot, i.e. the inverse kinematic solver sometimes failed to find solutions for planned pushes. Thus, general pushing is shown on the real robot, whereas uncertainty averse pushing (which requires using a much larger workspace) is demonstrated in simulation only. We now proceed to describe the experiments in the real robot. Finally, we will describe the simulation experiments that demonstrate uncertainty averse pushing.111See video summarising the proposed approach and experiments: https://youtu.be/LjYruxwxkPM
Vi-a Experiment 1:
In this experiment we trained the E-MDN model on a set of 326 pushes, gathered from the Baxter robot 222see data collection setup, in which the robot applies random pushes to the object and periodically restarts the box to an initial location: https://youtu.be/pRDvkDkCSTQ. The E-MDN utilised had
members in the ensemble, each member being an MDN with 3 hidden layers with 20 neurons per layer. The number of mixtures in each MDN was chosen to be9] optimiser. We utilised batch size 5 and the learning rate was 0.001. Following the training protocol described by , we used 0.005 as the adversarial coefficient for generating adversarial examples for training. Figure 2(a) and (b) shows the predictions given by the trained E-MDN model and the GP model respectively.
To show that uncertainty rises when the amount of available data is limited we also gathered data from randomised pushes in Box-2D . Then we lesioned the data for various parts of the state space, and trained the E-MDN model. The uncertainty should be higher in the lesioned parts of space, and this is what we see in Figure 4, in which the average uncertainty for a given state was obtained via marginalisation over the action space at the given state (average uncertainty for a given box state was calculated using Monte-Carlo action samples).
Vi-B Experiment 2:
We performed experiments with the real robot, using the MPPI planner to plan with the learnt models from Experiment 1, and also with a physics simulator suitable for planar pushing called Box-2D . We use this to investigate whether the push planning framework can be combined with a variety of forward models to achieve many-step push manipulations. Some push sequences are visualised in Figure 3. These show that both learning methods terminate with positions close to, but not at, the goal, in terms of both orientation and position. Table I shows that both substantially reduce the cost in paired trials, and that there is no clear difference in performance between the different predictors underpinning MPPI. The cost parameters for the MPPI were chosen to be , , . The optimisation horizon was set to and the number of sampled trajectory rollouts was chosen to be , , with , , and .
|Treatment||Starting pose||Initial cost||Final cost||Steps|
Vi-C Experiment 3:
Utilising Algorithm 1, together with the cost function defined by Equation 11 set to penalise uncertainty for 150 pushes, and having the uncertainty penalty switched off there after. The aim of this experiment was to show in simulation that, with the right trade-off between the goal and the uncertainty gains, a box can be pushed to a desired location while avoiding regions with high uncertainty. The E-MDN was trained with 261 pushes collected from simulation. The parameters of the model were , in which each MDN member of the ensemble had a single hidden layer with 25 units, and the number of mixtures was set to . The MPPI parameters were set to , , , , , , (exploration decay not utilised) and the uncertainty penalty was set to for 150 pushes, and then afterwards. The results for this first experiment are shown by Figures 5 (a) and (b), in which two distinct trajectories are obtained as a result of either penalising or not penalising for model uncertainty.
Vi-D Experiment 4:
Finally, this experiment makes use of Algorithm 2, which performs optimisation to find a low uncertainty cost trajectory first. This low uncertainty cost trajectory is then followed by Algorithm 1 push controller, but this time, we do not need to penalise uncertainty in its running cost, since the trajectory has already been optimised for that. The results for this experiment are shown in Fig 6.
A push planning approach that uses a learnt forward model is presented. The push planner is also capable of taking into account the reliability of the learnt model. Initially we showed how a learnt forward model can be used by a real robot to push the object to a target location using the MPPI approach. Later, we showed the modification to this basic MPPI to accommodate predictive uncertainty in two ways. In the first algorithm the uncertainty is directly inserted into the MPPI cost function. In the second formulation, a trajectory that is uncertainty averse is pre-computed using a path-integral update (Algorithm 2) and MPPI (Algorithm 1) is used to follow it. We have shown that both algorithms exhibit the desired behaviour subject to tuning. In addition we have created the data gathering framework on a Baxter robot, and shown that E-MDNs and GPs produce very similar estimates of model uncertainty for real data. Experiments showed that Algorithm 1 works on the real robot, and that either one or both learning methods outpeformed a physics simulator, when used as the forward model for planning.
-  P Agrawal, A Nair, P Abbeel, J Malik, and S Levine. Learning to poke by poking: Experiential learning of intuitive physics. CoRR-2016.
-  M Bauza and A Rodriguez. A probabilistic data-driven model for planar pushing. In Proc. of IEEE, ICRA-2017.
-  D Belter, M Kopicki, S Zurek, and J Wyatt. Kinematically optimised predictions of object motion. In Proc. of the IEEE, IROS-2014.
-  C M Bishop. Mixture density networks. Technical report, 1994.
-  Erin Catto. Box2d: A 2d physics engine for games, 2011.
-  A Cosgun, T Hermans, V Emeli, and M Stilman. Push planning for object placement on cluttered table surfaces. In Proc. of the IEEE, IROS- 2011.
-  B Da Silva, G Konidaris, and A Barto. Learning parameterized skills. arXiv preprint arXiv:1206.6398, 2012.
-  M Deisenroth and C E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In ICML-2011.
-  P K Diederik and Ba Jimmy. Adam: A method for stochastic optimization. CoRR-2014.
-  M Dogar and S Srinivasa. A framework for push-grasping in clutter. Robotics: Science and systems VII, 2011.
-  C Finn, I J. Goodfellow, and S Levine. Unsupervised learning for physical interaction through video prediction. CoRR-2016.
-  Y Gal and Z Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML-2016.
Generating sequences with recurrent neural networks.CoRR-2013.
-  M Kalakrishnan, S Chitta, E Theodorou, P Pastor, and S Schaal. Stomp: Stochastic trajectory optimization for motion planning. In Proc. of the IEEE, ICRA-2011.
-  H J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005.
-  H J Kappen. Optimal control theory and the linear bellman equation. 2011.
-  D E Kirk. Optimal control theory: an introduction. Courier Corporation, 2012.
-  M Kopicki, S Zurek, R Stolkin, T Moerwald, and J L. Wyatt. Learning modular and transferable forward models of the motions of push manipulated objects. Autonomous Robots-2016.
-  M Kopicki, S Zurek, R Stolkin, T Mörwald, and J L Wyatt. Learning to predict how rigid objects behave under simple manipulation. In Proc. of the IEEE, ICRA-2011.
-  B Lakshminarayanan, A Pritzel, and C Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474, 2016.
-  W Li and E Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO-2004.
-  K Lynch. The mechanics of fine manipulation by pushing. In Proc. of the IEEE, ICRA-1992.
-  M T Mason. Manipulator grasping and pushing operations. PhD thesis, MIT, 1982.
-  M T Mason. Mechanics of robotic manipulation. MIT press, 2001.
-  M A Peshkin and A C Sanderson. The motion of a pushed, sliding workpiece.
-  L Pinto and A Gupta. Learning to push by grasping: Using multiple tasks for effective learning. CoRR-2016.
-  C Rasmussen, E. Gaussian processes for machine learning. 2006.
E Theodorou, J Buchli, and S Schaal.
A generalized path integral control approach to reinforcement learning.JMLR-2010.
-  G Williams, A Aldrich, and E Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015.
-  G Williams, P Drews, B Goldfain, J M Rehg, and E A Theodorou. Aggressive driving with model predictive path integral control. In IEEE, ICRA-2016.
-  C Yevgen, M Kalakrishnan, A Yahya, A Li, S Schaal, and S Levine. Path integral guided policy search. CoRR-2016.
-  J Zhou, J A Bagnell, and M T Mason. A fast stochastic contact model for planar pushing and grasping: Theory and experimental validation. In RSS-2017.