Robot manipulation has been extensively studied for decades and there is a large body of work on the manipulation of rigid and deformable objects. Compared to the manipulation of a rigid object, the state of which can be completely described by a six-dimensional configuration space, deformable object manipulation (DOM) is more challenging due to its very high configuration space dimensionality. The resulting manipulation algorithm needs to handle this dimensional complexity and maintain the tension to perform the task. DOM has many important applications, including cloth folding [1, 2]; robot-assisted dressing or household chores [3, 4]; ironing ; coat checking ; sewing ; string insertion ; robot-assisted surgery and suturing [9, 10]; and transporting large materials like cloth, leather, and composite materials .
There are two main challenges that arise in performing DOM tasks. First, we need to model a feature representation of the object status that can account for the object’s possibly large deformations. The design of such a feature is usually task-dependent and object-specific. For instance, a cloth needs richer features than a rope, and a surgical-related application requires more sophisticated features than a consumer application. As a result, feature design is usually a tedious manual procedure. Second, we need to develop a controller that maps the object’s state features to a robot control action in order to achieve a specific task. The design of such a controller is usually also task-dependent. For instance, a manipulation controller for human dressing will require a more expressive controller parametrization than that needed to flatten a piece of paper. As a result, controller design usually requires tedious manual parameter tuning when generalizing to different tasks. More importantly, the two problems of feature extraction and controller design were solved as two separate problems. With the recent development of deep (reinforcement) learning, a prominent method[12, 13]
is to represent feature extraction and controller parametrization as two neural networks, which are either trained either jointly or separately. However, the design decisions for these two networks, e.g., their network structures, are made independently. This method suffers from the large number of parameters that need to be manually determined. There are some more task-specific methods, such as[14, 5], which use a vision-based feature extractor, but the controller is a standalone optimization-based motion planner whose formulation is independent of feature extraction. These methods are hard to extend because the feature-based objective functions for trajectory optimization are task-specific.
There is a rich literature on autonomous robotic DOM finding solutions to the above challenges, and these works can be classified into three categories. The first group of approaches requires a physical model of the object’s deformation properties in terms of stiffness, Young’s modules, or FEM coefficients, to design a control policy[15, 9, 16, 17, 18, 19]
. However, such deformation parameters are difficult to estimate accurately and may even change during the manipulation process, especially for objects made by nonlinear elastic or plastic materials. The approaches in the second group use manually designed low-dimensional features to model the object’s deformation behavior[20, 18, 17, 21] and then use an adaptive linear controller to accomplish the task. The approaches in the final group do not explicitly model the deformation parameters of the object. Instead, they use vision or learning methods to accomplish tasks directly [1, 10, 22, 23, 24, 14, 25, 26]. These methods focus on high-level policies but cannot achieve accurate operations; their success rate is low and some of them are open-loop methods. As a result, designing appropriate features and controllers to achieve accurate and flexible deformable object manipulation is still an open research problem in robotics .
In this paper, we focus on designing a general learning-based feedback control framework for accurate and flexible DOM. The feedback controller’s input is a feature representation for the deformable object’s current status and its output is the robotic end-effector’s movement. Our framework provides solutions to both the feature design and controller parameterization challenges. For feature design, we propose both a set of manually designed low-level features and a novel higher-level feature based on a histogram of oriented wrinkles, which is automatically extracted from data to describe the shape variation of a highly deformable object. For controller parameterization, we first propose a novel nonlinear feedback controller based on Gaussian Process Regression (GPR), which learns the object’s deformation behavior online and can accomplish a set of challenging DOM tasks accurately and reliably. We further design a controller that treats feature extraction and controller design as a coupled problem by using a random forest trained using two-stage learning. During the first stage, we construct a random forest to classify a sampled dataset of images of deformable objects. Based on the given forest topology, we augment the random forest by defining one optimal control action for each leaf-node, which provides the action prediction for any unseen image that falls in that leaf-node. In this way, the feature extraction helps determine the parameterization of the controller. The random forest construction and controller optimization are finally integrated into an imitation learning framework to improve the robustness of human-robot co-manipulation tasks.
We have integrated our approach with an ABB YuMi dual-arm robot and a camera for image capture and use this system to manipulate different cloth materials for different tasks. We highlight the real-time performance of our method on a set of DOM benchmarks, including standard feature point reaching tasks (as in [17, 18]); the tasks for cloth stretching, folding, twisting, and placement; and the industrial tasks for cloth assembly. Our manipulation system successfully and efficiently accomplishes these manipulation tasks for a wide variety of objects with different deformation properties.
Our main contributions in this paper include:
A general feedback control framework for DOM tasks.
A set of manually designed low-level features that work well in a set of challenging DOM tasks.
A novel histogram feature representation of highly deformable materials (HOW-features) that are computed directly from the streaming RGB data using Gabor filters. These features are then correlated using a sparse representation framework with a visual feedback dictionary and then fed to the feedback controller to generate appropriate robot actions.
An online Gaussian Process Regression based nonlinear feedback controller for DOM tasks that is robust and adaptive to the object’s unknown deformation parameters.
A random-forest-based DOM-controller parametrization that is robust and less parameter-sensitive; an imitation learning algorithm based framework that trains robust DOM controllers using a deformable object simulator.
A set of benchmark DOM tasks that have importance for manufacturing or service industries.
The techniques developed in this paper would be useful for general DOM tasks, including the DOM tasks in the warehouse. In addition, our techniques could potentially be used in some manufacturing processes. Let’s take the assembly of cloth pieces with fixtures, one typical deformable object manipulation task that we will study in this paper, as an example. In this task, a piece of cloth with holes needs to be aligned with a fixture made by a set of vertical locating pins. The assembled cloth pieces are then sent to the sewing machine for sewing operations. Such a task can be efficiently performed by a human worker without any training, as shown in Figure 2, but it is difficult for a robot. For instance, the wrinkles generated during the operation will interfere the feature extraction and tracking procedures that are critical for perception feedbacks. And the highly deformable property of the cloth will lead to unpredictable changes in the size and shape of the holes during the manipulation. These challenges make it difficult to achieve an accurate and reliable robotic manipulation control for cloth assembly. The methods developed in this paper would help to enable the robot to accomplish the cloth assembly tasks accurately and efficiently.
The rest of this paper is organized as follows. We briefly survey the related works in Section II. We give an overview of our DOM framework in Section III. We present the details of our feature design and extraction in Section IV. We discuss the details of our new controllers in Section VI. Finally, we demonstrate the performance of our new approach with a set of experiments on a wide variety of soft objects in Section VIII with conclusions in Section IX.
Ii Related work
Ii-a Deformable object manipulation
Many robotic manipulation methods for deformable objects have been proposed in recent years. Early work [28, 29] used knot theory or energy theory to plan the manipulation trajectories for linear deformable objects like ropes. Some recent work  considered manipulating clothes using dexterous grippers. These works required a complete and accurate knowledge about the object’s geometric and deformation parameters and thus are not applicable in practice.
More practical works used sensors to guide the manipulation process.  used images to estimate the knot configuration.  used vision to estimate the configuration of a cloth and then leveraged gravity to accomplish folding tasks .  used an RGBD camera to identify the boundary components in clothes. [14, 33] first used vision to determine the status of the cloth, then optimized a set of grasp points to unfold the clothes on the table, and finally found a sequence of folding actions. Schulman et al.  enabled a robot to accomplish complex multi-step deformation object manipulation strategies by learning from a set of manipulation sequences with depth images to encode the task status. Such learning from demonstration techniques have further been extended using reinforcement learning  and tangent space mapping 
. A deep learning-based end-to-end framework has also been proposed recently. A complete pipeline for clothes folding tasks including vision-based garment grasping, clothes classification and unfolding, model matching and folding has been described in .
The above methods generally did not explicitly model the deformation parameters of the deformation objects, which is necessary for high-quality manipulation control. Some methods used uncertainty models 
or heuristics[34, 8] to account for rough deformation models during the manipulation process. Some works required an offline procedure to estimate the deformation parameters . There are several recent works that estimated the object’s deformation parameters in an online manner and then designed a controller accordingly. Navarro-Alarcon et al. [17, 18, 21] used an adaptive and model-free linear controller to servo-control soft objects, where the object’s deformation is modeled using a spring model .  learned the models of the part deformation depending on the end-effector force and grasping parameters in an online manner to accomplish high-quality cleaning tasks. A more complete survey about deformable object manipulation in industry is available in.
Ii-B Deformable object feature design
Different techniques have been proposed for motion planning for deformable objects. Most of the works on deformable object manipulation focus on volumetric objects such as a deforming ball or linear deformable objects such as steerable needles [36, 37, 38, 28]. By comparison, cloth-like thin-shell objects tend to exhibit more complex deformations, forming wrinkles and folds. Current solutions for thin-shelled manipulation problems are limited to specific tasks, including folding , ironing , sewing , and dressing . On the other hand, deformable body tracking solves a simpler problem, namely inferring the 3D configuration of a deformable object from sensing inputs. There is a body of literature on deformable body tracking that deals with inferring the 3D configuration from sensor data [40, 41, 42]. However, these methods usually require a template mesh to be known a priori, and the allowed deformations are relatively small. Recently, some template-less approaches have also been proposed, including [43, 44, 45], that tackle the tracking and reconstruction problems jointly and in real-time,
Rather than requiring a complete 3D reconstruction of the entire object, a visual-servo controller only uses single-view observations about the object as the input. However, even the single-view observation is high-dimensional. Thus, previous DOM methods use various feature extraction and dimensionality-reduction techniques, including SIFT-features  and combined depth and curvature-based features [25, 46]. Recently, deep neural-networks became a mainstream general-purpose feature extractor. They have also been used for manipulating low-DOF articulated bodies  and for DOM applications . However, visual feature extraction is always decoupled from controller design in all these methods.
Ii-C Deformable object controller optimization
In robotics, reinforcement learning , imitation learning , and trajectory optimization  have been used to compute optimal control actions. Trajectory optimization, or a model-based controller, has been used in [14, 5, 51] for DOM applications. Although they are accurate, these methods cannot achieve real-time performance. For low-DOF robots such as articulated bodies , researchers have developed real-time trajectory optimization approaches, but it is hard to extend them to deformable models due to the high simulation complexity of such models. Currently, real-time performance can only be achieved by learning-based controllers [53, 46, 47]
, which use supervised learning to train real-time controllers. However, as pointed out in, these methods are not robust in handling unseen data. Therefore, we can further improve the robustness by using imitation learning.  used reinforcement learning to control a soft-hand, but the object to be manipulated by the soft-hand was still rigid.
DOM controller design is dominated by visual servoing techniques [56, 57], which aim to control a dynamic system using visual features extracted from images. They have been widely used in robotic tasks like manipulation and navigation. Recent work includes the use of histogram features for rigid objects.
The dominant approach for DOM tasks is visual servoing techniques [56, 57], which aim at controlling a dynamic system using visual features extracted from images. These techniques have been widely used in robotic tasks like manipulation and navigation. Recent work includes the use of histogram features for rigid objects . Sullivan et al.  use a visual servoing technique to solve the deformable object manipulation problem using active models. Navarro-Alarcon et al. [17, 18, 21] use an adaptive and model-free linear controller to servo-control soft objects, where the object’s deformation is modeled using a spring model . Langsfeld et al.  perform online learning of part-deformation models for robot cleaning of compliant objects. Our goal is to extend these visual servoing methods to perform complex tasks on highly deformable materials.
Iii Overview and problem formulation
|3D configuration space of the object|
|a configuration of the object,|
|feedback feature points on the object|
|uninformative points on the object|
|robot end-effectors’ grasping points on the object|
|an observation of object|
the feature vector extracted from
|target configuration of the object|
|optimal grasping points returned by the expert|
|interaction function linking velocities of the feature|
|space to the end-effector configuration space|
|distance measure in the feature space|
|distance measure in the policy space|
|Random forest DOM controller|
|parameter for random forest topology|
|confidence of leaf-node|
the number of decision trees
|a leaf-node of -th decision tree|
|the leaf-node that belongs to|
|labeling function for optimal actions|
|feature transformation for observation|
The problem of 3D deformable object manipulation can be formulated as follows. Similar to , we describe an object as a set of discrete points, which are classified into three types: manipulated points, feedback points, and uninformative points, as shown in Figure 3. The manipulated points correspond to the positions on the object that are grabbed by the robot and are thus fixed relative to the robotic end-effectors. The feedback points correspond to the object surface regions that define the task goals and are involved in the visual feedbacks. The uninformative points correspond to other regions on the object. Given this setup, the deformable object manipulation problem is about how to move the manipulated points to drive the feedback points toward a required target configuration.
Our goal is to optimize a realtime feedback controller to deform the object into a desired target configuration. We denote the 3D configuration space of the object as . Typically, a configuration can be discretely represented as a 3D mesh of the object and the dimension of can be in the thousands. However, we assume that only a partial observation is known, which in our case is an RGB-D image from a single, fixed point of view.
Since the manipulation of a deformable object is usually executed at a low speed to avoid vibration, we can reasonably assume that the object always lies in the quasi-static state where the internal forces caused by the elasticity of the object are balanced with the external force applied by the end-effector on the manipulated points. We use a potential energy function to formulate the elasticity of the deformable object, where the potential energy depends on all the points on the object and vectors , , and represent the stacked coordinates of all feedback points, uninformed points, and manipulated points, respectively. The equation of equilibrium for the object can then be described as follows:
where is the external force vector applied on the manipulated points. To solve the above equations, we need exact knowledge about that deformable object’s deformation property, which is either not available or difficult to acquire in many applications. To cope with this issue, we first simplify the potential energy function to only depend on and , which is reasonable because the uninformed points are usually far from the manipulated and feedback points and thus their influence on the manipulation process is small and can be neglected. Next, we perform a Taylor expansion of Equation 3 and Equation 1 about the current static equilibrium status , and the equation of equilibrium implies a relationship between the relative displacements of feedback points and manipulated points:
where and are the displacements relative to the equilibrium for feedback points and manipulated points, respectively. The functions and are nonlinear in general, though they can be linear in some special cases. For instance, when only performing the first order Taylor expansion as in , and are two linear functions. In this paper, we allow and to be general functions to estimate a better model for the deformable object manipulation process.
We further assume the function to be invertible, which implies
where is the mapping between the velocities of the feedback points and the manipulated points. In this way, we can determine a suitable end-effector velocity via feedback control to derive the object toward its goal state, where is the difference between the desired vector and the current vector of the feedback points and is the feedback gain.
However, the velocities of feedback points generally cannot be directly used in the control, because, in practice, these velocities are measured using visual tracking of deformable objects and thus are likely to be noisy or even unreliable when tracking fails. More importantly, a soft object needs a large set of feedback points to characterize its deformation, but a robotic manipulation system usually only has a few end-effectors. Thus ,the function in Equation 5 is a mapping from the high-dimensional space of feedback point velocities to the low-dimensional space of manipulated point velocities. Such a system is extremely underactuated and the convergence speed of the control would be slow.
To deal with aforementioned difficulties, we replace the feedback points with a low-dimensional feature vector , which is extracted from the observed part of the deformable object, where is the feature extraction function. Around the equilibrium state, we have , and we can rewrite the equilibrium function using the feature vector as
where the function is called the interaction function.
The manipulation problem of deformable objects can finally be described as the following: given the desired state of an object in the feature space, design a controller that learns the interaction function in an offline or online manner, and outputs the control velocity decreasing the distance between and , which are the object’s current state and desired goal state in the feature space. More formally, the controller is:
where is the distance metric in the feature space and is the feedback gain. If we assume to be a linear function and to be a linear metric, then the controller degrades to the linear visual-servo controller:
In addition, since and is usually a known vector, we can also write the controller as a policy :
where is the observation about the current state of the deformable object and is equivalent to . This form is more popular in the policy optimization literature.
Iv Feature extraction: manual design
For rigid body manipulation, an object’s state can be completely described by its centroid position and orientation. However, such global features are not sufficient to determine the configuration of a deformable object. As mentioned in Section III, we extract a feature vector from the observed feedback points to represent the object’s state. One common practice for feature extraction is to manually design task-dependent features. Here we present a set of features that are able to provide high-quality representations for DOM tasks, as will be demonstrated in our experiments.
Iv-a Global features
The centroid feature is computed as the geometric center of the 3D coordinates of all the observed feedback points:
Iv-A2 Positions of feedback points
Another way to describe a deformable object’s configuration is to directly use the positions of all observed feedback points as part of , i.e.
Iv-B Local features
Iv-B1 Distance between points
The distance between each pair of feedback points intuitively measures the stretch of deformable objects. This feature is computed as
where and are a pair of feedback points.
Iv-B2 Surface variation indicator
For deformable objects with developable surfaces, the surface variation around each feedback point can measure the local geometric property. Given a feedback point , we can compute the covariance matrix for its neighborhood and the surface variation is then computed as
where , ,
are eigenvectors ofwith .
Iv-B3 Extended FPFH from VFH
Extended FPFH is the local descriptor of VFH and is based on Fast Point Feature Histograms (FPFH) . Its idea is to use a histogram to record differences between the centroid point and its normal with all other points and normals.
Iv-C Feature selection for different tasks
According to our experience, for 1D deformable objects such as a rope, we can use centroid, distance or the coordinates of several marked points; for 2D deformable objects like deformable sheets, a surface variation indicator or extended FPFH are more effective for representing the states of the deformable objects. The features such as centroid and distance are also used in , and we will use these features to compare our method with 
. Our learning-based controllers (discussed later) are not very sensitive to the feature selection because they can adaptively weight the importance of different features. Nevertheless, we can introduce task-relevant prior knowledge into the controller by manually designing task-specific features, which would be helpful for the robustness and effectiveness of the controlling process.
V Feature extraction: data-driven design
Manual feature design is tedious and cannot be optimized for a given task. Here, we present a novel feature representation, a Histogram of Oriented Wrinkles (HOW), to describe the shape variation of a highly deformable object like clothes. These features are computed by applying Gabor filters and extracting the high-frequency and low-frequency components. We precompute a visual feedback dictionary using an offline training phase that stores a mapping between these visual features and the velocity of the end-effector . At runtime, we automatically compute the goal configurations based on the manipulation task and use sparse linear representation to compute the velocity of the controller from the dictionary.
V-a Histogram of deformation model feature
Here we present our algorithm to compute the HOW-features from the camera stream. These are low-dimensional features of highly deformable material.
The pipeline of our HOW-feature computation process is shown in Figure 4, and it has three stages:
V-A1 Foreground segmentation
V-A2 Deformation enhancement
To model the high dimensional characteristics of the highly deformable material, we use deformation enhancement. This is based on the perceptual observation that most deformations can be modeled by shadows and shape variations. Therefore, we extract the features corresponding to shadow variations by applying a Gabor transform  to the RGB image. This results in the enhancement of the ridges, wrinkles, and edges (as shown in Figure 4). We convolve the deformation filters to the image and represent the result as .
In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave  and it has been used to detect wrinkles . The 2D Gabor filter can be represented as follows:
where , , is the orientation of the normal to the parallel stripes of the Gabor filter, is the wavelength of the sinusoidal factor, is the phase offset,
is the standard deviation of the Gaussian, andis the spatial aspect ratio. When we apply the Gabor filter to our deformation model image, the choices for wavelength () and orientation () are the key parameters with respect to the wrinkles of deformable materials. As a result, the deformation model features consist of multiple Gabor filters () with different values of wavelengths () and orientations ().
V-A3 Grids of histogram
A histogram-based feature is an approximation of the image which can reduce the data redundancy and extract a high-level representation that is robust to local variations in an image. Histogram-based features have been adopted to achieve a general framework for photometric visual servoing . Although the distribution of the pixel value can be represented by a histogram, it is also significant to represent the position in the feature space of the deformation to achieve the manipulation task. Our approach is inspired by the study of grids in a Histogram of Oriented Gradient , which is computed on a dense grid of uniformly spatial cells.
We compute the grids of histogram for the deformation model feature by dividing the image into small spatial regions and accumulating local histogram of different filters () of the region. For each grid, we compute the histogram in the region and represent it as a matrix. We vary the grid size and compute matrix features for each size. Finally, we represent the entries of a matrix as a column feature vector. The complete feature extraction process is described in Algorithm 1.
The HOW-feature has several advantages. It captures the deformation structure, which is based on the characteristics of the local shape. Moreover, it uses a local representation that is invariant to local geometric and photometric transformations. This is useful when the translations or rotations are much smaller than the local spatial or orientation grid size.
V-B Manipulation using the visual feedback dictionary
Here we present our algorithm for computing the visual feedback dictionary. At runtime, this dictionary is used to compute the corresponding velocity () of the controller based on the visual feedback ().
V-B1 Building the visual feedback dictionary
As shown in Figure 5, the inputs to the offline training phase are a set of images and end-effector configurations (, ) and the output is the visual feedback dictionary ().
For the training process, the end-effector configurations, (), are either collected by human tele-operation or generated randomly. A single configuration () of the robot is a column vector of length
, the number of degrees-of-freedom to be controlled.and its value is represented in the configuration space.
In order to compute the mapping in Equation 6 from the visual feedback to the velocity, we need to transform the configurations and image stream into velocities and the visual feature feedback , respectively. One solution is to select a fixed time step and to represent the velocity in both the feature and the configuration space as:
where is the frame rate of the captured video.
However, sampling by a fixed time step () leads to a limited number of samples () and can result in over-fitting. To overcome this issue, we break the sequential order of the time index to generate more training data from and . In particular, we assume that the manipulation task can be observed as a Markov process  and that each step is independent from every other. In this case, the sampling rates are given as follows, (when the total sampling amount is ):
where is a set of randomly generated indices and
. To build a more concise dictionary, we also apply K-Means Clustering on the feature space, which enhances the performance and prevents the over-fitting problem.
V-B2 Sparse representation
At runtime, we use sparse linear representation  to compute the velocity of the controller from the visual feedback dictionary. These representations tend to assign zero weights to most irrelevant or redundant features and are used to find a small subset of the most predictive features in the high dimensional feature space. Given a noisy observation of a feature at runtime () and the visual feedback dictionary constructed by features with labels , we represent by , which is a sparse linear combination of , where is the sparsity-inducing term. To deal with noisy data, we use the norm on the data-fitting term and formulate the resulting sparse representation as:
where is a slack variable that is used to balance the trade-off between fitting the data perfectly and using a sparse solution. The sparse coefficient is computed using a minimization formulation:
After is computed, the observation
and the probable labelcan be reconstructed by the visual feedback dictionary:
The corresponding of the th DOF in the configuration is given as:
where denotes the -th datum of the -th feature, denotes the value of the response, and the norm-1 regularizer typically results in a sparse solution in the feature space.
V-B3 Goal configuration and mapping
We compute the goal configuration and the corresponding HOW-features based on the underlying manipulation task at runtime. Based on the configuration, we compute the velocity of the end-effector. The different ways to compute the goal configuration are:
To manipulate deformable objects to a single state , the goal configuration can be represented simply by the visual feature of the desired state .
To manipulate deformable objects to a hidden state , which can be represented by a set of states of the object as a set of visual features . We modify the formulation in Equation 6 to compute as:
For a complex task, which can be represented using a sequential set of states , we estimate the sequential cost of each state as . We use a modification that tends to compute the state with the lowest sequential cost:
After is computed, the velocity for state is determined by , and is removed from the set of goals for subsequent computations.
Vi Controller design: nonlinear Gaussian Process Regression
Vi-a Interaction function learning
Unlike many previous methods that assume the interaction function to be a linear function, here we consider as a general and highly nonlinear function determining how the movement of the manipulated points is converted into the feature space. Learning the function requires a flexible and non-parametric method. Our solution is to use Gaussian Process Regression (GPR) to fit the interaction function in an online manner.
GPR is a nonparametric regression technique that defines a distribution over functions and in which the inference takes place directly in the functional space given the covariance and mean of the functional distribution. For our manipulation problem, we formulate the interaction function as a Gaussian process:
where still denotes the velocity in the feature space. For the covariance or kernel function , we use the Radius Basis Function (RBF) kernel: , where the parameter sets the spread of the kernel. For the mean function , we use the linear mean function , where
is the linear regression weight matrix. We choose to use a linear mean function rather than the common zero mean function, because previous work has shown that the adaptive Jacobian method, which can be considered as a special version of our method using the linear mean function, is able to capture a large part of the interaction function . As a result, a linear mean function can result in faster convergence of our online learning process and provide a relatively accurate prediction in the unexplored region in the feature space. The matrix is learned online by minimizing a squared error with respect to the weight matrix .
Given a set of training data in terms of pairs of feature space velocities and manipulated point velocities during the previous manipulation process, the standard GPR computes the distribution of the interaction function as a Gaussian process , where GP’s mean function is
and GP’s covariance function is
Here and are matrices corresponding to the stack of and in the training data, respectively. and are matrices and vectors computed using a given covariance function . The matrix is called the Gram matrix, and the parameter estimates the uncertainty or noise level of the training data.
Vi-B Real-time online GPR
In the deformation object manipulation process, the data is generated sequentially. Thus, at each time step , we need to update the GP interaction function in an interactive manner, with
In the online GPR, we need to perform the inversion of the Gram matrix repeatedly with a time complexity , where is the size of the current training set involved in the regression. Such cubic complexity makes the training process slow for long manipulation sequences where the training data size increases quickly. In addition, the growing up of the GP model will reduce the newest data’s impact on the regression result and make the GP fail to capture the change of the objects’s deformation parameters during the manipulation. This is critical for deformable object manipulation because the interaction function is derived from the local force equilibrium and thus is only accurate in a small region.
Motivated by previous work about efficient offline GPR [68, 69, 70, 71], we here present a novel online GPR method called the Fast Online GPR (FO-GPR) to reduce the high computational cost and to adapt to the changing deformation properties while updating the deformation model during the manipulation process. The main idea of FO-GPR includes two parts: 1) maintaining the inversion of the Gram matrix incrementally rather using direct matrix inversion; 2) restricting the size of to be smaller than a given size and, if ’s size exceeds that limit, using a selective “forgetting” method to replace stale or uninformative data with a fresh new data point.
Vi-B1 Incremental update of Gram matrix
Suppose at time , the size of is still smaller than the limit . In this case, and are related by
where and . According to the Helmert–Wolf blocking inverse property, we can compute the inverse of based on the inverse of :
where . In this way, we achieve the incremental update of the inverse Gram matrix from to , and its computational cost is rather than of direct matrix inversion. This acceleration enables fast GP model updates during the manipulation process.
Vi-B2 Selective forgetting in online GPR
When the size of reaches the limit , we use a “forgetting” strategy to replace the most uninformative data with the fresh data points while keeping the size of to be . In particular, we choose to forget the data point that is the most similar to other data points in terms of the covariance, i.e.,
where denotes the covariance value stored in the -th row and -th column in , i.e., .
The incremental update for and is trivial: is identical to except is rather than ; is identical to except is rather than .
We then discuss how to update from . Since is only non-zero at the -th column or the -th row:
This matrix can be written as the multiplication of two matrices and , i.e., , where
Here is a vector that is all zero except when it is one at the -th item, is the vector and is the vector . Both and are size matrices.
Then, using the Sherman-Morrison formula, we get
which provides the incremental update scheme for the Gram matrix . Since is a matrix, its inversion can be computed in time. Therefore, the incremental update computation is dominated by the matrix-vector multiplication and thus the time complexity is rather than .
A complete description of FO-GPR is shown in Algorithm 3.
Vi-C Exploitation and exploration
Given the interaction function learned by FO-GPR, the controller system predicts the required velocity to be executed by the end-effectors based on the error between the current state and the goal state in the feature space:
However, when there is no sufficient data, GPR cannot output control policy with high confidence, which typically happens in the early steps of the manipulation or when the robot manipulates the object into a new unexplored configuration. Fortunately, the GPR framework provides a natural way to trade-off exploitation and exploration by sampling the control velocity from distribution of :
If is now in an unexplored region with a large , the controller will perform exploration around ; if is in a well-explored region with a small , the controller will output a velocity close to .
A complete description of the controller based on FO-GPR is shown in Figure 8.
Vi-D Convergence and stability analysis
We can prove that, given more and more data, our GPR-based online learning converges to the true underlying deformation distribution and that the resulting controller is stable.
The GPR prediction in Equation VI-A and 34 includes two terms: The first term actually corresponds to the adaptive Jacobian in  which has proven to be bounded and is able to asymptotically minimize the error between the current and target states. The second term corresponds to a function , which minimizes the functional
where is the RKHS (reproducing kernel Hilbert space) norm w.r.t. the kernel . According to , we can prove that this second term converges:
The prediction function converges to a true underlying interaction function .
Let be the probability measure from which the data pairs are generated. We have
Let and be the
-th eigenfunction of the kernel function,the functional then becomes
and its solution can then be computed as
When , and thus converges to .
Vii Controller design: random forest controller
The GPR controller proposed above has two main limitations. First, the controller must be in the form of a mathematical function, making it unable to model the more complicated controller required for dealing with large deformations. Second, the feature extraction step and control design step are independent.
Here, we present a another controller for DOM tasks. This controller uses random forest  to model the mapping between the visual features of the object and an optimal control action of the manipulator. The topological structure of this random-forest-based controller is determined automatically based on the training data, which consists of visual features and control actions. This enables us to integrate the overall process of training data classification and controller optimization into an imitation learning (IL) algorithm. Our approach enables joint feature extraction and controller optimization for DOM tasks.
The pipeline of our approach is illustrated in Figure 9 and it consists of two components. For preprocessing (the blue block), a dataset of object observations is labeled and features are extracted for each observation. This dataset is used to train our controller. During runtime (the red block), our algorithm takes the current as input and generates the optimal action , where is the control policy and is the controller’s learnable parameter.
Vii-a Random forest-based controller formulation
Our key contribution is a novel parametrization of with the use of a random forest . A random forest is an ensemble of decision trees, where the -th tree classifies by bringing it to a leaf-node , where and is the number of leaf-nodes in the -th decision tree. The random forest makes its decision by bringing through every tree and taking the average. In order to use an already constructed random forest as a controller, we define an optimal control action so that the final action is determined by averaging:
where we introduce an additional parameter that encodes the random forest’s topology. As a result, the number of optimizable controller parameters is . Such an extension to a random forest has two benefits. First, even if two decision trees give the same classification for an observation , i.e., the class label of equals that of , we still assign separate optimal control actions, and , for optimization. This makes the controller more robust to bad predictions. Second, the number of optimized parameters is related to the random forest’s topology. This reveals strong connection between feature extraction and controller parametrization. By automatically segmenting the state space into pieces (leaf-nodes) where different control actions need to be taken, fewer design decisions on the controller’s side are exposed to the end user. This makes our method less sensitive to parameters.
Vii-B Controller optimization problem
Our controller training problem can take different forms depending on the available information about . If is known, then we can define a reward function: , where can be any distance measure between RGB-D images. In this setting, we want to solve the following reinforcement learning (RL) problem:
where is a trajectory sampled according to and is the discount factor. Another widely used setting assumes that is unknown, but that an expert is available to provide optimal control action . In this case, we want to solve the following IL problem:
Our method is designed for this IL problem and we assume that is known even in an IL setting. This is because is needed to construct the random forest. This assumption always holds when we are training in a simulated environment. We describe the method to perform these optimizations in Section VII-C.
Vii-C Learning random forest-based controller
To find the controller parameters, we use an IL algorithm , which can be decomposed into two substeps: dataset sampling and controller optimization. The first step samples a dataset , where each sample is a combination of cloth observation and optimal action. Our goal is to optimize the random forest-based controller with respect to , given .
Vii-C1 Random forest construction
We first solve for , the topology of the random forest. Unfortunately, a random forest construction requires a labeled dataset that classifies the samples into discrete sets of classes. To generate these labels, we run a mean-shift clustering computation on the optimal actions, . In addition, we reduce the dimension of the RGB-D image by extracting the HOW-feature (Section V).
After feature mapping, we get a modified dataset , where is the feature extractor such that the extracted feature and is the mean-shift labeling.
To construct the random forest, we use a strategy that is similar to . We construct binary decision trees in a top-down manner, each using a random subset of . Specifically, for each node of a tree, a set of random partitions is computed and the one with the maximal Shannon information gain  is adopted. Each tree is grown until a maximum depth is reached or the best Shannon information gain is lower than a threshold.
Vii-C2 Controller optimization
After constructing the random forest, we define optimal control actions on each leaf-node of each decision tree. The optimal control action can be derived according to Equation 39. We can then optimize for , i.e., all the , by solving the following optimization problem:
This is a quadratic energy function that can be solved in closed form. However, a drawback of this energy is that every decision tree contributes equally to the final optimal action, making it less robust to outliers. To resolve this problem, we introduce a more robust model by exploiting sparsity. We assume that each leaf-node has a confidence value denoted by. The final optimal action is found by weighted averaging:
where a lower bound on is used to avoid division-by-zero. Our final controller optimization takes the following form: