Learning-based Feedback Controller for Deformable Object Manipulation

06/25/2018 ∙ by Biao Jia, et al. ∙ 0

In this paper, we present a general learning-based framework to automatically visual-servo control the position and shape of a deformable object with unknown deformation parameters. The servo-control is accomplished by learning a feedback controller that determines the robotic end-effector's movement according to the deformable object's current status. This status encodes the object's deformation behavior by using a set of observed visual features, which are either manually designed or automatically extracted from the robot's sensor stream. A feedback control policy is then optimized to push the object toward a desired featured status efficiently. The feedback policy can either be learned online or offline. Our online policy learning is based on the Gaussian Process Regression (GPR), which can achieve fast and accurate manipulation and is robust to small perturbation. An offline imitation learning framework is also proposed to achieve a control policy that is robust to large perturbation in the human-robot interaction. We validate the performance of our controller on a set of deformable object manipulation tasks and demonstrate that our method can achieve effective and accurate servo-control for general deformable objects with a wide variety of goal settings.



There are no comments yet.


page 1

page 6

page 7

page 11

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robot manipulation has been extensively studied for decades and there is a large body of work on the manipulation of rigid and deformable objects. Compared to the manipulation of a rigid object, the state of which can be completely described by a six-dimensional configuration space, deformable object manipulation (DOM) is more challenging due to its very high configuration space dimensionality. The resulting manipulation algorithm needs to handle this dimensional complexity and maintain the tension to perform the task. DOM has many important applications, including cloth folding [1, 2]; robot-assisted dressing or household chores [3, 4]; ironing [5]; coat checking [6]; sewing [7]; string insertion [8]; robot-assisted surgery and suturing [9, 10]; and transporting large materials like cloth, leather, and composite materials [11].

Fig. 1: Our robotic system for deformable object manipulation is made of two 3D cameras and one dual-arm ABB robot.

There are two main challenges that arise in performing DOM tasks. First, we need to model a feature representation of the object status that can account for the object’s possibly large deformations. The design of such a feature is usually task-dependent and object-specific. For instance, a cloth needs richer features than a rope, and a surgical-related application requires more sophisticated features than a consumer application. As a result, feature design is usually a tedious manual procedure. Second, we need to develop a controller that maps the object’s state features to a robot control action in order to achieve a specific task. The design of such a controller is usually also task-dependent. For instance, a manipulation controller for human dressing will require a more expressive controller parametrization than that needed to flatten a piece of paper. As a result, controller design usually requires tedious manual parameter tuning when generalizing to different tasks. More importantly, the two problems of feature extraction and controller design were solved as two separate problems. With the recent development of deep (reinforcement) learning, a prominent method

[12, 13]

is to represent feature extraction and controller parametrization as two neural networks, which are either trained either jointly or separately. However, the design decisions for these two networks, e.g., their network structures, are made independently. This method suffers from the large number of parameters that need to be manually determined. There are some more task-specific methods, such as

[14, 5], which use a vision-based feature extractor, but the controller is a standalone optimization-based motion planner whose formulation is independent of feature extraction. These methods are hard to extend because the feature-based objective functions for trajectory optimization are task-specific.

There is a rich literature on autonomous robotic DOM finding solutions to the above challenges, and these works can be classified into three categories. The first group of approaches requires a physical model of the object’s deformation properties in terms of stiffness, Young’s modules, or FEM coefficients, to design a control policy 

[15, 9, 16, 17, 18, 19]

. However, such deformation parameters are difficult to estimate accurately and may even change during the manipulation process, especially for objects made by nonlinear elastic or plastic materials. The approaches in the second group use manually designed low-dimensional features to model the object’s deformation behavior 

[20, 18, 17, 21] and then use an adaptive linear controller to accomplish the task. The approaches in the final group do not explicitly model the deformation parameters of the object. Instead, they use vision or learning methods to accomplish tasks directly [1, 10, 22, 23, 24, 14, 25, 26]. These methods focus on high-level policies but cannot achieve accurate operations; their success rate is low and some of them are open-loop methods. As a result, designing appropriate features and controllers to achieve accurate and flexible deformable object manipulation is still an open research problem in robotics [27].

In this paper, we focus on designing a general learning-based feedback control framework for accurate and flexible DOM. The feedback controller’s input is a feature representation for the deformable object’s current status and its output is the robotic end-effector’s movement. Our framework provides solutions to both the feature design and controller parameterization challenges. For feature design, we propose both a set of manually designed low-level features and a novel higher-level feature based on a histogram of oriented wrinkles, which is automatically extracted from data to describe the shape variation of a highly deformable object. For controller parameterization, we first propose a novel nonlinear feedback controller based on Gaussian Process Regression (GPR), which learns the object’s deformation behavior online and can accomplish a set of challenging DOM tasks accurately and reliably. We further design a controller that treats feature extraction and controller design as a coupled problem by using a random forest trained using two-stage learning. During the first stage, we construct a random forest to classify a sampled dataset of images of deformable objects. Based on the given forest topology, we augment the random forest by defining one optimal control action for each leaf-node, which provides the action prediction for any unseen image that falls in that leaf-node. In this way, the feature extraction helps determine the parameterization of the controller. The random forest construction and controller optimization are finally integrated into an imitation learning framework to improve the robustness of human-robot co-manipulation tasks.

We have integrated our approach with an ABB YuMi dual-arm robot and a camera for image capture and use this system to manipulate different cloth materials for different tasks. We highlight the real-time performance of our method on a set of DOM benchmarks, including standard feature point reaching tasks (as in [17, 18]); the tasks for cloth stretching, folding, twisting, and placement; and the industrial tasks for cloth assembly. Our manipulation system successfully and efficiently accomplishes these manipulation tasks for a wide variety of objects with different deformation properties.

Our main contributions in this paper include:

  • A general feedback control framework for DOM tasks.

  • A set of manually designed low-level features that work well in a set of challenging DOM tasks.

  • A novel histogram feature representation of highly deformable materials (HOW-features) that are computed directly from the streaming RGB data using Gabor filters. These features are then correlated using a sparse representation framework with a visual feedback dictionary and then fed to the feedback controller to generate appropriate robot actions.

  • An online Gaussian Process Regression based nonlinear feedback controller for DOM tasks that is robust and adaptive to the object’s unknown deformation parameters.

  • A random-forest-based DOM-controller parametrization that is robust and less parameter-sensitive; an imitation learning algorithm based framework that trains robust DOM controllers using a deformable object simulator.

  • A set of benchmark DOM tasks that have importance for manufacturing or service industries.

The techniques developed in this paper would be useful for general DOM tasks, including the DOM tasks in the warehouse. In addition, our techniques could potentially be used in some manufacturing processes. Let’s take the assembly of cloth pieces with fixtures, one typical deformable object manipulation task that we will study in this paper, as an example. In this task, a piece of cloth with holes needs to be aligned with a fixture made by a set of vertical locating pins. The assembled cloth pieces are then sent to the sewing machine for sewing operations. Such a task can be efficiently performed by a human worker without any training, as shown in Figure 2, but it is difficult for a robot. For instance, the wrinkles generated during the operation will interfere the feature extraction and tracking procedures that are critical for perception feedbacks. And the highly deformable property of the cloth will lead to unpredictable changes in the size and shape of the holes during the manipulation. These challenges make it difficult to achieve an accurate and reliable robotic manipulation control for cloth assembly. The methods developed in this paper would help to enable the robot to accomplish the cloth assembly tasks accurately and efficiently.

Fig. 2: The manual procedure in a clothing manufacturer where a human worker is using fixtures to assemble cloth pieces for the following automated sewing: (a) a human worker locates the cloth pieces on the fixtures made by a few pins; (b) the human worker finishes the assembly of cloth pieces; (c) an industrial sewing machine is performing the automated sewing over the assembled template. Our DOM framework can enable the robot to accomplish this manufacturing process autonomously.

The rest of this paper is organized as follows. We briefly survey the related works in Section II. We give an overview of our DOM framework in Section III. We present the details of our feature design and extraction in Section IV. We discuss the details of our new controllers in Section VI. Finally, we demonstrate the performance of our new approach with a set of experiments on a wide variety of soft objects in Section VIII with conclusions in Section IX.

Ii Related work

Ii-a Deformable object manipulation

Many robotic manipulation methods for deformable objects have been proposed in recent years. Early work [28, 29] used knot theory or energy theory to plan the manipulation trajectories for linear deformable objects like ropes. Some recent work [30] considered manipulating clothes using dexterous grippers. These works required a complete and accurate knowledge about the object’s geometric and deformation parameters and thus are not applicable in practice.

More practical works used sensors to guide the manipulation process.  [31] used images to estimate the knot configuration.  [1] used vision to estimate the configuration of a cloth and then leveraged gravity to accomplish folding tasks [32].  [6] used an RGBD camera to identify the boundary components in clothes.  [14, 33] first used vision to determine the status of the cloth, then optimized a set of grasp points to unfold the clothes on the table, and finally found a sequence of folding actions. Schulman et al. [10] enabled a robot to accomplish complex multi-step deformation object manipulation strategies by learning from a set of manipulation sequences with depth images to encode the task status. Such learning from demonstration techniques have further been extended using reinforcement learning [22] and tangent space mapping [23]

. A deep learning-based end-to-end framework has also been proposed recently 

[24]. A complete pipeline for clothes folding tasks including vision-based garment grasping, clothes classification and unfolding, model matching and folding has been described in [25].

The above methods generally did not explicitly model the deformation parameters of the deformation objects, which is necessary for high-quality manipulation control. Some methods used uncertainty models [15]

or heuristics 

[34, 8] to account for rough deformation models during the manipulation process. Some works required an offline procedure to estimate the deformation parameters [16]. There are several recent works that estimated the object’s deformation parameters in an online manner and then designed a controller accordingly. Navarro-Alarcon et al. [17, 18, 21] used an adaptive and model-free linear controller to servo-control soft objects, where the object’s deformation is modeled using a spring model [35]. [19] learned the models of the part deformation depending on the end-effector force and grasping parameters in an online manner to accomplish high-quality cleaning tasks. A more complete survey about deformable object manipulation in industry is available in[27].

Ii-B Deformable object feature design

Different techniques have been proposed for motion planning for deformable objects. Most of the works on deformable object manipulation focus on volumetric objects such as a deforming ball or linear deformable objects such as steerable needles [36, 37, 38, 28]. By comparison, cloth-like thin-shell objects tend to exhibit more complex deformations, forming wrinkles and folds. Current solutions for thin-shelled manipulation problems are limited to specific tasks, including folding [14], ironing [5], sewing [7], and dressing [39]. On the other hand, deformable body tracking solves a simpler problem, namely inferring the 3D configuration of a deformable object from sensing inputs. There is a body of literature on deformable body tracking that deals with inferring the 3D configuration from sensor data [40, 41, 42]. However, these methods usually require a template mesh to be known a priori, and the allowed deformations are relatively small. Recently, some template-less approaches have also been proposed, including [43, 44, 45], that tackle the tracking and reconstruction problems jointly and in real-time,

Rather than requiring a complete 3D reconstruction of the entire object, a visual-servo controller only uses single-view observations about the object as the input. However, even the single-view observation is high-dimensional. Thus, previous DOM methods use various feature extraction and dimensionality-reduction techniques, including SIFT-features [14] and combined depth and curvature-based features [25, 46]. Recently, deep neural-networks became a mainstream general-purpose feature extractor. They have also been used for manipulating low-DOF articulated bodies [12] and for DOM applications [47]. However, visual feature extraction is always decoupled from controller design in all these methods.

Ii-C Deformable object controller optimization

In robotics, reinforcement learning [48], imitation learning  [49], and trajectory optimization [50] have been used to compute optimal control actions. Trajectory optimization, or a model-based controller, has been used in  [14, 5, 51] for DOM applications. Although they are accurate, these methods cannot achieve real-time performance. For low-DOF robots such as articulated bodies [52], researchers have developed real-time trajectory optimization approaches, but it is hard to extend them to deformable models due to the high simulation complexity of such models. Currently, real-time performance can only be achieved by learning-based controllers [53, 46, 47]

, which use supervised learning to train real-time controllers. However, as pointed out in 

[54], these methods are not robust in handling unseen data. Therefore, we can further improve the robustness by using imitation learning.  [55] used reinforcement learning to control a soft-hand, but the object to be manipulated by the soft-hand was still rigid.

DOM controller design is dominated by visual servoing techniques [56, 57], which aim to control a dynamic system using visual features extracted from images. They have been widely used in robotic tasks like manipulation and navigation. Recent work includes the use of histogram features for rigid objects.

The dominant approach for DOM tasks is visual servoing techniques  [56, 57], which aim at controlling a dynamic system using visual features extracted from images. These techniques have been widely used in robotic tasks like manipulation and navigation. Recent work includes the use of histogram features for rigid objects [58]. Sullivan et al. [20] use a visual servoing technique to solve the deformable object manipulation problem using active models. Navarro-Alarcon et al. [17, 18, 21] use an adaptive and model-free linear controller to servo-control soft objects, where the object’s deformation is modeled using a spring model [35]. Langsfeld et al. [19] perform online learning of part-deformation models for robot cleaning of compliant objects. Our goal is to extend these visual servoing methods to perform complex tasks on highly deformable materials.

Iii Overview and problem formulation

Symbol Meaning
3D configuration space of the object
a configuration of the object,
feedback feature points on the object
uninformative points on the object
robot end-effectors’ grasping points on the object
an observation of object

the feature vector extracted from

target configuration of the object
optimal grasping points returned by the expert
interaction function linking velocities of the feature
space to the end-effector configuration space
distance measure in the feature space
distance measure in the policy space
DOM-control policy
Random forest DOM controller
parameter for random forest topology
leaf parameter
confidence of leaf-node
parameter sparsity

the number of decision trees

a leaf-node of -th decision tree
the leaf-node that belongs to
labeling function for optimal actions
feature transformation for observation
TABLE I: Symbol table.

The problem of 3D deformable object manipulation can be formulated as follows. Similar to [35], we describe an object as a set of discrete points, which are classified into three types: manipulated points, feedback points, and uninformative points, as shown in Figure 3. The manipulated points correspond to the positions on the object that are grabbed by the robot and are thus fixed relative to the robotic end-effectors. The feedback points correspond to the object surface regions that define the task goals and are involved in the visual feedbacks. The uninformative points correspond to other regions on the object. Given this setup, the deformable object manipulation problem is about how to move the manipulated points to drive the feedback points toward a required target configuration.

Our goal is to optimize a realtime feedback controller to deform the object into a desired target configuration. We denote the 3D configuration space of the object as . Typically, a configuration can be discretely represented as a 3D mesh of the object and the dimension of can be in the thousands. However, we assume that only a partial observation is known, which in our case is an RGB-D image from a single, fixed point of view.

Since the manipulation of a deformable object is usually executed at a low speed to avoid vibration, we can reasonably assume that the object always lies in the quasi-static state where the internal forces caused by the elasticity of the object are balanced with the external force applied by the end-effector on the manipulated points. We use a potential energy function to formulate the elasticity of the deformable object, where the potential energy depends on all the points on the object and vectors , , and represent the stacked coordinates of all feedback points, uninformed points, and manipulated points, respectively. The equation of equilibrium for the object can then be described as follows:


where is the external force vector applied on the manipulated points. To solve the above equations, we need exact knowledge about that deformable object’s deformation property, which is either not available or difficult to acquire in many applications. To cope with this issue, we first simplify the potential energy function to only depend on and , which is reasonable because the uninformed points are usually far from the manipulated and feedback points and thus their influence on the manipulation process is small and can be neglected. Next, we perform a Taylor expansion of Equation 3 and Equation 1 about the current static equilibrium status , and the equation of equilibrium implies a relationship between the relative displacements of feedback points and manipulated points:


where and are the displacements relative to the equilibrium for feedback points and manipulated points, respectively. The functions and are nonlinear in general, though they can be linear in some special cases. For instance, when only performing the first order Taylor expansion as in [18], and are two linear functions. In this paper, we allow and to be general functions to estimate a better model for the deformable object manipulation process.

We further assume the function to be invertible, which implies


where is the mapping between the velocities of the feedback points and the manipulated points. In this way, we can determine a suitable end-effector velocity via feedback control to derive the object toward its goal state, where is the difference between the desired vector and the current vector of the feedback points and is the feedback gain.

Fig. 3: We model a soft object using three classes of points: manipulated points , feedback points , and uninformative points .

However, the velocities of feedback points generally cannot be directly used in the control, because, in practice, these velocities are measured using visual tracking of deformable objects and thus are likely to be noisy or even unreliable when tracking fails. More importantly, a soft object needs a large set of feedback points to characterize its deformation, but a robotic manipulation system usually only has a few end-effectors. Thus ,the function in Equation 5 is a mapping from the high-dimensional space of feedback point velocities to the low-dimensional space of manipulated point velocities. Such a system is extremely underactuated and the convergence speed of the control would be slow.

To deal with aforementioned difficulties, we replace the feedback points with a low-dimensional feature vector , which is extracted from the observed part of the deformable object, where is the feature extraction function. Around the equilibrium state, we have , and we can rewrite the equilibrium function using the feature vector as


where the function is called the interaction function.

The manipulation problem of deformable objects can finally be described as the following: given the desired state of an object in the feature space, design a controller that learns the interaction function in an offline or online manner, and outputs the control velocity decreasing the distance between and , which are the object’s current state and desired goal state in the feature space. More formally, the controller is:


where is the distance metric in the feature space and is the feedback gain. If we assume to be a linear function and to be a linear metric, then the controller degrades to the linear visual-servo controller:


In addition, since and is usually a known vector, we can also write the controller as a policy :


where is the observation about the current state of the deformable object and is equivalent to . This form is more popular in the policy optimization literature.

Iv Feature extraction: manual design

For rigid body manipulation, an object’s state can be completely described by its centroid position and orientation. However, such global features are not sufficient to determine the configuration of a deformable object. As mentioned in Section III, we extract a feature vector from the observed feedback points to represent the object’s state. One common practice for feature extraction is to manually design task-dependent features. Here we present a set of features that are able to provide high-quality representations for DOM tasks, as will be demonstrated in our experiments.

Iv-a Global features

Iv-A1 Centroid

The centroid feature is computed as the geometric center of the 3D coordinates of all the observed feedback points:


Iv-A2 Positions of feedback points

Another way to describe a deformable object’s configuration is to directly use the positions of all observed feedback points as part of , i.e.


Iv-B Local features

Iv-B1 Distance between points

The distance between each pair of feedback points intuitively measures the stretch of deformable objects. This feature is computed as


where and are a pair of feedback points.

Iv-B2 Surface variation indicator

For deformable objects with developable surfaces, the surface variation around each feedback point can measure the local geometric property. Given a feedback point , we can compute the covariance matrix for its neighborhood and the surface variation is then computed as


where , ,

are eigenvectors of

with .

Iv-B3 Extended FPFH from VFH

Extended FPFH is the local descriptor of VFH and is based on Fast Point Feature Histograms (FPFH) [59]. Its idea is to use a histogram to record differences between the centroid point and its normal with all other points and normals.

Iv-C Feature selection for different tasks

According to our experience, for 1D deformable objects such as a rope, we can use centroid, distance or the coordinates of several marked points; for 2D deformable objects like deformable sheets, a surface variation indicator or extended FPFH are more effective for representing the states of the deformable objects. The features such as centroid and distance are also used in [18], and we will use these features to compare our method with [18]

. Our learning-based controllers (discussed later) are not very sensitive to the feature selection because they can adaptively weight the importance of different features. Nevertheless, we can introduce task-relevant prior knowledge into the controller by manually designing task-specific features, which would be helpful for the robustness and effectiveness of the controlling process.

V Feature extraction: data-driven design

Manual feature design is tedious and cannot be optimized for a given task. Here, we present a novel feature representation, a Histogram of Oriented Wrinkles (HOW), to describe the shape variation of a highly deformable object like clothes. These features are computed by applying Gabor filters and extracting the high-frequency and low-frequency components. We precompute a visual feedback dictionary using an offline training phase that stores a mapping between these visual features and the velocity of the end-effector . At runtime, we automatically compute the goal configurations based on the manipulation task and use sparse linear representation to compute the velocity of the controller from the dictionary.


Fig. 4: Pipeline for HOW-feature computation: Given the input image (1), we use the following stages: (2) foreground segmentation using Gaussian mixture; (3) image filtering with multiple orientations and wavelengths of a Gabor Kernel; (4) discretization of the filtered images to form grids of a histogram; (5) stacking the feature matrix to a single column vector.
Fig. 5: Computing the visual feedback dictionary: The input to this offline process is the recorded manipulation data with images and the corresponding configuration of the robot’s end-effector. The output is a visual feedback dictionary, which links the velocity of the features and the controller.
Fig. 6: At runtime, the controller leverages the HOW features in two stages. First, we extract HOW features from the visual input and compute the visual feedback word by subtracting the extracted features from the features of the goal configuration. Then, we apply the sparse representation and compute the velocity of the controller for manipulation.

V-a Histogram of deformation model feature

Here we present our algorithm to compute the HOW-features from the camera stream. These are low-dimensional features of highly deformable material.

The pipeline of our HOW-feature computation process is shown in Figure 4, and it has three stages:

V-A1 Foreground segmentation

To find the foreground partition of a deformable object, we apply the Gaussian mixture algorithm [60] on the RGB data captured by the camera. The intermediate result of segmentation is shown in Figure 4(2).

V-A2 Deformation enhancement

To model the high dimensional characteristics of the highly deformable material, we use deformation enhancement. This is based on the perceptual observation that most deformations can be modeled by shadows and shape variations. Therefore, we extract the features corresponding to shadow variations by applying a Gabor transform [61] to the RGB image. This results in the enhancement of the ridges, wrinkles, and edges (as shown in Figure  4). We convolve the deformation filters to the image and represent the result as .

In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave [62] and it has been used to detect wrinkles [63]. The 2D Gabor filter can be represented as follows:


where , , is the orientation of the normal to the parallel stripes of the Gabor filter, is the wavelength of the sinusoidal factor, is the phase offset,

is the standard deviation of the Gaussian, and

is the spatial aspect ratio. When we apply the Gabor filter to our deformation model image, the choices for wavelength () and orientation () are the key parameters with respect to the wrinkles of deformable materials. As a result, the deformation model features consist of multiple Gabor filters () with different values of wavelengths () and orientations ().

V-A3 Grids of histogram

A histogram-based feature is an approximation of the image which can reduce the data redundancy and extract a high-level representation that is robust to local variations in an image. Histogram-based features have been adopted to achieve a general framework for photometric visual servoing [58]. Although the distribution of the pixel value can be represented by a histogram, it is also significant to represent the position in the feature space of the deformation to achieve the manipulation task. Our approach is inspired by the study of grids in a Histogram of Oriented Gradient [64], which is computed on a dense grid of uniformly spatial cells.

We compute the grids of histogram for the deformation model feature by dividing the image into small spatial regions and accumulating local histogram of different filters () of the region. For each grid, we compute the histogram in the region and represent it as a matrix. We vary the grid size and compute matrix features for each size. Finally, we represent the entries of a matrix as a column feature vector. The complete feature extraction process is described in Algorithm 1.

1:image of size , deformation filtering or Gabor kernels , grid size set.
2:feature vector
3:for  do
4:     for  do
5:         for  do
6:               compute the indices using truncation
8:               add the filtered pixel value to the specific bin of the grid
10:         end for
11:     end for
12:end for
Algorithm 1 Computing HOW features

The HOW-feature has several advantages. It captures the deformation structure, which is based on the characteristics of the local shape. Moreover, it uses a local representation that is invariant to local geometric and photometric transformations. This is useful when the translations or rotations are much smaller than the local spatial or orientation grid size.

V-B Manipulation using the visual feedback dictionary

Here we present our algorithm for computing the visual feedback dictionary. At runtime, this dictionary is used to compute the corresponding velocity () of the controller based on the visual feedback ().

Fig. 7: Visual feedback dictionary: The visual feedback word is defined by the difference between the visual features and the controller positions . The visual feedback dictionary consists of the visual feedback words computed. We show the initial and final states on the left and the right, respectively.

V-B1 Building the visual feedback dictionary

As shown in Figure 5, the inputs to the offline training phase are a set of images and end-effector configurations (, ) and the output is the visual feedback dictionary ().

For the training process, the end-effector configurations, (), are either collected by human tele-operation or generated randomly. A single configuration () of the robot is a column vector of length

, the number of degrees-of-freedom to be controlled.

and its value is represented in the configuration space.

In order to compute the mapping in Equation 6 from the visual feedback to the velocity, we need to transform the configurations and image stream into velocities and the visual feature feedback , respectively. One solution is to select a fixed time step and to represent the velocity in both the feature and the configuration space as:


where is the frame rate of the captured video.

However, sampling by a fixed time step () leads to a limited number of samples () and can result in over-fitting. To overcome this issue, we break the sequential order of the time index to generate more training data from and . In particular, we assume that the manipulation task can be observed as a Markov process [65] and that each step is independent from every other. In this case, the sampling rates are given as follows, (when the total sampling amount is ):


where is a set of randomly generated indices and

. To build a more concise dictionary, we also apply K-Means Clustering 

[66] on the feature space, which enhances the performance and prevents the over-fitting problem.

In practice, the visual feedback dictionary can be regarded as an approximation of the interaction function in Equation 6). The overall algorithm to compute the dictionary is given in Algorithm 2.

V-B2 Sparse representation

At runtime, we use sparse linear representation [67] to compute the velocity of the controller from the visual feedback dictionary. These representations tend to assign zero weights to most irrelevant or redundant features and are used to find a small subset of the most predictive features in the high dimensional feature space. Given a noisy observation of a feature at runtime () and the visual feedback dictionary constructed by features with labels , we represent by , which is a sparse linear combination of , where is the sparsity-inducing term. To deal with noisy data, we use the norm on the data-fitting term and formulate the resulting sparse representation as:


where is a slack variable that is used to balance the trade-off between fitting the data perfectly and using a sparse solution. The sparse coefficient is computed using a minimization formulation:


After is computed, the observation

and the probable label

can be reconstructed by the visual feedback dictionary:


The corresponding of the th DOF in the configuration is given as:


where denotes the -th datum of the -th feature, denotes the value of the response, and the norm-1 regularizer typically results in a sparse solution in the feature space.

1:image stream and positions of end-effectors with sampling amount , dictionary size
2:Visual feedback dictionary
4: generate random indices for sampling
6:for  do
7:      = sampling
8:      = sampling
9:end for
10: compute the centers of the feature set for clustering
12:for  do
15:end for
Algorithm 2 Building the visual feedback dictionary

V-B3 Goal configuration and mapping

We compute the goal configuration and the corresponding HOW-features based on the underlying manipulation task at runtime. Based on the configuration, we compute the velocity of the end-effector. The different ways to compute the goal configuration are:

  • To manipulate deformable objects to a single state , the goal configuration can be represented simply by the visual feature of the desired state .

  • To manipulate deformable objects to a hidden state , which can be represented by a set of states of the object as a set of visual features . We modify the formulation in Equation 6 to compute as:

  • For a complex task, which can be represented using a sequential set of states , we estimate the sequential cost of each state as . We use a modification that tends to compute the state with the lowest sequential cost:


    After is computed, the velocity for state is determined by , and is removed from the set of goals for subsequent computations.

Vi Controller design: nonlinear Gaussian Process Regression

Vi-a Interaction function learning

Unlike many previous methods that assume the interaction function to be a linear function, here we consider as a general and highly nonlinear function determining how the movement of the manipulated points is converted into the feature space. Learning the function requires a flexible and non-parametric method. Our solution is to use Gaussian Process Regression (GPR) to fit the interaction function in an online manner.

GPR is a nonparametric regression technique that defines a distribution over functions and in which the inference takes place directly in the functional space given the covariance and mean of the functional distribution. For our manipulation problem, we formulate the interaction function as a Gaussian process:


where still denotes the velocity in the feature space. For the covariance or kernel function , we use the Radius Basis Function (RBF) kernel: , where the parameter sets the spread of the kernel. For the mean function , we use the linear mean function , where

is the linear regression weight matrix. We choose to use a linear mean function rather than the common zero mean function, because previous work 

[18] has shown that the adaptive Jacobian method, which can be considered as a special version of our method using the linear mean function, is able to capture a large part of the interaction function . As a result, a linear mean function can result in faster convergence of our online learning process and provide a relatively accurate prediction in the unexplored region in the feature space. The matrix is learned online by minimizing a squared error with respect to the weight matrix .

Given a set of training data in terms of pairs of feature space velocities and manipulated point velocities during the previous manipulation process, the standard GPR computes the distribution of the interaction function as a Gaussian process , where GP’s mean function is


and GP’s covariance function is


Here and are matrices corresponding to the stack of and in the training data, respectively. and are matrices and vectors computed using a given covariance function . The matrix is called the Gram matrix, and the parameter estimates the uncertainty or noise level of the training data.

Vi-B Real-time online GPR

In the deformation object manipulation process, the data is generated sequentially. Thus, at each time step , we need to update the GP interaction function in an interactive manner, with




In the online GPR, we need to perform the inversion of the Gram matrix repeatedly with a time complexity , where is the size of the current training set involved in the regression. Such cubic complexity makes the training process slow for long manipulation sequences where the training data size increases quickly. In addition, the growing up of the GP model will reduce the newest data’s impact on the regression result and make the GP fail to capture the change of the objects’s deformation parameters during the manipulation. This is critical for deformable object manipulation because the interaction function is derived from the local force equilibrium and thus is only accurate in a small region.

Motivated by previous work about efficient offline GPR [68, 69, 70, 71], we here present a novel online GPR method called the Fast Online GPR (FO-GPR) to reduce the high computational cost and to adapt to the changing deformation properties while updating the deformation model during the manipulation process. The main idea of FO-GPR includes two parts: 1) maintaining the inversion of the Gram matrix incrementally rather using direct matrix inversion; 2) restricting the size of to be smaller than a given size and, if ’s size exceeds that limit, using a selective “forgetting” method to replace stale or uninformative data with a fresh new data point.

Vi-B1 Incremental update of Gram matrix

Suppose at time , the size of is still smaller than the limit . In this case, and are related by


where and . According to the Helmert–Wolf blocking inverse property, we can compute the inverse of based on the inverse of :


where . In this way, we achieve the incremental update of the inverse Gram matrix from to , and its computational cost is rather than of direct matrix inversion. This acceleration enables fast GP model updates during the manipulation process.

Vi-B2 Selective forgetting in online GPR

When the size of reaches the limit , we use a “forgetting” strategy to replace the most uninformative data with the fresh data points while keeping the size of to be . In particular, we choose to forget the data point that is the most similar to other data points in terms of the covariance, i.e.,


where denotes the covariance value stored in the -th row and -th column in , i.e., .

Given the new data , we need to update , , and in Equations VI-B and VI-B by swapping data terms related to and , to update the interaction function .

The incremental update for and is trivial: is identical to except is rather than ; is identical to except is rather than .

We then discuss how to update from . Since is only non-zero at the -th column or the -th row:

This matrix can be written as the multiplication of two matrices and , i.e., , where


Here is a vector that is all zero except when it is one at the -th item, is the vector and is the vector . Both and are size matrices.

Then, using the Sherman-Morrison formula, we get


which provides the incremental update scheme for the Gram matrix . Since is a matrix, its inversion can be computed in time. Therefore, the incremental update computation is dominated by the matrix-vector multiplication and thus the time complexity is rather than .

A complete description of FO-GPR is shown in Algorithm 3.

1:, , , ,
2:, ,
3:if  then
6:      computed using Equation VI-B1
8:      computed using Equation 32
9:     ,
10:     ,
11:      computed using Equation 33
12:end if
Algorithm 3 FO-GPR

Vi-C Exploitation and exploration

Given the interaction function learned by FO-GPR, the controller system predicts the required velocity to be executed by the end-effectors based on the error between the current state and the goal state in the feature space:


However, when there is no sufficient data, GPR cannot output control policy with high confidence, which typically happens in the early steps of the manipulation or when the robot manipulates the object into a new unexplored configuration. Fortunately, the GPR framework provides a natural way to trade-off exploitation and exploration by sampling the control velocity from distribution of :


If is now in an unexplored region with a large , the controller will perform exploration around ; if is in a well-explored region with a small , the controller will output a velocity close to .

A complete description of the controller based on FO-GPR is shown in Figure 8.

Fig. 8: An overview of our nonlinear DOM controller based on Gaussian Process Regression.

Vi-D Convergence and stability analysis

We can prove that, given more and more data, our GPR-based online learning converges to the true underlying deformation distribution and that the resulting controller is stable.

The GPR prediction in Equation VI-A and 34 includes two terms: The first term actually corresponds to the adaptive Jacobian in [18] which has proven to be bounded and is able to asymptotically minimize the error between the current and target states. The second term corresponds to a function , which minimizes the functional


where is the RKHS (reproducing kernel Hilbert space) norm w.r.t. the kernel . According to [72], we can prove that this second term converges:

Proposition 1

The prediction function converges to a true underlying interaction function .

Let be the probability measure from which the data pairs are generated. We have

Let and be the

-th eigenfunction of the kernel function

,the functional then becomes


and its solution can then be computed as


When , and thus converges to .

After showing the convergence of Equations VI-A and 34, we can perform local linearization for the nonlinear GPR prediction function and then design a Lyapunov function similar to [18] to show the stability of the GPR-based visual-servoing controller.

Vii Controller design: random forest controller

The GPR controller proposed above has two main limitations. First, the controller must be in the form of a mathematical function, making it unable to model the more complicated controller required for dealing with large deformations. Second, the feature extraction step and control design step are independent.

Here, we present a another controller for DOM tasks. This controller uses random forest [73] to model the mapping between the visual features of the object and an optimal control action of the manipulator. The topological structure of this random-forest-based controller is determined automatically based on the training data, which consists of visual features and control actions. This enables us to integrate the overall process of training data classification and controller optimization into an imitation learning (IL) algorithm. Our approach enables joint feature extraction and controller optimization for DOM tasks.

The pipeline of our approach is illustrated in Figure 9 and it consists of two components. For preprocessing (the blue block), a dataset of object observations is labeled and features are extracted for each observation. This dataset is used to train our controller. During runtime (the red block), our algorithm takes the current as input and generates the optimal action , where is the control policy and is the controller’s learnable parameter.

(a)HOW Feature/Label (b)(d)(e)(f)

Fig. 9: The pipeline of learning a random forest-based DOM controller that maps the visual feature (RGB-D image) to the control action. Given a sampled dataset (a), we first label each data point (shown as red text in (b)) to get a labeled dataset, (b). We then construct a random forest to classify the images, (c). After training, the random forest is used as a controller. Given an unseen visual observation (d), the observation is brought through the random forest to a set of leaf-nodes. The optimal control actions are defined on these leaf-nodes, (e). The entire process of labeling, classification, and controller optimization can be integrated into an imitation learning algorithm, (f).

Vii-a Random forest-based controller formulation

Our key contribution is a novel parametrization of with the use of a random forest [73]. A random forest is an ensemble of decision trees, where the -th tree classifies by bringing it to a leaf-node , where and is the number of leaf-nodes in the -th decision tree. The random forest makes its decision by bringing through every tree and taking the average. In order to use an already constructed random forest as a controller, we define an optimal control action so that the final action is determined by averaging:


where we introduce an additional parameter that encodes the random forest’s topology. As a result, the number of optimizable controller parameters is . Such an extension to a random forest has two benefits. First, even if two decision trees give the same classification for an observation , i.e., the class label of equals that of , we still assign separate optimal control actions, and , for optimization. This makes the controller more robust to bad predictions. Second, the number of optimized parameters is related to the random forest’s topology. This reveals strong connection between feature extraction and controller parametrization. By automatically segmenting the state space into pieces (leaf-nodes) where different control actions need to be taken, fewer design decisions on the controller’s side are exposed to the end user. This makes our method less sensitive to parameters.

Vii-B Controller optimization problem

Our controller training problem can take different forms depending on the available information about . If is known, then we can define a reward function: , where can be any distance measure between RGB-D images. In this setting, we want to solve the following reinforcement learning (RL) problem:


where is a trajectory sampled according to and is the discount factor. Another widely used setting assumes that is unknown, but that an expert is available to provide optimal control action . In this case, we want to solve the following IL problem:


Our method is designed for this IL problem and we assume that is known even in an IL setting. This is because is needed to construct the random forest. This assumption always holds when we are training in a simulated environment. We describe the method to perform these optimizations in Section VII-C.

Vii-C Learning random forest-based controller

To find the controller parameters, we use an IL algorithm [54], which can be decomposed into two substeps: dataset sampling and controller optimization. The first step samples a dataset , where each sample is a combination of cloth observation and optimal action. Our goal is to optimize the random forest-based controller with respect to , given .

Vii-C1 Random forest construction

We first solve for , the topology of the random forest. Unfortunately, a random forest construction requires a labeled dataset that classifies the samples into discrete sets of classes. To generate these labels, we run a mean-shift clustering computation on the optimal actions, . In addition, we reduce the dimension of the RGB-D image by extracting the HOW-feature (Section V).

After feature mapping, we get a modified dataset , where is the feature extractor such that the extracted feature and is the mean-shift labeling.

To construct the random forest, we use a strategy that is similar to [74]. We construct binary decision trees in a top-down manner, each using a random subset of . Specifically, for each node of a tree, a set of random partitions is computed and the one with the maximal Shannon information gain [73] is adopted. Each tree is grown until a maximum depth is reached or the best Shannon information gain is lower than a threshold.

Vii-C2 Controller optimization

After constructing the random forest, we define optimal control actions on each leaf-node of each decision tree. The optimal control action can be derived according to Equation 39. We can then optimize for , i.e., all the , by solving the following optimization problem:


This is a quadratic energy function that can be solved in closed form. However, a drawback of this energy is that every decision tree contributes equally to the final optimal action, making it less robust to outliers. To resolve this problem, we introduce a more robust model by exploiting sparsity. We assume that each leaf-node has a confidence value denoted by

. The final optimal action is found by weighted averaging:


where a lower bound on is used to avoid division-by-zero. Our final controller optimization takes the following form:


which can be solved using an L-BFGS-B optimizer [75]. Finally, both the random forest construction and the controller optimization are integrated into the IL algorithm as outlined in Algorithm 4.

1:Initial guess of , optimal policy
3: IL outer loop
4:while IL has not converged do
5:      Generate training data based on current controller
6:     Sample by querying as in [54]
7:      Label each data sample by mean-shift
8:     Run mean-shift on to get
9:      Extract HOW feature for each data sample
10:     for each  do
11:         Extract HOW feature