Robotic manipulation of deformable objects is a challenging problem especially because of the complexity of the many different ways an object can deform. Searching within such a high dimensional state space makes it difficult to recognize, track, and manipulate deformable objects. In this paper we present a feed-forward, model-driven approach to address this challenge, using a pre-computed, simulated database of deformable thin-shell object models, where the bending of the mesh models is predominant Grinspun et al. . The models are detailed, robust, and easy to construct, and using a physics engine one can accurately predict the behavior of the objects in simulation, which can then be applied to a real physical setting. This work bridges the gap between the simulation world and the real world. The predictive, feed-forward, model-driven approach takes advantages of the simulation and generates a large number of instances for learning approaches, which not only alleviates the burden of data collection, which can be efficiently done in simulation, but also makes adaptation of the methods to other application areas easier and faster. Mesh models of common deformable garments are simulated with the garments picked up in multiple different poses under gravity, and stored in a database for fast and efficient retrieval. To validate this approach, we developed a comprehensive pipeline for manipulating clothing as in a typical laundry task. First, the database is used to estimate categories and poses of garments in arbitrary positions. A fully featured 3D volumetric model of the garment is constructed in real-time and volumetric features are then used to obtain the most similar model in the database to predict the object category and pose. Second, the database can significantly benefit the manipulation of deformable objects via non-rigid registration, providing accurate correspondences between the reconstructed object model and the database models. Third, the accurate model simulation can also be used to optimize the trajectories for manipulation of deformable objects, such as the folding of garments. In addition, the simulation can be easily adapted to new garment models. Extensive experimental results are shown for the tasks above using a variety of different clothing.
Figure 1 shows a typical pipeline for manipulating clothing as in a laundry task. This paper brings together work addressing all the tasks in the pipeline (except ironing) which have been previously published in conference papers (Li et al. [2014a] Li et al. [2014b] Li et al. [2015a] Li et al. [2015b]). These tasks, with the exception of the ironing task, are all implemented using a feed forward, model-driven methodology, and this paper serves to consolidate all these results into a single integrated whole. The work has also been extended to include novel garments not found in the database, extended results on regrasping using a much larger dataset of objects and examples, quantitative registration results for our hybrid rigid/deformable registration methods, new dense mesh modeling techniques, and a novel dissimilarity metric used to assess folding success. The ironing task is omitted from this paper due to size constraints. Full details on ironing can be found in Li et al. .
In addition, a set of videos of our experimental results are available at: https://youtu.be/fRp05Teua4c
2 Related Work
used interactive perception to classify the clothing type. Their work was based on a image-only database of 6 categories, each of which is with 5 different items from real garments. Later, they increased the size of the database but still used real garments. Their work focused on small clothing such as socks and short pants usually consisting of a single color.Miller et al. , Wang et al. , Schulman et al. , Cusumano-Towner, et. al Cusumano-Towner et al.  have done some impressive work in clothing recognition and manipulation. They have successfully enabled the PR2 robot to fold clothing and towels. Their methods mainly focus on aligning the current edge/shape from observation to an existing shape. A series of works on clothes pose recognition were done by Kita et al. [2011a] Kita et al. [2011b] Kita and Kita . They used a simulation database of a single garment with around different grasping points, which were mostly selected on the border when the garment was laid flat. Their work demonstrated the ability to identify the pose of the clothes by registration to pre-recorded template images. Doumanoglou et al.  used a pair of two industrial arms to recognize and manipulate deformable garments. They used a database of depth images captured from real garments, such as a sweater or a pair of pants.
With powerful computing resources reconstructing a 3D model of the garment, and using that to search a pre-computed database of simulated garment models in different poses can be more accurate and efficient. With the increasing popularity of Kinect sensor, there are various methods emerging in computer graphics such as KinectFusion and its variants Newcombe et al.  Chen et al.  Li et al. . Although these methods have shown success in reconstructing static scenes, they do not fit our scenario directly where a robotic arm is rotating the target garment about a grasping point. Therefore we first do a 3D segmentation to get the masks of the garment on the depth images, and then invoke KinectFusion to do the reconstruction.
Shape matching is another related and long-standing topic in robotics and computer vision. On the 2D side, various local features have been developed for image matching and recognitionHuttenlocher et al.  Latecki et al.  Lowe , which have shown good performance on textured images. Another direction is shape-context based recognition Belongie et al.  Toshev et al.  Tu and Yuille , which is better for handwriting and character matching. On the 3D side, Wu Wu et al.  and Wang Wang et al.  have proposed methods to match patches based on 3D local features. They extract Viewpoint-Invariant Patches or the distribution of geometry primitives as features, based on which matching is performed. Osada and Funkhouser , Thayananthan et al. , and Frome et al.  apply 3D shape-context as a metric to compute similarities of 3D layout for recognition. However, most of the methods are designed for noise-free human-designed models, without the capability to match between the relatively noisy and incomplete mesh model produced by Kinect and the human-designed models. Our method is inspired by 3D shape context Frome et al. , but provides the capability of cross-domain matching with a learned distance metric, and also utilizes a volumetric data representation to efficiently extract the features.
Osawa et al. 
proposed a method using a dual-arm setup to unfold a garment from pick up. They used a segmented mask to match the pre-stored template mask to track the states of the garment. The PR2 robot is probably the first robot that has successfully manipulated deformable objects such as a towel or a T-shirtMaitin-Shepard et al. . The visual recognition in this work targets corner-based features, which does not require a template to match. The subsequent work has improved the prediction of the state of a garment using a HMM framework by regrasping at the lowest corner point Cusumano-Towner et al. . Doumanoglou et al.  applied pre-recorded depth images to guided the manipulation procedures. sun2015 used a pair of stereo cameras to analysis the surface of a piece of cloth and performed flattening and unfolding.
One of the applications of the our database is to localize the regrasping point during the manipulation by mapping the pre-determined points from simulation mesh to the reconstructed mesh. Therefore, a fast and accurate registration algorithm plays a key role in our method. Rigid or non-rigid surface registration is a fundamental tool to find shape correspondence. A thorough review can be found in Tam et al. . Our registration algorithm builds on previous techniques for rigid and non-rigid registrations. First, we use an iterative closest point method Besl. and McKay.  to rigidly align the garment. Here, we use a distance field to accelerate the computation. Next, we perform a non-rigid registration to improve the matching by locally deforming the garment. Similar to Li et al. , we find the correspondence by minimizing an energy function that describes the deformation and the fitting.
2.3 Folding Deformable Objects
With the garment fully spread on the table, attention is turned to parsing its shape. S. Miller et al. have designed a parametrized shape model for unknown garments Miller et al.  Miller et al. . Each set of parameters defines a certain type of garment such as a sweater or a towel. The goal is to minimize the distance between the observed garment contour points and points from the parametrized shape model. The fitting score between the observed contour and the shape models can also be used for recognition of garment category. However, the average time for the fitting procedure is seconds and sometimes does not converge. The contour-based garment shape model was further improved by J. Stria et al. using polygonal models Stria et al. [2014a]. The detected garment contour is matched to a polygonal model by removing non-convex points using a dynamic programming approach. The landmarks on the polygonal model are then mapped to the real garment contour, and followed by generating a folding plan.
Folding is another application of garment manipulation. F. Osawa et al. used a robot to fold a garment with a special purpose table that contains a plate that can bend and fold the clothes assisted by a dual-arm robot. The robot mainly worked on repositioning the clothes for the plate for each folding action. Within several “flip-fold” operations, the garment can be folded. Another folding method using a PR2 robot was implemented by van den Berg et al. . The core of their approach was the geometry reasoning with respect to the cloth model without any physical simulation. Contour fitting at each step took relatively longer than execution of the folding actions, which reduced its efficiency. This was further sped up by Stria et al. [2014b] using two industrial arms and a polygonal contour model. They showed impressive folding results by utilizing a specifically-designed gripper Le et al.  that is suitable for cloth grasping and manipulation.
None of the previous works focus on trajectory optimization for garment folding, which brings uncertainty to the layout given the same folding plan. One possible case is that the garment shifts on the table during one folding action so that the targeted folding position is also moved. Another case is that an improper folding trajectory causes additional deformation of the garment itself, which can accumulate. Our previous work Li et al. [2015b] has proved that with effective simulation, bad trajectories can be avoided and the results of manipulation of the deformable objects is predictable.
3 A Database For Deformable Object Recognition
Figure 1 shows an overview of our pipeline for dexterous manipulation of deformable objects. The first step is the visual recognition of deformable objects. We need to have a large set of exemplars of how garments will look visually when arbitrarily grasped. In addition, as we mentioned previously, a 3D model can be used for regrasping and further manipulation after an accurate registration. Therefore, a database with a number of 3D models is desirable. In order to have a set of pre-calculated trajectories for efficient manipulation of deformable objects, off-line simulation is an effective way to approach this. With low-cost and fast simulation, optimized trajectories can be calculated, and exported and adapted to the real robotic manipulation.
Physically having people, or a robot arm, successively pick up an object and image its appearance is too slow and cannot span the large space we are hoping to learn. Given the physical nature of this training set, it can be very time-consuming to create, and may have problems encompassing a wide range of garments and different fabrics which we can more easily accommodate in the simulation environment. Using advanced simulators such as Maya Maya2015 to physically simulate deformable objects, we can produce thousands of exemplars efficiently, which we can then use as a corpus for learning the visual appearances of the deformed garments.
Take manipulation of the deformable garment as an example. One solution is to use prior knowledge to guide the robot to follow steps of a task. In our previous work, we have successfully used online registration between the database model and the reconstructed model to achieve a stable regrasping and unfold a garment. The work that is closest to ours is by Doumanoglou et al. 
. This work has impressive results for unfolding of a number of different garments. They use a dual-arm industrial robot to unfold a garment guided by a set of depth images which provide a regrasping point. This method achieves promising accuracy. Their training set is a number of physical garments that have been grasped at different grasping points to create feature vectors for learning. A major difference is our use of simulated predictive thin shell models for garments from a large database of garments and their poses. We also use an online registration of a reconstructed 3D surface mesh with the simulated model to find regrasping points. By this method, we can choose arbitrary regrasping points without having to train the physical model for the occurrence of the grasping points. This allows us to choose any point on the garment at any time as the regrasping point.
3.2 Simulating Deformable Objects
We have developed an off-line simulation pipeline whose results can be used to predict poses of deformable objects. The off-line simulation is time efficient, noise free, and more accurate compared with acquiring data via sensors from real objects. Simulation models do not suffer from occlusion or noise and are more complete than physically scanned models. In the off-line simulation, we use a few well-defined garment mesh models such as sweaters, jeans, and short pants, etc. Similar garment mesh models can also be obtained from Poserworld Inc. [a] and Turbo Squid Inc. [b]. We can also generate models by using our own “Sensitive Couture ” software Umetani et al. . Figure 2 shows a few of our current garment models rendered in Maya software.
For each grasping point, we compute the garment layout by hanging under gravity in the simulator. In Maya, a mesh model can be converted into an nCloth model which can be then simulated with some cloth properties such as hanging and falling down. Maya also allows for control of cloth thickness and deformation resistance, etc. In addition, any vertex on the mesh can be selected as a constraint point to simulate a draping effect. The hanging under gravity effect of the garment models is shown in Figure 2. We manually label each garment in the database with the key grasping points such as sleeve end, elbow, shoulder, chest, and waist, etc.
The simulation model be can exported as an obj file for recognition using volumetric approach Li et al. [2014b]. Figure 3 shows a small sample of different picking points of a single garment hanging under gravity that simulated in Maya.
4 Pose Estimation
Pose estimation of deformable objects is an important problem in robotics, laying the foundation for further procedures. For example, in the task of garment folding, once the robot has detected the pose of the garment, it can then proceed to manipulate the target garment to a preset “standard pose.” Unlike rigid object recognition which has finite state spaces, deformable object recognition is much harder because of the very large state space of how it deforms.
In this section, we describe a real-time pose recognition algorithm with accurate prediction of grasping point locations. Figure 4 shows the experimental settings for our algorithm: a Baxter robot grasping a garment and predicting the grasping location (e.g. cm left of the collar). With this information, the robot is then able to proceed to subsequent tasks such as regrasping and folding. The main idea of our method is to first accurately reconstruct a 3D mesh model from a low-cost depth sensor, and then compute the similarity between the reconstructed model and the models simulated offline to predict the pose of the object. The database introduced in the previous section provides a perfect source for such offline-simulated models.
Our method consists of two stages: the offline model simulation stage and the online recognition stage. In the offline model simulation stage, we use a physics engine to simulate the stationary state of the mesh models of different types of garments in different poses. In the online recognition stage, we use a Kinect sensor to capture many depth images of different view points of the garment by rotating it as it hangs from a robotic arm. We then reconstruct a smooth 3D model from the depth input, extract compact 3D features from it, and finally match against the offline model database to recognize its pose. Figure 5 shows the framework of our method, which will be introduced in the subsequent subsections.
4.1.1 3D Reconstruction
Given the model database described above, we now need to generate depth images and match against the database. Direct recognition from depth images suffers from the problems of self-occlusion and sensor noise. This naturally leads to our new method of first building a smooth 3D model from the noisy input, and then performing recognition in 3D. However, how to do such reconstruction is still an open problem. Although there are existing approaches of obtaining high-quality models from noisy depth inputs such as KinectFusion Newcombe et al. , which requires the scene to be static. In our data collection settings, the target garment is being rotated by a robotic arm, which invalidates the KinectFusion’s assumptions. We solve this problem by first segmenting out the garment from its background, and then invoke KinectFusion to obtain a smooth 3D model, assuming that the rotation is slow and steady enough such that the garment will not deform in the process.
Segmentation. Before diving into the reconstruction algorithm, let us first define some notation. Given the intrinsic matrix of the depth camera and the th depth image , we are able to compute the 3D coordinates of all the pixels in the camera coordinate system with , in which is the coordinate of a pixel in , with as the corresponding depth, and is the corresponding 3D coordinate in the camera coordinate system.
Our segmentation is then performed in the 3D space. We ask the user to specify a 2D bounding box on the depth image with a rough estimation of the depth of the garment . Given that the data collection environment is reasonably constrained, we find even one predefined bounding box works well. Then we adopt all the pixels having their 3D coordinates within the bounding box as the foreground, resulting in a series of masked depth images and their corresponding 3D points, which will be fed into the reconstruction module.
The 3D reconstruction is done by feeding the masked depth images
into KinectFusion, while the unrelated surroundings are eliminated, leaving the scene to reconstruct as static. This process can be done in real time. In addition to a smooth mesh, the KinectFusion library also generates a Signed Distance Function (SDF) mapping, which will be used for 3D feature extraction. The SDF is defined on any 3D point. It has the property that it is negative when the point is within the surface of the scanned object, positive when the point is outside a surface, and zero when it is on the surface. We will use this function to efficiently compute our 3D features in the next subsection.
4.1.2 Feature Extraction
Inspired by 3D Shape Context Belongie et al. , we design a binary feature to describe the 3D models. In our method, the features are defined on a cylindrical coordinate system fit to the hanging garment as opposed to traditional 3D Shape Context which uses a spherical coordinate system Frome et al. .
For each layer, as shown in Figure 6 top-right, we uniformly divide the world space into rings sectors in a polar coordinate system, with the largest ring covering the largest radius among all the layers. The center of the polar coordinate system is determined as the mean of all the points in the highest layer, which usually contains the robot gripper. Note we do a uniform division instead of logarithm division of as Shape Context does. The reason why Shape Context uses logarithm division of is that the cells farther from the center are less important, which is not the case in our settings. For each layer, instead of doing a point count as in the original Shape Context method, we check the Signed Distance Function (SDF) of the voxel which the center of the polar cell belongs to, and fill one () in the cell if the SDF is zero or negative (i.e. the cell is inside the voxel), otherwise zero (). Finally, all the binary numbers in each cell are collected in an order (e.g. with increasing and then increasing), and are concatenated as the final feature vector.
The insight behind this design is, to improve the robustness against local surface disturbance due to friction, we include the 3D voxels inside the surface in the features. Note we do not need to do the time-consuming classification (e.g. ray tracing) to determine whether each cell is inside the surface, but only need to look up their SDFs, thus dramatically speed up the feature extraction.
Matching Scheme. Similar to Shape Context, when matching against two shapes, we conceptually rotate one of them and adopt the minimum distance as the matching cost, to provide rotation invariance. That is,
in which are the features to be matched ( is the binary set ), is the binary XOR operation, and is the transform matrix to rotate the feature of each layer by . Recall that both features to be matched are compact binary codes. Thus such conceptual rotation as well as Hamming distance computation can be efficiently implemented by integer shifting and XOR operations, resulting in matching that is even faster than the Euclidean Distance given reasonable s (e.g. ). A complete illustration of the feature extraction algorithm can be found in Algorithm 1.
4.1.3 Domain Adaptation
Now we have a feature vector representation for each model in the simulated database and for the query. A natural idea is to find the Nearest Neighbor (NN) of the query in the database and transfer the metadata such as category and pose from the NN to the query. But a naive NN algorithm with Euclidean distance does not work here because， even for the same garment and the same grasping point by the robot, the way it deforms may still be slightly different due to friction. This requires a solution in the matching stage, especially given that it is impractical to simulate every object with all the possible materials. Therefore, essentially we are doing cross-domain retrieval, which generally requires a “calibration” step to adapt the knowledge from one domain (simulated models) to another (reconstructed models).
Weighted Hamming Distance. Similar with the distance calibration in Wang et al. , we use a learned distance metric to improve the NN accuracy, i.e.
in which is the feature vector of the query, is the index of models in the simulated database, and is the binary XOR operation. indicates the feature vector of the th model, with as the optimal in Equation 1.
The insight here is that we wish to grant our distance metric more robustness against material properties by assigning larger weights to the regions invariant to the material differences (this amplifies the features that are more intrinsic for the recognition task).
Distance Metric Learning. To robustly learn the weighted Hamming distance, we use an extra set of mesh models collected from a Kinect as calibration data. The collection settings are the same as described in “3D Reconstruction” and only a small amount of calibration data is needed for each category (e.g. models in poses for long-sleeve shirt model). To determine the weight vector , we then formulate the learning process as an optimization problem of minimizing the empirical error with a large-margin regularizer:
in which is the orientation-calibrated feature of the th model (from the database), with as the corresponding ground truth label (i.e. the index of the pose). is the extracted feature of the th training model (from Kinect), with as the ground truth label. We wish to minimize , which indicates how many wrong results the learned metric gives, with a quadratic regularizer. controls how much penalty is given to wrong predictions.
This is a non-convex and even non-differentiable problem. Therefore we employ the RankSVM Joachims  to obtain an approximate solution using the cutting-plane method.
Knowledge Transfer. Given the learned , in the testing stage, we then use Equation 2 to obtain the nearest neighbor of the query model. We directly adopt the grasping point of the nearest neighbor, which is known from the simulation process, as the final prediction.
4.2 Experimental Results
We used a series of experiments to demonstrate the effectiveness of the proposed method and justify the components. We tested our method on a dataset of various kinds of garments collected from practical settings, by treating it as a classification problem and calculating the classification accuracy. Experimental results demonstrate that our method is able to achieve both reasonable accuracy and fast speed.
4.2.1 Data Acquisition
Since the simulated database introduced in the previous section does not have the practically captured data, we collect an extra test dataset for general evaluation of pose recognition of deformable objects based on depth image as inputs.
The dataset consists of 2 parts, a test set and a calibration set. To collect the testing set, we use a Baxter robot, which is equipped with two arms with 7 degrees of freedom. A Kinect sensor is mounted on a horizontal platform at height ofmeters to capture the depth images, as shown in Figure 4. We bought kinds of garments – long-sleeve shirts, pants and shorts, as representative examples in the manufacturing industry, and then collect their depth images with the same grasping points of the training database. We then use our 3D reconstruction algorithm to obtain their mesh models. For each grasping point of each garment, the robot rotates the garment 360 degrees around seconds while the Kinect captures at fps, which gives us around depth images for each garment/pose. This results in a test set of mesh models, with their raw depth images.
Given we also need to learn/calibrate a distance metric from extra data from Kinect (using Equation 3), we collect an extra small amount of data with the same settings as the calibration data, only collecting five poses for each garment. A weight vector is then learned from this calibration data for each type of garment.
4.2.2 Qualitative Evaluation
We demonstrate some of the recognition results in Figure 7 in the order of color image, depth image, reconstructed model, predicted model, ground truth model, and predicted grasping point (red) vs. ground truth grasping point (yellow) on the garment. From the figure, we can first see that our 3D reconstruction is able to provide us with good-quality models for a fixed camera capturing a dynamic scene. And our shape retrieval scheme with learned distance metrics is also able to provide reasonable matches for the grasping points. Note that our method is able to output a mesh model of the target garment, which is critical for the subsequent operations such as path planning and object manipulation.
4.2.3 Quantitative Evaluation
Implementation Details. In the 3D reconstruction, we set , voxels and the resolution of the voxels as voxels per meter to obtain a trade-off between resolution and robustness against sensor noise. In the feature extraction, our implementation adopts in the feature extraction as an empirically good configuration. That is, each mesh model gives a dimensional binary feature. We set the penalty in Equation 3.
Classification Accuracy. For each input garment, we compute the classification accuracy of pose recognition, i.e.
The classification accuracy for each garment type is reported in Table 1 (left). Given we have two models for each garment in the database (except shorts), we report the accuracy achieved of using only Model 1 for retrieval, using only Model 2 for retrieval, and use all the available data. The total grasping points for long-sleeve shirts, pants, and shorts are , , and respectively. Our method is benefited from the 3D reconstruction step, which reduces the sensor noise and integrates the information of each frame to a comprehensive model and thus leads to better decisions. Among three types of garments, recognition of shorts is not as accurate as the other two. One possible reason is that many of the shapes from different grasping points look very similar. Even for human observers, it is hard to distinguish them.
Running Time. In addition, we also report the processing time of our method. The time is measured on a PC with an Intel i7 3.0 GHz CPU, and shown in Table 1 (right). We can see that our method demonstrates orders of magnitude speed-up against the state-of-the-art depth-image based method which takes minutes to process one input. This verifies our advantages from the efficient 3D reconstruction, feature extraction, and matching.
4.2.4 Generality to Novel Garments
Though we used a relatively small garment database for our experiments, we noticed that our simulated models can also be generalized to recognize similar but unseen garments. For example, long-sleeve shirts and jackets can be considered as similar garments to our long-sleeve shirts model. Also, knit pants and suit pants are similar to our jeans model. Although they are made of different materials, the way they deform are similar to our training models in some poses. Figure 8 shows some extra examples of recognizing poses of unseen garments using the same weight learned on our original dataset. We also noticed that there exist some decorations such as pockets or shoulder boards on those garments, however, our method is robust enough to ignore these subtler features.
5 Online Model Registration for Regrasping and unfolding
As a part of the pipeline as shown in Fig. 1, to unfold a garment, we can use the simulated models in the database to guide real object manipulation by registration. In this pipeline, the registration results can be used for detection of regrasping points. One of such scenarios is that unfolding a garment by iterative registration between the reconstructed mesh model and the database mesh model, and then regrasping. After several steps of regrasping, the robot holds the garment at two desired positions. Using a long-sleeve shirt as an example, we defined the optimal grasping positions on the two sleeves, respectively. The regrasping is built on the recognition pipeline described in the previous section. Once we have a recognized 3D object model from the database, we can perform a registration-search that looks for an optimal registration between the model and physical garment over the entire mesh model. Once registered, we can then predict the best regrasping point in 3-dimensional space and guide the other hand to approach and regrasp at this point. We do this using a fast, two-stage deformable object registration algorithm that integrates off-line simulated results with online localization and uses a novel non-rigid registration method to minimize energy differences between source and target models. Then, we use a constrained weighted metric for evaluating grasping points during regrasping, which can also be used for a convergence criterion.
5.1 Problem Formulation
Our objective is to put the garment into a certain configuration, which is defined as the relative grasping points on the garment Li et al. [2014b]
, such that the garment can be easily placed flat on a table for the folding process. This problem can be formulated as a mathematical optimization problem:
Here 111Each garment mesh is defined in a 2-dimensional parameter space. When we choose a grasping point, we choose a particular set of parameters, which then will be mapped by registration with the sensed garment to a grasping point in . are the positions of the left and right grasping points on the garment (the configuration) and the function is an evaluation function for such a configuration. We seek a principled way to build a feedback loop for garment regrasping, which allows us to grasp at pre-determined points on the garment, and place it flat.
Suppose the candidate garment is a long-sleeve shirt which we want to unfold and place flat. A desired solution is grasping points lying on the elbows of the sleeves. Our goal is to find a pair of grasping points , through a series of regrasping procedures that will converge to a value close to . We need a quantitative function defined on the pose of the garment (i.e., where the robot arm grasps the garment) in order to evaluate how good a grasping point is. While this can be computed on the continuous surface of the garment, we can also discretize the garment into a set of anchor points , which typically contains about points for a garment in our database. After such quantization, the garment pose recognition can be treated as a discrete classification problem, which the current robotics system is able to handle reliably. This also simplifies the definition of the objective function, which then becomes a D score table or a matrix, given our robot has two arms.
Details of the optimization procedure and inference can be found in Li et al. [2015a]. The objective function which needs to be maximized finally can be written as:
The related parameters in the objective such as and are set depending on the desired configuration. For example, for long-sleeve shirts, we set and on the elbow of the two sleeves. The Gaussian formulation ensures a smooth decrease from the expected grasping points, as visualized in Figure 11 as an example.
5.2 Deformable Registration
After obtaining the location of the current grasp point, we seek to register the reconstructed 3D model to the ground truth garment mesh to establish point correspondences. The input to the registration is a canonical reference (“source”) triangle mesh that has been computed in advance and stored in the garment database, and a target triangle mesh representing the geometry of the garment grasped by the robot, as acquired by 3D scans of the grasped garment.
The registration proceeds in three steps. First, we scale the source mesh to match its size to the target mesh . Next, we apply an iterative closest point (ICP) technique to rigidly transform the source mesh (i.e., via only translation and rotation). Finally, we apply a non-rigid registration technique to locally deform the source mesh toward the target .
First, we compute a representative size for each of the source and target meshes. For a given mesh, let and be the area and barycenter of the th triangle. Then the area-weighted center of the mesh is
where is the number of vertices of the source mesh . Given the area-weighted center, the representative size of the mesh is given by
Let the representative sizes of the source and target meshes be and , respectively. Then, we scale the source mesh by a factor of .
Computing the rigid transformation
We use a variant of ICP Besl. and McKay.  to compute the rigid transformation. ICP iteratively updates a rigid transformation by (a) finding the closest point on the target mesh for each of the vertices of the source mesh , (b) computing the optimal rigid motion (rotation and translation) that minimizes the distance between and , and then (c) updating the vertices via this rigid motion.
To accelerate the closest point query, we prepare a grid data structure during preprocessing. For each grid point, we compute the closest point on the target mesh using fast sweeping h R.Tsai , and store for runtime using both the found point and its distance to the grid point as shown in Figure 12. At runtime, we approximate the closest point query for vertex by searching only among those eight precomputed closest points corresponding to the eight grid points surrounding , thereby reducing the complexity of the closest point query to per vertex.
After establishing point correspondences, we compute the optimal rotation and translation for registering with Besl. and McKay. . We iteratively compute point correspondences and rigid motions until successive iterations converge to a fixed rigid motion, yielding a rigidly registrated source mesh .
Given a candidate source mesh obtained via rigid registration, our non-rigid registration seeks the vertex positions of the source mesh that minimize
where penalizes discrepancies between the source and target meshes, and seeks to limit and regularize the deformation of the source mesh away from its rigidly registrated counterpart . The term
penalizes deviation of the source and target meshes. Here is the barycenter of the triangle , and is the distance from to the closest point on the target mesh. As in the rigid case, we use the precomputed distance field to query for the distance.
It might appear that the fitting energy could be trivially minimized by moving each vertex of mesh to lie on mesh . In practice, however, this does not work because all of the geometry of the precomputed reference mesh is discarded; instead, the geometry of this mesh, which was precomputed using fabric simulation, should serve as a prior. Thus, we introduce a second term to retain as much as possible the geometry of the reference mesh :
The deformation term , derived from a physically based energy (e.g., see Grinspun et al. ), is a sum of three terms
where , and are user-specified coefficients.The term
penalizes changes to the area of each mesh triangle. Here is the area of the triangle , and refers to a corresponding quantity form the undeformed mesh . The term
penalizes shearing of each mesh triangle, where is the th angle of the triangle . The term Grinspun et al. 
penalizes bending, measured by the angle formed by adjacent triangles. Here is the hinge angle of edge , i.e., the angle formed by the normals of the two triangles incident to ; is the length of the edge , and is a third of the sum of the heights of the two triangles incident to the edge .
5.3 Grasping Point Localization
We use a pre-determined anchor point (e.g., elbow on the sleeve of a long-sleeve shirt) to indicate a possible regrasping point. The detection of the regrasping point can be summarized in two steps: global localization and local refinement. Global localization is achieved by deformable registration. The registered simulation mesh will provide a 3D regrasping point from the recognized state which will be then mapped onto the reconstructed mesh. Details of local refinement can be found in Li et al. [2015a].
In order to improve the regrasping success rate, we propose a step of local refinement. The point on the actual garment may be hard to grasp for several reasons. One is that during the garment manipulation steps, such as rotation, the curvature over the garment may change. Another reason is that when considering the width of robot hand gripper, a ridge curve with proper orientation and width should be selected for regrasping. We consider the proper orientation as a direction perpendicular to the opening of the gripper. Therefore, we propose an efficient 1D blob curvature detection algorithm that can find a refined position in the local area over the garment surface via an IR range sensor.
In our experiment, the Baxter robot is equipped with a IR range sensor close to the gripper as shown in Figure 15 top. Once the gripper moves to the same height of the predicted 3D regrasping point from registration, it will perform a horizontal scan search to achieve a refinement of the local grasping point, moving from one side to the other, so that the IR sensor will scan over the full local curvature.
We then apply a curvature detection algorithm that convolves the IR depth signal with a fixed width kernel, where the width is determined by the opening of the gripper. Here we use a Laplacian-Gaussian Kernel :
where is the depth signal, and is the width parameter.
After the regrasping is finished, we evaluate the current grasping configuration by the objective function . If is greater than a given value , which means the grasping points are on the desired positions, and the robot will then stop regrasping and enter the placing flat mode. The two arms will open to slightly stretch the garment and place it on a table. The overall algorithm is summarized in Algorithm 2.
5.5 Experimental Results
To evaluate our results, we tested our method on several different garments such as long-sleeve shirts and pants for multiple trials.
Below, we briefly recap the pose recognition method. Details can be found in the previous section. We first pick up the garment at a random point. In the online recognition stage, we use a Kinect sensor to capture depth images of different views of the garment while it is being rotated by a robotic arm. The garment is rotated clockwise and then counter-clockwise to obtain about depth images for an accurate reconstruction. We reconstruct a 3D mesh model from the depth image segmentation and volumetric fusion. Then with an efficient 3D feature extraction algorithm, we build up a binary feature vector and finally match against the offline database for pose recognition. One of the outputs is a high-quality reconstructed mesh, which is used for 3D registration and accurate regrasping point prediction, as described below.
We apply both rigid and non-rigid registrations. The rigid registration step mainly focuses on mesh rescaling and alignment, whereas the non-rigid registration step refines the results and improves the mapping accuracy. In Figure 13, we compare the difference between using rigid registration only and using rigid plus non-rigid registration side by side. We can clearly see that with non-rigid registration, the two meshes are registered more accurately. In addition, the location of the designated grasping points on the sleeves are also closer to the ground truth points. Note that for the fourth row, after the alignment by the rigid registration algorithm, the state is evaluated as a local minimum. Therefore, there is no improvement by the following non-rigid registration. But as we can see from the visualization, such a case is still good enough for finding point correspondence.
|Source Mesh||S to T (R)||T to S (R)||S to T (R+N)||T to S (R+N)|
|Long-Sleeve T-Shirt 1|
|Long-Sleeve T-Shirt 2|
|Long Pants 1|
|Long Pants 2|
We also evaluate the registration algorithm on the entire database, which contains two stage, rigid registration using ICP algorithm and non-rigid registration algorithm. To show the performance of our registration algorithm, the registration pairs are established with the knowledge that the recognition of the pose is % correct. This will enable the registration to happen between the closest grasping location. Meanwhile, we design the registration experiments in two directions, the source mesh to the target mesh, and vice versa. We also compare the registration results of the rigid registration, and the rigid plus the non-rigid registration for all the pairs. Detailed results are shown in Table 2. For example, for the S to T(R), we first subdivides the source mesh into a set of disjoint triangulated patches, and generates a single sample point in each patch. Each sample point is also assigned the area of the patch it belongs to. Then, from each such sample point, we find the closest point on the target mesh, and sum up the distance of all point pairs and multiplied by the corresponding patch area. Finally, the summed value is divided by the total area of the source mesh.
5.7 Search for Best Grasping Point by Local Curvature
Once we choose a potential grasping point, we can perform a search to find the best local grasping point for the gripper. We are trying to a find a fold in the vicinity of the potential grasping point with a high local curvature tuned to the gripper width that allows for a stable grasp. The opening size of the gripper is approximately and empirically we set in the equation 15. Figure 15 top shows a picture of the IR range sensor on the gripper. A plot of its signal, as well as the convoluted signal, are shown in Figure 15 bottom left and right. We can clearly see that the response from the filter is at a minimum where the grasping should take place. The tactile sensors then assure that the gripper has properly closed on the fabric.
5.8 Iterative regrasping
Figure 14 shows two examples (long-sleeve shirt and pants) of iterative regrasping using the Baxter robot. The robot first picks up a garment at a random grasping point. Once the arm reaches a pre-defined position, the last joint of the arm starts to rotate and the Kinect will capture the depth images as it rotates, and reconstruct the 3D mesh in real-time. After the rotation, a predicted pose is recognized Li et al. [2014b]
as shown in the third image of each row. For each pose, we have a constrained weighted evaluation metric over the surface to identify the regrasping point as indicated in the fourth image. By registration of the reconstructed mesh and predicted mesh from the database, we can map the desired regrasping point onto the reconstructed mesh. The robot then regrasps by moving the other gripper towards it. With our 1D blob curvature detection method, the gripper can move to the best curvature on the garment and regrasp, which increases the success rate. The iterative regrasping stops when the two grasped points are the designated anchor points on the garment (e.g., elbows on the sleeves of a long-sleeve shirt).
Figure 16 left shows sample garments in our test, and the table on the right shows the results. For each garment, we perform unfolding tests. We have on average an successful recognition rate for the pose of the objects over all the garments. We have on average an successful regrasping rate for each garments, where regrasping is defined as a successful grasp of the other arm on the garment. of the time we are able to successfully unfold the garment, placing the grippers at the designated grasping points. Unsuccessful unfolding occurred when either the gripper lost contact with the garment, or the gripper was unable to find a regrasping point. Although we did not perform this experiment, it is possible to restart the method after one of the grippers loses contact as an error recovery procedure.
For the successful unfolding cases, we also report the average number of regrasping attempts. The minimum number of regrasping attempts . This happens when the initial grasping is at one of the desired positions, and the regrasping succeeds at the other desired position (i.e., two elbows on the sleeves for a long-sleeve shirt). In most cases, we are able to successfully unfold the garments using regraspings.
Among all these garments, jeans, pants, and leggings achieve high success rate because of their unique layout when grasping at the leg position. The shorts are difficult for both recognition and unfolding steps possibly because its ambiguous appearances in different grasping points. One observation is that in a few cases, when the recognition is not accurate, our registration algorithm was sometimes able to find a desired regrasping point for unfolding. This is an artifact of the geometry of pant-like garments where the designated regrasping points are at the extreme locations on the garments.
6 Trajectory Optimization for Folding
Robotic folding of a garment is a difficult task because it requires sequential manipulations of a highly unconstrained, deformable object. Given the garment shape, the robot can fold it by following a folding plan Miller et al.  Miller et al. . However, the layout of the same folding action can vary in terms of the material properties such as cloth hardness and the environment such as friction between the garment and the table. Given the starting and ending folding positions, different folding trajectories will lead to different results. In this section, we propose a novel method that learns optimal folding trajectory parameters from predicted thin shell simulations of similar garments, which can then be applied to a real garment folding task (see Figure 17). We first present an online optimization algorithm that learns optimal trajectories for manipulation from mathematical model evolution combined with predictive thin shell simulation. Meanwhile, a novel approach is introduced that can adjust the simulation environment to the robot working environment for the purpose of creating a similar manipulation result. Then, with the learned simulation results, we introduce a fast and robust algorithm that can detect garment key points such as sleeve ends, collar, and waist corner, automatically. These key points can be used for folding plan generation. The trajectories themselves are general in that they can be scaled to accommodate similar garments of different size.
Figure 19 shows the key steps of the garment folding. The garment folding is the final step of the entire pipeline of garment manipulation which contains visual recognition, unfolding, ironing, and folding in Figure 1. This section specifically addresses the robotic folding task (purple rectangle in Figure 19) with the goal of finding optimal trajectories to successfully fold garments.
Figure 18 shows a few failure examples with improper trajectories. We use green tape on the table to show the original position of the garments. The first two rows show that if the moving trajectory is too low and close to the garment, the folded part will fall down, pull the rest, and cause drift of the whole garment. These cases usually happen when the folding step is lengthy without trajectory optimization. The third row shows a case where the folding trajectory is too high, which will cause extra wrinkles or even piling up. The last row shows two cases using two arms to fold. If the arms are close to each other, the part in between loses tension, and will fall down and pull the rest away. The focus of this work is to create trajectories for folding that will overcome these problems.
6.1 Simulation Environment
6.1.1 Folding Pipeline in Simulation
In the model simulation, we use a physics engine Maya to simulate the movement and deformation of the garment mesh models. We assume there is only one garment for each folding task, which has been placed flat on a table. A virtual table is added to the scene which the garment lies on, as shown in Figure 17, top.
During each folding step, the robot arm picks up a small part of the mesh, moves it to the target position following a computed trajectory, and places it on the table to simulate an entire folding scenario. If the part of the garment to be folded is relatively wide, then both left and right arms may be involved. The trajectory is generated using a Bézier curve, which will be discussed in section 6.2 below.
We can use the mesh model from the database to simulate the folding process. However, for faster computation, these mesh models are relatively low resolution meshes, which are not very accurate when used to simulate folding via bending the mesh. For more accurate simulation purposes, we propose a method to build a mesh model from our real garments. Specifically, a garment mesh is created by first extracting the contour of the real garment Li et al. [2015b]. Then by inserting points on the inside of the garment contour, we triangulate a mesh by connecting these points. Lastly, we mirror the mesh to construct a two-sided garment mesh (see figure 20).
6.1.2 Parameter Adaptation
There are two key parameters needed to accurately simulate the real world folding environment. The first is the material properties of the fabric, and the second is the frictional forces between the garment and the table.
Through many experiments, we found that the most important property for the garments in the simulation environment is shear resistance. It specifies the amount the simulated mesh model resists shear under strain; when the garment is picked up and hung by gravity, the total length will be elongated due to the balance between gravity force and shear resistance. An appropriate shear resistance measure allows the simulated mesh to reproduce the same elongation as the real garment. This measurement will bridge the gap between the simulation and the real world for the garment mesh model.
For each new garment, we follow the steps described below to measure the shear resistance. Figure 21 shows an example.
Manually pick one extremum part of the garment such as the sleeve end of a T-shirt, the waist part of a pair of pants, and a corner of a towel.
Hang the garment under gravity and measure the length between the picking point and the lowest point as
Slowly put down the garment on a table and keep the picking point and the lowest point in the previous step at maximum spread condition. Measure the distance between these two points again as . The shear resistance fraction is defined as the following
We then the virtual garment into the same configuration in Maya, adjusting the Maya shear parameter such that the shear fraction as calculated in the simulator is identical to the real world.
The surface of the table can be rough if covered by a cloth sheet or slippery if not covered, which leads to variance in friction between the table and garment. A shift of the garment during the folding can possibly impair the whole process and cause additional repositioning. Adjusting the frictional level in the simulation environment to the real world is crucial and necessary for trajectory optimization.
To measure the friction between the table and the garment, we do the following steps.
Place a real garment on the real table of length .
Slowly lift up one side of the real table, until the garment in the real world begins to slide. The lifted height is . The friction angle is computed as,
In the virtual environment, the garment is placed flat on a table with gravity. Assign a relatively high friction value to the virtual table. Lift up one side of the virtual table to the angle of .
Gradually decrease the frictional force in the virtual environment, until the garment begins to slide. Use this frictional force in the virtual environment as it mirrors the real world
With these two parameters set up, we obtain similar manipulation results for both the simulation and the real garment.
6.2 Trajectory Optimization
The goal of the folding task is specified by the initial and folded shapes of the garment, and by the starting and target positions of the grasp point (as in Figure 22). Given the simulation parameters, we seek the trajectory that effects the desired set of folds. We first describe how to optimize the trajectory for a single end effector and then discuss the case of two end effectors.
6.2.1 Trajectory parametrization
We use a Bézier curve Farin  to describe the trajectory. An -th order Bézier curve has control points , defined by
where are the Bernstein basis functions.
We use for simplicity, but our method can be easily extended to deal with higher order curves. and are fixed to the specified starting and target positions of the grasp point (as in Figure 22). The intermediate control points can then be adjusted to define a new trajectory using the objective function defined below.
Here is a cost function with two terms. The first term penalizes the trajectory length , thus preferring a folding path that is efficient in time and energy. The second term seeks the desired fold, by penalizing dissimilarity between the desired folded shape , compared to the shape obtained by the candidate folding trajectory , as predicted by a cloth simulation; we used a physical simulation engine Maya , for the cloth simulation. The weight balances the two terms; we used in our experiment.
Intuitively, dissimilarity measures the difference between the desired folded shape and the folded garment in simulation. We define the dissimilarity term as
where is the total surface area of the garment mesh including both sides of the garment, is a point on the target folded shape , is the corresponding point on the simulated folded shape, and is the area measure, see Figure 23, left. Our implementation assumes and are given as triangle meshes, and discretizes (20) as
where is the barycenter of -th triangle on the target shape, is the (corresponding) barycenter of -th triangle on the simulated shape, and is the area of the -th triangle on the target shape.
To compute the trajectory length , we use the De Casteljau’s algorithm Farin  to recursively subdivide the Bézier curve into a set of Bézier curves , until the deviation between the chord length () and the total length between the control points (