Vision-based Manipulation of Deformable and Rigid Objects Using Subspace Projections of 2D Contours

06/16/2020 ∙ by Jihong Zhu, et al. ∙ IEEE 0

This paper proposes a unified vision-based manipulation framework using image contours of deformable/rigid objects. Instead of using human-defined cues, the robot automatically learns the features from processed vision data. Our method simultaneously generates—from the same data—both, visual features and the interaction matrix that relates them to the robot control inputs. Extraction of the feature vector and control commands is done online and adaptively, with little data for initialization. The method allows the robot to manipulate an object without knowing whether it is rigid or deformable. To validate our approach, we conduct numerical simulations and experiments with both deformable and rigid objects.



There are no comments yet.


page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Humans are capable of manipulating both rigid and deformable objects. However, robotic researchers tend to consider the manipulation of these two classes of objects as separate problems. This paper presents our efforts in formulating a generalized framework for vision-based manipulation of both rigid and deformable objects, which does not require prior knowledge of the object’s mechanical properties.

In classical visual servoing literature [chaumette2006visual], the vector

denotes the set of features selected to represent the object in the image. These features represent both the object’s pose and its shape. We denote the process of selecting

as parameterization. The aim of visual servoing is to minimize, through robot motion, the feedback error between the desired and the current (i.e., measured) feature .

One of the initial works on vision-based manipulation of deformable objects is presented in [inoue1984hand] to solve a knotting problem by a topological model. Smith et al. developed a relative elasticity model, such that vision can be utilized without a physical model for the manipulation task [smith1996vision]. A classical model-free approach in manipulating deformable objects is developed in [berenson2013manipulation]. More recent research [lagneau2020active]

proposes a method for online estimation of the deformation Jacobian, based on weighted least square minimization with a sliding window. In

[navarro2014visual] and [navarro2017fourier], the vision-based deformable objects manipulation is termed as shape servoing. An expository paper on the topic is available in [navarro2019model]. A recent work on shape servoing of plastic material was presented in [cherubini2020model].

For shape servoing, commonly selected features are curvatures [navarro2014visual], points [wang2018unified] and angles [navarro2013uncalibrated]. Laranjeira et al. proposed a catenary-based feature for tethered management on wheeled and underwater robots [laranjeira2017catenary, LARANJEIRA2020107018]. A more general feature vector is that containing the Fourier coefficients of the object contour [navarro2017fourier] and [zhu2018dual]. Yet, all these approaches require the user to specify a model, e.g., the object geometry [wang2018unified, navarro2013uncalibrated, navarro2014visual] or a function [laranjeira2017catenary, navarro2017fourier, zhu2018dual]

for selecting the feature. Alternative data-driven (hence, modeling-free) approaches rely on machine learning. Nair et al. combine learning and visual feedback to manipulate ropes in 

[nair2017combining]. The authors of [hu20193]

employ deep neural networks to manipulate deformable objects given their 3D point cloud. All these methods rely on (deep) connectionists models that invariably require training through an extensive data set. The collected data has to be diverse enough to generalize the model learnt by this type of networks. Furthermore, the above-mentioned methods focus only on deformable object manipulation.

As for pose control of rigid objects, the trend in visual servoing is to find features which are independent from the object characteristics. Following this trend, [chaumette2004image]

proposes the use of image moments. More recently, researchers have proposed direct visual servoing (DVS) methods, which eliminate the need for user-defined features and for the related image processing procedures. The pioneer DVS works

[collewet2008visual, collewet2011photometric] propose using the whole image luminance to control the robot, leading to “photometric” visual servoing. Bakthavatchalam et al. join the two ideas by introducing photometric moments [bakthavatchalam2013photometric]. A subspace method [marchand2019subspace]

can further enhance the convergence of photometric visual servoing, via Principal Component Analysis (PCA). This method was first introduced for visual servoing in

[nayar1996subspace]. In that work, using an eye-in-hand setup, the image was compressed to obtain a low-dimensional vector for controlling the robot to a desired pose. Similarly, the authors of [deguchi1996visual] transformed the image into a lower dimensional hyper surface, to control the robot position via in-hand camera feedback. However, DVS generally considers rigid and static scenes, where the robot controls the motion of the camera (eye-in-hand setup) to change only the image viewpoint, and not the environment. This setup avoids breaking the Lambertian hypotheses, which are needed for the methods to be applicable. For this reason, to our knowledge, DVS was never applied to object manipulation, since changes in the pose and/or shape of the object would break the Lambertian assumption.

Compared with above-mentioned works, our paper has the following original contributions:

  1. We introduce a new compact feedback feature vector – based on PCA of sampled 2D contours – which can be used for both deformable and rigid object manipulation.

  2. We exploit the linear properties of PCA and of the local interaction matrix, to initialize our algorithm with little data – the same data for feature vector extraction and for interaction matrix estimation.

  3. We report experiments using the same framework to manipulate objects with different unknown geometric and mechanical properties.

In a nutshell, we propose an efficient data-driven approach that allows manipulation of an object regardless of its deformation characteristics. To the best of authors’ knowledge, there has not been any similar framework proposed before.

The paper is organized as follows. Sect. II presents the problem. Sect. III outlines the framework. Sect. IV elaborates on the methods. In Sect. V, we analyze and verify the methods by numerical simulations. Then, Sect. VI presents the robotic experiments and we conclude in Sect. VII.

Ii Problem Statement

In this work, we aim at solving object manipulation tasks with visual feedback. We rely on the following hypotheses:

  • The shape and pose of the object are represented by its 2-D contour on the image as seen from a camera fixed in the robot workspace (eye-to-hand setup). We denote this contour as


    where denotes the th pixel of the contour in the image .

  • The contour is always entirely visible in the scene and there are no occlusions.

  • One of the robot’s end-effectors holds one point of the object (we consider the grasping problem to be already solved). At each control iteration , its pose is , and it can execute motion commands that drive the robot as .

(a) Rigid objects
(b) Deformable objects
Fig. 1: Vision-based manipulation of rigid and deformable objects. For rigid objects (left): control pose (translation and rotation). For deformable objects (right): control the pose, and also shape.
Problem Statement.

Given a desired shape of the object, represented by its contour , we aim at designing a controller that generates a sequence of robot motions to drive the initial contour to the desired one, by relying on visual feedback.

The controller should work without any knowledge of the object physical characteristics (i.e., for both rigid and deformable objects). Typically, since rigid and deformable objects behave differently during manipulation, we set the following manipulation goals:

  • Rigid objects: move them to a desired pose (see Fig. (a)a).

  • Deformable objects: move them to a desired pose with a desired shape (see Fig. (b)b).

Iii Framework overview

In this section, we present an overview of the proposed approach motivated by the problem analysis.

Iii-a Problem analysis

We can work directly on the object shape space by selecting the contour as the feature vector . However, this will result in an unnecessarily large dimension of the feature vector (e.g., if , has 100 components). The high dimensional feature vector increases the computation demand and complicates the control due to the high under-actuation of the system. Therefore, instead of working directly on shape space, we work on a feature vector space that has reduced dimensions. For that, we split the problem into two sub-problems: parameterization and control, see Fig. 2.

Fig. 2: Graphic representation of the vision-based manipulation problem, with its two sub-problems, parameterization and control.

Parameterization consists in representing the shape contour via a compact feature vector , such that . We denote this representation as .

Control consists in computing robot motions , so that the object’s representation converges to a desired target . Control can be broken down to solving the optimization problem:


where denotes the mapping between robot pose and feature vector, which is assumed to be smooth and often nonlinear. If we know the analytic solution to , we can solve (2) and obtain the target shape in a single iteration by commanding . However, the full mapping is the result of two subsequent mappings. A mapping exists between the robot pose and the resulting contour . This, combined with the parameterization mapping above yields: . To solve for , one needs to have the analytic expressions of both and . While the latter comes from the parameterization, the former () is difficult to obtain, since it encompasses both the object characteristics and the camera projection model. For different objects and camera poses, we expect different properties and camera parameters, therefore a different . Even for , we would like our framework to automatically extract the feature vector without explicit human definition. Therefore, unlike traditional approaches where is user-defined and known, in our framework, the robot has to infer from sensor data, which in return makes deriving difficult.

A solution to this problem is to approximate the full mapping from sensor observations. Classic data-driven approaches typically require a long training phase to collect vast and diverse data for approximating . In some cases (robotics surgery for instance), it is not possible to collect such data beforehand. Moreover, if the object changes, new data has to be collected to retrain the model, leading to a cumbersome process. In this paper instead, we aim at doing the data collection online, with minimum initialization.

Thus, instead of estimating the full nonlinear mapping , we divide it into piece-wise linear models [Journals:Sang2012] at successive equilibrium points. These models are considered time invariant in the neighbourhood of the equilibrium points. We then compute the control law for each linear model and apply it to the robot end-effector.

Iii-B Proposed methods

Given a target shape , we define an intermediate local target at each (see Fig. 2). At the instance, the robot autonomously generates a local mapping to produce the feature vector . The robot then finds the local mapping online.

Consider at the current time instant , the shape , the intermediate target and the local parameterization , we can transform shape data into a feature vector by:


The linearized version of centered at is then:




The matrix represents a local mapping, referred to as the interaction matrix in the visual servoing literature [chaumette2006visual]. If can be estimated online at each iteration , then, we can design one-step control laws to drive towards .

After the robot has executed the motion command , we update the next target to be , and so on, until it reaches the final (desired) target . Although the validity region of this local linear mapping is much smaller than that of the original nonlinear mapping, it results in an online training with less data and reduced computational demand.

In the following section, we detail our proposed approach.

Iv Methods

Figure 3 shows the building blocks for the overall framework. In this section, we focus on the red dashed part of the diagram. We will elaborate on each red block in the subsequent subsections. The blue block represents the image processing pipeline that will be discussed in Sect. VI-A.

Fig. 3: The block diagram that represents the overall framework.

Iv-a Feature vector extraction

There are many ways for parameterizing to reduce its dimension. Since the interaction matrix in (4) represents a linear mapping for the control, to be consistent, we look for a linear transformation for the parameterization.

One of the prominent linear dimension reduction methods is Principal Component Analysis (PCA). PCA finds a new orthogonal basis for high-dimensional data. This enables projection of the data to lower dimension with the minimal sum of squared residuals. We apply PCA to map

to .

To find the projection, we collect images with different shapes of the object and construct the data matrix . Then, we shift the columns of by the mean contour :


We then compute the covariance matrix

, and apply Singular Value Decomposition (SVD) to it:


Once we have obtained the eigenvector matrix

, we can move on to select the first columns of denoted by . Then, the -dimensional contour can be projected into a smaller -dimensional feature vector as:


To assess the quality of this projection, we can compute the explained variance

using the eigenvalue matrix

in (7). Denoting the diagonal entries of as

, the explained variance of the first

components is:


where is a scalar between and (since , ), indicating to what extent the components represent the original data (a larger suggests a better representation).

At this stage, it should be clear for the reader that there is a crucial trade-off between choosing a low or high value for the number of features, . A low will ease controllability (given the limited robot inputs), whereas a high will improve the data representation (by increasing ). In the next section, we explain how to select the best value of .

Iv-B Selection of the Feature Dimension

Feature vector in (8) lies in the new orthonormal basis represented by the columns of in (7). Therefore, both and its variation span a space of dimension . Similarly, and span a space of dimension .

From (4), we know that at each iteration, is the result of a constant linear mapping () on . For an arbitrary number of samples of small robot motions : such that the mapping stays constant, we have . Using (4) we obtain


such that Each column of is the result of corresponding robot motion. To find the relationship between and , we introduce the following lemma [strang1993introduction]:

Lemma 1.


Applying Lemma 1 to (10):


Since , the maximum dimension of is : . According to (11), . Since a larger yields a better approximation of the real shape data, the value of should be chosen as large as possible considering the condition . Therefore we choose .

Iv-C Local Target Generation

Let us now explain how we generate a local desired contour given current contour and final desired contour (Algorithm 1). The overall shape error is given by:


We define the intermediate desired contour as:


with an integer that ensures that is a “good” local target for (i.e., the two are similar). This means that if we project the intermediate local data using (note that we are using the full projection matrix and not just the first columns), the feature vector should fulfil:


with a threshold and the -th component of .

Algorithm 1 outlines the steps for computing the local intermediate targets, so that:

  • they are near the final target,

  • the corresponding feature vector can be extracted with the current learned projection matrix.

Remark 1.

The reachability of a local target can only be verified with a global deformation model which we want to avoid identifying in our methods. We will further discuss this issue in the Conclusion (Sect. VII).

localTargetFound = false
while not localTargetFound do
     if  or  then
         localTargetFound = true
     end if
end while
Algorithm 1 Local target generation

Iv-D Interaction Matrix Estimation

Let us consider the current contour and the local target . In this section, we show how we can implement the PCA and model estimation together and online. We denote the robot motions and corresponding object contours over the last iterations (prior to iteration ) as:


By selecting , we compute the projection matrix , from and via (6) and (7). Then, using , we project current contour , desired contour and shape matrix :


In (16), is normalized by as in (6). We can then compute from (16), by subtracting consecutive columns of :


Using and we can now estimate the local interaction matrix at iteration . We assume that near this iteration, the system remains linear and time invariant: is constant. Using the local linear model (4), we can write the following:


Our goal then is to solve for , given and . Note that this is an overdetermined linear system (with equations for unknonwns). Let us consider has full row rank. Note this sufficiently implies . With this prerequisite, . Therefore, , and its inverse exists. We post multiply (18) by :


Then, since is invertible, the that best fulfills (18) is:


This matrix is also the solution of the optimization problem


where and are respectively the row of and . The detailed derivation is presented in the Appendix.

If the full row rank condition of is not satisfied, and is singular. Then, instead of (20), we can use Tikhonov regularization:


Practically, this implies that one or more inputs motions do not appear in . Therefore, we cannot infer the relationship between these motions and the resulting feature vector changes. In this case it is better to increase and obtain more data, so that has full row rank.

Instead of computing the interaction matrix, it is also possible to directly compute its inverse, since this guarantees better control properties, as explained in [lapreste2004efficient]. With the same data, one can re-write (18) as:


We can also solve (23) with Tikhonov regularization:


Iv-E Control law and stability analysis

It is now possible to control the robot, using either of the following strategies:


if one estimates the interaction matrix with (22), where denotes the pseudo-inverse, or:


if one estimates the inverse of the interaction matrix with (24). In both equations, is an arbitrary control gain.

Proposition 1.

Consider that locally, the model (4) closely approximates the interaction matrix . For number of linearly independent displacement vectors such that the interaction matrix is invertible, the update rule (25) asymptotically minimizes the error .


With , we can write (4) in discretized form as


The error dynamic


is asymptotically stable for . This can be proved by considering the Lyapunov function


Using the error dynamic (28), one can derive:


This proves the asymptotic stability of the error locally by our inputs. ∎

Iv-F Model adaptation

Since both the projection matrix and the interaction matrix are local approximations of the full nonlinear mapping, they need to be updated constantly. We choose a receding window approach with the window size equal to .

At current instance , we estimate the projection matrix and local interaction matrix with samples of the most recent data. Using the interaction matrix, and the a local target , we can derive the one-step command by (25). Once we execute the motion , a new contour data is obtained. We move to the next instance . A new pair of input and shape data is obtained. We shift the window by deleting the oldest data in the window and add in the new data pair. Then, using the shifted window, we compute one step control at instance.

The receding window approach ensures that, at each instance, we are using the latest data to estimate the interaction matrix. The overall algorithm is initialized with small random motions around the initial configuration. First, samples of shape data and the corresponding robot motions are collected. With this initialization, we can simultaneously solve for the projection matrix and estimate the initial interaction matrix using the methods described in Sect. IV-A and IV-D. Using the projection matrix and the initial/desired shapes, we can then find an intermediate target as explained in Sect. IV-C.

We consider quasi-static deformation. Hence, at each iteration the system is in equilibrium and can be linearized according to (4). The data that best captures the current system are the most recent ones. The choice of is a trade-off between locality and richness. For fast varying deformations111The notion of fast or slow varying depends on both the speed of manipulation, and on the objects deformation characteristics (which affect the rate of change in shapes) with regard to the image processing time., we would expect to reduce as larger will hinder the locality assumption. Yet, if is too small, it affects the estimation of (refer to detailed discussion in Sect. IV-D).

V Simulation results

In this section, we present the numerical simulations that we ran to validate our method.

V-a Simulating the objects

We ran simulations on MATLAB (R2018b) with two types of objects: a rigid box and a deformable cable, both constrained to move on a plane. The rigid object is represented by a uniformly sampled rectangular contour. The controllable inputs are its position and orientation. For the cable, we developed a simulator, which is publicly available at The simulator relies on the differential geometry cable model introduced in [wakamatsu2004static], with the shape defined by solving a constrained optimization problem. The underlying principle is that the object’s potential energy is minimal for the object’s static shape [wakamatsu1995modeling]. Position and orientation constraints – imposed at the cable ends – are input to the simulator. The output is the sampled cable. Figures 4 - 6910 show simulated shapes of cables and rigid boxes. We choose samples for both rigid objects and cables. The camera perspective projection is simulated, with optical axis perpendicular to the plane.

Fig. 4: Six trials conducted to test various choices of feature dimension for a cable. In each sub-figure, the solid red lines are the initial shapes and the dashed black are the shapes resulting from random motions of the right tip (translations limited to of the length, rotations limited to ).

V-B Selecting the feature dimension

To check condition (see Sect. IV-B), we simulate trials with distinct initial shapes of a cable. The dimension of the robot motion vector is (two translations and one rotation of the right tip), and the motions are limited (each translation to of the cable length and the rotation to ). For each trial, we command random motions around the initial shape using our simulator. Figure 4 shows the initial cable shapes (solid red) and resulting shapes from random movements (dashed black).

For each trial, we apply PCA to map the cable contour to feature vector , as explained in Sect. IV-A. We do this for , and and for each of these experiments, we calculate the explained variance with (9). Table I shows these explained variances. In all trials, yields explained variances very close to . This result confirms that choosing as the dimension of the feature vector gives an excellent representation of the shape data. It is also possible to select , since the first two components can represent more than of the variance. Nevertheless, the simulation is noise-free. Therefore, although increases little from to , this increase is not related to noise but to an actual gain in data information.

trial 1 trial 2 trial 3 trial 4 trial 5 trial 6
0.727 0.795 0.871 0.847 0.847 0.705
0.992 0.995 0.996 0.997 0.997 0.994
0.999 0.999 0.999 0.999 0.999 0.999
TABLE I: Explained variance for the trials with small motion.

At this stage, it is legitimate to ask how does this scale to larger movements? Fig. 5 illustrates cable shapes generated by large movements (angle variation: , maximum translation: ). Again, we apply PCA (); Table II shows the resulting from various values of .

0 0.5444 0.7218 0.8927 0.9919 0.9990
TABLE II: Explained variance computed with large motion.
Fig. 5: Ten distinctive cable shapes generated by large motion: angle variation: , maximum translation: of the cable length.

Comparing Tables I and II, it is noteworthy that with large motion is smaller than with small motion. There are two possible explanation here. One is that when shapes stays local, the local linear mapping in (4) remains constant and we need less features to characterize it; the more the shape varies, the more more features we need. Another possible explanation is that for larger motions, shapes may be insufficient for PCA. Likely, the larger the changes, the larger the number of shapes needed for proper PCA.

V-C Manipulation of deformable objects

With our cable simulator, we can now test the controller to modify the shape from an initial to a desired one. Again, the left tip of the cable is fixed, and we control the right tip with degrees of freedom (two translations and one rotation). Using the methods described in Sect. IV, we choose window size , the Tikhonov factor , the local target threshold , the control gain . To quantify the effectiveness of our algorithms in driving the contour to , we define a scalar measure: the Average Sample Error (ASE). At iteration , with current contour it is:


A small ASE indicates that the current contour is near the desired one. In Sect. IV-E, we have proved that our controller asymptotically stabilizes the feature vector, to . Hence, since we have also shown that is a “very good” representation” of , we also expect our controller to drive to , thus ASE to . This measure is also used in the real experiments.

Using the cable simulator, we compared the convergence of two control laws proposed in our paper (25) and (26) against a baseline algorithm in [zhu2018dual] which uses Fourier parameters as feature. To make methods compatible, we choose first order Fourier approximation. Note that this results in a feature vector of dimension of (see [zhu2018dual]) which is still twice the number used in our method. We also normalized the computed control action and then multiplied with the same gain factor .

Figure 6 shows the evolution of the cable shapes towards the target using (26). Figure 7 compares convergence of our methods against the Fourier-based method. We can observe that our method provides compatible convergence using half the features. Also directly computing of the inverse (26) provides faster convergence than (25). It is noteworthy to point out that the Fourier-based method requires a different parameterization for closed and open contours (see [navarro2017fourier] and [zhu2018dual]), where in our framework, the parameterization can be kept the same.

Fig. 6: Cable manipulation with a single end-effector, moving the right tip. The blue and black lines are the initial and intermediate shapes, respectively, and the dashed black line is the target shape. The red frame indicates the end-effector position and orientation generated by our controller.
Fig. 7: The evolution of ASE of the simulated cable manipulation using our method against the Fourier-based method as baseline. Top: left simulation in Fig. 6. Bottom: right simulation in Fig. 6.

Taking a step further, we consider the Broyden update law [broyden1965class], which has been used to update the interaction matrix in classic visual servoing [hosoda1994versatile, jagersand1997experimental, chaumette2007visual] and shape servoing [navarro2013model]. Let us hereby show why it is not applicable in our framework.

The Broyden update is an iterative method for estimating at iteration . Its standard discrete-time formulation is:


with an adjustable gain. Using our simulator, we estimate the interaction matrix using both Broyden update (with three different values of ) and our receding horizon method (22). We then compare (with estimated with either method) the one-step prediction of the resulting feature vector:


with the ground truth from the simulator. The results (plotted in Fig. 8) show that receding horizon outperforms all three Broyden trials. One possible reason is that the components of fluctuate since (at each iteration) a new matrix is used. These variations cause the Broyden method to accumulate the result from old interaction matrices, and therefore perform badly on a long term. This result contrasts with that of [navarro2013model], where the Broyden method performs well since there is a fixed mapping from contour data to feature vector. Another advantage of the receding horizon approach is that it does not require any gain tuning.

(a) Receding horizon
(b) Broyden update
(c) Receding horizon
(d) Broyden update
(e) Receding horizon
(f) Broyden update
Fig. 8: Comparison – for estimating – of the receding horizon approach (RH, left) and of the Broyden update (right, with three values of ). The topmost, middle and bottom plots show the one step prediction of , and , respectively. In all plots, the dashed red curve is the ground truth from the simulator. The plots clearly show that the receding horizon approach outperforms all three Broyden trials.

V-D Manipulation of rigid objects

The same framework can also be applied to rigid object manipulation. Consider the problem of moving a rigid object to a certain position and orientation via visual feedback. This time, the shape of the object does not change, but its pose will (it can translate and rotate). We use the same , , and as for cable manipulation. We compare the convergence of two control laws proposed in our paper (25) and (26) against a baseline using image moments [chaumette2004image]. The translation and orientation can be represented with image moments and the analytic interaction matrix can be computed as explained in [chaumette2004image]). To make methods compatible, we normalize the computed control and then multiply it by the same factor .

Figure 9 shows two simulations where our controller successfully moves a rigid object from an initial (blue) to a desired (dashed black) pose using control law (26). Figure 11 compares convergence of our methods against the image moments method. We can observe that our method provides a slightly slower convergence. Directly computing the inverse (26) provides a convergence similar to (25). Later, we will show why our method is slower. Yet, the fact that it can be applied on both deformable and rigid objects makes it stand out over the other techniques.

Fig. 9: Manipulation of a rigid object with a single end-effector (red frame). The initial, intermediate and desired contours are respectively blue, solid black and dashed black. Note that in both cases, our controller moves the object to the desired pose.
Fig. 10: From an initial (red) pose, we generate (dashed blue) random motions of a rigid object.
Fig. 11: Evolution of ASE of the simulated rigid object manipulation using our method against image moments. Top: left simulation in Fig. 9, Bottom: right simulation in Fig. 9.

Taking a step further, we would like to analyze locally what does each component of the feature vector represent. To this end, we apply random movements (rotation range , maximum translation of the width) to a rigid rectangular object of length times larger than width (see Fig. 10). We compute the projection matrix as explained in Sect. IV-A, and transform the contour samples to feature vectors. Following the rationale explained in Sect. IV-B, we set . Then, we seek the relationship – at each iteration – between the object pose , , and the components of the feature vector generated by PCA. To this end, we use bivariate correlation [feller2008introduction] defined by:


where and are two variables with expected values and

and standard deviations

and . An absolute correlation close to indicates that the variables are highly correlated. All the simulations in Fig. 10 exhibit similar correlation between the computed feature vector and the object pose. In Table III, we show one instance (Left first simulation in Fig. 10) of the correlation between variables, with high absolute correlations marked in red. It is clear from the table that each component in the feature vector relates strongly to one pose parameter. We further demonstrate the correlation in Fig. 12, where we plot the evolution of object poses and feature components. Note that and are negatively correlated. The slower convergence (compared to image moments) could be as a result of non-unitary correlation between extracted features and object pose.

-0.2819 -0.3343 0.9887
0.2607 -0.8547 -0.0465
0.9230 0.3629 -0.1426
TABLE III: Correlation between , , and , , .
Fig. 12: Progression of the auto-generated feature components (row 1, 3, 5: , , ) vs. object pose (row 2, 4, 6: , , ). We have purposely arranged the variables with high correlation with the same color.

Vi Experiments

Figure 13 outlines our experimental setup. We use a KUKA LWR IV arm. We constrain it to planar () motions , defined in its base frame (red in the figure): two translations and and one counterclockwise rotation around . A Microsoft Kinect V2 observes the object222We only use the RGB image – not the depth – from the sensor.. A Linux-based -bit PC processes the image at fps. In the following sections, we first introduce the image processing for contour extraction, then present the experiments.

Fig. 13: Overview of the experimental setup.

Vi-a Image processing

This section explains how we extract and sample the object contours from an image. We have developed two pipelines, according to the kind of contours (See Fig. 14): open (e.g., representing a cable) and closed. We hereby describe the two.

Fig. 14: Open (left) and closed (right) contours can be both represented by a sequence of sample pixels in the image.

Vi-A1 Open Contours

The overall pipeline for extracting an open contour is illustrated in Fig. 15 and Algorithm 2. On the initial image, the user manually selects a Region of Interest (ROI, see Fig. (a)a) containing the object. We then apply thresholding, followed by a morphological opening, to the image, to remove the noise and obtain a binary image as in Fig. (b)b. This image is dilated to generate a mask (Fig. (c)c), which is used as the new ROI to detect the object on the following image. Figure (e)e is the object after a small manipulation motion and (f)f shows the mask with the grey color which contains the cable in Fig. (e)e. On each binary image, we apply Algorithm 2 to uniformly sample the object (see Fig. (d)d, where the green box indicates the end-effector). This is the contour used by our controller.

(a) ROI
(b) After thresholding
(c) Mask
(d) Sampled data
(e) Next image
(f) Cable in the mask
Fig. 15: Image processing steps needed to obtained the sampled open contour of an object (here, a cable).
Fig. 16: Image processing for getting a sampled closed contour: (a) original image, (b) image after thresholding and Gaussian blur, (c) extracted contour, (d) finding the starting sample and the order of the samples.

Vi-A2 Closed Contours

The procedure is shown in Fig. 16. For an object with uniform color (in the experiment blue), we apply HSV segmentation, followed by Gaussian blur of size , and finally the OpenCV findContour function, to get the object contour. The contour is then re-sampled using Algorithm 2. The starting point and the order of the samples is determined by tracking the grasping point (red dot in Fig. (d)d) and the centroid of the object (blue dot). We obtain the vector connecting the grasping point to the centroid. Then, the starting sample is the one closest to this vector, and we proceed along the contour clockwise.

1:compute the full length of of .
2:compute desired distance per sample:
6:while  do
9:     if  then
12:         ,
13:     else
17:     end if
18:end while
Algorithm 2 Generate fixed number of sampled data
Input: , original ordered sampled data and , desired number of data samples generated
Output: , re-sampled data.

Vi-B Vision-based manipulation

In this section, we present the experiments that we ran to validate our algorithms, also visible at To demonstrate the generality of our framework, we tested it with:

  • Rigid objects represented by closed contours,

  • Deformable objects represented by open contours (cables),

  • Deformable objects represented by closed contours (sponges).

We carried out different experiments with a variety of initial and desired contours and camera-to-object relative poses. The variety of both geometric and physical properties demonstrates the robustness of our framework. The variety of camera-to-object relative poses shows that—as usual in image-based visual servoing [chaumette2006visual]—camera calibration is unnecessary. The algorithm and parameters are the same in all experiments; the only differences are in the image processing, depending on the type of contour (closed or open, see Sect. VI-A).

We obtain the desired contours by commanding the robot with predefined motions. Once the desired contour is acquired, the robot goes back to the initial position, and then should autonomously reproduce the desired contour. Again, we set the number of features , and use samples to represent the contour . We set the window size , both for obtaining the feature vector and the interaction matrix . The control gain is , the local target threshold and the Tikhonov factor . At the beginning of each experiment, the robot executes steps of small333The notion of small is relative, and usually dependent on the size of the object the robot is manipulating. Refer to Sect. IV-A (especially Fig. 4) for a discussion on this. random motions to obtain the initial features and interaction matrix.

For all the experiments, we set the same termination condition at iteration using ASE defined in (31) such that:

  1. pixel and

  2. .

In the graphs that follow, we show the evolution of ASE in blue before the termination condition, and in red after the condition (until manual stop by the operator).

Figure 17 presents experiments, one per column. Columns 1 – 3, 4 – 6 and 7 – 8 show respectively manipulation of: cable, rigid object and sponge. The first row presents the full RGB image obtained from Kinect V2. The second and third rows zoom in on the manipulation at the initial and final iterations. We track the end-effector in the image with a green marker for contour sampling. The desired and current contours are drawn in red and blue, respectively.

Fig. 17: Eight experiments with the robot manipulating different objects. From left to right: a cable (columns 1 – 3), a rigid object (columns 4 – 6) and a sponge (columns 7 and 8). The first row shows the full Kinect V2 view, and the second and the third columns zoom in to show the manipulation process at the first and last iterations. The red contour is the desired one, whereas the blue contour is the current one. The green square indicates the end-effector.
(a) cable – column 1
(b) cable – column 2
(c) cable – column 3
(d) rigid object – column 4
(e) rigid object – column 5
(f) sponge – column 6
(g) sponge – column 7
(h) sponge – column 8
Fig. 18: Evolution of at each iteration , for the experiments of Fig. 17. The black dashed lines indicate the threshold pixel. The blue curves show until the termination condition, whereas the red curves show the error until manual termination by the human operator.

Figure 18 shows the decreasing trend of error ASE for each experiment. The initial increase of ASE in the experiments can be due to the random motion at the beginning of the experiments. In general, we found that ASE is more noisy for the closed than for the open contour. This discontinuity is visible in Figures (c)c and (d)d (zigzag evolution). Such noise is likely introduced by the way we sampled the contour. When we have false contour data, the value of ASE may encounter a sudden discontinuity. Figure 19 shows examples of these false samples, output by the image processing pipeline. Despite these errors, thanks to the “forgetting nature” of the receding horizon and to the relatively small window size (), the corrupted data will soon be forgotten, and it will not hinder the overall manipulation task. Yet, the overall framework would benefit from a more robust sensing strategy, as in [Cheng2019Occlusion].

Fig. 19: False contour data from the image can cause noise in ASE.
Fig. 20: Two “move and shape” experiments grouped into two rows. The desired contour (red dotted) is far from the initial one. This requires the robot to 1) move the object, establish contact with the right – fixed – robot arm, 2) give the object the desired shape, by relying on the contact. The first column shows the starting configuration, the second column presents the contact establishment, and the third column zooms in to show the alignment. The last column shows the final results.
(a) move and shape – row 1
(b) move and shape – row 2
Fig. 21: The evolution of for the experiments of Fig. 20. The black dashed line indicates the threshold . The blue curves show until the termination condition, whereas the red curves show the error until manual termination by the human operator.

Finally, since our framework can deal with both rigid and deformable objects, we tested it in two experiments where the same object (a sponge) can be both rigid (in the free space), and deformed (when in contact with the environment). These experiments require the robot to: 1) move the object, establish contact, 2) give the object the desired shape, by relying on the contact. Figure 20 presents these two original “move and shape” servoing experiments with the corresponding errors ASE plotted in Fig. 21. We use a second fixed robot arm to generate the deforming contact. As the figures and curves show, both experiments were successful.

The success of the “move and shape” task is largely dependent on the contact establishment. However, even when the initial contact has some misalignment (see Fig. (c)c and (g)g), our framework can still reduce the ASE to give a reasonable final configuration (see Fig. (h)h and Fig. (b)b).

Vii Conclusion

In this paper, we propose algorithms to automatically and concurrently generate object representations (feature vectors) and models of interaction (interaction matrices) from the same data. We use these algorithms to generate the control inputs enabling a robot to move and shape the said object, be it rigid or deformable. The scheme is validated with comprehensive experiments, including a desired contour that requires both moving and shaping. We believe it is unprecedented in previous research.

Our framework adopts a model-free approach. The system characteristics are computed online with visual and manipulation data. We do not require camera calibration, nor a priori knowledge of the camera pose, object size or shape. An open question remains the management of 6 DOFs motions of arms. Indeed, while the proposed control strategy can be easily generalized to 6 DOFs motions, it relies on a sufficiently accurate extraction of feature vectors from vision sensors. A very challenging task is to generate complete 3D feature vectors of objects from a limited sensor set, due to partial view of objects and occlusions.

The framework could benefit from robust sensing of deformation. In fact, one obvious setback is that the representation and model of interaction are extremely local. Thus, they cannot guarantee global convergence. In addition, our framework cannot infer whether a shape is reachable or not. This drawback is solvable by using a global deformation model for control. But as we mentioned earlier, a global model usually requires an offline identification phase which we want to avoid. In fact, for different objects, we will need to re-identify the model. There is a dilemma in using a global deformation model.

Maybe one of the possible solutions to this dilemma is to have both our method and deep learning based methods running in parallel. While our scheme enables fast online computation and direct manipulation, the extracted data can be used by a deep neural network to obtain a global interaction mapping. Once a global mapping is learned, it can later be used for direct manipulation and to infer feasibility of the goal shape.