I Introduction
Manipulation enables robots to physically interact with their environment. Robotics researchers have made significant progress on tasks such as grasping [shimoga1996robot, pinto2016supersizing, grasp_unseen, levine2018learning, zeng2018learning] and dexterous manipulation [yousef2011tactile, andrychowicz2020learning] of rigid objects. In this work, we focus on the problem of interacting with nonrigid objects. Learning to manipulate nonrigid objects allows robots to handle fragile [wen2020force, soft_object] and flexible objects [rambow2012autonomous], or household items [matas2018sim, wu2019learning]. Although research on rigid object manipulation is a mature field, existing techniques can not be applied directly to nonrigid objects [sanchez2018robotic]. In this paper, we introduce velcro peeling as an illustrative application for manipulating a nonrigid object in a complex geometric setting (Fig. 1).
The goal is to peel velcro over a surface with unknown geometry, which provides a unique set of challenges: (1) force feedback measurements can be ambiguous, (2) visual feedback is not always viable due to selfocclusion, and (3) the system state is not directly observable. As a camera’s view can be blocked through selfocclusion and merging different sensor modalities, such as vision and touch, is challenging, we investigate if a robot can learn to peel velcro from force feedback alone. However, relying solely on touch brings with it additional challenges. Suppose the robot established an initial grasp of the velcro endpoint, and the task is to open the velcro fully. Tactile feedback can only be measured if the material resists the robot’s pulling motion. However, we show in Section III that the measured feedback signal of peeling a velcro is nearly identical to when the robot is pulling in the wrong direction. Additionally, our experiments on a real robot identified a force void space, where no force feedback can be measured at all. Learning correct behavior from such sensor signals requires reasoning capabilities over a long time horizon.
We believe that velcropeeling provides an approachable, representative application for manipulating nonrigid objects with only touch feedback. It also has practical applications considering that velcro is a common material found on everyday objects such as coats, bags, and shoes.
We propose a novel simulation environment where a velcro strip is placed on a variety of surfaces, including planar and cylindrical ones (Figure 1
). We show how the ideal, fullyobservable version of the peeling task can be formulated as a Markov Decision Process and solved optimally. If only tactile measurements are available, the problem becomes partially observable. For this version, we present a multistep Deep Recurrent Network (DQRN) that can successfully solve more than 90% of the configurations under geometric uncertainty and ambiguous sensor feedback. Our method improves performance over existing baselines by over 20%.
Our contributions are summarized as follows:

We introduce velcro peeling as a representative application to learn nonrigid robotic object manipulation from only touch feedback.

We present a MultiStep DRQN network that handles longterm dependencies between sensor measurements to peel velcro strips from varying geometric shapes successfully.

We validate our approach in simulation and in experiments on a real robot.
Ii Related Work
Markov Decision Processes (MDP) provide a mathematical formulation for reinforcement learning problems. The standard MDP formulation assumes that the current state of the environment is fully observable and that the optimal action choice depends solely on this current state. However, estimating the current state is nontrivial. As such, various neural network architectures have emerged as powerful tools to learn state estimation from observations. For Atari gamers, Deep QNetworks (DQN) achieved humanlevel control for discrete
[mnih2015dqn] and continuous action spaces [dqn_continuous]. The same approach can be used for robotic manipulation. However, the MDP modeling approach performs well only in fully observable systems, such as Atari games, in which low dimensional features can be extracted from observations due to relatively simple physical environment settings. Extending these methods to manipulation tasks in 3D Cartesian space with combinations of multimodal sensory inputs such as vision, tactile, and proprioceptive data [making_sense_of_VaT] is an active area. If the environment additionally contains uncertainty or noise, the MDP based modeling approach fails. Instead, the problem can be reformulated as a Partially Observable Markov Decision Process (POMDP). Two recent papers [grasping_pomdp_2007, pomdp_lite_grasping] used this approach to learn robust robotic grasping. Glashan et al. [grasping_pomdp_2007]simplified the grasping observation and statespace to discrete abstractions to reduce the probability model complexity in their POMDP formulation. Similarly, Chen et al.
[pomdp_lite_grasping] fixed the hidden state variables and introduced more deterministic properties to the Bayesian transition model. They successfully achieved robust twofinger grasping under uncertainty. In more challenging manipulation tasks, where multiple objects interact with each other through collision and friction, it is nontrivial to design an optimal statespace representation. Sung et al. [diode_deepRNN] proposed a variational Bayesian model to learn the continuous state representation from the tactile signal followed by a planning network. In contrast, we propose a method to skip extracting features altogether and learn a mapping from a tactile signal sequence to the optimal action.Several prior works have applied handdesigned controllers in combination with tactile feedback to solve rigid body manipulation tasks [haptic_feedback_control_grasp, inverse_sensor_modeling]. When the geometric parameters of the object are known, a PID controller can even address the peginhole problem robustly [geometric_peginhole]. These controllers address particular tasks well when the states in their models are accurately measured or estimated. Koval et al. [koval2017manifold] demonstrated particle filtering for the states of a noisy robot arm informed by the tactile sensor. Platt et al. [platt2011using] applied Bayesian estimation using tactile feedback to localize flexible materials during manipulation. Sutanto et al. [latent_space_tactile_servoing] developed an approach to predict actions to perform tactile servoing based on a learned latent space representation. The same method can also be applied to estimate object physical properties such as elasticity and stiffness [latent_elasticity, latent_stiffness]
. However, there might not exist a correlation between latent representation changes and agent actions. That is one of the cases where tactile input for feature extraction performs worse compared to other sensory inputs, for example, vision. In the velcro peeling case, it is hard to estimate the velcro’s loop and hook status based on the pressure mapping and shear force readings at the gripper finger. Additionally, the detachment of the hook and loop introduces noise to the sensor. The resistance force could result in ambiguous sensor readings. We show that we can overcome these challenges with our proposed Multistep DRQN architecture that considers the long term dependencies between individual observations.
Tactile sensory input can provide useful perceptional capabilities to assist manipulation [making_sense_of_VaT]. When combined with visual input, a neural network can combine the two signals and extract features for object classification. Combining vision and tactile observations were shown to benefit some applications [crossmodal_touch_and_vision, haptic_SVM_manipulation, haptics_drilling_carving], such as slicing, drilling, and carving, which require direct perception around the contact area. In [whats_in_container], the authors have shown that vision and tactile feedback can identify objects inside a container individually. In our work, vision feedback does not provide a stable input signal due to selfocclusions. Instead, we show that a highfidelity model can be learned from tactile observations only.
Iii Velcro Peeling
This section presents an initial experiment on a real robot that highlights the challenges when designing velcro peeling strategies for complex geometries. We measure the force feedback while moving the manipulator along predefined trajectories, parametrized by , the angle between the peeling direction and the xaxis. Fig. 2 shows the force magnitude of each trajectory when we vary the percentage of already peeled velcro (. The force is measured with a triaxis load sensor on a Kinova Gen 3 7DOF manipulator’s wrist, and the magnitude is indicated by the strength of the marker in the figure. The darker the dot, the larger the force magnitude.
We observe a tactile void space (where force feedback is weak) that increases as more of the velcro is peeled off. These extended sequences without useful feedback pose a significant challenge for standard Deep Qlearning approaches [mnih2015dqn]. As the force feedback becomes weaker over time, the learned policy can not differentiate between states, and all Qvalues become equally likely, meaning that no optimal action exists. In the next section, we present our approach to address this challenge.
Iv Method
In this section, we briefly describe our model of the velcro peeling task in simulation together with the mathematical notation (Table I). Next, we show that if the full state is observable, velcro peeling can be solved by choosing greedy decisions based on the peeling boundary region. Finally, we study the partially observable cases, where the controller observes only vision or tactile feedback.
Iva Problem Formulation
We aim to peel a velcro strip of uniform width, applied on surfaces of varying geometry. The manipulator initially holds the handle in hand. We discretize the velcro into consecutive flat pieces, each containing a binary value on whether the segment is peeled (1) or attached (0). The global state of the velcro represents a string of attached/peeled bits. Given tactile and proprioceptive observations from the agent, our goal is to find an optimal strategy to control the gripper to successfully change the velcro attachment state from to .
Notation  Description  















IvB Simulation Model
The velcro strip is simulated as a 2D net of point mass nodes connected by springdamper units. Variables relevant to the state (listed in Table I) are representing translation, rotation, and radius of the surface shape on which the velcro is mounted. The velcro strips hooks and loops are simulated with breakable tendons. At time step , the environment state includes the position and velocity of each velcro node, the length of all tendons, the manipulator’s endeffector pose, and the tactile feedback measurement. Additional details of the simulation setup are introduced in section V.
IvC Our Approach
Humans peel a velcro strip by grasping one end and pulling towards a direction roughly between the surface tangential and the peeling boundary’s surface normal. In our experiments in Section V, we show that if the states of all velcro nodes are observable, we can compute the surface tangential and normal. In this case, a simple greedy strategy suffices to peel the velcro.
Of course, in realworld environments, it is not possible to observe these environment variables directly. Additionally, it is challenging to estimate the velcro’s geometric properties accurately. Since the environment state is not fully observable, we formulate our approach as a Partially Observable Markov Decision Process (POMDP). A POMDP is characterized by a tuple of 6 values: States , actions , a state transition function , reward function , and observations according to an observation function . In our case, the observations contain only the position of the endeffector and the tactile feedback measurement. We use the area of peeled velcro as our reward , and we define six possible manipulator actions (move left, right, forward, backward, up and down).
We use reinforcement learning [sutton2018reinforcement] to learn a control policy that at each timestep receives observation chooses an action that optimizes our long term reward. If the state is directly observable, the problem can be solved by learning an optimal policy that maximizes the expected sum of future rewards, given by , i.e. discounted sum over an infinite time horizon of future returns. QLearning [watkins1992q] is a modelfree offpolicy algorithm to estimate these expected long term rewards or Qvalues. However, in realworld scenarios, it is often impossible to observe the state directly [hausknecht_deep_2015]. In this case, estimating the Qvalues from a single observation can be arbitrarily bad, since . Hausknecht and Stone [hausknecht_deep_2015] showed that estimating the state using multiple observations together with a Deep Recurrent QNetwork (DRQN) leads to better policies in partially observed environments.
However, in our experiments in Section V, we show empirically that the standard DRQN approach suffers from the long timescale memory transport problem [longtime_mem_transpt]. We show that we can address this issue by slowing down the Qvalue estimation frequency. To achieve this, we propose a MultiStep DRQN approach, as shown in Fig. 3.
When the tactile feedback is weak (usually when there is slack in the peeled part of the velcro), the force feedback provided contains little meaningful information. In our early experiment in Section III, we showed that this tactile void space grows as the percentage of peeled velcro increases. To estimate the Qvalues within this void space, the agent needs to reason from observations that span a long time horizon. Our multistep DQRN outputs a single action for a maximum of
time steps if tactile feedback is weak to overcome this issue. If sensor measurements are reasonable, the agent predicts the actionstate value and chooses the next action in the standard DRQN network. Since the resulting observations do not have a fixed length, we propose using a LongShort Term Memory (LSTM) layer along with two linear layers to learn a fixedsized tactile feature vector. This tactile feature vector is then used as input for the DRQN network to estimate the Qvalues. Our proposed network architecture enforces more state explorations inside the tactile void space and slows down the Qvalue estimation frequency. We also applied the Double Qlearning method from
[double_dqn], which uses two independent Qnetworks for Qvalue estimation to increase training stability.V Experiments: Design and Setup
In this section, we first introduce the velcro strip simulation model and associated parameters. Next, we introduce the evaluation metrics, implementation details, and the evaluation baselines. Finally, we introduce our setup for realworld evaluation.
Va Velcro Model
We use MuJoCo, a fast and accurate physics engine optimized for dynamical systems with rich contacts and constraints for simulations. We simulate the velcro strip as closedloop kinematic chains, consisting of two 2D arrays of point mass nodes connected through tendons. Each tendon is modeled as a springdamper unit to impose force constraints and motion limits (Fig. 4). The bottom layer is fixed on a table structure, while the upper layer is constrained only by the tendons. Once the spring tension exceeds a certain threshold, we reduce the spring constant to 0 to eliminate the relevant constraint, i.e., the hook detaches from the loop. To simplify simulation behavior, we do not recover the spring constant even if the spring displacement goes back to zero, i.e., the hook does not reattach to the loop once it was detached. The tendon spring constant remains unchanged throughout the training for consistent dynamic behavior.
VB Model Generation
To introduce geometric variations, we parameterize three different geometry scenarios from which we sample training data. Namely, we sample models with variations of (Fig. 4): velcro translation (, ), rotation (, , ) and concavity, where the velcro is generated on a cylindrical surface with radius . During training, we randomly initialize a model by choosing a set of parameters for each episode.
VC Robot Agent
We simulate only the gripper part of the manipulator to achieve fast planning. The gripper consists of a standard parallel jaw gripper equipped with forcetorque sensors at the fingertips (Fig. 4). We remove the computation complexity from inverse kinematics by controlling the gripper directly in the end effector frame using position and velocity commands.
Unlike [making_sense_of_VaT], we use a set of discrete actions in 3D Cartesian space. Each Cartesian space action displaces the manipulator by a fixed . The action displacement is sampled into a quintic polynomial trajectory to get the joint position and joint velocity command at each time step through a PD controller. The Cartesian space action magnitude is fixed to ensure an equal number of simulation steps.
VD Reward design
In our modeling approach, the velcro tendon spring constant is set to 0 to approximate the hooks’ detaching from the loop. We call this process breaking the tendons. The reward is assigned by how many tendons the manipulator breaks during a single step.
VE Evaluation Metric
We evaluate our approach in environments generated similarly to the models used during training. To demonstrate shortcomings and failurecases occurring during geometric selfocclusion, we generate three test cases with different parameters. Test case 1 includes variation in translation and rotation . Test case 2 additionally contains variations in the other two rotation parameter and . Test case 3 uniformly samples 50% of the environments on planar and 50% on to cylindershaped table whose radius is controlled by the parameter . In total, we generate 500 examples for test cases 1 and 2, and 1000 samples for case 3. We measure success via completion ratio (number of broken tendons compared to the total number of tendons). To discourage infinite exploration, we formulate three termination criteria: success in cases where the manipulator peels off the whole velcro, failure in cases where the manipulator loses hold of the velcro, and failure in cases where the time limit is exceeded. We set the time limit to 200 steps for both training and testing.
VF Implementation Details
Our tactile sequence network is implemented as an LSTM layer, followed by two linear layers. In the simulation, the tactile input consists of 3 values for force and torque for both fingers and runs at . Thus each observation contains 186 values (6 values for endeffector pose and 180 for tactile observation). The output hidden vector
is of size 150. Our Qnetwork is a threelayer Multilayer Perceptron (MLP), with two linear layers followed by an LSTM layer. Finally, a linear layer outputs a Qvalue for each action. We use ReLU as nonlinearities and batchnormalization layers for weight normalization. During training, we jointly learn the linear layer weights and LSTM weights. For each episode, we randomly sample the geometric uncertainty parameters to generate a new model, and the episode terminates after 200 action steps. The policy is trained for 1500 episodes using RMSProp optimizer with a learning rate of
for about 30 hours on a single NVIDIA GeForce GTX 1080.VG Baseline Strategies
We compare our approach with six different baselines. Full Observation: Geometric input greedy approach (Geomgreedy) If the full state is observable, a handdesigned strategy based on the peeling boundary’s geometry information is sufficient to solve the peeling task. Specifically, the peeling orientation, the position and normal vector at the peeling boundary, and the gripper orientation need to be observed. We use this information as the basis for a greedy algorithm similar to the one presented in Section III. The end effector follows trajectories defined by (see Fig. 5). The trajectories that yield successful peels are plotted in green while failed ones are shown in red. The results show that by approximating the velcro shape with straight line segments, the agent only needs to drive the end effector inside a peeling cone while increasing the cone origin’s distance.
Partial Observation As geometric velcro properties can not be observed directly in the real world, we limit the observations to measurable data: vision or tactile input. Open loop (sweepthrough) The most straightforward strategy is to choose to pull in a single direction. We randomly sample a direction from a hemisphere, assuming the velcro strip’s topside is always facing up ( direction). The gripper moves towards the sampled direction until failure or success.
State Estimation + Handdesigned Strategy
The state estimation + Handdesigned strategy estimates the peeling boundary and normal of the velcro using either vision or tactile input and generates actions using the previously presented Geomgreedy approach. We design two strategies based on the sensor modality: Visiongreedy and Tactilegreedy. For the Visiongreedy method, we place a camera that captures RGBD images in the simulation environment. We use a neural network with a ResNet18 encoder followed by three additional linear layers to predict the peeling orientation, the position, and the normal vector at the peeling boundary from images. We randomly explore the state space and collect 4532 images and the associated geometric features from 1000 different velcro configurations. For the Tactilegreedy method, we sampled 1000 random exploration trajectories and trained a recurrent neural network to predict the peeling boundary’s geometry information. The Visiongreedy and Tactilegreedy methods are trained in a supervised manner until convergence.
Reactive policy The reactive policy network is a standard Qvalue network. This network contains no memory of previously chosen actions/observations and selects the next action based on the current observation.
Singlestep DRQN The singlestep DRQN closely follows the approach introduced by [hausknecht_deep_2015]. The current observation is processed by a DRQN network to predict Qvalues at every step. The network keeps memory in the DRQN’s internal hidden state.
VH Real Robot Setup
For realworld evaluations, we use the same setup as described in III. We fix the 3D printed structure on the table and rotate the robot’s endeffector frame at initialization, rotating the discrete cartesian action space and the tactile force frame accordingly. We then add a random offset to the position observation to introduce uncertainties for translation. We found that the hook and loop detaching during the velcro peeling process yields a unique audio signal that can approximate the reward signal. We sample an audio signal of a single, noncontinuous velcro peeling process in a quiet room for
. We then threshold and average the spectrum of this signal over time to filter in the frequency domain. A 1D convolution with the signal spectrum yields a filtered signal without background noise, which we use to approximate the amount of velcro peeled during the process.
We train the real robot agent for 200 episodes and then test the result on test cases 1 and 2. In total, we generated 30 examples for both case 1 and case 2.
Vi Results
We record the episodic return (total number of tendons broken), time (number of discrete action steps), and final result as success or failure. The episodic returns are discretized into 5 ranges: , , , and . Fig. 6 shows the 100% stacked column charts on the episodic returns.
Test 1  Test 2  Test 3  
1  82%  138  184  68%  143  155  47%  152  122 
2  84%  117  144  65%  157  142  61%  164  151 
4  97%  87  210  88%  115  207  76%  136  173 
8  98%  82  213  92%  97  206  85%  129  191 
We also compute the success rate, average time step, and average episodic return for each test case and summarize them in Table II. In full observability, the Geomgreedy approach achieves a 100% success rate with the shortest peeling time among all methods. The Geomgreedy’s success in the fully observable case is expected, as the Geomgreedy approach can follow the direction between surface tangential and the normal of the peeling boundary throughout the episode to achieve success. When we compare the Visiongreedy approach with the Geomgreedy, we see that the Visiongreedy approach achieves perfect performance test 1. However, when the test cases contain geometric self occlusions, as in test cases 2 and 3, the performance decays quickly. Among all approaches with partial observation, our Multistep DRQN achieved the highest success rate on test sets 2 and 3 and competitive performance on test set 1.
To show the importance of tactile feedback measurements, we provide ablations of the used observation representation. The agent can observe only the gripper position and the tactile feedback, and the training is conducted with both Singlestep and Multistep DRQN formulations. Additionally, we study the influence of the sample time steps on the DRQN agents. Both ablation study results are summarized in Table II and Table III.
In Table II, both singlestep and multistep DRQN approaches show performance drop when either force or position input is missing, indicating that observing only part of the geometry is not sufficient to solve the task, and the partly observed tactile feedback can be ambiguous regarding the underlying physical properties. In Table III, the multistep DRQN agent is trained with the same hyperparameters except for , which is selected from . For the case where is identical to the reactive method. The small performance difference can be attributed to differently initialized parameters. The performance shows significant improvement when increases, indicating the importance of memory and long time reasoning capability for the agent to solve this task.
In the real robot evaluations, we show that the multistep approach can achieve a similar success rate as in simulations. It also outperforms the singlestep DRQN approach both in terms of success rate and the average timestep it takes to finish the task.
Test 1  Test 2  





Open loop  40%  18  27%  25  
Singlestep  90%  28  83%  32  
Multistep  97%  21  88%  25 
Vii Conclusion
We introduced the task of peeling velcro strips mounted on varying geometries as a new task for nonrigid robotic manipulation. To solve this task in the presence of environmental uncertainties, we proposed a novel Multistep DRQN architecture that outperforms all baselines in two out of three test cases and achieves competitive performance on the last one. We provide experiments in both simulation and real robot setup to evaluate our approach and explain the need for a network that models the long term dependencies between observations. The empirical evaluation and comparison with multiple baseline methods provide a benchmark for future work to study this problem. Exciting future research directions include implementing the velcro strip’s initial grasping and generalizing to more complex geometric configurations.
Comments
There are no comments yet.