Extracting Contact and Motion from Manipulation Videos

07/13/2018 ∙ by Konstantinos Zampogiannis, et al. ∙ 6

When we physically interact with our environment using our hands, we touch objects and force them to move: contact and motion are defining properties of manipulation. In this paper, we present an active, bottom-up method for the detection of actor-object contacts and the extraction of moved objects and their motions in RGBD videos of manipulation actions. At the core of our approach lies non-rigid registration: we continuously warp a point cloud model of the observed scene to the current video frame, generating a set of dense 3D point trajectories. Under loose assumptions, we employ simple point cloud segmentation techniques to extract the actor and subsequently detect actor-environment contacts based on the estimated trajectories. For each such interaction, using the detected contact as an attention mechanism, we obtain an initial motion segment for the manipulated object by clustering trajectories in the contact area vicinity and then we jointly refine the object segment and estimate its 6DOF pose in all observed frames. Because of its generality and the fundamental, yet highly informative, nature of its outputs, our approach is applicable to a wide range of perception and planning tasks. We qualitatively evaluate our method on a number of input sequences and present a comprehensive robot imitation learning example, in which we demonstrate the crucial role of our outputs in developing action representations/plans from observation.



There are no comments yet.


page 7

page 10

page 11

page 12

page 13

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A manipulation action, by its very definition, involves the handling of objects by an intelligent agent. Every such interaction requires physical contact between the actor and some object, followed by the exertion of forces on the manipulated object, which typically induce motion. When we open a door, pick up a coffee mug, or pull a chair, we invariably touch an object and cause it (or parts of it) to move. This obvious observation demonstrates that contact and motion are two fundamental aspects of manipulation.

Contact and motion information alone are often sufficient to describe manipulations in a wide range of applications, as they naturally encode crucial information regarding the performed action. Contact encodes where the affected object was touched/grasped, as well as when and for how long the interaction took place. Motion conveys what part of the environment (i.e. which object or object part) was manipulated and how it moved.

The ability to automatically extract contact and object motion information from video either directly solves or can significantly facilitate a number of common perception tasks. For example, in the context of manipulation actions, knowledge of the spatiotemporal extent of an actor-object contact automatically provides action detection/segmentation in the time domain, as well as localization of the detected action in the observed space poppe2010survey ; weinland2011survey . At the same time, motion information bridges the gap between the observation of an action and its semantic grounding. Knowing what part of the environment was moved effectively acts as an attention mechanism for the manipulated object recognition rutishauser2004bottom ; ba2014multiple , while the extracted motion profile provides invaluable cues for action recognition, in both “traditional” wang2013action ; poppe2010survey ; weinland2011survey

and deep learning

simonyan2014two frameworks.

Robot imitation learning is rapidly gaining attention. The use of robots in less controlled workspaces and even domestic environments necessitates the development of easily applicable methods for robot “programming”: autonomous robots for manipulation tasks must efficiently learn how to manipulate. Exploiting contact and motion information can largely automate robot replication of a wide class of actions. As we will discuss later, the detected contact area can effectively bootstrap the grasping stage by guiding primitive fitting and grasp planning, while the extracted object and its motion capture the trajectory to be replicated as well as any applicable kinematic/collision constraints. Thus, the components introduced in this work are essential for building complex, hierarchical models of action (e.g., behavior trees, activity graphs) as they appear in the recent literature kruger2011object ; amaro2014understanding ; summers2012using ; yang2014cognitive ; yang2015robot ; aksoy2011learning ; zampogiannis2015learning .

In this paper, we present an unsupervised, bottom-up method for estimating from RGBD video the contacts and object motions in manipulation tasks. Our approach is fully 3D and relies on dense motion estimation: we start by capturing a point cloud model of the observed scene and continuously warp/update it throughout the duration of the video. Building upon our estimated dense 3D point trajectories, we use simple concepts and common sense rules to segment the actor and detect actor-environment contact locations and time intervals. Subsequently, we exploit the detected contact to guide the motion segmentation of the manipulated object and, finally, estimate its 6DOF pose in all observed video frames. Our intermediate and final results are summarized in Table 1.

It is worth noting that we do not treat contact detection and object motion segmentation/estimation independently: we use the detected contact as an attention mechanism to guide the extraction of the manipulated object and its motion. This active approach provides an elegant and effective solution to our motion segmentation task. A passive approach to our problem would typically segment the whole observed scene into an unknown (i.e. to be estimated) number of motion clusters. By exploiting contact, we avoid having to solve a much larger and less constrained problem, while gaining significant improvements in terms of both computational efficiency and segmentation/estimation accuracy.

The generality of our framework, combined with the highly informative nature of our outputs, renders our approach applicable to a wide spectrum of perception and planning tasks. In Section 3, we provide a detailed technical description of our method, while in Section 4, we demonstrate our intermediate results and final outputs for a number of input sequences. In Section 5, we present a comprehensive example of how our outputs were successfully used to facilitate a robot imitation learning task.

2 Related Work

We focus our literature review on recent works in four areas that are most relevant to the our twofold problem, and the major processes/components upon which we build. We deliberately do not review works from the action recognition literature; while our approach may very appropriately become a component of a higher-level reasoning solution, the scope of this paper is the extraction of contacts, moving objects, and their motions.

Scene flow. Scene flow refers to the dense 3D motion field of an observed scene with respect to a camera; its 2D projection onto the image plane of the camera is the optical flow. Scene flow, analogously to optical flow, is typically computed from multi-view frame pairs yan2016scene . There have been a number of successful recent works on scene flow estimation from RGBD frame pairs, following both variational herbst2013rgb ; quiroga2014dense ; jaimez2015motion ; jaimez2015primal ; jaimez2017fast and deep learning mayer2016large frameworks. While being of great relevance in a number of motion reasoning tasks, plain scene flow cannot be directly integrated into our pipeline, which requires model-to-frame motion estimation: the scene flow motion field has a 2D support (i.e. the image plane), effectively warping the 2.5D geometry of an RGBD frame, while we need to appropriately warp a full 3D point cloud model.

Non-rigid registration. The non-rigid alignment of 3D point sets can be viewed as a generalization of scene flow, in the sense that the estimated motion field is supported by a 3D point cloud: the goal is to estimate point-wise transformations (usually rigid) that best align the point set to the target geometry under certain global prior constraints (e.g., ‘as-rigid-as-possible’ sorkine2007rigid ). The warp field estimation is performed either by iterating between correspondence estimation and motion optimization tam2013registration ; amberg2007optimal ; newcombe2015dynamicfusion ; innmann2016volume , or in a correspondence-free fashion, by aligning volumetric SDFs (Signed Distance Fields) slavcheva2017killingfusion . For this work, and due to lack of publicly available solutions, we have implemented a non-rigid registration algorithm similar to newcombe2015dynamicfusion and innmann2016volume (Section 3.2) and released it as part of our cilantro cilantro library.

Contact detection. A CNN-based method for grasp recognition is introduced in yang2015grasp . A 2D approach for detecting “touch” interactions between a caregiver and an infant is presented in chen2016touch . To the best of our knowledge, there is no prior work on explicitly determining the spatiotemporal extent of human-environment contact.

Motion segmentation. A very large volume of works on motion segmentation have casted the problem as subspace clustering of 2D point trajectories, assuming an affine camera model yan2006general ; tron2007benchmark ; costeira1995multi ; kanatani2001motion ; rao2010motion ; vidal2004motion . In katz2013interactive , an active approach for the segmentation and kinematic modeling of articulated objects is proposed, which relies on the robot manipulation capabilities to induce object motion. In herbst2012object , object segmentation is performed from two RGBD frames, one before and one after the manipulation of the object, by rigidly aligning and ‘differencing’ the two views and robustly estimating rigid motion between the ‘difference’ regions. The same method is used in herbst2013rgb , where scene flow is used to obtain motion proposals, followed by an MRF inference step. In ruenz2017icra , joint tracking and reconstruction of multiple rigidly moving objects is achieved by combining two segmentation/grouping strategies with multiple surfel fusion whelan2015elasticfusion instances. A naive integration of a generic motion segmentation algorithm for the extraction of the manipulated object into our pipeline would be suboptimal in multiple ways. For instance, given the fact that there may exist an unknown number of other object motions that are irrelevant to the manipulation, we would be solving an unnecessarily hard problem. For the same reason, we would have little control over the segmentation granularity, which could cause the manipulated object to be over/under-segmented. Instead, we leverage the detected contact and bootstrap our segmentation by an informed trajectory clustering approach that is similar to ochs2014segmentation .

3 Our Approach

3.1 Overview

We present an automated system that, given a video of a human performing a manipulation task as input, detects and tracks the parts of the environment that participate in the manipulation. More specifically, our system is able to visually detect physical contact between the actor and their environment, and, using contact as an attention mechanism, eventually segment the manipulated object and estimate its 6DOF pose in every observed video frame. Our pipeline, as well as the interactions of the involved processes, are sketched in Fig. 1 and followed by a more detailed description. An in-depth discussion of our core modules is provided in the following subsections.

Figure 1: A high-level overview of our modules and their connections in the proposed pipeline.

The input to our system is an RGBD frame sequence, captured by a commodity depth sensor, of a human actor performing a task that involves the manipulation of objects in their environment. We assume that the input depth images are registered to and in sync with their color counterparts. Using estimates of the color camera intrinsics (e.g., from the manufacturer provided specifications), all input RGBD frames are back-projected to 3D point clouds (colored, with estimated surface normals), on which all subsequent processing is performed.

At the core of our method lies non-rigid point cloud registration, described in detail in Section 3.2. An initial point cloud model of the observed scene is built from the first observed frame and is then consecutively transformed to the current observation based on the estimated model-to-frame warp field at every time instance. This process generates a dense set of point trajectories, each associated with a point in the initial model. In order to keep the presentation clean, we opted to obtain the scene model from the first frame and keep it fixed in terms of its point set. Non-rigid reconstruction techniques for updating the model over time newcombe2015dynamicfusion ; innmann2016volume can be easily integrated to our pipeline if required.

To perform actor/background segmentation, we follow the semi-automatic approach described in Section 3.3. The obtained binary labeling is propagated to the whole temporal extent of the observed action via our estimated dense point trajectories, and enables us to easily detect human-environment contacts as described in Section 3.4.

Given the dense scene point trajectories, the actor/background labels, and the (hand) contact interaction locations and time intervals, our final goal is, for each detected interaction, to segment the manipulated object and re-estimate its motion for every time instance, assuming it is rigid (i.e. fully defined by a 6DOF pose). Our contact-guided motion segmentation approach for this task is described in Section 3.5.

In Table 1, we summarize our proposed system’s expected inputs, final outputs, and some useful generated intermediate results.

Input Intermediate results Final outputs

RGBD video of manipulation

  • Dense 3D point trajectories for the whole sequence duration

  • Actor/background labels for all model points at all times

  • 3D trajectories of detected actor-environment contact points

  • Manipulated object segments and their 6DOF poses for every time point

Table 1: List of the inputs, intermediate results, and final outputs of our proposed system.

3.2 Non-rigid registration

As described in the previous subsection, whenever a new RGBD frame (point cloud) becomes available, our scene model is non-rigidly warped from its previous state (that corresponds to the previous frame) to the new (current) observation. Since parts of the scene model may be invisible in the current state (e.g., because of self-occlusion), we cannot directly apply a traditional scene flow algorithm, as that would only provide us with motion estimates for (some of) the currently visible points. Instead, we adopt a more general approach, by implementing a non-rigid Iterative Closest Point (ICP) algorithm, similar to amberg2007optimal ; newcombe2015dynamicfusion ; innmann2016volume .

As is the case with rigid ICP besl1992method , our algorithm iterates between a correspondence search step and a warp field optimization step for the given correspondences. Our correspondence search typically amounts to finding the nearest neighbors of each point in the current frame to the model point cloud in its previous state. Correspondences that exhibit large point distance, normal angle, or color difference are discarded. Nearest neighbor searches are done efficiently by parallel kd-tree queries.

In the following, we will focus on the warp field optimization step of our scheme. It has been found that modeling the warp field using locally affine amberg2007optimal or locally rigid newcombe2015dynamicfusion transformations provides better motion estimation results than adopting a simple translational local model, due to better regularization. In our implementation, for each point of the scene model in its previous state, we compute a full 6DOF rigid transformation that best aligns it to the current frame.

Let be the set of scene model points in the previous state that need to be registered to the point set of the current frame, whose surface normals we denote by . Let and be the index sets of corresponding points in and respectively, such that is a pair of corresponding points. Let be the unknown warp field of rigid transformations, such that and , and denote the application of to model point . Local transformations are parameterized by 3 Euler angles for their rotational part and 3 offsets

for their translational part, and are represented as 6D vectors


Our goal at this stage is to estimate a warp field , of unknown parameters, that maps model points in as closely as possible to frame models in . We formulate this property as the minimization of a weighted combination of sums of point-to-plane and point-to-point squared distances between corresponding pairs:


Pure point-to-plane metric optimization generally converges faster and to better solutions than pure point-to-point rusinkiewicz2001efficient and is the standard trend in the state of the art for both rigid newcombe2011kinectfusion ; whelan2015elasticfusion and non-rigid newcombe2015dynamicfusion ; innmann2016volume registration. However, we have found that integrating a point-to-point term (second term in (1)) with a small weight (e.g., with ) to the registration cost improves motion estimation on surfaces that lack geometric texture.

The set of estimated correspondences is only expected to cover a subset of and , as not all model points are expected to be visible in the current frame, and the latter may suffer from missing data. Furthermore, even for model points with existing data terms (correspondences) in (1), analogously to the aperture problem in optical flow estimation, the estimation of point-wise transformation parameters locally is under-constrained. These reasons render the minimization of the cost function in (1) ill-posed. To overcome this, we introduce a “stiffness” regularization term that imposes an as-rigid-as-possible prior sorkine2007rigid by directly penalizing differences between transformation parameters of neighboring model points in a way similar to amberg2007optimal . We fix a neighborhood graph on , based on point locations, and use to denote the indices of the neighbors of point to formulate our stiffness prior term as:


where , controls the radial extent of the regularization neighborhoods, ‘’ denotes regular matrix subtraction for the 6D vector representations of the local transformations, and

denotes the sum of the Huber loss function values over the 6 residual components. Parameter

controls the point at which the loss function behavior switches from quadratic (-norm) to absolute-linear (-norm). Since -norm regularization is known to better preserve solution discontinuities, we choose a small value of .

Our complete registration cost function is a weighted combination of costs (1) and (2):


where controls the overall regularization weight (set to in our experiments). We minimize in (3), which is non-linear in the unknowns, by performing a small number of Gauss-Newton iterations. At every step, we linearize around the current solution and obtain a solution increment by solving the system of normal equations , where is the Jacobian matrix of the residual terms in and is the vector of residual values. We solve this sparse system iteratively, using the Conjugate Gradient algorithm with a diagonal preconditioner.

Figure 2: Non-rigid registration: displacement vectors are depicted as white lines, aligning the source (red) to the target (blue) geometry.

In Fig. 2, we show two sample outputs of our algorithm in an RGBD frame pair non-rigid alignment scenario. Our registration module accurately estimates deformations even for complex motions of significant magnitude.

3.3 Human actor segmentation

We follow a semi-automatic approach to perform actor/background segmentation that relies on simple point cloud segmentation techniques.

We construct a proximity graph over the scene model points in the initial state, in which each node is a model point and two nodes are connected if and only if their Euclidean distance falls below a predefined threshold. Assuming that the actor is initially not in contact with any other part of the scene (i.e. the minimum distance of an actor point to a background point is at least our predefined distance threshold) and the observed actor points are not too severely disconnected in the initial state, the actor points will be exactly defined by one connected component of this proximity graph. The selection of the correct (actor) component can be automated by filtering all the extracted components based on context-specific criteria (e.g., rough size, shape, location, etc.) or by picking the component whose image projection exhibits maximum overlap with the output of a 2D human detector cao2017realtime ; dalal2005histograms . Equivalently, we may begin by selecting a seed point known to belong to the actor and then perform region growing on the model point cloud until our distance threshold is no longer satisfied. Again, the selection of the seed point can be automated by resorting to standard 2D means (e.g., by picking the point with the strongest skin color response jones2002statistical ; vezhnevets2003survey within a 2D human detector output cao2017realtime ; dalal2005histograms ).

We believe that the assumptions imposed by our Euclidean clustering based approach for the actor segmentation task are not too restricting, as the main setting we focus on (representing human demonstrations for robot learning) is reasonably controlled in the first place.

We note that, since we opted to keep the scene model point set fixed and track it throughout the observed action, the obtained segmentation automatically becomes available at all time points.

3.4 Contact detection

The outputs of the above two processes are a dense set of point trajectories and their respective actor/background labels. Given this information, it is straightforward to reason about contact, simply by examining whether the minimum distance between parts of the two clusters is small enough at any given time. In other words, we can easily infer both when the actor comes into/goes out of contact with part of the environment and where this interaction is taking place.

Some of the contact interactions detected using this criterion may, of course, be semantically irrelevant to the performed action. Since semantic reasoning is not part of our core framework, these cases have to be handled by a higher lever module. However, under reasonably controlled scenarios, we argue that it is sufficient to simply assume that the detected contacts are established by the actor hands, with the goal of manipulating an object in their environment.

3.5 Manipulated object motion/segmentation

Knowing the dense scene point trajectories, labeled as either actor or background, as well as the contact locations and intervals, our next goal is to infer what part of the environment is being manipulated, or, in other words, which object was moved. We assume that every contact interaction involves the movement of a single object, and that the latter undergoes rigid motion. In the following, we only focus on the background part of the scene around the contact point area, ignoring the human point trajectories. We propose the following two-step approach.

First, we bootstrap our segmentation task by finding a coarse/partial mask of the moving object, using standard unsupervised clustering techniques. Specifically, we cluster the point trajectories that are labeled as background and lie within a fixed radius of the detected contact point at the beginning of the interaction into two groups. We adopt a spectral clustering approach, using the ‘random walk’ graph Laplacian

von2007tutorial and a standard -means last step. Our pairwise trajectory similarities are given by , where and are the minimum and maximum Euclidean point distance of trajectories and over the duration of the interaction, respectively. This similarity metric enforces similar trajectories to exhibit relatively constant point-wise distances, i.e. promotes clusters that undergo rigid motion. From the two output clusters, one is expected to cover (part of) the object being manipulated. Operating under the assumption that only interaction can cause motion in the scene, we pick the cluster that exhibits the largest average motion over the duration of contact as our object segment candidate.

In the above, we restricted our focus within a region of the contact point, in order to 1) avoid that our binary classification is influenced by other captured motions in the scene that are not related to the current interaction, and 2) make the classification itself more computationally tractable. As long as these requirements are met, the choice of radius is not important.

Subsequently, we obtain a refined, more accurate segment of the moving object by requiring that the latter undergoes a rigid motion that is at every time point consistent with that of the previously found motion cluster. Let denote the background (non-actor) part the scene model point cloud at time , for , and be the initial motion cluster state at the same time instance. For all , we robustly estimate the rigid motion between point sets and (i.e. relative to the first frame), using the closed form solution of umeyama1991least under a RANSAC scheme, and then find the set of points in all of that are consistent with this motion model between and . If we denote this set of motion inliers by (which is a set of indices of points in ), we obtain our final object segment for this interaction as the intersection of inlier indices for all time instances :


The subset of the background points indexed by , as well as the per-frame RANSAC motion (pose) estimates of this last step, are the final outputs of our pipeline for the given interaction.

4 Experiments

4.1 Qualitative evaluation

We provide a qualitative evaluation of our method for video inputs recorded in different settings, covering three different scenarios: 1) a tabletop object manipulation that involves flipping a pitcher, 2) opening a drawer, and 3) opening a room door. All videos were captured from a static viewpoint, using a standard RGBD sensor.

For each scenario, we depict (in Fig. 3, 4, and 5, respectively) the scene model point cloud state at three time snapshots: one right before, one during, and one right after the manipulation. For each time point, we show the corresponding color image and render the tracked point cloud from two viewpoints. The actor segment is colored green, the background is red, and the detected contact area is marked by blue. We also render the point-wise displacements induced by the estimated warp field (from the currently visible state to its next) as white lines (mostly visible in areas that exhibit large motion). The outputs displayed in these figures are in direct correspondence with the processes described in Sections 3.2, 3.3, and 3.4.

Figure 3: Flipping a pitcher: scene tracking, labeling, and contact detection.
Figure 4: Opening a drawer: scene tracking, labeling, and contact detection.
Figure 5: Opening a door: scene tracking, labeling, and contact detection.

Next, we demonstrate our attention-driven motion segmentation and 6DOF pose estimation of the manipulated object. In Fig. 6, we render the background part of the scene model in its initial state with the actor removed and show the two steps of our segmentation method described in Section 3.5. In the middle column, the blue segment corresponds to the initial motion segment, obtained by clustering trajectories in the vicinity of the contact point, which was propagated back to the initial model state and is highlighted in yellow. In the left column, we show the refined, final motion segment. We note that, because of our choice of the radius around the contact point in which we focus out attention in the first step, the initial segment in the first two cases is the same as the final one.

(a) Flipping a pitcher.
(b) Opening a drawer.
(c) Opening a door.
Figure 6: Motion segmentation of the manipulated object. First column: scene background points (the actor is removed). Second column: initial motion segment (blue) obtained by spectral clustering of point trajectories around contact area (yellow). Third column: Final motion segment.

In Fig. 7, we show the estimated rigid motion (6DOF pose) of the segmented object. To more clearly visualize the evolution of object pose over time, we attach a local coordinate frame to the object, at the location of the contact point, whose axes were chosen as the principal components of the extracted object point cloud segment.

(a) Flipping a pitcher.
(b) Opening a drawer.
(c) Opening a door.
Figure 7: Estimated rigid motion of the manipulated object. A coordinate frame is attached to the object segment (blue) at the contact point location (yellow). First column: temporal accumulation of color frames for the whole action duration. Second column: object state before manipulation. Third column: object trajectory as a series of 6DOF poses. Fourth column: object state after manipulation.

The above illustrations provide a qualitative demonstration of the successful application of our proposed pipeline to three different manipulation videos. In all cases, contacts were detected correctly and the manipulated object was accurately segmented and tracked. A more thorough, quantitative evaluation of our contact and segmentation outputs on an extended set of videos is in our plans for the immediate future.

4.2 Implementation

Our pipeline is implemented using the cilantro cilantro library, which provides a self-contained set of facilities for all of the computational steps involved.

5 Application: Replication from Observation by a robot

Figure 8: High-Level representation of opening a refrigerator door.

For any human-environment task to be successful, there is a well-defined process involved, demarcated into phases depending on human-environment contact and consequent motion. This allows us to generate a graph representation for actions, such as that shown in Fig. 8, for the task of opening a refrigerator. Given this general representation of tasks, we demonstrate how our algorithm allows grounding of the grasp and release parts, based on contact detection, and also of the feedback loop for opening the door, based on motion analysis of segmented objects. Such a representation, featuring a tight coupling of planning and perception, is crucial for robots to observe and replicate human actions.

Figure 9: Robot observing a human opening a door.

We now present a comprehensive application of our method to a real-world task, where a robot observes a human operator opening a refrigerator door and learns the process for replication. This can be seen in Fig. 9, where a RGBD sensor mounted to the robot’s manipulator is used for observation. This process involves the segmentation of the human and the environment from the observed video input, analyzing the contact between the human agent and the environment (the refrigerator handle in this case), and finally performing 3D motion tracking and segmentation on the action of opening the door, using our methods elucidated in Section 3. These analyses, and the corresponding outputs, are then converted into an intermediate graph-like representation, which encodes both semantic labeling of regions of interest, such as doors and handles in our case, as well as motion trajectories computed from observing the human agent. The combination of these allow the robot to understand and generalize the action to be performed even in changing scenarios.

We present a detailed explanation of each step involved in the process of a robot’s replication of an action by observing a human. This entire process is visually described in Fig. 10, which separates our application into three phases, namely preprocessing, planning and execution.

Figure 10: State transition diagram of our process.

5.1 Preprocessing Stage

Figure 11: Input to the preprocessing stage from our algorithm.
(a) Diagram depicting refrigerator handle detection.
(b) Point cloud of refrigerator with detected handle and door.
Figure 12: Handle detection.

The preprocessing stage is responsible for taking the contact point, object segments and their motion trajectories, as described in Fig. 1, and converting them into robot-specific trajectories for planning and execution. A visualization of this input can be seen in Fig. 11, where (a) depicts the RGB frame of the human performing the action. Subfigure (b) shows the contact point, highlighted in yellow, along with an initial object frame. Subfigure (c) demonstrates a dynamic view of the motion trajectory and segmentation of the door, along with the tracked contact point axes across time. Subfigure (d) shows the final pose of the door, after opening has finished.

In this stage, we exploit domain knowledge to semantically ground contact points and object segments, in order to assist affordance analysis and common-sense reasoning for robot manipulation, since that provides us with task-dependent priors. For instance, since we know that our task involves opening a refrigerator door, we can make prior assumptions that the contact point between the human agent and the environment will happen at the handle and any consequent motion will be of the door and handle only.

5.1.1 Door Handle Detection

These priors allow us to robustly fit a plane to the points of the door (extracted object) using standard least squares fitting under RANSAC and obtain a set of points for the door handle (plane outliers). We then fit a cylinder to these points, in order to generate a grasp primitive with a 6 DOF pose, for robot grasp planning. The estimated trajectories of the object segment, as mentioned in Table 

1, are not directly utilized by the robot execution system, but must instead be converted to a robot-specific representation before replication can take place. Our algorithm outputs a series of 6DOF poses for every time point . These are then converted to a series of robot-usable poses for the planning phase.

5.2 Planning Stage

Figure 13: Visualization of planning stage

The outputs from the preprocessing stage, namely the robot-specific 6DOF poses of the handle and the cylinder of specified radius and height depicting the handle are passed in to the planning stage of our pipeline, for both grasp planning and trajectory planning. The Robot Visualizer (rviz) rviz package in ROS allows for simulation and visualization of the robot during planning and execution, via real-time feedback from the robot’s state estimator. It also has point cloud visualization capabilities, which can be overlaid over primitive shapes. We use this tool for the planning stage, with the Baxter robot and our detected refrigerator.

5.2.1 Grasp Planning

Given a primitive shape, such as a block or cylinder, we are able to use the MoveIt! Simple Grasps moveitgrasp2016 package to generate grasp candidates for a parallel gripper (such as one mounted on the Baxter robot). The package integrates with the “MoveIt!” library’s pick and place pipeline to simulate and generate multiple potential grasp candidates, i.e. approach poses. There is also a grasp filtering stage, which uses task and configuration specific constraints to remove kinematically infeasible grasps, by performing feasibility tests via inverse kinematics solvers. At the end of the grasp planning pipeline, we have a set of candidate grasps, sorted by a grasp quality metric, of which one is chosen for execution in the next stage.

5.2.2 Trajectory Planning

The ordered set of the poses over time obtained from the preprocessing stage is then used to generate a Cartesian path, using the Robot Operating System’s “MoveIt!” chitta2012moveit motion planning library. This abstraction allows us to input a set of poses through which the end-effector must pass, along with parameters for path validity and obstacle avoidance. “MoveIt!” then uses inverse kinematics solutions for the specified manipulator configuration combined with sampling-based planning algorithms, such as Rapidly-Exploring Random Trees lavalle98rrt , to generate a trajectory for the robot to execute.

5.3 Execution Stage

Figure 14: Robot replicating human by opening refrigerator

The execution stage takes as input the grasp and trajectory plans generated in the planning stage and executes the plan on the robot. First, the generated grasp candidate is used to move the end-effector to a pre-grasp pose and the parallel gripper is aligned to the cylindrical shape of the handle. The grasp is executed based on a feedback control loop, with the termination condition decided by collision avoidance and force feedback. Upon successful grasp of the handle, our pipeline transitions into the trajectory execution stage, which attempts to follow the generated plan based on feedback from the robot’s state estimation system. Once the trajectory has been successfully executed, the human motion replication pipeline is complete. This execution process is demonstrated by the robot in Fig. 14, beginning with the robot grasping the handle in the top-leftmost figure and ending with the robot releasing the handle in the bottom-leftmost figure, with intermediate frames showing the robot imitating the motion trajectory of the human.

In future work, we plan to implement a dynamic motion primitives schaal_dmp_2016 (DMP) based approach, which will allow more accurate and robust tracking of trajectories by the robot.

6 Conclusions

In this paper, we have introduced an active, bottom-up method for the extraction of two fundamental features of an observed manipulation, namely the contact points and motion trajectories of segmented objects. We have qualitatively demonstrated the success of our approach on a set of video inputs and described in detail its fundamental role in a robot imitation scenario. Owing to its general applicability and the manipulation-defining nature of its output features, our method can effectively bridge the gap between observation and the development of action representations and plans.

There are many possible directions for future work. At a lower level, we plan to integrate dynamic reconstruction

into our pipeline to obtain a more complete model for the manipulated object; at this moment, this can be achieved by introducing a step of static scene reconstruction before the manipulation happens, after which we run our algorithm. We also plan to extend our method so that it also can handle

articulated manipulated objects, as well as objects that are indirectly manipulated (e.g., via the use of tools).

On the planning end, one of our future goals is to release a software component for the fully automated replication of door opening tasks (Section 5), given only a single demonstration. This module will be hardware agnostic up until the final execution stage of the pipeline, such that the generated plan to be imitated can be handled by any robot agent, given the specific manipulator and end-effector configurations.


The support of ONR under grant award N00014-17-1-2622 and the support of the National Science Foundation under grants SMA 1540916 and CNS 1544787 are greatly acknowledged.


  • (1) R. Poppe, A survey on vision-based human action recognition, Image and vision computing 28 (6) (2010) 976–990.
  • (2)

    D. Weinland, R. Ronfard, E. Boyer, A survey of vision-based methods for action representation, segmentation and recognition, Computer vision and image understanding 115 (2) (2011) 224–241.

  • (3)

    U. Rutishauser, D. Walther, C. Koch, P. Perona, Is bottom-up attention useful for object recognition?, in: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2, IEEE, 2004, pp. II–II.

  • (4) J. Ba, V. Mnih, K. Kavukcuoglu, Multiple object recognition with visual attention, arXiv preprint arXiv:1412.7755.
  • (5) H. Wang, C. Schmid, Action recognition with improved trajectories, in: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE, 2013, pp. 3551–3558.
  • (6) K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems, 2014, pp. 568–576.
  • (7) N. Krüger, C. Geib, J. Piater, R. Petrick, M. Steedman, F. Wörgötter, A. Ude, T. Asfour, D. Kraft, D. Omrčen, et al., Object–action complexes: Grounded abstractions of sensory–motor processes, Robotics and Autonomous Systems 59 (10) (2011) 740–757.
  • (8) K. R. Amaro, M. Beetz, G. Cheng, Understanding human activities from observation via semantic reasoning for humanoid robots, in: IROS Workshop on AI and Robotics, 2014.
  • (9) D. Summers-Stay, C. L. Teo, Y. Yang, C. Fermüller, Y. Aloimonos, Using a minimal action grammar for activity understanding in the real world, in: Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, IEEE, 2012, pp. 4104–4111.
  • (10) Y. Yang, A. Guha, C. Fermüller, Y. Aloimonos, A cognitive system for understanding human manipulation actions, Advances in Cognitive Sysytems 3 (2014) 67–86.
  • (11) Y. Yang, Y. Li, C. Fermüller, Y. Aloimonos, Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web., in: AAAI, 2015, pp. 3686–3693.
  • (12) E. E. Aksoy, A. Abramov, J. Dörr, K. Ning, B. Dellen, F. Wörgötter, Learning the semantics of object–action relations by observation, The International Journal of Robotics Research 30 (10) (2011) 1229–1249.
  • (13) K. Zampogiannis, Y. Yang, C. Fermüller, Y. Aloimonos, Learning the spatial semantics of manipulation actions through preposition grounding, in: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE, 2015, pp. 1389–1396.
  • (14) Z. Yan, X. Xiang, Scene flow estimation: A survey, arXiv preprint arXiv:1612.02590.
  • (15) E. Herbst, X. Ren, D. Fox, Rgb-d flow: Dense 3-d motion estimation using color and depth, in: Robotics and Automation (ICRA), 2013 IEEE International Conference on, IEEE, 2013, pp. 2276–2282.
  • (16) J. Quiroga, T. Brox, F. Devernay, J. Crowley, Dense semi-rigid scene flow estimation from rgbd images, in: European Conference on Computer Vision, Springer, 2014, pp. 567–582.
  • (17) M. Jaimez, M. Souiai, J. Stückler, J. Gonzalez-Jimenez, D. Cremers, Motion cooperation: Smooth piece-wise rigid scene flow from rgb-d images, in: 3D Vision (3DV), 2015 International Conference on, IEEE, 2015, pp. 64–72.
  • (18) M. Jaimez, M. Souiai, J. Gonzalez-Jimenez, D. Cremers, A primal-dual framework for real-time dense rgb-d scene flow, in: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE, 2015, pp. 98–104.
  • (19) M. Jaimez, C. Kerl, J. Gonzalez-Jimenez, D. Cremers, Fast odometry and scene flow from rgb-d cameras based on geometric clustering, in: Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE, 2017, pp. 3992–3999.
  • (20) N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, T. Brox, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.
  • (21) O. Sorkine, M. Alexa, As-rigid-as-possible surface modeling, in: Symposium on Geometry processing, Vol. 4, 2007, p. 30.
  • (22) G. K. Tam, Z.-Q. Cheng, Y.-K. Lai, F. C. Langbein, Y. Liu, D. Marshall, R. R. Martin, X.-F. Sun, P. L. Rosin, Registration of 3d point clouds and meshes: a survey from rigid to nonrigid, IEEE transactions on visualization and computer graphics 19 (7) (2013) 1199–1217.
  • (23) B. Amberg, S. Romdhani, T. Vetter, Optimal step nonrigid icp algorithms for surface registration, in: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE, 2007, pp. 1–8.
  • (24) R. A. Newcombe, D. Fox, S. M. Seitz, Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 343–352.
  • (25) M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, M. Stamminger, VolumeDeform: Real-time Volumetric Non-rigid Reconstruction.
  • (26) M. Slavcheva, M. Baust, D. Cremers, S. Ilic, Killingfusion: Non-rigid 3d reconstruction without correspondences, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 3, 2017, p. 7.
  • (27) K. Zampogiannis, C. Fermuller, Y. Aloimonos, Cilantro: A lean, versatile, and efficient library for point cloud data processing, in: Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, ACM, New York, NY, USA, 2018, pp. 1364–1367. doi:10.1145/3240508.3243655.
  • (28) Y. Yang, C. Fermuller, Y. Li, Y. Aloimonos, Grasp type revisited: A modern perspective on a classical feature for vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 400–408.
  • (29) Q. Chen, H. Li, R. Abu-Zhaya, A. Seidl, F. Zhu, E. J. Delp, Touch event recognition for human interaction, Electronic Imaging 2016 (11) (2016) 1–6.
  • (30) J. Yan, M. Pollefeys, A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate, in: European conference on computer vision, Springer, 2006, pp. 94–106.
  • (31) R. Tron, R. Vidal, A benchmark for the comparison of 3-d motion segmentation algorithms, in: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE, 2007, pp. 1–8.
  • (32) J. Costeira, T. Kanade, A multi-body factorization method for motion analysis, in: Computer Vision, 1995. Proceedings., Fifth International Conference on, IEEE, 1995, pp. 1071–1076.
  • (33) K.-i. Kanatani, Motion segmentation by subspace separation and model selection, in: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, Vol. 2, IEEE, 2001, pp. 586–591.
  • (34) S. Rao, R. Tron, R. Vidal, Y. Ma, Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (10) (2010) 1832–1845.
  • (35) R. Vidal, R. Hartley, Motion segmentation with missing data using powerfactorization and gpca, in: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2, IEEE, 2004, pp. II–II.
  • (36) D. Katz, M. Kazemi, J. A. Bagnell, A. Stentz, Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects, in: Robotics and Automation (ICRA), 2013 IEEE International Conference on, IEEE, 2013, pp. 5003–5010.
  • (37) E. Herbst, X. Ren, D. Fox, Object segmentation from motion with dense feature matching, in: ICRA Workshop on Semantic Perception, Mapping and Exploration, Vol. 2, 2012.
  • (38) M. Rünz, L. Agapito, Co-fusion: Real-time segmentation, tracking and fusion of multiple objects, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 4471–4478.
  • (39) T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, A. Davison, Elasticfusion: Dense slam without a pose graph, Robotics: Science and Systems, 2015.
  • (40) P. Ochs, J. Malik, T. Brox, Segmentation of moving objects by long term video analysis, IEEE transactions on pattern analysis and machine intelligence 36 (6) (2014) 1187–1200.
  • (41) P. J. Besl, N. D. McKay, Method for registration of 3-d shapes, in: Sensor Fusion IV: Control Paradigms and Data Structures, Vol. 1611, International Society for Optics and Photonics, 1992, pp. 586–607.
  • (42) S. Rusinkiewicz, M. Levoy, Efficient variants of the icp algorithm, in: 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, IEEE, 2001, pp. 145–152.
  • (43) R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, A. Fitzgibbon, Kinectfusion: Real-time dense surface mapping and tracking, in: Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, IEEE, 2011, pp. 127–136.
  • (44) Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part affinity fields, in: CVPR, 2017.
  • (45) N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 886–893.
  • (46) M. J. Jones, J. M. Rehg, Statistical color models with application to skin detection, International Journal of Computer Vision 46 (1) (2002) 81–96.
  • (47) V. Vezhnevets, V. Sazonov, A. Andreeva, A survey on pixel-based skin color detection techniques, in: Proc. Graphicon, Vol. 3, Moscow, Russia, 2003, pp. 85–92.
  • (48) U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing 17 (4) (2007) 395–416.
  • (49) S. Umeyama, Least-squares estimation of transformation parameters between two point patterns, IEEE Transactions on pattern analysis and machine intelligence 13 (4) (1991) 376–380.
  • (50) D. Hershberger, D. Gossow, J. Faust, rviz, https://github.com/ros-visualization/rviz (2012).
  • (51) D. T. Coleman, “moveit!” simple grasps, https://github.com/davetcoleman/moveit_simple_grasps (2016).
  • (52) S. Chitta, I. Sucan, S. Cousins, Moveit! [ros topics], IEEE Robotics Automation Magazine 19 (1) (2012) 18–19. doi:10.1109/MRA.2011.2181749.
  • (53) S. M. Lavalle, Rapidly-exploring random trees: A new tool for path planning, Tech. rep., Iowa State University (1998).
  • (54) S. Schaal, Dynamic movement primitives - a framework for motor control in humans and humanoid robotics (2002).