Source code for Video Object Segmentation-based Visual Servo Control
To be useful in everyday environments, robots must be able to identify and locate unstructured, real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. This paper addresses the problem of identifying generic objects and locating them in 3D from a mobile robot platform equipped with an RGB camera. We achieve this by introducing a video object segmentation-based approach to visual servo control and active perception. We validate our approach in experiments using an HSR platform, which subsequently identifies, locates, and grasps objects from the YCB object dataset. We also develop a new Hadamard-Broyden update formulation, which enables HSR to automatically learn the relationship between actuators and visual features without any camera calibration. Using a variety of learned actuator-camera configurations, HSR also tracks people and other dynamic articulated objects.READ FULL TEXT VIEW PDF
Video object segmentation, i.e., the separation of a target object from
To be useful in everyday environments, robots must be able to observe an...
People spend an enormous amount of time and effort looking for lost obje...
The quality of life of many people could be improved by autonomous human...
This paper addresses the problem of learning to estimate the depth of
Human following on mobile robots has witnessed significant advances due ...
Building models, or maps, of robot environments is a highly active resea...
Source code for Video Object Segmentation-based Visual Servo Control
Visual servo control (VS), using visual data in the servo loop to control a robot, is a well-established field [11, 28]. By using RGB images for sensing, VS has been used for positioning UAVs [25, 42] and wheeled robots [33, 41], manipulating objects [29, 55], and even laparoscopic surgery . While this prior work attests to applicability of VS, generating robust visual features for VS in unstructured environments with generic objects (e.g., without fiducial markers) remains an open problem.
On the other hand, video object segmentation (VOS), the dense separation of objects in video from background, has made great progress on real, unstructured videos. This progress is due in part to the recent introduction of multiple benchmark datasets [46, 48, 58], which evaluate VOS methods across many challenging categories, including moving cameras, occlusions, objects leaving view, scale variation, appearance change, edge ambiguity, multiple interacting objects, and dynamic background (among others); these challenges frequently occur simultaneously. However, despite all of VOS’s contributions to video understanding, we are unaware of any work that utilizes VOS for VS.
To the end, this paper develops a video object segmentation-based framework to address the problem of visual servo control in unstructured environments. The choice to use VOS-based features has many advantages. First, recent VOS methods are robust in terms of the variety of unstructured objects and backgrounds they can operate on, making our framework general to many objects and settings. Second, VOS methods can operate on streaming RGB images, making them ideal for VS and tracking objects from a mobile platform (see Figure 1). Third, recent work in active and interactive perception enables robots to automatically generate object-specific training data for semi-supervised VOS methods [7, 31, 40, 52]. Finally, VOS remains a hotly studied area of video understanding, and the accuracy and robustness of state-of-the-art segmentation methods will continue to improve.
The primary contribution of our paper is our video object segmentation-based framework for visual servo control (VOS-VS). We demonstrate the utility of VOS-VS on a mobile robot equipped with an RGB camera to identify and position itself relative to many challenging objects from HSR challenges and the YCB object dataset . In addition, we develop an auxiliary framework that combines our segmentation-based features with active perception to estimate the depth of a segmented object relative to the robot, which, in conjunction with VOS-VS, provides the object’s 3D location. Finally, we develop a new Hadamard-Broyden update formulation, which enables HSR to learn the relationship between actuators and VOS-VS features online without any camera calibration. We use this formulation to learn the pseudoinverse feature Jacobian for all VOS-VS experiments and provide analysis for seven unique configurations, which include permutations over seven actuators and two cameras. To the best of our knowledge, this work is the first use of video object segmentation for visual servo control and for a Broyden update to directly estimate the pseudoinverse feature Jacobian for visual servo control on an actual robot.
We provide source code and training data for the current work at https://github.com/griffbr/VOSVS.
Video object segmentation methods can be generally categorized by their level of supervision. Compared to unsupervised VOS, which generally relies on object motion [21, 23, 32, 45, 56], semi-supervised VOS, the problem of segmenting objects in video given a user-annotated example, is particularly useful in this work. From an annotation, semi-supervised VOS methods can learn the particular visual characteristics of a target object, which enables them to reliably segment dynamic or static objects. In addition, semi-supervised VOS has seen rampant advances, even within just the past year [5, 14, 15, 34, 39, 44, 60].
To generate our VOS-based features in the current work, we segment objects using One-Shot Video Object Segmentation (OSVOS) , which is state-of-the-art in VOS and has influenced other leading methods [39, 53]. One unique property of OSVOS is that it does not require temporal consistency, i.e., the order that OSVOS segments frames is inconsequential. Nonetheless, segmentation methods that operate sequentially are still applicable to the current work.
We draw inspiration from numerous visual servo control methodologies that have been developed in robotics. In , to overcome drawbacks of classical position- and image-based visual servo control, a technique of 2-1/2D VS is developed, which uses a hybrid input of 3D Cartesian space and 2D image space to estimate the homography between the current and desired feature image. In , researchers show analytically and by experiment that the choice of image features have a direct effect on closed-loop system dynamics, and, for the axis in particular, image features that scale proportional to the optical depth of the observed target should be used. In the partitioned approach of , decoupled -axis motions are controlled using the longest line connecting two feature points for rotation and the square root of the collective feature-point-polygon area for depth; this approach is shown to address the Chaumette Conundrum presented in . Finally, in , planar contours of objects are specified using Canny edge detection for VS image features.
In this work, we build upon previous methods by introducing segmentation-based features that are generated from common objects. Furthermore, our VOS-based features are rotation invariant and work even when parts of an object and its contour are out of view or occluded. Using semi-supervised VOS, our features are learned and do not require any particular object viewpoint or marking, making this work applicable to articulated and deformable objects (e.g., the plastic chain from the YCB dataset or a crumpled up paper towel in the third HSR challenge). In general, if an object can be segmented by a VOS method, we can use it to generate features for our VOS-VS framework.
A critical asset for robot perception is that explicit actions can be taken to improve sensing and understanding of the environment. Accordingly, Active Perception (AP) [4, 3] and Interactive Perception (IP)  are two areas of research that exploit robot-specific capabilities to improve perception. Compared to structure from motion [1, 2], which requires feature matching or scene flow to relate images, AP exploits knowledge of a robot’s relative position to relate images and improve 3D reconstruction. Furthermore, AP can select future view locations that explicitly improve perception performance [19, 51, 61].
In this work, we develop an auxiliary framework that uses VOS-based features and active perception to estimate the depth of segmented objects. We do this by using a pinhole camera model-based formulation that can be linearized and solved in real time during an approach to a segmented object. In addition, we track the convergence of our depth estimate and can collect more data as required. Essentially, using an RGB camera and 3D kinematic information already available to the robot, we approximate the 3D position of segmented objects without any reliance on 3D sensing hardware (e.g., RGBD cameras [26, 49, 50] or stereo vision ). Even in cases where such 3D sensors are available, our approach can serve as an auxiliary backup in the case of a sensor failure or occlusion.
For our robot experiments, we use a Toyota Human Support Robot (HSR), which has a 4-DOF manipulator arm mounted on a torso with prismatic and revolute joints and a differential drive base . Also, because HSR has a revolute joint directly on top of its differential drive base, it is effectively omnidirectional. For visual servo control, we use the actuators shown in Figure 2 as the joint space ,
In addition to , HSR’s end effector has a parallel gripper with series elastic fingertips for grasping objects; the fingertips have 135 mm maximum width.
For perception, we use HSR’s base-mounted UST-20LX 2D scanning laser for obstacle avoidance and the head-mounted Xtion PRO LIVE RGBD camera and end effector-mounted wide-angle grasp camera for segmentation. The head tilt and pan joints act as a 2-DOF gimbal for the head camera, and the grasp camera moves with the arm and wrist joints; both cameras stream 640x480 resolution RGB images.
Note a significant component of HSR’s manipulation DOF come from its mobile base. While many planning algorithms work well on high DOF arms with a stationary base, the odometer errors of HSR compound during trajectory execution and can cause missed grasps. Thus, VS naturally lends itself to the mobile HSR platform, providing updates on the relative position of objects as HSR approaches.
Using video object segmentation (VOS)-based features, we derive a new visual servo (VS) control framework (VOS-VS). In our VOS-VS control scheme, we define feature error as
is a vector of visual features found in imageusing learned VOS parameters and is the desired values of the features. Note that compared to more general VS control schemes, in (2) has no dependence on time, previous observations, or additional system parameters (e.g., camera parameters or 3D object models).
Typical VS approaches relate camera motion to using
) requires continuous, six degree of freedom (DOF) control of camera velocity.
where is the change of actuated joints, is the feature Jacobian relating to , and is the estimated pseudoinverse of . We command from (7) directly to the robot joint space as our VOS-VS controller to minimize and reach the desired feature values in (2).
In real visual servo systems, it is impossible to know the exact feature Jacobian () relating control actuators to image features ; instead, a few VS methods have estimated directly from observations . Of these methods, we are particularly interested in those using the Broyden update rule [27, 30, 47], which iteratively updates online.
In contrast to previous VS work, there is a formulation to estimate the pseudoinverse feature Jacobian () in Broyden’s original paper [8, (4.5)]. However, we found it necessary to augment Broyden’s formulation with the logical matrix, . We define our Hadamard-Broyden update as
where determines the update speed, and are the respective changes in joint space and feature errors since the last update, and is a logical matrix coupling actuators to image features. In experiments, we initialize (8) using and .
The Hadamard product with prevents undesired coupling between certain actuator and image feature pairs. In practice, we find that using enables real-time convergence of (8) without any calibration on the robot for all of the configuration experiments in Section V-C. To the best of our knowledge, this is the first use of a Broyden update to directly estimate for VS on an actual robot.
Assume we are given an RGB image that contains an object of interest. Using VOS, we a generate binary mask
where consists of pixel-level labels , indicates pixel corresponds to the segmented object, and are learned parameters for VOS (we detail our specific VOS method in Section V-B).
Using , we define the following VOS-based features
where is a measure of segmentation area by the number of labeled pixels, is the -centroid of the segmented object using -axis label locations , and is the -centroid. In addition to (10)-(12), we develop more VOS-based features for depth estimation and grasping in Sections IV-E and IV-F.
|(8)||Joints Coupled with|
Using and in (2), we define error as
where . Note that in our Hadamard-Broyden update (8), the Hadamard product of corresponds to in . Thus, we configure the logical coupling matrix by setting if coupling actuated joint with image feature is desired. Using our Broyden formulation (8), we learn on HSR for each of the configurations in Table I and provide experimental results in Section V-C.
By combining VOS-based features with active perception, we are able to estimate the depth of segmented objects and approximate an object’s relative 3D position. As shown in Figure 3, we initiate our depth estimation framework (VOS-DE) by centering a segmented object on the optical axis of our camera using the VOS-VS controller. This alignment minimizes lens distortion, which facilitates the use of an ideal camera model. Using the pinhole camera model , projections of objects onto the image plane scale inversely with their distance on the optical axis from the camera. Thus, with the object centered on the optical axis, we can relate projection scale and object distance using
where is the projected length of an object measurement orthogonal to the optical axis, is the distance on the optical axis of the object away from the camera, and is the projected measurement length at a new distance . Combining Galileo Galilei’s Square-cube law with (15),
where is the projected object area corresponding to and (see Figure 3).
After centering on the segmented object, we advance the camera along the optical axis while collecting object segmentations. We modify (16) to relate each image using
where is a constant proportional to the orthogonal surface area of the segmented object. Also, using a coordinate frame with the axis aligned with the optical axis,
where and are the respective -axis coordinates of the camera and object. Note, because the camera and object are both centered on the axis, and . Using (18) and the area measurement from (10), we update (17) as
where the object is assumed stationary between images (i.e., ) and the position is known from the robot kinematics and encoder values. Note that provides 3D information to our VOS-DE framework and (19) identifies a key linear relationship between VOS and the distance between the segmented object and the camera.
Finally, after collected a series of measurements, we can estimate the depth of the segmented object. From (19),
which over the measurements in form yields
By solving (21) for and , we can estimate the distance in (18), and, thus, the 3D location of the object. In Section V-D, we show in robot experiments that our combined VOS-VS and VOS-DE framework is sufficient for locating, approaching, and estimating the depth of a variety of household objects.
As a final extension to our VOS-based framework, we develop a VOS-based method of grasping and grasp-error detection. While many recent grasping methods are RGBD-based [24, 35], similar to VOS-DE, a VOS-based approach to grasping can function when 3D sensing is unavailable. This design choice is further motivated by the winning team of the 2017 Amazon Robotics Challenge [43, 54], which uses RGB-based methods when RGBD-based methods fail.
Assume an object is centered using VOS-VS and has estimated depth using VOS-DE, we move to
where is the known -axis offset between and the center of HSR’s closed fingertips. Thus, when is at , HSR can reach the object at depth .
After moving to , we center the object directly underneath HSR’s antipodal gripper using VOS-VS control. To find a suitable grasp location, we project a mask of the gripper, , into the camera and solve
is the intersection over union (or Jaccard index) of and object segmentation mask , and is the projection of corresponding to HSR wrist rotation . Thus, we grasp the object using the wrist rotation with least intersection between the object and the gripper, which is then less likely to collide with the object before achieving a parallel grasp (see Figure 4).
After the object is grasped, we lift HSR’s arm to perform a visual grasp check. We consider a grasp complete if
where is the object segmentation size (10) during the initial grasp and is the corresponding after lifting the arm. Essentially, if decreases when lifting the arm, the object is further from the camera and not securely grasped. Thus, we quickly identify if a grasp is missed and regrasp as necessary. Note that our VOS-based grasp check can work with other grasping methods as well.
For most of our experiments, we use the objects from the YCB dataset  shown in Figure 6. We use six objects from each of the food, kitchen, tool, and shape categories and purposefully choose some of the most difficult objects to stress test our system. To name only a few of the challenges for selected objects: dimensions span from the 470 mm long pan to the 4 mm thick washer, most of the contours change with pose, and over a third of the objects exhibit specular reflection of overhead lights. To learn object recognition, we annotate ten training images of each object using HSR’s grasp camera with various object poses, backgrounds, and distances from the camera (see example image in Figure 1).
). OSVOS uses a base network trained on ImageNet to recognize image features, re-trains a parent network on DAVIS 2016  to learn general video object segmentation, and then fine tunes for each object that we use in our experiments (i.e., each object has unique learned parameters in (9)). After learning , we segment HSR’s 640x480 RGB images at 29.6 Hz with a dual-GPU (GTX 1080 Ti) machine.
|(8)||Learned in (14)|
We learn all of the VOS-VS configurations in Table I on HSR using the Hadamard-Broyden update formulation in (8). We initialize each configuration using , , and a target segmentation object in view to elicit a step response from the VOS-VS controller. Each configuration starts at a specific pose (e.g., uses the leftmost pose in Figures 3-4), and configurations use in (13), except for , which uses for grasp positioning.
After initializing each configuration, within a few iterations of control inputs from (7) and updates from (8), the learned matrix generally shows convergence for any component that is initialized with the correct sign. Components initialized with an incorrect sign (e.g., in ) generally require more updates to change directions and jump through zero during one of the discrete updates. If an object goes out of view from an incorrectly signed component, we reset HSR’s pose and restart the update from the most recent . Once is reached, the object can be moved to elicit a few more step responses for fine tuning. The learned parameters for each Hadamard-Broyden configuration are provided in Table II. In the rest of our experiments, we set in (8) to reduce variability.
To show the step response of each learned configuration in Table II, we perform additional experiments by centering the camera on multiple YCB objects within view of each configuration’s starting pose. In Figure 7, both and exhibit a stable response for the same configuration of objects; note that the wood block starts close to (shown by green dot) in the raised perspective of . Our motivation to learn two base configurations is the increase in sensitivity to base motion as an object’s depth decreases. operates with the camera raised high above objects, while operates with the camera directly above objects to position for grasping. Thus, needs more base movement for the same changes in compared to . This difference is apparent in Table II from learning greater values and in Figure 7 from the ’s smaller distribution of initial values despite identical object distances.
We show the step response of all arm-based VS configurations in Figure 8. Each configuration uses the same segmentation objects and starting pose. While each configuration segments the pan and baseball, is not reachable for these objects within any of the configured actuator spaces; is the only configuration to center on all four of the other objects. The overactuated configuration exhibits the most overshoot, while exhibits the most limited range of camera positions but essentially deadbeat control.
Finally, we show the step response of in Figure 9. is the only configuration that uses HSR’s 2-DOF head gimbal and camera, and it exhibits a relatively smooth step response over the entire image. Of particular significance, even though uses the head camera, it uses the same OSVOS parameters that are learned using images from the grasp camera; this further demonstrates the general applicability of our VOS-VS framework in regards to not needing any camera calibration.
|Box of Sugar||Food||0.25||X||X|
|Skillet with Lid||Kitchen||Ground||N/A|
We perform an experiment consisting of a consecutive set of trials that simultaneously test VOS-VS and VOS-DE. Each trial consists of three unique YCB objects placed at different heights: one on the blue bin 0.25 m above the ground, one on the green bin 0.125 m above the ground, and one directly on the ground (see bin configuration in Figure 1). The trial configurations and corresponding results are provided in Table III. VOS-VS is considered a success (“X”) if HSR locates and centers on the object for depth estimation. VOS-DE is considered a success if HSR achieves (22) such that 1) HSR can close its grippers on the object without hitting the underlying surface and 2) HSR does not move past the top surface of the object.
Across all of the challenging YCB objects we selected for this single consecutive set of trials, VOS-VS has a 83% success rate. Note that VOS-DE is only applicable if VOS-VS succeeds; in these cases, VOS-DE has a 50% overall success rate. By category, food objects have the highest success (100% VOS-VS, 83% VOS-DE) and kitchen objects have the lowest (50% VOS-VS, 66% VOS-DE). Note that the margin of error and difficulty for VOS-DE is highly variable between objects (e.g., much more difficult for the 4 mm thick washer compared to the 50 mm thick foam brick).
We perform a few additional experiments for our VOS-based methods, including our work in the TRI-sponsored HSR challenges. These challenges consist of timed trials for pick-and-place tasks with randomly scattered, non-YCB objects (e.g., banana peels and wadded up dollar bills). These HSR challenges are a particularly good demonstration of our VOS-based approach to visual servo control and grasping. Footage for these experiments and trials is available at: https://www.youtube.com/playlist?list=PLz52BAn_JPx8nVgP2XfnG_9TCJj0DwC5y.
Finally, we perform additional VOS-VS experiments with dynamic articulated objects. Using , HSR tracks the plastic chain across the room in real time as we kick it and throw it in a variety of unstructured poses; we can even pick up the chain and use it the guide HSR’s movements from the grasp camera. In addition, by training OSVOS to recognize an article of clothing, HSR is able to reliably track a person moving throughout the room using (see Figure 10). Footage for these experiments with dynamic articulated objects is available at: https://youtu.be/hlog5FV9RLs.
In this work, we develop a video object segmentation-based approach to visual servo control, depth estimation of objects, and grasping. Visual servo control is an established approach to controlling a physical robot system from RGB images, and video object segmentation has seen rampant advances within the computer vision community for densely segmenting general objects in challenging videos. The success of our VOS-based approach to visual servo control in experiments using a mobile robot platform and generic objects is a tribute to both of these communities and the initiation of a bridge between them. Future developments in video object segmentation will improve the robustness of our method and, we expect, lead to other innovations in robotics.
A significant benefit of our VOS-based framework is that it only requires an RGB camera combined with robot actuation. While 3D sensing-based methods for depth estimation and grasping are still state-of-the-art, in future work, we will improve our approach to VOS-based depth estimation and grasping by collecting data from multiple poses, thereby leveraging more information and making our 3D understanding of the target object more complete. Nonetheless, we find that the current VOS-based framework is a useful tool for robotics applications where 3D sensors are unavailable.
Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” 2017.
S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” inThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” inBritish Machine Vision Conference (BMVC), 2017.