Video Object Segmentation-based Visual Servo Control and Object Depth Estimation on a Mobile Robot Platform

03/20/2019 ∙ by Brent Griffin, et al. ∙ University of Michigan 0

To be useful in everyday environments, robots must be able to identify and locate unstructured, real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. This paper addresses the problem of identifying generic objects and locating them in 3D from a mobile robot platform equipped with an RGB camera. We achieve this by introducing a video object segmentation-based approach to visual servo control and active perception. We validate our approach in experiments using an HSR platform, which subsequently identifies, locates, and grasps objects from the YCB object dataset. We also develop a new Hadamard-Broyden update formulation, which enables HSR to automatically learn the relationship between actuators and visual features without any camera calibration. Using a variety of learned actuator-camera configurations, HSR also tracks people and other dynamic articulated objects.



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 7

page 8

Code Repositories


Source code for Video Object Segmentation-based Visual Servo Control

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual servo control (VS), using visual data in the servo loop to control a robot, is a well-established field [11, 28]. By using RGB images for sensing, VS has been used for positioning UAVs [25, 42] and wheeled robots [33, 41], manipulating objects [29, 55], and even laparoscopic surgery [57]. While this prior work attests to applicability of VS, generating robust visual features for VS in unstructured environments with generic objects (e.g., without fiducial markers) remains an open problem.

On the other hand, video object segmentation (VOS), the dense separation of objects in video from background, has made great progress on real, unstructured videos. This progress is due in part to the recent introduction of multiple benchmark datasets [46, 48, 58], which evaluate VOS methods across many challenging categories, including moving cameras, occlusions, objects leaving view, scale variation, appearance change, edge ambiguity, multiple interacting objects, and dynamic background (among others); these challenges frequently occur simultaneously. However, despite all of VOS’s contributions to video understanding, we are unaware of any work that utilizes VOS for VS.

Fig. 1: HSR observing objects at various heights (left). HSR’s grasp camera faces downward and only collects RGB data for objects in the scene (top right). However, using active perception and video object segmentation (bottom right), HSR can locate and grasp a variety of objects in real time.

To the end, this paper develops a video object segmentation-based framework to address the problem of visual servo control in unstructured environments. The choice to use VOS-based features has many advantages. First, recent VOS methods are robust in terms of the variety of unstructured objects and backgrounds they can operate on, making our framework general to many objects and settings. Second, VOS methods can operate on streaming RGB images, making them ideal for VS and tracking objects from a mobile platform (see Figure 1). Third, recent work in active and interactive perception enables robots to automatically generate object-specific training data for semi-supervised VOS methods [7, 31, 40, 52]. Finally, VOS remains a hotly studied area of video understanding, and the accuracy and robustness of state-of-the-art segmentation methods will continue to improve.

The primary contribution of our paper is our video object segmentation-based framework for visual servo control (VOS-VS). We demonstrate the utility of VOS-VS on a mobile robot equipped with an RGB camera to identify and position itself relative to many challenging objects from HSR challenges and the YCB object dataset [10]. In addition, we develop an auxiliary framework that combines our segmentation-based features with active perception to estimate the depth of a segmented object relative to the robot, which, in conjunction with VOS-VS, provides the object’s 3D location. Finally, we develop a new Hadamard-Broyden update formulation, which enables HSR to learn the relationship between actuators and VOS-VS features online without any camera calibration. We use this formulation to learn the pseudoinverse feature Jacobian for all VOS-VS experiments and provide analysis for seven unique configurations, which include permutations over seven actuators and two cameras. To the best of our knowledge, this work is the first use of video object segmentation for visual servo control and for a Broyden update to directly estimate the pseudoinverse feature Jacobian for visual servo control on an actual robot.

We provide source code and training data for the current work at

Ii Related Work

Ii-a Video Object Segmentation

Video object segmentation methods can be generally categorized by their level of supervision. Compared to unsupervised VOS, which generally relies on object motion [21, 23, 32, 45, 56], semi-supervised VOS, the problem of segmenting objects in video given a user-annotated example, is particularly useful in this work. From an annotation, semi-supervised VOS methods can learn the particular visual characteristics of a target object, which enables them to reliably segment dynamic or static objects. In addition, semi-supervised VOS has seen rampant advances, even within just the past year [5, 14, 15, 34, 39, 44, 60].

To generate our VOS-based features in the current work, we segment objects using One-Shot Video Object Segmentation (OSVOS) [9], which is state-of-the-art in VOS and has influenced other leading methods [39, 53]. One unique property of OSVOS is that it does not require temporal consistency, i.e., the order that OSVOS segments frames is inconsequential. Nonetheless, segmentation methods that operate sequentially are still applicable to the current work.

Ii-B Visual Servo Control

We draw inspiration from numerous visual servo control methodologies that have been developed in robotics. In [37], to overcome drawbacks of classical position- and image-based visual servo control, a technique of 2-1/2D VS is developed, which uses a hybrid input of 3D Cartesian space and 2D image space to estimate the homography between the current and desired feature image. In [36], researchers show analytically and by experiment that the choice of image features have a direct effect on closed-loop system dynamics, and, for the axis in particular, image features that scale proportional to the optical depth of the observed target should be used. In the partitioned approach of [17], decoupled -axis motions are controlled using the longest line connecting two feature points for rotation and the square root of the collective feature-point-polygon area for depth; this approach is shown to address the Chaumette Conundrum presented in [13]. Finally, in [16], planar contours of objects are specified using Canny edge detection for VS image features.

In this work, we build upon previous methods by introducing segmentation-based features that are generated from common objects. Furthermore, our VOS-based features are rotation invariant and work even when parts of an object and its contour are out of view or occluded. Using semi-supervised VOS, our features are learned and do not require any particular object viewpoint or marking, making this work applicable to articulated and deformable objects (e.g., the plastic chain from the YCB dataset or a crumpled up paper towel in the third HSR challenge). In general, if an object can be segmented by a VOS method, we can use it to generate features for our VOS-VS framework.

Ii-C Active Perception

A critical asset for robot perception is that explicit actions can be taken to improve sensing and understanding of the environment. Accordingly, Active Perception (AP) [4, 3] and Interactive Perception (IP) [6] are two areas of research that exploit robot-specific capabilities to improve perception. Compared to structure from motion [1, 2], which requires feature matching or scene flow to relate images, AP exploits knowledge of a robot’s relative position to relate images and improve 3D reconstruction. Furthermore, AP can select future view locations that explicitly improve perception performance [19, 51, 61].

In this work, we develop an auxiliary framework that uses VOS-based features and active perception to estimate the depth of segmented objects. We do this by using a pinhole camera model-based formulation that can be linearized and solved in real time during an approach to a segmented object. In addition, we track the convergence of our depth estimate and can collect more data as required. Essentially, using an RGB camera and 3D kinematic information already available to the robot, we approximate the 3D position of segmented objects without any reliance on 3D sensing hardware (e.g., RGBD cameras [26, 49, 50] or stereo vision [38]). Even in cases where such 3D sensors are available, our approach can serve as an auxiliary backup in the case of a sensor failure or occlusion.

Iii Robot Model and Perception Hardware

Fig. 2: HSR Control Model.

For our robot experiments, we use a Toyota Human Support Robot (HSR), which has a 4-DOF manipulator arm mounted on a torso with prismatic and revolute joints and a differential drive base [59]. Also, because HSR has a revolute joint directly on top of its differential drive base, it is effectively omnidirectional. For visual servo control, we use the actuators shown in Figure 2 as the joint space ,


In addition to , HSR’s end effector has a parallel gripper with series elastic fingertips for grasping objects; the fingertips have 135 mm maximum width.

For perception, we use HSR’s base-mounted UST-20LX 2D scanning laser for obstacle avoidance and the head-mounted Xtion PRO LIVE RGBD camera and end effector-mounted wide-angle grasp camera for segmentation. The head tilt and pan joints act as a 2-DOF gimbal for the head camera, and the grasp camera moves with the arm and wrist joints; both cameras stream 640x480 resolution RGB images.

Note a significant component of HSR’s manipulation DOF come from its mobile base. While many planning algorithms work well on high DOF arms with a stationary base, the odometer errors of HSR compound during trajectory execution and can cause missed grasps. Thus, VS naturally lends itself to the mobile HSR platform, providing updates on the relative position of objects as HSR approaches.

Iv Segmentation-based Visual Servo Control, Object Depth Estimation, and Grasping

Iv-a Visual Servo Model

Using video object segmentation (VOS)-based features, we derive a new visual servo (VS) control framework (VOS-VS). In our VOS-VS control scheme, we define feature error as



is a vector of visual features found in image

using learned VOS parameters and is the desired values of the features. Note that compared to more general VS control schemes, in (2) has no dependence on time, previous observations, or additional system parameters (e.g., camera parameters or 3D object models).

Typical VS approaches relate camera motion to using


where is a feature Jacobian relating the three linear and three angular camera velocities to . From (2)-(3), assuming , we find the following VS control velocities for to minimize


where is the estimated pseudoinverse of and ensures an exponential decoupled decrease of [11]. Note that VS control using (4

) requires continuous, six degree of freedom (DOF) control of camera velocity.

To generalize our VS controller for discrete motion planning and fewer required DOF, we modify (3)-(4) to


where is the change of actuated joints, is the feature Jacobian relating to , and is the estimated pseudoinverse of . We command from (7) directly to the robot joint space as our VOS-VS controller to minimize and reach the desired feature values in (2).

Iv-B Feature Jacobians using a Hadamard-Broyden Update

In real visual servo systems, it is impossible to know the exact feature Jacobian () relating control actuators to image features [11]; instead, a few VS methods have estimated directly from observations [12]. Of these methods, we are particularly interested in those using the Broyden update rule [27, 30, 47], which iteratively updates online.

In contrast to previous VS work, there is a formulation to estimate the pseudoinverse feature Jacobian () in Broyden’s original paper [8, (4.5)]. However, we found it necessary to augment Broyden’s formulation with the logical matrix, . We define our Hadamard-Broyden update as


where determines the update speed, and are the respective changes in joint space and feature errors since the last update, and is a logical matrix coupling actuators to image features. In experiments, we initialize (8) using and .

The Hadamard product with prevents undesired coupling between certain actuator and image feature pairs. In practice, we find that using enables real-time convergence of (8) without any calibration on the robot for all of the configuration experiments in Section V-C. To the best of our knowledge, this is the first use of a Broyden update to directly estimate for VS on an actual robot.

Iv-C Segmentation-based Features

Assume we are given an RGB image that contains an object of interest. Using VOS, we a generate binary mask


where consists of pixel-level labels , indicates pixel corresponds to the segmented object, and are learned parameters for VOS (we detail our specific VOS method in Section V-B).

Using , we define the following VOS-based features


where is a measure of segmentation area by the number of labeled pixels, is the -centroid of the segmented object using -axis label locations , and is the -centroid. In addition to (10)-(12), we develop more VOS-based features for depth estimation and grasping in Sections IV-E and IV-F.

(8) Joints Coupled with
Configuration ID Camera
Grasp ,
TABLE I: VOS-VS Hadamard-Broyden Configurations.
Fig. 3: Video Object Segmentation-based Visual Servo Control and Object Depth Estimation. HSR first centers the object with the camera’s optical axis (left) then estimates the object’s depth as the camera approaches on the axis (right).

Iv-D HSR Visual Servo Configuration

Using and in (2), we define error as


Using and HSR joints (1), we choose in (8) as


where . Note that in our Hadamard-Broyden update (8), the Hadamard product of corresponds to in . Thus, we configure the logical coupling matrix by setting if coupling actuated joint with image feature is desired. Using our Broyden formulation (8), we learn on HSR for each of the configurations in Table I and provide experimental results in Section V-C.

Iv-E Segmentation-based Object Depth Estimation

Fig. 4: VOS-based visual servo control (columns 1 to 2), active depth estimation (2-4), and mobile robot grasping (5-6).

By combining VOS-based features with active perception, we are able to estimate the depth of segmented objects and approximate an object’s relative 3D position. As shown in Figure 3, we initiate our depth estimation framework (VOS-DE) by centering a segmented object on the optical axis of our camera using the VOS-VS controller. This alignment minimizes lens distortion, which facilitates the use of an ideal camera model. Using the pinhole camera model [22], projections of objects onto the image plane scale inversely with their distance on the optical axis from the camera. Thus, with the object centered on the optical axis, we can relate projection scale and object distance using


where is the projected length of an object measurement orthogonal to the optical axis, is the distance on the optical axis of the object away from the camera, and is the projected measurement length at a new distance . Combining Galileo Galilei’s Square-cube law with (15),


where is the projected object area corresponding to and (see Figure 3).

After centering on the segmented object, we advance the camera along the optical axis while collecting object segmentations. We modify (16) to relate each image using


where is a constant proportional to the orthogonal surface area of the segmented object. Also, using a coordinate frame with the axis aligned with the optical axis,


where and are the respective -axis coordinates of the camera and object. Note, because the camera and object are both centered on the axis, and . Using (18) and the area measurement from (10), we update (17) as


where the object is assumed stationary between images (i.e., ) and the position is known from the robot kinematics and encoder values. Note that provides 3D information to our VOS-DE framework and (19) identifies a key linear relationship between VOS and the distance between the segmented object and the camera.

Finally, after collected a series of measurements, we can estimate the depth of the segmented object. From (19),


which over the measurements in form yields


By solving (21) for and , we can estimate the distance in (18), and, thus, the 3D location of the object. In Section V-D, we show in robot experiments that our combined VOS-VS and VOS-DE framework is sufficient for locating, approaching, and estimating the depth of a variety of household objects.

Remark: There are many methods to find approximate solutions to (21

). In practice, we find that a least squares solution provides good robustness to outliers in terms of encoder and segmentation errors (see data in Figures 



Number of Observations

Fig. 5: Depth Estimate of Sugar Box. Data collected and processed in real time during the initial approach in Figure 4.

Iv-F Segmentation-based Grasping

Fig. 6: Experiment Objects from YCB Dataset. Object categories are (from left to right) Food, Kitchen, Tool, and Shape.

As a final extension to our VOS-based framework, we develop a VOS-based method of grasping and grasp-error detection. While many recent grasping methods are RGBD-based [24, 35], similar to VOS-DE, a VOS-based approach to grasping can function when 3D sensing is unavailable. This design choice is further motivated by the winning team of the 2017 Amazon Robotics Challenge [43, 54], which uses RGB-based methods when RGBD-based methods fail.

Assume an object is centered using VOS-VS and has estimated depth using VOS-DE, we move to


where is the known -axis offset between and the center of HSR’s closed fingertips. Thus, when is at , HSR can reach the object at depth .

After moving to , we center the object directly underneath HSR’s antipodal gripper using VOS-VS control. To find a suitable grasp location, we project a mask of the gripper, , into the camera and solve



is the intersection over union (or Jaccard index

[20]) of and object segmentation mask , and is the projection of corresponding to HSR wrist rotation . Thus, we grasp the object using the wrist rotation with least intersection between the object and the gripper, which is then less likely to collide with the object before achieving a parallel grasp (see Figure 4).

After the object is grasped, we lift HSR’s arm to perform a visual grasp check. We consider a grasp complete if


where is the object segmentation size (10) during the initial grasp and is the corresponding after lifting the arm. Essentially, if decreases when lifting the arm, the object is further from the camera and not securely grasped. Thus, we quickly identify if a grasp is missed and regrasp as necessary. Note that our VOS-based grasp check can work with other grasping methods as well.

A complete demonstration of our VOS-based visual servo control, depth estimation, and grasping framework is shown from start to finish in Figure 4, with corresponding depth estimation data shown in Figure 5.

V Robot Experiments

V-a Experiment Objects

For most of our experiments, we use the objects from the YCB dataset [10] shown in Figure 6. We use six objects from each of the food, kitchen, tool, and shape categories and purposefully choose some of the most difficult objects to stress test our system. To name only a few of the challenges for selected objects: dimensions span from the 470 mm long pan to the 4 mm thick washer, most of the contours change with pose, and over a third of the objects exhibit specular reflection of overhead lights. To learn object recognition, we annotate ten training images of each object using HSR’s grasp camera with various object poses, backgrounds, and distances from the camera (see example image in Figure 1).

V-B Video Object Segmentation Method

We segment objects using One-Shot Video Object Segmentation (OSVOS) [9] (available at

). OSVOS uses a base network trained on ImageNet

[18] to recognize image features, re-trains a parent network on DAVIS 2016 [46] to learn general video object segmentation, and then fine tunes for each object that we use in our experiments (i.e., each object has unique learned parameters in (9)). After learning , we segment HSR’s 640x480 RGB images at 29.6 Hz with a dual-GPU (GTX 1080 Ti) machine.

V-C Hadamard-Broyden VOS-VS Results

(8) Learned in (14)
 0.00173 0.00183
-0.00157 0.00321
-0.00221 0.00445
-0.00392 0.00328
-0.00179 0.00173
-0.00040 0.00040
TABLE II: Learned for Various Configurations.

V-C1 Initialization and Learning

We learn all of the VOS-VS configurations in Table I on HSR using the Hadamard-Broyden update formulation in (8). We initialize each configuration using , , and a target segmentation object in view to elicit a step response from the VOS-VS controller. Each configuration starts at a specific pose (e.g., uses the leftmost pose in Figures 3-4), and configurations use in (13), except for , which uses for grasp positioning.

After initializing each configuration, within a few iterations of control inputs from (7) and updates from (8), the learned matrix generally shows convergence for any component that is initialized with the correct sign. Components initialized with an incorrect sign (e.g., in ) generally require more updates to change directions and jump through zero during one of the discrete updates. If an object goes out of view from an incorrectly signed component, we reset HSR’s pose and restart the update from the most recent . Once is reached, the object can be moved to elicit a few more step responses for fine tuning. The learned parameters for each Hadamard-Broyden configuration are provided in Table II. In the rest of our experiments, we set in (8) to reduce variability.

Fig. 7: Segmentation visual servo trajectories for the same object locations using (left) and (right).

Fig. 8: Segmentation visual servo trajectories using (center left), (center right), and (right).

Fig. 9: Segmentation visual servo trajectories using .

V-C2 Configuration Results

To show the step response of each learned configuration in Table II, we perform additional experiments by centering the camera on multiple YCB objects within view of each configuration’s starting pose. In Figure 7, both and exhibit a stable response for the same configuration of objects; note that the wood block starts close to (shown by green dot) in the raised perspective of . Our motivation to learn two base configurations is the increase in sensitivity to base motion as an object’s depth decreases. operates with the camera raised high above objects, while operates with the camera directly above objects to position for grasping. Thus, needs more base movement for the same changes in compared to . This difference is apparent in Table II from learning greater values and in Figure 7 from the ’s smaller distribution of initial values despite identical object distances.

V-C3 Configuration Results

We show the step response of all arm-based VS configurations in Figure 8. Each configuration uses the same segmentation objects and starting pose. While each configuration segments the pan and baseball, is not reachable for these objects within any of the configured actuator spaces; is the only configuration to center on all four of the other objects. The overactuated configuration exhibits the most overshoot, while exhibits the most limited range of camera positions but essentially deadbeat control.

V-C4 Configuration Results

Finally, we show the step response of in Figure 9. is the only configuration that uses HSR’s 2-DOF head gimbal and camera, and it exhibits a relatively smooth step response over the entire image. Of particular significance, even though uses the head camera, it uses the same OSVOS parameters that are learned using images from the grasp camera; this further demonstrates the general applicability of our VOS-VS framework in regards to not needing any camera calibration.

Support Task Complete
Item Category Height (m) VS DE
Chips Can Food 0.25 X X
Potted Meat Food 0.125 X X
Plastic Banana Food Ground X X
Box of Sugar Food 0.25 X X
Tuna Food 0.125 X
Gelatin Food Ground X X
Mug Kitchen 0.25 X X
Softscrub Kitchen 0.125 N/A
Skillet with Lid Kitchen Ground N/A
Plate Kitchen 0.25 X X
Spatula Kitchen 0.125 N/A
Knife Kitchen Ground X
Power Drill Tool 0.25 X X
Marker Tool 0.125 X
Padlock Tool Ground X
Wood Tool 0.25 X
Spring Clamp Tool 0.125 X
Screwdriver Tool Ground X
Baseball Shape 0.25 X
Plastic Chain Shape 0.125 X
Washer Shape Ground X
Stacking Cup Shape 0.25 X X
Dice Shape 0.125 N/A
Foam Brick Shape Ground X X
TABLE III: Sequential VOS-VS and VOS-DE Results. All results from a single consecutive set of mobile HSR trials.

V-D Consecutive VOS-VS and VOS-DE Mobile Robot Trials

We perform an experiment consisting of a consecutive set of trials that simultaneously test VOS-VS and VOS-DE. Each trial consists of three unique YCB objects placed at different heights: one on the blue bin 0.25 m above the ground, one on the green bin 0.125 m above the ground, and one directly on the ground (see bin configuration in Figure 1). The trial configurations and corresponding results are provided in Table III. VOS-VS is considered a success (“X”) if HSR locates and centers on the object for depth estimation. VOS-DE is considered a success if HSR achieves (22) such that 1) HSR can close its grippers on the object without hitting the underlying surface and 2) HSR does not move past the top surface of the object.

Across all of the challenging YCB objects we selected for this single consecutive set of trials, VOS-VS has a 83% success rate. Note that VOS-DE is only applicable if VOS-VS succeeds; in these cases, VOS-DE has a 50% overall success rate. By category, food objects have the highest success (100% VOS-VS, 83% VOS-DE) and kitchen objects have the lowest (50% VOS-VS, 66% VOS-DE). Note that the margin of error and difficulty for VOS-DE is highly variable between objects (e.g., much more difficult for the 4 mm thick washer compared to the 50 mm thick foam brick).

V-E Additional Experiments

Fig. 10: HSR Taking Grasped Banana Peel to Garbage.

V-E1 HSR Pick-And-Place Challenge

We perform a few additional experiments for our VOS-based methods, including our work in the TRI-sponsored HSR challenges. These challenges consist of timed trials for pick-and-place tasks with randomly scattered, non-YCB objects (e.g., banana peels and wadded up dollar bills). These HSR challenges are a particularly good demonstration of our VOS-based approach to visual servo control and grasping. Footage for these experiments and trials is available at:

V-E2 Dynamic Articulated Objects

Finally, we perform additional VOS-VS experiments with dynamic articulated objects. Using , HSR tracks the plastic chain across the room in real time as we kick it and throw it in a variety of unstructured poses; we can even pick up the chain and use it the guide HSR’s movements from the grasp camera. In addition, by training OSVOS to recognize an article of clothing, HSR is able to reliably track a person moving throughout the room using (see Figure 10). Footage for these experiments with dynamic articulated objects is available at:

Fig. 11: HSR Tracking Person with Head Camera.

Vi Conclusions and Future Work

In this work, we develop a video object segmentation-based approach to visual servo control, depth estimation of objects, and grasping. Visual servo control is an established approach to controlling a physical robot system from RGB images, and video object segmentation has seen rampant advances within the computer vision community for densely segmenting general objects in challenging videos. The success of our VOS-based approach to visual servo control in experiments using a mobile robot platform and generic objects is a tribute to both of these communities and the initiation of a bridge between them. Future developments in video object segmentation will improve the robustness of our method and, we expect, lead to other innovations in robotics.

A significant benefit of our VOS-based framework is that it only requires an RGB camera combined with robot actuation. While 3D sensing-based methods for depth estimation and grasping are still state-of-the-art, in future work, we will improve our approach to VOS-based depth estimation and grasping by collecting data from multiple poses, thereby leveraging more information and making our 3D understanding of the target object more complete. Nonetheless, we find that the current VOS-based framework is a useful tool for robotics applications where 3D sensors are unavailable.


Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


  • [1] “A computer algorithm for reconstructing a scene from two projections,” in Readings in Computer Vision, M. A. Fischler, , and O. Firschein, Eds.   San Francisco (CA): Morgan Kaufmann, 1987, pp. 61 – 62.
  • [2] J. K. Aggarwal and N. Nandhakumar, “On the computation of motion from sequences of images-a review,” Proceedings of the IEEE, vol. 76, no. 8, pp. 917–935, Aug 1988.
  • [3] R. Bajcsy, “Active perception,” Proceedings of the IEEE, vol. 76, no. 8, pp. 966–1005, Aug 1988.
  • [4] R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos, “Revisiting active perception,” Autonomous Robots, vol. 42, no. 2, pp. 177–196, Feb 2018.
  • [5] L. Bao, B. Wu, and W. Liu, “CNN in MRF: video object segmentation via inference in A cnn-based higher-order spatio-temporal MRF,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [6] J. Bohg, K. Hausman, B. Sankaran, O. Brock, D. Kragic, S. Schaal, and G. S. Sukhatme, “Interactive perception: Leveraging action in perception and perception in action,” IEEE Transactions on Robotics, vol. 33, no. 6, pp. 1273–1291, Dec 2017.
  • [7] B. Browatzki, V. Tikhanoff, G. Metta, H. H. B?lthoff, and C. Wallraven, “Active in-hand object recognition on a humanoid robot,” IEEE Transactions on Robotics, vol. 30, no. 5, pp. 1260–1269, Oct 2014.
  • [8] C. G. Broyden, “A class of methods for solving nonlinear simultaneous equations,” Mathematics of Computation, vol. 19, no. 92, pp. 577–593, 1965.
  • [9] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [10] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in manipulation research: Using the yale-cmu-berkeley object and model set,” IEEE Robotics Automation Magazine, vol. 22, no. 3, pp. 36–52, Sep. 2015.
  • [11] F. Chaumette and S. Hutchinson, “Visual servo control. i. basic approaches,” IEEE Robotics Automation Magazine, vol. 13, no. 4, pp. 82–90, Dec 2006.
  • [12] ——, “Visual servo control. ii. advanced approaches [tutorial],” IEEE Robotics Automation Magazine, vol. 14, no. 1, pp. 109–118, March 2007.
  • [13] F. Chaumette, “Potential problems of stability and convergence in image-based and position-based visual servoing,” in The confluence of vision and control, D. J. Kriegman, G. D. Hager, and A. S. Morse, Eds.   London: Springer London, 1998, pp. 66–78.
  • [14] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool, “Blazingly fast video object segmentation with pixel-wise metric learning,” in Computer Vision and Pattern Recognition (CVPR), 2018.
  • [15] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang, “Fast and accurate online video object segmentation via tracking parts,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [16] G. Chesi, E. Malis, and R. Cipolla, “Automatic segmentation and matching of planar contours for visual servoing,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), vol. 3, April 2000, pp. 2753–2758 vol.3.
  • [17] P. I. Corke and S. A. Hutchinson, “A new partitioned approach to image-based visual servo control,” IEEE Transactions on Robotics and Automation, vol. 17, no. 4, pp. 507–515, Aug 2001.
  • [18] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [19] R. Eidenberger and J. Scharinger, “Active perception and scene modeling by planning with probabilistic 6d object poses,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2010, pp. 1036–1043.
  • [20] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [21] A. Faktor and M. Irani, “Video segmentation by non-local consensus voting.” in British Machine Vision Conference (BMVC), 2014.
  • [22] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach.   Prentice Hall Professional Technical Reference, 2002.
  • [23] B. A. Griffin and J. J. Corso, “Tukey-inspired video object segmentation,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2019.
  • [24] M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precision grasp pose detection in dense clutter,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 598–605.
  • [25] N. Guenard, T. Hamel, and R. Mahony, “A practical visual servo control for an unmanned aerial vehicle,” IEEE Transactions on Robotics, vol. 24, no. 2, pp. 331–340, April 2008.
  • [26] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds.   Cham: Springer International Publishing, 2014, pp. 345–360.
  • [27] K. Hosoda and M. Asada, “Versatile visual servoing without knowledge of true jacobian,” in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’94), vol. 1, Sep. 1994, pp. 186–193 vol.1.
  • [28] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE Transactions on Robotics and Automation, vol. 12, no. 5, pp. 651–670, Oct 1996.
  • [29] M. Jagersand, O. Fuentes, and R. Nelson, “Experimental evaluation of uncalibrated visual servoing for precision manipulation,” in Proceedings of International Conference on Robotics and Automation, vol. 4, April 1997, pp. 2874–2880 vol.4.
  • [30] ——, “Experimental evaluation of uncalibrated visual servoing for precision manipulation,” in Proceedings of International Conference on Robotics and Automation, vol. 4, April 1997, pp. 2874–2880 vol.4.
  • [31] M. Krainin, P. Henry, X. Ren, and D. Fox, “Manipulator and object tracking for in-hand 3d object modeling,” The International Journal of Robotics Research, vol. 30, no. 11, pp. 1311–1327, 2011.
  • [32] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object segmentation,” in IEEE International Conference on Computer Vision (ICCV), 2011.
  • [33] A. D. Luca, G. Oriolo, and P. R. Giordano, “Feature depth observation for image-based visual servoing: Theory and experiments,” The International Journal of Robotics Research, vol. 27, no. 10, pp. 1093–1116, 2008.
  • [34] J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-generation, refinement and merging for video object segmentation,” in Asian Conference on Computer Vision, 2018.
  • [35]

    J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” 2017.

  • [36] R. Mahony, P. Corke, and F. Chaumette, “Choice of image features for depth-axis control in image based visual servo control,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1, Sept 2002, pp. 390–395 vol.1.
  • [37] E. Malis, F. Chaumette, and S. Boudet, “2 1/2 d visual servoing,” IEEE Transactions on Robotics and Automation, vol. 15, no. 2, pp. 238–250, April 1999.
  • [38] E. Malis and F. Chaumette, “Theoretical improvements in the stability analysis of a new class of model-free visual servoing methods,” Robotics and Automation, IEEE Transactions on, vol. 18, pp. 176 – 186, 05 2002.
  • [39] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. V. Gool, “Video object segmentation without temporal information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2018.
  • [40] P. Marion, P. R. Florence, L. Manuelli, and R. Tedrake, “Label fusion: A pipeline for generating ground truth labels for real rgbd data of cluttered scenes,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 1–8.
  • [41] G. L. Mariottini, G. Oriolo, and D. Prattichizzo, “Image-based visual servoing for nonholonomic mobile robots using epipolar geometry,” IEEE Transactions on Robotics, vol. 23, no. 1, pp. 87–100, Feb 2007.
  • [42] A. McFadyen, M. Jabeur, and P. Corke, “Image-based visual servoing with unknown point feature correspondence,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 601–607, April 2017.
  • [43] A. Milan, T. Pham, K. Vijay, D. Morrison, A. W. Tow, L. Liu, J. Erskine, R. Grinover, A. Gurman, T. Hunn, N. Kelly-Boxall, D. Lee, M. McTaggart, G. Rallos, A. Razjigaev, T. Rowntree, T. Shen, R. Smith, S. Wade-McCue, Z. Zhuang, C. F. Lehnert, G. Lin, I. D. Reid, P. I. Corke, and J. Leitner, “Semantic segmentation from limited training data,” CoRR, vol. abs/1709.07665, 2017.
  • [44] S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim, “Fast video object segmentation by reference-guided mask propagation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [45] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013.
  • [46] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [47] J. A. Piepmeier, G. V. McMurray, and H. Lipkin, “Uncalibrated dynamic visual servoing,” IEEE Transactions on Robotics and Automation, vol. 20, no. 1, pp. 143–147, Feb 2004.
  • [48] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv:1704.00675, 2017.
  • [49] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 746–760.
  • [50]

    S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [51] R. Spica, P. R. Giordano, and F. Chaumette, “Coupling active depth estimation and visual servoing via a large projection operator,” The International Journal of Robotics Research, vol. 36, no. 11, pp. 1177–1194, 2017.
  • [52] A. Venkataraman, B. Griffin, and J. J. Corso, “Kinematically-informed interactive perception: Robot-generated 3d models for classification,” CoRR, vol. abs/1901.05580, 2019.
  • [53]

    P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” in

    British Machine Vision Conference (BMVC), 2017.
  • [54] S. Wade-McCue, N. Kelly-Boxall, M. McTaggart, D. Morrison, A. W. Tow, J. Erskine, R. Grinover, A. Gurman, T. Hunn, D. Lee, A. Milan, T. Pham, G. Rallos, A. Razjigaev, T. Rowntree, R. Smith, K. Vijay, Z. Zhuang, C. F. Lehnert, I. D. Reid, P. I. Corke, and J. Leitner, “Design of a multi-modal end-effector and grasping system: How integrated design helped win the amazon robotics challenge,” CoRR, vol. abs/1710.01439, 2017.
  • [55] Y. Wang, H. Lang, and C. W. de Silva, “A hybrid visual servo controller for robust grasping by wheeled mobile robots,” IEEE/ASME Transactions on Mechatronics, vol. 15, no. 5, pp. 757–769, Oct 2010.
  • [56] S. Wehrwein and R. Szeliski, “Video segmentation with background motion models,” in British Machine Vision Conference (BMVC), 2017.
  • [57] G. Wei, K. Arbter, and G. Hirzinger, “Real-time visual servoing for laparoscopic surgery. controlling robot motion with color image segmentation,” IEEE Engineering in Medicine and Biology Magazine, vol. 16, no. 1, pp. 40–45, Jan 1997.
  • [58] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “Youtube-vos: A large-scale video object segmentation benchmark,” arXiv preprint arXiv:1809.03327, 2018.
  • [59] U. Yamaguchi, F. Saito, K. Ikeda, and T. Yamamoto, “Hsr, human support robot as research and development platform,” The Abstracts of the international conference on advanced mechatronics : toward evolutionary fusion of IT and mechatronics : ICAM, vol. 2015.6, pp. 39–40, 2015.
  • [60] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [61] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, N. C. Dafle, R. Holladay, I. Morona, P. Q. Nair, D. Green, I. Taylor, W. Liu, T. Funkhouser, and A. Rodriguez, “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2018.