Morphology-Agnostic Visual Robotic Control

by   Brian Yang, et al.

Existing approaches for visuomotor robotic control typically require characterizing the robot in advance by calibrating the camera or performing system identification. We propose MAVRIC, an approach that works with minimal prior knowledge of the robot's morphology, and requires only a camera view containing the robot and its environment and an unknown control interface. MAVRIC revolves around a mutual information-based method for self-recognition, which discovers visual "control points" on the robot body within a few seconds of exploratory interaction, and these control points in turn are then used for visual servoing. MAVRIC can control robots with imprecise actuation, no proprioceptive feedback, unknown morphologies including novel tools, unknown camera poses, and even unsteady handheld cameras. We demonstrate our method on visually-guided 3D point reaching, trajectory following, and robot-to-robot imitation.



page 2

page 4

page 6

page 7


Active Visual SLAM with independently rotating camera

In active Visual-SLAM (V-SLAM), a robot relies on the information retrie...

Self-Modifying Morphology Experiments with DyRET: Dynamic Robot for Embodied Testing

Robots need to be able to adapt to complex and dynamic environments for ...

Controlling Robot Morphology from Incomplete Measurements

Mobile robots with complex morphology are essential for traversing rough...

Design, fabrication and 3-DOF control of legless capsule robot

In this paper, a legless capsule robot (capsubot) comprised of a sealed ...

Learning a generative model for robot control using visual feedback

We introduce a novel formulation for incorporating visual feedback in co...

The Field-of-View Constraint of Markers for Mobile Robot with Pan-Tilt Camera

In the field of navigation and visual servo, it is common to calculate r...

Shimon the Robot Film Composer and DeepScore: An LSTM for Generation of Film Scores based on Visual Analysis

Composing for a film requires developing an understanding of the film, i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Achild playing an arcade “claw crane” game must first visually locate and recognize the claw robot they are controlling, and learn how it responds to various control commands—something they typically accomplish within a few seconds of twiddling the joystick controller. Infants and some animals also exhibit mirror self-recognition, the ability to recognize one’s reflection in the mirror as tied to oneself [1, 2]. In this paper, we ask: how might a robot perform efficient self-recognition and how might this ability improve robotic control?

Current robotic control methods typically require precise knowledge of the robot’s configuration and kinematics, obtained from accurate geometric models of its body, and joint-level proprioceptive encoders. As an example, the degrees of freedom of a standard robot arm are fully specified by its joint angles, available through servomotor encoders. Given a target pose, a controller can quite easily plan a trajectory of poses, and servo to sequentially reach those poses.

Unfortunately, such proprioception-driven control methods do not generalize to many important settings. What if the robot were made of soft or deformable material, so that its degrees of freedom are not easily enumerated, let alone measured? Even for the rigid robot arm above, introduce, say, a pen into its gripper, and the position of its tip is now unknown and therefore not possible to control.

Humans easily handle such control tasks by relying on rich sensory feedback, such as from vision, rather than purely on precise proprioception. This makes us less reliant on precisely modeling our bodies and tools, and the richer visual space also enables specifying and performing a larger class of tasks, such as writing with an unfamiliar chalk piece, or catching a ball.

Existing image-based visual servoing (IBVS) approaches attempt to bridge this gap by relying on visual feedback to plan towards visual goals, but they require the robot to be pre-specified in some way. For example, they may assume that there exists a reliable detector for a point of interest such as the end-effector of a robot arm, so that its position in the camera view is known. To control a new robot, or the pen in the gripper above, a new detector would have to be provided by a human engineer, either by applying visual markers or by training a visual detector with machine learning.

We propose mavric, a self-recognition-enabled approach to IBVS that works “out of the box” on arbitrary new or altered robots with no manual specification of any points of interest. In a self-recognition phase, the robot locates and characterizes itself through exploratory actions, much like the claw crane game player twiddling their joystick. We use simple techniques to accomplish this: a mutual information measure [3] evaluates the responsiveness of various points in the environment, tracked using Lucas-Kanade optical flow computation [4], to the robot control commands. The most responsive points are then exploited in the control phase. mavric is lightweight, flexible, and fast to adapt, producing responsive “control points” for a new robot within a few seconds of interaction.

Fig. 1: mavric first collects random exploration video for a few seconds, then computes scores for various tracked points corresponding to their responsiveness to control inputs (shown in blue here). The top-few such points are then averaged (shown as the green point) treated as the end-effector for visual servoing in control tasks.

As we will show, mavric handles settings that are challenging for today’s state-of-the-art robotic control approaches: imprecise actuation, unknown robot morphologies, unknown camera poses, novel unmodeled tools, and unsteady handheld cameras.

Ii Related Work

Image-based visual servoing.   Visual servoing methods [5, 6, 7, 8, 9] perform feedback control to reduce an error that is measured through camera observations. Specifically, uncalibrated image-based visual servoing (IBVS) approaches [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

typically estimate the differential relationships between control inputs and changes in some visual feature of interest, such as the pixel coordinates of the robot’s end-effector in an uncalibrated camera view. Reinforcement learning approaches have also been proposed for the IBVS setting 

[24, 25, 26, 27, 28, 29].

All these IBVS methods produce controllers that are tied to a single robot morphology in some way—for example, they may require visual markers on the robot [20, 21, 22, 23] or a large dataset of interactions specific to the current robot morphology and environment [24, 26, 28, 29, 30, 31, 25, 27]. In contrast, mavric performs automatic self-recognition to produce a controller that adapts to new or altered robots within a few seconds.

Robotic self-recognition.   Prior methods have been proposed that learn to recognize the robot’s body. Michel et al [32] use learned characteristic response delays between actions and observed motion, and Natale et al [33] also exploit simple temporal correspondences, relying on periodicity. Robot and object keypoints may be discovered by training a keypoint encoder for image reconstruction on a large image dataset [24], or by weakly supervised multiple-instance learning from a few minutes of video with and without an object of interest [34]. Byravan et al [35] demonstrate rigid robot link discovery by training a dynamics model with appropriate inductive biases on hundreds of thousands of depth frames annotated with ground truth correspondences. Closest to mavric, [36] use a mutual information-based approach to recognize individual links, but different from mavric, they assume known aspects of the robot’s morphology, such as the number of its links and the visual appearance of its body, and the correspondences between action dimensions and the various servos on the robot, and also rely on manually demonstrated robot poses during exploration. Despite these advantages, they report requiring four minutes of exploration, compared to about 20s for mavric. They also propose a different approach to tracking, which we empirically compare against.

Iii Morphology-Agnostic Visual Control

We operate in the following setting: at each time step , a controller has access to raw RGBD image observations from a camera, and the ability to set a -dimensional control input for the robot. The images contain the robot’s body as well as other portions of its environment. We are interested in performing visually guided control tasks, such as reaching and trajectory-following.

To maximize generality, we make very few assumptions about factors that are commonly treated as fully known in robotic control: (i) We do not know the nature of the robot’s embodiment, such as the degrees of freedom, rigidity, or the number and lengths of links in a robot arm. (ii) For the control interface, aside from the standard assumptions made in uncalibrated visual servoing, we make one additional assumption that the displacements of points on the robot are a probabilistic function of the control commands. We make no further assumptions about how the controls affect the robot. We will revisit this point in Sec III-A. (iii) We do not assume camera calibration.

Our approach, mavric, works in two phases. In a self-recognition phase (Sec III-A), it identifies the robot’s body and represents it as trackable and characterizable “control points,” through unsupervised interaction. In the control phase (Sec III-B), it performs visual servoing to move these control points along desired target trajectories.

Iii-a Self-Recognition: Robot as Responsive Particles

In the self-recognition phase, we aim to resolve the question: what is the body of the robot? In standard robotic control settings, the body is a predefined, physically connected object with rigid links. Since we do not assume a pre-specified body in our setting, we first define the body in terms of a new notion of “responsiveness.”

We start by decomposing the observable environment containing the robot into “particles”— points in 3D space that may or may not be part of its body. Each particle has an associated position in RGBD camera coordinates (, , depth) from our uncalibrated camera. The task of identifying the body now reduces to assigning a binary label (body / not body) to each such particle.

Responsiveness.   We define the responsiveness of a particle as the mutual information (MI) [3, 37] between its motions and the control inputs. Specifically, we execute a random sequence of exploratory control commands assign a non-negative responsiveness score to each particle :


where is the change in position of in response to , and

is the MI between two random variables.

is zero for points whose motion is completely independent of .

To understand this definition intuitively, note that , where is entropy and is conditional entropy. High responsiveness corresponds to low conditional entropy , i.e., changes in states are predictable given the control inputs. At the same time, the state changes themselves should also be sufficiently expressive, i.e., the unconditional entropy should be high. A fixed particle in the environment would have zero , but also zero H(), so it would have zero responsiveness.

Alternatively, since MI is symmetric, may also be written as . Since is the same for all particles, high responsiveness corresponds precisely to low conditional entropy : control inputs should be easily recoverable given the state changes . This is consistent with the intuition for discovering “contingent image regions” in Atari games used in Choi et al [38].

Fig. 2: Manual annotations to illustrate the end-effector point (green point) and region (red outline) in different settings: a schematic single link robot (leftmost), followed by our experimental setups with various tools held in a robot arm.

Body and control points.   We define the “body” of a robot as the set of particles whose responsiveness is higher than some threshold : . We call the constituent particles of this set “control points.”

First, does this definition align with our intuitive notion of a robot’s body? Should all points on a robot body be “responsive”, i.e., do their displacements have high MI with the control inputs ? For this to hold, must be a probabilistic function of , i.e., a fixed control input must induce a fixed distribution over . This is true for velocity control commands as long as the states explored during the self-recognition phase lie within a small neighborhood. For example, consider a single motor controlling a rigid rod, as in Fig 2 (left). A small angular shift in the servomotor corresponds to a displacement for a particle at a distance along the rod, in a direction perpendicular to the current orientation of the rod. With a significantly different orientation of the rod, the same angular shift would produce a very different displacement.

To account for this, our experiments employ velocity control and a small number of small exploration actions, so that all exploration happens within a small state neighborhood. As we will show empirically, this yields good performance.

Most responsive control point (MRCP).   We now define the most responsive control point (MRCP) as the particle with the highest responsiveness . Its position is henceforth denoted . In practice, we average the positions of the top- most responsive particles to compute a robust MRCP. The MRCP is the point that is most responsive to control, which intuitively corresponds to highest maneuverability or dexterity. For this reason, it may be treated as the end-effector. Fig 2 shows images of various settings from our experiments, with manual annotations of the end effector point and region that we evaluate the MRCP against.

As an example, consider a hammer held firmly in the gripper of a robotic arm. It is very responsive to the arm’s control inputs, and may even contain the MRCP. Replacing the hammer with a loose rope, the rope would be less responsive, and the MRCP would be in the gripper instead.

Handling rigid objects.   Points on a rigid object, such as a single link of a standard arm, all exhibit the same motion, modulo an invertible affine transformation. Mutual information is known to be invariant under such smooth, invertible mappings, a.k.a. homeomorphisms (see Kraskov et al [39] for a simple proof). This in turn means that points on a rigid object all have the same responsiveness score. For example, for a rigid rod pivoted at one end as in Fig 2, the midpoint of the rod is just as responsive to any forces applied to the rod as the end of the rod.

To break such ties, we preferentially select points with larger motions. We do this by exploiting the fact that, although mutual information is insensitive to the scale of motion under infinite precision, any loss of precision leads naturally to a preference for large motions. We add Gaussian noise to in Eq. 1 to artificially lower the precision and express this preference for large motions.

Note however that it is not always the case that particles that are near the periphery or which move the most are MRCP points. For example, if the last link of a robot arm is broken so that it moves randomly, it might very well have a lower responsiveness score than the previous link.

Fig. 3: [Best seen in pdf] Self-recognition for MRCP search illustrated on data from a handheld, shaky camera. Frame 1 shows the camera view of the robot during exploration. Frame 2 shows the results of coarse tracking on the exploration video, where the track points overlaid on the image are colored more green if they are more responsive. Frame 3 shows how fine tracking is initialized on the same video around most responsive tracks from the coarse stage, and Frame 4 shows surviving tracks after the fine stage of MRCP search. Frame 5 shows the final result, with the top-15 control points in green and their average position in red. Video in Supp.

Implementation details.   Fig 1 shows a schematic of the mavric self-recognition phase. For tracking the positions of particles over time, we use the Lucas-Kanade optical flow estimator [4]. We only consider points that last the full duration of the exploration actions. We use the mutual information estimator proposed in [39], as implemented in [40]. In our experiments, control points on a robot arm are discovered within 20 seconds of exploration, sufficient to execute about 100 randomly sampled small actions. Finally, since we are primarily interested in the MRCP for control, we adopt a coarse-to-fine search strategy: In the coarse stage, we initialize point tracking with Shi-Tomasi corner points [41], compute responsiveness scores and select the top-

candidate particles. In the fine stage, we reinitialize tracking with a grid of 15x15 points around each of the selected candidates, and recompute responsiveness scores. The only important hyperparameter in all the above is the noise variance for handling rigid objects. Our experiments in Sec 

IV-A study this and other implementation details such as coarse-to-fine search. Fig 3 shows an example of the various stages of MRCP identification with a handheld, shaky camera.

Finally, the Lucas-Kanade tracker often drops particles over time, especially under occlusions. Since mavric is able to work with very short exploration phases in our experiments, the effect of dropped particles is mitigated. However, improved tracking, such as through articulated motion segmentation approaches [42], or by using multiple cameras, may improve robustness.

Iii-B Visual Servoing in MRCP Coordinates

Once the MRCP is identified, mavric performs visual servoing for control to transport the MRCP point to a specified goal point . This is appropriate for tasks like reaching and pushing, which are normally performed by hand-specified end-effectors in standard control settings.

We use online regression to fit tuples to estimate local Jacobian matrices “on the fly,” using Broyden updates [12]: .

The new control input for the current step is then computed quickly and cheaply using the pseudoinverse of the Jacobian, as , where is a rate hyperparameter. Once is executed and the new position is measured, the Jacobian matrix is updated as above, and the process repeats until . The Broyden update above is susceptible to noise, since it only uses a single measurement, hence we apply a batched update comprising the last tuples of (, ) as proposed in [13]. In our experiments, we set .

The Jacobian matrix initialization is computed as follows: we start at a random arm position, sequentially set the control inputs

to scaled unit vectors

along each control dimension, and set the -th column of to the response . In our experiments, we repeat this initialization procedure whenever servoing has failed to get closer to the target in the last 20 steps.

Handling tracking failures.   The above discussion of visual servoing depends on reliably continuing to track the MRCP throughout the servoing process. In practice, tracking is imperfect, and the control points are often dropped midway through the task due to occlusions, lighting changes etc. For robustness to such errors, we take the MRCP to be the average of the most responsive control points. If any one point is dropped by the tracker during servoing, the MRCP is set to the average of the remaining points. The larger is, the greater is the robustness to dropped points, and the lower is the precision in end effector localization since those points would be spatially more scattered. In our studies, sufficed for robust MRCP detection, so we study lower in experiments.

Iv Experiments

We perform experiments using the REPLAB standardized hardware platform [43, 44] with an imprecise low-cost manipulator (Trossen WidowX) and an RGB-D camera (Intel Realsense SR300). We evaluate how well mavric’s self-recognition phase works under varied conditions, and also the overall effectiveness of mavric for visuomotor servoing tasks. We will release REPLAB-compatible code upon acceptance for reproducibility.

Iv-a Self-Recognition: Discovering End Effectors, Tools, and Robot Morphology

First, we evaluate mavric’s self-recognition phase, described in Sec III-A. Specifically, we measure its accuracy at locating the robot’s end-effector or the tool held in its gripper, with four different tools – pliers, wrench, marker, pencil, shown in Fig 2.

Simulation.   In simulation, we evaluate mavric with perfect actuation and tracking, as a sanity check. We use a simulated Baxter robot, and perform 100 small exploratory control commands during the self-recognition phase. In our experiments, we precisely identified the end-effector with zero error every single time in this setting.

Real-world experiments.   Next, we evaluate our approach on real-world REPLAB cells with noisy tracking and actuation. In each run, a sequence of 100 random exploratory control commands are executed, which requires about 20 seconds. Fig 5 shows some example results of detected control points and MRCP points in each setting.

We quantitatively evaluate how closely mavric’s MRCP matches the manually annotated “true end-effector” of the robot. We annotate an end-effector region as well as a single end effector point in each setting. Example annotations are shown in Fig 2: the end-effector region includes the entire tool or the entire last link of the robot, and the end-effector point is more subjectively chosen for each tool based on how it is typically used, e.g., for the pencil and the marker, we annotate its tip. We evaluate several design decisions in mavric: 1-stage vs 2-stage (coarse-to-fine) MRCP search, values of for top-K control point selection, and values of noise variance added to point tracks before responsiveness computation. No prior work studies self-recognition in as general settings as ours, and the closest approaches [36, 33, 32] do not have public implementations. In lieu of prior work, we introduce a simple baseline that selects the points that move the largest distance over exploration (“Max-motion”). We compare later against our implementation of Edsinger et al [36]’s tracking approach.

Fig 4 shows these quantitative results. Max-motion peforms very poorly in nearly all settings, while most variants of mavric get close to perfect end-effector region identification success rate. Max-motion has a number of conceptual weakenesses, since it cannot distinguish between intentional and accidental motion, and is vulnerable to moving distractors. However, even in experiments without such distractors, we find that Max-motion often fails due to sensitivity to noisy sensing and tracking (examples in Supp).

The end-effector point identification error plot (Fig 4, bottom) provides a clearer comparison of the mavric ablations, labeled A through H in the legend. Comparing A and C (1-stage vs. 2-stage search), it is clear that coarse-to-fine MRCP search has a big impact on self-recognition success. Comparing B, C, D, E, and F (increasing noise variance), it is clear that a small amount of noise improves outcomes, but performance deteriorates when the noise is too high. Finally, comparing G, H, and C (top 1 vs top 5 vs top 15 control points), top 15 performs best in most cases. For all remaining experiments, we use variant E (2-stage, noise variance 1.6, and top 15 control point selection). Fig 5 shows examples of the detected MRCP from various runs under various settings. Fig 6 shows examples of the effect of noise on MRCP detection with the marker tool, clearly illustrating how higher noise variance biases towards selecting points closer to the tip of the marker. In our experiments, mavric performed qualitatively well across all these settings, consistently identifying point close to the tooltips. The end-effector error is also a function of the geometry of the end-effector. For example, in the no-tool case, the parallel jaw gripper has two fingers with a gap in the middle, which is where we annotate the end effector point (see Fig 2). However, only points on the physical robot can be control points, so this is an additional source of error.

We also quantitatively evaluate self-recognition in two additional settings: an amputated version of the robot arm, with the last two links removed, and a shaky handheld camera. Fig 7 shows the errors. Once again, 2-stage mavric works best. Fig 3 shows various steps during self-recognition with the handheld camera. Fig 5 includes an example in the amputated arm setting.

Moving distractors.   Next, we evaluate self-recognition with moving distractors by evaluating it on videos with two robots, where one of the robots is controlled by our method, while the other arm, a decoy moves independently, thereby creating a moving distractor. We create such videos by spatially concatenating two separate exploration videos, so that both arms are moving using the same random motion scheme. Fig 8 shows an example result. mavric correctly selects the end-effector of the correct arm, based on which arm’s control commands it receives as input. See Supp for example videos. Max-motion does not have any control inputs, so it produces the same prediction in both cases.

Tracking.   We now compare against an alternative tracking scheme. Edsinger et al [36] track moving objects for self-recognition by finding image patches that match the expected appearance of the robot and clustering them based on appearance. We implement their tracking scheme for self-recognition, so that the output is an image patch tracked through the video, representing the end-effector. On the same “no-tool” videos where mavric correctly identifies the end-effector 10 out of 10 times, this method produces an output image patch that is centered on the end-effector only 2 out of 10 times. Further, since this clustering scheme relies on appearance similarities, it completely breaks down in the moving distractors setting above, where multiple identical-looking robots are present — the same appearance cluster teleports across the different robots, making responsiveness computation extremely noisy.

Self-recognition phase duration.   While the above results are based on a 100-time step self-recognition phase (approx. 20 s), how much faster could this phase be? We evaluate end-effector identification with even fewer exploration steps in Fig 7, which shows that performance deteriorates gracefully under shorter exploration sequences.

Evaluating control points.   While the above results evaluate end-effector identification alone, mavric finds control points all along the robot body. We now annotate the full robot body to evaluate whether these discovered control points are indeed on the robot body. Treating points on the robot body as ground truth positives, and those outside as negatives, Fig 7 (d) shows the precision-recall plot as the threshold on the responsiveness scores are varied (“mavric

w/o outlier removal”): while the precision is very high at low recall, it drops off quickly. This is intuitive: the lower the true responsiveness, the more noisy the measurements are. We expect that less responsive control points would benefit from a longer self-recognition phase. However, even with

seconds, it is possible to filter the points to improve the precision-recall performance. We perform simple outlier removal as follows: we measure the 2D position variance of each candidate track over the length of self-recognition, and set a heuristic threshold on this value, below which points are discarded. This simple outlier removal scheme proves sufficient to significantly improve precision-recall, as shown in Fig 

7 (d). These control points may then be clustered based on spatial coherence to discover various links of a rigid robot, and their associated responsiveness scores. Fig 7 (c) shows an example. We use -means clustering () on position history features.

Please see Supp for videos demonstrating self-recognition in still more varied settings, including different robot configurations, lighting conditions, and more tools.

Fig. 4: [Best seen in pdf] (Top) End-effector region identification success rate (higher is better), and (Bottom) End-effector identification error (in cm, lower is better) in various settings for the maximum-motion baseline, and various ablations of our method. Note that max-motion performs very poorly in most situations (average error 8.8 cm): the error axis is clipped at 6 cm here. Among ablations of mavric, we study three hyperparameters: number of stages of end-effector ID (default: 2), noise variance (in squared pixel units) before responsiveness computation (default: 1.6), number of top points averaged to compute the MRCP (default: 15).
no tool pencil pliers wrench amputated
Fig. 5: [Best seen in pdf] MRCP identification in various settings. In each setting, the red point is the MRCP, computed as the average of the 15 most responsive points, shown in green. Please see videos in Supp. Fig 2 presents the ground truth end effector annotations used to evaluate these settings.
variance 0.0 variance 0.4 variance 0.8 variance 1.6
Fig. 6: [Best seen in pdf] (Left to right) Original image of the arm with a marker tool, followed by MRCP identification with various values of noise variance. As noise increases, the MRCP points move closer towards the marker tip.
(a) (b) (c) (d)
Fig. 7: [Best seen in pdf] (a) Self-recognition performance with shorter exploration phases, (b) Self-recognition with an amputated arm and a handheld camera, (c) Discovery of robot arm links from self-recognition phase data: tracks assigned to different clusters are colored differently. (d) Precision-Recall plot for control points.
video 1, left arm control inputs video 1, right arm control inputs
Fig. 8: The moving distractors test: given the same video of two arms operating side by side (produced by concatenating individual frames side-by-side from two exploration videos), mavric correctly ignores the decoy arm and selects the arm that is being controlled based on which control input sequence it receives as input. (left) mavric is fed the left arm’s controls, and it selects the MRCP (red point) on the left arm’s end effector. (right) With right arm’s control inputs, mavric selects the right arm’s end effector.
Method no-tool wrench pliers pencil marker average
error (cm) ETR (%) error (cm) ETR (%) error (cm) ETR (%) error (cm) ETR (%) error (cm) ETR (%) error (cm) ETR (%)
ROS MoveIt - - - - - - - - - - -
Oracle VS
TABLE I: 3D point-reaching error and early termination rate (ETR) during visual servoing for point reaching

Iv-B Visually Guided Servoing

We now demonstrate controllability using two basic control tasks involving our automatically localized end-effectors: 3D point reaching, and trajectory following. Then, we show how mavric enables robot-to-robot imitation.

3D point reaching.   We now evaluate mavric (self-recognition + servoing) on 3D-point reaching tasks. We set 9 goal positions in the RGBD camera view at varying elevations and azimuths centered at the end effector’s initial position at a distance of about 15 cm. We compare mavric to two methods that have access to additional manually specified information: “Oracle VS,” which servos a manually annotated end effector using the same visual servoing approach (based on [12, 13]) as our method, and “ROS MoveIt” [45] which has knowledge of the full robot morphology and kinematics models, camera calibration matrices, and proprioception. We allow a maximum of 150 steps.

Tab I reports (i) median 3D distance error of the manually annotated end effector point from the goal, and (ii) early termination rate (ETR). Early termination is triggered whenever the MRCP has reached within a 5 px radius of the goal — in our experiments, this is a good proxy for servoing success. ROS MoveIt cannot control unmodeled tools, so we report its performance only in the no-tool setting. Its error is higher than Oracle VS; this may be due to WidowX robot model inaccuracy, servo encoder position errors, and camera calibration error. Oracle VS does well in most settings, and mavric takes slightly longer (lower ETR), but its error is within 3 cm of Oracle VS — in our experiments, this is largely explained by the end-effector point identification error (Fig 4) from the self-recognition stage. We use standard techniques for visual servoing (Sec III-B), and Oracle VS and mavric both inherit common problems of these methods, such as local minima and singularites during Jacobian estimation. This sometimes results in failures with unbounded errors — we report the median error to discard these.

Trajectory following.   Aside from single point reaching, we may also servo to follow a trajectory. In Supp, we show videos demonstrating mavric writing the letter “C” onto the projected view of the camera through 2D trajectory following, with different robots. Fig 9 (left) shows one such video frame.

Fig. 9: [Best seen in pdf] Robot-to-robot imitation with mavric. In both video frames, cyan represents the next target, white points are future targets, and black points are previously reached targets. (left) A video frame of a source robot draws the letter C — in this case, we used mavric to perform this task with visual servoing for trajectory-following. (right) A video frame of the target robot imitating the motions of the source robot. Full video in Supp.

Robot-to-robot imitation.   Consider a source robot with unknown morphology and kinematics that is performing a writing task — perhaps it has been trained for days using an RL approach [27]. How might a target robot, also with unknown morphology and kinematics, perform the same task? This requires visually mapping the embodiment across these two robots, with very little data — we know of no prior approach that might accomplish this. Using mavric, we map the automatically discovered MRCP of the source robot to that of the target, and perform imitation by servoing the target MRCP along the observed trajectory of the source MRCP. Fig 9 shows frames from the result videos (full videos in Supp). Here, the source and target robots are both WidowX arms, with different robot and camera poses, and different illumination.

V Discussion

We have presented mavric, an approach that performs fast robot body recognition and uses this to accomplish morphology-agnostic visuomotor control. Reinforcement learning-based approaches operating in the same setting typically require days of robot data [24, 26, 27, 29, 27, 25], compared to 20s for mavric — we know of no other methods that can perform visuomotor control without knowledge of kinematics or morphology on similar timescales to mavric. We have demonstrated mavric across diverse settings, with various tools, shaky camera views etc. However, our validation of mavric has been limited to coarse control for basic tasks due to various limitations: end-effector self-recognition error (average  2 cm), point tracking failures under occlusions, noisy depth sensing causing poor Jacobian estimates, and well-known issues with visual servoing [46, 6, 47, 48]. We hope to address these in future work by moving to multi-camera setups, using longer exploration phases, using better tracking algorithms, and better depth sensors. mavric may also be extended to handle settings with multiple end-effectors such as a multi-fingered hand.


  • [1] G. G. Gallup, “Chimpanzees: self-recognition,” Science, vol. 167, no. 3914, pp. 86–87, 1970.
  • [2] J. B. Asendorpf, V. Warkentin, and P.-M. Baudonniere, “Self-awareness and other-awareness.” Developmental Psychology, 1996.
  • [3] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [4] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an application to stereo vision,” 1981.
  • [5] P. I. Corke, “Visual control of robot manipulators–a review,” in Visual Servoing: Real-Time Control of Robot Manipulators Based on Visual Sensory Feedback, 1993.
  • [6] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE transactions on robotics and automation, vol. 12, no. 5, pp. 651–670, 1996.
  • [7] F. Chaumette and S. Hutchinson, “Visual servo control. I. Basic approaches,” IEEE Robotics & Automation Magazine, 2006.
  • [8] ——, “Visual servo control, part ii: Advanced approaches,” IEEE Robotics and Automation Magazine, vol. 14, no. 1, pp. 109–118, 2007.
  • [9] É. Marchand, F. Spindler, and F. Chaumette, “ViSP for visual servoing,” IEEE Robotics & Automation Magazine, 2005.
  • [10] K. Hosoda and M. Asada, “Versatile visual servoing without knowledge of true jacobian,” in IROS.   IEEE, 1994.
  • [11] B. H. Yoshimi and P. K. Allen, “Active, uncalibrated visual servoing,” in ICRA.   IEEE, 1994.
  • [12] M. Jagersand, O. Fuentes, and R. Nelson, “Experimental evaluation of uncalibrated visual servoing for precision manipulation,” in ICRA, 1997.
  • [13] M. Bonkovi, K. Jezernik, et al., “Population-based uncalibrated visual servoing,” IEEE/ASME Transactions on Mechatronics, 2008.
  • [14] A. Shademan, A.-M. Farahmand, and M. Jägersand, “Robust jacobian estimation for uncalibrated visual servoing,” in ICRA, 2010.
  • [15] A. Dame and E. Marchand, “Mutual information-based visual servoing,” IEEE Transactions on Robotics, vol. 27, no. 5, pp. 958–969, 2011.
  • [16] T. Lampe and M. Riedmiller, “Acquiring visual servoing reaching and grasping skills using neural reinforcement learning,” in IJCNN, 2013.
  • [17] K. Mohta, V. Kumar, and K. Daniilidis, “Vision-based control of a quadrotor for perching on lines,” in ICRA, 2014.
  • [18]

    Q. Bateux, E. Marchand, J. Leitner, F. Chaumette, and P. Corke, “Training deep neural networks for visual servoing,” in

    ICRA, 2018.
  • [19] U. Viereck, A. t. Pas, K. Saenko, and R. Platt, “Learning a visuomotor controller for real world robotic grasping using simulated depth images,” arXiv preprint arXiv:1706.04652, 2017.
  • [20] R. L. Andersson, “A robot ping-pong player,” Experiment in Real-Time Intelligent Control., 1988.
  • [21] W. Jang, K. Kim, M. Chung, and Z. Bien, “Concepts of augmented image space and transformed feature space for efficient visual servoing of an “eye-in-hand robot”,” Robotica, 1991.
  • [22] C. Cai, E. Dean-León, N. Somani, and A. Knoll, “6d image-based visual servoing for robot manipulators with uncalibrated stereo cameras,” in IROS, 2014.
  • [23] A. McFadyen, M. Jabeur, and P. Corke, “Image-based visual servoing with unknown point feature correspondence.” RAL, 2017.
  • [24]

    C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel, “Deep spatial autoencoders for visuomotor learning,” in

    ICRA.   IEEE.
  • [25] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning with temporal skip connections,” arXiv preprint arXiv:1710.05268, 2017.
  • [26] F. Sadeghi, A. Toshev, E. Jang, and S. Levine, “Sim2real view invariant visual servoing by recurrent control,” arXiv preprint arXiv:1712.07642, 2017.
  • [27]

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,”

    IJRR, 2018.
  • [28] A. X. Lee, S. Levine, and P. Abbeel, “Learning visual servoing with deep features and fitted q-iteration,” arXiv:1703.11000, 2017.
  • [29] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks,” in CVPR, 2019.
  • [30] F. Shkurti, W.-D. Chang, P. Henderson, M. J. Islam, J. C. G. Higuera, J. Li, T. Manderson, A. Xu, G. Dudek, and J. Sattar, “Underwater multi-robot convoying using visual tracking by detection,” in IROS, 2017.
  • [31] M. J. Islam, M. Fulton, and J. Sattar, “Towards a generic diver-following algorithm: Balancing robustness and efficiency in deep visual detection,” arXiv preprint arXiv:1809.06849, 2018.
  • [32] P. Michel, K. Gold, and B. Scassellati, “Motion-based robotic self-recognition,” in IROS, 2004.
  • [33] L. Natale, F. Orabona, G. Metta, and G. Sandini, “Sensorimotor coordination in a “baby” robot: learning about objects through grasping,” Progress in brain research, vol. 164, pp. 403–424, 2007.
  • [34] R. Jonschkowski, “Found by NEMO: Unsupervised Object Detection from Negative Examples and Motion,” OpenReview ICLR submission, 2018.
  • [35] A. Byravan, F. Lceb, F. Meier, and D. Fox, “SE3-Pose-Nets: Structured deep dynamics models for visuomotor control,” in ICRA, 2018.
  • [36] A. Edsinger and C. C. Kemp, “What can i control? a framework for robot self-discovery,” in 6th International Conference on Epigenetic Robotics, 2006.
  • [37] T. M. Cover and J. A. Thomas, Elements of information theory.   John Wiley & Sons, 2012.
  • [38] J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee, “Contingency-aware exploration in reinforcement learning,” in ICLR, 2019.
  • [39] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical review E, vol. 69, no. 6, p. 066138, 2004.
  • [40] G. Ver Steeg, “Non-parametric entropy estimation toolbox (npeet),” 2000.
  • [41] J. Shi and C. Tomasi, “Good features to track,” Cornell University, Tech. Rep., 1993.
  • [42] J. Yan and M. Pollefeys, “A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate,” in ECCV, 2006.
  • [43] B. Yang, J. Zhang, D. Jayaraman, and S. Levine, “Replab: A reproducible low-cost arm benchmark platform for robotic learning,” ICRA, 2019.
  • [44] B. Yang, J. Zhang, V. Pong, S. Levine, and D. Jayaraman, “REPLAB: A Reproducible Low-Cost Arm Benchmark Platform for Robotic Learning,” arXiv preprint arXiv:1905.07447, 2019.
  • [45] “ROS MoveIt package,”, accessed: 2019-07-06.
  • [46] F. Chaumette, “Potential problems of stability and convergence in image-based and position-based visual servoing,” in The confluence of vision and control.   Springer, 1998, pp. 66–78.
  • [47] C. Cai, “6d visual servoing for industrial manipulators applied to human-robot interaction scenarios,” 2017.
  • [48] S. Cho and D. H. Shim, “Sampling-based visual path planning framework for a multirotor uav,” International Journal of Aeronautical and Space Sciences, pp. 1–29, 2019.