The difficulty of robot programming is one of the central hurdles to the widespread application of robots. This task requires domain-specific expertise, making it inaccessible to untrained personnel and resulting in high system costs that lead to low adoption rates. Few-shot imitation from videos is an appealing alternative to overcome this problem, as videos typically capture all task-relevant information. However, the high dimensionality of videos makes it challenging to convert a demonstration video into actionable commands, while at the same time being robust to variations in the environment and the task.
The visual imitation problem can be divided into three separate components:
determining the salient objects relevant to the task,
establishing correspondences between demonstration and the live application, and
controlling the robot in order to reproduce the motion observed in the demonstration.
Each component is a difficult task.
Existing learning-based approaches need large amounts of training data. Other methods that rely on explicit pose estimation require precise 3D models of the objects. Even when given a 3D model, robust 6D pose estimation under appearance variation is an ongoing area of research.
In this paper, we propose a one-shot imitation learning***
Ours is not a learning method in the sense of machine learning, we do not use data to optimize weights.approach which can robustly replicate a task from a single demonstration video, despite substantial variation of the objects’ initial positions, orientations, and appearances. Our approach imitates demonstrations through the use of learned optical flow; point correspondences from optical flow together with a given foreground mask align live observations with demonstration frames. After successfully aligning the live observation with the first demonstration frame, we successively do this for subsequent frames, thereby tracking an entire demonstration trajectory. Thus, this formulation naturally extends to learning multi-step tasks. There is neither a need for CAD models of the objects involved, nor expensive pretraining in elaborate simulation environments.
While conceptually straightforward, our approach shows a large degree of robustness towards various factors of variation. We successfully learn a variety of tasks, including picking and insertion of objects. The method is both data-efficient and achieves high success rates.
The main contribution of our work is a practical, data-efficient approach to imitation which exploits and transfers the trained robustness of modern optical flow methods to robot control.
Ii Related Work
Imitation learning†††Also know as: Learning from Demonstration, Robot Programming by Demonstration, and Apprenticeship Learning tries to control robots in such a way as to replicate what has been demonstrated. This approach reduces or eliminates the need for explicit programming . It is often formulated as a learning problem  and numerous strategies to utilize demonstrations exist. These include kinesthetic teaching, the decomposition of movement into motion primitives , additional exploration guidance for learning algorithms , and the direct learning of input to action mappings.
Some approaches start with low-dimensional inputs, e.g. , but most start from high-dimensional sensor input and aim to reduce its dimensionality. This is often done by combining embeddings and imitation. Time Contrastive Networks  use multiple perspectives to learn a perspective-invariant embedding, and embeddings of images are used to guide Atari play in . Another approach to the problem is one-shot imitation learning, in which policy networks are fine-tuned on conditioning demonstrations . In contrast, our approach need not learn an encoding to work directly with the high-dimensional sensor input.
Visual Servoing (VS) is the concept of creating a feedback loop in which sensor data is used to compute control commands that change sensor measurements [23, 9]. This feedback actively updates the control based on current observations, allowing for more adaptive control that is robust to variation in the environment . Many tasks in robotics, such as navigation, manipulation and learning from demonstration can be addressed using visual servoing [19, 23, 3]. Visual servoing allows the specification of goal configurations as target feature states to which a control law can servo.
Our approach is a form of visual servoing. Visual servoing can be realized in a number of different ways. It generally consists of three components: (1) image feature extraction, (2) the control law to decide where to move with respect to these features, and (3) joint control to execute this decision. While many works focus on formulating robust control laws and image feature extraction in the context of navigation, we consider the imitation of manipulation tasks. Unlike most other visual servoing techniques ours benefits from the use of robust correspondences to generalize over scene geometry, lighting conditions, and object appearance.
Visual servoing is used in diverse applied robotics fields: aircraft manufacturing , robotic surgery , marine ROVs , and aerial manipulation . Recent methodological extensions range from the combination with pose estimation  to servoing to bounding boxes detected by an R-CNN .
A combination of visual servoing and optical flow is often used to track poses when servoing with respect to explicit pose estimates, which can be expensive to compute ab initio [33, 32, 42]. Optical flow was used for visual servoing by , where flow replaces template matching for visual servoing in an industrial positioning system. In [27, 28], a character navigation policy is learned based on the intermediate representation of optical flow.
A number of recent works combine visual servoing and learning. Some policy learning architectures bear a similarity to visual servoing, e.g. through the use of soft-argmax activations  or the use of optical flow as an auxiliary task . A number of applied works have also been published that use a combination of visual servoing and learning-based approaches:  trains a model acting as control law based on limited examples for electrical engine construction,  uses images as templates, and  trains a network to predict relative poses between images. 
implements target following by learning features and dynamics using reinforcement learning. presented learning-based visual servoing for peg insertion.
The performance of optical flow computation has developed very rapidly in recent years [12, 20]. While initial interest in the optical flow problem was grounded in the context of active motion , it is also treated as an independent problem. Good performance of these methods has renewed interest in applications of optical flow . We benefit from the ability of recent learning-based optical flow to generalize, as it has been trained to be robust to common variations of appearance changes.
We use optical flow to solve the dense correspondence problem. Optical flow is commonly defined as the per pixel apparent motion between two consecutive frames, which implies a data distribution. Strictly speaking, we apply an optical flow algorithm outside of its canonical scope. This is not a trivial change since it adversely affects the data distribution; incidence of large displacements and out-of-frame occlusions increases. Learned dense correspondences have also been generated in [7, 10]. Similar to our method, [13, 29] learn to transfer key point affordances used for grasping.
The main idea of FlowControl is to align a live video frame with a demonstration consisting of a sequence of target frames; this is illustrated in Figure 2. A learned optical flow algorithm is used to compute correspondences between a live frame and a target frame. Together with the recorded depth this yields a 3D flow of the underlying point cloud, which describes how points in the scene must move to reach the target state. A foreground mask restricts the points to a subset relevant for the task. These foreground points are used for computing the 3D transformation that brings the current frame closer to the target frame, and the robot is moved according to that transformation.
After successfully aligning with the first demonstration frame, the procedure can be successively repeated for subsequent frames. This tracks an entire demonstration trajectory, which can include the manipulation of multiple different objects.
FlowControl benefits from three design choices: (1) the end-of-arm camera setup makes it easy to convert image transformations into the end-effector frame; (2) providing a first-person demonstration avoids the problem of a changing perspective; (3) the learned optical flow is robust to many appearance variations.
The foreground mask defines which object in the scene to align in order to progress to the next target state. While such a foreground mask could be inferred automatically from the demonstration video, we simplify the problem by manually providing the foreground mask, trading algorithmic complexity for an intuitive and practical extra step. The foreground mask need not be precise; an under-segmentation of the object suffices.
Optical Flow: For computing correspondences, we use the optical flow from FlowNet2 . FlowNet2 is trained on FlyingThings3D , a synthetic dataset procedurally assembled from a large set of different objects. Data augmentation employed in FlowNet training ensures that the resulting optical flow is robust to changes in lighting and partial occlusions. It also performs well on textureless objects, having learned from a large set of different shapes. While we benefit from the robustness and generalization this offers, correspondence is a modular component in our framework, and we could employ other methods such as , which does not require any learning. Since the foreground mask is provided for the target image from the demonstration, we compute the optical flow from the target image to the live image.
Frame Alignment: We use pixel correspondences from the flow algorithm together with our aligned RGB-D data to match 3D points between the recorded demonstration observation and the live observations. This gives us correspondences between individual points of the pointclouds. Subsequently, we use SVD to compute a least-squares rigid transformation between these point clouds . This is an estimate of the relative transformation between the demonstration scene and the scene as observed live. Servoing in this direction will align the camera image with the demonstration image. We use a position-based controller to convert this transformation into a control signal. Pseudocode for this algorithm is given in Algorithm 1.
Sequence Tracking: In order to imitate a complete task, such as grasping an object, we need to successively align with respect to a sequence of demonstration images. To this end, when the live image is sufficiently close to the target image, as defined by a threshold, we step over to the next target image from the demonstration. As converging to each correct alignment takes some time, the tracking process can be accelerated by sampling only every n demonstration frame. Moving the attention from one object to another requires a new target frame with the foreground mask on the new object.
We need not generalize over some control quantities such as gripper state; for these we just copy the demonstration actions from the corresponding trajectory step. Interestingly, as long as our method correctly aligns all other dimensions an orthogonal dimension can be copied from the demonstrations. This allows aligning the in-plane components of the relative orientation and copying the height of the gripper.
To account for the time it takes the gripper to close, we delay progressing to the next frame accordingly. Pseudocode for the sequence tracking is given in Algorithm 2.
Task Modularity: Our method can combine individual subtasks into a multi-step task. This is done by switching the object segmented as foreground object during the demonstration. An example is shown in Figure 3, where the first subtask is to grasp the wheel (the foreground mask is on the wheel) before the focus switches to the screw to connect the wheel with the screw.
We tested the end-to-end performance of our system on four manipulation tasks and in a localization experiment that is inspired by a real world industrial use case. Since camera pose estimation plays a key role in our setup, we additionally tested this component in isolation and quantified its performance relative to alternative pose estimation methods. Finally we tested the robustness of our system to variations in scene geometry and appearance.
Setup: The experimental setup consists of a KUKA iiwa arm with a WSG-50 two finger parallel gripper and an Intel SR300 projected light RGB-D camera mounted on the flange for an eye-in-hand view. The camera is attached to a 3D-printed mount and faces towards the point between the fingertips. The setup, which is shown in Fig. 3, allows us to record depth images, which we use in our geometric fitting procedure. The movement of the end effector is restricted such that the gripper always faces downwards; it is parameterized as a 5 DoF continuous action in the end effector frame. specify a Cartesian offset for the desired end effector position, defines the yaw rotation of the end effector, and is the gripper action that is mapped to the binary command to open or close the fingers.
Our state observation consists of 640
480 pixel RGB-D camera images and proprioceptive state vector consisting of the gripper height above the table, the angle that specifies the rotation of the gripper, and the width of the gripper fingers. The optical flow is computed at the same resolution.
Iv-a Manipulation Experiments
Our approach is demonstrated by experiments on four example tasks: grasping a wooden block, inserting a block into a shape sorter, and grasping and inserting a wheel onto a screw. Solving these tasks requires precise movements; for example, an error of 4 mm is enough to make the insertion tasks fail. During the experiments we tested the robustness of our method by varying the object positions. To test the reactiveness of the approach we also moved the objects during task execution and varied the lighting conditions. Examples of this are shown in the supplemental videos.
|(a) Grasping Task||(b) Pick-and-Stow Task|
|(c) Shape-Sorter Task||(d) Wheel Task|
Grasping Blocks: A small, 25 mm wide, wooden block must be grasped and lifted from a table surface. This task is shown in Fig. 3 (a).
Pick-and-Stow: A small wooden block needs to be grasped and lifted from a table surface to be dropped into a box.
Shape-Sorter: A block must be inserted into a shape-sorting cube. This requires precise positioning due to the tight fit of the opening to the block. We start with the block already grasped. This task is shown in Fig. 3 (c).
Wheel Insertion: This task uses parts from a toy construction set. A wheel must be grasped and inserted onto a screw that is held in a vertical position. This task is shown in Fig. 3 (d).
Results: The success rates for our manipulation experiments are shown in Table I, examples are also shown in the supplemental video. During testing, we used the same position distributions for task between methods as the success rate depends on the variation in the environment. Our method achieves high success rates. In addition, we outperform ACGD , a recent approach that uses demonstrations to generate a curriculum for reinforcement learning. In contrast to ACGD, FlowControl does not require a simulation environment for task learning, which is difficult and time consuming to set up.
|Pick-Stow||ACGD ||17 / 20|
|Pick-Stow||(Ours)||19 / 20|
|Shape-Sorter||(Ours)||8 / 10|
|Wheel Insertion||(Ours)||9 / 10|
Despite the generally good performance, we also identified failure cases. A predictable source of problems were occlusions. These occurred, for example, in the pick-and-stow task when the box occluded the cube.
Starting too far away from a target frame of the demonstration also leads to failure. FlowNet2 works within a given range of displacements. If the target is too far away, the optical flow will not find the correspondence anymore. Especially large rotations are a problem for optical flow. Section IV-C quantifies the robustness of FlowControl in greater detail. As the flow algorithm works for limited displacements in image space, starting demonstrations with the robot further away from the objects allows for coping with bigger displacements, as these appear smaller.
Optical flow is also more likely to fail when confusing background flows are present. This is not prevented by the foreground mask as the flow algorithm receives as input the whole image and information from this may confound the flow computation. When running the controller with high velocities, a single wrong optical flow estimate can move the robot out of the convergent zone. Running the controller at smaller velocities allows such errors to be corrected, reducing the chance of failure on the task.
Other practical examples of failure were due to low illumination combined with fixed exposure times and grasping attempts snapping the object out of the visual field.
Iv-B Navigation Experiment
To demonstrate that the proposed method is directly applicable in an industrial setting, we evaluated a practical localization task. This task is based on an automotive assembly scenario in which a nut-runner must fasten nuts to fix an Engine Control Unit (ECU) into position. The nut-runner must be positioned precisely in order to engage the bolts. Our method can visually align the end-effector resulting in greater robustness to variation in workpiece placement. The task is shown in Fig. 4.
We randomized the initial positions of the end-effector in the camera plane and measured how precisely it can return to the given reference position according to the robot’s state-estimation. This resulted in a precision of mm, also measured using the robot’s state estimation system. This test was repeated five times, with starting positions of up to 8 cm from the target position. In these experiments, our reference image was taken with different lighting.
Iv-C Fitting Experiment
We evaluated the pose estimation component of our system using a proxy task with pre-recorded images. Instead of moving an object with respect to the camera, we moved the camera with respect to a static object and recorded multiple views. Visual markers were added to the scene to determine the relative camera pose between views; these were calculated using the FreiCalib tool . The pose estimation algorithms must estimate this relative pose. We evaluate several baselines. The simplest is a zero pose change prediction. The SIFT baseline is evaluated similar to . We also compare to DeepTAM , a learned algorithm that estimates depth and relative pose given to images.
In this setup, the static background could be used to infer the relative pose. To mitigate this, we again masked our computed features with the demonstration segmentation, except for DeepTAM, where this was not possible, because it is a monolithic system that produces a pose. As both SIFT matching and optical flow methods have outliers, we substitute zero pose change predictions for rotations larger than, and translations larger than .
|Relative Orientation Error||Relative Translation Error|
This evaluation was done on a selection of 250 views with a relative rotation of less than . The results are shown in Figure 5. SIFT based relative pose detection performed badly, completely failing for most views. This resulted in worse performance than the zero pose change baseline. DeepTAM performs slightly better than the zero rotation. However, as this approach is designed for smaller pose differences it often underestimates changes. Our flow-based approach performs best for most samples, although it still has outliers in cases where the flow computation failed. An example of this is shown in Figure 6. This usually occurs for a combination of large displacements and rotations.
Iv-D Generalization Experiments
In contrast to classical fixed visual servoing approaches, FlowNet has been trained to be invariant to miscellaneous effects, such as lighting changes and partial occlusion. This helps it find correspondences even when objects in the demonstration do not match exactly. For simplicity, we limited the generalization experiments to the grasping task. We recorded a demonstration with one object and then tested if this demonstration generalizes to objects of different shapes and sizes. Examples of this are shown in Figure 7 and in the supplemental video. FlowControl is able to cope with variation in both color and shape.
V Discussion and Conclusion
We presented a practical, data-efficient method for visual servoing from optical flow. Our method works with single demonstrations and is able to handle significant variations in the geometric arrangement as well as visual appearance of the task. We demonstrated the effectiveness of our method on a series of robotic manipulation experiments. In addition, we provided a quantitative assessment of the pose estimation part of our algorithm and combined this with a discussion of possible failure cases of our method. Finally, we also provide some experiments indicating that our method is able to generalize over substantial variation in geometry and appearance.
While FlowControl has many advantageous properties, it has natural limitations: it cannot yet do re-grasping and currently relies on manual segmentation to define the task. Current failure cases include optical flow methods failing for large displacements. One could train optical flow specifically for the type of data distribution at hand: one with larger rotations and displacements, or for the specific objects that may appear in the task.
Despite this, FlowControl satisfies an important aim; robotics algorithms should not merely solve one specific task, but instead obviate the need for task-specific engineering. With little manual effort, FlowControl solves a diverse set of tasks.
-  (2016-11) JUMP: virtual reality video. ACM Transactions on Graphics 35, pp. 1–13. External Links: Cited by: §III.
-  (2018) Playing hard exploration games by watching youtube. CoRR abs/1805.11592. External Links: Cited by: §II, §II.
-  (2012-03) A survey of vision based architectures for robot learning by imitation. International Journal of Humanoid Robotics 9, pp. . External Links: Cited by: §II.
Training Deep Neural Networks for Visual Servoing. In ICRA 2018 - IEEE International Conference on Robotics and Automation, pp. 3307–3314. External Links: Cited by: §II.
-  (2008) Robot programming by demonstration. In Springer Handbook of Robotics, Cited by: §II.
-  (2019) Trends and challenges in robot manipulation. Science 364 (6446). External Links: Cited by: §II.
Dense semantic correspondence where every pixel is a classifier. CoRR abs/1505.04143. External Links: Cited by: §II.
-  (2017) A machine learning-based visual servoing approach for fast robot control in industrial setting. International Journal of Advanced Robotic Systems 14. Cited by: §II.
-  (2006) Visual servo control. i. basic approaches. IEEE Robotics & Automation Magazine 13, pp. 82–90. Cited by: §II.
-  (2016) Universal correspondence network. In Advances in Neural Information Processing Systems 30, Cited by: §II.
-  (2016) Deep visual foresight for planning robot motion. CoRR abs/1610.00696. External Links: Cited by: §II.
-  (2015) FlowNet: learning optical flow with convolutional networks. CoRR abs/1504.06852. External Links: Cited by: §II.
-  (2018) Dense object nets: learning dense visual object descriptors by and for robotic manipulation. In 2nd Annual Conference on Robot Learning, CoRL 2018 Proceedings, pp. 373–385. Cited by: §II.
-  (1979) The ecological approach to visual perception. Lawrence Erlbaum Associates Inc. Cited by: §II.
-  (2019) What is optical flow for?: workshop results and summary. In Computer Vision – ECCV 2018 Workshops, L. Leal-Taixé and S. Roth (Eds.), Cham, pp. 731–739. External Links: Cited by: §II.
-  (2019) Adaptive curriculum generation from demonstrations for sim-to-real visuomotor control. External Links: Cited by: §IV-A, TABLE I.
-  (1954) Doubly stochastic matrices and the diagonal of a rotation matrix. American Journal of Mathematics 76 (3), pp. 620–630. External Links: Cited by: §III.
-  (2017) A vision-guided multi-robot cooperation framework for learning-by-demonstration and task reproduction. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4797–4804. Cited by: §II.
-  (1996-10) A tutorial on visual servo control. IEEE Transactions on Robotics and Automation 12 (5), pp. 651–670. External Links: Cited by: §II.
-  (2016) FlowNet 2.0: evolution of optical flow estimation with deep networks. CoRR abs/1612.01925. External Links: Cited by: §II, §III.
-  (EasyChair, 2018) Vision-guided robotic leaf picking. Note: EasyChair Preprint no. 250 External Links: Cited by: §II.
-  (2019-12) Humanoid robots in aircraft manufacturing. IEEE Robotics and Automation Magazine 26 (4), pp. 30–45. External Links: Cited by: §II.
-  (2002) Survey on visual servoing for manipulation. Technical report Computational Vision and Active Perception Laboratory. Cited by: §II.
-  (2012) Incremental learning of full body motion primitives and their sequencing through human motion observation. The International Journal of Robotics Research 31, pp. 330 – 345. Cited by: §II.
Learning visual servoing with deep features and fitted q-iteration. CoRR abs/1703.11000. External Links: Cited by: §II.
-  (2015) End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, pp. 39:1–39:40. Cited by: §II.
-  (2019-05) Character navigation in dynamic environments based on optical flow. Computer Graphics Forum 38 (2), pp. 181–192. Note: Eurographics 2019 proceedings External Links: Cited by: §II.
-  (2019) Attracted by light: vision-based steering virtual characters among dark and light obstacles. In MIG 2019 - ACM SIGGRAPH Conference Motion Interaction and Games, pp. 1–6. External Links: Cited by: §II.
-  (2019) KPAM: keypoint affordances for category-level robotic manipulation. ArXiv abs/1903.06684. Cited by: §II.
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation.
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Note: arXiv:1512.02134 External Links: Cited by: §III.
-  (1999) Visual servoing based on coarse optical flow. IFAC Proceedings Volumes 32 (2), pp. 539 – 544. Note: 14th IFAC World Congress 1999 External Links: Cited by: §II.
-  (1995) Improved force control through visual servoing. Proceedings of 1995 American Control Conference - ACC’95 1, pp. 380–386 vol.1. Cited by: §II.
-  (1993) Visual servoing for robotic assembly. In Visual Servoing-Real-Time Control of Robot Manipulators Based on Visual Sensory Feedback, K. Hashimoto (Ed.), pp. 139–164. Cited by: §II.
-  (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7, pp. 1–179. Cited by: §II.
-  (2019) Grasp planning and visual servoing for an outdoors aerial dual manipulator. Engineering. External Links: Cited by: §II.
-  (2018-06) Sim2Real viewpoint invariant visual servoing by recurrent control. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4691–4699. External Links: Cited by: §II.
Time-contrastive networks: self-supervised learning from multi-view observation. CoRR abs/1704.06888. External Links: Cited by: §II.
-  (2018) Fully automatic visual servoing control for work-class marine intervention rovs. Control Engineering Practice 74, pp. 153 – 167. External Links: Cited by: §II.
-  (2004) Scene modelling, recognition and tracking with invariant image features. Third IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 110–119. Cited by: §IV-C.
-  (2019) Quickly inserting pegs into uncertain holes using multi-view images and deep network trained on synthetic data. CoRR abs/1902.09157. External Links: Cited by: §II.
-  (2017) An image-based trajectory planning approach for robust robot programming by demonstration. Robotics and Autonomous Systems 98, pp. 241–257. Cited by: §II.
-  (2002) Dynamic aspects of visual servoing and a framework for real-time 3d vision for robotics. In Sensor Based Intelligent Robots, G. D. Hager, H. I. Christensen, H. Bunke, and R. Klein (Eds.), Berlin, Heidelberg, pp. 101–121. External Links: Cited by: §II.
-  (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. 6th International Conference on Learning Representations, ICLR 2018, Workshop Track Proceedings. Cited by: §II.
-  (2018) DeepTAM: deep tracking and mapping. CoRR abs/1808.01900. External Links: Cited by: §IV-C.
FreiPose: a deep learning framework for precise animal motion capture in 3d spaces. Technical report Department of Computer Science, University of Freiburg. External Links: Cited by: §IV-C.
-  (2018) 3D human pose estimation in rgbd images for robotic task learning. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1986–1992. Cited by: §II.