Towards Accurate Task Accomplishment with Low-Cost Robotic Arms

12/03/2018 ∙ by Yiming Zuo, et al. ∙ Tsinghua University Peking University 0

Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry. This work discusses the role of computer vision algorithms in this field. We focus on low-cost arms on which no sensors are equipped and thus all decisions are made upon visual recognition, e.g., real-time 3D pose estimation. This requires annotating a lot of training data, which is not only time-consuming but also laborious. In this paper, we present an alternative solution, which uses a 3D model to create a large number of synthetic data, trains a vision model in this virtual domain, and applies it to real-world images after domain adaptation. To this end, we design a semi-supervised approach, which fully leverages the geometric constraints among keypoints. We apply an iterative algorithm for optimization. Without any annotations on real images, our algorithm generalizes well and produces satisfying results on 3D pose estimation, which is evaluated on two real-world datasets. We also construct a vision-based control system for task accomplishment, for which we train a reinforcement learning agent in a virtual environment and apply it to the real-world. Moreover, our approach, with merely a 3D model being required, has the potential to generalize to other types of multi-rigid-body dynamic systems.



There are no comments yet.


page 2

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Precise and agile robotic arms have been widely used in the assembly industry for decades, but the adaptation of robots to domestic use is still a challenging topic. This task can be made much easier if vision input are provided and well utilized by the robots. A typical example lies in autonomous driving [8]. In the area of robotics, researchers have paid more and more attentions to vision-based robots and collected large-scale datasets, e.g., for object grasping [19][28] and block stacking [12]. However, the high cost of configuring a robotic system largely limits researchers from accessing these interesting topics.

This work aims at equipping a robotic system with computer vision algorithms, e.g., predicting its real-time status using an external camera, so that researchers can control them with a high flexibility, e.g., mimicking the behavior of human operators. In particular, we build our platform upon a low-cost robotic arm named OWI-535 which can be purchased from Amazon111 for less than . The downside is that this arm has no sensors and thus it totally relies on vision inputs222Even when the initialized status of the arm is provided and each action is recorded, we cannot accurately compute its real-time status because each order is executed with large variation – even the battery level can affect. – on the other hand, we can expect vision inputs to provide complementary information in sensor-equipped robotic systems. We chose this arm for two reasons. (i) Accessibility: the cheap price reduces experimental budgets and makes our results easy to be reproduced by lab researchers (poor vision people :( ). (ii) Popularity: users around the world uploaded videos to YouTube recording how this arm was manually controlled to complete various tasks, e.g., picking up tools, stacking up dices, etc. These videos were captured under substantial environmental changes including viewpoint, lighting condition, occlusion and blur. This raises real-world challenges which are very different from research done in a lab environment.

Hence, the major technical challenge is to train a vision algorithm to estimate the 3D pose of the robotic arm. Mathematically, given an input image , a vision model is used to predict , the real-time 3D pose of the arm, where denotes the learnable parameters, e.g.

, in the context of deep learning 

[17], network weights. Training such a vision model often requires a considerable amount of labeled data. One option is to collect a large number of images under different environments and annotate them using crowd-sourcing, but we note a significant limitation as these efforts, which take hundreds of hours, are often not transferable from one robotic system to another. In this paper, we suggest an alternative solution which borrows a 3D model and synthesizes an arbitrary amount of labeled data in the virtual world with almost no cost, and later adapt the vision model trained on these virtual data to real-world scenarios.

This falls into the research area of domain adaptation [4]. Specifically, the goal is to train on a virtual distribution and then generalize it to the real distribution . We achieve this goal by making full use of a strong property, that the spatial relationship between keypoints, e.g., the length of each bone, is fixed and known. This is to say, although the target distribution is different from and data in remain unlabeled, the predicted keypoints should strictly obey some geometric constraints . To formulate this, we decompose into two components, namely for keypoint detection and for 3D pose estimation, respectively. Here, is parameter-free and thus cannot be optimized, so we train on and hope to adapt it to , and becomes a hidden variable. We apply an iterative algorithm to infer , and the optimal determined by serves as the guessed label, which is used to fine-tune . Eventually, domain adaptation is achieved without any annotations in the target domain.

We design two benchmarks to evaluate our system. The first one measures pose estimation accuracy, for which we manually annotate two image datasets captured in our lab and crawled from YouTube, respectively. Our algorithm, trained on labeled virtual data and fine-tuned with unlabeled lab data, achieves a mean angular error of , averaged over joints. This lays the foundation of the second benchmark in which we create a environment for the arm to accomplish a real-world task, e.g., touching a specified point. Both quantitative (in distance error and success rates) and qualitative (demos are provided in the supplementary material) results are reported. Equipped with reinforcement learning, our vision-based algorithm achieves comparable accuracy with human operators. All these data and code will be released after the review process, which we believe will ease other researchers to reproduce these tasks and experiment their algorithms with a low cost.

In summary, the contribution of this paper is three-fold. First, we design a complete framework to achieve satisfying accuracy in task accomplishment with a low-cost, sensor-free robotic arm. Second, we propose a vision algorithm involving training in virtual environment and domain adaptation, and verify its effectiveness in a typical multi-rigid-body system. Third, we develop a platform with two real-world datasets and a virtual environment so as to facilitate future research in this field.

The remainder of this paper is organized as follows. Section 2 briefly reviews related work. Section 3 elaborates the entire system including data acquisition, synthetic training and domain adaptation. After experiments are shown in Section 4, we conclude this work in Section 5.

2 Related Work

2.1 Vision-based Robotic Control

Vision-based robotic control is attracting more and more attentions. Compared with conventional system relying on specific sensors, e.g. IMU and rotary encoder, vision has the flexibility to adapt to complex and novel tasks. Recent progress of computer vision makes vision-based robotic control more feasible. Besides using vision algorithms as a perception module, researchers are also exploring training an end-to-end control system purely from vision [13][18][23]. In order to achieve this goal, researchers collected large datasets for various tasks, including grasping [19][28], block stacking [12], autonomous driving [8][42], etc.

On the other hand, training a system for real-world control tasks is always time-consuming, and high-accuracy sensor-based robots are expensive , both of which have prevented a lot of vision researchers from entering this research field. For the first issue, people turned to use simulators such as MuJoCo [36] and Gazebo [16] so as to accelerate training processes, e.g., with reinforcement learning, and applied to real robots, e.g., PR2 [38], Jaco [31] and KUKA IIWA [2](ranging from to ). For the second issue, although low-cost objects (e.g., toy cars [15]) have been used to simulate real-world scenarios, low-cost robotic arms were rarely used, mainly due to the limitation caused by the imprecise motors and/or sensors, so that conventional control algorithms are difficult to be applied. For instance, Lynxmotion Arm is an inexpensive () robotic arm used for training reinforcement learning algorithms [30][5]. The control of this arm was done using a hybrid of camera and servo-motor, which provides joint angle. This paper uses an even cheaper () and more popular robotic arm named OWI-535, which merely relies on vision inputs from an external camera. To the best of our knowledge, this arm has never been used for automatic task accomplishment, because lacking of sensors.

2.2 Computer Vision with Synthetic Data

Synthetic data have been widely applied to many computer vision problems in which annotations are difficult to obtain, such as optical flow [3], autonomous driving [6], human parsing [40], VQA [14], -D pose estimation [33][34], semantic segmentation [10], etc.

Domain adaptation is an important stage to transfer models trained on synthetic data to real scenarios. There are three major approaches in this field, namely, domain randomization [35][37], adversarial training [9][24][32][39] and joint supervision [20]. A more comprehensive survey on domain adaptation is available in [4]. As an alternative solution, researchers introduced intermediate representation (e.g., semantic segmentation) to bridge the domain gap [11]

. In this work, we focus on semi-supervised learning with the assistance of domain randomization. The former method is mainly based on 3D priors obtained from modeling the geometry of the target object 

[7][27]. Previously, researchers applied parameterized 3D models to refine the parsing results of humans [1][26] or animals [45], or fine-tune the model itself [27]. The geometry of a robotic system (e.g.

, a multi-rigid-body system) often has a lower degree of freedom, which enables strong shape constraints to be used for both purposes,

i.e., prediction refinement and model fine-tuning.

3 Approach

3.1 System Overview

We aim at designing a vision-based system to control a sensor-free robotic arm to accomplish real-world tasks. Our system, as illustrated in Figure 1, consists of three major components working in a loop. The first component, data acquisition, samples synthetic images from a virtual environment for training and real images from an external camera for real-time control. The second component is pose estimation, an algorithm which produces the 3D pose (joint angles) of the robotic arm (see Figure 2 for the definition of four joints). The third component is the control module, which takes as input, determines an action for the robotic arm to take, and therefore triggers a new loop.

Note that data acquisition (Section 3.2) may happen in both virtual and real environments – our idea is to collect cheap training data in the virtual domain, train a vision model and tune it into one that works well in real world. The core of this paper is pose estimation (Section 3.3), which is itself an important problem in computer vision, and we investigate it from the perspective of domain transfer. While studying motion control (Section 3.4) is also interesting yet challenging, it goes out of the scope of this paper, so we setup a relatively simple environment and apply an reinforcement learning algorithm.

3.2 Data Acquisition

Figure 2: Here shows the joints and keypoints of OWI-535 used in our experiment. Each joint is assigned a specific name. Color of keypoint correspond to the part to which it belongs.

The OWI-535 robotic arm has motors named rotation, base, elbow, wrist and gripper. Among them, the status of the gripper is not necessary for motion planing and thus it is simply ignored in this paper. The range of motion for the first 4 motors are , , and , respectively. Being an educational arm, OWI-535 is not equipped with any sensor, and the accuracy of its motors is considerably low – the same instruction can lead to different reactions because of the change in the battery level. We choose it mainly due to its accessibility and popularity, both of which make it a suitable tool for research and education.

In order to collect training data with low costs, we turn to the virtual world. We download a CAD model of the arm with exactly the same appearance as the real one which was constructed using Unreal Engine 4 (UE4)333 Using Maya, we implement its motion system which was not equipped in the original model. The angle limitation as well as the collision boundary of each joint is also manually configured. The CAD model of OWI-535 has vertices in total, among which, we manually annotate visually distinguishable vertices as our keypoints, as shown in Figure 2. This number is larger than the degree-of-freedom of the system ( camera parameters and joint angles), meanwhile reasonably small so that all keypoints can be annotated on real images. The images and annotations are collected from UE4 via UnrealCV [29].

We create real-world dataset from two sources for benchmark and replication purpose. The first part of data are collected from an arm in our own lab, and we maximally guarantee the variability in its pose, viewpoint and background. The second part of data are crawled from YouTube, on which many users uploaded videos of playing with this arm. Both subsets raise new challenges which are not covered by virtual data, such as motion blur and occlusion, to our vision algorithm, with the second subset being more challenging as the camera intrinsic parameters are unknown and the arm may be modded for various purposes. We manually annotate the keypoints on these images, which typically takes one minute for a single frame.

More details of these datasets are covered in Section 4.

3.3 Transferable 3D Pose Estimation

We first define some terminologies used in this paper. Let define all parameters that determine the arm’s position in the image. In our implementation, has dimensions: camera extrinsic parameters (location, rotation) and angles of motors. and are the locations of keypoints in 2D and 3D, respectively. Both and are deterministic functions of .

The goal is to design a function which receives an image

and outputs the pose vector

which defines the pose of the object. denotes the learnable parameters, e.g., network weights in the context of deep learning. The keypoints need to follow the same geometry constraints in both virtual and real domains. In order to fully utilize these constraints, we decompose into two components, namely, 2D keypoint detection and 3D pose estimation . Here, is a fixed set of equations corresponding to the geometric constraints, e.g., the length between two joints. This is to say, needs to be trained to optimize while is a parameter-free algorithm which involves fitting a few fixed arithmetic equations.

To alleviate the expense of data annotation, we apply a setting known as semi-supervised learning [44] which contains two parts of training data. First, a labeled set of training data is collected from the virtual environment. This process is performed automatically with little cost, and also easily transplanted to other robotic systems with a 3D model available. Second, an unlabeled set of image data is provided, while the corresponding label for each remains unknown. We use and to denote the virtual and real image distributions, i.e., and , respectively. Since and can be different in many aspects, we cannot expect a model trained on to generalize sufficiently well on .

The key is to bridge the gap between and . One existing solution works in an explicit manner, which trains a mapping , so that when we sample from , maximally mimics the distribution of

. This is achieved by unpaired image-to-image translation 

[43], which was verified effective in some vision tasks [9]. However, in our problem, an additional cue emerges, claiming that the source and target images have the same label distribution, i.e., both scenarios aim at estimating the pose of exactly the same object, so we can make use of this cue to achieve domain adaptation in an implicit manner. In practice, we do semi-supervised training by providing the system with unlabeled data. Our approach exhibits superior transfer ability in this specific task, while we preserve the possibility of combining both manners towards higher accuracy.

To this end, we reformulate and in a probabilistic style. produces a distribution , and similarly, outputs . Here, the goal is to maximize the marginal likelihood of while remains a latent variable:


There is another option, which directly computes and then infers and from . We do not take it because we trust more than , since the former is formulated by strict geometric constraints. Eqn (1) can be solved using an iterative algorithm, starting with a model pre-trained in the virtual dataset.

In the first step, we first fix and infer . This is implemented using a stacked hourglass network [25] with stacks. We crop the input image to

. The neural network produces

heatmaps, where is the number of keypoints, and each heatmap is in size . Then, these heatmaps are taken as input data of which estimates and as well as . This is done by making use of geometric constraints , which appears as a few linear equations with fixed parameters, e.g., the length of each bone of the arm. This is a probabilistic model and we apply an iterative algorithm (see Section 3.5.2) to find an approximate solution , and . Note that is not necessarily the maximum in .

In the second step, we take the optimal found in the first step to update . As

is a deep network, this is often achieved by gradient back-propagation. We incorporate this iterative algorithm with stochastic gradient descent. In each basic unit known as an


, each step is executed only once. Although convergence is most often not achieved, we continue with the next epoch, which brings more informative supervision. Compared with solving Eqn (

1) directly, this strategy improves the efficiency in the training stage, i.e., a smaller number of iterations is required.

3.4 Motion Control

In order to control the arm to complete tasks, we need a motion control module which takes the estimated 3D pose as input and outputs an action to achieve the goal. The motion control policy is learned via a deep reinforcement learning algorithm. is the state about the environment at time , e.g. the arm pose . represents the goal, e.g. target location. is the control signal for each joint in our system. The policy is learned in our virtual environment, and optimized by Deep Deterministic Policy Gradient (DDPG) [21]. Our experiment shows that using arm pose as input, the policy learned in the virtual environment can be directly applied to the real world.

3.5 Implementation Details

3.5.1 Training Data Variability

Our approach involves two parts of training data, namely, a virtual subset to pre-train 2D keypoint detection, and an unlabeled real subset for fine-tuning. In both parts, we change the background contents of each training image so as to facilitate data variability and thus alleviate the risk of over-fitting.

In the virtual domain, background can be freely controlled by the graphical renderer. In this scenario, we place the arm on a board, under a sky sphere, and the background of the board and the sphere are both randomly sampled from the MS-COCO dataset [22]. In the real domain, however, background prediction is non-trivial yet can be inaccurate. To prevent this difficulty, we create a special subset for fine-tuning, in which all images are captured in a clean environment, e.g., in front of a white board, which makes it easy to segment the arm with a pixel-wise color-based filter, and then place it onto a random image from the MS-COCO dataset. We observe consistent accuracy gain brought by these simple techniques.

3.5.2 Joint Keypoint Detection and Pose Estimation

We use an approximate algorithm to find the (as well as and ) that maximizes in Eqn (1), because an accurate optimization is mathematically intractable. We first compute which maximizes

. This is performed on the heatmap of each 2D keypoint individually, which produces not only the most probable

but also a score indicating its confidence. We first filter out all keypoints with a threshold , i.e., all keypoints with

are considered unknown (and thus completely determined by geometric prior) in the following 3D reconstruction module. This is to maximally prevent the impact of outliers. In practice, we use

and our algorithm is not sensitive to this parameter.

Next, we recover the 3D pose using these 2D keypoints, i.e., maximizing . Under the assumption of perspective projection, that each keypoint is the 2D projection of a 3D coordinate , which can be written in a linear equation:


Here, and are 2D and 3D coordinate matrices, respectively, and is an all-one vector. is the camera intrinsic matrix, which is constant for a specific camera. , and denote the scaling vector, rotation matrix and translation vector, respectively, all of which are determined by . For each keypoint , is determined by the motor transformation , where is a constant matrix indicating the coordinates of all keypoints when motor angles are , and is the motor transformation matrix for the th keypoint, which is also determined by . is the scaling matrix. Due to the inaccuracy in prediction ( can be inaccurate in either prediction or manual annotation) and formulation (e.g., perspective projection does not model camera distortion), Eqn (2

) may not hold perfectly. In practice, we assume the recovered 3D coordinates to follow an isotropic Gaussian distribution, and so maximizing its likelihood gives the following log-likelihood loss:


4 Experiments

Figure 3: Examples of data used in this paper. From top to bottom: synthetic images (top two rows), lab images and YouTube images. Zoom in to see details.

4.1 Dataset and Settings

We generated synthetic images with randomized camera parameters, lighting conditions, arm poses and background augmentation (see Section 3.5.1). Among them, are used for training and the remaining for validation. This dataset, later referred to as the virtual dataset, is used to verify the 2D keypoint detection model, e.g., a stacked hourglass network, works well.

In the real environments, we collected and manually annotated two sets of data. The first one is named the lab dataset, which contains more than frames captured by a Logitech C920 camera (P). We manually chose key frames and annotated them. For this purpose, we rendered the 3D model of the arm in the virtual environment and adjusted it to the same pose of the real arm so that arms in these two images exactly overlap with each other – in this way we obtained the ground-truth arm pose, as well as the camera intrinsic and extrinsic (obtained by a checkerboard placed alongside the robotic arm at the beginning of each video) parameters. We deliberately put distractors, e.g. colorful boxes, dices and balls, to make the dataset more difficult and thus evaluate the generalization ability of our models. The frames used for fine-tuning and for testing come from different videos.

The second part of real-world image data, the YouTube dataset, is crawled from YouTube, which contains videos with the OWI-535 arm. This is a largely diversified collection, in which the arm may even be modded, i.e., the geometric constraints may not hold perfectly. We sampled frames and manually annotated the visibility as well as position for each 2D keypoint. Note that, without camera parameters, we cannot annotate the accurate pose of the arm. This dataset is never included in training, but used to observe the behavior of domain adaptation.

Sample images from the three datasets are shown in Figure 3.

4.2 Pose Estimation

This section studies 3D pose estimation on the basis of 2D keypoint detection. In all experiments, a popular metric named PCK@  [41] is used to evaluate 2D keypoint detection, and the average angular error (in degrees) is used for 3D pose estimation.

4.2.1 Detecting 2D Keypoint

We first evaluate 2D keypoint detection, the basis of 3D pose estimation. For this purpose, we train a -stack hourglass network from scratch for epochs in the virtual dataset. Standard data augmentation techniques are applied, including random translation, rotation, scaling, color shifting and flipping. On top of this model, we consider two approaches to achieve domain transfer. The first one is to train an explicit model which transfers virtual data to fake real data on which we train a new model. In practice, we apply a popular generative model named CycleGAN [43]. We trained the CycleGAN network with synthetic image as the source domain and lab image as the target domain for 100 epochs. The second approach is described in Section 3.3. We mix the synthetic and real images with a ratio of and use the same hyper-parameters as in the baseline training. Background clutters are added to the lab images in an online manner to facilitate variability (see Section 3.5.1).

Model Virtual Lab YouTube YouTube
Synthetic 99.95 95.66 80.05 81.61
CycleGAN 99.86 97.84 75.26 76.98
Refined 99.63 99.55 87.01 88.89
Table 1: 2D keypoint detection result on three datasets. Accuracy is reported under PCK@0.2 metric. Models are tested on YouTube dataset when considering all keypoints and considering only the visible ones.

Results are summarized in Table 1. The baseline model works almost perfectly on virtual data, which reports a PCK@ accuracy of . However, this number drops significantly to in lab data, and even dramatically to in YouTube data, demonstrating the existence of domain gaps. These gaps are largely shrunk after domain adaptation algorithms are applied. Training with images generated by CycleGAN, we found that the model works better in its target domain, i.e. the lab dataset by a margin of . However, this model failed to generalize to YouTube dataset, as the accuracy is even lower than the baseline model. Our approach, on the other hand, achieves much higher accuracy, with a PCK@ score of in the lab data, and in the YouTube, boosting the baseline performance by . In the subset of visible YouTube keypoints, the improvement is even higher (). In addition, the refined model only produces a slightly worse PCK@ accuracy ( vs. ) on virtual data, implying that balance is achieved between “fitting on virtual data” and “transferring to real data”.

The results reveal that performance of explicit domain adaptation manners, e.g. CycleGAN, can be limited in several aspect. For instance, For instance, compared with our 3D geometric based domain adaptation method which has a strong generalization ability, although the model trained on CycleGAN generated images fits to the target domain, it has a poor performance on unseen data. Moreover, we fail to train a CycleGAN model with YouTube dataset as the target domain, because the distribution of data in YouTube is too diverse and such transformation is hard to learn.

4.2.2 Estimating 3D Pose

Motor Synthetic Only Refined
Rotation 7.41 4.40
Base 6.20 3.29
Elbow 7.15 5.35
Wrist 7.74 6.03
Average 7.13 4.81
Camera Rotation () 6.29 5.29
Camera Location (cm) 7.58 6.73
Table 2: 3D pose estimation errors (in degrees) and camera parameter prediction errors (in degrees and centimeters) in the lab dataset.

We first test the performance of our 3D Pose estimation algorithm on virtual dataset. We use the model trained on synthetic image only for testing since it performs better on synthetic data. The experiment was conducted on 500 synthetic images. The angular error for four joints are , , , , with an average of . The error of camera parameters is for rotation and  cm for location.

We also test the 3D pose estimation performance of our model on real images, which is the basis for completing tasks. The quantitative 3D pose estimation result is only reported for the lab dataset, since getting 3D annotation for YouTube data is difficult. We estimate the camera intrinsic parameters by using camera calibration with checkerboard. Results are shown in Table 2. The results reveal that our refined model outperforms the synthetic only model by on average angular error. The qualitative result on YouTube dataset are shown in Figure 4. Since the camera intrinsic parameters are unknown for YouTube videos, we use the weak perspective model during reconstruction. Heavy occlusion, user modification and extreme lighting make the 3D pose estimation hard in some cases. We select typical samples for success and failure cases.

Figure 4: Qualitative results from our YouTube dataset. The challenges include occlusion, user modification, lighting, etc. We show synthetic images generated using the camera parameters and pose estimated from the single input image. Both success cases (left five columns) and failure cases (rightmost column) are shown.

4.2.3 Ablation Study: Domain Adaptation Options

As described in Sec. 3.5, when training the refined model, two strategies are applied on the real images: random background augmentation and joint keypoint detection and pose estimation. To evaluate the contribution of these two strategies to the improvement on accuracy, we did an ablation study. Results are shown in Table 3.

We compare the performance of 5 models: 1) baseline model, trained on 4,500 synthetic images; 2) - 5) models trained with/without joint estimation and with/without background augmentation. Note that if a model is trained without joint estimation, we directly take the argmax on the predicted heatmap as annotations for training.

We see that both strategies contribute to the overall performance improvement. Our model performs best when combining them.

Model Lab YouTube YouTube
Synthetic Only 95.66 80.05 81.61
BG ✗ 3D ✗ 94.52 80.24 82.22
BG ✓ 3D ✗ 98.72 84.04 87.11
BG ✗ 3D ✓ 97.31 86.52 88.27
BG ✓ 3D ✓ 99.55 87.01 88.89
Table 3: Ablation study result. Models are tested on lab and YouTube dataset under PCK@0.2 metric. ”3D” stands for joint keypoint detection and 3D pose estimation, ”BG” stands for random background augmentation.

4.3 Control the Arm with Vision

We implement a complete control system purely based on vision, as described in Sec.3.4. It takes a video stream as input, estimates the arm pose, then plans the motion and sends control signal to the motors.

This system is verified with a task, reaching a target point. The goal is controlling the arm to make the arm tip reach right above a specified point on the table without collision. Each attempt is considered successful if the horizontal distance between arm tip and the target is within 3cm. The system is tested at 6 different camera views. For each view, the arm needs to reach 9 target points. The target points and camera views are selected to cover a variety of cases. We also place distractors to challenge our vision module. A snapshot of our experiment setup can be seen in Fig. 5.

We report human performance on the same task. Human is asked to watch the video stream from a screen and control the arm with a game pad. This setup ensures human and our system accepts the same vision input in comparison. In addition, we allow human to directly look at the arm and move freely to observe the arm when doing the task. The performance for both setups are reported.

Our control system can achieve comparable performance with human in this task. The result is reported in Table 4. Human can perform much better if directly looking at the arm. This is because human can constantly move his head to pick the best view for current state. Potentially, we could use action vision, or multi-camera system, to mimic this ability and further improve the system, which are interesting future work but beyond the scope of this work. It is worth noticing our system can run real-time and finish the task faster than human.

Built on this control module, we show that our system can move a stack of dices into a box. Please see for video demonstration.

Agent Input Distance Success Average
Type Error (cm) Rate Time (s)
Human Direct 0.65 100.0% 29.8
Human Camera 2.67 66.7% 38.8
Ours Camera 2.66 59.3% 21.2
Table 4: Quantitative result for completing the reaching task. Our system achieves comparable performance with human when given the same input.
Figure 5: A snapshot of the real-world experiment setup. Locations of the goals are printed on the reference board and are used as reference when measuring the error. We scatter background objects randomly during testing.

5 Conclusions

In this paper, we built a system, which is purely based on vision inputs, to control a low-cost, sensor-free robotic arm to accomplish tasks. We used a semi-supervised algorithm which integrates labeled synthetic as well as unlabeled real data to train the pose estimation module. Geometric constraints of multi-rigid-body system (the robotic arm in this case) was utilized for domain adaptation. Our approach, with merely a 3D model being required, has the potential to be applied to other multi-rigid-body systems.

To facilitate reproducible research, we created a virtual environment to generate synthetic data, and also collected two real-world datasets from our lab and YouTube videos, respectively, all of which can be used as benchmarks to evaluate 2D keypoint detection and/or 3D pose estimation algorithms. In addition, the low cost of our system enables vision researchers to study robotic tasks, e.g.

, reinforcement learning, imitation learning, active vision,

etc., without large economic expenses. This system also has the potential to be used for high-school and college educational purposes.

Beyond our work, many interesting future directions can be explored. For example, we can train perception and controller modules in a joint, end-to-end manner [18][13], or incorporate other vision components, such as object detection and 6D pose estimation, to enhance the ability of the arm so that more complex tasks can be accomplished.


This work is supported by IARPA via DOI/IBC contract No. D17PC00342. The authors would like to thank Vincent Yan, Kaiyue Wu, Andrew Hundt, Prof. Yair Amir and Prof. Gregory Hager for helpful discussions.


  • [1] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, 2016.
  • [2] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In International Conference on Robotics and Automation, 2018.
  • [3] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, 2012.
  • [4] G. Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
  • [5] M. P. Deisenroth. Learning to control a low-cost manipulator using data-efficient reinforcement learning. Robotics: Science and Systems, 2012.
  • [6] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In

    Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [7] Y. Gao and A. L. Yuille. Exploiting symmetry and/or manhattan properties for 3d object structure estimation from single and multiple images. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  • [9] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In

    International Conference on Machine Learning

    , 2018.
  • [10] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [11] Z. Hong, Y. Chen, H. Yang, S. Su, T. Shann, Y. Chang, B. H. Ho, C. Tu, T. Hsiao, H. Hsiao, S. Lai, Y. Chang, and C. Lee. Virtual-to-real: Learning to control in visual semantic segmentation. In

    International Joint Conference on Artificial Intelligence

    , 2018.
  • [12] A. Hundt, V. Jain, C. Paxton, and G. D. Hager. Training frankenstein’s creature to stack: Hypertree architecture search. arXiv preprint arXiv:1810.11714, 2018.
  • [13] E. Jang, S. Vijaynarasimhan, P. Pastor, J. Ibarz, and S. Levine. End-to-end learning of semantic grasping. arXiv preprint arXiv:1707.01932, 2017.
  • [14] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [15] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
  • [16] N. P. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In International Conference on Intelligent Robots and Systems, 2004.
  • [17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
  • [18] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • [19] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436, 2018.
  • [20] C. Li, M. Z. Zia, Q. Tran, X. Yu, G. D. Hager, and M. Chandraker. Deep supervision with shape concepts for occlusion-aware 3d object parsing. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [21] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference for Learning Representations, 2016.
  • [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, 2014.
  • [23] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang. End-to-end active object tracking via reinforcement learning. In International Conference on Machine Learning, pages 3286–3295, 2018.
  • [24] F. Mahmood, R. Chen, and N. J. Durr. Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE Transactions on Medical Imaging, 2018.
  • [25] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, 2016.
  • [26] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision, 2018.
  • [27] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [28] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In International Conference on Robotics and Automation, 2016.
  • [29] W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, Y. Wang, and A. Yuille. Unrealcv: Virtual worlds for computer vision. In Proceedings of the 25th ACM International Conference on Multimedia, 2017.
  • [30] R. Rahmatizadeh, P. Abolghasemi, L. Bölöni, and S. Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In International Conference on Robotics and Automation, 2018.
  • [31] A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell. Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286, 2016.
  • [32] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [33] H. Su, C. R. Qi, Y. Li, and L. J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In International Conference on Computer Vision, 2015.
  • [34] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In European Conference on Computer Vision, 2018.
  • [35] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems, 2017.
  • [36] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
  • [37] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. arXiv preprint arXiv:1804.06516, 2018.
  • [38] E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, K. Saenko, and T. Darrell. Adapting deep visuomotor representations with weak pairwise constraints. arXiv preprint arXiv:1511.07111, 2015.
  • [39] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [40] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [41] Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35(12):2878–2890, 2013.
  • [42] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018.
  • [43] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision, 2017.
  • [44] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006.
  • [45] S. Zuffi, A. Kanazawa, and M. J. Black. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In Conference on Computer Vision and Pattern Recognition, 2018.