In the context of autonomous manipulation, 3D object localization plays an essential role. Stereo vision is one of the most efficient methods for 3D reconstruction and thus is a very useful tool when dealing with robotic manipulation. Indeed, stereo vision has received many attention in research over the past decade in different subfields of autonomous manipulation (grasping [grasping_stereo, grasping_stereo2], contact-reach tasks [manipulation_stereo], and even playing soccer [robocup_stereo], …).
However, classical methods of stereo vision are hard to design and require a lot of tuning. They can be inaccurate as they are composed of many steps, which are all potential sources of errors (see section 2). For this reason, we strongly believe that stereo localization would benefit from an end-to-end learning approach, which would simplify the design process and hopefully improve the system accuracy. Recently, several end-to-end approaches have been successful in achieving complex tasks, such as robot manipulation [deepVisuo], self-driving cars [endtoend_driving], speech recognition [endtoend_speech], obstacle avoidance [endtoend_obstacle]
… Even for end-to-end learning of manipulation tasks, which includes object localization, the vision part must be pretrained to locate objects if we want the reinforcement learning to be scalable to real life applications[deepVisuo]. Hence, this work, which mainly focuses on end-to-end learning of 3D stereo object localization, also has great significance within the wider context of robotic autonomous learning of manipulation tasks.
To train a convolutional neural network (or any regression model) to learn stereo localization, we need to generate labelled data for localization (i.e., stereo images of the object to be located together with its position in space). Gathering labels for localization is a hard task. Position labels are hard to get as they cannot be written manually without spending precious time doing very precise measurements for each sample. In this paper, we introduce an approach for building such a dataset, by using an industrial robot to generate labelled data for 3D stereo localization. The main contribution of this paper is to describe a procedure to gather a lot of labelled stereo data for object localization automatically. This procedure is applied to generate a dataset for screw drivers 3D localization, composed of more than three thousand samples. Such a dataset111https://goo.gl/hCuj35, also constitutes a progress as it will give a baseline for future research to test various regression models (CNN architectures) for stereo localization.
2 Motivations for end-to-end stereo localization
2.1 Classical methods for 3D object localization using stereo vision
The use of stereo vision has a long history in robotics manipulation [grasping_stereo, manipulation_stereo]. However, currently, stereo localization consists in stacking different methods that can each be inaccurate. As we can see in figure 1
, to implement 3D object localization, it is needed to calibrate both cameras, compute accurate stereo matching, build a computer vision pipeline to identify pixels of interest, triangulate, and measure accurate transformation of frames to project the result in the frame of interest.
More recently, researchers have started to use deep learning for stereo vision[cnn_stereomatching1, cnn_stereomatching2, cnn_stereomatching3]. However, they mainly focus on stereo matching, which is only solving part of the problem. This trend can be seen by looking at the problem proposed with the most famous datasets for stereo vision [kitti_stereomatching, middlebury]. To locate an object in space, one also needs to get the precise pixels in one of the image, which can cause errors, even with perfect stereo matching. Indeed, this problem is close to instance segmentation [instance_segmentation] and is not easy. In general, end-to-end learning approaches tend to be better than stacking subsystems as it can correct internal errors. For this reason we introduce such a framework for stereo localization in the next section. This paper is a first step towards implementing it as it proposes a generic way to produce real-world datasets for stereo object localization using a robot to generate position labels.
2.2 End-to-end pipeline for stereo localization
An end-to-end pipeline for stereo object localization would consist in mapping pixels from two images to 3D points in space representing an object pose, using a regression function approximator. Such a pipeline could then be used for grasping, or any robotic manipulation task.
In this paper, we propose an approach to build datasets for stereo localization learning. Showing the feasibility of gathering large datasets, is a first step towards investigating the feasibility of end-to-end object stereo localization. The purple dotted line in figure 1 illustrates what the network is supposed to encode (the relative camera calibration could be added). After seeing the amazing results produced by CNNs when dealing with images, it seems that only the difficulty to gather enough labelled data prevents them to carry-out end-to-end stereo localization. The idea we implement in this paper consists in defining a method, using a precise and accurate industrial robot, to generate labelled data for object localization. We also apply it to the get data for screw-drivers localization. Section 3 contains more details about the dataset .
3 Dataset generation
When building a dataset for supervised learning, it is crucial to make sure that our input data are sufficient to know everything about our outputs. For the case of stereo object localization, all that is required is to have two cameras with fixed focal length and relative positions. We also need to keep the position of the two cameras fixed with respect to the robot and to make sure that both cameras are oriented such that they share partially common fields of view (the object must be seen by both cameras).
3.1 Generate diversity to avoid overfitting
In order to avoid overfitting, the dataset should have as much diversity as possible. To do so, we try to change the following parameters:
lighting conditions, by gathering the data in a shop-floor with unmastered lighting.
background, by placing different clothes/objects on the table at which cameras are looking
We also add images where we place distractor objects (other tools) in the background so that the CNN would learn to locate screw drivers and not any object. The dataset contains several different screwdrivers to help him discover the concept of what makes a screw driver.
Finally, we make sure that the random configurations of the robot explore the full range of positions and orientations allowed within the common view range of the cameras. Figure 2 shows a representative subset of the images present in the dataset. Note that each image shown has a corresponding image from the other camera.
3.2 Avoid learning the wrong thing by removing the robot
The idea of using a robot to generate supervision opens a broad range of possibilities for object localization. However, there can be a drawback, the robot (or at least the tool holder) will appear on every image and we can fear that the regressor learns to locate the robot and just applies some kind of shift to find the screwdriver. To avoid such problem, we need to remove the robot from some of the images. We proceed in two different ways:
Physically, we hide the robot by wrapping some cloth around it (figure 2)
Computationally, using a computer vision (CV) pipeline to remove the robot from images. To do so, we increase the dataset and take four images per sample (robot + tool, robot alone, and the background corresponding to the two situations). The full pipeline is described in Figure 3.
The underlying idea behind the robot removal CV algorithm is to use the robot only image and the background to compute a mask of the robot and replace the corresponding pixels by background pixels. A similar approach has been proposed by [levine_grasping]. The necessity to get two background images comes from the changes in lighting conditions. This algorithm is then fine tuned to remove only the good pixels and get smooth edges.
To ensure that all the images in our dataset do not contain any systemic patterns coming from our robot removal algorithm, it is also important to keep original images, i.e. with the robot.
3.3 Technical details on data generation
For tool calibration, we assume that the screw driver is orthogonal to the tool holder and we use the very accurate torque sensors of the Kuka LBR iiwa robot to detect contact reaches with a plaque in several directions. In this way, we can get all the shifts and compute the transformation of frames.
When we generate data, we want to make sure that the object stands within a certain cuboid (the biggest inside the common views), that the robot is not hiding the object in the image, and obviously, that the robot is not colliding with itself or the environment.
We also check, for the cases where we take several images, that the robot is not behind the screw driver so that there is no mask superposition and we do not remove part of the tool with our computer vision algorithm.
In this paper, we described a procedure to build a dataset for 3D object localization using stereo vision. By using an industrial robot, we have shown that it is possible to generate data with accurate position labels. We propose an approach to generate data with enough variability, to avoid overfitting, as well as a method to remove the robot from some of the images to prevent our regressor from learning the wrong thing. We also applied it to build a dataset for screw driver localization.
Obviously, the next steps for this work will consist in exploiting this dataset and design a CNN architecture to encode the 3D stereo localization pipeline. Another perspective is to apply this process as a preprocessing step for manipulation reinforcement learning, likewise in [deepVisuo]. By doing so, we can expect to get better results with a policy using stereo images instead of a single camera.
The authors would like to thank Jorge Palos and Matthieu Decobecq for their precious help. This work was partly supported by the european Union’s Horizon 2020 research and innovation program under grant agreement No 688807.