Long range teleoperation for fine manipulation tasks under time-delay network conditions

by   Jun Jin, et al.

We present a coarse-to-fine approach based semi-autonomous teleoperation system using vision guidance. The system is optimized for long range teleoperation tasks under time-delay network conditions and does not require prior knowledge of the remote scene. Our system initializes with a self exploration behavior that senses the remote surroundings through a freely mounted eye-in-hand web cam. The self exploration stage estimates hand-eye calibration and provides a telepresence interface via real-time 3D geometric reconstruction. The human operator is able to specify a visual task through the interface and a coarse-to-fine controller guides the remote robot enabling our system to work in high latency networks. Large motions are guided by coarse 3D estimation, whereas fine motions use image cues (IBVS). Network data transmission cost is minimized by sending only sparse points and a final image to the human side. Experiments from Singapore to Canada on multiple tasks were conducted to show our system's capability to work in long range teleoperation tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5


Long-Range Route-planning for Autonomous Vehicles in the Polar Oceans

There is an increasing demand for piloted autonomous underwater vehicles...

Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration

We introduce a simple new method for visual imitation learning, which al...

Human-Machine Interface for Remote Training of Robot Tasks

Regardless of their industrial or research application, the streamlining...

Fine Manipulation and Dynamic Interaction in Haptic Teleoperation

Teleoperation of robots enables remote intervention in distant and dange...

Learn and Transfer Knowledge of Preferred Assistance Strategies in Semi-autonomous Telemanipulation

Increasing the autonomy level of a robot hand to accomplish remote objec...

Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images

Long-range context information is crucial for the semantic segmentation ...

Hierarchical Memory Learning for Fine-Grained Scene Graph Generation

As far as Scene Graph Generation (SGG), coarse and fine predicates mix i...

Code Repositories


Implementation and annotation of Free Space Carving algorithm (CARV) using ORB-SLAM map points and camera poses.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Long range teleoperation for fine manipulation tasks (e.g., remote facility maintenance, offshore oil rig operations, and space missions) under unknown environments often have limited network conditions. There are two major challenges: (i) For the purpose of situation awareness, traditional telepresence methods [sheridan1995teleoperation, beck2013immersive], are no longer practical via low bandwidth network communications as they require real time visual feedback and a fine 3D scene model prior or sound and tactile feedback. (ii) For the purpose of remote control, a fully human supervised system is difficult to use since the response time increases under high latency network conditions. We propose a coarse-to-fine approach to address these challenges.

A coarse-to-fine manipulation approach is commonly used in human eye-hand-coordination when the exact target location or 3D model of the surroundings is unknown. This approach can easily plan the large motion part using a coarse geometric estimation of the scene, and then, if needed, performs fine motions to precisely hit the target using visual texture feedback. Inspired by the same idea, we propose a coarse-to-fine teleoperation system which guides large motions based on a coarse 3D geometric reconstruction and fine motions via visual servoing.

Fig. 1: Design overview: The remote robot begins with a self exploration process which drives an uncalibrated web cam and send back map points. cam poses to human operator. The returned points are used to build a telepresence interface for task specification. Then the robot side coarse-to-fine autonomous controller guides the robot to fulfill the task.

Our proposed system starts with a self exploration process to sense the remote surroundings by randomly driving a freely mounted eye-in-hand web cam through the robots workspace. A visual odometry method [mur2017orb] is used to generate map points throughout this process. After self exploration, hand-eye calibration [horaud1995hand] is estimated using correspondent map points and a coarse telepresence model is obtained by real-time 3D reconstruction. The human operator then specifies a target by clicking on the model, which the system will use to generate large motion commands that drive the robot towards the target. Once confirmation of the coarse movement is received, an image of the remote scene is sent to the telepresence interface for further fine task specification  [gridseth2013bringing]. Image-Based Visual Servoing (IBVS) [chaumette2006visual] control is then used to autonomously guide the robot to the desired position. Our major contributions are:

Fig. 2: Experiment overview: A. Location of the remote robot and human operator; B. Network structure used for data communication; C. Interface for human operator. Up: Selecting a coarse target. Down: Fine manipulation task specification [gridseth2013bringing]; D. Two tasks are designed: 1) hold handler. 2) press button.
  • We propose a long range teleoperation system that works in time-delay network conditions. Capability of our system is demonstrated through experiments using a wireless 3G router to connect between a remote robot and human operator.

  • Our proposed system includes a coarse-to-fine autonomous control which can work in high latency networks. Errors coming from the coarse 3D model and hand-eye calibration are further compensated in the fine manipulation stage using image cues.

  • Our proposed system provides telepresence under low bandwidth network by constructing the 3D geometric model of the remote scene.

  • Our 3D reconstruction process also removes the common constraint in task specification, that the target must be in camera’s field of view, by exploiting the coarse 3D model in our visual interface.

  • The cumbersome camera-robot calibration is automatically calculated in our self exploration stage.

Ii Method

Ii-a Preliminaries

We begin by defining coordinate frames (Fig. 3). Our notation style follows the same definition in [corke2017robotics]. Symbols that will be used in further derivation are listed below:

Notations Descriptions
robot’s base frame.
current end effector’s frame.
current camera’s frame.
end effector’s frame at time i.
camera’s frame at time i.
initial end effector’s frame.
corresponding initial camera’s frame.
desired end effector’s frame.
corresponding desired camera’s frame.
spatial transformation from to .
spatial transformation from to .
spatial transformation from to .
spatial transformation from to .
translation from origin of to defined in
translation from origin of to defined in

It’s worth noting that: a) the end effector’s frame and the camera’s frame have a fixed correspondence relationship since they form a rigid body; b) The initial camera’s frame is the world coordinate frame used to represent map points, camera poses, and reconstructed 3D model in the self exploration stage.

Fig. 3: Coordinate Frames Definition

Ii-B Remote Robot Side: Self Exploration Process

This autonomous process drives a freely mounted webcam following a sphere-like routine to explore remote surroundings. It is used to estimate an unknown remote scene, as well as hand-eye calibration using visual odometry.

Ii-B1 Visual Odometry

In this paper, we use ORB-SLAM[mur2017orb] to generate sparse 3D map points and estimate camera pose. ORB-SLAM is a feature-based method which relies on rich image textures and scene overlaps. Based on the application needs, direct visual odometry methods, such as LSD SLAM [engel2014lsd] or SVO [forster2017svo], may also be used as they generate dense map points which work well for telepresence visual rendering but generally perform poorly for network data transmission. The visual odometry module incrementally outputs map points defined in Frame {}, as well as continuously estimates the cameras pose.

Ii-C Solving Hand-Eye Calibration

Along with the camera pose estimation in the visual odometry module, the corresponding end effector pose is also recorded. Therefore, we are able to construct a dataset with camera poses and end effector poses: , where defines key frame time steps. By decomposing a SE(3) transformation into a translation part and a rotation part, dataset can be further represented as two pairs of correspondence: . Our goal is to estimate two spatial transformations: and .

The camera and end effector form a rigid body, namely and are fixed for all time steps . Let’s denote as .

Ii-C1 From initial camera frame to robots base frame

Assuming the scale matrix that converts camera frame units to physical world units be denoted as D = diag(, , ), we have:


where and does not depend on our parameters and . The objective function is:


This forms a relaxed Orthogonal Procrustes Problem [everson1998orthogonal]). We use a tandem algorithm described in [everson1998orthogonal] to solve , and by iteratively optimizing one while fixing others.

In our experiments, we simply assume as the camera and end effector are situated close to each other, since this error will be further compensated in the fine motion control part using image textures.

Subsequently, can be calculated using an average estimation from all samples. Now we have estimated , and . The transformation from to , i.e., is solved.

Fig. 4: The CARV [lovi2011incremental] algorithm is based on free-space constraint and has the ability to incrementally refine the 3D reconstruction result in real-time. Basic tetrahedron cells are built by triangulation and refined as new points are added (red cross).

Ii-C2 From camera frame to end effector frame

By using orientation correspondences (, ), we have:


where . Now the cost function is:


This is a regular Orthogonal Procrustes Problem. A general solution [schonemann1966generalized] based on SVD can solve . Also, is solved previously, now the transformation from to , i.e., is obtained.

Ii-D Human Side: Visual Telepresence Interface

Ii-D1 Real-time 3D reconstruction

Given a set a map points that are transmitted to human side, real-time 3D geometric reconstruction methods [lovi2011incremental], [niessner2013real], [he2018incremental] can be used to build a visual telepresence interface. We use CARV [lovi2011incremental] in this paper (as shown in Fig. 4), other more sophisticated methods [petit2010multicamera] can also be used to create better rendering effects. The reconstructed 3D geometric model generated in CARV is defined in frame .

Ii-D2 Task specification

As shown in Fig. 5, the operator is required to select a target on the rendered 3D model with a desired end effector orientation. This is called coarse target selection. A console interface is designed (Fig. 6) for the user to select among 4 preset orientations. The task command is then send back to the remote robot side as:

  • : Position of target point defined in {}.

  • : Desired end effector orientation defined in {b}.

After receiving the task command, a coarse controller will drive the robot towards the target. Since the camera is also near the target now, an image is sent back to human side for precise task specification. As shown in [gridseth2013bringing], we use a similar interface to specify a task using image features. The human operators selection is then sent to the remote robot side, where an IBVS controller is used to fulfill the task.

Fig. 5: Visual telepresence interface for task specification. Left Column: Hold handler task; Right Column: Press button task; Up Row: Coarse target selection; Down Row: Precise task specification.
Fig. 6: Human side console interface. It allows user to select a preferred end effector orientation. It also monitors remote task execution status and pops up an image for precise task specification as shown in Fig. 5

Ii-E Remote Robot Side: Coarse-to-Fine Controller

Ii-E1 Coarse motion controller for large motions

The main purpose of our coarse controller is to bring the target into the camera’s field of view by driving the robot towards it, thus fulfilling the large movement part in task execution, which is why a coarse 3D model still works in practice.

One way to drive large motions is to use an inverse kinematics solver. Since all of the calibration transformations are already complete, it is trivial to get our desired end effector’s position if we substitute with in eq. (1).

However, an inverse kinematics solver may cause large joint movements, which is unsafe in practice. For smooth motion purpose, PBVS [hutchinson1996tutorial] is used for coarse motion control. Let’s denote as , since target point can represent the desired end effector position. Given the transformation matrix, it’s trivial to derive basic equations of PBVS control as:


where and . Now we have the desired camera pose. Given current camera pose (,

), we can represent error vector in desired camera’s frame

as :


At last, a simple PD control law can be designed [hutchinson1996tutorial, kragic2003robust] minimizing the error to zero.

Ii-E2 Fine motion controller using image cues

After human operator specifies the task on the send-back image (Fig. 5), fine motion controller will be activated.

Image based visual servoing [chaumette2006visual] is used as our fine motion controller. IBVS can both work in calibrated and uncalibrated scenarios. IBVS has the ability to compensate errors in calibration by directly minimizing errors represented in image space.

Iii Experimental Setup

Iii-a Network Structure

Work sites for long range teleoperation tasks (e.g., offshore oil rig operation, remote facility maintenance) often have poor network infrastructures. Considering this practical issue, we use a 3G wireless router which connects the remote robot to the Internet. Thus, a human operator can work in any locations where there is Internet access (as shown in Fig. 2).

A data communication module based on UDP protocal is designed to transmit map points and images from the remote robot side to the human operator. Task specification data is also sent to robot side for autonomous control purpose. Detailed network structure is shown in Fig. 2.

Fig. 7: Measurement of the final error. For all the two tasks, error is measured as the distance from the final end effector position (finger tip) to the target. Left: hold handler task; Right: press button task. A threshold of 10mm is used to determine success of a trial. The final grasping motion is hard coded since it’s simply a close grasp action.
Metrics Hold handler (10 trials) Press button (10 trials)
Baseline Ours Baseline Ours
Success rate (%) 70 100 70 90
Task duration (s) 1,778.2 471.0+36.0 3,318.2 471.0+125.0
Data size (KB) 10,812.0 4,154+323.8 20,540.0 4,154+376.5
TABLE I: Comparison to baseline method. Results show that our method has higher success rate, less task execution time and lower network data consumption. The self exploration (duration: 471s, datagram size: 4.154KB) phase in our method is only required to do once. All following tasks can share the same model thus task duration and datagram size are reduced.
Fig. 8: Measurement of datagram transmitted via network on task 2: Press Button. x axis: Timeline (s). y axis: Datagram size(KB). Blue dot: Timestamp of sending data. Red dot: Timestamp of receiving response. Green Line: Time delay measured on robot side (from sending to receiving response). Red Line: Time delay measured on human side (from sending to receiving response). The average time delay is 111.4 ms/KB. The majority of total task duration and network data consumption are in our system’s self exploration phase. However, this part is only required to do once at the beginning. Following tasks will be done using the same coarse model.

Iii-B Teleoperation Tasks

Two tasks were carried out in our experiments (as shown in Fig. 2). Both tasks require the remote robot to coarsely estimate its surroundings and return a telepresence model to human operator. Task 1 is a regular manipulation type, while task 2 requires fine manipulation.

  • Task 1: Holder handle. Robot is required to hold a handle based on human operator’s selection.

  • Task 2: Press button. Robot is required to press a button according to human operator’s selection.

Iv Evaluation Results

Iv-a Comparison to baseline method

Iv-A1 Baseline method

Our baseline method is a Cartesian controller which relies on an inverse kinematics solver from MoveIt [moveit]. Remote images are continuously transmitted to the human side, requiring on average 35.2s per frame. There are two intrinsic limitations in baseline method compared to ours:

  1. A target must be always in the camera’s field of view, otherwise a human operator has no idea where the target may be. This is tedious since the view field of an eye-in-hand camera is relative small.

  2. Task execution is fully controlled by human side, which can be challenging and frustrating for the user in practice because of the large response delay.

In order to compare, we made two assumptions: a) Human operator knows roughly where the target is to ensure the baseline method is possible to use. b) Task duration is unlimited, unless the human operator chooses to abort early.

Iv-A2 Comparison metrics

Both the baseline method and our methods include 10 trials of the above mentioned two tasks. We designed 4 metrics for comparison:

  1. Success rate of trials, which is based on a threshold (10mm) of the final error.

  2. Average time to complete task (min). In failed trials, we set it to 100 min for convenience in calculation.

  3. Average network data consumption (kB), which measures the total size of datagrams transmitted via network.

Evaluation results are shown in Table I.

Iv-B Measurement of data communication

For the purpose of analyzing how time-delay network conditions affect the communication between remote robot controller and human operator, we measure two factors in our method: a) datagram size (KB) transmitted via network for all tasks; b) round trip response delays (s) caused by network latency. Since we have the UDP communication interface for both robot side and human side, the above mentioned measurements are taken by logging, sending, and receiving datagrams in both the client and server side. Each datagram will have a short response ending with its datagram ID. So that a round trip delay can be measured without synchronizing the client and server side time. Measurement results are shown in Fig. 8. The average time delay in our ‘press button’ task is 111.4 ms/KB.

V Related Works

Long range teleoperation control has lots of work done using Internet connections [hu2001internet], [uddin2017long]. While it provides a possibility for convenient operation regardless of location, it typically omits a common limitation in practice: network conditions. Low bandwidth and high latency networks are more common, for example, in tasks such as space missions (e.g., 6 to 44 minutes delay in Martian Rover [longrange2018]) where an Internet connection is not available.

Teleoperation under such network conditions is a significant challenge. Khan [khan2010wireless] proposed a wireless network architecture to optimize data transition and control, thus improving network quality. Yokokohji et al.[yokokohji2001bilateral] proposed a PD-based bilateral control method testing on a satellite (ETS_VII) with round-trip delay 6 seconds. However, addressing both telepresence and control in time-delay network conditions is rare. One possible solution is to use robotic vision. Lovi et al. [lovi2010predictive] proposed a method using PTAM [klein2007parallel] and CARV [lovi2011incremental] with a telepresence interface and predictive display to generate rendered intermediate image for human control guidance. However, previous images are needed for a projective texture rendering which requires high network consumption and a fully human supervised control causes difficulties in practice.

For the 3D reconstruction, there are other approaches (e.g., multi-camera based method [petit2010multicamera] and RGB-D data based methods [niessner2013real]), that can create a usable telepresence model, however, they typically require high performance network conditions and machines. For robot side semi-autonomous control using vision guidance, Gridseth et al. [gridseth2013bringing] proposed an interface for task specification based on geometric constraints with error mapping represented in image space. A successive uncalibrated IBVS [chaumette2006visual] controller is used for robot side autonomy.

Vi Conclusions

We present a coarse-to-fine approach based teleoperation system taking into consideration the common network limitation in long range telerobotic control: low bandwidth and high latency. The ‘coarse’ part includes a coarse estimation of the remote scene and a controller for large movements. The ‘fine’ part includes an IBVS [chaumette2006visual] controller for fine motions, which compensates errors in the coarse part by minimizing errors in image space. Only sparse points, an target image and two task commands are transmitted via the network, thus removing the low bandwidth challenge. The robot side autonomous controller removes the high latency challenge. Experiments are designed to showcase our methods ability in such network conditions.

Limitations and future work: We use Orb-SLAM [mur2017orb] to generate map points. However, this feature based method relies highly on rich image textures. In scenarios with a lack of image textures, direct methods (e.g., LSD SLAM [engel2014lsd]) can be used. However, for poor illumination scenarios (e.g., space missions), both of these methods will not perform well. Furthermore, a pure coarse 3D model based telepresence may still cause human understanding difficulties. Generating denser map points may help, however, it consumes a larger network cost. The second limitation comes from our fine manipulation controller: IBVS [chaumette2006visual], which relies on trackers to estimate error in image space. This is a typical challenge in visual servoing. Several approaches are proposed, e.g., direct visual servoing [silveira2012direct], however new challenges arise [bateux2018going]. Other learning based methods [jin2018robot], [levine2016end] could be a future direction.

We are planning to deploy our system on a mobile manipulator. Combined with a motion planning module, our system can work for tasks requiring both self-navigation and manipulation, enhancing application potential in remote facility maintenance tasks, offshore oil rig operations and space missions.