Teleoperation brings the advantage of remote control and manipulation to distant locations or harsh or constrained environments. The system allows operators to send commands from a remote console, traditionally called a device, to a robot, traditionally called a device, and offers synchronization of movements. This allows the remote user to operate as if on-site, making teleoperational systems an ideal and often only solution to a wide range of applications such as underwater exploration, space robotics, mobile robots, and telesurgery [1, 2].
Perhaps the most well known and still widely used teleoperation system to date is the da Vinci® Surgical System, which is deployed in thousands of hospitals. However, the master and slave systems are operated in the same room, providing some advantages of tremor reduction and stereoscopic vision  but none of the remote teleoperation functionality. The main technical challenge in realizing remote telesurgery (and similarly, all remote teleoperation) is the latency from the communication distance between the master and slave. Latency can reach beyond a second when linking sites between continents , and may have higher ranges when using satellite or space communications. This delay causes overshoot and oscillations in the commanded positions, and are observable and statistically significant in as little as 50msec   of round trip communication delay. Anvari et al. reported on 21 remote laparoscopic surgeries where a distance of over 400km had delay of 135-140msec . Furthermore, 300msec has been stated as the largest delay where surgeons feel safe [8, 9], and when using satellite communication for operations between London and Toronto, a delay of 560msec was observed .
Teleoperational systems under delay is not specific to telesurgery and its history is too vast to be covered in this paper. Please refer to Hokayem and Spong’s historical survey for a more extensive background on prior work . Nonetheless, early work by Ferrel and Sheridan suggested supervisory control to mitigate the negative effects of delay . This gives the robotic system the capability of making limited decisions on its own while being supervised by the operator; however, the master loses explicit control of the motions of the slave, and thus supervisory controlled can only be implemented practically at this time in structured environments.
The largest area of research in delayed teleoperation has been for haptic feedback. When haptic feedback is given to the operator, it has been experimentally measured and theoretically shown that delay causes instability [12, 13]. While there are some techniques to dampen the unstable overshoot and oscillatory behavior such as wave variables , these techniques have been shown to increase task completion time when under delay , , and teleoperating without haptic feedback can be often the better alternative.
Predictive displays are virtual reality renderings, generally designed for space operations, that show a prediction of the events to follow in a short amount of time. It can be used to overcome the negative effects of delay by giving the operator immediate feedback from a predicted environment. Furthermore, it does not suffer stability issues that arise with delayed haptic feedback. Early predictive displays included manipulation of the Engineering Test Satellite 7 from ground control where the round trip delay can be up to 7sec [17, 18]. Recent work by Winck et al. combined prediction and haptic feedback of dynamic environments . These strategies can be applied to telesurgery, but require overcoming the unique challenges in calculating and tracking the 3D geometry of the surgical environment. While this is feasible for space operations due to the lack of physical disturbances, including gravity, in surgery the scale of the operation is in millimeters, and the environment includes tissue which cannot be modeled as rigid bodies nor ones that operate in gravity and disturbance-free environments.
In this paper we propose the first predictive display designed for teleoperated surgical robots. Most crucially, we calculate the predicted behavior of the robot arms and display this prediction in real-time to the operator to anticipate the delayed slave feedback video. The novel contributions of the paper are:
a real-time strategy for kinematically estimated Augmented Reality (AR) registration, rendering, and lens-distorted image overlays for stereoscopic left and right streams of laparoscopic stereo cameras,
an Extended Kalman Filtering (EKF) strategy to address the challenges of visual-mismatch between the prediction and the actual movements that arise from imperfect kinematic calibrations with on-the-fly corrections, and
an adaptive transparency filter that prevents confusion arising from overlapping virtual and real visuals of the robot arms.
We present a complete system, and show through a user study that the result is an efficient AR rendering architecture for streaming stereoscopic displays. We call this a Stereoscopic AR Predictive Display (SARPD). We show that over 30fps for the stereoscopic AR rendering and 24Hz for the slave-tool tracking from the EKF can be achieved while running simultaneously on a commodity GPU and using the robot operating system (ROS). The EKF is also shown to be accurate and therefore an accurate prediction is presented to the operator. A user study is carried out to demonstrate its ability to improve the speed of procedures without affecting error rates. Beyond telesurgery, this solution can be deployed to any teleoperated robot with visual feedback once calibrated. Furthermore, it does not require stereo cameras or displays since the real-time slave-tool tracking is done on a single monocular camera data stream.
A block diagram in Fig. 2 shows the architecture for SARPD, and the variables shown will be used throughout this paper. In surgical systems, such as the da Vinci® Surgical System, translational motions from the operator are scaled down to improve accuracy. This constant scaling relationship to set the slave’s pose, , from the operators input pose, , in a teleoperation system under delay is described as follows:
where and are the translational component and quaternion of pose . Equations (1) and (2) give the target pose for the slave, , through the constant scaling for the position and mirroring the operators rotational input. Note that both and are poses in the corresponding slave arm base frame. The rotation is mirrored because master consoles such as the a Vinci® Surgical System use wrist orientation as an input. Equation (3) simply highlights the delay channel.
The predictive display proposed here has two major components: slave-tool tracking by correcting the hand-eye transform in real-time to calculate an accurate prediction and stereoscopic AR rendering to display the prediction. Both are running asynchronously, and ROS is used to pass the data from the slave-tool tracking to the stereoscopic AR rendering.
Ii-a EKF to Correct Hand-Eye
Ye et al.  successfully tracked the slave-tools in real-time by estimating the error of the initial hand-eye calibrations between the base of both slave arms and the left camera in real-time by using:
virtual slave-tool rendering to generate part-based templates online,
template matching between virtual slave-tool parts and image data,
geometric context to extract the best location estimates of the slave-tool parts, and
EKF to track the error of the hand-eye transform using the previously found 2D estimates for the update.
The hand-eye correction from  is implemented on the master side, so the predicted AR slave-tools use the correction corresponding to the image data they will be displayed with. This means the joints angles, , from encoder readings and image data, , are passed through the communication channel before calculating the correction. This differs from  which is not concerned with delayed teleoperation. Each slave-tool part is given its own thread in step 1 and 2 to improve the real-time performance. Fig. 4 is an example photo showing the detected features and rendered slave-tools that use the hand-eye correction.
To initialize the EKF, the hand-eye transform from calibration completes steps 1 through 3, and instead of step 4, a Perspective-n-Point (PnP) solver on the detected features is applied . To improve robustness, this is repeated when the distance or absolute difference in roll, pitch, or yaw between the corrected hand-eye and initial hand-eye from calibration is greater than a set threshold. This and the EKF updates to the corrected transform, , cause large steps in the hand-eye transform relative to the scale of surgical environments. These large steps will make the predicted AR slave-tool appear jumpy and therefore untrustworthy to operators. To smooth this, the following filter is applied to get the outputted hand-eye transform, :
where and are the quaternion and translation representations of the outputted, smoothed hand-eye transform, . Likewise, and are the corrected transforms from the EKF’s output, . This is simply a first order infinite impulse response filter with parameter
and spherical linear interpolation to average the rotations.
Ii-B Stereoscopic AR Rendering
SARPD is rendered using The Visualization Toolkit (VTK) and OpenGL. The rendering pipeline is shown in Fig. 3. To render the predicted slave-tools, their 3D CAD models are loaded as VTK Actors and uses the most recent filtered hand-eye transform to be in the left camera frame. The predicted joint angles for the slave-tools, , are calculated through inverse kinematics from target pose , as shown in Fig. 2. For the right display rendering, the slave-tools are additionally transformed using the baseline from stereo camera calibration to be in the right camera frame.
Distortion must be applied to the rendered slave-tools since the master console shows the distorted images to the operator. Camera calibration procedures in MATLAB and OpenCV find the coefficients for equations that un-distort barrel and tangential camera distortions. Solving for the inverse of these equations gives multiple solutions. Therefore, bilinear interpolation on the inverse of the discretized un-distort mapping is used to find the distortion map for both the left and right cameras. Fig. 6 shows the how far the pixels change when applying the distortion mapping and highlights the importance of using the distortion mapping to have a lens-matched rendering.
Similar to how mappings are applied in OpenCV, the distortion map for each camera is split into two separate mappings, columns and rows. The four mappings are uploaded as textures onto the GPU once at the beginning and then applied to the slave-tool renderings using fragment shaders.
Two virtual cameras, left and right, render the stereoscopic display. Each virtual camera display overlays the predicted and distorted slave-tool renderings on top of the corresponding image frame which are uploaded to the GPU as textures. Both the left and right rendering pipeline are done in parallel and the VTK slave-tool actors and textures are shared resources on the GPU. Through this implementation optimization and utilizing fragment shaders to apply the distortion mapping, the rendering pipeline can be run on consumer grade GPUs. Fig. 1 shows an example of the predicted slave-tools and image data rendering.
Even with the real-time hand-eye correction, there are other kinematic inaccuracies that are unaccounted for. Joint angle inaccuracies on the wrist of the slave-tool will cause inconsistency between the AR rendering and the image. This is noticeable when the slave-tools are making no motion. Even small, sub-millimeter inaccuracies were observed to cause confusion in users in initial studies. To overcome this, the opacity, , of the rendered slave-tools is dynamically set with the following equation:
The opacity will increase proportionally with the distance , which is the distance between the new target pose that the prediction shows and the pose in the delayed image. The distance threshold, , is the minimum value must have to not be fully transparent, and simply sets the maximum opacity. The dynamic opacity allows for the operator to use the image data undisturbed when making precise, slow motions where the kinematic inaccuracies would be apparent and naturally use the AR prediction when making larger motions. Fig. 5 shows an example of the dynamic transparency during a ring hand-off.
Iii Experimental Setup
To initialize the hand-eye transform for the EKF, we provide a calibration method that only requires a rigidly mounted checkerboard on the slave-tools (only during the calibration phase). This is the only additional hardware required to implement the proposed predictive display, and calibration is only required once before a procedure. We measure the accuracy of the slave-tool tracking with a motion capture system (OptiTrack’s V12:Trio). Performance and latency tests were also conducted to show the efficiency of SARPD.
To measure the effectiveness SARPD, a user study with ten participants was conducted on the da Vinci® Surgical System. The user study ensures that participants have enough time to practice such that they can get over the learning curve of the system and environment. Errors and time to complete task are used as metrics, and statistical testing was done to draw conclusions of SARPD’s performance. A recent study found that when using satellite communication for telesurgery between London and Toronto, a round trip delay of mecs was measured . So a round trip delay of 1sec was used for the delayed environment in the user study to ensure SARPD performs in realistically high latency remote operations.
SARPD is rendered to the master console at 30fps. The operator manipulates two master arms in the console to get pose . There are two 7-DOF slave arms which go to pose . Modifications were made to the daVinci Research Toolkit (dVRK ) to support the architecture proposed in Fig. 2 for the da Vinci® Surgical System. The computer used to run the modified dVRK and SARPD has an Intel® Core™ i9-7940X Processor and NVIDIA’s GeForce GTX 1060.
Iii-a Initial Hand-Eye Calibration
To find the initial hand-eye transform, a checkerboard is rigidly attached to the end-effector of the slave-tool. The arm is moved around and photos are taken of the checkerboard and the corresponding joint angles are recorded. The -th corner on the checkerboard in the checkerboard frame, , with side length , is projected onto the image plane with the following equation:
projects a point onto the image plane using the camera matrix. is the pose of the checkerboard which is calculated by using forward kinematics from the recorded joint angles at image , and is the constant offset error of the calculated which accounts for mounting errors of the checkerboard. To solve for the hand-eye transform in a robust manner, the following optimization problem is solved when there are corners on the checkerboard and recorded images and joint angles:
where is the pixel position of the checkerboard corner on the recorded image. Optimizing the constant offset of the checkerboard and side length, , will help account for alignment errors and printing inaccuracies.
is initially set to the identity matrix, all three positional components are constrained tocm, and the rotation of each axis are constrained by . The side length, , is initially set to the side length of the checkerboard and bounded by mm.
To initialize the hand-eye transform, a PnP solver is used on each recorded image to find the pose of the camera in the checkerboard frame. This is then further transformed with the corresponding to get image ’s individual estimate of the hand-eye transform. The initial value for the hand-eye is simply set to the arithmetic average of the positional component and roll, pitch, and yaw of all of the images individual estimates. Similar to the constant offset transform, the hand-eye is constrained by cm and from the initialization for the positional and rotational components respectively. The optimization is solved by using MATLAB’s fmincon function, and Fig 7. shows an example result of the re-projected corners on a recorded image.
Iii-B Motion Capture Measurement
OptiTrack’s V12:Trio was used to measure the performance of the initial hand-eye calibration and the real-time correction of it on the da Vinci® Surgical System. Due to the location of base frame, as shown in Fig. 9 the orientation cannot be recovered without measuring joint angle 1 and 2, which can only be done with encoder readings from the slave arms. Therefore, the performance will be evaluated by comparing the distance between the slave arms base frames.
To get the position of the slave arms base frames with the motion capture system, retroreflective markers are rigidly mounted as shown in Fig. 9. The tracked frame from the retroreflective markers is then shifted down the z-axis such that the position is at the base of the slave arm. Since the position of the base on the slave arms is invariant to rotations from the joints, the positional measurement of each base is repeated with different joint angle configurations and the average is taken to get an accurate measurement. Using the average positions of the two bases in the motion capture frame, and , the error is defined as:
are the positional component of the hand-eye transforms from the EKF directly, smoothing filter applied after the EKF, and initial calibration. The precision of this measurement is the standard deviation of the repeated positional measurements from the motion capture. Since the geometry of the rigidly attached markers and the slave arm are known, there should be no bias in the measurement. To further reduce any potential biases, the markers geometry and slave arms were measured with sub millimeter precision via calipers.
Iii-C Task for User Study
A peg transfer task is the sample task used in the user study due to the complex motions involved with it. A photo of the environment from the endoscope is provided in Fig. 10. To complete the task, the operator must pick up the ring from the front right or left peg with the corresponding arm, pass the ring to the other arm, place the ring on to opposite front peg, and then repeat with the back pegs and ring.
Time to complete task and weighted error are used to evaluate performance of completing the task. To find the weighted error, all errors that occurred during a trial are counted and weighted according to Table 1. Example errors are shown in Fig. 8. The weightings were chosen such that the severity of the errors would be reflected properly when evaluating the performance of completing the peg transfer. Each participant was also shown a video of a complete task with no errors and examples of the errors before beginning the procedure.
|Stretch ring during handoff for a second or less||2|
|Stretch ring on peg for a second or less||4|
|Stretch ring for an additional second||4|
|Knock down peg||20|
Weights associated with type of error
Iii-D Procedure for User Study
After introducing the participant to the study and showing the video, the participant goes through the following procedure:
Practice: complete the task twice under no delay and with no SARPD
Record: complete the task once under no delay and with no SARPD
Repeat step 1 and 2 under delay and with no SARPD
Repeat step 1 and 2 under delay and with SARPD
Two practice trials are always completed before every recording, so that the participant can overcome the learning curve for the new environment. To further ensure this, participants are also offered additional practice before recording. The order was chosen so that participants only encounter one new environmental effect at a time. Through initial experimentation we set the following values: (the default value for dVRK) for equation (1), for equation (4) and (6), and mm, , and for equation (8).
Table 2 indicates the results of real-time slave-tool tracking using both the original work  and the parallelization strategies in this paper while simultaneously running the stereoscopic AR rendering and the modified dVRK, reaching 24Hz. Tool Rendering refers to the online rendering of the part-based template generation procedure and Tool Tracking is the total tool tracking rate. During this measurement, the stereoscopic AR rendering ran asynchronously at 36fps.
|Tool Rendering||Tool Tracking|
|Original||26 Hz||13 Hz|
|Parallelized||50 Hz||24 Hz|
Performance of slave-tool tracking in SARPD
We also measured the latency of the entire image pipeline, which is the time from when an image is captured to when it is displayed to the operator under no delay. When using the SARPD, the latency was measured to be ms. To compare, we ran the same experiment with ROS’s to display the raw image data and measured a latency of ms. ROS’s is the recommended way to view images in dVRK, so our solution also improves on the latency of the system. Finally, Fig. 12 shows the accuracy results of the slave-tool tracking comparing to OptiTrack’s V12:Trio, indicating SARPD’s ability to effectively localize features in 3D. The precision of the measurement was less than 1mm.
|Weighted Error||Completion Time (sec)|
|No SARPD, d=0sec|
|No SARPD, d=1sec|
Average and standard deviation from the user study results
The results of the user study comparing delayed teleoperation with and without SARPD is shown in Fig. 11 and the statistics are given in Table 3. To identify a statistically significant (
) change in weighted error or time to complete task under the different conditions, two sided paired t-test’s were done. No statistically significant difference was measured in weighted errors when comparing SARPD under delay with not using SARPD under both no delay () and delay (). Repeating for time to complete task, statistical significance was measured when comparing SARPD under delay with not using SARPD under both no delay () and delay (). To conclude, we measured that using SARPD on average decreases the time to complete task by 19% when under delay while having no statistically significant change in errors.
V Discussions and Conclusion
SARPD is the first predictive display developed for teleoperated surgical robots and has been shown to be an efficient system. On a consumer grade GPU, the tool-tracking is computed in real-time at high accuracy and the AR rendering pipeline is able to support two 1080p displays, each running above 30fps. By utilizing AR to show the prediction, no assumptions need to be made about the environment so SARPD can be applied to any calibrated teleoperated system under delay. SAPRD also on average decreases time to complete task while not affecting the number of errors. This is an expected result since the AR rendering should not assist when small, precise motions are made which is when the errors in our user study occur.
In future work, to further improve the accuracy of the AR Prediction, the EKF can be expanded to also track joint angles. This allows for a more precise prediction to be made when no motion is applied to a joint. As seen in our results, the tool rendering procedure in the tool-tracking component is running at twice the rate of the total tool-tracking component. This implies that there is enough headroom to use the secondary camera to get more features for the EKF update. With these improvements, an even more consistent and accurate prediction can be given to the operator when controlling a teleoperated robot under delay.
-  M. Yip and N. Das, “Robot autonomy for surgery,” arXiv preprint arXiv:1707.03080, 2017.
-  P. F. Hokayem and M. W. Spong, “Bilateral teleoperation: An historical survey,” Automatica, vol. 42, no. 12, pp. 2035 – 2057, 2006.
-  A. R. Lanfranco, A. E. Castellanos, J. P. Desai, and W. C. Meyers, “Robotic surgery: A current perspective,” Annals of Surgery, vol. 239(1), pp. 14–21, Jan 2004.
-  M. J. Lum H., D. C. Friedman W., H. H. King I., R. Donlin, G. Sankaranarayanan, T. Broderick J., M. Sinanan N., J. Rosen, and B. Hannaford, “Teleoperation of a Surgical Robot via Airborne Wireless Radio and Transatlantic Internet Links,” in 6th International Conference on Field and Service Robotics - FSR 2007, vol. 42 of Springer Tracts in Advanced Robotics, (Chamonix, France), Springer, July 2007.
-  A. Kapoor, R. Kumar, , and R. Taylor, “Simple biomanipulation tasks with ’steady hand’ cooperative manipulator,” Lect Notes Comput Sc., pp. 141–148, 2003.
-  A. Schweikard, R. Tombropoulos, L. Kavraki, J. Adler, and J. Latombe, “Treatment planning for a radiosurgical system with general kinematics,” IEEE Intl. Conf. on Robotics and Automation, pp. 1720–1727, 1994.
-  M. Anvari, C. McKinley, and H. Stein, “Establishment of the world’s first telerobotic remote surgical service,” Annals of Surgery, vol. 241(3), pp. 460–464, Mar 2005.
-  J. Marescaux, J. Leroy, M. Gagner, F. Rubino, D. Mutter, M. Vix, S. E. Butner, and M. K. Smith, “Transatlantic robot-assisted telesurgery,” Nature, pp. 379–380, Sep 2001.
-  S. Xu, M. Perez, K. Yang, C. Perrenot, J. Felblinger, and J. Hubert, “Determination of the latency effects on surgical performance and the acceptable latency levels in telesurgery using the dv-trainer ®simulator,” Surg Endosc., pp. 2569–76, 2014.
-  C. Y. Nguan, R. Morady, C. Wang, D. Harrison, D. Browning, R. Rayman, and P. P. W. Luke, “Robotic pyeloplasty using internet protocol and satellite network-based telesurgery,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 4, pp. 10–14, Mar 2008.
-  W. R. Ferrell and T. B. Sheridan, “Supervisory control of remote manipulation,” IEEE Spectrum, vol. 4, pp. 81–88, Oct 1967.
-  W. R. Ferrell, “Delayed force feedback,” Human Factors, vol. 8, no. 5, pp. 449–455, 1966. PMID: 5966936.
-  R. J. Anderson and M. W. Spong, “Bilateral control of teleoperators with time delay,” IEEE Transactions on Automatic Control, vol. 34, pp. 494–501, May 1989.
-  G. Niemeyer and J. . E. Slotine, “Using wave variables for system analysis and robot control,” in Proceedings of International Conference on Robotics and Automation, vol. 2, pp. 1619–1625 vol.2, April 1997.
-  M. C. Yip, M. Tavakoli, and R. D. Howe, “Performance analysis of a manipulation task in time-delayed teleoperation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5270–5275, Oct 2010.
-  M. C. Yip, M. Tavakoli, and R. D. Howe, “Performance analysis of a haptic telemanipulation task under time delay,” Advanced Robotics, vol. 25, no. 5, pp. 651–673, 2011.
-  W.-K. Yoon, T. Goshozono, H. Kawabe, M. Kinami, Y. Tsumaki, M. Uchiyama, M. Oda, and T. Doi, “Model-based space robot teleoperation of ets-vii manipulator,” IEEE Transactions on Robotics and Automation, vol. 20, pp. 602–612, June 2004.
-  T. Imaida, Y. Yokokohji, T. Doi, M. Oda, and T. Yoshikawa, “Ground-space bilateral teleoperation of ets-vii robot arm by direct bilateral coupling under 7-s time delay condition,” IEEE Transactions on Robotics and Automation, vol. 20, pp. 499–511, June 2004.
-  R. C. Winck, S. M. Sketch, E. W. Hawkes, D. L. Christensen, H. Jiang, M. R. Cutkosky, and A. M. Okamura, “Time-delayed teleoperation for interaction with moving objects in space,” in IEEE International Conference on Robotics and Automation, pp. 5952–5958, May 2014.
-  M. Ye, L. Zhang, S. Giannarou, and G.-Z. Yang, “Real-time 3d tracking of articulated tools for robotic surgery,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Cham), pp. 386–394, Springer International Publishing, 2016.
-  P. Kazanzides, Z. Chen, G. S. F. A. Deguet, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da vinci ®surgical system,” IEEE Intl. Conf. on Robotics and Automation, pp. 6434–6439, 2014.