Tracking of medical instruments and tools is required for various systems in medical imaging, as well as computer aided interventions. In general, tracking systems provide the rigid body transformation of one (or multiple) targets with respect to a common reference frame, which can be the patient, a camera system, or any pre-calibrated coordinate space. Especially for medical applications, accurate tracking is an important requirement, however often comes with severe drawbacks impacting the medical workflow. Mechanical tracking systems (robotic arms or linear stages) can provide highly precise tracking through a kinematic chain , . However, these systems often require bulky and expensive equipment, which cannot be adapted to a clinical environment where high flexibility needs to be ensured. In contrast to that, electromagnetic tracking is flexible in its use, but is limited to comparably small work spaces and can interfere with metallic objects in proximity to the target, severely reducing the accuracy .
Optical tracking systems (OTS) enjoy widespread use as they do not have these disadvantages. Usually, a set of active or passive infrared markers is attached to the target and tracked by a static external stereo camera. Despite favourable spatial accuracy under optimal conditions, respective systems suffer from constraints by the required line-of-sight. Especially in interventional settings, this can impair the positioning of equipment and staff, as occlusions of the markers occur while they are in use. Robust marker based methods such as  address this problem and work even if the target is only partly visible. However, the marker-visibility issue is further complicated for imaging solutions relying on tracking systems, with prominent examples being freehand SPECT imaging  as well as freehand 3D ultrasound imaging .
Aiming at both accurate and flexible systems for 3D imaging, a series of developments have been proposed recently. Inside-out tracking for collaborative robotic imaging 
proposes a marker-based approach using infrared cameras, however, not resolving line-of-sight issues. A first attempt at making use of localized features employs tracking of specific skin features for estimation of 3D poses in 3D US imaging. While this work shows promising results, it is constrained to the specific anatomy at hand.
In contrast to previous works, our aim is to provide a generalizable tracking approach without requiring a predefined or application-specific set of features. With the recent advent of advanced miniaturized camera systems, our aim is to evaluate an inside-out tracking approach solely relying on features extracted from image sensor data for pose tracking. For a generic inside-out tracking approach, which is robust in different environments, the sole geometric information of the scenery without any prior knowledge, is vital for orientation. For this purpose, we propose the use of visual methods for simultaneously mapping the scenery and localizing the system within it. This is enabled by building up a map from characteristic structures within the previously unknown scene observed by a camera, which is known as SLAM. SLAM methods can be distinguished between direct and feature-based methods, both with its characteristic drawbacks and benefits. For direct SLAM approaches, the whole image information is taken into account [10, 11], which may lead to erroneous poses under changing lighting conditions, require good initialization and are not able to recover poses correctly for rolling shutter cameras. In contrast, feature-based methods rely on extracted feature points , which lead to more stable tracking behaviour during illumination changes, but require a minimum amount of structure within the scene. Different image modalities can be used for visual SLAM, whereas a stereo setup possesses many benefits compared to monocular vision or active depth sensors.
On this foundation, we propose a flexible inside-out tracking approach relying on image features and poses retrieved from SLAM. We evaluate different methods in direct comparison to a commercial tracking solution and ground truth, and show an integration for freehand 3D US imaging as one potential use-case. The proposed prototype is the first proof of concept for SLAM-based inside-out tracking for interventional applications, applied here to 3D TRUS as shown in Fig. 2. The novelty of pointing the camera away from the patient into the quasi-static room while constantly updating the OR map enables advantages in terms of robustness, rotational accuracy and line-of-sight problem avoidance. Thus, no hardware relocalization of external outside-in systems is needed, partial occlusion is handled with wide-angle lenses and the method copes with dynamic environmental changes. Moreover, it paves the path for automatic multi-sensor alignment through a shared common map while maintaining an easy installation by clipping the sensor to tools.
For interventional imaging and specifically for the case of 3D ultrasound, the goal is to provide rigid body transformations of a desired target with respect to a common reference frame. This way, we denote as transformation to . On this foundation, the transformation from the ultrasound image () should be indicated in a desired world coordinate frame (). For the case of inside-out based tracking - and in contrast to outside-in approaches - the ultrasound probe is rigidly attached to the camera system, providing the desired relation to the world reference frame
Inside-out tracking is proposed on the foundation of a miniature camera setup as described in Sec. 3. The setup provides different image modalities for the visual SLAM. Monocular SLAM is not suitable for our needs, since it needs an appropriate translation without rotation within the first frames for proper initialization and suffers from drift due to accumulating errors over time. Furthermore, the absolute scale of the reconstructed map and the trajectory is unknown due to the arbitrary baseline induced by the non-deterministic initialization for finding a suitable translation. The latter is needed to triangulate matched feature points between two views. Relying on the depth data from the sensor would not be sufficient for the desired tracking accuracy, due to noisy depth information. A stereo setup can account for absolute scale by a known fixed baseline. Movements with rotations only can be accounted for with a stereo system, since matched feature points can be triangulated for each frame.
For the evaluations we run experiments with publicly available SLAM methods for better reproducibility and comparability. ORB-SLAM2  is use as state-of-the-art feature based method. The well-known direct methods [10, 11] are not eligible due to the restriction to monocular cameras. We rely on a the recent publicly available111https://github.com/JiatianWu/stereo-dso, Horizon Robotics, Inc. Beijing, China, Authors: Wu, Jiatian; Yang, Degang; Yan, Qinrui; Li, Shixin stereo implementation of Direct Sparse Odometry (DSO) .
The intrinsic camera parameters of the involved monocular and stereo cameras (RGB, IR1, IR2) are estimated as proposed by . We use the standard pinhole camera model with two radial distortion coefficients. The stereo geometry is calculated via OpenCV222https://github.com/itseez/opencv. For the rigid transformation from the robotic end effector to the inside-out camera, we use the hand-eye calibration algorithm of Tsai-Lenz  in eye-on-hand variant implemented in ViSP  and the eye-on-base version to get the rigid transformation from the optical tracking system to the robot base. To calibrate the ultrasound image plane with respect to the different tracking systems, we use the open source PLUS ultrasound toolkit  and provide a series of correspondence pairs using a tracked stylus pointer, and retrieve the desired transformation matrix.
3 Experiments and Validation
To validate the proposed tracking approach, we first evaluate the tracking accuracy, followed by a specific analysis for the suitability to 3D ultrasound imaging. We use a KUKA iiwa (KUKA Roboter GmbH, Augsburg, Germany) 7 DoF robotic arm to gather ground truth tracking data, as it provides a guaranteed positional reproducibility of mm. To provide a realistic evaluation, we also utilize an optical infrared-based outside-in tracking system (Polaris Vicra, Northern Digital Inc., Waterloo, Canada). Inside-out tracking is performed with the Intel RealSense Depth Camera D435 (Mountain View, US), providing RGB and infrared stereo data in a portable system (see Fig. 3).
Direct and feature based SLAM methods for markerless inside-out tracking are compared and evaluated against marker based optical inside-out tracking with ArUco markers and classical optical outside-in tracking. For the former, ArUco  markers with a size of cm are placed in the acquisition room. For a quantitative analysis, a combined marker with an optical target and a miniature vision sensor is used (see Fig. 3) and attached to the robot end effector. The robot is controlled using the Robot Operating System (ROS) while the camera acquisition is done on a separate machine using the intel RealSense SDK333https://github.com/IntelRealSense/librealsense. Images are acquired with a resolution of pixels at a frame rate of Hz. The pose of the RGB camera and the tracking target are communicated via TCP/IP with a publicly available library444https://github.com/IFL-CAMP/simple. The images are processed on an intel Core i7-6700 CPU, 64bit, 8GB RAM running Ubuntu 14.04. We use the same constraints as in a conventional TRUS. Thus, the scanning time, covered volume and distance of the tracker is directly comparable and the error analysis reflects this specific procedure with all involved components. Fig. 4 shows the clinical environment for the quantitative evaluation together with the inside-out view and the extracted image information for the different SLAM methods.
3.1 Tracking Accuracy
To evaluate the tracking accuracy without specific application to imaging, we use the setup described above and acquire a series of pose sequences. The robot is programmed to run in gravity compensation mode such that it can be directly manipulated by a human operator. The forward kinematics of a robotic manipulator are used as ground truth (GT) for the actual movement.
To allow for error evaluation, we transform all poses of the different tracking systems in the joint coordinate frame coinciding at the RGB-camera of the end effector mount (see Fig. 3 for an overview of all reference frames)
providing a direct way to compare the optical tracking system (OTS), to SLAM-based methods (SR), and the ArUco-based tracking (AR). Multiple calibrations have shown that the residuals from the robotic hand-eye calibration is negligible with respect to the tracking and US calibration.
In overall, 5 sequences were acquired with a total of 8698 poses. The pose error for all compared system is indicated in Fig. 5, where the translation error is given by the RMS of the residuals compared with the robotic ground truth while the illustrated angle error gives angular deviation of the rotation axis.
From the results it can be observed that compared to GT, optical tracking provides the best results, with translation errors of 1.90 0.53 mm, followed by 2.65 0.74 mm for ORB-SLAM and 3.20 0.96 for DSO, ArUco with 5.73 1.44 mm. Interestingly, the SLAM-based methods provide better results compared to OTS, with errors of 1.99 1.99 for ORB-SLAM, followed by 3.99 3.99 for DSO, respectively. OTS estimates result in errors of 8.43 6.35, and ArUco orientations are rather noisy with 29.75 48.92.
3.2 Markerless Inside-Out 3D Ultrasound
On the foundation of favourable tracking characteristics, we evaluate the performance of a markerless inside-out 3D ultrasound system by means of image quality and reconstruction accuracy for a 3D US compounding. For imaging, the tracking mount shown in Fig. 3 is integrated with a 128 elements linear transducer (CPLA12875, 7 MHz) connected to a cQuest Cicada scanner (Cephasonics, CA, USA). For data acquisition, a publicly available real-time framework is employed555https://github.com/IFL-CAMP/supra in conjunction with ROS, and calibration is performed using a stylus as described in Sec. 2. We perform a sweep acquisition, comparing OTS outside-in tracking with the proposed inside-out tracking and evaluate the quality of the reconstructed data while we deploy  for temporal pose synchronization. Fig. 6 shows a qualitative comparison of the 3D US compoundings for the same sweep with the different tracking methods.
4 Discussion and Conclusion
From our evaluation, it appears that Aruco markers are viable only for approximate positioning within a room rather than accurate tracking. Our proposed inside-out approach shows valuable results compared to standard OTS and even outperforms the outside-in system in terms of rotational accuracy. These findings concur with assumptions based on the camera system design, as small rotations close to the optical principal point of the camera around any axis will lead to severe changes in the viewing angle, which can visually be described as inside-out rotation leverage effect.
One main advantage of the proposed methods is with respect to usability in practice. By not relying on specific markers, there is no need for setting up an external system or a change in setup during procedures. Additionally, we can avoid line-of-sight problems, and potentially allow for highly accurate tracking even for complete rotations around the camera axis without loosing tracking. Besides the results above, our proposed method is capable of orientating itself within an unknown environment by mapping its surrounding from the beginning of the procedure. This mapping is build up from scratch without the necessity of any additional calibration. Our tracking results for a single sensor also suggest further investigation towards collaborative inside-out tracking with multiple systems at the same time, orientating themselves within a global map as common reference frame.
Overall, we presented a markerless inside-out tracking method based on visual SLAM and demonstrated its accuracy for general tracking as well as 3D ultrasound imaging. Reasoned by the accuracy and versatility, we hope that this will lead to a more detailed investigation of the proposed markerless inside-out tracking method for the use in medical procedures also by other research groups. This is in particular interesting for applications that include primarily rotation such as transrectal prostate fusion biopsy.
-  Hennersperger, C., Fuerst, B., Virga, S., Zettinig, O., Frisch, B., Neff, T., Navab, N.: Towards MRIs-based autonomous robotic US acquisitions: a first feasibility study. IEEE transactions on medical imaging 36(2) (2017) 538–548
-  Adebar, T., Salcudean, S., Mahdavi, S., Moradi, M., Nguan, C., Goldenberg, L.: A robotic system for intra-operative trans-rectal ultrasound and ultrasound elastography in radical prostatectomy. In: International Conference on Information Processing in Computer-Assisted Interventions, Springer (2011) 79–89
-  Kral, F., Puschban, E.J., Riechelmann, H., Freysinger, W.: Comparison of optical and electromagnetic tracking for navigated lateral skull base surgery. The International Journal of Medical Robotics and Computer Assisted Surgery 9(2) (2013) 247–252
Busam, B., Esposito, M., Che’Rose, S., Navab, N., Frisch, B.:
A stereo vision approach for cooperative robotic movement therapy.
In: Proceedings of the IEEE International Conference on Computer Vision Workshops. (2015) 127–135
-  Heuveling, D., Karagozoglu, K., Van Schie, A., Van Weert, S., Van Lingen, A., De Bree, R.: Sentinel node biopsy using 3d lymphatic mapping by freehand spect in early stage oral cancer: a new technique. Clinical Otolaryngology 37(1) (2012) 89–90
-  Fenster, A., Downey, D.B., Cardinal, H.N.: Three-dimensional ultrasound imaging. Physics in medicine & biology 46(5) (2001) R67
-  Esposito, M., Busam, B., Hennersperger, C., Rackerseder, J., Lu, A., Navab, N., Frisch, B.: Cooperative robotic gamma imaging: Enhancing us-guided needle biopsy. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2015) 611–618
-  Sun, S.Y., Gilbertson, M., Anthony, B.W.: Probe localization for freehand 3d ultrasound by tracking skin features. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2014) 365–372
-  Mur-Artal, R., Tard s, J.D.: ORB-SLAM2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics 33(5) (2017) 1255–1262
-  Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale direct monocular SLAM. In: European Conference on Computer Vision. (2014)
-  Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. Transactions on Pattern Analysis and Machine Intelligence (2018)
-  Hsu, P.W., Prager, R.W., Gee, A.H., Treece, G.M.: Freehand 3d ultrasound calibration: a review. In: Advanced imaging in biology and medicine. Springer (2009) 47–84
-  Wang, R., Schwörer, M., Cremers, D.: Stereo DSO: Large-scale direct sparse visual odometry with stereo cameras. In: International Conference on Computer Vision. (2017)
-  Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence 22(11) (2000) 1330–1334
-  Tsai, R.Y., Lenz, R.K.: A new technique for fully autonomous and efficient 3d robotics hand/eye calibration. IEEE Transactions on robotics and automation 5(3) (1989) 345–358
-  Marchand, É., Spindler, F., Chaumette, F.: Visp for visual servoing: a generic software platform with a wide class of robot control skills. IEEE Robotics & Automation Magazine 12(4) (2005) 40–52
-  Lasso, A., Heffter, T., Rankin, A., Pinter, C., Ungi, T., Fichtinger, G.: Plus: open-source toolkit for ultrasound-guided intervention systems. IEEE Transactions on Biomedical Engineering 61(10) (2014) 2527–2537
-  Garrido-Jurado, S., noz Salinas, R.M., Madrid-Cuevas, F., Marín-Jiménez, M.: Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47(6) (2014) 2280 – 2292
-  Busam, B., Esposito, M., Frisch, B., Navab, N.: Quaternionic upsampling: Hyperspherical techniques for 6 dof pose tracking. In: 3D Vision (3DV), 2016 Fourth International Conference on, IEEE (2016) 629–638