Multi-layer Visualization for Medical Mixed Reality

09/26/2017 ∙ by Séverine Habert, et al. ∙ TU München Technische Universität München 0

Medical Mixed Reality helps surgeons to contextualize intraoperative data with video of the surgical scene. Nonetheless, the surgical scene and anatomical target are often occluded by surgical instruments and surgeon hands. In this paper and to our knowledge, we propose a multi-layer visualization in Medical Mixed Reality solution which subtly improves a surgeon's visualization by making transparent the occluding objects. As an example scenario, we use an augmented reality C-arm fluoroscope device. A video image is created using a volumetric-based image synthesization technique and stereo-RGBD cameras mounted on the C-arm. From this synthesized view, the background which is occluded by the surgical instruments and surgeon hands is recovered by modifying the volumetric-based image synthesization technique. The occluding objects can, therefore, become transparent over the surgical scene. Experimentation with different augmented reality scenarios yield results demonstrating that the background of the surgical scenes can be recovered with accuracy between 45 for medicine, providing transparency to objects occluding the surgical scene. This work is also the first application of volumetric field for Diminished Reality/ Mixed Reality.



There are no comments yet.


page 2

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Methodology

The setup, calibration methods, and image synthesization used in this paper have been previously published by [habert2015posteraugmenting]. In the interest of brevity, we will not describe the calibration steps but we will thoroughly describe the synthesization process since it is vital to our Mixed Reality multi-layer visualization contribution.

1.1 Setup

The setup comprises 2 RGBD cameras (Kinect v2) placed on the side of a X-ray source 1. Each RGBD camera outputs a depth image, an infrared image and a wide-angle video image. Their fields-of-view are overlapping over the C-arm detector. Kinect v2 has been chosen because its depth information does not interfere with a similar sensor.

Figure 1: Setup with 2 Kinects attached on the C-arm gantry

The depth and video images are recorded using the libfreenect2 library [florian_echtler_2016_45314]. The mapping from depth to video image is provided by the library. The synchronization between images from the two cameras has been performed manually because two Kinect v2 can not used on a single standard computer and are therefore run on two separate computers. As a consequence, every sequence is recorded at a lower framerate than a standard 30fps video.

1.2 Image synthesization

Once the system has been calibrated following the steps from [habert2015posteraugmenting], the video image from the X-ray viewpoint can be synthesized. First, the origin of the 3D world coordinate space is positioned at the center of the volumetric grid, around the C-arm intensifier. Knowing the poses of the two RGBD cameras relative to the X-ray source, the projection matrices and for the 2 RGBD sensors can be computed. The notations relative to the cameras are defined as follows: optical center of the first camera , its depth image and color image (respectively, in the second camera , and ).

To render the color image from the X-ray source viewpoint, a volumetric TSDF field is created which maps a 3D point to a truncated signed distance value. This value is the weighted mean of the truncated signed distance values and computed respectively in the 2 RGBD sensor cameras. Therefore, the field follows Equation 1.


where and are the weights for each camera. The weights are used to reject truncated signed values according to specific conditions (described in Equation 2). For each camera , the weights for each truncated signed value are computed as:


where is a tolerance on the visibility of x (we use ). For each view , represents geometrically the difference in between the distance from x to the optical center of the camera and the depth value obtained by projecting x into camera , on which a scaled truncation to the interval [-1,1] is applied. The truncated signed distances are computed according to Equation 3.


with being a tolerance parameter to handle noise in depth measurements ( in our method) . Alongside with the TSDF , we also create a volumetric color field following Equation 4.


The scene to synthesize is represented in the volumetric grid by the voxels whose TSDF values is equal to 0. The color image from the X-ray viewpoint is therefore generated by performing raytracing from the X-ray viewpoint on the TSDF field . For every pixel in the image to be synthesized, a ray is traced passing through the X-ray source and the pixel. Raytracing consists at searching the closest to the X-ray source voxel y respecting the condition along this ray. To speed up this step, the search for the 0-value is performed by binary search. Once the y has been found, the color is applied to the pixel in the synthesized image . A depth image can be synthesized by calculating the distance between y and the X-ray source.

1.3 Multi-Layer Image Generation

After the first raytracing step, the video image as seen by the X-ray source viewpoint, as well as its corresponding depth image are generated. The volumetric TSDF field is a dense representation which contains information about the full 3D space around the C-arm detector whereas the raytracing stops only at the first found 0-value voxel. Therefore, the TSDF field contains more information than is actually used until now. Beyond the hands synthesized by the first raytracing, more 0-value can be present along the ray. This is true especially since the 2 RGBD cameras are placed on the side of the C-arm, giving additional information from another viewpoint. This situation is illustrated in Figure 2 where the background occluded by a hand from the X-ray source viewpoint (the blue point) can be seen by at least one of the 2 cameras. In a TSDF representation, this means those occluded background voxels also have a 0-value. To find those additional 0-value, a modified “second run” raytracing must be performed on the foreground (e.g. surgeon hands or surgical tools).

Figure 2: Occlusion

1.3.1 Hand segmentation

As a first step, the foreground needs to be segmented from the synthesized video image and depth image . A background model is computed from an initialization sequence of depth images where no hands or surgical instruments are introduced yet. An average depth image is created by averaging the depth at every pixel along the initialization sequence. Then, for every new image (with potential hands or surgical instruments present), the depth image is compared to the mean image in order to create a binary mask image . For every pixel whose depth is lower than the average depth minus a margin (3 cm), the pixel is classified as foreground and is set as white in . If the pixel is classified as background, then it is set as black in . The method is rudimentary compared to background subtraction methods, however the margin allows the background to change shape (in the limit of the margin). A noise removal step is added using morphological opening on the mask image. An example of scaled depth image and its corresponding mask are shown on Figure 3.

Figure 3: The synthesized depth image and its corresponding segmented mask

1.3.2 Second-run raytracing

Once the foreground has been segmented, a second raytracing can be performed on the pixels classified as hands or surgical instruments. Instead of beginning the raytracing from the X-ray source viewpoint, the ray search starts at the voxel y found at the first raytracing run plus a margin of 4 cm. This margin is the insurance to not find a 0-value still related to the foreground. The starting voxel y can be easily retrieved using the depth image resulting from the first raytracing. The raytracing is then performed forward using binary search in a similar fashion to the first run of raytracing. As a result, a color image of the background can be synthesized and combined to the color image from the first raytracing run (excluding the foreground segmented pixels) creating a complete background image .

1.3.3 Multi-Layer Visualization

On top of the background image , the foreground layer extracted from can be overlaid with transparency as well as the X-ray image . A multi-layer image can then be created by blending all the layers according to Equation 5.


where with are the blending parameters associated with each level. They can also be seen as specific weight values which emphasize a specific layer during the blending process.

The visualization scheme we propose allows us then to observe three layers of structures (displayed in Figure 4) according to those parameters.

Figure 4: Layers in our visualization, all can be observed depending on the chosen blending values

The furthest layer is the X-ray, which can be observed in its totality in the image with . As we get closer to the camera, another layer is the background structure recovered using volumetric field. It can be observed with . Finally the front layer comprising the hands and instruments can be observed in the image using . Our visualization scheme allows to see in transparency the different layers (anatomy by X-ray, background, front layer ) by choosing blending parameters non equal to 0 and 1. The choice of blending values depends on multiple parameters such as surgeon preferences, step in the surgical workflow, type of instrument used. It can be changed on the fly during surgery according to such parameters. For example, once an instrument has already penetrated the skin, the background is not necessary to visualize. The transparent hands can be overlaid directly on the X-ray image, skipping the background layer. This scenario corresponds to blending parameters , with . With the configuration , the visualization consists of fully opaque hands or surgical tools on the X-ray image, giving a similar output as [pauly2014relevance] which aimed at obtaining a natural ordering of hands over X-ray image. As every layer is known at any point in a sequence, the multi-layer visualization can be replayed to medical students and residents for example with other blending parameters than the one used in surgery. They can have full control for the observation of the layers having the choice to emphasize particular layers of interest for their learning.

2 Results

2.1 Experimental protocol

Six sequences have been recorded depicting example scenarios which include both surgeon hands and surgical tools. Both a realistic hand model phantom and a real patient hand are used and positioned on a surgical table. A clinician wearing purple examination gloves introduces partial occlusions randomly to the scene. Sequences 1 and 3 contain the motion of the clinician’s hand above the hand model phantom at 20 cm and 30 cm respectively. Sequences 2 and 4 contain the motion of a clinician’s hand closed and above the hand model phantom at 20 cm and 30 cm respectively. Sequences 3 and 4 also contain incision lines drawn using a marker on the hand model phantom. Finally, Sequences 5 and 6 are recorded with surgical tools above a real patient hand. Sequence 5 includes actions using a surgical hammer aiming for a cross target drawn on the patient hand. Sequence 6 includes a scalpel targeting the same cross. The heights of the surgical instruments to the patient hand vary up from 5 cm to 30 cm.

2.2 Background recovery

For every sequence, the mean value for the percentage of recovered pixels is calculated and indicated in Table 1. The natural observation in Table 1 is that the closer the surgeon hand and surgical tools are to the anatomy the larger the occlusion in both side cameras will be. This signifies a lower percentage of recovered pixels by our algorithm which is demonstrated.

max width=8.46 cm Sequences 1 2 3 4 5 6 Pixels recovered (in %) 69.3 65.2 88.2 97.4 84.1 45.2

Table 1: Background recovery results

Sequences 1 and 2 were recorded with surgeon hand open (69.3%) and closed (65.2%) Less pixels are recovered for the close hand scenario as mainly the fist is present in the scene. The fist is also not recovered in the other scenario but the fingers are also occluding which are easier to recover from (due to their thin shape), in percentage, the open hand scenario recovers more, even if occluding more. Sequences 3 and 4 resulted in larger recovery percentages (88.2% and 97.4% respectively) because the surgeon hand was farther away from the hand model. This implies that there is a greater probability for the background voxels to be seen by the RGBD sensors. Sequence 6 with a scalpel confirms that the height strongly influences the recovery. The scalpel scenario which includes numerous images with hands and instruments close to the background (less than 10 cm) shows a low recovery result as expected. Due to the hammer’s shape, the sequence 5 shows however a higher recovery percentage.

2.3 Visualization results

In Figure 5, for each scenario, one selected image in the sequence can observed with different values of , , and . Each row corresponds to the sequence . From left to right, the layer visualized in is getting closer to the X-ray source viewpoint. In the column (a), the furthest layer (the X-ray image) is displayed. In the column (b), the second layer (the background), in the column (c), the blending of the front layer with the background, in the column (d), the blending of the three layers and finally, in the column (e), the closest layer is shown. Additional images from the sequences can be visualized in the supplementary video where interaction between the layers by changing the blending values can be observed.

Despite the fact that the background cannot be ideally recovered, a manual post processing step involving inpainting is applied and displayed in the column (f) of Figure 5. We believe that the multi-layer visualization concept is an interesting and profound solution offering numerous possibilities in the surgical areas, as well as, the mixed reality communities.

Figure 5: Per row , multi-layer image of one selected frame in the sequence with different blending parameters

Similar to results from Habert et al. [habert2015posteraugmenting], the images resulting from synthesization are not as sharp as a real video image. The area synthesized by our algorithm is approximately 20 cm 20 cm (C-arm detector size), which is small compared to the wide-angle field of view from the Kinect v2. Reduced to the area of synthesization, the video and depth from Kinect is not of high resolution enough for sharper results. More specialized hardware with smaller field of view and higher resolution RGBD data would solve this problem. Moreover, several artifacts can be seen around the hand and surgical instruments in the synthesized image due to high difference and noise in depth in the RGBD data from the 2 cameras. However, our results demonstrate that our method is working well, since the incision line and cross drawn on the hand model and patient hand are perfectly visible in the recovered background image and can be seen in transparency through the hands and surgical tools in the images of Figure 5-column (c) and (d). In the scalpel sequence (sequence 6) in Figure 5-column (b), it can be seen that the tip of the scalpel is considered as background, this is due to the margin of few centimeters used for background segmentation. In this image, the scalpel is actually touching the skin.

3 Discussion

Inferring temporal priors can help alleviate occlusion. Methods involving volumetric fields [newcombe2011kinectfusion] use temporal information as the field is sequentially updating with new information, instead of fully being reinitialized as per our method. The percentage of pixels recovered is also dependent of the side cameras configuration. In our clinical case, the camera setup is constrained by the C-arm design and the disparity between the X-ray source and the two RGBD cameras is low. A higher disparity would lead to less occlusion in at least one of the cameras. Even with our constrained and difficult clinical setup, the results are extremely promising and we are convinced the work could also be easily extended to less restrictive settings. A potential application is Industrial Diminished/Mediative Reality where workers wearing a HMD with two cameras placed on its side (with a higher disparity than our setup) could see their viewpoint synthesized with their hands in transparency.

4 Conclusion

In this paper, we have presented the first work combining Diminished and Augmented Reality in medical domain. Our visualization scheme proposes a user-adjustable multiple layer visualization where each layer can be blended with others. The multiple layers comprise the anatomy with the X-ray image, the patient background, and the surgeon hand and surgical instruments. The result of our visualization scheme offers the clinician to choose which layer(s) are to become transparent depending on the surgical scenario or workflow step. Beyond the medical domain, this work is the first use of volumetric field for background recovery in Diminished Reality and Mixed Reality. Future works should involve adding additional layers, by disassociating the surgeon hand layer from the surgical instruments layer, in order to adjust further the visualization to the user preferences.