1 Description of purpose
A common application of laser microsurgery is the treatment of pathological occurences in the larynx, also referred to as laser phonomicrosugery (LP) . Recently, automatic laser control in LP with consideration of deformations has been done by stereo-vision based tracking of laryngeal soft tissue . However, this approach only detects superficial deformations and lacks tomographic depth perception of subcutaneous vocal fold layers, where vocal fold pathologies, such as Reinke’s edema, are often located . This results in a lower than possible accuracy, the need for higher safety margins during resection and thus a higher trauma and lower residual speech ability of patients. In addition to camera imaging, optical coherence tomography (OCT) has proved to be promising for the depth-resolved imaging of vocal fold tissue layers . OCT can be used to enable automated control of the laser position for high-precision laser cuts in laryngeal soft tissue environments. Therefore, the purpose of this paper is to estimate voxel displacements on 4D OCT images for tracking the desired laser target with respect to tissue movement. In the following sections of this supplemental abstract, the proposed methods, the evaluation setup and first results are described.
Convolutional neural networks (CNN) have recently shown superior performance in processing OCT images, for example by identifying medical diagnoses on retinal OCTs with human performance or in pose estimation with micrometer accuracy. CNNs have also lead to state-of-the-art performance in estimating pixel displacements (disparity) on stereo camera images. Therefore, the workflow proposed in this study extends CNNs for disparity estimation and applies them to 4D OCT data for estimating voxel displacements over time.
An overview of the workflow is shown in Fig. 1. Two subsequent OCT volumes and are fed into separate branches of the tracking scheme. In the first step, depth maps from predefined view angles using the maximum intensity projection (MIP) are calculated for each of the volumes by projecting the depth values of the maximum voxels along parallel rays onto image planes. These depth maps can be interpreted as concise representations of the original input volume. Both subsequent representations and are then used for unsupervised training a CNN, which estimates the dense planar optical flow and depth changes between them. This is referred to as a sparse 2.5D flow field with respect to the input volumes.
2.1 Sparse flow field estimation
The 2.5D flow field describes pixel displacements and changes in depth between two concise representations of successive OCT volumes, which is also referred to as scene flow . In order to estimate the 2.5D flow field, a modified DispNet architecture is used 
. The original DispNet was designed to estimate disparities between a pair of stereo images and it consists of an encoding part and a decoding part with long-range connections between them. The encoder comprises convolutional layers with strides of 2 in some of the layers, which downsamples the input by a total factor of 64. By this, the network can also estimate large displacements. The input of the encoder is formed by stacking together the left and right stereo projections along the channels dimension. The decoder then upsamples the feature maps by an alternating series of up-convolutional and convolutional layers, also exploiting the features from the decoder by concatenating the feature maps. Similar to the connections in the ResNet, information can pass through the long-range connections, which avoids forming a data bottleneck. Different disparity predictions are created at different stages of the decoder, resulting in disparity estimations at 1, 2, 3 and 8 downscaled resolution of the original input image.
The described DispNet architecture is modified to output a parameter , which is used to reconstruct the second depth map from the first one and vise-versa using the differentiable spatial transformer function . The actual scene flow can be directly calculated from . The convolutional layers that predict the sparse flow field are constrained to have an output between and by using a scaled hyperbolic tangent non-linearity, with
the according image dimension at a given output scale. The exponential linear unit (ELU) is used as non-linear activation for all other convolutional layers. Training of the flow field estimation network is done by utilizing an unsupervised loss functionas described below.
A small dataset of subsequent 3D OCTs with known displacements of 75 m were created by placing a specimen on a motorized linear stage (MZ812B, ThorLabs Inc., Newton, NJ, USA) with a repeatability of 1.5 m and aquiring volumes with a swept-source OCT (OCS1300SS, ThorLabs Inc., lateral resolution 12 m). The scan dimensions are equally set to 3 mm for each spatial direction with a resolution of 512 voxels. This results in isotropic voxels with an edge size of 6 m. To augment the small dataset, we cropped adjacent, partly overlapping sub-volumes with a shape of voxels with offsets in the - plane from the original volumes, created the depth maps along the -axis and used them as training samples (see Fig. 6).
2.3 Unsupervised loss function
The proposed method learns optimal parameters by minimizing the differences between and , respectively (see Fig. 1). This is done by minimizing the loss function . The loss is computed at each scale , resulting in the final training loss . is referred to as reconstruction loss and encourages the reconstructed image to have similar appearance to the corresponding input image. The smoothness loss enforces a smooth scene flow field and the consistency loss enforces the predicted images to be consistent from to and to .
First quantitative results of the aforementioned tracking setup are shown in Fig. 10. As accuracy metric, we use the popularly accepted endpoint error (EPE) . The samples for training and validation were cropped from different OCT volumes of the same specimen. As shown in Fig. 10 (d) tracking results have been achieved with a mean EPE of 2.27 voxels and a median of 0.91 voxels for the shown example. Generally, a small error can be observed in feature-rich areas.
4 New or breakthrough work to be presented
This paper formulates the following hypothesis: The sparse scene flow field of 4D OCT data can be obtained from concise representations in a deep learning approach. For the final manuscript, we will add quantitative results obtained from the following evaluation setup. A large dataset of tissue phantoms is created by placing different specimens on a high-accuracy hexapod robotic platform and acquiring 3D OCT volumes while moving the platform. This results in 4D OCT data with known rigid transformations (translation and rotation) and the tracking accuracy can be stated with respect to this. An in vivo sequence of human skin with elastic manipulations is used to assess performance in non-rigid transformations. For the latter, due to the lack of ground truth, the forward-backward error metric is used to quantify the tracking result.
A dense 4D scene flow is important for using 4D OCT as an intra-operative modality for image-guided procedures such as laser phonomicrosurgery. However, current state-of-the-art high-speed OCT devices can operate at 25 volumes per second  and the computational complexity of directly estimating 4D scene flow limits its potential . By using concise 2.5D representations of 3D OCT volumes, 3.5D OCT data is created. This data can be used more efficiently for tracking and thus enabling 4D OCT for image-guidance with sub-epithelial information in microsurgery.
The author declares, that this work has not been published previously, that it is not under consideration for publication elsewhere, and that, if accepted, it will not be published elsewhere.
Acknowledgements.The authors thank Tom Pfeiffer and Robert Huber from the Institute of Biomedical Optics, University of Lübeck, Germany for providing us with the 4D OCT data used in this study.
-  Mattos, L. S., Deshpande, N., Barresi, G., Guastini, L., and Peretti, G., “A novel computerized surgeon–machine interface for robot-assisted laser phonomicrosurgery,” The Laryngoscope 124(8), 1887–1894 (2014).
-  Schoob, A., Laves, M.-H., Kahrs, L. A., and Ortmaier, T., “Soft tissue motion tracking with application to tablet-based incision planning in laser surgery,” Int. J. Comput. Assist. Radiol. Surg. (2016).
-  Rubin, J. S., Sataloff, R. T., and Korovin, G. S., [Diagnosis and Treatment of Voice Disorders ], Plural Publishing (2014).
-  Benboujja, F., Garcia, J. A., Beaudette, K., Strupler, M., Hartnick, C. J., and Boudoux, C., “Intraoperative imaging of pediatric vocal fold lesions using optical coherence tomography,” J. of Biomed. Opt. 21(1) (2016).
-  Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C., Liang, H., Baxter, S. L., McKeown, A., Yang, G., Wu, X., Yan, F., et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell 172(5), 1122–1131 (2018).
-  Gessert, N., Schlüter, M., and Schlaefer, A., “A deep learning approach for pose estimation from volumetric oct data,” Medical Image Analysis 46, 162–179 (2018).
-  Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., and Brox, T., “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” in [ ], 4040–4048 (2016).
-  Huguet, F. and Devernay, F., “A Variational Method for Scene Flow Estimation from Stereo Sequences,” in [Proceedings of the ICCV 2007 ], IEEE (2007).
-  He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ], 770–778 (2016).
Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K., “Spatial Transformer Networks,” in [Advances in Neural Information Processing Systems 28 ], 2017–2025 (2015).
-  Godard, C., Aodha, O. M., and Brostow, G. J., “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” in [IEEE Conference on Computer Vision and Pattern Recognition ], 6602–6611 (2017).
-  Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., and Brox, T., “FlowNet: Learning Optical Flow with Convolutional Networks,” in [Proceedings of the IEEE International Conference on Computer Vision ], 2758–2766 (2015).
-  Kalal, Z., Mikolajczyk, K., and Matas, J., “Forward-Backward Error: Automatic Detection of Tracking Failures,” in [20th International Conference on Pattern Recognition ], 2756–2759, IEEE (2010).
-  Wieser, W., Draxinger, W., Klein, T., Karpf, S., Pfeiffer, T., and Huber, R., “High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s,” Biomedical Optics Express 5(9), 2963–2977 (2014).
-  Laves, M.-H., Kahrs, L. A., and Ortmaier, T., “Volumetric 3D stitching of optical coherence tomography volumes,” in [Proceedings of the 52nd Annual Conference of the German Society for Biomedical Engineering ], (2018).