Patient-Specific Domain Adaptation for Fast Optical Flow Based on Teacher-Student Knowledge Transfer

07/09/2020 ∙ by Sontje Ihler, et al. ∙ uni hannover 0

Fast motion feedback is crucial in computer-aided surgery (CAS) on moving tissue. Image-assistance in safety-critical vision applications requires a dense tracking of tissue motion. This can be done using optical flow (OF). Accurate motion predictions at high processing rates lead to higher patient safety. Current deep learning OF models show the common speed vs. accuracy trade-off. To achieve high accuracy at high processing rates, we propose patient-specific fine-tuning of a fast model. This minimizes the domain gap between training and application data, while reducing the target domain to the capability of the lower complex, fast model. We propose to obtain training sequences pre-operatively in the operation room. We handle missing ground truth, by employing teacher-student learning. Using flow estimations from teacher model FlowNet2 we specialize a fast student model FlowNet2S on the patient-specific domain. Evaluation is performed on sequences from the Hamlyn dataset. Our student model shows very good performance after fine-tuning. Tracking accuracy is comparable to the teacher model at a speed up of factor six. Fine-tuning can be performed within minutes, making it feasible for the operation room. Our method allows to use a real-time capable model that was previously not suited for this task. This method is laying the path for improved patient-specific motion estimation in CAS.



There are no comments yet.


page 2

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In robot-assisted minimally invasive surgery (MIS), instruments are inserted through small incisions and observed via video endoscopy. The remote control of the instruments is unintuitive and requires a trained operator. Image-guided surgery can help surgeons to operate more safely and accurately. A challenging open problem in this scenario is visual motion estimation of moving tissue. Accurate motion predictions with fast feedback rates are crucial to ensure the patient’s safety.

Visual motion estimation can be implemented with sparse tracking, for instance based on feature matching, or dense tracking algorithms like optical flow estimation. There is a wide variety of optical flow algorithms, many based on conventional image processing with engineered features (a broad overview is provided in [Sun et al.(2010)]), as well as many recent data-driven approaches [Dosovitskiy et al.(2015), Ilg et al.(2017), Yu et al.(2016), Meister et al.(2018), Sun et al.(2018), Wulff and Black(2015)].

Because tissue deformation can only fully be captured with dense tracking, we focus this work on motion estimation with dense optical flow (OF). We will further focus on deep learning models, as these are outperforming the conventional methods on public OF benchmarks [Butler et al.(2012), Geiger et al.(2012), Menze and Geiger(2015)]. Unfortunately, these models show the common speed vs accuracy trade-off. On the one end, they achieve high accuracy in large target domains from high complexity. This leads to slow processing rates. On the other end, faster models lack the capability to generalize well on a large target space and drop in accuracy. For illustration see Figure 1 top and middle row. Requirements for surgical interventions are accurate motion predictions at simultaneously high processing rates. One might argue that the issue of speed is solved automatically within the next years with the continuous improvement of GPU/TPUs. However, this can only be said for tasks that profit from parallelization, which is not the case for surgical online applications, where each camera frame must be processed right after it is captured.

Figure 1: Tracking results on a challenging sparse-textured liver with respiratory motion [Mountney et al.(2010)] after 0, 12, 38 and 180 frames, respectively from left to right. (top row) FlowNet2 @12 fps. (middle row) FlowNet2S @80 fps. (bottom row) our proposed patient-adapted FlowNet2S+F @80 fps.

In this work, we present a combination of patient-specific domain adaption and a teacher-student learning approach to achieve accurate and simultaneously fast OF estimation without ground truth labels. Patient-specific domain adaptation describes the specialization of a model to a specific patient, in other words, for each patient the OF model is specifically tailored to their individual anatomical motion and appearance. This action maximizes the domain match between the model’s training data and the image/motion-data which is afterwards observed during the intervention. This guarantees higher performance of the OF model. To overcome the issue of missing ground truth we deploy a teacher-student-learning strategy for unsupervised domain adaptation. One model functions as a student model. Another model (or several models) function as a teacher model that support the training of the student model. In our case: a complex, accurate, but not real-time-capable OF estimation model serves as a guide for a less complex and fast student model. The complex teacher model operates as a high-performance work horse, which achieves high accuracy in a large task/target space. The fast student model lacks the complexity to achieve high accuracy everywhere in large task space, however if the task space is reduced, it can also achieve high accuracy. This is achieved by reducing the task space to the patient-specific task domain. We combine both approaches by capturing video sequences from the situs of planned intervention, use the accurate teacher model to compute a gold truth and fine-tune the fast student model with the gold truth to specialize the student model on the patient-specific image and motion domain.

The contribution of our work is manifold. We propose the concept of patient-specific neural networks to regression tasks without requiring manual annotation and we introduce patient-specific domain adaptation for motion estimation during surgical interventions. Our method allows to use a real-time capable OF model that was previously not suited for this task. We area able to perform robust optical flow estimation on sparse or deformable tissue at high frame rates.

In the following sections, we first give an overview of related work. We then provide a detailed presentation of proposed unsupervised fine-tuning method. Afterwards we describe the experimental setup to verify our proposed method and show accuracy results on a selection of endoscopic scenarios. The paper concludes with a discussion and outlook.

2 Related Work

Optical flow (OF) estimation is a regression problem to determine visible motion between two images. The displacement of corresponding pixels is interpreted as a vector field describing the movement of each image component

[Horn and Schunck(1981)]. It is widely applied in motion estimation for medical endoscopy: Changes of the camera (endoscope) pose [Mahmoud et al.(2017), Spyrou and Iakovidis(2012)] as well as tissue tracking [Penza et al.(2018), Schoob et al.(2017), Yip et al.(2012)] can be determined with OF algorithms. Penza et al. recently proposed a long-term safety tracking of pre-operatively defined risk areas [Penza et al.(2018)]. However, the computation time between two frames is to high and does not ensure real-time capability for all speed requiring applications. In a broader application (especially driven by autonomous driving) deep learning based methods like FlowNet2 [Ilg et al.(2017)] and PWC-net [Sun et al.(2018)] are outperforming the conventional approaches. Both are designed to be trained in a supervised manner. The lack of ground truth is a general problem in end-to-end learning. FlowNet2 [Ilg et al.(2017)] profits from training on large synthetically created (non-medical) image sequences FlyingChairs [Dosovitskiy et al.(2015)] and FlyingThings3D [Ilg et al.(2017)]. The image content as well as inter-frame motion in these rendered training sets is drastically different to the properties found in endoscopic surgery. FlowNet2 is a high-complexity network architecture of several stacked subcomponents (FlowNet2S, FlowNet2C, and more), which each for themselves can also function as OF estimation models. As they are smaller, they have much lower inference time but with less accurate performance. FlowNet2 yields very good motion estimation on endoscopic images, illustrating good general purpose capabilities due to its high complexity. Its small — and ergo faster — counterpart FlowNet2S however fails as illustrated in Fig. 1.

Tissue tracking is a challenging task as ”endoscopic tissue images do not have strong edge features, are poorly lit and have been limited in providing stable, long-term tracking or real-time processing” [Yip et al.(2012)]. The issue of sparse textures was recently tackled for ocular endoscopy, successfully using a fine-tuned FlowNet2S. Fine-tuning of a pretrained FlowNet2S to retinal images with virtual movement yielded in good motion estimation for mosaicking small retinal scans to a retinal panorama [Guerre et al.(2018)]. Retinal images are very challenging due to their low textured image content. The authors state that this was the first time they obtained an estimation accuracy sufficient for this task. To obtain labels for supervised training they implemented virtual motion using an affine motion model (translation and rotation). Affine inter-frame motion can be interpreted as an approximation of camera movement. The high speed capability of FlowNet2S is highly suitable for our application. In pre-experiments, a fine-tuned FlowNet2S model with virtual affine movement was not able to track tissue deformations, which are more complex than affine motion. An unsupervised fine-tuning approach based on UnFLow [Meister et al.(2018)], which does not have the restriction of a simplified motion model, did not converge on sparse textured image pairs in pre-experiments.

In 2018, Armin et al. proposed an unsupervised deep learning method to learn inter-frame correspondences for endoscopic images [Armin et al.(2018)]. However, their method is outperformed by the state-of-art model FlowNet2, as well as FlowNet2S (see Appendix D). Lit et al. recently proposed DDF-FLow, a teacher-student approach to learn occlusion maps for OF in an unsupervised manner [Liu et al.(2019)]. Unlike us, they use two models of identical complexity. They do not put their focus on computation speed.

3 Methods

We propose a simplified teacher-student learning strategy to create an annotated training set for fine-tuning a real-time capable OF model to specialize on a surgical scene.

To gain speed, we were inspired by the concept of knowledge distillation (KD) [Hinton et al.(2015)] using a teacher-student learning approach. This strategy was pioneered in model compression [Buciluǎ et al.(2006)], where a small model is trained to imitate a pretrained, complex model (ensemble). It was introduced to train small models for mobile applications. Small models also achieve higher processing rates than their complex counterparts. To achieve high accuracy, we propose patient-specific fine-tuning of the fast model. This has two advantages. First, we minimize the domain gap between training and application data, enabling higher accuracy during a surgical intervention. Second, patient-specific fine-tuning reduces the target space to the capability of the lower complexity, fast model. For patient-specific fine-tuning we propose to obtain training sequences intra-operatively in the operation room once the camera is placed in the situs. We overcome the issue of missing ground truth annotations, by employing the unsupervised teacher-student learning approach.

3.1 Gaining Speed

To explain our approach we recite a simplified concept of KD model compression, illustrated in Figure 2.

Figure 2: Domain adaptation with simplified teacher-student approach. Goal is to shift the student’s convergence space to the target space . In an optimal scenario the target space fully lies in the teacher’s convergence space and the student’s solution space .

We assume there exists a high-complex teacher model , parametrized by , with a large solution space and convergence space . The solution space encapsulates all possible mappings that model is capable of, representing the full potential of the model. The convergence space, on the other hand, describes a fixed variant of the model with a fixed parameter configuration . For neural networks, the parameters are equivalent to the model’s weights. We now assume a second less-complex, but faster student model . Due its smaller complexity it has a smaller solution space and an even smaller convergence space parametrized with fixed configuration .

The target space is the domain of the target application (in our case the patient-specific, surgical image/motion domain). The mismatch of the target space with the convergence space of a model can be interpreted as the domain gap. The goal is to shift the student’s convergence space to overlap with the target space . If the target space lies within the intersection of the teacher’s convergence space and the student’s solution space


the student’s parameter configuration can be altered to replicate the teachers knowledge (convergence space) within the target space. If the students capacity is great enough to mimick the behavior of the teacher within the target space, is becomes a specialist for the target domain as it achieves the accuracy of the complex model with less computation time.

3.2 Closing the Gap

To achieve the shift of the student’s convergence space towards the target space , the objective is to maximize the overlap of and by finding the optimal solution so that


This is achieved by fine-tuning the model to the target domain111Fine-tuning reduces the risk of over-fitting compared to training from scratch. A further advantage is shorter training times (s. Subsection 3.3)). It must be noted again, that the target space must be small enough to match the capacity of the student model to achieve high accuracy. For fine-tuning, we must sample training samples from the target space, where in our case, are image pairs and is the corresponding flow field. Unfortunately, in our case it is only possible to sample . Neither corresponding nor the true mapping is known. We solve this by employing the mentioned teacher knowledge transfer to realize unsupervised domain adaptation.

We assume that the requirement in Eq.1 are met and teacher model provides a good approximation for . We can then use to create an annotated dataset to shift the convergence to overlap with . We do that by sampling image pairs from the target domain and compute the corresponding flow field where (gold truth). We can then optimize our student model with the objective .

3.3 Patient-Specific Target Domain

Above presented approach requires a small target space to create a fast but accurate student specialist. We already addressed the issue of over-fitting by employing fine-tuning. We also propose to obtain our training samples directly prior to the surgical intervention in the operation room. This reduces the target space efficiently and simultaneously minimizes the domain gap. To achieve the highest accuracy possible, training samples should at best be identical to application samples. The idea is to capture image data during a preparation stage of an intervention, right after the placement of the camera in the situs. For an optimal fine-tuning outcome of the student model, all expected motion during application should be induced during the caption of training data. To make this approach feasible to be performed in the operation room, training times must be very short.

Dataset Image Size
# Training
# Validation
# Test
Epoch of
Time [min]
rotation 640x448 329 110 161 91 12,1
scale change 640x448 600 240 397 81 19,1
sparse texture 640x448 399 100 307 91 14,3
deformation 320x512 600 200 100 96 13,6
Table 1: Details for datasets and corresponding fine-tuning. All dataset have a frame rate of 30 fps. Average training time was less than 15 minutes on an Nvidia GeForce GTX 1080 Ti with batch size 8.

4 Experiments

The aim of our experiments is to verify that our teacher model is indeed capable of accurate motion estimation in the general endoscopic image domain. We further want to verify that our specialized model has the capacity to achieve high accuracy on the patient-specific target domain. Finally, we compare accuracy of teacher, student and specialist model, as well as the feasibility of our approach as in intra-operative procedure.

4.1 Dataset

To simulate intra-operative sampling from the patient-specific target domain, we use endoscopic video sequences that we split into disjoint training and test sets. The training set represents the sampling phase (preparation stage), while the test set represent the application phase during intervention. We chose four endoscopic sequences from the Hamlyn datasets that pose varying challenges regarding image content and inter-frame motion:

All sequences show specular lights. The sparse-texture set shows very challenging lighting conditions. The rotation and scale change sequence show artifacts caused by interlacing. Each dataset represents an individual patient. The splits of the subsets were chosen manually, so that training and test data both cover dataset-specific motion. Exact sectioning of the sequence is shown in the Table 1. For all datasets the left camera was used.

4.2 Implementation

For all our experiments we used the accurate, high-complexity FlowNet2 framework [Ilg et al.(2017)] as our teacher model and its fast FlowNet2S component as our low-complexity student model . Due to the very different architectures of the two networks, we focus to mimick only the predictions of the teacher model, instead of learning several behaviors of subcomponents. Both models have full weight channels and are pretrained on datasets FlyingChairs [Dosovitskiy et al.(2015)] and FlyingThings3D [Mayer et al.(2016)]

). We utilized the publicly available implementation in pytorch

[Reda et al.(2017)]. We cropped all datasets to multiples of 64 to fit the models’ architecture.

4.3 Results

All results were obtained using the test sets described in 4.1. The test sets were not used during training in any way. The experimental results cover training time, inference time, as well as accuracy of teacher model FlowNet2, student model FlowNet2S and proposed specialist model FlowNet2S+F.

Figure 3: Endpoint error relative to gold truth (EPE*)

— provided by teacher model FlowNet2 — for FlowNet2S, as well proposed FlowNet2S+F. The EPE* is provided averaged over all test sets (left) as well as for each set individually (right). The boxplot illustrates the median, lower and upper quartile of all estimation errors, whiskers represent a multiple of 1.5 of the inner quartile range

Prior Tracking
Figure 4: Tracking results on rotation (top) and scale (bottom) sequences from in vivo porcine abdomen [Mountney et al.(2010)] after 110 (rotation) and 158 frames (scale) consecutive frames. Main challenges is the change of camera pose. There are also specular lights, illumination changes, small respiratory motion, as well as interlacing artifacts.

Gold truth from teacher model

We compute the gold truth for all data samples using teacher model FlowNet2 .

Fine-tuning student model

We use student model FlowNet2S pretrained to avoid over-fitting. We followed the (multi-scale) supervised learning scheme proposed by FlowNet2 with loss function

. Training was performed on training set described in Subsection 4.1. We split our training sets into training and validation set (see Table 1). Fine-tuning was performed until convergence of validation loss, up to a maximum of 100 epochs. Validation loss was computed every five epochs. During training, we augmented the training samples using random crops. Training parameters can be found in Appendix A.

Training time

Training times at point of convergence are provided in Table 1. Average training was less than 15 minutes on an Nvidia GeForce GTX 1080 Ti with batch size 8. Unlike online inference where only single images can be processed at once, training time can further be reduced from parallelization and be accelerated with higher batch sizes (higher memory GPU). This makes training possible within minutes. We therefore consider the proposal of intra-operative sampling of training data feasible.

Prior Tracking
(a) FlowNet2S+F
Figure 5: Tracking results on strong tissue deformation from tool interaction on low-textured tissue after 60 consecutive frames [Giannarou et al.(2012)].

Accuray and speed

We provide the relative endpoint error () as well inference time in Figure 3 on all test sets for models FlowNet2, FlowNet2S and our proposed model FlowNet2S+F. Inference time of FlowNet2 was (processing rate: approx. ) on our dataset, while FlowNet2S is significantly faster at , (processing rate approx. ). Our fine-tuned model FlowNet2S+F has identical inference time as FlowNet2S. FlowNet2S+F is therefore 6 times as fast as its complex counterpart FlowNet2.

We show EPE* results averaged over the entire sequence, as well as results for each test sequence separately to see how FlowNet2S and our proposed model FlowNet2S+F handles rotation, scale, sparse texture and deformation. To show outliers and maximum errors, we chose boxplots rather than the common average. The number of samples in each boxplot is provided in Table 

1. As we employed FlowNet2 as the benchmark to compute EPE*, the error of FlowNet2 naturally is zero.

The fine-tuned model produces optical flow estimations much closer to the ones estimated from FlowNet2. Initial FlowNet2S showed an initial average EPE* of 0.6 which was reduced to 0.12. Generally, sparse texture and deformation seem more challenging for FlowNet2S than rotation and scale change. After fine-tuning, proposed FlowNet2S+F achieves comparable accuracy to FlowNet2 on the deformation test set. The sparse texture seems to pose the biggest challenge also for FlowNet2S+F222We propose an explanation in the Appendix C, nevertheless, the error is still reduced to less than half. Overall, both models achieve low, relative errors (below ) on all test sets. However, this does not imply, their the estimations are accurate enough for tracking. As already shown in Figure 1 FlowNet2S fails on this task.


Due to the lack of annotated ground truth, we cannot evaluate the estimation accuracy on the test sequences directly. To overcome this issue, we embed the flow estimation into a tracking algorithm over consecutive frames. Small errors between two consecutive frames are not visible. During tracking small errors add up for consecutive frame, resulting in drift and making small errors visible over time.

We illustrate our tracking results in Figures 1, 4 and 5. We compare estimation accuracy of our fine-tuned model FlowNet2S+F to existing models FlowNet2 and FlowNet2S. The long-term tracking based on FlowNet2 all yield robust tracking results on all our test sets, confirming its capability as a suitable teacher model. There is a small occurrence of drift, visible in Figure 4 bottom row after 158 frames. Before fine-tuning, the student model FlowNet2S fails on all test set. After fine-tuning however, our customized FlowNet2S+F model mimicks the predictions of FlowNet2 very well, with few exceptions at the margins. We deduce that FlowNet2S is in fact capable to learn highly-accurate flow estimation for small, patient-specific target domains. Overall, our specialist model achieves comparable accuracy on all test sets compared to the complex FlowNet2 framework, which is more than factor 6 slower in inference time.

5 Conclusion

In this work we introduced a novel method to create a patient-adaptive OF algorithm on-the-fly during pre-operation setup based on the concept of patient-specific vision algorithms. Our method allows us to use a real-time capable model that was previously not suitable for this task.

For evaluation of our method, we used in vivo endoscopic video sequences from the Hamlyn Dataset [Mountney et al.(2010), Giannarou et al.(2012)]. We created gold truth annotations for our training data using FlowNet2 as a high-performance teacher model. The gold truth was used to fine-tune FlowNet2S, a simple, therefore real-time capable student model [Ilg et al.(2017)]. We benchmarked our approach by comparing FlowNet2 (accurate but slow) and FlowNet2S (fast but less accurate) embedded in a tissue tracking algorithm, as well as providing the relative EPE for the fast models. Overall, our specialist model FlowNet2S+F achieves comparable estimation accuracy to the complex FlowNet2 framework on our test set at a significant speed up (more than 6 times faster).

Our fine-tuning method significantly reduced the issue of drift for model FlowNet2S on our test set, making it feasible for robust long-term tissue tracking at high processing speeds. Inference time on our test images was only . This not only enables the processing of higher camera frame rates up to increasing the update rate of estimated motion rates but also generally reduces computation delay for slower frame rates. Overall, high processing rates improve the safety of vision-assisted applications.

With fine-tuning taking less than 15 minutes on average, the proposed method can be used as a preprocessing step before operation. The method is laying the path for improved patient-specific motion estimation and tracking for computer aided surgery. Overall, we demonstrate the feasibility of our training scheme. We achieved comparable accuracy and robustness to drift to state of the art model architecture for accurate OF at a significant speed up, making it feasible for real-time online application. Faster processing rates result in faster tracking feedback and can significantly increase the safety of a surgical intervention.

In our experiments, we only tested our method on FlowNet2S, however, any adequate end-to-end convnet for OF is feasible. Future experiments could also be done on FlowNets, a sparse version (sparse weights) of FlowNet2S. At the expense of possibly reduced accuracy, it is possible to achieve even faster flow computation. It would further be interesting to teach the student model using a teacher ensemble, fully exploiting the potential of distilled knowledge. This has the further advantage of introducing predictive uncertainty to our motion estimation. The uncertainty of a prediction has significant influence on the safety of a vision application. For real-life application the method would also have to be extended by occlusion handling.


This research has received funding from the European Union as being part of the EFRE OPhonLas project.


  • [Armin et al.(2018)] Mohammad Ali Armin, Nick Barnes, Salman Khan, Miaomiao Liu, Florian Grimpen, and Olivier Salvado. Unsupervised learning of endoscopy video frames’ correspondences from global and local transformation. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, pages 108–117. Springer, 2018.
  • [Buciluǎ et al.(2006)] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
  • [Butler et al.(2012)] Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black.

    A naturalistic open source movie for optical flow evaluation.


    European Conference on Computer Vision (ECCV)

    , pages 611–625, 2012.
  • [Dosovitskiy et al.(2015)] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [Geiger et al.(2012)] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. pages 3354–3361, 05 2012. ISBN 978-1-4673-1226-4. doi: 10.1109/CVPR.2012.6248074.
  • [Giannarou et al.(2012)] Stamatia Giannarou, Marco Visentini-Scarzanella, and Guang-Zhong Yang. Probabilistic tracking of affine-invariant anisotropic regions. IEEE transactions on pattern analysis and machine intelligence, 35(1):130–143, 2012.
  • [Guerre et al.(2018)] Alexandre Guerre, Mathieu Lamard, P-H Conze, Béatrice Cochener, and Gwénolé Quellec. Optical flow estimation in ocular endoscopy videos using flownet on simulated endoscopy data. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 1463–1466. IEEE, 2018.
  • [Hinton et al.(2015)] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [Horn and Schunck(1981)] Berthold K. P. Horn and Brian G. Schunck. Determining optical flow. Artificial Intelligence, 17:185–203, 1981. doi: 10.1016/0004-3702(81)90024-2.
  • [Ilg et al.(2017)] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , Jul 2017.
  • [Liu et al.(2019)] Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu. Ddflow: Learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8770–8777, 2019.
  • [Mahmoud et al.(2017)] Nader Mahmoud, Óscar G. Grasa, Stéphane A. Nicolau, Christophe Doignon, Luc Soler, Jacques Marescaux, and J. M. M. Montiel. On-patient see-through augmented reality based on visual slam. International Journal of Computer Assisted Radiology and Surgery (JCARS), 12(1):1–11, Jan 2017. ISSN 1861-6429. doi: 10.1007/s11548-016-1444-x.
  • [Mayer et al.(2016)] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. arXiv:1512.02134.
  • [Meister et al.(2018)] Simon Meister, Junhwa Hur, and Stefan Roth. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In Conference on Artificial Intelligence (AAAI), pages 7251–7259, New Orleans, Louisiana, February 2018.
  • [Menze and Geiger(2015)] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [Mountney et al.(2010)] Peter Mountney, Danail Stoyanov, and Guang-Zhong Yang. Three-dimensional tissue deformation recovery and tracking. IEEE Signal Process. Mag., 27(4):14–24, 2010. doi: 10.1109/MSP.2010.936728.
  • [Penza et al.(2018)] Veronica Penza, Xiaofei Du, Danail Stoyanov, Antonello Forgione, Leonardo S. Mattos, and Elena De Momi. Long term safety area tracking (LT-SAT) with online failure detection and recovery for robotic minimally invasive surgery. Medical Image Analysis, 45:13–23, 2018.
  • [Reda et al.(2017)] Fitsum Reda, Robert Pottorff, Jon Barker, and Bryan Catanzaro. flownet2-pytorch: Pytorch implementation of flownet 2.0: Evolution of optical flow estimation with deep networks., 2017.
  • [Schoob et al.(2017)] Andreas Schoob, Dennis Kundrat, Lüder A. Kahrs, and Tobias Ortmaier. Stereo vision-based tracking of soft tissue motion with application to online ablation control in laser microsurgery. Medical Image Analysis, 40:80–95, 2017.
  • [Spyrou and Iakovidis(2012)] Evaggelos Spyrou and Dimitris K Iakovidis. Homography-based orientation estimation for capsule endoscope tracking. In IEEE International Conference on Imaging Systems and Techniques (IST), pages 101–105. IEEE, 2012.
  • [Sun et al.(2010)] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and their principles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432–2439, June 2010.
  • [Sun et al.(2018)] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
  • [Wulff and Black(2015)] Jonas Wulff and Michael J Black. Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 120–130, 2015.
  • [Yip et al.(2012)] M. C. Yip, D. G. Lowe, S. E. Salcudean, R. N. Rohling, and C. Y. Nguan. Tissue tracking and registration for image-guided surgery. IEEE Transactions on Medical Imaging, 31(11):2169–2182, Nov 2012. ISSN 0278-0062. doi: 10.1109/TMI.2012.2212718.
  • [Yu et al.(2016)] Jason J. Yu, Adam W. Harley, and Konstantinos G. Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision (ECCV) Workshops, pages 3–10, Cham, 2016. ISBN 978-3-319-49409-8.

Appendix A Training Parameters

added soon

Appendix B Tracking algorithm

Image Stabilization (or motion compensation) based on estimated flow fields is achieved by warping each frame to reconstruct the first frame .


Tracking is the inverse operation of image stabilization.

Figure 6: Progress of Training FlowNet2S with illumination augmentation. Showing flow-based image reconstructions before epoch 1, 2 and 3. Top and bottom are initial image pairs. Second, third and fourth row show training progress before epoch 1, 2 and 3 respectively. It shows that FlowNet2S cannot handle strong illumination changes well without fine-tuning.

Appendix C FlowNet2S and its struggle with illumination changes

It is not only sparse texture that is challenging in the liver dataset of this work, but also very difficult lighting condition around the border. The shading seems like vignetting, but is actually due to shape of liver in combination with frontal illumination. As a result transition from light to dark areas should not be used as a reference for flow estimation. A well-working OF model would require invariance to illumination. As can be seen in Figure 6, FlowNet2S does not handle strong illumination changes well. This should definitely be subject to future work.

Appendix D EndoRegNet

Armin et. al propose a method called EndoRegNet for estimating flow between two endoscopic frame using an unsupervised learning scheme [Armin et al.(2018)]. They provide the structural similarity index (SSIM) for their method on the rotation and scale dataset also used in this work. For comparison, we computed the same measure derived from FlowNet2 and FlowNet2S on our test sets, which both perform significantly better than EndoRegNet, see Table 2

SSIM EndoRegNet FlowNet2 FlowNet2-S
mean 0.85 0.9615* 0.9551*
min 0.83 0.8785* 0.8650*
max 0.87 0.9894* 0.9845*
Table 2: Structural Similarity Index on Hamlyn rotation and scale dataset. *Evaluation of FlowNet2 and FlowNet2S was performed on a subset (test set of this work).