Deep convolutional neural networks have been proposed to segment livers from surgical video images[gibson2017deep], a significant step towards fully-automated computer-assisted guidance for liver resection procedures. The automatically segmented liver surfaces can be used to reconstruct anatomical structures for assisting real-time navigation and for registering with preoperative 3D medical images, such as diagnostic CT or MR, to locate the target of operative interest. Precise image-guidance has the potential to increase the number of patients that can be offered laparoscopic liver resection over open surgery, thereby significantly reducing the surgery-related stress and risk.
Further improving the segmentation accuracy may resort to more labelled data or unlabelled data with semi-supervised learning. Like many other medical image segmentation tasks, deep-learning-based approaches often require a substantial amount of labelled data for training, which rely on human experts with specialised clinical knowledge and multidisciplinary experience. On the other hand, acquiring more unlabelled image data from more patients or prolonging procedures may have a significant impact on workflow and patient safety. The data planning decision in relation to performance improvement needs to be weighted by the unit costs associated with these choices, labelling more data and collecting more unlabelled data.
Semi-supervised approaches have been successfully applied in medical image segmentation [bai2017semi, perone2018deep, cheplygina2019not]. However, comparing semi-supervised methods directly with the supervised counterparts has to consider multiple factors, such as added unlabelled data and a different network with its training strategy that is often more complex and specific to application. We postulate that this could lead to inconclusive correlation between confounding factors and the observed performance improvement. Based on the ‘mean teacher’ method [mean-teacher], which has been adapted into several medical imaging applications [perone2018deep, adapted-mean-teacher], we decomposed the effects into those caused by the change of network (training and architectures) and those by adding unlabelled data. The mean teacher approach averages model weights to produce perturbed predictions as pseudo labels for regularising the training [pseudo-label], a strategy that can be applied with or without ground-truth labels. In this work, we use the aforementioned surgical application as a real-world example to provide a quantitative analysis of the performance impact on the quantities of labelled and unlabelled training data.
Using real patient data from liver surgery cases, we summaries the contributions in this study as follows: a) A statistically significant higher segmentation accuracy is reported in terms of Dice score and Hausdorff distance, compared with a previously proposed supervised method [gibson2017deep]; b) We demonstrate the possibility that the change of training strategy specific to semi-supervised learning could result in significantly better segmentation results without adding any labelled or unlabelled data; c) We show that adding more unlabelled data potentially can reach the improvement made with more labels, providing a practically important quantitative basis for data planning decisions.
2.1 Supervised Segmentation Network Architecture
To analyse the effect with different training data set sizes in this work, we consistently use an exemplar neural network throughout our experiment, which is adapted from a U-Net variant [focal-tversky-loss]. Like the original U-Net [u-net], it consists of a downsampling path (encoder) and an upsampling path (decoder), with skip connections added between the two paths. In addition, a multi-scale input image pyramid is added at each encoder layer except for the bottom one. For the decoder, the attention gate and deep supervision are omitted in this network for faster training. The details of the network are illustrated in Fig. 1. The two-class Dice [sudre2017generalised] with
regularization is adopted for classifying the foreground pixels representing liver from the background pixels.
2.2 Semi-supervised Mean Teacher Training
Denote the labelled input as , with its label as , and the unlabelled input as . Let be the mixed input. Two identical segmentation networks, the student network and the teacher network are illustrated in Fig. 2, with different input noise and network weights .
During the training, the student network’s weights are optimized using back-propagated gradients with respect to a regularised segmentation loss:
where is a hyper-parameter balancing the contributions of a supervised loss and an unsupervised loss , both based on the two-class soft Dice loss [sudre2017generalised]. measures the overlap between the prediction and the ground-truth label, while measures the discrepancy between student and teacher’s predictions. The teacher network is updated using exponential moving average (EMA): after each training step, , where controls the smoothing.
One important mechanism of this method is adding noise and to labelled and unlabelled image input, respectively, and for . In this work, we propose to use random affine transformation as the noise in the spatial domain. We apply two independently-drawn affine transformations to the input data as follows: one is applied to the student network input, with the same transformation applied to the available labels for supervised loss; while the second is composed with the first and applied to the teacher network input. The second transformation is then applied to the student network’s prediction for computing the unsupervised loss.
3.1 Data Set
A total of 41,994 laparoscopic video frames, with a sampling rate of four frames-per-second, were captured from a Storz TIPCAM 3D stereo laparoscope camera in our experiment. These were from thirteen patients during six liver resection and seven liver staging procedures, with informed consents obtained from all patients, and the data collection was approved by our institutional research ethics board. In addition, 2,209 images were selected on which, the regions of liver were manually contoured by an expert clinical research fellow in General Surgery to provide ground-truth segmentation labels. The annotation was performed in NiftyIGI [clarkson2015niftk], resulting in , , , , , , , , , , , , labelled frames for each patient respectively.
The original size of frame images were pixels in RGB channels with black borders on both sides. For computational and memory efficiency, All images were linearly re-sampled to for each channel after cropping out the border to a size of pixels.
3.2 Network Implementation and Training
The depth of network was and each network was trained for iterations with a mini-batch size of , using the Adam optimizer with an initial learning rate at . The weight of loss was fixed to throughout the experiments. The network output has the same size as the re-sampled input image, larger than used in previous work [gibson2017deep]. In the loss used in the mean teacher training, with increasing progressively, i.e. , where is the current training step and is the ramp-up length. The EMA decay was fixed to during the initial ramp-up phase and
afterwards. All networks were implemented in TensorFlow and trained using Nvidia Tesla V100 general-purpose graphics process units on a DGX-1 workstation. To avoid over-fitting the entire data set, all the reported hyper-parameter values were configured empirically without extensive tuning.
All experiment results reported in this paper were based on 13-fold leave-one-patient-out cross-validations: for each fold, data from one patient was used for evaluation and the network was trained on the remaining data. The predicted binary masks representing segmentation were first re-sampled to and then processed by filling the holes before evaluation. Commonly-adopted data augmentation strategies for surgical video applications, including contrast and brightness adjustment and standardization, were also used before feeding the input data. The segmentation performance was measured by the Dice score and the 95th-percentile Hausdorff distance. The reported Hausdorff distance is in pixels and pixels correspond approximately mm to mm, depending on the typical object-to-camera distance range in this application.
To test different data set sizes, , , , and of the labelled data set were randomly sampled from each patients for semi-supervised networks, while , , and of the unlabelled data set were sampled with indicating the mean teacher models trained without unlabelled data. A single network without the mean teacher model (hereafter referred to as the baseline supervised network111The mean teacher model without unlabelled images is also fully-supervised.) was also tested. In practice, however, the availability of the labelled and unlabelled image data would be influenced by other practical factors, such as cost and patient cohort sampling, and is highly application-dependent. This controlled experiment was designed with a simplified condition that excludes potential anatomical-variation-introduced difference between patients and should be considered as the first step towards a more comprehensive experiment design considering both inter- and intra-patient variation. We also report the statistical significance in the observed differences throughout the presented experiments using non-parametric Wilcoxon signed-rank tests at a significance level of 0.05.
4.0.1 Baseline Supervised Network (SL)
The median Dice scores on 13 folds from the baseline supervised network trained using all labelled images ranged from to with a median of , compared with , and from the previous study [gibson2017deep]to , the segmentation performance was improved, from to and from to , for Dice score and Hausdorff distance, respectively.
4.0.2 Mean Teacher (MT)
The results for SL and MT with unlabelled data are summarised in Table 1. Both the medians of Dice score and Hausdorff distance from MT were significantly better (both p-values ). The median Dice scores on folds ranged from to , with a median of , therefore surpassed the previous study [gibson2017deep] (p-value ). Examples are shown in Fig. 3.
4.0.3 Mean Teacher with Different Labelled Data Set Sizes
The median Dice scores for the MT models, trained with all available unlabelled data and different quantities of labelled data, varied from to . It consistently outperformed SL with the same labelled data set sizes sampled, as shown in Fig. 4. The Hausdorff distance results also showed a consistent difference. In addition, a clear overall trend for both segmentation metrics can be observed: the performance improves as the number of labelled data increases.
4.0.4 Mean Teacher with Different Unlabelled Data Set Sizes
Median Dice scores are plotted in Fig. 5 with the quantity of labelled data indicated in the brackets. Without using any unlabelled data, MT generally outperformed SL; with more unlabelled data, MT produced better segmentation in general, but it was not monotonic. For instance, using of unlabelled data improves MT () from to in terms of Dice score, but for MT () the score decreases from to . This may be caused by a) high correlation between unlabelled data due to the nature of the procedure and the omitted inter-patient variation (also discussed in Sec. 3.3); b) the lack of optimised semi-supervised training and hyper-parameter tuning, which was not pursued further for the purpose of this work. Practically important, perhaps more interesting, results can be found to quantify the trade-off between the labelled and unlabelled data. For example, using unlabelled data, MT () reached a Dice score of which was higher than SL (), , depicting a scenario in which more unlabelled data achieve a comparable performance as adding labels would.
The quantified differences showed in this work, such as the improvement due to more labelled and/or unlabelled data, are useful in developing machine learning applications that in turn assist clinical procedures. To summarise, we have shown a statistically significant improvement in segmenting liver from laparoscopic video images using a semi-supervised mean teacher method. Whilst adding more labelled data generally improves the segmentation, it is possible to use more unlabelled data, instead of labelling more data, to achieve comparable level of segmentation accuracy. To the best of our knowledge, it is the first time these conclusions are presented with quantitative evidence based on real patient data.
These results, however, should be interpreted with the limitations of the experiment design due to practical constraints. We suspect that non-optimised semi-supervised training and sampling intra-patient variation, also discussed in Sec. 3.2 and 3.3, respectively, are possible reasons for the perturbing segmentation performance as unlabelled data increase, which limited potentially larger improvement. Nevertheless, the reported high segmentation accuracy warrants a high applicability of these presented models for clinical use. Thus, the statistical significance found in the performance changes, measured on independent test data, suggest potential clinical value in planning data for training these semi-supervised models. These experiments also produced a set of quantitative results, on which future work can build on to answer further multidisciplinary questions.
This work is supported by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) (203145Z/16/Z). DS receives funding from EPSRC [EP/P012841/1]. MC receives funding from EPSRC [EP/P034454/1]. BD was supported by the NIHR Biomedical Research Centre at University College London Hospitals NHS Foundations Trust and University College London. The imaging data used for this work were obtained with funding from the Health Innovation Challenge Fund [HICF-T4-317], a parallel funding partnership between the Wellcome Trust and the Department of Health. The views expressed in this publication are those of the author(s) and not necessarily those of the Wellcome Trust or the Department of Health.