Surgical data science is the discipline of capturing, organizing, analyzing and modelling surgical data in order to improve the quality of interventional healthcare maier-hein_surgical_2017; maier-hein_surgical_2021. Semantic segmentation is an important prerequisite for various tasks in surgical data science, including context-aware assistance and surgical robotics. So far, the scientific literature has focused on binary segmentation tasks (e.g. medical instruments ros_comparative_2021) and conventional RGB video data (see e.g. scheikl_deep_2020; grammatikopoulou_cadis_2021).
Recently, spectral imaging clancy_surgical_2020 has evolved as a promising technique for advanced optical imaging in the operating room. While conventional RGB imaging is limited by imitating the human eye, spectral imaging systems capture the reflectance spectrum of the tissue over the entire field of view, thereby generating a datacube consisting of two spatial and one spectral dimension. Since the underlying tissue optical properties determine the measured tissue reflectance spectrum, spectral imaging has the potential to extract biological information such as tissue type and pathologies while being non-invasive. Spectral imaging with up to tens of spectral bands is generally referred to as msi, whereas spectral imaging with up to hundreds of spectral bands is named hsi clancy_surgical_2020. Early work on semantic segmentation with msi data indicated that more detailed spectral information may improve the segmentation performance, but the number of classes was relatively low with six and three classes, respectively moccia_uncertainty-aware_2018; garifullin_hyperspectral_2018. It has not been determined yet whether there is a benefit in using hsi data over processed hsi data (e.g. in the form of parameter images, cf. Fig. 1) and RGB data. Furthermore, we are not aware of any prior work that investigated what is an adequate representation of medical hsi data for neural network-based fully automated scene segmentation, especially with respect to the spatial granularity of the data. The spatial variability of images acquired during open surgery is large due to the less controlled imaging conditions (e.g. imaging distance), complex three-dimensional relationships between multiple soft tissues as well as anatomical differences between individuals. In addition, acquisition and annotation of large data sets are difficult and time-consuming. In order to determine the ideal spatial granularity of the hsi input data, the generalization capability towards unseen individuals and the required amount of training data thus also need to be considered.
To address this gap in the literature, this paper investigates the following research questions:
What is an adequate representation of hsi data for neural network-based fully automated organ segmentation? Specifically, what is the optimal granularity of the data (pixels vs. superpixels vs. patches vs. full images) with respect to segmentation quality, the required number of training cases and the capability to generalize towards unseen individuals?
Is there a benefit of using hsi data compared to RGB data and processed hsi data (e.g. tissue parameter estimations) when performing semantic organ segmentation?
2 Related Work
As only very limited prior work on automatic organ segmentation in msi/hsi exists, we first summarize related work on organ segmentation based on RGB data in surgery and then present a brief overview on the state of the art in segmentation with msi/hsi data, both within and outside the field of biomedical image analysis.
2.1 Deep learning-based organ segmentation on RGB data
During the past years, deep-learning based segmentation of RGB data has found several applications in surgery, especially in minimally invasive surgery rivas-blanco_review_2021 such as cataract grammatikopoulou_cadis_2021 or colorectal surgery maier-hein_heidelberg_2021. However, work has mainly focused on medical instrument segmentation, driven by various challenges in this area (e.g. CATARACTS challenge on automatic tool segmentation in cataract surgery al_hajj_cataracts_2019, Robust Medical Instrument Segmentation challenge in laparoscopic surgery maier-hein_heidelberg_2021). Only few recent works have tackled organ segmentation, either restricted to organ classes (e.g. fu_more_2019; webster_deep_2017) or, more often, in the context of full scene segmentation (e.g. allan_2018_2020; zhou_feature_2019; madad_zadeh_surgai_2020; scheikl_deep_2020). The data sets used differ highly in terms of annotation sparsity (e.g. full scene vs. specific organ segmentation) and the number of considered classes. Input to the models are video frames of varying size (e.g. in scheikl_deep_2020, in laves_dataset_2019). Relatively few works have tackled organ segmentation in open surgery, where, compared to minimally invasive surgery, image acquisition is often more difficult to realize and challenges arise from the even larger complexity and variability of the surgical scene gong_using_2021. We are aware of only a single investigation of deep-learning based organ segmentation on RGB images in open surgery: gong_using_2021 analysed segmentation performance under different imaging conditions such as lightning changes or varying distances based on RGB images of 130 patients and found that these factors have a high influence on the image scores.
Overall, prior work has identified several major challenges for automated organ segmentation on RGB data such as a high variability in tissue appearance across patients (e.g. webster_deep_2017; collins_segmenting_2015) and across images (e.g. due to occlusions or deformations moccia_uncertainty-aware_2018) as well as the variability in the image acquisition. Including further spectral information could be key for addressing those challenges as msi/hsi may be less reliant on the spatial context and encodes additional clinical information such as tissue perfusion fei_chapter_2020.
2.2 Segmentation with msi/hsi data
Within biomedical image analysis
Only a small number of papers address a biomedical segmentation problem based on msi/hsi data with deep learning khan_trends_2021. Even without restricting the search to deep learning-based approaches, we could only identify five related publications (with only the first three using deep learning techniques):
trajanovski_tongue_2021 segmented healthy and tumorous tongue tissue in histopathological hsi images on an in-house data set consisting of 14 patients (one image per patient) trajanovski_tongue_2021. They compared several pixel-based networks, networks based on patches of size and hybrid networks, taking a combination of entire pixel spectra and patches with a reduced number of channels as input. They found a U-Net architecture ronneberger_u-net_2015
based on patches to perform best in their specific segmentation task. However, as the performance analysis was conducted on the validation data set (on which hyperparameters were tuned), a subsequent evaluation on an independent test set remains to be performed.
garifullin_hyperspectral_2018 analysed 55 retinal msi images and segmented three tissue types (vessels, optic disc and macula) garifullin_hyperspectral_2018. They used SegNet badrinarayanan_segnet_2016 and Dense-FCN jegou_one_2017 models and compared msi with RGB data but their results did not reveal a clear winner (neither from the model nor the modality perspective).
cervantes-sanchez_automatic_2021 analysed 18 hsi images from seven hepatic surgery patients and 21 hsi images from seven thyroid surgery patients. They created sparse annotations of circular shape for four organs (liver, bile duct, artery, portal vein) in the case of hepatic surgeries and three organs (thyroid, parathyroid, muscle) in the case of thyroid surgeries cervantes-sanchez_automatic_2021cox_regression_1958boser_training_1992haykin_neural_1994 and U-Net) based on single pixels or small patches of an shape for automatic segmentation of the annotated organ classes. However, evaluation was only performed on the sparse annotations and on the validation data set (on which hyperparameters were tuned). Therefore, a subsequent evaluation on full semantic annotations and an independent test set remains to be performed.
moccia_uncertainty-aware_2018 acquired msi data of seven pigs (57 images) in the setting of hepatic laparoscopic surgery moccia_uncertainty-aware_2018
. They turned the actual organ segmentation problem into a classification problem in the following way: Based on manually extracted textural and spectral features from automatically segmented superpixels, they trained a support vector machine to classify six organs (liver, gallbladder, spleen, diaphragm, intestine, and abdominal wall). They showed that the classification accuracy for their msi data was superior to the classification accuracy for a selection of only three channels. However, the selected channels were too narrow to represent an RGB image.
akbari_wavelet-based_2008 acquired seven hsi images of abdominal organs for a single pig. Five organs (spleen, colon, small intestine, bladder, peritoneum) were annotated for these images and pixel-based organ classification was performed while learning vector quantization kohonen_learning_1995 of compressed spectra akbari_wavelet-based_2008. Given the small size of the data set consisting only of a single individual and the unclear separation between train and test data, a subsequent evaluation on an independent test set comprising a larger number of individuals remains to be performed.
Outside biomedical image analysis
msi/hsi is applied in various fields such as biochemistry, agriculture, archaeology and especially remote sensing prasad_hyperspectral_2020. However, investigation of deep learning-based semantic scene segmentation in those fields is rare khan_modern_2018. The validity of existing works is very limited due to small data sets composed of only one to two images and training and testing being performed on the same data (e.g. alam_crf_2016; nalepa_towards_2020; paul_classification_2021). Generally, the application of deep learning-based semantic segmentation is hampered by limitations in the available annotations vali_deep_2020: training data is sparse and usually only several discrete pixels instead of entire images are labelled, while due to the high dimensionality of the data, large data sets would be required to avoid overfitting zhu_spectral-spatial-dependent_2021. Due to the sparse annotations, most segmentation tasks in these fields are addressed via pixel-based classification (e.g. nalepa_towards_2020) and the few existing patch-based segmentation approaches are at high risk of train-test-leakage nalepa_validating_2019.
In summary, prior work on semantic scene segmentation in open surgery is extremely sparse and even non-existent in medical hsi. Furthermore, the data sets used so far are rather small, and the high complexity and variability in surgical scenes due to non-standardized image acquisition, inter-subject variability and complex three-dimensional relationships between multiple soft tissues (e.g. overlapping tissue, shadowing, deformations) gong_using_2021 remain to be addressed. Among the related work on organ segmentation from msi/hsi data, models have been based on superpixels moccia_uncertainty-aware_2018, patches trajanovski_tongue_2021; garifullin_hyperspectral_2018; cervantes-sanchez_automatic_2021 and pixels akbari_wavelet-based_2008. However, the optimal granularity of the data with respect to segmentation quality, the required number of training cases, and the capability to generalize towards new individuals given the large variability in the surgical scene, has not been determined up to the present date. Furthermore, no prior work could show a clear benefit of msi/hsi data over RGB data for deep learning-based organ segmentation. We address these gaps in the literature based on a semantically annotated hsi data set of unprecedented size and number of classes.
3 Materials and Methods
3.1 Image acquisition and data set
The hsi data was acquired at the Heidelberg University Hospital after approval by the Committee on Animal Experimentation of the regional council Baden-Württemberg in Karlsruhe, Germany (G-161/18 and G-262/19). hsi images were taken for 20 pigs that were managed according to the German laws for animal use and care and in agreement with the directives of the European Community Council (2010/63/EU).
3.1.1 HSI camera system
The hsi camera system Tivita® Tissue (Diaspective Vision GmbH, Am Salzhaff, Germany) was used to acquire the hsi data. In a push-broom fashion, it captures hyperspectral images with a spectral resolution of approximately in the spectral range between and , resulting in datacubes with a size of (width height number of spectral channels ). The camera system images an area of approximately 20 . An imaging distance of about is ensured through an integrated distance calibration system composed of two light marks that overlap if the distance is correct. The image acquisition takes approximately seven seconds. In addition to the hsi datacubes, the camera system estimates tpi that include oxygenation (sto2), perfusion (npi), water content (twi) and hemoglobin content (thi) from the hsi datacubes. Furthermore, reconstructed RGB images based on the hsi images are available. The underlying calculations are described in holmer_hyperspectral_2018. Fig. 1 shows the reconstructed RGB image and tpi associated with an exemplary hsi datacube. More technical details on the hardware are available in holmer_hyperspectral_2018; kulcke_compact_2018.
3.1.2 Data acquisition
In order to prevent distortion of the spectra from stray light, light sources other than the integrated halogen lighting unit were shut off during image acquisition and window blinds were closed. Instead of acquiring a fixed number of images with a fixed set of camera perspectives from a fixed set of situses per pig, we decided to acquire images in a fashion reflecting intraoperative reality. That is, the number of images and situses were decided individually for every pig to mimic natural variations during the surgery and the camera perspectives were chosen to allow for a good view of all organs in the scene.
An overview of the data set is given in Fig. 2. In total, 506 images from 20 pigs were acquired. For each organ, between 32 and 405 images were acquired from 5 to 20 individuals.
3.1.3 Image annotation
19 different classes were annotated, including two thoracic organs (heart, lung), eight abdominal organs (stomach, jejunum, colon, liver, gallbladder, pancreas, kidney, spleen) and one pelvic organ (bladder). In the case of kidney, images before and after removal of Gerota’s fascia were taken and labelled kidney with Gerota’s fascia and kidney, respectively. Furthermore, subcutaneous fat, skin and muscle tissue as well as omentum, peritoneum and major veins were annotated. Pixels that did not belong to any of the described 18 organ classes were labelled background. Additionally, pixels were labeled unsure if it could not be decided to which organ they belong. These pixels were later excluded from our analysis.
The semantic annotations were performed by two different annotators using vector annotation tools. To ensure consistent labelling, all annotations were then revised by the same medical expert.
Imbalances in the number of images per class arose since some organs naturally occur more often in the field of view of other organs. For example on images of the gallbladder, the surrounding liver is always present, whereas not on all liver images the gallbladder is visible. Heterogeneity in the number of animals per organ arose from differences in the surgical procedure performed, for example, opening of the thorax was only performed for eight out of 20 pigs, making hsi data from heart and lung unavailable for the remaining 12 pigs.
3.2 Deep learning-based full semantic scene segmentation
Our approach to deep learning-based semantic hsi image segmentation is summarized in Fig. 3. The following sections present our image processing pipeline in detail, including an overview of our input modalities (Sec. 3.2.1), our pre-processing of the hsi data (Sec. 3.2.2), the architectures of our deep-learning models (Sec. 3.2.3), our training setup (Sec. 3.2.4) and our approach to increase randomness in the data loading (Sec. 3.2.5).
In general, our design choices with respect to model architectures and training setup were motivated by our comparative study. We aimed to have common model and training parameters across the different spatial models and modalities whenever possible (e.g. same hyperparameters and data splits). We explicitly avoided individual parameter tuning for each model to ensure a fair comparison and reduce computational costs.
3.2.1 Model input modalities
A primary purpose of this study was to investigate whether there is a benefit in using hsi data compared to RGB and tpi data for neural network-based fully automated organ segmentation. For simplicity, we will refer to these different input data types as input modalities although they were in practice all computed with the same camera. This reflects the fact that a future application leveraging semantic segmentation could be based on RGB images from a conventional camera, on the preprocessed hsi images of an hsi camera provider or on raw hsi spectra. As illustrated in Fig. 3, we trained neural networks separately on all three input modalities for all studied levels of data granularity. RGB data reconstructed from the hsi data was available through the camera system. To study the organ segmentation performance on processed hsi data, the associated sto2, npi, twi and thi images were stacked, yielding a () tpi cube that served as model input.
3.2.2 hsi data pre-processing
In order to remove the influence of sensor noise and convert the acquired hsi data from radiance to reflectance, the raw hsi datacubes were automatically corrected with a pre-recorded white and dark reference cube by the camera system as described in holmer_hyperspectral_2018. After exporting the hsi cubes from the camera system, the -norm was applied to each pixel spectrum in order to account for multiplicative illumination changes that arise, for example, from fluctuations in the measurement distance. The hsi cube was further pre-processed with a median filter of the size (). Border values were reflected in order to maintain the original image size. This pre-processing step smoothed the hsi images both spatially and spectrally.
3.2.3 Deep learning models
Our approach to semantic segmentation is presented in Fig. 3 for the exemplary case of hsi input data. Implementation details are provided in the following paragraphs.
The smallest possible spatial granularity of the input data is to use single pixel spectra, resulting in input feature vectors of length in the case of hsi input data, in the case of RGB input data and
in the case of tpi input data. The deep learning architecture for the hsi input data is composed of three convolutional layers with 64 filters in the first, 32 in the second, and 16 in the third layer. Each convolution is performed with a kernel size of 5 and followed by an average pooling layer with a kernel size of 2. The output of the convolutional layers is stacked and served as input for two fully connected layers, with 100 neurons in the first and 50 in the second layer. For tpi and RGB input data, no convolutional operations along channels are feasible due to the small channel size. Instead, the network consists of three fully connected layers with 200 neurons in the first, 100 neurons in the second and 50 neurons in the third layer. The eluclevert_fast_2016
is used as an activation function and batch normalization is applied to the outputs of all layers. The class logits are calculated by a final linear layer, resulting in one organ label for each input pixel spectrum. The cross-entropy (CE) loss function is used to optimize the model during training.
The architecture was designed such that both local and global spectral information is aggregated while keeping a small network size of only weights for hsi, weights for tpi and weights for RGB input data and thus being computationally efficient. The local information aggregation is achieved through the convolutional layers: a relatively small kernel size was chosen to focus on local structures and by stacking three convolutional layers, a compromise between increasing the receptive field of the network and keeping the number of learnable weights small was achieved szegedy_rethinking_2016. The global context was learned by the fully connected layers.
To retrieve a segmentation map for an image, we predict a class label for each pixel in the image and then map the resulting labels back to the image positions.
Superpixels are regions of low spatial granularity that adhere to local boundaries which enclose pixels with similar features. As for the pixel-wise organ segmentation, the unsupervised clustering of superpixels turns the actual organ segmentation task into a superpixel-wise organ classification task. This is justified by the assumption that all pixels within a superpixel belong to the same organ class since superpixels are supposed to lie within the local boundaries of an organ. Superpixels are generated by the use of the simple linear iterative clustering (SLIC) algorithm on the reconstructed RGB data achanta_SLIC_2012
. Prior to clustering, the image is smoothed with a Gaussian kernel of width 3 and then 1000 segments are computed in ten iterations while adaptively changing the per-superpixel compactness parameter (SLICO mode). For each superpixel, a minimum enclosing bounding box is computed and areas not belonging to the superpixel are replaced with zeros. To ensure one common input shape, superpixels are resized via bilinear interpolation to the shape() for hsi, for tpi and for RGB input data.
The resized superpixel cubes are passed to an efficientnet-b5 encoder tan_efficientnet_2019
pre-trained on the ImageNet data setdeng_imagenet_2009 using the library of yakubovskiy_segmentation_2020 yakubovskiy_segmentation_2020. We chose this encoder as it yields good performance while at the same time being economical in terms of the number of parameters leading to a low memory footprint and fast computation times. The output of the encoder network is passed on to a classification head consisting of a dropout layer and a fully connected layer with 19 neurons for calculating the class logits. This way, the superpixel network shares the same architecture as the segmentation networks for the image and patch-based models with only minor modifications.
It is possible that not all pixels within one superpixel belong to the same organ class, for example due to inconsistencies at the border between organs. To account for this, we introduced the concept of fuzzy labels where we assigned a label vector of length to each superpixel (e.g.
classes in our case). The fuzzy label vector stores the relative frequency of each class label considering all enclosed pixels inside the superpixel. The Kullback-Leibler divergencekullback_information_1951 between fuzzy labels and the softmax output is used as a loss function during training.
During inference, the argmax of the class logits is computed and the resulting label is assigned to every pixel position of the superpixel. Predictions of all pixels from all superpixels are combined to yield a segmentation map for an image.
Patches are regions of low spatial granularity that are extracted from images according to a fixed shape. They are generally more easily generated and more straightforward to use in neural networks than superpixels, for example because their rectangular shape matches with the rectangular kernel shapes of convolutional neural networks. In order to capture different degrees of granularity, patches of two different shapes are extracted:and . These sizes serve as intermediate steps between the superpixel and the image model in terms of spatial granularity (cf. Table. 1). We use patch sizes which are a power of two so as to easily integrate them with encoder architectures which halve the input shape multiple times. The number of generated patches per image corresponds to the number of patches that could have been generated via a grid-based tiling.
The patches are passed to a U-Net with an efficientnet-b5 encoder pre-trained on the ImageNet data set (like the superpixel network). Dice loss and ce loss are calculated during training based on all pixels in the batch and equally weighted to compute the loss function. While each misclassified pixel contributes equally in the computation of the ce loss, misclassified pixels belonging to an organ class of smaller image area contribute more to the dice loss than misclassified pixels belonging to a dominant class (e.g.background). By computing a weighted sum of both loss terms, the network training can benefit from the respective advantages.
During inference, images are divided into a grid of non-overlapping patches of the corresponding patch size. In cases where an image dimension is not an integer multiple of the patch dimension, missing image regions are zero-padded. For each patch, the network yields a segmentation map. The segmentation maps of all patches of one image are combined to yield an image segmentation map. Segmentations belonging to previously zero-padded image regions are cropped.
Entire images offer a maximum of spatial granularity and we use them directly without any further adaptations of the image dimensions, i.e. the input tensors have a shape of() for hsi, for tpi and for RGB. Equivalent to the patch-based segmentation, the images are passed to an efficientnet-b5 U-Net pre-trained on the ImageNet data set. Again, both dice and ce loss are equally weighted to compute the loss function.
3.2.4 Training setup
To prevent biases due to differences in the training setup when comparing the organ segmentation performance on input data for different levels of spatial granularity and modalities, the training setup was made as comparable as possible for all models. The following paragraphs describe the data augmentation, optimization and regularization methods that were consistently applied for all the different segmentation models.
In order to increase the size and diversity of the available training data and thereby improve convergence, generalization and robustness on out-of-distribution samples, data augmentation is commonly applied in computer visionbuslaev_albumentations_2020. For all models, the training data is augmented using the Albumentation library111https://github.com/albumentations-team/albumentations with default settings buslaev_albumentations_2020 on a per-image basis before extracting pixels, superpixels or patches: images are shifted (shift factor limit: 0.0625), scaled (scaling factor limit: 0.1), rotated (rotation angle limit:
) and flipped with a probability of. All transformations are applied with a probability of to reduce the computational data loading costs that are induced by extensive data augmentations.
For all models, the Adam kingma_adam_2017 optimization algorithm and an exponential learning rate scheme are used (initial learning rate: 0.001, decay rate : 0.99, Adam decay rates : 0.9 and
: 0.999). Training is performed for 100 epochs and stochastic weight averagingizmailov_averaging_2019 is applied. To achieve a comparable training procedure across models, the available training budget should be equal for each model. To this end, the size of one epoch is defined as seeing 500 training images for image-based segmentation. For pixel-, superpixel- and patch-based segmentation, it was ensured that the total number of extracted pixels for each approach matches roughly the total number of pixels of 500 images. Such matching can only be roughly approximated because the epoch size has to be divisible by the batch size so that every worker in the data loader can contribute equally to every batch, cf. Sec. 3.2.5. For comparability across modalities, the batch size is adapted only on a per-model level, so that for all input modalities roughly of gpu memory are sufficient for training. Table. 1, gives an overview of the resulting epoch and batch sizes.
During training, each model was evaluated after each epoch on the validation set of the respective training fold. The dsc was calculated while respecting the hierarchical structure of the data, as described in more detail in Sec. 4.1. The final validation score was obtained by averaging the dsc values of three validation pigs, and this score was also used to determine the best model across all epochs.
|model||# pixels||epoch size||batch size|
To avoid overfitting, dropout regularization is applied for the fully connected layers in the pixel models and in the superpixel classification head while using a dropout probability of 0.1.
Variability of results
Training of neural networks is always subject to various sources of variation, some easier to control than others (e.g. seeding vs. hardware influences) pham_problems_2020. Especially in our study, in which we aim for a fair comparison between different models and modalities, reduction of such sources of variation is important. While obtaining exactly reproducible results is only achievable at the cost of longer training times (e.g. by restricting oneself to slow deterministic operations or a single homogeneous hardware infrastructure) pham_problems_2020, we took several measures to reduce the variation. We controlled the weight initialization of the networks, the initialization of the workers responsible for data loading (which also affects the data augmentation) and the ordering in which training samples are presented to the network for the modalities (e.g. corresponding spatial models for different modalities receive patches from the same spatial locations and in identical order). To this end, we set a seed for the network initialization and the data loaders and fixed the number of workers on each data loader to 12 across all experiments. Since we had to run many training runs efficiently for this study, we did not enforce deterministic operations and we used our in-house cluster infrastructure which consists of an inhomogeneous hardware infrastructure with many different GPUs (e.g. GeForce GTX™ 2080 Ti or DGX™ A100 (Nvidia Corporation, Santa Clara, USA)).
3.2.5 Multi-image contribution in data loading
The models that work on part of an image such as the patch or pixel models feature an additional challenge when it comes to data loading. Due to the size of the data set, it is not feasible to load all images at the beginning of the training process as the hsi data easily exceeds memory limitations. Pre-computing a set of image parts is also not feasible as the number of instances can be very high (e.g. each image has pixel instances). To overcome these issues, we used a custom data loading strategy outlined in Fig. 4 where all training images are distributed in unique sets to each worker. Each worker processes its own set of images as an endless stream but only ever loads one image at a time to extract the relevant parts (pixel, patch or superpixel in our case). The image parts are directly stored in a shared memory location by the workers so that the main process only needs to move the data to the gpu without any further processing. The shared memory implements a ring buffer which gets filled by the workers in a cyclic manner where the next batch is filled as soon as the batch is moved to the gpu by the main process. Every worker contributes equally to the batch, meaning that the same number of image parts are added to the batch dimension by every worker. This increases randomness across images in one batch so that the batch distribution more closely resembles the data distribution, which is usually desired when training a deep learning network. Once a worker has finished extracting parts from an image, the next image is loaded without keeping the old image in the memory. This leads to a constant memory footprint for data loading during training that is only determined by the number of used workers. The ordering of the images is reshuffled after each training epoch.
4 Experiments and Results
The purpose of our experiments was to investigate what constitutes an adequate representation of hsi data for neural network-based organ segmentation (cf. Sec. 4.2) and determine whether there is a benefit in using hsi data over other modalities like RGB and tpi data (cf. Sec. 4.3). The underlying experimental setup is described in Sec. 4.1.
4.1 Experimental setup
We validated our framework on the data set described in Sec. 3.1. This section describes our train-test-split, validation metrics as well as how we assessed the quality of our reference annotations and how much training data was required.
We split the data set comprising 20 pigs (506 images) into a train data set consisting of 15 pigs (340 images) and a disjoint test set consisting of five pigs (166 images). The five test pigs were randomly selected such that in both train and test set each organ class was represented by at least one pig. The test set remained untouched throughout model development. -fold cross-validation was performed on the train data set with . The folds were generated such that each organ class was represented by at least one pig and the number of pigs per organ class was maximized for each of the five validation sets. Since several fold combinations met this criterion, the fold combination with the most homogeneous distributions of number of pigs and number of images per class for all folds was selected. We refer to this traditional validation set obtained for each fold, which consists of three pigs that are not seen during training, as . Additionally, for each of the 12 pigs included in each of the five training sets, one random image was excluded from the training and used to compose a second validation set consisting of 12 unseen images per fold from pigs known from the training., referred to as . By comparing the model’s performance on the two validation sets, its generalization capabilities towards unseen individuals could be estimated. Once model architectures and parameters were finalized based on the performance on the validation set, we evaluated the segmentation performance on the hold-out test data set by ensembling the network predictions for each fold via averaging the softmax values.
Since individual validation metrics do not reflect all clinical requirements yeghiazaryan_family_2018; reinke_common_2021, we combined several validation metrics with different strengths and weaknesses to obtain a more informative quantification of the model performance. The most commonly used validation metric for biomedical segmentation tasks is the dsc dice_measures_1945; maier-hein_why_2018. It measures the overlap between a predicted object segmentation and the corresponding reference segmentation and is highly sensitive to the object size and insensitive to the object shape reinke_common_2021. dsc values are available per class and are always in the range with 0 indicating that either there is no overlap between predicted segmentation and reference segmentation for the class or the class is present on the image but has not been predicted. A value of 1 for a class is obtained if the prediction overlaps perfectly with the reference segmentation. It should be noted that any binary segmentation task can be regarded as a pixel-wise classification tasks and that the dsc is identical to the F1-score, which is typically represented as a function of true/false positives/negatives. Furthermore, the dsc is very closely related to the Intersection over Union (IoU), which is commonly used as primary validation metric in the machine learning community. For the computation of the dsc, we used the MONAI framework222https://monai.io/ consortium_monai_2020.
Boundary-distance-based methods measure the dissimilarity between the predicted segmentation and reference segmentation in terms of distances between boundaries. In contrast to overlap-based metrics, boundary-distance-based metrics are insensitive to the object size and sensitive to the object shape. An example is the asd heimann_comparison_2009 which for an organ is the average of all distances between pixels on the predicted object segmentation border and its nearest neighbor on the reference segmentation border . Here, we used the symmetric version of the asd which repeats the computation of the set of nearest neighbor distances with the role of the predicted segmentation and reference reversed, yielding . All obtained distance values are averaged, yielding an average distance value for each class:
The asd has the disadvantage that it is unbounded, yielding values in the range with 0 in cases where boundaries of objects match exactly. Therefore, asd values are generally harder to interpret. Furthermore, special attention must be placed on the case of missed classes (classes that are present in the reference annotations but were not predicted), as no natural limit is available reinke_common_2021. Here, we decided to set the asd value for a missed class to the maximum asd obtained for the other classes on the same image. This introduced a potentially high and image-dependent penalty when a class could not be predicted in an image (cf. discussion on this point in Sec. 5.3).
The nsd nikolov_deep_2021 estimates which fraction of a segmentation boundary is correctly predicted with an additional threshold related to the clinically acceptable amount of deviation in pixels. It is thus a measure of what fraction of a segmentation boundary would have to be redrawn to correct for segmentation errors. Instead of one common threshold , we used a class-specific threshold for each organ class since the difficulty of annotating varies between organs (e.g. an organ with a clear boundary, such as liver, is easier to precisely annotate than an organ with a fuzzy boundary, such as omentum). We adapted the nsd, which was originally invented for 3D segmentation maps, to our 2D segmentation maps in the following way: Instead of considering 3D segmentation surfaces, we considered 2D segmentation boundaries. For all pixels of the predicted segmentation boundary of organ , , the nearest-neighbor distances to the reference segmentation boundary were computed, resulting in a set of distances . We determined the subset of distances in that are smaller or equal to the acceptable deviation
The entire procedure was symmetrically repeated for , yielding and . For each class that appears in both the prediction and the reference segmentation the was then computed as:
The nsd is bounded in the range with 0 indicating that either the boundary is completely off, with all distances being larger than the acceptable deviation , or that the class is present on the image, but has not been predicted. 1 is obtained for cases where no redrawing of the segmentation boundary is necessary since every deviation is below the acceptable threshold .
One of the major challenges of the nsd is the determination of the organ-specific thresholds (cf. discussion in Sec. 5.3). To this end, 20 randomly selected images (one image per pig such that all organ classes are represented by at least two images) were re-annotated by a second medical expert. Equivalent to the asd, we computed distances between the boundaries of the original and the re-annotation for each organ in each image and averaged the results to obtain the image- and organ-specific threshold . If an organ appeared in only one of the two corresponding images, no distances could be computed and the corresponding structure was ignored (cf. discussion in Sec. 5.3). We computed the class-specific distance threshold by averaging the for the set of images where the organ is present and could be computed:
For each image, metric values were first separately computed for all classes annotated in the reference segmentation. These class-wise metric values were then averaged to obtain a per-image metric value. To account for the hierarchical nature of the data (following holland-letz_drawing_2020), all metric values were first averaged for all images of one pig and these per-pig scores served as a basis for visualizations and model rankings.
We investigated the model ranking and its stability with respect to two sources of variability, namely the sampling variability and variability due to choice of validation metric. The ranking analyses were performed using the challengeR333https://github.com/wiesenfa/challengeR toolkit wiesenfarth_methods_2021.
To assess the stability of the model ranking concerning the sampling variability, rankings were performed repeatedly on bootstrap samples. Each bootstrap sample consisted of 5 cases randomly drawn with replacement from the set of test cases, which consists of five test cases (one metric value for each pig in the test set).
To assess the stability of the ranking with respect to different validation metrics, rankings were performed separately for each metric and compared.
Quality of reference annotations
Crucial for any deep learning algorithm is the quality of the data, including the available reference annotations. It is known from previous studies that the variability between different human raters can be large joskowicz_inter-observer_2019
. To quantify the quality of our reference annotations, we used the same set of re-annotated images as described above and the annotations of the second medical expert for inter-rater estimations. Additionally, the medical expert who originally annotated our data set re-annotated the same set of 20 images for an estimation of the intra-rater variability. In both cases, we compared the re-annotations with the original annotation per image and calculated our three evaluation metrics. In contrast to the comparison of model predictions, we did not use pixels which the new annotator assigned one of the ignored classes (e.g.unsure). That is, the union of ignored pixels in reference and re-annotated segmentation map was computed and those pixels were ignored.
Required amount of training data
To study the effect of the amount of training data on the performance of the different models, we randomly sampled pigs from the set of 15 training pigs without replacement and varied from 1 to 14. Training of the models was only performed on the images of the sampled pigs without -folds, while the performance was evaluated on the same test set described in Sec. 4.1 but only for nine organ classes and without ensembling. These nine organ classes (background, liver, colon, jejunum, stomach, spleen, omentum, peritoneum and skin) are the set of organs of which images are available for all 15 training pigs. This design choice avoids the problem of having a pig sampled during training that does not contain any of the target classes, and hence would not give indicative results for this experiment. To increase the stability of the results towards pig sampling variability, we repeated the experiment five times with different random pig selections.
Misclassifications were analysed by computing the confusion matrix for the model of interest. To account for the hierarchical structure of the data, confusion matrices were first computed per pig and row-wise-normalized by dividing every value by the total number of pixels of the respective organ in the reference segmentation. Thereupon, the row-normalized confusion matrices of all test set pigs were averaged to yield the model-wise confusion matrix.
4.2 Representation of hsi data
A primary purpose of our study was to investigate what is an adequate representation of hsi data both with respect to the segmentation performance, the required amount of training data and the capability to generalize towards unseen individuals. The experiments performed to answer these research questions are outlined together with their results in the following paragraphs.
Fig. 5 shows the performance of the pixel-based (referred to as pixel), superpixel-based (referred to as superpixel), patch-based (referred to as patch_32 for an input shape of and patch_64 for an input shape of ) and image-based (referred to as image) segmentation models for our three validation metrics dsc, asd and nsd. Despite the fact that performance gaps for different spatial granularities are less prominent for hsi data than for RGB and tpi data, a consistent trend is visible for all levels of spectral granularity and validation metrics: the larger the spatial granularity of the input data, the better the segmentation performance.
Segmentation performance of the spatio-spectral models. The performance of five spatial models and three modalities (RGB, *tpi and *hsi) is shown for three different metrics on the test set. The dotted line and the shaded area represent the mean and standard deviation range, respectively, of the inter-rater results for the corresponding metric. Each boxplot shows the quartiles of the metric value distribution with the whiskers extending up tothe interquartile range, and the median and mean as solid and dashed line, respectively. The results on the validation data set reveal a similar pattern and are shown in Fig. 1.
Notably, the image-based hsi organ segmentation model performance is, on average, consistently comparable to predictions of a second human expert for all validation metrics. For the inter-rater variability, we obtained a dsc of 0.89 (sd 0.07), an asd of 4.41 (sd 5.41) and an nsd of 0.79 (sd 0.10). The intra-rater variability is better on all three metrics with a dsc of 0.91 (sd 0.06), an asd of 4.11 (sd 4.62) and an nsd of 0.81 (sd 0.07). Across all 20 images, it occurred 8 times in the inter-rater and 6 times in the intra-rater agreement evaluation that organ classes that were not annotated in the reference segmentation map were newly assigned to an image. 8 times in the inter-rater and 5 times in the intra-rater agreement evaluation, organ classes that were annotated in the reference segmentation map were missing in the re-annotations. Differences in the unsure class occurred for 15 and 16 out of the 20 images in the inter-rater and intra-rater comparison, respectively. In total, for in the inter-rater and in the intra-rater comparison, the label unsure was assigned in the re-annotation, but an organ label had been assigned in the reference annotation or vice-versa.
We described our strategy for the reduction of software- and hardware-related variability of our results in Sec. 3.2.4. To get an estimate of the controlled source of variation, we ran the image-based hsi model 5 times with different seeds and found the dsc to be in the range , the asd in and the nsd in on the test set. The controlled software- and hardware-related variability between pigs in the test set is thus about a magnitude smaller than the inter-pig variability.
In Fig. 6, we illustrate the ranking stability with respect to the sampling variability for the dsc (results for the asd and nsd can be found in Fig. 2 and Fig. 3, respectively). The bootstrapped ranking is relatively stable with the last three ranks being very clear and all other ranks varying mainly rank around the median (with image RGB being the only exception). Across all modalities, the ranking of the spatial models is always (from best to worst): image, patch_64, patch_32, superpixel and pixel. Hence, more context always improves the segmentation performance irrespective of the modality.
To assess the stability of the ranking with respect to different validation metrics, rankings for the three metrics were compared as depicted in Fig. 7. Consistently across all metrics, the image-based segmentation of hsi data ranks first. For the dsc and nsd, the ranking closely follows the level of spatial granularity (with more context yielding better results). Only for the asd, a shift in rank between superpixel-based and patch_32-based segmentation of hsi data can be observed. Generally, rankings for the different metrics are in close agreement (deviations of the rank by only rank from the median) with only two exceptions: in the asd ranking, ranks for superpixel-based models are improved compared to dsc and nsd rankings and the pixel-based segmentation of hsi data ranks better for the dsc than for asd and nsd.
Visual segmentation quality assessment
Example predictions for the five spatial models on the hsi data are shown in Fig. 8. Based on the image dsc averaged over all five models, the images corresponding to the quantile, quantile and quantile were selected, representing examples for overall bad, intermediate and good segmentation performances, respectively. For pixel-based segmentation predictions, boundaries are more incomplete and scattered than for the other models. In some of the patch-based segmentation examples, sharp vertical and horizontal edges are visible where adjacent patches connect, whereas, for the superpixel-based segmentation, edges appear wiggly due to misclassified superpixels in the proximity of organ boundaries.
Required amount of training data
A potential benefit when using input data of smaller spatial granularity is that more training samples are available, e.g. one image corresponds to training samples in the case of pixel-based segmentation, but only to a single training sample in the case of image-based segmentation (Table. 1). Fig. 9 shows the development of the dsc over the number of training pigs for the different models. Independently from the number of pigs in the training data set, the image-based segmentation model outperforms all other models. Results for asd and nsd show the same development and can be found in Fig. 4 and Fig. 5, respectively.
We wanted to study how well our neural network-based organ segmentation generalizes towards unseen individuals. To obtain an initial estimate for the generalization capabilities, we compared the segmentation performance achieved on the validation set containing only unseen pigs to those obtained on the validation set composed of unseen images of seen pigs (cf. Sec. 4.1). (a) shows the average dsc on and , respecting the hierarchical structure of the data, for the 5 different levels of spatial granularity over all 100 epochs throughout training. Performance is generally better for . With increasing spatial granularity, a larger gap in performance between and can be observed.
Fig. 11 shows the confusion matrix of the best model (image model on the hsi modality) for the test set described in Sec. 4.1. For 8 out of the 19 classes, on average more than of the organ pixels were correctly identified. For major vein, the sensitivity was lowest with only of the organ pixels being correctly identified.
4.3 Model input modalities
We wanted to investigate whether there is a benefit in using hsi data compared to RGB and tpi data.
Segmentation performance and ranking
Fig. 5 shows that for all metrics and all models, the average segmentation performance on hsi data is consistently better than the performance on RGB and tpi data. While the performance gap is largest in the case of pixel-based segmentation, it decreases with an increased level of spatial granularity and is minimized in the case of image-based segmentation. Nevertheless, a model based on hsi data ranks better than the same model based on tpi and RGB data regarding sampling and metric stability (Fig. 6 and Fig. 7, respectively). In many cases, a model based on tpi data ranks better than the same model based on RGB data. However, compared to the gap in performance/ranking between hsi-based and non-hsi-based segmentation, the difference in performance/ranking between tpi- and RGB-based segmentation is usually smaller.
When comparing the generalization capabilities towards unseen individuals for different input modalities in Fig. 10, it can be observed that performance gaps between and are comparable across the hsi, tpi and RGB input modalities for the superpixel-based, patch-based and image-based segmentation. Only for the pixel-based segmentation, a slight increase in the generalization gap from RGB over tpi to hsi data can be observed. Regarding the progress over training time, there are striking differences across modalities: while the dsc changed relatively smoothly for tpi and RGB data, the training was noisier for hsi data. This holds especially true for the pixel model whereas the training was only slightly noisier for the image model. Additionally, training converged faster for the tpi and RGB modalities whereas hsi benefited more from longer training times.
With the present study, we addressed two important and so far unanswered research questions in deep learning-based organ segmentation on hsi data, namely: (1) What is the optimal spatial granularity of the input data, both with respect to segmentation performance, required amount of training data and capability to generalize towards unseen individuals? (2) Is there a benefit in using hsi data over other modalities such as tpi and RGB data? The key insights are:
Performance: Consistently across all our validation metrics, the segmentation performance improved with larger spatial granularity of the input data. For image-based segmentation on hsi data, the performance of our model was comparable to that of a second medical expert.
Required amount of training data: Across all studied numbers of training pigs, the image-based segmentation on hsi data outperformed all other models.
Generalization capability: With larger spatial granularity, a larger gap between performance on known versus unknown pigs could be observed. Pixel-based segmentation thus generalizes better towards unseen individuals than image-based segmentation.
Benefit of hsi data over tpi and RGB data: For all models and metrics, segmentation performance on hsi data is better than on tpi and RGB data. However, the gap in performance decreases with increased spatial context from an improvement in the dsc of for the pixel, for the superpixel, for the patch_32, for the patch_64 to only about for the image model.
The following sections provide a detailed discussion on the results of our experiments (Sec. 5.1), design choices for our models and validation (Sec. 5.2) as well as limitations of our study (Sec. 5.3). We further discuss the impact of our work (Sec. 5.4), give an outlook on the future (Sec. 5.5) and close with our conclusions (Sec. 5.6).
5.1 Interpretation of results
We found that the segmentation performance for all input modalities generally improved with larger spatial granularity of the input data. There is only one exception to this, visible in (b) and Fig. 7: Despite the smaller spatial context, superpixel-based segmentation of hsi data ranks better than patch_32-based segmentation for the asd metric. This may be attributed to the boundary-sensitive nature of the asd metric. While the superpixel boundaries match the reference segmentation very well with an average lower bound for the asd of 2.75 (sd 0.90) if all superpixels were correctly classified (cf. Sec. 5.3), we see from the example predictions in Fig. 8 that sharp vertical and horizontal edges can be present in the patch_32-based predictions. These are due to our choice of aggregation scheme in which an image segmentation prediction is assembled from non-overlapping patches (cf. Sec. 5.3). The resulting incomplete and scattered boundaries are especially penalized by boundary-distance metrics such as the asd, whereas well matching boundaries are rewarded (cf. Fig. 7).
Apart from the shift in the asd ranking for the superpixel-based models, the ranking is relatively stable (deviations of the rank by only rank from the median) with one further exception: the pixel-based segmentation of hsi data ranks better for the dsc than for asd and nsd. A possible explanation are the incomplete and scattered boundaries of pixel-based segmentation predictions (cf. Fig. 8) due to the lack of spatial context. These are penalized more heavily by boundary-distance based metrics such as asd and nsd compared to the (pixel-level) overlap-based dsc (cf. Fig. 7). This finding is also supported by the segmentation examples of Fig. 8, where the dsc values are almost identical for all 5 models (including the pixel model with a dsc of 0.96) for the quantile image, but the asd, with a value of 7.52, captures the noisiness of the pixel-based segmentation predictions much better.
Required amount of training data
The reduced standard deviation range of the dsc with an increasing number of training pigs in Fig. 9 should be interpreted with care since pigs were always sampled without replacement and this inevitably increases the overlap of selected pigs across random selections on runs with a higher number of training pigs due to the limited number of 15 available training pigs. For example, when selecting 14 pigs out of 15 training pigs two times, then the two sets can only differ in one pig in the best case.
We found in Fig. 10 that the organ segmentation performance is generally better for , which makes sense since the gap between different images of the same pig can be expected to be smaller than the domain gap between different images of different pigs. We further found that the performance gap increases with larger spatial granularity of the input data. This finding may be explained in the following way: While a pixel-based model only needs to generalize for inter-pig variation in the spectral dimension, the spatial models need to additionally account for inter-pig variations of the spatial context.
The confusion matrix in Fig. 11 revealed that the largest part of the confusion () is between peritoneum and major vein, which can be explained by the neighboring relation of the two organs and the very limited amount of training data available for major vein since it could only be imaged for 32 images (cf. Fig. 2) and visible parts are generally of small size with on average (sd ). Other more frequently misclassified classes are either classes that are already difficult to annotate due to fuzzy boundaries (e.g. omentum, peritoneum, fat) or unclear distinction (e.g. kidney with Gerota’s fascia and peritoneum). Generally, most of the misclassifications in the confusion matrix occur between classes that are neighbors in the images (e.g. heart instead of lung and vice versa, colon instead of jejunum and vice versa, fat instead of peritoneum and vice versa, background instead of skin, etc.) which may be attributed to errors in the predicted segmentation boundaries. This assumption is supported by the segmentation examples in Fig. 8.
Model input modalities
We observed that increasing the input spatial context of the RGB and tpi modalities led to the gap in segmentation performance between using hsi data and tpi/RGB data decreasing, potentially since the model can use additional information from the spatial context, which compensates for the lack of detail in the spectral dimension. However, the smaller performance gap for the increased spatial context may also be attributed to the quality of the provided expert annotations since the performance of our hsi models approaches the inter-rater variability. We further saw in Fig. 5 that in many cases, a model based on tpi data ranks better than the same model based on RGB data indicating that the manually derived tpi data contains relevant information for the segmentation task.
Regarding the generalization capabilities of our models for different input modalities, we found in Fig. 10 that performance gaps between and are comparable across hsi, tpi and RGB input modality for the superpixel-based, patch-based and image-based segmentation, indicating that inter-pig variation in the spatial context may be more relevant than inter-pig variation in the spectral dimension. Only for the pixel-based segmentation, which is not affected by inter-pig variations in the spatial context, a slight increase in the generalization gap from RGB over tpi to hsi data can be observed.
We further saw in Fig. 10 that there are striking differences in the noisiness of the training: While the dsc is changing relatively smoothly for tpi and RGB data, the hsi training curve is much noisier. This holds especially true for the pixel model whereas the training was only slightly noisier for the image model. This may be attributed to the larger input feature dimension of hsi data compared to the other modalities.
5.2 Design choices
Our data set described in Sec. 3.1 is highly imbalanced in terms of number of images (Fig. 2) as well as number of pixels per class. We therefore tested different countermeasures against imbalanced data sets during model development. More precisely, we tried weighted loss functions and oversampling strategies.
For the loss functions, we first calculated a weight for each class based on the number of pixels for this class across all images in the respective fold of the training set. We chose in such that majority classes received a low and minority classes a high weight. We then computed the loss value for the whole batch as a weighted average of the individual class losses. Even though we tried several approaches for calculating the class weights (e.g. inverse proportional weights), none of them could consistently improve the results on our validation data compared to having no weights at all.
In a similar manner, we evaluated oversampling strategies by sampling instances from minority classes more often compared to instances of majority classes based on the same weights as used for the loss function. We tried this for the pixel- and patch-based methods, but the resulting dsc performance was always worse compared to the default sampling on the validation data.
We followed recommendations of segmentation challenges and used more than one metric to evaluate our results and to perform the ranking ros_comparative_2021; antonelli_medical_2021. We used an overlap-based measure (dsc), a distance-based measure (asd) and a measure for quantifying annotation uncertainty (nsd). Each metric analyzes specific properties of the predicted segmentation map and a model may be biased towards a metric, e.g. as discussed in Sec. 5.1, the pixel model performs best under the dsc metric (cf. Fig. 7), whereas the superpixel model performs best under the asd. Hence, only a combination of multiple metrics can give insights into the overall performance of a model.
One particular design choice which affects all metrics is the case of missing classes in the prediction. Common evaluation frameworks such as MONAI usually return nan or inf
values in these cases, leaving the aggregation to the user. While being less problematic for the dsc and nsd metric as they are bounded, the manner of handling missing classes is a crucial design choice for the asd which is unbounded. There are several options such as ignoring these cases completely or imposing a fixed penalty which e.g. depends on the image diagonal. We set missing classes to the maximum distance of the other classes, which introduces a penalty without producing outliers. However, it has the disadvantage that the value for the missing class depends on the prediction of the other classes in the image.
Using the nsd requires setting a (class-specific) threshold and for this, re-annotations of a subset of the images by at least one additional human rater are needed nikolov_deep_2021. This subset is usually small compared to the data set size (e.g. 20 of 506 images in our case) since annotations of more images or even annotations from multiple annotators are often not feasible. Hence, the thresholds depend mainly on this subset and errors in these annotations have a high influence on the results. Missing classes in the re-annotation are also problematic since corresponding distances cannot be computed so that this part of an image does not contribute to the threshold. In the original formulation, this problem did not occur since an organ annotation was created separately for each known organ class nikolov_deep_2021. In our case, however, the annotators did not know which organ classes were present in the image.
An additional problem is the choice of the aggregation function of the class distances per organ. For each image pair which both experts annotated, we computed the distances between the two annotations for each organ, applied an aggregation function and finally averaged the aggregated values across pigs and organs (respecting the hierarchical structure). In Fig. 12, we show several thresholds resulting from different aggregation functions. First, we see that there is a high variation across organ classes with e.g. large differences between the two annotations for peritoneum and low deviations between annotations of bladder as well as high variations across pigs (e.g. the standard deviation for the mean aggregation in case of skin is 2.7 times higher than the mean itself). This underpins that not for every organ it is equally difficult to determine its boundary. In general, the agreement of our two expert annotations is rather low indicating that even for medical experts the decision of which pixel belongs to which organ is neither easy nor unambiguous. Second, the choice of the aggregation function is crucial for the thresholds. In their original work, Nikolov & Blackwell & Zverovitch et al. used the quantile of the distances nikolov_deep_2021, but this led to very high thresholds even above in our case, so we decided to use the mean instead which results in more moderate distances always below . However, it is important to note that other aggregation methods like the median or another quantile would also have been possible.
There are two main assumptions behind our superpixel classification approach: (1) superpixels consist of homogeneous regions with every pixel inside a superpixel belonging to the same organ class and (2) superpixel borders follow the borders of the organs (instead of crossing them). We evaluated these assumptions and created a performance limit for our superpixel model by taking the modal value of all pixel labels inside a superpixel and using this modal value as a superpixel label ((a)). This directly incorporates the annotation labels and hence serves as a performance limit for our model.
In (b), we show the results of the performance limit for the different metrics on our data set. The performance limit comes closest to a perfect segmentation for the asd with an average of 2.75 (sd 0.90), followed by the dsc and nsd with average values of 0.92 (sd 0.03) and 0.76 (sd 0.07), respectively. The asd is very low and has a small standard deviation because the distances between the annotation and performance limit are constrained by the superpixel size and since each superpixel contains roughly , large distances are unlikely (with ). For dsc and nsd the gap to a perfect segmentation is larger, which indicates that the superpixels are not in perfect agreement with the annotations; this can mainly be attributed to the borders between organs. It is possible that either the superpixels or the annotations (or both) are not located at the “true organ border” and any deviation leads to a reduced overlap (dsc) or the necessity to redraw some superpixel borders to be in alignment with the annotation (nsd). The nsd is lower than the dsc because the acceptable threshold is very low for some organs (cf. Fig. 12) so that pixels with minor deviations already influence the nsd.
There is a gap between the performance limit and our model predictions for all metrics with average scores of 0.82 (sd 0.07), 16.21 (sd 9.87) and 0.60 (sd 0.12) for the dsc, asd and nsd, respectively. Assuming that the features of the superpixels are discriminative enough, this indicates that our superpixel model could be improved. This is amplified by our design choices of making the models and modalities as comparable as possible without specific optimizations for a single model (e.g. when we introduced augmentations, we added them to all models). However, the image hsi model is not too far away from the performance limit (with average values of 0.89 (sd 0.04), 7.02 (sd 4.42), 0.77 (sd 0.09) for the dsc, asd and nsd, respectively) so that the gain in investing further in the superpixel model development is low.
Non-standardized image acquisition is an important challenge in open surgery. Instead of following a standardized imaging protocol, we thus decided to acquire images in a fashion reflecting intraoperative reality (i.e. number of images and situses were decided on individually for every pig to mimic natural variations in the surgery performed). However, we are aware that this does not cover other, additional sources of variation present in a real-world open surgery setting, such as perfusion changes of certain tissue areas, presence of tissue pathologies (e.g. fatty liver), image parts being covered by body fluids (e.g. blood) and interventions being performed.
Furthermore, we found that for all studied imaging modalities, image-based models performed similarly to a second human rater. Additional performance improvements of our models may thus be limited by the quality of our reference annotations.
We designed our models to be as comparable as possible by, for example, using the same U-Net architecture or a comparable epoch size. This was important for our study as we wanted only the size of the input and the modality to be the main sources of variation instead of model-specific design choices.
For the same reason, we also did not apply any post-processing to the network output even though models like the pixel model could benefit from this, for example via morphological operations. For inference, we restricted the spatial context of each model to its defined input size. For example, the patches which the patch_32 model sees during inference are explicitly non-overlapping as this would increase the spatial context beyond a resolution of , making a comparison across different spatial resolutions (e.g. superpixel vs. patch_32) harder. However, we also see in Fig. 8 that this design choice produces visible artefacts, for example at patch borders in the patch-based segmentation.
Similarly, we did not introduce any hsi-specific model variations but instead used, for example, the same U-Net for the RGB and hsi modalities. We see in the benchmarking of Fig. 5 that the U-Net was able to leverage the additional spectral information, but the performance gap is more dominant in the patch-based and less in the image-based segmentation. This may be attributed to the U-Net leveraging the spectral information more poorly if more spatial context is available, but could also be influenced by our limitations in the annotation quality.
Many prior reports on organ segmentation in surgery have focused largely on conventional RGB imaging and minimally invasive surgery. While previous approaches to segmenting organs on hsi data have been based on a variety of input data sizes including pixels, superpixels and patches, neither the optimal granularity of the input data has been determined nor has a clear benefit in using hsi data over RGB data been shown. Furthermore, leveraging existing approaches to segmenting organs in an open surgery setting (in contrast to minimally invasive surgery) poses several challenges such as the large complexity and variability in the surgical scene due to non-standardized image acquisition, inter-subject variability and complex three-dimensional relationships between multiple soft tissues.
We addressed these gaps in the literature in a validation study of unprecedented size, both in number of pigs and number of classes. Despite the challenges imposed by the open surgery setting, we could segment organs with an average dsc of 0.89 (sd 0.04) in an image-based approach that is close to the inter-rater variability with an average dsc of 0.89 (sd 0.07). Consistently over all input sizes, the segmentation based on hsi outperforms the segmentation based on RGB data. We are the first to clearly show a benefit in using hsi data compared to RGB data for organ segmentation, thus offering a foundation that further segmentation studies on hsi data can build upon. We found that the segmentation performance decreases with less spatial context of the input data, whereas the generalization capability towards unseen individuals increases. These findings deliver valuable guidance for future work.
The following aspects should be addressed in future work:
In order to better reflect intraoperative reality, we decided to acquire our data in a way such that typical variations due to non-standardized image acquisition were covered. However, we could not cover all sources of variation that are present in real-world open surgery. Mainly, we imaged healthy and surgically unaltered pig organs and thus did not account for perfusion changes of certain tissue areas, presence of tissue pathologies (e.g. fatty liver), image parts being covered by body fluids (e.g. blood) and interventions being performed. These aspects should be studied in the future. Since the ultimate goal is real-world open surgery in humans,the generalization of our findings from pigs to humans should be verified.
In the present study, we focused on organ segmentation. A natural next step would be to expand our work to full scene segmentation by introducing additional instruments and other non-biological object classes as well as medical classes (e.g. sutures).
Our study on the intra- and inter-rater variability of our data set showed that organ segmentation is not an easy task, even for medical experts. The quality of our reference annotations may impose an upper limit to the performance of our models with our image-based models approaching the inter-rater variability. Future work may thus also focus on improving the reference annotations, e.g. by having the data annotated by multiple medical experts and seeking for consensus agreement in cases of differing annotations.
In a comprehensive validation study, we showed that unprocessed hsi data offers a huge benefit compared to RGB data or processed data of the camera provider for organ segmentation with the image-based hsi model approaching inter-rater performance. We conclude that hsi could become a powerful imaging modality for fully-automatic surgical scene understanding with many advantages over traditional imaging including the ability to recover additional functional tissue information.
Acknowledgments and conflicts of interest
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (NEURAL SPICING, grant agreement No. 101002198) and was supported by the German Cancer Research Center (DKFZ) and the Helmholtz Association under the joint research school HIDSS4Health (Helmholtz Information and Data Science School for Health). Martin Wagner, Lena Maier-Hein and Beat P. Müller-Stich worked with the medical device manufacturer KARL STORZ SE & Co. KG in the projects “InnOPlan” and “OP 4.1”, funded by the German Federal Ministry of Economic Affairs and Energy (grant agreement No. BMWI 01MD15002E and BMWI 01MT17001E) and “Surgomics”, funded by the German Federal Ministry of Health (grant agreement No. BMG 2520DAT82D).
The Authors declare that there is no conflict of interest.