Fabric Image Representation Encoding Networks for Large-scale 3D Medical Image Analysis

06/28/2020 ∙ by Siyu Liu, et al. ∙ 0

Deep neural networks are parameterised by weights that encode feature representations, whose performance is dictated through generalisation by using large-scale feature-rich datasets. The lack of large-scale labelled 3D medical imaging datasets restrict constructing such generalised networks. In this work, a novel 3D segmentation network, Fabric Image Representation Networks (FIRENet), is proposed to extract and encode generalisable feature representations from multiple medical image datasets in a large-scale manner. FIRENet learns image specific feature representations by way of 3D fabric network architecture that contains exponential number of sub-architectures to handle various protocols and coverage of anatomical regions and structures. The fabric network uses Atrous Spatial Pyramid Pooling (ASPP) extended to 3D to extract local and image-level features at a fine selection of scales. The fabric is constructed with weighted edges allowing the learnt features to dynamically adapt to the training data at an architecture level. Conditional padding modules, which are integrated into the network to reinsert voxels discarded by feature pooling, allow the network to inherently process different-size images at their original resolutions. FIRENet was trained for feature learning via automated semantic segmentation of pelvic structures and obtained a state-of-the-art median DSC score of 0.867. FIRENet was also simultaneously trained on MR (Magnatic Resonance) images acquired from 3D examinations of musculoskeletal elements in the (hip, knee, shoulder) joints and a public OAI knee dataset to perform automated segmentation of bone across anatomy. Transfer learning was used to show that the features learnt through the pelvic segmentation helped achieve improved mean DSC scores of 0.962, 0.963, 0.945 and 0.986 for automated segmentation of bone across datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep-learning-based algorithms for computer vision are rapidly evolving and are being actively developed for medical image analysis (MIA) in applications such as automated image segmentation, in which fast, accurate algorithms can offer significant advantages over expertise and resource-intensive manual analyses, and have the potential to facilitate disease diagnosis [40] and treatment planning [50]. However, the development of deep learning approaches used to provide automated MIA systems to enhance clinical decision-making and care of patients remains an open challenge. Currently, the use of deep learning algorithms in 2D applications such as Convolutional Neural Networks [34] has already achieved excellent accuracy and generalisability due to large-scale learning on computer vision datasets such as [37, 23, 12]

. i.e. models can accurately interpret, classify and perform segmentation on complex objects such as vehicles irrespective of variations in colour, contrast, scale, shape, orientation and object coverage. However, the use of such deep learning approaches for 3D

MIA is often hampered by poor performance and generalisation due to the lack of large-scale datasets.

Deep Neural Networks such as CNNs, which are universal function approximators parameterised by (usually millions of) weights, have the potential to revolutionise MIA. During training, the weights in a DNN are fine-tuned via gradient descent according to an optimisation objective, where these weights effectively encode the feature representations (knowledge) the neural network has accumulated from the training data (e.g. images). Thereby, the weights in a DNN govern the overall performance and generalisability of the network. However, without general intelligence, DNNs are prone to overfitting in domains where high-quality training data are scarce. In this paper, this problem is formalised as the sparse sampling of training data from the problem domain with medical image segmentation being one such domain.

The sparsely-sampled nature of labelled medical image data is associated with 3 central factors. First, 3D images have an order of magnitude larger sampling space compared to 2D images due to higher-order data signals. Second, medical image datasets used to train deep learning models are characterised by high inter-dataset variability, which substantially exceeds intra-dataset variability. As clinical studies are typically highly focused, individual medical image datasets are usually insufficient representations of their true underlying distributions. For example, MR imaging is commonly used for dedicated examinations of various (selected) organs and soft tissues for the diagnosis of tissue damage, tumors and other pathologies (which have the capacity to generate 2D and 3D images based on a wide range of acquisition sequences such as T1, T2, PD weighted). Hence, medical image data from different sources may vary in contrast, field-of-view and resolution, as well as contain different anatomical structures and tissues, and variations within images due to vendor-specific acquisition protocols and sequences. Consequently, deep learning models trained on individual medical image datasets are unable to form generalisable weights, which leads to poor performance and generalisability. Third, labelled medical images are extremely time-consuming to acquire as manual procedures such as expert contouring is labour-intensive, and requires careful planning and strict acquisition procedures and contouring protocols to avoid inter-rater differences, as well as anatomical inconsistencies such as those caused by full or empty bladders or rectums.

Improving the performance and generalisability of deep learning models can be achieved by obtaining weights that encode generalisable feature representations. From a data acquisition perspective, one needs to increase the sampling density of the training data. In the context of MIA, one practical solution would be multi-source sampling: combining the data from various sources (e.g. multi-modality, multi-sequence, multi-object (tissue) and multi-centre imaging databases) to train a generalisable model. As an example, a more generalisable lung segmentation model can be obtained by training on a set of lung scans from a large number of different sources. While much of the deep learning research literature in MIA has narrowly focused on individual medical image datasets that are domain-specific, the learning encoded in the weights can be still harnessed towards building a powerful backbone containing transferable weights.

The weights formed by training on data sampled from rich distributions present many important applications. Transfer learning [61] is a mechanism for utilising pre-trained weights in different but related tasks. For example, pre-trained networks such as ResNet [27] have been widely adopted by different computer vision tasks to bootstrap weight initialisation. In problem domains with sparsely sampled training data, transferred weights from a related task can introduce generalisable knowledge that may otherwise be underrepresented in the training set. More recently, pre-trained networks have proven highly effective for more advanced applications such as style transfer [2]. In MIA, a network trained on rich data distributions also has the potential to be extremely valuable.

In this work, we propose a deep learning framework capable of harnessing multiple medical image datasets and different anatomical regions and tissues to learn valuable 3D feature representations specific to medical image data. Our proposed FIRENet encodes generalisable medical image features through a novel adaptive fabric structure for 3D volumes. The fabric is enhanced by our Atrous Spatial Pyramid Pooling 3D (ASPP3D) to extract and encode powerful feature representations. The learning process involve performing image segmentation on multiple medical image datasets irrespective of acquisition procedures (e.g. modality, sequence and resolution). FIRENet is designed for multi-anatomy applications from the ground up and we show this with handling multiple different bones in MR imaging. The novel features of this network can be summarised as follows:

Fig. 1: High level architecture diagram of FIRENet (bottom) and detailed fabric cell structure diagram (top). The base network is an encoder-decoder architecture with 2 hierarchies of feature pooling. The convolutional depths progressively increases as the input is pooled.
  1. FIRENet is powered by a novel Dense Residual Fabric (DRF) latent representation block geared towards extracting rich multi-scale features from multiple datasets in 3D. The fabric itself contains three different-scaled branches, and the 3D extension of the ASPP [7] that we term ASPP3D, which is used in the fabric cells to achieve finer control of the receptive-field sizes. DRF is embedded in a encoder-decoder network with moderate intermediate down-sampling, which prevents excessive loss of high-resolution features within the Graphics Processing Unit (GPU) memory constraint.

  2. The DRF is non-dataset-specific as it is an ensemble of an exponential number of sub-architectures, generalising the concept of neural architecture search for image segmentation. Each edge of DRF’s computational graph is assigned a trainable weight allowing the network to dynamically fine-tune the feature sharing scheme, which in turn allows the entire DRF structure to adapt to the training sets, as well as re-adapt to new datasets. We do not use the "best" compact architecture as AutoDeepLab [39], but use the entire fabric to encapsulate a superposition of sub-architectures for handling multiple datasets and anatomy. The weighted edges of DRF are facilitated by Weighted Residual Summation (WRS) modules, which are placed before ASPP3D (Fig. 1) to modulate (using weights) and aggregate input signals from predecessor cells.

  3. The network can inherently process images of arbitrary sizes despite its multi-scale nature. In networks involving intermediate resizing, pooling is often the source of internal size inconsistencies (among the feature maps) as the sliding window used can discard edge pixel (voxels). In response, an Input Size Equaliser (ISE) is employed (Fig. 1) before WRS to detect and reinsert missing pixels (voxels) on a per input basis at runtime through conditional minimal padding.

  4. Multi-dataset learning using FIRENet facilitates many practical applications that can exploit the learning encoded in the feature space. In this case, 3D medical image segmentation on multi-MR-sequence, multi-object (tissue) and multi-centre datasets with a single model instance.

We demonstrate these features with two feature learning experiments performed on 3D MR datasets containing limited training examples. In the first experiment, FIRENet was tasked to learn features via the 3D semantic segmentation of a prostate dataset, and it achieved expert-level performance with a median Dice Similarly Coefficient (DSC) score of 0.867 and a median Mean Surface Distance (MSD) of 0.971mm. In the second experiment, FIRENet was used to learn features simultaneously from the MR scans of musculoskeletal (MSK) regions (hip, knee, shoulder) [6, 25, 60], as well as the open Osteoarthritis Initiative (OAI) knee dataset [53], to perform 3D multi-dataset multi-bone segmentation. An instance of the network was initialised with the weights from the prostate experiment to introduce the learnt feature representations. As a result of the weight transfer, an overall validation DSC improvement (from 0.955 to 0.964) and faster convergence were observed. Finally, feature maps sampled from FIRENet’s intermediate outputs are visualised to examine the features learnt by the DRF module. The visualisations indicate that the fabric learnt generalisable features from the MR prostate dataset, which were shown to be also useful for the segmentation of the OAI MR knee dataset.

Due to the lack of publicly available data, or specialised augmentation tools for 3D image-segmentation pairs, a data augmentation library specifically for 3D medical image segmentation was developed and has been made available on GitHub111https://github.com/SiyuLiu0329/python-3d-image-augmentation for public use.

Ii Related Work

In this section, we review important state-of-the-art methods in 2D and 3D image segmentation in the context of image representation learning, as well as the underlying mechanisms contributing to their success.

Ii-a Feature Learning in CNNs for Image Segmentation

CNNs, initially designed for classification tasks [34, 52], utilise powerful feature extractors that learn feature representations from aggregated contextual information. Typically, aggregating contextual information in CNNs for classification involves consecutive pooling which projects the input onto a lower-resolution feature space with reduced spatial awareness. However, as image segmentation requires both contextual and spatial awareness (global and local feature representations), excessive pooling decreases segmentation accuracy due to the exponential loss of image resolution and spatial information. In response, early work such as SegNet [3] proposed an additional decoder network which attempts to recover the original resolution from the feature space via un-pooling. Later works such as U-Net [47] improve upon this idea by employing a staged decoder with shortcuts to route high-resolution feature maps directly to the decoder network. U-Net has proved to be very powerful, and there have been several U-Net based networks [67, 64, 44, 68] for specific tasks. Contrary to these approaches, work including those on context aggregation [63] advocate for maintaining high-resolution feature representations throughout. Yu et al. proposed using dilated (or atrous) convolution, which expands the convolutional filters to learn from larger receptive fields without requiring feature pooling (down-sampling). It has also been shown that stacked dilated convolutional layers can achieve exponentially increased receptive fields sizes without requiring additional parameters [62].

Researchers have also discovered that, for image segmentation, multi-scale feature extraction 

[22, 43, 4, 65, 36] can be used to learn more powerful feature representations, hence enhancing the multi-scale reasoning abilities of CNNs. Typically, multi-scale feature extraction divides the input into several branches each with a different receptive field size. The Atrous Spatial Pyramid Pooling (ASPP) in DeepLabV3 [7] is a popular multi-scale feature extractor which utilises parallel dilated convolution branches to achieve diverse receptive field sizes. DeepLabV3+ [8] improves upon DeepLabV3 by including an encoder-decoder base-network and depth-wise separable convolution [10]. The more recent HRNet [57] is another exceptionally successful example. However, instead of using dilated convolution, HRNet simply uses pooling with different rates and normal convolution to achieve multi-scale feature extraction. Multi-scale networks like DeepLabV3 and HRNet are, in essence, ensembles of different-scaled features and have been shown to improve the accuracy on complex segmentation tasks substantially.

The continuous evolution of CNNs resulted in the exponential expansion of the hyper-parameter search space. To design a capable CNN architecture, exhaustive search is no longer feasible. This problem gives rise to a class of CNNs [66, 24, 48] that seek to encapsulate an exponential number of possible sub-architectures. These networks use interlaced convolutional layers with shared weights, and gradient descent is used to fine-tune the weight sharing scheme between components. More recently, AutoDeepLab [39] demonstrated explicit architecture search via gradient descent. It uses connections between cells that are weighted using trainable parameters. After training, AutoDeepLab removes weak connections to reveal a compact architecture tuned on the training set. The leftover compact architecture was found to be as competent as many hand-crafted state-of-the-art architectures.

Ii-B Traditional Medical Image Segmentation Methods

Prior to the wide adoption of CNNs for MIA, medical image segmentation is dominated by traditional methods such as thresholding [41], Statistical Shape Models [25] and atlas based methods [20], many of which rely on hand-crafted features. For example, multi-atlas approaches have been used for prostate segmentation [31] by fusing these labels with methods such as majority voting, SIMPLE [33] or others. Dowling et al. [20] recently showed how a multi-atlas approach can be used to accurately segment multiple objects from pelvic MR images for MR

-alone treatment planing. Deformable models and machine learning have been successful in the 3D segmentation of

MR images in a number of areas including the (individual) bones [49, 6] and the prostate [54].

SSM multi-object segmentation methods have also been developed [26, 5]. However, SSMs are very sensitive to initialisation. Determining the shape surfaces for training the model can be very challenging as well due to the complex nature of anatomical shapes that require sophisticated methods [14]. The shape model itself, can also be severely limited by a lack of imaging information that has only recently been solved [49, 6].

Ii-C Cnn-based Medical Image Segmentation Methods

As CNNs are designed to learn rich representations directly from image data, many of the state-of-the-art CNNs have been extended for 3D medical image segmentation. Some notable contributions include the 3D U-Net [11] and V-Net [42]. As 3D convolution is higher-order than regular 2D convolution, compromises and workarounds are often required to fit 3D adaptations of large 2D models into GPU memory. For example, 3D U-Net is heavily shrunk down from the original 2D implementation of U-Net in terms of the number of parameters. However, overly simplifying a CNN architecture can significantly limit its learning capacity and lead to underfitting. 2.5D CNNs [58, 56] have been proposed to avoid the use of 3D convolution. Instead, they operate on 2D slices extracted from the input volume leaving headroom for more complex models. The downside, however, is that 2D convolution can only extract weak features representations from 3D images. Patch-based methods [13, 17] provide alternative solutions by training networks on small 3D patches of the input volume. However, as patch extraction limits the size of the observable context, earlier patch-based implementations such as [13] usually result in sub-optimal performance and are prone to block artefacts if not carefully constructed. Recently, patch-based methods have evolved into more sophisticated forms [19, 18] delivering expert levels of performance, while being memory efficient and requiring little or no augmentation, but at the cost of longer inference times. In the meantime, the versatility of deep neural networks also gives rise to a class of hybrid networks that marries traditional methods and CNNs. [1] uses an SSM in the image segmentation pipeline and achieved state-to-the-art results on the OAI dataset. Work including [21, 28] demonstrated ways to incorporate domain knowledge such as shape priors into deep learning models.

Ii-D Transfer Learning in Medical Image Analysis

Transfer learning [61] is a procedure to adopt the encoded feature representations of a pre-trained network. It brings tangible benefits including improved performance, generalisability and convergence. In 2D image processing, there have been several ubiquitous backbone networks [27, 52] with weights pre-trained on large-scale datasets such as [15]. In MIA, transfer learning is especially useful as medical image data are often sparsely sampled from their distributions. [46, 32, 59, 51, 9] have all demonstrated that pre-trained weights can be generalised and adopted by other tasks for better performance, especially if the target task is related to the source task. [51] and [38] have even shown that the weights from models trained on 2D general images can be helpful in medical image segmentation tasks. Though in some cases, random initialisation may outperform transfer learning from unrelated tasks.

Ii-E The Dense Residual Fabric Network

FIRENet is a general 3D network that can learn features from all datasets intended for medical image segmentation. It employs multi-scale fabric latent representations block as it generally encapsulates an exponential number of sub-architectures, hence, alleviating the need for explicit dataset-specific architecture design. Unlike static fabric networks such as[48, 66, 24], all the connections in FIRENet are adjustable via trainable weights for a new level of adaptability. Each fabric cell contains an ASPP3D for multi-scale feature extraction and instance normalisation for improved convergence stability. As FIRENet may theoretically be used to simultaneously learn from hundreds of different-scaled datasets, utilising ASPP3D in an already multi-scaled latent module provides a finer collection of receptive field sizes for multi-scale feature extraction.

While architecture search methodology used in AutoDeepLab produced promising results, it is not suitable for multi-dataset feature learning as the architecture search is performed based on specific data distributions. Removing connections based on the resulting weights limits the learning capacity and re-adaptability of the model on future datasets. Therefore, FIRENet focuses on the architecture-level feature adaptation of the entire network rather than architecture search.

The fabric is embedded in a 2-stage encoder-decoder base network to balance the trade-off between learning capacity and memory usage. Alternative patch-based architectures were considered for better memory efficiency. However, they can potentially introduce additional parameters in the pre-processing and post-processing stage, most of which, such as sampling-rate and patch-size, are dataset-specific. For the same reason, hybrid approaches were not considered for the task at hand.

Finally, many of the transfer learning work reviewed such as [9] implied a fixed input-size. In the present work, a crucial requirement must be established to ensure inherent multi-dataset feature learning: the network must be capable of processing different-sized images at their native resolutions, all in a end-to-end manner.

Iii Methods

In this section, we introduce the proposed network architecture, its implementation, training pipeline and experiment setups in detail.

Iii-a Network Architecture

Iii-A1 Feature-Cell

The Dense Residual Fabric (DRF) (Fig. 1(a)) is the latent feature representation module of FIRENet. The fabric consists of inter-weaved 3D Feature-Cells (denoted ). Each Feature-Cell has three major components (as Fig. 1): Input Size Equaliser (ISE), Weighted Residual Summation (WRS) and Atrous Spatial Pyramid Pooling 3D (ASPP3D).

ISE: Input Size Equaliser (ISE) is used to enforce size consistency among the output feature maps from predecessor cells. As DRF exchanges different-scaled intermediate outputs among its cells, feature merging operations such as element-wise summation may fail due to inconsistent feature map sizes. Specifically, when a feature map undergoes pooling, edge pixels (voxels) are discarded if its dimensions are not divisible by the pooling rate. As FIRENet has no pre-defined input-size at compile time, The discarded edge pixels (voxels) are subject to changes in the input image size at runtime, and can cause the subsequent WRS to fail. Thus, we employ an ISE at the start of each Feature-Cell. The role of ISE is to detect feature maps with discarded pixels and apply padding to compensate. The maximum amount of padding allowed is 1 pixel per dimension and is found to be sufficient for a pooling rate of 2. As FIRENet assumes no pre-defined input size at compile time, ISE uses runtime flow-control operations to conditionally apply padding or even cropping.

WRS: The subsequent WRS fuses the size-equalised feature maps from ISE. As Fig. 1(b) shows, the pixel intensity of each WRS input is scaled by its associated sigmoid-activated weight before merging with others. The weights are uniformly sampled from {[-0.03, 0.03]} and are trainable via gradient descent. WRS gives the network the flexibility to determine the optimal connection strengths between components. This form of architecture-level adaptation alleviates the need for explicit feature extraction path design.

ASPP3D: The third component, the ASPP3D (Fig. 1(b)), aims to provide the cell with multi-scale feature learning capability. Many multi-scale networks like HRNet can only extract features at a limited number of unique scales. With ASPP3D, we can drastically increase this number to prepare the fabric for an unforeseeable number of possible different feature sizes. For instance, given a DRF implementation with three branches, using Feature-Cells with dilation rates of 1, 2 and 4 yields nine unique receptive field dimensions of 3, 5, 6, 7, 10, 12, 14, 20 and 28, which would otherwise require nine dedicated branches without ASPP3D.

(a) Structure of the Dense Residual Fabric module. Feature-Cells are denoted where and enumerate the number of Feature-Cell along the network’s depth (N) and the number of branches (W) of the fabric, respectively. Cells are coloured based on the number of channels of each cell.
(b) Detailed Feature-Cell configuration. The number of channels is governed by the location of the cell in the fabric.
(c) Example dense shortcut arrangement in the Residual Fabric. A shortcut connection is only established if the layers involved have the same number of channels. This arrangement is applied to every branch of the fabric.
Fig. 2: DRF feature representation model. (a) Structure of DRF

. (b) Detailed cell architecture. (c) Example residual connections used in the fabric.

Iii-A2 Dense Residual Fabric Module

The Dense Residual Fabric (DRF) (Fig. 1(a)) can be viewed as an ensemble of an exponential number of different sub-networks with shared features. We span the fabric with two axes: a width axis representing the number of different-scaled branches, a depth axis representing the fabric’s network-wise depth. For a given 3D input of scale , the fabric first splits it into

parallel branches using strided-convolution. The resulting scale

of each branch satisfies . Each branch is then separately processed through Feature-Cells (denoted ) to progressively learn features at different scales. To enable intermediate sharing of multi-scale features, the output of an Feature-Cell , where and , is fed into subsequent cells , , and using WRS. Strided-convolution and bi-linear up-sampling are used to resize feature maps to their target sizes as needed. We avoid transpose convolution as it has been shown to produce checkerboard artefacts [45]. The channel depths and of the Feature-Cells are distributed following the geometry of a pyramid - increasing towards the mid-point of the lowest resolution branch () of the fabric. Let be DRF’s input channels, then the number of channels of any cell in the first half of the fabric (from to ) can be defined as . The number of channels of the cells in the second half of the fabric then gradually shrinks along mirroring the first half. At the end of the fabric, the different-scaled parallel branches are merged using WRS to form an output of the original scale .

Iii-A3 Dense Residual Connections

[27] showed that network depth is positively correlated to training difficulty. We include supplementary residual shortcuts (Fig. 1(c)) to densely connect the cells in the fabric. That is, in addition to the different-scaled features for the immediate previous layers, each cell receives shortcut signals from all other proceeding cells with compatible channel sizes ({} where stands for number of channels) as illustrated in Fig. 1(c).

Iii-A4 Instance Normalisation and Dropout

Feature map normalisation techniques [29, 55, 35] have been extensively studied and utilised to improve convergence. In our work, we use instance normalisation [55] instead of batch normalisation to prevent undesirable batch-wise correlations as the training examples are sampled from different datasets. In the proposed model, instance normalisation is applied after convolution and before non-linear activation. A dropout with a 50 per cent rate is used after every non-linearity. Our initial results show FIRENet with dropouts consistently produced better results.

Iii-A5 Encoder-decoder Backbone

Even though the consensus in the literature is to maintain high-resolution feature representations throughout the network, high-resolution 3D networks require an order of magnitude more processing power and memory. We embed the DRF in a limited encoder-decoder base (Fig. 1) with WRS passing features from the encoder to the decoder via shortcuts. The encoder and the decoder have the same number of convolutional blocks. Each block is a residual unit [27]

with two convolutions followed by max-pooling. A convolutional layer is added to each encoder-to-decoder shortcut to reduce semantic gaps 

[67].

Iii-A6 Instantiation Parameters

The encoder contains two convolutional blocks of 32 and 64 channels, respectively. The encoded representation of the input is passed to a DRF instantiated with , and . Each Feature-Cell has three parallel branches with dilation rates of 1, 2 and 4, respectively. Finally, the fabric output is passed through two decoder blocks with 64 and 32 channels respectively to arrive at the network’s output. The shortcut convolutional layers used for semantic gap reduction have the same depths as their corresponding encoder or decoder blocks.

Iii-A7 Training Pipeline

We develop a 3-stage training pipeline to enable the architecture-level feature adaptation. Since tuning the WRS

weights in early epochs can prematurely influence the strengths and stability of the back-propagated signal into different network components, we divide the training process into the following three stages.

Stage 1. Train FIRENet with all WRS weights frozen for 20 epochs.

Stage 2. Unfreeze the WRS weights in the fabric allowing the fabric to tune the connections between cells. This training stage lasts for another 20 epochs.

Stage 3. Unfreeze all the remaining WRS weights and train until converged.

The model was trained by minimising the basic cross-entropy loss using Adam optimiser [30]. The target class’s validation DSC was monitored during training and the weights from the best performing epoch were saved.

Iii-A8 Experiment Setups

A set of two 3D MR imaging segmentation experiments were conducted to examine FIRENet for feature representation learning. All the deep learning models were trained using an NVIDIA Tesla V100 (32GB) GPU

for a maximum duration of 3 days. 3-fold validation is used to control experimental bias and the evaluation metrics used to report the experiment results are

Dice Similarly Coefficient (DSC[16] and Mean Surface Distance (MSD).

Experiment I: Prostate MR Imaging Segmentation: As ISE and WRS are non-standard network components, a prostate MR imaging segmentation experiment was used to verify FIRENet’s performance on single-dataset semantic segmentation tasks, which also indirectly verifies its feature learning ability. The 3D MR imaging prostate dataset was derived from an 8-week clinical study for prostate cancer radiotherapy treatment and the training data contained 211 MR examinations from 39 patients. The images are in size and semantically labelled with five foreground classes: body, bone (pelvic and hip), urinary bladder, rectum and prostate. The ground truth (manual segmentation) of the prostate dataset has an average inter-observer overlap of 0.84 for the prostate class. Dowling et al. [20] used an atlas-based automatic segmentation method to determine the prostate volume and achieved a median DSC of 0.82 and MSD of 0.204mm. One issue of the atlas-based method was the slow run time, which took up to an hour per image. Chandra et al. [5] used a shape model-based method on the same dataset and achieved a similar median DSC of 0.81 while speeding up the inference process drastically (10 minutes per image).

In the present work, the MR examinations are divided into three train-validation groups by patient case to prevent data leakage. A 3D U-Net [11] and a V-Net [42] were trained for performance comparison against our FIRENet

. All the models were trained using the same metric and loss function. It is worth noting that both of the traditional automatic segmentation methods used leave-one-out for validation. The models trained in the present work, in theory, should be at a disadvantage as it only relies on 2/3 of the dataset for training.

Experiment II: Multi-dataset Bone Segmentation: The bone segmentation tasks were undertaken as a proof of concept for large-scale feature learning via multi-dataset segmentation. FIRENet was trained simultaneously on 4 3T MR imaging musculoskeletal datasets: Knee [25], shoulder [60], hip [6] and OAI ZIB Knee [1] in an effort to learn general feature representations that facilitate the accurate bone segmentation of the four datasets. A second instance of FIRENet (FIRENet-T), initialised with the weights from the prostate experiment, was trained to exploit the commonalities in the feature representation space. The main challenge of this experiment is that the combined MR imaging dataset contains images with 12 different resolutions, which FIRENet overcomes with its input-size invariance property. To quantitatively assess the importance of this property, an equivalent network (FIRENet-F) with a fixed-sized input-layer was trained for comparison. The fixed-sized network was an exact copy of FIRENet, but the inputs’ resolutions were scaled to the average () of the 4 datasets then restored in post-processing. Another challenge is data imbalance: the four sub-datasets have unequal numbers (62, 25, 53 and 507) of training examples. In response, a 3-fold split was performed on the four datasets separately. During training, the four datasets are enumerated, and a random training example is sampled each training step.

Data Augmentation: To increase the diversity of the training data, we apply 3D data augmentation to image-segmentation pairs using our augmentation library. The augmentation methods used to train our models are translation, rotation, affine transformation and elastic deformation [47]

. The augmentation parameters are sampled from a truncated normal distribution within {[1, -1]}.

Iv Results and Discussion

Method Prostate Bladder Rectum Bone Body
DSC MSD DSC MSD DSC MSD DSC MSD DSC MSD
Atlas-based [20] 0.820 2.040 0.900 3.260 0.850 2.160 0.920 1.330 0.999 0.310
SSM [5] 0.810 2.080 0.873 2.771 0.788 2.425 0.810 - 0.940 -
3D U-Net [11] 0.855 1.124 0.959 0.547 0.874 1.089 0.924 0.800 0.984 0.488
FIRENet 0.867 0.971 0.963 0.519 0.890 0.827 0.931 0.723 0.989 0.337
Average Inter-observer Overlap(n=3) 0.840 1.980 0.950 0.910 0.820 1.980 - - - -
TABLE I: Prostate dataset semantic segmentation results. Metrics are median DSC and median MSD(in mm). V-Net produced very inconsistent results on sparse classes such as prostate and rectum and has been omitted from the table.
Fig. 3: Violin plot for the baseline (week-0) prostate DSC results. The median DSC scores have been marked.

In this section, the results from the MR Imaging Prostate Segmentation and Multi-dataset Bone Segmentation experiments are presented and discussed within the context of large-scale feature learning from medical image data.

Iv-a Experiment I: Mr Imaging Prostate Segmentation Results

Table I shows the median DSC and MSD results for baseline (week-0) volumetric segmentation of the prostate, urinary bladder, rectum, (pelvic and hip) bones and body component across the different automated segmentation approaches compared with the manual segmentation results from the MR imaging prostate dataset. Overall, all of the automated segmentation methods, except V-Net, achieved over 0.8 median DSC on the prostate class. The FIRENet had a significantly higher median prostate DSC (0.867) value than the SSM based  [5], atlas-based  [20] and the 3D U-Net (0.855). The computed MSDs showed the same overall pattern in regards to the performance of these methods. the violin plot shown in Fig. 3, shows how FIRENet outperforms 3D U-Net in terms of its DSC distribution being relatively sparse in the low DSC

range (below 0.8). In the prostate dataset, there is an extreme outlier characterised by its lack of object separation and a clear boundary. U-Net performed poorly with a 0.29

DSC for the prostate on this particular case. As Fig. 4

, it also completely missed the relatively full urinary bladder. FIRENet provided more accurate estimates on both the prostate (DSC=0.61) and bladder.

V-Net produced very inconsistent results likely due to its sensitivity to the choice of loss function. To avoid bias, all of the experiments used the same standard cross-entropy loss, which differs from the DSC loss used by V-Net.

Fig. 4: Visualisation of segmentation results from manual segmentation (left), FIRENet (middle) and U-Net (right). Two cases representative of "above average" and "average" results and one outlier case are presented.

Iv-B Experiments II: Multi-dataset Bone Segmentation from Mr Imaging Examinations of the Knee, Hip and Shoulder Joints

Table II(a) and II(b) provide the mean and poorest DSC scores for segmentation of bone volume obtained using FIRENet, its fixed-sized variant (FIRENet-F), transfer learning (FIRENet-T). Despite only relying on one set of weights to encode features from 4 datasets, FIRENet obtained average DSC scores of over 0.9 for bone segmentation across all 4 datasets. This indicates the learning of more generalisable features (than features obtained using individual datasets). The mean DSC result for the RDFNet-T approach (0.986) on the OAI ZIB dataset is comparable to the state-of-the-art benchmark [1] of 0.986, which was obtained using a CNN but aided with a shape model. It is worth noting the method used by [1] has a "run" time of approximately 9 mins, whereas FIRENet has a throughput of 1 (similarly sized) image per second. From FIRENet-F’s mean and poorest DSC results, there was a significant performance disadvantage compared to FIRENet and FIRENet-T. This was because FIRENet-F enforces a fixed input-size, which resulted in the loss of image integrity from re-scaling in the pre-processing and post-processing stages. Comparing FIRENet-T to FIRENet, FIRENet-T produced better average DSC results for bone segmentation across all 4 sub-datasets. The poorest DSC results indicate some degree of improvement across three of the four sub-datasets. It was observed that FIRENet-T approaches convergence at a faster rate than the randomly initialised FIRENet across all three validation splits. This is shown in the convergence plot (Fig. 5). The improved convergence and final results indicate that the learning from the MR prostate dataset was successfully generalised and adopted by FIRENet-T in the automated segmentation of bones in MR images of the knee, hip and shoulder joint.

Method MSKH MSKK MSKS OAI Overall
FIRENet-F 0.930 0.943 0.903 0.964 0.935
FIRENet 0.954 0.961 0.923 0.985 0.955
FIRENet-T 0.962 0.963 0.945 0.986 0.964
CNN + SSM [1] - - - 0.986 -
SSM [6] 0.950 - - - -
SSM [25] -
0.952(P)
0.952(T)
0.862(F)
- - -
SSM [60] - -
0.926(H)
0.837(S)
- -
(a) Mean validation set bone DSC results of 3D musculoskeletal hip (MSKH), knee (MSKK), shoulder (MSKK) datasets and the OAI dataset. The knee segmentation work by [25] divides the bone class into patella (P), tibia (T) and femur (F). The shoulder segmentation work by [60] divides the bone class into humerus (H) and Scapula (S).
Method MSKH MSKK MSKS OAI ZIB
FIRENet-F 0.822 0.883 0.813 0.950
FIRENet 0.857 0.910 0.825 0.968
FIRENet-T 0.870 0.907 0.900 0.973
(b) Worst validation set DSC results of FIRENet-F, FIRENet and FIRENet-T.
TABLE II: Validation set DSC results.
Fig. 5: Early convergence (50 epochs) plots of FIRENet and FIRENet-T. FIRENet-T consistently showed accelerated convergence especially in the first 30 epochs.

The results acquired using traditional methods [6, 25, 60] are also included in Table 4. The high DSC scores from the deep-learning-based methods provide good evidence that they performed well on the bone datasets.

Iv-C Activation Map Visualisation

In this section, the intermediate feature maps are visualised and presented to provide insights into the contributing factors to FIRENet

’s performance and feature learning process (a feature map reveals the features that excite the neurons in a particular layer).

Feature Generalisation: The numeric results from Experiment II (FIRENet-T) indicate that the weights in FIRENet were generalised to the bone segmentation task through fine-tuning. Besides quantitative results, qualitatively analysis was conducted to assess the usefulness of the transferred weights. This was done by obtaining a trained instance of the FIRENet (without any fine-tuning) from Experiment I and passing OAI ZIB data through it in inference mode. The intermediate features maps from the fabric module are visualised as Fig. 6. It can be shown that, despite OAI ZIB being an unseen dataset, the activation maps contain useful features crucial for knee segmentation. Across many feature maps, prominent activation highlighting the knee bones as well as cartilages were observed. Understandably, the activation becomes less legible as the layers progress deeper because deeper layers of CNNs extract higher-level features.

Fig. 6: Activation maps sampled from the output of cells , , and . The weights are trained using the prostate data and the input used to generate these activation maps is a knee MR image from the OAI ZIB dataset.
Fig. 7: Example intermediate feature maps of a hip bone enhanced by WRS by incorporating information from other parallel branches.
Fig. 8: Activation maps sampled from dilated convolution branches (part of ASPP3D) from cells , and . RF refers to the effective receptive field size as a result of different scales and dilation rates. The parallel dilated convolution branches are fused using convolution to form the ASPP3D output.

WRS Feature Augmentation: The fabric latent module exchanges multi-scale feature representations across its parallel branches via WRS. Our hypothesis is that the WRS in each cell performed feature map augmentation by filling in features from neighbouring branches. To examine this hypothesis, the activation maps before and after WRS are visualised and some samples are presented in Fig. 7. As per Fig. 7A, B and D, WRS is shown to augment (enhance the definition) of certain features by incorporating activations from parallel branches. As a result, more prominent shape outlines can be observed across many of the feature maps. As per Fig. 7C, WRS may also "clean up" the input by subtracting "extraneous" signal intensity. In this case, WRS removed noise-like artefacts that share little correlation to a more "faithful" output of the underlying objects of interest. It was also observed that, in some cases, WRS leaves some of the activation maps untouched.

ASPP3D: The latent fabric of FIRENet used in present work extracts features at 9 unique scales without needing to maintain 9 dedicated parallel branches. Fig. 8 shows activation maps for each of the 9 scales given a hip MR scan. It can be seen that the detected features become increasingly abstract as the receptive field size increases. The same trend continues after the parallel feature maps are fused into 3 main branches: the high and mid resolution branches show the outlines of bones and the low-resolution branch is more abstract showing weak spatial correlations. This is because the high-resolution path preserves most of the image details while the lower resolution paths focus more on image-level information. While 3 of the feature maps share very similar effective receptive field sizes (RF=5, 6 and 7), the features extracted from them are completely different. This indicates that the network incorporates different feature representations 9 different scales when constructing the three different-scaled fabric branches.

Iv-D Implications of Size Invariance

While the idea of multi-dataset like transfer learning is not new, FIRENet optimises the process as a result of its input size invariance property. For our experiments, only the number of output classes were modified when transferring FIRENet from the prostate dataset to the bone dataset. The rest of the network’s structure (including weights) and all its hyper-parameters remained untouched.

Iv-E Future Work

The feature representations encoded by FIRENet’s fabric can be further enriched by incorporating additional datasets. In future work, FIRENet can take advantage of a semi-online learning pipeline involving an expanding collection of datasets. In addition to learning generalisable features through image segmentation, one can expand the learning objective of FIRENet by including additional prediction heads at the end of the fabric model. The prediction heads can used in tasks (other than image segmentation) that exploit the commonalities in the feature space. Example usage of the predication head can be the laterality detection of knee joint scans and bounding box regression. It is also possible to carry out feature learning via multi-task learning where both the decoder output and the prediction head are used simultaneously.

V Conclusion

Obtaining a well-formed set of weights that encode rich and generalisable feature representations is paramount in many deep-learning-based MIA tasks. As MIA is a field where labelled data are sparsely sampled, we advocate for large-scale feature learning involving as many labelled medical image datasets as there is available. A novel architecture FIRENet was proposed to learn and encode generalisable feature representations via large-scale 3D medical image segmentation. FIRENet is equipped with a state-of-the-art DRF module enhanced with ASPP3D for superior feature learning capacity and multi-scale reasoning ability, which are critical for extracting features from multiple datasets. One crucial property of FIRENet is its invariant to changes in the input size, which eliminates the need for destructively resizing the input, and in turn preserves the integrity of the input image. Combined with its smaller hyper-parameter space (than other approaches involving complex pipelines) FIRENet enables inherent large-scale feature learning involving any number of datasets. In future work, the training pipeline of FIRENet will be continually expanded to include additional datasets as well as tasks beyond image segmentation (image classification and regression). This process will allow the network to encode richer and more general knowledge that can substantially benefit other related tasks.

Vi Acknowledgement

We wish to acknowledge The University of Queensland’s Research Computing Centre (RCC) and the use of the Wiener supercomputer (Abramson, Carroll and Porebski, 2017) for support in this research.

References

  • [1] F. Ambellan, A. Tack, M. Ehlke, and S. Zachow (2019) Automated segmentation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: data from the osteoarthritis initiative. Medical Image Analysis 52, pp. 109 – 118. External Links: ISSN 1361-8415, Document Cited by: §II-C, §III-A8, §IV-B, II(a).
  • [2] K. Armanious, C. Jiang, M. Fischer, T. Küstner, K. Nikolaou, S. Gatidis, and B. Yang (2018-06) MedGAN: Medical Image Translation using GANs. arXiv e-prints, pp. arXiv:1806.06397. External Links: 1806.06397 Cited by: §I.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2015-11) SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv e-prints, pp. arXiv:1511.00561. External Links: 1511.00561 Cited by: §II-A.
  • [4] P. Buyssens, A. Elmoataz, and O. Lezoray (2012-11) Multiscale convolutional neural networks for vision–based classification of cells. Vol. 7725, pp. . External Links: Document Cited by: §II-A.
  • [5] S. S. Chandra, J. A. Dowling, P. B. Greer, J. Martin, C. Wratten, P. Pichler, J. Fripp, and S. Crozier (2016-10) Fast automated segmentation of multiple objects via spatially weighted shape learning. Physics in Medicine and Biology 61 (22), pp. 8070–8084. External Links: Document Cited by: §II-B, §III-A8, §IV-A, TABLE I.
  • [6] S. S. Chandra, Y. Xia, C. Engstrom, S. Crozier, R. Schwarz, and J. Fripp (2014) Focused shape models for hip joint segmentation in 3D magnetic resonance images. Medical Image Analysis 18 (3), pp. 567 – 578. External Links: Document, ISSN 1361-8415 Cited by: §I, §II-B, §II-B, §III-A8, §IV-B, II(a).
  • [7] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017-06) Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv e-prints, pp. arXiv:1706.05587. External Links: 1706.05587 Cited by: item 1, §II-A.
  • [8] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018-02) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv e-prints, pp. arXiv:1802.02611. External Links: 1802.02611 Cited by: §II-A.
  • [9] S. Chen, K. Ma, and Y. Zheng (2019-04) Med3D: Transfer Learning for 3D Medical Image Analysis. arXiv e-prints, pp. arXiv:1904.00625. External Links: 1904.00625 Cited by: §II-D, §II-E.
  • [10] F. Chollet (2016-10) Xception: Deep Learning with Depthwise Separable Convolutions. arXiv e-prints, pp. arXiv:1610.02357. External Links: 1610.02357 Cited by: §II-A.
  • [11] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016-06) 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. arXiv e-prints, pp. arXiv:1606.06650. External Links: 1606.06650 Cited by: §II-C, §III-A8, TABLE I.
  • [12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016-04)

    The Cityscapes Dataset for Semantic Urban Scene Understanding

    .
    arXiv e-prints, pp. arXiv:1604.01685. External Links: 1604.01685 Cited by: §I.
  • [13] Z. Cui, J. Yang, and Y. Qiao (2016) Brain mri segmentation with patch-based cnn approach. In 2016 35th Chinese Control Conference (CCC), Vol. , pp. 7026–7031. Cited by: §II-C.
  • [14] R. H. Davies, C. J. Twining, T. F. Cootes, and C. J. Taylor (2010) Building 3-D statistical shape models by direct optimization. Medical Imaging, IEEE Transactions on 29 (4), pp. 961–981. External Links: ISSN 0278-0062, Document Cited by: §II-B.
  • [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §II-D.
  • [16] L. R. Dice (1945-07) Measures of the amount of ecologic association between species. Ecology 26 (3), pp. 297–302. Cited by: §III-A8.
  • [17] J. Dolz, C. Desrosiers, and I. [. Ayed] (2018) 3D fully convolutional networks for subcortical segmentation in mri: a large-scale study. NeuroImage 170, pp. 456 – 470. Note: Segmenting the Brain External Links: ISSN 1053-8119, Document, Link Cited by: §II-C.
  • [18] S. Dong, G. Luo, C. Tam, W. Wang, K. Wang, S. Cao, B. Chen, H. Zhang, and S. Li (2020) Deep atlas network for efficient 3d left ventricle segmentation on echocardiography. Medical Image Analysis 61, pp. 101638. External Links: ISSN 1361-8415, Document, Link Cited by: §II-C.
  • [19] Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P. Heng (2017) 3D deeply supervised network for automated segmentation of volumetric medical images. Medical Image Analysis 41, pp. 40 – 54. Note: Special Issue on the 2016 Conference on Medical Image Computing and Computer Assisted Intervention (Analog to MICCAI 2015) External Links: ISSN 1361-8415, Document, Link Cited by: §II-C.
  • [20] J. A. Dowling, J. Sun, P. Pichler, D. Rivest-Hénault, S. Ghose, H. Richardson, C. Wratten, J. Martin, J. Arm, L. Best, S. S. Chandra, J. Fripp, F. W. Menk, and P. B. Greer (2015) Automatic substitute computed tomography generation and contouring for magnetic resonance imaging (mri)-alone external beam radiation therapy from standard mri sequences. International Journal of Radiation Oncology*Biology*Physics 93 (5), pp. 1144 – 1153. External Links: ISSN 0360-3016, Document Cited by: §II-B, §III-A8, §IV-A, TABLE I.
  • [21] J. Duan, G. Bello, J. Schlemper, W. Bai, T. J. W. Dawes, C. Biffi, A. de Marvao, G. Doumoud, D. P. O’Regan, and D. Rueckert (2019) Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi- task deep learning approach. IEEE Transactions on Medical Imaging 38 (9), pp. 2151–2164. Cited by: §II-C.
  • [22] D. Eigen and R. Fergus (2014-11) Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. arXiv e-prints, pp. arXiv:1411.4734. External Links: 1411.4734 Cited by: §II-A.
  • [23] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §I.
  • [24] D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tremeau, and C. Wolf (2017-07) Residual Conv-Deconv Grid Network for Semantic Segmentation. arXiv e-prints, pp. arXiv:1707.07958. External Links: 1707.07958 Cited by: §II-A, §II-E.
  • [25] J. Fripp, S. Crozier, S. K. Warfield, and S. Ourselin (2007-02) Automatic segmentation of the bone and extraction of the bone–cartilage interface from magnetic resonance images of the knee. Physics in Medicine and Biology 52 (6), pp. 1617–1631. External Links: Document Cited by: §I, §II-B, §III-A8, §IV-B, II(a).
  • [26] B. Glocker, O. Pauly, E. Konukoglu, and A. Criminisi (2012-10) Joint Classification-Regression Forests for Spatially Structured Multi-object Segmentation. In Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid (Eds.), Lecture Notes in Computer Science, pp. 870–881 (en). External Links: ISBN 978-3-642-33764-2, 978-3-642-33765-9, Document Cited by: §II-B.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun (2015-12) Deep Residual Learning for Image Recognition. arXiv e-prints, pp. arXiv:1512.03385. External Links: 1512.03385 Cited by: §I, §II-D, §III-A3, §III-A5.
  • [28] Z. He, S. Bao, and A. Chung (2018) 3D deep affine-invariant shape learning for brain mr image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, D. Stoyanov, Z. Taylor, G. Carneiro, T. Syeda-Mahmood, A. Martel, L. Maier-Hein, J. M. R.S. Tavares, A. Bradley, J. P. Papa, V. Belagiannis, J. C. Nascimento, Z. Lu, S. Conjeti, M. Moradi, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 56–64. External Links: ISBN 978-3-030-00889-5 Cited by: §II-C.
  • [29] S. Ioffe and C. Szegedy (2015-02) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv e-prints, pp. arXiv:1502.03167. External Links: 1502.03167 Cited by: §III-A4.
  • [30] D. P. Kingma and J. Ba (2014-12) Adam: A Method for Stochastic Optimization. arXiv e-prints, pp. arXiv:1412.6980. External Links: 1412.6980 Cited by: §III-A7.
  • [31] S. Klein, U. A. v. d. Heide, I. M. Lips, M. v. Vulpen, M. Staring, and J. P. W. Pluim (2008) Automatic segmentation of the prostate in 3D MR images by atlas matching using localized mutual information. Medical Physics 35 (4), pp. 1407–1417. External Links: Document Cited by: §II-B.
  • [32] F. Knoll, K. Hammernik, E. Kobler, T. Pock, M. P. Recht, and D. K. Sodickson (2019) Assessment of the generalization of learned image reconstruction and the potential for transfer learning. Magn Reson Med 81 (1), pp. 116–128. External Links: ISSN 0740-3194 (Print) 0740-3194, Document Cited by: §II-D.
  • [33] T. R. Langerak, U. A. van der Heide, A. N. Kotte, M. A. Viergever, M. van Vulpen, and J. P. Pluim (2010-12) Label fusion in Atlas-Based segmentation using a selective and iterative method for performance level estimation (SIMPLE). Medical Imaging, IEEE Transactions on 29 (12), pp. 2000–2008. External Links: ISSN 0278-0062, Document Cited by: §II-B.
  • [34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §I, §II-A.
  • [35] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016-07) Layer Normalization. arXiv e-prints, pp. arXiv:1607.06450. External Links: 1607.06450 Cited by: §III-A4.
  • [36] S. Li, X. Zhu, and J. Bao (2019-04) Hierarchical Multi-Scale Convolutional Neural Networks for Hyperspectral Image Classification. Sensors (Basel) 19 (7). Cited by: §II-A.
  • [37] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: ISBN 978-3-319-10602-1 Cited by: §I.
  • [38] W. Lingyun, X. Yang, S. Li, T. Wang, P. Heng, and D. Ni (2017-04) Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation. pp. 663–666. External Links: Document Cited by: §II-D.
  • [39] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. Fei-Fei (2019-01) Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation. arXiv e-prints, pp. arXiv:1901.02985. External Links: 1901.02985 Cited by: item 2, §II-A.
  • [40] I. Y. Maolood, Y. E. A. Al-Salhi, and S. Lu (2018) Thresholding for Medical Image Segmentation for Cancer using Fuzzy Entropy with Level Set Algorithm. Open Med (Wars) 13, pp. 374–383. Cited by: §I.
  • [41] I. Maolood, Y. Alsalhi, and S. Lu (2018-09) Thresholding for medical image segmentation for cancer using fuzzy entropy with level set algorithm. Open Medicine 13, pp. 374–383. External Links: Document Cited by: §II-B.
  • [42] F. Milletari, N. Navab, and S. Ahmadi (2016-06) V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. arXiv e-prints, pp. arXiv:1606.04797. External Links: 1606.04797 Cited by: §II-C, §III-A8.
  • [43] H. T. Mustafa, J. Yang, and M. Zareapoor (2019) Multi-scale convolutional neural network for multi-focus image fusion. Image and Vision Computing 85, pp. 26 – 35. External Links: ISSN 0262-8856, Document Cited by: §II-A.
  • [44] Z. Ni, G. Bian, X. Zhou, Z. Hou, X. Xie, C. Wang, Y. Zhou, R. Li, and Z. Li (2019-09) RAUNet: Residual Attention U-Net for Semantic Segmentation of Cataract Surgical Instruments. arXiv e-prints, pp. arXiv:1909.10360. External Links: 1909.10360 Cited by: §II-A.
  • [45] A. Odena, V. Dumoulin, and C. Olah (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Document Cited by: §III-A2.
  • [46] H. Ravishankar, P. Sudhakar, R. Venkataramani, S. Thiruvenkadam, P. Annangi, N. Babu, and V. Vaidya (2017-04) Understanding the mechanisms of deep transfer learning for medical images. pp. . Cited by: §II-D.
  • [47] O. Ronneberger, P. Fischer, and T. Brox (2015-05) U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv e-prints, pp. arXiv:1505.04597. External Links: 1505.04597 Cited by: §II-A, §III-A8.
  • [48] S. Saxena and J. Verbeek (2016-06) Convolutional Neural Fabrics. arXiv e-prints, pp. arXiv:1606.02492. External Links: 1606.02492 Cited by: §II-A, §II-E.
  • [49] J. Schmid, J. Kim, and N. Magnenat-Thalmann (2011) Robust statistical shape models for MRI bone segmentation in presence of small field of view. Medical Image Analysis 15 (1), pp. 155– 168. External Links: ISSN 1361-8415, Document Cited by: §II-B, §II-B.
  • [50] G. Sharp, K. D. Fritscher, V. Pekar, M. Peroni, N. Shusharina, H. Veeraraghavan, and J. Yang (2014-05) Vision 20/20: perspectives on automated image segmentation for radiotherapy. Med Phys 41 (5), pp. 050902. Cited by: §I.
  • [51] H. C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers (2016-05) Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans Med Imaging 35 (5), pp. 1285–1298. Cited by: §II-D.
  • [52] K. Simonyan and A. Zisserman (2014-09) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv e-prints, pp. arXiv:1409.1556. External Links: 1409.1556 Cited by: §II-A, §II-D.
  • [53] The osteoarthritis initiative. U.S. Department of Health and Human Services. External Links: Link Cited by: §I.
  • [54] R. Toth and A. Madabhushi (2012) Multi-Feature Landmark-Free Active Appearance Models: Application to Prostate MRI Segmentation. Medical Imaging, IEEE Transactions on In Press (99), pp. 1. External Links: ISSN 0278-0062, Document Cited by: §II-B.
  • [55] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016-07) Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv e-prints, pp. arXiv:1607.08022. External Links: 1607.08022 Cited by: §III-A4.
  • [56] G. Wang, W. Li, S. Ourselin, and T. Vercauteren (2019) Automatic brain tumor segmentation based on cascaded convolutional neural networks with uncertainty estimation. Frontiers in Computational Neuroscience 13, pp. 56. External Links: Link, Document, ISSN 1662-5188 Cited by: §II-C.
  • [57] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao (2019-08) Deep High-Resolution Representation Learning for Visual Recognition. arXiv e-prints, pp. arXiv:1908.07919. External Links: 1908.07919 Cited by: §II-A.
  • [58] Y. Xue, F. G. Farhat, O. Boukrina, A.M. Barrett, J. R. Binder, U. W. Roshan, and W. W. Graves (2020) A multi-path 2.5 dimensional convolutional neural network system for segmenting stroke lesions in brain mri images. NeuroImage: Clinical 25, pp. 102118. External Links: ISSN 2213-1582, Document, Link Cited by: §II-C.
  • [59] S. S. Yadav and S. M. Jadhav (2019-12-17) Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big Data 6 (1), pp. 113. External Links: ISSN 2196-1115, Document Cited by: §II-D.
  • [60] Z. Yang, J. Fripp, S. S. Chandra, A. Neubert, Y. Xia, M. Strudwick, A. Paproki, C. Engstrom, and S. Crozier (2015) Automatic bone segmentation and bone-cartilage interface extraction for the shoulder joint from magnetic resonance images. Physics in Medicine and Biology 60 (4), pp. 1441. External Links: Document Cited by: §I, §III-A8, §IV-B, II(a).
  • [61] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. Advances in Neural Information Processing Systems 27, pp. 3320–3328. Cited by: §I, §II-D.
  • [62] F. Yu, V. Koltun, and T. Funkhouser (2017-05) Dilated Residual Networks. arXiv e-prints, pp. arXiv:1705.09914. External Links: 1705.09914 Cited by: §II-A.
  • [63] F. Yu and V. Koltun (2015-11) Multi-Scale Context Aggregation by Dilated Convolutions. arXiv e-prints, pp. arXiv:1511.07122. External Links: 1511.07122 Cited by: §II-A.
  • [64] Z. Zhang, Q. Liu, and Y. Wang (2018-05) Road Extraction by Deep Residual U-Net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. External Links: Document, 1711.10684 Cited by: §II-A.
  • [65] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2016-12) Pyramid Scene Parsing Network. arXiv e-prints, pp. arXiv:1612.01105. External Links: 1612.01105 Cited by: §II-A.
  • [66] Y. Zhou, X. Hu, and B. Zhang (2015-10) Interlinked convolutional neural networks for face parsing. Vol. 9377, pp. 222–231. External Links: ISBN 978-3-319-25392-3, Document Cited by: §II-A, §II-E.
  • [67] Z. Zhou, M. Mahfuzur Rahman Siddiquee, N. Tajbakhsh, and J. Liang (2018-07) UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv e-prints, pp. arXiv:1807.10165. External Links: 1807.10165 Cited by: §II-A, §III-A5.
  • [68] H. Zhu, F. Shi, L. Wang, S. Hung, M. Chen, S. Wang, W. Lin, and D. Shen (2019) Dilated dense u-net for infant hippocampus subfield segmentation. Frontiers in Neuroinformatics 13, pp. 30. External Links: Document, ISSN 1662-5196 Cited by: §II-A.