The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 Challenge

by   Nicholas Heller, et al.
University of Minnesota

There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recently, methods based on deep learning have shown excellent results in automatic 3D segmentation, but they require large datasets for training, and there remains little consensus on which methods perform best. The 2019 Kidney and Kidney Tumor Segmentation challenge (KiTS19) was a competition held in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) which sought to address these issues and stimulate progress on this automatic segmentation problem. A training set of 210 cross sectional CT images with kidney tumors was publicly released with corresponding semantic segmentation masks. 106 teams from five continents used this data to develop automated systems to predict the true segmentation masks on a test set of 90 CT images for which the corresponding ground truth segmentations were kept private. These predictions were scored and ranked according to their average So rensen-Dice coefficient between the kidney and tumor across all 90 cases. The winning team achieved a Dice of 0.974 for kidney and 0.851 for tumor, approaching the inter-annotator performance on kidney (0.983) but falling short on tumor (0.923). This challenge has now entered an "open leaderboard" phase where it serves as a challenging benchmark in 3D semantic segmentation.


The KiTS19 Challenge Data: 300 Kidney Tumor Cases with Clinical Context, CT Semantic Segmentations, and Surgical Outcomes

The morphometry of a kidney tumor revealed by contrast-enhanced Computed...

FUSeg: The Foot Ulcer Segmentation Challenge

Acute and chronic wounds with varying etiologies burden the healthcare s...

The Liver Tumor Segmentation Benchmark (LiTS)

In this work, we report the set-up and results of the Liver Tumor Segmen...

Bayesian Generative Models for Knowledge Transfer in MRI Semantic Segmentation Problems

Automatic segmentation methods based on deep learning have recently demo...

Dataset and Evaluation algorithm design for GOALS Challenge

Glaucoma causes irreversible vision loss due to damage to the optic nerv...

Comparison of computer systems and ranking criteria for automatic melanoma detection in dermoscopic images

Melanoma is the deadliest form of skin cancer. Computer systems can assi...

1 Introduction

The incidence of kidney tumors is increasing, especially for small, localized tumors that are often discovered incidentally (Hollingsworth et al., 2006). It’s difficult to radiographically differentiate between benign kidney tumors (e.g., angiomyolipoma and oncocytoma) and malignant Renal Cell Carcinoma (RCC) (Millet et al., 2011), but most kidney tumors are eventually found to be malignant (Chawla et al., 2006). Surgical removal of localized RCC is regarded as curative (Capitanio and Montorsi, 2016), so most localized kidney tumors are removed despite the sizable minority that are postoperatively found to be benign (Kim et al., 2019).

Traditionally, kidney tumors were removed through radical nephrectomy in which the entire kidney along with the tumor are excised (Robson, 1963). However, in order to preserve renal function (Scosyrev et al., 2014), partial nephrectomy, where only the tumor is removed, has recently become the standard of care in an increasing share of tumors with lower surgical complexity (Campbell et al., 2017). Further, a growing body of literature suggests that a large proportion of renal tumors are indolent (Richard et al., 2016; Uzosike et al., 2018; McIntosh et al., 2018; Patel et al., 2016), meaning they will never become a danger to the patient, and thus active surveillance has emerged as an increasingly popular treatment strategy for tumors exhibiting less aggressive characteristics in imaging.

With these developments, there is an exciting opportunity to reduce overtreatment of renal tumors without compromising oncologic outcomes (Mir et al., 2017), but there is a need for methods to objectively quantify the complexity and aggression of kidney tumors in order to better inform treatment decisions like radical nephrectomy vs. partial nephrectomy vs. active surveillance. Clinicians predominantly rely on imaging, primarily CT, to assess the complexity and aggression of renal masses. A number of manual scoring systems, termed nephrometry scores, have been proposed for this purpose (Kutikov and Uzzo, 2009; Ficarra et al., 2009; Simmons et al., 2010), but they have seen limited adoption due to the significant manual effort they require (Simmons et al., 2012), the interobserver variability between expert raters (Spaliviero et al., 2015), and their limited predictive power (Kutikov et al., 2011; Hayn et al., 2011; Okhunov et al., 2011).

Figure 1: An example of a coronal section of one of the training cases with its ground truth segmentation overlaid (kidney in red, tumor in green). Visualization generated by ITKSnap (Yushkevich et al., 2016). Best viewed in color.

Semantic segmentation of kidneys and kidney tumors offers an expressive characterization of the lesion, but it imposes an an even larger burden of manual effort than most nephrometry scores. Reliable automatic semantic segmentation of kidneys and kidney tumors would enable full automation of several nephrometry scores as well as studies of kidney tumor morphology on unprecedented scales.

The 2019 Kidney and Kidney Tumor Segmentation Challenge (KiTS19) aimed to accelerate progress on this automatic segmentation problem by releasing a dataset of 210 CT images with associated high-quality kidney and kidney tumor segmentations that could be used for training learned models. It also aimed to objectively assess the state of the art by holding a collection of 90 segmentation masks private for participants to predict given associated imaging. Participating teams were ranked based on their average Sørensen-Dice coefficient between the kidneys and tumors across these 90 test cases.

This challenge was hosted on grand-challenge.org111 where it accrued 826 registrations prior to the deadline. 106 unique teams submitted valid predictions to the challenge, and the official leaderboard222 reflects submissions from 100 unique teams who met all criteria for a complete submission, including a detailed manuscript describing their method. This challenge was accepted to be held in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) in Shenzhen, China, and it has now entered an indefinite “open leaderboard” phase in which any user may submit their predictions on the 90 test cases and have them scored and added to the leaderboard without delay.

The remainder of this manuscript is structured as follows: In Section 2 we give a brief description of prior work in 3D segmentation as well as existing public datasets and prior challenges. In Section 3

we describe the design of the challenge, including the dataset, rules, timeline, and evaluation metric. In Section

4 we provide an in-depth description of the methods of the top five placing teams, and discuss the implications of these high-performing methods in the context of 3D segmentation research. In Section 5 we discuss the limitations of this challenge, including issues with the dataset, challenge design, and what could be done to address these in future iterations. Finally we give concluding remarks in Section 6.

2 Related Work

2.1 Biomedical Grand Challenges

“Grand Challenges” in biomedical image analysis are events in which participants compete against one another to develop automated (or sometimes semi-automated) systems that perform a task in medical imaging (i.e., given some input, produce a particular output). Challenge organizers

clearly define how participating teams will be evaluated, and usually release some training data to help teams develop systems based on machine learning. The use of common benchmarks such as this have a long history in predictive learning

(West et al., 1997; Maier-Hein et al., 2018), but they first started to attract broader interest within the medical imaging community in 2007 when “Biomedical Grand Challenges” were first officially affiliated with the MICCAI conference in Brisbane Australia that year (Heimann et al., 2009). Since then, hundreds of challenges have been organized in conjunction with a wide array of related conferences (Reinke et al., 2018).

Challenges serve two main purposes in biomedical image analysis research:

  1. They allow for the objective and fair comparison of methods. With all participating teams evaluated on a single benchmark that’s centrally maintained by the organizers, performance metrics are consistent, and the test set performance isn’t subject to sampling variations, each of which can obscure comparisons between independent studies on separate, private data.

  2. They contribute publicly available data to the research community. This stimulates and democratizes the development of systems for the target task and sometimes for other related tasks. This work would otherwise be limited to entities with the resources to curate large quantities of high-quality annotated data.

It’s important that new challenges continue to be organized because the set of interesting tasks in biomedical image analysis is vast and largely untapped. Even if each task could be adequately covered, the risk that high-performing methods are overfitting to the quirks of that particular test set rather than the true data generating process grows over time. Further, no challenge is perfect, and organizers continue to learn and improve from the experiences of others. We enumerate the limitations of the present challenge in Section 5 for exactly this reason.

2.2 3D Segmentation

3D semantic segmentation is the voxel-wise delineation of regions in three-dimensional imaging such as CT or MRI. Several diverse applications of semantic segmentation have been proposed including targeting for radiation therapy (He et al., 2019), patient-specific surgical simulation (Taha et al., 2018), and quantitative diagnostic and prognostic scoring of anatomic and histologic indications of disease (Farjam et al., 2007; Tang et al., 2018; Blake et al., 2019). However, without automation of the upstream segmentation task, the prospects for clinical translation of these applications is poor. Automatic semantic segmentation of biomedical imaging has thus arisen as an important and popular research direction. Indeed, roughly 70% of prior biomedical grand challenges were focused on semantic segmentation (Maier-Hein et al., 2018).

The third spatial dimension presents a unique set of challenges to this field such as a significantly higher cost-per-instance for data annotation and higher memory intensity during model training and inference. Further, because this problem is relatively esoteric compared with its 2D counterpart, the prospect for leveraging transfer learning

(Bengio, 2012)

from huge, well-established computer vision benchmarks like ImageNet

(Deng et al., 2009) or MSCOCO (Lin et al., 2014) is severely limited. Despite these challenges, recent work in this area has demonstrated impressive performance on the 3D segmentation of a wide variety of anatomical structures and lesions in cross-sectional imaging (Maier et al., 2017; Bakas et al., 2018; Bilic et al., 2019; Zhuang et al., 2019).

Throughout computer vision, methods based on Deep Learning (DL) now dominate (Litjens et al., 2017) and 3D segmentation is no exception (Shen et al., 2017)

. Deep Neural Networks (DNNs) for any given task have a massive design space including a vast array of network architectures, optimization algorithms, and preprocessing procedures. Further, the impressive performance of DL has attracted the many researchers now searching this design space for optimal performance. Unfortunately, with such a high computational cost of training DNNs, most papers that propose a new DNN architecture lack comprehensive benchmarking against the state of the art, especially when results are reported only on private datasets–a practice that still encompasses more than half of papers accepted at MICCAI each year

(Heller et al., 2019a).

U-Net (Ronneberger et al., 2015) and its 3D variants (Milletari et al., 2016; Çiçek et al., 2016)

are some of the earliest proposed methods for DL-based medical image segmentation. In the years since, innumerable modifications have been proposed to U-Net (e.g., residual connections

(Milletari et al., 2016), dense connections (Li et al., 2018), and attention mechanisms (Oktay et al., 2018)), with researchers often reporting substantial improvements over their baseline U-Net results. Recently, however, Isensee et al. (2018)

demonstrate state of the art performance in several well-established 3D segmentation grand challenges using only a U-Net and a novel methodology to search a small space of hyperparameters and preprocessing procedures. This paradigm, termed the nnU-Net, won the recent “decathalon” grand challenge

333, in which teams were challenged to develop a system capable of performing well on the tasks of 10 grand challenges simultaneously.

The authors of nnU-Net make clear that despite its dominance, they do not believe it to be a global optimum of design space, rather they believe it to be a more consistent stepping-off point from which to evaluate architectural “bells and whistles” which could present added performance. The KiTS19 challenge is significant because it’s one of the first grand challenges on 3D segmentation to be held after nnU-Net demonstrated its dominant performance and its implementation was released. Thus, one might have expected KiTS19 to be won by the team which finds the right architectural garnish with which to augment the nnU-Net baseline, but as we discuss in Section 4, the “vanilla” nnU-Net demonstrated top performance despite attempts by several teams to enhance it.

3 Materials and Methods

3.1 The KiTS19 Dataset

We obtained approval from the University of Minnesota Institutional Review Board to conduct a retrospective review of patients who underwent partial or radical nephrectomy for suspicion of renal cancer by a physician in the urology service of University of Minnesota Health between 2010 and mid-2018. 544 patients met this initial criteria. We then excluded patients for whom imaging in the late arterial phase was not available. By the semantic definitions that we chose to use, cyst was regarded as a part of the kidney. We therefore had to exclude a small number of patients whose tumor was postoperatively determined to be a cyst. We also excluded patients with tumor thrombus, since in these cases the tumor extends out beyond what we considered to be the primary site and the appropriate boundaries were ambiguous. This left the 300 patients comprising the KiTS19 dataset.

Figure 2: A flow chart of the patient inclusion and random assignment to either training or test set of the study.

The imaging for these 300 patients was downloaded in DICOM format and converted to Nifti (Larobina and Murino, 2014) using the pydicom (Mason, 2011) and nibabel (Brett et al., 2018) packages in Python 3.6444 Students under the supervision of the challenge’s clinical chair, Dr. Christopher Weight, then produced manual delineations for the kidneys and tumors in these images using a web-based annotation platform developed in-house (Heller et al., 2017). A comprehensive description of the annotation protocol and quality assurance procedures can be found in Heller et al. (2019b).

Despite the fact that all of these patients were treated by Urologists at a single tertiary care center, more than half of the imaging was acquired across more than 50 referring institutions. This, combined with the range of acquisition protocols within each institution, made KiTS19 a diverse dataset in terms of the voxel dimensions, contrast timing, table signature, and scanner field of view. This helps to ameliorate some concern over the external validity of a single institution retrospective cohort like KiTS19, but in future work, a multi-institution cohort with a prospectively collected test set would be preferable (Park and Han, 2018). Table 1 provides a summary of the patient characteristics.

Attribute Values (N=300)
Age 60 (51, 68)
BMI 29.82 (26.16, 35.28)
Tumor Diameter* 4.200 (2.600, 6.125)
   Male 180 (60%)
   Female 120 (40%)
   Radical Nx 112 (37.3%)
   Partial Nx 188 (62.6%)
   Clear Cell RCC 203 (67.7%)
   Papillary RCC 28 (9.3%)
   Chromophobe RCC 27 (9%)
   Oncocytoma 16 (5.3%)
   Other 26 (8.7%)
Table 1: The baseline and tumor characteristics of patients in the KiTS19 dataset. Continuous variables are reported as: Median (Q1, Q3). *In cases where there is more than one tumor, the largest tumor diameter was reported. The initialism RCC represents Renal Cell Carcinoma.

3.2 Design of the KiTS19 Challenge

In Reinke et al. (2018), the authors review challenges hosted in conjunction with MICCAI conferences from 2007 to 2016 with a focus on how common weaknesses in the design of these challenges made them vulnerable to leaderboard manipulation, which could seriously jeopardize the validity of these challenges as objective and fair benchmarks. We found this paper to be very instructive in the design of this challenge, and we made every effort to follow the best practices outlined by the authors.

3.2.1 Rules

  1. External Data: Teams were permitted to use external data, as long as that data was publicly available. This was intended to allow for methods based on pre-training and domain adaptation without allowing teams who might have internal datasets to have an unfair advantage.

  2. Replicability: Teams were required to submit a manuscript describing their methods in detail, as well as any external sources of data. Teams were highly encouraged but not required to make their code available as well.

  3. Submissions per Team: Teams were allowed to enter only one

    submission to the final leaderboard. This was to prevent “fine tuning” on the test set by sending many different submissions and simply retracting all lower-scoring entries once the scores are released, which would result in an over-estimate of true performance. This policy was strictly enforced, and several cases in which a team made multiple submissions from different accounts were recognized and only the most recent submission was considered.

  4. Manual Predictions: The predictions submitted to the challenge were required to be entirely automatic; any manual intervention in the inference stage was strictly prohibited.

3.2.2 Timeline

The data was collected and annotated between July and December 2018, after which time the challenge was designed and proposed to MICCAI 2019 via their online submission platform. We were notified that KiTS19 was accepted on March 3, 2019, and we then published the official webpage at on March 4.

The training data was released in full on March 15, at which time participants were invited to inspect and suggest improvements to the data for a period of 20 days. After this period, we addressed several concerns with the metadata as well as a handful of segmentation labels and the data was “frozen” for the challenge on April 15. On April 23, a second version of the data was released in which the imaging and segmentation labels were resampled to the median spacing of the original dataset: 3mm x 0.781mm x 0.781mm on the longitudinal, anterior-posterior, and mediolateral axes respectively. This was requested by a number of teams who wanted to work with data in a consistent spacing, but did not feel comfortable resampling the data and labels themselves.

On July 15, the test imaging was released (without labels) and the submission period was opened until July 29. The MICCAI 2019 leaderboard was then released on July 30, and submission reopened indefinitely for the open leaderboard on August 12, 2019.

3.2.3 Infrastructure

The main challenge webpage as well as submission and evaluation platforms were graciously hosted by grand-challenge.org555 The data for this challenge was hosted on GitHub666 using their large file storage solution, Git-LFS777 A separate server was then set up to run a Discourse888 discussion forum. Strict rules were never implemented for which types of correspondences should be submitted as GitHub issues and which should be forum posts, but generally speaking GitHub was used for issues with the data and starter code, and Discourse was used for all other correspondences such as clarifications about the rules and official announcements.

3.2.4 Submission and Evaluation

The evaluation metric for this challenge was the average Sørensen-Dice coefficient between kidney and tumor across all 90 test cases. That is

where is the final score, the superscript represents the th case, represents the set of predicted kidney voxels, represents the set of ground truth kidney voxels, represents the set of predicted tumor voxels, and represents the set of ground truth tumor voxels. Here, ranges from 210 to 299 because by our indexing, cases 0 to 209 comprised the training set.

During the submission period, teams had the option of viewing the approximate score of two of their submissions. By “approximate”, we mean that 45 cases were sampled from the test set without replacement and the score was calculated for these 45 cases only. For each submission, teams were contacted via email to confirm that the submission had been received, and asked whether they would like to view this approximate score. A “yes” response would be honored no more than twice. This helped to alleviate participants’ concerns that the predictions were not packaged or interpreted correctly, without providing participants with enough information for test set fine-tuning.

In an attempt to prevent teams from gaming this system to receive more than two approximate scores, teams were required to enclose a manuscript with at least one page of content and their final authors list with all submissions. Authors lists were cross-checked against prior submissions before scores were released, and in a few cases where substantially overlapping submissions went unnoticed, all submissions after the one immediately following receipt of the second score were discarded. The final score on the MICCAI leaderboard reflects the most recent remaining submission from each team, whether its approximate score was requested or not.

4 Results

126 unique users on made a submission to this challenge, but 20 of these were determined to belong to a team that had submitted previously. A further 6 submissions were disqualified for not providing a sufficient manuscript, even after repeated warnings. This left the 100 teams that appear on the official MICCAI leaderboard.

Submissions to this challenge were overwhelmingly based on deep neural networks, although they varied considerably in decisions around preprocessing strategies, architectural details, and training procedures. Sections 4.1 - 4.5 outline the methods used by the five highest-scoring teams listed on the official MICCAI leaderboard. These five teams were also invited to present this material orally at the KiTS19 satellite event of MICCAI 2019 in Shenzhen, China on Oct. 13, 2019.

Figure 3: Kidney Dice scores (left, gold) and tumor Dice scores (right, maroon) on each case for each of the top 5 teams. Similar plots for the remaining submissions are included in the supplementary material.

4.1 First Place: An Attempt at Beating the 3D U-Net

This submission was made by Fabian Isensee and Klaus H. Maier-Hein of the German Cancer Research Center.

4.1.1 Data Use and Preprocessing

This submission did not make use of any data other than the official training set. Model parameters were initialized randomly and no transfer learning was used.

The data was downloaded in its original spacing (from the master branch) and resampled to a common spacing of mm resulting in median volume dimensions of voxels. The CT intensities (HU) were clipped to a range of [-79, 304] and transformed by subtracting 101 and dividing by 76.9.

4.1.2 Architecture

Three 3D U-Net architectures were tested using five-fold cross-validation. The networks all used 3D convolutions, leaky ReLU (LReLU) activations, and instance normalization. Upsampling was performed with transposed convolutions and downsampling was performed with strided convolutions. The first level of the U-Net extracts either 24 or 30 feature maps, and each downsampling doubles this up to a maximum of 320. Downsampling is stopped once a further downsampling would result in at least one spatial dimension of

voxels, at which point upsampling begins. The three networks differed as follows:

Plain 3D U-Net: This network used 30 feature maps at the highest resolution. Between each up or downsampling, two blocks of conv-instnorm-LReLU were performed.

Residual 3D U-Net: This network used 24 feature maps at the highest resolution. In the encoder portion, the conv-instnorm-LReLU were replaced with residual blocks of the form: conv-instnorm-ReLU-conv-instnorm-ReLU where the residual addition takes place before the final activation, similar to He et al. (2016a). Just one of these blocks is used at the highest resolution, and with each downsampling another is added. The decoder portion uses just one conv-instnorm-ReLU between upsamplings.

Pre-activation Residual 3D U-Net: This network was similar to the Residual 3D U-Net but the residual blocks used pre-activation (He et al., 2016b). The residual blocks were thus instnorm-ReLU-conv-instnorm-ReLU-conv.

An overview of these architectures is given in Fig. 4

Figure 4: 3D U-Net (top) and residual 3D U-Net architecture (bottom) used in this project. denotes that a block is repeated X times. The architecture of the pre-activation residual U-Net is analogous to the residual U-Net (with instnorm and ReLU being shifted to accommodate pre-activation residual blocks).

4.1.3 Training

Patches of size

were randomly sampled from the resampled volumes for training. 1000 epochs training were performed with an epoch defined as 250 batches with a batch size of two. A sum of cross-entropy and dice loss was used as a training objective with deep supervision. All networks were trained with stochastic gradient descent. Extensive data augmentation with the

batchgenerators framework999 was used during training. Adjustments included scaling, rotations, brightness, contrast, a gamma transformation, and the introduction of Gaussian noise.

Training for each network was performed on a single NVIDIA Titan Xp GPU using the PyTorch framework

(Paszke et al., 2017) based on the nnU-Net implementation101010 (Isensee et al., 2019). Each network took about five days to train.

The cases that were known to have been mislabeled (outlined in Section 5.1) were excluded from training, and cases 23, 68, 125, 133 were found to be in consistent disagreement with predictions so they were excluded as well.

Five-fold cross-validation Dice scores of 0.974 and 0.857 were observed for kidney and tumor respectively for the Residual 3D U-Net architecture, which was marginally higher than the performance of any of the other approach, including an ensemble of all three. Therefore, an ensemble of the networks from these five folds only was used for the final test set predictions.

4.1.4 Postprocessing

The model predictions were resampled to their original spacing and submitted without any further processing.

4.1.5 Results

This submission scored a 0.974 kidney Dice and a 0.851 tumor Dice resulting in a first place 0.912 composite score. For a more detailed description of this submission, see Isensee and Maier-Hein (2019).

4.2 Second Place: Cascaded Semantic Segmentation for Kidney and Tumor

This submission was made by Xiaoshuai Hou, Chunmei Xie, Fengyi Li, and Yang Nan of PingAn Technology Co.

4.2.1 Data Use and Preprocessing

This submission did not make use of any data other than the official training set. Model parameters were initialized randomly and no transfer learning was used.

The CT intensities were normalized to zero mean and unit standard deviation without clipping. This method had two input modes: it first performed a coarse localization of the kidneys from a low-resolution image (volumes resampled to

mm), and then performed fine-grained delineation of those kidneys as well as the lesion(s) from a cropped region of high-resolution images (volumes resampled to mm).

4.2.2 Architecture

This cascaded approach had three stages. Stage 1 performed a coarse segmentation of all kidneys in the image in order to crop out spatially distant regions for the next stage. This stage is based on the nnU-Net (Isensee et al., 2018), but the segmentation masks here are used only to localize the kidney regions, and the predicted masks are discarded. The second stage is run for each rectangular kidney region that is found by the first stage. Here, another 3D U-Net based on the nnU-Net implementation is used to produce a fine-grained segmentation of the kidneys vs background, where tumor is included in the kidney label. Finally, in the third stage of the model, all voxels predicted to be background are set to zero, and a fully convolutional net is used to segment the tumor voxels from the kidney voxels. Here, all predictions made outside of the stage 2 kidney predictions are discarded.

An overview of this prediction pipeline is shown in Fig. 5.

Figure 5: The segmentation pipeline for the second place method. Kidneys are first coarsely localized via segmentation in stage one, then each kidney is finely segmented from background in stage two, and finally the tumor segmented from kidney in stage 3.

4.2.3 Training

All models were trained with a combination of the cross entropy loss and Dice loss. Data augmentation was used including elastic deformation, rotation, and random cropping. The Adam optimizer (kingma2014adam) was used with an initial learning rate of . Whenever the exponential moving average of training loss did not improve by a certain threshold in 30 epochs, the learning rate was scaled by 1/5. Training was terminated if the validation loss did not improve within 50 consecutive epochs.

4.2.4 Postprocessing

After the model made its kidney and tumor predictions, an algorithm was used to fill holes within the tumor prediction and remove some predicted regions that appeared to be false positives.

4.2.5 Results

This submission scored a 0.967 kidney Dice and a 0.845 tumor Dice resulting in a second place 0.906 composite score. For a more detailed description of this submission, see Hou et al. (2019).

4.3 Third Place: Segmentation of kidney tumor by multi-resolution VB-nets

This submission was made by Guangrui Mu, Zhiyong Lin, Miaofei Han, Guang Yao, and Yaozong Gao of Shanghai United Imaging Intelligence Inc.

4.3.1 Data Use and Preprocessing

This submission did not make use of any data other than the official training set. Model parameters were initialized randomly and no transfer learning was used.

The data was downloaded in its original spacing and 30 cases were randomly selected to form a validation set, leaving 180 for training. CT intensity values were clipped to fall in the range of [-200, 500] HU and then uniformly normalized to [-1,1], and all volumes were resampled to an isotropic spatial resolution. A professional doctor then manually delineated all cysts in the dataset in order to help the learned model form an understanding of cyst as well as tumor, and mitigate the risk that the two are confused.

4.3.2 Architecture

The authors of this submission extended the V-Net proposed in Milletari et al. (2016) to include bottlenecks instead of traditional convolutional layers. Here, each bottleneck consists of three convolutional layers – the first applies a kernel and reduces the number of feature maps, the second performs a spatial convolution with some receptive field greater than in each spatial dimension. The last applies another

filter to increase the number of feature maps back to their original count. The authors extracted 16 feature maps at the highest resolution, and doubled this with each downsampling. Residual blocks were used throughout the network of the form conv-batchnorm-ReLU-conv-batchnorm-addition-ReLU. Zero padding was used to keep blocks the same size for concatenation.

This submission also made use of a cascaded approach in which segmentation-based localization was again used on low resolution volumes (voxel size of mm) to produce Volumes of Interest (VOIs) which were then fed at a high-resolution (voxel size of mm) into a second model which produced final segmentation predictions.

A graphical representation of this pipeline is shown in Fig. 6.

Figure 6: A graphic showing the prediction pipeline for the third place submission to the challenge. Data is first resampled to a low-resolution global image, and kidney segmentation is used to define Volumes of Interest (VOIs). Those VOIs are then resampled to a high resolution, and another network is used to make kidney and tumor segmentation predictions.

4.3.3 Training

The models were implemented and trained in the PyTorch framework. During the training of both the coarse and fine models, random patches of size voxels are sampled from the target volume. A generalized Dice loss is used with the following formulation:

Where represents the number of class labels,

is the probability of the class

at voxel predicted by the network, is the binary label indicating whether the label of voxel is class .

The Adam Kingma and Ba (2014) optimization algorithm was used with a constant learning rate of and a batch size of 6. The networks were trained for 5000 epochs of 30 batches. Data augmentation was not employed.

4.3.4 Postprocessing

Connected component analysis was used to filter out small regions that were predicted as kidney. In order to aid in the discrimination between tumors and cysts with uneven densities, an algorithm made use of the spatial relationship between the cyst and the tumor and their average HU value to give a final, consistent classification.

4.3.5 Results

This submission scored a 0.973 kidney Dice and a 0.832 tumor Dice resulting in a third place 0.903 composite score. For a more detailed description of this submission, see Mu et al. (2019).

4.4 Fourth Place: Cascaded Volumetric Convolutional Network for Kidney Tumor Segmentation from CT volumes

This submission was made by Yao Zhang, Yixin Wang, Feng Hou, Jiawei Yang, Guangwei Xiong, Jiang Tian, and Cheng Zhong of the Lenovo AI Lab.

4.4.1 Data Use and Preprocessing

This submission did not make use of any data other than the official training set. Model parameters were initialized randomly and no transfer learning was used.

The data from the interpolated branch of the GitHub repository was used for this submission. After download, 50 cases were selected as a validation set, leaving 160 for training. This method also used a course-to-fine approach with a first segmentation-based localization to crop volumes of interest for each kidney in a lower resolution image (voxel spacing

mm). Then, those crops were applied to the full resolution data, creating patches that were then sent to a finer-grained segmentation network which made the model’s predictions. Normalization of the CT intensities was done on a case-by-case basis by first clipping values at their 0.5% and 99.5% percentiles and then subtracting the mean dividing by the standard deviation.

4.4.2 Architecture

The architecture used for both the coarse localization and fine predictions of this submission is a 3D U-Net based on the nnU-Net implementation. Several hyperparameters and architectural enhancements and their combinations were tested on the validation set and the highest-performing combination was chosen for predictions on the test set. The baseline model that the authors used was a 3D U-Net with instance normalization. The net extracted 30 feature maps at original resolution and doubled this with each downsampling. Downsampling was performed with max-pooling and transposed convolutions were used for upsampling. These up and downsampling operations were strided differently based on the original patch resolution in order to handle blocks with differing extents in each axis. The network downsampled until each spatial dimension of the feature map is smaller than 8 voxels. Leaky ReLU was used for all activation functions outside of the loss layers.

The authors tested this baseline against other versions making use of deep supervision, residual blocks between up and down sampling, and the use of a “spatial prior”, in which they fed the coarse predictions of the first network as an input channel to the second. The authors found that the combination of all three of these modifications yielded the best validation performance, and they thus chose to use this model at test time. A diagram outlining this architecture is shown in Fig. 7.

In addition to deep supervision, the authors also employed deep prediction. The predictions at each decoder stage of the network are ensembled together by majority voting to produce final predictions.

Figure 7: The pipeline used by the fourth place submission. In the first stage, a baseline 3D U-Net equipped with instance normalization and residual blocks is used to obtain the coarse location of the kidney. In the second stage, a counterpart further augmented with deep supervision is employed for both kidney and tumor segmentation.

4.4.3 Training

The authors apply data augmentation to expand training data and avoid overfitting. The data augmentation including random rotation, scaling and elastic deformation is implemented on the fly. Random patches of size were sampled from the images and used for training with a batch size of 5. The objective was a sum of cross entropy and dice loss at each supervised layer. The Adam algorithm was used for optimization with an initial learning rate of . On a single NVIDIA Tesla V100 32GB GPU, the first stage took roughly 18 hours to train and the second stage took a further 30.

4.4.4 Postprocessing

The model operates on the assumption that no more than two kidneys exist in each case. A connected component analysis was run on each case’s prediction to remove all but the two largest components of the kidney’s predicted segmentation.

4.4.5 Results

This submission scored a 0.974 kidney Dice and a 0.831 tumor Dice resulting in a fourth place 0.902 composite score. For a more detailed description of this submission, see Zhang et al. (2019).

4.5 Fifth Place: Cascaded U-Net Ensembles

This submission was made by Jun Ma of the Nanjing University of Science and Technology.

4.5.1 Data Use and Preprocessing

This submission did not make use of any data other than the official training set. Model parameters were initialized randomly and no transfer learning was used.

This submission is built on the data from the interpolated branch of the GitHub repository, and the data were randomly assigned to five folds for cross-validation. The CT intensities were clipped based on the 0.5 and 99.5th percentile and normalized by subtracting the mean and dividing by the standard deviation of the intensity values.

4.5.2 Architecture

For this submission, 3D U-Net (Fig. 4-Top) is used as the main architecture which is based on nnU-Net implementation111111 Compared to the original 3D U-Net, the notable changes (Isensee et al., 2019) are the use of padding convolutions, instance normalization and leaky ReLUs. Predictions were made by three separate models and ensembled together with majority voting for kidney and an OR operation for tumor. One of these three models was a “vanilla” 3D U-Net, and the other two are cascaded models each with one network to localize the kidneys’ Volumes of Interest (VOIs) and another to produce a fine-grained segmentation of the kidneys and tumors in each VOI.

At test time, the data was augmented with mirrors and predictions were averaged in an attempt to produce more robust predictions.

4.5.3 Training

The 3D U-Net model was first trained to minimize the sum of the cross-entropy and dice losses. For the cascaded models, the dice loss was replaced with a TopK loss to prevent tumor predictions in the upstream model. In all cases, the Adam algorithm was used for optimization with an initial learning rate of followed by a fine-tuning learning rate of .

Each model took about four days to train with one Titan-Xp GPU and two Intel Xeon E5-2650V4 CPUs.

4.5.4 Postprocessing

An heuristic-based algorithm was used to remove predicted kidney regions that were suspicious for false positives. First, the isolated points smaller then 20,000 voxels are removed. Second, the algorithm assumes that the centers of the two kidneys should have similar positions on the anterior-posterior axis, and larger isolated regions can be excluded if another plausible kidney region at that approximate position cannot be found.

4.5.5 Results

This submission scored a 0.973 kidney Dice and a 0.825 tumor Dice resulting in a fifth place 0.899 composite score. For a more detailed description of this submission, see Ma (2019).

4.6 Discussion

Semantic segmentation is one of the most popular research areas in medical image analysis, and as such, a vast number of novel sophisticated methods have been proposed in this space over the last few years. Surprisingly, very few of these are represented in the methods of the highest performing submissions, and this is consistent with the results of many concurrent challenges. The apparent failure of many of these methods to outperform e.g. the winning submission to this challenge is interesting and worthy of further study. Is this an artifact of this task in particular, or evidence of something more general? The winning method of Isensee et al. focused heavily on data preprocessing rather than novel architectures or optimization algorithms. Perhaps preprocessing has a larger impact than it typically gets credit for. Further experiments and challenges are needed to support or refute this claim.

5 Limitations

The KiTS19 challenge was, overall, highly successful. It attracted a high number of submissions and continues to serve as an important and challenging benchmark in 3D segmentation. With that said, it was not perfect. In this section, we discuss limitations of this challenge, including issues with the dataset, issues with the challenge policies and design, and issues with the infrastructure for the challenge and some instances where communication between the organizers and participants broke down. We conclude with some ideas for how future iterations might address these limitations.

5.1 Dataset

Even though systems based on Deep Learning have often shown excellent generalization beyond the population that their training set was sampled from (Zhang et al., 2016), there is still reason for concern about a potential performance drop when applying these systems beyond the population that was sampled for the test set.

The patients represented in the KiTS19 challenge were all treated by physicians within the same health system, and the population is therefore heavily concentrated in a limited geographic region, which might limit the generalizability of methods developed for this dataset to populations from other regions of the world. However, there is considerable diversity in imaging protocols and scanners since the preoperative studies were often performed at referring institutions. The dataset for the KiTS19 challenge was also retrospective, and the split into the training and test sets was random. Therefore, there is some concern that distributional shift over time might result in comparatively low performance on prospective data.

Even within this region and this block of time, the patients sampled represent only a subset of all patients seen for concern of renal malignancy. In particular, we were limited to patients who did not choose to “opt out” of making their data available for research, and we were limited to patients with contrast-enhanced imaging available for download (see Fig. 1). We have no reason to suspect that this introduced bias into our cohort, but this is, of course, difficult to check. Patients with tumor thrombus (27/329) were excluded in order to simplify the annotation process, and patients with concerning lesions that turned out to be cysts (2/329) were also excluded. The exclusion of cysts contributed to (but did not fully account for) the lower proportion of lesions postoperatively found to be benign in our cohort than has been reported elsewhere (8% vs  30% – (Kim et al., 2019)). Therefore, if a system based on this data were to be applied in clinical practice, care would need to be taken to ensure that the system was not being applied to patients who meet our exclusion criteria. Further, if a deployment’s target population differs significantly from that of this cohort (e.g., higher benign rate, lower tumor sizes – see table 1), the system might exhibit worse performance than on the KiTS19 test set.

Finally, despite our best efforts to avoid errors when creating the semantic segmentation labels, they are imperfect. Errors range from inevitable noise in boundary delineations, to a few cases in which a whole structure was labeled incorrectly.

5.2 Challenge Design

As stated in Section 3.2.2, when we first released the dataset, we designated a 20 day period for public “label review”, in which participants were invited to visualize the segmentation labels and raise any concerns. While a number of important issues were discovered during this period, in retrospect we believe that this period should have been extended considerably, perhaps until just one month before the test set was released. This would have allowed for the few issues that were discovered after the data “freeze” date to be fixed for the MICCAI 2019 competition.

Further, since we released the test images publicly, there is a possibility that teams may have manually intervened in the prediction process in order to illegally “clean” their predictions, or even simply segmented some cases manually from scratch. In order to mitigate this risk, we allowed only a two week period for submission, but we cannot exclude this possibility entirely. It might have been preferable to host our challenge on a platform where a kernel is made available with private access to the test data for prediction purposes only.

5.3 Communication and Challenge Infrastructure

There were two public avenues for communication between the participants and organizers: GitHub issues and a Discourse forum. GitHub Issues were meant to be used for personal issues with downloading and using the data, as well as for reporting label errors, where Discourse was to be used for everything else. The intention was that Discourse would be for information that benefits all participants and GitHub was for troubleshooting individual difficulties. In retrospect, the reporting of label errors was wrongly confined to GitHub, and was therefore not widely disseminated. In future challenges, we will take conscious steps to ensure that all participants are made aware of issues that are found with the training labels so that teams are given equal opportunities to exclude or amend these issues for their training process.

5.4 Future Directions for KiTS

We are actively working to improve upon the KiTS19 Challenge in several ways. Among these are:

  • Multi-Institutional Cohort: We will be expanding the dataset to represent at least four health systems, each in different geographic regions.

  • Pseudo-Prospective Cohort: A date will be selected, and data generated before that date will be used only for training, and data generated after that time will be used only for testing.

  • Longer Data Review: We will be extending the time period in which concerns will be addressed prior to the data freeze. Tentatively, we plan to freeze the data just one month before the test set release.

  • Clearer Communication of Label Errors: In addition to the labels’ version control system, label errors discovered after the data freeze will be announced on the homepage of the challenge as well as on the discussion forum.

  • Better Representation of Rare Subtypes: A vast majority of renal tumors are Clear Cell Renal Cell Carcinomas, and this is reflected in the KiTS19 dataset. With contributions from other clinical centers, we will have enough data to perform stratified random sampling in order to give equal representation to several histological subtypes that are comparatively rare.

  • More Segmentation Classes: In order to prevent the segmentation problem from becoming trivially easy with the larger dataset, we plan to expand the segmentation problem to include more classes and structures such as renal cyst, renal artery and vein, and ureter.

Things such as a longer data review and clearer communication of label errors are simple to implement, but others such as more segmentation classes and multi-institutional data will take significant effort. Our hope is to phase these changes into future KiTS Challenges as time and administrative hurdles allow.

6 Conclusion

The KiTS19 challenge served to accelerate and measure the state of the art in the automatic semantic segmentation of kidneys and kidney tumors in contrast-enhanced CT imaging. The challenge attracted submissions from more than 100 teams around the world, and the highest-scoring team achieved a kidney Dice score of 0.974 and a tumor Dice score of 0.851 on the private 90-case test set. The experiments and results of the winning team are surprising in that they failed to show any meaningful benefit to several “bells and whistles” that people have recently reported to yield substantial improvements over the 3D U-Net baseline. Instead, they won by a considerable margin by submitting the predictions of the baseline+residual connections model alone. The challenge has now entered an indefinite “open leaderboard” phase where it serves as a high-quality and challenging benchmark in 3D semantic segmentation. A second iteration of the KiTS challenge is planned with the goals of improving upon the clinical significance and external validity of the challenge, as well as increasing the difficulty of the 3D segmentation problem by adding other more complicated structures such as ureters, renal arteries, and renal veins.


Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA225435. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

We would like to thank the MICCAI challenge committee for taking the time to review our challenge proposal and provide useful feedback. We also thank for providing an excellent free platform for hosting challenges such as this one. Finally, we thank the developers of Discourse for providing an excellent piece of free software for self-hosted discussion forums.


  • Bakas et al. (2018) Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shinohara, R.T., Berger, C., Ha, S.M., Rozycki, M., et al., 2018. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629 .
  • Bengio (2012) Bengio, Y., 2012. Deep learning of representations for unsupervised and transfer learning, in: Proceedings of ICML workshop on unsupervised and transfer learning, pp. 17–36.
  • Bilic et al. (2019) Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou, Q., Fu, C.W., Han, X., Heng, P.A., Hesser, J., et al., 2019. The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056 .
  • Blake et al. (2019) Blake, P., Sathianathen, N., Heller, N., Rosenberg, J., Rengel, Z., Moore, K., Kaluzniak, H., Walczak, E., Papanikolopoulos, N., Weight, C., 2019. Automatic renal nephrometry scoring using machine learning. European Urology Supplements 18, e904–e905.
  • Brett et al. (2018) Brett, M., Hanke, M., Markiewicz, C., Côté, M.A., McCarthy, P., Ghosh, S., Wassermann, D., et al., 2018. nipy/nibabel: 2.3. 0. June. https://doi. org/10.5281/zenodo 1287921.
  • Campbell et al. (2017) Campbell, S., Uzzo, R.G., Allaf, M.E., Bass, E.B., Cadeddu, J.A., Chang, A., Clark, P.E., Davis, B.J., Derweesh, I.H., Giambarresi, L., et al., 2017. Renal mass and localized renal cancer: Aua guideline. The Journal of Urology 198, 520–529.
  • Capitanio and Montorsi (2016) Capitanio, U., Montorsi, F., 2016. Renal cancer. The Lancet 387, 894–906.
  • Chawla et al. (2006) Chawla, S.N., Crispen, P.L., Hanlon, A.L., Greenberg, R.E., Chen, D.Y., Uzzo, R.G., 2006. The natural history of observed enhancing renal masses: meta-analysis and review of the world literature. The Journal of Urology 175, 425–431.
  • Çiçek et al. (2016) Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016. 3d u-net: learning dense volumetric segmentation from sparse annotation, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 424–432.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009.

    Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee. pp. 248–255.

  • Farjam et al. (2007) Farjam, R., Soltanian-Zadeh, H., Jafari-Khouzani, K., Zoroofi, R.A., 2007. An image analysis approach for automatic malignancy determination of prostate pathological images. Cytometry Part B: Clinical Cytometry: The Journal of the International Society for Analytical Cytology 72, 227–240.
  • Ficarra et al. (2009) Ficarra, V., Novara, G., Secco, S., Macchi, V., Porzionato, A., De Caro, R., Artibani, W., 2009. Preoperative aspects and dimensions used for an anatomical (padua) classification of renal tumours in patients who are candidates for nephron-sparing surgery. European urology 56, 786–793.
  • Hayn et al. (2011) Hayn, M.H., Schwaab, T., Underwood, W., Kim, H.L., 2011. Renal nephrometry score predicts surgical outcomes of laparoscopic partial nephrectomy. BJU international 108, 876–881.
  • He et al. (2016a) He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  • He et al. (2016b) He, K., Zhang, X., Ren, S., Sun, J., 2016b. Identity mappings in deep residual networks, in: European conference on computer vision, Springer. pp. 630–645.
  • He et al. (2019) He, T., Guo, J., Wang, J., Xu, X., Yi, Z., 2019. Multi-task learning for the segmentation of thoracic organs at risk in ct images., in: SegTHOR@ ISBI.
  • Heimann et al. (2009) Heimann, T., Van Ginneken, B., Styner, M.A., Arzhaeva, Y., Aurich, V., Bauer, C., Beck, A., Becker, C., Beichel, R., Bekes, G., et al., 2009. Comparison and evaluation of methods for liver segmentation from ct datasets. IEEE transactions on medical imaging 28, 1251–1265.
  • Heller et al. (2019a) Heller, N., Rickman, J., Weight, C., Papanikolopoulos, N., 2019a. The role of publicly available data in miccai papers from 2014 to 2018. arXiv preprint arXiv:1908.06830 .
  • Heller et al. (2019b) Heller, N., Sathianathen, N., Kalapara, A., Walczak, E., Moore, K., Kaluzniak, H., Rosenberg, J., Blake, P., Rengel, Z., Oestreich, M., et al., 2019b. The kits19 challenge data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and surgical outcomes. arXiv preprint arXiv:1904.00445 .
  • Heller et al. (2017) Heller, N., Stanitsas, P., Morellas, V., Papanikolopoulos, N., 2017. A web-based platform for distributed annotation of computerized tomography scans, in: Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer, pp. 136–145.
  • Hollingsworth et al. (2006) Hollingsworth, J.M., Miller, D.C., Daignault, S., Hollenbeck, B.K., 2006. Rising incidence of small renal masses: a need to reassess treatment effect. Journal of the National Cancer Institute 98, 1331–1334.
  • Hou et al. (2019) Hou, X., Xie, C., Li, F., Nan, Y., 2019. Cascaded semantic segmentation for kidney and tumor, in: Submissions to the 2019 Kidney Tumor Segmentation Challenge – KiTS19.
  • Isensee and Maier-Hein (2019) Isensee, F., Maier-Hein, K.H., 2019. An attempt at beating the 3d u-net, in: Submissions to the 2019 Kidney Tumor Segmentation Challenge – KiTS19.
  • Isensee et al. (2018) Isensee, F., Petersen, J., Klein, A., Zimmerer, D., Jaeger, P.F., Kohl, S., Wasserthal, J., Koehler, G., Norajitra, T., Wirkert, S., et al., 2018. nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486 .
  • Isensee et al. (2019) Isensee, F., Petersen, J., Kohl, S.A., Jäger, P.F., Maier-Hein, K.H., 2019. nnu-net: Breaking the spell on successful medical image segmentation. arXiv preprint arXiv:1904.08128 .
  • Kim et al. (2019) Kim, J.H., Li, S., Khandwala, Y., Chung, K.J., Park, H.K., Chung, B.I., 2019. Association of prevalence of benign pathologic findings after partial nephrectomy with preoperative imaging patterns in the united states from 2007 to 2014. JAMA surgery 154, 225–231.
  • Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Kutikov et al. (2011) Kutikov, A., Smaldone, M.C., Egleston, B.L., Manley, B.J., Canter, D.J., Simhan, J., Boorjian, S.A., Viterbo, R., Chen, D.Y., Greenberg, R.E., et al., 2011. Anatomic features of enhancing renal masses predict malignant and high-grade pathology: a preoperative nomogram using the renal nephrometry score. European urology 60, 241–248.
  • Kutikov and Uzzo (2009) Kutikov, A., Uzzo, R.G., 2009. The renal nephrometry score: a comprehensive standardized system for quantitating renal tumor size, location and depth. The Journal of Urology 182, 844–853.
  • Larobina and Murino (2014) Larobina, M., Murino, L., 2014. Medical image file formats. Journal of digital imaging 27, 200–206.
  • Li et al. (2018) Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A., 2018. H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging 37, 2663–2674.
  • Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: European conference on computer vision, Springer. pp. 740–755.
  • Litjens et al. (2017) Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88.
  • Ma (2019) Ma, J., 2019. Solution to the kidney tumor segmentation challenge 2019, in: Submissions to the 2019 Kidney Tumor Segmentation Challenge – KiTS19.
  • Maier et al. (2017) Maier, O., Menze, B.H., von der Gablentz, J., Häni, L., Heinrich, M.P., Liebrand, M., Winzeck, S., Basit, A., Bentley, P., Chen, L., et al., 2017. Isles 2015-a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral mri. Medical image analysis 35, 250–269.
  • Maier-Hein et al. (2018) Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz, P., Arbel, T., Bogunovic, H., Bradley, A.P., Carass, A., et al., 2018. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature communications 9, 5217.
  • Mason (2011) Mason, D., 2011.

    Su-e-t-33: pydicom: an open source dicom library.

    Medical Physics 38, 3493–3493.
  • McIntosh et al. (2018) McIntosh, A.G., Ristau, B.T., Ruth, K., Jennings, R., Ross, E., Smaldone, M.C., Chen, D.Y., Viterbo, R., Greenberg, R.E., Kutikov, A., et al., 2018. Active surveillance for localized renal masses: tumor growth, delayed intervention rates, and¿ 5-yr clinical outcomes. European urology 74, 157–164.
  • Millet et al. (2011) Millet, I., Doyon, F.C., Hoa, D., Thuret, R., Merigeaud, S., Serre, I., Taourel, P., 2011. Characterization of small solid renal lesions: can benign and malignant tumors be differentiated with ct? American journal of roentgenology 197, 887–896.
  • Milletari et al. (2016) Milletari, F., Navab, N., Ahmadi, S.A., 2016.

    V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 2016 Fourth International Conference on 3D Vision (3DV), IEEE. pp. 565–571.

  • Mir et al. (2017) Mir, M.C., Derweesh, I., Porpiglia, F., Zargar, H., Mottrie, A., Autorino, R., 2017. Partial nephrectomy versus radical nephrectomy for clinical t1b and t2 renal tumors: a systematic review and meta-analysis of comparative studies. European urology 71, 606–617.
  • Mu et al. (2019) Mu, G., Lin, Z., Han, M., Yao, G., Gao, Y., 2019. Segmentation of kidney tumor by multi-resolution vb-nets, in: Submissions to the 2019 Kidney Tumor Segmentation Challenge – KiTS19.
  • Okhunov et al. (2011) Okhunov, Z., Rais-Bahrami, S., George, A.K., Waingankar, N., Duty, B., Montag, S., Rosen, L., Sunday, S., Vira, M.A., Kavoussi, L.R., 2011. The comparison of three renal tumor scoring systems: C-index, padua, and renal nephrometry scores. Journal of endourology 25, 1921–1924.
  • Oktay et al. (2018) Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al., 2018. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 .
  • Park and Han (2018) Park, S.H., Han, K., 2018.

    Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction.

    Radiology 286, 800–809.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., 2017.

    Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.

    PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6.
  • Patel et al. (2016) Patel, H.D., Riffon, M.F., Joice, G.A., Johnson, M.H., Chang, P., Wagner, A.A., McKiernan, J.M., Trock, B.J., Allaf, M.E., Pierorazio, P.M., 2016. A prospective, comparative study of quality of life among patients with small renal masses choosing active surveillance and primary intervention. The Journal of Urology 196, 1356–1362.
  • Reinke et al. (2018) Reinke, A., Eisenmann, M., Onogur, S., Stankovic, M., Scholz, P., Full, P.M., Bogunovic, H., Landman, B.A., Maier, O., Menze, B., et al., 2018. How to exploit weaknesses in biomedical challenge design and organization, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 388–395.
  • Richard et al. (2016) Richard, P.O., Jewett, M.A., Bhatt, J.R., Evans, A.J., Timilsina, N., Finelli, A., 2016. Active surveillance for renal neoplasms with oncocytic features is safe. The Journal of Urology 195, 581–587.
  • Robson (1963) Robson, C.J., 1963. Radical nephrectomy for renal cell carcinoma. The Journal of Urology 89, 37–42.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241.
  • Scosyrev et al. (2014) Scosyrev, E., Messing, E.M., Sylvester, R., Campbell, S., Van Poppel, H., 2014. Renal function after nephron-sparing surgery versus radical nephrectomy: results from eortc randomized trial 30904. European urology 65, 372–377.
  • Shen et al. (2017) Shen, D., Wu, G., Suk, H.I., 2017. Deep learning in medical image analysis. Annual review of biomedical engineering 19, 221–248.
  • Simmons et al. (2010) Simmons, M.N., Ching, C.B., Samplaski, M.K., Park, C.H., Gill, I.S., 2010. Kidney tumor location measurement using the c index method. The Journal of Urology 183, 1708–1713.
  • Simmons et al. (2012) Simmons, M.N., Hillyer, S.P., Lee, B.H., Fergany, A.F., Kaouk, J., Campbell, S.C., 2012. Diameter-axial-polar nephrometry: integration and optimization of renal and centrality index scoring systems. The Journal of Urology 188, 384–390.
  • Spaliviero et al. (2015) Spaliviero, M., Poon, B.Y., Aras, O., Di Paolo, P.L., Guglielmetti, G.B., Coleman, C.Z., Karlo, C.A., Bernstein, M.L., Sjoberg, D.D., Russo, P., et al., 2015. Interobserver variability of renal, padua, and centrality index nephrometry score systems. World Journal of Urology 33, 853–858.
  • Taha et al. (2018) Taha, A., Lo, P., Li, J., Zhao, T., 2018. Kid-net: convolution networks for kidney vessels segmentation from ct-volumes, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 463–471.
  • Tang et al. (2018) Tang, Y., Harrison, A.P., Bagheri, M., Xiao, J., Summers, R.M., 2018. Semi-automatic recist labeling on ct scans with cascaded convolutional neural networks, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 405–413.
  • Uzosike et al. (2018) Uzosike, A.C., Patel, H.D., Alam, R., Schwen, Z.R., Gupta, M., Gorin, M.A., Johnson, M.H., Gausepohl, H., Riffon, M.F., Trock, B.J., et al., 2018. Growth kinetics of small renal masses on active surveillance: variability and results from the dissrm registry. The Journal of Urology 199, 641–648.
  • West et al. (1997) West, J., Fitzpatrick, J.M., Wang, M.Y., Dawant, B.M., Maurer Jr, C.R., Kessler, R.M., Maciunas, R.J., Barillot, C., Lemoine, D., Collignon, A., et al., 1997. Comparison and evaluation of retrospective intermodality brain image registration techniques. Journal of computer assisted tomography 21, 554–568.
  • Yushkevich et al. (2016) Yushkevich, P.A., Gao, Y., Gerig, G., 2016. Itk-snap: An interactive tool for semi-automatic segmentation of multi-modality biomedical images, in: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE. pp. 3342–3345.
  • Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 .
  • Zhang et al. (2019) Zhang, Y., Wang, Y., Hou, F., Yang, J., Xiong, G., Tian, J., Zhong, C., 2019. Cascaded volumetric convolutional network for kidney tumor segmentation from ct volumes, in: Submissions to the 2019 Kidney Tumor Segmentation Challenge – KiTS19.
  • Zhuang et al. (2019) Zhuang, X., Li, L., Payer, C., Stern, D., Urschler, M., Heinrich, M.P., Oster, J., Wang, C., Smedby, O., Bian, C., et al., 2019. Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge. arXiv preprint arXiv:1902.07880 .