Consensus Based Medical Image Segmentation Using Semi-Supervised Learning And Graph Cuts

12/07/2016 ∙ by Dwarikanath Mahapatra, et al. ∙ ibm 0

Medical image segmentation requires consensus ground truth segmentations to be derived from multiple expert annotations. A novel approach is proposed that obtains consensus segmentations from experts using graph cuts (GC) and semi supervised learning (SSL). Popular approaches use iterative Expectation Maximization (EM) to estimate the final annotation and quantify annotator's performance. Such techniques pose the risk of getting trapped in local minima. We propose a self consistency (SC) score to quantify annotator consistency using low level image features. SSL is used to predict missing annotations by considering global features and local image consistency. The SC score also serves as the penalty cost in a second order Markov random field (MRF) cost function optimized using graph cuts to derive the final consensus label. Graph cut obtains a global maximum without an iterative procedure. Experimental results on synthetic images, real data of Crohn's disease patients and retinal images show our final segmentation to be accurate and more consistent than competing methods.



There are no comments yet.


page 4

page 14

page 16

page 17

page 18

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Combining manual annotations from multiple experts is important in medical image segmentation and computer aided diagnosis (CAD) tasks such as performance evaluation of different registration or segmentation algorithms, or to assess the annotation quality of different raters through inter- and intra-expert variability LMS1 . Accuracy of the final (or consensus) segmentation determines to a large extent the accuracy of (semi-) automated segmentation and disease detection algorithms.

It is common for medical datasets to have annotations from different experts. Combining many experts’ annotations is challenging due to their varying expertise levels, intra- and inter-expert variability, and missing labels of one or more experts. Poor consensus segmentations seriously affect the performance of segmentation algorithms, and robust fusion methods are crucial to their success. In this work we propose to combine multiple expert annotations using semi-supervised learning (SSL) and graph cuts (GC). Its effectiveness is demonstrated on example annotations of Crohn’s Disease (CD) patients on abdominal magnetic resonance (MR) images, retinal fundus images, and synthetic images. Figure 1 shows an example with two consecutive slices of a patient affected with CD. In both slices, the red contour indicates a diseased region annotated by Expert 1 while green contour denotes diseased regions annotated by Expert 2. Two significant observations can be made: 1) in Figure 1 (a) there is no common region which is marked as diseased by both experts; 2) in Figure 1 (b) the area agreed by both experts as diseased is very small. Figure 1 (c) illustrates the challenges in retinal fundus images where different experts have different contours for the optical cup. The challenges of intra- and inter-expert variability are addressed by a novel self-consistency (SC) score and the missing label information is predicted using SSL.

1.1 Related Work

Fusing expert annotations involves quantifying annotator performance. Global scores of segmentation quality for label fusion were proposed in LMS2 ; STAPLE . However, as suggested by Restif in LMS4 the computation of local performance is a better measure since it suits applications requiring varying accuracy in different image areas. Majority voting has also been used for fusing atlases of the brain in LMS7 . However, it is limited by the use of a global metric for template selection which considers each voxel independently from others, and assumes equal contribution by each template to the final segmentation. It also produces locally inconsistent segmentations in regions of high anatomical variability and poor registration. To address these limitations weighted majority voting was proposed in LMS10 that calculates weights based on intensity differences. This strategy depends on intensity normalization and image registration and is error prone.

A widely used algorithm for label fusion is STAPLE STAPLE that uses Expectation-Maximization (EM) to find sensitivity and specificity values maximizing the data likelihood. These values quantify the quality of expert segmentations. Their performance varies depending upon annotation accuracy, or anatomical variability between templates LMS14 . Commowick et al. propose Local MAP STAPLE (LMSTAPLE) LocalMAPSTAPLE that addresses the limitations of STAPLE by using sliding windows and Maximum A Posteriori (MAP) estimation, and defining a prior over expert performance. Wang et al. WangPAMI13 exploit the correlation between different experts through a joint probabilistic model for improved automatic brain segmentation. Chatelain et al. in ChatelainMiccai13

use Random forests (RF) to determine most coherent expert decisions with respect to the image by defining a consistency measure based on information gain. They select the most relevant features to train the classifier, and do not combine multiple expert labels. Statistical approaches such as COLLATE

COLLATE model the rating behavior of experts and use statistical analysis to quantify their reliability. The final annotation is obtained using EM. The SIMPLE method combines atlas fusion and weight selection in an iterative procedure SIMPLE . Combining multiple atlases demonstrates the importance of anatomical information from multiple sources in segmentation tasks leading to reduced error compared to a single training atlas SabuncuTMI2010 ; LotjonenNeuro2010 .

1.2 Our Contribution

The disadvantage of EM based methods is greater computation time, and the risk of being trapped in local minimum. Consequently, the quantification of expert performance might be prone to errors. Statistical methods such as STAPLER require many simulated user studies to learn rater behavior, which may be biased towards the simulated data.

Another common issue is missing annotation information from one or more experts. It is common practice to annotate only the interesting regions in medical images such as diseased regions or boundaries of an organ and disagreement between experts is a common occurrence. However in some cases we find that one or more experts do not provide any labels in some image slices, perhaps due to mistakes or inattention induced due to stress. In such cases it is important to infer the missing annotations and gather as much information as possible since it is bound to impact the quality of the consensus annotation. Methods like STAPLE predict missing labels that would maximize the assumed data likelihood function, which seems to be a strong assumption on the data distribution.

Our work addresses the above limitations through the following contributions:

  1. SSL is used to predict missing annotation information. While SSL is a widely used concept in machine learning it has not been previously used to predict missing annotations. Such an approach reduces the computation time since it predicts the labels in one step without any iterations as in EM based methods. By considering local pixel characteristics and global image information from the available labeled samples, SSL predicts missing annotations using global information but without making any strong assumptions of the form of the data generating function.

  2. A SC score based on image features that best separate different training data quantifies the reliability and accuracy of each annotation. This includes both local and global information in quantifying segmentation quality.

  3. Graph cuts (GC) are used to obtain the final segmentation which gives a global optimum of the second order MRF cost function and also incorporates spatial constraints into the final solution. The SC is used to calculate the penalty costs for each possible class as reference model distributions cannot be defined in the absence of true label information. GC also pose minimal risk of being trapped in local minima compared to previous EM based methods.

We describe different aspects of our method in Sections 2-5, present our results in Section 7 and conclude with Section 8.

(a) (b) (c)

Figure 1: (a)-(b) Illustration of subjectivity in annotating medical images. In both figures, red contour indicates diseased region as annotated by Expert 1 while green contour denotes diseased region as annotated by Expert 2. (c) outline of optic cup by different experts.

2 Image Features

Feature vectors derived for each voxel are used to predict any missing annotations from one or more experts. Image intensities are normalized to lie between

. Each voxel is described using intensity statistics, texture and curvature entropy, and spatial context features, and they are extracted from a patch around each voxel. In previous work MahapatraTMI_CD2013 we have used this same set of features to design a fully automated system for detecting and segmenting CD tissues from abdominal MRI. These patches were used on images of different sizes, and pixels. Through extensive experimental analysis of the RF based training procedure we identified context features to be most important followed by curvature, texture and intensity. Our hand crafted features also outperformed other feature combinations MahapatraABD12 . Since the current work focuses on a method to combine multiple expert annotations, we refer the reader to MahapatraTMI_CD2013 for details.

2.1 Intensity Statistics

MR images commonly contain regions that do not form distinct spatial patterns but differ in their higher order statistics Petrou1

. Therefore, in addition to the features processed by the human visual system (HVS), i.e., mean and variance, we extract skewness and kurtosis values from each voxel’s neighborhood.

2.2 Texture Entropy

Texture maps are obtained from -D Gabor filter banks for each slice (at orientations and scale ). They are partitioned into equal parts corresponding to sectors of a circle. Figure 2 (a) shows the template for partitioning a patch into sectors and extracting entropy features. For each sector we calculate the texture entropy given by,


denotes the probability distribution of texture values in sector

. This procedure is repeated for all the texture maps over orientations and scales to extract a () dimensional feature vector.

2.3 Curvature Entropy

Different tissue classes have different curvature distributions and we exploit this characteristic for accurate discrimination between different tissue types. Curvature maps are obtained from the gradient maps of the tangent along the D surface. The second fundamental form () of these curvature maps is identical to the Weingarten mapping and the trace of the matrix gives the mean curvature. This mean curvature map is used for calculating curvature entropy. Details on curvature calculation are given in 3dcurv ; MahapatraTMI_CD2013 . Similar to texture, curvature entropy is calculated from sectors of a patch and is given by


denotes the probability distribution of curvature values in sector , denotes the curvature values. Intensity, texture and curvature features combined give a dimensional feature vector.

We use D texture and curvature maps as the D maps do not provide consistent features because of lower resolution in the direction compared to the and axis (voxel resolution was mm). Experimental results demonstrate that using D features results in higher classification accuracy () in identifying diseased and normal samples when compared to using features (). We also resample the images using isotropic sampling and extract D features, but the results are similar and favour the use of D features.

2.4 Spatial Context Features:

Context information is particularly important for medical images because of the regular arrangement of human organs Tu ; ZhengSteerable . Figure 2 (b) shows the template for context information where the circle center is the current voxel and the sampled points are identified by a red ‘X’. At each point corresponding to a ‘X’ we extract a region and calculate the mean intensity, texture and curvature values. The texture values were derived from the texture maps at orientation and scale . The ‘X’s are located at distances of pixels from the center, and the angle between consecutive rays is . The values from the regions are concatenated into a dimensional feature vector, and the final feature vector has values. The choice of sampling distances and angles was determined experimentally on a small subset of images, with pixels and giving the best result in distinguishing between normal and diseased samples.

(a) (b)

Figure 2: (a) partitioning of patch for calculating anisotropy features; (b) template for calculating context features.

3 Learning Using Random Forests

Let us consider a multi-supervised learning scenario with a training set of samples , and the corresponding labels provided by

experts. A binary decision tree is a collection of nodes and leaves with each node containing a weak classifier that separates the data into two subsets of lower entropy. Training a node

on consists of finding the parameters of the weak classifier that maximize the information gain () of splitting labeled samples into and :


where is the empiric entropy of , and denotes cardinality. The parameters of the optimized weak classifier are stored in the node. Data splitting stops when we reach a predefined maximal depth, or when the training subset does not contain enough samples. In this case, a leaf is created that stores the empiric class posterior distribution estimated from this subset.

A collection of decorrelated decision trees increases generalization power over individual trees. Randomness is introduced by training each tree on a random subset of the whole training set (bagging), and by optimizing each node over a random subspace of the feature parameter space. At testing time, the output of the forest is defined as the average of the probabilistic predictions of the trees. Note that the feature vector for every pixel consists of the features defined in Section 2.

3.1 Predicting Missing Labels

Missing labels are commonly encountered when multiple experts annotate data. We use semi-supervised learning (SSL) to predict the missing labels. Unlike previous methods (BudvytisCVPR11 ), a ‘single shot’ RF method for SSL without the need for iterative retraining was introduced in ForestBook . We use this SSL classifier as it is shown to outperform other approaches ForestBook . For SSL the objective function encourages separation of the labeled training data and simultaneously separates different high density regions. It is achieved via the following mixed information gain for node :


where is defined in Eqn. 3. depends on both labeled and unlabeled data, and is defined using differential entropies over continuous parameters as


is the covariance matrix of the assumed multivariate distributions at each node. For further details we refer the reader to ForestBook . Thus the above cost function is able to combine the information gain from labeled and unlabeled data without the need for an iterative procedure.

Each voxel has known labels and the unknown labels are predicted by SSL. The feature vectors of all samples (labeled and unlabeled) are inputted to the RF-SSL classifier which returns the missing labels. Note that although the same sample (hence feature vector) has multiple labels, RF-SSL treats it as another sample with similar feature values. The missing labels are predicted based on the split configuration (of decision trees in RFs) that leads to maximal global information gain. Hence the prediction of missing labels is not directly influenced by the other labels of the same sample but takes into account global label information ForestBook .

4 Self Consistency of Experts

Since the annotator is guided by visual features, such as intensity, in distinguishing between different regions, it is expected that for reliable annotations the region with a particular label would have consistent feature distributions. Expert reliability is quantified by examining the information gain at different nodes while training a random forest on samples labeled by a particular expert. This helps us evaluate the consistency of the experts with respect to the visual features. For each expert we define an estimator of the expectation of the information gain on the labeled training set sent to node as


where is a randomly selected subset of the feature parameters space. measures how well the data can be separated according to the labels of each expert. However, it suffers from two weaknesses in lower nodes of the tree: (i) it is evaluated from fewer samples, and hence becomes less reliable, and (ii) it quantifies only the experts’ local consistency, without considering global consistency measures. Therefore, similar to ChatelainMiccai13 we define the performance level of each expert as a linear combination of the estimators from root to node as


By weighting the estimators in proportion to the size of the training subset, we give more importance to the global estimates of the experts’ consistencies, but still take into account their feature-specific performances. Once the parameters have been computed, an expert’s reliability or self consistency () is calculated as the average performance level over all nodes in trees:


where is the total number of trees in the forest. Higher indicates greater rater consistency. To reduce computation time we select a region of interest (ROI) by taking the union of all expert annotations and determining its bounding box rectangle. The size of the rectangle is expanded by pixels along rows and columns and slices to give the final ROI.

5 Obtaining Final Annotations

The final annotation is obtained by optimising a second order MRF cost function that is given by,


where denotes the set of pixels; is the neighbors of pixel (or sample ); is the label of ; is the neighbor of , and is the set of labels for all . determines the relative contribution of penalty cost () and smoothness cost (). We have only labels ( for object/background), although our method can also be applied to the multi-label scenario. The final labels are obtained by graph cut optimization using Boykov’s expansion method. For details about the implementation we refer the reader to BoykovFastApproximate .

The penalty cost for MRF is usually calculated with respect to a reference model of each class (e.g., distribution of intensity values). The implicit assumption is that the annotators’ labels are correct. However, we aim to determine the actual labels of each pixel and hence do not have access to true class distributions. To overcome this problem we use the consistency scores of experts to determine the penalty costs for a voxel. Each voxel has labels (after predicting the missing labels). Say for voxel the label (of the th expert) is , and the corresponding SC score is (Eqn.8). Since SC is higher for better agreement with labels, the corresponding penalty cost for is


where is the label of voxel . Consequently, the corresponding penalty cost for label is


However, if the label (of the th expert) is , then the corresponding penalty costs are as follows


The individual penalty costs depend upon the labels given by the experts, while the final penalty costs for each is the average of costs from all experts,


Smoothness Cost (V): ensures a spatially smooth solution by penalizing discontinuities. We used a standard and popular formulation of the smoothness cost as originally proposed in BoykovFastApproximate . It is given by


denotes the intensity. Smoothness cost is determined over a neighborhood system.

6 Dataset Description

We use real datasets from two different applications: Crohn’s disease detection, and colour fundus retinal images originally intended for optic cup and disc segmentation, and a synthetic image dataset. Details of the different datasets are given below.

6.1 Crohn’s Disease Dataset

For Crohn’s Disease we use datasets from two different sources, one from the Academic Medical Center (), Amsterdam and the other from University College of London Hospital ().

  • : The data was acquired from patients (mean age years, range, years, females) with luminal Crohn’s disease that had been approved by AMC’s Medical Ethics Committee. All patients had given informed consent to the prior study. Patients fasted four hours before a scan and drank ml of Mannitol () (Baxter, Utrecht, the Netherlands) one hour before a scan. weighted images were acquired using a -T MR imaging unit (Intera, Philips Healthcare, Best, The Netherlands) with a -channel torso phased array body coil. The image resolution was mm mm mm/pixel, and the volume dimension was pixels.

  • : Data from patients (mean age, years, range, years, females) diagnosed with small bowel Crohn’s disease was used. weighted images were acquired using a T MR imaging unit (Avanto; Siemens, Erlangen). The spatial resolution of the images was mm mm mm per pixel. Two datasets have dimension of , one , one and the rest . Ethical permission was given by the University College London Hospital ethics committee, and informed written consent was obtained from all participants.

Each of the hospital MRI datasets was annotated by radiologists, two each from AMC and UCL. Consensus segmentations were obtained using 4 methods described in Section 7.5. The final segmentations of all patients are used to train a fully supervised method for detecting and segmenting CD tissues (details are given in Section 6.3) using fold cross validation.

6.2 Colour fundus retinal images

We use the DRISHTI-GS dataset Drishti consisting of retinal fundus images from patients obtained using degree FOV at a resolution of pixels. The optic cup and optic disc are manually segmented by ophthalmologists, and the consensus ground truth is also available. We choose this dataset because the final ground truth and annotations of individual experts are publicly available and facilitates accurate validation.

6.3 Evaluation Metrics

Availability of ground truth annotations makes it easier to evaluate the performance of any segmentation algorithm. However, the purpose of our experiments is to estimate the actual ground truth annotations, and hence there is no direct method to estimate the accuracy of the consensus annotations. We adopt the following validation strategy using a fully supervised learning (FSL) framework:

  1. Obtain the consensus segmentation from different methods.

  2. Train a separate RF classifier on the consensus segmentations of different methods in a fold cross validation setting. The same set of features as described in Section 2 are used to describe each voxel. If the training labels were obtained using STAPLE then the FSL segmentation of the test image is compared with the ground truth segmentation from STAPLE only.

  3. Use the trained RF classifiers to generate probability maps for each voxel of the test image.

  4. Use the probability maps to obtain the final segmentation using the following second order MRF cost function


    where is the likelihood (from probability maps) previously obtained using RF classifiers and is a very small value to ensure that the cost is a real number. The smoothness cost is same as Eqn 14.

  5. Obtain the final segmentation using graph cuts. Note that this segmentation is part of the validation scheme and not for obtaining consensus annotations.

This validation is similar to our previous method in MahapatraTMI_CD2013 , but without using the supervoxels for region of interest detection. The algorithm segmentations are compared with the ‘ground-truth’ segmentations (the consensus segmentation obtained by the particular method) using Dice Metric (DM) and Hausdorff distance (HD). Consensus segmentations with greater accuracy give better discriminative features and more accurate probability maps, and the classifiers obtained from these annotations can identify diseased regions more accurately. Thus we expect the resulting segmentations to be more accurate. The fusion method which most effectively combines the different annotations is expected to give higher accuracy for the segmentations on the test data.

Dice Metric measures the overlap between the segmented diseased region obtained by our algorithm and reference manual annotations. It is given by


where - segmentation from our algorithm and - manual annotations. The measure yields values between and where high DM corresponds to a good segmentation.

Hausdorff Distance (HD): HD measures the distance between the contours corresponding to different segmentations. If two curves are represented as sets of points and , where each and

is an ordered pair of the

and coordinates of a point on the curve, the distance to the closest point (DCP) for to the curve is calculated. The HD, defined as the maximum of the DCP’s between the two curves, is:


The results between two different methods were compared using a paired test with a significance level that determines whether the two sets of results are statistically different or not. MATLAB’s ttest2 function was used as it because it integrates better into our workflow and the result is returned as the value. Before performing the

test we ensured that all essential assumptions are met namely, 1) all measurements are on a continuous scale; 2) the values are from a related group; 3) no significant outliers are present; 4) assumption of normality is not violated.

Our whole pipeline was implemented in MATLAB on a GHz quad core CPU running Windows with GB RAM. The random forest code was a MATLAB interface to the code in RFcode written in the R programming language.The RF classifier had trees and its maximal tree depth was .

7 Experiments and Results

7.1 Inter-expert Agreement

Each of the hospital MRI datasets was annotated by radiologists, two each from AMC and UCL. Thus each slice has different annotations and a mean annotation is calculated from them. The average DM between individual annotations and mean annotations was (minimum DM and maximum DM). The corresponding average values from the paired test between the mean annotation of the individual annotations of that slice was (minimum , maximum ). The corresponding numbers for inter-expert agreement on retinal images was average DM (minimum DM and maximum DM), and average (min , max ). These values indicate good agreement between different experts. Since each expert annotated a slice only once we do not have the appropriate data to calculate intra-expert agreement.

7.2 MRF regularization strength (Eqn. 9)

To choose the MRF regularization strength we choose a separate group of patient volumes (from both hospitals), and perform segmentation using our proposed method but with taking different values from to . The results are summarized in Table 1. The maximum average segmentation accuracy using Dice Metric (DM) was obtained for which was fixed for subsequent experiments. Note that these datasets were a mix of patients from the two hospitals and not  part of the test dataset used for evaluating our algorithm.

DM 71.4 72.8 75.4 80.2 82.8 88.7 87.2 87.4 86.1
Table 1: Change in segmentation accuracy with different values of (Eqn. 9). is in .

7.3 Influence of Number of Trees

The effect of the number of trees () on the segmentation is evaluated by varying them and observing the final segmentation accuracy (DM values) on the datasets mentioned above. The results are summarized in Table 2. For there is no significant increase in DM () but the training time increases significantly. The best trade-off between and DM is achieved for trees and is the reason behind our choice in the RF ensemble. The tree depth was fixed at after cross validation comparing tree depth, and resulting classification accuracy.

DM 82.5 84.7 86.6 88.3 91.7 91.8 91.7 91.7
0.20T 0.21T 0.42T 0.8T T 1.4T 2.2T 3.4T
Table 2: Effect of number of trees in RF classifiers () on segmentation accuracy and training time () of . is in .

7.4 Synthetic Image Dataset

To illustrate the relevance of the SC score, we report segmentation results on synthetic images as they provide a certain degree of control on image characteristics. Figure 3 (a) shows an example synthetic image where the ‘diseased’ region is within the red square. Pixel intensities are normalized to

. Intensities within the square have a normal distribution with

and different . Background pixels have a lower intensity distribution ( and different ). such images with different shapes for the diseased region (e.g., squares, circles, rectangles, polygons, of different dimensions) are created with known ground truths of the desired segmentation. adjacent boundary points are chosen and randomly displaced between pixels. This random displacement is repeated for more point sets depending on the size of the image. These multiple displacements of boundary points is the simulated annotation of one annotator. Two other sets of annotations are generated to create simulated annotations for ‘experts’. The annotations of different experts are shown as colored contours in Fig. 3 (b).

To test our SSL based prediction strategy, we intentionally removed expert’s annotations for each image/volume slice. The experts whose annotation was removed is chosen at random. We refer to our method as (Graph Cut with Multiple Experts) and compare its performance with the final segmentations obtained using COLLATE COLLATE , Majority Voting (MV) LMS7 , and Local MAP-STAPLE (LMStaple) LocalMAPSTAPLE . We also show results for in which none of the expert annotations were removed while predicting the final segmentation. Note that except for , all other methods don’t have access to all annotations.

Additionally, we show results for , i.e., without SSL for predicting missing labels. In this case the penalty costs are determined from ’s of available annotations. Missing annotations of experts is not predicted and hence not used for determining the consensus segmentation. Consensus segmentation results are also shown for , i.e., without our SC score. The penalty cost is the distance between the reference distribution in the ground truth annotation of Fig. 3 (a), and the distribution from the ‘expert’s’ annotation. Note that this condition can be tested only for synthetic images where we know the pixels’ true labels. For COLLATE we utilized the implementations available from the MASI fusion package MASI . Local MAP STAPLE implementation is available from the Computational Radiology Laboratory website CRL . For both methods we closely followed the parameter settings recommended by the authors.

Table 3 summarizes the performance of different methods. gives the highest DM and lowest HD values, followed by , LocalMAPSTAPLE , COLLATE , LMS7 , and . Our proposed self consistency score accurately quantifies the consistency level of each expert as is evident from the significant difference in performance of and (). Figures 3 (c)-(i) show the final segmentations obtained using the different methods.

(a) (b) (c) (d) (e) (f) (g) (h) (i)
Figure 3: (a) synthetic image with ground truth segmentation in red; (b) synthetic image with simulated expert annotations; final segmentation obtained by (c) (DM); (d) Majority voting (DM); (e) COLLATE (DM); (f) LMStaple (DM); (g) (DM); (h) (DM); (i) (DM).
LmStaple Collate MV
DM 92.3 91.2 88.8 87.1 85.3 84.0 83.7
HD 6.1 7.4 9.0 10.1 11.9 13.5 13.9
0.032 -
Table 3: Quantitative measures for segmentation accuracy on synthetic images. DM- Dice Metric in ; HD is Hausdorff distance mm and is the result of Student tests with respect to .

7.5 Real Patient Crohn’s Disease Dataset

For the CD patient datasets we show consensus segmentation results for , , COLLATE, Majority Voting (MV), and LMStaple. Although, all the experts annotated every image, in order to test our SSL based prediction strategy, we intentionally removed or annotations for each image/volume slice.

Figure 4 shows the predicted ground truth for fusion strategies using only two expert labels. We show results for two experts due to the ease in showing the different annotations in one image. Displaying three or more expert annotations with the consensus segmentation makes the images very crowded and hence difficult to interpret. Since our purpose is to show the relative merit of different methods, two expert annotations also serve the same purpose.

Figures 5,6 show segmentation results for two patients ( Patient 23 and Patient 15) using all the fusion strategies mentioned above and Table 4 summarizes their average performance over all patients. From the visual results and quantitative measures it is clear that gives the highest DM and lowest HD values, followed by , LocalMAPSTAPLE , COLLATE , LMS7 , and . Since had access to all annotations, it obviously performed best. However ’s performance is very close and a Student test with gives indicating very small difference in the two results. Thus we can effectively conclude that does a very good job in predicting missing annotations. Importantly, performs much better than all other methods (). The results show SSL effectively predicts missing annotation information since shows a significant drop in performance from ().

If the consensus segmentation is inaccurate then the subsequent training is also flawed because the classifier learns features from many voxels whose label is inaccurate. As a result, in many cases the final segmentation includes regions which do not exhibit any disease characteristics as confirmed by our medical experts. Another limitation of sub-optimal label fusion is the wide variation in segmentation performance of that particular method. The standard deviation of

LMS7 is much higher than indicating inconsistent segmentation quality. A good fusion algorithm should assign lower reliability scores to inconsistent segmentations, which is achieved by as is evident from the low variation in its DM scores.

An important factor limiting the performance of LMStaple is its prediction of sensitivity and specificity parameters from the annotations without considering their overall consistency. Our SC score takes into account both global and local information and is able to accurately quantify a rater’s consistency. The effect of SC is also highlighted through experiments on synthetic images (Section 7.4) Secondly, LMStaple may be prone to being trapped in local minimum due to the iterative EM approach. On the contrary, we employ graph cuts which is almost always guaranteed to give a global minimum. This makes the final output (the consensus segmentation) much more accurate and robust. COLLATE also suffers due to its reliance on an EM based approach.

DM 92.62.4 91.73.0 87.34.5 85.15.3 83.87.3 82.39.0
HD 7.42.6 8.23.3 9.84.8 12.06.2 13.97.4 14.78.2
Table 4: Quantitative measures for segmentation accuracy on CD images. DM- Dice Metric in ; HD is Hausdorff distance in mm and is the result of Student tests with respect to .
(a) (b) (c) (d) (e) (f)
Figure 4: The predicted ground truth for UCL Patient 23 by different methods: (a) ; (b) ; (c) LocalMAPSTAPLE ; (d) COLLATE ; (e) LMS7 ; and (f) . Red and blue contours are expert annotations and yellow is the final annotation obtained by the respective methods.
(a) (b) (c) (d) (e) (f)
Figure 5: Segmentation results on UCL patient 23 for: (a) ; (b) ; (c) LocalMAPSTAPLE ; (d) COLLATE ; (e) LMS7 ; and (f) (a) ;. Red contour is the corresponding ground truth generated by the fusion method, and yellow contour is the algorithm segmentation obtained as described in Section 6.3.
(a) (b) (c) (d) (e) (f)
Figure 6: Segmentation results on AMC patient 15 for: (a) ; (b) ; (c) LocalMAPSTAPLE ; (d) COLLATE ; (e) LMS7 ; and (f) (a) ;. Red contour is the corresponding ground truth generated by the fusion method, and yellow contour is the algorithm segmentation obtained as described in Section 6.3.

7.6 Real Patient Retina Dataset

Quantitative evaluation is based on F-score and absolute pointwise localization error

in pixels (measured in the radial direction). Additionally we report the overlap measure . is the manual segmentation while is the algorithm segmentation. Comparative results are shown for , , , COLLATE, MV and LMStaple.

Table 5 summarizes the segmentation performance of different methods. Figure 7 (b),(c) shows the individual expert annotations and the consensus ground truth annotation while Figs 7 (d)-(f) show the predicted ground truth for fusion strategies. As is evident from the images shows the best agreement with the ground truth segmentations.

These results confirm our earlier observations from synthetic and CD patient datasets about: 1) the superior performance of ; 2) effectiveness of SSL in predicting missing annotation information; 3) inferior performance of LMStaple due to predicting sensitivity and specificity parameters from annotations without considering their overall consistency, and using EM; and 4) contribution of our SC score and graph cuts in obtaining better consensus annotations.

COLLATE LMStaple Majority
F 95.4 97.2 90.2 89.0 92.1 86.4
S 89.2 91.2 84.8 83.2 85.9 80.8
B 9.9 8.2 13.2 10.9 10.3 18.1
Time 7 7 6 9 7 3
Table 5: Segmentation accuracy of retinal fundus images in terms of score, overlap and boundary distance for different methods. is in pixels; - fusion time in minutes;-F score; -overlap measure; -boundary error.
(a) (b) (c) (d) (e)
Figure 7: Example annotations of (a) optic disc and (b) optic cup. The ground truth consensus segmentation is shown in yellow while the different expert annotations are shown in red, green and blue. Consensus segmentations for optic cup obtained using (c) ; (d) LocalMAPSTAPLE ; and (e) COLLATE .
(a) (b) (c) (d) (e)
Figure 8: Segmentation results for different methods: (a) our proposed method (b) LocalMAPSTAPLE ; (c) COLLATE ; (d) Majority Voting; and (e) . Green contour is manual segmentation and blue contours are algorithm segmentations from different fusion methods.

7.7 Computation Time

Since the size of the annotations varies depending on the diseased area (ROI varies between to ), an average fusion time for an annotation may be misleading. Therefore we calculate an average fusion time per pixel, which is the highest for LMStaple at seconds followed by COLLATE ( seconds), ( seconds) and majority (voting) MV ( seconds). Other variations of take almost the same time as . Note that we report only the time for fusing the annotations and not the total segmentation time as the segmentation time is the same for all cases since a RF based framework is used. The segmentation time is an additional seconds per pixel.

These results clearly show the faster performance by our method due to employing SSL and GC for predicting missing annotations and obtaining the final annotation. The EM based LMStaple algorithm is nearly times slower than , while COLLATE is times slower because of many computations. Majority voting is faster than all other methods because of its simple approach to predicting final annotations. However, its performance is the worst.

8 Discussion And Conclusion

We have proposed a novel strategy for combining multiple annotations and applied it for segmenting Crohns disease tissues from abdominal MRI, and the optic cup and disc from retinal fundus images. Qualitative evaluation is performed using a machine learning approach for segmentation. Highest segmentation accuracy is observed for the annotations obtained by our fusion strategy, which is indicative of better quality annotations. The comparative results of our method and other fusion strategies highlight the following major points.

  1. With least variance of DM values is the most consistent fusion method, and with highest DM values is also the most accurate.

  2. SSL effectively predicts missing annotation information since has very close performance to , and is significantly better than . Local MAP STAPLE infers missing annotations by minimising the log-likelhood of the overall cost function. Employing EM contributes to its erroneous results. SSL’s advantage is the predicted annotations are consistent with previously annotated samples by considering both global information and local feature consistencies.

  3. Our proposed self consistency score accurately quantifies the consistency level of each expert as is evident from the performance of and () for synthetic images. SC analyses feature distributions of neighboring pixels that share the same labels and gives higher values for consistent annotations which have similar feature distributions.

  4. Graph cut optimization produces a quick global optimum without the risk of getting trapped in local minimum which can be a serious limitation for EM based methods. Use of GC and SSL together contribute to low computation time since there is no iterative approach involved.

Our proposed method for obtaining consensus annotations can be used in scenarios where there is the need to find a ground truth. In most medical image analysis applications it is good practice to have or more experts annotate the images. This also minimizes scope of biased or inaccurate annotations. In such cases our method can be used to generate the ground truth from the multiple expert annotations. However, in reality it can be difficult to obtain multiple expert annotations due to cost and resource issues. In such scenarios multiple segmentations can be generated from different automatic segmentation algorithms and the consensus ground truth segmentation can be generated using our method.

Algorithm Limitations: It is important that in order to generate a good ground truth we have multiple experts’ annotations. As mentioned before that it is not easy for many experts to provide annotations. Although in principle we can use different segmentation algorithms to generate candidate segmentations and then calculate the ground truth, these algorithms may not always be accurate and the final result would be erroneous. Thus we see that our algorithm’s performance is limited by the availability of qualified experts to provide accurate annotations.

SSL for predicting missing annotations is an important part of our fusion approach and erroneous prediction affects the final results. In SSL the unlabeled samples are assigned a class based on their presence in the feature space and its subsequent split to maximize information gain. Erroneous labels of one or more annotations affects the predicted label. However, our proposed method limits the damage due to inaccurate label predictions with the help of the SC score which is based on the image features of each annotation. Inaccurately labeled annotations are assigned low scores since the image features for a label are not consistent throughout the annotation. Subsequently, inaccurate or ambiguous annotations have a lower contribution to the final consensus segmentation. Although we cannot completely eliminate mistakes, use of SC allows us to minimize them by assigning lower importance to erroneous annotations.



  • (1) L. Hoyte, W. Ye, L. Brubaker, J. R. Fielding, M. E. Lockhart, M. E. Heilbrun, M. B. Brown, S. K. Warfield, Segmentations of mri images of the female pelvic floor: A study of inter and intra-reader reliability., J. Mag. Res. Imag. 33 (3) (2011) 684–691.
  • (2) G. Gerig, M. Jomier, M. Chakos, VALMET: A new validation tool for assessing and improving 3d object segmentation, in: In Proc: MICCAI, 2001, pp. 516–523.
  • (3) S. Warfield, K. Zhou, W. Wells, Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation., IEEE Trans. Med. Imaging 23 (7) (2004) 903–921.
  • (4) C. Restif, Revisiting the evaluation of segmentation results: Introducing confidence maps, in: In Proc: MICCAI, 2007, pp. 588–595.
  • (5) P. Aljabar, R. Heckemann, A. Hammers, J. Hajnal, D. R. ., Multi-atlas based segmentation of brain images:Atlas selection and its effect on accuracy., Neuroimage 46 (3) (2009) 726–738.
  • (6) X. Artaechevarria, A. Munoz-Barrutia., Combination strategies in multi-atlas image segmentation: Application to brain MR data., IEEE Trans. Med. Imag. 28 (8) (2009) 1266–1277.
  • (7) S. Klein, U. van der Heide, I. Lips, M. van Vulpen, M. Staring, J. Pluim., Automatic segmentation of the prostate in 3D MR images by atlas matching using localised mutual information, Medical Physics 35 (4) (2008) 1407–1417.
  • (8) O. Commowick, A. Akhondi-Asl, S. Warfield, Estimating a reference standard segmentation with spatially varying performance parameters: Local MAP STAPLE., IEEE Trans. Med. Imag. 31 (8) (2012) 1593–1606.
  • (9) H. Wang, J. Suh, S. Das, J. Pluta, C. Craige, P. Yushkevich, Multi-atlas segmentation with joint label fusion, IEEE Trans. Patt. Anal. Mach. Intell. 35 (3) (2013) 611–623.
  • (10) P. Chatelain, O. Pauly, L. Peter, A. Ahmadi, A. Plate, K. Botzel, N. Navab, Learning from multiple experts with random forests: Application to the segmentation of the midbrain in 3D ultrasound., in: In Proc: MICCAI Part II, 2013, pp. 230–237.
  • (11) A. Asman, B. Landman, Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (COLLATE), IEEE Trans. Med. Imag. 30 (10) (2011) 1779–1794.
  • (12) T. Langerak, U. van der Heide, A. Kotte, M. Viergever, M. van Vulpen, J. Pluim, Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE), IEEE Trans. Med. Imag. 29 (12) (2010) 2000–2008.
  • (13) M. Sabuncu, B. Yeo, K. V. Leemput, B. Fischl, P. Golland, A generative model for image segmentation based on label fusion, IEEE Trans. Med. Imag. 29 (10) (2010) 1714–1729.
  • (14) J. Lotjonen, R. Wolz, J. Koikkalainen, L. Thurfjell, G. Waldemar, H. Soininen, D. Rueckert, Fast and robust multi-atlas segmentation of brain magnetic resonance images, Neuroimage 49 (3) (2010) 2352–2365.
  • (15) B. Landman, A. Asman, A. Scoggins, J. Bogovic, F. Xing, J. Prince, Robust statistical fusion of image labels, IEEE Trans. Med. Imag. 31 (2) (2011) 512–522.
  • (16) D. Mahapatra, P. Schffler, J.Tielbeek, J. Makanyanga, J. Stoker, S. Taylor, F. Vos, J. Buhmann, Automatic detection and segmentation of crohn’s disease tissues from abdominal mri., IEEE Trans. Med. Imag. 32 (12) (2013) 1232–1248.
  • (17) D. Mahapatra, P. J. Schüffler, J. Tielbeek, J. M. Buhmann, F. M. Vos., A supervised learning based approach to detect crohn’s disease in abdominal mr volumes, in: Proc. MICCAI-ABD, 2012, pp. 97–106.
  • (18) M. Petrou, V. Kovalev, J. Reichenbach, Three-dimensional nonlinear invisible boundary detection., IEEE Trans. Imag. Proc 15 (10) (2006) 3020–3032.
  • (19)
  • (20) Z. Tu, X. Bai, Auto-context and its application to high-level vision tasks and 3d brain image segmentation, IEEE Trans. Patt. Anal. Mach. Intell. 32 (10) (2010) 1744 – 1757.
  • (21) Y. Zheng, A. Barbu, B. Beorgescu, M. Scheuering, D. Comaniciu., Four chamber heart modeling and automatic segmentation for 3D cardiac CT volumes using marginal space learning and steerable features., IEEE Trans. Med. Imag. 27 (11) (2008) 1668–1681.
  • (22) I. Budvytis, V. Badrinarayanan, R. Cipolla, Semi-supervised video segmentation using tree structured graphical models., in: IEEE CVPR, 2011, pp. 2257–2264.
  • (23)

    A. Criminisi, J. Shotton., Decision Forests for Computer Vision and Medical Image Analysis., Springer, 2013.

  • (24) Y. Boykov, O. Veksler, Fast approximate energy minimization via graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 1222–1239.
  • (25) J. Sivaswamy, et. al.., Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation, in: IEEE EMBC, 2014, pp. 53–56.
  • (26) A. Liaw, M. Wiener, Classification and regression by randomforest, R News 2 (3) (2002) 18–22.
  • (27) fusion/.
  • (28)
  • (29) J. Rimola, S. Rodriguez, O. garcia Bosch, et al., Magnetic resonance for assessment of disease activity and severity in ileocolonic Crohn’s disease., Gut 58 (2009) 1113–1120.
  • (30) D. Mahapatra, J. Buhmann, Analyzing training information from random forests for improved image segmentation., IEEE Trans. Imag. Proc. 23 (4) (2014) 1504–1512.
  • (31) D. Mahapatra, J. Buhmann, Prostate mri segmentation using learned semantic knowledge and graph cuts., IEEE Trans. Biomed. Engg. 61 (3) (2014) 756–764.
  • (32) J. Rimola, I. Ordas, S. Rodriguez, O. Garcia-Bosch, M. Aceituno, J. Llach, C. Ayuso, E. Ricart, J. Panes, Magnetic resonance imaging for evaluation of crohn’s disease: Validation of parameters of severity and quantitative index of activity., Inflamm Bowel Dis 17 (8) (2011) 1759–1768.
  • (33) L. Irwig, P. Macaskill, P. Glasziou, M. Fahey, Meta-analytic methods for diagnostic test accuracy, J. Clin. Epidemiol 48 (1) (1995) 119–130.
  • (34) O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, MIT Press,Cambridge, MA, 2006.
  • (35) T. Riklin-Raviv, K. V. Leemput, B. Menze, W. W. III, P. Golland, Segmentation of image ensembles via latent atlases, Med. Imag. Anal. 14 (5) (2010) 654–665.
  • (36)

    C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1–27:27, software available at
  • (37) T. J. Fuchs, J. M. Buhmann, Computational pathology: Challenges and promises for tissue analysis, Comp Med Imag Graphics 35 (7-8) (2011) 515–530. doi:DOI:10.1016/j.compmedimag.2011.02.006.
  • (38) D. Baumgart, W. Sandborn, Inflammatory bowel disease: clinical aspects and established and evolving therapies., Lancet. 369 (9573) (2007) 1641–1657.
  • (39) K. Schunk, Small bowel magnetic resonance imaging for inflammatory bowel disease (2002).
  • (40) P. Melville, R. J. Mooney, R. Nagarajan, Content-boosted collaborative filtering for improved recommendations, in: In Proc: AAAI, 2002, pp. 187–192.
  • (41) P. Schffler, D. Mahapatra, J. Tielbeek, F. Vos, J. Makanyanga, D. Pends , C. Nio, J. Stoker, S. Taylor, J. Buhmann, A model development pipeline for crohn s disease severity assessment from magnetic resonance images, in: In Proc: MICCAI-ABD, 2013, pp. 1–10.
  • (42) V. Raykar, S. Yu, L. Zhao, G. Valadez, C. Florin, L. Bogoni, L. Moy, Learning from crowds., Journal of Machine Learning Research 11 (2010) 1297–1322.
  • (43) O. Commowick, S. Warfield, Incorporating priors on expert performance parameters for segmentation validation and label fusion: A maximum a posteriori STAPLE., in: In Proc: MICCAI Part III, 2010, pp. 25–32.
  • (44) H. Wang, P. Yushkevich, Guiding automatic segmentation with multiple manual segmentations., in: In Proc: MICCAI Part II, 2012, pp. 429–436.
  • (45) K. Horsthuis, S. Bipat, P. Stokkers, J. Stoker, Magnetic resonance imaging for evaluation of disease activity in crohn’s disease: a systematic review., Eur Radiol 19 (6) (2009) 1450–1460.
  • (46) J. Mary, R. Modigliani, Development and validation of an endoscopic index of the severity for crohn s disease: a prospective multicentre study., Gut. 30 (7) (1989) 983–989.
  • (47) K. Bodily, J. Fletcher, C. Solem, et al., Crohn disease: mural attenuation and thickness at contrast-enhanced ct enterography correlation with endoscopic and histologic findings of inflammation., Radiology 238 (2) (2006) 505–516.
  • (48) C. Bru, M. Sans, M. Defelitto, R. Gilabert, D. Fuster, J. Llach, F. Lome a, J. Bordas, J. Piqu , J. Pan s2., Hydrocolonic sonography for evaluating inflammatory bowel disease., AJR Am J Roentgenol 177 (1) (2001) 99–105.
  • (49) K. Horsthuis, S. Bipat, R. Bennink, J. Stoker, Inflammatory bowel disease diagnosed with US,MR, scinitigraphy, and CT” meta-analysis of prospective studies, Radiology 247 (1) (2008) 64–79.
  • (50) A. Schreyer, H. Rath, R. Kikinis, M. V lk, J. Sch lmerich, S. Feuerbach, G. Rogler, J. Seitz, H. Herfarth., Comparison of magnetic resonance imaging colonography with conventional colonoscopy for the assessment of intestinal inflammation in patients with inflammatory bowel disease., Gut 54 (2) (2005) 250–256.
  • (51) H. Siddiki, J. Fidler, J. Fletcher, S. Burton, J. Huprich, D. Hough, C. Johnson, D. Bruining, E. L. Jr, W. Sandborn, D. Pardi, J. M. JN., Prospective comparison of state-of-the-art mr enterography and ct enterography in small-bowel crohn s disease., AJR Am J Roentgenol 193 (1) (2005) 113–121.
  • (52) L. Breiman, Random forests., Machine Learning 45 (1) (2001) 5–32.
  • (53) V. Caselles, F. Catte, T. Coll, F. Dibos, A geometric model for active contours in image processing., Numerische mathematik 66 (1993) 1–31.
  • (54) B. S. Manjunath, W. Y. Ma, Texture features for browsing and retrieval of image data., IEEE Trans. Pattern Anal. Mach. Intell 18 (8) (1996) 837–842.
  • (55)

    C. Liu, H. Wechsler, Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition., IEEE Trans. Image Process. 11 (4) (2002) 467–476.

  • (56) R. L. D. Valois, D. G. Albrecht, L. G. Thorell, Spatial-frequency selectivity of cells in macaque visual cortex., Vis. Res. 22 (5) (1982) 545–559.
  • (57) N. Kingsbury, Complex wavelets for shift invariant analysis and filtering of signals., Applied and Computational harmonic analysis 10 (3) (2001) 234–253.
  • (58) R. Verma, E. Zacharaki, Y. Ou, H. Cai, S. Chawla, S. Lee, E. Melhem, R. Wolf, C. Davatzikos, Multiparametric tissue characterization of brain neoplasms and their recurrence usng pattern classification of mr images, Acad. Radiol. 15 (8) (2008) 966–977.
  • (59) R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms., in: ICML, 2006, pp. 161–168.
  • (60) Z. Yi, A. Criminisi, J. Shotton, , A. Blake, Discriminative, semantic segmentation of brain tissue in mr images., in: MICCAI, 2009, pp. 558–565.
  • (61)

    B. Settles, M. Craven, An analysis of active learning strategies for sequence labeling tasks, in: Empirical methods in natural language processing, 2008, pp. 1070–1079.

  • (62) D. Lewis, J. Catlett, Heterogenous uncertainty sampling for supervised learning., in: ICML, 1994, pp. 148–156.
  • (63) R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. S sstrunk, Slic superpixels compared to state-of-the-art superpixel methods, IEEE Trans. Patt. Anal. Mach. Intell. 34 (11) (2012) 2274–2282.
  • (64)

    B. Andres, U. K othe, M. Helmstaedter, W. Denk, F. Hamprecht, Segmentation of sbfsem volume data of neural tissue by hierarchical classification., in: DAGM Symp Pattern Recognition, 2008, pp. 142–152.

  • (65) M. Berks, Z. Chen, S. Astley, C. Taylor, Detecting and classifying linear structures in mammograms using random forests, in: IPMI, 2011, pp. 510–524.
  • (66) J. Cheng, D. Tao, D. Wong, B. Lee, et al., Focal biologically inspired feature for glaucoma type detection, in: MICCAI, part 3, 2011, pp. 91–98.
  • (67) Y. Xu, D. Xu, S. Lin, J. Liu, J. Cheng, C. Cheung, T. Aung, T. Wong, Sliding window and regression based cup detection in digital fundus images for glaucoma diagnosis, in: MICCAI, 2011, pp. 1–8.
  • (68) H. Wang, F. Nie, H. Huang, S. Risacher, A. Saykin, L. Shen, Identifying ad-sensitive and cognition-relevant imaging biomarkers via joint classification and regression, in: MICCAI, 2011, pp. 115–123.
  • (69) U. Avni, H. Greenspan, J. Goldberger, X-ray categorization and spatial localization of chest pathologies, in: MICCAI, 2011, pp. 199–206.
  • (70) W. Li, S. Liao, Q. Feng, W. Chen, D. Shen, Learning image context for segmentation of prostate in ct-guided radiotherapy, in: MICCAI, 2011, pp. 570–578.
  • (71) C. Leistner, A. Saffari, J. Santner, H. Bischof., Semi-supervised random forests., in: IEEE ICCV, 2009, pp. 506–513.
  • (72) C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, A. Zeileis, Conditional variable importance for Random forests, BMC Bioinformatics 9 (2008) doi:10.1186/1471–2105–9–307.
  • (73) Y. Freund, H. Seung, E. Samir, N. Tishby., Selective sampling using the query by committee algorithm., Mach. Learn 28 (2) (1997) 133–168.
  • (74) D. Zhang, Y. Wang, L. Zhou, H. Yuan, D. Shen., Multimodal classification of Alzheimer’s disease and mild cognitive impairment., Neuroimage 55 (3) (2011) 856–867.
  • (75) C. Davatzikos, Y. Fan, X. Wu, D. S. adn S.M. Resnick., Detection of prodromal Alzheimer’s via pattern classification of mri., Neurobiology of Aging 29 (4) (2008) 514–523.
  • (76) X. Ren, J. Malik, Learning a classification model for segmentation, in: ICCV, 2003, pp. 10–17.
  • (77) B. Julesz, E. Gilbert, L. Shepp, H. Frisch, Inability of humans to discriminate between visual textures that agree in second-order statistics-revisited., Perception 2 (4) (1973) 391–405.
  • (78) V. Kovalev, M. Petrou, Y. Bondar, Texture anisotropy in 3D images., IEEE Trans. Imag. Proc 8 (3) (1999) 346–360.
  • (79) J. Iglesias, C.-Y. Liu, P. Thompson, Z. Tu, Robust brain extraction across datasets and comparison with publicly available methods., IEEE Trans. Med. Imag 30 (9) (2011) 1617–1634.
  • (80) Z. Liu, L. Smith, J. Sun, M. Smith, R. Warr, Biological indexes based reflectional asymmetry for classifying cutaneous lesions, in: MICCAI, 2011, pp. 124–132.
  • (81) F. M. Vos, et. al., Computational modeling for assessment of IBD: to be or not to be?, in: Proc. IEEE EMBC, 2012, pp. 3974–3977.
  • (82) D. Mahapatra, P. J. Schüffler, J. Tielbeek, J. Buhmann, F. M. Vos., Localizing and segmenting crohn’s disease affected regions in abdominal mri using novel context features, in: To Appear: SPIE Medical Imaging, 2013.
  • (83) A. Yu, L. C. E. Wu, P. Mulani, J. Chao, The costs of crohn s disease in the united states and other western countries: a systematic review., Current Medical Research and Opinion 24 (2) (2008) 319–328.
  • (84) E. Armitage, H. Drummond, D. W. et al., Increasing incidence of both juvenile-onset crohn s disease and ulcerative colitis in scotland., Eur J Gastroenterol Hepatol 13 (12) (2001) 1439–1447.
  • (85) K. Fonager, H. Sorensen, J. Olsen, Change in incidence of crohn s disease and ulcerative colitis in denmark: a study based on the national registry of patients., Int J Epidemiol 26 (5) (1997) 1003–1008.
  • (86) D. Mahapatra, P. Schffler, J. Tielbeek, F. Vos, J. Buhmann, Semi-supervised and active learning for automatic segmentation of crohn’s disease, in: Proc. MICCAI, Part 2, 2013, pp. 214–221.
  • (87) T. Rohlfing, D. B. Russakoff, J. C. R. Maurer, Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation., IEEE Trans. Med. Imag 23 (8) (2004) 983–994.
  • (88) R. A. Heckemann, J. V. Hajnal, P. Aljabar, D. Rueckert, A. Hammers., Automatic anatomical brain MRI segmentation combining label propagation and decision fusion., Neuroimage 33 (1) (2006) 115–126.
  • (89) E. M. van Rikxoort, I. Isgum, Y. Arzhaeva, M. Staring, S. Klein, M. A. Viergever, J. P. Pluim, B. van Ginneken., Adaptive local multi-atlas segmentation: Application to the heart and the caudate nucleus., Med. Imag. Anal. 14 (1) (2010) 39–49.
  • (90) I. Isgum, M. Staring, A. Rutten, M. Prokop, M. A. Viergever, B. van Ginneken., Multi-atlas-based segmentation with local decision fusion - application to cardiac and aortic segmentation in ct scans, IEEE Trans. Med. Imag 28 (7) (2009) 1000–1010.
  • (91) M. R. Sabuncu, B. T. T. Yeo, B. Fischl, P. Golland., A generative model for image segmentation based on label fusion, IEEE Trans. Med. Imag 29 (10) (2010) 1714–1729.
  • (92) Y. Yan, R. Rosales, G. Fung, M. Schmidt, Modeling annotator expertise: Learning when everybody knows a bit of something, in: In Proc: AISTATS, 2010, pp. 932–939.
  • (93) M. Richardson, P. Domingos, Learning with knowledge from multiple experts, in: In Proc: ICML, 2003, pp. 624–631.
  • (94) J. Kamarainen, L. Lensu, T. Kauppi, Learning with knowledge from multiple experts, in: In Proc: MICCAI-MLMI, 2012, pp. 193–200.
  • (95) H. Valizadegan, Q. Nguyen, M. Hauskrecht., Learning classification models from multiple experts, Journal of Biomedical Informatics 46 (6) (2013) 1125–1135.
  • (96)

    P. Roy, R. Chakravorty, S. Sedai, D. Mahapatra, R. Garnavi, Automatic eye type detection in retinal fundus image using fusion of transfer learning and anatomical features, in: In Proc. DICTA, 2016, pp. –.

  • (97)

    R. Tennakoon, D. Mahapatra, P. Roy, S. Sedai, R. Garnavi, Image quality classification for dr screening using convolutional neural networks, in: In Proc. MICCAI-OMIA, 2016, pp. 113–120.

  • (98) S. Sedai, P. Roy, D. Mahapatra, R. Garnavi, Segmentation of optic disc and optic cup in retinal images using coupled shape regression, in: In Proc. MICCAI-OMIA, 2016, pp. 1–8.
  • (99) D. Mahapatra, P. Roy, S. Sedai, R. Garnavi, Retinal image quality classification using saliency maps and cnns, in: In Proc. MICCAI-MLMI, 2016, pp. 172–179.
  • (100) S. Sedai, P. Roy, D. Mahapatra, R. Garnavi, Segmentation of optic disc and optic cup in retinal fundus images using shape regression, in: In Proc. EMBC, 2016, pp. 3260–3264.
  • (101) D. Mahapatra, P. Roy, S. Sedai, R. Garnavi, A cnn based neurobiology inspired approach for retinal image quality assessment, in: In Proc. EMBC, 2016, pp. 1304–1307.
  • (102) J. Zilly, J. Buhmann, D. Mahapatra, Boosting convolutional filters with entropy sampling for optic cup and disc image segmentation from fundus images, in: In Proc. MLMI, 2015, pp. 136–143.
  • (103) D. Mahapatra, J. Buhmann, Visual saliency based active learning for prostate mri segmentation, in: In Proc. MLMI, 2015, pp. 9–16.
  • (104) D. Mahapatra, J. Buhmann, Obtaining consensus annotations for retinal image segmentation using random forest and graph cuts, in: In Proc. OMIA, 2015, pp. 41–48.
  • (105) D. Mahapatra, J. Buhmann, A field of experts model for optic cup and disc segmentation from retinal fundus images, in: In Proc. IEEE ISBI, 2015, pp. 218–221.
  • (106) D. Mahapatra, Z. Li, F. Vos, J. Buhmann, Joint segmentation and groupwise registration of cardiac dce mri using sparse data representations, in: In Proc. IEEE ISBI, 2015, pp. 1312–1315.
  • (107) D. Mahapatra, F. Vos, J. Buhmann, Crohn’s disease segmentation from mri using learned image priors, in: In Proc. IEEE ISBI, 2015, pp. 625–628.
  • (108) H. Kuang, B. Guthier, M. Saini, D. Mahapatra, A. E. Saddik, A real-time smart assistant for video surveillance through handheld devices., in: In Proc: ACM Intl. Conf. Multimedia, 2014, pp. 917–920.
  • (109) D. Mahapatra, J.Tielbeek, J. Makanyanga, J. Stoker, S. Taylor, F. Vos, J. Buhmann, Combiningmultiple expert annotations using semi-supervised learning and graph cuts for crohn s disease segmentation, in: In Proc: MICCAI-ABD, 2014.
  • (110) D. Mahapatra, J.Tielbeek, J. Makanyanga, J. Stoker, S. Taylor, F. Vos, J. Buhmann, Active learning based segmentation of crohn’s disease using principles of visual saliency, in: Proc. IEEE ISBI, 2014, pp. 226–229.
  • (111) D. Mahapatra, Graph cut based automatic prostate segmentation using learned semantic information, in: Proc. IEEE ISBI, 2013, pp. 1304–1307.
  • (112) D. Mahapatra, J. Buhmann, Automatic cardiac rv segmentation using semantic information with graph cuts, in: Proc. IEEE ISBI, 2013, pp. 1094–1097.
  • (113) D. Mahapatra, J. Tielbeek, F. Vos, J. Buhmann, Weakly supervised semantic segmentation of crohn’s disease tissues from abdominal mri, in: Proc. IEEE ISBI, 2013, pp. 832–835.
  • (114) D. Mahapatra, J. Tielbeek, F. Vos, J. B. ., Crohn’s disease tissue segmentation from abdominal mri using semantic information and graph cuts, in: Proc. IEEE ISBI, 2013, pp. 358–361.
  • (115) D. Mahapatra, J. Tielbeek, F. Vos, J. Buhmann, Localizing and segmenting crohn’s disease affected regions in abdominal mri using novel context features, in: Proc. SPIE Medical Imaging, 2013.
  • (116) D. Mahapatra, Cardiac lv and rv segmentation using mutual context information, in: Proc. MICCAI-MLMI, 2012, pp. 201–209.
  • (117) D. Mahapatra, Landmark detection in cardiac mri using learned local image statistics, in: Proc. MICCAI-Statistical Atlases and Computational Models of the Heart. Imaging and Modelling Challenges (STACOM), 2012, pp. 115–124.
  • (118) D. Mahapatra, Groupwise registration of dynamic cardiac perfusion images using temporal information and segmentation information, in: In Proc: SPIE Medical Imaging, 2012.
  • (119) D. Mahapatra, Neonatal brain mri skull stripping using graph cuts and shape priors, in: In Proc: MICCAI workshop on Image Analysis of Human Brain Development (IAHBD), 2011.
  • (120) D. Mahapatra, Y. Sun, Orientation histograms as shape priors for left ventricle segmentation using graph cuts, in: In Proc: MICCAI, 2011, pp. 420–427.
  • (121) D. Mahapatra, Y. Sun, Joint registration and segmentation of dynamic cardiac perfusion images using mrfs., in: Proc. MICCAI, 2010, pp. 493–501.
  • (122) D. Mahapatra, Y. Sun., An mrf framework for joint registration and segmentation of natural and perfusion images, in: Proc. IEEE ICIP, 2010, pp. 1709–1712.
  • (123) D. Mahapatra, Y. Sun, Retrieval of perfusion images using cosegmentation and shape context information, in: Proc. APSIPA Annual Summit and Conference (ASC), 2010.
  • (124) D. Mahapatra, Y. Sun, A saliency based mrf method for the joint registration and segmentation of dynamic renal mr images, in: Proc. ICDIP, 2010.
  • (125) D. Mahapatra, Y. Sun, Nonrigid registration of dynamic renal MR images using a saliency based MRF model, in: Proc. MICCAI, 2008, pp. 771–779.
  • (126) D. Mahapatra, Y. Sun, Registration of dynamic renal mr images using neurobiological model of saliency, in: Proc. ISBI, 2008, pp. 1119–1122.
  • (127) D. Mahapatra, M. Saini, Y. Sun, Illumination invariant tracking in office environments using neurobiology-saliency based particle filter, in: IEEE ICME, 2008, pp. 953–956.
  • (128) D. Mahapatra, S. Roy, Y. Sun, Retrieval of mr kidney images by incorporating spatial information in histogram of low level features, in: In 13th International Conference on Biomedical Engineering, 2008.
  • (129) D. Mahapatra, Y. Sun, Using saliency features for graphcut segmentation of perfusion kidney images, in: In 13th International Conference on Biomedical Engineering, 2008.
  • (130) D. Mahapatra, S. Winkler, S. Yen, Motion saliency outweighs other low-level features while watching videos, in: SPIE HVEI., 2008, pp. 1–10.
  • (131) D. Mahapatra, A. Routray, C. Mishra, An active snake model for classification of extreme emotions, in: IEEE International Conference on Industrial Technology (ICIT), 2006, pp. 2195–2199.
  • (132) D. Mahapatra, Semi-supervised learning and graph cuts for consensus based medical image segmentation., Pattern Recognition 63 (1) (2017) 700–709.
  • (133) J. Zilly, J. Buhmann, D. Mahapatra, Glaucoma detection using entropy sampling and ensemble learning for automatic optic cup and disc segmentation., In Press Computerized Medical Imaging and Graphics.
  • (134) D. Mahapatra, F. Vos, J. Buhmann, Active learning based segmentation of crohns disease from abdominal mri., Computer Methods and Programs in Biomedicine 128 (1) (2016) 75–85.
  • (135) D. Mahapatra, J. Buhmann, Visual saliency based active learning for prostate mri segmentation., SPIE Journal of Medical Imaging 3 (1).
  • (136) D. Mahapatra, Combining multiple expert annotations using semi-supervised learning and graph cuts for medical image segmentation., Computer Vision and Image Understanding 151 (1) (2016) 114–123.
  • (137) Z. Li, D. Mahapatra, J.Tielbeek, J. Stoker, L. van Vliet, F. Vos, Image registration based on autocorrelation of local structure., IEEE Trans. Med. Imaging 35 (1) (2016) 63–75.
  • (138) D. Mahapatra, Automatic cardiac segmentation using semantic information from random forests., J. Digit. Imaging. 27 (6) (2014) 794–804.
  • (139) D. Mahapatra, S. Gilani, M. Saini., Coherency based spatio-temporal saliency detection for video object segmentation., IEEE Journal of Selected Topics in Signal Processing. 8 (3) (2014) 454–462.
  • (140) D. Mahapatra, J.Tielbeek, F. Vos, J. Buhmann, A supervised learning approach for crohn’s disease detection using higher order image statistics and a novel shape asymmetry measure., J. Digit. Imaging 26 (5) (2013) 920–931.
  • (141) D. Mahapatra, Cardiac mri segmentation using mutual context information from left and right ventricle., In press J. Digit. Imaging 26 (5) (2013) 898–908.
  • (142) D. Mahapatra, Cardiac image segmentation from cine cardiac mri using graph cuts and shape priors., J. Digit. Imaging 26 (4) (2013) 721–730.
  • (143) D. Mahapatra, Joint segmentation and groupwise registration of cardiac perfusion images using temporal information., J. Digit. Imaging 26 (2) (2013) 173–182.
  • (144) D. Mahapatra, Skull stripping of neonatal brain mri: Using prior shape information with graphcuts., J. Digit. Imaging 25 (6) (2012) 802–814.
  • (145) D. Mahapatra, Y. Sun, Integrating segmentation information for improved mrf-based elastic image registration., IEEE Trans. Imag. Proc. 21 (1) (2012) 170–183.
  • (146) D. Mahapatra, Y. Sun, Mrf based intensity invariant elastic registration of cardiac perfusion images using saliency information, IEEE Trans. Biomed. Engg. 58 (4) (2011) 991–1000.
  • (147) D. Mahapatra, Y. Sun, Rigid registration of renal perfusion images using a neurobiology based visual saliency model, EURASIP Journal on Image and Video Processing. (2010) 1–16.