A Joint Graph and Image Convolution Network for Automatic Brain Tumor Segmentation

by   Camillo Saueressig, et al.
Brown University

We present a joint graph convolution-image convolution neural network as our submission to the Brain Tumor Segmentation (BraTS) 2021 challenge. We model each brain as a graph composed of distinct image regions, which is initially segmented by a graph neural network (GNN). Subsequently, the tumorous volume identified by the GNN is further refined by a simple (voxel) convolutional neural network (CNN), which produces the final segmentation. This approach captures both global brain feature interactions via the graphical representation and local image details through the use of convolutional filters. We find that the GNN component by itself can effectively identify and segment the brain tumors. The addition of the CNN further improves the median performance of the model by 2 percent across all metrics evaluated. On the validation set, our joint GNN-CNN model achieves mean Dice scores of 0.89, 0.81, 0.73 and mean Hausdorff distances (95th percentile) of 6.8, 12.6, 28.2mm on the whole tumor, core tumor, and enhancing tumor, respectively.



There are no comments yet.


page 8


A Computation-Efficient CNN System for High-Quality Brain Tumor Segmentation

In this paper, a Convolutional Neural Network (CNN) system is proposed f...

Glioma Prognosis: Segmentation of the Tumor and Survival Prediction using Shape, Geometric and Clinical Information

Segmentation of brain tumor from magnetic resonance imaging (MRI) is a v...

Brain Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017 Challenge

Quantitative analysis of brain tumors is critical for clinical decision ...

Predicting isocitrate dehydrogenase mutation status in glioma using structural brain networks and graph neural networks

Glioma is a common malignant brain tumor with distinct survival among pa...

Collaborative learning of images and geometrics for predicting isocitrate dehydrogenase status of glioma

The isocitrate dehydrogenase (IDH) gene mutation status is an important ...

Global Planar Convolutions for improved context aggregation in Brain Tumor Segmentation

In this work, we introduce the Global Planar Convolution module as a bui...

A Systematic Approach for MRI Brain Tumor Localization, and Segmentation using Deep Learning and Active Contouring

One of the main requirements of tumor extraction is the annotation and s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tumor segmentation is a cornerstone of nearly all standard tumor treatments. It is integral for surgical and radiation planning, treatment response analysis, and longitudinal tumor monitoring, among other standard practices. However, manual tumor segmentation is notoriously time-consuming and subjective, even for highly trained radiologists. Automatic tumor segmentation can produce such segmentations in a fraction of the time in a standardized, reproducible fashion. Over the past decade, the performance of automated biomedical segmentation methods has significantly improved across multiple tumor types, and brain tumors are no exception [9, 7]. The Brain Tumor Segmentation dataset (BraTS) is the largest publicly available dataset of brain tumor MRIs and corresponding expert segmentations and has played a pivotal role in developing and evaluating these methods [12, 5, 6, 3, 4].

The 2021 BraTS tumor segmentation challenge consists of over 2000 multi-para- metric magnetic resonance images (MRIs) of tumorous brain volumes imaged across a wide array of institutions. While the images are collected using diverse procedures and instruments, they are all processed using a standard pipeline, and the same four modalities are available for every volume. These are T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and Fluid Attenuated Inversion Recovery (FLAIR) modalities, all of which provide complementary information on the location and shape of the tumor and its compartments. The ground truth labels are generated using an ensemble of top-performing models from previous years and are manually revised by an expert neuroradiologist for all images. The challenge aims to correctly classify each voxel of a given brain volume as either healthy tissue, edema, enhancing tumor (ET), or necrotic tumor core. These tumor sub-regions can be combined into the whole tumor (WT) and core tumor (necrotic core+enhancing tumor, CT) to further evaluate model performance on gross tumor segmentation 


Our submission to the BraTS 2021 challenge is a joint graph neural network (GNN) - convolutional neural network (CNN) model (summarized in Figure 1). The GNN module aims to partition the brain into distinct regions and predict the label of each region, and the CNN component refines the predictions made by the GNN. Unlike the vast majority of BraTS competitors in recent years [6], which exclusively perform inference directly on voxel data, our model instead learns and predicts primarily on a graphical representation of the brain. We model each brain volume as composed of small, contiguous regions and connect nearby regions using edges, forming a graph. Each graph node contains information summarizing the intensity information of the brain in that region across all four modalities, and the edges allow neighboring regions to share their information with each other. This formulation greatly simplifies the representation of a brain from millions of voxels down to only thousands of nodes, while preserving nearly all the information. It also enables the modeling of explicit connectivity between different regions of the brain and potential long-range interactions between distant regions, which are difficult to capture using only CNNs. We have previously developed a similar model composed only of a graph neural network on the 2019 BraTS dataset [13]. Here, we improve on our previous work by adding a shallow CNN to the end of the model, which smooths out the model predictions at region boundaries and provides a substantial () improvement in both median Dice score and median Hausdorff distance.

Figure 1: GNN-CNN Model Overview.

MRI Modalities are first stacked to create one 3D Image with 4 channels. 1) Combined modalities are clustered into supervoxels using SLIC. 2) Supervoxels are converted to a graph structure such that each supervoxel becomes one graph node (depicted graph is greatly simplified). 3) Graph is fed through a Graph Neural Network 4) Node prediction outputs (more specifically, logits) are overlaid back onto the supervoxels. The original input image features are concatenated with re-projected node logits. 5) The result is fed through a 2-layer CNN which produces final predictions.

2 Methods

Our GNN-CNN model is composed of two components. The core component is a graph neural network (GNN) [10, 14]. For a given input graph representing one patient sample, where each node corresponds to a collection of adjacent voxels in the original MRI image, the GNN predicts each node’s label. Since the GNN can only predict the label of nodes (i.e. brain regions) atomically, its predictions are necessarily coarser than voxel-based predictions. This property can lead to incorrect predictions at the edges of tumor compartments, where created regions can contain voxels of multiple labels [13]. This shortcoming is especially pronounced in small tumors. Accordingly, we have added a second component to our model: a shallow CNN [11]. The convolutional layers receive both the GNN prediction logits (projected back into an image) and the original voxel image data. They are thus able to make fine-grained adjustments to the coarse predictions based on local voxel information. The details of the model are presented in Figure 2.

2.1 Graph Construction from MRI Modalities

Both the input and the output of the GNN are required to be graph-structured data. Therefore, before feeding the MRI scans into our network, we transform them into graphs. Graphs are composed of nodes and edges, where both the nodes and the edges can have features associated with them. In this work, each node corresponds to one image region, and an edge between two nodes corresponds to spatial proximity of the corresponding regions. We partition the brain into regions using supervoxels. Supervoxels are the 3D analog to superpixels, i.e., collections of nearby pixels that share similar intensities.

We construct the supervoxels using the Simple Linear Iterative Clustering (SLIC) algorithm [1]

. SLIC uses a combination of spatial and intensity information to partition an image into approximately a desired number of supervoxels using K-means clustering. While the input to SLIC is traditionally in either RGB or Lab color space, we find that running SLIC directly on the stacked MRI modalities still produces meaningful supervoxels. To determine the optimal hyperparameters for the SLIC algorithm, we perform a grid search across

k, the number of supervoxels and m, the compactness coefficient (the weighting between spatial and intensity information), and compute the achievable segmentation accuracy (ASA). ASA measures how well the GNN would perform on a given supervoxel partitioning, given that it classifies every supervoxel according to the most common label of the constituent voxels. The ASA is high if there is a strong correspondence between supervoxel shape and tumor boundaries, resulting in supervoxels composed of voxels with the same label. It is low if supervoxels are composed of voxels with mixed labels.

After the supervoxels are generated via SLIC, we discard those supervoxels that lie outside the brain volume. Of the remaining supervoxels, each is assigned a feature vector, a label, and a set of neighbors. The feature vector summarizes the intensities of the input MRIs for its comprising voxels. We empirically found that intensity quintiles for each modality yielded the best results. The label is the majority label (mode) of its constituent voxels. The neighbors of a supervoxel are all other supervoxels which are directly adjacent to it. A graph is then constructed where each supervoxel forms one node with its associated features and label, and each supervoxel shares an unweighted and undirected edge with its neighbors.

2.2 GNN Architecture

Our graph neural network is composed of several sequential GraphSAGE-pool layers [8]

alternated with the ReLU non-linearity (Figure

2). Each layer transforms the features of each node by aggregating information from that node’s neighbors, according to Eq. 1


where is the features of node at layer ,

is a differentiable, non-linear activation function,

is a layer specific trainable weight matrix, is a global trainable weight matrix, is the concatenation operator, and is the subset of nodes which are are directly connected to via edges, also known as the neighborhood of .

The input layer expects 20 features (5 quintiles for each of four modalities) and the output layer outputs 4 logits (one for each label). The output logits are duplicated, where one copy is passed directly through a loss function which backpropagates only through the GNN, and the other is passed through to the CNN (Fig.  


Figure 2: Detailed view of GNN and CNN. Left: The GNN is composed of GraphSAGE layers alternated with a nonlinearity. Each GraphSAGE layer updates each node’s features by sampling neighboring nodes and aggregating the features (Eq.1). Right: 1) The output of the GNN is reprojected into a 3D image by assigning each voxel the output logits of its corresponding node. 2) Based on this reprojection, the approximate location of the tumor predicted by the GNN is located and cropped out. 3) The projected and cropped logits are concatenated with the image features for that same location. This volume is then fed through a two-layer CNN. Note that the output of both the GNN and CNN components have an associated loss function.

2.3 CNN Architecture

The CNN consists of two convolutional layers with a

kernel size and a stride of 1 (Figure  

2). The first layer has filters and the second (one for each label) with ReLU nonlinearity between the two layers. The architecture is purposefully kept simple since it only serves to refine the predictions made by the GNN.

The input to the CNN is the concatenation of the GNN output logits () and the input MRI modalities () for each voxel. Therefore, the CNN receives the predictions of the GNN and the input features, which allows it to correct the predictions made by the GNN. This correction is especially relevant around the edges of the tumor and its compartments, where the coarse predictions from the GNN can often result in misclassifications of strips of voxels. We feed only the tumorous tissue through the CNN to reduce the memory requirement and computation time. Specifically, we crop out a patch of the volume containing the tumor, as predicted by the GNN, and the CNN further refines only that patch.

2.4 Loss Functions

We calculate and backpropagate loss through our model at two locations. A voxel-wise cross-entropy loss is calculated from the output of the CNN and backpropagated only through the convolutional layers. This loss is unweighted as the input to the CNN has been cropped to the tumor-containing volume.

A node-wise weighted cross-entropy loss is calculated from the GNN logits and backpropagated through the GNN. The ground truth label for each node is generated by finding the mode of the labels in the corresponding supervoxel. This loss is weighted approximately inversely to the prevalence of each label to address the class imbalance.

We include this GNN loss function to obtain prediction logits of the nodes that can then be easily projected in the image space. It is crucial for the model’s performance that the GNN output be interpretable as predictions so that the predicted tumorous volume can be located and cropped out. Furthermore, this formulation allows us to visualize the finer corrections that the CNN layer performs over the coarse GNN predictions (see Fig. 3 for example).

2.5 Model Training

In practice, we train the GNN and CNN sequentially rather than simultaneously to decrease training time. The GNN is trained for 300 epochs on mini-batches of 6 graphs, whereas the CNN is trained for 100 epochs using only one sample at a time. The training of a full model takes approximately 2 days on an 8GB GPU.

We used the AdamW optimizer with weight decay of 0.0001 and exponentially decrease learning rate according to Eq. 2


where is the initial learning rate, is the current epoch and . We found that adding additional regularization, such as dropout or higher weight decay, did not improve performance.

The BraTS 2021 dataset is split into training(n=1251), validation(n=216), and test(n to be released) partitions. The hyperparameters for only the GNN component, i.e., GNN layer sizes, GNN depth, learning rate, and class weighting, were tuned using random search and 5-fold cross-validation on the entire training set (n=1251). The GNN architecture with the best performance across the 5 folds (calculated using the average) was then integrated into the full hybrid model. Three architectural replicates were trained on the entire dataset and evaluated on the validation set. We report the performance of the best performing replicate.

2.6 Data Preprocessing

The BraTS dataset MRIs are all padded to a standard shape, presumably to facilitate image-based processing. Since our approach is graph-based, we first crop each patient sample into the tightest possible bounding box around the brain, which varies for each sample. Subsequently, we rescale each MRI to the approximate

range by dividing by the 99.5 percentile of intensity values in that MRI. This normalization is crucial as MRI data is not collected in a bounded range, and intensity values can vary by several orders of magnitude even between two images of the same modality. Finally, we compute the mean and standard deviation for each modality across the entire training dataset (on non-zero voxels) and standardize each modality to have zero mean and unit variance.

3 Results

3.1 Hyperparameters

The SLIC parameters with the highest achievable segmentation accuracy (ASA) were and . The value for differs from that in our previous work [13] as our preprocessing steps have slightly changed.

The best performing GNN model from the cross-validation phase had 6 layers with 256 neurons each and a learning rate of 0.0005. The GNN is thus deeper and has many more learnable parameters than the CNN. This is a purposeful design choice to force the GNN to do the majority of the learning.

3.2 Evaluation Metrics

The performance of the models submitted to the BraTS challenge are evaluated using two metrics, Dice score and the 95 percentile of the symmetric Hausdorff distance. Both metrics are evaluated over the whole tumor, core tumor, and active tumor subregions. Intuitively, the Dice score measures the overlap between the predictions and the ground truth while Hausdorff distance measures the most the predicted and ground truth segmentations diverge from each other.


where , , and are the number of true positives, false positives, and false negatives, respectively. True positive voxels are defined as those correctly assigned as belonging to a specific tumor compartment.


where is the element-wise distance of every voxel in the first set to the closest voxel of the same label in the second, are the predicted labels of each voxel, are the ground truth labels of each voxel, and is the concatenation operator.

Metric Dice HD95
Tumor Subregion WT TC ET WT TC ET
GNN 0.874 0.782 0.738 6.92 16.67 20.40
GNN-CNN 0.894 0.807 0.734 6.79 12.62 28.20

Table 1: Mean results on validation set.
Metric Dice HD95
Tumor Subregion WT TC ET WT TC ET
GNN 0.906 0.885 0.813 3.46 3.16 2.45
GNN-CNN 0.925 0.908 0.842 3.00 3.00 2.24

Table 2: Median results on validation set.

3.3 Performance

The mean results of the best performing model replicate with the hyperparameters described above are given in Table 1. We report both the performance of the GNN-only model and the performance of the joint GNN-CNN model with the added shallow CNN.

The mean scores reported in Table 1

indicate that the addition of the convolutional layers to the model improves performance on the whole tumor and tumor core sub-regions but decreases performance on the enhancing tumor. However, we argue that the existence of outliers makes the mean results possibly misleading and report the median metrics as well (the ET, in particular, is the rarest label and thus prone to outliers).

Figure 3: Example Predictions on Validation Brain Three slices (horizontal, coronal, sagittal of the same brain from the validation set are shown). The first row is from the T1ce modality, and the second is from the FLAIR modality. The third shows the GNN predictions. The fourth row contains the GNN predictions refined through the CNN. Red =edema, Blue=NET/necrosis, Yellow=ET. We observe that the GNN accurately identifies the tumorous region but makes slight errors in classifying the compartments. The CNN, however, can refine the predictions in greater accordance with the images.

Here, we observe that the GNN-CNN model outperforms the GNN-only model across every metric. These results indicate that the addition of the CNN can successfully correct misclassification errors that result from mixed-label supervoxels, even while the CNN architecture is very simple. Notably, the improvement across all three subregions demonstrates that the joint GNN-CNN model is 1) better able to distinguish the border edema from healthy tissue, 2) better able to distinguish NET from edema and ET, and 3) better able to distinguish ET from NET. Nonetheless, it is interesting that the CNN tends to exacerbate poor performance on the ET predictions on outliers.

In Figure  3 we provide several example slices from a segmentation on a brain from the validation set, as well as 2 of the 4 input modalities. The FLAIR image provides information on the tumor core and edema and is thus well suited for the segmentation of the whole tumor. In contrast, the T1ce provides information on NET/necrotic tissue and the enhancing tumor and is thus vital for delineation of the ET and NET subregions. The predictions that have been refined through the CNN (last row) are both smoother and correspond more closely with the shape and appearance of the tumor in the two modalities than the predictions made directly by the GNN (third row).

4 Discussion

We have presented a joint GNN-CNN network for automatic brain tumor segmentation. The GNN can produce good segmentations on its own, but struggles to accurately delineate exact tumor and tumor compartment boundaries due to the coarse supervoxel generation step. We show that this limitation can be at least partially circumvented by adding convolutional layers to the end of the model to smooth out predictions. While it is likely that a more complex CNN could further boost performance, this work aims to improve the feasibility of GNNs for brain tumor segmentation rather than to engineer an optimal CNN. Furthermore, even without the additional convolutional module, the GNN-only model presented here outperforms our previous work (albeit on a different dataset). This improvement is likely the result of a larger training dataset and an improved standardization scheme.

An interesting direction for future work is the inverse of the approach we attempt here: rather than a complex GNN and a simple CNN, append a complex CNN to a simple GNN. A common problem with CNN approaches is that the entire brain is impractically large to fit on a single GPU, and thus random image patches are used instead. A possible solution is to train a GNN first to identify the approximate location of the whole tumor, but without distinguishing tumor subtypes, and then crop out that patch from the original image and feed it through a performant CNN architecture, such as a U-net, to obtain the final predictions. Given the flexibility and ease of integration into existing image processing models of our joint GNN-CNN model, we believe it is a promising approach for improved and more efficient automatic tumor segmentation models.


  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34 (11), pp. 2274–2282. Cited by: §2.1.
  • [2] U. Baid, S. Ghodasara, M. Bilello, S. Mohan, E. Calabrese, E. Colak, K. Farahani, J. Kalpathy-Cramer, F. C. Kitamura, S. Pati, et al. (2021) The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314. Cited by: §1.
  • [3] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos (2017) Segmentation labels and radiomic features for the pre-operative scans of the tcga-gbm collection. the cancer imaging archive. Nat Sci Data 4, pp. 170117. Cited by: §1.
  • [4] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos (2017) Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection. The cancer imaging archive 286. Cited by: §1.
  • [5] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos (2017) Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4, pp. 170117. Cited by: §1.
  • [6] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. M. Ha, M. Rozycki, et al. (2018)

    Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge

    arXiv preprint arXiv:1811.02629. Cited by: §1, §1.
  • [7] W. L. Bi, A. Hosny, M. B. Schabath, M. L. Giger, N. J. Birkbak, A. Mehrtash, T. Allison, O. Arnaout, C. Abbosh, I. F. Dunn, et al. (2019) Artificial intelligence in cancer imaging: clinical challenges and applications. CA: a cancer journal for clinicians 69 (2), pp. 127–157. Cited by: §1.
  • [8] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2.2.
  • [9] I. R. I. Haque and J. Neubert (2020) Deep learning approaches to biomedical image segmentation. Informatics in Medicine Unlocked 18, pp. 100297. Cited by: §1.
  • [10] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
  • [11] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §2.
  • [12] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §1.
  • [13] C. Saueressig, A. Berkeley, E. Kang, R. Munbodh, and R. Singh (2021) Exploring graph-based neural networks for automatic brain tumor segmentation. In From Data to Models and Back: 9th International Symposium, DataMod 2020, Virtual Event, October 20, 2020, Revised Selected Papers, Vol. 12611, pp. 18. Cited by: §1, §2, §3.1.
  • [14] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.