Learning an MR acquisition-invariant representation using Siamese neural networks

10/17/2018 ∙ by Wouter M. Kouw, et al. ∙ 0

Generalization of voxelwise classifiers is hampered by differences between MRI-scanners, e.g. different acquisition protocols and field strengths. To address this limitation, we propose a Siamese neural network (MRAI-NET) that extracts acquisition-invariant feature vectors. These can consequently be used by task-specific methods, such as voxelwise classifiers for tissue segmentation. MRAI-NET is tested on both simulated and real patient data. Experiments show that MRAI-NET outperforms voxelwise classifiers trained on the source or target scanner data when a small number of labeled samples is available.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Voxelwise classifiers for brain tissue segmentation should be trained on a sufficiently large representative data set, covering all possible types of variation. However, acquiring manual labels as ground truth is both labor intensive and time consuming. Furthermore, non-standardized manual segmentation protocols and inter- and intra-observer variability add another factor of variation to an already complex problem. Instead of increasing the number of manual labels, we propose to improve generalization by teaching a neural network to minimize an undesirable form of variation, namely acquisition-based variation. The proposed network learns a representation [1] in which for example gray matter patches acquired with a 1.5T scanner and a 3T scanner are considered similar. Therefore it has the potential to fully exploit a 1.5T data set with fully labeled brain tissues for segmenting an unlabelled 3T data set.

Overcoming acquisition-variation is a relatively new challenge in medical imaging. Transfer classifiers have been proposed that focus on weighting classifiers based on how well their training data matches the test data, such as weighted SVM’s [2] and weighted ensembles [3]. However, these classifiers need to be retrained for every new test data set, and do not remove acquisition-variation in general or extract acquisition-invariant feature vectors for later use by task-specific methods.

We propose to learn a task-independent representation, in which acquisition variation is minimal while tissue variation is maintained. Patches sampled from MRI-scans that are mapped to this new representation will become feature vectors, and can be used by task-specific classifiers later on. In order to minimize one factor of variation while maintaining another, we exploit a Siamese network [4]. The proposed network (mrai-net) is described in Section 2. Experiments on both simulated and real patient data are shown in Section 3.

2 MR acquisition-invariant network

Suppose that we have scans that are acquired in two different ways; A (source) and B (target). A tissue patch, e.g. gray matter, is selected from both scans A and B. The aim is to teach a neural network that both these patches are gray matter. To achieve this, we use a loss function that expresses that pairs of samples from the same tissue but different scanners should be

similar. However, if the neural network only receives this expression it would map all patches to a single point and would destroy variation between tissues. To balance out the action of making certain pairs more similar, another expression is added, stating that patches from different tissues – regardless of scanner – should remain dissimilar.

2.1 Siamese loss

Neural networks transform data in each layer. We summarize the total transformation from input to output with the symbol , i.e. patch will be mapped to the new representation with and patch will be mapped with . Distance in the new representation is expressed as , where is an -norm. Pairs marked as similar (y=1) should be pulled together, while those marked as dissimilar (y=0) should be pushed apart. The loss for the similar pairs consists of the squared distance, . The loss function for the dissimilar pairs consists of a hinge loss: where is the margin parameter. Pairs that are pushed past the margin, will not suffer a loss.

We can combine the similar and dissimilar losses into a single loss function:

where iterates over pairs. This type of loss function is known as a Siamese loss [4].

2.2 Labeling pairs as similar or dissimilar

Assume that we have sufficient manual segmentations (voxel labels) on scans from scanner A to train a supervised classifier, but a limited amount of labels from scanner B. Let be the set of tissue labels. A patch from scanner A is denoted , and a patch from scanner B is denoted , with specifying the current patch’s tissue. Given sets of patches, we form similar and dissimilar pairs, designated by a similarity label . The following pairs are labeled as similar (): source patches from the same tissue () , source and target patches from the same tissue: , and target patches from the same tissue: . Conversely, the following are labeled as dissimilar (): source patches from different tissues , source and target patches from different tissues , and target patches from different tissues .

Let be the number of patches extracted from a scan of scanner belonging to tissue , and be the number of patches extracted from scanner of tissue . In total, the number of combinations is , where refers to all combinations of that can be taken from the set of tissues. The combinatorial explosion works in our favor, as it allows us to generate a large training data set from only a few labeled target samples. Figure 1 illustrates the process of selecting pairs of patches from different scanners.

Figure 1: Illustration of extracting pairs of patches from images from scanner A and B. Each image shows 4 patches: 2 gray matter ones (green), 1 cerebrospinal fluid (blue) and 1 white matter (yellow). The lines mark the 6 types of combinations from Section 2.2 (green = similar, purple = dissimilar).

2.3 Network architecture

The network consists of two pipelines and a Siamese loss layer that acts on the pipes’ output layers. We made the following architectural choices: 15x15 input patches, 8 convolution kernels of size 3x3 with ”ReLUactivation functions, a fully-connected layer of size 16, another fully-connected layer of size 8, and a final fully-connected layer of size 2. Dropout was set to 0.2 during training, and we used a standard ”RMSprop” optimizer to perform backpropagation. For more implementation details, see the accompanying software repository:

github.com/wmkouw/mrai-net. mrai-net

is implemented in a combination of Tensorflow and Keras.

Patches represented in the final representation layer are, in fact, feature vectors. The wider the layer, the higher the feature vector dimensionality. The two pipelines share their weights, which means they are constrained to perform the same transformation. This means that single patches can be fed through the network and that it is not necessary to form pairs at test time.

3 Experiment

In this experiment we test the proxy -distance between patches from the source and target scanners and we compare the performance of a linear classifier trained on mrai-net’s feature vectors on a cross-scanner tissue segmentation task.

3.1 Data

We simulated different MR acquisitions from anatomical models of the human brain [5], using the MRI simulator SIMRI [6, 5]. The anatomical models consist of transverse slices of 20 normal brains (Brainweb). We simulated two acquisition types: (1) Brainweb1.5T, a standard gradient-echo acquisition protocol with the same parameters as the MRI-scanner in the Rotterdam Scan Study (B0 = 1.5T, , TR=13.8 ms, TE=2.8 ms) [7], and (2) Brainweb3.0T, a standard gradient-echo protocol with the same parameters as the scanner used for MRBrainS (B0 = 3.0T, , TR=7.9 ms, TE=4.5 ms) [8]. Magnetic field inhomogeneities and partial volume effects are not included in the simulation. There are 9 tissues, but we grouped these into ”background”, ”cerebrospinal fluid”, ”gray matter”, and ”white matter”. The simulations result in images of 256 by 256 pixels, with a 1.0x1.0mm resolution. Figure 1 shows examples of the Brainweb1.5T (A) and Brainweb3.0T (B) scan of the same subject.

In order to test the proposed method on real data, we use the publicly available training data (5 subjects) from the MRBrainS challenge [8].

3.2 Measuring acquisition variation

The proxy -distance is a measure of the discrepancy between two data sets [9]. Denoted by , it is defined as: , where represents the test error of a classifier trained to discriminate patches from scanner A and patches from scanner . For computing the proxy

-distance, we draw 1500 patches from all source and 1500 from all target scans. A linear support vector machine is trained to discriminate between them, and the cross-validation error is used to produce


3.3 Measuring tissue variation

A tissue classifier is used to measure how much variation between tissues is preserved in mrai-net’s representation, specifically gray matter, white matter and cerebrospinal fluid. For evaluation, we use scans from target subjects that have been held back (10 subjects). From these scans, we draw 50 patches per tissue at random, for a total of 1500 patches. We apply the tissue classifier to these test samples and compute the classification error rate.

3.4 Experimental setup

Ultimately, we know that tissue variation is preserved if the extracted feature vectors can be used for tissue segmentation. To that end, we compare a linear support vector machine trained on mrai-net’s extracted feature vectors (also referred to as mrai-net) to two other supervised classifiers: (1) source

classifier, a convolutional neural network (CNN) trained on samples from the source (4 subjects) and target data (1 subject), and (2)

target classifier, a CNN trained on samples from the target data (1 subject). These two classifiers represent two possible scenario’s where you would not account for the differences between the scanners. source and target’s network architecture is the same as that of each pipeline in mrai-net; this rules out that differences in behavior between source, target and mrai-net are due to choices for specific architectures. All classifiers are trained on a range between 1 and 1000 labeled target patches per tissue.

We first performed this experiment using Brainweb1.5T as the source scanner and Brainweb3T as the target scanner. Since the same subjects are used, all variation between the data sets is acquisition-based. Secondly, we performed the same experiment using Brainweb1.5T as the source scanner and MRBrainS as the target scanner. Now there are more factors of variation, such as different subjects, environments, partial volume effects and field inhomogeneities.

Figure 2: Learning curves for Brainweb1.5T Brainweb3T (Top row) and Brainweb1.5T MRBrainS (Bottom row). (Left column) Proxy A-distance between source and target patches before (red) and after (blue) learning the new representation (smaller is better). (Right column) Tissue classification error for source, mrai-net and target.

3.5 Results

Figure 2 shows the proxy -distance and the tissue classification error, with an increasing number of labeled target patches used for training. In general, the experiment on the real data (MRBrainS) follows the same pattern as the simulated data. By using mrai-net, the distance between the source and target scanner data sets (proxy -distance) drops substantially, even with only one labeled target sample per class. With one hundred target training samples the proxy -distance approaches (small acquisition variation means the data sets overlap), while tissue variation is preserved (tissue classification error 0.11 for simulated data and 0.27 for MRBrainS real patient data). The tissue classification error for the source and target voxel classifiers is 0.21 and 0.37, respectively.

For ten labeled target sample per tissue, mrai-net’s error is 0.17 (simulated data) and 0.33 (MRBrainS data), while source still performs at a 0.66/0.64 error (simulated/MRBrainS) and target performs at 0.40/0.49. Given sufficient samples, all three classifier reach similar performances. Figure 3 illustrates the difference in tissue classification performance when only one labeled target sample per tissue is used for training.

(a) Scan
(b) Ground truth
(c) source
(d) mrai-net
(e) target
(f) Scan
(g) Ground truth
(h) source
(i) mrai-net
(j) target
Figure 3: Example segmentations into white matter (yellow), gray matter (green) and cerebrospinal fluid (blue) using only one labeled target patch per class, for Brainweb1.5T Brainweb3T (top row) and Brainweb1.5T MRBrainS (bottom row).

Note furthermore that source shows worse performance than target for less than roughly 50 samples. In this setting, the scanners are so different that including the source samples in the training set actually interferes with learning. Given enough target samples, however, source finds a good balance between source and target samples and starts to match the performance of target.

4 Conclusion

We proposed a Siamese neural network (mrai-net) to learn a representation of the data where acquisition-based variation is minimal and tissue-based variation is maintained. A linear classifier trained on feature vectors extracted by mrai-net outperforms conventional CNN classifiers trained on the source and target data sets on a cross-scanner tissue segmentation task, when few labeled target samples are available.


  • [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013.
  • [2] A van Opbroek, MA Ikram, MW Vernooij, and M De Bruijne,

    Transfer learning improves supervised image segmentation across imaging protocols,”

    IEEE Transactions on Medical Imaging, vol. 34, no. 5, pp. 1018–1030, 2015.
  • [3] V. Cheplygina, A. van Opbroek, M.A. Ikram, M.W. Vernooij, and M. de Bruijne, “Asymmetric similarity-weighted ensembles for image segmentation,” in International Symposium on Biomedical Imaging, 2016, pp. 273–277.
  • [4] Raia Hadsell, Sumit Chopra, and Yann LeCun, “Dimensionality reduction by learning an invariant mapping,” in Computer Vision and Pattern Recognition, IEEE Computer Society Conference on. IEEE, 2006, vol. 2, pp. 1735–1742.
  • [5] Berengere Aubert-Broche, Mark Griffin, G Bruce Pike, Alan C Evans, and D Louis Collins, “Twenty new digital brain phantoms for creation of validation image data bases,” IEEE Transactions on Medical Imaging, vol. 25, no. 11, pp. 1410–1416, 2006.
  • [6] Hugues Benoit-Cattin, Guylaine Collewet, Boubakeur Belaroussi, H Saint-Jalmes, and C Odet, “The SIMRI project: a versatile and interactive MRI simulator,” Journal of Magnetic Resonance, vol. 173, no. 1, pp. 97–115, 2005.
  • [7] M Arfan Ikram, Aad van der Lugt, Wiro J Niessen, Peter J Koudstaal, Gabriel P Krestin, Albert Hofman, Daniel Bos, and Meike W Vernooij, “The rotterdam scan study: design update 2016 and main findings,” European Journal of Epidemiology, vol. 30, no. 12, pp. 1299–1315, 2015.
  • [8] Adriënne M Mendrik, Koen L Vincken, Hugo J Kuijf, Marcel Breeuwer, Willem H Bouvy, Jeroen De Bresser, Amir Alansary, Marleen De Bruijne, Aaron Carass, Ayman El-Baz, et al., “MRBrainS challenge: Online evaluation framework for brain image segmentation in 3T MRI scans,” Computational Intelligence and Neuroscience, 2015.
  • [9] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan, “A theory of learning from different domains,” Machine Learning, vol. 79, no. 1, pp. 151–175, 2010.