Very few of the many medical image analysis algorithms that were proposed in the literature are applicable in clinical practice. One of the reasons for this is the complexity of the medical image data, i.e. the vast amount of variation that is present in this data. A more specific example of this, is brain tissue segmentation in MRI scans. Many automatic methods have been proposed bezdek1993review ; zijdenbos1994brain ; clarke1995mri ; saeed1998magnetic ; pham2000current ; suri2002computer ; duncan2004geometric ; balafar2010review , but due to a lack of generalization, large scale use in clinical practice remains a challenge giorgio2013clinical . In order to test the capacity of algorithms to generalize to new data, a representative sample (dataset) is required. This entails identifying all factors of variation in the data that would influence algorithm performance with respect to the medical image analysis task at hand. For brain tissue segmentation in MRI scans, we identify for example subject related variation (i.e. pathology, age, ethnicity, gender) and acquisition related variation (i.e. MR field strength, protocol settings, scanner vendor, artefacts). Supervised voxel classification approaches have been shown to perform well on small data sets van2015transfer ; moeskops2016automatic ; chen2017voxresnet . However, in order to ensure generalization, these algorithms should be trained and tested on a sufficiently large representative dataset that covers all possible types of variation. This is practically infeasible since training and testing require not only the MRI scans, but also manual labels as ground truth. The manual segmentation process is labor intensive and time consuming, and adds another layer of variation due to non-standardized manual segmentation protocols and inter- and intra-observer variability. To address this problem, we propose an alternative approach, by learning a representation of the data bengio2013RepLearning that is invariant to disturbing types of variation, while preserving the variation relevant for the selected classification task, i.e. clinically relevant variation. By reducing undesired variation, this method has the potential to decrease the number of fully labeled samples required for generalization and enable broader use of voxel classification approaches.
Overcoming acquisition-variation is a relatively new challenge in medical imaging. One particularly interesting approach focuses on weighting classifiers based on how well their training data matches the test data van2012supervised ; van2015transfer ; cheplygina2017transfer . Examples of transfer classifiers include weighted SVM’s van2015transfer and weighted ensembles cheplygina2017transfer . But these methods are very dataset-dependent: the classifiers need to be retrained for every new test dataset. Ideally, we would like to have a method that removes acquisition-variation or extracts acquisition-invariant features. Domain adaptation researchers have proposed representation learning methods that explicitly maximize ”domain confusion”: if a classifier cannot distinguish between domains then the representation is domain-invariant tzeng2014deep ; ganin2015unsupervised ; ganin2016domain
. For MRI scans, different scanners or acquisition protocols constitute different data domains. These representation learning methods are variants of deep neural networks, called domain-adversarial networks. They have two layers in which a loss function is computed: one layer for the task-dependent loss, such as tissue or lesion classification, and one that maximizes domain confusion. The networks learn representations in which the data from each domain overlaps while the different classes become separabletzeng2014deep . They are adversarial because the loss layers operate with different objectives, which can make them very difficult to train ganin2015unsupervised ; ganin2016domain ; goodfellow2014generative . A recent paper has applied domain-adversarial networks to segmenting brain lesions kamnitsas2017unsupervised . They achieved excellent performance and provided an in-depth analysis of the adversarial training procedure. However, their networks are still very task-dependent: the learned representation works well for brain lesion segmentation but cannot be used for tumor detection for example. It is not a method for learning a general acquisition-invariant representation.
In this paper, we propose to learn a general representation by marking certain factors of variation as desirable and others as undesired bengio2009learning . Learning a representation by explicitly minimizing undesirable factors of variation while maintaining desirable factors will produce a task-independent representation, which can be used for a variety of tasks later on. In order to minimize certain factors of variation while maintaining others, we exploit a particular type of neural network, referred to as a Siamese network bromley1993signature . Our work was inspired by the work of Hadsell hadsell2006dimensionality , who used Siamese neural networks to learn a lighting-invariant representation for airplane images in the NORB lecun2004NORB dataset. In this paper we aim to provide a proof of principle for learning an MR-acquisition invariant (mrai-net) representation for MR brain tissue segmentation.
To test mrai-net we simulated MRI data (SIMRI benoit2005simri ; aubert2006new ; aubert2006twenty ) from a 1.5T scanner and 3T scanner based on acquisition protocols used to acquire real data (Section 3.3) and real tissue segmentations from healthy adults (Brainweb). In addition we used real patient data (3T) as provided by MRBrainS mendrik2015mrbrains . We acknowledge that the simulated data is idealistic as compared to real patient data. However, experiments in a controlled environment provide a proof of principle to ensure that the method is behaving appropriately. Translation to real patient data is provided by including the MRBrainS data. For the experiments with the simulated data (Section 3.4 and 3.5), the same subject acquired with different acquisition protocols is used. This is however not a prerequisite to train mrai-net. For the experiments that use real patient data, different subjects are used. mrai-net is not trained by using tissue labels, but with patches labeled as similar or dissimilar. Factors of variation that should be preserved should be labeled as dissimilar, mrai-net will aim to reduce all other factors of variation.
2 Magnetic resonance acquisition-invariant network
Neural networks transform data based on minimizing a loss function. In supervised neural networks, labels are used to determine the loss (error between prediction and label). Many labels are required to learn a task. We aim to use as little labels as possible to learn a representation in which the variation over different methods of acquisition is minimal, without destroying the variation relevant to distinguish between brain tissues.
The proposed network works as follows. Suppose that we have scans that are acquired in two different ways (A and B). Possible differences can be in field strength, scanner vendor, acquisition protocol, and so on. A tissue patch, for example gray matter, is selected from both scans A and B. The aim is to teach the network that both these patches are gray matter regardless of their acquisition variation. Therefore, we use a loss function that expresses that in the mrai representation, pairs of samples from the same tissue but from different scanners should be as similar as possible. However, that expression alone would cause all samples to be mapped to a single point and would destroy variation between tissues. To balance out the action of pulling pairs marked as similar together, it is necessary to push other pairs apart hadsell2006dimensionality . Since we want to maintain the relevant variation between tissues, we additionally express that in the mrai representation, pairs from different tissues should retain their dissimilarity. The loss function is described in section 2.1. Section 2.2 describes how pairs of samples are labeled as similar or dissimilar. The Siamese neural network that is used to learn the mrai representation is described in section 2.3. The network consists of two pipelines with shared weights and a Siamese loss layer that acts on the output layer of the two pipelines (mrai representation).
2.1 Siamese loss
Neural networks transform data in each layer. We summarize the total transformation from input to output with the symbol , i.e. patch will be mapped to the new representation with and patch will be mapped with . To find an optimal transformation, we employ a loss function based on distances between pairs of patches in the output representation, i.e. . Pairwise distances are computed through an -norm, denoted by . We used an -norm as opposed to for instance an -norm, because larger values of in -norms either result in problems in high-dimensional spaces or result in problems with the gradient during optimization (see Appendix B).
The loss function for the similar pairs consists of the squared distance, . We chose this formulation in order to express that large distances are less desirable. The loss function for the dissimilar pairs consists of a hinge loss, where the distance is subtracted from a margin parameter and the negative values are set to : . Pairs that lie close together will suffer a loss, while pairs that are pushed sufficiently apart, i.e. past the margin, will not suffer a loss. We discuss the effect of the margin parameter in Section 3.7.
Each pair of patches is marked with a similarity variable; for similar and for dissimilar. Using the similarity label we can combine the similar and dissimilar loss functions into a single loss function:
where iterates over pairs and refers to the whole dataset of pairs.
2.2 Labeling pairs as similar or dissimilar
As described above, suppose we have two medical images from two different scanners; A and B. Assume that we have sufficient manual segmentations (labeled voxels) on scans from scanner A, to train a supervised classifier, but a very limited amount of labels from scanner B, for example labeled voxel per tissue for subject. The data from scanner A will be referred to as the source set, and the data from scanner B as the target set. Let be the set of tissue labels. The set of patches extracted from Scanner A is denoted , and the set from scanner B is denoted , with specifying the sample’s tissue. Given these two sets of patches, we form sets of similar and dissimilar pairs, with a similarity label . The following pairs are labeled as similar () and therefore will be pulled closer together:
Source patches from the same tissue : ,
Source and target patches from the same tissue : ,
Target patches from the same tissue : .
The subscript selects all patches that belong to tissue . The following pairs are labeled as dissimilar () and therefore will be pushed apart:
Source patches from different tissues : ,
Source and target patches from different tissues : ,
Target patches from different tissues : .
Figure 1 illustrates the process of selecting pairs of patches from different scanners. Consider a medical image from scanner A and scanner B, with 2 GM patches (green), 1 WM patch (yellow) and 1 CSF patch (blue) for each image. Using these patches we can generate the following pairs: a GM patch from A with another GM patch from A , a GM patch from A with a GM patch from B , a GM patch from B with another GM patch from B , a GM patch from A with a CSF patch from A , a GM patch from B with a WM patch from B , and a GM patch from B with a CSF patch from B . The bottom of the image shows examples of these 6 pairs of patches.
The pairs are concatenated into a dataset , where iterates over the pairs. In total, the number of combinations is , where refers to the number of source patches from the -th tissue and, likewise, refers to the number of target patches from the -th tissue. The number of pairs that can be generated is very large, even when only a small number of patches is available. For example, taking patches of 3 tissues from 4 source scans and 1 patch of 3 tissues from 1 target scan, results in pairs of patches that can be used for training the deep neural network.
2.3 Network architecture
Figure 2 shows a diagram of the network architecture. The network consists of two pipelines and a Siamese loss layer that acts on the output layers (red nodes). Pairs of patches enter the input layer (black squares) where they are convolved (blue squares) and mapped to feature vectors (blue nodes). The final layer is a low-dimensional feature space (red nodes). The Siamese loss layer (section 2.1) calculates the distance between each pair in their new representation and computes the loss based on whether the pair is marked as similar or dissimilar. The two pipelines share their weights, which means they are constrained to perform the same transformation. During training, the loss is propagated back through the network, adjusting the network weights.
Width and depth of the network may vary. In this paper, we made the following choices: input patches are size and scanner identification is set to a single variable. The convolution block consists of kernels of size. The output of these operations is flattened and the scanner ID ( for source, for target) is appended. The scanner ID ensures that regions of different tissues in different scanners do not overlap in the input space. The flattened and pooled convolutional layer output, plus the scanner ID is then densely mapped to a 16-dimensional representation. A dropout noise of size is set for each edge. This -dimensional representation is then densely mapped, again with a dropout of , to an -dimensional representation, which is finally mapped to a -dimensional representation. We chose a final representation of 2 dimensions because this allows for scatter plot visualizations.
Our method is implemented in a combination of Tensorflow111https://www.tensorflow.org/
and Keras222https://keras.io/ 45381 ; chollet2015keras . This proof of principle uses a 4-layer hybrid convolutional-dense network for the pipeline. However, the network architecture can be changed. Variations involve, for example, more layers, wider layers, larger convolution kernels, and heavier max-pooling. See Section 3.6 for an experiment that varies the layer widths in the network.
During training, we apply an -regularization of to every layer with weights. Regularization punishes the size of the weights, which prevents model over-complexity. In our experiments, the regularization parameter could be increased or decreased by two orders of magnitude with little effect on the networks performance. It is however always necessary to include some regularization as there is not only the danger of overfitting to training data but also the danger of overfitting to the specific target subject used for training.
for more details on optimizer parameters). RMSprop is based on stochastic gradient descent, which splits the dataset into batches and updates the networks parameters after processing each batch. An epoch is the number of times the optimization procedure splits the training set into batches. The number of epochs cannot be too large, otherwise the network starts to overfit to the specific target subject from which the target patches originated.
During experimentation we found that it is important that the batches are well-mixed with respect to the 6 types of pairs outlined in Section 2.2. If this is not the case, such as when one batch mostly consists of similar gray-matter patches and another batch consists mostly of dissimilar gray-matter / white-matter patches, then the network tends to push and pull in the same direction. These actions cancel each other out. The overall effect of having too many uniform batches is that the optimization procedure is slowed down.
2.4 Training and applying mrai-net for adaptive segmentation
Figure 3 illustrates the training procedure of mrai-net. Once it is trained and an MR acquisition-invariant representation is learned, it can be used as a preprocessing step for tissue segmentation (Figure 4). Because of the shared weights, either one of the pipelines can be used to transform the input patches into the mrai representation. Input patches from both the source and target scanner can be fed into the network, and any supervised classification model that uses feature vectors can subsequently be trained to distinguish tissues in the acquisition-invariant representation. Once the supervised classifier is trained, both the trained mrai-net and the trained supervised classifier are used to segment a new image. This is done by feeding a new patch through the mrai-net and letting the tissue classifier predict the label in the MR acquisition invariant space. In this way, the mrai-net acts as a preprocessing step to ensure that acquisition-based variation does not affect the tissue classifier.
3 Evaluating mrai-net
Since the aim of the mrai-net is to preserve variation between tissues while reducing the MR acquisition related variation, two different measures of performance are used to evaluate mrai-net. MR acquisition invariance is measured with the proxy -distance that measures the distance between the source and target scanner patches, as described in section 3.1. The preservation of tissue variation is measured using the tissue classification performance, and compared to supervised classification with CNN (Section 3.2). Section 3.3 describes the simulated (Brainweb 1.5T, Brainweb3.0T) and real data (MRBrainS) used for the experiments. For each experiment a source and target domain was specified. Four source subjects (100 random patches per tissue) and 1 target subject (1-1000 patches per tissue depending on experiment) were used for training. Four independent target subjects (100 random patches per tissue) were used for testing. Four experiments were set-up: 1) Only 1 patch per tissue from the target domain subject is used for training both the supervised CNNs (source, target) as well as the mrai-net followed by a linear classifier on the simulated data (Brainweb1.5T, Brainweb3.0T), 2) Multiple target training samples per tissue (randomly selected with 50 repeats) are used for training the source, target, and mrai -net classifiers for both simulated (Brainweb3.0T) and real patient data (MRBrainS). The first experiment (Section 3.4) was set-up to test if only 1 target patch per tissue would be sufficient in order to learn an MR-acquisition invariant representation. If so, then calibrating a supervised segmentation algorithm for a new scanner using mrai-net would require only three clicks in one scan acquired with a new scanner. The second experiment (Section 3.5) illustrates the performance of the mrai-net compared to the target, and mrai-net classifiers when adding more target training samples (Figure 7). Results of using 1 patch per tissue and 100 patches per tissue from the target subject for training are shown in Figures 8-10. The third experiment (Section 3.6) looks at the performance of the network if we vary the number of convolution kernels and the number of nodes in the dense layers. For the setting where Brainweb1.5T is the source scanner and Brainweb3.0T is the target scanner, the network will keep gaining in performance at the cost of adding tens of thousands more parameters. Finally, the fourth experiment (Section 3.7) shows the influence of the margin parameter on the Siamese loss function. If the margin parameter is set too low, tissue variation will not be preserved. On the other hand, if the margin parameter is set too high the acquisition variation will not be reduced. The next two sections describe how these two types of variation are measured.
3.1 MR acquisition invariance measure
The -divergence can be used as a measure of the discrepancy between the source and target scanner data sets kifer2004detecting ; ben2007analysis ; ben2010theory . This divergence relies on the ability of a classifier to distinguish between domains. If a classifier is not able to distinguish source from target, i.e. has a test error of , then invariance is achieved. Unfortunately, the original -divergence is a measure between distributions and not samples. Since we only have samples, we use its proxy instead: the -distance ben2007analysis ; ben2010theory , as used in ganin2016domain . The proxy -distance, denoted by , is defined as follows:
where represents the test error of a classifier trained to discriminate source samples from target samples . If the source and target data lie far apart, the error will be close to , i.e. perfect separability, and the proxy -distance will be close to . If the source and target data overlap, the error will be around , i.e. no separability (invariance), and the proxy -distance will approach
. We use a linear support vector machine (SVM) as domain classifier.
3.2 Measure of preserving tissue variation
The tissue classification error is used as a measure of tissue variation preservation. The aim is to learn a linearly separable representation with mrai-net, to aid the number of methods that can be used for classification. Therefore, we evaluate the tissue classification error of the samples in the acquisition-invariant representation with a logistic regressor. The classifier is -regularized and cross-validated for optimal regularization parameters. This classifier mrai-net, based on the mrai-net, is compared to two other supervised classifiers: 1) source classifier: a convolutional-dense neural network (CNN) trained on samples from the source (4 subjects) and target data (1 subject), and 2) target classifier: a CNN trained on samples from the target data (1 subject). In order to ensure that differences in performance between source, mrai-net and target are not due to differences between classifiers, the mrai-net (Figure 2) neural network architecture was used for the source and target classifiers as well.
To be able to provide a proof of principle, we simulated different MR acquisitions from various anatomical models of the human brain collins1998design ; aubert2006twenty , using an MRI simulator (SIMRI benoit2005simri ; aubert2006new ; aubert2006twenty ). The anatomical models consist of transverse slices of 20 normal brains and are publicly available through Brainweb333http://www.bic.mni.mcgill.ca/brainweb/. These models were used as input for the MRI simulator. For the experiments, we simulated two acquisition types: Brainweb1.5T, a standard gradient-echo acquisition protocol for a 1.5 Tesla scanner (c.f. ikram2015rotterdam ), and Brainweb3.0T, a standard gradient-echo protocol for a 3.0 Tesla scanner (c.f. mendrik2015mrbrains ). Table 1 describes the parameters used for the simulation: magnetic field strength (B0), flip angle (), repetition time (TR), echo time (TE). Magnetic field inhomogeneities and voxel inhomogeneity (partial volume effects) were not included in the simulation.
|Brainweb1.5T||1.5 Tesla||13.8 ms||2.8 ms|
|Brainweb3.0T||3.0 Tesla||7.9 ms||4.5 ms|
Appendix A describes the nuclear magnetic resonance (NMR) relaxation times for the tissues in the Brainweb anatomical models, for 1.5 and 3.0 Tesla field strengths. The tissues in the anatomical models are grouped into ”background” (BKG), ”cerebrospinal fluid” (CSF), ”gray matter” (GM), and ”white matter” (WM) to compose the ground truth segmentation labels for the simulated scans. The simulations result in images of 256 by 256 pixels, with a 1.0x1.0mm resolution. Figures 4(a) and 4(b) show examples of the Brainweb1.5T and Brainweb3.0T scan of the same subject. For all scans, we used a brain mask to strip the skull.
In order to test the proposed method on real data, we use the publicly available training data (5 subjects) from the MRBrainS challenge444http://mrbrains13.isi.uu.nl/Figure. The acquisition parameters used for simulating the Brainweb3.0T are based on the MRBrainS acquisition protocol (3.0T scanner, gradient-echo, B0 = 3.0T, = flip angle, TE = 4.5ms, and TR = 7.9ms). Figure 4(c) shows an example of an MRBrainS scan. Again, a brain mask is used to strip the skull.
3.4 Experiment 1: One training target sample per tissue
The first experiment with the simulated data tests the scenario described at the beginning of this section: suppose a supervised classification algorithm trained on one scanner needs to be calibrated for a new scanner, would this be possible with three clicks (1 for each tissue type) using mrai-net? To study this, we manually selected 1 patch for each tissue in the target scan (1 subject) and used this data to train mrai-net. Once mrai-net has been trained and an acquisition-invariant representation has been learned, we compute the proxy -distance and perform a tissue classification experiment.
For computing the proxy -distance, we used scans from 10 source subjects and 10 target subjects that had been held back (i.e. we did not draw samples from them to either train mrai-net or train any of the tissue classifiers). We randomly drew 50 patches per tissue from each subject, resulting in two sets of 1500 patches. These patches were fed into mrai-net which mapped them to the new acquisition-invariant representation. The datasets were labeled and for source and target. Next, we trained a linear classifier with 5-fold cross-validation to obtain a test error on data set discrimination. Finally, using this test error and Equation 1, we computed the proxy -distance.
For evaluating the tissue classification performance, we used scans from 10 target subjects that had been held back. From these 10 scans, we drew 50 patches per tissue at random, for a total of 1500 patches. We computed the error rate by computing the proportion of wrong predictions on this test set. We trained the following three classifiers (described in Section 3.2): firstly, the source classifier (CNN) was trained on images from the source dataset, and applied to the test set to make predictions. Secondly, we trained a linear classifier on the source data mapped to mrai-net’s representation. We mapped the test data to mrai-net’s representation as well and applied the trained linear classifier to make predictions. Its performance on the test set is indicated with mrai-net in Table 2. The target classifier (CNN) was applied to the available target patches. In this experiment, there were 3 target patches in total, which is far too little data to train such a large convolutional network. We included its performance to indicate that using the target classifier in this kind of situation is not a sensible option.
For comparison, we performed the same experiment but with randomly selected target patches. Table 2 lists the tissue classification errors of the three classifiers and the proxy -distance between the source and target patches before (raw) and after (rep) applying mrai-net
. The whole experiment was repeated 10 times and the average error rate is reported with the standard error of the mean between brackets.
|manual||0.631 (.02)||0.223 (.01)||0.613 (.01)|
|random||0.667 (.02)||0.250 (.02)||0.610 (.06)|
|1.88 (.01)||0.26 (.05)|
|1.91 (.01)||0.41 (.06)|
Figure 6 displays the manually selected patches and their position within the image. For both the source and target classifier, one target patch per tissue is insufficient to achieve good tissue classification performance (2 (top row): 0.631 and 0.613). However, the mrai-net classifier shows considerably better performance (0.223), using only one target patch per tissue. The proxy -distance also drops from near perfect separability (1.88) to near invariance (0.26). Randomly selecting (10 repeats) 1 target patch per tissue (Table 2 (bottom row)), shows worse performance of the mrai-net classifier, for both the classification error (0.250) as well as the -distance (0.41). Suggesting that purposive (information rich) sampling beats random sampling in this case.
3.5 Experiment 2: Multiple training target samples per tissue
The second experiment tests the performance when adding more target training samples, for both simulated (Brainweb3.0T) and real patient data (MRBrainS). We set-up the following sub-experiments:
2.1) Experiment on simulated data with two different acquisition protocols (Source: Brainweb1.5T, Target: Brainweb3.0T).
2.2) Experiment on 1.5T simulated data and 3.0T real data (Source: Brainweb1.5T, Target: MRBrainS).
2.3) Experiment on 3.0T simulated data and 3.0T real data (Source: Brainweb3.0T, Target: MRBrainS).
Each of these experiments is repeated 50 times. Figure 7 shows the performance (both tissue classification error as well as proxy -distance) as a function of the number of used target training samples. The average error (solid line) and the standard error of the mean (line thickness) is shown, ranging from using 1 target patch up to more than 1000 target patches per tissue for training both the supervised CNNs (source, target) as well as the mrai-net followed by a linear classifier (mrai-net).
Figure 7 (left) shows the proxy -distance between the source and target samples for all three experiments. The proxy -distance for experiments 2.1 and 2.2 shows that in the original representation (raw; red line), the source and target distributions lie far apart (proxy -distance approaches ). This illustrates the difference in acquisition protocol (1.5T versus 3.0T). After applying mrai-net (rep; blue line) the proxy -distance drops drastically (approaches ) showing that the network managed to learn an MR-acquisition invariant representation. Adding more target training samples improves the invariance up to about 100 samples, but the proxy -distance is already quite low after only using 1 target sample per tissue type for training. In experiment 2.3 the proxy -distance before applying mrai-net (raw) is already much lower than in the previous two experiments (around ), this illustrates that the acquisition protocols are more similar to begin with (both 3.0T). The main difference between the distributions presumably results from simulated versus real data, since not all factors of acquisition variation are included in the simulations, most notably partial volume (0.96x0.96x3mm voxels in MRBrainS versus no partial volume in Brainweb). However, after applying mrai-net the proxy -distance is reduced further (approaches ), again showing that mrai-net is able to learn an MR-acquisition invariant representation (rep) on this data, even for simulated and real data. Note that the MRBrainS data adds other modes of variation in terms of pathology and age in comparison to the Brainweb healthy adults, which could influence the tissue classification performance.
Figure 7 (right) shows the tissue classification error for all three experiments. If the proxy -distance between the source and target distribution is high (experiment 2.1 and 2.2), and when using only one target sample per tissue, the source classifier that uses both the source data and target data for training shows worse performance than the one that uses only the target data (target); an error of 0.667 versus 0.591, respectively. Even when adding more target samples for training, the results show that it is more beneficial to train a supervised classifier on the target data alone, instead of on both the source and target data; using 10 target samples for training, source achieves an error of 0.662 versus an error of 0.403 for target. The source classifier is focused on its source samples, which in this case are not informative of the target data. Given enough target samples, however, source starts to shift focus towards its target data and starts to match the performance of target: for 100 target samples, errors of 0.213 versus 0.205 respectively. If the proxy -distance between the source and target distributions is low (distributions are more similar; experiment 2.3), using the source data for training is beneficial; for 1 target sample per tissue source achieves an error of 0.435 and target an error of 0.596. In this case, the source samples are more representative of the target data and are aiding the classifier. In general, the mrai-net classifier outperforms both the source and target classifiers: an error of 0.269 for 1 sample, 0.175 for 10 samples and 0.111 for 100 samples. mrai-net’s representation ensures that the source and target samples are more similar and that the source samples can be effectively used for training.
Examples of the segmentation results on one of the target test images are shown in Figure 8 for experiment 2.1, Figure 9 for experiment 2.2, and Figure 10 for experiment 2.3. Examples are shown after using 1 target patch per tissue for training, and after using 100 target patches per tissue for training. The results show that only the mrai-net classifier is able to predict a segmentation that approaches the ground truth with only 1 target patch per tissue for training (error for experiment 2.1 = 0.269, experiment 2.2 = 0.403, experiment 2.3 = 0.320), while the source and target classifiers cannot (source error for experiment 2.1 = 0.667, experiment 2.2 = 0.653, experiment 2.3 = 0.435; target error for experiment 2.1: 0.591, experiment 2.2: 0.614, experiment 2.3 = 0.596). After using 100 patches the source and target classifiers can predict a gross segmentation of WM, GM and CSF (source error for experiment 2.1 = 0.213, experiment 2.2 = 0.384, experiment 2.3 = 0.363; target error for experiment 2.1: 0.205, experiment 2.2: 0.368, experiment 2.3 = 0.368), but the mrai-net classifier prediction shows more details and a lower tissue classification error (error for experiment 2.1 = 0.111, experiment 2.2 = 0.276, experiment 2.3 = 0.284).
3.6 Experiment 3: number of network parameters
Setting neural network hyperparameters, such as the number of convolution kernels to use, is always a tricky issue. The optimal parameter is different for each dataset, which means there are no easy defaults. In order to get some insight into the behavior of the network for different choices of hyperparameters, we performed an additional experiment. We used experiment 2.1’s setting: Brainweb1.5T as source and Brainweb3.0T as target.
mrai-net has three layers with parameters: a convolution layer and two dense layers. We varied the number of kernels in the convolution layer and the number of nodes in the dense layers. We use the following sets of hyperparameters: [2 kernels, 4 nodes, 4 nodes], [4 kernels, 8 nodes, 4 nodes], [8 kernels, 16 nodes, 8 nodes], [16 kernels, 32 nodes, 16 nodes], [32 kernels, 64 nodes, 32 nodes] and [64 kernels, 128 nodes, 64 nodes] (i.e. the layer widths double each time). The total number of parameters are 322, 1254, 4874, 19218, 76322, and 304194, respectively. We used 10 labeled target patches per classes, from which we generated 18000 pairs of patches. The network was trained for 320 epochs and the experiment was repeated 20 times to obtain standard errors of the means. Figure 11 shows the results: the left figure looks at the proxy -distance as a function of the number of parameters and the right figure looks at the tissue classification error of a linear classifier trained on the resulting representation. For the proxy -distance, the graphs show a steady decrease in distance and then roughly levels off after [8, 16, 8]. This result indicates that an extremely wide mrai-net
(i.e. [64, 128, 64]) will still be able to reduce acquisition variation. As for the tissue classification error, the thin network (i.e. [2, 4, 2]) starts out with a average error rate of 0.28 (underfitting) and drops immediately to 0.18 for [4, 8, 4]. Afterwards, it slowly increases to 0.19. This indicates that the network is not overfitting too drastically yet, which is probably due to the regularization (see Section2.3.1). However, the graph does indicate that its error rate will go up if the number of parameters is increased further.
3.7 Experiment 4: effect of the margin parameter
The margin parameter in the dissimilar loss function, , is important as it balances the actions of pushing and pulling between pairs. For small values, will be much smaller than and the network will focus on pulling pairs together. For large values, will always be much larger than and network will focus on pushing pairs apart. Figure 12 plots a synthetic data setting with the outcome of using three different values for the margin parameter. The left figure shows two synthetic 2-dimensional data sets, one with red versus blue crosses and the other with red versus blue squares. The right figures show validation samples fed through three networks with different values for the margin parameter. Firstly, the right top figure displays the result of using a margin parameter of : the network does not suffer any loss by making pairs of samples of different tissues too similar and consequently maps everything to a single point. Secondly, the right middle figure shows an appropriate choice for the margin, where the two data sets overlap and where red and blue points are separated. Lastly, the right bottom figure shows what happens when a large margin parameter is used: it focuses almost entirely on separating red versus blue and is not making the data sets more similar.
Additionally, the optimal value for the margin parameter is affected by the number of similar versus dissimilar pairs. If there are twice as many similar pairs, then their loss will be twice as large as well and the network will focus more on pulling pairs together. Overall, the more similar pairs there are, the larger the margin parameter will need to be.
In this paper, we proposed a method to learn an MR acquisition invariant (mrai) representation that preserves the variation between brain tissues for segmentation. Once the representation is learned using mrai-net
, any supervised classification model that uses feature vectors can be used to classify the brain tissues. The proposed method addresses the problem that the difference between scans acquired with two different MRI scanners or protocols can be so large that scans from one scanner are not representative of scans from another scanner. This difference does not affect assessment by human vision (e.g. radiologists can perform diagnostic work-up on both), but it does affect computer vision. To get insight into the difference between scans and to assess the performance ofmrai-net to reduce this difference (achieve invariance), the proxy -distance measure between source and target patches was used. The experiments (Figure 7) show that this is a good measure to determine the difference between source and target acquisition, and might be used to predict classifier performance of a source classifier. Note that this measure does not require any tissue labels, and can thus be used as a general measure of distance between scanners. It merely requires source patches to be labeled as source, and target patches to be labeled as target. When the proxy -distance is low (Figure 7 bottom row) the source (source) classifier outperforms the target (target) classifier when a small number of target training patches are used. When the proxy -distance is large the target classifier outperforms the source classifier, even when one target training patch per tissue is used. This suggests that if the proxy -distance is large (source data is not representative of target data), a source classifier trained on the source data should not be applied to the target data. Ground truth labels on the source data that are labor-intensive to acquire can in this case not be used for the target data. However, since mrai-net learns a representation that reduces the acquisition difference between source and target scanner the proxy -distance is drastically reduced. Therefore the mrai-net classifier outperforms both the source and target classifiers, when a small number of target training samples is available, and leverages the source ground truth labels.
Due to the complexity of the problem addressed in this paper, simulated data was used to provide a proof of principle. Ideal real data would require the same subject to be scanned on different scanners with different protocols, after which the scans should be manually segmented to obtain the ground truth for both scans. However, inter-observer variability would add an extra layer of variation. To test mrai-net on real data, the MRBrainS challenge data was used. Although, additional layers of variation include resolution, population and manual segmentation protocol, the experiments (Figure 7) show that the mrai-net performance on real data follows the same pattern as its performance on simulated data, be it with a higher classification error due to additional factors of variation.
A limitation of the proposed method is that learning an mrai representation with mrai-net, will not necessarily work well on data sets with poor contrast between tissues. In that case, the network will both push and pull points in the overlap. Since these actions will mostly cancel each other out, the network will not be able to reduce acquisition-variation without sacrificing tissue variation, and vice versa.
Another limitation is that the proposed mrai-net requires at least 1 sample per tissue from the target scanner. This is not an unreasonable request, as it is not hard to find at least 1 patch per tissue (Section 3.4) in only one subject scanned with the target scanner. However, it may be possible to perform the similar/dissimilar labeling based on assumptions instead. For instance, if one assumes that the registration between two scans is accurate and that the subject-variation is not too large, then one could assume that target patches at certain locations are the same tissue as the source patches at these locations. Hence, those voxels could be used for the similarity-labeling process.
The proposed representation learning method could be used to reduce any type of variation, by adjusting the way that the similar and dissimilar pairs are defined. For example, registration, which can be viewed as variation in position, might be approached in a similar manner simonovsky2016deep . Key is to identify the forms of variation, determine which variation should be preserved and which should be reduced, and to find a way to label them as similar or dissimilar accordingly.
In this paper we addressed one of the major challenges of supervised voxel classification, i.e. generalization to data that is not representative of the training data. We provided a proof of principle for learning an MR acquisition invariant representation that reduces the variation between MRI scans acquired with different scanners or acquisition protocols, while preserving the variation between brain tissues. We showed that the proposed mrai-net is able to learn an MR acquisition invariant representation (low proxy -distance), and outperform supervised convolution neural networks trained on patches from the source or target scanners for tissue classification, when little target training patches are available. By reducing the acquisition related variation using mrai-net, the ground truth labels from the source data can be reused for the target data, since the source and target data are mapped to the same representation achieving generalization.
The research of A.M. Mendrik was financially supported by IMDI Grant 104002002 (Brainbox) from ZonMw, the Netherlands Organisation for Health Research and Development. W.M. Kouw was supported by the Netherlands Organization for Scientific Research (NWO; grant 612.001.301).
Appendix A Nuclear Magnetic Resonance relaxation times
SIMRI requires NMR relaxation times for tissues based on particular magnetic static field strengths benoit2005simri . We performed a literature study for the T1 and T2 relaxation times, the results of which are listed in Table 3. The proton density values stem from yoder2002mri
. The 3.0T CSF parameters were interpolated using an exponential function fit (kroeker1986analysis justifies an exponential function based on physical properties). We equate connective tissue to glial matter (90% of the brain’s connective tissue system is glial matter555http://www.neuroplastix.com/styled-2/page139/styled-42/brainsconnectivetissue.html).
|CSF||100 (0)||4326 (0)||791 (127)||4313 (0)||503 (64)||kroeker1986analysis ; cheng1994vivo ; rooney2007magnetic ; piechnik2009functional|
|GM||86 (.4)||1124 (24)||95 (8)||1820 (114)||99 (7)||stanisz2005t1|
|WM||77 (3)||884 (50)||72 (4)||1084 (45)||69 (3)||stanisz2005t1|
|Fat||100 (0)||343 (37)||58 (4)||382 (13)||68 (4)||de2004mr|
|Muscle||100 (0)||629 (50)||44 (6)||832 (62)||50 (4)||stanisz2005t1 ; barral2009skin|
|Skind||100 (0)||230 (8)||35 (4)||306 (18)||22 (0)||richard1991vivo ; song1997vivo ; barral2009skin|
|Skullb||0 (0)||200 (0)||.46 (0)||223 (11)||.39 (.02)||reichert2005magnetic ; du2010qualitative|
|Gliala||86 (0)||1124 (24)||95 (8)||1820 (114)||99 (7)||cheng1994vivo ; stanisz2005t1|
|Conn. c||77 (0)||1124 (24)||95 (8)||1820 (114)||99 (7)||stanisz2005t1|
Glial matter values are unknown and are imputed with gray matter values.
b T2 values for cortical bone are actually T2* values (UTE seq).
c Equated to glial matter (see text).
d 3.0T T2 relaxation time is from dermis, other values are from hypodermis.
Appendix B -norm minimization
In Section 2.1 we specified the Siamese loss as the networks objective function. The input of this loss consists of a pairwise distance, for which we chose an -norm. There are 2 reasons for this: the first is that -norms with larger values for concentrate densely in high-dimensional spaces flexer2015choosing . Concentration means that the differences between pairwise distances of a set of points become smaller as the number of dimensions increases. This is a problem because the actions of pulling and pushing will not sufficiently decrease the distance between similar pairs or sufficiently increase the distance between dissimilar pairs. The second reason is that the gradient of the -norm is constant, while the gradient of an -norms with are functions of the distance boyd2004convex . Gradients of norms with large ’s become smaller as the distance between pairs becomes smaller, which means the incentive for the network to pull pairs closer decreases. A constant gradient ensures that there will also be a constant incentive to pull similar pairs closer together. Considering that we want our representation to be truly invariant, we want the network to continue to pull similar pairs together until they are as close as possible.
J. C. Bezdek, L. Hall, L. Clarke, Review of MR image segmentation techniques using pattern recognition, Medical Physics 20 (4) (1993) 1033–1048.
- (2) A. P. Zijdenbos, B. M. Dawant, Brain segmentation and white matter lesion detection in MR images, Critical Reviews in Biomedical Engineering 22 (5-6) (1994) 401–465.
- (3) L. Clarke, R. Velthuizen, M. Camacho, J. Heine, M. Vaidyanathan, L. Hall, R. Thatcher, M. Silbiger, MRI segmentation: methods and applications, Magnetic Resonance Imaging 13 (3) (1995) 343–368.
- (4) N. Saeed, Magnetic resonance image segmentation using pattern recognition, and applied to image registration and quantitation, NMR in Biomedicine 11 (4-5) (1998) 157–167.
- (5) D. L. Pham, C. Xu, J. L. Prince, Current methods in medical image segmentation, Annual Review of Biomedical Engineering 2 (1) (2000) 315–337.
- (6) J. S. Suri, S. Singh, L. Reden, Computer vision and pattern recognition techniques for 2-D and 3-D MR cerebral cortical segmentation (part i): a state-of-the-art review, Pattern Analysis & Applications 5 (1) (2002) 46–76.
- (7) J. S. Duncan, X. Papademetris, J. Yang, M. Jackowski, X. Zeng, L. H. Staib, Geometric strategies for neuroanatomic analysis from MRI, Neuroimage 23 (2004) S34–S45.
M. A. Balafar, A. R. Ramli, M. I. Saripan, S. Mashohor, Review of brain MRI image segmentation methods, Artificial Intelligence Review 33 (3) (2010) 261–274.
- (9) A. Giorgio, N. De Stefano, Clinical use of brain volumetry, Journal of Magnetic Resonance Imaging 37 (1) (2013) 1–14.
A. van Opbroek, M. Ikram, M. Vernooij, M. De Bruijne, Transfer learning improves supervised image segmentation across imaging protocols, IEEE Transactions on Medical Imaging 34 (5) (2015) 1018–1030.
- (11) P. Moeskops, M. A. Viergever, A. M. Mendrik, L. S. de Vries, M. J. Benders, I. Išgum, Automatic segmentation of MR brain images with a convolutional neural network, IEEE Transactions on Medical Imaging 35 (5) (2016) 1252–1261.
- (12) H. Chen, Q. Dou, L. Yu, J. Qin, P.-A. Heng, Voxresnet: Deep voxelwise residual networks for brain segmentation from 3D MR images, NeuroImage 2017.
- (13) Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8) (2013) 1798–1828.
A. van Opbroek, M. A. Ikram, M. W. Vernooij, M. de Bruijne, Supervised image segmentation across scanner protocols: A transfer learning approach, in: Machine Learning in Medical Imaging, Springer, 2012, pp. 160–167.
- (15) V. Cheplygina, A. van Opbroek, M. A. Ikram, M. W. Vernooij, M. de Bruijne, Transfer learning by asymmetric image weighting for segmentation across scanners, arXiv preprint arXiv:1703.04981.
- (16) E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, T. Darrell, Deep domain confusion: Maximizing for domain invariance, arXiv preprint arXiv:1412.3474.
- (17) Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by backpropagation, in: International Conference on Machine Learning, 2015, pp. 1180–1189.
- (18) Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks, Journal of Machine Learning Research 17 (59) (2016) 1–35.
- (19) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
- (20) K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert, et al., Unsupervised domain adaptation in brain lesion segmentation with adversarial networks, in: International Conference on Information Processing in Medical Imaging, Springer, 2017, pp. 597–609.
- (21) Y. Bengio, et al., Learning deep architectures for AI, Foundations and Trends® in Machine Learning 2 (1) (2009) 1–127.
- (22) J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, R. Shah, Signature verification using a siamese time delay neural network, International Journal of Pattern Recognition and Artificial Intelligence 7 (04) (1993) 669–688.
- (23) R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, in: Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, Vol. 2, IEEE, 2006, pp. 1735–1742.
- (24) Y. LeCun, F. J. Huang, L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, in: Computer Vision and Pattern Recognition, Vol. 2, 2004, pp. II–97–104.
- (25) H. Benoit-Cattin, G. Collewet, B. Belaroussi, H. Saint-Jalmes, C. Odet, The SIMRI project: a versatile and interactive MRI simulator, Journal of Magnetic Resonance 173 (1) (2005) 97–115.
- (26) B. Aubert-Broche, A. C. Evans, L. Collins, A new improved version of the realistic digital brain phantom, NeuroImage 32 (1) (2006) 138–145.
- (27) B. Aubert-Broche, M. Griffin, G. B. Pike, A. C. Evans, D. L. Collins, Twenty new digital brain phantoms for creation of validation image data bases, IEEE Transactions on Medical Imaging 25 (11) (2006) 1410–1416.
- (28) A. M. Mendrik, K. L. Vincken, H. J. Kuijf, M. Breeuwer, W. H. Bouvy, J. De Bresser, A. Alansary, M. De Bruijne, A. Carass, A. El-Baz, et al., MRBrainS challenge: Online evaluation framework for brain image segmentation in 3T MRI scans, Computational Intelligence and Neuroscience.
- (29) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, X. Zheng, Tensorflow: A system for large-scale machine learning, in: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
- (30) F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).
- (31) L. Bottou, Stochastic gradient descent tricks, in: Neural Networks: Tricks of the Trade, Springer, 2012, pp. 421–436.
- (32) D. Kifer, S. Ben-David, J. Gehrke, Detecting change in data streams, in: Proceedings of the Thirtieth International Conference on Very Large Data Bases, Vol. 30, VLDB Endowment, 2004, pp. 180–191.
- (33) S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, Analysis of representations for domain adaptation, in: Advances in Neural Information Processing Systems, 2007, pp. 137–144.
- (34) S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W. Vaughan, A theory of learning from different domains, Machine Learning 79 (1) (2010) 151–175.
- (35) D. L. Collins, A. P. Zijdenbos, V. Kollokian, J. G. Sled, N. J. Kabani, C. J. Holmes, A. C. Evans, Design and construction of a realistic digital brain phantom, IEEE Transactions on Medical Imaging 17 (3) (1998) 463–468.
- (36) M. A. Ikram, A. van der Lugt, W. J. Niessen, P. J. Koudstaal, G. P. Krestin, A. Hofman, D. Bos, M. W. Vernooij, The rotterdam scan study: design update 2016 and main findings, European Journal of Epidemiology 30 (12) (2015) 1299–1315.
- (37) M. Simonovsky, B. Gutiérrez-Becker, D. Mateus, N. Navab, N. Komodakis, A deep metric for multimodal registration, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 10–18.
- (38) D. Yoder, E. Changchien, C. B. Paschal, J. M. Fitzpatrick, MRI simulator with static field inhomogeneity, in: Medical Imaging, International Society for Optics and Photonics, 2002, pp. 592–603.
- (39) R. M. Kroeker, R. M. Henkelman, Analysis of biological NMR relaxation data with continuous distributions of relaxation times, Journal of Magnetic Resonance 69 (2) (1986) 218–235.
- (40) K. H. Cheng, In vivo tissue characterization of human brain by chisquares parameter maps: multiparameter proton T2-relaxation analysis, Magnetic Resonance Imaging 12 (7) (1994) 1099–1109.
- (41) W. D. Rooney, G. Johnson, X. Li, E. R. Cohen, S.-G. Kim, K. Ugurbil, C. S. Springer, Magnetic field and tissue dependencies of human brain longitudinal 1H2O relaxation in vivo, Magnetic Resonance in Medicine 57 (2) (2007) 308–318.
S. Piechnik, J. Evans, L. Bary, R. G. Wise, P. Jezzard, Functional changes in CSF volume estimated using measurement of water T2 relaxation, Magnetic Resonance in Medicine 61 (3) (2009) 579–586.
- (43) G. J. Stanisz, E. E. Odrobina, J. Pun, M. Escaravage, S. J. Graham, M. J. Bronskill, R. M. Henkelman, T1, T2 relaxation and magnetization transfer in tissue at 3T, Magnetic Resonance in Medicine 54 (3) (2005) 507–512.
- (44) C. M. De Bazelaire, G. D. Duhamel, N. M. Rofsky, D. C. Alsop, MR imaging relaxation times of abdominal and pelvic tissues measured in vivo at 3.0 T: preliminary results, Radiology 230 (3) (2004) 652–659.
- (45) J. K. Barral, N. Stikov, E. Gudmundson, P. Stoica, D. G. Nishimura, Skin T1 mapping at 1.5T, 3T, and 7T, in: Proceedings of the 17th Annual Meeting of ISMRM, 2009, p. 4451.
- (46) S. Richard, B. Querleux, J. Bittoun, I. Idy-Peretti, O. Jolivet, E. Cermakova, J.-L. Lévêque, In vivo proton relaxation times analysis of the skin layers by magnetic resonance imaging, Journal of Investigative Dermatology 97 (1) (1991) 120–125.
- (47) H. K. Song, F. W. Wehrli, J. Ma, In vivo MR microscopy of the human skin, Magnetic Resonance in Medicine 37 (2) (1997) 185–191.
- (48) I. L. Reichert, M. D. Robson, P. D. Gatehouse, T. He, K. E. Chappell, J. Holmes, S. Girgis, G. M. Bydder, Magnetic resonance imaging of cortical bone with ultrashort TE pulse sequences, Magnetic Resonance Imaging 23 (5) (2005) 611–618.
- (49) J. Du, M. Carl, M. Bydder, A. Takahashi, C. B. Chung, G. M. Bydder, Qualitative and quantitative ultrashort echo time (UTE) imaging of cortical bone, Journal of Magnetic Resonance 207 (2) (2010) 304–311.
- (50) A. Flexer, D. Schnitzer, Choosing norms in high-dimensional spaces based on hub analysis, Neurocomputing 169 (2015) 281–287.
- (51) S. Boyd, L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.