Log In Sign Up

Factorised Representation Learning in Cardiac Image Analysis

by   Agisilaos Chartsias, et al.

Typically, a medical image offers spatial information on the anatomy (and pathology) modulated by imaging specific characteristics. Many imaging modalities including Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) can be interpreted in this way. We can venture further and consider that a medical image naturally factors into some spatial factors depicting anatomy and factors that denote the imaging characteristics. Here, we explicitly learn this decomposed (factorised) representation of imaging data, focusing in particular on cardiac images. We propose Spatial Decomposition Network (SDNet), which factorises 2D medical images into spatial anatomical factors and non-spatial imaging factors. We demonstrate that this high-level representation is ideally suited for several medical image analysis tasks, such as semi-supervised segmentation, multi-task segmentation and regression, and image-to-image synthesis. Specifically, we show that our model can match the performance of fully supervised segmentation models, using only a fraction of the labelled images. Critically, we show that our factorised representation also benefits from supervision obtained either when we use auxiliary tasks to train the model in a multi-task setting (e.g. regressing to known cardiac indices), or when aggregating multimodal data from different sources (e.g. pooling together MRI and CT data). To explore the properties of the learned factorisation, we perform latent-space arithmetic and show that we can synthesise CT from MR and vice versa, by swapping the modality factors. We also demonstrate that the factor holding image specific information can be used to predict the input modality with high accuracy.


page 6

page 11

page 13

page 15


Deep Learning for Multi-Task Medical Image Segmentation in Multiple Modalities

Automatic segmentation of medical images is an important task for many c...

Factorised spatial representation learning: application in semi-supervised myocardial segmentation

The success and generalisation of deep learning algorithms heavily depen...

Disentangle, align and fuse for multimodal and zero-shot image segmentation

Magnetic resonance (MR) protocols rely on several sequences to properly ...

Controllable cardiac synthesis via disentangled anatomy arithmetic

Acquiring annotated data at scale with rare diseases or conditions remai...

Anatomy-Aware Self-supervised Fetal MRI Synthesis from Unpaired Ultrasound Images

Fetal brain magnetic resonance imaging (MRI) offers exquisite images of ...

4D Semantic Cardiac Magnetic Resonance Image Synthesis on XCAT Anatomical Model

We propose a hybrid controllable image generation method to synthesize a...

Deep generative model-driven multimodal prostate segmentation in radiotherapy

Deep learning has shown unprecedented success in a variety of applicatio...

1 Introduction

Learning good data representations is a long running goal of machine learning

(Bengio et al., 2013a). In general, representations are considered “good” if they capture explanatory (discriminative) factors of the data, and are useful for the task(s) being considered. Learning good data representations for medical imaging tasks poses additional challenges, since the representation must lend itself to a range of medically useful tasks, and work across data from various image modalities.

Within deep learning research there has recently been a renewed focus on methods for learning so called “factorised” or “disentangled” representations, for example in

Higgins et al. (2017) and Chen et al. (2016). A disentangled representation is one in which all information is retained, but is represented as a number of (independent) factors, with each factor corresponding to some meaningful aspect of the data (Bengio et al., 2013a). Factorised representations offer many benefits: for example, they ensure the preservation of information not directly related to the primary task, which would otherwise be discarded, whilst they also facilitate the use of only the relevant aspects of the data as input to later tasks. Furthermore, and importantly, they improve the interpretability of the learned features, since each factor captures a distinct attribute of the data, while also varying independently from the other factors.

1.1 Motivation

Disentangled representations have considerable potential in the analysis of medical data. In this paper we combine recent developments in factorised representation learning with strong prior knowledge about medical image data: that it necessarily decomposes into a spatial “anatomy factor” and a non-spatial “modality factor”.

The use of an anatomy factor that is explicitly spatial (represented as a multi-class semantic map) enables maintaining pixel-level correspondences with the input, and directly supports spatially equivariant tasks such as segmentation and registration. Most importantly, it also allows a meaningful representation of the anatomy that can be generalised to any modality. As we demonstrate below, a spatial anatomical representation is useful for various modality independent tasks, for example in extracting segmentations as well as in calculating cardiac functional indices. It also provides a suitable format for pooling information from various imaging modalities, since the factorisation process ensures a modality-invariant representation.

The non-spatial modality factor captures global image modality information, specifying how the anatomy is rendered in the final image. Maintaining a representation of the modality characteristics allows, among other things, cross-modal synthesis of images between modalities (i.e. image transfer).

Finally, the ability to learn this factorisation using a very limited number of labels is of considerable significance in medical image analysis, as labelling data is tedious and costly. Thus, it will be demonstrated that the proposed factorisation, in addition to being intuitive and interpretable, leads also to considerable performance improvements in segmentation tasks when using a very limited number of labelled images.

Figure 1: A schematic overview of the proposed model. An input image is first encoded to a multi-channel spatial representation, the anatomical factor , using an anatomy encoder . Then can be used as an input to a segmentation network to produce a multi-class segmentation mask, (or some other task specific network). The factor along with the input image are used by a modality encoder

to produce a latent vector

representing the imaging modality. The two representations and are combined to reconstruct the input image through the decoder network .

1.2 Overview of the proposed approach

Learning a decomposition of data into a spatial content factor and a non-spatial appearance (or style) factor has been a focus of recent research in computer vision

(Huang et al., 2018; Lee et al., 2018) with the aim being to achieve diversity in image translation between domains. However, no consideration has been taken regarding the semantics and the precision of the spatial factor. This is crucial in medical analysis tasks in order to be able to extract quantifiable information directly from the spatial factor. Concurrently with these approaches, Chartsias et al. (2018) aimed to precisely address the need for interpretable semantics by explicitly enforcing the spatial factor to be a binary myocardial segmentation. However, since the spatial factor is a segmentation mask of only the myocardium, remaining anatomies must be encoded in the non-spatial factor, which violates the concept of explicit factorisation into spatial anatomical and non-spatial imaging factors.

In this paper instead we propose the Spatial Decomposition Network (SDNet) that learns a factorised representation of medical images consisting of a spatial map that semantically represents the anatomy, and a non-spatial latent vector containing image modality information. The model’s schematic is shown in Figure 1.

The anatomy is modelled as a multi-channel feature map, where each channel represents different anatomical substructures (e.g. myocardium, left and right ventricles). This spatial representation is categorical with each pixel necessarily belonging to exactly one channel. This strong restriction prevents the binary maps from encoding modality information, resulting in the anatomy factors being modality-agnostic (invariant), and further promotes factorisation of the subject’s anatomy into meaningful topological regions.

On the other hand, the non-spatial factor contains modality-specific information, in particular the distribution of intensities of the spatial regions. We encode the image intensities into a smooth latent space, using a Variational Autoencoder (VAE) loss, such that nearby values in this space correspond to neighbouring values in the intensity space.

Finally, since the representation should retain all information about the input (albeit in two factors), image reconstructions are possible by combining both factors.

In the literature the term “factor” usually refers to either a single dimension of a latent representation, or a meaningful aspect of the data (i.e. a group of dimensions) that can vary independently from other aspects. Here we use factor in the second sense, and we thus learn a representation that consists of a (multi-dimensional) anatomy factor, and a (multi-dimensional) modality factor. Although the individual dimensions of the factors could be seen as (sub-)factors themselves, for clarity we will refer to them as dimensions throughout the paper.

1.3 Contributions

Our main contributions are as follows:

  • With the use of few segmentation labels and a reconstruction cost, we learn a multi-channel spatial representation of the anatomy. We specifically restrict this representation to be semantically meaningful by imposing that it is a discrete categorical variable, such that different channels represent different anatomical regions.

  • We learn a modality representation using a VAE, which allows sampling from a Gaussian distribution in the modality space. This facilitates the decomposition, permits latent space arithmetic, and also allows us to use part of our network as a generative model to synthesise new images.

  • We detail design choices, such as using Feature-wise Linear Modulation (FiLM) (Perez et al., 2018) in the decoder, to ensure that the modality factors do not contain anatomical information, and prevent posterior collapse of the VAE.

  • We demonstrate our method in a multi-class segmentation task, and on different datasets, and show that we maintain a good performance even when training with labelled images from only a single subject.

  • We show that our semantic anatomical representation is useful for other anatomical tasks, such as inferring the Left Ventricular Volume (LVV). More critically, we show that we can also learn from such auxiliary tasks demonstrating the benefits of multi-task learning, whilst also improving the learned representation.

  • Finally, we demonstrate that our method is suitable for multimodal learning, where a single encoder is used with both MR and CT data, and show that information from additional modalities improves segmentation accuracy.

In this paper we advance our preliminary work (Chartsias et al., 2018) in the following aspects: 1) we learn a general anatomical representation useful for multi-task learning; 2) we perform multi-class segmentation (of multiple cardiac substructures); 3) we impose a structure in the imaging factor which follows a multi-dimensional Gaussian distribution, that allows sampling and improves generalisation; 4) we formulate the reconstruction process to use FiLM normalisation (Perez et al., 2018), instead of concatenating the two factors; and 5) we offer a series of experiments using four different datasets to show the capabilities and expressiveness of our representation.

The rest of the paper is organised as follows: Section 2 reviews related literature in representation learning and segmentation. Then, Section 3 describes our proposed approach. Sections 4 and 5 describe the setup and results of the experiments performed. Finally, Section 6 concludes the manuscript.

Figure 2: The architectures of the four networks that make up SDNet. The anatomy encoder is a standard U-Net (Ronneberger et al., 2015) that produces a spatial anatomical representation . The modality encoder is a convolutional network (except for a fully connected final layer) that produces the modality representation . The segmentor is a small fully convolutional network that produces the final segmentation prediction of a multi-class mask (with L classes) given . Finally the decoder produces a reconstruction of the input image from with its output modulated by through FiLM normalisation (Perez et al., 2018). The bottom of the figure details the components used throughout the four networks. The anatomical factor’s channels parameter , the modality factor’s size , and the number of segmentation classes depend on the specific task and are detailed in the main text.

2 Related work

Here we review previous work on factorised representation learning, which is typically a focus of research on generative models (Section 2.1). We then review its application in domain adaptation, which is achieved by a factorisation of style and content (Section 2.2). Finally, we review semi-supervised methods in medical imaging, as well as recent literature in cardiac segmentation, since they are related to the application domain of our method (Sections 2.3 and 2.4).

2.1 Factorised representation learning

There has been growing interest in learning independent factors of variation of data distributions. Factors can be the individual dimensions of the latent representation, or groups of these dimensions, and should each capture a meaningful aspect of the data. Several variations of VAE (Kingma and Welling, 2014; Rezende et al., 2014) and Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) have been proposed to achieve such a factorisation. For example -VAE (Higgins et al., 2017)

adds a hyperparameter

to the KL-divergence constraint, whilst Factor-VAE (Kim and Mnih, 2018) boosts disentanglement by encouraging independence between the marginal distributions. On the other hand, using GANs, InfoGAN (Chen et al., 2016) minimises the mutual information between the generated image and a latent factor using adversarial training, and SD-GAN (Donahue et al., 2018) generates images with a common identity and varying appearance. Combinations of VAE and GANs have also been proposed, for example by Mathieu et al. (2016) and Szabó et al. (2018). Both learn two continuous factors: one dataset specific factor, in their case class labels, and one factor for the remaining information. In order to promote independence of the factors and prevent a degenerate condition where the decoder uses only one of the two factors, mixing techniques have also been proposed (Hu et al., 2017). These ideas also begin to see use in medical image analysis: Biffi et al. (2018) apply VAE to learn a latent space of 3D cardiac segmentations in order to train a model of cardiac shapes useful for disease diagnosis. Learning factorised features is also used to distinguish between (learned) features specific to a modality from those shared across modalities (Fidon et al., 2017). However, their aim is combining information from multimodal images and not learning semantically meaningful representations.

These methods rely on learning representations in the form of latent vectors. Our method is similar in concept with Mathieu et al. (2016) and Szabó et al. (2018), which both learn a factorisation into known and other residual factors. However, we constrain the known factor to be spatial, since this is naturally related to the anatomy of medical images.

2.2 Style and content factorisation

There is a connection between our task and image translation (also called image transfer or modality transformation), which is the task of rendering one image in the “style” of another. Classic image translation methods do not explicitly model the style of the output image and therefore suffer from style ambiguity, where many outputs correspond to the same style. In order to address this “many to one” problem, a number of models have recently appeared that include an additional latent variable capturing image style. For example, colouring a sketch may result in different images (depending on the colours chosen) thus, in addition to the sketch itself, a vector parameterising the colour choices is also given as input (Zhu et al., 2017).

Our approach here can be seen as similar to a factorisation of an image into “style” and “content” (Gatys et al., 2016; Azadi et al., 2018), where we represent content (i.e. in our case the underlying anatomy) spatially. Similar to our approach, there have been recent factorisation models that also use vector and spatial representations for the style and content respectively (Almahairi et al., 2018; Huang et al., 2018; Lee et al., 2018). The intricacies of medical images differentiate us by necessitating the expression of the spatial content factor as categorical in order to produce a semantically meaningful (interpretable) representation of the anatomy. This discretisation of the spatial factor also prevents the spatial representation from being associated with a particular medical image modality.

2.3 Semi-supervised segmentation

A powerful property of factorised representations is that they can be applied in semi-supervised learning

(Almahairi et al., 2018). An important application in medical image analysis is (semi-supervised) segmentation, for a recent review see Cheplygina et al. (2018). As discussed in this review, manual segmentations are a laborious task, particularly as inter-rater variation means multiple labels are required to reach a consensus, and images labelled by multiple experts are very limited. Semi-supervised segmentation has been proposed for cardiac image analysis using an iterative approach and Conditional Random Fields (CRF) post-processing (Bai et al., 2017), and for gland segmentation using GANs (Zhang et al., 2017).

More recent medical semi-supervised image segmentation approaches include Zhao et al. (2018) and Nie et al. (2018). Zhao et al. (2018) address a multi-instance segmentation task in which they have bounding boxes for all instances, but pixel-level segmentation masks for only some instances. Nie et al. (2018) approach semi-supervised segmentation with adversarial learning and a confidence network. Neither approaches involve learning factorised representations of the data.

2.4 Cardiac segmentation

We apply our model to the problem of cardiac segmentation, for which there is considerable literature (Peng et al., 2016). The majority of recent methods use convolutional networks with full supervision for multi-class cardiac segmentations, as seen for example in participants of workshop challenges (Bernard et al., 2018). Cascaded networks (Vigneault et al., 2018) are used to perform 2D segmentation by transforming the data into a canonical orientation and also by combining information from different views. Prior information about the cardiac shape has been used to improve segmentation results (Oktay et al., 2018). Spatial correlation between adjacent slices has been explored (Zheng et al., 2018) to consistently segment 3D volumes. Segmentation can also be treated as a regression task (Tan et al., 2017). Finally, temporal information related to the cardiac motion has been used for segmentation of all cardiac phases (Qin et al., 2018; Bai et al., 2018b).

Differently from the above, in this work we focus on learning meaningful spatial factorised representations, and leveraging these for improved semi-supervised segmentation results, and performing auxiliary tasks.

3 Materials and methods

Overall, our proposed model can be considered as an autoencoder, which takes as input a 2D volume slice , where is the set of all images in the data, with and

being the image’s height and width respectively. The model generates a reconstruction through an intermediate factorised representation. The factorised representation is comprised of a multi-channel spatial map (a tensor)

, where is the number of channels, and a multi-dimensional continuous vector factor , where

is the number of dimensions. These are generated respectively by two encoders, modelled as convolutional neural networks,

and . The two representations are combined by a decoder to reconstruct the input. In addition to the reconstruction cost, explicit supervision can be given in the form of auxiliary tasks, for example with a segmentation task using a network , or with a regression task as we will demonstrate in Section 5.2. A schematic of our model can be seen in Figure 1 and the detailed architectures of each network are shown in Figure 2.

3.1 Input decomposition

The decomposition process yields representations for the anatomy and the modality characteristics of medical images and is achieved by two dedicated neural networks. Whilst a decomposition could also be performed with a single neural network of two separate outputs and shared layer components, as done in our previous work (Chartsias et al., 2018), we found that by using two separate networks we can more easily control the information captured by each factor, and we can individually adapt the behaviour of each encoder to achieve the desired behaviour.

3.1.1 Anatomical representation

Figure 3: Example of a spatial representation, expressed as a multi-channel binary map. Some channels represent defined anatomical parts such as the myocardium or the left ventricle, and others the remaining anatomy required to describe the input image on the left. Observe how sparse most of the informative channels are.
Figure 4: Spatial representation with no thresholding applied. Each channel of the spatial map, also captures the intensity signal in different gray level variations and is not sparse, in contrast to Figure 3. This prevents a clear anatomical factorisation.

The anatomy encoder is a fully convolutional neural network that maps 2D images to spatial representations, . We use a U-Net (Ronneberger et al., 2015) architecture, containing downsampling and upsampling paths with skip connections between feature maps of the same size, allowing effective fusion of important local and non-local information.

The spatial representation is a feature map consisting of a number of binary channels of the same spatial dimensions as the input image, that is , where is the number of channels. Some channels contain individual anatomical (cardiac) sub-structures, while the other structures, necessary for reconstruction, are freely dispersed in the remaining channels. Figure 3 shows an example of a spatial representation, where the representations of the myocardium, the left and the right ventricle, are clearly visible, and the remaining channels contain the surrounding image structures (albeit more mixed and not being anatomically distinct).

The spatial representation is derived using a softmax activation function to force each pixel of the input to have activations that sum to one across the channels. Since softmax functions encode continuous distributions, we binarise the anatomical representation with the use of an operator

, which acts as a threshold for the pixel values of the spatial variables in the forward pass. During back-propagation the step function is bypassed and updates are applied to the original non-binary representation, as in the straight-through operator (Bengio et al., 2013b).

Thresholding the spatial representation is an integral part of the model’s design and offers two advantages. Firstly, it reduces the capacity of the spatial factor, encouraging it to be a representation of only the anatomy and preventing any modality information from being encoded. Secondly, it enforces a factorisation of the spatial factor in distinct channels, as each pixel can only be active on one channel. To illustrate how binarisation of the channels is key, an example of a non-thresholded spatial factor is shown in Figure 4. Observe, that the anatomical variable is not sparse and variations of gray level are evident. In particular, image intensities are encoded spatially, using different grayscale values, allowing a good reconstruction to be achieved without the need of a modality factor, which we explicitly want to avoid.

3.1.2 Modality representation

Given samples of the data with their corresponding anatomies (which are deterministically produced using ), we learn the posterior distribution of latent factors , . Following the VAE principle (Kingma and Welling, 2014), this posterior is encouraged to match a prior distribution that is set to be an isotropic unit Gaussian . This is achieved by minimising the KL-divergence between the and :

where the KL-Divergence between two probability distributions

and is defined as: .

The posterior distribution is modelled with a stochastic encoder as a convolutional network, which encodes the image modality, . Specifically, the stochasticity of the encoder (for a sample and its anatomy factor ) is achieved as follows: produces first the mean and diagonal covariance for an dimensional Gaussian, which is then sampled to yield the final . This sampling is done using the reparameterisation trick, which allows the network to still be trained with back propagation (see Kingma and Welling (2014) for details).

3.2 Segmentation

One important task for the model is to infer segmentation masks , where is the number of anatomical segmentation categories in the training dataset, out of the spatial representation. This is an integral part of the training process because it also defines the anatomical structures that will be extracted from the image. The segmentation network is a fully convolutional network consisting of two convolutional blocks followed by a final convolution layer (see Figure 2), with the goal of refining the anatomy present in the spatial maps and produce the final segmentation masks, .

When labelled data are available, a supervised cost is employed that is based on a differentiable Dice loss (Milletari et al., 2016) between a real segmentation mask of an image sample and its predicted segmentation ,

where is a small constant added to avoid division by 0. In a semi-supervised scenario, where there are images with no corresponding segmentations, an adversarial loss is defined, based on LeastSquares-GAN (Mao et al., 2018), using a discriminator over masks . Networks and are trained to maximise the adversarial objective, against which is trained to minimise it,

The architecture of the discriminator is based on DCGAN discriminator (Radford et al., 2015)

, without Batch Normalization.

3.3 Image reconstruction

The two factors are combined by a decoder network to generate an image with the anatomical characteristics specified by and the imaging characteristics specified by , . The fusion of the two factors acts as an inpainting mechanism where the information stored in , is used to derive the image signal intensities that will be used on the anatomical structures, stored in .

The reconstruction is achieved by a convolutional network using four FiLM layers (Perez et al., 2018). Using a small network of two fully connected layers, as shown in Figure 2, is mapped to and (for each feature map of the decoder) which are used to normalise the intermediate feature maps of the decoder. Let be one such feature map. Then each channel of is normalised based on learned parameters and as follows: . The decoder and FiLM parameters are learned through the reconstruction of the input images using Mean Absolute Error,

The design of the decoding process restricts the type of information stored in to only affect the intensities of the produced image. This is important in the factorisation process as it ensures that cannot contain any spatial anatomical information.

The decoder can also be interpreted as a conditional generative model, where different samples of conditioned on a given generate images of the same anatomical properties, but with different appearances. The reconstruction process is the opposite of the decomposition process, i.e. it learns the dependencies between the two factors in order to produce a realistic output.

3.3.1 Modality factor reconstruction

A common problem when training VAE is posterior collapse: a degenerate condition where the decoder is ignoring some factors. In this case, even though the reconstruction is accurate, not all data variation is captured in the underlying factors.

In our model posterior collapse manifests when some modality information is spatially encoded within the anatomical factor.111Note that while using FiLM prevents from encoding spatial information, it does not prevent the case of posterior collapse i.e. that encodes (all or part of) the modality information. To overcome this we use a reconstruction cost, according to which an image produced by a random sample should produce the same modality factor when (re-)encoded,

The faithful reconstruction of the modality factor

penalises the VAE for ignoring dimensions of the latent distribution and encourages each encoded image to produce a low variance Gaussian. This is in tension with the KL divergence cost which is optimal when the produced distribution is a spherical Gaussian of zero mean and unit variance. A perfect score of the KL divergence results in all samples producing the same distribution over

, and thus the samples are indistinguishable from each other based on . Without , the overall cost function can be minimised if imaging information is encoded in , thus resulting in posterior collapse. Reconstructing the modality factor prevents this, and results in an equilibrium where a good reconstruction is possible only with the use of both factors.

4 Experimental Setup

4.1 Data

In our experiments we use 2D images from four datasets, which have been normalised to the range [-1, 1].

  1. [label=()]

  2. For the semi-supervised segmentation experiment (Section 5.1) and the latent space arithmetics (Section 5.5) we use data from the 2017 Automatic Cardiac Diagnosis Challenge (ACDC) (Bernard et al., 2018). This dataset contains cine-MR images acquired in 1.5T and 3T MR scanners, with resolution between 1.22 and 1.68 and a number of phases varying between 28 to 40 images per patient. We resample all volumes to 1.37 resolution. In total there are images of 100 patients, for which manual segmentations are provided for the left ventricular cavity (LV), the myocardium (MYO) and the right ventricle (RV), corresponding to the end systolic (ES) and end diastolic (ED) cardiac phases.

  3. We also use private data acquired at Edinburgh Imaging Facility QMRI with a 3T scanner. The dataset contains cine-MR images of 26 healthy volunteers each with approximately 30 cardiac phases (frames). The spatial resolution is 1.406 /pixels with a slice thickness of 6 mm, matrix size and a field of view . This dataset is used in the semi-supervised segmentation and multi-task experiments of Sections 5.1 and 5.2 respectively. Manual segmentations of the left ventricular cavity (LV) and the myocardium (MYO) are provided, corresponding to the ES and ED cardiac phases.

  4. To demonstrate multimodal segmentation and modality transformation (Section 5.3

    ), as well as modality estimation (Section

    5.4), we use data from the 2017 Multi-Modal Whole Heart Segmentation (MM-WHS) challenge, made available by Zhuang et al. (2010), Zhuang (2013), and Zhuang and Shen (2016). This contains 40 anonymised volumes, of which 20 are cardiac CT/CT angiography (CTA) and 20 are cardiac MRI. The CT/CTA data were acquired in the axial view at Shanghai Shuguang Hospital, China, using routine cardiac CTA protocols. The in-plane resolution is about and the average slice thickness is . The MRI data were acquired at St. Thomas hospital and Royal Brompton Hospital, London, UK, using 3D balanced steady state free precession (b-SSFP) sequences, with about acquisition resolution at each direction and reconstructed (resampled) into about . All data have manual segmentations of seven heart substructures: myocardium (MYO), left atrium (LA), left ventricle (LV), right atrium (RA), right ventricle (RV), ascending aorta (AO) and pulmonary artery (PA). Data preprocessing is as in Chartsias et al. (2017).

  5. Finally, we use cine-MR and CP-BOLD images of 10 canines to further evaluate modality estimation (Section 5.4). 2D images with an in-plane resolution of were acquired at baseline and severe ischemia (inflicted as controllable stenosis of the left-anterior descending coronary artery (LAD)) on a 1.5T Espree (Siemens Healthcare) on the same instrumented canines. The image acquisition is at short axis view, covering the mid-ventricle, and is performed using cine-MR and a flow and motion compensated CP-BOLD acquisition. The pixel resolution is (Tsaftaris et al., 2013). This dataset (whilst private) is ideal to show complex spatio-temporal effects as it images the same animal with and without disease and using two almost identical sequences with the only difference that CP-BOLD modulates pixel intensity with the level of oxygenation present in the tissue.

4.2 Model and training details

The overall cost function is a composition of the individual costs of each of the model’s components and is defined as:

The parameters are set to values: =0.01, =10, =10, =1, =1. We adopt the value of from Zhu et al. (2017), that also trains a VAE for modelling intensity variability. Separating the anatomy into segmentation masks is a difficult task, and is also in tension with the reconstruction process which pushes parts with similar intensities to be in the same channels. This motivates our decision in increasing the values of the segmentation hyperparameters and .

We set the dimension of the modality factor =8 as in Zhu et al. (2017) across all datasets. We also set the number of channels of the spatial factor to =8 for ACDC and QMRI and increase to =16 for MM-WHS, to support the increased number of segmented regions (7 in MM-WH).

We minimise the cost function using Adam (Kingma and Ba, 2014)

with a learning rate of 0.0001 and a decay of 0.0001 per epoch. We used a batch size of 4 and an early stopping criterion based on the segmentation cost of a validation set. All code was developed in Keras

(Chollet et al., 2015). The quantitative results of Section 5 are obtained through 3-fold cross validation, where each split contains a proportion of the total volumes of 60%, 20% and 20% corresponding to training, validation and test sets.

4.3 Baseline and benchmark methods

We evaluate our model’s segmentation accuracy by comparing with one fully supervised and two semi-supervised methods described below:

  1. [label=()]

  2. We use U-Net (Ronneberger et al., 2015) as a fully supervised baseline because of its effectiveness in various medical segmentation problems, and also since it is frequently used by the participants of the two cardiac challenges MM-WHS and ACDC.

  3. We add an adversarial cost using a mask discriminator to the fully-supervised U-Net, enabling its use in semi-supervision. This can also be considered as a variant of SDNet without the reconstruction cost. We refer to this method as GAN in Section 5.

  4. Finally, we use the self-train method of Bai et al. (2017), which proposes an iterative method of using unlabelled data to retrain a segmentation network. In the original paper a Conditional Random Field (CRF) post-processing is applied. Here we use U-Net as a segmentation network and we do not perform any post-processing for a fair comparison with the other methods we present.

5 Results and Discussion

We here present and discuss quantitative and qualitative results of our method in various experimental scenarios. Initially, multi-class semi-supervised segmentation is evaluated in Section 5.1. Subsequently, Section 5.2 demonstrates multi-task learning with the addition of a regression task in the training objectives. In Section 5.3, SDNet is evaluated in a multimodal scenario by concurrently segmenting MR and CT data. In Section 5.4 we investigate whether the modality factor captures multimodal information. Finally, Section 5.5 demonstrates properties of the factorisation using latent space arithmetic, in order to show how and interact to reconstruct images.

5.1 Semi-supervised segmentation

We evaluate the utility of our method in a semi-supervised experiment, in which we combine labelled images with a pool of unlabelled images to achieve multi-class semi-supervised segmentation. Specifically, we explore the sensitivity of SDNet and the baselines of Section 4.3 to the number of labelled examples, by training with various numbers of labelled images. Our objective is to show that we can achieve comparable results to a fully supervised network using fewer annotations.222It is not our focus to show state-of-the-art results in supervised learning (100% of labelled data). Better performance could be obtained by adapting some of the approaches (such as hyperparameter tuning and post-processing) of segmentation challenge participants, e.g. Baumgartner et al. (2018).

To simulate a more realistic clinical scenario, sampling of the labelled images does not happen over the full image pool, but at a subject level: initially, a number of subjects is sampled, and then all images of these subjects constitute the labelled dataset. The number of unlabelled images is fixed and set equal to 1200 images: these are sampled equally at random from all subjects and from cardiac phases other than End Systole (ES) and End Diastole (ED) (for which no ground truth masks exist). The real segmentation masks used to train the mask discriminator are taken from the set of image-mask pairs from the same dataset.

In order to test the generalisability of all methods to different types of images, we use two cine-MR datasets: ACDC which contains masks of the LV, MYO and RV; and QMRI which contains masks of the LV and MYO. Spatial augmentations by rotating inputs up to are applied to experiments using ACDC data to better simulate the orientation variability of the dataset. No augmentations are applied in experiments using QMRI data since all images maintain a canonical orientation. No further augmentations have been performed to fairly compare the effect of the different methods.

We present the average cross-validation Dice score across all labels, along with its standard deviation on held out test sets in Tables

1 and 2 for the two datasets respectively. The best results are shown in bold font, and an asterisk indicates statistical significance at the

level, compared to the second best result, computed using a paired t-test. In both tables the lowest amount of labelled data (

for Table 1 and for Table 2) correspond to images selected from one subject. Segmentation examples for ACDC data using different number of labelled images are shown in Figure 5, where different colours are used for the different segmentation classes.

For both datasets, when the number of annotated images is high, then all methods perform equally well, although our method achieves the lowest variance. In Table 1 the performance of the supervised (U-Net) and self-trained methods decrease when the number of annotated images reduces below , since the limited annotations are not sufficiently representative of the data. When using data from one or two subjects, these two methods which mostly rely on supervision fail with a Dice score below 0.3. On the other hand, even when the number of labelled images is small, adversarial training used by SDNet and GAN helps maintaining a good performance. The reconstruction cost used by our method further regularises training and consistently produces more accurate results that are also significantly better in a statistical sense.

In Table 2 with the smaller QMRI dataset, the segmentation task is easier since there are two masks to segment (instead of three) and there is also less pixel intensity and orientation variation in the images, since all are acquired using a single 3T scanner under standard acquisition protocol, and all in a canonical orientation. As a result, the overall Dice scores are higher than those presented for the ACDC data (in Table 1). When using annotated images from one subject, the performance of the supervised method reduces by almost 50% compared to when using the full dataset. SDNet and GAN both maintain a good performance of 0.75 and 0.79, with no significant differences between them. Note that in this experiment using labelled images from just a single subject corresponds to 6% of the data, which is why this is the lowest percent used.

labels U-Net GAN self-train SDNet
100% 0.790.08 0.720.09 0.800.10 0.800.06
50% 0.760.09 0.760.07 0.770.10 0.780.06
25% 0.730.09 0.720.11 0.740.09 0.760.08*
12.5% 0.540.17 0.730.08 0.430.23 0.730.07
6% 0.550.11 0.700.10 0.440.16 0.720.07
3% 0.300.21 0.600.10 0.200.14 0.640.10*
1.5% 0.050.02 0.580.11 0.050.02 0.610.10*
Table 1: Dice scores on ACDC (LV, MYO, RV). For training, 1200 unlabelled and different numbers of labelled images were used. The best result is shown in bold font and an asterisk indicates statistical significance at the 5% level compared to the second best.
labels U-Net GAN self-train SDNet
100% 0.830.07 0.860.04 0.860.07 0.860.04
50% 0.740.15 0.830.05 0.790.09 0.840.05
25% 0.690.10 0.810.05 0.490.26 0.800.08
12.5% 0.650.07 0.790.07 0.580.14 0.800.07
6% 0.430.20 0.750.10 0.130.07 0.790.05
Table 2: Dice scores on QMRI (LV, MYO) data. For training, 1200 unlabelled and different numbers of labelled images were used. The best result is shown in bold font.
Figure 5: Segmentation example for different numbers of labelled images from the ACDC dataset. Blue, green and red show the models prediction for MYO, LV and RV respectively.

5.2 Left ventricular volume

It is common for clinicians to not manually annotate all endocardium and epicardium contours for all patients if it is not necessary. Rather, a mixture of annotations and other metrics of interest will be saved at the end of the study in the electronic health record. For example, we can have a scenario with images of some patients that contain myocardium segmentations and some images with the value of their left ventricular volume. Here we test our model in such a multi-task scenario and show that we can benefit from such auxiliary and mixed annotations. We will evaluate, firstly whether our model is capable of predicting a secondary output related to the anatomy (the volume of the left ventricle), and secondly whether this secondary task improves the performance of the main segmentation task.

Using the QMRI dataset, we first calculate the ground truth left ventricular volume (LVV) for each patient as follows: for each 2D slice, we first sum the pixels of the left ventricular cavity, then multiply this sum with the pixel resolution to get the corresponding area and then multiply the result with the slice thickness to get the volume occupied by each slice. The final volume is the sum of all individual slice volumes.

Predicting the LVV as another output of SDNet follows a similar process to the one used to calculate the ground truth values. We design a small neural network consisting of two convolutional layers (each having a

kernel followed by a ReLU activation), and two fully connected layers of 16 and 1 neurons respectively, both followed by a ReLU activation. This network regresses the sum of the pixels of the left ventricle, taking as input the spatial representation. The predicted sum can then be used to calculate the LVV offline.

Using a pre-trained model of labelled images corresponding to one subject (last row in Table 2 with 6% labels), we fine-tune the whole model whilst training the area regressor using ground truth values from 17 subjects. We find that the average LVV over the test volumes (over both ED and ES frames) is 59.37 with a standard deviation of 3.7, which is in the normal range as reported in (Bai et al., 2018a). The multi-task objective used to fine-tune the whole model also benefits test segmentation accuracy, which is raised from 0.756 to 0.832.333The multi-task objective in fact benefits the Dice score for both labels individually: MYO accuracy rises from 0.633 to 0.706 and LV accuracy rises from 0.819 to 0.899. While this is for a single split, observe that using LVV as an auxiliary task effectively brought us closer to the range of having 50% annotated masks (second row in Table 2). Thus, auxiliary tasks, such as LVV prediction, which are related to the endocardial border segmentation can be used to train models in a multi-task setting and leverage supervision present in typical clinical settings.

5.3 Multimodal learning

By design, our model separates the anatomical factor from the image modality factor. As a result, it can be trained using multimodal data, with the spatial factor capturing the common anatomical information and the non-spatial factor capturing the intensity information unique to each image’s particular modality. Here we evaluate our model using a multimodal MR and CT input to achieve segmentation (Section 5.3.1) and modality transformation (Section 5.3.2).

5.3.1 Multimodal segmentation

We train SDNet using a multimodal input of MR and CT data with the aim to improve learning of the anatomical factor from both MR and CT segmentation masks. In fact, we show below that when mixing data from MR and CT images, we improve segmentation compared to when using each modality separately. Since the aim is to specifically evaluate the effect of multimodal training in segmentation accuracy, unlabelled images are not considered here as part of the training process, and the models are trained with full supervision only.

In Table 3 we present the Dice score, over held out MR and CT test sets, obtained when training a model with differing amounts of MR and CT data. Results for 12.5% of data correspond to images obtained from one subject. Training with multimodal data leads to improvements in both individual MR and CT performances. This is the case even when we add 12.5% of CT to the full MR dataset, and vice versa; this does not only improve the MR (increasing from 0.74 to 0.76), but also the CT performance (increasing from 0.23 to 0.56).

MR train CT train MR test CT test
100% 100% 0.78 0.05 0.80 0.01
12.5% 100% 0.39 0.07 0.81 0.01
100% 12.5% 0.76 0.03 0.56 0.06
100% 0% 0.74 0.02 -
12.5% 0% 0.27 0.12 -
0% 100% - 0.77 0.04
0% 12.5% - 0.23 0.07
Table 3: Dice score on MM-WHS (LV, RV, MYO, LA, RA, PA, AO) data, when training with different mixtures of MR and CT data.

5.3.2 Modality transformation

Although our method is not specifically designed for modality transformations, it can be used as such, when trained with multimodal data as input. Only the image modality factor captures the intensity variability between the MR and CT images. Therefore, different values of the modality factor can be combined with the same fixed anatomy factor to achieve representations of the anatomy corresponding to the two modalities.

To illustrate this we use the model trained with 100% of the MR and CT in the MM-WHS dataset and demonstrate transformations between the two modalities. In Figure 6 we synthesise CT images from MR (and MR from CT) by fusing a CT modality vector with an anatomy from an MR image (and vice versa). We can readily see how the transformed images capture intensity characteristics typical of the domain.

This mixing of factors is a special case of latent space arithmetic that we demonstrate concretely in Section 5.5.

Figure 6: Modality transformation between MR and CT when a fixed anatomy is combined with a modality vector derived from each imaging modality. Specifically let be and images respectively. The left pane of the figure shows the original MR image , and a ‘reconstruction’ of using the modality component derived from , i.e. . The right pane of the figure shows the original CT image , and a ‘reconstruction’ of using the modality component derived from , i.e. .

5.4 Modality type estimation

Our premise is that the learned modality factor captures imaging specific information. We assess this in two different settings using multimodal MR and CT data and also cine-MR and CP-BOLD MR data.

After having a trained model, we learn posthoc a logistic regression classifier to predict the image modality (MR or CT) from the modality factor

. The learned regressor is able to correctly classify the input images as CT or MR, on a held out test set 92% of the time. To find whether there is a single dimension that captures best this binary semantic component (MR or CT) we repeat 8 independent experiments training 8 single input logistic regressors, one for each dimension of . We find that obtains an accuracy of 82%, whereas the remaining dimensions vary from 42% to 66% accuracy. We can conclude that a single dimension (in this case ) captures most of the intensity differences between MR and CT which are global and affect all areas of the image.

In a second complementary experiment we perform the same logistic regression classification to discriminate between cine-MR and CP-BOLD MR images (which are also cine, but contain additionally oxygen-level dependent contrast). Unlike MR and CT which are globally easy to differentiate due to their pronounced differences in signal intensities across the whole anatomy, BOLD and cine exhibit subtle spatially and temporally localised differences that are modulated by the amount of oxygenated blood present (the BOLD effect) and the cardiac cycle and these are most acute in the heart.444In fact, these subtle spatio-temporal differences are adequate to detect myocardial ischemia at rest as has been demonstrated in Bevilacqua et al. (2016); Kali et al. (2013). Even here the accuracy of the classifier is 96% in detecting the presence or not of BOLD contrast, when all dimensions of are given as input. When each dimension is used separately, accuracy ranges between 47% and 65%, and thus no single dimension globally captures the presence (or lack) of BOLD contrast.

These findings are revealing and have considerable implications. First they show that our modality factor does capture modality specific information which is obtained completely unsupervised, and depending on context and complexity of the imaging modality, a single dimension may capture it almost completely (in the case of MR/CT). This also implicitly suggests that spatial information is captured only in .

More importantly, it opens the question of how the spatial and modality factors interact to reproduce the output. We address these questions below using latent space arithmetic.

5.5 Latent space arithmetic

Herein we demonstrate the properties of our spatial latent space factorisation by separately examining the effects of anatomical and modality factors on the synthetic images and how modifications of each alter the output. We conclude this section by showing how specific changes in each of the dimensions correlate to changes in the output. For these experiments we consider the model from Table 1, trained on ACDC using 100% of the labelled training images.

Arithmetic on the spatial factor : We start with the spatial factor and in Figure 7 we alter the content of the spatial channels to qualitatively see how the decoder has learned an association between the position of each channel and different signal intensities of the anatomical parts. In all these experiments the factor remains the same. The first two images show the input and the original reconstruction. The third image is produced by adding the MYO spatial channel with the LV spatial channel and by nulling (zeroing) the MYO channel. We can see that the intensity of the myocardium is now the same as the intensity of the left ventricle. In the fourth image, we swap the channels of the MYO with the one of the LV, resulting in reverse intensities for the two substructures. Finally, the fifth image is produced by randomly shuffling the spatial channels.

Figure 7: Reconstructions of an input image, when re-arranging the channels of the spatial representation. The images from left to right are: the input, the original reconstruction, the reconstruction when moving the MYO to the LV channel, the reconstruction when exchanging the content of the MYO and the LV channels, and finally a reconstruction obtained after a random permutation of the channels.

Arithmetic on the modality factor :

Next, we examine the information captured in each dimension of the modality factor. Since the modality factor follows a Gaussian distribution, we can draw random samples or interpolate between samples in order to generate new images. In this analysis, an image

is firstly encoded to its spatial factor and its modality factor . Since prior over

is an 8-dimensional unit Normal distribution,

of its probability mass lies within three standard deviations of the mean. As a result, the probability space is almost fully covered by values in the range . By interpolating each -dimension between and , and whilst keeping the values of the remaining dimensions and fixed, we can decode synthetic images that will show the variability induced by every -dimension.

To achieve this we consider a grid where each dimension is considered over 7 fixed steps from and . Each row of the grid corresponds to one of the 8 dimensions, whereas a column a specific -th value in the range . This grid is visualised in Figure 8.

Mathematically described, for and , an image in the row and column of the grid is , where denotes element-wise multiplication, is a vector of length 8 with all entries 1 except for a 0 in the position, and .

In order to assess the effect of (the dimension of ) on the intensities of the synthetic results, we calculate a correlation image and a difference image (for every row of results). The value of each pixel in the correlation image is calculated using the Pearson correlation coefficient between the interpolation values of a and the intensity values of the synthetic images for this pixel.

where are the height and width position of a pixel, is the mean value of , is the mean value of a pixel across the interpolated images. The difference image is calculated per each row by subtracting the image in the last column position on the grid () with the first position on the grid (). 555Note that in order to keep the correlation and the difference image in the same scale [-1, 1], we rescale the images from [-1, 1] to the [0,1], which does not have any effect on the results.

In Figure 8, the correlation images show high levels of positive or negative correlation between each dimension and most pixels of the input image, demonstrating that mostly captures global image characteristics. However, local correlations are also evident for example between and all pixels of the heart, between and the right ventricle and between and the myocardium. However, different magnitude changes are evident, as the difference image in the last last column of Figure 8 shows. Among all dimensions, it seems that , and alter significantly the local contrast.

Figure 8: Reconstructions when interpolating between vectors. Each row corresponds to images obtained by changing the values of a single -dimension. The final two columns (correlation and indicate which areas of the image are mostly affected by this change in .

6 Conclusion

We have presented a method for factorising medical images into a spatial and a non-spatial latent factor, where we enforced a semantically meaningful spatial factor of the anatomy and a non-spatial factor encoding the modality information. To the best of our knowledge, maintaining semantics in the spatial factor has not been previously investigated. Moreover, through the incorporation of a variational autoencoder, we can treat our method as a generative model, which allows us to also efficiently model the intensity variability of medical data.

We demonstrated the utility of our methodology in a semi-supervised segmentation task, where we achieve high accuracy even when the amount of labelled images is substantially reduced. We also demonstrated that the semantics of our spatial representation mean it is suitable for secondary anatomically-based tasks, such as quantifying the left ventricular volume, which not only can be accurately predicted, but also improve the accuracy of the primary task in a multi-task training scenario. We also show that the factorisation of the model presented can be used in multimodal learning, where both anatomical and imaging information can be encoded to create synthetic MR and CT images, using even small fractions of CT and MR input images, respectively.

The main contribution of this paper is the factorisation of medical image data into meaningful spatial and non-spatial factors. This intuitive factorisation does not require the specific network architecture choices used here, but is general in nature with the potential for application in diverse medical image analysis tasks.

The utility of our factorisation is also evident in how we can interpret manipulations of the latent space and as such probe into the model. Such interpretability is considered key to advance the translation of advanced machine learning methods in the clinic (and perhaps why has been recently emphasised with dedicated MICCAI workshops Critically though, whilst our work is a positive step in this direction, we can envision that extensions to 3D, and to explicitly learning hierarchical factors that better capture semantic information (both in terms of anatomical and modality representations), would further improve applicability of our approach in several domains.


This work was supported in part by the US National Institutes of Health (1R01HL136578-01) and UK EPSRC (EP/P022928/1). This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF) (


  • Almahairi et al. (2018) Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron C. Courville. Augmented CycleGAN: Learning many-to-many mappings from unpaired data. In International Conference on Machine Learning, 2018.
  • Azadi et al. (2018) Samaneh Azadi, Matthew Fisher, Vladimir Kim, Zhaowen Wang, Eli Shechtman, and Trevor Darrell. Multi-content GAN for few-shot font style transfer. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , volume 11, page 13, 2018.
  • Bai et al. (2017) Wenjia Bai, Ozan Oktay, Matthew Sinclair, Hideaki Suzuki, Martin Rajchl, Giacomo Tarroni, Ben Glocker, Andrew King, Paul M Matthews, and Daniel Rueckert. Semi-supervised learning for network-based cardiac MR image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 253–260, Cham, 2017. Springer International Publishing. ISBN 978-3-319-66185-8.
  • Bai et al. (2018a) Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Oktay, Martin Rajchl, Ghislain Vaillant, Aaron M. Lee, Nay Aung, Elena Lukaschuk, Mihir M. Sanghvi, Filip Zemrak, Kenneth Fung, Jose Miguel Paiva, Valentina Carapella, Young Jin Kim, Hideaki Suzuki, Bernhard Kainz, Paul M. Matthews, Steffen E. Petersen, Stefan K. Piechnik, Stefan Neubauer, Ben Glocker, and Daniel Rueckert. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. Journal of Cardiovascular Magnetic Resonance, 20(1):65, Sep 2018a. doi: 10.1186/s12968-018-0471-x.
  • Bai et al. (2018b) Wenjia Bai, Hideaki Suzuki, Chen Qin, Giacomo Tarroni, Ozan Oktay, Paul M. Matthews, and Daniel Rueckert. Recurrent neural networks for aortic image sequence segmentation with sparse annotations. In Alejandro F. Frangi, Julia A. Schnabel, Christos Davatzikos, Carlos Alberola-López, and Gabor Fichtinger, editors, Medical Image Computing and Computer Assisted Intervention, pages 586–594, Cham, 2018b. Springer International Publishing. ISBN 978-3-030-00937-3.
  • Baumgartner et al. (2018) Christian F Baumgartner, Lisa M Koch, Marc Pollefeys, and Ender Konukoglu. An exploration of 2D and 3D deep learning techniques for cardiac MR image segmentation. In Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, pages 111–119, Cham, 2018. Springer International Publishing.
  • Bengio et al. (2013a) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013a. doi: 10.1109/TPAMI.2013.50.
  • Bengio et al. (2013b) Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013b.
  • Bernard et al. (2018) O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester, G. Sanroma, S. Napel, S. Petersen, G. Tziritas, E. Grinias, M. Khened, V. A. Kollerathu, G. Krishnamurthi, M. Rohé, X. Pennec, M. Sermesant, F. Isensee, P. Jäger, K. H. Maier-Hein, C. F. Baumgartner, L. M. Koch, J. M. Wolterink, I. Išgum, Y. Jang, Y. Hong, J. Patravali, S. Jain, O. Humbert, and P. Jodoin. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Transactions on Medical Imaging, 37(11):2514–2525, Nov 2018. ISSN 0278-0062. doi: 10.1109/TMI.2018.2837502.
  • Bevilacqua et al. (2016) Marco Bevilacqua, Rohan Dharmakumar, and Sotirios A Tsaftaris. Dictionary-driven ischemia detection from cardiac phase-resolved myocardial BOLD MRI at rest. IEEE Transactions on Medical Imaging, 35(1):282–293, Jan 2016. ISSN 0278-0062. doi: 10.1109/TMI.2015.2470075.
  • Biffi et al. (2018) Carlo Biffi, Ozan Oktay, Giacomo Tarroni, Wenjia Bai, Antonio De Marvao, Georgia Doumou, Martin Rajchl, Reem Bedair, Sanjay Prasad, Stuart Cook, Declan O’Regan, and Daniel Rueckert. Learning interpretable anatomical features through deep generative models: Application to cardiac remodeling. In Alejandro F. Frangi, Julia A. Schnabel, Christos Davatzikos, Carlos Alberola-López, and Gabor Fichtinger, editors, Medical Image Computing and Computer Assisted Intervention, pages 464–471, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00934-2.
  • Chartsias et al. (2017) Agisilaos Chartsias, Thomas Joyce, Rohan Dharmakumar, and Sotirios A Tsaftaris. Adversarial image synthesis for unpaired multi-modal cardiac data. In Simulation and Synthesis in Medical Imaging, pages 3–13. Springer International Publishing, 2017. ISBN 978-3-319-68127-6.
  • Chartsias et al. (2018) Agisilaos Chartsias, Thomas Joyce, Giorgos Papanastasiou, Scott Semple, Michelle Williams, David Newby, Rohan Dharmakumar, and Sotirios A. Tsaftaris. Factorised spatial representation learning: Application in semi-supervised myocardial segmentation. In Medical Image Computing and Computer Assisted Intervention, pages 490–498, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00934-2.
  • Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180. Curran Associates, Inc., 2016.
  • Cheplygina et al. (2018) Veronika Cheplygina, Marleen de Bruijne, and Josien P. W. Pluim. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. CoRR, abs/1804.06353, 2018.
  • Chollet et al. (2015) François Chollet et al. Keras., 2015.
  • Donahue et al. (2018) Chris Donahue, Zachary C Lipton, Akshay Balsubramani, and Julian McAuley. Semantically decomposing the latent spaces of generative adversarial networks. In International Conference on Learning Representations, 2018.
  • Fidon et al. (2017) Lucas Fidon, Wenqi Li, Luis C Garcia-Peraza-Herrera, Jinendra Ekanayake, Neil Kitchen, Sebastien Ourselin, and Tom Vercauteren. Scalable multimodal convolutional networks for brain tumour segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 285–293, Cham, 2017. Springer International Publishing. ISBN 978-3-319-66179-7.
  • Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016. doi: 10.1109/CVPR.2016.265.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. Curran Associates, Inc., 2014.
  • Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
  • Hu et al. (2017) Qiyang Hu, Attila Szabó, Tiziano Portenier, Matthias Zwicker, and Paolo Favaro. Disentangling factors of variation by mixing them. CoRR, abs/1711.07410, 2017.
  • Huang et al. (2018) Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz.

    Multimodal unsupervised image-to-image translation.

    In European Conference on Computer Vision, volume 11207, pages 179–196. Springer International Publishing, 2018.
  • Kali et al. (2013) Avinash Kali, Andreas Kumar, Ivan Cokic, Richard LQ Tang, Sotirios A Tsaftaris, Matthias G Friedrich, and Rohan Dharmakumar. Chronic manifestation of post-reperfusion intramyocardial hemorrhage as regional iron deposition: a cardiovascular mr study with ex-vivo validation. Circulation: Cardiovascular Imaging, 2013.
  • Kim and Mnih (2018) Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, volume 80 of JMLR Workshop and Conference Proceedings, pages 2654–2663., 2018.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
  • Lee et al. (2018) Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, volume 11205, pages 36–52. Springer International Publishing, 2018.
  • Mao et al. (2018) Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. On the effectiveness of least squares generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 2018. ISSN 0162-8828. doi: 10.1109/TPAMI.2018.2872043.
  • Mathieu et al. (2016) Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
  • Milletari et al. (2016) Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 2016 Fourth International Conference on 3D Vision, pages 565–571, 2016. doi: 10.1109/3DV.2016.79.
  • Nie et al. (2018) Dong Nie, Yaozong Gao, Li Wang, and Dinggang Shen. ASDNet: Attention based semi-supervised deep networks for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention, pages 370–378, Cham, 2018. Springer International Publishing.
  • Oktay et al. (2018) O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. de Marvao, T. Dawes, D. P. O‘Regan, B. Kainz, B. Glocker, and D. Rueckert. Anatomically constrained neural networks (acnns): Application to cardiac image enhancement and segmentation. IEEE Transactions on Medical Imaging, 37(2):384–395, Feb 2018. ISSN 0278-0062. doi: 10.1109/TMI.2017.2743464.
  • Peng et al. (2016) Peng Peng, Karim Lekadir, Ali Gooya, Ling Shao, Steffen E Petersen, and Alejandro F Frangi. A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging. Magnetic Resonance Materials in Physics, Biology and Medicine, 29(2):155–195, 2016.
  • Perez et al. (2018) Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. FiLM: Visual reasoning with a general conditioning layer. In AAAI, pages 3942–3951. AAAI Press, 2018.
  • Qin et al. (2018) Chen Qin, Wenjia Bai, Jo Schlemper, Steffen E. Petersen, Stefan K. Piechnik, Stefan Neubauer, and Daniel Rueckert. Joint motion estimation and segmentation from undersampled cardiac mr image. In Florian Knoll, Andreas Maier, and Daniel Rueckert, editors, Machine Learning for Medical Image Reconstruction, pages 55–63, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00129-2.
  • Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286. PMLR, 22–24 Jun 2014.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 234–241, Cham, 2015. Springer International Publishing.
  • Szabó et al. (2018) Attila Szabó, Qiyang Hu, Tiziano Portenier, Matthias Zwicker, and Paolo Favaro. Challenges in disentangling independent factors of variation. In International Conference on Learning Representations Workshop, 2018.
  • Tan et al. (2017) Li Kuo Tan, Yih Miin Liew, Einly Lim, and Robert A. McLaughlin. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Medical Image Analysis, 39:78 – 86, 2017. ISSN 1361-8415. doi:
  • Tsaftaris et al. (2013) Sotirios A Tsaftaris, Xiangzhi Zhou, Richard Tang, Debiao Li, and Rohan Dharmakumar. Detecting myocardial ischemia at rest with cardiac phase–resolved blood oxygen level–dependent cardiovascular magnetic resonance. Circulation: Cardiovascular Imaging, 6(2):311–319, 2013.
  • Vigneault et al. (2018) Davis M. Vigneault, Weidi Xie, Carolyn Y. Ho, David A. Bluemke, and J. Alison Noble. -net (omega-net): Fully automatic, multi-view cardiac mr detection, orientation, and segmentation with deep neural networks. Medical Image Analysis, 48:95 – 106, 2018. ISSN 1361-8415. doi:
  • Zhang et al. (2017) Yizhe Zhang, Lin Yang, Jianxu Chen, Maridel Fredericksen, David P Hughes, and Danny Z Chen. Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In Medical Image Computing and Computer-Assisted Intervention, pages 408–416, Cham, 2017. Springer International Publishing.
  • Zhao et al. (2018) Zhuo Zhao, Lin Yang, Hao Zheng, Ian H Guldner, Siyuan Zhang, and Danny Z Chen. Deep learning based instance segmentation in 3D biomedical images using weak annotation. In Medical Image Computing and Computer Assisted Intervention, pages 352–360, Cham, 2018. Springer International Publishing.
  • Zheng et al. (2018) Q. Zheng, H. Delingette, N. Duchateau, and N. Ayache. 3-D consistent and robust segmentation of cardiac images by deep learning with spatial propagation. IEEE Transactions on Medical Imaging, 37(9):2137–2148, Sept 2018. ISSN 0278-0062. doi: 10.1109/TMI.2018.2820742.
  • Zhu et al. (2017) Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
  • Zhuang (2013) Xiahai Zhuang. Challenges and methodologies of fully automatic whole heart segmentation: a review. Journal of Healthcare Engineering, 4(3):371–407, 2013.
  • Zhuang and Shen (2016) Xiahai Zhuang and Juan Shen. Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Medical Image Analysis, 31:77–87, 2016. doi:
  • Zhuang et al. (2010) Xiahai Zhuang, Kawal S Rhode, Reza S Razavi, David J Hawkes, and Sebastien Ourselin. A registration-based propagation framework for automatic whole heart segmentation of cardiac MRI. IEEE Transactions on Medical Imaging, 29(9):1612–25, 2010. doi: 10.1109/TMI.2010.2047112.