Image interpretation using convolutional neural networks (CNNs) has been widely and successfully applied to medical image analysis during recent years. However, in contrast to human observers, CNNs exhibit weaknesses of being generalized to tackle previously unseen entangled image properties (e.g. shape and texture) . In Ultrasound (US), the image property entanglement can be observed when acquisition-related artifacts (e.g. shadows) obfuscate the underlying anatomy (see Fig. 1). A CNN simultaneously learns anatomical features and artifacts features for either anatomy classification or artifacts detection . As a result, the model trained by images with certain entangled properties (e.g. images without acoustic shadows) can hardly handle images with new entangled properties which are unseen during training (e.g. images with shadows).
Approaches for representation disentanglement have been proposed in order to learn semantically disjoint internal representations for improving image interpretation . These methods pave a way for improving the generalization of CNNs in a wide range of medical image analysis problems. Specifically for a practical application in this work, we want to disentangle anatomical features from shadow features so that to generalize anatomical standard plane analysis for a better detection of abnormality in early pregnancy.
Contribution: In this paper, we propose a novel, end-to-end trainable representation disentanglement model that can learn distinct and generalizable features through a multi-task architecture with adversarial training. The obtained disjoint features are able to improve the performance of multi-task networks, especially on data with previously unseen properties. We evaluate the proposed model on specific multi-task problems, including shape/background-color classification tasks on synthetic data and standard-plane/shadow-artifacts classification tasks on fetal US data. Our experiments show that our model is able to disentangle latent representations and, in a practical application, improves the performance for anatomy analysis in US imaging.
Related work:4] and bilinear models  to recent deep learning-based models such as InfoGAN  and -VAE [7, 8]. Disentangled representations can be utilized to interpret complex interactions of underlying factors within data [9, 10] and enable deep learning models to manipulate relevant information for specific tasks [11, 12, 13]. Particularly related to our work is the work by Mathieu et al. , which proposed a conditional generative model with adversarial networks to disentangle specific and unspecific factors of variation in deep representations without strong supervision. Compared to , Hadad et al.  proposed a simpler two-step method with the same aim. Their network directly utilizes the encoded latent space without assuming the underlying distribution, which can be more efficient for learning various unspecified features. Different from their aim – disentangling one specific representation from unspecific factors – our work focuses on disentangling several specific factors. Further related to our research question is to learn only unspecific invariant features, for example, for domain adaptation . However, unlike learning invariant features, which ignores task-irrelevant information , our method aims to preserve information for multiple tasks while enhancing feature generalizability.
In the medical image analysis community, few approaches have focused on disentangling internal factors of representations in discriminative tasks. Ben-Cohen et al.  proposed a method to disentangle lesion type from image appearance and use disentangled features to generate more training samples for data augmentation. Their work improves liver lesions classification. In contrast, our work aims to utilize disentangled features for generalization of deep neural networks in medical image analysis.
Our goal is to disentangle latent representations of the data into distinct feature sets () that separately contain relevant information for corresponding different tasks (). The main motivation of the proposed method is to learn feature sets that are maximally informative about their corresponding task (e.g. ) but minimally representative for irrelevant tasks (e.g. ). While our approach scales to any number of classification tasks, in this work we focus on two tasks as a proof of concept. The proposed method consists of two classification tasks () with an adversarial regularization. The classification aims to map the encoded features to their relevant class identities, and is trained to maximize and . The adversarial regularization penalizes the mutual information between the encoded features and their irrelevant class identities, in other words, minimizes and . The training architecture of our method is shown in Fig. 2.
is used to learn the encoded features that enable high prediction performance for the class identity of the relevant task. Each of the two classification networks is composed of an encoder and a classifier for a defined task. Given data, the matching labels are for and for . is the number of images and are the number of class identities in each task. Two independent encoders map to and with parameters and respectively, yielding and . Two classifiers are used to predict class identity for the corresponding task, where and . and are the parameters of the corresponding classifiers. We define the the cost functions and as the softmax cross-entropy between and and between and respectively. The classification loss is minimized to train the two encoders and the two classifiers () for obtaining and that are maximally related to their relevant task.
Adversarial regularization is used to force the encoded features to be minimally informative about irrelevant tasks, which results in disentanglement of internal representations. The adversarial regularization is implemented by using an adversarial network for each task as shown in Fig. 2. These adversarial networks are utilized to map the encoded features to class identity of the irrelevant task, yielding and . Here, and are the parameters of the corresponding adversarial networks. By referring to and as the softmax cross-entropy between and and between and , the adversarial loss is defined as . During training, the adversarial networks are trained to minimize while two encoders and two classifiers are trained to maximize (). This competition between the encoders/classifiers and the adversarial networks encourages the encoded features to be invalid for irrelevant tasks.
By combining the two classifications with the adversarial regularization, the whole model is optimized iteratively during training. The training objective for optimizing the two encoders and the two classifiers can be written as
Here, is the trade-off parameter of the adversarial regularization. The training objective for the optimization of the adversarial networks thus follows as
Network architectures: and both consist of six residual-blocks implemented as proposed in  to reduce the training error and to support easier network optimization. and both contain two dense layers with hidden units. The adversarial networks and have the same architecture as and respectively.
Training: Our model is optimized for epochs and
is chosen heuristically and independently for each data set using validation data. For more stable optimization, in each iteration, we train the encoders and classifiers once, followed by five training steps of the adversarial networks. Similar to , we use the Adam optimizer (, ) to train the encoders and classifiers based on Eq. 1
, and use Stochastic Gradient Descent (SGD) with momentum optimizer (, ) to update the parameters of the adversarial networks in Eq. 2. We apply L2 regularization () to all weights during training to prevent over-fitting. The batch size is 50 and the images in each batch have been randomly flipped as data augmentation. Our model is trained on a Nvidia Titan X GPU with 12 GB of memory.
3 Evaluation and Results
Evaluation on synthetic data: We use synthetic data as a proof of concept example to verify our model. This data set contains a randomly located gray circle or rectangle on a black or white background. We split the data into images for train/validation/test and these images consist of circles on white background, rectangles on black background and rectangles on white background. To keep the balance between image properties in the training split, we use circle:rectangle=1:1 and black:white=7:5. In this case, is a background color classification task and is the a shape classification task. We implement our model as outlined in Sec.2 and choose . We evaluate our model on the test data. The experimentation illustrates that the encoded features successfully identify the class identities of the relevant task (e.g. , ) but fail to handle irrelevant task (e.g. , ). Here, is the overall accuracy. To show the utility of the proposed method on images with previously unseen entangled properties, we additionally compare the shape classification performance of our model and a baseline (our model without the adversarial regularization) on images with a previously unseen entangled properties (circles on black background). The proposed model achieves and outperforms the baseline which achieves . We use PCA to examine the learned embedding space at the penultimate dense layer of the classifiers. The top row of Fig. 11 illustrates that the extracted features is able to identify class identities for relevant tasks (see (a,c)) but unable to predict correct class identities for irrelevant tasks (see (b,d).
Evaluation on fetal US data: We verify the applicability of our method on fetal US data. Here, we refer to an anatomical standard plane classification task as and an acoustic shadow artifacts classification task as . We want to learn the corresponding disentangled features for all anatomical information, separated from containing only information about shadow artifacts. is the label for different anatomical standard planes while and are the labels of the shadow-free class and the shadow-containing class respectively.
Data set: The fetal US data set contains images sampled from 4120 2D US fetal anomaly screening examinations with gestational ages between 1822 weeks. These sequences consist of eight standard planes defined in the UK FASP handbook , including three vessel view (3VV), left ventricular outflow tract (LVOT), abdominal (Abd.), four chamber view (4CH), femur, kidneys, lips and right ventricular outflow tract (RVOT), and are classified by expert observers as shadow-containing (W S) or shadow-free (W/O S) (Fig. 1). We split the data as shown in Table. 1. Train, Validation and Test seen are separate data sets. Test seen contains the same entangled properties (but different images) as used for the training data set, while LVOT(W S) and Artifacts(OTHS) contain new combinations of entangled properties.
|Train||Validation||Test seen||LVOT(W S)||Artifacts(OTHS)|
|3VV||W/O S (W S)||180 (320)||50 (50)||334 (41)||- (-)||- (-)|
|LVOT||W/O S (W S)||500 (-)||50 (-)||79 (-)||- (418)||- (-)|
|Abd.||W/O S (W S)||125 (375)||50 (50)||190 (220)||- (-)||- (-)|
|Others||W/O S (W S)||- (-)||- (-)||- (-)||- (-)||3159 (2211)|
Evaluation approach: We refer to Std plane only as the networks for standard plane classification only (consists of and ), and Artifacts only as the networks for shadow artifacts classification only (consists of and ). refers to the proposed method without the adversarial regularization and Proposed is our method in Fig. 2.
The proposed method is implemented as outlined in Sec.2 choosing . contains three dense layers with hidden units while contains two dense layers with hidden units. We choose a bigger network capacity for by assuming that anatomies have more complex structures than shadows to be learned.
Table. 2 shows that our method improves the performance of standard plane classification by and on Test seen when compared with the Std plane only and the method (see in Col.5). It achieves minimal improvement (Artifacts only: and : classification accuracy) for shadow artifacts classification (see in Col.8).We also demonstrate the utility of the proposed method on images with previously unseen entangled properties. Table. 2 shows that the proposed method achieves accuracy of standard plane classification on LVOT(W S) ( higher than other comparison methods) while it performs similar to other methods on Artifacts(OTHS) for shadow artifacts classification.
|3VV||LVOT||Abd.||W/O S||W S|
|Std plane only||60.80||96.59||67.09||78.36||-||-||-||34.93||-|
We evaluate the performance of disentanglement by using the encoded features for the irrelevant task on Test seen, e.g. and . Here, and are encoded features of the proposed method. in Table. 2 indicates that contains much less anatomical information for standard plane classification ( in proposed vs. in ), while contains less shadow features information ( in proposed vs. in ). We additionally use PCA to show the embedded test data on the penultimate dense layer. The bottom row in Fig. 11 shows that encoded features are more capable of classifying class identities in the relevant task than the irrelevant task (e.g. (a) vs. (d)).
Discussion: Acoustic shadows are caused by anatomies which block the propagation of sound waves or by destructive interference. With this dependency between anatomy and artifacts, separating shadow features from anatomical features may lead to decreased performance of artifacts classification (Table.2, Col.7, Proposed). However, this separation enables feature generalization so that the model is less limited to certain image formation and able to tackle new combinations of entangled properties (Table.2, Col.9, Proposed). Generalization of supervised neural networks can also be achieved by extensive data collection across domains and in a limited way by artificial data augmentation. Here, we propose an alternative through feature disentanglement, which requires less data collection and training effort. Fig. 11
shows PCA plots for the penultimate dense layer. Observing entanglement in earlier layers reveals that disentanglement occurs in this very last layer. This is due to the definition of our loss functions and is partly influenced by the dense layers interpreting the latent representation for classification. Finally, perfect representation disentanglement is likely infeasible because image features are rarely totally isolated in reality. In this paper we have shown that even imperfect disentanglement is able to provide great benefits for artifact-prone image classification in medical image analysis.
In this paper, we propose a novel disentanglement method to extract generalizable features within a multi-task framework. In the proposed method, classification tasks lead to encoded features that are maximally informative with respect to these tasks while the adversarial regularization forces these features to be minimally informative about irrelevant tasks, which disentangles internal representations. Experimental results on synthetic and fetal US data show that our method outperforms baseline methods for multiple tasks, especially on images with entangled properties that are unseen during training. Future work will explore the extension of this framework to multiple tasks beyond classification.
We thank the Wellcome Trust IEH Award , Nvidia (GPU donations) and Intel.
- Geirhos et al.  Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv:1811.12231, 2018.
Meng et al. 
Qingjie Meng, Matthew Sinclair, Veronika Zimmer, Benjamin Hou, Martin Rajchl,
Nicolas Toussaint, Ozan Oktay, Jo Schlemper, Alberto Gomez, James Housden,
Jacqueline Matthew, Daniel Rueckert, Julia A Schnabel, and Bernhard Kainz.
Weakly supervised estimation of shadow confidence maps in fetal ultrasound imaging.IEEE transactions on medical imaging, 2019. ISSN 0278-0062.
- Kim and Mnih  Hyunjik Kim and Andriy Mnih. Disentangling by factorising. CoRR, arXiv/1802.05983, 2018.
- Hyvärinen and Oja  A. Hyvärinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5):411–430, May 2000. ISSN 0893-6080.
- Tenenbaum and Freeman  Joshua B. Tenenbaum and William T. Freeman. Separating style and content with bilinear models. Neural Comput., 12(6):1247–1283, June 2000. ISSN 0899-7667.
- Chen et al.  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS’16, pages 2180–2188, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9.
- Higgins et al.  Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR’17, 2017.
- Burgess et al.  Christopher P. Burgess, Irina Higgins, Arka Pal, Loïc Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in -vae. arXiv:1804.03599, 2018.
- Bengio et al.  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013. ISSN 0162-8828.
Chen et al. 
Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John
Schulman, Ilya Sutskever, and Pieter Abbeel.
Variational lossy autoencoder.In ICLR’17, 2017.
- Gonzalez-Garcia et al.  Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for cross-domain disentanglement. In NeurIPS’18, pages 1287–1298. Curran Associates, Inc., 2018.
- Liu et al.  Alexander H. Liu, Yen-Cheng Liu, Yu-Ying Yeh, and Yu-Chiang Frank Wang. A unified feature disentangler for multi-domain image translation and manipulation. In NeurIPS, pages 2590–2599. Curran Associates, Inc., 2018.
- Hadad et al.  Naama Hadad, Lior Wolf, and Moni Shahar. A two-step disentanglement method. In CVPR’18, 2018.
- Mathieu et al.  Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representations using adversarial training. In NeurIPS’16, pages 5040–5048, 2016.
- Kamnitsas et al.  Konstantinos Kamnitsas, Christian Baumgartner, Christian Ledig, Virginia Newcombe, Joanna Simpson, Andrew Kane, David Menon, Aditya Nori, Antonio Criminisi, Daniel Rueckert, and Ben Glocker. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International conference on information processing in medical imaging, pages 597–609. Springer, 2017.
- Ben-Cohen et al.  Avi Ben-Cohen, Roey Mechrez, Noa Yedidia, and Hayit Greenspan. Improving CNN training using disentanglement for liver lesion classification in CT. arXiv:1811.00501, 2018.
- Pawlowski et al.  Nick Pawlowski, S. Ira Ktena, Matthew Lee, Bernhard Kainz, Daniel Rueckert, Ben Glocker, and Martin Rajchl. Dltk: State of the art reference implementations for deep learning on medical images. arXiv:1711.06853, 2017.
- NHS  NHS. Fetal anomaly screening programme: programme handbook June 2015. Public Health England, 2015.