Unsupervised Anomaly Detection for X-Ray Images

01/29/2020 ∙ by Diana Davletshina, et al. ∙ Universität München 74

Obtaining labels for medical (image) data requires scarce and expensive experts. Moreover, due to ambiguous symptoms, single images rarely suffice to correctly diagnose a medical condition. Instead, it often requires to take additional background information such as the patient's medical history or test results into account. Hence, instead of focusing on uninterpretable black-box systems delivering an uncertain final diagnosis in an end-to-end-fashion, we investigate how unsupervised methods trained on images without anomalies can be used to assist doctors in evaluating X-ray images of hands. Our method increases the efficiency of making a diagnosis and reduces the risk of missing important regions. Therefore, we adopt state-of-the-art approaches for unsupervised learning to detect anomalies and show how the outputs of these methods can be explained. To reduce the effect of noise, which often can be mistaken for an anomaly, we introduce a powerful preprocessing pipeline. We provide an extensive evaluation of different approaches and demonstrate empirically that even without labels it is possible to achieve satisfying results on a real-world dataset of X-ray images of hands. We also evaluate the importance of preprocessing and one of our main findings is that without it, most of our approaches perform not better than random. To foster reproducibility and accelerate research we make our code publicly available at https://github.com/Valentyn1997/xray

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 12

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Learning techniques are ubiquitous and achieving state-of-the-art performance in many areas. However, they require vast amounts of labeled data as witnessed by the marvelous boost in image recognition after the publication of the large scale ImageNet data set [4, 11]. In medical applications, labels are expensive to acquire. While anyone can decide whether an image depicts a dog or a cat, deciding whether a medical image shows abnormalities, is a highly difficult task requiring specialists with years of training. Another specialty of medical applications is that a simple classification decision does often not suffice. End-to-end deep learning solutions tend to be hard to interpret, preventing their application in an area as sensitive as deciding for a treatment. Moreover, additional patient information such as the patient’s medical history, and clinical test results are often crucial to a correct diagnosis. Integrating this information into an end-to-end pipeline is difficult and makes results even less interpretable. Thus, the motivation for our work is to let doctors decide about the final diagnosis and treatment and develop a system, which can provide hints for doctors where to pay more attention to.

Hence, in this work, we investigate how we can support doctors to faster assess X-ray images, and reduce the chance of overlooking suspicious regions. To this end, we demonstrate how state-of-the-art unsupervised methods, such as Autoencoders (AE) or Generative Adversarial Networks (GANs), can be used for anomaly detection on X-ray images. As this dataset is noisy, and this is a general problem for a lot of real-world datasets, we present a sophisticated preprocessing pipeline to obtain better training data. Afterwards, we train several unsupervised models, and explain for each, how to obtain several image-level anomaly scores. For some of them, it is even natural to obtain pixel-wise annotations, highlighting anomalous regions. One of our main findings is that accurate data preprocessing is indispensable. The advantage of using autoencoders is that they naturally can provide pixel-level anomaly heatmaps, which can be used to understand model decisions. In contrast, GAN-based approaches seem to be able to cope with more noisy data, yet being only able to produce image-wise anomaly scores. We envision that this methodology can be easily installed in clinical daily routine to support doctors in quickly assessing X-ray images and spotting candidate regions for anomalies.

In this work, we focus on a subset of the MURA dataset [18] containing only hand images. In total, we have 5,543 images of 2,018 studies of 1,945 patients. Each study is labeled as negative or positive, where positive means that there was an anomaly diagnosed in this study. There are 521 positive studies, with a total of 1,484 images. Figure 1 shows some examples from the dataset. In summary, our contributions are as follows:

  1. We present a powerful preprocessing pipeline for the MURA dataset [18], enabling the construction of a high-quality training set.

  2. We extensively survey unsupervised Deep Learning methods, and present approaches on how to obtain image-level and even pixel-level anomaly scores.

  3. We show extensive experiments on a real-world dataset evaluating the influence of proper preprocessing as well as the usability of the anomaly scores. To foster reproducibility, we will make our code public in the camera-ready version.

The rest of the paper is structured as follows: In Section 2 we describe our approach. We start with the description of data preprocessing in 2.1 and describe anomaly detection approaches along with anomaly scores in section 2.2. We discuss related work in section 3. Finally, Section 4 shows quantitative and qualitative experimental results on image-level and pixel-level anomaly detection.

Figure 1: A few examples from the used subset of the MURA dataset containing X-ray images of hands demonstrating the large variety of image quality.

2 Unsupervised Anomaly Detection

2.1 Preprocessing

Real-life data is often noisy. This is especially problematic for unsupervised approaches for anomaly detection. On the one hand, it is necessary to remove noise to make sure that it is not recognized as an anomaly. On the other hand, it is crucial that the data denoising process does not mistake anomalies for noise and does not remove them. After experimenting a lot, we end up with the preprocessing pipeline depicted in Figure 2. We distinguish offline and online processing steps, where the offline processing is done once and then stored to disk to save time, whereas the online preprocessing is done on-the-fly while loading the data. The individual steps are described in detail subsequently.

Input

Cropping

Hand CenterLocalization

HandSegmentation

Augmentation

Padding Centering

Min-MaxNormalization

Model

offline

online
Figure 2: The full image preprocessing pipeline. Steps highlighted in green are performed once and the result is stored to disk. Steps highlighted in orange are done on-the-fly.
Cropping

The first step in our pipeline is to detect the X-ray image carrier in the image. To this end, we apply OpenCV’s contour detection using Otsu binarization

[15], and retrieve the minimum size bounding box, which does not need to be axis-aligned. This works sufficiently well as long as the majority of the image carrier is within the image (cf. Figure 3). However, the approach might fail for heavily tilted images or those where larger parts of the image carrier reach beyond the image border.

Figure 3: Result of image carrier detection with OpenCV (left side). The first image shows the original image with a detected rectangle. Next to it is the extracted image. The right image shows the result of running object detection on an image containing two hands. We extract the image of both hands separately such that our preprocessed data set does not contain images with more than one hand.
Hand Localization

To further improve the detection of hands, and in particular split the images where two hands are depicted on one image, we manually labeled approximately 150 bounding boxes in the images. Using this small dataset, we fine-tune a pre-trained single shot multibox detector (SSD) [13]

with MobileNet as taken from TensorFlow. An exemplary results can be seen in Figure 

3.

Foreground Segmentation

In a final step, foreground segmentation is performed using Photoshop’s "select subject" method in batch processing mode. Thereby, we obtain a pixel-wise mask, roughly encompassing the scanned hand.

Data Augmentation

Due to GPU memory constraints, the images for BiGAN and -GAN are resized to 128 pixels on the longer image side while maintaining aspect ratio before applying the augmentation. For the auto-encoder models, this is not necessary. Afterwards, standard data augmentation methods (horizontal/vertical flipping, channel-wise multiplication, rotation, scaling) using the imgaug222https://github.com/aleju/imgaug library are applied before finally padding the images to 512x512 (AE + DCGAN) or 128x128 (BiGAN + GAN) pixels.

2.2 Models

In this section, we describe the different model types we trained in a fully unsupervised / self-supervised fashion on the train data part comprising only images from patients without attested anomalies. We also describe how to obtain anomaly scores from the trained models. In the appendix we additionally provide details about the architecture for every model.

2.3 Autoencoders

We studied different auto-encoder architectures for the task at hand. Common among them is their usage of a reconstruction loss, i.e. the input to the network is also used as the target, and we evaluate how well the input is reconstructed. As the information has to pass through an informational bottleneck, the model cannot simply copy the input data, but instead has to perform a form of compression, extracting features which suffice to reconstruct the image sufficiently well. Hence, we have an encoder part of the network (), which transforms the input non-linearly to a latent space representation . Analogously, there is a decoder that transforms an input from latent space back to and element in the input space.

For simplicity, we describe the general loss formulation using a vector input

instead of a two-dimensional pixel-matrix. In its simplest form, the reconstruction loss is given as the mean over pixel-wise squared differences. Let , then

As we are only interested in detecting anomalies on the hand part of the image, we consider a variant of this loss, named masked reconstruction loss, where only those pixels are considered that belong to the mask. Let be the mask, where if and only if the position belongs to the hand. Then,

where denotes the Hadamard product (i.e. element-wise multiplication). In the following, we describe the architectures of the network in more detail.

In a convolutional auto-encoder (CAE)

, we implement encoder and decoder as fully convolutional neural networks (CNNs). In general, the encoder is built as a sequence of repeated convolution blocks. We apply Batch Normalization

[9]

between every convolution and the respective activation, and use ReLU

[7]

as activation function. A detailed model description is given in the appendix. Similarly, the decoder consists of repeated blocks of transposed convolutions. As before, we apply batch normalization before every activation function. As bottleneck size, we use a spatial resolution of

and 512 channels.

Variational AE (VAE) [10]

is a generative model, which maps an input to a Gaussian distribution in latent space, characterized by its mean and covariance

, instead of mapping it to a fixed latent representation. The covariance matrix is usually restricted to a diagonal matrix. For reconstruction, a sample is drawn, and passed through the decoder sub-network. To avoid very small values in

, and thereby approaching a delta distribution, i.e. traditional AE, an additional loss term is introduced as the Kullback-Leibler divergence (KLD) between

and the standard normal distribution

.

2.3.1 Anomaly Detection Scores

The rationale behind using AE for anomaly detection is that as the AE is trained on normal data only, it has not seen anomalies during training, and hence will fail to reproduce them. Due to the convolutional nature of the network, the error is even expected to occur stronger in regions close to the anomaly, and less strong further apart. If the receptive field is small enough, those regions outside of it are not affected at all. Hence, we can use the reconstruction error in two ways:

  1. Pixel-wise to obtain a heatmap highlighting regions that were hardest to reconstruct. If there is an anomaly, we expect the highest error in that region. We show an example for such in the qualitative results, Figure 5.

  2. Aggregated over all pixels (under the mask) to obtain an image-wise score. As for aggregation, we explore different aggregation strategies. In the simplest case, we just average over all locations. By using only the highest values to compute the mean, we can obtain a score that is more sensitive towards regions of high reconstruction error (i.e. anomalous regions).

We aim for using auto-encoder architectures which are strong enough to successfully reconstruct normal hands, without risking to learn identity mappings by allowing too wide bottlenecks. While the architecture should generalize over all normal hands, a too strong generalization might cause the effect that also anomalies can be reconstructed sufficiently well.

2.4 Gan

A Generative Adversarial Network (GAN) [8] comprises two sub-networks, a generator , and a discriminator , which can be seen as antagonists in a two-player game. The generator takes random noise as input and generates samples in the target domain. The discriminator takes real data points, as well as generated ones, and has to distinguish between real and fake data. The sub-networks are trained alternatingly, and if successful, the generator can afterwards be used to sample from the (approximated) data distribution, and the discriminator can be used to decide whether a sample is drawn from the given data distribution.

Deep Convolutional GAN (DCGAN)[16] is an extension of the original GAN architecture to convolutional neural networks. Similarly to the CAE, the two networks contain convolutions (discriminator) and transposed convolutions (generator) instead of the fully connected layers of the originally proposed GAN architecture.

BiGAN [5] / ALI [6] extends DCGAN by an encoder , which encodes the real image into latent space333In the paper is called as opposed to the generator . The discriminator is now provided with both, the real and fake image, as well as their latent codes, i.e. .

-GAN [20] comprises four sub-networks:

  • An encoder which transforms a real image into a latent representation.

  • A code-discriminator which distinguishes between the latent representations produced by the encoder and random noise used as generator input.

  • A generator which generates an image from either the randomly sampled , or the encoded image .

  • A discriminator which distinguishes between reconstructed real images , and generated images .

In addition to the classification losses for both discriminators, a reconstruction loss is applied for the auto-encoder formed by the encoder-generator pair. Hence, the code-discriminator gives the encoder the incentive to transform the inputs to match the random distribution, similarly as in VAE through the KL-divergence. Likewise, the discriminator motivates matching the data distribution in the image domain.

2.4.1 Anomaly Detection Scores

For the GAN models, we generally use the discriminator’s output as the anomaly score. When converged, the discriminator should be able to distinguish between images belonging to the data manifold, i.e. images of hands without any anomalies, and those which lie outside, such as those containing anomalous regions. For

GAN we use the mean over code discriminator and discriminator probability.

3 Related Work

With the rapid advancement of deep learning methods, they have also found their way into medical imaging, cf. e.g. [12, 19]. Despite the limited availability of labels in medical contexts, supervised methods make up the vast majority. Very likely, this is due to the easier trainability, but possibly also because the interpretability of the results so far has often been secondary. Sato et al. [22] use a 3D CAE for a pathology detection method in CT scans of brains. The CAE is trained solely on normal images, and at test time, the MSE between the image and its reconstruction is taken as the anomaly score. Uzunova et al. [25] use VAE for medical 2D and 3D CT images. Similarly, they use MSE reconstruction loss as the anomaly score. Besides the KL-divergence in latent space, they use a reconstruction loss for training, which produced less smooth output. GANomaly [2] and its extension with skip-connections uses an AE and maps the reconstructed input back to the latent space. The anomaly score is computed in latent space between original and reconstructed input. They apply their methods on X-Ray security imagery to detect anomalous items in baggage. Recently, there have been a lot of publications using the currently popular GANs. For example, [14] uses a semi-supervised approach for anomaly detection in chest X-ray images. They replace the standard discriminator classification into real and fake, with a three-way classification into real-normal, real-abnormal, and fake. While this allows training with fewer labels, it still requires them for training. Schlegl et al [23], train a DC-GAN on slices of OCT scans, where the original volume is cut along the x-z axis, and the slices are further randomly cropped. At test time, they use gradient descent to iteratively solve the inverse problem of obtaining a latent vector that produces the image. Stopping after a few iterations, the distance between the generated image and the input image is considered as residual loss. To summarize, the focus of recent work for anomaly detection approaches lies either in applying existing methods for a new type of data or adapting unsupervised methods for anomaly detection. Instead, we provide an extensive evaluation of state-of-the-art unsupervised learning approaches that can be directly used for anomaly detection. Furthermore, we evaluate the importance of different preprocessing steps and compare methods with regard to explainability.

4 Experiments

We demonstrate the capability of our preprocessing pipeline and all described models in experiments on a subset of the MURA dataset containing only X-ray images of hands. 3,062 images are stored in a single-channel PNG image, and 2,481 are stored with three RGB channels. However, all images look like gray-scale images, which is why we convert all 3-channel images to a single channel. The longest side of the images is always 512 pixels in size. The smaller side ranges from 160 to 512, with the majority between 350 and 450.

train

validation

test

split

anomaly

n

p

n

p

n

0

250

500

750

1000

1250

1500

1750

#patients
Figure 4: Visualization of the applied data split scheme. "n" denotes patients which do not have a abnormal study ("negative"), "p" the contrary ("positive"). Notice that the training part of the split does not contain any images of anomalies, i.e. we do not use anomalous images for training.

As our approach is unsupervised, we train only on negative images, i.e. images without an anomaly. Furthermore, to avoid test leakage, we split the data by patient, and not by study or image, to ensure that we do not have an image of a patient in the training data, and another image of the same patient in the test or validation data. To this end, we proceed as follows: Let be the set of all patients, and be the set of patients with a study that is labeled as abnormal. The rest of the patients is denoted by . For the test and validation set, we aim at having balanced classes. Therefore, we distribute evenly at random across test and validation. Afterwards, we randomly sample the same number of patients without known anomalies for test and validation and use the rest of the patients for training. The procedure is visualized in Figure 4. In total, we end up with 2,554 training images, 1,494 validation images, and 1,495 test images.

We trained all models on a machine with one NVIDIA Tesla V100 GPU with 16GiB of VRAM, 20 cores and 360GB of RAM. Following [17]

, we train our models from scratch and do not use transfer learning from large image classification datasets. We performed a manual hyper-parameter search on the validation set and selected the best-performing models per type with respect to Area-under-Curve for the Receiver-Operator-Curve (ROC-AUC). We report the ROC-AUC on the test set.

4.1 Quantitative Results

raw crop full
w/o HE w/ HE w/o HE w/ HE w/o HE w/ HE
CAE
MSE .460 .033 .504 .034 .466 .022 .510 .021 .501 .013 .570 .019
MSE (top-200) .466 .013 .448 .025 .486 .015 .473 .018 .506 .039 .553 .023
VAE
KLD .488 .031 .491 .013 .470 .046 .496 .045 .520 .026 .533 .014
L1 .432 .033 .446 .016 .438 .033 .438 .016 .435 .014 .483 .009
L1 + KLD .432 .033 .446 .016 .438 .034 .437 .016 .438 .011 .488 .011
L1 (top-200) .438 .017 .472 .010 .440 .025 .471 .013 .428 .013 .481 .010
MSE .432 .033 .446 .016 .438 .033 .438 .016 .435 .014 .483 .009
MSE + KLD .432 .033 .446 .016 .438 .033 .438 .016 .436 .013 .486 .010
MSE (top-200) .438 .017 .472 .010 .440 .025 .471 .013 .428 .013 .481 .010
DCGAN
Disc. (D) .497 .018 .491 .041 .493 .015 .493 .025 .530 .027 .527 .022
BiGAN
MSE .471 .021 - .438 .039 - .491 .042 .522 .017
MSE (top-200) .471 .011 - .459 .030 - .475 .033 .508 .026
Disc. (D) .508 .007 - .534 .016 - .549 .006 .522 .019
GAN
Code-Disc. (C) .500 .000 - .500 .001 - .500 .000 .500 .000
MSE .476 .029 - .466 .022 - .442 .013 .528 .018
MSE (top-200) .465 .031 - .446 .018 - .422 .016 .533 .013
Disc. (D) .503 .022 - .534 .022 - .607 .016 .584 .012
C + D .503 .022 - .534 .022 - .607 .016 .584 .012
Table 1:

Quantitative results for all models. We report ROC-AUC on the test set for the best configuration regarding validation set ROC-AUC. All numbers are mean and standard deviation across four separate trainings with different random seeds. For each model we report results for various anomaly scores: Mean Squared Error (MSE), L1, Kullback–Leibler Divergence (KLD), Discriminator Probability (D). Top-200 denotes the case, when only 200 pixels with the highest error are taken into consideration.

Apart from the performance of single models, we also evaluate the importance of the preprocessing steps. Therefore, we evaluate the models on the raw data, the data after cropping the hand regions, as well as on the fully preprocessed data. We also vary whether histogram equalization is applied before the augmentation or not. We summarize the quantitative results in Table 1 showing the mean and standard deviation across four runs. There is a clear trend in preprocessing: All models have their best runs in the fully preprocessed setting, emphasizing the importance of our preprocessing pipeline for noisy datasets. Interestingly, without foreground segmentation, i.e. only by cropping the single hands, the results appear to be worse than on the raw data. While histogram equalization is a contrast enhancement method in particular useful to improve human perception of low-contrast images, it seems to improve the results for AE-based models consistently. For BiGAN and GAN our experiments did not finish until the deadline. As they comprise AE components we expect to see an improvement there. On raw and also cropped data we frequently observe ROC-AUC values smaller than 45%. Hence, we might be able to improve the ROC-AUC score by flipping the anomaly decision. Partially, we attribute this also to the rather unstable results for these models. Regarding the aggregation of reconstruction error, we observe that using only the top-k loss values across all pixels does not improve the result. We attribute that partially to not tuning enough across different values for , as we only used for all models, which may be too few pixels to detect some anomalies. Due to the lack of pixel-level annotation, we did not investigate this issue so far. In total, we obtain the best ROC-AUC score with 60.7% for -GAN using the discriminator probability. CAE however also achieves 57% ROC-AUC and additionally can naturally provide pixel-level anomaly scores yielding higher interpretability.

4.2 Qualitative Results

Figure 5: Example heatmaps of reconstruction error of CAE. The left image-pair shows a hand from a study labeled as normal hand. Here we can see that the reconstruction error is relatively wide spread. The right image pair shows an abnormal hand, where the abnormality is clearly highlighted.

In addition to the numerical results we also showcase some qualitative results. For all methods with reconstruction loss, i.e. all AE as well as -GAN, we can generate heatmaps visualizing the pixel-wise losses. Thereby, we can highlight regions that could not be reconstructed well. Following our assumption, these regions should be the anomalous regions. In Figure 5, we can see prototypical examples produced by CAE. The upper image shows a hand contained in a study which was labeled as normal. We can see that the reconstruction error does not occur concentrated, but is rather spread widely across the hand. The maxima seem to occur around joints, which due to their more complex structure are likely to be harder to reconstruct. Compared to the lower image, which shows a study labeled as abnormal, we see a clear highlighting at the middle finger. Visible also for a non-expert, we can spot metal parts in the X-ray image at the very same location. For those anomalies which could be validated by a person without a medical background, the highlighted regions seem to correspond largely to those anomalous regions.

5 Conclusion

In this paper, we investigated methods for unsupervised anomaly detection in X-ray images. To this end, we surveyed two families of unsupervised models, auto-encoders and GANs, regarding their applicability to derive anomaly scores. In addition, we provide a sophisticated multi-step preprocessing pipeline. In our experiments, we compare the methods against each other, and furthermore, reveal that the preprocessing is crucial for most models to obtain good results on real-world data. For the auto-encoder family, we study the interpretability of pixel-wise losses as anomaly heatmap and verify that in cases of anomalies which a non-expert can detect (e.g. metal pieces in the hand), these heatmaps closely match the anomalous regions. As future work, we envision the extension to broader datasets such as the full MURA dataset, as well as obtaining pixel-level anomaly scores for the GAN based models. To this end, methods from the field of explainable AI, such as grad-CAM [24] or LRP [3] can be applied to the discriminator to obtain heatmaps similarly to those of the AE models. Moreover, we see the potential for different model architectures closer tailored to the specific problem and data type, as well as the possibility of building an ensemble model using the different ways how to extract anomaly scores from single models, or even across different model types.

Acknowledgement

We would like to thank Franz Pfister and Rami Eisaway from deepc (www.deepc.ai) for access to the data and support in understanding the use case. Part of this work has been conducted during a practical course at Ludwig-Maximilians-Unversität München funded by Z.DB. This work has been funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.

References

  • [1] (2018) 15th IEEE international symposium on biomedical imaging, ISBI 2018, washington, dc, usa, april 4-7, 2018. IEEE. External Links: Link, ISBN 978-1-5386-3636-7 Cited by: 14.
  • [2] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon (2019) GANomaly: semi-supervised anomaly detection via adversarial training. In Computer Vision – ACCV 2018, C. V. Jawahar, H. Li, G. Mori, and K. Schindler (Eds.), Cham, pp. 622–637. External Links: ISBN 978-3-030-20893-6 Cited by: §3.
  • [3] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015-07)

    On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation

    .
    PLOS ONE 10 (7), pp. 1–46. External Links: Link, Document Cited by: §5.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1.
  • [5] J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial Feature Learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.4.
  • [6] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. C. Courville (2017) Adversarially Learned Inference. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.4.
  • [7] X. Glorot, A. Bordes, and Y. Bengio (2011) Deep Sparse Rectifier Neural Networks. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011

    ,
    pp. 315–323. External Links: Link Cited by: §2.3.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.4.
  • [9] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    ,
    pp. 448–456. External Links: Link Cited by: §2.3.
  • [10] D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §2.3.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [12] G. J. S. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez (2017) A Survey on Deep Learning in Medical Image Analysis. Medical Image Analysis 42, pp. 60–88. External Links: Link, Document Cited by: §3.
  • [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot MultiBox Detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp. 21–37. External Links: Link, Document Cited by: §2.1.
  • [14] A. Madani, M. Moradi, A. Karargyris, and T. F. Syeda-Mahmood (2018) Semi-Supervised Learning with Generative Adversarial Networks for Chest X-Ray Classification with Ability of Data Domain Adaptation. See 1, pp. 1038–1042. External Links: Link, Document Cited by: §3.
  • [15] N. Otsu (1979-01) A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9 (1), pp. 62–66. External Links: Document, ISSN 0018-9472 Cited by: §2.1.
  • [16] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434. Cited by: §2.4.
  • [17] M. Raghu, C. Zhang, J. M. Kleinberg, and S. Bengio (2019) Transfusion: Understanding Transfer Learning with Applications to Medical Imaging. CoRR abs/1902.07208. External Links: Link, 1902.07208 Cited by: §4.
  • [18] P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, D. Laird, R. L. Ball, et al. (2017) MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs. arXiv preprint arXiv:1712.06957. Cited by: item 1, §1.
  • [19] K. Raza and N. K. Singh (2018) A Tour of Unsupervised Deep Learning for Medical Image Analysis. CoRR abs/1812.07715. External Links: Link, 1812.07715 Cited by: §3.
  • [20] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed (2017) Variational Approaches for Auto-encoding Generative Adversarial Networks. arXiv preprint arXiv:1706.04987. Cited by: §2.4.
  • [21] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2234–2242. External Links: Link Cited by: 11st item, 8th item.
  • [22] D. Sato, S. Hanaoka, Y. Nomura, T. Takenaga, S. Miki, T. Yoshikawa, N. Hayashi, and O. Abe (2018) A primitive study on unsupervised anomaly detection with an autoencoder in emergency head CT volumes. In Medical Imaging 2018: Computer-Aided Diagnosis, Houston, Texas, USA, 10-15 February 2018, pp. 105751P. External Links: Link, Document Cited by: §3.
  • [23] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Information Processing in Medical Imaging - 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017, Proceedings, pp. 146–157. External Links: Link, Document Cited by: §3.
  • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §5.
  • [25] H. Uzunova, S. Schultz, H. Handels, and J. Ehrhardt (2019) Unsupervised pathology detection in medical images using conditional variational autoencoders. Int. J. Comput. Assist. Radiol. Surg. 14 (3), pp. 451–461. External Links: Link, Document Cited by: §3.

Supplementary Material

0..1 Schematic Architectures and Reconstruction Examples

Figure 6: Schematic overview of convolutional autoencoder (CAE) and an example reconstruction of completely preprocessed image. Encoder and decoder are realized as deep convolutional neural networks (CNNs). Note, that masked reconstruction loss is used, therefore reconstruction outside of a hand is arbitrary.
Figure 7: Schematic overview of variational autoencoder and an example reconstruction of completely preprocessed image. Encoder and decoder are realized as CNNs. The encoder predicts a mean and a covariance of a Gaussian distribution in latent space. The reconstruction is done from a sample from this Gaussian.
Figure 8: Schematic overview of DCGAN and example comparison of real (above) and fake (below) images. The generator generates an image from input noise . The discriminator distinguishes between real images and generated ones. More details in the text.
Figure 9: Schematic overview of BiGAN / ALI and an examples of real (above) and fake (below) images. The generator generates an image from input noise . The encoder encodes the real image. The discriminator distinguishes between real images and generated ones additionally provided with the noise vector and the encoding .
Figure 10: Schematic overview of -GAN comprising four sub-networks: encoder , generator , discriminator and code-discriminator . Discriminator and code-discriminator distinguish real from fake data in image space, and latent space, respectively. In addition, encoder and generator also form an auto-encoder. The image above is completely preprocessed real image. Its reconstruction is shown in the middle. The last image is fake produced by generator from random noise.

0..2 Architecture Details

CAE encoder kernel size output filters (3, 3) (512, 512, 16) (4, 4) (256, 256, 32) (3, 3) (256, 256, 32) (4, 4) (128, 128, 64) (3, 3) (128, 128, 64) (4, 4) (64, 64, 128) (3, 3) (64, 64, 128) (4, 4) (32, 32, 256) (3, 3) (32, 32, 256) (4, 4) (16, 16, 512) decoder kernel size output filters (4, 4) (32, 32, 256) (4, 4) (64, 64, 128) (4, 4) (128, 128, 64) (4, 4) (256, 256, 32) (4, 4) (512, 512, 16) (3, 3) (512, 512, 1)
VAE
encoder
kernel size output filters
(4, 4) (255, 255, 8)
(4, 4) (126, 126, 16)
(4, 4) (62, 62, 32)
(4, 4) (30, 30, 64)
(4, 4) (14, 14, 128)
(4, 4) (6, 6, 256)
(4, 4) (2, 2, 512)
bottleneck
reshape:
= FC() (1024,)
= FC() (1024,)
(1024,)
reshape: (2, 2, 512)
decoder
kernel size output filters
(4, 4) (6, 6, 256)
(4, 4) (14, 14, 128)
(4, 4) (30, 30, 64)
(4, 4) (62, 62, 32)
(4, 4) (126, 126, 16)
(4, 4) (254, 254, 8)
(6, 6) (512, 512, 1)

DCGAN
generator
kernel size output filters
(4, 4) (4, 4, 1024)
(4, 4) (8, 8, 512)
(4, 4) (16, 16, 256)
(4, 4) (32, 32, 128)
(4, 4) (64, 64, 64)
(4, 4) (128, 128, 32)
(4, 4) (256, 256, 16)
(4, 4) (512, 512, 1)
discriminator
kernel size output filters
(4, 4) (256, 256, 4)
(4, 4) (128, 128, 8)
(4, 4) (64, 64, 16)
(4, 4) (32, 32, 32)
(4, 4) (16, 16, 64)
(4, 4) (8, 8, 128)
(4, 4) (4, 4, 256)
(4, 4) (1, 1, 512)
minibatch discrimination (1, 1, 528)
FC (1,)
BiGAN
generator
kernel size output filters
(4, 4) (4, 4, 1024)
(4, 4) (8, 8, 512)
(4, 4) (16, 16 ,256)
(4, 4) (32, 32, 128)
(4, 4) (64, 64, 64)
(4, 4) (128, 128, 1)
encoder
kernel size output filters
(4, 4) (64, 64, 64)
(4, 4) (32, 32, 128)
(4, 4) (16, 16 ,256)
(4, 4) (8, 8, 512)
(4, 4) (4, 4, 1024)
(4, 4) (1, 1, 200)
Discriminator: Image Branch
kernel size output filters
(4, 4) (64, 64, 64)
(4, 4) (32, 32, 128)
(4, 4) (16, 16 ,256)
(4, 4) (8, 8, 512)
(4, 4) (4, 4, 1024)
(4, 4) (1, 1, 1024)
Discriminator: Code Branch
kernel size output filters
(1, 1) (1, 1, 512)
(1, 1) (1, 1, 512)
Discriminator: Combination
kernel size output filters
stack branches
(1, 1) (1, 1, 1024)
(1, 1) (1, 1, 1024)
(1, 1) (1, 1, 1)
-GAN
generator
kernel size output filters
(4, 4) (4, 4, 1024)
(4, 4) (8, 8, 512)
(4, 4) (16, 16 ,256)
(4, 4) (32, 32, 128)
(4, 4) (64, 64, 64)
(4, 4) (128, 128, 1)
encoder
kernel size output filters
(4, 4) (64, 64, 64)
(4, 4) (32, 32, 128)
(4, 4) (16, 16 ,256)
(4, 4) (8, 8, 512)
(4, 4) (4, 4, 1024)

mean and variance

(4, 4) (1, 1, 200)
discriminator
kernel size output filters
(4, 4) (64, 64, 64)
(4, 4) (32, 32, 128)
(4, 4) (16, 16 ,256)
(4, 4) (8, 8, 512)
(4, 4) (4, 4, 1024)
minibatch discrimination (4, 4, 1028)
(4, 4) (1, 1, 1)
code-discriminator
kernel size output filters
(1, 1) 100
(1, 1) 50
(1, 1) 25
(1, 1) 1

Notes: * denotes additional max-pooling/nearest-neighbor upsampling, FC denotes fully-connected layer,

. denotes an additional self-attention layer.

0..3 Data Augmentation

We use two different augmentation strategies, named default (used for GANs and VAE) and advanced (used for CAE and BAE).

0..3.1 Default

  • Horizontally flip 50% of all images

  • Vertically flip 50% of all images

  • Center pad all images to the target resolution

0..3.2 Advanced

  • Horizontally flip 50% of all images

  • Vertically flip 50% of all images

  • For 50% of the images change the brightness by multiplying all channels by a scalar value drawn randomly from the uniform distribution

  • For 50% of the images randomly scale in x- and y- direction independently by a factor drawn randomly from the uniform distribution

  • For 50% of the images rotate the image by an angle drawn randomly from the uniform distribution

  • Center pad all images to the target resolution

0..4 Training Details

  • For all models we train 4 variants with different random seeds being 42, 4242, 424242, and 42424242.

0..4.1 Cae

  • Batch Size: 32

  • Image Resolution:

  • 1,000 epochs

  • Batch Normalization

  • Learning rate: 0.0001

  • Adam optimizer

0..4.2 Bae

  • Batch Size: 32

  • Image Resolution:

  • 500 epochs

  • Batch Normalization

  • Learning rate: 0.0001

  • Adam optimizer

0..4.3 Vae

  • Batch Size: 32

  • Image Resolution:

  • 500 epochs

  • Batch Normalization

  • ,

  • Learning rate: 0.0001

  • Adam optimizer

0..4.4 Dcgan

  • Batch Size: 80

  • Image Resolution:

  • 500 epochs

  • No Batch Normalization

  • Spectral Normalization

  • Soft Labels

  • Generator Learning Rate: 0.001

  • Discriminator Learning Rate: 0.00001

  • Soft Delta: 0.01

  • As we observed mode collapse, we added minibatch discrimination [21]

  • Adam optimizer

0..4.5 BiGAN

  • Batch Size: 16

  • Image Resolution:

  • 500 epochs

  • Generator & Encoder Learning Rate: 0.001

  • Discriminator Learning Rate: 0.000005

  • Adversarial Loss: Hinge Loss

  • Adam optimizer

0..4.6 -Gan

  • Batch Size: 16

  • Image Resolution:

  • 500 epochs

  • Generator & Encoder Learning Rate: 0.001

  • Discriminator & Code-Discriminator Learning Rate: 0.000005

  • Adversarial Loss: Hinge Loss

  • As we observed mode collapse, we added minibatch discrimination [21]

  • Adam optimizer