Weakly Supervised Context Encoder using DICOM metadata in Ultrasound Imaging

03/20/2020 ∙ by Szu-Yeu Hu, et al. ∙ 0

Modern deep learning algorithms geared towards clinical adaption rely on a significant amount of high fidelity labeled data. Low-resource settings pose challenges like acquiring high fidelity data and becomes the bottleneck for developing artificial intelligence applications. Ultrasound images, stored in Digital Imaging and Communication in Medicine (DICOM) format, have additional metadata data corresponding to ultrasound image parameters and medical exams. In this work, we leverage DICOM metadata from ultrasound images to help learn representations of the ultrasound image. We demonstrate that the proposed method outperforms the non-metadata based approaches across different downstream tasks.



There are no comments yet.


page 4

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning algorithms have made foray into the clinical domain and has emerged as a successful technique in various medical imaging applications. It has shown the potential to automate disease detection, severity grading, and clinical diagnosis in different domain (Hu et al., 2019; Gulshan et al., 2016). However, clinically accepted deep learning algorithms need to be build upon a considerable amount of annotated data. For example, Gulshan et al. (2016)

utilizes more than 100,000 images to train and validate the algorithm. Unfortunately, obtaining accurate annotations from clinicians is extremely expensive and time consuming, constraining supervised learning approaches in low-resource settings.

Unsupervised or semi-supervised learning provides potential solutions to alleviate the problems by learning the data distribution without or with limited labels. Studies have shown that unsupervised pretraining can serve as a regularization method and lead to better generalization

(Erhan et al., 2010). Recently, weakly-supervised and self-supervised learning have also drawn significant attention with their ability to learn high-quality feature representations. In this paper, we will explore one of the self-supervised technique, the context encoder (Pathak et al., 2016), and use the metadata in medical imaging as the weak labels to reinforce its capability to learn representation features.

In most of the modern medical imaging acquisition devices, such as ultrasound imaging, the data is stored in DICOM (Digital Imaging and Communications in Medicine) format. Besides the image pixel data, the DICOM headers contain the metadata, such as the patient information, study descriptions, and the reported results. The abundant information encoded in DICOM format provides a unique opportunity for modern deep learning applications. Recent studies have shown that the metadata can be leveraged for series categorization using machine learning

(Gauriau et al., ). Nevertheless, DICOM has not been a popular supervision target in machine learning. One major concern about DICOM is that they are often noisy and may contain inaccurate tags (Gueld et al., 2002). In practice, clinical personnel often adjust the examination protocol and imaging presets to improve the image quality. However, these changes may not be properly reflected in the DICOM tags due to spurious entry or missing tag. However, using DICOM metadata as weak labels can help incorporation of valuable information into the deep learning algorithm while minimizing the noise.

In this work, we investigated weakly-supervised learning using metadata and proposed a framework build on top of the self-supervised learning method. We showed that incorporating DICOM metadata as weak labels can improve the quality of representation learning and improve the performance of the downstream segmentation and classification tasks.

2 Related Work

2.1 Self-supervised learning

Self-supervised learning, a subcategory of unsupervised learning, creates labels from the data itself and trains the network in a supervised manner. It has been proved to be an effective technique for representation learning. The commonly used methods include predicting the rotation angles

(Gidaris et al., 2018), solving the jigsaw puzzles (Noroozi and Favaro, 2016), and predicting the cropped parts, which is the context encoder (Pathak et al., 2016) used in this study.

2.2 Adversarial Training with labels

Context encoder is trained to predict the missing parts of the images while adding an adversarial loss to encourage realistic output. Adversarial training originates from the generative adversarial network (GAN), which is trained in an unsupervised manner. Recent studies have shown that the class labels can stabilize the training and improve image qualities. For example, Brock et al. (2018) and fed the labels as the generator inputs to produce high-quality images. AC-GAN (Odena et al., 2017)

uses the discriminator to classify the class labels as an auxiliary loss.

Miyato and Koyama (2018) proposed a linear projection layer, which was also employed in Lučić et al. (2019) to generate high fidelity images with a limited number of labels.

2.3 Weakly-supervised learning

Weakly-supervised learning can be roughly categorized into two types. The first is inexact supervision, which usually uses the labels at a higher abstract level. For example, Hu et al. (2018) used the position coordinates to enhance optic disc segmentation. The second is inaccurate supervision, which involves low quality or noisy labels. Mahajan et al. (2018)

used the hashtag for weakly-supervised pretraining to boost the ImageNet classification performance. Inspired by these efforts, we employ the metadata as a form of weak labels for pretraining.

3 Method

In this study, we proposed a new weakly-supervised representation learning framework, which incorporates the DICOM metadata and the context encoder, for representation learning. The overview of the framework is demonstrated in Figure 1.

Figure 1: The proposed frame work for representation learning

3.1 Context Encoder

The idea of the context encoder is that given an input image with intentionally masked out areas, we train a deep learning model to reconstruct the missing part (semantic in-painting). The network utilizes an encoder-decoder structure. The encoder encodes the image context into a compact latent representation, and the decoder employs them to generate the missing image content. The network is trained to minimize the distance reconstruction loss .

In the original context encoder paper, it is proposed that the in-painting area can be either fixed or random blocks. Typically, models using random blocks tend to generalize better. However, due to the nature of ultrasound images, where informative context is located in the central region, we crop a square patch at the center of the image with a fixed size equal to half of the image width and height.

3.1.1 Discriminator with Linear Projection Layer

We also added the discriminator for adversarial training to encourage the output to look realistic. The standard is formulated as


where is the context encoder, is the discriminator, is original image, and is cropped input image.

To incorporate the DICOM metadata, we employ a linear projection layer as proposed in Miyato and Koyama (2018) and Lučić et al. (2019). The discriminator was decomposed into a learned discriminator representation, , and the representation then fed into two different parts: (1) A classifier to distinguish whether the image is real or fake; (2) A linear project layer , with a learned weight matrix

applied to a feature vector

and the encoded DICOM tags as an input. The output of the discriminator becomes , and . With the above modification, the adversarial loss with additional DICOM input can be rewritten as:


The final joint loss function is:


3.2 DICOM MetaData

We selected two DICOM tags as the target since they were directly related with the image semantic context:

  • Transducer data (DICOM tag: (0018, 5010)), which indicates the probe type used for examination. There are three different transducer probes in the dataset – SC6-1, SL10-2, SL15-4, where S represents single crystal, C or L represents curvilinear or linear probe geometry, and the numbers represent the ultrasound frequency bandwidth in MHz. We classify the probes into two groups - linear (SL10-2, SL15-4) and curvilinear (SC6-1).

  • Study Description (DICOM tag: (0008, 1030)). The study description illustrates the standard protocol in our institution when performing specific ultrasound exams or ultrasound-related procedure. For example, images with “US BIOPSY LIVER NONFOCAL” are acquired during an ultrasound-guided liver biopsy, implying they are predominantly liver images. We identified 45 different study description in our dataset (Appendix A.1

    ). Due to the spurious nature of the tags, we categorized the study series into eight different groups according to procedure type or site, including liver, kidney, thyroid, abdomen, chest, soft tissue, nodule,and drainage. The DICOM categorization was performed manually by a board-certified radiologist. Each study series can belong to more than one group. For example, the tag “US BIOPSY LIVER NONFOCAL” is mapped to two groups – liver and abdomen. We binarized the DICOM labels in a multi-label format.

Figure 2 demonstrates some image examples of the DICOM tags.

Figure 2: Example of DICOM metadata. The first row in each subfigure indicates the original DICOM metadata. The second row is the encoding tag.

4 Experiments

4.1 Dataset

Task # of images # of patients Supervision Description
Semantic Inpainting 12267 1188 Self + Weakly-supervised Self-supervised pretraining with metadata
Classification 3226 1000 Supervised Quality Score Classification
Segmentation 591 591 Supervised Liver and Kidney Segmentation
Table 1: Description of the dataset

A retrospective database was collected from September 2018 and November 2019 after proper approval from the Institutional Review Board. Informed consent was waived, and HIPAA compliance was ensured. A total of 12,267 images from 1,188 patients were collected. All images were acquired using Supersonic Aixplorer ultrasound machine (SuperSonic Imagine S.A., Aix-en-Provence, France).

We externally evaluated our results on two different downstream tasks: classification and segmentation. These two datasets were also retrospectively collected from the same institution, but the images were all acquired from GE Logiq E9 ultrasound machine(GE Healthcare, Chicago, IL, USA). In other words, there is no overlap between our pretraining dataset and the downstream evaluation dataset resulting in a machine agnostic algorithm. The overview of the dataset is shown in Table 1, and detailed description of the downstream tasks is detailed in Appendix A.2.

4.2 Architecture

VGG16 (Simonyan and Zisserman, 2014)

with five down-sampling blocks and batch normalization was used as the underlying architecture for the encoder. The decoder consists of four up-sampling blocks each with a 3

3 up-convolutional, a batch normalization, and a ReLU layer.

For the quality score classification, we take the pretrained encoder (VGG16) and add a classifier, which is a sequence of a 11 convolutional, a dropout, a global average pooling, and a fully connected layer. For the liver and kidney segmentation, we adopted a UNet-like architecture (Ronneberger et al., 2015). We used the pretrained encoder as the first half of the UNet, adding five up-convolutional blocks and skip connection to complete the network.

4.3 Training

The dataset was splitted into training(80%, 9814 images) and validation set(20%, 2454 images) randomly. All the images were resized to 256

384 pixels and apply with Z- score normalized before feeding into the network. We trained the context encoders with and without DICOM information using the joint loss in equation 

3, with = 0.99 and = 0.01. The adversarial loss with and without DICOM follows equation 2 and equation 1

, respectively. Training hyperparameters include Adam optimizer (

= 0.9,

= 0.999), batch size = 8, generator learning rate = 0.0001, and discriminator learning rate = 0.00001. The models were trained over 200 epochs without early stopping, and the ones with the lowest joint loss on the validation set were selected for downstream evaluation.

5 Results and Discussion

5.1 Context Encoder with DICOM

The qualitative results of the context encoder with and without DICOM tags are shown in Figure 3. We observed that trainings without DICOM tags are more prone to modal collapse in our experiments, making it difficult to obtain optimal results. With DICOM tags, the generated images look sharper and can resemble the actual organ texture like liver and kidney(Figure 3). It shows that the joined weakly-supervised training with DICOM tags did improve the image generation quality.

Figure 3: Example of semantic in-painting results.

5.2 Downstream tasks pretraining

Next, we examined whether downstream segmentation and classification could benefit from the pretrained encoder. The results are summarized in Table 2. In both tasks, the performance significantly improved with pretraining. Furthermore, the proposed method with weakly-supervised DICOM data outperforms the original context encoder in all conditions. The effect of freezing the encoder shows different outcomes in the two tasks. When freezing the encoder, we are reusing the learned features directly, and the segmentation tasks benefit more from the approach; when unfreezing the encoders, we treat it as the self-supervised initialization, and the classification task gains more from this. In this paper, we focus on the benefits of adding the DICOM information and leave the optimal approach of downstream tasks for future research.

In the segmentation task, the improvement from the DICOM is relatively trivial using full data. We further analyze the effect under a smaller data regime. A small data subset, 5% of our training data set, typical for low-resource settings shows that DICOM metadata helps improve the segmentation DICE score from 0.616 to 0.664 (freeze, p-value = 0.009) and 0.582 to 0.653 (unfreeze, p-value = 0.0001). The same response was also observed for the classification task.

Classification (Accuracy) Segmentation (DICE)
% of training data % of training data
100% 5% 100% 5%
None 0.558 0.03 0.319 0.03 0.801 0.09 0.574 0.15
CE w/o DICOM 0.629 0.02 0.387 0.03 0.816 0.06 0.582 0.15
CE w DICOM 0.657 0.04 0.426 0.04* 0.818 0.07 0.653 0.13*
CE w/o DICOM 0.588 0.03 0.396 0.02 0.822 0.06 0.616 0.15
CE w DICOM 0.645 0.04* 0.417 0.03* 0.826 0.06 0.664 0.13*
Table 2: The downstream tasks performance with pretraining. The value is shown in mean standard deviation on the held-out test set. The standard deviation is derived from bootstrapping for ten times. CE: context encoder. Best performance is highlighted in boldface. The asterisk sign indicates a statistically significant improvement (p 0.05) from the counterparts pretrained without DICOM.

6 Conclusion

In this paper, we demonstrate the potential of using DICOM metadata from ultrasound images as weak labels to improve deep learning representation learning in a self-supervised schema. The method can have great impacts in the resource-limited area given its ability to effectively utilizing the data information without extra human labor. The method can be extended to other image modalities with DICOM tags like CT or MRI.

7 Acknowledgement

This work was supported by the National Institute of Biomedical Imaging and Bioengineering and National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health under Award Nos. K23 EB020710, and R01 DK119860 respectively. The authors are solely responsible for the content, and the work does not represent the official views of the National Institutes of Health.


  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.2.
  • D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio (2010) Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research 11 (Feb), pp. 625–660. Cited by: §1.
  • [3] R. Gauriau, C. Bridge, L. Chen, F. Kitamura, N. A. Tenenholtz, J. E. Kirsch, K. P. Andriole, M. H. Michalski, and B. C. Bizzo Using dicom metadata for radiological image series categorization: a feasibility study on large clinical brain mri datasets. Journal of Digital Imaging, pp. 1–16. Cited by: §1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.1.
  • M. O. Gueld, M. Kohnen, D. Keysers, H. Schubert, B. B. Wein, J. Bredno, and T. M. Lehmann (2002) Quality of dicom header information for image categorization. In Medical Imaging 2002: PACS and Integrated Medical Information Systems: Design and Evaluation, Vol. 4685, pp. 280–287. Cited by: §1.
  • V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al. (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316 (22), pp. 2402–2410. Cited by: §1.
  • S. Hu, A. Beers, K. Chang, K. Höbel, J. P. Campbell, D. Erdogumus, S. Ioannidis, J. Dy, M. F. Chiang, J. Kalpathy-Cramer, et al. (2018) Deep feature transfer between localization and segmentation tasks. arXiv preprint arXiv:1811.02539. Cited by: §2.3.
  • S. Hu, W. Weng, S. Lu, Y. Cheng, F. Xiao, F. Hsu, and J. Lu (2019) Multimodal volume-aware detection and segmentation for brain metastases radiosurgery. In Workshop on Artificial Intelligence in Radiation Therapy, pp. 61–69. Cited by: §1.
  • M. Lučić, M. Ritter, M. Tschannen, X. Zhai, O. F. Bachem, and S. Gelly (2019) High-fidelity image generation with fewer labels. In International Conference on Machine Learning, Cited by: §2.2, §3.1.1.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 181–196. Cited by: §2.3.
  • T. Miyato and M. Koyama (2018) CGANs with projection discriminator. In International Conference on Learning Representations, Cited by: §2.2, §3.1.1.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.1.
  • A. Odena, C. Olah, and J. Shlens (2017) Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. Cited by: §2.2.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2536–2544. Cited by: §1, §2.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.

Appendix A Appendix

a.1 List of DICOM Study Descriptions

Study Series Encoding
US LYMPH NODE BIOPSY soft tissue,nodule
US THYROID BIOPSY thyroid,nodule
US DRAINAGE ABDOMEN abdomen,drainage
US BIOPSY MESENTERY abdomen,drainage,soft tissue
US NECK SOFT TISSUE BIOPSY soft tissue,nodule
US DRAINAGE PELVIS abdomen,drainage
US SOFT TISSUE BIOPSY soft tissue,nodule
US BIOPSY NOT OTHERWISE SPECIFIED soft tissue,nodule,drainage
CT LYMPH NODE BIOPSY soft tissue,nodule
US DRAINAGE LIVER liver,drainage,abdomen
US LYMPH NODE ASPIRATION/FNA soft tissue,nodule,drainage
US SOFT TISSUE ASPIRATION soft tissue,drainage
US ASPIRATION PELVIS abdomen,drainage
US DRAINAGE KIDNEY/PARARENAL (RIGHT) abdomen,kidney,drainage

a.2 Downstream Tasks

a.2.1 Quality Score

The task is to identify an optimal view for Morrison’s pouch - an anatomic site between the right lobe of the liver and the right kidney. Clinically, the view is important to identify ascites and hemoperitoneum when abnormal fluid accumulation is presented; also, it is the reference view to estimate the severity of steatosis using the hepatorenal index. Therefore, quantifying the view quality is crucial in an ultrasound examination. The images were reviewed by a board-certified radiologist and gave five different rankings as the quality measurement (Figure

4). Class 0 indicates the view does not include the liver or the kidney, and should not be used; while class 4 represent an optimal Morrison’s pouch view that will be used by an experienced operator. We used the ordinal encoding for the labels. (class 0: [0,0,0,0], class 1: [1,0,0,0], class 2: [1,1,0,0], class 3: [1,1,1,0], class 4: [1,1,1,1])

The dataset was split into training(800 patients, 2548 images), validation(100 patients, 343 images) and test set (100 patients, 335 images) stratified by the patients. All the images follow the same preprocessing procedure in section 4.3. All the models was trained using Adam optimizer ( = 0.9, = 0.999), batch size = 4, and learning rate = 0.0001 over 300 epochs, minimizing a weighted binary cross-entropy loss, where the positive to negative weighting ratio is 2.0. The models with the lowest validation loss were selected, and the performance on the held-out test set was reported.

(a) class 0
(b) class 1
(c) class 2
(d) class 3
(e) class 4
Figure 4: Example of quality score

a.2.2 Kidney and Liver Segmentation

The task is to segment the kidney and liver in ultrasound imaging. All the images were reviewed by an board certificated radiologist, and manually segment the two organs.

The dataset was split into training(391 images), validation(100 images) and test set(100 images) randomly. Similar to the classfication task, all the images follow the same preprocessing pipeline. All the models was trained using Adam optimizer ( = 0.9, = 0.999), batch size = 8, and learning rate = 0.0001 over 500 epochs, minimizing a soft dice loss. The models with the lowest validation loss were selected, and the performance on the held-out test set was reported.