Unsupervised Multimodal Representation Learning across Medical Images and Reports

11/21/2018 ∙ by Tzu Ming Harry Hsu, et al. ∙ MIT 0

Joint embeddings between medical imaging modalities and associated radiology reports have the potential to offer significant benefits to the clinical community, ranging from cross-domain retrieval to conditional generation of reports to the broader goals of multimodal representation learning. In this work, we establish baseline joint embedding results measured via both local and global retrieval methods on the soon to be released MIMIC-CXR dataset consisting of both chest X-ray images and the associated radiology reports. We examine both supervised and unsupervised methods on this task and show that for document retrieval tasks with the learned representations, only a limited amount of supervision is needed to yield results comparable to those of fully-supervised methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical imaging is one of the most compelling domains for the immediate application of artificial intelligence tools. Recent years have seen not only tremendous academic advancements

esteva2017dermatologist ; gulshan2016development ; rajpurkar2017chexnet but additionally a breadth of applied tools marr2017first ; Walter2018DensitasDensity ; Lagasse2018FDANews ; EnvoyAI2017EnvoyAIPartners .

There has been some emerging attention on joint processing of medical images and radiological free-text reports. wang2018tienet used the public NIH Chest X-ray 14 dataset (wang2017chestx, ) linked with the non-public associated reports to both improve disease classification performance and for automatic report generation. gale2018producing attempted to generate radiology reports while shin2016learning generated disease/location/severity annotations. liu2018learning generated notes, including radiology reports for the Medical Information Mart for Intensive Care (MIMIC) dataset using non-image modalities such as demographics, previous notes, labs, and medications. These works used annotations from either machines (wang2017chestx, ) or humans. However, with a huge influx of imaging data beyond human capacity, parallel records from both imaging and text are not always readily available. We thus would like to bring up the question of whether we can take advantage of unannotated but massive imaging datasets and learn from the underlying distribution of these images.

One natural area that remains unexplored is representation learning across images and reports. The idea of representation learning in a joint embedding space can be realized in multiple ways. Some (pan2011domain, ; chen2016transfer, ) explored statistical and metrical relevance across domains, and some (ganin2016domain, ) realized it as an adversarially determined domain-agnostic latent space. shen2017style ; mor2018universal both used a the latent space for style transfer, in language sentiment and music style, respectively. reed2016learning learned joint spaces of images and their captions, which reed2016generative later used for caption-driven image generation. conneau2017word and grave2018unsupervised also used similar ideas to perform both supervised and unsupervised word-to-word translation tasks. (chung2018unsupervised, ) further aligned cross-modal embeddings through semantics in speech and text for spoken word classification and translation tasks.

A recent dataset, MIMIC- Chest X-ray111Soon to be publicly released. (MIMIC-CXR), carries paired records of X-ray images and radiology reports, and the imaging modality has been explored in rubin2018large . In this work, we explore both the text and image modalities with joint embedding spaces under a spectrum of supervised and unsupervised methods. In particular, we make the following contributions:

  1. We establish baseline results and evaluation methods for jointly embedding radiological images and reports via retrieval and distance metrics.

  2. We profile the impact of supervision level on the quality of representation learning in joint embedding spaces.

  3. We characterize the influence of using different sections from the report on representation learning.

2 Methodology

2.1 Data

All experiments in this work used the MIMIC-CXR dataset. MIMIC-CXR consists of 473,057 chest X-ray images and 206,563 reports from 63,478 patients. Of these images, 240,780 are of anteroposterior (AP) views, which we focus on in this work. Further, we eliminate all duplicated radiograph images with adjusted brightness or contrast222Commonly produced for clinical needs, leaving a total of 95,242/87,353 images/reports, which we subdivide into a train set of 75,147/69,171 and a test set of 19,825/18,182 images/reports, with no overlap of patients between the two. Radiological reports are parsed into sections and we use either the impression or the findings sections.

For evaluation, we aggregate a list of unique International Classification of Diseases (ICD-9) codes from all patient admissions and ask a clinician to pick out a subset of codes that are related to thoracic diseases. Records with ICD-9 codes in the subset are then extracted, including 3,549 images from 380 patients. This population serves as a disease-related evaluation for retrieval algorithms. Note that this disease information is never provided during training in any setting.

2.2 Methods

Figure 1: The overall experimental pipeline. EA: embedding alignment; Adv: adversarial training.

Our overall experimental flow follows Figure 1. Notes are featurized via (1) term frequency-inverse document frequency (TF-IDF) over bi-grams, (2) pre-trained GloVe word embeddings (pennington2014glove, ) averaged across the selected section of the report, (3) sentence embeddings, or (4) paragraph embeddings. In (3) and (4), we first perform sentence/paragraph splitting, and then fine-tune a deep averaging network (DAN) encoder (bird2004nltk, ; cer2018universal, ; iyyer2015deep, ) with the corpus. Embeddings are finally averaged across sentences/paragraphs. The DAN encoder is pretrained on a variety of data sources and tasks and fine-tuned on the context of report sections.

Images are resized to 256256, then featurized to the last bottleneck layer of a pretrained DenseNet-121 model (rajpurkar2017chexnet, ). PCA is applied onto the 1024-dimension raw image features to obtain 64-dimension features.333

96.9% variance explained

Text features are projected into the 64-dimension image feature space. We use several methods regarding different objectives.

Embedding Alignment (EA)

Here, we find a linear transformation between two sets of matched points

and by minimizing .

Adversarial Domain Adaption (Adv)

Adversarial training pits a discriminator,

, implemented as a 2-layer (hidden size 256) neural network using scaled exponential linear units (SELUs)

(klambauer2017self, ), against a projection matrix , as the generator.

is trained to classify points in the joint space according to source modality, and

is trained adversarially to fool . Alternatively, minimizes when minimizes .

Procrustes Refinement (Adv + Proc)

On top of adversarial training, we also use an unsupervised Procrustes induced refinement as in conneau2017word .


We also assess how much supervision is necessary to ensure strong performance on these modalities by randomly subsampling our data into supervised and unsupervised samples. We then combine the embedding alignment objective and adversarial training objective functions as and train simultaneously as we vary the fraction trained. Preliminary experiments suggests .

Orthogonal Regularization

smith2017offline ; conneau2017word ; xing2015normalized all showed that imposing orthonormality on linear projections leads to better performance and stability in training . However, brock2018large suggested orthogonality (i.e., not constraining the norms) can perform better as a regularization. Thus on top of the objectives, we add , where denotes element-wise product and

denotes a column vector of all ones. Scanning through a range shows

yields good performance.

2.3 Evaluation

We evaluate via cross domain retrieval in the test set : querying in the joint embedding space for closest neighboring images using a report, , or vice-versa, . For direct pairings, we compute the cosine similarity, and where is the rank of the first true pair for (e.g., the first paired image or text corresponding to the query ) in the retrieval list. For thoracic disease induced pairings, we first define the relevance between two entries and as the intersection-over-union of their respective set of ICD-9 codes. Then we calculate the normalized discounted cumulative gain (jarvelin2002cumulated, ) , where denotes the ideal value for

using a perfect retrieval algorithm. All experiments are repeated with random initial seeds for at least 5 times. Means and 95% confidence intervals are reported in the following section.

Text Feature Method Similarity
bi-gram EA
word EA
sentence EA
paragraph EA
bi-gram Adv
bi-gram Adv + Proc
word Adv
word Adv + Proc
sentence Adv
sentence Adv + Proc
paragraph Adv
paragraph Adv + Proc
Table 1: Comparison among supervised (upper) and unsupervised (lower) methods. Subscripts show the half width of confidence intervals. Bold denotes the best performance in each group. Chance is the expected value if we randomly yield retrievals. Higher is better for all metrics.

3 Results

Retrieval with/without Supervision

Table 1 compares four types of text features and supervised/unsupervised methods. We find that unsupervised methods can achieve comparable results on disease-related retrieval tasks on a large scale () without the need for labeling the chest X-ray images. Experiments show uni-, bi-, and tri-grams yield very similar results and we only include bi-gram in the table. Additionally, we find that the high-level sentence and paragraph embeddings approach underperformed the bi-gram text representation. Although having generalizability (cer2018universal, ), sentence and paragraph embeddings learned from the supervised multi-task pre-trained model may not be able to represent the domain-specific radiological reports well due to the lack of medical domain tasks in the pre-training process. Unsupervised procrustes refinement is occasionally, but not universally helpful. Note that is comparatively small since reports are in general highly similar for radiographs with the same disease types.

The Impact of Supervision Fraction

We define the supervision fraction as the fraction of pairing information provided in the training set. Note the ICD-9 codes are not provided for training even in the fully supervised setting. Figure 2

shows our evaluation metrics for models trained using bi-gram text features and the semi-supervised learning objective for various supervision fractions. A minimal supervision as low as 0.1% provided can drastically improve the alignment quality, especially in terms of cosine similarity and

. More annotations further improve the performance measures, but one would almost require exponentially many data points in exchange for a linear increase. That implies the possibility of concatenating a well-annotated dataset and a large but unannotated dataset for a substantial performance boost.

Figure 2: Performance measures of retrieval tasks at retrieved items as a function of the supervision fraction. Higher is better. Note the -axis is in log scale. Unsupervised is on the left, increasingly supervised to the right. Dashed lines indicate the performance by chance. Vertical bars indicate the 95% confidence interval, and some are too narrow to be visible.

Using Different Sections of the Report

We investigate the effectiveness of using different sections for the embedding alignment task. All models in Figure 3 run with a supervision fraction of 1%. The models trained on the findings section outperformed the models trained on the impression section using cosine similarity and . This makes sense from a clinical perspective since the radiologists usually only describe image patterns in the findings section and thus they would be aligned well. On the other hand, they make radiological-clinical integrated interpretations in the impression section, which means that the both the image-uncorrelated clinical history and findings were mentioned in the impression section. Since is calculated using ICD-9 codes, which carry disease-related information, it naturally aligns with the purpose of writing an impression section. This may explain why the models trained on impression section worked better for .

Figure 3: Different metrics for retrieval on either the impression or findings section using four types of features. 95% confidence intervals are indicated on the bars.

4 Conclusion

MIMIC-CXR will soon be the largest publicly available imaging dataset consisting of both medical images and paired radiological reports, promising myriad applications that can make use of both modalities together. We establish baseline results using supervised and unsupervised joint embedding methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval evaluation metrics. Results show a possibility of incorporating more unsupervised data into training for minimal-effort performance increase. A further study of joint embeddings between these modalities may enable significant applications, such as text/image generation or the incorporation of other EMR modalities.