We present a novel approach to training machine learning models for assessing pulmonary edema severity from chest radiographs by jointly learning representations from the images (chest radiographs) and their associated radiology reports. Pulmonary edema is the most common reason patients with acute congestive heart failure (CHF) seek care in hospitals [9, 15, 1]. The treatment success in acute CHF cases depends crucially on effective management of patient fluid status, which in turn requires pulmonary edema quantification, rather than detecting its mere absence or presence.
Chest radiographs are commonly acquired to assess pulmonary edema in routine clinical practice. Radiology reports capture radiologists’ impressions of the edema severity in the form of unstructured text. While the chest radiographs possess ground-truth information about the disease, they are often time intensive (and therefore expensive) for manual labeling. Therefore, labels extracted from reports are used as a proxy for ground-truth image labels. Only limited numerical edema severity labels can be extracted from the reports, which limits the amount of labeled image data we can learn from. This presents a significant challenge for learning accurate image-based models for edema assessment. To improve the performance of the image-based model and allow leveraging larger amount of training data, we make use of free-text reports to include rich information about radiographic findings and reasoning of pathology assessment. We incorporate free-text information associated with the images by including them during our training process.
We propose a neural network model that jointly learns from images and free-text to quantify pulmonary edema severity from images (chest radiographs). At training time, the model learns from a large number of chest radiographs and their associated radiology reports, with a limited number of numerical edema severity labels. At inference time, the model computes edema severity given the input image. While the model can also make predictions from reports, our main interest is to leverage free-text information during training to improve the accuracy of image-based inference. Compared to prior work in the image-text domain that fuses image and text features , our goal is to decouple the two modalities during inference to construct an accurate image-based model.
. To the best of our knowledge, ours is the first method to leverage the free-text radiology reports for improving the image model performance in this application. Our experimental results demonstrate that the joint representation learning framework improves the accuracy of edema severity estimates over a purely image-based model on a fully labeled subset of the data (supervised). The joint learning framework uses a ranking-based criterion[12, 7], allowing for training the model on a larger dataset of unlabeled images and reports. This semi-supervised modification demonstrates a further improvement in accuracy. Additional advantages of our joint learning framework are 1) allowing for the image and text models to be decoupled at inference time, and 2) providing textual explanations for image classification in the form of saliency highlights in the radiology reports.
The ability of neural networks to learn effective feature representations from images and text has catalyzed the recent surge of interest in joint image-text modeling. In supervised learning, tasks such as image captioning have leveraged a recurrent visual attention mechanism using recurrent neural networks (RNNs) to improve captioning performance. The TieNet used this attention-based text embedding framework for pathology detection from chest radiographs 29]. A similar image-text embedding setup has been employed for chest radiograph (image) annotations 
. In unsupervised learning, training a joint global embedding space for visual object discovery has recently been shown to capture relevant structure. All of these models used RNNs for encoding text features. More recently, transformers such as the BERT model  have shown the ability to capture richer contextualized word representations using self-attention and have advanced the state-of-the-art in nearly every language processing task compared to variants of RNNs. Our setup, while similar to  and , uses a series of residual blocks  to encode the image representation and uses the BERT model to encode the text representation. We use the radiology reports during training only, to improve the image-based model’s performance. This is in contrast to visual question answering [3, 19, 2], where inference is performed on an image-text pair, and image/video captioning [28, 22, 24, 16], where the model generates text from the input image.
For training and evaluating our model, we use the MIMIC-CXR dataset v2.0 , consisting of 377,110 chest radiographs associated with 227,835 radiology reports. The data was collected in routine clinical practice, and each report is associated with one or more images. We limited our study to 247,425 frontal-view radiographs.
We extracted pulmonary edema severity labels from the associated radiology reports using regular expressions (regex) with negation detection . The keywords of each severity level (“none”=0, “vascular congestion”=1, “interstitial edema”=2, and “alveolar edema”=3) are summarized in the supplementary materials. In order to limit confounding keywords from other disease processes, we limited the label extraction to patients with congestive heart failure (CHF) based on their ED ICD-9 diagnosis code in the MIMIC dataset . Cohort selection by diagnosis code for CHF was previously validated by manual chart review. This resulted in 16,108 radiology reports. Regex labeling yielded 6,710 labeled reports associated with 6,743 frontal-view images111The numbers of images of the four severity levels are 2883, 1511, 1709, and 640 respectively.. Hence, our dataset includes 247,425 image-text pairs, 6,743 of which are of CHF patients with edema severity labels. Note that some reports are associated with more than one image, so one report may appear in more than one image-text pair.
Let be a 2D chest radiograph, be the free-text in a radiology report, and be the corresponding edema severity label. Our dataset includes a set of image-text pairs , where . The first image-text pairs are annotated with severity labels . Here we train a joint model that constructs an image-text embedding space, where an image encoder and a text encoder are used to extract image features and text features separately (Fig. 1). Two classifiers are trained to classify the severity labels independently from the image features and from the text features. This setup enables us to decouple the image classification and the text classification at inference time. Learning the two representations jointly at training time improves the performance of the image model.
Joint Representation Learning.
We apply a ranking-based criterion [7, 12] for training the image encoder and the text encoder parameterized by and respectively, to learn image and text feature representations and . Specifically, given an image-text pair , we randomly select an impostor image and an impostor report from X
. This selection is generated at the beginning of each training epoch. Mapproduces a random permutation of .
We encourage the feature representations between a matched pair to be “closer” than those between mismatched pairs and in the joint embedding space. Direct minimization of the distance between and
could end up pushing the image and text features into a small cluster in the embedding space. Instead, we encourage matched image-text features to be close while spreading out all feature representations in the embedding space for downstream classification by constructing an appropriate loss function:
where is the similarity measurement of two feature representations in the joint embedding space and is a margin parameter that is set to when both and ; otherwise, . The margin is determined by the difference due to the mismatch, if both labels are known; otherwise the margin is a constant.
We employ two fully connected layers (with the same neural network architecture) on the joint embedding space to assess edema severity from the image and the report respectively. For simplicity, we treat the problem as multi-class classification, i.e. the classifiers’ outputs and are encoded as one-hot 4-dimensional vectors. We use cross entropy as the loss function for training the classifiers and the encoders on the labeled data:
i.e., minimizing the cross entropy also affects the encoder parameters.
The image encoder is implemented as a series of residual blocks , the text encoder is a BERT model that uses the beginner [CLS] token’s hidden unit size of 768 and maximum sequence length of 320 . The image encoder is trained from a random initialization, while the BERT model is fine-tuned during the training of the joint model. The BERT model parameters are initialized using pre-trained weights on scientific text . The image features and the text features are represented as 768-dimensional vectors in the joint embedding space. The two classifiers are both 768-to-4 fully connected layers. The neural network architecture is provided in the supplementary materials.
We employ the stochastic gradient-based optimization procedure AdamW  to minimize the loss in Eq. (3) and use a warm-up linear scheduler  for the learning rate. The model is trained on all the image-text pairs by optimizing the first term in Eq. (3) for 10 epochs and then trained on the labeled image-text pairs by optimizing Eq. (3) for 50 epochs. The mini-batch size is 4. We use dot product as the similarity metric in Eq. (1). The dataset is split into training and test sets. All the hyper-parameters are selected based on the results from 5-fold cross validation within the training set.
The size of the chest radiographs varies and is around 30003000 pixels. We randomly translate and rotate the images on the fly during training and crop them to 20482048 pixels as part of data augmentation. We maintain the original image resolution to capture the subtle differences in the images between different levels of pulmonary edema severity. For the radiology reports, we extract the impressions, findings, conclusion and recommendation sections. If none of these sections are present in the report, we use the final report section. We perform tokenization of the text using ScispaCy  before providing it to the BERT tokenizer.
For evaluating our model, we randomly selected 531 labeled image-text pairs (corresponding to 485 reports) for expert annotation. A board-certified radiologist and two domain experts reviewed and corrected the regex labels of the reports. We use the expert labels for model testing. The overall accuracy of the regex labels (positive predictive value compared against the expert labels) is 89%. The other 6,212 labeled image-text pairs and around 240K unlabeled image-text pairs were used for training. There is no patient overlap between the training set and the test set.
We evaluated variants of our model and training regimes as follows:
image-only: An image-only model with the same architecture as the image stream in our joint model. We trained the image model in isolation on the 6,212 labeled images.
A joint image-text model trained on the 6,212 labeled image-text pairs only. We compare two alternatives to the joint representation learning loss:
dot, l2, cosine: direct minimization on the similarity metrics without the ranking based criterion.
ranking-dot-semi: A joint image-text model trained on the 6,212 labeled and the 240K unlabeled image-text pairs in a semi-supervised fashion, using the ranking based criterion with dot product in Eq. (1). Dot product is selected for the ranking-based loss based on cross-validation experiments on the supervised data comparing ranking-dot, ranking-l2, ranking-cosine, dot, l2, and cosine.
All reported results are compared against the expert labels in the test set. The image portion of the joint model is decoupled for testing, and the reported results are predicted from images only. To optimize the baseline performance, we performed a separate hyper-parameter search for the image-only model using 5-fold cross validation (while holding out the test set).
We use the area under the ROC (AUC) and macro-averaged F1-scores (macro-F1) for our model evaluation. We dichotomize the severity levels and report 3 comparisons (0 vs 1,2,3; 0,1 vs 2,3; and 0,1,2 vs 3), since these 4 classes are ordinal (e.g., , ).
Table 2 reports the performance statistics for all similarity measures. The findings are consistent with our cross-validation results: the ranking based criterion offers significant improvement when it is combined with the dot product as the similarity metric.
Table 2 reports the performance of the optimized baseline model (image-only) and two variants of the joint model (ranking-dot and ranking-dot-semi). We observe that when the joint model learns from the large number of unlabeled image-text pairs, it achieves the best performance. The unsupervised learning minimizes the ranking-based loss in Eq. (1), which does not depend on availability of labels.
It is not surprising that the model is better at differentiating the severity level 3 than other severity categories, because level 3 has the most distinctive radiographic features in the images.
|Method||AUC (0 vs 1,2,3)||AUC (0,1 vs 2,3)||AUC (0,1,2 vs 3)||macro-F1|
|Method||AUC (0 vs 1,2,3)||AUC (0,1 vs 2,3)||AUC (0,1,2 vs 3)||macro-F1|
Joint Model Visualization.
As a by-product, our approach provides the possibility of interpreting model classification using text. While a method like Grad-CAM  can be used to localize regions in the image that are “important” to the model prediction, it does not identify the relevant characteristics of the radiographs, such as texture. By leveraging the image-text embedding association, we visualize the heatmap of text attention corresponding to the last layer of the [CLS] token in the BERT model. This heatmap indicates report tokens that are important to our model prediction. As shown in Fig. 2, we use Grad-CAM  to localize relevant image regions and the highlighted words (radiographic findings, anatomical structures, etc.) from the text embedding to explain the model’s decision making.
In this paper, we presented a neural network model that jointly learns from images and text to assess pulmonary edema severity from chest radiographs. The joint image-text representation learning framework incorporates the rich information present in the free-text radiology reports and significantly improves the performance of edema assessment compared to learning from images alone. Moreover, our experimental results show that joint representation learning benefits from the large amount of unlabeled image-text data.
Expert labeling of the radiology reports enabled us to quickly obtain a reasonable amount of test data, but this is inferior to direct labeling of images. The joint model visualization suggests the possibility of using the text to semantically explain the image model, which represents a promising direction for future investigation.
This work was supported in part by NIH NIBIB NAC P41EB015902, Wistron Corporation, Takeda, MIT Lincoln Lab, and Philips. We also thank Dr. Daniel Moyer for helping generate Fig. 1.
-  Adams Jr, K.F., Fonarow, G.C., Emerman, C.L., LeJemtel, T.H., Costanzo, M.R., Abraham, W.T., Berkowitz, R.L., Galvao, M., Horton, D.P., Committee, A.S.A., Investigators, et al.: Characteristics and outcomes of patients hospitalized for heart failure in the united states: rationale, design, and preliminary observations from the first 100,000 cases in the acute decompensated heart failure national registry (adhere). American heart journal 149(2), 209–216 (2005)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018)
-  Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015)
Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3606–3611 (2019)
-  Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2612–2620 (2017)
-  Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics 34(5), 301–310 (2001)
Chechik, G., Sharma, V., Shalit, U., Bengio, S.: Large scale online learning of image similarity through ranking. Journal of Machine Learning Research11(Mar), 1109–1135 (2010)
-  Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
-  Gheorghiade, M., Follath, F., Ponikowski, P., Barsuk, J.H., Blair, J.E., Cleland, J.G., Dickstein, K., Drazner, M.H., Fonarow, G.C., Jaarsma, T., et al.: Assessing and grading congestion in acute heart failure: a scientific statement from the acute heart failure committee of the heart failure association of the european society of cardiology and endorsed by the european society of intensive care medicine. European journal of heart failure 12(5), 423–433 (2010)
-  Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation 101(23), e215–e220 (2000)
-  Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 649–665 (2018)
-  Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Advances in Neural Information Processing Systems. pp. 1858–1866 (2016)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
-  Horng, S., Liao, R., Wang, X., Dalal, S., Golland, P., Berkowitz, S.J.: Deep learning to quantify pulmonary edema in chest radiographs. arXiv preprint arXiv:2008.05975 (2020)
-  Hunt, S.A., Abraham, W.T., Chin, M.H., Feldman, A.M., Francis, G.S., Ganiats, T.G., Jessup, M., Konstam, M.A., Mancini, D.M., Michl, K., et al.: 2009 focused update incorporated into the acc/aha 2005 guidelines for the diagnosis and management of heart failure in adults: a report of the american college of cardiology foundation/american heart association task force on practice guidelines developed in collaboration with the international society for heart and lung transplantation. Journal of the American College of Cardiology 53(15), e1–e90 (2009)
-  Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195 (2017)
-  Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 6 (2019)
-  Liao, R., Rubin, J., Lam, G., Berkowitz, S., Dalal, S., Wells, W., Horng, S., Golland, P.: Semi-supervised learning for quantification of pulmonary edema in chest x-ray images. arXiv preprint arXiv:1902.10785 (2019)
-  Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems. pp. 289–297 (2016)
-  Moradi, M., Madani, A., Gur, Y., Guo, Y., Syeda-Mahmood, T.: Bimodal network architectures for automatic generation of image annotation from text. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 449–456. Springer (2018)
-  Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: Fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task. pp. 319–327. Association for Computational Linguistics, Florence, Italy (Aug 2019). https://doi.org/10.18653/v1/W19-5034, https://www.aclweb.org/anthology/W19-5034
-  Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization via vision-language embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5781–5789 (2017)
-  Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
-  Vasudevan, A.B., Gygli, M., Volokitin, A., Van Gool, L.: Query-adaptive video summarization via quality-aware relevance estimation. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 582–590 (2017)
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
-  Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 9049–9058 (2018)
-  Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. ArXiv pp. arXiv–1910 (2019)
-  Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057 (2015)
-  Xue, Y., Huang, X.: Improved disease classification in chest x-rays with transferred features from report generation. In: International Conference on Information Processing in Medical Imaging. pp. 125–138. Springer (2019)
|Edema severity||Regex keyword terms||Number of reports||Accuracy|
|Level 0 –||(no) pulmonary edema||222||88.74%|
|none||(no) vascular congestion||43||100.00%|
|(n=216)||(no) fluid overload||4||100.00%|
|(no) acute cardiopulmonary process||115||98.27%|
|Level 1 –||cephalization||17||94.12%|
|vascular congestion||pulmonary vascular congestion||96||98.96%|
|pulmonary vascular prominence||1||100.00%|
|pulmonary vascular engorgement||8||87.50%|
|Level 2 –||interstitial opacities||30||73.33%|
|interstitial pulmonary edema||21||100.00%|
|Level 3 –||alveolar infiltrates||10||100.00%|
|alveolar edema||severe pulmonary edema||58||98.28%|
|ill defined opacities||1||100.00%|
|number-of-epochs (supervised)||50, 100, 150, 250|
|learning-rate||2e-5, 5e-4, 1e-4, 1e-3|