Chest radiographs are performed to diagnose and monitor a wide range of conditions affecting lungs, heart, bones, and soft tissues. Despite being commonly performed, their reading is challenging and interpretation discrepancies can occur. There is a need to develop machine learning algorithms that can assist the reporting radiologist. In this work we address the problem of learning a distance metric for chest radiographs using a very large repository of historical exams that have already been reported. An ideal metric should be able to cluster together radiographs presenting similar radiological abnormalities and place them far away from exams with normal radiological appearance. Learning a suitable metric would enable a variety of applications, from automated retrieval of radiologically similar exams, for teaching and training, to their automated prioritization based on visual patterns.
The problem we discuss here is challenging for several reasons. First, the number of potential abnormalities that can be observed in a chest radiograph can be quite large. Visual patterns detected in radiographs are important cues used by the clinicians when making a diagnosis. Often, during the reporting time, the clinician will describe the visual pattern using descriptors (e.g. “enlarged heart") or stating the exact medical pathology associated with the visual pattern (e.g. “consolidation in the right lower lobe"). A metric learning algorithm should be able to deal with any such labels and their potential overlaps. Second, the labels may not always be accurate or comprehensive due to the fact that not all the abnormalities are always reported in an image, e.g. due to omissions or when deemed unimportant by the radiologist. When these labels are automatically obtained from free-text reports, as we do in this work, mislabelling errors may also occur. Third, certain abnormalities are less frequently observed than others, and may not even exist in the training dataset.
To support this study, we have prepared a large repository consisting of over
chest radiograph examinations extracted from the PACS (Picture Archiving and Communication System) of a large teaching hospital in London. To our knowledge, this is the largest chest radiograph repository to ever be deployed in a machine learning study. Due to the large sample size, manual annotation of all the exams is unfeasible. All the historical free-text reports have been parsed using a Natural Language Processing (NLP) system, which has identified and classified any mention of radiological abnormalities. As a result of this process, each film has been automatically assigned to one or multiple labels. Our contributions are the following. First, we discuss the problem of deep metric learning with multi-labelled images and propose two versions of a loss function specifically designed to deal with overlapping and potentially noisy labels. At the core of the architecture, a DCN is used to learn compact image representations capturing the visual patterns described by the labels. Second, we report on a large-scale evaluation of the proposed methodology using a manually curated subset of overexams. Each historical radiological report was reviewed by two independent clinicians who extracted all the labels associated to the films. We report on comparative results for two tasks, clustering and image retrieval, and provide evidence that the learned metric can be used to cluster radiographs with a normal appearance as well as clusters of abnormal exams with co-occurring abnormalities.
2 Related work
2.1 Deep metric learning
The first attempt of using neural networks to learn an embedding space was the Siamese Network , which used a contrastive loss to train the network to distinguish between pairs of examples. Schroff et al.  combined a Siamese architecture with a triplet loss and applied the resulting model to the face verification problem obtaining a nearly human performance. Other approaches have been proposed more recently in order to better exploit the information in each mini-batch; e.g. Song et al.  proposed a loss with a lifted structure, while Sohn et al.  proposed a tuplet loss. They both use all the possible example pairs within each mini-batch. All these methods use a query or anchor image , which is compared with positive elements (images sharing the same label) and negative elements (images with a different label). Several of these methods also implement a hard data mining approach whereby samples within a given pair or triplet are selected in such a way to represent the hardest positive or negative example with respect to the given anchor. This strategy improves both the convergence speed and the final discriminative performance. In FaceNet , pairs of anchor and positive samples are randomly selected while negative samples are selected from a subset of the training set using a semi-hard negative algorithm. Recently, Wu et al.  proposed a novel off-line mining strategy that, on the entire training set, selects the optimal positive and negative elements for each anchor. A different learning framework that does not require the training data to be processed in paired format has been recently proposed .
2.2 CAD systems for chest radiographs
The use of computer-aided diagnosis (CAD) systems in medical imaging goes back more than a half century 
. Over the years the methodologies powering the CAD systems have evolved substantially from rule-based engines to artificial neural nertworks. In recent years, CAD developers have started to adopt deep learning stategies in a number of medical application domains. For instance, Geras et al. have developed a DCN model able to handle multiple views of high-resolution screening mammographies, which are commonly used to screen for breast cancer. For applications to plain chest radioghraphs, standard DCNs have been used to predict pulmonary tuberculosis 
and an architecture involving DCNs and recurrent neural networks has been trained to perform automatic image annotation. Wang et al.  have used a database of chest x-rays with more than frontal-view images and associated radiological reports in an attempt to detect commonly occurring thoracic diseases.
3 Deep metric learning with multi-labelled images
3.1 Problem formulation
In the remainder of this article we assume that each chest radiograph is associated with any of possible labels contained in a set . We collect all the labels describing in a set whilst all the remaining labels are identified by . Our aim is to learn a non-linear embedding that maps each onto a feature space where . In this subspace, the Euclidean distance among groups of similar images should be small and, conversely, the distance between dissimilar images should be large. The distance should be robust to anatomical variability within the normal range as well as geometric distortions and noise. Most importantly, it should be able to capture a notion of radiological similarity, i.e. two images are expected to be more similar to each other if they share similar radiological abnormalities. We require the embedding function,
, to depend only upon a learnable parameter vector. No assumptions about this function can be made besides differentiability with respect to . Consequently, the learned distance, , also depends on .
While the definition of positive and negative elements is straightforward for applications involving mutually exclusive labels, it becomes more ambiguous when each image is allowed to have non-mutually exclusive labels. Restrictive assumptions would need to be made in order to use existing approaches based on contrastive loss , triplet loss  and others [14, 12]. The simplest approach would be to assume that and are positive with respect to each other only when they share exactly the same labels, i.e. when ; conversely, they would be interpreted as negative elements when the equality is not satisfied. However, assuming that two films are radiologically similar when they share exactly the same abnormalities is too strong. Adopting this strategy would also result in much larger sample sizes for elements with frequently co-occuring labels compared to elements characterised by less frequent labels thus hindering the learning process. Furthermore, since each individual label in both and is expected to be noisy, requiring the co-occurrence of exactly all the labels may be too restrictive.
A much less restrictive approach would be to assume that and are positive when they have at least one common label, i.e. when . Under this definition, both the contrastive or triplet loss could still be used. This approach is still far from ideal, though, because this definition is invariant to the degree of overlap between and . Ideally, the learned distance between any two images should be proportional to the number of abnormalities they do not share. Fig. 2d illustrates this ideal situation. The triplet loss would struggle to satisfy this requirement as it does not take the global structure of the embedding space into consideration  and does not explicitly account for overlapping labels; see Fig. 2a. In the next section, we propose two loss functions that are designed to overcome the above limitations.
3.2 Proposed loss functions for multi-labelled images
We begin by assuming that and are positive when . Given an anchor , our approach starts by retrieving randomly selected images, one for each label in . The images are then grouped into two non-overlapping sets: one containing positive elements
and one containing the remaining negative elements
where . An ideal metric should ensure that is kept as close as possible to all the elements in whilst being kept away from all the elements in . Accordingly, the loss function to be minimised can be defined as
where the positive scalar represents a margin to be enforced between positive and negative pairs. This formulation can be seen as the triplet loss average derived from all the possible triplets where and .
The expression above can be simplified by pre-selecting the negative element having the largest contribution (e.g. see also Song et al.), i.e. yielding
In this way, we obtain a more tractable optimisation problem
which can be further simplified by using a smooth upper bound for ,
The above loss does not directly address the issue arising when some elements in have labels that are not in . Without imposing further constraints on how the elements in are selected, the loss will force to become as small as possible regardless of the number of labels that and actually have in common. This problem is addressed by introducing a quantity, , that represents the degree of overlap between the labels associated to and those associated to its positive elements, i.e.
Clearly, is equal to when and to when . By allowing to be a fraction of , we obtain the proposed ML2 (Metric Learning for Multi-Label) loss, i.e.
An illustrative example of its inner working is provided in Fig. 2b. We also propose a different version of the loss, which relies on a different definition of positive elements. In this case, for each label in , a positive element is strictly required to have only that particular label. The quantify then simplifies to since and . An illustration is provided in Fig. 2c, and we call this version ML2+.
For applications involving a large number of classes, a memory efficient implementation of the two methods above can be obtained by reducing the elements in and using a hard class mining approach. In this case, and depend only on a subset of all labels, which is chosen by determining which labels contribute the most to the overall loss (e.g. see Sohn et al.).
4 Large-scale metric learning for chest radiographs
For this study, we obtained a large dataset consisting of historical chest radiographs extracted from the PACS system of Guy’s & St Thomas’ NHS Foundation Trust, serving a large, diverse population in South London. Our dataset covers the period between January 2005 and March 2016. The radiographs were taken using different scanners across more than departments. For a large portion of these exams, we had both the radiological report as well as the associated plain film. The reports were written by different readers, including consultant and trainee radiologists and accredited reporting radiographers. All the examinations were anonymised with no patient-identifiable data or referral information. The size of the images ranges from to pixels, and each pixel is represented in greyscale with 12 bit precision. Table 1 contains the sample size breakdown of all the exams that we used for training, validation, and testing. Starting from the full dataset, we selected all the exams concerning patients older than years and for which we had both the report and the plain film. Only a subset of manually validated exams - the Golden Set - was used to assess and compare the performance of the metric algorithms.
4.1 Automatic labels extraction from medical reports
Given the large number of reports available for the study, obtaining manual labels for each exam was unfeasible. Instead, all the written reports were processed using a NLP system specifically developed to model radiological language . The system was trained to detect any mention of radiological abnormalities and their negations. Labels were chosen to allow all common radiological findings to be allocated to a group along with other films sharing similar appearances. The labels were adapted from Hansell et al.  and were meant to capture discrete radiological findings (e.g. cardiomegaly, medical device, pleural effusion) rather than giving a final diagnosis (e.g. pulmonary oedema), which requires clinical judgement to combine the current findings with previous imaging, clinical history, and laboratory results. For this study, we used different labels, i.e. cardiomegaly, medical devices (e.g. pacemakers, lines, and tubes), pleural effusion and pneumothorax. The NLP system also identified all “normal” exams, i.e. those where no abnormalities were mentioned in the report. Cumulatively, the normal and abnormal labels used here represent of all the reported visual patterns in our database.
A validation study was carried out to assess how accurately the NLP system extracted the clinical labels, plus the normal class, from the written reports. Two independent clinicians were presented with the original radiological reports and manually generated the labels from the reports. This study generated the Golden Set, which is used here purely for performance evaluation purposes.
In Table 2 we report the precision, sensitivity, specificity and score obtained by our NLP system. These results demonstrate that the labels automatically extracted at scale from the written reports are sufficiently reliable; this provides evidence that the vast majority of labels associated to images in our datasets is correct, thus allowing the neural network architectures to learn suitable image representations.
Further details on the NLP algorithms and experimental results can be found in Pesce et al. .
4.2 DCN for high resolution input images
Standard DCN architectures, such as Inception v3
, were originally designed to model natural images, such as those in the Imagenet dataset. These images are typically scaled down to pixels, even though higher resolution images are available. In many studies, down-scaling natural images has been shown to be a good compromise between the amount of information that is lost and computational efficiency. However, in a medical imaging setting, every detail in an image matters, at least in principle. Thus, arbitrarily reducing the resolution of the images is generally considered suboptimal . For this reason, in our study we have implemented a slightly modified version of Inception v3 that is able to handle pixels images. Table 3 shows the details of the proposed architecture. The chosen aspect ratio is close to the median aspect ratio (
) amongst all images in our dataset and has the advantage of minimizing the number of image that would be cropped (or padded), since the input of our DCN has a fixed size.
5 Experimental results
5.1 Training strategy
The representation was learned using an Inception v3 architecture  resulting in an dimensional mapping under the constraint that . We call the output of the last convolutional layer and we define our final layer as:
where and are respectively weights and bias of the last layer. All the results presented here use , because the use of larger dimensions did not introduce any significant improvements. All images were rescaled to have a standard size of ( for the non-standard model) pixels and no other pre-processing was carried out. For training purposes, synthetic data was generated by random rotation and flipping of the original images. Two different experiments setups were considered, one in which the was learned end-to-end from the raw images, and one where pre-training was used instead, as is commonly done in other works. The proposed ML2 and ML2+ losses were compared to more traditional metric learning approaches based on contrastive and triplet losses sharing the same architecture.
Stochastic Gradient Descent (SGD) was used for the optimisation process, with an initial learning rate equal to , momentum equal to and weight decay equal to . When we started from randomly initialised weights, the total number of iterations was and, every iterations, the learning rate was decreased by a factor of . Instead, when the weights were pre-trained on the classification task, the number of total iterations was and the learning rate was decreased every iterations. In both experimental setups the size of the mini-batches was equal to when contrastive and triplet losses were used, and it was equal to for our proposed losses. We tested different values for , which, for the results shown in this work, has been set to . During the training the model with the best value of NMI on the validation set is kept as the best model and used during the testing phase.
Positive and negative elements were randomly sampled. The noisiness of our labels prevented us from exploiting any sampling techniques (e.g. hardest negative mining, etc.), since all those methods take the reliability of the labels for granted.
5.1.1 DCN pre-training
For the pre-training of our DCN, we used a multi-label binary cross entropy loss. Given our possible labels, we defined an equal number of binary classifiers with the aim of predicting the presence or absence of each label. The output of each binary classifier is
where and are different weights and bias whith respect to the one defined above. The loss function is equal to the average of the negative log likelihoods of for each ,
where is the labels vector; will be equal to when the i-th abnormality is present in the image , otherwise, it will be equal to .
5.2 Cluster and retrieval performance
We assessed the performance of the proposed losses on two different tasks: (i) clustering, evaluated with the normalized mutual information (NMI) metric and (ii) image retrieval, evaluated with the Recall@K metric; see Manning et al.  for a complete account of these metrics.
Table 4 shows the empirical results obtained after learning the metrics on the training images and testing them on the Golden Set. When learning without pre-training (i.e. initially using random weights), ML2+ outperforms ML2 on both tasks and largely improves upon the other alternative losses. When using a pre-trained architecture, improvements can be observed across all methods, and ML2+ obtains a slightly better performance than ML2. Based on these results, we demonstrate the superior performance of our proposed losses with respect to the baseline; moreover, we suspect that ML2+ is able to converge to a better optimum more easily than ML2.
In the same Table we also reported the results obtained with a DCN using pixels input images. Here we used the same configuration of the model yielding the best performance on standard image size (i.e. pre-trained weights and ML2+ loss). Almost no improvements at all can be seen compared to the standard version of the network. In fact, while the retrieval performances are almost the same, the NMI score is more than one point lower. We hypothesize that, at least for the radiological abnormalities we have considered here, which involve large anatomical structures, an input size of pixels may be sufficiently informative.
Figure 3 shows a -dimensional representation of the radiographs contained in the Golden Set. This representation was obtained by means of dimensionality reduction using a -distributed Stochastic Neighbor Embedding (t-SNE) , which effectively projects the -dimensional embeddings extracted from the best model onto dimensions for visualisation purposes. Remarkably, this projection shows that the normal exams are mostly concentrated in a well-separated cluster; moreover, other clusters of exams sharing similar abnormalities have also been identified.
The chest radiographs marked with a circle can be seen in Figure 1. These are two examples of radiographs that were originally labelled as normal but ended up being placed away from the cloud of normal exams. A second reading of these exams has revealed unreported abnormalities thus confirming that their position within the embedding was justified.
|Without pre-training||With pre-training|
|ML2+ (high res.)||–||–||–||–||–||40.80||54.90||68.11||79.62||86.64|
5.3 Abnormalities classification performance
|LR on triplet embedding||95.36||95.04||87.63||95.20|
|LR on ML2+ embedding||95.55||94.98||88.17||95.26|
The classification performance for abnormal exams obtained (i) when the network is trained directly on the classification task (Cross-entropy), (ii) using the embeddings extracted from a network trained with a triplet loss in order to train a linear regression classifier (LR on triplet embedding) and (iii) using the embeddings extracted from a network trained with our proposed loss, ML2+, in order to train a linear regression classifier (LR on ML2+ embedding).
In a separate task, we tried to predict whether a given chest radiograph contains a radiological abnormality. For this task, we compared the performance of the DCN architecture trained as a multi-label classifier using a cross entropy loss (the same described above and used for pre-training) and the feature embeddings extracted from one of our DCN trained with a metric loss. Logistic regression was used on the extracted embedding space in order to obtain a classification prediction. In Table5 we present the results we obtained. Performances are evaluated in terms of Precision, Sensitivity, Specificity and Score. We used Score instead of Accuracy bacause in our data normal and abnormal exams are not balanced, and in the latter case comparing performances using Accuracy can be misleading. In comparison to the baseline model, it is possible to see that the models based on the learned embedding obtain better performances, showing a higher proficiency when discriminating between normal and abnormal exams.
In this article we have proposed two loss functions for metric learning with multi-labelled medical images. Their performance has been tested on a very large dataset of chest radiographs. Our initial results demonstrate that learning a metric that captures a notion of radiological similarity is indeed possible; most importantly, the learned metric places normal radiographs far away from the exams that have been reported to contain one or multiple abnormalities. This is a striking result, given the complexity of the visual patterns to be discovered, the degree of noise characterising the radiological labels, and the large variety of scanners and readers included in our study. It is also an important step towards the fully-automated reading of chest radiographs as being able to recognize normal radiological structures on plain film, which is key to interpreting any abnormal findings.
The authors thank NVIDIA for providing access to a DGX-1 server, which speeded up the training and evaluation of all the deep learning algorithms used in this work.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a “siamese" time delay neural network. In NIPS, pages 737–744. 1994.
-  S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, volume 1, pages 539–546. IEEE, 2005.
S. Cornegruta, R. Bakewell, S. Withey, and G. Montana.
Modelling radiological language with bidirectional long short-term memory networks.7th International Workshop on Health Text Mining and Information Analysis, 2016.
-  K. J. Geras, S. Wolfson, S. G. Kim, L. Moy, and K. Cho. High-resolution breast cancer screening with multi-view deep convolutional neural networks. CoRR, abs/1703.07047, 2017.
-  D. M. Hansell, A. A. Bankier, H. MacMahon, T. C. McLoud, N. L. Muller, and J. Remy. Fleischner society: glossary of terms for thoracic imaging 1. Radiology, 246(3):697–722, 2008.
-  P. Lakhani and B. Sundaram. Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology, 284(2):574–582, 2017. PMID: 28436741.
-  C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008.
-  E. Pesce, P.-P. Ypsilantis, S. Withey, R. Bakewell, V. Goh, and G. Montana. Learning to detect chest radiographs containing lung nodules using visual attention networks. ArXiv e-prints, Dec. 2017.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.In CVPR, pages 815–823, 2015.
-  H. Shin, K. Roberts, L. Lu, D. Demner-Fushman, J. Yao, and R. M. Summers. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. In CVPR, pages 2497–2506, 2016.
-  K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, pages 1849–1857, 2016.
-  H. O. Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In CVPR, 2017.
-  H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826.
-  L. van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15:3221–3245, 2014.
-  B. van Ginneken. Fifty years of computer analysis in chest imaging: rule-based, machine learning, deep learning. Radiological Physics and Technology, 10(1):23–32, 2017.
-  X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. CoRR, abs/1705.02315, 2017.
-  K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In Y. Weiss, P. B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1473–1480. MIT Press, 2006.
-  C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning. CoRR, abs/1706.07567, 2017.