Log In Sign Up

Multimorbidity Content-Based Medical Image Retrieval Using Proxies

Content-based medical image retrieval is an important diagnostic tool that improves the explainability of computer-aided diagnosis systems and provides decision making support to healthcare professionals. Medical imaging data, such as radiology images, are often multimorbidity; a single sample may have more than one pathology present. As such, image retrieval systems for the medical domain must be designed for the multi-label scenario. In this paper, we propose a novel multi-label metric learning method that can be used for both classification and content-based image retrieval. In this way, our model is able to support diagnosis by predicting the presence of diseases and provide evidence for these predictions by returning samples with similar pathological content to the user. In practice, the retrieved images may also be accompanied by pathology reports, further assisting in the diagnostic process. Our method leverages proxy feature vectors, enabling the efficient learning of a robust feature space in which the distance between feature vectors can be used as a measure of the similarity of those samples. Unlike existing proxy-based methods, training samples are able to assign to multiple proxies that span multiple class labels. This multi-label proxy assignment results in a feature space that encodes the complex relationships between diseases present in medical imaging data. Our method outperforms state-of-the-art image retrieval systems and a set of baseline approaches. We demonstrate the efficacy of our approach to both classification and content-based image retrieval on two multimorbidity radiology datasets.


page 1

page 8


Generating Binary Tags for Fast Medical Image Retrieval Based on Convolutional Nets and Radon Transform

Content-based image retrieval (CBIR) in large medical image archives is ...

Content Based Image Retrieval (CBIR) in Remote Clinical Diagnosis and Healthcare

Content-Based Image Retrieval (CBIR) locates, retrieves and displays ima...

A Hybrid Method for Distance Metric Learning

We consider the problem of learning a measure of distance among vectors ...

HyP^2 Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval

Image retrieval has become an increasingly appealing technique with broa...

Optimized Feature Space Learning for Generating Efficient Binary Codes for Image Retrieval

In this paper we propose an approach for learning low dimensional optimi...

Efficient feature embedding of 3D brain MRI images for content-based image retrieval with deep metric learning

Increasing numbers of MRI brain scans, improvements in image resolution,...

Attention-based Dynamic Subspace Learners for Medical Image Analysis

Learning similarity is a key aspect in medical image analysis, particula...

I Introduction

Radiology is a vital tool for the diagnosis of disease. With the demand for medical imaging increasing rapidly [1], computer-aided diagnosis systems can help to improve the radiology workflow. Two useful computer-aided diagnosis tasks are pathology classification and Content-Based Image Retrieval (CBIR), i.e. the process of searching an image database for samples that are pathologically similar to a query image. In practice, such a medical CBIR system may also return pathology reports alongside the retrieved samples. Returning similar images and their pathology information to the user provides evidence and context for the pathology classifications made by the system. This can in turn help foster trust between healthcare professionals and the computer-aided diagnosis tool. Further, a classification and retrieval system can help reduce the workload of healthcare professionals by assisting in the generation of radiology reports [2] and help to reduce the high inter-observer variability that occurs when analysing radiology images [3].

Fig. 1: Example X-ray retrieval results. Images are annotated with disease labels; see Section IV-B for label definitions.

Deep metric learning methods learn a feature space in which distance is a measure of similarity. As such, metric learning is well suited to the problem of content-based image retrieval. The feature spaces learned by metric learning approaches have been shown to generalise well to unseen samples [4, 5, 6, 7], with demonstrated efficacy to retrieval [8, 9]. Conventional deep classification models often require vast quantities of high-quality annotated data in order to be accurately trained [10]. Metric learning models can reduce over-fitting [11], resulting in better performance in few shot learning [12, 13] and data limited scenarios [7]. This is particularly important in the medical domain, as some diseases may be rare and annotating data has a high associated cost.

One family of metric learning approaches are proxy-based methods [14, 15]. Proxies are trainable model parameters that are used to approximate the real distribution of training set feature vectors. Using proxies enables efficient model training, as training samples need only be compared to the relatively small number of proxies, rather than to one another. Proxy methods are able to learn faster than other metric learning methods and result in a more generic feature space than commonly used triplet-based methods [14]. However, existing proxy methods are not designed for multi-label data and cannot be directly applied to multimorbidity CBIR.

In this paper, we propose a novel proxy-based metric learning method for multimorbidity computer-aided diagnosis. Existing proxy approaches generally allow a sample to assign to just a single proxy during training. Unlike these approaches, our method is explicitly designed for multi-label data by allowing X-rays with multiple disease findings to assign to all relevant proxies. This results in a metric feature space that encodes the complex interactions between different diseases. Unlike existing methods that tend to define a single proxy per class label, we propose the use of several proxies per disease class, allowing the semantic variations that exist within the distribution of a single disease to be encoded in the feature space. In this way, our method supports both intra-class and inter-class multi-proxy assignment. Further, we propose the use of negative proxies. By treating negative samples (i.e. those with no positive labels) as their own training class with proxies, a feature space is learned that better encodes the relationships between X-ray samples with no disease findings.

While many metric learning approaches do not perform well for classification [16]

, our method optimises a classification loss directly on the feature vectors extracted by a deep neural network. This results in a trained model that can be successfully applied to both pathology classification and content-based image retrieval. The major contributions of this paper can be summarised as follows:

  • We propose a novel proxy-based metric learning algorithm for multi-label data (Section III-C).

  • We introduce negative class proxies for encoding the important relationships between X-ray samples with no disease findings (Section III-D).

  • We propose defining multiple proxies for each class label and demonstrate the performance benefit (Sections III-C and IV-G).

  • We demonstrate that our method outperforms conventional deep classification models (Section IV-E).

  • We show that our approach achieves state-of-the-art CBIR performance on multimorbidity radiology datasets (Sections IV-D and IV-E).

Ii Related work

Ii-a Content-based medical image retrieval

As medical imaging data is often multimorbidity, i.e. a single sample may show the presence of several pathologies, we focus on the problem of multi-label CBIR. Medical image retrieval remains a challenging problem, owing to the often minuscule visual differences between pathologies, as well as the existence of complex relationships between pathologies [17, 18, 2]. Existing medical image retrieval systems include those based on handcrafted features and shallow methods [19, 20, 21, 22]

, as well as deep learning methods

[23, 24, 25, 3, 26, 27, 28, 2].

In the deep learning domain, multi-label metric learning and hashing is a common approach [3, 26, 27, 28]. Chen et al.[3] propose a method that optimises a combination of ranking loss and multi-label classification loss, while Conjeti et al.[26] introduce a Deep Residual Hashing network that incorporates a retrieval loss and regularisation techniques to improve CBIR performance. A pair-wise Deep Supervised Hashing method is proposed by Liu et al.[27], whereby images are mapped to discreet codes, allowing Hamming distance to be used as a measure of the similarity between samples. Further discussion on multi-label metric learning is found in Section II-B3.

Taking a different deep learning approach, Haq et al.[2]

leverage a conventional Convolutional Neural Network (CNN) multi-label classifier trained with binary cross-entropy. A community-based graph structure is proposed for efficient search in large retrieval databases. As demonstrated in our experiments (Section

IV), a conventional CNN classifier is limited in its ability to encode rich semantic relationships in the feature vector space. This results in poorer CBIR performance compared to our multi-label metric learning approach that directly optimises the feature space.

Ii-B Metric learning

The aim of metric learning is to learn a feature space in which standard distance measures, such as Euclidean distance, can be used as a measure of similarity. For example, one would expect the feature vectors belonging to sample images with similar semantic content to be located nearby, while those from images with dissimilar semantic content to be located further apart. Metric feature spaces have applications including retrieval [29], ranking [8], out-of-distribution detection [30] and novel class image generation [31].

Ii-B1 Triplet Methods

One of the main approaches to metric learning is derived from Siamese networks [32] with contrastive loss [33, 34], whereby positive pairs of images (with matching semantic content) and negative pairs (with non-matching semantic content) are passed through the same network. The network parameters are updated such that the feature vectors of positive pairs are pulled together, while those of negative pairs are pushed apart. An improvement on pairwise metric learning methods are triplet-based methods [35], which construct trios of training images containing two samples with matching semantic content and one sample with differing semantic content. Triplet loss attempts to pull the positive pair of samples closer together than the anchor positive sample and the negative sample, by a set margin.

Metric learning literature often aims to improve triplet methods by performing intelligent mining of “hard” triplets [36, 6]. While other streams of literature aim to generalise triplet loss, such as by allowing multiple comparisons within a single mini-batch [5] or by employing a lifted structured embedding [4] that allows computation between every positive and negative pair in the batch. Local triplet loss can also be combined with a global loss term to improve performance [37]. Triplet-based metric learning has shown good performance in embedding learning problems and extreme classification, but is often poor at regular classification problems compared to conventional neural network classifiers [16]. Further, triplet methods suffer from computational bottlenecks in terms of triplet mining.

Ii-B2 Neighbourhood and Proxy Methods

Analysing a larger neighbourhood of samples at each training iteration can allow for more efficient updates to the model and a more robust distance metric [7]. Neighbourhood Component Analysis (NCA) [38] considers all nearby samples, minimising a probabilistic loss based on a sum of Gaussian distances within the neighbourhood. As the feature vectors of every sample change after each training iteration, it is computationally infeasible to minimise this loss exactly with a deep neural network. To make training practical, a cache of training feature vectors can be stored and periodically updated, allowing an approximation of NCA loss to be optimised [7].

Another approach to making neighbourhood methods computationally feasible is to employ proxy feature vectors [14]. Proxy features are trainable model parameters that are assigned a class label and used as a proxy for real training feature vectors belonging to that same class. By minimising a proxy-based version of NCA (Proxy-NCA [14, 15]), training features need only be compared to the proxy features, rather than to each other. As the number of proxies is generally set to equal the number of class labels, this significantly reduces the computational complexity. Kim et al.[39]

propose a loss function with advantages of both pair-wise methods and proxy-based methods by assigning samples to proxies using sample-to-sample relations. Similarly, N-pair loss is implemented using proxies by Aziere

et al.[40], while SoftTriple loss [41] improves the representation of intra-class variations using a method analogous to proxies. In this work, we propose a novel multi-label proxy metric learning method that allows training samples to assign to multiple inter-class and intra-class proxies. This results in a model that can be applied to both multi-label classification and multi-label image retrieval.

Ii-B3 Multi-label Metric Learning

Beyond the discussed medical CBIR approaches [3, 26, 27, 28] (Section II-A), other multi-label metric learning approaches include the multi-label extension of triplet loss by Sumbul et al.[42]. A two-step triplet sampling algorithm is proposed, that uses multi-label similarity to select a diverse set of triplets for a training mini-batch. Annarumma et al.[43] propose a multi-label triplet method that selects several positive examples for each training sample, such that each of the sample’s positive class labels are present in the constructed positive pairs. As multi-label extensions of triplet learning, these methods inherit the inefficiencies of triplet metric learning discussed in Section II-B1, in terms of high computational complexity, limited ability to encode complex intra-class and inter-class relationships, and poor classification performance. To avoid these known problems with triplet loss, our work focuses on computationally efficient proxies and directly minimises a distance-metric based classification loss that is effective for both classification and content-based image retrieval. Beyond multi-label triplet methods, Li et al.[44] optimise a two-way distance metric loss between image and label embeddings, both extracted by neural networks. This method uses the entire neighbourhood of feature vectors to compute the loss. Avoiding such inefficient neighbourhood analysis is a primary motivator behind our use of proxies.

Further to not being proxy methods, the discussed existing approaches do not give special consideration to negative samples (i.e. examples with no positive labels). Such samples are extremely common in medical data (e.g. a radiology image with no disease finding). Recognising this, our method is explicitly designed for such data via the introduction of negative proxies (Section III-D). Additionally, our approach was designed for the consolidated dual use case of medical image retrieval and pathology classification, while most existing multi-label metric learning methods were designed only for a single use case. We quantitatively evaluate our approach against existing multi-label metric learning methods [3, 27, 28] in Section IV-D.

(a) Training the system.
(b) Querying the system.
Fig. 2: Overview of training and testing of our system.

Ii-B4 Using Multiple Proxies per Class

Further to enabling multi-label proxy metric learning, we propose to improve the representation of intra-disease variations by defining multiple proxies for each disease class. Although existing works have proposed to define multiple proxies for a single class [41, 45, 16], these methods differ significantly from our approach. Qian et al.[41] propose the use of multiple centroids for each class, but a single weighted per-class centroid is used in the loss function, limiting the intra-class variations that can be captured. Conversely, our loss function incorporates all proxies (both intra-class and inter-class) independently. Similar to approximate NCA methods [7], Liu et al.[45] couple a proxy with each data instance, which is by definition different to the dataset approximating proxies in our method. The proxies used in our proposed approach are trained model parameters that are efficient approximations of the data distribution. Although not strictly proxies, Rippel et al.[16] use -means clustering and penalise overlapping clusters from different classes, with multiple -means centres defined for each class. In this sense, the centres can be considered proxies of the data distribution. However, neither approaches in Rippel et al.[16] and Liu et al.[45] have the efficiency benefits of the parameterised proxies used in our paper.

Iii Method

An overview of our approach is shown in Fig. 2. Given a query image, such as a chest X-ray, our method is able to predict the presence of multiple diseases, as well as return a set of semantically similar X-rays from a retrieval database to the user. The similarity of samples is measured by Euclidean distance in a feature vector space. Features are extracted by a convolutional neural network that is trained with our novel multi-label metric learning method.

Iii-a Metric learning problem statement

Let be a set of images from a multi-label dataset containing class labels. The corresponding set of ground truth label vectors is where is the label vector for the -th image. A value of 1 at location k in indicates the presence of the k-th class in image . We aim to learn a distance metric , such that:


Such a distance metric is achieved by learning a transformation from the image space to a feature vector space in which the Euclidean distance between features can be used as a measure of similarity, i.e.:


where represents a neural network encoder and is the feature vector that is output from the network for image . Given a well learned metric feature vector space, we would expect that the distance between similar samples in the feature space will be small, while the distance between dissimilar samples will be large.

Iii-B Proxy feature vectors

The feature vector extracted from image by the model is defined as:


where and is the dimensionality of the feature vector space, realised by the model . In general, the feature space may have any dimensionality; in our experiments produces features with 1024 dimensions. A proxy is defined as a trainable feature vector that is assigned a class label and represents the set of, or a subset of, the real training sample feature vectors that belong to that class. During training, sample feature vectors are compared with the proxies rather than with the much larger number of other training samples. In this way, these trainable parameters act as a proxy for the real distribution of training samples.

Iii-C Multi-label metric learning with proxies

Existing proxy-based metric learning methods [14, 15] are designed for datasets with a single positive class label per sample. Generally, these methods define a single proxy for each class label, and training samples are only able to be assigned to one proxy. Since medical imaging data is often multimorbidity, we propose a novel proxy-based metric learning approach for multi-label datasets. In this approach, training samples are able to assign to multiple proxies spanning multiple class labels. We also generalise our method to allow for the definition of multiple proxies per class. Having more than one proxy for a single class may allow for intra-class variations to be better captured by the model, as well as more complex interactions between class labels. This means that a sample’s multi-proxy assignment may occur both inter-class and intra-class.

We define a set of proxies for each class labels as:

where each corresponds to the -th proxy feature for the -th class label.

During training, we optimise a multi-label proxy-based loss that allows training features to assign to multiple proxies. The multi-proxy assignment happens both intra-class (when ) and inter-class, when a training sample has more than one positive label (i.e. when , where the 0-norm of is the number of non-zero elements).

The loss function for sample with labels is defined in (4). To deal with the large imbalance of positive and negative occurrences of classes that is common in medical imaging data [46, 47], per-class positive and negative weights are used in the loss function. With being the total number of positive class samples and being the number of negative class samples, the positive and negative weights for class are defined as and

, respectively. Before calculating the loss, feature vectors and proxies are normalised. The hyperparameter

is a fixed value that sets the width of the Gaussian windows in our multi-label proxy loss function. This value is the same for all proxies.


Iii-D Negative proxies

Training samples that are negative for all labels, e.g. healthy chest X-rays with no disease findings, can be considered as their own class for model training purposes. Formally, an X-ray is considered a negative sample when , where the 0-norm calculates the number of non-zero elements in . An additional element can be concatenated to the label vector to represent negative samples, i.e. , where denotes concatenation and is an indicator function. As negative samples are now represented as a class label, proxies must also be defined for this new class. Training with additional proxies for negative samples can result in a better structured feature space, as negative samples will cluster together around those proxies. Without such proxies, the only training constraint for negative samples is for them to be located far away from positive class proxies. However, there is no loss term that encourages negative samples to be located nearby one another, despite their semantic similarity. The inclusion of proxies for negative samples, which we name negative proxies, introduces a training constraint that negative samples should be located nearby in the feature space. As shown in our experiments in Section IV-H, negative proxies result in both better classification and CBIR performance.

Model with parameters ;
Proxies ;
Dataset .
2:while not converged do
3:     Sample Sample training data.
5:      Normalise feature vector.
6:     for each  do
7:          Normalise proxies.      
8: Concatenate () label to represent negative samples.
10:     Adam
11:     Adam
Algorithm 1 Training procedure.

Iii-E Training algorithm

The training procedure for our approach is shown in Algorithm 1. Image and label pairs are sampled from the training dataset and feature vectors are extracted from the images. Both the feature vectors and the proxies are then constrained to a unit sphere using L2-normalisation. The label vector is modified to include an extra dimension that represents the negative (healthy) class. This extra label is set to a value of one when all other elements of the label vector are zero, otherwise the negative label is set to a value of zero. The extra label dimension is needed due to the inclusion of the negative proxies, as discussed in Section III-D. The loss is then calculated according to (4), and model parameters and proxies are updated using the Adam optimisation algorithm [48].

Iii-F Multi-label classification

We perform multi-label classification by analysing the similarity between a sample’s feature vector and each of the proxy feature vectors. The classification score for disease label is:


where . For each class , we calculate the distance between the feature vector and each of the proxies belonging to class . The classification score is then calculated based on the distance to the nearest proxy of that class, where a score close to 1 indicates a high likelihood that disease is present in the sample image, while a score close to 0 indicates a low likelihood. A prediction for class can then be made by comparing the classification score to a discrimination threshold for that class , as shown in (6).


Iii-G Content-based image retrieval

Given a well structured metric feature space, we expect the feature vectors of samples with similar pathology information to be located nearby. As such, we perform content-based image retrieval for a query image by returning the database images corresponding to the feature vectors that are nearest to the query sample’s feature vector. The image retrieval database, i.e. the set of images from which samples are retrieved based on a query, is constructed from the labelled training samples. The image retrieval procedure is outlined in Algorithm 2. The feature vector distance between the query sample and each of the samples in the database are calculated. The database samples are then sorted based on the distance to the query sample, in ascending order. Finally, the images that are most similar to the query image are returned to the user as the output of the CBIR system.

Trained model ;
Retrieval database .
2:Initialise: Ø, Ø retrieved images.
3:for each  do
4:      Dist. b/w query and samples.
5:      Add to set of distances.
6:argsort ascent Sort indices of .
8:while  do Pick k most similar images.
9:      Add next similar image to .
Algorithm 2 CBIR of images for query image .

Iii-H Baselines for evaluation

We compare our proposed approach to seven state-of-the-art CBIR methods from the literature [20, 21, 28, 27, 3, 22, 2], including CNN classifier methods, feature-based deep learning methods and hashing methods. For further evaluation, we train appropriate baseline models to benchmark our approach against. These baselines are described below. For fairness, our method uses the same base network architecture as all baselines methods, as well as the highest performing evaluated method from literature [2].

Iii-H1 Multi-label classifier (DenseNet w/ BCE).

This method is a conventional CNN classifier, trained with multi-label binary cross entropy loss. The head of the network is a fully connected layer that outputs class-wise prediction scores. To extract feature vectors for image retrieval, we bypass the final fully connected layer, resulting in a feature vector with the same dimension as , described in Section III-C. This method is the state-of-the-art literature approach proposed by Haq et al.[2] in terms of model architecture and training, but does not include the nearest neighbour graph.

Iii-H2 Multi-label Proxy-NCA (ML-ProxyNCA).

The standard Proxy-NCA [14, 15] loss function can be naively extended to the multi-label case by optimising the following loss function:


where and is the proxy for the i-th class. In the multi-label case, the outer sum over classes in (7) allows a training feature vector to pull towards the proxies belonging to all positive labels. In the case where each sample only has a single positive label, the equation in (7) becomes the standard probabilistic Proxy-NCA loss function [15].

Method nDCG
Wang et al.[20] 0.15
Gong et al.[21] 0.16
Erin Liong et al.[28] 0.19
Liu et al.[27] 0.17
Chen et al.[3] 0.24
Lan et al.[22] 0.15
Haq et al.[2] 0.31
DenseNet w/ BCE 0.31
ML-ProxyNCA 0.32
Ours 0.38
TABLE I: CBIR performances of literature approaches, baselines and our approach, on the NIH dataset [47].
Classification CBIR
Method AUC nDCG ACG Prec.
DenseNet w/ BCE 0.69 0.21 0.37 0.71
ML-ProxyNCA 0.64 0.20 0.36 0.72
Ours (No Neg. Proxies) 0.74 0.28 0.46 0.80
Ours (With Neg. Proxies) 0.77 0.30 0.48 0.82
TABLE II: Classification and CBIR results on CheXpert [46].
(a) Classification AUC.
(b) CBIR nDCG.
(d) CBIR Precision.
Fig. 3: Classification and image retrieval results across a range of dataset sizes. Negative proxies are used for our approach. Note the logarithmic scale of the horizontal axes. See Section III-H for details on baseline DenseNet w/ BCE.

Iv Experiments

Iv-a Implementation

We use a DenseNet121 architecture [49] as our feature extractor network , producing a feature vector with 1024 dimensions. The Adam optimisation algorithm [48] is used for model training, with coefficients and

. Our proposed method is trained for 50 epochs, with a learning rate of

and a batch size of 48. The loss function hyperparameter is set to 0.7, and unless otherwise stated, two proxies are used per class. Hyperparameter values were tuned using a withheld validation set.

During training, images are first resized to 270x270 and then randomly cropped to 224x224. Finally, images are normalised to a range of -1 to 1. For evaluation images (retrieval database and queries), the random cropping is replaced by a centre crop. For evaluating content-based image retrieval, we compute the metrics using the -nearest database samples for each query. For the CheXpert and NIH datasets, is set to 10 and 100, respectively. This evaluation protocol follows the set-ups used in Haq et al.[2] and Chen et al.[3]. In the interest of fairness, the same backbone network, data augmentation and data preprocessing is used for the baselines methods.

Iv-B Datasets

CheXpert [46] is comprised of frontal and lateral chest X-rays from 67,740 individual patients. Following Haq et al.[2], we use the nine most common diseases found in the dataset. The diseases and their shorthand abbreviations used in this paper are: Enlarged Cardiomediastinum (EC), Cardiomegaly (CM), Lung Opacity (LO), Edema (ED), Consolidation (CS), Pneumonia (PNA), Atelectasis (AT), Pneumothorax (PTX) and Pleural Effusion (PE). Each sample is labelled as positive, negative or uncertain for each disease. Uncertain labels are ignored during training, but treated as positive for CBIR purposes, following the set-up used by Haq et al.[2]. The NIH Chest X-ray Dataset [47] consists of frontal-view X-ray images from 30,805 patients. Again following Haq et al.[2], we use the 13 most common disease labels in our experiments.

Fig. 4: Qualitative CBIR results. Images are annotated with disease labels (see Section IV-B for label definition).

Iv-C Evaluation metrics

For evaluating CBIR performance, we analyse the three retrieval metrics described below.

Iv-C1 Normalised Discounted Cumulative Gain (nDCG).

Discounted Cumulative Gain is the sum of the graded relevance of all of the retrieved images based on their rank position:


where is the number of images that the system is retrieving and is the graded relevance value of the -th retrieved image. Each graded relevance value is the number of common positive labels between the query and retrieved image. Each value is adjusted logarithmically proportional to the rank position of the query image. This means that a highly relevant query image will be penalised if it has a low rank. The normalised DCG (nDCG) is defined as:


where the ideal DCG (iDCG) is the maximum possible DCG that can be achieved based on the dataset.

Iv-C2 Average Cumulative Gain (ACG).

The ACG is defined as:


where is the graded similarity value of the -th retrieved image. Graded similarity is defined as the ratio of the number of common positive labels between the query image and the -th retrieved image, and the total number of positive labels in the query.

Iv-C3 Precision (Prec.).

The CBIR precision is the ratio of the number of relevant images and the total number of retrieved images, . Each retrieved image that has at least one common label with the query image is considered to be a relevant image. Precision is defined in (11), where is an indicator function.


In order to evaluate pathology classification performance, we report the Area Under Receiver Operating Characteristic Curve (AUC). The receiver operating characteristic curve is a plot of the classifier’s True Positive Rate (TPR) against the False Positive Rate (FPR), produced by sweeping the classifier’s discrimination thresholds. The AUC measure will be between 0 and 1, where a higher value indicates a more robust classifier with a lower FPR across a range of TPRs.

(a) Edema/Cardiomegaly.
(b) Pneumothorax/Lung Opacity.
(c) Pleural Effusion/Atelectasis.
Fig. 5: Visualisation of the retrieval database feature vector space. Best viewed zoomed in on a monitor.
(a) With negative proxies.
(b) Without negative proxies.
Fig. 6: Feature space visualisation showing the effect of including proxies for negative samples, i.e. samples with no positive labels. A positive sample is a sample with any positive disease label. When negative proxies are used, negative samples are better clustered and are co-located with fewer positive samples. Best viewed zoomed in on a monitor.
Fig. 7: Effect of the number of proxies defined per class label on classification and image retrieval performance. Note the logarithmic scale of the horizontal axis.

Iv-D NIH literature comparison

We compare our approach to state-of-the-art image retrieval methods from literature on the NIH dataset. All of the compared methods were designed for multi-label data, while three of the methods are multi-label metric learning approaches [3, 27, 28]. We follow the experimental set ups from Haq et al.[2] and Chen et al.[3]. For training, 12,000 images are used, each containing at least one positive label from one of the 13 most common diseases in the dataset. As such, negative proxies are not used in this experiment. For testing, a further 1,000 samples are selected. As seen in the retrieval nDCG results in Table I, our method significantly outperforms the existing methods on the CBIR task.

Iv-E CheXpert evaluation

To further evaluate our method, we compare both classification and CBIR results on the CheXpert dataset to the baseline methods detailed in Section III-H. As training sample efficiency is important in the medical domain, due to the difficulty of obtaining high-quality annotations, we evaluate performance across a range of training set sizes, with a particular interest in the smaller sizes. In these experiments, both positive samples (with at least one positive label from one of the nine most common diseases) and negative samples (no positive labels) are used. As such, negative proxies are leveraged in these experiments.

Table II shows all classification and CBIR metrics using a training set size of 4096 samples. Our approach outperforms the baselines methods across all classification and CBIR metrics. The comparatively poor performance of the naive multi-label extension of Proxy-NCA (ML-ProxyNCA) in Tables I and II demonstrates that multi-label proxy metric learning is non-trivial to achieve. The naive extension performs comparatively poorly at both pathology classification and CBIR, while our novel formulation of proxy metric learning for multi-label data is highly effective for both tasks.

Fig. 3 compares performance of our method to the classifier baseline (DenseNet w/ BCE) across a large range of dataset sizes (from 1024 - 65536 samples). These results show a consistent and significant performance advantage to our method in both pathology classification and CBIR. The advantage holds across both small and large training set sizes, and is largest when training data is limited. This shows the suitability of our method to data constrained medical problems.

Iv-F Qualitative evaluation

Example image retrieval results for sample query X-rays are shown in Fig. 1 and 4. Query images are from a test set that is withheld during training. Disease annotations are indicated by the shorthand names defined in Section IV-B. In general, there is a strong similarity between the disease labels of the query samples and retrieved samples. Fig. 5 uses t-SNE visualisations [50] of the retrieval database feature vector space to show some of the relationships between diseases learned by the model. Each visualisation selects two disease labels and indicates by colour the samples that have only one of those labels positive or both labels positive. Samples are co-located in the feature space based on their combination of positive disease labels.

Iv-G Proxies per class

Fig. 7 shows the effect that varying the number of proxies defined for each class label has on the classification and CBIR performance. There is a benefit to having multiple proxies per class, with two providing the best results. Interestingly, as the proxies per class passes eight, the performance begins to drop below the level of a single proxy per class. This is likely due to over-parameterisation of the feature space resulting in overfitting and proxies that do not generalise as well to unseen samples. For example, the extreme case of this would be having one proxy for each training sample, where without additional regularisation, the model wouldn’t necessarily need to generalise.

Iv-H Negative proxies

We analyse the effect of using negative proxies both quantitatively and qualitatively. Table II shows that excluding negative proxies during training results in a performance drop for both classification and CBIR. The t-SNE visualisation [50] of the feature vector space in Fig. 6 shows that without negative proxies, more positive samples are peppered throughout the primary negative sample cluster, compared to when negative proxies are used. Negative proxies are important for accurately encoding the semantic content of samples with no disease findings, and allow for better discrimination between negative and positive samples.

V Conclusion and future work

Computer-aided diagnosis systems can help to reduce the workload of healthcare professionals, potentially resulting in improved patient outcomes [2]. In this paper, we presented a novel model that can be jointly used for pathology classification and content-based image retrieval. Our multimorbidty metric learning approach uses the power of proxies to efficiently learn a feature vector space that encodes the relationships between disease labels. We showed the efficacy of our approach on two chest X-ray datasets, demonstrating a performance advantage over the baseline and state-of-the-art methods in classification and retrieval.

Defining multiple proxies for each disease resulted in improved classification and CBIR performance. This leads to a research question: can defining a variable number of proxies across diseases help to alleviate the affects of unbalanced data? Such unbalanced data is common in the medical domain, where particular disease annotations may be scarce due to the rarity of the disease. Setting per disease proxy numbers with consideration to the disease distribution may help to improve model training on unbalanced data. We leave this as a promising future research direction.


  • [1] Ahmed Hosny, Chintan Parmar, John Quackenbush, Lawrence H Schwartz, and Hugo JWL Aerts. Artificial intelligence in radiology. Nature Reviews Cancer, 18(8):500–510, 2018.
  • [2] Nandinee Fariah Haq, Mehdi Moradi, and Z Jane Wang. A deep community based approach for large scale content based x-ray image retrieval. Medical Image Analysis, 68:101847, 2021.
  • [3] Zhixiang Chen, Ruojin Cai, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Order-sensitive deep hashing for multimorbidity medical image retrieval. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 620–628. Springer, 2018.
  • [4] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 4004–4012, 2016.
  • [5] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016.
  • [6] Ben Harwood, Vijay Kumar BG, Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2821–2829, 2017.
  • [7] Benjamin J Meyer, Ben Harwood, and Tom Drummond. Deep metric learning and image classification with nearest neighbour gaussian kernels. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 151–155. IEEE, 2018.
  • [8] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. Deep metric learning to rank. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1861–1870, 2019.
  • [9] Aoxiao Zhong, Xiang Li, Dufan Wu, Hui Ren, Kyungsang Kim, Younggon Kim, Varun Buch, Nir Neumark, Bernardo Bizzo, Won Young Tak, et al. Deep metric learning-based image retrieval system for chest radiograph and its clinical applications in covid-19. Medical Image Analysis, 70:101993, 2021.
  • [10] Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018.
  • [11] Xiaoxu Li, Xiaochen Yang, Zhanyu Ma, and Jing-Hao Xue. Deep metric learning for few-shot image classification: A selective review. arXiv preprint arXiv:2105.08149, 2021.
  • [12] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018.
  • [13] Xiaomeng Li, Lequan Yu, Chi-Wing Fu, Meng Fang, and Pheng-Ann Heng. Revisiting metric learning for few-shot image classification. Neurocomputing, 406:49–58, 2020.
  • [14] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pages 360–368, 2017.
  • [15] Eu Wern Teh, Terrance DeVries, and Graham W Taylor. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In European Conference on Computer Vision, pages 448–464. Springer, 2020.
  • [16] Oren Rippel, Manohar Paluri, Piotr Dollar, and Lubomir Bourdev. Metric learning with adaptive density discrimination. arXiv preprint arXiv:1511.05939, 2015.
  • [17] Shaoting Zhang and Dimitris Metaxas. Large-scale medical image analytics: recent methodologies, applications and future directions, 2016.
  • [18] Zhongyu Li, Xiaofan Zhang, Henning Müller, and Shaoting Zhang. Large-scale retrieval for medical image analytics: A comprehensive review. Medical image analysis, 43:66–84, 2018.
  • [19] Md Mahmudur Rahman, Sameer K Antani, and George R Thoma. A learning-based similarity fusion and filtering approach for biomedical image retrieval using svm classification and relevance feedback. IEEE Transactions on Information Technology in Biomedicine, 15(4):640–646, 2011.
  • [20] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for large-scale search. IEEE transactions on pattern analysis and machine intelligence, 34(12):2393–2406, 2012.
  • [21] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE transactions on pattern analysis and machine intelligence, 35(12):2916–2929, 2012.
  • [22] Rushi Lan, Si Zhong, Zhenbing Liu, Zhuo Shi, and Xiaonan Luo. A simple texture feature for retrieval of medical images. Multimedia Tools and Applications, 77(9):10853–10866, 2018.
  • [23] Amit Shah, Sailesh Conjeti, Nassir Navab, and Amin Katouzian. Deeply learnt hashing forests for content based image retrieval in prostate mr images. In Medical Imaging 2016: Image Processing, volume 9784, pages 302–307. SPIE, 2016.
  • [24] Xinran Liu, Hamid R Tizhoosh, and Jonathan Kofman. Generating binary tags for fast medical image retrieval based on convolutional nets and radon transform. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 2872–2878. IEEE, 2016.
  • [25] Yaron Anavi, Ilya Kogan, Elad Gelbart, Ofer Geva, and Hayit Greenspan. A comparative study for chest radiograph image retrieval using binary texture and deep learning classification. In 2015 37th annual international conference of the IEEE engineering in medicine and biology society (EMBC), pages 2940–2943. IEEE, 2015.
  • [26] Sailesh Conjeti, Abhijit Guha Roy, Amin Katouzian, and Nassir Navab. Hashing with residual networks for image retrieval. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 541–549. Springer, 2017.
  • [27] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2064–2072, 2016.
  • [28] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep hashing for compact binary codes learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2475–2483, 2015.
  • [29] Xingyu Gao, Steven CH Hoi, Yongdong Zhang, Ji Wan, and Jintao Li. Soml: Sparse online metric learning with application to image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  • [30] Benjamin J Meyer and Tom Drummond.

    The importance of metric learning for robotic vision: Open set recognition and active learning.

    In 2019 International Conference on Robotics and Automation (ICRA), pages 2924–2931. IEEE, 2019.
  • [31] Luke Ditria, Benjamin J Meyer, and Tom Drummond.

    Opengan: Open set generative adversarial networks.

    In Proceedings of the Asian Conference on Computer Vision, 2020.
  • [32] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.
  • [33] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
  • [34] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005.
  • [35] Kilian Q Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems, 18, 2005.
  • [36] Florian Schroff, Dmitry Kalenichenko, and James Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [37] Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5385–5394, 2016.
  • [38] Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and Russ R Salakhutdinov. Neighbourhood components analysis. Advances in neural information processing systems, 17, 2004.
  • [39] Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3238–3247, 2020.
  • [40] Nicolas Aziere and Sinisa Todorovic. Ensemble deep manifold similarity learning using hard proxies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7299–7307, 2019.
  • [41] Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. Softtriple loss: Deep metric learning without triplet sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6450–6458, 2019.
  • [42] Gencer Sumbul, Mahdyar Ravanbakhsh, and Begüm Demir. Informative and representative triplet selection for multilabel remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2021.
  • [43] Mauro Annarumma and Giovanni Montana. Deep metric learning for multi-labelled radiographs. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pages 34–37, 2018.
  • [44] Changsheng Li, Chong Liu, Lixin Duan, Peng Gao, and Kai Zheng. Reconstruction regularized deep metric learning for multi-label image classification. IEEE transactions on neural networks and learning systems, 31(7):2294–2303, 2019.
  • [45] Qi Liu, Wenhan Li, Zhiyuan Chen, and Bin Hua. Deep metric learning for image retrieval in smart city development. Sustainable Cities and Society, 73:103067, 2021.
  • [46] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  • [47] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.
  • [48] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [49] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [50] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

    Journal of machine learning research

    , 9(11), 2008.