Attention-based Dynamic Subspace Learners for Medical Image Analysis

Learning similarity is a key aspect in medical image analysis, particularly in recommendation systems or in uncovering the interpretation of anatomical data in images. Most existing methods learn such similarities in the embedding space over image sets using a single metric learner. Images, however, have a variety of object attributes such as color, shape, or artifacts. Encoding such attributes using a single metric learner is inadequate and may fail to generalize. Instead, multiple learners could focus on separate aspects of these attributes in subspaces of an overarching embedding. This, however, implies the number of learners to be found empirically for each new dataset. This work, Dynamic Subspace Learners, proposes to dynamically exploit multiple learners by removing the need of knowing apriori the number of learners and aggregating new subspace learners during training. Furthermore, the visual interpretability of such subspace learning is enforced by integrating an attention module into our method. This integrated attention mechanism provides a visual insight of discriminative image features that contribute to the clustering of image sets and a visual explanation of the embedding features. The benefits of our attention-based dynamic subspace learners are evaluated in the application of image clustering, image retrieval, and weakly supervised segmentation. Our method achieves competitive results with the performances of multiple learners baselines and significantly outperforms the classification network in terms of clustering and retrieval scores on three different public benchmark datasets. Moreover, our attention maps offer a proxy-labels, which improves the segmentation accuracy up to 15 state-of-the-art interpretation techniques.


page 1

page 4

page 9

page 10


Attention-based Ensemble for Deep Metric Learning

Recently, ensemble has been applied to deep metric learning to yield sta...

Improving Deep Metric Learning by Divide and Conquer

Deep metric learning (DML) is a cornerstone of many computer vision appl...

CT Image Synthesis Using Weakly Supervised Segmentation and Geometric Inter-Label Relations For COVID Image Analysis

While medical image segmentation is an important task for computer aided...

Discovering beautiful attributes for aesthetic image analysis

Aesthetic image analysis is the study and assessment of the aesthetic pr...

Semantic Aware Attention Based Deep Object Co-segmentation

Object co-segmentation is the task of segmenting the same objects from m...

Visual Knowledge Tracing

Each year, thousands of people learn new visual categorization tasks – r...

BEDS: Bagging ensemble deep segmentation for nucleus segmentation with testing stage stain augmentation

Reducing outcome variance is an essential task in deep learning based me...

I Introduction

Learning the similarity between arbitrary images is a fundamental problem in many key areas of computer vision such as image retrieval

[53, 37, 17], recommender system [33, 31], duplicate detection [73], clustering [77], or zero-shot learning [71]. In this context, metric learning is commonly used for measuring similarities by learning a distance function over objects [62, 26]. Recently, deep metric learning (DML) has been raised as a powerful approach to learn these similarities [20]

. More specifically, the goal of DML is to learn an embedding space where images from the same classes are encouraged to be close to one another. In contrast, images belonging to different classes are pushed away in the embedding space. In recent DML approaches, the loss function can be typically expressed in Euclidean distances or cosine similarities between pairs or tuples of images in the embedding space. Well-known losses employed in DML include: contrastive loss

[15], triplet loss [60], lifted structure loss [39], N-pairs loss [53], margin loss [63], angular loss [59], or ProxyNCA loss [37]. In addition to novel learning objectives, recent efforts are also devoted to designing efficient sample-mining [63], or sample weighting [61] strategies.

Most of these methods learn the embedding mapping function with a single metric learner. However, medical images have complex distributions consisting of different object attributes such as color, shape, size, or artifacts. Thus, learning the complex similarity associated with these different object attributes may be inadequate with only one single-learner. A few attempts have been made towards leveraging multiple metric learners to address this complexity. For example, Kim et al. [22] ensemble multiple learners, whereas a divide-and-conquer strategy is used in [46] by splitting the manifold into several embedding subspaces. One main limitation of these approaches is a need to empirically find the optimal number of learners, which requires a new validation for every new setting, including every use of a new dataset. Furthermore, the sizes of the embedding subspaces associated with each learner might differ since learning the various sets of object attributes requires varying degrees of modeling complexity.

Despite the popularity of DML, surprisingly few works attempt to visually explain which regions contribute to the similarity between images in embedding networks [18]

. These visualizations are of pivotal importance since they provide an efficient mechanism to understand the predictions of the model. Recent efforts have been devoted to the interpretability of deep neural networks, resulting in a variety of different approaches

[69, 24, 49, 9, 4]. Among these methods, GradCAM [49] has been widely employed to explain deep classification models. This method uses gradients to highlight the discriminative regions of an image. Nevertheless, since the gradients are not available during testing, directly applying this strategy in embedding networks is not feasible [8]. Integrating interpretability in embedding networks requires either attaching an additional classification branch [72] or employing multiple images simultaneously [54, 76]. Needless to say, interpretability is of particular interest in medical imaging, as visual explanations of predictions directly impact the diagnosis, therapy planning, and follow-up of many diseases. Thus, existing DML approaches may be inadequate to visually uncover what constitutes similarities among a complex set of medical images.

Motivated by these gaps and the scarcity of the DML literature in medical imaging, we propose a novel attention-based dynamic subspace learners approach. The underlying metric learning method is inspired by the idea of a divide-and-conquer strategy. More specifically, we propose to follow the approach of [46] in order to capture different object attributes, each of them processed with an independent subspace learner. These subspace learners having variable sizes are learned dynamically as and when the network accuracy is plateauing during training. Thereby avoids the need to find apriori

the number of subspace learners while retaining the state-of-the-art performance. Furthermore, the visual interpretation of the embedding is addressed by integrating an attention module after feature extraction layers, encouraging the learners to focus on the discriminative areas of target objects. Consequently, the learning process provides a visual insight of which image region considerably contributes to the clustering of image sets in the form of pixel-wise interpretable predictions.

Our Contribution

We contribute a novel approach to the state-of-the-art method in deep metric learning and illustrate its application in medical image analysis. More precisely, we propose a training strategy that (i) explores the dynamic learning of an embedding, (ii) overcomes the empirical search of an optimal number of subspaces in approaches based on multiple metric learners, and (iii) produces compact subspaces of variable size to attend different object attributes. Furthermore, the integration of an attention module in our dynamic learner approach focuses the attention of each independent learner on the discriminative regions of an object of interest. This attention mechanism provides the added benefit of visually interpreting relevant embedded features. The evaluation of our proposed method is conducted by extensive experiments on three publicly available benchmarks: ISIC19 [10, 11], MURA [44], and HyperKvasir [5]. The performance is evaluated on clustering and image retrieval tasks, showing that the proposed method achieves competitive results with the state-of-the-art without requiring the grid searches over optimal numbers of learners. We also demonstrate that the attention maps produced by our method can be used as proxy-labels to train deep segmentation models. In particular, we evaluate our approach on ISIC18 [57, 10] in a weakly supervised segmentation task and show improvements to the visual attention and class activation maps obtained from recent state-of-the-art methods, including the method specifically designed for skin lesion detection [70].

Ii Related Work

Ii-a Deep Metric Learning

Metric learning is a widely explored research field in the learning community [6, 62]. The seminal work of Siamese Networks [6]

represents the first attempt to use neural networks for feature embedding. Its concept is to employ two identical neural networks that learn a contrastive embedding from a pair of images. With the advent of deep learning, deep metric learning (DML) has gained popularity, becoming a mainstay in many modern computer vision problems, such as image retrieval

[40], person re-identification [29], or few-shot learning [52]. In DML, the images are mapped into a manifold space via deep neural networks. Euclidean or cosine distances can then be directly used as a metric distance between two images in this mapped space. Typical losses employed in DML include contrastive [15] or triplet loss [48]. The contrastive loss [15] encourages images from the same class to stay closer –in the learned manifold– while pushing away samples from different classes, which should be separated by a given fixed distance. Nevertheless, forcing the same distance for all pairs of images can discourage any potential distortion in the embedded space. In contrast, this assumption is relaxed in triplet loss [15], which only imposes that negative pairs of images should be further away than positive pairs.

In the same direction as our work, [22] and [46] have leveraged the use of multiple learners to diversify the learning space towards different object attributes. While [22] propose an ensemble of multiple learners driven by attention, a divide and conquer strategy is employed in [46], which promotes the discovery of multiple subspaces. For example, Sanakoyeu et al. [46] explicitly splits the embedded space into a predefined number of learners with fixed size subspaces. Then, each learner independently learns a part of an embedding space, i.e., a subspace, from a portion of clustered data, and the final embedding is later refined from multiple learners. Even though this strategy leads to improvements over its single-learner counterpart, a grid search is needed to find an optimal number of learners with each new dataset. Furthermore, the size of the embedding space is uniform across the learners, whereas some attributes, such as color, might require smaller embeddings to encode the information than other attributes, such as shape.

Ii-B Metric Learning in Medical Image Analysis

Despite the interest in other domains, metric learning, and more particularly DML, remains almost unexplored in medical imaging. In the pre-deep learning era, related work includes [67], which employed a distance metric learning in a traditional boosting framework in a medical image retrieval scenario. More recently, [66] investigates the use of DML to model the similarity relationship between lesions in the context of radiology images, where a triplet loss is employed to learn the lesion embeddings. Gupta et al. [14]

also resorts to the triplet loss to learn the underlying manifold space for the task of Mitotic classification, whose embedded features are subsequently used as input for a Support Vector Machine classifier. Recently, a combination of cross-entropy loss with a contrastive loss or triplet loss is used to classify whole slide images in digital pathology

[55, 42]. In [50], a triplet loss is used to learn a representation of source domain images, which is later used for target domain classification under the few-shot learning paradigm. In [56], DML is used to pre-train a model in the application of digital pathology classification, where authors use a ProxyNCA loss for learning transferable features. To enhance the embedding, [74, 68] has integrated a multi-similarity loss to DML in the context of chest radiography and liver histopathology image, respectively. Nevertheless, most of these methods are developed with the goal of classification tasks and do not effectively leverage the geometrical information of the underlying embedding space.

Ii-C Weakly Supervised Segmentation

Weakly supervised segmentation (WSS) has emerged as an alternative to alleviate the need for large amounts of pixel-level labelled data. These labels can come in the form of image-level labels [41], scribbles [30], points [3], bounding boxes [43] or direct losses [21]. Among them, image-level labels are easier and inexpensive to obtain [3]. Particularly, class activation maps (CAM) [75] have gained popularity in identifying saliency regions based on image labels. It is achieved by associating feature maps of the last layers and weighting their activation using a global average pooling (GAP) layer. However, generated saliency maps are typically spread around the target object, only focusing on the most discriminant areas. This limits its usability as pixel-level supervision for semantic segmentation. To enhance the generated saliency regions, some alternatives based on back-propagation (GradCAM [49]) or super-pixels (SP-CAM [27]) have been proposed. Nevertheless, these methods demand additional gradients computations [49] or supervisions [27].

The literature on WSS in medical imaging remains scarce. While few methods resort to direct losses, hence requiring additional priors, such as the target size [19, 21], other approaches rely on stronger forms of supervision, for instance, using bounding boxes [43] or scribbles [7]. Tackling WSS from a perspective of image-level labels typically involves visual features, which has not been thoroughly investigated [13, 35, 38, 12]. For example, Nguyen et al. [38] has proposed a CAM-based approach for the segmentation of uveal melanoma. In their method, the CAMs generated by the classification network are further refined by an active shape model and conditional random fields [25]. More recently, CAMs derived from image-level labels have been combined with attention scores to refine lesion segmentation in brain images [64]. By doing so, they have demonstrated a performance improvement compared to the vanilla version of CAMs. Nevertheless, these methods typically integrate CAM/GradCAM with complex models to enhance the performance of a final segmentation.

Iii Methodology

Fig. 1: Overview of our proposed attention-based dynamic subspace learners - The embedding space is dynamically divided into the subspaces of varying sizes during training. Suppose there are subspaces at a particular training time; the data are first grouped into

groups in the full embedding space (step 1, from epoch 1 to 250) and assign each subgroup of data to an individual subspace learner. Each learner then only attends the data from its subgroup in the learning stage (step 2). In inference time, our method uses the entire embedding space

to map an image. Best viewed in color.

Iii-a Overview

An overview of the proposed approach is depicted in Fig. 1. The main idea is to split the embedding space into multiple subspaces () such that the original embedding space can be learned by refining its subspaces. Contrary to [46], the embedding space is split dynamically, which removes the need to search for the optimal number of learners in each scenario. The whole process is divided into two iterative steps. First, input images are mapped into the lower dimension embedding space using the entire embedding layer (

-dimension), where they are clustered into different groups. Second, the clustered data is consequently assigned to an individual subspace learner, where their corresponding images are used to train each subspace. These two steps are repeated at regular intervals, as well as each time a new learner is added. The key idea is that each subspace learner learns a part of the embedding space from a subgroup of images instead of learning a whole embedded representation vector. Finally, all subspaces are combined to generate a full embedding space. Furthermore, an attention module is integrated within the learning process to guide the learning of distance metrics. The following sections describe the deep metric learning formulation, present the proposed dynamic subspace metric learning and attention module.

Iii-B Deep Metric learning Formulation

Let the training dataset be defined as , where the i-th image is denoted as , and is its corresponding class label. defines the total number of classes. The goal of deep metric learning is to learn an embedding function , which discriminatively maps semantically similar images (same class) in the input space onto metrically close points in the learned manifold . Similarly, semantically dissimilar images (different class) in should be mapped metrically far in . The parameters

of the mapping function are typically learned by a convolutional neural network. Formally, the distance metric

between two images in the embedding space can be defined as:


where denotes the Euclidean norm. This distance can be minimized in different ways, depending on the loss function employed. In this work, we resort to the Margin loss [63]:


where is the boundary between the similar and dissimilar pairs, is a separation margin, and indicates whether the images in the pair are similar () or different (). Note that any other metric learning loss function can be employed with our approach.

Iii-C Dynamic Subspace Learners

The complexity of the original problem can be solved by dividing the problem into smaller sub-problems, which are easier to solve. We follow the approach in [46], where the embedding space and the data is split into multiple groups. Specifically, splitting of the embedding space is conducted by slicing the space, i.e., the last dense layer of the network, into sub-vectors of the same size, . Furthermore, data is clustered into groups based on their pairwise distance in the embedding space

, for instance, using K-means. Then, a set of

independent learners is used to learn over each subspace by using a fraction of the input data, thereby reducing the complexity of the original problem. Nevertheless, a major bottleneck is finding an optimal number of subspaces to learn an effective embedding, which must be found empirically for every new dataset. Moreover, the subspace is divided equally, which is ineffective as not all the object attributes require the same size to encode the information.

Contrary to [46]

, our proposed learning strategy finds an optimal embedding by dynamically splitting the embedding space and associating with a metric learner during training. To construct each subspace, we group highly contributing neurons of the embedding layer

, which is repeated until network convergence. Initially, the entire embedding space is learned with all the data, with an initial single learner . As the learning progresses, the accuracy of the model starts to reach an initial plateau. At this stage, we compute the score of each neuron () in the embedding layer, similarly to the pruning strategy as in [36]. In particular, the low-scoring neurons are pruned such that the performance drop of the model is minimal, i.e, . By using Taylor expansion, as in [36], the scoring of each neuron can be reduced to:


Thus, the scoring of neurons is simplified to multiplying the activation and the gradient output in the embedding layer. This score is computed for each training example separately, and is consequently averaged across all training data and normalized to . The neurons having high normalized scores are subsequently grouped to form a new subspace. Particularly, the neurons having more than 50% of the confidence score, i.e., , are grouped as a new subspace. The current metric learner () is later assigned to this group of neurons. The remaining neurons of the embedding layer, , are eventually reset, similar to the pruning technique [36] and assign a new metric learner as in Eq. 4. After adding this new learner, the training data is clustered by mapping into the entire embedding space using K-means with the updated ( for the second iteration). Note that the entire embedding space here is a combination of all the subspaces. Each learner is eventually assigned a subgroup of data from the clustering, resulting in each learner being trained with a fraction of the input data. The addition of a new learner is repeated with the remaining neurons when the network performance reaches a new plateau, until convergence. In the end, it results in mapping functions, , where each mapping function will project the images into the corresponding subspace of , each with a variable size.

All learners are trained jointly by resorting to the margin loss [63], which for each learner can be defined as:


where is the current mini-batch (uniformly sampled from each data group) having both positive and negative classes, and is the distance metric (similar to Eq.1) for the -th learner. Once individual learners are trained, these are merged to compose the entire embedding space, which is refined with the entire training set. Furthermore, assuming that the learned embedding space is improving over time, we re-cluster the images at every epochs by mapping all the images using the entire embedding space . An outline of the proposed method is presented in Algorithm 1.

Inputs : , : Training and test data
: backbone network parameters
E : Embedding space
, : clustering and network plateau threshold
Initialize : K 1, number of learner
0, Best epoch
ep 1, current epoch
, remaining embedding space
RC True, re-clustering flag
while Not converged do
        if RC then Re-cluster the data
               E ConcatEmbedding({, ,…, })
               emb ComputeEmbedding(X, , E)
               {, ,…,} ClusterData(emb, K)
               {, ,…, } SplitEmbedding(E, K)
               RC False
        repeat Train all learners
               {, ,…,}
               b GetBatch()
               FPass(b, , )
               , BPass(, , )
       until epoch completed ep ep + 1
        E ConcatEmbedding({, ,…, })
        RC (ep mod == 0)
        if Evaluate(, , E, ep) B then Is best
               B ep
       else if  then Is network plateaued
               K K + 1 Update new learner
               {} splitLearner({}) using Eq.3
               {,.., } SplitEmbedding(E, K, )
               RC True
E ConcatEmbedding({, ,…, })
, E FineTune(X, , E)
Output: , E
Algorithm 1 Dynamic Subspace Learner Pseudocode

Iii-D Attentive Dynamic Subspace Learners

Deep attention is raising as an efficient mechanism to focus the learning on the objects of interest in a wide range of applications, such as person re-identification [28], object classification [58], or medical image segmentation [51, 47]. Inspired by these advances, we introduce an attention module to learn attentive features, with the goal of enhancing the learning of the embedding space. For a given input image , feature extractor produces a feature maps , where denote the spatial dimension of the feature map and the number of channels. The attention map produced by the attention module can be then defined as . The generated attention map is multiplied with each feature map , where is the element-wise product, resulting in the set of attentive features. Last, the attentive features are combined to produce a dimensional vector by using global average pooling (GAP), which are mapped into the manifold space using a dense layer (Fig. 1).

Iii-E Attention maps for Weakly Supervised Segmentation

The attention maps obtained by our proposed method can serve as proxy pixel-level labels to train a segmentation network in a fully-supervised manner. Specifically, the input image and corresponding attention map are used as a training pair. To differentiate foreground pixels from the background pixels in , we threshold the attention maps with (i.e., pixels in greater than

are set to 1, 0 otherwise) before training the segmentation network. The network is trained with binary cross-entropy as a loss function, which is computed over pixel-wise softmax probabilities, defined as:


where is a segmentation network parameterized by . Note that the learning objective that trains a segmentation network is same in both the fully and weakly supervised scenario. However, the main difference lies in the labels employed in the cross-entropy term. In particular, while the former resorts to given segmentation masks, e.g., , the latter leverages the obtained attention masks as pseudo-labels, i.e., .

(a) ISIC19 dataset
(b) MURA dataset
(c) HyperKvasir dataset
Fig. 2: Impact of number of learners in DCML [46] - Each line indicates the NMI (top) and Recall@1 (bottom) scores across the three datasets. The default loss function employed is margin loss, whereas models with a triplet loss are explicitly mentioned. Best seen in color.

Iv Experiments

Iv-a Experimental Setting

The performance of the proposed attention-based dynamic subspace learners (ADSL) is compared to other deep metric learning methods applied in medical imaging [55, 42, 50, 14, 66], which resort to contrastive or triplet loss. To assess the effectiveness of the dynamic learner training strategy, we compare it with the divide and conquer approach (DCML) [46]. Since we use class labels information, we compare with the classification network trained using a cross-entropy loss. For a fair evaluation, the backbone architecture and hyper-parameters are fixed across the different methods. In addition, experiments across all the models and datasets are run three times, and their average performances are reported. Note that the baselines based on triplet and contrastive loss rely on single-learner, whereas models based on the divide-and-conquer strategy and our method employ multiple learners.

To assess the performance of our approach in terms of segmentation, we benchmark the resulting attention maps against the popular GradCAM [49] from the classification networks. We include a recent Attention Residual Learning (ARL) approach in [70] since it has been similarly proposed in the context of skin lesion analysis. We also include a recently proposed weakly supervised segmentation method, Embedded Discriminative Attention Mechanism (EDAM) [65], applied for the natural image. Lastly, we include as an upper bound the results obtained by UNet [45]

that was trained on the provided pixel-level masks. Note that the model architecture and hyperparameters are fixed across the different methods. Nevertheless, the ARL model employs a carefully modified ResNet50 backbone with soft-attention blocks in each layer. It is noteworthy to mention that it also uses an offline multi-scale patch extraction strategy, resulting in extra images during training. Whereas, the EDAM model employs a collaborative multi-head attention module after the feature extraction layer to directly generate the discriminative activation masks.


The performance of the proposed method, in terms of clustering and image retrieval, is evaluated on three diverse medical imaging datasets: skin lesion from the ISIC 2019 Challenge [10, 11], musculoskeletal radiographs from the MURA dataset [44], and gastrointestinal tract images from the HyperKvasir dataset [5]. To assess the segmentation performance, we resort to the skin lesion dataset from the ISIC 2018 Challenge [57, 10].


This dataset consists of 25,331 images across 8 different categories. In our experiments, following the standard procedure in DML, we split our dataset into independent training and testing sets. Specifically, 20,000 images were used for training and the remaining 5,331 for testing.


It consists of 40,561 images from 9,045 normal and 5,818 abnormal musculoskeletal radiography studies across seven standard upper extremity types. We configure this as 14 categories (7 normal and 7 abnormal) to represent the data in a manifold. We use the provided split of 36,808 images for training and 3,197 images for testing.


This dataset consists of 110,079 images, of which 10,662 images are labeled across 23 different classes of findings. We randomly split the data into 8,567 images for training and the remaining 2,095 images for testing.


This dataset is composed of 2,594 images and their corresponding pixel-level masks. The segmentation dataset is randomly split into three sets: training (1,042), validation (520), and testing (1,038). We leverage the attention maps and GradCAMs generated on the ISIC19 dataset (25,331 images) as proxy-labels to train the segmentation networks. In contrast, the training set is used to train the upper-bound model, i.e., fully-supervised.

Evaluation Metrics

We follow the evaluation protocol typically employed in deep metric learning [46, 39]. In particular, we employ the normalized mutual information (NMI) to assess the clustering performance using K-means and the Recall score (with k = 1 and 4) to evaluate the image retrieval quality. To assess the segmentation performance, we employ the common Dice score coefficient.

Implementation details

As in [46], we use ResNet50 [16] as the backbone architecture. The feature extractor layers consist of the first three residual blocks of ResNet50, used as input to the attention module. The attention module consists of three convolution layers with

kernel and filters size of {128, 32, 1}, with a ReLU activation between each convolutional layer. Last, a sigmoid activation is integrated into the final layer to produce the activation map. An input image size of

is used for all our experiments. All models are trained using the Adam optimizer [23] with batch size of = 32. In each mini-batch, 8 images per class are sampled to ensure a class-balanced scenario and experiments are trained for 300 epochs. The last 50 epochs are fine-tuned with full embedding. The re-clustering parameter is set to = 2 as in [46] and the network plateau threshold is empirically set to = 10. The margin loss parameters are set to = 0.2, = 1.2, as in [63]. Last, since most DML approaches [63, 46] employ an embedding space of size = 128, we use the same latent dimension in all our experiments.

Regarding the segmentation task, we use UNet [45] architecture with an initial kernel size of 32 with two convolution layers and a depth of 3. It is trained with Adam optimizer with batch sizes of 16. For each method, the threshold parameter is set to maximize the Dice score on the initial maps of the validation set (Fig. 6).

Iv-B Clustering and image retrieval results

Impact of number of learners

One of the motivations of this work is to remove the need to empirically searching for the optimal number of learners. To validate this hypothesis, we first study the performance of DCML [46] by varying the number of subspace learners (). Figure 2 depicts the results of this experiment across the three datasets and under two different loss functions: margin and triplet loss. In these plots, it can be observed that the optimal value significantly differs across datasets and metrics. Thus, this limitation of the DCML approach results in extra time-consuming steps to fine-tune the model in each dataset. In contrast, the proposed method (dotted line) eliminates the need of manually defining by dynamically exploring the manifold, yet achieves on par results with the best performing DCML setting.

We also report the average values obtained from our method over three runs, as well as the DCML (best) in Table I. The table shows that the value has no relation to the number of ground-truth classes. The dynamically obtained in our method is driven by image content, not by the number of ground-truth classes, which explains their uncorrelated values.

Dataset #classes ADSL - Avg. K DCML - Best K
ISIC19 8 7 6
MURA 14 4.67 1
HyperKvasir 23 4.33 2
TABLE I: Comparison of the obtained values from our method and the DCML best K values with respect to the number of ground-truth classes.
Method NMI () R@1 () R@4 () Avg. of NMI + R@1 ()
Classification network 45.41 1.95 77.85 0.86 90.54 0.51 61.63 1.40
Contrastive loss 31.47 0.39 78.13 0.59 91.13 0.08 54.80 0.49
Triplet loss 50.97 0.61 79.84 0.49 91.70 0.26 65.41 0.55
DCML (worst NMI, K = 1) 50.53 1.01 82.84 0.39 91.51 0.43 66.69 0.70
DCML (best NMI, K = 6) 55.08 0.83 82.29 0.56 91.73 0.36 68.69 0.70
ADSL (free from K, ours) 55.14 0.87 82.39 0.11 92.11 0.27 68.77 0.49
TABLE II: Quantitative evaluation on ISIC19 test set - The NMI, Recall, and average scores from the different methods. Our method is emphasized with light gray, whereas best and second-best results are highlighted with bold and underline.
Method NMI () R@1 () R@4 () Avg. of NMI + R@1 ()
Classification network 71.09 1.25 74.21 0.27 92.59 0.40 72.65 0.76
Contrastive loss 74.28 0.53 71.65 0.53 92.07 0.36 72.97 0.53
Triplet loss 74.41 0.27 74.51 0.78 92.95 0.33 74.46 0.53
DCML (worst NMI, K = 10) 72.88 0.40 73.55 0.16 91.17 0.19 73.22 0.28
DCML (best NMI, K = 1) 74.67 0.35 75.36 0.79 92.89 0.18 75.02 0.57
ADSL (free from K, ours) 74.88 0.09 75.52 0.18 92.25 0.42 75.20 0.15
TABLE III: Quantitative evaluation on MURA test set - The NMI, Recall, and average scores from the different methods. Our method is emphasized with light gray, whereas best and second-best results are highlighted with bold and underline.
Method NMI () R@1 () R@4 () Avg. of NMI + R@1 ()
Classification network 80.13 2.34 85.66 0.39 94.42 0.39 82.90 1.87
Contrastive loss 83.89 0.15 78.52 0.86 93.44 0.48 81.21 0.51
Triplet loss 82.24 0.19 83.44 0.34 93.92 0.22 82.84 0.27
DCML (worst NMI, K = 1) 83.31 0.19 84.79 0.59 94.05 0.26 84.05 0.39
DCML (best NMI, K = 2) 84.40 0.52 85.46 0.31 94.19 0.28 84.93 0.42
ADSL (free from K, ours) 84.18 0.12 85.82 0.27 94.24 0.41 85.00 0.20
TABLE IV: Quantitative evaluation on HyperKvasir test set - The NMI, Recall, and average scores from the different methods. Our method is emphasized with light gray, whereas best and second-best results are highlighted with bold and underline.

Comparison to prior literature

We now compare our method with recent prior work as baselines, whose results are reported in Tables II-IV. As the performance of DCML varies with , we report only the best and worst models. Note that the DCML with a single-learner, i.e., , is equivalent to a margin loss method [63]. We also report the performance of the embedding space learned by the classification network. From the Tables II-IV, we observe that the proposed method consistently achieves the best results in terms of NMI across the three datasets while performing on par with the best setting of the DCML approach on image retrieval metrics. As shown previously, it is important to note that the performance of DCML heavily depends on the value of . For instance, the difference between the worst and best DCML configuration in NMI score can be up to 5% on the ISIC19 dataset. Compared to single-learner approaches, our method brings 5 and 2% improvements in NMI and Recall score on the ISIC19 dataset and up to a 1% improvement in both scores on the MURA and HyperKvasir datasets. This highlights the potential of exploring embeddings via multiple subspaces.

Furthermore, the comparison with the conventional classification network shows that our method consistently outperforms its accuracy up to 10% in terms of NMI scores on ISIC19 and up to 4% NMI score on MURA and HyperKvasir datasets and up to 4% and 1.5% in terms of Recall scores on the ISIC19 and MURA datasets. The averaged NMI and R@1 results of the proposed method slightly outperform the best DCML configuration, which is consistent across all the datasets. The standard deviation of our method is smaller in all cases for all metrics compared to the DCML. Overall, our method shows better robustness with respect to the state-of-the-art methods in the learning manifold space. The performance of our method is in line with the recent literature

[2, 1].

Dataset Method NMI () R@1 () R@4 ()
ISIC19 DSL 54.11 82.74 91.95
ADSL 55.14 82.39 92.11
MURA DSL 74.21 75.85 92.26
ADSL 74.88 75.52 92.25
HyperKvasir DSL 84.44 85.36 93.54
ADSL 84.18 85.82 94.24
Average DSL 70.92 81.32 92.58
ADSL 71.40 81.24 92.87
TABLE V: Impact of attention module - Per-dataset and average results of the proposed model with (ADSL) and without (DSL) the attention module. Best results are highlighted with bold for each dataset as well as for the average.
Fig. 3: Impact of the embedding size - Each bar indicates the NMI (top) and Recall@1 (bottom) scores on ISIC19 dataset. Compare to the best model of DCML method, our method produces better NMI and Recall scores for most cases.

Ablation study on the use of attention

Adding an attention module brings additional value to our model in terms of interpretability. Nevertheless, to assess whether this improvement is also reflected in the model performance, we compare our model to its non-attention counterpart, denoted as Dynamic Subspace Learners (DSL). Results from this study are reported in Table V

, which shows that adding attention typically leads to a boost on the model performance. In particular, the attentive model brings 0.5 and 0.3% improvement as average over the three datasets for the NMI and R@4 metrics, respectively, while achieves on par results for R@1. Additionally, the attention module minimally increases the model memory by 5 MB (includes parameters, forward and backward pass size) when compared to non-attention counterpart, which is arguably negligible with respect to the overall model size (607 MB) in case of deployment.

Impact of the embedding size

We also evaluate the effect of representing the embedding space with different sizes. In particular, we assess the clustering and image retrieval performance on the ISIC19 dataset by fixing the embedding dimension size to 64, 128, 256, and 512. Figure 3 shows that increasing the embedding size results in a performance improvement, which is reflected in both NMI and recall metrics. Nevertheless, beyond a 256-dimension embedding, the performance of both models typically decreases.

(a) Before learning
(b) Classification network
(c) DCML
(d) DCML
(e) Our method
Fig. 4: Visualization of ISIC19 test set in embedding space using t-SNE - Each class is indicated by its individual color. When compared to a standard classification network, DCML (a single-learner) improves the separation between classes. The multi-learner methods, DCML and our method, further improve the separation between classes, while our method has the advantage of being free from the number of learners . Best seen in color.
Fig. 5: Performance of image retrieval on test sets - Each query image and its five nearest neighbors in ascending order of distance are shown (left to right) from the DCML (best K) and our method with an overlay of our attention maps (probability above 0.5).

Qualitative Analysis

To show the inter and intra-class representation power in the embedding space across different models, we visualize a t-SNE mapping [34] on the ISIC19 test set (Fig. 4). The classification network fails to discover clear boundaries across classes in the embedding space (Fig. 4

b). This could be because of the cross-entropy loss when coupled with softmax, does not explicitly guarantee the minimization of intra-class variance or maximization of inter-class variance, which results in suboptimal discriminative features

[32]. The single metric learner, i.e., DCML (Fig. 4c), improves the class boundaries when compared to the classification network, yet they fail to possess compact clusters. On the other hand, inter-class discrimination is visually enhanced when resorting to multiple learners, i.e., DCML (Fig. 4d) and our approach (Fig. 4e). Further, we can also observe that the proposed model yields more compact clusters than the DCML approach, which might be due to the freedom of our model to explore the manifold.

Qualitative evaluation in terms of image retrieval is assessed in Fig. 5, where a given random query with its five nearest neighbors, found using both DCML and our method, are shown. Additionally, we overlay the contour of our attention maps (having probability above 0.5) from the proposed method over their respective retrieved image. First, our method indeed retrieves images having similar lesions and colors from the ISIC19 dataset. In radiography wrist images, both DCML and our method have similar retrieval errors. Finally, retrieval images from the HyperKvasir dataset have similar image semantics in terms of texture and probe length using our method when compared to DCML. The coherence of image retrievals indicates that the intra- and inter-class similarities have been captured by our method and thereby demonstrates the robustness of our learned embedding. Moreover, our attention maps mainly concentrate on the lesion in the skin images, the wrist in the radiography images, and the probe contact region in the endoscopic images, demonstrating that our model decision are consistent over all retrievals.

Iv-C Weakly Supervised Segmentation results

Table  VI reports the results of the segmentation experiments. In this table, Init maps are used to denote the raw visual salient regions from either GradCAM or attention maps. Refined refers to the performance of the segmentation network trained on the Init maps. First, we can observe that segmentation results obtained by raw attention maps and GradCAMs are considerably low, with Dice values around 40%. This is likely due to the well-known fact that both are highly discriminative, resulting in over-segmented regions. The Attention Residual Learning (ARL) significantly outperforms these baselines, whose improvement could be due to the use of attentive residual blocks and additional multiscale data augmentation. The attention maps from the recent Embedded Discriminative Attention Mechanism (EDAM) method perform at a similar level when compared to ARL. Last, the attention maps from the proposed approach bring a significant boost compared to all the other methods. In particular, our model outperforms the baselines by nearly 30% and the recent ARL model by 13%. These results are typically consistent if we employ the initial maps as proxy-labels to train a segmentation network. In this case, raw attention maps or GradCAMs barely improve or even decrease the initial segmentation performance. In contrast, ARL, EDAM, and the proposed method reach higher Dice values, with about 1%, 3.5%, and 3% of increase, respectively. This represents a difference of 15% in Dice with respect to ARL. On the other hand, by only using image-level information, the proposed model bridges the gap with a fully-supervised network, with only 14% of difference. This suggests that the proposed model generates reliable segmentations.

Method Init maps Refined
Attention 38.45 33.43
Attention 38.52 38.38
GradCAM 41.55 40.76
GradCAM 39.80 41.27
ARL [70] 56.78 57.60
EDAM [65] 51.99 55.50
ADSL (ours) 69.23 72.42
Full-supervision (upperbound) - 86.15
TABLE VI: Performance of weakly supervised segmentation - “Initial maps” and “Refined” are Dice scores (in %) on the ISIC18 test set for different methods. Our method yields the best results (in bold). , and are from ResNet50, ResNet101 and modified ResNet50, respectively, indicating the used architecture in each visual map.

Ablation study of threshold on the raw visual maps

We evaluate the effect of threshold values on the Dice score for raw visual maps from attention maps and GradCAMs, as shown in Fig 6. First, the attention maps and GradCAMs from the classification network have an almost flat Dice score of around 40% until , succeeded by a gradual decrease. The ARL and EDAM have a gradually increasing Dice score until and with a maximum score of 57.33% and 50.89%, respectively, followed by a gradual decrease. Our method outperforms the baselines for all threshold values in Dice scores with a maximum dice score of 69.0%, showing the robustness of the attention maps derived from our method. This study assists in setting a threshold value for each method before training the segmentation network.

Fig. 6: Threshold selection - Each line indicates the Dice scores of initial maps on the ISIC18 validation set for different methods. Our method outperforms the baselines for all values. and are obtained by classification networks using ResNet50 and ResNet101, respectively.
Fig. 7: Visual results of segmentation - (Init maps) Saliency map obtained by different methods and (Refined) their segmentation results. and are obtained by GradCAM on classification networks using ResNet50 and ResNet101, respectively.

Qualitative Performance Evaluation

Visual results of the different methods are shown in Fig.  7. In this figure, Init maps (row 1 and 3) are raw visual salient regions from either GradCAM or attention maps shown as heatmaps, whereas Refined (row 2 and 4) refers to the performance of the segmentation network trained using Init maps as a proxy-labels. The attention maps (row 1 and 3) produced by the classification network spread all over the image, capturing some discriminative regions on the target lesion. GradCAMs spread around the target, highlighting discriminative regions of the lesion but failing to capture the whole context. The saliency map produced by the ARL method is focused on the target lesion. The attention maps obtained by the recent EDAM method spread around the target lesion, including the artifact regions, and fail to capture the target object context. In contrast, the attention maps derived from our approach better capture the attentive region, which mostly cover the lesion regions. The results show that our proposed approach generates superior attention maps compared to attention maps or GradCAMs from classification networks.

The results obtained by training a segmentation network on the initial salient regions (row 1 and 3) are depicted in row 3 and 4. These images demonstrate the feasibility of our method to weakly generate pixel-level labels that are usable for training segmentation networks.

V Discussion and Conclusion

This paper presents a novel attention-based dynamic subspace metric learning approach for medical image analysis. The proposed algorithm leverages recent advances in deep metric learning using multiple metric learners. Our contribution improves the state-of-the-art method [46] with dynamic exploitation of subspace learners to learn the embedding space. Specifically, our novel training strategy overcomes the empirical search of the optimal number of subspace learners parameter while achieving competitive results in clustering and image retrieval tasks. Performance is extensively evaluated on three publicly available benchmark datasets: skin lesions, musculoskeletal radiography, and endoscopic images. Results demonstrate that our dynamic learner approach achieves the best results in clustering performance across all three datasets. Compared to the single-learner method, our method brings a maximum of 5 and 2% improvements in clustering and image retrieval scores on the ISIC19 dataset. Furthermore, our method significantly outperforms the classification network in all the datasets with a maximum of 10% and 4% improvements in clustering and retrieval scores on the ISIC19 dataset. Overall, the proposed method slightly outperforms in averaged results and has a smaller standard deviation when compared to the state-of-the-art methods in multiple metric learning. Our experiments have shown consistency across all the datasets, demonstrating the robustness of our method. Qualitative results show that the proposed method produces compact clustering and coherence image retrievals.

The addition of the attention module to our subspace learners provides the visual interpretability of the learned embedding space in terms of attention maps and improves the clustering metrics. Our method offers new tools in multiple metric learners approaches, notably dynamically learning the number of learners and providing attention maps to hint at salient information caught by the learners. Studying the clinical usability of these tools remains to be explored. Nevertheless, A recent study [2] shows that the use of a retrieval network, in a single learner, yields an improvement of 9.2% in the decision accuracy of dermatologists. Our method indeed suggests that multiple learners capture a data embedding that yields a higher accuracy in clustering and retrieval tasks over single-learner methods, while additionally offering visual saliency from our attention mechanism.

The attention maps produced by our proposed method can serve as proxy pixel-level labels to train a segmentation network. The segmentation results outperform a state-of-the-art method, Attention Residual Learning (ARL) [70], as well as the recent Embedded Discriminative Attention Mechanism (EDAM) [65] by a margin of 15% and 17% in Dice scores, respectively, on the skin lesion dataset. The qualitative results demonstrate that the produced attention maps and their segmentation masks focus on the target lesion, demonstrating the effectiveness and robustness of our method. These attention maps produced in our subspace learning approach could therefore be potentially beneficial to a broader range of weakly supervised tasks, where the feature space remains challenging to represent using a single metric model within a specific task.


This research work was partly funded by the Canada Research Chair on Shape Analysis in Medical Imaging, the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Fonds de Recherche du Quebec (FQRNT). We would like to acknowledge Compute Canada for providing computing resources used for this work.


  • [1] S. Allegretti, F. Bolelli, F. Pollastri, S. Longhitano, G. Pellacani, and C. Grana (2021) Supporting skin lesion diagnosis with content-based image retrieval. In

    25th International Conference on Pattern Recognition (ICPR)

    pp. 8053–8060. Cited by: §IV-B.
  • [2] C. Barata and C. Santiago (2021) Improving the explainability of skin cancer diagnosis using cbir. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 550–559. Cited by: §IV-B, §V.
  • [3] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei (2016) What’s the point: semantic segmentation with point supervision. In European conference on computer vision, pp. 549–565. Cited by: §II-C.
  • [4] S. Belharbi, J. Rony, J. Dolz, I. B. Ayed, L. McCaffrey, and E. Granger (2020) Deep interpretable classification and weakly-supervised segmentation of histology images via max-min uncertainty. arXiv preprint arXiv:2011.07221. Cited by: §I.
  • [5] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov, M. Lux, D. T. D. Nguyen, et al. (2020) HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific Data 7 (1), pp. 1–14. Cited by: §I, §IV-A.
  • [6] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a ”Siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §II-A.
  • [7] Y. B. Can, K. Chaitanya, B. Mustafa, L. M. Koch, E. Konukoglu, and C. F. Baumgartner (2018) Learning to segment medical images with scribble-supervision alone. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 236–244. Cited by: §II-C.
  • [8] L. Chen, J. Chen, H. Hajimirsadeghi, and G. Mori (2020) Adapting Grad-CAM for embedding networks. In Winter Conference on Applications of Computer Vision, Cited by: §I.
  • [9] R. Chen, H. Chen, J. Ren, G. Huang, and Q. Zhang (2019) Explaining neural networks semantically and quantitatively. In International Conference on Computer Vision, pp. 9187–9196. Cited by: §I.
  • [10] N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al. (2019) Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368. Cited by: §I, §IV-A.
  • [11] M. Combalia, N. C. Codella, V. Rotemberg, B. Helba, V. Vilaplana, O. Reiter, A. C. Halpern, S. Puig, and J. Malvehy (2019) BCN20000: dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288. Cited by: §I, §IV-A.
  • [12] F. Dubost, H. Adams, P. Yilmaz, G. Bortsova, G. van Tulder, M. A. Ikram, W. Niessen, M. W. Vernooij, and M. de Bruijne (2020) Weakly supervised object detection with 2d and 3d regression neural networks. Medical Image Analysis 65, pp. 101767. Cited by: §II-C.
  • [13] X. Feng, J. Yang, A. F. Laine, and E. D. Angelini (2017) Discriminative localization in CNNs for weakly-supervised segmentation of pulmonary nodules. In Medical Image Computing and Computer-Assisted Intervention, pp. 568–576. Cited by: §II-C.
  • [14] K. Gupta, D. Thapar, A. Bhavsar, and A. K. Sao (2019) Deep metric learning for identification of mitotic patterns of hep-2 cell images. In Computer Vision and Pattern Recognition Workshops, Cited by: §II-B, §IV-A.
  • [15] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §I, §II-A.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §IV-A.
  • [17] X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai (2018) Triplet-center loss for multi-view 3D object retrieval. In Computer Vision and Pattern Recognition, pp. 1945–1954. Cited by: §I.
  • [18] B. Hu, B. Vasu, and A. Hoogs (2022) X-mir: explainable medical image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 440–450. Cited by: §I.
  • [19] Z. Jia, X. Huang, I. Eric, C. Chang, and Y. Xu (2017) Constrained deep weak supervision for histopathology image segmentation. Transactions on Medical Imaging 36 (11), pp. 2376–2388. Cited by: §II-C.
  • [20] M. Kaya and H. Ş. Bilge (2019) Deep metric learning: a survey. Symmetry 11 (9), pp. 1066. Cited by: §I.
  • [21] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. B. Ayed (2019) Constrained-CNN losses for weakly supervised segmentation. Medical Image Analysis 54, pp. 88–99. Cited by: §II-C, §II-C.
  • [22] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In European Conference on Computer Vision, pp. 736–751. Cited by: §I, §II-A.
  • [23] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Vol. 5. Cited by: §IV-A.
  • [24] P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In

    International Conference on Machine Learning

    Vol. 70, pp. 1885–1894. Cited by: §I.
  • [25] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected CRFs with gaussian edge potentials. In Advances in Neural Information Processing Systems, pp. 109–117. Cited by: §II-C.
  • [26] B. Kulis et al. (2012) Metric learning: a survey. Foundations and trends in machine learning 5 (4), pp. 287–364. Cited by: §I.
  • [27] S. Kwak, S. Hong, B. Han, et al. (2017) Weakly supervised semantic segmentation using superpixel pooling network.. In

    Association for the Advancement of Artificial Intelligence

    pp. 4111–4117. Cited by: §II-C.
  • [28] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In Computer Vision and Pattern Recognition, pp. 2285–2294. Cited by: §III-D.
  • [29] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In Computer Vision and Pattern Recognition, pp. 2197–2206. Cited by: §II-A.
  • [30] D. Lin, J. Dai, J. Jia, K. He, and J. Sun (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, pp. 3159–3167. Cited by: §II-C.
  • [31] N. Liu, Q. Tan, Y. Li, H. Yang, J. Zhou, and X. Hu (2019) Is a single vector enough? exploring node polysemy for network embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 932–940. Cited by: §I.
  • [32] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks.. In ICML, Vol. 2, pp. 7. Cited by: §IV-B.
  • [33] J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In Advances in Neural Information Processing Systems, Cited by: §I.
  • [34] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §IV-B.
  • [35] Q. Meng, M. Sinclair, V. Zimmer, B. Hou, M. Rajchl, N. Toussaint, O. Oktay, J. Schlemper, A. Gomez, J. Housden, et al. (2019)

    Weakly supervised estimation of shadow confidence maps in fetal ultrasound imaging

    Transactions on Medical Imaging 38 (12), pp. 2755–2767. Cited by: §II-C.
  • [36] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016) Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440. Cited by: §III-C, §III-C.
  • [37] A. Y. Movshovitz, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In International Conference on Computer Vision, pp. 360–368. Cited by: §I.
  • [38] H. Nguyen, A. Pica, J. Hrbacek, D. C. Weber, F. La Rosa, A. Schalenbourg, R. Sznitman, and M. B. Cuadra (2019) A novel segmentation framework for uveal melanoma in magnetic resonance imaging based on class activation maps. In Medical Imaging with Deep Learning, pp. 370–379. Cited by: §II-C.
  • [39] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §I, §IV-A.
  • [40] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2017) Bier-boosting independent embeddings robustly. In International Conference on Computer Vision, pp. 5189–5198. Cited by: §II-A.
  • [41] G. Papandreou, L. Chen, K. Murphy, and A. L. Yuille (2015)

    Weakly-and semi-supervised learning of a DCNN for semantic image segmentation

    In International Conference on Computer Vision, Cited by: §II-C.
  • [42] P. Pati, A. Foncubierta-Rodríguez, O. Goksel, and M. Gabrani (2020) Reducing annotation effort in digital pathology: a co-representation learning framework for classification tasks. Medical Image Analysis, pp. 101859. Cited by: §II-B, §IV-A.
  • [43] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, et al. (2016) Deepcut: object segmentation from bounding box annotations using convolutional neural networks. Transactions on Medical Imaging 36 (2), pp. 674–683. Cited by: §II-C, §II-C.
  • [44] P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, D. Laird, R. L. Ball, et al. (2017) MURA: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957. Cited by: §I, §IV-A.
  • [45] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §IV-A, §IV-A.
  • [46] A. Sanakoyeu, V. Tschernezki, U. Buchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In Computer Vision and Pattern Recognition, pp. 471–480. Cited by: §I, §I, §II-A, Fig. 2, §III-A, §III-C, §III-C, §IV-A, §IV-A, §IV-A, §IV-B, §V.
  • [47] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert (2019) Attention gated networks: learning to leverage salient regions in medical images. Medical image analysis 53, pp. 197–207. Cited by: §III-D.
  • [48] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §II-A.
  • [49] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision, pp. 618–626. Cited by: §I, §II-C, §IV-A.
  • [50] M. Sikaroudi, A. Safarpoor, B. Ghojogh, S. Shafiei, M. Crowley, and H. R. Tizhoosh (2020) Supervision and source domain impact on representation learning: a histopathology case study. In Engineering in Medicine & Biology Society, pp. 1400–1403. Cited by: §II-B, §IV-A.
  • [51] A. Sinha and J. Dolz (2020) Multi-scale self-guided attention for medical image segmentation. Journal of Biomedical and Health Informatics. Cited by: §III-D.
  • [52] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §II-A.
  • [53] K. Sohn (2016) Improved deep metric learning with multi-class N-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §I.
  • [54] A. Stylianou, R. Souvenir, and R. Pless (2019) Visualizing deep similarity networks. In Winter Conference on Applications of Computer Vision, pp. 2029–2037. Cited by: §I.
  • [55] E. W. Teh and G. W. Taylor (2019) Metric learning for patch classification in digital pathology. Cited by: §II-B, §IV-A.
  • [56] E. W. Teh and G. W. Taylor (2020) Learning with less data via weakly labeled patch classification in digital pathology. In International Symposium on Biomedical Imaging, pp. 471–475. Cited by: §II-B.
  • [57] P. Tschandl, C. Rosendahl, and H. Kittler (2018) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, pp. 180161. Cited by: §I, §IV-A.
  • [58] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §III-D.
  • [59] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In International Conference on Computer Vision, pp. 2593–2601. Cited by: §I.
  • [60] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014) Learning fine-grained image similarity with deep ranking. In Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §I.
  • [61] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §I.
  • [62] K. Q. Weinberger, J. Blitzer, and L. K. Saul (2006) Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 1473–1480. Cited by: §I, §II-A.
  • [63] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In International Conference on Computer Vision, pp. 2840–2848. Cited by: §I, §III-B, §III-C, §IV-A, §IV-B.
  • [64] K. Wu, B. Du, M. Luo, H. Wen, Y. Shen, and J. Feng (2019) Weakly supervised brain lesion segmentation via attentional representation learning. In Medical Image Computing and Computer-Assisted Intervention, pp. 211–219. Cited by: §II-C.
  • [65] T. Wu, J. Huang, G. Gao, X. Wei, X. Wei, X. Luo, and C. H. Liu (2021) Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16765–16774. Cited by: §IV-A, TABLE VI, §V.
  • [66] K. Yan, X. Wang, L. Lu, L. Zhang, A. P. Harrison, M. Bagheri, and R. M. Summers (2018) Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In Computer Vision and Pattern Recognition, pp. 9261–9270. Cited by: §II-B, §IV-A.
  • [67] L. Yang, R. Jin, L. Mummert, R. Sukthankar, A. Goode, B. Zheng, S. C. Hoi, and M. Satyanarayanan (2008) A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (1), pp. 30–44. Cited by: §II-B.
  • [68] P. Yang, Y. Zhai, L. Li, H. Lv, J. Wang, C. Zhu, and R. Jiang (2019) Liver histopathological image retrieval based on deep metric learning. In Bioinformatics and Biomedicine, pp. 914–919. Cited by: §II-B.
  • [69] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Cited by: §I.
  • [70] J. Zhang, Y. Xie, Y. Xia, and C. Shen (2019) Attention residual learning for skin lesion classification. Transactions on Medical Imaging 38 (9), pp. 2092–2103. Cited by: §I, §IV-A, TABLE VI, §V.
  • [71] Z. Zhang and V. Saligrama (2016) Zero-shot learning via joint latent similarity embedding. In Computer Vision and Pattern Recognition, pp. 6034–6042. Cited by: §I.
  • [72] M. Zheng, S. Karanam, Z. Wu, and R. J. Radke (2019) Re-identification with consistent attentive siamese networks. In Computer Vision and Pattern Recognition, pp. 5735–5744. Cited by: §I.
  • [73] S. Zheng, Y. Song, T. Leung, and I. Goodfellow (2016) Improving the robustness of deep neural networks via stability training. In Computer Vision and Pattern Recognition, pp. 4480–4488. Cited by: §I.
  • [74] A. Zhong, X. Li, D. Wu, H. Ren, K. Kim, Y. Kim, V. Buch, N. Neumark, B. Bizzo, W. Y. Tak, et al. (2021) Deep metric learning-based image retrieval system for chest radiograph and its clinical applications in covid-19. Medical Image Analysis. Cited by: §II-B.
  • [75] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §II-C.
  • [76] S. Zhu, T. Yang, and C. Chen (2021) Visual explanation for deep metric learning. IEEE Transactions on Image Processing 30, pp. 7593–7607. Cited by: §I.
  • [77] I. Ziko, E. Granger, and I. Ben Ayed (2018) Scalable laplacian K-modes. In Advances in Neural Information Processing Systems, pp. 10041–10051. Cited by: §I.