The most common type of cancer and the second leading cause of death in women is breast cancer . Nearly 40 million mammography exams are performed on a yearly basis in the US alone. Screening mammograms (MG) are the first line of imaging for the early detection of breast cancer, that raise the survival rate, but place a massive workload on radiologists. Although mammography provides a high resolution image, its analysis remains challenging due to tissue overlaps, the high variability between individual breast patterns, subtle malignant findings (often less than 0.1% of image area) and high similarity between benign and malignant lesions.
Suspicious lesions are often difficult to detect and classify, even by expert radiologists. A broad range of traditional machine learning classifiers have been developed for automatic diagnosis of specific findings such as masses and calcifications, and ultimately breast cancer[2, 3]. The diagnosis in mammograms is often dictated by the type of a found lesion. A naive approach suggests training a detector from local (often referred as instance) annotations [4, 5]
, and then classifying the image according to the most severe finding in the image (max operation or global max pooling). However, in such supervised setting, training requires bounding-box annotations for every single abnormality. This setting is tedious, costly and impractical for large data sets. This problem is exacerbated in mammograms that can contain tens or hundreds of micro-calcifications spread in the breast. Having manual annotations further increase the chance for inconsistency in labeling due lack of consensus between radiologists, caused by ambiguous lesion boundaries. This problem is often solved by multiple annotators  that further escalates the workload.
In the weakly supervised paradigm, only global image-level tags are provided to train a classifier. Global image labels are easily available from retrospective clinical records often without the need for further clinician intervention. Weakly supervised methods which also localize abnormalities provide high value. In an era of growing demand for XAI (explainable AI), localization can shed light into the model reasoning for the image classification, and help gaining trust among practitioners in the field. Weakly supervised methods with localization, further suggests a unique value in scenarios where the source of discrimination between the classes is a-priori unknown.
Semi-supervised models, however, allow having a subset of the data with local annotations in addition to the global label. In this work, we explore the problem of weakly supervised and semi-supervised classification and detection in mammograms as illustrated in Fig. 1. The training set may be weakly labeled (i.e., without local annotations) or consisting of local annotation for a subset of the mammograms. Lesions can be relatively small with respect to the whole image and occluded in the parenchymal tissues.
We address the acute problem of annotation and suggest a new method that can be trained on weakly labeled data with or without additional instance labels. Our model is capable of localizing the lesions at test time (perform detection), in full resolution. Our extension to semi-supervised method makes use of a small set of mammograms with local annotations to further boost performance.
In this study we first explore a new data-driven strategy for weakly supervised learning with a novel dual-branch deep CNN model. As abnormalities in mammogram have local characteristics, our model is based on analyzing regions in the breast MG, therefore determining a region-based setting. Although in weakly-supervised setting there are no labels for regions, in such approach, the deep learning model learns to classify the whole mammogram by detecting discriminative regions between the classes. A unique objective function often allows handling the inherent imbalance in the region classes as the abnormalities are rare among healthy/normal tissues (e.g. ).
Our architecture is composed of two streams, one for classification and the other for detection. The classification compares association of regions with certain class e.g. benign or malignant with a newly added normal
class representing healthy tissues. In the detection branch, however, the scores of all regions are ranked relative to one another, at each class (resulting in a distribution over regions per class). Hence, the classification branch predicts which class to associate to a region, whereas the detection branch selects which regions are more likely to contain a finding. The image class probability is then obtained by aggregation of detection and classification abnormality probabilities for all regions in the image. The final abnormality probability is then increased when a suspicious finding is contained in a certain region, similar to a radiologist inspection work flow.
The contributions of this work are as follows: 1. We suggest a novel deep learning scheme for multi-class classification of mammograms and detection of abnormalities in weakly supervised setting. 2. Based on the weakly supervised setting scheme, we propose a novel classification and detection model to leverage from locally annotated lesions in part of the data set in the so called semi-supervised detection setting. 3. We present a novel fully supervised loss on detection branch for semi-supervised learning. 4. An ablation study and deep analysis, showing the contribution of different segments of our model as well as the cost-effectiveness of local annotations in MG to boost performance.
We validate our method on a large FFDM dataset of nearly 3,000 mammograms as well as the public INBreast dataset . Direct comparison of our method to previous works [8, 10] and an ablation study shows the superior performance of the suggested model in classification and, in particular, in detection.
Ii Related Work
Deep learning methods promise a breakthrough on assisting breast radiologists for early cancer detection in mammograms. Yet, the bottleneck for supervised methods in Big Data is the annotation workload often requiring expert clinicians to delineate many findings in mammograms. Weakly supervised and semi-supervised methods promise an affordable compromise to this tangle.
Weakly supervised detection: This category of methods in deep learning have attracted growing interest with the publish of the paper “Is localization for free?”  that addressed the tedious task of local annotations in images [10, 12]. Therefore, recent studies and challenges in mammography, with vast data sets (of over 0.5 million mammograms), opted for weakly labeled data [13, 14].
In general there are two main approaches for weakly supervised learning, known as image and region based. In image based methods based on CNN [15, 16], the input to the model is the whole image. Region inference is then obtained from feature maps after pooling at the final CNN layer (often generating a heat map). In the region based methods e.g. [8, 17], the image is first decomposed into regions. The convolutional layers then process each region separately. Subsequent layers then classify the regions and aggregate results to a global class level.
In medical imaging,  proposed an image based method for mammogram classification based on Multi-Instance Learning (MIL) that classifies large tiles of the image by max-pooling over feature maps, with sparsity soft constraints. However, using down-sampled images, their method yielded detection maps with extremely low resolution of just 66 pixels, strongly limiting a practical use.
Hwang et al.  took, also, an image based approach using a CNN with two whole-image classification branches that shared convolution layers. One branch used fully connected layers, and the second branch used convolution layers, resulting in a map per class, and then a global max pooling on each map. Their method yielded a low AUROC of 0.65 over 332 MIAS mammograms.
Both  and  address a binary classification task with a small test set of 410 full-filed-digital mammography (FFDM) mammograms  or using non-FFDM (digitally scanned) images of the MIAS cohort instead.  however proposed a region based method for a different use-case of discriminating between local anatomies in CT scans, using MIL in DNN setting.
Region based approaches were previously shown at e.g. , using the MIL paradigm to classify the entire mammogram according to the max-probability region, providing also detection in full resolution.
Recent studies suggest that applying explicit data-driven detection in parallel to classification yields improved performance . In this study we follow this approach, but differ and generalize the existing method by two main factors: 1) We don’t use any unsupervised region proposal in our scheme as it is commonly unavailable in MG. 2) We change the architecture in  and extend the region
classification stream with additional class but without any detection counterpart, in order to handle images lacking any findings (objects), and reduce false positives resulting from normal regions in detection, and in order to use the network in semi-supervised setting. The extension for handling radiology images assessed as normal is equivalent to images without any objects in natural images. The addition of normal class changes the probability distribution for regions, and allows improved classification of these specific and prevalent normal cases in many medical use cases such as screening mammography. Similar to, we further connect between the branches by adding the information from the classification branch to guide the detection to the most relevant regions. Our model is capable of multi-class classification and detection that provides localization of the abnormalities in full resolution. As such, we compare our method with the previous approach of  and an approach based on . We show improved performance in both classification and detection.
Semi supervised detection: These methods deal with fusion of weak labels and a subset of data with local annotations, namely fully labeled (also known as strongly labeled) data.  recently proposed a method for training Fast RCNN 
via Expectation-Maximization (EM). Focusing on the detection problem, they treated instance-level labels as missing data for weakly annotated images. Their method alternated between two steps: 1) E-step: estimating a probability distribution over all possible latent locations, and 2) M-step: updating a CNN using estimated locations from the last E-step. In the M-step, they optimize the sum of a region-level likelihood function from the fully supervised images and the estimation from the E-step. Their method was applied on non-medical (natural) images, and in practice, the quality of the final solution depended heavily on the initialization by another method (, which we compare our method with). Furthermore, their approach required thousands of Fast RCNN training iterations at each M-step, that is computationally expensive, particularly for large images such as mammograms.
In , Cinbis et al. suggest a MIL approach for weakly supervised detection. They proposed to extend their method to semi-supervised setting by replacing the top region selection obtained from MIL with the ground-truth regions when training from fully-supervised images.
Semi-supervised approach was further used in semantic image segmentation at .
In the medical domain, an approach based on Faster RCNN  was taken by , applied on breast Ultrasound (US) images, who also proposed a semi-supervised training, but based on combination of Faster RCNN  and MIL. Yet, in breast US, only the field of view with suspicious masses were considered (not calcifications or images without any abnormality). Unlike mammograms, the lesion in US captures a relatively large area of the image. Mammography therefore suggests a higher challenge by considering more types of lesions with significantly lower signature.
Another line of studies use first a large data set of fully labeled data with lesion annotations to train a region based classifier. Then, at a subsequent stage the model is modified for whole image input (usually decomposed to regions) and fine-tuned on the weakly labeled data, to create a weakly labeled classifier [5, 24, 4]. However, these methods strongly rely on local annotations and need a sufficiently large fully labeled data set to initialize the model. They are unable to train purely on weakly labeled mammograms and often lack of detection capability (except  with detection based on instance labels). In our approach the local annotations are used as auxiliary data, and our model can be trained with a small fully annotated data set, mostly relying on weak labels. Due to the annotation cost in many medical domains, we believe that this approach suggests a competitive edge.
Our semi-supervised setting can boost performance by incorporating a small set of local annotations. By local annotation we mean delineation of key findings responsible for the image class (such as malignant tumors or calcifications). We further study the cost-effectiveness of local annotations towards a desired performance. It should be noted that our dual-branch approach is different from  and  in modeling. Our semi-supervised method is different from previous methods by having ranking branch in the architecture. Previous methods [23, 18] added a fully-supervised objective function on the region classification in fully-supervised images subset. In our method, we add a fully-supervised objective function on the region classification, and, in addition, we add a fully-supervised objective function on the detection (ranking) branch’s region probabilities of the fully-supervised images subset. Our approach is insensitive to image size and the number of regions, thus eliminating the need to warp the image to a fixed size, which often causes distortion of the lesion appearance and can bias the final decision.
Iii A Dual Branch Weakly Supervised Detection Methodology for Mammograms
Generally, radiologists divide regions in mammograms into three classes: 1. Normal tissues that can be referred as background. 2. Lesions appearing to be benign. 3. Malignant findings. In this section we propose a deep network architecture that classifies mammogram regions into three different classes: normal tissue, benign and malignant findings using labels at the image level (also known as weak labels). We first decompose the image into regions that are fed into the network. The network has two branches: a classification branch that computes local probabilities of malignant, benign and normal for each region, and a detection branch that ranks regions relative to one another for malignant class and, independently, for benign class. The branches are then combined at a subsequent layer to obtain an image-level decision for the presence of malignant and/or benign findings. The proposed weakly supervised network architecture is described in Fig. 2, and the algorithm is summarized in Table I.
Region extraction. Given a mammography image, we first perform pre-processing to compute feature representations for regions within the breast. To this end, we used a sliding window of overlapping regions (with stride) within the breast region excluding the axilla (using a method similar to ).
Due to the relatively small training dataset, we employ a two-stage deep neural network architecture. In the first stage, we apply a transfer learning approach by using the pre-trained VGG128 network by
, trained on the ImageNet dataset
. In our model, we extract CNN codes from the last hidden layer as 128D feature vectors per region. Then, we process each region separately by a fully connected (FC) layer. Formally, a image, is first decomposed into regions denoted by such that is the feature vector representation of the -th region.
We first compute a local decision for each region separately. Each region is classified, in this study, as normal (N), benign (B) or malignant (M) using a softmax layer:
such that and are the parameters of the classifier. Note that the same classification parameters are used across all the regions in the image.
Detection branch. In parallel, we compute the relevance of each region for the global image-level decision. We perform a distinct detection process for each type of abnormality - one for malignant regions and one for benign regions. The normal class has a different characteristic. These regions are prevalent in all types of mammograms, similar to “background” in natural images. Therefore, the normal class is not associated with a detection scheme (see Fig. 2). This is a novel extension to previous modeling in  where the image-level class set and region-level class set are the same and are used in both branches. The detection result is a distribution over all the regions in the image implemented by a softmax operation. Formally, let
be a hidden random variable representing the localization of classfindings in the image. Then, given an image , the probability of in the distribution is:
such that and are the parameter-sets of the benign and malignant detectors, respectively. Note that is equivalent to ranking of -th region in image relative to the other regions in for class .
Image level decision. Given the region-level classification results and the region detection distribution, we can now evaluate the image-level classification. Let , be a binary tuple indicator whether an image contains a malignant and/or benign finding, respectively. Note that this type of tuple labeling allows tagging images of class by and those with both and findings by . The posterior distributions of and given mammogram image are obtained as a weighted average of the local (i.e. region-level) decisions:
Comparison to previous dual-branch approach. Since in many medical applications, such as mammography, the most prevalent cases are normal without any finding, we extended the method in  by adding a normal () class in the classification branch. Note that in our new scheme the normal class is only added to the classification branch and not to the detection branch or to the image-level class set (see Fig. 2). This is a novel generalization to previous modeling in . In oppose to , by allowing classification of regions to normal, we can handle “clean” images without any finding. Normal images in our model are then discriminated by having low probability for both and findings. The probability for an image to be normal can then be obtained via the joint probability .
This extension is also important for decreasing the false positives in detection (localization) resulting from normal regions, as shown in Sec. V. Since normal regions gain high probability for local class and low probabilities for and (instead of expected uniform probabilities over and when class is not used ).
In addition, this architecture modification, allows the use of fully supervised loss on the classification branch for the extension to semi-supervised detection setting in Sec. IV
Region selection. So far the detection branch’s decision has solely been based on the features that were extracted from the image regions. It makes sense to use the classification decision results to guide the detection process. For example, if a region is clearly classified as malignant, it is likely that the malignancy detection will favor this region. Since the classification branch includes an additional class for normal regions, the suspicious regions in the B and M classes can be used to guide the detection branch and create a soft alignment between the branches. We formalize this intuition by a region selection step. Now, let be the region probabilities of being classified as malignant. In the malignant detection process, we only consider the regions with the highest probability of being classified as malignant and only apply the softmax operation on these selected regions. Let be a binary value indicating whether region has been selected for the malignancy detection process. We can apply the same selection criterion for the benign detector. Thus, each detector’s ranking is performed solely on the relevant regions according to the classification branch. In the modified detection branch we replace the softmax over regions by a masked softmax:
This paradigm guides the M detector to focus on the most probable malignant regions on malignant mammograms. Yet, if the image is normal or contains a benign finding, the model will concentrate on regions that were most probably and erroneously classified as malignant (hard negatives). This process, which is similarly applied for the benign class is equivalent to hard negative mining in natural images, imposing the difficulty often faced by radiologists when examining a mammogram.. In the experimental section we compare network architectures with and without masked detectors and show that applying region selection yields superior performance.
Training. Assume we are given a set of weakly labeled mammography images . Each image consists of regions and is associated with a binary tuple label that indicates whether the image contains at least one malignant and/or one benign finding respectively. A normal case will have label whereas a mammogram with both M and B finding will be labeled . The network provides soft decisions for each image regarding the values of and . The objective function that we maximize in the network training step is the following likelihood function:
such that is the parameter-set of the model and the probability is defined in Eq. (3).
Iv Semi Supervised Detection Methodology
Iv-a Approach overview
In this section, we extend our weakly supervised setting to a novel semi-supervised approach. In semi-supervised setting, we assume that part of the weakly labeled data has gone under local annotations, generating a subset of fully-labeled data. This local annotation can take a form of contours around lesions or just bounding boxes. We demonstrate our model on M vs. B N. Motivated by reduction of annotation workload, let us assume that the malignant class has a fully-labeled subset in which only the malignant findings are locally annotated (note that malignant images can still include benign findings).
We make use of different ratios of local annotations in the malignant class (25%-100%) to present the impact of these annotation on the performance. Due to rarity of malignant findings with respect to benign ones, the annotated set only captures 2.5%-10% of all the lesions in the cohort, therefore demanding a low workload for annotation.
Iv-B Semi-supervised detection objective function
Although local annotations on a large scale are commonly out of reach , in this section we study the effect of engaging with a small set of locally annotated data combined with a large set of weakly labeled data. We assume that the training set contains two distinct sets, one with weakly and one with fully labeled images. We denote as the set of indices of the weakly-labeled images (these can be malignant, benign or normal) and as the set of indices of the fully-labeled images; namely, mammograms where lesions have been locally annotated. For each fully labeled image, , we are given a set of malignant regions. We next describe how we transform the pixel-level information (i.e. contour annotations) into the region-level labels based on the intersection between our extracted regions and the malignant lesion. To this end, we define a soft version of Intersection over Union () called the Intersection over Minimum (). This measure computes the ratio between the area of the intersection with respect to the minimum size between the -th region and the lesion area:
where is the annotated domain. In our setting the region size is fixed and the lesion scale can vary in by factor of 10. This definition therefore allows a positive region to cover a small lesion or alternatively be located within a large finding. We define the local label of a region as malignant () if the region has with a ground-truth (GT) malignant finding, and define the label as either benign or normal () if the region has empty intersection with all the GT malignant findings. We set . Formally, the label of region , denoted by , is defined as follows:
Non-malignant regions with are ignored during training. In practice, we got a better performance with ignoring those regions during training in compare to labeling those regions as .
In order to engage the local annotations, we propose two separate and novel objective functions that are imposed directly on the region classification and detection probabilities. In the fully supervised objective of the classification branch, we compute the log likelihood according to the region true classes (as or ) as:
where goes over all the fully labeled images, and goes over the labeled regions in each image. The probability of a region to be classified as malignant, , is defined in Eq. (1), and is the complement probability (i.e., the probability to be classified as either benign or normal).
In the fully-supervised objective of the detection branch, we want to concentrate on the malignant regions. We therefore define the detection branch objective as:
This demanding regions with high overlap over M-lesions to gain high M-probability. This soft constraint alters the weakly supervised decisions to be modified toward manually labeled regions. The trained model eventually relies its decisions over discriminative power and similarity to the annotated regions as the source of malignancy.
Without loss of generality, we assume the fully-supervised objective is applied on the malignant images in . Our final fully supervised objective is then obtained as:
We set where is the total number of regions in the train data that have a region-level label. For simplicity, we set .
The weakly supervised part is defined in a similar way as in Sec. III. In the semi-supervised setting, this objective is defined over the weakly labeled training subset for M class and over all the images for B class:
In order to prevent redundancy in the training samples we avoid using the fully labeled images also as weakly labeled samples, previously shown to degrade performance in .
The fusion of the weakly and fully supervised settings can now be obtained by maximizing the following multi-task objective:
where denotes the weakly supervised part and denotes the fully supervised part. For simplicity, we set .
V Experimental Results
Dataset. We conducted experiments on a large screening dataset, named IMG, with full field digital mammography (FFDM). The cohort was acquired from different Hologic devices and 4 different medical centers (with approximately 3K 1.5K image size). From this proprietary dataset we excluded images containing artifacts such as metal clips, skin markers, etc., as well as large foreign bodies (pacemakers, implants, etc.). Otherwise, the images contain a wide variation in terms of anatomical differences, pathology (including benign and malignant cases) and breast densities that corresponds to what is typically found in screening clinics. The dataset was composed of 2,967 mammograms with normal images as well as various benign and suspiciously malignant findings. In terms of the global image BI-RADS (Breast Imaging Reporting and Data System), we had 350, 2,364, 146 and 107 corresponding to BI-RADS 1,2,4 and 5 captured from 65, 693, 81 and 62 individuals respectively. Note that our BI-RADS 1 (Normal category) does not contain any suspicious finding, neither confidently benign ones. Since a mammogram can contain findings with different BI-RADS categories, the global image BI-RADS was set by the most severe finding in the image (max operation), and the global patient BI-RADS was set by the max global image BI-RADS for that patient at a specific study, according to clinical guidelines.
Mammograms with global BI-RADS of 3 were excluded from our IMG dataset since these intermediate BI-RADS are commonly assigned based on other modalities (e.g. ultrasound) and comparison to prior mammograms  which are often unavailable. Yet, our data set included BI-RADS 3 findings that were not the most severe ones in the image. In terms of breast composition, 20% were “almost entirely fatty”, 48% with “scattered fibroglandular density”, 27% “heterogeneously dense” and 5% “extremely dense”. With respect to the dominant pathologies, our data set included 4525 calcifications (micro and macro) and 926 masses.
In our test scenario, we split the mammograms into the following three global labels: BI-RADS 4 & 5 defined as malignant (M), BI-RADS 2 as benign (B) and BI-RADS 1 as normal (N). We included all types of suspiciously malignant abnormalities in the M class such as mass, calcification, architectural distortions etc. This discrimination in data classes creates a specific challenge, demanding the model to distinguish between images with very similar types of lesions, such as malignant versus benign masses or different types of micro-calcifications that are often ambiguous even for expert radiologists. BI-RADS based class separation is frequently used (e.g, [14, 24, 29, 30, 31, 32, 33]) due to often lack of pathological results in the data set and the ambition to have a large positive set. In , the authors claim that although the INbreast data set includes pathology results, they use BI-RADS assessments for class labels, due to ”lacks of reliable pathological confirmation”. In a similar way, they assign all images with BI-RADS 1 and 2 as negative and BI-RADS 4, 5 and 6 as positive.
Our second test bed used for our weakly supervised model, was composed of the INbreast (INB) publicly available FFDM dataset . This small dataset has 410 mammograms from 116 cases and was split into 100 positive (global BI-RADS 4,5,6) and 310 negative (global BI-RADS 1,2,3) mammograms. Note that in this case we included BI-RADS 3 to enable comparison with previous methods in literature. We conducted a random patients split on the INbreast images with 50% for train and 50% for test.
We implemented our model in TensorFlow framework using the Adam optimizer for training, with learning rate of, dropout of ,
-regularization and a batch-size of 256 images. This included all the regions from each image (on average approx. 200). We initialized the weights of the shared fully connected (FC) layer with normal distribution. The weights of the FC layers in the branches were initialized with zero mean and STD normal distribution. As the number of selected regions we chose (other values were tested but yielded lower performance). To enlarge and balance the training set, we used augmentations by adding rotations of , left-right and up-down flips and 6 image shifts. Fig. 3 illustrates the train progress and the network convergence in our two binary classification scenarios.
Evaluation Procedure. Our evaluation on IMG dataset was based on 5 fold patient-wise cross-validation, where at each train and test iteration, all the images from the patient under test were strictly excluded from the training set. To this end we randomly split the data set into 5 folds according to patient IDs, keeping a similar distribution over breast composition and lesion types in the folds.
As our model outputs two probabilities per image (, Eq. (3)), we are able to create 2D probabilities maps and conduct multi-class classification. However, to compare our results to previous methods and as instance of practical use case, we evaluated system performance on two binary classification tasks by joining two “nearby” classes, namely M with B or B with N. To this end we used scoring for M vs. B N (M vs. BN) and scoring for M B vs. N (MB vs. N). For performance measures, in addition to AUROC, we also report two more practical measures as used in . The partial-AUC ratio (pAUCR), associated with the ratio of the area under the ROC curve in a high sensitivity range ([0.8,1]), representing the AUROC at a more relevant domain for clinicians. Also, we report the specificity extracted from the ROC curve at sensitivity 0.85 and 0.90 representing an average operation point (OP) for expert radiologists, as reported in .
We show the comparison of our model performance to several baselines and to the previously published method of  (Max-Region). The impact of the fully labeled data, engaged with our multi-task loss is then presented. For evaluation, we present our results on the two binary classification tasks, M vs. BN and MB vs. N.
Weakly-supervised set-up. Table II presents the performance for the two binary classification tasks. In this test we present three baselines, 1) Max-Region , 2) the DB-Baseline presenting an equivalent dual-branch approach of  and 3) the Cls-Det as our approach without region selection.
Considering purely weakly labeled data set, our method (Cls-Det-RS) outperforms the DB-Baseline and Max-region  in all measures and both classification scenarios. In AUROC we obtained a slight increase of 2-4%. However, in the more practical measures of pAUCR and operation point the results show significant improve. Particularly, for M vs. BN, there was an increase of 10-17% on pAUCR and 8-11% on specificity at 0.85 sensitivity. In the case of MB vs. N we obtained 6-14% higher pAUCR and 8-15% higher specificity at 0.85 sensitivity. The model analysis without region selection (RS) part shows that in average, the addition of region-selection improves the performance.
We further conducted a breast level analysis by considering both views of the same breast. To this end we assigned the max probability between the views to the specific breast. The results showed a similar performance to the single mammogram processing.
Train and test on the small public data set of INB yielded AUROC of 0.73. Note that this result is without using an external fully labeled data set in oppose to [4, 5]. This result shows the performance of our model when trained on a very small data set. It is further comparable to AUROC 0.74, reported in  when trained on single MG, yet used fully supervised data.
Our error analysis over regions indicates that the majority of our errors are between and class as it is also the case for breast radiologists.
|Method||AUROC||pAUCR||Spec @ Sens|
|M vs. BN: Weakly-supervised methods|
|M vs. BN: Semi-supervised methods|
|SS-Cls-Det-RS .25||0.731 0.029||0.3050.108||0.40||0.31|
|SS-Baseline-RS .5||0.740 0.022||0.3160.126||0.43||0.30|
|SS-Cls-Det-RS .5||0.745 0.032||0.3130.119||0.46||0.33|
|SS-Cls-Det-RS .75||0.745 0.026||0.3200.109||0.42||0.33|
|M vs. BN: Fully-supervised method|
|SS-Cls-Det-RS 1||0.751 0.026||0.3160.078||0.47||0.32|
|MB vs. N: Weakly-supervised methods|
AUC per epoch for train and validation set (on first fold experiment). Left: M vs. BN. Right: MB vs. N.
Semi-supervised set-up. In this section we analyze the performance of our semi-supervised model. In order to reduce the demand for local annotations, we only considered local annotations for the malignant findings in our setting. We opted for the classification task of M vs. BN as commonly considered in previous works [15, 16, 30]. We further evaluated the impact of the ratio of the fully supervised train set as cost-effectiveness of the annotation workload. The results for our semi-supervised setting (SS-Cls-Det-RS) are shown in Table II. The classification performance is improved as more localized regions are used. This continues up to 75% utilization of local annotation and is plateaued afterward.
For instance, when using 50% of the local M annotations, there was a 2.33% increase in the AUROC, 5.2% in the pAUC and 8.2% in specificity at 0.85 sensitivity, compared to our weakly-supervised approach (Cls-Det-RS). This result further exemplifies improvement of 6.6%, 11% and 14% in AUROC, pAUC and Spec @ 0.85 Sens, respectively, compared to Max-Region. With respect to DB-Baseline there was a 5.1% improvement in AUROC that was actually achieved with annotation of less than 5% of all findings in the training set (around 175 lesions). All the performance values are based on average over random split, 5-fold cross validation.
Although the train process begins without any labels on regions, the impact of each region can be scored after the training process by:
The top regions for each class (B/M) can now be visualized and compared to the radiologist’s annotations as the source of malignancy or benign class of the image. Fig. 4 shows several examples with localization in the test set, overlaid with radiologist annotations (used only for validation). As observed, the method is capable of separately highlighting multiple types of abnormalities such as benign and malignant lesions without having an instance level annotation.
We further evaluated our localization performance by a quantitative measure. Targeting the localization as the system’s self-explanation tool, we used a less strict measure than the standard intersection over union (IoU) for correctness of our localization outcome. We follow the weak localization as intersection over the minimum area between the region and the lesion (IoM) as defined in Eq. (6) (also used in ). This measure allows explanation of an outcome when a specified region contains a true type of lesion or vice-versa. Since our region size is relatively small and fixed, this setup will not allow over sizing of the localization area (see examples in Fig. 4). In opposed to previous methods of [15, 16] we formally asses the accuracy of our localization results by Eq. (13).
For an image classified as , we consider all the regions with over a certain threshold. Correct localization per-lesion is obtained as . We present the FROC localization accuracy for class using . The detection sensitivity in the FROC is the fraction of images in the True-Positive set with at least one correct localization. The results show that the region selection yields the best performance with relatively low False positive rate per image (FPPI).
Weakly-supervised set-up. Fig. 5 shows the detection performance as FROC. At left, the performance for MB vs. N is shown. While at low FPPI, DB-Baseline (dotted black curve) and our model (Cls-Det-RS, dashed orange curve) are comparable, at high detection sensitivity our model shows slightly improved performance. However, our model clearly outperforms Max-Region  (dotted red curve).
In Fig 5 right plot, we show FROC curves for detection of the malignant lesions (BI-RADS 4 & 5). In this set-up, we first compare our weakly supervised model to several baselines and then show the impact of our semi-supervised network with various ratios of fully labeled data used. In particular, the detection performance in our weakly supervised model (dashed orange) is compared with the DB-Baseline  (dotted black) and the Max-Region method  (dotted red). In this scenario of detecting malignant lesions, the DB-Baseline shows poor results. Although the Max-Region shows improvement over DB-Baseline, our model clearly outperforms both. Also, our model with region selection (Cls-Det-RS, dashed orange), outperforms our model without region selection (Cls-Det, dashed blue).
Semi-supervised set-up. The right plot in Fig. 5 shows that engaging local annotations into our semi-supervised model (SS-Cls-Det-RS) improves the detection. The green lines indicate results when using different ratios of fully labeled data (wider curves indicate higher fully labeled ratio in training). The wide cyan line standing for full supervision on the M class. For instance, at 1 FPPI, our model with 50% local annotations (SS-Cls-Det-RS .5) yields 30% higher detection sensitivity. The detection sensitivity further improves when more locally annotated mammograms are used. However, the influence of local annotations plateaus at nearly 75% ratio (SS-Cls-Det-RS 0.75), presenting similar performance to the fully-supervised method (SS-Cls-Det-RS 1).
The performance drop in M vs. BN compared to MB vs. N (right vs. left plot in Fig. 5), indicates the difficulty of the model in distinguishing between benign and malignant lesions, as often is the case with radiologists.
The impact of loss on detection branch. To this end, we run our model with loss only on the classification branch (similar to ). We therefore train our model with 50% fully labeled data, without the detection loss in Eq. (10) (setting ). We then show the resulting FROC (SS-Baseline-RS .5 - dotted pink curve) in Fig. 5-right. Comparison to our model (SS-Cls-Det-RS .5 - green) shows a significant drop of FROC in this baseline, indicating the contribution of our novel detection loss.
Visual Analysis. Using our multi-label probability output we plot each sample in a probability plane representing the global prediction results of the images. In this plane, each image is a point with coordinates as and probabilities. Fig. 6 shows the global probability plane on a train and test set color coded by the true class. Blue colored normal (N) images (without any finding) are mostly located near the origin, with low and showing approximately zero probabilities for malignancy and benign. Green color representing images with only benign finding (B). Those are likely concentrated around (0,1) with low and high . Red points represent malignant images without benign finding. Those are emerged at the right side of the plot with high and low . Finally, black points, representing malignant mammograms that also include benign findings, are more likely located at the top-right corner with high and high .
Having the tuple probabilities as shown in Fig. 6
we applied a one vs. all kernel SVM classification into 3 classes of Normal, Benign and Malignant (for practical reasons and simplicity, we unified the M & MB labels). The confusion matrix is shown in TableIII. Although there is no order between the classes, the errors are more likely to occur between “neighboring” classes (N-B-M). Note that most of the errors in M class (75%) are with B and most of the errors in N (70%) are with B. This trend may correspond to the human difficulty of distinguishing between malignant and benign findings.
Vi Summary & Discussion
In this work, we proposed a method for simultaneous classification and detection of abnormalities in mammograms. Our framework enables multi-class classification of mammograms in weakly labeled setting (where only global labels are given, without local annotations). To this end, we used a novel region based approach and a dual branch deep neural net, one for classification and the other for detection. The results show higher AUROC for whole mammogram classification and a significant improvement in detection of the multi-class abnormalities. Further novelty of this paper addresses the problem of fusion between weakly labeled and local annotations in the data set. As collection of local annotations are prohibitively expensive in the medical domain and their cost-effectiveness is questionable. We therefore suggest a novel semi-supervised approach where only a small subset of the data has local annotations for the key findings. The new model relies mainly on weakly labeled data and therefore can also be implemented without any amount of locally annotated images. To this end, we used a novel objective function enforced on region classification and detection, formalized as a multi-task objective function.
We conduct an extensive evaluation comparing our approach to several baselines and a direct comparison to previously published methods on the same data set. The results show improvement in AUROC, and a significant performance boost in the more practical measures of partial AUC and specificity at high sensitivity operation points. For instance using only 5% of the data with local annotation, shows 10% increase in specificity (at 0.85 sensitivity) that is projected to yearly 3.6 million less false positives in screening (based on 40M yearly mammograms in US from which 90% are BI-RADS 1 or BI-RADS2). Our method, which can only learn from image-level labels, and also utilize possibly existing instance-level labels, is uniformly applicable to both weakly supervised and semi-supervised detection problems.
Localization in the result is an effective tool for explanation behind a “black box” deep learning model. We therefore evaluated the localization of our approach quantitatively, in full resolution. The results show significant improvement in detection of the true region, as the source of our assessment. The improvement is obtained in our weakly supervised method and further boosted with additional of local annotations. In the era of Big Data, the combination of large weakly labeled data sets with partially local annotations can provide a cost-effective solution for future decision support systems in medical imaging.
Our method was evaluated, based on BI-RADS assessment by radiologists, similar to [14, 24, 29, 30, 31, 32, 33]. We opted for this setting in order to have a large data set of approx. 3K mammograms, as pathologies were not available for all of our high BI-RADS exams. BI-RADS 4 and 5 have positive predictive values of approx. 35% and over 95% respectively and are particularly rare in population. There are several recent works trained and tested on large FFDM mammogram data sets with pathologies such as [4, 5] which used the DREAM Challenge data set, or . Unfortunately, these data sets are not publicly available and cannot be used by other researches for benchmarking. We believe that our scenario based on BI-RADS assessments can provide a valid platform for comparison between different methods and baselines. We test our method and all baselines on the same data setting. Our new model is compared to previous methods on the same data set, rarely done in the medical domain, due to the lack of large public FFDM data sets.
Our method was also limited in analyzing each view separately, without bilateral breast comparison as conducted by radiologists. Our future work will focus on engaging this additional information into the model, to find correlations between image views and dissimilarities between breast sides.
Combining the proposed approach with an end-to-end training of the backbone network is applicable with larger data sets. End to end training as well as using muliple scale and aspect ratio regions are interesting future research directions and beyond the scope of this work.
-  A. Jemal, F. Bray, M. M. Center, J. Ferlay, E. Ward, and D. Forman, “Global cancer statistics,” CA: a cancer journal for clinicians, vol. 61, no. 2, pp. 69–90, 2011.
-  F. S. S. de Oliveira, A. O. de Carvalho Filho, A. C. Silva, A. C. de Paiva, and M. Gattass, “Classification of breast regions as mass and non-mass based on digital mammograms using taxonomic indexes and SVM,” Comp. in Bio. and Med., vol. 57, pp. 42–53, 2015.
-  C. Jen and S. Yu, “Automatic detection of abnormal mammograms in mammographic images,” Expert Syst. Appl., vol. 42, no. 6, pp. 3048–3055, 2015.
-  W. Lotter, G. Sorensen, and D. Cox, “A multi-scale CNN and curriculum learning strategy for mammogram classification,” MICCAI, 2017.
-  D. Ribli, A. Horváth, Z. Unger, P. Pollner, and I. Csabai, “Detecting and classifying lesions in mammograms with deep learning,” Scientific Reports, 2018.
-  A. Katalinic, C. Bartel, H. Raspe, and I. Schreer, “Beyond mammography screening: quality assurance in breast cancer diagnosis (the QuaMaDi project),” Breast Journal Cancer, vol. 96, no. 1, pp. 157–161, 2007.
-  M. Y. Guan, V. Gulshan, A. M. Dai, and G. E. Hinton, “Who said what: Modeling individual labelers improves classification,” in AAAI, 2018.
-  Y. Choukroun, R. Bakalo, R. Ben-Ari, A. Akselrod-Ballin, E. Barkan, and P. Kisilev, “Mammogram classification and abnormality detection from nonlocal labels using deep multiple instance neural network,” in Eurographics Workshop on Visual Computing for Biology and Medicine. The Eurographics Association, 2017.
-  I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso, and J. S. Cardoso, “Inbreast: toward a full-field digital mammographic database,” Academic radiology, vol. 19, no. 2, pp. 236–248, 2012.
-  H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in CVPR, 2016, pp. 2846–2854.
M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free? - weakly-supervised learning with convolutional neural networks,” inComputer Vision and Pattern Recognition (CVPR), 2015.
-  W. Jiang, T. Ngo, B. S. Manjunath, Z. Zhao, and F. Su, “Optimizing region selection for weakly supervised object detection,” CoRR, vol. abs/1708.01723, 2017.
-  DREAM, “The Digital Mammography DREAM Challenge,” 2017, https://www.synapse.org/#!Synapse:syn4224222.
-  K. J. Geras, S. Wolfson, Y. Shen, S. G. Kim, L. Moy, and K. Cho, “High-resolution breast cancer screening with multi-view deep convolutional neural networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.07047
-  S. Hwang and H. Kim, “Self-transfer learning for weakly supervised lesion localization,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2016, pp. 239–246.
-  W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, “Deep multi-instance networks with sparse label assignment for whole mammogram classification,” in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2017, pp. 603–611.
-  Z. Yan, Y. Zhan, Z. Peng, S. Liao, Y. Shinagawa, S. Zhang, D. N. Metaxas, and X. S. Zhou, “Multi-instance deep learning: Discover discriminative local anatomies for bodypart recognition,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1332–1343, 2016.
-  Z. Yan, J. Liang, W. Pan, J. Li, and C. Zhang, “Weakly- and semi-supervised object detection with expectation-maximization algorithm,” CoRR, vol. abs/1702.08740, 2017.
-  R. B. Girshick, “Fast R-CNN,” in Int. Conference on Computer Vison (ICCV), 2015, pp. 1440–1448.
-  R. G. Cinbis, J. J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 189–203, 2017.
-  G. Papandreou, L. Chen, K. Murphy, and A. L. Yuille, “Weakly- and semi-supervised learning of a DCNN for semantic image segmentation,” CoRR, vol. abs/1502.02734, 2015.
-  S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
-  S. Y. Shin, S. Lee, I. D. Yun, and K. M. Lee, “Joint weakly and semi-supervised deep learning for localization and classification of masses in breast ultrasound images,” CoRR, vol. abs/1710.03778, 2017.
-  L. Shen, “End-to-end training for whole image breast cancer diagnosis using an all convolutional design,” NIPS Workshop on Machine Learning for Health, 2017.
-  C. Chen, G. Liu, J. Wang, and G. Sudlow, “Shape-based automatic detection of pectoral muscle boundary in mammograms,” Journal of Medical and Biological Engineering, vol. 35, pp. 315–322, 2015.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in BMVC, 2014.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
-  A. Y. Michaels, R. L. Birdwell, C. S. Chung, E. P. Frost, and C. S. Giess, “Ass○essment and management of challenging bi-rads category 3 mammographic lesions,” RadioGraphics, vol. 36, no. 5, pp. 1261–1272, 2016.
N. Dhungel, G. Carneiro, and A. P. Bradley, “The automated learning of deep features for breast mass classification from mammograms,” inMICCAI, 2016.
-  N. Dhungle, G. Carnerio, and A. P. Bradley, “Fully automated classification of mammograms using deep residual neural networks,” in IEEE Int. Symposium on Biomedical Imaging (ISBI), 2017.
-  W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, “Deep multi-instance networks with sparse label assignment for whole mammogram classification,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2017.
-  A. Akselrod-Ballin, L. Karlinsky, S. Alpert, S. Hasoul, R. Ben-Ari, and E. Barkan, “A region based convolutional network for tumor detection and classification in breast mammography,” in MICCAI Workshop on Deep Learning and Data Labeling for Medical Applications, 2016.
-  A. Akselrod-Ballin, L. Karlinsky, A. Hazan, R. Bakalo, A. B. Horesh, Y. Shoshan, and E. Barkan, “Deep learning for automatic detection of abnormal findings in breast mammography,” in MICCAI Workshop on Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 2017.
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Int. Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256.
-  C. D. Lehman, R. F. Arao, B. L. Sprague, janie M. Lee, D. S. Buist, K. Kerlikowske, L. M. Henderson, T. Onega, A. N. A. Tosteson, G. H. Rauscher, and D. L. Miglioretti, “National performance benchmarks for modern screening digital mammography: Update from Breast Cancer Surveillance Consortium,” Radiology, vol. 283, no. 1, pp. 59–69, 2017.
-  R. Ben-Ari, A. Akselrod-Ballin, L. Karlinsky, and S. Y. Hashoul, “Domain specific convolutional neural nets for detection of architectural distortion in mammograms,” in IEEE Int. Symposium on Biomedical Imaging (ISBI), 2017, pp. 552–556.
-  N. Wu, J. Phang, J. Park, and …, “Deep neural networks improve radiologists’ performance in breast cancer screening,” CoRR, vol. abs/1710.03778, 2019.