Using Machine Learning to Automate Mammogram Images Analysis

Breast cancer is the second leading cause of cancer-related death after lung cancer in women. Early detection of breast cancer in X-ray mammography is believed to have effectively reduced the mortality rate. However, a relatively high false positive rate and a low specificity in mammography technology still exist. In this work, a computer-aided automatic mammogram analysis system is proposed to process the mammogram images and automatically discriminate them as either normal or cancerous, consisting of three consecutive image processing, feature selection, and image classification stages. In designing the system, the discrete wavelet transforms (Daubechies 2, Daubechies 4, and Biorthogonal 6.8) and the Fourier cosine transform were first used to parse the mammogram images and extract statistical features. Then, an entropy-based feature selection method was implemented to reduce the number of features. Finally, different pattern recognition methods (including the Back-propagation Network, the Linear Discriminant Analysis, and the Naive Bayes Classifier) and a voting classification scheme were employed. The performance of each classification strategy was evaluated for sensitivity, specificity, and accuracy and for general performance using the Receiver Operating Curve. Our method is validated on the dataset from the Eastern Health in Newfoundland and Labrador of Canada. The experimental results demonstrated that the proposed automatic mammogram analysis system could effectively improve the classification performances.



page 1

page 2

page 3

page 4


Fuzzy - Rough Feature Selection With Π- Membership Function For Mammogram Classification

Breast cancer is the second leading cause for death among women and it i...

Improving Specificity in Mammography Using Cross-correlation between Wavelet and Fourier Transform

Breast cancer is in the most common malignant tumor in women. It account...

Texture Characterization of Histopathologic Images Using Ecological Diversity Measures and Discrete Wavelet Transform

Breast cancer is a health problem that affects mainly the female populat...

Automatic Application Level Set Approach in Detection Calcifications in Mammographic Image

Breast cancer is considered as one of a major health problem that consti...

Automatic multi-objective based feature selection for classification

Accurately classifying malignancy of lesions detected in a screening sca...

Descriptive analysis of computational methods for automating mammograms with practical applications

Mammography is a vital screening technique for early revealing and ident...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Breast cancer is the most commonly diagnosed form of cancer in women and the second-leading cause of cancer-related death after lung cancer [7]

. Statistics from the American Cancer Society indicate that approximately 232,670 (29% of all cancer cases) American women will be diagnosed with breast cancer, and an estimated 40,000 (15% of all cancer cases) women will die of it in 2014  

[16]. Similar statistics were also found in Canada, where approximately 23,800 (26%) women were diagnosed with breast cancer, and 5,000 (14%) died from it in 2013 [20]. Under this circumstance, detection and diagnosis of breast cancer has already drawn a great deal of attention from the medical world.

Studies show that early detection, diagnosis and therapy is particularly important to prolong lives and treat cancers [24, 17]. If breast cancer is found early, patients’ five-year survival rate in stage 1 could reach 90% with effective treatment. Medical imaging technology is one of the main methods for breast cancer detection. Commonly used medical imaging technologies include X-ray mammography, Computer Tomography (CT), ultrasound and Magnetic Resonance Imaging (MRI), and Single-Photon Emission Computed Tomography (SPECT). Among these technologies, mammography achieves the best results in early detection of asymptomatic breast cancer and is one of the least expensive ones. For this reason, it has become the principal method of breast cancer detection in clinical practice, and one of the most effective ways for general breast cancer survey.

Masses and calcifications (including marcocalcifications and microcalcifications) are the most common and basic symptoms of breast cancer. Masses in mammography can be recognized as a local, high-contrast area, but the value of contrast is not unique. It changes when imaging conditions, sizes and backgrounds change. The X-ray absorption rates of masses are very close to dense glandular tissue in breast and other dense tissues. In addition, the boundaries of masses are always mixed with background structures. Therefore, microcalcification detection remains one of the most popular topics in medical image processing research [14].

Modern equipment has improved the technical aspects of mammography. However, there still exist a relatively high false positive rate and a low specificity due to fundamental physical limitations such as unobvious lesions and controllable factors like radiologists’ inexperience in reading mammograms. The latter issue has been addressed using double reading, where two radiologists make their own judgments independently based on the same mammogram, and then combine and discuss both opinions. However, the solution is expensive and highly relying on radiologists’ experience. On the other hand, machine learning based systems provide sustainable support and are routinely being used across every aspect of our lives [21, 26, 22, 23] including Computer-Aided Diagnosis (CADx) systems for mammograms images analysis [3].

In this paper, we focus on breast cancer detection using a computer-aided automatic mammogram analysis system to improve the accuracy of diagnosis. In proposed analysis system, entropy-based method was implemented for feature selection as well as different pattern recognition methods, including the Back-propagation (BP) Network, the Linear Discriminant Analysis (LDA), the Naive Bayes (NB) Classifier, and voting schemes were employed for image classification.

The rest of the paper is organized as follows. Section II presents background regarding breast cancer detection and relevant work. Section III proposes the automatic mammogram images analysis system. Subsequently, the experimental results are presented and analyzed in Section IV. Finally, Section V draws the conclusion and discusses lines of future work.

Ii Related Work

Calcification detection and classification has been an important research target of the automatic mammogram analysis systems. A wide variety of approaches have so far been developed to improve the detection performance. However, it’s still challenging for the following reasons: 1) microcalcifications occur in various sizes, shapes, and distributions; 2) microcalcifications have low contrast in the region of interest (ROI); 3) dense tissue and/or skin thickness make suspicious lesion areas difficult to detect (especially in young women); 4) the dense tissue is easily misunderstood as microcalcification, which results in high false positive rates among most existing algorithms.

Feature selection is crucial for detection and classification. A number of efforts have been devoted to employing numerous different features in the application of mammogram analysis. For example, the geometrical and statistical features in [1]

are used for classification. But these methods depend on the image translation, scaling and rotation heavily, which results in misclassification. Consequently, these approaches can be improved by using a robust factor, such as Fourier transform, which is certainly translation invariant 


. In the data time-frequency analysis, the Fourier transform is traditionally applied, which is a global transform between the time and frequency domain 

[13]. Therefore, the Fourier transform cannot express the local properties of signals in the time and frequency domains simultaneously. However, these local properties are the key characteristics of non-stationary signals in some circumstances. To analyze and process non-stationary signals, wavelet transforms is one of the methods to retain important temporal information in the frequency domain or partial frequency information in the time domain. In this work, both the Fourier transform and wavelet transforms are applied at the data transform stage to extract the features. However, not all of the extracted features benefit classification performance. For a feature to be useful in classification, it should be closely and uniquely associated with a certain class [6]

. Ideally, the feature will correlate with the desired class independent of the presence of other classes. If these conditions are met, the feature reduction (selection) problem can be addressed by measuring the correlation with that class then establishing a pass threshold. The pass threshold eliminates features that correlate poorly. There are two common approaches used to measure the correlation between two random variables, in this case between feature and class 

[10]. The first is the linear correlation, where the variation in a feature value is compared to the variation in class value. The second approach and the one adopted for our study is Information Gain, a concept based on the reduction of entropy in the dataset.

A target range for the number of features was determined from the work of, Lei and Huan [15], they proposed a fast correlation based filter approach and conducted an efficient way of analyzing feature redundancy. Their new feature selection algorithm was implemented and evaluated through extensive experiments comparing with other related feature selection algorithms based on ten different kinds of feature types. The number of features ranged from 57 to 650, and the sample size of feature types ranged from 32 to 9338. At the end of the experiment, they recorded the running time of the proposed system and the number of features selected for each algorithm. The results showed that the average selected number of features was 15 for the five compared feature selection algorithms, and the selected features could lead to classification accuracy to around 89%. In this research, we chose a threshold of information gain which could lead to around 15 features left. Entropy is a measure of the uncertainty of a random variable [16].

Iii System Design

The proposed automatic mammogram analysis system comprised of three consecutive stages, including the image processing stage, the feature selection stage and the image classification stage. Fig. 1 visualizes the framework of the proposed system with each components being detailed in the following sections.

Fig. 1: Framework of automatic mammogram analysis system.

Iii-a Image Processing

In the first image processing stage, a set of scalar features were extracted from an original image. This stage consisted of two steps: image pre-processing and data transform (including wavelet and Fourier transforms). In the image pre-processing step, the original digitized mammogram image is flipped, de-noised, and scaled to a common maximum value. In the data transform step, the normalized images are decomposed by three wavelet transforms with different bases (Daubechies db2, Daubechies db4, and Biorthogonal bior6.8) and the Fourier transform separately [9].

Multiple levels of decomposition were used, and four images are produced at each level of the decomposition. Finally, four statistic features, including the mean, standard deviation, skewness and kurtosis of the image intensities, were calculated.

Iii-A1 Mammogram Image Pre-processing

for the automatic mammogram analysis system, the original images are different in size and directions. Furthermore, artefacts and noise may also exist in some mammograms, which would generate wrong or poor analysis result. Thus, several mammogram pre-processing steps were implemented to regularize the appearance of the images, and remove unnecessary artefacts and noise. Based on the studies [5], the steps taken in this work are as below:

Fig. 2: An example of MLO view mammogram: A.Right side; B.Left side; C.the image after orientation matching of A.
  1. [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

  2. Orientation Matching: our study only involved the MLO mammogram presentation. In these, the right and left breasts point to the opposite sides in the mammogram image. Therefore, flipping one of the breasts to the same direction as the other one ensures that all images pointed in the same direction, preventing changes in the wavelet transform coefficients due only to the directionality change between right and left images. The sharp edge between the tissue and the dark background is a major feature in all images that affects this change. As shown in Fig. 2, the intensity of right breast images falls from left to right across this edge, while it rises in left breast images. This would change the sign of the calculated wavelet coefficient. Fig. 2 shows the result of orientation matching of an example of Medial Lateral Oblique (MLO) view mammogram. Fig. 2 A and B respectively show the right and left breast images of a patient with tiny microcalcifications in her breast tissue. Fig. 2 C shows the reflected image of orientation matching of the right breast.

    Fig. 3: A. Mammogram image before background thresholding; B. The thresholded binary image used to mask the original image.
  3. Background Thresholding: signal outside the tissue is non-informative and was removed by binary masking. Concretely, a threshold is set to create binary images. Pixels with lower intensity value than the threshold are set to zero. A satisfactory threshold can remove all irrelevant information in the background pixels, and leave foreground objects unaltered. One of the most commonly used method to choose the threshold is Otsu’s Method [8]

    , which assumes that the image to be thresholded contains two classes of pixels or bi-modal histogram (e.g. foreground and background). The method then calculates the optimum threshold separating those two classes so that their combined spread (intra-class variance) is minimal 


    . It also assumes that the foreground and background intensities are normally distributed, and it chooses the threshold level which minimizes the segmentation error between the two regions. The attenuation of x-rays passing through the tissue affects the intensity in the images, and is influenced by the thickness and density of the tissue. Therefore, tissue pixels which fall below the conservative threshold are predominantly from the edges of the tissue region where the breast tissue is thin and uncompressed. While a few pixel layers may be removed by this method, it was deemed acceptable as any pathology that exists this close to the surface of a patient’s skin should be readily detectable by conventional examination without the aid of mammography. In this work, the binary thresholding, which sets all pixels below a threshold, was set to an intensity of zero and all pixels above the threshold to an intensity of one (see Fig. 

    3). The output image of the process is the pixel-by-pixel product of the binary mask image and the original image. In this way, all background pixels of the output image are set to zero of intensity, while all foreground pixels are unaffected.

    Fig. 4: Mammogram image before A and after B intensity matching.
  4. Intensity Matching: intensity matching is the last pre-processing step applied to the images before they are ready for data transforms. In this step, all mammograms are linearly scaled to an intensity of 0.0 to 1.0. This intensity matching process can be defined by


    where is the input image following the background thresholding step, and is the intensity-matched image whose pixel intensities range from zero to one. This step ensures the uniformity across all different mammogram images, because their pixel intensities ranges could differ with machines settings. It can be seen in Fig. 4 that there is tiny difference before and after the intensity matching procedure. The broader spread in intensities would increase the variations in different tissue types and densities. (the maximum relative intensity prior to normalization was 0.92).

Iii-A2 Data Transforms

once the images are pre-processed to minimize the differences between images that were not related to differences in the physical composition of the breast tissue, the wavelet and Fourier transforms were performed on the images. The images were all sampled to 1024×1024 pixels, which would allow maximum 10 levels of decomposition, since dyadic sampling reduces the dimensions by a factor of two in each direction after each pass. In this work, only eight levels of decomposition were used. Because the final two levels would consist of four-pixel and one-pixel images, respectively, which are basically useless for mammogram analysis, compared to the size of the entire breast. As a result, these levels are omitted from the wavelet analysis to speed calculation.

  1. [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

  2. Choice of Transform Methods: Fig. 5 shows the original mammogram and its four detail views obtained at the first decomposition level when the Db4 wavelet basis is used. It is shown that the wavelet maps have a lower resolution than the original image. Each view is sensitive to different features in the image. For example, the horizontal detail detects vertical changes in intensity, the vertical detail detects horizontal changes in intensity, the diagonal detail responds when the intensity is varying in both directions, and the approximation image is a low resolution version of the original image used as an input to the next coarser level of the decomposition. Fig. 6 shows the Fourier transform view of the original mammogram. Compared with the wavelet maps, it can be seen that the wavelet transform provides multi-resolution decomposition, which means the wavelet maps at different levels reflect the image features of different sizes. Furthermore, spatial information is partially conserved. The wavelet maps in Fig. 5 show the spatial distribution of information at particular size scales; in contrast, the Fourier transform would lose the spatial information and simply produce a map of the relative contributions of different frequencies over the entire image. This spatial information is useful for finding localized structures, such as microcalcifications and masses. These structures remain localized after the wavelet transform is applied, and they can then be distinguished from a more homogeneous background.

    Fig. 5: First level db4 wavelet decomposition: A. Original mammography image; B. Approximation view; C. Horizontal detail view; D. Vertical detail view; E. Diagonal view.
    Fig. 6: The Fourier transform view of the mammogram in Fig. 5.
  3. Choice of Measurement: in our experiment, four statistical features were extracted: mean intensity, standard deviation, skewness and kurtosis of the pixel intensities. Then, the mammogram analysis system uses some of these features to classify mammogram images as being normal or cancerous.

    • [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

    • Mean: the mean in this paper is obtained by calculating the average pixel value of the tissue region in the mammogram image. The equation is given by


      where the pixel value at point of the mammogram image. N is the number of pixels in the tissue region of the image. The mean feature measures the average value of each detail views at different decomposition levels. Microcalcifications are usually tiny and bright. Compared with normal samples, microcalcifications have a slightly higher intensity in the high resolution maps. While masses are usually different in sizes and shapes, they could range from millimetres to several centimetres in width. Therefore, masses cannot be extracted from the background tissue through single scale or wavelet basis. However, masses are located in one region of tissue, and they are usually brighter than normal tissue. As a result, a slightly larger mean intensity can be measured through a wavelet basis, especially when different scales are used to detect masses.

    • Standard Deviation: the standard deviation , the estimate of the mean square deviation of grey pixel values, describes the dispersion of a local region. It is defined as


      It measures the variability in the brightness of the image over the tissue region. The value of the standard deviation would increase in the high spatial resolution levels of the wavelet map images that contain microcalcifications or masses, because they are brighter than normal parts of mammogram images.

    • Skewness: the third statistic feature measured from each wavelet map image is the skewness of the pixel intensities, which measures the degree of asymmetry. The skewness of a distribution of values is defined as the third central moment of the distribution, normalized by the cube of the standard deviation. It is given by


      When a distribution has a larger right tail, then it shows a positive skewness. Even there is no significantly difference in the mean value or standard deviation, the skewness still changes because it is sensitive to the addition of a small number of unusually small or large values on a distribution.

    • Kurtosis: the fourth statistic measured from the wavelet maps is the kurtosis of the pixel intensities. The kurtosis of a distribution of values is defined as the fourth central moment of the distribution, normalized by the fourth power of the standard deviation of the distribution. The kurtosis K is given by


      Kurtosis measures the narrowness of the central peak of a distribution compared with the size of the distribution’s tails. A distribution with a narrow peak and tails that drop off slowly has a large kurtosis compared with a distribution with a relatively wide peak but suppressed tails. The kurtosis and standard deviation of a distribution may be similar, but kurtosis is more sensitive to points distant from the mean than the standard deviation. Because of this, kurtosis is sensitive to the presence of microcalcifications and masses. It will rise when the number of unusual bright pixels increases in a wavelet map.

Iii-B Feature Selection

Since a large number of potential classification features are generated from each mammogram image, a selection process is needed to choose those features that are most effective at differentiating between normal and cancerous images. Specifically, there are four parameters measured from each wavelet map, with four wavelet maps per level and eight levels of decomposition. Thus, 16 features could be generated form each level of decomposition. To eliminate some of these, it was noted in N. Terki, etc. [4]

that peak signal to noise ratio (PSNR) improved when the level of decomposition increases, and the image quality was better from third level of decomposition. Therefore, level 3 to level 8 of decomposition of the proposed three wavelet transform methods were applied in this work. In this case, 96 features would be generated from each of three wavelet transforms based on the 6 levels of wavelet decomposition.

Then, the generated 96 features from the wavelet transform were combined with the 6 features extracted from the Fourier transform. In other words, 3 different feature sets were created, and each of the feature sets contains features from one wavelet transform and the Fourier transform.

We adopted entropy-based feature selection in our work. The entropy of a variable X is defined as


and the entropy of X after observing values of another variable Y is defined as



the prior probabilities for all values of X, and

is the posterior probabilities of X given the values of Y. The amount by which the entropy of X decreases reflects additional information about X provided by Y, and is called information gain 

[11], given by


If we have , it means a feature Y is regarded more correlated to feature X than to feature Z.

The entropy with the feature selection algorithm was implemented by the following steps:

  • [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

  • Order features based on decreasing entropy values using (6), and build a link list for all features;

  • Calculate the entropy of each feature in the link list related to the classification results using (7);

  • Calculate the information gain of each feature using (8) based on its two entropies obtained from step 1 and 2;

  • Compare each feature’s information gain with the next feature, and move the larger one ahead till the end of the link list;

  • Select the features with the information gain larger than the threshold set in the program.

In order to select the most effective features for differentiating between normal and cancerous mammogram images, less significant features are removed by entropy-based algorithm. This selection was achieved by sorting and selecting features with higher information gain values. The experimental results suggested that the information gain of features from the db4-Fourier transform was higher than that of features from the bior6.8-Fourier transform, and the information gain of features from the db2-Fourier transform was the lowest among the three feature sets. In the features from the db2, db4, bior6.8, and Fourier transforms, we selected the top 12 features (the optimal features) with their information gain values higher than 0.74.

Iii-C Image Classification

In the final image classification stage, mammogram images were determined as either normal or cancerous based on the selected features. In proposed system, three classifiers (Linear Discriminant Analysis, Back-propagation Network, and Naive Bayes Classifier) were trained and tested. Moreover, combining the above mentioned classifiers (LDA, BP, NB), a voting classification scheme is further proposed for the mammogram analysis system in this research. In the voting classification scheme, where “1” represents cancerous mammograms from a classifier, and “0” represents normal mammograms. When classifying a mammogram image, the voting classification decision is made by taking opinions of the majority of the three classifiers.

  1. [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

  2. Linear Discriminant Analysis: the objective of LDA is to make the data points of different classes as far apart from each other as possible. In addition, it also aims at making the data points from the same class as close as possible. It can be implemented as:

    • [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

    • Constructing a matrix of feature vectors: all feature samples were read in as a matrix

      . Each feature data was regarded as a node i, and in the same way, another feature data was regarded as a node j. Node i and j were connected with a line if and were close, and they belonged to the same class.

    • Calculating scatter matrices: in this step, between-class scatter matrix and within-class matrix were calculated using  (9) (10).


      where is the total sample mean vector, is the number of samples in class , is the average vector associated to class, is the sample vector in the class. and are named between-class scatter matrix and within-class matrix, respectively.

    • LDA projection: data points were projected into the LDA subspace so that the matrix was non-singular. The transformation matrix of LDA was presented here as . After projection, and became

    • Computing the projection matrices: after adding the Lagrange multiplier and some derivation steps, the following function was achieved. It is also called the Fisher Linear Discrimination.


      It can be seen that

      is the eigenvector of matrix


    • Linear embedding: with the substitution of eigenvector, is easy to find by the following equation:


      where is the mean value (central point) of samples in each class.

  3. Back-propagation Network: the aim of the algorithm was to classify mammograms into two categories: cancerous or normal. Because the input features are 14 dimensional, and there are two kinds of mammograms to be classified, the construction of the BP network can be defined as “14-15, 2”. It means that there are 14 nodes in the input layer, 15 nodes in the hidden layer, and 2 nodes in the output layer. Furthermore, after random sorting of 670 mammograms, 520 of them were randomly selected as the training dataset, the remaining 150 were chosen to test the classification performance of the BP network.

    • [labelsep = .5em, leftmargin = 0pt, itemindent = 3em]

    • Network initialization: according to the input and desired output values (X and Y) of the network, we can set n nodes in the input layer, l nodes in the hidden layer, and m nodes in the output layer. The weight values ( and ), the threshold value in the hidden layer, the threshold value

      in the output layer, the learning speed, and the activation functions should also be initialized.

    • Calculation of the hidden layer output: this output can be achieved through , , and .


      here, is the number of nodes in the hidden layer, is an activation function. In this work, the activation function is chosen as

    • Calculation of the output layer output: O is determined through ,, and .


      Calculation of error. Based on the output O and the estimated output Y, we can obtain the prediction error . .

    • Weight update: the values can be updated using the following equations:


      in which, is the learning speed.

    • Threshold update: the threshold value can be updated using the following equations:

    • If the iteration is not over, the algorithm goes back to the second step.

  4. Naive Bayes Classifier: suppose that there are m classes, = . are the features for one dataset. Given an instance, its feature is =

    , then the posterior probability that instance belongs to a class

    is = = . The Naive Bayes Classifier can be represented as = . It indicates that the prediction accuracy reaches the maximum value when instance has the largest posterior probability. However, the posterior probability is difficult to calculate. Therefore, the “Naive Bayes hypothesis”, which assumes all features are independent from each other, is introduced to the Naive Bayes Classifier. Thus, , .

    The Naive Bayes classification algorithm can independently learn either the conditional probability of each feature in the class , or the probability of each feature . Replaced with a normalization factor “”, the posterior probability becomes


    According to (19), the optimal classification () should satisfy


Iv Experiments

In this section, we conduct experiments to evaluate the performance of our system on mammogram image dataset. Since we adopted entropy-based feature selection method, we evaluated and compared the performance of three unique classifiers (LDA, BP, NB) using optimal and non-optimal features. Also, the classification performances of the three classifiers were compared with Receiver Operating Characteristic (ROC) curves. Furthermore, we compared the performance of voting scheme and unique classifier scheme (LDA, BP, NB).

Iv-a Dataset

The dataset to be analyzed in this work were a gift from the Eastern Health in Newfoundland and Labrador of Canada. It consists of 1487 mammogram images, which is divided into a training set with 1040 mammogram images and a test set with 447 mammogram images. The images were all anonymous in the format of DICOM, which is a set of standard protocols in medical image processing, storage, printing, and transmission. They were authorized for our experiment by the Health Research Ethics Authority (HREA) in the reference number of 11312. All DICOM mammogram images were sampled to 1024×1024 pixels and reconfigured to PGM format.

Iv-B Implementation Details

In this experiment, the programs of the image processing and the classification were developed in Matlab 2010b, and the program of feature selection was developed in Eclipse using JAVA. The computer used was based on Windows 10, Intel(R) Core(TM) CPU i7-8700 CPU @ 3.2GHz, 16GB RAM.

Iv-C Evaluation Metrics

The performance of a mammography screening system can be measured by two parameters: sensitivity and specificity. Sensitivity (true positive rate) is the proportion of the cases deemed abnormal when breast cancer is present. For example, if 100 women do have breast cancer among 1000 screened patients but only 90 are detected, then the sensitivity is 90/100 or 90%. Sensitivity may depend on several factors, such as lesion size, breast tissue density, and overallS image quality. In cancer screening protocols, sensitivity is deemed more important than specificity, because failure to diagnose breast cancer may result in serious health consequences for a patient. Almost fifty percent of cases in medical malpractice relate to “false-negative mammograms” [2]. Specificity (true negative fraction) is the proportion of cases deemed normal when breast cancer is absent. For example, if 100 cases of breast cancer are diagnosed in a set of 1000 patients, and the screening system finds 720 cases to be normal, the specificity is 720/900 or 80%. Although the consequences of a false positive (diagnosing a normal patient as having breast cancer) are less severe than missing a positive diagnosis of cancer, specificity should also be as high as possible. False positive examinations can result in unnecessary follow-up examinations and procedures, and may lead to significant anxiety and concern for the patient.

Iv-D Results

Table I

summarizes the accuracy, sensitivity and specificity performances of the three classifiers based on the selected features from db2-Fourier, bior6.8-Fourier, db4-Fourier, and the optimal features. Features including standard diversion, kurtosis and skewness extracted from the Fourier transform are obtained by the proposed entropy-based feature selection algorithm. By ranking the information gain, 12 optimal features with top information gain value for classification are selected. The Receiver Operating Characteristics curves were also plotted to facilitate comparison of the three classifiers as shown in Fig. 


db2- Fourier Features Accuracy 80.69% 85.05% 81.03% 88.07%
Sensitivity 71.18% 83.05% 90.06% 90.06%
Specificity 89.03% 88.06% 70.08% 92.03%
bior6.8- Fourier Features Accuracy 81.01% 86.07% 83.03% 89.01%
Sensitivity 72.06% 84.65% 91.71% 91.70%
Specificity 90.32% 89.01% 72.15% 90.41%
db4- Fourier Features Accuracy 84.03% 89.06% 86.02% 89.08%
Sensitivity 74.24% 87.55% 93.20% 95.01%
Specificity 93.06% 92.05% 74.07% 95.81%
The optimal Features Accuracy 88.02% 94.14% 89.83% 96.06%
Sensitivity 78.26% 90.45% 96.21% 96.45%
Specificity 96.88% 95.03% 80.60% 97.45%
TABLE I: Classification performances of three classifiers for the training dataset
Fig. 7: ROC curves with the classifiers: A. LDA; B. BP; C. NB; D. Voting (a), (b), (c), and (d): performances of classifiers based on db2-Fourier, bior6.8-Fourier, db4-Fourier, and the optimal features respectively.

According to the results in Table I, classifiers achieved their highest classification performances using the optimal features, followed by the features selected from the db4 wavelet and Fourier transforms; whereas their performances are the lowest using features selected from the db2 wavelet and Fourier transforms. The specificity of the voting classifier with the optimal features is 1.64% higher than that of the features of the db4-Fourier transform, 7.04% higher than that of the bior6.8-Fourier transform, and 5.42% higher than that of the db2-Fourier transform. Moreover, the specificity of the LDA, BP, NB classifiers with optimal features is better than that of the three feature methods, i.e., the db4-Fourier, db2-Fourier, and bior6.8-Fourier transform.

In addition, it is also clear that the voting classification scheme outperforms each individual base classifier. The reason that optimal features achieve the highest performance could be the information gain of the optimal features is the highest among the four different feature sets. In other words, the features in the optimal feature set are more correlated to the mammogram class than any of the other features.

What’s more, it can be found from Table I

and ROC curves that the voting scheme achieves the highest accuracy using the default parameters in the proposed mammogram analysis system. On the other hand, the Naive Bayes Classifier among the three single classifiers (LDA, BP, NB) achieves the highest specificity, and the Back Propagation network achieves the highest accuracy based on all the four feature sets. This result suggests that among the three classifiers, the NB classifier is more sensitive to classify cancerous mammograms, the LDA classifier gives better classification in normal mammograms, and the BP neural network works well in both of normal and cancerous mammograms. The voting classification scheme performs best.

V Conclusions

This paper proposes a computer-aided automatic mammogram analysis system. The real world application on mammogram image dataset from the Eastern Health in Newfoundland and Labrador of Canada demonstrates the effectiveness of the proposed system. For future works, we plan to apply these results along with our previous works [19, 18, 25] for the fair allocation of health care resources.


  • [1] H. Al-Shamlan and A. El-Zaart (2010) Feature extraction values for breast cancer mammography images. In 2010 International Conference on Bioinformatics and Biomedical Technology, pp. 335–340. Cited by: §II.
  • [2] L. Berlin (2003) Breast cancer, mammography, and malpractice litigation: the controversies continue. American Journal of Roentgenology 180 (5), pp. 1229–1237. Cited by: §IV-C.
  • [3] R. L. Birdwell (2009) The preponderance of evidence supports computer-aided detection for screening mammography. Radiology 253 (1), pp. 9–16. Cited by: §I.
  • [4] N. Doghmane, Z. Baarir, N. Terki, and A. Ouafi (2003) Study of effect of filters and decomposition level in wavelet image compression. Courrier du Savoir 3 (1), pp. 41–45. Cited by: §III-B.
  • [5] E. J. Kendall, M. G. Barnett, and K. Chytyk-Praznik (2013) Automatic detection of anomalies in screening mammograms. BMC Medical Imaging 13 (1), pp. 43. Cited by: §III-A1.
  • [6] H. Liu and L. Yu (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering 17 (4), pp. 491–502. Cited by: §II.
  • [7] J. McLaughlin, D. Dryer, H. Logan, Y. Mao, L. Marrett, H. Morrison, B. Schacter, G. Villeneuve, C. Waters, and R. Semenciw (2006) Canadian cancer statistics 2006. Toronto (Canada): National Cancer Institute of Canada. Cited by: §I.
  • [8] N. Otsu (1979) A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9 (1), pp. 62–66. Cited by: item ii).
  • [9] Q. Qin and Z. Yang (1994) Practical wavelet analysis. In Press of Xi’an University of Electronic Science and Technology, Cited by: §III-A.
  • [10] Y. Saeys, I. Inza, and P. Larrañaga (2007) A review of feature selection techniques in bioinformatics. bioinformatics 23 (19), pp. 2507–2517. Cited by: §II.
  • [11] M. Schneiders (2001) Wavelets in control engineering (master’s thesis). Eindhoven University of Technology, Eindhoven. Cited by: §III-B.
  • [12] O. R. Shahin and G. Attiya (2014) Classification of mammograms tumors using fourier analysis. IJCSNS 14 (2), pp. 110. Cited by: §II.
  • [13] J. O. Smith (2002) Mathematics of the discrete fourier transform (dft). Center for Computer Research in Music and Acoustics (CCRMA), Department of Music, Stanford University, Stanford, California. Cited by: §II.
  • [14] S. Wu, B. Ye, Z. Yuan, Q. Shen, L. Bao, Z. Ding, S. Hong, J. Li, J. Chen, and Y. Zhu (1996) Application research of laser scanning microscope for early diagnosis of tumors. In Lasers in Medicine and Dentistry: Diagnostics and Treatment, Vol. 2887, pp. 190–192. Cited by: §I.
  • [15] L. Yu and H. Liu (2003)

    Feature selection for high-dimensional data: a fast correlation-based filter solution

    In Proceedings of the 20th international conference on machine learning (ICML-03), pp. 856–863. Cited by: §II.
  • [16] L. Zhang and W. Zhang (2014) A comparison of different pattern recognition methods with entropy based feature reduction in early breast cancer classification. Cited by: §I, §II.
  • [17] M. Zhang, X. Zhao, W. Zhang, A. Chaddad, A. Evans, and J. B. Poline (2020) Deep discriminative learning for autism spectrum disorder classification. In DEXA, pp. 435–443. Cited by: §I.
  • [18] W. Zhang and A. Bifet (2020)

    FEAT: a fairness-enhancing and concept-adapting decision tree classifier

    In International Conference on Discovery Science, pp. 175–189. Cited by: §V.
  • [19] W. Zhang and E. Ntoutsi (2019) FAHT: an adaptive fairness-aware decision tree classifier. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    pp. 1480–1486. Cited by: §V.
  • [20] W. Zhang, J. Tang, and N. Wang (2016) Using the machine learning approach to predict patient survival from high-dimensional survival data. pp. 1234–1238. Cited by: §I.
  • [21] W. Zhang, X. Tang, and J. Wang (2019) On fairness-aware learning for non-discriminative decision-making. In 2019 International Conference on Data Mining Workshops (ICDMW), pp. 1072–1079. Cited by: §I.
  • [22] W. Zhang, J. Wang, D. Jin, L. Oreopoulos, and Z. Zhang (2018)

    A deterministic self-organizing map approach and its application on satellite data based cloud type classification

    pp. 2027–2034. Cited by: §I.
  • [23] W. Zhang and J. Wang (2017) A hybrid learning framework for imbalanced stream classification. In 2017 IEEE International Congress on Big Data (BigData Congress), pp. 480–487. Cited by: §I.
  • [24] W. Zhang and J. Wang (2018) Content-bootstrapped collaborative filtering for medical article recommendations. pp. 1184–1188. Cited by: §I.
  • [25] W. Zhang, M. Zhang, J. Zhang, Z. Liu, Z. Chen, J. Wang, E. Raff, and E. Messina (2020) Flexible and adaptive fairness-aware learning in non-stationary data streams. In ICTAI, Cited by: §V.
  • [26] W. Zhang (2017) Phd forum: recognizing human posture from time-changing wearable sensor data streams. In 2017 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 1–2. Cited by: §I.