On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset

by   Abien Fred Agarap, et al.
Adamson University

This paper presents a comparison of six machine learning (ML) algorithms: GRU-SVM (Agarap, 2017), Linear Regression, Multilayer Perceptron (MLP), Nearest Neighbor (NN) search, Softmax Regression, and Support Vector Machine (SVM) on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (Wolberg, Street, & Mangasarian, 1992) by measuring their classification test accuracy and their sensitivity and specificity values. The said dataset consists of features which were computed from digitized images of FNA tests on a breast mass (Wolberg, Street, & Mangasarian, 1992). For the implementation of the ML algorithms, the dataset was partitioned in the following fashion: 70 30 were manually assigned. Results show that all the presented ML algorithms performed well (all exceeded 90 MLP algorithm stands out among the implemented algorithms with a test accuracy of 99.04 studies (Salama, Abdelhalim, & Zeid, 2012; Zafiropoulos, Maglogiannis, & Anagnostopoulos, 2006).


page 1

page 2

page 3

page 4


Breast Cancer Diagnosis via Classification Algorithms

In this paper, we analyze the Wisconsin Diagnostic Breast Cancer Data us...

Machine Learning Approaches to Predict Breast Cancer: Bangladesh Perspective

Nowadays, Breast cancer has risen to become one of the most prominent ca...

Predicting the Severity of Breast Masses with Data Mining Methods

Mammography is the most effective and available tool for breast cancer s...

Breast Cancer Classification with Ultrasound Images Based on SLIC

Ultrasound image diagnosis of breast tumors has been widely used in rece...

One-Pixel Attack Deceives Automatic Detection of Breast Cancer

In this article we demonstrate that a state-of-the-art machine learning ...

Receiver Operating Characteristic Curves and Confidence Bands for Support Vector Machines

Many problems that appear in biomedical decision making, such as diagnos...

Code Repositories


On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset

view repo

1. Introduction

Breast cancer is one of the most common cancer along with lung and bronchus cancer, prostate cancer, colon cancer, and pancreatic cancer among others(nci, 2017). Representing 15% of all new cancer cases in the United States alone(sur, [n. d.]), it is a topic of research with great value.

The utilization of data science and machine learning approaches in medical fields proves to be prolific as such approaches may be considered of great assistance in the decision making process of medical practitioners. With an unfortunate increasing trend of breast cancer cases

(sur, [n. d.]), comes also a big deal of data which is of significant use in furthering clinical and medical research, and much more to the application of data science and machine learning in the aforementioned domain.
Prior studies have seen the importance of the same research topic(Salama et al., 2012; Zafiropoulos et al., 2006), where they proposed the use of machine learning (ML) algorithms for the classification of breast cancer using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset(Wolberg et al., 1992), and eventually had significant results.
This paper presents yet another study on the said topic, but with the introduction of our recently-proposed GRU-SVM model(Agarap, 2017)

. The said ML algorithm combines a type of recurrent neural network (RNN), the gated recurrent unit (GRU)

(Cho et al., 2014) with the support vector machine (SVM)(Cortes and Vapnik, 1995). Along with the GRU-SVM model, a number of ML algorithms is presented in Section 2.4, which were all applied on breast cancer classification with the aid of WDBC(Wolberg et al., 1992).

2. Methodology

2.1. Machine Intelligence Library

Google TensorFlow

(Abadi et al., 2015) was used to implement the machine learning algorithms in this study, with the aid of other scientific computing libraries: matplotlib(Hunter, 2007), numpy(Walt et al., 2011), and scikit-learn(Pedregosa et al., 2011).

2.2. The Dataset

The machine learning algorithms were trained to detect breast cancer using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset(Wolberg et al., 1992). According to (Wolberg et al., 1992), the dataset consists of features which were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The said features describe the characteristics of the cell nuclei found in the image(Wolberg et al., 1992).

Figure 1. Image from (Wolberg et al., 1992) as cited by (Zafiropoulos et al., 2006). Digitized images of FNA: (a) Benign, (b) Malignant.

There are 569 data points in the dataset: 212 – Malignant, 357 – Benign. Accordingly, the dataset features are as follows: (1) radius, (2) texture, (3) perimeter, (4) area, (5) smoothness, (6) compactness, (7) concavity, (8) concave points, (9) symmetry, and (10) fractal dimension. With each feature having three information(Wolberg et al., 1992)

: (1) mean, (2) standard error, and (3) “worst” or largest (mean of the three largest values) computed. Thus, having a total of 30 dataset features.

2.3. Dataset Preprocessing

To avoid inappropriate assignment of relevance, the dataset was standardized using Eq. 1.


where is the feature to be standardized, is the mean value of the feature, and

is the standard deviation of the feature. The standardization was implemented using

StandardScaler().fit_transform() of scikit-learn(Pedregosa et al., 2011).

2.4. Machine Learning (ML) Algorithms

This section presents the machine learning (ML) algorithms used in the study. The Stochastic Gradient Descent (SGD) learning algorithm was used for all the ML algorithms presented in this section except for GRU-SVM, Nearest Neighbor search, and Support Vector Machine. The code implementations may be found online at


2.4.1. Gru-Svm

We proposed a neural network architecture(Agarap, 2017) combining the gated recurrent unit (GRU) variant of recurrent neural network (RNN) and the support vector machine (SVM), for the purpose of binary classification.


where and are the update gate and reset gate of a GRU-RNN respectively, is the candidate value, and is the new RNN cell state value(Cho et al., 2014). In turn, the

is used as the predictor variable

in the L2-SVM predictor function (given by ) of the network instead of the conventional Softmax classifier.
The learning parameter

of the GRU-RNN is learned by the L2-SVM using the loss function given by Eq.

20. The computed loss is then minimized through Adam(Kingma and Ba, 2014) optimization. The same optimization algorithm was used for Softmax Regression (Section 2.4.5) and SVM (Section 2.4.6). Then, the decision function produces a vector of scores for each cancer diagnosis: -1 for benign, and +1 for malignant. In order to get the predicted labels for a given data , the function is used (see Eq. 6).


The function shall return the indices of the highest scores across the vector of predicted classes .

2.4.2. Linear Regression

Despite an algorithm for regression problem, linear regression (see Eq. 7) was used as a classifier for this study. This was done by applying a threshold for the output of Eq. 7, i.e. subjecting the value of the regressand to Eq. 8.


To measure the loss of the model, the mean squared error (MSE) was used (see Eq. 9).


where represents the actual class, and represents the predicted class. This loss is minimized using the SGD algorithm, which learns the parameters of Eq. 7. The same method of loss minimization was used for MLP and Softmax Regression.

2.4.3. Multilayer Perceptron

The perceptron model was developed by Rosenblatt (1958)(Rosenblatt, 1958)

based on the neuron model by McCulloch & Pitts (1943)

(McCulloch and Pitts, 1943). The multilayer perceptron (MLP)(Bishop, 1995)

consists of hidden layers (composed by a number of perceptrons) that enable the approximation of any functions, that is, through activation functions such as

or sigmoid .


For this study, the activation function used for MLP was ReLU

(Hahnloser et al., 2000) (see Eq. 11), while there were three hidden layers that each consists of 500 nodes (500-500-500 architecture). As for the loss, it was computed using the cross entropy function (see Eq. 15).

2.4.4. Nearest Neighbor

This is a form of an optimization problem that seeks to find the closest point to a query point . In this study, both the L1 (Manhattan, see Eq. 12) and L2 (Euclidean, see Eq. 13) norm were used to measure the distance between and .


The code implementation was based on the work of Damien (2017)(Damien, t 29) in GitHub. A learning algorithm such as SGD and Adam(Kingma and Ba, 2014) is not applicable to Nearest Neighbor search, as it is practically a geometric approach for classification.

2.4.5. Softmax Regression

This is a classification model generalizing logistic regression to multinomial problems. But unlike linear regression (Section


) that produces raw scores for the classes, softmax regression produces a probability distribution for the classes. This is accomplished using the Softmax function (see Eq.



The loss is measured by using the cross entropy function (see Eq. 15), where represents the actual class, and represents the predicted class.

2.4.6. Support Vector Machine

Developed by Vapnik(Cortes and Vapnik, 1995)

, the support vector machine (SVM) was primarily intended for binary classification. Its main objective is to determine the optimal hyperplane

separating two classes in a given dataset having input features , and labels .

SVM learns by solving the following constrained optimization problem:


where is the Manhattan norm, is a cost function, and is the penalty parameter (may be an arbitrary value or a selected value using hyper-parameter tuning). Its corresponding unconstrained optimization problem is the following:


where is the predictor function. The objective of Eq. 19 is known as the primal form problem of L1-SVM, with the standard hinge loss. The problem with L1-SVM is the fact that it is not differentiable(Tang, 2013), as opposed to its variation, the L2-SVM:


The L2-SVM is differentiable and provides more stable results than its L1 counterpart(Tang, 2013).

2.5. Data Analysis

There were two phases of experiment for this study: (1) training phase, and (2) test phase. The dataset was partitioned by 70% (training phase) / 30% (testing phase). The parameters considered in the experiments were as follows: (1) Test Accuracy, (2) Epochs, (3) Number of data points, (4) False Positive Rate (FPR), (5) False Negative Rate (FNR), (6) True Positive Rate (TPR), and (7) True Negative Rate (TNR).

3. Results and Discussion

All experiments in this study were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. Table 1 shows the manually-assigned hyper-parameters used for the ML algorithms. Table 2 summarizes the experiment results. In addition to the reported results, the result from (Zafiropoulos et al., 2006) was put into comparison.

Hyper-parameters GRU-SVM Linear Regression MLP Nearest Neighbor Softmax Regression SVM
Batch Size 128 128 128 N/A 128 128
Cell Size 128 N/A [500, 500, 500] N/A N/A N/A
Dropout Rate 0.5 N/A None N/A N/A N/A
Epochs 3000 3000 3000 1 3000 3000
Learning Rate 1e-3 1e-3 1e-2 N/A 1e-3 1e-3
Norm L2 N/A N/A L1, L2 N/A L2
SVM C 5 N/A N/A N/A N/A 5

Table 1. Hyper-parameters used for the ML algorithms.
Parameter GRU-SVM Linear Regression MLP L1-NN L2-NN Softmax Regression SVM
Accuracy 93.75% 96.09375% 99.038449585420729% 93.567252% 94.736844% 97.65625% 96.09375%
Data points 384000 384000 512896 171 171 384000 384000
Epochs 3000 3000 3000 1 1 3000 3000
FPR 16.666667% 10.204082% 1.267042% 6.25% 9.375% 5.769231% 6.382979%
FNR 0 0 0.786157% 6.542056% 2.803738% 0 2.469136%
TPR 100% 100% 99.213843% 93.457944% 97.196262% 100% 97.530864%
TNR 83.333333% 89.795918% 98.732958% 93.75% 90.625% 94.230769% 93.617021%

Table 2. Summary of experiment results on the ML algorithms.

(Zafiropoulos et al., 2006)

implemented the SVM with Gaussian Radial Basis Function (RBF) as its kernel for classification on WDBC. Their experiment revealed that their SVM had its highest test accuracy of 89.28% with its free parameter

. However, their experiment was based on a 60/40 partition (training/testing respectively). Hence, we would not be able to draw a fair comparison between the current study and (Zafiropoulos et al., 2006). Comparing the results of this study on an intuitive sense may perhaps be close to a fair comparison, recalling that the partition done in this study was 70/30.
With a test accuracy of 96.09%, the L2-SVM in this study bares superiority against the findings of (Zafiropoulos et al., 2006) (SVM with Gaussian RBF, having a test accuracy of 89.28%). But then again, it was based on a higher training data of 10% (70% vs 60%).

Figure 2. Plotted using matplotlib(Hunter, 2007). Training accuracy of the ML algorithms on breast cancer detection using WDBC.
Figure 3. Plotted using matplotlib(Hunter, 2007). Scatter plot of mean features () in the WDBC.
Figure 4. Plotted using matplotlib(Hunter, 2007). Scatter plot of error features () in the WDBC.
Figure 5. Plotted using matplotlib(Hunter, 2007). Scatter plot of worst features () in the WDBC.

Figure 2 shows the training accuracy of the ML algorithms: (1) GRU-SVM finished its training in 2 minutes and 54 seconds with an average training accuracy of 90.6857639%, (2) Linear Regression finished its training in 35 seconds with an average training accuracy of 92.8906257%, (3) MLP finished its training in 28 seconds with an average training accuracy of 96.9286785%, (4) Softmax Regression finished its training in 25 seconds with an average training accuracy of 97.366573%, and (5) L2-SVM finished its training in 14 seconds with an average training accuracy of 97.734375%. There was no recorded training accuracy for Nearest Neighbor search since it does not require any training, as the norm equations (Eq. 12 and Eq. 13) are directly applied on the dataset to determine the “nearest neighbor” of a given data point .
The empirical evidence presented in this section draws a qualitative comparability with, and corroborates the findings of (Zafiropoulos et al., 2006). Hence, a testament to the effectiveness of ML algorithms on the diagnosis of breast cancer. While the experiment results are all commendable, the performance of the GRU-SVM model(Agarap, 2017) warrants a discussion. The mid-level performance of GRU-SVM with a test accuracy of 93.75% is hypothetically attributed to the following information: (1) the non-linearities introduced by the GRU model(Cho et al., 2014) through its gating mechanism (see Eq. 2, Eq. 3, and Eq. 4) to its output may be the cause of a difficulty in generalizing on a linearly-separable data such as the WDBC dataset, and (2) the sensitivity of RNNs to weight initialization(Alalshekmubarak and Smith, 2013). Since the weights of the GRU-SVM model are assigned with arbitrary values, it will also prove limited capability of result reproducibility, even when using an identical configuration(Alalshekmubarak and Smith, 2013).
Despite the given arguments, it does not necessarily revoke the fact that GRU-SVM is comparable with the presented ML algorithms, as what the results have shown. In addition, it was a expected that the upper hand goes to the linear classifiers (Linear Regression and SVM) as the utilized dataset was linearly separable. The linear separability of the WDBC dataset is shown in a naive method of visualization (see Figure 3, Figure 4, and Figure 5). Visually speaking, it is palpable that the scattered features in the mentioned figures may be easily separated by a linear function.

4. Conclusion and Recommendation

This paper presents an application of different machine learning algorithms, including the proposed GRU-SVM model in (Agarap, 2017), for the diagnosis of breast cancer. All presented ML algorithms exhibited high performance on the binary classification of breast cancer, i.e. determining whether benign tumor or malignant tumor. Consequently, the statistical measures on the classification problem were also satisfactory.
To further substantiate the results of this study, a CV technique such as -fold cross validation should be employed. The application of such a technique will not only provide a more accurate measure of model prediction performance, but it will also assist in determining the most optimal hyper-parameters for the ML algorithms(Bengio et al., 2015).

5. Acknowledgment

Deep appreciation is given to the family and friends of the author (in arbitrary order): Myra M. Maranan, Faisal E. Montilla, Corazon Fabreag-Agarap, Crystal Love Fabreag-Agarap, Michaelangelo Milo L. Lim, Liberato F. Ramos, Hyacinth Gasmin, Rhea Jude Ferrer, Ma. Pauline de Ocampo, and Abqary Alon.