wisconsin-breast-cancer
On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset
view repo
This paper presents a comparison of six machine learning (ML) algorithms: GRU-SVM (Agarap, 2017), Linear Regression, Multilayer Perceptron (MLP), Nearest Neighbor (NN) search, Softmax Regression, and Support Vector Machine (SVM) on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (Wolberg, Street, & Mangasarian, 1992) by measuring their classification test accuracy and their sensitivity and specificity values. The said dataset consists of features which were computed from digitized images of FNA tests on a breast mass (Wolberg, Street, & Mangasarian, 1992). For the implementation of the ML algorithms, the dataset was partitioned in the following fashion: 70 30 were manually assigned. Results show that all the presented ML algorithms performed well (all exceeded 90 MLP algorithm stands out among the implemented algorithms with a test accuracy of 99.04 studies (Salama, Abdelhalim, & Zeid, 2012; Zafiropoulos, Maglogiannis, & Anagnostopoulos, 2006).
READ FULL TEXT VIEW PDFOn Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset
Breast cancer is one of the most common cancer along with lung and bronchus cancer, prostate cancer, colon cancer, and pancreatic cancer among others(nci, 2017). Representing 15% of all new cancer cases in the United States alone(sur, [n. d.]), it is a topic of research with great value.
The utilization of data science and machine learning approaches in medical fields proves to be prolific as such approaches may be considered of great assistance in the decision making process of medical practitioners. With an unfortunate increasing trend of breast cancer cases
(sur, [n. d.]), comes also a big deal of data which is of significant use in furthering clinical and medical research, and much more to the application of data science and machine learning in the aforementioned domain.. The said ML algorithm combines a type of recurrent neural network (RNN), the gated recurrent unit (GRU)
(Cho et al., 2014) with the support vector machine (SVM)(Cortes and Vapnik, 1995). Along with the GRU-SVM model, a number of ML algorithms is presented in Section 2.4, which were all applied on breast cancer classification with the aid of WDBC(Wolberg et al., 1992).Google TensorFlow
(Abadi et al., 2015) was used to implement the machine learning algorithms in this study, with the aid of other scientific computing libraries: matplotlib(Hunter, 2007), numpy(Walt et al., 2011), and scikit-learn(Pedregosa et al., 2011).The machine learning algorithms were trained to detect breast cancer using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset(Wolberg et al., 1992). According to (Wolberg et al., 1992), the dataset consists of features which were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The said features describe the characteristics of the cell nuclei found in the image(Wolberg et al., 1992).
There are 569 data points in the dataset: 212 – Malignant, 357 – Benign. Accordingly, the dataset features are as follows: (1) radius, (2) texture, (3) perimeter, (4) area, (5) smoothness, (6) compactness, (7) concavity, (8) concave points, (9) symmetry, and (10) fractal dimension. With each feature having three information(Wolberg et al., 1992)
: (1) mean, (2) standard error, and (3) “worst” or largest (mean of the three largest values) computed. Thus, having a total of 30 dataset features.
To avoid inappropriate assignment of relevance, the dataset was standardized using Eq. 1.
(1) |
where is the feature to be standardized, is the mean value of the feature, and
is the standard deviation of the feature. The standardization was implemented using
StandardScaler().fit_transform() of scikit-learn(Pedregosa et al., 2011).This section presents the machine learning (ML) algorithms used in the study. The Stochastic Gradient Descent (SGD) learning algorithm was used for all the ML algorithms presented in this section except for GRU-SVM, Nearest Neighbor search, and Support Vector Machine. The code implementations may be found online at
https://github.com/AFAgarap/wisconsin-breast-cancer.We proposed a neural network architecture(Agarap, 2017) combining the gated recurrent unit (GRU) variant of recurrent neural network (RNN) and the support vector machine (SVM), for the purpose of binary classification.
(2) |
(3) |
(4) |
(5) |
where and are the update gate and reset gate of a GRU-RNN respectively, is the candidate value, and is the new RNN cell state value(Cho et al., 2014). In turn, the
is used as the predictor variable
in the L2-SVM predictor function (given by ) of the network instead of the conventional Softmax classifier.of the GRU-RNN is learned by the L2-SVM using the loss function given by Eq.
20. The computed loss is then minimized through Adam(Kingma and Ba, 2014) optimization. The same optimization algorithm was used for Softmax Regression (Section 2.4.5) and SVM (Section 2.4.6). Then, the decision function produces a vector of scores for each cancer diagnosis: -1 for benign, and +1 for malignant. In order to get the predicted labels for a given data , the function is used (see Eq. 6).(6) |
The function shall return the indices of the highest scores across the vector of predicted classes .
Despite an algorithm for regression problem, linear regression (see Eq. 7) was used as a classifier for this study. This was done by applying a threshold for the output of Eq. 7, i.e. subjecting the value of the regressand to Eq. 8.
(7) |
(8) |
To measure the loss of the model, the mean squared error (MSE) was used (see Eq. 9).
(9) |
where represents the actual class, and represents the predicted class. This loss is minimized using the SGD algorithm, which learns the parameters of Eq. 7. The same method of loss minimization was used for MLP and Softmax Regression.
The perceptron model was developed by Rosenblatt (1958)(Rosenblatt, 1958)
based on the neuron model by McCulloch & Pitts (1943)
(McCulloch and Pitts, 1943). The multilayer perceptron (MLP)(Bishop, 1995)consists of hidden layers (composed by a number of perceptrons) that enable the approximation of any functions, that is, through activation functions such as
or sigmoid .(10) |
(11) |
This is a classification model generalizing logistic regression to multinomial problems. But unlike linear regression (Section
2.4.2) that produces raw scores for the classes, softmax regression produces a probability distribution for the classes. This is accomplished using the Softmax function (see Eq.
14).(14) |
(15) |
The loss is measured by using the cross entropy function (see Eq. 15), where represents the actual class, and represents the predicted class.
Developed by Vapnik(Cortes and Vapnik, 1995)
, the support vector machine (SVM) was primarily intended for binary classification. Its main objective is to determine the optimal hyperplane
separating two classes in a given dataset having input features , and labels .SVM learns by solving the following constrained optimization problem:
(16) |
(17) | |||
(18) |
where is the Manhattan norm, is a cost function, and is the penalty parameter (may be an arbitrary value or a selected value using hyper-parameter tuning). Its corresponding unconstrained optimization problem is the following:
(19) |
where is the predictor function. The objective of Eq. 19 is known as the primal form problem of L1-SVM, with the standard hinge loss. The problem with L1-SVM is the fact that it is not differentiable(Tang, 2013), as opposed to its variation, the L2-SVM:
(20) |
The L2-SVM is differentiable and provides more stable results than its L1 counterpart(Tang, 2013).
There were two phases of experiment for this study: (1) training phase, and (2) test phase. The dataset was partitioned by 70% (training phase) / 30% (testing phase). The parameters considered in the experiments were as follows: (1) Test Accuracy, (2) Epochs, (3) Number of data points, (4) False Positive Rate (FPR), (5) False Negative Rate (FNR), (6) True Positive Rate (TPR), and (7) True Negative Rate (TNR).
All experiments in this study were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. Table 1 shows the manually-assigned hyper-parameters used for the ML algorithms. Table 2 summarizes the experiment results. In addition to the reported results, the result from (Zafiropoulos et al., 2006) was put into comparison.
Hyper-parameters | GRU-SVM | Linear Regression | MLP | Nearest Neighbor | Softmax Regression | SVM |
---|---|---|---|---|---|---|
Batch Size | 128 | 128 | 128 | N/A | 128 | 128 |
Cell Size | 128 | N/A | [500, 500, 500] | N/A | N/A | N/A |
Dropout Rate | 0.5 | N/A | None | N/A | N/A | N/A |
Epochs | 3000 | 3000 | 3000 | 1 | 3000 | 3000 |
Learning Rate | 1e-3 | 1e-3 | 1e-2 | N/A | 1e-3 | 1e-3 |
Norm | L2 | N/A | N/A | L1, L2 | N/A | L2 |
SVM C | 5 | N/A | N/A | N/A | N/A | 5 |
Parameter | GRU-SVM | Linear Regression | MLP | L1-NN | L2-NN | Softmax Regression | SVM |
Accuracy | 93.75% | 96.09375% | 99.038449585420729% | 93.567252% | 94.736844% | 97.65625% | 96.09375% |
Data points | 384000 | 384000 | 512896 | 171 | 171 | 384000 | 384000 |
Epochs | 3000 | 3000 | 3000 | 1 | 1 | 3000 | 3000 |
FPR | 16.666667% | 10.204082% | 1.267042% | 6.25% | 9.375% | 5.769231% | 6.382979% |
FNR | 0 | 0 | 0.786157% | 6.542056% | 2.803738% | 0 | 2.469136% |
TPR | 100% | 100% | 99.213843% | 93.457944% | 97.196262% | 100% | 97.530864% |
TNR | 83.333333% | 89.795918% | 98.732958% | 93.75% | 90.625% | 94.230769% | 93.617021% |
(Zafiropoulos et al., 2006)
implemented the SVM with Gaussian Radial Basis Function (RBF) as its kernel for classification on WDBC. Their experiment revealed that their SVM had its highest test accuracy of 89.28% with its free parameter
. However, their experiment was based on a 60/40 partition (training/testing respectively). Hence, we would not be able to draw a fair comparison between the current study and (Zafiropoulos et al., 2006). Comparing the results of this study on an intuitive sense may perhaps be close to a fair comparison, recalling that the partition done in this study was 70/30.Figure 2 shows the training accuracy of the ML algorithms: (1) GRU-SVM finished its training in 2 minutes and 54 seconds with an average training accuracy of 90.6857639%, (2) Linear Regression finished its training in 35 seconds with an average training accuracy of 92.8906257%, (3) MLP finished its training in 28 seconds with an average training accuracy of 96.9286785%, (4) Softmax Regression finished its training in 25 seconds with an average training accuracy of 97.366573%, and (5) L2-SVM finished its training in 14 seconds with an average training accuracy of 97.734375%. There was no recorded training accuracy for Nearest Neighbor search since it does not require any training, as the norm equations (Eq. 12 and Eq. 13) are directly applied on the dataset to determine the “nearest neighbor” of a given data point .
The empirical evidence presented in this section draws a qualitative comparability with, and corroborates the findings of (Zafiropoulos et al., 2006). Hence, a testament to the effectiveness of ML algorithms on the diagnosis of breast cancer. While the experiment results are all commendable, the performance of the GRU-SVM model(Agarap, 2017) warrants a discussion. The mid-level performance of GRU-SVM with a test accuracy of 93.75% is hypothetically attributed to the following information: (1) the non-linearities introduced by the GRU model(Cho et al., 2014) through its gating mechanism (see Eq. 2, Eq. 3, and Eq. 4) to its output may be the cause of a difficulty in generalizing on a linearly-separable data such as the WDBC dataset, and (2) the sensitivity of RNNs to weight initialization(Alalshekmubarak and
Smith, 2013). Since the weights of the GRU-SVM model are assigned with arbitrary values, it will also prove limited capability of result reproducibility, even when using an identical configuration(Alalshekmubarak and
Smith, 2013).
Despite the given arguments, it does not necessarily revoke the fact that GRU-SVM is comparable with the presented ML algorithms, as what the results have shown. In addition, it was a expected that the upper hand goes to the linear classifiers (Linear Regression and SVM) as the utilized dataset was linearly separable. The linear separability of the WDBC dataset is shown in a naive method of visualization (see Figure 3, Figure 4, and Figure 5). Visually speaking, it is palpable that the scattered features in the mentioned figures may be easily separated by a linear function.
This paper presents an application of different machine learning algorithms, including the proposed GRU-SVM model in (Agarap, 2017), for the diagnosis of breast cancer. All presented ML algorithms exhibited high performance on the binary classification of breast cancer, i.e. determining whether benign tumor or malignant tumor. Consequently, the statistical measures on the classification problem were also satisfactory.
To further substantiate the results of this study, a CV technique such as -fold cross validation should be employed. The application of such a technique will not only provide a more accurate measure of model prediction performance, but it will also assist in determining the most optimal hyper-parameters for the ML algorithms(Bengio
et al., 2015).
Deep appreciation is given to the family and friends of the author (in arbitrary order): Myra M. Maranan, Faisal E. Montilla, Corazon Fabreag-Agarap, Crystal Love Fabreag-Agarap, Michaelangelo Milo L. Lim, Liberato F. Ramos, Hyacinth Gasmin, Rhea Jude Ferrer, Ma. Pauline de Ocampo, and Abqary Alon.
Neural networks for pattern recognition
. Oxford university press.