Cancers are the first or second leading causes of premature death in 134 out of 183 nations, and it ranks the third in 45 other countries. 4.5 million (29.8%) of the 15.2 million world-wide premature deaths in 2016 were caused by cancers. Lung cancer is the most prevalent type with nearly 2.1 million new cases and 1.8 million deaths in 2018. Breast cancer is the most commonly diagnosed cancer in women with 2,088,849 new cases in 2018 and causing the world’s largest cancer mortality in women (626,679 deaths in 2018) cancerstat; wild2020world.
An abnormal new tissue growth, known as a neoplasm, may be benign or malignant referring to the overall biologic behavior of a tumor rather than to its morphologic characteristics. The word ”cancer” is commonly related to malignant tumors. In some conditions, however, benign tumors can be lethal rubin2014essentials. Contingent upon the growth location, a cancer can have different clinical features including: rapid proliferation, diminished growth control, metastasis, and loss of contact inhibition in vitro rodwell2015harpers.
Numerous conventional techniques such as histological biopsy tests, CT-imaging, magnetic resonance imaging (MRI), bronchoscopy, magnetic resonance mammography (MRM) are some of the essential tools for diagnosing different types of cancer fass2008imaging; prabhakar2018current; geller2010osteosarcoma. While biopsy-based technology can efficiently diagnose malignancy, major challenges remain in expediting clinical diagnosis from pathology imaging and in automated image processing wang2019pathology. Automatically classifying histopathological images in computer-assisted pathology research is an imperative task. Because of the instability in appearance caused by the heterogeneity of diseases, tissue preparation, and staining process, deriving descriptive and concise information from histopathological images is considerably challenging feng2018deep. The workload and complexity of histopathologic cancer diagnosis due to the emergence of precision medicine have increased significantly for pathologists, and a huge number of slides need to be examined for an incredibly long time before making any decision litjens2016deep.
Whole slide imaging(WSI) or virtual microscopy requires examining or digitizing glass slides to provide digital images necessary for automated image analysis being used in surgical pathology in regular practice pantanowitz2011review. WSI has brought an opportunity to mitigate histopathology workload and assist pathologists to examine and quantify a large number of image slides. WSI combined with artificial neural networks and deep learning gives a tremendous confidence to improve the efficiency of diagnosing cancers litjens2016deep; lisboa2006use. The accuracy of the models is growing continuously as the digital pathology is expanding and more datasets are becoming publicly available ibrahim2020artificial.
In particular fields such as medical imaging, deep learning algorithms have outperformed human experts substantially schmidhuber2015deep
. For the prediction of complicated tasks, deep learning algorithms use massive datasets. In computer vision tasks where interpreting the images, deep learning has achieved a significantly high level of accuracyduggento2020deep. CNN is a particular kind of neural network which are the most efficient for analyzing data in a grid-form topology. Convolution is a special linear operator for spatial and grid-like data such as digitized images. Convolutional networks are a basic neural network utilizing convolution rather than the matrix multiplication in at least one of its layers goodfellow2016deep. In recent years, the deep learning architectures, particularly Deep Convolutional Neural Networks (DCNNs), are the most effective practices of analysis of medical imaging tasks such as diagnosis Talo_2019.
DCNNs have gained a lot of attention and have been utilized for many classification tasks. The introduction of AlexNet krizhevsky2012imagenet was a recent breakthrough in the field of machine and deep learning. The idea of having a deeper network has been the focus of many successful implementations of DCNNs including the idea of Networks inside Networks (NIN) lin2013network. An example of this idea is the Inception szegedy2015going network which achieved an outstanding performance in object detection and classification. Consequently, plenty of studies have adopted these implementations upon medical image classification, especially on breast cancer. However, in this work, we changed the focus on going deeper with CNNs by Concatenating multiple CNNs composed of Outer, Middle, and Inner networks. With this new architecture, which we call C-Net, several NINs are operating simultaneously on multiple networks.
The main contributions in this work can be summarized in the following aspects:
We propose a novel deep convolutional neural network, called C-Net, to surmount shortcomings in current classification schemes.
We provide a comparative and successful application of the new architecture in the domain of cancer classification.
The model has been evaluated on two different datasets in terms of size, disease type, and image resolution: (1) BreakHis, a breast cancer dataset containing 7909 histopathology images with different magnification factors including 40X, 100X, 200X, and 400X, and (2) Osteosarcoma dataset composed of 1144 images, with the size of 10241024 and 10X magnification.
The rest of the paper is organized as follows: Section. 2 discusses related work on cancer classification. Section. 3 provides a detailed explanation of the C-Net model which includes the datasets, pre-processing and the model architecture. The experimental results and the performance analysis are presented in Section. 4. Finally, we present the conclusion and suggest future directions in Section. 5.
2 Related Works
The advancements of CNNs are so substantial that many CNN-based models have surpassed human capacities in numerous fields litjens2017survey. In classifying skin cancer, for example, DCNN has reached the dermatologist-level in identifying keratinocyte carcinomas versus benign seborrheic keratoses, and malignant melanomas versus benign nevi esteva2017dermatologist. Successfully classifying and detecting lung cancer with high confidence at the radiologist level is another successful case of implementing DCNN which provides physicians an accurate diagnostic tool to detect and classify pulmonary nodules into malignant or benign zhang2019toward.
Several CNN architectures have been implemented on WSI. Cruz-Roa et al. employed a CNN model for invasive tumor detection on WSI. The classifiers were trained on 400 samples and then tested separately from The Cancer Genome Atlas on 200 cases. In contrast to manually annotated regions of invasive ductal carcinoma, their approach resulted in a dice-coefficient of 75.86%, and negative predictive value of 96.77% cruz2017accurate. With Google’s inception v3 model Chang et al did a pilot study on breast cancer for classifying histopathological images and achieved a value of 93% in Area Under the Curve(AUC) chang.
In another work in 2017, Wahab et al. wahab2017two developed a CNN model to classify mitotic and non-mitotic nuclei in breast cancer on histopathology, which achieved a F-measure of 79%. Similarly in roy2019patch, the authors proposed a patch-based classifier through two approaches including one patch in one decision (OPOD) and all patches in one decision (APOD), to automate the classification of histopathological images by using the breast histology image dataset.
Fondon et al. utilized a computer-aided tool for automatic classification of tissue malignancy. The Support Vector Machine (SVM) was used as the kernel classifier, which achieved an accuracy of 75.8% on their experimentsfondon2018automatic.
A structure deep learning model was proposed by Han et al han2017breast, in which a distance constraint of feature space was proposed to articulate feature space similarities. The model achieved an average accuracy of 96% on binary classifications on different magnification factors. Their model used the idea of transfer learning, in which the networks trained upon a large dataset and the weights were then transferred. Some other studies have been implemented using transfer learning for classifying histopathology images xu2017effect; talo2019automated; de2019double
. A different study was done by Celik et al. to identify the invasive ductal carcinoma using transfer learning. They used two pre-trained well-known models, ResNet-50 and DenseNet-161, for their experiments; and obtained F-scores of 92.38% and 94.11% for DenseNet-161 and ResNet-50 respectivelycelik2020automated.
Pratiher et al. pratiher2018grading applied the Bidirectional LSTM(Bi-LSTM) on manifold encoded histopathological image of the quasi-isometric topological space as an automated categorization framework for breast cancer classification. Morphological dynamics and contextual feature space semantic constraints have been utilized in their method through Bi-LSTM on histopathological manifolds.
In another model called DMAE feng2018deep, a patch-based deep learning method comprises of two stages: (1) an end-to-end deep neural network was trained with histopathology images, and (2) previously unseen patches of the test images were used for breast histopathology image classification with the features learned from the proposed model.
Correct classification and detection of breast cancer in terms of malignancy and benignancy have been the focus of many studies araujo2017classification; spanhol2017deep; deniz2018transfer; awan2018context; wahab2019transfer. Another study that was done in this regard was by Khan et al. khan2019novel, in which they proposed a framework for extracting low-level features on cytology images using transfer learning. The results of their study have been observed to outclass other deep learning models for detecting and classifying breast cancer in terms of accuracy with an average of 97.67% in cytology images. Guata et al. gupta2019partially implemented partially-independent framework to explore the features on a multi-layer ResNet model for classification of histopathology images. The average accuracy of the four categories of BreakHis dataset was 94.66%.
An integration of multi-layer features from a ResNet model called Partially-Independent Framework was proposed by Gupta and Bhavsar gupta2019partially
for breast cancer histopathological image classification. Since the discriminative features are not learned by all the layers, they decided to choose the optimal subset of layers based upon an information theoretic measure (ITS). They used XGBoost method for dimensionality reduction, and SVM with two polynomial kernels as their classifier framework. The highest accuracy was obtained by using 40X images at the rate of 97%.
In an attempt to classify different histopathological images of Osteosarcoma into the necrotic tumor, viable tumor, and non-tumor, Arunachalam et al. employed a combination of machine and deep learning models for the task. Their model achieved an accuracy of 93.3%, and 92.7% for necrotic, 95.3% for viable, and 91.9% for non-tumor arunachalam2019viable. The most recent study was done by Fu and Xue fu2020deep
for assessing viable and necrotic tumors on Osteosarcoma. The DS-Net they used was a combination of an auxiliary supervision network (ASN) and a classification network. Features extracted by ASN were used for accurate classification. Their model achieved an overall accuracy of 94.5%, and the class-based categories with accuracies of 92.2%, 93.6%, and 97.7% for non-tumor, necrotic tumor, and viable tumor respectively.
The Breast Cancer Histopathological Image Classification (BreakHis) is a freely available public dataset (https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/), composed of 7909 images, including 2480 benign and 5429 malignant cases, with different magnification factors including 40X, 100X, 200X, and 400X. The size of the Images is 700460 with a three-channel RGB at each pixel spanhol2015dataset. Table 1 shows the distribution of the BreakHis dataset by four magnification factors: 40X, 100X, 200X, and 400X of Benign and Malignant Cancer. Figure 1 displays sample images of all types and magnifications.
3.1.2 Osteosarcoma dataset
Another entirely different dataset to test our methods is composed of Hematoxylin and Eosin(H&E) stained Osteosarcoma histology images, provided by the University of Texas Southwestern Medical Center. The samples were obtained from the pathology reports of the Osteosarcoma resection of 50 patients. The dataset contains 1144 images of 10241024 pixels and 10X magnification, including 536, 263, and 345 images for non-tumor (NT), necrotic tumor (NCT), and viable tumor (VT) respectively osteodataset. Among the 345 viable tumor images, 53 images are discarded from the experiments because they labeled with both viable and necrotic tumors. Figure 2 shows some sample images from the Osteosarcoma dataset.
Machine/Deep learning models are inherently data-hungry, posing great challenges in the medical imaging field due to the lack of big data in training a new model. By generating artificial data and adding them into the training set, data augmentation provides a way to mitigate the obstacles with small datasets size shorten2019survey.To this end, all the image intensities in our experiments are first scaled to the range between 0 to 1. The augmentation techniques that have been applied to them include flipping horizontally and vertically, shearing with a factor of 0.2, zooming by 0.2, height and width shifting with a factor of 0.2, and sequential rotation by 40 degrees.
3.3 The C-Net architecture
In developing our new architecture, called C-Net, different networks are integrated to fulfill several goals. (1) Using the same architecture in the Outer networks results in a more stable and reliable way of feature extracting, and using the same filter size in convolutions layers (33), inspired by VGG-19 model simonyan2014very, provides a better feature extraction compared to other implementations with different filter sizes. (2) The shortcomings of one network is compensated by another since the networks work in parallel and features are extracted by different networks at different times instead of going directly into the FC layers. (3) Unlike many other studies talo2019application; yildirim2019automated; geng2016deep; celik2020automated; wahab2019transfer; khan2019novel in which a deep learning model is first trained on large datasets and the weights of the models are then transferred to perform similar tasks, a unique feature of the proposed method is that transfer learning approach has not been used to perform the classification tasks. (4) The proposed model has fewer parameters to be trained. Compared to most successful architectures, the proposed model has fewer parameters, typically less than 30M with an image size of 224224 pixels.
As shown in Figure 3
, the C-Net comprises of three main parts, the Outer, Middle, and Inner networks. The Outer one is composed of four CNNs, and works as a feature extractor. The architecture of each of the outer networks is as follows: first, the images go into the input layer in all of the outer networks simultaneously followed by several convolutional layers, followed by a max-pooling layer in the firstblock. Here, block refers to a combination of several convolutional layers followed by a pooling layer. The number of filters in the first block is 64 with a size of 3
3 and the same padding, and the filter size of max-pooling is 2
2 with a stride of 2. The same structure is being repeated with the same order for three additionalblocks, except that the number of filters gets multiplied by 2 but not the final block. To prevent further reduction of the final output, the max-pooling layer has been dropped for the final blockReLU
) is applied upon the convolution layers as the activation function.
The returning output features are then Concatenated, as shown by (+) sign in Figure 3, which is one of the important operations in the C-Net model shown in Equation 1. This operation is being applied two by two upon all of the Outer networks’ output.
where (m,n) are the dimensions of networks’ outputs, y and w are the feature maps of different networks, are the number of the channels on each output, is the Concatenation with respect to the feature map axis and channel, and finally X is the result of the Concatenation operation, which would be the input for Middle networks.
Features extracted from Outer networks serve as the input for the Middle networks. The structure of the Middle networks is as follows: four convolutional layers on top of each other, each with a filter size of 33, the same padding, and 256 filters. To overcome the model’s complexity and reduce the feature maps, 11 convolution has been placed upon the previous convolutions. NIN, 11, is a fully connected network that is being applied upon the feature map, inspired by Lin et al. lin2013network. Subsequently, a max-pooling layer with a stride of 2 has been placed after the 11 convolutional layer to complete the first block of the Middle networks. The same architecture is repeated in the second block of the Middle networks. The activation function on all the convolutional layers is ReLU. The outputs (feature maps) from the Middle networks are going to be Concatenated, (see Equation 1), and serve as the input for the Inner network. To generate efficient feature descriptors, we make sure that each network contains the max-pooling layer.
Drop out, which is randomly turning off some units of the layers, is a regularization technique and has an extreme effect of preventing the network from overfitting. It has been applied upon each block of the Middle networks.
Finally, the Inner network takes the features returned by the Middle networks as the input. The Inner network contains only one block which has the following structure: two convolutional layers with a filter size of 33 and stride of 1, the same padding, and 256 filters. Additionally, we have a 11 convolutional layer with same configurations followed by a max-pooling layer with the size of 22, and stride of 2. In the Inner network the ReLU is also used as the activation function.
The max-pooling layer returned from the Inner network is flattened out into a vector connected to a (FC) layer containing 1024 units and connected to another FC layer with an equal number of units. Drop out has been applied upon both of the FC layers. Finally, sigmoid is used as the activation function, as shown in Equation 2, at the output layer with two nodes, benign and malignant, as shown in Figure 3.
z here is the dot product of filter w with a chunk of the image with same size of the filter, and b is the bias.
The loss function for the proposed model is a cross-entropy function represented as follow:
where is the label y of N classes, and ŷ is the element of the model’s output ŷ.
4 Experimental Results and Analysis
This section presents the results on two different histopathological datasets; both have been split into training, validation, and test images with a ratio of 70%, 15%, and 15% respectively. For the BreakHis dataset, two-class classifications have been performed on different magnification levels, including 40X, 100X, 200X, and 400X. The configuration for the Osteosarcoma dataset on two-class settings are as follows: a) NCT versus NT, b) NCT versus VT, and c) NT versus VT.
The parameters setup for running the experiments are as follow:
Images are resized to 224224 and 375375 for BreakHis and Osteosarcoma respectively.
Adam algorithm with the parameters of (learning rate = 1, beta_1 = 0.9, beta_2 = 0.999, epsilon = ) is used for optimization.
Different batch sizes including, 16, 32, 64, and 80 are used.
Binary cross entropy is chosen as the loss function.
The model has been implemented in Python, employing two well-known deep learning platforms, TensorFlowabadi2016tensorflow
and Keraschollet2015keras, as its architecture. The experimental environment is run on Ubuntu 18.04, trained and tested on an Nvidia GeForce RTX 2080Ti GPU platform.
In order to measure the performance of the model several metrics are used. Precision (or Positive Predictive Value (PPV)) shows all the instances predicted as cancer (y=1), what fraction of them are truly cancerous. Negative Predictive Value (NPV), Equation 6
, shows the probability of all the instances that the model predicted as negative. Recall (or Sensitivity),Equation 7, tells what fraction of all the actual cancer instances the model detected as cancer. Specificity, Equation 8, is the capacity of the model to correctly find no cancer among non-cancerous. Matthews correlation coefficient (MCC), Equation 10
, is another evaluation metric which produces high score when the model has achieved a superior performance on all the categories of confusion matrix including True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP), as more unbiased and reliable than accuracy (Equation 4) and F1-score (Equation 9), in the presence of imbalance datasets chicco2020advantages. The following formulas give the definitions of the aforementioned metrics from the confusion matrix.
|Evaluation Metrics||Magnificaion Factors||Average|
The experimental results are composed of three parts. First, we present the results of the C-Net model on the BreakHis dataset upon different magnification factors (40X, 100X, 200X, 400X) using various evaluation metrics. Secondly, we show the classification results obtained by C-Net on the Osteosarcoma dataset. Finally, we demonstrate a comparative analysis of our model with previous models carried out upon the same dataset. Tables and matrices demonstrate the performance of the model on the test set.
Different values in Table 2 are derived from confusion matrices presented in Figure 4. Table 3 represents the results on three different settings of binary classification including, NCT vs NT, NCT vs VT, and ultimately VT vs NT. Furthermore, the confusion matrices for this dataset are given in Figure 5. A comparison between different architectures and the C-Net model on breast cancer classification using the BreakHis dataset is shown in Table 4.
|Evaluation Metrics||Different Classification settings||Average|
|NCT vs NT||NCT vs VT||VT vs NT|
|Architectures||year||Magnification Accuracies in perc||Avg.Acc|
|Sharma et al.sharma2020novel||2020||97.4||98.6||97.7||96.8||97.63|
Accurately diagnosing histopathological images is an extremely time-consuming process and needs expertise to manually go through 1000 slides to detect abnormalities, and it is prone to error. Early diagnosis of cancer could have a significant impact on the outcome of the disease and could save lives, and hence it has been the focus of many studies in cancer-related research sun2020deep; azer2019challenges; zheng2020deep
. Computer-aided diagnosis (CAD) systems, especially DCNN, aim to help physicians to diagnose the presence of cancer better and faster, and CAD has a transcendent advantage over traditional machine learning techniqueschougrad2018deep.
Independent of the population of interest subjected to the test, sensitivity and specificity are usually employed for evaluating clinical tests lalkhen2008clinical. As presented in Table 2, the C-Net model achieves 100% performance for these two metrics on different magnification factors including 40X, 100X, and 400X on the BreakHis dataset, which outperforms all the previous studies using these metrics.
Sensitivity is the probability that the model predicts positive given the patient has the disease or , and likewise specificity is the probability that the model predicts negative given the patient does not have the disease, . Despite the popularity of these evaluation metrics, their use is not realistic for a clinician in determining the probability of illness in individual patients akobeng2007understanding. Clinicians may want to know the probability that the patient has the disease or not when the model predicts positive or negative. Hence, PPV which is the , and NPV is are two other measurements. As shown in Table 2 the C-Net achieves 100% PPV for 200X, and also 100% NPV for two different magnification factors, 40X and 400X, which corroborate the model as a robust classifier.
The lowest precision achieved by the proposed model as shown in Table 2
is 99.03% for 40X images, and the rate increases as the magnification increases. Furthermore, the C-Net model achieves the 100% recall rate on two magnification factors, 40X and 400X. Precision and recall alone cannot explicitly show the performance of a classifier. However, F1-score inEquation 9
, is when these two metrics are combined as their harmonic mean. The higher the score is the closer precision and recall are to each other. The model attainsof F1-score on all the magnification factors, and the average score of 99.52%, which compared to the average F1-score in sharma2020novel with a rate of 97.67% is a significant improvement and shows its robust performance for classifying medical images.
MCC in Table 2 is the next metric for evaluating a binary classification model. Even though the accuracy and F1-score are the most popular metrics in binary classification tasks, these metrics could be overly optimistic and inflate the results chicco2020advantages. The C-Net model achieves the highest score of 99.23% for MCC using 200X images, and the average of all MCC scores in different magnification factors is 98.46%, which exceeds even the average accuracy of all the models presented in Table 4. This high performance on MCC value shows that Concatenating multiple networks results in better feature extraction and fewer misclassifications, with only 8 misclassified cases in all magnification factors combined, as is shown in the confusion matrices in Figure 4.
The confusion matrices allow us to have a detailed comparison of the models’ performances. The highest MCC score, 99.23%, by the C-Net model is 2.63% greater than the highest MCC value based on the confusion matrix in tougaccar2020breastnet. Additionally, the highest accuracy and F1-score of their model was attained on 200X images with a rate of 98.51% and 98.28% respectively. In the C-Net model, however, the highest accuracy (99.67%) and F1-score (99.76%) are obtained on the same magnification. Another evaluation metric for our comparison is sensitivity or recall. The highest sensitivity rate, 98.70%, was achieved in tougaccar2020breastnet on the same magnification factors, compared to our model that achieved the rate of 100% on two magnification levels, 40X and 400X.
Table 4 illustrates a detailed comparison of accuracy achieved by previously successful models with our model on BreakHis dataset. The C-Net model achieves 99.34% which is 1.71% higher than the highest average accuracy in sharma2020novel among the preceding models. We compare the accuracies on all magnification factors in models that used a hybrid architecture. The highest accuracy is 97.2% with an average of 96.48% in pratiher2018grading, the highest accuracy is 97.95% with an average of 97.55% achieved by alom2019breast. Our model uses the same metrics and dataset with the values of 99.67% and 99.34% for the highest and average accuracy scores respectively. The performance shows that our architecture with multiple networks as the feature extractor in Outer networks provide better features for the network to learn different variations in images.
On the Osteosarcoma dataset, Table 3 shows that the least accuracy achieved by our model, in NCT vs NT classification categories, has a rate of 95.83% which is 3.23% higher than the best performance stated in arunachalam2019viable. Moreover, comparing the best performance of our model with the 100% accuracy on VT vs NT category, shows 7.4% improvement.
The model achieves 100% precision on two classification categories, NCT vs VT and VT vs NT, and the average of all three categories is 98.25%. Similarly the model obtains 99.18% of specificity, shown in Table 3, which is 5.35% and 3.08% higher, respectively, compared to the results in fu2020deep. Illustrated in Figure 5, the model predicts zero misclassification in detecting viable tumors versus non-tumors on the Osteosarcoma dataset, which clearly shows the ability and reliability of the proposed model for classification of biomedical images.
In this work, we proposed a novel and reliable CNN, called C-Net, for binary classification of biomedical images. The architecture was tested on two different datasets, namely, BreakHis and Osteosarcoma, with regards to different image sizes, data volumes, and underlying diseases. To assure reliability, the proposed model has been evaluated using several metrics. The C-Net model achieved an average accuracy of 99.34% and the average F1-score of 99.52%, both of which outperformed all other models tested on the BreakHis dataset. In addition, the proposed model improved the accuracy and F1-score by 2.71% and 4.76%, respectively, compared to all the prior results on the Osteosarcoma dataset. The experimental results using a broad range of evaluation metrics showed that the proposed model is a promising architecture that has the potentials to be generalized on other biomedical image classification tasks.