Ensemble of CNN classifiers using Sugeno Fuzzy Integral Technique for Cervical Cytology Image Classification

08/21/2021 ∙ by Rohit Kundu, et al. ∙ IEEE 5

Cervical cancer is the fourth most common category of cancer, affecting more than 500,000 women annually, owing to the slow detection procedure. Early diagnosis can help in treating and even curing cancer, but the tedious, time-consuming testing process makes it impossible to conduct population-wise screening. To aid the pathologists in efficient and reliable detection, in this paper, we propose a fully automated computer-aided diagnosis tool for classifying single-cell and slide images of cervical cancer. The main concern in developing an automatic detection tool for biomedical image classification is the low availability of publicly accessible data. Ensemble Learning is a popular approach for image classification, but simplistic approaches that leverage pre-determined weights to classifiers fail to perform satisfactorily. In this research, we use the Sugeno Fuzzy Integral to ensemble the decision scores from three popular pretrained deep learning models, namely, Inception v3, DenseNet-161 and ResNet-34. The proposed Fuzzy fusion is capable of taking into consideration the confidence scores of the classifiers for each sample, and thus adaptively changing the importance given to each classifier, capturing the complementary information supplied by each, thus leading to superior classification performance. We evaluated the proposed method on three publicly available datasets, the Mendeley Liquid Based Cytology (LBC) dataset, the SIPaKMeD Whole Slide Image (WSI) dataset, and the SIPaKMeD Single Cell Image (SCI) dataset, and the results thus yielded are promising. Analysis of the approach using GradCAM-based visual representations and statistical tests, and comparison of the method with existing and baseline models in literature justify the efficacy of the approach.



There are no comments yet.


page 2

page 3

page 10

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most prevalent and deadliest diseases of the 21st century is cancer which is caused by the uncontrolled growth of human body cells. Statistically, this is the second leading cause of death worldwide, causing around 9.6 million deaths every year and about 1/6th of the total deaths of the population throughout the globe. Late diagnosis and prognosis of cervical cancer often lead to deaths without receiving adequate treatment, mostly in poor and middle-income countries where the living index is low and healthcare infrastructure is insufficient.

Papanicolaou smear or the Pap smear test is the most common and widely used method in cervical cytology for the screening of abnormal lesions and cervical cancer. However, the assessment of cervical cytology requires expert physicians, which is expensive and time-consuming. Also, inter and intra-human variability of assessment or incorrect prognosis due to human errors may worsen the patient condition and can even be fatal in some cases. To avoid ambiguity in diagnosis, people tend to take the opinions of multiple experts. In the present work, we have utilized that kind of strategy for the automated detection of cervical cancer, where multiple CNN classifiers have been used to generate predictions, and the decision scores have been ensembled to conclude the final predictions.

Figure 1: Overall workflow of the proposed framework

In literature, there are some simple fusion schemes nannia2020ensemble

, that combines CNN features mostly by late fusion of several feature sets, specifically, by using majority voting, weighted probability ensemble, etc. These schemes utilize different CNN features by simple addition, multiplication or averaging them. Besides, experiments with different fusion method reveal some of the optimum weights that have been used for a weighted mean of the decision scores obtained. Therefore, there remains an opportunity to optimize the fusion schemes of different CNN or machine learning-based classifiers by adaptively promoting the importance of every classifier for every single image. This can be done by conditioning the weightage of one classifier upon others before it, which is done in a fuzzy fusion method, which remains largely unexplored in the particular domain of cervical cytology. The proposed method performed superior to other popularly used fusion methods as described in Section


The rest of the paper is organised as follows: Section 2 provides a brief literature survey of the existing classification approaches and fusion methods in the domain of cervical cytology; Section 3

is the detailed description of the proposed method, where we have implemented transfer learning-based decision score generation followed by fuzzy fusion; Section

4 contains information about the datasets used and the results we have obtained along with the comparison of different existing methods; Section 6 is the brief description of the outcome of our experiments and the future improvements that can be made to enhance the classification performance further.

1.1 Overview and Contributions

The high rate of cervical cancer cases in the world, especially in developing and underdeveloped countries is mainly due to inadequate screening. However, detection of cervical cancer is no easy feat, taking long hours to detect a single case, thus making regular population-wide screening an impossible task. This calls for the need for automation in the detection procedure, and thus in this paper, we propose a framework for reliable automated detection of cervical cancer employing deep neural networks and Ensemble Learning using Fuzzy Fusion. The overview of the proposed framework is shown in

Figure 1.

The contributions of this paper are as follows:

  1. The Sugeno Fuzzy Integral is introduced for the first time for cervical cell classification to fuse the decision scores of multiple CNN classifiers. Using it the performance of three popular individual CNN classifiers such as Inception v3 szegedy2016rethinking, ResNet-34 he2016deep and DenseNet-161 huang2017densely are improved using the ensemble technique and thus has been used in the present research to make accurate predictions on the small-sized available datasets. We have performed experiments with several ensemble approaches, but the Sugeno Fuzzy Integral-based ensemble outperformed the traditional methods nannia2020ensemble since it is capable of using adaptive weights based on the confidence of predictions by the individual classifiers for each test sample, leading to superior predictions.

  2. The proposed method has been tested on two publicly available datasets: the SIPaKMeD Pap Smear image dataset plissiti2018sipakmed and the Mendeley Liquid Based Cytology dataset hussain2020liquid. The SIPaKMeD dataset has both whole slide images (WSI) and single-cell images (SCI), and hence both types have been used separately for evaluation. Promising results have been obtained by the framework, which is reliable for practical use.

Thus we have developed an automated framework for the classification of Pap stained and Liquid-Based Cytology images using Deep Learning and a novel ensemble approach for the classification of cervical cytology images, which is otherwise a laborious task for cyto-technicians.

(a) Mendeley
Figure 2: Confusion Matrices of the respective test sets of (a) Mendeley LBC Dataset (b) SIPaKMeD Whole Slide Images Dataset (c) SIPaKMeD Single Cell Images Dataset

2 Literature Survey

For many decades extensive researches have been done to develop improved algorithms and methods for computer-aided diagnosis of cervical cancer mitra2021cytology

. In the past few years, various machine learning algorithms have been proposed for the detection and classification of cancerous cell images such as Support Vector Machine used by Ashok et al.

ashok2016comparison and K-Nearest Neighbour classifier used by sharma2016classification, etc.

Win et al. win2020computer

proposed a method in which nuclei were detected using a shape-based iterative method, and the overlapping cytoplasm was separated by a marker-control watershed approach. Features were extracted from regions of segmented nuclei and a Random Forest classifier was used for feature selection. For classification, bagging ensemble classifier, which combined the results of LDA, SVM, KNN, boosted trees and bagged trees. They achieved 98.27% accuracy in two-class and 94.09% accuracy in five-class classification on the SIPaKMeD dataset. Jia et al.


proposed a new framework based on a strong feature Convolutional Neural Networks-Support Vector Machine (CNN-SVM) model to classify cervical cells. A method fusing the strong features extracted by Gray-Level Co-occurrence Matrix and Gabor filters with abstract features from the hidden layers of CNN was conducted, meanwhile, the fused ones were input into the SVM for classification. Basak et al.


used a deep learning-based method where in they extracted deep features from multiple CNN models and applied a two-step feature enhancement procedure using Principle Component Analysis (PCA) and Grey Wolf Optimizer (GWO) to reduce the dimensionality of the feature set for efficient classification.

Ensemble Learning is a popular technique to incorporate the salient features of multiple CNN models like Kuko et al. kuko2019ensemble proposed a method of applying Random Forest classifiers on another layer of ensemble learning based on the rotation of the image. Each image is rotated 8 times by 45 degrees and 8 Random Forest models were trained. After the classifications, an ensemble voting technique is used to tally all votes amongst the 8 models and the most-voted class is selected as the final classification. This method achieved an accuracy of 90.37% on binary classification. Xue et al. xue2020application used an ensemble learning method the weighted voting based method. They have developed Inception-V3, Xception, VGG-16, and ResNet-50 based TL structures. Then, to enhance the classification performance, a weighted voting based EL strategy was introduced. An experiment for classifying the benign cells from the malignant ones is carried out on the Herlev dataset and obtains an overall accuracy of 98.37%. Sarwar et al. sarwar2015hybrid

used an ensemble system developed using Random subset space, Radial basis function network, Multiclass classifier, Random forest, Bagging, Rotation Forest, J48 graft, Ensemble of Nested dichotomies (END). Decorate, PART, Random Committee, Filtered Classifier, Decision Table, Multiple back propagation artificial neural network, and Naïve Bayes. The final classification decision is obtained by aggregating the output of all possible candidate trees for the multiclass problem. The overall accuracy of the system for the two-class problem was 98.57% on the HErlev dataset.

3 Proposed Method

In the present study, we have used three different CNN architectures (Transfer Learning) for generating the confidence scores on the datasets: Inception v3, DenseNet-161 and ResNet-34. The decision scores from these classifiers have been fused using the Sugeno Fuzzy Integral to generate the final predictions. These steps are explained in detail in this section.

3.1 Inception v3

One of the most popularly used deep learning network for transfer learning technique is Inception v3 szegedy2016rethinking, which is consisted of several inception blocks. It takes an input image of size

and produces feature maps of different dimensions in different layers. The inception block of Inception v3 allows us to utilize the facilities of using different filters of feature extraction from a single feature map. These features with different filters are concatenated and passed on to the next layer for deeper feature extraction. In this study, we have evaluated Inception v3 with the ReLU activation function. In each case, the model is trained for 100 epochs with cross-entropy loss which is optimized by an SGD optimizer with a learning rate of 0.001.

3.2 DenseNet-161

DenseNet has been proposed by huang2017densely to address the problem of gradient vanishing for the case of deep neural networks. The building blocks of DenseNets are connected densely to each other. In this way, only fewer parameters are needed to be learnt by the network. These kinds of networks have very narrow architecture and add small sets of feature maps. This network also takes input image of size and similar to Inception v3, in our study, we have trained this model for 200 epochs and SGD optimizer with a 0.001 learning rate. For DenseNet-161 also we have used the ReLU activation function.

3.3 ResNet-34

ResNet he2016deep is also an advanced convolutional neural net with residual skip connection embedded in it. There are certain versions of ResNets, which are ResNet-18, ResNet-34, ResNet-50 and ResNet-152. Due to the embedding of the skip-connections, despite having such deep architecture, the gradient vanishing problem is already being taken care of. Similar to DenseNet, the standard image dimension which should be given as an input to any version of ResNet is . We have evaluated ResNet-34 in this study. To maintain consistency, the number of epochs in training, the optimizer, learning rate etc. have been fixed to the values mentioned in the above CNNs.

3.4 Ensemble: Sugeno Fuzzy Integral

To leverage the ascendency of individual CNN classifiers instead of a single one, we propose an integration of multiple classifiers utilizing fuzzy fusion in this paper. As shown in Figure 1

, the confidence scores from multiple classifiers are treated as the input of the fuzzy fusion directly. It has been used previously in a pattern recognition task, specifically in classifier fusion

wu2016fuzzy; liu2009machinery, and has shown promising results. However, no such applications in the domain of cervical cytology have been found so far. The fusion scheme harnesses additional information of a classifier, which is the uncertainty of the decision scores. The generalization of aggregation operators for a set of confidence values is known as fuzzy measures, that uses some weights before each source.

If be the set of classifiers, the fuzzy measure is the worth value of the set and, as introduced in sugeno1993fuzzy, can have values in the range of [0, 1] and can be represented by the function . represents that the classifier can be considered as consistent whereas represents that the classifier cannot be trusted and considered as the results are inconsistent. For all , the fuzzy measure can be characterized by the monotonic property as in Equation 1.


Fuzzy density is defined as the fuzzy measure of set S when S contains a single element and is the measure of the worth value of individual classifiers. Some studies have used fuzzy density values predefined based on the experience of the researcher; however, that does not ensure superior integration of the classifiers. Instead, following the original work of tong2016speech, we have set the fuzzy density values the same as the classifier accuracy measure on the test set, to give weightage to the optimal classifiers and punish the inferior ones. Following the work of tahani1990information, Sugeno fuzzy measure can be conceptualized with an additional characteristic that if then it can be considered that there is always a such that:


The value can be obtained by solving the following equation, where is greater than -1.


where is the number of CNN classifier, which is 3 in our case. has the following characteristics:

  1. when

  2. when

  3. when

Among all the existing methods of fuzzy integrals like Fuzzy min-max mesiar2008fuzzy, ordered weighted averaging operators like ordered weighted averaging OR (OWA-OR) cheng2012combining and ordered weighted averaging and (OWA-AND) cho1995fuzzy, Sugeno integral sugeno1993fuzzy, Choquet integral murofushi1989interpretation, we have implemented Sugeno and Choquet integrals in this work, and have selected the best result from these two. The steps for calculating the fuzzy integrals are described below.

First, the N classifiers are sorted according to their output scores:


where represents the largest output value of the classifiers where , where N is the number of classifiers. Next, we calculate the Choquet and Sugeno fuzzy integrals by means of:


where is defined from the definition of Sugeno fuzzy measures as follows:


Thus through fuzzy integrals, the robustness is experimentally found to be higher as compared to the previously obtained normalized softmax probabilities and the time complexity of the algorithm is found to be , where is the number of classifiers and are the number of classes. The pseudo-code for computing the Sugeno Fuzzy Integral for the ensemble of CNN classifiers decisions is shown in Algorithm 1.

Set of Decision Scores (from different base learners):
Set of Fuzzy Measures:
Final Predictions on test samples:

  predictions Initialize empty list of final predictions
  for class index num_classes do
      Permutation of in descending order
      Permutation of corresponding to
     for  do
         pred pred
     end for
     predictions[c] pred
  end for
Algorithm 1 Ensemble of Classifiers Decisions using Sugeno Fuzzy Integral

4 Results and Discussion

In this section, we first briefly describe the two publicly available datasets used. Then we evaluate the performance of the proposed framework on these datasets and compare the results to other popular approaches used in literature to justify the viability of the method used.

4.1 Description of Datasets

In the present study, we have used two publicly accessible datasets: the SIPaKMeD Pap Smear dataset by Plissiti et al plissiti2018sipakmed and the Mendeley Liquid Based Cytology (LBC) dataset by Hussain et al. hussain2020liquid which are briefly described below.

4.1.1 SIPaKMeD Pap Smear Dataset

Class Category WSI SCI
1 Superficial-Intermediate 126 831
2 Parabasal 108 787
3 Koilocytotic 238 825
4 Metaplastic 271 793
5 Dyskeratotic 223 813
- Total 966 4049
Table 1: SIPaKMeD Pap Smear Dataset Distribution. WSI: Whole Slide Images, SCI: Single Cell Images

The SIPaKMeD dataset consists of 4049 images of isolated cells that have been manually cropped from 966 cluster cell images of Pap smear slides, which are also included. The cells are classified into five different classes by expert cytopathologists. Normal cells are divided into two categories (superficial-intermediate, parabasal), abnormal but not malignant cells are divided into two categories (koilocytes and dyskeratotic) and the final category is benign (metaplastic) cells. Both the Whole Slide Images (WSI) and the Single Cell Images (SCI) have been used separately for the present study. The distribution of images in the dataset is shown in Table 1.

4.1.2 Mendeley Liquid Based Cytology Dataset

Class Category Number of Images
1 NILM 613
2 HSIL 113
3 LSIL 163
4 SCC 74
- Total 963
Table 2: Mendeley LBC Dataset Distribution. NILM: Negative for Intra-epithelial Malignancy, HSIL: High Squamous Intra-epithelial Lesion, LSIL: Low Squamous Intra-epithelial Lesion, SCC: Squamous Cell Carcinoma

The Liquid Based Cytology (LBC) Dataset by Hussain et al. hussain2020liquid contains 963 images classified into four different classes. The Pap smear images were captured in 40x magnification which is collected and prepared using the liquid-based cytology technique from 460 patients. In this dataset 613 images belong to the normal cell category and 350 images belong to the abnormal cell category. The distribution of these images is given in Table 2.

4.2 Metrics for Performance Evaluation

For our work, we have used Accuracy, Precision, Recall and F1-score for evaluating the performance of the proposed framework. True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) are the basic elements that help determine the values of these metrics, and they can be defined as follows:

  • True Positive: The predicted result is positive, while it is labelled as positive.

  • False Positive: The predicted result is positive, while it is labelled as negative. It calls Type I Error as well.

  • True Negative: The predicted result is negative, while it is labelled as negative.

  • False Negative: The predicted result is negative, while it is labelled as positive. It calls Type II Error as well.

Based on these 4 elements, we can calculate the metrics: accuracy, precision, recall, F1 score. For a multi-class system (

class), if we have a confusion matrix

, with the rows depicting the predicted class, and the columns depicting the true class, these evaluations metrics can be formulated as in Equations

8, 9, 10 and 11.


4.3 Implementation

Dataset Class Precision(%) Recall(%) F1 Score(%) Accuracy(%)
Mendeley LBC High Squamous Intra-epithelial Lesion 100 93.75 96.77 93.75
Low Squamous Intra-epithelial Lesion 100 100 100 100
Negative for Intra epithelial Malignancy 100 100 100 100
Squamous Cell Carcinoma 88.24 100 93.75 100
Aggregrate 99.08 98.95 98.97 99.48
SIPaKMeD WSI Dyskeratotic 91.67 97.78 94.62 97.78
Koilocytotic 100 87.5 93.33 87.5
Metaplastic 96.36 96.36 96.36 96.36
Parabasal 100 100 100 100
Superficial Intermediate 89.66 100 94.55 100
Aggregrate 95.54 96.33 95.77 96.33
SIPaKMeD SCI Dyskeratotic 98.18 99.39 98.78 99.39
Koilocytotic 98.75 95.18 96.93 95.15
Metaplastic 95.73 98.74 97.21 98.74
Parabasal 100 100 100 100
Superficial Intermediate 100 99.4 99.7 99.4
Aggregrate 98.55 98.53 98.53 98.54
Table 3: Class-wise and Net Results obtained on the three datasets
Dataset Model Accuracy(%) Precision(%) Recall(%) F1-Score(%)
Mendeley LBC Inception v3 97.96 97.95 97.56 97.75
DenseNet-161 98.44 97.06 98.44 97.63
ResNet-34 98.12 97.96 98.12 98.04
SIPaKMeD WSI Inception v3 88.27 88.63 89.97 89.13
DenseNet-161 93.08 92.38 93.08 92.69
ResNet-34 93.28 92.22 93.29 92.62
SIPaKMeD SCI Inception v3 94.34 94.31 94.38 94.31
DenseNet-161 97.28 97.28 97.29 97.28
ResNet-34 97.17 97.27 97.22 97.19
Table 4: Results obtained by the base classifiers (before fusion) on the three datasets used in this research.

The datasets used in the present study have been split into 3:1:1 ratio of train, validation and test sets. The three pre-trained CNN models have been fine-tuned using the datasets by freezing the weights of the top 5 layers and training for 50 epochs. The probability distributions of the models have been saved and fused for the final classification using the Sugeno Fuzzy Integral. The confusion matrices thus obtained on the test sets of the respective datasets are shown in

Figure 2. Consequently, the class-wise results and the aggregate results of all the class are tabulated in Table 3 for the three datasets used. The results obtained before the ensemble, that is the results obtained by the base classifiers are shown in Table 4.

4.4 Verification of Complementarity of Features

Distribution P Distribution Q D(P||Q) D(P||Q)
Inception v3 DenseNet-161 2.356 0.131
DenseNet-161 Inception v3 0.611
Inception v3 ResNet-34 6.055 0.115
ResNet-34 Inception v3 0.540
DenseNet-161 ResNet-34 3.300 0.162
ResNet-34 DenseNet-161 0.884
Table 5: KL and JS Divergences between individual models on Mendeley LBC Dataset

To verify the complementary or dissimilar nature of the features of the pre-trained models used to extract the confidence scores of the datasets, two statistical divergence metrics are used: the Kullback-Leibler Divergence (KLD)

kullback1951information; kullback1997information and the Jensen-Shannon Divergence (JSD) menendez1997jensen.

Distribution P Distribution Q D(P||Q) D(P||Q)
Inception v3 DenseNet-161 0.502 0.152
DenseNet-161 Inception v3 0.367
Inception v3 ResNet-34 0.584 0.156
ResNet-34 Inception v3 0.418
DenseNet-161 ResNet-34 0.212 0.132
ResNet-34 DenseNet-161 0.208
Table 6: KL and JS Divergences between individual models on SIPaKMeD WSI Dataset

The KLD is a non-symmetric measure of dissimilarity between two probability distributions on the same probability space. Let there be a probability space , and two probability distributions on this space and for every discrete variable , such that . Then the discrete form of KLD defined from to is given as Equation 12, being the natural logarithm of .

Distribution P Distribution Q D(P||Q) D(P||Q)
Inception v3 DenseNet-161 0.355 0.129
DenseNet-161 Inception v3 0.160
Inception v3 ResNet-34 0.250 0.134
ResNet-34 Inception v3 0.337
DenseNet-161 ResNet-34 0.178 0.118
ResNet-34 DenseNet-161 0.211
Table 7: KL and JS Divergences between individual models on SIPaKMeD SCI Dataset

As , a symmetrical statistical divergence have been derived from the KLD, called the Jensen-Shannon Divergence (JSD). JSD is effectively, a smoothed form of KLD. For the same probability distributions and as mentioned above, let be another probability distribution such that . Then the JSD (discrete form) is given by Equation 13.


The KLD and JSD measures between the decision scores of each pair of CNN classifiers, are shown in Tables 5, 6 and 7 for the Mendeley, SIPaKMeD WSI and SIPaKMeD SCI datasets respectively.

4.5 Comparison with different back-bone CNNs

It has been mentioned earlier that we have evaluated fuzzy measure over three popularly used pre-trained CNNs such that Inception v3, DenseNet-161 and ResNet-34. The results obtained by these datasets on all three datasets is given by Figure 3. It can be observed that for SIPaKMeD SCI the maximum is achieved by combining the probability distribution of all three neural nets. Whereas for SIPaKMeD WSI and Mendeley LBC dataset ensemble of Inception v3 and DenseNet-161 datasets achieve the most.

Figure 3: Comparison with different CNN architectures

4.6 Comparison with Other Ensemble Approaches

Majority Voting 95.68 94.37 97.64
Average 94.76 93.88 97.29
98.96 95.11 98.03
Product Rule 92.15 93.37 97.29
Maximum Rule 92.15 94.89 97.54
Choquet Fuzzy
98.96 95.41 98.40
Sugeno Fuzzy
99.48 96.33 98.54
Table 8: Comparison of accuracies obtained by Fuzzy Fusion with other popular ensemble techniques

The same probability distributions have been used to compute the predictions on the datasets using some popular ensembling procedures. The results thus obtained have been tabulated in Table 8. Among the ensemble techniques used, the weighted probability average ensemble with weights {0.5, 2.0, 1.0} for {Inception v3, DenseNet-161, ResNet-34} (weights set experimentally), gave results closest to the Sugeno Fuzzy Integral ensemble. The fuzzy measures used for the Sugeno Fuzzy Integral are {Inception v3, DenseNet-161} for Mendeley LBC and SIPaKMeD WSI datasets and {Inception v3, DenseNet-161, ResNet-34} for SIPaKMeD SCI dataset. The fuzzy measures have been set through extensive experiments on multiple runs of the framework.

4.7 Results with different fuzzy measures

Fuzzy Measures Accuracy
Inception v3 DenseNet-161 ResNet-34
0.5 0.5 0.1 95.33
0.5 0.1 0.5 95.36
0.1 0.5 0.5 98.54
1 0.5 0.5 97.52
0.5 1 0.5 95.36
0.5 0.5 1 97.57
0.5 1 0.1 97.52
Table 9: Results obtained by different fuzzy measures of the fuzzy integral ensemble on SIPaKMeD SCI dataset
Fuzzy Measures Accuracy
Inception v3 DenseNet-161
1 0.5 96.33
0.5 1 90.31
0.5 0.1 30.31
0.1 0.5 94.36
Table 10: Results obtained by different fuzzy measures of the fuzzy integral ensemble on SIPaKMeD WSI dataset
Fuzzy Measures Accuracy
Inception v3 DenseNet-161
1 0.5 99.48
0.5 1 79.48
0.5 0.1 79.58
0.1 0.5 87.96
Table 11: Results obtained by different fuzzy measures of the fuzzy integral ensemble on Mendeley LBC dataset

Different experiments with the fuzzy measures have been conducted and the best set of weights have been chosen. The variation of accuracies with the change in fuzzy measures are shown in Tables 9, 10 and 11 for the SIPaKMeD SCI, SIPaKMeD WSI and the Mendeley LBC datasets respectively.

4.8 Comparison with Existing Models

Work Approach Accuracy (%)
Kiran GV et al. gv2019automatic Feature Extraction and PCA 99.63 96.37
Shi et al. shi2019graph Graph Convolutional Network 98.37 -
Plissiti et al. plissiti2018sipakmed
Features: Deep fully CNNs
Classifiers: SVM and CNNs
95.35 -
Win et al. win2020computer
Features: RF

Classifiers: LDA, SVM, KNN and Decision trees

94.09 -
Sevi et al. sevihealth CNNs 88.40 -
Proposed approach Sugeno Fuzzy Integral Ensemble 98.54 96.33
Table 12: Comparison of the proposed framework with existing methods in literature on the SIPaKMeD SCI and WSI datasets

Here in this section, we have given the comparative study of performances of our proposed approach with previously reported works. In the Mendeley dataset, no works have been reported so far, therefore the works on the SIPaKMeD dataset are presented for comparison purpose. In kiran2019automatic, the reported accuracy in SIPaKMeD WSI is 96.37% which is almost the same as ours. The comparative results in SIPaKMeD are given by 12. Shi et al shi2019graph achieves impressive result of 98.37% on 5-class SIPaKMeD dataset. Kiran GV et al. gv2019automatic

extracted features from the ResNet-34 CNN model using transfer learning and applied Principal Component Analysis in the penultimate feature layer of the CNN for the final feature set selection and classification. They achieved an accuracy of 99.63% on the SIPaKMeD SCI dataset and 96.37% on the SIPaKMeD WSI dataset employing 5-fold cross-validation. However, we have implemented a different approach that does not require extraction of features and takes the opinion of multiple experts (CNN models) making the performance robust for the different datasets used, and is computationally efficient while keeping classification performance at par with state-of-the-art. It is seen that the proposed approach outperforms most of the works evolved so far which justifies the reliability of the model.

4.9 GradCAM Analysis

(a) Mendeley: Original
(b) Mendeley: Inception v3
(c) Mendeley: DenseNet-161
(d) Mendeley: ResNet-34
(e) SIPaKMeD WSI: Original
(f) SIPaKMeD WSI: Inception v3
(g) SIPaKMeD WSI: DenseNet-161
(h) SIPaKMeD WSI: ResNet-34
(i) SIPaKMeD SCI: Original
(j) SIPaKMeD SCI: Inception v3
(k) SIPaKMeD SCI: DenseNet-161
(l) SIPaKMeD SCI: ResNet-34
Figure 4: GradCAM activations using the three base learners- Inception v3, DenseNet-161 and ResNet-34 on the three datasets used: (a)-(d) Mendeley LBC dataset, (e)-(h) SIPaKMeD WSI dataset and (i)-(l) SIPaKMeD SCI dataset.

In this section we use the Gradient guided Class Activation Maps or GradCAM by Selvaraju et al. selvaraju2017grad to visually represent the distinguishing regions in the single cell and whole slide pap stained images that enables to make the base classifiers to make the predictions. The results for the same are show in Figure 4 for the three datasets used in this study. GradCAM computes the number of weights in feature map of the last convolution layer to calculate the contribution of the feature maps towards the class prediction made by the CNN classifier.

For all the three datasets, in Figure 4, it can be noted that the three classifiers focus on different regions of the corresponding original image. For example, in Figure 4(i), for the SIPaKMeD SCI dataset, the ResNet-34 model (Figure 4(l)) puts attention solely on the nucleus of the cell, DenseNet-161 (Figure 4(k)) focuses on the nucleus as well as on the cytoplasm of the cell. Inception v3 in Figure 4

(j) focuses on the outliers and on the nucleus. Clearly, the three models takes into account different aspects of the image. Thus, when the ensemble of these three models are computer, the prediction incurs the complementary information provided by these different classifiers, and a superior prediction is made. Similarly for the whole slide images of Mendeley LBC dataset (

Figure 4(a)) and SIPaKMeD WSI dataset (Figure 4(e)), different classifiers focus on different cells within the slide to compute the final predictions, which further enables the ensemble model to aggregate the discerning information from the base learners to compute a prediction.

4.10 Error Analysis

(a) SIPaKMeD SCI: Superficiel Intermediate
(b) SIPaKMeD WSI: Metaplastic
(c) Mendeley LBC: NILM
Figure 5: Examples of instances where the ensemble approach made correct predictions although all contributing classifiers did not predict correctly

The proposed framework shows robust and reliable performance in the cervical cytology image classification task. For example, Figure 5, shows examples of instances where some individual classifiers predicted wrong classes while the ensemble approach made correct predictions. Figure 5(a) shows a test image from the SIPaKMeD SCI dataset belonging to the class "Superficiel Intermediate", and predicted correctly by the fuzzy ensemble method despite the image containing multiple nuclei. But, this image was classified incorrectly by Inception v3 and ResNet-34 (but correctly by DenseNet-161). The confidence of DenseNet-161 on its prediction of this instance was much higher than Inception v3 and ResNet-34 on their predictions. This resulted in the ensemble method to give priority to the DenseNet-161 model’s decision and predicting the sample to be "Superficiel Intermediate". Similarly, Figure 5(b) shows a sample from the SIPaKMeD WSI dataset which was predicted correctly by DenseNet-161 to be "Metaplastic" but wrongly by Inception v3 as "Dyskeratotic". Figure 5(c) shows a correct prediction from the Mendeley LBC dataset as belonging to the class "NILM", where DenseNet-161 made the correct prediction while Inception v3 predicted it to be "HSIL".

Figure 6: Misclassified HSIL class image of Mendeley LBC dataset

Figure 6 shows the only misclassified sample from the Mendeley LBC dataset. The image belongs to the class HSIL but was predicted to be of class SCC by the proposed model.

(a) Dyskeratotic
(b) Koilocytotic
(c) Metaplastic
Figure 7: Misclassified samples of SIPaKMeD WSI dataset, originally belonging to (a) Dyskeratotic (b) Koilocytotic and (c) Metaplastic classes

Figure 7 shows some misclassified samples from the SIPaKMeD WSI dataset. The most probable reason for the wrong classification, in this case, is the presence of several types of cells in a single image. For example in Figure 7(a), the number of "Metaplastic" cells is more than the number of "Dyskeratotic" cells, which led to the originally "Dyskeratotic" class image to be classified as "Metaplastic". The case is reversed for Figure 8(c), where most cells are of the "Dyskeratotic" class, and thus an originally "Metaplastic" class image is classified as "Dyskeratotic" class. These wrong predictions might be due to the improper placement of these images in the classes while creating the dataset.

(a) Dyskeratotic
(b) Koilocytotic
(c) Metaplastic
(d) Superficial Intermediate
Figure 8: Misclassified samples of SIPaKMeD SCI dataset, originally belonging to (a) Dyskeratotic (b) Koilocytotic (c) Metaplastic and (d) Superficial Intermediate classes

Figure 8 shows some misclassified samples from the SIPaKMeD SCI dataset. The possible reasons for the misclassifications are the quality of the image resulting in unclearly visible nuclei like in Figure 8(a) and (b); and the presence of multiple nuclei of cells in the image which is not desired from a single-cell image dataset like in Figure 8(c) and (d).

5 Statistical Analysis: McNemar’s Test

McNemar’s Test p-value
Compared with
Inception v3 0.00012 1.88E-08 0.0005
DenseNet-161 0 0 0.0455
ResNet-34 0.0217 0.0036 0.00433
Table 13:

Results from McNemar’s Test: Null Hypothesis is rejected for every case

The McNemar’s test dietterich1998approximate has been performed to justify the viability of the proposed framework with respect to the constituent models in the ensemble. Table 13 shows the results from the test on all the three datasets and clearly, the -value is lower than 5% in all the cases, and hence the null hypothesis can be rejected, proving that the proposed framework is significantly better than the individual models used to form the ensemble.

6 Conclusions & Future Work

In this paper, we propose a fuzzy-fusion based CNN integration method to address the problem of classification of pap-smear based cervical cytology images. The decision scores obtained from CNN classifiers are used as the input of the fuzzy-integral to perform the final classification. With classification accuracies of 99.48%, 96.33%, and 98.54% on Mendeley LBC, SIPaKMeD whole slide images (WSI) and SIPaKMeD single-cell images (SCI) datasets respectively, our proposed method has shown superior performance as compared to other simple fusion methods and has outperformed several existing methods on these datasets. The proposed Sugeno Fuzzy Integral based ensemble is the first such implementation in this domain, and its adaptive weighting system based on the confidence scores of contributing classifiers makes it perform better than the traditional ensemble schemes previously used in the literature as evident from Table 8.

Graph convolution networks (GCN) and attention-gated networks have also shown promising performance in several domains, which engenders our interest to experiment with fuzzy fusion-based methods on these test-beds in the future. The fuzzy measures are selected based on the individual classifier performance on test sets, which is not the optimal solution. Hence, we further plan to implement some evolutionary meta-heuristic optimization algorithm for the selection of the fuzzy measures of the classifiers that might further improve the overall classification performance. We might also incorporate other CNN classifiers to form the ensemble in the future.