1 Introduction
Pulmonary optical endomicroscopy uses fiber confocal fluorescence microscopy which can provide diagnostic information about fibrosis and inflammation of the distal air spaces associated with lung disease salaun2012 . It is a new, realtime imaging technology that provides pulmonary alveoli imagery at a microscopic level. However, acquired in clinical use, a POE image sequence can have a proportion of more than of the sequence giving uninformative frames as purenoise and motion artefacts perperidis2017 . For a future data analysis, these uninformative frames must be first removed from the dataset. In clinical examination, the detection of uninformative frames is actually carried out manually. This manual operation is time consuming and laborious. Therefore, automatic detection is necessary to speed up data analysis and shorten diagnostic time. Our work aims at developing an automatic detection method to remove uninformative frames from a sequence of images. Fig. 1 shows four informative and four noninformative images. We can observe that textures in these two kinds of images are very different. Hence, we can consider the detection problem as a classification one.
Deep learning is an advanced machine learning technique, showing powerful classification ability. It consists in estimating the parameters of the activation functions by minimizing a loss function comparing the output of the neural network and the desired output. It has shown to be outperforming the stateofthe art specialists
Zhou2019 ; Li2020 in various applications. One class of deep learning techniques is the supervised deep learning techniques. In supervised deep learning techniques, the considered data are labeled; for instance: informative versus uninformative POE. The loss function evaluates the difference between the estimated labels from the network and the true labels. Parameters are updated in order to minimize the loss. To achieve this, we generally use a Convolutional Neural Network (CNN) fukushima1988 ; lecun1998 . Fig. 2 presents the general architecture of CNNs.Many methods of classification based on CNN have been developed in medical image fields yadav2019 ; ahn2016 ; heung2017 ; amyar2019
. Definition of the loss function for training the network is important for achieving good performance of accuracy. Most studies using entropy in deep learning were focused on most common forms as quadratic metric or Shannon crossentropy. Quadratic metric is generally used when the desired ouput is deterministic; for instance in image segmentation. Crossentropy loss functions are generally used for probabilistic decision; for instance in classification. Shannon entropy is the most known entropy and the derived crossentropy can be easily interpreted as a “distance” between probabilities
Sen2015 . Moreover, Shannon entropy is more relevant when the data follows a Gibbsian distribution (ie. exponential of the opposite of a convex energy function). Unlike Shannon’s entropy, HavrdaCharvat entropy Roselin2014 ; Chen2010 ; Maksa2008 ; Kurt2014 ; Zhu2020 doesn’t require a specific distribution of data in order to keep its specificity and accuracy. In this paper, we propose to implement a CNN with a loss function derived from HavrdaCharvat’s entropy to classify the studied frames in their proper categories (informative/uninformative) and to compare the results with a Shannonbased crossentropy and find out what improvement can be achieved.2 Related Works
2.1 Data classification
Their exist many methods for classifying data. Some methods as Support Vector Machines (SVM) are geometrical
cortez1995 . SVM is inspirated from the HahnBanach theorem brezisand is a supervised learning method consisting in separating two convex subsets by an hyperplan. SVM has been generalized in nonlinear case by the way of the introduction of a metric kernel
Crammer2001 and sollich2002proposes to estimate the kernel via Bayesian inference. The decision tree method
breiman1984constructs classifiers by tree both in regression and in classification. So a tree is built by gradually dividing a population into two subpopulations in order to optimize the homogeneity of populations according to their label. Decision trees have been generalized into random forests
breiman2001 which group together a multitude of independent decision trees. These are built from the same learning base using different random processes. The fact of combining several decision trees makes it possible to reduce the influence of noisy data during the learning phase. Random forests have been used successfully in classification; for instance in radiomics parmar2015. Other classification methods are statistical. Amongst statistical methods, logistic regression
vittinghoff2006 consists in learning parameters of the logistic classification function from annoted data. In likelihood maximization barbu2006 , a probabilistic model representing the probability of each class in function of the observation is given; the decision consists in choosing the most likely class. Such likelihood maximization is also used in the more recent Conditional Random Fields (CRF) for image segmentation magnano2014 and for classification liliana2017 . Another kind of statistical methods are the Bayesian methods in which we have a prior knowledge about the belonging to a class. Bayesian methods bernardoSmith1994 ; ghosh2006 have been widely used in image segmentation lanchantin2008 and data classification priya2015 . More recently, the deeplearning methods is more and more used. Amongst applications of deep learning in data classification, one can quote text recognition audebert2019 majeedi2018 or spams detection dada2019 . The most classical deeplearning method for supervised classification is based on Convolutional Neural Networks (CNN). Generally, a CNN is composed of two sets of layers. The first one applies convolution maps in order to reduce the data and the second one is a fully connected network. There are different CNNbased architectures. Amongst them, the LeNet architecture lecun1998 is the first successful CNN used for classify digits; AlexNet krizhevsky2012was the first CNN applied to computer vision and was submitted to the ImageNet ILSVRC challenge in 2012; ZFNet is an improvement of AlexNet proposed in
zeiler2013 . Many applications of deep learning for classification of optical images have already been made in moccia2020 ; li2019 ; chang2018 .2.2 Generalized entropies and its application
In a technical point of view, there are several ways to generalize the classical Shannon entropy and the different metrics and divergences as summarized in basseville2010 . There is two main ways to generalize the Shannon entropy: the first one consists in replacing the integrated functional in the expression of the Shannon entropy and the second one according to an axiomatic kumar2014 . As underlined in Li2020 , HavrdaCharvat and Shannon entropies are the only ones which satisfy the strong recursivity property. Even if generalization of Shannon entropy as HavrdaCharvat is not recent; most applications of them is clustering Li2020 ; Sen2015 ; Chen2010 ; Zhu2020 or coding theory kumar2012InternalJournal . In kumar2014 , a weakened recursivity property is studied and generalization of the HavrdaCharvat satisfying this property is proposed. This property makes easier it use in case of multilabel classification as, for instance, for gene expression analysis Li2020 . In Roselin2014 , results of mammogram image classification are compared while using different parameterized entropies: Rényi, HavrdaChravat, Tsallis and Kapur’s entropy. It appears that Tsallis entropy gives the best results as part of their study. However, Tsallis entropy has two parameters whereas HavrdaCharvat has only one parameter and gives slighly less performant results. HavrdaCharvat entropy has also been used for clustering in Chen2010 by replacing the Shannon entropy by the HavrdaCharvat entropy in the JensenShannon divergence. In Maksa2008 , a functional equation from which HavrdaCharvat, Shannon and Tsallis entropies is proposed. Moreover, the stability in the sense of HyersUlam hyers2004 of this functional equation is studied.
3 Method
In this section, we present the neural network that we use for the classification of the optical images.
We use a supervised method for the classification of optical images of lungs into two classes: informative class and noninformative class. The method that we use is a Convolutional Neural Network (CNN) whose architecture is represented in Fig. 2. The loss functions we use for the supervised learning are entropybased functions. These loss functions compare two probability laws. The first one is the output of the CNN and represents the estimated probability of the input image belonging to informative class. This estimated probability is computed by the sigmoid stage of the CNN. The second probability is a Dirac probability whose value is if the image is informative one and otherwise. A entropybased loss function is constructed by choosing an entropy function and a divergence for comparing the entropies of the two distributions. The first entropy is the classical Shannon entropy and the second one is the HavrdaCharvat entropy which generalizes the Shannon entropy.
3.1 Supervised classification of optical images
In this paper, we focus on classifying optical images into two classes informative and non informative images. The set of possible states is where corresponds to the event “informative image” and to “non informative image”. In supervised classification, we train the network from an annotated database. Let be the number of annotated images. At the th image, we associate the following Dirac probability:
(1) 
Let the output of the CNN for the th image. The loss function to be minimized is:
(2) 
where is the chosen entropydivergence.
The two following paragraphs detail how one can generalize the Shannon entropy and how an entropydivergence can be built from an entropy.
3.2 Generalized entropies
An entropy function is a concave function from a subset of probability densities to the real line. The most known entropy function is the Shannon entropy, for discrete probabilities (i.e. probabilities defined on a countable space , this one is defined by:
(3) 
where is a measure on ; in practice, is taken to be the counting measure but can also represents a prior information on the state .
As explained in basseville2010 , there are several ways to generalize the Shannon entropy. A classical way to define generalized entropy is:
(4) 
where is a convex function defined on .
In this paper, we propose to utilise the HavrdaCharvat entropy kumar2014 which belongs to a parametric family whose convex functional is given by:
(5) 
where . By studying the limit at , we find back. As a consequence, the HavrdaCharvat whose parameter is equal to coincides with the Shannon entropy.
By replacing by in (4), we deduce the expression of the HavrdaCharvat entropy:
(6) 
and if is the counting measure:
(7) 
3.3 Entropydivergences and crossentropy
There exist several ways to construct a loss function from a given entropy. In this paper, we consider the crossentropy, as it is the most used in deeplearning when the entropy is the Shannon entropy. This one is defined as:
(8) 
In the case where , we can find the classical crossentropy:
(9) 
and for , the HavrdaCharvat crossentropy is defined by:
(10) 
In this paper, we compare classification results for several loss functions: HavrdaCharvatbased loss function with ranging from 1 to 2.
4 Results and analysis
4.1 Data
In this study, the dataset consists of different kinds of datasets. The potential diseases are: asbestosis, pulmonary idiopathic fibrosis, hypersensitivity pneumopathy, sclerodermia and some healthy people as the control group. we have used 3895 images where 2313 were informative and 1582 were uninformative images. These frames were obtained from images sequences given by the CHU (Universitary Central Hospital) of Rouen, Normandy. The size of images are between 512x512 and 500x500 pixels, depending on the sequence. We proposed, to normalize all images in a 128x128 resolution, in order to train the CNN with a reasonable resolution by considering a balance between the image size and processing time. The data were obtained as video sequences and turned into frames for the analysis.
4.2 Implementation
We proposed to go for a simple CNN architecture. The two main kind of layers in a CNN are the following:

1) The first layers are the Convolutional layers. Their function is to reduce the input to its most prominent features, in order to retain the most important information while reducing the dimensions of the input image. These layers are often associated with an activation function, which is used to determine the value of the output, and eventually a MaxPooling, that aims to increase even further the reduction of dimensions and the extraction of features.

2) The second layers are Fully Connected layers, whose role is to establish a decision rule that leads, on the last layer, to the classification of the input image.
We implemented the following layers: five convolutional layers , each having a convolution filter of dimensions (3,3) and using the activation method Rectified Linear Unit (ReLU) of decreasing size (128, 64, 32, 16 and 8), and then four dense layers of decreasing size (128, 64, 32 and 16). The total number of trainable parameters in this architecture was around 6,6 millions. The validation split of this network was set at 70/30, i.e. that means 70% served as a training set, and the remaining 30% were used as a validation set in order for the CNN to improve. The batch size, the number of data simultaneously passed through the CNN, was 64. We also added Dropout layers after the first three Dense ones. These layers force the network to drop certain links, reducing the phenomenon of overfitting.
4.3 Learning conditions
We use balanced datasets, with about half images being uninformative and half being informative. This way we have prevented the availability bias that would’ve oriented the model toward the most available kind of data. Such bias can potentially skewer data where nothing like that can be explained by data distribution. We used the Python language with the Keras library which is a powerful easytouse Python library for developing and evaluating deep learning models.
The criteria of evaluation were those implemented with the CNN, i.e the amount of correct identification over the total length of the dataset.
4.4 Experimentation
A comparison study between different entropies is carried out in function of the number of epochs to define their accuracy (see TABLE 1). N represents the number of epochs set in the network’s training, and
is the coefficient of identical name in the HavrdaCharvat’s formula. The following table was made for a 2947 images dataset (1625 informative / 1322 uninformative):N  


1.0  1.1  1.3  1.5  2.0 
20 
0.59  0.65  0.62  0.63  0.59 
30  0.59  0.68  0.73  0.59  0.59 
40  0.70  0.79  0.77  0.76  0.59 

To help assert our observations, we also computed the specificity and sensitivity for the same values of for 40 epochs.

1.0  1.1  1.3  1.5  2.0 

Specificity 
0.87  0.91  0.92  0.78  0.99 
Sensitivity  0.71  0.79  0.78  0.75  0.59 

We can observe that the number of epochs has moderate influence on the end results. Except, of course, when the number of epochs is deliberately too small to properly reach convergence. During testing, we noticed that the results tended to present a greater disparity the fewer epoch we set. This is due to the fact that the network selects a random set of input images from the dataset to perform both training and testing. Then, by increasing the number of epoch, we make sure that the network reaches convergence and thus diminish the random factor in the subset’s analysis.
The values of sensitivity and specificity allow to rule out the following value of : 2.0. These results concur with the previous ones. The remaining values are satisfactory regarding the model’s accuracy.
When increasing the number of epochs beyond 40, we obtained very high results, including with Shannon’s entropy. This can be explained by the fact that our sample remains small in comparison to what a Neural Network might need, and its learning can turn into overfitting, which is the network learning so well to recognize the patterns it has been given that it will be unable to recognize other ones, provided it differs too much from the training sets. As a rule of thumbs, the more epoch the better, but if a network seems to have reached a stability after a certain number of epochs, the following ones tend to only improve them by a short amount, and that’s when the network enters the domain of overfitting, which is not to be encouraged. That’s why we need to carefully study how the network converges and prevent it from doing so for too long.
The best results for the supervised elements were 79% of correct classification when using HavrdaCharvat with alpha=1.1 (1.3) and 40 epochs. It’s a slight, yet noticeable improvement from Shannon’s entropy. Moreover, for 40 epochs, the respective results for Shannon’s entropy and HavrdaCharvat() are 70% and 79 % of accuracy for the first experiment and 77% and 79% for Shannon and HavrdaCharvat() for the second experiment, which makes the later a huge improvement over Shannon. It demonstrates that HavrdaCharvat’s formula can be an improvement when it comes to classification and the loss function that comes from it is a better alternative to the more common one.
As a trend, we’ve noticed that the HavrdaCharvat formula yielded better results than Shannon’s. It can be explained by the fact that Shannon’s entropy needs data of a certain type to be totally relevant, i.e data that can be distributed on the exponential of a convex function, in other words it needs a single extremum to safely move toward it. If the data are not meeting this requirement, Shannon’s entropy cannot be considered as completely reliable. HavrdaCharvat is a generalized version of Shannon’s entropy, not needing any specific conditions for the data. The choice of the parameter in HavrdaCharvat can be fixed depending on data or estimated. Considering non any prior information about the nature of our data, this more generalized entropy can thus give better performance.
We can also observe that the factor is not to be increased to high values. When we got close from 2, quickly, after 20 epochs, we noticed that our network stopped improving and was stuck to accuracy values close to 0.60.
The interval of for good results is between 1.1 and 1.5 in both experiments. We will study how to automatically estimate this value in upcoming work.
4.5 Validation loss study
We notice, thanks to figure 3, that for , , and , training and validation errors decrease at the same pace for N=40. Regarding Shannon’s entropy, where the two losses are higher than for the two HavrdaCharvatbased models. We can deduce that, despite slightly decreasing too, Shannon’s entropy is less relevant and trustworthy.
We can deduce that our results for 40 epochs regarding are coherent with the tests. However, for a number of epochs N greater than 40, the phenomenon of overfitting starts to appear. We tested the models for N=100 epochs and the overfitting phenomenon appears and increases as the epochs happen. For the size of the studied dataset, 40 epochs can be considered as sufficient, but as the dataset’s size increases, a greater number of epochs will be available without triggering any overfitting.
As a sumup, for further study, the value seems to be the most promising and an improvement over the commonly used Shannon’s entropy.
Regarding the other value, we can notice that, after 40 to 50 epochs, the validation loss tends to increase, while the training loss reaches its low. If this value was to be used, one would have to consider that for a dataset of comparable size, epochs should be kept below 50, or adapted to reflect this level of fitting, since increasing epochs beyond this point would soon lower the adaptability of the network due to overfitting.
We notice that the validation error tends to increase far from the training one when N becomes superior to 30. This phenomenon tends to increase the further the epochs do so. Again, we can notice that epochs are not to be increased beyond 40, as the two losses seem to grow further away from one another at a rapid rate.
Moreover, we notice that the phenomenon becomes greater when using Shannon’s entropy than with HavrdaCharvat’s equation, which is another element in favor of the new loss function over the more commonly used Shannonbased loss. It is already starting at N=40 epochs for Shannonbased loss, as depicted in figure 3, hence the lesser reliability of the results achieved thanks to it, including sensitivity and specificity.
4.6 Images classification
Among the correctly classified images, we find those with obvious, clearly identified structures that the network identifies as informative:
Conversely, the obviously uninformative ones are recognized by the algorithm and deemed as noise:
Despite the encouraging results achieved by the network, some limitations exist.
5 Limitations
5.1 Classification errors
During the experiments, we noticed that several images were classified in the wrong category. For example, the following images can be classified as informative although they were uninformative:
As an explanation for the mistake, we can say that the data are very noisy and that any structure in the image can pass for an information and thus make the image an informative one, according to the neural network. Structures are hard to take in account, since the feature map at the end of the convolutional layers is flatten to meet the fully connected layers’ requirements for inputs.
Noise makes it difficult for the algorithm to notice when an artifact happens, when the patient moves for example or when the scope sees things that are not relevant to the current examination, like blood for example.
On the contrary, when the genuine structures become blurrier or are mixed with a lot of noise, which can result in the picture being deemed uninformative although a structure is present:
So, noisy images still presenting informative elements are susceptible to being considered as being only noise and thus uninformative, the useful data being lost among the noise’s values
A possible solution would be to preprocess the data with smoothing filters in order to reduce the noise’s influence in the signal.
5.2 Data Scarcity and Overfitting
As presented in the previous part, there’s a phenomenon of overfitting the more we increase the number of epochs. This phenomenon comes from the small dataset that we use. Usually, the more epochs the better, but with smaller databases, the network tends to learn too well its training set, and become unable to adapt to any other data. This is why, for further research, we’ll need more labeled data.
Another option regarding overfitting is to lower the complexity of the network by reducing the number of neurons in existing layers or by deleting some layers without altering the global structure of the network.
Regarding the labeled data scarcity, the solution is to create algorithms that are less impacted by small datasets, or that can work without labels on bigger sets of images.
5.3 Balanced dataset
In the foreword of this paper, we described the uninformative data as being about 25% of the available frames of endomicroscopy videos. But, to make the learning less biased, we proposed to make each category (uninformative/informative)be closer to half the training dataset.
These data were chosen in order to be closer from a balanced dataset, but without cutting a sequence of frames. To make sure the dataset is completely representative of the real acquired data, further work should proceed with 25% of the frames being uninformative to see if the results are similar. Unbalance in dataset is responsible for biases that make the algorithm determine that, since one class has a greater probability to occur than the other, then any analyzed frame will naturally be more prone to be categorized as such, regardless of its characteristics, as it’s an issue of availability.
5.4 Hyperparameter
In this study, we proposed to set the parameter
to predetermined values, and have obtained the corresponding results. We noticed that, depending on the data available, the best value of the hyperparameter tended to vary. This is why we can assume that there may exist values of
yielding better results in the considered interval. In order to further improve our findings, the algorithm needs to be modified in order to automatically test the values of and provide the value which gave the best results.6 Conclusion
In this paper, we design a CNN classifier with a new loss function based on HavrdaCharvat entropy. Most of CNN classifier use Shannon entropy, while HavrdaCharvat entropy is a generalized Shannon entropy. Therefore, it can outperform it if the nature of data cannot satisfy certain conditions. Our application aims to classify pulmonary optical endomicroscopy images in which informative images and noninformative images are not easy to distinguish. The proposed classifier can achieve an accuracy of 79% (77% for the second set), better than the one obtained by Shannon’s entropy. In future work, we will analyze the endomicroscopy after removing the noninformative images for helping pathological diagnostic.
7 Acknowledgment
This project was cofinanced by the European Union with the European regional development fund (ERDF,
18P03390/18E01750/18P02733) and by the HauteNormandie Regional Council via the M2SINUM project.
References
 (1) M. Salaün, R. Modzelewski, J.P. Marie, et al., In vivo assessment of the pulmonary microcirculation in elastaseinduced emphysema using probebased confocal fluorescence microscopy,, IntraVital 1:2 (2012) 122–131.
 (2) A. Perperidis, A. Akram, P. McCool, et al., Automated Detection of Uninformative Frames in Pulmonary Optical Endomicroscopy, IEEE Trans. on Biomedical Engineering 64 (1) (2017).
 (3) T. Zhou, et al., A review: Deep Learning for medical segmentation using multimodality fusion, Array/Elsevier 34 (100004) (2019).
 (4) H. Li, et al., Minimum Entropy Clustering and Applications to Gene Expression Analysis.

(5)
K. Fukushima, Neocognitron: A hierarchical neural network capable of visual pattern recognition, Neural networks 1 (2) (1988) 119–130.
 (6) Y. LeCun, L. Bottou, Y. Bengio, et al., Gradientbased learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2323.
 (7) S. S. Yadav, S. M. Jadhay, Deep convolutional neural network based medical image classification for disease diagnosis, Journal Big Data 6 (113) (2019).
 (8) E. Ahn, A. Kumar, J. Kim, et al., Xray image classification using domain transferred convolutional neural networks and local sparse spatial pyramid, in: IEEE International Symposium on Biomedical Imaging (ISBI), Prague, Czech Republic, 2016.
 (9) S. HeungIla, L. SeongWhan, S. Dinggang, Deep ensemble learning of sparse regression models for brain disease diagnosis for the Alzheimer’s Disease, Medical Image Analysis (2017) 101–113.
 (10) A. Amyar, S. Ruan, I. Gardin, et al., 3D RPETNET: Development of a 3D PET Imaging Convolutional Neural Network for Radiomics Analysis and Outcome Prediction, IEEE Transactions on Radiation and Plasma Medical Sciences 3 (2019) 225–231.
 (11) H. Sen, A. Agarwal, Shannon and Non Shannon Entropy Based MRI Image Segmentation, International Bulletin of Mathematical Research (2015) 290–296.
 (12) R. Roselin, et al., Mammogram Image Classification: NonShannon Entropy based AntMiner, International Journal of Computational Intelligence and Informatics 4 (2014).
 (13) T. Chen, et al., Groupwise Pointset registration using a novel CDFbased HavrdaCharvát Divergence, International Journal of Computer Vision 86 (1) (2010) 111–124.
 (14) G. Maksa, The stability of the entropy of degree alpha, J. Math. Anal. Appl 346 (2008) 17–21.
 (15) B. K. et al., A novel automatic suspicious mass regions identification using HavrdaCharvat entropy and Otsu’s N thresholding, Computer methods and programs in biomedicine 114 (2014) 349–360.
 (16) Q. Zhu, et al., A new loss function for CNN classifier based on predefined evenlydistributed class centroids, IEEE 8 (2) (2020) 10888–10895.

(17)
C. Cortes, V. Vapnik, SupportVector Networks, Machine Learning 20 (1995) 273–297.
 (18) H. Brezis, Functional Analysis, theory and applications, Masson, 1987.
 (19) K. Crammer, Y. Singer, On the Algorithmic Implementation of Multiclass Kernelbased Vector Machines, Journal of Machine Learning Research 2 (2001) 265–292.
 (20) P. Sollich, Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities, Machine Learning 46 (2002) 21–52.
 (21) L. Breiman, J. Friedman, R. Olsen, et al., Classification and Regression Trees, Wadsworth and Brooks, 1984.
 (22) L. Breiman, Random Forests, Machine Learning 45 (1) (2001) 5–32.
 (23) C. Parmar, P. Grossmann, J. Bussink, et al., Machine Learning methods for Quantitative Radiomic Biomarkers, Scientific Reports 5 (1) (2015).
 (24) E. Vittinghoff, C. E. McCulloch, Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression, American Journal of Epidemiology 165 (6) (2006) 710–718.

(25)
V. Barbu, N. Limnios, Maximum likelihood estimation for hidden semiMarkov models, Comptes Rendus de l’Academie des Sciences 342 (2006) 201–205.
 (26) C. S. Magnano, A. Soni, S. Natarajan, et al., Conditional Random Fields for brain tissue segmentation, SDM 2014, 2014.
 (27) D. Y. Liliana, C. Basaruddin, A review on conditional random fields as a sequential classifier in machine learning, ICECOS 2017, 2017.
 (28) J. M. Bernardo, A. F. M. Smith, Bayesian theory, Wiley, 1994.
 (29) J. K. Ghosh, M. Delampady, T. Samanta, An introduction to Bayesian analysis. Theory and methods, Springer texts in Statistics, 2006.

(30)
P. Lanchantin, J. LapuyadeLahorgue, W. Pieczynski, Unsupervised segmentation of triplet Markov chains hidden with long memory noise, Signal Processing 88 (5) (2008) 1134–1151.
 (31) T. Priya, S. Prasad, H. Wu, Superpixels for Spatially Reinforced Bayes Classificaton of Hyperspectral Images, IEEE Geoscience and Remote Sensing Letters 12 (5) (2015) 1071–1075.
 (32) N. Audebert, C. Herold, K. Slimani, C. Vidal, Multimodal deep networks for text and imagebased document classification, ECML 2020, 2019.
 (33) A. Majeedi, A supervised learning methodology for realtime disguised face recognition in the wild, Industrial Conference on Robotics and Computer Vision (ICRCV 2018), 2018.
 (34) E. G. Dada, J. S. Bassi, H. Chiroma, et al., Machine learning for email spam filtering: review, approaches and open research problems, Science Direct, Heliyon 5 (6) (2019).

(35)
A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems (2012).
 (36) M. D. Zeiler, R. Fergus, Visualizing and Understanding Convolutional Networks, Computer Vision and Pattern Recognition (2013).
 (37) S. Moccia, L. Romeo, L. Migliorelli, E. Frontoni, Deep Learners and Deep Learner Descriptors for Medical Applications, Vol. 186, Springer, Intelligent Systems Reference Library, 2020, Ch. Supervised CNN Strategies for Optical Image Segmentation and Classification in Interventional Medicine.
 (38) L. Yibo, L. Mingjun, Z. Senyue, Classification of optical remote sensing images based on Convolutional Neural Network, International Conference on Control, Decision and Information Technologies (CoDit2019), 2019.
 (39) J. Chang, V. Sitzmann, X. Dun, et al., Hybrid opticalelectronic convolutional neural networks with optimized diffractive optics for image classification, Scientific reports (2018).
 (40) M. Basseville, Information: entropies, divergences et moyennes, Tech. rep., INRIA (2010).
 (41) S. Kumar, A generalization of the HavrdaCharvat and Tsallis entropy and its axiomatic characterization, Abstract and Applied Analysis (2014).
 (42) S. Kumar, A. Kumar, A Coding theorem on HavrdaCharvat and Tsallis’s Entropy, International Journal of Computer Applications (1) (2012).
 (43) S.M. Jung, HyersUlam Stability of Linear Differential Equations of First order, Elsevier, Applied Mathematics Letters 17 (2004) 1135–1140.