Deep learning has demonstrated ability to achieve highly accurate results in different application domains such as speech recognition , image recognition , and language translation  and other complex problems . It attracted the attention of media and the wider public 
. It has also proven to be very valuable and efficient in automating the usually laborious and sometimes controversial pre-processing stage of feature extraction. The main criticism towards deep learning is usually related to its ‘black-box’ nature and requirements for huge amount of labeled data, computational resources (GPU accelerators as a standard), long times (hours) of training, high power and energy requirements
. Indeed, a traditional deep learning (e.g. convolutional neural network) algorithm involves hundreds of millions of weights/coefficients/parameters that require iterative optimization procedures. In addition, these hundreds of millions of parameters are abstract and detached from the physical nature of the problem being modelled. However, the automated way to extract them is very attractive in high throughput applications of complex problems like image processing where the human expertise may simply be not available or very expensive.
Feature extraction is an important pre-processing stage, which defines the data space and may influence the level of accuracy the end result provides. Therefore, we consider this very useful property of the traditional deep learning and step on it combined with another important recent result in the deep learning domain, namely, the transfer learning. This concept postulates that knowledge in the form of a model architecture learned in one context can be re-used and useful in another context. Transfer learning helps to considerably reduce the amount of time used for training. Moreover, it also may help to improve the accuracy of the models .
Stepping on the two main achievements of the deep learning - top accuracy combined with an automatic approach for feature extraction for complex problems, such as image classification, we try to address its deficiencies such as the lack of explainability , computational burden, power and energy resources required, ability to self-adapt and evolve . Interpretability and explainability are extremely important for high stake applications, such as autonomous cars, medical or court decisions, etc. For example, it is extremely important to know the reasons why a car took some action, especially if this car is involved in an accident .
, machine-learning and pattern-recognition required substantial domain expertise to model a feature extractor that could transform the raw data into a feature vector which defines the data space within which the learning subsystem could detect or classify data patterns. Deep learning offers new way to extract abstract features automatically. Moreover, pre-trained structures can be reused for different tasks through the transfer learning technique . Transfer learning helps to considerably reduce the amount of time used for training, moreover, it also may helps to improve the accuracy of the models . In this paper, we propose a new approach, xDNN that offers both, high level of explainability combined with the top accuracy.
The proposed approach, xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is based on prototypes and the data density  as well as typicality - an empirically derived pdf . It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. We tested it on some well-known benchmark data sets such as iRoads  and Caltech-256  and xDNN outperforms the other methods including deep learning in terms of accuracy, time to train, moreover, offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record .
The remainder of this paper is organized as follows: The next section introduces the proposed explainable deep learning approach. The experimental data employed in the analysis and results are presented in the results section. Discussion is presented in the last section of this paper.
Ii Explainable Deep Neural Network
Ii-a Architecture and Training of the proposed xDNN
The proposed explainable deep neural network (xDNN) classifier is formed of several layers with a very clear semantic and functional meaning. In addition to the internal clarity and transparency it also offers a very clear from the user point of view set of prototype-based rules. Prototypes are selected data samples (images) that the user can easily view, understand and appreciate the similarity to other validation images. xDNN offers a synergy between the statistical learning and reasoning bringing both together. In most of the other approaches there is a dichotomy and preference of one over the other. We advocate and demonstrate that both, learning and reasoning can work together in a synergy and produce very impressive results. Indeed, the proposed xDNN method outperforms all published results [15, 8, 2] in terms of accuracy. Moreover, in terms of time for training, computational simplicity, low power and energy required it is also far ahead. The proposed approach can be described as a feedforward neural network which has an incremental learning algorithm that autonomously self-develops and evolves its structure adding new prototypes to reflect the possibly changing (dynamically evolving) data pattern . As shown in Figure 3, xDNN is composed of the following layers–
Features descriptor layer;
Features descriptor layer: (Defines the data space)
The Feature Descriptor Layer is the first phase of the proposed xDNN method. This layer is in charge of extracting global features vector from the images. This first layer can be formed by more traditional ‘handcrafted’ methods such as GIST  or HoG . Alternatively, it can be formed by the fully connected layer (FCL) of the pre-trained convolutional neural network approaches such as AlexNet , VGG–VD–16 , and Inception , residual neural networks such as Resnet  or Inception-Resnet , etc. Using pre-trained deep neural network approach allows automatic extraction of more abstract and discriminative high-level features. In this paper, pre-trained VGG–VD–16 DCNN is employed for feature extraction. According to , VGG–VD–16 has a simple structure and it can achieve a better performance in comparison with other pre-trained deep neural networks. The first fully connected layer from VGG–VD–16 provides a dimensional vector.
a) The values are then standardized using the following equation (1):
where denotes a standardized features vector of the image ( are the values provided by the FCL), denotes the time stamp or the ID of the image, refers to the number of features of the given in our case .
b) The standardized values are normalised to bring them to the range [0;1]:
where denotes the normalized value of the features vector. For clarity in the rest of the paper we will use instead of .
Meta-parameters for the xDNN are initialized with the first observed data sample (image). The proposed algorithm works per class; therefore, all the calculations are done for each class separately.
where denotes the global mean of data samples of the given class. is the total number of the identified prototypes from the observed data samples (images).
Each class is initialized by the first data sample of that class:
where, is the vector of features that describe the prototype of the ; is the identified prototype; is the corresponding support (number of members) associated with this prototype; is the corresponding radius of the area of influence of .
In this paper, we use same as ; the rationale is that two vectors for which the angle between them is less than or are pointing in close/similar directions . That is, we consider that two feature vectors can be considered to be similar if the angle between them is smaller than 30 degrees. Note that is data derived, not a problem- or user- specific parameter. In fact, it can be defined without prior knowledge of the specific problem or data through the following equation (5).
The density layer defines the mutual proximity of the images in the data space defined by the features from the previous layer. The data density, if use Euclidean form of distance, has a Cauchy form (6) :
where is the density, is the global mean, and
is the variance. The reason it is Cauchy is not arbitrary. It can be demonstrated theoretically that if Euclidean or Mahalanobis type of distances in the feature space are considered, the data density reduces to Cauchy type as referred in equation (6). Density can also be updated online :
where and the scalar product, can be updated recursively as follows:
Data samples (images) that are closer to the global mean have higher density values. Therefore, the value of the data density indicates how strongly a particular data sample is influenced by other data samples in the data space due to their mutual proximity.
The prototypes identification layer is the core of the proposed xDNN classifier. This layer is responsible to provide the clearly explainable model. The xDNN classifier is free from prior assumptions about the data distribution type, as well as the random or deterministic nature of the data. In contrast, it extracts the actual distribution empirically form the data samples (images) bottom up . The prototypes are independent from each other. Therefore, one can change the structure by adding a new prototype without influencing the other already existing prototypes. In other words, the proposed xDNN is highly parallelizable and suitable for evolving form of application where new prototypes may be added (if the data pattern requires this). The proposed xDNN method is trained per class forming a set of prototypes per class. Therefore, all the calculations are done for each class separately. Prototypes are the local peaks of the data density (and typicality) identified in the previous layers/ stages of the algorithm from the images of the corresponding class based on their feature vectors. The prototypes can be used to form linguistic logical rules of the following form:
: IF THEN
where stands for similarity, it also can be seen as a fuzzy degree of membership; is the identified prototype; is the number of identified prototypes; is the class , denotes an image.
One rule per prototype can be formed. All rules per class can be combined together using logical OR, also known as disjunction or S-norm:
: IF OR OR … OR THEN
We call all data points associated with a prototype data clouds
, because their shape is not regular (e.g., hyper-spherical, hyper-ellipsoidal, etc.) and the prototype is not necessarily the statistical and geometric mean , but actual image. The algorithm absorbs the new data samples one by one by assigning then to the nearest (in the feature space) prototype:
In case, the following condition  is met:
It means that is out of the influence area of . Therefore, the vector of features becomes a new prototype of a new data cloud with meta-parameters initialized by equation (13). Add a new data cloud:
Otherwise, data cloud parameters are updated online by equation (14). It has to be stressed that all calculations per data cloud are performed on the basis of data points associated with a certain data cloud only (i. e. locally, not globally, on the basis of all data points).
The xDNN learning procedure can be summarized by the following algorithm. xDNN: Learning Procedure1: Read the first feature vector sample representing the image of the class ;2: Set ;3: FOR = 2, …4: Read ;5: Calculate and according to equation (9);6: IF Equation (12) holds7: Create rule according to Equation (13);8: ELSE9: Search for according to Equation (11);10: Update rule according to Equation (14);11: END12: END
In the MegaClouds layer the clouds formed by the prototypes in the previous layer are merged if the neighbouring prototypes have the same class label. In other words, they are merged if they belong to the same class. MegaClouds are used to facilitate the human interpretability. Figure 5 illustrates the formation of the MegaClouds.
Rules in the MegaClouds layer have the following format:
: IF OR OR … OR THEN
where are the MegaClouds, or the areas formed from the merging of the clouds, and is the number of identified MegaClouds. Multimodal typicality, , can also be used to illustrate the MegaClouds as illustrated by Figure 6.
Ii-B Architecture and Validation of the proposed xDNN
Architecture for the validation process of the proposed xDNN method is illustrated by Figure 7.
The validation process of xDNN is composed of the following layers:
Features descriptor layer;
Similarity layer (density);
Which is detailed described as following:
Features descriptor layer:
Similarly to the features descriptor layer described in the training process.
In this layer the degrees of similarity to the nearest prototypes (per class) are extracted for each unlabeled (new/validation) data sample/image defined as follows:
where denotes the similarity degree.
Local (per class) decision-making layer:
Local (per class) decision-making is calculated based on the ‘winner-takes-all’ principle and can be obtained by:
Global decision-making layer: The global decision-making layer is in charge of forming the decision by assigning labels to the validation images based on the degree of similarity of the prototypes obtained by the prototype identification layer as illustrated by Figure 7 and determining the winning class.
In order to determine the overall degree of satisfaction, the maximum of the local, per class winners is applied.
The label is obtained by the following equation (18):
Iii Experimental Data
We validated our proposed approach, xDNN using several complex, well-known image classification benchmark datasets (iRoads and Calltech-256).
Iii-a iRoads dataset
The iROADS dataset  was considered in the analysis first. The dataset contains 4,656 image frames recorded from moving vehicles on a diverse set of road scenes, recorded in day, night, under various weather and lighting conditions, as described below:
Daylight - 903 images
Night - 1050 images
Rainy day - 1049 images
Rainy night - 431 images
Snowy - 569 images
Sun strokes - 307 images
Tunnel - 347 images
Caletch-256 has 30,607 images divided into 257 object categories (one of which is the background) .
Iii-C Performance Evaluation
The performance of the classification methods is usually evaluated based on their accuracy index which is defined as follows:
where denote true and false, negative and positive, respectively.
All the experiments were conducted with MATLAB 2018a using a personal computer with a 1.8 GHz Intel Core i5 processor, 8-GB RAM, and MacOS operating system. The classification experiments were executed using 10-fold cross validation under the same ratio of training-to-testing (80% to 20%) sample sets.
Iv Results and Analysis
Computational simulations were performed to assess the accuracy of the proposed explainable deep learning method, xDNN against other state-of-the-art approaches.
Iv-a iRoads Dataset
Table I shows that the proposed xDNN method provides the best result in terms of classification accuracy as well as time/complexity and simplicity of the model structure (number of parameters/prototypes). The number of model parameters for xDNN (and DRB) is, strictly speaking, zero, because the 2 parameters (mean,
and standard deviation,) per prototype (data cloud
) are derived from the data and are not algorithmic parameters or user-defined parameters. For kNN method one can argue that the number of parameters is the number of data samples, N. The proposed explainable DNN surpasses in terms of accuracy the state-of-the-art VGG–VD–16 algorithm which is a well-established convolutional deep neural network. Moreover, the proposed xDNN has at its top layer a set of a very small number ofMegaClouds (27 or, on average, 4 MegaClouds per class) which makes it very easy to explain and visualize. For comparison, our earlier version of deep rule-based models, called DRB  also produced a high accuracy and was trained a bit faster, but ended up with 521 prototypes (on average 75 prototypes per class) . With xDNN we do generate meaningful rules as well as generate an analytical description of the typicality which is the empirically derived pdf in a closed form which lends itself for further analysis and processing.
|VGG–VD–16 ||99.51 %||836.28||Not reported|
|SVM ||94.17%||5.67||Not reported|
|Naive Bayes ||88.35%||5.31||Not reported|
MegaClouds generated by the proposed xDNN model can be visualized in terms of rules as illustrated by the Figure 8.
Voronoi tesselation can also be used to visualize the resulting MegaClouds as illustrated by Figure 9.
Typicality for classes ‘night scene’ and ‘snow scene’ are given by Figure 10.
Typicality can also be used for interpreatability and explainability as it is correspondent to the pdf. One can use the typicality to represent the likelihood that an image represents a specific type of driving conditions. For a given image a vector of features can be extracted, which can be standardized and normalized and used to demonstrate the likelihood of a certain type of driving condition as shown on Fig. 10.
Iv-B Caltech-256 Dataset
Results for Caltech-256 are presented in Table II.
|SVM(1) ||24.6 %|
Results presented in Table II demonstrate that the proposed xDNN approach can obtain the best classification reported so far world wide for this complex problem, namely, 75.41%. The proposed approach did surpass all of the competitors, offering the highest accuracy, as well as, clearly explainable model. xDNN produced on average 3 MegaClouds per class (a total of 721) which are clearly explainable. Rules have the following format:
Experiments have demonstrated that the proposed xDNN approach is able to produce highly accurate results surpassing state-of-the-art methods for different challenging datasets. Moreover, xDNN presents highly interpretable results that can be presented in the form of logical rules, Voronoi tessellations, and/or typicality (empirically derived form of pdf) in a closed analytical form allowing further analysis. Because of its recursive, non-iterative and non-parametric form it allows computationally very efficient implementations to be realized.
In this paper we propose a new method, explainable deep neural network (xDNN), that is directly addressing the bottlenecks of the traditional deep learning approaches and offers a clearly explainable internal architecture that can outperform the existing methods. The proposed xDNN approach requires very little computational resources (no need for GPUs) and short training times (in the order of seconds). The proposed approach, xDNN is prototype-based. Prototypes are actual training data samples (images), which have local peaks of the empirical data distribution called typicality as well as of the data density. This generative model is identified in a closed form and equates to the pdf but is derived automatically and entirely from the training data with no user- or problem-specific thresholds, parameters or intervention. The proposed xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. Results for some well-known benchmark data sets such as iRoads and Caltech-256 show that xDNN outperforms the other methods including state-of-the-art deep learning approaches (VGG–VD–16) in terms of accuracy, time to train and offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record 111https://martin-thoma.com/sota/. Future research will concentrate on the development of a tree-based architecture, synthetic data generation, and local optimization in order to improve the proposed deep explainable approach.
-  (2017) A generalized methodology for data analysis. IEEE transactions on cybernetics 48 (10), pp. 2981–2993. Cited by: §I.
-  (2018) Deep rule-based classifier with human-level performance and characteristics. Information Sciences 463, pp. 196–213. Cited by: §II-A, §IV-A.
-  (2019) Empirical fuzzy sets and systems. In Empirical Approach to Machine Learning, pp. 135–155. Cited by: §I, item 1, item 2, item 2, item 3, item 4, item 4, item 4, item 4.
-  (2012) Autonomous learning systems: from data streams to knowledge in real-time. John Wiley & Sons. Cited by: item 2.
-  (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §I.
-  (2016) Deep learning. MIT press. Cited by: §I.
-  (2007) Caltech-256 object category dataset. Cited by: §I, §III-B.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: Towards Explainable Deep Neural Networks (xDNN) Preprint submitted to Neural Networks Journal ††thanks: Plamen Angelov, and Eduardo Soares are with the School of Computing and Communications, Lancaster University, Lancaster, LA1 4WA, UK. E-mails: firstname.lastname@example.org; email@example.com., §I, §II-A, §V.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, item 1.
-  (2015) Deep transfer metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 325–333. Cited by: §I, §I.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: item 1.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §I, §I.
-  (2012) Architectural study of hog feature extraction processor for real-time object detection. In 2012 IEEE Workshop on Signal Processing Systems, pp. 197–202. Cited by: item 1.
-  (2016) Object detection networks on convolutional feature maps. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1476–1481. Cited by: item 1.
-  (2013) Vehicle detection based on multi-feature clues and Dempster-Shafer fusion theory. In Pacific-Rim Symposium on Image and Video Technology, pp. 60–72. Cited by: §I, §II-A, §III-A.
-  (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §I, §I.
-  (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §I.
-  (2018) The deep learning revolution. MIT Press. Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Fig. 2, item 1.
-  (2019) Actively semi-supervised deep rule-based classifier applied to adverse driving scenarios. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §IV-A, TABLE I.
-  (2019) Novelty detection and learning from extremely weak supervision. arXiv preprint arXiv:1911.00616. Cited by: §I, §II-A.
-  (2013) Classifying web videos using a global video descriptor. Machine vision and applications 24 (7), pp. 1473–1485. Cited by: item 1.
Inception-v4, inception-resnet and the impact of residual connections on learning. In
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: item 1.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: item 1.
-  (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5934–5938. Cited by: §I.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: TABLE II.
Supervised representation learning: transfer learning with deep autoencoders. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §I, §I.