In this paper we introduce a new deep neural network (DNN) which departs from the amorphous and highly abstract, ”black box” model structure towards deep machine reasoning (DMR) architecture. This is based on the following principle differences from the traditional approach: i) use of prototypes as the core of the method; ii) use of a DT for decision making (class labeling) instead of a flat ”winner takes all” type function; iii) using similarity as a measure of association to prototypes; iv) possibility to express the method in a form of human-interpretable IF-THEN rules with partial degree of satisfaction and to visualise by Voronoi tessellation or by prototypes.
The staggering increase of the amount and complexity of the data sets and streams led to a move from rule-based systems (fuzzy, Bayesian inference, Markov decision processes, using Q tables in reinforcement learning, case base reasoning, etc.) towards DNN which have proven their efficiency in a number of problems ranging from speech, image recognition and language translation to games. This abundance of data led, however, to the temptation to shortcut from data to the solutions driven entirely by the accuracy and ignoring the depth of understanding the problem at hand, and getting insights.
In DRM we make use of the strong properties of the DNN and add new mechanisms to address their shortcomings. For example, the DNN are very efficient feature extractors, especially for image processing problems 
. We use this in DRM and we also use layered structure/architecture. We further benefit from the transfer learning approach.
Traditional classifiers assume balanced classes, but in practice classes are usually (highly) imbalanced. For example, in fault detection and identification the amount of data about the faulty cases are usually significantly smaller than the amount of data for ”normal” operation. In social applications, for example, this leads to possible un-fairness  when the data is highly imbalanced with dominating class(es) and minority class(es).
Finally, traditional statistical modelling is heavily influenced by averages and starts with assumptions about the data distributions which are then put to a test by parametrisation . We take the opposite approach starting with the observed data samples and generalise from these local densities and global multivariate generative distributions. These empirically derived distributions have discrete and continuous form 
. Their discrete forms are exact while the continuous forms which are needed for the inference are local estimates.
Prototype-based models have demonstrated their high efficiency, e.g. the discriminative models such as kNN, SVM , less so RBF  and LVQ . The latter two are also good in terms of explainability . Explianability is undoubtedly, the Achilles heel of the DNN and the solution we propose is to have a synergy between reasoning and learning rather than the current dichotomy.
In this paper we offer a new deep learning architecture and method that builds upon our recently introduced xDNN  method by adding two important novelties, namely: i) using a DT to determine the winning class label, and ii) balancing the classes by synthesising data around the prototypes determined from the available training data.
We validated the new DMR method on three well known benchmark problems, namely Faces-1999, Caltech-101 and Caltech-256. Both Caltech problems are very hard and there is a public record of the best results achieved so far . We surpassed one of them (Caltech-101) with xDNN already . With DMR we surpass our own xDNN ”world record”. Furthermore, we also surpassed the best record on Caltech-256 as well as on Faces-1999 problems. Moreover, DMR does not require GPUs, computationally lean and can continue to train for new data without the need for full re-training.
The remainder of the paper is organised as follows: Section II introduces the concept and novelties of the proposed approach. Section III presents the proposed architecture used during the training phase. Section IV outlines the learning procedure, section V introduces the architecture of the DMR used during the validation phase. Section VI illustrates explainability of DMR in terms of IF-THEN rules. Numerical experiments are presented in the Section VII, results are analysed in Section VIII and the paper is concluded with Section IX.
Ii Concept and novelties of the proposed approach
The problem we consider in this paper is to design a classifier with deep architecture that is explainable-by-design due to the use of prototypes . Prototypes are a small subset of the training data that are highly representative. This is because they are the local peaks of the distribution .
Let us denote the training data set of points by with corresponding class labels . Here, is the number of training data samples and is their dimensionality (number of features); is the number of classes. DMR starts by selecting a set of descriptive prototypes for each class/per class, is the total number of prototypes of class ; ; . Notice that for , i.e. we usually consider more than a single prototype per class . The prototype extraction process (which can be both, offline and online) is described in more detail in ,. At the heart of practically all prototype-based methods is the concept that the prototypes of class are designed to be close to many training points of class and far from training points of the other classes. As pointed out in  ”This idea captures the sense in which the word prototypical is commonly used”.
The power of prototype-based approaches stems from the fact that they are explainable-by-design , easy to understand by the users because they represent samples of the training data, e.g. images. They can be used for classification. Any new data sample with unknown label, can be associated with the nearest prototype from the sets ; .
Ii-a Decision Tree layer
In traditional DNN, the decision is flat, en bloc in the form of a single stage ”winner takes all” function as in eq. (1) and is the last layer of the network. In xDNN  we also followed this popular decision concept, but split it into two stages: i) per class winner, and ii) across classes global decision. In DMR, similarly to xDNN  the decision mechanism is part of the architecture used for validation of the results because the training is per class and no decision for the class label is needed during the training. In this paper, the proposed DMR is using a multi-layer DT formed by pairwise comparison of top two classes in terms of minimum error in training as detailed in Section V and Fig. (4). The reason the result is significantly different is that the Voronoi tessellation regions of the data clouds that are formed around each prototype (local zones of influence) are significantly different when binary decisions are made.
Ii-B Balancing classes through synthesising training data strategically
The second innovation of the proposed method is related to the balancing of the classes. We achieve this by synthetic data augmentation. In this paper we propose a different approach from our recently published one 
for synthesising data for highly imbalanced classification problems. The differences are that in this paper we synthesise data around prototypes which makes these synthetic data more likely to have the same class as the prototype. The method starts by identifying a population of pairwise neighbouring data samples from minority classes around prototypes. Then, it imposes a Gaussian disturbance on these data samples, and, finally, it generates synthetic samples by creating linear interpolations between these extrapolations. A further difference from our recent method
is that in this paper we use the standard deviation,as a radius of influence around the prototype rather than absolute distance of first order. We then augment the training data set with this synthetically generated data set as shown in Fig. (1), see the augmented prototypes layer.
Iii Architecture of the proposed DMR approach (during the training phase)
The architecture of the proposed classifier can be represented as a multi-layered DNN with a very clear semantic and functional meaning by design. The architecture for the training and for the validation phases are different as detailed in Figs. 1 and 4. The training phase is performed per class (except the last layer) and includes the following layers:
Input (features) layer
This is the first layer which defines the data space. The number of inputs is determined by the nature of the problem that the data describe. In many problems these are clearly known physical or biomedical variables, e.g. velocities, pressure, temperature, etc. In image processing problems traditionally size, shape of objects or HoG  were used as well as more abstract methods like GIST 
. More recently, convolutional neural networks (CNN) like AlexNet, VGG–VD–16 , Inception , ResNet , Inception–Resnet 
have proven to be very efficient to encode images and represent them as a highly abstract vector of the outputs from the Fully Connected Layer (FCL). The proposed DMR architecture is agnostic to the source of the features vector that the input layer represents. It can be any of the above. In this paper without any loss of generality we use a
dimensional vector formed by the outputs from the first FCL from a VGG–VD–16 pre-trained on Imagenet.
Data density layerD defined by a Cauchy function :
where is the density, is the global mean, and
is the variance. In it was demonstrated theoretically that starting from the mutual proximity of the data samples in the data space and using Euclidean (or Mahalanobis) type distance D takes the form of a Cauchy function. Moreover, data density can be updated recursively as detailed in . The value of the data density, represent the closeness to the mean and is in the range . It obtains its maximum (of 1) when . is indicative for the centrality of a data sample and its suitability to be a prototype due to its proximity to other data samples.
Conditional probability layer
The conditional probability can be estimated from the empirically observed data as described in  where it is also called . It can be given by eq. (3). The integral of same as for the pdf , but it is multi-modal:
where denotes the number of data samples associated with (support of) the data cloud, ; . Notice that since p(Cx) is empirically derived  it is not constrained by any prior assumptions about the data distribution type or even about the random or deterministic nature of the data. This is clearly more realistic in comparison with the common approach which (for theoretical convenience) assumes randomness and independence of the features of the experimentally observed data which is usually far from the reality.
The next layer consists of prototypes, . This is the core layer of the proposed DMR architecture. This layer is responsible to provide explainable-by-design model. Prototypes are the local peaks of the data density (and, respectively, local peaks of the conditional probability, eq. (3)) identified in the previous layers/stages. The proposed DMR algorithm absorbs the new data samples by assigning them to the nearest prototype:
In this way, each prototype forms a of data that it represents. These ”data clouds” form Voronoi tessellation, illustrated in Fig.2
The prototypes are independent from each other. Therefore, one can change the structure by adding a new prototype without influencing the other already existing prototypes. In other words, the proposed DMR network is highly parallelizable and suitable for dynamically evolving applications with non-stationary data streams and evolving data patterns where new prototypes may be added if the data pattern requires this. The proposed DMR network is trained per class forming a set of prototypes per class. Therefore, all the calculations are done for each class separately. New prototypes are added to this layer when the following condition is met :
If that is the case, then the vector of features of the current training data sample becomes a new prototype, forms a new data cloud .
Synthetic data augmentation
This mechanism is not a separate layer, but a feedback process that gets information from the prototypes layer, augments the training data set (in the form of synthetically added features vectors close to the existing prototypes) and expands the size of the prototypes layer by balancing the amount of prototypes per class. This mechanism is one of the two novelties of the proposed approach in comparison with our recent xDNN  method. The rationale for and the main functionality of this mechanism has been described in Section II.B. In fact, this is an augmentation of the amounts of training data (by augmenting to made by feeding back the information from the prototypes layer. As a result, the size of the prototypes layer is expanded (by ) so that the number of prototypes per class is being balanced. This is visualised in Fig. (1) where the red solid rectangle includes the black dotted one (original prototypes) but also adds prototypes which result from adding synthetic training data.
This is the final layer of the training architecture. Unlike the previous layers it is cross-class. At this layer prototypes from all classes are put together and once this is done all the adjacent data clouds that have the same class label are combined into mega-clouds, see Fig.(3). Notice that the number of megaclouds, is significantly smaller than the number of prototypes, () and the interpretability improves significantly.
Iv Learning Procedure
The learning of DMR is summarised below by the following pseudo-code. The proposed architecture is feed-forward with the exception of the synthetic data augmentation mechanism which feeds back form the prototype layer back to the input layer. The proposed method can work both, in a batch mode as well as on a per sample basis, online.
DMR: Learning Procedure
Synthetic Data Generation
dimensional randomly generated vectors sampled from the Gaussian distributions,with being the standard deviation.
dimensional random vector, elements of which follows the uniform distribution within the range [0,1].
V Validation Architecture
The architecture of DMR for the validation phase (see Fig. 4) has the following layers.
Input (features) layer The first layer is exact the same as in the training phase and has been described in section III.
Ranked prototypes layer
In this layer we rank order all the prototypes in terms of minimum error during the training. Then we organise them in overlapping pairs: we start with the top two prototypes (providing smaller error) and then the pair of the second best and the third; further on, the pair of the third and the forth, etc. In this way, all prototypes take part twice except the best one and the worst one, see Fig. (4). The output of this layer is the degree of similarity, between the unlabeled data sample and the respective prototype. The activation functions of the neurons of this layer are defined as follows:
where . It is easy to see that for similarity we use the same Cauchy function as the data density, eq. (2).
Maximum similarity layer
Each neuron of this layer is performing a simple max operation over the pair of similarity values that are coming form the previous layer, namely:
The winner goes forward.
Pair-wise confidence checks layer
In this layer we check if the confidence in the best of the two potential outcomes is high enough. In this paper we use a threshold, =0.9, which means 90% similarity of the new, unlabeled data sample to any prototype. The neurons of this layer are linked between each other forming a competitive layer. This link is activated if the confidence check fails (see Fig. 2). The flow of the information to the next layer is conditional on the outcome from the confidence check. First, the top two pairs of prototypes are checked. If the winner surpasses it is the winner. Otherwise, the flow goes down to the next pair (in the same layer of the network, the key Fig. 4 is closed) and so on.
()) ( 4)
( 3 )
Pair-wise winners layer
Pair-wise decisions are made to determine the winning prototype form the candidate pair , which passed the confidence check in the proceeding layer.
Vi Explaining the DMR network as a set of IF…THEN rules
One of the main advantages of the proposed DMR approach is that it is explainable-by-design and can be represented, for example, in the form of IF…THEN rules . People can easily understand rules and prototypes. These are often easy to visualise, e.g. in case of images and can also be expressed as a set of linguistic rules as follows:
where denotes ”similar to”; it can also be seen as a fuzzy degree of membership. One rule per prototype can be formed. All rules per class can be combined together using logical OR, also known as disjunction or S-norm:
Vii Numerical Experiments
We validated our proposed approach, DMR using several complex, well-known image classification benchmark data sets (Faces-1999, Caltech-101, and Caltech-256). Description of the data sets are given below:
The Faces-1999 data set  contains 450 frontal real faces images from 27 different people. This data set is highly unbalanced.
The Caltech-101 data set  contains 9144 images in divided into 102 categories(one background). The Caltech-101 dataset is highly unbalanced and is widely used as bench marking data set.
Caletch-256 has 30,607 images divided into 257 object categories (one of which is the background) .
Vii-B Performance Evaluation
The performance of the classification methods is usually evaluated based on their accuracy index which is defined as follows:
where denote true and false, negative and positive, respectively.
All the experiments were conducted with MATLAB 2018a using a personal computer with a 1.8 GHz Intel Core i5 processor, 8-GB RAM, and MacOS operating system. The classification experiments were executed using 10-fold cross validation under the same ratio of training-to-testing (80% to 20%) sample sets.
Viii Results and Analysis
Computational simulations were performed to assess the accuracy of the proposed explainable tree-based deep learning method (DMR), against other state-of-the-art approaches.
Viii-a Faces Data set
Table I shows that the proposed DMR method provides the best result in terms of classification accuracy than its state-of-the-art competitors. The number of model parameters for DMR (and xDNN) is, strictly speaking, zero, because the 2 parameters (mean, and standard deviation, ) per prototype (data cloud) are derived from the data and are not algorithmic parameters or user-defined parameters. However, the tree-based structure of the proposed DMR and the mechanism for balancing the classes allow the result to surpass all others. The propose deep reasoning through a layered pair-wise DT is exploiting and benefiting from the old principle of divide et impera.
Viii-B Caltech-101 Data set
Table II shows the results considering the challenging Caltech-101 data set. It is possible to note through Table II that the proposed DMR method provides the best result in terms of classification accuracy. The proposed Caltech-101 is hugely unbalanced, and the inner data augmentation mechanism of the proposed DMR method favour the balance of the data, consequently, it improves the final classification result. Moreover, the intelligent tree-based structure of the proposed method allows interpretability and also favours the improvement in the classification accuracy of the given model.
The proposed explainable tree-based DNN surpasses in terms of accuracy the state-of-the-art VGG–VD–16 algorithm which is a well-established convolutional deep neural network. Moreover, it could also surpass other state-of-art approaches.
Viii-C Caltech-256 Data set
Results for Caltech-256 are presented in Table III.
|SVM(1) ||24.6 %|
These results demonstrate that the proposed DMR approach obtains the best classification accuracy ever reported for this complex problem, namely, 77.54%. The proposed approach not only surpasses all published competitors but also offers a clearly explainable model.
DMR even surpasses the recently introduced by us xDNN approach , which reported the world best result on 5 December 2019 for this classification problem.
In this paper we introduce the DMR – a prototype-based explainable DNN with DT inference and balanced amount of prototypes per class regardless of the possible imbalances of the training data. The proposed method offers two main novelties, namely: i) using a DT to determine the winning class label, and ii) balancing the classes by synthesising data around the prototypes determined from the available training data. It demonstrates excellent performance surpassing three well known benchmark problems (Caltech-101, Caltech-256 and Faces-1999) where the first two are the the best results published. The proposed approach is explainable-by-design, computationally efficient (no need for GPUs, high degree of parallelization possible, no iterative search procedures and parameter optimisation). Furthermore, it offers the ability to learn continuously (live-long) adapting smoothly to new data patterns. It is a step towards bringing closer machine learning and automated reasoning into what we calldeep machine reasoning aiming not only high levels of accuracy but also deeper understanding and insight.
-  (2019) Empirical fuzzy sets and systems. In Empirical Approach to Machine Learning, pp. 135–155. Cited by: §I, §II, §II, item 2, item 2, item 3, item 3, item 4, 2.
-  (2019) Towards explainable deep neural networks (xDNN). External Links: Cited by: Towards Deep Machine Reasoning: a Prototype- based Deep Neural Network with Decision Tree Inference ††thanks: Plamen Angelov, and Eduardo Soares are with the School of Computing and Communications, Lancaster University, Lancaster, LA1 4WA, UK. E-mails: email@example.com; firstname.lastname@example.org., §I, §I, §I, §II-A, §II, §II, item 4, item 5, §VIII-C, TABLE III.
-  (2012) Autonomous learning systems: from data streams to knowledge in real-time. John Wiley & Sons. Cited by: item 2, §VI.
-  (2015) Fully unsupervised fault detection and identification based on recursive density estimation and self-evolving cloud-based classifier. Neurocomputing 150, pp. 289–303. Cited by: §I.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: item 1.
-  (2002) Face recognition with radial basis function (rbf) neural networks. IEEE transactions on neural networks 13 (3), pp. 697–710. Cited by: §I.
-  (2010) Stabilization and disturbance attenuation over a gaussian communication channel. IEEE Transactions on Automatic Control 55 (3), pp. 795–799. Cited by: 6.
-  (2016) Deep learning. MIT press. Cited by: §I, §I.
-  (2007) Caltech-256 object category dataset. Cited by: §VII-2, §VII-A.
-  (2019) A self-adaptive synthetic over-sampling technique for imbalanced classification. arXiv preprint arXiv:1911.11018. Cited by: §II-B, 7.
-  (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §I.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: item 1.
-  (1996) LVQ pak: the learning vector quantization program package. Technical report Technical report. Cited by: §I.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: item 1.
Architectural study of hog feature extraction processor for real-time object detection. In 2012 IEEE Workshop on Signal Processing Systems, pp. 197–202. Cited by: item 1.
-  (1993) Discriminability-based transfer between neural networks. In Advances in neural information processing systems, pp. 204–211. Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: item 1.
-  (2019) Fair-by-design explainable models for prediction of recidivism. arXiv preprint arXiv:1910.02043. Cited by: §I, §II.
-  (2013) Classifying web videos using a global video descriptor. Machine vision and applications 24 (7), pp. 1473–1485. Cited by: item 1.
Least squares support vector machine classifiers. Neural processing letters 9 (3), pp. 293–300. Cited by: §I.
Inception-v4, inception-resnet and the impact of residual connections on learning. In
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: item 1.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: item 1.
-  (1999) Caltech frontal face database. California Institute of Technology. Cited by: §VII-1.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: TABLE III.
-  (2017) Learning k for knn classification. ACM Transactions on Intelligent Systems and Technology (TIST) 8 (3), pp. 1–19. Cited by: §I.