Towards Explainable Deep Neural Networks (xDNN)

12/05/2019 ∙ by Plamen Angelov, et al. ∙ Lancaster 128

In this paper, we propose an elegant solution that is directly addressing the bottlenecks of the traditional deep learning approaches and offers a clearly explainable internal architecture that can outperform the existing methods, requires very little computational resources (no need for GPUs) and short training times (in the order of seconds). The proposed approach, xDNN is using prototypes. Prototypes are actual training data samples (images), which are local peaks of the empirical data distribution called typicality as well as of the data density. This generative model is identified in a closed form and equates to the pdf but is derived automatically and entirely from the training data with no user- or problem-specific thresholds, parameters or intervention. The proposed xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. We tested it on some well-known benchmark data sets such as iRoads and Caltech-256. xDNN outperforms the other methods including deep learning in terms of accuracy, time to train and offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has demonstrated ability to achieve highly accurate results in different application domains such as speech recognition [25], image recognition [9], and language translation [12] and other complex problems [6]. It attracted the attention of media and the wider public [18]

. It has also proven to be very valuable and efficient in automating the usually laborious and sometimes controversial pre-processing stage of feature extraction. The main criticism towards deep learning is usually related to its ‘black-box’ nature and requirements for huge amount of labeled data, computational resources (GPU accelerators as a standard), long times (hours) of training, high power and energy requirements

[16]

. Indeed, a traditional deep learning (e.g. convolutional neural network) algorithm involves hundreds of millions of weights/coefficients/parameters that require iterative optimization procedures. In addition, these hundreds of millions of parameters are abstract and detached from the physical nature of the problem being modelled. However, the automated way to extract them is very attractive in high throughput applications of complex problems like image processing where the human expertise may simply be not available or very expensive.

Feature extraction is an important pre-processing stage, which defines the data space and may influence the level of accuracy the end result provides. Therefore, we consider this very useful property of the traditional deep learning and step on it combined with another important recent result in the deep learning domain, namely, the transfer learning. This concept postulates that knowledge in the form of a model architecture learned in one context can be re-used and useful in another context

[10]. Transfer learning helps to considerably reduce the amount of time used for training. Moreover, it also may help to improve the accuracy of the models [27].

Stepping on the two main achievements of the deep learning - top accuracy combined with an automatic approach for feature extraction for complex problems, such as image classification, we try to address its deficiencies such as the lack of explainability [16], computational burden, power and energy resources required, ability to self-adapt and evolve [21]. Interpretability and explainability are extremely important for high stake applications, such as autonomous cars, medical or court decisions, etc. For example, it is extremely important to know the reasons why a car took some action, especially if this car is involved in an accident [5].

The state-of-the-art classifiers offer a choice between higher explainability for the price of lower accuracy or vice versa (Figure 1). Before deep learning [17]

, machine-learning and pattern-recognition required substantial domain expertise to model a feature extractor that could transform the raw data into a feature vector which defines the data space within which the learning subsystem could detect or classify data patterns

[12]. Deep learning offers new way to extract abstract features automatically. Moreover, pre-trained structures can be reused for different tasks through the transfer learning technique [10]. Transfer learning helps to considerably reduce the amount of time used for training, moreover, it also may helps to improve the accuracy of the models [27]. In this paper, we propose a new approach, xDNN that offers both, high level of explainability combined with the top accuracy.

Fig. 1: Trade-off between accuracy and explainability.

The proposed approach, xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is based on prototypes and the data density [3] as well as typicality - an empirically derived pdf [1]. It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. We tested it on some well-known benchmark data sets such as iRoads [15] and Caltech-256 [7] and xDNN outperforms the other methods including deep learning in terms of accuracy, time to train, moreover, offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record [8].

The remainder of this paper is organized as follows: The next section introduces the proposed explainable deep learning approach. The experimental data employed in the analysis and results are presented in the results section. Discussion is presented in the last section of this paper.

Ii Explainable Deep Neural Network

Ii-a Architecture and Training of the proposed xDNN

The proposed explainable deep neural network (xDNN) classifier is formed of several layers with a very clear semantic and functional meaning. In addition to the internal clarity and transparency it also offers a very clear from the user point of view set of prototype-based rules. Prototypes are selected data samples (images) that the user can easily view, understand and appreciate the similarity to other validation images. xDNN offers a synergy between the statistical learning and reasoning bringing both together. In most of the other approaches there is a dichotomy and preference of one over the other. We advocate and demonstrate that both, learning and reasoning can work together in a synergy and produce very impressive results. Indeed, the proposed xDNN method outperforms all published results [15, 8, 2] in terms of accuracy. Moreover, in terms of time for training, computational simplicity, low power and energy required it is also far ahead. The proposed approach can be described as a feedforward neural network which has an incremental learning algorithm that autonomously self-develops and evolves its structure adding new prototypes to reflect the possibly changing (dynamically evolving) data pattern [21]. As shown in Figure 3, xDNN is composed of the following layers–

  1. Features descriptor layer;

  2. Density layer;

  3. Typicality layer;

  4. Prototypes layer;

  5. MegaClouds layer;

Fig. 2: Pre-training a traditional deep neural network (weights of the network are being optimized/trained). Using the transfer learning concept this architecture with the weights are used as feature extractor (the last fully connected layer is considered as a feature vector). Adapted from [19].
Fig. 3: xDNN training architecture (per class).
  1. Features descriptor layer: (Defines the data space)

    The Feature Descriptor Layer is the first phase of the proposed xDNN method. This layer is in charge of extracting global features vector from the images. This first layer can be formed by more traditional ‘handcrafted’ methods such as GIST [22] or HoG [13]. Alternatively, it can be formed by the fully connected layer (FCL) of the pre-trained convolutional neural network approaches such as AlexNet [11], VGG–VD–16 [19], and Inception [24], residual neural networks such as Resnet [9] or Inception-Resnet [23], etc. Using pre-trained deep neural network approach allows automatic extraction of more abstract and discriminative high-level features. In this paper, pre-trained VGG–VD–16 DCNN is employed for feature extraction. According to [14], VGG–VD–16 has a simple structure and it can achieve a better performance in comparison with other pre-trained deep neural networks. The first fully connected layer from VGG–VD–16 provides a dimensional vector.

    a) The values are then standardized using the following equation (1):

    (1)

    where denotes a standardized features vector of the image ( are the values provided by the FCL), denotes the time stamp or the ID of the image, refers to the number of features of the given in our case .

    b) The standardized values are normalised to bring them to the range [0;1]:

    (2)

    where denotes the normalized value of the features vector. For clarity in the rest of the paper we will use instead of .

    Initialization:

    Meta-parameters for the xDNN are initialized with the first observed data sample (image). The proposed algorithm works per class; therefore, all the calculations are done for each class separately.

    (3)

    where denotes the global mean of data samples of the given class. is the total number of the identified prototypes from the observed data samples (images).

    Each class is initialized by the first data sample of that class:

    (4)

    where, is the vector of features that describe the prototype of the ; is the identified prototype; is the corresponding support (number of members) associated with this prototype; is the corresponding radius of the area of influence of .

    In this paper, we use same as [3]; the rationale is that two vectors for which the angle between them is less than or are pointing in close/similar directions . That is, we consider that two feature vectors can be considered to be similar if the angle between them is smaller than 30 degrees. Note that is data derived, not a problem- or user- specific parameter. In fact, it can be defined without prior knowledge of the specific problem or data through the following equation (5).

    (5)
  2. Density layer:

    The density layer defines the mutual proximity of the images in the data space defined by the features from the previous layer. The data density, if use Euclidean form of distance, has a Cauchy form (6) [3]:

    (6)

    where is the density, is the global mean, and

    is the variance. The reason it is Cauchy is not arbitrary

    [3]. It can be demonstrated theoretically that if Euclidean or Mahalanobis type of distances in the feature space are considered, the data density reduces to Cauchy type as referred in equation (6). Density can also be updated online [4]:

    (7)

    where and the scalar product, can be updated recursively as follows:

    (8)
    (9)

    Data samples (images) that are closer to the global mean have higher density values. Therefore, the value of the data density indicates how strongly a particular data sample is influenced by other data samples in the data space due to their mutual proximity.

  3. Typicality layer:

    Typicality

    is is an empirically derived form of probability distribution function (pdf).

    Typicality is given by the equation (10). The value of even at the point is much less than 1; the integral of [3].

    (10)
  4. Prototypes layer:

    The prototypes identification layer is the core of the proposed xDNN classifier. This layer is responsible to provide the clearly explainable model. The xDNN classifier is free from prior assumptions about the data distribution type, as well as the random or deterministic nature of the data. In contrast, it extracts the actual distribution empirically form the data samples (images) bottom up [3]. The prototypes are independent from each other. Therefore, one can change the structure by adding a new prototype without influencing the other already existing prototypes. In other words, the proposed xDNN is highly parallelizable and suitable for evolving form of application where new prototypes may be added (if the data pattern requires this). The proposed xDNN method is trained per class forming a set of prototypes per class. Therefore, all the calculations are done for each class separately. Prototypes are the local peaks of the data density (and typicality) identified in the previous layers/ stages of the algorithm from the images of the corresponding class based on their feature vectors. The prototypes can be used to form linguistic logical rules of the following form:

    : IF THEN

    where stands for similarity, it also can be seen as a fuzzy degree of membership; is the identified prototype; is the number of identified prototypes; is the class , denotes an image.

    One rule per prototype can be formed. All rules per class can be combined together using logical OR, also known as disjunction or S-norm:

    : IF OR OR … OR THEN

    Figure 4 illustrates the area of influence of the identified prototypes. These areas around the identified prototypes are called data clouds [3]. Thus, each prototype defines a data cloud.

    Fig. 4: Identified prototypes – Voronoi Tesselation.

    We call all data points associated with a prototype data clouds

    , because their shape is not regular (e.g., hyper-spherical, hyper-ellipsoidal, etc.) and the prototype is not necessarily the statistical and geometric mean , but actual image

    [3]. The algorithm absorbs the new data samples one by one by assigning then to the nearest (in the feature space) prototype:

    (11)

    In case, the following condition [3] is met:

    (12)

    It means that is out of the influence area of . Therefore, the vector of features becomes a new prototype of a new data cloud with meta-parameters initialized by equation (13). Add a new data cloud:

    (13)

    Otherwise, data cloud parameters are updated online by equation (14). It has to be stressed that all calculations per data cloud are performed on the basis of data points associated with a certain data cloud only (i. e. locally, not globally, on the basis of all data points).

    (14)

    The xDNN learning procedure can be summarized by the following algorithm.   xDNN: Learning Procedure  

    1:  Read the first feature vector sample representing the image of the class ;
    2:  Set ;
    3:  FOR = 2, …
    4:     Read ;
    5:    Calculate and according to equation (9);
    6:     IF Equation (12) holds
    7:       Create rule according to Equation (13);
    8:     ELSE
    9:       Search for according to Equation (11);
    10:       Update rule according to Equation (14);
    11:     END
    12:  END

     

  5. MegaClouds layer:

    In the MegaClouds layer the clouds formed by the prototypes in the previous layer are merged if the neighbouring prototypes have the same class label. In other words, they are merged if they belong to the same class. MegaClouds are used to facilitate the human interpretability. Figure 5 illustrates the formation of the MegaClouds.

    Fig. 5: MegaClouds – Voronoi Tesselation.

    Rules in the MegaClouds layer have the following format:

    : IF OR OR … OR THEN

    where are the MegaClouds, or the areas formed from the merging of the clouds, and is the number of identified MegaClouds. Multimodal typicality, , can also be used to illustrate the MegaClouds as illustrated by Figure 6.

    Fig. 6: Typicality for the iRoads dataset.

Ii-B Architecture and Validation of the proposed xDNN

Architecture for the validation process of the proposed xDNN method is illustrated by Figure 7.

Fig. 7: Architecture for the validation process of the proposed xDNN.

The validation process of xDNN is composed of the following layers:

  1. Features descriptor layer;

  2. Similarity layer (density);

  3. Local decision-making.

  4. Global decision-making.

Which is detailed described as following:

  1. Features descriptor layer:

    Similarly to the features descriptor layer described in the training process.

  2. Prototypes layer:

    In this layer the degrees of similarity to the nearest prototypes (per class) are extracted for each unlabeled (new/validation) data sample/image defined as follows:

    (15)

    where denotes the similarity degree.

  3. Local (per class) decision-making layer:

    Local (per class) decision-making is calculated based on the ‘winner-takes-all’ principle and can be obtained by:

    (16)
  4. Global decision-making layer: The global decision-making layer is in charge of forming the decision by assigning labels to the validation images based on the degree of similarity of the prototypes obtained by the prototype identification layer as illustrated by Figure 7 and determining the winning class.

    (17)

    In order to determine the overall degree of satisfaction, the maximum of the local, per class winners is applied.

    The label is obtained by the following equation (18):

    (18)

Iii Experimental Data

We validated our proposed approach, xDNN using several complex, well-known image classification benchmark datasets (iRoads and Calltech-256).

Iii-a iRoads dataset

The iROADS dataset [15] was considered in the analysis first. The dataset contains 4,656 image frames recorded from moving vehicles on a diverse set of road scenes, recorded in day, night, under various weather and lighting conditions, as described below:

  • Daylight - 903 images

  • Night - 1050 images

  • Rainy day - 1049 images

  • Rainy night - 431 images

  • Snowy - 569 images

  • Sun strokes - 307 images

  • Tunnel - 347 images

Iii-B Caltech-256

Caletch-256 has 30,607 images divided into 257 object categories (one of which is the background) [7].

Iii-C Performance Evaluation

The performance of the classification methods is usually evaluated based on their accuracy index which is defined as follows:

(19)

where denote true and false, negative and positive, respectively.

All the experiments were conducted with MATLAB 2018a using a personal computer with a 1.8 GHz Intel Core i5 processor, 8-GB RAM, and MacOS operating system. The classification experiments were executed using 10-fold cross validation under the same ratio of training-to-testing (80% to 20%) sample sets.

Iv Results and Analysis

Computational simulations were performed to assess the accuracy of the proposed explainable deep learning method, xDNN against other state-of-the-art approaches.

Iv-a iRoads Dataset

Table I shows that the proposed xDNN method provides the best result in terms of classification accuracy as well as time/complexity and simplicity of the model structure (number of parameters/prototypes). The number of model parameters for xDNN (and DRB) is, strictly speaking, zero, because the 2 parameters (mean,

and standard deviation,

) per prototype (data cloud

) are derived from the data and are not algorithmic parameters or user-defined parameters. For kNN method one can argue that the number of parameters is the number of data samples, N. The proposed explainable DNN surpasses in terms of accuracy the state-of-the-art VGG–VD–16 algorithm which is a well-established convolutional deep neural network. Moreover, the proposed xDNN has at its top layer a set of a very small number of

MegaClouds (27 or, on average, 4 MegaClouds per class) which makes it very easy to explain and visualize. For comparison, our earlier version of deep rule-based models, called DRB [2] also produced a high accuracy and was trained a bit faster, but ended up with 521 prototypes (on average 75 prototypes per class) [20]. With xDNN we do generate meaningful rules as well as generate an analytical description of the typicality which is the empirically derived pdf in a closed form which lends itself for further analysis and processing.


Method
Time(s) # Parameters
xDNN 99.59% 4.32 27
VGG–VD–16 [20] 99.51 % 836.28 Not reported
DRB [20] 99.02% 2.95 521
SVM [20] 94.17% 5.67 Not reported
KNN [20] 93.49% 4.43 4656
Naive Bayes [20] 88.35% 5.31 Not reported
TABLE I: Performance Comparasion: iRoads Dataset

MegaClouds generated by the proposed xDNN model can be visualized in terms of rules as illustrated by the Figure 8.

IF (I ) OR

(I ) OR

OR (I )

THEN ‘Daylight scene’

Fig. 8: xDNN rule generated for the ‘Daylight scene’.

Voronoi tesselation can also be used to visualize the resulting MegaClouds as illustrated by Figure 9.

Fig. 9: MegaClouds for the iRoads dataset.

Typicality for classes ‘night scene’ and ‘snow scene’ are given by Figure 10.

Fig. 10: Typicality for the iRoads dataset (2D), 2 classes, representing ‘night scene’ and ‘snow scene’.

Typicality can also be used for interpreatability and explainability as it is correspondent to the pdf. One can use the typicality to represent the likelihood that an image represents a specific type of driving conditions. For a given image a vector of features can be extracted, which can be standardized and normalized and used to demonstrate the likelihood of a certain type of driving condition as shown on Fig. 10.

Iv-B Caltech-256 Dataset

Results for Caltech-256 are presented in Table II.


Method
xDNN 75.41%
SVM(1) [26] 24.6 %
SVM(2) [26] 39.6%
SVM(3) [26] 46.0%
SVM(4) [26] 51.3%
SVM(5) [26] 65.6%
SVM(7) [26] 71.7%
Softmax(5)[26] 65.7%
Softmax(7) [26] 74.2%
TABLE II: Performance Comparasion: Caltech-256 Dataset

Results presented in Table II demonstrate that the proposed xDNN approach can obtain the best classification reported so far world wide for this complex problem, namely, 75.41%. The proposed approach did surpass all of the competitors, offering the highest accuracy, as well as, clearly explainable model. xDNN produced on average 3 MegaClouds per class (a total of 721) which are clearly explainable. Rules have the following format:

IF (x ) OR (x ) OR (x )

THEN ‘CD’

Experiments have demonstrated that the proposed xDNN approach is able to produce highly accurate results surpassing state-of-the-art methods for different challenging datasets. Moreover, xDNN presents highly interpretable results that can be presented in the form of logical rules, Voronoi tessellations, and/or typicality (empirically derived form of pdf) in a closed analytical form allowing further analysis. Because of its recursive, non-iterative and non-parametric form it allows computationally very efficient implementations to be realized.

V Conclusion

In this paper we propose a new method, explainable deep neural network (xDNN), that is directly addressing the bottlenecks of the traditional deep learning approaches and offers a clearly explainable internal architecture that can outperform the existing methods. The proposed xDNN approach requires very little computational resources (no need for GPUs) and short training times (in the order of seconds). The proposed approach, xDNN is prototype-based. Prototypes are actual training data samples (images), which have local peaks of the empirical data distribution called typicality as well as of the data density. This generative model is identified in a closed form and equates to the pdf but is derived automatically and entirely from the training data with no user- or problem-specific thresholds, parameters or intervention. The proposed xDNN offers a new deep learning architecture that combines reasoning and learning in a synergy. It is non-iterative and non-parametric, which explains its efficiency in terms of time and computational resources. From the user perspective, the proposed approach is clearly understandable to human users. Results for some well-known benchmark data sets such as iRoads and Caltech-256 show that xDNN outperforms the other methods including state-of-the-art deep learning approaches (VGG–VD–16) in terms of accuracy, time to train and offers a clearly explainable classifier. In fact, the result on the very hard Caltech-256 problem (which has 257 classes) represents a world record [8]111https://martin-thoma.com/sota/. Future research will concentrate on the development of a tree-based architecture, synthetic data generation, and local optimization in order to improve the proposed deep explainable approach.

References

  • [1] P. P. Angelov, X. Gu, and J. C. Príncipe (2017) A generalized methodology for data analysis. IEEE transactions on cybernetics 48 (10), pp. 2981–2993. Cited by: §I.
  • [2] P. P. Angelov and X. Gu (2018) Deep rule-based classifier with human-level performance and characteristics. Information Sciences 463, pp. 196–213. Cited by: §II-A, §IV-A.
  • [3] P. P. Angelov and X. Gu (2019) Empirical fuzzy sets and systems. In Empirical Approach to Machine Learning, pp. 135–155. Cited by: §I, item 1, item 2, item 2, item 3, item 4, item 4, item 4, item 4.
  • [4] P. Angelov (2012) Autonomous learning systems: from data streams to knowledge in real-time. John Wiley & Sons. Cited by: item 2.
  • [5] F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §I.
  • [6] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §I.
  • [7] G. Griffin, A. Holub, and P. Perona (2007) Caltech-256 object category dataset. Cited by: §I, §III-B.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: Towards Explainable Deep Neural Networks (xDNN) Preprint submitted to Neural Networks Journal thanks: Plamen Angelov, and Eduardo Soares are with the School of Computing and Communications, Lancaster University, Lancaster, LA1 4WA, UK. E-mails: p.angelov@lancaster.ac.uk; e.almeidasoares@lancaster.ac.uk., §I, §II-A, §V.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §I, item 1.
  • [10] J. Hu, J. Lu, and Y. Tan (2015) Deep transfer metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 325–333. Cited by: §I, §I.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: item 1.
  • [12] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §I, §I.
  • [13] K. Mizuno, Y. Terachi, K. Takagi, S. Izumi, H. Kawaguchi, and M. Yoshimoto (2012) Architectural study of hog feature extraction processor for real-time object detection. In 2012 IEEE Workshop on Signal Processing Systems, pp. 197–202. Cited by: item 1.
  • [14] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun (2016) Object detection networks on convolutional feature maps. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1476–1481. Cited by: item 1.
  • [15] M. Rezaei and M. Terauchi (2013) Vehicle detection based on multi-feature clues and Dempster-Shafer fusion theory. In Pacific-Rim Symposium on Image and Video Technology, pp. 60–72. Cited by: §I, §II-A, §III-A.
  • [16] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §I, §I.
  • [17] J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §I.
  • [18] T. J. Sejnowski (2018) The deep learning revolution. MIT Press. Cited by: §I.
  • [19] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Fig. 2, item 1.
  • [20] E. Soares, P. Angelov, B. Costa, and M. Castro (2019) Actively semi-supervised deep rule-based classifier applied to adverse driving scenarios. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §IV-A, TABLE I.
  • [21] E. Soares and P. Angelov (2019) Novelty detection and learning from extremely weak supervision. arXiv preprint arXiv:1911.00616. Cited by: §I, §II-A.
  • [22] B. Solmaz, S. M. Assari, and M. Shah (2013) Classifying web videos using a global video descriptor. Machine vision and applications 24 (7), pp. 1473–1485. Cited by: item 1.
  • [23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In

    Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: item 1.
  • [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: item 1.
  • [25] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5934–5938. Cited by: §I.
  • [26] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: TABLE II.
  • [27] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He (2015)

    Supervised representation learning: transfer learning with deep autoencoders

    .
    In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §I, §I.