Deep Within-Class Covariance Analysis for Acoustic Scene Classification

11/10/2017 ∙ by Hamid Eghbal-zadeh, et al. ∙ 0

Within-Class Covariance Normalization (WCCN) is a powerful post-processing method for normalizing the within-class covariance of a set of data points. WCCN projects the observations into a linear sub-space where the within-class variability is reduced. This property has proven to be beneficial in subsequent recognition tasks. The central idea of this paper is to reformulate the classic WCCN as a Deep Neural Network (DNN) compatible version. We propose the Deep WithinClass Covariance Analysis (DWCCA) which can be incorporated in a DNN architecture. This formulation enables us to exploit the beneficial properties of WCCN, and still allows for training with Stochastic Gradient Descent (SGD) in an end-to-end fashion. We investigate the advantages of DWCCA on deep neural networks with convolutional layers for supervised learning. Our results on Acoustic Scene Classification show that via DWCCA we can achieves equal or superior performance in a VGG-style deep neural network.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are the state-of-the-art in supervised learning in many areas such as image classification. Using the power of convolutional layers, DNNs are able to learn novel features that are often superior to the engineered features. The characteristics of the representations learned in DNNs in an end-to-end fashion are influenced by different factors such as the network architecture, training data [1]

and optimization algorithms. One of the drawbacks of such models is that they require enormous amount of training data which is tied to their success. Although in Imagenet competition 

[2] provided with over 1 million data samples, DNNs are the state-of-the-art; in challenges such as DCASE-2016 [3] which only slightly more than 1 thousand recordings are provided, other methods such as factor analysis [4] and matrix factorization [5] outperform DNNs. This is also represented in the challenge results [6] as the DNNs with less parameters [7] performed better than the models with deeper architectures [4]

. To influence the characteristics of the learned representation and overcome shortcomings of end-to-end deep learning, inspired by classical machine learning algorithms many have tried to integrate such methods into a deep learning framework in terms of a layer or an objective function. Examples of such methods are Deep Linear Discriminant Analysis (DeepLDA) 

[8] which learns linearly separable latent representations on top of a DNN, and Deep Canonical Correlation Analysis (DCCA)  [9] which is used to produce correlated representations of multi-modal acoustic and articulatory speech data. These methods influence the learned representation to have characteristics similar to their respective conventional counterparts. Conventional Within-Class Covariance Normalization (WCCN) [10]

is a powerful method used to reduce the covariance of classes by projecting the observations into a linear sub-space where the within-class variability is reduced. WCCN is used with different features such as i-vectors 


and Maximum-Likelihood Linear Regression (MLLR) 

[12] for speaker, language and music artist recognition [13]

, Over-Complete Local Binary Patterns (OCLBP) for face recognition 

[14] and Gaussianized Vector Representation for video and image modeling [15]. In this paper, we reformulate the classical WCCN as a Deep Neural Network (DNN) compatible version by proposing the Deep Within-Class Covariance Analysis (DWCCA) layer. Our DWCCA layer can be incorporated at arbitrary positions in a DNN architecture which enables us to utilize the beneficial properties of WCCN directly within a DNN, and still allows for end-to-end training with Stochastic Gradient Descent (SGD). We provide empirical results to demonstrate that by adding DWCCA layer to a DNN – in our case, a VGG-style network [16]– we can influence the covariance of the learned representation which results in smaller per-class covariance and overall within-class covariance. These properties lead to achieve similar or superior performances on the task of Acoustic Scene Classification (ASC) compared to a DNN without DWCCA. We further provide deeper insights by analyzing the covariance of the learned representations via the network and visualizing a 2D version of it.

2 Deep Within-Class Covariance Analysis

We start this section by introducing a common notation which will be used throughout the paper. Based on this notation we first describe Within-Class Covariance Normalization (WCCN) and show how we cast it as a deep learning compatible version.

2.1 Conventional Within-Class Covariance Normalization

Let denote a set of -dimensional observations (feature vectors) belonging to different classes

. The observations are in the present case either hand crafted features (e.g. i-vectors) or any intermediate hidden representation of a deep neural network.

WCCN is a linear projection that provides an effective compensation for the within-class variability and has proven to be effectively used in combination with different scoring functions [17, 18].The WCCN-projection scales the feature space in the opposite direction of its within-class covariance matrix which has the advantage that finding decision boundaries on the WCCN-projected data becomes easier [19]. The within-class covariance

is estimated as:


where is the mean feature vector of class and are the samples belonging to this class. is the number of observations of each class in the training set. We use the inverse of matrix to normalize the direction of the projected feature vectors. The WCCN projection matrix can be estimated using the Cholesky decomposition as .

2.2 Deep Within-Class Covariance Analysis

Based on the definitions above we propose Deep WCCA,a DNN compatible formulation of WCCN. The parameters of our networks are optimized with ADAM [20], a variation of Stochastic Gradient Descent (SGD) with faster convergence, and mini-batches of size . This optimization strategy implies that each parameter update is computed only on a small subset of the entire train data set. The deterministic version of WCCN described above is usually estimated on the entire train (development) dataset which is by the definition of SGD not available in the present case. Another reason why computing the projection matrix on the entire dataset might not be feasible is that the train dataset is too large and estimating on the entire data might just not be feasible [21]. In the following we propose the DWCCA Layer which helps to circumvent these problems.

(a) Eigenvalues of the covariance of the network
predictions on all of the evaluation set.
(b) Eigenvalues of the covariance of the network predictions on all examples of class City Center in the evaluation set.
(c) 2D PCA projected predictions of the network
without DWCCA on the evaluation set.
(d) 2D PCA projected predictions of the network with DWCCA on the evaluation set.
Figure 1: Eigenvalue analysis of the covariance of the network predictions and 2D visualizations of the network predictions on the evaluation data.

Instead of computing the within class covariance matrices on the entire training set we provide an estimate on the observations of the respective mini-batches. Given this estimates we compute the corresponding mini-batch projection matrix and use it to maintain a moving average projection matrix as

. In addition, the moving average is applied for the means of each class and the within-class covariance to provide better estimation of these parameters over the training set. This is done similarly when computing the mean and standard deviation for normalizing the activations in batch normalization

[22]. The hyper parameter controls the influence of the data in the current batch on the final DWCCA projection matrix. The output of this processing step is the DWCCA projected data of the respective mini-batch. The SWCCN Layer can be seen as a special dense layer with a predefined weight matrix, the projection matrix

with the difference that the parameters are computed via the activations of the previous layer, and not learned via SGD. The proposed covariance normalization is applied directly during optimization and implemented as a layer within the network we have to establish gradient flow. This is required for training our networks with back-propagation. We implement the SWCCN Layer using the automatic differentiation framework Theano

[23] which already provides the derivatives of matrix inverses and the Cholesky decomposition. We refer the interested reader to [24] for details on this derivative.

3 Experiments


Conv(pad-2, stride-2)-


Conv(pad-1, stride-1)--BN-ReLu
Max-Pooling + Drop-Out()
Conv(pad-1, stride-1)--BN-ReLu
Conv(pad-1, stride-1)--BN-ReLu
Max-Pooling + Drop-Out()
Conv(pad-1, stride-1)--BN-ReLu
Conv(pad-1, stride-1)--BN-ReLu
Conv(pad-1, stride-1)--BN-ReLu
Conv(pad-1, stride-1)--BN-ReLu
Max-Pooling + Drop-Out()
Conv(pad-0, stride-1)--BN-ReLu
Conv(pad-0, stride-1)--BN-ReLu
Conv(pad-0, stride-1)--BN-ReLu
DWCCA (if applied)
-way Soft-Max
Table 1:

Model Specifications. BN: Batch Normalization, ReLu: Rectified Linear Activation Function, CCE: Categorical Cross Entropy. For training a constant batch size of 225 (15 examples of each class) samples is used.

3.1 Dataset

To evaluate the performance of different methods, we use TUT database for ASC and sound event detection (TUT16) [3]. We use both development and evaluation sets for performance comparison using accuracy. On the development set, we use a four-fold CV provided with the dataset. On the evaluation set, we train on the development set and test on the evaluation set.

3.2 Baseline Systems

Our first baseline is a VGG-Style CNN [4] which uses audio spectrograms of

. Our second baseline is a gated recurrent neural network optimized with DeepLDA

[25]. This provides a good comparison with DeepLDA as it is a deep learning compatible version of conventional LDA. Our last baseline is a VGG-style network with the exact same architecture as in  [4]. We will call this baseline VGG-baseline which is slightly modified from [4] as follows to be suitable for our DWCCA experiments. First, we reduce the spectrogram dimensionality from to

using mel-spectrograms instead of log spectrograms. Second, to introduce higher variance, we use longer durations (

frames instead of ). Third, we use stratified training with mini-batches of with examples from each class. These changes are necessary since DWCCA requires samples of all the classes for within-class covariance computation. This results in larger batch sizes than usual which is compensated with lower feature dimensionality so that all can fit within our Titan X GPU memory. Instead of SGD with momentom, we use ADAM with starting learning rate of and we do not use l2-norm penalty as it is reported in [26] that it does not improve generalization. We use the same learning rate schedule as in [4].

3.3 Results

For DWCCA experiments, our setup is identical to VGG-baseline with DWCCA layer added before the softmax. The is also set to as it is the parameter used in batch-normalization. In [27] authors suggest that eigenvalues of the covariance of the neural network activations can be used to investigate the dynamics of the learning in NNs. In this paper, we use eigenvalues of the covariance of final output activation of our network to study the covariance of the classes in the predicted score. In Figure 1.a eigenvalues of the covariance of the predictions of the network are provided for the cases with and without DWCCA layer. As can be clearly seen, applying DWCCA provides flatter eigen values which suggests that the information in the predictions in each dimension have less variance among different dimensions compared to the case without DWCCA. Also as explained in [27], having the ratio of minimized is beneficial for the network learning, where shows eigenvalues of the covariance of the activations. Figure 1.b shows a similar behavior but only for the scores of City Center class which had better performance with DWCCA. Comparing Figure 1.c and d reveals that in a PCA-projected 2D representation it seems classes appear with smaller within-class variability as they shrieked into a line or gathered in one area when DWCCA is applied. In the results reported in Tables 2, on the validation set DWCCA performs significantly better than [4] and VGG-baseline. On average, [4] with average accuracy performs better than  VGG-baseline with accuracy. This can be explained with the fact that the features used in  VGG-baseline are on lower dimensionality compared to  [4]. By applying DWCCA, the performance improves to by percentage point improvement. Comparing the performance of evaluation set shows that all three methods ( [4] and VGG-baseline and DWCCA) achieve similar performances. Comparing DeepLDA and DWCCA, reveals DWCCA outperforms DeepLDA on the evaluation set. The results of the development dataset were not reported in the paper.

(%) fold1 fold2 fold3 fold4 eval
VGG[4] 80.69 75.52 77.85 83.90 83.30
DeepLDA[25] - - - - 79.1
VGG baseline 82.06 75.51 72.14 78.08 83.58
DWCCA 84.48 71.72 79.19 78.76 83.33
Fused - - - - 85.64
Table 2: Audio scene classification accuracy on the provided DCASE-2016 test set with provided cross-validation splits. Methods marked with an asterisk () used the score calibration projection which was trained on the same set as the test set because of the lack of validation set in the provided cross-validation splits.

The class-wise performances are provided in Table 3

. As can be seen, DWCCA changed the performance of the network and improved the poor performances on 6 out of 15 classes and stayed the same for 2 of the classes. As the network performs differently by adding DWCCA, we late-fused the probabilities of the two networks using linear logistic regression and reported the results under

Fused. Looking at the results in both tables shows that the two networks have complementary information and the fusion outperforms all the baselines and improves the performance by more that percentage point.

4 Conclusion

We presented the DWCCA layer, a DNN compatible version of the classic WCCN which is used to normalize the within-class covariance of the hidden activations. In contrast to classic WCCN, DWCCA is formulated as an interchangeable network component that can be directly incorporated inside a DNN. This has the advantages that it allows for training end-to-end with SGD and back-propagation and provide a better internal representation of the network and it allows for a joint optimization in an end-to-end neural network fashion. We showed that DWCCA achieved similar or superior performances while providing a representation with lower covariance.

5 Acknowledgment

Authors acknowledge Hasan Bahari of KU Leuven for helpful discussions about this work. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X GPU used for this research.

(%) VGG baseline DWCCA Fused
Forest path 88 92 90
Office 87 96 96
Grocery store 62 64 73
Cafe/Restaurant 92 100 98
Bus 92 96 98
Metro station 88 83 87
Park 83 74 77
library 81 81 81
Tram 72 69 71
Residential area 88 82 84
Train 92 87 87
Car 91 85 90
City center 78 84 84
Home 70 70 76
Beach 86 83 88
Average 83 83 85
Table 3: The class-wise f1-score comparing the performance of different methods for classification of different audio scenes on DCASE-2016 provided evaluation dataset.


  • [1] Pang Wei Koh and Percy Liang, “Understanding black-box predictions via influence functions,” in Proceedings of the 34th International Conference on Machine Learning, 2017.
  • [2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
  • [3] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “Tut database for acoustic scene classification and sound event detection,” in Signal Processing Conference (EUSIPCO), 2016 24th European. IEEE, 2016, pp. 1128–1132.
  • [4] Hamid Eghbal-zadeh, Bernhard Lehner, Matthias Dorfer, and Gerhard Widmer, “Cp-jku submissions for dcase-2016: a hybrid approach using binaural i-vectors and deep convolutional neural networks,” 2016.
  • [5] Victor Bisot, Romain Serizel, Slim Essid, and Gael Richard, “Supervised nonnegative matrix factorization for acoustic scene classification,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 2016.
  • [6] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “Assessment of human and machine performance in acoustic scene classification: Dcase 2016 case study,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017.
  • [7] Michele Valenti, Aleksandr Diment, Giambattista Parascandolo, Stefano Squartini, and Tuomas Virtanen, “Dcase 2016 acoustic scene classification using convolutional neural networks,” in Proc. Workshop Detection Classif. Acoust. Scenes Events, 2016, pp. 95–99.
  • [8] Matthias Dorfer, Rainer Kelz, and Gerhard Widmer, “Deep Linear Discriminant Analysis,” International Conference on Learning Representations (ICLR), 2016.
  • [9] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu, “Deep canonical correlation analysis,” Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.
  • [10] A Hatch and A Stolcke, “Generalized linear kernels for one-versus-all classification: application to speaker recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings., 2006.
  • [11] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, 2011.
  • [12] Christopher J Leggetter and Philip C Woodland,

    “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,”

    Computer Speech & Language, 1995.
  • [13] Mohamad Hasan Bahari, Najim Dehak, Hugo Van Hamme, Lukas Burget, Ahmed M. Ali, and Jim Glass,

    “Non-negative factor analysis of Gaussian mixture model weight adaptation for language and dialect recognition,”

    IEEE Transactions on Audio, Speech and Language Processing, 2014.
  • [14] Oren Barkan, Jonathan Weill, Lior Wolf, and Hagai Aronowitz, “Fast high dimensional vector multiplication face recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2013.
  • [15] Xiaodan Zhuang, Modeling audio and visual cues for real-world event detection, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2011.
  • [16] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [17] Najim Dehak, Reda Dehak, James Glass, Douglas Reynolds, and Patrick Kenny,

    Cosine Similarity Scoring without Score Normalization Techniques,”

    Proceedings of Odyssey 2010 - The Speaker and Language Recognition Workshop (Odyssey), 2010.
  • [18] Ming Li, Andreas Tsiartas, Maarten Van Segbroeck, and Shrikanth S Narayanan, “Speaker verification using simplified and supervised i-vector modeling,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7199–7203.
  • [19] Ao Hatch, “Within-class covariance normalization for SVM-based speaker recognition.,” Interspeech, 2006.
  • [20] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [21] Weiran Wang and Karen Livescu, “Large-scale approximate kernel canonical correlation analysis,” International Conference on Learning Representations (ICLR) (arXiv:1511.04773), 2015.
  • [22] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, 2015.
  • [23] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), 2010, Oral Presentation.
  • [24] Stephen P Smith, “Differentiation of the cholesky algorithm,” Journal of Computational and Graphical Statistics, 1995.
  • [25] Matthias Zöhrer and Franz Pernkopf, “Gated recurrent networks applied to acoustic scene classification and acoustic event detection,” IEEE AASP Chall. Detect. Classif. Acoust. Scenes Events (DCASE), 2016, 2016.
  • [26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
  • [27] Yann Le Cun, Ido Kanter, and Sara A Solla, “Eigenvalues of covariance matrices: Application to neural-network learning,” Physical Review Letters, vol. 66, no. 18, pp. 2396, 1991.