End-to-End Auditory Object Recognition via Inception Nucleus

by   Mohammad K. Ebrahimpour, et al.

Machine learning approaches to auditory object recognition are traditionally based on engineered features such as those derived from the spectrum or cepstrum. More recently, end-to-end classification systems in image and auditory recognition systems have been developed to learn features jointly with classification and result in improved classification accuracy. In this paper, we propose a novel end-to-end deep neural network to map the raw waveform inputs to sound class labels. Our network includes an "inception nucleus" that optimizes the size of convolutional filters on the fly that results in reducing engineering efforts dramatically. Classification results compared favorably against current state-of-the-art approaches, besting them by 10.4 percentage points on the Urbansound8k dataset. Analyses of learned representations revealed that filters in the earlier hidden layers learned wavelet-like transforms to extract features that were informative for classification.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning Environmental Sounds with Multi-scale Convolutional Neural Network

Deep learning has dramatically improved the performance of sounds recogn...

Deep Steganalysis: End-to-End Learning with Supervisory Information beyond Class Labels

Recently, deep learning has shown its power in steganalysis. However, th...

InfantNet: A Deep Neural Network for Analyzing Infant Vocalizations

Acoustic analyses of infant vocalizations are valuable for research on s...

Towards the Design of an End-to-End Automated System for Image and Video-based Recognition

Over many decades, researchers working in object recognition have longed...

End-to-end User Recognition using Touchscreen Biometrics

We study the touchscreen data as behavioural biometrics. The goal was to...

Fast Wavelet-Based Visual Classification

We investigate a biologically motivated approach to fast visual classifi...

Learning Filterbanks from Raw Speech for Phone Recognition

We train a bank of complex filters that operates on the raw waveform and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Convolutional Neural Networks (CNNs) have proven effective in learning to classify large sets of categories when given very large numbers of training examples 

[23, 9]. One of the advantages of deep CNNs in object recognition is their ability to learn useful features in an end-to-end manner by mapping raw data, such as RGB pixels, to class labels.
In contrast, auditory object recognition is typically implemented based on engineered features [4, 24]. One of the most powerful types of engineered representation for speech recognition tasks is based on the mel-frequency cepstrum [16]

, which is basically the discrete cosine transform of the windowed spectra. Researchers have used such engineered features as inputs to CNNs for audio classification tasks, such as Automatic Speech Recognition (ASR) 

[20] and music analysis [28]

. In these cases, CNNs are typically applied to two-dimensional feature maps created by arranging the log-mel cepstral features of each frame along the time axis. This feature map creates locality in both time and frequency domains 

[1], which means that the machine learning problem can be framed as an image classification problem.

However, cepstral features were designed specifically for speech recognition and may not be optimal for other types of audio classification tasks. More generally, pre-engineered features will be tailored to whatever the problem is to be solved, which means they may not be readily transferred to other problem domains. Another potential problem with engineered features is that they must be computed as inputs to the classification system, such as a deep learning convolutional network. On-line computation of spectral or cepstral features can be costly in terms of time and power, especially for edge computing applications that do not have access to cloud computing servers.

Inception Nucleus Nets Configurations
Inception Inception-FA Inception-FI Inception-BN
289 K 789 K 479 K 292 K
Input ()
Conv1D,32,80,4 Inception Nucleus: Conv1D,32,80,4 with BN
Inception Nucleus: Inception Nucleus: Inception Nucleus: Inception Nucleus:
Conv1D,64,4,4 Conv1D,64,20,4 Conv1D,64,4,4 Conv1D,64,4,4 - BN
Conv1D,[64,8,4] Conv1D,[64,40,4] Conv1D,[64,8,4] Conv1D,[64,8,4]-BN
Conv1D,[64,16,4] Conv1D,[64,60,4] Conv1D,[64,16,4] Conv1D,[64,16,4]-BN
Max Pooling 1D, 64,10,1
Reshape (put the channels first)
Conv2D,32,,1 Conv2D,32,,-BN
Max Pooling 2D,32,,2
Conv2D,64,,1 Conv2D,64,,1-BN
Conv2D,64,,1 Conv2D,64,,1-BN
Max Pooling 2D,64,,2
Conv2D,128,,1 Conv2D,128,,1-BN
Max Pooling 2D,128,,2
Conv2D,10,,1 Conv2D,10,,1-BN
Global Average Pooling
Table 1:

Our proposed deep neural networks architectures. Each column belongs to a network. The third row indicated number of parameters. The convolutional layer parameters are denoted as “conv (1D or 2D),(number of channels),(kernel size),(stride).” Layers with batch normalization are denoted with BN.

More recently, researchers have developed deep learning networks that take raw waveforms as input, rather than using pre-engineered features. This approach is known as end-to-end audio classification. For instance, Dai et al. proposed five CNNs with different architectures and a varying number of parameters [6]. They achieved impressive accuracy on the Urbansound8k dataset [6]

. Tokozume and Harada proposed EnvNet which is an 8-layer neural network that takes the raw waveform as input, but requires careful selection of hyperparameters to choose appropriately sized kernels 

[25]. AclNet [13] is another end-to-end CNN architecture, inspired by MobileNet [10] because of its computational efficiency. AclNet achieved human-level accuracy for the ESC50 dataset with only 155k parameters and 49.3 million multiply-adds per second [13]. Finally, Ravanelli and Benjio proposed speaker recognition network based on raw wavforms [19].

Relation to prior work. We present a deep CNN that learns to classify broad categories of sounds directly from raw audio waveforms. In comparison to previous end-to-end audio classification efforts [6, 25, 13], we make use of a novel combination of 1D and 2D convolutional layers, and, most importantly, “inception nucleus” layers. The inception nucleus approach, described in Section 2, reduces sensitivity to prespecified filter sizes by depending on adaptation during learning. In comparison to prior work, the proposed method also greatly reduces the number of parameters while outperforming current state-of-the-art networks on the urbansound8k dataset by 10.4 percentage points. Thus, our CNN is a strong candidate for low-power always-on sound classification applications. In addition, we analyze the learned representations, using visualizations to reveal wavelet-like transforms in early layers, supporting deeper representations that are discriminative and meaningful, even with reduced dimensionality.

2 Proposed Method

Figure 1: Inception nucleus. The input comes from the previous layer and is passed to the 1D convolutional layers with kernel sizes of 4, 8, and 16 to capture a variety of features. The convolutional layer parameters are denoted as “conv1D,(number of channels),(kernel size),(stride).” All of the receptive fields are concatenated channel-wise in the concatenation layer.

Our proposed end-to-end neural network takes time-domain waveform data — not engineered representations — and processes it through several 1D convolutions, the inception nucleus, and 2D convolutions to map the input to the desired outputs. The details of the proposed architectures are described in Table 1. The overall design can be summarized as follows:

Fully Convolution Network.

We propose an inception nucleus convolution layer that contains a series of 1D convolutional layers followed by nonlinearities (i.e., ReLU layer) to reduce the sensitivity of the architecture to kernel size. Convolutional networks are well-suited for audio signals for the following reasons. First, similar to images, we desire our network to be translation invariant to reduce the number of parameters efficiently. Second, convolutional networks allow us to stack layers, which gives us the opportunity to detect higher-level concepts through a series of lower-level detectors. We used Global Average Pooling (GAP) in our architectures to aggregate the spatial information in the last convolutional layer and map this information onto class labels. GAP greatly reduces the number of parameters to make the network relatively light to implement.

Variable Length Input/Output. Since sound can vary in temporal length, we want our network to handle variable-length inputs. To do this, we use a fully convolutional network. As convolutional layers are invariant to location, we can convolve each layer based on the length of the input.
The input layer to our network is a 1D array, representing the audio waveform, which is denoted as , since the audio files are about 4 seconds, and the sampling rate was set to be . The network is designed to learn a set of parameters, , to map the input to the prediction, , based on nested mapping functions, given by Eq 1.


where is the number of hidden layers and is a typical convolution layer followed by a pooling operation.

Inception Nucleus Layer.

We propose the use of an inception nucleus to produce a more robust architecture for sound classification. This approach also makes the architecture less sensitive to idiosyncratic variance in audio files. A schematic representation of the inception nucleus appears in Fig 

1. The inputs to the inception nucleus are the feature maps of the previous layer. Then, three 1D convolutions with different kernels are applied to the inputs to capture a variety of features. We test the following kernel sizes in our experiments: . (See Section 3.) After obtaining the feature maps from our convolutional layers, we concatenate the receptive fields in a channel-wise manner.

Reshape. After applying 1D convolutions on the waveforms and obtaining low-level features, the feature map, , will be . We can treat as a grayscale image with width=, height=, and channel=

. For simplicity, we transpose the tensor

to . From here, we apply normal 2D convolutions with the VGG standard kernel size of and stride = 1 [23]. Also, the pooling layers have kernel sizes = and stride = 2. We also implemented the inception nucleus with batch normalization to analyze the effect of batch normalization on our approach, as explained in Section 3.

Global Average Pooling (GAP).

In the last convolutional layer we compute GAP to aggregate the most abstract features over the spatial dimensions and reduce the number of outputs to class labels. We use GAP instead of max pooling to reduce the number of parameters and avoid adding fully connected layers at the end of the network. It has been noted in the computer vision literature that aggregating features across spatial locations and channels keeps important information while reducing the number of parameters 

[7, 29]

. We intentionally did not use fully connected layers with a softmax activation function to avoid overfitting, since fully connected layers greatly increase the number of parameters. GAP was implemented as follows:


where are width, height, and channel of the last feature map ().

3 Experimental Results

We tested our network on the UrbanSound8k dataset which contains kinds of environmental sounds in urban areas, such as drilling, car horn, and children playing [21]. The dataset consists of audio clips of seconds or less, totalling

hours. We padded zeros to the samples that were less than 4 seconds. To speed computation, the audio waveforms were down-sampled to

and standardized to zero mean and unit variance. We shuffled the training data to enhance variability in the training set.
We trained the CNN models using the Adam [14]

optimizer, a variant of stochastic gradient descent that adaptively tunes the step size for each dimension. We used glorot weight initialization 

[8] and trained each model with batch size for up to epochs until convergence.
To avoid overfitting, all weight parameters were penalized by their norm, using a coefficient of

. Our models were implemented in Keras 

[5] and trained using a GeForce GTX 1080 Ti GPU.

Model Test Parameters
M3-fc [6] 46.82 129M
M5-fc [6] 62.76 18M
M11-fc [6] 68.29 1.8M
M18-fc [6] 64.93 8.7M
M3-Big [6] 57.55 0.5M
RCNN [22] 71.68 3.7M
ACLNet [13] 65.32 2M
EnvNet-v2 [26] 78 101M
PiczakCNN [17] 73 26M
VGG [18] 70 77M
Inception Nucleus-BN (Ours) 83.2 292K
Inception Nucleus-FA (Ours) 70.9 789K
Inception Nucleus-FI (Ours) 75.3 479K
Inception Nucleus (Ours) 88.4 289K
Table 2: Accuracy of different approaches on the UrbanSound8k dataset. The first column indicates the name of the method, the second column is the accuracy of the model on the test set, the third column reveals the number of parameters. It is clear that our proposed method has the fewest number of parameters and achieves the highest test accuracy.

Table 2 provides classification performance on the testing set along with numbers of parameters used for the Urbansound8k dataset. The table shows that our CNN outperformed other methods in terms of test classification accuracy, with the fewest number of parameters. Preliminary simulations revealed that fully connected layers at the end of the network caused overfitting due to an explosion in the number of weight parameters. These preliminary results led us to use a fully convolutional network with a reduced number of parameters.

Figure 2: Illustrating 3 filters in the first convolutional layer. The visualization indicates that learned representations in the early layers implemented wavelet-like audio filters.
Figure 3: Illustration of the top two components of the t-SNE of the last convolutional layer.

We note that the deeper networks (M5, M11, and M18) can improve performance if their architectures are well-designed. However, our inception nucleus model is accurate, which outperforms the reported test accuracy of CNNs on spectrogram input using the same dataset by a large margin [17]. Also, inception nucleus-FI achieves very good results in terms of both accuracy and number of parameters. This result suggests that if we let the network learn useful features for the desired task in the convolutional layers, recognition performance and generalization is improved over pre-engineered features.

Kernel Analysis. We also analyzed the learned kernels of our Inception Nucleus model in the very first layer of our neural network. Interestingly, the network learns wavelet transforms at the first convolutional layer, as has been found by other researchers [2, 27]. Some of those filters are illustrated in Fig 2.

Representation Analysis. To better understand the learned representations in the Inception Nucleus model, we extracted features from the last convolutional layer (before applying GAP) and applied t-SNE to reduce the dimensionality to two [15]. The results, shown in Fig. 3, suggest that the network learned meaningful and discriminative features, as the different classes are fairly well distinguished from each other.

Depth Analysis. We found that deeper networks with larger numbers of parameters were less generalizable as indicated by poorer performance on the test set. For example, M18 has parameters (see Table 2) but only achieves accuracy, compared with our inception nucleus network which achieves by only having parameters. This finding runs counter to results from the image recognition literature, in which deeper networks tend to perform better than shallower ones [9, 12, 11]. The observed detriment of additional hidden layers may be attributable to the limited number of training examples, which can be tested in future studies with larger datasets.

Kernel Size Analysis. Dai et al. [6] found that smaller kernel sizes are insufficient to capture the necessary bandpass filter characteristics in the earlier convolutional layers. Our results indicate that, with the Inception Nucleus-FS, large kernel sizes (e.g. ) are more effective in the first convolutional layer. By contrast, large kernel sizes in the second layer reduce performance substantially (e.g., using the Inception Nuclus-FA with large kernels in the second layer decreased performance by 13 points). We conclude that a larger inception nucleus is more suitable for the first layer, with smaller kernels in later convolutional layers.

Batch Normalization. We tested whether batch normalization (BN) improves performance in our CNN, as it can for very deep neural networks. Without BN, our inception nucleus achieves accuracy while with BN it achieves . The slight decrease in accuracy using BN may have been observed because our CNNs did not have enough layers to show the advantage of BN.

4 Conclusion

In this study, we developed, optimized, and tested CNNs up to 13 layers deep that used an inception nucleus to overcome problems with choosing kernel sizes. The CNNs were trained to perform end-to-end sound classification, and they were benchmarked against the Urbansound8K dataset. Results from our networks, compared with competitors, showed better performance with fewer parameters — up to accuracy using only parameters. The ability to perform end-to-end computations effectively using so few parameters may be useful for edge computing applications, especially with optimized hardware, such as neuromorphic implementations of deep networks [3]. Our results indicate that end-to-end computation does not detract from performance by forgoing cepstral or spectral features. To the contrary, our networks outperformed competitors that used log-mel spectrogram inputs [17]. Visualizations of kernels learned in the earliest layer revealed wavelet-like transforms that build up to more abstract and discriminating learned features in deeper layers. In summary, we have demonstrated effective end-to-end sound classification with an efficient deep learning network.


This research was supported in part by a gift from Accenture Labs (LLP) to Cognitive and Information Sciences at the University of California, Merced (PI Kello).


  • [1] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu (2014) Convolutional neural networks for speech recognition. ASLP. Cited by: §1.
  • [2] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In NIPS, Cited by: §3.
  • [3] P. Blouw, X. Choo, E. Hunsberger, and C. Eliasmith (2019) Benchmarking keyword spotting efficiency on neuromorphic hardware. In NCEW, Cited by: §4.
  • [4] S. Chachada and C. J. Kuo (2014) Environmental sound recognition: a survey. APSIPA. Cited by: §1.
  • [5] F. Chollet et al. (2015) Keras. Note: https://keras.io Cited by: §3.
  • [6] W. Dai, C. Dai, S. Qu, J. Li, and S. Das (2017) Very deep convolutional neural networks for raw waveforms. In ICASSP, Cited by: §1, §1, Table 2, §3.
  • [7] M. K. Ebrahimpour, J. Li, Y. Yu, J. Reesee, A. Moghtaderi, M. Yang, and D. C. Noelle (2019) Ventral-dorsal neural networks: object detection via selective attention. In WACV, Cited by: §2.
  • [8] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §3.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3.
  • [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [11] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In PCVPR, Cited by: §3.
  • [13] J. J. Huang and J. J. A. Leanos (2018) AclNet: efficient end-to-end audio classification cnn. arXiv preprint arXiv:1811.06669. Cited by: §1, §1, Table 2.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
  • [15] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. JMLR. Cited by: §3.
  • [16] K. J. Piczak (2015) Environmental sound classification with convolutional neural networks. In MLSP, Cited by: §1.
  • [17] K. J. Piczak (2015) Environmental sound classification with convolutional neural networks. In MLSP, Cited by: Table 2, §3, §4.
  • [18] J. Pons and X. Serra (2019) Randomly weighted cnns for (music) audio classification. In ICASSP, Cited by: Table 2.
  • [19] M. Ravanelli and Y. Bengio (2018) Speaker recognition from raw waveform with sincnet. In SLT, Cited by: §1.
  • [20] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak (2015)

    Convolutional, long short-term memory, fully connected deep neural networks

    In ICASSP, Cited by: §1.
  • [21] J. Salamon, C. Jacoby, and J. P. Bello (2014) A dataset and taxonomy for urban sound research. In ICM, Cited by: §3.
  • [22] J. Sang, S. Park, and J. Lee (2018)

    Convolutional recurrent neural networks for urban sound classification using raw waveforms

    In EUSIPCO, Cited by: Table 2.
  • [23] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.
  • [24] D. Stowell and M. D. Plumbley (2014) Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning. PeerJ. Cited by: §1.
  • [25] Y. Tokozume and T. Harada (2017) Learning environmental sounds with end-to-end convolutional neural network. In ICASSP, Cited by: §1, §1.
  • [26] Y. Tokozume and T. Harada (2017) Learning environmental sounds with end-to-end convolutional neural network. In ICASSP, Cited by: Table 2.
  • [27] Z. Tüske, P. Golik, R. Schlüter, and H. Ney (2014) Acoustic modeling with deep neural networks using raw time signal for lvcsr. In ACISCA, Cited by: §3.
  • [28] A. Van den Oord, S. Dieleman, and B. Schrauwen (2013) Deep content-based music recommendation. In NIPS, Cited by: §1.
  • [29] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In CVPR, Cited by: §2.