Sound recognition (SR) is the art and science of having machine to identify sounds (Jurafsky and Martin, 2014). These sounds could be much beyond than speech and music. Among others it includes such examples as barking dogs, breaking glasses, crying babies and etc. Sound recognition is a key strategic technology that will be embedded in most connected devices offering AI capabilities. For example, every one of us has come across smart-phones with mobile assistants such as Apple Siri, Amazon Alexa or Google Assistant. These applications are dominating and in a way invading human interactions. In addition, there have also been successful uses of voice controlled systems in both medicine and in education for blind and/or handicapped people (Okada et al., 1998; Mohamed et al., 2014). In the nearest future we will see exponential growth of speech recognition embedded devices that will assist to our every day lives. It requires to develop and optimize sound recognition algorithms that need to be fast enough to work in real time and support many different embedded platforms (Solovyev et al., 2018).
Classical approaches in sound recognition system are based on understanding the components of human speech. A phoneme is a contrastive unit in the sound system that helps to distinguish between meanings of words from a set of similar sounds corresponding to it pronounced in one or more ways. For example the word ”speech” has the four phonemes: S P I CH (Lee and Hon, 1989; Gruhn et al., 2011)
. To find phonemes, speech signals are slowly timed where their characteristics are stationary over a short period of time. In the feature extraction step, acoustic observations are extracted in frames of typically 25 ms. For the acoustic samples in that frame, a multi-dimensional vector is calculated and on that vector a fast Fourier transformation is performed(Lee and Hon, 1989; Shrawankar and Thakare, 2013)
, to transform a function of time, e.g. a signal in this case, into their frequencies. In the decoding process where calculations is made to find which sequence of words is the most probable match to the feature vectors. For this step, three things has to be present; an acoustic model with a hidden Markov model (HMM) for each unit (phoneme or word), a dictionary containing possible words and their phoneme sequences and a language model with words or word sequences likelihoods(Jurafsky and Martin, 2014). Using neural networks as acoustic models for HMM-based speech recognition originally was introduced over 20 years ago (Bourlard and Morgan, 2012; McClelland and Elman, 1986). Much of this original work developed the basic ideas of hybrid DNN-HMM systems which are used in modern, state-of-the-art automatic SR systems (Mohamed et al., 2012). However, until recently, neural networks were not a standard tool in the real time automatic SR systems. Computational constraints and the amount of available training data severely limited the pace at which it was possible to make progress.
Recently, deep learning-based approaches demonstrated performance improvements over conventional machine learning methods for many different applications(LeCun et al., 2015). The neural networks built with memory capabilities have made speech recognition 99 percent accurate (Hinton et al., 2012; LeCun et al., 2015; Goodfellow et al., 2016)
. Neural networks like LSTMs have taken over the field of Natural Language Processing(Gers et al., 1999; Graves and Schmidhuber, 2005; Greff et al., 2017). A person’s speech can also be understood and processed into text by storing the last word of the particular sentence which is fascinating (Van Den Oord et al., 2016; Abdel-Hamid et al., 2014). Even more the beautiful scientific results now have state-of-the-art applications to the whole process of sound recognition to machine translation (Wu et al., 2016; Amodei et al., 2016; Bahdanau et al., 2014).
In this paper, we consider several approaches based on deep learning to the problem of sound classification that we applied in TensorFlow Speech Recognition Challenge organized by Google Brain team on Kaggle platform 111https://www.kaggle.com/c/tensorflow-speech-recognition-challenge. Here we review 1D convolutional neural networks that uses raw sound files as an input and image-based convolutional neural networks using 2D representation of sounds via Spectrogram, Melgrams and MFCC. We also describe the details of our solution that allowed us to reach pretty good classification accuracy and finish the challenge on 8-th place among 1315 teams.
2.1 Data set description and pre-processing
The training data consists of 60,000 audio files separated by 32 directories with the folder name being the label of the audio clip (Warden, 2018). There are less labels that need to be predicted (12 vs 32 in the train). The labels that should to be identified are yes, no, up, down, left, right, on, off, stop, go. Everything else should be considered either silence or unknown classes. The distribution of classes in the train data set is shown in Fig. 1. Each audio file in the data set is 1-second-long clip of voice commands that is converted into a 16-bit little-endian PCM-encoded WAVE file at a 16000 sample rate. This data set is not completely cleaned up for us. For example, several files in the train data set are not exactly 1 second long. Moreover, there are no silence files as such. In these cases we are provided by longer recordings with background_noises that we can split up into 1 second fragments. In addition, one can also mix background noises with word files to get some different augmentations for our sounds during training. The test data set contains an audio folder with 150,000+ files in the format . The task is to predict the correct label. Note that not all of the files are evaluated for the leader-board score.
In the train data set audio files are named so the first element is the subject id of the person who gave the voice command, and the last element indicated repeated commands. Repeated commands are when the subject repeats the same word multiple times. Subject id is not provided for the test data, and you can assume that the majority of commands in the test data were from subjects not seen in train (Warden, 2018). The files contained in the training audio are not uniquely named across labels, but they are unique if you include the label folder. For example, is found in 14 folders, but that file is a different speech command in each folder.
One important thing before training models - we need to clean up the data. There are files with pretty low sound volumes. Some of these files are corrupt and only contain noise while others are basically background noise without any spoken word. To remove or correctly re-label these files it helps to sort all files on a dynamic range of the output volume. Then find out if there is a threshold minimum sound level below which all files can be classified as silence. For better performance of the procedure one can manually double check the candidate through listening them and looking at spectrograms.
2.2 Sound representation
Although deep learning approach eliminates the need for hand-engineered features, we need to choose a representation for our data. The sound signal is a one-dimensional time domain signal. It is difficult to find the rule of how the frequency changes. If we convert the sound signal to frequency domain via Fourier transform (FT), it will show the signal frequency distribution. But at the same time, its time domain information will be missing, making it impossible to see the change of frequency distribution over time. Many joint time-frequency analysis methods have emerged to solve this problem.
One of the the classic method for joint time-frequency analysis is the short-time Fourier transform (STFT). STFT is a mathematical transformation associated with FT to determine the frequency and phase of a sine wave in a local region of the time-varying signal. The concept of STFT is to first choose a window function with time-frequency localization. Then assume that the analysis window function was stationary over a short time, which ensures is a stationary signal within different finite time widths. STFT uses fixed window functions, the most commonly used include the Hanning window, the Hamming window, and the Blackman-Haris window (Jurafsky and Martin, 2014). The Hamming window, a generalized cosine window, is used in this article. It is usually represented as
. This function is a member of both the cosine-sum and power-of-sine families. The Hamming window can efficiently reflect the attenuation relationship between energy and time at a certain moment. We show the spectrogram logarithmic the STFT values with a window size ofms and step of ms in Fig. 2. In addition, we convert a power spectrogram (amplitude squared) to decibel (dB) units. This computes the scaling in a numerically stable way (Zeros in the output correspond to positions where ).
Spectrograms are usually in the form of a large map. In order to turn the sound features into a suitable size, they often need to be transformed into Mel spectrum via Mel scale filter bank. The Mel spectrum is known that the unit of frequency is Hertz (Hz) and the frequency range of human hearing is Hz. Human auditory perception does not relate to scale units such as Hz in a linear manner. For example, if we have adapted to a Hz sound, then when sound frequency is increased to Hz, our ears could only perceive that the frequency may be slightly increased, and we would not realize that the frequency has doubled. The mapping for converting an ordinary frequency scale to Mel-frequency scale is as follows (Deng and O’Shaughnessy, 2003):
Here is a log relationship between Hz and Mel frequency. If the frequency is low, Mel-frequency will change rapidly with Hz otherwise if the frequency is high, Mel-frequency will change slowly. This shows that human ears are sensitive to low frequency sounds and less responsive to high frequency sounds.
Another well-known speech extraction is based on Mel-frequency Cepstral Coefficients (MFCC). This method is one of the most popular feature extraction techniques used in speech recognition based on frequency domain using the Mel scale. MFCC is a representation of the real cepstral of a windowed short-time signal derived from the fast FT of that signal. The difference from the real cepstral is that a nonlinear frequency scale is used, which approximates the behaviour of the auditory system. Additionally, these coefficients are robust and reliable to variations according to speakers and recording conditions (Dave, 2013). MFCCs use Mel-scale filter bank where the higher frequency filters have greater bandwidth than the lower frequency filters, but their temporal resolutions are the same Deng and O’Shaughnessy (2003). The last step is to calculate Discrete Cosine Transformation (DCT) of the outputs from the filter (see Fig. 2).
In classical, but still state-of-the-art systems, MFCC or similar features are taken as the input to the system instead of spectrograms. However, in end-to-end (often neural-network based) systems, the most common input features are probably raw spectrograms, or mel power spectrograms. For example MFCC decorrelates features, but neural networks deal with correlated features well. As a result, further we will use only raw wave signal for 1D convolution networks, log-spectrograms and mel power spectrograms as an input image for 2D networks. For this work we use scipy222https://scipy.org/ and librosa333https://librosa.github.io/ python libraries - this code is a standard one for conversion into spectrogram and the software is free to make modifications to suit the needs.
2.3 1D convolutional neural networks
Convolutional neural networks (CNN) have demonstrated remarkable success in many Visual Recognition Challenges (Shvets et al., 2018; Rakhlin et al., 2018; Iglovikov et al., 2017). Such 2D-CNNs learn to recognize the local structure within an image. Inspired by these successes, the hypothesis for this section is that a 1D-CNN will be able to recognize local structure in our time dependant signals. We consider two types of 1D CNN adapted from 2D architectures: VGG (Simonyan and Zisserman, 2014) amd ResNet (He et al., 2016) where as an input we use raw signal (see Fig. 3
). The first type is based on the VGG16 architecture when several convolutional layers alternate with Batchnorm and MaxPooling layers. Then, at the end two blocks of fully connected layers are placed. There are also several differences in comparison to standard VGG16 architecture for 2D images. For example, in the first two layers we use MaxPooling operation that reduce the size of the tensor by 4. It is done to faster reduce dimensionality of the problem. In addition, we have more pooling layers in comparison to 2D case, because the size of input vector is pretty high and we need to reduce it at the end of the network before classification block. As a result the spacial size of the feature map before flatten is equal to 16. In contrast to VGG16 we use single convolution block in the first two layers and then increase it to two consecutive blocks. For activation part we use ReLU activation function placed after Batchnorm operations. As kernel size, we took the value 9, which would be somewhat similar to 3x3 for the case of a two-dimensional convolution. But as far as we understand, it may well work with small values namely, just as 3. Dropout is also applied with the value of 0.5 at the end of the network within fully connected layers to prevent over-fitting. Similarly to VGG16 we implemented ResNet34/ResNet50 for 1D case. As in the previous case here we also have convolution layers with kernel size equal to 9 for all layers apart from the fist one where the kernel size is equal to 80. In addition, all pooling layers have factor equal to 4. The details of the network are provided in Fig.3.
2.4 2D convolutional neural networks
In 2D case as an input for networks we use log-spectrograms and mel power spectrograms with a standard set of convolutional neural networks available in PyTorch444https://pytorch.org/
and Keras555https://keras.io/. We utilize only subset of the networks namely VGG16, VGG19, ResNet50, InceptionV3, Xception, InceptionResnetV2 (also shown in Fig.4). One very important point is that fine-tuning was forbidden by the competition rules. As a result, here we train all the networks randomly initializing weights. Moreover, we train all of the networks using the same set of hyper-parameters excluding input size that is different.
We train all models using 4fold cross validation. The data separated between folds by voice id. It means that the voice of one participant did not fall into different folds. Since it was known that in the test set all 12 classes were distributed approximately evenly, we generated each batch with the same number of instances from each class. And for simplicity we chose the size of the batch multiple of 12.
In speech recognition, data augmentation helps with generalizing models and making them robust against variations in speed, volume, pitch or background noise. In our case augmentations consisted of the following operations: 1) Random change of playback speed rate in the range 0.7 to 1.4; 2) Time shift by a random value in the range -0.1 to +0.1 second; 3) Add random noise from background noise from 0 to 0.05 of the maximum volume of the current waveform. Further, if wave frame was obtained more than the size of the input of a neural network, it was randomly cut off to the desired length. If it turned out less than the desired length, it was supplemented with zeros from the beginning. In 1D cases all audio files are vectors with a length of 16000. First we chose the value 3x4096 = 12288 as the input for the networks and randomly chose such a segment from 16000 as augmentations, but then we realized that most likely it negatively affects the accuracy. As a result the input for neural networks became a vector of length 4x4096 = 16384.
The quantitative comparison of our models’ performance is presented in the Table1
. In this table we present classification accuracy for single network models as well as ensemble of all models for validation, public leader-board and private leader-board datasets. The validation results represent average between 4Folds. One can notice that all the models provide similar performance even those based on 1D convolution neural networks. The ensemble of the models performs better because it reduce variance between models and get rid of outliers. In our case we perform the ensemble as a mean of softmax probabilities. It can be done as a majority vote between the models but its performance is slightly worse. The difference in accuracy between validation and leader-board datastes could be understood if we consider confusion matrix. In such a case we can see that not unknown commands are predicted as unknown, and the predictions are pretty uncertain (0.25-0.4 for unknown, a little less for some other class). We can slighly improve the performance if we play with a threshold for the unknown class. In addition, the test data set contains many files that are purely labeled and said by id’s not contained in the train data set. One of the way to improve the performance of the models is to enrich substantially the train dataset by many other examples and somehow improve the quality of signals.
In this study we presented deep learning-based approach for speech command classification in TensorFlow Speech Recognition Challenge organized by Google Brain team on Kaggle platform. We showed different representation of sound command such as wave frames and spectrograms that are used as a input in 1D and 2D convolutional networks. The most popular approaches to this problem are based on fine-tuning of Imagenet networks, but we showed that approaches based on 1D convolutional networks provide very similar performance. This work can be further extended considering different recurrent neural networks architectures, their combinations with convolutional networks and/or siamese networks. Due to the similar performance between several models we can conclude that the data preparation plays a major role for further improvements. Therefore, with the understanding of how to process sound on a machine, one can also work on building their own sound classification systems. The general rule is when it comes to deep learning, the data is the key component. Larger the data, better the accuracy(Warden, 2018).
- Abdel-Hamid et al. (2014) Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10):1533–1545, 2014.
- Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pages 173–182, 2016.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bourlard and Morgan (2012) Herve A Bourlard and Nelson Morgan. Connectionist speech recognition: a hybrid approach, volume 247. Springer Science & Business Media, 2012.
- Dave (2013) Namrata Dave. Feature extraction methods lpc, plp and mfcc in speech recognition. International journal for advance research in engineering and technology, 1(6):1–4, 2013.
- Deng and O’Shaughnessy (2003) Li Deng and Douglas O’Shaughnessy. Speech processing: a dynamic and optimization-oriented approach. CRC Press, 2003.
- Gers et al. (1999) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. 1999.
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
- Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
- Greff et al. (2017) Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2017.
- Gruhn et al. (2011) Rainer E Gruhn, Wolfgang Minker, and Satoshi Nakamura. Statistical pronunciation modeling for non-native speech processing. Springer Science & Business Media, 2011.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
- Iglovikov et al. (2017) Vladimir Iglovikov, Alexander Rakhlin, Alexandr Kalinin, and Alexey Shvets. Pediatric bone age assessment using deep convolutional neural networks. arXiv preprint arXiv:1712.05053, 2017.
- Jurafsky and Martin (2014) Dan Jurafsky and James H Martin. Speech and language processing, volume 3. Pearson London, 2014.
- LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
- Lee and Hon (1989) K-F Lee and H-W Hon. Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11):1641–1648, 1989.
- McClelland and Elman (1986) James L McClelland and Jeffrey L Elman. The trace model of speech perception. Cognitive psychology, 18(1):1–86, 1986.
Mohamed et al. (2012)
Abdel-rahman Mohamed, George E Dahl, Geoffrey Hinton, et al.
Acoustic modeling using deep belief networks.IEEE Trans. Audio, Speech & Language Processing, 20(1):14–22, 2012.
- Mohamed et al. (2014) Samir A Elsagheer Mohamed, Allam Shehata Hassanin, and Mohamed Tahar Ben Othman. Educational system for the holy quran and its sciences for blind and handicapped people based on google speech api. Journal of Software Engineering and Applications, 7(03):150, 2014.
- Okada et al. (1998) Shinichiro Okada, Yoshiaki Tanaba, Hideyuki Yamauchi, and Shoichi Sato. Single-surgeon thoracoscopic surgery with a voice-controlled robot. The Lancet, 351(9111):1249, 1998.
- Rakhlin et al. (2018) Alexander Rakhlin, Alexey Shvets, Vladimir Iglovikov, and Alexandr A Kalinin. Deep convolutional neural networks for breast cancer histology image analysis. arXiv preprint arXiv:1802.00752, 2018.
- Shrawankar and Thakare (2013) Urmila Shrawankar and Vilas M Thakare. Techniques for feature extraction in speech recognition system: A comparative study. arXiv preprint arXiv:1305.1145, 2013.
- Shvets et al. (2018) Alexey Shvets, Alexander Rakhlin, Alexandr A Kalinin, and Vladimir Iglovikov. Automatic instrument segmentation in robot-assisted surgery using deep learning. arXiv preprint arXiv:1803.01207, 2018.
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Solovyev et al. (2018) Roman A Solovyev, Alexandr A Kalinin, Alexander G Kustov, Dmitry V Telpukhov, and Vladimir S Ruhlov. Fpga implementation of convolutional neural networks with fixed-point calculations. arXiv preprint arXiv:1808.09945, 2018.
- Van Den Oord et al. (2016) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, page 125, 2016.
- Warden (2018) Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.