Acoustic modeling with deep learning has demonstrated remarkable performance improvements in automatic speech recognition[1, 2, 3, 4]
. Deep neural networks (DNNs) are trained to label each frame of processed speech data with the state of a hidden Markov model (HMM). However, there is a difficulty due to the fact that acoustic features vary widely in frequency and articulation rate depending on harmonics of the vocal tract and characteristic speaking styles.
Efforts to effectively handle these variations can be categorized into feature-level and model-level approaches. Amongst feature-level approaches, speaker-adapted methods such as fMLLR 
have been proposed. Acoustic features concatenated with i-vectors, which represent speaker information, also have been employed as input for DNNs[6, 7]. Model-level approaches have employed hybrid NN-HMM systems with convolutional neural networks (CNNs) [8, 9, 10]
and recurrent neural networks (RNNs)[11, 12, 13].
In particular, CNNs have advantages in terms of capturing local features through weight sharing while remaining robust to slight translations of these features through pooling. Structural advantages of CNNs enable the modeling of speech data without feature-level engineering, such as spectrograms or mel filter-banks. Previous researchers introduced time-delayed neural networks (TDNNs), which are CNNs with convolutions along the time axis to learn the temporal dynamics of features [14, 15, 16]. Other researchers have applied convolutions along the frequency axis to attain invariance to frequency-shifts [8, 17]. However, acoustic features of speech vary in frequency, so that weight sharing along the frequency axis may not be appropriate. The limited weight sharing method in which weights are convolved only within a subsection of frequency-bands has been employed in efforts to overcome this problem . However, settling on appropriate band divisions and filters will require further work.
One limitation common to the preceding approaches is that most of them have employed only one or two convolution and pooling layers. Another limitation is that the relationship or topography of filters trained in supervised learning has not been intensively investigated. For unsupervised feature extraction, previous researchers imposed sparsity terms over small groups or neighborhoods in feature maps of image[18, 19, 20] and speech data [21, 22]. They attained topographically-organized maps of smoothly varying oriented edge filters or tonotopic disordered topography of spectrotemporal features, such as those found in the primary visual or auditory cortex (V1, A1) respectively.
In this paper, we argue that convolution along the time axis is more effective than along the frequency axis for acoustic models. In order that the network learns temporal dynamics adequately, we increase the depth of convolution layers that have small filters. Instead of frequency-axis convolution and pooling, we propose the addition of a convolutional maxout layer, namely an intermap pooling (IMP) layer in order to increase robustness to spectral variations. Previously, a convolutional maxout network has been proposed , however, it applied convolutions along the frequency axis. We show that the IMP CNNs with the time convolution reduce the word error rates more. As a result, the IMP CNNs can both model temporal dynamics and remain robust to spectral variations.
Ii Convolution Neural Networks
CNNs consist of the alternation of convolution and pooling layers, and fully connected layers in the top-most layer. Let stand for input to the th convolution layer having filters, with the th convolution filter denoted with and denoting a filter’s height and width respectively, and designating the number of feature maps of the input. A bias term is shared inside the th feature map. Thus, from any input , the output can be calculated as
where and are height and width of each output feature map, and
An intramap pooling layer, typically called “max-pooling”, propagates the maximum value from each sub-region in each feature map. For non-overlapping sub-regions with heightand width , the output from this pooling layer is given by
Intramap pooling layers have blurring effects on feature maps, with the result being that the CNN is more robust to locally translated features.
Iii Intermap Pooling Layers
There are several categories of acoustic features such as harmonics, formants, and on/offsets (i.e., start and end points of speech). Spectral variations of acoustic features appear as shifts in the frequency axis over time (spectro-temporal modulation). In order to ensure the robustness of our model to spectral variations, we propose the addition of a convolutional maxout layer, the intermap pooling (IMP) layer. Like the maxout networks , this layer groups the filters, and pools the feature maps inside a group.
Specifically, an intermap pooling layer partitions feature maps into a set of groups. Then each group propagates the maximum activation value at each position. Formally, the output of the th group consisting of consecutive feature maps is given by
The structural comparison of intermap and intramap pooling layers is shown in Fig.1. Note that the method pursued in this paper does not introduce any additional learning terms except for the intermap grouping of filters.
The central idea to the IMP CNN is that the filters in each group learn common but spectrally variant features, such as frequency-shifted harmonics, and the pooled feature map is invariant to those feature variations within the group. The pooled feature maps are representative of the feature maps in each group. Through supervised learning, the pooled feature maps become discriminative of features for recognizing phonemes. For a phoneme, since spectral variations among different speakers and utterances are not discriminative information, the individual filters in a group spontaneously represent common but spectrally variant features, even though the layer does not ensure this.
Iv Deep CNN Architecture
for Acoustic Modeling
Since short-term temporal dynamics are shared within every frame of a given speech sample, sharing filters along the time axis is reasonable. However, sharing filters along the frequency axis may not be suitable, because features within lower frequency-band regions are significantly different from those in the higher regions. Instead of convolution along the frequency axis, our architecture employs an intermap pooling layer following the first convolution layer. This approach demonstrates robustness not only to frequency-shifted features but also to spectro-temporally distorted features. Moreover, it does not require engineered efforts to consider the varying characteristics of different frequency-bands.
A sufficiently deep depth of convolution and pooling layers is necessary to precisely represent complex acoustic features with temporal and spectral variations. Individual frames also should be labeled as minutely as the number of HMM states, which is more than thousands in tri-phone modeling. However, context windowed inputs are too tiny (e.g., 21 frames) and stacking multiple intramap pooling layers decreases the feature map size in proportion to the pooling size, thereby restricting the depth of CNNs. Since, previous researchers have chosen large convolution filter and intramap pooling sizes, sufficient increases in depths of CNNs have not been realized.
As illustrated in Fig.1, the IMP CNN architecture applies convolution and intramap pooling layers only along the time axis. The pooling size of the intramap pooling layers is small so that it does not decrease temporal resolution much. Furthermore, motivated by the performance of very deep CNNs , we inserted convolution layers with small filters (of size 1x3) between two intramap pooling layers. The combination of filters before pooling layers increases non-linearity, and this results in a network that has rich feature expressions.
V Experimental Results
V-a Experiments setup
We conducted experiments using the 300 hour Switchboard-I Release 2 (SWBD) dataset  which is conversational telephone speech task as well as the Wall Street Journal (WSJ) corpus  and Aurora4 database which are read speech. We used the 81-hour training dataset (SI-284) of the WSJ corpus. The Aurora4 database is a subset of the WSJ in which clean utterances are added with different noise types and/or convolved with microphone distortions. The following results are for the trained IMP CNN on the multi-conditioned training dataset.
The raw speech signal is processed via short-time Fourier transform (STFT) with a 25ms Hamming window and 10ms window shifts. We used 40-dimensional log-mel filter bank features without the energy coefficient, and concatenated frames with a context window size of 21 (10 frames) to feed them into networks as inputs. We trained the GMM-HMM system over fMLLR features. The forced alignment of each frame by the GMM-HMM baseline system is the target label of the neural networks.
After random initialization of weights and biases from the Gaussian distribution(0,0.01) and
(0,0.5) respectively, the CNNs were optimized by the stochastic gradient descent (SGD) method. In particular, for CNNs deeper than 9-layers, we faced with infeasible training, because each layer back-propagates errors by multiplying its small initial weights, resulting in vanishing gradients. Therefore, we increased the standard deviation (
) of the Gaussian distribution in lower layers. Each layer is trained with a momentum of 0.9, an L2-decay term of 0.0005, and mini-batch size of 512. After one epoch of training, the trained model is accepted if the validation cost decreases. Otherwise, the trained model is rejected and training starts again from the latest accepted model with a halved learning rate. The initial learning rate is 0.01, and the training stops after 50 epochs. Our implementation is developed upon the KALDI toolkit.
For the SWBD task, we decode speech using a trigram language model (LM) of 30k vocabularies which is trained on 3M words, and then we rescore the decoding results using 4-gram LM which is trained on Fisher English Part 1 transcripts . For the WSJ and Aurora4 corpus, we used a 146K word extended dictionary and the trigram pruned language model which is exactly the same as the ‘s5 ’recipe in the KALDI.
V-B Convolution axis and depth of CNNs
Fig. 3 shows the decoding results of CNNs on SWBD evaluation sets with various depths, from 6 layers up to 15 layers (configurations are described in Fig. 2). Deeper CNNs produced lower WERs, with the 15-layer CNN achieving a WER of 12.8% for SWB and 18.6% for total evaluation sets. Moreover, it is validated that convolution along the time axis always outperforms convolution along the frequency axis. Furthermore, CNNs trained over log-mel features had lower WER as fMLLR features when CNN has more than 9 layers. These results show that weight sharing along the time axis more effectively reduces the WER, and that increased non-linearity obviates preprocessing for speaker adaptation.
|log-mel (time)||13.2||12.7||18.8||18.5 [t]|
|fMLLR (time)||13.3||13.5||19.0||19.1 [b]|
V-C IMP CNNs
Decoding results of IMP-CNNs with different numbers of maps and pooling sizes are compared in Table I
. We further investigated an intermap pooling layer in which groups overlap (IMPO) each other with a stride of one. All CNNs with an intermap pooling layer performed better than the 9-layer CNN without. Especially, the ‘9L-IMP(512, 4)’ CNN performs the best with a WER of 12.7% for SWB test set, showing a 3.78% relative improvement over the 9-layer CNN. Note TableII that when the IMP layer is applied to CNNs along the frequency axis or over fMLLR features, performance declines. In addition, IMP CNN performed well on the WSJ and Aurora4 corpus as shown in TableIII and IV, respectively. It is remarkable that IMP layers contribute robustness to spectral variations in both clean and noisy conditions.
V-D Analysis on learnt filters
Learnt filters of the first convolution layer are visualized in Fig.4 (a). There are five categories of spectrotemporal features in the filters. (1) Harmonic features are narrow in the low frequency-region and (2) broad in the high frequency-region. (3) The on/off-set detecting filters are temporally selective, but are also sensitive to several frequencies. (4) The features of Gabor-like filters are centered on some frequency-bands, which presumably detect formants. (5) The features of formant changes are directional diagonal lines, spectrotemporal modulations, in the middle frequency-bands. Note that different features appear in different frequency-bands, and that local features of each type have different bandwidth sizes.
The trained filters in each group of the intermap pooling layer are presented in Fig.4 (b). Importantly, most filters in a group belong to a common category. For example, the filters in harmonic extractor and formant change detector groups have marginally shifted features on the frequency axis. This figure verifies that the intermap pooling lead the filters of a group to extract common but spectrally variant features, although there are no additional architectural constraints to guarantee this.
The consecutive trained filters of the IMPO layer are drawn in Fig. 4(c). The filters form a 1-dimensional topological map, where neighboring filters respond to similar spectrotemporal features. Along the topological map axis, filters appear discontinuously between feature categories, reflecting the fact that feature categories become definitely distinguishable to the system as it is trained. In recent neurophysiological studies, there is consensus that multiple tonotopic maps exist in the human auditory system . However, few studies suggest that this topography includes other sound features, such as temporal, spectral, and joint modulations [31, 32, 33]. The trained topography may provide a clue as to how human auditory neurons organize to efficiently process information in A1.
|Maxout 7L||14.2||20.0 [b]|
|log-mel filterbanks||Maxout 7L||14.6||20.7 [t]|
|CNN 15L||12.8||18.6 [b]|
|MFCC + i-vectors||TDNN 4L ||12.9||19.2|
|VTL-warped log-mel||CNN 8L (2conv+6fc) ||12.6||- [t]|
|CNN 13L (10conv+3fc) ||11.8||- [b]|
V-E Comparison of the IMP CNN
For comparison, we trained max-out networks that have 7 layers with 2,000 hidden neurons and 400 groups on both fMLLR and filter-bank features. The comparison of the decoding results is summarized in Table V. The ‘9L-IMP(512, 4)’IMP CNN improved on the GMM-HMM baseline (19.5%) and the max-out network (14.6%), demonstrating a 34.87% and 13.01% relative improvement respectively. Also, it performs on par with a 15-layer CNN, i.e. a non-IMP CNN with six additional convolution layers. Finally, the IMP CNN is compared with the TDNN  and the CNNs which employed 2-dimensional convolutions [34, 35]. Note that we only compare other previous results without any sequence training such as sMBR. Even though our deep CNN did not use any speaker adaptation techniques, it yielded a comparative word error rate simply by employing intermap pooling and by increasing depths.
In this paper, the present experiments demonstrate that convolution along the time axis is more effective than along the frequency axis when processing speech. Depth in convolution layers is crucial for the sufficient representation of the complex temporal dynamics inherent in the acoustic features of speech. In order to achieve greater robustness to spectral variations in speech recognition, we proposed the addition of intermap pooling (IMP) to CNNs. Through visualization of the trained filters, we verified that filters grouped together learn similar spectrotemporal features and form a topological map. In the end, even without any speaker adaptation techniques, the proposed IMP CNN delivered competitive performance on the Switchboard, WSJ, and Aurora4 databases.
A.-R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic Modeling using Deep Belief Networks,”Audio, Speech, Lang. Process. IEEE Trans., vol. 20, no. 1, pp. 14–22, 2010.
-  G. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary continuous speech recognition with context-dependent DBN-HMMs,” INTERSPEECH, pp. 4688–4691, 2011.
-  J. Lee and S.-Y. Lee, “Deep learning of speech features for improved phonetic recognition,” Acoust. Speech Signal Process. (ICASSP), 2011 IEEE Int. Conf. on., no. August, pp. 1249–1252, 2011.
-  G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
-  S. P. Rath, D. Povey, K. Vesel, and J. H. Cernock, “Improved feature processing for Deep Neural Networks,” INTERSPEECH, pp. 1–5, 2013.
-  G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” 2013 IEEE Work. Autom. Speech Recognit. Underst., pp. 55–59, 2013.
-  A. Senior and I. Lopez-Moreno, “Improving DNN Speaker Independence With I-Vector Inputs,” 2014 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 225–229, 2014.
-  O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” IEEE Int. Conf. Acoust. Speech Signal Process., 2012.
-  O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition,” INTERSPEECH, no. August, pp. 3366–3370, 2013.
-  O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional Neural Networks for Speech Recognition,” Audio, Speech, Lang. Process. IEEE/ACM Trans., vol. 22, no. 10, pp. 1533–1545, 2014.
-  A. Graves, A.-R. Mohamed, and G. Hinton, “Speech Recognition With Deep Recurrent Neural Networks,” Acoust. Speech Signal Process. (ICASSP), 2013 IEEE Int. Conf. on. IEEE, no. 3, pp. 6645–6649, 2013.
-  A. Graves, N. Jaitly, and A. R. Mohamed, “Hybrid speech recognition with Deep Bidirectional LSTM,” Autom. Speech Recognit. Underst. (ASRU), 2013 IEEE Work. on. IEEE, pp. 273–278, 2013.
F. Beaufays, H. Sak, and A. Senior, “Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling Has,”INTERSPEECH, no. September, pp. 338–342, 2014.
-  A. Waibel, T. Hanazawa, G. E. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” Acoust. Speech Signal Process. IEEE Trans., vol. 37, no. 3, pp. 328–339, 1989.
-  H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks.” Adv. Neural Inf. Process. Syst., pp. 1–9, 2009.
-  V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” INTERSPEECH, pp. 2–6, 2015.
-  T. N. Sainath, A.-R. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep Convolutional neural networks for LVCSPR,” Acoust. Speech Signal Process. (ICASSP), 2013 IEEE Int. Conf. on., pp. 10–14, 2013.
-  A. Hyvärinen and H. Patrik, “Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces,” Neural Comput., vol. 12, no. 7, pp. 1705—-1720, 2000.
A. Hyvärinen, P. O. Hoyer, and M. Inki, “Topographic independent component analysis,”Neural Comput., vol. 13, no. 7, pp. 1527–1558, 2001.
K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. Le-Cun, “Learning Invariant
Features through Topographic Filter Maps,”
Comput. Vis. Pattern Recognit., pp. 1605–1612, 2009.
-  T. Kim and S.-Y. Lee., “Learning self-organized topology-preserving complex speech features at primary auditory cortex,” Neurocomputing, vol. 65-66, pp. 793–800, 2005.
H. Terashima and M. Okada, “The topographic unsupervised learning of natural sounds in the auditory cortex,”Adv. Neural Inf. Process. Syst., pp. 1–9, 2012.
-  M. Cai, Y. Shi, J. Kang, J. Liu, and T. Su, “Convolutional maxout neural networks for low-resource speech recognition,” Proc. 9th Int. Symp. Chinese Spok. Lang. Process. ISC SLP 2014, pp. 133–137, 2014.
-  I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout Networks,” arXiv Prepr., pp. 1319–1327, 2013.
-  K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recoginition,” arXiv Prepr., pp. 1–14, 2015.
-  J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD telephone speech corpus for research and development,” Proc. 1992 IEEE Int. Conf. Acoust. Speech, Signal Process., vol. 1, pp. 517–520, 1992.
-  D. B. Paul and J. M. Baker, “The Design for the Wall Street Journal-based CSR Corpus,” Proc. Work. Speech Nat. Languae, 1994.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” Autom. Speech Recognit. Underst. (ASRU), 2011 IEEE Work. on. IEEE, pp. 1–4, 2011.
-  C. Cieri, D. Miller, and K. Walker, “The Fisher corpus: a Resource for the Next Generations of Speech-to-Text,” Proc. Lr., vol. 4, pp. 69–71, 2004.
-  M. Saenz and D. R. Langers, “Tonotopic mapping of human auditory cortex,” Hear. Res., vol. 307, pp. 42–52, 2014.
-  B. Barton, J. H. Venezia, K. Saberi, G. Hickok, and A. A. Brewer, “Orthogonal acoustic dimensions de fi ne auditory fi eld maps in human cortex,” Proc. Natl. Acad. Sci., 2012.
-  M. Herdener, F. Esposito, K. Scheffler, P. Schneider, N. K. Logothetis, K. Uludag, and C. Kayser, “Spatial representations of temporal and spectral sound cues in human auditory cortex,” Cortex, vol. 49, no. 10, pp. 2822–2833, 2013.
-  R. Santoro, M. Moerel, F. De Martino, R. Goebel, K. Ugurbil, E. Yacoub, and E. Formisano, “Encoding of Natural Sounds at Multiple Spectral and Temporal Resolutions in the Human Auditory Cortex,” PLoS Comput. Biol., vol. 10, no. 1, p. e1003412, 2014.
-  G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The IBM 2015 English Conversational Telephone Speech Recognition System,” Interspeech 2015, 2015.
-  T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, “Very deep multilingual convolutional neural networks for LVCSR,” pp. 2–6, 2016.