CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

11/07/2018 ∙ by Nelson Yalta, et al. ∙ 0

Casual conversations involving multiple speakers and noises from surrounding devices are part of everyday environments and pose challenges for automatic speech recognition systems. These challenges in speech recognition are target for the CHiME-5 challenge. In the present study, an attempt is made to overcome these challenges by employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system. The system comprises an attention-based encoder-decoder neural network that directly generates a text as an output from a sound input. The mulitchannel CNN encoder, which uses residual connections and batch renormalization, is trained with augmented data, including white noise injection. The experimental results show that the word error rate (WER) was reduced by 11.9



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speech recognition (ASR) makes it possible for machines to understand human languages and follow human voice commands. The current ASR system implemented with deep learning techniques improves its performance in near/far fields

[1, 2] for diverse environmental conditions [3]. Recently, an ASR system implemented with end-to-end models [4, 5, 6, 7]

has gained attention since end-to-end models learn to map character sequences from acoustic feature sequences directly without any intermediate modeling, unlike the conventional ASR systems (such as the acoustic model, pronunciation lexicon, and language models that are based on deep learning

[1, 8]).

The two major approaches of end-to-end models, connectionist temporal classification (CTC) [9, 5, 10] and attention-based models [4, 11], have achieved promising recognition results. CTC-based models [9] solve sequential learning problems based on the Markov assumptions [5, 10]. Attention-based models align between acoustic frames and decoded symbols by using an attention mechanism [4, 11]

. Recent studies on end-to-end models showed that a joint CTC-attention model improves the recognition performance rather than each approach

[6, 12]. The joint model trains an attention-based encoder with an attached CTC objective for regularization. Furthermore, the CTC objective is employed during the decoding phase to improve the model results [13].

Although end-to-end models are comparable or even more advantageous than the conventional ASR systems [6, 7], recognizing speech signals robustly under adverse scenarios, including casual conversation and noisy environments and with low resources (i.e., CHiME-5 task [14]) is nevertheless challenging. Actually, most of the competitive systems except for [12] in the fifth CHiME challenge employ conventional ASR methods with multichannel speech enhancement techniques [15, 16, 17, 18]. In this study, this challenging scenario is addressed using an end-to-end ASR model. To boost the speech recognition performance under these conditions, we propose an extension of a joint CTC-attention model that uses residual connections for the CNN and accepts multichannel inputs.

First, we explore the use of multichannel inputs [19, 20] for noisy environments under the fifth CHiME challenge scenario [14] to train our model. The fifth CHiME challenge collects speech materials from casual conversations in real home scenarios. The challenge considers distant multi-microphone speech captured by four binaural microphone pairs and six Kinect microphone arrays and features two tracks, namely, the single-array track and multiple-array track. Specifically, our multichannel end-to-end approach was focused on a single-array track, and we evaluated several configurations for a joint CTC-attention model with an end-to-end toolkit named ESPnet [21].

This paper presents extensions of a joint CTC-attention model. The performance was evaluated and compared to that of a conventional joint CTC-attention model. The introduced extensions are as follows:

  • Parallel CNN-encoder with residual connections. We employed the data from both microphones (Kinect and binaural) to improve the performance for noisy speech recognition. Furthermore, we observed that augmenting the data on the binaural side with white noise reduced the absolute word error rate (WER) by 1% and better performance was obtained than when employing dropout.

  • Batch Renormalization [22]

    . This normalization improves the training process for small mini-batches using the moving averages of the mean and variance during training and inference.

  • Multilevel language modeling (LM) [23]. This modeling technique integrates the ability to model an open vocabulary ASR of a character-based LM with the strength to model large sequences of word-based LM.

Compared to the WER of a standard joint model, the absolute WER improved by 11.9% when using the proposed multichannel joint CTC-attention with residual connections.

2 End-to-End ASR Overview

The framework employs a joint CTC-attention model that processes the audio features and generates text as an output.

2.1 Joint CTC-Attention Model

The key idea of a joint CTC-attention model is to overcome with 1) the conditional independence of the targets assumed in the CTC model and 2) the misalignments in the attention model produced by the noise in real-environment speech recognition tasks [24]. A joint CTC-attention model uses a shared-encoder to train an attention model encoder with a CTC objective function as an auxiliary task. This model uses the multi-task learning (MTL) framework to achieve the desired training.

For an audio input of length , CTC will generate and output a sequence of shorter length , for the -length letter sequence with and set of distinct characters . CTC generates an intermediate ”blank” symbol, which represents the omission of the output label. This special symbol is introduced to generate a framewise letter sequence

. Assuming conditional independence between each output, CTC models the probability distributions over all possible label sequences to maximize

as follows:


where and are label prior distributions. Similar to the conventional hybrid ASR,

represents the framewise posterior distribution and is modeled by using a deep encoder, such as the bidirectional long short-term memory (BiLSTM), convolutional neural network (CNN) + BiLSTM, etc., as follows:



) denotes a linear layer that converts hidden vector

to a ( <blank>) dimensional vector, and Softmax(

) denotes a softmax activation function.

On the other hand, an attention-based model does not assume any conditional independence assumptions for

. The posterior probability

is directly estimated based on the chain rule:


where is represented as:


where Decoder() Softmax(Lin(LSTM(

))), is a recurrent neural network with hidden vector

, previous output and a letter-wise hidden vector . , that is the attention weight and represents a soft alignment. It is obtained as follows:


with Attention() as a content-based attention mechanism with convolutional features [25]. Although attention-based ASR combines implicitly acoustic models, lexicon and language models in a single framework model (i.e., encoder, decoder, and attention) making predictions conditioned in the all previous predictions, the alignment can become impaired owing to the use of explicit alignment without monotonic constrains.

Figure 1: Parallel Encoder

The use of a joint CTC-attention model with MTL approach improves performance in the ASR task and reduces irregular alignments during training and inference. This MTL objective maximizes the logarithmic linear combination of the CTC and attention objectives:


where is a tunable parameter with values .

As the joint CTC-attention model considers the CTC probabilities during inference, the model can find a better alignment of the hypothesis to the input speech. During inference, an RNN-LM () trained separately is integrated using a scaling factor for the log probabilities. Then, the most probable character sequence is obtained as follows:


where (eos: the end-of-sentence symbol), is the scaling factor in the log probability domain and is the LM probability computed as .

3 Adaptation for Multichannel ASR in Noisy environments

The idea of our model is to use a parallel deep CNN encoder with residual connections, batch renormalization, and a multilevel RNN-LM network as an extension for a joint CTC-attention end-to-end ASR with multichannel input. In the next subsections, we describe each individual extension in detail.

3.1 Parallel Multichannel Encoder

To boost the accuracy of the joint CTC-attention model applied in the fifth CHiME challenge, we employ both Kinect and binaural microphone arrays, supplied on the corpus, during training using a parallel multichannel encoder (Fig. 1). The multichannel encoder comprises two CNNs that process each array during a minibatch step, and uses the CNN encoder with Kinect array during decoding since we cannot use the binaural array for the distant ASR scenario. Unlike sole training with single channel or with multichannel from the Kinect array, the use of the binaural array enriches the possible input features combinations, regularizes the network training, and therefore, improves the model performance.

3.2 Residual Connections

Using residual (i.e., skip) connections presents several benefits. Skip connections improve the back-propagation of the gradient to the bottom layers, thus, easing the training on very deep networks [26]. Studies showed that residual or skip connections eliminate the overlaps, consistent deactivation, and linear dependence singularities of nodes in a neural network [27].

Let be the learned mapping of a network. Then, the network can also learn mapping for a given input . Residual learning is then denoted as follows:


Residual learning is implemented in any feedforward neural network using a skip connection (Fig. 1), which is presented as an identity mapping. A network can be trained end-to-end with this implementation using any deep learning framework. In practice, this implementation improves the performance of the model; however, increasing the computing time.

In this work, residual learning is implemented using three convolutional layers, namely, two convolutional layers with a kernel filter size of to calculate and one with a kernel filter size of , which is used as the skip connection.

3.3 Batch Renormalization

Batch normalization has become a standard implementation for deep neural networks [28]. A model implemented with batch normalization is trained with moving averages of mean and variance of the mini-batch. Moving averages is used to avoid dependence of the normalized activations for a given input sample and the additional samples of the mini-batch. In addition, the mean and variance are computed overall training data to employ them for inference. However, the use of the mean and variance has significant drawback when mini-batches with few samples are employed [22].

Batch renormalization [22] proposes to apply a per-dimension affine transformation to the normalized activations. The statistic differences of mini-batch are corrected by fixed parameters ensuring that the computed activations depend only on a single example, and thus the performance for models trained with small mini-batches is improved. Also, batch renormalization employs the overall calculated mean and variance in the training process. During training, the above layers observe the same activations that would be generated for inference, unlike batch normalization that uses the overall mean and variance only for inference.

To boost the accuracy of the joint model, we implement the model with batch renormalization in the CNN layers (Fig. 1). This implementation improves the performance of proposed models, obtaining an additional absolute error rate reduction of 0.2% in the WER.

3.4 Multilevel RNN-LM

Prior studies have shown that integrating the joint CTC-attention model with a character-based recurrent neural network language model (RNN-LM) improves the recognition accuracy [13]. Word-based LM suffers from the out-of-vocabulary (OOV) problem, unline the character-based LM that has the advantage of open vocabulary ASR [23]. However, it is difficult for character-based LM to model linguistic constraints across a long sequence of characters. In a previous study [23], this problem was overcome by implementing a multilevel LM and combining it with the decoder network. The multilevel LM first ranks the hypothesis using the character-based LM, and then, the word-based LM rescore known words. The OOV score is provided by the character-based LM.

4 Experimental Setup

We used the fifth CHiME challenge ASR benchmarks to show the effectiveness of the proposed extensions for the joint CTC-attention model. The fifth CHiME challenge comprises tasks of conversational ASR employing distant multi-microphones in real home environments [14]. The speech material captured natural conversational speeches, and six Kinect microphone arrays and four binaural microphone pairs were employed to record it. The speech material comprises a total of 40 h of training data, 4 h of development data, and 5 h of evaluation data. The corpus features two challenges, namely, single-array track and multiple-array track. We have considered the single-array track.

We evaluated the model trained with subsets of different size depending on the number of channels. A subset of 275K utterances selected randomly from both Kinect and binaural arrays were used for training baseline models with a single channel. We compared the use of the Kinect array only and combined Kinect with binaural arrays for multichannel input. The Kinect array yielded around 375K utterances. When this array was combined with the binaural array, around 480K utterances were obtained. In addition, we evaluated the results of augmenting the data with the white noise added to the binaural array to obtain around 560K utterances for training.

The baseline joint model architecture follows a setup similar to that adopted previously [13]. The input features for all models were 80-dimensional log-mel filter bank coefficients with pitch features computed every 10 ms. The joint model comprised an encoder of four convolutional layers motivated by VGG [29] (called VGG), followed by six stacked bidirectional long short-term memory (BLSTM) layers. The convolutional layers had a kernel filter size of , and the BLSTM layers each had 320 cells units. The decoder network had a 1-layer LSTM with 300 cells and a CTC network. The attention network employed location-based attention [25], where 10 centered convolution filters of width 100 were used to extract the convolutional features. The character-based and word-based LMs were trained using corpus transcriptions [23]. The character-based LM was built as a 2-layer LSTM with 650 units trained with ADAM optimization [30]

. The word-based LM was built as a 1-layer LSTM with 650 units trained with stochastic gradient descent optimization and a word vocabulary of 5K. The OOV rate was 2.69% for the training set and 2.87% for the development set.

The joint model was optimized with MTL , AdaDelta algorithm [31]

, and gradient clipping

[32]. The model was implemented by using Chainer deep learning framework [33] in the ESPnet toolkit [21]

. Unless otherwise indicated, the model was trained for 15 epochs using a mini-batch of 25 for input lengths less than or equal to 750 frames using four NVIDIA K80 GPUs.

5 Experiments

Method Channels
1 94.7 67.2
Joint model (a) 1 90.8 61.5
Joint model 512 (b) 1 89.2 61.1
Kinect Array (c) 4 88.3 -
Parallel Encoder (d) 4+2 85.4 55.6
Table 1: Comparison of overall WER for systems tested on the development set.
Method Channels
VGG 4+2 85.4 55.6
RES 4+2 85.1 55.8
ResBRN 4+2 85.0 54.4
Table 2: Comparison of CNN architectures tested on the development set.

5.1 Parallel MultiChannel Encoder

Table 1 lists the WER for the proposed multichannel parallel encoder and end-to-end baseline for the fifth CHiME challenge. For 1-channel input, we employed the beamformed data from the reference microphone of the development set. We achieved a 3.9% absolute reduction in WER after using a character-based word trained with the ADAM optimizer (a). An additional 0.3% reduction was obtained after increasing the number of cell units of BLSTM to 512 (b).

For the multichannel input, we employed data without any additional preprocessing and the number of BLSTM layer were reduced to three because of memory limitations. The number of cell units of BLSTM was maintained at 512. The use of the Kinect array as input reduced the WER by 0.9% compared to the WER of the best 1-channel model. In addition, a reduction of 3.6% was achieved for the parallel encoder based on the original VGG.

5.2 Residual Connections and Batch Renormalization

Table 2 lists the WER for the parallel encoder (VGG) implemented with residual connections (RES) and batch renormalization (ResBRN). We observed that the residual connections resulted in an additional absolute reduction of 0.3% in the Single-Array Track WER. After training the residual connections with batch renormalization the joint model provides an additional reduction of 0.1% and 1.4% on the the Single-Array Track and binaural tasks, respectively.

5.3 Multilevel LM

Table 3 lists the WER for the multilevel LM used with a VGG encoder and compares it to that of the end-to-end baseline. For 1-channel input, an absolute reduction of 5.9% was achieved in the WER. The use of parallel encoder resulted in an additional 3.5% improvement.

5.4 Data Augmentation

In addition to the abovementioned results, we finnaly report the WER for a model with a parallel encoder trained with augmented data. For these experiments, we used a character-based LM. The augmented data were obtained by adding simulated white noise to the binaural array. The signal-to-noise ratio was randomly selected to range 7 and 20 dB.

The data in Table  4 show that the augmented data work when noise is added to the binaural array. We compared the results with the results for models trained with dropouts added to the convolutional layer. Overall, our final model trained with white noise performs better, providing an absolute improvement of 11.9% and 8.9%, compared to the end-to-end and GMM baselines, and it’s also close to the state-of-the-art lattice free MMI (LF-MMI) baseline without using any phonemic information or finite state transducer decoding.

Method Channels
1 94.7 67.2
1-Channel 1 88.8 59.8
Parallel Encoder 4+2 85.3 55.1
Table 3: Effectiveness of the multilevel LM for the VGG encoder tested on the development set.
Method Channels
GMM [14] 1 91.7 72.8
LF-MMI TDNN [14] 1 81.3 47.9
VGG 4+2 84.6 54.4
RES + Dropout 4+2 83.8 64.0
RES 4+2 83.0 52.9
ResBRN 4+2 82.8 51.8
Table 4: White noise data augmentation for binaural microphone. Comparison overall WER for systems tested on the development set.

6 Conclusions

In this paper, we present a study of extensions for a joint CTC-attention model based on residual learning, batch renormalization, multilevel LM, and white noise augmentation. These extensions improve the performance of end-to-end models in everyday environment ASR, resulting in a WER reduction of 11.9%. The models showed improvements over the baseline even when no additional preprocessing (such as beamforming) was performed for the input.