Multi-Stream End-to-End Speech Recognition

06/17/2019 ∙ by Ruizhi Li, et al. ∙ MERL 0

Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each stream to force monotonic alignments. Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively. In MEM-Res framework, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary information from same acoustics. Experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1 WSJ eval92 test set. The MEM-Array framework aims at improving the far-field ASR robustness using multiple microphone arrays which are activated by separate encoders. Compared with the best single-array results, the proposed framework has achieved relative WER reduction of 3.7 multi-array corpora, respectively, which also outperforms conventional fusion strategies.



There are no comments yet.


page 1

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advancements in deep neural networks enabled several practical applications of automatic speech recognition (ASR) technology. The main paradigm for an ASR system is the so-called hybrid approach


, which involves training a Deep Neural Network (DNN) to predict context dependent phoneme states (or senones) from the acoustic features. During inference the predicted senone distributions are provided as inputs to decoder, which combines with lexicon and language model to estimate the word sequence. Despite the impressive accuracy of the hybrid system, it requires hand-crafted pronunciation dictionary based on linguistic assumptions, extra training steps to derive context-dependent phonetic models, and text preprocessing such as tokenization for languages without explicit word boundaries. Consequently, it is quite difficult for non-experts to develop ASR systems for new applications, especially for new languages.

End-to-End (E2E) speech recognition approaches are designed to directly output word or character sequences from the input audio signal. This model subsumes several disjoint components in the hybrid ASR model (acoustic model, pronunciation model, language model) into a single neural network. As a result, all the components of an E2E model can be trained jointly to optimize a single objective. Two dominant end-to-end architectures for ASR are Connectionist Temporal Classification (CTC) [2, 3, 4] and attention-based encoder decoder [5, 6]

models. While CTC efficiently addresses a sequential problem (speech vectors to word sequence mapping) by avoiding the alignment pre-construction step using dynamic programming, it assumes the conditional independence of label sequence given the input. The attention model does not assume conditional independence of a label sequence resulting in a more flexible model. However, attention-based methods encounter difficulty in satisfying the speech-label monotonic property. To alleviate this issues, a joint CTC/Attention framework was proposed in

[7, 8, 9]. The joint model was shown to provide the state-of-the-art E2E results in several benchmark datasets [9].

In this work, we propose a multi-stream architecture within the joint CTC/Attention framework. Multi-stream paradigm was successfully used in hybrid ASR [10, 11, 12, 13] motivated by observations of multiple parallel processing streams in the human speech processing cognitive system. For instance, forming streams by band-pass filtering the signal with stream dropout was proposed to deal with noise robustness scenario mimicking human auditory process [12, 10]. However, multi-stream approaches have not been investigated for E2E ASR models. This paper is an extension of our prior study [14], which successfully applied the proposed multi-stream concept to multi-array ASR. In this work, we present a general formulation to the multi-stream framework and two practical E2E applications (MEM-Res and MEM-Array models) with additional experiments and discussions. The framework has the following highlights:

  1. Multiple Encoders in parallel acting as information streams. Two ways of forming the streams have been proposed in this work according to different applications: Parallel encoders with different architectures and temporal resolutions perform on the same acoustics, which we refer to as Multi-Encoder Multi-Resolution (MEM-Res) model; Parallel input speech from multiple microphone arrays are fed into separate but identical encoders, which we refer to as Multi-Encoder Multi-Array (MEM-Array) model.

  2. The Hierarchical Attention Network (HAN) [15, 16, 17] is introduced to dynamically combine knowledge from parallel streams. Several studies have shown that attention-based model benefits from having multiple attention mechanisms [18, 19, 20, 15, 16, 17]. Inspired by the advances in hierarchical attention mechanism in document classification task [15], multi-modal video description [16] and machine translation [17], we adapt HAN into our multi-stream model. The encoder that carries the most discriminative information for the prediction can dynamically receive a higher weight. On top of the per-encoder attention mechanism, stream attention is employed to steer toward the stream, which carries more task-related information.

  3. Each encoder is associated with a separate CTC network to guide the frame-wise alignment process for each stream to potentially achieve better performance.

In MEM-Res model, two parallel encoders with heterogeneous structures are mutually complementary in characterizing the speech signal. In E2E ASR, the encoder acts as an acoustic model providing higher-level features for decoding. Bi-directional Long Short-Term Memory (BLSTM) has been widely used due to its ability to model temporal sequences and their long-term dependencies as the encoder architecture; Deep Convolutional Neural Network (CNN) was introduced to model spectral local correlations and reduce spectral variations in E2E framework

[8, 21]

. The encoder architecture combining CNN with recurrent layers, was suggested to address the limitation of LSTM. While temporal subsampling in RNN and max-pooling in CNN aim to reduce the computational complexity and enhance the robustness, it is likely that subsampling technique results in loss of temporal resolution. The MEM-Res framework proposes to combine both RNN-based and CNN-RNN-based networks to form a complementary multi-stream encoder.

In addition to MEM-Res, MEM-Array model is one of the other applications of our multi-stream E2E framework. Far-field ASR using multiple microphone arrays has become important strategies in the speech community toward a smart speaker scenario in a meeting room or house environment [22, 23, 24]. Individually, the microphone array is able to bring a substantial performance improvement with algorithms such as beamforming [25] and masking [26]. However, what kind of information can be extracted from each array and how to make multiple arrays work in cooperation are still challenging. Time synchronization among arrays is one of the main challenges that most distributed setup face [27]. Without any prior knowledge of speaker-array distance or video monitoring, it is difficult to estimate which array carries more reliable information or is less corrupted.

According to the reports from the CHiME-5 challenge [24], which targets the problem of multi-array conversational speech recognition in home environments, the common ways of utilizing multiple arrays in the hybrid ASR system are finding the one with highest Signal-to-Noise/Signal-to-Interference Ratio (SNR/SIR) for decoding [28] or fusing the decoding results by voting for the most confident words [29], e.g. ROVER [30]. Similar to our previous work [31][32]

, combination using the classifier’s posterior probabilities followed by lattice generation has been an alternative approach

[33, 34, 13]. The posteriors from the well-trained classifier decorrelate the input features, but preserve more distinctive speech information than the words after the full decoding stage. In terms of the combination strategy, ASR performance monitors have been designed [35], resulting in a process of stream confidence generation, guiding the linear fusion of array streams. While most of the E2E ASR studies engage in single-channel task or multi-channel task from one microphone array [36, 37, 38, 39], research on multi-array scenario is still unexplored within the E2E framework. The MEM-Array model is proposed to solve the aforementioned problem. The output of each microphone array is modeled by a separate encoder. Multiple encoders with the same configuration act as the acoustic models for individual arrays. Note that we integrate beamformed signals instead of using all multi-channel signals for the multi-stream framework, which is computationally efficient. This design can make use of the powerful beamforming algorithm as well.

This paper is organized as follows: section II explains the joint CTC/Attention model. The description of the proposed multi-stream framework including MEM-Res and MEM-Array is in section III. Experiments with results and several analyses for MEM-Res and MEM-Array models are presented in section IV and Section V, respectively. Finally, in section VI the conclusion is derived.

Ii Joint CTC/Attention Mechanism

In this section, we review the joint CTC/attention architecture, which takes advantage of both CTC and attention-based end-to-end ASR approaches during training and decoding.

Ii-a Connectionist Temporal Classification (CTC)

CTC enforces a monotonic mapping from a -length speech feature sequence, , to an -length letter sequence, . Here is a -dimensional acoustic vector at frame , and is at position a letter from , a set of distinct letters.

The CTC network introduces a many-to-one function from frame-wise latent variable sequences, , to letter predictions with shorter lengths. With several conditional independence assumptions, the posterior distribution, , is represented as follows:


where is a frame-wise posterior distribution, which is often modeled using BLSTM. We also define the CTC objective function . CTC preserves the benefits that it enforces the monotonic behavior of speech-label alignments, avoids the HMM/GMM construction step and preparation of pronunciation dictionary.

Ii-B Attention-based Encoder-Decoder

As one of the most commonly used sequence modeling techniques, the attention-based framework selectively encodes an audio sequence of variable length into a fixed dimension vector representation, which is then consumed by the decoder to produce a distribution over the outputs. We can directly estimate the posterior distribution

using the chain rule:


where is defined as the attention-based objective function. Typically, a BLSTM-based encoder transforms the speech vectors into frame-wise hidden vector If the encoder subsamples the input by a factor , there will be time steps in . The letter-wise context vector is formed as a weighted summation of frame-wise hidden vectors using content-based attention network [6]:


where is the attention weight, a soft-alignment of for output , and is the previous decoder state. ContentAttention() is described as follows:


g is a learnable vector parameter. is a -dimensional vector. LinB() and Lin(

) represent the linear transformation with or without bias term, respectively.

In comparison to CTC, not requiring conditional independence assumptions is one of the advantages of using the attention-based model. However, the attention is too flexible to satisfy monotonic alignment constraint in speech recognition tasks.

Ii-C Joint CTC/Attention

The joint CTC/Attention architecture benefits from both CTC and attention-based models since the attention-based encoder-decoder is trained together with CTC within the Multi-Task Learning (MTL) framework. The encoder is shared across CTC and attention-based encoders. And the objective function to be maximized is a logarithmic linear combination of the CTC and attention objectives, i.e., and :


where is a tunable scalar satisfying . is an approximated letter-wise objective where the probability of a prediction is conditioned on previous true labels.

During inference, the joint CTC/Attention model performs a label-synchronous beam search. The most probable letter sequence given the speech input is computed according to


where external RNN-LM probability is added with a scaling factor . For each partial hypothesis in the beam search, the joint score, the log probability of hypothesized label sequence, can be computed as


where the attention decoder score, , can be accumulated recursively from hypothesis scores from one step before. In terms of CTC score, , we utilize the CTC prefix probability defined as the cumulative probability of all label sequences that have as their prefix [40, 41]. In this work, we use the look-ahead word-based language model to give the RNN-LM score [42], . This language model enables us to decode with only a word-based model, rather than using a multi-level LM which uses a character-level LM until the identity of the word is determined.

Iii Proposed Multi-Stream Framework

The proposed multi-stream architecture is shown in Fig. 1. For simplicity to understand the framework, we focus on the two-stream architecture. Two encoders are presented in parallel to capture information in various ways, followed by an attention fusion mechanism together with per-encoder CTC. An external RNN-LM is also involved during the inference step. We will describe the details of each component in the following sections.

Fig. 1: The Multi-Stream End-to-End Framework.

Iii-a Parallel Encoders as Multi-Stream

Similar to acoustic modeling in conventional ASR, the encoder maps the audio features into higher-level feature representations for the use of CTC and attention model:


where we denote superscript as the index for corresponding to stream , is the frame-wise hidden vector of stream introduced in Sec. II-B. , and denotes the number of streams. in Eq. 10 represents a -length speech feature sequence, i.e., . Note that it is not mandatory to have frame-level synchronization across all streams since , could be different in the proposed model. Together with stream-specific subsampling factor , stream will have time instances at the encoder-output level. Rounding process of is performed in the encoder based on different architecture.

For simplicity, multi-stream model with is depicted in Fig. 1, where two encoders in parallel take different input features, with frames and with frames, respectively. Each encoder operates on different temporal resolution with subsampling factor and , where subsampling could be performed in RNN or maxpooling layer in CNN.

Iii-B Hierarchical Attention

Since the encoders model the speech signals differently by catching acoustic knowledge in their own ways, encoder-level fusion is suitable to boost the network’s ability to retrieve the relevant information. We adapt Hierarchical Attention Network (HAN) in [15] for information fusion. The decoder with HAN is trained to selectively attend to appropriate encoder, based on the context of each prediction in the sentence as well as the higher-level acoustic features from encoders, to achieve a better prediction.

The letter-wise context vectors, , from individual encoders are computed as follows:


where the attention weights , where , are obtained using a content-based attention mechanism. Note that since encoders perform downsampling, the summations are till for each individual stream in Eq. (11), respectively.

The fusion context vector is obtained as a convex combination of , as illustrated in the following:


The stream-level attention weight, , where , is estimated according to the previous decoder state and context vector from an individual encoder as described in Eq. (13). The fusion context vector is then fed into the decoder to predict the next letter.

Iii-C Training and Decoding with Per-encoder CTC

In the CTC/Attention model with a single encoder, the CTC objective serves as an auxiliary task to speed up the procedure of realizing monotonic alignment and providing a sequence-level objective. In multi-stream framework, we introduce per-encoder CTC where a separate CTC mechanism is active for each encoder stream during training and decoding. Sharing one set of CTC among encoders is a soft constraint that limits the potential of diverse encoders to reveal complimentary information. In the case that encoders are with different temporal resolutions and network architectures, per-encoder CTC can further align speech with labels in a monotonic order and customize the sequence modeling of individual streams.

During training and decoding steps, we follow Eq. (7) and (II-C) with a change of the CTC objective in the following way:


where joint CTC loss is the average of per-encoder CTCs. In the beam search, the CTC prefix score of hypothesized sequence is altered as follows:


where equal weight is assigned to each CTC network.

Iii-D Multi-Encoder Multi-Resolution

Fig. 2: Multi-Encoder Multi-Resolution Architecture.

As one realization of multi-stream framework, we propose a Multi-Encoder Multi-Resolution (MEM-Res) architecture that has two encoders, RNN-based and CNN-RNN-based. Both encoders take the same input features in parallel operating on different temporal resolutions, aiming to capture complimentary information in the speech as depicted in Fig. 2.

The RNN-based encoder is designed to model temporal sequences with their long-range dependencies. Subsampling in BLSTM is often used to decrease the computational cost, but performing subsampling might result in lost information which could be better modeled in RNN. In MEM-Res, the BLSTM encoder has only BLSTM layers that extract the frame-wise hidden vector without subsampling in any layer, i.e. :


where the BLSTM decoder is labeled as index .

The combination of CNN and RNN allows the convolutional feature extractor applied on the input to reveal local correlations in both time and frequency dimensions. The RNN block on top of CNN makes it easier to learn temporal structure from the CNN output, to avoid modeling direct speech features with more underlying variations. The pooling layer is essential in CNN to reduce the spatial size of the representation to control over-fitting. In MEM-Res, we use the initial layers of the VGG net architecture [43], stated in table I, followed by BLSTM layers as VGGBLSTM decoder labeled as index 2:


Two maxpooling layers with downsample the input features by a factor of in both temporal and spectral directions.

Convolution 2D in = 1, out = 64, filter = 3 3
Convolution 2D in = 64, out = 64, filter = 3 3
Maxpool 2D patch = 2

2, stride = 2

Convolution 2D in = 64, out = 128, filter = 3 3
Convolution 2D in = 128, out = 128, filter = 3 3
Maxpool 2D patch = 22, stride = 22
TABLE I: Initial Six-Layer VGG Configurations

Iii-E Multi-Encoder Multi-Array

In this section, we present another realization of multi-stream framework for the multi-array ASR task, i.e. Multi-Encoder Multi-Array (MEM-Array) model.

Iii-E1 Conventional Multi-Array ASR

In our previous work, we proposed a stream attention framework to improve the far-field performance in the hybrid approach, using distributed microphone array(s) [32]

. Specifically, we generated more reliable Hidden Markov Model (HMM) state posterior probabilities by linearly combining the posteriors from each array stream, under the supervision of the ASR performance monitors.

In general, the posterior combination strategy outperformed conventional methods, such as signal-level fusion and the word-level technique ROVER [30], in the prescribed multi-array configuration. Accordingly, stream attention weights estimated from the de-correlated intermediate features should be more reliable. We adopt this assumption in MEM-Array framework.

Iii-E2 Multi-Array Architecture with Stream Attention

Fig. 3: Multi-Encoder Multi-Array Architecture.

Based on the multi-stream model, the proposed MEM-Array architecture in Fig. 3 has two encoders, with each mapping the speech features of a single array to higher level representations , where we denote as the index for corresponding to array . Note that and have the same configurations receiving parallel speech data collected from multiple microphone arrays. As we introduced in Sec. III-D, CNN layers are often used together with BLSTM layers on top to extract frame-wise hidden vectors. We explore two types of encoder structures: BLSTM (RNN-based) and VGGBLSTM (CNN-RNN-based) [44]:


Note that the BLSTM encoders are equipped with an additional projection layer after each BLSTM layer. In both encoder architectures, subsampling factor is applied to decrease the computational cost. Specially, the convolution layers of the VGGBLSTM encoder downsamples the input features by a factor of 4 so that there is no subsampling in the recurrent layers.

In the multi-stream setting, one inherent problem is that the contribution of each stream (array) changes dynamically. Specially, when one of the streams takes corrupted audio, the network should be able to pay more attention to other streams for the purpose of robustness. Inspired by the advances of linear posterior combination [32] and a hierarchical attention fusion [15, 16, 17], a stream-level fusion on the letter-wise context vector is used in this work to achieve the goal of encoder selectivity as we introduced in Sec. III-B.

In comparison to fusion on frame-wise hidden vectors , stream-level fusion can deal with temporal misalignment from multiple arrays at the stream level. Furthermore, adding an extra microphone array could be simply implemented with an additional term in Eq.(12).

Iv Experiments: MEM-Res Model

Iv-a Experimental Setup

We demonstrated our proposed MEM-Res model using two datasets: WSJ1 [45] (81 hours) and CHiME-4 [46] (18 hours). In WSJ1, we used the standard configuration: “si284” for training, “dev93” for validation, and “eval92” for test. The CHiME-4 dataset is a noisy speech corpus recorded or simulated using a tablet equipped with 6 microphones in four noisy environments: a cafe, a street junction, public transport, and a pedestrian area. For training, we used both “tr05_real” and “tr05_simu” with additional WSJ1 corpora to support end-to-end training. “dt05_multi_isolated_1ch_track” was used for validation. We evaluated the real recordings with 1, 2, 6-channel in the evaluation set. The BeamformIt [47] method was applied to multi-channel evaluation. In all experiments, 80-dimensional mel-scale filterbank coefficients with additional 3-dimensional pitch features served as the input features.

Model et05_real_1ch eval92
BLSTM (Single-Encoder)
CTC 62.7 36.4
ATT 50.2 20.8
CTC+ATT 29.2 4.6
VGGBLSTM (Single-Encoder)
CTC 50.6 19.1
ATT 42.2 17.2
CTC+ATT 29.6 5.6
CTC 49.1 15.2
ATT 44.3 18.9
CTC(shared)+ATT 26.8 4.4
CTC(shared)+ATT+HAN 26.9 4.3
CTC(per-enc)+ATT 26.6 4.1
CTC(per-enc)+ATT+HAN 26.4 3.6
Previous Studies
RNN-CTC [3] - 8.2
Eesen [4] - 7.4
Temporal LS + Cov. [48] - 6.7
E2E+regularization[49] - 6.3
Scatt+pre-emp[50] - 5.7
Joint e2e+look-ahead LM[42] - 5.1
EE-LF-MMI [52] - 4.1
TABLE II: Comparison among Single-Encoder End-to-End Models with BLSTM or VGGBSLTM as the Encoder, the MEM-Res Model and Prior End-to-End models. (WER: WSJ1, CHiME-4)

The contained four BLSTM layers, in which each layer had 320 cells in both directions followed by a 320-unit linear projection layer. The combined the convolution layers with RNN-based network that had the same architecture as . A content-based attention mechanism with 320 attention units was used in encoder-level and frame-level attention mechanisms. The decoder was a one-layer unidirectional LSTM with 300 cells. We used 50 distinct labels including 26 English letters and other special tokens, i.e., punctuations and sos/eos.

We incorporated the look-ahead word-level RNN-LM [42]

of 1-layer LSTM with 1000 cells and 65K vocabulary, that is, 65K-dimensional output in Softmax layer. In addition to the original speech transcription, the WSJ text data with 37M words from 1.6M sentences was supplied as training data. RNN-LM was trained separately using Stochastic Gradient Descent (SGD) with learning rate

for 60 epochs.

The MEM-Res model was implemented using Pytorch backend on ESPnet


. Training procedure was operated using the AdaDelta algorithm with gradient clipping on single GPUs, “GTX 1080ti”. The mini-batch size was set to be 15. We also applied a unigram label smoothing technique to avoid over-confidence predictions. The beam width was set to 30 for WSJ1 and 20 for CHiME-4 in decoding. For model jointly trained with CTC and attention objectives,

was used for training, and for decoding. RNN-LM scaling factor was for all experiments with the exception of using in decoding attention-only models.

Iv-B Results

The overall experimental results on WSJ1 and CHiME-4 are shown in Table II. Compared to joint CTC/Attetion single-encoder models, the proposed MEM-Res model with per-encoder CTC and HAN achieved relative improvements of () in CHiME-4 and 21.7% in WSJ1 () in terms of WER. We compared the MEM-Res model with other end-to-end approaches, and it outperformed all of the systems from previous studies. We designed experiments with fixed encoder-level attention . And the MEM-Res model with HAN outperformed the ones without parameterized stream attention. Moreover, per-encoder CTC constantly enhanced the performance with or without HAN. Specially in WSJ1, the model shows notable decrease () in WER with per-encoder CTC. Our results further confirmed the effectiveness of joint CTC/Attention architecture in comparison to models with either CTC or attention network.

Single-Encoder Proposed Model
Data (21.9M) (21.3M)
et05_real_1ch 32.2 26.4 (18.0%)
et05_real_2ch 26.8 21.9 (18.3%)
et05_real_6ch 21.7 17.2 (20.8%)
eval92 5.3 3.6 (32.1%)
TABLE III: Comparison between the MEM-Res Model and VGGBSLTM Single-Encoder Model with Similar Network Size. (WER: WSJ1, CHiME-4)

For fair comparison, we increased the number of BLSTM layers from 4 to 8 in to train a single-encoder model. In Table III, the MEM-Res system outperforms the single-encoder model by a significant margin with similar amount of parameters, M v.s. M. In CHiME-4, we evaluated the model using real test data from 1, 2, 6-channel resulting in an average of 19% relative improvement from all three setups. In WSJ1, we achieved 3.6% WER in eval92 in our MEM-Res framework with relatively 32.1% improvement.

Data (4,4) (2,4) (1,4)
et05_real_1ch 29.1 27.0 26.4
eval92 4.5 4.2 3.6
TABLE IV: Effect of Multi-Resolution Configuration , where and are the Subsampling Factors for and , respectively. (WER: WSJ1, CHiME-4)

The results in Table IV shows the contribution of multiple resolution. The WER went up when increasing subsampling factor closer to in both datasets. In other words, the fusion worked better when two encoders are more heterogeneous which supports our hypothesis. As shown in Table V, We analyzed the average stream-level attention weight for when we gradually decreased the number of LSTM layers while keeping with the original configuration. It aimed to show that HAN was able to attend to the appropriate encoder seeking for the right knowledge. As suggested in the table, more attention goes to from as we intentionally make weaker.

# LSTM Layers Average Stream Attention
0 0.27 30.6
1 0.52 29.8
2 0.75 28.9
3 0.82 27.8
4 0.81 26.4
TABLE V: Analysis of Hierarchical Attention Mechanism when Fixing and Changing the Number of LSTM Layers in . (WER: CHiME-4)

V Experiments: MEM-Array Model

V-a Experimental Setup

Two dataset, AMI Meeting Corpus and DIRHA, were used to demonstrate MEM-Array model. The AMI Meeting Corpus consists of 100 hours of far-field recordings from 3 meeting rooms (Edinburgh, Idiap and TNO Room) [22]. The recordings used a range of signals synchronized to a common time line. There were two arrays placed in each meeting room to record the sentences, with one 10 cm radius circular array between the speakers consisting of 8 omni-directional microphones. The setups of the second microphone array were different among the rooms, detailed by Table VI. The DIRHA dataset was collected in a real apartment setting with typical domestic background noise and reverberation [23]. In the configuration, a total of 32 microphones were placed in the living-room (26 microphones) and in the kitchen (6 microphones). The microphone network consist of 2 circular arrays of 6 microphones (located on the ceiling of the living-room and the kitchen), a linear array of 11 sensors (located in the living-room) and 9 microphones distributed on the living-room walls. During the recording, the speaker was asked to move to a different position and take a different orientation after reading several sentences.

In both datasets, we chose two microphone arrays as parallel streams (noted by Str1 and Str2) to train the proposed E2E system, which is also shown by Table VI. For each microphone array, all the simulations or recordings were synthesized into the single channel using delay-and-sum (DS) beamforming with the BeamformIt Toolkit [47]. The AMI training set consists of 81 hours of speech. The development (Dev) and evaluation (Eval) set respectively contain 9 hours of meeting recordings. We used Dev set for cross validation and Eval set for testing. Contaminated version of the original WSJ (Wall Street Journal) corpus was used for DIRHA training. Two streams were generated using the WSJ0 and WSJ1 clean utterances convolved by the circular array impulse responses and the linear ones, respectively. Recorded noises were added as well. We used the DIRHA Simulation set (generated via the same way as training data) for cross validation and DIRHA Real set for testing, which consisted of 3 Male and 3 Female native US speakers uttering 409 WSJ sentences.

Dataset Str1 (Stream 1) Str2 (Stream 2)
Edinburgh: 8-mic Circular Array
AMI 8-mic Circular Array Idiap: 4-mic Circular Array
TNO: 10-mic Linear Array
DIRHA 6-mic Circular Array 11-mic Linear Array
TABLE VI: Description of the Array Configuration in the Two-Stream E2E Experiments.

Experiments were conducted with the configuration as described in Table VII:

Single Stream 80-dim fbank + 3-dim pitch
Multi Stream :80+3; :80+3
Encoder type BLSTM or VGGBLSTM
Encoder layers BLSTM:4; VGGBLSTM[44]:6(CNN)+4
Encoder units 320 cells (BLSTM layers)
(Stream) Attention Content-based
Decoder type 1-layer 300-cell LSTM
CTC weight (train) AMI:0.5; DIRHA:0.2
CTC weight (decode) AMI:0.3; DIRHA:0.3
Type Look-ahead Word-level RNNLM [42]
Train data AMI:AMI; DIRHA:WSJ0-1+extra WSJ text data
LM weight AMI:0.5; DIRHA:1.0
TABLE VII: Experimental Configuration (MEM-Array)

V-B Results

We defined two kinds of E2E architectures in these results discussions: single-stream architecture, which had only one encoder without stream attention and multi-stream architecture, which had several encoders with each corresponding to one microphone array and had stream attention mechanism as well.

V-B1 Single-array results

First of all, we explored the ASR performance for the individual array (single stream). As illustrated in Table VIII, the single stream system with the VGGBLSTM based encoder outperforms the one with BLSTM encoder, both in Character Error Rate (CER) and WER. Joint training of CTC and attention based model helps since CTC could enforce the monotonic behavior of attention alignments, rather than merely estimating the desired alignment for long sequence. With the RNNLM, we could see a dramatical decrease of the WERs on both datasets. The Str1 WERs of AMI Eval and DIRHA Real were 56.9% and 35.1%, respectively. For simplicity, we only kept the CTC/Attention based single-stream results with RNNLM for Str2 since the same trend could be found and only the WER would be compared in the following results.

Model (Single Stream) Eval Real
BLSTM (Str1)
Attention 45.1 60.9 42.7 68.7
+ CTC 41.7 63.0 38.5 74.8
+ Word RNNLM 41.7 59.1 29.4 47.4
Attention 43.2 59.7 39.5 71.4
+ CTC 40.2 62.0 30.1 61.8
+ Word RNNLM 39.6 56.9 21.2 35.1
VGGBLSTM (Str2) 45.6 64.0 22.5 38.4
TABLE VIII: Exploration of Best Encoder and Decoding Strategy for Single-Stream E2E Model.

V-B2 Multi-array results

As shown in Table IX, the proposed stream attention framework achieves 3.7% (56.9 to 54.9) and 9.7% (35.1 to 31.7) relative WERs reduction on AMI and DIRHA datasets, respectively. Hierarchical attention played a role that emphasizing the more reliable stream. In addition, we compared the multi-stream framework with conventional strategies using single-stream system trained by the Fbank and pitch features, either concatenated by the Str1 and Str2 features or extracted from the speech audio through alignment and average between the streams. The multi-stream framework outperformed the others. To explain the improvement was not from the boost of the number of model parameters, we doubled the BLSTM layers (4 to 8) in the VGGBLSTM encoder and train the single-stream CTC/Attention system with a comparable amount of parameters (33.7M vs 31.6M). Our system still showed strong competitiveness.

(Att + CTC + RNNLM) Eval Real
Single-stream model
Concatenating Str1&Str2 23.3M 56.7 33.5
WAV alignment and average 26.2M 56.7 43.5
+ model parameter extension 33.7M 56.9 39.6
Multi-stream model
Proposed framework 31.6M 54.9 31.7
TABLE IX: WER(%) Comparison between the Proposed Multi-Stream Approach and Alternative Single-Stream Strategies.

During the inference stage of the multi-stream model, we examined how the stream attention weights change once one of the streams was corrupted by noise. Fig.4

shows an example in the DIRHA Real set that whether the input features of Str1 is affected by an additive Gaussian noise with zero mean and unit variance. After the corruption, the alignment between characters and acoustic frames of Str1 becomes blurred (Fig.

4(c)), indicating that the information from Str1 should be less trusted. Therefore, as expected, a positive shift of the attention weights for Str2 can be observed (upper line in Fig.4(e)).

Fig. 4: Comparison of the alignments between characters (y-axix) and acoustic frames (x-axis) before ((a) Str1; (b) Str2) and after ((c) Str1; (d) Str2) noise corruption of Str1. (e) shows the attention weight shift of Str2 between two cases (x-axis is the letter sequence).

V-B3 Comparison with hybrid system

Table X shows the comparison between the proposed E2E framework and the conventional hybrid ASR approach. In [32], we designed three scenarios using different subsets from the 32 microphones and 2 arrays in the DIRHA dataset. Our proposed DNN posterior combination approach and ROVER technique could relatively reduce the WER of the hybrid system by 7.2% and 5.8% respectively, when we averaged the WERs of the Real test sets among three cases. Meanwhile, a relative 9.7% WER reduction has already been achieved in the stream attention-based two-stream E2E system, even though we had less number of streams (two) than the hybrid one (six). Ignoring the WER gap between the hybrid and E2E ASR systems, we still believe that the proposed E2E approach has much potential to do better with more array streams.

System #Num Method Best Stream WER
Hybrid 6 post. comb. 29.2 27.1 (7.2%)
6 ROVER 29.2 27.5 (5.8%)
E2E 2 proposed 35.1 31.7 (9.7%)
TABLE X: WER(s) Comparison between the Hybrid and End-to-End System on DIRHA Dataset. #Num Denotes the Number of Streams.

Vi conclusion

In this work, we present our multi-stream framework to build an end-to-end ASR system. Higher-level frame-wise acoustic features were carried out from parallel encoders with various configurations of input features, architectures and temporal resolutions. Stream attention was achieved through a hierarchical connection between the decoder and encoders. We also investigated that assigning a CTC network to individual encoder further helped diverse encoders to reveal complimentary information.

Two realizations of multi-stream framework have been proposed, which are MEM-Res model and MEM-Array model targeting different applications. In MEM-Res architecure, RNN-based and CNN-RNN-based encoders with subsampling only in convolutional layers characterized same speech in different ways. The model outperformed various single-encoder models, reaching the state-of-the-art performance on WSJ among end-to-end systems. For further study, exploring advanced convolutional layers, such ResNet, and self-attention layers has the potential to improve the WER even more. In multi-array scenarios, taking advantage of all the information that each array shared and contributed was crucial in this task. The MEM-Array model represent each array with one encoder followed by attention fusion in the contextual vector level, where no frame synchronization of parallel stream was required. Thanks to the success of joint training of per-encoder CTC and attention, substantial WER reduction was shown in both AMI and DIRHA corpora, demonstrating the potentials of the proposed architecture. An extension to more streams efficiently and exploration of schedule training of the encoders are to be investigated.


This work is supported by National Science Foundation under Grant No. 1704170 and No. 1743616, and a Google faculty award to Hynek Hermansky.


  • [1] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, vol. 29, 2012.
  • [2]

    A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in

    ICML, 2006, pp. 369–376.
  • [3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in ICML, 2014, pp. 1764–1772.
  • [4] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in ASRU, 2015, pp. 167–174.
  • [5] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2015.
  • [6] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in NIPS, 2015, pp. 577–585.
  • [7] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017, pp. 4835–4839.
  • [8] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in INTERSPEECH, 2017.
  • [9] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [10] S. H. R. Mallidi, “A practical and efficient multistream framework for noise robust speech recognition,” Ph.D. dissertation, Johns Hopkins University, 2018.
  • [11] H. Hermansky, “Multistream recognition of speech: Dealing with unknown unknowns,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1076–1088, 2013.
  • [12] S. H. Mallidi and H. Hermansky, “Novel neural network based fusion for multistream asr,” in ICASSP.   IEEE, 2016, pp. 5680–5684.
  • [13] H. Hermansky, “Coding and decoding of messages in human speech communication: Implications for machine recognition of speech,” Speech Communication, 2018.
  • [14] X. Wang, R. Li, S. H. Mallidi, T. Hori, S. Watanabe, and H. Hermansky, “Stream attention-based multi-array end-to-end speech recognition,” in ICASSP.   IEEE, 2019, pp. 7105–7109.
  • [15] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for document classification,” in NAACL HLT, 2016, pp. 1480–1489.
  • [16] C. Hori, T. Hori, T.-Y. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in ICCV.   IEEE, 2017, pp. 4203–4212.
  • [17] J. Libovickỳ and J. Helcl, “Attention strategies for multi-source sequence-to-sequence learning,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, 2017, pp. 196–202.
  • [18] T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Multi-head decoder for estream attention-based multi-array end-to-end speech recognitionnd-to-end speech recognition,” in INTERSPEECH, 2018, pp. 801–805.
  • [19] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in ICASSP.   IEEE, 2018, pp. 4774–4778.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
  • [21] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in ICASSP, 2017.
  • [22] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal et al., “The ami meeting corpus: A pre-announcement,” in

    International Workshop on Machine Learning for Multimodal Interaction

    .   Springer, 2005, pp. 28–39.
  • [23] M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi-microphone data simulation for distant speech recognition,” in INTERSPEECH, 2016.
  • [24] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines,” in Interspeech, 2018, pp. 1561–1565.
  • [25] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
  • [26] Z. Wang, X. Wang, X. Li, Q. Fu, and Y. Yan, “Oracle performance investigation of the ideal masks,” in IWAENC 2016.   IEEE, 2016, pp. 1–5.
  • [27] S. Markovich-Golan, A. Bertrand, M. Moonen, and S. Gannot, “Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks,” Signal Processing, vol. 107, pp. 4–20, 2015.
  • [28] J. Du et al., “The ustc-iflytek systems for chime-5 challenge,” in CHiME-5, 2018.
  • [29] N. Kanda et al., “The hitachi/jhu chime-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays,” in CHiME-5, 2018.
  • [30] J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover),” in ASRU.   IEEE, 1997, pp. 347–354.
  • [31] X. Wang, Y. Yan, and H. Hermansky, “Stream attention for far-field multi-microphone asr,” arXiv preprint arXiv:1711.11141, 2017.
  • [32] X. Wang, R. Li, and H. Hermansky, “Stream attention for distributed multi-microphone speech recognition,” in INTERSPEECH, 2018, pp. 3033–3037.
  • [33] H. Misra, H. Bourlard, and V. Tyagi, “New entropy based combination rules in hmm/ann multi-stream asr,” in ICASSP, vol. 2.   IEEE, 2003, pp. II–741.
  • [34] F. Xiong et al., “Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments,” in CHiME-5, 2018.
  • [35] S. H. Mallidi, T. Ogawa, and H. Hermansky, “Uncertainty estimation of dnn classifiers,” in ASRU.   IEEE, 2015, pp. 283–288.
  • [36] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Unified architecture for multichannel end-to-end speech recognition with neural beamforming,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1274–1288, 2017.
  • [37] S. Braun, D. Neil, J. Anumula, E. Ceolini, and S.-C. Liu, “Multi-channel attention for end-to-end speech recognition,” in INTERSPEECH, 2018, pp. 17–21.
  • [38] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel end-to-end speech recognition,” in ICML.   JMLR. org, 2017, pp. 2632–2641.
  • [39] S. Kim, I. Lane, S. Kim, and I. Lane, “End-to-end speech recognition with auditory attention for multi-microphone distance speech recognition,” in INTERSPEECH, 2017, pp. 3867–3871.
  • [40] A. Graves, “Supervised sequence labelling with recurrent neural networks,” Ph.D. dissertation, Universität München, 2008.
  • [41] T. Hori, S. Watanabe, and J. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 518–529.
  • [42] T. Hori, J. Cho, and S. Watanabe, “End-to-end speech recognition with word-based rnn language models,” in SLT.   IEEE, 2018, pp. 389–396.
  • [43] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [44]

    J. Cho, M. K. Baskar, R. Li, M. Wiesner, S. H. Mallidi, N. Yalta, M. Karafiat, S. Watanabe, and T. Hori, “Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling,” in

    SLT, 2018.
  • [45] L. D. Consortium, “CSR-II (wsj1) complete,” Linguistic Data Consortium, Philadelphia, vol. LDC94S13A, 1994.
  • [46] E. Vincent, S. Watanabe, J. Barker, and R. Marxer, “The 4th chime speech separation and recognition challenge,” 2016.
  • [47] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2011–2022, 2007.
  • [48] J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” arXiv preprint arXiv:1612.02695, 2016.
  • [49] Y. Zhou, C. Xiong, and R. Socher, “Improved regularization techniques for end-to-end speech recognition,” arXiv preprint arXiv:1712.07108, 2017.
  • [50] N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, “End-to-end speech recognition from the raw waveform,” arXiv preprint arXiv:1806.07098, 2018.
  • [51] Y. Wang, X. Deng, S. Pu, and Z. Huang, “Residual convolutional ctc networks for automatic speech recognition,” arXiv preprint arXiv:1702.07793, 2017.
  • [52] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end speech recognition using lattice-free mmi,” INTERSPEECH, pp. 12–16, 2018.
  • [53] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “Espnet: End-to-end speech processing toolkit,” in INTERSPEECH, 2018, pp. 2207–2211.