Quaternion Neural Networks for Multi-channel Distant Speech Recognition

05/18/2020 ∙ by Xinchi Qiu, et al. ∙ University of Oxford 0

Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal. In this paper, we propose to capture these inter- and intra- structural dependencies with quaternion neural networks, which can jointly process multiple signals as whole quaternion entities. The quaternion algebra replaces the standard dot product with the Hamilton one, thus offering a simple and elegant way to model dependencies between elements. The quaternion layers are then coupled with a recurrent neural network, which can learn long-term dependencies in the time domain. We show that a quaternion long-short term memory neural network (QLSTM), trained on the concatenated multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of multi-channel distant speech recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art speech recognition systems perform reasonably well in close-talking conditions. However, their performance degrades significantly in more realistic distant-talking scenarios, since the signals are corrupted with noise and reverberation [wolfel2009distant, li2015robust, dsp_thesis]. A common approach to improve the robustness of distant speech recognizers relies on the adoption of multiple microphones [brandstein2013microphone, benesty2008microphone]. Multiple microphones, either in the form of arrays or distributed networks, capture different views of an acoustic scene that are combined to improve robustness.

A common practice is to combine the microphones using signal processing techniques such as beamforming [kellermann]. The goal of beamforming is to achieve spatial selectivity (i.e., privilege the areas where a target speaker is speaking), limiting the effects of both noise and reverberation. One way to perform spatial filtering is provided by the delay-and-sum beamforming, which simply performs a time alignment followed by a sum of the recorded signals [KnappCarter]. More sophisticated techniques are filter-and-sum beamforming [filt_sum], that filters the signal before summing them up, and super-directive beamforming [super_dir], which further enhances the target speech by suppressing the contributions of the noise sources from other directions.

An alternative that is gaining significant popularity is End-to-end (E2E) multi-channel ASR [heymann2017beamnet, braun2018multi, tara, unified, 7472778, Kim2017]. Here, the core idea is to replace the signal processing part with an end-to-end differentiable neural network, that is jointly trained with the speech recognizer. It will make the speech processing pipeline significantly simpler, and different modules composing the whole system match better with each other. The most straightforward approach is concatenating the speech features of the different microphones and feeding them to a neural network [6854663]

. However, this approach forces the network to deal with very high-dimensional data, and might thus make learning the complex relationships between microphones difficult due to numerous independent neural parameters. To mitigate this issue, it is common to inject prior knowledge or inductive biases into the model. For instance,

[tara] suggested an adaptive neural beamformer that performs filter-and-sum beamforming using learned filters. Similar techniques have been proposed in [unified, 7472778]. In all aforementioned works, the microphone combination is not implemented with an arbitrary function, but a restricted pool of functions like beamforming ones. This introduces a regularization effect that helps the convergence of the speech recognizer.

In this paper, we propose a novel approach to model the complex inter- and intra- microphone dependencies that occur in multi-microphone ASR. Our inductive bias relies on the use of quaternion algebra. Quaternions extend complex numbers and define four-dimensional vectors composed of a real part and three imaginary components. The standard dot product is replaced with the Hamilton product that offers a simple and elegant way to learn dependencies across input channels by sharing weights across them. More precisely, Quaternion Neural Networks (QNN) have recently been the object of several research efforts focusing on image processing

[parcollet2019survey1, isokawa2003quaternion, parcollet2019quaternion], 3D sound event detection [comminiello2019quaternion] and single-channel speech recognition [QRNNparcollet2018quaternion]

. To the best of our knowledge, our work is the first that proposes the use of quaternions in a multi-microphone speech processing scenario, which is a particularly suitable application. Our approach combines the speech features extracted from different channels into four different dimensions of a set of quaternions (Section

2.3). We then employ a Quaternion Long-Short Term Memory (QLSTM) neural network [QRNNparcollet2018quaternion]. This way, our architecture not only models the latent intra- and inter- microphone correlations with the quaternion algebra, but also jointly learns time-dependencies with recurrent connections.

Our QLSTM achieves promising results on both a simulated version of TIMIT and the DIRHA corpus [7404805]

, which are characterized by the presence of significant levels of non-stationary noises and reverberation. In particular, we outperform both a beamforming baseline (15% relative improvement) and a real-valued model with the same number of parameters (8% relative improvement). In the interest of reproducibility, we release the code under PyTorch-Kaldi

[pytorchkaldi] 111https://github.com/mravanelli/pytorch-kaldi/.

2 Methodology

This section first describes the quaternion algebra (Section 2.1) and quaternion long short-term memory neural networks (Section 2.2). Finally, the quaternion representation of multi-channel signals is introduced in Section 2.3.

2.1 Quaternion Algebra

A quaternion is an extension of a complex number to the four-dimensional space [hamilton1899elements]. A quaternion is written as:

(1)

with , , , and four real numbers, and , i, j, and k the quaternion unit basis. In a quaternion, is the real part, while with is the imaginary part, or the vector part. Such definition can be used to describe spatial rotations. In the same manner as complex numbers, the conjugate of is defined as:

(2)

and a unitary quaternion (i.e. whose norm is equal to ) is defined as:

(3)

The Hamilton product between and is determined by the products of the basis elements and the distributive law:

(4)

Analogously to complex numbers, quaternions also have a matrix representation defined in a way that quaternion addition and multiplication correspond to a matrix addition and a matrix multiplication. An example of such matrix is:

(5)

Following this representation, the Hamilton product can be written as a matrix multiplication as follow:

(6)

Using the matrix representation of quaternions turns out to be particularly suitable for computations on modern GPUs compared to the less efficient object programming.

2.2 Quaternion Long Short-Term Memory Networks

Equivalently to standard LSTM models, a QLSTM consists of a forget gate , an input gate , a cell input activation vector , a cell state and an output gate . In a QLSTM layer, however, inputs , hidden states , cell states , biases , and weight parameters

are quaternion numbers. All multiplications are thus replaced with the Hamilton product. Different activation functions defined in the quaternion domain can be used

[qactivate, parcollet2019survey1]. In this work, we follow the split approach defined as:

(7)

where is any real-valued activation function (i.e.ReLU, Sigmoid, …). Indeed, fully quaternion-valued activation functions have been demonstrated to be hard to train due to numerous singularities [parcollet2019survey1]

. Then, the output layer is commonly defined in the real-valued space to be combined with traditional loss functions (

e.g. cross-entropy) [qback2] due to the real-valued nature of the labels implied by the considered speech recognition task. Therefore, a QLSTM layer can be summarised with the following equations:

(8)

with two split activations and as described in Eq. 7. As shown in [QRNNparcollet2018quaternion]

, QLSTM models can be trained following the quaternion-valued backpropagation through time. Finally, weight initialisation is crucial to train deep neural networks effectively

[glorot2010understanding]. Hence, a well-adapted quaternion weight initialisation process [QRNNparcollet2018quaternion, QCNN] is applied. Quaternion neural parameters are sampled with respect to their polar form and a random distribution following common initialization criteria [glorot2010understanding, he2015delving].

2.3 Quaternion Representation of Multi-channel Signals

Figure 1: Illustration of the integration of multiple microphones with a quaternion dense layer. Each microphone is encapsulated by one component of a set of quaternions. All the neural parameters are quaternion numbers.

We propose to use quaternion numbers in a multi-microphone speech processing scenario. More precisely, quaternion numbers offer the possibility to encode up to four microphones (Fig. 1). Therefore, common acoustic features (e.g. MFCCs, FBANKs, …) are computed from each microphone signal , and then concatenated to compose a quaternion as follows:

(9)

Internal relations are captured with the specific weight sharing property of the Hamilton product. By using Hamilton products, quaternion-weight components are shared through multiple quaternion-input, creating relations within the elements as demonstrated in [parcollet2019quaternion]. More precisely, real-valued network inputs are treated as a group of uni-dimensional elements that could be related to each other, potentially decorrelating the four microphone signals. Conversely, quaternion networks consider each time frame as an entity of four related elements. Hence, internal relations are naturally captured and learned through the process. Indeed, a small variation in one of the microphone would result in an important change in the internal representation affecting the encoding of the three other microphones.

It is worth noticing that four microphones may be limiting for realistic applications. For instance, the latest CHIME-6 challenge [watanabe2020chime] proposes various recordings obtained from six microphones in different scenarios. This difficulty could be easily avoided by considering these tasks as a special case of higher algebras, such as octonions (eight dimensions) or sedenions (sixteen dimensions). Nevertheless, this paper proposes to first consider four dimensions to evaluate the viability of the application of high-dimensional neural networks for distant and multi-microphone ASR. Finally, quaternion neural networks are known to be more computationally intensive than real-valued neural networks. Indeed, the Hamilton product involves basic operations compared to for a standard product. Nonetheless, the training time can be reduced with the matrix representation defined in Eq.(6), and can be drastically improved with simple linear algebra properties [cariow2020fast].

3 Experimental Protocol

A perturbed speech and multi-channel TIMIT [timit] version presented thereafter is first used as a preliminary task to investigate the impact of the Hamilton product. Then, the DIRHA dataset [cristoforetti2014dirha] is used to verify the scalability of the proposed approach to more realistic conditions.

3.1 TIMIT Dataset

The TIMIT corpus contains broadband recordings of speakers of eight main dialects of American English, each reading ten phonetically rich sentences. The training dataset consists of the standard sentences uttered by speakers, while the testing one consists of sentences uttered by speakers. A validation dataset composed of sentences uttered by speakers is used for hyper-parameter tuning.

In our experiments, we created a multi-channel simulated version of TIMIT using the impulse responses measured in [ravanelli2012impulse, ir_selection]222Perturbation can be re-created following: https://github.com/SHINE-FBK/DIRHA_English_wsj. The reference environment is a living room of a real apartment with an average reverberation time of seconds. The considered four microphones (i.e. , , , ) are placed on the ceiling of the room. Data are created considering all the different positions, and different positions are used for training and testing data. We also integrate a single-channel signal obtained with delay-and-sum beamforming as a baseline comparison [KnappCarter]. Input features consist of Mel filters bank energies (FBANK) with no deltas extracted with Kaldi [kaldi]. To show that the obtained gain in performance is independent of the input features, we also propose MFCC coefficients as an alternative set of features.

3.2 DIRHA Dataset

To validate our model in a more realistic scenario, a set of experiments is also conducted with the larger DIRHA-English corpus [7404805]. Equivalently to the generated TIMIT dataset, the reference context is a domestic environment characterized by the presence of non-stationary noise and acoustic reverberation. Training is based on the original Wall-Street-Journal-5k (WSJ) corpus (i.e. consisting of sentences uttered by speakers) contaminated with a set of impulse responses measured in a real apartment [timitrealistic, ravanelli2017contaminated]. Both a real and a simulated dataset are used for testing, each consisting of WSJ sentences uttered by six native American speakers. Note that a validation set of WSJ sentences is used for hyper-parameter tuning. Only the first four microphones of the circular array are used in our experiments to fit the quaternion representation. A single-channel signal obtained with delay-and-sum beamforming is also proposed as a baseline comparison [KnappCarter]. It is worth noting that we also used MFCC coefficients as features in comparison to FBANKs to evaluate the robustness of the model to the input representation.

3.3 Neural Network Architectures

We decided to fix the number of neural parameters to M for both LSTM and QLSTM following the models studied in [QRNNparcollet2018quaternion]. Therefore, the QLSTM model is composed of bidirectional QLSTM layers followed by a linear layer with a softmax activation function for classification. Output labels are the different HMM states of the Kaldi decoder. Each of the QLSTM layers consists of quaternion nodes. Although there are real-valued nodes in total, there are only real-valued weight parameters, due to the weight sharing property of quaternion neural networks. The LSTM model is composed of 4 bidirectional LSTM layers of size (i.e.

ensuring the same number of neural parameters as the QLSTM) followed by the same linear layer to obtain posterior probabilities. A dropout rate of

is applied across all (Q)LSTM layers. Quaternion parameters are initialised with the specific initialisation defined in [QRNNparcollet2018quaternion], while LSTM parameters are initialised with the Glorot criterion [glorot2010understanding].

Training is performed with the RMSPROP optimizer with vanilla hyper-parameters and an initial learning rate of

over 24 epochs. The learning-rate is halved every time the loss on the validation set increases, ensuring an optimal convergence. Finally, both LSTM and QLSTM are manually implemented in PyTorch to alleviate any variation due to different implementations.

4 Results and Discussions

Models Signals Test (FBANK) Test (MFCC)
QLSTM 1 microphone copied 32.1 0.02 34.2 0.13
LSTM 1 microphone 32.3 0.14 35.0 0.23
LSTM beamforming 31.1 0.11 33.4 0.07
LSTM 4 microphones 30.2 0.16 32.8 0.09
QLSTM 4 microphones 28.7 0.06 30.4 0.11
Table 1: Results expressed in terms of Phoneme Error Rate (PER) percentage (i.e lower is better) of both QLSTM and LSTM models on the TIMIT distant phoneme recognition task with different acoustic features. Results are from an average of runs.
Models Signals Test Real (MFCC) Test Sim. (MFCC) Test Real (FBANK) Test Sim. (FBANK)
LSTM beamforming 35.1 33.7 35.0 33.0
LSTM 4 microphones 32.7 26.4 31.6 26.3
QLSTM 4 microphones 29.8 23.8 29.7 23.4
Table 2: Results expressed in terms of Word Error Rate (WER) (i.e lower is better) of both QLSTM and LSTM based models on the DIRHA dataset with different acoustic features. ’Test Sim.’ corresponds to the simulated test set of the corpus, while “Test Real” is the set composed of real recordings.

The results on the distant multi-channel TIMIT dataset are reported in Table 1. From this comparison, it emerges that QLSTM with four microphones outperforms the other approaches. Our best QLSTM model, in fact, obtains a PER of against a PER of achieved with a standard real-value LSTM. In both cases, the best performance is obtained with FBANK features. Interestingly, Table 1 shows that the concatenation of the four input signals with a real-valued LSTM outperforms the delay-and-sum beamforming approach. Similar achievements have already emerged in previous works on multi-channel ASR [6854663] and can be due to the ability of modern neural networks to obtain disentangled and informative representations from noisy inputs.

We can now investigate in more detail the role played by the quaternion algebra on learning cross-microphone dependencies. One way to do it is to overwrite the quaternion dimensions with the features extracted from the same microphone (see the first row of Table 1). In this case, we expect that our QLSTM will fail to learn cross-microphone dependencies, simply because we have a single feature vector replicated multiple times. For a fair comparison, the aforementioned experiment is conducted by selecting the best microphone of the array (i.e. LA4).

From the first and the second rows of Table 1, one can note that both single-channel QLSTM and LSTM perform roughly the same. As expected, in fact, the single-channel QLSTM is not able to model useful dependencies when the quaternion dimensions are dumped with the same feature vector. Nonetheless, switching to four-channel signal brings an average PER improvement of for the QLSTM compared to for the LSTM, showing a higher gain obtained on multiple channels with the QLSTM. This illustrates the ability of QLSTM to better capture latent relations across the different microphones.

To provide some experimental evidence on a more realistic task, we evaluate our model with the DIRHA dataset. The results obtained in Table 2 confirm the trend observed with TIMIT. Indeed, Word Error Rates (WER) of and are obtained for the QLSTM on the real and simulated test sets respectively, compared to and for the equivalent real-valued LSTM. The same remark holds while feeding our models with FBANK features with a best WER of obtained with the QLSTM compared to . As a side note, the accuracies reported on Table 2 are slightly worse compared to the ones given in [pytorchkaldi]. Indeed, the latter work includes a specific batch-normalisation that is not applied in our experiments due to the very high complexity of the Quaternion Batch-Normalisation (QBN) introduced in [gaudet2018deep]. As a matter of fact, the current equations of the QBN induce an increase of the VRAM consumption by a factor of . As expected, WER observed on the real test set are also higher than those on the simulated one, due to more complex and realistic perturbations.

As shown in both TIMIT and DIRHA experiments, the performance improvement observed with the QLSTM is independent of the initial acoustic representation, implying that a similar increase of accuracy may be expected with other acoustic features such as fMLLR or PLP. Interestingly, the single-channel beamforming approach gives the worst results among all the investigated methods on both TIMIT and DIRHA.

5 Conclusion

Summary. This paper proposed to perform multi-channel speech recognition with an LSTM based on quaternion numbers. Our experiments, conducted on multi-channel TIMIT and DIRHA have shown that: 1) Given the same number of parameters, our multi-channel QLSTM significantly outperforms an equivalent LSTM network; 2) the performance improvement is observed with different features, implying that a similar increase of accuracy may be expected with others acoustic representations such as fMLLR or PLP; 3) our QLSTM learns internal latent relations across microphones. Therefore, the initial intuition that quaternion neural networks are suitable for multi-channel distant automatic speech recognition has been verified.

Perspectives. One limitation of the current approach is due to the fact that quaternion neural networks can only deal with four-dimensional input signals. Even though popular devices such as the Microsoft Kinect, or the ReSpeaker are based on 4-microphones arrays, future efforts will focus on generalising this paradigm to an arbitrary number of microphones by considering, for instance, higher dimensional algebras such as octonions and sedenions, or by investigating other methods of weight sharing for multi-channel ASR. Finally, despite recent works on investigating efficient quaternion computations, the current training and inference processes of the QLSTM remain slower than that of a LSTM. Therefore, efforts should be put in developing and implementing faster training procedures.

6 Acknowledgements

This work was supported by the EPRSC through MOA (EP/S001530/) and Samsung AI. We would also like to thank Elena Rastorgueva and Renato De Mori for the helpful comments and discussions.

References