Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

07/22/2020 ∙ by Lukas Pfeifenberger, et al. ∙ 0

While machine learning techniques are traditionally resource intensive, we are currently witnessing an increased interest in hardware and energy efficient approaches. This need for resource-efficient machine learning is primarily driven by the demand for embedded systems and their usage in ubiquitous computing and IoT applications. In this article, we provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs). In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations. This speech mask is used to obtain either the Minimum Variance Distortionless Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible while still obtaining an audio quality almost on par to single-precision DNNs and a slightly larger Word Error Rate (WER) for single speaker scenarios using the WSJ0 speech corpus.



There are no comments yet.


page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs), the workhorse of speech-based user interaction systems, prove particularly effective when big amounts of data and plenty of computing resources are available. However, in many real-world applications the limited computing infrastructure, latency and the power constraints during the operation phase effectively suspend most of the current resource-hungry DNN approaches. Therefore, there are several key challenges which have to be jointly considered to facilitate the usage of DNNs when it comes to edge-computing implementations:

  • Efficient representation: The model complexity measured by the number of model parameters should match the limited resources of the computing hardware, in particular regarding memory footprint.

  • Computational efficiency: The model should be computationally efficient during inference, exploiting the available hardware optimally with respect to time and energy. Power constraints are key for embedded systems, as the device lifetime for a given battery charge needs to be maximized.

  • Prediction quality: The focus is usually on optimizing the prediction quality of the models. For embedded devices, model complexity versus prediction quality trade-offs must be considered to achieve good prediction performance while simultaneously reducing computational complexity and memory requirements.

DNN models use GPUs to enable efficient processing, where single precision floating-point numbers are common for parameter representation and arithmetic operations. To facilitate deep models in today’s consumer electronics, the model usually has to be scaled down to be implemented efficiently on embedded or low power systems. Most research emphasizes one of the following two techniques: (i) reduce model size in terms of number of weights and/or neurons 

[17, 16, 15, 60, 25, 68], or (ii) reduce arithmetic precision of parameters and/or computational units [10, 64, 35, 43]. Evidently, these two basic techniques are almost “orthogonal directions” towards efficiency in DNNs, and they can be naturally combined, e.g. one can do both sparsify the model and reduce arithmetic precision. Both strategies reduce the memory footprint accordingly and are vital for the deployment of DNNs in many real-world applications. This is especially important as reduced memory requirements are one of the main contributing factors reducing the energy consumption [16, 8, 24]. Apart from that, model size reduction and sparsification techniques such as weight pruning [32, 17, 16], weight sharing [51], knowledge distillation [22] of special weight matrix structures [28, 9, 62] also impacts the computational demand measured in terms of number of arithmetic operations. Unfortunately, this reduction usually does not directly translate into savings of wall-clock time, as current hardware and software are not well-designed to exploit model sparseness [63]. Instead, reducing parameter precision proves quite effective for improving execution time on CPUs [55, 47] and specialized hardware such as FPGAs [52]. When the precision of the inference process is driven to the extreme, i.e. assuming binary weights or ternary weights

in conjunction with binary inputs and binary activation functions

, floating or fixed point multiplications are replaced by hardware-friendly logical XNOR and bitcount operations, i.e. DNNs essentially reduce to a logical circuit. Training such discrete-valued DNNs

111Due to finite precision, in fact any DNN is discrete valued. However, we use this term here to highlight the extremely low number of values. is delicate as they cannot be directly optimized using gradient based methods. However, the obtained computational savings on today’s computer architectures are of great interest, especially when it comes to human-machine interaction (HMI) systems where latency and energy efficiency plays an important role.

Due to steadily decreasing cost of consumer electronics, many state-of-the-art HMI systems include multi-channel speech enhancement (MCSE) as a pre-processing stage. In particular, beamformers (BF) spatially separate background noise from the desired speech signal. Common BF methods include the Minimum Variance Distortionless Response (MVDR) beamformer [56] and the Generalized Eigenvalue (GEV) beamformer [58]. Both MVDR and GEV beamformers are frequently combined with DNN-based mask estimators, estimating a gain-mask to obtain the spatial Power Spectral Density (PSD) matrices of the desired and interfering sound sources. Mask-based BFs are amongst state-of-the-art beamforming approaches [21, 20, 19, 13, 39, 46, 40, 41, 67]. However, they are computational demanding and need to be trimmed down to facilitate the usage in low-resource applications.

In this paper, we investigate the trade-off between performance and resources in MCSE systems. We exploit Bi-directional Long Short-Term Memory (BLSTM) architectures to estimate a gain mask from noisy speech signals, and combine it with both the GEV and MVDR beamformers 

[39]. We analyze the computational demands of the overall system and highlight techniques to reduce the computational load. In particular, we observe that the computational effort for mask estimation is by orders of magnitude larger compared to subsequent beamforming. Hence, we concentrate on efficient mask estimation in the remainder of the paper. We limit the numerical precision of the mask estimation DNN’s weights and after each processing step in the forward pass (i.e. inference) to either 8, 4 or 1 bits. This makes the MCSE system both resource and memory efficient. We report both perceptual audio quality and speech intelligibility of the overall MCSE system in terms of SNR improvement and Word Error Rate (WER) using the Google Speech-to-Text API [49]. In particular, we use the WSJ0 corpus [37] in conjunction with simulated room acoustics to obtain 6 channel data of multiple, moving speakers [36]. When reducing the numerical precision in the forward pass, the system still yields competitive results for single speaker scenarios with a slightly decreased WER. Additionally, we show that reduced-precision DNNs can be readily exploited on today’s hardware, by benchmarking the core operation of binary DNNs (BNNs), i.e. binary matrix multiplication, on NVIDIA Tesla K80 and ARM Cortex-A57 architectures.

The paper is structured as follows. In Section II we introduce the MCSE system. We highlight both MVDR and GEV beamformers and introduce DNN-based mask estimators. Section III provides details about the computational complexity of the MCSE system. We introduce reduced-precision LSTMs and discuss efficient representations in detail. In Section IV we present experiments of the MCSE system. In particular, the experimental setup and the results in terms of SNR and WER accuracy are discussed. Section V concludes the paper.

Ii Multi-Channel Speech Enhancement System

The acoustic environment of our MCSE system consists of up to independent sound sources, i.e. human speech or ambient noise. The sound sources may be non-stationary, due to moving speakers, and their spatial and temporal characteristics are unknown.

The speech enhancement system itself is composed of a circular microphone array with microphones, a DNN to estimate gain masks from the noisy microphone observations, and a broadband beamformer to isolate the desired signal as shown in Figure (1).

Fig. 1: System overview, showing the microphone signals and the beamformer output

in frequency domain.

The signal at the microphones is a mixture of all sources, i.e. in

short-time Fourier transform

(STFT) domain


where the time-frequency bins of all microphones are stacked into a vector . The vector represents the th sound source at all microphones at frequency bin and time frame .222For the sake of brevity, the frequency and time frame indices will be omitted where the context is clear. Each sound source is composed of a monaural recording convolved with the Acoustic Transfer Function (ATF) , i.e.


where models the acoustic path from the th sound source to the microphones, including all reverberations and reflections caused by the room acoustics [31]. In the near field of the array, the ATFs can be modeled by a finite impulse response (FIR) filter [5]. The filter characteristics varies with the movement of the speaker, i.e. it is non-stationary. Without loss of generality, we specify the first source to be the desired source, i.e. , and the interfering signal as the sum of the remaining sources, i.e. . The spatial PSD matrix for the desired signal is given as [26]


and for the interfering signal


The aim of beamforming is to recover the desired source while suppressing the interfering sources at the same time. We use a filter and sum beamformer [7], where each microphone signal is weighted with the beamforming weights , prior to summation into the result , i.e.


where .

Ii-a MVDR Beamformer

The MVDR beamformer [5, 48] minimizes the signal energy at the output of the beamformer, while maintaining an undistorted response with respect to the steering vector , i.e. its weights are


The steering vector guides the beamformer towards the direction of the desired signal. This direction can be determined using Direction Of Arrival (DOA) estimation algorithms [4, 14, 50, 38]. However, In real-world application this is sub-optimal, as it does not consider reverberations and multi-path propagations. Assuming that the PSD matrix of the desired source is known, the steering vector can be obtained in signal subspace [48] using Eigenvalue decomposition (EVD) of the PSD matrix

. In particular, the eigenvector belonging to the largest eigenvalue is used as steering vector


Ii-B GEV Beamformer

An alternative to the MVDR beamformer is the GEV beamformer [58, 59]. It determines the filter weights to maximize the SNR at the beamformer output, i.e.




Eq. (7) can be rewritten as a generalized Eigenvalue problem [59]:


where is the eigenvector belonging to the largest eigenvalue of . To compensate for the amplitude distortions  [58] of the beamforming filter , we choose the two reference implementations from  [39], i.e. the GEV-PAN, GEV-BAN postfilters.

Ii-C PSD Matrix Estimation

The spatial PSD matrix can be approximated using


and the gain mask for the speech signal. Analogously, can be estimated using the gain mask for the interfering signal. Note that the window length defines the number of time frames used for estimating the PSD matrices. For moving sources, has to be sufficiently large to obtain well estimated PSD matrices. If is too large, the estimated PSD matrices might fail to adapt quickly enough to changes in the spatial characteristics of the moving sources. An alternative is provided by recursive estimation, i.e.


This online processing [6] allows to adapt the MVDR or GEV beamformer at each time frame . This recursive estimation is initialized using Eq. (10).

Ii-D Recursive Eigenvector Tracking

If Eq. (10) is used, the generalized Eigenvalue decomposition in Eq. (9) has to be performed for every time-frequency bin. This expensive operation can be circumvented by recursive Eigenvector tracking using Oja’s method [18]; i.e.


Note that the GEV-PAN and GEV-BAN postfilters from [39] are not affected by the normalization operation in Eq. (12). When using the MVDR beamformer, tracking the largest eigenvector of is done in a similar fashion.

Ii-E DNN-based Speech Mask Estimation

The DNN used to estimate the gain mask for the beamformer uses the noisy microphone observations as features. In particular, the features per time-frequency-bin are defined as , where is a whitened and phase-normalized version of . Further details on whitening can be found in [39]. For microphones, contains real-valued elements. The DNN processes frequency bins at a time, hence each time frame uses the feature vector as input. It contains elements.

Figure 2 shows the architecture of the DNN consisting of Dense layers and BLSTM units. Similar architectures for speech mask estimation can be found in [12, 20, 66, 40].

Fig. 2: Mask estimation DNN.

The first BLSTM layer consists of two separate LSTM units [23], each with neurons for each frequency bin. While the first LSTM processes the data in forward direction (i.e. one time frame after another), the second LSTM operates in backward direction. The output of both LSTMs is then concatenated to an intermediate vector with elements. The second and third layer consists of a Dense layer. The first three layers reduce the feature vector size from elements per time-frequency bin down to 1. Note that those layers have very few weights, as they consist of independent units with neurons each. The fourth layer is a BLSTM processing all frequency bins at a time. Finally, three separate Dense layers are used. The first dense layer estimates the mask for the desired source, the second estimates the mask for the interfering sources, and the third estimates the mask for time-frequency bins which are not assigned to the other two classes. The activation function of this layer is a softmax, so that the sum of each of the three masks is 1 for each time-frequency bin, i.e. .

Iii Computational Efficiency of the MCSE System

Iii-a Complexity analysis of MCSE system

Table I shows both the computational complexity and the number of multiply-and-accumulate (MAC) operations for the proposed DNN-based mask estimator (cf. Section 2). Overall, 5562e6 MAC operations are needed to compute a gain-mask given a multi-channel signal with microphones, frequency bins and frames. Table II shows the MAC operations of a static and dynamic beamformer, needed to infer the target speech. Static beamformers, which do not track moving targets have a reduced computational overhead compared to dynamic variants, computing the beamforming weight for every time-step. However, the overall computational complexity is orders of magnitude lower compared to the DNN-based mask estimator. This indicates that significant computational savings can be obtained when optimizing DNNs with respect to resource efficiency.

Layer Shape Weights MAC
BLSTM 590976 295e6
Dense layer 6156 3e6
Dense layer 526338 263e6
BLSTM 8421408 4211e6
Dense layer 1579014 790e6
Total 11123892 5562e6
TABLE I: Computational complexity for the DNN-based mask estimator.
Mode Layer Complexity MAC
static Eq. 10 18e6
static Eq. 9 0.1e6
Total 18.1e6
dynamic Eq. 10 18e6
dynamic Eq. 9 55e6
Total 73e6
TABLE II: Computational complexity of a static- and dynamic GEV beamformer.

Reducing the precision of the DNN-based mask estimators reduces the computational complexity and memory consumption of the overall MCSE system. Reduced precision DNNs can be realized via bit-packing333 schemes, with the help of processor specific GEMM instructions [1] or can be implemented on a DSP or FPGA.

Computational savings for various 8-bit DNN models on both ARM processors and GPUs have been reported in [54, 53, 45, 1]. In particular, [55] reported that speech recognition performance is maintained when quantizing the neural network parameters to 8 bit fixed-point, while the system runs 3 times faster on a x86 architecture.

In order to demonstrate the advantages that binary computations achieve on other general-purpose processors, we implemented matrix-multiplication operators for NVIDIA GPUs and ARM CPUs. BNNs can be implemented very efficiently as 1-bit scalar products, i.e. multiplications of two vectors and of length reduce to bit-wise xnor() operation, followed by counting the number of set bits with popc(), i.e.


where and denote the element of and , respectively. We use the matrix-multiplication algorithms of the MAGMA and Eigen libraries and replace float multiplications by xnor() operations, as depicted in Equation (13). Our CPU implementation uses NEON vectorization in order to fully exploit SIMD instructions on ARM processors. We report execution time of GPUs and ARM CPUs in Table III. As can be seen, binary arithmetic offers considerable speed-ups over single-precision with manageable implementation effort. This also affects energy consumption since binary values require less off-chip accesses and operations. Performance results of x86 architectures are not reported because neither SSE nor AVX ISA extensions support vectorized popc().

arch matrix size time (float32) time (binary) speed-up
GPU 256 0.14ms 0.05ms 2.8
GPU 513 0.34ms 0.06ms 5.7
GPU 1024 1.71ms 0.16ms 10.7
GPU 2048 12.87ms 1.01ms 12.7
ARM 256 3.65ms 0.42ms 8.7
ARM 513 16.73ms 1.43ms 11.7
ARM 1024 108.94ms 8.13ms 13.4
ARM 2048 771.33ms 58.81ms 13.1
TABLE III: Performance metrics for matrix matrix multiplications on a NVIDIA Tesla K80 and ARM Cortex-A57.

Iii-B Reduced Precision DNNs

We exploit reduced-precision weights and limit the numerical precision of a DNN-based mask estimator to either 8- or 4 bit fixed-point representations or to binary weights. Recently, there has been numerous extensions to train DNNs with limited precision  [64, 61, 57, 11].

Iii-B1 DNN with Low-precision Weights

The weights and activations of a DNN often lie within a small range, making it possible to introduce quantization schemes. Implementations like [35, 47] use reduced precision for their DNN’s weights. In [55], an improvement of inference speed of factor 3 for fixed-point implementation on a general purpose hardware has been reported. Hence, we consider a fixed-point representation of the computed values in the forward pass of our DNN [11]

. In particular, we use 8- and 4 bit weights, which represent the Q2.6 and Q2.2 fractional formats, respectively. After each layer, we use batch normalization to ensure the activations to fit within

. The accumulation of the values in the dot products and the batch normalization are performed with high precision, while the multiplication is performed at lower precision.

During training we compute the gradient and update the weights using float32, while the precision is only reduced accordingly in the forward pass444The derivative is computed with respect to the quantized weights as in [10, 64, 35].. This is known as straight through estimator (STE) [10, 64], where the parameter update is performed in full-precision. Usually, when deploying the DNN in an application, only the forward-pass calculations are required. Hence, the reduced-precision weights can be used, reducing memory requirements by a factor of 4 or 8 compared to 32-bit weight representations. Figure 3 shows a reduced-precision LSTM cell. Besides the well-known gating and vectormatrix computations of LSTMs, bit clipping operations are introduced after each mathematical operation. Details of the LSTM cell can be found in [23].

Fig. 3: Reduced precision LSTM cell, using bit-clipping to either 4- or 8 bit fixed-point representation after each mathematical operation.

Iii-B2 DNN with Binary Weights

In [10], binary-weight DNNs are trained using the STE, i.e. deterministic and stochastic rounding is used during forward propagation, and the full-precision weights are updated based on the gradients of the quantized weights. In [27], STE is used to quantize both the weights and the activations to a single bit and sign functions respectively.  [33] trained ternary weights by setting weights below or above a certain threshold to , or zero otherwise. This has been extended in [65] to ternary weights by learning the factors and using gradient updates and a different threshold has been applied.

Fig. 4: Binary LSTM, using both binary weights and binary activation functions and scaling parameters .

When dealing with recurrent architectures such as LSTMs,  [35] observed that recent reduced-precision techniques for BNNs [10, 27]

cannot be directly extended to recurrent layers. In particular, a simple reduction of the precision of the forward pass to 1 bit in the LSTM layers suffers from severe performance degradation as well as the vanishing gradient problem. In

[65, 3] batch-normalization and a weight-scaling is applied to the recurrent weights to overcome this problem. We adopt this approach, i.e. introducing a trainable scaling parameter , which maps the range of the recurrent activations to . Hence, each of the recurrent weight matrices and has its own scaling factor, i.e. . See also Fig. 4. This limits the recurrent weights to small values, preventing the LSTM to reach unstable states, i.e. avoids accumulating the cell states to large numbers. For binary weights, the LSTM cell equations are given as:


where and are a binary version (i.e. hard sigmoid and sign function [10]) of the well-known sigmoid and tanh activation functions. The weights and biases are the binary network parameters (i.e. with values of ), and are the scaling parameters for the recurrent network weights.

Iv Experiments

Iv-a Experimental Setup

The performance of the multi-channel speech enhancement system is demonstrated by simulating a typical living room scenario with two static speakers S1 and S2, two moving speakers D1 and D2, and an isotropic background noise source I similar as in [39]. The floor plan of the setup is shown in Figure 5. The circular microphone array with microphones and a diameter of is shown in red labeled as Mic. Head movements of the static speakers S1 and S2 are simulated by random 3D position changes within . The trajectory of the moving speakers D1 and D2 random within a region of 2m 4m on both sides of the microphone array. The movement velocity is constant at .

Fig. 5: Shoebox model of a living room showing two stationary sound sources S1 and S2, and two dynamic sound sources D1 and D2. The microphone array (Mic) is visualized as red circle.

We specify five scenarios for our experiments using this shoebox model:

  1. Random vs. isotropic (R-I): A static speaker with head movements is the random source. The position is randomly selected in the room for each new utterance to prevent the model from learning the position of the speaker.

  2. Static1 vs. isotropic (S1-I): A stationary speaker at fixed position S1 and an isotropic background noise are used in this scenario. The head movements cause a varying phase especially at higher frequencies.

  3. Static1 vs. static2 + isotropic (S1-S2I): Two simultaneously talking speakers at position S1 and S2 embedded in isotropic background noise are used in this scenario.

  4. Dynamic1 vs. isotropic (D1-I): The speaker moving in region D1 has to be tracked in the presence of ambient background noise. This challenges the tracking capabilities of the DNN mask estimation.

  5. Dynamic1 vs. dynamic2 + isotropic (D1-D2I): The separation capabilities of two speakers moving in D1 and D2 embedded in background noise is analysed.

These experimental setups are summarized in Table IV:

Experiment # Desired source Interfering source(s)
1 random R isotropic I
2 S1 isotropic I
3 S1 S2, isotropic I
4 D1 isotropic I
5 D1 D2, isotropic I
TABLE IV: Experimental setups using a virtual shoebox model.

Iv-B Data Generation

We use the Image Source Method (ISM) [36, 44] to simulate the ATFs in Eq. (2). This enables to generate multi-channel recordings from a monaural source. The room is modeled as shoebox with a reflection coefficient of for each wall. The reflection order is which results in a reverberation time of . We generate a new set of ATFs every for the moving sources. The isotropic background noise is determined as


where is the monaural noise source, , and denotes the eigenvalue and eigenvector matrices of the spatial coherence matrix for a spherical sound field [31]. The vector

denotes a uniformly distributed phase between


Iv-C Training and Testing

We use 12776 utterances from the si_tr_s set of the WSJ0 [37] corpus for the speech sources in Eq. (2) for training. Additionally, 20 hours of different sound categories from YouTube [42] are used as isotropic background noise. All recordings are sampled at 16kHz and converted to the frequency domain with bins and 75% overlapping blocks. The sources are mixed with equal volume. For testing, we use 2907 utterances from the si_et_05 set of the WSJ0 corpus mixed with Youtube noise.

The ground truth gain masks required for training can be obtained for the desired signal as:


The mask for the interfering signals is given as:


The weak signal components, which do not contribute to any of the PSD matrices, are obtained as:


Parameter specifies the amount of energy per frequency bin required for the signal to be assigned to either the desired or interfering class label. Note that the calculation of the ground truth masks requires the corresponding signal energies and to be known, which is why we used the ISM rather than existing multi-channel speech databases such as [34].

By setting for each time-frequency bin, we can use the cross-entropy

as loss function. For each (B)LSTM or dense layer a

tanh activation and batch normalization [29] is applied. We train for each of the five scenarios in Table IV

a separate DNN. Model optimization is done using stochastic gradient descent with ADAM

[30] using the cross-entropy between the optimal binary mask and the estimated mask

of the respective model. To avoid overfitting, we use early stopping by observing the error on the validation set every 20 epochs.

Iv-D Performance evaluation

We use three different beamformers: the MVDR, GEV-BAN and GEV-PAN (see Section II) for each gain mask. The estimates of the PSD matrices are obtained using Eq. (10), where blocks. We apply the BeamformIt toolkit [2] as baseline. It uses DOA estimation [4] followed by a MVDR beamformer. To evaluate the performance of the enhanced signals , we use the Google Speech-to-Text API [49] to perform Automatic Speech Recognition (ASR). Furthermore, we determine the SNR improvement as:


where the optimal binary mask is used to measure the energy of the desired and interfering components in the beamformer output and the noisy inputs , respectively. The can be computed without having access to the beamforming weights , as is the case of the BeamformIt toolkit.

Iv-E Results

While improvements of memory footprint and computation time are independent of the underlying tasks, the prediction accuracy highly depends on the complexity of the data and the used neural network. Simple data sets allow for aggressive quantization without affecting prediction performance significantly, while binary quantization results in severe prediction degradation on more complex data sets.

Figure 6 shows speech mask estimations using (a) 32-, (b) 8- (c) 4- and (d) 1-bit DNNs from the mixture of scenario (S1-I) of the WSJ0 utterance “When its initial public offering is completed Ashland is expected to retain a 46% stake” from si_et_05. As noted in Section II-E, the activation function of the output layer is a full-precision softmax function. The reduction of the weight precision introduces artifacts in (b), (c) and (d).

Fig. 6: Speech mask estimation using (a) 32-, (b) 8-, (c) 4-, and (d) 1-bit DNNs.
Fig. 7: (a) Original WSJ0 utterance “When its initial public offering is completed Ashland is expected to retain a 46% stake”, (b) background car noise, (c) mixture, (d) reconstructed speech using BeamformIt, (e-h) speech estimates using 32-, 8-, 4, and 1-bit DNNs and a GEV-BAN beamformer, respectively.

Figure 7 shows the corresponding log-spectrograms. In particular, (a) shows the original source signal, (b) the noise, and (c) the mixture, (d-h) shows the reconstructed source signals using BeamformIt and 32-, 8-, 4-, and 1-bit DNNs using a GEV-BAN beamformer, respectively. Reduced precision DNNs generate reasonable predictions compared to the single-precision baseline. BeamformIt is not able to remove the low frequency components of the car noise. The reduced-precision DNNs are able to attenuate the car noise in the background in a similar way as the 32-bit baseline DNN.

This is also reflected in Table V, showing the SNR improvement on the test set for experiment 1 - 5. Mask-based beamformers outperform BeamformIt in all five experiments. Reducing the bit-width slightly degrades the SNR performance. However this reduces the memory footprint of the models. There is a small difference between the mask-based beamformers, i.e. GEV performs slightly better than MVDR. In general, 8-bit mask-based estimators achieve competitive SNR scores, comparable to the full precision baseline.

bits experiment 1 GEV BAN GEV PAN MVDR
32 BeamformIt - - -0.57
32 DNN 8.09 8.37 7.36
8 DNN 7.61 8.00 6.77
4 DNN 4.36 5.81 4.17
1 DNN 5.47 6.30 4.96
bits experiment 2 GEV BAN GEV PAN MVDR
32 BeamformIt - - -0.46
32 DNN 8.57 8.76 7.95
8 DNN 8.43 8.63 7.87
4 DNN 7.50 8.05 6.21
1 DNN 7.60 8.03 6.72
bits experiment 3 GEV BAN GEV PAN MVDR
32 BeamformIt - - -0.20
32 DNN 11.69 11.96 10.48
8 DNN 10.09 10.53 10.88
4 DNN 10.29 10.96 6.65
1 DNN 10.71 11.21 8.83
bits experiment 4 GEV BAN GEV PAN MVDR
32 BeamformIt - - -0.19
32 DNN 8.44 8.72 7.63
8 DNN 8.11 8.46 7.27
4 DNN 7.01 7.78 5.49
1 DNN 6.62 7.30 5.76
bits experiment 5 GEV BAN GEV PAN MVDR
32 BeamformIt - - 0.35
32 DNN 12.73 13.15 10.72
8 DNN 12.09 12.62 9.50
4 DNN 10.26 11.09 4.60
1 DNN 11.25 12.01 7.31
TABLE V: SNR improvement of GEV- and MVDR beamformers for various experiments conducted on WSJ0 test sets.

Table VI reports the word error rate (WER). We use the 6 channel data processed with 32-, 8-, 4-, and 1-bit DNNs for speech mask estimation using GEV-PAN, GEV-BAN and MVDR beamformers. Groundtruth transcriptions were generated using original WSJ0 recordings. Single-precision networks obtained the best overall WER in all experiments. In case of reduced precision networks, 8-bit DNNs produce competitive results, when using MVDR beamformers. For the 4- and 1-bit variants the performance degrades. For experiments with more than one dominant source BeamformIt fails. In general, results for single speaker scenarios (experiment 1, 2 and 4) are better.

bits experiment 1 GEV BAN GEV PAN MVDR
32 BeamformIt - - 21.38
32 DNN 9.14 11.71 9.69
8 DNN 12.13 16.57 10.62
4 DNN 22.14 21.89 15.24
1 DNN 29.97 40.23 15.07
bits experiment 2 GEV BAN GEV PAN MVDR
32 BeamformIt - - 22.77
32 DNN 8.15 11.10 9.64
8 DNN 8.89 10.67 10.17
4 DNN 12.79 17.56 12.21
1 DNN 10.88 15.57 11.20
bits experiment 3 GEV BAN GEV PAN MVDR
32 BeamformIt - - 84.68
32 DNN 15.38 17.48 16.24
8 DNN 24.84 26.02 24.69
4 DNN 21.94 29.18 25.76
1 DNN 20.78 26.79 21.89
bits experiment 4 GEV BAN GEV PAN MVDR
32 BeamformIt - - 22.95
32 DNN 13.99 19.12 14.63
8 DNN 16.00 21.72 16.47
4 DNN 27.66 37.00 20.08
1 DNN 26.69 38.37 19.68
bits experiment 5 GEV BAN GEV PAN MVDR
32 BeamformIt - - 80.90
32 DNN 19.80 27.01 21.04
8 DNN 24.33 33.23 23.81
4 DNN 37.21 49.52 43.03
1 DNN 31.92 44.09 31.18
TABLE VI: Word error rates on enhanced speech data of WSJ0 corpus. Speech has been processed with GEV-PAN, GEV-BAN and MVDR beamformers using 32-, 8-, 4-, and 1-bit DNNs for speech mask estimation.

V Conclusion

We introduced a resource-efficient approach for multi-channel speech enhancement using DNNs for speech mask estimation. In particular, we reduce the precision to 8-, 4- and 1-bit. We use a recurrent neural network structure capable of learning long-term relations. Limiting the bit-width of the DNNs reduces the memory footprint and improves the computational efficiency while the degradation in speech mask estimation performance is marginal. When deploying the DNN in speech processing front-ends only the reduced-precision weights and forward-pass calculations are required. This supports speech enhancement on low-cost, low-power and limited-resource front-end hardware. We conducted five experiments simulating various cocktail party scenarios using the WSJ0 corpus. In particular, different beamforming architectures, i.e. MVDR, GEV-BAN, and GEV-PAN, which are combined with low bit-width mask estimators have been evaluated. MVDR beamformers, using 8-bit reduced-precision DNNs for estimating the speech mask, obtain competitive SNR scores compared to the single-precision baselines. Furthermore, the same architecture achieve competitive WERs in single speaker scenarios, measured with the Google Speech-to-Text API. If multiple speakers are introduced, the performance degrades. In the case of binary DNNs, we show a significant reduction of memory footprint while still obtaining an audio quality which is only slightly lower compared to single-precision DNNs. We show that these trade-offs can be readily exploited on today’s hardware, by benchmarking the core operation of binary DNNs on NVIDIA and ARM architectures.

In future, we aim to implement the system on a target hardware and measure the resource consumption and run time.


  • [1] A. Abdelfattah, S. Tomov, and J. Dongarra (2019-05) Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vol. , pp. 111–122. External Links: ISSN 1530-2075 Cited by: §III-A, §III-A.
  • [2] X. Anguera, C. Wooters, and J. Hernando (2007-09) Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing 15 (7), pp. 2011–2021. Cited by: §IV-D.
  • [3] A. Ardakani, Z. Ji, S. C. Smithson, B. H. Meyer, and W. J. Gross (2019) Learning recurrent binary/ternary weights. In International Conference on Learning Representations, Cited by: §III-B2.
  • [4] J. Benesty, J. Chen, and Y. Huang (2008) Microphone array signal processing. Springer, Berlin–Heidelberg–New York. Cited by: §II-A, §IV-D.
  • [5] J. Benesty, M. M. Sondhi, and Y. Huang (2008) Springer handbook of speech processing. Springer, Berlin–Heidelberg–New York. Cited by: §II-A, §II.
  • [6] C. Böddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach (2018) Exploring practical aspects of neural mask-based beamforming for far-field speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6697–6701. Cited by: §II-C.
  • [7] M. Brandstein and D. Ward (2001) Microphone arrays. Springer, Berlin–Heidelberg–New York. Cited by: §II.
  • [8] Y. Chen, J. Emer, and V. Sze (2016)

    Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

    In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, Piscataway, NJ, USA, pp. 367–379. External Links: ISBN 978-1-4673-8947-1, Link, Document Cited by: §I.
  • [9] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang (2015) An exploration of parameter redundancy in deep networks with circulant projections. In

    International Conference on Computer Vision (ICCV)

    pp. 2857–2865. Cited by: §I.
  • [10] M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), pp. 3123–3131. Cited by: §I, §III-B1, §III-B2, §III-B2, §III-B2, footnote 4.
  • [11] M. Courbariaux, Y. Bengio, and J. David (2015) Training deep neural networks with low precision multiplications. In International Conference on Learning Representations (ICLR) Workshop, Vol. abs/1412.7024. Cited by: §III-B1, §III-B.
  • [12] L. Deng, M.L. Seltzer, D. Yu, A. Acero, A. Mohamed, and G.E. Hinton (2010) Binary coding of speech spectrograms using a deep auto-encoder.. In Interspeech, pp. 1692–1695. Cited by: §II-E.
  • [13] H. Erdogan, J. Hershey, S. Watanabe, M. Mandel, and J. L. Roux (2016) Improved MVDR beamforming using single-channel mask prediction networks. In Interspeech, Cited by: §I.
  • [14] S. Gannot, D. Burshtein, and E. Weinstein (2001-08) Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Transactions on Signal Processing 49 (8). Cited by: §II-A.
  • [15] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi (2018) MorphNet: fast & simple resource-constrained structure learning of deep networks. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    pp. 1586–1595. Cited by: §I.
  • [16] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), Cited by: §I.
  • [17] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143. Cited by: §I.
  • [18] S. S. Haykin (2009) Neural networks and learning machines. Third edition, Pearson Education. Cited by: §II-D.
  • [19] J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach (2015) BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 444–451. Cited by: §I.
  • [20] J. Heymann, L. Drude, and R. Haeb-Umbach (2016-03) Neural network based spectral mask estimation for acoustic beamforming. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 196–200. Cited by: §I, §II-E.
  • [21] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani (2016) Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 4, pp. 5210–5214. Cited by: §I.
  • [22] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop @ NIPS, Cited by: §I.
  • [23] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II-E, §III-B1.
  • [24] M. Horowitz (2014-02) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Document, ISSN 0193-6530 Cited by: §I.
  • [25] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. Cited by: §I.
  • [26] Y. Huang, J. Benesty, and J. Chen (2006) Acoustic mimo signal processing. Springer, Berlin–Heidelberg–New York. Cited by: §II.
  • [27] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 4107–4115. Cited by: §III-B2, §III-B2.
  • [28] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. Cited by: §I.
  • [29] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In JMLR, pp. 448–456. Cited by: §IV-C.
  • [30] D. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §IV-C.
  • [31] H. Kuttruff (2009) Room acoustics. 5th edition, Spoon Press, London–New York. Cited by: §II, §IV-B.
  • [32] Y. LeCun, J. S. Denker, and S. A. Solla (1989) Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), pp. 598–605. Cited by: §I.
  • [33] F. Li, B. Zhang, and B. Liu (2016) Ternary weight networks. CoRR abs/1605.04711. Cited by: §III-B2.
  • [34] M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti (2005-11) The multi-channel wall street journal audio visual corpus (mc-wsj-av): specification and initial experiments. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., Vol. , pp. 357–362. Cited by: §IV-C.
  • [35] J. Ott, Z. Lin, Y. Zhang, S. Liu, and Y. Bengio (2016) Recurrent neural networks with limited numerical precision. CoRR abs/1608.06902. Cited by: §I, §III-B1, §III-B2, footnote 4.
  • [36] H. A. P. and Gannot,Sharon (2007) Generating sensor signals in isotropic noise fields. The Journal of the Acoustical Society of America 122 (6), pp. 3464–3470. Cited by: §I, §IV-B.
  • [37] D. B. Paul and J. M. Baker (1992) The design for the wall street journal-based csr corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, Stroudsburg, PA, USA, pp. 357–362. External Links: ISBN 1-55860-272-0 Cited by: §I, §IV-C.
  • [38] L. Pfeifenberger and F. Pernkopf (2014-05) Blind source extraction based on a direction-dependent a-priori SNR. In Interspeech, Cited by: §II-A.
  • [39] L. Pfeifenberger, M. Zöhrer, and F. Pernkopf (2019-12) Eigenvector-based speech mask estimation for multi-channel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2162–2172. External Links: ISSN 2329-9304 Cited by: §I, §I, §II-B, §II-D, §II-E, §IV-A.
  • [40] L. Pfeifenberger, M. Zöhrer, and F. Pernkopf (2017-03) DNN-based speech mask estimation for eigenvector beamforming. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, pp. 66–70. Cited by: §I, §II-E.
  • [41] L. Pfeifenberger, M. Zöhrer, and F. Pernkopf (2017-08)

    Eigenvector-based speech mask estimation using logistic regression

    In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, 2017, pp. 2660–2664. Cited by: §I.
  • [42] (2018) PyTube – a lightweight, pythonic, dependency-free, library for downloading youtube videos.. External Links: Link Cited by: §IV-C.
  • [43] W. Roth, G. Schindler, M. Zöhrer, L. Pfeifenberger, S. Tschiatschek, R. Peharz, H. Fröning, F. Pernkopf, and Z. Ghahramani (2019) Resource-efficient neural networks for embedded systems. JMLR submitted. Cited by: §I.
  • [44] R. Scheibler, E. Bezzam, and I. Dokmanic (2017) Pyroomacoustics: A python package for audio room simulations and array processing algorithms. CoRR abs/1710.04196. Cited by: §IV-B.
  • [45] G. Schindler, M. Zöhrer, F. Pernkopf, and H. Fröning (2018) Towards efficient forward propagation on resource-constrained systems. In European Conference on Machine Learning (ECML), (English). Cited by: §III-A.
  • [46] T. Schrank, L. Pfeifenberger, M. Zöhrer, J. Stahl, P. Mowlaee, and F. Pernkopf (2016) Deep beamforming and data augmentation for robust speech recognition: results of the 4th CHiME challenge. In Proc. of the 4th Intl. Workshop on Speech Processing in Everyday Environments (CHiME 2016), Cited by: §I.
  • [47] S. Shin, K. Hwang, and W. Sung (2016) Fixed-point performance analysis of recurrent neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. Cited by: §I, §III-B1.
  • [48] M. G. Shmulik, S. Gannot, and I. Cohen (2009-08)

    Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals

    IEEE Transactions on Audio, Speech, and Language Processing 17 (6). Cited by: §II-A, §II-A.
  • [49] (2018) SpeechRecognition – a library for performing speech recognition, with support for several engines and apis, online and offline.. External Links: Link Cited by: §I, §IV-D.
  • [50] R. Talmon, I. Cohen, and S. Gannot (2009-05) Relative transfer function identification using convolutive transfer function approximation. IEEE Transactions on audio, speech, and language processing 17 (4). Cited by: §II-A.
  • [51] K. Ullrich, E. Meeds, and M. Welling (2017) Soft weight-sharing for neural network compression. In International Conference on Learning Representations (ICLR), Cited by: §I.
  • [52] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. H. W. Leong, M. Jahre, and K. A. Vissers (2017) FINN: A framework for fast, scalable binarized neural network inference. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ISFPGA), pp. 65–74. Cited by: §I.
  • [53] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers (2017) FINN: a framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, pp. 65–74. Cited by: §III-A.
  • [54] Y. Umuroglu and M. Jahre (2017) Streamlined deployment for quantized neural networks. CoRR abs/1709.04060. Cited by: §III-A.
  • [55] V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop @ NIPS, Cited by: §I, §III-A, §III-B1.
  • [56] B. D. V. Veen and K. M. Buckley (1988-04) Beamforming: a versatile approach to spatial filtering. IEEE International Conference on Acoustics, Speech, and Signal Processing 5 (5), pp. 4–24. Cited by: §I.
  • [57] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018) Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems 31, pp. 7675–7684. Cited by: §III-B.
  • [58] E. Warsitz and R. Haeb-Umbach (2007) Blind acoustic beamforming based on generalized eigenvalue decomposition. In IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, pp. 1529–1539. Cited by: §I, §II-B, §II-B.
  • [59] E. Warsitz, A. Krueger, and R. Haeb-Umbach (2008) Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 73–76. Cited by: §II-B, §II-B.
  • [60] C. W. Wu (2018) ProdSumNet: reducing model parameters in deep neural networks via product-of-sums matrix decompositions. CoRR abs/1809.02209. Cited by: §I.
  • [61] S. Wu, G. Li, F. Chen, and L. Shi (2018) Training and inference with integers in deep neural networks. Cited by: §III-B.
  • [62] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. J. Smola, L. Song, and Z. Wang (2015) Deep fried convnets. In International Conference on Computer Vision (ICCV), pp. 1476–1483. Cited by: §I.
  • [63] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen (2016) Cambricon-X: An accelerator for sparse neural networks. In International Symposium on Microarchitecture (MICRO), pp. 20:1–20:12. Cited by: §I.
  • [64] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. External Links: Link, 1606.06160 Cited by: §I, §III-B1, §III-B, footnote 4.
  • [65] C. Zhu, S. Han, H. Mao, and W. J. Dally (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §III-B2, §III-B2.
  • [66] M. Zöhrer, R. Peharz, and F. Pernkopf (2015) Representation learning for single-channel source separation and bandwidth extension. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (12), pp. 2398–2409. External Links: Document, ISSN 2329-9290 Cited by: §II-E.
  • [67] M. Zöhrer, L. Pfeifenberger, G. Schindler, H. Fröning, and F. Pernkopf (2018-04) Resource efficient deep eigenvector beamforming. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Cited by: §I.
  • [68] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2017) Learning transferable architectures for scalable image recognition. CoRR abs/1707.07012. Cited by: §I.