Incorporating Symbolic Sequential Modeling for Speech Enhancement

04/30/2019 ∙ by Chien-Feng Liao, et al. ∙ Academia Sinica National Institute of Information and Communications Technology 0

In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments. In this study, in addition to the conventional modeling for learning the acoustic noisy-clean speech mapping, an abstract symbolic sequential modeling is incorporated into the SE framework. This symbolic sequential modeling can be regarded as a "linguistic constraint" in learning the acoustic noisy-clean speech mapping function. In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm. The obtained symbols are able to capture high-level phoneme-like content from speech signals. The experimental results demonstrate that the proposed framework can significantly improve the SE performance in terms of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) on the TIMIT dataset.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech enhancement (SE) has been commonly used as a front-end module in speech-related applications, such as robust automatic speech recognition (ASR) [1], automatic speaker recognition, and assistive listening devices [2, 3]. Recently, deep learning (DL)-based SE models have also been proposed and extensively investigated [4, 5, 6, 7, 8]. The main idea in these DL-based SE models is to learn the complex mapping functions between noisy speech and clean speech. In most studies, the mapping functions are learned based on a large quantity of well-prepared noisy-clean speech pairs in the acoustic domain without considering the underlying linguistic structure.

In a noisy environment, audiences can automatically restore a noise-masked speech based on their knowledge of a “language model”, and the restoring ability depends on the effectiveness of this internal “language model”. For example, in noisy environments, great effort is required for non-native listeners [9]. These studies indicate that the linguistic-related information is helpful to retrieve target speech signals from the noisy ones. Accordingly, it is argued in this study that it is beneficial to incorporate text information (phonemes or words) into an SE system for improved performance.

In [10], oracle transcription is used to extract time-aligned text features as auxiliary input to the DNN model. Even though this can be formulated as a text-to-speech application, it is not practical under SE scenarios to assume to have ground-truth transcription. Several studies incorporate recognition results or outputs from acoustic models. In [11], a phone-class feature is augmented to standard acoustic features as input for de-reverberation. In [6], an ASR and an SE system are trained iteratively, where each system’s input depend on the other’s output. In [12, 13]

, a set of DNNs were trained as enhancement models, one for each specific phoneme. During inference time, an ASR or a phoneme classifier was used to determine which DNN to use. Even though promising results have been obtained, these approaches have major drawbacks. First, the recognition model is not jointly trained and thus optimization cannot be achieved for both systems. If the ASR system is incorrect, errors will be propagated to the downstream SE system. Secondly, heavily equip SE with an ASR system may be undesirable because SE is commonly used as a preprocessor. To overcome these obstacles,

[14] proposed learning a Deep Mixture of Experts (DMoE) network where the experts are DNNs, whose outputs are combined by a gating DNN. The gating DNN is trained to assign a combination weight to each expert. This results in splitting the acoustic space into sub-areas in an unsupervised manner, which is similar to our proposed method.

Oord et al. [15] recently proposed the Vector Quantized Variational Autoencoder (VQ-VAE), in which the stochastic continuous latent variables from the original VAE are replaced with deterministic discrete latent variables. It maintains a set of prototype vectors, i.e., a predefined size of learnable codebook. During forward pass, feature vectors produced by the encoder are replaced with their nearest-neighbor in the codebook. Although this quantization component acts as an information bottleneck and can regularize the power of the encoder, the discrete latent variables are more interpretable and tend to learn higher level representations, which can naturally correspond to phoneme-like features for given speech signal inputs. In [16], a comprehensive study of VQ-VAE applied to speech data was carried out, and it was demonstrated that VQ-VAE achieves better interpretability and information separation (such as disentangling speaker characteristics) than VAEs and AEs. Furthermore, the extracted representation allowed for accurate mapping into phonemes and achieved competitive performance on an unsupervised acoustic unit discovery task. Overall, the characteristics of the VQ-VAE make it a suitable component to reinforce an SE system with high-level linguistic information.

Figure 1: Proposed system consisting of a U-Net architecture, a symbolic encoder, and an attention mechanism. Conv1Ds and Deconvs are in the format (filterWidth, outputChannels), and the down-sampleup-sample rates are both 2. FC (outputChannels) denotes the fully connected layer.

In this study, an SE system with U-Net architecture [17, 18, 19, 20] is proposed. Moreover, a “symbolic encoder” is developed, consisting of DNNs and the vector quantization mechanism in VQ-VAE. The extracted symbolic sequence is then connected to the U-Net via multi-head attention mechanism [21]. Thereby, the two components can be jointly trained without the need of any supervised transcription or explicit constraints. The results demonstrate a marginal improvement in terms of objective measures including perceptual evaluation of speech quality (PESQ) [22] and short-time objective intelligibility (STOI) [23].

The rest of the paper is organized as follows. In Section 2, the proposed approach is detailed, including each components of the system and the objective functions. The experiment settings and results are presented in Section 3. Finally, Section 4 concludes the paper.

2 System architecture

A paired training dataset , where is the input noisy speech and is the target clean speech. The proposed system is shown in Figure 1. It consists of the following parts: an encoder network consisting of convolutional layers that extracts the feature sequence; another encoder network called symbolic encoder consists of fully connected layers and extracts the symbolic sequence by vector quantization. Multi-head attention function and skip-connection are used to connect the two encoder outputs with the decoder

. All components are jointly trained using mean-squared-error (MSE) loss function between the clean speech and the enhanced speech:


The quantization mechanism and the multi-head attention mechanism will now be briefly explained; for more detailed information readers may refer to [15] and [21], respectively.

Noisy U-Net U-Net-MOL Proposed (64) Oracle
-6 1.213 0.532 1.685 0.602 1.800 0.619 1.828 0.624 1.961 0.703
-3 1.353 0.598 1.880 0.669 1.974 0.681 2.045 0.693 2.140 0.741
0 1.517 0.669 2.071 0.725 2.140 0.736 2.240 0.750 2.306 0.776
3 1.702 0.739 2.237 0.770 2.290 0.779 2.416 0.794 2.456 0.806
6 1.902 0.823 2.387 0.805 2.424 0.813 2.581 0.830 2.592 0.831
Avg. 1.537 0.669 2.052 0.714 2.126 0.725 2.222 0.738 2.291 0.771
Table 1: Average PESQ, and STOI scores for evaluating baseline models and the proposed method on the test set under three unseen noise environments at five SNR levels and the average scores across all SNRs. The unprocessed test set is denoted by Noisy. Size of the symbolic book is shown in the parenthesis. The highest scores per metric are highlighted with bold text, excluding Oracle.

2.1 Symbolic Encoder

The symbolic encoder reads a sequence of acoustic features as input. Here, mel-frequency cepstral coefficients (MFCCs) are used, as suggested in [16]. A sequence of hidden vectors is extracted by the fully connected layers, where is the dimensionality and denotes the sequence length. A symbolic book that contains a set of prototype vectors is maintained, where is the size of the book. The hidden vectors will be replaced by the nearest prototype vector in the symbolic book. That is, , where . During the training phase, the prototypes in the symbolic book are updated as a function of exponential moving averages of . This method is presented in the original paper as an alternative way to update the book, and has the advantage of faster training speed than using an auxiliary loss. To prevent the symbolic encoder diverge in with unbounded value, [15] also uses the “commitment loss” to encourage the symbolic encoder to produce vectors lying close to the prototypes. Overall, the full system is optimized with two loss terms: the MSE between the enhanced acoustic features and the clean target features, and the commitment loss:



is a hyperparameter that controls the importance of the commitment loss and

denotes the stop-gradient operation. It should be noted here that the gradient of the loss can be backpropagated to the symbolic encoder using the straight-through estimator presented in


2.2 Multi-head Attention

Multi-head attention (MHA) was first proposed in the transformer architecture [21] for machine translation, and recently explored in various speech-related tasks including end-to-end ASR [25] and text-to-speech system [26]. MHA extends the conventional attention mechanism to have multiple heads, where each head generates a different attention weight vector. This allows the decoder to jointly retrieve information from different representation subspaces at different positions, which facilitates focusing on the various structures of the symbolic sequence. The input argument consists of queries , keys , and values , i.e., . In this study, MHA is used before each layer in the decoder. Every time-step of the decoder output acts as an query to attend on the symbolic sequence. The output of MHA will be concatenated with the skip-connection and fed to the proceeding decoder layer together. Formally, we have the symbolic sequence and the skip-connection from the encoder at each layer . The output of each layer in the decoder is the following:

where is the depth of the decoder layer and is the encoder output.

2.3 Model Details

The symbolic encoder consists of four fully connected layers, each followed by a ReLU activation function and a dropout layer

[27] with a drop rate of 0.2. A linear projection layer then maps the hidden vectors to dimensions in order to perform quantization. An one-dimensional (1-D) convolutional layer that slides on the time axis is used to give the symbolic sequence contextual information. Four heads are used in MHA, leading to point (a) in Figure 1, with a dimensionality of

. As in the original transformer, the positional encodings are also added to the inputs of the MHA, providing some information about the position of the tokens in the sequence. Queries and keys are first passed through a linear projection layer with 256 nodes before being divided into multiple heads. For the encoder, the frequency axis is treated as channel; thus, 1-D convolutional layers are used. The sequence length is down-sampled at each layer using a stride of 2 instead of pooling layers. The decoder is a mirrored version of the encoder with deconvolutional layers and larger kernel width. LeakyReLU is used as activation function in both the encoder and the decoder. Finally, the decoder output is projected back to frequency dimension using 1-D convolution with a kernel width of 1.

Figure 2: Left: Histogram: each bin represents the token index, and the value shows how many times this token was chosen, given the corresponding phoneme. Right: The element on location (i,j) represents JS-divergence between the histogram from the i-th phoneme and the histogram from the j-th phoneme. Darker color implies larger divergence. Some phonemes were omitted owing to space limitations.

3 Experiments

The experiments were conducted on the TIMIT database [28]. A total of 3696 utterances from the TIMIT training set (excluding SA files) were randomly sampled and corrupted with 100 noise types from [29] at six SNR levels, i.e., 20dB, 15dB, 10dB, 5dB, 0dB, and -5dB, to obtain 40-hour multi-condition training set, consisting of pairs of clean and noisy speech utterances. Another 100 utterances were randomly sampled to construct the validation set. They are mixed with cafeteria babble noise at 4 SNR levels (-4 dB, 0 dB, 4 dB, and 8 dB), which is unseen from the training set. The 192 utterances from the core test set of the TIMIT database were used to construct the test set for each combination of noise types and SNR levels. To evaluate the system on unseen noise types, three other noise types, namely Buccaneer1, Destroyer engine, and HF channel from the NOISEX-92 corpus [30], were adopted. In the following experiments, the SE algorithm will be evaluated in terms of speech quality and speech intelligibility. Therefore, PESQ and STOI, respectively, will be used to evaluate the enhanced speech, respectively. Higher scores represent better performance.

3.1 Implementation

The sampling rate of the speech data was 16 kHz. For the encoder input, time-frequency (T-F) features were extracted using a 512-point short time Fourier transform (STFT) with a hamming window size of 32 ms and a hop size of 16 ms, resulting in feature vectors consisting of 257-point STFT log-power spectra (LPS). For the symbolic encoder, standard 13 MFCC features (extracted at a rate identical to that for the LPS features) were used and concatenated with their temporal first and second derivatives. MFCCs are often used in speech recognition because they are pitch invariant and slightly robust to noise. A better quantization behavior was observed using MFCC compared to LPS in the preliminary experiments. The input was a segment of 64 frames (approximately 1 s), and was normalized by mean and standard-deviation before being fed to the system. Finally, the decoder outputs were synthesized back to the waveform signal via inverse Fourier transform and an overlap-add method. The phases of the noisy signals were used for the inverse Fourier transform. All models were trained on minibatches of 32. The Adam optimizer

[31] was used with learning rate , , and . The weight of the commitment loss was set to 0.2, which is close to the original setting in VQ-VAE, and it did not have significant impact on performance. Early stopping was performed based on the validation set to prevent overfitting.

3.2 Baseline model

We constructed the baseline model by excluding the symbolic encoder component, i.e., the left part of Figure 1 without MHA. This model is denoted by U-Net. Subsequently, the mulit-objective learning method proposed in [32] was adopted in the baseline model. The input of the U-Net was augmented by MFCC features, and an additional objective was added to during training to predict clean MFCCs. This baseline is denoted by U-Net-MOL. Finally, the benefit of using real text information as in [10] should be demonstrated. The phoneme level transcriptions provided by the TIMIT corpus were used to obtain frame-wise phoneme labels. The input MFCCs of the symbolic encoder were then replaced by the phoneme embeddings (embeddings are jointly learned). Quantization was discarded because the real phonetic information was provided. This is considered as an oracle model, as it takes correct transcriptions as input. This system will be called Oracle.

Book size M PESQ STOI
39 2.061 0.711
64 2.108 0.713
128 2.027 0.712
256 2.041 0.711
Table 2: Average PESQ and STOI performance on the validation set for different size of the symbolic book.

3.3 Results

Table 1 presents the results of the average PESQ, SSNR, and STOI scores on the test set for different systems. “Noisy” denotes unprocessed noisy speech, and the proposed model is shown with the symbolic book size of 64 as a representative. From this table, it can be observed that Oracle performed the best, as expected. This also confirmed the hypothesis that, given correct text information, the SE system can be more robust to noisy environments. Furthermore, the proposed model outperformed U-Net and U-Net-MOL at every SNR levels. It should be noted here that the system had fewer trainable parameters compared to the baselines, as MHA reduces the dimension to , as mentioned in Section 2.3. Thus, the improvement was not due to model complexity. Table 2 shows the de-noise ability of the proposed method with different size symbolic book. It can be seen that performance peaked for a size of 64. During the experiments, it was also observed that the symbolic book suffered from the “index collapse” problem [33] (some tokens are not activated through out training) for sizes larger than 256, implying that 256 tokens are sufficient for exploring the acoustic units, whereas adding more will be of no benefit.

3.4 Interpretation of symbolic sequence

An advantage of the discrete representation learned by the VQ-VAE is the interpretability of individual tokens in the symbolic book. Here, a visualization method was developed to connect input acoustic features to the activated token. Figure 2

(left) shows histograms corresponding to phoneme classes. More specifically, noisy speech from the test set were passed through the symbolic encoder to obtain the symbolic sequences. Given the frame-wise phoneme labels, a histogram for individual phoneme class can then be plotted. Each bin represents the token index, and the value shows how many times this token was chosen, given the frame that belongs to the corresponding phoneme. The histograms were normalized to become probability distribution functions (PDFs), i.e., the summation equals 1. Here, it can be seen that phonemes with similar pronunciation also have similar distribution in the histograms. For example, the phonemes in each of the pairs (

aa, aw), (m, n), and (ch, sh) have similar distributions, whereas phonemes in different pairs have different distributions.

For a complete understanding of the relations within the phoneme set, the Jensen-Shannon divergence between the phonemes was measured. Figure 2 (right) shows a heat map. Each element represents the distance between two PDFs, and darker color corresponds to larger distance. As JS-divergence is symmetric, the heat map is also a symmetric matrix. Some squares in light color are located on the diagonal, which implies that phonemes with similar pronunciation are clustered together, e.g., vowels have lighter colors with each other, and are completely separated from fricatives. The heat map greatly facilitates the visualization of the relationship between phonemes. For instance, it shows that ch is very close to s, z, and sh. In conclusion, the symbolic encoder was demonstrated to be reactive to phonetic content. It was observed that some of the phonemes that are pronounced differently lie near each other. The obvious explanation is that the noise affected the input MFCCs, thus confusing the symbolic encoder. One possible solution is to constrain explicitly the symbolic encoder so that it may become noise-invariant by adding a discriminator and using adversarial training as in [34]. This is left as future work.

4 Conclusion and future work

A novel approach for incorporating phonetic content into a SE system was proposed, without the need for a recognition system or any transcriptions during training. The symbolic encoder used the vector quantization method proposed in VQ-VAE to extract discrete representations. Consequently, the symbolic encoder learned to divide the input MFCCs into acoustic units automatically, and achieved notable performance improvement compared to the baseline systems. The representations were further interpreted by visualizing the symbolic encoder behavior, and it was confirmed that it was phoneme-sensitive. In future studies, the effect of different noise types on the symbolic encoder will be investigated, and noise-invariant training will be performed to extract purer symbolic sequence. Furthermore, an explicit language model constraint based on the learned symbolics may be even more useful to the SE system.