The objective of speech enhancement (SE) is to transform low-quality speech signals into enhanced-quality speech signals [SE]
. In many speech-related applications such as automatic speech recognition (ASR)[ASR1] and speech emotion recognition [emotion2], SE is used as a preprocessor to remove noise components from speech signals. In many portable or assistive-hearing devices, such as mobile phones [tan2019real], hearing aids [aids], and cochlear implants [implants], SE is crucial for increasing speech intelligibility and quality in noise environments.
In the past few years, deep learning (DL)-based models have been widely used for SE [SE1, SE2, SE3, SE4, SE5, SE6, DAELu, CHLee, IA-NET]
. Various deep neural networks such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) have been used as fundamental models in SE systems. In these systems, some metrics are defined to measure the distance between the enhanced output and the clean reference, and the DL models are trained to minimize the distance. TheL1 and L2 (mean-square-error) losses are commonly used because of their ease of computation and differentiability. However, these two losses may not be optimal for specific tasks, and thus other metrics have been used as the loss to train the DL models [metricgan, wang2018maximum].
In addition to model types and loss functions, another important consideration for the success of an SE system is its ability to adapt to new environments, particularly when deployed in embedded devices. In real-world situations, the noise in the testing environment is unseen in the training set; moreover, the noise types often vary over time. The mismatch between training and testing environments can significantly degrade the performance of SE. Therefore, identifying an approach that can efficiently and effectively adapt DL models to new testing conditions and improve the performance of SE is necessary. Thus far, several domain adaptation approaches[adaptation1, adaptation2, fine-tune1, fine-tune2] have been proposed to address the training-testing acoustic mismatch issue, which is also known as the domain shift problem. Although noise-adapted models can provide improved SE results for these conventional approaches, they often suffer from a catastrophic forgetting effect[forget1, forget2]. In other words, when DL models adapt to a new noise environment, they usually perform poorly when dealing with previously adapted noise environments.
In this paper, we propose a regularization-based incremental learning strategy for adapting DL-based SE models to new environments (speakers and noise types) while handling the catastrophic forgetting issue. The proposed method is termed SERIL. SERIL exploits the advantages of two well-known incremental learning algorithms: (1) whole past optimization path information[SI] and (2) curvature-based strategy[EWC]. We evaluated SERIL using two datasets: the Voice Corpus Bank corpus (VCB) [VCB] and the TIMIT corpus [TIMIT], which were used to form the training and testing sets, respectively. The overall SERIL included two phases: offline and online. In the offline phase, we first trained the DL model on the utterances from the VCB corpus with 13 different types of noise. In the online phase, SERIL first adapted the pre-trained model based on a small amount of adaptation data; then, the adapted model was used for SE. A direct fine-tuning model adaptation approach was implemented for comparison. Experimental results show that SERIL and the direct fine-tuning approach both effectively adapt the SE model to new environments and improve SE performance, compared with the pre-trained DL model without adaptation. Moreover, compared to the direct fine-tuning approach, SERIL maintained high SE performance against all previously learnt types of noise, thus effectively addressing the catastrophic forgetting problem.
2 Related Works and Motivation
An intuitive SE method to overcome the mismatch problem is to collect as many types of noise as possible to increase the generalization ability[CHLee]. However, it is impractical to cover the infinite types of noise that may be encountered in real situations. Several researches [fine-tune1, fine-tune2] have been proposed to directly fine-tune a pre-trained model to improve the performance in a target domain. When entering a new circumstance, these algorithms only focus on the current noise domain, and ignore the memory of the previously learned noise types. In many applications, such as edge-devices, the type of noise changes frequently, and it is common to re-encounter learned types of noise. However, the adapted SE model cannot perform well in the previously learned noise types. This effect is called catastrophic forgetting [forget1, forget2]. Although the SE model can be fine-tuned every time the environment is changed, the repeated model adaptation process will result in high computation and time costs.
The above limitations of adaptive methods based on direct fine-tuning motivated us to apply the incremental learning algorithm to SE. Incremental learning is also known as continuous learning or life-long learning. Figure 1 illustrates the relationship between direct fine-tuning and incremental learning. Training trajectories are illustrated in a schematic parameter space, with parameter regions leading to good performance on the source (yellow region) and target (blue region), denoted as tasks and , respectively. After learning in task , the parameters are located in . As shown by the dashed arrow in Figure 1, when the SE model is adapted by taking gradient steps to minimize the loss based on task alone, the resulting is beyond the good performance area of Task , i.e., what is already learned in Task is forgotten. In contrast, in incremental learning, the SE model weights are updated to the target domain while retaining the knowledge learned from the source domain. This is often realized by finding the overlapping region of the source and target domains. The learning trajectory of incremental learning shown by the solid arrow in Figure 1 illustrates this concept. In this way, incremental learning can help the resulting model provide good SE results in the target domain while maintaining satisfactory performance in the source domain.
3 The SERIL System
3.1 Architecture and loss function of the SERIL system
The architecture of the SERIL system is depicted in Figure 2
. The system performs SE in the spectral domain. Speech waveforms are first converted into time-frequency features using a 512-point short-time Fourier transform (STFT) with a hamming window size of 32 ms and a hop size of 16 ms. Each feature vector consists of 257 elements. The enhanced spectral features are then converted into the waveform domain by inverse STFT with an overlap-add method. In the SERIL system, the first 3 layers are LSTM layers (one-directional LSTM was used for achieving real-time inference). The hidden dimension of each LSTM is 257. A fully connected layer is concatenated to the output of the last LSTM layer for scaling.
As mentioned earlier, the L1 and L2 norms are commonly used as the loss function to train DL-based SE models. In this study, we derived another loss function based on the short-time spectral amplitude SDR (), which was shown to provide better results than L1 and L2 norms in our preliminary experiments. In a previous study, Kolbæk et al. [TSDR] reported that using the time-domain SDR [SDR1, SDR2] as the loss can help the SE models to achieve improved performance. Because the input and output of SERIL are both spectral features, we need to modify the original SDR loss to use it in the spectral domain. We note that SDR can be regarded as the energy ratio of enhanced speech projected on the clean speech space over enhanced speech projected on the orthogonal space of clean speech. By Parseval’s theorem [Parseval]
and the linear property of Fourier transform, the energy ratio in the time domain is equivalent to that in the time-frequency domain. Therefore, we define the () as follows:
Given the noisy spectral features, , the SE model aims to generate enhanced spectral features, . is computed by , where is the target clean spectral features. In addition, is equal to ; thus, we denote our loss function as .
3.2 Curvature-based regularization strategy
Considering the losses in the previous and new acoustic environments, and , respectively, the total loss can be formulated as:
Because the training data of the previous environment is usually not accessible online, we cannot calculate . Instead, we can assume that the loss of the previous environment can be revealed from the learned SE model, . By approximating using the second-order Taylor expansion at , we have
where is ; is the Hessian matrix of at ; and is a constant. Because the elements in are generally small enough to be ignored, we can obtain the approximate form as . Similar to the elastic weight consolidation [EWC, Online-EWC], we ignore the cross terms in to improve computational efficiency. The approximate form becomes
is a hyperparameter;is the index of the parameters in the model; and are the -th parameters in the current and previous environments, respectively; and is the diagonal element of . The intuitive interpretation of is the local curvature, which indicates the sensitivity that affects the performance of the previous acoustic environment.
Kolouri et al. [SCP] provided a different explanation for the geometric view of the regularization term, which can be applied to our scenario. As , can be interpreted as the expectation of the squared difference of the loss values of the training samples of the previous environment, i.e., . Similar to (3), the distance can be approximated by , which is also derived by the second-order Taylor expansion of at . Referring to [Online-EWC, RWalk, SCP]
, we apply the interpolation approach to the case of multiple tasks. Givenderived by all previous tasks, is updated as
where is the index of the task; is a hyperparameter in [0,1]; denotes derived from the ()-th task; and is the interpolation result of and , corresponding to the information of past accumulations and curvatures.
3.3 Path optimization augmenting approach
Although is equipped with rationality to avoid catastrophic forgetting, the commonly used curvature-based methods [EWC, Online-EWC] of deriving
rely on point estimation, which only capture local curvature information around. In contrast, the path optimization-based method [SI] considers the information over the optimization path on the loss surface. In particular, the importance score is determined by accumulating over the entire training trajectory, as illustrated in Figure 3.
By using the first-order Taylor approximation and setting and as the start and end steps of the -th task, the change in loss over the time from to can be written as
where is the index of the SE model parameter. To simplify the description, we denote as . Therefore, the change in the total loss can be represented as the summation of the individual loss associated with each parameter. We put a minus sign on the left side of to make the sign consistent with the regularization term. Practically, we replace with , where is the index of iteration. From [SI], the definition of importance scores as we begin to train the -th task can be defined as
where is the index of the task before the -th task; is the -th parameter of the SE model derived from training the -th task; is ; and is a hyperparameter with a positive value.
Similar to [RWalk], we combined the advantages of curvature-based[EWC, Online-EWC] and path optimization-based[SI] approaches. The importance of parameter when training the -th task can be written as . Therefore, the training loss is defined as:
where is the index of the task (if is zero, is equivalent to ); is the -th parameter after training the ()-th task; and is a scalar with the value in [0,1], which determines the weight of the two strategies.
4 Experiment and Analysis
4.1 Experimental Setup
We evaluated the proposed SERIL system on two speech corpora: VCB[VCB] and TIMIT [TIMIT]. Three data sets were prepared, namely, the training, adaptation, and testing sets. For the training set, 2,000 utterances were randomly selected from the VCB corpus. Each utterance was contaminated with 13 types of noise (obtained from the NOISEX-92 database[NOISEX-92]) at 6 signal-to-noise (SNR) levels (ranging from -3 dB to 12 dB with a step of 3 dB), amounting to 156,000 (=2000136) paired noisy-clean utterances in total. This training set is termed . To prepare the adaptation sets, we randomly selected another 300 utterances from the VCB corpus. These 300 utterances were contaminated with other 4 types of noise (obtained from the Nonspeech database [NonSpeech]): cough, door moving, footsteps, and clap, at 6 SNR levels (from -3 dB to 12 dB with a step of 3 dB) to form 4 adaptation sets, termed , , , and . Each set contained 1,800 (=3006) paired noisy-clean utterances.
For the testing set, we selected 1,680 utterances from the TIMIT data set. There were a total of five testing sets. The first testing set, , corresponded to the training set . The other four testing sets to corresponded to the adaptation sets to . For the testing set , there were 1,680 noisy utterances, and the noise types and SNR levels were the same as those used in . Each utterance was contaminated with one of the 13 noise types at a particular SNR level (one out of 6 SNR levels was randomly specified). For each of the testing sets to , there were also 1,680 noisy utterances, and each utterance was contaminated with one noise type at a particular SNR level (one out of the 6 SNR levels was randomly specified).
Three standardized evaluation metrics were used to measure the performance: perceptual evaluation of speech quality (PESQ)[PESQ], short-time objective intelligibility measure (STOI) [STOI], and extended STOI (eSTOI) [eSTOI]. PESQ was designed to evaluate the quality of processed speech. The higher the PESQ, the better the speech quality. Both STOI and eSTOI were designed to compute the speech intelligibility. The higher STOI and eSTOI scores, the better the speech intelligibility. In addition, we also reported the scores to illustrate the learning process. The higher the score, the smaller the distortion of the spectral features.
4.2 Experimental Results
First, we compared SERIL and the direct fine-tuning approach in terms of the adaptation capability and the degree of catastrophic forgetting. We used the training set to train one baseline model, termed . Then, based on the four adaptation sets, we sequentially adapted the model from to using , to using , to using , and to using . The five models ( to ) were then tested on the five testing sets ( to ). The scores of the five models tested on the five testing sets are shown in Figure 4. The results of the baseline model without adaptation and the scores of unprocessed noisy speech are also given for comparison.
From the figure, we note that although the baseline model performs well on , where the noise types and SNR levels are matched during the training and testing stages, notable degradation is observed for the mismatched conditions (cf. the gray lines on to ). Further, both SERIL and the direct fine-tuning approach effectively adapt the SE model to each target domain and achieve good performance. For example, in Figure 4(b), achieves the best performance on for both SERIL and the direct fine-tuning approach. The model trained by direct fine-tuning tends to forget the previously learned SE capability, whereas the model trained by SERIL can maintain good SE performance for previously learned noise types. For instance, in Figure 4(b), the performance of trained by direct fine-tuning is considerably reduced in , showing that the adapted model has “forgotten” the SE capability for the previously learned noise type. This is because each noise type has different structural characteristics in different frequency bands, so direct fine-tuning without proper constraints can severely distort the modeling of previous noise environments. In contrast, the performance drop of the SERIL system for the same training-testing case is relatively minor. Consistent trends can be observed for all testing sets.
Table 1 shows the , PESQ, STOI, and eSTOI scores of the final model () learned using the fine-tuning method and SERIL on the five testing sets. The scores of unprocessed noisy speech and the baseline model without adaptation () are also listed for comparison. Several observations can be drawn from the table. First, SERIL performs as well as direct fine-tuning in the current noise environment in terms of all metrics (cf. the “clap” column in Table 1). Second, SERIL always outperforms direct fine-tuning for previous environments in terms of all metrics (cf. the “original” to “footsteps” columns in Table 1). Third, SERIL performs better than the baseline model in all testing environments except for “original”, which is under a matched training-testing condition for the baseline model. It is worth noting that compared with the direct fine-tuning approach, SERIL requires only a small amount of additional computational cost and storage to set the constraints when performing model adaptation. However, SERIL can produce performance comparable to the direct fine-tuning approach in each new environment while overcoming the catastrophic forgetting problem in old environments.
5 Concluding Remarks
When deploying an SE system in real-world applications, it is common to encounter a new noisy environment and re-visit to previous noisy environments. Although the direct fine-tuning approach can effectively adapt SE models to new environments, the adapted SE model may suffer from the catastrophic forgetting problem. The proposed SERIL model not only yields comparable performance to the direct fine-tuning approach but also effectively overcomes the catastrophic forgetting problem. To the best of our knowledge, this paper is the first work that incorporates incremental learning into SE tasks. Our experimental results confirmed the effectiveness of the proposed SERIL system for SE model adaptation and avoiding catastrophic forgetting. Based on the promising results, we believe that the proposed SERIL model can be used in various edge-computing devices, where the acoustic condition changes frequently and the cost of online retraining is high. In addition, we note that using an appropriate weight, , to combine the curvature-based and path optimization-based strategies can provide better SE performance in most tasks. Derivation of an algorithm that can automatically determine the optimal is worthy of further study.