PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

by   Dacheng Yin, et al.

Time-frequency (T-F) domain masking is a mainstream approach for single-channel speech enhancement. Recently, focuses have been put to phase prediction in addition to amplitude prediction. In this paper, we propose a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, for this task. Unlike previous methods that directly use a complex ideal ratio mask to supervise the DNN learning, we design a two-stream network, where amplitude stream and phase stream are dedicated to amplitude and phase prediction. We discover that the two streams should communicate with each other, and this is crucial to phase prediction. In addition, we propose frequency transformation blocks to catch long-range correlations along the frequency axis. The visualization shows that the learned transformation matrix spontaneously captures the harmonic correlation, which has been proven to be helpful for T-F spectrogram reconstruction. With these two innovations, PHASEN acquires the ability to handle detailed phase patterns and to utilize harmonic patterns, getting 1.76dB SDR improvement on AVSpeech + AudioSet dataset. It also achieves significant gains over Google's network on this dataset. On Voice Bank + DEMAND dataset, PHASEN outperforms previous methods by a large margin on four metrics.



There are no comments yet.


page 1

page 3

page 6

page 7


HGCN: harmonic gated compensation network for speech enhancement

Mask processing in the time-frequency (T-F) domain through the neural ne...

Consistency-aware multi-channel speech enhancement using deep neural networks

This paper proposes a deep neural network (DNN)-based multi-channel spee...

Phase-aware Speech Enhancement with Deep Complex U-Net

Most deep learning-based models for speech enhancement have mainly focus...

Deep Interaction between Masking and Mapping Targets for Single-Channel Speech Enhancement

The most recent deep neural network (DNN) models exhibit impressive deno...

Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

While phase-aware speech processing has been receiving increasing attent...

Phase reconstruction from amplitude spectrograms based on von-Mises-distribution deep neural network

This paper presents a deep neural network (DNN)-based phase reconstructi...

Deep Griffin-Lim Iteration

This paper presents a novel phase reconstruction method (only from a giv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single-channel speech noise reduction aims at separating the clean speech from a noise-corrupted speech signal. Existing methods can be classified into two categories according to the signal domain they work on. The time domain methods directly operate on the one-dimensional (1D) raw waveform of speech signals, while the time-frequency (T-F) domain methods manipulate the two-dimensional (2D) speech spectrogram. Mainstream methods in the second category formulate the speech noise reduction problem as to predict a T-F mask over the input spectrogram. Early T-F masking methods only try to recover the amplitude of the target speech. When the importance of phase information was recognized, complex ideal ratio mask (cIRM)

[27] was proposed aiming at faithfully recovering the complex T-F spectrogram.

Figure 1:

Straightforward cIRM estimation does not achieve desired results. Although the imaginary part of the cIRM, as shown in (b), contains much information, that of a predicted cRM, as shown in (c), is almost zero.

Williamson et al. [27] observed that, in Cartesian coordinates, structure exists in both real and imaginary components of the cIRM, so they designed deep neural network (DNN)-based methods to estimate the real and imaginary parts of cIRM. However, in our evaluations of a modern DNN-based cIRM estimation method [1], we find that simply changing the training target to cIRM did not generate desired prediction results. Fig.1(a) shows the amplitude of the noisy signal where the stripe pattern is caused by noise. Fig.1(b) and (c) show the imaginary parts of the ideal mask and the estimated mask, respectively. Surprisingly, Fig.1(c) is almost zero, meaning that the estimated cIRM is downgraded to IRM. In another word, the phase information is not recovered at all.

This observation motivates us to design a novel architecture to improve the phase prediction. A straightforward idea is to separately predict amplitude mask and phase with a two-stream network. However, Willianson et al. [27] also pointed out that, in polar coordinates, structure does not exist in the phase spectrogram. This suggests that independent phase estimation is very difficult, if not completely impossible. In view of this, we add two-way information exchange for the two-stream architecture, so that the predicted amplitude can guide the prediction of phase. Results show that such information exchange is critical to the successful phase prediction of the target speech.

In the design of the amplitude stream, we find that conventional CNN kernels which are widely used in image processing do not capture the harmonics in T-F spectrogram. The reason is that correlations in natural images are mostly local while those in speech T-F spectrogram along the frequency axis are mostly non-local. In particular, at a given point of time, the value at a base frequency is strongly correlated with the values at its overtones. Unfortunately, previous DNN models cannot efficiently exploit harmonics although backbones like U-net [7] and dilated 2D convolution [1] can increase the receptive field. In this paper, we propose to insert frequency transformation blocks (FTBs) to capture global correlations along the frequency axis. Visualization of FTB weights shows that FTBs spontaneously learn the correlations among harmonics.

In a nutshell, we design a phase-and-harmonics-aware speech enhancement network, dubbed PHASEN, for monaural speech noise reduction. The contributions of this work are three-fold:

  • We propose a novel two-stream DNN architecture with two-way information exchange for efficient speech noise reduction in T-F domain. The proposed architecture is capable of recovering phase information of the target speech.

  • We design frequency transformation blocks in the amplitude stream to efficiently exploit global frequency correlations, especially the harmonic correlation in spectrogram.

  • We carry out comprehensive experiments to justify the design choices and to demonstrate the performance superiority of PHASEN over existing noise reduction methods.

The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 presents the proposed PHASEN architecture and its implementation details. Section 4 shows the experimental results. Section 5 concludes this paper with discussions on limitations and future work.

Figure 2: The proposed two-stream PHASEN architecture. The amplitude stream (Stream A) is in the upper portion and the phase stream (Stream P) is in the lower portion. The outputs of Stream A and Stream P are the amplitude mask and the estimated (complex) phase, respectively. Three two-stream blocks (TSBs) are stacked in the network.

2 Related Work

This section reviews both time-frequency domain methods and time-domain methods for single-channel speech enhancement. Within T-F domain methods, we are only interested in T-F masking methods. Special emphases are put to phase estimation and the utilization of harmonics.

2.1 T-F Domain Masking Methods

T-F domain masking methods for speech enhancement usually operate in three steps. First, the input time-domain waveform is transformed into T-F domain and represented by a T-F spectrogram. Second, a multiplicative mask is predicted based on the input spectrogram and applied to it. Last, an inverse transform is applied to the modified spectrogram to obtain the real-valued time-domain signal. The most widely used T-F spectrogram is computed by the short-time Fourier transform (STFT) and it can be convert back to time-domain signal by the inverse STFT (iSTFT). The key problems to be solved in T-F domain masking methods are what type of mask to be used and how to predict it.

Early T-F masking methods only try to estimate the amplitudes of a spectrogram by using real-valued ideal binary mask (IBM) [5], ideal ratio mask (IRM) [19], or spectral magnitude mask (SMM) [26]. After the enhanced amplitudes are obtained, they are combined with the noisy phase to produce the enhanced speech. Later, research [13] reveals that phase plays an important role in speech quality and intelligibility. In order to recover phase, phase sensitive mask (PSM) [2] and cIRM [27] are proposed. PSM is still a real-valued mask, extending SMM by simply adding a phase measure. In contrast, cIRM is a complex-valued mask which has the potential to faithfully recover both amplitude and phase of the clean speech.

Williamson et al. [27] propose a DNN-based approach to estimate the real and imaginary components of the cIRM, so that both amplitude and phase spectra can be simultaneously enhanced. However, their experimental results show that using cIRM does not achieve significantly better results than using PSM. We believe that the potential of a complex mask is not fully exploited. In [1], a much deeper neural network with dilated convolution and bi-LSTM is employed for speech separation with visual clues. It also achieves state-of-the-art speech enhancement performance when visual clues are absent. We carry out experiments on the network and surprisingly find that the imaginary components of the estimated cIRM is almost zero. This suggests that directly using cIRM to supervise a single-stream DNN cannot achieve satisfactory results.

There exist some other methods [20, 21, 11] which process phase reconstruction asynchronously with amplitude estimation. Their goal is to reconstruct phase based on a given amplitude spectrogram, which could be the amplitude spectrogram of a clean speech or the output from any speech denoising model. In particular, Takahashi et al. [20] observe the difficulty in phase regression, so they treat the phase estimation problem as a classification problem by discretizing phase values and assigning class indices to them. While all these methods demonstrate the benefits of phase reconstruction, their approach does not fully utilize the rich information in the input noisy phase spectrogram.

2.2 Time Domain Methods

Time domain methods belong to the other camp for speech enhancement. We briefly mention several pieces of work here because they are proposed to avoid the phase prediction problem in T-F domain methods. SEGAN [15] uses generative adversarial networks (GANs) to directly predict the 1D waveform of the clean speech. Rethage et al. [17] modify Wavenet for the speech enhancement task. convolution-TasNet [10] uses a learnable encoder-decoder in time domain as an alternative to the hand-crafted STFT-iSTFT for a speech separation task. However, when it is applied to the speech enhancement task, the 2ms frame length appears to be too short. TCNN [14] adopts a similar approach as TasNet, but it uses non-linear encoder-decoder and longer frame length than TasNet. Although these methods divert around the difficult phase estimation problem, they also give up the benefits of speech enhancement in T-F domain, as it is widely recognized that most speech and noise patterns are separately distributed or easily distinguishable on T-F domain features. As a result, the performance of time domain methods is not among the first tier in the speech enhancement task.

2.3 Harmonics in Spectrogram

Plapous et al. [16] discover that common noise reduction algorithms suppress some harmonics existing in the original signal and then the enhanced signal sounds degraded. They propose to regenerate the distorted speech frequency bands by taking into account the harmonic characteristic of speech. Other research [9, 12] also show that phase correlation between harmonics can be used for speech phase reconstruction. A recent work [25] further propose a phase reconstruction method based on harmonic enhancement using the fundamental frequency and phase distortion feature. All these work demonstrate the importance of harmonics in speech enhancement. In this paper, we also try to exploit harmonic correlation, but this is achieved by designing an integral block in the end-to-end learning DNN.

3 PHASEN Architecture

3.1 Overview

The basic idea behind PHASEN is to separate the predictions of amplitude and phase, as the two prediction tasks may need different features. In our design, we use two parallel streams, denoted by stream for amplitude mask prediction and stream for phase prediction. The entire PHASEN architecture is shown in Fig. 2.

The input to the network is the STFT spectrogram, denoted by . Here, is a complex-valued spectrogram, where represents the number of time steps and represents the number of frequency bands. is fed into both streams and two different groups of 2D convolutional layers are used to produce feature for stream and feature for stream . Here, and are the number of channels for stream and stream , respectively.

The key component in PHASEN is the stacked two-stream blocks (TSBs), in which stream and stream features are computed separately. Note that at the end of each TSB, stream and stream exchange information. This design is critical to the phase estimation, as phase itself does not have structure and is hard to estimate [27]. However, with the information from the amplitude stream, the features for phase estimation is significantly improved. In Section 4, we will visualize the difference between the estimated phase spectrograms when the information communication is present and absent. The output features of the three TSBs are denoted by and , for . They have the same dimensions as and . In stream , frequency transformation blocks (FTBs) are used to capture non-local correlation along the frequency axis.

After the three TSBs, and are used to predict amplitude mask and phase. For , channel is reduced to by a convolution, then reshaped into a 1D feature map, whose dimension is , and finally fed into a Bi-LSTM and three fully connected (FC) layers to predict an amplitude mask

. Sigmoid is used as activation function of the last FC layer. For the other FC layers, ReLU is used as activation function.

For , a convolution is used to reduce channel number to 2 to form a complex-valued feature map , where the two channels correspond to the real and the imaginary parts. Then, amplitude of this complex feature map is normalized to 1 for each T-F bin. As such, the feature map only contains phase information. The phase prediction result is denoted by .

Finally, the predicted spectrogram can be computed by:


where denotes element-wise multiplication.

3.2 Two-Stream Blocks (TSBs)


In each TSB, three 2D convolutional layers are used for stream to handle local time-frequency correlation of the input feature. To capture global correlation on frequency axis such as harmonic correlation, we propose frequency transformation blocks (FTBs) to be used before and after the three convolutional layers. The FTB design will be detailed in the next subsection. The combination of 2D convolutions and FTBs efficiently captures both global and local correlations, allowing the following blocks to extract high-level features for amplitude prediction. Stream of each TSB performs the following computation:


Here, represents the -th convolutional layer in stream of the -th TSB. and represent its output and input, respectively. and

represent the FTB before and after the three 2D convolutional layers. Each 2D convolutional layer is followed by batch normalization (BN) and activation function ReLU.


Stream is designed to be light-weight. We only use two 2D convolutional layers in each TSB to process the input feature . Mathematically,


Here, represents the -th convolutional layer in stream of the -th TSB. and denote its output and input, respectively. The second convolutional layer uses a kernel size of 251 to capture long-range time-domain correlation. Global Layer Normalization(gLN) is performed before each convolutional layer. In stream , no activation function is used. We will later show in ablation studies that this choice increases performance.

Information Communication

Information communication is critical to the success of the two-stream structure. Without the information from Stream , Stream by itself cannot successfully make phase prediction. Conversely, successfully predicted phases can also help Stream to better predict amplitude. The communication takes place just before TSB generates output features. Let and be the amplitude features and phase features computed from eq. (4) and eq. (6), the output feature of TSB after information communication can be written as:


where and are information communication functions of the two directions. In this work, we adopt the attention mechanism. For , we have:


Here, denotes element-wise multiplication and represents a convolution. The number of output channels is the same as the number of channels in .

Figure 3: Flowchart of the proposed FTBs. Here, , and the kernel size of Conv 1D is 9.

3.3 Frequency Transformation Blocks (FTBs)

Non-local correlations exist in a T-F spectrogram along the frequency axis. A typical example is the correlations among harmonics, which has been shown to be helpful for the reconstruction of corrupted T-F spectrograms. However, simply stacking several 2D convolution layers with small kernels cannot capture such global correlation. Therefore, we design FTBs to be inserted at the beginning and the end of each TSB, so that the output features of TSB have full-frequency receptive field. At the kernel of an FTB is the learning of a transformation matrix, which is applied on the frequency axis. Fig. 3 shows the flowchart of the proposed FTB. The three groups of operations in each FTB can be represented by:


Eq. (10) describes the T-F attention module as highlighted in the dotted box in Fig. 3. With the input feature , it uses 2D and 1D convolutional layers to predict an attention map, which is then point-wise multiplied to to obtain . The 2D convolution reduces the channel number to and the kernel size of the 1D convolution is 9.

Freq-FC is the key component in FTB. It contains a trainable frequency transformation matrix (FTM) which is applied to the feature map slice at each point in time. Let denote the trainable FTM and let denote the feature slice at time step . The transformation can be simply represented by the following equation:


The transformed feature slice at time step , denoted by , has the same dimension as . Stacking them along the time axis and we can get the transformed feature map . After Freq-FC, each T-F bin in will contain the information from all the frequency bands of . This allows the following blocks to exploit global frequency correlations for amplitude and phase estimation.

The output of an FTB, denoted by , is calculated by concatenating with and fusing them with a convolution. In the proposed FTBs, batch normalization (BN) and ReLU are used after all convolutional layers as normalization method and activation function.

3.4 Implementation

PHASEN is implemented in Pytorch. The dimension of feature maps and the kernel size of convolutional layers are shown in Fig.

2 and Fig. 3

. Both streams use convolution operation with zero padding, dilation=1 and stride=1, making sure the input and output feature map size are the same. All audios are resampled to 16kHz. STFT is calculated using Hann window, whose window length is 25ms. The hop length is 10ms and FFT size is 512.

The network is trained using MSE loss on the power-law compressed STFT spectrogram. The loss consists of two parts: amplitude loss and phase-aware loss .


where and are the power-law compressed spectrogram of output spectrogram and ground truth spectrogram . The compression is performed on amplitude with (, where is the amplitude of the spectrogram.)

Note that instead of only using pure phase, whole spectrogram (phase and amplitude) is taken into consideration for . In this way, phase of T-F bins with higher amplitude is emphasized, helping the network to focus on the high amplitude T-F bins where most speech signals are located.

4 Experiments

4.1 Datasets

Two datasets are used in our experiments.

AVSpeech+AudioSet: This is a large dataset proposed by [1]. Audios from AVSpeech dataset are used as clean speech. It is collected from YouTube, containing 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people and languages. The noisy speech is a mixture of the above clean speech segments with AudioSet [3], which contains a total of more than 1.7 million 10-second segments of 526 kinds of noise. The noisy speech is synthesized by a weighted linear combination of speech segments and noise segments: , where and are 3-second segments randomly sampled from speech and noise dataset. and form a noisy-clean speech pair. In our experiments, 100k segments randomly sampled from AVSpeech dataset and the “Balanced Train” part of AudioSet are used to synthesize the training set, while the validation set is the same as the one used in [1], synthesized by the test part of AVSpeech dataset and the evaluation part of AudioSet.

Voice Bank+DEMAND: This is an open dataset111 proposed by [23]. Speech of 30 speakers from the Voice Bank corpus [1] are selected as clean speech: 28 are included in the training set and 2 are in the validation set. The noisy speech is synthesized using a mixture of clean speech with noise from Diverse Environments Multichannel Acoustic Noise Database (DEMAND) [22]. A total of 40 different noise conditions are considered in training set and 20 different conditions are considered in test set. Finally, the training and test set contain 11572 and 824 noisy-clean speech pairs, respectively. Both speakers and noise conditions in the test set are totally unseen by the training set. Our system comparison is partly done on this dataset.

4.2 Evaluation Metrics

The following six metrics are used to evaluate PHASEN and state-of-the-art competitors. All these metrics are better if higher.

  • SDR [24]: Signal-to-distortion ratio from the mir_eval library;

  • PESQ: Perceptual evaluation of speech quality (from -0.5 to 4.5).

  • CSIG [6]: Mean opinion score (MOS) prediction of the signal distortion attending only to the speech signal (from 1 to 5).

  • CBAK [6]: MOS prediction of the intrusiveness of background noise (from 1 to 5).

  • COVL [6]: MOS prediction of the overall effect (from 1 to 5).

  • SSNR: Segmental SNR .

4.3 Ablation Study

In the ablation study, networks of different settings are trained with the same random seed for 1 million steps. Adam optimizer with a fixed learning rate of 0.0002 is used and the batch size is set to 8. We use mean SDR and PESQ on test dataset as the evaluation metric.

The ablation results are shown in Table 1. Among these methods, PHASEN represents our full model. PHASEN-baseline represents a single-stream network which uses cIRM as training target. We use the network structure in stream for PHASEN-baseline and replace the FTBs with 55 convolutions. The comparison between PHASEN and PHASEN-baseline shows that our two innovations, namely two-stream architecture and FTBs, provide a total of 1.76dB improvement on SDR and 0.53 improvement on PESQ.

Method SDR(dB) PESQ
PHASEN-baseline 15.08 2.87
PHASEN-1strm 15.99 2.98
PHASEN-w/o-FTBs 16.10 3.31
PHASEN-w/o-A2PP2A 16.13 3.33
PHASEN-w/o-P2A 16.62 3.38
PHASEN 16.84 3.40
Table 1: Ablation study on AVSpeech + AudioSet

Two-Stream Architecture

PHASEN-1strm shows the performance of single-stream architecture with cIRM as training target. In this experiment, stream and information communication are removed from PHASEN architecture, while FTBs are preserved. The output of stream is the predicted cRM. Comparison between PHASEN-1strm and PHASEN shows that the two-stream architecture provides 0.85dB gain on SDR and 0.42 gain on PESQ. The large gain on PESQ indicates the proposed two-stream architecture can largely improve the perceptual quality of the denoised speech.


The proposed method uses FTBs at both the beginning and the end of each TSB. In ablation study, PHASEN-w/o-FTBs try to replace all the FTBs in PHASEN architecture with 55 convolutions. By comparing PHASEN to PHASEN-w/o-FTBs we find that FTBs can provide 0.74 dB and 0.09 gain on SDR and PESQ, respectively. We have also tried to replace the FTBs on either location of each TSB with 55 convolutions. Both attempts result into 0.31dB-0.39dB drop on SDR and 0.03-0.05 drop on PESQ, showing that FTBs on both locations are equally important and the gain is accumulative.

In order for a better understanding of FTBs, we visualize the weights of , the matrix that reflects the learned global frequency correlation. From Fig. 4, we show that the energy map of resembles the harmonic correlation, especially when higher harmonics (larger H) are taken into consideration. This phenomenon confirms that FTBs really capture the harmonic correlation, and that harmonic correlation is really useful to a speech enhancement network, because the network can learn this correlation spontaneously.

Figure 4: Comparison of different level of harmonic correlation: and learned FTM weights. is on the upper-left corner of each sub-figure.

Information communication mechanism

PHASEN-w/o-P2A, and PHASEN-w/o-A2PP2A are two settings that remove the information communication mechanism partly and fully. The former one removes the communication from stream to stream , and the latter one removes communication of both directions. In SDR and PESQ result, significant gain of 0.49dB and 0.05 is observed when comparing PHASEN-w/o-P2A to PHASEN-w/o-A2PP2A. This indicates that the information in the intermediate steps of amplitude prediction is very helpful to phase prediction. In comparison between our full model PHASEN and PHASEN-w/o-P2A, we also see that when integrating stream information into stream , the model gets 0.22dB gain on SDR and 0.02 gain on PESQ. This proves that phase feature can also help amplitude prediction.

Fig. 5 also confirms the above improvements through visualization. Here, because the predicted phase spectrogram has few visible patterns, we visualize , which represents the phase difference between predicted phase spectrogram and input noisy spectrogram. The division operation in this formula is on complex domain, and represents the phase spectrogram of input noisy speech. From the visualization, we can conclude that information communication mechanism not only significantly improves the phase prediction, but helps remove amplitude artifacts. To summarize, information communication of both directions are useful in PHASEN, while direction “A2P” plays a key role.

Figure 5: The effect of information communication mechanism. Best viewed in color. We use the same input noisy speech as the one in Fig.2 to produce the results. (a),(b),(c): Amplitude of predicted spectrogram, real part, and imaginary part of in setting PHASEN-w/o-A2PP2A. Significant amplitude artifacts are observed in (a) on frequency bands where speech is overwhelmed by noise. In every T-F bins, (c) is almost zero, and (b) is almost one, indicating failure on phase prediction. (d),(e),(f): Amplitude of predicted spectrogram, real part, and imaginary part of in setting PHASEN. In (e) and (f), phase prediction is obviously visible in T-F bins where noise overwhelms speech. (d) also shows fewer artifacts on amplitude spectrogram.

Other ablations

Apart from the results shown in Table 1, we also perform ablations on activation function and normalization functions for stream .

The proposed method uses no activation function on stream . Though this design is counter-intuitive, it is actually inspired by previous work [10] and also supported by the ablation study. In fact, we try to add ReLU or Tanh as activation function after each, except the last, convolutional layer in stream . However, this causes 0.02dB-0.16dB drop on SDR. Moreover, if ReLU is added after the last convolutional layer in stream , a huge drop of 5.52dB and 0.2 is observed on SDR and PESQ.

The proposed method uses gLN in stream and BN in stream . We test other normalization method for each stream. A performance drop of 0.97dB and 0.12 on SDR and PESQ is observed if gLN is used in stream , while a drop of 0.09dB and 0.02 on SDR and PESQ is observed if BN is used in stream .

From these two experiments, we can observe significant difference between phase prediction and amplitude mask prediction. This supports our design of using two streams to accomplish the two prediction tasks.

4.4 System Comparison

We carry out system comparison on both datasets mentioned in section 4.1.

AVSpeech + AudioSet

On this large dataset we compare our method with two other recent methods, Conv-TasNet [10] and “Google” [1]. Conv-TasNet is a time domain method. The result of Conv-TasNet is produced using the released code222

, trained for the same epochs and on the same data as our PHASEN. “Google” is a T-F domain masking method which uses cIRM as supervision. The method is intended for both speech noise reduction and speech separation. We compare PHASEN with their audio-only, 1S+noise setting. The result in Table

2 shows that our method outperforms both Conv-TasNet and “Google”. Note that this is achieved under the condition that we only use a small fraction of training step (1M/5M) and data (100k/2.4M) used by “Google”. Such superior performance on large dataset demonstrates that our method can be generalized to various speakers and various kinds of noisy environments. It suggests that PHASEN is readily applicable to complicated real-world environment.

Method SDR(dB) PESQ
Conv-TasNet 14.19 2.93
Google(5M step, 2.4M speech) 16.00
PHASEN(1M step, 100k speech) 16.84 3.40
Table 2: System comparison on AVSpeech + AudioSet

Voice Bank + DEMAND

Apart from using large dataset, we also train our model on small but commonly-used dataset Voice Bank + DEMAND, so that we can fairly compare our PHASEN with many other methods. In this experiment, our network is trained on training set for 40 epochs, with Adam optimizer using warm-up step number of 6000, learning rate of 0.0005, and batch size of 12.

Table 3 shows the comparison result. Firstly, our method has very large gain over time-domain methods like SEGAN [15], Wavenet [17], and DFL [4] on all the five metrics, even though these time-domain methods are free of phase-prediction problem. This proves the advantage of our method over the time-domain methods on capturing phase-related information. Also, our method shows great improvement over time-frequency domain method like MMSE-GAN [18] on all metrics, indicating the superiority of our network design. Finally, we also compare our method with a recent hybrid model of time-domain and time-frequency domain called MDPhD [8]. Our method significantly outperforms it on four metrics, and there is only a small difference of about 0.04dB on SSNR metric.

Noisy 1.68 1.97 3.35 2.44 2.63
SEGAN 7.73 2.16 3.48 2.94 2.80
Wavenet 3.62 3.23 2.98
DFL 3.86 3.33 3.22
MMSE-GAN 2.53 3.80 3.12 3.14
MDPhD 10.22 2.70 3.85 3.39 3.27
PHASEN 10.18 2.99 4.21 3.55 3.62
Table 3: System comparison on Voice Bank + DEMAND

5 Conclusion

We have proposed a two-stream architecture with two-way information communication for efficient phase prediction in monaural speech enhancement. We have also designed a learnable frequency transformation matrix in the network. It spontaneously learns a pattern that is consistent with harmonic correlation. Comprehensive ablation studies have been carried out, justifying almost every design choices we have made in PHASEN. Comparison with state-of-the-art systems on both AVSpeech+AudioSet and Voice Bank+DEMAND datasets demonstrates the superior performance of PHASEN. Note that the current design of PHASEN does not allow it to be used for low-latency applications, such as voice over IP. In the future, we plan to explore the potential of PHASEN in low-latency settings and mobile settings which require a smaller model size and shorter inference time. We also plan to expand this architecture to other related tasks such as speech separation.


  • [1] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619. Cited by: §1, §1, §2.1, §4.1, §4.1, §4.4.
  • [2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux (2015)

    Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks

    In ICASSP 2015, pp. 708–712. Cited by: §2.1.
  • [3] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP 2017, pp. 776–780. Cited by: §4.1.
  • [4] F. G. Germain, Q. Chen, and V. Koltun (2018)

    Speech denoising with deep feature losses

    arXiv preprint arXiv:1806.10522. Cited by: §4.4.
  • [5] G. Hu and D. Wang (2001) Speech segregation based on pitch tracking and amplitude modulation. In Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575), pp. 79–82. Cited by: §2.1.
  • [6] Y. Hu and P. C. Loizou (2007) Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16 (1), pp. 229–238. Cited by: 3rd item, 4th item, 5th item.
  • [7] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 323–332. Cited by: §1.
  • [8] J. Kim, J. Yoo, S. Chun, A. Kim, and J. Ha (2018) Multi-domain processing via hybrid denoising networks for speech enhancement. arXiv preprint arXiv:1812.08914. Cited by: §4.4.
  • [9] M. Krawczyk and T. Gerkmann (2014) STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (12), pp. 1931–1940. Cited by: §2.3.
  • [10] Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §2.2, §4.3, §4.4.
  • [11] Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada (2019) Deep griffin–lim iteration. In ICASSP 2019, pp. 61–65. Cited by: §2.1.
  • [12] P. Mowlaee and J. Kulmer (2015) Harmonic phase estimation in single-channel speech enhancement using phase decomposition and snr information. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (9), pp. 1521–1532. Cited by: §2.3.
  • [13] K. Paliwal, K. Wójcicki, and B. Shannon (2011) The importance of phase in speech enhancement. speech communication 53 (4), pp. 465–494. Cited by: §2.1.
  • [14] A. Pandey and D. Wang (2019)

    TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain

    In ICASSP 2019, pp. 6875–6879. Cited by: §2.2.
  • [15] S. Pascual, A. Bonafonte, and J. Serra (2017) SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. Cited by: §2.2, §4.4.
  • [16] C. Plapous, C. Marro, and P. Scalart (2005) Speech enhancement using harmonic regeneration. In Proceedings.(ICASSP’05)., Vol. 1, pp. I–157. Cited by: §2.3.
  • [17] D. Rethage, J. Pons, and X. Serra (2018) A wavenet for speech denoising. In ICASSP 2018, pp. 5069–5073. Cited by: §2.2, §4.4.
  • [18] M. H. Soni, N. Shah, and H. A. Patil (2018) Time-frequency masking-based speech enhancement using generative adversarial network. In ICASSP 2018, pp. 5039–5043. Cited by: §4.4.
  • [19] S. Srinivasan, N. Roman, and D. Wang (2006) Binary and ratio time-frequency masks for robust speech recognition. Speech Communication 48 (11), pp. 1486–1501. Cited by: §2.1.
  • [20] N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji (2018) PhaseNet: discretized phase modeling with deep neural networks for audio source separation.. In Interspeech, pp. 2713–2717. Cited by: §2.1.
  • [21] S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, and H. Saruwatari (2018) Phase reconstruction from amplitude spectrograms based on von-mises-distribution deep neural network. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 286–290. Cited by: §2.1.
  • [22] J. Thiemann, N. Ito, and E. Vincent (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America 133 (5), pp. 3591–3591. Cited by: §4.1.
  • [23] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.. In SSW, pp. 146–152. Cited by: §4.1.
  • [24] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: 1st item.
  • [25] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura, and Y. Yamashita (2018) Single-channel speech enhancement with phase reconstruction based on phase distortion averaging. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26 (9), pp. 1559–1569. Cited by: §2.3.
  • [26] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 22 (12), pp. 1849–1858. Cited by: §2.1.
  • [27] D. S. Williamson, Y. Wang, and D. Wang (2016) Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24 (3), pp. 483–492. Cited by: §1, §1, §1, §2.1, §2.1, §3.1.