Improved Lite Audio-Visual Speech Enhancement

08/30/2020 ∙ by Shang-Yi Chuang, et al. ∙ Academia Sinica 0

Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm. Compared to conventional AVSE systems, LAVSE requires less online computation and moderately solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the requirement for additional visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.



There are no comments yet.


page 11

page 14

page 19

page 21

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech is the most natural and convenient means for human-human and human-machine communications. In recent years, various speech-related applications have been developed and have facilitated our daily lives. For most of these applications, however, the performance may be affected by acoustic distortions, which may lower the quality of the input speech. These acoustic distortions may come from different sources, such as recording sensors, background noise, and reverberations. To alleviate the distortion issue, many approaches have been proposed, and speech enhancement (SE) is one of them. The goal of SE is to enhance low-quality speech signals to improve quality and intelligibility. SE systems have been widely used as front-end processes in automatic speech recognition

El-Solh et al. ; Li et al. (2015); Vincent et al. (2018), speaker recognition Li et al. (2011b), speech coding Li et al. (2011a), hearing aids Levit (2001); Venema (2006); Healy et al. (2019), and cochlear implants Chen et al. (2015); Lai et al. (2016) to improve the performance of target tasks.

Traditional SE methods are generally designed based on the properties of speech and noise signals. A class of approaches estimates the statistics of speech and noise signals to design a gain/filter function, which is then used to suppress the noise components in noisy speech. Notable examples belonging to this class include the Wiener filter

Scalart and others ; Chen et al. (2008) and its extensions Hänsler and Schmidt (2006), such as the minimum mean square error spectral estimator Makhoul (1975); Quatieri and McAulay (1992), maximum a posteriori spectral amplitude estimator Lotter and Vary (2005); Suhadi et al. (2010), and maximum likelihood spectral amplitude estimator McAulay and Malpass (1980); Kjems and Jensen . Another class of approaches considers the temporal properties or data distributions of speech and noise signals. Notable examples include harmonic models Frazier et al. , linear prediction models Atal and Schroeder (1979); Ephraim (1992)

, hidden Markov models

Rabiner and Juang (1986)

, singular value decomposition

Hu and Loizou , and Karhunen-Loeve transform Rezayee and Gazor (2001)

. In recent years, numerous machine-learning-based SE methods have been proposed. These approaches generally learn a model from training data in a data-driven manner. Then, the trained model is used to convert the noisy speech signals into the clean speech signals. Notable machine-learning-based SE methods include compressive sensing

Wang et al. (2016), sparse coding Eggert and Korner ; Chin et al. (2017), non-negative matrix factorization Mohammadiha et al. (2013)

, and robust principal component analysis

Candès et al. (2011); Huang et al. .

More recently, deep learning (DL) has became a popular and effective machine learning algorithm

Ronneberger et al. ; He et al. ; Vaswani et al. and has brought significant progress in the SE field Zhang and Wang (2016); Pascual et al. ; Michelsanti and Tan ; Luo and Mesgarani ; Zhang et al. ; Xu and Fosler-Lussier ; Kim et al. ; Hu et al. ; Yang et al. . Based on the deep structure, an effective representation of the noisy input signal can be extracted and used to reconstruct a clean signal Williamson et al. (2015); Wang and Chen (2018); Zheng and Zhang (2018); Plantinga et al. ; Wang et al. ; Qi et al. ; Carbajal et al. (2020)

. Various DL-based model structures, including deep denoising autoencoders

Lu et al. ; Xia and Bao (2014), fully connected neural networks Liu et al. ; Xu et al. (2015); Kolbæk et al. (2017)

, convolutional neural networks (CNNs)

Fu et al. ; Pandey and Wang (2019)

, recurrent neural networks (RNNs), and long short-term memory (LSTM)

Campolucci et al. (1999); Weninger et al. ; Erdogan et al. ; Chen et al. ; Weninger et al. ; Sun et al. , have been used as the core model of an SE system and have been proven to provide better performance than traditional statistical and machine-learning methods. Another well-known advantage of DL models is that they can flexibly fuse data from different domains. Recently, researchers have tried to incorporate text Kinoshita et al. , bone-conducted signals Yu et al. (2020), and visual cues Wu et al. ; Michelsanti et al. (2019); Iuzzolino and Koishida ; Gu et al. (2020) into SE systems as auxiliary and complementary information to achieve better SE performance. Among them, visual cues are the most common and intuitive because most devices can capture audio and visual data simultaneously. Numerous audio-visual SE (AVSE) systems have been proposed and confirmed to be effective Hou et al. (2018); Ideli et al. ; Adeel et al. (2019); Michelsanti et al. (2020). In our previous work, a lite AVSE (LAVSE) approach was proposed to handle the immense visual data and potential privacy issues Chuang et al.

. The LAVSE system uses an autoencoder (AE)-based compression network along with a latent feature quantization unit

Wu et al. ; Hsu et al. to successfully reduce the size of visual data and handle the privacy issues.

In this study, we intend to further explore three practical issues that are often encountered when implementing AVSE systems in real-world scenarios; they are: (1) the requirement of additional visual data (usually much larger than audio data), (2) audio-visual asynchronization, and (3) low-quality visual data. In order to address these issues, we extend the LAVSE system to an improved LAVSE (iLAVSE) system, which is formed by a multimodal convolutional RNN (CRNN) architecture in which the recurrent part is realized by implementing a LSTM layer. The audio data are provided as input directly to the SE model, while the visual input is first processed by a three-unit data compression module CRQ (C for color channel, R for resolution, and Q for bit quantization) and a pretrained AE module. In CRQ, we adopt three data compression units: reducing the number of channels, reducing the resolution, and reducing the number of bits. The AE is formed by a deep convolutional architecture and can extract meaningful and compact representations, which are then quantized and used as input of the CRNN AVSE model. Based on the visual data compression CRQ module and AE module, the size of visual input is significantly reduced, and the privacy issue can be addressed well.

Audio-visual asynchronization is a common issue that may arise from low-quality audio-visual sensors. We propose to handle this issue based on a data augmentation scheme. The problem of low-quality visual data also include the failure of the sensor to capture the visual signal. A practical example is the use of an AVSE system in a car driving scenario. When the car passes trough a tunnel, the visual information disappears due to the insufficient light. We solve this problem through a zero-pad training scheme. The proposed iLAVSE was evaluated on the Taiwan Mandarin speech with video (TMSV) dataset

Chuang et al. . Based on the special design of model architecture and data augmentation, iLAVSE can effectively overcome the above three issues and provide more robust SE performance than LAVSE and several related SE methods.

The remainder of this paper is organized as follows. Section 2 reviews related work on AVSE systems and data quantization techniques. Section 3 introduces the proposed iLAVSE system. Section 4 presents our experimental setup and results. Finally, Section 5 provides the concluding remarks.

2 Related Work

2.1 Avse

(a) AVDCNNHou et al. (2018).
(b) LAVSEChuang et al. .
Figure 1: Previous AVSE systems.

In this section, we review several existing AVSE systems. In Hou et al. , a fully connected network was used to jointly process audio and visual inputs to perform SE. Since the fully connected architecture cannot effectively process visual information, the AVSE system in Hou et al. is only slightly better than its audio-only SE counterpart. In order to further improve the performance, a multimodal deep CNN SE (termed AVDCNN) system Hou et al. (2018) was subsequently proposed. As shown in Fig. 0(a)

(ISTFT denotes inverse short time Fourier transform; FC denotes fully connected layers; Conv denotes convolutional layers; Pool denotes max-pooling layers), the AVDCNN system consists of several convolutional layers to process audio and visual data. Experimental results show that compared with the audio-only deep CNN system, the AVDCNN system can effectively improve the SE performance. Later, Gabbay et al. proposed another visual SE (VSE) model, which has a similar architecture to AVDCNN, but does not reconstruct the visual part in the output layer

Gabbay et al. . In the meantime, a looking-to-listen (L2L) system was proposed, which uses estimated complex masks to reconstruct enhanced spectral features Ephrat et al. (2018). In Sadeghi et al. (2020)

, a variational AE (VAE) mode was used as the basis model to build the AVSE system. The authors also investigated the possibility of using a strong pretrained model for visual feature extraction and performing SE in an unsupervised manner.

Unlike audio-only SE systems, the above-mentioned AVSE systems require additional visual input, which causes additional hardware and computational costs. In addition, the use of facial or lip images may cause privacy issues. Some work has been done to deal with these two issues. In Chuang et al. , the LAVSE system was proposed to effectively reduce the size of visual input and user identifiability. As shown in Fig. 0(b), the LAVSE system uses an AE to extract meaningful and compact representations of visual data as the input of the SE model to reduce computational costs and appropriately solve the privacy problem in facial information.

2.2 Data Quantization

Figure 2: Single-Precision Floating-Point Format.

Quantization is a simple and effective way to reduce the size of data. Fig. 2 shows the data format of single-precision floating-point in IEEE 754 Institute of Electrical and Electronics Engineers (1985). There are 32 base-2 bits, including 1 sign bit, 8 exponential bits, and 23 mantissa bits. The decimal value of a single-precision floating-point representation is calculated as


where the subscripts and of , , and denote base-2 and base-10, respectively. The sign bit determines whether the value represented is positive () or negative (). The exponential bits represent a 2’s complement, which can store negative values with a bias of 127 (). The mantissa bits are the significant figures. The decimal value of the 32-bit representation in Fig. 2 is 0.20314788.

Obviously, the representation range of values is determined by the exponential term, and the mantissa term accounts for the precision part. Therefore, quantizing the mantissa bits does not change the range, but only reduces the precision of the original value. Based on this property, an exponent-only floating-point quantized neural network (EOFP-QNN) has been proposed to reduce the mantissa bits of the SE model parameters in Hsu et al. . Experimental results have confirmed that by moderately reducing the mantissa bits, the size of the model parameters can be reduced while the overall SE capability can be improved. In this study, we followed the same idea, keeping only the sign and exponent bits, and removing all mantissa bits to perform visual data compression.

3 Proposed iLAVSE System

As mentioned earlier, this study investigates three practical issues: (1) the requirement of additional visual data, (2) audio-visual data asynchronization, and (3) low-quality visual data. We propose three approaches to address these issues respectively: (1) visual data compression, (2) compensation on audio-visual asynchronization, and (3) zero-pad training. By integrating the above three approaches with the CRNN AVSE architecture, the proposed iLAVSE can perform SE well even under unfavorable testing conditions. In this section, we first present the overall system of iLAVSE. Then, we describe the three issues and our solutions.

3.1 iLAVSE System

Figure 3: The proposed iLAVSE system.
Figure 4: The proposed CRQ module.

The proposed iLAVSE system is demonstrated in Fig. 3. As shown in the figure, the iLAVSE system includes three stages: a data preprocessing stage, a CRNN-based AVSE stage, and a data reconstruction stage. The functions of iLAVSE are shown as follows,


where denotes the -th training utterance, and is the number of the training utterances; denotes the -th sample frame, is the size of the concatenated frames for a context window, and is the number of frames of the -th utterance. We have implemented three data compression functions in iLAVSE, which are outlined in green blocks in Fig. 3. CRQ is a three-unit data compression module used to compress the visual image data. As shown in Fig. 4, the CRQ module consists of Colimg, Resimg, and Quaimg, denoting color channel reduction, resolution reduction, and bit quantization, respectively. Qualatent stands for the bit quantization of the latent feature extracted by EncoderAE, the encoder part of a pretrained AE.

In the data preprocessing stage, the waveform of the noisy data is transformed into log1p spectral features () by using the short time Fourier transform (STFT), while the visual image data () are compressed and transformed into latent features () by the CRQ module and EncoderAE. Next, in the CRNN AVSE stage, the audio spectral features pass through an audio net composed of convolutional and pooling layers to extract the audio latent features (), and the Qualatent unit further quantizes the visual input to . Then, the audio latent features and the quantized visual latent features are concatenated as , which is then sent into the fusion net and turned into . Then, the fused features are decoded into the audio spectral features () and the visual latent features () respectively through a linear layer. During testing, the former (with the phase of the noisy speech) is reconstructed into the speech waveform using the inverse STFT in the data reconstruction stage.

Note that we choose the log1p feature Lu et al. because its projecting range can avoid some minimum values in the data. If we take for example and if log is applied, the projected value is ; but if log1p is applied, the projected value is . This characteristic enables the log1p feature to be easily normalized and trained.

3.2 Three Practical Issues and Proposed Solutions

3.2.1 Visual Data Compression

For AVSE systems, the main goal is to use visual data as an auxiliary input to retrieve the clean speech signals from the distorted speech signals. However, the size of visual data is generally much larger than that of audio data, which may cause unfavorable hardware and computational costs when implementing the AVSE system. Our previous work has proven that visual data may not require very high precision, and the original image sequence can be replaced by meaningful and compact representations extracted by an AE Chuang et al. . In this study, we further explore directly reducing the size of visual data by the CRQ compression module. The AE is directly applied to the compressed image sequence to extract a compact representation. The extracted representation is then further compressed by Qualatent and sent to the CRNN-based AVSE stage in iLAVSE.

Visual Feature Extraction by a CNN-based AE
Figure 5: The AE model for visual input data compression.

As mentioned earlier, iLAVSE uses the three visual data compression units in the CRQ module, namely Colimg, Resimg, and Quaimg, to perform color channel reduction, resolution reduction, and bit quantization, respectively. The size of the original image sequence can be notably reduced by the three units. The compressed visual data is then passed to EncoderAE, and the latent representation is used as the visual representation. As shown in Fig. 5, we use a 2D-convolution-layer-only AE to process the visual input data. For a given visual input, the AE is trained to reconstruct the input images.

Generally, captured images are saved in RGB (three channels) or grayscale (one channel) format. Therefore, to make the iLAVSE system applicable to different scenarios, we consider both RGB and grayscale visual inputs to train the AE model. As a result, this AE model can reconstruct RGB and grayscale images.

In addition, we use images with different resolutions to train the AE model. Since the lip images are about 100 to 250 pixels square, we designed three settings to reduce the resolution—64, 32, and 16 (pixels square). When using a resolution of 64, for example, the original image at sizes of 100 to 250 pixels square is resized to 64 pixels square.

For data quantization, we first quantize the values of an input image by removing the mantissa bits in the floating-point representation. To train the AE, we place the quantized and original images at the input and output, respectively. In real-world applications, the AE model can reconstruct the original visual data from the quantized version.

Latent Feature Compression
(a) 32-bit AE features.
(b) 3-bit AE features.
Figure 6: Original and quantized visual latent features.
Figure 7: The distributions of original and quantized visual latent features.

After extracting the latent feature by passing the compressed images to the AE, Qualatent in Fig. 3 can further reduce the number of bits of each latent feature element. The quantized visual latent features are then used in the CRNN AVSE stage. Fig. 6 shows the visual latent features before and after the Qualatent module. In real-world applications, the EncoderAE module and Qualatent unit can be installed in a low-quality visual sensor, thereby improving the online computing efficiency and greatly reducing the transmission costs.

To further confirm that the quantized latent representation can be used to replace the original latent representation, we plotted the distributions of the latent representations before and after applying bit quantization in Fig. 7. The lighter green bins represent the feature before Qualatent is applied, and the darker green bins represent the feature after Qualatent is applied. We can see that the darker green bins cover the range of the lighter green bins well, indicating that we can use the quantized latent feature to replace the original latent feature.

3.2.2 Compensation of Audio-Visual Asynchronization

(a) Synchronous audio and visual data.
(b) Asynchronous audio and visual data.
Figure 8: Synchronous and asynchronous audio and visual data.

Multimodal data asynchronization is a common issue in multimodal learning. We also encountered this problem when implementing the AVSE system. The ideal situation is that the audio and visual data are precisely synchronized in time. Otherwise, the ancillary visual information may not be helpful or may even worsen the SE performance. Fig. 8 shows the synchronous and asynchronous situations of audio and visual data. Owing to audio-visual asynchronization, the video frames are not aligned with the speech well. In this study, we propose a data augmentation approach to alleviate this audio-visual asynchronization issue. The main idea is to artificially simulate various asynchronous audio-visual data to train the AVSE systems.

3.2.3 Zero-Pad Training

(a) Low-quality lip images.
(b) Low-quality latent features.
Figure 9: Low-quality visual data.

Because visual data are regarded as an auxiliary input to the AVSE systems, a necessary requirement is that low-quality visual conditions will not degrade the SE performance. In use with poor lighting conditions, such as in a tunnel or at a night market, the quality of video frames may be poor. In Fig. 8(a), which shows an example, where a segment of frames (in the middle region) has very poor quality. Using the entire video frames directly may degrade the AVSE performance. To overcome this problem, we intend to let iLAVSE dynamically decide whether video data should be used. More specifically, when the quality of a segment of image frames is poor (which can be determined using an additional light sensor), iLAVSE can directly discard the visual information. In this study, we prepare the training data by replacing the visual latent features of low-quality frames with zero, as shown in Fig. 8(b). In this way, iLAVSE can perform SE based on only audio input without considering visual information, when the video frames are in low quality. Note that this study only considered that a low-quality situation occurs in a consecutive segment of frames, not in sporadic frames. However, it is believed that the proposed zero-pad training method is suitable for different low-quality visual data scenarios.

4 Experiments

This section presents the experimental setup and results. Two standardized evaluation metrics were used to evaluate the SE performance: perceptual evaluation of speech quality (PESQ)

Rix et al. and short-time objective intelligibility measure (STOI) Taal et al. (2011). PESQ was developed to evaluate the quality of processed speech, and the score ranged from -0.5 to 4.5. A higher PESQ score indicates that the enhanced speech has better speech quality. STOI was designed to evaluate the speech intelligibility. The score typically ranges from 0 to 1. A higher STOI value indicates better speech intelligibility.

(a) Audio-only SE.
(b) Dual-path-audio-only SE.
Figure 10: Architectures of two audio-only SE systems.

Two audio-only baseline SE systems were implemented for comparison. Their model architectures are illustrated in Fig. 10. Fig. 9(a) is a system with the visual part in the iLAVSE system deleted, and Fig. 9(b) is a system with a dual-path audio model. The additional audio net in Fig. 9(b) is to increase the number of model parameters to be the same as in the iLAVSE model. This system tests whether additional improvements can be achieved by simply increasing the number of model parameters.

The loss function for training iLAVSE is based on the mean square error computed from both the audio and visual parts,


where is empirically determined as . For training the two audio-only SE systems, is used.

In this study, all the SE models were implemented using the PyTorch

Paszke et al. library. The optimizer is Adam Kingma and Ba with a learning rate of .

4.1 Experimental Setup

In this section, the details of the dataset and the implementation steps of iLAVSE and other SE systems are introduced.

4.1.1 Dataset

We evaluated the proposed system on the TMSV dataset111 The dataset contains video recordings of 18 native speakers (13 males and 5 females), each speaking 320 utterances of Mandarin sentences, with the script of the Taiwan Mandarin hearing in noise test Huang (2005). Each sentence has 10 Chinese characters, and the length of each utterance is approximately 2–4 seconds. The utterances were recorded in a recording studio with sufficient light, and the speakers were filmed from the front view. The video was recorded at a resolution of 1920 pixels 1080 pixels at 50 frames per second. The audio was recorded at a sampling rate of 48 kHz.

In this study, we selected the video files from 8 speakers (4 males and 4 females) to form the training set. For each speaker, among the 320 utterances, the 1-st to the 200-th utterances were selected. The utterances were artificially corrupted by 100 types of noise Hu (2004)

at 5 different signal-to-noise ratio (SNR) levels, from -12 dB to 12 dB with a step of 6 dB. The 201-st to 320-th video recordings of 2 other speakers (1 male and 1 female) were used to form the testing set. Six types of noise were selected, which are common in car-driving scenarios, including baby cry, engine noise, background talkers, music, pink noise, and street noise. We artificially generated noisy utterances by contaminating the clean testing speech with these 6 types of noise at 4 low SNR levels, including -1, -4, -7, and -10 dB. The speakers, speech contents, noise types, and SNR levels were all mismatched in the training and testing sets.

4.1.2 Audio and Visual Feature Extraction

The recorded speech signals were downsampled to 16 kHz and mixed into monaural waveforms. The speech waveforms were converted into spectrograms with STFT. The window size of STFT was 512, corresponding to 32 milliseconds. The hop length was 320, so the interval between each frame was 20 milliseconds. The audio data was formatted at 50 frames per second and was aligned with the video data. For each speech frame, the log1p magnitude spectrum Lu et al.

was extracted, and the value was normalized to zero mean and unit standard deviation. The normalization process was conducted at the utterance level; that is, the mean and standard deviation vectors were calculated on all frames of an utterance. The length of the context window was 5, i.e.,

2 frames were concatenated to the central frame. Accordingly, the dimension of the final frame-based audio feature vector was 257 5.

For each frame in the video, the contour of the lips was detected using a 68-point facial landmark detector with Dlib King (2009), and the RGB channels were retained. The extracted lip images were approximately 100 pixels square to 250 pixels square. The AE was trained on the lip images in the training set. The latent representation (2048-dimensional) of AE were used as the visual input to the CRNN-based AVSE stage. Same as the audio feature, 2 frames were concatenated to the central frame. Therefore, the dimension of the frame-based visual feature vector was 2048 5.

4.2 Experimental Result

4.2.1 AVSE Versus Audio-Only SE

Noisy 1.001 0.587
AOSE 1.282 0.616
AOSE(DP) 1.283 0.610
AVDCNN 1.337 0.641
LAVSE 1.374 0.646
Table 1: Average PESQ and STOI scores of the two audio-only SE systems and the two existing AVSE systems over SNRs of -1, -4, -7 and -10 dB.

The two audio-only SE systems shown in Fig. 10 were used as the baselines. The results of the audio-only SE (denoted as AOSE) and dual-path audio-only SE (denoted as AOSE(DP)) systems are shown in Table 1. As mentioned earlier, AOSE(DP) has a similar number of model parameters to LAVSE. From the results in Table 1, we note that AOSE and AOSE(DP) yield similar performance in terms of PESQ and STOI. The result suggests that the additional path with extra parameters cannot provide improvements for the audio-only SE system in this task. Table 1 also lists the results of two existing AVSE systems, namely AVDCNN Hou et al. (2018) and LAVSE Chuang et al. . Compared to AOSE and AOSE(DP), both AVDCNN and LAVSE yield higher PESQ and STOI scores, confirming the effectiveness of incorporating visual data into the SE system.

4.2.2 Visual Data Compression

Figure 11: Original uncompressed images of lips.
(a) RGB 64 input.
(b) RGB 32 input.
(c) RGB 16 input.
(d) GRAY 64 input.
(e) GRAY 32 input.
(f) GRAY 16 input.
(g) RGB 64 output.
(h) RGB 32 output.
(i) RGB 16 output.
(j) GRAY 64 output.
(k) GRAY 32 output.
(l) GRAY 16 output.
Figure 12: The lip images (input and output of the AE module) with different resolutions in RGB or GRAY.

In this set of experiments, we first examined the ability of iLAVSE to incorporate compressed visual data. Fig. 11 shows a sequence of original lip images. As shown in Fig. 3, the visual data preprocessing is carried out by a CRQ module, which implements three units: Colimg, Resimg, and Quaimg. Then, after the latent representation is extracted by EncoderAE, Qualatent further quantizes the bits of the latent representation. In other words, there are four units that perform visual data reduction. We represent the entire reduction process as {Colimg; Resimg; Quaimg; Qualatent} = {A; B; C; D}, where A is either RGB or GRAY (for grayscale), B denotes the image resolution, C indicates the image data quantization, and D stands for the latent feature quantization. In Fig. 12, several versions of compressed visual data are presented. In the top row, the three figures from left to right (i.e., Fig. 11(a), Fig. 11(b), and Fig. 11(c)) denote the RGB images with resolutions 64, 32, and 16, respectively. In the second row, the three figures from left to right (i.e., Fig. 11(d), Fig. 11(e), and Fig. 11(f)) denote the GRAY images with resolutions of 64, 32, and 16. The reconstructed output generated by the autoencoder corresponding to each input is shown in Fig. 11(g), Fig. 11(h), Fig. 11(i), Fig. 11(j), Fig. 11(k), and Fig. 11(l), respectively. Comparing each pair of input and output, we confirmed that the AE can reconstruct the input images well at different resolutions (64, 32, and 16) in either RGB or GRAY.

AOSE(DP) 1.283 0.610
iLAVSE 64 1.374 1.378 0.646 0.646
iLAVSE 32 1.371 1.375 0.644 0.645
iLAVSE 16 1.374 1.358 0.646 0.649
Table 2: The performance of iLAVSE using lip images with reduced channel numbers and resolutions. The underlined scores are the same as those of LAVSE in Table 1 because the iLAVSE with the {RGB, 64} setup is equivalent to LAVSE.

Then, we evaluated iLAVSE with different types of compressed visual data. The results are listed in Table 2. From the table, we first see that iLAVSE outperforms AOSE(DP) in terms of PESQ and STOI with different compressed visual data. Moreover, compared to LAVSE (the underlined scores), we note that iLAVSE can still achieve comparable performance even though the resolution of the visual data has been notably reduced. For example, the {GRAY, 16} case in Table 2 strikes a good balance between the data compression ratio of 48 (()) and the PESQ and STOI scores. Therefore, we decided to use {GRAY, 16} as a representative setup in the following discussion.

(a) {RGB, 16, 5bits(i)} input.
(b) {RGB, 16, 5bits(i)} output.
(c) {GRAY, 16, 5bits(i)} input.
(d) {GRAY, 16, 5bits(i)} output.
Figure 13: AE lip images in 5 bits (1 sign bit and 4 exponential bits).
Total bits R G R G
1 1.333 1.296 0.619 0.615
3 1.250 1.295 0.628 0.613
5 1.361 1.398 0.644 0.641
7 1.374 1.379 0.640 0.644
9 1.386 1.387 0.642 0.642
32 1.374 1.358 0.646 0.649
Table 3: The performance of iLAVSE with or without image quantization (the original image is with 32 bits), R: {RGB, 64} and G: {GRAY, 16}. The underlined scores are the results same as that of LAVSE.

Next, we investigated quantized images. The input and output (reconstructed) images in RGB and GRAY are shown in the left and right columns in Fig. 13, respectively. The original 32-bit images were reduced to 5-bit images (1 sign bit and 4 exponential bits). From the figures, we observe that the AE can reconstruct the quantized image well. We also evaluated iLAVSE with the quantized images. The results are shown in Table 3. The PESQ and STOI scores reveal that when the numerical precision of the input image is reduced to 5 bits (1 sign bit and 4 exponential bits), iLAVSE still maintains satisfactory performance. When the number of bits is further reduced, the PESQ and STOI scores both decrease notably. Compared to LAVSE that uses raw visual data, the overall compression ratio of the CRQ module from {RGB, 64, 32bits(i)} to {GRAY, 16, 5bits(i)} is 307.2 times, which is calculated as follows.


4.2.3 Latent Feature Quantization

In this set of experiments, we investigated the impact of the bit quantizaion in the Qualatent unit on the visual latent representation. We intended to use fewer bits to represent the original 32-bit latent representation. The compressed representation was used as the visual feature input of the AVSE model. In Fig. 5(a) and Fig. 5(b), the latent representations of lip features before and after applying data quantization (from 32 bits to 3 bits) are depicted. As can be seen from the figures, the user identity has been almost completely removed, thereby moderately addressing the privacy problem.

We further evaluated iLAVSE with latent representation quantization. The number of bits was reduced from 32 to 1, 3, 5, 7 and 9 (1 sign bit and 0, 2, 4, 6, and 8 exponential bits). The results are listed in Table 4. From the table, we can note that for different types of visual input, latent representations with different levels of quantization provide similar performance in terms of PESQ and STOI. For example, when quantizing the latent representation to 3 bits, PESQ = 1.410 and STOI = 0.641 under the condition of {GRAY, 16, 5bits(i)}, which are much better than the performance of AOSE(DP) (PESQ = 1.283 and STOI = 0.610) and comparable to the performance of LAVSE (PESQ = 1.374 and STOI = 0.646).

Total bits R G R G
1 1.365 1.374 0.642 0.642
3 1.337 1.410 0.642 0.641
5 1.343 1.413 0.643 0.641
7 1.357 1.391 0.643 0.641
9 1.362 1.373 0.643 0.643
32 1.374 1.398 0.646 0.641
Table 4: The performance of iLAVSE with or without latent quantization, R: {RGB, 64, 32bits(i)} and G: {GRAY, 16, 5bits(i)} (1 sign bit + 4 exponential bits).

4.2.4 Further Analysis

In this set of experiments, we evaluated the SE systems compared in this study with different SNR levels. The AVDCNN system using the original high-quality images is denoted as “AVSE”. For LAVSE, we used the {RGB, 64, 32bits(i), 32bits(l)} setup. For iLAVSE, we used {GRAY, 16, 5bits(i), 3bits(l)}, where (i) and (l) denote the quantization unit applied to the images and the latent features, respectively. The PESQ and STOI scores for different SNR levels are shown in Fig. 14. It can be seen from the figure that all four SE systems have higher PESQ and STOI scores than the “Noisy” speech. In addition, the iLAVSE system is always better than the other three SE systems at different SNR levels in terms of PESQ, and maintains satisfactory performance in terms of STOI.

(a) PESQ.
(b) STOI.
Figure 14: The performance of different SE systems at different SNR levels. LAVSE: {RGB, 64, 32bits(i), 32bits(l)}, iLAVSE: {GRAY, 16, 5bits(i), 3bits(l)}.

We further examined the spectrogram and waveform of the “Noisy” speech and the enhanced speech provided by AOSE(DP), LAVSE, and iLAVSE. An example under the condition of street noise at -7 dB is shown in Fig. 15. The spectrogram and waveform of the clean speech are also plotted for comparison. From the figure, we see that iLAVSE can suppress the noise components in the noisy speech more effectively than AOSE(DP), thus confirming the effectiveness of using the visual information. In addition, we note that the output plots of iLAVSE and LAVSE are very similar, which suggests that iLAVSE can still provide satisfactory performance even with compressed visual data.

(a) Clean waveform.
(b) Clean spectrogram.
(c) Noisy waveform.
(d) Noisy spectrogram.
(e) AOSE(DP) waveform.
(f) AOSE(DP) spectrogram.
(g) LAVSE waveform.
(h) LAVSE spectrogram.
(i) iLAVSE waveform.
(j) iLAVSE spectrogram.
Figure 15: The waveforms and spectrograms of an example speech utterance under the condition of street noise at -7 dB.

4.2.5 Asynchronization Compensation

We simulated the audio-visual asynchronization condition by offsetting the visual and audio data streams of each utterance in the time domain. We designed 5 asynchronization conditions, i.e., 5 specific offset ranges (OFR): [-1, 1], [-2, 2], [-3, 3], [-4, 4], and [-5, 5]. For example, for OFR = [-1, 1], the offset range is from -1 to 1. An offset of -1, 0, or 1 frame (each frame = 20ms) was randomly selected (with equal probability) and used to shift the audio stream, so that the audio-visual asynchronization was -1, 0, or 1. In this way, we prepared 5 sets of training data with different degrees of audio-visual asynchronization. For the testing set, we simulated the audio-visual asynchronization condition using the fixed offsets in [-5, 5]. Therefore, the audio-visual data contained 11 different degrees of asynchronization.

Because the iLAVSE model was trained with 5 different OFRs, namely [-1, 1], [-2, 2], [-3, 3], [-4, 4], and [-5, 5], we therefore obtained 5 iLAVSE models, termed iLAVSE(OFR1), iLAVSE(OFR2), iLAVSE(OFR3), iLAVSE(OFR4), and iLAVSE(OFR5). These 5 models were then tested on the 11 different offsets (with a fixed offset in [-5, 5]). The results are shown in Fig. 16. The results of Noisy, AOSE(DP), and iLAVSE trained without audio-visual asynchronization (denoted as iLAVSE(OFR0)) are also listed for comparison. All iLAVSE systems in this experiment used the original visual data.

Please note that, in both figures, the central point (cf. Test Offset = 0) represents the audio-visual synchronous condition. A “Test Offset” value away from the central point indicates a more severe audio-visual asynchronous situation. “Test Offset = -5” and “Test Offset = 5” are the most severe conditions, where the audio and visual signals are misaligned for 5 frames (100 ms) in both cases.

From Fig. 16, we can note that when “Test Offset = 0”, iLAVSE(OFR0) achieves the best performance. This is reasonable because in this case, there is no asynchronous data in training and testing. When the asynchronization condition becomes severe, iLAVSE(OFR5) achieves better performance than other models. We also note that when the “Test Offset” values lie in [-3, 3], iLAVSE(OFR5) always outperforms Noisy and AOSE(DP). The results confirm the effectiveness of including audio-visual asynchronous data (as augmented training data) to train the iLAVSE system to overcome the asynchronization issue.

(a) PESQ.
(b) STOI.
Figure 16: The PESQ and STOI scores of iLAVSE trained and tested with different audio-visual asynchronous data.

4.2.6 Zero-Pad Training

We simulated the low-quality visual data condition by applying a low-quality percentage range (LPR) to the visual data. The low-quality percentage (LP) determines the percentage of missing frames in the visual data, and the LPR indicates the range of randomly assigned LPs for each batch. For example, if LPR is set to 10, LP will be randomly selected from 0% to 10%; if LP is set to 4% for a batch with a length of 150 frames, a sequence of 6 () frames of the visual data will be replaced with zeros. In this experiment, we chose LPRs {0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100} for training, and set LPs {0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100} to test the performance on specific percentages of missing visual data. The starting point of the missing visual part was randomly assigned for each batch.

The iLAVSE models trained with the 11 different LPRs are denoted as iLAVSE(LPR), where = 0, …, 10. The training set of iLAVSE(LPR0) did not contain missing visual data. A larger value of in LPR indicates a more severe low-quality visual data condition. The results are presented in Fig. 17, where the x-axis represents the LP value used for testing. The results in the figure show that without involving low-quality visual data in training (iLAVSE(LPR0)), the performance drops rapidly when visual data loss occurs in the testing data. The PESQ and STOI scores are even worse than those of Noisy and AOSE(DP). On the other hand, the iLAVSE models trained with low-quality visual data (even with low LPRs) are robust against all LP testing conditions. When the LP of the testing data is very high, the performance of iLAVSE converges to that of AOSE(DP), which shows that the benefit from visual information becomes negligible.

(a) PESQ.
(b) STOI.
(c) PESQ zoom in.
(d) PESQ zoom in.
Figure 17: The PESQ and STOI scores of iLAVSE trained with different LPRs and tested on specific LP conditions.

5 Conclusions

The proposed iLAVSE system includes three stages: a data preprocessing stage, a CRNN-based AVSE stage, and a data reconstruction stage. The preprocessing stage uses a CRQ module and an AE module to extract a compact latent representation as the visual input to the AVSE stage. In our experiments, instead of sending the original visual image to the AVSE stage, we can notably reduce the input size to 0.33% () of the original visual image without reducing the PESQ and STOI scores. We solved the audio-visual asynchronization and low-quality visual data issues using a data augmentation scheme and a zero-pad training approach, respectively. Experimental results showed that iLAVSE can effectively deal with three practical issues and provide better SE performance than AOSE and related AVSE systems. Therefore, the results confirmed that the proposed iLAVSE system is robust against unfavorable conditions and can be suitably implemented in real-world applications. In the future, we will incorporate other neural network architectures, objective functions, and compression techniques Puzicha et al. (2000); Patil and Jondhale ; Celebi (2011) into the proposed system. Meanwhile, we plan to further use the complementary information provided by visual data, combined with self-supervised and meta learning, to improve the applicability of iLAVSE.


This work was supported by the Ministry of Science and Technology [109-2221-E-001-016-, 109-2634-F-008-006-, 109-2218-E-011-010-].


  • A. Adeel, M. Gogate, A. Hussain, and W. M. Whitmer (2019) Lip-reading driven deep learning approach for speech enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–10. Cited by: §1.
  • B. Atal and M. Schroeder (1979) Predictive coding of speech signals and subjective error criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing 27 (3), pp. 247–254. Cited by: §1.
  • P. Campolucci, A. Uncini, F. Piazza, and B. D. Rao (1999) On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10 (2), pp. 253–271. Cited by: §1.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011) Robust principal component analysis?. Journal of the ACM 58 (3), pp. 1–37. Cited by: §1.
  • G. Carbajal, R. Serizel, E. Vincent, and E. Humbert (2020) Joint nn-supported multichannel reduction of acoustic echo, reverberation and noise. Cited by: §1.
  • M. E. Celebi (2011)

    Improving the performance of k-means for color quantization

    Image and Vision Computing 29 (4), pp. 260–271. Cited by: §5.
  • F. Chen, Y. Hu, and M. Yuan (2015) Evaluation of noise reduction methods for sentence recognition by mandarin-speaking cochlear implant listeners. Ear and hearing 36 (1), pp. 61–71. Cited by: §1.
  • J. Chen, J. Benesty, Y. A. Huang, and E. J. Diethorn (2008) Fundamentals of noise reduction. In Springer Handbook of Speech Processing, pp. 843–872. Cited by: §1.
  • [9] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proc. Interspeech 2015, Cited by: §1.
  • Y. Chin, J. Wang, C. Huang, K. Wang, and C. Wu (2017) Speaker identification using discriminative features and sparse representation. IEEE Transactions on Information Forensics and Security 12 (8), pp. 1979–1987. Cited by: §1.
  • [11] S. Chuang, Y. Tsao, C. Lo, and H. Wang Lite audio-visual speech enhancement. In Proc. Interspeech 2020, Cited by: §1, §1, 0(b), §2.1, §3.2.1, §4.2.1.
  • [12] J. Eggert and E. Korner Sparse coding and nmf. In Proc. IJCNN 2004, Cited by: §1.
  • [13] A. El-Solh, A. Cuhadar, and R. A. Goubran Evaluation of speech enhancement techniques for speaker identification in noisy environments. In Proc. ISM 2007, Cited by: §1.
  • Y. Ephraim (1992) Statistical-model-based speech enhancement systems. Proceedings of the IEEE 80 (10), pp. 1526–1555. Cited by: §1.
  • A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics 37 (4), pp. 1–11. Cited by: §2.1.
  • [16] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proc. ICASSP 2015, Cited by: §1.
  • [17] R. Frazier, S. Samsam, L. Braida, and A. Oppenheim Enhancement of speech by adaptive filtering. In Proc. ICASSP 1976, Cited by: §1.
  • [18] S. Fu, T. Hu, Y. Tsao, and X. Lu Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In Proc. MLSP 2017, Cited by: §1.
  • [19] A. Gabbay, A. Shamir, and S. Peleg Visual speech enhancement. In Proc. Interspeech 2018, Cited by: §2.1.
  • R. Gu, S. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu (2020) Multi-modal multi-channel target speech separation. IEEE Journal of Selected Topics in Signal Processing 14 (3), pp. 530–541. Cited by: §1.
  • E. Hänsler and G. Schmidt (2006) Topics in acoustic echo and noise control: selected methods for the cancellation of acoustical echoes, the reduction of background noise, and speech processing. Springer Science & Business Media. Cited by: §1.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun Deep residual learning for image recognition. In Proc. CVPR 2016, Cited by: §1.
  • E. W. Healy, M. Delfarah, E. M. Johnson, and D. Wang (2019) A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker and reverberation. The Journal of the Acoustical Society of America 145 (3), pp. 1378–1388. Cited by: §1.
  • [24] J. Hou, S. Wang, Y. Lai, J. Lin, Y. Tsao, H. Chang, and H. Wang Audio-visual speech enhancement using deep neural networks. In Proc. APSIPA 2016, Cited by: §2.1.
  • J. Hou, S. Wang, Y. Lai, Y. Tsao, H. Chang, and H. Wang (2018) Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence 2 (2), pp. 117–128. Cited by: §1, 0(a), §2.1, §4.2.1.
  • [26] Y. Hsu, Y. Lin, S. Fu, Y. Tsao, and T. Kuo A study on speech enhancement using exponent-only floating point quantized neural network (eofp-qnn). In Proc. SLT 2018, Cited by: §1, §2.2.
  • G. Hu (2004) 100 nonspeech environmental sounds. Note: Available: Cited by: §4.1.1.
  • [28] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. In Proc. Interspeech 2020, Cited by: §1.
  • [29] Y. Hu and P. C. Loizou A subspace approach for enhancing speech corrupted by colored noise. In Proc. ICASSP 2002, Cited by: §1.
  • M. Huang (2005) Development of taiwan mandarin hearing in noise test. Department of speech language pathology and audiology, National Taipei University of Nursing and Health Science. Cited by: §4.1.1.
  • [31] P. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson Singing-voice separation from monaural recordings using robust principal component analysis. In Proc. ICASSP 2012, Cited by: §1.
  • [32] E. Ideli, B. Sharpe, I. V. Bajić, and R. G. Vaughan Visually assisted time-domain speech enhancement. In Proc. GlobalSIP 2019, Cited by: §1.
  • Institute of Electrical and Electronics Engineers (1985)

    IEEE standard for binary floating-point arithmetic

    ANSI/IEEE Std 754-1985. Cited by: §2.2.
  • [34] M. L. Iuzzolino and K. Koishida AV (se) 2: audio-visual squeeze-excite speech enhancement. In Proc. ICASSP 2020, Cited by: §1.
  • [35] J. Kim, M. El-Khamy, and J. Lee T-gsa: transformer with gaussian-weighted self-attention for speech enhancement. In Proc. ICASSP 2020, Cited by: §1.
  • D. E. King (2009) Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10, pp. 1755–1758. Cited by: §4.1.2.
  • [37] D. P. Kingma and J. Ba Adam: a method for stochastic optimization. In Proc. ICLR 2015, Cited by: §4.
  • [38] K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani Text-informed speech enhancement with deep neural networks. In Proc. Interspeech 2015, Cited by: §1.
  • [39] U. Kjems and J. Jensen Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement. In Proc. EUSIPCO 2012, Cited by: §1.
  • M. Kolbæk, Z. Tan, and J. Jensen (2017) Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Transactions on Audio, Speech and Language Processing 25 (1), pp. 153–167. Cited by: §1.
  • Y. Lai, F. Chen, S. Wang, X. Lu, Y. Tsao, and C. Lee (2016) A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation. IEEE Transactions on Biomedical Engineering 64 (7), pp. 1568–1578. Cited by: §1.
  • H. Levit (2001) Noise reduction in hearing aids: an overview. J. Rehabil. Res. Develop. 38 (1), pp. 111–121. Cited by: §1.
  • J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong (2015) Robust automatic speech recognition: a bridge to practical applications. Academic Press. Cited by: §1.
  • J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki (2011a) Two-stage binaural speech enhancement with wiener filter for high-quality speech communication. Speech Communication 53 (5), pp. 677–689. Cited by: §1.
  • J. Li, L. Yang, J. Zhang, Y. Yan, Y. Hu, M. Akagi, and P. C. Loizou (2011b) Comparative intelligibility investigation of single-channel noise-reduction algorithms for chinese, japanese, and english. The Journal of the Acoustical Society of America 129 (5), pp. 3291–3301. Cited by: §1.
  • [46] D. Liu, P. Smaragdis, and M. Kim Experiments on deep learning for speech denoising. In Proc. Interspeech 2014, Cited by: §1.
  • T. Lotter and P. Vary (2005) Speech enhancement by MAP spectral amplitude estimation using a super-gaussian speech model. EURASIP Journal on Advances in Signal Processing 2005 (7), pp. 354850. Cited by: §1.
  • [48] X. Lu, Y. Tsao, S. Matsuda, and C. Hori Speech enhancement based on deep denoising autoencoder.. In Proc. Interspeech 2013, Cited by: §1.
  • [49] Y. Lu, C. Liao, X. Lu, J. Hung, and Y. Tsao Incorporating broad phonetic information for speech enhancement. In Proc. Interspeech 2020, Cited by: §3.1, §4.1.2.
  • [50] Y. Luo and N. Mesgarani Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In Proc. ICASSP 2018, Cited by: §1.
  • J. Makhoul (1975) Linear prediction: a tutorial review. Proceedings of the IEEE 63 (4), pp. 561–580. Cited by: §1.
  • R. McAulay and M. Malpass (1980) Speech enhancement using a soft-decision noise suppression filter. IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (2), pp. 137–145. Cited by: §1.
  • D. Michelsanti, Z. Tan, S. Sigurdsson, and J. Jensen (2019) Deep-learning-based audio-visual speech enhancement in presence of lombard effect. Speech Communication 115, pp. 38–50. Cited by: §1.
  • D. Michelsanti, Z. Tan, S. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen (2020) An overview of deep-learning-based audio-visual speech enhancement and separation. arXiv preprint arXiv:2008.09586. Cited by: §1.
  • [55] D. Michelsanti and Z. Tan

    Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification

    In Proc. Interspeech 2017, Cited by: §1.
  • N. Mohammadiha, P. Smaragdis, and A. Leijon (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2140–2151. Cited by: §1.
  • A. Pandey and D. Wang (2019) A new framework for cnn-based speech enhancement in the time domain. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (7), pp. 1179–1188. Cited by: §1.
  • [58] S. Pascual, A. Bonafonte, and J. Serrà SEGAN: speech enhancement generative adversarial network. In Proc. Interspeech 2017, Cited by: §1.
  • [59] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. NIPS 2019, Cited by: §4.
  • [60] R. Patil and K. Jondhale Edge based technique to estimate number of clusters in k-means color image segmentation. In Proc. ICCSIT 2010, Cited by: §5.
  • [61] P. Plantinga, D. Bagchi, and E. Fosler-Lussier Phonetic feedback for speech enhancement with and without parallel speech data. In Proc. ICASSP 2020, Cited by: §1.
  • J. Puzicha, M. Held, J. Ketterer, J. M. Buhmann, and D. W. Fellner (2000) On spatial quantization of color images. IEEE Transactions on Image Processing 9 (4), pp. 666–682. Cited by: §5.
  • [63] J. Qi, H. Hu, Y. Wang, C. H. Yang, S. M. Siniscalchi, and C. Lee

    Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement

    In Proc. Interspeech 2020, Cited by: §1.
  • T. F. Quatieri and R. J. McAulay (1992) Shape invariant time-scale and pitch modification of speech. IEEE Transactions on Signal Processing 40 (3), pp. 497–510. Cited by: §1.
  • L. Rabiner and B. Juang (1986)

    An introduction to hidden markov models

    IEEE ASSP Magazine 3 (1), pp. 4–16. Cited by: §1.
  • A. Rezayee and S. Gazor (2001) An adaptive klt approach for speech enhancement. IEEE Transactions on Speech and Audio Processing 9 (2), pp. 87–95. Cited by: §1.
  • [67] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In Proc. ICASSP 2001, Cited by: §4.
  • [68] O. Ronneberger, P. Fischer, and T. Brox U-net: convolutional networks for biomedical image segmentation. In Proc. MICCAI 2015, Cited by: §1.
  • M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud (2020) Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1788–1800. Cited by: §2.1.
  • [70] P. Scalart et al. Speech enhancement based on a priori signal to noise estimation. In Proc. ICASSP 1996, Cited by: §1.
  • S. Suhadi, C. Last, and T. Fingscheidt (2010) A data-driven approach to a priori snr estimation. IEEE Transactions on Audio, Speech, and Language Processing 19 (1), pp. 186–195. Cited by: §1.
  • [72] L. Sun, J. Du, L. Dai, and C. Lee Multiple-target deep learning for lstm-rnn based speech enhancement. In Proc. HSCMA 2017, Cited by: §1.
  • C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. Cited by: §4.
  • [74] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin Attention is all you need. In Proc. NIPS 2017, Cited by: §1.
  • T. Venema (2006) Compression for clinicians, chapter 7. The many faces of compression.: Thomson Delmar Learning. Cited by: §1.
  • E. Vincent, T. Virtanen, and S. Gannot (2018) Audio source separation and speech enhancement. John Wiley & Sons. Cited by: §1.
  • D. Wang and J. Chen (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (10), pp. 1702–1726. Cited by: §1.
  • J. Wang, Y. Lee, C. Lin, S. Wang, C. Shih, and C. Wu (2016) Compressive sensing-based speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2122–2131. Cited by: §1.
  • [79] S. Wang, W. Li, S. M. Siniscalchi, and C. Lee

    A cross-task transfer learning approach to adapting deep speech enhancement models to unseen background noise using paired senone classifiers

    In Proc.ICASSP 2020, Cited by: §1.
  • [80] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. In Proc. LVA/ICA 2015, Cited by: §1.
  • [81] F. Weninger, F. Eyben, and B. Schuller Single-channel speech separation with memory-enhanced recurrent neural networks. In Proc. ICASSP 2014, Cited by: §1.
  • D. S. Williamson, Y. Wang, and D. Wang (2015) Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and language Processing 24 (3), pp. 483–492. Cited by: §1.
  • [83] J. Wu, Y. Xu, S. Zhang, L. Chen, M. Yu, L. Xie, and D. Yu Time domain audio visual speech separation. In Proc. ASRU 2019, Cited by: §1.
  • [84] S. Wu, G. Li, F. Chen, and L. Shi Training and inference with integers in deep neural networks. Cited by: §1.
  • B. Xia and C. Bao (2014) Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Communication 60, pp. 13–29. Cited by: §1.
  • [86] S. Xu and E. Fosler-Lussier Spatial and channel attention based convolutional neural networks for modeling noisy speech. In Proc. ICASSP 2019, Cited by: §1.
  • Y. Xu, J. Du, L. Dai, and C. Lee (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing 23 (1), pp. 7–19. Cited by: §1.
  • [88] C. Yang, J. Qi, P. Chen, X. Ma, and C. Lee Characterizing speech adversarial examples using self-attention u-net enhancement. In Proc. ICASSP 2020, Cited by: §1.
  • C. Yu, K. Hung, S. Wang, Y. Tsao, and J. Hung (2020) Time-domain multi-modal bone/air conducted speech enhancement. IEEE Signal Processing Letters 27, pp. 1035–1039. Cited by: §1.
  • X. Zhang and D. Wang (2016) A deep ensemble learning method for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (5), pp. 967–977. Cited by: §1.
  • [91] Y. Zhang, Q. Duan, Y. Liao, J. Liu, R. Wu, and B. Xie Research on speech enhancement algorithm based on sa-unet. In Proc. ICMCCE 2019, Cited by: §1.
  • N. Zheng and X. Zhang (2018) Phase-aware speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (1), pp. 63–76. Cited by: §1.