Room acoustics can cause degradation of audio quality in recordings produced outside of professional recording studios. Acoustics in everyday environments such as the home commonly produce undesirable effects including excessive coloration from early reflections and masking from late reflections . One may wish to mitigate these effects by dereverberating the speech signals. This type of speech enhancement can be applied in the post-processing stage, where the full signal has been recorded in advance, or in real time, if low latency is required. This technical report addresses the former case.
We focus on the single-channel case, where only one microphone was used to record the speech signal. The single-channel case is challenging because, unlike in the multi-channel case, a reverberation reduction program cannot utilize the difference in time of arrivals of reflections across the microphones. In the recent REVERB challenge , only one among eleven submitted single-channel systems both reduced the perceived amount of reverberation and improved overall quality.
The task of reverberation reduction has been framed in multiple ways over the past five decades. Habets  describes three trends. The first involves modeling the acoustic system and using this information to equalize or filter the reverberant signal. The second approach is similar to denoising, treating the source and reverberation signals as independent. The third directly estimates the dry signal without trying to model the room acoustics.
The techniques in the first trend, which involves modeling the acoustic system, provide clean results under specific conditions  and can provide the ability to control the Room Impulse Response (RIR) 
. However, they can be challenging to develop even when the RIR is known. Though the RIR function is a linear transformation, in realistic room acoustics, an exact inverse is often either unstable or acausal. Exact equalization requires very long filters, causing increased computational complexity and numerical instability. Sometimes, computing the inverse can be intractable[4, 5]. Another issue is that imperfections in the RIR measurement will affect the result. Faller describes how inversion-based filtering only works in the “sweet spot” . Movement from the “sweet spot” to other positions in the room degrades the quality. Furthermore, changes in object positions in the room, temperature, or humidity will change the RIR, which can be described as a weakly non-stationary process . The RIR estimate then needs to be updated across the signal instead of only being computed once. For these reasons, techniques that remove only the most problematic components of the reverberation while leaving the rest intact have often been chosen for practical uses .
Approaches involving denoising or source separation do not necessarily generalize to the dereverberation task. The generalization is most problematic in the case of models where the target and noise signals are assumed independent—for example, time-frequency bin masking 
. Masking works best when bins belong either to the signal or to the noise, and there is not much overlap. A reverberant signal, though, is a sum of delayed and filtered versions of the original signal, which are highly correlated and likely present in the same time-frequency bins. Another assumption commonly made for denoising and that does not apply to dereverberation is that every time frame can be treated separately. This assumption also does not generalize well because RIRs often last longer than one Short-Time Fourier Transform (STFT) window.
Direct estimation approaches, e.g., 
, focus instead on estimating a clean signal given a reverberant input. They do not directly estimate room acoustics, or separate the dry and reverberant signals. Direct estimation can produce good results when the model is able to represent the distribution of the target outputs—speech signals. However, the nature of the direct estimation approach means that some useful information about the room acoustics is not utilized, which could negatively affect the model’s ability, for example, to adapt to unseen room conditions.
In this report, we conduct preparatory work for developing a joint model, which combines the direct estimation approach with estimation of the RIR. A joint model, which involves shared parameters for these two tasks, can produce higher accuracy than separate models. We utilize the fact that we have access both to pairs of dry and reverberant signals and the RIRs used to generate these pairs via convolution. Access to the target dry speech and RIR signals allows us to train both components of the model in a supervised manner. In this report, we explore separate models for each task, and provide an overview of how these models can be combined into a single one.
Section 2 covers related work. In section 3, we provide an overview of the joint training technique. We also describe the models we use separately for the tasks of dry speech estimation and RIR estimation. In section 4.1, we describe our experiments along with the dataset and pre-processing steps. Finally, section 5 concludes the report.
2 Related work
Deep neural networks (DNNs) have been investigated for the reverberation reduction task. Fully connected models that learn and process the signal frame-by-frame are proposed in[8, 9]. The first model outputs a magnitude STFT, while the second outputs both real and complex values. Williamson et al. 
predict one frame at a time from an input of multiple consecutive frames. Each input frame includes a set of different time-frequency features extracted from the audio.
—to estimate dry speech directly from reverberant and noisy speech. The authors predict the magnitude spectrogram and use the reverberant phase for reconstruction of the audio signal. Results are state-of-the-art. In addition to supervised training, the authors experiment with using Generative Adversarial Networks (GANs) to improve over their supervised U-net model, but their performance decreases slightly.
GANs for the dereverberation task have also been proposed by Li et al. 
. The authors experiment with supervised models and further train the one that performed the best—a Long-Short-Term Memory Unit (LSTM) using adversarial training. The downstream task in the work, however, is automatic character recognition instead of audio signal enhancement.
Engel et al.  introduce a DSP-based DNN that includes differentiable oscillators, filters, and reverberation components, allowing for separate control over parameters such as pitch and loudness. The model has an encoder-decoder structure, and indicates potential for encoding the dry component of the signal while discarding the room reverberation, and reconstructing a dry signal, optionally convolving it with a different RIR.
We also mention relevant models designed for related tasks such as speech denoising. Tan et al. 
introduce the convolutional recurrent neural network (CRNN) model. The CRNN combines the convolutional filter feature extraction capability with the sequential modeling of an LSTM in an encoder-decoder setup. Huanget al. address the speech separation task, using joint training  to improve accuracy by estimating both the mask and the target speech signal. Relevant work also includes non-negative matrix factorization approaches , which separately dereverberate magnitude STFTs over every frequency bin. This technique greatly reduces the number of parameters and inspires our approach for RIR prediction.
3.1 Joint training
Figure 1 illustrates our joint training approach, which involves estimating both the dry signal and the RIR. Our assumption is that sharing some of the parameters used for the two tasks can improve accuracy compared to what the two models would reach when trained separately. The model would learn to extract the structure of the RIR from the reverberant signal, which would also make it better able to predict the dry signal. In addition to the dry speech and RIR training targets, we introduce a third target: The RIR estimate output by the model is convolved with the dry signal to produce a reconstruction of the reverberant signal. Given that we synthesize reverberant speech by convolving the dry signal and the RIR, all three targets are known and we can train the joint model in a supervised manner with three weighted loss components. In the following sections, we describe models we use separately for the two tasks. Once the separate models produce sufficiently accurate results, they can be combined into a joint model. The individual loss terms are then summed in a weighted manner.
3.2 Direct estimation
We first compare two existing models for estimating the dry speech directly from the reverberant signal. We modify the LSTM proposed in 
by using a Bi-directional Gated Recurrent Unit (Bi-GRU)
instead. The model includes residual connections between layers. Given that we use a bi-directional model, we use half the hidden dimension used by Wanget al., which amounts to 380, but produces the original 760 dimensions when combining the two directions. We do not use exponential decay like the authors and solely rely on the learning rate decay adjusted by Adam optimizer . The target is also different as we estimate the log-STFT instead of recognizing characters. We empirically find that training parameters used by Wang et al. produce the best results for our model.
We also implement the U-net introduced in  for comparison.
3.3 Room impulse response estimation
We then explore using a fully convolutional model to estimate the RIR from the reverberant signal. We again choose to operate in the magnitude-frequency domain. We introduce a convolutional model designed to capture the time-invariant filter structure across a few seconds of the speech signal. It includes filters that span multiple time frames but only one frequency bin at a time, assuming the reverberation can be isolated per-frequency because the reverberant signal is a linear sum of the dry signal. The structure is as follows, with the first two values indicating the filter size (time and frequency axes) and the third the number of filters: (91, 16), (141, 32), (271, 64), (271, 32), (271, 16), (281, 4), (1871, 126). Each convolution is followed by an exponential linear unit activation 
except for the final one, which is followed by a rectified linear unit activation. After the final layer, the filter axis with 126 dimension becomes the time axis of the RIR. We apply this modification as the time-varying structures of the speech and RIR signals are unrelated.
4 Experimental results
For RIRs, we use publicly available recordings downloaded using Kaldi scripts111https://github.com/kaldi-asr/kaldi/blob/master/egs/aspire/s5/local/multi_condition/rirs/. Data sources consist of the Aachen Impulse Response Database , the C4DM Room Impulse Response Dataset , the RWCP Sound Scene Database in Real Acoustical Environments , the REVERB Challenge dataset’s RIRs (omitting the noise signals) , and the PORI Concert Hall Impulse Responses . These combined provide a total of 1069 RIRs. Additionally, Steinmetz kindly provided RIRs used for research on a NeuralReverberator222https://www.christiansteinmetz.com/projects-blog/neuralreverberator. This set includes 693 RIRs from a variety of sources.
We split the RIRs into 1362, 200, and 200 samples for training, validation, and testing, respectively. We group together signals that were too similar, e,g., involved the same source in the same room, the same receiver in same room, or the same RIR recorded with different microphones or microphone rotations. When the configurations were unknown, we group RIRs by the codes found in the file names. We assign groups larger than 20 to the training set for more balanced validation and test sets. We also limit the size of each group to 100 and discard the remaining RIRs. This data preparation results in a total 823 groups distributed across training, validation, and test sets.
4.2 Format and pre-processing
We process the input signals using a sample rate of 16000 Hz. We convolve each dry speech signal with multiple, randomly selected RIRs. We align the dry and reverberant speech in time by applying the RIR delay to the dry speech. We remove the near-silent part of the signal at the beginning. Finally, we truncate or pad the speech signals so that they last 5 seconds. We also zero-pad the RIRs to a minimum length of 2 seconds but do not truncate longer ones.
We compute the STFT with frame size 32 ms and hop length 16 ms. We normalize the audio by dividing each STFT by its maximum.
4.3 Current results
The U-net introduced by Ernst et al. produced the best results for the dry speech direct estimation task. The Bi-GRU produced too many artifacts to be used in practice. We also explored fully connected models, but found that they were not as accurate as the U-net. For the RIR estimation task, the convolutional model output impulses that had the desired decay but were not precise enough. The convolutional model might not be powerful enough to capture the time-invariant filter from the speech signal. RIRs have an unusual structure, which might be difficult for a DNN to predict: a short and loud impulse followed by a rapid decay. Structures of signals often estimated using DNNs, including speech signals or time series, have a distribution that remains relatively stable over time. To properly estimate a RIR, the DNN needs to concentrate most of the output energy into a small area.
We believe a time-domain approach can be used to improve the precision of the U-net by including phase into the estimation. We also expect to improve RIR estimation results by using a sequential model, such as an LSTM in the time-frequency domain or a CRNN in the time domain. The time-alignment of the dry and reverberant speech, along with the removal of any silence at the beginning of the RIR signal, can also be further improved for better precision.
In this technical report, we describe a joint modeling approach to estimating dry speech and the room impulse response from a reverberant speech input. We provide an overview of separate deep learning models for each task and how these can be combined in future work. We operate on magnitude-STFT data. We find that a magnitude-STFT U-net  provides good results for direct speech estimation. For room impulse response estimation, we introduce a convolutional model and plan to improve it further by modeling the sequential nature of the data. Though there has been some prior work using representations that include phase, namely, time-domain signals  and complex STFTs , we leave representations that include the phase for future work.
-  C. Faller, “Modifying audio signals for reproduction with reduced room effect,” in 147th Audio Engineering Society Conv., 2019.
-  K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 7, 2016.
E. Habets, “Fifty years of reverberation reduction: From analog signal processing to machine learning,” inAES 60th Conf. on DREAMS, 2016.
-  S. Neely and J. Allen, “Invertibility of a room impulse response,” The Journal of the Acoustical Society of America, vol. 66, no. 1, pp. 165–169, 1979.
-  S. Cecchi, A. Carini, and S. Spors, “Room response equalization—a review,” Applied Sciences, vol. 8, no. 1, p. 16, 2018.
-  J. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
-  O. Ernst, S. Chazan, S. Gannot, and J. Goldberger, “Speech dereverberation using fully convolutional networks,” in European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 390–394.
-  K. Han, Y. Wang, D. Wang, W. Woods, I. Merks, and T. Zhang, “Learning spectral mapping for speech dereverberation and denoising,” Trans. Audio, Speech, and Language Processing, vol. 23, no. 6, pp. 982–992, 2015.
-  D. Williamson and D. Wang, “Speech dereverberation and denoising using complex ratio masks,” in Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5590–5594.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Int. Conf. Medical Image Computing and Computer-assisted Intervention. Springer, 2015, pp. 234–241.
-  C. Li, T. Wang, S. Xu, and B. Xu, “Single-channel speech dereverberation via generative adversarial training,” arXiv preprint arXiv:1806.09325, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable digital signal processing,” arXiv preprint arXiv:2001.04643, 2020.
-  K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” in Interspeech, 2018, pp. 3229–3233.
-  P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” Trans. Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136–2147, 2015.
-  N. Mohammadiha, P. Smaragdis, and S. Doclo, “Joint acoustic and spectral modeling for speech dereverberation using non-negative representations,” in Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4410–4414.
-  J. Eaton, N. Gaubitch, A. Moore, and P. Naylor, “Estimation of room acoustic parameters: The ACE challenge,” Trans. Audio, Speech, and Language Processing, vol. 24, no. 10, pp. 1681–1693, 2016.
-  N. Bryan, “Data augmentation and deep convolutional neural networks for blind room acoustic parameter estimation,” arXiv preprint arXiv:1909.03642, 2019.
-  K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, and L. Xie, “Investigating generative adversarial networks based speech dereverberation for robust speech recognition,” arXiv preprint arXiv:1803.10132, 2018.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015.
V. Nair and G. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProc. Int. Conf. Machine Learning (ICML), 2010, pp. 807–814.
-  M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Int. Conf. Digital Signal Processing. IEEE, 2009, pp. 1–5.
-  R. Stewart and M. Sandler, “Database of omnidirectional and B-format room impulse responses,” in Int. Conf. Acoustics, Speech and Signal Processing. IEEE, 2010, pp. 165–168.
S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,” 2000.
-  K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas et al., “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in Workshop Applications Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2013, pp. 1–4.
-  J. Merimaa, T. Peltonen, and T. Lokki, “Concert hall impulse responses pori, finland: Reference,” Tech. Rep, 2005.
-  Y. Luo and N. Mesgarani, “Real-time single-channel dereverberation and separation with time-domain audio separation network,” in Interspeech, 2018, pp. 342–346.