FAST-RIR: Fast neural diffuse room impulse response generator

by   Anton Ratnarajah, et al.
University of Maryland

We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s. We evaluate our generated RIRs in automatic speech recognition (ASR) applications using Google Speech API, Microsoft Speech API, and Kaldi tools. We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU and gives similar performance to DAS in ASR experiments. Our FAST-RIR is 12 times faster than an existing GPU-based RIR generator (gpuRIR). We show that our FAST-RIR outperforms gpuRIR by 2.5



There are no comments yet.


page 1

page 2

page 3

page 4


IR-GAN: Room Impulse Response Generator for Speech Augmentation

We present a Generative Adversarial Network (GAN) based room impulse res...

Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models

In this paper, we describe how to efficiently implement an acoustic room...

A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network

In this paper, a time delay neural network (TDNN) based acoustic model i...

gpuRIR: A python library for Room Impulse Response simulation with GPU acceleration

The Image Source Method (ISM) is one of the most employed techniques to ...

SofaMyRoom: a fast and multiplatform "shoebox" room simulator for binaural room impulse response dataset generation

This paper introduces a shoebox room simulator able to systematically ge...

StutterNet: Stuttering Detection Using Time Delay Neural Network

This paper introduces StutterNet, a novel deep learning based stuttering...

Code Repositories


This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.

view repo



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Room impulse response (RIR) generators are used to simulate large-scale far-field speech training data [15, 16, 8]. A synthetic far-field speech training dataset is created by convolving clean speech with RIRs generated for different acoustic environments and adding background noise [16, 29]. The acoustic environment can be described using room geometry, speaker and listener positions, and room acoustic materials.

In recent years, an increasing number of RIR generators have been introduced to generate a realistic RIR for a given acoustic environment [1, 27, 30, 19]. Accurate RIR generators can generate RIRs with various acoustic effects (e.g., diffraction, scattering, early reflections, late reverberations) [18]. A limitation of accurate RIR generators is that they are computationally expensive, and the time taken to generate RIRs depends on the geometric complexity of the acoustic environment. Also, traditional RIR generators rely on the empirical Sabine formula [25] to generate RIRs with expected reverberation time. Reverberation time () is the time required for the sound energy to decay by 60 decibels [17].

With advancements in deep neural-network-based far-field speech processing, the demand for on-the-fly simulation of far-field speech training datasets with hundreds of thousands of room configurations similar to the testing environment is increasing [14, 3, 31, 13]. The CPU-based offline simulation of far-field speech with balanced distribution requires a lot of computation time and disk space [1, 30], thus it is not scalable for production-level ASR training. One strategy to improve the speed of RIR generation is parallelizing most of the stages in the existing RIR generators and making the algorithm compatible for running on GPUs [3, 9].

Main Contributions: We propose a neural-network-based fast diffuse room impulse response generator (FAST-RIR) that can be directly controlled using rectangular room dimension, listener and speaker positions, and . implicitly reflects the characteristics of the room materials such as the floor, ceiling, walls, furniture etc. Our FAST-RIR takes a constant amount of time to generate an RIR for any given acoustic environment, and yields accurate .

Our FAST-RIR architecture is trained to generate both specular and diffuse reflections for a given acoustic environment. Diffuse reflection is widely observed in real-world environments and it is important to accurately model RIR. We show that our FAST-RIR can generate RIRs 400 times faster than the state-of-the-art diffuse acoustic simulator (DAS) [30] on a single CPU and 12 times faster than gpuRIR[3] on a single GPU. The RIRs generated using our FAST-RIR perform similarly to the RIRs generated using the DAS and outperform gpuRIR by up to 2.5% in far-field automatic speech recognition (ASR) experiments. Our FAST-RIR can generate RIRs for a given input with an average error of 0.02s.

2 Related Works

The RIR generators developed over the decades can be divided into three groups: wave-based, ray-based, and neural-network-based techniques. Wave-based techniques are designed to give the most accurate results by solving wave equations [22, 26]. However, the wave-based techniques are only feasible for generating RIRs for less complicated scenes at low frequencies. Ray-based techniques are less accurate than the wave-based approach because the wave nature of the sound is neglected. The image method [1] and diffuse acoustic simulators [30] are commonly used ray-based methods in speech-related tasks. The image method only models specular reflections while diffuse acoustic simulators accurately model both diffuse and specular reflections.

Recently, neural-network-based RIR generators [23, 24, 28] have been introduced to generate RIRs for a given acoustic environment. IR-GAN [23] is a GAN-based RIR generator that is trained on a real-world RIR dataset to generate realistic RIRs. However, IR-GAN does not take conventional environmental parameters as input by design, making it less configurable than traditional RIR generators.

Figure 1: The architecture of our FAST-RIR. Our Generator network takes acoustic environment details as input and generates corresponding RIR as output. Our Discriminator network discriminates between the generated RIR and the ground truth RIR for the given acoustic environment during training.

3 Our Approach

To generate RIRs for a given acoustic environment, we propose a one-dimensional conditional generator network. Our generator network takes room geometry, listener and speaker positions, and as inputs, which are the common input used by all traditional RIR generators, and generates RIRs as raw-waveform audio. Our FAST-RIR generates RIRs of length 4096 at 16 kHz frequency.

3.1 Modified Conditional GAN

We propose a modified conditional GAN architecture to precisely generate an RIR for a given condition. GAN [11] consists of a generator () and a discriminator () networks that are alternatingly trained to compete. The network

is trained to learn a mapping from noise vector samples (

) from distribution to the data distribution . The network is optimized to produce samples that are difficult for the to distinguish from real samples () taken from true data distribution, while is optimized to differentiate samples generated from and real samples. The networks and are trained to optimize the following two-player min-max game with value function .


Conditional GAN (CGAN) [20, 10] is an extended version of GAN where both the generator and discriminator networks are conditioned on additional information . The generator network in CGAN is conditioned on the random noise and . The vector is used to generate multiple different samples satisfying the given condition . In our work, we train our FAST-RIR to generate a single sample precisely for a given condition. Our FAST-RIR is a modified CGAN architecture where the generator network is only conditioned on .

3.2 Fast-Rir

We combine rectangular room dimension, listener location, and source location represented using 3D Cartesian coordinates () and as a ten-dimensional vector embedding . We normalize the vector embedding within the range -1.2 to 1.2 using the largest room dimension in the training dataset.

For each , we generate RIR using DAS () and use it as ground truth to train our network. Our objective function for the generator network () consists of modified CGAN error, mean square error and error. The discriminator network is trained using the modified CGAN objective function.

3.2.1 Generator Modified CGAN Error

The is trained with the following modified CGAN error to generate RIRs that are difficult for the discriminator to differentiate from RIRs generated from DAS.


3.2.2 Mean Square Error (MSE)

We compare each sample () of the RIR generated using our FAST-RIR () with RIR generated using DAS () for each to calculate the following MSE.


3.2.3 Error

We generate RIRs using our FAST-RIR and calculate their using a method based on ISO 3382-1:2009. We compare the of each generated RIRs with the given as input to the network in the embedding as follows:


3.2.4 Full Objective

We train the and alternatingly to minimize the generator objective function (Equation 5) and maximize the discriminator objective function (Equation 6). We control the relative importance of the MSE () and error () using the weights and , respectively.


3.2.5 Implementation

Network Architecture: We adapt the generator network () and the discriminator network () proposed in Stage-I of StackGAN architecture [32] and modify the networks. StackGAN takes a text description and a noise vector as input and generates a photo-realistic two-dimensional (2D) image as output. Our FAST-RIR takes acoustic environment details as input and generates an RIR as a one-dimensional (1D) raw-waveform audio output. We flatten the 2D convolutions into 1D to process 1D RIR in both and .

Unlike photo-realistic images, raw-waveform audio exhibits periodicity. Donahue et al. [5] suggest that filters with larger receptive fields are needed to process low frequencies (large wavelength signals) in the audio. We improve the receptive field of the original and the encoder in by increasing the kernel size (i.e., 3

3 2D convolution becomes length 41 1D convolution) and strides (i.e., stride 2

2 becomes stride 41). We also replace the upsampling layer and the following convolutional layer with a transposed convolutional layer.

Dataset: The sizes of the existing real-world RIR datasets [7, 4, 29] are insufficient to train our FAST-RIR. Therefore, we generate 75,000 medium-sized room impulse responses using a DAS [30] to create a training dataset. We choose 15 evenly spaced room lengths within the range 8m to 11m, 10 evenly spaced room widths between 6m and 8m, and 5 evenly spaced room heights between 2.5m and 3.5m to generate RIRs. We position the speaker and the listener at random positions within the room and generate 100 different RIRs for each combination of room dimensions (15105). The values of our training dataset are between 0.2s and 0.7s.

Training: We iteratively train and

using RMSprop optimizer with batch size 128 and learning rate 8x

. For every 40 epochs, we decay the learning rate by 0.7.

4 Experiment and Results

4.1 Baselines

We randomly select 30,000 different acoustic environments within the range of the training dataset (Section 3.2.5). We generate RIRs corresponding to the selected acoustic environments using image method [1], gpuRIR [3], DAS [30] and FAST-RIR to evaluate the performance of our proposed FAST-RIR. IR-GAN [23] does not have the capability to precisely generate RIRs for a given speaker and listener positions; therefore, we did not use IR-GAN in our experiments.

4.2 Runtime

We evaluate the runtime for generating 30,000 RIRs using image method, gpuRIR, DAS and FAST-RIR on an Intel(R) Xenon(R) CPU E52699 v4 @ 2.20 GHz and a GeForce RTX 2080 Ti GPU (Table 1). The gpuRIR is optimized to run on a GPU; therefore, we generate RIR using gpuRIR only on a GPU. For a fair comparison with CPU implementations of image-method and DAS, we also generate RIRs using our FAST-RIR with batch size 1 on a CPU.

From Table 1, we can see that our proposed FAST-RIR with batch size 1 is 400 times faster than DAS [30] on a CPU. Our FAST-RIR is optimized to run on a GPU. We compare the performance of our FAST-RIR with an existing GPU-based RIR generator gpuRIR [3]. We can see that gpuRIR performs better than our FAST-RIR with batch size 1, which is not the real use case of our generator. To our best knowledge, the gpuRIR does not leverage the batch parallelization while this was supported in our FAST-RIR. We can see that our proposed FAST-RIR with batch size 64 is 12 times faster than gpuRIR.

RIR Generator Hardware Total Time Avg Time
DAS [30] CPU 9.01xs 30.05s
Image Method [1] CPU 4.49xs 0.15s
FAST-RIR(Batch Size 1) CPU 2.15xs 0.07s
gpuRIR [3] GPU 16.63s 5.5xs
FAST-RIR(Batch Size 1) GPU 34.12s 1.1xs
FAST-RIR(Batch Size 64) GPU 1.33s 4.4xs
FAST-RIR(Batch Size 128) GPU 1.77s 5.9xs
Table 1: The runtime for generating 30,000 RIRs using image method, gpuRIR, DAS, and our FAST-RIR. Our FAST-RIR significantly outperforms all other methods in runtime.
Range Crop RIR at Error
0.2s - 0.25s No 0.068s
0.2s - 0.25s Yes 0.033s
0.25s - 0.7s - 0.021s
0.2s - 0.7s No 0.029s
0.2s - 0.7s Yes 0.023s
Table 2: error of our FAST-RIR for 30,000 testing acoustic environments. We report the error for RIRs cropped at and full RIRs. We only crop RIRs with below 0.25s.

4.3 Error

Table 2 shows the error of the generated RIRs calculated using Equation 4. We can see that the testing error of our FAST-RIR is high for input below 0.25s (0.068s) when compared to the input greater than 0.25s (0.021s).

Our FAST-RIR is trained to generate RIRs with durations slightly above 0.25s. For the input below 0.25s, the generated RIR has a noisy output between and 0.25s. We notice that cropping the generated RIRs at improves the overall error from 0.029s to 0.023s.

4.4 Simulated Speech Comparison

We simulate reverberant speech by convolving clean speech

from the LibriSpeech test-clean dataset

[21] with different RIRs (Equation 7).


We decode the simulated reverberant speech using Google Speech API111 and Microsoft Speech API222 Table 3 shows the Word Error Rate (WER) of the decoded speech. No text normalization was applied in both cases, as only the relative WER differences between different RIR generators are concerned. For Google Speech API, we report WER for the clean and reverberant LibriSpeech test sets that are successfully decoded. The results of each speech API show that compared with the reverberant speech simulated using traditional RIR generators, reverberant speech simulated using our FAST-RIR is closer to the reverberant speech simulated using DAS [30]. We provide reverberant speech audio examples, spectrograms and the source code for reproducibility at github333

4.5 Far-field Automatic Speech Recognition

We want to ensure that our FAST-RIR generates RIRs that are better than or as good as existing RIR generators for ASR. We use the AMI corpus [2] for our far-field ASR experiments. AMI contains close-talk speech data recorded using Individual Headset Microphones (IHM) and distant speech data recorded using Single Distant Microphones (SDM).

We use a modified Kaldi recipe 444 to evaluate our FAST-RIR. The modified Kaldi recipe takes IHM data as the training set and tests the model using SDM data. The IHM data can be considered clean speech because the echo effects in IHM data are negligible when compared to SDM data. We augment far-field speech data by reverberating the IHM data with different RIR sets using Equation 7. The 30,000 RIRs generated using the image method, gpuRIR, DAS, and FAST-RIR are used in our experiment.

The IHM data consists of 687 long recordings. Instead of reverberating a speech recording using a single RIR, we do segment-level speech reverberation, as proposed in [29]. We split each recording at the beginning of at least continuous 3 seconds of silence. We split at the beginning to avoid inter-segment reverberated speech overlapping. We can split IHM data into 17749 segments. We reverberate each segment using a randomly selected RIR from an RIR dataset (either image method, gpuRIR, DAS, DAS-cropped or our FAST-RIR).

Table 4 presents far-field ASR development and test WER for far-field SDM data. We can see that our FAST-RIR outperforms gpuRIR [3] by up to 2.5% absolute WER. The DAS [30] with full duration and the DAS cropped to have the same duration as our FAST-RIR (DAS-cropped) performs similarly in the far-field ASR experiment. We see that the performance of DAS and FAST-RIR has no significant difference.

Testing Dataset Word Error Rate [%]
Clean Speech RIR Google API Microsoft API
Libri DAS (baseline) [30] 6.56 2.63
Libri gpuRIR [3] 9.39 (+43%) 3.78 (+44%)
Libri Image Method [1] 9.03 (+38%) 3.86 (+47%)
Libri FAST-RIR (ours) 7.14 (+9%) 2.76 (+5%)
Table 3: Automatic speech recognition (ASR) results were obtained using Google Speech API and Microsoft Speech API. We simulate a reverberant speech testing dataset by convolving clean speech from the LibriSpeech dataset with different RIR datasets. We compare the reverberant speech simulated using the image method, gpuRIR and our FAST-RIR with the reverberant speech simulated using DAS. We show that the relative WER change from our method is the smallest.
Training Dataset Word Error Rate [%]
Clean Speech RIR dev eval
IHM None 55.0 64.2
IHM Image Method [1] 51.7 56.1
IHM gpuRIR [3] 52.2 55.5
IHM DAS [30] 47.9 52.5
IHM DAS-cropped [30] 48.3 52.6
IHM FAST-RIR (ours) 47.8 53.0
Table 4: Far-field ASR results were obtained for far-field speech data recorded by single distance microphones (SDM) in the AMI corpus. The best results are shown in bold.

5 Discussion and Future Work

We propose a novel FAST-RIR architecture to generate a large RIR dataset on the fly. We show that our FAST-RIR performs similarly in ASR experiments when compared to the RIR generator (DAS [30]), which is used to generate a training dataset to train our FAST-RIR. Our FAST-RIR can be easily trained with RIR generated using any state-of-the-art accurate RIR generator to improve its performance in ASR experiments while keeping the speed of RIR generation the same.

Although we trained our FAST-RIR for limited room dimensions ranging from (8m,6m,2,5m) to (11m,8m,3.5m) using 75,000 RIRs, we believe that our FAST-RIR will give a similar performance when we train FAST-RIR for a larger room dimension range with a huge amount of RIRs. We would like to evaluate the performance of our FAST-RIR in the multi-channel ASR [6] and speech separation [12] tasks.


  • [1] J. B. Allen and D. A. Berkley (1979-04) Image method for efficiently simulating small-room acoustics. Acoustical Society of America Journal 65 (4), pp. 943–950. External Links: Document Cited by: §1, §1, §2, §4.1, Table 1, Table 3, Table 4.
  • [2] J. Carletta et al. (2005) The ami meeting corpus: a pre-announcement. In

    Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction

    MLMI’05, , pp. 28–39. External Links: ISBN 3540325492, Link, Document Cited by: §4.5.
  • [3] D. Diaz-Guerra, A. Miguel, and J. R. Beltran (2021) GpuRIR: a python library for room impulse response simulation with gpu acceleration. Multimedia Tools and Applications 80 (4), pp. 5653–5671. Cited by: §1, §1, §4.1, §4.2, §4.5, Table 1, Table 3, Table 4.
  • [4] J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor (2015) The ACE challenge - corpus description and performance evaluation. In WASPAA, pp. 1–5. Cited by: §3.2.5.
  • [5] J. H. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts (2019) GANSynth: adversarial neural audio synthesis. In ICLR (Poster), Cited by: §3.2.5.
  • [6] A.S. S. et al. (2021) Directional ASR: A new paradigm for E2E multi-speaker speech recognition with source localization. In ICASSP, pp. 8433–8437. Cited by: §5.
  • [7] K. K. et al. (2017) The REVERB challenge: A benchmark task for reverberation-robust ASR techniques. In

    New Era for Robust Speech Recognition, Exploiting Deep Learning

    pp. 345–354. Cited by: §3.2.5.
  • [8] T. N. S. et al. (2017) Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE ACM Trans. Audio Speech Lang. Process. 25 (5), pp. 965–979. Cited by: §1.
  • [9] Z. Fu and J. Li (2016) GPU-based image method for room impulse response calculation. Multim. Tools Appl. 75 (9), pp. 5205–5221. Cited by: §1.
  • [10] J. Gauthier (2015)

    Conditional generative adversarial networks for convolutional face generation

    In Tech Report, Cited by: §3.1.
  • [11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §3.1.
  • [12] R. Gu, S. Zhang, Y. Zou, and D. Yu (2021) Complex neural spatial filter: enhancing multi-channel target speech separation in complex domain. IEEE Signal Process. Lett. 28, pp. 1370–1374. Cited by: §5.
  • [13] M. Karafiát, F. Grézl, L. Burget, I. Szöke, and J. Černocký (2015) Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In Proc. Interspeech 2015, pp. 2454–2458. External Links: Document Cited by: §1.
  • [14] M. Karafiát, K. Veselý, K. Zmolíková, M. Delcroix, S. Watanabe, L. Burget, J. H. Cernocký, and I. Szöke (2017) Training data augmentation and data selection. In New Era for Robust Speech Recognition, Exploiting Deep Learning, pp. 245–260. Cited by: §1.
  • [15] C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani (2017) Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home. In Proc. Interspeech 2017, pp. 379–383. External Links: Document Cited by: §1.
  • [16] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In ICASSP, pp. 5220–5224. Cited by: §1.
  • [17] H. Kuttruff (2009) Room acoustics. Spon Press. Cited by: §1.
  • [18] S. Liu and D. Manocha (2020) Sound synthesis, propagation, and rendering: a survey. arXiv preprint arXiv:2011.05538. Cited by: §1.
  • [19] P. Masztalski, M. Matuszewski, K. Piaskowski, and M. Romaniuk (2020) StoRIR: stochastic room impulse response generation for audio data augmentation. In INTERSPEECH, pp. 2857–2861. Cited by: §1.
  • [20] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §3.1.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §4.4.
  • [22] N. Raghuvanshi, R. Narain, and M. C. Lin (2009) Efficient and accurate sound propagation using adaptive rectangular decomposition. IEEE Trans. Vis. Comput. Graph. 15 (5), pp. 789–801. Cited by: §2.
  • [23] A. Ratnarajah, Z. Tang, and D. Manocha (2021) IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition. In Proc. Interspeech 2021, pp. 286–290. External Links: Document Cited by: §2, §4.1.
  • [24] A. Ratnarajah, Z. Tang, and D. Manocha (2021) TS-rir: translated synthetic room impulse responses for speech augmentation. arXiv preprint arXiv:2103.16804. Cited by: §2.
  • [25] W. C. Sabine and M. D. Egan (1994) Collected papers on acoustics. Acoustical Society of America. Cited by: §1.
  • [26] Sakamoto,Shinichi, Ushiyama,Ayumi, and Nagatomo,Hiroshi (2006) Numerical analysis of sound propagation in rooms using the finite difference time domain method. The Journal of the Acoustical Society of America 120 (5), pp. 3008–3008. External Links: Document Cited by: §2.
  • [27] C. Schissler and D. Manocha (2016-09) Interactive sound propagation and rendering for large multi-source scenes. ACM Trans. Graph. 36 (1). External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • [28] N. Singh, J. Mentch, J. Ng, M. Beveridge, and I. Drori (2021-10) Image2Reverb: cross-model reverb impulse response synthesis. In ICCV, Cited by: §2.
  • [29] I. Szöke, M. Skácel, L. Mosner, J. Paliesek, and J. H. Cernocký (2019) Building and evaluation of a real room impulse response dataset. IEEE J. Sel. Top. Signal Process. 13 (4), pp. 863–876. Cited by: §1, §3.2.5, §4.5.
  • [30] Z. Tang, L. Chen, B. Wu, D. Yu, and D. Manocha (2020) Improving reverberant speech training using diffuse acoustic simulation. In ICASSP, pp. 6969–6973. Cited by: §1, §1, §1, §2, §3.2.5, §4.1, §4.2, §4.4, §4.5, Table 1, Table 3, Table 4, §5.
  • [31] Z. Tang and D. Manocha (2021) Scene-aware far-field automatic speech recognition. External Links: 2104.10757 Cited by: §1.
  • [32] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, Cited by: §3.2.5.