In recent years, the usage of open-source toolkits for the development and deployment of state-of-the-art machine learning applications has grown rapidly. General-purpose open-source toolkits such as TensorFlow [tensorflow2015-whitepaper] and PyTorch [NEURIPS2019_9015] are used extensively. However, building applications for different domains requires additional domain-specific functionality. To accelerate development, we established the torchaudio toolkit, which provides building blocks for machine learning applications in the audio and speech domain.
To provide building blocks for modern machine learning applications in the audio/speech domain, we aim to enable the following three important properties for each building block: (1) GPU compute capability, (2) automatic differentiablility, (3) production readiness. GPU compute capability enables the acceleration of the training and inference process. Automatic differentiation allows to directly incorporate these functionalities into neural networks enabling full end-to-end learning. Production readiness means that the trained models can be easily ported to various environments including mobile devices running Android and iOS platforms.
The stability of the toolkit is our top priority. Our goal is not to include all state-of-the-art technologies, but rather to provide high-quality canonical implementations that users can build upon. The functionalities provided in torchaudio include basic input/output of audio files, audio/speech datasets, audio/speech operations, and implementations of canonical machine learning models. To the best of our knowledge, torchaudio is the only toolkit that has building blocks that span these four functionalities, with the majority satisfying the three aforementioned properties.
To understand how our implementation compares with others, we conduct an empirical study. We benchmark five audio/speech operations and three machine learning models and compare them with publicly available reference implementations. For audio/speech operations, we find that our implementations achieves or exceed parity in run time performance. For the machine learning models, we show that our implementations achieve or exceed parity in output quality.
torchaudio is designed to facilitate new projects in the audio/speech domain. We have built up a thriving community on Github with many users and developers. As of September 2021, we have addressed more than issues and merged more than pull requests. We have more than users that contributed more than lines of code, with many of these users being external contributors. We also observe extensive usage of torchaudio in the open source community. Currently, there are more than projects that are forked from torchaudio and more than public repositories on Github that depend on torchaudio.
This paper is organized as follows. We start with reviewing existing open source audio/speech toolkits in Section 2. Next, we talk about torchaudio’s design principles in Section 3, and then we introduce the structure of torchaudio and the functionalities available in Section 4. Finally, we showcase the empirical study we conducted in Section 5.
2 Related work
Arguably, modern deep learning applications for audio/speech processing are mainly developed within either thenumpy [harris2020array], TensorFlow [tensorflow2015-whitepaper], or PyTorch [NEURIPS2019_9015] ecosystem. Users tend to commit to one ecosystem when building their applications to avoid complicated dependencies and conversions between ecosystems that increase the cost of maintenance. There are many excellent open-source toolkits that provide audio/speech related building blocks and functionalities within each ecosystem. For example, librosa [mcfee2015librosa] is built for numpy, DDSP [engel2020ddsp] and TensorFlowASR111https://github.com/TensorSpeech/TensorFlowASR are created for TensorFlow, and torchaudio is designed to work with PyTorch. torchaudio is the go-to toolkit for basic audio/speech functionalities inside the PyTorch ecosystem.
torchaudio provides important low-level functionalities like audio input/output, spectrogram computation, and unified interface for accessing dataset. In the PyTorch ecosystem, there are many useful audio/speech toolkits available including Asteroid [Pariente2020Asteroid], ESPnet [watanabe2018espnet], Espresso [wang2019espresso], fairseq [ott2019fairseq], NeMo [kuchaiev2019nemo]. Pytorch-Kaldi [pytorch-kaldi], and SpeechBrain [ravanelli2021speechbrain]. These toolkits provide ready-to-use models for various applications, including speech recognition, speech enhancement, speech separation, speaker recognition, text-to-speech, etc. One common thing with these toolkits is that they all have torchaudio as a dependency so that they do not have to re-implement the basic operations222We say a package depends on torchaudio if they have import torchaudio in their repository..
3 Design Principles
torchaudio is designed to provide building blocks for the development of audio applications within the PyTorch ecosystem. The functionalities in torchaudio are built to be compatible with PyTorch core functionalities like neural network containers and data loading utilities. This allows users to easily incorporate functionalities in torchaudio into their use cases. For simplicity and ease of use, torchaudio does not depend on any other Python packages except PyTorch.
We ensure that most of our functionalities satisfy three key properties: (1) GPU compute capability, (2) automatic differentiability, and (3) production readiness. To ensure GPU compute capability, we implement all computational-heavy operations, such as convolution and matrix multiplication, using GPU-compatible logic. To ensure automatic differentiability, we perform a gradient test on all necessary functionalities. Lastly, for production readiness, we make sure that most of our functionalities are compilable into TorchScript333https://pytorch.org/docs/stable/jit.html. TorchScript is an intermediate representation that can be saved from Python code, serialized, and then later loaded by a process where there is no Python dependency, such as systems where only C++ is supported444https://github.com/pytorch/audio/tree/main/examples/libtorchaudio and mobile platforms including Android555https://github.com/pytorch/android-demo-app/tree/master/SpeechRecognition and iOS666https://github.com/pytorch/ios-demo-app/tree/master/SpeechRecognition.
Since users depend on torchaudio for foundational building blocks, we must maintain its stability and availability to a wide range of platforms, particularly as its user base grows. We support all platforms that PyTorch supports, which includes major platforms (Windows, MacOS, and Linux) with major Python versions from to
. The features are classified into three release status:stable, beta, and prototype. Once a feature reaches the stable release status, we aim to maintain backward compatibility whenever it is possible. Any backward-compatibility-breaking change on stable features can be released only after two release cycles have elapsed from when it was proposed. On the other hand, the beta and prototype features can be less stable in terms of its API, but they allow users to benefit from accessing newly implemented features earlier. The difference between beta and prototype is that prototype features are generally even less stable and will only be accessible from the source code or nightly build but not from PyPI or Conda. Before the release of a new version, we use release candidates to collect user feedback before the official release.
We apply modern software development practices to ensure the quality of the codebase. For code readability, the code linting complies with PEP-8 recommendations, with only small modifications. All developments are conducted openly on GitHub and all functions are thoroughly documented using Sphinx777https://www.sphinx-doc.org/en/master/. We also have implemented more than test cases using the pytest888https://pytest.org/ framework and use CircleCI999https://circleci.com/ for continuous integration testing.
We strive to be selective about the features we implement. We aim to maintain only the most essential functionalities to keep the design of the package lean. This way, it is easier to keep each releases stable and the maintenance cost down. We provide models that researchers would employ as baselines for comparison. For example, Tacotron2 [shen2018natural] is usually used as the baseline when developing text-to-speech systems. Therefore, we included the implementation of Tacotron2.
For sophisticated downstream applications, we provide examples and tutorials101010https://pytorch.org/tutorials/ on important downstream applications, therefore, it is easy for users to adapt torchaudio to various applications and use cases.
4 Package Structure and Functionalities
Here, we provide an overview of the functionalities and structure of torchaudio. torchaudio supports all four categories of functionalities mentioned previously, including audio input/output (I/O), audio/speech datasets, audio/speech operations, and audio/speech models.
4.1 Audio input/output (I/O)
Audio I/O is implemented under the backend submodule to provide a user-friendly interface for loading audio files into PyTorchtensors and saving tensors into audio files. We ported SoX111111https://sourceforge.net/projects/sox/ into torchaudio and made it torchscriptable (meaning it is production-ready and can be used for on-device stream ASR support) for this purpose. Optionally, torchaudio also provides interfaces for different backends such as soundfile and kaldi-io121212https://github.com/vesis84/kaldi-io-for-python.
4.2 Audio/speech datasets
The access to audio/speech datasets is implemented under the dataset submodule. The goal of this submodule is to provide a user-friendly interface for accessing commonly used datasets. Currently, we support datasets, including Librispeech [panayotov2015librispeech], VCTK [yamagishi2019cstr], LJSpeech [ljspeech17], etc. These datasets are designed to greatly simplify data pipelines and are built to be compatible with the data loading utilities provided in PyTorch, i.e. torch.utils.data.DataLoader131313https://pytorch.org/docs/stable/data.html. This allows users to access a wide range of off-the-shelf features including customizing data loading order, automatic batching, multi-process data loading, and automatic memory pinning.
4.3 Audio/speech operations
We implemented common audio operations under three submodules – functional, transform, and sox_effects.
In this submodule, we provide
different commonly used operations. The operations can be further categorized into four categories including general utility, complex utility, filtering and feature extraction. In general utilities, we provide utilities such as creating of the discrete cosine transformation matrix. In complex utilities, we provide functions such as computing the complex norms. For filtering, we include things like bandpass filters. Finally, for feature extraction, we have algorithms such as the computation of spectral centroid.
There are different transforms implemented under the submodule – torch.nn.Module. The objects here are designed to work as neural network building blocks. They are designed to interface with PyTorch’s neural network containers (i.e. torch.nn.Sequential, torch.nn.ModuleList, etc.) As such, they can be seamlessly integrated with all the neural network features in PyTorch
. The functionalities implemented in this submodule include Spectrogram/InverseSpectrogram, mel-frequency cepstrum coefficients (MFCC), minimum variance distortionless response (MVDR) beamforming, RNN-Transducer Loss, etc.
SoX is a popular command line program that provides a many audio processing functionality. To make these functionalities more accessible, we modified from the source code of Sox version 14.4.2 and ported it into torchaudio. This module is fully torchscriptable and supports different sound effects, such as resampling, pitch shift, etc.
4.4 Machine learning models
We implement various canonical machine learning models in torchaudio.models across a wide range of audio-related applications. For speech recognition, we have DeepSpeech [hannun2014deep], HuBERT [hsu2021hubert], Wav2letter [collobert2016wav2letter], and Wav2Vec 2.0 [baevski2020wav2vec]. For speech separation, we have Conv-Tasnet [luo2019conv]. For text-to-speech and neural vocoder, we have Tacotron2 [kalchbrenner2018efficient] paired with WaveRNN [kalchbrenner2018efficient]. These models are selected to support basic downstream tasks.
5 Empirical evaluations
This section provides a comparison of our implementations with others. We present the performance benchmarks for five implementations of audio/speech operations and three implementations of machine learning models. We benchmark the audio/speech operations using run time as the main performance metric. For machine learning models, the performance measures are chosen based on the task they are performing. The experiments are conducted on Amazon AWS p4d.24xlarge instance with NVIDIA A100 GPUs.
5.1 Audio/Speech Operations
The pioneering librosa [mcfee2015librosa] is arguably one of the most commonly used toolkits in the Python community. We select five operations that are implemented in both torchaudio and librosa and compare their run time. For librosa, we use version 0.8.0 installed from PyPI. The inputs are all floating point data type. We consider five different operations including phase vocoder, Griffin-Lim algorithm, MFCC, spectral centroid, and spectrogram. For MFCC, spectral centroid, spectrogram, we measure the run time of runs, and for Griffin-Lim algorithm, phase vocoder, we measure the run time of run. Operations in torchaudio can be compiled into torchscript during runtime and can be accelerated with a GPU, thus, we also include the just-in-time compiled (jitted) and GPU accelerated versions of the torchaudio operations.
The benchmark results are shown in Figure 1. We have two findings. First, we can see that in terms of CPU operations, torchaudio is able to have even or slightly better speed. In addition, the support of GPU operation allows torchaudio to take advantage of hardware for acceleration. Second, we found that for the jitted version, it runs slightly slower on CPU, but slightly faster on GPU. However, this difference is marginal.
5.2 Audio/Speech Applications
Here, we benchmark the performance of three models, including WaveRNN [kalchbrenner2018efficient], Tacotron2 [shen2018natural], and and Conv-Tasnet [luo2019conv]. We compare each model independently with a popular open-source implementation available online.
|PESQ ()||STOI ()||MOS ()|
|fatchord’s WaveRNN141414https://github.com/fatchord/WaveRNN||3.43||0.96||3.78 0.08|
|torchaudio WaveRNN||3.68||0.97||3.88 0.08|
|Nvidia WaveGlow||3.78||0.97||3.72 0.05|
|torchaudio WaveRNN||3.63||0.97||3.86 0.05|
|ground truth||-||-||4.09 0.05|
For WaveRNN, we consider the task to reconstruct a waveform from the corresponding mel-spectrogram. We measure the reconstruction performance with the wide band version of the perceptual evaluation of speech quality (PESQ), and short-time objective intelligibility (STOI) on the LJSpeech [ljspeech17] dataset. When evaluating with PESQ, we resample the waveform to Hz and measure the reconstruction performance with a publicly available implementation151515https://github.com/ludlows/python-pesq. We compare the performance of our model with the popular WaveRNN implementation from fatchord161616https://github.com/fatchord/WaveRNN as well as the WaveGlow [prenger2019waveglow] from Nvidia171717https://github.com/NVIDIA/DeepLearningExamples The results are shown in Table 1. Note that the pretrained models for fatchord and Nvidia are trained on a different dataset split, therefore we train and evaluate our WaveRNNs on each of their datasets, respectively.
In addition, we also consider a subjective evaluation metric, the mean opinion score (MOS). For the comparison with fatchord’s WaveRNN, we use the full test set, which consists ofaudio samples; for the comparison with Nvidia’s WaveGlow, we randomly selected audio examples from the test set to generate the audio sample. These generated samples are then sent to Amazon’s Mechanical Turk, a human rating service, to have human raters score each audio sample. Each evaluation is conducted independently from each other, so the outputs of two different models are not directly compared when raters are assigning a score to them. Each audio sample is rated by at least raters on a scale from to with point increments, where is lowest means that the rater perceived the audio as unlike a human speech, and means that the rater perceived the audio as similar to a human speech.
The results in Table 1 show that our implementation is able to achieve similar performance comparing with both fatchord’s and Nvidia’s models. This verifies the validity of our implementation. In addition, our implementation also utilizes the latest DistributedDataParallel181818https://pytorch.org/tutorials/intermediate/ddp_tutorial.html for multi-GPU training, which gives additional running time benefit over fatchord’s implementation.
For Tacotron2, we measure the performance with the mel cepstral distortion (MCD) metric, and we compare our implementation with Nvidia’s implementation191919https://github.com/NVIDIA/DeepLearningExamples. The evaluation is conducted on the LJSpeech dataset. We train our implemented Tacotron2 using the training testing split from Nvidia’s repository and compare it with Nvidia’s pretrained model. We train our Tacotron2 similarly to the default parameter provided in Nvidia’s repository, which has epochs, an initial learning rate of , a weight decay of , clips the norm of the gradient to , and uses the Adam optimizer. Table 2 shows that the difference in MCD between the two implementations is very small (around ), which verifies the validity of our implementation.
We train the Conv-TasNet model on the sep_clean
task of Libri2Mix dataset. The sample rate is 8000 Hz. We follow the same model configuration and training strategy in the Asteroid training pipeline: 1) We use the negative value of scale-invariant scale-to-distortion ratio (Si-SDR) as the loss function with a permutation invariant training (PIT) criterion; 2) We clip the gradient with a threshold of 5; 3) We use 1e-3 as the initial learning rate and halve it if the validation loss stops decreasing for 5 epochs. We evaluate the performance by Si-SDR improvement (Si-SDRi) and signal-to-distortion ratio improvement (SDRi). Table3 shows that our implementation slightly outperforms that in Asteroid on both Si-SDRi and SDRi metrics.
|Si-SDRi (dB) ()||SDRi (dB) ()|
This paper provides a brief summary of the torchaudio toolkit. With increasing number of users, this project is under active development so that it can serve our goal to accelerate the development and deployment of audio-related machine learning applications. The roadmap for future work can be found on our Github page222222https://github.com/pytorch/audio.
We thank the release engineers, Nikita Shulga and Eli Uriegas, for helping with creating releases. We thank all contributors on Github, including but not limited to Bhargav Kathivarapu, Chin-Yun Yu, Emmanouil Theofanis Chourdakis, Kiran Sanjeevan, and Tomás Osório232323Due to page limit, we only include the name of people who contributed lines as of September 2021.. We thank Abdelrahman Mohamed, Alban Desmaison, Andrew Gibiansky, Brian Vaughan, Christoph Boeddeker, Fabian-Robert Stöter, Faro Stöter, Jan Schlüter, Keunwoo Choi, Manuel Pariente, Mirco Ravanelli, Piotr Żelasko, Samuele Cornell, Yossi Adi, and Zhi-Zheng Wu for the meaningful discussions about the design of torchaudio. We also thank the Facebook marketing team that helped with promoting this work.