Asteroid: the PyTorch-based audio source separation toolkit for researchers

by   Manuel Pariente, et al.

This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at .



There are no comments yet.


page 1

page 2

page 3

page 4


Transfer Learning with Jukebox for Music Source Separation

In this work, we demonstrate how to adapt a publicly available pre-train...

TorchAudio: Building Blocks for Audio and Speech Processing

This document describes version 0.10 of torchaudio: building blocks for ...

N-HANS: Introducing the Augsburg Neuro-Holistic Audio-eNhancement System

N-HANS is a Python toolkit for in-the-wild audio enhancement, including ...

The 2018 Signal Separation Evaluation Campaign

This paper reports the organization and results for the 2018 community-b...

Generalization Challenges for Neural Architectures in Audio Source Separation

Recent work has shown that recurrent neural networks can be trained to s...

Referenceless Performance Evaluation of Audio Source Separation using Deep Neural Networks

Current performance evaluation for audio source separation depends on co...

MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation

Deep neural networks have become an indispensable technique for audio so...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Audio source separation, which aims to separate a mixture signal into individual source signals, is essential to robust speech processing in real-world acoustic environments [BookEVincent]. Classical open-source toolkits such as FASST [FASST2014], HARK [HARK2009], ManyEars [ManyEars2013] and openBliSSART [OpenBLISSART]

which are based on probabilistic modelling, non-negative matrix factorization, sound source localization and/or beamforming have been successful in the past decade. However, they are now largely outperformed by deep learning-based approaches, at least on the task of single-channel source separation

[DPCLHershey2016, PITYu2016, LSTMLuo2018, ConvLuo2018, Wavesplit2020Zeghidour].

Several open-source toolkits have emerged for deep learning-based source separation. These include nussl (Northwestern University Source Separation Library) [NUSSLManilow2018], ONSSEN (An Open-source Speech Separation and Enhancement Library) [OnssenNi2019], Open-Unmix [OpenUnmix], and countless isolated implementations replicating some important papers 111kaituoxu/TasNet, kaituoxu/Conv-TasNet, yluo42/TAC, JusperLee/Conv-TasNet, JusperLee/Dual-Path-RNN-Pytorch, tky1117/DNN-based_source_separation ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation.

Both nussl and ONSSEN are written in PyTorch[PyTorchPaszke2019] and provide training and evaluation scripts for several state-of-the art methods. However, data preparation steps are not provided and experiments are not easily configurable from the command line. Open-Unmix does provides a complete pipeline from data preparation until evaluation, but only for the Open-Unmix model on the music source separation task. Regarding the isolated implementations, some of them only contain the model, while others provide training scripts but assume that training data has been generated. Finally, very few provide the complete pipeline. Among the ones providing evaluation scripts, differences can often be found, e.g., discarding short utterances or splitting utterances in chunks and discarding the last one.

This paper describes Asteroid (Audio source separation on Steroids), a new open-source toolkit for deep learning-based audio source separation and speech enhancement, designed for researchers and practitioners. Based on PyTorch

, one of the most widely used dynamic neural network toolkits,

Asteroid is meant to be user-friendly, easily extensible, to promote reproducible research, and to enable easy experimentation. As such, it supports a wide range of datasets and architectures, and comes with recipes reproducing some important papers. Asteroid is built on the following principles:

  1. Abstract only where necessary, i.e., use as much native PyTorch code as possible.

  2. Allow importing third-party code with minimal changes.

  3. Provide all steps from data preparation to evaluation.

  4. Enable recipes to be configurable from the command line.

We present the audio source separation framework in Section 2. We describe Asteroid’s main features in Section 3 and their implementation in Section 4. We provide example experimental results in Section 5 and conclude in Section 6.

2 General framework

While Asteroid is not limited to a single task, single-channel source separation is currently its main focus. Hence, we will only consider this task in the rest of the paper. Let be a single channel recording of sources in noise:


where are the source signals and

is an additive noise signal. The goal of source separation is to obtain source estimates

given .

Most state-of-the-art neural source separation systems follow the encoder-masker-decoder approach depicted in Fig. 1 [LSTMLuo2018, ConvLuo2018, tzinis2019twostep, DPRNNLuo2020]

. The encoder computes a short-time Fourier transform (STFT)-like representation

by convolving the time-domain signal with an analysis filterbank. The representation is fed to the masker network that estimates a mask for each source. The masks are then multiplied entrywise with to obtain sources estimates in the STFT-like domain. The time-domain source estimates are finally obtained by applying transposed convolutions to

with a synthesis filterbank. The three networks are jointly trained using a loss function computed on the masks or their embeddings

[DPCLHershey2016, DPCL+Isik2016, DANetChen2017], on the STFT-like domain estimates [PITYu2016, tzinis2019twostep, Demyst2020Heitkaemper], or directly on the time-domain estimates [LSTMLuo2018, ConvLuo2018, CompehensiveBahmaninezhad2019, Wavesplit2020Zeghidour, DPRNNLuo2020].

Figure 1: Typical encoder-masker-decoder architecture.

3 Functionality

Asteroid follows the encoder-masker-decoder approach, and provides various choices of filterbanks, masker networks, and loss functions. It also provides training and evaluation tools and recipes for several datasets. We detail each of these below.

3.1 Analysis and synthesis filterbanks

As shown in [CompehensiveBahmaninezhad2019, USSKavalerov2019, FilterbankDesign2019Pariente, MultiPhaseDitter2019], various filterbanks can be used to train end-to-end source separation systems. A natural abstraction is to separate the filterbank object from the encoder and decoder objects. This is what we do in Asteroid. All filterbanks inherit from the Filterbank class. Each Filterbank can be combined with an Encoder or a Decoder, which respectively follow the nn.Conv1d and nn.ConvTranspose1d interfaces from PyTorch for consistency and ease of use. Notably, the STFTFB filterbank computes the STFT using simple convolutions, and the default filterbank matrix is orthogonal.

Asteroid supports free filters [LSTMLuo2018, ConvLuo2018], discrete Fourier transform (DFT) filters [USSKavalerov2019, Demyst2020Heitkaemper], analytic free filters [FilterbankDesign2019Pariente], improved parameterized sinc filters [SincNetRavanelli2018, FilterbankDesign2019Pariente] and the multi-phase Gammatone filterbank [MultiPhaseDitter2019]. Automatic pseudo-inverse computation and dynamic filters (computed at runtime) are also supported. Because some of the filterbanks are complex-valued, we provide functions to compute magnitude and phase, and apply magnitude or complex-valued masks. We also provide interfaces to NumPy [NumPyVanDerWalt2011] and Additionally, Griffin-Lim [GriffinLim1984, FastGriffinLim2013] and multi-input spectrogram inversion (MISI) [MISI2010] algorithms are provided.

3.2 Masker network


provides implementations of widely used masker networks: TasNet’s stacked long short-term memory (LSTM) network

[LSTMLuo2018], Conv-Tasnet’s temporal convolutional network (with or without skip connections) [ConvLuo2018]

, and the dual-path recurrent neural network (DPRNN) in

[DPRNNLuo2020]. Open-Unmix [OpenUnmix] is also supported for music source separation.

3.3 Loss functions — Permutation invariance

Asteroid supports several loss functions: mean squared error, scale-invariant signal-to-distortion ratio (SI-SDR) [ConvLuo2018, SISDRLeroux2019], scale-dependent SDR [SISDRLeroux2019]

, signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ)

[PMSQE2018Donas], and affinity loss for deep clustering [DPCLHershey2016].

Whenever the sources are of the same nature, a permutation-invariant (PIT) loss shall be used [PITYu2016, uPITKolbaek2017]. Asteroid provides an optimized, versatile implementation of PIT losses. Let and be the matrices of true and estimated source signals, respectively. We denote as a permutation of by , where is the set of permutations of . A PIT loss is defined as


where is a classical (permutation-dependent) loss function, which depends on the network’s parameters through .

We assume that, for a given permutation hypothesis , the loss can be written as


where , , computes the pairwise loss between a single true source and its hypothesized estimate, and is the reduce function, usually a simple mean operation. Denoting by the pairwise loss matrix with entries , we can rewrite (3) as


and reduce the computational complexity from to by pre-computing ’s terms. Taking advantage of this, Asteroid provides PITLossWrapper, a simple yet powerful class that can efficiently turn any pairwise loss or permutation-dependent loss into a PIT loss.

3.4 Datasets

Asteroid provides baseline recipes for the following datasets: wsj0-2mix and wsj0-3mix [DPCLHershey2016], WHAM [WHAMWichern2019], WHAMR [WHAMWichern2019], LibriMix [Librimix2020] FUSS [FUSS2020], Microsoft’s Deep Noise Suppression challenge dataset (DNS) [DNSChallenge2020], SMS-WSJ [SMSWSJ2019Drude], Kinect-WSJ [KinectWSJ2019], and MUSDB18 [musdb18]. Their characteristics are summarized and compared in Table 1. wsj0-2mix and MUSDB18 are today’s reference datasets for speech and music separation, respectively. WHAM, WHAMR, LibriMix, SMS-WSJ and Kinect-WSJ are recently released datasets which address some shortcomings of wsj0-2mix. FUSS is the first open-source dataset to tackle the separation of arbitrary sounds. Note that wsj0-2mix is a subset of WHAM which is a subset of WHAMR.

wsj0-mix WHAM WHAMR Librimix DNS SMS-WSJ Kinect-WSJ MUSDB18 FUSS
Source types speech speech speech speech speech speech speech music sounds
# sources 2 or 3 2 2 2 or 3 1 2 2 4 0 to 4
Noise * **
# channels 1 1 1 1 1 6 4 2 1
Sampling rate 16k 16k 16k 16k 16k 16k 16k 16k 16k
Hours 30 30 30 210 100 (+aug.) 85 30 10 55 (+aug.)
Release year 2015 2019 2019 2020 2020 2019 2019 2017 2020
Table 1: Datasets currently supported by Asteroid. * White sensor noise. ** Background environmental scenes.

3.5 Training

For training source separation systems, Asteroid offers a thin wrapper around PyTorch-Lightning [LightningFalcon2019] that seamlessly enables distributed training, experiment logging and more, without sacrificing flexibility. Regarding the optimizers, we also rely on native PyTorch and torch-optimizer PyTorch provides basic optimizers such as SGD and Adam and torch-optimizer provides state-of-the art optimizers such as RAdam, Ranger or Yogi.

3.6 Evaluation

Evaluation is performed using, a sub-toolkit of [PB_BSS_Drude] written for evaluation. It natively supports most metrics used in source separation: SDR, signal-to-interference ratio (SIR), signal-to-artifacts ratio (SAR) [SDRVincent2006], SI-SDR [SISDRLeroux2019], PESQ [PESQRix2001], and short-time objective intelligibility (STOI) [STOITaal2011].

4 Implementation

Asteroid follows Kaldi-style recipes [Povey2011Kaldi], which involve several stages as depicted in Fig. 2. These recipes implement the entire pipeline from data download and preparation to model training and evaluation. We show the typical organization of a recipe’s directory in Fig. 3. The entry point of a recipe is the script which will execute the following stages:

  • Stage 0: Download data that is needed for the recipe.

  • Stage 1: Generate mixtures with the official scripts, optionally perform data augmentation.

  • Stage 2: Gather data information into text files expected by the corresponding DataLoader.

  • Stage 3: Train the source separation system.

  • Stage 4: Separate test mixtures and evaluate.

In the first stage, necessary data is downloaded (if available) into a storage directory specified by the user. We use the official scripts provided by the dataset’s authors to generate the data, and optionally perform data augmentation. All the information required by the dataset’s DataLoader such as filenames and paths, utterance lengths, speaker IDs, etc., is then gathered into text files under data/. The training stage is finally followed by the evaluation stage. Throughout the recipe, log files are saved under logs/ and generated data is saved under exp/.

Figure 2: Typical recipe flow in Asteroid.
├── data/                 # Output of stage 2
├── exp/                  # Store experiments
├── logs/                 # Store exp logs
├── local/
 ├── conf.yml            # Training config
 └──    # Dataset specific
├── utils/
 ├──    # Kaldi bash parser
 └──    # Package-level utils
├──                # Entry point
├──              # Model definition
├──              # Training scripts
└──               # Evaluation script
Figure 3: Typical directory structure of a recipe.
Figure 4: Simplified code example.

As can be seen in Fig. 4, the model class, which is a direct subclass of PyTorch’s nn.Module, is defined in It is imported in both training and evaluation scripts. Instead of defining constants in and, most of them are gathered in a YAML configuration file conf.yml. An argument parser is created from this configuration file to allow modification of these values from the command line, with passing arguments to The resulting modified configuration is saved in exp/ to enable future reuse. Other arguments such as the experiment name, the number of GPUs, etc., are directly passed to

5 Example results

To illustrate the potential of Asteroid, we compare the performance of state-of-the-art methods as reported in the corresponding papers with our implementation. We do so on two common source separation datasets: wsj0-2mix [DPCLHershey2016] and WHAMR [Whamr2019Maciejewski]. wsj0-2mix consists of a 30 h training set, a 10 h validation set, and a 5 h test set of single-channel two-speaker mixtures without noise and reverberation. Utterances taken from the Wall Street Journal (WSJ) dataset are mixed together at random SNRs between  dB and  dB. Speakers in the test set are different from those in the training and validation sets. WHAMR [Whamr2019Maciejewski] is a noisy and reverberant extension of wsj0-2mix. Experiments are conducted on the 8 kHz min version of both datasets. Note that we use wsj0-2mix separation, WHAM’s clean separation, and WHAMR’s anechoic clean separation tasks interchangeably as the datasets only differ by a global scale.

Table 2 reports SI-SDR improvements (SI-SDR) on the test set of wsj0-2mix for several well-known source separation systems. For most architectures, we can see that our implementation outperforms the original results. In Table 3, we reproduce Table 2 from [Whamr2019Maciejewski] which reports the performance of an improved TasNet architecture (more recurrent units, overlap-add for synthesis) on the four main tasks of WHAMR: anechoic separation, noisy anechoic separation, reverberant separation, and noisy reverberant separation. On all four tasks, Asteroid’s recipes achieved better results than originally reported, by up to  dB.

Reported Using Asteroid
Deep Clustering [DPCLHershey2016] 10.8
TasNet [LSTMLuo2018] 10.8 15.0
Conv-TasNet [ConvLuo2018] 15.2 16.2
TwoStep [tzinis2019twostep] 16.1 15.2
DPRNN () [DPRNNLuo2020] 16.0 17.7
DPRNN () [DPRNNLuo2020] 18.8 19.3
Wavesplit [Wavesplit2020Zeghidour] 20.4 -
Table 2: SI-SDR (dB) on the wsj0-2mix test set for several architectures. ks stands for for kernel size, i.e., the length of the encoder and decoder filters.
Reported Using Asteroid
Noise Reverb [Whamr2019Maciejewski]
14.2 16.8
12.0 13.7
8.9 10.6
9.2 11.0
Table 3: SI-SDR (dB) on the four WHAMR tasks using the improved TasNet architecture in [Whamr2019Maciejewski].

In both Tables 2 and 3, we can see that our implementations outperform the original ones in most cases. Most often, the aforementioned architectures are trained on 4-second segments. For the architectures requiring a large amount of memory (e.g., Conv-TasNet and DPRNN), we reduce the length of the training segments in order to increase the batch size and stabilize gradients. This, as well as using a weight decay of for recurrent architectures increased the final performance of our systems.

Asteroid was designed such that writing new code is very simple and results can be quickly obtained. For instance, starting from stage 2, writing the TasNet recipe used in Table 3 took less than a day and the results were simply generated with the command in Fig. 5, where the GPU ID is specified with the --id argument.

for task in clean noisy reverb reverb_noisy
    ./ --stage 3 --task $task --id $n
Figure 5: Example command line usage.

6 Conclusion

In this paper, we have introduced Asteroid, a new open-source audio source separation toolkit designed for researchers and practitioners. Comparative experiments show that results obtained with Asteroid are competitive on several datasets and for several architectures. The toolkit was designed such that it can quickly be extended with new network architectures or new benchmark datasets. In the near future, pre-trained models will be made available and we intend to interface with ESPNet to enable end-to-end multi-speaker speech recognition.