Deep learning relies on hierarchical representations that are commonly learned in a supervised way from large corpora. Access to such annotated corpora, however, is often expensive making of paramount interest the study of techniques able to extract knowledge from unlabelled data. Some early examples of successful unsupervised learning approaches include deep autoencoders 13], which were mainly used to effectively pre-train deep neural networks. More recent techniques include variational autoencoders  and generative adversarial networks . A related field that is gaining popularity is self-supervised learning, where targets are computed from the signal itself [8, 20]
by applying known transformations to the input data. Contrary to fully unsupervised approaches, in self-supervision one can easily incorporate expert-derived priors into the training process by tasking the model to recover known signal transformations (that are cheaply derived without humans in the loop). Self-supervised learning has been firstly adopted within the computer vision community to learn representations by solving various auxiliary tasks, such as colorize grayscale images or solving puzzles from image patches . Self-supervised learning has also been applied successfully in language modeling, leading to models like BERT [7, 19]. Recently, a similar paradigm has been used to infer audio representations [16, 6], including learning speech representations with mutual information [36, 27]. Despite the recent progress, applying self-supervised learning to speech remains a challenge. Speech signals entail a complex hierarchical structure (samples phonemes syllables words sentences semantic contents) that contains relevant information at different time scales. Speech is also characterized by considerable variability, due to within- and cross-speaker variations, disfluencies, different languages, acoustic environments or recording setups. It is thus difficult to infer relevant latent structures without any supervision guidance. Our recent attempt to learn speech representations with a multi-task self-supervised approach led us to the development of a problem-agnostic speech encoder (PASE) , that turned out to learn meaningful speech information such as speaker identities, phonemes, and emotions. The underlying assumption is that each self-supervised task provides a different “view” of the speech signal and by combining different peeks into a unique representation, the model can better learn comprehensive representations. PASE relies on a convolutional encoder followed by an ensemble of small neural networks, called workers, that are jointly trained to solve multiple self-supervised tasks. Our initial PASE variant provided promising results in several small-scale speech tasks, however, it was not explicitly designed to learn features robust against noise and reverberation.
In this paper, we improve the latter aspect with PASE+, a revised version of PASE
that significantly boosts its performance in challenging speech recognition tasks. Improvements include the development of an online speech distortion module that transforms clean speech segments into contaminated variants using reverberation, additive noise, temporal/frequency masking, clipping, and overlapped speech. Then, we combine our convolutional encoder with a quasi-recurrent neural network (QRNN)
. QRNN can learn long-term dependencies in a very efficient way using convolutional gates across the time steps and a minimal recurrent pooling function. Finally, we introduce several novel workers that estimate a large variety of known speech transformations, for which self-supervision ground-truth targets are extracted from the original clean signals. This way, we not only take advantage of data augmentation, but we also encourage our encoder to perform denoising and learn distortion-invariant features. Our approach is different from earlier DNN-based acoustic features extractors, as they exploited shared phonetic knowledge in a supervised manner, while PASE relies only on raw signals and self-supervised learning. Results, reported on TIMIT, DIRHA and CHiME-5 datasets, show that PASE+ significantly improves PASE and outperforms traditional hand-crafted features. To foster reproducibility, we have made the PASE code and the pre-trained models publicly available111 https://github.com/santi-pdp/pase.
2 Self-Supervised Learning with PASE+
PASE+, once pre-trained in a self-supervised way, can be either used as a standalone feature extractor (with frozen weights) or as a part of a target acoustic model (with supervised training) that solves some task of interest (e.g., speech recognition). As shown in Fig. 1, PASE+ is equipped with a speech distortion module, a speech encoder that converts raw samples into a higher-level representation, and a set of twelve workers, that are fed by the shared encoded features and cooperatively solve different self-supervised tasks. In this section, we describe how PASE+ is pre-trained in a self-supervised manner, with a particular focus on the main improvements proposed on top of the original  (cf. blue squares in Fig. 1).
2.1 Online speech contamination
To improve the robustness, we introduce a module that artificially contaminates the input speech with several distortions (see Tab. 1
for a summary). The contamination is active during self-supervised training only, and happens on-the-fly such that every input sentence is distorted in a different way. Each distortion transform is activated with a certain probability, and each speech segment can be corrupted by multiple distortions simultaneously. The probability has been tuned during the model ablation, showing reverberation, additive noise, and frequency masking contributing the most towards ASR performance. Reverberation is introduced by convolving the input signal with a set of 1300 impulse responses derived with the image method . The impulse responses simulate different acoustic conditions, with reverberation times ranging from 0.3 to 0.9 seconds. Additive noises are extracted from FreeSound and DIRHA datasets [29, 30]
and include both background and non-stationary noises such as alarms, door knocks, telephone ringing, television, and many others. The signal-to-noise ratio (SNR) is randomly sampled between 0 and 10 dB. Frequency masking is performed by filtering the time signal with band-stop filters. Other considered disturbances include temporal masking (i.e. a random number of consecutive samples set to zero), clipping (i.e. a random degree of saturation), and overlap speech (one non-dominant speaker in the background). The disturbances considered in this work are similar to those proposed in SpecAugment. As will be shown in Sec. 4, the contamination module proved important and is here used for the first time in a self-supervised training regime.
|Reverberation||0.5||Convolution with a large set of impulse responses derived with the image method.|
|Additive Noise||0.4||Non-stationary noises from the FreeSound and the DIRHA datasets.|
|Frequency Mask||0.4||Convolution with band-stop filters that randomly drops one band of the spectrum.|
|Temporal Mask||0.2||Replacing a random number of consecutive samples with zeros.|
|Clipping||0.2||Adding a random degree of saturation to simulate clipping conditions.|
|Overlap Speech||0.1||Adding another speech signal in background that overlaps with the main one.|
2.2 PASE+ encoder
Similarly to , the first layer of the encoder is based on SincNet , which performs the convolution of the raw input waveform with a set of parameterized sinc functions implementing rectangular band-pass filters. The subsequent layers are composed of seven convolutional blocks (Fig. 1
). Each block employs a one-dimensional convolution, followed by batch normalization (BN)
, and a multi-parametric rectified linear unit (PReLU) activation. The set of convolutions emulates a sliding window with a shift of 10 ms, as done in common speech feature extractors. PASE+ improves our previous encoder architecture as follows:
Skip connections: the final representation is the sum of the features discovered by the intermediate convolutional layers, that are linearly projected and downsampled to match the output embedding sequence dimensionality and length. Skip connections introduce shortcuts in the encoder architecture, which shuttle different levels of abstractions to the final representation as well as improving gradient flows.
Quasi-RNN: PASE+ can learn long-term dependencies efficiently with a QRNN placed on the top of the convolutional layers. QRNN is based on multiplicative gates implemented with 1-D convolutions and a minimalist recurrent pooling function, as shown in the following equations:
where Z, F, and O are the multiplicative gates parameterized by W, and denote convolutions and element-wise product, while and
are the cell-state and hidden-state vectors at time, respectively. The QRNN gates do not rely on previous computations and can be computed in parallel for all the time steps. The QRNN provides a performance similar to that of more traditional LSTM or GRU models with lower computational load . To the best of our knowledge, QRNNs are here used for the first time in a multi-task self-supervised setting.
Workers are implemented as small feed-forward neural networks (typically one hidden layer with 256 hidden units) that solve twelve self-supervised tasks, defined as regression or binary classification problems. Their capacity is deliberately limited to encourage the encoder to discover high-level features that can be successfully exploited even by classifiers with limited modeling power. Importantly, the worker supervised targets are extracted from the original clean signals and not from the distorted version. This way, we force the PASE+ to perform implicit denoising and learn robust features.
2.3.1 Regression Tasks
Regression workers are trained to minimize the mean squared error (MSE) between speech features (used as labels) and network predictions. The motivation behind that is to leverage well-known speech transformations to inject some prior knowledge into the encoder. In , we used four regression workers that estimate common speech features such as log power spectrum (LPS), MFCCs, prosody features, and the speech waveform itself in an autoencoder fashion. Given the crucial importance of these tasks in our previous work, we here extend the regressors in the following way:
Adding more features: we added new workers that estimate 40 FBANKS and 40 Gammatone features .
Estimating longer context: PASE+ estimates all the speech features along with their first and second derivatives. Moreover, instead of estimating the current feature only, we jointly estimate multiple features within a context window of seven neighbouring frames. This way, we help our local representation to embed information from a larger context.
Estimating features on longer windows: we further added new workers that estimated the aforementioned features computed over longer analysis windows of 200 ms rather than the usual ms used by the other regressors (see long workers in Fig.1). We found this useful because it makes our representation more aware of the average characteristics of the speech signal within a local context.
2.3.2 Binary Tasks
Binary workers solve tasks that capture higher-level information from the speech signal. These tasks rely on a pre-defined sampling strategy that draws anchor, positive, and negative samples from the pool of PASE-encoded representations. The adopted neural networks are simple binary classifiers that are trained to maximize the mutual information between the anchor and the positive representations. Mutual information is a very meaningful measure of divergence that can capture complex non-linear relationships between random variables[3, 14]. Depending on the adopted sampling strategy, we can derive different tasks. The ones used in PASE+ are the following:
Local info max (LIM): as proposed in 
, we draw the positive sample from PASE features extracted within the same sentence of the anchor and a negative sample from another random sentence (that likely belongs to a different speaker). Given the large receptive field inPASE+, each encoded frame embeds a relatively large context. This worker can thus learn how to discriminate between speakers, since the speaker identity is a reliable constant factor within random sequences of local features.
Global info max (GIM): differently from LIM, this worker relies on global information. The anchor and positive representations are obtained by averaging all the PASE features extracted from long chunks of 2 seconds belonging to the same sentence. The negative one is obtained from another sentence. GIM encourages PASE to learn representations containing high-level information on the input sequence, that are hopefully complementary to those learned by LIM.
2.4 Self-supervised Training
Encoder and workers are jointly trained with backpropagation by optimizing a total loss that is computed as the average of each worker cost. We conducted experiments to add dynamic weights to each worker, exploring, for instance, the use of the hypervolume maximization. The considered methods, however, do not provide benefits in our framework when compared to a simpler unweighted loss average. All the neural networks are optimized with Adam , using an initial learning rate of which is annealed using a polynomial scheduler 
. We use mini-batches of 32 waveform chunks each 2-seconds long. The system is trained for 200 epochs.
3 Corpora and ASR setup
Self-supervised training is performed with a portion of 50 hours of the LibriSpeech dataset . Target speech recognition experiments are performed with different out of domain datasets. The first set of experiments is carried out using TIMIT . Along with the original clean version, we generated a contaminated version of TIMIT using noise sequences and real impulse responses  (different from those used for self-supervised training). To assess our approach on a larger dataset, we also employ the DIRHA dataset . Training and validation sets are based on the original WSJ-5k corpus (consisting of 7138 sentences uttered by 83 speakers) simulated in a domestic environment. The test set is composed of 409 WSJ sentences recorded by six American speakers in a domestic environment with a of 0.7 seconds and an average SNR of 10 dB. Finally, a set of experiments are performed with the CHiME-5 dataset , which is based on real recordings of dinner parties. CHiME-5 is a challenging task characterized by noise, reverberation, overlap and conversational speech.
This work uses hybrid HMM-DNN speech recognizers. TIMIT and DIRHA experiments are performed with the PyTorch-Kaldi toolkit
using a six-layer multi-layer perceptron and a light GRU, respectively. The performance reported on TIMIT is the average of the phone error rates (PER%) obtained by running each experiment three times with different seeds. CHiME-5 results are based on Kaldi  and rely on a time-delay neural network (TDNN) [37, 24] acoustic model trained on PASE+ features.
4.1 Model ablation
First of all, we present some results to show the effectiveness of the proposed interventions. Tab. 2, reports the PER(%) obtained on the clean and noisy versions of TIMIT when we progressively improve the original version of PASE. For this experiment, PASE is frozen and it is used as a simple feature extractor.
|PASE (10h) ||18.6||41.1|
|+ 50 hours||18.3||39.9|
|+ Speech distortions||18.1||37.6|
|+ Skip connection||18.0||36.2|
|+ Embedding 256||17.8||34.8|
|+ New workers||17.2||33.8|
The first row shows the results achieved with the original version of PASE , which was trained on 10 hours on LibriSpeech only. In the second row, we observe some benefits in both clean and noisy conditions when training PASE with 50 hours. Then, we show the impact of the contamination module. Interestingly, adding online distortions not only improves the performance in the noisy context but also in the clean one. Data augmentation, in fact, acts as a powerful regularizer that helps especially when the supervised classifier is trained with a small dataset like TIMIT. In the fourth (+QRNN) and fifth rows (Skip connection), we show the improvement due to the revised encoder. As emerges from the table, QRNN has a major impact on noisy and reverberated conditions, where embedding a longer context is more important. We also found some benefits when increasing the dimensionality of the representation from 100 to 256 (see Embedding 256). This work does not aim to compress the signal, but rather to represent it in a form that can be better exploited by the following supervised classifier. Increasing the dimensionality helps to better disentangle the most important factors of variability that characterize speech. Finally, we report the results achieved when adopting an extended set of workers. This leads to a substantial improvement, confirming the importance of adding more relevant self-supervised tasks. We observe some benefits even though the speech representations estimated by some regressors (e.g. MFCCs, FBANK, Gammatone) are highly correlated. We think that this could add another regularization effect that helps to learn more robust representations.
Overall, the proposed PASE+ significantly outperforms our previous version, with a relative improvement of 9.5% in the clean scenario and of 17.7% in noisy conditions.
4.2 Comparison with standard speech features
We now compare PASE+ with the most popular hand-crafted features used in speech recognition. Table 3 compares MFCC, FBANK, and Gammatone, as well as those three features concatenated, with PASE+ features on TIMIT (Rev+Noise) and DIRHA. The results show that PASE+ (Frozen) outperforms all the hand-crafted features, with a relative WER improvement of 13.5 % over their best performance on DIRHA. PASE+ (Frozen) outperforms the supervised end-to-end baseline PASE (Supervised) as well, which is trained from the raw waveform directly without taking advantage of the self-supervised pre-training. As observed in , the best performance is achieved when fine-tuning the encoder representation during supervised training PASE+ (FineTune), leading to a relative improvement of 3.1 % over the frozen version.
Finally, Table 4 reports WER(%) obtained on the CHiME-5 dataset. For a distant speech scenario PASE+ acting as a feature extractor (i.e. frozen weights pre-trained on 50h of LibriSpeech data) performs better than the MFCC-based system (3.0% of rel. improvement) and approaches speaker adaptively trained MFCC+ivectors variant (-1.3% rel. difference). We also report combination scenarios in which PASE+ features were found complementary to both MFCC and ivectors, offering relative performance improvements of 1.2% and 3.7% when combined with MFCC and MFCC+ivector features, respectively. Our results on an extremely challenging task as CHiME-5 confirm the transferability of self-supervised PASE+ features to highly mismatched and realistic acoustic environments.
|MFCC + ivectors||74.1||65.7|
|PASE (Frozen, 10h)||77.9||72.0|
This work studied a multi-task self-supervised approach for robust speech recognition. The proposed PASE+ architecture is based on an online speech distortion module, a convolutional encoder coupled with a QRNN layer, and a set of workers solving self-supervised problems. PASE+ turned out to significantly outperform standard acoustic features on different speech recognition tasks (when used with frozen weights), and offering further gains when end-to-end optimized with the target acoustic model objective (here tuned for speech recognition). PASE also offers a remarkable level of transferability. Our system is trained on artificially distorted signals from a subset of LibriSpeech and provides a good performance even in challenging acoustic scenarios in unseen and realistic noisy environments.
This work investigated the potential of pure self-supervised representations. A natural extension will be the exploration of semi-supervised frameworks, where additional supervised workers (e.,g. a speech or a speaker recognizer) can be combined with the self-supervised ones to learn a better representation. We believe that future speech processing technologies will benefit more from this type of pre-trained models, as it happens nowadays in computer vision and natural language processing. As supported by the evidence in the carried experiments, in future works we will explore its usability to other downstream tasks (e.g., speaker, emotion, and language recognition) as well as in sequence-to-sequence neural speech recognition.
The work reported here was started at JSALT 2019, and supported by JHU with gifts from Amazon, Facebook, Google, and Microsoft. This work was also supported by NSERC, Samsung, Compute Canada, NCI/Intersect Australia and the project TEC2015-69266-P (MINECO/FEDER, UE). Special thanks to Maurizio Omologo, Dmitriy Serdyuk, Loren Lugosch, Renato De Mori, Najim Dehak, Hynek Hermansky, and all the JSALT-coop team for helpful discussions.
-  (2019) Multi-objective training of generative adversarial networks with multiple discriminators. In Proc. of ICLR, Cited by: §2.4.
-  (1979) Image method for efficiently simulating small‐room acoustics. JASA 65 (4), pp. 943–950. Cited by: §2.1.
-  (2018) Mutual information neural estimation. In Proc. of ICML, Cited by: §2.3.2.
-  (2006) Greedy layer-wise training of deep networks. In Proc. of NIPS, Cited by: §1.
-  (2017) Quasi-Recurrent Neural Networks. Proc. of ICLR. Cited by: §1, 2nd item.
-  (2019) Unsupervised speech representation learning using WaveNet autoencoders. CoRR abs/1901.08810. Cited by: §1.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §1.
-  (2017) Multi-task self-supervised visual learning. In Proc. of ICCV, Cited by: §1.
-  (1993) DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. NIST. Cited by: §3.
-  (2019) Rethinking learning rate schedules for stochastic optimization. In OpenReview, Cited by: §2.4.
-  (2014) Generative adversarial nets. In Proc. of NIPS, Cited by: §1.
Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In Proc. of ICCV, Cited by: §2.2.
-  (2006) A fast learning algorithm for deep belief nets. Neural Computation 18, pp. 1527–1554. Cited by: §1.
-  (2018) Learning deep representations by mutual information estimation and maximization. Proc. of ICLR. Cited by: §2.3.2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. of ICML, Cited by: §2.2.
-  (2018) Unsupervised learning of semantic audio representations. In Proc. of ICASP, Cited by: §1.
-  (2014) Auto-encoding variational Bayes. In Proc. of ICLR, Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In Proc. of ICLR, Cited by: §2.4.
-  (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In Proc. of ICLR, Cited by: §1.
-  (2016) Shuffle and learn: unsupervised learning using temporal order verification. In Proc. of ECCV, Cited by: §1.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzle. In Proc. of ECCV, Cited by: §1.
-  (2015) Librispeech: an ASR corpus based on public domain audio books. In Proc. of ICASSP, Cited by: §3.
-  (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proc. of Interspeech, Cited by: §1, §2.2, §2.3.1, §2, §4.1, §4.2, Table 2.
-  (2015) A time delay neural network architecture for efficient modeling of long temporal contexts. In In Proc. of Interspeech, Cited by: §3.
-  (2011) The Kaldi Speech Recognition Toolkit. In Proc. of ASRU, Cited by: §3.
-  (2018) Interpretable convolutional filters with SincNet. Proc. of IRASL@NIPS. Cited by: §2.2.
-  (2019) Learning speaker representations with mutual information. In Proc. of Interspeech, Cited by: §1, 1st item.
Improving speech recognition by revising gated recurrent units. In Proc. of Interspeech, Cited by: §3.
-  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments. In Proc. of ASRU 2015, Cited by: §2.1, §3.
-  (2015) Contaminated speech training methods for robust DNN-HMM distant speech recognition. In Proc. of Interspeech, Cited by: §2.1.
-  (2019) The PyTorch-Kaldi Speech Recognition Toolkit. In Proc. of ICASSP, Cited by: §3.
-  (2016) Realistic multi-microphone data simulation for distant speech recognition. In Proc. of Interspeech, Cited by: §3.
-  (2007) Gammatone features and feature combination for large vocabulary speech recognition. In Proc. of ICASSP, Cited by: 1st item.
-  (2013) Multi-level adaptive networks in tandem and hybrid ASR systems. In Proc. of ICASSP, pp. 6975–6979. Cited by: §1.
-  (2018) The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In Proc. of Interspeech, Cited by: §3.
-  (2018) Representation learning with contrastive predictive coding. Arxiv. Cited by: §1.
-  (1989) Phoneme recognition using time-delay neural networks. IEEE TASLP 37 (3), pp. 328–339. Cited by: §3.
Colorful image colorization. In Proc. of ECCV, Cited by: §1.
SpecAugment: a simple augmentation method for automatic speech recognition. In Proc of Interspeech, Cited by: §2.1.