The goal of acoustic representation learning is to transform the raw or surface feature into high-level features which are more accessible to acoustic tasks[tzanetakis2000marsyas]. It is critical to make acoustic representations more general and robust to improve the performance of acoustic tasks. However, the labeled data size of the specific acoustic task may be limited so that the learned representations can be less robust and the performance can be vulnerable to unseen data. On the other hand, there exists varieties of acoustic tasks which range from speaker verification[reynolds2000speaker], speech recognition[povey2011kaldi] to event and scene detection[schuller2018interspeech]
. For supervised learning, the learned representation useful for one task may be less suited for another task. It is worthwhile to explore how to utilize all kinds of datasets to learn a general and robust representation for all kinds of acoustic tasks.
Unsupervised pre-training can provide an appealing method to learn more general and robust high-level features that are less specialized towards solving a single supervised task. The training objective of unsupervised pre-training is only related with acoustic feature themselves and not dependent on any other downstream target. Because of this advantage, much more unlabeled data can be utilized so that a larger and more general model can be learned. At the same time, the learned representations can be directly utilized or fine-tuned for specific downstream tasks.
Contrastive Predictive Coding(CPC)[oord2018representation]
has provided a universal unsupervised learning approach to extract useful representations from high-dimensional data. The autoregressive mechanism are used for predicting future information. However, it can only be applied in uni-directional models. Masked Predictive Coding(MPC)[jiang2019improving] has been proposed to pre-train speech data in an unsupervised manner for speech recognition. It uses the bidirectional transformer based architecture and uses Masked-LM[devlin2018bert] like structure to perform predictive coding. The pre-trained representations can be further fine-tuned to improve specific speech recognition tasks. However, the speech or acoustic representation pre-trained from this method has not yet been applied to other kinds of acoustic tasks and also the performance of this unsupervised pre-training method on non-speech audio tasks remains unknown.
In this paper, we get intuition from MPC and utilize a Transformer[vaswani2017attention] based unsupervised pre-training method for acoustic representation learning. Transformer based encoder can be pre-trained by a large amount of unlabeled audio from various kinds of datasets. After pre-training, all we should do is to add a decoder layer targeted for downstream tasks and fine-tune the whole model. we have demonstrated that our method can learn a more general and robust acoustic representation which can significantly improve the performance of various kinds of acoustic tasks.
2 Related Work
Contrastive Predictive Coding(CPC) provided a universal unsupervised learning approach and the learned representation is able to achieve strong performance on four domains: speech, images, text and reinforcement learning in 3D environments. This model is mainly composed of two parts: a non-linear encoder
and an autoregressive model. Given an input sequence , encodes observations to a latent embedding space and accepts to produce a context representation . Targeting at predicting future observations ,a density ratio is modelled to maximally preserve the mutual information between and . To optimize and , the contrastive loss is minimized:
where represents number of samples in , with one positive sample from distribution and the rest being negative samples from distribution .
Autoregressive Predictive Coding(APC)[chung2019unsupervised] also proposed an autoregressive model for unsupervised speech representation learning. It used a deep LSTM network and make the model to predict further steps ahead of the current frame during training. APCs have demonstrated a strong capability of extracting useful phone and speaker information.
To learn a general high-level acoustic representation, we use Transformer based encoder in an unsupervised manner. The architecture of Transformer based encoder is illustrated in Figure 1(a).
For unsupervised pre-training, Figure 1(b) shows our pre-training procedure. 15% of frames of the acoustic feature sequence will be masked by zeros and the object of unsupervised pre-training is similar as that of [jiang2019improving] which is to restore the masked frames given the left and right context features. However, we have two aspects that are different from that of [jiang2019improving]. On one hand, we have different masking mechanism. Generally speaking, the CNN modules of Transformer based encoder provide downsampling mechanism, by which the frames would be N-fold downsampled. Therefore, to reserve the masked information after downsampling operations, we split frames into chunks each of which contains N frames and 15% of all chunks will be selected randomly and all frames of the selected chunks will be masked by zeros. On the other hand, Transformer encoder is followed by a feed-forward layer to output the prediction of which each frame-level prediction predicts corresponding N real frames of the input sequence. With these changes, we also use L1 loss to minimize the gap between the predicted frames and the corresponding real frames.
For fine-tuning, Transformer encoder needs to be pre-trained only once and can be adapted to varieties of acoustic tasks no matter whether the downstream task deal with the speech or non-speech acoustic sequences, and no matter whether the output of the task is a sequence or tag. All we should do is to add a decoder layer after the pre-trained encoder to fine-tune the whole model for specific tasks. The choice of decoder layers is based on the tasks as shown in Figure 1(c). We can use Transformer decoder for seq-to-seq tasks and specific pooling layers for tagging tasks.
To prove the effectiveness of our unsupervised pre-training method on various kinds of acoustic tasks, we selected three representative kinds of tasks: speech translation, speech emotion recognition and acoustic event detection.
For pre-training the model using a larger dataset which can be adapted to various kinds of downstream tasks, we merge MuST-C En-De(408 hours), Librispeech[panayotov2015librispeech](960 hours) and ESC-US[piczak2015esc]( 347 hours) datasets into one dataset(almost 1715 hours) and we call it OpenAudio. Among them, ESC is an open dataset for environmental sound classification while ESC-US is a compilation of 250k unlabeled clips. For pre-training, we did not use speed perturbation but for fine-tuning in every downstream task, we used speed perturbation with factor of 0.9 and 1.1 for data augmentation.
We use 40-dimensional Mel filter-banks extracted from the audio signals using window size of 25 ms and step size of 10 ms for pre-training and fine-tuning in all downstream tasks.
4.2 Experimental setups
For Transformer based model, we use the structure discussed before with hidden dimension size of 256, feed-forward size of 2048, attention heads of 4, dropout rate of 0.1 and encoder layers of 12 for all tasks.
We pre-trained our model using OpenAudio only once and fine-tuned it on each downstream task. It was trained on 4 GPUs with a total batch size of 256 for 50 epochs. We used the Adam optimizer[kingma2014adam] with warmup schedule[vaswani2017attention] according to the formula:
where n is the step number. k = 0.5 and warmup n = 8000 were chosen for all experiments. For comparison, we also pre-trained our model on each task using its own training data with the same setups as discussed before.
4.3 Speech translation
The aim of speech translation is to translate one language directly from the speech into another language. We used MuST-C English-to-German(En-De) and English-to-French(En-Fr) datasets[di2019must] which were commonly used in previous speech translation studies[indurthi2020end, Mattia2019Adapting, inaguma2020espnet]. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks. For fine-tuning, we used a 6-layer Transformer decoder as the decoder layer. To avoid overfitting, we also used label smoothing with the rate of 0.1. Similar to [inaguma2020espnet], we used 8k vocabularies based on byte pair encoding (BPE)[sennrich-etal-2016-neural]. It was trained on 4 GPUs with a total batch size of 512 for 50 epochs. We also use the optimizer which is the same as that of pre-training except that k = 2.5 and warmup n = 25000. For evaluating the performance, we restore the checkpoint averaged from best 5 checkpoints during training. We used beam search with beam size of 10 and performance was evaluated using case-sensitive 4-gram BLEU[papineni2002bleu] on the tst-COMMON set.
According to [inaguma2020espnet] and [Mattia2019Adapting], for end-to-end speech translation, Transformer based model has provided state-of-the-art results on MuST-C datasets. However, its performance depends on ASR pre-training which needs English transcripts. In our experiments as shown in Table 1, the performance of Transformer pre-trained by its own training audio can be comparable with that of Transformer pre-trained by ASR. Meanwhile, the results of Transformer pre-trained by OpenAudio have shown that the BLEU scores have exceeded that of [inaguma2020espnet] pre-trained by ASR on both datasets.
We can see that different from current end-to-end speech translation methods, our methods provides not only better performance but an easier training scheme without transcripts of speech in same language which is more practical for industrial application. It is also promising that combining our unsupervised pre-training method with current supervised pre-training mechanism will further improve the performance.
4.4 Speech emotion recognition
IEMOCAP database[busso2008iemocap] is used for our experiments on speech emotion recognition. We used the recordings where majority of annotators agreed on the emotion labels and it contains 4 kinds of emotions: angry, happy, sad and neutral state. Happy and excited emotions were combined as happy in order to balance the number of samples in each emotion class. The dataset contains 5,531 utterances (1,103 angry, 1,636 happy, 1,708 neutral, 1,084 sad) grouped into 5 sessions. We conduct 5-fold cross validation on IEMOCAP, taking samples from 8 speakers as train and development sets and the ones from the remaining 2 speakers as resprective testset. We use the macro-averaged F1-score which is calculated for each class seperately and averaged over all classes. For fine-tuning, we add an average pooling layer followed by one feed-forward layer. To test the relationship between the performance of unsupervised pre-training and the decoder layer type the model uses, we also conducted experiments on models with a multi-head attention layer[zhu2018self] with 5 heads. It was trained on 4 GPUs with a total batch size of 64 for 25 epochs. We also use the optimizer which is the same as that of pre-training. For evaluating the performance, we restore the checkpoint averaged from best 5 checkpoints during training. We used UAR which is defined as the unweighted average of the class-specific recalls achieved by the system as our metrics.
In our experiments as shown in Table 2, we achieve a mean UAR of 64.9% which is significantly better than the state-of-the-art result on this setup. According to [Michael2019Improving] and the best of our knowledge, [Rozgic2012Ensemble] and [xia2015leveraging] presented the best results in the condition that almost match our setups. Specifically, they all use 4 emotion classes and merge happy and excited as one class, except that they used leave-one-speaker-out cross validation and we use leave-one-session-out cross validation. Compared with [Michael2019Improving] which has provided another unsupervised pre-training method, our Transformer based model with pre-training can achieve better performance.
We can also see that no matter whether the decoder uses an average pooling layer or a multi-head attention layer, the performance gains using pre-training are similar.
|Rozgic et al.[Rozgic2012Ensemble]||-||-||60.9|
|Xia et al.[xia2015leveraging]||-||-||62.5|
|Michael et al.[Michael2019Improving]||Autoencoder||Libri + Ted||59.5|
|+ Attention pooling||-||-||60.3|
|+ Attention pooling||Ours||OpenAudio||64.7|
4.5 Sound event detection
We used DCASE2018 task5 dataset[Dekkers2017] for sound event detection. It contains a continuous recording of one person living in a vacation home over a period of one week. The continuous recordings were split into audio segments of 10s and each segment represents one activity. The dataset presents 10 kinds of activities like cooking, eating and so on. The DCASE2018 task5 has provided development and evaluation datasets for evaluation and test. We use the macro-averaged F1-score as the metrics of this task. It was trained on 4 GPUs with a total batch size of 128 for 50 epochs. We also use the optimizer which is the same as that of pre-training except that k = 0.3. For evaluating the performance, we restore the checkpoint averaged from best 5 checkpoints during training. Similar to speech emotion recognition, we conducted experiments for the model using an average pooling layer and the model using a multi-head attention layer with 5 heads respectively.
We compared our work with top three teams’ technical reports[Inoue2018, Liu2018, Liao2018] listed on the DCASE community website. Table 3 shows that with pre-training using OpenAudio, Transformer based model can achieve better performance than all of them on the development set and two of them on the evaluation set. Consider they used well-designed hand-crafted features with various kinds of data augmentation and ensemble tricks, our method presents a simple but effective training scheme. The results have also shown that just as that of speech emotion recognition, no matter whether the decoder uses an average pooling layer or a multi-head attention layer, the performance gains using pre-training are similar. It suggests that our pre-training method does not affect the choice of decoder.
|Inoue et al.[Inoue2018]||-||-||90.0||88.4|
|Liu et al.[Liu2018]||-||-||89.8||87.5|
|Liao et al.[Liao2018]||-||-||89.8||86.7|
|+ Attention pooling||-||-||89.7||85.5|
|+ Attention pooling||Ours||OpenAudio||91.2||87.8|
4.6 Effect on convergence
The experiments have also shown that pre-training can not only improve the performance but make the model converge faster. Figure 2 shows that at almost every epoch of all three tasks, the metrics of pre-trained Transformer will be better than that of the base model and Transformer pre-trained by OpenAudio performed the best.
On the other hand, compared with the En-De dataset, both the DCASE2018 task5 and IEMOCAP dataset are relatively smaller. Meanwhile severe instability(obvious decrease of metrics at some epochs) has also been shown from the convergence curve of Base in Figure 2(a) and 2(c). Accordingly, because our pre-training method utilized much more datasets, the model using pre-training has presented much more stability than that of Base. Our pre-training method can significantly stabilize the convergence process on relatively small datasets.
In this work, we explored Transformer based encoder with Masked-LM like pre-training for acoustic representation learning. We conducted experiments on three kinds of tasks: speech translation, speech emotion recognition, sound event detection. We pre-train the model with a large dataset combining Librispeech, MuST-C and ESC-US datasets and fine-tune it on each task. Results have shown that for speech translation, the BLEU score can improve relatively 12.2% and 8.4% on MuST-C En-De and En-Fr datasets respectively compared with that of Transformer without pre-training and performed better than that of Transformer pre-trained by ASR. For sound event detection, the F1 score can improve absolutely 1.7% and 2.4% on DCASE2018 task5 development set and evaluation set compared with that of our base Transformer. For speech emotion recognition, the UAR can improve absolutely 4.3% on IEMOCAP dataset compared with that of our base Transformer.
Compared with current state-of-the-art acoustic systems, our method is able to provide a more general and robust acoustic representation for all acoustic tasks and it is easy to be transferred, easy to be built without many hand-crafted designs and is more practical for industrial applications. It suggests that our method can provide a promising alternative for acoustic representation learning.