Many studies have shown that textual information is essential for building speech recognition systems and language models (LM). Recently, several important studies on representation learning[baevski2019vq, baevski2020wav2vec, chung2020generative, pascual2019learning, liu2020mockingjay] and semi-supervised training [xu2020iterative, park2020improved, masumura2020sequence] explored using a large amount of speech data without corresponding text annotations and demonstrated significant improvements in speech recognition performance. This suggests that such systems may learn to train their own LM from raw audio only. Therefore, it is hoped that eventually spoken language modeling tasks can be done without any text annotations.
The Zero Resource Speech (ZeroSpeech) Challenge 2021 [nguyen2020zero]
is designed to tackle such unsupervised LM training using only raw speech data as input. The evaluation is done using a suite of 4 black-box, zero-shot metrics, which probe for the quality of the training models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. The baseline system consists of three components: an acoustic model, a clustering module (k-means), and an LM. The acoustic model is built upon Contrastive Predictive Coding (CPC)[oord2018representation], where the representation of the audio is learned by predicting the future frames using an autoregressive model. After training the CPC model, the baseline system trains a k-means clustering module on the outputs of the final layer of the autoregressive model to obtain sequences of discretized audio files. Finally, the LM is trained with the discretized units as pseudo-labels.
In this challenge, the final goal is to solve a couple of discrimination tasks. However, the representation obtained by the CPC model does not have sufficient linguistically discriminative characteristics since the CPC model itself is trained for the prediction task. To address this issue, we propose a method that combines the CPC model with a deep cluster method [caron2018deep, xie2016unsupervised, guo2017deep, hsu2020hubert]. We train an autoregressive model for phoneme classification using pseudo-labels obtained by clustering the outputs of a CPC network using k-means. The phoneme discriminative representation is obtained by doing a second-round clustering on the outputs of the final layer of the autoregressive model. Note that we call it phoneme classification in the sense of classifying pseudo-labels, which are likely to capture the phonetic meaning.
Furthermore, we examine replacing the Transformer [vaswani2017attention] layer of the CPC model with a Conformer [gulati2020conformer]
layer. Conformer incorporates a convolutional neural network (CNN)[lecun1995convolutional] inside the Transformer to handle not only global but also local contexts, and its usefulness has been recognized in speech recognition tasks [gulati2020conformer, zhang2020pushing, huang2020improving, guo2020recent]. Likewise, it is expected that more precise phonetic and lexical representation is achieved by capturing both contexts using the Conformer network. We apply the above two methods separately and confirm that both methods outperform the baseline method using the phonetic metric. In addition, we observe that the proposed method combining the Conformer CPC model with the deep cluster method outperforms the baseline method using the lexical metric. This reveals that the two methods have a complementary effect on both tasks.
2 Challenge Overview
In this section, we briefly introduce the baseline system and the task of the ZeroSpeech Challenge 2021 [nguyen2020zero].
2.1 Baseline System
The baseline system consists of a speech representation learning model, a clustering model, and a language model. Fig 1 illustrates the architecture of the baseline system.
2.1.1 Contrastive Predictive Coding
The speech representation model is based on CPC, a self-supervised representation learning method proposed in [oord2018representation]. Instead of using a conditional generative model to predict the future input signal, the CPC model learns the representation via maximizing the mutual information between the current context and future embeddings. The CPC model consists of two modules. First, given an input speech signal , a non-linear encoder maps it to a -length sequence of embeddings with a lower time resolution: , where . Then, an autoregressive encoder aggregates the information from , producing a context latent representation
. The CPC model is optimzied by minimizing the noise-contrastive estimation-based (NCE) loss[gutmann2010noise]. At each time , given the context representaion and its future embeddings , the loss is defined as:
where is a set of negative embedding samples and is a transformation for each step . In this challenge, we use two different versions of CPC model: CPC-small and CPC-big, the differences of which are elaborated in Table 1.
2.1.2 Clustering and Language Models
To train a spoken language model (sLM) on pseudo-labels, the raw speech signal needs to be mapped to a sequence of discrete symbols. The pre-trained CPC model first generates a sequence of representations given the raw speech signal as input. Then, these representations are used to train a clustering model, which is k-means, with 50 clusters used in this work.
After training, the clustering model is applied to the speech representation of the training data to produce class labels. The class label can be regarded as a pseudo linguistic subword unit. Using these label sequences as pseudo-text data, we can train an sLM. In this work, we trained a BERT [vaswani2017attention] language models. This model consists of multiple Transformer layers. Note that the BERT model is only trained with the masked language model objective, following [liu2019roberta]
. Finally, the score of the language model on the pseudo-label sequence is regarded as a pseudo-probability (PP).
The training data is comprised of the audio from the LibriSpeech 960h dataset [panayotov2015librispeech] and the Libri-light dataset [kahn2020libri]
. The CPC-small model is trained on the 100 hours of clean audio subset (train-clean-100) from the LibriSpeech data, while the CPC-big model is trained on a 6K-hour subset of Libri-light data. The k-means clustering is performed on the train-clean-100h subset to obtain the centroid coordinates. Then the k-means estimates the pseudo-label sequences on LibriSpeech 960h data, which becomes the training set for the language model.
Each of the four metrics is evaluated on its dev and test sets, which are specially designed for the corresponding task. Please refer to the challenge description [nguyen2020zero] for more details of how the evaluation data are generated.
2.3 Evaluation Metrics
The performance of the spoken language model is evaluated using four different metrics, each corresponding to a task at a specific linguistic level: phonetics, lexicon, syntax and semantics.
Phonetics. The ABX metric [schatz2013evaluating] discriminates the speech sound between phonetic minimal pairs (e.g. “aba” and “apa”). Given the speech sounds , and , where and are from two categories and (), and belongs to category respectively, it computes the probability that the two sounds from the same category are closer than the two sounds from different categories:
represent the cardinalities of category and .
Lexicon. The sWUGGY “Spot-the-word” [le2017comparing] is used to discriminate an existing word from a lexically similar non-word using the sLM (e.g. “brick” and “blick”). The metric measures the accuracy that the PP of the real word is higher than that of the non-word: .
Syntax. sBLIMP acceptability, adapted from BLIMP [warstadt2020blimp], discriminates a grammatical sentence from an ungrammatical sentence (e.g. “dogs eat meat” and “dogs eats meat”). The metric accepts it if the PP of a grammatical sentence is greater than the ungrammatical one: .
Semantic. sSIMI similarity measures the similarity between the representations of pairs of words and compares the results with human judgment. The metric is computed as the Spearman’s rank correlation coefficient between the semantic similarity scores given by the model and the human scores in the dataset.
|Model||CPC model configuration||Training data||Input to k-means|
|CPC-small||-layer LSTM||LibriSpeech clean-100h||2nd layer of LSTM|
|CPC-big||-layer LSTM||Libri-light clean-6kh||2nd layer of LSTM|
3 Proposed System
The two proposed methods are described below. As each of these methods modifies a separate component in the baseline system, they can be used in combination.
3.1 CPC with deep cluster
All four evaluation metrics in this challenge are discriminative tasks. However, as we mentioned, the baseline system does not have sufficiently linguistically discriminative characteristics. To solve this problem, our system combines the CPC model with the deep cluster method[caron2018deep, xie2016unsupervised, guo2017deep]
. Deep cluster is a clustering method initially designed for image processing. It iterates between doing k-means clustering on the features produced by a neural network and updating its weights by classifying the cluster assignments of each feature. HUBERT[hsu2020hubert] is similar to our method in that it uses the deep cluster method to perform self-learning. Fig 2 illustrates the architecture of our method. First, we follow the same procedure as the baseline system until the k-means clustering step. After that, we obtain the discretized pseudo-labels for each feature frame. Then, we randomly initialize111In a preliminary experiment, we compared the case where the network is initialized with the first round of the CPC network weights and the case where the network is reinitialized randomly. As a result, better performance was obtained with the latter. a new model with the same architecture as the original one. However, this time the objective is to classify pseudo-labels of feature steps with the cross-entropy (CE) criterion, which is more straightforward than the NCE loss.
Finally, we execute the second-round k-means clustering with the outputs of the final layer of the autoregressive model. A phoneme discriminative representation is achieved by imposing a phoneme classification task with the pseudo-labels on the autoregressive model.
3.2 Conformer CPC
We propose Conformer CPC which replaces the Transformer classifier in Eq. (1) with a Conformer block. It contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention [vaswani2017attention] module and the Convolution module. For input in Eq. (1) to a Conformer block, the output of the block is:
where FFN refers to the Feed Forward module, MHSA refers to the Multi-Head Self-Attention module, and Conv refers to the Convolution module as described in [gulati2020conformer]. This network can capture not only long-term contexts via the self-attention block but also local contexts through a Convolution module. Therefore, it is expected that more precise phonetic and lexical representations are achieved.
|Embedding||Training Data||within ()||across ()|
|Baseline : CPC-small||LS-100h||/||6.24||8.48||8.17||13.55|
|Baseline : CPC-small||LS-460h||/||6.19||7.34||8.71||13.02|
|Proposed: Conformer CPC-small||LS-100h||/||5.78||7.83||8.23||13.59|
|Proposed: Conformer CPC-small||LS-460h||/||5.40||7.17||7.55||12.19|
|Proposed: Conformer CPC-small+DC||LS-460h||LS-460h||4.05||5.38||6.12||10.60|
|Baseline : CPC-big||LL-6kh||/||3.41||4.18||4.85||7.64|
|Proposed: CPC-big+DC (1024units)||LL-6kh||LS-960h||3.11||3.98||4.96||7.92|
|System||Training Data||sWUGGY ()||sBLIMP ()||sSIMI ()|
|Baseline : CPC-small||LS-100h||/||65.79||52.88||-0.09||9.23|
|Baseline : CPC-small||LS-460h||/||66.21||52.79||-0.67||4.92|
|Proposed: Conformer CPC-small||LS-100h||/||62.22||52.96||0.90||7.22|
|Proposed: Conformer CPC-small||LS-460h||/||66.10||53.39||-1.84||5.17|
|Proposed: Conformer CPC-small+DC||LS-460h||LS-460h||67.21||53.38||-0.17||7.07|
|Baseline : CPC-big||LL-6kh||/||65.81||52.91||3.88||5.56|
|Proposed: CPC-big+DC (1024units)||LL-6kh||LS-960h||62.64||54.06||-1.65||4.81|
4.1 Experimental Setup
Following the baseline system [nguyen2020zero], the encoder
consists of five 1d-convolutional layers with kernel sizes of (10, 8, 4, 4, 4) and stride sizes of (5, 4, 2, 2, 2). The downsampling factor ofis 160 and the embedding
has a sampling rate of 100Hz. Then, the multi-layer long short-term memory (LSTM)[hochreiter1997long] network is used as an autoregressive encoder . The CPC model can be divided into two categories: CPC-small and CPC-big, the differences of which are elaborated in Table 1
The transformation in Eq. (1) is a 1-layer Transformer or Conformer network, the parameters of which are as follows: The number of attention heads is and the hidden unit size is . The number of hidden units for the feed-forward layers is . As for Conformer, the kernel size of the convolution module is . During the training of CPC models, we applied dropout [srivastava2014dropout] with a rate of for the Transformer and Conformer block in the same way as existing studies [vaswani2017attention, gulati2020conformer] to achieve a better generalization. We also applied dropout with a rate of for the outputs of the CPC prediction network before taking the product with . in (1) was set to .
The number of iterations for k-means clustering was set to . This is the same for the first-round clustering and the second-round one. The language model was based on BERT [liu2019roberta]. We reduced the number of parameters by considering the training time. The model consists of Transformer layers, each of which has attention heads with hidden dimensionality of . The dimensionality of feed-forward layers is . The sLM can be trained within 60 hours on a single GPU using the pseudo-text of LibriSpeech 960h.
All models were implemented with PyTorch, including CPC_audio222https://github.com/facebookresearch/CPC_audio and fairseq333https://github.com/pytorch/fairseq
. The former was a modified version of the CPC that stabilizes the CPC training by replacing batch normalization[ioffe2015batch] with a channel-wise normalization. The latter was only used for the sLM training.
We identified three baseline systems: CPC-small trained on LibriSpeech 100h and 460h, respectively, and CPC-big trained on Libri-light 6kh. Seven proposed systems that combine different methods and training data sizes were included. The proposed methods do not necessarily require the same configuration for the initial autoregressive network and the network for phoneme classification. For this reason, we also compared a system in which the size of the hidden units in the network for phoneme classification was increased from to .
4.2 Results and Discussion
4.2.1 ABX metric
In Table 2, we present the results of the ABX metric for the baseline system and our two proposed systems before clustering. It is clear that all proposed systems of CPC-small outperform the original CPC-small baseline. The combination of CPC with deep cluster and Conformer CPC improves the performance up to 35% relative to the baseline, although not as much as the performance of CPC with deep cluster alone. This shows that the two proposed systems yield linguistically discriminative characteristics for the CPC network. Comparing the CPC-big models, we see that our systems outperform the baseline system only under the condition of “within speaker”. One possible reason for this is that the training data for the phoneme classification task in the 2nd stage was LibriSpeech 960h and was not sufficient compared with the baseline CPC-big training with Libri-light 6kh.
4.2.2 sWUGGY metric
Table 3 compares sWUGGY, sBLIMP, and sSIMI metrics with the baseline, and the proposed methods. The two proposed systems, when applied independently, failed to outperform the baseline results compared to the CPC-small baseline systems. Therefore, better performance in the ABX metric does not necessarily guarantee better performance in the sWUGGY metric. However, the best performance is achieved when the two proposed systems are applied simultaneously, i.e., Conformer CPC-small+DC. Compared to the CPC-small trained on LibriSpeech 460h data, Conformer CPC-small+DC achieves a relative improvement of 1.5%. This result suggests that the two methods have a complementary effect on the lexical metric.
4.2.3 sBLIMP metric
The proposed system of CPC-big with deep cluster achieves the highest score among all methods in Table 3. This is also the top result in this challenge444The leader-board can be viewed at https://zerospeech.com/2021/results.html.. Besides, we can see that all Conformer CPC systems outperform all baseline systems regardless of the amount of training data. It indicates that the Conformer block works to help learn higher-level linguistic features.
4.2.4 sSIMI metric
For all methods including the proposed systems and the baseline systems, there are no systems that are significantly better for both synthetic (synth.) and LibriSpeech (libri.) sets. We can see that the amount of training data does not directly contribute to the performance improvement even if comparing within baseline methods. The proposed systems generally achieve a performance that is almost competitive with the baseline systems.
In this paper, we have proposed a system which combines CPC with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with k-means. Then, we train an additional autoregressive classifier to predict the previously obtained pseudo labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. In addition, we show that replacing the Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. This result suggests that both methods have a complementary effect on the lexical metric.