1 Introduction
Current spoken language technologies only cover about two percent of the world’s languages. This is because most groundworks require a large amount of paired data resources, including a sizeable collection of spoken audio data and corresponding text transcription. On the other hand, most of the world’s languages are severely underresourced, some of which even lack a written form. Zero resource speech research is an extreme case from lowresourced approaches that learn the elements of a language solely from untranscribed raw audio data. This completely unsupervised technique attempts to mimic the early language acquisition of humans. The zero resource speech challenge (ZeroSpeech) [1, 2, 3] is directly addressing this issue and offers participants the opportunity to advance the stateoftheart in the core tasks of zero resource speech technology.
In ZeroSpeech 2015 and 2017, the goal was to discover an appropriate speech representation of the underlying language of a dataset [1, 2]. The ZeroSpeech 2019 [3] challenge confronts the problem of constructing a speech synthesizer without any text or phonetic labels: TTS without T. The task requires the full system not only to discover subword units in an unsupervised way but also to resynthesize the speech with a same content to a different target speaker. It includes both ASR and TTS components. In this paper, we describe our submitted system for the ZeroSpeech Challenge 2019 and focus on constructing endtoend systems.
The top performances in discovering speech representation in ZeroSpeech 2015 and 2017 are dominated by a Bayesian nonparametric approach with unsupervised cluster speech features using a Dirichlet process Gaussian mixture model (DPGMM)
[4, 5]. However, the DPGMM model is too sensitive to acoustic variations and often produces too many subword units and a relatively highdimensional posteriogram, which implies high computational cost for learning and inference as well as more tendencies for overfitting [6]. Therefore it is difficult to synthesize speech waveform from the resulting DPGMMbased acoustic units.To tackle these problems and achieve the best tradeoff, an optimization method is required to balance and improve both components. Recently, Tjandra et al. [7, 8, 9] proposed a machine speech chain (see Figure 1) that enables ASR and TTS to assist each other when they receive unpaired data by allowing them to infer the missing pair and optimize both models with reconstruction loss. However, since the architecture is based on an attentionbased sequencetosequence framework that transforms from a dynamiclength input into a dynamiclength output without decoding at the framelevel (one symbol per frame), it is less suitable for this challenge.
Inspired by a similar idea, we propose to utilize a framebased vector quantized variational autoencoder (VQVAE) [10] and a multiscale codebooktospectrogram (Code2Spec) inverter trained by mean square error (MSE) and adversarial loss. VQVAE extracts the speech to a latent space and forces itself to map onto the nearest codebook, leading to compressed representation. Next, the inverter generates a magnitude spectrogram to the target voice, given the codebook vector from VQVAE. In our experiments, we also investigate other clustering algorithms such as KMeans and GMM and compare them with the VQVAE result on ABX scores and bit rate.
2 Vector Quantized Variational Autoencoder (VQVAE)
A vector quantized variational autoencoder (VQVAE) [10] is a variant of variational autoencoder architecture. It has several differences compared to a standard autoencoder or a variational autoencoder [11] (VAE). First, the encoder generates discrete latent variables instead of continuous latent variables to represent the input data. Second, instead of onetoone mapping between the input data and the latent variables, VQVAE forces the latent variables to be represented by the closest codebook vector.
Figure 2 illustrates the encoding and decoding processes from the conditional VQVAE model. Here is the input data, is the speaker ID that is related to , is a discrete latent variable, and is the reconstructed input. Encoder and decoder can be represented by any differentiable transformation (e.g., linear, convolution, recurrent layer) parameterized by . Codebook is a collection of continuous codebook vectors with dimensions. Speaker embedding is speaker embedding to map speaker ID into a continuous representation. In the encoding step, encoder projects input into continuous representation . Posterior distributions are generated by a discretization process:
(1)  
(2) 
In the discretization process, we choose closest codebook vector based on the index of the closest distance (e.g., L2norm distance) from continuous representation . To decode the data, we use codebook and speaker embedding and feed both into decoder to reconstruct original data .
In VQVAE, we formulate the training objective:
(3) 
where function stops the gradient, defined as:
(4) 
There are three terms in loss . The first is a negative loglikelihood that resembles a reconstruction loss and optimizes the encoder and decoder parameters. The second optimizes codebook vectors , named codebook loss. The third forces the encoder to generate a representation near the codebook, called commitment loss. Coefficient is used to scale the commitment loss.
3 CodebooktoSpectrogram Inverter
The codebooktospectrogram (Code2Spec) inverter is a module that reconstructs the speech signal representation (e.g., linear magnitude spectrogram) , given a sequence of codebook .
In Fig. 3, we illustrate our codetospeech inverter model. The length of codebook sequence might be shorter than , depending on the VQVAE encoder model. Therefore, for an identical length between the codebook and speech representation sequences, we need to copy times for each codebook . Later, duplicated codebook sequences
are given to the inverter that consists of multiple layers of multiscale 1D convolution, batchnormalization
[12], and LeakyReLU [13] nonlinearity. In addition to the inverter, we also have a discriminator module. The discriminator predicts whether the given spectrogram is real data or is generated by the inverter, which generates a realistic spectrogram to deceive the discriminator [14, 15, 16]. The Code2Spec inverter has several training objectives:(5)  
(6)  
(7)  
(8) 
After we define the multiple objectives for training, we update each module parameter and with the following equation:
(9)  
(10) 
where Optim() is a gradient optimization function (e.g., SGD, Adam [19]), and is the coefficient to balance the loss between the MSE and the adversarial loss. In the inference stage, given the predicted linear magnitude spectrogram , we reconstruct the missing phase spectrogram with the GriffinLim algorithm [20]
and applied the inverse shortterm Fourier transform (STFT) to generate the waveform.
4 Experiment
In this section, we describe the feature extraction, the preliminary models, and our proposed models for this challenge. All of the results were evaluated using
evaluate.sh from the English test set.4.1 Experimental Setup
There are two datasets for two languages, English data for the development dataset, and a surprise Austronesian language for the test dataset. Each language dataset contains subset datasets: (1) a Voice Dataset for speech synthesis, (2) a Unit Discovery Dataset, (3) an Optional Parallel Dataset from the target voice to another speaker voice, and (4) a Test Dataset. The source corpora of the surprise language are describe here [21, 22], and further details can be found here [3]. In this work, we only use (1)(2) for training and (4) for testing.
For the speech input, we experimented with several feature types, such as Melspectrogram (80 dimensions, 25ms window size, 10ms timesteps) and MFCC (13 dimensions (total=39 dimensions), 25ms window size, 10ms timesteps). Both MFCC and Melspectrogram are generated by the Librosa package [23].
4.2 Official baseline and topline model
ZeroSpeech 2019 provides official baselines and toplines. The baseline consists of a pipeline with a simple acoustic unit discovery system based on DPGMM and a speech synthesizer based on Merlin, and the topline uses gold phoneme transcription to train a phonemebased ASR system with Kaldi and a phonemebased TTS with Merlin. The performance is shown in Table 1.
Feature  ABX  Bit rate 

Baseline  35.63  71.98 
Topline  29.85  37.73 
4.3 Preliminary model
We started to explore this challenge using a simpler method and gradually increased our model’s complexity.
4.3.1 Direct feature representation
We directly evaluated the ABX and the bit rate of Melspectrogram and MFCC as speech representations. In Table 2, we report each feature extraction method with respect to their ABX and bit rates. In our preliminary experiments, MFCC produced better performances on the ABX metric than the Melspectrogram. Therefore, for the rest of our discussion, we only focus on utilizing MFCC features. However, even the MFCC has better ABX score, the bit rate still remains too high.
Feature  ABX  Bit rate 

MelSpec  30.291  1738.38 
MFCC  21.114  1737.47 
4.3.2 KMeans
We trained Minibatch KMeans (with scikitlearn toolkit [24]) on the MFCC feature and varied the cluster size: 64, 128, 256. We represent a data point (a speech frame) KMeans by using the closest centroid vector to the data frame and calculate the ABX with the DTW cosine. Table 3 reports all the models and their configurations with respect to their ABX and bit rate.
Model  ABX / Bitrate  


#C  1T  2T  4T  
64  23.56 / 553  25.97 / 280  29.41 / 136  
128  23.16 / 649  24.24 / 321  28.12 / 161  
256  21.90 / 744  23.73 / 369  27.17 / 182 
4.3.3 Gaussian Mixture Model (GMM)
We trained GMM with diagonal covariance matrices (with scikitlearn toolkit [24]
) on the MFCC features. We varied the number of mixtures: 64, 128, and 256. We represent a data point (a speech frame) with the posterior probability from each component with a Bayes rule
and calculate the ABX with DTW KLdivergence. In Table 4, we report all of the models and their configurations with respect to their ABX and bit rate.Model  ABX / Bit rate  


#C  1T  2T  4T  
64  20.81 / 1647  22.67 / 676  29.82 / 257  
128  19.61 / 1705  23.06 / 704  31.19 / 281  
256  18.93 / 1691  23.39 / 757  32.99 / 306 
4.4 Proposed model
Model  ABX / Bit rate  

#CL  1T  2T  4T  8T  
64  27.46 / 606  25.51 / 302  26.15 / 138  28.81 / 70  
128  27.65 / 686  24.29 / 347  25.04 / 165  30.87 / 79  
256  27.63 / 787  24.37 / 349  24.17 / 184  30.51 / 79  

512  27.69 / 871  23.59 / 400  24.63 / 180  32.02 / 74 
4.4.1 VqVae
Next we describe our encoder and decoder architecture in Fig. 4
with four times the sequence length reduction. For the input and output targets, we use the MFCC features and explore different stride sizes to reduce the time length from 1, 2, 4, 8. We use speaker embedding with 32 dimensions and codebook embedding with 64 dimensions. We varied the number of codebooks: 64, 128, 256, 512. Batch normalization
[12] and LeakyReLU [13] activation were applied to every layer, except the last encoder and decoder layer. The decoder input is a concatenation between codebook and speaker embedding in the channel axis. We set commitment loss coefficient .4.4.2 Multiscale Code2Spec inverter
In Fig. 4
, we describe our inverter architecture. Our input is a codebook sequence with 64 dimensions and our target output is a sequence of linear magnitude spectrogram with 1025 dimensions. The first four layers have multiple kernels with different sizes across the timeaxis. All convolution layers have stride = 1 and the “same” padding. Batch normalization and LeakyReLU activation are applied to every layer, except the last one before the output prediction. For the adversarial loss, we found LSGAN is stabler, thus LSGAN with
is used in every model. We independently trained the inverter to generate a voice target speaker with a train/voice set. We have two inverters for the English set and one for the surprise set.4.4.3 Model training
4.4.4 Results and Discussion
Table 5 reports all models and their configurations with respect to their ABX and bit rate. Considering the balance between the discrimination score ABX and the bitrate compression rate, we submitted two proposed systems: (1) 256 codebooks and 4 stride size to reduce the time length and (2) 256 codebooks and 2 stride size to reduce the time length.
We also attempted further enhancement of the synthesized voice using several techniques, such as WaveNet [26, 27] and GANbased voice conversion [28]. WaveNet decoder is conditioned by framewise linguistic features or acoustic features with a 5ms timeshift (80 times smaller than the speech samples). As the sample rate of the codebook embeddings of our system was 320 times smaller than the speech samples, the Wavenet couldn’t produced satisfying result. GANs are known to be effective for achieving highquality voice conversion with clean input data [29, 30]. However, our task is more challenging due to the fact that our generated voice will always have some distortion. Therefore, GANbased voice conversion approach failed to improve our performance. As a future work, we will investigate the use of GANbased speech enhancement [31] approaches to further improve our results.
5 Conclusions
We described our approach for the ZeroSpeech Challenge 2019 for unsupervised unit discovery. We explored many different possibilities: feature extraction, clustering algorithm, and embedding representation. For our final submission, we utilized VQVAE to extract a sequence of codebook vectors. The codebook generated by VQVAE has a better tradeoff between ABX and the bit rate compared to the other models such as KMeans, GMM, or direct feature representation. To reconstruct speech from the codebook, we trained a Code2Spec inverter to generate a corresponding linear magnitude spectrogram. The combination between VQVAE and Code2Spec significantly improved the intelligibility (in CER), the MOS, and the discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.
6 Acknowledgements
Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H06101 and JP17K00237.
References
 [1] M. Versteegh, R. Thiolliere, T. Schatz, X. N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [2] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 323–330, IEEE, 2017.
 [3] J. K. M. B. J. B. X.N. C. L. M. C. D. L. O. A. W. B. L. B. S. S. E. D. E. Dunbar, R. Algayres, “The zero resource speech challenge 2019: TTS without T,” in Twentieth Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), 2019.
 [4] H. Chen, C.C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [5] M. Heck, S. Sakti, and S. Nakamura, “Feature optimized dpgmm clustering for unsupervised subword modeling: A contribution to zerospeech 2017,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 740–746, IEEE, 2017.
 [6] B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zeroresource setting based on functional load,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for UnderResourced Languages, pp. 1–5.

[7]
A. Tjandra, S. Sakti, and S. Nakamura, “Listening while speaking: Speech chain by deep learning,” in
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 301–308, IEEE, 2017.  [8] A. Tjandra, S. Sakti, and S. Nakamura, “Machine speech chain with oneshot speaker adaptation,” Proc. Interspeech 2018, pp. 887–891, 2018.

[9]
A. Tjandra, S. Sakti, and S. Nakamura, “Endtoend feedback loss in speech chain framework via straightthrough estimator,” in
2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p. to appear, IEEE, 2019.  [10] A. van den Oord, O. Vinyals, et al., “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
 [11] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [12] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[13]
A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in
in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Citeseer, 2013.  [14] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
 [15] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 84–96, Jan 2018.
 [16] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino, “Generative adversarial networkbased postfilter for statistical parametric speech synthesis,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4910–4914, IEEE, 2017.
 [17] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.

[18]
X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares
generative adversarial networks,” in
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2794–2802, 2017.  [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [20] D. Griffin and J. Lim, “Signal estimation from modified shorttime fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
 [21] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “Development of hmmbased indonesian speech synthesis,” Proc. Oriental COCOSDA, 01 2008.
 [22] S. Sakti, E. Kelana, R. Hammam, S. Sakai, K. Markov, and S. Nakamura, “Development of indonesian large vocabulary continuous speech recognition system within astar project,” 01 2008.
 [23] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” 2015.

[24]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.  [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPSW, 2017.
 [26] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
 [27] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Unsupervised speech representation learning using wavenet autoencoders,” arXiv preprint arXiv:1901.08810, 2019.
 [28] B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Adaptive wavenet vocoder for residual compensation in ganbased voice conversion,” in 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 282–289, IEEE, 2018.
 [29] F. Fang, J. Yamagishi, I. Echizen, and J. LorenzoTrueba, “Highquality nonparallel voice conversion based on cycleconsistent adversarial network,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5279–5283, IEEE, 2018.
 [30] C.C. Hsu, H.T. Hwang, Y.C. Wu, Y. Tsao, and H.M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017.
 [31] Z. Meng, J. Li, Y. Gong, et al., “Cycleconsistent speech enhancement,” arXiv preprint arXiv:1809.02253, 2018.