Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling

by   Siyuan Feng, et al.
The Chinese University of Hong Kong

This study addresses the problem of unsupervised subword unit discovery from untranscribed speech. It forms the basis of the ultimate goal of ZeroSpeech 2019, building text-to-speech systems without text labels. In this work, unit discovery is formulated as a pipeline of phonetically discriminative feature learning and unit inference. One major difficulty in robust unsupervised feature learning is dealing with speaker variation. Here the robustness towards speaker variation is achieved by applying adversarial training and FHVAE based disentangled speech representation learning. A comparison of the two approaches as well as their combination is studied in a DNN-bottleneck feature (DNN-BNF) architecture. Experiments are conducted on ZeroSpeech 2019 and 2017. Experimental results on ZeroSpeech 2017 show that both approaches are effective while the latter is more prominent, and that their combination brings further marginal improvement in across-speaker condition. Results on ZeroSpeech 2019 show that in the ABX discriminability task, our approaches significantly outperform the official baseline, and are competitive to or even outperform the official topline. The proposed unit sequence smoothing algorithm improves synthesis quality, at a cost of slight decrease in ABX discriminability.



page 1

page 2

page 3

page 4


Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

This study tackles unsupervised subword modeling in the zero-resource sc...

Supervised Speech Representation Learning for Parkinson's Disease Classification

Recently proposed automatic pathological speech classification technique...

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

This research addresses the problem of acoustic modeling of low-resource...

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Recent advances in neural multi-speaker text-to-speech (TTS) models have...

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

This paper presents Daft-Exprt, a multi-speaker acoustic model advancing...

On-the-fly Feature Based Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Automatic recognition of dysarthric and elderly speech highly challengin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays speech processing is dominated by deep learning techniques. Deep neural network (DNN) acoustic models (AMs) for the tasks of automatic speech recognition (ASR) and speech synthesis have shown impressive performance for major languages such as English and Mandarin. Typically, training a DNN AM requires large amounts of transcribed data. For a large number of low-resource languages, for which very limited or no transcribed data are available, conventional methods of acoustic modeling are ineffective or even inapplicable.

In recent years, there has been an increasing research interest in zero-resource speech processing, i.e., only a limited amount of raw speech data (e.g. hours or tens of hours) are given while no text transcriptions or linguistic knowledge are available. The Zero Resource Speech Challenges (ZeroSpeech) 2015 [1], 2017 [2] and 2019 [3] precisely focus on this area. One problem tackled by ZeroSpeech 2015 and 2017 is subword modeling, learning frame-level speech representation that is discriminative to subword units and robust to linguistically-irrelevant factors such as speaker change. The latest challenge ZeroSpeech 2019 goes a step further by aiming at building text-to-speech (TTS) systems without any text labels (TTS without T) or linguistic expertise. Specifically, one is required to build an unsupervised subword modeling sub-system to automatically discover phoneme-like units in the concerned language, followed by applying the learned units altogether with speech data from which the units are inferred to train a TTS. Solving this problem may partially assist psycholinguists in understanding young children’s language acquisition mechanism [3].

This study addresses unsupervised subword modeling in ZeroSpeech 2019, which is also referred to as acoustic unit discovery (AUD). It is an essential problem and forms the basis of TTS without T. The exact goal of this problem is to represent untranscribed speech utterances by discrete subword unit sequences, which is slightly different from subword modeling in the contexts of ZeroSpeech 2017 & 2015. In practice, it can be formulated as an extension to the previous two challenges. For instance, after learning the subword discriminative feature representation at frame-level, the discrete unit sequences can be inferred by applying vector quantization methods followed by collapsing consecutive repetitive symbolic patterns. In the previous two challenges, several unsupervised representation learning approaches were proposed for comparison, such as cluster posteriorgrams (PGs)

[4, 5, 6], DNN bottleneck features [7, 8]

, autoencoders (AEs)

[9, 10], variational AEs (VAEs) [Feng2019improving_arxiv, 12] and siamese networks [13, 14, 15].

One major difficulty in unsupervised subword modeling is dealing with speaker variation. The huge performance degradation caused by speaker variation reported in ZeroSpeech 2017 [2]

implies that speaker-invariant representation learning is crucial and remains to be solved. In ZeroSpeech 2019, speaker-independent subword unit inventory is highly desirable in building a TTS without T system. In the literature, many works focused on improving the robustness of unsupervised feature learning towards speaker variation. One direction is to apply linear transform methods. Heck et al.

[6]estimated fMLLR features in an unsupervised manner. Works in [7, 16] estimated fMLLR using a pre-trained out-of-domain ASR. Chen et al. [8] applied vocal tract length normalization (VTLN). Another direction is to employ DNNs. Zeghidour et al. [14] proposed to train subword and speaker same-different tasks within a triamese network and untangle linguistic and speaker information. Chorowski et al. [12] defined a speaker embedding as a condition of VAE decoder to free the encoder from capturing speaker information. Tsuchiya et al. [17] applied speaker adversarial training in a task related to the zero-resource scenario but transcription for a target language was used in model training.

In this paper, we propose to extend our recent research findings [Feng2019improving_arxiv] on applying disentangled speech representation learned from factorized hierarchical VAE (FHVAE) models [18]

to improve speaker-invariant subword modeling. The contributions made in this study are in several aspects. First, the FHVAE based speaker-invariant learning is compared with speaker adversarial training in the strictly unsupervised scenario. Second, the combination of adversarial training and disentangled representation learning is studied. Third, our proposed approaches are evaluated on the latest challenge ZeroSpeech 2019, as well as on ZeroSpeech 2017 for completeness. To our best knowledge, direct comparison of the two approaches and their combination has not been studied before.

2 System description

2.1 General framework

Figure 1: General framework of our proposed approaches

The general framework of our proposed approaches is illustrated in Figure 1. Given untranscribed speech data, the first step is to learn speaker-invariant features to support frame labeling. The FHVAE model [18] is adopted for this purpose. FHVAEs disentangle linguistic content and speaker information encoded in speech into different latent representations. Compared with raw MFCC features, FHVAE reconstructed features conditioned on latent linguistic representation are expected to keep linguistic content unchanged and are more speaker-invariant. Details of the FHVAE structure and feature reconstruction methods are described in Section 2.2.

The reconstructed features are fed as inputs to Dirichlet process Gaussian mixture model (DPGMM)

[19] for frame clustering, as was done in [4]. The frame-level cluster labels are regarded as pseudo phone labels to support supervised DNN training. Motivated by successful applications of adversarial training [20] in a wide range of domain invariant learning tasks [21, 22, 23, 24], this work proposes to add an auxiliary adversarial speaker classification task to explicitly target speaker-invariant feature learning. After speaker adversarial multi-task learning (AMTL) DNN training, softmax PG representation from pseudo phone classification task is used to infer subword unit sequences. The resultant unit sequences are regarded as pseudo transcriptions for subsequent TTS training.

2.2 Speaker-invariant feature learning by FHVAEs

The FHVAE model formulates the generation process of sequential data by imposing sequence-dependent and sequence-independent priors to different latent variables [18]. It consists of an inference model and a generation model . Let denote a speech dataset with sequences. Each contains speech segments , where is composed of fixed-length consecutive frames. The FHVAE model generates a sequence from a random process as follows: (1) An s-vector is drawn from a prior distribution ; (2) Latent segment variables and latent sequence variables are drawn from and respectively; (3) Speech segment is drawn from . Here

denotes standard normal distribution,


are parameterized by DNNs. The joint probability for

is formulated as,


Since the exact posterior inference is intractable, the FHVAE introduces an inference model to approximate the true posterior,


Here and

are all diagonal Gaussian distributions. The mean and variance values of

and are parameterized by two DNNs. For , during FHVAE training, a trainable lookup table containing posterior mean of for each sequence is updated. During testing, maximum a posteriori (MAP) estimation is used to infer for unseen test sequences. FHVAEs optimize the discriminative segmental variational lower bound which was defined in [18]. It contains a discriminative objective to prevent from being the same for all utterances.

After FHVAE training, encodes segment-level factors e.g. linguistic information, while encodes sequence-level factors that are relatively consistent within an utterance. By concatenating training utterances of the same speaker into a single sequence for FHVAE training, the learned is expected to be discriminative to speaker identity. This work considers applying s-vector unification [Feng2019improving_arxiv] to generate reconstructed feature representation that keeps linguistic content unchanged and is more speaker-invariant than the original representation. Specifically, a representative speaker with his/her s-vector (denoted as ) is chosen from the dataset. Next, for each speech segment of an arbitrary speaker , its corresponding latent sequence variable inferred from is transformed to , where denotes the s-vector of speaker . Finally the FHVAE decoder reconstructs speech segment conditioned on and . The features form our desired speaker-invariant representation.

2.3 Speaker adversarial multi-task learning

Speaker adversarial multi-task learning (AMTL) simultaneously trains a subword classification network (), a speaker classification network () and a shared-hidden-layer feature extractor (), where and are set on top of , as illustrated in Figure 1. In AMTL, the error is reversely propagated from to such that the output layer of is forced to learn speaker-invariant features so as to confuse , while

tries to correctly classify outputs of

into their corresponding speakers. At the same time, learns to predict the correct DPGMM labels of input features, and back-propagate errors to in a usual way.

Let and denote the network parameters of and

, respectively. With the stochastic gradient descent (SGD) algorithm, these parameters are updated as,


where is the learning rate, is the adversarial weight, and are the loss values of subword and speaker classification tasks respectively, both in terms of cross-entropy. To implement Eqt. (4), a gradient reversal layer (GRL) [20] was designed to connect and . The GRL acts as identity transform during forward-propagation and changes the sign of loss during back-propagation. After training, the output of is speaker-invariant and subword discriminative bottleneck feature (BNF) representation of input speech. Besides, the softmax output representation of is believed to carry less speaker information than that without performing speaker adversarial training.

2.4 Subword unit inference and smoothing

Subword unit sequences for the concerned untranscribed speech utterances are inferred from softmax PG representation of in the speaker AMTL DNN. For each input frame to the DNN, the DPGMM label with the highest probability in PG representation is regarded as the subword unit assigned to this frame. These frame-level unit labels are further processed by collapsing consecutive repetitive labels to form pseudo transcriptions.

We observed non-smoothness in the inferred unit sequences by using the above methods, i.e., frame-level unit labels that are isolated without temporal repetition. Considering that ground-truth phonemes generally span at least several frames, these non-smooth labels are unwanted. This work proposes an empirical method to filter out part of the non-smooth unit labels, which is summarized in Algorithm 1.

Input: Frame-level unit labels
Output: Pseudo transcription
1 }, where , for ;
2 while  do
3       if  then
4             ; ;
6       end if
8 end while
Algorithm 1 Unit sequence smoothing

3 ZeroSpeech 2017 experiments

3.1 Dataset and evaluation metric

ZeroSpeech 2017 development dataset consists of three languages, i.e. English, French and Mandarin. Speaker information for training sets are given while unknown for test sets. The durations of training sets are and hours respectively. Detailed information of the dataset can be found in [2].

The evaluation metric is ABX subword discriminability. Basically, it is to decide whether

belongs to or if belongs to and belongs to , where and are speech segments, and are two phonemes that differ in the central sound (e.g., “beg”-“bag”). Each pair of and is spoken by the same speaker. Depending on whether and are spoken by the same speaker, ABX error rates for across-/within-speaker are evaluated separately.

3.2 System setup

The FHVAE model is trained with merged training sets of all three target languages. Input features are fixed-length speech segments of frames. Each frame is represented by a -dimensional MFCC with cepstral mean normalization (CMN) at speaker level. During training, speech utterances spoken by the same speaker are concatenated to a single training sequence. During the inference of hidden variables and , input segments are shifted by

frame. To match the length of latent variables with original features, the first and last frame are padded. To generate speaker-invariant reconstructed MFCCs using the s-vector unification method, a representative speaker is selected from training sets. In this work the English speaker “s4018” is chosen. The encoder and decoder networks of the FHVAE are both

-layer LSTM with neurons per layer. Latent variable dimensions for and are . FHVAE training is implemented by using an open-source tool [18].

The FHVAE based speaker-invariant MFCC features with and are fed as inputs to DPGMM clustering. Training data for the three languages are clustered separately. The numbers of clustering iterations for English, French and Mandarin are and . After clustering, the numbers of clusters are and . The obtained frame labels support multilingual DNN training. DNN input features are MFCC+CMVN. The layer-wise structure of is . Nonlinear function is sigmoid, except the linear BN layer. contains

sub-networks, one for each language. The sub-network contains a GRL, a feed-forward layer (FFL) and a softmax layer. The GRL and FFL are

-dimensional. also contains sub-networks, each having a -dimensional FFL and a softmax layer. During AMTL DNN training, the learning rate starts from to

with exponential decay. The number of epochs is

. Speaker adversarial weight ranges from to . After training, BNFs extracted from are evaluated by the ABX task. DNN is implemented using Kaldi [27] nnet3 recipe. DPGMM is implemented using tools developed by [19].

DPGMM clustering towards raw MFCC features is also implemented to generate alternative DPGMM labels for comparison. In this case, the numbers of clustering iterations for the three languages are and . The numbers of clusters are and . The DNN structure and training procedure are the same as mentioned above.

3.3 Experimental results

Average ABX error rates on BNFs over three target languages with different values of are shown in Figure 2.

Figure 2: Average ABX error rates on BNF over languages

In this Figure, denotes that speaker adversarial training is not applied. From the dashed (blue) lines, it can be observed that speaker adversarial training could reduce ABX error rates in both across- and within-speaker conditions, with absolute reductions of and respectively. The amount of improvement is in accordance with the findings reported in [17], despite that [17] exploited English transcriptions during training. The dash-dotted (red) lines show that when DPGMM labels generated by reconstructed MFCCs are employed in DNN training, the positive impact of speaker adversarial training in across-speaker condition is relatively limited. Besides, negative impact is observed in within-speaker condition. From Figure 2, it can be concluded that for the purpose of improving the robustness of subword modeling towards speaker variation, frame labeling based on disentangled speech representation learning is more prominent than speaker adversarial training.

4 ZeroSpeech 2019 experiments

4.1 Dataset and evaluation metrics

ZeroSpeech 2019 [3] provides untranscribed speech data for two languages. English is used for development while the surprise language (Indonesian) [28, 29] is used for test only. Each language pack consists of training and test sets. The training set consists of a unit discovery dataset for building unsupervised subword models, and a voice dataset for training the TTS system. Details of ZeroSpeech 2019 datasets are listed in Table 1.

English Surprise
Duration #speakers Duration #speakers
Training Unit hrs hrs
Voice hrs hrs
Test hr hr
Table 1: ZeroSpeech 2019 datasets

There are two categories of evaluation metrics in ZeroSpeech 2019. The metrics for text embeddings, e.g. subword unit sequences, BNFs and PGs, are ABX discriminability and bitrate. Bitrate is defined as the amount of information provided in the inferred unit sequences. The metrics for synthesized speech waveforms are character error rate (CER), speaker similarity (SS, to , larger is better) and mean opinion score (MOS, to , larger is better), all evaluated by native speakers.

4.2 System setup

FHVAE model training and speaker-invariant MFCC reconstruction are performed following the configurations in ZeroSpeech 2017. The unit dataset is used for training. During MFCC reconstruction, a male speaker for each of the two languages is randomly selected as the representative speaker for s-vector unification. Our recent research findings [Feng2019improving_arxiv] showed that male speakers are more suitable than females in generating speaker-invariant features. The IDs of the selected speakers are “S015” and “S002” in English and Surprise respectively. In DPGMM clustering, the numbers of clustering iterations are both . Input features are reconstructed MFCCs++. After clustering, the numbers of clusters are and . The speaker AMTL DNN structure and training procedure follow configurations in ZeroSpeech 2017. One difference is the placement of adversarial sub-network . Here is put on top of the FFL in instead of on top of . Besides, the DNN is trained in a monolingual manner. After DNN training, PGs for voice and test sets are extracted. BNFs for test set are also extracted. Adversarial weights ranging from to with a step size of are evaluated on English test set.

The TTS model is trained with voice dataset and their subword unit sequences inferred from PGs. TTS training is implemented using tools [30] in the same way as in the baseline. The trained TTS synthesizes speech waveforms according to unit sequences inferred from test speech utterances. Algorithm 1 is applied to voice set and optionally applied to test set.

4.3 Experimental results

ABX error rates on subword unit sequences, PGs and BNFs with different values of evaluated on English test set are shown in Figure 3.

Figure 3: ABX error rates on unit sequence, PG and BNF with different adversarial weights evaluated on English test set

Algorithm 1 is not applied at this stage. It is observed that speaker adversarial training could achieve and absolute error rate reductions on PG and BNF representations. The unit sequence representation does not benefit from adversarial training. Therefore, the optimal for unit sequences is . The performance gap between frame-level PGs and unit sequences measures the phoneme discriminability distortion caused by the unit inference procedure in this work.

We fix to train the TTS model, and synthesize test speech waveforms using the trained TTS. Experimental results of our submission systems are summarized in Table 2.

English Surprise
Baseline [3]
Topline [3]
Table 2: Comparison of baseline, topline and our submission

In this Table, “+SM” denotes applying sequence smoothing towards test set unit labels. Compared with the official baseline, our proposed approaches could significantly improve unit quality in terms of ABX discriminability. Our system without applying SM achieves and absolute error rate reductions in English and Surprise sets. If SM is applied, while the ABX error rate increases, improvements in all the other evaluation metrics are observed. This implies that for the goal of speech synthesis, there is a trade off between quality and quantity of the learned subword units. Besides, our ABX performance is competitive to, or even better than the supervised topline.

Our systems do not outperform baseline in terms of synthesis quality. One possible explanation is that our learned subword units are much more fine-grained than those in the baseline AUD, making the baseline TTS less suitable for our AUD system. In the future, we plan to investigate on alternative TTS models to take full advantage of our learned subword units.

5 Conclusions

This study tackles robust unsupervised subword modeling in the zero-resource scenario. The robustness towards speaker variation is achieved by combining speaker adversarial training and FHVAE based disentangled speech representation learning. Our proposed approaches are evaluated on ZeroSpeech 2019 and ZeroSpeech 2017. Experimental results on ZeroSpeech 2017 show that both approaches are effective while the latter is more prominent, and that their combination brings further marginal improvement in across-speaker condition. Results on ZeroSpeech 2019 show that our approaches achieve significant ABX error rate reduction to the baseline system. The proposed unit sequence smoothing algorithm improves synthesis quality, at a cost of slight decrease in ABX discriminability.

6 Acknowledgements

This research is partially supported by the Major Program of National Social Science Fund of China (Ref:13&ZD189), a GRF project grant (Ref: CUHK 14227216) from Hong Kong Research Grants Council and a direct grant from CUHK Research Committee.


  • [1] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015.” in Proc. INTERSPEECH, 2015, pp. 3169–3173.
  • [2] E. Dunbar, X.-N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” in Proc. ASRU, 2017, pp. 323–330.
  • [3] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, L. Besacier, S. Sakti, and E. Dupoux, “The zero resource speech challenge 2019: TTS without T,” in Submitted to INTERSPEECH, 2019.
  • [4] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of Dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. 3189–3193.
  • [5] T. K. Ansari, R. Kumar, S. Singh, S. Ganapathy, and S. Devi, “Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions,” in Proc. ASRU, 2017, pp. 762–768.
  • [6] M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017,” in Proc. ASRU, 2017, pp. 740–746.
  • [7] H. Shibata, T. Kato, T. Shinozaki, and S. Watanabe, “Composite embedding systems for zerospeech2017 track 1,” in Proc. ASRU, 2017, pp. 747–753.
  • [8] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle-neck feature learning from untranscribed speech,” in Proc. ASRU, 2017, pp. 727–733.
  • [9] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge,” in Proc. INTERSPEECH, 2015, pp. 3199–3203.
  • [10]

    H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neural network based feature extraction using weak top-down constraints,” in

    Proc. ICASSP, 2015, pp. 5818–5822.
  • [11] S. Feng and T. Lee, “Improving purely unsupervised subword modeling via disentangled speech representation learning and transformation,” in submitted to INTERSPEECH, 2019.
  • [12] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using wavenet autoencoders,” in arXiv, 2019.
  • [13] R. Thiollière, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling,” in Proc. INTERSPEECH, 2015, pp. 3179–3183.
  • [14] N. Zeghidour, G. Synnaeve, N. Usunier, and E. Dupoux, “Joint learning of speaker and phonetic similarities with siamese networks,” in Proc. INTERSPEECH, 2016, pp. 1295–1299.
  • [15] R. Riad, C. Dancette, J. Karadayi, N. Zeghidour, T. Schatz, and E. Dupoux, “Sampling strategies in siamese networks for unsupervised speech representation learning,” in Proc. INTERSPEECH, 2018, pp. 2658–2662.
  • [16] S. Feng and T. Lee, “Exploiting speaker and phonetic diversity of mismatched language resources for unsupervised subword modeling,” in Proc. INTERSPEECH, 2018, pp. 2673–2677.
  • [17] T. Tsuchiya, N. Tawara, T. Ogawa, and T. Kobayashi, “Speaker invariant feature extraction for zero-resource languages with adversarial learning,” in Proc. ICASSP, 2018, pp. 2381–2385.
  • [18]

    W. Hsu, Y. Zhang, and J. R. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in

    Proc. NIPS, 2017, pp. 1876–1887.
  • [19] J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture models using sub-cluster splits,” in Advances in NIPS, 2013, pp. 620–628.
  • [20]

    Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in

    Proc. ICML, 2015, pp. 1180–1189.
  • [21] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79–87, 2017.
  • [22] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, and B. Juang, “Speaker-invariant training via adversarial learning,” in Proc. ICASSP, 2018, pp. 5969–5973.
  • [23]

    J. Yi, J. Tao, Z. Wen, and Y. Bai, “Language-adversarial transfer learning for low-resource speech recognition,”

    IEEE/ACM Trans. ASLP, vol. 27, no. 3, pp. 621–630, 2019.
  • [24]

    Z. Peng, S. Feng, and T. Lee, “Adversarial multi-task deep features and unsupervised back-end adaptation for language recognition,” in

    Proc. ICASSP, 2019.
  • [25]

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in

    Proc. OSDI, 2016, pp. 265–283.
  • [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, vol. abs/1412.6980, 2014.
  • [27] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
  • [28] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “Development of HMM-based indonesian speech synthesis,” in Proc. O-COCOSDA, 2008, pp. 215–220.
  • [29] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Nakamura, “Development of indonesian large vocabulary continuous speech recognition system within A-STAR project,” in Proc. TCAST, 2008, pp. 19–24.
  • [30] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural network speech synthesis system,” in Proc. INTERSPEECH, 2016, pp. 202–207.