Human infants acquire knowledge of a language by mere immersion in a language speaking community. The process is not yet completely understood, and is difficult to be reproduced by current automatic speech recognition (ASR) technologies where the dominant paradigm is supervised learning with large human-annotated data sets. The idea behind the Zero Resource Speech Challenge is to inspire the development of speech recognition under the extreme situation where a whole language has to be learned from scratch[2, 3]. The goal of this challenge is to find linguistic units directly from raw audio with no knowledge of the language, the speaker, or any other supplementary information. This challenge includes two tracks which focuses on subword units and word units respectively. In the first track of unsupervised subword modeling, the aim is to construct a framewise feature representation of speech sounds, that is robust to within-speaker and across-speaker variation. Dynamic Time Warping (DTW) is performed on sequences of these features for predefined phone pair intervals to extract the warping distance. The performance of the feature is evaluated using the ABX discriminability  on within and across-speaker phone pairs. The second track focuses on discovery of word units and the aim is to extract timing information of such word units in the hypothesized vocabularies derived from the speech corpus. The intervals in which each word unit appears in the corpus is then evaluated on parsing, clustering and matching quality . This paper serves as the documentation for the work by a team organized in National Taiwan University submitted to the challenge within the Interspeech 2015 technical program.
In this work, we propose a completely unsupervised framework of Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN) for the task. A Multi-layered Acoustic Tokenizer (MAT) is used to generate multiple sets of acoustic tokens. Each acoustic token set is specified by a pair of hyperparameters representing model granularities of the tokens. As a naming convention, we call an acoustic token set obtained from a hyperparameter pair a layer. Each layer carries complementary knowledge about the corpus and the language behind. Since it is well known that speech signals have multi-level structures including at least phonemes and words which are helpful in analysing or decoding speech , these sets of acoustic tokens can be further mutually reinforced. The multi-layered token labels generated by the MAT are then used as the training targets of a Multi-target Deep Neural Network (MDNN) to learn the framewise bottleneck features (BNFs). The BNFs are then used as feedback to both the MAT and the MDNN in the next iteration. The BNFs from the MDNN are evaluated in Track 1, while the time intervals for acoustic tokens obtained in the MAT are evaluated in Track 2.
2 Proposed Approach
2.1 Overview of the proposed framework
The framework of the approach is shown in Fig1. In the left part, the Multi-layered Acoustic Tokenizer (MAT) produces many sets of acoustic tokens using unsupervised HMMs, each describing different aspects of the given corpus. These tokens are specified by two hyperparameters describing HMM configurations. A set of acoustic tokens is obtained for each configuration by iteratively optimizing the token models and the token labels on the given acoustic corpus. Multiple pairs of hyperparameters were selected producing multi-layered token labels for the given corpus to be used as the training targets of the Multi-target Deep Neural Network (MDNN) on the right part of Fig.1. The MDNN on the right learns its parameters based on the multi-layered token labels for the given corpus as its targets from the MAT on the left, so the knowledge carried by different token sets on different layers are fused. Bottleneck features are then extracted from this MDNN. In the first iteration, some initial acoustic features are used for both the MAT and the MDNN. This gives the first set of bottleneck features. These bottleneck features are then used as feedback to both the MAT (to replace the initial acoustic features) and the MDNN (to be concatenated with the initial acoustic features to produce tandem features) in the second iteration. Such feedback can be continued iteratively. The complete framework is referred to as Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN) in this paper. The output of the MDNN (bottleneck features) is evaluated in Track 1 of the Challenge, while the time intervals for the acoustic token labels at the output of the MAT are evaluated in Track 2 of the Challenge.
2.2 Multi-layered Acoustic Tokenizer
The goal in this step is to obtain multiple sets of acoustic tokens, each defined by some hyperparameters, which capture complementary aspects of the corpus. There is no knowledge regarding the corpus at all, so the process here is completely unsupervised.
2.2.1 Unsupervised Token Discovery for Each layer of MAT
Using unsupervised HMMs, it is straight forward to discover acoustic tokens from the corpus for a chosen hyperparameter pair that determines the HMM configuration (number of states per model and number of distinct models) [11, 12, 13, 14, 15]. This can be achieved by first finding an initial label set based on a set of assumed tokens for all features in the corpus as in (1) . Then in each iteration the HMM parameters can be trained with the label set obtained in the previous iteration as in (2), and the new label set can be obtained by token decoding with the obtained parameters as in (3).
The training process can be repeated with enough number of iterations until a converged set of token HMMs is obtained. The processes (2),(3) are referred to as token model optimization and token label optimization in the left part of Fig.1.
2.2.2 Granularity Space of Multi-layered Acoustic Token Sets
The process explained above can be performed with different HMM configurations, each characterized by two hyperparameters: the number of states in each acoustic token HMM, and the total number of distinct acoustic tokens during initialization, . The transcription of a signal decoded with these tokens can be considered as a temporal segmentation of the signal, so the HMM length (or number of states in each HMM) represents the temporal granularity. The set of all distinct acoustic tokens can be considered as a segmentation of the phonetic space, so the total number of distinct acoustic tokens represents the phonetic granularity. This gives a two-dimensional representation of the acoustic token configurations in terms of temporal and phonetic granularities as in Fig.2. Any point in this two-dimensional space in Fig.2 corresponds to an acoustic token configuration. Acoustic tokens in different layers have different model granularities that extract complementary characteristics of the corpus and the language behind, so they jointly capture knowledge about the corpus. Although the selection of the hyperparameters can be arbitrary in the above two-dimensional space, here we can select temporal granularities (=,,…) and phonetic granularities (=,,…), forming a two-dimensional array of hyperparameter pairs in the granularity space.
2.3 Mutual Reinforcement of Multi-layered Tokens
Because all the layers obtained in the MAT above are learned in an unsupervised fashion, they are not precise. But we have many layers, each corresponding to a different pair of hyperparameters , so they can be mutually reinforced. This is explained here and shown in Fig.3, including token boundary fusion and LDA-based token label re-initialization as in Fig.3(a).
2.3.1 Token Boundary Fusion
Fig.3(b) shows the token boundary when a part of an utterance is segmented into acoustic tokens on different layers with different hyperparameter pairs . We define a boundary function on each layer with for the possible boundary between every pair of two adjacent frames within the utterance, where is the time index of such possible boundaries. On each layer =1 if boundary is a token boundary and 0 otherwise. All these boundary functions for all different layers are then weighted and averaged to give a joint boundary function . The weights consider the fact that smaller or shorter HMMs generate more boundaries. The peaks of are then selected based on the second derivatives and some filtering and thresholding process. This gives the new segmentation of the utterance as shown at the bottom of Fig.3(b).
2.3.2 LDA-based Token Label Re-initialization
As shown in Fig.3(c), each new segment obtained above usually consists of a sequence of acoustic tokens on each layer based on the tokens defined on that layer. We now consider all the tokens on all the different layers as different words, so we have a vocabulary of words, i.e., there are words on the -th layer and there are a total of layers. A new segment here is thus considered as a document (bag-of-words) composed of words (tokens) collected from all different layers. Latent Dirichlet Allocation
(LDA) is preformed for topic modeling, and then each document (new segment) is labeled with the most probable topic. Because in LDA a topic is characterized by a word distribution, here a token distribution across different layers may also represent a certain acoustic characteristics or a certain acoustic token. By setting the number of topics in LDA as the number of distinct tokens(=,,…) as in subsection 2.2.2) we have a new initial label set as in (1) of subsection 2.2.1, in which each new segment obtained here is a new acoustic token whose ID is the topic ID obtained by LDA. This new initial label set is then used to re-train all the acoustic tokens on all layers of MAT as in (1)(2)(3).
2.4 The Multi-target DNN (MDNN)
As shown in the right part of Fig.1, token label sequence from a layer (with a pair of hyperparameters ) is a valid target for supervised framewise training, although obtained in an unsupervised way. In the initial work here, we do not use the HMM states as the target, but simply take the token label as the training target. As shown in Fig.1, there are multi-layered token labels with different hyperparameter pair for each utterance, so we jointly consider all the multi-layered token labels by learning the parameters for a single DNN with a uniformly weighted cross-entropy objective at the output layer. As a result, the bottleneck feature (BNF) extracted from this DNN automatically fuse all knowledge about the corpus and the language behind learned from the different sets of acoustic tokens.
2.5 The Iterative Learning Framework for MAT-DNN
Once the BNFs are extracted from the MDNN in iteration 1, they can be taken as the input of the MAT on the left of Fig.1
(c) replacing the initial acoustic features. The MAT then generates updated sets of multi-layered token labels and these updated sets of multi-layered token labels can be used as the updated training objective of the MDNN. The input features of the MDNN can also be updated by concatenating the initial acoustic features with the newly extracted BNFs as the tandem features. This process can be repeated for several iterations until satisfactory results are obtained. The tandem feature used as the input of the MDNN can be further augmented by concatenating unsupervised features obtained in other systems such as the Deep Boltzmann Machine18]19] trained on MFCC. Although different from the conventional recurrent neural network (RNN) in which the recurrent structure is included in back propagation training, the concatenation of the bottleneck features with other features in the next iteration in MDNN is a kind of recurrent structure.
3 Experimental Setup
The general framework of the MAT-DNN presented above allows several flexible configurations. However, in this work we train the MAT-DNN in the following manner. We set =3, 5, 7, 9 states per token HMM and =50, 100, 300, 500 distinct tokens in the MAT, which gives a total of 16 layers.
In the first iteration, we use the 39 dimension Mel-frequency Cepstral Coefficients (MFCC) with energy, delta and double delta as the initial acoustic features for the input to both the MAT and the MDNN. We tandem the MFCC with a window of 4 frames before and after (39x9 dimensions), and an i-vector (400 dimensions) trained on the MFCC of each evaluation interval for the input of the MDNN. The topology of the DNN is set to be 751(input)-256(hidden)-256(hidden)-39(bottleneck)-(target) with 3 hidden layers. Even without the feedback and tandem features, the MAT-DNN is a powerful self-contained unsupervised feature extractor. We compared the BNF extracted in the first iteration with the Deep Boltzmann Machine posteriorgrams mentioned in section 2.5 that use the same MFCC as input. To make the comparison fair, we keep the dimensionality of these features to be 39. For the Deep Boltzmann Machine, we used the 39-dimension MFCC with a window of 5 frames before and after as the input. The configuration we used for the DBM is 429(visible)-256(hidden)-256(hidden)-39(hidden). We originally extracted another set of LSTM-RNN autoencoder bottleneck features as another baseline but the performance was slightly worse than the MFCC thus we omit it in any discussion here.
In the second iteration, we tandem the original MFCC, the BNF extracted from the first iteration, the DBM posteriorgrams, and the i-vector forming a (39x9+39x9+39x9+400=1453) dimension input to the MDNN. We used the updated transcriptions as the target and extracted the BNF as the features. The MAT is trained using the zrst, a python wrapper for the HTK toolkit, srilm that we developed for training unsupervised HMMs with varying model granularity. The LDA tool we used in the Mutual Reinforcement is done with MALLET. The MFCC were extracted using the HTK toolkit. The i-vectors were extracted using Kaldi. The DBM posteriorgram is extracted using libdnn
. The MDNN was trained using Caffe.
3.1 Track 1
The two official corpora are the Buckeye corpus  and NCHLT Xitsonga Speech corpus  in English and Tsonga respectively. They are used in the evaluation based on the ABX discriminability test  including across and within speaker tests. The final results is in error percentage, which means the lower the better. Our results of track 1 is presented in Table 1.
Rows (1) and (11) are the official baseline MFCC features and official topline supervised phone posteriorgrams provided by the challenge organizers respectively. Row (2) is our baseline of the MFCC features, the initial acoustic features used to train all systems in this work. Row (3) is for the DBM posteriorgrams extracted from the MFCC of row (2), serving as a strong unsupervised baseline. The results in rows (4), (5) and (6) are the performance of the bottleneck features extracted in the first iteration of the MAT-DNN without applying mutual reinforcement (MR) (4), applying MR once (5), and twice (6) respectively. Row (9) is similar to row (5), except we use a wider bottleneck layer with 256 dimensions instead of 39. Rows (7) and (8) are the performance of the bottleneck features extracted in the second iteration of the MAT-DNN without applying MR (7) and applying MR once (8). The MAT of the MAT-DNN in (7) and (8) is trained using the BNF of row(5). Row (10) is similar to row (8), except only the MFCC and i-vectors are tandemed as input without other features.
All the features from row (2) to (10) except for (9) are confined to 39 dimensions. This allows fast and fair comparison of different algorithms. We observe that as a stand-alone feature extractor without any iterations, the MAT-DNN in row (5) outperforms the DBM baseline in (3). The effect of mutual reinforcement can be seen in the improvement from row (4) to row (5)(6) and row (7) to row(8). We observe that a single iteration of mutual reinforcement of the target of the MAT-DNN is enough to bring huge improvement to the system. The effect of iterations in the MAT-DNN can be seen by comparing rows (2), (5), (8), respectively corresponding to 0, 1, and 2 iterations. Although the performance improvement from row (2) to row (5) is notable, it dropped in the second iteration in (8). To investigate reasons of the performance drop, we widened the bottleneck feature to 256 dimensions in (9) and observed a dramatic improvement in performance. It is possible that we have not explored the full potential of the MAT-DNN as comparison between algorithms was the original goal when we designed the experiments. For a better tuned set of parameters, improvement in following iterations is to be expected on track 1. Nonetheless, the benefit of the second iteration is better observed in track 2.
3.2 Track 2
The evaluation tool for track 2 provided by the challenge organizers gives five main metrics plus two more scores: NED and coverage. Fig.4 shows the results for (a) English and (b) Tsonga in NED, as well as the F-measures for the five main metrics: matching, grouping, type, token, and boundary, each in a subgraph. We omit coverage here because it is almost 100% in all cases. So there are six subfigures in Fig.4(a) and (b). In each subfigure, the results for four cases are shown, they correspond to the four MAT targets used for the MDNN bottleneck features listed in rows (4), (5), (6) and (8) of Table 1. For each of these token sets, the three or six groups of bars correspond to different values of (=3, 5, 7 or =3, 5, 7, 9, 11, 13), while in each group the four bars correspond to the values of (=50, 100, 300, 500 from left to right), where are the parameters for the token sets. Those bars in blue are better than the JHU baseline, while those in white are worse. Only the results jointly considering both within and across talker conditions are shown.
From Fig.4(a) for English, it can be seen that the proposed token sets perform well in type, token and boundary scores, although much worse in matching and grouping. we see in many cases the benefits brought by MR (e.g. (6) vs (5) in type of Fig.4(a)) and the second iteration (e.g. (8) vs (6) in boundary of Fig.4(a)), especially for small values of . In many groups for a given , smaller values of seemed better, probably because =50 is close to the total number of phonemes in the language. Also, a general trend is that larger values of were better, probably because HMMs with more states were better in modelling the relatively long units; this may directly lead to the higher type, token and boundary scores.
Similar observations can be made for Tsonga in Fig.4(b), and the overall performance seemed to be even better as the proposed token sets perform well even in matching scores. The improvements brought by MR, the bottleneck features and the second iteration is better observed here, which gives the best cases for all the five main scores. This is probably due to the fact that more sets of tokens were available for MR and MAT-DNN on Tsonga than English. We can conclude from this observation that more token sets introduces more robustness and that leads to better token sets for the next iteration. When goes to 13, we see that without MR in (4) of Fig.4(b)) almost all metrics degrade except for matching scores, but with MR almost all the scores consistently increases (except for NED) when becomes larger. This suggests that MR can also prevent degradation from happening while detecting relatively long units.
including Precision (P), Recall (R) and F-scores (F). These three example sets are also marked in Fig.4. In Table 2 those better than JHU baseline are in bold. The much higher NED and coverage scores suggest that the proposed approach is a highly permissive matching algorithm. The much higher parsing scores (type, token and boundary scores), especially the Recall and F-scores, imply the proposed approach is more successful in discovering word-like units. However, the matching and grouping scores are much worse probably because the discovered tokens cover almost the whole corpus, including short pauses or silence, and therefore many tokens are actually noises. Another possible reason might be that the values of used are much smaller than the size of the real word vocabulary, making the same token label used for signal segments of varying characteristics and this degenerated the grouping qualities.
This paper summarizes the preliminary work done for the Zero Resource Speech Challenge in Interspeech 2015. We propose a MAT-DNN to generate multi-layer token sets and fuse the various knowledge in different token sets in the bottleneck features. We present the complete results on all evaluations we tested up to the submission deadline, with a hope that these results serve as good references for future investigations.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
-  C.-y. Lee and J. Glass, “A nonparametric bayesian approach to acoustic model discovery,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012, pp. 40–49.
-  M.-h. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe, “Unsupervised training of an hmm-based self-organizing unit recognizer with applications to topic classification and keyword discovery,” Computer Speech & Language, vol. 28, no. 1, pp. 210–223, 2014.
-  T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair abx task: Analysis of the classical mfc/plp pipeline,” in INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association, 2013, pp. 1–5.
B. Ludusan, M. Versteegh, A. Jansen, G. Gravier, X.-N. Cao, M. Johnson, and E. Dupoux, “Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems,” inLanguage Resources and Evaluation Conference, 2014.
-  C.-T. Chung, C.-a. Chan, and L.-s. Lee, “Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
-  Y.-c. Pan and L.-s. Lee, “Performance analysis for lattice-based speech indexing approaches using words and subword units,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 6, pp. 1562–1574, 2010.
-  C.-T. Chung, W.-N. Hsu, C.-Y. Lee, and L.-S. Lee, “Enhancing automatically discovered multi-level acoustic patterns considering context consistency with applications in spoken term detection,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
-  N. T. Vu, J. Weiner, and T. Schultz, “Investigating the learning effect of multilingual bottle-neck features for asr,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  K. Vesely, M. Karafiát, F. Grezl, M. Janda, and E. Egorova, “The language-independent bottleneck features,” in Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012, pp. 336–341.
-  A. Jansen and K. Church, “Towards unsupervised training of speaker independent acoustic models.” in INTERSPEECH, 2011, pp. 1693–1692.
-  H. Gish, M.-h. Siu, A. Chan, and W. Belfield, “Unsupervised training of an hmm-based speech recognizer for topic classification.” in INTERSPEECH, 2009, pp. 1935–1938.
-  M.-H. Siu, H. Gish, A. Chan, and W. Belfield, “Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision.” in INTERSPEECH, 2010, pp. 2838–2841.
-  C.-T. Chung, C.-a. Chan, and L.-s. Lee, “Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8081–8085.
-  M. Creutz and K. Lagus, “Unsupervised models for morpheme segmentation and morphology learning,” ACM Transactions on Speech and Language Processing (TSLP), vol. 4, no. 1, p. 3, 2007.
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in
International Conference on Artificial Intelligence and Statistics, 2009, pp. 448–455.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  A. Kanagasundaram, R. Vogt, D. B. Dean, S. Sridharan, and M. W. Mason, “I-vector based speaker recognition on short utterances,” in Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA), 2011, pp. 2341–2344.
-  C.-T. Chung, “zrst,” https://github.com/C2Tao/zrst, 2014.
-  S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al., The HTK book. Entropic Cambridge Research Laboratory Cambridge, 1997, vol. 2.
-  A. Stolcke et al., “Srilm-an extensible language modeling toolkit.” in INTERSPEECH, 2002.
-  A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit,” 2002.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.
-  P.-W. Chou, “libdnn,” https://github.com/botonchou/libdnn, 2014.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  M. A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye corpus of conversational speech (2nd release),” Columbus, OH: Department of Psychology, Ohio State University, 2007.
-  N. J. De Vries, M. H. Davel, J. Badenhorst, W. D. Basson, F. De Wet, E. Barnard, and A. De Waal, “A smartphone-based asr data collection tool for under-resourced languages,” Speech communication, vol. 56, pp. 119–131, 2014.
-  A. Jansen and B. Van Durme, “Efficient spoken term discovery using randomized algorithms,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 401–406.