ASR has reached a huge success and been widely used in modern society [1, 2, 3]. However, in the existing algorithms, machines must learn from a large amount of annotated data, which makes the development of speech technology for a new language with low resource challenging. Annotating audio data for speech recognition is expensive, but unannotated audio data is relatively easy to collect. If the machine can acquire the word patterns behind speech signals from a large collection of unannotated speech data without alignment with text, it would be able to learn a new language from speech in a novel linguistic environment with little supervision. There are lots of researches towards this goal [4, 5, 6, 7, 8, 9, 10, 11].
Audio segment representation is still an open problem with lots of research [12, 13, 14, 15, 16]. In the previous work, a sequence-to-sequence autoencoder (SA) is used to represent variable-length audio segments using fixed-length vectors [17, 18]. In SA, the RNN encoder reads an audio segment represented as an acoustic feature sequence and maps it to a vector representation; the RNN decoder maps the vector back to the input sequence of the encoder. With SA, only audio segments without human annotation are needed, which suits it for low-resource applications. It has been shown that the vector representation contains phonetic information [17, 18, 19, 20].
In text, Word2Vec 
transforms each word into a fixed-dimension semantic vector used as the basic component of applications of natural language processing. Word2Vec is useful because it is learned from a large collection of documents without supervision. In this paper, we propose a similar method to extract semantic representations from audio without supervision. First, phonetic embeddings from audio segments with little speaker or environment dependent information are extracted by SA with adversarial training for disentangling information. Then, the phonetic embeddings are further used to obtain semantic embeddings by a skip-gram model. Different from typical Word2Vec which takes one-hot representations of words as input, here the proposed model takes phonetic embeddings from SA as input.
Given a set of word embeddings learned from text, if we can map the audio semantic embeddings to the textual semantic embedding space, the text corresponding to the semantic embeddings of the audio segments would be available. In this way, unsupervised ASR would be achieved. The idea is inspired from unsupervised machine translation with monolingual corpora only [22, 23]
. Because most languages share the same expressive power and are used to describe similar human experiences across cultures, they should share similar statistical properties. For example, one can expect the most frequent words to be shared. Therefore, given two sets of word embeddings of two languages, the representations of these words can be similar up to a linear transformation[23, 24].
In our task, the targets we want to align are not two different languages, but audio and text of the same language. We believe the alignment is probable because the frequencies and contextual relations of words are close in audio and text domains for the same language. The mapping method used in this stage is an EM-based method, Mini-Batch Cycle Iterative Closest Point (MBC-ICP)
, which is originally proposed for unsupervised machine translation. Here given two sets of embeddings, that is, semantic embeddings from text and audio, MBC-ICP can iteratively align the vectors in the two sets by Principal Component Analysis (PCA) and an affine transformation matrix. After mapping the semantic embeddings from audio to those learned from text, the text corresponding to the audio segments is directly known.
To our best knowledge, this is the first work attempting to achieve word-level ASR without any speech and text alignment.
2 Proposed Method
The proposed framework of unsupervised ASR consists of three stages:
Extracting phonetic embeddings from word-level audio segments using SA with discrimination.
Training semantic embeddings from phonetic embeddings.
Unsupervised transformation from audio semantic embeddings to textual semantic embeddings.
2.1 Extracting Phonetic Embeddings from Word-Level Audio Segments Using SA with Discrimination
In the proposed framework, we assume that in an audio collection, each utterance is already segmented into word-level segments. Although unsupervised segmentation is still challenging, there are already many approaches available [25, 26]. We denote the audio collection as , which consists of word-level audio segments, , where is the feature vector of the tth time frame and is the number of time frames of the segment. The goal is to disentangle the phonetic and speaker information in acoustic features, and extract a vector representation with phonetic information.
As shown in Figure 1, we pass a sequence of acoustic features into a phonetic encoder and a speaker encoder to obtain a phonetic vector and a speaker vector . Then we take the phonetic and speaker vectors as inputs of the decoder to reconstruct the acoustic features . The phonetic vector will be used in the next stage. The two encoders and the decoder are jointly learned by minimizing the reconstruction loss below:
2.1.2 Training Criteria for Speaker Encoder
In the following discussion, we also assume the speakers of the segments are known. Suppose the segment is uttered by speaker . If speaker information is not available, we can simply assume that the segments from the same utterance are uttered by the same speakers, and the approach below can still be applied. is learned to minimize the following loss :
If and are uttered by the same speaker (), we want their speaker embeddings and to be as close as possible. On the other hand, if , we want the distance of and larger than a threshold .
2.1.3 Training Criteria for Phonetic Encoder
As shown in Figure 1, the discriminator takes two phonetic vectors and as inputs and tries to tell if the two vectors come from the same speaker. The learning target of the phonetic encoder is to ”fool” the discriminator, keeping it from discriminating correctly. In this way, only phonetic information is contained in the phonetic vector, and the speaker information in original acoustic features is encoded in the speaker vector. The discriminator learns to maximize in (3), while the phonetic encoder learns to minimize .
The whole optimization procedure of the discriminator and the other parts is iteratively minimizing and .
2.2 Training Semantic Embeddings from Phonetic Embeddings
Similar to the Word2Vec skip-gram model , we use two encoders and to train the semantic embeddings from phonetic embeddings (Figure 2). On one hand, given a segment , we feed its phonetic vector obtained from the previous stage into , and output the semantic embedding of the segment . On the other hand, given the context window size
, which is a hyperparameter, if a segmentis in the context window of , then its phonetic vector is a context vector of . For each context vector of , we feed it into , and output its context embedding .
Given a pair of phonetic vectors , the training criteria for and is to maximize the similarity of and if and are contextual, while minimizing their similarity otherwise. The basic idea is parallel to textual Word2Vec. Two different words having the similar content have similar semantics, thus if two different phonetic embeddings corresponding to different words have the same context, they will be close to each other after projected by . and learn to minimize the semantic loss as follows:
The sigmoid of dot product of and is used to evaluate the similarity. If and are in the same context window, we want and to be as similar as possible. We also use the negative sampling technique, in which only some pairs are randomly sampled as negative examples instead of enumerating all possible negative pairs.
2.3 Unsupervised Transformation from Audio to Text
We have a set of audio semantic embeddings obtained from the last stage, where is the number of audio segments in the audio collection. On the other hand, given a text collection, we can obtain textual semantic embeddings by typical word embedding models like skip-gram. Here is the word embedding of the -th word in the text collection, and there are words in the text database. Although both and contain semantic information, they are not in the same space, that is, the same dimension in and would not correspond to the same semantic meaning. Here we want to learn a transformation to transform an embedding to in the textual semantic space.
MBC-ICP is used here, whose procedure is described as below. Given two sets of embeddings, and , they are projected to their top principal components by PCA respectively. Let the projected vectors of and be and . The -th column of , , is the PCA projection of , while the -th column of , , is the PCA projection of . Both the dimensionality of and are . If can be mapped to the space of by an affine transformation, and would be similar after PCA . The above PCA mapping technique is commonly used [27, 28].
Then a pair of transformation matrices, and , is learned, where transforms an in to the space of , that is, , while maps to the space of . and are learned iteratively by the following algorithm. We assume that two kinds of semantic embedding are likely the same after PCA projection, so we initialize the transformation matrices as identity matrices. Then in each iteration, the following steps are conducted:
For each , find the nearest from all , denoted as .
For each , find the nearest from all , denoted as .
Optimize and by minimizing:
In the first and the second terms, we want to transform and respectively to its nearest neighbors in the other space, and . We include cycle-constraints as the third and fourth terms in (5) to ensure that both and are unchanged after transformed to the other space and back.
Equation (5) is solved by gradient descent. After is eventually obtained, given , we can find in which is nearest to among all the columns of . Then we consider the -th word in the text database corresponds to the -th audio segment, or the -th word is the recognition result of the -th audio segment.
If some aligned pairs of audio and textual semantic embeddings are available, we can also train the transformation matrix in a supervised/semi-supervised way, in which we directly minimize the distance from the true embedding in the first two terms of (5), rather than from the nearest embedding.
|WS353 [34, 35]||0.441||0.203||0.374||0.109|
|WS353R [34, 35]||0.385||0.164||0.348||0.102|
|WS353S [34, 35]||0.465||0.224||0.367||0.122|
3.1 Experimental Setup
We used LibriSpeech  as the audio collection in our experiments. LibriSpeech is a corpus of read English speech and suitable for training and evaluating speech recognition systems. It is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. In our experiments, the dataset were segmented according to the word boundaries obtained by forced alignment with respect to the reference transcriptions. We used the 960 hours ”clean+others” speech sets from LibriSpeech for model training. MFCCs of 39-dim were used as the acoustic features.
The phonetic encoder, speaker encoder and decoder in the first stage are all 2-layer GRUs with hidden layer size 256, 256 and 512, respectively. The discriminator is a fully-connected feedforward network with 2 hidden layer whose size is 256. The value we used in the speaker loss term is set to 0.01. The discriminator and the other parts of this stage are iteratively trained as WGAN .
In the second stage, the two encoders are both 2-hidden-layer fully-connected feedforward networks with hidden size 256. The size of embedding vectors is 128, the context window size 5, the negative sampling number 5, the sampling factor 0.001, and 5 as the threshold for minimum count. Although textual Word2Vec has an unsupervised training procedure, it needs subsampling, which is an important step during training. Subsampling needs the frequencies of words, which we can’t obtain during the unsupervised training in the audio domain. In this preliminary work, we compromise to use known word labels to do subsampling, but we believe, with a proper statistically clustering algorithm to classify and group phonetic vectors, a completely unsupervised ASR can be achieved by this framework in the future.
We trained two sets of textual semantic embeddings. The first set (denoted as OHW in the following discussion) was trained on the manual transcriptions of LibriSpeech using one-hot representations as input by a typical skip-gram model. The second set of textual semantic embeddings (denoted as PEW) was also trained on LibriSpeech while using phonetic information. To generate PEW, we represented each word with a sequence of phonemes, and used a sequence-to-sequence autoencoder to encode the phoneme sequence into an embedding with size of 256. Then we took the embeddings of phoneme sequences as input of skip-gram model. Because the audio semantic embeddings were also learned from phonetic embeddings, we believe PEW would have a more similar distribution to audio semantic embeddings than OHW.
While each word in text has a unique semantic representation, segments corresponding to the same word can have different semantic representations , so the distributions of textual and audio semantic embedding are too different to be mapped together. In this work, we used known word labels to average audio semantic embeddings corresponding to the same word, so that each word has a unique semantic representation in both audio and text. We realize that this is another unrealistic setup in unsupervised scenario, and will develop technique to address this issue in the future. Finally in the third stage, we applied MBC-ICP  with top 5000 frequent words and projected the embeddings to the top 100 principle components. Hence, the affine transform matrix from audio embeddings to text embeddings is , and vise versa. The mini-batch size was set to be 200.
3.2 Evaluation of Word Representations
. Those benchmark corpora include pairs of words. Here we want to know whether the audio semantic embeddings can capture semantic meanings like textual semantic embeddings. We calculated cosine similarities of word pairs in benchmark datasets by audio semantic embeddings and text semantic embeddings, and evaluated the ranking correlation scores of cosine similarities between audio and textual embeddings111The code is released at https://github.com/grtzsohalf/Towards-Unsupervised-ASR. The results are presented with the Spearman’s rank correlation coefficients.
We measured correlations between two textual semantic embeddings, OHW and PEW, mentioned in Section 3.1 and four types of audio embeddings. The four types of audio embeddings are: semantic embeddings trained from phonetic vectors with disentanglement (SE/SAD), semantic embeddings trained from vectors extracted by SA without disentanglement (SE/SA), phonetic embeddings extracted by SA with disentanglement (PE/SAD), and embeddings extracted by SA without disentanglement (PE/SA).
The results are shown in Table 1 and Table 2. Table 1 shows the correlations between OHW and four audio embedding sets. Similarly, Table 2 presents the correlations between PEW and four audio embedding sets. In Tables 1 and 2, we found that disentanglement improved the correlation scores in most cases, and audio semantic embeddings outperformed embeddings extracted from SA in most cases. The results verify the first two stages in our proposed method are both helpful for extracting embeddings including semantic information. It can also be inferred from the two tables that correlation performance of PEW is better than OHW as expected because PEW is learned from the phonetic embeddings of text. Since PEW is more similar to the audio embeddings, it will make the transformation in the next stage easier.
|Labeled Pairs||top 1||top 10||top 100|
3.3 Transformation from Audio to Text
The results of MBC-ICP are shown in Table 3 and Table 4. In Table 3, we compare top 10 nearest accuracies of four audio semantic sets mentioned above using 5000 labeled pairs. SE/SAD achieved the best result. Once again, it shows that the first two stages in our proposed method are both effective indeed. In Table 4
, both the unsupervised and semi-supervised results with SE/SAD are further reported. The numbers of labeled data are 0, 1000, 2000 and 5000 respectively. The results also include top 1, top 10 and top 100 nearest accuracies. We can observe that although unsupervised MBC-ICP may not generate perfect matching, semi-supervised learning achieved high transformation accuracies. It shows that there exists a good affine transformation matrix that can transform audio and textual semantic embeddings. However, the affine transformation matrix cannot be easily found by the completely unsupervised approach.
4 Conclusions and Future Work
In this work, we propose a three-stage framework towards unsupervised ASR with unaligned speech and text only. Through the experiments, we showed semantic audio embeddings can be directly extracted from audio. Although we did not obtain satisfied results with unsupervised learning, via semi-supervised learning, we verified there is an affine matrix transforming semantic embeddings from audio to text. How to find the affine matrix in an unsupervised setup is still under investigation. Although some oracle settings were used in the experiments, we are conducting experiments under more realistic setups. We believe with further improvement on this framework, the completely unsupervised ASR could be achieved in the near future.
-  M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech communication, vol. 34, no. 3, pp. 267–285, 2001.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.
F. Sha and L. K. Saul, “Large margin hidden markov models for automatic speech recognition,” inAdvances in neural information processing systems, 2007, pp. 1249–1256.
-  E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” in ASRU, 2017.
-  A. Garcia and H. Gish, “Keyword spotting of arbitrary words using minimal speech resources,” in ICASSP, 2006.
-  A. Jansen and K. Church, “Towards unsupervised training of speaker independent acoustic models,” in INTERSPEECH, 2011.
-  A. Park and J. Glass, “Unsupervised pattern discovery in speech,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 1, pp. 186–197, Jan 2008.
-  V. Stouten, K. Demuynck, and H. Van hamme, “Discovering phone patterns in spoken utterances by non-negative matrix factorization,” Signal Processing Letters, IEEE, vol. 15, pp. 131 –134, 2008.
-  N. Vanhainen and G. Salvi, “Word discovery with beta process factor analysis,” in INTERSPEECH, 2012.
-  H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acoustic segment modeling approach to query-by-example spoken term detection,” in ICASSP, 2012.
H. Kamper, K. Livescu, and S. Goldwater, “An embedded segmental K-means model for unsupervised segmentation and clustering of speech,” inASRU, 2017.
-  K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic indexing for zero resource keyword search,” in ICASSP, 2015.
-  S. Bengio and G. Heigold, “Word embeddings for speech recognition,” in INTERSPEECH, 2014.
G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” inICASSP, 2015.
-  S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-example search with discriminative neural acoustic word embeddings,” INTERSPEECH, 2017.
A. Jansen, M. Plakal, R. Pandya, D. Ellis, S. Hershey, J. Liu, C. Moore, and
R. A. Saurous, “Towards learning semantic audio representations from
unlabeled data,” in
NIPS Workshop on Machine Learning for Audio Signal Processing (ML4Audio), 2017.
-  Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016.
-  C.-H. Shen, J. Y. Sung, and H.-Y. Lee, “Language transfer of audio word2vec: Learning audio segment representations without target language data,” in arXiv, 2017.
-  Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised adaptation with domain separation networks for robust speech recognition,” in ASRU, 2017.
-  W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in Advances in neural information processing systems, 2017, pp. 1876–1887.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
-  G. Lample, L. Denoyer, and M. Ranzato, “Unsupervised machine translation using monolingual corpora only,” arXiv preprint arXiv:1711.00043, 2017.
-  A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou, “Word translation without parallel data,” arXiv preprint arXiv:1710.04087, 2017.
-  Y. Hoshen and L. Wolf, “An iterative closest point method for unsupervised word translation,” arXiv preprint arXiv:1801.06126, 2018.
-  Y.-H. Wang, C.-T. Chung, and H.-y. Lee, “Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries,” arXiv preprint arXiv:1703.07588, 2017.
-  O. Scharenborg, V. Wan, and M. Ernestus, “Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries,” The Journal of the Acoustical Society of America, vol. 127, no. 2, pp. 1084–1095, 2010.
-  P. Daras, A. Axenopoulos, and G. Litos, “Investigating the effects of multiple factors towards more accurate 3-d object retrieval,” IEEE Transactions on multimedia, vol. 14, no. 2, pp. 374–388, 2012.
F. Li, D. Stoddart, and C. Hitchens, “Method to automatically register scattered point clouds based on principal pose estimation,”Optical Engineering, vol. 56, no. 4, p. 044107, 2017.
E. Bruni, N. Tram, M. Baroni et al., “Multimodal distributional
The Journal of Artificial Intelligence Research, vol. 49, pp. 1–47, 2014.
-  K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, “A word at a time: computing word relatedness using temporal semantic analysis,” in Proceedings of the 20th international conference on World wide web. ACM, 2011, pp. 337–346.
-  H. Rubenstein and J. B. Goodenough, “Contextual correlates of synonymy,” Communications of the ACM, vol. 8, no. 10, pp. 627–633, 1965.
T. Luong, R. Socher, and C. Manning, “Better word representations with recursive neural networks for morphology,” inProceedings of the Seventeenth Conference on Computational Natural Language Learning, 2013, pp. 104–113.
-  F. Hill, R. Reichart, and A. Korhonen, “Simlex-999: Evaluating semantic models with (genuine) similarity estimation,” Computational Linguistics, vol. 41, no. 4, pp. 665–695, 2015.
-  L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, “Placing search in context: The concept revisited,” in Proceedings of the 10th international conference on World Wide Web. ACM, 2001, pp. 406–414.
-  E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa, “A study on similarity and relatedness using distributional and wordnet-based approaches,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009, pp. 19–27.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5769–5779.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  S. Jastrzebski, D. Leśniak, and W. M. Czarnecki, “How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks,” arXiv preprint arXiv:1702.02170, 2017.