Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

by   Sung-Feng Huang, et al.

Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing. Audio Word2Vec previously proposed was shown to be able to represent audio segments for spoken words as such vectors carrying information about the phonetic structures of the signal segments. However, each linguistic unit (word, syllable, phoneme in text form) corresponds to unlimited number of audio segments with vector representations inevitably spread over the embedding space, which causes some confusion. It is therefore desired to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed. In this paper, inspired by Siamese networks, we propose some approaches to achieve the above goal. This includes identifying positive and negative pairs from unlabeled data for Siamese style training, disentangling acoustic factors such as speaker characteristics from the audio embedding, handling unbalanced data distribution, and having the embedding processes learn from the adjacency relationships among data points. All these can be done in an unsupervised way. Improved performance was obtained in preliminary experiments on the LibriSpeech data set, including clustering characteristics analysis and applications of spoken term detection.



There are no comments yet.


page 1

page 2

page 3

page 4


Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Word embedding or Word2Vec has been successful in offering semantics for...

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

While Word2Vec represents words (in text) as vectors carrying semantic i...

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

Audio Word2Vec offers vector representations of fixed dimensionality for...

CNN-based Spoken Term Detection and Localization without Dynamic Programming

In this paper, we propose a spoken term detection algorithm for simultan...

Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

Spoken term discovery from untranscribed speech audio could be achieved ...

Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings

Unsupervised discovery of acoustic tokens from audio corpora without ann...

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

In this paper, we propose a deep convolutional neural network-based acou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech recognition technologies have been very successful today. But for good accuracy in real-world applications, machines still have to learn from huge quantities of annotated data. This makes the development of speech technologies for a new language challenging. For low-resourced languages, collecting huge quantities of data is difficult, and having them annotated is even prohibitively hard. But more then 95% of the languages all over the world are low-resourced, and many of them even without linguistic analysis, or without written forms. Compared to annotating audio data, obtaining unannotated audio data of reasonable size is relatively achievable. If the machines can automatically learn the acoustic patterns for linguistic units (words, syllables, phonemes, etc.) within the speech signals from an unannotated speech data set of reasonable size, recognition models for those units may be constructed, and speech recognition may become possible for a new language under a new environment with minimum supervision. Imagine a Hokkien-speaking family obtaining an intelligent device at home: this device does not know Hokkien at all in the beginning, by hearing people speaking Hokkien for some time, it may automatically learn the language. The goal of this paper is one step forward towards this vision [1].

With the above long-term goal in mind, it is highly desired to embed audio signal segments (probably correspond to linguistic units of words, syllables or phonemes) of variable length into vectors of fixed dimensionality, serving as the latent representations for the signal segments 

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. This naturally makes all following processing easier such as computation, clustering, modeling classification, indexing, etc. Good application examples include speaker identification [12], emotion classification [13], and spoken term detection (STD) [14, 15, 16, 17, 18, 19]. In these applications, standard processing mechanisms can be easily performed over such vector representations for the audio segments, achieving the purpose much more efficiently than processing over the signal segments of variable lengths [5, 6, 20, 21, 22].

Audio Word2Vec was proposed and can be trained in an unsupervised way with a sequence-to-sequence autoencoder, with the embeddings for the input audio segments extracted from the bottleneck layer 

[4, 15]. It has been shown that the vector representations obtained in this way carry the phonetic information about the audio segments [15]. It was then further shown that dividing utterances into audio segments and embedding them as sequences of vectors can be jointly learned in an unsupervised way in Segmental Audio Word2Vec [23]

. Such unsupervised approaches for audio segment embedding are attractive because no annotation is needed. However, each linguistic unit (word, syllable, phoneme) corresponds to unlimited number of audio realizations each with its own vector representations, and the spread of these vector representations inevitably lead to confusion, especially when no human labels are available. For example, although embeddings for the realizations of the word ”brother” are very close to each other in the vector space, so do those for the word ”bother”, spread of those two groups causes some confusion. A Siamese convolutional neural network 

[24, 25, 26, 27]

was trained using side information to obtain embeddings for which same-word pairs were closer and different-word pairs were better separated. But human annotation is required under this supervised learning scenario 


Siamese networks learning from same-word and different-word pairs [20] can be useful in learning better audio embeddings for linguistic units which are discrete, but labeled data is needed. In this paper, inspired by the concept of Siamese networks, we propose a set of approaches to learn better audio embeddings based on the adjacency relationships among data points. This includes identifying positive and negative pairs from unlabeled data for Siamese style training, disentangling acoustic factors such as speaker characteristics from the audio embedding, and handling the unbalanced data distribution. All these can be done in an unsupervised way, and very encouraging results were observed in the initial experiments.

2 Proposed Approach

Because the goal of improving audio embedding is challenging, in the initial effort here we slightly simplify the task by assuming all training utterances have been properly segmented into spoken linguistic units (words, syllables, phonemes). Many approaches for segmenting utterances automatically have been developed [23], and automatic segmentation plus audio embedding has been jointly trained successfully and reported before [23], so such an assumption is reasonable here.

Below we denote the audio corpus as , which consists of spoken linguistic units, each represented as an acoustic feature sequence of length , . In the subsections below, we try to perform the Siamese style training in an unsupervised way over audio segments.

Figure 1: Overview of the proposed approach.

2.1 Siamese Networks Considered for Unlabeled Audio Data

The overview of the proposed approach is shown in Fig. 1. Siamese networks are typically trained on a collection of positive and negative pairs of data points to make sure positive pairs are closer and negative pairs far apart. We wish to use such a concept to improve the audio embeddings considered here.

With labeled data, pairs with the same label are considered positive, and negative otherwise. Here we consider unlabeled data sets. One way to achieve this is to learn such pairs directly from Euclidean proximity, e.g., by ”labeling” points positive if is small or taking the nearest neighbors of each point throughout the whole data set and negative otherwise. Such a Siamese network can then be trained by minimizing the contrastive loss,


where is a margin, and denotes positive and negative pair sets.

There exist basic problems for the above concept to be used in the scenario considered here: (a) we cannot define positive or negative pairs from raw data, due to the variable length of the audio segments and the disturbance caused by of speaker characteristics, and (b) it is time-consuming to find nearest neighbors for each point, since a large amount of data points is required for training audio embeddings. These problems will be taken care of below.

2.2 Phonetic Embedding with Speaker Characteristics Disentangled

Figure 2: Phonetic embedding with speaker characteristics disentangled.

Audio embedding is to represent each audio segment for a linguistic unit (a word, syllable or phoneme) as a vector of a fixed dimensionality. This partly solves the first problem mentioned above, i.e., the audio segments have variable lengths. Even with the fixed dimensionality, we note a linguistic unit with a given phonetic content corresponds to infinite number of audio realizations with varying acoustic factors such as speaker characteristics, microphone characteristics, background noise, etc. All the latter acoustic factors are jointly referred to as speaker characteristics here for simplicity, which obviously disturb the goal of embedding signals for the same linguistic unit to vectors very close to each other. This is why we wish to disentangle such factors here.

As shown in the middle of Figure 2 following exactly the prior work [28], a sequence of acoustic features is entered to a phonetic encoder and a speaker encoder to obtain a phonetic vector in orange and a speaker vector in green. Then the phonetic and speaker vectors , are used by the decoder together to reconstruct the acoustic features . This phonetic vector will be used as the phonetic embedding, or the audio embedding considered here carrying primarily the phonetic information in the signal. The two encoders , and the decoder are jointly learned by minimizing the reconstruction loss.

The training of the speaker encoder requires speaker information for the audio segments. Assume the audio segment is uttered by speaker . When the speaker information is not available, we can simply assume that the audio segments in the same utterance are produced by the same speaker. As shown in the lower part of Figure 2, is learned to minimize the contrastive loss. That is, if and are uttered by the same speaker (), we want their speaker embeddings and to be as close as possible. But if , we want the distance between and larger than a threshold.

As shown in the upper right corner of Figure 2, a speaker discriminator takes two phonetic vectors and as input and tries to tell if the two vectors come from the same speaker. The learning target of the phonetic encoder is to ”fool” this speaker discriminator , keeping it from discriminating the speaker identity correctly. In this way, only the phonetic information is learned in the phonetic vector , while only the speaker characteristics is encoded in the speaker vector .

2.3 Identify Positive and Negative Pairs within each Mini-Batch

Finding nearest neighbors for each data point is costly, with time complexity of , where is the corpus size. To reduce the time and computing costs, we alternatively create a -nearest neighbors graph among all the data points in each mini-batch, and use it to approximate the distribution for the full data set. In this way, time complexity could be reduced to , where is the mini-batch size.

2.4 Siamese Style Training

We can simply apply the Siamese style loss function as an extra requirement in training the phonetic embedding in subsection 

2.2, or the Siamese requirement is jointly trained:


where are positive and negative pairs selected in each mini-batch, and are the phonetic embeddings obtained in this way.

On the other hand, we can also pretrain the audio embeddings with speaker characteristics disentangled as in subsection 2.2, then on top of the obtained phonetic embeddings , train another Siamese model to further transform them to a new space where the similar points are more compact by clustered, or adjacency-based clustering. The training loss function for this extra model is:


where the phonetic embeddings obtained in subsection 2.2, , are transformed to the new embeddings .

2.5 Dealing with Unbalanced Data

The distribution of linguistic units is unbalanced. For low frequency units we may not be able to find more than two audio segments in a mini-batch. When we create a -nearest neighbor graph within the mini-batch, audio segments for such units would be forced to reduce their distance to audio segments corresponding to different linguistic units. On the other hand, for high frequency units with more than corresponding audio segments in a mini-batch, such audio segments would be separated into two or more clusters.

Therefore, instead of creating a -nearest neighbor graph for a mini-batch, we alternatively create a fully-connected graph for each mini-batch, labeling pairs of data points with top- shortest distance in between as the positive pairs, and randomly select other pairs of data points as negative pairs. In this way the probability that data points for positive pairs correspond to the same linguistic unit may be higher.

3 Experimental Setup

3.1 Dataset

We used LibriSpeech [29] as the audio corpus in the experiments, which is a corpus of read speech in English derived from audiobooks. This corpus contained 1000 hours of speech sampled at 16 kHz uttered by 2484 speakers. We randomly sampled 100 speakers from the “clean” and “others” sets, about 40 hours of speech for training, another 40 hours for testing, and 39-dim MFCCs were extracted as the acoustic features . The audio signals were segmented into three levels of linguistic units, word, syllable, and phoneme.

3.2 Model Implementation

The phonetic encoder , speaker encoder and decoder were either 2-layer bi-directional GRUs or 3-layer CNNs with dense layers. The size of embedding vectors is 256. The speaker discriminator is a fully-connected feedforward network with 2 hidden layers with size 128. The value of we used in Eqs (2) (3) was set to 1.

4 Experimental Results

In the following subsections, we evaluate four kinds of audio embeddings: (a) Audio Word2Vec [4], which is simply an auto-encoder; (b) Audio Word2Vec with speaker characteristics disentangled as in subsection 2.2 [30, 28]; (c) proposed approach as in Eq.(2); and (d) proposed approach as in Eq.(3).

4.1 Analysis of Embedding Characteristics

We first compared the averaged cosine similarity of intra- and inter-class pairs for three different levels of linguistic units. Intra-class pairs were evaluated between audio segments corresponding to the same linguistic units, while inter-class pairs were evaluated between segments belonging to different units. Except for the four kinds of audio embeddings mentioned, we also provided two kinds of audio embedding as baselines: audio embedding obtained in subsection 

2.2 with minimizing overall L1 loss as an extra requirement in training process ((b)+L1), and audio embedding obtained in subsection 2.2 with minimizing overall L2 loss as an extra requirement in training process ((b)+L2).

The results are listed in Table 1. It can be clearly found that row (d) for the proposed approach of Eq.(3) gave the highest intra-class average cosine similarity and the maximum difference between intra- and inter-class average cosine similarity for all the three linguistic units, which indicated that this approach offered better clustered audio embeddings. In other words, those corresponding to the same linguistic units were more compactly distributed even without extra annotation.

word syllable phoneme
intra inter intra inter intra inter
(a) .104 .036 .068 .139 .050 .089 .135 -.003 .138
(b) .083 .024 .059 .131 .039 .092 .088 -.003 .091
(b)+L1 .085 .017 .068 .112 .038 .074 .102 -.003 .105
(b)+L2 .078 .024 .054 .107 .037 .070 .090 -.003 .093
(c) .074 .030 .044 .096 .032 .064 .110 .011 .099
(d) .222 .022 .200 .236 .035 .201 .245 .051 .194
Table 1: Average cosine similarity of intra- and inter-class pairs for three different levels of linguistic units.

4.2 Analysis of Unsupervised Clustering

Figure 3: Total accuracy in Eq.(6) for clustering characteristics analysis: (a) Audio Word2Vec, (b) Audio Word2Vec with speaker characteristics disentangled as in subsection 2.2, (c) proposed with Eq.(2), (d) proposed with Eq.(3). Results for different levels of linguistic units have similar trends, so only results for of words are shown.

In the experiment here, we use k-means, an unsupervised clustering method, for the first experiment for analysis. All three different levels of linguistic units were tested for comparison, so the numbers of labels were fixed to 70, which is the total class number of phoneme. For word and syllable, the top 70 frequent units were selected as labels for experiments.

Given clustering results, we could construct a confusion matrix

, where is the number of linguistic units tested (70), is the number of clusters (tested up to 280), and indicates the count of data points having label but assigned to cluster . This count was first normalized,


For each label we obtained a cluster yielding the highest and assumed it was the cluster for label ,


The total accuracy was then evaluated by summing over all labels ,


Higher total accuracy would be obtained if all the data points of the same label were in the same cluster.

The results of three different levels of linguistic units are similar, so only the clustering results of word are shown in Fig. 3 for clarity. It is known that the clustering performance of k-means depends on both the number of clusters , so various values of were tested for overall results. First of all, we found that the feature disentanglement improved the total accuracy (curves (b) v.s. (a)), and the proposed Siamese style training in Eq.(3) gave further progress (curves (d) v.s. (a)). As shown in Fig. 3, curve (d) for the proposed approach of Eq.(3) gave the highest total accuracy at nearly all values of , which proved that this approach greatly improved the performance in unsupervised clustering.

4.3 Analysis of Spoken Term Detection

top_ (a) (b) (c) (d) (d) - (a) (d) - (b)
1 33.59% 33.53% 33.68% 33.88% 0.29% 0.35%
5 34.34% 34.39% 34.59% 35.13% 0.79% 0.74%
10 34.70% 34.79% 35.05% 35.67% 0.97% 0.88%
20 35.10% 35.23% 35.53% 36.20% 1.10% 0.97%
40 35.52% 35.70% 36.05% 36.71% 1.19% 1.01%
60 35.78% 36.00% 36.37% 37.00% 1.22% 1.00%
Table 2: Spoken term detection with 80 queries.

We used the 960 hours of “clean” and “other” parts of LibriSpeech data set as the target archive for detection, which consisted of 1478 audio books with 5466 chapters. Each chapter included 1 to 204 utterances or 5 to 6529 spoken words. In our experiments, 80 queries were chosen from the words used in these 960 hours of speech with the top 80 TF-IDF scores, and the chapters were taken as the spoken documents to be retrieved. The audio realization of each query was randomly sampled from LibriSpeech data set, and our goal was to retrieve documents containing those queries (words, not necessarily the exact audio realizations). We used mean average precision (MAP) as the evaluation metric for the spoken term detection test.

For each query and each document , the relevance score of with respect to , , is defined as follows:


where is the audio embedding of a spoken word , represents cosine similarity, and is the set of top spoken words in with the highest cosine similarity values , is a parameter. In other words, the documents were ranked by the average of top cosine similarity between each spoken word in and the query .

The results are listed in Table 2. As can be found from this table, colomn (d) for the proposed approach of Eq.(3) offered the best detection performance than all the other kinds of audio embedding at all values of . Because the queries are high frequency terms and they usually appear with multiple times in the documents, the detection performances of all kinds of audio embeddings gradually improved as increased in the range tested. The right most two columns also listed the differences between the proposed approach in Eq.(3) (column (d)) and columns (a) and (b). We see the proposed method gained large improvements. As shown in the table, at 40 gave the maximum difference between columns (d) and (b). This is probably related to the fact that the number of utterances in a document is roughly 40 in average. Larger difference was achieved as was increased from 1 to 40, but less as was increased over 40.

5 Conclusions and Future Work

In this paper we propose a framework to embed audio segments into better clustered vector representations with fixed dimensionality, including identifying positive and negative pairs from unlabeled data for Siamese style training, disentangling acoustic factors such as speaker characteristics from the audio embedding, handling unbalanced data distribution. Our proposed methods gave great improvement in both clustering analysis and spoken term detection. For the future work, we have committed ourselves to distilling only linguistic information from audio segments.