Unsupervised speech modeling is the task of discovering and modeling speech units at various levels from audio recording without using any prior linguistic information. It is an interesting, challenging and impactful research problem as phonetic, lexical and even semantic information could be acquired without the process of transcribing and understanding the given speech data. The relevant technology is particularly important to facilitate data preparation especially in the scenarios where: 1) a large (even unlimited) amount of audio data are readily available online but they are untranscribed; 2) a large amount of audio recording is available for an unpopular language about which no structured linguistic knowledge or documentation can be found.
Spoken term discovery is a representative task of unsupervised speech modeling. It aims to discover repetitively occurred words and/or phrases from untranscribed audio. The problem is commonly tackled with a two-stage approach. In the first stage, a set of subword units are automatically discovered from untranscribed speech data and these units in turn can be used to represent the speech data as a symbol sequence. In the second stage, variable-length sequence matching and clustering are performed on the subword sequence representations. One major drawback of this is that the subword decoding errors in the first stage would propagate to deteriorate the outcome of spoken term discovery in the second stage. The present study investigates the use of Siamese and Triplet networks in spoken term discovery. Siamese network has been commonly applied to pattern classification or matching problems when only weak labels are available. We propose to train a Siamese/Triplet network with a small dataset of matched and mismatched sequence pairs obtained and use the trained network to generate feature representations for unseen subword sequences. The training dataset is constructed based on hypothesized spoken term clusters from an baseline spoken term discovery system developed in our previous study. With the new feature representations learned by the Siamese/Triplet network, re-clustering of subword sequences is carried out to generate an improved set of discovered spoken terms.
2 Related Work
2.1 Spoken term discovery
Spoken term discovery aims to find and extract repetitively occurred sequential pattern from audio in an unsupervised manner. In general, a spoken term discovery system performs three tasks one after the other: segmentation, matching and clustering [versteegh2015zero]. There are mainly two approaches to spoken term discovery. In the first approach, pattern discovery is done directly with acoustic features. Word-level speech segments are matched using sequence matching algorithms like segmental-DTW. The matching could be based on conventional frame-level features [park2005towards] or fixed-dimension segment representations [kamper2017embedded, thual2018k]. Another approach involves a two-stage process. Unsupervised subword modeling is first carried out with the untranscribed audio, resulting in a symbolic representation known as the pseudo transcription of speech. Sequential pattern discovery is then performed by local alignment or string matching and clustering of sequential patterns [jansen2011efficient, harwath2013zero, siu2014unsupervised, sung2018unsupervised]. The results of clustering could be corresponded to the discovered spoken terms in the given audio dataset.
2.2 Siamese and Triplet networks
Siamese neural network was proposed in[bromley1994signature]
. It consists of two identical sub-network components, which share the learnable parameters. Through the two sub-network components, Siamese neural network is trained to perform a designated classification task on a pair of data samples. The most common task is to determine whether the two input samples are from the same class or not. In other words, the exact class identities for individual training samples are not needed. The training of Siamese network requires relatively fewer training samples than conventional neural network classifiers[koch2015siamese]. Siamese network is widely used in image processing. It is shown to have the ability of comparing samples from unseen classes in the problem of one shot classification [koch2015siamese]. Triplet network [hoffer2015deep] is an extension of Siamese network. It consists of three identical sub-networks, which process 3 input samples in parallel, including one reference sample, one matched and one mismatched samples. The network is trained to capture the similarity between the matched sample and the reference and the dissimilarity between the mismatched sample and the reference.
2.3 Siamese network on spoken term detection/discovery
It has been shown that Siamese network is able to learn new representations from audio signals, which facilitate spoken word classification [kamper2016deep]. It is also able to generate effective representations for spoken term detection [svec2017relevance, zhu2018siamese]. While existing work assumes matched pair and mismatched pairs for training the Siamese network are available, one challenge in unsupervised spoken term discovery is that no information is given to the system except the recording only. In order to apply Siamese network in learning segment representations, reliable matched and mismatched pairs are required for training the network.
Relatively less work is done on unsupervised generation of matched and mismatched training pairs. There is work that identifies frame-level training samples. After segmentation, frames from same segments are treated as matched pairs, frames from adjacent segments are treated as mismatched pairs [bhati2019unsupervised]. There is also work that extracts training examples from available spoken term discovery system, with sampling based on distributions of speakers and matched/mismatched pairs [riad2018sampling].
3 Proposed System
To generate reliable matched and mismatched pair, we consider the approach of relying information generated from a trained spoken term discovery system. Subwords and term clusters are learned in unsupervised manner, training pairs are identified by evaluating the discovered term clusters based on the discovered subword units.
In our previous work [sung2018unsupervised]
, a two-stage spoken term discovery approach was investigated on recording of classroom lectures. The audio signals are first converted into frame-level bottleneck features using a multilingual deep neural network model. A set of subword-level speech units are discovered based on the bottleneck features. The discovered subword units are treated as phonemes to be the acoustic modeling units in a conventional ASR system. The audio signals are in turn decoded by the ASR system into pseudo transcription. Sequential pattern matching is applied to the pseudo transcription to obtain segments represented in subword sequences, follow by clustering of the subword sequences. The resulted clusters were shown to be strongly associated with keywords or key phrases that occur frequently in the audio signals. In particular, clusters formed by long subword sequences generally are able to represent meaningful whole words or phrases. Nevertheless, many of the resulted clusters, especially those formed by short sequences, do not provide much useful information for spoken term discovery.
3.1 Training data for Siamese network
The intended problem of spoken term discovery assumes the absence of any kind of data labels for supervised model training. To address this issue, we adopt the Siamese/Triplet network, which can be trained with weakly labelled data to learn robust segment-level representation of speech. The required segments and their “weak” labels, which tell whether a pair of speech segments contain the same or different spoken terms, is obtained by leveraging the preliminary clustering results of the two-stage approach described above. Simply speaking, the clusters with high “purity” are used to provide the training data and their labels.
Let denote a cluster initially determined by the two-stage approach. contains a number of speech segments that hypothetically correspond to the same word or phrase. Consider two segments and in , and let be the Levenshtein distance between the symbol representations of and
. We calculate the mean and standard deviation of the Levenshtein distances of all pairs of segments in, i.e.,
A small value of implies that members in have similar pseudo representations. A small value of means that the distances between different member pairs are similar. These two measures can be used to indicate the purity of . We propose to retain a set of clusters with and below certain empirically determined thresholds, i.e.,
where denotes the average length of symbol sequences in .
Let the collection of retained “pure” clusters be denoted by . Speech segments in the same cluster are believed to contain the same spoken term and therefore are used to form matching pairs for the training of Siamese/Triplet network. On the other hand, contrasting training pairs are formed by segments from contrasting clusters that have large inter-cluster distance. Consider clusters and in , and define
and are selected as contrasting clusters if
3.2 Siamese/Triplet network
The Siamese network consists of two identical convolution neural networks (CNN) with shared parameters. In the proposed model, the two CNN take in the bottleneck features from a pair of speech segments, denotedand , and their outputs are the respective learnt representations denoted as and . If and are a matched pair, the overall output of the Siamese network is trained to be . If they are a mismatched pair, the output is trained to be .
The network parameters are trained to minimize the contrastive loss function defined as
The Triplet network is also composed of the same type of CNN components. It takes three segments , and as the input, where and are matched pair, and and are mismatched pair. The Triplet network aims at embedding matching samples closer and meanwhile keeping contrasting samples away in the representation space. The Triplet loss function is given as,
where denotes the the margin between matched and mismatched samples from .
The Siamese/Triplet network is trained to learn segment representations that can be used to measure the similarity between segments. Our idea is to apply re-clustering on speech segments so as to achieve spoken term discovery. Hierarchical Density-based Spatial clustering of Applications with Noise (HDBSCAN) [campello2013density] is adopted. The clustering algorithm uses data samples to construct a minimum spanning tree of the distance-weighted graph. Each node of the tree represents a data sample, and the weight of edge connecting two nodes represents the distance between the data samples. A hierarchical level of clustering is built from the tree. The tree is then condensed based on minimal cluster size and finally stable clusters are obtained.
In some cases, HDBSCAN may produce lot of micro-clusters on high-density region, so methods that combine the use of DBSCAN and HDBSCAN are introduced, such as introducing cluster selection epsilon that extracts DBSCAN results on region larger than the epsilon instead [malzer2019hybrid]. This hybrid clustering approach is also considered in our implementation.
4 Experimental Setup
Same as in the previous work, lecture recordings from MIT Online Courses are used in our experiments. The lecture “Recursion and Dictionaries” in the course “Introduction to Computer Science and Programming in Python” (PYTH) is analyzed. The Zeropseech Challenge Mandarin dataset is also used to evaluate the system [dunbar2017zero]. For both datasets, the initial spoken term clusters and corresponding segmentation information are obtained by the baseline two-stage system detailed as in [sung2018unsupervised]. The training data for the Siamese and Triplet networks are created from the initial clusters as described in Section 3.1.
4.2 Network training
A simple CNN network is used to construct the sub-network component in the Siamese/Triplet network. The CNN comprises
fully-connected layers with a ReLU activation function. The output layer is a linear layer of
dimension. The input contains frame-level bottleneck features from speech segments. The variable-length feature sequences are zero-padded to derive fixed-length sequences for all segments. The Siamese and Triplet networks are trained for no more than 20 epochs until a reasonable loss value is attained. After training, the networks are used to transform all segments obtained from the sequence alignment step into fixed-dimension segment representations. Subsequently HDBSCAN is applied to cluster the representations into spoken term clusters. For lecture recording, only HDBSCAN is used. For Zerospeech data, both HDBSCAN and its hybrid extension are experimented, with cluster selection epsilon being set to 0.2.
4.3 Performance evaluation
We compare performance of the proposed systems with re-clustering of segments and the two-stage baseline system. The proposed systems have two implementations in which Siamese network and Triplet network are used to learn segment representations respectively. Different metrics are used to evaluate the systems on different datasets.
For lecture recordings, the system performance is assessed by comparing the resulted spoken term clusters with reference to the ground-truth text transcriptions. Specifically the
most frequent word trigrams, bigrams and unigrams are retrieved, with function words being excluded. The purity, uniqueness and coverage of the obtained clusters with reference to the frequent n-grams are evaluated. Purity of a cluster is defined as the percentage of its members corresponding to the same n-gram item. Uniqueness refers to the number of different clusters corresponding to the same n-gram item. Coverage measures the degree that an n-gram item is covered by the discovered term clusters.
For Zerospeech data, evaluation metrics in the spoken term discovery track are adopted. It includes grouping and type scores that measure the clustering quality, token and boundary scores that measure the parsing quality, and NED and coverage that measure the matching quality of the discovered term clusters.
5 Results and Analysis
5.1 Quality of spoken term clusters on lecture recording
In the two-stage baseline system, initial clusters are obtained by selecting matched sub-sequences that contain no less than 3 subword units, followed by leader clustering with a radius of 0.8 and margin of 2.2. From the 48 minutes lecture recording, 564 initial term clusters are obtained, containing a total of 16,917 hypothesized segments. 286 “pure” clusters are selected by setting = 0.2, = 0.15, and = 0.4, = 0.2, from which 347,016 Siamese training pairs and 16,889 Triplet training tuples are generated. Applying HDBSCAN to the segment representations generated by Siamese and Triplet networks, we obtain 1649 and 1994 term clusters respectively.
The most frequent n-gram items in the reference transcription and the corresponding term clusters discovered by the 3 systems are listed as in Table 1. The clusters discovered by Siamese system show the smallest intra-cluster variation. Take the unigram item “dictionary” as example. The clusters discovered by the baseline have the variants “a dictionary”, “the dictionary”, “dictionaries”, while the clusters discovered from Siamese system only contain “dictionary”. Siamese system is more capable of generating clusters of short terms than the baseline, e.g., “return”, “times”, with a high purity of 90%. Segments that are missed in the initial clusters by the baseline can be re-discovered by Siamese system, e.g., “in the dictionary”. This shows that the proposed model has the ability to handle unseen segments.
The purity of discovered spoken term clusters is improved by the proposed method. Purity of cluster for the term “return” increases from 7.7% to 90.9%, and purity of “is equal to” increases from 66.7% to 100%. However, as the matching criteria are tightened, it is more likely to have multiple clusters corresponding to the same term. This is undesirable especially for words with polyphones or high phonetic complexity, when the coverage is already very high in the initial clusters discovered by the baseline.
For Triplet system, the terms in a cluster seem to be more loose than in the case of Siamese system, i.e., more varieties are included in a discovered cluster. For examples, “in the dictionary” and “how the dictionary” are grouped into the same cluster, “the simpler version of”, “the simpler version’ ’and “the simple example” are in the same cluster. The clustering performance shows great variation, from high purity value of 100% to the lowest of 25% for terms that are discovered by other systems with high purity. The poor performance contradicts to the understanding that Triplet network usually performs better than Siamese network [hoffer2015deep]. One possible reason is that the training examples for the Triplet network is not enough and the result is not conclusive.
5.2 Zerospeech 2017 challenge result comparison
The proposed method is also evaluated with the Zerospeech Mandarin dataset. To mimic the design on selecting only the longer sub-sequences to generate training samples, the baseline model clusters sequences with no less than 3 subword units. From the 2.5 hours recording, 1,899,591 Siamese training pairs and 618,146 Triplet training tuples are obtained by setting = 0.2, = 0.2, and = 0.4, = 0.2. All the sub-sequences are then decoded by the Siamese and Triplet networks. Performance of the systems is listed in Table 2.
It is shown that even the baseline model only discovers 65% of the words. The Siamese and Triplet networks are able to learn effective segment representations that can discover new terms which are not covered before, raising the coverage for at least 20%. Better cluster quality with higher token, type and boundary scores are achieved as well. Similar coverage is achieved by Siamese and Triplet networks on both clustering algorithms, with Triplet network being slightly better.
For the clustering algorithms, it is observed that HDBSCAN and its hybrid extension have different strength and weakness in terms of grouping quality and NED. HDBSCAN produces lower NED, but gives limited improvement to the grouping score. While the hybrid method gives a better grouping score with the exchange of high NED. Some clusters can get very large which include too many variations of similar terms, result in very large number of n-pairs.
In this work, the attempt of using Siamese and Triplet networks for spoken term discovery under a complete unsupervised scenario is made. The initial segmentation and cluster information is obtained from other spoken term discovery system. The clusters with high confidence are used to generate matched and mismatched pairs and tuples for training the Siamese and Triplet networks. The networks are used to generate representations for all the available segments, follow by HDBSCAN on the segment representations to obtain new set of spoken term clusters.
It is shown that even the exact labels of the segments are unavailable, Siamese/Triplet network can still be trained when a small set of high confidence matched and mismatched data pairs are presented. The segment representations generated by Siamese and Triplet networks can outperform the baseline two-stage model. In the lecture recording experiment, the result is not conclusive for Triplet network. However, experiment on Zerospeech dataset shows that Triplet network is slightly better than Siamese network in learning segment representations for spoken term discovery when trained on sufficient data.