Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

11/28/2020
by   Man-Ling Sung, et al.
0

Spoken term discovery from untranscribed speech audio could be achieved via a two-stage process. In the first stage, the unlabelled speech is decoded into a sequence of subword units that are learned and modelled in an unsupervised manner. In the second stage, partial sequence matching and clustering are performed on the decoded subword sequences, resulting in a set of discovered words or phrases. A limitation of this approach is that the results of subword decoding could be erroneous, and the errors would impact the subsequent steps. While Siamese/Triplet network is one approach to learn segment representations that can improve the discovery process, the challenge in spoken term discovery under a complete unsupervised scenario is that training examples are unavailable. In this paper, we propose to generate training examples from initial hypothesized sequence clusters. The Siamese/Triplet network is trained on the hypothesized examples to measure the similarity between two speech segments and hereby perform re-clustering of all hypothesized subword sequences to achieve spoken term discovery. Experimental results show that the proposed approach is effective in obtaining training examples for Siamese and Triplet networks, improving the efficacy of spoken term discovery as compared with the original two-stage method.

READ FULL TEXT
research
11/03/2020

Unsupervised Pattern Discovery from Thematic Speech Archives Based on Multilingual Bottleneck Features

The present study tackles the problem of automatically discovering spoke...
research
04/30/2018

Sampling strategies in Siamese Networks for unsupervised speech representation learning

Recent studies have investigated siamese network architectures for learn...
research
07/26/2020

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Unsupervised spoken term discovery consists of two tasks: finding the ac...
research
11/07/2018

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Embedding audio signal segments into vectors with fixed dimensionality i...
research
08/03/2020

Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics

Unsupervised spoken term discovery (UTD) aims at finding recurring segme...
research
12/10/2020

Direct multimodal few-shot learning of speech and images

We propose direct multimodal few-shot models that learn a shared embeddi...
research
10/21/2022

Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

The paper describes a novel approach to Spoken Term Detection (STD) in l...

Please sign up or login with your details

Forgot password? Click here to reset