Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

11/07/2018
by   Sung-Feng Huang, et al.
0

Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing. Audio Word2Vec previously proposed was shown to be able to represent audio segments for spoken words as such vectors carrying information about the phonetic structures of the signal segments. However, each linguistic unit (word, syllable, phoneme in text form) corresponds to unlimited number of audio segments with vector representations inevitably spread over the embedding space, which causes some confusion. It is therefore desired to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed. In this paper, inspired by Siamese networks, we propose some approaches to achieve the above goal. This includes identifying positive and negative pairs from unlabeled data for Siamese style training, disentangling acoustic factors such as speaker characteristics from the audio embedding, handling unbalanced data distribution, and having the embedding processes learn from the adjacency relationships among data points. All these can be done in an unsupervised way. Improved performance was obtained in preliminary experiments on the LibriSpeech data set, including clustering characteristics analysis and applications of spoken term detection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2018

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Word embedding or Word2Vec has been successful in offering semantics for...
research
08/07/2018

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

While Word2Vec represents words (in text) as vectors carrying semantic i...
research
07/19/2017

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

Audio Word2Vec offers vector representations of fixed dimensionality for...
research
11/08/2016

Discriminative Acoustic Word Embeddings: Recurrent Neural Network-Based Approaches

Acoustic word embeddings --- fixed-dimensional vector representations of...
research
11/28/2020

Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

Spoken term discovery from untranscribed speech audio could be achieved ...
research
04/01/2018

Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings

Unsupervised discovery of acoustic tokens from audio corpora without ann...
research
10/21/2022

Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

The paper describes a novel approach to Spoken Term Detection (STD) in l...

Please sign up or login with your details

Forgot password? Click here to reset