Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

10/21/2022
by   Jan Švec, et al.
0

The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2022

Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer

In recent years, the standard hybrid DNN-HMM speech recognizers are outp...
research
11/02/2022

Transformer-based encoder-encoder architecture for Spoken Term Detection

The paper presents a method for spoken term detection based on the Trans...
research
03/07/2021

CNN-based Spoken Term Detection and Localization without Dynamic Programming

In this paper, we propose a spoken term detection algorithm for simultan...
research
01/08/2023

Analyzing the Representational Geometry of Acoustic Word Embeddings

Acoustic word embeddings (AWEs) are vector representations such that dif...
research
11/07/2018

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Embedding audio signal segments into vectors with fixed dimensionality i...
research
11/28/2020

Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

Spoken term discovery from untranscribed speech audio could be achieved ...
research
11/24/2022

Analysis on English Vocabulary Appearance Pattern in Korean CSAT

A text-mining-based word class categorization method and LSTM-based voca...

Please sign up or login with your details

Forgot password? Click here to reset