Matching Latent Encoding for Audio-Text based Keyword Spotting

06/08/2023
by   Kumari Nishu, et al.
0

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4

READ FULL TEXT
research
08/12/2023

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Spotting user-defined/flexible keywords represented in text frequently u...
research
06/30/2022

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

In this paper, we propose a novel end-to-end user-defined keyword spotti...
research
11/18/2020

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) systems can infer t...
research
09/02/2020

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a w...
research
09/14/2023

Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (Extended Version)

This paper presents a novel evaluation approach to text-based speaker di...
research
10/09/2021

End-to-end Keyword Spotting using Xception-1d

The field of conversational agents is growing fast and there is an incre...
research
07/04/2016

Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation

Using neural networks to generate replies in human-computer dialogue sys...

Please sign up or login with your details

Forgot password? Click here to reset