Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

08/12/2023
by   Kumari Nishu, et al.
0

Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors, extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power. Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC) metric from 84.21 23.36

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2023

Matching Latent Encoding for Audio-Text based Keyword Spotting

Using audio and text embeddings jointly for Keyword Spotting (KWS) has s...
research
09/13/2023

Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

Open vocabulary keyword spotting is a crucial and challenging task in au...
research
06/30/2022

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

In this paper, we propose a novel end-to-end user-defined keyword spotti...
research
01/30/2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in tex...
research
04/19/2023

Controlling keywords and their positions in text generation

One of the challenges in text generation is to control generation as int...
research
08/08/2023

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

This work presents a text-to-audio-retrieval system based on pre-trained...
research
04/12/2022

Text-Driven Separation of Arbitrary Sounds

We propose a method of separating a desired sound source from a single-c...

Please sign up or login with your details

Forgot password? Click here to reset