Towards localisation of keywords in speech using weak supervision

12/14/2020
by   Kayode Olaleye, et al.
0

Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly. In the first, only the presence or absence of a word is indicated, i.e. a bag-of-words (BoW) labelling. In the second, visual context is provided in the form of an image paired with an unlabelled utterance; a model then needs to be trained in a self-supervised fashion using the paired data. For keyword localisation, we adapt a saliency-based method typically used in the vision domain. We compare this to an existing technique that performs localisation as a part of the network architecture. While the saliency-based method is more flexible (it can be applied without architectural restrictions), we identify a critical limitation when using it for keyword localisation. Of the two forms of supervision, the visually trained model performs worse than the BoW-trained model. We show qualitatively that the visually trained model sometimes locate semantically related words, but this is not consistent. While our results show that there is some signal allowing for localisation, it also calls for other localisation methods better matched to these forms of weak supervision.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2022

Keyword localisation in untranscribed speech using visually grounded speech models

Keyword localisation is the task of finding where in a speech utterance ...
research
06/16/2021

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken cap...
research
10/12/2022

Towards visually prompted keyword localisation for zero-resource spoken languages

Imagine being able to show a system a visual depiction of a keyword and ...
research
06/13/2018

Visually grounded cross-lingual keyword spotting in speech

Recent work considered how images paired with speech can be used as supe...
research
04/24/2019

On the Contributions of Visual and Textual Supervision in Low-resource Semantic Speech Retrieval

Recent work has shown that speech paired with images can be used to lear...
research
05/30/2023

Understanding temporally weakly supervised training: A case study for keyword spotting

The currently most prominent algorithm to train keyword spotting (KWS) m...
research
03/07/2023

Self-supervised speech representation learning for keyword-spotting with light-weight transformers

Self-supervised speech representation learning (S3RL) is revolutionizing...

Please sign up or login with your details

Forgot password? Click here to reset