DeepAI AI Chat
Log In Sign Up

Towards visually prompted keyword localisation for zero-resource spoken languages

10/12/2022
by   Leanne Nortje, et al.
0

Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising attention mechanism which we train with a new keyword sampling scheme. We show that these innovations give improvements in VPKL over an existing speech-vision model. We also compare to a visual bag-of-words (BoW) model where images are automatically tagged with visual labels and paired with unlabelled speech. Although this visual BoW can be queried directly with a written keyword (while our's takes image queries), our new model still outperforms the visual BoW in both detection and localisation, giving a 16 relative improvement in localisation F1.

READ FULL TEXT

page 3

page 5

06/16/2021

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken cap...
10/29/2021

Visual Keyword Spotting with Attention

In this paper, we consider the task of spotting spoken keywords in silen...
08/23/2021

End-to-End Open Vocabulary Keyword Search

Recently, neural approaches to spoken content retrieval have become popu...
12/31/2020

EfficientNet-Absolute Zero for Continuous Speech Keyword Spotting

Keyword spotting is a process of finding some specific words or phrases ...
02/11/2020

Phoneme Boundary Detection using Learnable Segmental Features

Phoneme boundary detection plays an essential first step for a variety o...
01/25/2020

Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling

We propose smoothed max pooling loss and its application to keyword spot...
12/14/2020

Towards localisation of keywords in speech using weak supervision

Developments in weakly supervised and self-supervised models could enabl...