DeepAI AI Chat
Log In Sign Up

Towards visually prompted keyword localisation for zero-resource spoken languages

by   Leanne Nortje, et al.

Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising attention mechanism which we train with a new keyword sampling scheme. We show that these innovations give improvements in VPKL over an existing speech-vision model. We also compare to a visual bag-of-words (BoW) model where images are automatically tagged with visual labels and paired with unlabelled speech. Although this visual BoW can be queried directly with a written keyword (while our's takes image queries), our new model still outperforms the visual BoW in both detection and localisation, giving a 16 relative improvement in localisation F1.


page 3

page 5


Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken cap...

Visual Keyword Spotting with Attention

In this paper, we consider the task of spotting spoken keywords in silen...

End-to-End Open Vocabulary Keyword Search

Recently, neural approaches to spoken content retrieval have become popu...

EfficientNet-Absolute Zero for Continuous Speech Keyword Spotting

Keyword spotting is a process of finding some specific words or phrases ...

Phoneme Boundary Detection using Learnable Segmental Features

Phoneme boundary detection plays an essential first step for a variety o...

Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling

We propose smoothed max pooling loss and its application to keyword spot...

Towards localisation of keywords in speech using weak supervision

Developments in weakly supervised and self-supervised models could enabl...