Attention-Based Keyword Localisation in Speech using Visual Grounding

06/16/2021
by   Kayode Olaleye, et al.
0

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2022

Keyword localisation in untranscribed speech using visually grounded speech models

Keyword localisation is the task of finding where in a speech utterance ...
research
10/12/2022

Towards visually prompted keyword localisation for zero-resource spoken languages

Imagine being able to show a system a visual depiction of a keyword and ...
research
04/15/2019

Semantic query-by-example speech search using visual grounding

A number of recent studies have started to investigate how speech system...
research
03/23/2017

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ...
research
12/14/2020

Towards localisation of keywords in speech using weak supervision

Developments in weakly supervised and self-supervised models could enabl...
research
09/08/2023

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

Visually grounded speech systems learn from paired images and their spok...
research
05/25/2023

Visually grounded few-shot word acquisition with fewer shots

We propose a visually grounded speech model that acquires new words and ...

Please sign up or login with your details

Forgot password? Click here to reset