DeepAI AI Chat
Log In Sign Up

Attention-Based Keyword Localisation in Speech using Visual Grounding

06/16/2021
by   Kayode Olaleye, et al.
0

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/02/2022

Keyword localisation in untranscribed speech using visually grounded speech models

Keyword localisation is the task of finding where in a speech utterance ...
10/12/2022

Towards visually prompted keyword localisation for zero-resource spoken languages

Imagine being able to show a system a visual depiction of a keyword and ...
04/15/2019

Semantic query-by-example speech search using visual grounding

A number of recent studies have started to investigate how speech system...
03/23/2017

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ...
05/25/2023

Visually grounded few-shot word acquisition with fewer shots

We propose a visually grounded speech model that acquires new words and ...
12/14/2020

Towards localisation of keywords in speech using weak supervision

Developments in weakly supervised and self-supervised models could enabl...
03/28/2022

Word Discovery in Visually Grounded, Self-Supervised Speech Models

We present a method for visually-grounded spoken term discovery. After t...